18
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright

Author's personal copy - Osaka U

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Author's personal copy - Osaka U

This article appeared in a journal published by Elsevier. The attachedcopy is furnished to the author for internal non-commercial researchand education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling orlicensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of thearticle (e.g. in Word or Tex form) to their personal website orinstitutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies areencouraged to visit:

http://www.elsevier.com/copyright

Page 2: Author's personal copy - Osaka U

Author's personal copy

Coordination of verbal and non-verbal actions in human–robotinteraction at museums and exhibitions

Akiko Yamazaki a,*, Keiichi Yamazaki b, Matthew Burdelski c, Yoshinori Kuno d,Mihoko Fukushima e

aDepartment of Media, Tokyo University of Technology, 1404-1 Katakuramachi, Hachioji City, Tokyo 192-0982, Japanb Faculty of Liberal Arts, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama City, Saitama 338-8570, JapancDepartment of Modern Languages and Literatures, Swarthmore College, Swarthmore, PA 19081 USAdGraduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama City, Saitama 338-8570, JapaneDepartment of Sociology, University of Essex, Wivenhoe, Park, Colchester CO4 3SQ, UK

1. Introduction

Over the last couple of decades, human–computer interaction has occupied a significant role in a broad range of contexts.Robots, inparticular, havebeenemployed across an increasingnumber of social domains such asmedical laboratories, factories,and households. While previous research has earnestly investigated human–robot interaction (e.g. Fischer, 2006; Sidner et al.,2005), few studies have explored this interaction with the goal of programming robots that deploy language and embodiedresources in interactingwithhumans. In our research—a collaborative project among researchers inhumanities, social sciences,and engineering—we have been exploring ways to develop a robot that can interact with humans in friendly and informativeways.

Journal of Pragmatics 42 (2010) 2398–2414

A R T I C L E I N F O

Article history:

Received 12 January 2009

Keywords:

Human–robot interaction

Projectability

Non-verbal action

Robot guide

Museum guide

Interaction analysis

A B S T R A C T

In this article we analyze videotaped data in Japanese of naturally occurring human–human

and experimental human–robot interaction in museums and exhibitions with the goal of

developing a robot that can provide explanations in these settings. Based on the initial

analysis of interaction among human (i) visitors and (ii) visitors and guides, we observe that

verbalandnon-verbalactionsplayan important role indirectingothers’ actions. Inparticular,

in the visitor–guide interactionswediscovered that guidesmake head turns from the exhibit

towards a visitor and also point towards features of the exhibit, and that these non-verbal

actions are consequential in gaining visitors’ response. Furthermore, guides turn their heads

at particular places in their talk. We then programmed such coordination of verbal and non-

verbal actions in a robot guide and conducted several experimental observations. These

observations show that when the robot’s talk was coordinated with non-verbal actions,

visitors more often responded with their own non-verbal actions such as head nods. The

results suggest that in developing robots to employ in interaction with humans as museum

and exhibition guides,we need to consider the precise coordination of verbal and non-verbal

actions.

� 2009 Elsevier B.V. All rights reserved.

* Corresponding author. Tel.: +81 42 637 2619; fax: +81 42 637 2790.

E-mail addresses: [email protected] (A. Yamazaki), [email protected] (K. Yamazaki), [email protected] (M. Burdelski),

[email protected] (Y. Kuno), [email protected] (M. Fukushima).

Contents lists available at ScienceDirect

Journal of Pragmatics

journal homepage: www.elsev ier .com/ locate /pragma

0378-2166/$ – see front matter � 2009 Elsevier B.V. All rights reserved.

doi:10.1016/j.pragma.2009.12.023

Page 3: Author's personal copy - Osaka U

Author's personal copy

Our analysis draws on ethnomethodology and conversation analysis, and contributes to the growing body of researchthat has come to be known as workplace studies. Our research interest from the humanities and social sciences concernsthe extent to which visitors to museums and exhibitions interact with a robot guide in comparison to a human guide. Weare particularly interested in the speech and non-verbal conduct of the visitors. In addressing this concern, we firstrecorded naturally occurring interaction at several museums and exhibitions where we observed visitors interacting withand without a guide. We discuss data from four data sets. The first two are from museums and the other two are fromuniversity exhibitions.

1. Communication Museum: Visitors experience a hands-on exhibit called ‘Elekitel’ (Tokyo, Japan).2. Japanese American National Museum: Guided tours with visitors (Los Angeles, USA).3. Korean roof tile: A guide explains Korean roof tiles to a visitor at a university exhibition (Saitama University,

Japan).4. Thailand spirit: A guide explains the spirit of Thailand to a visitor at a university exhibition (Future University-Hakodate,

Japan).

In our analyses, we discovered that guides coordinate verbal and non-verbal resources in particular ways.Specifically, we found that they gaze towards the visitor(s) and back towards the exhibit at specific points in theirtalk. Based on these findings, we programmed the coordination of verbal and non-verbal actions in a museum robotguide. We programmed the robot in two modes. In one, the robot turns its head at specific, interactionally significantplaces in its talk. In the other, the robot turns its head at non-interactionally significant places in its talk. We will discusswhat we consider to be interactionally significant in section 3. We investigated human visitor–robot guide interaction inthree experiments.

First experiment: A prototype museum robot guide explains an exhibit to participants at the Science Museum, Tokyo.Second experiment: A humanoid robot (Robovie ver.2) explains two posters to participants at a laboratory in SaitamaUniversity.Third experiment: The same robot as in the second experiment explains an ‘‘air plant’’ to participants at the laboratory inSaitama University.

As we mentioned above, our research interest lies in the ways visitors respond to a robot guide. We focus in particularon the interactional and sequential organization of participants’ actions with regard to the conduct of the robot. Ouranalysis involves detailed transcription of the interactions and a quantitative summary of the kinds of responses. In theexperiments, though we examine one-to-one interactions, we also plan to develop a robot to be employed in multiplevisitor interaction.

2. Head movement and coordinated activities: findings from visitor studies at museums

We have observed and videotaped visitors interacting with and without a guide in several science, art, and historymuseums, and have analyzed these data in relation to language, body, and environment. In particular, in relation to the body,we have examined how the head is an important resource for directing addressees towards courses of action. For instance, inFragment 1 a father and his two children are at a science museum without a guide. They are currently at the exhibit called‘‘Elekitel,’’ which discharges sparks when its handle is rotated.

Fragment 1: ‘‘Elekitel’’1

M1 (father), B1 (older boy), and G1 (younger girl)

G1: ((turning Elekitel’s handle))

B1: Kon’na akaruku cha yoku mie nai yo.

Like.this bright if well see NEG FP

‘‘When it’s this bright ((in here)) it’s hard to see ((the sparks))’’

1 The transcription conventions are the following:The original Japanese talk is presented in italics in the first line. A word-by-word translation or

grammatical description is provided in the second line, and a vernacular gloss in English appears in quotationmarks in third line. Abbreviations used in the

interlinear gloss are shown in Appendix A. In the Japanese excerpts, unexpressed elements are supplied within double parentheses in the English gloss. A

full stop is used to connect words (in the second line) when multiple words are used to translate a Japanese expression.

A. Yamazaki et al. / Journal of Pragmatics 42 (2010) 2398–2414 2399

Page 4: Author's personal copy - Osaka U

Author's personal copy

Fig. 1. The father and children at exhibit ‘‘Elekitel’’.

M1: A so? ((looks to the left as he finishes talking and steps backwards))

Oh really

‘‘Oh really’’

Fig. 3. The family walks away from Elekitel.

M1: ((walking away)) Otousan ni wa mieru.

Dad for TOP can.see

‘‘Dad ((=I)) can see it’’

Fig. 2. The father turns his head towards the left.

B1: ((looks towards the same direction as M1))

G1: ((releases the handle and promptly follows M1))

A. Yamazaki et al. / Journal of Pragmatics 42 (2010) 2398–24142400

Page 5: Author's personal copy - Osaka U

Author's personal copy

When the girl rotates Elekitel’s handle while gazing towards the sparks on the screen, her father and the boy look inthe same direction (Fig. 1). When the boy says, ‘Kon’na akaruku cha yokumie nai yo’ ‘‘When it’s this bright (in here) it’s hardto see (the sparks)’’, his father responds ‘A so?’ ‘‘Oh really’’ and then turns his head towards the left (Fig. 2). The boy then looksin the same direction as his father, the girl releases the handle, and then all three walk away from the exhibit (Fig. 3).

The orientation displayed by the father’s headmovement signals a completion of engagementwith Elekitel and indirectlydirects the children tomove on to another exhibit. This excerpt suggests that talk alongwith non-verbal actions such as headmovement, gaze, and body orientation are crucial resources for coordinating actions of hearers.

Fragment 2 is a guided tour at the Japanese American National Museum. Here a guide is explaining a set of pictures tothree visitors. At this moment the guide finishes her explanation and begins to walk to the next set of pictures with thevisitors.

Fragment 2: Guided-visitor tour at Japanese American National Museum

FG (female guide), M2 (male adult visitor), M3 (male adult visitor), F1 (female adult visitor)

Fig. 4. F1 turns back to exhibit again.

3 FG: A:: jaa jikan ga nai node isogi mashou ka?

Oh then time SUB NEG as hurry let’s QP

‘‘OK then, shall we move along quickly since we don’t have much time?’’ [((FG looks at M1))

[((M1 looks at FG))

Fig. 5. Mutual gaze between FG and M2. (a) Guide and M2 start to look at each other (left). (b) Guide and M2 look at each other (right).

1 F1 Sugoi ne:: ma:: ii wa:: setsumeisitemorau to

Impressive FP really wonderful FP receive.explanation when

2 yoku wakaru wa:

well understand FP

‘‘Impressive! Really wonderful. The explanation helps ((me)) understand ((the Exhibit)) so well.’’

A. Yamazaki et al. / Journal of Pragmatics 42 (2010) 2398–2414 2401

Page 6: Author's personal copy - Osaka U

Author's personal copy

Following the guide’s explanation, F1 goes back to the exhibit and looks at it closely, and then makes a positiveassessment in relation to the exhibit and the guide’s explanation (lines 1 and 2, Fig. 4). When the guide proposes to move onto the next exhibit, providing the reason that ‘‘we don’t have much time’’, the guide andM2 start to look at each other at thesame time (Fig. 5). Through this mutual gaze (Fig. 5), M2 displays his understanding and recipiency towards the guide andthe guide displays hermonitoring ofM2’s understanding and recipiency.M2 accepts the guide’s proposal, and thenmoves tothe entrance of the next room where he opens the door for the guide and others (Fig. 6).

As we have shown in this section, it is not only verbal actions but also non-verbal actions such as head turns and gaze thatare important resources for organizing action in the context ofmuseums and exhibitions. In the next sectionwe discuss priorresearch on the coordination of verbal and non-verbal actions.

3. Coordination of verbal and non-verbal actions

Academic interest in human non-verbal actions began as early as in the 16th century (Kendon, 2004). While manyresearchers have explored the iconic meaning of gesture, what is less well understood is how verbal and non-verbal actionsorganize the interactional environment, and what kinds of actions are achieved collaboratively in this environment.According to Hutchins and Palen (1997:38), ‘‘space, gesture, and speech are all combined in the construction of complexmultilayered representations in which no single layer is complete or coherent by itself.’’ Several researchers have examinedthis construction in naturally occurring interaction. In particular, Goodwin (2003), observing archeology classes, showedways that multiple resources including verbal actions, bodily movement, the environment, and tools were closelycoordinated in on-site instruction. In our analysis above we have focused on the ways verbal language is coordinated withnon-verbal actions such as headmovement and gaze. Prior research has shown that headmovement is a crucial resource forprojecting what the speaker will do next (Streeck, 1995) and thus it becomes an important resource for coordinating actionsamong multiple participants. However, head movement works effectively in interaction only if it relates to an environmentin focus, and contributes to the projectability of the turn-taking system (Lerner, 2003). Sacks et al. discuss unit-types andprojection as follows:

‘‘There are various unit-types with which a speaker may set out to construct a turn. Unit types for English includesentential, clausal, phrasal, and lexical constructions. Instances of the unit-types so usable allow a projection of the unit-type under way, and what, roughly, it will take for an instance of that unit-type to be completed’’ (Sacks et al., 1974:702).

The turn-constructional component of the turn-taking system identifies the types of turn-constructional units (TCUs) assentential, clausal, and lexical (Sacks et al., 1974:720). The authors define a transition-relevance place (TRP) as, ‘‘The firstpossible completion of a first such unit constitutes an initial transition-relevance place’’ (Sacks et al., 1974:703). A TRP is aplace where turn-transfer or speaker change may potentially occur (Tanaka, 1999:27) and the next speaker begins a turn. Inthis way, the TRP is a place at which a hearer can display a response to the speaker, but he/she may not necessarily take theturn. In our human–robot interaction, we expect human participants to display non-verbal actions such as head turning andgaze (towards the robot or exhibit) and possibly minimal verbal actions. As for the speaker, the TRP is also a place where he/she can check and monitor whether the hearer is listening and displaying understanding.

When we examined our human guide–visitor interaction in museums and exhibitions, we found that explanationstended to be constructed through the use of sentences. Therefore, the TRP of the guide’s explanation is often at the end of a

4 M2: A sou desu ne sumimasen

Oh right COP FP I’m.sorry

‘‘Oh ((you’re)) right. Sorry.’’

5 ((Everyone begins to leave the room.))

Fig. 6. M2 opens the door for FG and F1 and M3 follow them while talking.

A. Yamazaki et al. / Journal of Pragmatics 42 (2010) 2398–24142402

Page 7: Author's personal copy - Osaka U

Author's personal copy

sentence. In the next section we examine human guide–visitor interaction. As we will see, significant non-verbal actionssuch as head turns and pointing are often conducted around TRPs.

4. Head turn and utterance: university exhibit

We observed one-to-one interactions between a Japanese guide and a visitor2 at a university exhibition. A central featureof the interactions observed is that the guide and visitor typically look either at each other or at the exhibit (e.g. poster) at thesame time. Another central feature, related to the first, is that the guide and visitor often turn their head towards the exhibitor towards each other. Here we focus on the places in talk when such head turns occur. The data were collected in twosituations:1. Korean roof tile: A guide explains a Korean roof tile to a visitor at a university exhibition (Saitama University, Japan).2. Thailand spirit: A guide explains the spirit of Thailand to a visitor at a university exhibition (Future University-Hakodate,

Japan).Using these data sets, wewill examine those salient places in their talk where guidesmake such headmovements. For the seconddata set, we will also examine places at which guides point.

The first excerpt illustrates the recurring observation that guides turns their head at TRPs such as when finishing asentential unit. In Fragment 3, a guide (left) is explaining the process of making ancient Korean roof tiles to a visitor(right).

Fragment 3: Korean roof tiles

G (guide), V (visitor)

01 G: De ma: [kore ga kansei ban to, kouiu katachi

So well this SUB final version QT this.kind.of form

[(( pointing towards picture ))

02 de ma: [kawa[ra ga dekiru n desu ne:.

so well tile SUB is.made N COP FP

[((turns head, gazes, and nods towards participant))

‘This is the final version, and the tile is made into this form.’

03 V: [((gazes towards guide, nods head))

In this excerpt, when the guide says, ‘‘the tile is made,’’ he starts to turn his head towards the visitor (Fig. 7a). Thismovementallows him to visibly check whether or not the visitor is displaying verbal and non-verbal recipiency at this TRP. Thissentence furthermore has a slight rise in intonation at ‘‘made’’ (dekiru), which indicates pre-completion of this on-going turn.At this point, both the guide and the visitor start nodding at the same time. These movements display to each other a certaindegree of mutual understandings. The guide and visitor then turn back toward the exhibit (Fig. 7b) and proceed to the nextexhibit.

Fig. 7. Mutual gaze. (a) Guide (left) turns towards visitor (right). (b) The guide turns back to the exhibit and proceeds to the next exhibit.

2 Since we have frequently observed one-to-one guided tours and have often witnessed guides occasionally voluntarily explaining an exhibit to visitors

without companions, we believe our findings will be useful in Japanese museums.

A. Yamazaki et al. / Journal of Pragmatics 42 (2010) 2398–2414 2403

Page 8: Author's personal copy - Osaka U

Author's personal copy

In addition to TRPs, guides often turn their headwhen using deicticwords andmaking hand gestures.We categorized kore

(this), koko (here) and asoko (there) as deictic words, which refer to a certain physical place or some virtual space and time. Inaddition to turning their head at these places, guides also frequently turn their head when saying keywords, such as highlysignificant and unfamiliar words. Fig. 8 shows an example of this case. Here, the guide points at a certain part (The shrine of‘‘Pii’’: the spirit of Thailand) in the picture while saying ‘‘Pii’’, then he turns his head towards the visitor.

These observations showways in which human guides turn their head and gaze towards visitors and in some cases pointtowards the exhibit during their explanations. These non-verbal resources as well as other elements such as deictic termsand key words occur most frequently around TRPs. The data also show that visitors turn their head towards the guide andoccasionally nod in response. This suggests that visitor head turn, gaze, and head nod may be an important display ofrecipiency, engagement, and understanding in this context. These findings further suggest that in order to effectively employrobots as guides in museums and exhibits, it may be important to program the robot to deploy non-verbal actions such ashead turn, gaze, and pointing during its talk at such interactionally significant places (Yamazaki et al., 2008).

5. Developing museum and exhibition robot guides

Robotics engineers have observed that head motion is important for human–robot interaction, leading to thedevelopment of robots that move their head. These include the Cog (Brooks et al., 1999), Robovie (Ishiguro et al., 2001),and Robita (Matsusaka et al., 2003). In particular, Sidner et al. (2005) have conducted experiments with a guide robotthat explains things while moving its head. They found that when the robot turns its head and gazes towards the visitorsduring its explanation, the frequency of visitor head nodding increases. This suggests that human visitors display aheightened engagement towards a robot that moves its head during its explanation. We do not know, however, whetherthese head nods were timed at interactionally significant places, and whether this timing would affect visitor headnodding. Thus, we became interested in how visitors respond to a robot that coordinates its verbal and non-verbalactions around interactionally significant places, such as TRPs. To examine this, we conducted three experiments usingthe following data:

First experiment: A prototype museum guide robot explains Sachiko Kodama’s ‘‘Morpho tower’’ to participants at theScience Museum (Tokyo, Japan).Second experiment: A humanoid robot (Robovie ver.2) explains two posters to participants at a university laboratory(Saitama University, Japan).Third experiment: The robot explains an ‘‘air plant’’ to participants at a university laboratory (Saitama University, Japan).

In the first experiment we used a prototype robot, whereas in the second and third experiments we used Robovie ver.2.These robots are all autonomous.

5.1. First experiment

We conducted the first experiment at Sachiko Kodama’s media art exhibition at the Science Museum in Tokyo. Aprototype robot explains the ‘‘Morpho Tower’’ to visitors (Fig. 9) using a synthesized pre-recorded voice. We compared twomodes: head turn and no head turn. In the first, the robot turns its head towards the participant at TRPs and key words. In thesecond, the robot does not turn its head at all. A total of 16 participants participated in interactions with the robot in bothmodes. The analysis revealed that participants displayed non-verbal actions, in particular head motions such as turns andnods, more frequently in the head turn condition compared with the no head turn condition. We could not be certain,

Fig. 8. Sprit of Thailand: Example of key word/gesture case. Guide points while saying, ‘‘Pii.’’.

A. Yamazaki et al. / Journal of Pragmatics 42 (2010) 2398–24142404

Page 9: Author's personal copy - Osaka U

Author's personal copy

however, about the role of the timing of the robot’s headmovements. Thereforewe conducted a second experiment inwhichwe programmed the robot to turn its head at particular places in its talk in two different modes.

5.2. Second experiment

In the second experiment we used a Humanoid robot Robovie-R ver.2 (Fig. 10). We programmed it to explain two postersin one of two modes: (1) systematic mode (S-mode) in which the robot turns its head towards the participant atinteractionally significant places during its explanation (in this case, TRPs and keywords) and, (2) unsystematic mode

(U-mode) inwhich the robot turns its headduring its talk at places that arenot interactionally significant (e.g.middle of a TCU).Twelve participants participated in interactions in bothmodes.We categorized participants’ responses towards the robot

head turn into two types: (a) head nodding, and (b) gaze towards the robot, which we call ‘mutual gaze’ because the robotand participant gaze at each other.

As shown in Fig. 11, there is a significant difference in participant head nodding between the S-mode and the U-mode(p < 0.01, paired t-test).

There was no significant difference in mutual gaze between the S-mode and U-mode (Fig. 11). However, as shown inFig. 12 there was a strong difference in the ratio of participant response of head nodding and mutual gaze between theS-mode and the U-mode (p < 0.01, paired t-test) (Fig. 12). That is, in comparison to the U-mode, in the S-mode nodding andgaze occurred together.

This resultmay be affected by the fact that the participants were simply imitating the robot’s head turn, as also suggestedin our previous experiments (Kuno et al., 2007; Sidner et al., 2005).When head nodding occurred together withmutual gaze,this indicated a strong display of recipiency of the robot’s explanation.

When we examined these data qualitatively, we noticed three issues that had to be addressed:

Fig. 9. First experiment.

Fig. 10. The robot explains two posters to a participant.

A. Yamazaki et al. / Journal of Pragmatics 42 (2010) 2398–2414 2405

Page 10: Author's personal copy - Osaka U

Author's personal copy

(1) Standing position:We programmed the robot to automatically turn its head towards the participant’s standing position.However, in some sessions participants slightly moved to another position, yet the robot was not programmed to adjustits head accordingly. This could have been one of the reasons why participants’ engagement decreased.

(2) Numbers of participants: The number of participants was relatively few, only 12.(3) Synthesized voice: Participants reported that they could not clearly distinguish sentence endings,whichwe believedwas

due to the quality of the synthesized voice. As discussed above, guide explanations are often composed of sententialunits. Tanaka (1999:24) suggests that Japanese speakers ‘‘have at their disposal two devices, specific grammatical turn-elements and marked prosodic features’’ for helping them predict the TRP. In this experiment, while grammatical turn-elements are sufficient, there is lack ofmarked prosodic features in the synthesized voice. Thus, itmay have been difficultfor participants to project when an utterance would come to end.

5.3. Third experiment

Following these two experiments, we conducted a third experiment (Yamazaki et al., 2008) in which we made severaladjustments to the robot (Robovie-R ver.2). First, instead of synthesized speech, we recorded the voice of a human female.We programmed the robot to explain a poster of a sub-tropical ‘‘air plant’’ (hakarame) (Fig. 13). Second, we programmed the

Fig. 13. The robot explains the ‘air plant’ to a participant.

Fig. 11. Participant head movement in S-mode and U-mode.

Fig. 12. Ratio of co-occurrence of head nodding and mutual gaze in S-mode and U-mode.

A. Yamazaki et al. / Journal of Pragmatics 42 (2010) 2398–24142406

Page 11: Author's personal copy - Osaka U

Author's personal copy

robot to turn its head during its talk in one of two modes in order to examine the effectiveness of the robot’s head turn: (1)S-mode, in which the robot turns its head at TRPs (sentence ending),3 and (2) U-mode, in which the robot turns its head atplaces that are not interactionally significant (e.g. middle of a turn construction unit). Third, in relation to the experimentalscene,we asked each participant to stand at a fixed point on the floor, as opposed to allowing them to choosewhere to stand.4

We conducted the experimentswith 46 participants (male: 25, female: 21) and asked each one to undergo either the S-modeor the U-mode (24 in the S-mode and 22 in the U-mode).

We recorded and transcribed the interactions in detail. In particular, we focused on places in the talkwhen the robot turnsits head, and examined how participants responded, as well as the timing of these responses.

Fig. 14 is a transcript of the robot’s talk and head turns in the S-mode and U-mode. The first line is the robot’s explanationin Japanese. The second line is the place where the robot turns its head in the S-mode. The third line is the robot’s headturn in the U-mode. In the second and third lines, we use a hyphen (–) and the letter V. One or more hyphens before a Vindicates that the robot turns its head towards the visitor. The place where the hyphen begins indicates the onset of therobot head turn. The letter V means that the robot stops its head turn and gazes towards the participant. A hyphen after Vmeans that the robot turns its head towards the poster. The fourth line is a word-by-word gloss. The fifth line is anEnglish translation.

Our analysis of participants’ responses starts from line 4 of the robot’s explanation. Since we wanted to compare theS-mode and U-mode, we did not analyze participant responses at the first head turn, as the robot turns its head at the sameplace in the S-mode and U-mode. In the S-mode the robot turns its head from the poster to the participant a total of fourtimes, and a total of five times in the U-mode.

In the next section, we analyze data from the third experiment in greater detail.

6. Participant responses towards robot in third experiment

In order to explore the ways participants respond towards the robot, we analyzed data from the third experiment by (1)transcribing every participant response; and (2) by conducting a quantitative summary of different kinds of responses. Inparticular, we categorized participant non-verbal responses as gaze and head nodding.

6.1. Gaze and head nodding

Herewe examine how participants responded non-verbally in relation to the robot’s head turns. We examine this in boththe S-mode and U-mode, and categorize participant’s response as follows:

(1) +gaze/+nod (gaze with head nodding)(2) +gaze/�nod (gaze without head nodding)(3) �gaze/+nod (head nodding without gaze)(4) �gaze/�nod (no gaze, no head nodding)

These results are summarized in Fig. 15 for the S-mode (a) and U-mode (b).Fig. 15(a) indicates that in the S-mode +gaze/+nod and +gaze/�nod frequently occurs, whereas �gaze/+nod and �gaze/

�nod does not frequently occur. This suggests that in the S-mode participants often responded to the robot head and gazeturn with their own head and gaze turn, and on some occasions nodded their head (while gazing towards either the robot orposter).

Fig. 15(b) indicates that in the U-mode +gaze/+nod did not frequently occur, whereas �gaze/�nod were frequent.We then examined the number of participant responses towards each of the robot’s head turns in both the S-mode and

U-mode. We quantified the number of head nodding in relation to gaze as shown in Fig. 16.In the S-mode, participants consistently nod and gaze in response to each robot head turn. These head nods are

occasionally accompanied by gaze towards the robot (Fig. 16(a)). In the U-mode, in contrast, only one participant nodded,which occurred at the first and fourth robot head turns (Fig. 16(b)).

In the S-mode, we noted that the robot turns its head at TRPs. As discussed earlier, a TRP is a place in talk where turn-transfer or speaker change may potentially occur (Tanaka, 1999:27), and the next speaker can take the turn (see Sacks et al.,1974). While in our experiment we do not expect speaker change to occur, we have seen that the TRP is nevertheless a placeat which participants often display some kind of response. Furthermore, robot head turn makes the subsequent participant(hearer’s) action a focal point in interaction for displaying recipiency, engagement, and understanding. It may also displaythat the participant is monitoring and checking the speaker’s (robot’s) action.

3 While we realize that TRPs do not only occur at the end of a sentence, in the case of guide explanations, as we observed earlier, speech is often produced

in sentential units.4 In recent experiments at an actualmuseum,we have employed a face recognition system that enables the robot to turn its head towards the visitor even

when the visitor moves locations.

A. Yamazaki et al. / Journal of Pragmatics 42 (2010) 2398–2414 2407

Page 12: Author's personal copy - Osaka U

Author's personal copy

Fig. 14. The robot explains a poster on ‘air plant’ to a participant.

A. Yamazaki et al. / Journal of Pragmatics 42 (2010) 2398–24142408

Page 13: Author's personal copy - Osaka U

Author's personal copy

Fig. 14. (Continued)

A. Yamazaki et al. / Journal of Pragmatics 42 (2010) 2398–2414 2409

Page 14: Author's personal copy - Osaka U

Author's personal copy

However, robot gaze should be understood in relation to the sequence; otherwise the participant may not respond to therobot (Fig. 16(b)). When the robot gaze is produced at an interactionally significant place (e.g. TRP) in the sequence,participants typically respond by nodding, or nodding and gazing, towards the robot (Fig. 16(a)).

In section 6.2, we examine the timing of participants’ head movements in relation to robot head turns, and show thatparticipant responses to robot head turns relate to both turn construction unit and the projectability of the completion pointof the turn construction unit.

Fig. 16. Number of head nods in relation to gaze at each head turn. (a) S-mode, (b) U-mode.

Fig. 15. Ratio of participant responses.

A. Yamazaki et al. / Journal of Pragmatics 42 (2010) 2398–24142410

Page 15: Author's personal copy - Osaka U

Author's personal copy

6.2. Timing

We also examined the timing of participant responses towards the robot when the robot turns its head. We categorizedthe timing of such responses into three categories: synchronized response, delayed response, and non-gaze. We examinedthese in both the S-mode and the U-mode. When a participant responds (e.g. nods, turns his or her head) to the robot fromHead turn onset until Pause end, we categorize this response as a ‘Synchronized response.’When a participant responds to therobot from Pause end until Head turn termination we categorize this as ‘Delayed response.’ When a participant does notrespond we categorize this as ‘Non-gaze’. Fig. 17 shows how we define the first two terms: synchronized response anddelayed response.

The robot starts to turn its head to a participant at the beginning of Head turn onset, which takes 1.25 s. At Pause start therobot keeps its head towards the participant, which takes 0.2 s. After the Pause end, the robot starts to turns its head towardsthe poster, which takes 2.5 s to until Head turn termination.

First, as illustrated in Fig. 18(a), in the S-mode, participants consistently show synchronized responses over the course ofthe experiment (fromhead turns 1 through 4). Participants often turned their head at the same time as the robot. From closerexamination of the videotapes, we observed that some participants occasionally began to turn their head even before therobot began to turn its head towards the participants. Second, as illustrated in Fig. 18(b), in the U-mode, participants’responses are quite different in comparison to the S-mode. In particular, participants do not consistently show synchronizedresponses over the course of the experiment.Moreover, non-gaze occurs quite frequently.While some participants produced

Fig. 18. Ratio of participant response timing towards robot head turn. (a) S-mode, (b) U-mode.

Fig. 17. Participant synchronized and delayed response to robot head turn.

A. Yamazaki et al. / Journal of Pragmatics 42 (2010) 2398–2414 2411

Page 16: Author's personal copy - Osaka U

Author's personal copy

non-verbal responses, they typically did so in a delayedway. In several of these cases, participants rapidly looked at the robotin a jerky way.

We suggest that synchronized responses occurred in part because participants anticipated the robot turning its head atTRPs. We further suggest that this is related to grammatical features of the Japanese language.

In order to examine this we compare participant responses at line 4 and 5 of the S-mode, which have distinctivefeatures of Japanese grammar in relation to turn construction. In line 4, there is SOVword order, and the utterance ends withfinal particle ne, which is a particle that invites the recipient’s alignment (Tanaka, 1999). Tanaka (1999) further points outthat Japanese is a ‘‘delayed projection’’ language (compared to English) because it is a postpositional language in SOV wordorder.

In line 5, the sentence begins with ‘‘Why it is called an ‘air plant’,’’ (naze hakarame to iu ka to iimasu to), and ends with theexplanation, ‘‘because buds come out’’ (me ga detekuru kara nan desu). This sentence has a postpositional feature like Englishgrammar. In this case, a participant is likely to project when the TRP will come to completion after hearing the causal marker‘because’ (kara).

We examine the timing of responses of participants in line 4 and 5 in the S-mode. The results are shown in Fig. 19.As shown in Fig. 19, there is a small difference between the number of synchronized responses at lines 4 and 5. The

number of synchronized responses is 18 in line 4 and 17 in line 5. The number of delayed responses is 5 in line 4 and even

Fig. 19. Timing of participant responses at lines 4 and 5.

A. Yamazaki et al. / Journal of Pragmatics 42 (2010) 2398–24142412

Page 17: Author's personal copy - Osaka U

Author's personal copy

fewer in line 5. In contrast, the number of non-gaze responses is less in line 4. Since these differences are too small we cannotdeterminewhether the delayed response in line 4 is due to postpositional construction. However, we can say that regardlessof the design of the turn—being either prepositional or postpositional—participants typically respond with precision timingtowards the robot head turn.

We also counted the number of participant nods and gaze at line 4 and 5 in the S-mode.As shown in Fig. 20, the number of instances of gaze and nods in line 4 is smaller than in line 5. In line 5, the number

of nods is larger. However, the difference is marginal. This shows that although there are differences in distinctivegrammatical features at lines 4 and 5, participants respond non-verbally and do so in many cases with precise timingat TRPs.

7. Conclusion

In this article we have examined human–human and human–robot interaction in the settings of museums andexhibitions. Our findings suggest that robot–human communication may be enhanced by programming a robot thatcoordinates verbal and non-verbal actions at particular places in its talk.

The coordination we addressed here wasmainly head turn and gaze, thoughwe also discussed pointing to an extent. Ourfindings show that when the guide robot turns its head at interactionally significant places in its talk, participants respondwith head turn, gaze, and head nod, andmay do sowith precise timing at TRPs. One of these interactionally significant placesis the TRP. The TRP is a place at which the hearer’s action as a next speaker becomes interactionally relevant. Since the TCUcompletion point is projectable, a hearer can anticipate the place of the TRP on the basis of on-going talk. Participants projectthe robot head turn by not only hearing the robot’s explanation but also monitoring the robot’s non-verbal actions. Sinceparticipants monitor the robot’s orientation and listen to the explanation, they nod and respond with precise timing. Inhuman face-to-face interaction, participants monitor both the other’s utterance and bodily action in order to displayrecipiency. In human–robot interaction, projection supported by the coordination of verbal and non-verbal actions is animportant resource for displaying recipiency.

Although our findings are based on Japanese, we believe that programming a robot to coordinate verbal andnon-verbal actions at interactionally significant places in its talk can be transferred to other languages. Towards thisend, we hope to develop a robot that speaks other languages (e.g. English and German) with researchers from thesecountries.

While we examined how robot head turn is related to TCUs, there are many issues to consider in relation to verbal andnon-verbal (bodily) actions. For example, as we mentioned in section 4, a human guide uses hand gestures while usingdeictic words and keywords. In order to develop our robot more fully, we have to consider other places where coordinationcan occur.

Moreover, as human guides often explain exhibits to a tour group or other larger audiences, a guide robot should also beable to interact with multiple visitors. We are currently developing a robot that interacts with multiple participants anduses hand gestures, but this needs to be improved through the use of adequate coordination of verbal and non-verbalactions.

Acknowledgements

This work was supported in part by the Ministry of Internal Affairs and Communications under SCOPE, Grant-in-aid forScientific Research (KAKENHI 17530373, 19203025, 19024013, 19653043, 21013009, 21300316) and JSPS New ResearchInitiates for Humanities and Social Sciences.We thankMichie Kawashima, Koji Mitsuhashi, Satomi Kuroshima, Yasuko Suga,Shiro Kashimura, Mayumi Nakagawa, Mai Okada, Hideaki Kuzuoka, Marjorie Harness Goodwin, Charles Goodwin and thetwo anonymous reviewers.

Fig. 20. Participants’ gaze and nods at lines 4 and 5.

A. Yamazaki et al. / Journal of Pragmatics 42 (2010) 2398–2414 2413

Page 18: Author's personal copy - Osaka U

Author's personal copy

Appendix A. Abbreviations used in the interlinear gloss

COP copula FP final particle

LK linking nominal N nominaliser

NEG negative O object marker

QP question particle QT quotative particle

SUB nominative particle TOP topic particle

[ overlapped actions(speech and non-verbal action) in contiguous line

: sound stretch

? rising intonation

References

Brooks, Rodney, Breazeal, Cynthia, Marjanovic, Matthew, Scasselati, Brian, Williamson, Matthew, 1999. The Cog Project: building a humanoid robot. In:Christopher, Nehaniv (Ed.), Computation for Metaphors, Analogy and Agents. Springer, Berlin, Heidelberg, pp. 52–87.

Fischer, Kerstin, 2006. What Computer Talk is and Isn’t: Human-Computer Conversation as Intercultural Communication. Linguistics – ComputationalLinguistics, vol. 17. AQ-Verlag, Saarbrucken.

Goodwin, Charles, 2003. Pointing as situated practice. In: Sotaro, Kita (Ed.), Pointing: Where Language, Culture, Cognition Meet. Erlbaum, Mahwah, NJ, pp.217–242.

Hutchins, Edwin, Palen, Leysia, 1997. Constructing meaning from space, gesture and speech. In: Resnick, Lauren B.,Saljo, Roger, Pontecorvo, Clotilde, Burge,Barbara (Eds.), Discourse, Tools, and Reasoning: essays on Situated Cognition. Springer, Berlin, pp. 23–40.

Ishiguro, Hiroshi, Ono, Tetsuo, Imai, Michita, Maeda, Takeshi, Kanda, Takayuki, Nakatsu, Ryohei, 2001. Robovie: an interactive humanoid robot. IndustrialRobot: International Journal of Industrial Robots 28 (6), 498–503.

Kendon, Adam, 2004. Gesture: Visible Actions as Utterance. Cambridge University Press, Cambridge.Kuno, Yoshinori, Sadazuka, Kazuhisa, Kawashima, Michie, Yamazaki, Keiichi, Yamazaki, Akiko, Kuzuoka, Hideaki, 2007. Museum guide robot based on

sociological interaction analysis. In: Proceedings of CHI’07 (Human Factors in Computing Systems), ACM Digital Library, pp. 1191–1194.Lerner, Gene, 2003. Selecting next speaker: the context-sensitive operation of a context-free organization. Language in Society 32, 177–201.Matsusaka, Yosuke, Tojo, Tsuyoshi, Kobayashi, Tetsunori, 2003. Conversation robot participating in group conversation. IEICE Transaction on Information

and System E86-E (1), 26–36.Sacks, Harvey, Schegloff, Emmanuel, Jefferson, Gail, 1974. A simplest systematics for the organization of turn-taking for conversation. Language 50, 696–

735.Sidner, Candace, Lee, Chrisopher, Kidd, Cory, Lesh, Neal, Rich, Charles, 2005. Explorations in engagement for humans and robots. Artificial Intelligence 166

(1–2), 140–164.Streeck, Jurgen, 1995. On projection. In: Goody, Esther N. (Ed.), Interaction and Social Intelligence: Expressions and Implications of the Social Bias in Human

Intelligence. Cambridge University Press, Cambridge, pp. 84–110.Tanaka, Hiroko, 1999. Turn-Taking in Japanese Conversation: A Study in Grammar and Interaction. Benjamins, Amsterdam.Yamazaki, Akiko, Yamazaki, Keiichi, Kuno, Yoshinori, Burdelski, Matthew, Kawashima, Michie, Kuzuoka, Hideaki, 2008. Precision timing in human-robot

interaction: coordination of headmovement and utterance. In: Proceedings of CHI ‘08 (Human Factors in Computing Systems), ACMDigital Library, pp.131–139.

Akiko Yamazaki is associate professor of sociology and human–computer interaction at the Department of Media at Tokyo University of Technology. Usingethnomethodology and interaction analysis, she analyzes human interactions in public spaces, human–computer and human–robot interactions. She is especiallyinterested in human interaction and human–robot interactions at museums and nursery schools.

Keiichi Yamazaki is professor of sociology at Saitama University. His main interests are ethnomethodology, conversation analysis, CSCW, and human–robotinteraction.

Matthew Burdelski is a visiting assistant professor and mellon postdoctoral fellow at Swarthmore College in the Department of Modern Languages andLiteratures. He conducts research on Japanese interaction involving adults and children. He teaches courses in Japanese language, society, and popular culture.

Yoshinori Kuno received the Ph.D. degree in electronics engineering from the University of Tokyo and joined Toshiba Corporation in 1982. From 1987 to 1988, hewas a Visiting Scientist at CarnegieMellon University. In 1993, hemoved to Osaka University as an associate professor in the Department of Computer-ControlledMechanical Systems. Since 2000, he has been a professor in the Department of Information and Computer Sciences, Saitama University. His research interestsinclude computer vision and human–robot interaction.

Mihoko Fukushima is currently a doctoral student in sociology at the University of Essex. Using conversation analysis and ‘grammar and interaction’, she isconducting research in the use of various linguistic resources such as speech-style shifts, dialects, and humor, for the display of gender, status and socialrelationality in Japanese conversational interaction.

A. Yamazaki et al. / Journal of Pragmatics 42 (2010) 2398–24142414