21
Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga Tokyo Institute of Technology IJCNLP 2011 (Nov 9 2011)

Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

Embed Size (px)

Citation preview

Page 1: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues

Ryu IidaMasaaki YasuharaTakenobu TokunagaTokyo Institute of Technology

IJCNLP 2011 (Nov 9 2011)

Page 2: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

2

Research background

Typical coreference/anaphora resolution Researchers have tackled problems provided by

MUC, ACE and CoNLL shared tasks (a.k.a. OntoNote)

Mainly focused on linguistic aspect of reference function

Multi-modal research community(Byron, 2005; Prasov and Chai, 2008; Prasov and Chai, 2010; Schütte et al., 2010, Iida et al. 2010) Essential for human-computer interaction Identifying referents of referring expressions in a

static scene or a situated world, taking extra-linguistic clues into account

Page 3: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

3

Multi-modal reference resolution

move the triangle to the left

Rotate the triangle at top right 60 degrees clockwise

All right… done it..

O.K.

dialogue history

…piece 1: move (X:230,Y:150)piece 7: move (X:311, Y:510)piece 3: rotate 60°

action history

eye-gaze

Page 4: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

4

Aim

Integrate several types of multi-modal information into a machine learning-based reference resolution model

Investigate which kinds of clues are effective on multi-modal reference resolution

Page 5: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

5

Multi-modal problem setting:related work

3D virtual world (Byron 2005, Stonia et al. 2008) e.g. Participants controlled an avatar in a virtual

world for exploring hidden treasures Frequently occurring scene updates Referring expressions will be relatively skewed to exophoric cases

Static scene (Dale 1992) Centrality and size of each object in computer

display is fixed through dialogues Change of visual salience of objects is not observed

Page 6: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

6

Evaluation data set creation

REX-J corpus (Spanger et al. 2010) Dialogues and transcripts of collaborative

work (solving Tangram puzzles) by two Japanese participants

Designed the puzzle solving task to require the frequent use of both anaphoric and exophoric referring expressions

Page 7: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

7

solver operator

Setting of collecting data

not availabl

eshield screen

working area working area

goal shape

Page 8: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

8

Collecting eye gaze data

Recruited 18 Japanese graduate students split them into 9 pairs All pairs knew each other previously and were of

the same sex and approximately the same age Introduced to solve 4 different Tangram puzzles

Use the Tobii T60 Eye Tracker, sampling at 60 Hz for recording users’ eye gaze with 0.5 degrees in accuracy 5 dialogues in which the tracking results

contained more than 40% errors were removed

Page 9: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

9

Annotating referring expressions Conducted using a multimedia

annotation tool, ELAN Annotator manually detects a referring

expression and then selects its referent out of the possible puzzle pieces shown on the computer display

Total number of annotated referring expressions:1,462 instances in 27 dialogues 1,192 instances in solver’s utterances

(81.5%) 270 instances in operator’s utterances

(18.5%)

Page 10: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

10

Multi-modal reference resolution Base model

Ranking candidate referents is important for better accuracy (Iida et al. 2003, Yang et al. 2003, Denis & Baldridge 2008)

Apply Ranking SVM algorithm (Joachims, 2002) Learn a weight vector to rank candidates for a

given partial ranking of each referent Training instances

To define the partial ranking of candidate referents, simply rank referents referred to by a given referring expression as first place and any other referents as second place

Page 11: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

11

Feature set

1. Linguistic features: Ling (Iida et al. 2010):10 features

Capture the linguistic salience of each referent based on the discourse history

2. Task-specific features: TaskSp (Iida et al. 2010):12 features

Consider the visual salience based on the recent movements of mouse cursor and recent pieces manipulated by the operator

3. Eye-gaze features: Gaze (proposed):14 features

Page 12: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

12

Eye gaze as clues of reference function

Eye gaze Saccades: quick, simultaneous movements

of both eyes in the same direction Eye-fixations: maintaining of the visual gaze

on a single location Direction of eye gaze directly reflects the

focus of attention (Richardson et al., 2007) Used the eye fixations as clues for

identifying the pieces focused on Separating saccades and eye fixations:

Dispersion-threshold identification (Salvucci and Anderson, 2001)

Page 13: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

13

Eye gaze features

time

“First you need to move the smallest triangle to the left”

a

b

cd

e

f

g

fixating on piece_b

t-T

T = 1500msec (Prasov and Chai 2010) t t’

fixating on piece_a

how frequently orhow long the speaker fixates on each piece

Page 14: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

14

Empirical evaluation

Compared models with different combinations of the three types of features Conducted 5-fold cross-validation

Proposed model with model separation (Iida et al. 2010) the referential behaviour of pronouns is

completely different from non-pronounsSeparately create two reference resolution models; pronoun model: identifies a referent of a given

pronoun non-pronoun model: identifies a referent of all

other expressions (e.g. NP)

Page 15: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

15

Results of (non-)pronouns

model pronoun non-pronoun

Ling 56.0 65.4

Gaze 56.7 48.0

TaskSp 79.2 21.1

Ling+Gaze 66.5 75.7

Ling+TaskSp 79.0 67.1

TaskSp+Gaze 78.0 48.4

Ling+TaskSp+Gaze 78.7 76.0

Page 16: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

16

Overall results

model accuracy

Ling 61.8

Gaze 51.2

TaskSp 42.8

Ling+Gaze 72.3

Ling+TaskSp 71.5

TaskSp+Gaze 59.5

Ling+TaskSp+Gaze 77.0

Page 17: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

17

Investigation of the significance of features

Calculate the weight of features according to the following formula

set of the support vectors in a ranker weight of the

support vector x

function that returns 1 if f occurs in x

Page 18: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

18

Weights of features in each model

pronoun model non-pronoun model

rank feature weight feature weight

1 TaskSp1 0.4744 Ling6 0.6149

2 TaskSp3 0.2684 Gaze10 0.1566

3 Ling1 0.2298 Gaze9 0.1566

4 TaskSp7 0.1929 Gaze7 0.1255

5 TaskSp9 0.1605 Gaze11 0.1225

6 Gaze10 0.1547 Gaze14 0.1134

7 Gaze9 0.1547 Gaze13 0.1134

8 Ling6 0.1442 Gaze12 0.1026

9 Gaze7 0.1267 Ling2 0.1014

10 Ling2 0.1164 Gaze1 0.0750

TaskSp1: mouse cursor was over a piece at the beginning of uttering a referring expression

TaskSp3: time distance is less than or equal to 10 sec after the mouse cursor was over a piece

Page 19: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

19

Weights of features in each model

pronoun model non-pronoun model

rank feature weight feature weight

1 TaskSp1 0.4744 Ling6 0.6149

2 TaskSp3 0.2684 Gaze10 0.1566

3 Ling1 0.2298 Gaze9 0.1566

4 TaskSp7 0.1929 Gaze7 0.1255

5 TaskSp9 0.1605 Gaze11 0.1225

6 Gaze10 0.1547 Gaze14 0.1134

7 Gaze9 0.1547 Gaze13 0.1134

8 Ling6 0.1442 Gaze12 0.1026

9 Gaze7 0.1267 Ling2 0.1014

10 Ling2 0.1164 Gaze1 0.0750

Ling6: shape attributes of a piece are compatible with the attributes of a referring expression

Gaze10: there exists the fixation time of a piece in the time period [t − T , t]

Gaze 9: the fixation time of a piece in the time period [t − T , t] is longest out of all pieces

Page 20: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

20

Summary

Investigated the impact of multi-modal information on reference resolution in Japanese situated dialogues

The results demonstrate The referents of pronouns rely on the visual

focus of attention such as is indicated by moving the mouse cursor

Non-pronouns are strongly related to eye fixations on its referent

Integrating these two types of multi-modal information into linguistic information contributes to increasing accuracy of reference resolution

Page 21: Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga

21

Future work

Need further data collection All objects in Tangram puzzle (i.e. puzzle

pieces) have nearly the same size Rejecting the factor that a relatively larger

object occupying the computer display has higher prominence over smaller objects

Zero-anaphors in utterances need to be annotated frequent use of them in Japanese