INVESTIGATING REAL-TIME REFERENCE RESOLUTION IN …ufdcimages.uflib.ufl.edu/UF/E0/05/21/95/00001/LI_X.pdf · INVESTIGATING REAL-TIME REFERENCE RESOLUTION IN SITUATED DIALOGUE FOR

INVESTIGATING REAL-TIME REFERENCE RESOLUTION IN SITUATEDDIALOGUE FOR COMPLEX PROBLEM SOLVING

By

XIAOLONG LI

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2018

c⃝ 2018 Xiaolong Li

I dedicate this dissertation to my father Fuhai Li. I wish he could see this.

ACKNOWLEDGMENTS

I would like to express my sincere appreciation to my advisor Dr. Kristy Boyer for her

continuous guidance, support and friendship through out my Ph.D study. I also would like

to thank my LearnDialogue colleagues for their generous help and support. Specially, I

would like to thank Fernando Rodrıguez, Jennifer Tsan, and Lydia Pezzullo for their help

on document editing, Joseph Wiggins for data annotation, Mickey Vellukunnel, Mehmet

Celepkolu and Timothy Brown for organizing studies. The friendly and supportive

LearnDialogue culture made my Ph.D study much easier. I also want to thank my family,

especially my wife Runqing Wang, for their unconditional support.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1 Coreference Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2 Reference Resolution in Situated Dialogue . . . . . . . . . . . . . . . . . . 222.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 CORPUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 ONLINE REFERRING EXPRESSION EXTRACTION . . . . . . . . . . . . . 32

4.1 Part-of-speech Tagging for Domain-specific Language . . . . . . . . . . . . 324.1.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.1.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2 Noun Phrase Chunking in Tutorial Dialogue . . . . . . . . . . . . . . . . . 394.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 SEMANTIC INTERPRETATION OF REFERRING EXPRESSIONS . . . . . . 44

5.1 Semantic Interpretation as Sequence Labeling . . . . . . . . . . . . . . . . 465.1.1 Noun Phrases in Domain Language . . . . . . . . . . . . . . . . . . 465.1.2 Description Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.1.3 Joint Segmentation and Labeling . . . . . . . . . . . . . . . . . . . 495.1.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 REFERENCE RESOLUTION FOR SITUATED DIALOGUE SYSTEM . . . . 54

6.1 Reference Resolution in a Situated Environment . . . . . . . . . . . . . . 546.2 Referring Expression Semantic Interpretation . . . . . . . . . . . . . . . . 556.3 Generating a List of Candidate Referents . . . . . . . . . . . . . . . . . . 566.4 Ranking-based Classification . . . . . . . . . . . . . . . . . . . . . . . . . 586.5 Experiments and Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5

6.5.1 Semantic Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.5.2 Candidate Referent Generation . . . . . . . . . . . . . . . . . . . . 596.5.3 Identifying Most Likely Referent . . . . . . . . . . . . . . . . . . . . 60

7 TUTORIAL DIALOGUE SYSTEM FOR JAVA PROGRAMMING WITHSUPERVISED REFERENCE RESOLUTION . . . . . . . . . . . . . . . . . . . 66

7.1 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687.2 System Functionalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697.3 Architecture of the Dialogue Agent . . . . . . . . . . . . . . . . . . . . . . 707.4 Natural Language Understanding Module . . . . . . . . . . . . . . . . . . . 71

7.4.1 Reference Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 727.4.2 Dialogue Act Classification . . . . . . . . . . . . . . . . . . . . . . . 737.4.3 Topic Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7.5 Dialogue Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747.6 Knowledge Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777.7 System Utterance Generation . . . . . . . . . . . . . . . . . . . . . . . . . 77

8 EVALUATION OF THE DIALOGUE SYSTEM . . . . . . . . . . . . . . . . . . 79

8.1 Proposed Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798.2 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

8.2.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808.2.2 Java Programming Task for the Study . . . . . . . . . . . . . . . . . 808.2.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818.2.4 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

8.3 System Usability Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 868.4 User Engagement Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 878.5 Online Reference Resolution Evaluation in Tutorial Dialogue Systems . . . 87

9 DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

9.1 Null Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 929.2 Data-driven Approach in Building Dialogue Systems . . . . . . . . . . . . 939.3 Understanding Users’ Java Program - A Challenge in Building Dialogue

Systems For Java Programming . . . . . . . . . . . . . . . . . . . . . . . . 94

10 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

10.1 Hypothesis Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9610.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9710.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

APPENDIX

A PRE-SURVEY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

B POST-SURVEY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7

LIST OF TABLES

Table page

1-1 An excerpt dialogue between a user and the dialogue system. . . . . . . . . . . . 17

3-1 Semantic labels of referring expressions. . . . . . . . . . . . . . . . . . . . . . . 31

4-1 Results of baseline tagger (CRF trained on source-domain corpus), Stanfordtagger, and our approach (CRF trained on generated target-domain corpus). . . 38

4-2 Noun phrase chunking result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4-3 The features used for noun phrase chunking. . . . . . . . . . . . . . . . . . . . . 41

5-1 Semantic labeling accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6-1 Algorithm to select candidates using learned semantics . . . . . . . . . . . . . . 58

6-2 Features used for segmentation and labeling. . . . . . . . . . . . . . . . . . . . 61

6-3 Reference resolution results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6-4 Reference resolution results with gold semantic labels. . . . . . . . . . . . . . . 65

7-1 Dialogue act set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7-2 Topics recognized by the topic classifier. . . . . . . . . . . . . . . . . . . . . . . 75

7-3 Sample system response utterances. . . . . . . . . . . . . . . . . . . . . . . . . . 78

8-1 An excerpt dialogue between a user and the Virtual TA. . . . . . . . . . . . . . 85

8-2 An example user action saved in the database. . . . . . . . . . . . . . . . . . . . 85

8-3 An example reference resolution event saved in the database. . . . . . . . . . . . 86

8-4 A false positive example of referring expression identification. . . . . . . . . . . 91

8-5 A false negative example of referring expression identification. . . . . . . . . . . 91

9-1 A comparison between human-computer dialogues and human-human dialogues. 93

A-1 A complete pre-survey results for students used System Li. . . . . . . . . . . . . 102

A-2 A complete pre-survey results for students used System Comparison. . . . . . . 103

B-1 A complete post-survey results for users used System Li. . . . . . . . . . . . . . 113

B-2 A complete post-survey results for users used System Comparison. . . . . . . . . 114

8

LIST OF FIGURES

Figure page

1-1 Excerpt of tutorial dialogue illustrating reference resolution. Referring expressionsare shown in bold.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1-2 Pipeline of online reference resolution in a situated dialogue. . . . . . . . . . . . 16

2-1 Relationship between accessibility and referring expression forms. . . . . . . . . 20

2-2 Coreference relation example diagram. . . . . . . . . . . . . . . . . . . . . . . . 23

2-3 Bayesian network for reference resolution. . . . . . . . . . . . . . . . . . . . . . 25

2-4 Identifying the most likely referent using word-as-classifier approach. . . . . . . 26

3-1 The interface of Ripple - a tutorial dialogue system for Java programming. Itincludes two windows: a window (on the left) to display student’s Java codeand a window (on the right) for textual messages between student and tutor. . . 29

4-1 Steps for referring expression extraction. . . . . . . . . . . . . . . . . . . . . . . 33

4-2 Example of target sentence generation. . . . . . . . . . . . . . . . . . . . . . . . 37

5-1 A parse of the outer for loop from Stanford Parser. . . . . . . . . . . . . . . . . 47

5-2 Segmentation and semantic linking of NP “a 2 dimensional array”. . . . . . . . 49

5-3 Dependency structure of “a 2 dimensional array”. . . . . . . . . . . . . . . . . . 51

6-1 Semantic interpretation of referring expressions. . . . . . . . . . . . . . . . . . . 56

7-1 Architecture of the tutorial dialogue system. . . . . . . . . . . . . . . . . . . . . 67

7-2 User interface of the dialogue system. . . . . . . . . . . . . . . . . . . . . . . . . 69

7-3 Architecture of the dialogue system. . . . . . . . . . . . . . . . . . . . . . . . . 71

7-4 User intention identification example. . . . . . . . . . . . . . . . . . . . . . . . . 76

7-5 Structure of the programming task. . . . . . . . . . . . . . . . . . . . . . . . . . 77

8-1 A short instruction with the task description. . . . . . . . . . . . . . . . . . . . 82

8-2 A short instruction with the task description. . . . . . . . . . . . . . . . . . . . 83

8-3 System usability score interpretation. . . . . . . . . . . . . . . . . . . . . . . . . 87

8-4 Reference resolution process in the dialogue system. . . . . . . . . . . . . . . . . 89

A-1 Pre-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

9

A-2 Pre-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

A-3 Pre-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

B-1 Post-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

B-2 Post-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

B-3 Post-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

B-4 Post-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

B-5 Post-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

B-6 Post-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

B-7 Post-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

B-8 Post-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

B-9 Post-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

10

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

INVESTIGATING REAL-TIME REFERENCE RESOLUTION IN SITUATEDDIALOGUE FOR COMPLEX PROBLEM SOLVING

By

Xiaolong Li

August 2018

Chair: Kristy Elizabeth BoyerMajor: Computer Science

A situated dialogue is embedded in a situated environment, where domain-specific

task completion is usually a central activity. In a situated dialogue, it is essential to

correctly identify the objects that speakers refer to in the environment. This task is

referred to as reference resolution. However, reference resolution is a challenging problem

in situated dialogue, and in part because of this limitation, most state-of-the-art situated

dialogue systems operate within highly constrained domains. This dissertation presents an

implementation of a tutorial dialogue system for the domain of Java programming, with

real-time reference resolution. The implemented dialogue system identifies and interprets

referring expressions in user utterances in real time. The identified referents are used

to improve the performance of natural language understanding. This dissertation also

examines the impact of different reference resolution approaches on the performance of the

implemented tutorial dialogue system.

The implemented real-time reference resolution approach in this project has three

phases. First, we apply an innovative approach that we developed for more accurate

part-of-speech tagging in domain-specific dialogue. This approach does not require an

annotated corpus for the target domain. Next, we use a Conditional Random Field to

label the semantic structure of the referring expressions. Finally, the learned semantics

are used together with contextual information to perform reference resolution in situated

dialogue. Offline evaluation of the CRF-based reference resolution approach on an existing

11

tutorial dialogue corpus for computer programming showed an accuracy of 61.6%, which

is a dramatic improvement compared to 51.3% from an approach based on a manually

defined lexicon Li and Boyer (2016).

To evaluate the performance of the two reference resolution approaches, they were

implemented in a tutorial dialogue system for Java programming. A human subjects study

was conducted to assess the performance of the tutorial dialogue systems with different

reference resolution approaches. In the study, 41 human participants were randomly

assigned to use these two tutorial dialogue systems. Post-survey results were collected

from study participants to evaluate system usability and user engagement. The reference

resolution performed by the dialogue systems was automatically logged into a database

for manual evaluation. After analyzing the collected data in the study, we did not find a

significant difference on user satisfaction nor user engagement in the dialogue systems with

different reference resolution approaches. The possible reasons are discussed in Chapter 9.

This dissertation is one of the few works that attempts to implement a natural

language dialogue system for such a complex domain like Java programming. It is also

the only known work that compares different reference resolution approaches in a tutorial

dialogue system.

In the dialogue system research community, there is an increasing recognition that

natural language dialogue systems need to work in more complex domains. Real-time

reference resolution in situated dialogue is one of the important challenges to achieve such

a goal. This dissertation research has made a step toward real-time reference resolution for

a dialogue system operating in a complex domain.

12

CHAPTER 1INTRODUCTION

Dialogue systems must move toward understanding users’ language within situated

environments to assist users with increasingly complex tasks. Situated dialogue is usually

embedded in an environment where domain-specific task completion is a central activity.

One of the essential requirements of situated dialogue systems is to identify the objects

that users refer to during a conversation (Iida et al., 2010; Liu et al., 2014; Liu and

Chai, 2015; Chai et al., 2004). Identifying a speaker’s referents is, itself, a crucial part

of utterance interpretation. Identifying the correct referent for an utterance also helps

other aspects of language understanding—for example, by constraining the likely current

intention (Gorniak and Roy, 2007).

Reference resolution in situated dialogue is challenging because of the ambiguity

inherent within dialogue utterances and the complexity of the environment. Imagine

a dialogue system that assists a novice student in solving a programming problem. To

understand a question or statement the student poses, such as, “Should I use the 2

dimensional array?”, the system must link the referring expression “the 2 dimensional

array” to an object1 in the environment.

This process is illustrated in Figure 1-1, which shows an excerpt from a corpus of

tutorial dialogue situated in an introductory computer programming task in the Java

programming language. The arrows link referring expressions in the situated dialogue to

their referents in the environment. To identify the referent of each referring expression, it

is essential to capture the semantic structure of the referring expression of the object it

refers to, such as “the 2 dimensional array” contains two attributes, “2 dimensional” and

1 The word “object” has a technical meaning within the domain of object-orientedprogramming, which is the domain of the corpus utilized in this work. However, wefollow the standard usage of “object” in situated dialogue (Iida et al., 2010), which forprogramming is any portion of code in the environment.

13

“array”. At the same time, the dialogue history and the history of user task actions (such

as editing the code) play a key role. To disambiguate the referent of “my array”, temporal

information is needed: in this case, the referent is a variable named “arra”, which is an

array that the student has just created.

Tutor: Tutor: …

Tutor: … Student: Tutor: Student:

table = new int[10][5]; that is where they initialize the size of the 2 dimensional array

…

[student adds line of code: arra = new int[s.length()];]

great! … [student adds line of code: new2=Integer.parseInt(parse1);]

does my array look like it is set up correctly now umm...... in the for loop, what should you be storing in the array?

:)

setTitle("Postal Code Generator"); setDefaultCloseOperation(EXIT_ON_CLOSE); setVisible(true); table = new int[10][5]; initTable(); } /** * Extract the individual digits stored in the ZIP code * and store their values as private data */ private void extractDigits() { //You must complete this method!! String s = Integer.toString(zipCode); String parse1; Char num; int arra[]; int new2; arra = new int[s.length()]; for(int i=0, i<s.length(); i++) { num=s.charAt(i); parse1=""+num; new2=Integer.parseInt(parse1); arra[i]=num; }

Dialogue and task history Environment

Figure 1-1. Excerpt of tutorial dialogue illustrating reference resolution. Referringexpressions are shown in bold.2

To tackle the problem of reference resolution in this type of situated dialogue, we

present a pipeline approach that combines a domain-specific part-of-speech (POS) tagger,

semantics from a conditional-random-field-based semantic parser along with salience

features from dialogue history and task history. This approach includes three main steps.

First, we extract referring expressions from user utterances. Second, we interpret the

semantics of referring expressions using a conditional random field model (CRF). The

2 Typos and syntactic errors are shown as they appear in the original corpus.

14

outputs of this step are the object attributes expressed by the referring expressions.

Finally, the learned semantic information and contextual information from the situated

dialogue are used to identify the mentioned objects. This process is illustrated in Figure

1-2. We evaluate this approach on the JavaTutor corpus, a corpus of textual tutorial

dialogue collected within an online environment for computer programming.

In order to enable a task-oriented dialogue system to perform reference resolution in

a real-time dialogue system, we need to recognize referring expressions in user utterances

on the fly. To solve this problem, we need the accurate part-of-speech (POS) tags of user

utterances. This dissertation also presents an innovative POS tagging approach within

situated dialogue. In a corpus of textual dialogue for Java programming, the proposed

approach showed a large improvement over the Stanford tagger. Compared to a tagger

trained on the same source data (which includes dialogue) but with no domain adaptation,

overall accuracy improved from 87.14% to 92.76%. For nouns, which are a prevalent and

challenging open word class in domain language, the new approach results in a dramatic

improvement from F1-score of 0.701 to 0.903. Accordingly, the F1-score of noun phrase

chunking was improved from 0.81 to 0.86.

Prior work on reference resolution has leveraged dialogue history and task history

information to improve the accuracy of reference resolution (Iida et al., 2010, 2011;

Funakoshi et al., 2012). However, these prior approaches have employed relatively simple

semantic information from the referring expressions, such as a manually created lexicon,

or have operated within an environment with a limited set of pre-defined objects. As

this dissertation demonstrates, these prior approaches do not perform well in situated

dialogues for complex problem solving, in which the user creates, modifies, and removes

objects from the environment in unpredictable ways. We combine the semantics learned

by a CRF-based approach together with salience information of objects in the situated

environment to map referring expressions to their referents. The results showed that our

approach achieves substantial improvement over two existing state-of-the-art approaches,

15

with existing approaches achieving 51.3% accuracy at best, and the new approach

achieving 61.6% accuracy.

Referring Expression Extraction

Semantic Interpretation of Referring Expressions

Identifying Referents

User Utterance “… from the actionPerformed method.”

Referring Expression(s)

71: Public void actionPerformed(){...}

“the actionPerformed method”

Referent

Figure 1-2. Pipeline of online reference resolution in a situated dialogue.

In this dissertation, we present a data-driven tutorial dialogue system for Java

programming. In this dialogue system, we implement the reference resolution pipeline

presented above to identify the user’s referent in real time. The tutorial dialogue system

has four main modules: natural language understanding (NLU) module, dialogue manager

(DM) module, knowledge base (KB) module, and a natural language generation (NLG)

module. The NLU module performs reference resolution, dialogue act classification and

topic classification for an input user utterance. The DM tracks the current programming

progress and user intention. We also authorized a set of rules for the DM to generate

system dialogue acts for input user dialogue acts. The KB module maintains knowledge

about the programming problem and Java language. For the NLG module, we authorize a

set of system utterances for each system dialogue act. An excerpt dialogue between a user

and the dialogue system is shown in Table 1-1.

To evaluate the tutorial dialogue systems with different reference resolution

approaches, we conducted a user study with undergraduate students from the COP

16

Table 1-1. An excerpt dialogue between a user and the dialogue system.

Virtual TA Hi, I’m your virtual TA. I will be helping you on this Java project.User does extratDigits method input anything?Virtual TA This Java program converts a zip code for US mail into a visual barcode. When

you run the program, it asks the user to enter a zip code. The program will thenconvert it into a barcode and display it. But it doesn’t work yet! To make it work,your task is to complete the extractDigits method. It needs to convert an integerzip code into an array of individual digits, and return that array.Hint: Making theint into a String first is a convenient way to get its individual digits.

Virtual TA The extractDigits method is located in the PostalFrame class, which should havebeen opened for you. You are welcome to view the other code, but you only need towork on the extractDigits method.

User what do I need to do in the extractDigits method?Virtual TA You will extract the five digits of a zipcode. Now we need to get each of the 5 digits

of that int. Using a string and then getting the individual characters from it is agood way to do that.

User where is teh zip codeVirtual TA When the program is run, the user types in a zipCode. The code to get the zipCode

from the user is already complete. The zip code the user entered is stored as aprivate variable, int zipCode on line 34 in PostalFrame.

User how to parse a string into an arrayVirtual TA You can use the syntax intVariable + ”” to convert an integer to a String variable.

. . .

3502 ”Programming Fundamentals I” class at the University of Florida. We implemented

two different versions of the tutorial dialogue system with different reference resolution

approaches. The System Li implemented the reference resolution with learned semantics

with a CRF-based approach. The baseline system System Comparison implemented

a reference resolution approach with a manually authorized domain lexicon. In the

evaluation, we investigated the impact of different reference resolution approaches

on the tutorial dialogue system. Specifically, we examined the different approaches’

impacts on user satisfaction using System Usability Scale (SUS) instrument Bangor et al.

(2008), and user engagement using User Engagement Scale (UES) instrument Brien et al.

(2018). System Li had an average SUS score of 66.7, System Comparison had an average

SUS score of 68.8. There wasn’t a significant difference between them these two scores

(p-value=0.361). System Li had a UES score of 11.8, and System Comparison had a UES

score of 12.3. There wasn’t a significant difference neither (p-value=0.236).

17

We also examined the online accuracy of the two reference resolution approaches.

System Li and System Comparison had an accuracy of 21.6% and 19.6%. After further

analyzing the collected data, we found the low accuracy was caused by the referring

expression selection approach. After manually annotating the referring expressions in

the collected data, we found the accuracies of these two models are 63.3% and 44.9%,

respectively.

This dissertation makes the following contributions: 1) implementation of a tutorial

dialogue system for Java programming; and 2) evaluation of real-time reference resolution

approaches in the tutorial dialogue system by conducting a human subject study. We

believe these contributions will help the dialogue system research community to better

understand about reference resolution in situated dialogue systems.

The remainder of the dissertation is structured as follows. Chapter 2 reviews related

work on situated language understanding, and reference resolution in situated dialogue

understanding, summarizing the features and approaches used in prior work. Chapter 3

introduces the corpus of situated dialogue for Java programming, which is used in this

dissertation for model training and empirical evaluation. Chapter 4 describes the process

of online referring expression identification, which extracts referring expressions from

user utterances in real time when the dialogue system is running. Chapter 5 presents

the semantic interpretation of referring expressions using a CRF-based model. Chapter

6 describes the approach for reference resolution with learned semantics from referring

expressions and contextual information of the task-oriented dialogue. We describe

the implementation of the tutorial dialogue system for Java programming in Chapter

7. We present a user study for the tutorial dialogue system in Chapter 8. Chapter 9

is a discussion of observations made while building the tutorial dialogue system and

conducting the user study. The dissertation is concluded in Chapter 10 by summarizing

the presented work and contributions.

18

CHAPTER 2RELATED WORK

This chapter reviews previous research on reference resolution within different types

of situated environments. We start with coreference resolution in text, which is closely

related to reference resolution in situated language and has been a well established

research area for decades. Then, we categorize, discuss, and compare previous work on

reference resolution in situated language.

2.1 Coreference Resolution

Coreference resolution discovers antecedents for anaphoras in discourse. An anaphora

is a linguistic expression whose interpretation depends on another linguistic expression

in the context. An antecedent is also a linguistic expression, which is used before an

anaphora and could be used to explain it. For example, in the sentence “When you see

John, give him this card.”, “John” is an antecedent of “him”; “him” is an anaphora. A

coreference relation consists of an antecedent and an anaphora that refer to the same

entity. There may be multiple noun phrases referring to the same entity. Coreference

resolution is different from reference resolution in a situated environment, however, they

share some similarities which will be discussed in Section 2.2. Reference resolution has

been inspired by the theories and approaches developed for coreference resolution, such as

centering theory and ranking-based classification approach (Denis and Baldridge, 2008).

Theories for coreference resolution

Ariel presented a theory that described the relationship between accessibility of

entities and referring behaviors (Ariel, 1988). She argued that “natural language

primarily provides speakers with means to code the ACCESSIBILITY of the referent

to the addressee.” The accessibility of entities, which indicates how accessible an entity

is to the conversation participants, is “tied to context types in a definitely non-arbitrary

way.” According to the author, there are three types of contexts that are highly related to

reference resolution: community mutual knowledge, physical co-present mutual knowledge,

19

and linguistic co-present mutual knowledge. Community mutual knowledge is shared

by the speakers and addressees because of belonging to the same community. Physical

co-present mutual knowledge is perceived by the conversation participants in their shared

physical environment. Linguistic co-present mutual knowledge is conveyed by previous

utterances, i.e., dialogue history. All of these three kinds of knowledge determine the

accessibility of possible referents at a given moment. Intuitively, these three context

types provide metrics to measure the salience of entities involved in a conversation.

The authors also argued that the accessibility of entities determines the form of their

referring expressions. Entities with lower accessibility need more lexical information to be

identified, and vice versa. More detailed relationships between accessibility and the form

of referring expressions are shown in Figure 2-1.

Figure 2-1. Relationship between accessibility and referring expression forms.

Grosz et al. presented a framework based on centering theory to model local

coherence of discourse (Grosz et al., 1995). Centers were defined as entities in an

utterance that served as links to other utterances in the discourse that also contain

the same entities. Each utterance in the discourse was assigned a set of forward-looking

20

centers and one backward-looking center. The centering framework provided a rule-based

approach to describe a speaker’s attentional state by monitoring the change of centers.

The authors also argued that attentional states were highly related to the choice of

referring expressions. Sidner also pointed out the close relationship between discourse

structure and reference resolution (Sidner, 1986).

Both accessibility theory and centering theory emphasize the importance of salience

information in coreference resolution. We will show that this salience information is also

essential in reference resolution in situated environments.

Models for Coreference Resolution Early work on coreference resolution used

rule-based approaches (Lappin and Leass, 1994). More recent work usually formulates

coreference resolution as a classification problem as discussed above, which is also

employed by reference resolution in most cases. The difference is that the candidates

of coreference resolution are other referring expressions, while reference resolution has

objects from the situated environment as candidates.

The straightforward approach is to consider referring expressions in pairs, <

rei, rej >. The binary output of a classification function f(rei, rej) indicates whether rei

and rej have the same referent. Some previous work used decision trees as classification

functions, given the simplicity and categorical nature of the features (Mccarthy and

Lehnert, 1995; Soon et al., 2001). Ponzetto and Strube used a maximum entropy model as

their classification function (Ponzetto and Strube, 2006).

Ranking-based model: In a piece of text, there could be multiple antecedents for a

referring expression. Pairwise matching models consider a single candidate at once, which

only take a True/False decision from a binary classifier. However, the output of a binary

classifier is usually a probability. This probability, the confidence of making a positive

decision, was abandoned in this model. To employ this confidence value, Yang et al.

presented an approach using twin-candidates instead of a single candidate as antecedents

(Yang et al., 2003). In this approach, each data sample contained one anaphora and two

21

candidate antecedents, only one of which was the real antecedent. The model considered

features between these three referring expressions to make a final decision, which took

the comparison between two candidates into consideration. The model achieved better

performance. Using a similar idea, Denis and Baldridge presented a ranking-based model,

which created multiple antecedent candidates < c0, c1, ..., ck > for each anaphora re (Denis

and Baldridge, 2008). A binary classifier f(re, ci) ∈ [0, 1] was then used to compute

the compatibility pi between re and each ci. These outputs pi were ranked to select a

best candidate from the candidate list as re’s real antecedent. Culotta et al. organized

candidates into clusters and identified all the antecedents for a referring expression at the

same time (Culotta et al., 2007).

Specialized models: Denis and Baldridge argued that different referring expression

types, pronouns, definitive noun phrases, and demonstrative noun phrases were used

differently (Denis and Baldridge, 2008). Thus, they trained different models for each type

of referring expression, which proved to be more accurate for coreference resolution.

2.2 Reference Resolution in Situated Dialogue

Reference resolution in situated language shares similarities with coreference

resolution. Both benefit from semantic interpretation of referring expressions and are

usually formatted as classification problems. However, as coreference resolution identifies

a coreference relation between referring expressions within a discourse, whereas reference

resolution in situated language identifies referents of referring expressions in their situated

environment. For example, in Figure 2-2, referring expressions such as “he”, “his” and

“Clinton” appeared later in a piece of text all refer to the referring expression “Bill

Clinton”, which appeared earlier in the same text. In a situated dialogue, as shown in

Figure 1-1 both referring expressions “my array” and “the array” refer to arra, which is

an array that the student had just created.

22

Figure 2-2. Coreference relation example diagram.

The state of the situated environment also plays an essential role in solving this

problem. This section summarizes the approaches used in existing work on using reference

resolution in situated language.

Similar to coreference resolution, reference resolution is usually represented as a

classification problem. Given a referring expression re and a candidate referent e, a

classification function f(re, e) is used to predict the probability that e is re’s referent in

the current context, which includes linguistic context and world state. Each candidate

referent e is an entity in the situated environment, such as “a blue mug on the table”.

Features for Reference Resolution In previous work, there are three primary

types of features: syntactic features, semantic features, and salience features. Unlike

coreference resolution, there are less syntactic features involved for reference resolution

in situated language. Coreference resolution searches for relations between referring

expressions, in which the syntactic relationship between these referring expressions plays

an import role. For reference resolution in situated dialogue, the referents are in the

situated environment, not in the dialogue. The syntactic type of referring expressions,

such as demonstrative pronouns and definite pronouns, are the most commonly used

syntactic features (Chai et al., 2004; Iida et al., 2010). Demonstrative pronouns are

pronouns pointing to specific things, such as “this” and “that”. Definite pronouns, such

23

as “him” and “it”, are pronouns referring to specific things, which are different from

indefinite pronouns, such as “someone” and “anything”.

Semantic features: As discussed above, situated environments, including objects in

the environment, are usually represented in situated language understanding tasks as

symbols. One of the most important sources of information for identifying the referents of

a referring expression is the semantic compatibility between them. Chai et al. considered

semantic types while creating graphs that represented the relationships between entities

(Chai et al., 2004). Similar to coreference resolution, attributes of entities were also used

for reference resolution in situated language, such as the shape and size of entities (Iida

et al., 2010, 2011).

Salience features: Salience features capture how noticeable and important an entity

is at a given moment. Salience features contain information about what makes a specific

entity more prominent, such as mentioning an entity in recent discourse history, moving or

operating on an entity in recent action history, etc.

Chai et al. aligned deictic gestures, pointing and circling objects in the scene, with

referring expressions found within utterances using the temporal co-occurrence between

them (Chai et al., 2004). Iida et al. studied reference resolution in situated dialogues for

a collaborative game (Iida et al., 2010, 2011). They used dialogue history and operating

history as features to exploit the salience of entities. These features were coded by time

intervals, such as ”weather object oi was operated in the past 10 seconds.” Eye gaze

features have also been used as salience features in some research to improve the accuracy

of reference resolution (Iida et al., 2011; Kennington and Schlangen, 2015).

Different from the semantic features used in previous work, we propose a CRF-based

semantic labeling approach. This approach automatically labels attributes of objects in

referring expressions.

Approaches. Most existing work formulated reference resolution as a supervised

classification problem. Iida et al. used output from SVM classifiers as measurements for

24

compatibility between a referring expression and the candidate referents (Iida et al., 2010,

2011). They also trained specialized models, a pronoun model and a non-pronoun model,

for different type of referring expressions. Funakoshi et al. presented a Bayesian network

to model the generative process from referent to referring expressions (Funakoshi et al.,

2012). The structure of the Bayesian network is shown in Figure 2-3.

Figure 2-3. Bayesian network for reference resolution.

In this Bayesian network, W,C,X,D represent words, concepts (attribute), referents,

and a referent domain (a set of referents), respectively. This model also shows how to

resolve a reference to a set of referents.

Most previous work employed semantic features, which in some cases were extracted

using a manually defined lexicon (Chai et al., 2004; Liu et al., 2012) and in some other

cases learned automatically (Matuszek et al., 2014; Schlangen et al., 2016).

Weakly supervised approaches: Some work attempted to build reference resolution

models with less supervision. These approaches need less manual annotations, especially

for lexical semantics, when compared to fully supervised approaches. Supervised

approaches usually use a lexicon to label the semantics of referring expressions (Iida

et al., 2010). Thus, the training data for fully supervised approaches contain < re, e >

pairs and lexical semantics of referring expressions. Weakly supervised approaches do

not need lexical semantics as input; instead, their inputs are just the < re, e > pairs.

Weakly supervised approaches learn the alignments between natural language tokens in

re, and attributes of e automatically use the co-occurrences of re and e in training data.

In previous work (Kennington and Schlangen, 2015; Schlangen et al., 2016), the semantics

of natural language tokens were learned using a word-as-classifier approach. The input of

25

this approach was a set of < re, e > pairs. Each referent e in the dataset was an physical

object in a scene. The goal of this word-as-classifier approach was to learn the alignment

between natural language tokens in re and visual features of e. For each natural language

token w, a logistic regression classifier was learned given all of the co-occurrence of e and

w in training data. Object e was represented as an n-dimensional vector of visual features.

Classifiers were trained for each token w in the training data. When given a new referring

expression re =< w0, w1, ... > and a scene with a set of objects ei, the classifiers for tokens

in this re were applied to each object ei in the scene to find the best match in terms of

compatibility between re and ei. This process is illustrated in Figure 2-4. In this figure,

xi is the feature vector of the ith object in the scene. There is an output, δ(wTxi + b), for

each object in the scene. The top level represents normalization over all of the outputs

from the logistic classifier. With this word-as-classifier approach, the alignment between

natural language tokens and visual features of objects were learned automatically without

explicit manual annotation.

Figure 2-4. Identifying the most likely referent using word-as-classifier approach.

2.3 Summary

This chapter summarizes previous approaches on reference resolution in situated

language. According to the literature review, we found that most previous work performed

reference resolution in a limited setting, either a specific setting containing a fixed set of

objects to evaluate their approach (Kennington and Schlangen, 2015), or in a domain with

very limited number of objects (Iida et al., 2010). None of these approaches investigate

26

real-time reference resolution in a situated dialogue system. Different from previous work,

this dissertation reports a real-time reference resolution approach. In addition, we present

an implementation of a tutorial dialogue system for Java programming to evaluate it in a

real-time setting.

27

CHAPTER 3CORPUS

This dissertation investigates the reference resolution problem in a tutorial dialogue

system. Given the data-driven nature of the reference resolution and dialogue understanding

techniques used in this research, we employ a corpus of tutorial dialogues from previous

study.

3.1 Data Collection

The corpus was collected within a tutorial dialogue study in which human tutors and

students interacted through a tutorial dialogue interface, Ripple, that supported remote

textual communication (Boyer et al., 2011). The tutorial dialogue interface (Figure 3-1)

consists of two windows that display interactive components: the students’ Java code,

the compilation or execution output associated with the code, and the textual dialogue

messages between the student and tutor. All of the information in these two windows

was synchronized between the student’s screen and tutors’ screen in real time. The entire

corpus contains 45 Java programming tutoring sessions from student-tutor pairs, with a

total of 4857 utterances, an average of 108 utterances per session. Each of these sessions

lasted approximately one hour. The problem students solved during this tutorial dialogue

involved creating, traversing, and modifying parallel arrays, a challenging task since the

students were novices who were enrolled in an introductory computer programming class.

The dialogues within this domain are characterized by situated features that pertain

to the programming task. A portion of user utterances refer to general Java knowledge,

and in these cases a semantic interpretation can be accomplished by mapping to a

domain-specific ontology (Dzikovska et al., 2007). In contrast, many utterances refer

to concrete entities within the dynamically changing, user-created programing artifact.

Identifying these entities correctly is crucial for generating specific tutorial dialogue moves.

Besides the tutorial dialogue, we also used publicly available corpora for POS

tagging. We performed POS tagging in order to identify referring expressions from user

28

Figure 3-1. The interface of Ripple - a tutorial dialogue system for Java programming. Itincludes two windows: a window (on the left) to display student’s Java codeand a window (on the right) for textual messages between student and tutor.

utterances. Our target domain is online synchronous textual task-oriented dialogue

about Java programming. To train a domain-specific POS tagger, we leveraged two

different labeled corpora from source domains. First, we used the CoNLL2000 corpus for

phrase chunking (Tjong and Sang, 2000), which is a labeled Wall Street Journal corpus

with 10,948 sentences. We also used the NPS chat corpus (Forsythand and Martell,

2007), a set of annotated online conversational texts with 10,567 utterances. The target

corpus is a set of textual Java programming tutorial dialogues (Li and Boyer, 2015) that

contains 4,857 utterances (51,721 tokens) in total. The Java programming corpus is

task-oriented, containing not only utterances but also the accompanying Java program

that the interlocutors were creating and discussing. As described below, we utilized

a subset of these Java programs to extract noun phrases to generate the new labeled

29

training corpus. We also compared this approach to using Java snippets from The Java

Tutorial website to test the benefit of using unrelated Java code.1

3.2 Annotation

All of the utterances in the 45 tutorial sessions were manually annotated for the

referring expressions that have referents in the parallel Java program. For each referring

expression, we labeled segmentation and semantic labels for each segment, so that each

of these semantic segments represents one attribute in the Java programming domain.

These labeled referring expressions will be used to train statistical models to automatically

annotate referring expressions to provide semantic information for reference resolution.

Noun phrases from the tutorial dialogues were first manually extracted and

annotated. There were 364 grounded noun phrases extracted manually from six tutorial

dialogue sessions used in the current work. Each of these noun phrases extracted has one

or multiple corresponding entities in the programming artifact. Since each word in a noun

phrase is linked to an element in the description vector, the indices in this vector were

used as the label for each word. Annotation of all 346 noun phrases was performed by

one annotator, and 20% of the noun phrases (70 noun phrases) were doubly annotated by

an independent second annotator. The percent agreement was 85.3% and the Kappa was

0.765.

We also annotated the semantic labels for each referring expression. A noun phrase is

defined as a phrase which has a noun (or indefinite pronoun) as its head word, or which

performs the same grammatical function as such a phrase (Crystal, 1997). The syntactic

structure of a noun phrase consists of dependents which could include determiners,

adjectives, prepositional phrases, or even a clause. For example, the noun phrase “a 2

dimensional array” occurs within the Java programming corpus. Its head is “array” and

its dependents are “a” as the determiner and “2 dimensional” as an adjective phrase.

1 https://docs.oracle.com/javase/tutorial/

30

Each of these semantic segments involves an attribute of its real referent in the situated

environment (the parallel Java program in this case). We manually annotate these

semantic segments in referring expressions. The semantic tags we used are listed in Table

3-1.

Table 3-1. Semantic labels of referring expressions.

Attributes Meaning (in Java programming) ExampleCATEGORY Category of an entity Method, Variable, etc.NAME Variable name; often user-created extractDigitVAR TYPE Type of variable int, String, etc.NUMBER Number of entities 2IN CLASS The class that contains this entity postalFrameIN METHOD The method that contains this entity actionPerformedDIR PARENT Direct parent entity For Statement, MethodLINE NUMBER Line number 67SUPER CLASS Superclass of this entity JFrameMODIFIER Access modifier public, private, etc.ARRAY TYPE Type of Array int, char, etc.ARRAY DIMENSION Dimension of array 2, 1OBJ CLASS The class an object instantiates PostalBarCodeRETURN TYPE Return type String, int, etc.OTHER Other attributes the, extra, etc.

31

CHAPTER 4ONLINE REFERRING EXPRESSION EXTRACTION

One of the essential steps to implement reference resolution in a tutorial dialogue

system is to identify referring expressions, which are noun phrases, in user utterances in

real time. This is a challenging task in a tutorial dialogue system for Java programming.

Language used in such a dialogue is usually informal. Utterances may contain many

domain-specific components, such as Java program segments. To accurately identify

noun phrases in these utterances, we need an accurate part-of-speech (POS) tagger. POS

tagging is a very important step for noun phrase chunking, which is the approach used to

tag noun phrases in a given sentence. Since referring expressions are noun phrases in an

utterance, we need to first identify all of the noun phrases in this utterance. Not all noun

phrases have referents in the situated environment. We are only interested in noun phrases

that refer to objects in the environment, in this case the Java code. Consequently, we need

a classification step to first identify the referring expressions that are interesting to us.

This chapter includes two sections. Section one reports on an unsupervised approach

I developed for part-of-speech tagging in situated language. Section two conducts noun

phrase chunking for utterances in tutorial dialogue. I have developed and evaluated these

techniques to date on corpora. However, as will be described in Chapter 7, I deploy these

approaches within a real-time tutorial dialogue system. The process of referring expression

extraction is shown in Figure 4-1.

4.1 Part-of-speech Tagging for Domain-specific Language

In this section, I report a novel but simple domain-adaptation approach that I

developed to improve part-of-speech tagging in task-oriented dialogue. This approach

automatically generates an annotated domain-specific training corpus without any manual

annotation. In a corpus of textual dialogue for Java programming, experiments showed a

large improvement over the Stanford tagger. Compared to a tagger trained on the same

source data (which includes dialogue) but with no domain adaptation, overall accuracy

32

but why do that when I could just use the string zip from the actionPerformed method

but why do that when I could just use the string zip from the actionPerformed method CC WRB VBP DT WRB PRP MD RB VB DT NN NN IN DT NN NN

but why do that when I could just use the string zip from the actionPerformed method

but why do that when I could just use the string zip from the actionPerformed method CC WRB VBP DT WRB PRP MD RB VB DT NN NN IN DT NN NN

POStagging

NounPhraseChunking

Classifica6on

Figure 4-1. Steps for referring expression extraction.

improved from 87.14% to 92.76%. For nouns, which are the most essential word class for

referring expression identification, the new approach results in a dramatic improvement

from a F1-score of 0.701 to 0.903.

Accurate part of speech (POS) tagging is essential for many natural language

processing tasks, including natural language understanding in dialogue systems. Most

POS taggers are trained on large newswire corpora that support good performance

on open-domain language. However, these taggers encounter performance degradation

when applied to domain-specific language (Jiang and Zhai, 2007), which is often used in

task-oriented dialogue. This degradation is due partly to unknown tokens, but also due

to how known tokens are used. For example, in a Java programming tutorial dialogue, we

see utterances such as, “what I might could do is write if statements to see what range

sum%10 is in,” or, “... so String a = new String(zipCode); would work.” Dialogue systems

must be able to parse this kind of user utterance to react properly. There is much room

for improvement in domain-specific POS tagging: on the Java-programming dialogues

corpus used in this work, the Stanford tagger achieved 85.57% accuracy, compared to its

97.32% accuracy on the type of language on which it was trained (Manning, 2011).

33

Previous work on domain adaption for POS tagging has included adding annotated

target domain data (Jiang and Zhai, 2007; Daume, 2009) and using dictionaries to mine

patterns from domain languages (Hovy et al., 2015; Li et al., 2012). We present a different

perspective on POS tagging which does not require any manual labeling. We argue that

generating a grammatical sentence in a new domain is easier than parsing a given sentence

from the same domain, assuming that we can easily extract some domain language from

other sources. The domain language is not annotated per se, but because of the context

in which it occurs, its POS tag can be inferred. We then generate a new set of sentences

for our target domain-specific language with POS tags known, and we build a tagger using

the generated corpus as training data.

The approaches was tested on 5 sessions Java tutoring data collected using Ripple

(mentioned in the previous chapter). The other 40 sessions were used to generate training

data. This will be discussed in detail later in this chapter. Our simple yet effective

method improves upon the Stanford tagger’s performance on domain-specific language

for Java programming, achieving 92.76% accuracy compared to Stanford’s 85.57%, and

we do so without manually tagging any new domain-specific language. The new approach

achieved a recall of 91.9% for nouns (NN) (which account for 17% of all the tokens)

compared with 58.2% from a baseline tagger trained on the same source corpus without

domain adaptation and 71.6% by the Stanford tagger. The accuracy for some other POS

tags, such as adjectives (JJ) and past tense verbs (VBD) also improved significantly with

the reported approach, as did overall precision and recall for all of the POS tags.

4.1.1 Approach

The reported approach is based on the observation that open-domain POS tagging

errors in domain-specific language often occur in noun phrases. For example, “if

statement” is a noun phrase in the domain of Java programming, but taggers trained

on newswire recognize “if” as a subordinate conjunction instead of a noun. They also

cannot recognize examples such as the previously mentioned chunk of code “String a =

34

new String(zipCode);” as noun phrases. It would be challenging to induce a grammar

from an unlabeled corpus that contains a large proportion of tokens serving a new

grammatical role. Moreover, it is difficult to tag these phrases using preprocessing, since

the code-like-phrases used in natural language tend to be informal and neither follow

syntactic rules of the programming language nor the natural language in which they are

embedded. Our approach addresses this problem by generating grammatical (though not

semantically meaningful) sentences by substituting domain-specific noun phrases in place

of noun phrases in previously annotated source language.

To create a POS tagger for the target language, we used an annotated source

corpus (CoNLL2000 (Tjong and Sang, 2000)) and a set of domain-specific noun phrases

generated from a corpus of Java programs. We leverage the many similarities between this

domain-specific language and more open-domain language such as newswire: for example,

most other parts of domain-specific sentences, such as “what I might could do is write...”

and “so ... would work” still follow English grammar. Based on this simple idea, we

generate a corpus for the target domain, which is automatically annotated in the process

of generation. The approach substitutes domain-specific chunks into labeled sentences

from the source corpus by replacing part of an existing noun phrase to generate a target

training corpus. Finally, a POS tagger is trained on this corpus to perform POS tagging

for the target domain.

Domain-specific Noun Phrase Generation. To generate a set of labeled

sentences as training data for POS tagging, the reported approach requires that we

first generate a set of domain-specific noun phrases. For the domain of Java programming,

we extracted noun phrases from source code that had been created during dialogues from

our original in-domain corpus. (Later in this section we refer to those dialogues as the

extraction set. These dialogues were not the same ones used to test the POS tagger.)

We began by tokenizing each line of code from the Java programs. Then, we

extracted unigrams, bigrams, and trigrams from the tokenized Java code and treated

35

these as domain-specific noun phrases. Each token was tagged as a noun (except that

digits were tagged as numbers). The result is a set of domain-specific phrases with known

POS tags for each token.

Labeled Target Data Generation. Given a grammatical sentence ssource, which

is a sentence from a source language, if ssource contains a noun nsource, we could create

another grammatical sentence starget by replacing nsource with a domain-specific noun,

ntarget. Recall that a noun phrase is “a phrase which has a noun (or indefinite pronoun)

as its head word, or which performs the same grammatical function as such a phrase”

(Crystal, 1997). For a given sentence from the source corpus that has been tagged with

POS labels (such as CoNLL2000), we first check if it contains a noun phrase. We replace

the head of a noun phrase in ssource with a domain-specific noun phrase. An example

is shown in Figure 4-2, which shows that the determiner and adjective modifier of the

noun phrase are not replaced. The generated starget does not semantically make sense,

but it is grammatical, and it is labeled with POS tags. We generate a sentence starget for

every domain-specific noun phrase generated by the technique described in the previous

subsection. In this way, we create an annotated training set for the target domain.

Training POS Taggers. We trained conditional random field (CRF) POS taggers

on the source corpus and the generated target domain training corpus respectively Lafferty

et al. (2001). We then tested the models on the target domain testing corpus, which

consists of original dialogues (not generated dialogues).

4.1.2 Experiments and Results

First, the target corpus was split into two sets: the extraction set with 40 dialogue

sessions, and the testing set with 5 dialogue sessions. (Each dialogue session represents

approximately one hour of textual dialogue and collaborative construction of Java code.)

The testing set contains 687 sentences and 6581 tokens. We trained POS taggers using

source data and the automatically generated target data, which serves as the training

data. Both of these taggers were tested on the original (not generated) dialogues from

36

Confidence in the pond is widely expected to NN IN DT NN VBZ RB VBN TO

String a = new String (

Confidence in the pond is widely expected to take NN IN DT NN VBZ RB VBN TO VB

NP

another sharp String a = new String ( … DT JJ NN NN NN NN NN NN

NP

take another sharp dive … VB DT JJ NN

Figure 4-2. Example of target sentence generation.

the testing set. We also compared our trained POS taggers with results from the latest

Stanford tagger (v3.7.0) (Toutanova et al., 2003).

First, we trained the Baseline POS tagger on all the labeled sentences from the

CoNLL2000 corpus and the NPS chat corpus. We expected this tagger not to perform well

because although it included dialogues, it did not include any domain-specific language for

the target domain.

Next, using our approach, we trained a tagger for the target domain by leveraging

the generated sentences. For each extracted domain-specific noun phrase, we randomly

selected a sentence from CoNLL20001 to plug in the domain-specific noun phrase to

generate a labeled target sentence. We generated 96,011 target sentences in this step. A

POS tagger was then trained using these generated target sentences along with all of the

sentences from the NPS chat corpus. The Baseline CRF tagger, the Stanford tagger, and

the Li Approach tagger were all tested on dialogues in the testing set.

1 We chose CoNLL2000 because it has IOB tags, which makes the substitution simple.

37

Table 4-1. Results of baseline tagger (CRF trained on source-domain corpus), Stanford tagger, and our approach (CRFtrained on generated target-domain corpus).

total NN IN RB VBZ JJ NNS VBG VBDNum. 6571 1129 511 426 217 205 110 99 56prec. 0.906 0.882 0.926 0.985 0.980 0.680 0.724 0.790 0.711

Baseline recall 0.871 0.582 0.979 0.897 0.889 0.902 0.955 0.990 0.964F1 0.879 0.701 0.952 0.939 0.937 0.776 0.824 0.879 0.818prec. 0.900 0.932 0.817 0.697 0.968 0.668 0.794 0.980 0.786

Stanford recall 0.856 0.716 0.941 0.887 0.977 0.844 0.982 0.970 0.786F1 0.859 0.810 0.875 0.781 0.972 0.746 0.878 0.975 0.786

Li approach prec. 0.930 0.887 0.926 0.980 0.981 0.854 0.911 0.933 0.730with recall 0.928 0.919 0.982 0.918 0.954 0.859 0.836 0.980 0.964parallel code F1 0.927 0.903 0.954 0.948 0.967 0.856 0.872 0.956 0.831Li approach prec. 0.920 0.885 0.928 0.967 0.985 0.744 0.872 0.952 0.743with recall 0.914 0.869 0.980 0.890 0.912 0.878 0.927 0.990 0.982general code F1 0.915 0.877 0.953 0.927 0.947 0.805 0.899 0.970 0.846

38

The accuracy for Baseline Tagger, Stanford Tagger and Li Tagger were 87.14%,

85.57% and 92.76%, respectively. The Baseline Tagger performed better than the Stanford

Tagger, since its training set was partly conversational data (NPS chat corpus). Table 4-1

illustrates the combined precision, recall, and F1-score for the testing set and the same

measurements for some of the most frequently occurring POS tags. The overall precision,

recall, and F1-score were all improved by our approach. The F1-score increased from 0.879

(Baseline) to 0.927 (Li Approach), and both are higher than the Stanford tagger (0.859).

The open domain tagger trained with the NPS corpus achieved 0.834 accuracy.

For noun phrases in particular, which constitute the largest proportion of tokens

(17%), our approach performed particularly well. Noun phrases in domain-specific

language are hard to identify: the Baseline tagger achieved recall on NN of only 0.582, and

the Stanford Tagger performed worse on NN than on any other frequently occurring tag

in the set, at 0.716. Our approach achieved recall on NN of 0.919. Besides NN tokens, our

approach also achieved a much higher performance on adjectives (JJ), with an F1-score of

0.856 compared to 0.776 for Baseline and 0.746 for Stanford.

The Java code we used to generate the domain-specific training corpus was parallel

with the dialogues, which is not always available. To examine whether this approach could

use unrelated Java code, we collected 1968 lines of Java code from Oracle’s The JavaTM

Tutorials. With the same approach, we generated domain-specific training data and tested

on the same test set. This model achieved 0.913 accuracy, slightly lower than the model

trained with parallel code, but still much higher than models without domain adaptation.

4.2 Noun Phrase Chunking in Tutorial Dialogue

Noun phrase chunking is a type of syntactic analysis which labels all noun phrases

in a sentence (Tjong and Sang, 2000). With the POS tags generated using the approach

presented above, we performed noun phrase chunking of tutorial dialogue utterances

using a linear chain conditional random field (CRF) (Lafferty et al., 2001). In a tutorial

dialogue system, this process will find all noun phrases in user utterances. These noun

39

phrases are potentially referring expressions which refer to some objects in the shared

programming environment. We followed the approach in prior work to perform noun

phrase chunking (Sha and Pereira, 2003). This approach is tested on an existing corpus

and will be deployed in the dialogue system in Chapter 7. We use a BIO tagging schema,

which annotates each word in an input sentence. Each word is assigned with a tag: B

indicates “beginning of a phrase chunk”, I indicates “in a phrase chunk”, and O means

“out of a phrase chunk”. For example, in annotated sentence“but/O why/O do/B-VP

that/B-NP when/O I/B-NP could/O just/O use/B-VP the/B-NP string/I-NP zip/I-NP

from/O the/B-NP actionPerformed/B-NP method/B-NP”, B-NP indicates the beginning

of a noun phrase, I-NP means the corresponding word is inside a noun phrase, O means

the corresponding word is not in any phrase chunk. So, “the/B-NP string/I-NP zip/I-NP”

forms a complete noun phrase according to the annotation. Given this tagging schema, we

trained a conditional random field tagger to tag all of the noun phrases for a given input

sentence.

Linear chain conditional random field (CRF) is a discriminative graphical model for

sequential data tagging. In this noun phrase chunking application, we used it to assign

BIO tags to each token in a input word sequence W = w0, w1, ..., wn. Given a word

sequence W , the probability of a specific tag sequence A = a0, a1, ..., an is calculated as:

p(A|W ) =1

Z(W )exp(

n!

i=1

m!

j=1

λjfj(i, w, ai, ai−1))

The tag sequence with the highest probability is selected as the optimal annotation:

A = argmaxi

p(Ai|W )

For training data, we used data from the shared task for CoNLL-2000 (Tjong and

Sang, 2000). This corpus contains part of the Wall Street Journal corpus with BIO

annotations of phrases. It contains 211727 tokens in total.

40

This CRF-based approach employed lexical features and POS tags of words in a

sentence as features. Brill’s transformation-based learning approach was one of the most

influential POS tagging approaches Brill (1995). Some of the features are similar to the

rules used in Brill’s work. A complete list of features can be found below in Table 4-3.

Table 4-2. Noun phrase chunking result.

tag precision recall F1 # of instancesB-NP 0.75 0.91 0.82 2352

Baseline I-NP 0.87 0.75 0.80 1913B-NP, I-NP comb 0.80 0.84 0.82 4265B-NP 0.79 0.91 0.85 2352

Proposed I-NP 0.84 0.94 0.89 1913B-NP, I-NP comb 0.81 0.92 0.86 4265

The noun phrase chunking results are shown in Table 4-2. The domain adaptation

approach increased the F1-score of noun phrase chunking from 0.82 to 0.86. The new

approach improved the recall from 0.84 to 0.92.

Table 4-3. The features used for noun phrase chunking.

featuresthe word in lower casethe last three letters of the wordthe last two letters of the wordif the word is in upper caseif the word is title caseif the word is a numberthe word’s POS tagthe last two letters of the word’s POS tagthe previous word in lower caseif the previous word is in upper caseif the previous word is title caseif the previous word is a numberthe previous word’s POS tagthe following word in lower caseif the following word is in upper caseif the following word is title caseif the following word is a numberthe following word’s POS tag

41

4.3 Discussion

Qualitative examination shows the ways in which the proposed approach improved

over prior approaches. The example sentence used in the introduction was tagged as: “...

soIN, IN StringNN, NNP aNN, DT =NN, JJ newNN, JJ StringNN, NNP (NN, -LRB- zipCodeNN, NN

)NN, -RRB- ;NN, : wouldMD, MD”. The tag of the first subscript (blue) was from the proposed

approach, and the second (gray) was from the baseline tagger.

The proposed approach also performed very well on detecting change of usage for

domain-specific tokens, such as “the if statement” and “the for loop.” The proposed

approach correctly tagged “if” and “for” in these cases as NN, while in phrases such as “if

I use...” and “...for this method...,” they were correctly tagged as IN. Neither the Baseline

nor the Stanford Tagger could do this. To illustrate, consider an excerpt sentence from the

test set: “thatDT, DT lineNN, NN youPRP, PRP justRB, RB typedVBD, VBD canMD, MD beVB, VB

putVBN, VBN inIN, IN theDT, DT (NN, -LRB- )NN, -RRB- ofIN, IN theDT, DT forNN, IN loopNN, NN”.

In earlier work on domain adaptation for POS tagging, researchers have used

semi-supervised approaches, which employ a small annotated corpus of the target

language and a large annotated source language corpus to train a POS tagger for

the target language (Jiang and Zhai, 2007; Daume, 2009; Finkel and Manning, 2009;

Garrette and Baldridge, 2013; Plank et al., 2014). There has also been some work

using unsupervised approaches to perform domain adaptation, such as by employing

structural correspondence learning (Blitzer, 2006), and word clusters learned from

unlabeled target data set (Owoputi et al., 2013). Crowd-sourcing has also been leveraged

to implement domain adaptation for POS tagging (Hovy et al., 2015; Li et al., 2012). The

approach reported in this chapter generates labeled training data for the target language

automatically and thus dramatically simplifies the problem.

This chapter has reported a simple but effective domain adaptation approach for

POS tagging. Both quantitative and qualitative evaluation based on a corpus of informal

textual dialogues for Java programming demonstrated the effectiveness of the approach

42

compared to a Baseline approach and the Stanford tagger. The performance of the

reported approach was particularly evident on challenging noun phrases in the target

language. Experiments showed that even when using domain tokens unrelated to the

target testing corpus, the reported approach dramatically improved POS tagging on target

language. This is an essential step for accurate referring expression extraction.

43

CHAPTER 5SEMANTIC INTERPRETATION OF REFERRING EXPRESSIONS

This chapter presents a novel approach I created to perform semantic interpretation

of referring expressions within a situated environment. Recall that a situated dialogue is

embedded in an environment, where the dialogue usually focuses on a domain-specific task

within this environment. Referring expressions are noun phrases used to refer to entities

in the situated environment. In the context of tutorial dialogue for Java programming,

as shown in Figure 1-1 at the beginning of the introduction, noun phrases like “the

2 dimensional array”, and “the for loop” all refer to some entity in the parallel Java

program. These noun phrases are referring expressions in the situated dialogue for Java

programming.

The approach presented in this chapter performs joint segmentation and labeling of

the noun phrases to link them to attributes of entities within the environment. It is a new

way to provide semantic information for reference resolution in a situated environment.

Evaluation results on a corpus of tutorial dialogue for Java programming demonstrate that

a Conditional Random Field (CRF) model performs well, achieving an accuracy of 89.3%

for linking semantic segments to the correct entity attributes. This work is a step toward

enabling dialogue systems to perform accurate reference resolution.

Previous approaches for semantic interpretation include domain-specific grammars

(Lemon et al., 2001) and open-domain parsers together with a domain-specific lexicon

(Rose, 2000). However, existing techniques are not sufficient to support increasingly

complex task-oriented dialogues due to several challenges. For example, domain-specific

grammars become intractable when applied to more ill-formed domains, and open-domain

parsers may not perform well across domains (McClosky et al., 2010).

To address these challenges, this chapter presents a step toward reference resolution

in situated dialogues for complex problem-solving, in which the number of potential

entities (e.g. a Java variable or a piece of code) is infinite. The present work focuses

44

on the semantic interpretation of noun phrases, which tend to bear significant semantic

information for each utterance. Although noun phrases are typically small in their

number of tokens, their complexity and semantics vary in important ways. For example,

in the domain of computer programming, two similar noun phrases such as “the 2

dimensional array” and “the 3 dimensional array” refer to two different entities within

the problem-solving artifact. Inferring the semantic structure of the noun phrases is

necessary to differentiate these two references within a dialogue, to ground them in the

task, and to respond to them appropriately. Coreference resolution focuses on discovering

the coreference relationship between pairs of noun phrases in a piece of natural language

text (Culotta et al., 2007; Lappin and Leass, 1994), which is similar to the ultimate goal

of reference resolution in complex problem solving. However, different from coreference

resolution, reference resolution links natural language expressions to entities in a real

world environment. Comparing with natural language expressions, real world entities

contain richer information that could be utilized in the task of reference resolution. In

addition, the situated character of the dialogues generated in complex problem solving

introduces more uncertainty to the meaning of noun phrases used to refer to an entity

than that in a piece of self-contain natural language text; e.g. saying “that variable” by

highlighting a variable in Java code. To fully understand “that variable” requires more

context information in the environment in which this noun phrase is generated.

The current approach leverages the structure of noun phrases, mapping their

segments to attributes of entities to which they should be semantically linked. In order to

overcome the limitation of needing to fully enumerate the entities in the environment, we

represent the entities as automatically extracted vectors of attributes. We then perform

joint segmentation and labeling of the noun phrases in user utterances to map them to

the entity vectors (used to describe entities within the environment). In this way, the

semantics of noun phrases could be grounded by linking segments of noun phrases to

attributes of entities in the environment. The results show that a Conditional Random

45

Field performs well for this task, achieving 89.3% accuracy. Moreover, even in the

absence of lexical features (using only dependency parse features and parts of speech), the

model achieves 71.3% accuracy, indicating that it may be tolerant to unseen words. The

flexibility of this approach is due in part to the fact that it does not rely on a syntactic

parser‘s ability to accurately segment within noun phrases, but rather includes parse

features as just one type of feature among several made available to the model. Finally, in

contrast to methods based on bag-of-words such as latent semantic analysis, the reported

approach models the structure of noun phrases to facilitate specific grounding within an

artifact.

5.1 Semantic Interpretation as Sequence Labeling

To interpret the dialogue utterances as described above, our approach focuses first

upon noun phrases, which contain rich semantic information. This section introduces the

approach, based on Conditional Random Fields, to jointly segment the noun phrases and

link those segments to entities within the domain.

5.1.1 Noun Phrases in Domain Language

A noun phrase is defined as “a phrase which has a noun (or indefinite pronoun)

as its head word, or which performs the same grammatical function as such a phrase”

(Crystal, 1997). The syntactic structure of a noun phrase consists of dependents which

could include determiners, adjectives, prepositional phrases, or even a clause. For example,

the noun phrase “a 2 dimensional array” occurs within the Java programming corpus. Its

head is “array” and its dependents are “a” as the determiner and “2 dimensional” as an

adjective phrase. In this simple case the syntactic boundaries also indicate semantic

segments, as these dependents indicate one or more attributes of the head. If this

relationship were always true, the semantic structure understanding task would be a

labeling task that only requires assigning a semantic tag to each syntactic segment of the

noun phrase. But this is not always true, in part because a syntactic parser trained on

an open-domain corpus will not necessarily perform well on domain language (McClosky

46

NP

NP PP

DT JJ IN NP

NN the outer for

loop

Figure 5-1. A parse of the outer for loop from Stanford Parser.

et al., 2010). For example, in the noun phrase “the outer for loop,” which also occurs

in the Java programming corpus, the head of the noun phrase is “for loop,” but the

syntactic parse (generated by the Stanford parser) of this noun phrase understandably

(but incorrectly) identifies this head as part of a prepositional phrase (Figure 5-1).

To address this challenge, this chapter describes a joint segmentation and semantic

labeling approach that does not require accurate syntactic parsing within noun phrases.

In this approach the head and dependents of each noun phrase are each referred to as a

segment, with exactly one segment per dependent, and one or more words per segment.

Identifying these segments correctly is essential to correct assignment of semantic tags.

Pipeline methods for semantic segmentation rely on stable performance of an open

domain parser, but as described above, this assumption is not desirable for grounding

some domain language. We therefore utilize joint segmentation and labeling, and apply

a Conditional Random Field approach (Lafferty et al., 2001), a natural choice for the

sequential data segmentation and labeling problem.

47

5.1.2 Description Vector

The goal is to ground each noun phrase to an entity within the problem-solving

artifact, which constitutes the “world” in this domain. To do this, we will link each

semantic segment in a noun phrase to an attribute of an entity in the world. Because the

world can contain any of an infinite set of user-created entities, representation cannot rely

upon exhaustively enumerating the entities. To represent an entity in the domain, we

define a description vector V which defines the attribute types for entities in the domain.

Then, an entity O in the domain is represented uniquely by an instance of V . The values

of each Vi indicate the value of the attribute of O, as illustrated in Table 3-1. This

definition of the description vector relies upon the structure of the domain by factorizing

the attributes of entities. With this representation, interpreting a noun phrase involves

linking each segment of the noun phrase to a cell in the description vector. Formally, we

represent a noun phrase as a series of segments:

NP =< s1, s2, ..., sk >

where si is the ith segment in this noun phrase. A noun phrase is also a sequence of

words:

NP =< w1, w2, ..., wn >

where each wj is the jth word in the noun phrase. Therefore each segment is a series of

words:

si =< wj, wj+1, ..., wj+l1 >

where l is the length of semantic segment i. Given a noun phrase, the segmentation

problem is thus maximizing the following conditional probability:

p(< s1, s2, ..., sk > | < w1, w2, ..., wn >)

Complementary to the segmentation problem is the semantic linking problem, which is to

link si to an attribute ai, which is the label of the ith attribute in the entity description

48

“a 2 dimensional array"

w1 w2 w3 w4

NUM ARR_DIM ARR_DIM CATEG.

a1 a2 a2 a4

s1 s2 s3

a1 a2 a3

NUM ARR_DIM CATEG.

Figure 5-2. Segmentation and semantic linking of NP “a 2 dimensional array”.

vector. That is, we wish to maximize the probability of the attribute label sequence a

given the segments of the noun phrase:

p(< a1, a2, ..., ak > | < s1, s2, ..., sk >)

Taking consecutive words with the same attribute label as the same semantic segment, the

noun phrase segmentation and semantic linking problem is then:

argmaxa

{p(< a1, a2, ..., an > | < w1, w2, ..., wn >)}

In the tag sequence < a1, a2, , an >, if ai and ai+1 are the same, then wi and wi+1 are

assigned to the same semantic segment with tag ai. The process of segmentation and

semantic linking is illustrated in Figure 5.1.2.

5.1.3 Joint Segmentation and Labeling

In order to perform this joint segmentation and labeling, we utilize a Conditional

Random Field (CRF), which is a classic approach for sequence segmentation and labeling

(Lafferty et al., 2001). Given the linear nature of our data, we employ a linear chain CRF.

Specifically, given a sequence of words w, the probability of a label sequence a is defined

49

as

p(a|w) = 1

Z(w)exp(

n!

i=1

m!

j=1

λjfj(i, w, ai, ai1))

where fj(i, w, ai, ai−1) is a feature function. The weights j of this feature function are

learned within the training process. The normalization function Z(w) is the sum over the

weighted feature function for all possible label sequences:

Z(w) =!

a

exp(n!

i=1

m!

j=1

λjfj(i, w, ai, ai1))

The optimal labeling a is the one that maximizes the likelihood of the training set,

where K is the number of noun phrases in the corpus.

a = argmax(K!

i=1

logP (a(i)|w(i)))

5.1.4 Features

Next, we introduce the features used to train the CRF. The feature function

fj(i, w, ai, ai−1) was defined as a binary function, in which w is a feature value. We use

both lexical and syntactic features. In a trained CRF model, the value of fi(i, w, ai, ai−1)

is known given a combination of parameters (i, w, ai, ai−1). The features used in the

CRF model include words themselves, word lemmas, parts of speech, and dependency

relationships from the syntactic parse. The word itself, lemmatized words, and parts-of-speech

have all been shown useful within segmentation and labeling tasks, so they are made

available here (Xue and Palmer, 2004). Each of these features is represented as categorical

data. For example, a word is represented as its index in a list of all of the words that

appeared in the corpus.

The dependency structure of natural language has also been shown to be important in

semantic interpretation (Poon and Domingos, 2009). This chapter employs a dependency

feature vector extracted from dependency parses. The head word of each noun phrase is

the root of the dependency tree. Each dependent is a sub-tree directly under the head.

50

head

det

array

a 2 dimensional

amod

dependent 2 dependent 1

Figure 5-3. Dependency structure of “a 2 dimensional array”.

We design the dependency feature as a sequence of dependency labels as follows. Given a

dependency tree, words in each semantic segment of the noun phrase are assigned a tag

according to the relationship between them and the head. The relationship between each

segment and head is defined by the dependency type in the dependency tree. For example,

the dependency tree of “a 2 dimensional array” is shown in Figure 5-3. The dependency

features are < det, amod, amod, root >. In this way, the dependency information from an

open-domain parser is encoded as a feature to the semantic labeling model.

5.2 Experiments and Results

The goal of the experiments is to determine how well the trained CRF can segment

noun phrases and link these segments to the correct attribute of entities in the world. This

section presents the experiments using CRFs trained and tested on the Java programming

tutorial dialogue corpus. As described below, the results were evaluated by comparing

with manually labeled data. Noun phrases from the tutorial dialogues were first manually

extracted and annotated as to their slots in the description vector described in Section

5.1.2. There were 364 grounded noun phrases extracted manually from the six tutorial

dialogue sessions used in the current work. Each of these noun phrases extracted has one

or multiple corresponding entities in the programming artifact. Since each word in a noun

51

phrase is linked to an element in the description vector, the indices in this vector were

used as the label for each word. Annotation of all 346 noun phrases was performed by

one annotator, and 20% of the noun phrases (70 noun phrases) were doubly annotated

by an independent second annotator. The percent agreement was 85.3% and the Kappa

was 0.765. To extract features, the lemmatization and syntactic parsing were performed

with the Stanford CoreNLP toolkit (Manning et al., 2014). Then, a CRF was trained to

predict the label for each word in a new noun phrase. The training was performed with

the crfChain toolbox (Schmidt and Swersky, 2008).

We use ten-fold cross-validation to evaluate the performance of the CRF in this

problem. Results with different feature combinations are shown in Table 5-1. Manually

labeled data were taken as ground truth for computing accuracy, which is defined as the

percentage of segments correctly labeled. Recall that consecutive words with the same

label in a noun phrase are treated as a segment. Therefore, if a segment sCRF identified

by the CRF has the same boundary and the same label as a segment sHuman in the

noun phrase containing sCRF , this segment sCRF will be counted as a correct segment.

Otherwise, sCRF will be counted as incorrect. The accuracy is then calculated as the

number of correct segments identified by the CRF divided by the number of segments

annotated manually. As can be seen in Table 5-1, all of the models perform substantially

better than a minimal chance baseline of 43%, which would result from taking each

word as a segment and assigning it with the most frequent attribute label. The results

demonstrate important characteristics of the segmentation and labeling model. First,

unlike most previous semantic interpretation work, our semantic interpretation of noun

phrases does not rely on accurate syntactic parse within noun phrases. Rather, we use

a dependency parse from an open-domain parser as only one of several types of features

provided to the model. These dependency features improved the model in most feature

combinations (Table 5-1). The feature combination of words, lemmas, and dependency

parses achieved the best accuracy, which is 4.8% higher than the model that only used

52

word features. This difference is statistically significant (Wilcoxon rank-sum test; n=10;

p=0.02).

Table 5-1. Semantic labeling accuracy.

features accuracyword 84.5%word + lemma 85.5%Word + Dep 87.2%lemma + Dep 89.1%word + lemma + Dep 89.3%word + lemma + POS 86.9%word + lemma + POS + Dep 88.7%POS + Dep 71.3%

Notably, the combination of part-of-speech features and dependency parse features

still performed at 71.3% accuracy, indicating that to some extent, the method may be

tolerant to unseen words.

53

CHAPTER 6REFERENCE RESOLUTION FOR SITUATED DIALOGUE SYSTEM

Reference resolution in situated dialogues in a complex environment are often fraught

with high ambiguity. In Chapter 5, we presented our approach to extract referring

expressions from user utterances in real time. Given the extracted referring expressions,

we need to identify their referents in the situated environment, which is the problem of

reference resolution. In this chapter, I report a novel approach that I developed to address

these challenges by combining the learned semantic structure of referring expressions

with dialogue history into a ranking-based model. I evaluated the new technique on a

corpus of human-human tutorial dialogues for computer programming in this chapter.

The experimental results show a substantial performance improvement over two recent

state-of-the-art approaches. This reported approach makes a stride toward automated

dialogue in complex problem-solving environments, and will be used in the tutorial

dialogue system described in Chapter 7.

6.1 Reference Resolution in a Situated Environment

This section describes a new approach to reference resolution in situated dialogue. It

links each referring expression from the dialogue to its most likely referent object in the

environment. Our approach involves three main steps.

First, referring expressions from the situated dialogue are segmented and labeled

according to their semantic structure. Using a semantic segmentation and labeling

approach I have previously developed (Li and Boyer, 2015), a conditional random field

(CRF) is used for this joint segmentation and labeling task, and the values of the labeled

attributes are then extracted (Section 6.2). The result of this step is learned semantics,

which are attributes of objects expressed within each referring expression. Then, these

learned semantics are utilized within the novel approach reported in this chapter. As

Section 6.3 describes, dialogue and task history are used to filter the objects in the

54

environment to build a candidate list of referents, and then as Section 6.4 describes, a

ranking-based classification approach is used to select the best matching referent.

For situated dialogue we define Et as the state of the environment at time t. Et

consists of all objects present in the environment. Importantly, the objects in the

environment vary along with the dialogue: at each moment, new objects could be created

(|Et| > |Et−1|), and existing objects could be removed (|Et| < |Et−1|) as the user performs

task actions.

Et = {oi|oi is an object in the environment at time t}

We assume that all of the objects oi are observable in the environment. For example,

in situated dialogues about programming, we can find all of the objects and extract their

attributes using a source code parser. Then, reference resolution is defined as finding a

best-matching oi in Et for referring expression RE.

6.2 Referring Expression Semantic Interpretation

In situated dialogues, a referring expression may contain rich semantic information

about the referent, especially when the context of the situated dialogue is complex.

Approaches such as domain-specific lexicons are limited in their ability to address this

complexity, so we utilize a linear-chain CRF to parse the semantic structure of the

referring expression as presented in Section 5. This more automated approach can also

potentially avoid the manual labor required in creating and maintaining a lexicon.

In this approach, every object within the environment must be represented according

to its attributes. We treat the set of all possible attributes of objects as a vector, and

for each object oi in the environment, we instantiate and populate an attribute vector

Att V eci. For example, the attribute vector for a two-dimensional array in a computer

program could be [CATEGORY = ‘array, DIMENSION = ‘2, LINE = ‘30, NAME =

‘table, ...]. We ultimately represent Et = {oi} as the set of all attribute vectors Att V eci,

and for a referring expression we aim to identify Att V ecj, the actual referent.

55

Since a referring expression describes its referents either implicitly or explicitly, the

attributes expressed in it should match the attributes of its referent. We segment referring

expressions and label the semantics of each segment using the CRF and the result is a

set of segments, each of which represents some attribute of its referent. This process is

illustrated in (Figure 6-1 (a)). After segmenting and labeling attributes in the referring

expressions, the attribute “values” are extracted from each semantic segment using regular

expressions (Figure 6-1 (b)), e.g., value “2” is extracted from “2 dimensional” to fill in

the “ARRAY DIM” element in an empty Att V ec. The result is an attribute vector that

represents the referring expression.

Figure 6-1. Semantic interpretation of referring expressions.

6.3 Generating a List of Candidate Referents

Once the referring expression is represented as an object attribute vector as described

above, we wish to link that vector to the closest-matching object in the environment.

Each object is represented by its own attribute vector, and there may be a large number

of objects in Et. Given a referring expression Rk, we would like to trim the list to keep

only those objects that are likely to be referent for Rk.

56

There are two desired criteria for generating the list of candidate referents. First, the

actual referent must be in the candidate list. At the same time, the candidate list should

be as short as possible. We can pare down the set of all objects in Et by considering focus

of attention in dialogue. Early approaches performed reference resolution by estimating

each dialogue participant’s focus of attention (Lappin and Leass, 1994; Grosz et al.,

1995). According to Ariel’s accessibility theory (Ariel, 1988), people tend to use more

precise descriptions such as proper names in referring expressions for referents in long

term memory, and use less precise descriptions such as pronouns for referents in short

term memory. In a precise description, there is more semantic information, while in a

more vague description like a pronoun, there is less semantic information. Thus, these two

sources of information, semantics and focus of attention, work together in identifying a

referent.

Our approach employs this idea in the process of candidate referent selection by

tracking the focus of attention of the dialogue participants from the beginning of the

dialogue through dialogue history and task history, as has been done in prior work we

use for comparison within our experiments (Iida et al., 2010). We also use the learned

semantics of the referring expression (represented as the referring expression’s attribute

vector) as filtering conditions to select candidates.

The candidate generation process consists of three steps.

1. Candidate generation from dialogue history DH.

DH =< Od, Td >

Here, Od =< o1d, o2d, ..., o

md > is a sequence of objects that were mentioned since

the beginning of the dialogue. Td =< t1d, t2d, ..., t

md > is a sequence of timestamps

when corresponding objects were mentioned. All of the objects in Et that were evermentioned in the dialogue history, {oi|oi ∈ DH & oi ∈ Et}, will also be added intothe candidate list.

2. Candidate generation from task history TH. Similarly, TH =< Ob, Tb >, which is allof the objects in Et that were ever manipulated by the user, will be added into thecandidate list.

57

Table 6-1. Algorithm to select candidates using learned semantics

Given a referring expression Rk,whose attribute vector Att V eck hasbeen extracted.for each element atti of Att V eckif atti is not null

for each o in Et

if atti == o.attiadd o into candidate list

Ck

3. Candidate generation using learned semantics, which are the referent’s attributes.Given a set of attributes extracted from a referring expression, all objects in Et withone of the same attribute values will be added into the candidate list. The attributesare considered separately to avoid the case in which a single incorrectly extractedattribute could rule out the correct referent. Table 6-1 shows the algorithm used inthis step.

6.4 Ranking-based Classification

With the list of candidate referents in hand, we employ a ranking-based classification

model to identify the most likely referent. Ranking-based models have been shown to

perform well for reference resolution problems in prior work (Denis and Baldridge,

2008; Iida et al., 2010). For a given referring expression Rk and its candidate referent

list Ck = {o1, o2, ..., oNk}, in which each oi is an object identified as a candidate

referent, we compute the probability of each candidate oi being the true referent of

Rk, p(Rk, oi) = f(Rk, oi), where f is the classification function. (Note that our approach is

classifier-agnostic. As we describe in Section 6.5.3, we experimented with several different

models.) Then, the candidates are ranked by p(Rk, oi), and the object with the highest

probability is taken as the referent of Rk.

6.5 Experiments and Result

To evaluate the new approach, we performed a set of experiments that compare our

approach with two state-of-the-art approaches. We use the corpus described in Section 3.

58

6.5.1 Semantic Parsing

The referring expressions were extracted from the tutorial dialogues and their

semantic segments and labels were manually annotated. A linear-chain CRF was trained

on that data and used to perform referring expression segmentation and labeling (Li and

Boyer, 2015). The current work reports the first use of that learned semantics approach

for reference resolution.

Next, we proceeded to extract the attribute values, a step that our previous work

did not address. For the example shown in Figure 6-1 (b), from the learned semantic

structure, we may know that “2 dimensional” refers to the dimension of the array, the

attribute “ARRAY DIM”. (In the current domain there are 14 attributes that comprise

the generic attribute vector V , such as ARRAY DIM, NUM, and CATEGORY.) To

actually extract the attribute values, we use regular expressions that capture our three

types of attribute values: categorical, numeric, and strings. For example, the value type

of “CATEGORY” is categorical, like “method” or “variable”. Its values are taken from a

closed set. “NAME” has values that are strings. “LINE NUMBER”’s value is numeric.

For categorical attributes, we add the categorical attribute values into the semantic tag

set of the CRF used for segmentation. In this way, the attribute values of categorical

attributes will be generated by the CRF. For attributes with text string values, we take

the whole surface string of the semantic segment as its attribute value. The accuracy of

the entire semantic parsing pipeline is 93.2% using 10-fold cross-validation. The accuracy

is defined as the percentage of manually labeled attribute values that were successfully

extracted from referring expressions.

6.5.2 Candidate Referent Generation

We applied the approach described in Section 6.3 on each session to generate a list of

candidate referents for each referring expression. In a program, there could be more than

one appearance of the same object. We take all of the appearances of the same object to

be the same, since they all refer to the same artifact in the program. The average number

59

of generated candidates for each referring expression was 44.8. The percentage of referring

expressions whose actual referents were in the generated candidate list, or “hit rate” is

90.5%, based on manual tagging. This performance indicates that the candidate referent

list generation performs well.

A referring expression could be a pronoun, such as “it” or “that”, which does not

contain attribute information. In previous reference resolution research, it was shown

that training separate models for different kinds of referring expressions could improve

performance (Denis and Baldridge, 2008). We follow this idea and split the dataset

into two groups: referring expressions containing attributes, REFATT , (270 referring

expressions), and referring expressions that do not contain attributes, REFNON (76

referring expressions).

The candidate generation approach performed better for the referring expressions

without attributes (hit rate 94.7%), compared to referring expressions with attributes (hit

rate 89.3%). Since the candidate list for referring expressions without attributes relies

solely on dialogue and task history, 94.7% of those referents had been mentioned in the

dialogue or manipulated by the user previously. For referring expressions with attribute

information, the generation of the candidate list also used learned semantic information.

Only 70.0% of those referents had been mentioned in the dialogue or manipulated by the

user before.

6.5.3 Identifying Most Likely Referent

We applied the approach described in Section 6.4 to perform reference resolution on

the corpus of tutorial dialogue. The data from the six manually labeled Java tutoring

sessions were split into a training set and a test set. We used leave-one-dialogue-out cross

validation (which leads to six folds) for the reference resolution experiments. In each

fold, annotated referring expressions from one of the tutoring sessions were taken as the

test set, and data from the other five sessions were the training set. We tested logistic

regression, decision tree, naive Bayes, and neural networks as classifiers to compute the

60

p(Rk, oi) for each (referring expression, candidate) pair for the ranking-based model. The

features provided to each classifier are shown in Table 6-2.

Table 6-2. Features used for segmentation and labeling.

Learned Semantic Features (SF)SF1: whether RE has CATEGORY attributeSF2: whether RE.CATEGORY == o.CATEGORYSF3: whether RE has RE.NAMESF4: whether RE.NAME == o.NAMESF5: RE.NAME ≈ o.NAMESF6: RE.VAR TYPE existSF7: RE.VAR TYPE == o.VAR TYPESF8: RE.LINE NUMBER existSF9: RE.LINE NUMBER == o.LINE NUMBERSF10: RE.ARRAY DIMENSION existSF11: RE.ARRAY DIMENSION ==o.ARRAY DIMENSIONSF12: CATEGORY of o

Dialogue History (DH) FeaturesDH1: whether o is the latest mentioned objectDH2: whether o was mentioned in the last 30 secondsDH3: whether o was mentioned in the last [30, 60] secondsDH4: whether o was mentioned in the last [60, 180] secondsDH5: whether o was mentioned in the last [180, 300] secondsDH6: whether o was mentioned in the last [300, 600] secondsDH7: whether o was mentioned in the last [600, infinite] secondsDH8: whether o was never mentioned from the beginningDH9: String matching between o and RE

Task History (TH) FeaturesTH1: whether o is the most recent object manipulatedTH2: whether o was manipulated in the last 30 secondsTH3: whether o was manipulated in the last [30, 60] secondsTH4: whether o was manipulated in the last [60, 180] secondsTH5: whether o was manipulated in the last [180, 300] secondsTH6: whether o was manipulated in the last [300, 600] secondsTH7: whether o was manipulated in the last [600, infinite] secondsTH8: whether o was never manipulated from the beginningTH9: whether o is in the current working window

To evaluate the performance of the new approach, we compare against two other

recent approaches. First, we compare against a ranking-based model that uses dialogue

history and task history features (Iida et al., 2010). This model uses semantics from

61

a domain-specific lexicon instead of a semantic parser. (Iida et al’s work was extended

by Funakoshi et al. (Funakoshi et al., 2012), but that work relies upon a handcrafted

probability distribution of referents to concepts, which is not feasible in our domain

since it has no fixed set of possible referents.) Therefore, we compare against their 2010

approach, implementing it in a way that creates the strongest possible baseline: we built

a lexicon directly from our manually labeled semantic segments. First, we split all of

the semantic segments into groups by their tags. Then, for each group of segments, any

token that appeared twice or more was added into the lexicon. Although the necessary

data to do this would not be available in a real application of the technique, it ensures

that the lexicon for the baseline condition has good coverage and creates a high baseline

for our new approach to compare against. Additionally, for fairness of comparison, for

each semantic feature used in our model, we extracted the same feature using the lexicon.

There were three kinds of attribute values in the domain: categorical, string, and numeric

(as described in Section 6.5.1). We extracted categorical attribute values using the

appearance of tokens in the lexicon. We used regular expressions to determine whether

a referring expression contains the name of a candidate referent. We also used regular

expressions to extract attribute values from referring expressions, such as line number. We

also provided the Iida baseline model (Iida et al., 2010) with a feature to indicate string

matching between referring expressions and candidate referents, since this feature was

captured in our model as an attribute.

We also compared our approach (we call it Li approach here) against a very recent

technique that leveraged a word-as-classifier approach to learn semantic compatibility

between referring expressions and candidate referents (Kennington and Schlangen, 2015).

To create this comparison model, we used a word-as-classifier to learn the semantics

of referring expressions instead of CRF. This weakly supervised approach relies on

co-appearance between words and object’s attributes. We then used the resulting semantic

compatibility in a ranking-based model to select the most likely referent.

62

The three conditions for our experiment are as follows.

• Iida Baseline Condition: Features including dialogue history, task history, andsemantics from a handcrafted lexicon (Iida et al., 2010).

• Kennington Baseline Condition: Features including dialogue history, task history,and learned semantics from a word-as-classifier model (Kennington and Schlangen,2015).

• Li approach: Features including dialogue history, task history, and learned semanticsfrom CRF.

Within each of these experimental conditions, we varied the classifier used to compute

p(Rk, oi), testing four classifiers: logistic regression (LR), decision tree (DT), naive

Bayes (NB), and neural network (NN). The neural network has one hidden layer and the

best-performing number of perceptrons was 100 (we experimented with between 50 and

120).

To measure the performance of the reference resolution approaches, we analyzed

accuracy, defined as the percent of referring expressions that were successfully linked to

their referents. We chose accuracy for our metric following standard practice (Iida et al.,

2010; Kennington and Schlangen, 2015) because it provides an overall measure of the

number of (Rk, oi) pairs that were correctly identified. For the rare cases in which one

referring expression referred to multiple referents, the output referent of the algorithm was

taken as correct if it selected any of the multiple referents.

The results are shown in Table 6-3. We focus on comparing the results on referring

expressions that contain attribute information, shown in the table as REFATT . REFATT

accounts for 78% of all of the cases (270 out of 346). Among the three approaches, our

approach (Li approach) outperformed both prior approaches. Compared to the Iida

2010 approach which achieved a maximum of 55.2% accuracy, our approach achieved

68.5% accuracy using a neural net classifier, and this difference is statistically significant

based on the results of a Wilcoxon signed-rank test (n = 6; p = 0.046). Our approach

outperformed the Kennington 2015 approach even more substantially, as its best

63

performance was 46.3% accuracy (p = 0.028). Intuitively, the better performance of

our model compared to the Iida approach is due to its ability to more accurately model

referring expressions’ semantics. Compared to a lexicon, semantic labeling finds optimal

segmentation for a referring expression, while a lexicon approach extracts different

attribute information from referring expressions separately. Note that our approach

and the Iida 2010 approach achieved the same performance on REFNON referring

expressions. Since these referring expressions do not contain attribute information,

these two approaches used the same set of features.

Interestingly, the model using a word-as-classifier approach to learn the semantic

compatibility between referring expressions and referent’s attributes performs the worst.

We believe that the reason for this poor performance is mainly from the way it performs

semantic compositions. It cannot learn structures in referring expressions, such as that

“2 dimensional” is a segment, “dimensional” represents the type of the attribute, and “2”

is the value of the attribute. The word-as-classifier model cannot deal with this complex

semantic composition.

The combined accuracy for REFATT and REFNON were also calculated using a

neural networks model. The proposed approaches had an accuracy of 61.6%, and the

baseline approach using lexicon had an accuracy of 51.3%.

The results reported above relied on learned semantics. We also performed experiments

using manually labeled, gold-standard semantics of referring expressions. The result in

Table 6-4 shows that ranking-based models have the potential to achieve a considerably

better result, 73.6%, with more accurate semantic information. Given the 85.3%

agreement between two human annotators, the model performs very well, since the

semantics of whole utterances in situated dialogue also play a very important role in

identifying a given referring expression’s referent.

64

Table 6-3. Reference resolution results.

experimental condition f(Rk, oi)classifier

accuracy

REFATT REFNON

LR 0.500 0.440Iida DT 0.537 0.4532010 NB 0.466 0.413

NN 0.552 0.373LR 0.4627 0.3867

Kennington DT 0.3769 0.33332015 NB 0.3209 0.4000

NN 0.4216 0.4000LR 0.631 0.440

Li DT 0.631 0.453approach NB 0.493 0.413

NN 0.685 0.373

Table 6-4. Reference resolution results with gold semantic labels.

models accuracyREFATT REFNON

LR + SEM gold 0.684 0.429DT + SEM gold 0.643 0.429NB + SEM gold 0.511 0.377NN + SEM gold 0.736 0.325

65

CHAPTER 7TUTORIAL DIALOGUE SYSTEM FOR JAVA PROGRAMMING WITH SUPERVISED

REFERENCE RESOLUTION

This chapter presents an end-to-end tutorial dialogue system for Java programming

which implements real-time reference resolution. As discussed in the literature review

in Chapter 2, most existing task-oriented dialogue systems are designed to interact with

users in highly constrained domains (Wen et al., 2016; Strik et al., 1997). These systems

either do not need reference resolution functionality due to the simplicity of the domain

(Wen et al., 2016), or perform reference resolution using very simple approaches, such as

keyword matching and a domain-specific lexicon (Vanlehn et al., 2002). Different from the

constrained domains previous dialogue systems operates on, this dissertation focuses on

the domain of Java programming tutoring. In such a domain, tutorial dialogues frequently

mention objects in the Java program in question. The dialogues within this domain are

characterized by situated features that pertain to the programming task. A portion of

user utterances refer to general Java knowledge. In these cases, semantic interpretation

of a user’s request can be accomplished by mapping to a domain-specific ontology (e.g.,

(Dzikovska et al., 2007)). In contrast, many utterances refer to concrete entities within the

dynamically changing, user-created programing artifact. Identifying these entities correctly

is crucial for understanding a user’s utterance in the specific programming context, and

then generating specific tutorial dialogue moves.

This chapter presents a natural language tutorial dialogue system for Java programming

that implements real-time reference resolution for natural language understanding. This

dialogue system tracks user intention and the world state to provide a task-related context

for user utterance understanding and system dialogue act generation. Here, user intention

means the current subproblem that the user is focusing on, such as “creating a integer

array to store 5 digits of a zip code”. World state means the completed steps toward the

solution of a programming problem. The whole tutorial dialogue system software includes

three parts, a user interface module (UI), a database module, and an agent module. The

66

architecture of the whole system is illustrated in Figure 7-1. The UI module is an Eclipse

plugin, which provides an integrated development environment for Java programming.

The database module logs the data generated when a user interacts with the tutorial

dialogue system. The agent module implements all of machine learning functionalities

of the dialogue system. The UI module and the agent module are implemented in a

client-server architecture. They communicate using socket packages. This architecture

enables us to implement the UI and the agent using different programming languages

which best serve the requirements of the two modules respectively. The UI module

captures user utterances as well as user’s programming actions, and sends them to the

agent module. The agent module processes these user inputs and generates proper system

utterances accordingly. All of the generated data in this process are logged into the

database.

UserInterface(Client) Agent(Server)

useru6eranceuserac7on

Systemu6erance

Database

NLU

ReferenceResolu/on

TopicClassifier

DAClassifier

DM

UserInten/onRecognizer

WorldStateTracker

KnowledgeBase

NLG

useruEerance

systemuEeranceLogon/offPane

DialoguePane

Figure 7-1. Architecture of the tutorial dialogue system.

To evaluate how different reference resolution approaches impact the performance

of the dialogue system, I implemented two different reference resolution approaches. One

of the reference resolution modules used learned semantics from a CRF-based approach,

which is my novel reference resolution approach as described in Chapter 6. The other

67

reference resolution module is used for comparison and uses a recent state-of-the-art

approach that relies upon a manually created domain-specific lexicon. Both of these

approaches use contextual information, including user behavior history and dialogue

history for reference resolution. Recall that “user behavior history” in this tutorial

dialogue system means the editing actions conducted by the user, and “dialogue history”

means the objects that were mentioned previously in the tutorial dialogue. In this way,

we can access the impact of an improved reference resolution approach within a real-time

dialogue system by comparing the system’s performance with the two different reference

resolution models.

Section 7.1 describes the functionalities and implementation of the user interface

module. Section 7.2 defines the boundaries of the dialogue system’s capabilities, i.e. what

functionalities this system is able to perform. Section 7.3 introduces the architecture

of the dialogue system. Section 7.4 describes the approaches used to implement user

utterance understanding in this system. Section 7.5 describes the implementation of the

dialogue manager module. Section 7.6 presents the encoded domain knowledge in this

dialogue system. Section 7.7 describes the utterance generation implementation.

7.1 User Interface

The user interface is illustrated in Figure 7-2. This user interface is embedded in

Eclipse, a widely used integrated development environment (IDE) for Java programming.

The user interface has two panes, a log on/off pane and a dialogue pane. The log on/off

pane displays user’s log on/off status. Users log into the dialogue system using their

Google accounts. This user information is used to distinguish different tutorial sessions.

The dialogue pane displays the tutorial dialogue between a user and the dialogue system.

When a user logs into the dialogue system in the log on/off pane, the system greets

the user and starts a tutoring session for Java programming. The user can talk to the

dialogue system in the dialogue pane using textual messages. In addition, the UI module

implements a set of listeners in Eclipse to capture user’s programming actions, including

68

source code editing, source code selecting, file opening, file closing, and file creating. All

of the user utterances and programming actions are sent to the agent module as inputs to

the tutorial dialogue system. These data are also logged into a local database for further

analysis.

Figure 7-2. User interface of the dialogue system.

7.2 System Functionalities

Today’s state-of-art task-oriented dialogue systems are still far from engaging in

natural language dialogue with a human user as a human speaker could do. The limitation

of these systems lies with their ability to handle a conversation on various topics and

granularities. Thus, task-oriented dialogue systems usually operate only in a specific

domain, such as an employee information query in a company (Corbin et al., 2015) or

restaurant information requests (Wen et al., 2016).

Before building a task-oriented dialogue system, we need to clearly define the

functionality boundaries of the system. We need to define the topics on which the system

69

will be able to hold a reasonable conversation with the user, and how the system should

handle out-of-topic user utterances. In this way, we can provide users with a reasonable

expectation on the system functionalities.

My system is able to reasonably hold a conversation with a human user and help the

user to complete a Java programming problem. I categorize the functionalities into several

types. These key functionalities include the following items:

• Properly start and end a conversation with a human userTo conduct a conversation with the user, the dialogue system greets the user to drawthe user’s attention and get ready to start a conversation. When the session is over,the system closes the conversation.

• Understand and properly respond to a user utterance about program progressThe knowledge base of this dialogue system includes knowledge about theprogramming problem. The programming problem is modeled as a tree structure,as shown in Figure 7.6. To complete a task, the user needs to complete a set ofsubtasks that are the children of the current task in the tree structure. In this way,when the user is confused about the current task, the system helps the user to breakit down into smaller subtasks that are easier to work with.

• Understand and properly respond to a user utterance about basic Java conceptsThe system understands user utterances about basic Java knowledge in theprogramming context, such as how to create an array, and provide a properresponse.

• Detect user’s out-of-topic utterances and provide responseDuring an interaction, a user’s utterance could be off topic. However, the system isonly designed to hold natural language conversation on a specific Java programmingproblem. The system attempts to detect such user utterances, and provides aresponse with the goal of focusing on the programming problem.

• Monitor the programming actions of the user and generate proper system utterancesThis dialogue system is mixed initiative, which means that both the user and thedialogue system could start a conversation. The system is designed to detect themoments that users may need hints from the system.

7.3 Architecture of the Dialogue Agent

Following a typical dialogue system architecture, the dialogue system (the agent

module) has four main modules, as shown in Figure 7-3. The natural language understanding

(NLU) module performs reference resolution, topic classification, and dialogue act

70

UserU&erance

“Is my for loop correct?”

NLU

UserCodeEdi/ngEvent UserInten+onIden+fier

CREATE_for_loop

{ event_type: TYPING, added_text: “++” affected_line: for(int i=0;i<=5;i) Line_number: 80 … }

DECLARE_zipDigitsCREATE_zipcode_strDECLARE_digit_charCREATE_for_loopASSIGN_digit_charCONVERT_char_to_intASSIGN_zipDigits

WorldStateTracker

DialoguePolicy DM

NLG

NounPhraseChunking

ReferringExpressionExtrac+on

Seman+cInterpreta+onofReferringExpressions

Iden+fyingReferents

‘my for loop’ ‘my for loop’ NAME = ‘for’, CATEGORY = FOR_STATEMENT ...

for (int i=0; i<=5; i++)

NaturalLanguageGenera/on

'You will specify the end condition of the for loop, which tells the loop to stop. In this case, you can set the index i to <= 4. '

DialogueActClassifica+on

EVALUATION_QUESTION

TopicClassifica+on

AM_I_RIGHT

1

Figure 7-3. Architecture of the dialogue system.

classification for a user utterance. The input to the NLU model are user utterances.

The NLU module identifies any referents in the input user utterance. It also identifies

the user utterance’s dialogue act and topic. The output of the NLU module includes the

entities that the user mentioned in the current user utterance, the dialogue act, and the

formal semantic representation of the input user utterance.

The dialogue manager (DM) module tracks user intention and the task progress of the

task-oriented dialogue. In this tutorial dialogue system, user intention means the current

programming subtask that the user is focusing on. The input of this module includes the

output from the NLU module, as well as user actions, such as program editing actions.

The DM module outputs a system dialogue act for the current user utterance. It also

updates the user intention and world state. The world state tracker maintains the progress

of the programming task. The dialogue policy model takes the reference resolution results,

the dialogue act and topic of a user utterance and the current state of the Java program

as input, and outputs a system utterance.

7.4 Natural Language Understanding Module

The natural language understanding module contains three submodules: a reference

resolution module, a semantic parser and a dialogue act classifier, as shown in Figure

71

7-3. The inputs to the NLU module are textual user utterances, the current progress

of the programming task, and the current user intention. The output of the NLU

module includes user’s referents in the current utterance, dialogue act of the current user

utterance, and the semantics of user utterance. This section describes the implementation

of the submodules of the natural language understanding module.

7.4.1 Reference Resolution

As discussed in Chapter 2, perceived affordances—based on the user’s perceived

objects—in the situated environment suggest likely user actions. For example, a key

suggests the action of “opening a door”. In a Java programming problem, the user’s

perceived referent could also suggest possible actions. For example, when a user mentions

a two-dimensional array, the most likely action associated with it may be “ask how to

create a two dimensional array”. I use a data-driven approach to discover the relationships

between the mentioned objects and the suggested actions. Reference resolution is also

essential to understanding a user utterance within a context. For example, when the

Java programming problem asks the user to create an integer array called “zipCode”,

the user could say “I don’t know how to create zipCode.” We need to find the referent

of “zipCode” in the Java code, and infer that the user is asking about “how to create an

integer array”. Then we can form a query to the knowledge base to request an answer.

Two different reference resolution approaches were implemented in this dialogue

system for the purpose of comparison. Version 1 implements the reference resolution using

our approach, the learned semantics as described in Section 6. For comparison, I created

a baseline reference resolution module using the same approach as version 1 except that

it uses a manually defined lexicon to represent referring expressions’ semantics instead of

learned semantics using a CRF-based approach.

In Chapter 4, we presented the approach for referring expression extraction, which

extracts all noun phrases in a user utterance. Not all of these noun phrases refer to

objects in the parallel Java program, so we identify referring expressions from these

72

noun phrases. In this tutorial dialogue system, we first apply a set of rules to filter the

extracted noun phrases. Recall that our reference resolution approaches calculate a

compatibility probability for each referring expression and candidate pair.

compatibility probability = f(referring expression, candidatei)

candidatei is the ith candidate referent in the candidate list for the referring

expression in question. The candidate with the highest compatibility probability is picked

as the referent. We use the generated compatibility probability by the reference resolution

module as a measure to decide if a noun phrase refers to an object in the Java program.

Any noun phrase that has a 0.90 or higher compatibility (f(noun phrase, candidatei) >

0.90) with any of its candidates was taken as a referring expression.

7.4.2 Dialogue Act Classification

Dialogue acts are specialized speech acts that model the illocutionary force of

utterances (Austin, 1962). An illocutionary act indicates the speaker’s intention instead of

the user utterance’s surface meaning. For example, when a customer in a restaurant asks

a waiter: “Do you have salt?” The surface meaning of the utterance is a question which

asks whether the waiter has salt. The illocutionary act of this utterance is conveying the

customers request that she wants some salt.

For dialogue act classification, I use a maximum entropy model. The maximum

entropy model uses three types of features: word unigrams, bigrams, and trigrams from

each user utterance. I use the annotation schema proposed by Can (Can, 2016). The tag

set is shown in Table 7-1. The model is trained using 4857 utterances which are labeled

with dialogue act tags from the Ripple corpus Boyer et al. (2010). The classification

accuracy of the trained model was 73.6%.

7.4.3 Topic Classification

Dialogue acts represent the category of utterance-level intentions, which is categorical

and abstract. In some cases, knowing the user utterance’s dialogue act will be enough

for the system to generate a reasonable response, such as a greeting dialogue act from

73

Table 7-1. Dialogue act set.

Dialogue Act Tag Explanation Sample UtteranceQuestion(Q) A general question about

the taskwhat would be the best way to dothat?

EvaluationQuestion(EQ)

A evaluation questionabout the task

isn’t that also declared in the sameplace ?

Statement(S) A statement of a fact I was trying to figure out the best wayto do that

Grounding(G) Acknowledgement about aprevious utterance

fair enough

Extra-Domain(EX) Any utterance that is notrelated to the task

I’m not very good at Java yet

PositiveFeedback(PF)

Positive assessment ofknowledge or task

yea it’s a string

NegativeFeedback(NF)

Negative assessment ofknowledge or task

i really don’t see the point much ofthis loop really

LukewarmFeedback(LF)

Assessment having bothpositive and negativeaspects

kind of

Greeting(GRE) Greetings hello

a user. However, in some cases, such as to respond to a user’s question, the dialogue

system needs to query the knowledge base. To recognize the topics of user utterances in

the dialogue system, a topic classifier is trained. The topics this classifier recognizes are

listed in Table 7-2. The topic classifier was also implemented using a maximum entropy

model. It takes word unigrams to trigrams of user utterances as features. We manually

selected 492 utterances from the Ripple corpus and tagged them with topic labels. These

492 utterances were used as a training set to train the topic classifier. The accuracy of the

classifier was 63.7%.

7.5 Dialogue Manager

The dialogue manager takes user dialogue acts, user actions, and recognizes topics

of user utterances as inputs. It selects a system response according to these inputs and

recognizes user intention. For example, when a user says “Hi”, the dialogue act classifier

predicts this utterance as a GRE, a greeting dialogue act. Then the dialogue manager

generates a system dialogue act GRE, which will be passed to the NLU module to

74

Table 7-2. Topics recognized by the topic classifier.

Topics Explanation Sample UtteranceGET SUBSTRING the way to get a substring

from a stringokay so should it bezipString.substring(i,i+1)?

GET ZIP DIGITS the way to extract a singledigit from a zip code

how do I extract individual digits

CONVERT ZIPCODETO STR

convert the variablezipCode to string

Can’t manually turn an integer into astring?

CREATE FOR LOOP the way to create a forstatement

what are the three things we need fora loop?

USE A LOOP necessity of using a forstatement

would a for loop be best?

STORE ZIP DIGITS the way to store digits with an array?PROGRESS about the progress how do i start the extractDigits

method?STRING 2 INT convert a string to integer Integer.parseString()?CREATE DIGITSARRAY

the syntax to create anarray for the digits

which is 5, correct? or does itdepend?

DECLARE ARRAY the syntax to declare anarray

How do we declare an array?

INPUT ZIPCODE get the input zipcode so i need something telling it to getzipcode?

CHAR 2 INT convert character to integer can I parse a character to an intHOW TO RUN the way to run the

programhow to run it?

AM I RIGHT request to check user’scode

does that make sense?

OOD out of domain topics Meh, this [ key is stuck.

instantiate it as a system utterance “hi” or “hello”, etc. In another example, a user may

say, “How do I create an array to hold the zip code digits?”. A set of rules were authored

for the dialogue manager to generate a system response.

User intention indicates the subtask that the user is working on. User intention

gives the system essential contextual information about the dialogue. As discussed in

the related work section, it could dramatically constrain the possible explanation of user

utterances. In this dialogue system, the current user intention is used to divide the Java

programming problem into sub-domains. The whole Java programming problem is a

domain for the tutorial dialogue system. Each subtask of the programming problem forms

75

TheProgrammingTaskpublic class PostalFrame extends JFrame implements ActionListener {

…/** the numerial representation of the zip code */

private int zipCode;…/**

* Extract the individual digits stored in the ZIP code * and return them. */ private int[] extractDigits() { //You must complete this method!! String zipcode = zipCode + ""; int [] digits = new int[5];

for (int i=0; i<5; i++) { digits[i] = zipcode.charAt(i) -'0';}

barCode.clearCode(); return digits; }

...}

DECLARE_zipDigitsCREATE_zipcode_strDECLARE_digit_charCREATE_for_loopASSIGN_digit_charCONVERT_char_to_intASSIGN_zipDigitsRETURN_zipDigits

1

Figure 7-4. User intention identification example.

a sub-domain for the dialogue system. In this way, the dialogue system could be seen as a

combination of a set of smaller dialogue systems. For each sub-task, we focus on a much

smaller sub-domain, compared with the domain for the whole programming problem.

To identify user intention in the domain of Java programming, we need to understand

the user’s Java source code. Given a programming task, there could be multiple ways to

solve it, i.e. there are multiple paths to follow if we imagine each step in the solution is a

node on a graph.

The first step to understand the user’s program to is perform a syntax parsing so that

we know which type of variables were declared, which variables were assigned, etc. This

information helps us to identify which step the user is working on. For example, creating

an integer array at the beginning of the “extractDigits()” method indicates that the user

is creating an array to hold the 5 digits of a zip code.

We created a rule-based algorithm to interpret the user’s Java program by mapping

each line of Java program onto a step in the solution. There were 96 rules defined for the

intention identifier. An example of user intention identification is shown in Figure 7-4.

76

PostalFrame

extractDigits() calcAndDrawCDigits() drawZipCode()

createAndInitStringZip …... …... …...

Figure 7-5. Structure of the programming task.

7.6 Knowledge Base

To support a reasonable dialogue, the knowledge base contains three types of

knowledge: subtask structure of the programming problem, knowledge about Java

language features, and knowledge needed to solve the programming problem.

The solution to the programming problem is defined as a tree structure, as shown in

Figure 7.6. The root of the tree is the whole programming task. Each node in the tree is

a subtask. In this tutorial dialogue system, the whole programming task is to complete

a method in a Java class called PostalFrame, which translates a five-digit zip code into

a set of bar code. For the subtask of completing the method “extractDigits()”, there are

some smaller subtasks that need to be completed. With this tree structure, we could

understand the user’s progress and provide hints when the user has questions about the

current subtask. As described in Section 7.5, we use a rule-based algorithm to map each

line of a user’s program to a node in this tree structure. This gives us very important

contextual information for the dialogue.

7.7 System Utterance Generation

A set of 99 system utterances was authored to be selected by the dialogue manager.

Table 7-3 shows sample system responses to user questions on different topics. For each

topic, we create multiple system responses with different level of detail. When the system

detects that a user asks a similar question, the system gives a new response with a more

detailed explanation.

77

Table 7-3. Sample system response utterances.

Topics of question System ResponseGET SUBSTRING ’To get a substring of a string variable, the syntax is

stringVariable.substring(start,end+1)’GET ZIP DIGITS ’There are several ways to break an int apart into its

individual digits ...CONVERT ZIPCODETO STR

’You can use the syntax intVariable + ”” to convert aninteger to a String variable.’

CREATE FOR LOOP ’A for loop takes the form: for(start condition; finishcondition; increment statement)’

USE A LOOP ’We can start with a for loop. It should loop throughthe zip code to get out each individual digit.’

STORE ZIP DIGITS ’You need an int array to hold the 5 digits of azipcode.’

STRING 2 INT ’To cast a string of a number into an integer, you canuse the parseInt method: ...

CREATE DIGITSARRAY

’The syntax to create an array is type[] arrayName.’

DECLARE ARRAY ’For example, you can do int[] digits = ...’INPUT ZIPCODE ’When the program is run, the user types in a zipCode

...’CHAR 2 INT ’To convert a char of a digit into integer, you can do

char digit - 0.’HOW TO RUN ’To run a program in Eclipse, you can right click ...OOD ’I can help you with many aspects of this project, but

I might not ...

78

CHAPTER 8EVALUATION OF THE DIALOGUE SYSTEM

This chapter describes a human user study to evaluate the novel reference resolution

approach in the implemented tutorial dialogue system, and compare it with a comparison

approach. The tutorial dialogue system with the reference resolution approach based on

learned semantics is denoted as System Li. This is the treatment condition. The baseline

tutorial dialogue system with reference resolution approach using a manually created

lexicon is denoted as System Comparison. As mentioned in Chapter 6, the baseline model

was adopted from Iida et al. (2010). This is the comparison condition.

The goals of the user study are twofold. First, we evaluate the two dialogue systems’

user satisfaction and user engagement by analyzing study participants’ post-survey data.

In addition, we would like to investigate the performance of the two reference

resolution approaches in System Li and System Comparison in terms of accuracy. To do

this, we manually examined the natural language input users provided, and rated whether

the system properly identified the referent(s) within the user input.

In this chapter, the first section introduces the user study procedure and briefly

describes the collected data. In the second section, a hypothesis test is conducted to

compare user satisfaction and user engagement of the two dialogue systems. Finally, the

third section compares the reference resolution performance of the two dialogue systems.

8.1 Proposed Hypotheses

This dissertation focuses on three hypotheses.

• Hypothesis 1. System Li will outperform System Comparison on accuracy ofreference resolution.In Chapter 6, we compared two reference resolution approaches with offlineevaluation. We manually tagged the referring expressions. In an online dialoguesystem, the system will automatically extract referring expressions while conversingwith a human user. The accuracy of referring expression extraction also play a keyrole in the reference resolution pipeline. We would like to examine if the referenceresolution approach with learned semantics still has a higher accuracy in such anonline dialogue system, given noisy referring expressions.

79

• Hypothesis 2. System Li will offer a higher user satisfaction than System Comparison.The goal of this tutorial dialogue system is to tutor college students on Javaprogramming. I would like to know how satisfied the human subjects are whileusing the proposed dialogue system. I want to examine the difference of the dialoguesystem in two proposed conditions in terms of user satisfaction. I expect the Liapproach has a higher reference resolution accuracy than the comparison approachin a real-time dialogue system. This will probably improve the user utteranceunderstanding functionality of the treatment condition, which will make the systemgenerate more reasonable responses. So, my hypothesis is that the treatmentcondition has a higher user satisfaction than the comparison condition. I willmeasure students’ satisfaction using their self-reported satisfaction in the post-surveyresults.

• Hypothesis 3. System Li provides a higher user engagement than System Comparison.User engagement is another widely used metric to evaluate a dialogue system(Sidner et al., 2005). It measures how frequently the human user talks to thedialogue system. We would like to examine if users engage more with System Lithan System Comparison. As I discussed in Hypothesis 2, the treatment conditionwill probably generate more reasonable system response, which will likely increaseuser engagement.

8.2 User Study

This section introduces the procedure of the user study and the collected data.

8.2.1 Participants

Student participants were recruited from a undergraduate introductory Java

programming class COP 3502 ”Programming Fundamentals I” at the University of

Florida in the 2018 Spring semester. Students voluntarily participated in this study

and were compensated with small amount of credits for the class. This study had 43

participants in total, two of whom participated in a pilot study. During the pilot study,

we talked to the participants for feedback to improve the system for the following study

sessions. Data were collected for all of the 43 sessions, but only the data from the 41

sessions’ were analyzed to address the research questions, due to potential influence from

the communication with the participants in the pilot study.

8.2.2 Java Programming Task for the Study

The study adopted a Java programming task that was previously used in another

research study for dialogue act modeling in task-oriented tutorial dialogue (Boyer, 2010).

80

The programming task was designed for undergraduate students in an introductory Java

programming course. The programming task examined the use of “for” statement, array

and String concepts. We provided a partially implemented Java program which took a

5-digit zip code as input and converted it into a postal bar code. When a user ran this

program, it opened a graphical user interface (GUI) to prompt for a input zip code.

When input was entered, the GUI converted it into a bar code and displayed it. The

program separated the five digits in the input integer and converted each single digit into

a bar code. The only missing method in the provided program was the “extractDigits()”

method, which took a integer zip code as input and returned an integer array. This

integer array contained the five separated digits. A task description was provided to each

participant at the beginning of each study session. The task description can be found in

Figure 8-1 and 8-2.

8.2.3 Procedure

For recruitment, we presented a recruitment speech in the COP 3502 class to briefly

introduce the research study, and collected student volunteers’ contact information

through a Google form. The volunteers’ availability was then collected using a Doodle

poll. Student participants were assigned to different study sessions according to their

availability.

The two dialogue systems were installed on 12 LearnDialogue group-owned laptops.

The laptops were numbered. System Li was installed on odd number laptops, and

System comparison was installed on even number laptops. In each study session,

we prepared a similar number (the number of participants were odd in some study

sessions) of laptops from each group to ensure we had similar number of System Li and

System Comparison used in the study.

On arrival, students were seated randomly. They were given consent forms and short

instructions to the study. The instructions included a introduction to the goal of the

study and the task description that was mentioned in the last subsection. We used the

81

The goal of this study is to evaluate an intelligent tutoring system (ITS). This ITS provides conversational assistance to students during Java programming. It is important to note that this is an experimental system, which is one of the few research projects that attempt to implement a dialogue system for a complex domain like Java programming. The system may fail to answer some of your questions. It is important to keep in mind that the goal of this study is to evaluate the dialogue system.

This system is designed to assist you through Java programming problems. You can ask questions about the programming problem, such as: “Is it correct?”, “What should I do next?”, “Where should I start from?”. You can also ask questions about “Is my for loop correct?”, “How to declare an array?”

While you are interacting with the conversational agent, you will be working on the following problem.

Postal Bar Codes

The Problem: For faster sorting of letters, the United States Postal Service encourages companies that send large volumes of mail to use a bar code denoting the ZIP code. Using the skeleton GUI program provided for you, you will complete this lab with code to actually generate the bar code for a given zip code. More About Bar Codes: In postal bar codes, there is a full-height frame bar on each end (and these are drawn automatically by the program provided for you; you don't have to write code to draw these). Each of the five encoded digits is represented by five bars. The five encoded digits are followed by a correction digit.

Figure 8-1. A short instruction with the task description.

written instructions to maintain consistency among all of the different study sessions.

After reading the consent form and the instructions, participants were asked to complete a

pre-survey about their attitudes toward programming.

82

The Correction Digit The correction digit is computed as follows: Add up all digits, and choose the correct digit to make the sum a multiple of 10.For example, the ZIP code 95014 has sum of digits 19, so the correction digit is 1 to make the sum equal to 20. What’s Already Written? You can see what parts of this program are already written by running the file Main. java. When you do, you should see output like the image below, with a blank zip code slot. You can enter a zip code, and you should see that no bar code is generated (except the first and last full bars which are required for all bar codes). What’s Your Task? Your job is to extract this five-digit zip code from user’s. The PostalFrame class is the one which handles this task. The only method which you must complete is: extractDigits() For extractDigits(), you will need to create a variable in the method which stores the zip code as separate digits. Some Helpful Information - If you can’t remember how to do something with the software, please refer to the

reference sheet on your desk.

Figure 8-2. A short instruction with the task description.

After the pre-survey, the participants had 40 minutes to work on the programming

task with the assistance of the tutorial dialogue systems they were assigned to.

When participants finished the programming task or 40 minutes had passed, they

were given a post-survey to evaluate the system usability and user engagement.

8.2.4 Data Collection

During the user study, we collected users’ pre-survey and post-survey results. The

pre-survey focused on students’ attitude toward programming, including whether they

viewed programming as an important skill, as well as self-reported programming skill

83

evaluation. The survey can be found in Appendix A. The post-survey included two parts.

The first part was a widely used system usability scale (SUS) survey Bangor et al. (2008),

which accesses the usability of a system. The SUS survey contained 10 questions which

reflected users’ evaluation of a system’s usability. The score of each completed SUS survey

was 0-100. A higher score indicated better usability. Bangor et. al. calculated the mean

SUS score of nearly 3500 surveys in their past 273 studies, which suggested a system

with SUS score above 70 had a better-than-average usability Bangor et al. (2009). The

second part was a survey to measure the user engagement scale (UES), which contained 30

questions. A complete post-survey can be found in Appendix B.

Besides survey data, I also collected the textual dialogues between participants and

the dialogue systems. An excerpt dialogue is shown in Table 8-1. In the 41 sessions,

there were 2641 utterances collected, 1292 of which were user utterances. There were 64.4

utterances on average in each session, and 31.5 of them were user utterances. In these 41

sessions, the maximum number of utterances was 154, and the minimum number was 17.

The maximum number of user utterances was 76, and the minimum was 8.

Ultimately, 22 student participants used System Li, and 19 participants used

System Comparison.

Users’ programming actions were logged into a local database. These actions include

login, logout, typing, and selecting. These actions were saved in a database as a Json

string. The format of an user action is illustrated in Table 8-2. For example, for a

typing action, the time stamp of the action, the added text, and the line that the action

happened were logged into the database for further analysis.

The reference resolution actions happened in the dialogue system were also saved into

the database. Each record contained a referring expression, its semantic segments, the

candidate list and the compatibility probability of each candidate.

84

Table 8-1. An excerpt dialogue between a user and the Virtual TA.

Virtual TA Hi, I’m your virtual TA. I will be helping you on this Java project.User does extratDigits method input anything?Virtual TA This Java program converts a zip code for US mail into a visual barcode. When

you run the program, it asks the user to enter a zip code. The program will thenconvert it into a barcode and display it. But it doesn’t work yet! To make it work,your task is to complete the extractDigits method. It needs to convert an integerzip code into an array of individual digits, and return that array.Hint: Making theint into a String first is a convenient way to get its individual digits.

Virtual TA The extractDigits method is located in the PostalFrame class, which should havebeen opened for you. You are welcome to view the other code, but you only need towork on the extractDigits method.

User what do I need to do in the extractDigits method?Virtual TA You will extract the five digits of a zipcode. Now we need to get each of the 5 digits

of that int. Using a string and then getting the individual characters from it is agood way to do that.

User where is teh zip codeVirtual TA When the program is run, the user types in a zipCode. The code to get the zipCode

from the user is already complete. The zip code the user entered is stored as aprivate variable, int zipCode on line 34 in PostalFrame.

User how to parse a string into an arrayVirtual TA You can use the syntax intVariable + ”” to convert an integer to a String variable.

. . .

Table 8-2. An example user action saved in the database.

type: TYPINGtimestamp: 1523261514283addedText: intlineNum: 79offset: 1918currentCode: ”import javax.swing.*;\nimport . . . ”referents: []isRRResult: Falseprob: 0.0affectedLine: ”\tint [] digits = int”

In total, there were 18775 records collected throughout the 41 study sessions. The

number of reference resolution events were 1486. An example reference resolution event is

shown in Table 8-3.

As shown in Table 8-3, “noun phrase” is the referring expression; “candidates” field

lists all of the generated candidates from the parallel Java program; “probs” field lists

85

Table 8-3. An example reference resolution event saved in the database.

{noun phrase: charat methodcandidates: [{u’category’: u’METHOD’, u’line number’: 81, u’name’: u’charAT’, . . . }

{u’category’: u’METHOD’, u’line number’: 40, u’name’: u’PostalFrame’,. . . }{u’category’: u’METHOD’, u’line number’: 41, u’name’: u’setSize’, . . . }. . . ]

probs: [0.9741117181861638,0.00036208246341969553,0.00036208246341969553, . . . ]

referent: {u’category’: u’METHOD’, u’line number’: 81, u’name’:u’charAT’, . . . }

prob: 0.974111718186isRRResult: truetimestamp: 1523546180624

}

the compatibility probability between the referring expression and all of the candidates;

“referent” is the system-selected referent; “prob” is the compatibility probability between

the referring expression and the selected referent.

8.3 System Usability Evaluation

To evaluate the usability of the implemented dialogue systems with two different

reference resolution approaches, student participants of the research study were asked

to complete a post-survey which contained an instrument widely used to assess system

usability Bangor et al. (2008). The two groups’ user response can be found in Table B-1

and B-2 in Appendix B. The two systems had a very close mean system usability scale

(SUS) score. The average SUS score of 22 student participants who used System Li was

66.67. The average SUS score of 19 participants who used System Comparison was 68.77.

To interpret SUS scores, Bangor et. al. argued that a system with a SUS score

over 70 was acceptable, as shown in Figure 8-3. According to his argument, the systems

implemented in this project is marginal in the acceptability range, but very close to

acceptable.

86

Figure 8-3. System usability score interpretation.

A further performed t-test showed no significant difference (p-value=0.361) on the

SUS scores for the two groups.

8.4 User Engagement Evaluation

Next we examined our hypothesis about the two systems’ user engagement. Besides

SUS scale, the post-survey also measured user engagement using the User Engagement

Scale (UES) instrument Brien et al. (2018). This instrument included 30 questions.

Participants who used System Li had an average UES score of 11.80, and students who

used System Comparison had an average UES score of 12.27. A complete table of user

response from the two groups can be found in Table B-1 and B-2 in Appendix B. A t-test

showed no significant difference between two groups on UES scores (p-value=0.236).

The number of user utterances also reflected users’ engagement with the dialogue

system. System Li had 30.8 user utterance per session, System Comparison had 32.4 on

average. There was not a significant difference between them (p-value=0.382).

8.5 Online Reference Resolution Evaluation in Tutorial Dialogue Systems

In Chapter 6, we compared two reference resolution approaches with offline

evaluation. We manually tagged the referring expressions and their referents in the

parallel Java source code. In an online dialogue system, the system automatically

extracted referring expressions, generated candidates and extracted features while having

a conversation with a human user. Without human intervention, errors in one step could

87

propagate to later steps in the reference resolution pipline. We would like to examine if

the reference resolution approach with learned semantics still had a higher accuracy in

such an online dialogue system.

In the proposal, we hypothesized that System Li would have a higher reference

resolution performance. To evaluate these two systems’ reference resolution accuracy, we

analyzed the logged reference resolution actions performed by the dialogue systems.

As mentioned in Section 8.2.4, all of the reference resolution events performed

by the dialogue systems were logged in a local database. Each reference resolution

event contained several fields, referring expression, candidate list, the compatibility for

each candidate and the selected referent from the candidate list. An excerpt reference

resolution event is illustrated in Table 8-3.

The reference resolution events were manually evaluated to calculate accuracies for

the two systems. As discussed earlier, the system selected referring expressions from

noun phrases in a user utterance. The process of reference resolution is illustrated in

Figure 8-4. For a user utterance, the dialogue system first found all the noun phrases

in the utterance. It then filtered all of the extracted noun phrases using a set of rules.

Noun phrases that could never be a referring expression for objects in the Java code, such

as “you” and “me”, were filtered out. Then, the system attempted to find a “referent”

in the parallel Java source for the remaining noun phrases as if they were all referring

expressions. Finally, the system used the compatibility probability (as shown as f in the

figure) between the remaining noun phrases and their “referents” to decide which noun

phrases were real referring expressions. The threshold for the compatibility probability

was set to 90% empirically.

The researcher went through all of the reference resolution events which identified

a referent with 90% or higher compatibility probability. There were 417 such reference

resolution events, which was 28.1% of all the logged reference resolution events. System Li

had 320 reference resolution events in this class, and System Comparison had 97.

88

Seman&cInterpreta&onofReferringExpressions

Iden&fyingReferents

UserU&erance“I think I should start from the actionPerformed method by creating an array.”

71: Public void actionPerformed(){...}ReferentReferring Expression: “the actionPerformed method”

Prob = 0.97 Referent: { NAME = ‘actionPerformed’, CATEGORY = METHOD ... }

NounPhraseChunking

“I”, “I”, “the actionPerformed method”, “an array”

Rule-basedNounPhrasefiltering

“I”, “I”, “the actionPerformed method”, “an array”

NounPhrase(s)

FilteredNounPhrase(s)

f>0.9?

Yes

No

Referring Expression: “an array” Prob = 0.58 Referent: { NAME = ‘table’, CATEGORY = ARRAY ... }

NotReferringExpression

Figure 8-4. Reference resolution process in the dialogue system.

For each reference resolution event within this class, the identified referent was

manually examined within the involved programming context to determine if the result

was correct.

System Li’s reference resolution accuracy for this set of referring expressions was

21.6%, and System Comparison’s was 19.6%.

Both of the two systems had a much lower reference resolution accuracy on the

selected referring expressions, compared with their offline versions, which was 61.6% and

51.2%. The logged reference resolution results were closely examined to find the reasons

which may shed some light on building online reference resolution approaches in the

future. To have a more accurate understanding of the reference resolution performance

of the dialogue system, I collected all of the reference resolution events, regardless of

the compatibility probability. There were 1486 reference resolution events logged in

89

the 41 study sessions. Because of the way the system selected referring expressions,

most of these logged reference resolution events were performed by the system on noun

phrases to identify referring expressions. I manually tagged 169 referring expression for

System Li’s data, and 158 referring expressions for System Comparison’s data. I also

manually identified their referents in the Java code. The result showed System Li had a

63.3% reference resolution accuracy on these 169 manually tagged referring expressions.

System Comparison had an accuracy of 44.9% on the 158 referring expressions. These

accuracies matched the performance of the two reference resolution approaches in their

offline setting.

It appears that the main reason for the poor performance of the online reference

resolution approaches was the inaccurate referring expression extraction. While extracting

referring expressions from all of the recognized noun phrases in a user utterance, we

combined a rule-based approach and the classification result from the classifier we used to

calculate compatibility probabilities between referring expressions and their candidates.

The intuition of using this classifier was that when a noun phrase is compatible with

an entity in the Java code, then it is likely to be a referring expression. However, this

combined approach did not work as expected in practice. This directly affects the

reference resolution accuracy if we cannot accurately identify referring expressions in

user utterances.

To further illustrate the reasons of the poor referring expression identification

functionality, we provide two false examples referring expression identification in Table 8-4

and Table 8-5. In Table 8-4, noun phrase “a string” was not a referring expression, since it

is not specifically referring to anything in the Java code. However, the user just created a

string in the Java code, called “scode”. The noun phrase “a string” contained an attribute

“VAR TYPE” “string” in it. Recall the reference resolution approach takes the semantic

features, dialogue history features and the behavior history features as inputs. Since the

“scode” was just created, the behavior history features suggested that “scode” had a high

90

probability being the referent. Also “scode” was a string variable, thus the “scode” had

a high compatibility probability 0.939 with the noun phrase “a string”. This caused a

false positive instance. Similarly, in another false negative example shown in Table 8-5,

noun phrase “the for loop” was a referring expression, and it referred to a for statement

in user’s Java program. The reference resolution was correctly performed, but since the

for statement was not recently operated or mentioned, the dialogue history and behavior

history suggested a low compatibility probability 0.791, and it is lower than the threshold.

From the negative examples, we found that it is insufficient to only use the

compatibility in identifying referring expressions. The lexical features of referring

expressions and their enclosing utterances also play a key role in referring expression

identification. These features should be considered while building a referring expression

identification classifier.

Table 8-4. A false positive example of referring expression identification.

Utterance “what to do next if I have a string of the zipcode”Noun Phrase “a string”Referent “{category: VARIABLE, line number: 76, name:

scode, . . . }”Probability 0.939

Table 8-5. A false negative example of referring expression identification.

Utterance “Is the for loop correct?”Noun Phrase “the for loop”Referent “{category: STATEMENT FOR, line number: 78,

name: for, . . . }”Probability 0.791

91

CHAPTER 9DISCUSSION

This chapter discusses some of our observations in building the dialogue systems and

conducting the user study.

9.1 Null Results

The previous chapter described the research study to evaluate the two implemented

dialogue systems. We did not find significant results for the hypothesis on user satisfaction

and user engagement. One of the reasons could be the low accuracy of the online

reference resolution approach, which was caused by the referring expression identification

functionality.

Another reason could lie at the difference between human-computer dialogues and

human-human dialogues. We compared the human-human dialogues in the Ripple corpus

and the human-computer dialogues collected in this project by manually annotating the

number of utterances and number of referring events in each session. As shown in Table

9-1, the average number of utterances in one session of the Ripple corpus was 130.2,

which was much lower, 64.4, in the human-computer dialogues we collected. In Ripple

corpus, each session lasted about 50-55 minutes, and in the study conducted in this

project, each session lasted about 40 minutes. There is a huge difference between these

two kinds of dialogues in terms of utterance frequencies. In addition, the human-human

dialogues had 0.44 referring events per utterance on average, and human-computer

dialogues only had 0.12. These numbers suggested a different communication pattern in

human-human dialogues and human-computer dialogues. This difference may suggest that

reference resolution plays a different role in a human-computer dialogue comparing with

human-human dialogues. Further research is needed to explain this phenomenon.

Also, as argued at the beginning of this dissertation, reference resolution plays a key

role in natural language dialogue understanding. However, in a natural language dialogue

system for a complex domain like Java programming, there are many other modules that

92

influence the performance of the dialogue system, such as dialogue act tagger, utterance

topic classifier and user intention recognizer. Reference resolution takes effect together

with these modules as an integrated system. The improvement of a single module may not

necessarily increase the performance of the whole system.

Table 9-1. A comparison between human-computer dialogues and human-humandialogues.

Average #Utt #RefExp / #UttHuman-computer 64.4 0.12Human-human 130.2 0.44

9.2 Data-driven Approach in Building Dialogue Systems

The dialogue systems implemented in this project used data-driven approaches for

most of the essential functionalities, such as dialogue act classification, utterance topic

classification, POS tagging, noun phrase chunking, and reference resolution. Some of these

models are less closely related to the domain of the dialogue system. For example, we can

train a noun phrase chunking model for the dialogue system using training data from the

Wall Street Journal corpus, since the grammar of the English language used in a tutorial

dialogue for Java programming is very similar to that in the news feed. However, some

of the models are more domain-specific, which means these models need be trained using

domain-specific data.

Due to the availability of the Ripple corpus that was described in Chapter 3, we can

use its human-human dialogues as training data to build dialogue act classification and

topic classification models for the dialogue systems in this project. The dialogue systems

in this project support a programming task that is almost the same as that in the Ripple

corpus. So, we can take advantage of this similarity. We looked into the Ripple corpus to

discover the most frequently mentioned topics by the students and the tutors while they

are approaching the programming task, and built topic classifiers for these topics to help

the system better understand user utterances. However, this data-driven approach suffers

from data sparsity problems. For example, one of the important steps in the programming

93

task is converting a character digit into an integer. When students extract a character

digit from a zip code, they need to convert the character digit in to an integer and add

the integer into an array. However, we only found 8 utterances in the Ripple corpus that

are related to converting a character to an integer in Java. It is very hard for the topic

classification model to learn an accurate classifier for this specific topic given such a small

set of training utterances. In addition, there are also some topics that are unique to our

dialogue systems, which we cannot find training data from the Ripple corpus. The COP

3502 class at the University of Florida uses an integrated development environment called

IntelliJ 1 to teach Java programming, while the user interface of our dialogue systems is

based on Eclipse. So, students may ask the dialogue system how to run their program. In

this project, we manually created training utterances for these topics to alleviate the data

sparsity problem, but cannot totally eliminate it.

9.3 Understanding Users’ Java Program - A Challenge in Building DialogueSystems For Java Programming

One of the challenges to build a tutorial dialogue system for Java programming

lies in understanding user’s Java program. Before answering a user’s question that is

related to her Java program, the dialogue system needs to understand the context of

the question. The user’s current program is arguably the most important contextual

information in this case. However, automatic interpretation of a user’s Java program is

a very challenging task. There are two levels of interpretation of a user’s Java program,

syntactical interpretation and semantic interpretation. The dialogue system’s ability to

interpret the user’s Java program directly limits the system’s ability to respond to the

user’s questions regarding her program. We discuss this limitation later in more detail in

this section.

1 https://www.jetbrains.com/idea/

94

The goal of syntactic interpretation is to understand if a user’s Java program is

syntactically correct. In addition, it identifies items such as variable declarations, variable

assignments and method calls. Correctly identifying these operations in a user’s Java

source code is essential to interpret a user’s Java program semantically, i.e. understanding

which step toward the solution the user is working on. For example, when the user

declares an integer array at the beginning of the “extractDigits()” method, the user’s

intention is probably creating an array to hold the 5 integers of the input zip code.

We implemented Java code syntactic parsing using an abstract syntax tree (AST)

parser. When the program is syntactically correct, the parser can generate a parse for

user’s Java source code without problems. However, it is more likely than not that the

program is syntactically incorrect when the user needs help from the dialogue system.

For example, the student may ask a question before finishing typing a line of source

code. In this case, the AST parser fails to parse the line of Java source code with syntax

errors (such as an incomplete line of Java code). To address this problem, we created a

rule-based parser to interpret user’s Java program. This rule-based parser contains a set

of patterns. We match user’s program with these patterns to identify the status of user’s

progress toward the solution. However, the number of conditions that this rule-based

parser can identify is the number of conditions that the dialogue system can respond to

regarding user’s program. If the source code parser cannot “perceive” a problem in user’s

program, the dialogue system cannot reasonably comment on it. This “granularity” of the

system’s perception directly determines the “granularity” of the dialogues that the system

could conduct. Mulkar-Mehta et. al. argued that “granularity” of a natural language

discourse is “the level of detail of description of an event or object” Mulkar-mehta et al.

(2011).

95

CHAPTER 10CONCLUSION

This dissertation has reported on the development of a tutorial dialogue system using

an innovative reference resolution approach that I developed Li and Boyer (2016). In

Chapter 6, we empirically evaluated this reference resolution approach with an existing

human-human dialogue corpus for Java programming. I then implemented a tutorial

dialogue system for Java programming and deployed the reference resolution approach to

evaluate it in real time with human subjects.

10.1 Hypothesis Revisited

This dissertation focuses on evaluating my novel reference resolution approach in

a tutorial dialogue system. We are interested in how well my approach performs in a

real-time dialogue system compared to a comparison condition, and its impact on user

satisfaction and user engagement when interacting with the system. To serve this goal, we

implemented two tutorial dialogue system with different reference resolution approaches,

System Li and System Comparison. We had three hypotheses:

Hypothesis I: More accurate offline reference resolution approach is also more

accurate in a real-time dialogue system.

Hypothesis II: More accurate reference resolution leads to higher user satisfaction in a

tutorial dialogue system.

Hypothesis III: More accurate reference resolution leads to higher user engagement in

a tutorial dialogue system.

The first hypothesis was confirmed, but we did not find evidence for the second

and the third hypotheses. The performance of a dialogue system is determined by the

performance of multiple different modules. Improving reference resolution accuracy in the

implemented tutorial dialogue system may not directly increase the system performance.

Identifying the “bottleneck” module of the tutorial dialogue system will be a interesting

research question.

96

Summary.This document has presented our work on automatic referring expression

extraction, semantic labeling of referring expressions, and a reference resolution approach

combining learned semantics and contextual features of the dialogue. The presented

reference resolution approach was evaluated using an existing human-human tutorial

dialogue for Java programming. Then, I presented the implementation of a tutorial

dialogue system for Java programming. I first defined the functionalities the system

requires and then described its architecture and the implementation of its module. To

evaluate the impact of our novel reference resolution approach within the implemented

tutorial dialogue system, I implemented two different versions of reference resolution

approaches and conducted a user study with 41 undergraduate student participants. We

did not find a significant difference on user satisfaction (p=0.361) or user engagement

(p=0.236) between the two systems with different reference resolution approaches.

Contributions. This project makes two main contributions to the natural language

dialogue system research community. First, the implemented tutorial dialogue system

in this project is one of the first to support a complex domain like Java programming,

in which the entities and environment dynamically change because of the user’s actions.

Second, this work is the first to investigate real-time reference resolution approaches

in such a complex situated dialogue system. We examine both the performance of the

reference resolution module, and the impact of different reference resolution approaches on

the performance of the dialogue system.

To push dialogue systems toward assisting people in more and more complex tasks,

we need to address some challenging problems including reference resolution. This

dissertation investigates this challenge within a task-oriented dialogue system in a complex

domain. This work is a step toward practical dialogue systems that support users in more

complex domains.

10.2 Limitations

This research project has several limitations.

97

Firstly, the scale of the user study is limited, which may be one of the reasons that

we had a null result for the hypotheses on dialogue system performance. Given limited

time, we recruited 43 undergraduate students from the COP 3502 class at the University

of Florida. More data could lead to more confident results.

Secondly, according to the participants’ feedback, the system performance is limited

by its ability to accurately understand users’ utterances. The participants sometimes

need to rephrase their questions multiple times before the system could understand them.

More training data can help the system to train a more accurate topic classifier, which can

result in a better natural language understanding result.

Thirdly, the system’s performance was also limited by the Java program parser. With

a more accurate Java source parser, the system could identify more fine grained errors in

users’ program, and further give more accurate feedback.

10.3 Future Work

This dissertation research investigated the performance of reference resolution

approaches in a real-time tutorial dialogue system for Java programming. Both of the

evaluated approaches are still far from perfectly identifying user’s referents. According to

the result analysis, more accurate referring expressions identification approach is required

to have a better reference resolution performance. Another promising research direction is

to investigate more features from the dialogue and the situated environment to inform the

reference resolution module. For example, the verbs in the same utterance could be a good

feature. In addition, there are also coreference relationships in situated dialogue, and it

will be interesting to consider reference resolution and coreference resolution at the same

time. The tutorial dialogue system can be viewed as a start point for a series of better

performed dialogue systems, which could be developed by refining some of the modules in

the existing system. The tutorial dialogue system could benefit from the introductory Java

class’s instructors’ inputs. They should have a better understanding of the system users’

Java knowledge, which could help the system to better adapt toward users’ needs. Also,

98

the data-driven system’s performance was limited by the lack of training data. It will be a

interesting research question to have the system learn from the interaction with users.

99

APPENDIX APRE-SURVEY

Name

UFID

Please indicate how much you agree or disagree with the following statements.


Stronglydisagree Disagree Neutral Agree

StronglyAgree

Generally, I have feltsecure about attemptingcomputer programmingproblems.I am sure I could doadvanced work incomputer science.I am sure that I canlearn programming.I think I could handlemore difficultprogramming problems.I can get good grades incomputer science.I have a lot of self‐confidence when itcomes to programming.


StronglyAgree

I'll need programmingfor my future work.I study programming

Figure A-1. Pre-survey.

100


I study programmingbecause I know howuseful it is.Knowing programmingwill help me earn aliving.Computer science is aworthwhile andnecessary subject.I'll need a firm masteryof programming for myfuture work.I will use programmingin many waysthroughout my life.


StronglyAgree

I like writing computerprograms.Programming isenjoyable andstimulating to me.When a programmingproblem arises that Ican't immediately solve,I stick with it until I havethe solution.Once I start trying towork on a program, Ifind it hard to stop.When a question is leftunanswered incomputer science class,I continue to think aboutit afterward.I am challenged byprogramming problemsI can't understandimmediately.


101

Powered by Qualtrics

→

©University of FloridaGainesville, FL 32611Terms of Use


ID Q1-Q6 Q7-Q12 Q13-Q182 4 2 4 4 4 3 5 4 5 4 5 5 4 4 4 4 3 33 3 3 4 4 3 2 2 3 3 3 1 3 2 2 4 2 2 44 1 1 3 1 3 2 2 4 3 4 3 3 3 3 2 2 2 45 5 4 5 4 5 5 4 4 4 4 4 3 5 5 5 4 4 46 3 3 4 3 3 3 4 4 4 4 4 3 4 4 4 4 3 47 4 4 5 4 4 5 5 5 5 5 5 5 5 5 4 5 4 48 2 2 4 3 2 2 5 5 5 1 5 5 4 4 3 4 3 49 3 3 4 3 3 3 5 5 4 4 5 4 5 5 4 4 4 410 5 4 5 4 4 4 5 5 5 5 5 4 4 4 4 3 4 411 1 1 4 1 1 2 3 3 3 4 2 3 1 1 4 2 1 512 2 1 4 4 2 2 2 4 2 4 2 2 4 3 4 2 2 413 3 2 4 3 2 3 2 2 3 3 1 1 3 4 3 4 3 314 2 3 4 2 3 1 4 4 3 3 4 3 3 3 2 4 4 415 3 2 4 3 2 2 3 4 4 4 3 3 4 4 3 3 4 416 4 4 5 4 4 3 5 5 5 5 5 5 4 4 4 5 5 417 2 2 4 2 2 1 5 5 5 5 4 3 3 4 4 3 3 4

Table A-1. A complete pre-survey results for students used System Li.

102

ID Q1-Q6 Q7-Q12 Q13-Q1818 3 3 4 4 4 2 4 4 4 4 4 4 4 4 4 5 3 419 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 420 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 2 2 421 2 3 4 2 4 1 5 5 5 5 5 5 4 4 4 5 2 522 3 3 4 3 3 4 5 5 5 5 5 5 3 4 5 3 4 523 5 4 5 5 5 5 5 5 4 4 5 5 5 5 5 5 5 524 3 3 4 2 4 3 4 4 4 4 3 4 4 4 3 2 2 325 3 4 5 4 4 3 5 5 5 5 4 4 5 5 4 4 4 426 3 3 5 2 4 2 5 5 5 5 5 5 4 4 4 4 4 427 4 3 4 4 4 3 3 4 2 4 3 4 4 4 3 4 3 428 2 3 4 3 3 2 4 5 5 5 5 5 4 4 4 4 4 529 3 2 4 4 4 3 4 5 4 4 4 4 4 4 4 4 5 430 2 2 4 2 3 3 5 5 5 5 5 5 5 5 4 3 4 431 2 2 4 2 3 2 2 4 4 4 4 3 3 2 3 2 2 232 3 3 3 3 3 3 4 4 4 4 4 4 3 3 3 4 3 4

Table A-2. A complete pre-survey results for students used System Comparison.

103

APPENDIX BPOST-SURVEY

Name

UFID

I think that I would like to use this system frequently.

I found the system unnecessarily complex.

I thought the system was easy to use.

I think that I would need the support of a technical person to be able to use this system.

Strongly Disagree Extremely likely

0 1 2 3 4 5 6 7 8 9 10

Strongly Disagree Strongly Agree

0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10

Figure B-1. Post-survey.

104

I found the various functions in this system were well integrated.

I thought there was too much inconsistency in this system.

I would imagine that most people would learn to use this system very quickly.

I found the system very cumbersome to use.

I felt very confident using the system.

I needed to learn a lot of things before I could get going with this system.


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


105



0 1 2 3 4 5 6 7 8 9 10

→



106

This tutoring system is attractive.

This tutoring system was aesthetically appealing.

I liked the graphics and images used in this tutoring system.

This tutoring system appealed to my visual senses.

The screen layout of this tutoring system was visually pleasing.


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


107

Learning with this tutoring system was worthwhile.

I consider my experience a success.

Doing this task did not work out the way I planned.

My experience was rewarding.

I would recommend this tutoring system to my friends and family.

I lost myself in this task.


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10



108

I was so involved in this task that I lost track of time.

I blocked out things around me while I was working with this tutoring system.

When I was doing this work, I lost track of the world around me.

The time I spent on this task just slipped away.

I was absorbed in the task.

During this experience, I let myself go.

0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


109

During this experience, I let myself go.

I was really drawn finding the solutions.

I felt involved in this task.

This experience was fun.

I continued to use this tutoring system out of curiosity.

This tutoring system incited my curiosity.


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


110

I felt interested in this tutoring system.

I felt frustrated while using this tutoring system.

I felt this tutoring system confusing to use.

I felt annoyed while using this tutoring system.

I felt discouraged while using this tutoring system.

Using this tutoring system was mentally taxing.


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


111


This experience was demanding.

I felt in control of the experience.

I could not do something I needed to do with this tutoring system.


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10


0 1 2 3 4 5 6 7 8 9 10

→



112

ID Q1-Q20 Q21-Q411 7 1 8 2 8 0 8 0 7 2 5 5 7 5 5 7 5 6 8 8

7 6 5 5 8 8 8 0 5 7 6 6 7 5 0 2 2 2 2 7 52 3 2 3 2 2 2 2 5 2 2 5 5 5 5 5 5 5 5 5 5

2 2 5 2 5 2 2 2 5 2 8 8 8 8 8 9 7 8 5 5 93 5 2 9 0 6 4 10 2 10 1 9 9 9 9 9 9 9 9 8

4 4 4 4 5 5 8 0 8 6 5 6 7 2 1 1 1 1 2 8 34 9 1 8 1 8 3 9 1 9 2 2 2 8 3 3 2 3 7 7 9

7 5 4 5 5 9 10 2 9 9 10 10 10 2 2 2 2 2 2 8 55 3 2 4 3 4 4 6 7 4 5 6 6 7 7 7 7 7 8 8 6

6 6 5 5 6 5 3 6 4 4 5 5 5 7 6 7 4 4 5 5 86 4 3 6 3 5 4 3 6 4 5 4 4 3 3 4 6 6 6 5 4

7 7 7 6 7 4 5 6 5 4 6 6 6 7 3 5 3 4 2 6 67 10 1 8 0 9 3 9 1 6 1 2 2 7 6 3 8 4 4 7 5

9 8 5 6 7 7 7 3 5 8 4 6 6 6 1 5 5 2 2 5 68 2 5 2 7 3 6 3 7 1 5 2 2 5 6 4 5 6 6 7 8

1 3 5 4 5 2 3 3 5 1 7 7 6 7 7 7 5 3 4 7 109 8 0 10 0 7 8 10 5 6 0 10 10 10 10 10 10 10 10 10 10

5 4 3 3 10 8 10 0 10 8 8 7 7 5 0 7 0 0 5 6 810 5 1 6 2 6 6 6 3 4 1 1 3 3 3 3 6 3 6 6 6

5 6 5 5 6 5 6 1 5 5 4 6 6 6 2 4 1 1 1 7 711 6 2 6 6 5 5 5 3 5 3 4 5 6 5 6 6 4 7 5 5

7 7 6 6 7 7 6 5 5 6 6 6 6 5 3 6 4 3 3 6 512 5 2 7 3 6 4 7 5 6 4 4 4 2 4 4 4 5 7 7 5

4 5 6 5 5 6 2 9 5 6 4 6 6 8 2 8 4 1 1 7 713 3 2 7 7 3 8 7 3 2 7 5 7 8 7 7 8 8 8 8 8

3 5 1 1 7 1 4 6 3 3 10 10 10 8 3 2 3 3 5 5 1014 7 5 7 5 3 8 8 2 5 2 7 6 7 7 5 5 5 7 7 7

6 2 2 2 2 6 8 10 8 8 6 6 6 9 3 7 7 5 6 6 915 10 5 8 4 7 6 9 2 4 5 5 9 7 8 8 8 8 9 9 10

10 10 10 10 10 9 9 1 7 8 10 10 10 3 2 2 2 2 2 8 716 10 0 9 0 7 3 9 3 10 2 5 3 7 2 4 6 4 4 7 8

10 7 7 6 7 10 10 1 9 10 8 6 6 6 2 6 2 2 2 7 617 6 3 7 6 4 6 8 2 6 0 3 8 7 7 8 7 8 7 8 7

7 7 7 8 7 7 6 5 7 7 8 8 7 5 5 3 2 3 3 6 818 8 6 6 4 6 7 6 6 4 5 6 6 6 6 6 4 6 6 6

5 5 5 5 6 6 6 6 6 4 6 6 6 6 6 6 6 5 5 6 619 8 1 8 4 5 3 7 2 2 4 7 7 7 5 6 6 5 9 9 8

6 6 5 3 8 7 5 7 6 6 8 6 7 7 4 7 5 3 6 6 1020 4 3 5 3 4 2 7 2 6 1 5 6 5 5 5 5 5 6 4 6

6 6 6 7 7 6 5 6 5 5 5 6 6 7 4 7 4 1 8 6 921 7 2 8 1 3 7 1 8 6 3 3 6 2 3 6 2 7 6 7

7 8 6 7 8 7 7 2 6 7 7 7 8 3 2 6 5 2 7 7 822 8 4 7 2 8 5 8 4 7 3 3 4 6 5 6 6 6 6 5 6

7 6 7 7 7 6 6 6 6 6 8 8 8 5 4 4 3 3 2 6 4

Table B-1. A complete post-survey results for users used System Li.113

ID Q1-Q20 Q21-Q4123 3 2 8 5 4 6 6 6 3 2 5 6 7 7 7 7 5 8 7 5

7 7 7 7 7 5 5 5 7 4 6 7 7 10 3 8 8 4 2 4 724 8 1 10 0 8 5 10 0 10 0 1 4 8 7 5 7 5 7 9 7

10 5 3 3 9 9 7 1 7 9 10 10 10 1 0 3 0 0 0 9 825 6 1 8 1 4 6 7 2 8 1 1 3 3 1 1 4 5 7 8 7

8 4 3 4 5 7 7 2 7 5 10 10 10 1 2 0 0 0 0 8 026 3 1 8 1 5 7 6 5 4 2 6 6 10 6 6 6 6 9 9 9

5 8 8 8 8 5 5 2 5 3 10 10 10 6 2 6 2 2 4 7 827 4 7 6 4 4 6 7 4 6 1 2 0 6 5 6 7 6 6 6 6

4 5 5 3 7 6 7 1 6 4 7 6 6 6 3 5 2 2 2 8 628 7 3 10 1 8 7 10 0 10 6 2 2 2 2 2 8 7 8 8 8

8 9 4 6 8 8 8 5 7 8 8 6 6 2 1 2 1 1 2 8 229 3 2 9 0 2 3 8 5 8 3 2 1 3 2 2 3 2 4 5 4

7 7 7 7 7 3 6 6 6 3 6 6 6 5 2 6 3 2 2 7 630 8 0 9 2 7 2 8 1 7 1 9 10 8 8 8 10 10 8 10 7

8 3 5 3 8 9 9 1 10 10 10 7 9 6 1 6 0 1 7 7 031 7 7 3 2 1 8 10 5 5 0 0 10 10 5 10 10 7 10 10 10

5 5 5 5 5 3 3 8 5 5 5 10 10 3 0 5 5 3 5 5 332 5 5 5 6 5 9 5 7 8 10 0 0 5 0 0 8 0 10 10 10

5 7 8 10 10 10 10 10 10 5 7 7 7 3 5 7 0 5 0 10 233 7 4 7 3 7 6 8 2 8 2 4 3 6 3 3 6 5 5 5 5

8 4 5 3 5 7 6 6 6 7 7 7 7 5 4 5 6 4 4 7 434 9 0 10 0 8 1 10 0 8 2 9 9 9 9 9 9 7 10 10 10

6 6 6 6 7 10 10 0 8 10 7 8 9 0 0 0 0 0 0 8 035 7 6 6 2 5 6 8 5 6 7 8 9 9 9 7 9 5 10 10 6

8 8 7 7 8 6 3 9 3 7 8 7 7 7 5 6 7 2 2 5 536 6 3 7 2 6 4 7 3 7 5 6 5 4 4 4 5 3 7 7 6

4 5 5 4 6 6 4 7 6 5 4 5 6 6 3 3 3 2 2 5 737 8 3 7 3 9 5 8 5 7 7 3 7 7 7 7 7 1 7 8 7

8 8 8 8 8 8 10 0 10 8 7 7 7 2 2 2 1 1 1 8 238 6 4 4 2 3 4 5 5 3 4 7 7 6 6 7 6 6 6 6 6

6 5 5 7 7 6 3 7 3 6 6 6 6 7 8 5 5 3 6 439 3 1 9 0 4 2 9 1 4 1 4 4 4 5 4 4 4 4 4 4

4 3 2 4 4 5 3 8 3 4 6 4 4 4 4 4 4 4 4 4 1040 10 3 9 4 9 3 9 4 8 2 4 3 3 3 3 3 3 3 4 5

9 8 6 7 7 9 8 3 8 8 8 8 8 5 3 2 2 1 0 4 641 7 3 7 5 7 3 7 3 5 6 7 5 5 5 5 5 5 5 5 5

5 4 4 4 5 6 5 6 5 6 7 7 7 7 4 7 5 5 4 4 6

Table B-2. A complete post-survey results for users used System Comparison.

114

REFERENCES

Ariel, Mira. “Referring and Accessibility.” Journal of Linguistics 24 (1988).1: 65–87.

Austin, J L. “How To Do Things With Words.” (1962).

Bangor, Aaron, Kortum, Philip T, Miller, James T, Bangor, Aaron, Kortum, Philip T,Miller, James T, Empirical, An, Bangor, Aaron, Kortum, Philip T, and Miller, James T.“An Empirical Evaluation of the System Usability Scale Usability Scale.” InternationalJournal of HumanComputer Interaction 24 (2008).6: 574–594.

Bangor, Aaron, Staff, Technical, Kortum, Philip, Miller, James, and Staff, Technical.“Determining What Individual SUS Scores Mean : Adding an Adjective Rating Scale.”Journal of Usability Studies 4 (2009).3: 114–123.

Blitzer, John. “Domain Adaptation with Structural Correspondence Learning.” Pro-ceedings of the 2006 Conference on Empirical Methods in Natural Language Processing(EMNLP 2006). July. 2006, 120–128.

Boyer, Kristy Elizabeth. Structural and Dialogue Act Modeling in Task-Oriented TutorialDialogue. Ph.D. thesis, 2010.

Boyer, Kristy Elizabeth, Ha, Eun Young, Phillips, Robert, Wallis, Michael D., Vouk,Mladen A., and Lester, James C. “Dialogue Act Modeling in a Complex Task-OrientedDomain.” Proceedings of the 11th Annual SIGDIAL Meeting on Discourse and Dialogue.2010, 297–305.

Boyer, Kristy Elizabeth, Phillips, Robert, Ingram, Amy, Ha, Eun Young, Wallis,Michael D, Vouk, Mladen A, and Lester, James C. “Investigating the RelationshipBetween Dialogue Structure and Tutoring Effectiveness: A Hidden Markov ModelingApproach.” International Journal of Artificial Intelligence in Education (IJAIED) 21(2011).1: 65–81.

Brien, Heather L O, Cairns, Paul, and Hall, Mark. “A Practical Approach to MeasuringUser Engagement with the Refined User Engagement Scale ( UES ) and New UES ShortForm.” International Journal of Human - Computer Studies 112 (2018).December 2017:28–39.

Brill, Eric. “Transformation-Based Error-Driven Learning and Natural LanguageProcessing : A Case Study in Part-of-Speech Tagging.” Computational linguistics21 (1995).4: 543–565.

Can, Aysu Ezen. Unsupervised Dialogue Act Modeling for Tutorial Dialogue Systems.Ph.D. thesis, 2016.

Chai, Joyce, Hong, Pengyu, and Zhou, Michelle. “A Probabilistic Approach to ReferenceResolution in Multimodal User Interfaces.” Proceedings of the 9th InternationalConference on Intelligent User Interfaces - IUI ’04 (2004): 70–77.

115

Corbin, Carina, Morbini, Fabrizio, and Traum, David. “Creating a Virtual Neighbor.”Natural Language Dialog Systems and Intelligent Assistants (2015): 203–208.

Crystal, David. A Dictionary of Linguistics and Phonetics (4th ed.). Oxford UniversityPress, 1997.

Culotta, Aron, Wick, Michael, and Mccallum, Andrew. “First-Order Probabilistic Modelsfor Coreference Resolution.” Proceedings of the 2007 Annual Conference of the NorthAmerican Chapter of the Association for Computational Linguistics (NAACL). 2007,81–88.

Daume, Hal. “Frustratingly Easy Domain Adaptation.” arXiv preprint arXiv:0907.1815(2009).

Denis, Pascal and Baldridge, Jason. “Specialized Models and Reranking for CoreferenceResolution.” Proceedings of the Conference on Empirical Methods in Natural LanguageProcessing (2008).October: 660–669.

Dzikovska, Myroslava O, Callaway, Charles B, Farrow, Elaine, Marques-pita, Manuel,Matheson, Colin, and Moore, Johanna D. “Adaptive Tutorial Dialogue Systems UsingDeep NLP Techniques.” NAACL HLT Demonstrations. April. 2007, 5–6.

Finkel, Jenny Rose and Manning, Christopher D. “Hierarchical Bayesian DomainAdaptation.” the North American Chapter of the Association for ComputationalLinguistics Human Language Technologies (NAACL HLT) 2009 Conference. June.2009, 602–610.

Forsythand, Eric N and Martell, Craig H. “Lexical and Discourse Analysis of Online ChatDialog.” Semantic Computing, 2007. ICSC 2007. 2007, 19–26.

Funakoshi, Kotaro, Nakano, Mikio, Tokunaga, Takenobu, and Iida, Ryu. “A UnifiedProbabilistic Approach to Referring Expressions.” Proceedings of the 13th AnnualMeeting of the Special Interest Group on Discourse and Dialogue (2012).July: 237–246.

Garrette, Dan and Baldridge, Jason. “Learning a Part-of-Speech Tagger from Two Hoursof Annotation.” Proceedings of the 2013 Conference of the North American Chapter ofthe Association for Computational Linguistics Human Language Technologies (NAACLHLT 2013). June. 2013, 138–147.

Gorniak, Peter and Roy, Deb. “Situated Language Understanding as Filtering PerceivedAffordances.” Cognitive Science 31 (2007).2: 197–231.

Grosz, B J, Weinstein, S, and Joshi, A K. “Centering - a Framework for Modeling theLocal Coherence of Discourse.” Computational Linguistics 21 (1995).2: 203–225.

Hovy, Dirk, Plank, Barbara, and Søgaard, Anders. “Mining for Unambiguous Instances toAdapt Part-of-speech Taggers to New Domains.” Proceedings of the 2015 Conference ofthe North American Chapter of the Association for Computational Linguistics HumanLanguage Technologies (NAACL HLT 2015). 2015, 1256–1261.

116

Iida, Ryu, Kobayashi, Shumpei, and Tokunaga, Takenobu. “Incorporating Extra-linguisticInformation into Reference Resolution in Collaborative Task Dialogue.” Proceedingsof the 48th Annual Meeting of the Association for Computational Linguistic (2010):1259–1267.

Iida, Ryu, Yasuhara, Masaaki, and Tokunaga, Takenobu. “Multi-modal ReferenceResolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues.”Proceedings of the 5th International Joint Conference on Natural Language Processing(IJCNLP 2011) (2011).2003: 84–92.

Jiang, Jing and Zhai, Chengxiang. “Instance Weighting for Domain Adaptation inNLP.” the 45th Annual Meeting of the Association of Computational Linguistics. 2007,264–271.

Kennington, Casey and Schlangen, David. “Simple Learning and CompositionalApplication of Perceptually Grounded Word Meanings for Incremental ReferenceResolution.” Proceedings of the Conference for the Association for ComputationalLinguistics (ACL) (2015): 292–301.

Lafferty, John, McCallum, Andrew, and Pereira, Fernando C N. “Conditional RandomFields: Probabilistic Models for Segmenting and Labeling Sequence Data.” Proceedingsof the International Conference on Machine Learning. 2001, 282–289.

Lappin, Shalom and Leass, Herbert J. “An Algorithm for Pronominal AnaphoraResolution.” Computational Linguistics 20 (1994): 535–561.

Lemon, Oliver, Bracy, Anne, Gruenstein, Alexander, and Peters, Stanley. “The WITASMulti-Modal Dialogue System I.” Proceedings of INTERSPEECH. 2001, 1559–1562.

Li, Shen, Graca, Joao V, and Taskar, Ben. “Wiki-ly Supervised Part-of-Speech Tagging.”the 2012 Joint Conference on Empirical Methods in Natural Language Processing andComputational Natural Language Learning. 2012, 1389–1398.

Li, Xiaolong and Boyer, Kristy Elizabeth. “Semantic Grounding in Dialogue for ComplexProblem Solving.” Proceedings of the 2015 Conference of the North American Chapterof the Association for Computational Linguistics Human Language Technologies(NAACL HLT 2015). 2015, 841–850.

———. “Reference Resolution in Situated Dialogue with Learned Semantics.” the 17thAnnual Meeting of the Special Interest Group on Discourse and Dialogue. 2016, 329–338.

Liu, Changsong and Chai, Joyce Y. “Learning to Mediate Perceptual Differences inSituated Human-Robot Dialogue.” Proceedings of the Twenty-ninth AAAI Conference(AAAI15). 2015, 2288–2294.

Liu, Changsong, She, Lanbo, Fang, Rui, and Chai, Joyce Y. “Probabilistic Labeling forEfficient Referential Grounding Based On Collaborative Discourse.” Proceedings of the

117

52nd Annual Meeting of the Association for Computational Linguistics (ACL) (2014):13–18.

Liu, Chansong, Fang, Rui, and Chai, Joyce Yue. “Towards Mediating Shared PerceptualBasis in Situated Dialogue.” Proceedings of the 13th Annual Meeting of the SpecialInterest Group on Discourse and Dialogue (2012).July: 140–149.

Manning, Christopher D. “Part-of-Speech Tagging from 97% to 100%: Is It Time forSome Linguistics?” International Conference on Intelligent Text Processing andComputational Linguistics. 2011, 171–189.

Manning, Christopher D, Bauer, John, Finkel, Jenny, and Bethard, Steven J. “TheStanford CoreNLP Natural Language Processing Toolkit.” the 52nd Annual Meeting ofthe Association for Computational Linguistics: System Demonstrations (2014): 55–60.

Matuszek, Cynthia, Bo, Liefeng, Zettlemoyer, Luke S, and Fox, Dieter. “Learning fromUnscripted Deictic Gesture and Language for Human-Robot Interactions.” Proceedingsof AAAI 2014 (2014): 2556–2563.

Mccarthy, Joseph F and Lehnert, Wendy G. “Using Decision Trees for CoreferenceResolution.” Proceedings of teh Fourteenth International Joint Conference on ArtificialIntelligence (1995).

McClosky, David, Charniak, Eugene, and Johnson, Mark. “Automatic Domain Adaptationfor Parsing.” Proceedings of the 2010 Annual Conference of the North AmericanChapter of the Association for Computational Linguistics (HLT-NAACL). 2010, 28–36.

Mulkar-mehta, Rutu, Hobbs, Jerry, and Hovy, Eduard. “Granularity in Natural LanguageDiscourse.” Proceedings of the Ninth International Conference on ComputationalSemantics. Section 3. 2011, 360–364.

Owoputi, Olutobi, O’Connor, Brendan, Dyer, Chris, Gimpel, Kevin, Schneider, Nathan,and Smith, Noah A. “Improved Part-of-Speech Tagging for Online Conversational Textwith Word Clusters.” Proceedings of the 2013 Conference of the North American Chap-ter of the Association for Computational Linguistics Human Language Technologies(NAACL HLT 2013). June. 2013, 380–390.

Plank, Barbara, Hovy, Dirk, McDonald, Ryan, and Søgaard, Anders. “Adapting Taggersto Twitter with Not-so-distant Supervision.” COLING 2014, the 25th InternationalConference on Computational Linguistics: Technical Papers. 2014, 1783–1792.

Ponzetto, Simone Paolo and Strube, Michael. “Exploiting Semantic Role Labeling,WordNet and Wikipedia for Coreference Resolution.” Proceedings of the main confer-ence on Human Language Technology Conference of the North American Chapter of theAssociation of Computational Linguistics (2006).2: 192–199.

118

Poon, Hoifung and Domingos, Pedro. “Unsupervised Semantic Parsing.” Proceedings ofthe 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP).August. 2009, 1–10.

Rose, Carolyn P. “A Framework for Robust Semantic Interpretation.” Proceedings ofthe 1st North American Chapter of the Association for Computational LinguisticsConference (NAACL). 2000, 311–318.

Schlangen, David, Zarriess, Sina, and Kennington, Casey. “Resolving References toObjects in Photographs using the Words-As-Classifiers Model.” Proceedings of the 54thAnnual Meeting of the Association for Computational Linguistics (ACL 2016) (2016):1213–1223.

Schmidt, Mark and Swersky, Kevin. “http://www.cs.ubc.ca/∼schmidtm/Software/crfChain.html.” 2008.

Sha, Fei and Pereira, Fernando. “Shallow Parsing with Conditional Random Fields.” the2003 Conference of the North American Chapter of the Association for ComputationalLinguistics Human Language Technologies (HLT-NAACL 2003). June. 2003, 134–141.

Sidner, Candace L. “Attention, Intentions, and the Structure of Discourse.” 12 (1986).

Sidner, Candace L, Lee, Christopher, Lesh, Neal, and Rich, Charles. “Explorations inEngagement for Humans and Robots.” Artificial Intelligence 166 (2005).1-2: 140–164.

Soon, W M, Ng, H T, and Lim, D C Y. “A Machine Learning Approach to CoreferenceResolution of Noun Phrases.” Computational linguistics (2001).

Strik, Helmer, Russel, Albert, Cucchiarini, Catia, Boves, Lou, Oostdijk, N, andCucchiarini, C. “A Spoken Dialogue System For Public Transport Information.”International Journal of Speech Technology 2 (1997): 119–129.

Tjong, Erik F and Sang, Kim. “Introduction to the CoNLL-2000 Shared Task :Chunking.” the 2nd Workshop on Learning Language in Logic and the 4th Confer-ence on Computational Natural Language Learning. 2000, 127–132.

Toutanova, Kristina, Klein, Dan, and Manning, Christopher D. “Feature-RichPart-of-Speech Tagging with a Cyclic Dependency Network.” Human LanguageTechnologies: The 2003 Annual Conference of the North American Chapter of theAssociation for Computational Linguistics. 2003, 252–259.

Vanlehn, Kurt, Jordan, Pamela W, Ros, Carolyn P, Bhembe, Dumisizwe, Michael, B,Gaydos, Andy, Makatchev, Maxim, Pappuswamy, Umarani, Ringenberg, Michael,Roque, Antonio, Siler, Stephanie, and Srivastava, Ramesh. “The Architecture ofWhy2-Atlas : A Coach for Qualitative Physics Essay Writing.” (2002): 158–167.

Wen, Tsung-Hsien, Vandyke, David, Mrksic, Nikola, Gasic, Milica, Rojas-Barahona,Lina M, Su, Pei-Hao, Ultes, Stefan, and Young, Steve. “A Network-based End-to-EndTrainable Task-oriented Dialogue System.”, 2016.

119

http://www.cs.ubc.ca/~schmidtm/Software/crfChain.html

http://www.cs.ubc.ca/~schmidtm/Software/crfChain.html

Xue, Nianwen and Palmer, Martha. “Calibrating Features for Semantic Role Labeling.”Proceedings of the 2004 Conference on Empirical Methods in Natural Language Process-ing (EMNLP). 2004, 88–94.

Yang, Xiaofeng, Zhou, Guodong, Su, Jian, and Tan, Chew Lim. “Coreference ResolutionUsing Competition Learning Approach.” Proceedings of the 41st Annual Meeting onAssociation for Computational Linguistics (2003): 176–183.

120

BIOGRAPHICAL SKETCH

Xiaolong Li received his Ph.D. from the University of Florida in August 2018.

Before that, he received his bachelor’s and master’s degrees in computer engineering and

technology in 2008 and 2012 from Northwestern Polytechnical University and Zhejiang

University in China, respectively. He started his Ph.D. program in computer science in

2012 at North Carolina State University and then transferred to the University of Florida

with the LearnDialogue research group in 2015.

121

Documents

INVESTIGATING REAL-TIME REFERENCE RESOLUTION IN …ufdcimages.uflib.ufl.edu/UF/E0/05/21/95/00001/LI_X.pdf · INVESTIGATING REAL-TIME REFERENCE RESOLUTION IN SITUATED DIALOGUE FOR