Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
INVESTIGATING REAL-TIME REFERENCE RESOLUTION IN SITUATEDDIALOGUE FOR COMPLEX PROBLEM SOLVING
By
XIAOLONG LI
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2018
c⃝ 2018 Xiaolong Li
I dedicate this dissertation to my father Fuhai Li. I wish he could see this.
ACKNOWLEDGMENTS
I would like to express my sincere appreciation to my advisor Dr. Kristy Boyer for her
continuous guidance, support and friendship through out my Ph.D study. I also would like
to thank my LearnDialogue colleagues for their generous help and support. Specially, I
would like to thank Fernando Rodrıguez, Jennifer Tsan, and Lydia Pezzullo for their help
on document editing, Joseph Wiggins for data annotation, Mickey Vellukunnel, Mehmet
Celepkolu and Timothy Brown for organizing studies. The friendly and supportive
LearnDialogue culture made my Ph.D study much easier. I also want to thank my family,
especially my wife Runqing Wang, for their unconditional support.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Coreference Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2 Reference Resolution in Situated Dialogue . . . . . . . . . . . . . . . . . . 222.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 CORPUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 ONLINE REFERRING EXPRESSION EXTRACTION . . . . . . . . . . . . . 32
4.1 Part-of-speech Tagging for Domain-specific Language . . . . . . . . . . . . 324.1.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.1.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Noun Phrase Chunking in Tutorial Dialogue . . . . . . . . . . . . . . . . . 394.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 SEMANTIC INTERPRETATION OF REFERRING EXPRESSIONS . . . . . . 44
5.1 Semantic Interpretation as Sequence Labeling . . . . . . . . . . . . . . . . 465.1.1 Noun Phrases in Domain Language . . . . . . . . . . . . . . . . . . 465.1.2 Description Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.1.3 Joint Segmentation and Labeling . . . . . . . . . . . . . . . . . . . 495.1.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6 REFERENCE RESOLUTION FOR SITUATED DIALOGUE SYSTEM . . . . 54
6.1 Reference Resolution in a Situated Environment . . . . . . . . . . . . . . 546.2 Referring Expression Semantic Interpretation . . . . . . . . . . . . . . . . 556.3 Generating a List of Candidate Referents . . . . . . . . . . . . . . . . . . 566.4 Ranking-based Classification . . . . . . . . . . . . . . . . . . . . . . . . . 586.5 Experiments and Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5
6.5.1 Semantic Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.5.2 Candidate Referent Generation . . . . . . . . . . . . . . . . . . . . 596.5.3 Identifying Most Likely Referent . . . . . . . . . . . . . . . . . . . . 60
7 TUTORIAL DIALOGUE SYSTEM FOR JAVA PROGRAMMING WITHSUPERVISED REFERENCE RESOLUTION . . . . . . . . . . . . . . . . . . . 66
7.1 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687.2 System Functionalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697.3 Architecture of the Dialogue Agent . . . . . . . . . . . . . . . . . . . . . . 707.4 Natural Language Understanding Module . . . . . . . . . . . . . . . . . . . 71
7.4.1 Reference Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 727.4.2 Dialogue Act Classification . . . . . . . . . . . . . . . . . . . . . . . 737.4.3 Topic Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.5 Dialogue Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747.6 Knowledge Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777.7 System Utterance Generation . . . . . . . . . . . . . . . . . . . . . . . . . 77
8 EVALUATION OF THE DIALOGUE SYSTEM . . . . . . . . . . . . . . . . . . 79
8.1 Proposed Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798.2 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.2.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808.2.2 Java Programming Task for the Study . . . . . . . . . . . . . . . . . 808.2.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818.2.4 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.3 System Usability Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 868.4 User Engagement Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 878.5 Online Reference Resolution Evaluation in Tutorial Dialogue Systems . . . 87
9 DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.1 Null Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 929.2 Data-driven Approach in Building Dialogue Systems . . . . . . . . . . . . 939.3 Understanding Users’ Java Program - A Challenge in Building Dialogue
Systems For Java Programming . . . . . . . . . . . . . . . . . . . . . . . . 94
10 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.1 Hypothesis Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9610.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9710.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
APPENDIX
A PRE-SURVEY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
B POST-SURVEY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7
LIST OF TABLES
Table page
1-1 An excerpt dialogue between a user and the dialogue system. . . . . . . . . . . . 17
3-1 Semantic labels of referring expressions. . . . . . . . . . . . . . . . . . . . . . . 31
4-1 Results of baseline tagger (CRF trained on source-domain corpus), Stanfordtagger, and our approach (CRF trained on generated target-domain corpus). . . 38
4-2 Noun phrase chunking result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4-3 The features used for noun phrase chunking. . . . . . . . . . . . . . . . . . . . . 41
5-1 Semantic labeling accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6-1 Algorithm to select candidates using learned semantics . . . . . . . . . . . . . . 58
6-2 Features used for segmentation and labeling. . . . . . . . . . . . . . . . . . . . 61
6-3 Reference resolution results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6-4 Reference resolution results with gold semantic labels. . . . . . . . . . . . . . . 65
7-1 Dialogue act set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7-2 Topics recognized by the topic classifier. . . . . . . . . . . . . . . . . . . . . . . 75
7-3 Sample system response utterances. . . . . . . . . . . . . . . . . . . . . . . . . . 78
8-1 An excerpt dialogue between a user and the Virtual TA. . . . . . . . . . . . . . 85
8-2 An example user action saved in the database. . . . . . . . . . . . . . . . . . . . 85
8-3 An example reference resolution event saved in the database. . . . . . . . . . . . 86
8-4 A false positive example of referring expression identification. . . . . . . . . . . 91
8-5 A false negative example of referring expression identification. . . . . . . . . . . 91
9-1 A comparison between human-computer dialogues and human-human dialogues. 93
A-1 A complete pre-survey results for students used System Li. . . . . . . . . . . . . 102
A-2 A complete pre-survey results for students used System Comparison. . . . . . . 103
B-1 A complete post-survey results for users used System Li. . . . . . . . . . . . . . 113
B-2 A complete post-survey results for users used System Comparison. . . . . . . . . 114
8
LIST OF FIGURES
Figure page
1-1 Excerpt of tutorial dialogue illustrating reference resolution. Referring expressionsare shown in bold.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1-2 Pipeline of online reference resolution in a situated dialogue. . . . . . . . . . . . 16
2-1 Relationship between accessibility and referring expression forms. . . . . . . . . 20
2-2 Coreference relation example diagram. . . . . . . . . . . . . . . . . . . . . . . . 23
2-3 Bayesian network for reference resolution. . . . . . . . . . . . . . . . . . . . . . 25
2-4 Identifying the most likely referent using word-as-classifier approach. . . . . . . 26
3-1 The interface of Ripple - a tutorial dialogue system for Java programming. Itincludes two windows: a window (on the left) to display student’s Java codeand a window (on the right) for textual messages between student and tutor. . . 29
4-1 Steps for referring expression extraction. . . . . . . . . . . . . . . . . . . . . . . 33
4-2 Example of target sentence generation. . . . . . . . . . . . . . . . . . . . . . . . 37
5-1 A parse of the outer for loop from Stanford Parser. . . . . . . . . . . . . . . . . 47
5-2 Segmentation and semantic linking of NP “a 2 dimensional array”. . . . . . . . 49
5-3 Dependency structure of “a 2 dimensional array”. . . . . . . . . . . . . . . . . . 51
6-1 Semantic interpretation of referring expressions. . . . . . . . . . . . . . . . . . . 56
7-1 Architecture of the tutorial dialogue system. . . . . . . . . . . . . . . . . . . . . 67
7-2 User interface of the dialogue system. . . . . . . . . . . . . . . . . . . . . . . . . 69
7-3 Architecture of the dialogue system. . . . . . . . . . . . . . . . . . . . . . . . . 71
7-4 User intention identification example. . . . . . . . . . . . . . . . . . . . . . . . . 76
7-5 Structure of the programming task. . . . . . . . . . . . . . . . . . . . . . . . . . 77
8-1 A short instruction with the task description. . . . . . . . . . . . . . . . . . . . 82
8-2 A short instruction with the task description. . . . . . . . . . . . . . . . . . . . 83
8-3 System usability score interpretation. . . . . . . . . . . . . . . . . . . . . . . . . 87
8-4 Reference resolution process in the dialogue system. . . . . . . . . . . . . . . . . 89
A-1 Pre-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
9
A-2 Pre-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A-3 Pre-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
B-1 Post-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
B-2 Post-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
B-3 Post-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
B-4 Post-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
B-5 Post-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
B-6 Post-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
B-7 Post-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
B-8 Post-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
B-9 Post-survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
10
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
INVESTIGATING REAL-TIME REFERENCE RESOLUTION IN SITUATEDDIALOGUE FOR COMPLEX PROBLEM SOLVING
By
Xiaolong Li
August 2018
Chair: Kristy Elizabeth BoyerMajor: Computer Science
A situated dialogue is embedded in a situated environment, where domain-specific
task completion is usually a central activity. In a situated dialogue, it is essential to
correctly identify the objects that speakers refer to in the environment. This task is
referred to as reference resolution. However, reference resolution is a challenging problem
in situated dialogue, and in part because of this limitation, most state-of-the-art situated
dialogue systems operate within highly constrained domains. This dissertation presents an
implementation of a tutorial dialogue system for the domain of Java programming, with
real-time reference resolution. The implemented dialogue system identifies and interprets
referring expressions in user utterances in real time. The identified referents are used
to improve the performance of natural language understanding. This dissertation also
examines the impact of different reference resolution approaches on the performance of the
implemented tutorial dialogue system.
The implemented real-time reference resolution approach in this project has three
phases. First, we apply an innovative approach that we developed for more accurate
part-of-speech tagging in domain-specific dialogue. This approach does not require an
annotated corpus for the target domain. Next, we use a Conditional Random Field to
label the semantic structure of the referring expressions. Finally, the learned semantics
are used together with contextual information to perform reference resolution in situated
dialogue. Offline evaluation of the CRF-based reference resolution approach on an existing
11
tutorial dialogue corpus for computer programming showed an accuracy of 61.6%, which
is a dramatic improvement compared to 51.3% from an approach based on a manually
defined lexicon Li and Boyer (2016).
To evaluate the performance of the two reference resolution approaches, they were
implemented in a tutorial dialogue system for Java programming. A human subjects study
was conducted to assess the performance of the tutorial dialogue systems with different
reference resolution approaches. In the study, 41 human participants were randomly
assigned to use these two tutorial dialogue systems. Post-survey results were collected
from study participants to evaluate system usability and user engagement. The reference
resolution performed by the dialogue systems was automatically logged into a database
for manual evaluation. After analyzing the collected data in the study, we did not find a
significant difference on user satisfaction nor user engagement in the dialogue systems with
different reference resolution approaches. The possible reasons are discussed in Chapter 9.
This dissertation is one of the few works that attempts to implement a natural
language dialogue system for such a complex domain like Java programming. It is also
the only known work that compares different reference resolution approaches in a tutorial
dialogue system.
In the dialogue system research community, there is an increasing recognition that
natural language dialogue systems need to work in more complex domains. Real-time
reference resolution in situated dialogue is one of the important challenges to achieve such
a goal. This dissertation research has made a step toward real-time reference resolution for
a dialogue system operating in a complex domain.
12
CHAPTER 1INTRODUCTION
Dialogue systems must move toward understanding users’ language within situated
environments to assist users with increasingly complex tasks. Situated dialogue is usually
embedded in an environment where domain-specific task completion is a central activity.
One of the essential requirements of situated dialogue systems is to identify the objects
that users refer to during a conversation (Iida et al., 2010; Liu et al., 2014; Liu and
Chai, 2015; Chai et al., 2004). Identifying a speaker’s referents is, itself, a crucial part
of utterance interpretation. Identifying the correct referent for an utterance also helps
other aspects of language understanding—for example, by constraining the likely current
intention (Gorniak and Roy, 2007).
Reference resolution in situated dialogue is challenging because of the ambiguity
inherent within dialogue utterances and the complexity of the environment. Imagine
a dialogue system that assists a novice student in solving a programming problem. To
understand a question or statement the student poses, such as, “Should I use the 2
dimensional array?”, the system must link the referring expression “the 2 dimensional
array” to an object1 in the environment.
This process is illustrated in Figure 1-1, which shows an excerpt from a corpus of
tutorial dialogue situated in an introductory computer programming task in the Java
programming language. The arrows link referring expressions in the situated dialogue to
their referents in the environment. To identify the referent of each referring expression, it
is essential to capture the semantic structure of the referring expression of the object it
refers to, such as “the 2 dimensional array” contains two attributes, “2 dimensional” and
1 The word “object” has a technical meaning within the domain of object-orientedprogramming, which is the domain of the corpus utilized in this work. However, wefollow the standard usage of “object” in situated dialogue (Iida et al., 2010), which forprogramming is any portion of code in the environment.
13
“array”. At the same time, the dialogue history and the history of user task actions (such
as editing the code) play a key role. To disambiguate the referent of “my array”, temporal
information is needed: in this case, the referent is a variable named “arra”, which is an
array that the student has just created.
Tutor: Tutor: …
Tutor: … Student: Tutor: Student:
table = new int[10][5]; that is where they initialize the size of the 2 dimensional array
…
[student adds line of code: arra = new int[s.length()];]
great! … [student adds line of code: new2=Integer.parseInt(parse1);]
does my array look like it is set up correctly now umm...... in the for loop, what should you be storing in the array?
:)
setTitle("Postal Code Generator"); setDefaultCloseOperation(EXIT_ON_CLOSE); setVisible(true); table = new int[10][5]; initTable(); } /** * Extract the individual digits stored in the ZIP code * and store their values as private data */ private void extractDigits() { //You must complete this method!! String s = Integer.toString(zipCode); String parse1; Char num; int arra[]; int new2; arra = new int[s.length()]; for(int i=0, i<s.length(); i++) { num=s.charAt(i); parse1=""+num; new2=Integer.parseInt(parse1); arra[i]=num; }
Dialogue and task history Environment
Figure 1-1. Excerpt of tutorial dialogue illustrating reference resolution. Referringexpressions are shown in bold.2
To tackle the problem of reference resolution in this type of situated dialogue, we
present a pipeline approach that combines a domain-specific part-of-speech (POS) tagger,
semantics from a conditional-random-field-based semantic parser along with salience
features from dialogue history and task history. This approach includes three main steps.
First, we extract referring expressions from user utterances. Second, we interpret the
semantics of referring expressions using a conditional random field model (CRF). The
2 Typos and syntactic errors are shown as they appear in the original corpus.
14
outputs of this step are the object attributes expressed by the referring expressions.
Finally, the learned semantic information and contextual information from the situated
dialogue are used to identify the mentioned objects. This process is illustrated in Figure
1-2. We evaluate this approach on the JavaTutor corpus, a corpus of textual tutorial
dialogue collected within an online environment for computer programming.
In order to enable a task-oriented dialogue system to perform reference resolution in
a real-time dialogue system, we need to recognize referring expressions in user utterances
on the fly. To solve this problem, we need the accurate part-of-speech (POS) tags of user
utterances. This dissertation also presents an innovative POS tagging approach within
situated dialogue. In a corpus of textual dialogue for Java programming, the proposed
approach showed a large improvement over the Stanford tagger. Compared to a tagger
trained on the same source data (which includes dialogue) but with no domain adaptation,
overall accuracy improved from 87.14% to 92.76%. For nouns, which are a prevalent and
challenging open word class in domain language, the new approach results in a dramatic
improvement from F1-score of 0.701 to 0.903. Accordingly, the F1-score of noun phrase
chunking was improved from 0.81 to 0.86.
Prior work on reference resolution has leveraged dialogue history and task history
information to improve the accuracy of reference resolution (Iida et al., 2010, 2011;
Funakoshi et al., 2012). However, these prior approaches have employed relatively simple
semantic information from the referring expressions, such as a manually created lexicon,
or have operated within an environment with a limited set of pre-defined objects. As
this dissertation demonstrates, these prior approaches do not perform well in situated
dialogues for complex problem solving, in which the user creates, modifies, and removes
objects from the environment in unpredictable ways. We combine the semantics learned
by a CRF-based approach together with salience information of objects in the situated
environment to map referring expressions to their referents. The results showed that our
approach achieves substantial improvement over two existing state-of-the-art approaches,
15
with existing approaches achieving 51.3% accuracy at best, and the new approach
achieving 61.6% accuracy.
Referring Expression Extraction
Semantic Interpretation of Referring Expressions
Identifying Referents
User Utterance “… from the actionPerformed method.”
Referring Expression(s)
71: Public void actionPerformed(){...}
“the actionPerformed method”
Referent
Figure 1-2. Pipeline of online reference resolution in a situated dialogue.
In this dissertation, we present a data-driven tutorial dialogue system for Java
programming. In this dialogue system, we implement the reference resolution pipeline
presented above to identify the user’s referent in real time. The tutorial dialogue system
has four main modules: natural language understanding (NLU) module, dialogue manager
(DM) module, knowledge base (KB) module, and a natural language generation (NLG)
module. The NLU module performs reference resolution, dialogue act classification and
topic classification for an input user utterance. The DM tracks the current programming
progress and user intention. We also authorized a set of rules for the DM to generate
system dialogue acts for input user dialogue acts. The KB module maintains knowledge
about the programming problem and Java language. For the NLG module, we authorize a
set of system utterances for each system dialogue act. An excerpt dialogue between a user
and the dialogue system is shown in Table 1-1.
To evaluate the tutorial dialogue systems with different reference resolution
approaches, we conducted a user study with undergraduate students from the COP
16
Table 1-1. An excerpt dialogue between a user and the dialogue system.
Virtual TA Hi, I’m your virtual TA. I will be helping you on this Java project.User does extratDigits method input anything?Virtual TA This Java program converts a zip code for US mail into a visual barcode. When
you run the program, it asks the user to enter a zip code. The program will thenconvert it into a barcode and display it. But it doesn’t work yet! To make it work,your task is to complete the extractDigits method. It needs to convert an integerzip code into an array of individual digits, and return that array.Hint: Making theint into a String first is a convenient way to get its individual digits.
Virtual TA The extractDigits method is located in the PostalFrame class, which should havebeen opened for you. You are welcome to view the other code, but you only need towork on the extractDigits method.
User what do I need to do in the extractDigits method?Virtual TA You will extract the five digits of a zipcode. Now we need to get each of the 5 digits
of that int. Using a string and then getting the individual characters from it is agood way to do that.
User where is teh zip codeVirtual TA When the program is run, the user types in a zipCode. The code to get the zipCode
from the user is already complete. The zip code the user entered is stored as aprivate variable, int zipCode on line 34 in PostalFrame.
User how to parse a string into an arrayVirtual TA You can use the syntax intVariable + ”” to convert an integer to a String variable.
. . .
3502 ”Programming Fundamentals I” class at the University of Florida. We implemented
two different versions of the tutorial dialogue system with different reference resolution
approaches. The System Li implemented the reference resolution with learned semantics
with a CRF-based approach. The baseline system System Comparison implemented
a reference resolution approach with a manually authorized domain lexicon. In the
evaluation, we investigated the impact of different reference resolution approaches
on the tutorial dialogue system. Specifically, we examined the different approaches’
impacts on user satisfaction using System Usability Scale (SUS) instrument Bangor et al.
(2008), and user engagement using User Engagement Scale (UES) instrument Brien et al.
(2018). System Li had an average SUS score of 66.7, System Comparison had an average
SUS score of 68.8. There wasn’t a significant difference between them these two scores
(p-value=0.361). System Li had a UES score of 11.8, and System Comparison had a UES
score of 12.3. There wasn’t a significant difference neither (p-value=0.236).
17
We also examined the online accuracy of the two reference resolution approaches.
System Li and System Comparison had an accuracy of 21.6% and 19.6%. After further
analyzing the collected data, we found the low accuracy was caused by the referring
expression selection approach. After manually annotating the referring expressions in
the collected data, we found the accuracies of these two models are 63.3% and 44.9%,
respectively.
This dissertation makes the following contributions: 1) implementation of a tutorial
dialogue system for Java programming; and 2) evaluation of real-time reference resolution
approaches in the tutorial dialogue system by conducting a human subject study. We
believe these contributions will help the dialogue system research community to better
understand about reference resolution in situated dialogue systems.
The remainder of the dissertation is structured as follows. Chapter 2 reviews related
work on situated language understanding, and reference resolution in situated dialogue
understanding, summarizing the features and approaches used in prior work. Chapter 3
introduces the corpus of situated dialogue for Java programming, which is used in this
dissertation for model training and empirical evaluation. Chapter 4 describes the process
of online referring expression identification, which extracts referring expressions from
user utterances in real time when the dialogue system is running. Chapter 5 presents
the semantic interpretation of referring expressions using a CRF-based model. Chapter
6 describes the approach for reference resolution with learned semantics from referring
expressions and contextual information of the task-oriented dialogue. We describe
the implementation of the tutorial dialogue system for Java programming in Chapter
7. We present a user study for the tutorial dialogue system in Chapter 8. Chapter 9
is a discussion of observations made while building the tutorial dialogue system and
conducting the user study. The dissertation is concluded in Chapter 10 by summarizing
the presented work and contributions.
18
CHAPTER 2RELATED WORK
This chapter reviews previous research on reference resolution within different types
of situated environments. We start with coreference resolution in text, which is closely
related to reference resolution in situated language and has been a well established
research area for decades. Then, we categorize, discuss, and compare previous work on
reference resolution in situated language.
2.1 Coreference Resolution
Coreference resolution discovers antecedents for anaphoras in discourse. An anaphora
is a linguistic expression whose interpretation depends on another linguistic expression
in the context. An antecedent is also a linguistic expression, which is used before an
anaphora and could be used to explain it. For example, in the sentence “When you see
John, give him this card.”, “John” is an antecedent of “him”; “him” is an anaphora. A
coreference relation consists of an antecedent and an anaphora that refer to the same
entity. There may be multiple noun phrases referring to the same entity. Coreference
resolution is different from reference resolution in a situated environment, however, they
share some similarities which will be discussed in Section 2.2. Reference resolution has
been inspired by the theories and approaches developed for coreference resolution, such as
centering theory and ranking-based classification approach (Denis and Baldridge, 2008).
Theories for coreference resolution
Ariel presented a theory that described the relationship between accessibility of
entities and referring behaviors (Ariel, 1988). She argued that “natural language
primarily provides speakers with means to code the ACCESSIBILITY of the referent
to the addressee.” The accessibility of entities, which indicates how accessible an entity
is to the conversation participants, is “tied to context types in a definitely non-arbitrary
way.” According to the author, there are three types of contexts that are highly related to
reference resolution: community mutual knowledge, physical co-present mutual knowledge,
19
and linguistic co-present mutual knowledge. Community mutual knowledge is shared
by the speakers and addressees because of belonging to the same community. Physical
co-present mutual knowledge is perceived by the conversation participants in their shared
physical environment. Linguistic co-present mutual knowledge is conveyed by previous
utterances, i.e., dialogue history. All of these three kinds of knowledge determine the
accessibility of possible referents at a given moment. Intuitively, these three context
types provide metrics to measure the salience of entities involved in a conversation.
The authors also argued that the accessibility of entities determines the form of their
referring expressions. Entities with lower accessibility need more lexical information to be
identified, and vice versa. More detailed relationships between accessibility and the form
of referring expressions are shown in Figure 2-1.
Figure 2-1. Relationship between accessibility and referring expression forms.
Grosz et al. presented a framework based on centering theory to model local
coherence of discourse (Grosz et al., 1995). Centers were defined as entities in an
utterance that served as links to other utterances in the discourse that also contain
the same entities. Each utterance in the discourse was assigned a set of forward-looking
20
centers and one backward-looking center. The centering framework provided a rule-based
approach to describe a speaker’s attentional state by monitoring the change of centers.
The authors also argued that attentional states were highly related to the choice of
referring expressions. Sidner also pointed out the close relationship between discourse
structure and reference resolution (Sidner, 1986).
Both accessibility theory and centering theory emphasize the importance of salience
information in coreference resolution. We will show that this salience information is also
essential in reference resolution in situated environments.
Models for Coreference Resolution Early work on coreference resolution used
rule-based approaches (Lappin and Leass, 1994). More recent work usually formulates
coreference resolution as a classification problem as discussed above, which is also
employed by reference resolution in most cases. The difference is that the candidates
of coreference resolution are other referring expressions, while reference resolution has
objects from the situated environment as candidates.
The straightforward approach is to consider referring expressions in pairs, <
rei, rej >. The binary output of a classification function f(rei, rej) indicates whether rei
and rej have the same referent. Some previous work used decision trees as classification
functions, given the simplicity and categorical nature of the features (Mccarthy and
Lehnert, 1995; Soon et al., 2001). Ponzetto and Strube used a maximum entropy model as
their classification function (Ponzetto and Strube, 2006).
Ranking-based model: In a piece of text, there could be multiple antecedents for a
referring expression. Pairwise matching models consider a single candidate at once, which
only take a True/False decision from a binary classifier. However, the output of a binary
classifier is usually a probability. This probability, the confidence of making a positive
decision, was abandoned in this model. To employ this confidence value, Yang et al.
presented an approach using twin-candidates instead of a single candidate as antecedents
(Yang et al., 2003). In this approach, each data sample contained one anaphora and two
21
candidate antecedents, only one of which was the real antecedent. The model considered
features between these three referring expressions to make a final decision, which took
the comparison between two candidates into consideration. The model achieved better
performance. Using a similar idea, Denis and Baldridge presented a ranking-based model,
which created multiple antecedent candidates < c0, c1, ..., ck > for each anaphora re (Denis
and Baldridge, 2008). A binary classifier f(re, ci) ∈ [0, 1] was then used to compute
the compatibility pi between re and each ci. These outputs pi were ranked to select a
best candidate from the candidate list as re’s real antecedent. Culotta et al. organized
candidates into clusters and identified all the antecedents for a referring expression at the
same time (Culotta et al., 2007).
Specialized models: Denis and Baldridge argued that different referring expression
types, pronouns, definitive noun phrases, and demonstrative noun phrases were used
differently (Denis and Baldridge, 2008). Thus, they trained different models for each type
of referring expression, which proved to be more accurate for coreference resolution.
2.2 Reference Resolution in Situated Dialogue
Reference resolution in situated language shares similarities with coreference
resolution. Both benefit from semantic interpretation of referring expressions and are
usually formatted as classification problems. However, as coreference resolution identifies
a coreference relation between referring expressions within a discourse, whereas reference
resolution in situated language identifies referents of referring expressions in their situated
environment. For example, in Figure 2-2, referring expressions such as “he”, “his” and
“Clinton” appeared later in a piece of text all refer to the referring expression “Bill
Clinton”, which appeared earlier in the same text. In a situated dialogue, as shown in
Figure 1-1 both referring expressions “my array” and “the array” refer to arra, which is
an array that the student had just created.
22
Figure 2-2. Coreference relation example diagram.
The state of the situated environment also plays an essential role in solving this
problem. This section summarizes the approaches used in existing work on using reference
resolution in situated language.
Similar to coreference resolution, reference resolution is usually represented as a
classification problem. Given a referring expression re and a candidate referent e, a
classification function f(re, e) is used to predict the probability that e is re’s referent in
the current context, which includes linguistic context and world state. Each candidate
referent e is an entity in the situated environment, such as “a blue mug on the table”.
Features for Reference Resolution In previous work, there are three primary
types of features: syntactic features, semantic features, and salience features. Unlike
coreference resolution, there are less syntactic features involved for reference resolution
in situated language. Coreference resolution searches for relations between referring
expressions, in which the syntactic relationship between these referring expressions plays
an import role. For reference resolution in situated dialogue, the referents are in the
situated environment, not in the dialogue. The syntactic type of referring expressions,
such as demonstrative pronouns and definite pronouns, are the most commonly used
syntactic features (Chai et al., 2004; Iida et al., 2010). Demonstrative pronouns are
pronouns pointing to specific things, such as “this” and “that”. Definite pronouns, such
23
as “him” and “it”, are pronouns referring to specific things, which are different from
indefinite pronouns, such as “someone” and “anything”.
Semantic features: As discussed above, situated environments, including objects in
the environment, are usually represented in situated language understanding tasks as
symbols. One of the most important sources of information for identifying the referents of
a referring expression is the semantic compatibility between them. Chai et al. considered
semantic types while creating graphs that represented the relationships between entities
(Chai et al., 2004). Similar to coreference resolution, attributes of entities were also used
for reference resolution in situated language, such as the shape and size of entities (Iida
et al., 2010, 2011).
Salience features: Salience features capture how noticeable and important an entity
is at a given moment. Salience features contain information about what makes a specific
entity more prominent, such as mentioning an entity in recent discourse history, moving or
operating on an entity in recent action history, etc.
Chai et al. aligned deictic gestures, pointing and circling objects in the scene, with
referring expressions found within utterances using the temporal co-occurrence between
them (Chai et al., 2004). Iida et al. studied reference resolution in situated dialogues for
a collaborative game (Iida et al., 2010, 2011). They used dialogue history and operating
history as features to exploit the salience of entities. These features were coded by time
intervals, such as ”weather object oi was operated in the past 10 seconds.” Eye gaze
features have also been used as salience features in some research to improve the accuracy
of reference resolution (Iida et al., 2011; Kennington and Schlangen, 2015).
Different from the semantic features used in previous work, we propose a CRF-based
semantic labeling approach. This approach automatically labels attributes of objects in
referring expressions.
Approaches. Most existing work formulated reference resolution as a supervised
classification problem. Iida et al. used output from SVM classifiers as measurements for
24
compatibility between a referring expression and the candidate referents (Iida et al., 2010,
2011). They also trained specialized models, a pronoun model and a non-pronoun model,
for different type of referring expressions. Funakoshi et al. presented a Bayesian network
to model the generative process from referent to referring expressions (Funakoshi et al.,
2012). The structure of the Bayesian network is shown in Figure 2-3.
Figure 2-3. Bayesian network for reference resolution.
In this Bayesian network, W,C,X,D represent words, concepts (attribute), referents,
and a referent domain (a set of referents), respectively. This model also shows how to
resolve a reference to a set of referents.
Most previous work employed semantic features, which in some cases were extracted
using a manually defined lexicon (Chai et al., 2004; Liu et al., 2012) and in some other
cases learned automatically (Matuszek et al., 2014; Schlangen et al., 2016).
Weakly supervised approaches: Some work attempted to build reference resolution
models with less supervision. These approaches need less manual annotations, especially
for lexical semantics, when compared to fully supervised approaches. Supervised
approaches usually use a lexicon to label the semantics of referring expressions (Iida
et al., 2010). Thus, the training data for fully supervised approaches contain < re, e >
pairs and lexical semantics of referring expressions. Weakly supervised approaches do
not need lexical semantics as input; instead, their inputs are just the < re, e > pairs.
Weakly supervised approaches learn the alignments between natural language tokens in
re, and attributes of e automatically use the co-occurrences of re and e in training data.
In previous work (Kennington and Schlangen, 2015; Schlangen et al., 2016), the semantics
of natural language tokens were learned using a word-as-classifier approach. The input of
25
this approach was a set of < re, e > pairs. Each referent e in the dataset was an physical
object in a scene. The goal of this word-as-classifier approach was to learn the alignment
between natural language tokens in re and visual features of e. For each natural language
token w, a logistic regression classifier was learned given all of the co-occurrence of e and
w in training data. Object e was represented as an n-dimensional vector of visual features.
Classifiers were trained for each token w in the training data. When given a new referring
expression re =< w0, w1, ... > and a scene with a set of objects ei, the classifiers for tokens
in this re were applied to each object ei in the scene to find the best match in terms of
compatibility between re and ei. This process is illustrated in Figure 2-4. In this figure,
xi is the feature vector of the ith object in the scene. There is an output, δ(wTxi + b), for
each object in the scene. The top level represents normalization over all of the outputs
from the logistic classifier. With this word-as-classifier approach, the alignment between
natural language tokens and visual features of objects were learned automatically without
explicit manual annotation.
Figure 2-4. Identifying the most likely referent using word-as-classifier approach.
2.3 Summary
This chapter summarizes previous approaches on reference resolution in situated
language. According to the literature review, we found that most previous work performed
reference resolution in a limited setting, either a specific setting containing a fixed set of
objects to evaluate their approach (Kennington and Schlangen, 2015), or in a domain with
very limited number of objects (Iida et al., 2010). None of these approaches investigate
26
real-time reference resolution in a situated dialogue system. Different from previous work,
this dissertation reports a real-time reference resolution approach. In addition, we present
an implementation of a tutorial dialogue system for Java programming to evaluate it in a
real-time setting.
27
CHAPTER 3CORPUS
This dissertation investigates the reference resolution problem in a tutorial dialogue
system. Given the data-driven nature of the reference resolution and dialogue understanding
techniques used in this research, we employ a corpus of tutorial dialogues from previous
study.
3.1 Data Collection
The corpus was collected within a tutorial dialogue study in which human tutors and
students interacted through a tutorial dialogue interface, Ripple, that supported remote
textual communication (Boyer et al., 2011). The tutorial dialogue interface (Figure 3-1)
consists of two windows that display interactive components: the students’ Java code,
the compilation or execution output associated with the code, and the textual dialogue
messages between the student and tutor. All of the information in these two windows
was synchronized between the student’s screen and tutors’ screen in real time. The entire
corpus contains 45 Java programming tutoring sessions from student-tutor pairs, with a
total of 4857 utterances, an average of 108 utterances per session. Each of these sessions
lasted approximately one hour. The problem students solved during this tutorial dialogue
involved creating, traversing, and modifying parallel arrays, a challenging task since the
students were novices who were enrolled in an introductory computer programming class.
The dialogues within this domain are characterized by situated features that pertain
to the programming task. A portion of user utterances refer to general Java knowledge,
and in these cases a semantic interpretation can be accomplished by mapping to a
domain-specific ontology (Dzikovska et al., 2007). In contrast, many utterances refer
to concrete entities within the dynamically changing, user-created programing artifact.
Identifying these entities correctly is crucial for generating specific tutorial dialogue moves.
Besides the tutorial dialogue, we also used publicly available corpora for POS
tagging. We performed POS tagging in order to identify referring expressions from user
28
Figure 3-1. The interface of Ripple - a tutorial dialogue system for Java programming. Itincludes two windows: a window (on the left) to display student’s Java codeand a window (on the right) for textual messages between student and tutor.
utterances. Our target domain is online synchronous textual task-oriented dialogue
about Java programming. To train a domain-specific POS tagger, we leveraged two
different labeled corpora from source domains. First, we used the CoNLL2000 corpus for
phrase chunking (Tjong and Sang, 2000), which is a labeled Wall Street Journal corpus
with 10,948 sentences. We also used the NPS chat corpus (Forsythand and Martell,
2007), a set of annotated online conversational texts with 10,567 utterances. The target
corpus is a set of textual Java programming tutorial dialogues (Li and Boyer, 2015) that
contains 4,857 utterances (51,721 tokens) in total. The Java programming corpus is
task-oriented, containing not only utterances but also the accompanying Java program
that the interlocutors were creating and discussing. As described below, we utilized
a subset of these Java programs to extract noun phrases to generate the new labeled
29
training corpus. We also compared this approach to using Java snippets from The Java
Tutorial website to test the benefit of using unrelated Java code.1
3.2 Annotation
All of the utterances in the 45 tutorial sessions were manually annotated for the
referring expressions that have referents in the parallel Java program. For each referring
expression, we labeled segmentation and semantic labels for each segment, so that each
of these semantic segments represents one attribute in the Java programming domain.
These labeled referring expressions will be used to train statistical models to automatically
annotate referring expressions to provide semantic information for reference resolution.
Noun phrases from the tutorial dialogues were first manually extracted and
annotated. There were 364 grounded noun phrases extracted manually from six tutorial
dialogue sessions used in the current work. Each of these noun phrases extracted has one
or multiple corresponding entities in the programming artifact. Since each word in a noun
phrase is linked to an element in the description vector, the indices in this vector were
used as the label for each word. Annotation of all 346 noun phrases was performed by
one annotator, and 20% of the noun phrases (70 noun phrases) were doubly annotated by
an independent second annotator. The percent agreement was 85.3% and the Kappa was
0.765.
We also annotated the semantic labels for each referring expression. A noun phrase is
defined as a phrase which has a noun (or indefinite pronoun) as its head word, or which
performs the same grammatical function as such a phrase (Crystal, 1997). The syntactic
structure of a noun phrase consists of dependents which could include determiners,
adjectives, prepositional phrases, or even a clause. For example, the noun phrase “a 2
dimensional array” occurs within the Java programming corpus. Its head is “array” and
its dependents are “a” as the determiner and “2 dimensional” as an adjective phrase.
1 https://docs.oracle.com/javase/tutorial/
30
Each of these semantic segments involves an attribute of its real referent in the situated
environment (the parallel Java program in this case). We manually annotate these
semantic segments in referring expressions. The semantic tags we used are listed in Table
3-1.
Table 3-1. Semantic labels of referring expressions.
Attributes Meaning (in Java programming) ExampleCATEGORY Category of an entity Method, Variable, etc.NAME Variable name; often user-created extractDigitVAR TYPE Type of variable int, String, etc.NUMBER Number of entities 2IN CLASS The class that contains this entity postalFrameIN METHOD The method that contains this entity actionPerformedDIR PARENT Direct parent entity For Statement, MethodLINE NUMBER Line number 67SUPER CLASS Superclass of this entity JFrameMODIFIER Access modifier public, private, etc.ARRAY TYPE Type of Array int, char, etc.ARRAY DIMENSION Dimension of array 2, 1OBJ CLASS The class an object instantiates PostalBarCodeRETURN TYPE Return type String, int, etc.OTHER Other attributes the, extra, etc.
31
CHAPTER 4ONLINE REFERRING EXPRESSION EXTRACTION
One of the essential steps to implement reference resolution in a tutorial dialogue
system is to identify referring expressions, which are noun phrases, in user utterances in
real time. This is a challenging task in a tutorial dialogue system for Java programming.
Language used in such a dialogue is usually informal. Utterances may contain many
domain-specific components, such as Java program segments. To accurately identify
noun phrases in these utterances, we need an accurate part-of-speech (POS) tagger. POS
tagging is a very important step for noun phrase chunking, which is the approach used to
tag noun phrases in a given sentence. Since referring expressions are noun phrases in an
utterance, we need to first identify all of the noun phrases in this utterance. Not all noun
phrases have referents in the situated environment. We are only interested in noun phrases
that refer to objects in the environment, in this case the Java code. Consequently, we need
a classification step to first identify the referring expressions that are interesting to us.
This chapter includes two sections. Section one reports on an unsupervised approach
I developed for part-of-speech tagging in situated language. Section two conducts noun
phrase chunking for utterances in tutorial dialogue. I have developed and evaluated these
techniques to date on corpora. However, as will be described in Chapter 7, I deploy these
approaches within a real-time tutorial dialogue system. The process of referring expression
extraction is shown in Figure 4-1.
4.1 Part-of-speech Tagging for Domain-specific Language
In this section, I report a novel but simple domain-adaptation approach that I
developed to improve part-of-speech tagging in task-oriented dialogue. This approach
automatically generates an annotated domain-specific training corpus without any manual
annotation. In a corpus of textual dialogue for Java programming, experiments showed a
large improvement over the Stanford tagger. Compared to a tagger trained on the same
source data (which includes dialogue) but with no domain adaptation, overall accuracy
32
but why do that when I could just use the string zip from the actionPerformed method
but why do that when I could just use the string zip from the actionPerformed method CC WRB VBP DT WRB PRP MD RB VB DT NN NN IN DT NN NN
but why do that when I could just use the string zip from the actionPerformed method
but why do that when I could just use the string zip from the actionPerformed method CC WRB VBP DT WRB PRP MD RB VB DT NN NN IN DT NN NN
POStagging
NounPhraseChunking
Classifica6on
Figure 4-1. Steps for referring expression extraction.
improved from 87.14% to 92.76%. For nouns, which are the most essential word class for
referring expression identification, the new approach results in a dramatic improvement
from a F1-score of 0.701 to 0.903.
Accurate part of speech (POS) tagging is essential for many natural language
processing tasks, including natural language understanding in dialogue systems. Most
POS taggers are trained on large newswire corpora that support good performance
on open-domain language. However, these taggers encounter performance degradation
when applied to domain-specific language (Jiang and Zhai, 2007), which is often used in
task-oriented dialogue. This degradation is due partly to unknown tokens, but also due
to how known tokens are used. For example, in a Java programming tutorial dialogue, we
see utterances such as, “what I might could do is write if statements to see what range
sum%10 is in,” or, “... so String a = new String(zipCode); would work.” Dialogue systems
must be able to parse this kind of user utterance to react properly. There is much room
for improvement in domain-specific POS tagging: on the Java-programming dialogues
corpus used in this work, the Stanford tagger achieved 85.57% accuracy, compared to its
97.32% accuracy on the type of language on which it was trained (Manning, 2011).
33
Previous work on domain adaption for POS tagging has included adding annotated
target domain data (Jiang and Zhai, 2007; Daume, 2009) and using dictionaries to mine
patterns from domain languages (Hovy et al., 2015; Li et al., 2012). We present a different
perspective on POS tagging which does not require any manual labeling. We argue that
generating a grammatical sentence in a new domain is easier than parsing a given sentence
from the same domain, assuming that we can easily extract some domain language from
other sources. The domain language is not annotated per se, but because of the context
in which it occurs, its POS tag can be inferred. We then generate a new set of sentences
for our target domain-specific language with POS tags known, and we build a tagger using
the generated corpus as training data.
The approaches was tested on 5 sessions Java tutoring data collected using Ripple
(mentioned in the previous chapter). The other 40 sessions were used to generate training
data. This will be discussed in detail later in this chapter. Our simple yet effective
method improves upon the Stanford tagger’s performance on domain-specific language
for Java programming, achieving 92.76% accuracy compared to Stanford’s 85.57%, and
we do so without manually tagging any new domain-specific language. The new approach
achieved a recall of 91.9% for nouns (NN) (which account for 17% of all the tokens)
compared with 58.2% from a baseline tagger trained on the same source corpus without
domain adaptation and 71.6% by the Stanford tagger. The accuracy for some other POS
tags, such as adjectives (JJ) and past tense verbs (VBD) also improved significantly with
the reported approach, as did overall precision and recall for all of the POS tags.
4.1.1 Approach
The reported approach is based on the observation that open-domain POS tagging
errors in domain-specific language often occur in noun phrases. For example, “if
statement” is a noun phrase in the domain of Java programming, but taggers trained
on newswire recognize “if” as a subordinate conjunction instead of a noun. They also
cannot recognize examples such as the previously mentioned chunk of code “String a =
34
new String(zipCode);” as noun phrases. It would be challenging to induce a grammar
from an unlabeled corpus that contains a large proportion of tokens serving a new
grammatical role. Moreover, it is difficult to tag these phrases using preprocessing, since
the code-like-phrases used in natural language tend to be informal and neither follow
syntactic rules of the programming language nor the natural language in which they are
embedded. Our approach addresses this problem by generating grammatical (though not
semantically meaningful) sentences by substituting domain-specific noun phrases in place
of noun phrases in previously annotated source language.
To create a POS tagger for the target language, we used an annotated source
corpus (CoNLL2000 (Tjong and Sang, 2000)) and a set of domain-specific noun phrases
generated from a corpus of Java programs. We leverage the many similarities between this
domain-specific language and more open-domain language such as newswire: for example,
most other parts of domain-specific sentences, such as “what I might could do is write...”
and “so ... would work” still follow English grammar. Based on this simple idea, we
generate a corpus for the target domain, which is automatically annotated in the process
of generation. The approach substitutes domain-specific chunks into labeled sentences
from the source corpus by replacing part of an existing noun phrase to generate a target
training corpus. Finally, a POS tagger is trained on this corpus to perform POS tagging
for the target domain.
Domain-specific Noun Phrase Generation. To generate a set of labeled
sentences as training data for POS tagging, the reported approach requires that we
first generate a set of domain-specific noun phrases. For the domain of Java programming,
we extracted noun phrases from source code that had been created during dialogues from
our original in-domain corpus. (Later in this section we refer to those dialogues as the
extraction set. These dialogues were not the same ones used to test the POS tagger.)
We began by tokenizing each line of code from the Java programs. Then, we
extracted unigrams, bigrams, and trigrams from the tokenized Java code and treated
35
these as domain-specific noun phrases. Each token was tagged as a noun (except that
digits were tagged as numbers). The result is a set of domain-specific phrases with known
POS tags for each token.
Labeled Target Data Generation. Given a grammatical sentence ssource, which
is a sentence from a source language, if ssource contains a noun nsource, we could create
another grammatical sentence starget by replacing nsource with a domain-specific noun,
ntarget. Recall that a noun phrase is “a phrase which has a noun (or indefinite pronoun)
as its head word, or which performs the same grammatical function as such a phrase”
(Crystal, 1997). For a given sentence from the source corpus that has been tagged with
POS labels (such as CoNLL2000), we first check if it contains a noun phrase. We replace
the head of a noun phrase in ssource with a domain-specific noun phrase. An example
is shown in Figure 4-2, which shows that the determiner and adjective modifier of the
noun phrase are not replaced. The generated starget does not semantically make sense,
but it is grammatical, and it is labeled with POS tags. We generate a sentence starget for
every domain-specific noun phrase generated by the technique described in the previous
subsection. In this way, we create an annotated training set for the target domain.
Training POS Taggers. We trained conditional random field (CRF) POS taggers
on the source corpus and the generated target domain training corpus respectively Lafferty
et al. (2001). We then tested the models on the target domain testing corpus, which
consists of original dialogues (not generated dialogues).
4.1.2 Experiments and Results
First, the target corpus was split into two sets: the extraction set with 40 dialogue
sessions, and the testing set with 5 dialogue sessions. (Each dialogue session represents
approximately one hour of textual dialogue and collaborative construction of Java code.)
The testing set contains 687 sentences and 6581 tokens. We trained POS taggers using
source data and the automatically generated target data, which serves as the training
data. Both of these taggers were tested on the original (not generated) dialogues from
36
Confidence in the pond is widely expected to NN IN DT NN VBZ RB VBN TO
String a = new String (
Confidence in the pond is widely expected to take NN IN DT NN VBZ RB VBN TO VB
NP
another sharp String a = new String ( … DT JJ NN NN NN NN NN NN
NP
take another sharp dive … VB DT JJ NN
Figure 4-2. Example of target sentence generation.
the testing set. We also compared our trained POS taggers with results from the latest
Stanford tagger (v3.7.0) (Toutanova et al., 2003).
First, we trained the Baseline POS tagger on all the labeled sentences from the
CoNLL2000 corpus and the NPS chat corpus. We expected this tagger not to perform well
because although it included dialogues, it did not include any domain-specific language for
the target domain.
Next, using our approach, we trained a tagger for the target domain by leveraging
the generated sentences. For each extracted domain-specific noun phrase, we randomly
selected a sentence from CoNLL20001 to plug in the domain-specific noun phrase to
generate a labeled target sentence. We generated 96,011 target sentences in this step. A
POS tagger was then trained using these generated target sentences along with all of the
sentences from the NPS chat corpus. The Baseline CRF tagger, the Stanford tagger, and
the Li Approach tagger were all tested on dialogues in the testing set.
1 We chose CoNLL2000 because it has IOB tags, which makes the substitution simple.
37
Table 4-1. Results of baseline tagger (CRF trained on source-domain corpus), Stanford tagger, and our approach (CRFtrained on generated target-domain corpus).
total NN IN RB VBZ JJ NNS VBG VBDNum. 6571 1129 511 426 217 205 110 99 56prec. 0.906 0.882 0.926 0.985 0.980 0.680 0.724 0.790 0.711
Baseline recall 0.871 0.582 0.979 0.897 0.889 0.902 0.955 0.990 0.964F1 0.879 0.701 0.952 0.939 0.937 0.776 0.824 0.879 0.818prec. 0.900 0.932 0.817 0.697 0.968 0.668 0.794 0.980 0.786
Stanford recall 0.856 0.716 0.941 0.887 0.977 0.844 0.982 0.970 0.786F1 0.859 0.810 0.875 0.781 0.972 0.746 0.878 0.975 0.786
Li approach prec. 0.930 0.887 0.926 0.980 0.981 0.854 0.911 0.933 0.730with recall 0.928 0.919 0.982 0.918 0.954 0.859 0.836 0.980 0.964parallel code F1 0.927 0.903 0.954 0.948 0.967 0.856 0.872 0.956 0.831Li approach prec. 0.920 0.885 0.928 0.967 0.985 0.744 0.872 0.952 0.743with recall 0.914 0.869 0.980 0.890 0.912 0.878 0.927 0.990 0.982general code F1 0.915 0.877 0.953 0.927 0.947 0.805 0.899 0.970 0.846
38
The accuracy for Baseline Tagger, Stanford Tagger and Li Tagger were 87.14%,
85.57% and 92.76%, respectively. The Baseline Tagger performed better than the Stanford
Tagger, since its training set was partly conversational data (NPS chat corpus). Table 4-1
illustrates the combined precision, recall, and F1-score for the testing set and the same
measurements for some of the most frequently occurring POS tags. The overall precision,
recall, and F1-score were all improved by our approach. The F1-score increased from 0.879
(Baseline) to 0.927 (Li Approach), and both are higher than the Stanford tagger (0.859).
The open domain tagger trained with the NPS corpus achieved 0.834 accuracy.
For noun phrases in particular, which constitute the largest proportion of tokens
(17%), our approach performed particularly well. Noun phrases in domain-specific
language are hard to identify: the Baseline tagger achieved recall on NN of only 0.582, and
the Stanford Tagger performed worse on NN than on any other frequently occurring tag
in the set, at 0.716. Our approach achieved recall on NN of 0.919. Besides NN tokens, our
approach also achieved a much higher performance on adjectives (JJ), with an F1-score of
0.856 compared to 0.776 for Baseline and 0.746 for Stanford.
The Java code we used to generate the domain-specific training corpus was parallel
with the dialogues, which is not always available. To examine whether this approach could
use unrelated Java code, we collected 1968 lines of Java code from Oracle’s The JavaTM
Tutorials. With the same approach, we generated domain-specific training data and tested
on the same test set. This model achieved 0.913 accuracy, slightly lower than the model
trained with parallel code, but still much higher than models without domain adaptation.
4.2 Noun Phrase Chunking in Tutorial Dialogue
Noun phrase chunking is a type of syntactic analysis which labels all noun phrases
in a sentence (Tjong and Sang, 2000). With the POS tags generated using the approach
presented above, we performed noun phrase chunking of tutorial dialogue utterances
using a linear chain conditional random field (CRF) (Lafferty et al., 2001). In a tutorial
dialogue system, this process will find all noun phrases in user utterances. These noun
39
phrases are potentially referring expressions which refer to some objects in the shared
programming environment. We followed the approach in prior work to perform noun
phrase chunking (Sha and Pereira, 2003). This approach is tested on an existing corpus
and will be deployed in the dialogue system in Chapter 7. We use a BIO tagging schema,
which annotates each word in an input sentence. Each word is assigned with a tag: B
indicates “beginning of a phrase chunk”, I indicates “in a phrase chunk”, and O means
“out of a phrase chunk”. For example, in annotated sentence“but/O why/O do/B-VP
that/B-NP when/O I/B-NP could/O just/O use/B-VP the/B-NP string/I-NP zip/I-NP
from/O the/B-NP actionPerformed/B-NP method/B-NP”, B-NP indicates the beginning
of a noun phrase, I-NP means the corresponding word is inside a noun phrase, O means
the corresponding word is not in any phrase chunk. So, “the/B-NP string/I-NP zip/I-NP”
forms a complete noun phrase according to the annotation. Given this tagging schema, we
trained a conditional random field tagger to tag all of the noun phrases for a given input
sentence.
Linear chain conditional random field (CRF) is a discriminative graphical model for
sequential data tagging. In this noun phrase chunking application, we used it to assign
BIO tags to each token in a input word sequence W = w0, w1, ..., wn. Given a word
sequence W , the probability of a specific tag sequence A = a0, a1, ..., an is calculated as:
p(A|W ) =1
Z(W )exp(
n!
i=1
m!
j=1
λjfj(i, w, ai, ai−1))
The tag sequence with the highest probability is selected as the optimal annotation:
A = argmaxi
p(Ai|W )
For training data, we used data from the shared task for CoNLL-2000 (Tjong and
Sang, 2000). This corpus contains part of the Wall Street Journal corpus with BIO
annotations of phrases. It contains 211727 tokens in total.
40
This CRF-based approach employed lexical features and POS tags of words in a
sentence as features. Brill’s transformation-based learning approach was one of the most
influential POS tagging approaches Brill (1995). Some of the features are similar to the
rules used in Brill’s work. A complete list of features can be found below in Table 4-3.
Table 4-2. Noun phrase chunking result.
tag precision recall F1 # of instancesB-NP 0.75 0.91 0.82 2352
Baseline I-NP 0.87 0.75 0.80 1913B-NP, I-NP comb 0.80 0.84 0.82 4265B-NP 0.79 0.91 0.85 2352
Proposed I-NP 0.84 0.94 0.89 1913B-NP, I-NP comb 0.81 0.92 0.86 4265
The noun phrase chunking results are shown in Table 4-2. The domain adaptation
approach increased the F1-score of noun phrase chunking from 0.82 to 0.86. The new
approach improved the recall from 0.84 to 0.92.
Table 4-3. The features used for noun phrase chunking.
featuresthe word in lower casethe last three letters of the wordthe last two letters of the wordif the word is in upper caseif the word is title caseif the word is a numberthe word’s POS tagthe last two letters of the word’s POS tagthe previous word in lower caseif the previous word is in upper caseif the previous word is title caseif the previous word is a numberthe previous word’s POS tagthe following word in lower caseif the following word is in upper caseif the following word is title caseif the following word is a numberthe following word’s POS tag
41
4.3 Discussion
Qualitative examination shows the ways in which the proposed approach improved
over prior approaches. The example sentence used in the introduction was tagged as: “...
soIN, IN StringNN, NNP aNN, DT =NN, JJ newNN, JJ StringNN, NNP (NN, -LRB- zipCodeNN, NN
)NN, -RRB- ;NN, : wouldMD, MD”. The tag of the first subscript (blue) was from the proposed
approach, and the second (gray) was from the baseline tagger.
The proposed approach also performed very well on detecting change of usage for
domain-specific tokens, such as “the if statement” and “the for loop.” The proposed
approach correctly tagged “if” and “for” in these cases as NN, while in phrases such as “if
I use...” and “...for this method...,” they were correctly tagged as IN. Neither the Baseline
nor the Stanford Tagger could do this. To illustrate, consider an excerpt sentence from the
test set: “thatDT, DT lineNN, NN youPRP, PRP justRB, RB typedVBD, VBD canMD, MD beVB, VB
putVBN, VBN inIN, IN theDT, DT (NN, -LRB- )NN, -RRB- ofIN, IN theDT, DT forNN, IN loopNN, NN”.
In earlier work on domain adaptation for POS tagging, researchers have used
semi-supervised approaches, which employ a small annotated corpus of the target
language and a large annotated source language corpus to train a POS tagger for
the target language (Jiang and Zhai, 2007; Daume, 2009; Finkel and Manning, 2009;
Garrette and Baldridge, 2013; Plank et al., 2014). There has also been some work
using unsupervised approaches to perform domain adaptation, such as by employing
structural correspondence learning (Blitzer, 2006), and word clusters learned from
unlabeled target data set (Owoputi et al., 2013). Crowd-sourcing has also been leveraged
to implement domain adaptation for POS tagging (Hovy et al., 2015; Li et al., 2012). The
approach reported in this chapter generates labeled training data for the target language
automatically and thus dramatically simplifies the problem.
This chapter has reported a simple but effective domain adaptation approach for
POS tagging. Both quantitative and qualitative evaluation based on a corpus of informal
textual dialogues for Java programming demonstrated the effectiveness of the approach
42
compared to a Baseline approach and the Stanford tagger. The performance of the
reported approach was particularly evident on challenging noun phrases in the target
language. Experiments showed that even when using domain tokens unrelated to the
target testing corpus, the reported approach dramatically improved POS tagging on target
language. This is an essential step for accurate referring expression extraction.
43
CHAPTER 5SEMANTIC INTERPRETATION OF REFERRING EXPRESSIONS
This chapter presents a novel approach I created to perform semantic interpretation
of referring expressions within a situated environment. Recall that a situated dialogue is
embedded in an environment, where the dialogue usually focuses on a domain-specific task
within this environment. Referring expressions are noun phrases used to refer to entities
in the situated environment. In the context of tutorial dialogue for Java programming,
as shown in Figure 1-1 at the beginning of the introduction, noun phrases like “the
2 dimensional array”, and “the for loop” all refer to some entity in the parallel Java
program. These noun phrases are referring expressions in the situated dialogue for Java
programming.
The approach presented in this chapter performs joint segmentation and labeling of
the noun phrases to link them to attributes of entities within the environment. It is a new
way to provide semantic information for reference resolution in a situated environment.
Evaluation results on a corpus of tutorial dialogue for Java programming demonstrate that
a Conditional Random Field (CRF) model performs well, achieving an accuracy of 89.3%
for linking semantic segments to the correct entity attributes. This work is a step toward
enabling dialogue systems to perform accurate reference resolution.
Previous approaches for semantic interpretation include domain-specific grammars
(Lemon et al., 2001) and open-domain parsers together with a domain-specific lexicon
(Rose, 2000). However, existing techniques are not sufficient to support increasingly
complex task-oriented dialogues due to several challenges. For example, domain-specific
grammars become intractable when applied to more ill-formed domains, and open-domain
parsers may not perform well across domains (McClosky et al., 2010).
To address these challenges, this chapter presents a step toward reference resolution
in situated dialogues for complex problem-solving, in which the number of potential
entities (e.g. a Java variable or a piece of code) is infinite. The present work focuses
44
on the semantic interpretation of noun phrases, which tend to bear significant semantic
information for each utterance. Although noun phrases are typically small in their
number of tokens, their complexity and semantics vary in important ways. For example,
in the domain of computer programming, two similar noun phrases such as “the 2
dimensional array” and “the 3 dimensional array” refer to two different entities within
the problem-solving artifact. Inferring the semantic structure of the noun phrases is
necessary to differentiate these two references within a dialogue, to ground them in the
task, and to respond to them appropriately. Coreference resolution focuses on discovering
the coreference relationship between pairs of noun phrases in a piece of natural language
text (Culotta et al., 2007; Lappin and Leass, 1994), which is similar to the ultimate goal
of reference resolution in complex problem solving. However, different from coreference
resolution, reference resolution links natural language expressions to entities in a real
world environment. Comparing with natural language expressions, real world entities
contain richer information that could be utilized in the task of reference resolution. In
addition, the situated character of the dialogues generated in complex problem solving
introduces more uncertainty to the meaning of noun phrases used to refer to an entity
than that in a piece of self-contain natural language text; e.g. saying “that variable” by
highlighting a variable in Java code. To fully understand “that variable” requires more
context information in the environment in which this noun phrase is generated.
The current approach leverages the structure of noun phrases, mapping their
segments to attributes of entities to which they should be semantically linked. In order to
overcome the limitation of needing to fully enumerate the entities in the environment, we
represent the entities as automatically extracted vectors of attributes. We then perform
joint segmentation and labeling of the noun phrases in user utterances to map them to
the entity vectors (used to describe entities within the environment). In this way, the
semantics of noun phrases could be grounded by linking segments of noun phrases to
attributes of entities in the environment. The results show that a Conditional Random
45
Field performs well for this task, achieving 89.3% accuracy. Moreover, even in the
absence of lexical features (using only dependency parse features and parts of speech), the
model achieves 71.3% accuracy, indicating that it may be tolerant to unseen words. The
flexibility of this approach is due in part to the fact that it does not rely on a syntactic
parser‘s ability to accurately segment within noun phrases, but rather includes parse
features as just one type of feature among several made available to the model. Finally, in
contrast to methods based on bag-of-words such as latent semantic analysis, the reported
approach models the structure of noun phrases to facilitate specific grounding within an
artifact.
5.1 Semantic Interpretation as Sequence Labeling
To interpret the dialogue utterances as described above, our approach focuses first
upon noun phrases, which contain rich semantic information. This section introduces the
approach, based on Conditional Random Fields, to jointly segment the noun phrases and
link those segments to entities within the domain.
5.1.1 Noun Phrases in Domain Language
A noun phrase is defined as “a phrase which has a noun (or indefinite pronoun)
as its head word, or which performs the same grammatical function as such a phrase”
(Crystal, 1997). The syntactic structure of a noun phrase consists of dependents which
could include determiners, adjectives, prepositional phrases, or even a clause. For example,
the noun phrase “a 2 dimensional array” occurs within the Java programming corpus. Its
head is “array” and its dependents are “a” as the determiner and “2 dimensional” as an
adjective phrase. In this simple case the syntactic boundaries also indicate semantic
segments, as these dependents indicate one or more attributes of the head. If this
relationship were always true, the semantic structure understanding task would be a
labeling task that only requires assigning a semantic tag to each syntactic segment of the
noun phrase. But this is not always true, in part because a syntactic parser trained on
an open-domain corpus will not necessarily perform well on domain language (McClosky
46
NP
NP PP
DT JJ IN NP
NN the outer for
loop
Figure 5-1. A parse of the outer for loop from Stanford Parser.
et al., 2010). For example, in the noun phrase “the outer for loop,” which also occurs
in the Java programming corpus, the head of the noun phrase is “for loop,” but the
syntactic parse (generated by the Stanford parser) of this noun phrase understandably
(but incorrectly) identifies this head as part of a prepositional phrase (Figure 5-1).
To address this challenge, this chapter describes a joint segmentation and semantic
labeling approach that does not require accurate syntactic parsing within noun phrases.
In this approach the head and dependents of each noun phrase are each referred to as a
segment, with exactly one segment per dependent, and one or more words per segment.
Identifying these segments correctly is essential to correct assignment of semantic tags.
Pipeline methods for semantic segmentation rely on stable performance of an open
domain parser, but as described above, this assumption is not desirable for grounding
some domain language. We therefore utilize joint segmentation and labeling, and apply
a Conditional Random Field approach (Lafferty et al., 2001), a natural choice for the
sequential data segmentation and labeling problem.
47
5.1.2 Description Vector
The goal is to ground each noun phrase to an entity within the problem-solving
artifact, which constitutes the “world” in this domain. To do this, we will link each
semantic segment in a noun phrase to an attribute of an entity in the world. Because the
world can contain any of an infinite set of user-created entities, representation cannot rely
upon exhaustively enumerating the entities. To represent an entity in the domain, we
define a description vector V which defines the attribute types for entities in the domain.
Then, an entity O in the domain is represented uniquely by an instance of V . The values
of each Vi indicate the value of the attribute of O, as illustrated in Table 3-1. This
definition of the description vector relies upon the structure of the domain by factorizing
the attributes of entities. With this representation, interpreting a noun phrase involves
linking each segment of the noun phrase to a cell in the description vector. Formally, we
represent a noun phrase as a series of segments:
NP =< s1, s2, ..., sk >
where si is the ith segment in this noun phrase. A noun phrase is also a sequence of
words:
NP =< w1, w2, ..., wn >
where each wj is the jth word in the noun phrase. Therefore each segment is a series of
words:
si =< wj, wj+1, ..., wj+l1 >
where l is the length of semantic segment i. Given a noun phrase, the segmentation
problem is thus maximizing the following conditional probability:
p(< s1, s2, ..., sk > | < w1, w2, ..., wn >)
Complementary to the segmentation problem is the semantic linking problem, which is to
link si to an attribute ai, which is the label of the ith attribute in the entity description
48
“a 2 dimensional array"
w1 w2 w3 w4
NUM ARR_DIM ARR_DIM CATEG.
a1 a2 a2 a4
s1 s2 s3
a1 a2 a3
NUM ARR_DIM CATEG.
Figure 5-2. Segmentation and semantic linking of NP “a 2 dimensional array”.
vector. That is, we wish to maximize the probability of the attribute label sequence a
given the segments of the noun phrase:
p(< a1, a2, ..., ak > | < s1, s2, ..., sk >)
Taking consecutive words with the same attribute label as the same semantic segment, the
noun phrase segmentation and semantic linking problem is then:
argmaxa
{p(< a1, a2, ..., an > | < w1, w2, ..., wn >)}
In the tag sequence < a1, a2, , an >, if ai and ai+1 are the same, then wi and wi+1 are
assigned to the same semantic segment with tag ai. The process of segmentation and
semantic linking is illustrated in Figure 5.1.2.
5.1.3 Joint Segmentation and Labeling
In order to perform this joint segmentation and labeling, we utilize a Conditional
Random Field (CRF), which is a classic approach for sequence segmentation and labeling
(Lafferty et al., 2001). Given the linear nature of our data, we employ a linear chain CRF.
Specifically, given a sequence of words w, the probability of a label sequence a is defined
49
as
p(a|w) = 1
Z(w)exp(
n!
i=1
m!
j=1
λjfj(i, w, ai, ai1))
where fj(i, w, ai, ai−1) is a feature function. The weights j of this feature function are
learned within the training process. The normalization function Z(w) is the sum over the
weighted feature function for all possible label sequences:
Z(w) =!
a
exp(n!
i=1
m!
j=1
λjfj(i, w, ai, ai1))
The optimal labeling a is the one that maximizes the likelihood of the training set,
where K is the number of noun phrases in the corpus.
a = argmax(K!
i=1
logP (a(i)|w(i)))
5.1.4 Features
Next, we introduce the features used to train the CRF. The feature function
fj(i, w, ai, ai−1) was defined as a binary function, in which w is a feature value. We use
both lexical and syntactic features. In a trained CRF model, the value of fi(i, w, ai, ai−1)
is known given a combination of parameters (i, w, ai, ai−1). The features used in the
CRF model include words themselves, word lemmas, parts of speech, and dependency
relationships from the syntactic parse. The word itself, lemmatized words, and parts-of-speech
have all been shown useful within segmentation and labeling tasks, so they are made
available here (Xue and Palmer, 2004). Each of these features is represented as categorical
data. For example, a word is represented as its index in a list of all of the words that
appeared in the corpus.
The dependency structure of natural language has also been shown to be important in
semantic interpretation (Poon and Domingos, 2009). This chapter employs a dependency
feature vector extracted from dependency parses. The head word of each noun phrase is
the root of the dependency tree. Each dependent is a sub-tree directly under the head.
50
head
det
array
a 2 dimensional
amod
dependent 2 dependent 1
Figure 5-3. Dependency structure of “a 2 dimensional array”.
We design the dependency feature as a sequence of dependency labels as follows. Given a
dependency tree, words in each semantic segment of the noun phrase are assigned a tag
according to the relationship between them and the head. The relationship between each
segment and head is defined by the dependency type in the dependency tree. For example,
the dependency tree of “a 2 dimensional array” is shown in Figure 5-3. The dependency
features are < det, amod, amod, root >. In this way, the dependency information from an
open-domain parser is encoded as a feature to the semantic labeling model.
5.2 Experiments and Results
The goal of the experiments is to determine how well the trained CRF can segment
noun phrases and link these segments to the correct attribute of entities in the world. This
section presents the experiments using CRFs trained and tested on the Java programming
tutorial dialogue corpus. As described below, the results were evaluated by comparing
with manually labeled data. Noun phrases from the tutorial dialogues were first manually
extracted and annotated as to their slots in the description vector described in Section
5.1.2. There were 364 grounded noun phrases extracted manually from the six tutorial
dialogue sessions used in the current work. Each of these noun phrases extracted has one
or multiple corresponding entities in the programming artifact. Since each word in a noun
51
phrase is linked to an element in the description vector, the indices in this vector were
used as the label for each word. Annotation of all 346 noun phrases was performed by
one annotator, and 20% of the noun phrases (70 noun phrases) were doubly annotated
by an independent second annotator. The percent agreement was 85.3% and the Kappa
was 0.765. To extract features, the lemmatization and syntactic parsing were performed
with the Stanford CoreNLP toolkit (Manning et al., 2014). Then, a CRF was trained to
predict the label for each word in a new noun phrase. The training was performed with
the crfChain toolbox (Schmidt and Swersky, 2008).
We use ten-fold cross-validation to evaluate the performance of the CRF in this
problem. Results with different feature combinations are shown in Table 5-1. Manually
labeled data were taken as ground truth for computing accuracy, which is defined as the
percentage of segments correctly labeled. Recall that consecutive words with the same
label in a noun phrase are treated as a segment. Therefore, if a segment sCRF identified
by the CRF has the same boundary and the same label as a segment sHuman in the
noun phrase containing sCRF , this segment sCRF will be counted as a correct segment.
Otherwise, sCRF will be counted as incorrect. The accuracy is then calculated as the
number of correct segments identified by the CRF divided by the number of segments
annotated manually. As can be seen in Table 5-1, all of the models perform substantially
better than a minimal chance baseline of 43%, which would result from taking each
word as a segment and assigning it with the most frequent attribute label. The results
demonstrate important characteristics of the segmentation and labeling model. First,
unlike most previous semantic interpretation work, our semantic interpretation of noun
phrases does not rely on accurate syntactic parse within noun phrases. Rather, we use
a dependency parse from an open-domain parser as only one of several types of features
provided to the model. These dependency features improved the model in most feature
combinations (Table 5-1). The feature combination of words, lemmas, and dependency
parses achieved the best accuracy, which is 4.8% higher than the model that only used
52
word features. This difference is statistically significant (Wilcoxon rank-sum test; n=10;
p=0.02).
Table 5-1. Semantic labeling accuracy.
features accuracyword 84.5%word + lemma 85.5%Word + Dep 87.2%lemma + Dep 89.1%word + lemma + Dep 89.3%word + lemma + POS 86.9%word + lemma + POS + Dep 88.7%POS + Dep 71.3%
Notably, the combination of part-of-speech features and dependency parse features
still performed at 71.3% accuracy, indicating that to some extent, the method may be
tolerant to unseen words.
53
CHAPTER 6REFERENCE RESOLUTION FOR SITUATED DIALOGUE SYSTEM
Reference resolution in situated dialogues in a complex environment are often fraught
with high ambiguity. In Chapter 5, we presented our approach to extract referring
expressions from user utterances in real time. Given the extracted referring expressions,
we need to identify their referents in the situated environment, which is the problem of
reference resolution. In this chapter, I report a novel approach that I developed to address
these challenges by combining the learned semantic structure of referring expressions
with dialogue history into a ranking-based model. I evaluated the new technique on a
corpus of human-human tutorial dialogues for computer programming in this chapter.
The experimental results show a substantial performance improvement over two recent
state-of-the-art approaches. This reported approach makes a stride toward automated
dialogue in complex problem-solving environments, and will be used in the tutorial
dialogue system described in Chapter 7.
6.1 Reference Resolution in a Situated Environment
This section describes a new approach to reference resolution in situated dialogue. It
links each referring expression from the dialogue to its most likely referent object in the
environment. Our approach involves three main steps.
First, referring expressions from the situated dialogue are segmented and labeled
according to their semantic structure. Using a semantic segmentation and labeling
approach I have previously developed (Li and Boyer, 2015), a conditional random field
(CRF) is used for this joint segmentation and labeling task, and the values of the labeled
attributes are then extracted (Section 6.2). The result of this step is learned semantics,
which are attributes of objects expressed within each referring expression. Then, these
learned semantics are utilized within the novel approach reported in this chapter. As
Section 6.3 describes, dialogue and task history are used to filter the objects in the
54
environment to build a candidate list of referents, and then as Section 6.4 describes, a
ranking-based classification approach is used to select the best matching referent.
For situated dialogue we define Et as the state of the environment at time t. Et
consists of all objects present in the environment. Importantly, the objects in the
environment vary along with the dialogue: at each moment, new objects could be created
(|Et| > |Et−1|), and existing objects could be removed (|Et| < |Et−1|) as the user performs
task actions.
Et = {oi|oi is an object in the environment at time t}
We assume that all of the objects oi are observable in the environment. For example,
in situated dialogues about programming, we can find all of the objects and extract their
attributes using a source code parser. Then, reference resolution is defined as finding a
best-matching oi in Et for referring expression RE.
6.2 Referring Expression Semantic Interpretation
In situated dialogues, a referring expression may contain rich semantic information
about the referent, especially when the context of the situated dialogue is complex.
Approaches such as domain-specific lexicons are limited in their ability to address this
complexity, so we utilize a linear-chain CRF to parse the semantic structure of the
referring expression as presented in Section 5. This more automated approach can also
potentially avoid the manual labor required in creating and maintaining a lexicon.
In this approach, every object within the environment must be represented according
to its attributes. We treat the set of all possible attributes of objects as a vector, and
for each object oi in the environment, we instantiate and populate an attribute vector
Att V eci. For example, the attribute vector for a two-dimensional array in a computer
program could be [CATEGORY = ‘array, DIMENSION = ‘2, LINE = ‘30, NAME =
‘table, ...]. We ultimately represent Et = {oi} as the set of all attribute vectors Att V eci,
and for a referring expression we aim to identify Att V ecj, the actual referent.
55
Since a referring expression describes its referents either implicitly or explicitly, the
attributes expressed in it should match the attributes of its referent. We segment referring
expressions and label the semantics of each segment using the CRF and the result is a
set of segments, each of which represents some attribute of its referent. This process is
illustrated in (Figure 6-1 (a)). After segmenting and labeling attributes in the referring
expressions, the attribute “values” are extracted from each semantic segment using regular
expressions (Figure 6-1 (b)), e.g., value “2” is extracted from “2 dimensional” to fill in
the “ARRAY DIM” element in an empty Att V ec. The result is an attribute vector that
represents the referring expression.
Figure 6-1. Semantic interpretation of referring expressions.
6.3 Generating a List of Candidate Referents
Once the referring expression is represented as an object attribute vector as described
above, we wish to link that vector to the closest-matching object in the environment.
Each object is represented by its own attribute vector, and there may be a large number
of objects in Et. Given a referring expression Rk, we would like to trim the list to keep
only those objects that are likely to be referent for Rk.
56
There are two desired criteria for generating the list of candidate referents. First, the
actual referent must be in the candidate list. At the same time, the candidate list should
be as short as possible. We can pare down the set of all objects in Et by considering focus
of attention in dialogue. Early approaches performed reference resolution by estimating
each dialogue participant’s focus of attention (Lappin and Leass, 1994; Grosz et al.,
1995). According to Ariel’s accessibility theory (Ariel, 1988), people tend to use more
precise descriptions such as proper names in referring expressions for referents in long
term memory, and use less precise descriptions such as pronouns for referents in short
term memory. In a precise description, there is more semantic information, while in a
more vague description like a pronoun, there is less semantic information. Thus, these two
sources of information, semantics and focus of attention, work together in identifying a
referent.
Our approach employs this idea in the process of candidate referent selection by
tracking the focus of attention of the dialogue participants from the beginning of the
dialogue through dialogue history and task history, as has been done in prior work we
use for comparison within our experiments (Iida et al., 2010). We also use the learned
semantics of the referring expression (represented as the referring expression’s attribute
vector) as filtering conditions to select candidates.
The candidate generation process consists of three steps.
1. Candidate generation from dialogue history DH.
DH =< Od, Td >
Here, Od =< o1d, o2d, ..., o
md > is a sequence of objects that were mentioned since
the beginning of the dialogue. Td =< t1d, t2d, ..., t
md > is a sequence of timestamps
when corresponding objects were mentioned. All of the objects in Et that were evermentioned in the dialogue history, {oi|oi ∈ DH & oi ∈ Et}, will also be added intothe candidate list.
2. Candidate generation from task history TH. Similarly, TH =< Ob, Tb >, which is allof the objects in Et that were ever manipulated by the user, will be added into thecandidate list.
57
Table 6-1. Algorithm to select candidates using learned semantics
Given a referring expression Rk,whose attribute vector Att V eck hasbeen extracted.for each element atti of Att V eckif atti is not null
for each o in Et
if atti == o.attiadd o into candidate list
Ck
3. Candidate generation using learned semantics, which are the referent’s attributes.Given a set of attributes extracted from a referring expression, all objects in Et withone of the same attribute values will be added into the candidate list. The attributesare considered separately to avoid the case in which a single incorrectly extractedattribute could rule out the correct referent. Table 6-1 shows the algorithm used inthis step.
6.4 Ranking-based Classification
With the list of candidate referents in hand, we employ a ranking-based classification
model to identify the most likely referent. Ranking-based models have been shown to
perform well for reference resolution problems in prior work (Denis and Baldridge,
2008; Iida et al., 2010). For a given referring expression Rk and its candidate referent
list Ck = {o1, o2, ..., oNk}, in which each oi is an object identified as a candidate
referent, we compute the probability of each candidate oi being the true referent of
Rk, p(Rk, oi) = f(Rk, oi), where f is the classification function. (Note that our approach is
classifier-agnostic. As we describe in Section 6.5.3, we experimented with several different
models.) Then, the candidates are ranked by p(Rk, oi), and the object with the highest
probability is taken as the referent of Rk.
6.5 Experiments and Result
To evaluate the new approach, we performed a set of experiments that compare our
approach with two state-of-the-art approaches. We use the corpus described in Section 3.
58
6.5.1 Semantic Parsing
The referring expressions were extracted from the tutorial dialogues and their
semantic segments and labels were manually annotated. A linear-chain CRF was trained
on that data and used to perform referring expression segmentation and labeling (Li and
Boyer, 2015). The current work reports the first use of that learned semantics approach
for reference resolution.
Next, we proceeded to extract the attribute values, a step that our previous work
did not address. For the example shown in Figure 6-1 (b), from the learned semantic
structure, we may know that “2 dimensional” refers to the dimension of the array, the
attribute “ARRAY DIM”. (In the current domain there are 14 attributes that comprise
the generic attribute vector V , such as ARRAY DIM, NUM, and CATEGORY.) To
actually extract the attribute values, we use regular expressions that capture our three
types of attribute values: categorical, numeric, and strings. For example, the value type
of “CATEGORY” is categorical, like “method” or “variable”. Its values are taken from a
closed set. “NAME” has values that are strings. “LINE NUMBER”’s value is numeric.
For categorical attributes, we add the categorical attribute values into the semantic tag
set of the CRF used for segmentation. In this way, the attribute values of categorical
attributes will be generated by the CRF. For attributes with text string values, we take
the whole surface string of the semantic segment as its attribute value. The accuracy of
the entire semantic parsing pipeline is 93.2% using 10-fold cross-validation. The accuracy
is defined as the percentage of manually labeled attribute values that were successfully
extracted from referring expressions.
6.5.2 Candidate Referent Generation
We applied the approach described in Section 6.3 on each session to generate a list of
candidate referents for each referring expression. In a program, there could be more than
one appearance of the same object. We take all of the appearances of the same object to
be the same, since they all refer to the same artifact in the program. The average number
59
of generated candidates for each referring expression was 44.8. The percentage of referring
expressions whose actual referents were in the generated candidate list, or “hit rate” is
90.5%, based on manual tagging. This performance indicates that the candidate referent
list generation performs well.
A referring expression could be a pronoun, such as “it” or “that”, which does not
contain attribute information. In previous reference resolution research, it was shown
that training separate models for different kinds of referring expressions could improve
performance (Denis and Baldridge, 2008). We follow this idea and split the dataset
into two groups: referring expressions containing attributes, REFATT , (270 referring
expressions), and referring expressions that do not contain attributes, REFNON (76
referring expressions).
The candidate generation approach performed better for the referring expressions
without attributes (hit rate 94.7%), compared to referring expressions with attributes (hit
rate 89.3%). Since the candidate list for referring expressions without attributes relies
solely on dialogue and task history, 94.7% of those referents had been mentioned in the
dialogue or manipulated by the user previously. For referring expressions with attribute
information, the generation of the candidate list also used learned semantic information.
Only 70.0% of those referents had been mentioned in the dialogue or manipulated by the
user before.
6.5.3 Identifying Most Likely Referent
We applied the approach described in Section 6.4 to perform reference resolution on
the corpus of tutorial dialogue. The data from the six manually labeled Java tutoring
sessions were split into a training set and a test set. We used leave-one-dialogue-out cross
validation (which leads to six folds) for the reference resolution experiments. In each
fold, annotated referring expressions from one of the tutoring sessions were taken as the
test set, and data from the other five sessions were the training set. We tested logistic
regression, decision tree, naive Bayes, and neural networks as classifiers to compute the
60
p(Rk, oi) for each (referring expression, candidate) pair for the ranking-based model. The
features provided to each classifier are shown in Table 6-2.
Table 6-2. Features used for segmentation and labeling.
Learned Semantic Features (SF)SF1: whether RE has CATEGORY attributeSF2: whether RE.CATEGORY == o.CATEGORYSF3: whether RE has RE.NAMESF4: whether RE.NAME == o.NAMESF5: RE.NAME ≈ o.NAMESF6: RE.VAR TYPE existSF7: RE.VAR TYPE == o.VAR TYPESF8: RE.LINE NUMBER existSF9: RE.LINE NUMBER == o.LINE NUMBERSF10: RE.ARRAY DIMENSION existSF11: RE.ARRAY DIMENSION ==o.ARRAY DIMENSIONSF12: CATEGORY of o
Dialogue History (DH) FeaturesDH1: whether o is the latest mentioned objectDH2: whether o was mentioned in the last 30 secondsDH3: whether o was mentioned in the last [30, 60] secondsDH4: whether o was mentioned in the last [60, 180] secondsDH5: whether o was mentioned in the last [180, 300] secondsDH6: whether o was mentioned in the last [300, 600] secondsDH7: whether o was mentioned in the last [600, infinite] secondsDH8: whether o was never mentioned from the beginningDH9: String matching between o and RE
Task History (TH) FeaturesTH1: whether o is the most recent object manipulatedTH2: whether o was manipulated in the last 30 secondsTH3: whether o was manipulated in the last [30, 60] secondsTH4: whether o was manipulated in the last [60, 180] secondsTH5: whether o was manipulated in the last [180, 300] secondsTH6: whether o was manipulated in the last [300, 600] secondsTH7: whether o was manipulated in the last [600, infinite] secondsTH8: whether o was never manipulated from the beginningTH9: whether o is in the current working window
To evaluate the performance of the new approach, we compare against two other
recent approaches. First, we compare against a ranking-based model that uses dialogue
history and task history features (Iida et al., 2010). This model uses semantics from
61
a domain-specific lexicon instead of a semantic parser. (Iida et al’s work was extended
by Funakoshi et al. (Funakoshi et al., 2012), but that work relies upon a handcrafted
probability distribution of referents to concepts, which is not feasible in our domain
since it has no fixed set of possible referents.) Therefore, we compare against their 2010
approach, implementing it in a way that creates the strongest possible baseline: we built
a lexicon directly from our manually labeled semantic segments. First, we split all of
the semantic segments into groups by their tags. Then, for each group of segments, any
token that appeared twice or more was added into the lexicon. Although the necessary
data to do this would not be available in a real application of the technique, it ensures
that the lexicon for the baseline condition has good coverage and creates a high baseline
for our new approach to compare against. Additionally, for fairness of comparison, for
each semantic feature used in our model, we extracted the same feature using the lexicon.
There were three kinds of attribute values in the domain: categorical, string, and numeric
(as described in Section 6.5.1). We extracted categorical attribute values using the
appearance of tokens in the lexicon. We used regular expressions to determine whether
a referring expression contains the name of a candidate referent. We also used regular
expressions to extract attribute values from referring expressions, such as line number. We
also provided the Iida baseline model (Iida et al., 2010) with a feature to indicate string
matching between referring expressions and candidate referents, since this feature was
captured in our model as an attribute.
We also compared our approach (we call it Li approach here) against a very recent
technique that leveraged a word-as-classifier approach to learn semantic compatibility
between referring expressions and candidate referents (Kennington and Schlangen, 2015).
To create this comparison model, we used a word-as-classifier to learn the semantics
of referring expressions instead of CRF. This weakly supervised approach relies on
co-appearance between words and object’s attributes. We then used the resulting semantic
compatibility in a ranking-based model to select the most likely referent.
62
The three conditions for our experiment are as follows.
• Iida Baseline Condition: Features including dialogue history, task history, andsemantics from a handcrafted lexicon (Iida et al., 2010).
• Kennington Baseline Condition: Features including dialogue history, task history,and learned semantics from a word-as-classifier model (Kennington and Schlangen,2015).
• Li approach: Features including dialogue history, task history, and learned semanticsfrom CRF.
Within each of these experimental conditions, we varied the classifier used to compute
p(Rk, oi), testing four classifiers: logistic regression (LR), decision tree (DT), naive
Bayes (NB), and neural network (NN). The neural network has one hidden layer and the
best-performing number of perceptrons was 100 (we experimented with between 50 and
120).
To measure the performance of the reference resolution approaches, we analyzed
accuracy, defined as the percent of referring expressions that were successfully linked to
their referents. We chose accuracy for our metric following standard practice (Iida et al.,
2010; Kennington and Schlangen, 2015) because it provides an overall measure of the
number of (Rk, oi) pairs that were correctly identified. For the rare cases in which one
referring expression referred to multiple referents, the output referent of the algorithm was
taken as correct if it selected any of the multiple referents.
The results are shown in Table 6-3. We focus on comparing the results on referring
expressions that contain attribute information, shown in the table as REFATT . REFATT
accounts for 78% of all of the cases (270 out of 346). Among the three approaches, our
approach (Li approach) outperformed both prior approaches. Compared to the Iida
2010 approach which achieved a maximum of 55.2% accuracy, our approach achieved
68.5% accuracy using a neural net classifier, and this difference is statistically significant
based on the results of a Wilcoxon signed-rank test (n = 6; p = 0.046). Our approach
outperformed the Kennington 2015 approach even more substantially, as its best
63
performance was 46.3% accuracy (p = 0.028). Intuitively, the better performance of
our model compared to the Iida approach is due to its ability to more accurately model
referring expressions’ semantics. Compared to a lexicon, semantic labeling finds optimal
segmentation for a referring expression, while a lexicon approach extracts different
attribute information from referring expressions separately. Note that our approach
and the Iida 2010 approach achieved the same performance on REFNON referring
expressions. Since these referring expressions do not contain attribute information,
these two approaches used the same set of features.
Interestingly, the model using a word-as-classifier approach to learn the semantic
compatibility between referring expressions and referent’s attributes performs the worst.
We believe that the reason for this poor performance is mainly from the way it performs
semantic compositions. It cannot learn structures in referring expressions, such as that
“2 dimensional” is a segment, “dimensional” represents the type of the attribute, and “2”
is the value of the attribute. The word-as-classifier model cannot deal with this complex
semantic composition.
The combined accuracy for REFATT and REFNON were also calculated using a
neural networks model. The proposed approaches had an accuracy of 61.6%, and the
baseline approach using lexicon had an accuracy of 51.3%.
The results reported above relied on learned semantics. We also performed experiments
using manually labeled, gold-standard semantics of referring expressions. The result in
Table 6-4 shows that ranking-based models have the potential to achieve a considerably
better result, 73.6%, with more accurate semantic information. Given the 85.3%
agreement between two human annotators, the model performs very well, since the
semantics of whole utterances in situated dialogue also play a very important role in
identifying a given referring expression’s referent.
64
Table 6-3. Reference resolution results.
experimental condition f(Rk, oi)classifier
accuracy
REFATT REFNON
LR 0.500 0.440Iida DT 0.537 0.4532010 NB 0.466 0.413
NN 0.552 0.373LR 0.4627 0.3867
Kennington DT 0.3769 0.33332015 NB 0.3209 0.4000
NN 0.4216 0.4000LR 0.631 0.440
Li DT 0.631 0.453approach NB 0.493 0.413
NN 0.685 0.373
Table 6-4. Reference resolution results with gold semantic labels.
models accuracyREFATT REFNON
LR + SEM gold 0.684 0.429DT + SEM gold 0.643 0.429NB + SEM gold 0.511 0.377NN + SEM gold 0.736 0.325
65
CHAPTER 7TUTORIAL DIALOGUE SYSTEM FOR JAVA PROGRAMMING WITH SUPERVISED
REFERENCE RESOLUTION
This chapter presents an end-to-end tutorial dialogue system for Java programming
which implements real-time reference resolution. As discussed in the literature review
in Chapter 2, most existing task-oriented dialogue systems are designed to interact with
users in highly constrained domains (Wen et al., 2016; Strik et al., 1997). These systems
either do not need reference resolution functionality due to the simplicity of the domain
(Wen et al., 2016), or perform reference resolution using very simple approaches, such as
keyword matching and a domain-specific lexicon (Vanlehn et al., 2002). Different from the
constrained domains previous dialogue systems operates on, this dissertation focuses on
the domain of Java programming tutoring. In such a domain, tutorial dialogues frequently
mention objects in the Java program in question. The dialogues within this domain are
characterized by situated features that pertain to the programming task. A portion of
user utterances refer to general Java knowledge. In these cases, semantic interpretation
of a user’s request can be accomplished by mapping to a domain-specific ontology (e.g.,
(Dzikovska et al., 2007)). In contrast, many utterances refer to concrete entities within the
dynamically changing, user-created programing artifact. Identifying these entities correctly
is crucial for understanding a user’s utterance in the specific programming context, and
then generating specific tutorial dialogue moves.
This chapter presents a natural language tutorial dialogue system for Java programming
that implements real-time reference resolution for natural language understanding. This
dialogue system tracks user intention and the world state to provide a task-related context
for user utterance understanding and system dialogue act generation. Here, user intention
means the current subproblem that the user is focusing on, such as “creating a integer
array to store 5 digits of a zip code”. World state means the completed steps toward the
solution of a programming problem. The whole tutorial dialogue system software includes
three parts, a user interface module (UI), a database module, and an agent module. The
66
architecture of the whole system is illustrated in Figure 7-1. The UI module is an Eclipse
plugin, which provides an integrated development environment for Java programming.
The database module logs the data generated when a user interacts with the tutorial
dialogue system. The agent module implements all of machine learning functionalities
of the dialogue system. The UI module and the agent module are implemented in a
client-server architecture. They communicate using socket packages. This architecture
enables us to implement the UI and the agent using different programming languages
which best serve the requirements of the two modules respectively. The UI module
captures user utterances as well as user’s programming actions, and sends them to the
agent module. The agent module processes these user inputs and generates proper system
utterances accordingly. All of the generated data in this process are logged into the
database.
UserInterface(Client) Agent(Server)
useru6eranceuserac7on
Systemu6erance
Database
NLU
ReferenceResolu/on
TopicClassifier
DAClassifier
DM
UserInten/onRecognizer
WorldStateTracker
KnowledgeBase
NLG
useruEerance
systemuEeranceLogon/offPane
DialoguePane
Figure 7-1. Architecture of the tutorial dialogue system.
To evaluate how different reference resolution approaches impact the performance
of the dialogue system, I implemented two different reference resolution approaches. One
of the reference resolution modules used learned semantics from a CRF-based approach,
which is my novel reference resolution approach as described in Chapter 6. The other
67
reference resolution module is used for comparison and uses a recent state-of-the-art
approach that relies upon a manually created domain-specific lexicon. Both of these
approaches use contextual information, including user behavior history and dialogue
history for reference resolution. Recall that “user behavior history” in this tutorial
dialogue system means the editing actions conducted by the user, and “dialogue history”
means the objects that were mentioned previously in the tutorial dialogue. In this way,
we can access the impact of an improved reference resolution approach within a real-time
dialogue system by comparing the system’s performance with the two different reference
resolution models.
Section 7.1 describes the functionalities and implementation of the user interface
module. Section 7.2 defines the boundaries of the dialogue system’s capabilities, i.e. what
functionalities this system is able to perform. Section 7.3 introduces the architecture
of the dialogue system. Section 7.4 describes the approaches used to implement user
utterance understanding in this system. Section 7.5 describes the implementation of the
dialogue manager module. Section 7.6 presents the encoded domain knowledge in this
dialogue system. Section 7.7 describes the utterance generation implementation.
7.1 User Interface
The user interface is illustrated in Figure 7-2. This user interface is embedded in
Eclipse, a widely used integrated development environment (IDE) for Java programming.
The user interface has two panes, a log on/off pane and a dialogue pane. The log on/off
pane displays user’s log on/off status. Users log into the dialogue system using their
Google accounts. This user information is used to distinguish different tutorial sessions.
The dialogue pane displays the tutorial dialogue between a user and the dialogue system.
When a user logs into the dialogue system in the log on/off pane, the system greets
the user and starts a tutoring session for Java programming. The user can talk to the
dialogue system in the dialogue pane using textual messages. In addition, the UI module
implements a set of listeners in Eclipse to capture user’s programming actions, including
68
source code editing, source code selecting, file opening, file closing, and file creating. All
of the user utterances and programming actions are sent to the agent module as inputs to
the tutorial dialogue system. These data are also logged into a local database for further
analysis.
Figure 7-2. User interface of the dialogue system.
7.2 System Functionalities
Today’s state-of-art task-oriented dialogue systems are still far from engaging in
natural language dialogue with a human user as a human speaker could do. The limitation
of these systems lies with their ability to handle a conversation on various topics and
granularities. Thus, task-oriented dialogue systems usually operate only in a specific
domain, such as an employee information query in a company (Corbin et al., 2015) or
restaurant information requests (Wen et al., 2016).
Before building a task-oriented dialogue system, we need to clearly define the
functionality boundaries of the system. We need to define the topics on which the system
69
will be able to hold a reasonable conversation with the user, and how the system should
handle out-of-topic user utterances. In this way, we can provide users with a reasonable
expectation on the system functionalities.
My system is able to reasonably hold a conversation with a human user and help the
user to complete a Java programming problem. I categorize the functionalities into several
types. These key functionalities include the following items:
• Properly start and end a conversation with a human userTo conduct a conversation with the user, the dialogue system greets the user to drawthe user’s attention and get ready to start a conversation. When the session is over,the system closes the conversation.
• Understand and properly respond to a user utterance about program progressThe knowledge base of this dialogue system includes knowledge about theprogramming problem. The programming problem is modeled as a tree structure,as shown in Figure 7.6. To complete a task, the user needs to complete a set ofsubtasks that are the children of the current task in the tree structure. In this way,when the user is confused about the current task, the system helps the user to breakit down into smaller subtasks that are easier to work with.
• Understand and properly respond to a user utterance about basic Java conceptsThe system understands user utterances about basic Java knowledge in theprogramming context, such as how to create an array, and provide a properresponse.
• Detect user’s out-of-topic utterances and provide responseDuring an interaction, a user’s utterance could be off topic. However, the system isonly designed to hold natural language conversation on a specific Java programmingproblem. The system attempts to detect such user utterances, and provides aresponse with the goal of focusing on the programming problem.
• Monitor the programming actions of the user and generate proper system utterancesThis dialogue system is mixed initiative, which means that both the user and thedialogue system could start a conversation. The system is designed to detect themoments that users may need hints from the system.
7.3 Architecture of the Dialogue Agent
Following a typical dialogue system architecture, the dialogue system (the agent
module) has four main modules, as shown in Figure 7-3. The natural language understanding
(NLU) module performs reference resolution, topic classification, and dialogue act
70
UserU&erance
“Is my for loop correct?”
NLU
UserCodeEdi/ngEvent UserInten+onIden+fier
CREATE_for_loop
{ event_type: TYPING, added_text: “++” affected_line: for(int i=0;i<=5;i) Line_number: 80 … }
DECLARE_zipDigitsCREATE_zipcode_strDECLARE_digit_charCREATE_for_loopASSIGN_digit_charCONVERT_char_to_intASSIGN_zipDigits
WorldStateTracker
DialoguePolicy DM
NLG
NounPhraseChunking
ReferringExpressionExtrac+on
Seman+cInterpreta+onofReferringExpressions
Iden+fyingReferents
‘my for loop’ ‘my for loop’ NAME = ‘for’, CATEGORY = FOR_STATEMENT ...
for (int i=0; i<=5; i++)
NaturalLanguageGenera/on
'You will specify the end condition of the for loop, which tells the loop to stop. In this case, you can set the index i to <= 4. '
DialogueActClassifica+on
EVALUATION_QUESTION
TopicClassifica+on
AM_I_RIGHT
1
Figure 7-3. Architecture of the dialogue system.
classification for a user utterance. The input to the NLU model are user utterances.
The NLU module identifies any referents in the input user utterance. It also identifies
the user utterance’s dialogue act and topic. The output of the NLU module includes the
entities that the user mentioned in the current user utterance, the dialogue act, and the
formal semantic representation of the input user utterance.
The dialogue manager (DM) module tracks user intention and the task progress of the
task-oriented dialogue. In this tutorial dialogue system, user intention means the current
programming subtask that the user is focusing on. The input of this module includes the
output from the NLU module, as well as user actions, such as program editing actions.
The DM module outputs a system dialogue act for the current user utterance. It also
updates the user intention and world state. The world state tracker maintains the progress
of the programming task. The dialogue policy model takes the reference resolution results,
the dialogue act and topic of a user utterance and the current state of the Java program
as input, and outputs a system utterance.
7.4 Natural Language Understanding Module
The natural language understanding module contains three submodules: a reference
resolution module, a semantic parser and a dialogue act classifier, as shown in Figure
71
7-3. The inputs to the NLU module are textual user utterances, the current progress
of the programming task, and the current user intention. The output of the NLU
module includes user’s referents in the current utterance, dialogue act of the current user
utterance, and the semantics of user utterance. This section describes the implementation
of the submodules of the natural language understanding module.
7.4.1 Reference Resolution
As discussed in Chapter 2, perceived affordances—based on the user’s perceived
objects—in the situated environment suggest likely user actions. For example, a key
suggests the action of “opening a door”. In a Java programming problem, the user’s
perceived referent could also suggest possible actions. For example, when a user mentions
a two-dimensional array, the most likely action associated with it may be “ask how to
create a two dimensional array”. I use a data-driven approach to discover the relationships
between the mentioned objects and the suggested actions. Reference resolution is also
essential to understanding a user utterance within a context. For example, when the
Java programming problem asks the user to create an integer array called “zipCode”,
the user could say “I don’t know how to create zipCode.” We need to find the referent
of “zipCode” in the Java code, and infer that the user is asking about “how to create an
integer array”. Then we can form a query to the knowledge base to request an answer.
Two different reference resolution approaches were implemented in this dialogue
system for the purpose of comparison. Version 1 implements the reference resolution using
our approach, the learned semantics as described in Section 6. For comparison, I created
a baseline reference resolution module using the same approach as version 1 except that
it uses a manually defined lexicon to represent referring expressions’ semantics instead of
learned semantics using a CRF-based approach.
In Chapter 4, we presented the approach for referring expression extraction, which
extracts all noun phrases in a user utterance. Not all of these noun phrases refer to
objects in the parallel Java program, so we identify referring expressions from these
72
noun phrases. In this tutorial dialogue system, we first apply a set of rules to filter the
extracted noun phrases. Recall that our reference resolution approaches calculate a
compatibility probability for each referring expression and candidate pair.
compatibility probability = f(referring expression, candidatei)
candidatei is the ith candidate referent in the candidate list for the referring
expression in question. The candidate with the highest compatibility probability is picked
as the referent. We use the generated compatibility probability by the reference resolution
module as a measure to decide if a noun phrase refers to an object in the Java program.
Any noun phrase that has a 0.90 or higher compatibility (f(noun phrase, candidatei) >
0.90) with any of its candidates was taken as a referring expression.
7.4.2 Dialogue Act Classification
Dialogue acts are specialized speech acts that model the illocutionary force of
utterances (Austin, 1962). An illocutionary act indicates the speaker’s intention instead of
the user utterance’s surface meaning. For example, when a customer in a restaurant asks
a waiter: “Do you have salt?” The surface meaning of the utterance is a question which
asks whether the waiter has salt. The illocutionary act of this utterance is conveying the
customers request that she wants some salt.
For dialogue act classification, I use a maximum entropy model. The maximum
entropy model uses three types of features: word unigrams, bigrams, and trigrams from
each user utterance. I use the annotation schema proposed by Can (Can, 2016). The tag
set is shown in Table 7-1. The model is trained using 4857 utterances which are labeled
with dialogue act tags from the Ripple corpus Boyer et al. (2010). The classification
accuracy of the trained model was 73.6%.
7.4.3 Topic Classification
Dialogue acts represent the category of utterance-level intentions, which is categorical
and abstract. In some cases, knowing the user utterance’s dialogue act will be enough
for the system to generate a reasonable response, such as a greeting dialogue act from
73
Table 7-1. Dialogue act set.
Dialogue Act Tag Explanation Sample UtteranceQuestion(Q) A general question about
the taskwhat would be the best way to dothat?
EvaluationQuestion(EQ)
A evaluation questionabout the task
isn’t that also declared in the sameplace ?
Statement(S) A statement of a fact I was trying to figure out the best wayto do that
Grounding(G) Acknowledgement about aprevious utterance
fair enough
Extra-Domain(EX) Any utterance that is notrelated to the task
I’m not very good at Java yet
PositiveFeedback(PF)
Positive assessment ofknowledge or task
yea it’s a string
NegativeFeedback(NF)
Negative assessment ofknowledge or task
i really don’t see the point much ofthis loop really
LukewarmFeedback(LF)
Assessment having bothpositive and negativeaspects
kind of
Greeting(GRE) Greetings hello
a user. However, in some cases, such as to respond to a user’s question, the dialogue
system needs to query the knowledge base. To recognize the topics of user utterances in
the dialogue system, a topic classifier is trained. The topics this classifier recognizes are
listed in Table 7-2. The topic classifier was also implemented using a maximum entropy
model. It takes word unigrams to trigrams of user utterances as features. We manually
selected 492 utterances from the Ripple corpus and tagged them with topic labels. These
492 utterances were used as a training set to train the topic classifier. The accuracy of the
classifier was 63.7%.
7.5 Dialogue Manager
The dialogue manager takes user dialogue acts, user actions, and recognizes topics
of user utterances as inputs. It selects a system response according to these inputs and
recognizes user intention. For example, when a user says “Hi”, the dialogue act classifier
predicts this utterance as a GRE, a greeting dialogue act. Then the dialogue manager
generates a system dialogue act GRE, which will be passed to the NLU module to
74
Table 7-2. Topics recognized by the topic classifier.
Topics Explanation Sample UtteranceGET SUBSTRING the way to get a substring
from a stringokay so should it bezipString.substring(i,i+1)?
GET ZIP DIGITS the way to extract a singledigit from a zip code
how do I extract individual digits
CONVERT ZIPCODETO STR
convert the variablezipCode to string
Can’t manually turn an integer into astring?
CREATE FOR LOOP the way to create a forstatement
what are the three things we need fora loop?
USE A LOOP necessity of using a forstatement
would a for loop be best?
STORE ZIP DIGITS the way to store digits with an array?PROGRESS about the progress how do i start the extractDigits
method?STRING 2 INT convert a string to integer Integer.parseString()?CREATE DIGITSARRAY
the syntax to create anarray for the digits
which is 5, correct? or does itdepend?
DECLARE ARRAY the syntax to declare anarray
How do we declare an array?
INPUT ZIPCODE get the input zipcode so i need something telling it to getzipcode?
CHAR 2 INT convert character to integer can I parse a character to an intHOW TO RUN the way to run the
programhow to run it?
AM I RIGHT request to check user’scode
does that make sense?
OOD out of domain topics Meh, this [ key is stuck.
instantiate it as a system utterance “hi” or “hello”, etc. In another example, a user may
say, “How do I create an array to hold the zip code digits?”. A set of rules were authored
for the dialogue manager to generate a system response.
User intention indicates the subtask that the user is working on. User intention
gives the system essential contextual information about the dialogue. As discussed in
the related work section, it could dramatically constrain the possible explanation of user
utterances. In this dialogue system, the current user intention is used to divide the Java
programming problem into sub-domains. The whole Java programming problem is a
domain for the tutorial dialogue system. Each subtask of the programming problem forms
75
TheProgrammingTaskpublic class PostalFrame extends JFrame implements ActionListener {
…/** the numerial representation of the zip code */
private int zipCode;…/**
* Extract the individual digits stored in the ZIP code * and return them. */ private int[] extractDigits() { //You must complete this method!! String zipcode = zipCode + ""; int [] digits = new int[5];
for (int i=0; i<5; i++) { digits[i] = zipcode.charAt(i) -'0';}
barCode.clearCode(); return digits; }
...}
DECLARE_zipDigitsCREATE_zipcode_strDECLARE_digit_charCREATE_for_loopASSIGN_digit_charCONVERT_char_to_intASSIGN_zipDigitsRETURN_zipDigits
1
Figure 7-4. User intention identification example.
a sub-domain for the dialogue system. In this way, the dialogue system could be seen as a
combination of a set of smaller dialogue systems. For each sub-task, we focus on a much
smaller sub-domain, compared with the domain for the whole programming problem.
To identify user intention in the domain of Java programming, we need to understand
the user’s Java source code. Given a programming task, there could be multiple ways to
solve it, i.e. there are multiple paths to follow if we imagine each step in the solution is a
node on a graph.
The first step to understand the user’s program to is perform a syntax parsing so that
we know which type of variables were declared, which variables were assigned, etc. This
information helps us to identify which step the user is working on. For example, creating
an integer array at the beginning of the “extractDigits()” method indicates that the user
is creating an array to hold the 5 digits of a zip code.
We created a rule-based algorithm to interpret the user’s Java program by mapping
each line of Java program onto a step in the solution. There were 96 rules defined for the
intention identifier. An example of user intention identification is shown in Figure 7-4.
76
PostalFrame
extractDigits() calcAndDrawCDigits() drawZipCode()
createAndInitStringZip …... …... …...
Figure 7-5. Structure of the programming task.
7.6 Knowledge Base
To support a reasonable dialogue, the knowledge base contains three types of
knowledge: subtask structure of the programming problem, knowledge about Java
language features, and knowledge needed to solve the programming problem.
The solution to the programming problem is defined as a tree structure, as shown in
Figure 7.6. The root of the tree is the whole programming task. Each node in the tree is
a subtask. In this tutorial dialogue system, the whole programming task is to complete
a method in a Java class called PostalFrame, which translates a five-digit zip code into
a set of bar code. For the subtask of completing the method “extractDigits()”, there are
some smaller subtasks that need to be completed. With this tree structure, we could
understand the user’s progress and provide hints when the user has questions about the
current subtask. As described in Section 7.5, we use a rule-based algorithm to map each
line of a user’s program to a node in this tree structure. This gives us very important
contextual information for the dialogue.
7.7 System Utterance Generation
A set of 99 system utterances was authored to be selected by the dialogue manager.
Table 7-3 shows sample system responses to user questions on different topics. For each
topic, we create multiple system responses with different level of detail. When the system
detects that a user asks a similar question, the system gives a new response with a more
detailed explanation.
77
Table 7-3. Sample system response utterances.
Topics of question System ResponseGET SUBSTRING ’To get a substring of a string variable, the syntax is
stringVariable.substring(start,end+1)’GET ZIP DIGITS ’There are several ways to break an int apart into its
individual digits ...CONVERT ZIPCODETO STR
’You can use the syntax intVariable + ”” to convert aninteger to a String variable.’
CREATE FOR LOOP ’A for loop takes the form: for(start condition; finishcondition; increment statement)’
USE A LOOP ’We can start with a for loop. It should loop throughthe zip code to get out each individual digit.’
STORE ZIP DIGITS ’You need an int array to hold the 5 digits of azipcode.’
STRING 2 INT ’To cast a string of a number into an integer, you canuse the parseInt method: ...
CREATE DIGITSARRAY
’The syntax to create an array is type[] arrayName.’
DECLARE ARRAY ’For example, you can do int[] digits = ...’INPUT ZIPCODE ’When the program is run, the user types in a zipCode
...’CHAR 2 INT ’To convert a char of a digit into integer, you can do
char digit - 0.’HOW TO RUN ’To run a program in Eclipse, you can right click ...OOD ’I can help you with many aspects of this project, but
I might not ...
78
CHAPTER 8EVALUATION OF THE DIALOGUE SYSTEM
This chapter describes a human user study to evaluate the novel reference resolution
approach in the implemented tutorial dialogue system, and compare it with a comparison
approach. The tutorial dialogue system with the reference resolution approach based on
learned semantics is denoted as System Li. This is the treatment condition. The baseline
tutorial dialogue system with reference resolution approach using a manually created
lexicon is denoted as System Comparison. As mentioned in Chapter 6, the baseline model
was adopted from Iida et al. (2010). This is the comparison condition.
The goals of the user study are twofold. First, we evaluate the two dialogue systems’
user satisfaction and user engagement by analyzing study participants’ post-survey data.
In addition, we would like to investigate the performance of the two reference
resolution approaches in System Li and System Comparison in terms of accuracy. To do
this, we manually examined the natural language input users provided, and rated whether
the system properly identified the referent(s) within the user input.
In this chapter, the first section introduces the user study procedure and briefly
describes the collected data. In the second section, a hypothesis test is conducted to
compare user satisfaction and user engagement of the two dialogue systems. Finally, the
third section compares the reference resolution performance of the two dialogue systems.
8.1 Proposed Hypotheses
This dissertation focuses on three hypotheses.
• Hypothesis 1. System Li will outperform System Comparison on accuracy ofreference resolution.In Chapter 6, we compared two reference resolution approaches with offlineevaluation. We manually tagged the referring expressions. In an online dialoguesystem, the system will automatically extract referring expressions while conversingwith a human user. The accuracy of referring expression extraction also play a keyrole in the reference resolution pipeline. We would like to examine if the referenceresolution approach with learned semantics still has a higher accuracy in such anonline dialogue system, given noisy referring expressions.
79
• Hypothesis 2. System Li will offer a higher user satisfaction than System Comparison.The goal of this tutorial dialogue system is to tutor college students on Javaprogramming. I would like to know how satisfied the human subjects are whileusing the proposed dialogue system. I want to examine the difference of the dialoguesystem in two proposed conditions in terms of user satisfaction. I expect the Liapproach has a higher reference resolution accuracy than the comparison approachin a real-time dialogue system. This will probably improve the user utteranceunderstanding functionality of the treatment condition, which will make the systemgenerate more reasonable responses. So, my hypothesis is that the treatmentcondition has a higher user satisfaction than the comparison condition. I willmeasure students’ satisfaction using their self-reported satisfaction in the post-surveyresults.
• Hypothesis 3. System Li provides a higher user engagement than System Comparison.User engagement is another widely used metric to evaluate a dialogue system(Sidner et al., 2005). It measures how frequently the human user talks to thedialogue system. We would like to examine if users engage more with System Lithan System Comparison. As I discussed in Hypothesis 2, the treatment conditionwill probably generate more reasonable system response, which will likely increaseuser engagement.
8.2 User Study
This section introduces the procedure of the user study and the collected data.
8.2.1 Participants
Student participants were recruited from a undergraduate introductory Java
programming class COP 3502 ”Programming Fundamentals I” at the University of
Florida in the 2018 Spring semester. Students voluntarily participated in this study
and were compensated with small amount of credits for the class. This study had 43
participants in total, two of whom participated in a pilot study. During the pilot study,
we talked to the participants for feedback to improve the system for the following study
sessions. Data were collected for all of the 43 sessions, but only the data from the 41
sessions’ were analyzed to address the research questions, due to potential influence from
the communication with the participants in the pilot study.
8.2.2 Java Programming Task for the Study
The study adopted a Java programming task that was previously used in another
research study for dialogue act modeling in task-oriented tutorial dialogue (Boyer, 2010).
80
The programming task was designed for undergraduate students in an introductory Java
programming course. The programming task examined the use of “for” statement, array
and String concepts. We provided a partially implemented Java program which took a
5-digit zip code as input and converted it into a postal bar code. When a user ran this
program, it opened a graphical user interface (GUI) to prompt for a input zip code.
When input was entered, the GUI converted it into a bar code and displayed it. The
program separated the five digits in the input integer and converted each single digit into
a bar code. The only missing method in the provided program was the “extractDigits()”
method, which took a integer zip code as input and returned an integer array. This
integer array contained the five separated digits. A task description was provided to each
participant at the beginning of each study session. The task description can be found in
Figure 8-1 and 8-2.
8.2.3 Procedure
For recruitment, we presented a recruitment speech in the COP 3502 class to briefly
introduce the research study, and collected student volunteers’ contact information
through a Google form. The volunteers’ availability was then collected using a Doodle
poll. Student participants were assigned to different study sessions according to their
availability.
The two dialogue systems were installed on 12 LearnDialogue group-owned laptops.
The laptops were numbered. System Li was installed on odd number laptops, and
System comparison was installed on even number laptops. In each study session,
we prepared a similar number (the number of participants were odd in some study
sessions) of laptops from each group to ensure we had similar number of System Li and
System Comparison used in the study.
On arrival, students were seated randomly. They were given consent forms and short
instructions to the study. The instructions included a introduction to the goal of the
study and the task description that was mentioned in the last subsection. We used the
81
The goal of this study is to evaluate an intelligent tutoring system (ITS). This ITS provides conversational assistance to students during Java programming. It is important to note that this is an experimental system, which is one of the few research projects that attempt to implement a dialogue system for a complex domain like Java programming. The system may fail to answer some of your questions. It is important to keep in mind that the goal of this study is to evaluate the dialogue system.
This system is designed to assist you through Java programming problems. You can ask questions about the programming problem, such as: “Is it correct?”, “What should I do next?”, “Where should I start from?”. You can also ask questions about “Is my for loop correct?”, “How to declare an array?”
While you are interacting with the conversational agent, you will be working on the following problem.
Postal Bar Codes
The Problem: For faster sorting of letters, the United States Postal Service encourages companies that send large volumes of mail to use a bar code denoting the ZIP code. Using the skeleton GUI program provided for you, you will complete this lab with code to actually generate the bar code for a given zip code. More About Bar Codes: In postal bar codes, there is a full-height frame bar on each end (and these are drawn automatically by the program provided for you; you don't have to write code to draw these). Each of the five encoded digits is represented by five bars. The five encoded digits are followed by a correction digit.
Figure 8-1. A short instruction with the task description.
written instructions to maintain consistency among all of the different study sessions.
After reading the consent form and the instructions, participants were asked to complete a
pre-survey about their attitudes toward programming.
82
The Correction Digit The correction digit is computed as follows: Add up all digits, and choose the correct digit to make the sum a multiple of 10.For example, the ZIP code 95014 has sum of digits 19, so the correction digit is 1 to make the sum equal to 20. What’s Already Written? You can see what parts of this program are already written by running the file Main. java. When you do, you should see output like the image below, with a blank zip code slot. You can enter a zip code, and you should see that no bar code is generated (except the first and last full bars which are required for all bar codes). What’s Your Task? Your job is to extract this five-digit zip code from user’s. The PostalFrame class is the one which handles this task. The only method which you must complete is: extractDigits() For extractDigits(), you will need to create a variable in the method which stores the zip code as separate digits. Some Helpful Information - If you can’t remember how to do something with the software, please refer to the
reference sheet on your desk.
Figure 8-2. A short instruction with the task description.
After the pre-survey, the participants had 40 minutes to work on the programming
task with the assistance of the tutorial dialogue systems they were assigned to.
When participants finished the programming task or 40 minutes had passed, they
were given a post-survey to evaluate the system usability and user engagement.
8.2.4 Data Collection
During the user study, we collected users’ pre-survey and post-survey results. The
pre-survey focused on students’ attitude toward programming, including whether they
viewed programming as an important skill, as well as self-reported programming skill
83
evaluation. The survey can be found in Appendix A. The post-survey included two parts.
The first part was a widely used system usability scale (SUS) survey Bangor et al. (2008),
which accesses the usability of a system. The SUS survey contained 10 questions which
reflected users’ evaluation of a system’s usability. The score of each completed SUS survey
was 0-100. A higher score indicated better usability. Bangor et. al. calculated the mean
SUS score of nearly 3500 surveys in their past 273 studies, which suggested a system
with SUS score above 70 had a better-than-average usability Bangor et al. (2009). The
second part was a survey to measure the user engagement scale (UES), which contained 30
questions. A complete post-survey can be found in Appendix B.
Besides survey data, I also collected the textual dialogues between participants and
the dialogue systems. An excerpt dialogue is shown in Table 8-1. In the 41 sessions,
there were 2641 utterances collected, 1292 of which were user utterances. There were 64.4
utterances on average in each session, and 31.5 of them were user utterances. In these 41
sessions, the maximum number of utterances was 154, and the minimum number was 17.
The maximum number of user utterances was 76, and the minimum was 8.
Ultimately, 22 student participants used System Li, and 19 participants used
System Comparison.
Users’ programming actions were logged into a local database. These actions include
login, logout, typing, and selecting. These actions were saved in a database as a Json
string. The format of an user action is illustrated in Table 8-2. For example, for a
typing action, the time stamp of the action, the added text, and the line that the action
happened were logged into the database for further analysis.
The reference resolution actions happened in the dialogue system were also saved into
the database. Each record contained a referring expression, its semantic segments, the
candidate list and the compatibility probability of each candidate.
84
Table 8-1. An excerpt dialogue between a user and the Virtual TA.
Virtual TA Hi, I’m your virtual TA. I will be helping you on this Java project.User does extratDigits method input anything?Virtual TA This Java program converts a zip code for US mail into a visual barcode. When
you run the program, it asks the user to enter a zip code. The program will thenconvert it into a barcode and display it. But it doesn’t work yet! To make it work,your task is to complete the extractDigits method. It needs to convert an integerzip code into an array of individual digits, and return that array.Hint: Making theint into a String first is a convenient way to get its individual digits.
Virtual TA The extractDigits method is located in the PostalFrame class, which should havebeen opened for you. You are welcome to view the other code, but you only need towork on the extractDigits method.
User what do I need to do in the extractDigits method?Virtual TA You will extract the five digits of a zipcode. Now we need to get each of the 5 digits
of that int. Using a string and then getting the individual characters from it is agood way to do that.
User where is teh zip codeVirtual TA When the program is run, the user types in a zipCode. The code to get the zipCode
from the user is already complete. The zip code the user entered is stored as aprivate variable, int zipCode on line 34 in PostalFrame.
User how to parse a string into an arrayVirtual TA You can use the syntax intVariable + ”” to convert an integer to a String variable.
. . .
Table 8-2. An example user action saved in the database.
type: TYPINGtimestamp: 1523261514283addedText: intlineNum: 79offset: 1918currentCode: ”import javax.swing.*;\nimport . . . ”referents: []isRRResult: Falseprob: 0.0affectedLine: ”\tint [] digits = int”
In total, there were 18775 records collected throughout the 41 study sessions. The
number of reference resolution events were 1486. An example reference resolution event is
shown in Table 8-3.
As shown in Table 8-3, “noun phrase” is the referring expression; “candidates” field
lists all of the generated candidates from the parallel Java program; “probs” field lists
85
Table 8-3. An example reference resolution event saved in the database.
{noun phrase: charat methodcandidates: [{u’category’: u’METHOD’, u’line number’: 81, u’name’: u’charAT’, . . . }
{u’category’: u’METHOD’, u’line number’: 40, u’name’: u’PostalFrame’,. . . }{u’category’: u’METHOD’, u’line number’: 41, u’name’: u’setSize’, . . . }. . . ]
probs: [0.9741117181861638,0.00036208246341969553,0.00036208246341969553, . . . ]
referent: {u’category’: u’METHOD’, u’line number’: 81, u’name’:u’charAT’, . . . }
prob: 0.974111718186isRRResult: truetimestamp: 1523546180624
}
the compatibility probability between the referring expression and all of the candidates;
“referent” is the system-selected referent; “prob” is the compatibility probability between
the referring expression and the selected referent.
8.3 System Usability Evaluation
To evaluate the usability of the implemented dialogue systems with two different
reference resolution approaches, student participants of the research study were asked
to complete a post-survey which contained an instrument widely used to assess system
usability Bangor et al. (2008). The two groups’ user response can be found in Table B-1
and B-2 in Appendix B. The two systems had a very close mean system usability scale
(SUS) score. The average SUS score of 22 student participants who used System Li was
66.67. The average SUS score of 19 participants who used System Comparison was 68.77.
To interpret SUS scores, Bangor et. al. argued that a system with a SUS score
over 70 was acceptable, as shown in Figure 8-3. According to his argument, the systems
implemented in this project is marginal in the acceptability range, but very close to
acceptable.
86
Figure 8-3. System usability score interpretation.
A further performed t-test showed no significant difference (p-value=0.361) on the
SUS scores for the two groups.
8.4 User Engagement Evaluation
Next we examined our hypothesis about the two systems’ user engagement. Besides
SUS scale, the post-survey also measured user engagement using the User Engagement
Scale (UES) instrument Brien et al. (2018). This instrument included 30 questions.
Participants who used System Li had an average UES score of 11.80, and students who
used System Comparison had an average UES score of 12.27. A complete table of user
response from the two groups can be found in Table B-1 and B-2 in Appendix B. A t-test
showed no significant difference between two groups on UES scores (p-value=0.236).
The number of user utterances also reflected users’ engagement with the dialogue
system. System Li had 30.8 user utterance per session, System Comparison had 32.4 on
average. There was not a significant difference between them (p-value=0.382).
8.5 Online Reference Resolution Evaluation in Tutorial Dialogue Systems
In Chapter 6, we compared two reference resolution approaches with offline
evaluation. We manually tagged the referring expressions and their referents in the
parallel Java source code. In an online dialogue system, the system automatically
extracted referring expressions, generated candidates and extracted features while having
a conversation with a human user. Without human intervention, errors in one step could
87
propagate to later steps in the reference resolution pipline. We would like to examine if
the reference resolution approach with learned semantics still had a higher accuracy in
such an online dialogue system.
In the proposal, we hypothesized that System Li would have a higher reference
resolution performance. To evaluate these two systems’ reference resolution accuracy, we
analyzed the logged reference resolution actions performed by the dialogue systems.
As mentioned in Section 8.2.4, all of the reference resolution events performed
by the dialogue systems were logged in a local database. Each reference resolution
event contained several fields, referring expression, candidate list, the compatibility for
each candidate and the selected referent from the candidate list. An excerpt reference
resolution event is illustrated in Table 8-3.
The reference resolution events were manually evaluated to calculate accuracies for
the two systems. As discussed earlier, the system selected referring expressions from
noun phrases in a user utterance. The process of reference resolution is illustrated in
Figure 8-4. For a user utterance, the dialogue system first found all the noun phrases
in the utterance. It then filtered all of the extracted noun phrases using a set of rules.
Noun phrases that could never be a referring expression for objects in the Java code, such
as “you” and “me”, were filtered out. Then, the system attempted to find a “referent”
in the parallel Java source for the remaining noun phrases as if they were all referring
expressions. Finally, the system used the compatibility probability (as shown as f in the
figure) between the remaining noun phrases and their “referents” to decide which noun
phrases were real referring expressions. The threshold for the compatibility probability
was set to 90% empirically.
The researcher went through all of the reference resolution events which identified
a referent with 90% or higher compatibility probability. There were 417 such reference
resolution events, which was 28.1% of all the logged reference resolution events. System Li
had 320 reference resolution events in this class, and System Comparison had 97.
88
Seman&cInterpreta&onofReferringExpressions
Iden&fyingReferents
UserU&erance“I think I should start from the actionPerformed method by creating an array.”
71: Public void actionPerformed(){...}ReferentReferring Expression: “the actionPerformed method”
Prob = 0.97 Referent: { NAME = ‘actionPerformed’, CATEGORY = METHOD ... }
NounPhraseChunking
“I”, “I”, “the actionPerformed method”, “an array”
Rule-basedNounPhrasefiltering
“I”, “I”, “the actionPerformed method”, “an array”
NounPhrase(s)
FilteredNounPhrase(s)
f>0.9?
Yes
No
Referring Expression: “an array” Prob = 0.58 Referent: { NAME = ‘table’, CATEGORY = ARRAY ... }
NotReferringExpression
Figure 8-4. Reference resolution process in the dialogue system.
For each reference resolution event within this class, the identified referent was
manually examined within the involved programming context to determine if the result
was correct.
System Li’s reference resolution accuracy for this set of referring expressions was
21.6%, and System Comparison’s was 19.6%.
Both of the two systems had a much lower reference resolution accuracy on the
selected referring expressions, compared with their offline versions, which was 61.6% and
51.2%. The logged reference resolution results were closely examined to find the reasons
which may shed some light on building online reference resolution approaches in the
future. To have a more accurate understanding of the reference resolution performance
of the dialogue system, I collected all of the reference resolution events, regardless of
the compatibility probability. There were 1486 reference resolution events logged in
89
the 41 study sessions. Because of the way the system selected referring expressions,
most of these logged reference resolution events were performed by the system on noun
phrases to identify referring expressions. I manually tagged 169 referring expression for
System Li’s data, and 158 referring expressions for System Comparison’s data. I also
manually identified their referents in the Java code. The result showed System Li had a
63.3% reference resolution accuracy on these 169 manually tagged referring expressions.
System Comparison had an accuracy of 44.9% on the 158 referring expressions. These
accuracies matched the performance of the two reference resolution approaches in their
offline setting.
It appears that the main reason for the poor performance of the online reference
resolution approaches was the inaccurate referring expression extraction. While extracting
referring expressions from all of the recognized noun phrases in a user utterance, we
combined a rule-based approach and the classification result from the classifier we used to
calculate compatibility probabilities between referring expressions and their candidates.
The intuition of using this classifier was that when a noun phrase is compatible with
an entity in the Java code, then it is likely to be a referring expression. However, this
combined approach did not work as expected in practice. This directly affects the
reference resolution accuracy if we cannot accurately identify referring expressions in
user utterances.
To further illustrate the reasons of the poor referring expression identification
functionality, we provide two false examples referring expression identification in Table 8-4
and Table 8-5. In Table 8-4, noun phrase “a string” was not a referring expression, since it
is not specifically referring to anything in the Java code. However, the user just created a
string in the Java code, called “scode”. The noun phrase “a string” contained an attribute
“VAR TYPE” “string” in it. Recall the reference resolution approach takes the semantic
features, dialogue history features and the behavior history features as inputs. Since the
“scode” was just created, the behavior history features suggested that “scode” had a high
90
probability being the referent. Also “scode” was a string variable, thus the “scode” had
a high compatibility probability 0.939 with the noun phrase “a string”. This caused a
false positive instance. Similarly, in another false negative example shown in Table 8-5,
noun phrase “the for loop” was a referring expression, and it referred to a for statement
in user’s Java program. The reference resolution was correctly performed, but since the
for statement was not recently operated or mentioned, the dialogue history and behavior
history suggested a low compatibility probability 0.791, and it is lower than the threshold.
From the negative examples, we found that it is insufficient to only use the
compatibility in identifying referring expressions. The lexical features of referring
expressions and their enclosing utterances also play a key role in referring expression
identification. These features should be considered while building a referring expression
identification classifier.
Table 8-4. A false positive example of referring expression identification.
Utterance “what to do next if I have a string of the zipcode”Noun Phrase “a string”Referent “{category: VARIABLE, line number: 76, name:
scode, . . . }”Probability 0.939
Table 8-5. A false negative example of referring expression identification.
Utterance “Is the for loop correct?”Noun Phrase “the for loop”Referent “{category: STATEMENT FOR, line number: 78,
name: for, . . . }”Probability 0.791
91
CHAPTER 9DISCUSSION
This chapter discusses some of our observations in building the dialogue systems and
conducting the user study.
9.1 Null Results
The previous chapter described the research study to evaluate the two implemented
dialogue systems. We did not find significant results for the hypothesis on user satisfaction
and user engagement. One of the reasons could be the low accuracy of the online
reference resolution approach, which was caused by the referring expression identification
functionality.
Another reason could lie at the difference between human-computer dialogues and
human-human dialogues. We compared the human-human dialogues in the Ripple corpus
and the human-computer dialogues collected in this project by manually annotating the
number of utterances and number of referring events in each session. As shown in Table
9-1, the average number of utterances in one session of the Ripple corpus was 130.2,
which was much lower, 64.4, in the human-computer dialogues we collected. In Ripple
corpus, each session lasted about 50-55 minutes, and in the study conducted in this
project, each session lasted about 40 minutes. There is a huge difference between these
two kinds of dialogues in terms of utterance frequencies. In addition, the human-human
dialogues had 0.44 referring events per utterance on average, and human-computer
dialogues only had 0.12. These numbers suggested a different communication pattern in
human-human dialogues and human-computer dialogues. This difference may suggest that
reference resolution plays a different role in a human-computer dialogue comparing with
human-human dialogues. Further research is needed to explain this phenomenon.
Also, as argued at the beginning of this dissertation, reference resolution plays a key
role in natural language dialogue understanding. However, in a natural language dialogue
system for a complex domain like Java programming, there are many other modules that
92
influence the performance of the dialogue system, such as dialogue act tagger, utterance
topic classifier and user intention recognizer. Reference resolution takes effect together
with these modules as an integrated system. The improvement of a single module may not
necessarily increase the performance of the whole system.
Table 9-1. A comparison between human-computer dialogues and human-humandialogues.
Average #Utt #RefExp / #UttHuman-computer 64.4 0.12Human-human 130.2 0.44
9.2 Data-driven Approach in Building Dialogue Systems
The dialogue systems implemented in this project used data-driven approaches for
most of the essential functionalities, such as dialogue act classification, utterance topic
classification, POS tagging, noun phrase chunking, and reference resolution. Some of these
models are less closely related to the domain of the dialogue system. For example, we can
train a noun phrase chunking model for the dialogue system using training data from the
Wall Street Journal corpus, since the grammar of the English language used in a tutorial
dialogue for Java programming is very similar to that in the news feed. However, some
of the models are more domain-specific, which means these models need be trained using
domain-specific data.
Due to the availability of the Ripple corpus that was described in Chapter 3, we can
use its human-human dialogues as training data to build dialogue act classification and
topic classification models for the dialogue systems in this project. The dialogue systems
in this project support a programming task that is almost the same as that in the Ripple
corpus. So, we can take advantage of this similarity. We looked into the Ripple corpus to
discover the most frequently mentioned topics by the students and the tutors while they
are approaching the programming task, and built topic classifiers for these topics to help
the system better understand user utterances. However, this data-driven approach suffers
from data sparsity problems. For example, one of the important steps in the programming
93
task is converting a character digit into an integer. When students extract a character
digit from a zip code, they need to convert the character digit in to an integer and add
the integer into an array. However, we only found 8 utterances in the Ripple corpus that
are related to converting a character to an integer in Java. It is very hard for the topic
classification model to learn an accurate classifier for this specific topic given such a small
set of training utterances. In addition, there are also some topics that are unique to our
dialogue systems, which we cannot find training data from the Ripple corpus. The COP
3502 class at the University of Florida uses an integrated development environment called
IntelliJ 1 to teach Java programming, while the user interface of our dialogue systems is
based on Eclipse. So, students may ask the dialogue system how to run their program. In
this project, we manually created training utterances for these topics to alleviate the data
sparsity problem, but cannot totally eliminate it.
9.3 Understanding Users’ Java Program - A Challenge in Building DialogueSystems For Java Programming
One of the challenges to build a tutorial dialogue system for Java programming
lies in understanding user’s Java program. Before answering a user’s question that is
related to her Java program, the dialogue system needs to understand the context of
the question. The user’s current program is arguably the most important contextual
information in this case. However, automatic interpretation of a user’s Java program is
a very challenging task. There are two levels of interpretation of a user’s Java program,
syntactical interpretation and semantic interpretation. The dialogue system’s ability to
interpret the user’s Java program directly limits the system’s ability to respond to the
user’s questions regarding her program. We discuss this limitation later in more detail in
this section.
1 https://www.jetbrains.com/idea/
94
The goal of syntactic interpretation is to understand if a user’s Java program is
syntactically correct. In addition, it identifies items such as variable declarations, variable
assignments and method calls. Correctly identifying these operations in a user’s Java
source code is essential to interpret a user’s Java program semantically, i.e. understanding
which step toward the solution the user is working on. For example, when the user
declares an integer array at the beginning of the “extractDigits()” method, the user’s
intention is probably creating an array to hold the 5 integers of the input zip code.
We implemented Java code syntactic parsing using an abstract syntax tree (AST)
parser. When the program is syntactically correct, the parser can generate a parse for
user’s Java source code without problems. However, it is more likely than not that the
program is syntactically incorrect when the user needs help from the dialogue system.
For example, the student may ask a question before finishing typing a line of source
code. In this case, the AST parser fails to parse the line of Java source code with syntax
errors (such as an incomplete line of Java code). To address this problem, we created a
rule-based parser to interpret user’s Java program. This rule-based parser contains a set
of patterns. We match user’s program with these patterns to identify the status of user’s
progress toward the solution. However, the number of conditions that this rule-based
parser can identify is the number of conditions that the dialogue system can respond to
regarding user’s program. If the source code parser cannot “perceive” a problem in user’s
program, the dialogue system cannot reasonably comment on it. This “granularity” of the
system’s perception directly determines the “granularity” of the dialogues that the system
could conduct. Mulkar-Mehta et. al. argued that “granularity” of a natural language
discourse is “the level of detail of description of an event or object” Mulkar-mehta et al.
(2011).
95
CHAPTER 10CONCLUSION
This dissertation has reported on the development of a tutorial dialogue system using
an innovative reference resolution approach that I developed Li and Boyer (2016). In
Chapter 6, we empirically evaluated this reference resolution approach with an existing
human-human dialogue corpus for Java programming. I then implemented a tutorial
dialogue system for Java programming and deployed the reference resolution approach to
evaluate it in real time with human subjects.
10.1 Hypothesis Revisited
This dissertation focuses on evaluating my novel reference resolution approach in
a tutorial dialogue system. We are interested in how well my approach performs in a
real-time dialogue system compared to a comparison condition, and its impact on user
satisfaction and user engagement when interacting with the system. To serve this goal, we
implemented two tutorial dialogue system with different reference resolution approaches,
System Li and System Comparison. We had three hypotheses:
Hypothesis I: More accurate offline reference resolution approach is also more
accurate in a real-time dialogue system.
Hypothesis II: More accurate reference resolution leads to higher user satisfaction in a
tutorial dialogue system.
Hypothesis III: More accurate reference resolution leads to higher user engagement in
a tutorial dialogue system.
The first hypothesis was confirmed, but we did not find evidence for the second
and the third hypotheses. The performance of a dialogue system is determined by the
performance of multiple different modules. Improving reference resolution accuracy in the
implemented tutorial dialogue system may not directly increase the system performance.
Identifying the “bottleneck” module of the tutorial dialogue system will be a interesting
research question.
96
Summary.This document has presented our work on automatic referring expression
extraction, semantic labeling of referring expressions, and a reference resolution approach
combining learned semantics and contextual features of the dialogue. The presented
reference resolution approach was evaluated using an existing human-human tutorial
dialogue for Java programming. Then, I presented the implementation of a tutorial
dialogue system for Java programming. I first defined the functionalities the system
requires and then described its architecture and the implementation of its module. To
evaluate the impact of our novel reference resolution approach within the implemented
tutorial dialogue system, I implemented two different versions of reference resolution
approaches and conducted a user study with 41 undergraduate student participants. We
did not find a significant difference on user satisfaction (p=0.361) or user engagement
(p=0.236) between the two systems with different reference resolution approaches.
Contributions. This project makes two main contributions to the natural language
dialogue system research community. First, the implemented tutorial dialogue system
in this project is one of the first to support a complex domain like Java programming,
in which the entities and environment dynamically change because of the user’s actions.
Second, this work is the first to investigate real-time reference resolution approaches
in such a complex situated dialogue system. We examine both the performance of the
reference resolution module, and the impact of different reference resolution approaches on
the performance of the dialogue system.
To push dialogue systems toward assisting people in more and more complex tasks,
we need to address some challenging problems including reference resolution. This
dissertation investigates this challenge within a task-oriented dialogue system in a complex
domain. This work is a step toward practical dialogue systems that support users in more
complex domains.
10.2 Limitations
This research project has several limitations.
97
Firstly, the scale of the user study is limited, which may be one of the reasons that
we had a null result for the hypotheses on dialogue system performance. Given limited
time, we recruited 43 undergraduate students from the COP 3502 class at the University
of Florida. More data could lead to more confident results.
Secondly, according to the participants’ feedback, the system performance is limited
by its ability to accurately understand users’ utterances. The participants sometimes
need to rephrase their questions multiple times before the system could understand them.
More training data can help the system to train a more accurate topic classifier, which can
result in a better natural language understanding result.
Thirdly, the system’s performance was also limited by the Java program parser. With
a more accurate Java source parser, the system could identify more fine grained errors in
users’ program, and further give more accurate feedback.
10.3 Future Work
This dissertation research investigated the performance of reference resolution
approaches in a real-time tutorial dialogue system for Java programming. Both of the
evaluated approaches are still far from perfectly identifying user’s referents. According to
the result analysis, more accurate referring expressions identification approach is required
to have a better reference resolution performance. Another promising research direction is
to investigate more features from the dialogue and the situated environment to inform the
reference resolution module. For example, the verbs in the same utterance could be a good
feature. In addition, there are also coreference relationships in situated dialogue, and it
will be interesting to consider reference resolution and coreference resolution at the same
time. The tutorial dialogue system can be viewed as a start point for a series of better
performed dialogue systems, which could be developed by refining some of the modules in
the existing system. The tutorial dialogue system could benefit from the introductory Java
class’s instructors’ inputs. They should have a better understanding of the system users’
Java knowledge, which could help the system to better adapt toward users’ needs. Also,
98
the data-driven system’s performance was limited by the lack of training data. It will be a
interesting research question to have the system learn from the interaction with users.
99
APPENDIX APRE-SURVEY
Name
UFID
Please indicate how much you agree or disagree with the following statements.
Please indicate how much you agree or disagree with the following statements.
Stronglydisagree Disagree Neutral Agree
StronglyAgree
Generally, I have feltsecure about attemptingcomputer programmingproblems.I am sure I could doadvanced work incomputer science.I am sure that I canlearn programming.I think I could handlemore difficultprogramming problems.I can get good grades incomputer science.I have a lot of self‐confidence when itcomes to programming.
Stronglydisagree Disagree Neutral Agree
StronglyAgree
I'll need programmingfor my future work.I study programming
Figure A-1. Pre-survey.
100
Please indicate how much you agree or disagree with the following statements.
I study programmingbecause I know howuseful it is.Knowing programmingwill help me earn aliving.Computer science is aworthwhile andnecessary subject.I'll need a firm masteryof programming for myfuture work.I will use programmingin many waysthroughout my life.
Stronglydisagree Disagree Neutral Agree
StronglyAgree
I like writing computerprograms.Programming isenjoyable andstimulating to me.When a programmingproblem arises that Ican't immediately solve,I stick with it until I havethe solution.Once I start trying towork on a program, Ifind it hard to stop.When a question is leftunanswered incomputer science class,I continue to think aboutit afterward.I am challenged byprogramming problemsI can't understandimmediately.
Figure A-2. Pre-survey.
101
Powered by Qualtrics
→
©University of FloridaGainesville, FL 32611Terms of Use
Figure A-3. Pre-survey.
ID Q1-Q6 Q7-Q12 Q13-Q182 4 2 4 4 4 3 5 4 5 4 5 5 4 4 4 4 3 33 3 3 4 4 3 2 2 3 3 3 1 3 2 2 4 2 2 44 1 1 3 1 3 2 2 4 3 4 3 3 3 3 2 2 2 45 5 4 5 4 5 5 4 4 4 4 4 3 5 5 5 4 4 46 3 3 4 3 3 3 4 4 4 4 4 3 4 4 4 4 3 47 4 4 5 4 4 5 5 5 5 5 5 5 5 5 4 5 4 48 2 2 4 3 2 2 5 5 5 1 5 5 4 4 3 4 3 49 3 3 4 3 3 3 5 5 4 4 5 4 5 5 4 4 4 410 5 4 5 4 4 4 5 5 5 5 5 4 4 4 4 3 4 411 1 1 4 1 1 2 3 3 3 4 2 3 1 1 4 2 1 512 2 1 4 4 2 2 2 4 2 4 2 2 4 3 4 2 2 413 3 2 4 3 2 3 2 2 3 3 1 1 3 4 3 4 3 314 2 3 4 2 3 1 4 4 3 3 4 3 3 3 2 4 4 415 3 2 4 3 2 2 3 4 4 4 3 3 4 4 3 3 4 416 4 4 5 4 4 3 5 5 5 5 5 5 4 4 4 5 5 417 2 2 4 2 2 1 5 5 5 5 4 3 3 4 4 3 3 4
Table A-1. A complete pre-survey results for students used System Li.
102
ID Q1-Q6 Q7-Q12 Q13-Q1818 3 3 4 4 4 2 4 4 4 4 4 4 4 4 4 5 3 419 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 420 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 2 2 421 2 3 4 2 4 1 5 5 5 5 5 5 4 4 4 5 2 522 3 3 4 3 3 4 5 5 5 5 5 5 3 4 5 3 4 523 5 4 5 5 5 5 5 5 4 4 5 5 5 5 5 5 5 524 3 3 4 2 4 3 4 4 4 4 3 4 4 4 3 2 2 325 3 4 5 4 4 3 5 5 5 5 4 4 5 5 4 4 4 426 3 3 5 2 4 2 5 5 5 5 5 5 4 4 4 4 4 427 4 3 4 4 4 3 3 4 2 4 3 4 4 4 3 4 3 428 2 3 4 3 3 2 4 5 5 5 5 5 4 4 4 4 4 529 3 2 4 4 4 3 4 5 4 4 4 4 4 4 4 4 5 430 2 2 4 2 3 3 5 5 5 5 5 5 5 5 4 3 4 431 2 2 4 2 3 2 2 4 4 4 4 3 3 2 3 2 2 232 3 3 3 3 3 3 4 4 4 4 4 4 3 3 3 4 3 4
Table A-2. A complete pre-survey results for students used System Comparison.
103
APPENDIX BPOST-SURVEY
Name
UFID
I think that I would like to use this system frequently.
I found the system unnecessarily complex.
I thought the system was easy to use.
I think that I would need the support of a technical person to be able to use this system.
Strongly Disagree Extremely likely
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Figure B-1. Post-survey.
104
I found the various functions in this system were well integrated.
I thought there was too much inconsistency in this system.
I would imagine that most people would learn to use this system very quickly.
I found the system very cumbersome to use.
I felt very confident using the system.
I needed to learn a lot of things before I could get going with this system.
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Figure B-2. Post-survey.
105
Powered by Qualtrics
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
→
©University of FloridaGainesville, FL 32611Terms of Use
Figure B-3. Post-survey.
106
This tutoring system is attractive.
This tutoring system was aesthetically appealing.
I liked the graphics and images used in this tutoring system.
This tutoring system appealed to my visual senses.
The screen layout of this tutoring system was visually pleasing.
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Figure B-4. Post-survey.
107
Learning with this tutoring system was worthwhile.
I consider my experience a success.
Doing this task did not work out the way I planned.
My experience was rewarding.
I would recommend this tutoring system to my friends and family.
I lost myself in this task.
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
Figure B-5. Post-survey.
108
I was so involved in this task that I lost track of time.
I blocked out things around me while I was working with this tutoring system.
When I was doing this work, I lost track of the world around me.
The time I spent on this task just slipped away.
I was absorbed in the task.
During this experience, I let myself go.
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Figure B-6. Post-survey.
109
During this experience, I let myself go.
I was really drawn finding the solutions.
I felt involved in this task.
This experience was fun.
I continued to use this tutoring system out of curiosity.
This tutoring system incited my curiosity.
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Figure B-7. Post-survey.
110
I felt interested in this tutoring system.
I felt frustrated while using this tutoring system.
I felt this tutoring system confusing to use.
I felt annoyed while using this tutoring system.
I felt discouraged while using this tutoring system.
Using this tutoring system was mentally taxing.
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Figure B-8. Post-survey.
111
Powered by Qualtrics
This experience was demanding.
I felt in control of the experience.
I could not do something I needed to do with this tutoring system.
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
Strongly Disagree Strongly Agree
0 1 2 3 4 5 6 7 8 9 10
→
©University of FloridaGainesville, FL 32611Terms of Use
Figure B-9. Post-survey.
112
ID Q1-Q20 Q21-Q411 7 1 8 2 8 0 8 0 7 2 5 5 7 5 5 7 5 6 8 8
7 6 5 5 8 8 8 0 5 7 6 6 7 5 0 2 2 2 2 7 52 3 2 3 2 2 2 2 5 2 2 5 5 5 5 5 5 5 5 5 5
2 2 5 2 5 2 2 2 5 2 8 8 8 8 8 9 7 8 5 5 93 5 2 9 0 6 4 10 2 10 1 9 9 9 9 9 9 9 9 8
4 4 4 4 5 5 8 0 8 6 5 6 7 2 1 1 1 1 2 8 34 9 1 8 1 8 3 9 1 9 2 2 2 8 3 3 2 3 7 7 9
7 5 4 5 5 9 10 2 9 9 10 10 10 2 2 2 2 2 2 8 55 3 2 4 3 4 4 6 7 4 5 6 6 7 7 7 7 7 8 8 6
6 6 5 5 6 5 3 6 4 4 5 5 5 7 6 7 4 4 5 5 86 4 3 6 3 5 4 3 6 4 5 4 4 3 3 4 6 6 6 5 4
7 7 7 6 7 4 5 6 5 4 6 6 6 7 3 5 3 4 2 6 67 10 1 8 0 9 3 9 1 6 1 2 2 7 6 3 8 4 4 7 5
9 8 5 6 7 7 7 3 5 8 4 6 6 6 1 5 5 2 2 5 68 2 5 2 7 3 6 3 7 1 5 2 2 5 6 4 5 6 6 7 8
1 3 5 4 5 2 3 3 5 1 7 7 6 7 7 7 5 3 4 7 109 8 0 10 0 7 8 10 5 6 0 10 10 10 10 10 10 10 10 10 10
5 4 3 3 10 8 10 0 10 8 8 7 7 5 0 7 0 0 5 6 810 5 1 6 2 6 6 6 3 4 1 1 3 3 3 3 6 3 6 6 6
5 6 5 5 6 5 6 1 5 5 4 6 6 6 2 4 1 1 1 7 711 6 2 6 6 5 5 5 3 5 3 4 5 6 5 6 6 4 7 5 5
7 7 6 6 7 7 6 5 5 6 6 6 6 5 3 6 4 3 3 6 512 5 2 7 3 6 4 7 5 6 4 4 4 2 4 4 4 5 7 7 5
4 5 6 5 5 6 2 9 5 6 4 6 6 8 2 8 4 1 1 7 713 3 2 7 7 3 8 7 3 2 7 5 7 8 7 7 8 8 8 8 8
3 5 1 1 7 1 4 6 3 3 10 10 10 8 3 2 3 3 5 5 1014 7 5 7 5 3 8 8 2 5 2 7 6 7 7 5 5 5 7 7 7
6 2 2 2 2 6 8 10 8 8 6 6 6 9 3 7 7 5 6 6 915 10 5 8 4 7 6 9 2 4 5 5 9 7 8 8 8 8 9 9 10
10 10 10 10 10 9 9 1 7 8 10 10 10 3 2 2 2 2 2 8 716 10 0 9 0 7 3 9 3 10 2 5 3 7 2 4 6 4 4 7 8
10 7 7 6 7 10 10 1 9 10 8 6 6 6 2 6 2 2 2 7 617 6 3 7 6 4 6 8 2 6 0 3 8 7 7 8 7 8 7 8 7
7 7 7 8 7 7 6 5 7 7 8 8 7 5 5 3 2 3 3 6 818 8 6 6 4 6 7 6 6 4 5 6 6 6 6 6 4 6 6 6
5 5 5 5 6 6 6 6 6 4 6 6 6 6 6 6 6 5 5 6 619 8 1 8 4 5 3 7 2 2 4 7 7 7 5 6 6 5 9 9 8
6 6 5 3 8 7 5 7 6 6 8 6 7 7 4 7 5 3 6 6 1020 4 3 5 3 4 2 7 2 6 1 5 6 5 5 5 5 5 6 4 6
6 6 6 7 7 6 5 6 5 5 5 6 6 7 4 7 4 1 8 6 921 7 2 8 1 3 7 1 8 6 3 3 6 2 3 6 2 7 6 7
7 8 6 7 8 7 7 2 6 7 7 7 8 3 2 6 5 2 7 7 822 8 4 7 2 8 5 8 4 7 3 3 4 6 5 6 6 6 6 5 6
7 6 7 7 7 6 6 6 6 6 8 8 8 5 4 4 3 3 2 6 4
Table B-1. A complete post-survey results for users used System Li.113
ID Q1-Q20 Q21-Q4123 3 2 8 5 4 6 6 6 3 2 5 6 7 7 7 7 5 8 7 5
7 7 7 7 7 5 5 5 7 4 6 7 7 10 3 8 8 4 2 4 724 8 1 10 0 8 5 10 0 10 0 1 4 8 7 5 7 5 7 9 7
10 5 3 3 9 9 7 1 7 9 10 10 10 1 0 3 0 0 0 9 825 6 1 8 1 4 6 7 2 8 1 1 3 3 1 1 4 5 7 8 7
8 4 3 4 5 7 7 2 7 5 10 10 10 1 2 0 0 0 0 8 026 3 1 8 1 5 7 6 5 4 2 6 6 10 6 6 6 6 9 9 9
5 8 8 8 8 5 5 2 5 3 10 10 10 6 2 6 2 2 4 7 827 4 7 6 4 4 6 7 4 6 1 2 0 6 5 6 7 6 6 6 6
4 5 5 3 7 6 7 1 6 4 7 6 6 6 3 5 2 2 2 8 628 7 3 10 1 8 7 10 0 10 6 2 2 2 2 2 8 7 8 8 8
8 9 4 6 8 8 8 5 7 8 8 6 6 2 1 2 1 1 2 8 229 3 2 9 0 2 3 8 5 8 3 2 1 3 2 2 3 2 4 5 4
7 7 7 7 7 3 6 6 6 3 6 6 6 5 2 6 3 2 2 7 630 8 0 9 2 7 2 8 1 7 1 9 10 8 8 8 10 10 8 10 7
8 3 5 3 8 9 9 1 10 10 10 7 9 6 1 6 0 1 7 7 031 7 7 3 2 1 8 10 5 5 0 0 10 10 5 10 10 7 10 10 10
5 5 5 5 5 3 3 8 5 5 5 10 10 3 0 5 5 3 5 5 332 5 5 5 6 5 9 5 7 8 10 0 0 5 0 0 8 0 10 10 10
5 7 8 10 10 10 10 10 10 5 7 7 7 3 5 7 0 5 0 10 233 7 4 7 3 7 6 8 2 8 2 4 3 6 3 3 6 5 5 5 5
8 4 5 3 5 7 6 6 6 7 7 7 7 5 4 5 6 4 4 7 434 9 0 10 0 8 1 10 0 8 2 9 9 9 9 9 9 7 10 10 10
6 6 6 6 7 10 10 0 8 10 7 8 9 0 0 0 0 0 0 8 035 7 6 6 2 5 6 8 5 6 7 8 9 9 9 7 9 5 10 10 6
8 8 7 7 8 6 3 9 3 7 8 7 7 7 5 6 7 2 2 5 536 6 3 7 2 6 4 7 3 7 5 6 5 4 4 4 5 3 7 7 6
4 5 5 4 6 6 4 7 6 5 4 5 6 6 3 3 3 2 2 5 737 8 3 7 3 9 5 8 5 7 7 3 7 7 7 7 7 1 7 8 7
8 8 8 8 8 8 10 0 10 8 7 7 7 2 2 2 1 1 1 8 238 6 4 4 2 3 4 5 5 3 4 7 7 6 6 7 6 6 6 6 6
6 5 5 7 7 6 3 7 3 6 6 6 6 7 8 5 5 3 6 439 3 1 9 0 4 2 9 1 4 1 4 4 4 5 4 4 4 4 4 4
4 3 2 4 4 5 3 8 3 4 6 4 4 4 4 4 4 4 4 4 1040 10 3 9 4 9 3 9 4 8 2 4 3 3 3 3 3 3 3 4 5
9 8 6 7 7 9 8 3 8 8 8 8 8 5 3 2 2 1 0 4 641 7 3 7 5 7 3 7 3 5 6 7 5 5 5 5 5 5 5 5 5
5 4 4 4 5 6 5 6 5 6 7 7 7 7 4 7 5 5 4 4 6
Table B-2. A complete post-survey results for users used System Comparison.
114
REFERENCES
Ariel, Mira. “Referring and Accessibility.” Journal of Linguistics 24 (1988).1: 65–87.
Austin, J L. “How To Do Things With Words.” (1962).
Bangor, Aaron, Kortum, Philip T, Miller, James T, Bangor, Aaron, Kortum, Philip T,Miller, James T, Empirical, An, Bangor, Aaron, Kortum, Philip T, and Miller, James T.“An Empirical Evaluation of the System Usability Scale Usability Scale.” InternationalJournal of HumanComputer Interaction 24 (2008).6: 574–594.
Bangor, Aaron, Staff, Technical, Kortum, Philip, Miller, James, and Staff, Technical.“Determining What Individual SUS Scores Mean : Adding an Adjective Rating Scale.”Journal of Usability Studies 4 (2009).3: 114–123.
Blitzer, John. “Domain Adaptation with Structural Correspondence Learning.” Pro-ceedings of the 2006 Conference on Empirical Methods in Natural Language Processing(EMNLP 2006). July. 2006, 120–128.
Boyer, Kristy Elizabeth. Structural and Dialogue Act Modeling in Task-Oriented TutorialDialogue. Ph.D. thesis, 2010.
Boyer, Kristy Elizabeth, Ha, Eun Young, Phillips, Robert, Wallis, Michael D., Vouk,Mladen A., and Lester, James C. “Dialogue Act Modeling in a Complex Task-OrientedDomain.” Proceedings of the 11th Annual SIGDIAL Meeting on Discourse and Dialogue.2010, 297–305.
Boyer, Kristy Elizabeth, Phillips, Robert, Ingram, Amy, Ha, Eun Young, Wallis,Michael D, Vouk, Mladen A, and Lester, James C. “Investigating the RelationshipBetween Dialogue Structure and Tutoring Effectiveness: A Hidden Markov ModelingApproach.” International Journal of Artificial Intelligence in Education (IJAIED) 21(2011).1: 65–81.
Brien, Heather L O, Cairns, Paul, and Hall, Mark. “A Practical Approach to MeasuringUser Engagement with the Refined User Engagement Scale ( UES ) and New UES ShortForm.” International Journal of Human - Computer Studies 112 (2018).December 2017:28–39.
Brill, Eric. “Transformation-Based Error-Driven Learning and Natural LanguageProcessing : A Case Study in Part-of-Speech Tagging.” Computational linguistics21 (1995).4: 543–565.
Can, Aysu Ezen. Unsupervised Dialogue Act Modeling for Tutorial Dialogue Systems.Ph.D. thesis, 2016.
Chai, Joyce, Hong, Pengyu, and Zhou, Michelle. “A Probabilistic Approach to ReferenceResolution in Multimodal User Interfaces.” Proceedings of the 9th InternationalConference on Intelligent User Interfaces - IUI ’04 (2004): 70–77.
115
Corbin, Carina, Morbini, Fabrizio, and Traum, David. “Creating a Virtual Neighbor.”Natural Language Dialog Systems and Intelligent Assistants (2015): 203–208.
Crystal, David. A Dictionary of Linguistics and Phonetics (4th ed.). Oxford UniversityPress, 1997.
Culotta, Aron, Wick, Michael, and Mccallum, Andrew. “First-Order Probabilistic Modelsfor Coreference Resolution.” Proceedings of the 2007 Annual Conference of the NorthAmerican Chapter of the Association for Computational Linguistics (NAACL). 2007,81–88.
Daume, Hal. “Frustratingly Easy Domain Adaptation.” arXiv preprint arXiv:0907.1815(2009).
Denis, Pascal and Baldridge, Jason. “Specialized Models and Reranking for CoreferenceResolution.” Proceedings of the Conference on Empirical Methods in Natural LanguageProcessing (2008).October: 660–669.
Dzikovska, Myroslava O, Callaway, Charles B, Farrow, Elaine, Marques-pita, Manuel,Matheson, Colin, and Moore, Johanna D. “Adaptive Tutorial Dialogue Systems UsingDeep NLP Techniques.” NAACL HLT Demonstrations. April. 2007, 5–6.
Finkel, Jenny Rose and Manning, Christopher D. “Hierarchical Bayesian DomainAdaptation.” the North American Chapter of the Association for ComputationalLinguistics Human Language Technologies (NAACL HLT) 2009 Conference. June.2009, 602–610.
Forsythand, Eric N and Martell, Craig H. “Lexical and Discourse Analysis of Online ChatDialog.” Semantic Computing, 2007. ICSC 2007. 2007, 19–26.
Funakoshi, Kotaro, Nakano, Mikio, Tokunaga, Takenobu, and Iida, Ryu. “A UnifiedProbabilistic Approach to Referring Expressions.” Proceedings of the 13th AnnualMeeting of the Special Interest Group on Discourse and Dialogue (2012).July: 237–246.
Garrette, Dan and Baldridge, Jason. “Learning a Part-of-Speech Tagger from Two Hoursof Annotation.” Proceedings of the 2013 Conference of the North American Chapter ofthe Association for Computational Linguistics Human Language Technologies (NAACLHLT 2013). June. 2013, 138–147.
Gorniak, Peter and Roy, Deb. “Situated Language Understanding as Filtering PerceivedAffordances.” Cognitive Science 31 (2007).2: 197–231.
Grosz, B J, Weinstein, S, and Joshi, A K. “Centering - a Framework for Modeling theLocal Coherence of Discourse.” Computational Linguistics 21 (1995).2: 203–225.
Hovy, Dirk, Plank, Barbara, and Søgaard, Anders. “Mining for Unambiguous Instances toAdapt Part-of-speech Taggers to New Domains.” Proceedings of the 2015 Conference ofthe North American Chapter of the Association for Computational Linguistics HumanLanguage Technologies (NAACL HLT 2015). 2015, 1256–1261.
116
Iida, Ryu, Kobayashi, Shumpei, and Tokunaga, Takenobu. “Incorporating Extra-linguisticInformation into Reference Resolution in Collaborative Task Dialogue.” Proceedingsof the 48th Annual Meeting of the Association for Computational Linguistic (2010):1259–1267.
Iida, Ryu, Yasuhara, Masaaki, and Tokunaga, Takenobu. “Multi-modal ReferenceResolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues.”Proceedings of the 5th International Joint Conference on Natural Language Processing(IJCNLP 2011) (2011).2003: 84–92.
Jiang, Jing and Zhai, Chengxiang. “Instance Weighting for Domain Adaptation inNLP.” the 45th Annual Meeting of the Association of Computational Linguistics. 2007,264–271.
Kennington, Casey and Schlangen, David. “Simple Learning and CompositionalApplication of Perceptually Grounded Word Meanings for Incremental ReferenceResolution.” Proceedings of the Conference for the Association for ComputationalLinguistics (ACL) (2015): 292–301.
Lafferty, John, McCallum, Andrew, and Pereira, Fernando C N. “Conditional RandomFields: Probabilistic Models for Segmenting and Labeling Sequence Data.” Proceedingsof the International Conference on Machine Learning. 2001, 282–289.
Lappin, Shalom and Leass, Herbert J. “An Algorithm for Pronominal AnaphoraResolution.” Computational Linguistics 20 (1994): 535–561.
Lemon, Oliver, Bracy, Anne, Gruenstein, Alexander, and Peters, Stanley. “The WITASMulti-Modal Dialogue System I.” Proceedings of INTERSPEECH. 2001, 1559–1562.
Li, Shen, Graca, Joao V, and Taskar, Ben. “Wiki-ly Supervised Part-of-Speech Tagging.”the 2012 Joint Conference on Empirical Methods in Natural Language Processing andComputational Natural Language Learning. 2012, 1389–1398.
Li, Xiaolong and Boyer, Kristy Elizabeth. “Semantic Grounding in Dialogue for ComplexProblem Solving.” Proceedings of the 2015 Conference of the North American Chapterof the Association for Computational Linguistics Human Language Technologies(NAACL HLT 2015). 2015, 841–850.
———. “Reference Resolution in Situated Dialogue with Learned Semantics.” the 17thAnnual Meeting of the Special Interest Group on Discourse and Dialogue. 2016, 329–338.
Liu, Changsong and Chai, Joyce Y. “Learning to Mediate Perceptual Differences inSituated Human-Robot Dialogue.” Proceedings of the Twenty-ninth AAAI Conference(AAAI15). 2015, 2288–2294.
Liu, Changsong, She, Lanbo, Fang, Rui, and Chai, Joyce Y. “Probabilistic Labeling forEfficient Referential Grounding Based On Collaborative Discourse.” Proceedings of the
117
52nd Annual Meeting of the Association for Computational Linguistics (ACL) (2014):13–18.
Liu, Chansong, Fang, Rui, and Chai, Joyce Yue. “Towards Mediating Shared PerceptualBasis in Situated Dialogue.” Proceedings of the 13th Annual Meeting of the SpecialInterest Group on Discourse and Dialogue (2012).July: 140–149.
Manning, Christopher D. “Part-of-Speech Tagging from 97% to 100%: Is It Time forSome Linguistics?” International Conference on Intelligent Text Processing andComputational Linguistics. 2011, 171–189.
Manning, Christopher D, Bauer, John, Finkel, Jenny, and Bethard, Steven J. “TheStanford CoreNLP Natural Language Processing Toolkit.” the 52nd Annual Meeting ofthe Association for Computational Linguistics: System Demonstrations (2014): 55–60.
Matuszek, Cynthia, Bo, Liefeng, Zettlemoyer, Luke S, and Fox, Dieter. “Learning fromUnscripted Deictic Gesture and Language for Human-Robot Interactions.” Proceedingsof AAAI 2014 (2014): 2556–2563.
Mccarthy, Joseph F and Lehnert, Wendy G. “Using Decision Trees for CoreferenceResolution.” Proceedings of teh Fourteenth International Joint Conference on ArtificialIntelligence (1995).
McClosky, David, Charniak, Eugene, and Johnson, Mark. “Automatic Domain Adaptationfor Parsing.” Proceedings of the 2010 Annual Conference of the North AmericanChapter of the Association for Computational Linguistics (HLT-NAACL). 2010, 28–36.
Mulkar-mehta, Rutu, Hobbs, Jerry, and Hovy, Eduard. “Granularity in Natural LanguageDiscourse.” Proceedings of the Ninth International Conference on ComputationalSemantics. Section 3. 2011, 360–364.
Owoputi, Olutobi, O’Connor, Brendan, Dyer, Chris, Gimpel, Kevin, Schneider, Nathan,and Smith, Noah A. “Improved Part-of-Speech Tagging for Online Conversational Textwith Word Clusters.” Proceedings of the 2013 Conference of the North American Chap-ter of the Association for Computational Linguistics Human Language Technologies(NAACL HLT 2013). June. 2013, 380–390.
Plank, Barbara, Hovy, Dirk, McDonald, Ryan, and Søgaard, Anders. “Adapting Taggersto Twitter with Not-so-distant Supervision.” COLING 2014, the 25th InternationalConference on Computational Linguistics: Technical Papers. 2014, 1783–1792.
Ponzetto, Simone Paolo and Strube, Michael. “Exploiting Semantic Role Labeling,WordNet and Wikipedia for Coreference Resolution.” Proceedings of the main confer-ence on Human Language Technology Conference of the North American Chapter of theAssociation of Computational Linguistics (2006).2: 192–199.
118
Poon, Hoifung and Domingos, Pedro. “Unsupervised Semantic Parsing.” Proceedings ofthe 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP).August. 2009, 1–10.
Rose, Carolyn P. “A Framework for Robust Semantic Interpretation.” Proceedings ofthe 1st North American Chapter of the Association for Computational LinguisticsConference (NAACL). 2000, 311–318.
Schlangen, David, Zarriess, Sina, and Kennington, Casey. “Resolving References toObjects in Photographs using the Words-As-Classifiers Model.” Proceedings of the 54thAnnual Meeting of the Association for Computational Linguistics (ACL 2016) (2016):1213–1223.
Schmidt, Mark and Swersky, Kevin. “http://www.cs.ubc.ca/∼schmidtm/Software/crfChain.html.” 2008.
Sha, Fei and Pereira, Fernando. “Shallow Parsing with Conditional Random Fields.” the2003 Conference of the North American Chapter of the Association for ComputationalLinguistics Human Language Technologies (HLT-NAACL 2003). June. 2003, 134–141.
Sidner, Candace L. “Attention, Intentions, and the Structure of Discourse.” 12 (1986).
Sidner, Candace L, Lee, Christopher, Lesh, Neal, and Rich, Charles. “Explorations inEngagement for Humans and Robots.” Artificial Intelligence 166 (2005).1-2: 140–164.
Soon, W M, Ng, H T, and Lim, D C Y. “A Machine Learning Approach to CoreferenceResolution of Noun Phrases.” Computational linguistics (2001).
Strik, Helmer, Russel, Albert, Cucchiarini, Catia, Boves, Lou, Oostdijk, N, andCucchiarini, C. “A Spoken Dialogue System For Public Transport Information.”International Journal of Speech Technology 2 (1997): 119–129.
Tjong, Erik F and Sang, Kim. “Introduction to the CoNLL-2000 Shared Task :Chunking.” the 2nd Workshop on Learning Language in Logic and the 4th Confer-ence on Computational Natural Language Learning. 2000, 127–132.
Toutanova, Kristina, Klein, Dan, and Manning, Christopher D. “Feature-RichPart-of-Speech Tagging with a Cyclic Dependency Network.” Human LanguageTechnologies: The 2003 Annual Conference of the North American Chapter of theAssociation for Computational Linguistics. 2003, 252–259.
Vanlehn, Kurt, Jordan, Pamela W, Ros, Carolyn P, Bhembe, Dumisizwe, Michael, B,Gaydos, Andy, Makatchev, Maxim, Pappuswamy, Umarani, Ringenberg, Michael,Roque, Antonio, Siler, Stephanie, and Srivastava, Ramesh. “The Architecture ofWhy2-Atlas : A Coach for Qualitative Physics Essay Writing.” (2002): 158–167.
Wen, Tsung-Hsien, Vandyke, David, Mrksic, Nikola, Gasic, Milica, Rojas-Barahona,Lina M, Su, Pei-Hao, Ultes, Stefan, and Young, Steve. “A Network-based End-to-EndTrainable Task-oriented Dialogue System.”, 2016.
119
Xue, Nianwen and Palmer, Martha. “Calibrating Features for Semantic Role Labeling.”Proceedings of the 2004 Conference on Empirical Methods in Natural Language Process-ing (EMNLP). 2004, 88–94.
Yang, Xiaofeng, Zhou, Guodong, Su, Jian, and Tan, Chew Lim. “Coreference ResolutionUsing Competition Learning Approach.” Proceedings of the 41st Annual Meeting onAssociation for Computational Linguistics (2003): 176–183.
120
BIOGRAPHICAL SKETCH
Xiaolong Li received his Ph.D. from the University of Florida in August 2018.
Before that, he received his bachelor’s and master’s degrees in computer engineering and
technology in 2008 and 2012 from Northwestern Polytechnical University and Zhejiang
University in China, respectively. He started his Ph.D. program in computer science in
2012 at North Carolina State University and then transferred to the University of Florida
with the LearnDialogue research group in 2015.
121