arXiv:2004.04305v2 [cs.CL] 1 May 2020task-speciﬁc action templates, an entity module, a set of action masks, and a Recurrent Neural Net-work (RNN). Each action template can be a

Conversation Learner – A Machine Teaching Tool for Building DialogManagers for Task-Oriented Dialog Systems

† Swadheen Shukla∗ †Lars Liden∗ † Shahin Shayandeh∗ ‡Eslam Kamal † Jinchao Li†Matt Mazzola †Thomas Park †Baolin Peng † Jianfeng Gao

†Microsoft Research AI‡Microsoft Corporation

{swads,laliden,shahins,eskam,jincli,mattm,thpar,bapeng,jfgao}@microsoft.com

AbstractTraditionally, industry solutions for buildinga task-oriented dialog system have relied onhelping dialog authors define rule-based di-alog managers, represented as dialog flows.While dialog flows are intuitively interpretableand good for simple scenarios, they fall shortof performance in terms of the flexibilityneeded to handle complex dialogs. On theother hand, purely machine-learned modelscan handle complex dialogs, but they are con-sidered to be black boxes and require largeamounts of training data. In this demonstra-tion, we showcase Conversation Learner, amachine teaching tool for building dialog man-agers. It combines the best of both approachesby enabling dialog authors to create a dialogflow using familiar tools, converting the dia-log flow into a parametric model (e.g., neuralnetworks), and allowing dialog authors to im-prove the dialog manager (i.e., the parametricmodel) over time by leveraging user-system di-alog logs as training data through a machineteaching interface.

1 Introduction

The proliferation of messaging applications andhardware devices with personal assistants hasspurred the imagination of many in the technol-ogy industry to create task-oriented dialog systemsthat help users complete a wide range of tasksthrough natural language conversations. Tasks in-clude customer support, IT helpdesk, informationretrieval, appointment booking, etc. The wide va-riety of tasks has created the need for a flexibletask-oriented dialog system development platformthat can support many different use cases, while re-maining simple for developers to use and maintain.

A task-oriented dialog system is typically builtas a combination of three discrete systems, per-forming language understanding (for identifying

∗Equal contribution.

user intent and extracting associated information),dialog management (for guiding users towards taskcompletion), and language generation (for convert-ing agent actions to natural-language system re-sponses). The Dialog Manager (DM) contains twosub-systems: the Dialog State Tracker (DST) forkeeping track of the current dialog state, and theDialog Policy (DP) for determining the next actionto be taken in a given dialog instance. The DP re-lies on the internal state provided by DST to selectan action, which can be a response to the user, orsome operation on the back-end database (DB). Inthis paper, we present a novel approach to buildingdialog managers (DMs).

Language Understanding

(NLU)

Language Generation

(NLG)

Dialog State Tracker(DST)

Dialog Policy(DP)

Dialog Manager (DM)

DB

Input x

output y

Figure 1: An architecture for a task-oriented dialog sys-tem.

In a typical industrial implementation of a task-oriented dialog system, the DM is expressed as adialog flow, which is often a finite state machine,with nodes representing dialog activities (systemactions) and edges representing conditions (dialogstates that represent the previous user-system inter-actions). Since a dialog flow can be viewed as a setof rules that specify the flow between dialog states,it may also be called a rule-based DM.

There has been an increasing need for tools tohelp dialog authors1 develop and maintain rule-

1In this paper, “author” may refer to developers, businessowners, or domain experts who define and maintain the con-versational aspects of a task-oriented dialog system.

arX

iv:2

004.

0430

5v2

[cs

.CL

] 1

May

202

0

based DMs. These tools are often implemented asdrag-and-drop WYSIWYG tools that allow usersto specify and visualize all the details of the dialogflow. They often have deep integration with popu-lar Integrated Development Environments (IDEs)as editing frontends. Examples of rule-based or par-tially rule-based DMs include Microsoft’s PowerVirtual Agents (PVA) 2 and Bot Framework (BF)Composer 3, Google’s Dialog Flow 4, IBM’s Wat-son Assistant 5, Facebook’s Wit.ai 6 and Amazon’sLex 7. It should be noted that most of these toolshave some built-in machine-learned NLU capabil-ities, i.e. intent classification and entity detection,that can be leveraged to trigger different rule-baseddialog flows, e.g. asking appropriate questionsbased on missing slots from the dialog state.

However, a rule-based DM suffers from two ma-jor problems. First, these systems can have diffi-culty handling complex dialogs. Second, updat-ing a rule-based DM to handle unexpected userresponses and off-track conversations is often diffi-cult due to the rigid structure of the dialog flow, thelong-tail (sparseness) of user-system dialogs, andthe complexity in jumping to unrelated parts of theflow.

In end-to-end approaches proposed recently(Madotto et al., 2018; Lei et al., 2018), the DMis implemented as a neural network model that istrained directly on text transcripts of dialogs. Gaoet al. (2019) presents a survey of recent approaches.One benefit provided by using a neural networkmodel is that the network infers a latent repre-sentation of dialog state, eliminating the need forexplicitly specifying dialog states. Neural-basedDMs has been an area of active development forthe research community as well as in industry; Py-Dial (Ultes et al., 2017), ParlAI (Miller et al., 2017),Plato (Papangelis et al., 2020), Rasa (Bocklischet al., 2017), DeepPavlov (Burtsev et al., 2018),and ConvLab (Lee et al., 2019) are a few examples.However, these machine-learned neural DMs areoften viewed as black boxes from which dialogauthors have difficulty interpreting why individualuse cases succeed or fail. Further, these approachesoften lack a general mechanism for accepting task-

2https://powervirtualagents.microsoft.com/

3https://github.com/microsoft/BotFramework-Composer

4https://dialogflow.com/5https://www.ibm.com/watson/6https://wit.ai/7https://aws.amazon.com/lex/

specific knowledge and constraints, thus requiringa large number of validated dialog transcripts fortraining. Collection and curation of this type ofcorpus is often infeasible.

This paper presents Conversation Learner, a ma-chine teaching tool for building DMs, which com-bines the strengths of both rule-based and machine-learned approaches. Conversation Learner is basedon Hybrid Code Networks (HCNs) (Williams et al.,2017) and the machine teaching discipline (Simardet al., 2017). Conversation Learner allows dialogauthors to (1) import a dialog flow developed us-ing popular dialog composers, (2) convert the di-alog flow to an HCN-based DM, (3) continuouslyimprove the HCN-based DM by reviewing user-system dialog logs and providing updates via amachine teaching UI, and (4) convert the (revised)HCN-based DM back into a dialog flow for furtherediting and verification.

Section 2 describes the architecture and maincomponents of Conversation Learner. Section 3demonstrates Conversation Learner features. Sec-tion 4 presents a case study of using ConversationLearner as the DM of a text-based customer supportdialog system.

2 Conversation Learner

Development of any DM follows an iterative pro-cess of generation, testing, and revision. Conversa-tion Learner follows a three-stage DM developmentprocess:

1. Dialog authors develop a rule-based DM (dia-log flow) using a dialog composer.

2. The DM is imported into a HCN dialog sys-tem. Users (or human subjects recruited forsystem fine-tuning) interact with the systemand generate user-system dialog logs.

3. Dialog authors revise the DM by selectingrepresentative failed dialogs from the logs andteaching the system to complete these dialogssuccessfully. Run regression testing. Returnto step 2.

This development process is illustrated in Fig-ure 2 (Bottom). The overall architecture of Con-versation Learner is shown in Figure 2 (Top). Itconsists of four components: (1) a DM converterthat converts a dialog flow between rule-basedand HCN-based DM representations; (2) an HCN-based DM engine; (3) a machine teaching module

https://powervirtualagents.microsoft.com/https://powervirtualagents.microsoft.com/https://github.com/microsoft/BotFramework-Composerhttps://github.com/microsoft/BotFramework-Composerhttps://dialogflow.com/https://www.ibm.com/watson/https://wit.ai/https://aws.amazon.com/lex/https://www.microsoft.com/en-us/research/project/conversation-learner/

Figure 2: The architecture of Conversation Learner(Top) and the development of DMs using ConversationLearner (Bottom).

that allows dialog authors to revise the HCN-basedDM; and (4) an evaluation module that allows side-by-side comparison of the dialogs generated bydifferent DMs. We describe each component indetail below.

2.1 HCN-based DM

The Conversation Learner HCN consists of a set oftask-specific action templates, an entity module, aset of action masks, and a Recurrent Neural Net-work (RNN). Each action template can be a textualcommunicative action, rich card, or an API call.The entity module detects entity mentions in userutterances, grounds the entity mentions (e.g., bymapping an entity mention to a specific row in adataset), and performs entity substitution in a se-lected action template to produce a fully-formedaction (e.g., by mapping the template “the weatherof [city]?” to “the weather of Seattle?”

Each action mask represents an “if-then” rulethat determines the set of valid actions for someconditions (i.e., particular dialog states or user in-puts).

The RNN maintains dialog states and selects sys-tem actions. For each turn in a training dialog, acombination of features, including the user utter-ance embedding, its bag of words vector, and theset of extracted entities are concatenated to forma feature vector that is passed to the RNN, specifi-cally a Long Short Term Memory (LSTM) network.The RNN computes a hidden state vector, whichis retained for the next timestep. Next, a softmaxactivation layer is used to calculate a probabilitydistribution over the available system action tem-

plates. An action mask is then applied, and the re-sult is normalized to select the highest-probabilityaction as the best response for the current turn.

The HCN can be trained on a collection of user-system dialogs. For each system response in a dia-log, the action template is labeled. The training ofHCN takes two steps. First, all unique action tem-plates are imported into the HCN. Then, the RNN,which maps states to action templates, is optimizedfor minimizing the categorical cross-entropy ontraining data. More specifically, each dialog formsone minibatch, and updates of the RNN are donevia non-truncated back-propagation through time.

Readers are referred to Williams et al. (2017) fora detailed description of HCN. It should be notedthat CL leverages the same network architectureas HCN with enhancements and modifications togenerate context features from training samples.

2.2 DM converter

The DM converter converts a rule-based DM, de-veloped using a dialog composer, to an HCN-basedDM, which can then be improved via training di-alogs and machine teaching.

Given a dialog flow, the DM converter automat-ically generates a set of training dialogs that rep-resent the dialog flow. This process is done byperforming an exhaustive set of walks over the di-alog flow and generating training dialog instancesfor each walk. Rules that determine transitions inthe dialog flow are represented as action masks inthe HCN. The HCN is trained on the generatedtraining dialogs as described in Section 2.1.

The DM converter can also convert a revisedHCN-based DM back to a dialog flow by aggre-gating the individual training dialogs back into agraph for further editing and verification using adialog composer.

2.3 Machine Teaching

The HCN-based DM can be improved via machineteaching (Simard et al., 2017). “Machine teach-ing” is an active learning paradigm that focuseson leveraging the knowledge and expertise of do-main experts as “teachers”. This paradigm puts astrong emphasis on tools and techniques that en-able teachers - particularly non-data scientists andnon-machine-learning experts - to visualize data,find potential problems, and provide corrections oradditional training inputs in order to improve thesystem’s performance.

For Conversation Learner, we developed a UI forvisualizing and editing logged user-system dialogsthat had failed to complete their tasks successfully.The teacher does not need to revise the DM directly(e.g., via writing code or by modifying dialog struc-ture in a hierarchical composer tool). The teachersimply corrects cases where the dialog system re-sponded poorly or incorrectly.

The teacher can make three types of corrections:(1) correct entity detection and grounding errors;(2) correct state-to-action mapping; or (3) create anew action template.

In cases where a large number of logged dialogsexists, we use active learning to provide a rankingof the candidate dialogs most likely to benefit frommachine teaching intervention.

Our empirical tests show that machine teach-ing requires considerably fewer training samplesthan traditional machine learning approaches to im-prove system performance. We commonly observesignificant improvements in DM performance byproviding a dozen or fewer teaching examples.

There are three main reasons the HCN + machineteaching combination is so effective. First, dialogauthors are generally subject matter experts whocan make well-informed decisions about which ac-tions the DM should perform in individual situ-ations. If the DM has to automatically learn anaction policy from logs, a large corpus of data isrequired. Second, the HCN allows dialog authorsto explicitly encode domain-specific knowledge asaction templates (bot activities or responses) andaction masks (when a given response should bedisallowed) without learning. Third, we can use in-telligent filtering to select the most impactful faileddialogs for teachers to review.

2.4 Regression Testing

To effectively compare the performance of variousdialog systems using different DMs, we developeda regression testing module. The module replaysuser utterances from transcripts of existing conver-sations against the DMs being tested; each DMthen provides response action(s) for each turn. Themodule displays side-by-side comparisons of the re-sulting conversations from each DM, up to the pointwhere the DM responses diverge. Human judgesthen rate the conversations as “left better”,“rightbetter” or “same”.

At the end of the rating session, a report is gener-ated showing the performance of conversational

flow amongst the DMs, as rated by the humanjudges.

3 System Demonstration

The system demonstration consists of the followingsteps:

1. The dialog author creates a rule-based DMusing a dialog composer tool. We showcasethe usage of Microsoft Power Virtual Agentsystem for this step.

2. The rule-based DM is exported to a com-mon representation, e.g. Microsoft’s ‘.dialog’schema8.

3. The dialog flow is then imported into the Con-versation Learner system. The import pro-cess involves traversing the dialog flow andcreating dialog transcript instances for eachtraversal.

4. The generated dialog transcripts are runthrough both the rule-based system and Con-versation Learner to demonstrate the improve-ment in dialog-handling flexibility providedby Conversation Learner’s machine-learnedDM.

5. For transcripts containing failures to completethe user’s desired task, we demonstrate themachine teaching UI used to ‘correct’ errorsand generate new training examples for revis-ing the DM.

4 Case Study

To evaluate Conversation Learner’s performance,we used dialog transcripts from Microsoft’s text-based customer support system9. This customersupport system uses a sophisticated rule-based DMto handle various customer support scenarios. Itwas created and is regularly updated by domainexperts. We chose to use transcripts of actual user-system dialogs in order to measure the performanceof the DMs in a realistic setting.

The customer support system consists of hun-dreds of “topics” out of which we selected three:“language settings”, “installing or changing fonts”,

8https://github.com/microsoft/botbuilder-dotnet/blob/master/schemas/sdk.schema

9These transcripts were scrubbed of any personally identi-fiable information.

https://github.com/microsoft/botbuilder-dotnet/blob/master/schemas/sdk.schemahttps://github.com/microsoft/botbuilder-dotnet/blob/master/schemas/sdk.schemahttps://github.com/microsoft/botbuilder-dotnet/blob/master/schemas/sdk.schema

Turn Dialog

1 usr: How to reduce font sizeagt: How can I help you with fonts today?

2 usr: Adjust font sizeagt: Would you like to change the font size in anapp or the size of text on your screen?

3 usr: Change size of text on screenagt: Change the size of text in Windows 10 usingDisplay settings or use Magnifier.

4 agt: Did that solve your problem?usr: Yes5 agt: Great! Glad that helped.

Table 1: A sample human-agent dialog from the cus-tomer support logs

and “sign in problems”. Table 1 shows an exampleof a user-system dialog for “change font” topic.Figure 3 shows the representation of a specifictopic’s dialog flow in the Microsoft Power VirtualAgents dialog composer. We exported the dialogflow graph of the support system from this system,then followed the process described in Section 2.2to train our machine-learned HCN-based DM. Fig-ure 4 shows an example of a generated train dialog.

Figure 3: An example of a rule-based dialog defined inthe Microsoft Power Virtual Agents system

After generating the HCN-based DM, we ranthe set of dialog transcripts against both rule-basedand HCN-based DMs. For the majority of conver-sations, users followed the expected flow, so theHCN-based DM produced the same results as therule-based DM. For those that differed, we usedhuman judges to do a blind qualitative evaluationof the conversations and choose the conversation

Figure 4: An example train dialog generated by travers-ing the different paths of a dialog tree (left), and DMactions generated from the tree (right)

that provided the best task-completion result.

User Rating # of Convs. %CL is same 2749 91.63%CL is better 136 4.53%CL is worse 115 3.83%Overall variation 0.7% (better)

Table 2: Initial results of human evaluation of 3000dialogs against Conversation Learner (CL) and a rule-based dialog system.

As shown in Table 2, the HCN-based DM pro-vided better results for many transcripts, but therewas an almost equal number of dialogs where therule-based DM was rated better. The rule-basedDM may perform better in cases where specializedhard-coded logic was added to handle issues suchas input normalization or rewriting.

An example dialog comparison is shown in Fig-ures 5 and 6. As seen in Figure 6, the HCN-basedDM handles cases where the user utterance doesnot match a known phrase, better than the rules-based system. The HCN is also able to handleunexpected transitions between dialog nodes.

In a rule-based system, updating the DM to han-dle unexpected user responses and off-track con-versations is much harder due to the rigid structureof the dialog flow graph, the long-tail nature ofuser-system dialog transcripts (a large number ofsparse examples), and complexity in transitioningbetween unrelated parts of the dialog flow.

Next, we demonstrate that the performance ofthe HCN DM can be substantially improved byadding just 3 to 5 teaching examples via the ma-chine teaching UI presented in Section 2.3. Forthis experiment, we chose dialog transcripts thatcontained common patterns of conversational prob-lems, like users switching context, repeating them-selves or asking follow-up questions, and corrected

Figure 5: A sample dialog from a rule-based sys-tem. Notice that the system just repeats its previousquestion (as a rule) since it did not understand theuser reply.

Figure 6: Same dialog as Figure 5 in ConversationLearner. Notice that the user response is accuratelygeneralized to one of the available options whenpossible.

Figure 7: Machine Teaching UI for correcting dialogsto revise DM.

the dialog policy by creating or selecting the appro-priate system action to resolve the problem. Oncethese additional examples are added, as illustratedin Figure 7 the HCN-based DM’s performance im-provement over the rule-based DM nearly tripled,from 4.53% to 13.8%.

User Rating # of Convs. %CL is same 2562 85.4%CL is better 414 13.8%CL is worse 24 0.81%Overall variation 12.99% (better)

Table 3: Results of human evaluation of 3000 dialogsafter improving Conversation Learner (CL) model withmachine teaching.

As shown in Table 3, minimal intervention froma dialog author by providing a small number ofcorrections to problematic user-system dialog logs

can have a significant impact on the performance ofthe DM. As new users interact with the system andnew transcripts are generated, the dialog author cancontinuously improve the HCN DM’s performanceby making corrections and adding new trainingdata.

5 Conclusion

In this paper, we presented Conversation Learner,a machine teaching tool for building dialog policymanagers. We have shown that the CL HCN-basedDM can be bootstrapped from a rule-based DMpreserving the same behavior expected from therule-based system. Using the CL machine teachingUI, the dialog author can provide corrections tothe logged user-system dialogs and further improvethe CLs DM performance. We demonstrated thisthrough a case study based on dialog transcriptsfrom Microsofts text-based customer support sys-tem where the gains were approximately 13%.

We are planning to extend this work by look-ing into following problems: 1) Investigating ef-fectiveness of different ranking algorithms for logcorrection recommendation, 2) Optimizing num-ber of training samples and action masks generatedfrom the rule-based DM, and 3) Improving predic-tions of HCN-based DM by looking into alternativenetwork architectures.

ReferencesTom Bocklisch, Joey Faulkner, Nick Pawlowski, and

Alan Nichol. 2017. Rasa: Open source languageunderstanding and dialogue management. CoRR,abs/1712.05181.

Mikhail Burtsev, Alexander Seliverstov, RafaelAirapetyan, Mikhail Arkhipov, Dilyara Baymurz-ina, Nickolay Bushkov, Olga Gureenkova, TarasKhakhulin, Yuri Kuratov, Denis Kuznetsov, AlexeyLitinsky, Varvara Logacheva, Alexey Lymar,Valentin Malykh, Maxim Petrov, Vadim Polulyakh,Leonid Pugachev, Alexey Sorokin, Maria Vikhreva,and Marat Zaynutdinov. 2018. DeepPavlov: Open-source library for dialogue systems. In Proceedingsof ACL 2018, System Demonstrations, pages122–127, Melbourne, Australia. Association forComputational Linguistics.

Jianfeng Gao, Michel Galley, and Lihong Li. 2019.Neural approaches to conversational ai. Founda-tions and Trends R© in Information Retrieval, 13(2-3):127–298.

Sungjin Lee, Qi Zhu, Ryuichi Takanobu, Xiang Li,Yaoqin Zhang, Zheng Zhang, Jinchao Li, BaolinPeng, Xiujun Li, Minlie Huang, and Jianfeng Gao.2019. Convlab: Multi-domain end-to-end dialogsystem platform. CoRR, abs/1904.08637.

Wenqiang Lei, Xisen Jin, Min-Yen Kan, ZhaochunRen, Xiangnan He, and Dawei Yin. 2018. Sequicity:Simplifying task-oriented dialogue systems with sin-gle sequence-to-sequence architectures. In Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 1437–1447, Melbourne, Australia. As-sociation for Computational Linguistics.

Andrea Madotto, Chien-Sheng Wu, and Pascale Fung.2018. Mem2Seq: Effectively incorporating knowl-edge bases into end-to-end task-oriented dialog sys-tems. In Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 1468–1478, Melbourne,Australia. Association for Computational Linguis-tics.

Alexander H. Miller, Will Feng, Adam Fisch, JiasenLu, Dhruv Batra, Antoine Bordes, Devi Parikh, andJason Weston. 2017. Parlai: A dialog research soft-ware platform. CoRR, abs/1705.06476.

Alexandros Papangelis, Mahdi Namazifar, ChandraKhatri, Yi-Chia Wang, Piero Molino, and GokhanTur. 2020. Plato dialogue system: A flexible conver-sational ai research platform.

Patrice Y. Simard, Saleema Amershi, David MaxwellChickering, Alicia Edelman Pelton, Soroush Gho-rashi, Christopher Meek, Gonzalo Ramos, Jina Suh,Johan Verwey, Mo Wang, and John Wernsing. 2017.Machine teaching: A new paradigm for building ma-chine learning systems. CoRR, abs/1707.06742.

Stefan Ultes, Lina M. Rojas Barahona, Pei-Hao Su,David Vandyke, Dongho Kim, Iñigo Casanueva,Paweł Budzianowski, Nikola Mrkšić, Tsung-HsienWen, Milica Gasic, and Steve Young. 2017. PyDial:A Multi-domain Statistical Dialogue System Toolkit.In Proceedings of ACL 2017, System Demonstra-tions, pages 73–78, Vancouver, Canada. Associationfor Computational Linguistics.

Jason D. Williams, Kavosh Asadi, and Geoffrey Zweig.2017. Hybrid code networks: practical and efficientend-to-end dialog control with supervised and rein-forcement learning. CoRR, abs/1702.03274.
http://arxiv.org/abs/1712.05181http://arxiv.org/abs/1712.05181https://doi.org/10.18653/v1/P18-4021https://doi.org/10.18653/v1/P18-4021http://arxiv.org/abs/1904.08637http://arxiv.org/abs/1904.08637https://doi.org/10.18653/v1/P18-1133https://doi.org/10.18653/v1/P18-1133https://doi.org/10.18653/v1/P18-1133https://doi.org/10.18653/v1/P18-1136https://doi.org/10.18653/v1/P18-1136https://doi.org/10.18653/v1/P18-1136http://arxiv.org/abs/1705.06476http://arxiv.org/abs/1705.06476http://arxiv.org/abs/2001.06463http://arxiv.org/abs/2001.06463http://arxiv.org/abs/1707.06742http://arxiv.org/abs/1707.06742http://aclweb.org/anthology/P17-4013http://aclweb.org/anthology/P17-4013http://arxiv.org/abs/1702.03274http://arxiv.org/abs/1702.03274http://arxiv.org/abs/1702.03274

Documents

arXiv:2004.04305v2 [cs.CL] 1 May 2020task-speciﬁc action templates, an entity module, a set of action masks, and a Recurrent Neural Net-work (RNN). Each action template can be a