Dynamic Fusion Network for Multi-Domain End-to-end Task … · 2020. 6. 20. · Dynamic Fusion Network for Multi-Domain End-to-end Task-Oriented Dialog Libo Qin y, Xiao Xu , Wanxiang

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6344–6354July 5 - 10, 2020. c©2020 Association for Computational Linguistics

6344

Dynamic Fusion Network for Multi-Domain End-to-end Task-OrientedDialog

Libo Qin†, Xiao Xu†, Wanxiang Che†∗, Yue Zhang‡, Ting Liu††Research Center for Social Computing and Information Retrieval

†Harbin Institute of Technology, China‡School of Engineering, Westlake University, China

‡Institute of Advanced Technology, Westlake Institute for Advanced Study†{lbqin, xxu, car,tliu}@ir.hit.edu.cn, [email protected]

AbstractRecent studies have shown remarkable suc-cess in end-to-end task-oriented dialog system.However, most neural models rely on largetraining data, which are only available for acertain number of task domains, such as nav-igation and scheduling. This makes it difficultto scalable for a new domain with limited la-beled data. However, there has been relativelylittle research on how to effectively use datafrom all domains to improve the performanceof each domain and also unseen domains. Tothis end, we investigate methods that can makeexplicit use of domain knowledge and intro-duce a shared-private network to learn sharedand specific knowledge. In addition, we pro-pose a novel Dynamic Fusion Network (DF-Net) which automatically exploit the relevancebetween the target domain and each domain.Results show that our model outperforms exist-ing methods on multi-domain dialogue, givingthe state-of-the-art in the literature. Besides,with little training data, we show its transfer-ability by outperforming prior best model by13.9% on average.

1 Introduction

Task-oriented dialogue systems (Young et al., 2013)help users to achieve specific goals such as restau-rant reservation or navigation inquiry. In recentyears, end-to-end methods in the literature usuallytake the sequence-to-sequence (Seq2Seq) model togenerate a response from a dialogue history (Ericand Manning, 2017; Eric et al., 2017; Madottoet al., 2018; Wen et al., 2018; Gangi Reddy et al.,2019; Qin et al., 2019b; Wu et al., 2019a). Takingthe dialogue in Figure 1 as an example, to answerthe driver’s query about the “gas station”, theend-to-end dialogue system directly generates sys-tem response given the query and a correspondingknowledge base (KB).

∗Email corresponding.

Address Distance POI type POI Traffic info5672 barringer street 5 miles certain address 5672 barringer street no traffic200 Alester Ave 2 miles gas station Valero road block nearby899 Ames Ct 5 miles hospital Stanford Childrens Health moderate traffic481 Amaranta Ave 1 miles parking garage Palo Alto Garage R moderate traffic

Driver Address to the gas station.Dialogue

Knowledge Base (KB)

Car Valero is located at 200 Alester Ave.

Car Since there is a road block nearby, I found another route for you and I sent it on your screen.Driver OK , please give me directions via a route that avoids all heavy traffic.

Figure 1: Example of a task-oriented dialogue thatincorporates a knowledge base (KB) from the SMDdataset (Eric et al., 2017). Words with the same colorrefers queried entity from the KB. Better viewed incolor.

Though achieving promising performance, end-to-end models rely on a considerable amount oflabeled data, which limits their usefulness for newand extended domains. In practice, we cannot col-lect rich datasets for each new domain. Hence, it isimportant to consider methods that can effectivelytransfer knowledge from a source domain with suf-ficient labeled data to a target domain with limitedor little labeled data.

Existing work can be classified into two maincategories. As shown in Figure 2(a), the firststrand of work (Eric and Manning, 2017; Eric et al.,2017; Madotto et al., 2018; Wu et al., 2019a) sim-ply combines multi-domain datasets for training.Such methods can implicitly extract the shared fea-tures but fail to effectively capture domain-specificknowledge. As shown in Figure 2(b), The secondstrand of work (Wen et al., 2018; Qin et al., 2019b)trains model separately for each domain, which canbetter capture domain-specific features. However,those methods ignore shared knowledge betweendifferent domains (e.g. the location word existsin both schedule domain and navigation domain).

We consider addressing the limitation of existingwork by modeling knowledge connections betweendomains explicitly. In particular, a simple baseline

6345

No Domain-Specific Features

A Module B Module C ModuleShared Module

No Domain-Shared Features

Shared Module A Module B Module C Module

Dynamic Domain-Specific Fusion

Shared-Specific Fusion

(a) (b) (d)

A Model B Model C ModelShared Model

Dynamic Fusion Mechanism

Mixed DataMixed Data

Domain A Domain B Domain C



(c)

Basic Shared-Specific Fusion Mechanism

Mixed Data Domain A Domain B Domain C

Figure 2: Methods for multi-domain dialogue. Previous work either trains a general model on mixed multi-domainmixed datasets (a), or on each domain separately (b). The basic shared-private framework is shown (c). Ourproposed extension with dynamic fusion mechanism is shown (d).

to incorporate domain-shared and domain-privatefeatures is shared-private framework (Liu et al.,2017; Zhong et al., 2018; Wu et al., 2019b). Shownin Figure 2(c), it includes a shared module to cap-ture domain-shared feature and a private modulefor each domain. The method explicitly differenti-ates shared and private knowledge. However, thisframework still has two issues: (1) given a newdomain with extremely little data, the private mod-ule can fail to effectively extract the correspondingdomain knowledge. (2) the framework neglectsthe fine-grained relevance across certain subsets ofdomains. (e.g. schedule domain is more relevantto the navigation than to the weather domain.)

To address the above issues, we further proposea novel Dynamic Fusion Network (DF-Net), whichis shown in Figure 2 (d). In contrast to the shared-private model, a dynamic fusion module (see §2.3)is further introduced to explicitly capture the cor-relation between domains. In particular, a gateis leveraged to automatically find the correlationbetween a current input and all domain-specificmodels, so that a weight can be assigned to each do-main for extracting knowledge. Such a mechanismis adopted for both the encoder and the decoder,and also a memory module to query knowledgebase features. Given a new domain with little or notraining data, our model can still make the best useof existing domains, which cannot be achieved bythe baseline model.

We conduct experiments on two public bench-marks, namely SMD (Eric et al., 2017) and Multi-WOZ 2.1 (Budzianowski et al., 2018). Resultsshow that our framework consistently and sig-nificantly outperforms the current state-of-the-artmethods. With limited training data, our frame-work outperforms the prior best methods by 13.9%on average.

To our best of knowledge, this is the first workto effectively explore shared-private framework in

multi-domain end-to-end task-oriented dialog. Inaddition, when given a new domain which withfew or zero shot data, our extended dynamic fusionframework can utilize fine-grained knowledge toobtain desirable accuracies, which makes it moreadaptable to new domains.

All datasets and code are publicly available at:https://github.com/LooperXX/DF-Net.

2 Model Architecture

We build our model based on a seq2seq dialoguegeneration model (§2.1), as shown in Figure 3(a).To explicitly integrate domain awareness, as shownin Figure 3(b) we first propose to use a shared-private framework (§2.2) to learn shared and thecorresponding domain-specific features. Next, wefurther use a dynamic fusion network (§2.3) to dy-namically exploit the correlation between all do-mains for fine-grained knowledge transfer, which isshown in Figure 3(c). In addition, adversarial train-ing is applied to encourage shared module generatedomain-shared feature.

2.1 Seq2Seq Dialogue Generation

We define the Seq2Seq task-oriented dialogue gen-eration as finding the system response Y accordingto the input dialogue history X and KB B. For-mally, the probability of a response is defined as

p(Y | X,B) =n∏

t=1

p(yt | y1, ..., yt−1, X,B), (1)

where yt represents an output token. In a vanillaSeq2Seq task-oriented dialogue system (Eric andManning, 2017), a long short-term Memory net-work (LSTM, Hochreiter and Schmidhuber (1997))is used to encode the dialogue history X =(x1, x2, .., xT ) (T is the number of tokens in the di-alogue history) to produce shared context-sensitive

https://github.com/LooperXX/DF-Net

6346

ExternalKnowledge (EK)

START

EK Distribution

Domain + START

+

ℎ"#$,&' Vocabulary Distribution

Shared Encoder

Private Encoder

Shared-SpecificFusion


EK Distribution

+

Vocabulary Distribution

Shared Encoder

Fusion

DynamicFusion


+ Fusion

Private Encoder

Shared Encoder

Domain + START

EK Distribution

Vocabulary Distribution

(a) Baseline model

(b) Shared-Private model

(c) Dynamic Fusion model

Decoder

Decoder

Decoder

ℎ"#$,&'

ℎ"#$,&'

Figure 3: Workflow of our baseline and our proposedmodel.

hidden statesH = (h1,h2, ...,hT ):

hi = BiLSTMenc

(φemb(xi),hi−1

), (2)

where φemb(·) represents the word embedding ma-trix. LSTM is also used to repeatedly predict out-puts (y1, y2, ..., yt−1) by the decoder hidden states(hdec,1,hdec,2, ...,hdec,t). For the generation of yt,the model first calculates an attentive representationh′dec,t of the dialogue history over the encoding rep-

resentation H . Then, the concatenation of hdec,t

and h′dec,t is projected to the vocabulary space V

by U :ot = U [hdec,t,h

′dec,t], (3)

where ot is the score (logit) for the next tokengeneration. The probability of next token yt ∈ V isfinally calculated as:

p(yt | y1, ..., yt−1, X,B) = Softmax(ot). (4)

Different from typical text generation with Seq2seqmodel, the successful conversations for task-oriented dialogue system heavily depend on ac-curate knowledge base (KB) queries. We adopt

the global-to-local memory pointer mechanism(GLMP) (Wu et al., 2019a) to query the entitiesin KB, which has shown the best performance.An external knowledge memory is proposed tostore knowledge base (KB) B and dialogue historyX . The KB memory is designed for the knowl-edge source while the dialogue memory is usedfor directly copying history words. The entitiesin external knowledge memory are represented ina triple format and stored in the memory mod-ule, which can be denoted as M = [B;X] =(m1, . . . ,mb+T ), where mi is one of the tripletof M , b and T denotes the number of KB anddialog history respectively. For a k-hop mem-ory network, the external knowledge is composedof a set of trainable embedding matrices C =(C1, . . . ,Ck+1). We can query knowledge bothin encoder and decoder process to enhance modelinteraction with knowledge module.

Query Knowledge in Encoder We adopt thelast hidden state as the initial query vector:

q1enc = hT . (5)

In addition, it can loop over k hops and computethe attention weights at each hop k using

pki = Softmax((qkenc)>cki ), (6)

where cki is the embedding in ith memory posi-tion using the embedding matrix Ck. We obtainthe global memory pointer G = (g1, . . . , gb+T )by applying gki = Sigmoid((qkenc)

>cki ), which isused to filter the external knowledge for relevantinformation for decoding.

Finally, the model reads out the memory ok bythe weighted sum over ck+1 and updates the queryvector qk+1

enc . Formally,

okenc =∑i

pki ck+1i , qk+1

enc = qkenc + okenc. (7)

qk+1enc can be seen as the encoded KB information,

and is used to initialized the decoder.

Query Knowledge in Decoder we use a sketchtag to denote all the possible slot types that startwith a special token. (e.g., @address stands for allthe Address). When a sketch tag is generated byEq. 4 at t timestep, we use the concatenation of thehidden states hdec,t and the attentive representationh′dec,t to query knowledge, which is similar with

6347


Expert Gate

MLP

Feature Extractor

Dynamic Domain-SpecificFusion

Shared-Private Fusion


Expert Gate


Feature Extractor




Expert Gate

MLP

Feature Extractor



(&,) (&,* (&,+

Figure 4: The dynamic fusion layer for fusing domain-shared feature and domain-specific feature.

the process of querying knowledge in the encoder:

q1dec = [hdec,t,h′dec,t], (8)

pki = Softmax((qkdec)>cki g

ki ). (9)

Here, we can treat Pt = (pk1 ,. . . ,pkb+T ) as the prob-abilities of queried knowledge, and select the wordwith the highest probability from the query resultas the generated word.

2.2 Shared-Private Encoder-Decoder Model

The model in section 2.1 is trained over mixedmulti-domain datasets and the model parametersare shared across all domains. We call such modelas shared encoder-decoder model. Here, we pro-pose to use a shared-private framework includinga shared encoder-decoder for capturing domain-shared feature and a private model for each domainto consider the domain-specific features explicitly.Each instance X goes through both the shared andits corresponding private encoder-decoder.

Enhancing Encoder Given an instance alongwith its domain, the shared-private encoder-decoder generates a sequence of encoder vectorsdenoted as H{s,d}enc , including shared and domain-specific representation from corresponding en-coder:

H{s,d}enc =(h

{s,d}enc,1 , . . . ,h

{s,d}enc,T )

=BiLSTM{s,d}enc (X).

(10)

The final shared-specific encoding representationHf

enc is a mixture:

Hfenc=W 2(LeakyReLU(W 1[H

senc,H

denc])). (11)

For ease of exposition, we define the shared-specific fusion function as:

shprivate : (Hsenc,H

denc)→Hf

enc. (12)

In addition, self-attention has been shown useful forobtaining context information (Zhong et al., 2018).Finally, we follow Zhong et al. (2018) to use self-attention overHf

enc to get context vector cfenc. Wereplace hT with cfenc in Eq. 5. This makes ourquery vector combine the domain-shared featurewith domain-specific feature.

Enhancing Decoder At t step of the decoder, theprivate and shared hidden state is:

h{s,d}dec,t = LSTM

{s,d}dec,t (X). (13)

We also apply the shared-specific fusion functionto the hidden states and the mixture vector is:

shprivate : (hsdec,t,h

ddec,t)→ hf

dec,t. (14)

Similarly, we obtain the fused attentive represen-tation hf ′

dec,t by applying attention from hfdec,t over

Hfenc. Finally, we replace [hdec,t,h

′dec,t] in Eq. 8

with [hfdec,t,h

f ′

dec,t] which incorporates shared anddomain-specific features.

2.3 Dynamic Fusion for Querying Knowledge

The shared-private framework can capture the cor-responding specific feature, but neglects the fine-grained relevance across certain subsets of domains.We further propose a dynamic fusion layer to ex-plicitly leverage all domain knowledge, which isshown in Figure 4. Given an instance from anydomain, we first put it to multiple private encoder-decoder to obtain domain-specific features fromall domains. Next, all domain-specific features arefused by a dynamic domain-specific feature fusionmodule, followed by a shared-specific feature fu-sion for obtaining shared-specific features.

Dynamic Domain-Specific Feature FusionGiven domain-specific features from all domains,a Mixture-of-Experts mechanism (MoE) (Guoet al., 2018) is adapted to dynamically incorporateall domain-specific knowledge for the currentinput in both encoder and decoder. Now, wegive a detailed description on how to fuse thetimestep t of decoding and the fusion process isthe same to encoder. Given all domain featurerepresentations in t decoding steps: {hdi

dec,t}|D|i=1,

where |D| represents the number of domains, anexpert gate E takes {hdi

dec,t} as input and outputsa softmax score αt,i that represents the degreeof correlation between each domain and the

6348

current input token. We achieve this by a simplefeedforward layer:

αt = Softmax(W ∗ hddec,t + b). (15)

The final domain-specific feature vector is a mix-ture of all domain outputs, dictated by the expertgate weights αt = (αt,1, . . . , αt,|D|), which can be

written as hdfdec,t =

∑i αt,ih

didec,t.

During training, take the decoder for example,we apply the cross-entropy loss Lmoe

dec as the su-pervision signal for the expert gate to predict thedomain of each token in the response, where theexpert gate output αt can be treated as the tth to-ken’s predicted domain probability distribution bymultiple private decoder. Hence, the more accuratethe domain prediction is, the more correct expertgets:

Lmoedec = −

n∑t=1

|D|∑i=1

(ei · log(αt,i|θs,θmdec)), (16)

where θs represents the parameters of encoder-decoder model, θmdec represents the parametersof the MoE module (Eq. 15) in the decoder andei ∈ {0, 1} represents whether the response with ntokens belongs to the domain di. Similarly, we canget the Lmoe

enc for the encoder and sum up them as:Lmoe = Lmoe

enc + Lmoedec .

Lmoe is used to encourage samples from a cer-tain source domain to use the correct expert, andeach expert learns corresponding domain-specificfeatures. When a new domain has little or no la-beled data, the expert gate can automatically calcu-late the correlation between different domains withthe target domain and thus better transfer knowl-edge from different source domains in both encoderand decoder module.

Shared-Specific Feature Fusion We directly ap-ply shprivate operation to fuse shared and finaldomain-specific feature:

shprivate : (hsdec,t,h

dfdec,t)→ hf

dec,t. (17)

Finally, we denote the dynamic fusion functionas dynamic(hs

dec,t, {hdidec,t}

|D|i=1). Similar to Sec-

tion 2.2, we replace [hdec,t,h′dec,t] in Eq. 8 with

[hfdec,t,h

f ′

dec,t]. The other components are kept thesame as the shared-private encoder-decoder frame-work.

Dataset Domains Train Dev TestSMD Navigate, Weather, Schedule 2,425 302 304Multi-WOZ 2.1 Restaurant, Attraction, Hotel 1,839 117 141

Table 1: Statistics of datasets.

Adversarial Training To encourage the modelto learn domain-shared features, we apply adversar-ial learning on the shared encoder and decoder mod-ule. Following Liu et al. (2017), a gradient reversallayer (Ganin and Lempitsky, 2014) is introducedafter the domain classifier layer. The adversarialtraining loss is denoted as Ladv. We follow Qinet al. (2019a) and the final loss function of ourDynamic fusion network is defined as:

L = γbLbasic + γmLmoe + γaLadv, (18)

where Lbasic keep the same as GLMP (Wu et al.,2019a), γb, γm and γa are hyper-parameters. Moredetails about Lbasic and Ladv can be found in ap-pendix.

3 Experiments

3.1 DatasetsTwo publicly available datasets are used in thispaper, which include SMD (Eric et al., 2017) andan extension of Multi-WOZ 2.1 (Budzianowskiet al., 2018) that we equip the corresponding KBto every dialogue.1 The detailed statistics are alsopresented in Table 1. We follow the same partitionas Eric et al. (2017), Madotto et al. (2018) and Wuet al. (2019a) on SMD and (Budzianowski et al.,2018) on Multi-WOZ 2.1.

3.2 Experimental SettingsThe dimensionality of the embedding and LSTMhidden units is 128. The dropout ratio we use inour framework is selected from {0.1, 0.2} and thebatch size from {16, 32}. In the framework, weadopt the weight typing trick (Wu et al., 2019a).We use Adam (Kingma and Ba, 2015) to opti-mize the parameters in our model and adopt thesuggested hyper-parameters for optimization. Allhyper-parameters are selected according to valida-tion set. More details about hyper-parameters canbe found in Appendix.

3.3 BaselinesWe compare our model with the following state-of-the-art baselines.

1The constructed datasets will be publicly available forfurther research.

6349

SMD Multi-WOZ 2.1

Model BLEU F1Navigate

F1Weather

F1Calendar

F1BLEU F1

RestaurantF1

AttractionF1

HotelF1

Mem2Seq (Madotto et al., 2018) 12.6 33.4 20.0 32.8 49.3 6.6 21.62 22.4 22.0 21.0DSR (Wen et al., 2018) 12.7 51.9 52.0 50.4 52.1 9.1 30.0 33.4 28.0 27.1KB-retriever (Qin et al., 2019b) 13.9 53.7 54.5 52.2 55.6 - - - - -GLMP (Wu et al., 2019a) 13.9 60.7 54.6 56.5 72.5 6.9 32.4 38.4 24.4 28.1Shared-Private framework (Ours) 13.6 61.7 56.3 56.5 72.8 6.6 33.8 39.8 26.0 28.3Dynamic Fusion framework (Ours) 14.4* 62.7* 57.9* 57.6* 73.1* 9.4* 35.1* 40.9* 28.1* 30.6*

Table 2: Main results. The numbers with * indicate that the improvement of our framework over all baselines isstatistically significant with p < 0.05 under t-test.

Model Entity F1 (%)Test ∆

Full model 62.7 -w/o Domain-Shared Knowledge Transfer 59.0 3.7w/o Dynamic Fusion Mechanism 60.9 1.8w/o Multi-Encoder 61.0 1.7w/o Multi-Decoder 58.9 3.8w/o Adversarial Training 61.6 1.1

Table 3: Ablation tests on the SMD test set.

• Mem2Seq (Madotto et al., 2018): the modeltakes dialogue history and KB entities as in-put and uses a pointer gate to control eithergenerating a vocabulary word or selecting aninput as the output.

• DSR (Wen et al., 2018): the model leveragesdialogue state representation to retrieve theKB implicitly and applies copying mechanismto retrieve entities from knowledge base whiledecoding.

• KB-retriever (Qin et al., 2019b): the modeladopts a retriever module to retrieve the mostrelevant KB row and filter the irrelevant infor-mation for the generation process.

• GLMP (Wu et al., 2019a): the frameworkadopts the global-to-local pointer mechanismto query the knowledge base during decodingand achieve state-of-the-art performance.

For Mem2Seq, DSR, KB-retriever 2, we adoptthe reported results from Qin et al. (2019b) and Wuet al. (2019a). For GLMP, we rerun their publiccode to obtain results on same datasets.3

2For Multi-WOZ 2.1 dataset, most dialogs are supportedby more than single row, which can not processed by KB-retriever, so we compare our framework with it in SMD andCamrest datasets.

3Note that, we find that Wu et al. (2019a) report macroentity F1 as the micro F1, so we rerun their models(https://github.com/jasonwu0731/GLMP) and obtain results.

3.4 Results

Follow the prior work (Eric et al., 2017; Madottoet al., 2018; Wen et al., 2018; Wu et al., 2019a;Qin et al., 2019b), we adopt the BLEU and Mi-cro Entity F1 metrics to evaluate model perfor-mance. The results on the two datasets are shownin Table 2, we can observe that: 1) The basicshared-private framework outperforms the bestprior model GLMP in all the datasets. This in-dicates that the combination of domain-shared anddomain-specific features can better enhance eachdomain performanc compared with only utilizingthe implicit domain-shared features. 2) Our frame-work achieves the state-of-the-art performance ontwo multi-domain task-oriented dialog datasets,namely SMD and Multi-WOZ 2.1. On SMDdataset, our model has the highest BLEU com-pared with baselines, which shows that our frame-work can generate more fluent response. More im-portantly, our model outperforms GLMP by 2.0%overall, 3.3% in the Navigate domain, 1.1% in theWeather domain and 0.6% in Schedule domain onentity F1 metric, which indicates that consideringrelevance between target domain input and all do-mains is effective for enhancing performance ofeach domain. On Multi-Woz 2.1 dataset, the sametrend of improvement has been witnessed, whichfurther shows the effectiveness of our framework.

3.5 Analysis

We study the strengths of our model from severalperspectives on SMD dataset. We first conductseveral ablation experiments to analyze the effectof different components in our framework. Next,we conduct domain adaption experiments to verifythe transferability of our framework given a newdomain with little or no labeled data. In addition,we provide a visualization of the dynamic fusionlayer and case study to better understand how themodule affects and contributes to the performance.

6350

(a) Navigate Domain (b) Weather Domain (c) Schedule Domain

Figure 5: Performance of domain adaption on different subsets of original training data.

6.3 4.7

17.6

8.8 11.0

42.7

05

1015202530354045

Navigation Weather Schedule

GLMP Ours

Figure 6: Zero-shot performance (F1 score) on each do-main on SMD dataset. The x-axis domain name repre-sents that the domain is unseen and other two domainsis the same as original dataset.

Model Correct Fluent HumanlikeGLMP 3.4 3.9 4.0

Our framework 3.6 4.2 4.2Agreement 53% 61% 74%

Table 4: Human evaluation of responses on the ran-domly selected dialogue history.

3.5.1 Ablation

Several ablation experiments and results are shownin Table 3. In detail, 1) w/o Domain-shared Knowl-edge Transfer denotes that we remove domain-shared feature and just keep fused domain-specificfeature for generation. 2) w/o Domain Fusionmechanism denotes that we simply sum all domain-specific features rather than use the MOE mecha-nism to dynamically fuse domain-specific features.3) w/o Multi-Encoder represents that we removemulti-encoder module and adopt one shared en-coder in our framework. 4) w/o Multi-Decoder rep-resents that we remove the multi-decoder moduleand adopt one shared decoder. 5) w/o AdversarialTraining denotes that we remove the adversarialtraining in experimental setting. Generally, all theproposed components are effective to contributethe final performance. Specifically, we can clearlyobserve the effectiveness of our dynamic fusionmechanism where w/o domain-specific knowledgefusion causes 1.8% drops and the same trend inremoving domain-shared knowledge fusion. This

Figure 7: Distribution of Mix-of-the-expert mechanismacross source domains for randomly selected 100 exam-ples in each domain on SMD dataset.

further verifies that domain-shared and domain-specific feature are benefit for each domain perfor-mance.

3.5.2 Domain AdaptionLow-Resource Setting To simulate low-resource setting, we keep two domains unchanged,and the ratio of the except domain from originaldata varies from [1%, 5%, 10%, 20%, 30%,50%]. The results are shown in Figure 5. Wecan find that: (1) Our framework outperforms theGLMP baseline on all ratios of the original dataset.When the data is only 5% of original dataset, ourframework outperforms GLMP by 13.9% on alldomains averagely. (2) Our framework trainedwith 5% training dataset can achieve comparableand even better performance compared to GLMPwith 50% training dataset on some domains. Thisimplies that our framework effectively transfersknowledge from other domains to achieve betterperformance for the low-resources new domain.

Zero-Shot Setting Specially, we further evalu-ate the performance of domain adaption ability onthe zero-shot setting given an unseen domain. Werandomly remove one domain from the trainingset, and other domain data remained unchanged totrain the model. During test, the unseen domaininput use the MoE to automatically calculate thecorrelation between other domains and the currentinput and get the results. Results are shown in

6351

0.1996 0.0242

0.7762

0

1


0.0015 0.1457

0.8527

0

1


0.0249

0.8908

0.08430

1


Driver: Manhattan, please will it be cloudy on Monday ? Car: Mondaywill be foggy.

Driver: Who will be at the meeting Friday? Car: HR will be at the Friday meeting.

Driver: Find location and address to home that is nearest me. Car: Your home is at 56 cadwell street. Thanks, set the gps for there. No problem.Navigation@5%

Weather@5%

Schedule@5%

0.0015 0.1457

0.8527

0

0.5

1


Driver: Manhattan, please will it be cloudy on Monday ?Car: Monday will be foggy.

Weather@5%

0.19960.0242

0.7762

0

0.5

1


Dialogue History: Find location and address to home that is nearest me. Your home is at 56 cadwell street. Thanks, set the gps for there.Response: No problem.

Figure 8: Case of of expert gate distribution in SMDdataset. Text segments with red color represents ap-pearing in both schedule and navigation domain.

Figure 6, we can see our model significantly out-performs GLMP on three domains, which furtherdemonstrate the transferability of our framework.

3.5.3 Visualization of Dynamic Fusion LayerTo better understand what our dynamic fusion layerhas learned, we visualize the gate distribution foreach domain in low-resource (5%) setting, whichfuses domain-specific knowledge among variouscases. As shown in the Figure 7, for a specific tar-get domain, different examples may have differentgate distributions, which indicates that our frame-work successfully learns how to transfer knowledgebetween different domains. For example, the navi-gation column contains 100 examples from its testset and each row show the corresponding expertvalue. More specifically, in the navigation column,we observe that the expert value in schedule do-main is bigger than weather domain, which indi-cates schedule domain transfers more knowledgeto navigation than weather domain.

3.5.4 Case StudyFurthermore, we provide one case for navigationdomain and their corresponding expert gate dis-tribution. The cases are generated with 5% train-ing data in the navigation domain and other twodomain datasets keep the same, which can bettershow how the other two domains transfer knowl-edge to the low-resource domain. As shown inFigure 8, the expert value of schedule domain isbigger than the weather domain, which indicatesthe schedule contributes more than weather domain.In further exploration, we find word “location” and“set” appear both in navigation and schedule do-main, which shows schedule has closer relationwith navigation than weather, which indicates ourmodel successfully transfers knowledge from theclosest domain.

3.5.5 Human EvaluationWe provide human evaluation on our frameworkand other baseline models. We randomly gener-ated 100 responses. These responses are based on

distinct dialogue history on the SMD test data. Fol-lowing Wen et al. (2018) and Qin et al. (2019b), Wehired human experts and asked them to judge thequality of the responses according to correctness,fluency, and humanlikeness on a scale from 1 to 5.

Results are illustrated in Table 4. We can see thatour framework outperforms GLMP on all metrics,which is consistent with the automatic evaluation.

4 Related Work

Existing end-to-end task-oriented systems can beclassified into two main classes. A series of worktrains a single model on the mixed multi-domaindataset. Eric et al. (2017) augments the vocabu-lary distribution by concatenating KB attention togeneratge entities. Lei et al. (2018) first integratestrack dialogue believes in end-to-end task-orienteddialog. Madotto et al. (2018) combines end-to-end memory network (Sukhbaatar et al., 2015) intosequence generation. Gangi Reddy et al. (2019)proposes a multi-level memory architecture whichfirst addresses queries, followed by results and fi-nally each key-value pair within a result. Wu et al.(2019a) proposes a global-to-locally pointer mecha-nism to query the knowledge base. Compared withtheir models, our framework can not only explicitlyutilize domain-specific knowledge but also con-sider different relevance between each domain. An-other series of work trains a model on each domainseparately. Wen et al. (2018) leverages dialoguestate representation to retrieve the KB implicitly.Qin et al. (2019b) first adopts the KB-retriever toexplicitly query the knowledge base. Their worksconsider only domain-specific features. In contrast,our framework explicitly leverages domain-sharedfeatures across domains.

The shared-private framework has been exploredin many other task-oriented dialog components.Liu and Lane (2017) applies a shared-privateLSTM to generate shared and domain-specific fea-tures. Zhong et al. (2018) proposes a global-localarchitecture to learn shared feature across all slotsand specific feature for each slot. More recently,Zhang et al. (2018) utilizes the shared-privatemodel for text style adaption. In our work, we ex-plore shared-private framework in end-to-end task-oriented dialog to better transfer domain knowledgefor querying knowledge base. In addition, we takeinspiration from Guo et al. (2018), who success-fully apply the mix-of-the-experts (MoE) mech-anism in multi-sources domain and cross-lingual

6352

adaption tasks. Our model not only combines thestrengths of MoE to incorporate domain-specificfeature, but also applies adversarial training to en-courage generating shared feature. To our best ofknowledge, we are the first to effectively exploreshared-private framework in multi-domain end-to-end task oriented dialog.

5 Conclusion

In this paper, we propose to use a shared-privatemodel to investigate explicit modeling domainknowledge for multi-domain dialog. In addition, adynamic fusion layer is proposed to dynamicallycapture the correlation between a target domain andall source domains. Experiments on two datasetsshow the effectiveness of the proposed models. Be-sides, our model can quickly adapt to a new domainwith little annotated data.

Acknowledgements

We thank Min Xu, Jiapeng Li, Jieru Lin andZhouyang Li for their insightful discussions. Wealso thank all anonymous reviewers for their con-structive comments. This work was supportedby the National Natural Science Foundation ofChina (NSFC) via grant 61976072, 61632011 and61772153. Besides, this work also faxed the sup-port via Westlake-BrightDreams Robotics researchgrant.

ReferencesPaweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang

Tseng, Inigo Casanueva, Stefan Ultes, Osman Ra-madan, and Milica Gasic. 2018. MultiWOZ - alarge-scale multi-domain wizard-of-Oz dataset fortask-oriented dialogue modelling. In Proceedings ofthe 2018 Conference on Empirical Methods in Nat-ural Language Processing, pages 5016–5026, Brus-sels, Belgium. Association for Computational Lin-guistics.

Mihail Eric, Lakshmi Krishnan, Francois Charette, andChristopher D. Manning. 2017. Key-value retrievalnetworks for task-oriented dialogue. In Proceedingsof the 18th Annual SIGdial Meeting on Discourseand Dialogue, pages 37–49, Saarbrucken, Germany.Association for Computational Linguistics.

Mihail Eric and Christopher Manning. 2017. A copy-augmented sequence-to-sequence architecture givesgood performance on task-oriented dialogue. In Pro-ceedings of the 15th Conference of the EuropeanChapter of the Association for Computational Lin-guistics: Volume 2, Short Papers, pages 468–473,

Valencia, Spain. Association for Computational Lin-guistics.

Revanth Gangi Reddy, Danish Contractor, DineshRaghu, and Sachindra Joshi. 2019. Multi-levelmemory for task oriented dialogs. In Proceedingsof the 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Longand Short Papers), pages 3744–3754, Minneapolis,Minnesota. Association for Computational Linguis-tics.

Yaroslav Ganin and Victor Lempitsky. 2014. Unsuper-vised domain adaptation by backpropagation. arXivpreprint arXiv:1409.7495.

Jiang Guo, Darsh Shah, and Regina Barzilay. 2018.Multi-source domain adaptation with mixture of ex-perts. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 4694–4703, Brussels, Belgium. Associationfor Computational Linguistics.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural Computation,9(8):1735–1780.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In 3rd Inter-national Conference on Learning Representations,ICLR 2015, San Diego, CA, USA, May 7-9, 2015,Conference Track Proceedings.

Wenqiang Lei, Xisen Jin, Min-Yen Kan, ZhaochunRen, Xiangnan He, and Dawei Yin. 2018. Sequicity:Simplifying task-oriented dialogue systems with sin-gle sequence-to-sequence architectures. In Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 1437–1447, Melbourne, Australia. As-sociation for Computational Linguistics.

Bing Liu and Ian Lane. 2017. Multi-domain adversar-ial learning for slot filling in spoken language under-standing. arXiv preprint arXiv:1711.11310.

Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017.Adversarial multi-task learning for text classifica-tion. In Proceedings of the 55th Annual Meetingof the Association for Computational Linguistics(Volume 1: Long Papers), pages 1–10, Vancouver,Canada. Association for Computational Linguistics.

Andrea Madotto, Chien-Sheng Wu, and Pascale Fung.2018. Mem2Seq: Effectively incorporating knowl-edge bases into end-to-end task-oriented dialog sys-tems. In Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 1468–1478, Melbourne,Australia. Association for Computational Linguis-tics.

Libo Qin, Wanxiang Che, Yangming Li, Haoyang Wen,and Ting Liu. 2019a. A stack-propagation frame-work with token-level intent detection for spoken

https://doi.org/10.18653/v1/D18-1547

https://doi.org/10.18653/v1/D18-1547

https://doi.org/10.18653/v1/D18-1547

https://doi.org/10.18653/v1/W17-5506

https://doi.org/10.18653/v1/W17-5506

https://www.aclweb.org/anthology/E17-2075



https://doi.org/10.18653/v1/N19-1375

https://doi.org/10.18653/v1/N19-1375

https://doi.org/10.18653/v1/D18-1498

https://doi.org/10.18653/v1/D18-1498

http://arxiv.org/abs/1412.6980

http://arxiv.org/abs/1412.6980

https://doi.org/10.18653/v1/P18-1133

https://doi.org/10.18653/v1/P18-1133

https://doi.org/10.18653/v1/P18-1133

https://doi.org/10.18653/v1/P17-1001

https://doi.org/10.18653/v1/P17-1001

https://doi.org/10.18653/v1/P18-1136

https://doi.org/10.18653/v1/P18-1136

https://doi.org/10.18653/v1/P18-1136

https://doi.org/10.18653/v1/D19-1214

https://doi.org/10.18653/v1/D19-1214

6353

language understanding. In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 2078–2087, Hong Kong,China. Association for Computational Linguistics.

Libo Qin, Yijia Liu, Wanxiang Che, Haoyang Wen,Yangming Li, and Ting Liu. 2019b. Entity-consistent end-to-end task-oriented dialogue systemwith KB retriever. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 133–142, Hong Kong, China. As-sociation for Computational Linguistics.

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston,and Rob Fergus. 2015. End-to-end memory net-works. In Advances in Neural Information Process-ing Systems 28: Annual Conference on Neural In-formation Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2440–2448.

Haoyang Wen, Yijia Liu, Wanxiang Che, Libo Qin, andTing Liu. 2018. Sequence-to-sequence learning fortask-oriented dialogue with dialogue state represen-tation. In Proceedings of the 27th International Con-ference on Computational Linguistics, pages 3781–3792, Santa Fe, New Mexico, USA. Association forComputational Linguistics.

Chien-Sheng Wu, Richard Socher, and Caiming Xiong.2019a. Global-to-local memory pointer networksfor task-oriented dialogue. In 7th International Con-ference on Learning Representations, ICLR 2019,New Orleans, LA, USA, May 6-9, 2019.

Haiming Wu, Yue Zhang, Xi Jin, Yun Xue, and Zi-wen Wang. 2019b. Shared-private LSTM for multi-domain text classification. In Natural LanguageProcessing and Chinese Computing - 8th CCF In-ternational Conference, NLPCC 2019, Dunhuang,China, October 9-14, 2019, Proceedings, Part II,pages 116–128.

Steve J. Young, Milica Gasic, Blaise Thomson, and Ja-son D. Williams. 2013. Pomdp-based statistical spo-ken dialog systems: A review. Proceedings of theIEEE, 101(5):1160–1179.

Ye Zhang, Nan Ding, and Radu Soricut. 2018.SHAPED: Shared-private encoder-decoder for textstyle adaptation. In Proceedings of the 2018 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers), pages1528–1538, New Orleans, Louisiana. Associationfor Computational Linguistics.

Victor Zhong, Caiming Xiong, and Richard Socher.2018. Global-locally self-attentive encoder for di-alogue state tracking. In Proceedings of the 56th An-nual Meeting of the Association for Computational

Linguistics (Volume 1: Long Papers), pages 1458–1467, Melbourne, Australia. Association for Compu-tational Linguistics.

https://doi.org/10.18653/v1/D19-1214

https://doi.org/10.18653/v1/D19-1013

https://doi.org/10.18653/v1/D19-1013

https://doi.org/10.18653/v1/D19-1013

http://papers.nips.cc/paper/5846-end-to-end-memory-networks

http://papers.nips.cc/paper/5846-end-to-end-memory-networks

https://www.aclweb.org/anthology/C18-1320



https://openreview.net/forum?id=ryxnHhRqFm

https://openreview.net/forum?id=ryxnHhRqFm

https://doi.org/10.1007/978-3-030-32236-6_10

https://doi.org/10.1007/978-3-030-32236-6_10

https://doi.org/10.1109/JPROC.2012.2225812

https://doi.org/10.1109/JPROC.2012.2225812

https://doi.org/10.18653/v1/N18-1138

https://doi.org/10.18653/v1/N18-1138

https://doi.org/10.18653/v1/P18-1135

https://doi.org/10.18653/v1/P18-1135

6354

Hyperparameter Name SMD Multi-WOZ 2.1Batch Size 16 32Hidden Size 128 128Embedding Size 128 128Learning Rate 0.001 0.001Dropout Ratio 0.2 0.1Teacher Forcing Ratio 0.9 0.9Number of Memory Network’s Hop 3 3

Table 5: Hyperparameters we use for SMD and Multi-WOZ 2.1 dataset.

A Appendices

A.1 Hyperparameters Setting

The hyperparameters used for SMD and Multi-WOZ 2.1 dataset are shown in Table 5.

A.2 Basic Loss Function

The loss Lbasic used in our Shared-PrivateEncoder-Decoder Model is the same as GLMP. Dif-ferent with the standard sequence-to-sequence with

attention mechanism model, we use [hfdec,t,h

f′

dec,t]

to replace [hdec,t,h′dec,t] and then get the sketch

word probability distribution P vocabt . Based on the

gold sketch response Y s = (ys1, . . . , ysn), we calcu-

late the standard cross-entropy loss Lv as follows:

P vocabt = Softmax(U [hf

dec,t,hf′

dec,t]), (19)

Lv =n∑

t=1

−log(P vocabt (yst )). (20)

Given the system response Y , we get theglobal memory pointer label sequence Glabel =(g1, . . . , gb+T ) and local memory pointer label se-quence Llabel = (l1, . . . , ln) as follows:

gi=

{1 if Object(mi) ∈ Y0 otherwise

, (21)

lt=

{max(z) if ∃z s.t. yt=Object(mz)b+ T + 1 otherwise

,

(22)

where mi represents one triplet in the externalknowledge M = [B;X] = (m1, . . . ,mb+T ) andObject(·) function is denoted as getting the objectword from a triplet.

Then, the Lg can be written as follows:

Lg=−b+T∑i=1

(gi · log gi + (1− gi) · log (1− gi)) .

(23)

Based on the Llabel and Pt = (pk1, . . . , pkb+T ),

we can calculate the standard cross-entropy loss Llas follows:

Ll =n∑

t=1

− log(Pt(lt)). (24)

Finally, Lbasic is the weighted-sum of threelosses:

Lbasic = γgLg + γvLv + γlLl, (25)

where γg, γv and γl are hyperparameters.

A.3 Adversarial TrainingWe apply a Convolutional Neural Network (CNN)as domain classifier both in the shared encoder andshared decoder to identify the domain of shared rep-resentation of dialogue historyHs

enc and responseHs

dec. Take the encoder for example, based on theHs

enc, we can extract the context representationcsenc by CNN and then βenc ∈ R|D| can be calcu-lated as follows:

βenc=Sigmoid(LeakyReLU(W enc(csenc)),

(26)Then we train the domain classifier by optimiz-

ing its parameters θd to minimize the sequence-level binary cross-entropy loss Ladv

enc as follows:

maxθs

minθd

Ladvenc =−

|D|∑i=1

(ei · log(βenc,i|θs,θd)

+(1− ei) · log(1− βenc,i|θs,θd)),(27)

where βenc,i represents the probability of the inputdialogue history belongs to the domain di. Sim-ilarly, we can get the Ladv

dec and sum up them as:Ladv = Ladv

enc + Ladvdec .

In order to update the encoder-decoder modelparameters θs underlying the domain classifier, weintroduce the gradient reversal layer to reverse thegradient direction which trains our model to extractdomain-shared features to confuse the classifier.On the one hand, we train the domain classifierto minimize the domain classification loss. Onthe other hand, we update the parameters of thenetwork underlying the domain classifier to maxi-mize the domain classification loss, which worksadversarially towards the domain classifier. Thisencourages that our shared encoder and decoderare trained to extract domain-shared features.

Documents

Dynamic Fusion Network for Multi-Domain End-to-end Task … · 2020. 6. 20. · Dynamic Fusion Network for Multi-Domain End-to-end Task-Oriented Dialog Libo Qin y, Xiao Xu , Wanxiang