Microsoft Research Montreal Abstract arXiv:1812.07617v2 ...Dodge et al. [4] introduced four movie dialogue datasets comprising the Facebook Movie Dialog Data Set. There is a QA dataset,

Towards Deep Conversational Recommendations

Raymond Li1, 2, Samira Ebrahimi Kahou1, 3, Hannes Schulz3, Vincent Michalski4, 5,Laurent Charlin5, 6, and Chris Pal1, 2, 5

1Ecole Polytechnique de Montréal 2Element AI 3Microsoft Research Montreal4Université de Montréal 5Mila 6HEC Montréal

Abstract

There has been growing interest in using neural networks and deep learning tech-niques to create dialogue systems. Conversational recommendation is an interestingsetting for the scientific exploration of dialogue with natural language as the as-sociated discourse involves goal-driven dialogue that often transforms naturallyinto more free-form chat. This paper provides two contributions. First, until nowthere has been no publicly available large-scale dataset consisting of real-world dia-logues centered around recommendations. To address this issue and to facilitate ourexploration here, we have collected REDIAL, a dataset consisting of over 10,000conversations centered around the theme of providing movie recommendations.We make this data available to the community for further research. Second, weuse this dataset to explore multiple facets of conversational recommendations. Inparticular we explore new neural architectures, mechanisms, and methods suitablefor composing conversational recommendation systems. Our dataset allows usto systematically probe model sub-components addressing different parts of theoverall problem domain ranging from: sentiment analysis and cold-start recommen-dation generation to detailed aspects of how natural language is used in this settingin the real world. We combine such sub-components into a full-blown dialoguesystem and examine its behavior.

1 Introduction

Deep-learning-based approaches to creating dialogue systems provide extremely flexible solutionsfor the fundamental algorithms underlying dialogue systems. In this paper we explore fundamentalalgorithmic elements of conversational recommendation systems through examining a suite of neuralarchitectures for sub-problems of conversational recommendation making.

It is well known that deep learning techniques require considerable amounts of data to be effective.Addressing this need, we provide a new dataset of 10,000 dialogues to the community to facilitate thestudy of discourse with natural language when making recommendations is an explicit goal of theexchange. Our setting of interest and our new dataset, named REcommendations through DIALog(REDIAL)1, are centered around conversations about movies where one party in the conversationis seeking recommendations and the other party is providing recommendations. Our decision forfocusing on this domain is motivated in part by the following.

A good discussion with a friend, librarian, movie rental store clerk or movie fan can be an enjoyableexperience, leading to new ideas for movies that one might like to watch. We shall refer to thisgeneral setting as conversational movie recommendation. While dialogue systems are sometimescharacterized as falling into the categories of goal-directed dialogue vs chit-chat, we observe thatdiscussions about movies often combine various elements of chit-chat, goal-directed dialogue, andeven question answering in a natural way. As such the practical goal of creating conversational1 https://redialdata.github.io/website/

32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.

arX

iv:1

812.

0761

7v2

[cs

.LG

] 4

Mar

201

9

https://redialdata.github.io/website/

recommendation systems provides an excellent setting for the scientific exploration of the continuumbetween these tasks.

This paper makes a number of contributions. First we provide the only real-world, two-partyconversational corpus of this form (that we are aware of) to the community. We outline the data-collection procedure in Section 3. Second, we use this corpus to systematically propose and evaluateneural models for key sub-components of an overall conversational recommendation system. We focusour exploration on three key elements of such a system, consisting of: 1) Making recommendations;we examine sampling based methods for learning to make recommendations in the cold-start settingusing an autoencoder [1]. We present this model in Section 4.3 and evaluate it in Section 5. Prior workwith such models has not examined the cold-start setting which must be addressed in our dialogueset-up. 2) Classifying opinions or the sentiment of a dialogue participant with respect to a particularmovie. For this task throughout the dialogue whenever a new movie is discussed we instantiatean RNN-based sentiment-prediction model. This model is used to populate the autoencoder-basedrecommendation engine above. We present this model component and our analysis of its behaviorand performance in Sections 4.2 and 5 respectively. 3) We compose the components outlined aboveinto a complete neural dialogue model for conversation and recommendation. For this aspect of theproblem we examine a novel formulation of a hierarchical recurrent encoder-decoder (HRED) model[2] with a switching mechanism inspired from Gulcehre et al. [3] that allows suggested movies to beintegrated into the model for the dialogue acts of the recommender. As our new dataset is relativelysmall for neural network techniques, our modular approach allows one to train sub-components onother larger data sources, whereas naïvely training end-to-end neural models from scratch using onlyour collected dialogue data can lead to overfitting.

2 Related WorkWhile we are aware of no large scale public dataset of human-to-human dialogue on the subject ofmovie recommendations, we review some of the most relevant work of which we are aware below.We also review a selection of prior work on related methods in Section 4 just prior to introducingeach component of our model.

Dodge et al. [4] introduced four movie dialogue datasets comprising the Facebook Movie DialogData Set. There is a QA dataset, a recommendation dataset, and a QA + recommendation dataset. Allthree are synthetic datasets built from the classic MovieLens ratings dataset [5]2 and Open MovieDatabase3. Others have also explored procedures for generating synthetic dialogues from ratingsdata [6]. The fourth dataset is a Reddit dataset composed of around 1M dialogues from the moviesubreddit4. The recommendation dataset is the closest to what we propose, however it is syntheticallygenerated from natural language patterns, and the answers are always a single movie name. TheReddit dataset is also similar to ours in the sense that it consists of natural conversations on the topicof movies. However, the exchanges are more free-form and obtaining a good recommendation is nota goal of the discourse.

Krause et al. [7] introduce a dataset of self dialogues collected for the Amazon Alexa Prize compe-tition5, using Amazon Mechanical Turk (AMT). The workers are asked to imagine a conversationbetween two individuals on a given topic and to play both roles. The topics are mostly about movies,music, and sport. The conversations are not specifically about movie recommendations, but havethe advantage of being quite natural, compared to the Facebook Movie Dialog Data Set. They usethis data to develop a chat bot. The chat bot is made of several components, including: a rule-basedcomponent, a matching-score component that compares the context with similar conversations fromthe data to output a message from the data, and a (generative) recurrent neural network (RNN). Theyperform human evaluation of the matching-score component.

Some older work from the PhD thesis of Johansson [8] involved collecting a movie recommendationthemed dialogue corpus with 24 dialogues, consisting of 2684 utterances and a mean of 112 utterancesper dialogue. In contrast, our corpus has over 10k conversations and 160k utterances. See Serbanet al. [9] for an updated survey of corpora for data-driven dialogue systems.

The recommender-systems literature has also proposed models for conversational systems. Theseapproaches are goal-oriented and combine various different modules each designed (and trained)independently [10, 11]. Further, these approaches either rely on tracking the state of the dialogue2 https://grouplens.org/datasets/movielens/ 3 http://en.omdb.org4 http://reddit.com/r/movie 5 https://developer.amazon.com/alexaprize

2

https://grouplens.org/datasets/movielens/

http://en.omdb.org

http://reddit.com/r/movie

https://developer.amazon.com/alexaprize

using slot-value pairs [12, 13] or focus on different objectives such as minimizing the number of userqueries to obtain good recommendations [14]. Other approaches [15, 16, 17, 18] use reinforcementlearning to train goal-oriented dialogue systems. Sun and Zhang [18] apply it to conversationalrecommendations: a simulated user allows to train the dialogue agent to extract the facet valuesneeded to make an appropriate recommendation. In contrast, we propose a conditional generativemodel of (natural language) recommendation conversations and our contributed dataset allows one toboth train sub-modules as well as explore end-to-end trainable models.

3 REDIAL dataset collectionHere we formalize the setup of a conversation involving recommendations for the purposes of datacollection. To provide some additional structure to our data (and models) we define one person in thedialogue as the recommendation seeker and the other as the recommender. To obtain data in this form,we developed an interface and pairing mechanism mediated by Amazon Mechanical Turk (AMT).Our task setup is very similar to that used by Das et al. [19] to collect dialogue data around an imageguessing game, except that we focus on movie recommendations. We pair up AMT workers and giveeach of them a role. The movie seeker has to explain what kind of movie he/she likes, and asks formovie suggestions. The recommender tries to understand the seeker’s movie tastes, and recommendsmovies. All exchanges of information and recommendations are made using natural language.

We add additional instructions to improve the data quality and guide the workers to dialogue theway we expect them to. We ask to use formal language and that conversations contain roughlyten messages minimum. We also require that at least four different movies are mentioned in everyconversation. Finally, we ask to converse only about movies, and notably not to mention MechanicalTurk or the task itself. See Figure 4 in the supplementary material for a screen-shot of the interface.

In addition, we ask that every movie mention is tagged using the ‘@’ symbol. When workers type‘@’, the following characters are used to find matching movie names, and workers can choose amovie from that list. This allows us to detect exactly what movies are mentioned and when. Wegathered entities from DBpedia that were of type <http://dbpedia.org/ontology/Film> toobtain a list of movies, but also allow workers to add movies to the list if it is not present already. Wealso obtained movie release dates from DBpedia. Note that the year or release date of a movie can beessential to differentiate movies with the same name, but released at different dates.

Workers are (separately from the on-going discussion) asked three questions for each movie:(1) Whether the movie was mentioned by the seeker, or was a suggestion from the recommender(“suggested” label); (2) Whether the seeker has seen the movie (“seen” label): one of Seen it, Haven’tseen it, or Didn’t say; (3) Whether the seeker liked the movie or the suggestion (“liked” label):one of Liked, Didn’t like, Didn’t say. We will refer to these additional labels as movie dialogueforms. Both workers have to answer these forms even though it really concerns the seeker’s movietastes. We use those ratings to validate data collection, the two workers agreeing in the forms beinggenerally an indicator for conscientious workers. Ideally, the two participants would give the sameanswer to every form, but it is possible that their answers do not coincide (because of carelessness,or dialogue ambiguity). The dataset released provides both workers’ answers. The movie dialogueforms therefore allow us to evaluate sub-components of an overall neural dialogue system moresystematically, for example one can train and evaluate a sentiment analysis model directly using theselabels. We believe that predicting sentiment from dialogues poses an interesting sub-challenge withinconversational recommendation, as the sentiment can be expressed in a question-answer form overseveral dialogue utterances.

In each conversation, the number of movies mentioned varies, so we have different numbers of moviedialogue form answers for each conversation. The distribution of the different classes of the moviedialogue form is shown in Table 1a. The liked/disliked/did not say label is highly imbalanced. This isstandard for recommendation data [20], since people are naturally more likely to talk about moviesthat they like, and the recommender’s objective is to recommend movies that the seeker is likely tolike. Table 1b shows an example of conversation from the dataset.

For the AMT HIT we collect data in English and restrict the data collection to countries where Englishis the main language. The fact that we pair workers together slows down the data collection since twopeople must be online at the same time to do the task, so a good amount of workers is required tomake the collection possible. Meanwhile, the task is quite demanding, and we have to select qualifiedworkers. HIT reward and qualification requirement were decisive to get good conversation quality

3

http://dbpedia.org/ontology/Film

# conversations 10006# utterances 182150# users 956# movie mentions 51699

seeker mentioned 16278recommender suggested 35421

not seen 16516seen 31694did not say 3489

disliked (4.9%) 2556liked (81%) 41998did not say (14%) 7145

Table 1a. (Above) REDIAL data statistics.For the movie dialogue forms, the numbersshown represent the seeker’s answers.Table 1b.(Right) Conversation excerpts(HUMAN followed by response bySEEKER) and model outputs (OUR pro-posed approach compared to HRED ageneric dialogue model [2]).

HUMAN: hello, can i help you find a movie ?HRED: hi what kind of movies do you likeOURS: hi, how can i help you ?SEEKER: hello ! i would like to find a suspenseful, but clean family

friendly movie .... ...SEEKER: 2001 : a space odyssey might be a great option. anything

else that you would suggest ?HUMAN: you can try planet of the apes the older one is quite

suspenseful and family friendly .HRED: i haven’t seen that one yet but i heard it was good. i

haven’t seen that one. have you seen the last house onthe left ?

OURS: star wars : the force awakens is also a good one returnof the jedi all good movies

SEEKER: those sound good ! i ’m going to look into those movies.HUMAN: i hope you enjoy, have a nice oneHRED: have you seen foxcatcher ? it ’s about a man who has a

rich guy.OURS: i hope i was able to help you find a good movie to watchSEEKER: thank you for your help ! have a great night ! good bye

Note: We provide additional conversation examples and model outputs in the supplementary material.

while still ensuring that people could get paired together. We launched preliminary HITs to finda compromise and finally set the reward to $0.50 per person for each completed conversation (soeach conversation costs us $1, plus taxes), and ask that workers meet the following requirements:(1) Approval percentage greater than 95; (2) Number of approved HITs greater than 1000; and(3) They must be in the United States, Canada, the United Kingdom, Australia or New Zealand.

4 Our ApproachWe aim at developing an agent capable of chatting with a partner and asking questions about theirmovie tastes in order to make movie recommendations. One might therefore characterize our systemas a recommendation “chat-bot”. The complete architecture of our approach is illustrated in Figure 1.Starting from the bottom of Figure 1, there are four sub-components: (1) A hierarchical recurrentencoder following the HRED [2] architecture, using general purpose representations based on theGensen model [21]; (2) A switching decoder inspired by Gulcehre et al. [3], modeling the dialogueacts generated by the recommender; (3) After each dialogue act our model detects if a movie entity hasbeen discussed (with the @identifier convention) and we instantiate an RNN focused on classifyingthe seeker’s sentiment or opinion regarding that entity. As such there are as many of these RNNs asthere are movie entities discussed in the discourse. The sentiment analysis RNNs are used to indicatethe user opinions forming the input to (4), an autoencoder-based recommendation module [1]. Theautoencoder recommender’s output is used by the decoder through a switching mechanism. Someof these components can be pre-trained on external data, thus compensating for the small data size.Notably, the switching mechanism allows one to include the recommendation engine, which wetrained using the significantly larger MovieLens data. We provide more details for each of thesecomponents below and describe the training procedure in the supplementary materials.

4.1 Our Hierarchical Recurrent Encoder

Our dialogue model is reminiscent of the hierarchical recurrent encoder-decoder (HRED) architectureproposed and developed in Sordoni et al. [2] and Serban et al. [22]. We reuse their hierarchicalarchitecture, but we modify the decoder so that it can take explicit movie recommendations into

4

Figure 1: Our proposed model for conversational recommendations.

account and we modify the encoder to take general purpose sentence (GenSen) representationsarising from a bidirectional Gated Recurrent Unit (GRU) [23] as input. Since our new dataset hereconsists of about 10k dialogues (which is relatively small for deep learning techniques), we usepre-trained GenSen representations obtained from the encoder outlined in Subramanian et al. [21].These representations have led to higher performance across a variety of new tasks in lower dataregimes (e.g. with only 10k examples). We use the embeddings and first layer of the GenSen sentenceencoder which are pre-trained on multiple language tasks and we keep them frozen during trainingof our model. To deal with the issue of how to process movies discussed in the dialogue using the@movie for movie entities, @movie tokens in the input are replaced by the corresponding wordtokens for the title of the movie.

More formally, we model each utterance Um as a sequence of Nm words Um = (wm,1, ..., wm,Nm)

where the tokens wm,n are either words from a vocabulary V or movie names from a set of moviesV ′. We also use a scalar sm ∈ {−1, 1} appended to each utterance to indicate the role (recommenderor seeker) such that a dialogue of M utterances can be represented as D = ((U1, s1) , ..., (UM , sM )).We use a GRU to encode utterances and dialogues. Given an input sequence (i1, ..., iT ), the networkcomputes reset gates rt, input gates zt, new gates nt and forward hidden state

−→h t as follows:

rt = σ(Wirit +Whr

−→h t−1 + br

), zt = σ

(Wizit +Whz

−→h t−1 + bz

)nt = tanh

(Winit + bin + rt ◦ (Whn

−→h t−1 + bhn)

),−→h t = (1− zt) ◦ nt + zt ◦

−→h t−1

Where the W∗∗ and b∗ are the learned parameters. In the case of a bi-directional GRU, the backwardhidden state

←−h t is computed the same way, but takes the inputs in a reverse order. In a multi-layer

GRU, the hidden states of the first layer(−→h

(1)1 , ...,

−→h

(1)T

)(or the concatenation of the forward and

backward hidden states of the first layer

[−→h

(1)1←−

h(1)1

], ...,

[−→h

(1)T←−

h(1)T

]for a bi-directional GRU) are passed

as inputs to the second layer, and so on. For the utterance encoder words are embedded in a 2048dimensional space. Each utterance is then passed to the sentence encoder bi-directional GRU. The

final hidden state of the last layer is used as utterance representation u =

[−→h

(−1)T←−

h(−1)T

]. We obtain a

sequence of utterance representations u1, ...,uM . To assist the conversation encoder we append abinary-valued scalar sm to each utterance representation um, indicating if the sender is the seeker

5

or the recommender. The sequence u′1, ...,u′M is passed to the conversation encoder unidirectional

GRU, which produces conversation representations at each step of the dialogue: h1, ...,hM .

4.2 Dynamically Instantiated RNNs for Movie Sentiment Analysis

In a test setting, users would not provide explicit ratings about movies mentioned in the conversation.Their sentiment can however be inferred from the utterances themselves. Therefore, to drive ourautoencoder-based recommendation module we build a model that takes as input both the dialogueand a movie name, and predicts for that movie the answers to the associated movie dialogue form. Weremind the reader that both workers answer the movie dialogue form, but it only concerns the seeker’smovie tastes. It often happens that the two workers do not agree on all the answers to the forms. Itmay either come from a real ambiguity in the dialogue, or from worker carelessness (data noise). Sothe model predicts different answers for the seeker and for the recommender. For each participant itlearns to predict three labels: the “suggested” label (binary), the “seen” label (categorical with threeclasses), the “liked” label (categorical with three classes) for a total of 14 dimensions.

Let us denote D = {(xi, yi) , i = 1..N} the training set, where xi = (Di,mi) is the pair of adialogue Di and a movie name mi that is mentioned in Di and

yi = (ysuggi , yseeni , ylikedi︸︷︷︸seeker’s answers

, y′ suggi , y′ seeni , y′ likedi︸︷︷︸recommender’s answers

), (1)

are the labels in the movie dialogue form corresponding to movie mi in dialogue Di. So if 5 movieswere mentioned in dialogue D, this dialogue appears 5 times in a training epoch.

The model is based on a hierarchical encoder (Section 4.1). For sentiment analysis, we modify theutterance encoder to take the movie m into account. After the first layer of the utterance encoderGRU (which is pre-trained), we add a dimension to the hidden states that indicate for each word ifit is part of a movie mention. For example if we condition on the movie The Sixth Sense, then theinput ["<s>", "you", "would", "like", "the", "sixth", "sense", ".", "</s>"] produces the movie mentionfeature: [0, 0, 0, 0, 1, 1, 1, 0, 0]. The utterance and conversation encoding continue as described inSection 4.1 afterwards, producing dialogue representations h1, ...,hM at each dialogue step.

The dialogue representation at the last utterance hM is passed in a fully connected layer. Theresulting vector has 14 dimensions. We apply a sigmoid to the first component to obtain thepredicted probability that the seeker answered that the movie was suggested by the recommenderosuggi . We apply a softmax to the next three components to obtain the predicted probabilities for theseeker’s answer in the not-seen/seen/did-not-say variable oseeni . We apply a softmax to the next threecomponents to obtain the predicted probabilities for the seeker’s answer in the disliked/liked/did-not-say variable olikedi . The last 7 components are treated the same way to obtain the probabilitiesof answers according to the recommender o′ sugg, o′ seen, o′ liked. We denote the parameters of theneural network by θ and oi = fθ(xi) =

(osuggi , oseeni , olikedi , o′ suggi , o′ seeni , o′ likedi

), the prediction of

the model. We minimize the sum of the three corresponding cross-entropy losses.

4.3 The Autoencoder Recommender

At the start of each conversation, the recommender has no prior information on the movie seeker(cold start). During the conversation, the recommender gathers information about the movie seekerand (implicitely) builds a profile of the seeker’s movie preferences. Sedhain et al. [1] developed auser-based autoencoder for collaborative filtering (U-Autorec), a model capable of predicting ratingsfor users not seen in the training set. We use a similar model and pre-train it with MovieLens data [5].

We haveM users, |V ′|movies and a partially observed user-movie rating matrix R ∈ RM×|V ′|. Eachuser u ∈ {1, ...,M} can be represented by a partially observed vector r(u) =

(Ru,1, ...,Ru,|V ′|

).

Sedhain et al. [1] project r(u) in a smaller space with a fully connected layer, then retrieve the fullratings vector r̂(u) = h(r(u); θ) with another fully connected layer. So during training they minimizethe following loss:

LR(θ) =

M∑u=1

‖r(u) − h(r(u); θ)‖2O + λ‖θ‖2 (2)

where ‖ ·‖O is the L2 norm when considering the contribution of observed ratings only and λ controlsthe regularization strength.

6

To improve the performance of this model in the early stage of performing recommendations (i.e. incold-start setting) we train this model as a denoising auto-encoder [24]. We denote by Nu the numberof observed ratings in the user vector r(u). During training, we sample the number of inputs keptp uniformly at random in {1, ..., Nu − 1}. Then we draw p inputs uniformly without replacementamong all the observed inputs in r(u), which gives us a noisy user vector r̃(u). The term inside thesum of Equation 2 becomes ‖r(u) − h(r̃(u); θ)‖2O. The validation procedure is not changed: thecomplete input from the training set is used at validation or test time.

4.4 Our Decoder with a Movie Recommendation Switching Mechanism

Let us place ourselves at step m in dialogue D. The sentiment analysis RNNs presented above predictfor each movie mentioned so far whether the seeker liked it or not using the previous utterances.These predictions are used to create an input rm−1 ∈ R|V ′| for the recommendation system. Therecommendation system uses this input to produce a full vector of ratings r̂m−1 ∈ R|V ′|. Thehierarchical encoder (Section 4.1) produces the current context hm−1 using previous utterances. Therecommendation vector r̂m−1 and the context hm−1 are used by the decoder to predict the nextutterance by the recommender.

For the decoder, a GRU decodes the context to predict the next utterance step by step. To selectbetween the two types of tokens (words or movie names), we use a switch, as Gulcehre et al. [3] did forthe pointer softmax. The decoder GRU’s hidden state is initialized with the context hm−1, and decodesthe sentence as follows: h′m,0 = hm−1, h′m,n = GRU(h′m,n−1, wm,n), vm,n = softmax

(Wh′m,n

),

vm,n ∈ R|V | is the predicted probability distribution for the next token wm,n+1, knowing thatthis token is a word. The recommendation vector r̂m−1 is used to obtain a predicted probabilitydistribution vector v′m,n ∈ R|V ′| for the next token wm,n+1, knowing that this token is a movie name:v′m,n = softmax(r̂m−1) = v′m,0 ∀n. Where we note that we use the same movie distribution v′m,0during the whole utterance decoding. Indeed, while the recommender’s message is being decoded, itdoes not gather additional information about the seeker’s movie preferences, so the movie distributionshould not change. A switching network conditioned on the context hm−1 and the hidden state h′m,npredicts the probability dm,n that the next token wm,n+1 is a word and not a movie name.

Such a switching mechanism allows to include an explicit recommendation system in the dialogueagent. One issue of this method is that the recommendations are conditioned on the movies mentionedin the dialogue, but not directly on the language. For example our system would be unable toprovide recommendations to someone who just asks for “a good sci-fi movie”. Initial experimentsconditioning the recommendation system on the dialogue hidden state led to overfitting. This couldbe an interesting direction for future work. Another issue is that it relies on the use of the ‘@’ symbolto mention movies, which could be addressed by adding an entity recognition module.

5 Experiments

We propose to evaluate the recommendation and sentiment-analysis modules separately using estab-lished metrics. We believe that these individual metrics will improve when modules are more tightlycoupled in the recommendation system and thus provide a proxy to overall dialogue quality. We alsoperform an utterance-level human evaluation to compare responses generated by different models insimilar settings.

Evaluating models in a fully interactive setting, conversing with a human, is the ultimate testingenvironment. However, evaluating even one response utterance at a time is an open challenge (e.g.,[25]). We leave such evaluation for future work.

Movie sentiment analysis performance: We use the movie dialogue forms from our data to trainand evaluate our proposed RNN-based movie sentiment analysis formulation. The results obtained forthe seeker’s answers and the recommender’s answers are highly similar, thus we present the resultsonly for the seeker’s answers. We focus on understanding if models are able to correctly infer theseen vs not seen, and liked vs not liked assessments from the forms. Because of the class imbalance(i.e. 81% of movies were liked, vs 4.9% which were disliked), we weight the loss to compensate.

7

(a) (top row) Confusion matrices for the seen label. (bottom row)Confusion matrices for the liked label. (left column) BaselineGRU experiment. (middle) Our method with separate objectives(right) Our method, jointly trained. We also provide Cohen’skappa coefficient for each matrix.

(b) Confusion matrix for the Cartesianproduct predictions of seen and liked la-bels using our method.

Figure 2: Confusion matrices for movie sentiment analysis on the validation set.

We compare with two simpler approaches. First, a baseline approach in which we pass the GenSenencodings of the sentences between the first and the last mention of a movie into a GRU layer. Thisis followed by a fully connected layer from the last hidden state. The prediction is made from themean probability over all the sentences. Second, instead of using a single hierarchical encoder that isjointly trained to predict the three labels (suggested, seen and liked), we train the same model withonly one of the three objectives (seen or liked) and demonstrate that the joint training regularizes themodel. Figure 2a shows the confusion matrices for the seen and liked prediction tasks for, from left toright, the baseline model, our model trained on single objectives, and our method outlined in Section4.2 and illustrated in the blue region of Figure 1. We also provide Cohen’s kappa coefficient [26] foreach model and prediction task. Cohen’s kappa measures the agreement between the true label andthe predictions. For each prediction task, our jointly trained model has a higher kappa coefficient thanthe two other baselies. The full confusion matrix for the Cartesian product of predictions is shown inFigure 2b. All results are on the validation set.

Table 2: RMSE for movie recommendations. RMSE is shown for ratings on a 0–1 scale. For theMovieLens experiment, we show the RMSE on a 0.5-5 scale in parenthesis.6

Experiments on REDIAL

Training procedure Experiments on MovieLens No pre-training Pre-trained on MovieLens

Standard Baseline 0.182± 0.0002 (0.820) 0.35 0.29Denoising Autorec 0.179± 0.0002 (0.805) 0.33 0.28

Movie recommendation quality: We use the “latest” MovieLens dataset7, that contains 26 millionratings across 46,000 movies, given by 270,000 users. It contains 2.6 times more ratings, but alsoacross 4.6 times more movies than MovieLens-10 M, the dataset used in Sedhain et al. [1]. First,we evaluate the model on the MovieLens dataset. Randomly chosen user-item ratings are held outfor validation and test, and only training ratings are used as inputs. Following Sedhain et al. [1], wesampled the training, validation, and test set in a 80-10-10 proportion, and repeated this splittingprocedure five times, reporting the average RMSE.

We also examine how the model performs on the ratings from our data (REDIAL), with and withoutpre-training on MovieLens. This experiment ignores the conversational aspect of our data and focusesonly on the like/dislike ratings provided by users. We chose to consider only the ratings given bythe movie seeker, and to ignore the responses where he answered “did not say either way”. Weend up with a set of binary ratings for each conversation. To place ourselves in the setting of arecommender that meets a new movie seeker (cold-start setting), we consider each conversation as a6 Due to an error in our code, the original published version of the paper incorrectly reported some of theseresults. Results are now updated to the ones from our (newly released) accompanying code. The new results donot alter the study’s conclusion. 7 https://grouplens.org/datasets/movielens/latest/, retrievedSeptember 2017.

8

https://grouplens.org/datasets/movielens/latest/

separate user. Randomly chosen conversations are held out for validation, and each rating, in turn, ispredicted using all other ratings (from the same conversation) as inputs. We binarize the Movielensobservations—they range between 0.5 and 5— for pre-training, by choosing a threshold that gives asimilar distribution of 0s and 1s as in our data. Knowing that our data has 94.3% of “liked” ratings,we chose a rating threshold of 2: ratings higher or equal are considered as “liked”, ratings lower areconsidered as “disliked”. The binarized MovieLens dataset now has 93.7% of “liked” ratings. In eachexperiment, for the two training procedures (standard and denoising), we perform a hyper-parametersearch on the validation set.

Table 2 shows the RMSE obtained on the test set. In the experiment on the MovieLens dataset, thedenoising training procedure brings a slight improvement on the standard training procedure. Afterpre-training on MovieLens, the performances of the models on our data is significantly improved.

Figure 3: Results of humanassessment of dialogue quality.The percentages are relative tothe total number of ranking tasks,so that bars of the same colorsum to 1.

Overall dialogue quality assessment: We run a user study toassess the overall quality of the responses of our model comparedto HRED. Ten participants were each presented with ten completereal dialogues from our validation set, performing 56 ranking tasks–1 for each recommender’s utterance in those ten dialogues. Atthe point where the human recommender provided their responsein the real dialogue we show: the text generated by our HREDbaseline, our model, and the true response in a random order. Theparticipant is asked to give the dialogue responses a rank from1–3, with 1 being the best and 3 the worst. We allow ties so thatmultiple responses could be given the same rank (e.g., rankingsof the form 1, 2, 2 were possible if the one response was clearlythe best, but the other two were of equivalent quality). In Figure 3,we show the percentage of times that each model was given eachranking. The true response was ranked first 349 times, our model267 times, and HRED 223 times.

6 Discussion and Conclusions

We presented REDIAL a new, high-utility dataset of real-world, human generated conversationsaround the theme of providing movie recommendations. 10,000 conversations will likely be insuf-ficient to train an end-to-end neural model from scratch, we believe that this shortage of data isa systematic problem in goal-oriented dialogue settings and needs to be adressed at the modelingside. We use this dataset to explore a novel modular formulation of a fully neural architecture forconversational movie recommendations. The dataset has been collected in such a way that subtaskssuch as sentiment analysis and movie recommendation can be explored and evaluated separately orwithin the context of a complete dialogue system.

We introduced a novel overall architecture for this problem domain which leverages general purposesentence representations and hierarchical encoder-decoder architectures, extending them with dynam-ically instantiated RNN models that drive an autoencoder-based recommendation engine. We findtremendous benefit from this modularization in that it allows one to pre-train the recommendationengine on other larger data sources specialized for the recommendation task alone. Further, ourproposed switching mechanism allows one to integrate recommendations within a recurrent decoder,mixing high quality suggestions into the overall dialogue framework.

Our proposed architecture is not specific to movies and applies to other types of products, given thata conversational recommendation dataset is available in that domain. Our utterance-level evaluationcompares the responses generated by different models in a given context, controlling for confoundingvariables to some extent. In that context, our model outperforms the HRED baseline. However, wedid not yet evaluate whole conversations between our model and a human user. Future works couldimprove this evaluation setting by asking more precise questions to the human evaluators. Insteadof asking which response is the best in a general way, we could ask for example which responseprovides the best recommendation given the context, or which is the most fluent. This would allow usto gain insight on what parts of the model could be improved.

9

References[1] Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. Autorec: Autoencoders

meet collaborative filtering. In International Conference on World Wide Web, pages 111–112,2015.

[2] Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen,and Jian-Yun Nie. A hierarchical recurrent encoder-decoder for generative context-aware querysuggestion. In International on Conference on Information and Knowledge Management, pages553–562, 2015.

[3] Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. Pointingthe unknown words. In Proceedings of the 54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages 140–149, Berlin, Germany, August 2016.Association for Computational Linguistics.

[4] Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander Miller,Arthur Szlam, and Jason Weston. Evaluating prerequisite qualities for learning end-to-enddialog systems. arXiv:1511.06931 [cs.CL], 2015.

[5] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context.Transactions on Interactive Intelligent Systems (TiiS), 5(4):19, 2016.

[6] Alessandro Suglia, Claudio Greco, Pierpaolo Basile, Giovanni Semeraro, and Annalina Caputo.An automatic procedure for generating datasets for conversational recommender systems. InProceedings of Dynamic Search for Complex Tasks-8th International Conference of the CLEFAssociation, CLEF, 2017.

[7] Ben Krause, Marco Damonte, Mihai Dobre, Daniel Duma, Joachim Fainberg, Federico Fancellu,Emmanuel Kahembwe, Jianpeng Cheng, and Bonnie Webber. Edina: Building an open domainsocialbot with self-dialogues. arXiv:1709.09816 [cs.CL], 2017.

[8] Pontus Johansson. Design and Development of Recommender Dialogue Systems. PhD thesis,Institutionen för datavetenskap, 2004.

[9] Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Laurent Charlin, and Joelle Pineau. A surveyof available corpora for building data-driven dialogue systems. arXiv:1512.05742 [cs.CL],2015.

[10] Mehmet H. Göker, Pat Langley, and Cynthia A. Thompson. A personalized system for conver-sational recommendations. arXiv:1107.0029 [cs.IR], 2011.

[11] Claudio Greco, Alessandro Suglia, Pierpaolo Basile, and Giovanni Semeraro. Converse-et-impera: Exploiting deep learning and hierarchical reinforcement learning for conversationalrecommender systems. In Conference of the Italian Association for Artificial Intelligence, pages372–386. Springer, 2017.

[12] Dwi H Widyantoro and Z.K.A. Baizal. A framework of conversational recommender systembased on user functional requirements. In Information and Communication Technology (ICoICT),2nd International Conference on, pages 160–165, 2014.

[13] Pontus Wärnestål, Lars Degerstedt, and Arne Jönsson. Emergent conversational recommen-dations: A dialogue behavior approach. In SIGDIAL Workshop on Discourse and Dialogue,2007.

[14] Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. Towards conversationalrecommender systems. In SIGKDD International Conference on Knowledge Discovery andData Mining, pages 815–824, 2016.

[15] Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, and Mari Ostendorf. Deepreinforcement learning with a natural language action space. arXiv preprint arXiv:1511.04636,2015.

10

[16] Abhishek Das, Satwik Kottur, José MF Moura, Stefan Lee, and Dhruv Batra. Learningcooperative visual dialog agents with deep reinforcement learning. arXiv:1703.06585 [cs.CV],2017.

[17] Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz. End-to-endtask-completion neural dialogue systems. arXiv preprint arXiv:1703.01008, 2017.

[18] Yueming Sun and Yi Zhang. Conversational recommender system. arXiv preprintarXiv:1806.03277, 2018.

[19] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura,Devi Parikh, and Dhruv Batra. Visual dialog. In Conference on Computer Vision and PatternRecognition, volume 2, 2017.

[20] Benjamin M. Marlin, Richard S. Zemel, Sam T. Roweis, and Malcolm Slaney. Collaborativefiltering and the missing at random assumption. In UAI, pages 267–275, 2007.

[21] Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J Pal. Learning generalpurpose distributed sentence representations via large scale multi-task learning. In ICLR, 2018.

[22] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and Joelle Pineau.Building end-to-end dialogue systems using generative hierarchical neural network models. InAAAI, volume 16, pages 3776–3784, 2016.

[23] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.

[24] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extractingand composing robust features with denoising autoencoders. In ICML, pages 1096–1103, 2008.

[25] Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael Noseworthy, Laurent Charlin, and JoellePineau. How not to evaluate your dialogue system: An empirical study of unsupervisedevaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023, 2016.

[26] Jacob Cohen. A coefficient of agreement for nominal scales. Educational and psychologicalmeasurement, 20(1):37–46, 1960.

[27] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.

[28] Mathieu Jacomy, Tommaso Venturini, Sebastien Heymann, and Mathieu Bastian. ForceAtlas2,a continuous graph layout algorithm for handy network visualization designed for the gephisoftware. PloS ONE, 9(6), 2014.

11

A Data collection interface

(a) Seeker interface

(b) Recommender interface

Figure 4: Data collection interface.

B Training Procedure

Two parts of our model are frozen after pre-training: the first layer of our utterance encoder (whichuses the GenSen representations [21]) and the sentiment analysis module (after being pre-trainedto predict the movie ratings). The recommender system is pre-trained on the Movielens dataset butwill be fine-tuned with the rest of the network. Our model is then trained in teacher forcing usingour dialogue dataset. The training objective is to maximize the likelihood of the next recommenderutterance (which may contain words and movie entities). Thus, the switching decoder and thehierarchical encoder (except for the utterance encoder’s first layer) are trained from scratch usingour dataset. We retain 20% of the training conversations for the validation set. We optimize the

12

parameters with Adam [27] and a leaning rate of 0.001. At test time, we use a beam search of size 10to generate the response.

C Conversation samples

We show in tables 3, 4, 5, 6, 7, some randomly picked conversations from the validation set, and theresponses generated by the HRED and by our model. At each utterance, we condition on the groundtruth previous turns (labeled “SEEKER” and “HUMAN” in the tables) to generate the response.

Table 3: Sample conversation from validation set.SEEKER: hi ! i ’m looking for an action filled movie similar to jurassic parkHUMAN: ok , have you seen jurassic world ?HRED: hi ! i love that one ! have you seen the conjuring or the conjuring 2 ?OURS: hi what kind of movies do you like ?

SEEKER: yes i loved that movie as well . are there any good movies without dinosaurs ?HUMAN: have you seen king kong ?HRED: no , i have n’t seen it yet . have you seen the avengers ? it ’s about a couple of

friends who are their own .OURS: i have n’t seen that one . i will have to check it out . have you seen jurassic park

SEEKER: yes ! that was another great giant monster movie : )HUMAN: what about jaws & amp ; jaws 2 ? oldies but goodies scary action ? suspense

gives me the rumbliesHRED: i have n’t seen that one yet but i heard it was good . i have n’t seen it yet . i have

n’t seen it yet . i have n’t seen it yet .OURS: i have n’t seen that one yet . i will have to check it out . what about walking

with dinosaurs ?

SEEKER: i absolutely loved the first one . one of my favorite movies ever : ) honestly i can’t say i remember much about the second one . jaws 3-d was quite funny andjaws : the revenge was just weird . i do love suspense too ...

HUMAN: i like suspense but sometimes i can barely handle it ! it gives me anxiety ! lolHRED: no , i have n’t seen it yet . i will have to check it out . thanks for the chat .OURS: i have n’t seen that one . i will have to check it out . i hope i helped .

SEEKER: that ’s understandable . as long as it ’s not too gory i do n’t generally have aproblem with those things .

HUMAN: well , it was great chatting with you ! have a great one !HRED: i ’m glad i was able to help you todayOURS: have you seen immortals ? it ’s a war movie but it ’s pretty good

SEEKER: you too ! thanks !

13

Table 4: Sample conversation from validation setHUMAN: hello .HRED: hi what kind of movies do you likeOURS: hi , how can i help you ?

SEEKER: hiHUMAN: what type of movie can i suggest to you ?HRED: what kind of movies do you like ? i ’m goodOURS: what kind of movies do you like ? what kind of movies do you like ?

SEEKER: i am looking for a slap stick comedy from the 80 ’s or 90 ’sHUMAN: not familiar with this , please give me a minute .HRED: have you seen the conjuring or the conjuring 2 ? it was really good and very

funnyOURS: what kind of movies do you like ? i ’m good , thanks for asking ! have a great

day !

SEEKER: okay movies like airplane ! and airplane ii : the sequelHUMAN: oh , okay . i saw those and thought they were great . did you see caddyshack or

dumb and dumber ?HRED: i have n’t seen that one yet but i heard it was good . have you seen the last house

on the left ? it ’s pretty good .OURS: have you seen coming to america ? or the naked gun hot shots !

SEEKER: yes , those were pretty goodHUMAN: how about spaceballs , i thought this was pretty funny .HRED: have you seen foxcatcher ? it ’s a comedy about a man on a true storyOURS: have you seen dumb and dumber to ? it ’s a classic !

SEEKER: haha that one was greatHUMAN: okay , let ’ me see if i can offer one more .HRED: have you seen foxcatcher ? it ’s a comedy about a man that turns out on a true

story .OURS: i have n’t seen that one yet . i ’ll have to look it up .

SEEKER: awesomeHUMAN: robin hood : men in <unk> ? this one had me rolling .HRED: have you seen foxcatcher ? it ’s about a man who has a rich guy who is a <unk>

rich guyOURS: have you seen monty python ’s life of brian ? that ’s a good one !

SEEKER: haha i forgot about that one ! i ’ll be watching that one tonightHUMAN: great .HRED: it stars nanjiani , zoe kazan , holly hunter , ray romano , adeel akhtar , and

anupam <unk> no , i have n’t seen it yet .OURS: have you seen caddyshack ii ? it ’s a good one .

SEEKER: thanks for all of your help i hope you have a great dayHUMAN: welcome , glad to help . have a nice day . good byeHRED: you ’re welcome ! have a great day !OURS: you ’re welcome ! have a great day !

14

Table 5: Sample conversation from validation setSEEKER: hi !HUMAN: hi ! how is your morning going ?HRED: hi , what kind of movies do you likeOURS: hello ! what kind of movie are you looking for ?

SEEKER: great , thanks ! i was thinking of starting the day off with a great comedy , anysuggestions ?

HUMAN: i love the hangover it always makes me laugh . and i bet lots of people arehungover this morning ! airplane ! is another classic of mine .

HRED: that ’s a good one ! have you seen the lego movie ? it ’s a great movie !OURS: i have n’t seen that one ! what kind of movies do you like ? i can help you with

that .

SEEKER: i ’ve seen the hangover it ’s really good ! i thought about watching knocked uphave you seen it ?

HUMAN: i enjoyed that movie . i enjoyed spy it ’s a well done parody .HRED: it ’s about a man who gets revenge on a cyborg and a hockey team . it ’s really

good . it ’s really good . i also like the girl on the trainOURS: i have n’t seen that one . i will have to check it out . i hope i was able to help

you find a good one ! have a great day !

SEEKER: thanks ! i ’ll check it out !HUMAN: sure ! have a great day !HRED: you ’re welcome ! have a great day !OURS: have you seen revenge of the nerds ? it ’s really good . it ’s a classic .

SEEKER: you too .

Figure 5: 2D embedding of movies in our conversation database. The edge weight in the similaritymatrix is proportional to the number of co-occurrences in the same dialogue. Left: all movies, coloredby number of occurrences from light blue (low) to red (high). Right: names of movies with highestnumber of occurrences. Embedding via Jacomy et al. [28].

15

Table 6: Sample conversation from validation setHUMAN: good day !HRED: hi what kind of movies do you likeOURS: hi , how can i help you ?

SEEKER: hey i ’m really into romance movies any good suggestions ? thanks !HUMAN: hmm . let me think for a sec . have you seen the big sick ? it ’s romantic and

funnyHRED: what kind of movies do you like to watch ? have you seen the conjuring 2 ?OURS: what kind of movies do you like ? i ’m good , and you ?

SEEKER: i like movies like beauty and the beast and a walk to remember oh and lol i havenever seen the the big sick or never been kissed and romeo and juliet

HUMAN: i highly recommend ! great story ! have you seen the notebook ? that ’s a bigone <unk> romance fans

HRED: i have n’t seen that one yet but i ’ve heard good things about it !OURS: i have n’t seen that one yet but i have heard good things about the longest ride

and the fault in our stars

SEEKER: i love the the notebook a really great film .HUMAN: and if you liked romeo and juliet then i have to reccommend titanic for leo .

also a recent sweet romantic comedy was the duffHRED: i have n’t seen that one , but i ’ve heard good things about it !OURS: i have n’t seen that one yet but i ’ve heard good things about it have you seen

message in a bottle ?

SEEKER: ahh the great titanic what a classic thanks for recommending that one . i willdefinitely have to check out the the duff thanks so much for your help !

HUMAN: no problem ! i have one more to suggest . phantom thread ! it ’s new and i didn’t like it as much but a lot of others do ! it <unk> romance in a unique way noproblem at all ! i hope you enjoy the films you decide to watch and enjoy therest of your day !

HRED: you ’re welcome ! have a great day !OURS: you ’re welcome ! have a great night !

SEEKER: thanks i will defiantly check out the phantom thread i really appreciate it andyou have a great day !

16

Table 7: Sample conversation from validation setSEEKER: helloHUMAN: helloHRED: hi , how can i help you ?OURS: hello , what kind of movie are you looking for ?

SEEKER: how are you this evening ?HUMAN: what type of movies do you like ? good , thanksHRED: what kind of movies do you like ? i ’m goodOURS: can i help you find a movie ?

SEEKER: i am looking for a musical . what do you suggest ?HUMAN: have you seen enchanted ? maybe you ’d like la la land .HRED: have you seen get out yet ? it ’s pretty goodOURS: i have n’t seen that one yet but have heard good things about it have you seen

prometheus ?

SEEKER: no i have n’t seen that one . i have been meaning to watch it though . oh that ’sa good suggestion . what about something older ?

HUMAN: the rocky horror picture show is a great one .HRED: i have n’t seen that one yet but heard good things about ender ’s game i have n’t

seen it yet but i have n’t seen it yet but i ’ve heard of it .OURS: i ’m trying to think of something along the same way . sweeney todd : the

demon barber of fleet street is another good one

SEEKER: i ’ve seen that before when i was younger . i did n’t know whether to laugh orbe scared . lol

HUMAN: fiddler on the roof maybe ? sorry , i have n’t seen many musicalsHRED: i have n’t seen that one . i will have to check it out . thanks for chatting !OURS: i have n’t seen grease but i have heard good things about it .

SEEKER: yes ! that ’s great ! thank you for the suggestions ! i try to watch them soon ,especially enchanted ! hope you have a good night !

HUMAN: thanks you tooHRED: you ’re welcome ! have a great night !OURS: no problem ! have a good day !

17

Documents

Microsoft Research Montreal Abstract arXiv:1812.07617v2 ...Dodge et al. [4] introduced four movie dialogue datasets comprising the Facebook Movie Dialog Data Set. There is a QA dataset,