Learning to Understand Phrases by Embedding the Dictionary · Learning to Understand Phrases by Embedding the Dictionary ... tutes the vocabulary of the RNN. ... and are often referred

Learning to Understand Phrases by Embedding the Dictionary

Felix HillComputer Laboratory

University of [email protected]

Kyunghyun Cho∗

Courant Institute of Mathematical Sciencesand Centre for Data Science

New York [email protected]

Anna KorhonenDepartment of Theoretical and Applied Linguistics

University of [email protected]

Yoshua BengioCIFAR Senior Fellow

Universite de [email protected]

Abstract

Distributional models that learn rich seman-tic word representations are a success storyof recent NLP research. However, develop-ing models that learn useful representations ofphrases and sentences has proved far harder.We propose using the definitions found ineveryday dictionaries as a means of bridg-ing this gap between lexical and phrasal se-mantics. Neural language embedding mod-els can be effectively trained to map dictio-nary definitions (phrases) to (lexical) repre-sentations of the words defined by those defi-nitions. We present two applications of thesearchitectures: reverse dictionaries that returnthe name of a concept given a definition ordescription and general-knowledge crosswordquestion answerers. On both tasks, neural lan-guage embedding models trained on defini-tions from a handful of freely-available lex-ical resources perform as well or better thanexisting commercial systems that rely on sig-nificant task-specific engineering. The re-sults highlight the effectiveness of both neu-ral embedding architectures and definition-based training for developing models that un-derstand phrases and sentences.

1 Introduction

Much recent research in computational seman-tics has focussed on learning representations ofarbitrary-length phrases and sentences. This task ischallenging partly because there is no obvious goldstandard of phrasal representation that could be used

∗ Work mainly done at the University of Montreal.

in training and evaluation. Consequently, it is diffi-cult to design approaches that could learn from sucha gold standard, and also hard to evaluate or comparedifferent models.

In this work, we use dictionary definitions to ad-dress this issue. The composed meaning of thewords in a dictionary definition (a tall, long-necked,spotted ruminant of Africa) should correspond tothe meaning of the word they define (giraffe). Thisbridge between lexical and phrasal semantics is use-ful because high quality vector representations ofsingle words can be used as a target when learningto combine the words into a coherent phrasal repre-sentation.

This approach still requires a model capable oflearning to map between arbitrary-length phrasesand fixed-length continuous-valued word vectors.For this purpose we experiment with two broadclasses of neural language models (NLMs): Recur-rent Neural Networks (RNNs), which naturally en-code the order of input words, and simpler (feed-forward) bag-of-words (BOW) embedding models.Prior to training these NLMs, we learn target lexi-cal representations by training the Word2Vec soft-ware (Mikolov et al., 2013) on billions of words ofraw text.

We demonstrate the usefulness of our approach bybuilding and releasing two applications. The first isa reverse dictionary or concept finder: a system thatreturns words based on user descriptions or defini-tions (Zock and Bilac, 2004). Reverse dictionariesare used by copywriters, novelists, translators andother professional writers to find words for notionsor ideas that might be on the tip of their tongue.

17

Transactions of the Association for Computational Linguistics, vol. 4, pp. 17–30, 2016. Action Editor: Chris Callison-Burch.Submission batch: 9/2015; revised 12/2015; revised 1/2016; Published 2/2016.

c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

For instance, a travel-writer might look to enhanceher prose by searching for examples of a countrythat people associate with warm weather or an ac-tivity that is mentally or physically demanding. Weshow that an NLM-based reverse dictionary trainedon only a handful of dictionaries identifies novel def-initions and concept descriptions comparably or bet-ter than commercial systems, which rely on signif-icant task-specific engineering and access to muchmore dictionary data. Moreover, by exploiting mod-els that learn bilingual word representations (Vulicet al., 2011; Klementiev et al., 2012; Hermann andBlunsom, 2013; Gouws et al., 2014), we show thatthe NLM approach can be easily extended to pro-duce a potentially useful cross-lingual reverse dic-tionary.

The second application of our models is as ageneral-knowledge crossword question answerer.When trained on both dictionary definitions and theopening sentences of Wikipedia articles, NLMs pro-duce plausible answers to (non-cryptic) crosswordclues, even those that apparently require detailedworld knowledge. Both BOW and RNN models canoutperform bespoke commercial crossword solvers,particularly when clues contain a greater number ofwords. Qualitative analysis reveals that NLMs canlearn to relate concepts that are not directly con-nected in the training data and can thus generalisewell to unseen input. To facilitate further research,all of our code, training and evaluation sets (togetherwith a system demo) are published online with thispaper.1

2 Neural Language Model Architectures

The first model we apply to the dictionary-basedlearning task is a recurrent neural network (RNN).RNNs operate on variable-length sequences of in-puts; in our case, natural language definitions, de-scriptions or sentences. RNNs (with LSTMs) haveachieved state-of-the-art performance in languagemodelling (Mikolov et al., 2010), image captiongeneration (Kiros et al., 2015) and approach state-of-the-art performance in machine translation (Bah-danau et al., 2015).

During training, the input to the RNN is a dic-tionary definition or sentence from an encyclopedia.

1 https://www.cl.cam.ac.uk/˜fh295/

The objective of the model is to map these defin-ing phrases or sentences to an embedding of theword that the definition defines. The target wordembeddings are learned independently of the RNNweights, using the Word2Vec software (Mikolov etal., 2013).

The set of all words in the training data consti-tutes the vocabulary of the RNN. For each word inthis vocabulary we randomly initialise a real-valuedvector (input embedding) of model parameters. TheRNN ‘reads’ the first word in the input by applyinga non-linear projection of its embedding v1 parame-terised by input weight matrix W and b, a vector ofbiases.

A1 = φ(Wv1 + b)

yielding the first internal activation state A1. In ourimplementation, we use φ(x) = tanh(x), though intheory φ can be any differentiable non-linear func-tion. Subsequent internal activations (after time-stept) are computed by projecting the embedding of thetth word and using this information to ‘update’ theinternal activation state.

At = φ(UAt−1 +Wvt + b).

As such, the values of the final internal activationstate units AN are a weighted function of all inputword embeddings, and constitute a ‘summary’ of theinformation in the sentence.

2.1 Long Short Term MemoryA known limitation when training RNNs to read lan-guage using gradient descent is that the error sig-nal (gradient) on the training examples either van-ishes or explodes as the number of time steps (sen-tence length) increases (Bengio et al., 1994). Conse-quently, after reading longer sentences the final in-ternal activation AN typically retains useful infor-mation about the most recently read (sentence-final)words, but can neglect important information nearthe start of the input sentence. LSTMs (Hochreiterand Schmidhuber, 1997) were designed to mitigatethis long-term dependency problem.

At each time step t, in place of the single inter-nal layer of units A, the LSTM RNN computes sixinternal layers iw, gi, gf , go, h and m. The first, gw,represents the core information passed to the LSTM

18

https://www.cl.cam.ac.uk/~fh295/

unit by the latest input word at t. It is computed asa simple linear projection of the input embeddingvt (by input weights Ww) and the output state ofthe LSTM at the previous time step ht−1 (by updateweights Uw):

iwt =Wwvt + Uwht−1 + bw

The layers gi, gf and go are computed as weightedsigmoid functions of the input embeddings, againparameterised by layer-specific weight matrices Wand U :

gst =1

1 + exp(−(Wsvt + Usht−1 + bs))

where s stands for one of i, f or o. These vectorstake values on [0, 1] and are often referred to as gat-ing activations. Finally, the internal memory state,mt and new output state ht, of the LSTM at t arecomputed as

mt =iwt � git +mt−1 � gft

ht =got � φ(mt),

where � indicates elementwise vector multiplica-tion and φ is, as before, some non-linear function(we use tanh). Thus, gi determines to what extentthe new input word is considered at each time step,gf determines to what extent the existing state ofthe internal memory is retained or forgotten in com-puting the new internal memory, and go determineshow much this memory is considered when comput-ing the output state at t.

The sentence-final memory state of the LSTM,mN , a ‘summary’ of all the information in the sen-tence, is then projected via an extra non-linear pro-jection (parameterised by a further weight matrix)to a target embedding space. This layer enables thetarget (defined) word embedding space to take a dif-ferent dimension to the activation layers of the RNN,and in principle enables a more complex definition-reading function to be learned.

2.2 Bag-of-Words NLMsWe implement a simpler linear bag-of-words (BOW)architecture for encoding the definition phrases. Aswith the RNN, this architecture learns an embeddingvi for each word in the model vocabulary, togetherwith a single matrix of input projection weights W .

The BOW model simply maps an input definitionwith word embeddings v1 . . . vn to the sum of theprojected embeddings

∑ni=1Wvi. This model can

also be considered a special case of an RNN inwhich the update function U and nonlinearity φ areboth the identity, so that ‘reading’ the next word inthe input phrase updates the current representationmore simply:

At = At−1 +Wvt.

2.3 Pre-trained Input Representations

We experiment with variants of these models inwhich the input definition embeddings are pre-learned and fixed (rather than randomly-initialisedand updated) during training. There are several po-tential advantages to taking this approach. First, theword embeddings are trained on massive corporaand may therefore introduce additional linguistic orconceptual knowledge to the models. Second, at testtime, the models will have a larger effective vocab-ulary, since the pre-trained word embeddings typi-cally span a larger vocabulary than the union of alldictionary definitions used to train the model. Fi-nally, the models will then map to and from the samespace of embeddings (the embedding space will beclosed under the operation of the model), so con-ceivably could be more easily applied as a general-purpose ‘composition engine’.

2.4 Training Objective

We train all neural language models M to map theinput definition phrase sc defining word c to a lo-cation close to the the pre-trained embedding vc ofc. We experiment with two different cost functionsfor the word-phrase pair (c, sc) from the trainingdata. The first is simply the cosine distance betweenM(sc) and vc. The second is the rank loss

max(0,m− cos(M(sc), vc)− cos(M(sc), vr))

where vr is the embedding of a randomly-selectedword from the vocabulary other than c. This lossfunction was used for language models, for example,in (Huang et al., 2012). In all experiments we applya margin m = 0.1, which has been shown to workwell on word-retrieval tasks (Bordes et al., 2015).

19

2.5 Implementation Details

Since training on the dictionary data took 6-10hours, we did not conduct a hyper-parameter searchon any validation sets over the space of possiblemodel configurations such as embedding dimension,or size of hidden layers. Instead, we chose theseparameters to be as standard as possible based onprevious research. For fair comparison, any aspectsof model design that are not specific to a particu-lar class of model were kept constant across experi-ments.

The pre-trained word embeddings used in all ofour models (either as input or target) were learnedby a continuous bag-of-words (CBOW) model usingthe Word2Vec software on approximately 8 billionwords of running text.2 When training such modelson massive corpora, a large embedding length of upto 700 have been shown to yield best performance(see e.g. (Faruqui et al., 2014)). The pre-trained em-beddings used in our models were of length 500,as a compromise between quality and memory con-straints.

In cases where the word embeddings are learnedduring training on the dictionary objective, we makethese embeddings shorter (256), since they mustbe learned from much less language data. In theRNN models, and at each time step each of thefour LSTM RNN internal layers (gating and activa-tion states) had length 512 – another standard choice(see e.g. (Cho et al., 2014)). The final hidden statewas mapped linearly to length 500, the dimensionof the target embedding. In the BOW models, theprojection matrix projects input embeddings (eitherlearned, of length 256, or pre-trained, of length 500)to length 500 for summing.

All models were implemented withTheano (Bergstra et al., 2010) and trained withminibatch SGD on GPUs. The batch size wasfixed at 16 and the learning rate was controlled byadadelta (Zeiler, 2012).

2The Word2Vec embedding models are well known; furtherdetails can be found at https://code.google.com/p/word2vec/ The training data for this pre-training was com-piled from various online text sources using the script demo-train-big-model-v1.sh from the same page.

3 Reverse Dictionaries

The most immediate application of our trained mod-els is as a reverse dictionary or concept finder. Itis simple to look up a definition in a dictionarygiven a word, but professional writers often also re-quire suitable words for a given idea, concept ordefinition.3 Reverse dictionaries satisfy this needby returning candidate words given a phrase, de-scription or definition. For instance, when queriedwith the phrase an activity that requires strengthand determination, the OneLook.com reverse dictio-nary returns the concepts exercise and work. Ourtrained RNN model can perform a similar func-tion, simply by mapping a phrase to a point in thetarget (Word2Vec) embedding space, and returningthe words corresponding to the embeddings that areclosest to that point.

Several other academic studies have proposedreverse dictionary models. These generally relyon common techniques from information retrieval,comparing definitions in their internal database tothe input query, and returning the word whose def-inition is ‘closest’ to that query (Bilac et al., 2003;Bilac et al., 2004; Zock and Bilac, 2004). Proxim-ity is quantified differently in each case, but is gen-erally a function of hand-engineered features of thetwo sentences. For instance, Shaw et al. (2013) pro-pose a method in which the candidates for a giveninput query are all words in the model’s databasewhose definitions contain one or more words fromthe query. This candidate list is then ranked accord-ing to a query-definition similarity metric based onthe hypernym and hyponym relations in WordNet,features commonly used in IR such as tf-idf and aparser.

There are, in addition, at least two commercialonline reverse dictionary applications, whose ar-chitecture is proprietary knowledge. The first isthe Dictionary.com reverse dictionary 4, which re-trieves candidate words from the Dictionary.comdictionary based on user definitions or descrip-tions. The second is OneLook.com, whose algo-rithm searches 1061 indexed dictionaries, including

3See the testimony from professional writers at http://www.onelook.com/?c=awards

4Available at http://dictionary.reference.com/reverse/

20

https://code.google.com/p/word2vec/

https://code.google.com/p/word2vec/

http://www.onelook.com/?c=awards

http://www.onelook.com/?c=awards

http://dictionary.reference.com/reverse/

http://dictionary.reference.com/reverse/

all major freely-available online dictionaries and re-sources such as Wikipedia and WordNet.

3.1 Data Collection and Training

To compile a bank of dictionary definitions for train-ing the model, we started with all words in the tar-get embedding space. For each of these words, weextracted dictionary-style definitions from five elec-tronic resources: Wordnet, The American HeritageDictionary, The Collaborative International Dictio-nary of English, Wiktionary and Webster’s. Wechose these five dictionaries because they are freely-available via the WordNik API,5 but in theory anydictionary could be chosen. Most words in our train-ing data had multiple definitions. For each wordw with definitions {d1 . . . dn} we included all pairs(w, d1) . . . (w, dn) as training examples.

To allow models access to more factual knowl-edge than might be present in a dictionary (for in-stance, information about specific entities, places orpeople, we supplemented this training data with in-formation extracted from Simple Wikipedia. 6 Forevery word in the model’s target embedding spacethat is also the title of a Wikipedia article, we treatthe sentences in the first paragraph of the article asif they were (independent) definitions of that word.When a word in Wikipedia also occurs in one (ormore) of the five training dictionaries, we simplyadd these pseudo-definitions to the training set ofdefinitions for the word. Combining Wikipedia anddictionaries in this way resulted in≈ 900, 000 word-’definition’ pairs of ≈ 100, 000 unique words.

To explore the effect of the quantity of trainingdata on the performance of the models, we alsotrained models on subsets of this data. The first sub-set comprised only definitions from Wordnet (ap-proximately 150,000 definitions of 75,000 words).The second subset comprised only words in Word-net and their first definitions (approximately 75,000word, definition pairs).7. For all variants of RNNand BOW models, however, reducing the trainingdata in this way resulted in a clear reduction in per-

5See http://developer.wordnik.com6https://simple.wikipedia.org/wiki/Main_

Page7As with other dictionaries, the first definition in WordNet

generally corresponds to the most typical or common sense of aword.

formance on all tasks. For brevity, we therefore donot present these results in what follows.

3.2 ComparisonsAs a baseline, we also implemented two entirelyunsupervised methods using the neural (Word2Vec)word embeddings from the target word space. In thefirst (W2V add), we compose the embeddings foreach word in the input query by pointwise addition,and return as candidates the nearest word embed-dings to the resulting composed vector.8 The sec-ond baseline, (W2V mult), is identical except thatthe embeddings are composed by elementwise mul-tiplication. Both methods are established ways ofbuilding phrase representations from word embed-dings (Mitchell and Lapata, 2010).

None of the models or evaluations from previousacademic research on reverse dictionaries is pub-licly available, so direct comparison is not possi-ble. However, we do compare performance withthe commercial systems. The Dictionary.com sys-tem returned no candidates for over 96% of our in-put definitions. We therefore conduct detailed com-parison with OneLook.com, which is the first re-verse dictionary tool returned by a Google searchand seems to be the most popular among writers.

3.3 Reverse Dictionary EvaluationTo our knowledge there are no established means ofmeasuring reverse dictionary performance. In theonly previous academic research on English reversedictionaries that we are aware of, evaluation wasconducted on 300 word-definition pairs written bylexicographers (Shaw et al., 2013). Since these arenot publicly available we developed new evaluationsets and make them freely available for future eval-uations.

The evaluation items are of three types, designedto test different properties of the models. To cre-ate the seen evaluation, we randomly selected 500words from the WordNet training data (seen by allmodels), and then randomly selected a definition foreach word. Testing models on the resulting 500word-definition pairs assesses their ability to recallor decode previously encoded information. For the

8Since we retrieve all answers from embedding spaces bycosine similarity, addition of word embeddings is equivalent totaking the mean.

21

http://developer.wordnik.com

https://simple.wikipedia.org/wiki/Main_Page

https://simple.wikipedia.org/wiki/Main_Page

Dictionary definitionsTest Set Seen (500 WN defs) Unseen (500 WN defs) Concept descriptions (200)

Unsup. W2V add - - - 923 .04/.16 163 339 .07/.30 150models W2V mult - - - 1000 .00/.00 10* 1000 .00/.00 27*

OneLook 0 .89/.91 67 - - - 18.5 .38/.58 153RNN cosine 12 .48/.73 103 22 .41/.70 116 69 .28/.54 157

RNN w2v cosine 19 .44/.70 111 19 .44/.69 126 26 .38/.66 111RNN ranking 18 .45/.67 128 24 .43/.69 103 25 .34/.66 102

NLMs RNN w2v ranking 54 .32/.56 155 33 .36/.65 137 30 .33/.69 77BOW cosine 22 .44/.65 129 19 .43/.69 103 50 .34/.60 99

BOW w2v cosine 15 .46/.71 124 14 .46/ .71 104 28 .36/.66 99BOW ranking 17 .45/.68 115 22 .42/.70 95 32 .35/.69 101

BOW w2v rankng 55 .32/.56 155 36 .35/.66 138 38 .33/.72 85

median rank accuracy@10/100 rank variance

Table 1: Performance of different reverse dictionary models in different evaluation settings. *Low variance in multmodels is due to consistently poor scores, so not highlighted.

unseen evaluation, we randomly selected 500 wordsfrom WordNet and excluded all definitions of thesewords from the training data of all models.

Finally, for a fair comparison with OneLook,which has both the seen and unseen pairs in its in-ternal database, we built a new dataset of conceptdescriptions that do not appear in the training datafor any model. To do so, we randomly selected 200adjectives, nouns or verbs from among the top 3000most frequent tokens in the British National Cor-pus (Leech et al., 1994) (but outside the top 100).We then asked ten native English speakers to writea single-sentence ‘description’ of these words. Toensure the resulting descriptions were good qual-ity, for each description we asked two participantswho did not produce that description to list anywords that fitted the description (up to a maximumof three). If the target word was not produced byone of the two checkers, the original participant wasasked to re-write the description until the validationwas passed.9 These concept descriptions, togetherwith other evaluation sets, can be downloaded fromour website for future comparisons.

Given a test description, definition, or question,all models produce a ranking of possible word an-swers based on the proximity of their representationsof the input phrase and all possible output words.To quantify the quality of a given ranking, we re-port three statistics: the median rank of the correct

9Re-writing was required in 6 of the 200 cases.

Test set Word DescriptionDictionary valve ”control consisting of a mechanicaldefinition device for controlling fluid flow”

Concept prefer ”when you like one thingdescription more than another thing”

Table 2: Style difference between dictionary definitionsand concept descriptions in the evaluation.

answer (over the whole test set, lower better), theproportion of training cases in which the correct an-swer appears in the top 10/100 in this ranking (accu-racy@10/100 - higher better) and the variance of therank of the correct answer across the test set (rankvariance - lower better).

3.4 Results

Table 1 shows the performance of the different mod-els in the three evaluation settings. Of the unsu-pervised composition models, elementwise additionis clearly more effective than multiplication, whichalmost never returns the correct word as the near-est neighbour of the composition. Overall, however,the supervised models (RNN, BOW and OneLook)clearly outperform these baselines.

The results indicate interesting differences be-tween the NLMs and the OneLook dictionary searchengine. The Seen (WN first) definitions in Table 1occur in both the training data for the NLMs andthe lookup data for the OneLook model. Clearly theOneLook algorithm is better than NLMs at retriev-

22

ing already available information (returning 89% ofcorrect words among the top-ten candidates on thisset). However, this is likely to come at the cost of agreater memory footprint, since the model requiresaccess to its database of dictionaries at query time.10

The performance of the NLM embedding modelson the (unseen) concept descriptions task shows thatthese models can generalise well to novel, unseenqueries. While the median rank for OneLook onthis evaluation is lower, the NLMs retrieve the cor-rect answer in the top ten candidates approximatelyas frequently, within the top 100 candidates morefrequently and with lower variance in ranking overthe test set. Thus, NLMs seem to generalise more‘consistenly’ than OneLook on this dataset, in thatthey generally assign a reasonably high ranking tothe correct word. In contrast, as can also be verifiedby querying our we demo, OneLook tends to per-form either very well or poorly on a given query.11

When comparing between NLMs, perhaps themost striking observation is that the RNN modelsdo not significantly outperform the BOW models,even though the BOW model output is invariant tochanges in the order of words in the definition. Usersof the online demo can verify that the BOW modelsrecover concepts from descriptions strikingly well,even when the words in the description are per-muted. This observation underlines the importanceof lexical semantics in the interpretation of languageby NLMs, and is consistent with some other recentwork on embedding sentences (Iyyer et al., 2015).

It is difficult to observe clear trends in the dif-ferences between NLMs that learn input word em-beddings and those with pre-trained (Word2Vec) in-put embeddings. Both types of input yield goodperformance in some situations and weaker perfor-mance in others. In general, pre-training input em-beddings seems to help most on the concept de-scriptions, which are furthest from the training datain terms of linguistic style. This is perhaps unsur-prising, since models that learn input embeddingsfrom the dictionary data acquire all of their concep-

10The trained neural language models are approximately halfthe size of the six training dictionaries stored as plain text, sowould be hundreds of times smaller than the OneLook databaseof 1061 dictionaries if stored this way.

11We also observed that the mean ranking for NLMs waslower than for OneLook on the concept descriptions task.

tual knowledge from this data (and thus may over-fit to this setting), whereas models with pre-trainedembeddings have some semantic memory acquiredfrom general running-text language data and otherknowledge acquired from the dictionaries.

3.5 Qualitative Analysis

Some example output from the various models ispresented in Table 3. The differences illustratedhere are also evident from querying the web demo.The first example shows how the NLMs (BOW andRNN) generalise beyond their training data. Fourof the top five responses could be classed as ap-propriate in that they refer to inhabitants of coldcountries. However, inspecting the WordNik train-ing data, there is no mention of cold or anything todo with climate in the definitions of Eskimo, Scandi-navian, Scandinavia etc. Therefore, the embeddingmodels must have learned that coldness is a char-acteristic of Scandinavia, Siberia, Russia, relates toEskimos etc. via connections with other conceptsthat are described or defined as cold. In contrast,the candidates produced by the OneLook and (unsu-pervised) W2V baseline models have nothing to dowith coldness.

The second example demonstrates how the NLMsgenerally return candidates whose linguistic or con-ceptual function is appropriate to the query. For aquery referring explicitly to a means, method or pro-cess, the RNN and BOW models produce verbs indifferent forms or an appropriate deverbal noun. Incontrast, OneLook returns words of all types (aero-dynamics, draught) that are arbitrarily related to thewords in the query. A similar effect is apparent inthe third example. While the candidates producedby the OneLook model are the correct part of speech(Noun), and related to the query topic, they are notsemantically appropriate. The dictionary embeddingmodels are the only ones that return a list of plausi-ble habits, the class of noun requested by the input.

3.6 Cross-Lingual Reverse Dictionaries

We now show how the RNN architecture can be eas-ily modified to create a bilingual reverse dictionary- a system that returns candidate words in one lan-guage given a description or definition in another.A bilingual reverse dictionary could have clear ap-plications for translators or transcribers. Indeed, the

23

InputDescription OneLook W2V add RNN BOW

”a native of 1:country 2:citizen 1:a 2.the 1:eskimo 2:scandinavian 1:frigid 2:colda cold 3:foreign 4:naturalize 3:another 4:of 3:arctic 4:indian 3:icy 4:russian

country” 5:cisco 5:whole 5:siberian 5:indian

”a way of 1:drag 2:whiz 1:the 2:through 1:glide 2:scooting 1:flying 2:glidingmoving 3:aerodynamics 4:draught 3:a 4:moving 3:glides 4:gliding 3:glide 4:flythrough 5:coefficient of drag 5:in 5:flight 5:scootingthe air”

”a habit that 1:sisterinlaw 2:fatherinlaw 1:annoy 2:your 1:bossiness 2:jealousy 1:infidelity 2:bossinessmight annoy 3:motherinlaw 4:stepson 3:might 4:that 3:annoyance 4:rudeness 3:foible 4:unfaithfulnessyour spouse” 5:stepchild 5:either 5:boorishness 5:adulterous

Table 3: The top-five candidates for example queries (invented by the authors) from different reverse dictionary mod-els. Both the RNN and BOW models are without Word2Vec input and use the cosine loss.

Input description RNN EN-FR W2V add RNN + Google”an emotion that you might feel triste, pitoyable insister, effectivement sentiment, regretter

after being rejected” repugnante, epouvantable pourquoi, nous peur, aversion

”a small black flying insect that mouche, canard attentivement, pouvions voler, faucontransmits disease and likes horses” hirondelle, pigeon pourrons, naturellement mouches, volant

Table 4: Responses from cross-lingual reverse dictionary models to selected queries. Underlined responses are ‘cor-rect’ or potentially useful for a native French speaker.

problem of attaching appropriate words to conceptsmay be more common when searching for words ina second language than in a monolingual context.

To create the bilingual variant, we simply replacethe Word2Vec target embeddings with those froma bilingual embedding space. Bilingual embeddingmodels use bilingual corpora to learn a space of rep-resentations of the words in two languages, suchthat words from either language that have similarmeanings are close together (Hermann and Blun-som, 2013; Chandar et al., 2014; Gouws et al.,2014). For a test-of-concept experiment, we usedEnglish-French embeddings learned by the state-of-the-art BilBOWA model (Gouws et al., 2014) fromthe Wikipedia (monolingual) and Europarl (bilin-gual) corpora.12 We trained the RNN model to mapfrom English definitions to English words in thebilingual space. At test time, after reading an En-glish definition, we then simply return the nearestFrench word neighbours to that definition.

Because no benchmarks exist for quantitative

12The approach should work with any bilingual embeddings.We thank Stephan Gouws for doing the training.

evaluation of bilingual reverse dictionaries, we com-pare this approach qualitatively with two alternativemethods for mapping definitions to words acrosslanguages. The first is analogous to the W2V Addmodel of the previous section: in the bilingual em-bedding space, we first compose the embeddings ofthe English words in the query definition with ele-mentwise addition, and then return the French wordwhose embedding is nearest to this vector sum. Thesecond uses the RNN monolingual reverse dictio-nary model to identify an English word from an En-glish definition, and then translates that word usingGoogle Translate.

Table 4 shows that the RNN model can be ef-fectively modified to create a cross-lingual reversedictionary. It is perhaps unsurprising that the W2VAdd model candidates are generally the lowest inquality given the performance of the method in themonolingual setting. In comparing the two RNN-based methods, the RNN (embedding space) modelappears to have two advantages over the RNN +Google approach. First, it does not require on-line access to a bilingual word-word mapping as

24

defined e.g. by Google Translate. Second, it lessprone to errors caused by word sense ambiguity.For example, in response to the query an emotionyou feel after being rejected, the bilingual embed-ding RNN returns emotions or adjectives describingmental states. In contrast, the monolingual+Googlemodel incorrectly maps the plausible English re-sponse regret to the verbal infinitive regretter. Themodel makes the same error when responding to adescription of a fly, returning the verb voler (to fly).

3.7 Discussion

We have shown that simply training RNN or BOWNLMs on six dictionaries yields a reverse dictionarythat performs comparably to the leading commer-cial system, even with access to much less dictio-nary data. Indeed, the embedding models consis-tently return syntactically and semantically plausi-ble responses, which are generally part of a morecoherent and homogeneous set of candidates thanthose produced by the commercial systems. We alsoshowed how the architecture can be easily extendedto produce bilingual versions of the same model.

In the analyses performed thus far, we only testthe dictionary embedding approach on tasks that itwas trained to accomplish (mapping definitions ordescriptions to words). In the next section, we ex-plore whether the knowledge learned by dictionaryembedding models can be effectively transferred toa novel task.

4 General Knowledge (crossword)Question Answering

The automatic answering of questions posed in nat-ural language is a central problem of Artificial In-telligence. Although web search and IR techniquesprovide a means to find sites or documents related tolanguage queries, at present, internet users requiringa specific fact must still sift through pages to locatethe desired information.

Systems that attempt to overcome this, viafully open-domain or general knowledge question-answering (open QA), generally require large teamsof researchers, modular design and powerful infras-tructure, exemplified by IBM’s Watson (Ferrucciet al., 2010). For this reason, much academic re-search focuses on settings in which the scope of the

task is reduced. This has been achieved by restrict-ing questions to a specific topic or domain (Mollaand Vicedo, 2007), allowing systems access to pre-specified passages of text from which the answer canbe inferred (Iyyer et al., 2014; Weston et al., 2015),or centering both questions and answers on a partic-ular knowledge base (Berant and Liang, 2014; Bor-des et al., 2014).

In what follows, we show that the dictionary em-bedding models introduced in the previous sectionsmay form a useful component of an open QA sys-tem. Given the absence of a knowledge base orweb-scale information in our architecture, we nar-row the scope of the task by focusing on generalknowledge crossword questions. General knowl-edge (non-cryptic, or quick) crosswords appear innational newspapers in many countries. Crosswordquestion answering is more tractable than generalopen QA for two reasons. First, models know thelength of the correct answer (in letters), reducingthe search space. Second, some crossword questionsmirror definitions, in that they refer to fundamentalproperties of concepts (a twelve-sided shape) or re-quest a category member (a city in Egypt).13

4.1 Evaluation

General Knowledge crossword questions come indifferent styles and forms. We used the Eddie Jamescrossword website to compile a bank of sentence-like general-knowledge questions.14 Eddie James isone of the UK’s leading crossword compilers, work-ing for several national newspapers. Our long ques-tion set consists of the first 150 questions (startingfrom puzzle #1) from his general-knowledge cross-words, excluding clues of fewer than four wordsand those whose answer was not a single word (e.g.kingjames).

To evaluate models on a different type of clue, wealso compiled a set of shorter questions based onthe Guardian Quick Crossword. Guardian questionsstill require general factual or linguistic knowledge,but are generally shorter and somewhat more crypticthan the longer Eddie James clues. We again formed

13As our interest is in the language understanding, wedo not address the question of fitting answers into a grid,which is the main concern of end-to-end automated crosswordsolvers (Littman et al., 2002).

14http://www.eddiejames.co.uk/

25

http://www.eddiejames.co.uk/

a list of 150 questions, beginning on 1 January 2015and excluding any questions with multiple-word an-swers. For clear contrast, we excluded those fewquestions of length greater than four words. Of these150 clues, a subset of 30 were single-word clues.All evaluation datasets are available online with thepaper.

As with the reverse dictionary experiments, can-didates are extracted from models by inputting def-initions and returning words corresponding to theclosest embeddings in the target space. In this case,however, we only consider candidate words whoselength matches the length specified in the clue.

Test set Word DescriptionLong Baudelaire ”French poet(150) and key figure

in the developmentof Symbolism.”

Short (120) satanist ”devil devotee”

Single-Word (30) guilt ”culpability”

Table 5: Examples of the different question types in thecrossword question evaluation dataset.

4.2 Benchmarks and ComparisonsAs with the reverse dictionary experiments, we com-pare RNN and BOW NLMs with a simple unsuper-vised baseline of elementwise addition of Word2Vecvectors in the embedding space (we discard the in-effective W2V mult baseline), again restricting can-didates to words of the pre-specified length. Wealso compare to two bespoke online crossword-solving engines. The first, One Across (http://www.oneacross.com/) is the candidate gen-eration module of the award-winning Proverb cross-word system (Littman et al., 2002). Proverb, whichwas produced by academic researchers, has featuredin national media such as New Scientist, and beatenexpert humans in crossword solving tournaments.The second comparison is with Crossword Maestro(http://www.crosswordmaestro.com/), acommercial crossword solving system that handlesboth cryptic and non-cryptic crossword clues (wefocus only on the non-cryptic setting), and has alsobeen featured in national media.15 We are unable

15 See e.g. http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/

to compare against a third well-known automaticcrossword solver, Dr Fill (Ginsberg, 2011), becausecode for Dr Fill’s candidate-generation module isnot readily available. As with the RNN and base-line models, when evaluating existing systems wediscard candidates whose length does not match thelength specified in the clue.

Certain principles connect the design of the ex-isting commercial systems and differentiate themfrom our approach. Unlike the NLMs, they each re-quire query-time access to large databases contain-ing common crossword clues, dictionary definitions,the frequency with which words typically appearas crossword solutions and other hand-engineeredand task-specific components (Littman et al., 2002;Ginsberg, 2011).

4.3 ResultsThe performance of models on the various questiontypes is presented in Table 6. When evaluating thetwo commercial systems, One Across and Cross-word Maestro, we have access to web interfaces thatreturn up to approximately 100 candidates for eachquery, so can only reliably record membership of thetop ten (accuracy@10).

On the long questions, we observe a clear advan-tage for all dictionary embedding models over thecommercial systems and the simple unsupervisedbaseline. Here, the best performing NLM (RNNwith Word2Vec input embeddings and ranking loss)ranks the correct answer third on average, and in thetop-ten candidates over 60% of the time.

As the questions get shorter, the advantage ofthe embedding models diminishes. Both the unsu-pervised baseline and One Across answer the shortquestions with comparable accuracy to the RNN andBOW models. One reason for this may be the differ-ence in form and style between the shorter clues andthe full definitions or encyclopedia sentences in thedictionary training data. As the length of the clue de-creases, finding the answer often reduces to generat-ing synonyms (culpability - guilt), or category mem-bers (tall animal - giraffe). The commercial systemscan retrieve good candidates for such clues amongtheir databases of entities, relationships and com-mon crossword answers. Unsupervised Word2Vec

crossword-blog-computers-crack-cryptic-clues

26

http://www.oneacross.com/

http://www.oneacross.com/

http://www.crosswordmaestro.com/

http://www.theguardian.com/crosswords/crossword-blog/2012/mar/08/crossword-blog-computers-crack-cryptic-clues

















































Question Type avg rank -accuracy@10/100 - rank variance

Long (150) Short (120) Single-Word (30)One Across .39 / .68 / .70 /Crossword Maestro .27 / .43 / .73 /W2V add 42 .31/.63 92 11 .50/.78 66 2 .79/.90 45RNN cosine 15 .43/.69 108 22 .39/.67 117 72 .31/.52 187RNN w2v cosine 4 .61/.82 60 7 .56/.79 60 12 .48/.72 116RNN ranking 6 .58/.84 48 10 .51/.73 57 12 .48/.69 67RNN w2v ranking 3 .62/.80 61 8 .57/.78 49 12 .48/.69 114BOW cosine 4 .60/.82 54 7 .56/.78 51 12 .45/.72 137BOW w2v cosine 4 .60/.83 56 7 .54/.80 48 3 .59/.79 111BOW ranking 5 .62/.87 50 8 .58/.83 37 8 .55/.79 39BOW w2v ranking 5 .60/.86 48 8 .56/.83 35 4 .55/.83 43

Table 6: Performance of different models on crossword questions of different length. The two commercial systemsare evaluated via their web interface so only accuracy@10 can be reported in those cases.

representations are also known to encode these sortsof relationships (even after elementwise addition forshort sequences of words) (Mikolov et al., 2013).This would also explain why the dictionary embed-ding models with pre-trained (Word2Vec) input em-beddings outperfom those with learned embeddings,particularly for the shortest questions.

4.4 Qualitative Analysis

A better understanding of how the different modelsarrive at their answers can be gained from consider-ing specific examples, as presented in Table 7. Thefirst three examples show that, despite the apparentlysuperficial nature of its training data (definitions andintroductory sentences) embedding models can an-swer questions that require factual knowledge aboutpeople and places. Another notable characteristic ofthese model is the consistent semantic appropriate-ness of the candidate set. In the first case, the topfive candidates are all mountains, valleys or places inthe Alps; in the second, they are all biblical names.In the third, the RNN model retrieves currencies, inthis case performing better than the BOW model,which retrieves entities of various type associatedwith the Netherlands. Generally speaking (as canbe observed by the web demo), the ‘smoothness’ orconsistency in candidate generation of the dictionaryembedding models is greater than that of the com-mercial systems. Despite its simplicity, the unsuper-vised W2V addition method is at times also surpris-ingly effective, as shown by the fact that it returns

Joshua in its top candidates for the third query.The final example in Table 7 illustrates the sur-

prising power of the BOW model. In the trainingdata there is a single definition for the correct an-swer Schoenberg: United States composer and musi-cal theorist (born in Austria) who developed atonalcomposition. The only word common to both thequery and the definition is ’composer’ (there is notokenization that allows the BOW model to directlyconnect atonal and atonality). Nevertheless, themodel is able to infer the necessary connections be-tween the concepts in the query and the definition toreturn Schoenberg as the top candidate.

Despite such cases, it remains an open questionwhether, with more diverse training data, the worldknowledge required for full open QA (e.g. sec-ondary facts about Schoenberg, such as his fam-ily) could be encoded and retained as weights in a(larger) dynamic network, or whether it will be nec-essary to combine the RNN with an external mem-ory that is less frequently (or never) updated. Thislatter approach has begun to achieve impressive re-sults on certain QA and entailment tasks (Bordes etal., 2014; Graves et al., 2014; Weston et al., 2015).

5 Conclusion

Dictionaries exist in many of the world’s languages.We have shown how these lexical resources can con-stitute valuable data for training the latest neural lan-guage models to interpret and represent the mean-ing of phrases and sentences. While humans use

27

Input Description One Across Crossword Maestro BOW RNN

”Swiss mountain 1:noted 2:front 1:after 2:favor 1:Eiger 2.Crags 1:Eiger 2:Aostapeak famed for its 3:Eiger 4:crown 3:ahead 4:along 3:Teton 4:Cerro 3:Cuneo 4:Lecco

north face (5)” 5:fount 5:being 5:Jebel 5:Tyrol

”Old Testament 1:Joshua 2:Exodus 1:devise 2:Daniel 1:Isaiah 2:Elijah 1:Joshua 2:Isaiahsuccessor to 3:Hebrew 4:person 3:Haggai 4: Isaiah 3:Joshua 4:Elisha 3:Gideon 4:ElijahMoses (6)” 5:across 5:Joseph 5:Yahweh 5:Yahweh

”The former 1:Holland 2:general 1:Holland 2:ancient 1:Guilder 2:Holland 1:Guilder 2:Escudoscurrency of the 3:Lesotho 3:earlier 4:onetime 3:Drenthe 4:Utrecht 3:Pesetas 4:Someren

Netherlands 5:qondam 5:Naarden 5:Florins(7)”

”Arnold, 20th 1:surrealism 1:disharmony 1:Schoenberg 1:MendelsohnCentury composer 2:laborparty 2:dissonance 2:Christleib 2:Williamson

pioneer of 3:tonemusics 3:bringabout 3:Stravinsky 3:Huddlestonatonality 4:introduced 4:constitute 4:Elderfield 4:Mandelbaum

(10)” 5:Schoenberg 5:triggeroff 5:Mendelsohn 5:Zimmerman

Table 7: Responses from different models to example crossword clues. In each case the model output is filtered toexclude any candidates that are not of the same length as the correct answer. BOW and RNN models are trainedwithout Word2Vec input embeddings and cosine loss.

the phrasal definitions in dictionaries to better un-derstand the meaning of words, machines can usethe words to better understand the phrases. We usedtwo dictionary embedding architectures - a recurrentneural network architecture with a long-short-termmemory, and a simpler linear bag-of-words model -to explicitly exploit this idea.

On the reverse dictionary task that mirrors itstraining setting, NLMs that embed all known con-cepts in a continuous-valued vector space performcomparably to the best known commercial applica-tions despite having access to many fewer defini-tions. Moreover, they generate smoother sets of can-didates and require no linguistic pre-processing ortask-specific engineering. We also showed how thedescription-to-word objective can be used to trainmodels useful for other tasks. NLMs trained on thesame data can answer general-knowledge crosswordquestions, and indeed outperform commercial sys-tems on questions containing more than four words.While our QA experiments focused on crosswords,the results suggest that a similar embedding-basedapproach may ultimately lead to improved outputfrom more general QA and dialog systems and in-formation retrieval engines in general.

We make all code, training data, evaluation setsand both of our linguistic tools publicly available on-

line for future research. In particular, we propose thereverse dictionary task as a comparatively general-purpose and objective way of evaluating how wellmodels compose lexical meaning into phrase or sen-tence representations (whether or not they involvetraining on definitions directly).

In the next stage of this research, we will ex-plore ways to enhance the NLMs described here,especially in the question-answering context. Themodels are currently not trained on any question-like language, and would conceivably improve onexposure to such linguistic forms. We would alsolike to understand better how BOW models can per-form so well with no ‘awareness’ of word order,and whether there are specific linguistic contexts inwhich models like RNNs or others with the powerto encode word order are indeed necessary. Finally,we intend to explore ways to endow the model withricher world knowledge. This may require the in-tegration of an external memory module, similar tothe promising approaches proposed in several recentpapers (Graves et al., 2014; Weston et al., 2015).

Acknowledgments

KC and YB acknowledge the support of the follow-ing organizations: NSERC, Calcul Quebec, Com-pute Canada, the Canada Research Chairs and CI-

28

FAR. FH and AK were supported by Google FacultyResearch Award, and AK further by Google Euro-pean Fellowship.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2015. Neural machine translation by jointlylearning to align and translate. In Proceeding of ICLR.

Yoshua Bengio, Patrice Simard, and Paolo Frasconi.1994. Learning long-term dependencies with gradientdescent is difficult. Neural Networks, IEEE Transac-tions on, 5(2):157–166.

Jonathan Berant and Percy Liang. 2014. Semantic pars-ing via paraphrasing. In Proceedings of the Associa-tion for Computational Linguistics.

James Bergstra, Olivier Breuleux, Frederic Bastien, Pas-cal Lamblin, Razvan Pascanu, Guillaume Desjardins,Joseph Turian, David Warde-Farley, and Yoshua Ben-gio. 2010. Theano: a CPU and GPU math expressioncompiler. In Proceedings of the Python for ScientificComputing Conference (SciPy).

Slaven Bilac, Timothy Baldwin, and Hozumi Tanaka.2003. Improving dictionary accessibility by maximiz-ing use of available knowledge. Traitement Automa-tique des Langues, 44(2):199–224.

Slaven Bilac, Wataru Watanabe, Taiichi Hashimoto,Takenobu Tokunaga, and Hozumi Tanaka. 2004. Dic-tionary search based on the target word description. InProceedings of NLP 2014.

Antoine Bordes, Sumit Chopra, and Jason Weston. 2014.Question answering with subgraph embeddings. Pro-ceedings of EMNLP.

Antoine Bordes, Nicolas Usunier, Sumit Chopra, andJason Weston. 2015. Large-scale simple questionanswering with memory networks. arXiv preprintarXiv:1506.02075.

Sarath Chandar, Stanislas Lauly, Hugo Larochelle,Mitesh Khapra, Balaraman Ravindran, Vikas C.Raykar, and Amrita Saha. 2014. An autoencoder ap-proach to learning bilingual word representations. InAdvances in Neural Information Processing Systems,pages 1853–1861.

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learning phraserepresentations using RNN encoder-decoder for statis-tical machine translation. In Proceedings of EMNLP.

Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, ChrisDyer, Eduard Hovy, and Noah A. Smith. 2014.Retrofitting word vectors to semantic lexicons. Pro-ceedings of the North American Chapter of the Asso-ciation for Computational Linguistics.

David Ferrucci, Eric Brown, Jennifer Chu-Carroll, JamesFan, David Gondek, Aditya A. Kalyanpur, AdamLally, J. William Murdock, Eric Nyberg, John Prager,Nico Schlaefer, and Chris Welty. 2010. Building Wat-son: An overview of the DeepQA project. In AI mag-azine, volume 31(3), pages 59–79.

Matthew L. Ginsberg. 2011. Dr. FILL: Crosswords andan implemented solver for singly weighted CSPs. InJournal of Artificial Intelligence Research, pages 851–886.

Stephan Gouws, Yoshua Bengio, and Greg Corrado.2014. BilBOWA: Fast bilingual distributed represen-tations without word alignments. In Proceedings ofNIPS Deep Learning Workshop.

Alex Graves, Greg Wayne, and Ivo Danihelka.2014. Neural turing machines. arXiv preprintarXiv:1410.5401.

Karl Moritz Hermann and Phil Blunsom. 2013. Multi-lingual distributed representations without word align-ment. In Proceedings of ICLR.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural computation, 9(8):1735–1780.

Eric H. Huang, Richard Socher, Christopher D. Manning,and Andrew Y. Ng. 2012. Improving word representa-tions via global context and multiple word prototypes.In Proceedings of the Association for ComputationalLinguistics.

Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino,Richard Socher, and Hal Daume III. 2014. A neu-ral network for factoid question answering over para-graphs. In Proceedings of EMNLP.

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber,and Hal Daume III. 2015. Deep unordered compo-sition rivals syntactic methods for text classification.In Proceedings of the Association for ComputationalLinguistics.

Ryan Kiros, Ruslan Salakhutdinov, and Richard S.Zemel. 2015. Unifying visual-semantic embeddingswith multimodal neural language models. Transac-tions of the Association for Computational Linguistics.to appear.

Alexandre Klementiev, Ivan Titov, and Binod Bhattarai.2012. Inducing crosslingual distributed representa-tions of words. Proceedings of COLING.

Geoffrey Leech, Roger Garside, and Michael Bryant.1994. CLAWS4: The tagging of the British NationalCorpus. In Proceedings of COLING.

Michael L. Littman, Greg A. Keim, and Noam Shazeer.2002. A probabilistic approach to solving crosswordpuzzles. Artificial Intelligence, 134(1):23–55.

Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cer-nocky, and Sanjeev Khudanpur. 2010. Recurrent neu-

29

ral network based language model. In Proceedings ofINTERSPEECH 2010.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-rado, and Jeff Dean. 2013. Distributed representationsof words and phrases and their compositionality. InAdvances in Neural Information Processing Systems.

Jeff Mitchell and Mirella Lapata. 2010. Composition indistributional models of semantics. Cognitive Science,34(8):1388–1429.

Diego Molla and Jose Luis Vicedo. 2007. Question an-swering in restricted domains: An overview. Compu-tational Linguistics, 33(1):41–61.

Ryan Shaw, Anindya Datta, Debra VanderMeer, andKaushik Dutta. 2013. Building a scalable database-driven reverse dictionary. Knowledge and Data Engi-neering, IEEE Transactions on, 25(3):528–540.

Ivan Vulic, Wim De Smet, and Marie-Francine Moens.2011. Identifying word translations from comparablecorpora using latent topic models. In Proceedings ofthe Association for Computational Linguistics.

Jason Weston, Antoine Bordes, Sumit Chopra, andTomas Mikolov. 2015. Towards AI-complete questionanswering: A set of prerequisite toy tasks. In arXivpreprint arXiv:1502.05698.

Matthew D. Zeiler. 2012. Adadelta: An adaptive learn-ing rate method. In arXiv preprint arXiv:1212.5701.

Michael Zock and Slaven Bilac. 2004. Word lookup onthe basis of associations: From an idea to a roadmap.In Proceedings of the ACL Workshop on Enhancingand Using Electronic Dictionaries.

30

Documents

Learning to Understand Phrases by Embedding the Dictionary · Learning to Understand Phrases by Embedding the Dictionary ... tutes the vocabulary of the RNN. ... and are often referred