Association for Computational Linguistics · Preface Welcome to the 1st Workshop on Sense, Concept and Entity Representations and their Applications (SENSE 2017). The aim of SENSE

SENSE 2017

EACL 2017Workshop on Sense, Concept and Entity Representations and

their Applications

Proceedings of the Workshop

April 4, 2017Valencia, Spain

c©2017 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN 978-1-945626-50-0

ii

Preface

Welcome to the 1st Workshop on Sense, Concept and Entity Representations and their Applications(SENSE 2017). The aim of SENSE 2017 is to focus on addressing one of the most important limitationsof word-based techniques in that they conflate different meanings of a word into a single representation.SENSE 2017 brings together researchers in lexical semantics, and NLP in general, to investigate andpropose sense-based techniques as well as to discuss effective ways of integrating sense, concept andentity representations into downstream applications.

The workshop is targeted at covering the following topics:

• Utilizing sense/concept/entity representations in applications such as Machine Translation,Information Extraction or Retrieval, Word Sense Disambiguation, Entity Linking, TextClassification, Semantic Parsing, Knowledge Base Construction or Completion, etc.

• Exploration of the advantages/disadvantages of using sense representations over wordrepresentations.

• Proposing new evaluation benchmarks or comparison studies for sense vector representations.

• Development of new sense representation techniques (unsupervised, knowledge-based or hybrid).

• Compositionality of senses: learning representations for phrases and sentences.

• Construction and use of sense representations for languages other than English as well asmultilingual representations.

We received 21 submissions, accepting 15 of them (acceptance rate: 71%).

We would like to thank the Program Committee members who reviewed the papers and helped to improvethe overall quality of the workshop. We also thank Aylien for their support in funding the best paperaward. Last, a word of thanks also goes to our invited speakers, Roberto Navigli (Sapienza University ofRome) and Hinrich Schütze (University of Munich).

Jose Camacho-Collados and Mohammad Taher PilehvarCo-Organizers of SENSE 2017

iii

Organizers:

Jose Camacho-Collados, Sapienza University of RomeMohammad Taher Pilehvar, University of Cambridge

Program Committee:

Eneko Agirre, University of the Basque CountryClaudio Delli Bovi, Sapienza University of RomeLuis Espinosa-Anke, Pompeu Fabra UniversityLucie Flekova, Darmstadt University of TechnologyGraeme Hirst, University of TorontoEduard Hovy, Carnegie Mellon UniversityIgnacio Iacobacci, Sapienza University of RomeRichard Johansson, University of GothenburgDavid Jurgens, Stanford UniversityOmer Levy, University of WashingtonAndrea Moro, MicrosoftRoberto Navigli, Sapienza University of RomeArvind Neelakantan, University of Massachusetts AmherstLuis Nieto Piña, University of GothenburgSiva Reddy, University of EdinburghHoracio Saggion, Pompeu Fabra UniversityHinrich Schütze, University of MunichPiek Vossen, University of AmsterdamIvan Vulic, University of CambridgeTorsten Zesch, University of Duisburg-EssenJianwen Zhang, Microsoft Research

Invited Speakers:

Roberto Navigli, Sapienza University of RomeHinrich Schütze, University of Munich

v

Table of Contents

Compositional Semantics using Feature-Based Models from WordNetPablo Gamallo and Martín Pereira-Fariña . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Automated WordNet Construction Using Word EmbeddingsMikhail Khodak, Andrej Risteski, Christiane Fellbaum and Sanjeev Arora . . . . . . . . . . . . . . . . . . . . 12

Improving Verb Metaphor Detection by Propagating Abstractness to Words, Phrases and IndividualSenses

Maximilian Köper and Sabine Schulte im Walde . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Improving Clinical Diagnosis Inference through Integration of Structured and Unstructured KnowledgeYuan Ling, Yuan An and Sadid Hasan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Classifying Lexical-semantic Relationships by Exploiting Sense/Concept RepresentationsKentaro Kanada, Tetsunori Kobayashi and Yoshihiko Hayashi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Supervised and unsupervised approaches to measuring usage similarityMilton King and Paul Cook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Lexical Disambiguation of Igbo using Diacritic RestorationIgnatius Ezeani, Mark Hepple and Ikechukwu Onyenwe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Creating and Validating Multilingual Semantic Representations for Six Languages: Expert versus Non-Expert Crowds

Mahmoud El-Haj, Paul Rayson, Scott Piao and Stephen Wattam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Using Linked Disambiguated Distributional Networks for Word Sense DisambiguationAlexander Panchenko, Stefano Faralli, Simone Paolo Ponzetto and Chris Biemann . . . . . . . . . . . . 72

One Representation per Word - Does it make Sense for Composition?Thomas Kober, Julie Weeds, John Wilkie, Jeremy Reffin and David Weir . . . . . . . . . . . . . . . . . . . . . 79

Elucidating Conceptual Properties from Word EmbeddingsKyoung-Rok Jang and Sung-Hyon Myaeng. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91

TTCSˆe: a Vectorial Resource for Computing Conceptual SimilarityEnrico Mensa, Daniele P. Radicioni and Antonio Lieto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Measuring the Italian-English lexical gap for action verbs and its impact on translationLorenzo Gregori and Alessandro Panunzi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Word Sense Filtering Improves Embedding-Based Lexical SubstitutionAnne Cocos, Marianna Apidianaki and Chris Callison-Burch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of UnambigousSynonyms

Aleksander Wawer and Agnieszka Mykowiecka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

vii

Workshop Program

8:30 - 9:30 Registration

9:30 - 11:00 Session 1

• 9:30-9:40 - Opening Remarks

• 9:40-10:00 - Short paper presentations

Improving Verb Metaphor Detection by Propagating Abstractness to Words, Phrasesand Individual SensesMaximilian Köper and Sabine Schulte im Walde

Using Linked Disambiguated Distributional Networks for Word Sense Disambigua-tionAlexander Panchenko, Stefano Faralli, Simone Paolo Ponzetto and Chris Biemann

• 10:00-11:00 - Invited talk by Roberto Navigli (Sapienza University)

11:00 - 11:30 Coffee break

11:30 - 13:00 Session 2

• 11:30-11:45 - Lightning talks (posters)

Compositional Semantics using Feature-Based Models from WordNetPablo Gamallo and Martín Pereira-Fariña

Automated WordNet Construction Using Word EmbeddingsMikhail Khodak, Andrej Risteski, Christiane Fellbaum and Sanjeev Arora

Classifying Lexical-semantic Relationships by Exploiting Sense/Concept Represen-tationsKentaro Kanada, Tetsunori Kobayashi and Yoshihiko Hayashi

Supervised and unsupervised approaches to measuring usage similarityMilton King and Paul Cook

Lexical Disambiguation of Igbo using Diacritic RestorationIgnatius Ezeani, Mark Hepple and Ikechukwu Onyenwe

ix

Creating and Validating Multilingual Semantic Representations for Six Languages:Expert versus Non-Expert CrowdsMahmoud El-Haj, Paul Rayson, Scott Piao and Stephen Wattam

Elucidating Conceptual Properties from Word EmbeddingsKyoung-Rok Jang and Sung-Hyon Myaeng

TTCSˆe: a Vectorial Resource for Computing Conceptual SimilarityEnrico Mensa, Daniele P. Radicioni and Antonio Lieto

Measuring the Italian-English lexical gap for action verbs and its impact on trans-lationLorenzo Gregori and Alessandro Panunzi

Supervised and Unsupervised Word Sense Disambiguation on Word EmbeddingVectors of Unambigous SynonymsAleksander Wawer and Agnieszka Mykowiecka

• 11:45-13:00 - Poster session

13:00 - 14:30 Lunch

14:30 - 16:00 Session 3

• 14:30-15:00 - Invited talk by Hinrich Schütze (University of Munich)

• 15:00-16:00 - Presentations of the best paper award candidates

Improving Clinical Diagnosis Inference through Integration of Structured and Un-structured KnowledgeYuan Ling, Yuan An and Sadid Hasan

One Representation per Word - Does it make Sense for Composition?Thomas Kober, Julie Weeds, John Wilkie, Jeremy Reffin and David Weir

Word Sense Filtering Improves Embedding-Based Lexical SubstitutionAnne Cocos, Marianna Apidianaki and Chris Callison-Burch

16:00 - 16:30 Coffee break

16:30 - 17:15 Session 4

• 16:30-17:15 - Open discussion, best paper award and closing remarks

x

Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 1–11,Valencia, Spain, April 4 2017. c©2017 Association for Computational Linguistics

Compositional Semantics using Feature-Based Models from WordNet

Pablo GamalloCentro Singular de Investigacion en

Tecnoloxıas da Informacion (CiTIUS)Universidade de Santiago de Compostela

Galiza, [email protected]

Martın Pereira-FarinaCentre for Argument

Technology (ARG-tech)University of Dundee

Dundee, DD1 4HN, Scotland (UK)[email protected]

Departamento de Filosofıa e AntropoloxıaUniversidade de Santiago de CompostelaPz. de Mazarelos, 15782, Galiza, Spain

[email protected]

AbstractThis article describes a method to buildsemantic representations of composite ex-pressions in a compositional way by usingWordNet relations to represent the mean-ing of words. The meaning of a targetword is modelled as a vector in whichits semantically related words are assignedweights according to both the type of therelationship and the distance to the tar-get word. Word vectors are composi-tionally combined by syntactic dependen-cies. Each syntactic dependency triggerstwo complementary compositional func-tions: the named head function and depen-dent function. The experiments show thatthe proposed compositional method per-forms as the state-of-the-art for subject-verb expressions, and clearly outperformsthe best system for transitive subject-verb-object constructions.

1 Introduction

The principle of compositionality (Partee, 1984)states that the meaning of a complex expressionis a function of the meaning of its constituentparts and of the mode of their combination. Inthe recent years, different distributional semanticmodels endowed with a compositional componenthave been proposed. Most of them define words ashigh-dimensional vectors where dimensions rep-resent co-occurring context words. This distribu-tional semantic representation makes it possibleto combine vectors using simple arithmetic opera-tions such as addition and multiplication, or moreadvanced compositional methods such as learningfunctional words as tensors and composing con-stituents through inner product operations.

Notwithstanding, these models are usually qual-ified as black box systems because they are usuallynot interpretable by humans. Currently, the field ofinterpretable computational models is gaining rel-evance1 and, therefore, the development of moreexplainable and understandable models in compo-sitional semantics is also an open challenge. in thisfield. On the other hand, distributional semanticmodels, given the size of the vectors, needs signif-icant resources and they are dependent on particu-lar corpus, which can generate some biases in theirapplication to different languages.

Thus, in this paper, we will pay attention tocompositional approaches which employ otherkind of word semantic models, such as those basedon the WordNet relationships; i.e., synsets, hyper-nyms, hyponyms, etc. Only in (Faruqui and Dyer,2015) we can find a proposal for word vector rep-resentation using hand-crafted linguistic resources(WordNet, FrameNet, etc.), although a composi-tional frame is not explicitly adopted. Therefore,to the best of our knowledge, this is the first workusing WordNet to build compositional semanticinterpretations. Thus, in this article, we proposea method to compositionally build the semanticrepresentation of composite expressions using afeature-based approach (Hadj Taieb et al., 2014):constituent elements are induced by WordNet re-lationships.

However, this proposal raises a serious prob-lem: the semantic representation of two syntacti-cally related words (e.g. the verb run and the nouncomputer in “the computer runs”) encodes incom-patible information and there is no direct way ofcombining the features used to represent the mean-ing of the two words. On the one hand, the verb

1http://www.darpa.mil/program/explainable-artificial-intelligence

1

run is related by synonymy, hypernym, hyponymand entailment to other verbs and, on the other, thenoun computer is put in relation with other nounsby synonymy, hypernym, hyponym, and so on.

In order to solve this drawback, on the basisof previous work on dependency-based distribu-tional compositionality (Thater et al., 2010; Erkand Pado, 2008), we distinguish between directdenotation and selectional preferences within adependency relation. More precisely, when twowords are syntactically related, for instance com-puter and the verb run by the subject relation, webuild two contextualized senses: the contextual-ized sense of computer given the requirements ofrun and the contextualized sense of run given com-puter.

The sense of computer is built by combiningthe semantic features of the noun (its direct de-notation) with the selectional preferences imposedby the verb. The features of the noun are builtfrom the set of words linked to computer in Word-Net, while the selectional preferences of run inthe subject position are obtained by combining thefeatures of all the nouns that can be the nominalsubject of the verb (i.e. the features of runners).Then, the two sets of features are combined andthe resulting new set represents the specific senseof the noun computer as nominal subject of run.The sense of the verb given the noun is built in aanalogous way: the semantic features of the verbare combined with the (inverse) selectional pref-erences imposed by the noun, resulting in a newcompositional representation of the verb run whenit is combined with computer at the subject po-sition. The two new compositional feature setsrepresent the contextualized senses of the two re-lated words. During the contextualization process,ambiguous or polysemous words may be disam-biguated in order to obtain the right representation.

For dealing with any sequence with N (lexi-cal) words (e.g., “the coach runs the team”), thesemantic process can be applied in two differentways: from left-to-right and from right-to-left. Inthe first case, it is appliedN−1 times dependency-by-dependency in order to obtain N contextual-ized senses, one per lexical word. Thus, firstly,the subject dependency builds two contextualizedsenses: that of run given the noun coach and thatof the noun given the verb. Then, the direct objectdependency is applied on the already contextual-ized sense of the verb in order to contextualize it

again given team at the direct object position. Thisdependency also yields the contextualized sense ofthe object given the verb and its nominal subject(coach+run). At the end of the interpretation pro-cess, we obtain three fully contextualized senses.In the second case, from right-to-left, the semanticprocess process is applied in a similar way, beingcontextualized (and disambiguated) using the re-strictions imposed by the verb and its nominal ob-ject (run+team). As in the first case, three slightlydifferent word senses are also obtained.

Lastly, word sense disambiguation is out of theaim of this paper. Here, we only use WordNet forextracting semantic information from words, butnot to identify word senses.

The article is organized as follow: In the nextsection (2), different approaches on ontologicalfeature-based representations and compositionalsemantics are introduced and discussed. Then,sections 3 and 4 respectively describe our feature-based semantic representation and compositionalstrategy. In Section 5, some experiments are per-formed to evaluate the quality of the word modelsand compositional word vectors. Finally, relevantconclusions are reported in Section 6.

2 Related Work

Our approach relies on two tasks: to build feature-based representations using WordNet relations,and to build compositional vectors using theWordNet representations. In this section, we willexamine work related to these two tasks.

2.1 Feature-Based Approaches

Tversky (1977), in order to define a similaritymeasure, assumes that any object can be repre-sented as a collection (set) of features or proper-ties. Therefore, a similarity metric is a feature-matching process between two objects. This con-sists of a linear combination of the measures oftheir common and distinctive features. It is worthnoting that this is a non-symmetric measure.

In the particular case of semantic similarity met-rics, each word or concept is featured by means ofa set of words (Hadj Taieb et al., 2014). Framedinto an ontology such as WordNet, these setsof words are obtained from taxonomic (hyper-nym, hyponym, etc.) and non-taxonomic (synsets,glosses, meronyms, etc.) properties (Meng et al.,2013), although these last ones are classified assecondary in many cases (Slimani, 2013). The

2

main objective of this approach is to capture thesemantic knowledge induced by ontological rela-tionships.

Our model is partly inspired by that definedin (Rodrıguez and Egenhofer, 2003). It proposesthat the set of properties that characterizes a wordmay be stratified into three groups: i) synsets;ii) features (e.g., meronyms, attributes, hyponym,etc.), and, iii) neighbor concepts (those linked viasemantic pointers). Each one of these strata isweighted according to its contribution to the rep-resentation of the concept. The measure analyzesthe overlapping among the three strata between thetwo terms under comparison.

2.2 Compositional Strategies

Several models for compositionality in vectorspaces have been proposed in recent years, andmost of them use bag-of-words as basic distribu-tional representations of word contexts. The ba-sic approach to composition, explored by Mitchelland Lapata (2008; 2009; 2010), is to combine vec-tors of two syntactically related words with arith-metic operations: addition and component-wisemultiplication. The additive model produces a sortof union of word contexts, whereas multiplicationhas an intersective effect. According to Mitchelland Lapata (2008), component-wise multiplica-tion performs better than the additive model. How-ever, in (Mitchell and Lapata, 2009; Mitchell andLapata, 2010), these authors explore weighted ad-ditive models giving more weight to some con-stituents in specific word combinations. For in-stance, in a noun-subject-verb combination, theverb is provided with higher weight because thewhole construction is closer to the verb than tothe noun. Other weighted additive models are de-scribed in (Guevara, 2010) and (Zanzotto et al.,2010). All these models have in common the factof defining composition operations for just wordpairs. Their main drawback is that they do not pro-pose a more systematic model accounting for alltypes of semantic composition. They do not focuson the logical aspects of the functional approachunderlying compositionality.

Other distributional approaches develop soundcompositional models of meaning inspired byMontagovian semantics, which induce the compo-sitional meaning of the functional words from ex-amples adopting regression techniques commonlyused in machine learning (Krishnamurthy and

Mitchell, 2013; Baroni and Zamparelli, 2010; Ba-roni, 2013; Baroni et al., 2014). In our approach,by contrast, compositional functions, which aredriven by dependencies and not by functionalwords, are just basic arithmetic operations on vec-tors as in (Mitchell and Lapata, 2008). Arithmeticapproaches are easy to implement and producehigh-quality compositional vectors, which makesthem a good choice for practical applications (Ba-roni et al., 2014).

Other compositional approaches based on Cat-egorial Grammar use tensor products for com-position (Grefenstette et al., 2011; Coecke etal., 2010). A neural network-based methodwith tensor factorization for learning the embed-dings of transitive clauses has been introducedin (Hashimoto and Tsuruoka, 2015). Two prob-lems arise with tensor products. First, they re-sult in an information scalability problem, sincetensor representations grow exponentially as thephrases grow longer (Turney, 2013). And sec-ond, tensor products did not perform as well ascomponent-wise multiplication in Mitchell andLapata’s (2010) experiments.

There are also works focused on the notionof sense contextualization, e.g., Dinu and Lapata(2010) work on context-sensitive representationsfor lexical substitution. Reddy et al. (2011) workon dynamic prototypes for composing the seman-tics of noun-noun compounds and evaluate theirapproach on a compositionality-based similaritytask.

So far, all the cited works are based on bag-of-words to represent vector contexts and, then,word senses. However, there are a few works us-ing vector spaces structured with syntactic infor-mation. Thater et al. (2010) distinguish betweenfirst-order and second-order vectors in order to al-low two syntactically incompatible vectors to becombined. This work is inspired by that describedin (Erk and Pado, 2008). Erk and Pado (2008) pro-pose a method in which the combination of twowords, a and b, returns two vectors: a vector a’representing the sense of a given the selectionalpreferences imposed by b, and a vector b’ standingfor the sense of b given the (inverse) selectionalpreferences imposed by a. A similar strategy isreported in Gamallo (2017). Our approach is an at-tempt to join the main ideas of these syntax-basedmodels (namely, second-order vectors, selectionalpreferences and two returning words per combi-

3

nation) in order to apply them to WordNet-basedword representations.

3 Semantic Features from WordNet

A word meaning is described as a feature-valuestructure. The features are the words with whichthe target word is related to in the ontology (e.g., inWordNet, hypernym, hyponym, etc.) and the val-ues correspond to weights computed taking intoaccount two parameters: the relation type and theedge-counting distance between the target wordand each word feature (i.e. the number of rela-tions required to achieve the feature from the tar-get word) (Rada et al., 1989).

The algorithm to set the feature values is the fol-lowing. Given a target word w1 and the feature setF , where wi ∈ F if wi is a word semantically re-lated to w1 in WordNet, the weight for the relationbetween w1 and wi is computed by equation 1:

weight(w1, wi) =R∑

j=1

1length(w1, wi, rj)

(1)

where R is the number of different semantic re-lations (e.g. synonymy/synset, hyperonymy, hy-ponymy, etc) that WordNet defines for the part-of-speech of the target word. For instance, nounshave five different relations, verbs four and ad-jectives just two. length(w1, wi, rj is the lengthof the path from the target word w1 to its fea-ture wi in relation rj . length(w1, wi, rj) = 1when rj stands for the synonymy relationship,i.e. when w1 and wi belong to the same synset;length(w1, wi, rj) = 2 if wi is at the first levelwithin the hierarchy associated to relation rj .

For instance, the length value of a direct hyper-nym is 2 because there is a distance of two arcswith regard to the target word: the first arc goesfrom the target word to a synset and the second oneis the hyperonymy relation between the direct hy-pernym and the synset. The length value increasesin one unit as the hierarchy level goes up, so atlevel 4, the length score is 5 and then the partialweight is 1/5 = 0.2. For some non-taxonomic re-lations, namely meronymy, holonymy and coordi-nates, there is only one level in WordNet, but thedistance is 3 since the target word and the wordfeature (part, whole or coordinate term) are sepa-rated by a synset and a hypernym.

As a feature word wi may be related to thetarget w1 via different semantic relations (with-

out distinguishing between different word senses),the final weight is the addition of all partialweights. For instance, take the noun car. It isrelated to automobile through two different re-lationships: they belong to the same synset andthe latter is a direct hypernym of the former, soweight(car, automobile) = 1/1 + 1/2 = 1.5.

To compute compositional operations on words,the feature-value structure associated to each wordis modeled as a vector, where features are dimen-sions, words are objects, and weights the valuesfor each object/dimension position.

4 Compositional Semantics

4.1 Syntactic Dependencies AsCompositional Functions

Our approach is also inspired in (Erk and Pado,2008). Here, semantic composition is modeled interms of function application driven by binary de-pendencies. A dependency is associated in the se-mantic space with two compositional functions onword vectors: the head and the dependent func-tions. To explain how they work, let us take thedirect object relation (dobj) between the verb runand the noun team in the expression “run a team”.The head function, dobj↑, combines the vector ofthe head verb ~run with the selectional preferencesimposed by the noun, which is also a vector ofWordNet features, and noted ~team◦. This combi-nation is performed by component-wise multipli-cation and results in a new vector ~rundobj↑, whichrepresents the contextualized sense of run giventeam in the dobj relation:

dobj↑( ~run, ~team◦) = ~run� ~team◦ = ~rundobj↑

To build the (inverse) selectional preferences im-posed by the dependent word team as direct ob-ject on the verb, we require a reference corpus toextract all those verbs of which team is the di-rect object. The selectional preferences of teamas direct object of a verb, and noted ~team◦, is anew vector obtained by component-wise additionof the vectors of all those verbs (e.g. create, sup-port, help, etc) that are in dobj relation with thenoun team:

~team◦ =∑

~v∈ T

~v

where T is the vector set of verbs having team asdirect object (except run). T is thus included in the

4

subspace of verb vectors. Component-wise addi-tion has an union effect.

Similarly, the dependent function, dobj↓, com-bines the noun vector ~team with the selectionalpreferences imposed by the verb, noted ~run◦, bycomponent-wise multiplication. Such a combina-tions builds the new vector of ~teamdobj↓, whichstands for the contextualized sense of team givenrun in the dobj relation:

dobj↓( ~run◦, ~team) = ~team� ~run◦ = ~teamdobj↓

The selectional preferences imposed by thehead word run to its direct object are representedby the vector ~run◦, which is obtained by addingthe vectors of all those nouns (e.g. company,project, marathon, etc) which are in relation dobjwith the verb run:

~run◦ =∑

~v∈ R

~v

where R is the vector set of nouns playing the di-rect object role of run (except team). R is includedin the subspace of nominal vectors.

Each multiplicative operation results in acompositional vector of a contextualized word.Component-wise multiplication has an intersec-tive effect. The vector standing for the selectionalpreferences restricts the vector of the target wordby assigning weight 0 to those WordNet featuresthat are not shared by both vectors. The new com-positional vector as well as the two constituents allbelong to the same vector subspace (the subspaceof nouns, verbs, or adjectives).

Notice that, in approaches to computationalsemantics inspired by Combinatory CategorialGrammar (Steedman, 1996) and Montagovian se-mantics (Montague, 1970), the interpretation pro-cess for composite expressions such as “run ateam” or “electric coach” relies on rigid function-argument structures: relational expressions, likeverbs and adjectives, are used as predicates whilenouns and nominals are their arguments. In thecomposition process, each word is supposed toplay a rigid and fixed role: the relational wordis semantically represented as a selective func-tion imposing constraints on the denotations ofthe words it combines with, while non-relationalwords are in turn seen as arguments filling the con-straints imposed by the function. For instance, run

and electric would denote functions while teamand coach would be their arguments.

By contrast, we deny the rigid “predicate-argument” structure. In our compositional ap-proach, dependencies are the active functions thatcontrol and rule the selectional requirements im-posed by the two related words. Thus, each con-stituent word imposes its selectional preferenceson the other within a dependency-based construc-tion. This is in accordance with non-standard lin-guistic research which assumes that the words in-volved in a composite expression impose seman-tic restrictions on each other (Pustejovsky, 1995;Gamallo et al., 2005; Gamallo, 2008).

4.2 Recursive Compositional ApplicationIn our approach, the consecutive application of thesyntactic dependencies found in a sentence is ac-tually the process of building the contextualizedsense of all the lexical words which constitute it.Thus, the whole sentence is not assigned to anunique meaning (which could be the contextual-ized sense of the root word), but one sense perlemma, being the sense of the root just one ofthem.

This incremental process may have two di-rections: from left-to-right and vice versa (i.e.,from right-to-left). Figure 1 illustrates theincremental process of building the sense ofwords dependency-by-dependency from left-to-right. Thus, given the composite expression “thecoach runs the team” and its dependency analysisdepicted in the first row of the figure, two com-positional processes are driven by the two depen-dencies involved in the analysis (nsubj and dobj).Each dependency is decomposed into two func-tions: head (nsubj↑ and dobj↑) and dependent(nsubj↓ and dobj↓) functions.2 The first compo-sitional process applies, on the one hand, the headfunction nsubj↑ on the denotation of the head verb( ~run) and on the selectional preferences requiredby coach ( ~coach◦), in order to build a contex-tualized sense of the verb: ~runnsubj↑ . On theother hand, the dependent function nsubj↓ buildsthe sense of coach as nominal subject of run:~coachnsubj↓. Then, the contextualized head vec-

tor is involved in the compositional process drivenby dobj. At this level of semantic composition, theselectional preferences imposed on the noun team

2We do not consider the meaning of determiners, auxiliaryverbs, or tense affixes. Quantificational issues associated tothem are also beyond the scope of this work.

5

stand for the semantic features of all those nounswhich may be the direct object of coach+run. Atthe end of the process, we have not obtained onesingle sense for the whole expression, but one con-textualized sense per lexical word: ~coachnsubj↓,~runnsubj↑+dobj↑ and ~teamdobj↓.

In other case, from right-to-left, the verb run isfirst restricted by team at the direct object position,and then by its subject coach. In addition, thisnoun is now restricted by the selectional prefer-ences imposed by run and team, that is, it is com-bined with the semantic features of all those nounsthat may be the nominal subject of run+team.

5 Experiments

We have performed several similarity-based ex-periments using the semantic word model definedin Section 3 and the compositional algorithm de-scribed in 4.3 First, in Subsection 5.1, we evaluatejust word similarity without composition. Then,in Subsection 5.2, we evaluate the simple compo-sitional approach by making use of a dataset withsimilar noun-verb pairs (NV constructions). Fi-nally, the recursive application of compositionalfunctions is evaluated in Subsection 5.3, by mak-ing use of a dataset with similar noun-verb-nounpairs (NVN constructions).

In all experiments, we made use of datasetssuited for the task at hand, and compare our resultswith those obtained by the best systems for thecorresponding dataset. Moreover, in order to buildthe selectional preferences of the syntactically re-lated words, we used the British National Corpus(BNC). Syntactic analysis on BNC was performedwith the dependency parser DepPattern (Gamalloand Gonzalez, 2011; Gamallo, 2015), previouslyPoS tagged with Tree-Tagger (Schmid, 1994).

5.1 Word Similarity

Recently, the use of word similarity methods hasbeen criticised as a reliable technique for evalu-ating distributional semantic models (Batchkarovet al., 2016), given the small size of the datasetsand the limitation of context information as well.However, given this procedure still is widely ac-cepted, we have performed two different kinds ofexperiments: rating by similarity and synonym de-tection with multiple-choice questions.

3Both the software and the semantic word modelare freely available at http://fegalaz.usc.es/˜gamallo/resources/CompWordNet.tar.gz.

5.1.1 Rating by SimilarityIn the first experiment, we use the WordSim353dataset (Finkelstein et al., 2002), which was con-structed by asking humans to rate the degree of se-mantic similarity between two words on a numer-ical scale. This is a small dataset with 353 wordpairs. The performance of a computational systemis measured in terms of correlation (Spearman) be-tween the scores assigned by humans to the wordpairs and the similarity Dice coefficient assignedby our system (WN) built with the WordNet-basedmodel space.

Table 1 compares the Spearman correlation ob-tained by our model, WN, with that obtained bythe corpus-based system described in (Halawi etal., 2012), which is the highest score reached sofar on that dataset. Even if our results were clearlyoutperformed by that corpus-based method, WNseems to behave well if compared with the state-of-the-art knowledge-based (unsupervised) strat-egy reported in (Agirre et al., 2009).

Systems ρ MethodWN 0.69 knowledge(Hassan and Mihalcea, 2011) 0.62 knowledge(Agirre et al., 2009) 0.66 knowledge(Halawi et al., 2012) 0.81 corpus

Table 1: Spearman correlation between the Word-Sim353 dataset and the rating obtained by ourknowledge-based system WN and the state-of-the-art for both knowledge and corpus-based strate-gies.

5.1.2 Synonym Detection withMultiple-Choice Questions

In this evaluation task, a target word is presentedwith four synonym candidates, one of them beingthe correct synonym of the target. For instance,for the target deserve, the system must choose be-tween merit (the correct one), need, want, andexpect. Accuracy is the number of correct an-swers divided by the total number of words in the

Systems Noun Adj Verb AllWN 0.85 0.85 0.75 0.80(Freitag et al., 2005) 0.76 0.76 0.64 0.72(Zhu, 2015) 0.71 0.71 0.63 0.69(Kiela et al., 2015) - - - 0.88

Table 2: Accuracy of three systems on the WBSTtest (synonym detection on nouns, adjectives, andverbs)

6

coach run team

nsubj dobj

~coachnsubj↓ ~runnsubj↑ ~team

nsubj↑( ~run, ~coach◦), nsubj↓( ~run◦, ~coach)

~coachnsubj↓ ~runnsubj↑+dobj↑ ~teamdobj↓

dobj↑( ~runnsubj↑, ~team◦), dobj↓( ~coach + run◦, ~team)

Figure 1: Syntactic analysis of the expression “the coach runs the team” and left-to-right constructionof the word senses.

dataset.The dataset is an extended TOEFL test, called

the WordNet-based Synonymy Test (WBST) pro-posed in (Freitag et al., 2005). WBST was pro-duced by generating automatically a large setof TOEFL-like questions from the synonyms inWordNet. In total, this procedure yields 9,887noun, 7,398 verb, and 5,824 adjective questions,a total of 23,509 questions, which is a very largedataset. Table 2 shows the results. In this case, theaccuracy obtained by WN for the three syntacticcategories is close to state-of-the-art corpus-basedmethod for this task (Kiela et al., 2015), which is aneural network trained with a huge corpus contain-ing 8 billion words from English Wikipedia andnewswire texts.

5.2 Noun-Verb CompositionThe first experiment aimed at evaluating our com-positional strategy uses the test dataset by Mitchelland Lapata (2008), which comprises a total of3,600 human similarity judgments. Each itemconsists of an intransitive verb and a subject noun,which are compared to another noun-verb pair(NV) combining the same noun with a synonymof the verb that is chosen to be either similar odissimilar to the verb in the context of the givensubject. For instance, “child stray” is related to“child roam”, being roam a synonym of stray.The dataset was constructed by extracting NVcomposite expressions from the British NationalCorpus (BNC) and verb synonyms from WordNet.In order to evaluate the results of the tested sys-tems, Spearman correlation is computed betweenindividual human similarity scores and the sys-tems’ predictions.

In this experiment, we compute the similaritybetween the contextualized heads of two NV com-posites and between their contextualized depen-dent expressions. For instance, we compute thesimilarity between “eye flare” vs “eye flame” bycomparing first the verbs flare and flame whencombined with eye in the subject position (headfunction), and by comparing how (dis)similar isthe noun eye when combined with both the verbsflare and flame (dependent function). In addition,as we are provided with two similarities (head anddep) for each pair of compared expressions, it ispossible to compute a new similarity score by av-eraging the results of head and dependent func-tions (head+dep).

Table 3 shows the Spearman’s correlation val-ues (ρ) obtained by the three versions of WN:only head function (head), only dependent func-tion (dep) and average of both (head+dep). Thelatter score value is comparable to the state-of-the-art system for this dataset, reported in (Erk andPado, 2008). It is also very similar to the most re-cent results described in (Dinu et al., 2013), wherethe authors made use of the compositional strategydefined in (Baroni and Zamparelli, 2010).

Systems ρWN (head+dep) 0.29WN (head) 0.26WN (dep) 0.14(Erk and Pado, 2008) 0.27(Dinu et al., 2013) 0.26

Table 3: Spearman correlation for intransitive ex-pressions using the benchmark by Mitchell andLapata (2008)

7

5.3 Noun-Verb-Noun Composition

The last experiment consists in evaluating thequality of compositional vectors built by meansof the consecutive application of head and de-pendency functions associated with nominal sub-ject and direct object. The experiment is per-formed on the dataset developed in (Grefenstetteand Sadrzadeh, 2011a). The dataset was built us-ing the same guidelines as Mitchell and Lapata(2008), using transitive verbs paired with subjectsand direct objects: NVN composites.

Given our compositional strategy, we are able tocompositional build several vectors that somehowrepresent the meaning of the whole NVN com-posite expression. In order to known which isthe best compositional strategy and be exhaustiveand complete, we evaluate all of them; i.e., bothleft-to-right and right-to-left strategies. Thus, takeagain the expression “the coach runs the team”.If we follow the left-to-right strategy (noted nv-n),at the end of the compositional process, we obtaintwo fully contextualized senses:

nv-n head The sense of the head run, as a resultof being contextualized first by the prefer-ences imposed by the subject and then by thepreferences required by the direct object. Wenote nv-n head) the final sense of the head ina NVN composite expression following theleft-to-right strategy.

nv-n dep The sense of the object team, as a re-sult of being contextualized by the prefer-ences imposed by run previously combinedwith the subject coach. We note nv-n dep thefinal sense of the direct object in a NVN com-posite expression following the left-to-rightstrategy.

If we follow the right-to-left strategy (noted n-vn), at the end of the compositional process, weobtain two fully contextualized senses:

n-nv head The sense of the head run as a result ofbeing contextualized first by the preferencesimposed by the object and then by the sub-ject.

n-nv dep The sense of the subject coach, as aresult of being contextualized by the prefer-ences imposed by run previously combinedwith the object team.

Systems ρWN (nv-n head+dep) 0.35WN (nv-n head) 0.34WN (nv-n dep) 0.13WN (n-vn head+dep) 0.50WN (n-vn head) 0.35WN (n-vn dep) 0.44WN (n-vn+nv-n) 0.47(Grefenstette and Sadrzadeh, 2011b)* 0.28(Van De Cruys et al., 2013)* 0.37(Tsubaki et al., 2013)* 0.44(Milajevs et al., 2014) 0.46(Polajnar et al., 2015) 0.35(Hashimoto et al., 2014) 0.48(Hashimoto and Tsuruoka, 2015) 0.48Human agreement 0.75

Table 4: Spearman correlation for transitive ex-pressions using the benchmark by Grefenstetteand Sadrzadeh (2011)

Table 4 shows the Spearman’s correlation val-ues (ρ) obtained by all the different versions builtfrom our model WN. The best score was achievedby averaging the head and dependent similarityvalues derived from the n-vn (right-to-left) strat-egy. Let us note that, for NVN composite ex-pressions, the left-to-right strategy seems to buildless reliable compositional vectors than the right-to-left counterpart. Besides, the combination ofthe two strategies (n-vn+nv-n) does not improvethe results of the best one (n-vn).4. The score val-ues obtained by the different versions of the right-to-left strategy outperform other systems for thisdataset (see results reported below in the table).Our best strategy (ρ = 0.50) also outperforms theneural network strategy described in (Hashimotoand Tsuruoka, 2015), which achieved 0.48 with-out considering extra linguistic information not in-cluded in the dataset. The (ρ) scores for this taskare reported for averaged human ratings. This isdue to a disagreement in previous work regard-ing which metric to use when reporting results.We mark with asterisk those systems reporting (ρ)scores based on non-averaged human ratings.

6 Conclusions

In this paper, we described a compositional modelbased on WordNet features and dependency-basedfunctions on those features. It is a recursive pro-posal since it can be repeated from left-to-right orfrom right-to-left and the sense of each constituentword is performed in a recursive way.

4n-vn+nv-n is computed by averaging the similarities ofboth n-vn head+dep and nv-n head+dep

8

Our compositional model tackles the problemof information scalability. This problem statesthat the size of semantic representations should notgrow exponentially, but proportionally; and, infor-mation must not be loss using fixed size of compo-sitional vectors. In our approach, however, even ifthe size of the compositional vectors is fixed, thereis no information loss since each word of the com-posite expression is associated to a compositionalvector representing its context-sensitive sense. Inaddition, the compositional vectors do not growexponentially since their size is fixed by the vectorspace: they are all first-order (or direct) vectors.Finally, the number of vectors increases in propor-tion to the number of constituent words found inthe composite expression. So, both points are suc-cessfully solved.

In future work, we will try to design a compo-sitional model based on word semantic represen-tations combining WordNet-based features withsyntactic-based distributional contexts as well asextend our model to full sentences instead of thesimple ones described in this paper.

Acknowledgements

We would like to thank the anonymous review-ers for helpful comments and suggestions. Thiswork was supported by a 2016 BBVA Founda-tion Grant for Researchers and Cultural Creators,and by project TELPARES (FFI2014-51978-C2-1-R) and grant TIN2014-56633-C3-1-R, Ministryof Economy and Competitiveness. It has receivedfinancial support from the Consellerıa de Cultura,Educacion e Ordenacion Universitaria (Postdoc-toral Training Grants 2016 and Centro Singular deInvestigacion de Galicia accreditation 2016-2019,ED431G/08) and the European Regional Develop-ment Fund (ERDF).

ReferencesEneko Agirre, Enrique Alfonseca, Keith Hall, Jana

Kravalova, Marius Pasca, and Aitor Soroa. 2009.A study on similarity and relatedness using distribu-tional and wordnet-based approaches. In Proceed-ings of Human Language Technologies: The 2009Annual Conference of the North American Chap-ter of the Association for Computational Linguistics,NAACL ’09, pages 19–27, Stroudsburg, PA, USA.Association for Computational Linguistics.

Marco Baroni and Roberto Zamparelli. 2010. Nounsare vectors, adjectives are matrices: Represent-ing adjective-noun constructions in semantic space.

In Proceedings of the 2010 Conference on Em-pirical Methods in Natural Language Processing,EMNLP’10, pages 1183–1193, Stroudsburg, PA,USA.

Marco Baroni, Raffaella Bernardi, and Roberto Zam-parelli. 2014. Frege in space: A program for com-positional distributional semantics. LiLT, 9:241–346.

Marco Baroni. 2013. Composition in distributionalsemantics. Language and Linguistics Compass,7:511–522.

Miroslav Batchkarov, Thomas Kober, Jeremy Reffin,Julie Weeds, and David Weir. 2016. A critique ofword similarity as a method for evaluating distribu-tional semantic models. In Proceedings of the ACLWorkshop on Evaluating Vector Space Representa-tions for NLP, Berlin, Germany.

B. Coecke, M. Sadrzadeh, and S. Clark. 2010. Math-ematical foundations for a compositional distribu-tional model of meaning. Linguistic Analysis, 36(1-4):345–384.

Georgiana Dinu and Mirella Lapata. 2010. Measur-ing distributional similarity in context. In Proceed-ings of the 2010 Conference on Empirical Methodsin Natural Language Processing, pages 1162–1172,Cambridge, MA, October. Association for Compu-tational Linguistics.

G. Dinu, N. Pham, and M. Baroni. 2013. Generalestimation and evaluation of compositional distribu-tional semantic models. In ACL 2013 Workshop onContinuous Vector Space Models and their Compo-sitionality (CVSC 2013), pages 50–58, East Strouds-burg PA.

Katrin Erk and Sebastian Pado. 2008. A structuredvector space model for word meaning in context. InProceedings of EMNLP, Honolulu, HI.

Manaal Faruqui and Chris Dyer. 2015. Non-distributional word vector representations. In Pro-ceedings of ACL.

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias,Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey-tan Ruppin. 2002. Placing search in context: theconcept revisited. ACM Trans. Inf. Syst., 20(1):116–131.

Dayne Freitag, Matthias Blume, John Byrnes, Ed-mond Chow, Sadik Kapadia, Richard Rohwer, andZhiqiang Wang. 2005. New experiments in distribu-tional representations of synonymy. In Proceedingsof the Ninth Conference on Computational NaturalLanguage Learning, pages 25–32.

Pablo Gamallo and Isaac Gonzalez. 2011. A grammat-ical formalism based on patterns of part-of-speechtags. International Journal of Corpus Linguistics,16(1):45–71.

9

Pablo Gamallo, Alexandre Agustini, and GabrielLopes. 2005. Clustering Syntactic Positions withSimilar Semantic Requirements. ComputationalLinguistics, 31(1):107–146.

Pablo Gamallo. 2008. The meaning of syntactic de-pendencies. Linguistik OnLine, 35(3):33–53.

Pablo Gamallo. 2015. Dependency parsing with com-pression rules. In International Workshop on Pars-ing Technology (IWPT 2015), Bilbao, Spain.

Pablo Gamallo. 2017. The role of syntactic dependen-cies in compositional distributional semantics. Cor-pus Linguistics and Linguistic Theory.

Edward Grefenstette and Mehrnoosh Sadrzadeh.2011a. Experimental support for a categorical com-positional distributional model of meaning. In Con-ference on Empirical Methods in Natural LanguageProcessing.

Edward Grefenstette and Mehrnoosh Sadrzadeh.2011b. Experimenting with transitive verbs in a dis-cocat. In Workshop on Geometrical Models of Nat-ural Language Semantics (EMNLP-2011).

Edward Grefenstette, Mehrnoosh Sadrzadeh, StephenClark, Bob Coecke, and Stephen Pulman. 2011.Concrete sentence spaces for compositional distri-butional models of meaning. In Proceedings of theNinth International Conference on ComputationalSemantics, IWCS ’11, pages 125–134.

Emiliano Guevara. 2010. A regression model ofadjective-noun compositionality in distributional se-mantics. In Proceedings of the 2010 Workshop onGEometrical Models of Natural Language Seman-tics, GEMS ’10.

M. A. Hadj Taieb, M. Ben Aouicha, andA. Ben Hamadou. 2014. Ontology-based approachfor measuring semantic similarity. EngineeringApplications of Artificial Intelligence, 36:238–261.

Guy Halawi, Gideon Dror, Evgeniy Gabrilovich, andYehuda Koren. 2012. Large-scale learning of wordrelatedness with constraints. In Proceedings of the18th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD ’12,pages 1406–1414.

Kazuma Hashimoto and Yoshimasa Tsuruoka. 2015.Learning embeddings for transitive verb disam-biguation by implicit tensor factorization. In Pro-ceedings of the 3rd Workshop on Continuous VectorSpace Models and their Compositionality, pages 1–11, Beijing, China, July. Association for Computa-tional Linguistics.

Kazuma Hashimoto, Pontus Stenetorp, Makoto Miwa,and Yoshimasa Tsuruoka. 2014. Jointly learn-ing word representations and composition functionsusing predicate-argument structures. In Proceed-ings of the 2014 Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages

1544–1555, Doha, Qatar, October. Association forComputational Linguistics.

S. Hassan and R. Mihalcea. 2011. Semantic related-ness using salient semantic analysis. In Proceedingsof AAAI Conference on Artificial Intelligence.

Douwe Kiela, Felix Hill, and Stephen Clark. 2015.Specializing word embeddings for similarity or re-latedness. In Lluıs Marquez, Chris Callison-Burch,Jian Su, Daniele Pighin, and Yuval Marton, edi-tors, EMNLP, pages 2044–2048. The Associationfor Computational Linguistics.

Jayant Krishnamurthy and Tom Mitchell, 2013. Pro-ceedings of the Workshop on Continuous VectorSpace Models and their Compositionality, chapterVector Space Semantic Parsing: A Framework forCompositional Vector Space Models, pages 1–10.Association for Computational Linguistics.

Lingling Meng, Runquing Huang, and Junzhong Gu.2013. A review of semantic similarity measures inwordnet. International Journal of Hybrid Informa-tion Technology, 6(1).

Dmitrijs Milajevs, Dimitri Kartsaklis, MehrnooshSadrzadeh, and Matthew Purver. 2014. Evaluatingneural word representations in tensor-based compo-sitional settings. In Alessandro Moschitti, Bo Pang,and Walter Daelemans, editors, EMNLP, pages 708–719. ACL.

Jeff Mitchell and Mirella Lapata. 2008. Vector-basedmodels of semantic composition. In Proceedings ofACL-08: HLT, pages 236–244.

Jeff Mitchell and Mirella Lapata. 2009. Languagemodels based on semantic composition. In Proceed-ings of EMNLP, pages 430–439.

Jeff Mitchell and Mirella Lapata. 2010. Compositionin distributional models of semantics. Cognitive Sci-ence, 34(8):1388–1439.

Richard Montague. 1970. Universal grammar. theoria.Theoria, 36:373–398.

Barbara Partee. 1984. Compositionality. In FrankLandman and Frank Veltman, editors, Varietiesof Formal Semantics, pages 281–312. Dordrecht:Foris.

Tamara Polajnar, Laura Rimell, and Stephen Clark.2015. An exploration of discourse-based sentencespaces for compositional distributional semantics.In Proceedings of the First Workshop on LinkingComputational Models of Lexical, Sentential andDiscourse-level Semantics, pages 1–11, Lisbon, Por-tugal, September. Association for ComputationalLinguistics.

James Pustejovsky. 1995. The Generative Lexicon.MIT Press, Cambridge.

10

R. Rada, H. Mili, E. Bicknell, and M. Blettner. 1989.Development and application of a metric on se-mantic nets. Systems, Man and Cybernetics, IEEETransactions on, 19(1):17–30.

Siva Reddy, Ioannis P. Klapaftis, Diana McCarthy, andSuresh Manandhar. 2011. Dynamic and static pro-totype vectors for semantic composition. In Fifth In-ternational Joint Conference on Natural LanguageProcessing, IJCNLP 2011, Chiang Mai, Thailand,November 8-13, 2011, pages 705–713.

M. Andrea Rodrıguez and Max J. Egenhofer. 2003.Determining semantic similarity among entityclasses from different ontologies. IEEE Trans.Knowl. Data Eng., 15(2):442–456.

H. Schmid. 1994. Probabilistic part-of-speech taggingusing decision trees. In International Conference onNew Methods in Language Processing.

Thabet Slimani. 2013. Description and evaluation ofsemantic similarity measures approaches. Interna-tional Journal of Computer Applications, 80(1):25–33.

Mark Steedman. 1996. Surface Structure and Inter-pretation. The MIT Press.

Stefan Thater, Hagen Furstenau, and Manfred Pinkal.2010. Contextualizing semantic representations us-ing syntactically enriched vector models. In Pro-ceedings of the 48th Annual Meeting of the Associa-tion for Computational Linguistics, pages 948–957,Stroudsburg, PA, USA.

Masashi Tsubaki, Kevin Duh, Masashi Shimbo, andYuji Matsumoto. 2013. Modeling and learning se-mantic co-compositionality through prototype pro-jections and neural networks. In EMNLP, pages130–140. ACL.

Peter D. Turney. 2013. Domain and function: A dual-space model of semantic relations and compositions.Journal of Artificial Intelligence Research (JAIR),44:533–585.

A. Tversky. 1977. Features of similarity. Psichologi-cal Review, 84(4).

Tim Van De Cruys, Thierry Poibeau, and Anna Ko-rhonen. 2013. A Tensor-based Factorization Modelof Semantic Compositionality. In Conference of theNorth American Chapter of the Association of Com-putational Linguistics (HTL-NAACL), pages 1142–1151, Atlanta, United States, June.

Fabio Massimo Zanzotto, Ioannis Korkontzelos,Francesca Fallucchi, and Suresh Manandhar. 2010.Estimating linear models for compositional distribu-tional semantics. In Proceedings of the 23rd Inter-national Conference on Computational Linguistics,COLING ’10, pages 1263–1271.

Peng Zhu. 2015. N-grams based linguistic search en-gine. International Journal of Computational Lin-guistics Research, 6(1):1–7.

11


Automated WordNet Construction Using Word Embeddings

Mikhail Khodak, Andrej Risteski, Christiane Fellbaum andSanjeev Arora

Computer Science Department, Princeton University, 35 Olden St., Princeton, New Jersey 08540

{mkhodak,risteski,fellbaum,arora}@cs.princeton.edu

Abstract

We present a fully unsupervised methodfor automated construction of WordNetsbased upon recent advances in distribu-tional representations of sentences andword-senses combined with readily avail-able machine translation tools. The ap-proach requires very few linguistic re-sources and is thus extensible to multipletarget languages. To evaluate our methodwe construct two 600-word test sets forword-to-synset matching in French andRussian using native speakers and evalu-ate the performance of our method alongwith several other recent approaches. Ourmethod exceeds the best language-specificand multi-lingual automated WordNets inF-score for both languages. The databaseswe construct for French and Russian, bothlanguages without large publicly availablemanually constructed WordNets, will bepublicly released along with the test sets.

1 Introduction

A WordNet is a lexical database for languagesbased upon a structure introduced by the PrincetonWordNet (PWN) for English in which sets of cog-nitive synonyms, or synsets, are interconnectedwith arcs standing for semantic and lexical rela-tions between them (Fellbaum, 1972). WordNetsare widely used in computational linguistics, in-formation retrieval, and machine translation. Con-structing one by hand is time-consuming and dif-ficult, motivating a search for automated or semi-automated methods. We present an unsupervisedmethod based on word embeddings and word-sense induction and build and evaluate WordNetsfor French and Russian. Our approach needs onlya large unannotated corpus like Wikipedia in the

target language and machine translation (MT) be-tween that language and English.

A standard minimal WordNet design is to havesynsets connected by hyponym-hypernym rela-tions and linked back to PWN (Global WordNetAssociation, 2017). This allows for applicationsto cross-lingual tasks and rests on the assumptionthat synsets and their relations are invariant acrosslanguages (Sagot and Fiser, 2008). For example,while all senses of English word tie may not lineup with all senses of French word cravate, thesense “necktie” will exist in both languages andbe represented by the same synset.

Thus MT is often used for automated Word-Nets to generate a set of candidate synsets for eachword w in the target language by getting a set ofEnglish translations of w and using them to queryPWN (we will refer to this as MT+PWN). Thenumber of candidate synsets produced may not besmall, even as large as a hundred for some polyse-mous verbs. Thus one needs a way to select fromthe candidates of w those synsets that are its truesenses. The main contributions of this paper is anew word embedding-based method for matchingwords to synsets and the release of two large word-synset matching test sets for French and Russian.1

Though there has been some work using word-vectors for WordNets (see Section 2), the result-ing databases have been small, containing lessthan 1000 words. Using embeddings for this taskis challenging due to the need for good ways touse PWN synset information and account for thebreakdown of cosine-similarity for polysemouswords. We approach the first issue by representingsynset information using recent work on sentence-embeddings by Arora et al. (2017). To handle pol-ysemy we devise a sense clustering scheme basedon Word Sense Induction (WSI) via linear alge-

1 https://github.com/mkhodak/pawn

12

bra over word-vectors (Arora et al., 2016a). Wedemonstrate how this sense purification procedureeffectively combines clustering with embeddings,thus being applicable to many word-sense disam-biguation (WSD) and induction-related tasks. Us-ing both techniques yields a WordNet method thatoutperforms other language-independent methodsas well as language-specific approaches such asWOLF, the French WordNet used by the NaturalLanguage ToolKit (Sagot and Fiser, 2008; Bondand Foster, 2013; Bird et al., 2009).

Our second contribution is the creation of twonew 600-word test sets in French and Russian thatare larger and more comprehensive than any cur-rently available, containing 200 each of nouns,verbs, and adjectives. They are constructed by pre-senting native speakers with all candidate synsetsproduced as above by MT+PWN and treating thesenses they pick out as “ground truth” for measur-ing precision and recall. The motivation behindseparating by part-of-speech (POS) is that nounsare often easier than adjectives and verbs, so re-porting one number — as done by some past work— allows high noun performance to mask low per-formance on adjectives and verbs.

Using these test sets, we can begin addressingthe difficulties of evaluation for non-English auto-mated WordNets due to the use of different andunreported test data, incompatible metrics (e.g.matching synsets to words vs. retrieving words forsynsets), and differing cross-lingual dictionaries.In this paper we use the test sets to evaluate ourmethod and several other automated WordNets.

2 Related Work

There have been many language-specific ap-proaches for building automated WordNets, no-tably for Korean (Lee et al., 2000), French (Sagotand Fiser, 2008; Pradet et al., 2013), and Persian(Montazery and Faili, 2010). These approachesalso use MT+PWN to get candidate word-synsetpairs, but often use further resources — such asbilingual corpora, expert knowledge, or WordNetsin related languages — to select correct senses.

The Korean construction depends on a classifiertrained on 3260 word-sense matchings that yields93.6% precision and 77.1% recall, albeit only onnouns. The Persian WordNet uses a scoring func-tion based on related words between languages(requiring expert knowledge and parallel corpora)and achieves 82.6% precision, though without re-

porting recall and POS-separated statistics.The most comparable results to ours are from

the Wordnet Libre du Francais (WOLF) of Sagotand Fiser (2008), which leverages multiple Euro-pean WordNet projects. Our best method exceedsthis approach on our test set and benefits from hav-ing far fewer resource requirements. The Wordnetdu Francais (WoNeF) of Pradet et al. (2013) de-pends on combining linguistic models by a votingscheme. Their performance is found to be gener-ally below WOLF’s, so we compare to the latter.

There has also been work on multi-languageWordNets, specifically the Extended Open Multi-lingual Wordnet (OMW) (Bond and Foster, 2013),which scraped Wiktionary, and the Universal Mul-tilingual Wordnet (UWN) (de Melo and Weikum,2009), which used multiple translations to rateword-sense matches. In our evaluation both pro-duce high-precision/low-coverage WordNets.

Finally, there have been recent vector ap-proaches for an Arabic WordNet (Tarouti andKalita, 2016) and a Bengali WordNet (Nasirud-din et al., 2014). The Arabic effort uses a cosine-similarity threshold for correcting direct transla-tion and reports a precision of 78.4% on synonymmatching, although its small size (943 synsets)indicates a poor precision/recall trade-off. TheBengali WordNet paper examines WSI on wordvectors, evaluating clustering methods on sevenwords and achieving F1-scores of at best 52.1%. Itis likely that standard clustering techniques are in-sufficient when one needs many thousands of clus-ters, an issue we address via sparse coding.

Our use of distributional word embeddings toconstruct WordNets is the latest in a long line oftheir applications, e.g. approximating word simi-larity and solving word-analogies (Mikolov et al.,2013). The latter discovery was cited as the in-spiration for the theoretical model in Arora et al.(2016b), whose Squared-Norm (SN) vectors weuse; the computation is similar in form and per-formance to GloVe (Pennington et al., 2014).

3 Methods for WordNet Construction

The basic WordNet method is as follows. Given atarget word w, we use a bilingual dictionary to getits translations in English and let its set of candi-date synsets be all PWN senses of the translations(MT+PWN). We then assign a score to each synsetand accept as correct all synsets with score abovea threshold α If no synset is above the cutoff we

13

assign a scoreto each

candidate synset

French-EnglishDictionary

French word

'dalle'PrincetonWordNet

flagstone

flag

slab

��

correct synsets:flag.n.06 -slab.n.01 �

retrieved synsets:slab.n.01 �

TargetWord

get translations of target word using a

bilingual dictionary or machine translation (MT)

get set of candidate synsetsby querying Princeton WordNet (PWN) using

translations of

Translations

MT + PWN SynsetScoring

ThresholdMatching

return all synsetswith score abovea threshold

Candidates SynsetScore

Threshold 0.41

flag.n.01

flag.n.04

flag.n.06

flag.n.07

iris.n.01

masthead.n.01

pin.n.08

slab.n.01

score = 0.280

score = 0.222

score = 0.360

score = 0.161

score = 0.200

score = 0.195

score = 0.251

score = 0.521

Figure 1: The score-threshold procedure for French word w = dalle (flagstone, slab). Candidate synsetsgenerated by MT+PWN are given a score and matched to w if the score is above a threshold α.

accept only the highest-scoring synset. This score-threshold method is illustrated in Figure 1.

Thus we need methods to assign high scores tocorrect candidate synsets of w and low scores toincorrect ones. We use unsupervised word vec-tors in the target language computed from a largetext corpus (e.g. Wikipedia). Section 3.1 presentsa simple baseline that improves upon MT+PWNvia a cosine-similarity metric between w and eachsynset’s lemmas. A more sophisticated synset rep-resentation method using sentence-embeddings isdescribed in Section 3.2. Finally, in Section 3.3 wediscuss WSI-based procedures for improving thescore-threshold method for words with fine sensedistinctions or poorly-annotated synsets.

In this section we assume a vocabulary V of tar-get language words with associated d-dimensionalunit word-vectors vw ∈ Rd for d � |V | (e.g.d = 300 for vocabulary size 50000) trained on alarge text corpus. Each word w ∈ V also has a setof candidate synsets found by MT+PWN. We callsynsets S, S′ in PWN related, denoted S ∼ S′, ifone is a hyponym, meronym, antonym, or attributeof the other, if they share a verb group, or S = S′.

3.1 Baseline: Average Similarity Method

This method for scoring synsets can be seen as asimple baseline. Given a candidate synset S, wedefine TS ⊂ V as the set of translations of itslemmas from English to the target language. Thescore of S is then 1

|TS |∑

w′∈TSvw · vw′ , the av-

erage cosine similarity between w and the trans-lated lemmas of S. Although straightforward, thisscoring method is quite noisy, as averaging word-vectors dilutes similarity performance, and doesnot use all synset information provided by PWN.

3.2 Method 1: Synset RepresentationTo improve upon this baseline we need a bettervector representation of S to score S via cosinesimilarity with vw. Previous efforts in synset andsense embeddings (Iacobacci et al., 2015; Rotheand Schutze, 2015) often use extra resources suchas WordNet or BabelNet for the target language(Navigli and Ponzetto, 2012). As such databasesare not always available, we propose a synset rep-resentation uS that is unsupervised, needing no ex-tra resources beyond MT and PWN, and leveragesrecent work on sentence embeddings.

This new representation combines embeddingsof synset information given by PWN, e.g. synsetrelations, definitions, and example sentences. Tocreate these embeddings we first consider thequestion of how to represent a list of words L as avector in Rd. One way is to simply take the nor-malized sum v

(SUM)L of their word-vectors, where

v(SUM)L =

∑w′∈L

vw′

Potentially more useful is to compute a vectorv

(SIF )L via the sentence embedding formula of

Arora et al. (2017), based on smooth inverse fre-quency (SIF) weighting, which (for a = 10−4 andbefore normalization) is expressed as

v(SIF )L =

∑w′∈L

a

a+ P{w′}vw′

SIF is similar in spirit to TF-IDF (Salton andBuckley, 1988) and builds on work of Wieting etal. (2016); it has been found to perform well onother similarity tasks (Arora et al., 2017).

We find that SIF improves performance on sen-tences but not on translated lemma lists (Figure 2),

14

likely because sentences contain many distractorwords that SIF will weight lower while the pres-ence of distractors among lemmas is independentof word frequency. Thus to compute the synsetscore uS · vw we make the vector representationuS of S the element-wise average of:

• v(SUM)TS

TS is the set of translations oflemmas of S (as in Section 3.1).

• v(SUM)RS

RS =

( ⋃S′∼S

TS′

)\TS is the

set of lemma translations ofsynsets S′ related to S.

• v(SIF )DS

DS is the list of tokens in thetranslated definition of S.

• 1|Es|

∑E∈ES

v(SIF )E

ES contains the lists of tokensin the translated example sent-ences of S (excluded if S hasno example sentences).

3.3 Method 2: Better Matching Using WSI

We found through examination that the score-threshold method we have used so far performspoorly in two main cases:

(a) the word w has no candidate synset withscore high-enough to clear the threshold.

(b) w has multiple closely related synsets that areall correct matches but some of which have amuch lower score than others.

Here we address both issues by using sense infor-mation found by applying a word-sense inductionmethod first introduced in Arora et al. (2016a).

We summarize their WSI-model – referred tohenceforth as Linear-WSI — in Section 3.3.1.Then in Section 3.3.2 we devise a sense purifica-tion procedure for constructing a word-cluster foreach induced sense of a word. Applying this pro-cedure to construct word-cluster representations ofcandidates synsets provides an additional metricfor the correctness of word-synset matches thatcan be used to devise a w-specific threshold αw toameliorate problem (a). Meanwhile, using Linear-WSI to associate similar candidate synsets of w toeach other provides a way to address problem (b).We explain these methodologies in Section 3.3.3.

3.3.1 Summary of Linear-WSI ModelIn Arora et al. (2016a) the authors posit that thevector of a polysemous word can be linearly de-composed into vectors associated to its senses.

Avg. Similarity(Baseline)

Unweighted Summation

SIF-WeightedSummation

Synset Rep.(Method 1)

Synset Lemmas

Synset Lemmas +Related Lemmas

Synset Definition

Synset Definition +Example Sentences

All Synset Information

Average LemmaSimilarity

Synset Representation

0.55 0.60 0.65

Figure 2: F -score comparison between using un-weighted summation and SIF-weighted summa-tion for embedding PWN synset information.

Thus for w = tie — which can be an article ofclothing, a drawn match, and so on — we wouldhave vw ≈ avw(clothing) + bvw(match) + . . . fora, b ∈ R. It is unclear how to find such sense-vectors, but one expects different words to haveclosely related sense-vectors, e.g. for w′ = bowthe vector vw′(clothing) would be close to the vec-tor vw(clothing) of tie. Thus the Linear-WSI modelproposes using sparse coding, namely finding a setof unit basis vectors a1, . . . , ak ∈ Rd s.t. ∀w ∈ V ,

vw =k∑i=1

Rwiai + ηw, (1)

for k > d, ηw a noise vector, and at most s coef-ficients Rwi nonzero. The hope is that the sense-vector vw(clothing) of tie is in the neighborhood ofa vector ai s.t. Rwi > 0. Indeed, for k = 2000 ands = 5, Arora et al. (2016a) report that solving (1)represents English word-senses as well as a com-petent non-native speaker and significantly betterthan older clustering methods for WSI.

3.3.2 Sense Purification

While (1) is a good WSI method, its use for build-ing WordNets is hampered by its inability to pro-duce more than a few thousand senses ai, as set-ting a large k yields repetitive rather than differentsenses. As this is far fewer than the number ofsynsets in PWN, we use sense purification to ad-dress this by extracting a cluster of words relatedto ai as well as w to represent each sense. In ad-dition to w and ai, the procedure takes as input asearch-space V ′ ⊂ V and set size n. Then we find

15

(pistol)

(shooting)

(to shoot)

(species)

(crossbow)(kind)

(family)

(plant)

(coriander)

(cilantro)

(celery)

(garlic)

Figure 3: Isometric mapping of sense-cluster vectors for w = brilliant, fox, and лук (bow, onion). w ismarked by a star and each sense ai of w, shown by a large marker, has an associated cluster of wordswith the same marker shape. Contours are densities of vectors close to w and at least one sense ai.

C ⊂ V ′ of size n containing w that maximizes

f = minx=ai or

x=vw′ :w′∈CMedian{x · vw′ : w′ ∈ C} (2)

A cluster C maximizing this must ensure that nei-ther ai nor any word w′ ∈ C (including w′ = w)has low average cosine similarity with cluster-words, resulting in a dense cluster close to bothw and ai. We explain this further in Appendix A.

We illustrate this method in Figure 3 by pu-rifying the senses of English words brilliantand fox and Russian word лук, which has thesenses “bow” (weapon), “onion” (vegetable), and“onion” (plant). Note how correct senses are re-covered across POS and language and for properand common noun senses.

3.3.3 Applying WSI to Synset MatchingThe problem addressed by sense purification isthat senses ai induced by Linear-WSI have toomany related words; purification solves this byextracting a cluster of words related to w fromthe words close to ai. When translating WordNetsynsets, we have a similar problem in that transla-tions of a synset’s lemmas may not be relevant tothe synset itself. Thus we can try to create a puri-fied representation of each candidate synset S ofwby extracting a cluster of translated lemmas closeto w and one of its induced senses. We run purifi-cation on every sense ai in the sparse representa-tion (1) of w, using as a search-space V ′ = VS theset of translations of all lemmas of synsets S′ re-lated to S in PWN (i.e. S′ ∼ S as defined in Sec-tion 3). To each synset S we associate the senseaS and corresponding cluster CS that are optimalin the objective (2) among all senses ai of w.

Although we find a sense aS and purified repre-sentation CS for each candidate synset of w, we

note that an incorrect synset is likely to have alower objective value (2) than a correct synset asit likely has fewer words related to w in its search-space VS . However, using fS = f(w, aS , CS) as asynset score is difficult as some synsets have verysmall search-spaces, leading to inconsistent scor-ing even for correct synsets.

Instead we use fS as part of a w-specific thresh-old αw = min{α, uS∗ · vw}, where uS is thevector representation of S from Section 3.2 andS∗ = arg max fS + uS · vw. This attempts to ad-dress problem (a) of the score-threshold method— that some words have no synsets above the cut-off α and returning only the highest-scoring synsetin such cases will not retrieve multiple sense forpolysemous words. By the construction of αw, ifw is polysemous, a synset other than the one withthe highest score uS ·vw may have a better value offS and thus found to be S∗; then all synsets withhigher scores will be matched to w, allowing formultiple matches even if no synset clears the cut-off α. Otherwise if w is monosemous, we expectthe correct synset S to have the highest value forboth uS · vw and fS , making S∗ = S. Then if nosynset has score greater than α the threshold willbe set to uS · vw, the highest synset-score, so onlythe highest scoring synset will be matched to w.

To address problem (b) of the threshold method— that w might have multiple correct and closelyrelated candidate synsets of which only someclear the cutoff — we observe that closely relatedsynsets of w will have similar search-spaces VSand so are likely to be associated to the same senseai. For example, in Figure 4 the candidates ofdalle related to its correct meanings as a flagstoneor slab are associated to the same sense a892 whiledistractor synsets related to the incorrect sense as a

16

Candidates SynsetScore

Threshold 0.50

AssociatedSense

French word

'dalle'

get a set of candidate synsets by querying

Princeton WordNet using translations of

MT + PWN ThresholdMatching

use sense purification toassociate a sense

to each candidate synset

match to allcandidate synsets with score abovea threshold

��

Sense-Associationby Purification

Sense-AgglomerationProcedure

perform cutoff matching with a lower threshold on all synsets sharing a sense

with an already matched synset

targetword

Threshold 0.25

correct synsets:flag.n.06 �slab.n.01 �

retrieved synsets:flag.n.06 �slab.n.01 �

sense =

sense =

sense =

sense =

sense =

sense =

sense =

sense =

��

flag.n.01

flag.n.04

flag.n.06

flag.n.07

iris.n.01

masthead.n.01

pin.n.08

slab.n.01

score = 0.280

score = 0.222

score = 0.360

score = 0.161

score = 0.200

score = 0.195

score = 0.251

score = 0.521

Figure 4: The score-threshold and sense-agglomeration procedure for French word w = dalle (flagstone,slab). Candidate synsets are given a score and matched to w if they clear a high threshold αw (as inSection 3.2). If an unmatched synset shares a sense ai with a matched synset, it is compared to a lowthreshold β (the sense-agglomeration procedure in Section 3.3.3).

flag are mostly matched to other senses. This mo-tivates the sense agglomeration procedure, whichuses threshold-clearing synsets to match synsetswith the same sense below the score-threshold.For β < α, the procedure is roughly as follows:

1. Run score-threshold method with cutoff αw.

2. For each synset S with score uS · vw belowthe threshold, check for a synset S′ with scoreuS′ · vw above the threshold and aS′ = aS .

3. If uS ·vw ≥ β and the clustersCS , CS′ satisfya cluster similarity condition, match S to w.

We include the lower cutoff β because even sim-ilar synsets may not both be senses of the sameword. The cluster similarity condition, available inAppendix B.2, ensures the relatedness of synsetssharing a sense ai, as an erroneous synset S′ maybe associated to the same sense as a correct one.

In summary, the improved synset matching ap-proach has two steps: 1-conduct score-thresholdmatching using the modified threshold αw; 2-runthe sense-agglomeration procedure on all sensesai of w having at least one threshold-clearingsynset S. Although for fixed α both steps focuson improving the recall of the threshold method, inpractice they allow α to be higher, so that both pre-cision and recall are improved. For example, notethe recovery of a correct synset left unmatched bythe score-threshold method in the simplified de-piction of sense-agglomeration shown in Figure 4.

4 Evaluation of Methods

We evaluate our method’s accuracy and coverageby constructing and testing WordNets for Frenchand Russian. For both we train 300-dimensionalSN word embeddings (Arora et al., 2016b) onco-occurrences of words occurring at least 1000times, or having candidate PWN synsets and oc-curring at least 100 times, in the lemmatizedWikipedia corpus. This yields |V | ≈ 50000. ForLinear-WSI we run sparse coding with sparsitys = 4 and basis-size k = 2000 and use set-sizen = 5 for purification. To get candidate synsetswe use Google and Microsoft Translators and thedictionary of the translation company ECTACO,while for sentence-length MT we use Microsoft.

4.1 Testsets

A common way to evaluate accuracy of an auto-mated WordNet is to compare its synsets or wordmatchings to a manually-constructed one. How-ever, the existing ELRA French Wordnet 2 is notpublic and half the size of ours while RussianWordNets are either even smaller and not linkedto PWN3 or obtained via direct translation4.

Instead we construct test sets for each languagethat allow for evaluation of our methods and oth-ers. We randomly chose 200 each of adjectives,nouns, and verbs from the set of target languagewords whose English translations appear in thesynsets of the Core WordNet. Their “ground truth”

2 http://catalog.elra.info/product_info.php?products_id=550

3 http://project.phil.spbu.ru/RussNet/4 http://wordnet.ru/

17

Method POS F.5-Score∗ F1-Score∗ Precision∗ Recall∗ Coverage Synsets α β

Direct Translation(MT + PWN)

Adj. 50.3 59.3 46.1 100.0 99.9 11271Noun 56.2 64.6 52.2 100.0 100.0 74477Verb 41.4 51.3 37.0 100.0 100.0 13017Total 49.3 58.4 45.1 100.0 100.0 100174†

Wordnet Libredu Francais

(WOLF)(Sagot and Fiser, 2008)


Universal Wordnet(de Melo and Weikum, 2009)


Extended OpenMultilingual Wordnet

(Bond and Foster, 2013)


Baseline:Average Similarity

(Section 3.1)

Adj. 62.8±0.0 62.6±0.0 65.3±0.0 68.5±0.0 88.7 9687 0.31Noun 67.3±0.0 65.4±0.1 71.6±0.1 69.0±0.1 92.2 37970 0.27Verb 51.8±0.0 50.8±0.1 55.9±0.1 57.0±0.1 83.5 10037 0.26Total 60.6±0.0 59.6±0.0 64.3±0.0 64.9±0.0 90.0 58962†

Method 1:Synset Representation

(Section 3.2)



+ Linear-WSI(Section 3.3)

Adj. 67.7±0.0 62.5±0.1 76.9±0.1 62.6±0.1 91.2 8912 0.56 0.42Noun 73.0±0.0 66.0±0.1 83.7±0.1 62.0±0.2 90.9 34001 0.50 0.25Verb 64.4±0.0 55.9±0.0 79.3±0.0 51.5±0.1 93.6 9262 0.46 0.28Total 68.4±0.0 61.5±0.0 80.0±0.0 58.7±0.1 91.5 53208†

∗Micro-averages over a randomly held-out half of the data; parameters tuned on the other half. 95% asymptotic confidenceintervals found with 10000 randomized trials.† Includes adverb synsets. For the last three methods they are matched with the same parameter values (α and β) as for adjectives.

Table 1: French WordNet Results

word senses are picked by native speakers, whowere asked to perform the same matching task de-scribed in Section 3, i.e. select correct synsetsfor a word given a set of candidates generated byMT + PWN. For example, the French word foiehas one translation, liver with four PWN synsets:1-“glandular organ”; 2-“liver used as meat”; 3-“person with a special life style”; 4-“someone liv-ing in a place.” The first two align with sensesof foie while the others do not, so the expertmarks the first two as good and the others as neg-ative. Two native speakers for each language weretrained by an author with knowledge of WordNetand at least 10 years of experience in each lan-guage. Inconsistencies in the matchings of the twospeakers were resolved by the same author.

We get 600 words and about 12000 candidateword-synset pairs for each language, with adjec-tives and nouns having on average about 15 can-didates and verbs having about 30. These num-bers makes the test set larger than many others,

with French, Korean, and Persian WordNets citedin Section 2 being evaluated on 183 pairs, 3260pairs, and 500 words, respectively. Accuracy mea-sured with respect to this ground truth estimateshow well an algorithm does compared to humans.

A significant characteristic of this test set isits dependence on the machine translation sys-tem used to get candidate synsets. While this canleave out correct synset matches that the systemdid not propose, by providing both correct and in-correct candidate synsets we allow future authorsto focus on the semantic challenge of selectingcorrect senses without worrying about finding thebest bilingual dictionary. This allows dictionary-independent evaluation of automated WordNets,an important feature in an area where the specifictranslation systems used are rarely provided infull. When comparing the performance of our con-struction to that of previous efforts on this test set,we do not penalize word-synset matches in whichthe synset is not among the candidate synsets we

18

generate for that word, negating the loss of pre-cision incurred by other methods due to the useof different dictionaries. We also do not penalizeother WordNets for test words they do not contain.

In addition to precision and recall, we report theCoverage statistic as the proportion of the Core setof most-used synsets, a semi-automatically con-structed set of about 5000 PWN frequent senses,that are matched to at least one word (Fellbaum,1972). While an imperfect metric given differentsense usage by language, the synsets are universal-enough for it to be a good indicator of usability.

4.2 Evaluation Results

We report the evaluation of methods in Section 3in Tables 1 & 2 alongside evaluations of UWN(de Melo and Weikum, 2009), OWM (Bond andFoster, 2013), and WOLF (Sagot and Fiser, 2008).Parameters α and β were tuned to maximize themicro-averaged F.5-score 1.25·Precision·Recall

.25·Precision+Recall , usedinstead of F1 to prioritize precision, which is of-ten more important for application purposes.

Our synset representation method (Section 3.2)exceeds the similarity baseline by 6% in F.5-scorefor French and 10% for Russian. For French it iscompetitive with the best other WordNet (WOLF)and in both languages exceeds both multi-lingualWordNets. Improving this method via Linear-WSI(Section 3.3) leads to 2% improvement in F.5-score for French and 1% for Russian. Our methodsalso perform best in F1-score and Core coverage.

As expected from a Wiktionary-scrapingmethod, OMW achieves the best precisionacross languages, although it and UWN havelow recall and Core coverage. The performanceof our best method for French exceeds that ofWOLF in F.5-score across POS while achievingsimilar coverage. WOLF’s recall performanceis markedly lower than the evaluation in Sagotand Fiser (2008, Table 4); we believe this stemsfrom our use of words matched to Core synsets,not random words, leading to a more difficulttest set as common words are more-polysemousand have more synsets to retrieve. There is nocomparable automated Russian-only WordNet,with only semi-automated and incomplete efforts(Yablonsky and Sukhonogov, 2006).

Comparing across POS, we do best on nounsand worst on verbs, likely due to the greater pol-ysemy of verbs. Between languages, performanceis similar for adjectives but slightly worse on Rus-

sian nouns and much worse on Russian verbs.The discrepancy in verbs can be explained by a

difference in treating the reflexive case and aspec-tual variants due to the grammatical complexity ofRussian verbs. In French, making a verb reflexiverequires adding a word while in Russian the verbitself changes, e.g. to wash→to wash oneself islaver→se laver in French but мыть→мыться inRussian. Thus we do not distinguish the reflex-ive case for French as the token found is the samebut for Russian we do, so both мыть and мыть-ся may appear and have distinct synset matches.Thus matching Russian verbs is challenging as thereflexive usage of a verb is often contextually sim-ilar to the non-reflexive usage. Another compli-cation for Russian verbs is due to aspectual verbpairs; thus to do has aspects (делать, сделать)in Russian that are treated as distinct verbs whilein French these are just different tenses of the verbfaire. Both factors pose challenges for differentiat-ing Russian verb senses by a distributional model.

Overall however the method is shown to be ro-bust to how close the target language is to English,with nouns and adjectives performing well in bothlanguages and the difference for verbs stemmingfrom an intrinsic quality rather than dissimilaritywith English. This can be further examined bytesting the method on a non-European language.

5 Conclusion and Future Work

We have shown how to leverage recent advancesin word embeddings for fully-automated WordNetconstruction. Our best approach combining sen-tence embeddings, and recent methods for WSIobtains performance 5-16% above the naive base-line in F.5-score as well as outperforming previ-ous language-specific and multi-lingual methods.A notable feature of our work is that we requireonly a large corpus in the target language and au-tomated translation into/from English, both avail-able for many languages lacking good WordNets.

We further contribute new 600-word human-annotated test sets split by POS for French andRussian that can be used to evaluate future au-tomated WordNets. These larger test sets give amore accurate picture of a construction’s strengthsand weaknesses, revealing some limitations ofpast methods. With WordNets in French and Rus-sian largely automated or incomplete, the Word-Nets we build also add an important tool for multi-lingual natural language processing.

19

Method POS F.5-Score∗ F1-Score∗ Precision∗ Recall∗ Coverage Synsets α β

Direct Translation(MT + PWN)


Universal Wordnet(de Melo and Weikum, 2009)


Extended OpenMultilingual Wordnet

(Bond and Foster, 2013)


Baseline:Average Similarity

(Section 3.1)



(Section 3.2)



+ Linear-WSI(Section 3.3)

Adj. 69.7±0.0 64.9±0.1 77.3±0.0 63.6±0.1 93.3 9359 0.43 0.35Noun 71.6±0.0 67.6±0.0 78.1±0.0 68.0±0.1 91.0 31699 0.46 0.33Verb 54.4±0.0 49.7±0.1 64.9±0.1 52.6±0.2 91.9 8582 0.44 0.33Total 65.2±0.0 60.7±0.0 73.4±0.0 61.4±0.1 91.5 50850†

∗Micro-averages over a randomly held-out half of the data; parameters tuned on the other half. 95% asymptotic confidenceintervals found with 10000 randomized trials.† Includes adverb synsets. For the last three methods they are matched with the same parameter values (α and β) as for adjectives.

Table 2: Russian WordNet Results

Further improvement to our work may comefrom other methods in word-embeddings, suchas multi-lingual word-vectors (Faruqui and Dyer,2014). Our techniques can also be combined withothers, both language-specific and multi-lingual,for automated WordNet construction. In addition,our method for associating multiple synsets to thesame sense can contribute to efforts to improvePWN through sense clustering (Snow et al., 2007).Finally, our sense purification procedure, whichuses word-vectors to extract clusters representingword-senses, likely has further WSI and WSD ap-plications; such exploration is left to future work.

Acknowledgments

We thank Angel Chang and Yingyu Liang forhelpful discussion at various stages of this pa-per. This work was supported in part by NSFgrants CCF-1302518, CCF-1527371, Simons In-vestigator Award, Simons Collaboration Grant,and ONR-N00014-16-1-2329.

References

Michal Aharon, Michael Elad, and Alfred Bruckstein.2006. K-svd: An algorithm for designing overcom-plete dictionaries for sparse representation. IEEETransactions on Signal Processing, 54(11).

Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma,and Andrej Risteski. 2016a. Linear algebraic struc-ture of word sense, with applications to polysemy.

Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma,and Andrej Risteski. 2016b. Rand-walk: A la-tent variable model approach to word embeddings.Transactions of the Association for ComputationalLinguistics, 4:385–399.

Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017.A simple but tough-to-beat baseline for sentence em-beddings. In International Conference on LearningRepresentations. To Appear.

Steven Bird, Edward Loper, and Ewan Klein.2009. Natural Language Processing with Python.O’Reilly Media Inc.

Francis Bond and Ryan Foster. 2013. Linking and ex-tending an open multilingual wordnet. In Proceed-ings of the 51st Annual Meeting of the Associationfor Computational Linguistics.

20

Gerard de Melo and Gerhard Weikum. 2009. Towardsa universal wordnet by learning from combined evi-dence. In Proceedings of the 18th ACM Conferenceon Information and Knowledge Management.

Manaal Faruqui and Chris Dyer. 2014. Improvingvector space word representations using multilingualcorrelation. In Proceedings of the 14th Conferenceof the European Chapter of the Association for Com-putational Linguistics.

Christiane Fellbaum. 1972. WordNet: An ElectronicLexical Database. MIT Press, Cambridge, MA.

Global WordNet Association. 2017. Wordnets in theworld.

Ignaci Iacobacci, Mohammad Taher Pilehvar, andRoberto Navigli. 2015. Sensembed: Learning senseembeddings for word and relational similarity. InProceedings of the 53rd Annual Meeting of the As-sociation for Computational Linguistics.

Changki Lee, Geunbae Lee, and Seo JungYun. 2000.Automated wordnet mapping using word sense dis-ambiguation. In Proceedings of the 2000 Joint SIG-DAT Conference on Empirical Methods in NaturalLanguage Processing and Very Large Corpora.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space.

Mortaza Montazery and Heshaam Faili. 2010. Auto-matic persian wordnet construction. In Proceedingsof the 23rd International Conference on Computa-tional Linguistics.

Mohammad Nasiruddin, Didier Schwab, and AndonTchechmedjiev. 2014. Induction de sens pour en-richir des ressources lexicales. In 21eme TraitementAutomatique des Langues Naturelles.

Roberto Navigli and Simone Paolo Ponzetto. 2012.Babelnet: The automatic construction, evaluationand application of a wide-coverage multilingual se-mantic network. Artificial Intelligence, 193.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Proceedings of EmpiricalMethods in Natural Language Processing.

Quentin Pradet, Gael de Chalendar, and Jeanne Bague-nier Desormeaux. 2013. Wonef, an improved, ex-panded and evaluated automatic french translationof wordnet. In Proceedings of the Seventh GlobalWordnet Conference.

Sascha Rothe and Hinrich Schutze. 2015. Autoex-tend: Extending word embeddings to embeddingsfor synsets and lexemes. In Proceedings of the 53rdAnnual Meeting of the Association for Computa-tional Linguistics.

Benoıt Sagot and Darja Fiser. 2008. Building a freefrench wordnet from multilingual resources. In Pro-ceedings of the Sixth International Language Re-sources and Evaluation Conference.

Gerald Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. In-formation Processing & Management, 24(5).

Rion Snow, Sushant Prakash, Daniel Jurafsky, and An-drew Y. Ng. 2007. Learning to merge word senses.In Proceedings of the 2007 Joint Conference on Em-pirical Methods in Natural Language Processingand Computational Natural Language Learning.

Feras Al Tarouti and Jugal Kalita. 2016. Enhancingautomatic wordnet construction using word embed-dings. In Proceedings of the Workshop on Multilin-gual and Cross-lingual Methods in NLP.

John Wieting, Mohit Bansal, Kevin Gimpel, and KarenLivescu. 2016. Towards universal paraphrastic sen-tence embeddings. In International Conference onLearning Representations.

Sergey Yablonsky and Andrej Sukhonogov. 2006.Semi-automated english-russian wordnet construc-tion: Initial resources, software and methods oftranslation. In Proceedings of the Global WordNetConference.

A Purification Procedure

As discussed in Section 3.3.1, the Linear-WSImodel (Arora et al., 2016a) posits that there ex-ists an overcomplete basis a1, . . . , ak ∈ Rd of unitvectors such that each wordw ∈ V can be approx-imately represented by a linear combination of atmost s basis vectors (see Equation 1). Finding thebasis vectors ai and their coefficientsRwi requiressolving the optimization problem

minimize ‖Y −RA‖2subject to ‖Rw‖0 ≤ s ∀ w ∈ V

where Y ∈ R|V |×n has word-vectors as its rows,R ∈ R|V |×k is the matrix of coefficients Rwi, andA ∈ Rk×d has the overcomplete basis a1, . . . , akas its rows. This problem is non-convex and canbe solved approximately via the K-SVD algorithm(Aharon et al., 2006).

Given a word w, the purification procedure isa method for purifying the senses induced viaLinear-WSI (the vectors ai s.t. Rwi 6= 0) by rep-resenting them as word-clusters related to both thesense itself and the word. This is done so as tocreate a more fine-grained collection of senses, asLinear-WSI does not perform well for more thana few thousand basis vectors, far fewer than the

21

number of word-senses (Arora et al., 2016a). Theprocedure is inspired by the hope that words re-lated to each sense of w will form clusters nearboth vw and one of its senses ai. Given a fixed set-size n and a search-space V ′ ⊂ V , we realize thishope via the optimization problem

maximizeC⊂V ′

γ

subject to

γ ≤ Median{vx · vw′ : w′ ∈ C\{x}} ∀ x ∈ Cγ ≤ Median{ai · vw′ : w′ ∈ C}w ∈ C, |C| = n

This problem is equivalent to maximizing (2) withconstraints |C| = n and w ∈ C ⊂ V ′. The ob-jective value is constrained to be the lowest me-dian cosine similarity between any word x ∈ Cor the sense ai and the rest of C, so optimizing itensures that the words in C are closely related toeach other and to ai. Forcing the cluster to containw leads to the words in C being close to w as well.

For computational purposes we solve this prob-lem approximately using a greedy algorithm thatstarts with C = {w} and repeatedly adds the wordin V ′\C that results in the best objective value un-til |C| = n. Speed of computation is also a reasonfor using a search-space V ′ ⊂ V rather than theentire vocabulary as a source of words for the clus-ter; we found that restricting V ′ to be all words w′

s.t. min{vw′ · vw, vw′ · ai} ≥ .2 dramatically re-duces processing time with little performance loss.

The purification procedure represents only onesense of w, so to perform WSI we generate clus-ters for all senses ai s.t. Rwi > 0. If two senseshave clusters that share a word other than w, onlythe cluster with the higher objective value is re-turned. To find the clusters displayed in Figure 3we use this procedure with cluster size n = 5 onEnglish and Russian SN vectors decomposed withbasis size k = 2000 and sparsity s = 4.

B Applying WSI to Synset Matching

We use the purification procedure to represent thecandidate synsets of a wordw by clusters of wordsrelated to w and one if its senses ai. First, for eachsynset S we define the search-space

VS =⋃S′∼S

TS′

where TS′ is the set of translations of lemmas ofS′ as in Section 3.1. Then given a word w, one of

its candidate synsets S, and a fixed set size n, werun the following procedure:

1. For each sense ai in the sparse representation(1) of w let Ci be the output cluster of thepurification procedure run on sense ai withsearch-space V ′ = VS .

2. Return the (sense, sense-cluster) pair(aS , CS) with the highest purificationprocedure objective value among senses ai.

In the following methods we will assume that eachcandidate synset S of w has a sense aS and sense-cluster CS associated to it in this way. Examplesof such clusters for the French word dalle (flag-stone, slab) are provided in Table 3.

B.1 Word-Specific ThresholdOne application of Linear-WSI is the creation ofa word-specific threshold αw to use in the score-threshold method instead of a global cutoff α. Wedo this by using the quality of the sense cluster CSof each candidate synset S of w as an indicator ofthe correctness of that synset. Recalling that uSis the synset representation of S as a vector (seeSection 3.2) and letting fS = f(w, aS , CS) be theobjective value (2), we find αw as follows:

1. Find S∗ = arg maxS is a candidate of w

fS + uS · vw.

2. Let αw = min{α, uS∗ · vw}.As this modified threshold may be lower thanmore than one candidate synset of w it allows formultiple synset matches even when no synset hasscore high enough to clear the threshold α.

B.2 Sense-Agglomeration ProcedureWe use Linear-WSI more explicitly through thesense-agglomeration procedure, which attempts torecover unmatched synsets using matched synsetssharing the same sense ai ofw. We define the clus-ter similarity ρ between word-clusters C1, C2 ⊂V as the median of all cosine-similarities of pairsof words in their set product, i.e.

ρ(C1, C2) = Median{vx · vy : x ∈ C1, y ∈ C2}.Then we say that two clusters C1, C2 are similar if

ρ(C1, C2) ≥ min{ρ(C1, C1), ρ(C2, C2)},i.e. if their cluster similarity with each other ex-ceeds at least one’s cluster similarity with itself.

22

Synset AssociatedSense aS

Purified Cluster Representation CSObjective Valuef(w, aS , CS)

flag.n.01 a789 poteau (goalpost), fleche (arrow), haut (high, top), mat (matte) 0.23flag.n.04 a892 flamme (flame), fanion (pennant), guidon, signal 0.06flag.n.06 a892 dallage (paving), carrelage (tiling), pavement, pavage (paving) 0.36flag.n.07 a1556 pan (section), empennage, queue, tail 0.14iris.n.01 a1556 bœuf (beef), usine (factory), plante (plant), puant (smelly) 0.07masthead.n.01 a1556 inscription, lettre (letter), catalogue (catalog), cotation (quotation) 0.10pin.n.08 a1556 trou (hole), tertre (mound), marais (marsh), pavillon (house, pavillion) 0.17slab.n.01 a892 carrelage (tiling), carreau (tile), tuile (tile), batiment (building) 0.27

Table 3: Purified Synset Representations of dalle (flagstone, slab). Note how the correct candidatesynsets (bolded) have clusters of words closely related to the correct meaning while the other clustershave many unrelated words, leading to lower objective values.

Then given a global low threshold β ≤ α, for eachsense ai in the sparse representation (1), sense-agglomeration consists of the following algorithm:

1. Let Mi be the set of candidate synsets S of wthat have aS = ai and score above the thresh-old αw. Stop if Mi = ∅.

2. Let Ui be the set of candidate synsets S of wthat have aS = ai and score below the thresh-old αw. Stop if Ui = ∅.

3. For each synset S ∈ Ui, ordered by synset-score, check that CS is similar (in the abovesense) to all clusters CS′ for S′ ∈Mi and hasscore higher than β. If both are true, add S toMi and remove S from Ui. Otherwise stop.

The sense-agglomeration procedure allows an un-matched synset S to be returned as a correct synsetof w provided it shares a sense with a differentmatched synset S′ and satisfies cluster similarityand score-threshold constraints.

23


Improving Verb Metaphor Detection by PropagatingAbstractness to Words, Phrases and Individual Senses

Maximilian Köper and Sabine Schulte im WaldeInstitut für Maschinelle Sprachverarbeitung

Universität Stuttgart, Germany{maximilian.koeper,schulte}@ims.uni-stuttgart.de

Abstract

Abstract words refer to things that can notbe seen, heard, felt, smelled, or tastedas opposed to concrete words. Amongother applications, the degree of abstract-ness has been shown to be a useful infor-mation for metaphor detection. Our con-tribution to this topic are as follows: i) wecompare supervised techniques to learnand extend abstractness ratings for hugevocabularies ii) we learn and investigatenorms for multi-word units by propagat-ing abstractness to verb-noun pairs whichlead to better metaphor detection, iii) weovercome the limitation of learning a sin-gle rating per word and show that multi-sense abstractness ratings are potentiallyuseful for metaphor detection. Finally,with this paper we publish automaticallycreated abstractness norms for 3 millionEnglish words and multi-words as wellas automatically created sense-specific ab-stractness ratings.

1 Introduction

The standard approach to studying abstractness isto place words on a scale ranging between ab-stractness and concreteness. Alternately, abstract-ness can also be given a taxonomic definition inwhich the abstractness of a word is determined bythe number of subordinate words (Kammann andStreeter, 1971; Dunn, 2015).

In psycholinguistics abstractness is commonlyused for concept classification (Barsalou andWiemer-Hastings, 2005; Hill et al., 2014;Vigliocco et al., 2014). In computational work, ab-stractness has become an established informationfor the task of automatic detection of metaphoricallanguage. So far metaphor detection has been car-

ried out using a variety of features including se-lectional preferences (Martin, 1996; Shutova andTeufel, 2010; Shutova et al., 2010; Haagsma andBjerva, 2016), word-level semantic similarity (Liand Sporleder, 2009; Li and Sporleder, 2010),topic models (Heintz et al., 2013), word embed-dings (Dinh and Gurevych, 2016) and visual in-formation (Shutova et al., 2016).

The underlying motivation of using abstract-ness in metaphor detection goes back to Lakoffand Johnson (1980), who argue that metaphor isa method for transferring knowledge from a con-crete domain to an abstract domain. Abstractnesswas already applied successfully for the detectionof metaphors across a variety of languages (Tur-ney et al., 2011; Dunn, 2013; Tsvetkov et al.,2014; Beigman Klebanov et al., 2015; Köper andSchulte im Walde, 2016b).

The abstractness information itself is typicallytaken from a dictionary, created either by man-ual annotation or by extending manually col-lected ratings with the help of supervised learn-ing techniques that rely on word representations.While potentially less reliable, automatically cre-ated norm-based abstractness ratings can easilycover huge dictionaries. Although some meth-ods have been used to learn abstractness, literaturelacks a comparison of these learning techniques.

We compare and evaluate different learningtechniques. In addition we show and investigatethe usefulness of extending abstractness ratings tophrases as well as individual word senses. We ex-trinsically evaluate these techniques on two verbmetaphor detection tasks: (i) a type-based settingthat makes use of phrase ratings, (ii) a token-basedclassification for multi-sense abstractness norms.Both settings benefit from our approach.

24

2 Experiments

2.1 Propagating Abstractness: AComparison of Approaches & Ressources

2.1.1 Comparison of ApproachesTurney et al. (2011) first aproached to automati-cally create abstractness norms for 114 501 words,relying on manual ratings based on the MRC Psy-cholinguistic Database (Coltheart, 1981). The un-derlying algorithm (Turney and Littman, 2003) re-quires vector representation and annotated trainingsamples of words. The algorithm itself performs agreedy forward search over the vocabulary to learnso-called paradigm words. Once paradigm wordsfor both classes (abstract & concrete) are learned,a rating can be assigned to every word by com-paring its vector representation against the vectorrepresentations of the paradigm words.

Köper and Schulte im Walde (2016a) used thesame algorithm for a large collection of Germanlemmas, and in the same way additional cre-ated ratings for multiple norms including valency,arousal and imageability.

A different method that has been used to ex-tend abstractness norms based on low-dimensionalword embeddings and a Linear Regression classi-fier (Tsvetkov et al., 2013; Tsvetkov et al., 2014).

We compare approaches across different pub-licly available vector representations1, to study po-tential differences across vector dimensionality wecompare vectors between 50 and 300 dimensions.The Glove vectors (Pennington et al., 2014) havebeen trained on 6billion tokens of Wikipedia plusGigaword (V=400K), while the word2vec cbowmodel (Mikolov et al., 2013) was trained on aGoogle internal news corpus with 100billion to-kens (V=3million). For training and testing werelied on the ratings from Brysbaert et al. (2014),Dividing the ratings into 20% test (7 990) and 80%training (31 964) for tuning hyper parameters wetook 1 000 ratings from the training data. Wekept the ratio between word classes. Evaluation isdone by comparing the new created ratings againstthe test (gold) ratings using Spearman’s rank-ordercorrelation. We first reimplemented the algorithmfrom Turney and Littman (2003) (T&L 03). In-spired by recent findings of Gupta et al. (2015) weapply the hypothesis that distributional vectors im-

1http://nlp.stanford.edu/projects/glove/https://code.google.com/archive/p/word2vec/

plicitly encode attributes such as abstractness anddirectly feed the vector representation of a wordinto a classifier, either by using linear regression(L-Reg), a regression forest (Reg-F) or a fullyconnected feed forward neural network with up totwo hidden layers (NN).2

T&L 03 L-Reg. Reg-F. NN

Glove50 .76 .76 .78 .79Glove100 .80 .79 .79 .85Glove200 .78 .78 .76 .84Glove300 .76 .78 .74 .85W2V300 .83 .84 .79 .90

Table 1: Spearman’s ρ for the test ratings. Com-paring representations and regression methods.

Table 1 shows clearly that we can learn ab-stractness ratings with a very high correlation onthe test data using the word representations fromGoogle (W2V300) together with a neural net-work for regression (ρ=.90). The NN method sig-nificantly outperforms all other methods, usingSteiger (1980)’s test (p < 0.001).

2.1.2 Comparison of RessourcesBased on the comparison of methods in the pre-vious section we propagated abstractness ratingsto the entire vocabulary of the W2V300 dataset(3million words) and compare the correlation withother existing norms of abstractness. For this com-parison we use the common subset of two manu-ally and one automatically created resource: MRCPsycholinguistic Database, ratings from Brysbaertet al. (2014) and the automatically created ratingsfrom Turney et al. (2011). We map all existingratings, as well as our newly created ratings, tothe same interval using the method from Köperand Schulte im Walde (2016a). The mapping isperformed using a continuous function, that mapsthe ratings to an interval ranging from very ab-stract (0) to very concrete (10). The common sub-set contains 3 665 ratings. Figure 1 shows the re-sulting pairwise correlation between all four re-sources. Despite being created automatically, wesee that the newly created ratings provide a highcorrelation with both manually created collections(ρ for MRS=.91, Brysbaert=.93). In addition, thevocabulary of our ratings is much larger than anyexisting database. Thus this new collection might

2NN Implementation based on https://github.com/amten/NeuralNetwork

25

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1MRC BRYS TURNEY NEW

MRC

BRYS

TURNEY

NEW

1 0.92

1

0.85

0.84

1

0.91

0.93

0.88

1

Figure 1: Pairwise Spearman’s ρ on commonlycovered subset. Red = high correlation.

be useful, especially for further research which re-quires large vocabulary coverage.3

2.2 Abstractness for PhrasesA potential advantage of our method is that ab-stractness can be learned for multi-word units aslong as the representation of these units live in thesame distributional vector space as the words re-quired for the supervised training.

In this section we explore if ratings propa-gated to verb-noun phrases provide useful infor-mation for metaphor detection. As dataset we re-lied on the collection from Saif M. Mohammadand Turney (2016), who annotated different sensesof WordNet verbs for metaphoricity (Fellbaum,1998).

We used the same subset of verb–direct objectand verb–subject relations as used in Shutova etal. (2016). As preprocessing step we concatenatedverb-noun phrases by relying on dependency in-formation based on a web corpus, the ENCOW14corpus (Schäfer and Bildhauer, 2012; Schäfer,2015). We removed words and phrases that ap-peared less than 50 times in our corpus, thus ourselection covers 535 pairs, 238 of which weremetaphorical and 297 literal.

Given a verb-noun phrase, such asstamp person, we obtained vector representationsusing word2vec and the same hyper-parametersthat were used for the W2V300 embeddings(Section 2.1.1) together with the best learning

3Ratings available at http://www.ims.uni-stuttgart.de/data/en_abst_norms.html

method (NN). The technique allows us to propa-gate abstractness to every vector, thus we learnabstractness ratings for all three constituents:verb, noun and the entire phrase.

For the metaphor classification experiment weuse the rating score and apply the Area UnderCurve (AUC) metric. AUC is a metric for bi-nary classification. We assume that literal in-stances gain higher scores (= more concrete) thanmetaphorical word pairs. AUC considers all pos-sible thresholds to divide the data into literal andmetaphorical. In addition to the rating score wealso show results based on cosine similarity andfeature combinations (Table 2).

Feat. Name Type AUC

- Random baseline .501 V-NN cosine .752 V-Phrase cosine .703 NN-Phrase cosine .684 V rating .535 NN rating .786 Phrase rating .71

Comb 1+2+3 cosine .75Comb 4+5+6 rating .74Comb all(1-6) mixed .80Comb 1+5+6 best .84

Table 2: AUC Score single features and com-binations. Classifying literal and metaphoricalphrases based on the Saif M. Mohammad and Tur-ney (2016) dataset.

As shown in Table 2, the rating of the verb alone(AUC=.53) provides almost no useful information.The best performance based on a single feature isthe abstractness value of the noun (.78) followedby the cosine between verb and noun vector repre-sentation (.75). The phrase rating alone performsmoderate (.71). However when combining fea-tures we found that the best combinations are ob-tained by integrating the phrase rating. In more de-tail, combining noun and phrase rating (5+6) ob-tains a AUC of (.80). When adding the cosine (1)we obtain the best score of (.84). For comparison,the verb plus noun ratings (4+5) obtains a lowerscore (.72), this shows that the phrase rating pro-vides complementary and useful information.

26

2.3 Sense-specific Abstractness Ratings

In this section we investigate if automaticallylearned multi-sense abstractness ratings, that ishaving different ratings per word sense, are poten-tially useful for the task of metaphor detection.

Recent advances in word representation learn-ing led to the development of algorithms for non-parametric and unsupervised multi-sense repre-sentation learning (Neelakantan et al., 2014; Liuet al., 2015; Li and Jurafsky, 2015; Bartunov etal., 2016). Using these techniques one can learna different vector representation per word sense.Such representations can be combined with ourabstractness learning method from section 2.1.1.

While in theory any multi-sense learning tech-nique can be applied, we decided for the oneintroduced by Pelevina et al. (2016), as it per-forms sense learning after single senses have beenlearned. Starting from the public W2V300 repre-sentations we apply the multi-sense learning tech-nique using the default settings and learn sense-specific word representations. Finally we prop-agate abstractness to every newly created senserepresentation by using the exact same model andtraining data as in Section 1. For a given word in asentence we can now disambiguate the word senseby comparing its sense-specific vector representa-tion to all context words. The context words arerepresented using the (single sense) global repre-sentation. We always pick the sense representationthat obtains the largest similarity, measured by co-sine. The potential advantage of this method isthat in a metaphor detection system we are nowable to look up word-sense-specific abstractnessratings instead of globally obtained ratings.

For this experiment we use the VU Amster-dam Metaphor Corpus (Steen, 2010) (VUA), fo-cusing on verb metaphors. The collection contains23 113 verb tokens in running text, annotated asbeing used literally or metaphorically. In additionwe present results for the TroFi metaphor dataset(Birke and Sarkar, 2006) containing 50 verbs and3 737 labeled sentences. We pre-processed bothrecourses using Stanford CoreNLP (Manning etal., 2014) for lemmatization, part-of-speech tag-ging and dependency parsing.

We present results by applying ten-fold cross-validation over the entire data. For the VUA weadditionally present results for the test data us-ing the same training/test split as in Beigman Kle-banov et al. (2016).

Abstractness norms are implemented using thesame five feature dimensions as used by Turneyet al. (2011) plus dimensions respectively for sub-ject and object, thus we rely on the seven feature,namely:

1. Rating of the verbs subject

2. Rating of the verbs object

3. Average rating of all nouns (excluding propernames)

4. Average rating of all proper names

5. Average rating of all verbs, excluding the tar-get verb

6. Average rating of all adjectives

7. Average rating of all adverbs

For classification we used a balanced LogisticRegression classifier following the findings fromBeigman Klebanov et al. (2015). While this de-fault setup tries to generalize over unseen verbs byonly looking at a verb’s context we further presentresults for a second setup that uses a 6th feature:namely the lemma of the target verb itself (+L).The purpose of the second system is to describeperformance with respect to the state of the art(Beigman Klebanov et al., 2016), which amongother features also uses the verb lemma.

Feat. TroFi(10F) VUA(10F) VUA(Test)

1S .72 .42 .44MS .74 .44* .46

1S(+L) .74 .61 .62MS(+L) .75 .61 .62

Table 3: F-score (Metaphor). Classifying literaland metaphorical verbs based on the VUA andTroFi dataset. MS = multi-sense, 1S= single sense.

As shown in Table 3, the mutli-sense ratingsconstantly outperform the single-sense ratings ina direct comparison on all three sets. The dif-ference in performance of single and multi-senseratings is statistically significant on the full VUAdataset, using the χ2 test and ∗ for p < 0.05.However we also notice that the effect vanishesas soon as we combine the ratings with the lemmaof the verb, which is especially the case for theVUA dataset where the lemma increases the per-formance by a large margin. In contrast to relatedwork, the system with the verb unigram (+UL)

27

can be considered state-of-the-art. When apply-ing the same evaluation as Beigman Klebanov etal. (2016), namely a macro-average over the fourgenres of VUA, we obtain an average f-score of.60 by using only eight feature dimensions and ab-stractness ratings as external resource.4

3 Conclusion

In this paper we compared supervised methodsto propagate abstractness norms to words. Weshowed that a neural-network outperforms othermethods. In addition we showed that norms formulti-words phrases can be beneficial for typebased metaphor detection. Finally we showed hownorms can be learned for sense representations andthat sense specific norms show a clear tendency toimprove token-based verb metaphor detection.

Acknowledgments

The research was supported by the DFG Col-laborative Research Centre SFB 732 (Maximil-ian Köper) and the DFG Heisenberg FellowshipSCHU-2580/1 (Sabine Schulte im Walde).

ReferencesLawrence W Barsalou and Katja Wiemer-Hastings.

2005. Situating Abstract Concepts. Grounding cog-nition: The role of perception and action in memory,language, and thought, pages 129–163.

Sergey Bartunov, Dmitry Kondrashkin, Anton Osokin,and Dmitry Vetrov. 2016. Breaking Sticks and Am-biguities with Adaptive Skip-gram. In Proceedingsof the International Conference on Artificial Intelli-gence and Statistics, pages 130–138, Cadiz, Spain.

Beata Beigman Klebanov, Chee Wee Leong, andMichael Flor. 2015. Supervised Word-levelMetaphor Detection: Experiments with Concrete-ness and Reweighting of Examples. In Proceedingsof the Third Workshop on Metaphor in NLP, pages11–20, Denver, Colorado.

Beata Beigman Klebanov, Chee Wee Leong, E. DarioGutierrez, Ekaterina Shutova, and Michael Flor.2016. Semantic Classifications for Detection ofVerb Metaphors. In Proceedings of the h AnnualMeeting of the Association for Computational Lin-guistics, pages 101–106, Berlin, Germany.

Julia Birke and Anoop Sarkar. 2006. A Clustering Ap-proach for the Nearly Unsupervised Recognition ofNonliteral Language. In Proceedings of the 11thConference of the European Chapter of the ACL,pages 329–336, Trento, Italy.

4SOA results from Klebanov obtain also a score of .60

Marc Brysbaert, AmyBeth Warriner, and Victor Kuper-man. 2014. Concreteness Ratings for 40 ThousandGenerally known English Word Lemmas. BehaviorResearch Methods, pages 904–911.

Max Coltheart. 1981. The MRC psycholinguisticdatabase. The Quarterly Journal of ExperimentalPsychology Section A, 33(4):497–505.

Erik-Lân Do Dinh and Iryna Gurevych. 2016.Token-Level Metaphor Detection using Neural Net-works. In Proceedings of The Fourth Workshop onMetaphor in NLP, pages 28–33, San Diego, CA,USA.

Jonathan Dunn. 2013. What Metaphor IdentificationSystems can tell us About Metaphor-in-language. InProceedings of the First Workshop on Metaphor inNLP, pages 1–10, Atlanta, Georgia.

Jonathan Dunn. 2015. Modeling Abstractness andMetaphoricity. Metaphor and Symbol, pages 259–289.

Christiane Fellbaum. 1998. A Semantic Networkof English Verbs. In Christiane Fellbaum, editor,WordNet – An Electronic Lexical Database, Lan-guage, Speech, and Communication. MIT Press,Cambridge, MA.

Abhijeet Gupta, Gemma Boleda, Marco Baroni, andSebastian Padó. 2015. Distributional Vectors En-code Referential Attributes. In Proceedings of Con-ference on Empirical Methods in Natural LanguageProcessing, pages 12–21, Lisbon, Portugal.

Hessel Haagsma and Johannes Bjerva. 2016. Detect-ing Novel Metaphor using Selectional Preference In-formation. In Proceedings of The Fourth Workshopon Metaphor in NLP, pages 10–17, San Diego, CA,USA.

Ilana Heintz, Ryan Gabbard, Mahesh Srivastava, DaveBarner, Donald Black, Majorie Friedman, and RalphWeischedel. 2013. Automatic Extraction of Lin-guistic Metaphors with LDA Topic Modeling. InProceedings of the First Workshop on Metaphor inNLP, pages 58–66, Atlanta, Georgia.

Felix Hill, Anna Korhonen, and Christian Bentz.2014. A Quantitative Empirical Analysis of theAbstract/Concrete Distinction. Cognitive Science,38:162–177.

Richard Kammann and Lynn Streeter. 1971. TwoMeanings of Word Abstractness. Journal of VerbalLearning and Verbal Behavior, 10(3):303 – 306.

Maximilian Köper and Sabine Schulte im Walde.2016a. Automatically Generated Affective Normsof Abstractness, Arousal, Imageability and Valencefor 350 000 German Lemmas. In Proceedingsof the 10th International Conference on LanguageResources and Evaluation, pages 2595–2598, Por-toroz, Slovenia.

28

Maximilian Köper and Sabine Schulte im Walde.2016b. Distinguishing Literal and Non-Literal Us-age of German Particle Verbs. In Proceedings of theConference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, pages 353–362, San Diego,California.

George Lakoff and Mark Johnson. 1980. Metaphorswe live by. University of Chicago Press.

Jiwei Li and Dan Jurafsky. 2015. Do Multi-Sense Em-beddings Improve Natural Language Understand-ing? In Proceedings of the Conference on Empiri-cal Methods in Natural Language Processing, pages1722–1732, Lisbon, Portugal.

Linlin Li and Caroline Sporleder. 2009. ClassifierCombination for Contextual Idiom Detection With-out Labelled Data. In Proceedings of the Confer-ence on Empirical Methods in Natural LanguageProcessing, pages 315–323.

Linlin Li and Caroline Sporleder. 2010. Using Gaus-sian Mixture Models to Detect Figurative Languagein Context. In Proceedings of the Annual Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics, pages 297–300.

Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2015.Learning Context-Sensitive Word Embeddings withNeural Tensor Skip-Gram Model. In Proceedingsof the Twenty-Fourth International Joint Conferenceon Artificial Intelligence, pages 1284–1290, BuenosAires, Argentina.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP Natural Lan-guage Processing Toolkit. In Proceedings of the As-sociation for Computational Linguistics (ACL) Sys-tem Demonstrations, pages 55–60.

James H. Martin. 1996. Computational Approaches toFigurative Language. Metaphor and Symbolic Ac-tivity.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed Represen-tations of Words and Phrases and their Composi-tionality. In C.J.C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K.Q. Weinberger, editors, Ad-vances in Neural Information Processing Systems26, pages 3111–3119.

Arvind Neelakantan, Jeevan Shankar, Alexandre Pas-sos, and Andrew McCallum. 2014. Efficient Non-parametric Estimation of Multiple Embeddings perWord in Vector Space. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1059–1069, Doha, Qatar.

Maria Pelevina, Nikolay Arefiev, Chris Biemann, andAlexander Panchenko. 2016. Making Sense ofWord Embeddings. In Proceedings of the 1st Work-shop on Representation Learning for NLP, pages174–183, Berlin, Germany, August.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global Vectorsfor Word Representation. In Proceedings of Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1532–1543, Doha, Qatar.

Ekaterina Shutova Saif M. Mohammad and Peter D.Turney. 2016. Metaphor as a Medium for Emotion:An Empirical Study. In Proceedings of the FifthJoint Conference on Lexical and Computational Se-mantics (*Sem), pages 23–33, Berlin, Germany.

Roland Schäfer and Felix Bildhauer. 2012. BuildingLarge Corpora from the Web Using a New EfficientTool Chain. In Proceedings of the 8th InternationalConference on Language Resources and Evaluation,pages 486–493, Istanbul, Turkey.

Roland Schäfer. 2015. Processing and Querying LargeWeb Corpora with the COW14 Architecture. In Pi-otr Banski, Hanno Biber, Evelyn Breiteneder, MarcKupietz, Harald Lüngen, and Andreas Witt, editors,Proceedings of the 3rd Workshop on Challenges inthe Management of Large Corpora, pages 28–24,Lancaster.

Ekaterina Shutova and Simone Teufel. 2010.Metaphor Corpus Annotated for Source - Target Do-main Mappings. In Proceedings of the InternationalConference on Language Resources and Evaluation,pages 3225–3261, Valletta, Malta.

Ekaterina Shutova, Lin Sun, and Anna Korhonen.2010. Metaphor Identification Using Verb andNoun Clustering. In Proceedings of the 23rd Inter-national Conference on Computational Linguistics,pages 1002–1010.

Ekaterina Shutova, Douwe Kiela, and Jean Maillard.2016. Black Holes and White Rabbits: MetaphorIdentification with Visual Features. In Proceedingsof the Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, pages 160–170.

G. Steen. 2010. A Method for Linguistic MetaphorIdentification: From MIP to MIPVU. Converg-ing Evidence in Language and Communication Re-search. John Benjamins Publishing Company.

James H Steiger. 1980. Tests for Comparing Elementsof a Correlation Matrix. Psychological Bulletin.

Yulia Tsvetkov, Elena Mukomel, and Anatole Gersh-man. 2013. Cross-lingual Metaphor Detection Us-ing Common Semantic Features. In Proceedings ofthe Workshop on Metaphor in NLP, pages 45–51,Atlanta, USA.

Yulia Tsvetkov, Leonid Boytsov, Anatole Gershman,Eric Nyberg, and Chris Dyer. 2014. Metaphor De-tection with Cross-Lingual Model Transfer. In Pro-ceedings of the 52nd Annual Meeting of the Associa-tion for Computational Linguistics, pages 248–258.

29

Peter D. Turney and Michael L. Littman. 2003. Mea-suring Praise and Criticism: Inference of SemanticOrientation from Association. ACM Transactionson Information Systems, pages 315–346.

Peter Turney, Yair Neuman, Dan Assaf, and Yohai Co-hen. 2011. Literal and Metaphorical Sense Identi-fication through Concrete and Abstract Context. InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing, pages 680–690, Edinburgh, UK.

Gabriella Vigliocco, Stavroula-Thaleia Kousta,Pasquale Anthony Della Rosa, David P Vinson,Marco Tettamanti, Joseph T Devlin, and Stefano FCappa. 2014. The Neural Representation ofAbstract Words: the Role of Emotion. CerebralCortex, pages 1767–1777.

30


Improving Clinical Diagnosis Inference through Integration of Structuredand Unstructured Knowledge

Yuan Ling, Yuan AnCollege of Computing & Informatics

Drexel UniversityPhiladelphia, PA, USA

{yl638,ya45}@drexel.edu

Sadid A. HasanArtificial Intelligence LaboratoryPhilips Research North America

Cambridge, MA, [email protected]

Abstract

This paper presents a novel approach tothe task of automatically inferring themost probable diagnosis from a givenclinical narrative. Structured KnowledgeBases (KBs) can be useful for such com-plex tasks but not sufficient. Hence, weleverage a vast amount of unstructuredfree text to integrate with structured KBs.The key innovative ideas include buildinga concept graph from both structured andunstructured knowledge sources and rank-ing the diagnosis concepts using the en-hanced word embedding vectors learnedfrom integrated sources. Experiments onthe TREC CDS and HumanDx datasetsshowed that our methods improved the re-sults of clinical diagnosis inference.

1 Introduction and Related Work

Clinical diagnosis inference is the problem of au-tomatically inferring the most probable diagno-sis from a given clinical narrative. Many health-related information retrieval tasks can greatly ben-efit from the accurate results of clinical diagnosisinference. For example, in recent Text REtrievalConference (TREC) Clinical Decision Supporttrack (CDS1), diagnosis inference from medicalnarratives has improved the accuracy of retriev-ing relevant biomedical articles (Roberts et al.,2015; Hasan et al., 2015; Goodwin and Harabagiu,2016).

Solutions to the clinical diagnostic inferencingproblem require a significant amount of inputsfrom domain experts and a variety of sources (Fer-rucci et al., 2013; Lally et al., 2014). To ad-dress such complex inference tasks, researchers(Yao and Van Durme, 2014; Bao et al., 2014;Dong et al., 2015) have utilized structured KBs

1http://www.trec-cds.org/

that store relevant information about various en-tity types and relation triples. Many large-scaleKBs have been constructed over the years, suchas WordNet (Miller, 1995), Yago (Suchanek etal., 2007), Freebase (Bollacker et al., 2008), DB-pedia (Auer et al., 2007), NELL (Carlson et al.,2010), UMLS Metathesaurus (Bodenreider, 2004)etc. However, using KBs alone for inference tasks(Bordes et al., 2014) has certain limitations such asincompleteness of knowledge, sparsity, and fixedschema (Socher et al., 2013; West et al., 2014).

On the other hand, unstructured textual re-sources such as free texts from Wikipedia gen-erally contain more information than structuredKBs. As a supplementary knowledge to mitigatethe limitations of structured KBs, unstructured textcombined with structured KBs provides improvedresults for related tasks, for example, clinical ques-tion answering (Miller et al., 2016). For pro-cessing text, word embedding models (e.g. skip-gram model (Mikolov et al., 2013b; Mikolov etal., 2013a)) can efficiently discover and representthe underlying patterns of unstructured text. Wordembedding models represent words and their rela-tionships as continuous vectors. To improve wordembedding models, previous works have also suc-cessfully leveraged structured KBs (Bordes et al.,2011; Weston et al., 2013; Wang et al., 2014; Zhouet al., 2015; Liu et al., 2015).

Motivated by the superior power of the integra-tion of structured KBs and unstructured free text,we propose a novel approach to clinical diagno-sis inference. The novelty lies in the ways of in-tegrating structured KBs with unstructured text.Experiments showed that our methods improvedclinical diagnosis inference from different aspects(Section 5.4). Previous work on diagnosis in-ference from clinical narratives either formulatesthe problem as a medical literature retrieval task(Zheng and Wan, 2016; Balaneshin-kordan andKotov, 2016) or as a multiclass multilabel classi-

31

fication problem in a supervised setting (Hasan etal., 2016; Prakash et al., 2016). To the best of ourknowledge, there is no work on diagnoses infer-ence from clinical narratives conducted in an un-supervised way. Thus, we build such baselines forthis task.

2 Overview of the Approach

Our approach includes four steps in general: 1)extracting source concepts, q, from clinical narra-tives, 2) iteratively identifying corresponding ev-idence concepts, a, from KBs and unstructuredtext, 3) representing both source and evidenceconcepts in a weighted graph via a regularizer-enhanced skip-gram model, and 4) ranking the rel-evant evidence concepts (i.e. diagnoses) based ontheir association with the source concepts, S(q, a)(computed by weighted dot product of two vec-tors), to generate the final output. Figure 1 showsthe overview using an illustrative example.

Given source concepts as input, we build anedge-weighted graph representing the connectionsamong all the concepts by iteratively retrievingevidence concepts from both KBs and unstruc-tured text. The weights of the edges represent thestrengths of the relationships between concepts.Each concept is represented as a word embeddingvector. We combine all the source concept vec-tors into a single vector representing a clinical sce-nario. Source concepts are differentiated accord-ing to the weighting scheme in Section 4.2. Evi-dence concepts are also represented as vectors andranked according to their relevance to the sourceconcepts. For each clinical case, we find the mostprobable diagnoses from the top-ranked evidenceconcepts.

3 Knowledge Sources of EvidenceConcepts

In this study, we use UMLS Metathesaurus (Bo-denreider, 2004) and Freebase (Bollacker et al.,2008) as the structured KBs. Both KBs pro-vide semantic relation triples in the followingformat: <concept1, relation, concept2>. Weselect UMLS relation types that are relevantto the problem of clinical diagnosis inference.These types include disease-treatment, disease-prevention, disease-finding, sign or symptom,causes etc. Freebase contains a large numberof triples from multiple domains. We select61,243 triples from freebase that are classified as

medicine relation types. There are 19 such rela-tion types in total. Most of them fall under the“medicine.disease” category.

For unstructured text, we use articles fromWikipedia and MayoClinic corpus as the supple-mentary knowledge source. Important clinicalconcepts mentioned in a Wikipedia/MayoClinicpage can serve as a critical clue to a clinical di-agnosis. For example, in Figure 1, we see that“dyspnea”, “shortness of breath”, “tachypnea” etc.are the related signs and symptoms of the “Pul-monary Embolism” diagnosis. We select 37,245Wikipedia pages under the clinical diseases andmedicine category in this study. Most of thepage titles represent disease names. In addition,MayoClinic2 disease corpus contains 1,117 pages,which include sections of Symptoms, Causes,Risk Factors, Treatments and Drugs, Prevention,etc.

4 Methodology

4.1 Building Weighted Concept GraphBoth the source and the evidence concepts arerepresented as nodes in a graph. A clinical caseis represented as a set of source concept nodes:q = {q1, q2, . . .}. We build a weighted conceptgraph from source concepts using Algorithm 1.

Algorithm 1: Build Concept GraphInput : source concept nodes qOutput: graphG = (V,E)

1 S = q and V = q;2 while S 6= ∅ do3 for each qi in S do4 if distance(qi, q) > 2 then5 continue;6 end7 if triple< qi, r, aj > in KBs then8 wij = 1;9 e = (qi, aj) and e.value = wij ;

10 insert aj to V and S;11 insert e toE;12 end13 Use qi as query, search in Unstructured Text Corpora, get

ResultR;14 for each page-similarity pair (p, sij) inR do15 e = (qi, title(p)) and e.value = sij ;16 insert title(p) to V and S;17 insert e toE;18 end19 remove qi from S;20 end21 end

Two kinds of evidence concept nodes are addedto the graph: 1) the entities from KBs (UMLS andFreebase) (step 7-12 in Algorithm 1), and 2) theentities from unstructured text pages (step 13-18).If there exists a triple < qi, r, aj > in KBs, where

2http://www.mayoclinic.org/diseases-conditions

32

Figure 1: Overview of our system.

r refers to a relation, an edge is used to connectnode qi and node aj . wij represents the weight forthat edge, and let wij = 1, if the correspondingtriple occurs at least once. Due to the incomplete-ness of the KBs, there may exist multiple missingconnections between a potential evidence conceptaj and a source concept qi. Unstructured knowl-edge from Wikipedia and MayoClinic can replen-ish these missing connections. For each page p,the page title represents an evidence concept aj .We use each source concept qi as a query, andpage p as a document, and then calculate a query-document similarity to measure the edge weightwij between node aj and node qi. We only takeevidence concepts as all nodes connected to sourceconcepts in a distance of at most 2 (step 4-6).

4.2 Representing Clinical Case

We combine the source concepts q and get a sin-gle vector vq to represent the clinical case narra-tive. The source concepts from narratives for clin-ical diagnosis inference should be differentiated.Some source concepts are major symptoms for adiagnosis, while others are less critical. These ma-jor source concepts should be identified and givenhigher weight values. We develop two kinds ofweighting schema for the differential expressionof the source concepts. The source concept is rep-resented as vq = 1

N

∑qi∈q γivqi . N is the total

number of source concepts. vqi is the vector rep-resentation for one source concept qi.

(1) A longer concept usually convey more infor-mation (e.g. malar rash vs. rash), so it should begiven more weights. We define this weight valueas γ1 = #Words inConcept.

(2) For some commonly seen concepts (e.g.fever), usually, there are more edges connected tothem. Sometimes, a common concept is less im-portant for diagnosis inference, while some uniqueconcepts are critical to infer a specific diagnosis.We define this weight value for each concept asγ2 = 1

#Connected Edges . A higher weight valuemeans the source concept is more unique.

4.3 Inferring Concepts for Diagnosis

Extracting Potential Evidence Concepts: Fromsource concept nodes q, we find their con-nected concepts in the graph as evidence concepts.Traversing all edges in a graph is computation-ally expensive and often unnecessary for findingpotential diagnoses. The solution is to use a sub-graph. We follow the idea proposed in Bordes etal. (2014). The evidence concepts are defined asall nodes connected to source concepts in a dis-tance of at most 2.

Ranking Evidence Concepts: We rank eachevidence concept a′ according to its matchingscore S(q, a′) to the source concepts. The match-ing score S(q, a′) is a dot product of embeddingrepresentation of the evidence concept a′ and thesource concept q by taking the edge weights wij

into consideration. S(q, a′) = wijva′ · vq. va′ and

33

vq are embedding representations for a′ and q. Theembedding E ∈ Rk×N for concepts are trainedusing embedding models (Section 4.4). N is thetotal number of concepts and k is the predefineddimensions for the embedding vector. Each con-cept in the graph can find a k dimensional vectorrepresentation in E. For a set of source conceptsand evidence concepts A(q), the top-ranked evi-dence concept can be computed as:

a = argmax(a′∈A(q))S(q, a′) (1)

4.4 Word Embedding Models

We use the skip-gram model as the basic model.The skip-gram model predicts surrounding wordswt−c, . . . , wt−1, wt+1, . . . , wt+c given the currentcenter word wt. We further enhance the skip-grammodel by adding a graph regularizer. Given a se-quence of training words w1, w2, . . . , wT , the ob-jective function is:

J = max1

T

T∑t=1

(1−λ)∑

−c≤j≤c,j 6=0

log p(wt+j |wt)−λR∑

r=1

D(vt, vr),

(2)

where vt and vr are the representation vectors forwordwt and wordwr. λ is a parameter to leveragethe graph regularizer and original objective. Sup-pose, wordwt is mentioned having relations with aset of other wordswr, r ∈ {1, . . . , R} in KBs. Thegraph regularizer λ

∑Rr=1D(vt, vr) integrates ex-

tra knowledge about semantic relationships amongwords within the graph structure. D(vt, vr) repre-sents the distance between vt and vr. In our ex-periments, the distance between two concepts ismeasured using KL-Divergence. D(vt, vr) can becalculated using any other types of distance met-rics. By minimizing D(vt, vr), we expect if twoconcepts have a close relation in KBs, their vectorrepresentations will also be close to each other.

5 Experiments

5.1 Datasets for Clinical Diagnosis Inference

Our first dataset is from the 2015 TREC CDStrack (Roberts et al., 2015). It contains 30 top-ics, where each topic is a medical case narrativethat describes a patient scenario. Each case is as-sociated with the ground truth diagnosis. We useMetaMap3 to extract the source concepts from anarrative and then manually refine them to removeredundancy.

3https://metamap.nlm.nih.gov/

Our second dataset is curated from HumanDx4,a project to foster integrating efforts to map healthproblems to their possible diagnoses. We curatediagnosis-findings relationships from HumanDxand create a dataset with 459 diagnosis-findingsentries. Note that, the findings from this datasetare used as the given source concepts for a clinicalscenario.

5.2 Training Data for Word Embeddings

We curate a biomedical corpus of around 5Msentences from two data sources: PubMed Cen-tral5 from the 2015 TREC CDS snapshot6 andWikipedia articles under the “Clinical Medicine”category7. After sentence splitting, word tok-enization, and stop words removal, we train ourword embedding models on this corpus. UMLSMetathesaurus and Freebase are used as KBs totrain the graph regularizer. We use stochastic gra-dient descent (SGD) to maximize the objectivefunction and set the parameters empirically.

5.3 Results

We use Mean Reciprocal Rank (MRR) and Aver-age Precision at 5 (P@5) to evaluate our models.MRR is a statistical measure to evaluate a processthat generates a list of possible responses to a sam-ple of queries, ordered by probability of correct-ness. Average P@5 is calculated as precision attop 5 predicted results divided by the total numberof topics. Since our dataset only has one correctdiagnosis for each topic, all results have poor Av-erage P@5 scores.

Table 1 presents the results for our experi-ments. We report two baselines: Skip-gram refersto the basic word embedding model, and Skip-gram* refers to the graph-regularized model us-ing KBs. We also show the results for using dif-ferent unstructured knowledge sources and differ-ent weighting schema. We can see that the bestscores are obtained by the graph-regularized mod-els with both the unstructured knowledge sourceswith variable weighting schema (Section 4.2).

5.4 Discussion

Unstructured text is a critical supplement: Weanalyze the source concepts and the correspondingevidence concepts for CDS topics, and investigate

4https://www.humandx.org/5https://www.ncbi.nlm.nih.gov/pmc/6http://www.trec-cds.org/2015.html#documents7https://en.wikipedia.org/wiki/Category:Clinical medicine

34

TREC CDS HumanDxMethod MRR Average P@5 MRR Average P@5

BaselinesSkip-gram 21.66 8.88 18.56 5.08Skip-gram* 22.60 8.88 18.63 5.15

Skip-gram* + Different Unstructured Text DatasetsWikipedia 26.01 8.96 19.42 5.76MayoClinic 32.64 9.52 19.46 5.80Both 32.29 9.60 19.12 5.76

Skip-gram* + Both Text Datasets + Different Weightsγ1 32.22 10.40 21.09 5.88γ2 32.77 12.00 20.86 5.93

Table 1: Evaluation results.

the origin of the correct diagnoses. 70% of thecorrect diagnoses can be inferred from Wikipedia,60% of the correct diagnoses from MayoClinic,56% of the correct diagnoses from Freebase, andonly 7% are from UMLS. Hence, Wikipedia andMayoClinic are very important sources for findingthe correct diagnoses.

Source concepts should be differentiated: Inclinical narratives, some concepts are more criticalthan others for the clinical diagnosis inference. Wedeveloped two weighting schema to assign higherweight values to more important concepts. The re-sults in Table 1 show that differentiating the sourceconcepts with different weight values has a largeimpact on the model performance.

Enhanced skip-gram is better: We proposethe enhanced skip-gram model by using a graphregularizer to integrate the semantic relationshipsamong concepts from KBs. Experimental resultsshow that diagnosis inference is improved by us-ing word embedding representations from the en-hanced skip-gram model.

6 Conclusion

We proposed a novel approach to the task of clin-ical diagnosis inference from clinical narratives.Our method overcomes the limitations of struc-tured KBs by making use of the integrated struc-tured and unstructured knowledge. Experimen-tal results showed that the enhanced skip-grammodel with differential expression of source con-cepts improved the performance on two bench-mark datasets.

ReferencesSoren Auer, Christian Bizer, Georgi Kobilarov, Jens

Lehmann, Richard Cyganiak, and Zachary Ives.2007. Dbpedia: A nucleus for a web of open data.In The semantic web, pages 722–735. Springer.

Saeid Balaneshin-kordan and Alexander Kotov. 2016.Optimization method for weighting explicit and la-tent concepts in clinical decision support queries.In Proceedings of the 2016 ACM on InternationalConference on the Theory of Information Retrieval,pages 241–250. ACM.

Junwei Bao, Nan Duan, Ming Zhou, and Tiejun Zhao.2014. Knowledge-based question answering as ma-chine translation. Cell, 2(6).

Olivier Bodenreider. 2004. The unified medical lan-guage system (umls): integrating biomedical termi-nology. Nucleic acids research, 32(suppl 1):D267–D270.

Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: a col-laboratively created graph database for structuringhuman knowledge. In Proceedings of the 2008 ACMSIGMOD international conference on Managementof data, pages 1247–1250. ACM.

Antoine Bordes, Jason Weston, Ronan Collobert, andYoshua Bengio. 2011. Learning structured embed-dings of knowledge bases. In Conference on artifi-cial intelligence.

Antoine Bordes, Sumit Chopra, and Jason Weston.2014. Question answering with subgraph embed-dings. arXiv preprint arXiv:1406.3676.

Andrew Carlson, Justin Betteridge, Bryan Kisiel,Burr Settles, Estevam R Hruschka Jr, and Tom MMitchell. 2010. Toward an architecture for never-ending language learning. In AAAI, volume 5,page 3.

Li Dong, Furu Wei, Ming Zhou, and Ke Xu.2015. Question answering over freebase with multi-column convolutional neural networks. In Proceed-ings of Association for Computational Linguistics,pages 260–269.

David Ferrucci, Anthony Levas, Sugato Bagchi, DavidGondek, and Erik T Mueller. 2013. Watson: beyondjeopardy! Artificial Intelligence, 199:93–105.

Travis R. Goodwin and Sanda M. Harabagiu. 2016.Medical question answering for clinical decision

35

support. In Proceedings of the 25th ACM Inter-national on Conference on Information and Knowl-edge Management, pages 297–306. ACM.

Sadid A. Hasan, Yuan Ling, Joey Liu, and OladimejiFarri. 2015. Using neural embeddings for diag-nostic inferencing in clinical question answering. InTREC.

Sadid A. Hasan, Siyuan Zhao, Vivek Datla, Joey Liu,Kathy Lee, Ashequl Qadir, Aaditya Prakash, andOladimeji Farri. 2016. Clinical question answeringusing key-value memory networks and knowledgegraph. In TREC.

Adam Lally, Sugato Bachi, Michael A. Barborak,David W. Buchanan, Jennifer Chu-Carroll, David A.Ferrucci, Michael R. Glass, Aditya Kalyanpur,Erik T. Mueller, J. William Murdock, et al. 2014.Watsonpaths: scenario-based question answeringand inference over unstructured information. York-town Heights: IBM Research.

Quan Liu, Hui Jiang, Si Wei, Zhen-Hua Ling, andYu Hu. 2015. Learning semantic word embeddingsbased on ordinal knowledge constraints. In Pro-ceedings of the 53rd Annual Meeting of the Associ-ation for Computational Linguistics and the 7th In-ternational Joint Conference on Natural LanguageProcessing (ACL-IJCNLP), pages 1501–1511.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013a. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason We-ston. 2016. Key-value memory networks for di-rectly reading documents. CoRR, abs/1606.03126.

George A. Miller. 1995. Wordnet: a lexicaldatabase for english. Communications of the ACM,38(11):39–41.

Aaditya Prakash, Siyuan Zhao, Sadid A. Hasan, VivekDatla, Kathy Lee, Ashequl Qadir, Joey Liu, andOladimeji Farri. 2016. Condensed memory net-works for clinical diagnostic inferencing. arXivpreprint arXiv:1612.01848.

Kirk Roberts, Matthew S. Simpson, Ellen Voorhees,and William R. Hersh. 2015. Overview of the trec2015 clinical decision support track. In TREC.

Richard Socher, Danqi Chen, Christopher D. Manning,and Andrew Ng. 2013. Reasoning with neural ten-sor networks for knowledge base completion. In Ad-vances in Neural Information Processing Systems,pages 926–934.

Fabian M. Suchanek, Gjergji Kasneci, and GerhardWeikum. 2007. Yago: a core of semantic knowl-edge. In Proceedings of the 16th international con-ference on World Wide Web, pages 697–706. ACM.

Zhen Wang, Jianwen Zhang, Jianlin Feng, and ZhengChen. 2014. Knowledge graph and text jointly em-bedding. In EMNLP, pages 1591–1601. Citeseer.

Robert West, Evgeniy Gabrilovich, Kevin Murphy,Shaohua Sun, Rahul Gupta, and Dekang Lin. 2014.Knowledge base completion via search-based ques-tion answering. In Proceedings of the 23rd inter-national conference on World wide web, pages 515–526. ACM.

Jason Weston, Antoine Bordes, Oksana Yakhnenko,and Nicolas Usunier. 2013. Connecting languageand knowledge bases with embedding models for re-lation extraction. arXiv preprint arXiv:1307.7973.

Xuchen Yao and Benjamin Van Durme. 2014. Infor-mation extraction over structured data: Question an-swering with freebase. In ACL (1), pages 956–966.Citeseer.

Ziwei Zheng and Xiaojun Wan. 2016. Graph-basedmulti-modality learning for clinical decision sup-port. In Proceedings of the 25th ACM Internationalon Conference on Information and Knowledge Man-agement, pages 1945–1948. ACM.

Guangyou Zhou, Tingting He, Jun Zhao, and Po Hu.2015. Learning continuous word embedding withmetadata for question retrieval in community ques-tion answering. In Proceedings of ACL, pages 250–259.

36


Classifying Lexical-semantic Relationships by Exploiting Sense/ConceptRepresentations

Kentaro Kanada, Tetsunori Kobayashi and Yoshihiko HayashiWaseda University, Japan

[email protected]@waseda.jp

[email protected]

Abstract

This paper proposes a method for classi-fying the type of lexical-semantic relationbetween a given pair of words. Given aninventory of target relationships, this taskcan be seen as a multi-class classificationproblem. We train a supervised classi-fier by assuming that a specific type oflexical-semantic relation between a pair ofwords would be signaled by a carefully de-signed set of relation-specific similaritiesbetween the words. These similarities arecomputed by exploiting “sense represen-tations” (sense/concept embeddings). Theexperimental results show that the pro-posed method clearly outperforms an ex-isting state-of-the-art method that does notutilize sense/concept embeddings, therebydemonstrating the effectiveness of thesense representations.

1 Introduction

Given a pair of words, classifying the type oflexical-semantic relation that could hold betweenthem may have a range of applications. In particu-lar, discovering typed lexical-semantic relation in-stances is vital in building a new lexical-semanticresource, as well as for populating an existinglexical-semantic resource. As argued in (Boyd-Graber et al., 2006), even Princeton WordNet(henceforth PWN) (Miller, 1995) is noted for itssparsity of useful internal lexical-semantic rela-tions. A distributional thesaurus (Weeds et al.,2014), usually built with an automatic methodsuch as that described in (Rychly and Kilgar-riff, 2007), often comprises a disorganized seman-tic network internally, where a variety of lexical-semantic relations are incorporated without havingproper relation labels attached. These issues could

be addressed if an accurate method for classifyingthe type of lexical-semantic relation is available.

A number of research studies on the classifi-cation of lexical-semantic relationships have beenconducted. Among them, Necsulescu et al. (2015)recently presented two classification methods thatutilize word-level feature representations includ-ing word embedding vectors. Although the re-ported results are superior to the compared sys-tems, neither of the proposed methods exploited“sense representations,” which are described as thefine-grained representations of word senses, con-cepts, and entities in the description of this work-shop1.

Motivated by the above-described issues andprevious work, this paper proposes a supervisedclassification method that exploits sense represen-tations, and discusses their utilities in the lexicalrelation classification task. The major rationalesbehind the proposed method are: (1) a specifictype of lexical-semantic relation between a pair ofwords would be indicated by a carefully designedset of relation-specific similarities associated withthe words; and (2) the similarities could be ef-fectively computed by exploiting sense represen-tations.

More specifically, for each word in the pair, wefirst collect relevant sets of sense/concept nodes(node sets) from an existing lexical-semantic re-source (PWN), and then compute similarities forsome designated pairs of node sets, where eachnode is represented by an embedding vector de-pending on its type (sense/concept). In termsof its design, each node set pair is constructedsuch that it is associated with a specific typeof lexical-semantic relation. The resulting ar-ray of similarities, along with the underlyingword/sense/concept embedding vectors is finally

1https://sites.google.com/site/senseworkshop2017/background

37

fed into the classifier as features.The empirical results that use the BLESS

dataset (Baroni and Lenci, 2011) demonstrate thatour method clearly outperformed existing state-of-the-art methods (Necsulescu et al., 2015) that didnot employ sense/concept embeddings, confirm-ing that properly combining the similarity featuresalso with the underlying semantic/conceptual-level embeddings is indeed effective. These re-sults in turn highlight the utility of “the sense rep-resentations” (the sense/concept embeddings) cre-ated by the existing system referred to as AutoEx-tend (Rothe and Schutze, 2015).

The remainder of the paper first reviews relatedwork (section 2), and then presents our approach(section 3). As our experiments (section 4) uti-lize the BLESS dataset, the experimental resultsare directly compared with that of (Necsulescuet al., 2015) (section 5). Although our methodswere proved to be superior through the experi-ments, our operational requirement (sense/conceptembeddings should be created from the underly-ing lexical-semantic resource) could be problem-atic especially when having to process unknownwords. We conclude the present paper by dis-cussing future work to address this issue (section6).

2 Related work

A lexical-semantic relationship is a fundamentalrelationship that plays an important role in manyNLP applications. A number of research effortshave been devoted to developing an automated andaccurate method to type the relationship betweenan arbitrary pair of words. Most of these stud-ies (Fu et al., 2014; Kiela et al., 2015; Shwartzet al., 2016), however, concentrated on the hyper-nymy relation, since it is the most fundamentalrelationship that forms the core taxonomic struc-ture in a lexical-semantic resource. In compari-son, fewer studies considered a broader range oflexical-semantic relations, e.g., (Necsulescu et al.,2015) and our present work.

Lenci and Benotto (2012), among thehypernymy-centered researches, comparedexisting directional similarity measures (Kotler-man et al., 2010) to identify hypernyms, andproposed a new measure that slightly modifiedan existing measure. The rationale behind theirwork is: as hypernymy is a prominent asymmetricsemantic relation, it might be detected by the

higher similarity score yielded by an asymmetricsimilarity measure. Their idea of exploiting aspecific type of similarity to detect a specific typeof lexical-semantic relationship is highly feasible.

Recently, distributional and distributed wordrepresentations (word embeddings) have beenwidely utilized, partly because the offset vec-tor simply brought about by vector subtractionover word embeddings can capture some rela-tional aspects including a lexical-semantic rela-tionship. Given these useful resources, Weeds etal. (2014) presented a supervised classification ap-proach that employs a pair of distributional vec-tors for a given word pair as the feature, argu-ing that concatenation and subtraction were al-most equally effective vector operations. Simi-lar lines of work were presented by (Necsulescuet al., 2015) and (Vylomova et al., 2016): theformer suggested concatenation might be slightlysuperior to subtraction, whereas the latter espe-cially highlighted the subtraction. Here it shouldbe noted that Necsulescu et al. (2015) employedtwo kinds of vectors: one is a CBOW-based vec-tor (Mikolov et al., 2013b), and the other involvesword embeddings with a dependency-based skip-gram model (Levy and Goldberg, 2014).

The present work exploits semantic/conceptual-level embeddings, which were actually derivedby applying the AutoExtend (Rothe and Schutze,2015) system. Among the recent proposalsfor deriving semantic/conceptual-level embed-dings (Huang et al., 2012; Pennington et al., 2014;Neelakantan et al., 2014; Iacobacci et al., 2015),we adopt the AutoExtend system, since it ele-gantly exploits the network structure provided byan underlying semantic resource, and naturallyconsumes existing word embeddings. More im-portantly, the underlying word embeddings are di-rectly comparable with the derived sense represen-tations. In the present work, we applied the Au-toExtend system to the Word2Vec CBOW embed-dings (Mikolov et al., 2013b) by referring to PWNversion 3.0 as the underlying lexical-semantic re-source. As far as the authors know, AutoExtend-derived embeddings have been evaluated in thetasks of similarity measurements and word sensedisambiguation: they are yet to be applied to a se-mantic relation classification task.

There are a few datasets (Baroni and Lenci,2011; Santus et al., 2015) available that were pre-pared for the evaluation of lexical-semantic rela-

38

. . .

alligator1

alligator

alligator2

. . .

alligator.n.01 alligator.n.02

leather.n.01 crocodilian_reptile.n.01

reptile1

reptile

reptile.n.01

word

sense

concept

. . .

hypernym

Lalligator Lreptile

node set pair

Salligator Sreptile

Salligator

hyper

Figure 1: Creating node sets and node set pairs (in hypernymy relation).

tion classification tasks. We utilized the BLESSdataset (Baroni and Lenci, 2011) in order to di-rectly compare our proposed method with the re-lated methods given in (Necsulescu et al., 2015).

3 Proposed method

We adopt a supervised learning approach for clas-sifying the type of semantic relationship betweena pair of words (w1, w2), expecting that the plau-sibility of a specific semantic lexical-relationshipcan be measured by the similarity between thesenses/concepts associated with each of the words.

Figure 1 exemplifies the fundamental rationalebehind the proposed method. We assume that theplausibility of the hypernymy relation between “al-ligator” (w1) and “reptile” (w2) can be mainlymeasured by the similarity between the set of hy-pernym concepts of “alligator” (Shyper

w1 ) and theset of concepts of “reptile” (Sw2). Based on thisassumption, we calculate the similarities by thefollowing steps. Recall that these similarities areassumed to measure the plausibilities of relation-ships that could hold between a given word pair.

1. Collect pre-defined types of node sets foreach word (five types; detailed in section3.1).

2. Build some useful pairs of node sets by con-sidering the possible relationships assumed tobe held between the words (7 pair types; de-tailed in section 3.2).

3. Calculate the similarities for each node setpair by three types of calculation methods(detailed in section 3.3).

In total we calculate 21 (7 pairs × 3 meth-ods) similarities per word pair along with the co-sine similarity between word embeddings. We usethese similarities and vector pairs that yielded thesimilarities as feature.

3.1 Collecting node sets for each word

By consulting PWN, we collect the following fivetypes of node sets for each word. These node settypes are selected so as to characterize relevantlexical-semantic relationships in the target inven-tory detailed in section 4.1.

• Lw: a set of senses that a word w has

• Sw: a set of concepts each denoted by a mem-ber of Lw

• Shyperw : a set of concepts whose member is

directly linked from a member of Sw by thePWN hypernymy relation

• Sattriw : a set of concepts whose member is

directly linked from a member of Sw by thePWN attribute relation

• Smerow : a set of concepts whose member is

directly linked from a member of Sw by thePWN meronymy relation

39

3.2 Building node set pairs

Given a pair of words (w1, w2), we build seventypes of node set pairs as shown in Table 1. Eachrow in the table defines the combination of nodesets and presents the associated mnemonic.

Table 1: Node set pairs built for (w1, w2).Node set pair MnemonicLw1 Lw2 senseSw1 Sw2 concept

Shyperw1 Sw2 hyper

Shyperw1 Shyper

w2 coordSattri

w1Sw2 attri 1

Sw1 Sattriw2

attri 2Smero

w1Sw2 mero

These types of node set pairs are defined in ex-pecting that:

• sense, concept: captures semantic similar-ity/relatedness between the words;

• hyper: captures hypernymy relation betweenthe words;

• coord: dictates if the words share a commonhypernym;

• attri 1, attri 2: dictates if w1 describes someaspect of w2 (attri 1) or vice versa (attri 2);

• mero: captures the meronymy relation be-tween the words.

Note that the italicized words indicate lexicalrelationships often used in linguistic literature.

3.3 Similarity calculation

In a pair of node sets, each node set could havea different number of elements, meaning that wecannot apply element-wise computation (e.g., co-sine) for measuring the similarity between thenode sets. We thus propose the following threesimilarity calculation methods and compare themin the experiments.

In the following formulations: c indicates a cer-tain node set pair type defined in Table 1; (Xw1 ,Xw2) is the node set pair for (w1, w2) specifiedby c; and sim( ~x1, ~x2) is the cosine similarity be-tween ~x1 and ~x2.

simcmax method:

simcmax(w1, w2) = max

x1∈Xw1 ,x2∈Xw2

sim( ~x1, ~x2)

(1)As the formula defines, this method selects a com-bination of the node sets that yield the maximumsimilarity, implying that it achieves a disambigua-tion functionality.

The vector pair from the most similar node sets( ~x1, ~x2) is also used as feature. The actual usageof this pair in the experiments is detailed in section4.2.

simcsum method:

simcsum(w1, w2) = sim(

∑x1∈Xw1

~x1,∑

x2∈Xw2

~x2)

(2)As defined by the formula, this method firstly

makes a holistic meaning representation by sum-ming all embeddings of the nodes contained ineach node set. We devised this method with theexpectation that it could dictate semantic related-ness rather than semantic similarity (Budanitskyand Hirst, 2006). The pair of the summed embed-dings (

∑x1∈Xw1

~x1,∑

x2∈Xw2~x2) is also used as

feature.

simcmed method:

simcmed(w1, w2) = median

x1∈Xw1 ,x2∈Xw2

sim( ~x1, ~x2)

(3)The method is expected to express the similar-ity between mediated representations of each nodeset. Instead of the arithmetic average, we employthe median to select a representative node in eachnode set, allowing us to use the associated vectorpair as feature.

4 Experiments

We evaluated the effectiveness of the proposed su-pervised approach by conducting a series of clas-sification experiments using the BLESS dataset(Baroni and Lenci, 2011). Among the possi-ble learning algorithms, we adopted the RandomForestTMalgorithm as it maintains a balance be-tween performance and efficiency. The resultsare assessed by using standard measures such asPrecision (P ), Recall (R), and F1. We em-ployed the pre-trained Word2Vec embeddings2.

2300-dimensional vectors by CBoW, availableat https://code.google.com/archive/p/word2vec/

40

We trained sense/concept embeddings by apply-ing the AutoExtend system3 (Rothe and Schutze,2015) while using the Word2Vec embeddings asthe input and consulting PWN 3.0 as the underly-ing lexical-semantic resource.

4.1 DatasetWe utilized the BLESS dataset, which was de-veloped for the evaluation of distributional se-mantic models. It provides 14,400 tetrads of(w1, w2, lexical-semantic relation type, topical do-main type): where the topical domain type des-ignates a semantic class from the coarse semanticclassification system consisting of 17 English con-crete noun categories (e.g., tools, clothing, vehi-cles, and animals). The lexical-semantic relationtypes defined in BLESS and their counts are de-scribed as follows:

• COORD (3565 word pairs): they are co-hyponyms (e.g., alligator-lizard).

• HYPER (1337 word pairs): w2 is a hypernymof w1 (e.g., alligator-animal).

• MERO (2943 word pairs): w2 is a com-ponent/organ/member of w1 (e.g., alligator-mouth).

• ATTRI (2731 word pairs): w2 is an adjectiveexpressing an attribute of w1 (e.g., alligator-aquatic).

• EVENT (3824 word pairs): w2 is a verb re-ferring to an action/activity/happening/eventassociated with w1 (e.g., alligator-swim).

Note here that these lexical-semantic relationtypes are not completely concord with the PWNrelations described in section 3.1.

Data division: In order to compare the per-formance for the present task we divided thedata in three ways: In-domain, Out-of-domain(as employed in (Necsulescu et al., 2015)), andCollapsed-domain. For the In-domain setting, thedata in the same domain were used both for train-ing and testing. We thus conducted a five-foldcross validation for each domain. For the Out-of-domain setting, one domain is used for testing andthe remaining data is used for training. In addi-tion, we prepared the Collapsed-domain setting,where we conducted a 10-fold cross validation forthe entire dataset irrespective of the domain.

3The default hyperparameters were used.

4.2 Comparing methodsA supervised relation classification systemreferred to as WECE (Word EmbeddingsClassification systEm) in (Necsulescu et al.,2015) was especially chosen for comparisons,since this method combines and uses the wordembeddings of a given word pair (w1, w2) asfeature. They compare two types of approachesdescribed as WECEoffset and WECEconcat:WECEoffset uses the offset of the wordembeddings ( ~w2 − ~w1) as the feature vector,whereas WECEconcat uses the concatenationof the word embeddings. Moreover, they usetwo types of word embeddings: a bag-of-wordsmodel (BoW ) (Mikolov et al., 2013a) anda dependency-based skip-gram model (Dep)(Levy and Goldberg, 2014). In summary, theWECE system has the following variations:WECEoffset

BoW , WECEoffsetDep , WECEconcat

BoW andWECEconcat

Dep .As described in Section 3, we utilize 22 kinds

of similarity and the underlying vector as fea-tures. In order to make reasonable comparisons,we compare two vector composition methods. Inaddition to the already described array of similari-ties, the Proposalconcat method uses the concate-nated vector of the underlying vectors, whereasthe Proposalconcat method employs the differ-ence vector. As a result, the dimensionalities ofthe resulting vectors employed in these methodsare 13,222 (22 similarities + 22× 600 dimensionsfor concatenated vectors) and 6,622 (22 similari-ties + 22× 300 dimensions for difference vectors),respectively.

Baseline: As detailed in section 3, our meth-ods utilize PWN neighboring concepts linked byparticular lexical-semantic relationships, such as(hypernymy, attribute, and meronymy).We thus set the baseline as follows while respect-ing the direct relational links defined in PWN.

• Given a word pair (w1, w2), if any concept inSw1 and that in Sw2 are directly linked by acertain relationship in PWN, let w1 and w2 bein the relation.

Note that the baseline method cannot find anyword pair that is annotated to have the EVENTrelation in the BLESS dataset, because there areno links in PWN that share the same or a simi-lar definition. Likewise it is not capable for themethod to find any word pair with the ATTRI

41

In-domain Out-of-domain Collapsed-domainP R F1 P R F1 P R F1

WECEoffsetBoW 0.900 0.909 0.904 0.680 0.669 0.675 - - -

WECEoffsetDep 0.853 0.865 0.859 0.687 0.623 0.654 - - -

Proposaloffset 0.913 0.907 0.906 0.766 0.762 0.753 0.867 0.867 0.865WECEconcat

BoW 0.899 0.910 0.904 0.838 0.570 0.678 - - -WECEconcat

Dep 0.859 0.870 0.865 0.782 0.638 0.703 - - -Proposalconcat 0.973 0.971 0.971 0.839 0.819 0.812 0.970 0.970 0.970

Table 2: Comparison of the overall classification results.

relation in BLESS, because the definition of theattribute relationship in PWN differs fromthe definition of the ATTRI relation in the BLESSdataset. Thus, we can only compare the results forthe relationships COORD, HYPER, and MEROwith the common measures, P , R, and F1.

5 Results

5.1 Major results

Table 2 compares our results with that of theWECE systems in the three data set divisions(In/Out/Collapsed domains). The results show thatProposalconcat performed best in all measures ofeach division (shown in bold font). We observetwo common trends across the approaches includ-ing WECE: (1) Every score in the Out-of-domainsetting was lower than that in the In-domain set-ting; and (2) The methods using vector concatena-tion achieved higher scores than those using vectoroffsets. The former trend is reasonable, since in-formation that is also more relevant to the test datais contained in the training data in the In-domainsettings. The latter trend suggests that concate-nated vectors may be more informative than off-set vectors, supporting the conclusion presented in(Necsulescu et al., 2015).

Nevertheless, the results in the table clearlyshow that Proposaloffset outperformed bothWECEconcat

BoW and WECEconcatDep not only in Pre-

cision but also in Recall in the Out-of-domainsetting. This may confirm that sense represen-tations, acquired by exploiting a richer structureencoded in PWN, are richer in semantic contentthan word embeddings learned from textual cor-pora, and hence, even the offset vectors are capa-ble of abstracting some characteristics of potentiallexical-semantic relations between a word pair ef-fectively.

Table 3 breaks down the results obtained

Relationship P R F1COORD 0.761 0.559 0.645by Baseline 0.550 0.108 0.180HYPER 0.767 0.654 0.706by Baseline 0.746 0.199 0.314MERO 0.625 0.809 0.705by Baseline 0.934 0.034 0.065ATTRI 0.913 0.995 0.952EVENT 0.974 0.983 0.979

Table 3: Breakdown of the results obtained byProposalOoD

concat. The Baseline results are shownin italics.

by Proposalconcat in the Out-of-domain setting(ProposalOoD

concat), and compares them with theBaseline results, showing that Proposalconcat

clearly outperformed the Baseline in Recall andF1. This clearly confirms that the direct relationallinks defined in PWN are insufficient for classify-ing the BLESS relationships. With respect to theinternal comparison of the ProposalOoD

concat results,a prominent fact is the high-performance classi-fication of the ATTRI and EVENT relationships.By definition, these relationships link a noun to anadjective (ATTRI) or to a verb (EVENT), whereasthe COORD, HYPER, and MERO relationshipsconnect a noun to another noun. This may sug-gest that the information carried by part-of-speechplays a role in this classification task.

Table 4 further details the results obtained byProposalOoD

concat by showing the confusion ma-trix, also endorsing that the fine-grained classifi-cation of inter-noun relationships (COORD, HY-PER, and MERO) is far more difficult than dis-tinguishing cross-POS relationships (ATTRI andEVENT). In particular, as suggested in (Shwartzet al., 2016), synonymy is difficult to distinguishfrom hypernymy even by humans.

42

HYPER COORD ATTRI MERO EVENTHYPER 875 157 10 282 13COORD 189 1994 220 1118 44ATTRI 1 13 2716 1 0MERO 74 423 24 2380 42EVENT 2 32 5 27 3758

Table 4: Confusion matrix for the results by ProposalOoDconcat in the Out-of-domain setting.

5.2 Ablation tests

This section more closely considers the re-sults in the Out-of-domain setting. Table 5shows the results of the ablation tests for theProposalOoD

concat setting, comparing the effective-ness of the source of the similarities. Eachrow other than ProposalOoD

concat displays the re-sults when the designated feature is ablated. The−word row shows the result when ablating the601-dimensional features created from the pairof word embeddings, and the other rows showthe results when ablating the corresponding 1803-dimensional features (three similarities and thethree vector pairs that yielded the similarities) gen-erated from each node set pair.

P R F1 F1 diffProposalOoD

concat 0.839 0.819 0.812 -−word 0.845 0.827 0.819 0.008−sense 0.833 0.815 0.806 -0.006−concept 0.826 0.809 0.802 -0.010−coord 0.834 0.811 0.803 -0.009−hyper 0.826 0.803 0.800 -0.012−attri1 0.826 0.806 0.798 -0.014−attri2 0.842 0.820 0.814 0.002−mero 0.835 0.813 0.806 -0.006

Table 5: Ablation tests comparing the effective-ness of each node set.

This table suggests that the concept, hyper, andattri1 node set pairs are effective, as indicatedby the relatively large decreases in F1. Surpris-ingly, however, the features generated from theword embeddings affected the performance. Thisimplies that abstract-level semantics encoded insense/concept embeddings are more robust in theclassification of the target lexical-semantic rela-tionships. However, the utility of sense embed-dings was modest.This may result from the learn-ing method in AutoExtend: it tries to split a wordembedding into the senses’ embeddings withoutconsidering the virtual distribution of senses in the

Word2Vec training corpus. It is a potential futurework to address this issue.

Table 6 compares the effectiveness of the typesof features. The only similarities row shows the re-sults when ablating the vectoral features and onlyusing the 22-dimensional similarity features (21semantic/conceptual-level similarities along witha word-level similarity). On the other hand, theonly vector pairs row shows the results from theadverse setting, using the 22 vector pairs (using13,200-dimensional features).


concat 0.839 0.819 0.812 -only similarities 0.704 0.687 0.683 -0.129only vector pairs 0.834 0.812 0.804 -0.007

Table 6: Effectiveness comparison of the types offeatures.

It is shown that using vectorial features wouldproduce more accurate results than simply usingthe similarity features, confirming the general as-sumption: more features yield more accurate re-sults. However, we would have to emphasize that,even only with the similarity features, our ap-proach outperformed the comparable method inRecall (shown in the Out-of domain columns ofTable 2 ).

Table 7 shows the results of the other abla-tion tests, comparing the effectiveness of the sim-ilarity calculation methods. Each row in the ta-ble displays the result when ablating the 4207-dimensional features (seven similarities plus sevenvector pairs that yielded these similarities).

As the results in the table show, the F1 scoresdid not change significantly in each ablated con-dition, showing that the effect provided by the ab-lated method is completed by the remaining meth-ods. There exists some redundancy in preparingthese three calculation methods.

43


concat 0.839 0.819 0.812 -−simmax 0.835 0.812 0.805 -0.007−simsum 0.843 0.822 0.816 0.004−simmed 0.838 0.811 0.805 -0.007

Table 7: Additional ablation tests comparing thesimilarity calculation methods.

6 Discussion

This section discusses two issues: the first is asso-ciated with the usage of PWN in the experimentsusing BLESS, and the other is concerned with the“lexical memorization” problem.

Usage of PWN: As detailed in Section 3, ourmethods utilize neighboring concepts linkedby the particular lexical-semantic relationshypernymy, attribute, and meronymydefined in PWN. Some may consider that theHYPER, ATTRI, and MERO relationships canbe estimated simply by consulting the above-mentioned PWN relationships. However, this isdefinitely NOT the case, since almost all of thesemantic relation instances in the BLESS datasetare not immediately defined in PWN: Among the14,400 BLESS instances, only 951 are definedin PWN. For an obvious example, there are nolinks in PWN that are labeled event, which is atype of semantic relation defined in BLESS. Thelow Recall results presented in Table 3 endorsedthis fact, and clearly show the sparsity of usefulsemantic links in PWN. However, for some ofthe lexical-semantic relation types that exhibittransitivity, such as hypernymy, consulting thePWN indirect links could be effectively utilizedto improve the results.

Lexical memorization: Levy et al. (2015) re-cently argued that the results achieved by many su-pervised methods are inflated because of the “lex-ical memorization” effect, which only learns “pro-totypical hypernymy” relations. For example, if aclassifier encounters many positive examples suchas (X, animals), where X is a hyponym of animal(e.g., dog, cat, ...), it only learns to classify anyword pair as a hypernym as far as the second wordis “animal.” We argue that our method can be ex-pected to be relatively free from this issue. Thesimilarity features are not affected by this effect,since any similarity calculation is a symmetric op-eration, and independent of word order. More-

over, the simcmax or simc

med method selects a pairof sense/concept embeddings, where the combi-nation usually differs depending on the combi-nation of node sets. On one hand, the simc

sum

method could be affected by the memorization ef-fect, since the vectorial feature for the prototypicalhypernym is invariable.

7 Conclusion

This paper proposed a method for classifying thetype of lexical-semantic relation that could holdbetween a given pair of words. The empiricalresults clearly outperformed those previously ob-tained with the state-of-the-art results, demonstrat-ing that our rationales behind the proposal arevalid. In particular, it would be reasonable toassume that the plausibility of a specific type oflexical-semantic relation between the words couldbe chiefly recovered by a carefully designed setof relation-specific similarities. These results alsohighlight the utility of “the sense representations,”since our similarity calculation methods rely onthe sense/concept embeddings created by the Au-toExtend system.

Future work could follow two directions. First,we need to improve the classification of inter-nounsemantic relations. We may particularly need todistinguish the hypernym relationship, which isasymmetric, from the symmetric coordinate rela-tionship. In this regard, we would need to improvethe creation of node sets and the combinations tocapture the innate difference of the relationships.

Second, we need to address the potential draw-back of our proposal, which comes from our oper-ational requirement: a lack of sense/concept em-beddings is crucial, as we cannot collect relevantnode sets in this case. Therefore, we need to de-velop a method to assign some of the existing con-cepts to an unknown word, which is not containedin PWN, by seeking the nearest concept in the re-source. A possible method would first seek thenearest concept in the underlying lexical-semanticresource for an unknown word, and then induce arevised set of sense/concept embeddings by itera-tively applying the AutoExtend system.

Acknowledgments

The present research was supported by JSPSKAKENHI Grant Number JP26540144 andJP25280117.

44

ReferencesMarco Baroni and Alessandro Lenci. 2011. How we

blessed distributional semantic evaluation. In Pro-ceedings of the GEMS 2011 Workshop on GEomet-rical Models of Natural Language Semantics, pages1–10. Association for Computational Linguistics.

Jordan Boyd-Graber, Christiane Fellbaum, Daniel Os-herson, and Robert Schapire. 2006. Adding dense,weighted connections to wordnet. In Proceedingsof the third international WordNet conference, pages29–36.

Alexander Budanitsky and Graeme Hirst. 2006. Eval-uating wordnet-based measures of lexical semanticrelatedness. Computational Linguistics, 32(1):13–47.

Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, HaifengWang, and Ting Liu. 2014. Learning semantic hier-archies via word embeddings. In Proceedings of the52nd Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1199–1209, Baltimore, Maryland, June. Associationfor Computational Linguistics.

Eric Huang, Richard Socher, Christopher Manning,and Andrew Ng. 2012. Improving word represen-tations via global context and multiple word proto-types. In Proceedings of the 50th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 873–882, Jeju Island,Korea, July. Association for Computational Linguis-tics.

Ignacio Iacobacci, Mohammad Taher Pilehvar, andRoberto Navigli. 2015. Sensembed: Learningsense embeddings for word and relational similarity.In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers), pages95–105, Beijing, China, July. Association for Com-putational Linguistics.

Douwe Kiela, Felix Hill, and Stephen Clark. 2015.Specializing word embeddings for similarity or re-latedness. In Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Process-ing, pages 2044–2048, Lisbon, Portugal, September.Association for Computational Linguistics.

Lili Kotlerman, Ido Dagan, Idan Szpektor, and MaayanZhitomirsky-Geffet. 2010. Directional distribu-tional similarity for lexical inference. Natural Lan-guage Engineering, 16(04):359–389.

Alessandro Lenci and Giulia Benotto. 2012. Identify-ing hypernyms in distributional semantic spaces. In*SEM 2012: The First Joint Conference on Lexicaland Computational Semantics – Volume 1: Proceed-ings of the main conference and the shared task, andVolume 2: Proceedings of the Sixth InternationalWorkshop on Semantic Evaluation (SemEval 2012),

pages 75–79, Montreal, Canada, 7-8 June. Associa-tion for Computational Linguistics.

Omer Levy and Yoav Goldberg. 2014. Dependency-based word embeddings. In Proceedings of the 52ndAnnual Meeting of the Association for Computa-tional Linguistics (Volume 2: Short Papers), pages302–308, Baltimore, Maryland, June. Associationfor Computational Linguistics.

Omer Levy, Steffen Remus, Chris Biemann, and IdoDagan. 2015. Do supervised distributional meth-ods really learn lexical inference relations? In Pro-ceedings of the 2015 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages970–976, Denver, Colorado, May–June. Associationfor Computational Linguistics.


Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed represen-tations of words and phrases and their composition-ality. In Advances in neural information processingsystems, pages 3111–3119.


Silvia Necsulescu, Sara Mendes, David Jurgens, NuriaBel, and Roberto Navigli. 2015. Reading betweenthe lines: Overcoming data sparsity for accurateclassification of lexical relationships. In Proceed-ings of the Fourth Joint Conference on Lexical andComputational Semantics, pages 182–192, Denver,Colorado, June. Association for Computational Lin-guistics.

Arvind Neelakantan, Jeevan Shankar, Alexandre Pas-sos, and Andrew McCallum. 2014. Efficientnon-parametric estimation of multiple embeddingsper word in vector space. In Proceedings of the2014 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP), pages 1059–1069, Doha, Qatar, October. Association for Com-putational Linguistics.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 1532–1543, Doha,Qatar, October. Association for Computational Lin-guistics.

Sascha Rothe and Hinrich Schutze. 2015. Autoex-tend: Extending word embeddings to embeddingsfor synsets and lexemes. In Proceedings of the53rd Annual Meeting of the Association for Compu-tational Linguistics and the 7th International Joint

45

Conference on Natural Language Processing (Vol-ume 1: Long Papers), pages 1793–1803, Beijing,China, July. Association for Computational Linguis-tics.

Pavel Rychly and Adam Kilgarriff. 2007. An ef-ficient algorithm for building a distributional the-saurus (and other sketch engine developments). InProceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics CompanionVolume Proceedings of the Demo and Poster Ses-sions, pages 41–44, Prague, Czech Republic, June.Association for Computational Linguistics.

Enrico Santus, Frances Yung, Alessandro Lenci, andChu-Ren Huang. 2015. Evalution 1.0: an evolvingsemantic dataset for training and evaluation of dis-tributional semantic models. In Proceedings of the4th Workshop on Linked Data in Linguistics (LDL-2015), pages 64–69.

Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016.Improving hypernymy detection with an integratedpath-based and distributional method. In Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 2389–2398, Berlin, Germany, August.Association for Computational Linguistics.

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, andTimothy Baldwin. 2016. Take and took, gaggleand goose, book and read: Evaluating the utility ofvector differences for lexical relation learning. InProceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 1671–1682, Berlin, Germany,August. Association for Computational Linguistics.

Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir,and Bill Keller. 2014. Learning to distinguish hy-pernyms and co-hyponyms. In Proceedings of COL-ING 2014, the 25th International Conference onComputational Linguistics: Technical Papers, pages2249–2259, Dublin, Ireland, August. Dublin CityUniversity and Association for Computational Lin-guistics.

46


Supervised and unsupervised approaches to measuring usage similarity

Milton King and Paul CookFaculty of Computer Science, University of New Brunswick

Fredericton, NB E3B 5A3, [email protected], [email protected]

Abstract

Usage similarity (USim) is an approach todetermining word meaning in context thatdoes not rely on a sense inventory. Instead,pairs of usages of a target lemma are ratedon a scale. In this paper we propose un-supervised approaches to USim based onembeddings for words, contexts, and sen-tences, and achieve state-of-the-art resultsover two USim datasets. We further con-sider supervised approaches to USim, andfind that although they outperform unsu-pervised approaches, they are unable togeneralize to lemmas that are unseen in thetraining data.

1 Usage similarity

Word senses are not discrete. In many cases, fora given instance of a word, multiple senses froma sense inventory are applicable, and to varyingdegrees (Erk et al., 2009). For example, considerthe usage of wait in the following sentence takenfrom Jurgens and Klapaftis (2013):

1. And is now the time to say I can hardlywait for your impending new novel about theAlamo?

Annotators judged the WordNet (Fellbaum, 1998)senses glossed as ‘stay in one place and antici-pate or expect something’ and ‘look forward to theprobable occurrence of’, to have applicability rat-ings of 4 out of 5, and 2 out of 5, respectively, forthis usage of wait. Moreover, Erk et al. (2009)also showed that this issue cannot be addressedsimply by choosing a coarser-grained sense inven-tory. That a clear line cannot be drawn between thevarious senses of a word has been observed as farback as Johnson (1755). Some have gone so far as

to doubt the existence of word senses (Kilgarriff,1997).

Sense inventories also suffer from a lack ofcoverage. New words regularly come into us-age, as do new senses for established words. Fur-thermore, domain-specific senses are often not in-cluded in general-purpose sense inventories. Thisissue of coverage is particularly relevant for socialmedia text, which contains a higher rate of out-of-vocabulary words than more-conventional texttypes (Baldwin et al., 2013).

These issues pose problems for natural lan-guage processing tasks such as word sense disam-biguation and induction, which rely on, and seekto induce, respectively, sense inventories, and havetraditionally assumed that each instance of a wordcan be assigned one sense.1 In response to this,alternative approaches to word meaning have beenproposed that do not rely on sense inventories. Erket al. (2009) carried out an annotation task on “us-age similarity” (USim), in which the similarity ofthe meanings of two usages of a given word arerated on a five-point scale.

Lui et al. (2012) proposed the first computa-tional approach to USim. They considered ap-proaches based on topic modelling (Blei et al.,2003), under a wide range of parameter settings,and found that a single topic model for all tar-get lemmas (as opposed to one topic model pertarget lemma) performed best on the dataset ofErk et al. (2009). Gella et al. (2013) consideredUSim on Twitter text, noting that this model ofword meaning seems particularly well-suited tothis text type because of the prevalence of out-of-vocabulary words. Gella et al. (2013) also con-sidered topic modelling-based approaches, achiev-ing their best results using one topic model per

1Recent word sense induction systems and evaluationshave, however, considered graded senses and multi-sense ap-plicability (Jurgens and Klapaftis, 2013).

47

target word, and a document expansion strategybased on medium frequency hashtags to combatthe data sparsity of tweets due to their relativelyshort length. The methods of Lui et al. (2012) andGella et al. (2013) are unsupervised; they do notrely on any gold standard USim annotations.

In this paper we propose unsupervised ap-proaches to USim based on embeddings for words(Mikolov et al., 2013; Pennington et al., 2014),contexts (Melamud et al., 2016), and sentences(Kiros et al., 2015), and achieve state-of-the-artresults over the USim datasets of both Erk et al.(2009) and Gella et al. (2013). We then considersupervised approaches to USim based on thesesame methods for forming embeddings, whichoutperform the unsupervised approaches, but per-form poorly on lemmas that are unseen in thetraining data.

2 USim models

In this section we describe how we represent a tar-get word usage in context, and then how we usethese representations in unsupervised and super-vised approaches to USim.

2.1 Usage representation

We consider four ways of representing an instanceof a target word based on embeddings for words,contexts, and sentences. For word embeddings,we consider word2vec (Mikolov et al., 2013) andGloVe (Pennington et al., 2014). In each case werepresent a token instance of the target word in asentence as the average of the word embeddingsfor the other words occurring in the sentence, ex-cluding stopwords.

Context2vec (Melamud et al., 2016) can beviewed as an extension of word2vec’s continuousbag-of-words (CBOW) model. In CBOW, the con-text of a target word token is represented as the av-erage of the embeddings for words within a fixedwindow. In contrast, context2vec uses a richer rep-resentation based on a bidirectional LSTM cap-turing the full sentential context of a target wordtoken. During training, context2vec embeds thecontext of word token instances in the same vec-tor space as word types. As this model explicitlyembeds word contexts it seems particularly well-suited to USim.

Kiros et al. (2015) proposed skip-thoughts, asentence encoder that can be viewed as a sentence-level version of word2vec’s skipgram model, i.e.,

during training, the encoding of a sentence isused to predict surrounding sentences. Kiros etal. (2015) showed that skip-thoughts out-performsprevious approaches to measuring sentence-levelrelatedness. Although our goal is to determine themeaning of a word in context, the meaning of asentence could be a useful proxy for this.2

2.2 Unsupervised approach

In the unsupervised setup, we measure the simi-larity between two usages of a target word as thecosine similarity between their vector representa-tions, obtained by one of the methods describedin Section 2.1. This method does not require goldstandard training data.

2.3 Supervised approach

We also consider a supervised approach. For agiven pair of token instances of a target word, t1and t2, we first form vectors v1 and v2 represent-ing each of the two instances of the target, usingone of the approaches in Section 2.1. To representeach pair of instances, we follow the approach ofKiros et al. (2015). We compute the componen-twise product, and absolute difference, of v1 andv2, and concatenate them. This gives a vector oflength 2d — where d is the dimensionality of theembeddings used — representing each pair of in-stances. We then train ridge regression to learna model to predict the similarity of unseen usagepairs.

3 Materials and methods

3.1 USim Datasets

We evaluate our methods on two USim datasetsrepresenting two different text types: ORIGINAL,the USim dataset of Erk et al. (2009), and TWIT-TER from Gella et al. (2013). Both USim datasetscontain pairs of sentences; each sentence in eachpair includes a usage of a particular target lemma.Each sentence pair is rated on a scale of 1–5 forhow similar in meaning the usages of the targetwords are in the two sentences.

ORIGINAL consists of sentences from Mc-Carthy and Navigli (2007), which were drawnfrom a web corpus (Sharoff, 2006). This datasetcontains 34 lemmas, including nouns, verbs, ad-jectives, and adverbs. Each lemma is the target

2Inference requires only a single sentence, so the modelcan infer skip-thought vectors for sentences taken out-of-context, as in the USim datasets.

48

word in 10 sentences. For each lemma, sentencepairs (SPairs) are formed based on all pairwisecomparisons, giving 45 SPairs per lemma. An-notations were provided by three native Englishspeakers, with the average taken as the final goldstandard similarity. In a small number of cases theannotators were unable to judge similarity. Erk etal. (2009) removed these SPairs from the dataset,resulting in a total of 1512 SPairs.

TWITTER contains SPairs for ten nouns fromORIGINAL. In this case the “sentences” are infact tweets. 55 SPairs are provided for each noun.Unlike ORIGINAL, the SPairs are not formed onthe basis of all pairwise comparisons amongst asmaller set of sentences. This dataset was anno-tated via crowd sourcing and carefully cleaned toremove outlier annotations.

3.2 Evaluation

Following Lui et al. (2012) and Gella et al. (2013)we evaluate our systems by calculating Spear-man’s rank correlation coefficient between thegold standard similarities and the predicted simi-larities. This enables direct comparison of our re-sults with those reported in these previous studies.

We evaluate our supervised approaches usingtwo cross-validation methodologies. In the firstcase we apply 10-fold cross-validation, randomlypartitioning all SPairs for all lemmas in a givendataset. Using this approach, the test data for agiven fold consists of SPairs for target lemmasthat were seen in the training data. To determinehow well our methods generalize to unseen lem-mas, we consider a second cross-validation setupin which we partition the SPairs in a given datasetby lemma. Here the test data for a given fold con-sists of SPairs for one lemma, and the training dataconsists of SPairs for all other lemmas.

3.3 Embeddings

We train word2vec’s skipgram model on twocorpora:3 (1) a corpus of English tweets col-lected from the Twitter Streaming APIs4 fromNovember 2014 to March 2015 containing 1.3 bil-lion tokens; and (2) an English Wikipedia dumpfrom 1 September 2015 containing 2.6 billion to-kens. Because of the relatively-low cost of train-ing word2vec, we consider several settings of

3In preliminary experiments the alternative word2vecCBOW model achieved substantially lower correlations thanskipgram, and so CBOW was not considered further.

4https://dev.twitter.com/

D W ORIGINAL TWITTER

50 2 0.251 0.24650 5 0.262 0.27250 8 0.286 0.282

100 2 0.267 0.248100 5 0.273 0.253100 8 0.273 0.298300 2 0.275 0.266300 5 0.279 0.295300 8 0.281 0.300

Table 1: Spearman’s ρ on each dataset usingthe unsupervised approach with word2vec embed-dings trained using several settings for the numberof dimensions (D) and window size (W ). The bestρ for each dataset is shown in boldface.

window size (W=2,5,8) and number of dimen-sions (D=50,100,300). Embeddings trained onWikipedia and Twitter are used for experiments onORIGINAL and TWITTER, respectively.

For the other embeddings we use pre-trainedmodels. We use GloVe vectors from Wikipediaand Twitter, with 300 and 200 dimensions, for ex-periments on ORIGINAL and TWITTER, respec-tively.5 For context2vec we use a 600 dimen-sional model trained on the ukWaC (Ferraresi etal., 2008), a web corpus of approximately 2 bil-lion tokens.6 We use a skip-thoughts model with4800 dimensions, trained on a corpus of books.7

We use these context2vec and skip-thoughts mod-els for experiments on both ORIGINAL and TWIT-TER.

4 Experimental results

We first consider the unsupervised approach usingword2vec for a variety of window sizes and num-ber of dimensions. Results are shown in Table 1.All correlations are significant (p < 0.05). Onboth ORIGINAL and TWITTER, for a given num-ber of dimensions, as the window size is increased,ρ increases. Embeddings for larger window sizestend to better capture semantics, whereas em-beddings for smaller window sizes tend to bet-ter reflect syntax (Levy and Goldberg, 2014); the

5http://nlp.stanford.edu/projects/glove/

6https://github.com/orenmel/context2vec

7https://github.com/ryankiros/skip-thoughts

49

Dataset Embeddings Unsupervised SupervisedAll Lemma

ORIGINAL

Word2vec 0.281* 0.435* 0.220*GloVe 0.218* 0.410* 0.230*

Skip-thoughts 0.177* 0.436* 0.099*Context2vec 0.302* 0.417* 0.172*

TWITTER

Word2vec 0.300* 0.384* 0.196*GloVe 0.122* 0.314* 0.134*

Skip-thoughts 0.095* 0.360* 0.058Context2vec 0.122* 0.193* 0.067

Table 2: Spearman’s ρ on each dataset usingthe unsupervised method, and supervised meth-ods with cross-validation folds based on randomsampling across all lemmas (All) and holding outindividual lemmas (Lemma), for each embeddingapproach. The best ρ for each experimental setup,on each dataset, is shown in boldface. Significantcorrelations (p < 0.05) are indicated with *.

more-semantic embeddings given by larger win-dow sizes appear to be better-suited to the taskof predicting USim. For a given window size, ahigher number of dimensions also tends to achievehigher ρ. For example, for a given window size,D = 300 gives a higher ρ than D = 50 in eachcase, except for W = 8 on ORIGINAL.

The best correlations reported by Lui et al.(2012) on ORIGINAL, and Gella et al. (2013)on TWITTER, were 0.202 and 0.29, respectively.The best parameter settings for our unsupervisedapproach using word2vec embeddings achievehigher correlations, 0.286 and 0.300, on ORIGI-NAL and TWITTER, respectively. Lui et al. (2012)and Gella et al. (2013) both report drastic vari-ation in performance for different settings of thenumber of topics in their models. We also observesome variation with respect to parameter settings;however, any of the parameter settings consideredachieves a higher correlation than Lui et al. (2012)on ORIGINAL. For TWITTER, parameter settingswith W ≥ 5 and D ≥ 100 achieve a correlationcomparable to, or greater than, the best reportedby Gella et al. (2013)

We now consider the unsupervised approach,using the other embeddings. Based on the previ-ous findings for word2vec, we only consider thismodel with W = 8 and D = 300 here. Resultsare shown in Table 2 in the column labeled “Unsu-pervised”. For ORIGINAL, context2vec performsbest (and indeed outperforms word2vec for all pa-rameter settings considered). This result demon-strates that approaches to predicting USim that ex-plicitly embed the context of a target word can

outperform approaches based on averaging wordembeddings (i.e., word2vec and GloVe) or em-bedding sentences (skip-thoughts). This result isparticularly strong because we consider a rangeof parameter settings for word2vec, but only usedthe default settings for context2vec.8 Word2vecdoes however perform best on TWITTER. Therelatively poor performance of context2vec andskip-thoughts here could be due to differences be-tween the text types these embedding models weretrained on and the evaluation data. GloVe per-forms poorly, even though it was trained on tweetsfor these experiments, but that it performs lesswell than word2vec is consistent with the findingsfor ORIGINAL.

Turning to the supervised approach, we firstconsider results for cross-validation based on ran-domly partitioning all SPairs in a dataset (column“All” in Table 2). The best correlation on TWIT-TER (0.384) is again achieved using word2vec,while the best correlation on ORIGINAL (0.434)is obtained with skip-thoughts. The difference inperformance amongst the various embedding ap-proaches is, however, somewhat less here than inthe unsupervised setting. For each embedding ap-proach, and each dataset, the correlation in thesupervised setting is better than that in the unsu-pervised setting, suggesting that if labeled train-ing data is available, supervised approaches cangive substantial improvements over unsupervisedapproaches to predicting USim.9 However, thisexperimental setup does not show the extent towhich the supervised approach is able to gener-alize to previously-unseen lemmas.

The column labeled “Lemma” in Table 2 showsresults for the supervised approach for cross-validation using lemma-based partitioning. Inthese experiments, the test data consists of usagesof a target lemma that was not seen as a targetlemma during training. For each dataset, the corre-lations achieved here for each type of embeddingare lower than those of the corresponding unsu-pervised method, with the exception of GloVe. In

8The context2vec model has 600 dimensions, and wastrained on the ukWac, whereas our word2vec model forORIGINAL is trained on Wikipedia. To further compare theseapproaches we also trained word2vec on the ukWaC with 600dimensions and a window size of 8. These word2vec settingsalso did not outperform context2vec.

9These results on ORIGINAL must be interpreted cau-tiously, however. The same sentences, albeit in differentSPairs, occur in both the training and testing data for a givenfold. This issue does not affect TWITTER.

50

the case of ORIGINAL, the higher correlation forGloVe relative to the unsupervised setup appearsto be largely due to improved performance on ad-verbs. Nevertheless, for each dataset, the correla-tions achieved by GloVe are still lower than thoseof the best unsupervised method on that dataset.These results demonstrate that the supervised ap-proach generalizes poorly to new lemmas. Thisnegative result indicates an important directionfor future work — identifying strategies to train-ing supervised approaches to predicting USim thatgeneralize to unseen lemmas.

5 Conclusions

Word senses are not discrete, and multiple sensesare often applicable for a given usage of a word.Moreover, for text types that have a relatively-high rate of out-of-vocabulary words, such as so-cial media text, many words will be missing fromsense inventories. USim is an approach to deter-mining word meaning in context that does not relyon a sense inventory, addressing these concerns.

We proposed unsupervised approaches to USimbased on embeddings for words, contexts, and sen-tences. We achieved state-of-the-art results overUSim datasets based on Twitter text and more-conventional texts. We further considered super-vised approaches to USim based on these samemethods for forming embeddings, and found thatalthough these methods outperformed the unsu-pervised approaches, they performed poorly onlemmas that were unseen in the training data.

The approaches to learning word embeddingsthat we considered (word2vec and GloVe) bothlearn a single vector representing each word type.There are, however, approaches that learn multipleembeddings for each type that have been appliedto predict word similarity in context (Huang et al.,2012; Neelakantan et al., 2014, for example). Infuture work, we intend to also evaluate such ap-proaches for the task of predicting usage similar-ity. We also intend to consider alternative strate-gies to training supervised approaches to USim inan effort to achieve better performance on unseenlemmas.

Acknowledgments

This research was financially supported by theNatural Sciences and Engineering Research Coun-cil of Canada, the New Brunswick InnovationFoundation, ACENET, and the University of New

Brunswick.

ReferencesTimothy Baldwin, Paul Cook, Marco Lui, Andrew

MacKinlay, and Li Wang. 2013. How noisy so-cial media text, how diffrnt social media sources?In Proceedings of the 6th International Joint Con-ference on Natural Language Processing (IJCNLP2013), pages 356–364, Nagoya, Japan.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent dirichlet allocation. Journal of Ma-chine Learning Research, 3:993–1022.

Katrin Erk, Diana McCarthy, and Nicholas Gaylord.2009. Investigations on word senses and word us-ages. In Proceedings of the Joint Conference of the47th Annual Meeting of the ACL and the 4th Interna-tional Joint Conference on Natural Language Pro-cessing of the AFNLP, pages 10–18, Singapore.

Christiane Fellbaum, editor. 1998. WordNet: An Elec-tronic Lexical Database. MIT Press, Cambridge,MA.

Adriano Ferraresi, Eros Zanchetta, Marco Baroni, andSilvia Bernardini. 2008. Introducing and evaluatingukWaC, a very large web-derived corpus of English.In Proceedings of the 4th Web as Corpus Workshop:Can we beat Google, pages 47–54, Marrakech, Mo-rocco.

Spandana Gella, Paul Cook, and Bo Han. 2013. Unsu-pervised word usage similarity in social media texts.In Second Joint Conference on Lexical and Compu-tational Semantics (*SEM), Volume 1: Proceedingsof the Main Conference and the Shared Task: Se-mantic Textual Similarity, pages 248–253, Atlanta,USA.

Eric Huang, Richard Socher, Christopher Manning,and Andrew Ng. 2012. Improving word represen-tations via global context and multiple word proto-types. In Proceedings of the 50th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 873–882, Jeju Island,Korea.

Samuel Johnson. 1755. A Dictionary of the EnglishLanguage. London.

David Jurgens and Ioannis Klapaftis. 2013. Semeval-2013 task 13: Word sense induction for graded andnon-graded senses. In Proceedings of the 7th In-ternational Workshop on Semantic Evaluation (Se-mEval 2013), pages 290–299, Atlanta, USA.

Adam Kilgarriff. 1997. “I Don’t Believe in WordSenses”. Computers and the Humanities, 31(2):91–113.

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov,Richard S. Zemel, Antonio Torralba, Raquel Urta-sun, and Sanja Fidler. 2015. Skip-thought vec-tors. In C. Cortes, N. D. Lawrence, D. D. Lee,

51

M. Sugiyama, and R. Garnett, editors, Advances inNeural Information Processing Systems 28, pages3276–3284. Curran Associates, Inc.

Omer Levy and Yoav Goldberg. 2014. Dependency-based word embeddings. In Proceedings of the52nd Annual Meeting of the Association for Co-mouational Linguistics, pages 302–308, Maryland,USA.

Marco Lui, Timothy Baldwin, and Diana McCarthy.2012. Unsupervised estimation of word usage simi-larity. In Proceedings of the Australasian LanguageTechnology Association Workshop 2012, pages 33–41, Dunedin, New Zealand.

Diana McCarthy and Roberto Navigli. 2007. Semeval-2007 task 10: English lexical substitution task. InProceedings of the Fourth International Workshopon Semantic Evaluations (SemEval-2007), pages48–53, Prague, Czech Republic.

Oren Melamud, Jacob Goldberger, and Ido Dagan.2016. context2vec: Learning generic context em-bedding with bidirectional lstm. In Proceedingsof The 20th SIGNLL Conference on ComputationalNatural Language Learning, pages 51–61.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space. In Proceedings of Workshopat the International Conference on Learning Repre-sentations, 2013, Scottsdale, USA.

Arvind Neelakantan, Jeevan Shankar, Alexandre Pas-sos, and Andrew McCallum. 2014. Efficientnon-parametric estimation of multiple embeddingsper word in vector space. In Proceedings of the2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 1059–1069,Doha, Qatar.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1532–1543.

Serge Sharoff. 2006. Open-source corpora: Using thenet to fish for linguistic data. Corpus Linguistics,11(4):435–462.

52


Lexical Disambiguation of Igbo through Diacritic Restoration

Ignatius Ezeani Mark Hepple Ikechukwu OnyenweNLP Research Group,

Department of Computer Science,University of Sheffield, UK.

{ignatius.ezeani,m.r.hepple,i.onyenwe}@sheffield.ac.uk

AbstractProperly written texts in Igbo, a low re-source African language, are rich in bothorthographic and tonal diacritics. Dia-critics are essential in capturing the dis-tinctions in pronunciation and meaning ofwords, as well as in lexical disambigua-tion. Unfortunately, most electronic textsin diacritic languages are written withoutdiacritics. This makes diacritic restorationa necessary step in corpus building andlanguage processing tasks for languageswith diacritics. In our previous work, webuilt some n−gram models with simplesmoothing techniques based on a closed-world assumption. However, as a classi-fication task, diacritic restoration is wellsuited for and will be more generalisablewith machine learning. This paper, there-fore, presents a more standard approachto dealing with the task which involvesthe application of machine learning algo-rithms.

1 Introduction

Diacritics are marks placed over, under, or througha letter in some languages to indicate a differentsound value from the same letter. English does nothave diacritics (apart from a few borrowed words)but many of the world’s languages use a widerange of diacritized letters in their orthography.Automatic Diacritic Restoration Systems (ADRS)enable the restoration of missing diacritics in texts.Many forms of such tools have been proposed, de-signed and developed but work on Igbo is still inits early stages.

1.1 Diacritics and Igbo languageIgbo, a major Nigerian language and the nativelanguage of the people of the south-eastern Nige-

ria, is spoken by over 30 million people world-wide. It uses the Latin scripts and has many di-alects. Most written works, however, use the of-ficial orthography produced by the O. nwu. Com-mittee1.

The orthography has 8 vowels (a, e, i, o, u, i., o. ,u. ) and 28 consonants (b, gb, ch, d, f, g, gw, gh, h,j, k, kw, kp, l, m, n, nw, ny, n, p, r, s, sh, t, v, w, y,z).

Table 1, shows Igbo characters with their ortho-graphic or tonal (or both) diacritics and possiblechanges in meanings of the words they appear in2.

Char Ortho Tonala – a,a, ae – e,e, ei i. ı, ı, i, ı., ı., i.o o. o, o, o, o. , o. , o.u u. u, u, u, u. , u. , u.m – m,m, mn n n,n, n

Table 1: Igbo diacritic complexity

Most Igbo electronic texts collected from so-cial media platforms are riddled with flaws rang-ing from dialectal variations and spelling errors tolack of diacritics. For instance, consider this rawexcerpt from a chat on a popular Nigerian onlinechat forum www.nairaland.com3:

otu ubochi ka’m no na amaghi ihe muna uwa ga-eje. kam noo n’eche ihea,otu mmadu wee kpoturum,m lee anyao buru nwoke mara mma puru iche,mmaya turu m n’obi.o gwam si nne kedu

1http://www.columbia.edu/itc/mealac/pritchett/00fwp/igbo/txt onwu 1961.pdf

2m and n, nasal consonants, are sometimes treated as tonemarked vowels.

3Source: http://www.nairaland.com/189374/igbo-love-messages

53

k’idi.onu ya dika onu ndi m’ozi, ihu yadika anyanwu ututu,ahu ya n’achakwabara bara ka mmiri si n’okwute. ka iheniile no n’agbam n’obi,o sim na ohurum n’anya.na ochoro k’anyi buru enyi,ahukwuru m ya n’anya.anyi wee kweko-rita wee buru enyi onye’a m n’ekwumaka ya bu odinobi m,onye ihe yan’amasi m

In the above example, you can observe thatthere is zero presence of diacritics - tonal or or-thographic - in the entire text. As pointed outabove, although there are other issues with regardsto standard in the text, lack of diacritics seems tobe harder to control or avoid than the others. Thisis partly because diacritics or lack of it does af-fect human understanding a great deal; and alsothe rigours a writer will go through to insert themmay not worth the effort. The challenge, however,is that NLP systems built and trained with suchpoor quality non standard data will most likely beunreliable.

1.2 Diacritic restoration and other NLPsystems

Diacritic restoration is important for other NLPsystems such as speech recognition, text gener-ation and machine translations systems. For ex-ample, although most translation systems are nowvery impressive, not a lot of them support Igbolanguage. However, for the few that do (e.g.Google Translate), diacritic restoration still playsa huge role in how well they perform. The exam-ple below shows the effect of diacritic marks onthe output of Google Translate’s Igbo-to-Englishtranslation.

Statement Google Translate CommentO ji egbe ya gbuo egbe He used his gun to kill gun wrongO ji egbe ya gbuo egbe He used his gun to kill kite correctAkwa ya di n’elu akwa ya It was on the bed in his room fairAkwa ya di n’elu akwa ya his clothes on his bed correctOke riri oke ya Her addiction confusedOke riri oke ya Mouse ate his share correctO jiri ugbo ya bia He came with his farm wrongO jiri u. gbo. ya bia He came with his car correct

Table 2: Diacritic disambiguation for GoogleTranslate

1.3 Diacritic restoration and WSDYarowsky (1994a) observed that, although dia-critic restoration is not a hugely popular task inNLP research, it shares similar properties with

such tasks as word sense disambiguation with re-gards to resolving both syntactic and semantic am-biguities. Indeed it was referred to as an instanceof a closely related class of problems which in-cludes word choice selection in machine transla-tion, homograph and homophone disambiguationand capitalisation restoration (Yarowsky, 1994b).

Diacritic restoration, like sense disambiguation,is not an end in itself but an “intermediate task”(Wilks and Stevenson, 1996) which supports bet-ter understanding and representation of meaningsin human-machine interactions. In most non-diacritic languages, sense disambiguation systemscan directly support such tasks as machine transla-tion, information retrieval, text processing, speechprocessing etc. (Ide and Veronis, 1998). But ittakes more for diacritic languages, where possible,to produce standard texts. So for those languages,to achieve good results with such systems as listedabove, diacritic restoration is required as a boostfor the sense disambiguation task.

We note however, that although diacriticrestoration is related to word sense disambigua-tion (WSD), it does not eliminate the need forsense disambiguation. For example, if the word-key akwa is successfully restored to akwa, it couldstill be referring to either bed or bridge. Anothergood example is the behaviour of Google Trans-late as the context around the word akwa changes.

Statement Google Translate CommentAkwa ya di n’elu akwa It was on the high confusedAkwa ya di n’elu akwa ya It was on the bed in his room fairAkwa ya di n’elu akwa His clothing was on the bridge okayAkwa ya di n’elu akwa ya His clothing on his bed good

Table 3: Disambiguation challenge for GoogleTranslate

The last two statements, with proper diacrit-ics on the ambiguous wordkey akwa seem bothcorrect. Some disambiguation system in GoogleTranslate must have been used to select the rightform. However, it highlights the fact that such adisambiguation system may perform better whendiacritics are restored.

2 Problem Definition

As explained above, lack of diacritics can oftenlead to some lexical ambiguities in written Igbosentences. Although a human reader can, in mostcases, infer the intended meaning from context,the machine may not. Consider the sentences insections 2.1 and 2.2 and their literal translations:

54

Figure 1: Illustrative View of the DiacriticRestoration Process (Ezeani et al., 2016)

2.1 Missing orthographic diacritics1. Nwanyi ahu banyere n’ugbo ya. (The woman

entered her [farm|boat/craft])

2. O kwuru banyere olu ya. (He/she talkedabout his/her [neck/voice|work/job])

2.2 Missing tonal diacritics1. Nwoke ahu nwere egbe n’ulo ya. (That man

has a [gun|kite] in his house)

2. O dina n’elu akwa. (He/she is lying on the[cloth|bed,bridge|egg|cry])

3. Egwu ji ya aka. (He/she is held/gripped by[fear|song/dance/music])

Ambiguities arise when diacritics – ortho-graphic or tonal – are omitted in Igbo texts. In thefirst examples, we could see that ugbo(farm) andu. gbo. (boat/craft) as well as olu(neck/voice) ando. lu. (work/job) were candidates in their sentences.

Also the second examples show that egbe(kite)and egbe(gun); akwa(cloth), akwa(bed orbridge), akwa(egg), or even akwa(cry) in a philo-sophical or artistic sense; as well as egwu(fear)and egwu(music) are all qualified to replace theambiguous word in their respective sentences.

3 Related Literature

Diacritic restoration techniques for low resourcelanguages adopt two main approaches: wordbased and character based.

3.1 Word level diacritic restorationDifferent schemes of the word-based approachhave been described. They generally involve pre-processing, candidate generation and disambigua-tion. Simard (1998) applied POS-tags and HMM

language models for French. On the Croatianlanguage, Santic et al. (2009) used substitutionschemes, a dictionary and language models in im-plementing a similar architecture. For Spanish,Yarowsky (1999) used dictionaries with decisionlists, Bayesian classification and Viterbi decodingthe surrounding context.

Crandall (2005), using Bayesian approach,HMM and a hybrid of both, as well as differ-ent evaluation method, attempted to improve onYarowsky’s work. Cocks and Keegan (2011)worked on Maori using naıve Bayes and word-based n-grams relating to the target word as in-stance features. Tufis and Chitu (1999) used POStagging to restore Romanian texts but backed off tocharacter-based approach to deal with “unknownwords”. Generally, there seems to be a consensuson the superiority of the word-based approach forwell resourced languages.

3.2 Grapheme or letter level diacriticrestoration

For low-resource languages, there is often lack ofadequate data and resources (large corpora, dic-tionaries, POS-taggers etc.). Mihalcea (2002) aswell as Mihalcea and Nastase (2002) argued thatletter-based approach will help to resolve the issueof lack of resources. They implemented instancebased and decision tree classifiers which gave ahigh letter-level accuracy. However, their evalua-tion method implied a possibly much lower word-level accuracy.

Versions of Mihalcea’s approach with improvedevaluation methods have been implemented onother low resourced languages (Wagacha et al.,2006; De Pauw et al., 2011; Scannell, 2011). Wa-gacha et al. (2006), for example, reviewed theevaluation method in Mihalcea’s work and intro-duced a word-level method for Gikuyu. De Pauwet al. (2011) extended Wagacha’s work by apply-ing the method to multiple languages.

Our earlier work on Igbo diacritic restoration(Ezeani et al., 2016) was more of a proof of con-cept aimed at extending the initial work done byScannell (2011). We built a number of n–grammodels – basically unigrams, bigrams and tri-grams – along with simple smoothing techniques.Although we got relatively high results, our eval-uation method was based on a closed-world as-sumption where we trained and tested on the sameset of data. Obviously, that assumption does not

55

model the real world and so it is being addressedin this paper.

3.3 Igbo Diacritic Restoration

Igbo is low-resourced and is generally neglectedin NLP research. However, an attempt at restoringIgbo diacritics was reported by Scannell (2011)in which a combination of word- and character-level models were applied. Two lexicon lookupmethods were used: LL which replaces ambigu-ous words with the most frequent word and LL2that uses a bigram model to determine the right re-placement.

They reported word-level accuracies of 88.6%and 89.5% for the models respectively. But thesize of training corpus (31k tokens with 4.3k wordtypes) was too little to be representative and therewas no language speaker in the team to validatethe data used and the results produced. Therefore,we implemented a range of more complex n-grammodels, using similar evaluation techniques, ona comparatively larger sized corpus (1.07m with14.4k unique tokens) and had improved on theirresults (Ezeani et al., 2016).

In this work, we introduce machine learning ap-proaches to further generalise the process and tobetter learn the intricate patterns in the data thatwill help better restoration.

4 Experimental Setup

4.1 Experimental Data

The corpus used in these experiments were col-lected from the Igbo version of the bible availablefrom the Jehova Witness website4. The basic cor-pus statistics are presented in Table 4.

In Table 4, we refer to the “latinized” form of aword as its wordkey5. Less than 10% (529/15696)of the wordkeys are ambiguous. However, theseambiguous wordkeys represent 529 ambiguoussets that yield 348,509 of the corpus words (i.e.words that share the same wordkey with at leastone other word). These ambiguous words consti-tutes approximately 38.22% (348,509/911892) ofthe entire corpus. Some of the top most occurring,as well as the bottom least occurring ambiguoussets are shown in Table 5.

4jw.org5Expectedly, many Igbo words are the same with their

wordkeys

Item NumberTotal tokens 1070429Total words 902150Numbers/punctuations 168279Unique words 563383Ambiguous words 348509Wordkeys 15696Unique wordkeys 15167Ambiguous wordkeys 529

2 variants 5023 variants 154 variants 105 variants 2>5 variants 0

Approx. ambiguity 38.22%

Table 4: Corpus statistics

Top Variants(count)na(29272) na(1332), na(27940)o(22418) o(4757), o(64), o(5), o. (17592)Bottom Variants(count)Giteyim(2) Giteyi.m(1), Giteyim(1)Galim(2) Galim(1), Gali.m(1)

Table 5: Most and least frequent wordkeys

4.2 Preprocessing

The preprocessing task relied substantially on theapproaches used by Onyenwe et al. (2014). Ob-served language based patterns were preserved.For example, ga–, na– and n’ are retained as theyare due to the specific lexical functions the spe-cial characters “–” or “ ’ ” confer on them. Forinstance, while na implies conjunction (e.g. ji naede: yam and cocoa-yam), na– is a verb auxiliary(e.g. Obi na–agba o. so. : Obi is running) and n’ isshortened form of the preposition na (e.g. O. di.n’elu akwa: It is on the bed.).

Also for consistency, diacritic formats are nor-malized using the unicode’s Normalization FormCanonical NFC composition. For example, thecharacter e from the combined unicode charac-ters e (u0065) and ´ (u0301) will be decomposedand recombined as a single canonically equivalentcharacter e (u00e9). Also the character n, which isoften wrongly replaced with n and n in some text,is generally restored back to its standard form.

The diacritic marking of the corpus used in thisresearch is sufficient but not full or perfect. Theorthographic diacritics (mostly dot-belows) have

56

been included throughout. However, the tonal dia-critics are fairly sparse, having been included onlywhere they were for disambiguation (i.e. wherethe reader might not be able to decide the correctform from context.

Therefore, through manual inspection, someobserved errors and anomalies were corrected bylanguage speakers. For example, 3153 out of 3154occurrences of the key mmadu were of the classmmadu. . The only one that was mmadu was cor-rected to mmadu. after inspection. By repeatingthis process, a lot of the generated ambiguous setswere resolved and removed from the list to reducethe noise. Examples are as shown in the table be-low:

wordkey Freq var1Freq var2Freqakpu 106 akpu. –1 akpu. –105agbu 112 agbu. –111 agbu. –1aka 3690 aka–3689 aka–1iri 2036 iri–2035 i.ri.–1

Table 6: Some examples of corrected and removedambiguous set

4.3 Feature extraction for training instancesThe feature sets for the classification modelswere based on the works of Scannell (2011) oncharacter-based restoration which was extendedby Cocks and Keegan (2011) to deal with word-based restoration for Maori. These features con-sist of a combination of n-grams – represented inthe form (x,y), where x is the relative position tothe target key and y is the token length – at dif-ferent positions within the left and right context ofthe target word. The datasets are built as describedbelow for each of the ambiguous keys:

• FS1[(-1,1), (1,1)]: Unigrams on each side ofthe target key

• FS2[(-2,2), (2,2)]: Bigrams on each side

• FS3[(-3,3), (3,3)]: Trigrams on each side

• FS4[(-4,4), (4,4)]: 4-grams on each side

• FS5[(-5,5), (5,5)]: 5-grams on each side

• FS6[(-2,1), (-1,1), (1,1), (2,1)]: 2 unigramson both sides

• FS7[(-3,1), (-2,1), (-1,1), (1,1), (2,1), (3,1)]:3 unigrams on each side

• FS8[(-4,1), (-3,1), (-2,1), (-1,1), (1,1), (2,1),(3,1), (4,1)]: 4 unigrams on each side

• FS9[(-5,1), (-4,1), (-3,1), (-2,1), (-1,1), (1,1),(2,1), (3,1), (4,1), (5,1)]: 5 unigrams on eachside

• FS10[(-2,2), (-1,1), (1,1), (2,2)]: 1 unigramand 1 bigram on each side

• FS11[(-3,3), (-2,2), (2,2), (3,3)]: 1 bigramand 1 trigram on each side

• FS12[(-3,3), (-2,2), (-1,1), (1,1), (2,2), (3,3)]:1 unigram, 1 bigram and 1 trigram on eachside

• FS13[(-4,4), (-3,3), (-2,2), (-1,1), (1,1), (2,2),(3,3), (4,4)]: 1 unigram, 1 bigram, 1 trigramand a 4-gram on each side

4.3.1 Appearance threshold and stratificationWe removed low-frequency wordkeys in our databy defining an appearance threshold as a percent-age of the total tokens in our data. This is given bythe

appThreshold =C(wordkeys)C(tokens)

∗ 100

and wordkeys with appThreshold below the statedvalue6 were removed from the experiment.

As part of our data preparation for a stan-dard cross-validation, we also passed each ofour datasets through a simple stratification pro-cess. Instances of each label7, where possible, areevenly distributed to appear at least once in eachfold or removed from the dataset.

Our stratification algorithm basically picks onlylabels from each dataset that have a population psuch that p >= nfolds. nfolds is the number offolds which in our case has a default value of 10.In order to make the task a little more challeng-ing, this process was augmented by the removalof some high frequency, but low entropy datasetswhere using the most common class (MCC) pro-duces very high accuracies8. Entropy is looselyused here to refer to the degree of dominance of aparticular class across the dataset and it is simplydefined as:

entropy = 1− max[Count(labeli)]len(dataset)

6In this work, we used an appThreshold of 0.005%7labels are basically diacritic variants.8Datasets with more than 95% accuracy on the most com-

mon class (i.e. with entropy lower than 0.05) were removed.

57

where i = 1..n and n =number of distinct labelsin the dataset. Table 7 shows 10 of the 30 lowestentropy datasets that were removed by this pro-cess.

wordkey Counts MCCscore Label(count)e(2) 5476 99.78% e(12); e(5464)anyi(2) 5390 99.63% anyi.(5370); anyi.(20)ma(2) 6713 99.61% ma(6687); ma(26)ike(2) 3244 99.54% ike(15); ike(3229)unu(2) 8662 99.53% unu(41); unu(8621)Ha(2) 2266 99.29% Ha(16); Ha(2250)a(2) 12275 99.10% a(12165); a(110)onye(2) 8937 98.87% onye(8836); onye(101)ohu(2) 790 98.73% ohu(780); o.hu. (10)eze(2) 2633 98.14% eze(2584); eze(49)

Table 7: Low entropy datasets

At end of these pruning processes, our remain-ing datasets came to 110 with the distribution asfollows:

• datasets with only 2 variants: = 93

• datasets with 3 variants: = 7



Some datasets that originally had multiple vari-ants lost some of their variants. For example, thedataset from akwa which originally had five vari-ants and 1067 instances comprising of akwa (355),akwa(485), akwa(216), akwa(1) and akwa(10) re-tained only four variants (after dropping akwa)and 1066 instances.

4.4 Classification algorithmsThis work applied versions of five of the com-monly used machine learning algorithms in NLPclassification tasks namely:

- Linear Discriminant Analysis(LDA)

- K Nearest Neighbors(KNN)

- Decision Trees(DTC)

- Support Vector Machines(SVC)

- Naıve Bayes(MNB)

Their default parameters on Scikit-learn toolkitwere used with 10-fold cross-validation and theevaluation metrics used is mainly the accuracy ofprediction of the correct diacritic form in the testdata. The effect of the accuracy obtained for a

FS1 FS2 FS3 FS4 FS5 FS6 FS7 FS8 FS9 FS10 FS11 FS12 FS13

50.00%

55.00%

60.00%

65.00%

70.00%

75.00%

80.00%

85.00%

90.00%

95.00%

100.00%

Performance of all Algorithms on all Models

LDA

KNN

DTC

SVM

MNB

Models

Accu

racie

s

Baseline

Figure 2: Evaluation of algorithm performance oneach feature set model

dataset on the overall performance depends on theweight of the dataset. Each dataset is assigned aweight corresponding to the number of instancesit generates from the corpus which is determinedby its frequency of occurrence.

So the actual performance of each learning al-gorithm, on a particular feature set model, is theoverall weighted average of the its performancesacross all the 110 datasets. The bottom line ac-curacy is the result of replacing each word withits wordkey which gave an accuracy of 30.46%.However, the actual baseline to beat is 52.79%which is achieved by always predicting the mostcommon class.

4.5 Results and DiscussionsThe results of our experiments are as shown in Ta-ble 8 and Figure 2.

Models LDA KNN DTC SVM MNBBaseline: 52.79%FS1 77.65% 91.47% 94.49% 74.64% 74.64%FS2 77.65% 91.47% 94.49% 74.64% 74.64%FS3 74.48% 73.70% 84.60% 74.92% 74.64%FS4 73.71% 67.18% 81.00% 74.64% 74.64%FS5 74.68% 62.48% 76.70% 74.64% 74.64%FS6 76.21% 85.98% 91.54% 71.39% 71.39%FS7 72.74% 79.20% 90.94% 71.39% 71.39%FS8 72.74% 79.20% 90.94% 71.39% 71.39%FS9 76.99% 73.88% 89.50% 75.46% 74.67%FS10 76.18% 85.41% 92.89% 75.11% 74.64%FS11 73.94% 74.83% 86.23% 75.29% 74.64%FS12 76.99% 73.88% 89.50% 75.46% 74.67%FS13 76.99% 73.88% 89.50% 75.46% 74.67%

Table 8: Summary of results

The experiments indicate that on the average allthe algorithms were able to beat the baseline on allmodels. The decision tree algorithm (DTC) per-formed best across all models with an average ac-curacy of 88.64% (Figure 3), and the highest accu-racy of 94.49% (Table 8) on both the FS1 and FS2

58

models. However, with an average standard devi-ation of 0.076 (Figure 4) for its results, it appearsto be the least reliable.

As the next best performing algorithm, KNNfalls below DTC in average accuracy (91.47%)but seems slightly more reliable. It did, however,struggle more than others as the dimension of fea-ture n-grams increased (see its performance onFS3, FS4 and FS5). This may be due to the in-crease in sparsity of features and the difficulty tofind similar neighbours. The other algorithms –LDA, SVM and MNB – just trailed behind andalthough their results are a lot more reliable espe-cially SVM and MNB (Figure 4). But this maybe an indication that their strategies are not explo-rative enough. However, it could be observed thatthey traced a similar path in the graph and also hadtheir highest results with the same set of models(i.e. FS9, FS12 and FS13) with wider context.

On the models, we observed that the unigramsand bigrams have better predictive capacity thanthe other n-grams. Most of the algorithms gotcomparatively good results with FS1, FS2, FS6and FS10 (Figure 5) each of which has the uni-gram closest to the target word (i.e. in the±1 posi-tion) in the feature set. Also, models that excludedthe closest unigrams on both sides (e.g. FS11)and those with fairly wider context did not per-form comparatively well across algorithms.

LDA KNN DTC SVM MNB

65.0%

70.0%

75.0%

80.0%

85.0%

90.0%

Comparison of Performances of Algorithms on Models

Algorithms

Ave

rag

e A

ccu

racy a

cro

ss m

od

els

Figure 3: Average performance of algorithms

Again, it appears that beyond the three closestunigrams (i.e. those in the −3 through +3 posi-tions), the classifiers tend to be confused by ad-ditional context information. Generally, FS1 andFS2 stood out across all algorithms as the bestmodels while FS6 and FS7 also did well espe-cially with DTC, KNN and LDA.

LDA KNN DTC SVM MNB

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Standard Deviation of Algorithm Performance

Algorithm

Sta

nd

ard

De

via

tio

n

Figure 4: Average standard deviation for algo-rithms

FS1 FS2 FS3 FS4 FS5 FS6 FS7 FS8 FS9 FS10 FS11 FS12 FS1366.0%

68.0%

70.0%

72.0%

74.0%

76.0%

78.0%

80.0%

82.0%

84.0%

Comparison of Average Performances of Models

Models

Ac

cu

rac

y

Figure 5: Average performance of models

4.6 Future Research DirectionAlthough our results show a substantial improve-ment from the baseline accuracy by all the algo-rithms on all the models, there is still a lot of roomfor improvement. Our next experiments will in-volve attempts to improve the results by focusingon the following key aspects:

- Reviewing the feature set models:So far we have used instances with similarfeatures on both sides of the target words. Inour next experiments, we may consider vary-ing these features.

- Exploiting the algorithms:We were more explorative with the algo-rithms and so only the default parameters ofthe algorithms on Scikit-learn were tested.Subsequent experiments will involve tuningthe parameters of the algorithms and possiblyusing more evaluation metrics.

- Expanding data size and genre:A major challenge for this research work islack of substantially marked corpora. So al-though, we achieved a lot with the bible data,

59

it is inadequate and not very representative ofthe contemporary use of the language. Futureresearch efforts will apply more resources toincreasing the data size across other genres.

- Predicting unknown words:Our work is yet to properly address the prob-lem of unknown words. We are considering acloser inspection of the structural patterns inthe target word to see if they contain elementswith predictive capacity.

- Broad based standardization:Beside lack of diacritics online Igbo textsare riddled with spelling errors, lack of stan-dard orthographic and dialectal forms, poorwriting styles, foreign words and so on. Itmay therefore be good to consider a broaderbased process that includes, not just diacriticrestoration but other aspects of standardiza-tion.

- Interfacing with other NLP systems:Although it seems obvious, it will be inter-esting to investigate, in empirical terms, therelationship between diacritic restoration andothers NLP tasks and systems such as POS-tagging, morphological analysis and even thebroader field of word sense disambiguation.

Acknowledgements

The IgboNLP Project @ The University ofSheffield, UK is funded by TETFund & NnamdiAzikiwe University, Nigeria.

ReferencesJohn Cocks and Te-Taka Keegan. 2011. A Word-

based Approach for Diacritic Restoration in Maori.In Proceedings of the Australasian Language Tech-nology Association Workshop 2011, pages 126–130,Canberra, Australia, December.

David Crandall. 2005. AutomaticAccent Restoration in Spanish text.http://www.cs.indiana.edu/∼djcran/projects/674 final.pdf. [Online: accessed 7-January-2016].

Guy De Pauw, Gilles maurice De Schryver, LaurettePretorius, and Lori Levin. 2011. Introduction tothe Special Issue on African Language Technology.Language Resources and Evaluation, 45:263–269.

Ignatius Ezeani, Mark Hepple, and IkechukwuOnyenwe, 2016. Automatic Restoration of Diacrit-ics for Igbo Language, pages 198–205. Springer In-ternational Publishing, Cham.

Nancy Ide and Jean Veronis. 1998. Introductionto the special issue on word sense disambiguation:The state of the art. Comput. Linguist., 24(1):2–40,March.

Rada F. Mihalcea and Vivi Nastase. 2002. Letter levellearning for language independent diacritics restora-tion. In: Proceedings of CoNLL-2002, Taipei, Tai-wan, pages 105–111.

Rada F. Mihalcea. 2002. Diacritics Restoration:Learning from Letters versus Learning from Words.In: Gelbukh, A. (ed.) CICLing LNCS, pages 339–348.

Ikechukwu Onyenwe, Chinedu Uchechukwu, andMark Hepple. 2014. Part-of-speech Tagset and Cor-pus Development for igbo, an African Language.LAW VIII - The 8th Linguistic Annotation Work-shop., pages 93–98.

Kevin P. Scannell. 2011. Statistical unicodification ofafrican languages. Language Resource Evaluation,45(3):375–386, September.

Michel Simard. 1998. Automatic Insertion of Ac-cents in French texts. Proceedings of the Third Con-ference on Empirical Methods in Natural LanguageProcessing, pages 27–35.

Dan Tufis and Adrian Chitu. 1999. Automatic Diacrit-ics Insertion in Romanian texts. In Proceedings ofCOMPLEX99 International Conference on Compu-tational Lexicography, pages 185–194, Pecs, Hun-gary.

Nikola Santic, Jan Snajder, and Bojana Dalbelo Basic,2009. Automatic Diacritics Restoration in CroatianTexts, pages 126–130. Dept of Info Sci, Faculty ofHumanities and Social Sciences, University of Za-greb , 2009. ISBN: 978-953-175-355-5.

Peter W. Wagacha, Guy De Pauw, and Pauline W.Githinji. 2006. A Grapheme-based Approach toAccent Restoration in Gikuyu. In Proceedings ofthe Fifth International Conference on Language Re-sources and Evaluation, pages 1937–1940, Genoa,Italy, May.

Yorick Wilks and Mark Stevenson. 1996. The gram-mar of sense: Is word-sense tagging much more thanpart-of-speech tagging? CoRR, cmp-lg/9607028.

David Yarowsky. 1994a. A Comparison of Corpus-based Techniques for Restoring Accents in Spanishand French Text. In Proceedings, 2nd Annual Work-shop on Very Large Corpora, pages 19–32, Kyoto.

David Yarowsky. 1994b. Decision Lists for LexicalAmbiguity Resolution: Application to Accent Res-olution in Spanish and French Text. In Proceedingsof ACL-94, Las Cruces, New Mexico.

David Yarowsky, 1999. Corpus-based Techniques forRestoring Accents in Spanish and French Text, pages99–120. Kluwer Academic Publishers.

60


Creating and Validating Multilingual Semantic Representations for SixLanguages: Expert versus Non-Expert Crowds

Mahmoud El-Haj, Paul Rayson, Scott Piao and Stephen WattamSchool of Computing and Communications, Lancaster University, Lancaster, UK

[email protected]

Abstract

Creating high-quality wide-coverage mul-tilingual semantic lexicons to supportknowledge-based approaches is a chal-lenging time-consuming manual task.This has traditionally been performed bylinguistic experts: a slow and expensiveprocess. We present an experiment inwhich we adapt and evaluate crowdsourc-ing methods employing native speakers togenerate a list of coarse-grained senses un-der a common multilingual semantic tax-onomy for sets of words in six languages.451 non-experts (including 427 Mechan-ical Turk workers) and 15 expert partic-ipants semantically annotated 250 wordsmanually for Arabic, Chinese, English,Italian, Portuguese and Urdu lexicons. Inorder to avoid erroneous (spam) crowd-sourced results, we used a novel task-specific two-phase filtering process whereusers were asked to identify synonyms inthe target language, and remove erroneoussenses.

1 Introduction

Machine understanding of the meaning of words,phrases, sentences and documents has challengedcomputational linguists since the 1950s, and muchprogress has been made at multiple levels. Differ-ent types of semantic annotation have been devel-oped, such as word sense disambiguation, seman-tic role labelling, named entity recognition, senti-ment analysis and content analysis. Common toall of these tasks, in the supervised setting, is therequirement for a wide coverage semantic lexiconacting as a knowledge base from which to selector derive potential word or phrase level sense an-notations.

The creation of large-scale semantic lexical re-sources is a time-consuming and difficult task. Fornew languages, regional varieties, dialects, or do-mains the task will need to be repeated and thenrevised over time as word meanings evolve. Inthis paper, we report on work in which we adaptcrowdsourcing techniques to speed up the creationof new semantic lexical resources. We evaluatehow efficient the approach is, and how robust thesemantic representation is across six languages.

The task that we focus on here is a particularlychallenging one. Given a word, each annotatormust decide on its meaning[s] and assign the wordto single or multiple tags in a pre-existing seman-tic taxonomy. This task is similar to that under-taken by trained lexicographers during the processof writing or updating dictionary entries. Even forexperts, this is a complex task. Kilgarriff (1997)highlighted a number of issues related to lexicog-raphers ‘lumping’ or ‘splitting’ senses of a wordand cautioned that even lexicographers do not be-lieve in words having a “discrete, non-overlappingset of senses”. Veronis (2001) showed that inter-annotator agreement is very low in sense taggingusing a traditional dictionary. For our purpose, weuse the USAS taxonomy.1 If a linguist were un-dertaking this task, as they have done in the pastwith Finnish (Lofberg et al., 2005) and Russian(Mudraya et al., 2006) USAS taxonomies, theywould first spend some time learning the seman-tic taxonomy. In this experimental scenario, weaim to investigate whether or not non-expert na-tive speakers can succeed on the word-to-sensesclassification task without being trained on the tax-onomy in advance, therefore mitigating a signifi-cant overhead for the work. In addition, furthermotivation for our experiments is to validate theapplicability of the USAS taxonomy (Rayson et

1The UCREL Semantic Analysis System (USAS), seehttp://ucrel.lancaster.ac.uk/usas/

61

al., 2004), with a non-expert crowd, as a frame-work for multilingual sense representation. TheUSAS taxonomy was selected for this experimentsince it offers a manageable coarse-grained setof categories that have already been applied to anumber of languages. This taxonomy is distinctfrom other potential choices, such as WordNet.The USAS tagset is originally loosely based onthe Longman Lexicon of Contemporary English(McArthur, 1981) and has a hierarchical structurewith 21 major domains (see table 1) subdividinginto three levels. Versions of the USAS taggeror tagset exist in 15 languages in total and foreach language, native speakers have re-evaluatedthe applicability of the tagset with some specificextensions for Chinese (Qian and Piao, 2009) butotherwise the tagset is stable across all languages.For each language tagger, separate linguistic re-sources (lexicons) have been created, but they alluse the same taxonomy.

Domain DescriptionA General and abstract termsB The body and the individualC Arts and craftsE EmotionF Food and farmingG Government and publicH Architecture, housing and the homeI Money and commerce in industryK Entertainment, sports and gamesL Life and living thingsM Movement, location, travel and

transportN Numbers and measurementO Substances, materials, objects and

equipmentP EducationQ Language and communicationS Social actions, states and processesT TimeW World and environmentX Psychological actions, states and

processesY Science and technologyZ Names and grammar

Table 1: USAS top level semantic fields

In terms of main contributions, our researchgoes beyond the previous work on crowdsourc-ing word meanings which requires workers to pick

a word sense from an existing list that matchesprovided contextual examples, such as a concor-dance list. In our work, we require the partici-pants to define the list of all possible senses that aword could take in different contexts. We also seethat our two-stage filtering process tailored for thistask helps to improve results. We compare inter-rater scores for two groups of experts and non-experts to examine the feasibility of extractinghigh-quality semantic lexicons via the untrainedcrowd. Non-experts achieved results between 45-97% for accuracy, between 48-92% for complete-ness, with an average of 18% of tasks having er-roneous senses being left in. Experts scored 64-96% for accuracy, 72-95% for completeness, butachieve better results in terms of only 1% of erro-neous senses left behind. Our experimental resultsshow that the non-expert crowdsourced annotationprocess is of a good quality and comparable to thatof expert linguists in some cases, although thereare variations across different languages. Crowd-sourcing provides a promising approach for thespeedy generation and expansion of semantic lex-icons on a large scale. It also allows us to validatethe semantic representations embedded in our tax-onomy in the multilingual setting.

2 Related Work

The crowdsourcing approach, in particular Me-chanical Turk (MTurk), has been successfully ap-plied for a number of different Natural LanguageProcessing (NLP) tasks. Alonso and Mizzaro(2009) adopted MTurk for five types of NLP tasks,resulting in high agreement between expert goldstandard labels and non-expert annotations, wherea small number of workers can emulate an ex-pert. With the possibility of achieving good re-sults quickly and cheaply, MTurk has been testedfor a variety of tasks, such as image annotation(Sorokin and Forsyth, 2008), Wikipedia articlequality assessment (Kittur et al., 2008), machinetranslation (Callison-Burch, 2009), extracting keyphrases from documents (Yang et al., 2009), andsummarization (El-Haj et al., 2010). Practical is-sues such as payment and task design play an im-portant part in ensuring the quality of the resultingwork. Many designers pay between $0.01 to $0.10for a task taking a few minutes. Quality controland evaluation are usually achieved through con-fidence scores and gold-standards (Donmez et al.,2009; Bhardwaj et al., 2010). Past research has

62

shown (Aker et al., 2012) that the use of radio but-ton design seems to lead to better results comparedto the free text design. Particularly important inour case is the language demographics of MTurk(Pavlick et al., 2014), since we need to find enoughnative speakers in a number of languages.

There is a growing body of crowdsourcing workrelated to semantic annotation. Snow et al. (2008)applied MTurk to the Word Sense Disambiguation(WSD) task and achieved 100% precision withsimple majority voting for the correct sense ofthe word ‘president’ in 177 example sentences.Rumshisky et al. (2012) derived a sense inven-tory and sense-annotated corpus from MTurkerscomparison of senses in pairs of example sen-tences. They used clustering methods to iden-tify the strength of coders’ tags, something thatis poorly suited to rejecting work from spammers(participants who try to cheat the system withscripts or random answers) and would likely nottransfer well to our experiment.

Akkaya et al. (2010) also performed WSD us-ing MTurk workers. They discuss a number ofmethods for ensuring quality, accountability, andconsistency using 9 tasks per word and simple ma-jority voting. Kapelner et al. (2012) increased thescale to 1,000 words for the WSD task and foundthat workers repeating the task do not learn with-out feedback. A set-based agreement metric wasused by Passonneau et al. (2006) to assess thevalidity of polysemous selections of word sensesfrom WordNet categories. Their objective was totake into account similarity between items within aset, however, this may not be desirable in our casedue to the limited depth of the USAS taxonomy.

Directly related to our research here are theexperiments reported in Piao et al. (2015). Aset of prototype semantic lexicons were auto-matically generated by transferring semantic tagsfrom the existing USAS English semantic lex-icon entries to their translation equivalents inItalian, Chinese and Portuguese via dictionariesand bilingual lexicons. While some dictionar-ies involved, including Chinese/English and Por-tuguese/English dictionaries, provided high qual-ity lexical translations for core vocabularies ofthese languages, the bilingual lexicons, includingFreeLang English/Italian, English/Portuguese lex-icons2 and LDC English/Chinese word list, con-tain erroneous and inaccurate translations. To re-

2http://www.freelang.net/dictionary/

duce the error rate, some manual cleaning was car-ried out, particularly on the English-Italian bilin-gual lexicons. Because of the substantial amountof time needed for such manual work, the rest ofthe lexical resources were used with only minorsporadic manual checking. Due to the noise intro-duced from the bilingual lexicons, as well as theambiguous nature of the translation, the automati-cally generated semantic lexicons for the three lan-guages contain errors, including erroneous seman-tic tags caused by incorrect translations, and inac-curate semantic tags caused by ambiguous trans-lations. When these automatically generated lexi-cons were integrated and applied in the USAS se-mantic tagger, the tagger suffered from error ratesof 23.51%, 12.31% and 10.28% for Italian, Chi-nese and Portuguese respectively.

The improvement of the semantic lexicons istherefore an urgent and challenging task, and wehypothesise that the crowdsourcing approach canpotentially provide an effective means for address-ing this issue on a large scale, while at the sametime allowing us to further validate the representa-tion of word senses in the USAS sense inventory(i.e. the semantic tagset) for these languages.

3 Semantic Labeling Experiment

We test the wisdom of the crowd in buildinglexicons and applying the same multilingual se-mantic representation in six languages: Arabic,Chinese, English, Italian, Portuguese and Urdu.These languages were selected to provide a rangeof language families, inflectional and derivationalmorphology, while covering significant number ofspeakers worldwide. For each language, we ran-domly selected 250 words. All experiments pre-sented here use the USAS taxonomy to describesemantic categories3 (Rayson et al., 2004).

3.1 Gold Standard Semantic Tags

To prepare gold standard data we asked a groupof linguists (up to two per language) to manu-ally check 250 randomly selected words for eachof the six languages, starting from the data pro-vided by Piao et al. (2015). For additional lan-guages (those not in Piao et al. (2015)), the Arabicand Urdu gold standards were completed manu-ally by native speaker linguists who translated the250 words, and we instructed the translators to optfor the most familiar Arabic or Urdu equivalents

3http://ucrel.lancaster.ac.uk/usas/semtags.txt

63

of the English words. This was further confirmedby checking the list of words by two other Arabicand Urdu native speakers. Here the base form ofverbs in Arabic is taken to be the present simpleform of the verb in the interest of convenience andbecause there is no part of speech tag for ‘presenttense of a lexical verb’. Hence, the three-letterpast tense verbs are tagged as ‘past form of lex-ical verb’, rather than as base forms. Also, whilepresent and past participles (e.g., ‘interesting’, ‘in-terested’) are tagged as adjectives in English, theseare labeled in Arabic as nouns in stand-alone po-sitions but they can also function as adjective pre-modifying nouns. Linguists then used the USASsemantic tagset to semantically label each wordwith the most suitable senses.

3.2 Non-expert Participants

Non-expert participants are defined as those whoare not familiar with the USAS taxonomy in ad-vance of the experiment. We selected Amazon’sMechanical Turk4 – an online marketplace forwork that requires human intelligence – and pub-lished “Human Intelligence Tasks” (HITs) for En-glish, Chinese and Portuguese only. For Ara-bic, Italian, and Urdu initial experiments usingMTurk showed that not enough native speakersare available to complete the tasks. Therefore,we employed 12 non-expert participants directlywith four native speakers for each of the three lan-guages. All participants used the same interface(Figure 1).

On MTurk, we paid workers an average of 7 USdollars per hour. We paid Portuguese workers 50%more to try and attract more participants due tothe lack of Portuguese native speakers on MTurk.We paid the other directly contacted participantsan average of 8 British pounds per hour. Thosepayments were made using Amazon5 and AppleiTunes6 vouchers.

3.3 Expert Participants

Expert participants are defined as those who werealready familiar with the USAS taxonomy beforethe experiments took place. For four languages(Arabic, English, Chinese and Urdu) we asked atotal of 15 participants (3 for English and 4 forthe other languages) to carry out the same task as

4http://www.mturk.com5https://www.amazon.co.uk6http://www.apple.com

the non-experts. All expert and non-expert partici-pants (whether MTurk workers or direct-contact)used the same user interface (Figure 1) as de-scribed in section 3.5.

3.4 Experimental DesignObtaining reliable results from the crowd remainsa challenging task (Kazai et al., 2009), whichrequires a careful experimental design and pre-selection of crowdsourcers. In our experiments,we worked on minimising the effort required byparticipants through designing a user-friendly in-terface.7 Aside from copying the final output codeto a text-box everything else is done using mouseclicks. Poorly designed experiments can nega-tively affect the quality of the results conductedby MTurk workers (Downs et al., 2010; Welinderet al., 2010; Kazai, 2011).

Feedback from a short sample run with localtesters helped us update the interface and providemore information to make the task efficient. Fig-ure 1 shows a sample task for the word ‘car’. Themajority of the testers were able to complete thetask within five minutes. In response to feedbackby some of the testers we provided the “Instruc-tions” section in the six languages under consider-ation.

Term to Classify

carTags

Please add tags to describe the term, arranging them from the most commonly used sense to the least.

Please label the word below with a number of tags that represent its meaning. Attach as many (no

more than 10), or as few, as you deem appropriate for all senses of the word, placing them in descending order of importance. Tags can be assigned in a (positive or negative manner (for example,

"occasionally" is tagged negatively for frequency

To assign a tag, click on the add tag button, and select a box from the category selection. This will add

an entry in the list, that .can then be sorted so that the most commonly used tag is at the top

Please remove any unrelated tags and make sure you do not exceed 10 tags in total

To help you with identifying common senses of a word, we have provided a number of links to

dictionaries, thesauri, and corpora (where you can see realworld usage). The References button will

bring these up over the top of other things, so you can still browse the tags

Example: "Jordan" could be : 'Geographical name' or 'Personal name'

Please submit your selections by clicking in the Submit button at the end of the page and wait until you

receive a confirmation message where you need to copy the outputcode and provide it to us, as in

the following example: f962bed5616612c8c7053f6e97e72b129edb4990-b5f5-47ab-8e7e-

bfc61c6515ec

Instructions Istruzioni التعليمات 用户指南 Information Contact

Movement/transportation: land

American football

SubmitAdd Tag References

Figure 1: Sample Task for the word “Car”

3.5 Online Semantic LabelingAs shown in Figure 1, we asked the participants tolabel each word presented to them with a number

7Our underlying code is freely available on GitHub athttps://github.com/UCREL/BLC-WSD-Frontend

64

References

Click on any of the following links for more

information about the word: happy

dictionary.com

thesaurus.com

British National Corpus

hide

Figure 2: Dictionary and Thesauri References

of tags that represent the word’s possible mean-ings. The participants were asked to attach asmany, or as few, as they deemed appropriate forall senses of the word, placing them in descendingorder of likelihood.

To assign a tag, the participants click on theAdd Tag button, and navigate to a box from thecategory selection where they can select a subcat-egory (Figures 3 and 4). By following these stepsthe participants add an entry in the list, that canthen be sorted by dragging and dropping the se-lected tags so that the most commonly used tag isat the top. We asked the users to remove any un-related tags and make sure they do not exceed 10tags in total.

General & Abstract

Terms

15 subcategories

The Body &

Individual

5 subcategories

Arts & Crafts

1 subcategories

Life & Living Things

3 subcategories

Numbers &

Measurement

6 subcategories

Education

1 subcategories

Science &

Technology

2 subcategories

Time

4 subcategories

Money &

Commerce

4 subcategories

Figure 3: Categories

To help them when identifying common sensesof a given word, we provided a number of links todictionaries, thesauri, and corpora (where they cansee real-world usage) for each language. The Ref-erences are displayed alongside the interface, sothey can still browse the tags (Figure 2). Partici-pants are free to use other resources as they see fit.The participants then needed to submit their selec-tions by clicking the Submit button at the end of

Anatomy and physiology

Terms relating to the (human) body and bodily processes

Health and disease

Terms relating to the (state of the) physical condition

Clothes and personal belongings

Terms relating to clothes and other personal belongings

Cleaning and personal care

Terms relating to domestic/personal hygiene

+ Select ‒ Select + Select ‒ Select

+ Select ‒ Select + Select ‒ Select

Figure 4: Subcategories

the page and wait until they receive a confirmationmessage where they need to copy the output-codeand provide it to us.

For each word we targeted a total of four non-expert participants and four expert participants toallow measurement and comparison of the agree-ment within each group to investigate the variabil-ity of task results and participants, rather than totake a simple weighted combination to produce anagreed list.

3.6 FilteringEven though crowdsourcing has been shown tobe effective in achieving expert quality for a va-riety of NLP tasks (Snow et al., 2008; Callison-Burch, 2009), we still needed to filter out work-ers who were not taking the task seriously or wereattempting to manipulate the system for personalgain (spamming).

In order to avoid these spamming crowdsourcedresults, we designed a novel task-specific twostage filtering process that we considered moresuitable for this type of task than previous filteringapproaches. Our two stage process encompassesfilters that are appropriate for experts and non-experts, and is applicable whether participants areusing MTurk or not.

In stage one filtering, we asked the MTurkworkers to select the correct synonym of the pre-sented word from a list of noisy candidates in or-der to avoid rejection of their HITs. The list con-tained four words where only one word correctlyfitted as a synonym. In order to set up the firstfiltering task for MTurk workers (on English, Chi-nese and Portuguese tasks), we used MultilingualWordNet to obtain the most common synonym foreach word. The stage one filtering was not neededfor Arabic, Italian and Urdu, since these non-

65

expert participants were directly contacted and weknew that they were native speakers and would notsubmit random results. The synonyms were vali-dated by linguists in each of the three languagesand choices were randomly shuffled before be-ing presented to the workers. Stage one filteringremoved 12% of English HITS, 2% of Chinese,and 6% of the Portuguese submissions. In our re-sults presented below, we only considered tasks byworkers who chose the correct synonyms and re-jected the others.

For the stage two filter, we injected random er-roneous senses for each of the 250 words into theinitial list of tags and the participants were ex-pected to remove these in order to pass. We de-liberately injected wrong and unrelated semantictags in between ‘potentially’ correct ones beforeshuffling the order of the tags. For example, ex-amining the pre-selected tags for the word ‘car’ inFigure 1 we can see that the semantic tag ‘Ameri-can Football’ is unrelated to the word ‘car’ and infact does not exist in the USAS semantic taxon-omy. The potentially correct tag such as ‘Move-ment/transportation: land’ does exist in the se-mantic lexicon. Results where participants fail topass stage two are still retained in the experimentand we report on the usefulness of this filter insection 4. All participants (MTurk workers anddirectly-contacted; experts and non-experts) un-dertook stage 2 filtering. Our experimental designdid not reveal to the participants any details of thetwo stage filtering process.

4 Results and Discussion

To evaluate the results8 we adopted three mainmetrics (inspired by those used in the SemEvalprocedures): Accuracy, Completeness, and Corre-lation.

Accuracy: measure the accuracy of the partic-ipant’s selection of tags (WTags) by counting thematching tags between the worker’s selection andthe gold standards (GTags). To compute Accu-racy we divide the number of matching tags by thenumber of tags selected by the participants.

Accuracy =WTags ∩GTags

WTags

Completeness: measure the completeness of theparticipant’s selection of tags (WTags) by finding

8All expert and non-expert results are available asCSV files on https://github.com/UCREL/Multilingual-USAS/tree/master/papers/eacl2017 sense workshop

whether the gold standard tags (GTags) are com-pletely or partially contained within the worker’sselection. To compute Completeness we divide thenumber of matching tags by the number of goldstandard tags.

Completeness =WTags ∩GTags

GTags

Correlation: To test the similarity of tag selec-tion between workers and gold standards we usedSpearman’s rank correlation coefficient.

In addition to the three metrics mentioned abovewe used three factors that work as indicators of thequality of the tagging process:

• Strict: Whether worker’s tags are identical tothe gold standard (same tags in the same or-der);

• First Tag Correct: Whether the first tag se-lected by the worker matches the first tag inthe gold standard;

• Fuzzy: Whether tags selected by a worker arecontained within the gold standard tags (inany order).

For each language we asked for up to four an-notators per word (1,000 HITs per language). ForPortuguese, where the participants were all fromMTurk, we only received 694 HITs even thoughwe paid participants working on Portuguese 50%more than we paid for the English tasks.

Table 29 shows the aggregate averages of thenon-expert HITs. In total, direct-contact partic-ipants and MTurk workers performed well andachieved comparable results to the gold standardin places. Around 50% of the English, Chinese,Urdu, and Portuguese HITs had the correct tags se-lected with around 15% being identical to the goldstandards. In nearly all of the cases, Portugueseworkers chose the correct tags although they werein a different order than the gold standard. Arabicparticipants achieved a high completeness scorerelative to the gold standard tags, but a close anal-ysis of the results show the participants have sug-gested more tags than the gold standards.

The results for English suggest that the non-expert workers are consistent as can be observed

9For this and the following tables, we use: Acc: Accuracy,Com: Completeness, Corr: Spearman’s, Err: Erroneous tags,Str: Strict, 1st: first tag correct, Fuz: Fuzzy.

66

Lang Acc Com Cor Err Str 1st FuzEn 61.4 69.3 0.38 29% 16% 65% 38%Ar 55.5 87.1 0.35 8% 8% 55% 19%Zh 45.2 56.1 0.22 2% 15% 46% 27%It 45.7 47.9 0.06 31% 7% 38% 22%Pt 58.5 56.3 0.21 18% 19% 50% 94%Ur 97.6 91.9 0.51 1% 53% 78% 95%

Table 2: Summary of performance [Non experts]

by looking at the Accuracy (Acc) and Complete-ness (Com) results. Spearman’s correlation (Cor)suggests that the workers correlate with the expertgold standard tags, which is consistent with previ-ous findings that MTurk is effective for a variety ofNLP tasks through achieving expert quality (Snowet al., 2008; Callison-Burch, 2009). The majorityof the workers matched the first tag correctly (1st)by ordering the tags so the most important (coresense) tag appeared at the top of their selection.The erroneous tags (Err) column shows that manyworkers did not remove some of the deliberately-wrong tags (see Section 3.5). This reflects the lackof training of the workers, but our checking of theresults showed that the erroneous tags were not se-lected as first choice. Strict (Str) and Fuzzy (Fuz)show that many workers were consistent with thegold standard tags in terms of both tag selectionand order. It is worth mentioning that languagesdiffer in terms of ambiguity (e.g. Urdu is less am-biguous than Arabic) which can be observed in thedifferences between language results.

As mentioned earlier, we did not use MTurk forthe Arabic lexicon, due to the lack of Arabic na-tives speakers on MTurk. Instead, we found fourstudent volunteers who offered to help in seman-tically tagging the words, again without any train-ing on the tagset. The results show consistent ac-curacy and completeness. It is worth noting thatthe Arabic participants obtained higher accuracyand completeness scores by having higher agree-ment with the gold standard tags. The Arabic lan-guage participants selected fewer erroneous tagsthan the English ones. The majority of the par-ticipants got the first tag correct. Arabic partici-pants failed to match the order of tags in the goldstandards as reflected by lower correlation. Thisis expected due to the fact that Arabic is highlyinflectional and derivational, which increases am-biguity and presents a challenge to the interpreta-tion of the words (El-Haj et al., 2014). Difficulties

in knowing the exact sense of an out of contextArabic word could result in disagreement when itcomes to ordering the senses (see Section 3.1).

For the Chinese language, the result table showsthat there is a slightly lower correlation betweenthe non-expert workers’ tags and the gold stan-dards. Observing the erroneous results column wecan see that the workers have made very few mis-takes and deleted the random unrelated tags. TheStrict and Fuzzy scores suggest the results to be ofhigh quality. The participants managed to get thefirst tag correct in many cases.

For the Italian language, we sourced four non-expert undergrad student participants who are allnative Italian speakers but not familiar with thetagset. The participants’ results do not correlatewell with the gold standards. As the tags descrip-tion are all in English the annotators found it diffi-cult to correctly select the senses and had to trans-late some tags into Italian which could have re-sulted in shifting the meaning of those tags, tocommunicate with them we had an Italian/Englishbilingual linguist as a mediator.

As with Arabic and Italian, for the Urdu lan-guage we sourced four non-expert participantswho are all native Urdu speakers but not familiarwith the tagset. Urdu results show that participantscorrelate well with the gold standards. We alsonotice a lower percentage of erroneous tags thanother languages. The First Tag and Fuzzy scoressuggest the results to be of high quality. The par-ticipants also managed to get the first tag correct inmany cases. The participants all agreed it was easyto define word senses with the words being lessambiguous compared to other languages. This isshown in the high results achieved when comparedto non-experts of the other languages.

We received only 694 HITs for Portuguese taskson MTurk, which suggests there are fewer Por-tuguese speakers compared to English and Chi-nese speakers among the MTurk workers. Theresults for Portuguese in some cases are of veryhigh quality. It should be noted that the gold stan-dard tags were selected and manually checked by aBrazilian Portuguese native speaker expert. Thereis a difference between European and Brazil-ian Portuguese which could result in ambiguouswords for speakers from the two regions (Frotaand Vigario, 2001).

Table 3 shows the results obtained by using thesecond filtering mechanism to discard HITs where

67

Lang Acc Com CorEn 70.4 69.0 0.36Ar 56.6 87.6 0.34Zh 45.6 55.9 0.22It 54.4 53.5 0.09Pt 61.3 54.0 0.20Ur 97.6 91.9 0.51

Table 3: Summary of performance with SecondFilter [Non experts]

random erroneous tags were not completely re-moved. This enables us to increase accuracy forEnglish by 9.0%, Italian by 8.7% and Portugueseby 2.8% without negatively affecting complete-ness or correlation.

In order to allow better interpretation of the non-experts’ scores, we repeated the experiments ona smaller scale with up to four experts per lan-guage (English, Arabic, Chinese and Urdu), whowere already familiar with the USAS taxonomyand were researchers in the fields of corpus orcomputational linguistics. Experts used the sametask interface to assign senses to 50 words each.The results are presented in Table 4. Most notably,experts consistently excel at removing erroneoustags, leaving only a very small number in the data.

English experts performed much better thanEnglish non-experts on completeness, correlationand strict measures while their accuracy scores arecomparable. Arabic experts performed much bet-ter than Arabic non-experts on the accuracy, strictand fuzzy scores while the 1st score is compa-rable. Chinese experts performed slightly betterthan Chinese non-experts on Accuracy, complete-ness and Fuzzy while other scores were compa-rable. Urdu experts scored relatively more highlyon strict and 1st measures while other scores werecomparable to Urdu non-experts. Finally, Tables 5and 6 show the Observed Agreement (OA), Fleiss’Kappa and Krippendorff’s alpha scores for theinter-rater agreement between Expert and Non Ex-pert participants. According to (Landis and Koch,1977) our inter-rater scores show fair agreementbetween annotators. This serves to illustrate thetask is complex even for experts.

Overall these results show that untrained crowd-sourcing workers can produce results that are com-parable to those of experts when performing se-mantic annotation tasks. Directly-contacted andMTurk workers achieved similar levels of results

Lang Acc Com Cor Err Str 1st FuzEn 66.1 83 0.61 1% 31% 75% 40%Ar 78.8 72.4 0.22 1% 39% 51% 73%Zh 50.4 60.2 0.21 1% 15% 44% 31%Ur 96.2 94.8 0.69 1% 63% 89% 93%

Table 4: Summary of performance [Experts]

Language Measure OA Fleiss K–alphaEnglish First Tag 0.82 0.46 0.46

Fuzzy 0.64 0.27 0.27Strict 0.69 0.32 0.32

Arabic First Tag 0.77 0.55 0.55Fuzzy 0.84 0.59 0.59Strict 0.21 0.55 0.55

Chinese First Tag 0.62 0.23 0.24Fuzzy 0.75 0.41 0.41Strict 0.83 0.31 0.32

Urdu First Tag 0.83 0.10 0.10Fuzzy 0.91 0.35 0.35Strict 0.71 0.37 0.38

Table 5: Total Inter-rater agreement [Experts].

overall. This shows that the novel two-phase fil-tering method used in our experiment is effectivefor maintaining the quality of the results.


In order to accelerate the task of creating multilin-gual semantic lexicons with coarse-grained wordsenses using a common multilingual semantic rep-resentation scheme, we employed non-expert na-tive speakers via MTurk who were not trainedwith the semantic taxonomy. Overall, the non-expert participants semantically tagged 250 wordsin each of six languages: Arabic, Chinese, En-glish, Italian, Portuguese and Urdu. We analysedthe results using a number of metrics to considerthe correct likelihood order of tags relative to agold-standard, along with correct removal of ran-dom erroneous semantic tags, and completenessof tag lists. Crowdsourcing has been applied suc-cessfully for other NLP tasks in previous research,and we build on previous success in WSD tasks inthree ways. Firstly, we have specific requirementsfor semantic tagging purposes in terms of placingcoarse-grained senses into a semantic taxonomyrather than stand-alone definitions. Hence, our ex-perimental set-up allows us to validate the senseinventory in a multilingual setting, carried out herefor six languages. Secondly, we extend the usual

68

Language Measure OA Fleiss K–alphaEnglish First Tag 0.71 0.36 0.36

Fuzzy 0.58 0.11 0.11Strict 0.79 0.20 0.20

Arabic First Tag 0.66 0.32 0.32Fuzzy 0.71 0.05 0.05Strict 0.86 0.03 0.03

Chinese First Tag 0.73 0.45 0.45Fuzzy 0.74 0.38 0.38Strict 0.85 0.41 0.41

Italian First Tag 0.67 0.31 0.31Fuzzy 0.67 0.03 0.03Strict 0.89 0.12 0.13

Portuguese First Tag 0.64 0.22 0.22Fuzzy 0.63 0.13 0.13Strict 0.80 0.18 0.18

Urdu First Tag 0.72 0.16 0.16Fuzzy 0.95 0.45 0.45Strict 0.74 0.49 0.49

Table 6: Total Inter-rater agreement [Non Ex-perts].

classification task of putting a word into one ofan existing list of senses, instead asking partici-pants to list all possible senses that a word couldtake in different contexts. Thirdly, we have de-ployed a novel two-stage filtering approach whichhas been shown to improve the quality of our re-sults by filtering out spam responses using a sim-ple synonym recognition task as well as HITs re-moving random erroneous tags. Our experimentsuggests that the crowdsourcing process can pro-duce results of good quality and is comparable tothe work done by expert linguists. We showed thatit is possible for native speakers to apply the hier-archical semantic taxonomy without prior trainingby the application of a graphical browsing inter-face to assist selection and annotation process.

In the future, we will apply the method on alarger scale to the full semantic lexicons includ-ing multiword expressions, which are importantfor contextual semantic disambiguation. We willalso investigate whether adaptations to our methodare required to include more languages such asCzech, Malay and Spanish. In order to pursue thework beyond the existing languages in the USASsystem, we will extend bootstrapping methods re-ported in Piao et al. (2015) with vector-based tech-niques and evaluate their appropriateness for mul-tiple languages. Finally, we will test whether(a) provision of words in context through concor-

dances, (b) prototypical examples for each seman-tic tag, or (c) semantic tag labels in the same lan-guage as the task word, as part of the resourcesavailable to participants would further enhance theaccuracy of the crowdsourcing annotation process.

Acknowledgements

The crowdsourcing framework and interface wasoriginally produced in an EPSRC Impact Accel-eration Award at Lancaster University in 2013.This work has also been supported by theUK ESRC/AHRC-funded CorCenCC (The Na-tional Corpus of Contemporary Welsh) Project(ref. ES/M011348/1), the ESRC-funded Centrefor Corpus Approaches to Social Science (ref.ES/K002155/1), and by the UCREL research cen-tre, Lancaster University, UK via Wmatrix soft-ware licence fees. We would also like to ex-press our thanks to the many expert and non-expertparticipants who were also involved in the ex-periments including: Dawn Archer (ManchesterMetropolitan University, UK), Francesca Bianchi(Universita del Salento, Italy), Carmen Dayrell(Lancaster University, UK), Xiaopeng Hu (ChinaCenter for Information Industry Development,CCID, China), Olga Mudraya (Right Word Up,UK), Sheryl Prentice (Lancaster University, UK),Yuan Qi (China Center for Information IndustryDevelopment, CCID, China), Jawad Shafi (COM-SATS Institute of Information technology, Pak-istan), and May Wong (University of Hong Kong,Hong Kong, China).

References

Ahmet Aker, Mahmoud El-Haj, Udo Kruschwitz, andM-Dyaa Albakour. 2012. Assessing Crowdsourc-ing Quality through Objective Tasks. In 8th Lan-guage Resources and Evaluation Conference, Istan-bul, Turkey. LREC 2012.

Cem Akkaya, Alexander Conrad, Janyce Wiebe, andRada Mihalcea. 2010. Amazon mechanical turkfor subjectivity word sense disambiguation. In Pro-ceedings of the NAACL HLT 2010 Workshop onCreating Speech and Language Data with Ama-zon’s Mechanical Turk, pages 195–203, Los Ange-les, USA. Association for Computational Linguis-tics.

Omar Alonso and Stefano Mizzaro. 2009. Can we getrid of TREC Assessors? Using Mechanical Turk forRelevance Assessment. In SIGIR ’09: Workshop onThe Future of IR Evaluation, Boston, USA.

69

Vikas Bhardwaj, Rebecca J. Passonneau, Ansaf Salleb-Aouissi, and Nancy Ide. 2010. Anveshan: A frame-work for analysis of multiple annotators’ labelingbehavior. In Proceedings of the Fourth LinguisticAnnotation Workshop, LAW IV ’10, pages 47–55,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

Chris Callison-Burch. 2009. Fast, Cheap, and Cre-ative: Evaluating Translation Quality Using Ama-zon’s Mechanical Turk. In Proceedings of the 2009Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 286–295, Singa-pore. Association for Computational Linguistics.

Pinar Donmez, Jaime G. Carbonell, and Jeff Schneider.2009. Efficiently learning the accuracy of labelingsources for selective sampling. In Proceedings ofthe 15th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD ’09,pages 259–268, New York, NY, USA. ACM.

Julie S. Downs, Mandy B. Holbrook, Steve Sheng,and Lorrie Faith Cranor. 2010. Are your partic-ipants gaming the system?: Screening mechanicalturk workers. In Proceedings of the SIGCHI Con-ference on Human Factors in Computing Systems,CHI ’10, pages 2399–2402, New York, NY, USA.ACM.

Mahmoud El-Haj, Udo Kruschwitz, and Chris Fox.2010. Using Mechanical Turk to Create a Cor-pus of Arabic Summaries. In Language Resources(LRs) and Human Language Technologies (HLT) forSemitic Languages workshop held in conjunctionwith the 7th International Language Resources andEvaluation Conference (LREC 2010)., pages 36–39,Valletta, Malta. LREC 2010.

Mahmoud El-Haj, Udo Kruschwitz, and Chris Fox.2014. Creating language resources for under-resourced languages: methodologies, and experi-ments with Arabic. Language Resources and Eval-uation, pages 1–32.

Sonia Frota and Marina Vigario. 2001. On thecorrelates of rhythmic distinctions: The Euro-pean/Brazilian Portuguese case. pages 247–275.

Adam Kapelner, Krishna Kaliannan, H.AndrewSchwartz, Lyle Ungar, and Dean Foster. 2012.New insights from coarse word sense disambigua-tion in the crowd. In Proceedings of COLING 2012:Posters, pages 539–548, Mumbai, India. The COL-ING 2012 Organizing Committee.

Gabriella Kazai, Natasa Milic-Frayling, and JamieCostello. 2009. Towards methods for the collec-tive gathering and quality control of relevance as-sessments. In Proceedings of the 32nd InternationalACM SIGIR Conference on Research and Develop-ment in Information Retrieval, SIGIR ’09, pages452–459, New York, NY, USA. ACM.

Gabriella Kazai. 2011. In search of quality in crowd-sourcing for search engine evaluation. In PaulClough, Colum Foley, Cathal Gurrin, Gareth J.F.Jones, Wessel Kraaij, Hyowon Lee, and VanessaMudoch, editors, Advances in Information Re-trieval, volume 6611 of Lecture Notes in ComputerScience, pages 165–176. Springer Berlin Heidel-berg.

Adam Kilgarriff. 1997. “I Don’t Believe in WordSenses”. Computers and the Humanities, 31(2):91–113.

Aniket Kittur, Ed H. Chi, and Bongwon Suh. 2008.Crowdsourcing user studies with mechanical turk.In Proceedings of the SIGCHI Conference on Hu-man Factors in Computing Systems, CHI ’08, pages453–456, New York, NY, USA. ACM.

J.Richard Landis and Gary Koch. 1977. The mea-surement of observer agreement for categorical data.Biometrics, 33(1):159–174.

Laura Lofberg, Scott Piao, Asko Nykanen, KristaVarantola, Paul Rayson, and Jukka-Pekka Juntunen.2005. A semantic tagger for the Finnish language.In Proceedings of Corpus Linguistics 2005, Birm-ingham, UK.

Tom McArthur. 1981. Longman Lexicon of Contem-porary English. Longman, London, UK.

Olga Mudraya, Bogdan Babych, Scott Piao, PaulRayson, and Andrew Wilson. 2006. Developinga Russian semantic tagger for automatic semanticannotation. In Proceedings of Corpus Linguistics2006, pages 290–297, St. Petersburg, Russia.

Rebecca Passonneau, Nizar Habash, and Owen Ram-bow. 2006. Inter-annotator agreement on amultilingual semantic annotation task. In Pro-ceedings of the Fifth International Conference onLanguage Resources and Evaluation (LREC’06),Genoa, Italy. European Language Resources Asso-ciation (ELRA).

Ellie Pavlick, Matt Post, Ann Irvine, Dmitry Kachaev,and Chris Callison-Burch. 2014. The language de-mographics of Amazon Mechanical Turk. Transac-tions of the Association for Computational Linguis-tics, 2:79–92.

Scott Piao, Francesca Bianchi, Carmen Dayrell, An-gela D’Egidio, and Paul Rayson. 2015. Develop-ment of the multilingual semantic annotation sys-tem. In Proceedings of North American Chapterof the Association for Computational Linguistics –Human Language Technologies Conference, Den-ver, USA. (NAACL HLT 2015).

Yufang Qian and Scott Piao. 2009. The developmentof a semantic annotation scheme for chinese kinship.Corpora, 4(2):189–208.

70

Paul Rayson, Dawn Archer, Scott Piao, and AnthonyMcEnery. 2004. The UCREL semantic analysissystem. In Proceedings of the Beyond Named En-tity Recognition Semantic Labelling for NLP tasksworkshop, pages 7–12, Lisbon, Portugal.

Aanna Rumshisky, Nick Botchan, Sophie Kushkuley,and James Pustejovsky. 2012. Word sense inven-tories by non-experts. In Nicoletta Calzolari (Con-ference Chair), Khalid Choukri, Thierry Declerck,Mehmet Uur Doan, Bente Maegaard, Joseph Mar-iani, Jan Odijk, and Stelios Piperidis, editors, Pro-ceedings of the Eight International Conference onLanguage Resources and Evaluation (LREC’12),pages 4055–4059, Istanbul, Turkey, May. EuropeanLanguage Resources Association (ELRA).

Rion Snow, Brendan O’Connor, Daniel Jurafsky, andAndrew Y. Ng. 2008. Cheap and Fast – But is itGood? Evaluating Non-Expert Annotations for Nat-ural Language Tasks. In Proceedings of the 2008Conference on Empirical Methods in Natural Lan-guage Processing, pages 254–263. Association forComputational Linguistics.

Alexander. Sorokin and David Forsyth. 2008. Utilitydata annotation with amazon mechanical turk. In InIEEE Conference on Computer Vision and PatternRecognition Workshops, pages 1–8.

Jean Veronis. 2001. Sense tagging: does it makesense? In Proceedings of Corpus Linguistics 2001,Lancaster, UK. UCREL.

Peter Welinder, Steve Branson, Serge Belongie, andPietro Perona. 2010. The Multidimensional Wis-dom of Crowds. In J. Lafferty, C. K. I. Williams,J. Shawe-Taylor, R. S. Zemel, and A. Culotta, ed-itors, Advances in Neural Information ProcessingSystems 23, pages 2424–2432.

Yin Yang, Nilesh Bansal, Wisam Dakka, Panagio-tis Ipeirotis, Nick Koudas, and Dimitris Papadias.2009. Query by document. In Proceedings ofthe Second ACM International Conference on WebSearch and Data Mining, WSDM ’09, pages 34–43,New York, NY, USA. ACM.

71


Using Linked Disambiguated Distributional Networks forWord Sense Disambiguation

Alexander Panchenko‡, Stefano Faralli†, Simone Paolo Ponzetto†, and Chris Biemann‡

‡Language Technology Group, Computer Science Dept., University of Hamburg, Germany†Web and Data Science Group, Computer Science Dept., University of Mannheim, Germany

{panchenko,biemann}@informatik.uni-hamburg.de{faralli,simone}@informatik.uni-mannheim.de

Abstract

We introduce a new method for unsuper-vised knowledge-based word sense disam-biguation (WSD) based on a resource thatlinks two types of sense-aware lexical net-works: one is induced from a corpus us-ing distributional semantics, the other ismanually constructed. The combination oftwo networks reduces the sparsity of senserepresentations used for WSD. We evalu-ate these enriched representations withintwo lexical sample sense disambiguationbenchmarks. Our results indicate that (1)features extracted from the corpus-basedresource help to significantly outperform amodel based solely on the lexical resource;(2) our method achieves results compara-ble or better to four state-of-the-art unsu-pervised knowledge-based WSD systemsincluding three hybrid systems that alsorely on text corpora. In contrast to thesehybrid methods, our approach does not re-quire access to web search engines, textsmapped to a sense inventory, or machinetranslation systems.

1 Introduction

The representation of word senses and the dis-ambiguation of lexical items in context is an on-going long-established branch of research (Agirreand Edmonds, 2007; Navigli, 2009). Tradition-ally, word senses are defined and represented inlexical resources, such as WordNet (Fellbaum,1998), while more recently, there is an increasedinterest in approaches that induce word sensesfrom corpora using graph-based distributional ap-proaches (Dorow and Widdows, 2003; Biemann,2006; Hope and Keller, 2013), word sense embed-dings (Neelakantan et al., 2014; Bartunov et al.,

2016) and combination of both (Pelevina et al.,2016). Finally, some hybrid approaches emerged,which aim at building sense representations us-ing information from both corpora and lexical re-sources, e.g. (Rothe and Schutze, 2015; Camacho-Collados et al., 2015a; Faralli et al., 2016). Inthis paper, we further explore the last strain ofresearch, investigating the utility of hybrid senserepresentation for the word sense disambiguation(WSD) task.

In particular, the contribution of this paper isa new unsupervised knowledge-based approach toWSD based on the hybrid aligned resource (HAR)introduced by Faralli et al. (2016). The key dif-ference of our approach from prior hybrid meth-ods based on sense embeddings, e.g. (Rothe andSchutze, 2015), is that we rely on sparse lexicalrepresentations that make the sense representationreadable and allow to straightforwardly use thisrepresentation for word sense disambiguation, aswill be shown below. In contrast to hybrid ap-proaches based on sparse interpretable represen-tations, e.g. (Camacho-Collados et al., 2015a), ourmethod requires no mapping of texts to a sense in-ventory and thus can be applied to larger text col-lections. By linking symbolic distributional senserepresentations to lexical resources, we are able toimprove representations of senses, leading to per-formance gains in word sense disambiguation.

2 Related Work

Several prior approaches combined distributionalinformation extracted from text (Turney and Pan-tel, 2010) from text with information availablein lexical resources, such as WordNet. Yu andDredze (2014) proposed a model to learn wordembeddings based on lexical relations of wordsfrom WordNet and PPDB (Ganitkevitch et al.,2013). The objective function of their model

72

combines the objective function of the skip-grammodel (Mikolov et al., 2013) with a term that takesinto account lexical relations of a target word.Faruqui et al. (2015) proposed a related approachthat performs a post-processing of word embed-dings on the basis of lexical relations from thesame resources. Pham et al. (2015) introduced an-other model that also aim at improving word vec-tor representations by using lexical relations fromWordNet. The method makes representations ofsynonyms closer than representations of antonymsof the given word. While these three models im-prove the performance on word relatedness eval-uations, they do not model word senses. Jauharet al. (2015) proposed two models that tackle thisshortcoming, learning sense embeddings using theword sense inventory of WordNet.

Iacobacci et al. (2015) proposed to learnsense embeddings on the basis of the BabelNetlexical ontology (Navigli and Ponzetto, 2012).Their approach is to train the standard skip-gram model on a pre-disambiguated corpus us-ing the Babelfy WSD system (Moro et al., 2014).NASARI (Camacho-Collados et al., 2015a) re-lies on Wikipedia and WordNet to produce vec-tor representations of senses. In this approach,a sense is represented in lexical or sense-basedfeature spaces. The links between WordNet andWikipedia are retrieved from BabelNet. MUFFIN(Camacho-Collados et al., 2015b) adapts severalideas from NASARI, extending the method to themulti-lingual case by using BabelNet synsets in-stead of monolingual WordNet synsets.

The approach of Chen et al. (2015) to learn-ing sense embeddings starts from initializationof sense vectors using WordNet glosses. It pro-ceeds by performing a more conventional contextclustering, similar what is found to unsupervisedmethods such as (Neelakantan et al., 2014; Bar-tunov et al., 2016).

Rothe and Schutze (2015) proposed a methodthat learns sense embedding using word embed-dings and the sense inventory of WordNet. Theapproach was evaluated on the WSD tasks usingfeatures based on the learned sense embeddings.

Goikoetxea et al. (2015) proposed a method forlearning word embeddings using random walks ona graph of a lexical resource. Nieto Pina and Jo-hansson (2016) used a similar approach based onrandom walks on a WordNet to learn sense embed-dings.

All these diverse contributions indicate the ben-efits of hybrid knowledge sources for learningword and sense representations.

3 Unsupervised Knowledge-based WSDusing Hybrid Aligned Resource

We rely on the hybrid aligned lexical semantic re-source proposed by Faralli et al. (2016) to performWSD. We start with a short description of this re-source and then discuss how it is used for WSD.

3.1 Construction of the Hybrid AlignedResource (HAR)

The hybrid aligned resource links two lexicalsemantic networks using the method of Faralliet al. (2016): a corpus-based distributionally-induced network and a manually-constructed net-work. Sample entries of the HAR are presentedin Table 1. The corpus-based part of the resource,called proto-conceptualization (PCZ), consists ofsense-disambiguated lexical items (PCZ ID), dis-ambiguated related terms and hypernyms, as wellas context clues salient to the lexical item. Theknowledge-based part of the resource, calledconceptualization, is represented by synsets ofthe lexical resource and relations between them(WordNet ID). Each sense in the PCZ network issubsequently linked to a sense of the knowledge-based network based on their similarity calculatedon the basis of lexical representations of sensesand their neighbors. The construction of the PCZinvolves the following steps (Faralli et al., 2016):

Building a Distributional Thesaurus (DT). Atthis stage, a similarity graph over terms is inducedfrom a corpus, where each entry consists of themost similar 200 terms for a given term using theJoBimText method (Biemann and Riedl, 2013).

Word Sense Induction. In DTs, entries of pol-ysemous terms are mixed, i.e. they contain relatedterms of several senses. The Chinese Whispers(Biemann, 2006) graph clustering is applied to theego-network (Everett and Borgatti, 2005) of theeach term, as defined by its related terms and con-nections between then observed in the DT to de-rive word sense clusters.

Labeling Word Senses with Hypernyms.Hearst (1992) patterns are used to extract hy-pernyms from the corpus. These hypernymsare assigned to senses by aggregating hypernym

73

PCZ ID WordNet ID Related Terms Hypernyms Context Cluesmouse:0 mouse:1 rat:0, rodent:0, monkey:0, ... animal:0, species:1, ... rat:conj and, white-footed:amod, ...mouse:1 mouse:4 keyboard:1, computer:0, printer:0 ... device:1, equipment:3, ... click:-prep of, click:-nn, ....keyboard:0 keyboard:1 piano:1, synthesizer:2, organ:0 ... instrument:2, device:3, ... play:-dobj, electric:amod, ..keyboard:1 keyboard:1 keypad:0, mouse:1, screen:1 ... device:1, technology:0 ... computer, qwerty:amod ...

Table 1: Sample entries of the hybrid aligned resource (HAR) for the words “mouse” and “keyboard”.Trailing numbers indicate sense identifiers. Relatedness and context clue scores are not shown for brevity.

relations over the list of related terms for the givensense into a weighted list of hypernyms.

Disambiguation of Related Terms and Hyper-nyms. While target words contain sense distinc-tions (PCZ ID), the related words and hypernymsdo not carry sense information. At this step, eachhypernym and related term is disambiguated withrespect to the induced sense inventory (PCZ ID).For instance, the word “keyboard” in the list ofrelated terms for the sense “mouse:1” is linkedto its “device” sense represented (“keyboard:1”)as “mouse:1” and “keyboard:1” share neighborsfrom the IT domain.

Retrieval of Context Clues. Salient contexts ofsenses are retrieved by aggregating salient depen-dency features of related terms. Context featuresthat have a high weight for many related terms ob-tain a high weight for the sense.

3.2 HAR Datasets

We experiment with two different corpora for PCZinduction as in (Faralli et al., 2016), namely a 100million sentence news corpus (news) from Giga-word (Parker et al., 2011) and LCC (Richter et al.,2006), and a 35 million sentence Wikipedia cor-pus (wiki).1 Chinese Whispers sense clustering isperformed with the default parameters, producingan average number of 2.3 (news) and 1.8 (wiki)senses per word in a vocabulary of 200 thousandwords each, with the usual power-law distributionof sense cluster sizes. On average, each senseis related to about 47 senses and has assigned 5hypernym labels. These disambiguated distribu-tional networks were linked to WordNet 3.1 usingthe method of Faralli et al. (2016).

3.3 Using the Hybrid Aligned Resource inWord Sense Disambiguation

We experimented with four different ways of en-riching the original WordNet-based sense repre-

1The used PCZ and HAR resources are available at:https://madata.bib.uni-mannheim.de/171

sentation with contextual information from theHAR on the basis of the mappings listed below:

WordNet. This baseline model relies solely onthe WordNet lexical resource. It builds senserepresentations by collecting synonyms and sensedefinitions for the given WordNet synset andsynsets directly connected to it. We removed stopwords and weight words with term frequency.

WordNet + Related (news). This model aug-ments the WordNet-based representation with re-lated terms from the PCZ items (see Table 1). Thissetting is designed to quantify the added value oflexical knowledge in the related terms of PCZ.

WordNet + Related (news) + Context (news).This model includes all features of the previousmodels and complements them with context cluesobtained by aggregating features of the wordsfrom the WordNet + Related (news) (see Table 1).

WordNet + Related (news) + Context (wiki).This model is built in the same way as the pre-vious model, but using context clues derived fromWikipedia (see Section 3.2).

In the last two models, we used up to 5000most relevant context clues per word sense. Thisvalue was set experimentally: performance of theWSD system gradually increased with the num-ber of context clues reaching a plateau at the valueof 5000. During aggregation, we excluded stopwords and numbers from context clues. Besides,we transformed syntactic context clues presentedin Table 1 to terms, stripping the dependency type.so they can be added to other lexical representa-tions. For instance, the context clue “rat:conj and”of the entry “mouse:0” was reduced to the feature“rat”.

Table 2 demonstrates features extracted fromWordNet as compared to feature representationsenriched with related terms of the PCZ.

Each WordNet word sense is represented withone of the four methods described above. Thesesense representations are subsequently used to per-

74

Model Sense Representation

WordNet memory, device, floppy, disk, hard, disk, disk, computer, science, computing, diskette, fixed, disk, floppy, magnetic,disc, magnetic, disk, hard, disc, storage, device

WordNet + Related (news)

recorder, disk, floppy, console, diskette, handset, desktop, iPhone, iPod, HDTV, kit, RAM, Discs, Blu-ray, computer, GB, mi-crochip, site, cartridge, printer, tv, VCR, Disc, player, LCD, software, component, camcorder, cellphone, card, monitor, display,burner, Web, stereo, internet, model, iTunes, turntable, chip, cable, camera, iphone, notebook, device, server, surface, wafer,page, drive, laptop, screen, pc, television, hardware, YouTube, dvr, DVD, product, folder, VCR, radio, phone, circuitry, partition,megabyte, peripheral, format, machine, tuner, website, merchandise, equipment, gb, discs, MP3, hard-drive, piece, video, storagedevice, memory device, microphone, hd, EP, content, soundtrack, webcam, system, blade, graphic, microprocessor, collection,document, programming, battery, keyboard, HD, handheld, CDs, reel, web, material, hard-disk, ep, chart, debut, configuration,recording, album, broadcast, download, fixed disk, planet, pda, microfilm, iPod, videotape, text, cylinder, cpu, canvas, label,sampler, workstation, electrode, magnetic disc, catheter, magnetic disk, Video, mobile, cd, song, modem, mouse, tube, set, ipad,signal, substrate, vinyl, music, clip, pad, audio, compilation, memory, message, reissue, ram, CD, subsystem, hdd, touchscreen,electronics, demo, shell, sensor, file, shelf, processor, cassette, extra, mainframe, motherboard, floppy disk, lp, tape, version, kilo-byte, pacemaker, browser, Playstation, pager, module, cache, DVD, movie, Windows, cd-rom, e-book, valve, directory, harddrive,smartphone, audiotape, technology, hard disk, show, computing, computer science, Blu-Ray, blu-ray, HDD, HD-DVD, scanner,hard disc, gadget, booklet, copier, playback, TiVo, controller, filter, DVDs, gigabyte, paper, mp3, CPU, dvd-r, pipe, cd-r, playlist,slot, VHS, film, videocassette, interface, adapter, database, manual, book, channel, changer, storage

Table 2: Original and enriched representations of the third sense of the word “disk” in the WordNet senseinventory. Our sense representation is enriched with related words from the hybrid aligned resource.

form WSD in context. For each test instance con-sisting of a target word and its context, we selectthe sense whose corresponding sense representa-tion has the highest cosine similarity with the tar-get word’s context.

4 Evaluation

We perform an extrinsic evaluation and show theimpact of the hybrid aligned resource on wordsense disambiguation performance. While thereexist many datasets for WSD (Mihalcea et al.,2004; Pradhan et al., 2007; Manandhar et al.,2010, inter alia), we follow Navigli and Ponzetto(2012) and use the SemEval-2007 Task 16 onthe “Evaluation of wide-coverage knowledge re-sources” (Cuadros and Rigau, 2007). This task isspecifically designed for evaluating the impact oflexical resources on WSD performance. The Sem-Eval-2007 Task 16 is, in turn, based on two “lexi-cal sample” datasets, from the Senseval-3 (Mihal-cea et al., 2004) and SemEval-2007 Task 17 (Prad-han et al., 2007) evaluation campaigns. The firstdataset has coarse- and fine-grained annotations,while the second contains only fine-grained senseannotations. In all experiments, we use the offi-cial task’s evaluator to compute standard metricsof recall, precision, and F-score.

5 Results

Impact of the corpus-based features. Figure 1compares various sense representations in termsof F-score. The results show that, expandingWordNet-based sense representations with distri-butional information gives a clear advantage overthe original representation on both Senseval-3 andSemEval-2007 datasets. Using related words spe-cific to a given WordNet sense provides dramatic

improvements in the results. Further expansion ofthe sense representation with context clues (cf. Ta-ble 1) provide a modest further improvement onthe SemEval-2007 dataset and yield no further im-provement on the case of the Senseval-3 dataset.

Comparison to the state-of-the-art. We com-pare our approach to four state-of-the-art sys-tems: KnowNet (Cuadros and Rigau, 2008), Ba-belNet, WN+XWN (Cuadros and Rigau, 2007),and NASARI. KnowNet builds sense representa-tions based on snippets retrieved with a web searchengine. We use the best configuration reported inthe original paper (KnowNet-20), which extendseach sense with 20 keywords. BabelNet in itscore relies on a mapping of WordNet synsets andWikipedia articles to obtain enriched sense rep-resentations. The WN+XWN system is the top-ranked unsupervised knowledge-based system ofSenseval-3 and SemEval-2007 datasets from theoriginal competition (Cuadros and Rigau, 2007).It alleviates sparsity by combining WordNet withthe eXtended WordNet (Mihalcea and Moldovan,2001). The latter resource relies on parsing ofWordNet glosses.

For KnowNet, BabelNet, and WN+XWN weuse the scores reported in the respective originalpublications. However, as NASARI was not eval-uated on the datasets used in our study, we usedthe following procedure to obtain NASARI-basedsense representations: Each WordNet-based senserepresentation was extended with all features fromthe lexical vectors of NASARI.2

Thus, we compare our method to three hybridsystems that induce sense representations on the

2We used the version of lexical vectors (July 2016) featur-ing 4.4 million of BabelNet synsets, yet covering only 72%of word senses of the two datasets used in our experiments.

75

Figure 1: Performance of different word sense representation strategies.

Senseval-3 fine-grained SemEval-2007 fine-grainedModel Precision Recall F-score Precision Recall F-scoreRandom 19.1 19.1 19.1 27.4 27.4 27.4WordNet 29.7 29.7 29.7 44.3 21.0 28.5WordNet + Related (news) 47.5 47.5 47.5 54.0 50.0 51.9WordNet + Related (news) + Context (news) 47.2 47.2 47.2 54.8 51.2 52.9WordNet + Related (news) + Context (wiki) 46.9 46.9 46.9 55.2 51.6 53.4BabelNet 44.3 44.3 44.3 56.9 53.1 54.9KnowNet 44.1 44.1 44.1 49.5 46.1 47.7NASARI (lexical vectors) 32.3 32.2 32.2 49.3 45.8 47.5WN+XWN 38.5 38.0 38.3 54.9 51.1 52.9

Table 3: Comparison of our approach to the state of the art unsupervised knowledge-based methods onthe SemEval-2007 Task 16 (weighted setting). The best results overall are underlined.

basis of WordNet and texts (KnowNet, BabelNet,NASARI) and one purely knowledge-based sys-tem (WN+XWN). Note that we do not include thesupervised TSSEM system in this comparison, asin contrast to all other considered methods, it re-lies on a large sense-labeled corpus.

Table 3 presents results of the evaluation.On the Senseval-3 dataset, our hybrid modelsshow better performance than all unsupervisedknowledge-based approaches considered in ourexperiment. On the SemEval-2007 dataset, theonly resource which exceeds the performance ofour hybrid model is BabelNet. The extra perfor-mance of BabelNet on the SemEval dataset canbe explained by its multilingual approach: addi-tional features are obtained using semantic rela-tions across synsets in different languages. Be-sides, machine translation is used to further enrichcoverage of the resource (Navigli and Ponzetto,2012).

These results indicate on the high quality ofthe sense representations obtained using the hy-brid aligned resource. Using related words ofinduced senses improves WSD performance bya large margin as compared to purely WordNet-based model on both datasets. Adding extra con-textual features further improves slightly resultson one dataset. Thus, we recommend enrichingsense representations with related words and op-

tionally with context clues. Finally, note that,while our method shows competitive results com-pared to other state-of-the-art hybrid systems, itdoes not require access to web search engines(KnowNet), texts mapped to a sense inventory(BabelNet, NASARI), or machine translation sys-tems (BabelNet).

6 Conclusion

The hybrid aligned resource (Faralli et al., 2016)successfully enriches sense representations of amanually-constructed lexical network with fea-tures derived from a distributional disambiguatedlexical network. Our WSD experiments on twodatasets show that this additional information ex-tracted from corpora let us substantially outper-form the model based solely on the lexical re-source. Furthermore, a comparison of our senserepresentation method with existing hybrid ap-proaches leveraging corpus-based features demon-strate its state-of-the-art performance.

Acknowledgments

We acknowledge the support of the DeutscheForschungsgemeinschaft (DFG) under the project”JOIN-T: Joining Ontologies and Semantics In-duced from Text”.

76

ReferencesEneko Agirre and Philip Edmonds. 2007. Word sense

disambiguation: Algorithms and applications, vol-ume 33. Springer Science & Business Media.

Sergey Bartunov, Dmitry Kondrashkin, Anton Osokin,and Dmitry Vetrov. 2016. Breaking sticks and am-biguities with adaptive skip-gram. In Proceedings ofthe 19th International Conference on Artificial Intel-ligence and Statistics (AISTATS’2016), pages 130–138, Cadiz, Spain. JMLR: W&CP volume 51.

Chris Biemann and Martin Riedl. 2013. Text: Now in2D! A Framework for Lexical Expansion with Con-textual Similarity. Journal of Language Modelling,1(1):55–95.

Chris Biemann. 2006. Chinese whispers - an efficientgraph clustering algorithm and its application to nat-ural language processing problems. In Proceedingsof TextGraphs: the First Workshop on Graph BasedMethods for Natural Language Processing, pages73–80, New York City. Association for Computa-tional Linguistics.

Jose Camacho-Collados, Mohammad Taher Pilehvar,and Roberto Navigli. 2015a. Nasari: a novelapproach to a semantically-aware representation ofitems. In Proceedings of the 2015 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 567–577, Denver, Colorado. Asso-ciation for Computational Linguistics.

Jose Camacho-Collados, Mohammad Taher Pilehvar,and Roberto Navigli. 2015b. A unified multilingualsemantic representation of concepts. In Proceedingsof the 53rd Annual Meeting of the Association forComputational Linguistics and the 7th InternationalJoint Conference on Natural Language Processing(Volume 1: Long Papers), pages 741–751, Beijing,China. Association for Computational Linguistics.

Tao Chen, Ruifeng Xu, Yulan He, and Xuan Wang.2015. Improving distributed representation of wordsense via wordnet gloss composition and contextclustering. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 2: Short Papers),pages 15–20, Beijing, China. Association for Com-putational Linguistics.

Montse Cuadros and German Rigau. 2007. Semeval-2007 task 16: Evaluation of wide coverage knowl-edge resources. In Proceedings of the FourthInternational Workshop on Semantic Evaluations(SemEval-2007), pages 81–86, Prague, Czech Re-public. Association for Computational Linguistics.

Montse Cuadros and German Rigau. 2008. KnowNet:Building a large net of knowledge from the web. InProceedings of the 22nd International Conferenceon Computational Linguistics (Coling 2008), pages

161–168, Manchester, UK, August. Coling 2008 Or-ganizing Committee.

Beate Dorow and Dominic Widdows. 2003. Discov-ering Corpus-Specific Word Senses. In Proceedingsof the Tenth Conference on European Chapter of theAssociation for Computational Linguistics - Volume2, EACL ’03, pages 79–82, Budapest, Hungary. As-sociation for Computational Linguistics.

Martin Everett and Stephen P. Borgatti. 2005. Egonetwork betweenness. Social Networks, 27(1):31–38.

Stefano Faralli, Alexander Panchenko, Chris Biemann,and Simone P. Ponzetto. 2016. Linked disam-biguated distributional semantic networks. In In-ternational Semantic Web Conference (ISWC’2016),pages 56–64, Kobe, Japan. Springer.

Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar,Chris Dyer, Eduard Hovy, and Noah A. Smith.2015. Retrofitting word vectors to semantic lexi-cons. In Proceedings of the 2015 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 1606–1615, Denver, Colorado. As-sociation for Computational Linguistics.

Christiane Fellbaum. 1998. WordNet: An ElectronicDatabase. MIT Press, Cambridge, MA.

Juri Ganitkevitch, Benjamin Van Durme, and ChrisCallison-Burch. 2013. Ppdb: The paraphrasedatabase. In Proceedings of the 2013 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 758–764, Atlanta, Georgia. Associ-ation for Computational Linguistics.

Josu Goikoetxea, Aitor Soroa, and Eneko Agirre.2015. Random walks and neural network languagemodels on knowledge bases. In Proceedings of the2015 Conference of the North American Chapterof the Association for Computational Linguistics:Human Language Technologies, pages 1434–1439,Denver, Colorado. Association for ComputationalLinguistics.

Marti A. Hearst. 1992. Automatic acquisition of hy-ponyms from large text corpora. In Proceedingsof the 6th International Conference on Computa-tional Linguistics (COLING’1992), pages 539–545,Nantes, France.

David Hope and Bill Keller. 2013. MaxMax: AGraph-Based Soft Clustering Algorithm Applied toWord Sense Induction. In Computational Linguis-tics and Intelligent Text Processing: 14th Interna-tional Conference, CICLing 2013, Samos, Greece,March 24-30, 2013, Proceedings, Part I, pages 368–381. Springer Berlin Heidelberg, Berlin, Heidelberg.

Ignacio Iacobacci, Mohammad Taher Pilehvar, andRoberto Navigli. 2015. Sensembed: Learningsense embeddings for word and relational similarity.

77

In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers), pages95–105, Beijing, China. Association for Computa-tional Linguistics.

Sujay Kumar Jauhar, Chris Dyer, and Eduard Hovy.2015. Ontologically grounded multi-sense repre-sentation learning for semantic vector space models.In Proceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 683–693, Denver, Colorado. Association forComputational Linguistics.

Suresh Manandhar, Ioannis P. Klapaftis, Dmitriy Dli-gach, and Sameer S. Pradhan. 2010. SemEval-2010task 14: Word sense induction & disambiguation. InProceedings of the 5th International Workshop onSemantic Evaluation (ACL’2010), pages 63–68, Up-psala, Sweden. Association for Computational Lin-guistics.

Rada Mihalcea and Dan Moldovan. 2001. extendedwordnet: Progress report. In In Proceedings ofNAACL Workshop on WordNet and Other LexicalResources, pages 1–5, Pittsburgh, PA, USA. Asso-ciation for Computational Linguistics.

Rada Mihalcea, Timothy Chklovski, and Adam Kil-garriff. 2004. The SENSEVAL-3 English lexicalsample task. In SENSEVAL-3: Third InternationalWorkshop on the Evaluation of Systems for the Se-mantic Analysis of Text, pages 25–28, Barcelona,Spain. Association for Computational Linguistics.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.Corrado, and Jeffrey Dean. 2013. Distributed rep-resentations of words and phrases and their com-positionality. In Proceedings of Advances in Neu-ral Information Processing Systems 26 (NIPS’2013),pages 3111–3119, Harrahs and Harveys, CA, USA.

Andrea Moro, Alessandro Raganato, and Roberto Nav-igli. 2014. Entity linking meets word sense disam-biguation: a unified approach. Transactions of theAssociation for Computational Linguistics, 2:231–244.

Roberto Navigli and Simone Paolo Ponzetto. 2012.Babelnet: The automatic construction, evaluationand application of a wide-coverage multilingual se-mantic network. Artificial Intelligence, 193:217–250.

Roberto Navigli. 2009. Word sense disambiguation: Asurvey. ACM CSUR, 41(2):1–69.

Arvind Neelakantan, Jeevan Shankar, Alexandre Pas-sos, and Andrew McCallum. 2014. Efficientnon-parametric estimation of multiple embeddingsper word in vector space. In Proceedings of the2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 1059–1069,

Doha, Qatar. Association for Computational Lin-guistics.

Luis Nieto Pina and Richard Johansson. 2016. Em-bedding senses for efficient graph-based word sensedisambiguation. In Proceedings of TextGraphs-10:the Workshop on Graph-based Methods for NaturalLanguage Processing, pages 1–5, San Diego, CA,USA. Association for Computational Linguistics.

Robert Parker, David Graff, Junbo Kong, Ke Chen, andKazuaki Maeda. 2011. English Gigaword Fifth Edi-tion. Linguistic Data Consortium, Philadelphia.

Maria Pelevina, Nikolay Arefiev, Chris Biemann, andAlexander Panchenko. 2016. Making sense of wordembeddings. In Proceedings of the 1st Workshopon Representation Learning for NLP, pages 174–183, Berlin, Germany. Association for Computa-tional Linguistics.

Nghia The Pham, Angeliki Lazaridou, and Marco Ba-roni. 2015. A multitask objective to inject lexicalcontrast into distributional semantics. In Proceed-ings of the 53rd Annual Meeting of the Associationfor Computational Linguistics and the 7th Interna-tional Joint Conference on Natural Language Pro-cessing (Volume 2: Short Papers), pages 21–26, Bei-jing, China. Association for Computational Linguis-tics.

Sameer Pradhan, Edward Loper, Dmitriy Dligach, andMartha Palmer. 2007. Semeval-2007 task-17: En-glish lexical sample, srl and all words. In Proceed-ings of the Fourth International Workshop on Se-mantic Evaluations (SemEval-2007), pages 87–92,Prague, Czech Republic. Association for Computa-tional Linguistics.

Matthias Richter, Uwe Quasthoff, Erla Hallsteinsdottir,and Chris Biemann. 2006. Exploiting the LeipzigCorpora Collection. In Proceedings of the FifthSlovenian and First International Language Tech-nologies Conference (IS-LTC), Ljubljana, Slovenia.

Sascha Rothe and Hinrich Schutze. 2015. Autoex-tend: Extending word embeddings to embeddingsfor synsets and lexemes. In Proceedings of the53rd Annual Meeting of the Association for Compu-tational Linguistics and the 7th International JointConference on Natural Language Processing (Vol-ume 1: Long Papers), pages 1793–1803, Beijing,China, July. Association for Computational Linguis-tics.

Peter D. Turney and Patrick Pantel. 2010. From fre-quency to meaning: Vector space models of seman-tics. JAIR, 37:141–188.

Mo Yu and Mark Dredze. 2014. Improving lexicalembeddings with semantic knowledge. In Proceed-ings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Pa-pers), pages 545–550, Baltimore, Maryland. Asso-ciation for Computational Linguistics.

78


One Representation per Word — Does it make Sense for Composition?

Thomas Kober, Julie Weeds, John Wilkie, Jeremy Reffin and David WeirTAG laboratory, Department of Informatics, University of Sussex

Brighton, BN1 9RH, UK{t.kober, j.e.weeds, jw478, j.p.reffin, d.j.weir}@sussex.ac.uk

Abstract

In this paper, we investigate whether ana priori disambiguation of word senses isstrictly necessary or whether the meaningof a word in context can be disambiguatedthrough composition alone. We evaluatethe performance of off-the-shelf single-vector and multi-sense vector models ona benchmark phrase similarity task and anovel task for word-sense discrimination.We find that single-sense vector modelsperform as well or better than multi-sensevector models despite arguably less cleanelementary representations. Our findingsfurthermore show that simple compositionfunctions such as pointwise addition areable to recover sense specific informationfrom a single-sense vector model remark-ably well.

1 Introduction

Distributional word representations based oncounting co-occurrences have a long history innatural language processing and have successfullybeen applied to numerous tasks such as sentimentanalysis, recognising textual entailment, word-sense disambiguation and many other importantproblems. More recently low-dimensional anddense neural word embeddings have received aconsiderable amount of attention in the researchcommunity and have become ubiquitous in numer-ous NLP pipelines in academia and industry. Onefundamental simplifying assumption commonlymade in distributional semantic models, however,is that every word can be encoded by a single rep-resentation. Combining polysemous lexemes intoa single vector has the consequence of essentiallycreating a weighted average of all observed mean-ings of a lexeme in a given text corpus.

Therefore a number of proposals have beenmade to overcome the issue of conflating severaldifferent senses of an individual lexeme into asingle representation. One approach (Reisingerand Mooney, 2010; Huang et al., 2012) is to trydirectly inferring a predefined number of sensesfrom data and subsequently label any occurrencesof a polysemous lexeme with the inferred inven-tory. Similar approaches are proposed by Reddyet al. (2011) and Kartsaklis et al. (2013) who showthat appropriate sense selection or disambigua-tion typically improves performance for compo-sition of noun phrases (Reddy et al., 2011) andverb phrases (Kartsaklis et al., 2013). Dinu andLapata (2010) proposed a model that representsthe meaning of a word as a probability distribu-tion over latent senses which is modulated basedon contextualisation, and report improved perfor-mance on a word similarity task and the lexi-cal substitution task. Other approaches leveragean existing lexical resource such as BabelNet orWordNet to obtain sense labels a priori to creat-ing word representations (Iacobacci et al., 2015),or as a postprocessing step after obtaining initialword representations (Chen et al., 2014; Pilehvarand Collier, 2016). While these approaches haveexhibited strong performance on benchmark wordsimilarity tasks (Huang et al., 2012; Iacobacci etal., 2015) and some downstream processing taskssuch as part-of-speech tagging and relation identi-fication (Li and Jurafsky, 2015), they have beenweaker than the single-vector representations atpredicting the compositionality of multi-word ex-pressions (Salehi et al., 2015), and at tasks whichrequire the meaning of a word to be consideredin context; e.g, word sense disambiguation (Ia-cobacci et al., 2016) and word similarity in con-text (Iacobacci et al., 2015).

In this paper we consider what happens whendistributional representations are composed to

79

form representations for larger units of meaning.In a compositional phrase, the meaning of thewhole can be inferred from the meaning of itsparts. Thus, assuming compositionality, the rep-resentation of a phrase such as black mood, shouldbe directly inferable from the representations forblack and for mood. Further, one might supposethat composing the correct senses of the individ-ual lexemes would result in a more accurate repre-sentation of that phrase. However, our counter-hypothesis is that the act of composition con-textualises or disambiguates each of the lexemesthereby making the representations of individualsenses redundant. We investigate this hypothe-sis by evaluating the performance of single-vectorrepresentations and multi-sense representations atboth a benchmark phrase similarity task and at anovel word-sense discrimination task.

Our contributions in this work are thus as fol-lows. First, we provide quantitative and qualita-tive evidence that even simple composition func-tions have the ability to recover sense-specific in-formation from a single-vector representation ofa polysemous lexeme in context. Second, we in-troduce a novel word-sense discrimination task1,which can be seen as the first stage of word-sensedisambiguation. The goal is to find whether theoccurrences of a lexeme in two or more senten-tial contexts belong to the same sense or not, with-out necessarily labelling the senses. While it hasreceived relatively little attention in recent years,it is an important natural language understandingproblem and can provide important insights intothe process of semantic composition.

2 Evaluating Distributional Models ofComposition

For evaluation we use several readily availableoff-the-shelf word embeddings, that have al-ready been shown to work well for a numberof different NLP applications. We compare the300-dimensional skip-gram word2vec (Mikolovet al., 2013) word embeddings2 to the depen-dency based version of word2vec — henceforthdep2vec3 (Levy and Goldberg, 2014) — and the

1Our task is available from https://github.com/tttthomasssss/sense2017

2Available from: https://code.google.com/p/word2vec/

3Available from: https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/

SENSEMBED model4 by Iacobacci et al. (2015),which creates word-sense embeddings by per-forming word-sense disambiguation prior to run-ning word2vec.

We note that word2vec and dep2vec usea single vector per word approach and thereforeconflate the different senses of a polysemous lex-eme. On the other hand, SENSEMBED utilises Ba-belfy (Moro et al., 2014) as an external knowledgesource to perform word-sense disambiguation andsubsequently creates one vector representation perword sense.

For composition we use pointwise addition forall models as this has been shown to be a strongbaseline in a number of studies (Hashimoto etal., 2014; Hill et al., 2016). We also experi-mented with pointwise multiplication as compo-sition function but, similar to Hill et al. (2016),found its performance to be very poor (results notreported). We model any out-of-vocabulary itemsas a vector consisting of all zeros and determineproximity of composed meaning representationsin terms of cosine similarity. We lowercase andlemmatise the data in our task but do not performnumber or date normalisation, or removal of rarewords.

3 Phrase Similarity

Our first evaluation task is the benchmark phrasesimilarity task of Mitchell and Lapata (2010). Thisdataset consists of 108 adjective-noun (AN), 108noun-noun (NN) and 108 verb-object (VO) pairs.The task is to compare a compositional model’ssimilarity estimates with human judgements bycomputing Spearman’s ρ. An average ρ of 0.47-0.48 represents the current state-of-the-art perfor-mance on this task (Hashimoto et al., 2014; Koberet al., 2016; Wieting et al., 2015).

For single-sense representations, the strategyfor carrying out this task is simple. For eachphrase in each pair, we compose the constituentrepresentations and then compute the similarityof each pair of phrases using the cosine similar-ity. For multi-sense representations, we adaptedthe strategy which has been used successfully invarious word similarity experiments (Huang et al.,2012; Iacobacci et al., 2015). Typically, for eachword pair, all pairs of senses are considered andthe similarity of the word pair is considered to be

4Available from: http://lcl.uniroma1.it/sensembed/

80

the similarity of the closest pair of senses. Thefact that this strategy works well suggests thatwhen humans are asked to judge word similar-ity, the pairing automatically primes them to se-lect the closest senses. Extending this to phrasesimilarity requires us to compose each possiblepair of senses for each phrase and then selectthe sense configuration which results in maximalphrase similarity. For comparison, we also giveresults for the configuration which results in mini-mal phrase similarity and the arithmetic mean5 ofall sense configurations.

3.1 Results

Model AN NN VO Averageword2vec 0.47 0.46 0.45 0.46dep2vec 0.48 0.46 0.45 0.46SENSEMBED:max 0.39 0.39 0.32 0.37SENSEMBED:min 0.24 0.12 0.22 0.19SENSEMBED:mean 0.46 0.35 0.37 0.39

Table 1Results for the Mitchell and Laptata (2010) dataset.

Table 1 shows that the simple strategy of addinghigh quality single-vector representations is verycompetitive with the state-of-the-art for this task.None of the strategies for selecting a sense config-uration for the multi-sense representations couldcompete with the single sense representations onthis task. One possible explanation is that the com-monly adopted closest sense strategy is not effec-tive for composition since the composition of in-correct senses may lead to spuriously high similar-ities (for two “implausible” sense configurations).

Table 2 lists a number of example phrase pairswith low average human similarity scores in theMitchell and Lapata (2010) test set. The resultsshow the tendency of the closest sense strategywith SENSEMBED (SE) to overestimate the sim-ilarity of dissimilar phrase pairs. For a compari-son we manually labelled the lexemes in the sam-ple phrases with the appropriate BabelNet sensesprior to composition (SE*). Human (H) similar-ity scores are normalised and averaged for an eas-ier comparison, model estimates represent cosinesimilarities.

4 Word Sense Discrimination

Word-sense discrimination can be seen as the firststage of word-sense disambiguation, where the

5We also tried the geometric mean and the median butthese performed comparably with the arithmetic mean.

Phrase 1 Phrase 2 SE SE* Hbuy land leave house 0.49 0.28 0.26close eye stretch arm 0.40 0.31 0.25wave hand leave company 0.42 0.08 0.20drink water use test 0.29 0.04 0.18european state present position 0.28 -0.03 0.19high point particular case 0.41 0.10 0.21

Table 2Tendency of SENSEMBED (SE) to overestimate the similarityon phrase pairs with low average human similarity when theclosest sense strategy is used.

goal is to find whether two or more occurrences ofthe same lexeme express identical senses, withoutnecessarily labelling the senses yet. It has receivedrelatively little attention despite its potential forproviding important insights into semantic com-position, focusing in particular on to the abilityof compositional distributional semantic models toappropriately contextualise a polysemous lexeme.

Work on word-sense discrimination has suf-fered from the absence of a benchmark task as wellas a clear evaluation methodology. For exampleSchutze (1998) evaluated his model on a datasetconsisting of 20 polysemous words (10 naturallyambiguous lexemes and 10 artificially ambiguous“pseudo-lexemes”) in terms of accuracy for coarsegrained sense distinctions, and an information re-trieval task. Pantel and Lin (2002), and Van deCruys (2008) used automatically extracted wordsfrom various newswire sources and evaluated theoutput of their models in comparison to WordNetand EuroWordNet, respectively. Purandare andPedersen (2004) used a subset of the words fromthe SENSEVAL-2 task and evaluated their modelsin terms of precision, recall and F1-score of howwell available sense tags match with clusters dis-covered by their algorithms. Akkaya et al. (2012)used the concatenation of the SENSEVAL-2 andSENSEVAL-3 tasks and evaluated their models interms of cluster purity and accuracy. Finally,Moen et al. (2013) used the semantic textual sim-ilarity (STS) 2012 task, which is based on humanjudgements of the similarity between two sen-tences.

One contribution of our work is a novel word-sense discrimination task, evaluated on a numberof robust baselines in order to facilitate future re-search in that area. In particular, our task offers atestbed for assessing the contextualisation abilityof compositional distributional semantic models.The goal is, for a given polysemous lexeme in con-text, to identify the sentence from a list of options

81

that is expressing the same sense of that lexemeas the given target sentence. These two sentences— the target and the “correct answer” — can ex-hibit any degree of semantic similarity as long asthey convey the same sense of the target lexeme.Table 3 shows an example of the polysemous ad-jective black in our task. The goal of any modelwould be to determine that the expressed sense ofblack in the sentence She was going to set himfree from all of the evil and black hatred he hadbrought to the world is identical to the expressedsense of black in the target sentence Or should theyrebut the Democrats’ black smear campaign withthe evidence at hand.

Our task assesses the ability of a model to dis-criminate a particular sense in a sentential contextfrom any other senses and thus provides an excel-lent testbed for evaluating multi-sense word vec-tor models as well as compositional distributionalsemantic models. By composing the representa-tion of a target lexeme with its surrounding con-text, it should be possible to determine its sense.For example, composing black smear campaignshould lead to a compositional representation thatis closer to the composed representation of blackhatred than to black mood, black sense of humouror black coffee. This essentially uses the similarityof the compositional representation of a lexeme’scontext to determine its sense. Similar approachesto word-sense disambiguation have already beensuccessfully used in past works (Akkaya et al.,2012; Basile et al., 2014).

4.1 Task Construction

For the construction of our dataset we made use ofdata from two english dictionaries (Oxford Dictio-nary and Collins Dictionary), accessible via theirrespective web APIs6, as well as examples fromthe sense annotated corpus SemCor (Miller et al.,1993). Our use of dictionary data is motivated bya number of favourable properties which make it avery suitable data source for our proposed task:

• The content is of very high-quality and cu-rated by expert lexicographers.

• All example sentences are carefully crafted inorder to unambiguously illustrate the usage

6https://developer.oxforddictionaries.com for the Oxford Dictionary,https://www.collinsdictionary.com/api/ forthe Collins Dictionary. We use NLTK 3.2 to access SemCor.

of a particular sense for a given polysemouslexeme.

• The granularity of the sense inventory reflectscommon language use7.

• The example sentences are typically free ofany domain bias wherever possible.

• The data is easily accessible via a web API.

By using the data from curated resources we wereable to avoid a setup as a sentence similarity taskand any potentially noisy crowd-sourced humansimilarity judgements.

We were furthermore able to collect data fromvarying frequency bands, enabling an assessmentof the impact of frequency on any model. Fig-ure 1 shows the number of target lexemes per fre-quency band. While the majority of lexemes, withreference to a cleaned October 2013 Wikipediadump8, is in the middle band, there is a consid-erable amount of less frequent lexemes. The mostfrequent target lexeme in our task is the verb bewith ≈1.8m occurrences in Wikipedia, and theleast frequent lexeme is the verb ruffle with only57 occurrences. The average target lexeme fre-quency is ≈95k for adjectives, and ≈45k−46k fornouns and verbs9.

Figure 1: Binned frequency distribution of the polysemoustarget lexemes in our task.

7The Oxford dictionary lists 5 different senses for thenoun “bank”, whereas WordNet 3.0 lists 10 synsets, for ex-ample distinguishing “bank” as the concept for a financialinstitution and “bank” as a reference to the building wherefinancial transactions take place.

8We removed any articles with fewer than 20 page views.9The overall number of unique word types is smaller than

the number of examples in our task as there are a number oflexemes that can occur with more than one part-of-speech.

82

Sense Definition SentenceTarget full of anger or hatred Or should they rebut the Democrats’ black smear campaign with

the evidence at hand?Option 1 full of anger or hatred She was going to set him free from all of the evil and black hatred

he had brought to the world.Option 2 (of a person’s state of mind) I’ve been in a black mood since September 2001, it’s hanging over

full of gloom or misery; very depressed me like a penumbra.Option 3 (of humour) presenting tragic or harrowing Over the years I have come to believe that fate either hates me, or

situations in comic terms has one hell of a black sense of humour.Option 4 (of coffee or tea) served without milk The young man was reading a paperback novel and sipping a

steaming mug of hot, black coffee.

Table 3: Example of the polysemous adjective black in our task. The goal for any model is to predict option 1 as expressing thesame sense of black as the target sentence.

4.2 Task Setup DetailsWe collected data for 3 different parts-of-speech:adjectives, nouns and verbs. We furthermore cre-ated task setups with varying numbers of sensesto distinguish (2-5 senses) for a given target lex-eme. This is to evaluate how well a model is ableto discriminate different degrees of polysemy ofany lexeme. For any task setup evaluating for nsenses, we included all lexemes with > n sensesand randomly sampled n senses from its inventory.For each lexeme, we furthermore ensured that ithad at least 2 example sentences per sense. Forthe available senses of any given lexeme, we ran-domly chose a sense as the target sense, and fromits list of example sentences randomly sampled 2sentences, one as the target example and one asthe “correct answer” for the list of candidate sen-tences. Finally we once again randomly sampledthe required number of other senses and examplesentences to complete the task setup. Using ran-dom sampling of word senses and targets aims toavoid a predominant sense bias.

For each part-of-speech we created a develop-ment split for parameter tuning and a test split forthe final evaluation. Table 4 shows the number ofexamples for each setup variant of our task. Thebiggest category are polysemous nouns, represent-ing roughly half of the data, followed by verbsrepresenting another third, and the smallest cate-gory are adjectives taking up the remaining≈17%.We measure performance in terms of accuracy of

2 senses 3 senses 4 senses 5 sensesAdjective 66/209 47/170 37/137 28/115Noun 170/618 125/499 100/412 74/345Verb 127/438 71/354 72/295 56/256Total 363/1265 263/1023 209/844 164/716

Table 4: Number of examples per part-of-speech and numberof senses (#dev examples/#train examples).

correctly predicting which two sentences share the

same sense of a given target lexeme. Accuracy hasthe advantage of being much easier to interpret —in absolute terms as well as in the relative differ-ence between models — in comparison to othercommonly used evaluation metrics such as clus-ter purity measures or correlation metrics such asSpearman ρ and Pearson r.

4.3 Experimental Setup

In this paper we compare the compositional mod-els outlined earlier with two baselines, a randombaseline and a word-overlap baseline of the ex-tracted contexts. For the single-vector represen-tations, we composed the target lexeme with all ofthe words in the context window and compared itwith the equivalent representation of each of theoptions (lexeme plus context words). The optionwith the highest cosine similarity was deemed tobe the selected sense. For SENSEMBED, we com-posed all sense vectors of a target lexeme with thegiven context and then used the closest sense strat-egy (Iacobacci et al., 2015) on composed represen-tations to choose the predicted sense10. The word-overlap baseline is simply the number of words incommon between the context window for the tar-get and each of the options.

We experimented with symmetric linear bag-of-words contexts of size 1, 2 and 4 around thetarget lexeme. We also experimented with de-pendency contexts, where first-order dependencycontexts performed almost identical to using a 2-word bag-of-words context window (results notreported). We excluded stop words prior to ex-tracting the context window in order to maximisethe number of content words. We break ties forany of the methods — including the baselines —by randomly picking one of the options with the

10We also tried an all-by-all senses composition, howeverfound this to be computationally not tractable.

83

highest similarity to the composed representationof the target lexeme with its context. Statisticalsignificance between the best performing modeland the word overlap baseline is computed by us-ing a randomised pairwise permutation test (Efronand Tibshirani, 1994).

4.4 Results

Table 5 shows the results for all context windowsizes across all parts-of-speech and number ofsenses. All models substantially outperform therandom baseline for any number of senses. In-terestingly the word overlap baseline is compet-itive for all context window sizes. While it isa very simple method, it has already been foundto be a strong baseline for paraphrase detectionand semantic textual similarity (Dinu and Thater,2012). One possible explanation for its robustperformance on our task is an occurrence of theone-sense-per-collocation hypothesis (Yarowsky,1993). The performance of all other modelsis roughly in the same ballpark for all parts-of-speech and number of senses, suggesting that theyform robust baselines for future models. Whilethe results are relatively mixed for adjectives,word2vec appears to be the strongest model forpolysemous nouns and verbs.

The perhaps most interesting observation in Ta-ble 5 is that word2vec and dep2vec are per-forming as well or better than SENSEMBED de-spite the fact that the former conflate the senses ofa polysemous lexeme in a single vector represen-tation. Figure 2 shows the average performance ofall models across parts-of-speech per number ofsenses and for all context window sizes.

SENSEMBED and Babelfy

One possible explanation for SENSEMBED notoutperforming the other methods despite itscleaner encoding of different word senses in theabove experiments is that at train time, it had ac-cess to sense labels from Babelfy. At test timeon our task however, it did not have any senselabels available. We therefore sense tagged the5-sense noun subtask with Babelfy and re-ranSENSEMBED. As Table 6 shows, access to senselabels at test time did not give a substantive perfor-mance boost, representing further support for ourhypothesis that composition in single-sense vectormodels might be sufficient to recover sense spe-cific information.

Frequency Range

We chose the 2-sense noun subtask to estimate thedegree sensitivity of target lexeme frequency onour task we merged the [1, 1k) and [1k, 10k), and[50k, 100k) and [100k,∞) frequency bands fromFigure 1, and sampled an equal number of targetwords from each band. Table 7 reports the resultsfor this experiment. All methods outperform therandom and word overlap baseline and appear tobe working better for less frequent lexemes. Onepossible explanation for this behaviour is that lessfrequent lexemes have fewer senses and poten-tially less subtle sense differences than more fre-quent lexemes, which would make them easier todiscriminate by distributional semantic methods.

5 Discussion

Our results suggest that pointwise addition in asingle-sense vector model such as word2vec isable to discriminate the sense of a polysemouslexeme in context in a surprisingly effective wayand represents a strong baseline for future work.Distributional composition can therefore be inter-preted as a process of contextualising the mean-ing of a lexeme. This way, composition does notonly act as a way to represent the meaning of aphrase as a whole, but also as a local discrimina-tor for any lexemes in the phrase. For examplethe composed representation of dry clothes shouldonly keep contexts that dry shares with clotheswhile suppressing contexts it shares with weatheror wine. Hence, one would expect the same to hap-pen with a polysemous lexeme such as bank in thecontext of river and account, respectively.

Recent work by Arora et al. (2016) has shownthat the different senses of a polysemous lexemereside in a linear substructure within a single vec-tor and are recoverable by sparse coding. There isfurthermore evidence that additive composition inlow-dimensional word embeddings approximatesan intersection of the contexts of two distributionalword vectors (Tian et al., 2015). It therefore seemsplausible that an intersective composition functionshould be able to recover sense specific informa-tion.

To qualitatively analyse this hypothesis weused the word2vec and SENSEMBED vectorsto compose a small number of example phrasesby pointwise addition and calculated their top 5nearest neighbours in terms of cosine similarity.For SENSEMBED we manually sense tagged the

84

Symmetric context window of size 1Adjective Noun Verb

Senses 2 3 4 5 2 3 4 5 2 3 4 5Random 0.53 0.32 0.25 0.14 0.47 0.32 0.23 0.19 0.47 0.31 0.23 0.18Word Overlap 0.63 0.46 0.47 0.40 0.55 0.40 0.37 0.34 0.54 0.44 0.38 0.29

word2vec 0.70 0.56 0.61† 0.54† 0.66‡ 0.52‡ 0.50‡ 0.44‡ 0.63‡ 0.56‡ 0.52‡ 0.43‡

dep2vec 0.65 0.64‡ 0.57 0.57‡ 0.64‡ 0.50‡ 0.49‡ 0.48‡ 0.63‡ 0.55‡ 0.50‡ 0.43‡

SENSEMBED 0.67 0.54 0.56 0.56† 0.64‡ 0.49‡ 0.50‡ 0.43‡ 0.62† 0.53‡ 0.49‡ 0.38†



word2vec 0.70 0.64† 0.58 0.55 0.71‡ 0.63‡ 0.59‡ 0.54‡ 0.68‡ 0.64‡ 0.58‡ 0.49‡

dep2vec 0.71 0.65‡ 0.58 0.57‡ 0.70‡ 0.57‡ 0.55‡ 0.55‡ 0.66† 0.64‡ 0.54† 0.46†

SENSEMBED 0.72‡ 0.62 0.61 0.52 0.69‡ 0.56† 0.57‡ 0.51† 0.67‡ 0.65† 0.57 0.45



word2vec 0.71 0.65† 0.65 0.57 0.73‡ 0.61‡ 0.62‡ 0.57‡ 0.71‡ 0.62† 0.57 0.53‡

dep2vec 0.72 0.66† 0.60 0.54 0.71‡ 0.55 0.56† 0.53† 0.67 0.62 0.54 0.50SENSEMBED 0.75 0.59 0.62 0.55 0.69‡ 0.57† 0.58‡ 0.53† 0.68‡ 0.62† 0.55 0.47

Table 5Performance overview for all parts-of-speech and number of senses, ‡ statistically significant at the p < 0.01 level in com-parison to the Word Overlap baseline; † statistically significant at the p < 0.05 level in comparison to the Word Overlapbaseline.

Figure 2: Average performance across parts-of-speech per number of senses and context window.

Noun - 5 SensesContext Window Size 1 2 4word2vec 0.44 0.54 0.57dep2vec 0.48 0.55 0.53SENSEMBED 0.43 0.51 0.53SENSEMBED & Babelfy 0.45 0.49 0.54

Table 6Results on the 5-sense noun subtask with SENSEMBED hav-ing access to Babelfy sense labels at test time.

phrases with the appropriate BabelNet sense la-bels prior to composition. We omitted the Babel-Net sense labels in the neighbour list for brevity,

Noun - 2 Senses, context window size = 2Frequency Band < 10k 10k – 50k ≥ 50kRandom 0.51 0.51 0.51Word Overlap 0.66 0.60 0.56word2vec 0.81 0.64 0.66dep2vec 0.77 0.67 0.66SENSEMBED 0.74 0.68 0.60

Table 7Results on a subsample of the 2-sense noun subtask acrossfrequency bands.

however they were consistent with the intendedsense in all cases. Table 8 supports the view ofcomposition as a way of contextualising the mean-

85

ing of a lexeme. In all cases in our example theword2vec neighbours reflect the intended senseof the polysemous lexeme, providing evidence forthe linear substructure of word senses in a singlevector as discovered by Arora et al. (2016), andsuggesting that distributional composition is ableto recover sense specific information from a poly-semous lexeme. The very fine-grained sense-levelvector space of SENSEMBED is giving rise to avery focused neighbourhood, however there doesnot seem to be any advantage over word2vecfrom a qualitative point of view when using simpleadditive composition.

6 Related Work

The perhaps most popular tasks for evaluating theability of a model to capture or encode the differ-ent senses of a polysemous lexeme in a given con-text are the english lexical substitution task (Mc-Carthy and Navigli, 2007) and the Microsoft sen-tence completion challenge (Zweig and Burges,2011). Both tasks require any model to fill an ap-propriate word into a pre-defined slot in a givensentential context. The sentence completion chal-lenge provides a list of candidate words while theenglish lexical substitution task does not. How-ever, neither task focuses on polysemy and the en-glish lexical substitution task conflates the prob-lems of discriminating word senses and findingmeaning preserving substitutes.

Dictionary definitions have previously beenused to evaluate compositional distributional se-mantic models where the goal is to match a dictio-nary entry with its corresponding definition (Kart-saklis et al., 2012; Polajnar and Clark, 2014).These datasets are commonly set up as retrievaltasks, but generally do not test the ability of amodel to disambiguate a polysemous word in con-text, or discriminate multiple definitions of thesame word.

Our task also provides a novel evaluationfor compositional distributional semantic models,where the predominant strategy is to estimate thesimilarity of two short phrases (Bernardi et al.,2013; Grefenstette and Sadrzadeh, 2011; Kart-saklis and Sadrzadeh, 2014; Mitchell and Lap-ata, 2008; Mitchell and Lapata, 2010) or sen-tences (Agirre et al., 2016; Huang et al., 2012;Marelli et al., 2014) in comparison to human pro-vided gold-standard judgements. One problemwith these similarity tasks is that the similarity

or relatedness of two sentences is very difficultto judge — especially on a fine-grained scale —even for humans. This frequently results in a rel-atively high variance of judgements and low inter-annotator agreement (Batchkarov et al., 2016).The short phrase datasets typically have a fixedstructure that only test a very small fraction of thepossible grammatical constructions in which a lex-eme can occur, and furthermore provide very littlecontext. The use of full sentences remedies thelack of context and grammatical variation, how-ever can still contain a significant level of noisedue to the automatic construction of the dataset orthe variance in human ratings. In contrast, ourtask is not set up as a sentence similarity taskand therefore avoids the use of human similarityjudgements.

Our task is similar to word-sense induction(WSI), however we only focus on discriminatingthe sense of a polysemous lexeme in context ratherthan inducing a set of senses from raw data andappropriately tagging subsequent occurrences ofpolysemous instances with the inferred inventory.Separating the sense discrimination task from theproblem of sense induction has the advantage ofmaking our task applicable to evaluating composi-tional distributional semantic models in order totest their ability to appropriately contextualise apolysemous lexeme. Due to not requiring anymodels to perform an extra step for sense induc-tion, our task is easier to evaluate as no matchingbetween sense clusters identified by a model andsome gold standard sense classes needs to be per-formed, as typically proposed in the WSI litera-ture (Agirre and Soroa, 2007; Manandhar et al.,2010).

Most closely related to our task are the Stan-ford Contextual Word Similarity (SCWS) datasetby Huang et al. (2012) and the Usage Similar-ity (USim) task by Erk et al. (2009). The goalin both tasks is to estimate the similarity of twopolysemous words in context in comparison tohuman provided gold standard judgements. Inthe SCWS dataset typically two different lexemesare considered whereas in USim and our task thesame lexemes with different contexts are com-pared. Instead of relying on crowd-sourced hu-man gold-standard similarity judgements, whichcan be prone to a considerable amount of noise11,

11For example the average standard deviation of humanratings in the SCWS dataset is ≈3 on a 10-point scale, and

86

Phrase word2vec neighbours SENSEMBED neighboursriver bank bank, river, creek, lake, rivers bank, river, stream, creek, river basinbank account account, bank, accounts, banks, citibank bank, banks, the bank, pko bank polski, handlowydry weather weather, dry, wet weather, wet, unreasonably warm dry, weather, humid, cold, cooldry clothes dry, clothes, clothing, rinse thoroughly, wet dry, clothes, warm, cold, wetcapital city capital, city, cities, downtown, town city, capital, the capital city, town, provincial capitalcapital asset capital, asset, assets, investment, worth capital, asset, investment, assets, investorpower plant plant, power, plants, coalfired, megawatt power, plant, near-limitless, pulse-power, power of the windgarden plant plant, garden, plants, gardens, vegetable garden plant, garden, plants, oakville assembly, solanaceouswindow bar bar, window, windows, doorway, door window, bar, windows, glass window, wallsandwich bar bar, sandwich, restaurant, burger, diner sandwich, bar, restaurant, hot dog, cakegasoline tank gasoline, tank, fuel, gallon, tanks gasoline, tank, fuel, petrol, kerosenearmored tank armored, tank, tanks, M1A1 Abrams, armored vehicle armored, armoured, tank, tanks, light tankdesert rock rock, desert, rocks, desolate expanse, arid desert desert rock, the desert, deserts, badlandsrock band rock, band, rockers, bands, indie rock band, rock, group, the band, rock group

Table 8Nearest neighbours of composed phrases for word2vec and SENSEMBED. Distributional composition in word2vec is ableto recover sense specific information remarkably well. Some neighbours are phrases because they have been encoded as asingle token in the original vector space.

we leverage the high-quality content of availableenglish dictionaries. Furthermore, our task is notformulated as estimating the similarity betweentwo lexemes in context, but identifying the sen-tences that use the same sense of a given polyse-mous lexeme.

7 Conclusion

While elementary multi-sense representations ofwords might capture a more fine grained semanticpicture of a polysemous word, that advantage doesnot appear to transfer to distributional composi-tion in a straightforward way. Our experimentson a popular phrase similarity benchmark and ournovel word-sense discrimination task have demon-strated that semantic composition does not appearto benefit from a fine grained sense inventory, butthat the ability to contextualise a polysemous lex-eme in single-sense vector models is sufficient forsuperior performance. We furthermore have pro-vided qualitative and quantitative evidence that anintersective composition function such as point-wise addition for neural word embeddings is ableto discriminate the meaning of a word in context,and is able to recover sense specific informationremarkably well.

Lastly, our experiments have uncovered an im-portant question for multi-sense vector models,namely how to exploit the fine-grained senselevel representations for distributional composi-tion. Our novel word-sense discrimination taskprovides an excellent testbed for compositionaldistributional semantic models, both followinga single-sense or multi-sense vector modelling

can be up to 4–5 in some cases.

paradigm, due to its focus on assessing the abil-ity of a model to appropriately contextualise themeaning of a word. Our task furthermore pro-vides another evaluation option away from intrin-sic evaluations which are based on often noisy hu-man similarity judgements, while also not beingembedded in a downstream task.

In future work we aim to extend our eval-uation to more complex compositional distribu-tional semantic models such as the lexical func-tion model (Paperno et al., 2014) or the AnchoredPacked Dependency Tree framework (Weir et al.,2016). We furthermore want to investigate howfar the sense-discriminating ability of compositioncan be leveraged for other tasks.

Acknowledgments

We would like to thank our anonymous reviewersfor their helpful comments.

ReferencesEneko Agirre and Aitor Soroa. 2007. Semeval-2007

task 02: Evaluating word sense induction and dis-crimination systems. In Proceedings of the FourthInternational Workshop on Semantic Evaluations(SemEval-2007), pages 7–12, Prague, Czech Repub-lic, June. Association for Computational Linguis-tics.

Eneko Agirre, Carmen Banea, Daniel Cer, MonaDiab, Aitor Gonzalez-Agirre, Rada Mihalcea, Ger-man Rigau, and Janyce Wiebe. 2016. Semeval-2016 task 1: Semantic textual similarity, mono-lingual and cross-lingual evaluation. In Proceed-ings of the 10th International Workshop on Seman-tic Evaluation (SemEval-2016), pages 497–511, SanDiego, California, June. Association for Computa-tional Linguistics.

87

Cem Akkaya, Janyce Wiebe, and Rada Mihalcea.2012. Utilizing semantic composition in distribu-tional semantic models for word sense discrimina-tion and word sense disambiguation. In Proceedingsof ICSC, pages 45–51.

Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma,and Andrej Risteski. 2016. Linear algebraic struc-ture of word senses, with applications to polysemy.CoRR, abs/1601.03764.

Pierpaolo Basile, Annalina Caputo, and Giovanni Se-meraro. 2014. An enhanced lesk word sense dis-ambiguation algorithm through a distributional se-mantic model. In Proceedings of COLING 2014,the 25th International Conference on ComputationalLinguistics: Technical Papers, pages 1591–1600,Dublin, Ireland, August. Dublin City University andAssociation for Computational Linguistics.

Miroslav Batchkarov, Thomas Kober, Jeremy Reffin,Julie Weeds, and David Weir. 2016. A critique ofword similarity as a method of evaluating distribu-tional semantic models. In Proceedings of the 1stWorkshop on Evaluating Vector-Space Representa-tions for NLP, pages 7–12. Association for Compu-tational Linguistics.

Raffaella Bernardi, Georgiana Dinu, Marco Marelli,and Marco Baroni. 2013. A relatedness benchmarkto test the role of determiners in compositional dis-tributional semantics. In Proceedings of the 51st An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers), pages 53–57,Sofia, Bulgaria, August. Association for Computa-tional Linguistics.

Xinxiong Chen, Zhiyuan Liu, and Maosong Sun.2014. A unified model for word sense represen-tation and disambiguation. In Proceedings of the2014 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP), pages 1025–1035, Doha, Qatar, October. Association for Com-putational Linguistics.

Georgiana Dinu and Mirella Lapata. 2010. Measur-ing distributional similarity in context. In Proceed-ings of the 2010 Conference on Empirical Methodsin Natural Language Processing, pages 1162–1172,Cambridge, MA, October. Association for Compu-tational Linguistics.

Georgiana Dinu and Stefan Thater. 2012. Saarland:Vector-based models of semantic textual similarity.In *SEM 2012: The First Joint Conference on Lexi-cal and Computational Semantics – Volume 1: Pro-ceedings of the main conference and the shared task,and Volume 2: Proceedings of the Sixth Interna-tional Workshop on Semantic Evaluation (SemEval2012), pages 603–607, Montreal, Canada, 7-8 June.Association for Computational Linguistics.

Bradley Efron and Robert Tibshirani. 1994. An Intro-duction to the Bootstrap. CRC press.

Katrin Erk, Diana McCarthy, and Nicholas Gaylord.2009. Investigations on word senses and word us-ages. In Proceedings of the Joint Conference of the47th Annual Meeting of the ACL and the 4th In-ternational Joint Conference on Natural LanguageProcessing of the AFNLP, pages 10–18, Suntec, Sin-gapore, August. Association for Computational Lin-guistics.

Edward Grefenstette and Mehrnoosh Sadrzadeh. 2011.Experimental support for a categorical composi-tional distributional model of meaning. In Proceed-ings of the 2011 Conference on Empirical Meth-ods in Natural Language Processing, pages 1394–1404, Edinburgh, Scotland, UK., July. Associationfor Computational Linguistics.

Kazuma Hashimoto, Pontus Stenetorp, Makoto Miwa,and Yoshimasa Tsuruoka. 2014. Jointly learn-ing word representations and composition functionsusing predicate-argument structures. In Proceed-ings of the 2014 Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages1544–1555, Doha, Qatar, October. Association forComputational Linguistics.

Felix Hill, KyungHyun Cho, Anna Korhonen, andYoshua Bengio. 2016. Learning to understandphrases by embedding the dictionary. Transactionsof the Association for Computational Linguistics,4:17–30.

Eric Huang, Richard Socher, Christopher Manning,and Andrew Ng. 2012. Improving word represen-tations via global context and multiple word proto-types. In Proceedings of the 50th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 873–882, Jeju Island,Korea, July. Association for Computational Linguis-tics.

Ignacio Iacobacci, Mohammad Taher Pilehvar, andRoberto Navigli. 2015. Sensembed: Learningsense embeddings for word and relational similarity.In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers), pages95–105, Beijing, China, July. Association for Com-putational Linguistics.

Ignacio Iacobacci, Mohammad Taher Pilehvar, andRoberto Navigli. 2016. Embeddings for word sensedisambiguation: An evaluation study. In Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 897–907, Berlin, Germany, August. As-sociation for Computational Linguistics.

Dimitri Kartsaklis and Mehrnoosh Sadrzadeh. 2014. Astudy of entanglement in a categorical framework ofnatural language. In Proceedings of the 11th Work-shop on Quantum Physics and Logic (QPL).

Dimitri Kartsaklis, Mehrnoosh Sadrzadeh, and StephenPulman. 2012. A unified sentence space for

88

categorical distributional-compositional semantics:Theory and experiments. In Proceedings of COL-ING 2012: Posters, pages 549–558, Mumbai, India,December. The COLING 2012 Organizing Commit-tee.

Dimitri Kartsaklis, Mehrnoosh Sadrzadeh, and StephenPulman. 2013. Separating disambiguation fromcomposition in distributional semantics. In Pro-ceedings of the Seventeenth Conference on Com-putational Natural Language Learning, pages 114–123, Sofia, Bulgaria, August. Association for Com-putational Linguistics.

Thomas Kober, Julie Weeds, Jeremy Reffin, and DavidWeir. 2016. Improving sparse word representationswith distributional inference for semantic composi-tion. In Proceedings of the 2016 Conference on Em-pirical Methods in Natural Language Processing,pages 1691–1702, Austin, Texas, November. Asso-ciation for Computational Linguistics.

Omer Levy and Yoav Goldberg. 2014. Dependency-based word embeddings. In Proceedings of the 52ndAnnual Meeting of the Association for Computa-tional Linguistics (Volume 2: Short Papers), pages302–308, Baltimore, Maryland, June. Associationfor Computational Linguistics.

Jiwei Li and Dan Jurafsky. 2015. Do multi-sense em-beddings improve natural language understanding?In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages1722–1732, Lisbon, Portugal, September. Associa-tion for Computational Linguistics.

Suresh Manandhar, Ioannis Klapaftis, Dmitriy Dligach,and Sameer Pradhan. 2010. Semeval-2010 task 14:Word sense induction & disambiguation. In Pro-ceedings of the 5th International Workshop on Se-mantic Evaluation, pages 63–68, Uppsala, Sweden,July. Association for Computational Linguistics.

Marco Marelli, Stefano Menini, Marco Baroni, LuisaBentivogli, Raffaella bernardi, and Roberto Zam-parelli. 2014. A sick cure for the evaluation ofcompositional distributional semantic models. InNicoletta Calzolari, Khalid Choukri, Thierry De-clerck, Hrafn Loftsson, Bente Maegaard, JosephMariani, Asuncion Moreno, Jan Odijk, and SteliosPiperidis, editors, Proceedings of the Ninth Interna-tional Conference on Language Resources and Eval-uation (LREC’14), pages 216–223, Reykjavik, Ice-land, May. European Language Resources Associa-tion (ELRA). ACL Anthology Identifier: L14-1314.

Diana McCarthy and Roberto Navigli. 2007. Semeval-2007 task 10: English lexical substitution task. InProceedings of the Fourth International Workshopon Semantic Evaluations (SemEval-2007), pages48–53, Prague, Czech Republic, June. Associationfor Computational Linguistics.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-

tions of words and phrases and their composition-ality. In C.J.C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K.Q. Weinberger, editors, Ad-vances in Neural Information Processing Systems26, pages 3111–3119. Curran Associates, Inc.

George A. Miller, Claudia Leacock, Randee Tengi, andRoss T. Bunker. 1993. A semantic concordance. InProceedings of the Arpa Workshop on Human Lan-guage Technology, pages 303–308.

Jeff Mitchell and Mirella Lapata. 2008. Vector-basedmodels of semantic composition. In Proceedings ofthe 46th Annual Meeting of the Association for Com-putational Linguistics: Human Language Technol-ogy Conference, pages 236–244, Columbus, Ohio,June. Association for Computational Linguistics.

Jeff Mitchell and Mirella Lapata. 2010. Compositionin distributional models of semantics. Cognitive Sci-ence, 34(8):1388–1429.

Hans Moen, Erwin Marsi, and Bjorn Gamback. 2013.Towards dynamic word sense discrimination withrandom indexing. In Proceedings of the Workshopon Continuous Vector Space Models and their Com-positionality, pages 83–90, Sofia, Bulgaria, August.Association for Computational Linguistics.

Andrea Moro, Alessandro Raganato, and Roberto Nav-igli. 2014. Entity linking meets word sense disam-biguation: A unified approach. Transactions of theAssociation for Computational Linguistics, 2:231–244.

Patrick Pantel and Dekang Lin. 2002. Discoveringword senses from text. In Proceedings of the EighthACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, KDD ’02, pages613–619, New York, NY, USA. ACM.

Denis Paperno, Nghia The Pham, and Marco Baroni.2014. A practical and linguistically-motivated ap-proach to compositional distributional semantics. InProceedings of the 52nd Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 90–99, Baltimore, Maryland,June. Association for Computational Linguistics.

Mohammad Taher Pilehvar and Nigel Collier. 2016.De-conflated semantic representations. In Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing, pages 1680–1690,Austin, Texas, November. Association for Computa-tional Linguistics.

Tamara Polajnar and Stephen Clark. 2014. Improv-ing distributional semantic vectors through contextselection and normalisation. In Proceedings of the14th Conference of the European Chapter of the As-sociation for Computational Linguistics, pages 230–238, Gothenburg, Sweden, April. Association forComputational Linguistics.

89

Amruta Purandare and Ted Pedersen. 2004. Wordsense discrimination by clustering contexts in vec-tor and similarity spaces. In Hwee Tou Ngand Ellen Riloff, editors, HLT-NAACL 2004 Work-shop: Eighth Conference on Computational Natu-ral Language Learning (CoNLL-2004), pages 41–48, Boston, Massachusetts, USA, May 6 - May 7.Association for Computational Linguistics.

Siva Reddy, Ioannis Klapaftis, Diana McCarthy, andSuresh Manandhar. 2011. Dynamic and static pro-totype vectors for semantic composition. In Pro-ceedings of 5th International Joint Conference onNatural Language Processing, pages 705–713, Chi-ang Mai, Thailand, November. Asian Federation ofNatural Language Processing.

Joseph Reisinger and Raymond J. Mooney. 2010.Multi-prototype vector-space models of word mean-ing. In Human Language Technologies: The 2010Annual Conference of the North American Chap-ter of the Association for Computational Linguistics,pages 109–117, Los Angeles, California, June. As-sociation for Computational Linguistics.

Bahar Salehi, Paul Cook, and Timothy Baldwin. 2015.A word embedding approach to predicting the com-positionality of multiword expressions. In Proceed-ings of the 2015 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages977–983, Denver, Colorado, May–June. Associationfor Computational Linguistics.

Hinrich Schutze. 1998. Automatic word sense dis-crimination. Computational Linguistics, 24(1):97–123, mar.

Ran Tian, Naoaki Okazaki, and Kentaro Inui. 2015.The mechanism of additive composition. CoRR,abs/1511.08407.

Tim Van de Cruys. 2008. Using three way data forword sense discrimination. In Proceedings of the22nd International Conference on ComputationalLinguistics (Coling 2008), pages 929–936, Manch-ester, UK, August. Coling 2008 Organizing Com-mittee.

David Weir, Julie Weeds, Jeremy Reffin, and ThomasKober. 2016. Aligning packed dependency trees: atheory of composition for distributional semantics.Computational Linguistics, special issue on FormalDistributional Semantics, 42(4):727–761, Decem-ber.

John Wieting, Mohit Bansal, Kevin Gimpel, and KarenLivescu. 2015. From paraphrase database to com-positional paraphrase model and back. Transactionsof the Association for Computational Linguistics,3:345–358.

David Yarowsky. 1993. One sense per collocation. InProceedings of the Workshop on Human LanguageTechnology, HLT ’93, pages 266–271, Stroudsburg,

PA, USA. Association for Computational Linguis-tics.

Geoffrey Zweig and J.C. Chris Burges. 2011. Themicrosoft research sentence completion challenge.Technical report, Microsoft Research.

90


Elucidating Conceptual Properties from Word Embeddings

Kyoung-Rok JangSchool of Computing

KAISTDaejeon, South Korea

[email protected]

Sung-Hyon MyaengSchool of Computing

KAISTDaejeon, South Korea

[email protected]

Abstract

In this paper, we introduce a method ofidentifying the components (i.e. dimen-sions) of word embeddings that stronglysignifies properties of a word. By eluci-dating such properties hidden in word em-beddings, we could make word embed-dings more interpretable, and also couldperform property-based meaning compar-ison. With the capability, we can answerquestions like “To what degree a givenword has the property cuteness?” or “Inwhat perspective two words are similar?”.We verify our method by examining howthe strength of property-signifying compo-nents correlates with the degree of proto-typicality of a target word.

1 Introduction

Modeling the meaning of words has long beenstudied and served as a basis for almost every kindof NLP tasks. Most recent word modeling tech-niques are based on neural networks, and the wordrepresentations produced by such techniques arecalled word embeddings, which are usually low-dimensional, dense vectors of continuous-valuedcomponents. Although word embeddings havebeen proved for their usefulness in many tasks, thequestion of what are represented in them is under-studied.

Recent studies report empirical evidence that in-dicates word embeddings may reflect some prop-erty information of a target word (Erk, 2016; Levyet al., 2015). Learning the properties of a wordwould be helpful because many NLP tasks can berelated to “finding words that possess similar prop-erties”, which include finding synonyms, namedentity recognition (NER). Without a method forexplicating what properties are contained in em-

beddings, however, researchers have mostly fo-cused on improving the performance in well-known semantic benchmark tasks (e.g. SimLex-999) as a way to find better embeddings.

Performing well in such benchmark tasks isvaluable but provides little help in understandingthe inside of the black box. For instance, it is notpossible to answer to questions like “To what de-gree a given word has the property cuteness?”.

One way to solve this problem is to elucidateproperties that are encoded in word embeddingsand associate them with task performances. Withthe capability, we can not only enhance our un-derstanding of word embeddings but also make iteasier to make comparisons among heterogeneousword embedding models in more coherent ways.Our immediate goal in this paper is to show thefeasibility of explicating properties contained inword embeddings.

Our research can be seen as an attempt to in-crease the interpretability of word embeddings. Itis in line with an attempt to provide a human-understandable explanation for complex machinelearning models, with which we can gain enoughconfidence to use them in decision-making pro-cesses.

There has been a line of work devoted to identi-fying components that are important for perform-ing various NLP tasks such as sentiment analysisor NER (Faruqui et al., 2015; Fyshe et al., 2015;Herbelot and Vecchi, 2015; Karpathy et al., 2016;Li et al., 2016a; Li et al., 2016b; Rothe et al.,2016). Those works are analogous to ours in thatthey try to inspect the role of the components inword embedding. However, they just attempt toidentify key features for specific tasks rather thanelucidating properties. In contrast, our question is“what comprises word embeddings?” not “whatcomponents are important for performing well ina specific task?”

91

2 Feasibility Study

2.1 BackgroundWord embeddings can be seen as representingconcepts of a word. As such, we attempt to designan property-related experiment around manipula-tion of concepts. In particular, we bring in the cat-egory theory (Murphy, 2004) where the notion ofcategory is defined to be “grouping concepts thatshare similar properties”. In other words, prop-erties have a direct bearing on concepts and theircategories, according to the theory.

On the other hand, researchers have argued thatsome concepts are more typical (or central) thanothers in a category (Rosch, 1973; Rosch, 1975).For instance, apple is more typical than olive inthe fruit category. The typicality is a graded phe-nomenon, and may rise due to the strength of ‘es-sential’ properties that make a concept a specificcategory.

The key ideas from the above are 1) conceptsof the same category share similar properties and2) some concepts that have strong essential prop-erties are considered more typical in specific cate-gory, and they guided our experiment design.

2.2 DesignThe goal of this study is to show the feasibil-ity of sifting property information from word em-beddings. We assume that a concept’s propertyinformation is captured and distributed over oneor more components (dimensions) of embeddingsduring the learning process. Since the conceptsthat belong to the same category are likely to sharesimilar properties, there should be some salientcomponents that are shared among them. We callsuch components as SIG-PROPS (for significantproperties) of a specific category.

In this feasibility study, we hypothesize that thestrength of SIG-PROPS is strongly correlated withthe degree of concept’s typicality. This is basedon the theory introduced in Section 2.1, that thetypicality phenomenon rises due to the strengthof essential properties a target concept possesses.So the concept that has (higher/lower) SIG-PROPS

values than others should be (more typical/lesstypical) than other concepts.

2.3 DatasetsFor our experiment dealing with typicality of con-cepts, we needed both (pre-trained) word embed-dings and a dataset that encodes typicality scores

of concepts to a set of categories. Below we de-scribe two datasets we used in our experiment:HyperLex and Non-Negative Sparse Embedding(NNSE).

2.3.1 Dataset: Non-Negative SparseEmbedding (NNSE)

One desirable quality we wanted from the wordembeddings to be used in our experiment is thatthere should be clear contrast between informa-tive and non-informative components. In ordinarydense word embeddings, usually every componentis filled with a non-zero value.

The Non-Negative Sparse Embedding (NNSE)(Murphy et al., 2012) fulfills the condition inthe sense that insignificant components are set tozero. The NNSE component values falling be-tween 0 and 1 (non-negative) are generated by ap-plying the non-negative sparse coding algorithm(Hoyer, 2002) to ordinary word embeddings (e.g.word2vec).

2.3.2 Dataset: HyperLexHyperLex is a dataset and evaluation resourcethat quantifies the extent of the semantic categorymembership (Vulic et al., 2016). A total of 2,616concept pairs are included in the dataset, and thestrength of category membership is given by na-tive English speakers and recorded in graded man-ner. This graded category membership can be in-terpreted as a ‘typicality score’ (1–10). Some sam-ples are shown in Table 1.

Concept Category Scorebasketball activity 10

spy agent 8handbag bowl 0

Table 1: A HyperLex sample. The score is theanswer to the question “To what degree is concepta type of category?”

3 Experiment

3.1 Preparation

We first prepared pre-trained NNSE embeddings.The authors released pre-trained model on theirwebsite1. We used the model trained with depen-dency context (‘Dependency model’ on the web-site), because as reported in (Levy and Goldberg,2014), models trained on dependency context tend

1http://www.cs.cmu.edu/ bmurphy/NNSE/

92

to prefer functional similarity (hogwarts — sun-nydale) rather than topical similarity (hogwarts —dumbledore)2. The embeddings are more sparsethan ordinary embeddings and have 300 compo-nents.

Next we fetched HyperLex dataset at the au-thor’s website3. To make the settings suitable toour experiment goal, we selected categories withthe following criteria:

1. The categories and instances must be con-crete nouns (e.g. food). This is becausepeople are more coherent in producing theproperties of concrete nouns (Murphy, 2004).So the embeddings of concrete nouns shouldcontain more clear property information thanother types of words.

2. The categories must contain enough numberof instances (not 1 or 2). This is to gain reli-able result.

3. Some categories are sub-category of anotherselected category while others are not related.This is to see the discriminative and overlap-ping effect of identified SIG-PROPS betweencategories. Related categories should share aset of strong SIG-PROPS, while unrelated cat-egories shouldn’t.

As a result, we selected five categories: food,fruit (sub-category of food), animal, bird (sub-category of animal), and instrument. We fetchedthe concepts that belong to the categories and thenfiltered out those that aren’t contained in the pre-trained NNSE embeddings. The final size of eachcategory is shown in Table 2.

Category # of Conceptsfood 54fruit 9

animal 46bird 16

instrument 14

Table 2: The size of selected categories

In the next section, we explain how we identi-fied SIG-PROPS of each category.

2We thought “sharing similar function” is more compati-ble with the notion of sharing similar properties. The topicalsimilarity is less indicative of having properties in common.

3http://people.ds.cam.ac.uk/iv250/hyperlex.html

3.2 Identification of SIG-PROPS

The goal of this step is to find SIG-PROPS thatmight represent each category. Simply put, SIG-PROPS are the components that have on averagehigh value compared to other components of theconcepts in the same category. We identified SIG-PROPS by 1) calculating an average value of eachcomponent across the concepts with the same cat-egory, then 2) choosing those components whoseaverage value is above h. We empirically set h to0.2.

Category SIG-PROPSComp. ID Avg.

instrument c88 0.806c258 0.769

animal c154 0.587c265 0.221

bird c154 0.550c265 0.213

food c207 0.298c233 0.269

fruit

c229 0.492c27 0.369c156 0.349c44 0.264c233 0.206

Table 3: SIG-PROPS of each category. The stringsin “Comp. ID” column are the component IDs(c1–c300). “Avg.” column indicates the averagevalue of the component across all the concepts un-der that category.

Table 3 shows the identified SIG-PROPS. Thenumber of SIG-PROPS is different across cate-gories. Interestingly, there is component overlapbetween taxonomically similar categories (‘c154’and ‘c265’ between animal–bird, ‘c233’ betweenfood–fruit), while there is none between unrelatedcategories (instrument–animal–food).

This initial observation is encouraging for ourfeasibility study in that indeed SIG-PROPS canplay a role of distinguishing or associating cate-gories. We argue that the identified SIG-PROPS

strongly characterize each category, showing thatwe can associate vector components with proper-ties.

3.3 Correlation between SIG-PROPS andconcepts typicality scores

In this section, we check how the strength of SIG-PROPS correlates with the typicality scores. Notethat the range of SIG-PROPS values differ from cat-egory to category — those for instrument are es-pecially high, which might indicate they are highly

93

representative.Our assumption is that if the identified SIG-

PROPS truly represent the essential quality of a cat-egory, the strength of SIG-PROPS should be pro-portional to concepts’ typicality scores (equation1).

Strength(SIG-PROPS)∝ Typicality(Concept)

(1)

We observe this phenomenon by calculatingPearson correlation between the strength of SIG-PROPS and typicality scores. For instance, supposewe calculate the correlation between ‘c88’ (88thcomponent) of instrument concepts and their typ-icality score. We inspect the instrument conceptsone by one, collect their ‘c88’ values (x) and typ-icality scores (y), and then measure the tendencyof changes in the two variables.

The result is shown in Table 4–8 where the col-umn ‘Rank’ shows the rank of the component’scorrelation score (compared to other components).For instance, in Table 4 ‘c88’ has the highest cor-relation with the concept’s typicality score.

SIG-PROPS Correlation Corr. rankc88 0.926 1stc258 0.918 2nd

Table 4: Corr(SIG-PROPS, typicality): instrument


Table 5: Corr(SIG-PROPS, typicality): animal


Table 6: Corr(SIG-PROPS, typicality): bird

SIG-PROPS Correlation Corr. rankc233 0.255 1stc120 0.224 2ndc207 0.216 4thc192 0.030 104th

Table 7: Corr(SIG-PROPS, typicality): food

SIG-PROPS Correlation Corr. rankc229 0.743 1stc233 0.540 4thc27 0.516 5thc44 0.474 7th

c156 -0.663 85th

Table 8: Corr(SIG-PROPS, typicality): fruit

As the results show, there is clear tendency SIG-PROPS having high correlation with the typicalityscores. Most of the SIG-PROPS showed meaning-ful correlation (> 0.5) with the typicality score orplaced at the top in the component–typicality cor-relation ranking. The result strongly indicates thateven when we apply the simple method of iden-tifying SIG-PROPS and regarding them as proper-ties, they serve as strong indicators for the con-cept’s typicality.


Although limited in scale, our work showed thefeasibility of discovering properties from wordembeddings. Not only SIG-PROPS can be used toincrease the interpretability of word embeddings,but also enable us more elaborate, property-basedmeaning comparison.

Our next step would be checking the applica-bility to general NLP tasks (e.g. NER, synonymidentification). Also, applying our method to wordembeddings that have more granular components(e.g. 2,500) might be helpful for identifying SIG-PROPS in more granular level.

Acknowledgments

This work was supported by Institute for Infor-mation & communications Technology Promo-tion(IITP) grant funded by the Korea govern-ment(MSIP) (No. R0126-16-1002, Developmentof agro-livestock cloud and application service forbalanced production, transparent distribution andsafe consumption based on GS1)

ReferencesKatrin Erk. 2016. What do you know about an alliga-

tor when you know the company it keeps? Seman-tics and Pragmatics Article, 9(17):1–63.

Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, ChrisDyer, and Noah A. Smith. 2015. Sparse overcom-plete word vector representations. In Proceedingsof the 53rd Annual Meeting of the Association forComputational Linguistics and the 7th International

94

Joint Conference on Natural Language Processing(Volume 1: Long Papers), pages 1491–1500, Bei-jing, China, July.

Alona Fyshe, Leila Wehbe, Partha P. Talukdar, BrianMurphy, and Tom M. Mitchell. 2015. A composi-tional and interpretable semantic space. In Proceed-ings of the 2015 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, pages 32–41, Denver, Colorado, May–June.

Aurelie Herbelot and Eva Maria Vecchi. 2015. Build-ing a shared world: Mapping distributional tomodel-theoretic semantic spaces. In Proceedings ofEMNLP, number September, pages 22–32, Lisbon,Portugal.

P. O. Hoyer. 2002. Non-negative sparse coding. InNeural Networks for Signal Processing - Proceed-ings of the IEEE Workshop, volume 2002-Janua,pages 557–565.

Andrej Karpathy, Justin Johnson, and Li Fei-Fei. 2016.Visualizing and Understanding Recurrent Networks.International Conference on Learning Representa-tions (ICLR), pages 1–13.

Omer Levy and Yoav Goldberg. 2014. Dependency-based word embeddings. In Proceedings of the 52ndAnnual Meeting of the Association for Computa-tional Linguistics (Volume 2: Short Papers), pages302–308, Baltimore, Maryland, June.

Omer Levy, Steffen Remus, Chris Biemann, and IdoDagan. 2015. Do supervised distributional meth-ods really learn lexical inference relations? In Pro-ceedings of the 2015 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages970–976, Denver, Colorado, May–June.

Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Juraf-sky. 2016a. Visualizing and Understanding NeuralModels in NLP. In Naacl, pages 1–10.

Jiwei Li, Will Monroe, and Dan Jurafsky. 2016b. Un-derstanding Neural Networks through Representa-tion Erasure. In Arxiv.

Brian Murphy, Partha Pratim Talukdar, and TomMitchell. 2012. Learning Effective and In-terpretable Semantic Models using Non-NegativeSparse Embedding. In Proceedings of COLING2012: Technical Papers, number December 2012,pages 1933–1950.

Gregory Murphy. 2004. The big book of concepts.MIT press.

Eleanor H. Rosch. 1973. Natural categories. Cogni-tive Psychology, 4(3):328–350.

Eleanor H. Rosch. 1975. Cognitive representations ofsemantic categories. Journal of experimental psy-chology: General, 104(3):192.

Sascha Rothe, Sebastian Ebert, and Hinrich Schutze.2016. Ultradense word embeddings by orthogonaltransformation. In Proceedings of the 2016 Con-ference of the North American Chapter of the As-sociation for Computational Linguistics: HumanLanguage Technologies, pages 767–777, San Diego,California, June.

Ivan Vulic, Daniela Gerz, Douwe Kiela, Felix Hill, andAnna Korhonen. 2016. HyperLex: A Large-ScaleEvaluation of Graded Lexical Entailment. Arxiv.

95


TTCSE: a Vectorial Resourcefor Computing Conceptual Similarity

Enrico MensaUniversity of Turin, Italy

Dipartimento di [email protected]

Daniele P. RadicioniUniversity of Turin, Italy

Dipartimento di [email protected]

Antonio LietoUniversity of Turin, Italy

ICAR-CNR, Palermo, [email protected]

Abstract

In this paper we introduce the TTCSE , alinguistic resource that relies on BabelNet,NASARI and ConceptNet, that has nowbeen used to compute the conceptual sim-ilarity between concept pairs. The con-ceptual representation herein provides uni-form access to concepts based on Babel-Net synset IDs, and consists of a vector-based semantic representation which iscompliant with the Conceptual Spaces, ageometric framework for common-senseknowledge representation and reasoning.The TTCSE has been evaluated in a pre-liminary experimentation on a conceptualsimilarity task.

1 Introduction

The development of robust and wide-coverage re-sources to use in different sorts of application(such as text mining, categorization, etc.) hasknown in the last few years a tremendous growth.In this paper we focus on computing conceptualsimilarity between pairs of concepts, based on aresource that extends and generalizes an attemptcarried out in (Lieto et al., 2016a). In particular,the TTCSE—so named after Text to ConceptualSpaces-Extended— has been acquired by inte-grating two different sorts of linguistic resources,such as the encyclopedic knowledge availablein BabelNet (Navigli and Ponzetto, 2012) andNASARI (Camacho-Collados et al., 2015), andthe common-sense grasped by ConceptNet (Speerand Havasi, 2012). The resulting representationenjoys the interesting property of being anchoredto both resources, thereby providing a uniformconceptual access grounded on the sense identi-fiers provided by BabelNet.

Conceptual Spaces (CSs) can be thought of as

a particular class of vector representations whereknowledge is represented as a set of limited thoughcognitively relevant quality dimensions; in thisrepresentation a geometrical structure is associ-ated to each quality dimension, mostly based oncognitive accounts. In this setting, concepts cor-respond to convex regions, and regions with dif-ferent geometrical properties correspond to dif-ferent sorts of concepts (Gardenfors, 2014). Thegeometrical features of CSs have a direct appealfor common-sense representation and common-sense reasoning, since prototypes (the most rele-vant representatives of a category from a cogni-tive point of view, see (Rosch, 1975)) correspondto the geometrical centre of a convex region, thecentroid. Also exemplars-based representationscan be mapped onto points in a multidimensionalspace, and their similarity can be computed asthe distance intervening between each two points,based on some suitable metrics such as Euclideanor Manhattan distance. etc..

The CS framework has been recently used toextend and complement the representational andinferential power allowed by formal ontologies—and in general symbolic representation—, thatare not suited for representing defeasible, proto-typical knowledge, and for dealing with the cor-responding typicality-based conceptual reasoning(Lieto et al., 2017). Also, wide-coverage seman-tic resources such as DBPedia and ConceptNet, infact, in different cases fail to represent the sort ofcommon-sense information based on prototypicaland default information which is usually requiredto perform forms of plausible reasoning.1 In this

1Although DBPedia contains information on many sortsof entities, due to its explicit encyclopedic commitment,common-sense information is dispersed among textual de-scriptions (e.g., in the abstracts) rather than being available ina well-structured formal way. For instance, the fork entity canbe categorized as an object, whilst there is no structured infor-mation about its typical usage. On the other hand, Concept-

96

paper we explore whether and to what extent alinguistic resource describing concepts by meansof qualitative and synthetic vectorial representa-tion is suited to assess the conceptual similaritybetween pairs of concepts.

2 Vector representations with the TTCSE

The TTCSE has been designed to build resourcesencoded as general conceptual representations.We presently illustrate how the resource is built,deferring to Section 3 the description of the con-trol strategy designed to use it in the computationof conceptual similarity.

The TTCSE takes in input a concept c referredthrough a BabelNet synset ID, and produces asoutput a vector representation ~c where the in-put concept is described along some semantic di-mensions. In turn, filling each such dimensionamounts to finding a set of appropriate concepts:features act like vector space dimensions, and theyare based on ConceptNet relationships.2 The di-mensions are filled with BabelNet synset IDs, sothat finally each concept c residing in the linguis-tic resource can be defined as

~c =⋃d∈D{〈IDd, {c1, · · · , ck}〉} (1)

where IDd is the identifier of the d-th dimension,and {c1, · · · , ck} is the set of values chosen for d.

The control strategy implemented by the TTCSE

includes two main steps, semantic extraction(composed by the extraction and concept identi-fication phases) and the vector injection.

Net is more suited to structurally represent common-senseinformation related to typicality. However, in ConceptNetthe coverage of this type of knowledge component is some-times not satisfactory. For similar remarks on such resources,claiming for the need of new resources more suited to repre-sent common-sense information, please also refer to (Basileet al., 2016).

2The full list of the employed properties, whichwere selected from the most salient properties in Con-ceptNet, includes: INSTANCEOF, RELATEDTO, ISA,ATLOCATION, DBPEDIA/GENRE, SYNONYM, DERIVED-FROM, CAUSES, USEDFOR, MOTIVATEDBYGOAL, HAS-SUBEVENT, ANTONYM, CAPABLEOF, DESIRES, CAUS-ESDESIRE, PARTOF, HASPROPERTY, HASPREREQUI-SITE, MADEOF, COMPOUNDDERIVEDFROM, HASFIRST-SUBEVENT, DBPEDIA/FIELD, DBPEDIA/KNOWNFOR, DB-PEDIA/INFLUENCEDBY, DEFINEDAS, HASA, MEMBEROF,RECEIVESACTION, SIMILARTO, DBPEDIA/INFLUENCED,SYMBOLOF, HASCONTEXT, NOTDESIRES, OBSTRUCT-EDBY, HASLASTSUBEVENT, NOTUSEDFOR, NOTCA-PABLEOF, DESIREOF, NOTHASPROPERTY, CREATEDBY,ATTRIBUTE, ENTAILS, LOCATIONOFACTION, LOCATED-NEAR.

Extraction The TTCSE takes in input c andbuilds a bag-of-concepts C including the conceptsassociated to c through one or more ConceptNetrelationships. All ConceptNet nodes related tothe input concept c are collected: namely, wetake the corresponding ConceptNet node for eachterm in the WordNet (Miller, 1995) synset of c,sc ∈ WN-sync. For each such term we extractall terms t linked through d, one of the aforemen-tioned ConceptNet relationships: that is, we col-lect the terms sc d→ t and store them in the set T .Each sc can be considered as a different lexical-ization for the same concept c, so that all t can begrouped in T , that finally contains all terms asso-ciated in any way to c.

Since ConceptNet does not provide any directanchoring mechanism to associate its terms tomeaning identifiers, it is necessary to determinewhich of the terms t ∈ T are relevant for theconcept c. In other words, when we access theConceptNet page for a certain term, we find notonly the association regarding that term with thesense conveyed by c, but also all the associationsregarding it in any other meaning. To select only(and possibly all) the associations that concern thesense individuated through the previous phase, weintroduce the notion of relevance. To give an in-tuition of this process, the terms found in Con-ceptNet are considered as relevant (and thus re-tained) either if they exhibit a heavy weight in theNASARI vector corresponding to the consideredconcept, or if they share at least some terms withthe NASARI vector (further details on a similarapproach can be found in (Lieto et al., 2016a)).

Concept identification Once the set of relevantterms has been extracted, we need to lift them tothe corresponding concept(s), which will be usedas value for the features. We stress, in fact, thatdimension fillers are concepts rather than terms(please refer to Eq. 1). In the concept identifica-tion step, we exploit NASARI in order to provideeach term t ∈ T with a BabelNet synset ID, thusfinally converting it into the bag-of-concepts C.

Given a ti ∈ T , we distinguish two main cases.If ti is contained in one or more synsets inside theNASARI vector of c, we obtain ci (the concept un-derlying ti) by directly assigning to ti the identi-fier of the heaviest weighted synset that containsit.3 Otherwise, if ti is not included in any of the

3NASARI unified vectors are composed by a head con-

97

synsets in the NASARI vector associated to c, weneed to choose a vector among all possible ones:we first select a list of candidate vectors (that is,those containing ti in their vector head), and thenchoose the best one by retaining the vector wherec’s ID has highest weight.

For example, given in input the concept bankintended as a financial institution, we inspect theedges of the ConceptNet node ‘bank’ and its syn-onyms. Then, thanks to the relevance notion weget rid of associations such as ‘bank ISA flightmaneuver’ since the term ‘flight maneuver’ is notpresent in the vector associated to the conceptbank. Conversely, we accept sentences such as‘bank HASA branch’ (i.e., ‘branch’ is added to T ).Finally, ‘branch’ goes through the concept identi-fication phase, resulting in a concept ci and then itis added to C.

Vector injection The bag-of-concepts C is thenscanned, and each value is injected in the templatefor ~c. Each value {c1, . . . , cn ∈ C} is still pro-vided with the relationship that linked it to c inConceptNet, so this value is employed to fill thecorresponding feature in ~c. For example, if ck isextracted from the ConceptNet relation USEDFOR

(i.e., c USEDFOR→ ck), the value ck will be added tothe set of entities that are used for c.

2.1 Building the TTCSE resource

In order to build the set of vectors in the TTCSE

resource, the system took in input 16, 782 con-cepts. Such concepts have been preliminarilycomputed (Lieto et al., 2016b) by starting from the10K most frequent nouns present in the Corpus ofContemporary American English (COCA).4 Then,for each input concept the TTCSE scans some 3MConceptNet nodes to retrieve the terms that appearinto the WordNet synset of the input. This stepallows to browse over 11M associations avail-able in ConceptNet, and to extract on average 155ConceptNet nodes for each input concept. Sub-sequently, the TTCSE exploits the 2.8M NASARIvectors to decide whether each of the extractednodes is relevant or not w.r.t. the input concept,and then it tries to associate a NASARI vector toeach of them (concept identification step). On av-

cept (represented by its ID in the first position) and a body,that is a list of synsets related to the head concept. Eachsynset ID is followed by a number that grasps the strengthof its correlation with the head concept.

4http://corpus.byu.edu/full-text/.

erage, 14.90 concepts are used to fill each vector.5

3 Computing Conceptual Similariy

One main assumption underlying our approach isthat two concepts are similar insofar as they sharesome values on the same dimension, such as whenthey are both used for the same ends, they sharecomponents, etc.. Consequently, our metrics doesnot employ WordNet taxonomy and distances be-tween pairs of nodes, such as in (Wu and Palmer,1994; Leacock et al., 1998; Schwartz and Gomez,2008), nor it depends on information content ac-counts either, such as in (Resnik, 1998a; Jiang andConrath, 1997).

The representation available to the TTCSE is en-tirely filled with conceptual identifiers, so to assessthe similarity between two such values we checkwhether both the concept vector ~ci and the vec-tor ~cj share the same (concept) value for the samedimension d ∈ D, and our similarity along eachdimension basically depends on this simple intu-ition:

sim(~ci,~cj) =1|D| ·

∑d∈D

|di ∩ dj |.

The score computed by the TTCSE system canbe justified based on the dimensions actuallyfilled: this explanation can be built automatically,since the similarity between ~ci and ~cj is a func-tion of the sum of the number of shared elementsin each dimension, so that one can argue thata given score x is due to the fact that along agiven dimension d both concepts share some val-ues (e.g., sim(table, chair) = x because each oneis a (ISA) ‘furniture’, both are USEDFOR ‘eating’,‘studying’ and ‘working’; both can be found AT-LOCATION ‘home’, ‘office’; and each one HASA‘leg’).

Ultimately, the TTCSE collects informationalong the 44 dimensions listed in footnote 2, sothat we are in principle able to assess in how farsimilar they are along each and every dimension.However, our approach is presently limited bythe actual average filling factor, and by the noisethat can be possibly collected by an automaticprocedure built on top of the BabelNet knowledgebase. Since we need to deal with noisy and incom-plete information, some adjustments to the aboveformula have been necessary in order to handle

5The final resource is available for download at the URLhttp://ttcs.di.unito.it.

98

—intra dimension— the possibly unbalancednumber of concepts that characterize the differentdimensions; and to prevent —inter dimensions—the computation from being biased by more richlydefined concepts (i.e., those with more dimensionsfilled). The computation of the conceptual simi-larity score is thus based on the following formula:

sim(~ci,~cj) =1|D∗| ·

∑d∈D

|di ∩ dj |β (αa+ (1− α) b) + |di ∩ dj |

where |di ∩ dj | counts the number of conceptsthat are used as fillers for the dimension d inthe concept ~ci and ~cj , respectively; and a and bare computed as a = min(|di − dj |, |dj − di|),b = max(|di − dj |, |dj − di|); and |D∗| counts thedimensions filled with at least one concept in bothvectors.

This formula is known as the Symmetrical Tver-sky’s Ratio Model (Jimenez et al., 2013), whichis a symmetrical reformulation for the Tversky’sratio model (Tversky, 1977). It enjoys the fol-lowing properties: i) it allows grasping the num-ber of common traits between the two vectors (atthe numerator); ii) it allows tuning the balance be-tween cardinality differences (through the param-eter α), and between |di∩dj | and |di−dj |, |dj−di|(through the parameter β). Interestingly, by set-ting α = .5 and β = 1 the formula equals thepopular Dice’s coefficient. The parameters α andβ were set to .8 and .2 for the experimentation.

4 Evaluation

In the experimentation we addressed the concep-tual similarity task at the sense level, that is theTTCSE system has been fed with sense pairs. Weconsidered three datasets,6 namely the RG, MCand WS-Sim datasets, first designed in (Ruben-stein and Goodenough, 1965; Miller and Charles,1991) and (Agirre et al., 2009), respectively. His-torically, while the first two (RG and MC) datasetswere originally conceived for similarity measures,the WS-Sim dataset was developed as a subsetof a former dataset built by (Finkelstein et al.,2001) created to test similarity by including pairsof words related through specific relationships,such as synonymy, hyponymy, and unrelated. Allsenses were mapped onto WordNet 3.0.

The similarity scores computed by the TTCSE

system have been assessed through Pearsons r

6Publicly available at the URL http://www.seas.upenn.edu/˜hansens/conceptSim/.

ρ r

RG 0.78 0.85MC 0.77 0.80WS-Sim 0.64 0.54

Table 1: Spearman (ρ) and Pearson (r) correla-tions obtained over the three datasets.

and Spearmans ρ correlations, that are largelyadopted for the conceptual similarity task (for arecent compendium of the approaches please referto (Pilehvar and Navigli, 2015)). The former mea-sure captures the linear correlation of two vari-ables as their covariance divided by the productof their standard deviations, thus basically allow-ing to grasp differences in their values, whilst theSpearman correlation is computed as the Pearsoncorrelation between the rank values of the consid-ered variables, so it is reputed to be best suited toassess results in a similarity ranking setting whererelative scores are relevant (Schwartz and Gomez,2011; Pilehvar and Navigli, 2015).

Table 1 shows the results obtained by the systemin a preliminary experimentation.

Provided that the present task of sense-levelsimilarity is slightly different from word-levelsimilarity (about this distinction, please referto (Pilehvar and Navigli, 2015)), and our resultscan be thus hardly compared to those in litera-ture, the reported figures are still far from thestate of the art, where the Spearman correlation ρreaches 0.92 for the RG dataset (Pilehvar and Nav-igli, 2015), 0.92 for the MC dataset (Agirre et al.,2009), and 0.81 for the WS-Sim dataset (Halawiet al., 2012; Tau Yih and Qazvinian, 2012).7

However, we remark that the TTCSE employsvectors of a very limited size w.r.t. the standardvector-based resources used in the current mod-els of distributional semantics (as mentioned, eachvector is defined, on average, through 14.90 con-cepts). Moreover, due to the explicit groundingprovided by connecting the NASARI feature val-ues to the corresponding properties in ConceptNet,the TTCSE can be used to provide the scores re-turned as output with an explanation, based on theshared concepts along some given dimension. Atthe best of our knowledge, this is a unique fea-ture, that cannot be easily reached by methods

7Rich references to state-of-the-art results and works ex-perimenting on the mentioned datasets can be found on theACL Wiki, at the URL https://goo.gl/NQlb6g.

99

based on Latent Semantic Analysis (such as thosepioneered by (Deerwester et al., 1990)) and canbe only partly approached by techniques exploit-ing taxonomic structures (Resnik, 1998b; Baner-jee and Pedersen, 2003). Conversely, few and rel-evant traits are present in the final linguistic re-source, which is thus synthetic and more cogni-tively plausible (Gardenfors, 2014).

In some cases —27 concept pairs out of theoverall 190 pairs— the system was not able to re-trieve an ID for one of the concepts in the pair:such pairs were excluded from the computation ofthe final accuracy. Missing concepts may be lack-ing in (at least one of) the resources upon whichthe TTCSE is built: including further resourcesmay thus be helpful to overcome this limitation.Also, difficulties stemmed from insufficient infor-mation for the concepts at stake: this phenomenonwas observed, e.g., when both concepts have beenfound, but no common dimension has been filled.This sort of difficulty shows that the coverage ofthe resource still needs to be enhanced by improv-ing the extraction phase, so to add further conceptsper dimension, and to fill more dimensions.

5 Conclusions

In this paper we have introduced a novel resource,the TTCSE , which is compatible with the Con-ceptual Spaces framework and aims at putting to-gether encyclopedic and common-sense knowl-edge. The resource has been employed to computethe conceptual similarity between concept pairs.Thanks to its representational features it allowsimplementing a simple though effective heuristicsto assess similarity: that is, concepts are similarinsofar as they share some values along the samedimension. However, further heuristics will be in-vestigated in the next future, as well.

A preliminary experimentation has been run,employing three different datasets. Provided thatwe consider the obtained results as encouraging,the experimentation clearly points out that there isroom for improvement along two main axes: di-mensions must be filled with further information,and also the quality of the extracted informationshould be improved. Both aspects will be the ob-ject of our future efforts.

ReferencesEneko Agirre, Enrique Alfonseca, Keith Hall, Jana

Kravalova, Marius Pasca, and Aitor Soroa. 2009.

A study on similarity and relatedness using distribu-tional and wordnet-based approaches. In Procs ofHuman Language Technologies: The 2009 AnnualConference of the North American Chapter of theAssociation for Computational Linguistics, pages19–27. ACL.

S. Banerjee and T. Pedersen. 2003. Extended glossoverlaps as a measure of semantic relatedness. InProceedings of the Eighteenth International JointConference on Artificial Intelligence, pages 805–810.

Valerio Basile, Soufian Jebbara, Elena Cabrio, andPhilipp Cimiano. 2016. Populating a knowledgebase with object-location relations using distribu-tional semantics. In Eva Blomqvist, Paolo Cian-carini, Francesco Poggi, and Fabio Vitali, editors,EKAW, volume 10024 of Lecture Notes in ComputerScience, pages 34–50.

Jose Camacho-Collados, Mohammad Taher Pilehvar,and Roberto Navigli. 2015. NASARI: a novelapproach to a semantically-aware representation ofitems. In Proceedings of NAACL, pages 567–577.

S. Deerwester, S. Dumais, G. Furnas, T. Landauer, andR. Harshman. 1990. Indexing by latent semanticanalysis. Journal of the American Society for Infor-mation Science, 41:391–407.

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias,Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey-tan Ruppin. 2001. Placing search in context: Theconcept revisited. In Proceedings of the 10th inter-national conference on World Wide Web, pages 406–414. ACM.

Peter Gardenfors. 2014. The geometry of meaning:Semantics based on conceptual spaces. MIT Press.

Guy Halawi, Gideon Dror, Evgeniy Gabrilovich, andYehuda Koren. 2012. Large-scale learning of wordrelatedness with constraints. In Qiang Yang, DeepakAgarwal, and Jian Pei, editors, KDD, pages 1406–1414. ACM.

Jay J. Jiang and David W. Conrath. 1997. Semanticsimilarity based on corpus statistics and lexical tax-onomy. In Proceedings of International Conferenceon Research in Computational Linguisics, Taiwan.

Sergio Jimenez, Claudia Becerra, Alexander Gelbukh,Av Juan Dios Batiz, and Av Mendizabal. 2013.Softcardinality-core: Improving text overlap withdistributional measures for semantic textual simi-larity. In Second Joint Conference on Lexical andComputational Semantics, volume 1, pages 194–201.

Claudia Leacock, George A Miller, and MartinChodorow. 1998. Using corpus statistics and word-net relations for sense identification. ComputationalLinguistics, 24(1):147–165.

100

Antonio Lieto, Enrico Mensa, and Daniele P. Radi-cioni. 2016a. A Resource-Driven Approach for An-choring Linguistic Resources to Conceptual Spaces.In Procs of the XV International Conference of theItalian Association for Artificial Intelligence, vol-ume 10037 of LNAI, pages 435–449. Springer.

Antonio Lieto, Enrico Mensa, and Daniele P. Radi-cioni. 2016b. Taming sense sparsity: a common-sense approach. In Proceedings of Third ItalianConference on Computational Linguistics (CLiC-it2016) & Fifth Evaluation Campaign of Natural Lan-guage Processing and Speech Tools for Italian.

Antonio Lieto, Daniele P. Radicioni, and ValentinaRho. 2017. Dual PECCS: A Cognitive Systemfor Conceptual Representation and Categorization.Journal of Experimental & Theoretical Artificial In-telligence, 29(2):433–452.

George A. Miller and Walter G. Charles. 1991. Con-textual correlates of semantic similarity. Languageand cognitive processes, 6(1):1–28.


Roberto Navigli and Simone Paolo Ponzetto. 2012.BabelNet: The automatic construction, evaluationand application of a wide-coverage multilingual se-mantic network. Artificial Intelligence, 193:217–250.

Mohammad Taher Pilehvar and Roberto Navigli. 2015.From senses to texts: An all-in-one graph-based ap-proach for measuring semantic similarity. Artif. In-tell., 228:95–128.

Philip Resnik. 1998a. Semantic similarity in a taxon-omy: An information-based measure and its appli-cation to problems of ambiguity in natural language.Journal of Artificial Intelligence Research, 11(1).

Philip Resnik. 1998b. Semantic similarity in a taxon-omy: An information-based measure and its appli-cation to problems of ambiguity in natural language.Journal of Artificial Intelligence Research, 11(1).

Eleanor Rosch. 1975. Cognitive Representations ofSemantic Categories. Journal of experimental psy-chology: General, 104(3):192–233.

Herbert Rubenstein and John B. Goodenough. 1965.Contextual correlates of synonymy. Communica-tions of the ACM, 8(10):627–633.

Hansen A. Schwartz and Fernando Gomez. 2008. Ac-quiring knowledge from the web to be used as selec-tors for noun sense disambiguation. In Procs of theTwelfth Conference on Computational Natural Lan-guage Learning, pages 105–112. ACL.

Hansen A Schwartz and Fernando Gomez. 2011. Eval-uating semantic metrics on tasks of concept simi-larity. In Proc. Int. Florida Artif. Intell. Res. Soc.Conf.(FLAIRS), page 324.

Robert Speer and Catherine Havasi. 2012. Represent-ing General Relational Knowledge in ConceptNet 5.In LREC, pages 3679–3686.

Wen Tau Yih and Vahed Qazvinian. 2012. Measuringword relatedness using heterogeneous vector spacemodels. In HLT-NAACL, pages 616–620. The Asso-ciation for Computational Linguistics.

Amos Tversky. 1977. Features of similarity. Psycho-logical review, 84(4):327.

Zhibiao Wu and Martha Palmer. 1994. Verbs seman-tics and lexical selection. In Proceedings of the 32ndannual meeting on Association for ComputationalLinguistics, pages 133–138. ACL.

101


Measuring the Italian-English lexical gap for action verbs and its impacton translation

Lorenzo GregoriUniversity of Florence

[email protected]

Alessandro PanunziUniversity of Florence

[email protected]

Abstract

This paper describes a method to measurethe lexical gap of action verbs in Italianand English by using the IMAGACT on-tology of action. The fine-grained cate-gorization of action concepts of the datasource allowed to have wide overview ofthe relation between concepts in the twolanguages. The calculated lexical gap forboth English and Italian is about 30% ofthe action concepts, much higher than pre-vious results. Beyond this general num-bers a deeper analysis has been performedin order to evaluate the impact that lexi-cal gaps can have on translation. In partic-ular a distinction has been made betweenthe cases in which the presence of a lexi-cal gap affects translation correctness andcompleteness at a semantic level. The re-sults highlight a high percentage of con-cepts that can be considered hard to trans-late (about 18% from English to Italianand 20% from Italian to English) and con-firms that action verbs are a critical lexicalclass for translation tasks.

1 Introduction

Lexical gap is a well known phenomenon in lin-guistics and its identification allows to discoversome relevant features related to the semanticcategorization operated by languages. A lexicalgap corresponds to a lack of lexicalization of acertain concept in a given language. This phe-nomenon traditionally emerged from the analysisof a single language by means of the detection ofempty spaces in a lexical matrix (see the semi-nal works by Leher (1974) and Lyons (1977); seealso Kjellmer (2003)). Anyway, lexical gap be-comes a major issue when comparing two or more

languages, as in translation tasks (Ivir, 1977). Inthis latter case, a lexical gap can be defined as theabsence of direct lexeme in one language whilecomparing two languages during translation (Cvi-likaite, 2006). The presence of lexical gaps be-tween two languages is more than a theoreticalproblem, having a strong impact in several relatedfields: lexicographers need to deal with lexicalgaps in the compilation of bilingual dictionaries(Gouws, 2002); in knowledge representation thecreation of multilanguage linguistic resources re-quire a strategy to cope with the lack of concepts(Jansseen, 2004); the lexical transfer process is af-fected by the presence of lexical gaps in automatictranslation system, reducing their accuracy (San-tos, 1990).

Even if in literature it’s possible to find manyexamples of gaps, it’s hard to estimate them. Thisis due to the fact that most of the gaps are re-lated to small semantic differences that are hard toidentify: available linguistic resources usually rep-resent a coarse-grained semantics, so while theyare useful to discriminate the prominent senses ofwords, they can’t capture small semantic shifts. Inaddition to it, a multilanguage resource is requiredfor this purpose, but these resources are normallybuilt up through a mapping between two or moremonolingual resources and this cause an approx-imation in concept definitions: similar conceptstend to be grouped together in a unitary conceptthat represent the core-meaning and lose their se-mantic specificities.

2 IMAGACT

Verbs are a critical lexical class for disambiguationand translation tasks: they are much more polyse-mous than nouns and, moreover, their ambiguityis hard to resolve (Fellbaum et al., 2001). In par-ticular the representation of word senses as sepa-

102

rate entities is tricky, since their boundaries are of-ten vague causing the senses to be under-specifiedand overlapping. From this point of view the sub-class of general verbs represent a crucial point, be-cause these verbs are characterized by both highfrequency in the use of language and high ambi-guity.

IMAGACT1 is a visual ontology of action thatprovides a video-based translation and disam-biguation framework for general verbs. The re-source is built on an ontology containing a fine-grained categorization of action concepts, eachrepresented by one or more video prototypes asrecorded scenes and 3D animations.

IMAGACT currently contains 1,010 sceneswhich encompass the action concepts most com-monly referred to in everyday language usage.Data are derived from the manual annotation ofverb occurrences in spontaneous spoken corpora(Moneglia et al., 2012); the dataset has been com-piled by selecting action verbs with the highestfrequency in the corpora and comprises 522 Ital-ian and 554 English lemmas. Although the setof retrieved actions is necessarily incomplete, thismethodology ensures to have a significant pictureof the main action types performed in everydaylife2.

The links between verbs and video scenes arebased on the co-referentiality of different verbswith respect to the action expressed by a scene(i.e. different verbs can describe the same ac-tion, visualized in the scene). The visual represen-tations convey the action information in a cross-linguistic environment and IMAGACT may thusbe exploited to discover how the actions are lexi-calized in different languages.

In addition to it IMAGACT contains a semanticclassification of each lemma, that is divided intoTypes: each verb Type identifies an action con-cept and contains one ore more scenes, that workas prototypes of that concept. Type classificationis manually performed in Italian and English inparallel, through a corpus-based annotation pro-cedure by native language annotators (Monegliaet al., 2012); this allowed to have a discrimina-tion of verb Types based only on the annotatorcompetence, without any attempt to fit the datainto predefined semantic models. Validation re-sults (Gagliardi, 2014) highlight a good rate of

1http://www.imagact.it2see Moneglia et al. (2012) and Moneglia (2014b) for

details about corpora annotation numbers and methodology.

Type discrimination agreement: a Cohen k of 0,82for 2 expert annotators and a Fleiss k of 0.73 for 4non-expert ones3.

For these features IMAGACT ontology is a re-liable data source to measure the lexical gap be-tween Italian and English: in fact verb Types aredefined independently, but linked together throughthe scenes. The comparison of Types in differentlanguage through their action prototypes allows toidentify the action concepts that are shared be-tween the two languages and the ones that don’tmatch with any concept in the other language; inthis case we have a lexical gap.

3 Type relations

In this frame we can perform a set-based compar-ison, considering a Type as just a set of scenes.A Type is a lexicalized concept, so a partition ofthe meaning, but semantic features are not repre-sented in the ontology and, in fact, they are un-known: data are derived from the ability of com-petent speaker in performing a categorization ofsimilar items with respect to a lemma, without anyattempt to formalize semes. So if we look at thedatabase we can say that Types are merely sets ofscenes.

Comparing a Type (T1) of a verb in source lan-guage (V1) with a Type (T2) of a verb in target lan-guage (V2) we can have 5 possible configurations:

1. T1 ≡ T2: two Types are equivalent if theycontain the same set of scenes;

2. T1 ∩ T2 = ∅: two Types are disjoint if theydon’t share any scene;

3. T1 ⊂ T2: T1 is a subset of T2 if any scene ofT1 is also a scene of T2 and the 2 Types arenot equivalent;

4. T1 ⊃ T2: T1 is a superset of T2 if any sceneof T2 is also a scene of T1 and the 2 Types arenot equivalent;

5. T1∩T2 6= ∅∧T1 * T2∧T1 + T2: two Typesare partially overlapping if they share somescenes and each Type have some scenes notbelonging to the other one.

3Inter-annotator agreement (ITA) measured on WordNetby expert annotators on comparable verb sets (score for fine-grained sense inventories): ITA = 72% on SemEval-2007dataset (Pradhan et al., 2007); ITA = 71% on SensEval-2dataset (Palmer et al., 2007).

103

Figure 1: Two Equivalent Types belonging to theItalian verb toccare and to the English verb touch.

It’s important to discuss these cases separately,because each one of them highlights a differentsemantic relation between verbs and has differentimplications for translation.

When two Types are equivalents (case 1) the 2languages share the action concept the Types rep-resent: we could say that there is an interlinguisticconcept. This case is not problematic for transla-tion: each occurrence of the verb V1 that belongsto Type T1 can be translated with V2; moreoverwe can apply V1 to translate any occurrence of V2

belonging to T2.For example one Type of the English verb to

touch and one Type of the Italian verb toccareare equivalent. They share 3 video scenes: Marytouches the doll, Mary touches the table and Johntouches Mary (see Fig. 1). Each scene is con-nected to a different set of verbs (i.e. to brush, tograze, to caress), representing a specific semanticconcept, but they are kept together by a more gen-eral concept both in Italian and in English. So inany of these actions the verb to touch can be safelytranslated in Italian with toccare and vice versa.

If two Types are disjoint (case 2) the Types re-fer to unrelated semantic concepts and we can as-sume that translation between an occurrence of V1

belonging to T2 can not be translated with V2.In cases 3 and 4 the Types are hierarchically re-

lated and we can assume the existence of a seman-tic relation that links a general Type with a spe-cific one. Although we can not induce the typeof this relation that could be hyponym, entailment,troponym and so on. In this configuration we cansee that translation is safe from specific to general,but not vice versa: in case 3 any occurrence of V1

belonging to T1 can be translated with V2, whilein case 4 V2 can not be safely applied, because it

Figure 2: Two Hierarchically related Types be-longing to the Italian verb accoltellare and to theEnglish verb stab.

encodes only a sub-part of the concept representedby T1.

For example Type 1 of the English verb to staband Type 1 of the Italian verb accoltellare cate-gorize action where a sharp object pierces a body,but while stab can be applied to describe actionsindependently on their aim and the tool used, ac-coltellare is applicable only when the agent vol-untarily injures someone and the action is accom-plished with a knife. In this case the Italian Typeis more specific than the English one, so transla-tion is safe from Italian to English (stab can beused to translate any occurrence of accoltellare-Type 1), but not vice versa: stab-Type 1 can not bealways translated with accoltellare, because a partof its variation is covered by other Italian verbslike trafiggere, penetrare or attraversare.

Finally a partial overlap between Types (case5) doesn’t allow to induce any semantic relationbetween Types: in these cases we have differ-ent concepts that can refer the same action. Nor-mally these happen when the action is interpretedfrom two different points of view and categorizedwithin unrelated lexical concepts. In this casewe have a translation relation between V1 andV2 without having any semantic relation betweentheir Types.

For example the Italian verb abbassare, that isfrequently translated with lower in English, canalso be translated with position when applied tosome (but not every) actions belonging to Type 1,categorizing actions involving the body; moreoverwe have the same translation relation from En-glish to Italian where sometimes (but not always)position-Type 2 can be translated with abbassare.Here there are two Types that represent semanti-

104

Figure 3: Two partially overlapping Types belong-ing to the Italian verb abbassare and to the Englishverb position.

cally independent concepts, but that can both beapplied to describe some actions, like Mary posi-tions herself lower and other similar ones.

This happens rarely in Italian - English (14Types on our dataset) and in any of these casesthere are other translation verbs as possible alter-natives. Despite this, Type overlaps identificationis very relevant, because it allows to discover un-expected translation candidates (i.e. target verbsthat have a translation relation but not a seman-tic relation with the source verb) that can not beextracted from a lexico-semantic resource. In ad-dition to it Type overlaps identification is crucialif the target verb is the only one translation pos-sibility and this can happen, especially betweentwo languages that are very far: some evidencesfor example have been discovered in Italian andChinese (Pan, 2016) through a deep comparisonof Italian Types with Chinese verbs that refer tothe same scenes. This work allowed to identifysome positive occurrences of this interesting phe-nomenon, but can not be exploited for its numericquantification: indeed an exhaustive analysis thatinvolves the relation between action concepts canbe made only between Italian and English, sinceIMAGACT contains the verb Type discriminationin these two languages only.

4 Lexical gap identification

4.1 Dataset building

In order to measure the lexical gaps in Italian andEnglish we created a working dataset by select-ing the set of Types that have a full mapping inthe two languages. We need to consider that IMA-GACT annotation process has been carried out in

several steps: firstly verbs were annotated througha corpus-based procedure and Types were createdand validated by mother tongue speakers on thebasis of their linguistic competence; then for eachconcept a scene was produced to provide a proto-typical representation of it; after that a mappingbetween Italian and English was performed bylinking the scenes to the Types of each language;finally annotators were requested to recheck eachscene and add the missing verbs that are applica-ble to it. This last revision enriched the scene withmore verbs that don’t belong to any Type.

We decided to exclude from the dataset all thescenes (and the related Types) that contain untypedverbs, considering that a partial typing does notensure the coherence of verb Type discrimination:in fact it’s not possible to be sure that the creationof Types for these new instances would preservethe original Type distinction.

After this pruning we obtained a set of 1,000Italian Types and 1,027 English Types, that referto 501 and 535 verbs respectively (see Table 1).

IT ENTypes 1,000 1,027Verbs 501 535Scenes 980 917

Table 1: Number of Types, verbs and scenes be-longing to the dataset.

4.2 Methodology

According to our dataset, we can easily estimatethe lexical gap by measuring the number of Typesin source language that don’t have an equivalentType in target language. Namely for each conceptin source language we are going to verify if there isa concept in target language that refer to the sameset of actions (represented by video prototypes);if the match is not found we have a lexical gap intarget language.

As we can see in table 2, the action concepts thatare lexicalized in Italian and without a correspond-ing match in English are 33,6% (English gap); onthe contrary the Italian gap for English concepts is29,02%.

Before going ahead we need to do some con-siderations about these numbers. First of all wecan see that these percentages are much higherthan the ones calculated by Bentivogli and Pi-anta (2000), that found 7,4% of gaps for verbs in

105

IT→ EN EN→ ITTotal Types 1,000 1,027Equiv. Types 664 (66,4%) 729 (70,98%)Lexical gaps 336 (33,6%) 298 (29,02%)

Table 2: Types in source language that have andhave not an equivalent Type in target language.

English-to-Italian comparison. This is a big shift,but it’s not surprising if we consider the differ-ences of the two experiments in terms of method-ology and dataset:

• IMAGACT Type distinction is more fine-grained in respect to WordNet synsets (Bar-tolini et al., 2014);

• the experiment by Bentivogli and Pianta wasled on MultiWordNet, in which multilan-guage Wordnets are created on the basis ofthe Princeton Wordnet sense distinction (Pi-anta et al., 2002); this methodology introducean approximation in the concepts definition;

• the 7,4% of Bentivogli and Pianta is a gen-eral value on verbs, while our experiment isfocused on action verbs, which are a stronglyambiguous lexical class (Moneglia, 2014a);

• the dictionary-based methodology proposedby Bentivogli and Pianta is nearly opposite toIMAGACT reference-based approach.

Beyond these general considerations a lemma-by-lemma comparison with the experiment ofBentivogli and Pianta (whose dataset is currentlynot available) would better explain this numericdifference.

5 Lexical gaps and translation problems

Besides a general measure of the gaps for actionconcept it’s important to go a step beyond to ver-ify in which cases the presence of a lexical gapimpacts the translation quality. In order to do this,we divided the Types without an equivalent in tar-get language in three categories:

• leaf Types: these Types in source languagerepresent concepts that are more specific thanother ones in target language; in this case theonly Type in target language that have a par-tial match with the Type in source languageis a superset (case 3);

• root Types: these Types in source languagerepresent concepts that are more general thanother ones in target language: the only Typein target language that have a partial matchwith the Type in source language is a subset(case 4);

• middle Types: these Types have a partialmatch in target language both with a moregeneral Type and with a more specific one(both cases 3 and 4).

As we mentioned before we did not find anycase in which a partial overlapping Type (case 5) isthe only one possible match in Italian and Englishcomparison; so these cases are counted within thethree categories above.

5.1 Root Types and uncertain translations

Starting from this classification we can see thatroot Types are the critical ones in terms of trans-lation: in fact we don’t have a unique lexicalizedconcept in target language that is able to repre-sent the concept in source language; instead wehave more than one Type (and multiple verbs) thatcover different subparts of the whole general con-cept variation. In these cases we need to have extrainformation about the action in order to translate itproperly. From a computational point of view wecan say that a word sense disambiguation of thesource verb is not enough to reach a correct trans-lation verb.

The two sentences The cell phone lands on thecarpet and The pole vaulter lands on the mat, forexample, belong to the same action concept ac-cording to the semantics of the verb to land4. InItalian there is not a unique Type that collects thesetwo actions: it’s possible to use atterrare for theathlete, but it is not allowed for the phone, forwhich we need to make a semantic shift and usethe verb cadere (that is more similar to fall down).Again cadere is not appropriate for the athlete, be-cause it implies that the athlete stumbles and falls.

So this action concept that is lexicalized in En-glish with to land does not have a unique trans-lation verb in Italian, and extra informations arerequired to translate it properly (if the theme is anhuman being or an object, in this specific case).

Table 3 show the number of leaf, root and mid-dle Types in Italian and English; we can see that

4unlike The butterfly lands on the flower or The airplanelands that belong to different concepts of to land.

106

IT→ EN EN→ ITLexical gaps 336 298Leaf Types 217 (64.58%) 200 (67.58%)Root Types 47 (13.99%) 43 (14.43%)Middle Types 72 (21.43%) 55 (18.46%)

Table 3: Number of Leaf, Root and Middle Typesin Italian and English (percentages on the lexicalgaps).

root Types represent the 14% of the lexical gaps inboth the languages, corresponding to 4-5% of thetotal Types.

5.2 General Types and lossy translations

Root Types are the most critical case for a transla-tion task, because they affect the correctness; be-sides there are also other kinds of lexical gaps thatimpact on translation. In particular is useful toestimate how semantically far is the best transla-tion candidate in the cases in which we can applya more general Type to translate the concept in thesource language. In fact in both leaf and middleTypes we have a Type in target language that ismore general to the source one, so it is safely ap-plicable to any occurrence belonging to the sourceType. This is not free from problems, because intranslation we use a more general verb, so we misssome semes that are encoded in the source verb.In fact in this case we still have a translation prob-lem, which is not in finding a possible target verb,but in adding more information in other lexical el-ement of the sentence to fill the lack of semanticinformation. In this case the gap does not affectthe correctness of the translation, but its complete-ness.

For example the English verb to plonk does nothave a correspondence in Italian. In particular asentence like John plonks the books on the tablebelongs to a Type of plonk that is a leaf Type (sothere is a possible translation verb in Italian), butfor which the nearest Italian Type is much wider,belonging to the very general verb mettere. In thiscase it’s possible to translate in Italian with Johnmette i libri sul tavolo, but losing all the informa-tion regarding the way the books are placed on thetable (mettere is more similar with to put); an ad-dition of other lexical elements to the sentence isrequired to fill this gap in Italian.

Conversely we can say that a small distance be-tween the source and the target Type does not have

a negative effect on translation. Type 1 of the En-glish verb to throw and Type 1 of the Italian verblanciare categorize a wide set of actions in whichan object is thrown by a person independently onthe presence of a destination or on the action aim(John throws the bowling bowl, John throws therock in the field, John throws the paper in the boxetc.). However these two Types are not equivalent,because the Italian one comprise also actions per-formed in a limited space with a highly controlledmovement, like Marco lancia una monetina, thatrequire another verb in English like to toss (Marcotosses a coin). In this case the small gap betweenthe Italian concept and English one does not affectthe translation: in fact we can say that lanciare canbe used to translate properly any action belongingto throw - Type 1.

Given this consideration a measure of the se-mantic distance with the translation verb is use-ful to evaluate the loss: this can be easily donefrom IMAGACT dataset by calculating the ratiobetween the cardinality (i.e. the number of scenes)of the source Type, T1, and the one of the nearesttarget Type, T2 (the Type with the minimum cardi-nality among the Types in target language that aresupersets of the source Type). This ratio estimatesthe overlapping between the Types:

overlap =|T1||T2|

Data are represented in Figure 4, reporting thenumber of Types (Italian and English) for eachoverlap values, where this values are divided in 10ranges.

We considered semantically distant those Typeswith overlap < 0.4 (sharing less than 2 scenesover 5). These high distance Types (see Table 4)are 150 for Italian (51.9% of leaf + middle Typesand 15% of the total Types) and 145 for English(56.86% of leaf + middle Types and 14.12% of thetotal Types).

Basically we see that not only root Types, butalso a relevant part of leaf and middle Types (morethan 50% both in Italian and English) represent acritical point for translation.

IT→ EN EN→ ITLeaf+Mid T. 289 255Low dist. T. 139 (48.1%) 110 (43.14%)High dist. T. 150 (51.9%) 145 (56.86%)

Table 4: Distance from the nearest general Type intarget language.

107

Figure 4: Number of Italian and English Types foreach overlap range.

Within this numbers, that are quite homoge-neous between the two languages, we can see thatin the overlap range from 0 to 0.2 there are muchmore Italian Types than English ones (19% of leaf+ middle Types against 9%); conversely EnglishTypes are more distributed in the range from 0.3to 0.4 (Figure 4). This means that in this area ofextreme distance between the source and the tar-get concept, we have an higher semantic loss inthe translation from Italian to English.

Finally we can have have an overall value oftranslation critical Types, by summing up the onesbelonging to high distance Types class and the rootTypes. The verbs these Types belong to are theverbs for which the selection of a good translationcandidate is problematic. Results are reported inTables 5 and 6 and confirm that lexical gaps in ac-tion verbs have a strong impact on translation.

IT ENTotal Types 1,000 1,027Root Types 47 (4.7%) 43 (4.2%)High dist. T. 150 (15.0%) 145 (14.12%)Critical Types 197 (19.7%) 188 (18.3%)

Table 5: Number of translation critical Types.

IT ENTotal Verbs 501 535Verbs w/ r.T. 39 (7.78%) 38 (7.1%)Verbs w/ h.d.T. 109 (21.76%) 125 (23.36%)Critical Verbs 136 (27.15%) 154 (28.79%)

Table 6: Number of verbs with root Types and highdistance Types.

6 Conclusions

In this paper a methodology for measuring the lex-ical gap of action verbs is described and applied toItalian and English, by exploiting IMAGACT on-tology. We measured 33.6% of English gap and29.02% of Italian gap. Then this result have beeninvestigated, in order to discover when and why alexical gap can affect a translation task. The re-sults show that 19.7% of Italian Types and 18.3%of English ones represent action concept that arecritical from a translation perspective: these con-cepts are lexicalized by 27.15% of the Italian verbsand 28.79% of the English verbs that we consid-ered in our analysis. In addition to it the distinctionbetween concepts that can not be correctly trans-lated with a single lemma (root Types) and con-cepts that can be translated with a sensible seman-tic loss (high distance Types) is a relevant informa-tion that can lead to a different translation strategy.

Finally we feel important to note that behindthese numeric values there are lists of verbs andconcepts and this information could be integratedin Machine Translation and Computer AssistedTranslation Systems to improve their accuracy.

Acknowledgments

This research has been supported by the MOD-ELACT Project, funded by the Futuro in Ricerca2012 programme (Project Code RBFR12C608);http://modelact.lablita.it.

ReferencesRoberto Bartolini, Valeria Quochi, Irene De Felice,

Irene Russo, and Monica Monachini. 2014. Fromsynsets to videos: Enriching italwordnet multi-modally. In Nicoletta Calzolari (Conference Chair),Khalid Choukri, Thierry Declerck, Hrafn Lofts-son, Bente Maegaard, Joseph Mariani, AsuncionMoreno, Jan Odijk, and Stelios Piperidis, editors,Proceedings of the Ninth International Conferenceon Language Resources and Evaluation (LREC’14),Reykjavik, Iceland, may. European Language Re-sources Association (ELRA).

Luisa Bentivogli and Emanuele Pianta. 2000. Lookingfor lexical gaps. In Proceedings of the ninth EU-RALEX International Congress, Stuttgart, Germany.

Jurgita Cvilikaite. 2006. Lexical gaps: resolution byfunctionally complete units of translation. Darbai irdienos, 2006, nr. 45, p. 127-142.

Christiane Fellbaum, Martha Palmer, Hoa Trang Dang,Lauren Delfs, and Susanne Wolf. 2001. Manual and

108

automatic semantic annotation with wordnet. Word-Net and Other Lexical Resources, pages 3–10.

Gloria Gagliardi. 2014. Validazione dellontologia del-lazione IMAGACT per lo studio e la diagnosi delMild Cognitive Impairment. Ph.D. thesis, Univer-sity of Florence.

Rufus Hjalmar Gouws. 2002. Equivalent relations,context and cotext in bilingual dictionaries. Hermes,28(1):195–209.

Vladimir Ivir. 1977. Lexical gaps: A contrastiveview. Studia Romanica et Anglica Zagrabiensia,(43):167–176.

M. Jansseen. 2004. Multilingual lexical databases,lexical gaps, and simullda. International Journal ofLexicography, 17(2):137–154.

Gran Kjellmer. 2003. Lexical gaps. In Extendingthe scope of corpus-based research, pages 149–158.Brill.

A. Lehrer. 1974. Semantic Fields and Lexical Struc-ture. North-Holland linguistic series, 11. North-Holland.

John Lyons. 1977. Semantics, volume i. CambridgeUP, Cambridge.

Massimo Moneglia, Francesca Frontini, GloriaGagliardi, Irene Russo, Alessandro Panunzi, andMonica Monachini. 2012. Imagact: deriving anaction ontology from spoken corpora. Proceed-ings of the Eighth Joint ACL-ISO Workshop onInteroperable Semantic Annotation (isa-8), pages42–47.

Massimo Moneglia. 2014a. Natural Language On-tology of Action: A Gap with Huge Consequencesfor Natural Language Understanding and MachineTranslation. In Zygmunt Vetulani and Joseph Mar-iani, editors, Human Language Technology Chal-lenges for Computer Science and Linguistics, vol-ume 8387 of Lecture Notes in Computer Science,pages 379–395. Springer International Publishing.

Massimo Moneglia. 2014b. The semantic variation ofaction verbs in multilingual spontaneous speech cor-pora. In T. Raso and H. Mello, editors, Spoken Cor-pora and Linguistics Studies, pages 152–190. JohnBenjamins Publishing Company.

Martha Palmer, Hoa Trang Dang, and ChristianeFellbaum. 2007. Making fine-grained andcoarse-grained sense distinctions, both manuallyand automatically. Natural Language Engineering,13(02):137–163.

Yi Pan. 2016. Verbi dazione in Italiano e inCinese Mandarino. Implementazione e validazionedel cinese nellontologia interlinguistica dell’azioneIMAGACT. Ph.D. thesis, Universita degli Studi diFirenze.

Emanuele Pianta, Luisa Bentivogli, and Christian Gi-rardi. 2002. MultiWordNet: developing an alignedmultilingual database. In Proceedings of the FirstInternational Conference on Global WordNet, pages21–25.

Sameer S. Pradhan, Edward Loper, Dmitriy Dligach,and Martha Palmer. 2007. Semeval-2007 task 17:English lexical sample, srl and all words. In Pro-ceedings of the 4th International Workshop on Se-mantic Evaluations, pages 87–92. Association forComputational Linguistics.

Diana Santos. 1990. Lexical gaps and idioms inmachine translation. In Proceedings of the 13thConference on Computational Linguistics - Volume2, COLING ’90, pages 330–335, Stroudsburg, PA,USA. Association for Computational Linguistics.

109


Word Sense Filtering Improves Embedding-Based Lexical Substitution

Anne Cocos∗, Marianna Apidianaki∗† and Chris Callison-Burch∗∗ Computer and Information Science Department, University of Pennsylvania

† LIMSI, CNRS, Universite Paris-Saclay, 91403 Orsay{acocos,marapi,ccb}@seas.upenn.edu

Abstract

The role of word sense disambiguation inlexical substitution has been questioneddue to the high performance of vectorspace models which propose good sub-stitutes without explicitly accounting forsense. We show that a filtering mecha-nism based on a sense inventory optimizedfor substitutability can improve the resultsof these models. Our sense inventoryis constructed using a clustering methodwhich generates paraphrase clusters thatare congruent with lexical substitution an-notations in a development set. The re-sults show that lexical substitution can stillbenefit from senses which can improve theoutput of vector space paraphrase rankingmodels.

1 Introduction

Word sense has always been difficult to de-fine and pin down (Kilgarriff, 1997; Erk et al.,2013). Recent successes of embedding-based,sense-agnostic models in various semantic taskscast further doubt on the usefulness of word sense.Why bother to identify senses if even humans can-not agree upon their nature and number, and ifsimple word-embedding models yield good resultswithout using any explicit sense representation?

Word-based models are successful in varioussemantic tasks even though they conflate multipleword meanings into a single representation. Basedon the hypothesis that capturing polysemy couldfurther improve their performance, several workshave focused on creating sense-specific word em-beddings. A common approach is to cluster thecontexts in which the words appear in a corpusto induce senses, and relabel each word tokenwith the clustered sense before learning embed-

dings (Reisinger and Mooney, 2010; Huang et al.,2012). Iacobacci et al. (2015) disambiguate thewords in a corpus using a state-of-the-art WSDsystem and then produce continuous representa-tions of word senses based on distributional infor-mation obtained from the annotated corpus. Mov-ing from word to sense embeddings generally im-proves their performance in word and relationalsimilarity tasks but is not beneficial in all settings.Li and Jurafsky (2015) show that although multi-sense embeddings give improved performance intasks such as semantic similarity, semantic rela-tion identification and part-of-speech tagging, theyfail to help in others, like sentiment analysis andnamed entity extraction (Li and Jurafsky, 2015).

We show how a sense inventory optimized forsubstitutability can improve the rankings providedby two sense-agnostic, vector-based lexical sub-stitution models. Lexical substitution requiressystems to predict substitutes for target word in-stances that preserve their meaning in context(McCarthy and Navigli, 2007). We consider asense inventory with high substitutability to be onewhich groups synonyms or paraphrases that aremutually-interchangeable in the same contexts. Incontrast, sense inventories with low substitutabil-ity might group words linked by different types ofrelations. We carry out experiments with a syntac-tic vector-space model (Thater et al., 2011; Apid-ianaki, 2016) and a word-embedding model forlexical substitution (Melamud et al., 2015). In-stead of using the senses to refine the vector rep-resentations as in (Faruqui et al., 2015), we usethem to improve the lexical substitution rankingsproposed by the models as a post-processing step.Our results show that senses can improve the per-formance of vector-space models in lexical substi-tution tasks.

110

2 A sense inventory for substitution

2.1 Paraphrase substitutability

The candidate substitutes used by our rank-ing models come from the Paraphrase Database(PPDB) XXL package (Ganitkevitch et al.,2013).1 Paraphrase relations in the PPDB aredefined between words and phrases which mightcarry different senses. Cocos and Callison-Burch(2016) used a spectral clustering algorithm tocluster PPDB XXL into senses, but the clusterscontain noisy paraphrases and paraphrases linkedby different types of relations (e.g. hypernyms,antonyms) which are not always substitutable. Weuse a slightly modified version of their method tocluster paraphrases where both the number of clus-ters (senses) and their contents are optimized forsubstitutability.

2.2 A measure of substitutability

We define a substitutability metric that quantifiesthe extent to which a sense inventory aligns withhuman-generated lexical substitution annotations.We then cluster PPDB paraphrases using the sub-stitutability metric to optimize the sense clustersfor substitutability.

Given a sense inventory C, we can define thesenses of a target word t as a set of sense clusters,C(t) = {c1,c2, . . .ck}, where each cluster containswords corresponding to a single sense of t. Intu-itively, if a sense inventory corresponds with sub-stitutability, then each sense cluster ci should havetwo qualities: first, words within ci should be in-terchangeable with t in the same set of contexts;and second, ci should not be missing any wordsthat are interchangeable in those same contexts.We therefore operationalize the definition of sub-stitutability as follows.

We begin measuring substitutability with a lex-ical substitution dataset, consisting of sentenceswhere content words have been manually anno-tated with substitutes (see example in Table 1). Wethen use normalized mutual information (NMI)(Strehl and Ghosh, 2002) to quantify the levelof agreement between the automatically generatedsense clusters and human-suggested substitutes.NMI is an information theoretic measure of clus-ter quality. Given two clusterings U and V over

1PPDB paraphrases come into packages of different sizes(going from S to XXXL): small packages contain high-precision paraphrases while larger ones have high coverage.All are available from paraphrase.org

Sentence Annotated Substitutes (Count)In this world, one’sword is a promise.

vow (1), utterance (1), tongue(1), speech (1)

Silverplate: code wordfor the historic missionthat would end WorldWar II.

phrase (3), term (2), ver-biage(1), utterance (1), signal(1), name (1), dictate (1), des-ignation (1), decree (1)

I think she only heardthe last words of myspeech.

bit (3), verbiage (2), part(2), vocabulary (1), terminol-ogy (1), syllable (1), phrasing(1), phrase (1), patter (1), ex-pression (1), babble (1), anec-dote (1)

Table 1: Example annotated sentences for the tar-get word word.N from the CoInCo (Kremer et al.,2014) lexical substitution dataset. Numbers aftereach word indicate the number of annotators whomade that suggestion.

a set of items, it measures how much each clus-tering reduces uncertainty about the other (Vinhet al., 2009) in terms of their mutual informationI(U,V ) and entropies H(U),H(V ):

NMI(U,V ) =I(U,V )√

H(U)H(V )

To calculate the NMI between a sense inventoryfor target word t and its set of annotated substi-tutes, we first define the substitutes as a clustering,Bt = {b1,b2, . . .bn}, where bi denotes the set ofsuggested substitutes for each of n sentences. Ta-ble 1, for example, gives the clustered substitutesfor n = 3 sentences for target word t = word.N,where b1 = {vow, utterance, tongue, speech}. Wethen define the substitutability of the sense inven-tory, Ct , with respect to the annotated substitutes,Bt , as NMI(Ct ,Bt).2 Given many target words, wecan further aggregate the substitutability of senseinventory C over the set of targets T in B into asingle substitutability score:

substitutabilityB(C) = ∑t∈T

NMI(Ct ,Bt)|T |

2.3 Optimizing for SubstitutabilityHaving defined a substitutability score, we nowautomatically generate word sense clusters fromthe Paraphrase Database that maximize it. Theidea is to use the substitutability score to choosethe best number of senses for each target wordwhich will be the number of output clusters (k)generated by our spectral clustering algorithm.

2In calculating NMI, we ignore words that do not appearin both Ct and Bt .

111

2.4 Spectral clustering method2.4.1 Constructing the Affinity MatrixThe spectral clustering algorithm (Yu and Shi,2003) takes as input an affinity matrix A ∈ Rn×n

encoding n items to be clustered, and an integerk. It generates k non-overlapping clusters of the nitems. Each entry ai j in the affinity matrix denotesa similarity measurement between items i and j.Entries in A must be nonnegative and symmetric.The affinity matrix can also be thought of as de-scribing a graph, where the n rows and columnscorrespond to nodes, and each entry ai j gives theweight of an edge between nodes i and j. Becausethe matrix must be symmetric, the graph is undi-rected.

Given a target word t, we call its set of PPDBparaphrases PP(t). Note that PP(t) excludes t it-self. In our most basic clustering method, we clus-ter paraphrases for target t as follows. Given thelength-n set of t’s paraphrases, PP(t), we constructthe n×n affinity matrix A where each shared-indexrow and column corresponds to some word p ∈PP(t). We set entries equal to the cosine similaritybetween the applicable words’ embeddings, plusone: ai j = cos(vi,v j)+1 (to enforce non-negativesimilarities). For our implementation we use 300-dimensional part-of-speech-specific word embed-dings vi generated using the gensim word2vecpackage (Mikolov et al., 2013b; Mikolov et al.,2013a; Rehurek and Sojka, 2010).3 In Figure 1awe show a set of paraphrases, linked by PPDBrelations, and in Figure 1b we show the corre-sponding basic affinity matrix, encoding the para-phrases’ distributional similarity.

In order to aid further discussion, we point outthat the affinity matrix used for the basic cluster-ing method encodes a fully-connected graph G ={PP(t),EALL

PP } with paraphrases PP(t) as nodes,and edges between every pair of words, EALL

PP =PP(t)×PP(t). As for all variations on the cluster-ing method, the matrix entries correspond to dis-tributional similarity.

2.4.2 MaskingThe affinity matrix in Figure 1b ignores the graphstructure inherent in PPDB, where edges connectonly words that are paraphrases of one another.We experiment with enforcing the PPDB structure

3The word2vec parameters we use are a context win-dow of size 3, learning rate alpha from 0.025 to 0.0001, min-imum word count 100, sampling parameter 1e−4, 10 negativesamples per target word, and 5 training epochs.

in the affinity matrix through a technique we call‘masking.’ By masking, we mean allowing pos-itive values in the affinity matrix only where therow and column correspond to paraphrases thatappear as pairs in PPDB (Figure 1a). All entriescorresponding to paraphrase pairs that are not con-nected in the PPDB graph (Figure 1a) are forced to0.

More concretely, in the masked affinity matrix,we set each entry ai j for which i and j are not para-phrases in PPDB to zero. The masked affinity ma-trix encodes the graph G = {PP(t),EMASK

PP } withedges connecting only pairs of words that are inPPDB, EMASK

PP = {(pi, p j) | pi ∈ P(p j)}. Figure 1cshows the masked affinity matrix corresponding tothe PPDB structure in Figure 1a.

2.4.3 Optimizing kBecause spectral clustering requires the numberof output clusters, k, to be specified as input, foreach target word we run the clustering algorithmfor a range of k between 1 and the minimum of (n,20). We then choose the k that maximizes the NMIof the resulting clusters with the human-annotatedsubstitutes for that target in the development data.

2.5 Method variationsIn addition to using the substitutability score tochoose the best number of senses for each targetword, we also experiment with two variations onthe basic spectral clustering method to increase thescore further: filtering by a paraphrase confidencescore and co-clustering with WordNet (Fellbaum,1998).

2.5.1 PPDB Score ThresholdingEach paraphrase pair in the PPDB is associatedwith a set of scores indicating the strength ofthe paraphrase relationship. The recently addedPPDB2.0 Score (Pavlick et al., 2015) was calcu-lated using a supervised scoring model trained onhuman judgments of paraphrase quality.4 Apidi-anaki (2016) showed that the PPDB2.0 Score itselfis a good metric for ranking substitution candi-dates in context, outperforming some vector spacemodels when the number of candidates is high.With this in mind, we experimented with using aPPDB2.0 Score threshold to discard noisy PPDB

4The human judgments were used to fit a regression to thefeatures available in PPDB 1.0 plus numerous new featuresincluding cosine word embedding similarity, lexical overlapfeatures, WordNet features and distributional similarity fea-tures.

112

p1

p2

p5

p4

p3

p6

p7

(a) PPDB graph with n = 7paraphrases PP(t) to be clus-tered.

p1 p2 p3 p4 p5 p6 p7p1p2p3p4p5p6p7

(b) Unmasked affinity matrix for input to ba-sic clustering algorithm, for paraphrases inFig 1a. This matrix encodes a fully-connectedgraph G = {PP(t),EALL

PP }.

p1 p2 p3 p4 p5 p6 p7p1p2p3p4p5p6p7

(c) Masked affinity matrix for input to clus-tering algorithm, enforcing the paraphraselinks from the graph in Fig 1a, G ={PP(t),EMASK

PP }.

Figure 1: Unclustered PPDB graph and its corresponding affinity matrices, encoding distributional sim-ilarity, for input to the basic (1b) and masked (1c) clustering algorithms. Masking zeros-out the valuesof all cells corresponding to paraphrases not connected in the PPDB graph. Cell shading corresponds tothe distributional similarity score between words, with darker colors representing higher measurements.

XXL paraphrases prior to sense clustering. Ourobjective was to begin the clustering process witha clean set of paraphrases for each target word,eliminating erroneous paraphrases that might pol-lute the substitutable sense clusters. We imple-mented PPDB score thresholds in a range from 0to 2.5.

2.5.2 Co-Clustering with WordNetPPDB is large and inherently noisy. WordNethas smaller coverage but well-defined semanticstructure in the form of synsets and relations.We sought a way to marry the high coverage ofPPDB with the clean structure of WordNet by co-clustering the two resources, in hopes of creatinga sense inventory that is both highly-substitutableand high-coverage.

The basic unit in WordNet is the synset, a setof lemmas sharing the same meaning. WordNetalso connects synsets via relations, such as hyper-nymy, hyponymy, entailment, and ‘similar-to’. Wedenote as L(s) the set of lemmas associated withsynset s. We denote as R(s) the set of synsetsthat are related to synset s with a hypernym, hy-ponym, entailment, or similar-to relationship. Fi-nally, we denote as S(t) the set of synsets to whicha word t belongs. We denote as S+(t) the set of t’ssynsets, plus all synsets to which they are related;S+(t) = S(t)∪⋃

s′∈S(t) S(s′). In other words, S+(t)includes all synsets to which t is connected by a

path of length at most 2 via one of the relationsencoded in R(s).

For co-clustering, we generate the affinity ma-trix for a graph with m+n nodes corresponding tothe n words in PP(t) and the m synsets in S+(t),and edges between every pair of nodes. Becausethe edge weights are cosine similarity betweenvector embeddings, we need a way to construct anembedding for each synset in S+(t).5 We there-fore generate compositional embeddings for eachsynset s that are equal to the weighted average ofthe embeddings for the lemmas l ∈ L(s), where theweights are the PPDB2.0 scores between t and l:

vs =∑l∈L(s) PPDB2.0Score(t, l)× vl

∑l∈L(s) PPDB2.0Score(t, l)

The unmasked affinity matrix used for input tothe co-clustering method, then, encodes the graphG = {PP(t)∪ S+(t),EALL

PP ∪EALLPS ∪EALL

SS }, whereEALL

PS contains edges between every paraphraseand synset, and EALL

SS contains edges between ev-ery pair of synsets.

We also define masked versions of the co-clustering affinity matrix. In a masked affinitymatrix, positive entries are only allowed for en-tries where the row and column correspond to enti-

5We don’t use the NASARI embeddings (Camacho-Collados et al., 2015) because these are available only fornouns.

113

p1

p2

p5

p4

p3

p6

p7

s1 s2

s3s4

(a) Graph showing n = 7 PPDBparaphrases PP(t) and m = 4WordNet synsets S+(t) to beclustered.

p1 p2 p3 p4 p5 p6 p7 s1 s2 s3 s4p1p2p3p4p5p6p7

s1s2s3s4

(b) Unmasked affinity matrix for input to ba-sic clustering algorithm, for paraphrases andsynsets in Fig 2a. This matrix encodes a fully-connected graph G = {PP(t)∪S+(t),EALL

PP ∪EALL

PS ∪EALLSS }.

p1 p2 p3 p4 p5 p6 p7 s1 s2 s3 s4p1p2p3p4p5p6p7

s1s2s3s4

(c) Affinity matrix for input to cluster-ing algorithm, enforcing the paraphrase-paraphrase (EMASK

PP ) and paraphrase-synset(EMASK

PS ) links from the graph in 2a, but al-lowing all synset-synset links (EALL

PP ).

Figure 2: Paraphrase/synset graph for input to the co-clustering model, and its corresponding affinitymatrices for the basic (2b) and masked (2c) clustering algorithms. Cell shading corresponds to thedistributional similarity score between words/synsets.

ties (paraphrases or synsets) that are connected bythe underlying knowledge base (PPDB or Word-Net). Just as we defined masking for paraphrase-paraphrase links (EPP) to allow only positive val-ues corresponding to paraphrase pairs found inPPDB, here we separately define masking forparaphrase-synset (EPS) and synset-synset (ESS)based on WordNet synsets and relations. Whenapplying the clustering algorithm, it is possible toelect to use the masked version for any or all ofEPP, EPS, and ESS. In our experiments we try allcombinations.

For the synset-synset links, we define themasked version EMASK

SS as including only nonzeroedge weights where a hypernym, hyponym, en-tailment or similar-to relationship connects twosynsets: EMASK

SS = {(su,sv) | su ∈ R(sv) or sv ∈R(su)}. For the paraphrase-synset links, we definethe masked version EMASK

PS to include only nonzeroedge weights where the paraphrase is a lemmain the synset, or is a paraphrase of a lemma inthe synset (excluding the target word): EMASK

PS ={(pi,su) | pi ∈ L(su) or |(P(pi)− t)∩ L(su)|> 0}.We need to exclude the target word when calcu-lating the overlap because otherwise all words inPP(t) would connect to all synsets in S(t). Figure2 depicts the graph, unnmasked and masked affin-ity matrices for the co-clustering method.

2.6 Clustering Experiments

2.6.1 Datasets

We run clustering experiments using targets andhuman-generated substitution data drawn fromtwo lexical substitution datasets. The first is the“Concepts in Context” (CoInCo) corpus (Kremeret al., 2014), containing over 15K sentences cor-responding to nearly 4K unique target words. Wedivide the CoInCo dataset into development andtest sets by first finding all target words that haveat least 10 sentences. For each of the 327 result-ing targets, we randomly divide the correspond-ing sentences into 60% development instances and40% test instances. The resulting developmentand test sets have 4061 and 2091 sentences respec-tively. We cluster the 327 target words in the re-sulting subset of CoInCo, performing all optimiza-tion using the development portion.

In order to evaluate how well our method gen-eralizes to other data, we also create clusters fortarget words drawn from the SemEval 2007 En-glish Lexical Substitution shared task dataset (Mc-Carthy and Navigli, 2007). The entire test por-tion of the SemEval dataset contains 1700 an-notated sentences for 170 target words. We fil-ter this data to keep only sentences with one ormore human-annotated substitutes that overlap ourPPDB XXL paraphrase vocabulary. The result-ing test set, which we use for evaluating SemEval

114

targets, has 1178 sentences and 157 target words.We cluster each of the 157 targets, using the Co-InCo development data to optimize substitutabil-ity for the 32 SemEval targets that also appear inCoInCo. For the rest of the SemEval targets wechoose a number of senses equal to its WordNetsynset count.

2.6.2 Clustering Method VariationsWe try all combinations of the following parame-ters in our clustering model:

• Clustering method: We try the basic cluster-ing – clustering Paraphrases Only – and theWordNet Co-Clustering method.

• PPDB2.0 Score Threshold: We cluster para-phrases of each target having a PPDB2.0Score above a threshold, ranging from 0-3.0.

• Masking: When clustering paraphrases only,we can either use the PP-PP mask or al-low positive similarities between all words.When co-clustering, we try all combina-tions of the PP-PP, PP-SYN, and SYN-SYNmasks.

For each combination, we evaluate the NMIsubstitutability of the resulting sense inventoryover our CoInCo and SemEval test instances. Thesubstitutability results are given in Tables 2 and 3.

3 Filtering Substitutes

3.1 A WSD oracle

We now question whether it is possible to improvethe rankings of current state-of-the-art lexical sub-stitution systems by using the optimized sense in-ventory as a filter. Our general approach is to takea set of ranked substitutes generated by a vector-based model. Then, we see whether filtering theranked substitutes to bring words belonging to thecorrect sense of the target to the top of the rankingswould improve the overall ranking results.

Assuming that we have a WSD oracle that isable to choose the most appropriate sense for atarget word in context, this corresponds to nomi-nating substitutes from the applicable sense clusterand elevating them in the list of ranked substitutesoutput by the state-of-the-art lexical substitutionsystem. If sense filtering successfully improvesthe quality of ranked substitutes, we can say thatthe sense inventory captures substitutability well.

3.2 Ranking ModelsOur approach requires a set of rankings producedby a high-quality lexical substitution model tostart. We generate substitution rankings for eachtarget/sentence pair in the test sets using a syntac-tic vector-space model (Thater et al., 2011; Apidi-anaki, 2016) and a state-of-the-art model based onword embeddings (Melamud et al., 2015).

The syntactic vector space model of Apidianaki(2016) (Syn.VSM) demonstrated an ability to cor-rectly choose appropriate PPDB paraphrases for atarget word in context. The vector features cor-respond to syntactic dependency triples extractedfrom the English Gigaword corpus 6 analyzed withStanford dependencies (Marneffe et al., 2006).Syn.VSM produces a score for each (target, sen-tence, substitute) tuple based on the cosine sim-ilarity of the substitute’s basic vector representa-tion with the target’s contextualized vector (Thateret al., 2011). The contextualized vector is derivedfrom the basic meaning vector of the target wordby reinforcing its dimensions that are licensed bythe context of the specific instance under consider-ation. More specifically, the contextualized vectorof a target is obtained through vector addition andcontains information about the target’s direct syn-tactic dependents.

The second set of rankings comes from the Ad-dCos model of Melamud et al. (2015). AddCosquantifies the fit of substitute word s for targetword t in context C by measuring the semanticsimilarity of the substitute to the target, and thesimilarity of the substitute to the context:

AddCos(s, t,W ) =|W |·cos(s, t) + ∑w∈W cos(s,w)

2 · |W |(1)

The vectors s and t are word embeddings of thesubstitute and target generated by the skip-gramwith negative sampling model (Mikolov et al.,2013b; Mikolov et al., 2013a). The context Wis the set of words appearing within a fixed-widthwindow of the target t in a sentence (we use a win-dow (cwin) of 1), and the embeddings c are con-text embeddings generated by skip-gram. In ourimplementation, we train 300-dimensional wordand context embeddings over the 4B words in theAnnotated Gigaword (AGiga) corpus (Napoles etal., 2012) using the gensim word2vec package

6http://catalog.ldc.upenn.edu/LDC2003T05

115

(Mikolov et al., 2013b; Mikolov et al., 2013a;Rehurek and Sojka, 2010).7

3.3 Substitution metrics

Lexical substitution experiments are usually eval-uated using generalized average precision (GAP)(Kishida, 2005). GAP compares a set of predictedrankings to a set of gold standard rankings. Scoresrange from 0 to 1; a perfect ranking, in which allhigh-scoring substitutes outrank low-scoring sub-stitutes, has a score of 1. For each sentence in theCoInCo and SemEval test sets, we consider thePPDB paraphrases for the target word to be thecandidates, and we set the test set annotator fre-quency to be the gold score. Words in PPDB thatwere not suggested by annotators receive a goldscore of 0.001. Predicted scores are given by thetwo ranking models, Syn.VSM and AddCos.

3.4 Filtering Method

Sense filtering is intended to boost the rank of sub-stitutes that belong to the most appropriate senseof the target given the context. We run this exper-iment as a two-step process.

First, given a target and sentence, we obtain thePPDB XXL paraphrases for the target word andrank them using the Syn.VSM and the AddCosmodels.8 We calculate the overall unfiltered GAPscore on the test set for each ranking model as theaverage GAP over sentences.

Next, we evaluate the ability of a sense inven-tory to improve the GAP score through filtering.We implement sense filtering by adding a largenumber (10000) to the ranking model’s score forwords belonging to a single sense. We assumean oracle that finds the cluster which maximallyimproves the GAP score using this sense filteringmethod. If the sense inventory corresponds wellto substitutability, we should expect this filteringto improve the ranking by downgrading proposedsubstitutes that do not fall within the correct sensecluster.

We calculate the maximum sense-restrictedGAP score for the inventories produced by eachvariation on our clustering model, and compare

7The word2vec training parameters we use are a contextwindow of size 3, learning rate alpha from 0.025 to 0.0001,minimum word count 100, sampling parameter 1e−4, 10 neg-ative samples per target word, and 5 training epochs.

8For the SemEval test set, we rank PPDB XXL para-phrases having a PPDB2.0 Score with the target of at least2.54.

this to the unfiltered GAP score for each rankingmodel.

3.5 BaselinesWe compare the extent to which our optimizedsense inventories improve lexical substitutionrankings to the results of two baseline sense in-ventories.

• WordNet+: a sense inventory formed fromWordNet 3.0. For each CoInCo target wordthat appears in WordNet, we take its senseclusters to be its synsets, plus lemmas be-longing to hypernyms and hyponyms of eachsynset.

• PPDBClus: a much larger, abeit noisier,sense inventory obtained by automaticallyclustering words in the PPDB XXL package.To obtain this sense inventory we clusteredparaphrases for all targets in the CoInCodataset using the method outlined in Cocosand Callison-Burch (2016), with PPDB2.0Score serving as the similarity metric.

We assess the substitutability of these sensebaseline inventories with respect to the human-annotated substitutes in the CoInCo and SemEvaldatasets, and also use them for sense filtering.

Finally, we wish to estimate the impact ofthe NMI-based optimization procedure (Section2.4.3) on the quality of the senses used for fil-tering. We compare the performance of the opti-mized CoInCo sense inventory, where the numberof clusters, k, for a target word is defined throughNMI optimization (called ‘Choose-K: OptimizeNMI’), to an inventory induced from CoInCowhere k equals the number of synsets availablefor the target word in WordNet (called ‘Choose-K: #WN Synsets).

4 Results

We report substitutability, and the unfiltered andbest sense-filtered GAP scores achieved using theparaphrase-only clustering method and the co-clustering method in Tables 2 and 3.

The average unfiltered GAP scores for theSyn.VSM rankings over the CoInCo and SemEvaltest sets are 0.528 and 0.673 respectively.9 All

9The score reported by Apidianaki (2016) for theSyn.VSM model with XXL PPDB paraphrases on CoInCowas 0.56. The difference in scores is due to excluding fromour clustering experiments target words that did not have atleast 10 sentences in CoInCo.

116

Syn.VSM AddCos (cwin=1)substCoInCo Unfiltered GAP Oracle GAP Unfiltered GAP Oracle GAP

PPDBClus 0.254

0.528

0.661

0.533

0.656WordNet 0.252 0.655 0.651

Choose-K: # WN Synsets (avg) 0.205 0.639 0.636Choose-K: # WN Synsets (max, no co-clustering) 0.250* 0.695* 0.690*

Choose-K: # WN Synsets (max, co-clustering) 0.241** 0.690** 0.683**Choose-K: Optimize NMI (avg) 0.282 0.668 0.662

Choose-K: Optimize NMI (max, no co-clustering) 0.331* 0.719 * 0.714 ***Choose K: Optimize NMI (max, co-clustering) 0.314** 0.718 **** 0.710 **

Table 2: Substitutablity (NMI) of resulting sense inventories, and GAP scores of the unfiltered and best sense-filtered rankingsproduced by the Syn.VSM and AddCos models, for the CoInCo annotated dataset. Configurations for the best-performing senseinventories were: * Min PPDB Score 2.0, cluster PP’s only, use PP-PP mask; ** Min PPDB Score 2.0, co-clustering, use PP-PPmask only; *** Min PPDB Score 1.5, cluster PP’s only, use PP-PP mask; **** Min PPDB Score 2.0, co-clustering, use PP-PP,Syn-Syn masks only

Syn.VSM AddCos (cwin=1)substSemEval Unfiltered GAP Oracle GAP Unfiltered GAP Oracle GAP

PPDBClus 0.357

0.673

0.855

0.410

0.634WordNet 0.291 0.774 0.595

Average of all Clustered Sense Inventories 0.367 0.841 0.569Max basic (no co-clustering) sense inventory 0.448* 0.917* 0.626*

Max co-clustered sense inventory 0.449** 0.906*** 0.612****

Table 3: Substitutablity (NMI) of resulting sense inventories, and GAP scores of the unfiltered and best sense-filtered rankingsproduced by the Syn.VSM and AddCos models, for the SemEval07 annotated dataset. Configurations for the best-performingsense inventories were: * Min PPDB Score 2.31, cluster PP’s only, use PP-PP mask; ** Min PPDB Score 2.54, co-clustering,use PP-SYN mask only; *** Min PPDB Score 2.54, co-clustering, use PP-SYN mask only; **** Min PPDB Score 2.31,co-clustering, use PP-SYN mask only.

baseline and cluster sense inventories are capableof improving these GAP scores when we use thebest sense as a filter. Syntactic models generallygive very good results with small paraphrase sets(Kremer et al., 2014) but their performance seemsto degrade when they need to deal with larger andnoisier substitute sets (Apidianaki, 2016). Ourresults suggest that finding the most appropriatesense of a target word in context can improve theirlexical substitution results.

The trend in results is similar for the AddCosrankings. The average unfiltered GAP scores forthe AddCos rankings over the CoInCo and Se-mEval test sets are 0.533 and 0.410 respectively.The GAP scores of the unfiltered AddCos rankingsare much lower than after filtering with any base-line or cluster sense inventory, showing that lexicalsubsitutition rankings based on word-embeddingscan also be improved using senses.

To assess the impact of the NMI-based opti-mization procedure on the results, we compare theperformance of two sense inventories on the Co-InCo rankings: one where the number of clusters(k) for a target word is defined through NMI op-timization and another one, where k is equal tothe number of synsets available for the target word

in WordNet. We find that for both the Syn.VSMand AddCos ranking models, filtering using thesense inventory with the NMI-optimized k outper-forms the results obtained when the inventory withk equal to the number of synsets is used.

Furthermore, we find that the NMI substi-tutability score is a generally good indicator ofhow much improvement we see in GAP score dueto oracle sense filtering. We calculated the Pear-son correlation of a sense inventory’s NMI withits oracle GAP score to be 0.644 (calculated overall target words in the CoInCo test set, includingGAP results for both the Syn.VSM and AddCosranking models). This suggests that NMI is a rea-sonable measure of substitutability.

We find that for all methods, applying a PPDB-Score Threshold prior to clustering is an effectiveway of removing noisy, non-substitutable para-phrases from the sense inventory. When we usethe resulting sense inventory for filtering, this ef-fectively elevates only high-quality paraphrases inthe lexical substitution rankings. This supports thefinding of Apidianaki (2016), who showed that thePPDB2.0 Score itself is an effective lexical substi-tution ranking metric when large and noisy para-phrase substitute sets are involved.

117

Finally, we discover that co-clustering withWordNet did not produce any significant improve-ment in NMI or GAP score over clustering para-phrases alone. This could suggest that the addedstructure from WordNet did not improve over-all substitutability of the resulting sense invento-ries, or that our co-clustering method did not ef-fectively incorporate useful structural informationfrom WordNet.

5 Conclusion

We have shown that despite the high performanceof word-based vector-space models, lexical sub-stitution can still benefit from word senses. Wehave defined a substitutability metric and proposeda method for automatically creating sense invento-ries optimized for substitutability. The number ofsense clusters in an optimized inventory and theircontents are aligned with lexical substitution an-notations in a development set. Using the bestfitting cluster in each context as a filter over therankings produced by vector-space models boostsgood substitutes and improves the models’ scoresin a lexical substitution task.

For choosing the cluster that best fits a con-text, we used an oracle experiment which findsthe maximum GAP score achievable by a sense byboosting the ranking model’s score for words be-longing to a single sense. The cluster that achievedthe highest GAP score in each case was selected.The task of finding the most appropriate sense incontext still remains. But the improvement in lex-ical substitution results shows that word sense in-duction and disambiguation can still benefit state-of-the-art word-based models for lexical substitu-tion.

Our sense filtering mechanism can be applied tothe output of any vector-space substitution modelat a post-processing step. In future work, we in-tend to experiment with models that account forsenses during embedding learning. The models ofHuang et al. (2012) and Li and Jurafsky (2015)learn multi-prototype, or sense-specific, embed-ding representations and are able to choose thebest-fitted ones for words in context. These mod-els have up to now been tested in several NLPtasks but have not yet been applied to lexical sub-stitution. We will experiment with using the em-beddings chosen by these models for specific wordinstances for ranking candidate substitutes in con-text. The comparison with the results presented in

this paper will show whether it is preferable to ac-count for senses before or after actual lexical sub-stitution.

Acknowledgments

This material is based in part on research spon-sored by DARPA under grant number FA8750-13-2-0017 (the DEFT program). The U.S. Gov-ernment is authorized to reproduce and distributereprints for Governmental purposes. The viewsand conclusions contained in this publication arethose of the authors and should not be interpretedas representing official policies or endorsementsof DARPA and the U.S. Government.

This work has also been supported by theFrench National Research Agency under projectANR-16-CE33-0013.

We would like to thank our anonymous review-ers for their thoughtful and helpful comments.

ReferencesMarianna Apidianaki. 2016. Vector-space models for

PPDB paraphrase ranking in context. In Proceed-ings of EMNLP, pages 2028–2034, Austin, Texas.

Jose Camacho-Collados, Mohammad Taher Pilehvar,and Roberto Navigli. 2015. NASARI: a Novel Ap-proach to a Semantically-Aware Representation ofItems. In Proceedings of NAACL/HLT 2015, pages567–577, Denver, Colorado.

Anne Cocos and Chris Callison-Burch. 2016. Clus-tering paraphrases by word sense. In Proceedingsof NAACL/HLT 2016, pages 1463–1472, San Diego,California.

Katrin Erk, Diana McCarthy, and Nicholas Gaylord.2013. Measuring word meaning in context. Com-putational Linguistics, 39(3):511–554, 9.

Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar,Chris Dyer, Eduard Hovy, and Noah A. Smith.2015. Retrofitting Word Vectors to Semantic Lex-icons. In Proceedings of NAACL/HLT, Denver, Col-orado.

Christiane Fellbaum, editor. 1998. WordNet: an elec-tronic lexical database. MIT Press.

Juri Ganitkevitch, Benjamin VanDurme, and ChrisCallison-Burch. 2013. PPDB: The ParaphraseDatabase. In Proceedings of NAACL, Atlanta, Geor-gia.

Eric Huang, Richard Socher, Christopher Manning,and Andrew Ng. 2012. Improving Word Represen-tations via Global Context and Multiple Word Pro-totypes. In Proceedings of the ACL (Volume 1: LongPapers), pages 873–882, Jeju Island, Korea.

118

Ignacio Iacobacci, Mohammad Taher Pilehvar, andRoberto Navigli. 2015. SensEmbed: LearningSense Embeddings for Word and Relational Simi-larity. In Proceedings of the ACL/IJCNLP, pages95–105, Beijing, China.

Adam Kilgarriff. 1997. I don’t believe in word senses.Computers and the Humanities, 31(2):91–113.

Kazuaki Kishida. 2005. Property of average precisionand its generalization: An examination of evalua-tion indicator for information retrieval experiments.National Institute of Informatics Tokyo, Japan.

Gerhard Kremer, Katrin Erk, Sebastian Pado, and Ste-fan Thater. 2014. What Substitutes Tell Us - Anal-ysis of an “All-Words” Lexical Substitution Corpus.In Proceedings of EACL, pages 540–549, Gothen-burg, Sweden.

Jiwei Li and Dan Jurafsky. 2015. Do Multi-Sense Em-beddings Improve Natural Language Understand-ing? In Proceedings of the EMNLP, pages 1722–1732, Lisbon, Portugal.

M. Marneffe, B. Maccartney, and C. Manning. 2006.Generating Typed Dependency Parses from PhraseStructure Parses. In Proceedings of LREC-2006,Genoa, Italy.

Diana McCarthy and Roberto Navigli. 2007.SemEval-2007 Task 10: English Lexical Substitu-tion Task. In Proceedings of SemEval, pages 48–53,Prague, Czech Republic.

Oren Melamud, Omer Levy, and Ido Dagan. 2015. ASimple Word Embedding Model for Lexical Substi-tution. In Proceedings of the 1st Workshop on VectorSpace Modeling for Natural Language Processing,pages 1–7, Denver, Colorado.


Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed represen-tations of words and phrases and their composition-ality. In Advances in neural information processingsystems, pages 3111–3119.

Courtney Napoles, Matthew Gormley, and BenjaminVan Durme. 2012. Annotated Gigaword. In Pro-ceedings of the Joint Workshop on Automatic Knowl-edge Base Construction and Web-scale KnowledgeExtraction (AKBC-WEKEX), pages 95–100, Mon-treal, Canada.

Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch,Benjamin Van Durme, and Chris Callison-Burch.2015. PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, andstyle classification. In Proceedings of ACL/IJCNLP,pages 425–430, Beijing, China.

Radim Rehurek and Petr Sojka. 2010. SoftwareFramework for Topic Modelling with Large Cor-pora. In Proceedings of the LREC 2010 Workshopon New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta.

Joseph Reisinger and Raymond J. Mooney. 2010.Multi-Prototype Vector-Space Models of WordMeaning. In Proceedings of HLT/NAACL, pages109–117, Los Angeles, California.

Alexander Strehl and Joydeep Ghosh. 2002. Clus-ter ensembles—a knowledge reuse framework forcombining multiple partitions. Journal of machinelearning research, 3(Dec):583–617.

Stefan Thater, Hagen Furstenau, and Manfred Pinkal.2011. Word Meaning in Context: A Simple and Ef-fective Vector Model. In Proceedings of IJCNLP,pages 1134–1143, Chiang Mai, Thailand.

Nguyen Xuan Vinh, Julien Epps, and James Bailey.2009. Information theoretic measures for cluster-ings comparison: is a correction for chance neces-sary? In Proceedings of the 26th Annual Inter-national Conference on Machine Learning, pages1073–1080. ACM.

Stella X. Yu and Jianbo Shi. 2003. Multiclass spec-tral clustering. In Proceedings of International Con-ference on Computer Vision (ICCV 03), pages 313–319. IEEE.

119


Supervised and Unsupervised Word Sense Disambiguation on WordEmbedding Vectors of Unambiguous Synonyms

Aleksander WawerInstitute of Computer Science

PASJana Kazimierza 5

01-248 Warsaw, [email protected]

Agnieszka MykowieckaInstitute of Computer Science

PASJana Kazimierza 5

01-248 Warsaw, [email protected]

Abstract

This paper compares two approaches toword sense disambiguation using wordembeddings trained on unambiguous syn-onyms. The first one is an unsupervisedmethod based on computing log proba-bility from sequences of word embeddingvectors, taking into account ambiguousword senses and guessing correct sensefrom context. The second method is super-vised. We use a multilayer neural networkmodel to learn a context-sensitive transfor-mation that maps an input vector of am-biguous word into an output vector repre-senting its sense. We evaluate both meth-ods on corpora with manual annotations ofword senses from the Polish wordnet.

1 Introduction

Ambiguity is one of the fundamental features ofnatural language, so every attempt to understandNL utterances has to include a disambiguationstep. People usually do not even notice ambi-guity because of the clarifying role of the con-text. A word market is ambiguous, and it isstill such in the phrase the fish market while ina longer phrase like the global fish market it isunequivocal because of the word global, whichcannot be used to describe physical place. Thus,distributional semantics methods seem to be anatural way to solve the word sense discrimina-tion/disambiguation task (WSD). One of the firstapproaches to WSD was context-group sense dis-crimination (Schütze, 1998) in which sense rep-resentations were computed as groups of simi-lar contexts. Since then, distributional semanticmethods were utilized in very many ways in su-pervised, weekly supervised and unsupervised ap-proaches.

Unsupervised WSD algorithms aim at resolv-ing word ambiguity without the use of annotatedcorpora. There are two popular categories ofknowledge-based algorithms. The first one orig-inates from the Lesk (1986) algorithm, and ex-ploit the number of common words in two sensedefinitions (glosses) to select the proper meaningin a context. Lesk algorithm relies on the set ofdictionary entries and the information about thecontext in which the word occurs. In (Basile etal., 2014) the concept of overlap is replaced bysimilarity represented by a DSM model. The au-thors compute the overlap between the gloss ofthe meaning and the context as a similarity mea-sure between their corresponding vector represen-tations in a semantic space. A semantic space isa co-occurrences matrix M build by analysing thedistribution of words in a large corpus, later re-duced using Latent Semantic Analysis (Landauerand Dumais, 1997). The second group of algo-rithms comprises graph-based methods which usestructure of semantic nets in which different typesof word sense relations are represented and linked(e.g. WordNet, BabelNet). They used variousgraph-induced information, e.g. Page Rank algo-rithm (Mihalcea et al., 2004).

In this paper we present a method of word sensedisambiguation, i.e. inferring an appropriate wordsense from those listed in Polish wordnet, usingword embeddings in both supervised and unsuper-vised approaches. The main tested idea is to cal-culate sense embeddings using unambiguous syn-onyms (elements of the same synsets) for a par-ticular word sense. In section 2 we shortly presentexisting results for WSD for Polish as well as otherworks related to word embeddings for other lan-guages, while section 3 presents annotated datawe use for evaluation and supervised model train-ing. Next sections describe the chosen method ofcalculating word sense embeddings, our unsuper-

120

vised and supervised WSD experiments and somecomments on the results.

2 Existing Work

2.1 Polish WSD

There was very little research done in WSD forPolish. The first one from the few more vis-ible attempts comprise a small supervised ex-periment with WSD in which machine learn-ing techniques and a set of a priori defined fea-tures were used, (Kobylinski, 2012). Next, in(Kobylinski and Kopec, 2012), extended Leskknowledge-based approach and corpus-based sim-ilarity functions were used to improve previousresults. These experiments were conducted onthe corpora annotated with the specially designedset of senses. The first one contained generaltexts with 106 polysemous words manually anno-tated with 2.85 sense definitions per word on av-erage. The second, smaller, WikiEcono corpus(http://zil.ipipan.waw.pl/plWikiEcono) was anno-tated by another set of senses for 52 polysemouswords. It contains 3.62 sense definitions per wordon average. The most recent work on WSD forPolish (Kedzia et al., 2015) utilizes graph-basedapproaches of (Mihalcea et al., 2004) and (Agirreet al., 2014). This method uses both plWordnetand SUMO ontology and was tested on KPWr dataset (Broda et al., 2012) annotated with plWord-net senses — the same data set which we use inour experiments. The highest precision of 0.58was achieved for nouns. The results obtained bydifferent WSD approaches are very hard to com-pare because of different set of senses and testdata used and big differences in results obtainedby the same system on different data. (Tripodiand Pelillo, 2017) reports the results obtained bythe best systems for English at the level of 0.51-0.85% depending on the approach (supervised orunsupervised) and the data set. The only systemfor Polish to which to some extend we can com-pare our approach is (Kedzia et al., 2015).

2.2 WSD and Word Embeddings

The problem of WSD has been approached fromvarious perspectives in the context of word embed-dings.

Popular approach is to generate multiple em-beddings per word type, often using unsupervisedautomatic methods. For example, (Reisinger andMooney, 2010; Huang et al., 2012) cluster con-

texts of each word to learn senses for each word,then re-label them with clustered sense for learn-ing embeddings. (Neelakantan et al., 2014) intro-duce flexible number of senses: they extend sensecluster list when a new sense is encountered by amodel.

(Iacobacci et al., 2015) use an existing WSDalgorithm to automatically generate large sense-annotated corpora to train sense-level embeddings.(Taghipou and Ng, 2015) prepare POS-specificembeddings by applying a neural network withtrainable embedding layer. They use those embed-dings to extend feature space of a supervised WSDtool named IMS.

In (Bhingardive et al., 2015), the authors pro-pose to exploit word embeddings in an unsuper-vised method for most frequent sense detectionfrom the untagged corpora. Like in our work, thepaper explores creation of sense embeddings withthe use of WordNet. As the authors put it, senseembeddings are obtained by taking the average ofword embeddings of each word in the sense-bag.The sense-bag for each sense of a word is obtainedby extracting the context words from the WordNetsuch as synset members (S), content words in thegloss (G), content words in the example sentence(E), synset members of the hypernymy-hyponymysynsets (HS), and so on.

3 Word-Sense Annotated Treebank

The main obstacle in elaborating WSD method forPolish is lack of semantically annotated resourceswhich can be applied for training and evaluation.In our experiment we used an existing one whichuse wordnet senses – semantic annotation (Ha-jnicz, 2014) of Składnica (Wolinski et al., 2011).The set is a rather small but carefully prepared re-source and contains constituency parse trees forPolish sentences. The adapted version of Skład-nica (0.5) contains 8241 manually validated trees.Sentence tokens are annotated with fine-grainedsemantic types represented by Polish wordnetsynsets from plWordnet 2.0 plWordnet, Piasecki etal., 2009, http://plwordnet.pwr.wroc.pl/wordnet/).The set contains lexical units of three open partsof speech: adjectives, nouns and verbs. Therefore,only tokens belonging to these POS are annotated(as well as abbreviations and acronyms). Skład-nica contains about 50K nouns, verbs and adjec-tives for annotation, and 17410 of them belong-ing to 2785 (34%) sentences has been already an-

121

notated. For 2072 tokens (12%), the lexical unitappropriate in the context has not been found inplWordnet.

4 Obtaining Sense Embeddings

In this section we describe the method of obtainingsense-level word embeddings. Unlike most of theapproaches described in Section 2.2, our methodis applied to manually sense-labeled corpora.

In Wordnet, words either occur in multiplesynsets (are therefore ambiguous and subject ofWSD), or in one synset (are unambiguous). Ourapproach is to focus on synsets that contain bothambiguous and unambiguous words. In Skad-nica 2.0 (Polish WordNet) we found 28766 synsetsmatching these criteria and therefore potentiallysuitable for our experiments.

Let us consider a synset containing followingwords: ‘blemish’, ‘deface’, ‘disfigure’. Word‘blemish’ appears also in other synsets (is ambigu-ous) while words ‘deface’ and ‘disfigure’ are spe-cific for this synset and do not appear in any othersynset (are unambiguous).

We assume that embeddings specific to a senseor synset can be approximated by unambiguouspart of the synset. While some researchers suchas (Bhingardive et al., 2015) take average em-beddings of all synset-specific words, even us-ing glosses and hyperonymy, we use unambiguouswords to generate word2vec embedding vector ofa sense.

During training, each occurrence of unambigu-ous word in corpus is substituted for a synsetidentifier. As in the provided example, each oc-currence of ‘deface’ and ‘disfigure’ would be re-placed by its sense identifier, the same for bothunambiguous words. We’ll later use these sensevectors to distinguish between senses of ambigu-ous ‘blemish’ given their contexts.

We train word2vec vectors using substitutionmechanism described above on a dump of all Pol-ish language Wikipedia and 300-million subset ofthe National Corpus of Polish (Przepiórkowski etal., 2012). The embedding size is set to 100, allother word2vec parameters have the default valueas in (Rehurek and Sojka, 2010). The model isbased on lemmatized (base word forms) so onlythe occurrences of forms with identical lemmasare taken into account.

5 Unsupervised Word Sense Recognition

In this section we are proposing a simple unsuper-vised approach to WSD. The key idea is to useword embeddings in probabilistic interpretationand application comparable to language model-ing, however without building any additional mod-els or parameter-rich systems. The method is de-rived from (Taddy, 2015), where it was used witha bayesian classifier and vector embedding inver-sion to classify documents.

(Mikolov et al., 2013) describe two alterna-tive methods of generating word embeddings: theskip-gram, which represents conditional probabil-ity for a word’s context (surrounding words) andCBOW, which targets the conditional probabilityfor each word given its context. None of thesecorresponds to a likelihood model, but as (Taddy,2015) note they can be interpreted as componentsin a composite likelihood approximation. Let w= [w1. . . wT ] denote an ordered vector of words.The skip-gram in (Mikolov et al., 2013) yields thepairwise composite log likelihood:

logpV(w) =T∑

j=1

T∑i=1

1[1≤|k−j|≤b]logpV(wk|wj)

(1)We use the above formula to compute probabil-

ity of a sentence. Unambiguous words are repre-sented as their word2vec representations deriveddirectly from corpus. In case of ambiguous words,we substitute them for each possible sense vector(generated from unambiguous parts of synsets, ashas been previously described). Therefore, for anambiguous word to be disambiguated, we gener-ate as many variants of a sentence as there are itssenses, and compute each variant’s likelihood us-ing formula 1. Ambiguous words which occur inthe context are omitted (although we might alsoreplace them with an averaged vector representingall their meanings). Finally, we select the mostprobable variant.

Because the method involves no model train-ing, we evaluate it directly over the whole dataset without dividing it into train and test sets forcross-validation.

6 Supervised Word Sense Recognition

In the supervised approach, we train neural net-work models to predict word senses. In our exper-iment, neural network model acts as a regression

122

function F transforming word embeddings pro-vided at input into sense (synset identifiers) vec-tors.

As the network architecture we selected LSTM(Hochreiter and Schmidhuber, 1997). Neural net-work model consists of one LSTM layer followedby a dense (perceptron) layer at the output. Wetrain the network using mean standard error lossfunction.

Input data consists of the sequences of fiveword2vec embeddings: of two words that makeleft and right symmetric contexts of each inputword to be disambiguated, and the word itself rep-resented by the average vector of vectors repre-senting all its senses. Ambiguous words for whichthere are no embeddings are represented by zerovectors (padded). Zero vectors are also added ifthe context is too short. This data is used to trainLSTM model (Keras 1.0.1 https://keras.io/) linked with the subsequent dense layer withsigmoid activation function.

At the final step, we transform the output intosynsets rather than vectors. We select the most ap-propriate sense from a set of possible sense inven-tory, taking into account continuous output struc-ture. In this step, neural network output layer(which is a vector of the same size as input em-beddings, but transformed) is compared with eachpossible sense vector. To compare vectors, we usecosine similarity measure, defined between anytwo vectors.

We compute cosine similarity between neuralnetwork output vector nnv and each sense frompossible sense inventory S, and select the sensewith the maximum cosine similarity towards nnv.

To test each neural network set-up we use 30-fold cross-validation.

7 Results

In this section we put summary of the resultsobtained on our test set, as well as two base-line results. The corpus consisted of 2785 sen-tences and 303 occurences of annotated ambigu-ous words which could be disambiguated by ouralgorithms, i.e. there were unambiguous equiva-lents of its senses and there were appropiate wordembeddings for at least one of the other senses ofthis word. There were 5571 occurences of wordswhich occurred only in one sense.

Table 1 presents precision of both tested meth-ods computed over the Skladnica dataset. The

set contains 344 occurrences of ambiguous wordswhich were eligible for our method. For the unsu-pervised approach we tested a window of 5 and 10words around the analyzed word.

The ambiguous words from the sentence otherthan the one being disambiguated at the momentare either omitted or represented as a vector rep-resenting all their occurrences. The uniq variantomit all other ambiguous words from the sentencewhile in the all variant we use not disambiguatedrepresentation of these words.

Method Settings Precision

random baseline N/A 0.47

MFS baseline N/A 0.73

pagerank N/A 0.52

unsupervised

5 word, all 0.5075 word, uniq 0.50710 word, uniq 0.52910 word, all 0.513

supervised

750 epochs 0.6731000 epochs 0.6802000 epochs 0.6904000 epochs 0.667

Table 1: Precision of word-sense disambiguationmethods for Polish.

In the supervised approach the best results wereobtained for 2000 epochs but they did not differmuch from these obtained after 1000 epochs.For comparison, we include two baseline values:

• random baseline select random sense fromuniform random probability distribution,

• MFS baseline use most frequent sense ascomputed from the same corpus (There is noother available sense frequency data for Pol-ish, that could be obtained from manually an-notated sources.)

The table also includes results computed usingpagerank WSD algorithm developed at the PWR(Kedzia et al., 2015). These results were obtainedfor all the ambiguous words occurring within thesample, so cannot be directly compared to our re-sults.

As the results indicate, unsupervised methodperforms at the level of random sense selection.

123

Below there are two examples of the analyzed sen-tences.

• lek przed nicoscia łaczy sie z doswiadczeniempustki ‘fear of nothingness combines with theexperience of emptiness’: in this sentence,Polish ambiguous words ‘nothingness’ and‘emptiness’ were resolved correctly while anambiguous words ‘experience’ does not haveunambiguous equivalents.

• na tym nie koncza sie problemy ‘that does notstop problems’: in this example ambiguousword ‘problem’ was not resolved correctly,but this case is difficult also for humans.

The low quality of the results might be the ef-fect of a relatively short context available as theanalysed text is not continuous.

It might have also pointed out to the difficultyof the test set. Senses in plWodnet are very numer-ous and hard to differentiate even for human. Butthe results of the supervised method falsify this as-sumption.

Our supervised approach gave much better re-sults although they are also not very good as theamount of annotated data is rather small. In thisapproach more epochs resulted in a slight modelover-fitting.

8 Conclusions

Our work introduced two methods of word sensedisambiguation based on word embeddings, su-pervised and unsupervised. The first approach as-sumes probabilistic interpretation of embeddingsand computes log probability from sequences ofword embedding vectors. In place of ambiguousword we put embeddings specific for each possiblesense and evaluate the likelihood of thus obtainedsentences. Finally we select the most probablesentence. The second supervised method is basedon a neural network trained to learn a context-sensitive transformation that maps an input vectorof ambiguous word into an output vector repre-senting its sense. We compared the performanceof both methods on corpora with manual anno-tations of word senses from the Polish wordnet(plWordnet). The results show the low quality ofthe unsupervised method and suggest the superior-ity of the supervised version in comparison to thepagerank method on the set of words which wereeligible for our approach. Although the baseline

in which just the most frequent sense is chosen isstill a little better, this is probably due to a verylimited training set available for Polish.

Acknowledgments

The paper is partially supported by the Polish Na-tional Science Centre project Compositional dis-tributional semantic models for identification, dis-crimination and disambiguation of senses in Pol-ish texts number 2014/15/B/ST6/05186.

ReferencesEneko Agirre, Oier López de Lacalle, and Aitor Soroa.

2014. Random walks for knowledge-based wordsense disambiguation. Comput. Linguist., 40(1):57–84, March.

Pierpaolo Basile, Annalina Caputo, and Giovanni Se-meraro. 2014. An enhanced lesk word sense dis-ambiguation algorithm through a distributional se-mantic model. In Proceedings of COLING 2014,the 25th International Conference on ComputationalLinguistics, Dublin, Irleand. Association for Com-putational Linguistics.

Sudha Bhingardive, Dhirendra Singh, Rudra Murthy,Hanumant Redkar, and Pushpak Bhattacharyya.2015. Unsupervised most frequent sense detectionusing word embeddings. In DENVER.

Bartosz Broda, Michał Marcinczuk, Marek Maziarz,Adam Radziszewski, and Adam Wardynski. 2012.KPWr: Towards a free corpus of polish. Proceed-ings of LREC’12.

Elzbieta Hajnicz. 2014. Lexico-semantic annotation ofskładnica treebank by means of PLWN lexical units.In Heili Orav, Christiane Fellbaum, and Piek Vossen,editors, Proceedings of the 7th International Word-Net Conference (GWC 2014), pages 23–31, Tartu,Estonia. University of Tartu.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Longshort-term memory. Neural Comput., 9(8):1735–1780, November.

Eric H. Huang, Richard Socher, Christopher D. Man-ning, and Andrew Y. Ng. 2012. Improving wordrepresentations via global context and multiple wordprototypes. In Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguis-tics: Long Papers - Volume 1, ACL ’12, pages 873–882, Stroudsburg, PA, USA. Association for Com-putational Linguistics.

Ignacio Iacobacci, Mohammad Taher Pilehvar, andRoberto Navigli. 2015. Sensembed: Learningsense embeddings for word and relational similarity.In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the

124

7th International Joint Conference on Natural Lan-guage Processing of the Asian Federation of NaturalLanguage Processing, ACL 2015, July 26-31, 2015,Beijing, China, Volume 1: Long Papers, pages 95–105.

Łukasz Kobylinski and Mateusz Kopec. 2012. Se-mantic similarity functions in word sense dis-ambiguation. In Petrand Horák Sojka, AlešandKopecek, and Karel Ivanand Pala, editors, Text,Speech and Dialogue: 15th International Confer-ence, TSD 2012, Brno, Czech Republic, September3-7, 2012. Proceedings, pages 31–38, Berlin, Hei-delberg. Springer.

Łukasz Kobylinski. 2012. Mining class associationrules for word sense disambiguation. In Pascal Bou-vry, Mieczysław A. Kłopotek, Franck Leprevost,Małgorzata Marciniak, Agnieszka Mykowiecka, andHenryk Rybinski, editors, Security and IntelligentInformation Systems: International Joint Confer-ence, SIIS 2011, Warsaw, Poland, June 13-14, 2011,Revised Selected Papers, volume 7053 of LectureNotes in Computer Science, pages 307–317, Berlin,Heidelberg. Springer.

Paweł Kedzia, Maciej Piasecki, and Marlena Or-linska. 2015. Word sense disambiguation basedon large scale Polish CLARIN heterogeneous lexi-cal resources. Cognitive Studies| Études cognitives,15:269–292.

Rada Mihalcea, Paul Tarau, and Elizabeth Figa. 2004.Pagerank on semantic networks, with applicationto word sense disambiguation. In Proceedings ofthe 20th International Conference on ComputationalLinguistics, COLING ’04, Stroudsburg, PA, USA.Association for Computational Linguistics.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.2013. Linguistic regularities in continuous spaceword representations. In Human Language Tech-nologies: Conference of the North American Chap-ter of the Association of Computational Linguis-tics, Proceedings, June 9-14, 2013, Westin PeachtreePlaza Hotel, Atlanta, Georgia, USA, pages 746–751.

Arvind Neelakantan, Jeevan Shankar, Alexandre Pas-sos, and Andrew McCallum. 2014. Efficient non-parametric estimation of multiple embeddings perword in vector space. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing, EMNLP 2014, October 25-29,2014, Doha, Qatar, A meeting of SIGDAT, a SpecialInterest Group of the ACL, pages 1059–1069.

Adam Przepiórkowski, Mirosław Banko, Rafał L.Górski, and Barbara Lewandowska-Tomaszczyk,editors. 2012. Narodowy Korpus Jezyka Polskiego.Wydawnictwo Naukowe PWN, Warsaw.

Radim Rehurek and Petr Sojka. 2010. SoftwareFramework for Topic Modelling with Large Cor-pora. In Proceedings of the LREC 2010 Workshop

on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May. ELRA. http://is.muni.cz/publication/884893/en.

Joseph Reisinger and Raymond J Mooney. 2010.Multi-prototype vector-space models of word mean-ing. In Human Language Technologies: The 2010Annual Conference of the North American Chap-ter of the Association for Computational Linguistics,pages 109–117. Association for Computational Lin-guistics.

Hinrich Schütze. 1998. Automatic word sensediscrimination. Computational Linguistics, 24(1):97–123.

Matt Taddy. 2015. Document classification by inver-sion of distributed language representations. CoRR,abs/1504.07295.

Kaveh Taghipou and Hwee Tou Ng. 2015. Semi-supervised word sense disambiguation using wordembeddings in general and specific domains. InHuman Language Technologies: The 2015 AnnualConference of the North American Chapter of theACL, page 314–323. Association for ComputationalLinguistics.

Rocco Tripodi and Marcello Pelillo. 2017. A game-theoretic approach to word sense disambiguation.Computational Linguistics.

Marcin Wolinski, Katarzyna Głowinska, and MarekSwidzinski. 2011. A preliminary version of Skład-nica—a treebank of Polish. In Zygmunt Vetulani,editor, Proceedings of the 5th Language & Technol-ogy Conference: Human Language Technologies asa Challenge for Computer Science and Linguistics,pages 299–303, Poznan, Poland.

125

Author Index

An, Yuan, 31Apidianaki, Marianna, 110Arora, Sanjeev, 12

Biemann, Chris, 72

Callison-Burch, Chris, 110Cocos, Anne, 110Cook, Paul, 47

El-Haj, Mahmoud, 61Ezeani, Ignatius, 53

Faralli, Stefano, 72Fellbaum, Christiane, 12

Gamallo, Pablo, 1Gregori, Lorenzo, 102

Hasan, Sadid, 31Hayashi, Yoshihiko, 37Hepple, Mark, 53

Jang, Kyoung-Rok, 91

Kanada, Kentaro, 37Khodak, Mikhail, 12King, Milton, 47Kobayashi, Tetsunori, 37Kober, Thomas, 79Köper, Maximilian, 24

Lieto, Antonio, 96Ling, Yuan, 31

Mensa, Enrico, 96Myaeng, Sung-Hyon, 91Mykowiecka, Agnieszka, 120

Onyenwe, Ikechukwu, 53

Panchenko, Alexander, 72Panunzi, Alessandro, 102Pereira-Fariña, Martín, 1Piao, Scott, 61Ponzetto, Simone Paolo, 72

Radicioni, Daniele P., 96

Rayson, Paul, 61Reffin, Jeremy, 79Risteski, Andrej, 12

Schulte im Walde, Sabine, 24

Wattam, Stephen, 61Wawer, Aleksander, 120Weeds, Julie, 79Weir, David, 79Wilkie, John, 79

127

Documents

Association for Computational Linguistics · Preface Welcome to the 1st Workshop on Sense, Concept and Entity Representations and their Applications (SENSE 2017). The aim of SENSE