89_barbu

8/7/2019 89_barbu

1/8

Automatic Building of Wordnets

Eduard Barbu* and Verginica Barbu Mititelu* Graphitech Italy

2, Salita Dei Molini

38050 Villazzano, Trento, Italy

[email protected] Academy Research Institute for Artificial Intelligence

13 Calea 13 Septembrie,

Bucharest 050711, [email protected]

Abstract

In what follows we will present a two-phasemethodology for automatically building a

wordnet (that we call target wordnet) strictlyaligned with an already available wordnet

(source wordnet). In the first phase the synsetsfor the target language are automatically

generated and mapped onto the sourcelanguage synsets using a series of heuristics. Inthe second phase the salient relations that can

be automatically imported are identified andthe procedure for their import is explained. The

assumptions behind such methodology will bestated, the heuristics employed will be

presented and their success evaluated against acase study (automatically building a Romanian

wordnet using PWN).

1. IntroductionThe importance of a wordnet for NLP applications canhardly be overestimated. The Princeton WordNet(PWN) (Fellbaum 1998) is now a mature lexical

ontology which has demonstrated its efficiency in avariety of tasks (word sense disambiguation, machinetranslation, information retrieval, etc.). Inspired by thesuccess of PWN many languages started to developtheir own wordnets taking PWN as a model (cf.http://www.globalwordnet.org/gwa/wordnet_table.htm)

. Furthermore, in both EuroWordNet (Vossen 1998)

and BalkaNet (Tufi 2004) projects the synsets fromdifferent versions of PWN (1.5 and 2.0) were used asILI repositories. The created wordnets were linked bymeans of interlingual relations through this ILIrepository1.

The rapid progress in building a new wordnet andlinking it with an already tested wordnet (usuallyPWN) is hindered by the amount of time and effortneeded for developing such a resource. To take a recentexample, the development of core wordnets (of about20000 synsets, as is the case with the Romanianwordnet) for Balkan languages took three years (2001-

2004).

1 In both projects there was a number of synsets expressing

language specific concepts added to the ILI repository.

In what follows we present a methodology that can be

used for automatically building wordnets strictlyaligned (that is, using only EQ_SYNONYM relation)with an already available wordnet. We have started ourexperiment with the study of nouns, so the data

presented here are valid only for this grammaticalcategory.

We call the wordnet already available Sourcewordnet (as mentioned before, this is usually a versionof PWN) and the wordnet to be built and linked withthe Source wordnet will be named Target wordnet.

The methodology we present has two phases. In thefirst one the synsets for the target language areautomatically generated and mapped onto the source

language synsets using a series of heuristics. In thesecond phase the salient relations that can beautomatically imported are identified and the procedurefor their import is explained.

The paper has the following organization. Firstlywe state the implicit assumptions in building a wordnet

strictly aligned with other wordnets. Then we shortlydescribe the resources that one needs in order to applythe heuristics, and also the criteria we used in selectingthe source language test synsets to be implemented.Finally, we state the problem to be solved in a moreformal way, the heuristics employed will be presented

and their success evaluated against a case study(automatically building a Romanian wordnet usingPWN 2.0).

2. AssumptionsThe assumptions that we considered necessary forautomatically building a target wordnet using a Sourcewordnet are the following:

1. There are word senses that can be clearlyidentified. This assumption is implicit whenone builds a wordnet aligned or not with other

wordnets. This premise was extensivelyquestioned among others by (Kilgarriff 1997)who thinks that word senses have not a realontological status, but they exist only relativeto a task. We will not discuss this issue here.

2. A rejection of the strong reading of Sapir-

Whorf (Caroll 1964) hypothesis (the principleof linguistic relativity). Simply stated, the principle of linguistic relativity says that

8/7/2019 89_barbu

2/8

language shapes our thought. There are twovariants of this principle: strong determinismand weak determinism. According to the

strong determinism language and thought areidentical. This hypothesis has today fewfollowers if any and the evidence against itcomes from various sources among which the possibility of translation in other language.However, the weak version of the hypothesis

is largely accepted. One can view the realityand our organization of reality by analogywith the spectrum of colors which is acontinuum in which we place arbitrary boundaries (white, green, black, etc.).Different languages will cut differently this

continuous spectrum. For example, Russianand Spanish have no words for the concept blue. This weak version of the principle oflinguistic relativity warns us, however, that aspecific source wordnet could not be used for

automatically building any target wordnet. Wefurther discuss this bellow.3. The acceptance of the conceptualization made

by the source wordnet By conceptualizationwe understand the way in which the sourceWordnet sees the reality by identifying themain concepts to be expressed and their

relationships. For specifying how differentlanguages can differ with respect with toconceptual space they reflect we will follow(Sowa 1992) who considers three distinctdimensions:

accidental. The two languages have different

notations for the same concepts. For examplethe Romanian word mrand the English wordapple lexicalize the same concept.

systematic. The systematic dimension definesthe relation between the grammar of alanguage and its conceptual structures. It dealswith the fact that some languages are SVO orVSO, etc., some are analytic and other

agglutinative. Even if it is an importantdifference between languages, the systematicdimension has little import for our problem

cultural. The conceptual space expressed by alanguage is determined by environmental,

cultural factors, etc. It could be the case forexample, that concepts that define the legalsystems of different countries are not mutuallycompatible. So when someone builds awordnet starting from a source wordnet he/she

should ask himself/herself what the parts (ifany) that could be safely transferred in thetarget language are. More precise what theparts that share the same conceptual space are.

The assumption that we make use of is that thedifferences between the two languages (source and

target) are merely accidental: they have differentlexicalizations for the same concepts. As the

conceptual space is already expressed by the Source

wordnet structure using a language notation, our task isto find the concepts notations in the target language.

When the Source wordnet is not perfect (the real

situation), then a drawback of the automatic mappingapproach is that all the mistakes existent in the sourcewordnet are transferred in the target wordnet: considerthe following senses of the noun hindrance in PWN:

1. hindrance, deterrent, impediment, balk, baulk,check, handicap -- (something immaterial that

interferes with or delays action or progress)=> cognitive factor -- (something immaterial (as

a circumstance or influence) that contributes toproducing a result)

2. hindrance, hitch, preventive, preventative,encumbrance, incumbrance, interference -- (any

obstruction that impedes or is burdensome)=> artifact, artefact -- (a man-made object taken

as a whole).We listed the senses 1 and 2 of the word hindrancetogether with one of their hyperonyms. As one can see,

in PWN a distinction is made between hindrance as acognitive factor and hindrance as an artefact, so thatthese are two different senses of the word hindrance.According to this definition a speed bump can beclassified as a hindrance because it is an artifact, but astone that stays in the path of someone cannot be one,because it is not a man made object.

Another possible problem appears because thepreviously made assumption about the sameness of theconceptual space is not always true as the followingexample shows:

mister, Mr -- (a form of address for a man)sir -- (term of address for a man)

In Romanian both misterandsirin the listed senses aretranslated by the word domn. But in Romanian it wouldbe artificial to create two distinct synsets for the worddomn, as they are not different, not even in what theirconnotations are concerned.

3. Selection of concepts and

resources usedWhen we selected the set of synsets to be implementedin Romanian we followed two criteria.

The first criterion states that the selected set shouldbe structured in the source wordnet (i.e. every selected

synset should be linked by at least one semanticrelation with other selected synsets). This is dictated bythe methodology we have adopted (automatic mappingand automatic relation import). If we want to obtain awordnet in the target language and not just someisolated synsets, this criterion is self-imposing.

The second criterion is related to the evaluationstage. To properly evaluate the built wordnet, it should

be compared with a golden standard. The goldenstandard that we use will be the Romanian Wordnet(RoWN) developed in the BalkaNet project

2.

2 One can argue that this Romanian wordnet is not perfect and

definitely incomplete. However, PWN is neither perfect. Moreover, it

is indisputable that at least in the case of ontologies (lexical or

8/7/2019 89_barbu

3/8

For fulfilling both criteria we chose a subset of nounconcepts from the RoWN that has the property that itsprojection on PWN 2.0 is closed under the hyperonym

and the meronym relations. Moreover, this subsetincludes the upper level part of the PWN lexicalontology. The projection of this subset on PWN 2.0comprises 9716 synsets that contain 19624 literals.

For the purpose of automatic mapping of this subsetwe used an in-house dictionary built from many

sources. The dictionary has two main components:

The first component consists of the above-mentioned 19624 literals and their

Romanian translations. We must make surethat this part of the dictionary is as complete

as possible. Ideally, all senses of the Englishwords should be translated. For that weused the (Levichi & Banta 1992)dictionary and other dictionaries available

on web.

The second component of the dictionary is

(Levichi & Banta 1992) dictionary.Some dictionaries (in our case the dictionary extractedfrom the already available Romanian wordnet) also

have sense numbers specified, but, from ourexperience, this information is highly subjective, doesnot match the sense as defined by PWN and it is notconsistent over different dictionaries, so we chose todisregard it.

The second resource used is the Romanian

Explanatory Dictionary (EXPD 1996) whose entriesare numbered to reflect the dependencies between

different senses of the same word.

4. Notation introductionIn this section we introduce the notations used in thepaper and we outline the guiding idea of all heuristics

we used:1. By TL we denote the target lexicon. In ourexperiment TL will contain Romanian words (nouns).

TL= { , , } where , with i=1..m,

denotes a target word.

1rw 2rw mrw irw

2. By SL we denote the source lexicon. In our case SL

will contain English words (nouns). SL={ , ,

} where , with j=1..n, denotes a source

word.

1ew 2ew

new jew

3. WTand WS are the wordnets for the target languageand the source language, respectively.

4. denotes the kk

jwth

sense of the word wj.

5. BD is a bilingual dictionary which acts as a bridgebetween SL and TL. BD=(SL, TL, M) is a 3-tuple, whereM is a function that associates to each word in SL a set

of words in TL. For an arbitrary word Sjew L,

M( )jew ={ , , }.1rw 2rw krw

formal), a manually or semi-automatically built ontology is much

better than an automatically built one.

Formally the bilingual dictionary maps words and notword senses. If word senses had been mapped, thenbuilding WT from a WS would have been trivial.

If we ignore the information given by the definitionsassociated with word senses, then, formally a sense of aword in the PWN is distinguished from other wordsenses only by the set of relation it contacts in thesemantic network. This set of relations defines what itis called the position of a word in the semantic

network. Ideally, every sense of a word should beunambiguously identified by a set of connections; itshould have a unique position in the semantic net.Unfortunately this is not the case in PWN. There aremany cases when different senses of a word have thesame position in the semantic network (i.e they have

precisely the same connections with other wordsenses).

The idea of our heuristics could be summed up inthree points:

1. Increase the number of relations in the

Source wordnet to obtain a uniqueposition for each word sense. For thisan external resource can be used towhich the wordnet is linked, such asWordnet Domains.

2. Try to derive useful relations betweenthe words in the target language. For

this one can use corpuses, monolingualdictionaries, already classified set ofdocuments etc.

3. In the mapping stage of the proceduretake profit of the structures built atpoints 1 and 2.

We have developed so far a set of four heuristics andwe plan to supplement them in the future.

5. The first heuristic ruleThe first heuristic exploits the fact that synonymy

enforces equivalence classes on word senses.

Let EnSyn={ ,1111

i

jew12

12

i

jew } where

, , are the words in synset and the

superscripts denote their sense numbers) be a S

n

n

i

jew1

1

11jew

12jew

njew

1

L synset

and length(EnSyn)>1. We impose the length of a

synset to be greater than one when at least one

component word is not a variant of the other words. Sowe disregard synsets such as {artefact, artifact}. Forachieving this we computed the well knownLevenshtein distance between the words in the synset.The BD translations of the words in the synset will be:

M( )11j

ew = { , }11i

rwmi

rw1

M (12j

ew ) ={21i

rw , }ki

rw2

.

M (nj

ew1

) = {1ni

rw , }nti

rw

We build the corresponding TL synset as

1. M( ) if ikjew EnSyn such that thenumber of senses NoSenses ( )=1

ikjew

ikjew

8/7/2019 89_barbu

4/8

2. M( )11j

ew M ( )12j

ew M ( )nj

ew1

otherwiseWords belonging to the same synset in SL should havea common translation in TL. Above we distinguishedtwo cases:

1. At least one of the words in a synset is

monosemous. In this case we build the TL synset as the

set of translations of the monosemous word.2. All words in the synset are polysemous. The

corresponding TL synset will be constructed by theintersection of all TL translations of the SL words in thesynset.Taking the actual RoWNas a gold standard we can

evaluate the results of our heuristics by comparing theobtained synsets with those in the RoWN. Wedistinguish five possible cases:

1. The synsets are equal (this case will be labeledas Identical).

2. The generated synset has all literals of the

correct synset and some more. (Over-generation).3. The generated synset and the golden one have

some literals in common and some different (Overlap)4. The generated synset literals form a proper

subset of the golden synset (Under-generation)5. The generated synset have no literals in

common with the correct one (Disjoint).The cases Over-generation, Overlap and Disjoint will be counted as errors. The other two cases, namelyIdentical and Under-generation, will be counted assuccesses3.The evaluation of the first heuristics is given in Table

1, at the end of section 9.The percents mapped column contains the percents

of the synsets mapped by the heuristics from the totalnumber of the synsets (9716). The percent errorscolumn represents the percent of synsets from thenumber of mapped synsets wrongly assigned by the

heuristics. The high number of mapped synsets provesthe quality of the first part of the dictionary we used.

The only type of error we encountered is Over-generation.

6. The second heuristic ruleThe second heuristic draws from the fact that, in thecase of nouns, the hyperonymy relation can be

interpreted as an IS-A relation4. It is also based on tworelated observations:

1. A hyperonym and his hyponyms

carry some common information.2. The information common to the

hyperonym and the hyponym will increase as yougo down in the hierarchy.

Let EnSyn1={ ,11

11

i

jew12

12

i

jew } and

EnSyn

t

t

i

jew1

1

2={ ,21

21

i

jew22

22

i

jew } be two Ss

s

i

jew2

2L synsets

1

3 The Under-generation case means that the resulted synset is

not reach enough; it does not mean that it is incorrect.4 This not entirely true because in PWN the hyperonym relation

can also be interpreted as an INSTANCE-OF relation, as in PWN

there are also some instances included (e.g.New York,Adam, etc.).

such that EnSyn1 HYP EnSyn2, meaning that EnSyn1is a hyperonym of EnSyn2. Then we generate thetranslation lists of the words in the synsets. The

intersection is computed as:

TL EnSyn1= M(ew )11j

M ( )12j

ew M ( ew )tj

TL EnSyn2 = M( )21j

ew M ( ew )22j

M ( )sj

ew2

The generated synset in the target language will becomputed as

TL Synset = TL EnSyn1 TL EnSyn2Given the above consideration, it is possible that a

hyponym and its hyperonym have the same translationin the other language and this is more probable as youdescend in the hierarchy. The procedure formallydescribed above is applied for each synset in the sourcelist. It generates the lists of common translations for allwords in the hyperonym and hyponym synsets and then

constructs the TL synsets by intersecting these lists. Incase the intersection is not empty the created synsetwill be assigned to both SL language synsets.

Because the procedure generates autohyponymsynsets this could be an indication that created synsetscould be clustered in TL.

It is possible that a TLsynset be assigned to twodifferent source pair synsets as in the figure below. Sowe need to perform a clean-up procedure and choosethe assignment that maximizes the sum of depth levelof the two synsets.

Level k

Level k+1

Level k+2

In the figure common information is found between the

middle synset and the upper synset (its hyperonym)and also between the middle synset and the lowersynset (its hyponym). Our procedure will prefer thesecond assignment.

The results of the second heuristic are presented inTable 2, at the end of section 9. The low number of

mapped synsets (10%) is due to the fact that we did notfind many common translations between hyperonymsand their hyponyms.

7. The third heuristicThe third heuristics takes profit of an external relationimposed over the wordnet. At IRST PWN 1.6 wasaugmented with a set of Domain Labels, the resultingresource being called Wordnet Domains (Magnini &Cavaglia 2000). PWN 1.6 synsets have been semi-

automatically linked with a set of 200 domain labelstaken from Dewey Decimal classification, the worldmost widely used library classification system. Thedomain labels are hierarchically organized and eachsynset received one or more domain labels. For thesynsets that cannot be labeled unambiguously the

default label factotum has been used.Because in the BalkaNet project the RoWN has

been aligned with PWN 2.0, we performed a mapping

8/7/2019 89_barbu

5/8

between PWN 1.6 and PWN 2.0. By comparison withPWN1.6, PWN 2.0 has new additional synsets and alsothe wordnet structure is slightly modified. As a

consequence not all PWN 2.0 synsets can be reachedfrom PWN 1.6 either because they are completely newor because they could not be unambiguously mapped.This results in some PWN 2.0 synsets that have notdomains. For their labelling we used the three rulesbelow:

1. If one of the direct hyperonym of theunlabeled synsets has been assigned a domain,then the synset will automatically receive thefathers domain, and conversely, if one of thehyponyms is labelled with a domain and fatherlacks domain, then the father synset will receive

the sons domain.2. If a holonym of the synset is assigned aspecific domain, then the meronym will receivethe holonym domain and conversely.3. If a domain label cannot be assigned, then the

synset will receive the default factotum label.The idea of using domains is helpful for distinguishingword senses (different word senses of a word areassigned to different domains). The best case is wheneach sense of a word has been assigned to a distinctdomain. But even if the same domain labels areassigned to two or more senses of a word, in most

cases we can assume that this is a strong indication of afine-grained distinction. It is very probable that thedistinction is preserved in the target language by thesame word.

We labelled every word in the BD dictionary withits domain label. For English words the domain is

automatically generated from the English synset labels.For labelling Romanian words we used two methods:1. We downloaded a collection of documents

from web directories such that the categoriesof the downloaded documents match thecategories used in the Wordnet Domain. Thedownloaded document set underwent a pre-

processing procedure with the following steps:a. Feature extraction. The first phase consists in

finding a set of terms that represents thedocuments adequately. The documents werePOS tagged and lemmatized and the nounswere selected as features.

b. Features selection. In this phase the features thatprovide less information were eliminated. Forthis we used the well known

2statistic.

2

statistic checks if there is a relationship between being in a certain group and acharacteristic that we want to study. In our

case we want to measure the dependency between a term t and a category c. The

formula for2

is:

)()()()(

)(),(

22

DCBADBCA

CBADNct

++++

=

Where:

A is the number of times t and c co-occur B is the number of times t occur without c

C is the number of times c occurs withoutt

D is the number of times neither c nor t

occurs

N is the total number of documentsFor each category we computed the score between

that category and the noun terms of our

documents. Then, for choosing the terms thatdiscriminate well for a certain category we usedthe formula below (where m denotes the numberof categories):

)),(2(1

max)(max2

ict

m

it

==

2. We took advantage of the fact that some words

have already been assigned subject codes invarious dictionaries. We performed a manualmapping of these codes onto the DomainLabels used at IRST. The Romanian wordsthat could not be associated domain

information were associated with the defaultfactotum domain.The following entry is a BD dictionary entry augmentedwith domain information:

M( [D1ew 1,] ) = [D1rw 1, D2 ], [D2rw 1, D3 ],

[Dirw 2, D4 ]

In the square brackets the domains that pertain to each

word are listed.

Let again EnSyn1={ ,11

11

i

jew12

12

i

jew } be an

S

n

n

i

jew1

1

L language synset and Di the associated domain.Then the TL synset will be constructed as follows:

TL Synset = M( ), where eachnjjm 1..11U=

mew

irw M( ) has the property that its domain

matches the domain of EnSyn

mew

1 that is: either is thesame as the domain of EnSyn1, subsumes the domain

of EnSyn1 in the IRST domain labels hierarchy or issubsumed by the domain of EnSyn1 in the IRSTdomain labels hierarchy .

For each synset in the SL we generated all thetranslations of its literals in the TL. Then the TL synsetis built using only those TL literals whose domain

matches the SL synset domain.

The results of this heuristic are given in Table 3 atthe end of section 9.

8. The fourth heuristic ruleThe fourth heuristics takes advantage of the fact that

the source synsets have a gloss associated and also thattarget words that are translations of source words have

associated glosses in EXPD. As with the third heuristicthe procedure comprises a preprocessing phase. Wepreprocessed both resources (PWN and EXPD):

1. We automatically lemmatized and tagged all theglosses of the synsets in the SL.

2. We automatically lemmatized and tagged all the

definitions of the words that are translations ofSL words.

8/7/2019 89_barbu

6/8

3. We chose as features for representing theglosses the set of nouns.

2. For each vector in RiT Gloss we compute the

product: vS iT= . If there exists at

least one such that

= mj

jjts

..1

*

iT vS iT>=2 we compute

max(Sv Ti) and we add the word to the TLsynset.

The target definitions were automatically translated

using the bilingual dictionary. All possible sourcedefinitions were generated by translating eachlemmatized noun word in the TL definition. Thus, if aTL definition of one TL word is represented by the

following vector [ , , ], then the

number of S

1rw 2rw prwL vectors generated will be: N= * *

* * , where n

dn 1wt

2wt

pwt d is the number of definitions the

target word has in the monolingual dictionary (EXPD),

and , with k=1..p, is the number of translations that

the noun has in the bilingual dictionary.

kwt

kw

Notice that by using this heuristic rule we canautomatically add a gloss to the TL synset.As one can see in Table 4 at the end of section 9 thenumber of incomplete synsets is high. The percent ofmapped synsets is due to the low agreement betweenthe glosses in Romanian and English.

9. Combining resultsFor choosing the final synsets we devised a set ofmeta-rules by evaluating the pro and con of eachheuristic rule. For example, given the high quality

dictionary the probability that the first heuristic will

fail is very low. So the synsets obtained using it willautomatically be selected. A synset obtained using theother heuristics will be selected and moreover willreplace a synset obtained using the first heuristic, onlyif it is obtained independently using the heuristics 3

and 2, or by using the heuristics 3 and 4. If a synset isnot selected by the above meta-rules will be selectedonly if it is obtained by the heuristics number 3 and theambiguity of its members is at most equal to 2. Table 5at the end of this section shows the combined results ofour heuristics.

By RGloss we denote the set of SL representation

vectors of TL glosses of a TL word: RGloss = { ,

}. By we denote the vector of S

1T

2T nT vS L synset

gloss.The procedure for generating the TL synset is: for

each SL synset we generate the TL list of the translationof all words in the synset. Then for each word in the TL

list of translation we compute the similarity between Svand its RGloss. The computation is done in two steps:

1. We give the vectors in Sv and RGloss a binaryrepresentation. The number (m) of positions avector has will be equal to the number of distinct

words existent in the and in all vectors of

R

vS

Gloss. The presence of 1 in the vector means thata word is present and the existence of 0 means

that a word is absent from the vector.

As one can observe there, for 106 synsets in PWN2.0 the Romanian equivalent synsets could not be

found. There also resulted 635 synsets smaller than thesynsets in the RoWN.

Error types CorrectNumber of

mappedsynsets

Percentsmapped

Over-generation Overlap DisjointUnder-

generationIdentical

Percenterrors

8493 87 210 0 0 300 7983 2

Table 1: The results of the first heuristic


mappedsynsets

Percents

mappedOver-generation Overlap Disjoint

Under-

generationIdentical

Percent

errors

1028 10 213 0 150 230 435 35

Table 2: The results of the second heuristic

8/7/2019 89_barbu

7/8


mappedsynsets

Percents

mappedOver-generation Overlap Disjoint

Under-generation

Identical

Percent

errors

7520 77 689 0 0 0 6831 9

Table 3: The results of the third heuristic


mapped

synsets

Percentsmapped


generationIdentical

Percenterrors

3527 36 25 0 78 547 2877 3

Table 4: The results of the fourth heuristic


mappedsynsets

Percentsmapped


generationIdentical

Percenterrors

9610 98 615 0 250 635 8110 9

Table 5: The combined results of the heuristics

10. Import of relationsAfter building the target synsets an investigation of thenature of the relations that structure the source wordnetshould be made for establishing which of them can besafely transferred in target wordnet. As one expects theconceptual relations can be safely transferred because

these relations hold between concepts. The only lexicalrelation that holds between nouns and that was subjectto scrutiny was the antonym relation. We concludedthat this relation can also be safely imported. Theimporting algorithm works as described bellow.

If two source synsets S1 and S2 are linked by a

semantic relation R in WS and if T1 and T2 are thecorresponding aligned synsets in the WT, then they will

be linked by the relation R. If in WS there areintervening synsets between S1and S2, then we will setthe relation R between the corresponding TL synsetsonly if R is declared as transitive (R+, unlimitednumber of compositions, e.g. hypernym) or partially

transitive relation (Rk with k a user-specializedmaximum number of compositions, larger than thenumber of intervening synsets between S1 and S2). Forinstance, we defined all the holonymy relations aspartially transitive (k=3).

11. Conclusion and future work

Other experiments of automatically building wordnetsthat we are aware of are (Atserias et al., 1997) and (Lee

et al., 2000). They combine several methods, using

monolingual and bilingual dictionaries for obtaining a

Spanish Wordnet and, respectively, a Korean onestarting from PWN 1.5.

However, our approach is characterized by the fact

that it gives an accurate evaluation of the results byautomatically comparing them with a manually built

wordnet. We also explicitly state the assumptions ofthis automatic approach. Our approach is the first touse an external resource (Wordnet Domains) in theprocess of automatically building a wordnet.

We obtained a version of RoWN that contains 9610synsets and 11969 relations with 91% accuracy.

The results obtained encourage us to develop otherheuristics. The success of our procedure was facilitated

by the quality of the bilingual dictionary we used.Some heuristics developed here may be applied for

the automatic construction of synsets of other parts ofspeech. That is why we also plan to extend our

experiment to adjectives and verbs. Their evaluationwould be of great interest in our opinion.

Finally we would like to thank the three anonymousreviewers for helping us in improving the final versionof the paper.

References:(Atserias et al. 1997) J. Atserias, S. Clement, X. Farreres,

German Rigau, H. Rodrguez, Combining Multiple

Methods for the Automatic Construction of Multilingual

8/7/2019 89_barbu

8/8

WordNets. In Proceedings of the International

Conference on Recent Advances in Natural Language,

1997.

(Carroll 1964) J. B. Carroll (Ed.). 1964. Language, Thought

and Reality Selected writings of Benjamin Lee Whorf,The MIT Press, Cambridge, MA.

(EXPD 1996) Dicionarul explicativ al limbii romne, 2nd

edition, Bucureti, Univers Enciclopedic, 1996.

(Fellbaum 1998) Ch. Fellbaum (Ed.) WordNet: An

Electronical Lexical Database, MIT Press, 1998.

(Kilgarriff 1997) A. Kilgarriff,I dont believe in word senses.In Computers and the Humanities, 31 (2), 91-113, 1997.

(Lee et al. 2000) C. Lee, G. Lee, J. Seo, Automatic WordNetmapping using Word Sense Disambiguation. In Proceedings of the Joint SIGDAT Conference onEmpirical Methods in Natural Language Processing and

Very Large Corpora (EMNLP/VLC 2000), Hong Kong,

2000.

(Levichi & Banta 1992) L. Levichi and A. Banta,

Dicionar englez-romn, Bucureti, Teora, 1992.

(Magnini & Cavaglia 2000) B. Magnini and G. Cavaglia,Integrating subject field codes into WordNet. In

Proceedings of LREC-2000, Athens, Greece.

(Sowa 1992) J. F. Sowa, Logical Structure in the Lexicon. InJ. Pustejovsky and S. Bergler (Eds.) Lexical Semanticsand Commonsense Reasoning, LNAI 627, Springer-verlag, Berlin, 39-60, 1992.

(Tufi 2004) D. Tufi (Ed.), Special Issue on the BalkaNet

Project ofRomanian Journal of Information Science and

Technology, vol. 7, no. 1-2, 2004.

(Vossen 1998) P. Vossen, A Multilingual Database with

Lexical Semantic Networks, Dordrecht, Kluwer, 1998.

Documents

89_barbu