Upload
ioana-oncea
View
217
Download
0
Embed Size (px)
Citation preview
8/7/2019 89_barbu
1/8
Automatic Building of Wordnets
Eduard Barbu* and Verginica Barbu Mititelu* Graphitech Italy
2, Salita Dei Molini
38050 Villazzano, Trento, Italy
[email protected] Academy Research Institute for Artificial Intelligence
13 Calea 13 Septembrie,
Bucharest 050711, [email protected]
Abstract
In what follows we will present a two-phasemethodology for automatically building a
wordnet (that we call target wordnet) strictlyaligned with an already available wordnet
(source wordnet). In the first phase the synsetsfor the target language are automatically
generated and mapped onto the sourcelanguage synsets using a series of heuristics. Inthe second phase the salient relations that can
be automatically imported are identified andthe procedure for their import is explained. The
assumptions behind such methodology will bestated, the heuristics employed will be
presented and their success evaluated against acase study (automatically building a Romanian
wordnet using PWN).
1. IntroductionThe importance of a wordnet for NLP applications canhardly be overestimated. The Princeton WordNet(PWN) (Fellbaum 1998) is now a mature lexical
ontology which has demonstrated its efficiency in avariety of tasks (word sense disambiguation, machinetranslation, information retrieval, etc.). Inspired by thesuccess of PWN many languages started to developtheir own wordnets taking PWN as a model (cf.http://www.globalwordnet.org/gwa/wordnet_table.htm)
. Furthermore, in both EuroWordNet (Vossen 1998)
and BalkaNet (Tufi 2004) projects the synsets fromdifferent versions of PWN (1.5 and 2.0) were used asILI repositories. The created wordnets were linked bymeans of interlingual relations through this ILIrepository1.
The rapid progress in building a new wordnet andlinking it with an already tested wordnet (usuallyPWN) is hindered by the amount of time and effortneeded for developing such a resource. To take a recentexample, the development of core wordnets (of about20000 synsets, as is the case with the Romanianwordnet) for Balkan languages took three years (2001-
2004).
1 In both projects there was a number of synsets expressing
language specific concepts added to the ILI repository.
In what follows we present a methodology that can be
used for automatically building wordnets strictlyaligned (that is, using only EQ_SYNONYM relation)with an already available wordnet. We have started ourexperiment with the study of nouns, so the data
presented here are valid only for this grammaticalcategory.
We call the wordnet already available Sourcewordnet (as mentioned before, this is usually a versionof PWN) and the wordnet to be built and linked withthe Source wordnet will be named Target wordnet.
The methodology we present has two phases. In thefirst one the synsets for the target language areautomatically generated and mapped onto the source
language synsets using a series of heuristics. In thesecond phase the salient relations that can beautomatically imported are identified and the procedurefor their import is explained.
The paper has the following organization. Firstlywe state the implicit assumptions in building a wordnet
strictly aligned with other wordnets. Then we shortlydescribe the resources that one needs in order to applythe heuristics, and also the criteria we used in selectingthe source language test synsets to be implemented.Finally, we state the problem to be solved in a moreformal way, the heuristics employed will be presented
and their success evaluated against a case study(automatically building a Romanian wordnet usingPWN 2.0).
2. AssumptionsThe assumptions that we considered necessary forautomatically building a target wordnet using a Sourcewordnet are the following:
1. There are word senses that can be clearlyidentified. This assumption is implicit whenone builds a wordnet aligned or not with other
wordnets. This premise was extensivelyquestioned among others by (Kilgarriff 1997)who thinks that word senses have not a realontological status, but they exist only relativeto a task. We will not discuss this issue here.
2. A rejection of the strong reading of Sapir-
Whorf (Caroll 1964) hypothesis (the principleof linguistic relativity). Simply stated, the principle of linguistic relativity says that
8/7/2019 89_barbu
2/8
language shapes our thought. There are twovariants of this principle: strong determinismand weak determinism. According to the
strong determinism language and thought areidentical. This hypothesis has today fewfollowers if any and the evidence against itcomes from various sources among which the possibility of translation in other language.However, the weak version of the hypothesis
is largely accepted. One can view the realityand our organization of reality by analogywith the spectrum of colors which is acontinuum in which we place arbitrary boundaries (white, green, black, etc.).Different languages will cut differently this
continuous spectrum. For example, Russianand Spanish have no words for the concept blue. This weak version of the principle oflinguistic relativity warns us, however, that aspecific source wordnet could not be used for
automatically building any target wordnet. Wefurther discuss this bellow.3. The acceptance of the conceptualization made
by the source wordnet By conceptualizationwe understand the way in which the sourceWordnet sees the reality by identifying themain concepts to be expressed and their
relationships. For specifying how differentlanguages can differ with respect with toconceptual space they reflect we will follow(Sowa 1992) who considers three distinctdimensions:
accidental. The two languages have different
notations for the same concepts. For examplethe Romanian word mrand the English wordapple lexicalize the same concept.
systematic. The systematic dimension definesthe relation between the grammar of alanguage and its conceptual structures. It dealswith the fact that some languages are SVO orVSO, etc., some are analytic and other
agglutinative. Even if it is an importantdifference between languages, the systematicdimension has little import for our problem
cultural. The conceptual space expressed by alanguage is determined by environmental,
cultural factors, etc. It could be the case forexample, that concepts that define the legalsystems of different countries are not mutuallycompatible. So when someone builds awordnet starting from a source wordnet he/she
should ask himself/herself what the parts (ifany) that could be safely transferred in thetarget language are. More precise what theparts that share the same conceptual space are.
The assumption that we make use of is that thedifferences between the two languages (source and
target) are merely accidental: they have differentlexicalizations for the same concepts. As the
conceptual space is already expressed by the Source
wordnet structure using a language notation, our task isto find the concepts notations in the target language.
When the Source wordnet is not perfect (the real
situation), then a drawback of the automatic mappingapproach is that all the mistakes existent in the sourcewordnet are transferred in the target wordnet: considerthe following senses of the noun hindrance in PWN:
1. hindrance, deterrent, impediment, balk, baulk,check, handicap -- (something immaterial that
interferes with or delays action or progress)=> cognitive factor -- (something immaterial (as
a circumstance or influence) that contributes toproducing a result)
2. hindrance, hitch, preventive, preventative,encumbrance, incumbrance, interference -- (any
obstruction that impedes or is burdensome)=> artifact, artefact -- (a man-made object taken
as a whole).We listed the senses 1 and 2 of the word hindrancetogether with one of their hyperonyms. As one can see,
in PWN a distinction is made between hindrance as acognitive factor and hindrance as an artefact, so thatthese are two different senses of the word hindrance.According to this definition a speed bump can beclassified as a hindrance because it is an artifact, but astone that stays in the path of someone cannot be one,because it is not a man made object.
Another possible problem appears because thepreviously made assumption about the sameness of theconceptual space is not always true as the followingexample shows:
mister, Mr -- (a form of address for a man)sir -- (term of address for a man)
In Romanian both misterandsirin the listed senses aretranslated by the word domn. But in Romanian it wouldbe artificial to create two distinct synsets for the worddomn, as they are not different, not even in what theirconnotations are concerned.
3. Selection of concepts and
resources usedWhen we selected the set of synsets to be implementedin Romanian we followed two criteria.
The first criterion states that the selected set shouldbe structured in the source wordnet (i.e. every selected
synset should be linked by at least one semanticrelation with other selected synsets). This is dictated bythe methodology we have adopted (automatic mappingand automatic relation import). If we want to obtain awordnet in the target language and not just someisolated synsets, this criterion is self-imposing.
The second criterion is related to the evaluationstage. To properly evaluate the built wordnet, it should
be compared with a golden standard. The goldenstandard that we use will be the Romanian Wordnet(RoWN) developed in the BalkaNet project
2.
2 One can argue that this Romanian wordnet is not perfect and
definitely incomplete. However, PWN is neither perfect. Moreover, it
is indisputable that at least in the case of ontologies (lexical or
8/7/2019 89_barbu
3/8
For fulfilling both criteria we chose a subset of nounconcepts from the RoWN that has the property that itsprojection on PWN 2.0 is closed under the hyperonym
and the meronym relations. Moreover, this subsetincludes the upper level part of the PWN lexicalontology. The projection of this subset on PWN 2.0comprises 9716 synsets that contain 19624 literals.
For the purpose of automatic mapping of this subsetwe used an in-house dictionary built from many
sources. The dictionary has two main components:
The first component consists of the above-mentioned 19624 literals and their
Romanian translations. We must make surethat this part of the dictionary is as complete
as possible. Ideally, all senses of the Englishwords should be translated. For that weused the (Levichi & Banta 1992)dictionary and other dictionaries available
on web.
The second component of the dictionary is
(Levichi & Banta 1992) dictionary.Some dictionaries (in our case the dictionary extractedfrom the already available Romanian wordnet) also
have sense numbers specified, but, from ourexperience, this information is highly subjective, doesnot match the sense as defined by PWN and it is notconsistent over different dictionaries, so we chose todisregard it.
The second resource used is the Romanian
Explanatory Dictionary (EXPD 1996) whose entriesare numbered to reflect the dependencies between
different senses of the same word.
4. Notation introductionIn this section we introduce the notations used in thepaper and we outline the guiding idea of all heuristics
we used:1. By TL we denote the target lexicon. In ourexperiment TL will contain Romanian words (nouns).
TL= { , , } where , with i=1..m,
denotes a target word.
1rw 2rw mrw irw
2. By SL we denote the source lexicon. In our case SL
will contain English words (nouns). SL={ , ,
} where , with j=1..n, denotes a source
word.
1ew 2ew
new jew
3. WTand WS are the wordnets for the target languageand the source language, respectively.
4. denotes the kk
jwth
sense of the word wj.
5. BD is a bilingual dictionary which acts as a bridgebetween SL and TL. BD=(SL, TL, M) is a 3-tuple, whereM is a function that associates to each word in SL a set
of words in TL. For an arbitrary word Sjew L,
M( )jew ={ , , }.1rw 2rw krw
formal), a manually or semi-automatically built ontology is much
better than an automatically built one.
Formally the bilingual dictionary maps words and notword senses. If word senses had been mapped, thenbuilding WT from a WS would have been trivial.
If we ignore the information given by the definitionsassociated with word senses, then, formally a sense of aword in the PWN is distinguished from other wordsenses only by the set of relation it contacts in thesemantic network. This set of relations defines what itis called the position of a word in the semantic
network. Ideally, every sense of a word should beunambiguously identified by a set of connections; itshould have a unique position in the semantic net.Unfortunately this is not the case in PWN. There aremany cases when different senses of a word have thesame position in the semantic network (i.e they have
precisely the same connections with other wordsenses).
The idea of our heuristics could be summed up inthree points:
1. Increase the number of relations in the
Source wordnet to obtain a uniqueposition for each word sense. For thisan external resource can be used towhich the wordnet is linked, such asWordnet Domains.
2. Try to derive useful relations betweenthe words in the target language. For
this one can use corpuses, monolingualdictionaries, already classified set ofdocuments etc.
3. In the mapping stage of the proceduretake profit of the structures built atpoints 1 and 2.
We have developed so far a set of four heuristics andwe plan to supplement them in the future.
5. The first heuristic ruleThe first heuristic exploits the fact that synonymy
enforces equivalence classes on word senses.
Let EnSyn={ ,1111
i
jew12
12
i
jew } where
, , are the words in synset and the
superscripts denote their sense numbers) be a S
n
n
i
jew1
1
11jew
12jew
njew
1
L synset
and length(EnSyn)>1. We impose the length of a
synset to be greater than one when at least one
component word is not a variant of the other words. Sowe disregard synsets such as {artefact, artifact}. Forachieving this we computed the well knownLevenshtein distance between the words in the synset.The BD translations of the words in the synset will be:
M( )11j
ew = { , }11i
rwmi
rw1
M (12j
ew ) ={21i
rw , }ki
rw2
.
M (nj
ew1
) = {1ni
rw , }nti
rw
We build the corresponding TL synset as
1. M( ) if ikjew EnSyn such that thenumber of senses NoSenses ( )=1
ikjew
ikjew
8/7/2019 89_barbu
4/8
2. M( )11j
ew M ( )12j
ew M ( )nj
ew1
otherwiseWords belonging to the same synset in SL should havea common translation in TL. Above we distinguishedtwo cases:
1. At least one of the words in a synset is
monosemous. In this case we build the TL synset as the
set of translations of the monosemous word.2. All words in the synset are polysemous. The
corresponding TL synset will be constructed by theintersection of all TL translations of the SL words in thesynset.Taking the actual RoWNas a gold standard we can
evaluate the results of our heuristics by comparing theobtained synsets with those in the RoWN. Wedistinguish five possible cases:
1. The synsets are equal (this case will be labeledas Identical).
2. The generated synset has all literals of the
correct synset and some more. (Over-generation).3. The generated synset and the golden one have
some literals in common and some different (Overlap)4. The generated synset literals form a proper
subset of the golden synset (Under-generation)5. The generated synset have no literals in
common with the correct one (Disjoint).The cases Over-generation, Overlap and Disjoint will be counted as errors. The other two cases, namelyIdentical and Under-generation, will be counted assuccesses3.The evaluation of the first heuristics is given in Table
1, at the end of section 9.The percents mapped column contains the percents
of the synsets mapped by the heuristics from the totalnumber of the synsets (9716). The percent errorscolumn represents the percent of synsets from thenumber of mapped synsets wrongly assigned by the
heuristics. The high number of mapped synsets provesthe quality of the first part of the dictionary we used.
The only type of error we encountered is Over-generation.
6. The second heuristic ruleThe second heuristic draws from the fact that, in thecase of nouns, the hyperonymy relation can be
interpreted as an IS-A relation4. It is also based on tworelated observations:
1. A hyperonym and his hyponyms
carry some common information.2. The information common to the
hyperonym and the hyponym will increase as yougo down in the hierarchy.
Let EnSyn1={ ,11
11
i
jew12
12
i
jew } and
EnSyn
t
t
i
jew1
1
2={ ,21
21
i
jew22
22
i
jew } be two Ss
s
i
jew2
2L synsets
1
3 The Under-generation case means that the resulted synset is
not reach enough; it does not mean that it is incorrect.4 This not entirely true because in PWN the hyperonym relation
can also be interpreted as an INSTANCE-OF relation, as in PWN
there are also some instances included (e.g.New York,Adam, etc.).
such that EnSyn1 HYP EnSyn2, meaning that EnSyn1is a hyperonym of EnSyn2. Then we generate thetranslation lists of the words in the synsets. The
intersection is computed as:
TL EnSyn1= M(ew )11j
M ( )12j
ew M ( ew )tj
TL EnSyn2 = M( )21j
ew M ( ew )22j
M ( )sj
ew2
The generated synset in the target language will becomputed as
TL Synset = TL EnSyn1 TL EnSyn2Given the above consideration, it is possible that a
hyponym and its hyperonym have the same translationin the other language and this is more probable as youdescend in the hierarchy. The procedure formallydescribed above is applied for each synset in the sourcelist. It generates the lists of common translations for allwords in the hyperonym and hyponym synsets and then
constructs the TL synsets by intersecting these lists. Incase the intersection is not empty the created synsetwill be assigned to both SL language synsets.
Because the procedure generates autohyponymsynsets this could be an indication that created synsetscould be clustered in TL.
It is possible that a TLsynset be assigned to twodifferent source pair synsets as in the figure below. Sowe need to perform a clean-up procedure and choosethe assignment that maximizes the sum of depth levelof the two synsets.
Level k
Level k+1
Level k+2
In the figure common information is found between the
middle synset and the upper synset (its hyperonym)and also between the middle synset and the lowersynset (its hyponym). Our procedure will prefer thesecond assignment.
The results of the second heuristic are presented inTable 2, at the end of section 9. The low number of
mapped synsets (10%) is due to the fact that we did notfind many common translations between hyperonymsand their hyponyms.
7. The third heuristicThe third heuristics takes profit of an external relationimposed over the wordnet. At IRST PWN 1.6 wasaugmented with a set of Domain Labels, the resultingresource being called Wordnet Domains (Magnini &Cavaglia 2000). PWN 1.6 synsets have been semi-
automatically linked with a set of 200 domain labelstaken from Dewey Decimal classification, the worldmost widely used library classification system. Thedomain labels are hierarchically organized and eachsynset received one or more domain labels. For thesynsets that cannot be labeled unambiguously the
default label factotum has been used.Because in the BalkaNet project the RoWN has
been aligned with PWN 2.0, we performed a mapping
8/7/2019 89_barbu
5/8
between PWN 1.6 and PWN 2.0. By comparison withPWN1.6, PWN 2.0 has new additional synsets and alsothe wordnet structure is slightly modified. As a
consequence not all PWN 2.0 synsets can be reachedfrom PWN 1.6 either because they are completely newor because they could not be unambiguously mapped.This results in some PWN 2.0 synsets that have notdomains. For their labelling we used the three rulesbelow:
1. If one of the direct hyperonym of theunlabeled synsets has been assigned a domain,then the synset will automatically receive thefathers domain, and conversely, if one of thehyponyms is labelled with a domain and fatherlacks domain, then the father synset will receive
the sons domain.2. If a holonym of the synset is assigned aspecific domain, then the meronym will receivethe holonym domain and conversely.3. If a domain label cannot be assigned, then the
synset will receive the default factotum label.The idea of using domains is helpful for distinguishingword senses (different word senses of a word areassigned to different domains). The best case is wheneach sense of a word has been assigned to a distinctdomain. But even if the same domain labels areassigned to two or more senses of a word, in most
cases we can assume that this is a strong indication of afine-grained distinction. It is very probable that thedistinction is preserved in the target language by thesame word.
We labelled every word in the BD dictionary withits domain label. For English words the domain is
automatically generated from the English synset labels.For labelling Romanian words we used two methods:1. We downloaded a collection of documents
from web directories such that the categoriesof the downloaded documents match thecategories used in the Wordnet Domain. Thedownloaded document set underwent a pre-
processing procedure with the following steps:a. Feature extraction. The first phase consists in
finding a set of terms that represents thedocuments adequately. The documents werePOS tagged and lemmatized and the nounswere selected as features.
b. Features selection. In this phase the features thatprovide less information were eliminated. Forthis we used the well known
2statistic.
2
statistic checks if there is a relationship between being in a certain group and acharacteristic that we want to study. In our
case we want to measure the dependency between a term t and a category c. The
formula for2
is:
)()()()(
)(),(
22
DCBADBCA
CBADNct
++++
=
Where:
A is the number of times t and c co-occur B is the number of times t occur without c
C is the number of times c occurs withoutt
D is the number of times neither c nor t
occurs
N is the total number of documentsFor each category we computed the score between
that category and the noun terms of our
documents. Then, for choosing the terms thatdiscriminate well for a certain category we usedthe formula below (where m denotes the numberof categories):
)),(2(1
max)(max2
ict
m
it
==
2. We took advantage of the fact that some words
have already been assigned subject codes invarious dictionaries. We performed a manualmapping of these codes onto the DomainLabels used at IRST. The Romanian wordsthat could not be associated domain
information were associated with the defaultfactotum domain.The following entry is a BD dictionary entry augmentedwith domain information:
M( [D1ew 1,] ) = [D1rw 1, D2 ], [D2rw 1, D3 ],
[Dirw 2, D4 ]
In the square brackets the domains that pertain to each
word are listed.
Let again EnSyn1={ ,11
11
i
jew12
12
i
jew } be an
S
n
n
i
jew1
1
L language synset and Di the associated domain.Then the TL synset will be constructed as follows:
TL Synset = M( ), where eachnjjm 1..11U=
mew
irw M( ) has the property that its domain
matches the domain of EnSyn
mew
1 that is: either is thesame as the domain of EnSyn1, subsumes the domain
of EnSyn1 in the IRST domain labels hierarchy or issubsumed by the domain of EnSyn1 in the IRSTdomain labels hierarchy .
For each synset in the SL we generated all thetranslations of its literals in the TL. Then the TL synsetis built using only those TL literals whose domain
matches the SL synset domain.
The results of this heuristic are given in Table 3 atthe end of section 9.
8. The fourth heuristic ruleThe fourth heuristics takes advantage of the fact that
the source synsets have a gloss associated and also thattarget words that are translations of source words have
associated glosses in EXPD. As with the third heuristicthe procedure comprises a preprocessing phase. Wepreprocessed both resources (PWN and EXPD):
1. We automatically lemmatized and tagged all theglosses of the synsets in the SL.
2. We automatically lemmatized and tagged all the
definitions of the words that are translations ofSL words.
8/7/2019 89_barbu
6/8
3. We chose as features for representing theglosses the set of nouns.
2. For each vector in RiT Gloss we compute the
product: vS iT= . If there exists at
least one such that
= mj
jjts
..1
*
iT vS iT>=2 we compute
max(Sv Ti) and we add the word to the TLsynset.
The target definitions were automatically translated
using the bilingual dictionary. All possible sourcedefinitions were generated by translating eachlemmatized noun word in the TL definition. Thus, if aTL definition of one TL word is represented by the
following vector [ , , ], then the
number of S
1rw 2rw prwL vectors generated will be: N= * *
* * , where n
dn 1wt
2wt
pwt d is the number of definitions the
target word has in the monolingual dictionary (EXPD),
and , with k=1..p, is the number of translations that
the noun has in the bilingual dictionary.
kwt
kw
Notice that by using this heuristic rule we canautomatically add a gloss to the TL synset.As one can see in Table 4 at the end of section 9 thenumber of incomplete synsets is high. The percent ofmapped synsets is due to the low agreement betweenthe glosses in Romanian and English.
9. Combining resultsFor choosing the final synsets we devised a set ofmeta-rules by evaluating the pro and con of eachheuristic rule. For example, given the high quality
dictionary the probability that the first heuristic will
fail is very low. So the synsets obtained using it willautomatically be selected. A synset obtained using theother heuristics will be selected and moreover willreplace a synset obtained using the first heuristic, onlyif it is obtained independently using the heuristics 3
and 2, or by using the heuristics 3 and 4. If a synset isnot selected by the above meta-rules will be selectedonly if it is obtained by the heuristics number 3 and theambiguity of its members is at most equal to 2. Table 5at the end of this section shows the combined results ofour heuristics.
By RGloss we denote the set of SL representation
vectors of TL glosses of a TL word: RGloss = { ,
}. By we denote the vector of S
1T
2T nT vS L synset
gloss.The procedure for generating the TL synset is: for
each SL synset we generate the TL list of the translationof all words in the synset. Then for each word in the TL
list of translation we compute the similarity between Svand its RGloss. The computation is done in two steps:
1. We give the vectors in Sv and RGloss a binaryrepresentation. The number (m) of positions avector has will be equal to the number of distinct
words existent in the and in all vectors of
R
vS
Gloss. The presence of 1 in the vector means thata word is present and the existence of 0 means
that a word is absent from the vector.
As one can observe there, for 106 synsets in PWN2.0 the Romanian equivalent synsets could not be
found. There also resulted 635 synsets smaller than thesynsets in the RoWN.
Error types CorrectNumber of
mappedsynsets
Percentsmapped
Over-generation Overlap DisjointUnder-
generationIdentical
Percenterrors
8493 87 210 0 0 300 7983 2
Table 1: The results of the first heuristic
Error types CorrectNumber of
mappedsynsets
Percents
mappedOver-generation Overlap Disjoint
Under-
generationIdentical
Percent
errors
1028 10 213 0 150 230 435 35
Table 2: The results of the second heuristic
8/7/2019 89_barbu
7/8
Error types CorrectNumber of
mappedsynsets
Percents
mappedOver-generation Overlap Disjoint
Under-generation
Identical
Percent
errors
7520 77 689 0 0 0 6831 9
Table 3: The results of the third heuristic
Error types CorrectNumber of
mapped
synsets
Percentsmapped
Over-generation Overlap DisjointUnder-
generationIdentical
Percenterrors
3527 36 25 0 78 547 2877 3
Table 4: The results of the fourth heuristic
Error types CorrectNumber of
mappedsynsets
Percentsmapped
Over-generation Overlap DisjointUnder-
generationIdentical
Percenterrors
9610 98 615 0 250 635 8110 9
Table 5: The combined results of the heuristics
10. Import of relationsAfter building the target synsets an investigation of thenature of the relations that structure the source wordnetshould be made for establishing which of them can besafely transferred in target wordnet. As one expects theconceptual relations can be safely transferred because
these relations hold between concepts. The only lexicalrelation that holds between nouns and that was subjectto scrutiny was the antonym relation. We concludedthat this relation can also be safely imported. Theimporting algorithm works as described bellow.
If two source synsets S1 and S2 are linked by a
semantic relation R in WS and if T1 and T2 are thecorresponding aligned synsets in the WT, then they will
be linked by the relation R. If in WS there areintervening synsets between S1and S2, then we will setthe relation R between the corresponding TL synsetsonly if R is declared as transitive (R+, unlimitednumber of compositions, e.g. hypernym) or partially
transitive relation (Rk with k a user-specializedmaximum number of compositions, larger than thenumber of intervening synsets between S1 and S2). Forinstance, we defined all the holonymy relations aspartially transitive (k=3).
11. Conclusion and future work
Other experiments of automatically building wordnetsthat we are aware of are (Atserias et al., 1997) and (Lee
et al., 2000). They combine several methods, using
monolingual and bilingual dictionaries for obtaining a
Spanish Wordnet and, respectively, a Korean onestarting from PWN 1.5.
However, our approach is characterized by the fact
that it gives an accurate evaluation of the results byautomatically comparing them with a manually built
wordnet. We also explicitly state the assumptions ofthis automatic approach. Our approach is the first touse an external resource (Wordnet Domains) in theprocess of automatically building a wordnet.
We obtained a version of RoWN that contains 9610synsets and 11969 relations with 91% accuracy.
The results obtained encourage us to develop otherheuristics. The success of our procedure was facilitated
by the quality of the bilingual dictionary we used.Some heuristics developed here may be applied for
the automatic construction of synsets of other parts ofspeech. That is why we also plan to extend our
experiment to adjectives and verbs. Their evaluationwould be of great interest in our opinion.
Finally we would like to thank the three anonymousreviewers for helping us in improving the final versionof the paper.
References:(Atserias et al. 1997) J. Atserias, S. Clement, X. Farreres,
German Rigau, H. Rodrguez, Combining Multiple
Methods for the Automatic Construction of Multilingual
8/7/2019 89_barbu
8/8
WordNets. In Proceedings of the International
Conference on Recent Advances in Natural Language,
1997.
(Carroll 1964) J. B. Carroll (Ed.). 1964. Language, Thought
and Reality Selected writings of Benjamin Lee Whorf,The MIT Press, Cambridge, MA.
(EXPD 1996) Dicionarul explicativ al limbii romne, 2nd
edition, Bucureti, Univers Enciclopedic, 1996.
(Fellbaum 1998) Ch. Fellbaum (Ed.) WordNet: An
Electronical Lexical Database, MIT Press, 1998.
(Kilgarriff 1997) A. Kilgarriff,I dont believe in word senses.In Computers and the Humanities, 31 (2), 91-113, 1997.
(Lee et al. 2000) C. Lee, G. Lee, J. Seo, Automatic WordNetmapping using Word Sense Disambiguation. In Proceedings of the Joint SIGDAT Conference onEmpirical Methods in Natural Language Processing and
Very Large Corpora (EMNLP/VLC 2000), Hong Kong,
2000.
(Levichi & Banta 1992) L. Levichi and A. Banta,
Dicionar englez-romn, Bucureti, Teora, 1992.
(Magnini & Cavaglia 2000) B. Magnini and G. Cavaglia,Integrating subject field codes into WordNet. In
Proceedings of LREC-2000, Athens, Greece.
(Sowa 1992) J. F. Sowa, Logical Structure in the Lexicon. InJ. Pustejovsky and S. Bergler (Eds.) Lexical Semanticsand Commonsense Reasoning, LNAI 627, Springer-verlag, Berlin, 39-60, 1992.
(Tufi 2004) D. Tufi (Ed.), Special Issue on the BalkaNet
Project ofRomanian Journal of Information Science and
Technology, vol. 7, no. 1-2, 2004.
(Vossen 1998) P. Vossen, A Multilingual Database with
Lexical Semantic Networks, Dordrecht, Kluwer, 1998.