89_barbu

Embed Size (px)

Citation preview

  • 8/7/2019 89_barbu

    1/8

    Automatic Building of Wordnets

    Eduard Barbu* and Verginica Barbu Mititelu* Graphitech Italy

    2, Salita Dei Molini

    38050 Villazzano, Trento, Italy

    [email protected] Academy Research Institute for Artificial Intelligence

    13 Calea 13 Septembrie,

    Bucharest 050711, [email protected]

    Abstract

    In what follows we will present a two-phasemethodology for automatically building a

    wordnet (that we call target wordnet) strictlyaligned with an already available wordnet

    (source wordnet). In the first phase the synsetsfor the target language are automatically

    generated and mapped onto the sourcelanguage synsets using a series of heuristics. Inthe second phase the salient relations that can

    be automatically imported are identified andthe procedure for their import is explained. The

    assumptions behind such methodology will bestated, the heuristics employed will be

    presented and their success evaluated against acase study (automatically building a Romanian

    wordnet using PWN).

    1. IntroductionThe importance of a wordnet for NLP applications canhardly be overestimated. The Princeton WordNet(PWN) (Fellbaum 1998) is now a mature lexical

    ontology which has demonstrated its efficiency in avariety of tasks (word sense disambiguation, machinetranslation, information retrieval, etc.). Inspired by thesuccess of PWN many languages started to developtheir own wordnets taking PWN as a model (cf.http://www.globalwordnet.org/gwa/wordnet_table.htm)

    . Furthermore, in both EuroWordNet (Vossen 1998)

    and BalkaNet (Tufi 2004) projects the synsets fromdifferent versions of PWN (1.5 and 2.0) were used asILI repositories. The created wordnets were linked bymeans of interlingual relations through this ILIrepository1.

    The rapid progress in building a new wordnet andlinking it with an already tested wordnet (usuallyPWN) is hindered by the amount of time and effortneeded for developing such a resource. To take a recentexample, the development of core wordnets (of about20000 synsets, as is the case with the Romanianwordnet) for Balkan languages took three years (2001-

    2004).

    1 In both projects there was a number of synsets expressing

    language specific concepts added to the ILI repository.

    In what follows we present a methodology that can be

    used for automatically building wordnets strictlyaligned (that is, using only EQ_SYNONYM relation)with an already available wordnet. We have started ourexperiment with the study of nouns, so the data

    presented here are valid only for this grammaticalcategory.

    We call the wordnet already available Sourcewordnet (as mentioned before, this is usually a versionof PWN) and the wordnet to be built and linked withthe Source wordnet will be named Target wordnet.

    The methodology we present has two phases. In thefirst one the synsets for the target language areautomatically generated and mapped onto the source

    language synsets using a series of heuristics. In thesecond phase the salient relations that can beautomatically imported are identified and the procedurefor their import is explained.

    The paper has the following organization. Firstlywe state the implicit assumptions in building a wordnet

    strictly aligned with other wordnets. Then we shortlydescribe the resources that one needs in order to applythe heuristics, and also the criteria we used in selectingthe source language test synsets to be implemented.Finally, we state the problem to be solved in a moreformal way, the heuristics employed will be presented

    and their success evaluated against a case study(automatically building a Romanian wordnet usingPWN 2.0).

    2. AssumptionsThe assumptions that we considered necessary forautomatically building a target wordnet using a Sourcewordnet are the following:

    1. There are word senses that can be clearlyidentified. This assumption is implicit whenone builds a wordnet aligned or not with other

    wordnets. This premise was extensivelyquestioned among others by (Kilgarriff 1997)who thinks that word senses have not a realontological status, but they exist only relativeto a task. We will not discuss this issue here.

    2. A rejection of the strong reading of Sapir-

    Whorf (Caroll 1964) hypothesis (the principleof linguistic relativity). Simply stated, the principle of linguistic relativity says that

  • 8/7/2019 89_barbu

    2/8

    language shapes our thought. There are twovariants of this principle: strong determinismand weak determinism. According to the

    strong determinism language and thought areidentical. This hypothesis has today fewfollowers if any and the evidence against itcomes from various sources among which the possibility of translation in other language.However, the weak version of the hypothesis

    is largely accepted. One can view the realityand our organization of reality by analogywith the spectrum of colors which is acontinuum in which we place arbitrary boundaries (white, green, black, etc.).Different languages will cut differently this

    continuous spectrum. For example, Russianand Spanish have no words for the concept blue. This weak version of the principle oflinguistic relativity warns us, however, that aspecific source wordnet could not be used for

    automatically building any target wordnet. Wefurther discuss this bellow.3. The acceptance of the conceptualization made

    by the source wordnet By conceptualizationwe understand the way in which the sourceWordnet sees the reality by identifying themain concepts to be expressed and their

    relationships. For specifying how differentlanguages can differ with respect with toconceptual space they reflect we will follow(Sowa 1992) who considers three distinctdimensions:

    accidental. The two languages have different

    notations for the same concepts. For examplethe Romanian word mrand the English wordapple lexicalize the same concept.

    systematic. The systematic dimension definesthe relation between the grammar of alanguage and its conceptual structures. It dealswith the fact that some languages are SVO orVSO, etc., some are analytic and other

    agglutinative. Even if it is an importantdifference between languages, the systematicdimension has little import for our problem

    cultural. The conceptual space expressed by alanguage is determined by environmental,

    cultural factors, etc. It could be the case forexample, that concepts that define the legalsystems of different countries are not mutuallycompatible. So when someone builds awordnet starting from a source wordnet he/she

    should ask himself/herself what the parts (ifany) that could be safely transferred in thetarget language are. More precise what theparts that share the same conceptual space are.

    The assumption that we make use of is that thedifferences between the two languages (source and

    target) are merely accidental: they have differentlexicalizations for the same concepts. As the

    conceptual space is already expressed by the Source

    wordnet structure using a language notation, our task isto find the concepts notations in the target language.

    When the Source wordnet is not perfect (the real

    situation), then a drawback of the automatic mappingapproach is that all the mistakes existent in the sourcewordnet are transferred in the target wordnet: considerthe following senses of the noun hindrance in PWN:

    1. hindrance, deterrent, impediment, balk, baulk,check, handicap -- (something immaterial that

    interferes with or delays action or progress)=> cognitive factor -- (something immaterial (as

    a circumstance or influence) that contributes toproducing a result)

    2. hindrance, hitch, preventive, preventative,encumbrance, incumbrance, interference -- (any

    obstruction that impedes or is burdensome)=> artifact, artefact -- (a man-made object taken

    as a whole).We listed the senses 1 and 2 of the word hindrancetogether with one of their hyperonyms. As one can see,

    in PWN a distinction is made between hindrance as acognitive factor and hindrance as an artefact, so thatthese are two different senses of the word hindrance.According to this definition a speed bump can beclassified as a hindrance because it is an artifact, but astone that stays in the path of someone cannot be one,because it is not a man made object.

    Another possible problem appears because thepreviously made assumption about the sameness of theconceptual space is not always true as the followingexample shows:

    mister, Mr -- (a form of address for a man)sir -- (term of address for a man)

    In Romanian both misterandsirin the listed senses aretranslated by the word domn. But in Romanian it wouldbe artificial to create two distinct synsets for the worddomn, as they are not different, not even in what theirconnotations are concerned.

    3. Selection of concepts and

    resources usedWhen we selected the set of synsets to be implementedin Romanian we followed two criteria.

    The first criterion states that the selected set shouldbe structured in the source wordnet (i.e. every selected

    synset should be linked by at least one semanticrelation with other selected synsets). This is dictated bythe methodology we have adopted (automatic mappingand automatic relation import). If we want to obtain awordnet in the target language and not just someisolated synsets, this criterion is self-imposing.

    The second criterion is related to the evaluationstage. To properly evaluate the built wordnet, it should

    be compared with a golden standard. The goldenstandard that we use will be the Romanian Wordnet(RoWN) developed in the BalkaNet project

    2.

    2 One can argue that this Romanian wordnet is not perfect and

    definitely incomplete. However, PWN is neither perfect. Moreover, it

    is indisputable that at least in the case of ontologies (lexical or

  • 8/7/2019 89_barbu

    3/8

    For fulfilling both criteria we chose a subset of nounconcepts from the RoWN that has the property that itsprojection on PWN 2.0 is closed under the hyperonym

    and the meronym relations. Moreover, this subsetincludes the upper level part of the PWN lexicalontology. The projection of this subset on PWN 2.0comprises 9716 synsets that contain 19624 literals.

    For the purpose of automatic mapping of this subsetwe used an in-house dictionary built from many

    sources. The dictionary has two main components:

    The first component consists of the above-mentioned 19624 literals and their

    Romanian translations. We must make surethat this part of the dictionary is as complete

    as possible. Ideally, all senses of the Englishwords should be translated. For that weused the (Levichi & Banta 1992)dictionary and other dictionaries available

    on web.

    The second component of the dictionary is

    (Levichi & Banta 1992) dictionary.Some dictionaries (in our case the dictionary extractedfrom the already available Romanian wordnet) also

    have sense numbers specified, but, from ourexperience, this information is highly subjective, doesnot match the sense as defined by PWN and it is notconsistent over different dictionaries, so we chose todisregard it.

    The second resource used is the Romanian

    Explanatory Dictionary (EXPD 1996) whose entriesare numbered to reflect the dependencies between

    different senses of the same word.

    4. Notation introductionIn this section we introduce the notations used in thepaper and we outline the guiding idea of all heuristics

    we used:1. By TL we denote the target lexicon. In ourexperiment TL will contain Romanian words (nouns).

    TL= { , , } where , with i=1..m,

    denotes a target word.

    1rw 2rw mrw irw

    2. By SL we denote the source lexicon. In our case SL

    will contain English words (nouns). SL={ , ,

    } where , with j=1..n, denotes a source

    word.

    1ew 2ew

    new jew

    3. WTand WS are the wordnets for the target languageand the source language, respectively.

    4. denotes the kk

    jwth

    sense of the word wj.

    5. BD is a bilingual dictionary which acts as a bridgebetween SL and TL. BD=(SL, TL, M) is a 3-tuple, whereM is a function that associates to each word in SL a set

    of words in TL. For an arbitrary word Sjew L,

    M( )jew ={ , , }.1rw 2rw krw

    formal), a manually or semi-automatically built ontology is much

    better than an automatically built one.

    Formally the bilingual dictionary maps words and notword senses. If word senses had been mapped, thenbuilding WT from a WS would have been trivial.

    If we ignore the information given by the definitionsassociated with word senses, then, formally a sense of aword in the PWN is distinguished from other wordsenses only by the set of relation it contacts in thesemantic network. This set of relations defines what itis called the position of a word in the semantic

    network. Ideally, every sense of a word should beunambiguously identified by a set of connections; itshould have a unique position in the semantic net.Unfortunately this is not the case in PWN. There aremany cases when different senses of a word have thesame position in the semantic network (i.e they have

    precisely the same connections with other wordsenses).

    The idea of our heuristics could be summed up inthree points:

    1. Increase the number of relations in the

    Source wordnet to obtain a uniqueposition for each word sense. For thisan external resource can be used towhich the wordnet is linked, such asWordnet Domains.

    2. Try to derive useful relations betweenthe words in the target language. For

    this one can use corpuses, monolingualdictionaries, already classified set ofdocuments etc.

    3. In the mapping stage of the proceduretake profit of the structures built atpoints 1 and 2.

    We have developed so far a set of four heuristics andwe plan to supplement them in the future.

    5. The first heuristic ruleThe first heuristic exploits the fact that synonymy

    enforces equivalence classes on word senses.

    Let EnSyn={ ,1111

    i

    jew12

    12

    i

    jew } where

    , , are the words in synset and the

    superscripts denote their sense numbers) be a S

    n

    n

    i

    jew1

    1

    11jew

    12jew

    njew

    1

    L synset

    and length(EnSyn)>1. We impose the length of a

    synset to be greater than one when at least one

    component word is not a variant of the other words. Sowe disregard synsets such as {artefact, artifact}. Forachieving this we computed the well knownLevenshtein distance between the words in the synset.The BD translations of the words in the synset will be:

    M( )11j

    ew = { , }11i

    rwmi

    rw1

    M (12j

    ew ) ={21i

    rw , }ki

    rw2

    .

    M (nj

    ew1

    ) = {1ni

    rw , }nti

    rw

    We build the corresponding TL synset as

    1. M( ) if ikjew EnSyn such that thenumber of senses NoSenses ( )=1

    ikjew

    ikjew

  • 8/7/2019 89_barbu

    4/8

    2. M( )11j

    ew M ( )12j

    ew M ( )nj

    ew1

    otherwiseWords belonging to the same synset in SL should havea common translation in TL. Above we distinguishedtwo cases:

    1. At least one of the words in a synset is

    monosemous. In this case we build the TL synset as the

    set of translations of the monosemous word.2. All words in the synset are polysemous. The

    corresponding TL synset will be constructed by theintersection of all TL translations of the SL words in thesynset.Taking the actual RoWNas a gold standard we can

    evaluate the results of our heuristics by comparing theobtained synsets with those in the RoWN. Wedistinguish five possible cases:

    1. The synsets are equal (this case will be labeledas Identical).

    2. The generated synset has all literals of the

    correct synset and some more. (Over-generation).3. The generated synset and the golden one have

    some literals in common and some different (Overlap)4. The generated synset literals form a proper

    subset of the golden synset (Under-generation)5. The generated synset have no literals in

    common with the correct one (Disjoint).The cases Over-generation, Overlap and Disjoint will be counted as errors. The other two cases, namelyIdentical and Under-generation, will be counted assuccesses3.The evaluation of the first heuristics is given in Table

    1, at the end of section 9.The percents mapped column contains the percents

    of the synsets mapped by the heuristics from the totalnumber of the synsets (9716). The percent errorscolumn represents the percent of synsets from thenumber of mapped synsets wrongly assigned by the

    heuristics. The high number of mapped synsets provesthe quality of the first part of the dictionary we used.

    The only type of error we encountered is Over-generation.

    6. The second heuristic ruleThe second heuristic draws from the fact that, in thecase of nouns, the hyperonymy relation can be

    interpreted as an IS-A relation4. It is also based on tworelated observations:

    1. A hyperonym and his hyponyms

    carry some common information.2. The information common to the

    hyperonym and the hyponym will increase as yougo down in the hierarchy.

    Let EnSyn1={ ,11

    11

    i

    jew12

    12

    i

    jew } and

    EnSyn

    t

    t

    i

    jew1

    1

    2={ ,21

    21

    i

    jew22

    22

    i

    jew } be two Ss

    s

    i

    jew2

    2L synsets

    1

    3 The Under-generation case means that the resulted synset is

    not reach enough; it does not mean that it is incorrect.4 This not entirely true because in PWN the hyperonym relation

    can also be interpreted as an INSTANCE-OF relation, as in PWN

    there are also some instances included (e.g.New York,Adam, etc.).

    such that EnSyn1 HYP EnSyn2, meaning that EnSyn1is a hyperonym of EnSyn2. Then we generate thetranslation lists of the words in the synsets. The

    intersection is computed as:

    TL EnSyn1= M(ew )11j

    M ( )12j

    ew M ( ew )tj

    TL EnSyn2 = M( )21j

    ew M ( ew )22j

    M ( )sj

    ew2

    The generated synset in the target language will becomputed as

    TL Synset = TL EnSyn1 TL EnSyn2Given the above consideration, it is possible that a

    hyponym and its hyperonym have the same translationin the other language and this is more probable as youdescend in the hierarchy. The procedure formallydescribed above is applied for each synset in the sourcelist. It generates the lists of common translations for allwords in the hyperonym and hyponym synsets and then

    constructs the TL synsets by intersecting these lists. Incase the intersection is not empty the created synsetwill be assigned to both SL language synsets.

    Because the procedure generates autohyponymsynsets this could be an indication that created synsetscould be clustered in TL.

    It is possible that a TLsynset be assigned to twodifferent source pair synsets as in the figure below. Sowe need to perform a clean-up procedure and choosethe assignment that maximizes the sum of depth levelof the two synsets.

    Level k

    Level k+1

    Level k+2

    In the figure common information is found between the

    middle synset and the upper synset (its hyperonym)and also between the middle synset and the lowersynset (its hyponym). Our procedure will prefer thesecond assignment.

    The results of the second heuristic are presented inTable 2, at the end of section 9. The low number of

    mapped synsets (10%) is due to the fact that we did notfind many common translations between hyperonymsand their hyponyms.

    7. The third heuristicThe third heuristics takes profit of an external relationimposed over the wordnet. At IRST PWN 1.6 wasaugmented with a set of Domain Labels, the resultingresource being called Wordnet Domains (Magnini &Cavaglia 2000). PWN 1.6 synsets have been semi-

    automatically linked with a set of 200 domain labelstaken from Dewey Decimal classification, the worldmost widely used library classification system. Thedomain labels are hierarchically organized and eachsynset received one or more domain labels. For thesynsets that cannot be labeled unambiguously the

    default label factotum has been used.Because in the BalkaNet project the RoWN has

    been aligned with PWN 2.0, we performed a mapping

  • 8/7/2019 89_barbu

    5/8

    between PWN 1.6 and PWN 2.0. By comparison withPWN1.6, PWN 2.0 has new additional synsets and alsothe wordnet structure is slightly modified. As a

    consequence not all PWN 2.0 synsets can be reachedfrom PWN 1.6 either because they are completely newor because they could not be unambiguously mapped.This results in some PWN 2.0 synsets that have notdomains. For their labelling we used the three rulesbelow:

    1. If one of the direct hyperonym of theunlabeled synsets has been assigned a domain,then the synset will automatically receive thefathers domain, and conversely, if one of thehyponyms is labelled with a domain and fatherlacks domain, then the father synset will receive

    the sons domain.2. If a holonym of the synset is assigned aspecific domain, then the meronym will receivethe holonym domain and conversely.3. If a domain label cannot be assigned, then the

    synset will receive the default factotum label.The idea of using domains is helpful for distinguishingword senses (different word senses of a word areassigned to different domains). The best case is wheneach sense of a word has been assigned to a distinctdomain. But even if the same domain labels areassigned to two or more senses of a word, in most

    cases we can assume that this is a strong indication of afine-grained distinction. It is very probable that thedistinction is preserved in the target language by thesame word.

    We labelled every word in the BD dictionary withits domain label. For English words the domain is

    automatically generated from the English synset labels.For labelling Romanian words we used two methods:1. We downloaded a collection of documents

    from web directories such that the categoriesof the downloaded documents match thecategories used in the Wordnet Domain. Thedownloaded document set underwent a pre-

    processing procedure with the following steps:a. Feature extraction. The first phase consists in

    finding a set of terms that represents thedocuments adequately. The documents werePOS tagged and lemmatized and the nounswere selected as features.

    b. Features selection. In this phase the features thatprovide less information were eliminated. Forthis we used the well known

    2statistic.

    2

    statistic checks if there is a relationship between being in a certain group and acharacteristic that we want to study. In our

    case we want to measure the dependency between a term t and a category c. The

    formula for2

    is:

    )()()()(

    )(),(

    22

    DCBADBCA

    CBADNct

    ++++

    =

    Where:

    A is the number of times t and c co-occur B is the number of times t occur without c

    C is the number of times c occurs withoutt

    D is the number of times neither c nor t

    occurs

    N is the total number of documentsFor each category we computed the score between

    that category and the noun terms of our

    documents. Then, for choosing the terms thatdiscriminate well for a certain category we usedthe formula below (where m denotes the numberof categories):

    )),(2(1

    max)(max2

    ict

    m

    it

    ==

    2. We took advantage of the fact that some words

    have already been assigned subject codes invarious dictionaries. We performed a manualmapping of these codes onto the DomainLabels used at IRST. The Romanian wordsthat could not be associated domain

    information were associated with the defaultfactotum domain.The following entry is a BD dictionary entry augmentedwith domain information:

    M( [D1ew 1,] ) = [D1rw 1, D2 ], [D2rw 1, D3 ],

    [Dirw 2, D4 ]

    In the square brackets the domains that pertain to each

    word are listed.

    Let again EnSyn1={ ,11

    11

    i

    jew12

    12

    i

    jew } be an

    S

    n

    n

    i

    jew1

    1

    L language synset and Di the associated domain.Then the TL synset will be constructed as follows:

    TL Synset = M( ), where eachnjjm 1..11U=

    mew

    irw M( ) has the property that its domain

    matches the domain of EnSyn

    mew

    1 that is: either is thesame as the domain of EnSyn1, subsumes the domain

    of EnSyn1 in the IRST domain labels hierarchy or issubsumed by the domain of EnSyn1 in the IRSTdomain labels hierarchy .

    For each synset in the SL we generated all thetranslations of its literals in the TL. Then the TL synsetis built using only those TL literals whose domain

    matches the SL synset domain.

    The results of this heuristic are given in Table 3 atthe end of section 9.

    8. The fourth heuristic ruleThe fourth heuristics takes advantage of the fact that

    the source synsets have a gloss associated and also thattarget words that are translations of source words have

    associated glosses in EXPD. As with the third heuristicthe procedure comprises a preprocessing phase. Wepreprocessed both resources (PWN and EXPD):

    1. We automatically lemmatized and tagged all theglosses of the synsets in the SL.

    2. We automatically lemmatized and tagged all the

    definitions of the words that are translations ofSL words.

  • 8/7/2019 89_barbu

    6/8

    3. We chose as features for representing theglosses the set of nouns.

    2. For each vector in RiT Gloss we compute the

    product: vS iT= . If there exists at

    least one such that

    = mj

    jjts

    ..1

    *

    iT vS iT>=2 we compute

    max(Sv Ti) and we add the word to the TLsynset.

    The target definitions were automatically translated

    using the bilingual dictionary. All possible sourcedefinitions were generated by translating eachlemmatized noun word in the TL definition. Thus, if aTL definition of one TL word is represented by the

    following vector [ , , ], then the

    number of S

    1rw 2rw prwL vectors generated will be: N= * *

    * * , where n

    dn 1wt

    2wt

    pwt d is the number of definitions the

    target word has in the monolingual dictionary (EXPD),

    and , with k=1..p, is the number of translations that

    the noun has in the bilingual dictionary.

    kwt

    kw

    Notice that by using this heuristic rule we canautomatically add a gloss to the TL synset.As one can see in Table 4 at the end of section 9 thenumber of incomplete synsets is high. The percent ofmapped synsets is due to the low agreement betweenthe glosses in Romanian and English.

    9. Combining resultsFor choosing the final synsets we devised a set ofmeta-rules by evaluating the pro and con of eachheuristic rule. For example, given the high quality

    dictionary the probability that the first heuristic will

    fail is very low. So the synsets obtained using it willautomatically be selected. A synset obtained using theother heuristics will be selected and moreover willreplace a synset obtained using the first heuristic, onlyif it is obtained independently using the heuristics 3

    and 2, or by using the heuristics 3 and 4. If a synset isnot selected by the above meta-rules will be selectedonly if it is obtained by the heuristics number 3 and theambiguity of its members is at most equal to 2. Table 5at the end of this section shows the combined results ofour heuristics.

    By RGloss we denote the set of SL representation

    vectors of TL glosses of a TL word: RGloss = { ,

    }. By we denote the vector of S

    1T

    2T nT vS L synset

    gloss.The procedure for generating the TL synset is: for

    each SL synset we generate the TL list of the translationof all words in the synset. Then for each word in the TL

    list of translation we compute the similarity between Svand its RGloss. The computation is done in two steps:

    1. We give the vectors in Sv and RGloss a binaryrepresentation. The number (m) of positions avector has will be equal to the number of distinct

    words existent in the and in all vectors of

    R

    vS

    Gloss. The presence of 1 in the vector means thata word is present and the existence of 0 means

    that a word is absent from the vector.

    As one can observe there, for 106 synsets in PWN2.0 the Romanian equivalent synsets could not be

    found. There also resulted 635 synsets smaller than thesynsets in the RoWN.

    Error types CorrectNumber of

    mappedsynsets

    Percentsmapped

    Over-generation Overlap DisjointUnder-

    generationIdentical

    Percenterrors

    8493 87 210 0 0 300 7983 2

    Table 1: The results of the first heuristic

    Error types CorrectNumber of

    mappedsynsets

    Percents

    mappedOver-generation Overlap Disjoint

    Under-

    generationIdentical

    Percent

    errors

    1028 10 213 0 150 230 435 35

    Table 2: The results of the second heuristic

  • 8/7/2019 89_barbu

    7/8

    Error types CorrectNumber of

    mappedsynsets

    Percents

    mappedOver-generation Overlap Disjoint

    Under-generation

    Identical

    Percent

    errors

    7520 77 689 0 0 0 6831 9

    Table 3: The results of the third heuristic

    Error types CorrectNumber of

    mapped

    synsets

    Percentsmapped

    Over-generation Overlap DisjointUnder-

    generationIdentical

    Percenterrors

    3527 36 25 0 78 547 2877 3

    Table 4: The results of the fourth heuristic

    Error types CorrectNumber of

    mappedsynsets

    Percentsmapped

    Over-generation Overlap DisjointUnder-

    generationIdentical

    Percenterrors

    9610 98 615 0 250 635 8110 9

    Table 5: The combined results of the heuristics

    10. Import of relationsAfter building the target synsets an investigation of thenature of the relations that structure the source wordnetshould be made for establishing which of them can besafely transferred in target wordnet. As one expects theconceptual relations can be safely transferred because

    these relations hold between concepts. The only lexicalrelation that holds between nouns and that was subjectto scrutiny was the antonym relation. We concludedthat this relation can also be safely imported. Theimporting algorithm works as described bellow.

    If two source synsets S1 and S2 are linked by a

    semantic relation R in WS and if T1 and T2 are thecorresponding aligned synsets in the WT, then they will

    be linked by the relation R. If in WS there areintervening synsets between S1and S2, then we will setthe relation R between the corresponding TL synsetsonly if R is declared as transitive (R+, unlimitednumber of compositions, e.g. hypernym) or partially

    transitive relation (Rk with k a user-specializedmaximum number of compositions, larger than thenumber of intervening synsets between S1 and S2). Forinstance, we defined all the holonymy relations aspartially transitive (k=3).

    11. Conclusion and future work

    Other experiments of automatically building wordnetsthat we are aware of are (Atserias et al., 1997) and (Lee

    et al., 2000). They combine several methods, using

    monolingual and bilingual dictionaries for obtaining a

    Spanish Wordnet and, respectively, a Korean onestarting from PWN 1.5.

    However, our approach is characterized by the fact

    that it gives an accurate evaluation of the results byautomatically comparing them with a manually built

    wordnet. We also explicitly state the assumptions ofthis automatic approach. Our approach is the first touse an external resource (Wordnet Domains) in theprocess of automatically building a wordnet.

    We obtained a version of RoWN that contains 9610synsets and 11969 relations with 91% accuracy.

    The results obtained encourage us to develop otherheuristics. The success of our procedure was facilitated

    by the quality of the bilingual dictionary we used.Some heuristics developed here may be applied for

    the automatic construction of synsets of other parts ofspeech. That is why we also plan to extend our

    experiment to adjectives and verbs. Their evaluationwould be of great interest in our opinion.

    Finally we would like to thank the three anonymousreviewers for helping us in improving the final versionof the paper.

    References:(Atserias et al. 1997) J. Atserias, S. Clement, X. Farreres,

    German Rigau, H. Rodrguez, Combining Multiple

    Methods for the Automatic Construction of Multilingual

  • 8/7/2019 89_barbu

    8/8

    WordNets. In Proceedings of the International

    Conference on Recent Advances in Natural Language,

    1997.

    (Carroll 1964) J. B. Carroll (Ed.). 1964. Language, Thought

    and Reality Selected writings of Benjamin Lee Whorf,The MIT Press, Cambridge, MA.

    (EXPD 1996) Dicionarul explicativ al limbii romne, 2nd

    edition, Bucureti, Univers Enciclopedic, 1996.

    (Fellbaum 1998) Ch. Fellbaum (Ed.) WordNet: An

    Electronical Lexical Database, MIT Press, 1998.

    (Kilgarriff 1997) A. Kilgarriff,I dont believe in word senses.In Computers and the Humanities, 31 (2), 91-113, 1997.

    (Lee et al. 2000) C. Lee, G. Lee, J. Seo, Automatic WordNetmapping using Word Sense Disambiguation. In Proceedings of the Joint SIGDAT Conference onEmpirical Methods in Natural Language Processing and

    Very Large Corpora (EMNLP/VLC 2000), Hong Kong,

    2000.

    (Levichi & Banta 1992) L. Levichi and A. Banta,

    Dicionar englez-romn, Bucureti, Teora, 1992.

    (Magnini & Cavaglia 2000) B. Magnini and G. Cavaglia,Integrating subject field codes into WordNet. In

    Proceedings of LREC-2000, Athens, Greece.

    (Sowa 1992) J. F. Sowa, Logical Structure in the Lexicon. InJ. Pustejovsky and S. Bergler (Eds.) Lexical Semanticsand Commonsense Reasoning, LNAI 627, Springer-verlag, Berlin, 39-60, 1992.

    (Tufi 2004) D. Tufi (Ed.), Special Issue on the BalkaNet

    Project ofRomanian Journal of Information Science and

    Technology, vol. 7, no. 1-2, 2004.

    (Vossen 1998) P. Vossen, A Multilingual Database with

    Lexical Semantic Networks, Dordrecht, Kluwer, 1998.