81
ACL-IJCNLP 2009 MWE 2009 2009 Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation, Applications Proceedings of the Workshop 6 August 2009 Suntec, Singapore

MWE 2009 2009 Workshop on Multiword Expressions

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MWE 2009 2009 Workshop on Multiword Expressions

ACL-IJCNLP 2009

MWE 2009

2009 Workshop on Multiword ExpressionsIdentification Interpretation Disambiguation Applications

Proceedings of the Workshop

6 August 2009Suntec Singapore

Production and Manufacturing byWorld Scientific Publishing Co Pte Ltd5 Toh Tuck LinkSingapore 596224

ccopy2009 The Association for Computational Linguisticsand The Asian Federation of Natural Language Processing

Order copies of this and other ACL proceedings from

Association for Computational Linguistics (ACL)209 N Eighth StreetStroudsburg PA 18360USATel +1-570-476-8006Fax +1-570-476-0860aclaclweborg

ISBN 978-1-932432-60-2 1-932432-60-4

ii

Introduction

The ACL 2009 Workshop on Multiword Expressions Identification Interpretation Disambiguation andApplications (MWErsquo09) took place on August 6 2009 in Singapore immediately following the annualmeeting of the Association for Computational Linguistics (ACL) This is the fifth time this workshop hasbeen held in conjunction with ACL following the meetings in 2003 2004 2006 and 2007

The workshop focused on Multi-Word Expressions (MWEs) which represent an indispensable part ofnatural languages and appear steadily on a daily basis both novel and already existing but paraphrasedwhich makes them important for many natural language applications Unfortunately while easilymastered by native speakers MWEs are often non-compositional which poses a major challenge forboth foreign language learners and automatic analysis

The growing interest in MWEs in the NLP community has led to many specialized workshops heldevery year since 2001 in conjunction with ACL EACL and LREC there have been also two recentspecial issues on MWEs published by leading journals the International Journal of Language Resourcesand Evaluation and the Journal of Computer Speech and Language

As a result of the overall progress in the field the time has come to move from basic preliminary researchto actual applications in real-world NLP tasks Thus in MWErsquo09 we were interested in the overallprocess of dealing with MWEs asking for original research on the following four fundamental topics

Identification Identifying MWEs in free text is a very challenging problem Due to the variability ofexpression it does not suffice to collect and use a static list of known MWEs complex rules andmachine learning are typically needed as well

Interpretation Semantically interpreting MWEs is a central issue For some kinds of MWEs egnoun compounds it could mean specifying their semantics using a static inventory of semanticrelations eg WordNet-derived In other cases MWErsquos semantics could be expressible by asuitable paraphrase

Disambiguation Most MWEs are ambiguous in various ways A typical disambiguation task is todetermine whether an MWE is used non-compositionally (ie figuratively) or compositionally(ie literally) in a particular context

Applications Identifying MWEs in context and understanding their syntax and semantics is importantfor many natural language applications including but not limited to question answering machinetranslation information retrieval information extraction and textual entailment Still despite thegrowing research interest there are not enough successful applications in real NLP problemswhich we believe is the key for the advancement of the field

Of course the above topics largely overlap For example identification can require disambiguatingbetween literal and idiomatic uses since MWEs are typically required to be non-compositional bydefinition Similarly interpreting three-word noun compounds like morning flight ticket and plasticwater bottle requires disambiguation between a left and a right syntactic structure while interpretingtwo-word compounds like English teacher requires disambiguating between (a) lsquoteacher who teachesEnglishrsquo and (b) lsquoteacher coming from England (who could teach any subject eg math)rsquo

We received 18 submissions and given our limited capacity as a one-day workshop we were only ableto accept 9 full papers for oral presentation an acceptance rate of 50

We would like to thank the members of the Program Committee for their timely reviews We would alsolike to thank the authors for their valuable contributions

Dimitra Anastasiou Chikara Hashimoto Preslav Nakov and Su Nam KimCo-Organizers

iii

Organizers

Dimitra Anastasiou Localisation Research Centre Limerick University IrelandChikara Hashimoto National Institute of Information and Communications Technology JapanPreslav Nakov National University of Singapore SingaporeSu Nam Kim University of Melbourne Australia

Program Committee

Inaki Alegria University of the Basque Country (Spain)Timothy Baldwin University of Melbourne (Australia)Colin Bannard Max Planck Institute (Germany)Francis Bond National Institute of Information and Communications Technology (Japan)Gael Dias Beira Interior University (Portugal)Ulrich Heid Stuttgart University (Germany)Stefan Evert University of Osnabruck (Germany)Afsaneh Fazly University of Toronto (Canada)Nicole Gregoire University of Utrecht (The Netherlands)Roxana Girju University of Illinois at Urbana-Champaign (USA)Kyo Kageura University of Tokyo (Japan)Brigitte Krenn Austrian Research Institute for Artificial Intelligence (Austria)Eric Laporte University of Marne-la-Vallee (France)Rosamund Moon University of Birmingham (UK)Diana McCarthy University of Sussex (UK)Jan Odijk University of Utrecht (The Netherlands)Stephan Oepen University of Oslo (Norway)Darren Pearce London Knowledge Lab (UK)Pavel Pecina Charles University (Czech Republic)Scott Piao University of Manchester (UK)Violeta Seretan University of Geneva (Switzerland)Stan Szpakowicz University of Ottawa (Canada)Beata Trawinski University of Tubingen (Germany)Peter Turney National Research Council of Canada (Canada)Kiyoko Uchiyama Keio University (Japan)Begona Villada Moiron University of Groningen (The Netherlands)Aline Villavicencio Federal University of Rio Grande do Sul (Brazil)

v

Table of Contents

Statistically-Driven Alignment-Based Multiword Expression Identification for Technical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto 1

Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan 9

Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada 17

Exploiting Translational Correspondences for Pattern-Independent MWE IdentificationSina Zarrieszlig and Jonas Kuhn 23

A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan 31

Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha 40

Improving Statistical Machine Translation Using Domain Bilingual Multiword ExpressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang 47

Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi 55

Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita 63

vii

Workshop Program

Friday August 6 2009

830ndash845 Welcome and Introduction to the Workshop

Session 1 (0845ndash1000) MWE Identification and Disambiguation

0845ndash0910 Statistically-Driven Alignment-Based Multiword Expression Identification for Tech-nical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto

0910ndash0935 Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan

0935ndash1000 Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada

1000-1030 BREAK

Session 2 (1030ndash1210) Identification Interpretation and Disambiguation

1030ndash1055 Exploiting Translational Correspondences for Pattern-Independent MWE Identifi-cationSina Zarrieszlig and Jonas Kuhn

1055ndash1120 A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan

1120ndash1145 Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha

1145-1350 LUNCH

Session 3 (1350ndash1530) Applications

1350ndash1415 Improving Statistical Machine Translation Using Domain Bilingual Multiword Ex-pressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang

1415ndash1440 Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi

1440ndash1505 Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita

1505-1530 Discussion of Sessions 1 2 3 (Creating an Agenda for the general discussion)

1530-1600 BREAK

1600-1700 General Discussion

1700-1715 Closing Remarks

ix

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 1ndash8Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Statistically-Driven Alignment-Based Multiword ExpressionIdentification for Technical Domains

Helena de Medeiros Caselidiams Aline Villavicencioclubsspades Andre Machadoclubs Maria Jose Finattohearts

diamsDepartment of Computer Science Federal University of Sao Carlos (Brazil)clubsInstitute of Informatics Federal University of Rio Grande do Sul (Brazil)

spadesDepartment of Computer Sciences Bath University (UK)heartsInstitute of Language and Linguistics Federal University of Rio Grande do Sul (Brazil)

helenacaselidcufscarbr avillavicencioinfufrgsbrammachadoinfufrgsbr mfinattoterracombr

Abstract

Multiword Expressions (MWEs) are oneof the stumbling blocks for more preciseNatural Language Processing (NLP) sys-tems Particularly the lack of coverageof MWEs in resources can impact nega-tively on the performance of tasks and ap-plications and can lead to loss of informa-tion or communication errors This is es-pecially problematic in technical domainswhere a significant portion of the vocab-ulary is composed of MWEs This pa-per investigates the use of a statistically-driven alignment-based approach to theidentification of MWEs in technical cor-pora We look at the use of several sourcesof data including parallel corpora usingEnglish and Portuguese data from a corpusof Pediatrics and examining how a sec-ond language can provide relevant cues forthis tasks We report results obtained bya combination of statistical measures andlinguistic information and compare theseto the reported in the literature Such anapproach to the (semi-)automatic identifi-cation of MWEs can considerably speedup lexicographic work providing a moretargeted list of MWE candidates

1 Introduction

A multiword expression (MWE) can be definedas any word combination for which the syntac-tic or semantic properties of the whole expres-sion cannot be obtained from its parts (Sag etal 2002) Examples of MWEs are phrasal verbs(break down rely on) compounds (police car cof-fee machine) idioms (rock the boat let the catout of the bag) They are very numerous in lan-guages as Biber et al (1999) note accouting forbetween 30 and 45 of spoken English and 21

of academic prose and for Jackendoff (1997) thenumber of MWEs in a speakerrsquos lexicon is of thesame order of magnitude as the number of singlewords However these estimates are likely to beunderestimates if we consider that for languagefrom a specific domain the specialized vocabularyis going to consist largely of MWEs (global warm-ing protein sequencing) and new MWEs are con-stantly appearing (weapons of mass destructionaxis of evil)

Multiword expressions play an important rolein Natural Language Processing (NLP) applica-tions which should not only identify the MWEsbut also be able to deal with them when they arefound (Fazly and Stevenson 2007) Failing toidentify MWEs may cause serious problems formany NLP tasks especially those envolving somekind of semantic processing For parsing for in-stance Baldwin et al (2004) found that for a ran-dom sample of 20000 strings from the British Na-tional Corpus (BNC) even with a broad-coveragegrammar for English (Flickinger 2000) missingMWEs accounted for 8 of total parsing errorsTherefore there is an enormous need for robust(semi-)automated ways of acquiring lexical infor-mation for MWEs (Villavicencio et al 2007) thatcan significantly extend the coverage of resourcesFor example one can more than double the num-ber of verb-particle constructions (VPCs) entriesin a dictionary such as the Alvey Natural Lan-guage Tools (Carroll and Grover 1989) just ex-tracting VPCs from a corpus like the BNC (Bald-win 2005) Furthermore as MWEs are languagedependent and culturally motivated identifyingthe adequate translation of MWE occurrences is animportant challenge for machine translation meth-ods

In this paper we investigate experimentally theuse of an alignment-based approach for the iden-tification of MWEs in technical corpora We lookat the use of several sources of data including par-

1

allel corpora using English and Portuguese datafrom a corpus of Pediatrics and examining howa second language can provide relevant cues forthis tasks In this way cost-effective tools for theautomatic alignment of texts can generate a listof MWE candidates with their appropriate trans-lations Such an approach to the (semi-)automaticidentification of MWEs can considerably speed uplexicographic work providing a more targeted listof MWE candidates and their translations for theconstruction of bilingual resources andor withsome semantic information for monolingual re-sources

The remainder of this paper is structured as fol-lows Section 2 briefly discusses MWEs and someprevious works on methods for automatically ex-tracting them Section 3 presents the resourcesused while section 4 describes the methods pro-posed to extract MWEs as a statistically-driven by-product of an automatic word alignment processSection 5 presents the evaluation methodology andanalyses the results and section 6 finishes this pa-per with some conclusions and proposals for fu-ture work

2 Related Work

The term Multiword Expression has been used todescribe a large number of distinct but related phe-nomena such as phrasal verbs (eg come along)nominal compounds (eg frying pan) institution-alised phrases (eg bread and butter) and manyothers (Sag et al 2002) They are very frequent ineveryday language and this is reflected in severalexisting grammars and lexical resources wherealmost half of the entries are Multiword Expres-sions

However due to their heterogeneous charac-teristics MWEs present a tough challenge forboth linguistic and computational work (Sag etal 2002) Some MWEs are fixed and do notpresent internal variation such as ad hoc whileothers allow different degrees of internal vari-ability and modification such as touch a nerve(touchfind a nerve) and spill beans (spill sev-eralmusicalmountains of beans) In terms of se-mantics some MWEs are more opaque in theirmeaning (eg to kick the bucket as to die) whileothers have more transparent meanings that can beinferred from the words in the MWE (eg eat upwhere the particle up adds a completive sense toeat) Therefore providing appropriate methods

for the automatic identification and treatment ofthese phenomena is a real challenge for NLP sys-tems

A variety of approaches has been proposed forautomatically identifying MWEs differing basi-cally in terms of the type of MWE and lan-guage to which they apply and the sources ofinformation they use Although some work onMWEs is type independent (eg (Zhang et al2006 Villavicencio et al 2007)) given the het-erogeneity of MWEs much of the work looks in-stead at specific types of MWE like collocations(Pearce 2002) compounds (Keller and Lapata2003) and VPCs (Baldwin 2005 Villavicencio2005 Carlos Ramisch and Aline Villavicencio andLeonardo Moura and Marco Idiart 2008) Someof these works concentrate on particular languages(eg (Pearce 2002 Baldwin 2005) for Englishand (Piao et al 2006) for Chinese) but somework has also benefitted from asymmetries in lan-guages using information from one language tohelp deal with MWEs in the other (eg (na Vil-lada Moiron and Tiedemann 2006 Caseli et al2009))

As basis for helping to determine whether agiven sequence of words is in fact an MWE (egad hoc vs the small boy) some of these works em-ploy linguistic knowledge for the task (Villavicen-cio 2005) while others employ statistical meth-ods (Pearce 2002 Evert and Krenn 2005 Zhanget al 2006 Villavicencio et al 2007) or combinethem with some kinds of linguistic informationsuch as syntactic and semantic properties (Bald-win and Villavicencio 2002 Van de Cruys and naVillada Moiron 2007) or automatic word align-ment (na Villada Moiron and Tiedemann 2006)

Statistical measures of association have beencommonly used for this task as they can be demo-cratically applied to any language and MWE typeHowever there is no consensus about which mea-sure is best suited for identifying MWEs in gen-eral Villavicencio et al (2007) compared some ofthese measures (mutual information permutationentropy and χ2) for the type-independent detec-tion of MWEs and found that Mutual Informationseemed to differentiate MWEs from non-MWEsbut the same was not true of χ2 In addition Ev-ert and Krenn (2005) found that for MWE iden-tification the efficacy of a given measure dependson factors like the type of MWEs being targetedfor identification the domain and size of the cor-

2

pora used and the amount of low-frequency dataexcluded by adopting a threshold NonethelessVillavicencio et al (2007) discussing the influ-ence of the corpus size and nature over the meth-ods found that these different measures have ahigh level of agreement about MWEs whetherin carefully constructed corpora or in more het-erogeneous web-based ones They also discussthe results obtained from adopting approaches likethese for extending the coverage of resources ar-guing that grammar coverage can be significantlyincreased if MWEs are properly identified andtreated (Villavicencio et al 2007)

Among the methods that use additional infor-mation along with statistics to extract MWE theone proposed by na Villada Moiron and Tiede-mann (2006) seems to be the most similar to ourapproach The main difference between them isthe way in which word alignment is used in theMWE extraction process In this paper the wordalignment is the basis for the MWE extractionprocess while Villada Moiron and Tiedemannrsquosmethod uses the alignment just for ranking theMWE candidates which were extracted on the ba-sis of association measures (log-likelihood andsalience) and head dependence heuristic (in parseddata)

Our approach as described in details by Caseliet al (2009) also follows to some extent thatof Zhang et al (2006) as missing lexical en-tries for MWEs and related constructions are de-tected via error mining methods and this paper fo-cuses on the extraction of generic MWEs as a by-product of an automatic word alignment Anotherrelated work is the automatic detection of non-compositional compounds (NCC) by Melamed(1997) in which NCCs are identified by analyz-ing statistical translation models trained in a hugecorpus by a time-demanding process

Given this context our approach proposesthe use of alignment techniques for identifyingMWEs looking at sequences detected by thealigner as containing more than one word whichform the MWE candidates As a result sequencesof two or more consecutive source words aretreated as MWE candidates regardless of whetherthey are translated as one or more target words

3 The Corpus and Reference Lists

The Corpus of Pediatrics used in these experi-ments contains 283 texts in Portuguese with a total

of 785448 words extracted from the Jornal de Pe-diatria From this corpus the Pediatrics Glossarya reference list containing multiword terms and re-curring expressions was semi-automatically con-structed and manually checked1 The primary aimof the Pediatrics Glossary as an online resourcefor long-distance education was to train qualifyand support translation students on the domain ofpediatrics texts

The Pediatrics Glossary was built from the36741 ngrams that occurred at least 5 times in thecorpus These were automatically cleaned or re-moved using some POS tag patterns (eg remov-ing prepositions from terms that began or endedwith them) In addition if an ngram was part of alarger ngram only the latter appeared in the Glos-sary as is the case of aleitamento materno (mater-nal breastfeeding) which is excluded as it is con-tained in aleitamento materno exclusivo (exclusivematernal breastfeeding) This post-processing re-sulted in 3645 ngrams which were manuallychecked by translation students and resulted in2407 terms with 1421 bigrams 730 trigrams and339 ngrams with n larger than 3 (not considered inthe experiments presented in this paper)

4 Statistically-Driven andAlignment-Based methods

41 Statistically-Driven methodStatistical measures of association have beenwidely employed in the identification of MWEsThe idea behind their use is that they are an in-expensive language and type independent meansof detecting recurrent patterns As Firth famouslysaid a word is characterized by the company itkeeps and since we expect the component wordsof an MWE to occur frequently together thenthese measures can give an indication of MWE-ness In this way if a group of words co-occurswith significantly high frequency when comparedto the frequencies of the individual words thenthey may form an MWE Indeed measures suchas Pointwise Mutual Information (PMI) MutualInformation (MI) χ2 log-likelihood (Press et al1992) and others have been employed for thistask and some of them seem to provide more ac-curate predictions of MWEness than others Infact in a comparison of some measures for thetype-independent detection of MWEs MI seemed

1Available in the TEXTQUIMUFRGS website httpwwwufrgsbrtextquim

3

to differentiate MWEs from non-MWEs but thesame was not true of χ2 (Villavicencio et al2007) In this work we use two commonly em-ployed measures for this task PMI and MIas implemented in the Ngram Statistics Package(Banerjee and Pedersen 2003)

From the Portuguese portion of the Corpus ofPediatrics 196105 bigram and 362663 trigramMWE candidates were generated after filteringngrams containing punctuation and numbers Inorder to evaluate how these methods perform with-out any linguistic filtering the only threshold em-ployed was a frequency cut-off of 2 occurrencesresulting in 64839 bigrams and 54548 trigramsEach of the four measures were then calculated forthese ngrams and we ranked each n-gram accord-ing to each of these measures The average of allthe rankings is used as the combined measure ofthe MWE candidates

42 Alignment-Based method

The second of the MWE extraction approachesto be investigated in this paper is the alignment-based method The automatic word alignment oftwo parallel texts mdash a text written in one (source)language and its translation to another (target) lan-guage mdash is the process of searching for correspon-dences between source and target words and se-quences of words For each word in a source sen-tence equivalences in the parallel target sentenceare looked for Therefore taking into account aword alignment between a source word sequenceS (S = s1 sn with n ge 2) and a target wordsequence T (T = t1 tm with m ge 1) thatis S harr T the alignmet-based MWE extracionmethod assumes that (a) S and T share some se-mantic features and (b) S may be a MWE

In other words the alignment-based MWE ex-traction method states that the sequence S will bea MWE candidate if it is aligned with a sequenceT composed of one or more words (a n m align-ment with n ge 2 and m ge 1) For examplethe sequence of two Portuguese words aleitamentomaterno mdash which occurs 202 times in the cor-pus used in our experiments mdash is a MWE can-didate because these two words were joined to bealigned 184 times with the word breastfeeding (a2 1 alignment) 8 times with the word breast-fed (a 2 1 alignment) 2 times with breastfeedingpractice (a 2 2 alignment) and so on

Thus notice that the alignment-based MWE ex-

traction method does not rely on the conceptualasymmetries between languages since it does notexpect that a source sequence of words be alignedwith a single target word The method looksfor the sequences of source words that are fre-quently joined together during the alignment de-spite the number of target words involved Thesefeatures indicate that the method priorizes preci-sion in spite of recall

It is also important to say that although the se-quences of source and target words resemble thephrases used in the phrase-based statistical ma-chine translation (SMT) they are indeed a re-finement of them More specifically althoughboth approaches rely on word alignments per-formed by GIZA++2 (Och and Ney 2000) inthe alignment-based approach not all sequences ofwords are considered as phrases (and MWE can-didates) but just those with an alignment n m(n gt= 2) with a target sequence To confirmthis assumption a phrase-based SMT system wastrained with the same corpus used in our exper-iments and the number of phrases extracted fol-lowing both approaches were compared Whilethe SMT extracted 819208 source phrases ouralignment-based approach (without applying anypart-of-speech or frequency filter) extracted only34277 These results show that the alignment-based approach refines in some way the phrasesof SMT systems

In this paper we investigate experimentallywhether MWEs can be identified as a by-productof the automatic word alignment of parallel textsWe focus on Portuguese MWEs from the Corpusof Pediatrics and the evaluation is performed us-ing the bigrams and trigrams from the PediatricsGlossary as gold standard

To perform the extraction of MWE candi-dates following the alignment-based approachfirst the original corpus had to be sentence andword aligned and Part-of-Speech (POS) taggedFor these preprocessing steps were used re-spectively a version of the Translation Cor-pus Aligner (TCA) (Hofland 1996) the statisti-cal word aligner GIZA++ (Och and Ney 2000)and the morphological analysers and POS taggersfrom Apertium3 (Armentano-Oller et al 2006)

2GIZA++ is a well-known statistical word aligner that canbe found at httpwwwfjochcomGIZA++html

3Apertium is an open-source machine translation en-gine and toolbox available at httpwwwapertiumorg

4

From the preprocessed corpus the MWE candi-dates are extracted as those in which two or morewords have the same alignment that is they arelinked to the same target unit This initial list ofMWE candidates is then filtered to remove thosecandidates that (a) match some sequences of POStags or words (patterns) defined in previous exper-iments (Caseli et al 2009) or (b) whose frequencyis below a certain threshold The remaining unitsin the candidate list are considered to be MWEs

Several filtering patterns and minimum fre-quency thresholds were tested and three of themare presented in details here The first one (F1)is the same used during the manual building ofthe reference lists of MWEs (a) patterns begin-ning with Article + Noun and beginning or finish-ing with verbs and (b) with a minimum frequencythreshold of 5

The second one (F2) is the same used in the(Caseli et al 2009) mainly (a) patterns begin-ning with determiner auxiliary verb pronoun ad-verb conjunction and surface forms such as thoseof the verb to be (are is was were) relatives(that what when which who why) and prepo-sitions (from to of ) and (b) with a minimum fre-quency threshold of 2

And the third one (F3) is the same as (Caseli etal 2009) plus (a) patterns beginning or finishingwith determiner adverb conjunction prepositionverb pronoun and numeral and (b) with a mini-mum frequency threshold of 2

5 Experiments and Results

Table 1 shows the top 5 and the bottom 5 rankedcandidates returned by PMI and the alignment-based approach Although some of the results aregood especially the top candidates there is stillconsiderable noise among the candidates as forinstance jogar video game (lit play video game)From table 1 it is also possible to notice that thealignment-based approach indeed extracts Pedi-atrics terms such as aleitamento materno (breast-feeding) and also other possible MWE that are notPediatrics terms such as estados unidos (UnitedStates)

In table 2 we show the precision (number ofcorrect candidates among the proposed ones) re-call (number of correct candidates among those inreference lists) and F-measure ((2 lowast precision lowastrecall)(precision + recall)) figures for the as-sociation measures using all the candidates (on the

PMI alignment-basedOnline Mendelian Inheritance faixa etariaBeta Technology Incorporated aleitamento maternoLange Beta Technology estados unidosOxido Nitrico Inalatorio hipertensao arterialjogar video game leite materno e um de couro cabeludoe a do bloqueio lactıferosse que de emocional anatomiae a da neonato a termoe de nao duplas maes bebes

Table 1 Top 5 and Bottom 5 MWE candidatesranked by PMI and alignment-based approach

pt MWE candidates PMI MI proposed bigrams 64839 64839 correct MWEs 1403 1403precision 216 216recall 9873 9873F 423 423 proposed trigrams 54548 54548 correct MWEs 701 701precision 129 129recall 9603 9603F 255 255 proposed bigrams 1421 1421 correct MWEs 155 261precision 1091 1837recall 1091 1837F 1091 1837 proposed trigrams 730 730 correct MWEs 44 20precision 603 274recall 603 274F 603 274

Table 2 Evaluation of MWE candidates - PMI andMI

first half of the table) and using the top 1421 bi-gram and 730 trigram candidates (on the secondhalf) From these latter results we can see that thetop candidates produced by these measures do notagree with the Pediatrics Glossary since there areonly at most 1837 bigram and 603 trigramMWEs among the top candidates as ranked byMI and PMI respectively Interestingly MI had abetter performance for bigrams while for trigramsPMI performed better

On the other hand looking at the alignment-based method 34277 pt MWE candidates wereextracted and Table 3 sumarizes the number ofcandidates filtered following the three filters de-scribed in 42 F1 F2 and F3

To evaluate the efficacy of the alignment-basedmethod in identifying multiword terms of Pedi-atrics an automatic comparison was performedusing the Pediatrics Glossary In this auto-

5

pt MWE candidates F1 F2 F3 filtered by POS patterns 24996 21544 32644 filtered by frequency 9012 11855 1442 final Set 269 878 191

Table 3 Number of pt MWE candidates filteredin the alignment-based approach

pt MWE candidates F1 F2 F3 proposed bigrams 250 754 169 correct MWEs 48 95 65precision 1920 1260 3846recall 338 669 457F 575 874 818 proposed trigrams 19 110 20 correct MWEs 1 9 4precision 526 818 2000recall 014 123 055F 027 214 107 proposed bitrigrams 269 864 189 correct MWEs 49 104 69precision 1822 1204 3651recall 228 483 321F 405 690 590

Table 4 Evaluation of MWE candidates

matic comparision we considered the final lists ofMWEs candidates generated by each filter in table3 The number of matching entries and the valuesfor precision recall and F-measure are showed intable 4

The different values of extracted MWEs (in ta-ble 3) and evaluated ones (in table 4) are due tothe restriction of considering only bigrams and tri-grams in the Pediatrics Glossary Then longerMWEs mdash such as doenca arterial coronarianaprematura (premature coronary artery disease)and pequenos para idade gestacional (small forgestational age) mdash extracted by the alignment-based method are not being considered at the mo-ment

After the automatic comparison using the Pedi-atrics Glossary an analysis by human experts wasperformed on one of the derived lists mdash that withthe best precision values so far (from filter F3)The human analysis was necessary since as statedin (Caseli et al 2009) the coverage of referencelists may be low and it is likely that a lot of MWEcandidates that were not found in the PediatricsGlossary are nonetheless true MWEs In this pa-per only the pt MWE candidates extracted usingfilter F3 (as described in section 42) were manu-ally evaluated

From the 191 pt MWE candidates extracted af-ter F3 69 candidates (361 of the total amount)were found in the bigrams or trigrams in the

Glossary (see table 4) Then the remaining 122candidates (639) were analysed by two native-speakers human judges who classified each of the122 candidates as true if it is a multiword expres-sion or false otherwise independently of beinga Pediatrics term For the judges a sequence ofwords was considered a MWE mainly if it was(1) a proper name or (2) a sequence of words forwhich the meaning cannot be obtained by com-pounding the meanings of its component words

The judgments of both judges were comparedand a disagreement of approximately 12 on mul-tiwords was verified This disagreement was alsomeasured by the kappa (K) measure (Carletta1996) with k = 073 which does not preventconclusions to be drawn According to Carletta(1996) among other authors a value of k between067 and 08 indicates a good agreement

In order to calculate the percentage of true can-didates among the 122 two approaches can befollowed depending on what criteria one wantsto emphasize precision or coverage (not recallbecause we are not calculating regarding a refer-ence list) To emphasize the precision one shouldconsider as genuine MWEs only those candidatesclassified as true by both judges on the other handto emphasize the coverage one should consideralso those candidates classified as true by just oneof them So from 191 MWE candidates 126(6597) were classified as true by both judgesand 145 (7592) by at least one of them

6 Conclusions and Future Work

MWEs are a complex and heterogeneous set ofphenomena that defy attempts to capture themfully but due to their role in communication theyneed to be properly accounted for in NLP applica-tions and tasks

In this paper we investigated the identifica-tion of MWEs from technical domain test-ing statistically-driven and alignment-based ap-proaches for identifying MWEs from a Pediatricsparallel corpus The alignment-based method gen-erates a targeted precision-oriented list of MWEcandidates while the statistical methods producerecall-oriented results at the expense of precisionTherefore the combination of these methods canproduce a set of MWE candidates that is both moreprecise than the latter and has more coverage thanthe former This can significantly speed up lex-icographic work Moreover the results obtained

6

show that in comparison with the manual extrac-tion of MWEs this approach can provide also ageneral set of MWE candidates in addition to themanually selected technical terms

Using the alignment-based extraction methodwe notice that it is possible to extract MWEs thatare Pediatrics terms with a precision of 38 forbigrams and 20 for trigrams but with very lowrecall since only the MWEs in the Pediatrics Glos-sary were considered correct However after amanual analysis carried out by two native speakersof Portuguese we found that the percentage of trueMWEs considered by both or at least one of themwere respectively 6597 and 7592 This wasa significative improvement but it is important tosay that in this manual analysis the human ex-perts classified the MWEs as true independently ofthem being Pediatrics terms So as future work weintend to carry out a more carefull analysis withexperts in Pediatrics to evaluate how many MWEscandidates are also Pediatrics terms

In addition we plan to investigate a weightedcombination of these methods favouring thosethat have better precision Finally we also in-tend to apply the results obtained in to the semi-automatic construction of ontologies

Acknowledgments

We would like to thank the TEXTQUIMUFRGSgroup for making the Corpus of Pediatrics and Pe-diatrics Glossary available to us We also thank thefinancial support of FAPESP in bulding the paral-lel corpus This research has been partly fundedby the FINEP project COMUNICA

ReferencesCarme Armentano-Oller Rafael C Carrasco Anto-

nio M Corbı-Bellot Mikel L Forcada MireiaGinestı-Rosell Sergio Ortiz-Rojas Juan Anto-nio Perez-Ortiz Gema Ramırez-Sanchez FelipeSanchez-Martınez and Miriam A Scalco 2006Open-source Portuguese-Spanish machine transla-tion In R Vieira P Quaresma MGV Nunes NJMamede C Oliveira and MC Dias editors Pro-ceedings of the 7th International Workshop on Com-putational Processing of Written and Spoken Por-tuguese (PROPOR 2006) volume 3960 of LectureNotes in Computer Science pages 50ndash59 Springer-Verlag May

Timothy Baldwin and Aline Villavicencio 2002 Ex-tracting the Unextractable A Case Study on Verb-particles In Proceedings of the 6th Conference on

Natural Language Learning (CoNLL-2002) pages98ndash104 Taipei Taiwan

Timothy Baldwin Emily M Bender Dan FlickingerAra Kim and Stephan Oepen 2004 Road-testingthe English Resource Grammar over the British Na-tional Corpus In Proceedings of the Fourth In-ternational Conference on Language Resources andEvaluation (LREC 2004) pages 2047ndash2050 LisbonPortugal

Timothy Baldwin 2005 The deep lexical acquisitionof English verb-particles Computer Speech andLanguage Special Issue on Multiword Expressions19(4)398ndash414

Satanjeev Banerjee and Ted Pedersen 2003 The De-sign Implementation and Use of the Ngram Statis-tics Package In In Proceedings of the Fourth Inter-national Conference on Intelligent Text Processingand Computational Linguistics pages 370ndash381

Douglas Biber Stig Johansson Geoffrey Leech SusanConrad and Edward Finegan 1999 Grammar ofSpoken and Written English Longman Harlow

Jean Carletta 1996 Assessing agreement on classi-fication tasks the kappa statistics ComputationalLinguistics 22(2)249ndash254

Carlos Ramisch and Aline Villavicencio and LeonardoMoura and Marco Idiart 2008 Picking themup and Figuring them out Verb-Particle Construc-tions Noise and Idiomaticity In Proceedings ofthe 12th Conference on Computational Natural Lan-guage Learning (CoNLL 2008) pages 49ndash56

John Carroll and Claire Grover 1989 The derivationof a large computational lexicon of English fromLDOCE In Bran Boguraev and Ted Briscoe editorsComputational Lexicography for Natural LanguageProcessing pages 117ndash134 Longman Harlow UK

Helena M Caseli Carlos Ramisch Maria G V Nunesand Aline Villavicencio 2009 Alignment-basedextraction of multiword expressions Language Re-sources and Evaluation to appear

Stefan Evert and Brigitte Krenn 2005 Using smallrandom samples for the manual evaluation of statis-tical association measures Computer Speech andLanguage 19(4)450ndash466

Afsaneh Fazly and Suzanne Stevenson 2007 Distin-guishing Subtypes of Multiword Expressions UsingLinguistically-Motivated Statistical Measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueJune

Dan Flickinger 2000 On building a more efficientgrammar by exploiting types Natural LanguageEngineering 6(1)15ndash28

7

Knut Hofland 1996 A program for aligning Englishand Norwegian sentences In S Hockey N Ideand G Perissinotto editors Research in HumanitiesComputing pages 165ndash178 Oxford Oxford Univer-sity Press

Ray Jackendoff 1997 Twistinrsquo the night away Lan-guage 73534ndash59

Frank Keller and Mirella Lapata 2003 Using the Webto Obtain Frequencies for Unseen Bigrams Compu-tational Linguistics 29(3)459ndash484

I Dan Melamed 1997 Automatic Discovery ofNon-Compositional Compounds in Parallel Data Ineprint arXivcmp-lg9706027 pages 6027ndash+ June

Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the Workshopon Multi-word-expressions in a Multilingual Context(EACL-2006) pages 33ndash40 Trento Italy

Franz Josef Och and Hermann Ney 2000 Improvedstatistical alignment models In Proceedings of the38th Annual Meeting of the ACL pages 440ndash447Hong Kong China October

Darren Pearce 2002 A Comparative Evaluation ofCollocation Extraction Techniques In Proceedingsof the Third International Conference on LanguageResources and Evaluation pages 1ndash7 Las PalmasCanary Islands Spain

Scott S L Piao Guangfan Sun Paul Rayson andQi Yuan 2006 Automatic Extraction of Chi-nese Multiword Expressions with a Statistical ToolIn Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context (EACL-2006)pages 17ndash24 Trento Italy April

William H Press Saul A Teukolsky William T Vet-terling and Brian P Flannery 1992 NumericalRecipes in C The Art of Scientific Computing Sec-ond edition Cambridge University Press

Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 MultiwordExpressions A Pain in the Neck for NLP In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing (CICLing-2002) volume 2276 of (LectureNotes in Computer Science) pages 1ndash15 LondonUK Springer-Verlag

Tim Van de Cruys and Bego na Villada Moiron 2007Semantics-based Multiword Expression ExtractionIn Proceedings of the Workshop on A Broader Pre-spective on Multiword Expressions pages 25ndash32Prague June

Aline Villavicencio Valia Kordoni Yi Zhang MarcoIdiart and Carlos Ramisch 2007 Validation andEvaluation of Automatically Acquired Multiword

Expressions for Grammar Engineering In Pro-ceedings of the 2007 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning pages1034ndash1043 Prague June

Aline Villavicencio 2005 The Availability of Verb-Particle Constructions in Lexical Resources HowMuch is Enough Journal of Computer Speech andLanguage Processing 19(4)415ndash432

Yi Zhang Valia Kordoni Aline Villavicencio andMarco Idiart 2006 Automated Multiword Ex-pression Prediction for Grammar Engineering InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 36ndash44 Sydney Australia July Asso-ciation for Computational Linguistics

8

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 9ndash16Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Re-examining Automatic Keyphrase Extraction Approachesin Scientific Articles

Su Nam KimCSSE dept

University of Melbournesnkimcsseunimelbeduau

Min-Yen KanDepartment of Computer ScienceNational University of Singaporekanmycompnusedusg

Abstract

We tackle two major issues in automatickeyphrase extraction using scientific arti-cles candidate selection and feature engi-neering To develop an efficient candidateselection method we analyze the natureand variation of keyphrases and then se-lect candidates using regular expressionsSecondly we re-examine the existing fea-tures broadly used for the supervised ap-proach exploring different ways to en-hance their performance While mostother approaches are supervised we alsostudy the optimal features for unsuper-vised keyphrase extraction Our researchhas shown that effective candidate selec-tion leads to better performance as evalua-tion accounts for candidate coverage Ourwork also attests that many of existing fea-tures are also usable in unsupervised ex-traction

1 Introduction

Keyphrases are simplex nouns or noun phrases(NPs) that represent the key ideas of the documentKeyphrases can serve as a representative summaryof the document and also serve as high quality in-dex terms It is thus no surprise that keyphraseshave been utilized to acquire critical informationas well as to improve the quality of natural lan-guage processing (NLP) applications such as doc-ument summarizer(Davanzo and Magnini 2005)information retrieval (IR)(Gutwin et al 1999) anddocument clustering(Hammouda et al 2005)

In the past various attempts have been made toboost automatic keyphrase extraction performancebased primarily on statistics(Frank et al 1999Turney 2003 Park et al 2004 Wan and Xiao2008) and a rich set of heuristic features(Barkerand Corrnacchia 2000 Medelyan and Witten

2006 Nguyen and Kan 2007) In Section 2 wegive a more comprehensive overview of previousattempts

Current keyphrase technology still has muchroom for improvement First of all although sev-eral candidate selection methods have been pro-posed for automatic keyphrase extraction in thepast (eg (Frank et al 1999 Park et al 2004Nguyen and Kan 2007)) most of them do not ef-fectively deal with various keyphrase forms whichresults in the ignorance of some keyphrases as can-didates Moreover no studies thus far have donea detailed investigation of the nature and varia-tion of manually-provided keyphrases As a con-sequence the community lacks a standardized listof candidate forms which leads to difficulties indirect comparison across techniques during evalu-ation and hinders re-usability

Secondly previous studies have shown the ef-fectiveness of their own features but not manycompared their features with other existing fea-tures That leads to a redundancy in studies andhinders direct comparison In addition existingfeatures are specifically designed for supervisedapproaches with few exceptions However thisapproach involves a large amount of manual laborthus reducing its utility for real-world applicationHence unsupervised approach is inevitable in or-der to minimize manual tasks and to encourageutilization It is a worthy study to attest the re-liability and re-usability for the unsupervised ap-proach in order to set up the tentative guideline forapplications

This paper targets to resolve these issues ofcandidate selection and feature engineering Inour work on candidate selection we analyze thenature and variation of keyphrases with the pur-pose of proposing a candidate selection methodwhich improves the coverage of candidates thatoccur in various forms Our second contributionre-examines existing keyphrase extraction features

9

reported in the literature in terms of their effec-tiveness and re-usability We test and comparethe usefulness of each feature for further improve-ment In addition we assess how well these fea-tures can be applied in an unsupervised approach

In the remaining sections we describe anoverview of related work in Section 2 our propos-als on candidate selection and feature engineeringin Section 4 and 5 our system architecture anddata in Section 6 Then we evaluate our propos-als discuss outcomes and conclude our work inSection 7 8 and 9 respectively

2 Related Work

The majority of related work has been carriedout using statistical approaches a rich set ofsymbolic resources and linguistically-motivatedheuristics(Frank et al 1999 Turney 1999 Barkerand Corrnacchia 2000 Matsuo and Ishizuka2004 Nguyen and Kan 2007) Features used canbe categorized into three broad groups (1) docu-ment cohesion features (ie relationship betweendocument and keyphrases)(Frank et al 1999Matsuo and Ishizuka 2004 Medelyan and Wit-ten 2006 Nguyen and Kan 2007) and to lesser(2) keyphrase cohesion features (ie relationshipamong keyphrases)(Turney 2003) and (3) termcohesion features (ie relationship among compo-nents in a keyphrase)(Park et al 2004)

The simplest system is KEA (Frank et al1999 Witten et al 1999) that uses TFIDF (ieterm frequency inverse document frequency) andfirst occurrence in the document TFIDF mea-sures the document cohesion and the first occur-rence implies the importance of the abstract orintroduction which indicates the keyphrases havea locality Turney (2003) added the notion ofkeyphrase cohesion to KEA features and Nguyenand Kan (2007) added linguistic features suchas section information and suffix sequence TheGenEx system(Turney 1999) employed an inven-tory of nine syntactic features such as length inwords and frequency of stemming phrase as aset of parametrized heuristic rules Barker andCorrnacchia (2000) introduced a method basedon head noun heuristics that took three featureslength of candidate frequency and head noun fre-quency To take advantage of domain knowledgeHulth et al (2001) used a hierarchically-organizeddomain-specific thesaurus from Swedish Parlia-ment as a secondary knowledge source The

Textract (Park et al 2004) also ranks the can-didate keyphrases by its judgment of keyphrasesrsquodegree of domain specificity based on subject-specific collocations(Damerau 1993) in addi-tion to term cohesion using Dice coefficient(Dice1945) Recently Wan and Xiao (2008) extractsautomatic keyphrases from single documents uti-lizing document clustering information The as-sumption behind this work is that the documentswith the same or similar topics interact with eachother in terms of salience of words The authorsfirst clustered the documents then used the graph-based ranking algorithm to rank the candidates ina document by making use of mutual influences ofother documents in the same cluster

3 Keyphrase Analysis

In previous study KEA employed the index-ing words as candidates whereas others such as(Park et al 2004 Nguyen and Kan 2007) gen-erated handcrafted regular expression rules How-ever none carefully undertook the analysis ofkeyphrases We believe there is more to be learnedfrom the reference keyphrases themselves by do-ing a fine-grained careful analysis of their formand composition Note that we used the articlescollected from ACM digital library for both ana-lyzing keyphrases as well as evaluating methodsSee Section 6 for data in detail

Syntactically keyphrases can be formed by ei-ther simplex nouns (eg algorithm keyphrasemulti-agent) or noun phrases (NPs) which can be asequence of nouns and their auxiliary words suchas adjectives and adverbs (eg mobile networkfast computing partially observable Markov de-cision process) despite few incidences They canalso incorporate a prepositional phrase (PP) (egquality of service policy of distributed caching)When keyphrases take the form of an NP with anattached PP (ie NPs in of-PP form) the preposi-tion of is most common but others such as for invia also occur (eg incentive for cooperation in-equality in welfare agent security via approximatepolicy trade in financial instrument based on log-ical formula) The patterns above correlate wellto part-of-speech (POS) patterns used in modernkeyphrase extraction systems

However our analysis uncovered additional lin-guistic patterns and alternations which other stud-ies may have overlooked In our study we alsofound that keyphrases also occur as a simple con-

10

Criteria RulesFrequency (Rule1) Frequency heuristic ie frequency ge 2 for simplex words vs frequency ge 1 for NPs

Length (Rule2) Length heuristic ie up to length 3 for NPs in non-of-PP form vs up to length 4 for NPs in of-PP form(eg synchronous concurrent program vs model of multiagent interaction)

Alternation (Rule3) of-PP form alternation(eg number of sensor = sensor number history of past encounter = past encounter history)(Rule4) Possessive alternation(eg agentrsquos goal = goal of agent securityrsquos value = value of security)

Extraction (Rule5) Noun Phrase = (NN |NNS|NNP |NNPS|JJ |JJR|JJS)lowast(NN |NNS|NNP |NNPS)(eg complexity effective algorithm grid computing distributed web-service discovery architecture)(Rule6) Simplex WordNP IN Simplex WordNP(eg quality of service sensitivity of VOIP traffic (VOIP traffic extracted)simplified instantiation of zebroid (simplified instantiation extracted))

Table 1 Candidate Selection Rules

junctions (eg search and rescue propagation anddelivery) and much more rarely as conjunctionsof more complex NPs (eg history of past en-counter and transitivity) Some keyphrases appearto be more complex (eg pervasive document editand management system task and resource allo-cation in agent system) Similarly abbreviationsand possessive forms figure as common patterns(eg belief desire intention = BDI inverse docu-ment frequency = (IDF) Bayesrsquo theorem agentrsquosdominant strategy)

A critical insight of our work is that keyphrasescan be morphologically and semantically alteredKeyphrases that incorporate a PP or have an un-derlying genitive composition are often easily var-ied by word order alternation Previous studieshave used the altered keyphrases when forming inof-PP form For example quality of service canbe altered to service quality sometimes with lit-tle semantic difference Also as most morpho-logical variation in English relates to noun num-ber and verb inflection keyphrases are subject tothese rules as well (eg distributed system 6= dis-tributing system dynamical caching 6= dynamicalcache) In addition possessives tend to alternatewith of-PP form (eg agentrsquos goal = goal of agentsecurityrsquos value = value of security)

4 Candidate Selection

We now describe our proposed candidate selectionprocess Candidate selection is a crucial step forautomatic keyphrase extraction This step is corre-lated to term extraction study since topNth rankedterms become keyphrases in documents In pre-vious study KEA employed the indexing wordsas candidates whereas others such as (Park et al2004 Nguyen and Kan 2007) generated hand-crafted regular expression rules However nonecarefully undertook the analysis of keyphrases In

this section before we present our method we firstdescribe the detail of keyphrase analysis

In our keyphrase analysis we observed thatmost of author assigned keyphrase andor readerassigned keyphrase are syntactically more of-ten simplex words and less often NPs Whenkeyphrases take an NP form they tend to be a sim-ple form of NPs ie either without a PP or withonly a PP or with a conjunction but few appear asa mixture of such forms We also noticed that thecomponents of NPs are normally nouns and adjec-tives but rarely are adverbs and verbs As a re-sult we decided to ignore NPs containing adverbsand verbs in this study as our candidates since theytend to produce more errors and to require morecomplexity

Another observation is that keyphrases contain-ing more than three words are rare (ie 6 in ourdata set) validating what Paukkeri et al (2008)observed Hence we apply a length heuristic Ourcandidate selection rule collects candidates up tolength 3 but also of length 4 for NPs in of-PPform since they may have a non-genetive alter-nation that reduces its length to 3 (eg perfor-mance of distributed system = distributed systemperformance) In previous studies words occur-ring at least twice are selected as candidates How-ever during our acquisition of reader assignedkeyphrase we observed that readers tend to collectNPs as keyphrases regardless of their frequencyDue to this we apply different frequency thresh-olds for simplex words (gt= 2) and NPs (gt= 1)Note that 30 of NPs occurred only once in ourdata

Finally we generated regular expression rulesto extract candidates as presented in Table 1 Ourcandidate extraction rules are based on those inNguyen and Kan (2007) However our Rule6for NPs in of-PP form broadens the coverage of

11

possible candidates ie with a given NPs in of-PP form not only we collect simplex word(s)but we also extract non-of-PP form of NPs fromnoun phrases governing the PP and the PP Forexample our rule extracts effective algorithm ofgrid computing as well as effective algorithm andgrid computing as candidates while the previousworksrsquo rules do not

5 Feature Engineering

With a wider candidate selection criteria the onusof filtering out irrelevant candidates becomes theresponsibility of careful feature engineering Welist 25 features that we have found useful in ex-tracting keyphrases comprising of 9 existing and16 novel andor modified features that we intro-duce in our work (marked with lowast) As one ofour goals in feature engineering is to assess thesuitability of features in the unsupervised settingwe have also indicated which features are suitableonly for the supervised setting (S) or applicable toboth (S U)

51 Document Cohesion

Document cohesion indicates how important thecandidates are for the given document The mostpopular feature for this cohesion is TFIDF butsome works have also used context words to checkthe correlation between candidates and the givendocument Other features for document cohesionare distance section information and so on Wenote that listed features other than TFIDF are re-lated to locality That is the intuition behind thesefeatures is that keyphrases tend to appear in spe-cific area such as the beginning and the end of doc-uments

F1 TFIDF (SU) TFIDF indicates doc-ument cohesion by looking at the frequency ofterms in the documents and is broadly used in pre-vious work(Frank et al 1999 Witten et al 1999Nguyen and Kan 2007) However a disadvan-tage of the feature is in requiring a large corpusto compute useful IDF As an alternative con-text words(Matsuo and Ishizuka 2004) can alsobe used to measure document cohesion From ourstudy of keyphrases we saw that substrings withinlonger candidates need to be properly counted andas such our method measures TF in substrings aswell as in exact matches For example grid com-puting is often a substring of other phrases such asgrid computing algorithm and efficient grid com-

puting algorithm We also normalize TF with re-spect to candidate types ie we separately treatsimplex words and NPs to compute TF To makeour IDFs broadly representative we employed theGoogle n-gram counts that were computedover terabytes of data Given this large genericsource of word count IDF can be incorporatedwithout corpus-dependent processing hence suchfeatures are useful in unsupervised approachesas well The following list shows variations ofTFIDF employed as features in our system

bull (F1a) TFIDF

bull (F1b) TF including counts of substrings

bull (F1c) TF of substring as a separate feature

bull (F1d) normalized TF by candidate types(ie simplex words vs NPs)

bull (F1e) normalized TF by candidate types asa separate feature

bull (F1f) IDF using Google n-gram

F2 First Occurrence (SU) KEA used the firstappearance of the word in the document(Frank etal 1999 Witten et al 1999) The main ideabehind this feature is that keyphrases tend to oc-cur in the beginning of documents especially instructured reports (eg in abstract and introduc-tion sections) and newswire

F3 Section Information (SU) Nguyen andKan (2007) used the identity of which specificdocument section a candidate occurs in This lo-cality feature attempts to identify key sections Forexample in their study of scientific papers theauthors weighted candidates differently dependingon whether they occurred in the abstract introduc-tion conclusion section head title andor refer-ences

F4 Additional Section Information (SU)We first added the related work or previous workas one of section information not included inNguyen and Kan (2007) We also propose and testa number of variations We used the substringsthat occur in section headers and reference titlesas keyphrases We counted the co-occurrence ofcandidates (ie the section TF) across all key sec-tions that indicates the correlation among key sec-tions We assign section-specific weights as in-dividual sections exhibit different propensities forgenerating keyphrases For example introduction

12

contains the majority of keyphrases while the ti-tle or section head contains many fewer due to thevariation in size

bull (F4a) section rsquorelatedprevious workrsquo

bull (F4b) counting substring occurring in keysections

bull (F4c) section TF across all key sections

bull (F4d) weighting key sections according tothe portion of keyphrases found

F5 Last Occurrence (SU) Similar to dis-tance in KEA the position of the last occurrenceof a candidate may also imply the importance ofkeyphrases as keyphrases tend to appear in thelast part of document such as the conclusion anddiscussion

52 Keyphrase CohesionThe intuition behind using keyphrase cohesion isthat actual keyphrases are often associated witheach other since they are semantically related totopic of the document Note that this assumptionholds only when the document describes a singlecoherent topic ndash a document that represents a col-lection may be first need to be segmented into itsconstituent topics

F6 Co-occurrence of Another Candidatein Section (SU) When candidates co-occur inseveral key sections together then they are morelikely keyphrases Hence we used the number ofsections that candidates co-occur

F7 Title overlap (S) In a way titles also rep-resent the topics of their documents A large col-lection of titles in the domain can act as a prob-abilistic prior of what words could stand as con-stituent words in keyphrases In our work as weexamined scientific papers from computer sciencewe used a collection of titles obtained from thelarge CiteSeer1 collection to create this feature

bull (F7a) co-occurrence (Boolean) in title col-location

bull (F7b) co-occurrence (TF) in title collection

F8 Keyphrase Cohesion (SU) Turney (2003)integrated keyphrase cohesion into his system bychecking the semantic similarity between top Nranked candidates against the remainder In the

1It contains 13M titles from articles papers and reports

original work a large external web corpus wasused to obtain the similarity judgments As wedid not have access to the same web corpus andall candidateskeyphrases were not found in theGoogle n-gram corpus we approximated this fea-ture using a similar notion of contextual similarityWe simulated a latent 2-dimensional matrix (simi-lar to latent semantic analysis) by listing all candi-date words in rows and their neighboring words(nouns verbs and adjectives only) in columnsThe cosine measure is then used to compute thesimilarity among keyphrases

53 Term Cohesion

Term cohesion further refines the candidacy judg-ment by incorporating an internal analysis of thecandidatersquos constituent words Term cohesionposits that high values for internal word associa-tion measures correlates indicates that the candi-date is a keyphrase (Church and Hanks 1989)

F9 Term Cohesion (SU) Park et al (2004)used in the Dice coefficient (Dice 1945)to measure term cohesion particularly for multi-word terms In their work as NPs are longer thansimplex words they simply discounted simplexword cohesion by 10 In our work we vary themeasure of TF used in Dice coefficientsimilar to our discussion earlier

bull (F9a) term cohesion by (Park et al 2004)

bull (F9b) normalized TF by candidate types(ie simplex words vs NPs)

bull (F9c) applying different weight by candi-date types

bull (F9d) normalized TF and different weight-ing by candidate types

54 Other Features

F10 Acronym (S) Nguyen and Kan (2007) ac-counted for the importance of acronym as a fea-ture We found that this feature is heavily depen-dent on the data set Hence we used it only forNampK to attest our candidate selection method

F11 POS sequence (S) Hulth and Megyesi(2006) pointed out that POS sequences ofkeyphrases are similar It showed the distinctivedistribution of POS sequences of keyphrases anduse them as a feature Like acronym this is alsosubject to the data set

13

F12 Suffix sequence (S) Similar to acronymNguyen and Kan (2007) also used a candidatersquossuffix sequence as a feature to capture the propen-sity of English to use certain Latin derivationalmorphology for technical keyphrases This fea-ture is also a data dependent features thus used insupervised approach only

F13 Length of Keyphrases (SU) Barker andCorrnacchia (2000) showed that candidate lengthis also a useful feature in extraction as well as incandidate selection as the majority of keyphrasesare one or two terms in length

6 System and Data

To assess the performance of the proposed candi-date selection rules and features we implementeda keyphrase extraction pipe line We start withraw text of computer science articles convertedfrom PDF by pdftotext Then we parti-tioned the into section such as title and sectionsvia heuristic rules and applied sentence segmenter2 ParsCit3(Councill et al 2008) for refer-ence collection part-of-speech tagger4 and lem-matizer5(Minnen et al 2001) of the input Af-ter preprocessing we built both supervised andunsupervised classifiers using Naive Bayes fromthe WEKA machine learning toolkit(Witten andFrank 2005) Maximum Entropy6 and simpleweighting

In evaluation we collected 250 papers fromfour different categories7 of the ACM digital li-brary Each paper was 6 to 8 pages on averageIn author assigned keyphrase we found manywere missing or found as substrings To rem-edy this we collected reader assigned keyphraseby hiring senior year undergraduates in computerscience each whom annotated five of the paperswith an annotation guideline and on average tookabout 15 minutes to annotate each paper The fi-nal statistics of keyphrases is presented in Table2 where Combined represents the total number ofkeyphrases The numbers in () denotes the num-ber of keyphrases in of-PP form Found means the

2httpwwwengritsumeiacjpasaoresourcessentseg3httpwingcompnusedusgparsCit4httpsearchcpanorgdistLingua-EN-

TaggerTaggerpm5httpwwwinformaticssusxacukresearchgroupsnlpcarrollmorphhtml6httpmaxentsourceforgenetindexhtml7C24 (Distributed Systems) H33 (Information Search

and Retrieval) I211 (Distributed Artificial Intelligence-Multiagent Systems) and J4 (Social and Behavioral Sciences-Economics)

number of author assigned keyphrase and readerassigned keyphrase found in the documents

Author Reader CombinedTotal 1252 (53) 3110 (111) 3816 (146)NPs 904 2537 3027Average 385 (401) 1244 (1288) 1526 (1585)Found 769 2509 2864

Table 2 Statistics in Keyphrases

7 Evaluation

The baseline system for both the supervised andunsupervised approaches is modified NampK whichuses TFIDF distance section information andadditional section information (ie F1-4) Apartfrom baseline we also implemented basicKEA and NampK to compare Note that NampK is con-sidered a supervised approach as it utilizes fea-tures like acronym POS sequence and suffix se-quence

Table 3 and 4 shows the performance of our can-didate selection method and features with respectto supervised and unsupervised approaches usingthe current standard evaluation method (ie exactmatching scheme) over top 5th 10th 15th candi-dates

BestFeatures includes F1cTF of substring asa separate feature F2first occurrence F3sectioninformation F4dweighting key sections F5lastoccurrence F6co-occurrence of another candi-date in section F7btitle overlap F9aterm co-hesion by (Park et al 2004) F13length ofkeyphrases Best-TFIDF means using all bestfeatures but TFIDF

In Tables 3 and 4 C denotes the classifier tech-nique unsupervised (U) or supervised using Max-imum Entropy (S)8

In Table 5 the performance of each feature ismeasured using NampK system and the target fea-ture + indicates an improvement - indicates aperformance decline and indicates no effector unconfirmed due to small changes of perfor-mances Again supervised denotes MaximumEntropy training and Unsupervised is our unsu-pervised approach

8 Discussion

We compared the performances over our candi-date selection and feature engineering with sim-ple KEA NampK and our baseline system In eval-uating candidate selection we found that longer

8Due to the page limits we present the best performance

14

Method Features C Five Ten FifteenMatch Precision Recall Fscore Match Precising Recall Fscore Match Precision Recall Fscore

All KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 079 1584 519 782 139 1388 909 1099 184 1224 1203 1213

NampK S 132 2648 867 1306 204 2036 1334 1612 254 1693 1664 1678baseline U 092 1832 600 904 157 1568 1027 1241 220 1464 1439 1451

S 115 2304 755 1137 190 1896 1242 1501 244 1624 1596 1610Lengthlt=3 KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 081 1616 529 797 140 1400 917 1108 184 1224 1203 1213

NampK S 140 2792 915 1378 210 2104 1378 1665 262 1749 1719 1734baseline U 092 184 603 908 158 1576 1032 1247 220 1464 1439 1451

S 118 2368 776 1169 190 1900 1245 1504 240 1600 1572 1586Lengthlt=3 KEA U 001 024 008 012 005 052 034 041 007 048 047 047Candidates S 083 1664 545 821 142 1424 933 1127 187 1245 1224 1234

+ Alternation NampK S 153 3064 1004 1512 231 2308 1512 1827 288 1920 1887 1903baseline U 098 1968 645 972 172 1724 1129 1364 237 1579 1551 1565

S 133 2656 870 1311 209 2088 1368 1653 269 1792 1761 1776

Table 3 Performance on Proposed Candidate Selection

Features C Five Ten FifteenMatch Prec Recall Fscore Match Prec Recall Fscore Match Prec Recall Fscore

Best U 114 228 747 113 192 192 126 152 261 174 171 173S 156 312 102 154 250 250 164 198 315 210 206 208

Best U 114 228 74 113 192 192 126 152 261 174 171 173wo TFIDF S 156 311 102 154 246 246 161 194 312 208 204 206

Table 4 Performance on Feature Engineering

A Method Feature+ S F1aF2F3F4aF4dF9a

U F1aF1cF2F3F4aF4dF5F7bF9a- S F1bF1cF1dF1fF4bF4cF7aF7bF9b-dF13

U F1dF1eF1fF4bF4cF6F7aF9b-d S F1eF10F11F12

U F1b

Table 5 Performance on Each Feature

length candidates play a role to be noises so de-creased the overall performance We also con-firmed that candidate alternation offered the flexi-bility of keyphrases leading higher candidate cov-erage as well as better performance

To re-examine features we analyzed the impactof existing and new features and their variationsFirst of all unlike previous studies we found thatthe performance with and without TFIDF did notlead to a large difference which indicates the im-pact of TFIDF was minor as long as other fea-tures are incorporated Secondly counting sub-strings for TF improved performance while ap-plying term weighting for TF andor IDF did notimpact on the performance We estimated thecause that many of keyphrases are substrings ofcandidates and vice versa Thirdly section in-formation was also validated to improve perfor-mance as in Nguyen and Kan (2007) Extend-ing this logic modeling additional section infor-mation (related work) and weighting sections bothturned out to be useful features Other localityfeatures were also validated as helpful both firstoccurrence and last occurrence are helpful as itimplies the locality of the key ideas In addi-

tion keyphrase co-occurrence with selected sec-tions was proposed in our work and found empiri-cally useful Term cohesion (Park et al 2004) is auseful feature although it has a heuristic factor thatreduce the weight by 10 for simplex words Nor-mally term cohesion is subject to NPs only henceit needs to be extended to work with multi-wordNPs as well Table 5 summarizes the reflectionson each feature

As unsupervised methods have the appeal of notneeding to be trained on expensive hand-annotateddata we also compared the performance of super-vised and unsupervised methods Given the fea-tures initially introduced for supervised learningunsupervised performance is surprisingly highWhile supervised classifier produced a matchingcount of 315 the unsupervised classifier obtains acount of 261 We feel this indicates that the exist-ing features for supervised methods are also suit-able for use in unsupervised methods with slightlyreduced performance In general we observed thatthe best features in both supervised and unsuper-vised methods are the same ndash section informationand candidate length In our analysis of the im-pact of individual features we observed that mostfeatures affect performance in the same way forboth supervised and unsupervised approaches asshown in Table 5 These findings indicate that al-though these features may be been originally de-signed for use in a supervised approach they arestable and can be expected to perform similar inunsupervised approaches

15

9 Conclusion

We have identified and tackled two core issuesin automatic keyphrase extraction candidate se-lection and feature engineering In the area ofcandidate selection we observe variations and al-ternations that were previously unaccounted forOur selection rules expand the scope of possiblekeyphrase coverage while not overly expandingthe total number candidates to consider In ourre-examination of feature engineering we com-piled a comprehensive feature list from previousworks while exploring the use of substrings in de-vising new features Moreover we also attested toeach featurersquos fitness for use in unsupervised ap-proaches in order to utilize them in real-world ap-plications with minimal cost

10 AcknowledgementThis work was partially supported by a National ResearchFoundation grant Interactive Media Search (grant R 252000 325 279) while the first author was a postdoctoral fellowat the National University of Singapore

ReferencesKen Barker and Nadia Corrnacchia Using noun phrase

heads to extract document keyphrases In Proceedings ofthe 13th Biennial Conference of the Canadian Society onComputational Studies of Intelligence Advances in Arti-ficial Intelligence 2000

Regina Barzilay and Michael Elhadad Using lexical chainsfor text summarization In Proceedings of the ACLEACL1997 Workshop on Intelligent Scalable Text Summariza-tion 1997 pp 10ndash17

Kenneth Church and Patrick Hanks Word associationnorms mutual information and lexicography In Proceed-ings of ACL 1989 76ndash83

Isaac Councill and C Lee Giles and Min-Yen Kan ParsCitAn open-source CRF reference string parsing package InProceedings of LREC 2008 28ndash30

Ernesto DAvanzo and Bernado Magnini A Keyphrase-Based Approach to Summarizationthe LAKE System atDUC-2005 In Proceedings of DUC 2005

F Damerau Generating and evaluating domain-orientedmulti-word terms from texts Information Processing andManagement 1993 29 pp43ndash447

Lee Dice Measures of the amount of ecologic associationsbetween species Journal of Ecology 1945 2

Eibe Frank and Gordon Paynter and Ian Witten and CarlGutwin and Craig Nevill-manning Domain SpecificKeyphrase Extraction In Proceedings of IJCAI 1999pp668ndash673

Carl Gutwin and Gordon Paynter and Ian Witten and CraigNevill-Manning and Eibe Frank Improving browsing indigital libraries with keyphrase indexes Journal of Deci-sion Support Systems 1999 27 pp81ndash104

Khaled Hammouda and Diego Matute and Mohamed KamelCorePhrase keyphrase extraction for document cluster-ing In Proceedings of MLDM 2005

Annette Hulth and Jussi Karlgren and Anna Jonsson andHenrik Bostrm and Lars Asker Automatic Keyword Ex-traction using Domain Knowledge In Proceedings of CI-CLing 2001

Annette Hulth and Beata Megyesi A study on automaticallyextracted keywords in text categorization In Proceedingsof ACLCOLING 2006 537ndash544

Mario Jarmasz and Caroline Barriere Using semantic sim-ilarity over tera-byte corpus compute the performance ofkeyphrase extraction In Proceedings of CLINE 2004

Dawn Lawrie and W Bruce Croft and Arnold RosenbergFinding Topic Words for Hierarchical Summarization InProceedings of SIGIR 2001 pp 349ndash357

Y Matsuo and M Ishizuka Keyword Extraction from a Sin-gle Document using Word Co-occurrence Statistical Infor-mation In International Journal on Artificial IntelligenceTools 2004 13(1) pp 157ndash169

Olena Medelyan and Ian Witten Thesaurus based automatickeyphrase indexing In Proceedings of ACMIEED-CSjoint conference on Digital libraries 2006 pp296ndash297

Guido Minnen and John Carroll and Darren Pearce Appliedmorphological processing of English NLE 2001 7(3)pp207ndash223

Thuy Dung Nguyen and Min-Yen Kan Key phrase Extrac-tion in Scientific Publications In Proceeding of ICADL2007 pp317-326

Youngja Park and Roy Byrd and Branimir Boguraev Auto-matic Glossary Extraction Beyond Terminology Identifi-cation In Proceedings of COLING 2004 pp48ndash55

Mari-Sanna Paukkeri and Ilari Nieminen and Matti Pollaand Timo Honkela A Language-Independent Approachto Keyphrase Extraction and Evaluation In Proceedingsof COLING 2008

Peter Turney Learning to Extract Keyphrases from TextIn National Research Council Institute for InformationTechnology Technical Report ERB-1057 1999

Peter Turney Coherent keyphrase extraction via Web min-ing In Proceedings of IJCAI 2003 pp 434ndash439

Xiaojun Wan and Jianguo Xiao CollabRank towards a col-laborative approach to single-document keyphrase extrac-tion In Proceedings of COLING 2008

Ian Witten and Gordon Paynter and Eibe Frank and CarGutwin and Graig Nevill-Manning KEAPractical Au-tomatic Key phrase Extraction In Proceedings of ACMDL 1999 pp254ndash256

Ian Witten and Eibe Frank Data Mining Practical Ma-chine Learning Tools and Techniques Morgan Kauf-mann 2005

Yongzheng Zhang and Nur Zinchir-Heywood and Evange-los Milios Term based Clustering and Summarization ofWeb Page Collections In Proceedings of Conference ofthe Canadian Society for Computational Studies of Intel-ligence 2004

16

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 17ndash22Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Verb Noun Construction MWE Token Supervised Classification

Mona T DiabCenter for Computational Learning Systems

Columbia Universitymdiabcclscolumbiaedu

Pravin BhutadaComputer Science Department

Columbia Universitypb2351columbiaedu

Abstract

We address the problem of classifying multi-word expression tokens in running text Wefocus our study on Verb-Noun Constructions(VNC) that vary in their idiomaticity depend-ing on context VNC tokens are classified aseither idiomatic or literal We present a super-vised learning approach to the problem We ex-periment with different features Our approachyields the best results to date on MWE clas-sification combining different linguistically mo-tivated features the overall performance yieldsan F-measure of 8458 corresponding to an F-measure of 8996 for idiomaticity identificationand classification and 6203 for literal identifi-cation and classification

1 Introduction

In the literature in general a multiword expression(MWE) refers to a multiword unit or a colloca-tion of words that co-occur together statisticallymore than chance A MWE is a cover term fordifferent types of collocations which vary in theirtransparency and fixedness MWEs are pervasivein natural language especially in web based textsand speech genres Identifying MWEs and under-standing their meaning is essential to language un-derstanding hence they are of crucial importancefor any Natural Language Processing (NLP) appli-cations that aim at handling robust language mean-ing and use In fact the seminal paper (Sag et al2002) refers to this problem as a key issue for thedevelopment of high-quality NLP applications

For our purposes a MWE is defined as a collo-cation of words that refers to a single concept forexample - kick the bucket spill the beans make adecision etc An MWE typically has an idiosyn-cratic meaning that is more or different from themeaning of its component words An MWE mean-ing is transparent ie predictable in as muchas the component words in the expression relaythe meaning portended by the speaker composi-tionally Accordingly MWEs vary in their de-gree of meaning compositionality composition-ality is correlated with the level of idiomaticityAn MWE is compositional if the meaning of an

MWE as a unit can be predicted from the mean-ing of its component words such as in make adecision meaning to decide If we conceive ofidiomaticity as being a continuum the more id-iomatic an expression the less transparent and themore non-compositional it is Some MWEs aremore predictable than others for instance kick thebucket when used idiomatically to mean to diehas nothing in common with the literal meaningof either kick or bucket however make a decisionis very clearly related to to decide Both of theseexpressions are considered MWEs but have vary-ing degrees of compositionality and predictabilityBoth of these expressions belong to a class of id-iomatic MWEs known as verb noun constructions(VNC) The first VNC kick the bucket is a non-decomposable VNC MWE the latter make a deci-sion is a decomposable VNC MWE These typesof constructions are the object of our study

To date most research has addressed the prob-lem of MWE type classification for VNC expres-sions in English (Melamed 1997 Lin 1999Baldwin et al 2003 na Villada Moiron andTiedemann 2006 Fazly and Stevenson 2007Van de Cruys and Villada Moiron 2007 Mc-Carthy et al 2007) not token classification Forexample he spilt the beans on the kitchen counteris most likely a literal usage This is given away bythe use of the prepositional phrase on the kitchencounter as it is plausable that beans could haveliterally been spilt on a location such as a kitchencounter Most previous research would classifyspilt the beans as idiomatic irrespective of con-textual usage In a recent study by (Cook et al2008) of 53 idiom MWE types used in differentcontexts the authors concluded that almost half ofthem had clear literal meaning and over 40 oftheir usages in text were actually literal Thus itwould be important for an NLP application suchas machine translation for example when givena new VNC MWE token to be able to determinewhether it is used idiomatically or not as it couldpotentially have detrimental effects on the qualityof the translation

17

In this paper we address the problem of MWEclassification for verb-noun (VNC) token con-structions in running text We investigate the bi-nary classification of an unseen VNC token ex-pression as being either Idiomatic (IDM) or Lit-eral (LIT) An IDM expression is certainly anMWE however the converse is not necessarilytrue To date most approaches to the problem ofidiomaticity classification on the token level havebeen unsupervised (Birke and Sarkar 2006 Diaband Krishna 2009b Diab and Krishna 2009aSporleder and Li 2009) In this study we carryout a supervised learning investigation using sup-port vector machines that uses some of the featureswhich have been shown to help in unsupervisedapproaches to the problem

This paper is organized as follows In Section2 we describe our understanding of the variousclasses of MWEs in general Section 3 is a sum-mary of previous related research Section 4 de-scribes our approach In Section 5 we present thedetails of our experiments We discuss the resultsin Section 6 Finally we conclude in Section 7

2 Multi-word Expressions

MWEs are typically not productive though theyallow for inflectional variation (Sag et al 2002)They have been conventionalized due to persis-tent use MWEs can be classified based on theirsemantic types as follows Idiomatic This cat-egory includes expressions that are semanticallynon-compositional fixed expressions such as king-dom come ad hoc non-fixed expressions suchas break new ground speak of the devil TheVNCs which we are focusing on in this paper fallinto this category Semi-idiomatic This classincludes expressions that seem semantically non-compositional yet their semantics are more or lesstransparent This category consists of Light VerbConstructions (LVC) such as make a living andVerb Particle Constructions (VPC) such as write-up call-up Non-Idiomatic This category in-cludes expressions that are semantically compo-sitional such as prime minister proper nouns suchas New York Yankees and collocations such as ma-chine translation These expressions are statisti-cally idiosyncratic For instance traffic light isthe most likely lexicalization of the concept andwould occur more often in text than say trafficregulator or vehicle light

3 Related Work

Several researchers have addressed the problem ofMWE classification (Baldwin et al 2003 Katzand Giesbrecht 2006 Schone and Juraksfy 2001

Hashimoto et al 2006 Hashimoto and Kawa-hara 2008) The majority of the proposed researchhas been using unsupervised approaches and haveaddressed the problem of MWE type classifica-tion irrespective of usage in context (Fazly andStevenson 2007 Cook et al 2007) We areaware of two supervised approaches to the prob-lem work by (Katz and Giesbrecht 2006) andwork by (Hashimoto and Kawahara 2008)

In Katz and Giesbrecht (2006) (KG06) the au-thors carried out a vector similarity comparisonbetween the context of an MWE and that of theconstituent words using LSA to determine if theexpression is idiomatic or not The KG06 is sim-ilar in intuition to work proposed by (Fazly andStevenson 2007) however the latter work was un-supervised KG06 experimented with a tiny dataset of only 108 sentences corresponding to oneMWE idiomatic expression

Hashimoto and Kawahara (2008) (HK08) is thefirst large scale study to our knowledge that ad-dressed token classification into idiomatic versusliteral for Japanese MWEs of all types They ap-ply a supervised learning framework using sup-port vector machines based on TinySVM with aquadratic kernel They annotate a web based cor-pus for training data They identify 101 idiomtypes each with a corresponding 1000 exampleshence they had a corpus of 102K sentences of an-notated data for their experiments They exper-iment with 90 idiom types only for which theyhad more than 50 examples They use two typesof features word sense disambiguation (WSD)features and idiom features The WSD featurescomprised some basic syntactic features such asPOS lemma information token n-gram featuresin addition to hypernymy information on words aswell as domain information For the idiom fea-tures they were mostly inflectional features suchas voice negativity modality in addition to adja-cency and adnominal features They report resultsin terms of accuracy and rate of error reductionTheir overall accuracy is of 8925 using all thefeatures

4 Our ApproachWe apply a supervised learning framework tothe problem of both identifying and classifying aMWE expression token in context We specificallyfocus on VNC MWE expressions We use the an-notated data by (Cook et al 2008) We adopt achunking approach to the problem using an InsideOutside Beginning (IOB) tagging framework forperforming the identification of MWE VNC to-kens and classifying them as idiomatic or literalin context For chunk tagging we use the Yam-

18

Cha sequence labeling system1 YamCha is basedon Support Vector Machines technology using de-gree 2 polynomial kernels

We label each sentence with standard IOB tagsSince this is a binary classification task we have 5different tags B-L (Beginning of a literal chunk)I-L (Inside of a literal chunk) B-I (Beginning anIdiomatic chunk) I-I (Inside an Idiomatic chunk)O (Outside a chunk) As an example a sentencesuch as John kicked the bucket last Friday will beannotated as follows John O kicked B-I the I-Ibucket I-I last O Friday O We experiment withsome basic features and some more linguisticallymotivated ones

We experiment with different window sizes forcontext ranging from minus+1 to minus+5 tokens be-fore and after the token of interest We also em-ploy linguistic features such as character n-gramfeatures namely last 3 characters of a token asa means of indirectly capturing the word inflec-tional and derivational morphology (NGRAM)Other features include Part-of-Speech (POS)tags lemma form (LEMMA) or the citation formof the word and named entity (NE) informationThe latter feature is shown to help in the unsuper-vised setting in recent work (Diab and Krishna2009b Diab and Krishna 2009a) In general allthe linguistic features are represented as separatefeature sets explicitly modeled in the input dataHence if we are modeling the POS tag feature forour running example the training data would beannotated as follows John NN O kicked VBDB-I the Det I-I bucket NN I-I last ADV O FridayNN O Likewise adding the NGRAM featurewould be represented as follows John NN ohnO kicked VBD ked B-I the Det the I-I bucket NNket I-I last ADV ast O Friday NN day O and soon

With the NE feature we followed the same rep-resentation as the other features as a separate col-umn as expressed above referred to as NamedEntity Separate (NES) For named entity recogni-tion (NER) we use the BBN Identifinder softwarewhich identifies 19 NE tags2 We have two set-tings for NES one with the full 19 tags explic-itly identified (NES-Full) and the other where wehave a binary feature indicating whether a wordis a NE or not (NES-Bin) Moreover we addedanother experimental condition where we changedthe wordsrsquo representation in the input to their NEclass Named Entity InText (NEI) For example forthe NEI condition our running example is repre-sented as follows PER NN ohn O kicked VBDked B-I the Det the I-I bucket NN ket I-I last ADV

1httpwwwtado-chasencomyamcha2httpwwwbbncomidentifinder

ast O DAY NN day O where John is replaced bythe NE ldquoPERrdquo

5 Experiments and Results

51 Data

We use the manually annotated standard dataset identified in (Cook et al 2008) This datacomprises 2920 unique VNC-Token expressionsdrawn from the entire British National Corpus(BNC)3 The BNC contains 100M words of multi-ple genres including written text and transcribedspeech In this set VNC token expressions aremanually annotated as idiomatic literal or un-known We exclude those annotated as unknownand those pertaining to the Speech part of thedata leaving us with a total of 2432 sentences cor-responding to 53 VNC MWE types This datahas 2571 annotations4 corresponding to 2020 Id-iomatic tokens and 551 literal ones Since the dataset is relatively small we carry out 5-fold cross val-idation experiments The results we report are av-eraged over the 5 folds per condition We splitthe data into 80 for training 10 for testing and10 for development The data used is the tok-enized version of the BNC

52 Evaluation Metrics

We use Fβ=1 (F-measure) as the harmonic meanbetween (P)recision and (R)ecall as well as accu-racy to report the results5 We report the resultsseparately for the two classes IDM and LIT aver-aged over the 5 folds of the TEST data set

53 Results

We present the results for the different featuressets and their combination We also present resultson a simple most frequent tag baseline (FREQ) aswell as a baseline of using no features just thetokenized words (TOK) The baseline is basicallytagging all identified VNC tokens in the data set asidiomatic It is worth noting that the baseline hasthe advantage of gold identification of MWE VNCtoken expressions In our experimental conditionsidentification of a potential VNC MWE is part ofwhat is discovered automatically hence our sys-tem is penalized for identifying other VNC MWE

3httpwwwnatcorpoxacuk4A sentence can have more than one MWE expression

hence the number of annotations exceeds the number of sen-tences

5We do not think that accuracy should be reported in gen-eral since it is an inflated result as it is not a measure of errorAll words identified as O factor into the accuracy which re-sults in exaggerated values for accuracy We report it onlysince it the metric used by previous work

19

tokens that are not in the original data set6

In Table 2 we present the results yielded per fea-ture and per condition We experimented with dif-ferent context sizes initially to decide on the opti-mal window size for our learning framework re-sults are presented in Table 1 Then once that isdetermined we proceed to add features

Noting that a window size of minus+3 yields thebest results we proceed to use that as our contextsize for the following experimental conditions Wewill not include accuracy since it above 96 for allour experimental conditions

All the results yielded by our experiments out-perform the baseline FREQ The simple tokenizedwords baseline (TOK) with no added features witha context size of minus+3 shows a significant im-provement over the very basic baseline FREQ withan overall F measure of 7704

Adding lemma information or POS or NGRAMfeatures all independently contribute to a bettersolution however combining the three featuresyields a significant boost in performance over theTOK baseline of 267 absolute F points in over-all performance

Confirming previous observations in the liter-ature the overall best results are obtained byusing NE features The NEI condition yieldsslightly better results than the NES conditionsin the case when no other features are beingused NES-Full significantly outperforms NES-Bin when used alone especially on literal classi-fication yielding the highest results on this classof phenomena across the board However whencombined with other features NES-Bin fares bet-ter than NES-Full as we observe slightly less per-formance when comparing NES-Full+L+N+P andNES-Bin+L+N+P

Combining NEI+L+N+P yields the highest re-sults with an overall F measure of 8458 a sig-nificant improvement over both baselines and overthe condition that does not exploit NE featuresL+N+P Using NEI may be considered a formof dimensionality reduction hence the significantcontribution to performance

6 Discussion

The overall results strongly suggest that using lin-guistically interesting features explicitly has a pos-itive impact on performance NE features helpthe most and combining them with other features

6We could have easily identified all VNC syntactic con-figurations corresponding to verb object as a potential MWEVNC assuming that they are literal by default This wouldhave boosted our literal score baseline however for this in-vestigation we decided to strictly work with the gold stan-dard data set exclusively

yields the best results In general performanceon the classification and identification of idiomaticexpressions yielded much better results This maybe due to the fact that the data has a lot more id-iomatic token examples for training Also we notethat precision scores are significantly higher thanrecall scores especially with performance on lit-eral token instance classification This might be anindication that identifying when an MWE is usedliterally is a difficult task

We analyzed some of the errors yielded in ourbest condition NEI+L+N+P The biggest errors area result of identifying other VNC constructionsnot annotated in the training and test data as VNCMWEs However we also see errors of confusingidiomatic cases with literal ones 23 times and theopposite 4 times

Some of the errors where the VNC should havebeen classified as literal however the system clas-sified them as idiomatic are kick heel find feetmake top Cases of idiomatic expressions erro-neously classified as literal are for MWE types hitthe road blow trumpet blow whistle bit a wall

The system is able to identify new VNC MWEconstructions For instance in the sentence On theother hand Pinkie seemed to have lost his head toa certain extent perhaps some prospects of mak-ing his mark by bringing in something novel inthe way of business the first MWE lost his headis annotated in the training data however makinghis mark is newly identified as idiomatic in thiscontext

Also the system identified hit the post as aliteral MWE VNC token in As the ball hit thepost the referee blew the whistle where blew thewhistle is a literal VNC in this context and it iden-tified hit the post as another literal VNC

7 Conclusion

In this study we explore a set of features that con-tribute to VNC token expression binary supervisedclassification The use of NER significantly im-proves the performance of the system Using NERas a means of dimensionality reduction yields thebest results We achieve a state of the art perfor-mance of an overall F measure of 8458 In thefuture we are looking at ways of adding more so-phisticated syntactic and semantic features fromWSD Given the fact that we were able to get moreinteresting VNC data automatically we are cur-rently looking into adding the new data to the an-notated pool after manual checking

20

IDM-F LIT-F Overall F Overall Accminus+1 7793 4857 7178 9622minus+2 8538 5561 7971 9706minus+3 8699 5568 8125 9693minus+4 8622 5581 8075 9706minus+5 8338 50 7763 9661

Table 1 Results in s of varying context window size

IDM-P IDM-R IDM-F LIT-P LIT-R LIT-F Overall FFREQ 7002 8916 7844 0 0 0 6968TOK 8178 8333 8255 7179 4375 5437 7704

(L)EMMA 831 8429 8369 6977 4688 5607 7811(N)GRAM 8317 8238 8278 70 4375 5385 7701

(P)OS 8333 8333 8333 7778 4375 5600 7808L+N+P 8695 8333 8538 7222 4561 5591 7971

NES-Full 852 8793 8655 7907 5862 6733 8277NES-Bin 8497 8241 8367 7349 5259 6131 7915

NEI 8992 8518 8748 8133 5259 6387 8282NES-Full+L+N+P 8989 8492 8734 7632 50 6042 8199NES-Bin+L+N+P 9086 8492 8779 7632 50 6042 8233

NEI+L+N+P 9135 8842 8986 8169 50 6203 8458

Table 2 Final results in s averaged over 5 folds of test data using different features and their combina-tions

8 Acknowledgement

The first author was partially funded by DARPAGALE and MADCAT projects The authors wouldlike to acknowledge the useful comments by twoanonymous reviewers who helped in making thispublication more concise and better presented

ReferencesTimothy Baldwin Collin Bannard Takakki Tanaka

and Dominic Widdows 2003 An empirical modelof multiword expression decomposability In Pro-ceedings of the ACL 2003 workshop on Multiwordexpressions pages 89ndash96 Morristown NJ USA

J Birke and A Sarkar 2006 A clustering approach fornearly unsupervised recognition of nonliteral lan-guage In Proceedings of EACL volume 6 pages329ndash336

Paul Cook Afsaneh Fazly and Suzanne Stevenson2007 Pulling their weight Exploiting syntacticforms for the automatic identification of idiomaticexpressions in context In Proceedings of the Work-shop on A Broader Perspective on Multiword Ex-pressions pages 41ndash48 Prague Czech RepublicJune Association for Computational Linguistics

Paul Cook Afsaneh Fazly and Suzanne Stevenson2008 The VNC-Tokens Dataset In Proceedings ofthe LREC Workshop on Towards a Shared Task for

Multiword Expressions (MWE 2008) MarrakechMorocco June

Mona Diab and Madhav Krishna 2009a Handlingsparsity for verb noun MWE token classification InProceedings of the Workshop on Geometrical Mod-els of Natural Language Semantics pages 96ndash103Athens Greece March Association for Computa-tional Linguistics

Mona Diab and Madhav Krishna 2009b Unsuper-vised classification for vnc multiword expressionstokens In CICLING

Afsaneh Fazly and Suzanne Stevenson 2007 Dis-tinguishing subtypes of multiword expressions us-ing linguistically-motivated statistical measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueCzech Republic June Association for Computa-tional Linguistics

Chikara Hashimoto and Daisuke Kawahara 2008Construction of an idiom corpus and its applica-tion to idiom identification based on WSD incor-porating idiom-specific features In Proceedings ofthe 2008 Conference on Empirical Methods in Nat-ural Language Processing pages 992ndash1001 Hon-olulu Hawaii October Association for Computa-tional Linguistics

Chikara Hashimoto Satoshi Sato and Takehito Utsuro2006 Japanese idiom recognition Drawing a linebetween literal and idiomatic meanings In Proceed-ings of the COLINGACL 2006 Main Conference

21

Poster Sessions pages 353ndash360 Sydney AustraliaJuly Association for Computational Linguistics

Graham Katz and Eugenie Giesbrecht 2006 Au-tomatic identification of non-compositional multi-word expressions using latent semantic analysis InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 12ndash19 Sydney Australia July Asso-ciation for Computational Linguistics

Dekang Lin 1999 Automatic identification of non-compositional phrases In Proceedings of ACL-99pages 317ndash324 Univeristy of Maryland CollegePark Maryland USA

Diana McCarthy Sriram Venkatapathy and AravindJoshi 2007 Detecting compositionality of verb-object combinations using selectional preferencesIn Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL) pages 369ndash379 Prague CzechRepublic June Association for Computational Lin-guistics

Dan I Melamed 1997 Automatic discovery of non-compositional compounds in parallel data In Pro-ceedings of the 2nd Conference on Empirical Meth-ods in Natural Language Processing (EMNLPrsquo97)pages 97ndash108 Providence RI USA August

Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the EACL-06Workshop on Multiword Expressions in a Multilin-gual Context pages 33ndash40 Morristown NJ USA

Ivan A Sag Timothy Baldwin Francis Bond Ann ACopestake and Dan Flickinger 2002 Multiwordexpressions A pain in the neck for nlp In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing pages 1ndash15 London UK Springer-Verlag

Patrick Schone and Daniel Juraksfy 2001 Isknowledge-free induction of multiword unit dictio-nary headwords a solved problem In Proceedingsof Empirical Methods in Natural Language Process-ing pages 100ndash108 Pittsburg PA USA

C Sporleder and L Li 2009 Unsupervised Recog-nition of Literal and Non-Literal Use of IdiomaticExpressions In Proceedings of the 12th Conferenceof the European Chapter of the ACL (EACL 2009)pages 754ndash762 Association for Computational Lin-guistics

Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32Prague Czech Republic June Association for Com-putational Linguistics

22

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 23ndash30Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Exploiting Translational Correspondences for Pattern-Independent MWEIdentification

Sina ZarrieszligDepartment of Linguistics

University of Potsdam Germanysinalinguni-potsdamde

Jonas KuhnDepartment of Linguistics

University of Potsdam Germanykuhnlinguni-potsdamde

Abstract

Based on a study of verb translations inthe Europarl corpus we argue that a widerange of MWE patterns can be identified intranslations that exhibit a correspondencebetween a single lexical item in the sourcelanguage and a group of lexical items inthe target language We show that thesecorrespondences can be reliably detectedon dependency-parsed word-aligned sen-tences We propose an extraction methodthat combines word alignment with syn-tactic filters and is independent of thestructural pattern of the translation

1 Introduction

Parallel corpora have proved to be a valuable re-source not only for statistical machine translationbut also for crosslingual induction of morphologi-cal syntactic and semantic analyses (Yarowsky etal 2001 Dyvik 2004) In this paper we proposean approach to the identification of multiword ex-pressions (MWEs) that exploits translational cor-respondences in a parallel corpus We will con-sider in translations of the following type

(1) DerThe

RatCouncil

sollteshould

unsereour

Positionposition

berucksichtigenconsider

(2) The Council shouldtake account ofour position

This sentence pair has been taken from theGerman - English section of the Europarl corpus(Koehn 2005) It exemplifies a translational cor-respondence between an English MWEtake ac-count ofand a German simplex verbberucksichti-gen In the following we refer to such correspon-dences asone-to-many translations Based on astudy of verb translations in Europarl we will ex-plore to what extent one-to-many translations pro-vide evidence for MWE realization in the targetlanguage It will turn out that crosslingual corre-

spondences realize a wide range of different lin-guistic patterns that are relevant for MWE iden-tification but that they pose problems to auto-matic word alignment We propose an extractionmethod that combines distributional word align-ment with syntactic filters We will show thatthese correspondences can be reliably detectedon dependency-parsed wordaligned sentences andare able to identify various MWE patterns

In a monolingual setting the task of MWE ex-traction is usually conceived of as a lexical as-sociation problem where distributional measuresmodel the syntactic and semantic idiosyncracy ex-hibited by MWEs eg (Pecina 2008) This ap-proach generally involves two main steps 1) theextraction of a candidate list of potential MWEsoften constrained by a particular target patternof the detection method like verb particle con-structions (Baldwin and Villavicencio 2002) orverb PP combinations (Villada Moiron and Tiede-mann 2006) 2) the ranking of this candidate listby an appropriate assocation measure

The crosslingual MWE identification wepresent in this paper is a priori independentof any specific association measure or syntacticpattern The translation scenario allows us toadopt a completely data-driven definition of whatconstitutes an MWE Given a parallel corpuswe propose to consider those tokens in a targetlanguage as MWEs which correspond to a singlelexical item in the source language The intuitionis that if a group of lexical items in one lan-guage can be realized as a single item in anotherlanguage it can be considered as some kind oflexically fixed entity By this means we willnot approach the MWE identification problemby asking for a given list of candidates whetherthese are MWEs or not Instead we will ask fora given list of lexical items in a source languagewhether there exists a one-to-many translation forthis item in a target language (and whether these

23

one-to-many translations correspond to MWEs)This strategy offers a straightforward solution tothe interpretation problem As the translation canbe related to the meaning of the source item andto its other translations in the target language theinterpretation is independent of the expressionrsquostransparency This solution has its limitationscompared to other approaches that need to auto-matically establish the degree of compositionalityof a given MWE candidate However for manyNLP applications coarse-grained knowledgeabout the semantic relation between a widerange of MWEs and their corresponding atomicrealization is already very useful

In this work we therefore focus on a generalmethod of MWE identification that captures thevarious patterns of translational correspondencesthat can be found in parallel corpora Our exper-iments described in section 3 show that one-to-many translations should be extracted from syn-tactic configurations rather than from unstructuredsets of aligned words This syntax-driven methodis less dependent on frequency distributions in agiven corpus but is based on the intuition thatmonolingual idiosyncracies like MWE realizationof an entity are not likely to be mirrored in anotherlanguage (see section 4 for discussion)

Our goal in this paper is twofold First we wantto investigate to what extent one-to-many transla-tional correspondences can serve as an empiricalbasis for MWE identification To this end Sec-tion 2 presents a corpus-based study of the rela-tion between one-to-many translations and MWEsthat we carried out on a translation gold standardSecond we investigate methods for the automaticdetection of complex lexical correspondences fora given parallel corpus Therefore Section 3 eval-uates automatic word alignments against our goldstandard and gives a method for high-precisionone-to-many translation detection that relies onsyntactic filters in addition to word-alignments

2 Multiword Translations as MWEs

The idea to exploit one-to-many translations forthe identification of MWE candidates has not re-ceived much attention in the literature Thus it isnot a priori clear what can be expected from trans-lational correspondences with respect to MWEidentification To corroborate the intuitions intro-duced in the last section we carried out a corpus-based study that aims to discover linguistic pat-

Verb 1-1 1-n n-1 n-n No

anheben (v1) 535 212 92 16 325

bezwecken (v2) 167 513 06 313 150

riskieren (v3) 467 357 05 17 182

verschlimmern (v4) 302 215 286 445 275

Table 1 Proportions of types of translational cor-respondences (token-level) in our gold standard

terns exhibited by one-to-many translationsWe constructed a gold standard coveringall En-

glish translations of four German verb lemmas ex-tracted from the Europarl Corpus These verbssubcategorize for a nominative subject and an ac-cusative object and are in the middle frequencylayer (around 200 occurrences) We extracted allsentences in Europarl with occurences of theselemmas and their automatic word alignments pro-duced by GIZA++ (Och and Ney 2003) Thesealignments were manually corrected on the basisof the crosslingual word alignment guidelines de-velopped by (Graca et al 2008)

For each of the German source lemmas ourgold standard records four translation categoriesone-to-one one-to-many many-to-one many-to-many translations Table 1 shows the distributionof these categories for each verb Strikingly thefour verbs show very different proportions con-cerning the types of their translational correspon-dences Thus while the German verbanheben(en increase) seems to have a frequent parallelrealization the verbsbezwecken(en intend to)or verschlimmern(en aggravate) tend to be real-ized by more complex phrasal translations In anycase the percentage of one-to-many translations isrelatively high which corroborates our hypothesisthat parallel corpora constitute a very interestingresource for data-driven MWE discovery

A closer look at the one-to-many translations re-veals that these cover a wide spectrum of MWEphenomena traditionally considered in the liter-ature as well as constructions that one wouldusually not regard as an MWE Below we willshortly illustrate the different classes of one-to-many translations we found in our gold standard

Morphological variations This type of one-to-many translations is mainly due to non-parallel re-alization of tense Itrsquos rather irrelevant from anMWE perspective but easy to discover and filterautomatically

24

(3) SieThey

verschlimmernaggravate

diethe

Ubelmisfortunes

(4) Their actionis aggravatingthe misfortunes

Verb particle combinations A typical MWEpattern treated for instance in (Baldwin andVillavicencio 2002) It further divides into trans-parent and non-transparent combinations the lat-ter is illustrated below

(5) DerThe

Ausschusscommitte

bezwecktintends

denthe

Institutioneninstitutions

eina

politischespolitical

Instrumentinstrument

anat

diethe

Handhand

zuto

gebengive

(6) The committeeset out to equip the institutions with apolitical instrument

Verb preposition combinations While thisclass isnrsquot discussed very often in the MWE lit-erature it can nevertheless be considered as an id-iosyncratic combination of lexical items Sag et al(2002) propose an analysis within an MWE frame-work

(7) SieThey

werdenwill

denthe

Treibhauseffektgreen house effect

verschlimmernaggravate

(8) They will add to the green house effect

Light verb constructions (LVCs) This is themost frequent pattern in our gold standard It ac-tually subsumes various subpatterns depending onwhether the light verbs complement is realized as anoun adjective or PP Generally LVCs are syntac-tically and semantically more flexible than otherMWE types such that our gold standard containsvariants of LVCs with similar potentially mod-ified adjectives or nouns as in the example be-low However it can be considered an idiosyn-cratic combination since the LVCs exhibit specificlexical restrictions (Sag et al 2002)

(9) IchIch

werdewill

diethe

Sachething

nuronly

nochjust

verschlimmernaggravate

(10) I am justmaking thingsmore difficult

Idioms This MWE type is probably the mostdiscussed in the literature due to its semantic andsyntactic idiosyncracy Itrsquos not very frequent inour gold standard which may be mainly due to itslimited size and the source items we chose

(11) SieThey

bezweckenintend

diethe

Umgestaltungconversion

ininto

einea

zivilecivil

Nationnation

(12) Theyhave in mind the conversion into a civil nation

v1 v2 v3 v4

Ntype 22 (26) 41 (47) 26 (35) 17 (24)

V Part 227 49 00 00

V Prep 364 415 39 59

LVC 182 293 885 882

Idiom 00 24 00 00

Para 364 243 115 235

Table 2 Proportions of MWE types per lemma

Paraphrases From an MWE perspective para-phrases are the most problematic and challengingtype of translational correspondence in our goldstandard While the MWE literature typically dis-cusses the distinction between collocations andMWEs the boarderline between paraphrases andMWEs is not really clear On the hand para-phrases as we classified them here are transparentcombinations of lexical items like in the exam-ple belowensure that something increases How-ever semantically these transparent combinationscan also be rendered by an atomic expressionin-crease A further problem raised by paraphrases isthat they often involve translational shifts (Cyrus2006) These shifts are hard to identify automat-ically and present a general challenge for seman-tic processing of parallel corpora An example isgiven below

(13) WirWe

brauchenneed

besserebetter

Zusammenarbeitcooperation

umto

diethe

RuckzahlungenrepaymentsOBJ

anzuhebenincrease

(14) We need greater cooperation in this respect toensurethat repaymentsincrease

Table 2 displays the proportions of the MWEcategories for the number of types of one-to-manycorrespondences in our gold standard We filteredthe types due to morphological variations only (theoverall number of types is indicated in brackets)Note that some types in our gold standard fall intoseveral categories eg they combine a verb prepo-sition with a verb particle construction For allof the verbs the number of types belonging tocore MWE categories largely outweighs the pro-portion of paraphrases As we already observedin our analysis of general translation categorieshere again the different verb lemmas show strik-ing differences with respect to their realization inEnglish translations For instanceanheben(enincrease) or bezwecken(en intend) are frequently

25

translated with verb particle or preposition combi-nations while the other verbs are much more of-ten translated by means of LVCs Also the morespecific LVC patterns differ largely among theverbs Whileverschlimmern(en aggravate) hasmany different adjectival LVC correspondencesthe translations ofriskieren(en risk) are predomi-nantly nominal LVCs The fact that we found veryfew idioms in our gold standard may be simplyrelated to our arbitrary choice of German sourceverbs that do not have an English idiom realiza-tion (see our experiment on a random set of verbsin Section 33)

In general one-to-many translational corre-spondences seem to provide a very fruitful groundfor the large-scale study of MWE phenomenaHowever their reliable detection in parallel cor-pora is far from trivial as we will show in thenext section Therefore we will not further in-vestigate the classification of MWE patterns inthe rest of the paper but concentrate on the high-precision detection of one-to-many translationsSuch a pattern-independent identification methodis crucial for the further data-driven study of one-to-many translations in parallel corpora

3 Multiword Translation Detection

This section is devoted to the problem of high-precision detection of one-to-many translationsSection 31 describes an evaluation of automaticword alignments against our gold standard Insection 32 we describe a method that extractsloosely aligned syntactic configurations whichyields much more promising results

31 One-to-many Alignments

To illustrate the problem of purely distributionalone-to-many alignment table 3 presents an eval-uation of the automatic one-to-many word align-ments produced by GIZA++ that uses the stan-dard heuristics for bidirectional word alignmentfrom phrase-based MT (Och and Ney 2003) Weevaluate the rate of translational correspondenceson the type-level that the system discovers againstthe one-to-many translations in our gold standardBy type we mean the set of lemmatized Englishtokens that makes up the translation of the Ger-man source lemma Generally automatic wordalignment yields a very high FPR if no frequencythreshold is used Increasing the threshold mayhelp in some cases however the frequency of the

verb n gt 0 n gt 1 n gt 3FPR FNR FPR FNR FPR FNR

v1 097 093 10 10 10 10

v2 093 09 05 096 00 098

v3 088 083 08 097 067 097

v4 098 092 08 092 034 092

Table 3 False positive rate and False negative rateof GIZA++ one-to-many alignments

translation types is so low that already at a thresh-old of 3 almost all types get filtered This does notmean that the automatic word alignment does notdiscover any correct correspondences at all but itmeans that the detection of the exact set of tokensthat correspond to the source token is rare

This low precision of one-to-many alignmentsisnrsquot very surprising Many types of MWEs con-sist of items that contribute most of the lexical se-mantic content while the other items belong to theclass of semantically almost ldquoemptyrdquo items (egparticles light verbs) These semantically ldquolightrdquoitems have a distribution that doesnrsquot necessarilycorrelate with the source item For instance inthe following sentence pair taken from EuroparlGIZA++ was not able to capture the correspon-dence between the German main verbbehindern(en impede) and the LVCconstitute an obstacleto but only finds an alignment link between theverb and the nounobstacle

(15) DieThe

Korruptioncorruption

behindertimpedes

diethe

Entwicklungdevelopment

(16) Corruptionconstitutes an obstacle todevelopment

Another limitation of the word-alignment mod-els is that are independent of whether the sen-tences are largely parallel or rather free transla-tions However parallel corpora like Europarl areknow to contain a very large number of free trans-lations In these cases direct lexical correspon-dences are much more unlikely to be found

32 Aligning Syntactic Configurations

High-precision extraction of one-to-many trans-lation detection thus involves two major prob-lems 1) How to identify sentences or configura-tions where reliable lexical correspondences canbe found 2) How to align target items that have alow occurrence correlation

We argue that both of these problems can beadressed by taking syntactic information into ac-

26

count As an example consider the pair of paral-lel configurations in Figure 1 for the sentence pairgiven in (15) and (16) Although there is no strictone-to-one alignment for the German verb the ba-sic predicate-argument structure is parallel Theverbs arguments directly correspond to each otherand are all dominated by a verbal root node

Based on these intuitions we propose agenerate-and-filter strategy for our one-to-manytranslation detection which extracts partial largelyparallel dependency configurations By admittingtarget dependency paths to be aligned to sourcesingle dependency relations we admit configura-tions where the source item is translated by morethan one word For instance given the configura-tion in Figure 1 we allow the German verb to bealigned to the path connectingconstituteand theargumentY2

Our one-to-many translation detection consistsof the following steps a) candidate generationof aligned syntactic configurations b) filtering theconfigurations c) alignment post-editing ie as-sembling the target tokens corresponding to thesource item The following paragraphs will brieflycaracterize these steps

behindert

X 1 Y 1

Y 2

an to

X 2 obstacle

create

Figure 1 Example of a typical syntactic MWEconfiguration

Data We word-aligned the German and Englishportion of the Europarl corpus by means of theGIZA++ tool Both portions where assigned flatsyntactic dependency analyses by means of theMaltParser (Nivre et al 2006) such that we ob-tain a parallel resource of word-aligned depen-dency parses Each sentence in our resource canbe represented by the triple(DG DE AGE) DG

is the set of dependency triples(s1 rel s2) such

thats2 is a dependent ofs1 of typerel ands1 s2

are words of the source languageDE is the setof dependency triples of the target sentenceAGE

corresponds to the set of pairs(s1 t1) such thats1 t1 are aligned

Candidate Generation This step generates alist of source configurations by searching for oc-curences of the source lexical verb where it islinked to some syntactic dependents (eg its argu-ments) An example input would be the configura-tion ( (verbSB) (verbOA)) forour German verbs

Filtering Given our source candidates a validparallel configuration(DG DE AGE) is then de-fined by the following conditions1 The source configurationDG is the set of tu-ples(s1 rel sn) wheres1 is our source item andsn some dependent2 For eachsn isin DG there is a tuple(sn tn) isinAGE ie every dependent has an alignment3 There is a target itemt1 isin DE such thatfor eachtn there is ap sub DE such thatp isa path(t1 rel tx) (tx rel ty)(tz rel tn) thatconnectst1 and tn Thus the target dependentshave a common root

To filter noise due to parsing or alignment er-rors we further introduce a filter on the length ofthe path that connects the target root and its de-pendents and w exclude paths cross contain sen-tence boundaries Moreover the above candi-date filtering doesnrsquot exclude configurations whichexhibit paraphrases involving head-switching orcomplex coordination Head-switching can be de-tected with the help of alignment information ifthere is a item in our target configuration that hasan reliable alignment with an item not contained inour source configuration our target configurationis likely to contain such a structural paraphrasesand is excluded from our candidate set Coordina-tion can be discarded by imposing the condition onthe configuration not to contain a coordination re-lation This Generate-and-Filter strategy now ex-tracts a set of sentences where we are likely to finda good one-to-one or one-to-many translation forthe source verb

Alignment Post-editing In the final alignmentstep one now needs to figure out which lexicalmaterial in the aligned syntactic configurations ac-tually corresponds to the translation of the sourceitem The intuition discussed in 32 was that all

27

the items lying on a path between the root itemand the terminals belong to the translation of thesource item However these items may have othersyntactic dependents that may also be part of theone-to-many translation As an example considerthe configuration in figure 1 where the articleanwhich is part of the LVCcreate an obstacle tohasto be aligned to the German source verb

Thus for a set of itemsti for which there is a de-pendency relation(tx rel ti) isin DE such thattx isan element of our target configuration we need todecide whether(s1 ti) isin AGE This translationproblem now largely parallels collocation trans-lation problems discussed in the literature as in(Smadja and McKeown 1994) But crucially oursyntactic filtering strategy has substantially nar-rowed down the number of items that are possi-ble parts of the one-to-many translation Thus astraightforward way to assemble the translationalcorrespondence is to compute the correlation orassociation of the possibly missing items with thegiven translation pair as proposed in (Smadja andMcKeown 1994) Therefore we propose the fol-lowing alignment post-editing algorithmGiven the source items1 and the set of target itemsT where eachti isin T is an element of our targetconfiguration

1 Compute corr(s1 T ) the correlation be-tweens1 andT

2 For each ti tx such that there isa (ti rel tx) isin DE computecorr(s1 T + tx)

3 if corr(s1 T + tx) ge corr(s1 T ) addtxto T

As the Dice coefficient is often to give the bestresults eg in (Smadja and McKeown 1994) wealso chose Dice as our correlation measure In fu-ture work we will experiment with other associa-tion measures Our correlation scores are thus de-fined by the formula

corr(s1 T ) =2(freq(s1 and T ))

freq(s1) + freq(T )

We definefreq(T ) as the number of sentencepairs whose target sentence contains occurrencesof all ti isin T andfreq(s1) accordingly The ob-servation frequencyfreq(s1andT ) is the number ofsentence pairs that wheres1 occurs in the sourcesentence andT in the target sentence

The output translation can then be rep-resented as a dependency configurationof the following kind ((ofPMOD)(riskNMODof)(riskNMODthe) (runOBJrisk)(runSBJ))which is the syntactic representationfor the English MWErun the risk of

33 Evaluation

Our translational approach to MWE extractionbears the advantage that evaluation is not exclu-sively bound to the manual judgement of candi-date lists Instead we can first evaluate the systemoutput against translation gold standards which areeasier to obtain The linguistic classification of thecandidates according to their compositionality canthen be treated as a separate problem

We present two experiments in this evaluationsection We will first evaluate the translation de-tection on our gold standard to assess the gen-eral quality of the extraction method Since thisgold standard is to small to draw conclusions aboutthe quality of MWE patterns that the system de-tects we further evaluate the translational corre-spondences for a larger set of verbs

Translation evaluation In the first experimentwe extracted all types of translational correspon-dences for the verbs we annotated in the gold stan-dard We converted the output dependency con-figurations to the lemmatized bag-of-word formwe already applied for the alignment evaluationand calculated the FPR and FNR of the trans-lation types The evaluation is displayed in ta-ble 4 Nearly all translation types that our sys-tem detected are correct This confirms our hy-pothesis that syntactic filtering yields more reli-able translations that just coocurrence-based align-ments However the false negative rate is alsovery high This low recall is due to the fact thatour syntactic filters are very restrictive such that amajor part of the occurrences of the source lemmadonrsquot figure in the prototypical syntactic configu-ration Column two and three of the evaluation ta-ble present the FPR and FNR for experiments witha relaxed syntactic filter that doesnrsquot constrain thesyntactic type of the parallel argument relationsWhile not decreasing the FNR the FPR decreasessignificantly This means that the syntactic filtersmainly fire on noisy configurations and donrsquot de-crease the recall A manual error analysis has alsoshown that the relatively flat annotation scheme ofour dependency parses significantly narrows down

28

the number of candidate configurations that our al-gorithm detects As the dependency parses donrsquotprovide deep analyses for tense or control phe-nomena very often a verbrsquos arguments donrsquot fig-ure as its syntactic dependents and no configura-tion is found Future work will explore the im-pact of deep syntactic analysis for the detection oftranslational correspondences

MWE evaluation In a second experiment weevaluated the patterns of correspondences foundby our extraction method for use in an MWE con-text Therefore we selected 50 random verbs oc-curring in the Europarl corpus and extracted theirrespective translational correspondences This setof 50 verbs yields a set of 1592 one-to-many typesof translational correspondences We filtered thetypes wich display only morphological variationsuch that the set of potential MWE types com-prises 1302 types Out of these we evaluated arandom sample of 300 types by labelling the typeswith the MWE categories we established for theanalysis of our gold standard During the clas-sification we encountered a further category ofoneto- many correspondence which cannot be con-sidered an MWE the category of alternation Forinstance we found a translational correspondencebetween the active realization of the German verbbegruszligen(enappreciate) and the English passivebe pleased by

The classification is displayed in table 5 Al-most 83 of the translational correspondencesthat our system extracted are perfect translationtypes Almost 60 of the extracted types can beconsidered MWEs that exhibit some kind of se-mantic idiosyncrasy The other translations couldbe classified as paraphrases or alternations In ourrandom sample the portions of idioms is signifi-cantly higher than in our gold standard which con-firms our intuition that the MWE pattern of theone-to-many translations for a given verb are re-lated to language-specific semantic properties ofthe verbs and the lexical concepts they realize

4 Related Work

The problem sketched in this paper has clear con-ncetions to statistical MT So-called phrase-basedtranslation models generally target whole sentencealignment and do not necessarily recur to linguis-tically motivated phrase correspondences (Koehnet al 2003) Syntax-based translation that speci-fies formal relations between bilingual parses was

Strict Filter Relaxed Filter

FPR FNR FPR FNR

v1 00 096 05 096

v2 025 088 047 079

v3 025 074 056 063

v4 00 0875 056 084

Table 4 False positive and false negative rate ofone-to-many translations

Trans type Proportion

MWE type Proportion

MWEs 575

V Part 82

V Prep 518

LVC 324

Idiom 106

Paraphrases 244

Alternations 10

Noise 171

Table 5 Classification of 300 types sampled fromthe set of one-to-many translations for 50 verbs

established by (Wu 1997) Our way to use syn-tactic configurations can be seen as a heuristic tocheck relaxed structural parallelism

Work on MWEs in a crosslingual context hasalmost exclusively focussed on MWE translation(Smadja and McKeown 1994 Anastasiou 2008)In (Villada Moiron and Tiedemann 2006) the au-thors make use of alignment information in a par-allel corpus to rank MWE candidates These ap-proaches donrsquot rely on the lexical semantic knowl-edge about MWEs in form of one-to-many trans-lations

By contrast previous approaches to paraphraseextraction made more explicit use of crosslingualsemantic information In (Bannard and Callison-Burch 2005) the authors use the target languageas a pivot providing contextual features for iden-tifying semantically similar expressions Para-phrasing is however only partially comparable tothe crosslingual MWE detection we propose inthis paper Recently the very pronounced contextdependence of monolingual pairs of semanticallysimilar expressions has been recognized as a ma-jor challenge in modelling word meaning (Erk andPado 2009)

The idea that parallel corpora can be used asa linguistic resource that provides empirical evi-dence for monolingual idiosyncrasies has already

29

been exploited in eg morphology projection(Yarowsky et al 2001) or word sense disambigua-tion (Dyvik 2004) While in a monolingual set-ting it is quite tricky to come up with theoreticalor empirical definitions of sense discriminationsthe crosslingual scenario offers a theory-neutraldata-driven solution Since ambiguity is an id-iosyncratic property of a lexical item in a givenlanguage it is not likely to be mirrored in a tar-get language Similarly our approach can also beseen as a projection idea we project the semanticinformation of simplex realization in a source lan-guage to an idiosyncratic multiword realization inthe target language

5 Conclusion

We have explored the phenomenon of one-to-many translations in parallel corpora from theperspective of MWE identification Our man-ual study on a translation gold standard as wellas our experiments in automatic translation ex-traction have shown that one-to-many correspon-dences provide a rich resource and fruitful basisof study for data-driven MWE identification Thecrosslingual perspective raises new research ques-tions about the identification and interpretation ofMWEs It challenges the distinction between para-phrases and MWEs a problem that does not ariseat all in the context of monolingual MWE ex-traction It also allows for the study of the rela-tion between the semantics of lexical concepts andtheir MWE realization Further research in this di-rection should investigate translational correspon-dences on a larger scale and further explore thesefor monolingual interpretation of MWEs

Our extraction method that is based on syn-tactic filters identifies MWE types with a muchhigher precision than purely cooccurence-basedword alignment and captures the various patternswe found in our gold standard Future work on theextraction method will have to focus on the gener-alization of these filters and the generalization toother items than verbs The experiments presentedin this paper also suggest that the MWE realiza-tion of certain lexical items in a target languageis subject to certain linguistic patterns Moreoverthe method we propose is completely languagein-dependent such that further research has to studythe impact of the relatedness of the consideredlanguages on the patterns of one-to-many transla-tional correspondences

ReferencesDimitra Anastasiou 2008 Identification of idioms by mtrsquos

hybrid research system vs three commercial system InProceedings of the EAMT pp 12ndash20

Timothy Baldwin and Aline Villavicencio 2002 Extract-ing the unextractable a case study on verb-particles InProceedings of the COLING-02 pp 1ndash7

Colin Bannard and Chris Callison-Burch 2005 Paraphras-ing with bilingual parallel corpora InProceedings of the43rd Annual Meeting of the ACL pp 597ndash604

Lea Cyrus 2006 Building a resource for studying transla-tion shifts InProceedings of the 5th LREC pp 1240ndash1245

Helge Dyvik 2004 Translations as semantic mirrors Fromparallel corpus to WordNetLanguage and Computers1311 ndash 326

Katrin Erk and Sebastian Pado 2009 Paraphrase assess-ment in structured vector space Exploring parameters anddatasets InProc of the EACL GEMS Workshop pp 57ndash65

Joao de Almeida Varelas Graca Joana Paulo Pardal LuısaCoheur and Diamantino Antonio Caseiro 2008 Multi-language word alignments annotation guidelines Techni-cal report Tech Rep 38 2008 INESC-ID Lisboa

Philipp Koehn Franz Josef Och and Daniel Marcu 2003Statistical phrase-based translation InProceedings of theNAACL rsquo03 pp 48ndash54

Philipp Koehn 2005 Europarl A parallel corpus for statis-tical machine translation InMT Summit 2005 pp 79ndash86

Joakim Nivre Johan Hall and Jens Nilsson 2006 Malt-parser A data driven parser-generator for dependencyparsing InProc of LREC-2006 pp 2216ndash2219

Franz Josef Och and Hermann Ney 2003 A systematic com-parison of various statistical alignment modelsComputa-tional Linguistics 29(1)19ndash51

Pavel Pecina 2008 A machine learning approach to multi-word expression extraction InProceedings of the LRECMWE 2008 Workshop pp 54ndash57

Ivan A Sag Timothy Baldwin Francis Bond and AnnCopestake 2002 Multiword expressions A pain in theneck for NLP InProc of the CICLing-2002 pp 1ndash15

Frank Smadja and Kathleen McKeown 1994 Translatingcollocations for use in bilingual lexicons InProceedingsof the HLT rsquo94 workshop pp 152ndash156

Begona Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automatic word-alignment InProc of the EACL MWE 2006 Workshoppp 33ndash40

Dekai Wu 1997 Stochastic inversion transduction gram-mars and bilingual parsing of parallel corporaComputLinguist 23(3)377ndash403

David Yarowsky Grace Ngai and Richard Wicentowski2001 Inducing multilingual text analysis tools via robustprojection across aligned corpora InProceedings of HLT2001 pp 1ndash8

30

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 31ndash39Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

A Re-examination of Lexical Association Measures

Abstract

We review lexical Association Measures (AMs) that have been employed by past work in extracting multiword expressions Our work contributes to the understanding of these AMs by categorizing them into two groups and suggesting the use of rank equivalence to group AMs with the same ranking performance We also examine how existing AMs can be adapted to better rank English verb particle constructions and light verb constructions Specifically we suggest normalizing (Pointwise) Mutual Information and using marginal frequencies to construct penalization terms We empirically validate the effectiveness of these modified AMs in detection tasks in English performed on the Penn Treebank which shows significant improvement over the original AMs

1 Introduction

Recently the NLP community has witnessed a renewed interest in the use of lexical association measures in extracting Multiword Expressions (MWEs) Lexical Association Measures (hereafter AMs) are mathematical formulas which can be used to capture the degree of connection or association between constituents of a given phrase Well-known AMs include Pointwise Mutual Information (PMI) Pearsonrsquos 2χ and the Odds Ratio These AMs have been applied in many different fields of study from information retrieval to hypothesis testing In the context of MWE extraction many published works have been devoted to comparing their effectiveness Krenn and Evert (2001) evaluate Mutual Information (MI) Dice Pearsonrsquos 2χ log-likelihood

ratio and the T score In Pearce (2002) AMs such as Z score Pointwise MI cost reduction left and right context entropy odds ratio are evaluated Evert (2004) discussed a wide range of AMs including exact hypothesis tests such as the binomial test and Fisherrsquos exact tests various coefficients such as Dice and Jaccard Later Ramisch et al (2008) evaluated MI Pearsonrsquos 2χ and Permutation Entropy Probably the most comprehensive evaluation of AMs was presented in Pecina and Schlesinger (2006) where 82 AMs were assembled and evaluated over Czech collocations These collocations contained a mix of idiomatic expressions technical terms light verb constructions and stock phrases In their work the best combination of AMs was selected using machine learning

While the previous works have evaluated AMs there have been few details on why the AMs perform as they do A detailed analysis of why these AMs perform as they do is needed in order to explain their identification performance and to help us recommend AMs for future tasks This weakness of previous works motivated us to address this issue In this work we contribute to further understanding of association measures using two different MWE extraction tasks to motivate and concretize our discussion Our goal is to be able to predict a priori what types of AMs are likely to perform well for a particular MWE class

We focus on the extraction of two common types of English MWEs that can be captured by bigram model Verb Particle Constructions (VPCs) and Light Verb Constructions (LVCs) VPCs consist of a verb and one or more particles which can be prepositions (eg put on bolster up) adjectives (cut short) or verbs (make do) For simplicity we focus only on bigram VPCs that take prepositional particles the most common class of VPCs A special characteristic of VPCs that affects their extraction is the

Hung Huu Hoang Dept of Computer Science

National University of Singapore

hoanghuucompnusedusg

Su Nam Kim Dept of Computer Science and Software Engineering University of Melbourne

snkimcsseunimelbeduau

Min-Yen Kan Dept of Computer Science

National University of Singapore

kanmycompnusedusg

31

mobility of noun phrase complements in transitive VPCs They can appear after the particle (Take off your hat) or between the verb and the particle (Take your hat off) However a pronominal complement can only appear in the latter configuration (Take it off)

In comparison LVCs comprise of a verb and a complement which is usually a noun phrase (make a presentation give a demonstration) Their meanings come mostly from their complements and as such verbs in LVCs are termed semantically light hence the name light verb This explains why modifiers of LVCs modify the complement instead of the verb (make a serious mistake vs make a mistake seriously) This phenomenon also shows that an LVCrsquos constituents may not occur contiguously

2 Classification of Association Measures

Although different AMs have different approaches to measuring association we observed that they can effectively be classified into two broad classes Class I AMs look at the degree of institutionalization ie the extent to which the phrase is a semantic unit rather than a free combination of words Some of the AMs in this class directly measure this association between constituents using various combinations of co-occurrence and marginal frequencies Examples include MI PMI and their variants as well as most of the association coefficients such as Jaccard Hamann Brawn-Blanquet and others Other Class I AMs estimate a phrasersquos MWE-hood by judging the significance of the difference between observed and expected frequencies These AMs include among others statistical hypothesis tests such as T score Z score and Pearsonrsquos 2χ test

Class II AMs feature the use of context to measure non-compositionality a peculiar characteristic of many types of MWEs including VPCs and idioms This is commonly done in one of the following two ways First non-compositionality can be modeled through the diversity of contexts measured using entropy The underlying assumption of this approach is that non-compositional phrases appear in a more restricted set of contexts than compositional ones Second non-compositionality can also be measured through context similarity between the phrase and its constituents The observation here is that non-compositional phrases have different semantics from those of their constituents It then

follows that contexts in which the phrase and its constituents appear would be different (Zhai 1997) Some VPC examples include carry out give up A close approximation stipulates that contexts of a non-compositional phrasersquos constituents are also different For instance phrases such as hot dog and Dutch courage are comprised of constituents that have unrelated meanings Metrics that are commonly used to compute context similarity include cosine and dice similarity distance metrics such as Euclidean and Manhattan norm and probability distribution measures such as Kullback-Leibler divergence and Jensen-Shannon divergence

Table 1 lists all AMs used in our discussion

The lower left legend defines the variables a b c and d with respect to the raw co-occurrence statistics observed in the corpus data When an AM is introduced it is prefixed with its index given in Table 1(eg [M2] Mutual Information) for the readerrsquos convenience

3 Evaluation

We will first present how VPC and LVC candidates are extracted and used to form our evaluation data set Second we will discuss how performances of AMs are measured in our experiments

31 Evaluation Data

In this study we employ the Wall Street Journal (WSJ) section of one million words in the Penn Tree Bank To create the evaluation data set we first extract the VPC and LVC candidates from our corpus as described below We note here that the mobility property of both VPC and LVC constituents have been used in the extraction process

For VPCs we first identify particles using a pre-compiled set of 38 particles based on Baldwin (2005) and Quirk et al (1985) (Appendix A) Here we do not use the WSJ particle tag to avoid possible inconsistencies pointed out in Baldwin (2005) Next we search to the left of the located particle for the nearest verb As verbs and particles in transitive VPCs may not occur contiguously we allow an intervening NP of up to 5 words similar to Baldwin and Villavicencio (2002) and Smadja (1993) since longer NPs tend to be located after particles

32

Extraction of LVCs is carried out in a similar fashion First occurrences of light verbs are located based on the following set of seven

frequently used English light verbs do get give have make put and take Next we search to the right of the light verbs for the nearest noun

AM Name Formula AM Name Formula M1 Joint Probability ( ) f xy N M2 Mutual Information

1log ˆ

ijij

i j ij

ff

N fsum

M3 Log likelihood ratio

2 log ˆ

ijij

i j ij

ff

fsum

M4 Pointwise MI (PMI) ( )log

( ) ( )

P xy

P x P ylowast lowast

M5 Local-PMI ( ) PMIf xy times M6 PMIk ( )log

( ) ( )

kNf xy

f x f ylowast lowast

M7 PMI2 2( )log

( ) ( )

Nf xy

f x f ylowast lowast

M8 Mutual Dependency 2( )log( ) ( )P xy

P x P y

M9 Driver-Kroeber

( )( )

a

a b a c+ +

M10 Normalized expectation

2

2

a

a b c+ +

M11 Jaccard a

a b c+ +

M12 First Kulczynski a

b c+

M13 Second Sokal-Sneath 2( )

a

a b c+ +

M14 Third Sokal-Sneath

a d

b c

+

+

M15 Sokal-Michiner a d

a b c d

+

+ + +

M16 Rogers-Tanimoto

2 2

a d

a b c d

+

+ + +

M17 Hamann ( ) ( )a d b c

a b c d

+ minus +

+ + +

M18 Odds ratio adbc

M19 Yulersquos ω ad bc

ad bc

minus

+

M20 Yulersquos Q ad bcad bc

minus+

M21 Brawn- Blanquet max( )

a

a b a c+ +

M22 Simpson

min( )

a

a b a c+ +

M23 S cost 12min( )

log(1 )1

b c

a

minus+

+

M24 Adjusted S Cost 12max( )

log(1 )1

b c

a

minus+

+

M25 Laplace 1

min( ) 2

a

a b c

+

+ +

M26 Adjusted Laplace 1

max( ) 2

a

a b c

+

+ +

M27 Fager [M9]

1max( )

2b cminus

M28 Adjusted Fager [M9]

1max( )b c

aNminus

M29 Normalized PMIs

PMI NF( )α PMI NFMax

M30 Simplified normalized PMI for VPCs

log( )

(1 )

ad

b cα αtimes + minus times

M31 Normalized MIs

MI NF( )α MI NFMax

NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast [0 1]α isin NFMax = max( ( ) ( ))P x P ylowast lowast

11 ( )a f f xy= = 12 ( )b f f xy= =

21 ( )c f f xy= = 22 ( )d f f xy= =

( )f xlowast ( )f xlowast

( )f ylowast ( )f ylowast N

Table 1 Association measures discussed in this paper Starred AMs () are developed in this work

Contingency table of a bigram (x y) recording co-occurrence and marginal frequencies w stands for all words except w stands for all words N is total number of bigrams The expected frequency under the independence assumption is ˆ ( ) ( ) ( ) f xy f x f y N= lowast lowast

33

permitting a maximum of 4 intervening words to allow for quantifiers (aan the many etc) adjectival and adverbial modifiers etc If this search fails to find a noun as when LVCs are used in the passive (eg the presentation was made) we search to the right of the light verb also allowing a maximum of 4 intervening words The above extraction process produced a total of 8652 VPC and 11465 LVC candidates when run on the corpus We then filter out candidates with observed frequencies less than 6 as suggested in Pecina and Schlesinger (2006) to obtain a set of 1629 VPCs and 1254 LVCs

Separately we use the following two available sources of annotations 3078 VPC candidates extracted and annotated in (Baldwin 2005) and 464 annotated LVC candidates used in (Tan et al 2006) Both sets of annotations give both positive and negative examples

Our final VPC and LVC evaluation datasets were then constructed by intersecting the gold-standard datasets with our corresponding sets of extracted candidates We also concatenated both sets of evaluation data for composite evaluation This set is referred to as ldquoMixedrdquo Statistics of our three evaluation datasets are summarized in Table 2 VPC data LVC data Mixed Total (freq ge 6)

413 100 513

Positive instances

117 (2833)

28 (28)

145 (2326)

Table 2 Evaluation data sizes (type count not token) While these datasets are small our primary

goal in this work is to establish initial comparable baselines and describe interesting phenomena that we plan to investigate over larger datasets in future work

32 Evaluation Metric

To evaluate the performance of AMs we can use the standard precision and recall measures as in much past work We note that the ranked list of candidates generated by an AM is often used as a classifier by setting a threshold However setting a threshold is problematic and optimal threshold values vary for different AMs Additionally using the list of ranked candidates directly as a classifier does not consider the confidence indicated by actual scores Another way to avoid setting threshold values is to measure precision and recall of only the n most likely candidates

(the n-best method) However as discussed in Evert and Krenn (2001) this method depends heavily on the choice of n In this paper we opt for average precision (AP) which is the average of precisions at all possible recall values This choice also makes our results comparable to those of Pecina and Schlesinger (2006)

33 Evaluation Results

Figure 1(a b) gives the two average precision profiles of the 82 AMs presented in Pecina and Schlesinger (2006) when we replicated their experiments over our English VPC and LVC datasets We observe that the average precision profile for VPCs is slightly concave while the one for LVCs is more convex This can be interpreted as VPCs being more sensitive to the choice of AM than LVCs Another point we observed is that a vast majority of Class I AMs including PMI its variants and association coefficients (excluding hypothesis tests) perform reasonably well in our application In contrast the performances of most of context-based and hypothesis test AMs are very modest Their mediocre performance indicates their inapplicability to our VPC and LVC tasks In particular the high frequencies of particles in VPCs and light verbs in LVCs both undermine their contextsrsquo discriminative power and skew the difference between observed and expected frequencies that are relied on in hypothesis tests

4 Rank Equivalence

We note that some AMs although not mathematically equivalent (ie assigning identical scores to input candidates) produce the same lists of ranked candidates on our datasets Hence they achieve the same average precision The ability to identify such groups of AMs is helpful in simplifying their formulas which in turn assisting in analyzing their meanings Definition Association measures M1 and M2 are

rank equivalent over a set C denoted by M1 r

Cequiv

M2 if and only if M1(cj) gt M1(ck) M2(cj) gt M2(ck) and M1(cj) = M1(ck) M2(cj) = M2(ck) for all cj ck belongs to C where Mk(ci) denotes the score assigned to ci by the measure Mk As a corollary the following also holds for rank equivalent AMs

34

Corollary If M1 r

Cequiv M2 then APC(M1) = APC(M2)

where APC(Mi) stands for the average precision of the AM Mi over the data set C Essentially M1 and M2 are rank equivalent over a set C if their ranked lists of all candidates taken from C are the same ignoring the actual calculated scores1 As an example the following 3 AMs Odds ratio Yulersquos ω and Yulersquos Q (Table 3 row 5) though not mathematically equivalent can be shown to be rank equivalent Five groups of rank equivalent AMs that we have found are listed in Table 3 This allows us to replace the below 15 AMs with their (most simple) representatives from each rank equivalent group

1 Two AMs may be rank equivalent with the exception of some candidates where one AM is undefined due to a zero in the denominator while the other AM is still well-defined We call these cases weakly rank equivalent With a reasonably large corpus such candidates are rare for our VPC and LVC types Hence we still consider such AM pairs to be rank equivalent

1) [M2] Mutual Information [M3] Log likelihood ratio 2) [M7] PMI2 [M8] Mutual Dependency [M9] Driver-Kroeber (aka Ochiai) 3) [M10] Normalized expectation [M11] Jaccard [M12] First Kulczynski

[M13]Second Sokal-Sneath (aka Anderberg)

4) [M14] Third Sokal-Sneath [M15] Sokal-Michiner [M16] Rogers-Tanimoto [M17] Hamann 5) [M18] Odds ratio [M19] Yulersquos ω [M20] Yulersquos Q

Table 3 Five groups of rank equivalent AMs

5 Examination of Association Measures

We highlight two important findings in our analysis of the AMs over our English datasets Section 51 focuses on MI and PMI and Section 52 discusses penalization terms

51 Mutual Information and Pointwise Mutual Information

In Figure 1 over 82 AMs PMI ranks 11th in identifying VPCs while MI ranks 35th in

01

02

03

04

05

06

AP

Figure 1a AP profile of AMs examined over our VPC data set

01

02

03

04

05

06

AP

Figure 1b AP profile of AMs examined over our LVC data set

Figure 1 Average precision (AP) performance of the 82 AMs from Pecina and Schlesinger (2006) on our English VPC and LVC datasets Bold points indicate AMs discussed in this paper

Hypothesis test AMs loz Class I AMs excluding hypothesis test AMs + Context-based AMs

35

identifying LVCs In this section we show how their performances can be improved significantly

Mutual Information (MI) measures the common information between two variables or the reduction in uncertainty of one variable given knowledge of the other

( )MI( ) ( )log ( ) ( )u v

p uvU V p uv p u p v= lowast lowastsum In the

context of bigrams the above formula can be

simplified to [M2] MI =

1log ˆN

ijij

i jij

ff

fsum While MI

holds between random variables [M4] Pointwise MI (PMI) holds between specific values PMI(x

y) =( )

log( ) ( )

P xy

P x P ylowast lowast

( )log

( ) ( )

Nf xy

f x f y=

lowast lowast It has long

been pointed out that PMI favors bigrams with low-frequency constituents as evidenced by the product of two marginal frequencies in its denominator To reduce this bias a common solution is to assign more weight to the co-occurrence frequency ( )f xy in the numerator by either raising it to some power k (Daille 1994) or multiplying PMI with ( )f xy Table 4 lists these adjusted versions of PMI and their performance over our datasets We can see from Table 4 that the best performance of PMIk

is obtained at k values less than one indicating that it is better to rely less on ( )f xy Similarly multiplying

( )f xy directly to PMI reduces the performance of PMI As such assigning more weight to ( )f xy does not improve the AP performance of PMI AM VPCs LVCs MixedBest [M6] PMIk 547

(k = 13) 573

(k = 85) 544

(k = 32)

[M4] PMI 510 566 515 [M5] Local-PMI 259 393 272 [M1] Joint Prob 170 28 175 Table 4 AP performance of PMI and its variants Best alpha settings shown in parentheses

Another shortcoming of (P)MI is that both grow not only with the degree of dependence but also with frequency (Manning and Schutzeampamp 1999 p 66) In particular we can show that MI(X Y) le min(H(X) H(Y)) where H() denotes entropy and PMI(xy) le min( log ( )P xminus lowast log ( )P yminus lowast )

These two inequalities suggest that the allowed score ranges of different candidates vary and consequently MI and PMI scores are not directly comparable Furthermore in the case of VPCs and LVCs the differences among score

ranges of different candidates are large due to high frequencies of particles and light verbs This has motivated us to normalize these scores before using them for comparison We suggest MI and PMI be divided by one of the following two normalization factors NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast with [0 1]α isin and NFmax = max( ( ) ( ))P x P ylowast lowast NF( )α being dependent on alpha can be optimized by setting an appropriate alpha value which is inevitably affected by the MWE type and the corpus statistics On the other hand NFmax is independent of alpha and is recommended when one needs to apply normalized (P)MI to a mixed set of different MWE types or when sufficient data for parameter tuning is unavailable As shown in Table 5 normalized MI and PMI show considerable improvements of up to 80 Also PMI and MI after being normalized with NFmax rank number one in VPC and LVC task respectively If one re-writes MI as = (1 N) ij ij

i jPMIf timessum it is easy to see the heavy

dependence of MI on direct frequencies compared with PMI and this explains why normalization is a pressing need for MI

AM VPCs LVCs MixedMI NF( )α 508

(α = 48) 583

(α = 47) 516

(α = 5)

MI NFmax 508 584 518 [M2] MI 273 435 289

PMI NF( )α 592 (α = 8)

554 (α = 48)

588 (α = 77)

PMI NFmax 565 517 556 [M4] PMI 510 566 515

Table 5 AP performance of normalized (P)MI versus standard (P)MI Best alpha settings shown in parentheses

52 Penalization Terms

It can be seen that given equal co-occurrence frequencies higher marginal frequencies reduce the likelihood of being MWEs This motivates us to use marginal frequencies to synthesize penalization terms which are formulae whose values are inversely proportional to the likelihood of being MWEs We hypothesize that incorporating such penalization terms can improve the respective AMs detection AP

Take as an example the AMs [M21] Brawn-Blanquet (aka Minimum Sensitivity) and [M22] Simpson These two AMs are identical except

36

for one difference in the denominator Brawn-Blanquet uses max(b c) Simpson uses min(b c) It is intuitive and confirmed by our experiments that penalizing against the more frequent constituent by choosing max(b c) is more effective This is further attested in AMs [M23] S Cost and [M25] Laplace where we tried to replace the min(b c) term with max(b c) Table 6 shows the average precision on our datasets for all these AMs

AM VPCs LVCs Mixed[M21]Brawn- Blanquet

478 578 486

[M22] Simpson 249 382 260

[M24] Adjusted S Cost

485 577 492

[M23] S cost 249 388 260

[M26] Adjusted Laplace

486 577 493

[M25] Laplace 241 388 254 Table 6 Replacing min() with max() in selected AMs

In the [M27] Fager AM the penalization term max(b c) is subtracted from the first term which is no stranger but rank equivalent to [M7] PMI2 In our application this AM is not good since the second term is far larger than the first term which is less than 1 As such Fager is largely equivalent to just ndashfrac12 max(b c) In order to make use of the first term we need to replace the constant frac12 by a scaled down version of max(b c) We have approximately derived 1 aN as a lower bound estimate of max(b c) using the independence assumption producing [M28] Adjusted Fager We can see from Table 7 that this adjustment improves Fager on both datasets AM VPCs LVCs Mixed[M28] Adjusted Fager

564 543 554

[M27] Fager 552 439 525 Table 7 Performance of Fager and its adjusted version

The next experiment involves [M14] Third Sokal Sneath which can be shown to be rank equivalent to ndashb ndashc We further notice that frequencies c of particles are normally much larger than frequencies b of verbs Thus this AM runs the risk of ranking VPC candidates based on only frequencies of particles So it is necessary

that we scale b and c properly as in [M14] bαminus times ndash (1 ) cαminus times Having scaled the constituents properly we still see that [M14] by itself is not a good measure as it uses only constituent frequencies and does not take into consideration the co-occurrence frequency of the two constituents This has led us to experiment

with [MR14]PMI

(1 )b cα αtimes + minus times The

denominator of [MR14] is obtained by removing the minus sign from [MR14] so that it can be used as a penalization term The choice of PMI in the numerator is due to the fact that the denominator of [MR14] is in essence similar to NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast which has been successfully used to divide PMI in the normalized PMI experiment We heuristically tried to simplify [MR14] to the following AM

[M30]log( )

(1 )

ad

b cα αtimes + minus times The setting of alpha in

Table 8 below is taken from the best alpha setting obtained the experiment on the normalized PMI (Table 5) It can be observed from Table 8 that [MR14] being computationally simpler than normalized PMI performs as well as normalized PMI and better than Third Sokal-Sneath over the VPC data set

AM VPCs LVCs MixedPMI NF( )α 592

(α =8) 554

(α =48) 588

(α =77)

[M30]log( )

(1 )

ad

b cα αtimes + minus times

600 (α =8)

484 (α =48)

588 (α =77)

[M14] Third Sokal Sneath

565 453 546

Table 8 AP performance of suggested VPCsrsquo penalization terms and AMs

With the same intention and method we have found that while addition of marginal frequencies is a good penalization term for VPCs the product of marginal frequencies is more suitable for LVCs (rows 1 and 2 Table 9) As with the linear combination the product bc should also be weighted accordingly as (1 )b cα αminus The best alpha value is also taken from the normalized PMI experiments (Table 5) which is nearly 5 Under this setting this penalization term is exactly the denominator of the [M18] Odds Ratio Table 9 below show our experiment results in deriving the penalization term for LVCs

37

AM VPCs LVCs Mixedndashb ndashc 565 453 546 1bc 502 532 502 [M18] Odds ratio 443 567 456

Table 9 AP performance of suggested LVCsrsquo penalization terms and AMs

6 Conclusions

We have conducted an analysis of the 82 AMs assembled in Pecina and Schlesinger (2006) for the tasks of English VPC and LVC extraction over the Wall Street Journal Penn Treebank data In our work we have observed that AMs can be divided into two classes ones that do not use context (Class I) and ones that do (Class II) and find that the latter is not suitable for our VPC and LVC detection tasks as the size of our corpus is too small to rely on the frequency of candidatesrsquo contexts This phenomenon also revealed the inappropriateness of hypothesis tests for our detection task We have also introduced the novel notion of rank equivalence to MWE detection in which we show that complex AMs may be replaced by simpler AMs that yield the same average precision performance

We further observed that certain modifications to some AMs are necessary First in the context of ranking we have proposed normalizing scores produced by MI and PMI in cases where the distributions of the two events are markedly different as is the case for light verbs and particles While our claims are limited to the datasets analyzed they show clear improvements normalized PMI produces better performance over our mixed MWE dataset yielding an average precision of 588 compared to 515 when using standard PMI a significant improvement as judged by paired T test Normalized MI also yields the best performance over our LVC dataset with a significantly improved AP of 583

We also show that marginal frequencies can be used to form effective penalization terms In particular we find that (1 )b cα αtimes + minus times is a good penalization term for VPCs while (1 )b cα αminus is suitable for LVCs Our introduced alpha tuning parameter should be set to properly scale the values b and c and should be optimized per MWE type In cases where a common factor is applied to different MWE types max(b c) is a better choice than min(b c) In future work we plan to expand our investigations over larger

web-based datasets of English to verify the performance gains of our modified AMs

Acknowledgement This work was partially supported by a National Research Foundation grant ldquoInteractive Media Searchrdquo (grant R 252 000 325 279)

References Baldwin Timothy (2005) The deep lexical

acquisition of English verb-particle constructions Computer Speech and Language Special Issue on Multiword Expressions 19(4)398ndash414

Baldwin Timothy and Villavicencio Aline (2002) Extracting the unextractable A case study on verb-particles In Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002) pages 98ndash104 Taipei Taiwan

Daille Beacuteatrice (1994) Approche mixte pour lextraction automatique de terminologie statistiques lexicales et filtres linguistiques PhD thesis Universiteacute Paris 7

Evert Stefan (2004) Online repository of association measures httpwwwcollocationsde a companion to The Statistics of Word Cooccurrences Word Pairs and Collocations PhD dissertation University of Stuttgart

Evert Stefan and Krenn Brigitte (2001) Methods for qualitative evaluation of lexical association measures In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics pages 188-915 Toulouse France

Katz Graham and Giesbrecht Eugenie (2006) Automatic identification of non-compositional multi-word expressions using latent semantic analysis In Proceedings of the ACL Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties pages 12-19 Sydney Australia

Krenn Brigitte and Evert Stefan (2001) Can we do better than frequency A case study on extracting PP-verb collocations In Proceedings of the ACLEACL 2001 Workshop on the Computational Extraction Analysis and Exploitation of Collocations pages 39ndash46 Toulouse France

Manning D Christopher and Schutze ampamp Hinrich (1999) Foundations of Statistical Natural Language Processing The MIT Press Cambridge Massachusetts

Pearce Darren (2002) A comparative evaluation of collocation extraction techniques In Proc of the

38

3rd International Conference on Language Resources and Evaluation (LREC 2002) Las Palmas pages 1530-1536 Canary Islands

Pecina Pavel and Schlesinger Pavel (2006) Combining association measures for collocation extraction In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLINGACL 2006) pages 651-658 Sydney Australia

Quirk Randolph Greenbaum Sidney Leech Geoffrey and Svartvik Jan (1985) A Comprehensive Grammar of the English Language Longman London UK

Ramisch Carlos Schreiner Paulo Idiart Marco and Villavicencio Aline (2008) An Evaluation of Methods for the extraction of Multiword Expressions In Proceedings of the LREC-2008 Workshop on Multiword Expressions Towards a Shared Task for Multiword Expressions pages 50-53 Marrakech Morocco

Smadja Frank (1993) Retrieving collocations from text Xtract Computational Linguistics 19(1) 143ndash77

Tan Y Fan Kan M Yen and Cui Hang (2006) Extending corpus-based identification of light verb constructions using a supervised learning framework In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context pages 49ndash56 Trento Italy

Zhai Chengxiang (1997) Exploiting context to identify lexical atoms ndash A statistical view of linguistic context In International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97) pages 119-129 Rio de Janeiro Brazil

Appendix A List of particles used in identifying verb particle constructions about aback aboard above abroad across adrift ahead along apart around aside astray away back backward backwards behind by down forth forward forwards in into off on out over past round through to together under up upon without

39

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 40ndash46Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus

R Mahesh K Sinha

Department of Computer Science amp Engineering Indian Institute of Technology Kanpur

Kanpur 208016 India rmkiitkacin

Abstract

Complex predicate is a noun a verb an ad-jective or an adverb followed by a light verb that behaves as a single unit of verb Com-plex predicates (CPs) are abundantly used in Hindi and other languages of Indo Aryan family Detecting and interpreting CPs con-stitute an important and somewhat a diffi-cult task The linguistic and statistical methods have yielded limited success in mining this data In this paper we present a simple method for detecting CPs of all kinds using a Hindi-English parallel corpus A CP is hypothesized by detecting absence of the conventional meaning of the light verb in the aligned English sentence This simple strategy exploits the fact that CP is a multi-word expression with a meaning that is dis-tinct from the meaning of the light verb Al-though there are several shortcomings in the methodology this empirical method surpri-singly yields mining of CPs with an average precision of 89 and a recall of 90

1 Introduction

Complex predicates (CPs) are abundantly used in Hindi and other languages of Indo-Aryan family and have been widely studied (Hook 1974 Ab-bi 1992 Verma 1993 Mohanan 1994 Singh 1994 Butt 1995 Butt and Geuder 2001 Butt and Ramchand 2001 Butt et al 2003) A com-plex predicate is a multi-word expression (MWE) where a noun a verb or an adjective is followed by a light verb (LV) and the MWE be-haves as a single unit of verb The general theory of complex predicate is discussed in Alsina (1996) These studies attempt to model the lin-guistic facts of complex predicate formation and the associated semantic roles

CPs empower the language in its expressive-ness but are hard to detect Detection and inter-

pretation of CPs are important for several tasks of natural language processing tasks such as ma-chine translation information retrieval summa-rization etc A mere listing of the CPs constitutes a valuable linguistic resource for lexicographers wordnet designers (Chakrabarti et al 2007) and other NLP system designers Computational me-thod using Hindi corpus has been used to mine CPs and categorize them based on statistical analysis (Sriram and Joshi 2005) with limited success Chakrabarti et al (2008) present a me-thod for automatic extraction of V+V CPs only from a corpus based on linguistic features They report an accuracy of about 98 An attempt has also been made to use a parallel corpus for de-tecting CPs using projection POS tags from Eng-lish to Hindi (Soni Mukerjee and Raina 2006) It uses Giza++ word alignment tool to align the projected POS information A success of 83 precision and 46 recall has been reported

In this paper we present a simple strategy for mining of CPs in Hindi using projection of meaning of light verb in a parallel corpus In the following section the nature of CP in Hindi is outlined and this is followed by system design experimentation and results

2 Complex Predicates in Hindi

A CP in Hindi is a syntactic construction consist-ing of either a verb a noun an adjective or an adverb as main predicator followed by a light verb (LV) Thus a CP can be a noun+LV an adjective+LV a verb+LV or an adverb+LV Fur-ther it is also possible that a CP is followed by a LV (CP+LV) The light verb carries the tense and agreement morphology In V+V CPs the contribution of the light verb denotes aspectual terms such as continuity perfectivity inception completion or denotes an expression of forceful-ness suddenness etc (Singh 1994 Butt 1995) The CP in a sentence syntactically acts as a sin-gle lexical unit of verb that has a meaning dis-

40

tinct from that of the LV CPs are also referred as the complex or compound verbs

Given below are some examples (1) CP=noun+LV noun = ashirwad blessings LV = denaa to give usane mujhe ashirwad diyaa उसन मझ आश वाद दया he me blessings gave he blessed me (2) No CP usane mujhe ek pustak dii उसन मझ एक प तक द he me one book gave he gave me a book In (1) the light verb diyaa (gave) in its past tense form with the noun ashirwad (blessings) makes a complex predicate verb form ashirwad diyaa (blessed) in the past tense form The CP here is ashirwad denaa and its corresponding English translation is lsquoto blessrsquo On the other hand in ex-ample (2) the verb dii (gave) is a simple verb in past tense form and is not a light verb Although same Hindi verb denaa (to give) is used in both the examples it is a light verb in (1) and a main verb in (2) Whether it acts as a light verb or not depends upon the semantics of the preceding noun However it is observed that the English meaning in case of the complex predicate is not derived from the individual meanings of the con-stituent words It is this observation that forms basis of our approach for mining of CPs (3) CP=adjective+LV adjective=khush happy LV=karanaa to do usane mujhe khush kiyaa उसन मझ खश कया he me happy did he pleased me Here the Hindi verb kiyaa (did) is the past tense form of a light verb karanaa (to do) and the pre-ceding word khush (happy) is an adjective The CP here is khush karanaa (to please) (4) CP=verb+LV verb = paRhnaa to read LV = lenaa to take usane pustak paRh liyaa

उसन प तक पढ़ लया he book read took he has read the book Here the Hindi verb liyaa (took) is the past tense form of the light verb lenaa (to take) and the pre-ceding word paRh (read) is the verb paRhnaa (to read) in its stem form The CP is paRh lenaa (to finish reading) In such cases the light verb acts as an aspectual modal or as an intensifier (5) CP=verb+LV verb = phaadanaa to tear LV = denaa to give usane pustak phaad diyaa उसन प तक फाड़ दया he book tear gave he has torn the book Here the Hindi verb diyaa (gave) is the past tense form of the light verb denaa (to give) and the preceding word phaad (tear) is the stem form of the verb phaadanaa (to tear) The CP is phaad denaa (to cause and complete act of tearing) (6) CP=verb+LV verb = denaa to give LV = maaranaato hit to kill usane pustak de maaraa उसन प तक द मारा he book give hit he threw the book Here the Hindi verb maaraa (hitkilled) is the past tense form of the light verb maaranaa (to hit to kill) and the preceding word de (give) is a verb denaa (to give) in its stem form The CP is de maranaa (to throw) The verb combination yields a new meaning This may also be consi-dered as a semi-idiomatic construct by some people (7) CP=adverb+LV1+LV2 adverb = vaapasback LV1 = karanaato do LV2 = denaato give or CP = CP+LV CP = vaapas karanaato return LV = denaato give usane pustak vaapas kar diyaa उसन प तक वापस कर दया he book back do gave

41

he returned the book Here there are two Hindi light verbs used The verb kar (do) is the stem form of the light verb karanaa (to do) and the verb diyaa (gave) is the past tense form of the light verb denaa (to give) The preceding word vaapas (back) is an adverb One way of interpretation is that the CP (a con-junct verb) vaapas karanaa (to return) is fol-lowed by another LV denaa (to give) signifying completion of the task Another way of looking at it is to consider these as a combination of two CPs vaapas karanaa (to return) and kar denaa (to complete the act) The semantic interpreta-tions in the two cases remain the same It may be noted that the word vaapas (return) is also a noun and in such a case the CP is a noun+LV

From all the above examples the complexity of the task of mining the CPs is evident Howev-er it is also observed that in the translated text the meaning of the light verb does not appear in case of CPs Our methodology for mining CPs is based on this observation and is outlined in the following section

3 System Design

As outlined earlier our method for detecting a CP is based on detecting a mismatch of the Hindi light verb meaning in the aligned English sen-tence The steps involved are as follows 1) Align the sentences of Hindi-English corpus 2) Create a list of Hindi light verbs and their

common English meanings as a simple verb (Table 1)

3) For each Hindi light verb generate all the morphological forms (Figure 1)

4) For each English meaning of the light verb as given in table 1 generate all the morphologi-cal forms (Figure 2)

5) For each Hindi-English aligned sentence execute the following steps a) For each light verb of Hindi (table 1)

execute the following steps i) Search for a Hindi light verb (LV)

and its morphological derivatives (figure 1) in the Hindi sentence and mark its position in the sentence (K)

ii) If the LV or its morphological deriv-ative is found then search for the equivalent English meanings for any of the morphological forms (figure 2) in the corresponding aligned Eng-lish sentence

iii) If no match is found then scan the words in the Hindi sentence to the left of the Kth position (as identified in step (i)) else if a match is found then exit ie go to step (a)

iv) If the scanned word is a lsquostop wordrsquo (figure 3) then ignore it and contin-ue scanning

v) Stop the scan when it is not a lsquostop wordrsquo and collect the Hindi word (W)

vi) If W is an lsquoexit wordrsquo then exit ie go to step (a) else the identified CP is W+LV

Hindi has a large number of light verbs A list of some of the commonly used light verbs along with their common English meanings as a simple verb is given in table 1 The light verb kar (do) is the most frequently used light verb Using its literal meaning as lsquodorsquo as a criterion for testing CP is quite misleading since lsquodorsquo in English is used in several other contexts Such meanings have been shown within parentheses and are not used for matching

light verb base form root verb meaning baithanaa बठना sit

bananaa बनना makebecomebuildconstruct manufactureprepare

banaanaa बनाना makebuildconstructmanufact-ure prepare

denaa दना give

lenaa लना take

paanaa पाना obtainget

uthanaa 3ठना rise arise get-up

uthaanaa 3ठाना raiselift wake-up

laganaa लगना feelappear look seem

lagaanaa लगाना fixinstall apply

cukanaa चकना (finish)

cukaanaa चकाना pay

karanaa करना (do)

honaa होना happenbecome be

aanaa आना come

jaanaa जाना go

khaanaa खाना eat

rakhanaa रखना keep put

maaranaa मारना killbeathit

daalanaa डालना put

haankanaa हाकना drive

Table 1 Some of the common light verbs in Hindi

42

For each of the Hindi light verb all morpho-logical forms are generated A few illustrations are given in figures 1(a) and 1(b) Similarly for each of the English meaning of the light verb all of its morphological derivatives are generated Figure 2 shows a few illustrations of the same

There are a number of words that can appear in between the nominal and the light verb in a CP These words are ignored in search for a CP and are treated as stop words These are words that denote negation or are emphasizers inten-sifiers interrogative pronoun or a particle A list of stop words used in the experimentation is giv-en in figure 3

Figure 1(a) Morphological derivatives of sample Hindi light verb rsquojaanaarsquo जाना to go

Figure 1(b) Morphological derivatives of sample Hindi light verb rsquolenaarsquo लना to take

Figure 2 Morphological derivatives of sample English meanings

We use a list of words of words that we have

named as lsquoexit wordsrsquo which cannot form part of a CP in Hindi We have used Hindi case (vibhak-ti) markers (also called parsarg) conjunctions and pronouns as the lsquoexit wordsrsquo in our imple-mentation Figure 4 shows a partial list used However this list can be augmented based on analysis of errors in LV identification It should be noted that we do not perform parts of speech (POS) tagging and so the nature of the word pre-ceding the LV is unknown to the system

English word sit Morphological derivations

sit sits sat sitting English word give Morphological derivations

give gives gave given giving degdegdegdegdeg

LV jaanaa जाना to go Morphological derivatives jaa jaae jaao jaaeM jaauuM jaane jaanaa jaanii jaataa jaatii jaate jaaniiM jaatiiM jaaoge jaaogii gaii jaauuMgA jaayegaa jaauuMgii jaayegii gaye gaiiM gayaa gayii jaayeMge jaayeMgI jaakara जा (go stem) जाए (go imperative)

जाओ (go imperative) जाए (go imperative)

जाऊ (go first-person) जान (go infinitive oblique)

जाना (go infinitive masculine singular)

जानी (go infinitive feminine singular)

जाता (go indefinite masculine singular)

जाती (go indefinite feminine singular)

जात (go indefinite masculine pluraloblique)

जानी (go infinitive feminine plural)

जाती (go indefinite feminine plural)

जाओग (go future masculine singular)

जाओगी (go future feminine singular)

गई (go past feminine singular)

जाऊगा (go future masculine first-person singular)

जायगा (go future masculine third-person singular)

जाऊगी (go future feminine first-person singular)

जायगी (go future feminine third-person singular)

गय (go past masculine pluraloblique)

गई (go past feminine plural)

गया (go past masculine singular)

गयी (go past feminine singular)

जायग (go future masculine plural)

जायगी (go future feminine plural)

जाकर (go completion) degdegdegdegdeg

LV lenaa लना to take Morphological derivatives le lii leM lo letaa letii lete liiM luuM legaa legii lene lenaa lenii liyaa leMge loge letiiM luuMgaa luuMgii lekara ल (take stem) ल (take past)

ल (take imperative) लो (take imperative)

लता (take indefinite masculine singular)

लती (take indefinite feminine singular)

लत (take indefinite masculine pluraloblique)

ल (takepastfeminineplural) ल(take first-person)

लगा(take future masculine third-personsingular)

लगी(take future feminine third-person singular)

लन (take infinitive oblique)

लना (take infinitive masculine singular)

लनी (take infinitive feminine singular)

लया (take past masculine singular)

लग (take future masculine plural)

लोग (take future masculine singular)

लती (take indefinite feminine plural)

लगा (take future masculinefirst-personsingular)

लगी (take future feminine first-person singular)

लकर (take completion) degdegdegdegdeg

43

Figure 3 Stop words in Hindi used by the system

Figure 4 A few exit words in Hindi used by the system

The inner loop of the procedure identifies mul-tiple CPs that may be present in a sentence The outer loop is for mining the CPs in the entire corpus The experimentation and results are dis-cussed in the following section

4 Experimentation and Results

The CP mining methodology outlined earlier has been implemented and tested over multiple files of EMILLE (McEnery Baker Gaizauskas and Cunningham 2000) English-Hindi parallel cor-pus A summary of the results obtained are given in table 2 As can be seen from this table the precision obtained is 80 to 92 and the recall is between 89 to 100 The F-measure is 88 to 97 This is a remarkable and somewhat sur-prising result from the simple methodology without much of linguistic or statistical analysis This is much higher than what has been reported on the same corpus by Mukerjee et al 2006 (83 precision and 46 recall) who use projec-tion of POS and word alignment for CP identifi-cation This is the only other work that uses a parallel corpus and covers all kinds of CPs The results as reported by Chakrabarti et al (2008) are only for V-V CPs Moreover they do not re-port the recall value File

1 File 2

File 3

File 4

File 5

File 6

No of Sentences

112 193 102 43 133 107

Total no of CP(N)

200 298 150 46 188 151

Correctly identified CP (TP)

195 296 149 46 175 135

V-V CP 56 63 9 6 15 20

Incorrectly identified CP (FP)

17 44 7 11 16 20

Unidentified CP(FN)

5 2 1 0 13 16

Accuracy

9750 9933 9933 1000 9308 8940

Precision (TP (TP+FP))

9198 8705 9551 8070 9162 8709

Recall ( TP (TP+FN))

9750 9833 9933 1000 9308 8940

F-measure ( 2PR ( P+R))

946 923 974 893 923 882

Table 2 Results of the experimentation

न (ergative case marker) को (accusative case marker) का (possessive case marker) क (possessive case marker) की (possessive case marker) स (frombywith) म (ininto) पर (onbut) और (and Hindi particle) तथा (and) या (or) ल कन (but) पर त (but) क (that Hindi particle) म (I) तम (you) आप (you) वह (heshe) मरा (my) मर (my) मर (my) त हारा (your) त हार (your) त हार (your) उसका (his) उसकी (her) उसक (hisher) अपना (own) अपनी (own) अपन (own) उनक (their) मन (I ergative) त ह (to you) आपको (to you) उसको (to himher) उनको (to them) उ ह (to them) मझको (to me) मझ (to me) िजसका (whose) िजसकी (whose) िजसक (whose) िजनको (to whom) िजनक (to whom)

नह (nonot) न (nonot Hindi particle) भी (also Hindi particle) ह (only Hindi particle) तो (then Hindi particle) य (why) या (what Hindi particle)

कहा (where Hindi particle) कब (when) यहा (here) वहा (there) जहा (where) पहल (before) बाद म (after) श म (beginning) आर भ म (beginning) अत म (in the end) आ खर म (in the end)

44

Given below are some sample outputs (1) English sentence I also enjoy working with the childrens parents who often come to me for advice - its good to know you can help Aligned Hindi sentence मझ ब च क माता - पताओ क साथ काम करना भी अ छा लगता ह जो क अ सर सलाह लन आत ह - यह जानकार खशी होती ह क आप कसी की मदद कर सकत ह | The CPs identified in the sentence i काम करना (to work) ii अ छा लगना (to feel good enjoy) iii सलाह लना (to seek advice) iv खशी होना (to feel happy good) v मदद करना (to help)

Here the system identified 5 different CPs all of which are correct and no CP in the sentence has gone undetected The POS projection and word alignment method (Mukerjee et al 2006) would fail to identify CPs सलाह लना (to seek advice) and खशी होना (to feel happy) (2) English sentence Thousands of children are already benefiting from the input of people like you - people who care about children and their future who have the commitment energy and enthusiasm to be positive role models and who value the opportu-nity for a worthwhile career Aligned Hindi sentence आप जस लोग जो क ब च और उनक भ व य क बार म सोचत ह - इस समय भी हज़ार ब च को लाभ पहचा रह ह | अ छ आदश बनन क लए ऐस लोग म तब ता उ सराह और लगन ह और व एक समथन - यो य यवसाय की क करत ह |

The CPs identified in the sentence i आदश बनना (to be role model) ii क करना (to respect)

Here also the two CPs identified are correct It is obvious that this empirical method of

mining CPs will fail whenever the Hindi light verb maps on to its core meaning in English It

may also produce garbage as POS of the preced-ing word is not being checked However the mining success rate obtained speaks of these be-ing in small numbers in practice Use of the lsquostop wordsrsquo in allowing the intervening words within the CPs helps a lot in improving the per-formance Similarly use of the lsquoexit wordsrsquo avoid a lot of incorrect identification

5 Conclusions

The simple empirical method for mining CPs outlined in this work yields an average 89 of precision and 90 recall which is better than the results reported so far in the literature The major drawback is that we have to generate a list of all possible light verbs This list appears to be very large for Hindi Since no POS tagging or statis-tical analysis is performed the identified CPs are merely a list of mined CPs in Hindi with no lin-guistic categorization or analysis However this list of mined CPs is valuable to the lexicograph-ers and other language technology developers This list can also be used for word alignment tools where the identified components of CPs are grouped together before the word alignment process This will increase both the alignment accuracy and the speed

The methodology presented in this work is equally applicable to all other languages within the Indo-Aryan family

References Anthony McEnery Paul Baker Rob Gaizauskas Ha-

mish Cunningham 2000 EMILLE Building a Corpus of South Asian Languages Vivek A Quar-terly in Artiificial Intelligence 13(3)23ndash32

Amitabh Mukerjee Ankit Soni and Achala M Raina 2006 Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora Proceed-ings of the Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties Sydney 11ndash18

Alex Alsina 1996 Complex PredicatesStructure and Theory CSLI PublicationsStanford CA

Anvita Abbi 1992 The explicator compound verbsome definitional issues and criteria for iden-tification Indian Linguistics 53 27-46

Debasri Chakrabarti Vaijayanthi Sarma and Pushpak Bhattacharyya 2007 Complex Predicates in In-dian Language Wordnets Lexical Resources and Evaluation Journal 40 (3-4)

Debasri Chakrabarti Hemang Mandalia Ritwik Priya Vaijayanthi Sarma and Pushpak Bhattacharyya 2008 Hindi Compound Verbs and their Automatic Extraction Computational Linguistics (COLING08) Manchester UK

45

Manindra K Verma (ed) 1993 Complex Predicates in South Asian Languages Manohar Publishers and Distributors New Delhi

Miriam Butt 1995 The Structure of Complex Predi-cates in Urdu CSLI Publications

Mirium Butt and Gillian Ramchand 2001 Complex Aspectual Structure in HindiUrdu In Maria Lia-kata Britta Jensen and Didier Maillat (Editors) Oxford University Working Papers in Linguistics Philology amp Phonetics Vol 6

Miriam Butt Tracy Holloway King and John T Maxwell III 2003 Complex Predicates via Re-striction Proceedings of the LFG03 Conference

Miriam Butt and Wilhelm Geuder 2001 On the (semi)lexical status of light verbs In Norbert Corv-er and Henk van Riemsdijk (Editors) Semi-lexical Categories On the content of function words and the function of content words Mouton de Gruyter Berlin 323ndash370

Mona Singh 1994 Perfectivity Definiteness and Specificity A Classification of Verbal Predicates Hindi Doctoral dissertation University of Texas Austin

Peter Edwin Hook 1974 The Compound Verb in Hindi Center for South and Southeast Asian Stu-dies The University of Michigan

Tara Mohanan 1994 Argument Structure in Hindi CSLI Publications Stanford California

Venkatapathy Sriram and Aravind K Joshi 2005 Relative compositionality of multi-word expres-sions a study of verb-noun (V-N) collocations In Proceedings of International Joint Conference on Natural Language Processing - 2005 Jeju Island Korea 553-564

46

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 47ndash54Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Improving Statistical Machine Translation UsingDomain Bilingual Multiword Expressions

Zhixiang Ren1 Yajuan L u1 Jie Cao1 Qun Liu 1 Yun Huang2

1Key Lab of Intelligent Info Processing 2Department of Computer ScienceInstitute of Computing Technology School of Computing

Chinese Academy of Sciences National University of SingaporePO Box 2704 Beijing 100190 China Computing 1 Law Link Singapore 117590

renzhixianglvyajuan huangyuncompnusedusgcaojieliuqunictaccn

Abstract

Multiword expressions (MWEs) havebeen proved useful for many natural lan-guage processing tasks However how touse them to improve performance of statis-tical machine translation (SMT) is not wellstudied This paper presents a simple yeteffective strategy to extract domain bilin-gual multiword expressions In additionwe implement three methods to integratebilingual MWEs to Moses the state-of-the-art phrase-based machine translationsystem Experiments show that bilingualMWEs could improve translation perfor-mance significantly

1 Introduction

Phrase-based machine translation model has beenproved a great improvement over the initial word-based approaches (Brown et al 1993) Recentsyntax-based models perform even better thanphrase-based models However when syntax-based models are applied to new domain with fewsyntax-annotated corpus the translation perfor-mance would decrease To utilize the robustnessof phrases and make up the lack of syntax or se-mantic information in phrase-based model for do-main translation we study domain bilingual mul-tiword expressions and integrate them to the exist-ing phrase-based model

A multiword expression (MWE)can be consid-ered as word sequence with relatively fixed struc-ture representing special meanings There is nouniform definition of MWE and many researchersgive different properties of MWE Sag et al (2002)roughly defined MWE as ldquoidiosyncratic interpre-tations that cross word boundaries (or spaces)rdquoCruys and Moiron (2007) focused on the non-compositional property of MWE ie the propertythat whole expression cannot be derived from their

component words Stanford university launcheda MWE project1 in which different qualities ofMWE were presented Forbilingual multiwordexpression (BiMWE) we define a bilingual phraseas a bilingual MWE if (1) the source phrase is aMWE in source language (2) the source phraseand the target phrase must be translated to eachother exactly ie there is no additional (boundary)word in target phrase which cannot find the corre-sponding word in source phrase and vice versaIn recent years many useful methods have beenproposed to extract MWEs or BiMWEs automati-cally (Piao et al 2005 Bannard 2007 Fazly andStevenson 2006) Since MWE usually constrainspossible senses of a polysemous word in contextthey can be used in many NLP applications suchas information retrieval question answering wordsense disambiguation and so on

For machine translation Piao et al (2005) havenoted that the issue of MWE identification andaccurate interpretation from source to target lan-guage remained an unsolved problem for existingMT systems This problem is more severe whenMT systems are used to translate domain-specifictexts since they may include technical terminol-ogy as well as more general fixed expressions andidioms Although some MT systems may employa machine-readable bilingual dictionary of MWEit is time-consuming and inefficient to obtain thisresource manually Therefore some researchershave tried to use automatically extracted bilingualMWEs in SMT Tanaka and Baldwin (2003) de-scribed an approach of noun-noun compound ma-chine translation but no significant comparisonwas presented Lambert and Banchs (2005) pre-sented a method in which bilingual MWEs wereused to modify the word alignment so as to im-prove the SMT quality In their work a bilin-gual MWE in training corpus was grouped as

1httpmwestanfordedu

47

one unique token before training alignment mod-els They reported that both alignment qualityand translation accuracy were improved on a smallcorpus However in their further study they re-ported even lower BLEU scores after groupingMWEs according to part-of-speech on a large cor-pus (Lambert and Banchs 2006) Nonethelesssince MWE represents liguistic knowledge therole and usefulness of MWE in full-scale SMTis intuitively positive The difficulty lies in howto integrate bilingual MWEs into existing SMTsystem to improve SMT performance especiallywhen translating domain texts

In this paper we implement three methods thatintegrate domain bilingual MWEs into a phrase-based SMT system and show that these ap-proaches improve translation quality significantlyThe main difference between our methods andLambert and Banchsrsquo work is that we directly aimat improving the SMT performance rather than im-proving the word alignment quality In detail dif-ferences are listed as follows

bull Instead of using the bilingual n-gram trans-lation model we choose the phrase-basedSMT system Moses2 which achieves sig-nificantly better translation performance thanmany other SMT systems and is a state-of-the-art SMT system

bull Instead of improving translation indirectlyby improving the word alignment qualitywe directly target at the quality of transla-tion Some researchers have argued that largegains of alignment performance under manymetrics only led to small gains in translationperformance (Ayan and Dorr 2006 Fraserand Marcu 2007)

Besides the above differences there are someadvantages of our approaches

bull In our method automatically extractedMWEs are used as additional resources ratherthan as phrase-table filter Since bilingualMWEs are extracted according to noisy au-tomatic word alignment errors in word align-ment would further propagate to the SMT andhurt SMT performance

bull We conduct experiments on domain-specificcorpus For one thing domain-specific

2httpwwwstatmtorgmoses

corpus potentially includes a large numberof technical terminologies as well as moregeneral fixed expressions and idioms iedomain-specific corpus has high MWE cov-erage For another after the investigationcurrent SMT system could not effectivelydeal with these domain-specific MWEs es-pecially for Chinese since these MWEs aremore flexible and concise Take the Chi-nese term ldquo j Ntilde (rdquo for example Themeaning of this term is ldquosoften hard massand dispel pathogenic accumulationrdquo Ev-ery word of this term represents a specialmeaning and cannot be understood literallyor without this context These terms are dif-ficult to be translated even for humans letalone machine translation So treating theseterms as MWEs and applying them in SMTsystem have practical significance

bull In our approach no additional corpus is intro-duced We attempt to extract useful MWEsfrom the training corpus and adopt suitablemethods to apply them Thus it benefitsfor the full exploitation of available resourceswithout increasing great time and space com-plexities of SMT system

The remainder of the paper is organized as fol-lows Section 2 describes the bilingual MWE ex-traction technique Section 3 proposes three meth-ods to apply bilingual MWEs in SMT systemSection 4 presents the experimental results Sec-tion 5 draws conclusions and describes the futurework Since this paper mainly focuses on the ap-plication of BiMWE in SMT we only give a briefintroduction on monolingual and bilingual MWEextraction

2 Bilingual Multiword ExpressionExtraction

In this section we describe our approach of bilin-gual MWE extraction In the first step we obtainmonolingual MWEs from the Chinese part of par-allel corpus After that we look for the translationof the extracted MWEs from parallel corpus

21 Automatic Extraction of MWEs

In the past two decades many different ap-proaches on automatic MWE identification werereported In general those approaches can beclassified into three main trends (1) statisti-cal approaches (Pantel and Lin 2001 Piao et

48

al 2005) (2) syntactic approaches (Fazly andStevenson 2006 Bannard 2007) and (3) seman-tic approaches (Baldwin et al 2003 Cruys andMoiron 2007) Syntax-based and semantic-basedmethods achieve high precision but syntax or se-mantic analysis has to be introduced as preparingstep so it is difficult to apply them to domains withfew syntactical or semantic annotation Statisticalapproaches only consider frequency informationso they can be used to obtain MWEs from bilin-gual corpora without deeper syntactic or semanticanalysis Most statistical measures only take twowords into account so it not easy to extract MWEscontaining three or more than three words

Log Likelihood Ratio (LLR)has been proved agood statistical measurement of the association oftwo random variables (Chang et al 2002) Weadopt the idea of statistical approaches and pro-pose a new algorithm named LLR-based Hierar-chical Reducing Algorithm (HRA for short) to ex-tract MWEs with arbitrary lengths To illustrateour algorithm firstly we define some useful itemsIn the following definitions we assume the givensentence is ldquoA B C D Erdquo

Definition 1 Unit A unit is any sub-string of thegiven sentence For example ldquoA Brdquo ldquo Crdquo ldquo C D Erdquoare all units but ldquoA B Drdquo is not a unit

Definition 2 List A list is an ordered sequence ofunits which exactly cover the given sentence ForexampleldquoArdquoldquo B C Drdquoldquo Erdquo forms a list

Definition 3 Score The score function only de-fines on two adjacent units and return the LLRbetween the last word of first unit and the firstword of the second unit3 For example the scoreof adjacent unit ldquoB Crdquo and ldquoD Erdquo is defined asLLR(ldquo Crdquoldquo Drdquo)

Definition 4 Select The selecting operator is tofind the two adjacent units with maximum scorein a list

Definition 5 ReduceThe reducing operator is toremove two specific adjacent units concatenatethem and put back the result unit to the removedposition For example if we want to reduce unitldquoB Crdquo and unit ldquoDrdquo in list ldquoArdquoldquo B Crdquoldquo Drdquoldquo Erdquowe will get the listldquoArdquoldquo B C Drdquoldquo Erdquo

Initially every word in the sentence is consid-ered as one unit and all these units form a initiallist L If the sentence is of lengthN then the

3we use a stoplist to eliminate the units containing func-tion words by setting their score to 0

list containsN units of course The final set ofMWEsS is initialized to empty set After initial-ization the algorithm will enter an iterating loopwith two steps (1) select the two adjacent unitswith maximum score inL namingU1 andU2 and(2) reduceU1 andU2 in L and insert the reducingresult into the final setS Our algorithm termi-nates on two conditions (1) if the maximum scoreafter selection is less than a given threshold or (2)if L contains only one unit

c1(pound p Eacute w egrave )

c1(pound p Eacute w )

c1(pound p Eacute w)

c1(pound)

c2(p Eacute w)

c2(p Eacute)

c2(p) c3(Eacute) c4(w) c5()

c6(egrave )

c6(egrave) c7()

pound p Eacute w egrave 1471 67552 10596 0 0 8096

Figure 1 Example of Hierarchical Reducing Al-gorithm

Let us make the algorithm clearer with an ex-ample Assume the threshold of score is 20 thegiven sentence is ldquopound p Eacute w egrave rdquo4Figure 1 shows the hierarchical structure of givensentence (based on LLR of adjacent words) Inthis example four MWEs (ldquop Eacuterdquo ldquo p Eacute

wrdquo ldquo egrave rdquo ldquo pound p Eacute wrdquo) are extractedin the order and sub-strings over dotted line in fig-ure 1 are not extracted

From the above example we can see that theextracted MWEs correspond to human intuitionIn general the basic idea of HRA is to reflectthe hierarchical structure pattern of natural lan-guage Furthermore in the HRA MWEs are mea-sured with the minimum LLR of adjacent wordsin them which gives lexical confidence of ex-tracted MWEs Finally suppose given sentencehas lengthN HRA would definitely terminatewithin N minus 1 iterations which is very efficient

However HRA has a problem that it would ex-tract substrings before extracting the whole stringeven if the substrings only appear in the particu-lar whole string which we consider useless Tosolve this problem we use contextual features

4The whole sentence means ldquohealthy tea for preventinghyperlipidemiardquo and we give the meaning for each Chi-nese wordpound(preventing)p(hyper-)Eacute(-lipid-) w(-emia)(for) egrave(healthy)(tea)

49

contextual entropy (Luo and Sun 2003) and C-value (Frantzi and Ananiadou 1996) to filter outthose substrings which exist only in few MWEs

22 Automatic Extraction of MWErsquosTranslation

In subsection 21 we described the algorithm toobtain MWEs and we would like to introduce theprocedure to find their translations from parallelcorpus in this subsection

For mining the English translations of ChineseMWEs we first obtain the candidate translationsof a given MWE from the parallel corpus Stepsare listed as follows

1 Run GIZA++5 to align words in the trainingparallel corpus

2 For a given MWE find the bilingual sentencepairs where the source language sentences in-clude the MWE

3 Extract the candidate translations of theMWE from the above sentence pairs accord-ing to the algorithm described by Och (2002)

After the above procedure we have alreadyextracted all possible candidate translations of agiven MWE The next step is to distinguish rightcandidates from wrong candidates We constructperceptron-based classification model (Collins2002) to solve the problem We design twogroups of features translation features whichdescribe the mutual translating chance betweensource phrase and target phrase and the languagefeatures which refer to how well a candidateis a reasonable translation The translation fea-tures include (1) the logarithm of source-targettranslation probability (2) the logarithm of target-source translation probability (3) the logarithmof source-target lexical weighting (4) the loga-rithm of target-source lexical weighting and (5)the logarithm of the phrase pairrsquos LLR (Dunning1993) The first four features are exactly the sameas the four translation probabilities used in tradi-tional phrase-based system (Koehn et al 2003)The language features include (1) the left entropyof the target phrase (Luo and Sun 2003) (2) theright entropy of the target phrase (3) the first wordof the target phrase (4) the last word of the targetphrase and (5) all words in the target phrase

5httpwwwfjochcomGIZA++html

We select and annotate 33000 phrase pairs ran-domly of which 30000 pairs are used as trainingset and 3000 pairs are used as test set We use theperceptron training algorithm to train the modelAs the experiments reveal the classification preci-sion of this model is 9167

3 Application of Bilingual MWEs

Intuitively bilingual MWE is useful to improvethe performance of SMT However as we de-scribed in section 1 it still needs further researchon how to integrate bilingual MWEs into SMTsystem In this section we propose three methodsto utilize bilingual MWEs and we will comparetheir performance in section 4

31 Model Retraining with Bilingual MWEs

Bilingual phrase table is very important forphrase-based MT system However due to the er-rors in automatic word alignment and unalignedword extension in phrase extraction (Och 2002)many meaningless phrases would be extractedwhich results in inaccuracy of phrase probabilityestimation To alleviate this problem we take theautomatically extracted bilingual MWEs as paral-lel sentence pairs add them into the training cor-pus and retrain the model using GIZA++ Byincreasing the occurrences of bilingual MWEswhich are good phrases we expect that the align-ment would be modified and the probability es-timation would be more reasonable Wu et al(2008) also used this method to perform domainadaption for SMT Different from their approachin which bilingual MWEs are extracted from ad-ditional corpus we extract bilingual MWEs fromthe original training set The fact that additionalresources can improve the domain-specific SMTperformance was proved by many researchers(Wu et al 2008 Eck et al 2004) Howeverour method shows that making better use of theresources in hand could also enhance the qualityof SMT system We use ldquoBaseline+BiMWErdquo torepresent this method

32 New Feature for Bilingual MWEs

Lopez and Resnik (2006) once pointed out thatbetter feature mining can lead to substantial gainin translation quality Inspired by this idea weappend one feature into bilingual phrase table toindicate that whether a bilingual phrase containsbilingual MWEs In other words if the source lan-guage phrase contains a MWE (as substring) and

50

the target language phrase contains the translationof the MWE (as substring) the feature value is 1otherwise the feature value is set to 0 Due to thehigh reliability of bilingual MWEs we expect thatthis feature could help SMT system to select bet-ter and reasonable phrase pairs during translationWe use ldquoBaseline+Featrdquo to represent this method

33 Additional Phrase Table of bilingualMWEs

Wu et al (2008) proposed a method to construct aphrase table by a manually-made translation dic-tionary Instead of manually constructing transla-tion dictionary we construct an additional phrasetable containing automatically extracted bilingualMWEs As to probability assignment we just as-sign 1 to the four translation probabilities for sim-plicity Since Moses supports multiple bilingualphrase tables we combine the original phrase ta-ble and new constructed bilingual MWE table Foreach phrase in input sentence during translationthe decoder would search all candidate transla-tion phrases in both phrase tables We use ldquoBase-line+NewBPrdquo to represent this method

4 Experiments

41 Data

We run experiments on two domain-specific patentcorpora one is for traditional medicine domainand the other is for chemical industry domain Ourtranslation tasks are Chinese-to-English

In the traditional medicine domain table 1shows the data statistics For language model weuse SRI Language Modeling Toolkit6 to train a tri-gram model with modified Kneser-Ney smoothing(Chen and Goodman 1998) on the target side oftraining corpus Using our bilingual MWE ex-tracting algorithm 80287 bilingual MWEs are ex-tracted from the training set

Chinese EnglishTraining Sentences 120355

Words 4688873 4737843Dev Sentences 1000

Words 31722 32390Test Sentences 1000

Words 41643 40551

Table 1 Traditional medicine corpus

6httpwwwspeechsricomprojectssrilm

In the chemical industry domain table 2 givesthe detail information of the data In this experi-ment 59466 bilingual MWEs are extracted

Chinese EnglishTraining Sentences 120856

Words 4532503 4311682Dev Sentences 1099

Words 42122 40521Test Sentences 1099

Words 41069 39210

Table 2 Chemical industry corpus

We test translation quality on test set and use theopen source tool mteval-vllbpl7 to calculate case-sensitive BLEU 4 score (Papineni et al 2002) asour evaluation criteria For this evaluation there isonly one reference per test sentence We also per-form statistical significant test between two trans-lation results (Collins et al 2005) The mean ofall scores and relative standard deviation are calcu-lated with a 99 confidence interval of the mean

42 MT Systems

We use the state-of-the-art phrase-based SMT sys-tem Moses as our baseline system The featuresused in baseline system include (1) four transla-tion probability features (2) one language modelfeature (3) distance-based and lexicalized distor-tion model feature (4) word penalty (5) phrasepenalty For ldquoBaseline+BiMWErdquo method bilin-gual MWEs are added into training corpus as aresult new alignment and new phrase table areobtained For ldquoBaseline+Featrdquo method one ad-ditional 01 feature are introduced to each entry inphrase table For ldquoBaseline+NewBPrdquo additionalphrase table constructed by bilingual MWEs isused

Features are combined in the log-linear modelTo obtain the best translatione of the source sen-tencef log-linear model uses following equation

e = arg maxe

p(e|f)

= arg maxe

Msum

m=1

λmhm(e f) (1)

in which hm andλm denote themth feature andweight The weights are automatically turned byminimum error rate training (Och 2002) on devel-opment set

7httpwwwnistgovspeechtestsmtresourcesscoringhtm

51

43 Results

Methods BLEUBaseline 02658Baseline+BiMWE 02661Baseline+Feat 02675Baseline+NewBP 02719

Table 3 Translation results of using bilingualMWEs in traditional medicine domain

Table 3 gives our experiment results Fromthis table we can see that bilingual MWEsimprove translation quality in all cases TheBaseline+NewBP method achieves the most im-provement of 061 BLEU score comparedwith the baseline system The Baseline+Featmethod comes next with 017 BLEU score im-provement And the Baseline+BiMWE achievesslightly higher translation quality than the baselinesystem

To our disappointment however none of theseimprovements are statistical significant Wemanually examine the extracted bilingual MWEswhich are labeled positive by perceptron algorithmand find that although the classification precisionis high (9167) the proportion of positive exam-ple is relatively lower (7669) The low positiveproportion means that many negative instanceshave been wrongly classified to positive which in-troduce noises To remove noisy bilingual MWEswe use the length ratiox of the source phrase overthe target phrase to rank the bilingual MWEs la-beled positive Assumex follows Gaussian distri-butions then the ranking score of phrase pair(s t)is defined as the following formula

Score(s t) = log(LLR(s t))times 1radic2πσ

timeseminus

(xminusmicro)2

2σ2

(2)Here the meanmicro and varianceσ2 are estimatedfrom the training set After ranking by score weselect the top 50000 60000 and 70000 bilingualMWEs to perform the three methods mentioned insection 3 The results are showed in table 4

From this table we can conclude that (1) Allthe three methods on all settings improve BLEUscore (2) Except the Baseline+BiMWE methodthe other two methods obtain significant improve-ment of BLEU score (02728 02734 02724)over baseline system (02658) (3) When the scaleof bilingual MWEs is relatively small (5000060000) the Baseline+Feat method performs better

Methods 50000 60000 70000Baseline 02658Baseline+BiMWE 02671 02686 02715Baseline+Feat 02728 02734 02712Baseline+NewBP 02662 02706 02724

Table 4 Translation results of using bilingualMWEs in traditional medicine domain

than others (4) As the number of bilingual MWEsincreasing the Baseline+NewBP method outper-forms the Baseline+Feat method (5) Comparingtable 4 and 3 we can see it is not true that themore bilingual MWEs the better performance ofphrase-based SMT This conclusion is the same as(Lambert and Banchs 2005)

To verify the assumption that bilingual MWEsdo indeed improve the SMT performance not onlyon particular domain we also perform some ex-periments on chemical industry domain Table 5shows the results From this table we can see thatthese three methods can improve the translationperformance on chemical industry domain as wellas on the traditional medicine domain

Methods BLEUBaseline 01882Baseline+BiMWE 01928Baseline+Feat 01917Baseline+Newbp 01914

Table 5 Translation results of using bilingualMWEs in chemical industry domain

44 Discussion

In order to know in what respects our methods im-prove performance of translation we manually an-alyze some test sentences and gives some exam-ples in this subsection

(1) For the first example in table 6 ldquoIuml oacuterdquois aligned to other words and not correctly trans-lated in baseline system while it is aligned to cor-rect target phrase ldquodredging meridiansrdquo in Base-line+BiMWE since the bilingual MWE (ldquoIuml oacuterdquoldquodredging meridiansrdquo) has been added into train-ing corpus and then aligned by GIZA++

(2) For the second example in table 6 ldquo

rdquo has two candidate translation in phrase tableldquo teardquo and ldquomedicated teardquo The baseline systemchooses the ldquoteardquo as the translation of ldquo rdquowhile the Baseline+Feat system chooses the ldquomed-

52

Src T not auml k Ouml Eacute Aring raquo Iuml oacute ) 9 | Y S _ Ouml otilde

egraveE 8

Ref the obtained product is effective in tonifying blood expelling cold dredging meridians promoting production of body fluid promoting urination and tranquilizing mind and can be used for supplementing nutrition and protecting health

Baseline the food has effects in tonifying blood dispelling cold promoting salivation and water and tranquilizing and tonic effects and making nutritious health

+Bimwe the food has effects in tonifying blood dispelling cold dredging meridians promotingsalivation promoting urination and tranquilizing tonic nutritious pulverizing

Src curren iexclJ J NtildeJ 5J

Ref the product can also be made into tablet pill powder medicated tea or injection Baseline may also be made into tablet pill powder tea or injection +Feat may also be made into tablet pill powder medicated tea or injection

Table 6 Translation example

icated teardquo because the additional feature giveshigh probability of the correct translation ldquomedi-cated teardquo

5 Conclusion and Future Works

This paper presents the LLR-based hierarchicalreducing algorithm to automatically extract bilin-gual MWEs and investigates the performance ofthree different application strategies in applyingbilingual MWEs for SMT system The translationresults show that using an additional feature to rep-resent whether a bilingual phrase contains bilin-gual MWEs performs the best in most cases Theother two strategies can also improve the qualityof SMT system although not as much as the firstone These results are encouraging and motivatedto do further research in this area

The strategies of bilingual MWE application isroughly simply and coarse in this paper Com-plicated approaches should be taken into accountduring applying bilingual MWEs For examplewe may consider other features of the bilingualMWEs and examine their effect on the SMT per-formance Besides application in phrase-basedSMT system bilingual MWEs may also be inte-grated into other MT models such as hierarchicalphrase-based models or syntax-based translationmodels We will do further studies on improvingstatistical machine translation using domain bilin-gual MWEs

Acknowledgments

This work is supported by National Natural Sci-ence Foundation of China Contracts 60603095and 60873167 We would like to thank the anony-

mous reviewers for their insightful comments onan earlier draft of this paper

References

Necip Fazil Ayan and Bonnie J Dorr 2006 Goingbeyond aer an extensive analysis of word align-ments and their impact on mt InProceedings of the44th Annual Meeting of the Association for Compu-tational Linguistics pages 9ndash16

Timothy Baldwin Colin Bannard Takaaki Tanaka andDominic Widdows 2003 An empirical modelof multiword expression decomposability InPro-ceedings of the ACL-2003 Workshop on MultiwordExpressions Analysis Acquisiton and Treatmentpages 89ndash96

Colin Bannard 2007 A measure of syntactic flex-ibility for automatically identifying multiword ex-pressions in corpora InProceedings of the ACLWorkshop on A Broader Perspective on MultiwordExpressions pages 1ndash8

Peter F Brown Stephen Della Pietra Vincent J DellaPietra and Robert L Mercer 1993 The mathe-matics of statistical machine translation Parameterestimation Computational Linguistics 19(2)263ndash311

Baobao Chang Pernilla Danielsson and WolfgangTeubert 2002 Extraction of translation unit fromchinese-english parallel corpora InProceedings ofthe first SIGHAN workshop on Chinese languageprocessing pages 1ndash5

Stanley F Chen and Joshua Goodman 1998 Am em-pirical study of smoothing techniques for languagemodeling Technical report

Michael Collins Philipp Koehn and Ivona Kucerova2005 Clause restructuring for statistical machinetranslation InProceedings of the 43rd Annual

53

Meeting on Association for Computational Linguis-tics pages 531ndash540

Michael Collins 2002 Discriminative training meth-ods for hidden markov models Theory and exper-iments with perceptron algorithms InProceedingsof the Empirical Methods in Natural Language Pro-cessing Conference pages 1ndash8

Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32

Ted Dunning 1993 Accurate methods for the statis-tics of surprise and coincidenceComputationalLinguistics 19(1)61ndash74

Matthias Eck Stephan Vogel and Alex Waibel 2004Improving statistical machine translation in the med-ical domain using the unified medical language sys-tem InProceedings of the 20th international con-ference on Computational Linguistics table of con-tents pages 792ndash798

Afsaneh Fazly and Suzanne Stevenson 2006 Auto-matically constructing a lexicon of verb phrase id-iomatic combinations InProceedings of the EACLpages 337ndash344

Katerina T Frantzi and Sophia Ananiadou 1996 Ex-tracting nested collocations InProceedings of the16th conference on Computational linguistics pages41ndash46

Alexander Fraser and Daniel Marcu 2007 Measuringword alignment quality for statistical machine trans-lation Computational Linguistics 33(3)293ndash303

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InPro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology pages48ndash54

Patrik Lambert and Rafael Banchs 2005 Data in-ferred multi-word expressions for statistical machinetranslation InProceedings of Machine TranslationSummit X pages 396ndash403

Patrik Lambert and Rafael Banchs 2006 Groupingmulti-word expressions according to part-of-speechin statistical machine translation InProceedings ofthe Workshop on Multi-word-expressions in a multi-lingual context pages 9ndash16

Adam Lopez and Philip Resnik 2006 Word-basedalignment phrase-based translation Whatrsquos thelink In proceedings of the 7th conference of the as-sociation for machine translation in the Americasvisions for the future of machine translation pages90ndash99

Shengfen Luo and Maosong Sun 2003 Two-characterchinese word extraction based on hybrid of internaland contextual measures InProceedings of the sec-ond SIGHAN workshop on Chinese language pro-cessing pages 24ndash30

Franz Josef Och 2002Statistical Machine Transla-tion From Single-Word Models to Alignment Tem-plates Phd thesis Computer Science DepartmentRWTH Aachen Germany

Patrick Pantel and Dekang Lin 2001 A statistical cor-pus based term extractor InAI rsquo01 Proceedings ofthe 14th Biennial Conference of the Canadian Soci-ety on Computational Studies of Intelligence pages36ndash46

Kishore Papineni Salim Roukos Todd Ward and Wei-jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation InProceedings of the40th Annual Conference of the Association for Com-putational Linguistics pages 311ndash318

Scott Songlin Piao Paul Rayson Dawn Archer andTony McEnery 2005 Comparing and combining asemantic tagger and a statistical tool for mwe extrac-tion Computer Speech and Language 19(4)378ndash397

Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 Multi-word expressions A pain in the neck for nlp InProceedings of the 3th International Conferenceon Intelligent Text Processing and ComputationalLinguistics(CICLing-2002) pages 1ndash15

Takaaki Tanaka and Timothy Baldwin 2003 Noun-noun compound machine translation A feasibilitystudy on shallow processing InProceedings ofthe ACL-2003 Workshop on Multiword ExpressionsAnalysis Acquisition and Treatment pages 17ndash24

Hua Wu Haifeng Wang and Chengqing Zong 2008Domain adaptation for statistical machine transla-tion with domain dictionary and monolingual cor-pora InProceedings of Conference on Computa-tional Linguistics (COLING) pages 993ndash1000

54

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 55ndash62Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Bottom-up Named Entity Recognitionusing a Two-stage Machine Learning Method

Hirotaka Funayama Tomohide Shibata Sadao KurohashiKyoto University Yoshida-honmachi

Sakyo-ku Kyoto 606-8501 Japanfunayama shibata kuronlpkueekyoto-uacjp

Abstract

This paper proposes Japanese bottom-upnamed entity recognition using a two-stage machine learning method Mostwork has formalized Named Entity Recog-nition as a sequential labeling problem inwhich only local information is utilizedfor the label estimation and thus a longnamed entity consisting of several mor-phemes tends to be wrongly recognizedOur proposed method regards a compoundnoun (chunk) as a labeling unit and firstestimates the labels of all the chunks ina phrasal unit (bunsetsu) using a machinelearning method Then the best label as-signment in the bunsetsu is determinedfrom bottom up as the CKY parsing al-gorithm using a machine learning methodWe conducted an experimental on CRLNE data and achieved an F measure of8979 which is higher than previous work

1 Introduction

Named Entity Recognition (NER) is a task of rec-ognizing named entities such as person namesorganization names and location It is used forseveral NLP applications such as Information Ex-traction (IE) and Question Answering (QA) Mostwork uses machine learning methods such as Sup-port Vector Machines (SVMs) (Vapnik 1995) andConditional Random Field (CRF) (Lafferty et al2001) using a hand-annotated corpus (Krishnanand DManning 2006 Kazama and Torisawa2008 Sasano and Kurohashi 2008 Fukushima etal 2008 Nakano and Hirai 2004 Masayuki andMatsumoto 2003)

In general NER is formalized as a sequentiallabeling problem For example regarding a mor-pheme as a basic unit it is first labeled as S-PERSON B-PERSON I-PERSON E-PERSON

S-ORGANIZATION etc Then considering thelabeling results of morphemes the best NE labelsequence is recognized

When the label of each morpheme is estimatedonly local information around the morpheme (egthe morpheme the two preceding morphemes andthe two following morphemes) is utilized There-fore a long named entity consisting of severalmorphemes tends to be wrongly recognized Letus consider the example sentences shown in Fig-ure 1

In sentence (1) the label of ldquoKazamardquo can berecognized to be S-PERSON (PERSON consist-ing of one morpheme) by utilizing the surroundinginformation such as the suffix ldquosanrdquo (Mr) and theverb ldquokikoku shitardquo (return home)

On the other hand in sentence (2) when thelabel of ldquoshinyourdquo (credit) is recognized to beB-ORGANIZATION (the beginning of ORGA-NIZATION) only information from ldquohatsudourdquo(invoke) to ldquokyusairdquo (relief) can be utilized andthus the information of the morpheme ldquoginkourdquo(bank) that is apart from ldquoshinyourdquo by three mor-phemes cannot be utilized To cope with this prob-lem Nakano et al (Nakano and Hirai 2004) andSasano et al (Sasano and Kurohashi 2008) uti-lized information of the head of bunsetsu1 In theirmethods when the label of ldquoshinyourdquo is recog-nized the information of the morpheme ldquoginkourdquocan be utilized

However these methods do not work when themorpheme that we want to refer to is not a headof bunsetsu as in sentence (3) In this examplewhen ldquogaikokurdquo (foreign) is recognized to be B-ARTIFACT (the beginning of ARTIFACT) wewant to refer to ldquohourdquo (law) not ldquoihanrdquo (viola-tion) which is the head of the bunsetsu

This paper proposes Japanese bottom-up named1Bunsetsu is the smallest coherent phrasal unit in

Japanese It consists of one or more content words followedby zero or more function words

55

(1) kikoku-shitareturn home

Kazama-san-waMrKazama TOP

rsquoMr Kazama who returned homersquo

(2) hatsudou-shitainvoke

shinyou-kumiai-kyusai-ginkou-nocredit union relief bank GEN

setsuritsu-mo establishment

rsquothe establishment of the invoking credit union relief bankrsquo

(3) shibunsyo-gizou-toprivate document falsification and

gaikoku-jin-touroku-hou-ihan-noforeigner registration law violation GEN

utagai-desuspicion INS

rsquoon suspicion of the private document falsification and the violation of the foreigner registra-tion lawrsquo

Figure 1 Example sentences

entity recognition using a two-stage machinelearning method Different from previous workthis method regards a compound noun as a la-beling unit (we call it chunk hereafter) and es-timates the labels of all the chunks in the bun-setsu using a machine learning method In sen-tence (3) all the chunks in the second bunsetsu(ie ldquogaikokurdquo ldquogaikoku-jinrdquo middot middot middot ldquogaikoku-jin-touroku-hou-ihan rdquo middot middot middot ldquoihanrdquo) are labeled andin the case that the chunk ldquogaikoku-jin-touroku-hourdquo is labeled the information about ldquohourdquo (law)is utilized in a natural manner Then in the bun-setsu the best label assignment is determined Forexample among the combination of ldquogaikoku-jin-touroku-hourdquo (ARTIFACT) and ldquoihanrdquo (OTHER)the combination of ldquogaikoku-jinrdquo (PERSON) andldquotouroku-hou-ihanrdquo (OTHER) etc the best la-bel assignment ldquogaikoku-jin-touroku-hourdquo (AR-TIFACT) and ldquoihanrdquo (OTHER) is chosen basedon a machine learning method In this determi-nation of the best label assignment as the CKYparsing algorithm the label assignment is deter-mined by bottom-up dynamic programming Weconducted an experimental on CRL NE data andachieved an F measure of 8979 which is higherthan previous work

This paper is organized as follows Section 2 re-views related work of NER especially focusing onsequential labeling based method Section 3 de-scribes an overview of our proposed method Sec-tion 4 presents two machine learning models andSection 5 describes an analysis algorithm Section6 gives an experimental result

2 Related Work

In Japanese Named Entity Recognition the defi-nition of Named Entity in IREX Workshop (IREX

class examplePERSON Kimura SyonosukeLOCATION Taiheiyou (Pacific Ocean)ORGANIZATION Jimin-tou (Liberal Democratic

Party)ARTIFACT PL-houan (PL bill)DATE 21-seiki (21 century)TIME gozen-7-ji (7 am)MONEY 500-oku-en (50 billions yen)PERCENT 20 percent

Table 1 NE classes and their examples

Committee 1999) is usually used In this def-inition NEs are classified into eight classesPERSON LOCATION ORGANIZATION AR-TIFACT DATE TIME MONEY and PERCENTTable 1 shows example instances of each class

NER methods are divided into two approachesrule-based approach and machine learning ap-proach According to previous work machinelearning approach achieved better performancethan rule-based approach

In general a machine learning method is for-malized as a sequential labeling problem Thisproblem is first assigning each token (character ormorpheme) to several labels In an SE-algorithm(Sekine et al 1998) S is assigned to NE com-posed of one morpheme B I E is assigned to thebeginning middle end of NE respectively and Ois assigned to the morpheme that is not an NE2The labels S B I and E are prepared for each NEclasses and thus the total number of labels is 33(= 8 4 + 1)

The model for the label estimation is learnedbased on machine learning The following fea-tures are generally utilized characters type of

2Besides there are IOB1 IOB2 algorithm using onlyIOB and IOE1 IOE2 algorithm using only IOE (Kim andVeenstra 1999)

56

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu-MeijinMONEY0075

OTHERe0092

MeijinOTHERe0245

(a)initial state

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu + MeijinPSN+OTHERe0438+0245

Yoshiharu Yoshiharu + Meijin

final outputanalysis direction

MONEY0075

MNY+OTHERe0075+0245

MeijinOTHERe0245

(b)final output

Figure 2 An overview of our proposed method (the bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo)

character POS etc about the morpheme and thesurrounding two morphemes The methods utiliz-ing SVM or CRF are proposed

Most of NER methods based on sequential la-beling use only local information Thereforemethods utilizing global information are pro-posed Nakano et al utilized as a feature the wordsub class of NE on the analyzing direction in thebunsetsu the noun in the end of the bunsetsu ad-jacent to the analyzing direction and the head ofeach bunsetsu (Nakano and Hirai 2004) Sasanoet al utilized cache feature coreference resultsyntactic feature and caseframe feature as struc-tural features (Sasano and Kurohashi 2008)

Some work acquired knowledge from unan-notated large corpus and applied it to NERKazama et al utilized a Named Entity dic-tionary constructed from Wikipedia and a nounclustering result obtained using huge amount ofpairs of dependency relations (Kazama and Tori-sawa 2008) Fukushima et al acquired hugeamount of category-instance pairs (eg ldquopo-litical party - New party DAICHIrdquoldquocompany-TOYOTArdquo) by some patterns from a large Webcorpus (Fukushima et al 2008)

In Japanese NER researches CRL NE data areusually utilized for the evaluation This data in-cludes approximately 10 thousands sentences innews paper articles in which approximately 20thousands NEs are annotated Previous workachieved an F measure of about 089 using thisdata

3 Overview of Proposed Method

Our proposed method first estimates the label ofall the compound nouns (chunk) in a bunsetsu

Then the best label assignment is determinedby bottom-up dynamic programming as the CKYparsing algorithm Figure 2 illustrates an overviewof our proposed method In this example thebunsetsu ldquoHabu-Yoshiharu-Meijinrdquo (Grand Mas-ter Yoshiharu Habu) is analyzed First the labelsof all the chunks (ldquoHaburdquo ldquoHabu-YoshiharurdquoldquoHabu-Yoshiharu-Meijinrdquo middot middot middot ldquoMeijinrdquo etc) inthe bunsetsu are analyzed using a machine learn-ing method as shown in Figure 2 (a)

We call the state in Figure 2 (a) initial statewhere the labels of all the chunks have been es-timated From this state the best label assign-ment in the bunsetsu is determined This pro-cedure is performed from the lower left (corre-sponds to each morpheme) to the upper right likethe CKY parsing algorithm as shown in Figure 2(b) For example when the label assignment forldquoHabu-Yoshiharurdquo is determined the label assign-ment ldquoHabu-Yoshiharurdquo (PERSON) and the labelassignment ldquoHaburdquo (PERSON) and ldquoYoshiharurdquo(OTHER) are compared and the better one is cho-sen While grammatical rules are utilized in ageneral CKY algorithm this method chooses bet-ter label assignment for each cell using a machinelearning method

The learned models are the followings

bull the model that estimates the label of a chunk(label estimation model)

bull the model that compares two label assign-ments (label comparison model)

The two models are described in detail in thenext section

57

Habu Yoshiharu Meijin gaPERSON OTHERe

invalid invalid

invalid

invalidinvalid

Figure 3 Label assignment for all the chunks inthe bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo

4 Model Learning

41 Label Estimation ModelThis model estimates the label for each chunk Ananalysis unit is basically bunsetsu This is because935 of named entities is located in a bunsetsuin CRL NE data Exceptionally the following ex-pressions located in multiple bunsetsus tend to bean NE

bull expressions enclosed in parentheses (eg ldquolsquoHimeyuri-no toursquo rdquo (The tower of Himeyuri)(ARTIFACT))

bull expressions that have an entry in Wikipedia(eg ldquoNihon-yatyou-no kairdquo (Wild Bird So-ciety of Japan) (ORGANIZATION))

Hereafter bunsetsu is expanded when one of theabove conditions meet By this expansion 986of named entities is located in a bunsetsu3

For each bunsetsu the head or tail functionwords are deleted For example in the bun-setsu ldquoHabu-Yoshiharu-Meijin-wardquo the tail func-tion word ldquowardquo (TOP) is deleted In the bunsetsuldquoyaku-san-bairdquo (about three times) the head func-tion word ldquoyakurdquo (about) is deleted

Next for learning the label estimation modelall the chunks in a bunsetsu are attached to the cor-rect label from a hand-annotated corpus The la-bel set is 13 classes which includes eight NE class(as shown in Table 1) and five classes OTHERsOTHERb OTHERi OTHERe and invalid

The chunk that corresponds to a whole bun-setsu and does not contain any NEs is labeledas OTHERs and the head middle tail chunkthat does not correspond to an NE is labeled asOTHERb OTHERi OTHERe respectively4

3As an example in which an NE is not included by anexpanded bunsetsu there are ldquoToru-no Kimirdquo (PERSON)and ldquoOsaka-fu midori-no kankyo-seibi-shitsurdquo (ORGANI-ZATION)

4Each OTHER is assigned to the longest chunk that satis-fies its condition in a chunk

1 of morphemes in the chunk

2 the position of the chunk in its bunsetsu

3 character type5

4 the combination of the character type of adjoiningmorphemes

- For the chunk ldquoRussian Armyrdquo this feature isldquoKatakanaKanjirdquo

5 word class word sub class and several features pro-vided by a morphological analyzer JUMAN

6 several features6 provided by a parser KNP

7 string of the morpheme in the chunk

8 IPADIC7 feature- If the string of the chunk are registered in the fol-

lowing categories of IPADIC ldquopersonrdquo ldquolo-cationrdquo ldquoorganizationrdquo and ldquogeneralrdquo thisfeature fires

9 Wikipedia feature

- If the string of the chunk has an entry inWikipedia this feature fires

- the hypernym extracted from its definition sen-tence using some patterns (eg The hyper-nym of ldquothe Liberal Democratic Partyrdquo is apolitical party)

10 cache feature- When the same string of the chunk appears in the

preceding context the label of the precedingchunk is used for the feature

11 particles that the bunsetsu includes

12 the morphemes particles and head morpheme in theparent bunsetsu

13 the NEcategory ratio in a case slot of predicatenouncase frame(Sasano and Kurohashi 2008)

- For example in the case ga (NOM) of the pred-icate case frame ldquokaikenrdquo (interview) the NEratio ldquoPERSON0245rdquo is assigned to the caseslot Hence in the sentence ldquoHabu-ga kaiken-shitardquo (Mr Habu interviewed) the featureldquoPERSON0245rdquo is utilized for the chunkldquoHaburdquo

14 parenthesis feature

- When the chunk in a parenthesis this featurefires

Table 2 Features for the label estimation model

The chunk that is neither any eight NE class northe above four OTHER is labeled as invalid

In an example as shown in Figure 3 ldquoHabu-Yoshiharurdquo is labeled as PERSON ldquoMeijinrdquo is la-beled as OTHERe and the other chunks are la-beled as invalid

Next the label estimation model is learned fromthe data in which the above label set is assigned

5The following five character types are considered KanjiHiragana Katakana Number and Alphabet

6When a morpheme has an ambiguity all the correspond-ing features fire

7httpchasenaist-naraacjpchasendistributionhtmlja

58

to all the chunks The features for the label esti-mation model are shown in Table 2 Among thefeatures as for feature (3) (5)minus(8) three cate-gories according to the position of a morphemein the chunk are prepared ldquoheadrdquo ldquotailrdquo andldquoanywhererdquo For example in the chunk ldquoHabu-Yoshiharu-Meijinrdquo as for the morpheme ldquoHaburdquofeature (7) is set to be ldquoHaburdquo in ldquoheadrdquo and as forthe morpheme ldquoYoshiharurdquo feature (7) is set to beldquoYoshiharurdquo in ldquoanywhererdquo

The label estimation model is learned from pairsof label and feature in each chunk To classify themulti classes the one-vs-rest method is adopted(consequently 13 models are learned) The SVMoutput is transformed by using the sigmoid func-tion 1

1+exp(minusβx) and the transformed value is nor-malized so that the sum of the value of 13 labelsin a chunk is one

The purpose for setting up the label ldquoinvalidrdquo isas follows In the chunk ldquoHaburdquo and ldquoYoshiharurdquoin Figure 3 since the label ldquoinvalidrdquo has a rela-tively higher score the score of the label PERSONis relatively low Therefore when the label com-parison described in Section 42 is performed thelabel assignment ldquoHabu-Yoshiharurdquo (PERSON) islikely to be chosen In the chunk where the scoreof the label invalid has the highest score the labelthat has the second highest score is adopted

42 Label Comparison Model

This model compares the two label assignmentsfor a certain string For example in the stringldquoHabu-Yoshiharurdquo the model compares the fol-lowing two label assignments

bull ldquoHabu-Yoshiharurdquo is labeled as PERSON

bull ldquoHaburdquo is labeled as PERSON and ldquoYoshi-harurdquo is labeled as MONEY

First as shown in Figure 4 the two comparedsets of chunks are lined up by sandwiching ldquovsrdquo(The left one right one is called the first set thesecond set respectively) When the first set is cor-rect this example is positive otherwise this ex-ample is negative The max number of chunks foreach set is five and thus examples in which thefirst or second set has more than five chunks arenot utilized for the model learning

Then the feature is assigned to each exampleThe feature (13 dimensions) for each chunk is de-fined as follows the first 12 dimensions are used

positive+1 Habu-Yoshiharu vs Habu + Yoshiharu

PSN PSN + MNY+1 Habu-Yoshiharu + Meijin vs Habu + Yoshiharu + Meijin

PSN + OTHERe PSN + MONEY + OTHERe

negative- 1 Habu-Yoshiharu-Meijin vs Habu-Yoshiharu + Meijin

ORG PSN + OTHERe

Figure 4 Assignment of positivenegative exam-ples

for each label which is estimated by the label esti-mation model and the last 13th dimension is usedfor the score of an SVM output Then for the firstand second set the features for each chunk are ar-ranged from the left and zero vectors are placedin the remainder part

Figure 5 illustrates the feature for ldquoHabu-Yoshiharurdquo vs ldquoHabu + Yoshiharurdquo The labelcomparison model is learned from such data us-ing SVM Note that only the fact that ldquoHabu-Yoshiharurdquo is PERSON can be found from thehand-annotated corpus and thus in the exampleldquoHabu-Yoshiharu-Meijinrdquo vs ldquoHabu + Yoshiharu-Meijinrdquo we cannot determine which one is cor-rect Therefore such example cannot be used forthe model learning

5 Analysis

First the label of all the chunks in a bunsetsu isestimated by using the label estimation model de-scribed in Section 41 Then the best label assign-ment in the bunsetsu is determined by applying thelabel comparison model described in Section 42iteratively as shown in Figure 2 (b) In this stepthe better label assignment is determined from bot-tom up as the CKY parsing algorithm

For example the initial state shown in Figure2(a) is obtained using the label estimation modelThen the label assignment is determined using thelabel comparison model from the lower left (cor-responds to each morpheme) to the upper rightIn determining the label assignment for the cellof ldquoHabu-Yoshiharurdquo as shown in 6(a) the modelcompares the label assignment ldquoBrdquo with the la-bel assignment ldquoA+Drdquo In this case the modelchooses the label assignment ldquoBrdquo that is ldquoHabu- Yoshiharurdquo is labeled as PERSON Similarlyin determining the label assignment for the cellof ldquoYoshiharu-Meijinrdquo the model compares the

59

chunk Habu-Yoshiharu Habu Yoshiharulabel PERSON PERSON MONEYvector V11 0 0 0 0 V21 V22 0 0 0

Figure 5 An example of the feature for the label comparison model (The example is ldquoHabu-Yoshiharuvs Habu + Yoshiharurdquo and V11 V21 V22 and 0 is a vector whose dimension is 13)

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu-Meijin

label assighment ldquoHabu-Yoshiharurdquo

label assignmentldquoHaburdquo + ldquoYoshiharurdquo

A B C

EDMONEY0075

OTHERe0092

MeijinOTHERe0245

ldquoHaburdquo + ldquoYoshiharurdquo

F

(a) label assignment for the cell ldquoHabu-Yoshiharurdquo

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu + Meijin

label assignmentldquoHabu-Yoshiharu-Meijinrdquo

A B C

EDMONEY0075

MNY+OTHERe0075+0245

MeijinOTHERe0245

label assignmentldquoHaburdquo + ldquoYoshiharurdquo + ldquoMeijinrdquo

label assignmentldquoHabu-Yoshiharurdquo + ldquoMeijinrdquo

F

(b) label assignment for the cell ldquoHabu-Yoshiharu-Meijinrdquo

Figure 6 The label comparison model

label assignment ldquoErdquo with the label assignmentldquoD+Frdquo In this case the model chooses the labelassignment ldquoD+Frdquo that is ldquoYoshiharurdquo is labeledas MONEY and ldquoMeijinrdquo is labeled as OTHEReWhen the label assignment consists of multiplechunks the content of the cell is updated Inthis case the cell ldquoErdquo is changed from ldquoYoshi-haru-Meijinrdquo (OTHERe) to ldquoYoshiharu + Meijinrdquo(MONEY + OTHERe)

As shown in Figure 6(b) in determining the bestlabel assignment for the upper right cell that isthe final output is determined the model comparesthe label assignment ldquoA+D+Frdquo ldquoB+Frdquo and ldquoCrdquoWhen there are more than two candidates of labelassignments for a cell all the label assignments arecompared in a pairwise and the label assignmentthat obtains the highest score is adopted

In the label comparing step the label as-signment in which OTHERlowast follows OTHERlowast(OTHERlowast - OTHERlowast) is not allowed since eachOTHER is assigned to the longest chunk as de-scribed in Section 41 When the first combina-tion of chunks equals to the second combinationof chunks the comparison is not performed

6 Experiment

To demonstrate the effectiveness of our proposedmethod we conducted an experiment on CRL NEdata In this data 10718 sentences in 1174 newsarticles are annotated with eight NEs The expres-sion to which it is difficult to annotate manually islabeled as OPTIONAL and was not used for both

a b c d e

learn the label estimation model 1

Corpus CRL NE data

learn the label estimation model 2bobtain features for the label apply

b c d e

b c d e

b c d e

obtain features for the labelcomparison model

learn the label estimation model 2c

learn the label estimation model 2d

learn the label estimation model 2e

apply

learn the label comparison model

Figure 7 5-fold cross validation

the model learning8 and the evaluationWe performed 5-fold cross validation following

previous work Different from previous work ourwork has to learn the SVM models twice There-fore the corpus was divided as shown in Figure 7Let us consider the analysis in the part (a) Firstthe label estimation model 1 is learned from thepart (b)-(e) Then the label estimation model 2bis learned from the part (c)-(e) and applying thelearned model to the part (b) features for learningthe label comparison model are obtained Simi-larly the label estimation model 2c is learned fromthe part (b)(d)(e) and applying it to the part (c)features are obtained It is the same with the part

8Exceptionally ldquoOPTIONALrdquo is used when the label es-timation model for OTHERlowast and invalid is learned

60

Recall PrecisionORGANIZATION 8183 (30083676) 8837 (30083404)PERSON 9005 (34583840) 9387 (34583684)LOCATION 9138 (49925463) 9244 (49925400)ARTIFACT 4672 ( 349 747) 7489 ( 349 466)DATE 9327 (33273567) 9312 (33273573)TIME 8825 ( 443 502) 9059 ( 443 489)MONEY 9385 ( 366 390) 9760 ( 366 375)PERCENT 9533 ( 469 492) 9591 ( 469 489)ALL-SLOT 8787 9179F-measure 8979

Table 3 Experimental result

(d) and (e) Then the label comparison model islearned from the obtained features After that theanalysis in the part (a) is performed by using boththe label estimation model 1 and the label compar-ison model

In this experiment a Japanese morphologicalanalyzer JUMAN9 and a Japanese parser KNP10

were adopted The two SVM models were learnedwith polynomial kernel of degree 2 and β in thesigmoid function was set to be 1

Table 6 shows an experimental result An F-measure in all NE classes is 8979

7 Discussion

71 Comparison with Previous Work

Table 7 presents the comparison with previ-ous work and our method outperformed previ-ous work Among previous work Fukushimaet al acquired huge amount of category-instance pairs (eg ldquopolitical party - New partyDAICHIrdquoldquocompany-TOYOTArdquo) by some patternsfrom a large Web corpus and Sasano et al uti-lized the analysis result of corefer resolution as afeature for the model learning Therefore in ourmethod by incorporating these knowledge andorsuch analysis result the performance would be im-proved

Compared with Sasano et al our methodachieved the better performance in analyzinga long compound noun For example in thebunsetsu ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo (Treaty on Conventional ArmedForces in Europe) while Sasano et al labeledldquoOushurdquo (Europe) as LOCATION our methodcorrectly labeled ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo as ARTIFACT Sasano etal incorrectly labeled ldquoOushurdquo as LOCATIONalthough they utilized the information about

9httpnlpkueekyoto-uacjpnl-resourcejuman-ehtml10httpnlpkueekyoto-uacjpnl-resourceknp-ehtml

the head of bunsetsu ldquojyouyakurdquo (treaty) Inour method for the cell ldquoOushurdquo invalid hasthe highest score and thus the score of LOCA-TION relatively drops Similarly for the cellldquosenryoku-sakugen-jyouyakurdquo invalid has thehighest score Consequently ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo is correctly labeledas ARTIFACT

In the bunsetsu ldquogaikoku-jin-touroku-hou-ihanrdquo(the violation of the foreigner registration law)while Sasano et al labeled ldquotouroku-hourdquo as AR-TIFACT our method correctly labeled ldquogaikoku-jin-touroku-hourdquo as ARTIFACT Sasano et al can-not utilize the information about ldquohourdquo that is use-ful for the label estimation since the head of thisbunsetsu is ldquoihanrdquo In contrast in estimating thelabel of the chunk ldquogaikoku-jin-touroku-hourdquo theinformation of ldquohourdquo can be utilized

72 Error Analysis

There were some errors in analyzing a Katakanaalphabet word In the following example althoughthe correct is that ldquoBatistutardquo is labeled as PER-SON the system labeled it as OTHERs

(4) Italy-deItaly LOC

katsuyaku-suruactive

Batistuta-woBatistuta ACC

kuwaetacall

ArgentineArgentine

rsquoArgentine called Batistuta who was active inItalyrsquo

There is not an entry of ldquoBatistutardquo in the dictio-nary of JUMAN nor Wikipedia and thus only thesurrounding information is utilized However thecase analysis of ldquokatsuyakurdquo (active) is incorrectwhich leads to the error of ldquoBatistutardquo

There were some errors in applying the la-bel comparison model although the analysis ofeach chunk is correct For example in thebunsetsu ldquoHongKong-seityourdquo (Government ofHongKong) the correct is that ldquoHongKong-seityourdquo is labeled as ORGANIZATION Asshown in Figure 8 (b) the system incorrectlylabeled ldquoHongKongrdquo as LOCATION As shownin Figure 8(a) although in the initial stateldquoHongKong-seityourdquo was correctly labeled as OR-GANIZATION the label assignment ldquoHongKong+ seityourdquo was incorrectly chosen by the labelcomparison model To cope with this problemwe are planning to the adjustment of the value βin the sigmoid function and the refinement of the

61

F1 analysis unit distinctive features(Fukushima et al 2008) 8929 character Web(Kazama and Torisawa 2008) 8893 character WikipediaWeb(Sasano and Kurohashi 2008) 8940 morpheme structural information(Nakano and Hirai 2004) 8903 character bunsetsu feature(Masayuki and Matsumoto 2003) 8721 character(Isozaki and Kazawa 2003) 8677 morphemeproposed method 8979 compound noun Wikipediastructural information

Table 4 Comparison with previous work (All work was evaluated on CRL NE data using cross valida-tion)

HongKongLOCATION

HongKong-seityouORGANIZATIONLOCATION

0266 0205

seityouOTHERe0184

(a)initial state

HongKongLOCATION

HongKong + seityouLOC+OTHEReLOCATION

0266LOC+OTHERe0266+0184

seityouOTHERe0184

(b)the final output

Figure 8 An example of the error in the label com-parison model

features for the label comparison model

8 Conclusion

This paper proposed bottom-up Named EntityRecognition using a two-stage machine learningmethod This method first estimates the label ofall the chunks in a bunsetsu using a machine learn-ing and then the best label assignment is deter-mined by bottom-up dynamic programming Weconducted an experiment on CRL NE data andachieved an F-measure of 8979

We are planning to integrate this method withthe syntactic and case analysis method (Kawa-hara and Kurohashi 2007) and perform syntacticcase and Named Entity analysis simultaneously toimprove the overall accuracy

ReferencesKenrsquoichi Fukushima Nobuhiro Kaji and Masaru

Kitsuregawa 2008 Use of massive amountsof web text in Japanese named entity recogni-tion In Proceedings of Data Engineering Workshop(DEWS2008) A3-3 (in Japanese)

IREX Committee editor 1999 Proceedings of theIREX Workshop

Hideki Isozaki and Hideto Kazawa 2003 Speeding upsupport vector machines for named entity recogni-tion Transaction of Information Processing Societyof Japan 44(3)970ndash979 (in Japanese)

Daisuke Kawahara and Sadao Kurohashi 2007 Prob-abilistic coordination disambiguation in a fully-lexicalized Japanese parser In Proceedings of the

2007 Joint Conference on Empirical Methods inNatural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL2007)pages 304ndash311

Junrsquoichi Kazama and Kentaro Torisawa 2008 In-ducing gazetteers for named entity recognition bylarge-scale clustering of dependency relations InProceedings of ACL-08 HLT pages 407ndash415

Erik F Tjong Kim and Jorn Veenstra 1999 Repre-senting text chunks In Proceedings of EACL rsquo99pages 173ndash179

Vajay Krishnan and Christopher DManning 2006An effective two-stage model for exploiting non-local dependencies in named entity recognitionpages 1121ndash1128

John Lafferty Andrew McCallun and FernandoPereira 2001 Conditional random fields Prob-abilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth In-ternational Conference (ICMLrsquo01) pages 282ndash289

Asahara Masayuki and Yuji Matsumoto 2003Japanese named entity extraction with redundantmorphological analysis In Proceeding of HLT-NAACL 2003 pages 8ndash15

Keigo Nakano and Yuzo Hirai 2004 Japanese namedentity extraction with bunsetsu features Transac-tion of Information Processing Society of Japan45(3)934ndash941 (in Japanese)

Ryohei Sasano and Sadao Kurohashi 2008 Japanesenamed entity recognition using structural naturallanguage processing In Proceeding of Third In-ternational Joint Conference on Natural LanguageProcessing pages 607ndash612

Satoshi Sekine Ralph Grishman and Hiroyuki Shin-nou 1998 A decision tree method for finding andclassifying names in japanese texts In Proceed-ings of the Sixth Workshop on Very Large Corpora(WVLC-6) pages 171ndash178

Vladimir Vapnik 1995 The Nature of StatisticalLearning Theory Springer

62

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 63ndash70Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

13

13

13 13

13

13

13

13

13

13

13

13 13

13

$

13 amp

13 amp

$ 13

13 (

) 13

$ 13

13

) 13

13 13 13

$ 13 13

+$ - 0)$ - 12

$ 3$ 4 1

)13$ 5 + )$ 5 2

3$ 6 2 $ 7-

2 $ 7 13

2 13 13

13 13

) 13

13$ $ 13$ $ $

13

13 13 13

13 1313 138

$ 13$

3$ 13) )

) 8 9$ $ $

$ $ )13$ 13 13

3

$

13 13

13 +

)

13 $

13 ) 13

$ )

13 13

13 ) $ )

13$

$

) (

$

13

13 13

13

)

13

13 )

$ 13

ファミリーレストラン 13) lt

ファミレス 13lt $

=

13 )

$ 13 13

$ $ )

13 -

)$ $ =

13 131313 13 1313 13 13 13 13 1313 13 13

63

13

13

13 13

13 13

13

13 13

13 $

13

13 amp

13

( ) +

13

13

-

13 13

13 13 13

日本 amp0 東京 $amp0amp

医科 amp0 科学 amp0amp

女子 amp0amp

大学 13amp0 研究所0

ampamp $ 13 13

+

13

$1

$ 13 $1

13

213

$1

$

$1

$1

13 $1

13

13 13 略語は0

0amp

$ +3

1313欧文略語一覧13カタカナ略語一覧13漢字略語一覧13大学の略称

$1

$1 13

13 -

13 $1 13

456

13

456

13

37+

- 13

13

$

8 913

77 4 77 913 13

13 77(amp 13 ltamp

13amp amp

13

amp - 13 13

13

13 9

13 13

$

13

-

13 =

= 9

8

13

13 $

13 13

13 13 13 1313

ッ$ 13 っ$

amp ―$

64

13

$ $

amp $ amp

() $

+) $amp $

) $ $ $ $ $

13 $ amp $ $ $ amp $$

$- )) ) 13

) ))13 ) 0()00)13 ) ) )13

- $ ))) ) 131

) 0)13 ))

13 13

2 ) )) )

)) 3 ) 1313))

1313 2 ) 13)

4) 5) )) )13

) 13) 1

6

7 ) )) ) )13 1

)

) )0

13 ) 8

-

) )

)13 ) 4

4 7 )

) )9 8)

5 ) )8 7 ))

)

) )13 1

)13 )) ) ) )

) 7 )) )13 1

)) 0()

) )))13 0

))13 ) 1

)6 5) ) )

5 ) ))1

lt ) ) 13 )

))) ) =))lt

)=)lt ))=lt gt

=ltlt )) ))) )

13 131313

-

))lt

lt ))lt

))lt

))lt lt

7 ) ))1

lt ) )) ) ))lt lt

13 lt ) ) )8 ) 1

)13 )) )

))lt 13 1

)13 ) )8 ) )13 ))

) ))) 85 1

)13 )8) 2 ))) ) (

0 )) )))

13))13

) ))) ) ( のlt

) ) )13 1

) )) )

) ))) のlt 0 )

)13 )) )1

) ) 13

) ) )) )1

) 00 )13 13)

1313 1313

2 ) 13 )) )

)13 )13 )

))13 )

3 )) )) 1

) ) )

$ )

+ ) )

+ 8 )

65

1313

1313

13 13

13 13

13 13 13

13 13

$13amp13amp 13 13 $13

$13 13 $ ( ) ) 13

13 13 + 13 1313 $

13 + 13 1313 + -

+ 13 ++ 13 13

1313 +

13 13

13 13 1313 13 -

+13 1313 13 13

1313

13

0 13 13 13 13

13 13 12 13

3013+ 13 45 666 1313

0 13 1313

7 13 13 13 13-

13 8 +-13 0

+ 13 13 13 -

+ 13 13 13

3

13 13-

13 13 + 13 13

+13513

1313 13 9

1 13 13 -

+ ( 13 13 3

13 13 + 13+ 13 13

3

135 1313 +13 13

13 13 13+ 13

1313 13 +

1313 1313

13 13 13

13+ 13 $1313 (

13 1313 1313 +-

13 4 13 $

$13 $ 13 -

13 +-13 0

13 1313 13-13 1313 + 13 13 4 - 13 + 13 13 13+ 131313 1313 3 13 13 ( 1313 13 13 + 13 13 1313 13 13 13 13 13 13 13 13 13 13 13+ 13 13 9

13

13

13

13

+ 13 13 +13

13+ 4

13 1313 1313 -

+

13 1313 13-

13 13 +13 + +

(13 13 $ +-

13 $1313 13 13 $ $ 13 $ 13

13 $ampamp lt13

2 13 13 131313 1313 13

13 lt13 ( $

13 13 13 13+ +13 -

13 13 13

13 2 2 2

= 13 13 13 +13 1313+

13 13 13

13 13 13 13

1313 13-

13 ( 13 13 13 13

13 13+ 13 1313

13 13 13+ 1313 13+

13 13 13 13 13 13

$gt1 gt1 13+ $ 13-

13 13 13 13+

13131313

66

1313 1313 1313

1313 1313 1313

13

13 13

13

13

$ 13 13

amp() amp() () amp amp

$ 13 13

+

13

13

13

- + 13 ( (

13 13

-+

13

0 -(1313 amp+ ( (

13 13

13 1

13 () 2 3

0 13 -4252

6-amp++ -32 6+

13 0

$ 135 2

1

- amp+ $ 13

13

$ amp( ) + ( ( )- )( ( ) )

FeaturesLabel

Noun

Noun

Noun

Noun

Particle

Noun

POS

0

0

0

1

1

1

Head of wordreadingword-

FeaturesLabel

Noun

Noun

Noun

Noun

Particle

Noun

POS

0

0

0

1

1

1

Head of wordreadingword-

amp 13 0

13 朝ビタ -+ 13

13 朝はビタミン - +

13 7

7

$

088

13 0

$ amp 9 amp 9

13

13

7

1313 0

13

13 2lt

13 13 13 13

13 13

3 13 amp

13 amp

13

13

0 13 13

13 13

13 13

13

1

ampampamp

67

13 13

$

13 13

ampamp (

)

ampamp $

ampamp amp

amp +

amp +

- - -

amp-

amp

amp

amp

- amp

+

- +

) +

amp $

- amp

+

- amp

+

0 $ ampamp +

1- amp

0 +

2-

)

amp -

amp ampamp

$ amp 0- 3(

34) 5 -+

0 - 4

6)4 amp amp 1 amp

7 4( 6

$ - +

- 8

64

13

) amp

amp

+

amp 9 2

- 9 - +

2 amp

- amp

amp 1

$

( +

(

( amp ( ( -

amp2

amp

amp )

( +

68

13 13

13 1313

13 13 13

13 13

13 13 13

13 13

13 13

13 $

13

13 13 13 13

13

13 13 13

1313

13

amp 13 13 13 13

13 13 13 13

(() 13 13 13

13 13 + 13

13 amp 13

13 13

$ 13

13 13

13

13 1313

13 -

13

1313 13

13 13 13

13 1313 amp

13 13

13 13 1313

0

1313 13 13

13 13

13

13 13

$ $ amp amp ( $)+ - 0 13 + 13 112

ミュージックステーション 13 ステ 13 13 13 13 13 ミュージック 13 ステ ステーション

3 4 56 1 0+ 0 + + 7+ 8 9 $8 4 lt

6 $8 -= 1 8 - = $ 13++ + 139 13 $ 22lt

8 $+ 4 13) 1ltltlt3 $gt3 =

5 $ $ ( 6 5 13amp

5 $ $ ( 213 $ 13 13

5 = 3 + 13 () 1 +

5 () 13 7+ + + 2ltlt

5 () 13 3A 21 + $ + 13

5 () $ gt) 3A2 0 13 13 22lt2lt

+A = 7 1 4 30$+ -+ 3 13 13$ 11

$ 13 amp) $ 4 42 6= $ 2

4 13 13+ $ gt $ 9+ amp6+ 13+ 8 8 6 3 (+ 6 13 11

13 136) $ 4 13 + gt+ 7 30 13 13+ 1

69

13 13 13

13 13 1313

13 13

1313 13

13 1313

13

$

13 13 1313 $

amp (

70

Author Index

Bhutada Pravin 17

Cao Jie 47Caseli Helena 1

Diab Mona 17

Finatto Maria Jose 1Fujii Hiroko 63Fukui Mika 63Funayama Hirotaka 55

Hoang Hung Huu 31Huang Yun 47

Kan Min-Yen 9 31Kim Su Nam 9 31Kuhn Jonas 23Kurohashi Sadao 55

Liu Qun 47Lu Yajuan 47

Machado Andre 1

Ren Zhixiang 47

Shibata Tomohide 55Sinha R Mahesh K 40Sumita Kazuo 63Suzuki Masaru 63

Villavicencio Aline 1

Wakaki Hiromi 63

Zarrieszlig Sina 23

71

  • Workshop Program
  • Statistically-Driven Alignment-Based Multiword Expression Identification for Technical Domains
  • Re-examining Automatic Keyphrase Extraction Approaches in Scientific Articles
  • Verb Noun Construction MWE Token Classification
  • Exploiting Translational Correspondences for Pattern-Independent MWE Identification
  • A re-examination of lexical association measures
  • Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus
  • Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions
  • Bottom-up Named Entity Recognition using Two-stage Machine Learning Method
  • Abbreviation Generation for Japanese Multi-Word Expressions
Page 2: MWE 2009 2009 Workshop on Multiword Expressions

Production and Manufacturing byWorld Scientific Publishing Co Pte Ltd5 Toh Tuck LinkSingapore 596224

ccopy2009 The Association for Computational Linguisticsand The Asian Federation of Natural Language Processing

Order copies of this and other ACL proceedings from

Association for Computational Linguistics (ACL)209 N Eighth StreetStroudsburg PA 18360USATel +1-570-476-8006Fax +1-570-476-0860aclaclweborg

ISBN 978-1-932432-60-2 1-932432-60-4

ii

Introduction

The ACL 2009 Workshop on Multiword Expressions Identification Interpretation Disambiguation andApplications (MWErsquo09) took place on August 6 2009 in Singapore immediately following the annualmeeting of the Association for Computational Linguistics (ACL) This is the fifth time this workshop hasbeen held in conjunction with ACL following the meetings in 2003 2004 2006 and 2007

The workshop focused on Multi-Word Expressions (MWEs) which represent an indispensable part ofnatural languages and appear steadily on a daily basis both novel and already existing but paraphrasedwhich makes them important for many natural language applications Unfortunately while easilymastered by native speakers MWEs are often non-compositional which poses a major challenge forboth foreign language learners and automatic analysis

The growing interest in MWEs in the NLP community has led to many specialized workshops heldevery year since 2001 in conjunction with ACL EACL and LREC there have been also two recentspecial issues on MWEs published by leading journals the International Journal of Language Resourcesand Evaluation and the Journal of Computer Speech and Language

As a result of the overall progress in the field the time has come to move from basic preliminary researchto actual applications in real-world NLP tasks Thus in MWErsquo09 we were interested in the overallprocess of dealing with MWEs asking for original research on the following four fundamental topics

Identification Identifying MWEs in free text is a very challenging problem Due to the variability ofexpression it does not suffice to collect and use a static list of known MWEs complex rules andmachine learning are typically needed as well

Interpretation Semantically interpreting MWEs is a central issue For some kinds of MWEs egnoun compounds it could mean specifying their semantics using a static inventory of semanticrelations eg WordNet-derived In other cases MWErsquos semantics could be expressible by asuitable paraphrase

Disambiguation Most MWEs are ambiguous in various ways A typical disambiguation task is todetermine whether an MWE is used non-compositionally (ie figuratively) or compositionally(ie literally) in a particular context

Applications Identifying MWEs in context and understanding their syntax and semantics is importantfor many natural language applications including but not limited to question answering machinetranslation information retrieval information extraction and textual entailment Still despite thegrowing research interest there are not enough successful applications in real NLP problemswhich we believe is the key for the advancement of the field

Of course the above topics largely overlap For example identification can require disambiguatingbetween literal and idiomatic uses since MWEs are typically required to be non-compositional bydefinition Similarly interpreting three-word noun compounds like morning flight ticket and plasticwater bottle requires disambiguation between a left and a right syntactic structure while interpretingtwo-word compounds like English teacher requires disambiguating between (a) lsquoteacher who teachesEnglishrsquo and (b) lsquoteacher coming from England (who could teach any subject eg math)rsquo

We received 18 submissions and given our limited capacity as a one-day workshop we were only ableto accept 9 full papers for oral presentation an acceptance rate of 50

We would like to thank the members of the Program Committee for their timely reviews We would alsolike to thank the authors for their valuable contributions

Dimitra Anastasiou Chikara Hashimoto Preslav Nakov and Su Nam KimCo-Organizers

iii

Organizers

Dimitra Anastasiou Localisation Research Centre Limerick University IrelandChikara Hashimoto National Institute of Information and Communications Technology JapanPreslav Nakov National University of Singapore SingaporeSu Nam Kim University of Melbourne Australia

Program Committee

Inaki Alegria University of the Basque Country (Spain)Timothy Baldwin University of Melbourne (Australia)Colin Bannard Max Planck Institute (Germany)Francis Bond National Institute of Information and Communications Technology (Japan)Gael Dias Beira Interior University (Portugal)Ulrich Heid Stuttgart University (Germany)Stefan Evert University of Osnabruck (Germany)Afsaneh Fazly University of Toronto (Canada)Nicole Gregoire University of Utrecht (The Netherlands)Roxana Girju University of Illinois at Urbana-Champaign (USA)Kyo Kageura University of Tokyo (Japan)Brigitte Krenn Austrian Research Institute for Artificial Intelligence (Austria)Eric Laporte University of Marne-la-Vallee (France)Rosamund Moon University of Birmingham (UK)Diana McCarthy University of Sussex (UK)Jan Odijk University of Utrecht (The Netherlands)Stephan Oepen University of Oslo (Norway)Darren Pearce London Knowledge Lab (UK)Pavel Pecina Charles University (Czech Republic)Scott Piao University of Manchester (UK)Violeta Seretan University of Geneva (Switzerland)Stan Szpakowicz University of Ottawa (Canada)Beata Trawinski University of Tubingen (Germany)Peter Turney National Research Council of Canada (Canada)Kiyoko Uchiyama Keio University (Japan)Begona Villada Moiron University of Groningen (The Netherlands)Aline Villavicencio Federal University of Rio Grande do Sul (Brazil)

v

Table of Contents

Statistically-Driven Alignment-Based Multiword Expression Identification for Technical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto 1

Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan 9

Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada 17

Exploiting Translational Correspondences for Pattern-Independent MWE IdentificationSina Zarrieszlig and Jonas Kuhn 23

A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan 31

Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha 40

Improving Statistical Machine Translation Using Domain Bilingual Multiword ExpressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang 47

Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi 55

Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita 63

vii

Workshop Program

Friday August 6 2009

830ndash845 Welcome and Introduction to the Workshop

Session 1 (0845ndash1000) MWE Identification and Disambiguation

0845ndash0910 Statistically-Driven Alignment-Based Multiword Expression Identification for Tech-nical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto

0910ndash0935 Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan

0935ndash1000 Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada

1000-1030 BREAK

Session 2 (1030ndash1210) Identification Interpretation and Disambiguation

1030ndash1055 Exploiting Translational Correspondences for Pattern-Independent MWE Identifi-cationSina Zarrieszlig and Jonas Kuhn

1055ndash1120 A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan

1120ndash1145 Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha

1145-1350 LUNCH

Session 3 (1350ndash1530) Applications

1350ndash1415 Improving Statistical Machine Translation Using Domain Bilingual Multiword Ex-pressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang

1415ndash1440 Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi

1440ndash1505 Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita

1505-1530 Discussion of Sessions 1 2 3 (Creating an Agenda for the general discussion)

1530-1600 BREAK

1600-1700 General Discussion

1700-1715 Closing Remarks

ix

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 1ndash8Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Statistically-Driven Alignment-Based Multiword ExpressionIdentification for Technical Domains

Helena de Medeiros Caselidiams Aline Villavicencioclubsspades Andre Machadoclubs Maria Jose Finattohearts

diamsDepartment of Computer Science Federal University of Sao Carlos (Brazil)clubsInstitute of Informatics Federal University of Rio Grande do Sul (Brazil)

spadesDepartment of Computer Sciences Bath University (UK)heartsInstitute of Language and Linguistics Federal University of Rio Grande do Sul (Brazil)

helenacaselidcufscarbr avillavicencioinfufrgsbrammachadoinfufrgsbr mfinattoterracombr

Abstract

Multiword Expressions (MWEs) are oneof the stumbling blocks for more preciseNatural Language Processing (NLP) sys-tems Particularly the lack of coverageof MWEs in resources can impact nega-tively on the performance of tasks and ap-plications and can lead to loss of informa-tion or communication errors This is es-pecially problematic in technical domainswhere a significant portion of the vocab-ulary is composed of MWEs This pa-per investigates the use of a statistically-driven alignment-based approach to theidentification of MWEs in technical cor-pora We look at the use of several sourcesof data including parallel corpora usingEnglish and Portuguese data from a corpusof Pediatrics and examining how a sec-ond language can provide relevant cues forthis tasks We report results obtained bya combination of statistical measures andlinguistic information and compare theseto the reported in the literature Such anapproach to the (semi-)automatic identifi-cation of MWEs can considerably speedup lexicographic work providing a moretargeted list of MWE candidates

1 Introduction

A multiword expression (MWE) can be definedas any word combination for which the syntac-tic or semantic properties of the whole expres-sion cannot be obtained from its parts (Sag etal 2002) Examples of MWEs are phrasal verbs(break down rely on) compounds (police car cof-fee machine) idioms (rock the boat let the catout of the bag) They are very numerous in lan-guages as Biber et al (1999) note accouting forbetween 30 and 45 of spoken English and 21

of academic prose and for Jackendoff (1997) thenumber of MWEs in a speakerrsquos lexicon is of thesame order of magnitude as the number of singlewords However these estimates are likely to beunderestimates if we consider that for languagefrom a specific domain the specialized vocabularyis going to consist largely of MWEs (global warm-ing protein sequencing) and new MWEs are con-stantly appearing (weapons of mass destructionaxis of evil)

Multiword expressions play an important rolein Natural Language Processing (NLP) applica-tions which should not only identify the MWEsbut also be able to deal with them when they arefound (Fazly and Stevenson 2007) Failing toidentify MWEs may cause serious problems formany NLP tasks especially those envolving somekind of semantic processing For parsing for in-stance Baldwin et al (2004) found that for a ran-dom sample of 20000 strings from the British Na-tional Corpus (BNC) even with a broad-coveragegrammar for English (Flickinger 2000) missingMWEs accounted for 8 of total parsing errorsTherefore there is an enormous need for robust(semi-)automated ways of acquiring lexical infor-mation for MWEs (Villavicencio et al 2007) thatcan significantly extend the coverage of resourcesFor example one can more than double the num-ber of verb-particle constructions (VPCs) entriesin a dictionary such as the Alvey Natural Lan-guage Tools (Carroll and Grover 1989) just ex-tracting VPCs from a corpus like the BNC (Bald-win 2005) Furthermore as MWEs are languagedependent and culturally motivated identifyingthe adequate translation of MWE occurrences is animportant challenge for machine translation meth-ods

In this paper we investigate experimentally theuse of an alignment-based approach for the iden-tification of MWEs in technical corpora We lookat the use of several sources of data including par-

1

allel corpora using English and Portuguese datafrom a corpus of Pediatrics and examining howa second language can provide relevant cues forthis tasks In this way cost-effective tools for theautomatic alignment of texts can generate a listof MWE candidates with their appropriate trans-lations Such an approach to the (semi-)automaticidentification of MWEs can considerably speed uplexicographic work providing a more targeted listof MWE candidates and their translations for theconstruction of bilingual resources andor withsome semantic information for monolingual re-sources

The remainder of this paper is structured as fol-lows Section 2 briefly discusses MWEs and someprevious works on methods for automatically ex-tracting them Section 3 presents the resourcesused while section 4 describes the methods pro-posed to extract MWEs as a statistically-driven by-product of an automatic word alignment processSection 5 presents the evaluation methodology andanalyses the results and section 6 finishes this pa-per with some conclusions and proposals for fu-ture work

2 Related Work

The term Multiword Expression has been used todescribe a large number of distinct but related phe-nomena such as phrasal verbs (eg come along)nominal compounds (eg frying pan) institution-alised phrases (eg bread and butter) and manyothers (Sag et al 2002) They are very frequent ineveryday language and this is reflected in severalexisting grammars and lexical resources wherealmost half of the entries are Multiword Expres-sions

However due to their heterogeneous charac-teristics MWEs present a tough challenge forboth linguistic and computational work (Sag etal 2002) Some MWEs are fixed and do notpresent internal variation such as ad hoc whileothers allow different degrees of internal vari-ability and modification such as touch a nerve(touchfind a nerve) and spill beans (spill sev-eralmusicalmountains of beans) In terms of se-mantics some MWEs are more opaque in theirmeaning (eg to kick the bucket as to die) whileothers have more transparent meanings that can beinferred from the words in the MWE (eg eat upwhere the particle up adds a completive sense toeat) Therefore providing appropriate methods

for the automatic identification and treatment ofthese phenomena is a real challenge for NLP sys-tems

A variety of approaches has been proposed forautomatically identifying MWEs differing basi-cally in terms of the type of MWE and lan-guage to which they apply and the sources ofinformation they use Although some work onMWEs is type independent (eg (Zhang et al2006 Villavicencio et al 2007)) given the het-erogeneity of MWEs much of the work looks in-stead at specific types of MWE like collocations(Pearce 2002) compounds (Keller and Lapata2003) and VPCs (Baldwin 2005 Villavicencio2005 Carlos Ramisch and Aline Villavicencio andLeonardo Moura and Marco Idiart 2008) Someof these works concentrate on particular languages(eg (Pearce 2002 Baldwin 2005) for Englishand (Piao et al 2006) for Chinese) but somework has also benefitted from asymmetries in lan-guages using information from one language tohelp deal with MWEs in the other (eg (na Vil-lada Moiron and Tiedemann 2006 Caseli et al2009))

As basis for helping to determine whether agiven sequence of words is in fact an MWE (egad hoc vs the small boy) some of these works em-ploy linguistic knowledge for the task (Villavicen-cio 2005) while others employ statistical meth-ods (Pearce 2002 Evert and Krenn 2005 Zhanget al 2006 Villavicencio et al 2007) or combinethem with some kinds of linguistic informationsuch as syntactic and semantic properties (Bald-win and Villavicencio 2002 Van de Cruys and naVillada Moiron 2007) or automatic word align-ment (na Villada Moiron and Tiedemann 2006)

Statistical measures of association have beencommonly used for this task as they can be demo-cratically applied to any language and MWE typeHowever there is no consensus about which mea-sure is best suited for identifying MWEs in gen-eral Villavicencio et al (2007) compared some ofthese measures (mutual information permutationentropy and χ2) for the type-independent detec-tion of MWEs and found that Mutual Informationseemed to differentiate MWEs from non-MWEsbut the same was not true of χ2 In addition Ev-ert and Krenn (2005) found that for MWE iden-tification the efficacy of a given measure dependson factors like the type of MWEs being targetedfor identification the domain and size of the cor-

2

pora used and the amount of low-frequency dataexcluded by adopting a threshold NonethelessVillavicencio et al (2007) discussing the influ-ence of the corpus size and nature over the meth-ods found that these different measures have ahigh level of agreement about MWEs whetherin carefully constructed corpora or in more het-erogeneous web-based ones They also discussthe results obtained from adopting approaches likethese for extending the coverage of resources ar-guing that grammar coverage can be significantlyincreased if MWEs are properly identified andtreated (Villavicencio et al 2007)

Among the methods that use additional infor-mation along with statistics to extract MWE theone proposed by na Villada Moiron and Tiede-mann (2006) seems to be the most similar to ourapproach The main difference between them isthe way in which word alignment is used in theMWE extraction process In this paper the wordalignment is the basis for the MWE extractionprocess while Villada Moiron and Tiedemannrsquosmethod uses the alignment just for ranking theMWE candidates which were extracted on the ba-sis of association measures (log-likelihood andsalience) and head dependence heuristic (in parseddata)

Our approach as described in details by Caseliet al (2009) also follows to some extent thatof Zhang et al (2006) as missing lexical en-tries for MWEs and related constructions are de-tected via error mining methods and this paper fo-cuses on the extraction of generic MWEs as a by-product of an automatic word alignment Anotherrelated work is the automatic detection of non-compositional compounds (NCC) by Melamed(1997) in which NCCs are identified by analyz-ing statistical translation models trained in a hugecorpus by a time-demanding process

Given this context our approach proposesthe use of alignment techniques for identifyingMWEs looking at sequences detected by thealigner as containing more than one word whichform the MWE candidates As a result sequencesof two or more consecutive source words aretreated as MWE candidates regardless of whetherthey are translated as one or more target words

3 The Corpus and Reference Lists

The Corpus of Pediatrics used in these experi-ments contains 283 texts in Portuguese with a total

of 785448 words extracted from the Jornal de Pe-diatria From this corpus the Pediatrics Glossarya reference list containing multiword terms and re-curring expressions was semi-automatically con-structed and manually checked1 The primary aimof the Pediatrics Glossary as an online resourcefor long-distance education was to train qualifyand support translation students on the domain ofpediatrics texts

The Pediatrics Glossary was built from the36741 ngrams that occurred at least 5 times in thecorpus These were automatically cleaned or re-moved using some POS tag patterns (eg remov-ing prepositions from terms that began or endedwith them) In addition if an ngram was part of alarger ngram only the latter appeared in the Glos-sary as is the case of aleitamento materno (mater-nal breastfeeding) which is excluded as it is con-tained in aleitamento materno exclusivo (exclusivematernal breastfeeding) This post-processing re-sulted in 3645 ngrams which were manuallychecked by translation students and resulted in2407 terms with 1421 bigrams 730 trigrams and339 ngrams with n larger than 3 (not considered inthe experiments presented in this paper)

4 Statistically-Driven andAlignment-Based methods

41 Statistically-Driven methodStatistical measures of association have beenwidely employed in the identification of MWEsThe idea behind their use is that they are an in-expensive language and type independent meansof detecting recurrent patterns As Firth famouslysaid a word is characterized by the company itkeeps and since we expect the component wordsof an MWE to occur frequently together thenthese measures can give an indication of MWE-ness In this way if a group of words co-occurswith significantly high frequency when comparedto the frequencies of the individual words thenthey may form an MWE Indeed measures suchas Pointwise Mutual Information (PMI) MutualInformation (MI) χ2 log-likelihood (Press et al1992) and others have been employed for thistask and some of them seem to provide more ac-curate predictions of MWEness than others Infact in a comparison of some measures for thetype-independent detection of MWEs MI seemed

1Available in the TEXTQUIMUFRGS website httpwwwufrgsbrtextquim

3

to differentiate MWEs from non-MWEs but thesame was not true of χ2 (Villavicencio et al2007) In this work we use two commonly em-ployed measures for this task PMI and MIas implemented in the Ngram Statistics Package(Banerjee and Pedersen 2003)

From the Portuguese portion of the Corpus ofPediatrics 196105 bigram and 362663 trigramMWE candidates were generated after filteringngrams containing punctuation and numbers Inorder to evaluate how these methods perform with-out any linguistic filtering the only threshold em-ployed was a frequency cut-off of 2 occurrencesresulting in 64839 bigrams and 54548 trigramsEach of the four measures were then calculated forthese ngrams and we ranked each n-gram accord-ing to each of these measures The average of allthe rankings is used as the combined measure ofthe MWE candidates

42 Alignment-Based method

The second of the MWE extraction approachesto be investigated in this paper is the alignment-based method The automatic word alignment oftwo parallel texts mdash a text written in one (source)language and its translation to another (target) lan-guage mdash is the process of searching for correspon-dences between source and target words and se-quences of words For each word in a source sen-tence equivalences in the parallel target sentenceare looked for Therefore taking into account aword alignment between a source word sequenceS (S = s1 sn with n ge 2) and a target wordsequence T (T = t1 tm with m ge 1) thatis S harr T the alignmet-based MWE extracionmethod assumes that (a) S and T share some se-mantic features and (b) S may be a MWE

In other words the alignment-based MWE ex-traction method states that the sequence S will bea MWE candidate if it is aligned with a sequenceT composed of one or more words (a n m align-ment with n ge 2 and m ge 1) For examplethe sequence of two Portuguese words aleitamentomaterno mdash which occurs 202 times in the cor-pus used in our experiments mdash is a MWE can-didate because these two words were joined to bealigned 184 times with the word breastfeeding (a2 1 alignment) 8 times with the word breast-fed (a 2 1 alignment) 2 times with breastfeedingpractice (a 2 2 alignment) and so on

Thus notice that the alignment-based MWE ex-

traction method does not rely on the conceptualasymmetries between languages since it does notexpect that a source sequence of words be alignedwith a single target word The method looksfor the sequences of source words that are fre-quently joined together during the alignment de-spite the number of target words involved Thesefeatures indicate that the method priorizes preci-sion in spite of recall

It is also important to say that although the se-quences of source and target words resemble thephrases used in the phrase-based statistical ma-chine translation (SMT) they are indeed a re-finement of them More specifically althoughboth approaches rely on word alignments per-formed by GIZA++2 (Och and Ney 2000) inthe alignment-based approach not all sequences ofwords are considered as phrases (and MWE can-didates) but just those with an alignment n m(n gt= 2) with a target sequence To confirmthis assumption a phrase-based SMT system wastrained with the same corpus used in our exper-iments and the number of phrases extracted fol-lowing both approaches were compared Whilethe SMT extracted 819208 source phrases ouralignment-based approach (without applying anypart-of-speech or frequency filter) extracted only34277 These results show that the alignment-based approach refines in some way the phrasesof SMT systems

In this paper we investigate experimentallywhether MWEs can be identified as a by-productof the automatic word alignment of parallel textsWe focus on Portuguese MWEs from the Corpusof Pediatrics and the evaluation is performed us-ing the bigrams and trigrams from the PediatricsGlossary as gold standard

To perform the extraction of MWE candi-dates following the alignment-based approachfirst the original corpus had to be sentence andword aligned and Part-of-Speech (POS) taggedFor these preprocessing steps were used re-spectively a version of the Translation Cor-pus Aligner (TCA) (Hofland 1996) the statisti-cal word aligner GIZA++ (Och and Ney 2000)and the morphological analysers and POS taggersfrom Apertium3 (Armentano-Oller et al 2006)

2GIZA++ is a well-known statistical word aligner that canbe found at httpwwwfjochcomGIZA++html

3Apertium is an open-source machine translation en-gine and toolbox available at httpwwwapertiumorg

4

From the preprocessed corpus the MWE candi-dates are extracted as those in which two or morewords have the same alignment that is they arelinked to the same target unit This initial list ofMWE candidates is then filtered to remove thosecandidates that (a) match some sequences of POStags or words (patterns) defined in previous exper-iments (Caseli et al 2009) or (b) whose frequencyis below a certain threshold The remaining unitsin the candidate list are considered to be MWEs

Several filtering patterns and minimum fre-quency thresholds were tested and three of themare presented in details here The first one (F1)is the same used during the manual building ofthe reference lists of MWEs (a) patterns begin-ning with Article + Noun and beginning or finish-ing with verbs and (b) with a minimum frequencythreshold of 5

The second one (F2) is the same used in the(Caseli et al 2009) mainly (a) patterns begin-ning with determiner auxiliary verb pronoun ad-verb conjunction and surface forms such as thoseof the verb to be (are is was were) relatives(that what when which who why) and prepo-sitions (from to of ) and (b) with a minimum fre-quency threshold of 2

And the third one (F3) is the same as (Caseli etal 2009) plus (a) patterns beginning or finishingwith determiner adverb conjunction prepositionverb pronoun and numeral and (b) with a mini-mum frequency threshold of 2

5 Experiments and Results

Table 1 shows the top 5 and the bottom 5 rankedcandidates returned by PMI and the alignment-based approach Although some of the results aregood especially the top candidates there is stillconsiderable noise among the candidates as forinstance jogar video game (lit play video game)From table 1 it is also possible to notice that thealignment-based approach indeed extracts Pedi-atrics terms such as aleitamento materno (breast-feeding) and also other possible MWE that are notPediatrics terms such as estados unidos (UnitedStates)

In table 2 we show the precision (number ofcorrect candidates among the proposed ones) re-call (number of correct candidates among those inreference lists) and F-measure ((2 lowast precision lowastrecall)(precision + recall)) figures for the as-sociation measures using all the candidates (on the

PMI alignment-basedOnline Mendelian Inheritance faixa etariaBeta Technology Incorporated aleitamento maternoLange Beta Technology estados unidosOxido Nitrico Inalatorio hipertensao arterialjogar video game leite materno e um de couro cabeludoe a do bloqueio lactıferosse que de emocional anatomiae a da neonato a termoe de nao duplas maes bebes

Table 1 Top 5 and Bottom 5 MWE candidatesranked by PMI and alignment-based approach

pt MWE candidates PMI MI proposed bigrams 64839 64839 correct MWEs 1403 1403precision 216 216recall 9873 9873F 423 423 proposed trigrams 54548 54548 correct MWEs 701 701precision 129 129recall 9603 9603F 255 255 proposed bigrams 1421 1421 correct MWEs 155 261precision 1091 1837recall 1091 1837F 1091 1837 proposed trigrams 730 730 correct MWEs 44 20precision 603 274recall 603 274F 603 274

Table 2 Evaluation of MWE candidates - PMI andMI

first half of the table) and using the top 1421 bi-gram and 730 trigram candidates (on the secondhalf) From these latter results we can see that thetop candidates produced by these measures do notagree with the Pediatrics Glossary since there areonly at most 1837 bigram and 603 trigramMWEs among the top candidates as ranked byMI and PMI respectively Interestingly MI had abetter performance for bigrams while for trigramsPMI performed better

On the other hand looking at the alignment-based method 34277 pt MWE candidates wereextracted and Table 3 sumarizes the number ofcandidates filtered following the three filters de-scribed in 42 F1 F2 and F3

To evaluate the efficacy of the alignment-basedmethod in identifying multiword terms of Pedi-atrics an automatic comparison was performedusing the Pediatrics Glossary In this auto-

5

pt MWE candidates F1 F2 F3 filtered by POS patterns 24996 21544 32644 filtered by frequency 9012 11855 1442 final Set 269 878 191

Table 3 Number of pt MWE candidates filteredin the alignment-based approach

pt MWE candidates F1 F2 F3 proposed bigrams 250 754 169 correct MWEs 48 95 65precision 1920 1260 3846recall 338 669 457F 575 874 818 proposed trigrams 19 110 20 correct MWEs 1 9 4precision 526 818 2000recall 014 123 055F 027 214 107 proposed bitrigrams 269 864 189 correct MWEs 49 104 69precision 1822 1204 3651recall 228 483 321F 405 690 590

Table 4 Evaluation of MWE candidates

matic comparision we considered the final lists ofMWEs candidates generated by each filter in table3 The number of matching entries and the valuesfor precision recall and F-measure are showed intable 4

The different values of extracted MWEs (in ta-ble 3) and evaluated ones (in table 4) are due tothe restriction of considering only bigrams and tri-grams in the Pediatrics Glossary Then longerMWEs mdash such as doenca arterial coronarianaprematura (premature coronary artery disease)and pequenos para idade gestacional (small forgestational age) mdash extracted by the alignment-based method are not being considered at the mo-ment

After the automatic comparison using the Pedi-atrics Glossary an analysis by human experts wasperformed on one of the derived lists mdash that withthe best precision values so far (from filter F3)The human analysis was necessary since as statedin (Caseli et al 2009) the coverage of referencelists may be low and it is likely that a lot of MWEcandidates that were not found in the PediatricsGlossary are nonetheless true MWEs In this pa-per only the pt MWE candidates extracted usingfilter F3 (as described in section 42) were manu-ally evaluated

From the 191 pt MWE candidates extracted af-ter F3 69 candidates (361 of the total amount)were found in the bigrams or trigrams in the

Glossary (see table 4) Then the remaining 122candidates (639) were analysed by two native-speakers human judges who classified each of the122 candidates as true if it is a multiword expres-sion or false otherwise independently of beinga Pediatrics term For the judges a sequence ofwords was considered a MWE mainly if it was(1) a proper name or (2) a sequence of words forwhich the meaning cannot be obtained by com-pounding the meanings of its component words

The judgments of both judges were comparedand a disagreement of approximately 12 on mul-tiwords was verified This disagreement was alsomeasured by the kappa (K) measure (Carletta1996) with k = 073 which does not preventconclusions to be drawn According to Carletta(1996) among other authors a value of k between067 and 08 indicates a good agreement

In order to calculate the percentage of true can-didates among the 122 two approaches can befollowed depending on what criteria one wantsto emphasize precision or coverage (not recallbecause we are not calculating regarding a refer-ence list) To emphasize the precision one shouldconsider as genuine MWEs only those candidatesclassified as true by both judges on the other handto emphasize the coverage one should consideralso those candidates classified as true by just oneof them So from 191 MWE candidates 126(6597) were classified as true by both judgesand 145 (7592) by at least one of them

6 Conclusions and Future Work

MWEs are a complex and heterogeneous set ofphenomena that defy attempts to capture themfully but due to their role in communication theyneed to be properly accounted for in NLP applica-tions and tasks

In this paper we investigated the identifica-tion of MWEs from technical domain test-ing statistically-driven and alignment-based ap-proaches for identifying MWEs from a Pediatricsparallel corpus The alignment-based method gen-erates a targeted precision-oriented list of MWEcandidates while the statistical methods producerecall-oriented results at the expense of precisionTherefore the combination of these methods canproduce a set of MWE candidates that is both moreprecise than the latter and has more coverage thanthe former This can significantly speed up lex-icographic work Moreover the results obtained

6

show that in comparison with the manual extrac-tion of MWEs this approach can provide also ageneral set of MWE candidates in addition to themanually selected technical terms

Using the alignment-based extraction methodwe notice that it is possible to extract MWEs thatare Pediatrics terms with a precision of 38 forbigrams and 20 for trigrams but with very lowrecall since only the MWEs in the Pediatrics Glos-sary were considered correct However after amanual analysis carried out by two native speakersof Portuguese we found that the percentage of trueMWEs considered by both or at least one of themwere respectively 6597 and 7592 This wasa significative improvement but it is important tosay that in this manual analysis the human ex-perts classified the MWEs as true independently ofthem being Pediatrics terms So as future work weintend to carry out a more carefull analysis withexperts in Pediatrics to evaluate how many MWEscandidates are also Pediatrics terms

In addition we plan to investigate a weightedcombination of these methods favouring thosethat have better precision Finally we also in-tend to apply the results obtained in to the semi-automatic construction of ontologies

Acknowledgments

We would like to thank the TEXTQUIMUFRGSgroup for making the Corpus of Pediatrics and Pe-diatrics Glossary available to us We also thank thefinancial support of FAPESP in bulding the paral-lel corpus This research has been partly fundedby the FINEP project COMUNICA

ReferencesCarme Armentano-Oller Rafael C Carrasco Anto-

nio M Corbı-Bellot Mikel L Forcada MireiaGinestı-Rosell Sergio Ortiz-Rojas Juan Anto-nio Perez-Ortiz Gema Ramırez-Sanchez FelipeSanchez-Martınez and Miriam A Scalco 2006Open-source Portuguese-Spanish machine transla-tion In R Vieira P Quaresma MGV Nunes NJMamede C Oliveira and MC Dias editors Pro-ceedings of the 7th International Workshop on Com-putational Processing of Written and Spoken Por-tuguese (PROPOR 2006) volume 3960 of LectureNotes in Computer Science pages 50ndash59 Springer-Verlag May

Timothy Baldwin and Aline Villavicencio 2002 Ex-tracting the Unextractable A Case Study on Verb-particles In Proceedings of the 6th Conference on

Natural Language Learning (CoNLL-2002) pages98ndash104 Taipei Taiwan

Timothy Baldwin Emily M Bender Dan FlickingerAra Kim and Stephan Oepen 2004 Road-testingthe English Resource Grammar over the British Na-tional Corpus In Proceedings of the Fourth In-ternational Conference on Language Resources andEvaluation (LREC 2004) pages 2047ndash2050 LisbonPortugal

Timothy Baldwin 2005 The deep lexical acquisitionof English verb-particles Computer Speech andLanguage Special Issue on Multiword Expressions19(4)398ndash414

Satanjeev Banerjee and Ted Pedersen 2003 The De-sign Implementation and Use of the Ngram Statis-tics Package In In Proceedings of the Fourth Inter-national Conference on Intelligent Text Processingand Computational Linguistics pages 370ndash381

Douglas Biber Stig Johansson Geoffrey Leech SusanConrad and Edward Finegan 1999 Grammar ofSpoken and Written English Longman Harlow

Jean Carletta 1996 Assessing agreement on classi-fication tasks the kappa statistics ComputationalLinguistics 22(2)249ndash254

Carlos Ramisch and Aline Villavicencio and LeonardoMoura and Marco Idiart 2008 Picking themup and Figuring them out Verb-Particle Construc-tions Noise and Idiomaticity In Proceedings ofthe 12th Conference on Computational Natural Lan-guage Learning (CoNLL 2008) pages 49ndash56

John Carroll and Claire Grover 1989 The derivationof a large computational lexicon of English fromLDOCE In Bran Boguraev and Ted Briscoe editorsComputational Lexicography for Natural LanguageProcessing pages 117ndash134 Longman Harlow UK

Helena M Caseli Carlos Ramisch Maria G V Nunesand Aline Villavicencio 2009 Alignment-basedextraction of multiword expressions Language Re-sources and Evaluation to appear

Stefan Evert and Brigitte Krenn 2005 Using smallrandom samples for the manual evaluation of statis-tical association measures Computer Speech andLanguage 19(4)450ndash466

Afsaneh Fazly and Suzanne Stevenson 2007 Distin-guishing Subtypes of Multiword Expressions UsingLinguistically-Motivated Statistical Measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueJune

Dan Flickinger 2000 On building a more efficientgrammar by exploiting types Natural LanguageEngineering 6(1)15ndash28

7

Knut Hofland 1996 A program for aligning Englishand Norwegian sentences In S Hockey N Ideand G Perissinotto editors Research in HumanitiesComputing pages 165ndash178 Oxford Oxford Univer-sity Press

Ray Jackendoff 1997 Twistinrsquo the night away Lan-guage 73534ndash59

Frank Keller and Mirella Lapata 2003 Using the Webto Obtain Frequencies for Unseen Bigrams Compu-tational Linguistics 29(3)459ndash484

I Dan Melamed 1997 Automatic Discovery ofNon-Compositional Compounds in Parallel Data Ineprint arXivcmp-lg9706027 pages 6027ndash+ June

Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the Workshopon Multi-word-expressions in a Multilingual Context(EACL-2006) pages 33ndash40 Trento Italy

Franz Josef Och and Hermann Ney 2000 Improvedstatistical alignment models In Proceedings of the38th Annual Meeting of the ACL pages 440ndash447Hong Kong China October

Darren Pearce 2002 A Comparative Evaluation ofCollocation Extraction Techniques In Proceedingsof the Third International Conference on LanguageResources and Evaluation pages 1ndash7 Las PalmasCanary Islands Spain

Scott S L Piao Guangfan Sun Paul Rayson andQi Yuan 2006 Automatic Extraction of Chi-nese Multiword Expressions with a Statistical ToolIn Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context (EACL-2006)pages 17ndash24 Trento Italy April

William H Press Saul A Teukolsky William T Vet-terling and Brian P Flannery 1992 NumericalRecipes in C The Art of Scientific Computing Sec-ond edition Cambridge University Press

Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 MultiwordExpressions A Pain in the Neck for NLP In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing (CICLing-2002) volume 2276 of (LectureNotes in Computer Science) pages 1ndash15 LondonUK Springer-Verlag

Tim Van de Cruys and Bego na Villada Moiron 2007Semantics-based Multiword Expression ExtractionIn Proceedings of the Workshop on A Broader Pre-spective on Multiword Expressions pages 25ndash32Prague June

Aline Villavicencio Valia Kordoni Yi Zhang MarcoIdiart and Carlos Ramisch 2007 Validation andEvaluation of Automatically Acquired Multiword

Expressions for Grammar Engineering In Pro-ceedings of the 2007 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning pages1034ndash1043 Prague June

Aline Villavicencio 2005 The Availability of Verb-Particle Constructions in Lexical Resources HowMuch is Enough Journal of Computer Speech andLanguage Processing 19(4)415ndash432

Yi Zhang Valia Kordoni Aline Villavicencio andMarco Idiart 2006 Automated Multiword Ex-pression Prediction for Grammar Engineering InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 36ndash44 Sydney Australia July Asso-ciation for Computational Linguistics

8

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 9ndash16Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Re-examining Automatic Keyphrase Extraction Approachesin Scientific Articles

Su Nam KimCSSE dept

University of Melbournesnkimcsseunimelbeduau

Min-Yen KanDepartment of Computer ScienceNational University of Singaporekanmycompnusedusg

Abstract

We tackle two major issues in automatickeyphrase extraction using scientific arti-cles candidate selection and feature engi-neering To develop an efficient candidateselection method we analyze the natureand variation of keyphrases and then se-lect candidates using regular expressionsSecondly we re-examine the existing fea-tures broadly used for the supervised ap-proach exploring different ways to en-hance their performance While mostother approaches are supervised we alsostudy the optimal features for unsuper-vised keyphrase extraction Our researchhas shown that effective candidate selec-tion leads to better performance as evalua-tion accounts for candidate coverage Ourwork also attests that many of existing fea-tures are also usable in unsupervised ex-traction

1 Introduction

Keyphrases are simplex nouns or noun phrases(NPs) that represent the key ideas of the documentKeyphrases can serve as a representative summaryof the document and also serve as high quality in-dex terms It is thus no surprise that keyphraseshave been utilized to acquire critical informationas well as to improve the quality of natural lan-guage processing (NLP) applications such as doc-ument summarizer(Davanzo and Magnini 2005)information retrieval (IR)(Gutwin et al 1999) anddocument clustering(Hammouda et al 2005)

In the past various attempts have been made toboost automatic keyphrase extraction performancebased primarily on statistics(Frank et al 1999Turney 2003 Park et al 2004 Wan and Xiao2008) and a rich set of heuristic features(Barkerand Corrnacchia 2000 Medelyan and Witten

2006 Nguyen and Kan 2007) In Section 2 wegive a more comprehensive overview of previousattempts

Current keyphrase technology still has muchroom for improvement First of all although sev-eral candidate selection methods have been pro-posed for automatic keyphrase extraction in thepast (eg (Frank et al 1999 Park et al 2004Nguyen and Kan 2007)) most of them do not ef-fectively deal with various keyphrase forms whichresults in the ignorance of some keyphrases as can-didates Moreover no studies thus far have donea detailed investigation of the nature and varia-tion of manually-provided keyphrases As a con-sequence the community lacks a standardized listof candidate forms which leads to difficulties indirect comparison across techniques during evalu-ation and hinders re-usability

Secondly previous studies have shown the ef-fectiveness of their own features but not manycompared their features with other existing fea-tures That leads to a redundancy in studies andhinders direct comparison In addition existingfeatures are specifically designed for supervisedapproaches with few exceptions However thisapproach involves a large amount of manual laborthus reducing its utility for real-world applicationHence unsupervised approach is inevitable in or-der to minimize manual tasks and to encourageutilization It is a worthy study to attest the re-liability and re-usability for the unsupervised ap-proach in order to set up the tentative guideline forapplications

This paper targets to resolve these issues ofcandidate selection and feature engineering Inour work on candidate selection we analyze thenature and variation of keyphrases with the pur-pose of proposing a candidate selection methodwhich improves the coverage of candidates thatoccur in various forms Our second contributionre-examines existing keyphrase extraction features

9

reported in the literature in terms of their effec-tiveness and re-usability We test and comparethe usefulness of each feature for further improve-ment In addition we assess how well these fea-tures can be applied in an unsupervised approach

In the remaining sections we describe anoverview of related work in Section 2 our propos-als on candidate selection and feature engineeringin Section 4 and 5 our system architecture anddata in Section 6 Then we evaluate our propos-als discuss outcomes and conclude our work inSection 7 8 and 9 respectively

2 Related Work

The majority of related work has been carriedout using statistical approaches a rich set ofsymbolic resources and linguistically-motivatedheuristics(Frank et al 1999 Turney 1999 Barkerand Corrnacchia 2000 Matsuo and Ishizuka2004 Nguyen and Kan 2007) Features used canbe categorized into three broad groups (1) docu-ment cohesion features (ie relationship betweendocument and keyphrases)(Frank et al 1999Matsuo and Ishizuka 2004 Medelyan and Wit-ten 2006 Nguyen and Kan 2007) and to lesser(2) keyphrase cohesion features (ie relationshipamong keyphrases)(Turney 2003) and (3) termcohesion features (ie relationship among compo-nents in a keyphrase)(Park et al 2004)

The simplest system is KEA (Frank et al1999 Witten et al 1999) that uses TFIDF (ieterm frequency inverse document frequency) andfirst occurrence in the document TFIDF mea-sures the document cohesion and the first occur-rence implies the importance of the abstract orintroduction which indicates the keyphrases havea locality Turney (2003) added the notion ofkeyphrase cohesion to KEA features and Nguyenand Kan (2007) added linguistic features suchas section information and suffix sequence TheGenEx system(Turney 1999) employed an inven-tory of nine syntactic features such as length inwords and frequency of stemming phrase as aset of parametrized heuristic rules Barker andCorrnacchia (2000) introduced a method basedon head noun heuristics that took three featureslength of candidate frequency and head noun fre-quency To take advantage of domain knowledgeHulth et al (2001) used a hierarchically-organizeddomain-specific thesaurus from Swedish Parlia-ment as a secondary knowledge source The

Textract (Park et al 2004) also ranks the can-didate keyphrases by its judgment of keyphrasesrsquodegree of domain specificity based on subject-specific collocations(Damerau 1993) in addi-tion to term cohesion using Dice coefficient(Dice1945) Recently Wan and Xiao (2008) extractsautomatic keyphrases from single documents uti-lizing document clustering information The as-sumption behind this work is that the documentswith the same or similar topics interact with eachother in terms of salience of words The authorsfirst clustered the documents then used the graph-based ranking algorithm to rank the candidates ina document by making use of mutual influences ofother documents in the same cluster

3 Keyphrase Analysis

In previous study KEA employed the index-ing words as candidates whereas others such as(Park et al 2004 Nguyen and Kan 2007) gen-erated handcrafted regular expression rules How-ever none carefully undertook the analysis ofkeyphrases We believe there is more to be learnedfrom the reference keyphrases themselves by do-ing a fine-grained careful analysis of their formand composition Note that we used the articlescollected from ACM digital library for both ana-lyzing keyphrases as well as evaluating methodsSee Section 6 for data in detail

Syntactically keyphrases can be formed by ei-ther simplex nouns (eg algorithm keyphrasemulti-agent) or noun phrases (NPs) which can be asequence of nouns and their auxiliary words suchas adjectives and adverbs (eg mobile networkfast computing partially observable Markov de-cision process) despite few incidences They canalso incorporate a prepositional phrase (PP) (egquality of service policy of distributed caching)When keyphrases take the form of an NP with anattached PP (ie NPs in of-PP form) the preposi-tion of is most common but others such as for invia also occur (eg incentive for cooperation in-equality in welfare agent security via approximatepolicy trade in financial instrument based on log-ical formula) The patterns above correlate wellto part-of-speech (POS) patterns used in modernkeyphrase extraction systems

However our analysis uncovered additional lin-guistic patterns and alternations which other stud-ies may have overlooked In our study we alsofound that keyphrases also occur as a simple con-

10

Criteria RulesFrequency (Rule1) Frequency heuristic ie frequency ge 2 for simplex words vs frequency ge 1 for NPs

Length (Rule2) Length heuristic ie up to length 3 for NPs in non-of-PP form vs up to length 4 for NPs in of-PP form(eg synchronous concurrent program vs model of multiagent interaction)

Alternation (Rule3) of-PP form alternation(eg number of sensor = sensor number history of past encounter = past encounter history)(Rule4) Possessive alternation(eg agentrsquos goal = goal of agent securityrsquos value = value of security)

Extraction (Rule5) Noun Phrase = (NN |NNS|NNP |NNPS|JJ |JJR|JJS)lowast(NN |NNS|NNP |NNPS)(eg complexity effective algorithm grid computing distributed web-service discovery architecture)(Rule6) Simplex WordNP IN Simplex WordNP(eg quality of service sensitivity of VOIP traffic (VOIP traffic extracted)simplified instantiation of zebroid (simplified instantiation extracted))

Table 1 Candidate Selection Rules

junctions (eg search and rescue propagation anddelivery) and much more rarely as conjunctionsof more complex NPs (eg history of past en-counter and transitivity) Some keyphrases appearto be more complex (eg pervasive document editand management system task and resource allo-cation in agent system) Similarly abbreviationsand possessive forms figure as common patterns(eg belief desire intention = BDI inverse docu-ment frequency = (IDF) Bayesrsquo theorem agentrsquosdominant strategy)

A critical insight of our work is that keyphrasescan be morphologically and semantically alteredKeyphrases that incorporate a PP or have an un-derlying genitive composition are often easily var-ied by word order alternation Previous studieshave used the altered keyphrases when forming inof-PP form For example quality of service canbe altered to service quality sometimes with lit-tle semantic difference Also as most morpho-logical variation in English relates to noun num-ber and verb inflection keyphrases are subject tothese rules as well (eg distributed system 6= dis-tributing system dynamical caching 6= dynamicalcache) In addition possessives tend to alternatewith of-PP form (eg agentrsquos goal = goal of agentsecurityrsquos value = value of security)

4 Candidate Selection

We now describe our proposed candidate selectionprocess Candidate selection is a crucial step forautomatic keyphrase extraction This step is corre-lated to term extraction study since topNth rankedterms become keyphrases in documents In pre-vious study KEA employed the indexing wordsas candidates whereas others such as (Park et al2004 Nguyen and Kan 2007) generated hand-crafted regular expression rules However nonecarefully undertook the analysis of keyphrases In

this section before we present our method we firstdescribe the detail of keyphrase analysis

In our keyphrase analysis we observed thatmost of author assigned keyphrase andor readerassigned keyphrase are syntactically more of-ten simplex words and less often NPs Whenkeyphrases take an NP form they tend to be a sim-ple form of NPs ie either without a PP or withonly a PP or with a conjunction but few appear asa mixture of such forms We also noticed that thecomponents of NPs are normally nouns and adjec-tives but rarely are adverbs and verbs As a re-sult we decided to ignore NPs containing adverbsand verbs in this study as our candidates since theytend to produce more errors and to require morecomplexity

Another observation is that keyphrases contain-ing more than three words are rare (ie 6 in ourdata set) validating what Paukkeri et al (2008)observed Hence we apply a length heuristic Ourcandidate selection rule collects candidates up tolength 3 but also of length 4 for NPs in of-PPform since they may have a non-genetive alter-nation that reduces its length to 3 (eg perfor-mance of distributed system = distributed systemperformance) In previous studies words occur-ring at least twice are selected as candidates How-ever during our acquisition of reader assignedkeyphrase we observed that readers tend to collectNPs as keyphrases regardless of their frequencyDue to this we apply different frequency thresh-olds for simplex words (gt= 2) and NPs (gt= 1)Note that 30 of NPs occurred only once in ourdata

Finally we generated regular expression rulesto extract candidates as presented in Table 1 Ourcandidate extraction rules are based on those inNguyen and Kan (2007) However our Rule6for NPs in of-PP form broadens the coverage of

11

possible candidates ie with a given NPs in of-PP form not only we collect simplex word(s)but we also extract non-of-PP form of NPs fromnoun phrases governing the PP and the PP Forexample our rule extracts effective algorithm ofgrid computing as well as effective algorithm andgrid computing as candidates while the previousworksrsquo rules do not

5 Feature Engineering

With a wider candidate selection criteria the onusof filtering out irrelevant candidates becomes theresponsibility of careful feature engineering Welist 25 features that we have found useful in ex-tracting keyphrases comprising of 9 existing and16 novel andor modified features that we intro-duce in our work (marked with lowast) As one ofour goals in feature engineering is to assess thesuitability of features in the unsupervised settingwe have also indicated which features are suitableonly for the supervised setting (S) or applicable toboth (S U)

51 Document Cohesion

Document cohesion indicates how important thecandidates are for the given document The mostpopular feature for this cohesion is TFIDF butsome works have also used context words to checkthe correlation between candidates and the givendocument Other features for document cohesionare distance section information and so on Wenote that listed features other than TFIDF are re-lated to locality That is the intuition behind thesefeatures is that keyphrases tend to appear in spe-cific area such as the beginning and the end of doc-uments

F1 TFIDF (SU) TFIDF indicates doc-ument cohesion by looking at the frequency ofterms in the documents and is broadly used in pre-vious work(Frank et al 1999 Witten et al 1999Nguyen and Kan 2007) However a disadvan-tage of the feature is in requiring a large corpusto compute useful IDF As an alternative con-text words(Matsuo and Ishizuka 2004) can alsobe used to measure document cohesion From ourstudy of keyphrases we saw that substrings withinlonger candidates need to be properly counted andas such our method measures TF in substrings aswell as in exact matches For example grid com-puting is often a substring of other phrases such asgrid computing algorithm and efficient grid com-

puting algorithm We also normalize TF with re-spect to candidate types ie we separately treatsimplex words and NPs to compute TF To makeour IDFs broadly representative we employed theGoogle n-gram counts that were computedover terabytes of data Given this large genericsource of word count IDF can be incorporatedwithout corpus-dependent processing hence suchfeatures are useful in unsupervised approachesas well The following list shows variations ofTFIDF employed as features in our system

bull (F1a) TFIDF

bull (F1b) TF including counts of substrings

bull (F1c) TF of substring as a separate feature

bull (F1d) normalized TF by candidate types(ie simplex words vs NPs)

bull (F1e) normalized TF by candidate types asa separate feature

bull (F1f) IDF using Google n-gram

F2 First Occurrence (SU) KEA used the firstappearance of the word in the document(Frank etal 1999 Witten et al 1999) The main ideabehind this feature is that keyphrases tend to oc-cur in the beginning of documents especially instructured reports (eg in abstract and introduc-tion sections) and newswire

F3 Section Information (SU) Nguyen andKan (2007) used the identity of which specificdocument section a candidate occurs in This lo-cality feature attempts to identify key sections Forexample in their study of scientific papers theauthors weighted candidates differently dependingon whether they occurred in the abstract introduc-tion conclusion section head title andor refer-ences

F4 Additional Section Information (SU)We first added the related work or previous workas one of section information not included inNguyen and Kan (2007) We also propose and testa number of variations We used the substringsthat occur in section headers and reference titlesas keyphrases We counted the co-occurrence ofcandidates (ie the section TF) across all key sec-tions that indicates the correlation among key sec-tions We assign section-specific weights as in-dividual sections exhibit different propensities forgenerating keyphrases For example introduction

12

contains the majority of keyphrases while the ti-tle or section head contains many fewer due to thevariation in size

bull (F4a) section rsquorelatedprevious workrsquo

bull (F4b) counting substring occurring in keysections

bull (F4c) section TF across all key sections

bull (F4d) weighting key sections according tothe portion of keyphrases found

F5 Last Occurrence (SU) Similar to dis-tance in KEA the position of the last occurrenceof a candidate may also imply the importance ofkeyphrases as keyphrases tend to appear in thelast part of document such as the conclusion anddiscussion

52 Keyphrase CohesionThe intuition behind using keyphrase cohesion isthat actual keyphrases are often associated witheach other since they are semantically related totopic of the document Note that this assumptionholds only when the document describes a singlecoherent topic ndash a document that represents a col-lection may be first need to be segmented into itsconstituent topics

F6 Co-occurrence of Another Candidatein Section (SU) When candidates co-occur inseveral key sections together then they are morelikely keyphrases Hence we used the number ofsections that candidates co-occur

F7 Title overlap (S) In a way titles also rep-resent the topics of their documents A large col-lection of titles in the domain can act as a prob-abilistic prior of what words could stand as con-stituent words in keyphrases In our work as weexamined scientific papers from computer sciencewe used a collection of titles obtained from thelarge CiteSeer1 collection to create this feature

bull (F7a) co-occurrence (Boolean) in title col-location

bull (F7b) co-occurrence (TF) in title collection

F8 Keyphrase Cohesion (SU) Turney (2003)integrated keyphrase cohesion into his system bychecking the semantic similarity between top Nranked candidates against the remainder In the

1It contains 13M titles from articles papers and reports

original work a large external web corpus wasused to obtain the similarity judgments As wedid not have access to the same web corpus andall candidateskeyphrases were not found in theGoogle n-gram corpus we approximated this fea-ture using a similar notion of contextual similarityWe simulated a latent 2-dimensional matrix (simi-lar to latent semantic analysis) by listing all candi-date words in rows and their neighboring words(nouns verbs and adjectives only) in columnsThe cosine measure is then used to compute thesimilarity among keyphrases

53 Term Cohesion

Term cohesion further refines the candidacy judg-ment by incorporating an internal analysis of thecandidatersquos constituent words Term cohesionposits that high values for internal word associa-tion measures correlates indicates that the candi-date is a keyphrase (Church and Hanks 1989)

F9 Term Cohesion (SU) Park et al (2004)used in the Dice coefficient (Dice 1945)to measure term cohesion particularly for multi-word terms In their work as NPs are longer thansimplex words they simply discounted simplexword cohesion by 10 In our work we vary themeasure of TF used in Dice coefficientsimilar to our discussion earlier

bull (F9a) term cohesion by (Park et al 2004)

bull (F9b) normalized TF by candidate types(ie simplex words vs NPs)

bull (F9c) applying different weight by candi-date types

bull (F9d) normalized TF and different weight-ing by candidate types

54 Other Features

F10 Acronym (S) Nguyen and Kan (2007) ac-counted for the importance of acronym as a fea-ture We found that this feature is heavily depen-dent on the data set Hence we used it only forNampK to attest our candidate selection method

F11 POS sequence (S) Hulth and Megyesi(2006) pointed out that POS sequences ofkeyphrases are similar It showed the distinctivedistribution of POS sequences of keyphrases anduse them as a feature Like acronym this is alsosubject to the data set

13

F12 Suffix sequence (S) Similar to acronymNguyen and Kan (2007) also used a candidatersquossuffix sequence as a feature to capture the propen-sity of English to use certain Latin derivationalmorphology for technical keyphrases This fea-ture is also a data dependent features thus used insupervised approach only

F13 Length of Keyphrases (SU) Barker andCorrnacchia (2000) showed that candidate lengthis also a useful feature in extraction as well as incandidate selection as the majority of keyphrasesare one or two terms in length

6 System and Data

To assess the performance of the proposed candi-date selection rules and features we implementeda keyphrase extraction pipe line We start withraw text of computer science articles convertedfrom PDF by pdftotext Then we parti-tioned the into section such as title and sectionsvia heuristic rules and applied sentence segmenter2 ParsCit3(Councill et al 2008) for refer-ence collection part-of-speech tagger4 and lem-matizer5(Minnen et al 2001) of the input Af-ter preprocessing we built both supervised andunsupervised classifiers using Naive Bayes fromthe WEKA machine learning toolkit(Witten andFrank 2005) Maximum Entropy6 and simpleweighting

In evaluation we collected 250 papers fromfour different categories7 of the ACM digital li-brary Each paper was 6 to 8 pages on averageIn author assigned keyphrase we found manywere missing or found as substrings To rem-edy this we collected reader assigned keyphraseby hiring senior year undergraduates in computerscience each whom annotated five of the paperswith an annotation guideline and on average tookabout 15 minutes to annotate each paper The fi-nal statistics of keyphrases is presented in Table2 where Combined represents the total number ofkeyphrases The numbers in () denotes the num-ber of keyphrases in of-PP form Found means the

2httpwwwengritsumeiacjpasaoresourcessentseg3httpwingcompnusedusgparsCit4httpsearchcpanorgdistLingua-EN-

TaggerTaggerpm5httpwwwinformaticssusxacukresearchgroupsnlpcarrollmorphhtml6httpmaxentsourceforgenetindexhtml7C24 (Distributed Systems) H33 (Information Search

and Retrieval) I211 (Distributed Artificial Intelligence-Multiagent Systems) and J4 (Social and Behavioral Sciences-Economics)

number of author assigned keyphrase and readerassigned keyphrase found in the documents

Author Reader CombinedTotal 1252 (53) 3110 (111) 3816 (146)NPs 904 2537 3027Average 385 (401) 1244 (1288) 1526 (1585)Found 769 2509 2864

Table 2 Statistics in Keyphrases

7 Evaluation

The baseline system for both the supervised andunsupervised approaches is modified NampK whichuses TFIDF distance section information andadditional section information (ie F1-4) Apartfrom baseline we also implemented basicKEA and NampK to compare Note that NampK is con-sidered a supervised approach as it utilizes fea-tures like acronym POS sequence and suffix se-quence

Table 3 and 4 shows the performance of our can-didate selection method and features with respectto supervised and unsupervised approaches usingthe current standard evaluation method (ie exactmatching scheme) over top 5th 10th 15th candi-dates

BestFeatures includes F1cTF of substring asa separate feature F2first occurrence F3sectioninformation F4dweighting key sections F5lastoccurrence F6co-occurrence of another candi-date in section F7btitle overlap F9aterm co-hesion by (Park et al 2004) F13length ofkeyphrases Best-TFIDF means using all bestfeatures but TFIDF

In Tables 3 and 4 C denotes the classifier tech-nique unsupervised (U) or supervised using Max-imum Entropy (S)8

In Table 5 the performance of each feature ismeasured using NampK system and the target fea-ture + indicates an improvement - indicates aperformance decline and indicates no effector unconfirmed due to small changes of perfor-mances Again supervised denotes MaximumEntropy training and Unsupervised is our unsu-pervised approach

8 Discussion

We compared the performances over our candi-date selection and feature engineering with sim-ple KEA NampK and our baseline system In eval-uating candidate selection we found that longer

8Due to the page limits we present the best performance

14

Method Features C Five Ten FifteenMatch Precision Recall Fscore Match Precising Recall Fscore Match Precision Recall Fscore

All KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 079 1584 519 782 139 1388 909 1099 184 1224 1203 1213

NampK S 132 2648 867 1306 204 2036 1334 1612 254 1693 1664 1678baseline U 092 1832 600 904 157 1568 1027 1241 220 1464 1439 1451

S 115 2304 755 1137 190 1896 1242 1501 244 1624 1596 1610Lengthlt=3 KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 081 1616 529 797 140 1400 917 1108 184 1224 1203 1213

NampK S 140 2792 915 1378 210 2104 1378 1665 262 1749 1719 1734baseline U 092 184 603 908 158 1576 1032 1247 220 1464 1439 1451

S 118 2368 776 1169 190 1900 1245 1504 240 1600 1572 1586Lengthlt=3 KEA U 001 024 008 012 005 052 034 041 007 048 047 047Candidates S 083 1664 545 821 142 1424 933 1127 187 1245 1224 1234

+ Alternation NampK S 153 3064 1004 1512 231 2308 1512 1827 288 1920 1887 1903baseline U 098 1968 645 972 172 1724 1129 1364 237 1579 1551 1565

S 133 2656 870 1311 209 2088 1368 1653 269 1792 1761 1776

Table 3 Performance on Proposed Candidate Selection

Features C Five Ten FifteenMatch Prec Recall Fscore Match Prec Recall Fscore Match Prec Recall Fscore

Best U 114 228 747 113 192 192 126 152 261 174 171 173S 156 312 102 154 250 250 164 198 315 210 206 208

Best U 114 228 74 113 192 192 126 152 261 174 171 173wo TFIDF S 156 311 102 154 246 246 161 194 312 208 204 206

Table 4 Performance on Feature Engineering

A Method Feature+ S F1aF2F3F4aF4dF9a

U F1aF1cF2F3F4aF4dF5F7bF9a- S F1bF1cF1dF1fF4bF4cF7aF7bF9b-dF13

U F1dF1eF1fF4bF4cF6F7aF9b-d S F1eF10F11F12

U F1b

Table 5 Performance on Each Feature

length candidates play a role to be noises so de-creased the overall performance We also con-firmed that candidate alternation offered the flexi-bility of keyphrases leading higher candidate cov-erage as well as better performance

To re-examine features we analyzed the impactof existing and new features and their variationsFirst of all unlike previous studies we found thatthe performance with and without TFIDF did notlead to a large difference which indicates the im-pact of TFIDF was minor as long as other fea-tures are incorporated Secondly counting sub-strings for TF improved performance while ap-plying term weighting for TF andor IDF did notimpact on the performance We estimated thecause that many of keyphrases are substrings ofcandidates and vice versa Thirdly section in-formation was also validated to improve perfor-mance as in Nguyen and Kan (2007) Extend-ing this logic modeling additional section infor-mation (related work) and weighting sections bothturned out to be useful features Other localityfeatures were also validated as helpful both firstoccurrence and last occurrence are helpful as itimplies the locality of the key ideas In addi-

tion keyphrase co-occurrence with selected sec-tions was proposed in our work and found empiri-cally useful Term cohesion (Park et al 2004) is auseful feature although it has a heuristic factor thatreduce the weight by 10 for simplex words Nor-mally term cohesion is subject to NPs only henceit needs to be extended to work with multi-wordNPs as well Table 5 summarizes the reflectionson each feature

As unsupervised methods have the appeal of notneeding to be trained on expensive hand-annotateddata we also compared the performance of super-vised and unsupervised methods Given the fea-tures initially introduced for supervised learningunsupervised performance is surprisingly highWhile supervised classifier produced a matchingcount of 315 the unsupervised classifier obtains acount of 261 We feel this indicates that the exist-ing features for supervised methods are also suit-able for use in unsupervised methods with slightlyreduced performance In general we observed thatthe best features in both supervised and unsuper-vised methods are the same ndash section informationand candidate length In our analysis of the im-pact of individual features we observed that mostfeatures affect performance in the same way forboth supervised and unsupervised approaches asshown in Table 5 These findings indicate that al-though these features may be been originally de-signed for use in a supervised approach they arestable and can be expected to perform similar inunsupervised approaches

15

9 Conclusion

We have identified and tackled two core issuesin automatic keyphrase extraction candidate se-lection and feature engineering In the area ofcandidate selection we observe variations and al-ternations that were previously unaccounted forOur selection rules expand the scope of possiblekeyphrase coverage while not overly expandingthe total number candidates to consider In ourre-examination of feature engineering we com-piled a comprehensive feature list from previousworks while exploring the use of substrings in de-vising new features Moreover we also attested toeach featurersquos fitness for use in unsupervised ap-proaches in order to utilize them in real-world ap-plications with minimal cost

10 AcknowledgementThis work was partially supported by a National ResearchFoundation grant Interactive Media Search (grant R 252000 325 279) while the first author was a postdoctoral fellowat the National University of Singapore

ReferencesKen Barker and Nadia Corrnacchia Using noun phrase

heads to extract document keyphrases In Proceedings ofthe 13th Biennial Conference of the Canadian Society onComputational Studies of Intelligence Advances in Arti-ficial Intelligence 2000

Regina Barzilay and Michael Elhadad Using lexical chainsfor text summarization In Proceedings of the ACLEACL1997 Workshop on Intelligent Scalable Text Summariza-tion 1997 pp 10ndash17

Kenneth Church and Patrick Hanks Word associationnorms mutual information and lexicography In Proceed-ings of ACL 1989 76ndash83

Isaac Councill and C Lee Giles and Min-Yen Kan ParsCitAn open-source CRF reference string parsing package InProceedings of LREC 2008 28ndash30

Ernesto DAvanzo and Bernado Magnini A Keyphrase-Based Approach to Summarizationthe LAKE System atDUC-2005 In Proceedings of DUC 2005

F Damerau Generating and evaluating domain-orientedmulti-word terms from texts Information Processing andManagement 1993 29 pp43ndash447

Lee Dice Measures of the amount of ecologic associationsbetween species Journal of Ecology 1945 2

Eibe Frank and Gordon Paynter and Ian Witten and CarlGutwin and Craig Nevill-manning Domain SpecificKeyphrase Extraction In Proceedings of IJCAI 1999pp668ndash673

Carl Gutwin and Gordon Paynter and Ian Witten and CraigNevill-Manning and Eibe Frank Improving browsing indigital libraries with keyphrase indexes Journal of Deci-sion Support Systems 1999 27 pp81ndash104

Khaled Hammouda and Diego Matute and Mohamed KamelCorePhrase keyphrase extraction for document cluster-ing In Proceedings of MLDM 2005

Annette Hulth and Jussi Karlgren and Anna Jonsson andHenrik Bostrm and Lars Asker Automatic Keyword Ex-traction using Domain Knowledge In Proceedings of CI-CLing 2001

Annette Hulth and Beata Megyesi A study on automaticallyextracted keywords in text categorization In Proceedingsof ACLCOLING 2006 537ndash544

Mario Jarmasz and Caroline Barriere Using semantic sim-ilarity over tera-byte corpus compute the performance ofkeyphrase extraction In Proceedings of CLINE 2004

Dawn Lawrie and W Bruce Croft and Arnold RosenbergFinding Topic Words for Hierarchical Summarization InProceedings of SIGIR 2001 pp 349ndash357

Y Matsuo and M Ishizuka Keyword Extraction from a Sin-gle Document using Word Co-occurrence Statistical Infor-mation In International Journal on Artificial IntelligenceTools 2004 13(1) pp 157ndash169

Olena Medelyan and Ian Witten Thesaurus based automatickeyphrase indexing In Proceedings of ACMIEED-CSjoint conference on Digital libraries 2006 pp296ndash297

Guido Minnen and John Carroll and Darren Pearce Appliedmorphological processing of English NLE 2001 7(3)pp207ndash223

Thuy Dung Nguyen and Min-Yen Kan Key phrase Extrac-tion in Scientific Publications In Proceeding of ICADL2007 pp317-326

Youngja Park and Roy Byrd and Branimir Boguraev Auto-matic Glossary Extraction Beyond Terminology Identifi-cation In Proceedings of COLING 2004 pp48ndash55

Mari-Sanna Paukkeri and Ilari Nieminen and Matti Pollaand Timo Honkela A Language-Independent Approachto Keyphrase Extraction and Evaluation In Proceedingsof COLING 2008

Peter Turney Learning to Extract Keyphrases from TextIn National Research Council Institute for InformationTechnology Technical Report ERB-1057 1999

Peter Turney Coherent keyphrase extraction via Web min-ing In Proceedings of IJCAI 2003 pp 434ndash439

Xiaojun Wan and Jianguo Xiao CollabRank towards a col-laborative approach to single-document keyphrase extrac-tion In Proceedings of COLING 2008

Ian Witten and Gordon Paynter and Eibe Frank and CarGutwin and Graig Nevill-Manning KEAPractical Au-tomatic Key phrase Extraction In Proceedings of ACMDL 1999 pp254ndash256

Ian Witten and Eibe Frank Data Mining Practical Ma-chine Learning Tools and Techniques Morgan Kauf-mann 2005

Yongzheng Zhang and Nur Zinchir-Heywood and Evange-los Milios Term based Clustering and Summarization ofWeb Page Collections In Proceedings of Conference ofthe Canadian Society for Computational Studies of Intel-ligence 2004

16

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 17ndash22Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Verb Noun Construction MWE Token Supervised Classification

Mona T DiabCenter for Computational Learning Systems

Columbia Universitymdiabcclscolumbiaedu

Pravin BhutadaComputer Science Department

Columbia Universitypb2351columbiaedu

Abstract

We address the problem of classifying multi-word expression tokens in running text Wefocus our study on Verb-Noun Constructions(VNC) that vary in their idiomaticity depend-ing on context VNC tokens are classified aseither idiomatic or literal We present a super-vised learning approach to the problem We ex-periment with different features Our approachyields the best results to date on MWE clas-sification combining different linguistically mo-tivated features the overall performance yieldsan F-measure of 8458 corresponding to an F-measure of 8996 for idiomaticity identificationand classification and 6203 for literal identifi-cation and classification

1 Introduction

In the literature in general a multiword expression(MWE) refers to a multiword unit or a colloca-tion of words that co-occur together statisticallymore than chance A MWE is a cover term fordifferent types of collocations which vary in theirtransparency and fixedness MWEs are pervasivein natural language especially in web based textsand speech genres Identifying MWEs and under-standing their meaning is essential to language un-derstanding hence they are of crucial importancefor any Natural Language Processing (NLP) appli-cations that aim at handling robust language mean-ing and use In fact the seminal paper (Sag et al2002) refers to this problem as a key issue for thedevelopment of high-quality NLP applications

For our purposes a MWE is defined as a collo-cation of words that refers to a single concept forexample - kick the bucket spill the beans make adecision etc An MWE typically has an idiosyn-cratic meaning that is more or different from themeaning of its component words An MWE mean-ing is transparent ie predictable in as muchas the component words in the expression relaythe meaning portended by the speaker composi-tionally Accordingly MWEs vary in their de-gree of meaning compositionality composition-ality is correlated with the level of idiomaticityAn MWE is compositional if the meaning of an

MWE as a unit can be predicted from the mean-ing of its component words such as in make adecision meaning to decide If we conceive ofidiomaticity as being a continuum the more id-iomatic an expression the less transparent and themore non-compositional it is Some MWEs aremore predictable than others for instance kick thebucket when used idiomatically to mean to diehas nothing in common with the literal meaningof either kick or bucket however make a decisionis very clearly related to to decide Both of theseexpressions are considered MWEs but have vary-ing degrees of compositionality and predictabilityBoth of these expressions belong to a class of id-iomatic MWEs known as verb noun constructions(VNC) The first VNC kick the bucket is a non-decomposable VNC MWE the latter make a deci-sion is a decomposable VNC MWE These typesof constructions are the object of our study

To date most research has addressed the prob-lem of MWE type classification for VNC expres-sions in English (Melamed 1997 Lin 1999Baldwin et al 2003 na Villada Moiron andTiedemann 2006 Fazly and Stevenson 2007Van de Cruys and Villada Moiron 2007 Mc-Carthy et al 2007) not token classification Forexample he spilt the beans on the kitchen counteris most likely a literal usage This is given away bythe use of the prepositional phrase on the kitchencounter as it is plausable that beans could haveliterally been spilt on a location such as a kitchencounter Most previous research would classifyspilt the beans as idiomatic irrespective of con-textual usage In a recent study by (Cook et al2008) of 53 idiom MWE types used in differentcontexts the authors concluded that almost half ofthem had clear literal meaning and over 40 oftheir usages in text were actually literal Thus itwould be important for an NLP application suchas machine translation for example when givena new VNC MWE token to be able to determinewhether it is used idiomatically or not as it couldpotentially have detrimental effects on the qualityof the translation

17

In this paper we address the problem of MWEclassification for verb-noun (VNC) token con-structions in running text We investigate the bi-nary classification of an unseen VNC token ex-pression as being either Idiomatic (IDM) or Lit-eral (LIT) An IDM expression is certainly anMWE however the converse is not necessarilytrue To date most approaches to the problem ofidiomaticity classification on the token level havebeen unsupervised (Birke and Sarkar 2006 Diaband Krishna 2009b Diab and Krishna 2009aSporleder and Li 2009) In this study we carryout a supervised learning investigation using sup-port vector machines that uses some of the featureswhich have been shown to help in unsupervisedapproaches to the problem

This paper is organized as follows In Section2 we describe our understanding of the variousclasses of MWEs in general Section 3 is a sum-mary of previous related research Section 4 de-scribes our approach In Section 5 we present thedetails of our experiments We discuss the resultsin Section 6 Finally we conclude in Section 7

2 Multi-word Expressions

MWEs are typically not productive though theyallow for inflectional variation (Sag et al 2002)They have been conventionalized due to persis-tent use MWEs can be classified based on theirsemantic types as follows Idiomatic This cat-egory includes expressions that are semanticallynon-compositional fixed expressions such as king-dom come ad hoc non-fixed expressions suchas break new ground speak of the devil TheVNCs which we are focusing on in this paper fallinto this category Semi-idiomatic This classincludes expressions that seem semantically non-compositional yet their semantics are more or lesstransparent This category consists of Light VerbConstructions (LVC) such as make a living andVerb Particle Constructions (VPC) such as write-up call-up Non-Idiomatic This category in-cludes expressions that are semantically compo-sitional such as prime minister proper nouns suchas New York Yankees and collocations such as ma-chine translation These expressions are statisti-cally idiosyncratic For instance traffic light isthe most likely lexicalization of the concept andwould occur more often in text than say trafficregulator or vehicle light

3 Related Work

Several researchers have addressed the problem ofMWE classification (Baldwin et al 2003 Katzand Giesbrecht 2006 Schone and Juraksfy 2001

Hashimoto et al 2006 Hashimoto and Kawa-hara 2008) The majority of the proposed researchhas been using unsupervised approaches and haveaddressed the problem of MWE type classifica-tion irrespective of usage in context (Fazly andStevenson 2007 Cook et al 2007) We areaware of two supervised approaches to the prob-lem work by (Katz and Giesbrecht 2006) andwork by (Hashimoto and Kawahara 2008)

In Katz and Giesbrecht (2006) (KG06) the au-thors carried out a vector similarity comparisonbetween the context of an MWE and that of theconstituent words using LSA to determine if theexpression is idiomatic or not The KG06 is sim-ilar in intuition to work proposed by (Fazly andStevenson 2007) however the latter work was un-supervised KG06 experimented with a tiny dataset of only 108 sentences corresponding to oneMWE idiomatic expression

Hashimoto and Kawahara (2008) (HK08) is thefirst large scale study to our knowledge that ad-dressed token classification into idiomatic versusliteral for Japanese MWEs of all types They ap-ply a supervised learning framework using sup-port vector machines based on TinySVM with aquadratic kernel They annotate a web based cor-pus for training data They identify 101 idiomtypes each with a corresponding 1000 exampleshence they had a corpus of 102K sentences of an-notated data for their experiments They exper-iment with 90 idiom types only for which theyhad more than 50 examples They use two typesof features word sense disambiguation (WSD)features and idiom features The WSD featurescomprised some basic syntactic features such asPOS lemma information token n-gram featuresin addition to hypernymy information on words aswell as domain information For the idiom fea-tures they were mostly inflectional features suchas voice negativity modality in addition to adja-cency and adnominal features They report resultsin terms of accuracy and rate of error reductionTheir overall accuracy is of 8925 using all thefeatures

4 Our ApproachWe apply a supervised learning framework tothe problem of both identifying and classifying aMWE expression token in context We specificallyfocus on VNC MWE expressions We use the an-notated data by (Cook et al 2008) We adopt achunking approach to the problem using an InsideOutside Beginning (IOB) tagging framework forperforming the identification of MWE VNC to-kens and classifying them as idiomatic or literalin context For chunk tagging we use the Yam-

18

Cha sequence labeling system1 YamCha is basedon Support Vector Machines technology using de-gree 2 polynomial kernels

We label each sentence with standard IOB tagsSince this is a binary classification task we have 5different tags B-L (Beginning of a literal chunk)I-L (Inside of a literal chunk) B-I (Beginning anIdiomatic chunk) I-I (Inside an Idiomatic chunk)O (Outside a chunk) As an example a sentencesuch as John kicked the bucket last Friday will beannotated as follows John O kicked B-I the I-Ibucket I-I last O Friday O We experiment withsome basic features and some more linguisticallymotivated ones

We experiment with different window sizes forcontext ranging from minus+1 to minus+5 tokens be-fore and after the token of interest We also em-ploy linguistic features such as character n-gramfeatures namely last 3 characters of a token asa means of indirectly capturing the word inflec-tional and derivational morphology (NGRAM)Other features include Part-of-Speech (POS)tags lemma form (LEMMA) or the citation formof the word and named entity (NE) informationThe latter feature is shown to help in the unsuper-vised setting in recent work (Diab and Krishna2009b Diab and Krishna 2009a) In general allthe linguistic features are represented as separatefeature sets explicitly modeled in the input dataHence if we are modeling the POS tag feature forour running example the training data would beannotated as follows John NN O kicked VBDB-I the Det I-I bucket NN I-I last ADV O FridayNN O Likewise adding the NGRAM featurewould be represented as follows John NN ohnO kicked VBD ked B-I the Det the I-I bucket NNket I-I last ADV ast O Friday NN day O and soon

With the NE feature we followed the same rep-resentation as the other features as a separate col-umn as expressed above referred to as NamedEntity Separate (NES) For named entity recogni-tion (NER) we use the BBN Identifinder softwarewhich identifies 19 NE tags2 We have two set-tings for NES one with the full 19 tags explic-itly identified (NES-Full) and the other where wehave a binary feature indicating whether a wordis a NE or not (NES-Bin) Moreover we addedanother experimental condition where we changedthe wordsrsquo representation in the input to their NEclass Named Entity InText (NEI) For example forthe NEI condition our running example is repre-sented as follows PER NN ohn O kicked VBDked B-I the Det the I-I bucket NN ket I-I last ADV

1httpwwwtado-chasencomyamcha2httpwwwbbncomidentifinder

ast O DAY NN day O where John is replaced bythe NE ldquoPERrdquo

5 Experiments and Results

51 Data

We use the manually annotated standard dataset identified in (Cook et al 2008) This datacomprises 2920 unique VNC-Token expressionsdrawn from the entire British National Corpus(BNC)3 The BNC contains 100M words of multi-ple genres including written text and transcribedspeech In this set VNC token expressions aremanually annotated as idiomatic literal or un-known We exclude those annotated as unknownand those pertaining to the Speech part of thedata leaving us with a total of 2432 sentences cor-responding to 53 VNC MWE types This datahas 2571 annotations4 corresponding to 2020 Id-iomatic tokens and 551 literal ones Since the dataset is relatively small we carry out 5-fold cross val-idation experiments The results we report are av-eraged over the 5 folds per condition We splitthe data into 80 for training 10 for testing and10 for development The data used is the tok-enized version of the BNC

52 Evaluation Metrics

We use Fβ=1 (F-measure) as the harmonic meanbetween (P)recision and (R)ecall as well as accu-racy to report the results5 We report the resultsseparately for the two classes IDM and LIT aver-aged over the 5 folds of the TEST data set

53 Results

We present the results for the different featuressets and their combination We also present resultson a simple most frequent tag baseline (FREQ) aswell as a baseline of using no features just thetokenized words (TOK) The baseline is basicallytagging all identified VNC tokens in the data set asidiomatic It is worth noting that the baseline hasthe advantage of gold identification of MWE VNCtoken expressions In our experimental conditionsidentification of a potential VNC MWE is part ofwhat is discovered automatically hence our sys-tem is penalized for identifying other VNC MWE

3httpwwwnatcorpoxacuk4A sentence can have more than one MWE expression

hence the number of annotations exceeds the number of sen-tences

5We do not think that accuracy should be reported in gen-eral since it is an inflated result as it is not a measure of errorAll words identified as O factor into the accuracy which re-sults in exaggerated values for accuracy We report it onlysince it the metric used by previous work

19

tokens that are not in the original data set6

In Table 2 we present the results yielded per fea-ture and per condition We experimented with dif-ferent context sizes initially to decide on the opti-mal window size for our learning framework re-sults are presented in Table 1 Then once that isdetermined we proceed to add features

Noting that a window size of minus+3 yields thebest results we proceed to use that as our contextsize for the following experimental conditions Wewill not include accuracy since it above 96 for allour experimental conditions

All the results yielded by our experiments out-perform the baseline FREQ The simple tokenizedwords baseline (TOK) with no added features witha context size of minus+3 shows a significant im-provement over the very basic baseline FREQ withan overall F measure of 7704

Adding lemma information or POS or NGRAMfeatures all independently contribute to a bettersolution however combining the three featuresyields a significant boost in performance over theTOK baseline of 267 absolute F points in over-all performance

Confirming previous observations in the liter-ature the overall best results are obtained byusing NE features The NEI condition yieldsslightly better results than the NES conditionsin the case when no other features are beingused NES-Full significantly outperforms NES-Bin when used alone especially on literal classi-fication yielding the highest results on this classof phenomena across the board However whencombined with other features NES-Bin fares bet-ter than NES-Full as we observe slightly less per-formance when comparing NES-Full+L+N+P andNES-Bin+L+N+P

Combining NEI+L+N+P yields the highest re-sults with an overall F measure of 8458 a sig-nificant improvement over both baselines and overthe condition that does not exploit NE featuresL+N+P Using NEI may be considered a formof dimensionality reduction hence the significantcontribution to performance

6 Discussion

The overall results strongly suggest that using lin-guistically interesting features explicitly has a pos-itive impact on performance NE features helpthe most and combining them with other features

6We could have easily identified all VNC syntactic con-figurations corresponding to verb object as a potential MWEVNC assuming that they are literal by default This wouldhave boosted our literal score baseline however for this in-vestigation we decided to strictly work with the gold stan-dard data set exclusively

yields the best results In general performanceon the classification and identification of idiomaticexpressions yielded much better results This maybe due to the fact that the data has a lot more id-iomatic token examples for training Also we notethat precision scores are significantly higher thanrecall scores especially with performance on lit-eral token instance classification This might be anindication that identifying when an MWE is usedliterally is a difficult task

We analyzed some of the errors yielded in ourbest condition NEI+L+N+P The biggest errors area result of identifying other VNC constructionsnot annotated in the training and test data as VNCMWEs However we also see errors of confusingidiomatic cases with literal ones 23 times and theopposite 4 times

Some of the errors where the VNC should havebeen classified as literal however the system clas-sified them as idiomatic are kick heel find feetmake top Cases of idiomatic expressions erro-neously classified as literal are for MWE types hitthe road blow trumpet blow whistle bit a wall

The system is able to identify new VNC MWEconstructions For instance in the sentence On theother hand Pinkie seemed to have lost his head toa certain extent perhaps some prospects of mak-ing his mark by bringing in something novel inthe way of business the first MWE lost his headis annotated in the training data however makinghis mark is newly identified as idiomatic in thiscontext

Also the system identified hit the post as aliteral MWE VNC token in As the ball hit thepost the referee blew the whistle where blew thewhistle is a literal VNC in this context and it iden-tified hit the post as another literal VNC

7 Conclusion

In this study we explore a set of features that con-tribute to VNC token expression binary supervisedclassification The use of NER significantly im-proves the performance of the system Using NERas a means of dimensionality reduction yields thebest results We achieve a state of the art perfor-mance of an overall F measure of 8458 In thefuture we are looking at ways of adding more so-phisticated syntactic and semantic features fromWSD Given the fact that we were able to get moreinteresting VNC data automatically we are cur-rently looking into adding the new data to the an-notated pool after manual checking

20

IDM-F LIT-F Overall F Overall Accminus+1 7793 4857 7178 9622minus+2 8538 5561 7971 9706minus+3 8699 5568 8125 9693minus+4 8622 5581 8075 9706minus+5 8338 50 7763 9661

Table 1 Results in s of varying context window size

IDM-P IDM-R IDM-F LIT-P LIT-R LIT-F Overall FFREQ 7002 8916 7844 0 0 0 6968TOK 8178 8333 8255 7179 4375 5437 7704

(L)EMMA 831 8429 8369 6977 4688 5607 7811(N)GRAM 8317 8238 8278 70 4375 5385 7701

(P)OS 8333 8333 8333 7778 4375 5600 7808L+N+P 8695 8333 8538 7222 4561 5591 7971

NES-Full 852 8793 8655 7907 5862 6733 8277NES-Bin 8497 8241 8367 7349 5259 6131 7915

NEI 8992 8518 8748 8133 5259 6387 8282NES-Full+L+N+P 8989 8492 8734 7632 50 6042 8199NES-Bin+L+N+P 9086 8492 8779 7632 50 6042 8233

NEI+L+N+P 9135 8842 8986 8169 50 6203 8458

Table 2 Final results in s averaged over 5 folds of test data using different features and their combina-tions

8 Acknowledgement

The first author was partially funded by DARPAGALE and MADCAT projects The authors wouldlike to acknowledge the useful comments by twoanonymous reviewers who helped in making thispublication more concise and better presented

ReferencesTimothy Baldwin Collin Bannard Takakki Tanaka

and Dominic Widdows 2003 An empirical modelof multiword expression decomposability In Pro-ceedings of the ACL 2003 workshop on Multiwordexpressions pages 89ndash96 Morristown NJ USA

J Birke and A Sarkar 2006 A clustering approach fornearly unsupervised recognition of nonliteral lan-guage In Proceedings of EACL volume 6 pages329ndash336

Paul Cook Afsaneh Fazly and Suzanne Stevenson2007 Pulling their weight Exploiting syntacticforms for the automatic identification of idiomaticexpressions in context In Proceedings of the Work-shop on A Broader Perspective on Multiword Ex-pressions pages 41ndash48 Prague Czech RepublicJune Association for Computational Linguistics

Paul Cook Afsaneh Fazly and Suzanne Stevenson2008 The VNC-Tokens Dataset In Proceedings ofthe LREC Workshop on Towards a Shared Task for

Multiword Expressions (MWE 2008) MarrakechMorocco June

Mona Diab and Madhav Krishna 2009a Handlingsparsity for verb noun MWE token classification InProceedings of the Workshop on Geometrical Mod-els of Natural Language Semantics pages 96ndash103Athens Greece March Association for Computa-tional Linguistics

Mona Diab and Madhav Krishna 2009b Unsuper-vised classification for vnc multiword expressionstokens In CICLING

Afsaneh Fazly and Suzanne Stevenson 2007 Dis-tinguishing subtypes of multiword expressions us-ing linguistically-motivated statistical measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueCzech Republic June Association for Computa-tional Linguistics

Chikara Hashimoto and Daisuke Kawahara 2008Construction of an idiom corpus and its applica-tion to idiom identification based on WSD incor-porating idiom-specific features In Proceedings ofthe 2008 Conference on Empirical Methods in Nat-ural Language Processing pages 992ndash1001 Hon-olulu Hawaii October Association for Computa-tional Linguistics

Chikara Hashimoto Satoshi Sato and Takehito Utsuro2006 Japanese idiom recognition Drawing a linebetween literal and idiomatic meanings In Proceed-ings of the COLINGACL 2006 Main Conference

21

Poster Sessions pages 353ndash360 Sydney AustraliaJuly Association for Computational Linguistics

Graham Katz and Eugenie Giesbrecht 2006 Au-tomatic identification of non-compositional multi-word expressions using latent semantic analysis InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 12ndash19 Sydney Australia July Asso-ciation for Computational Linguistics

Dekang Lin 1999 Automatic identification of non-compositional phrases In Proceedings of ACL-99pages 317ndash324 Univeristy of Maryland CollegePark Maryland USA

Diana McCarthy Sriram Venkatapathy and AravindJoshi 2007 Detecting compositionality of verb-object combinations using selectional preferencesIn Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL) pages 369ndash379 Prague CzechRepublic June Association for Computational Lin-guistics

Dan I Melamed 1997 Automatic discovery of non-compositional compounds in parallel data In Pro-ceedings of the 2nd Conference on Empirical Meth-ods in Natural Language Processing (EMNLPrsquo97)pages 97ndash108 Providence RI USA August

Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the EACL-06Workshop on Multiword Expressions in a Multilin-gual Context pages 33ndash40 Morristown NJ USA

Ivan A Sag Timothy Baldwin Francis Bond Ann ACopestake and Dan Flickinger 2002 Multiwordexpressions A pain in the neck for nlp In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing pages 1ndash15 London UK Springer-Verlag

Patrick Schone and Daniel Juraksfy 2001 Isknowledge-free induction of multiword unit dictio-nary headwords a solved problem In Proceedingsof Empirical Methods in Natural Language Process-ing pages 100ndash108 Pittsburg PA USA

C Sporleder and L Li 2009 Unsupervised Recog-nition of Literal and Non-Literal Use of IdiomaticExpressions In Proceedings of the 12th Conferenceof the European Chapter of the ACL (EACL 2009)pages 754ndash762 Association for Computational Lin-guistics

Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32Prague Czech Republic June Association for Com-putational Linguistics

22

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 23ndash30Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Exploiting Translational Correspondences for Pattern-Independent MWEIdentification

Sina ZarrieszligDepartment of Linguistics

University of Potsdam Germanysinalinguni-potsdamde

Jonas KuhnDepartment of Linguistics

University of Potsdam Germanykuhnlinguni-potsdamde

Abstract

Based on a study of verb translations inthe Europarl corpus we argue that a widerange of MWE patterns can be identified intranslations that exhibit a correspondencebetween a single lexical item in the sourcelanguage and a group of lexical items inthe target language We show that thesecorrespondences can be reliably detectedon dependency-parsed word-aligned sen-tences We propose an extraction methodthat combines word alignment with syn-tactic filters and is independent of thestructural pattern of the translation

1 Introduction

Parallel corpora have proved to be a valuable re-source not only for statistical machine translationbut also for crosslingual induction of morphologi-cal syntactic and semantic analyses (Yarowsky etal 2001 Dyvik 2004) In this paper we proposean approach to the identification of multiword ex-pressions (MWEs) that exploits translational cor-respondences in a parallel corpus We will con-sider in translations of the following type

(1) DerThe

RatCouncil

sollteshould

unsereour

Positionposition

berucksichtigenconsider

(2) The Council shouldtake account ofour position

This sentence pair has been taken from theGerman - English section of the Europarl corpus(Koehn 2005) It exemplifies a translational cor-respondence between an English MWEtake ac-count ofand a German simplex verbberucksichti-gen In the following we refer to such correspon-dences asone-to-many translations Based on astudy of verb translations in Europarl we will ex-plore to what extent one-to-many translations pro-vide evidence for MWE realization in the targetlanguage It will turn out that crosslingual corre-

spondences realize a wide range of different lin-guistic patterns that are relevant for MWE iden-tification but that they pose problems to auto-matic word alignment We propose an extractionmethod that combines distributional word align-ment with syntactic filters We will show thatthese correspondences can be reliably detectedon dependency-parsed wordaligned sentences andare able to identify various MWE patterns

In a monolingual setting the task of MWE ex-traction is usually conceived of as a lexical as-sociation problem where distributional measuresmodel the syntactic and semantic idiosyncracy ex-hibited by MWEs eg (Pecina 2008) This ap-proach generally involves two main steps 1) theextraction of a candidate list of potential MWEsoften constrained by a particular target patternof the detection method like verb particle con-structions (Baldwin and Villavicencio 2002) orverb PP combinations (Villada Moiron and Tiede-mann 2006) 2) the ranking of this candidate listby an appropriate assocation measure

The crosslingual MWE identification wepresent in this paper is a priori independentof any specific association measure or syntacticpattern The translation scenario allows us toadopt a completely data-driven definition of whatconstitutes an MWE Given a parallel corpuswe propose to consider those tokens in a targetlanguage as MWEs which correspond to a singlelexical item in the source language The intuitionis that if a group of lexical items in one lan-guage can be realized as a single item in anotherlanguage it can be considered as some kind oflexically fixed entity By this means we willnot approach the MWE identification problemby asking for a given list of candidates whetherthese are MWEs or not Instead we will ask fora given list of lexical items in a source languagewhether there exists a one-to-many translation forthis item in a target language (and whether these

23

one-to-many translations correspond to MWEs)This strategy offers a straightforward solution tothe interpretation problem As the translation canbe related to the meaning of the source item andto its other translations in the target language theinterpretation is independent of the expressionrsquostransparency This solution has its limitationscompared to other approaches that need to auto-matically establish the degree of compositionalityof a given MWE candidate However for manyNLP applications coarse-grained knowledgeabout the semantic relation between a widerange of MWEs and their corresponding atomicrealization is already very useful

In this work we therefore focus on a generalmethod of MWE identification that captures thevarious patterns of translational correspondencesthat can be found in parallel corpora Our exper-iments described in section 3 show that one-to-many translations should be extracted from syn-tactic configurations rather than from unstructuredsets of aligned words This syntax-driven methodis less dependent on frequency distributions in agiven corpus but is based on the intuition thatmonolingual idiosyncracies like MWE realizationof an entity are not likely to be mirrored in anotherlanguage (see section 4 for discussion)

Our goal in this paper is twofold First we wantto investigate to what extent one-to-many transla-tional correspondences can serve as an empiricalbasis for MWE identification To this end Sec-tion 2 presents a corpus-based study of the rela-tion between one-to-many translations and MWEsthat we carried out on a translation gold standardSecond we investigate methods for the automaticdetection of complex lexical correspondences fora given parallel corpus Therefore Section 3 eval-uates automatic word alignments against our goldstandard and gives a method for high-precisionone-to-many translation detection that relies onsyntactic filters in addition to word-alignments

2 Multiword Translations as MWEs

The idea to exploit one-to-many translations forthe identification of MWE candidates has not re-ceived much attention in the literature Thus it isnot a priori clear what can be expected from trans-lational correspondences with respect to MWEidentification To corroborate the intuitions intro-duced in the last section we carried out a corpus-based study that aims to discover linguistic pat-

Verb 1-1 1-n n-1 n-n No

anheben (v1) 535 212 92 16 325

bezwecken (v2) 167 513 06 313 150

riskieren (v3) 467 357 05 17 182

verschlimmern (v4) 302 215 286 445 275

Table 1 Proportions of types of translational cor-respondences (token-level) in our gold standard

terns exhibited by one-to-many translationsWe constructed a gold standard coveringall En-

glish translations of four German verb lemmas ex-tracted from the Europarl Corpus These verbssubcategorize for a nominative subject and an ac-cusative object and are in the middle frequencylayer (around 200 occurrences) We extracted allsentences in Europarl with occurences of theselemmas and their automatic word alignments pro-duced by GIZA++ (Och and Ney 2003) Thesealignments were manually corrected on the basisof the crosslingual word alignment guidelines de-velopped by (Graca et al 2008)

For each of the German source lemmas ourgold standard records four translation categoriesone-to-one one-to-many many-to-one many-to-many translations Table 1 shows the distributionof these categories for each verb Strikingly thefour verbs show very different proportions con-cerning the types of their translational correspon-dences Thus while the German verbanheben(en increase) seems to have a frequent parallelrealization the verbsbezwecken(en intend to)or verschlimmern(en aggravate) tend to be real-ized by more complex phrasal translations In anycase the percentage of one-to-many translations isrelatively high which corroborates our hypothesisthat parallel corpora constitute a very interestingresource for data-driven MWE discovery

A closer look at the one-to-many translations re-veals that these cover a wide spectrum of MWEphenomena traditionally considered in the liter-ature as well as constructions that one wouldusually not regard as an MWE Below we willshortly illustrate the different classes of one-to-many translations we found in our gold standard

Morphological variations This type of one-to-many translations is mainly due to non-parallel re-alization of tense Itrsquos rather irrelevant from anMWE perspective but easy to discover and filterautomatically

24

(3) SieThey

verschlimmernaggravate

diethe

Ubelmisfortunes

(4) Their actionis aggravatingthe misfortunes

Verb particle combinations A typical MWEpattern treated for instance in (Baldwin andVillavicencio 2002) It further divides into trans-parent and non-transparent combinations the lat-ter is illustrated below

(5) DerThe

Ausschusscommitte

bezwecktintends

denthe

Institutioneninstitutions

eina

politischespolitical

Instrumentinstrument

anat

diethe

Handhand

zuto

gebengive

(6) The committeeset out to equip the institutions with apolitical instrument

Verb preposition combinations While thisclass isnrsquot discussed very often in the MWE lit-erature it can nevertheless be considered as an id-iosyncratic combination of lexical items Sag et al(2002) propose an analysis within an MWE frame-work

(7) SieThey

werdenwill

denthe

Treibhauseffektgreen house effect

verschlimmernaggravate

(8) They will add to the green house effect

Light verb constructions (LVCs) This is themost frequent pattern in our gold standard It ac-tually subsumes various subpatterns depending onwhether the light verbs complement is realized as anoun adjective or PP Generally LVCs are syntac-tically and semantically more flexible than otherMWE types such that our gold standard containsvariants of LVCs with similar potentially mod-ified adjectives or nouns as in the example be-low However it can be considered an idiosyn-cratic combination since the LVCs exhibit specificlexical restrictions (Sag et al 2002)

(9) IchIch

werdewill

diethe

Sachething

nuronly

nochjust

verschlimmernaggravate

(10) I am justmaking thingsmore difficult

Idioms This MWE type is probably the mostdiscussed in the literature due to its semantic andsyntactic idiosyncracy Itrsquos not very frequent inour gold standard which may be mainly due to itslimited size and the source items we chose

(11) SieThey

bezweckenintend

diethe

Umgestaltungconversion

ininto

einea

zivilecivil

Nationnation

(12) Theyhave in mind the conversion into a civil nation

v1 v2 v3 v4

Ntype 22 (26) 41 (47) 26 (35) 17 (24)

V Part 227 49 00 00

V Prep 364 415 39 59

LVC 182 293 885 882

Idiom 00 24 00 00

Para 364 243 115 235

Table 2 Proportions of MWE types per lemma

Paraphrases From an MWE perspective para-phrases are the most problematic and challengingtype of translational correspondence in our goldstandard While the MWE literature typically dis-cusses the distinction between collocations andMWEs the boarderline between paraphrases andMWEs is not really clear On the hand para-phrases as we classified them here are transparentcombinations of lexical items like in the exam-ple belowensure that something increases How-ever semantically these transparent combinationscan also be rendered by an atomic expressionin-crease A further problem raised by paraphrases isthat they often involve translational shifts (Cyrus2006) These shifts are hard to identify automat-ically and present a general challenge for seman-tic processing of parallel corpora An example isgiven below

(13) WirWe

brauchenneed

besserebetter

Zusammenarbeitcooperation

umto

diethe

RuckzahlungenrepaymentsOBJ

anzuhebenincrease

(14) We need greater cooperation in this respect toensurethat repaymentsincrease

Table 2 displays the proportions of the MWEcategories for the number of types of one-to-manycorrespondences in our gold standard We filteredthe types due to morphological variations only (theoverall number of types is indicated in brackets)Note that some types in our gold standard fall intoseveral categories eg they combine a verb prepo-sition with a verb particle construction For allof the verbs the number of types belonging tocore MWE categories largely outweighs the pro-portion of paraphrases As we already observedin our analysis of general translation categorieshere again the different verb lemmas show strik-ing differences with respect to their realization inEnglish translations For instanceanheben(enincrease) or bezwecken(en intend) are frequently

25

translated with verb particle or preposition combi-nations while the other verbs are much more of-ten translated by means of LVCs Also the morespecific LVC patterns differ largely among theverbs Whileverschlimmern(en aggravate) hasmany different adjectival LVC correspondencesthe translations ofriskieren(en risk) are predomi-nantly nominal LVCs The fact that we found veryfew idioms in our gold standard may be simplyrelated to our arbitrary choice of German sourceverbs that do not have an English idiom realiza-tion (see our experiment on a random set of verbsin Section 33)

In general one-to-many translational corre-spondences seem to provide a very fruitful groundfor the large-scale study of MWE phenomenaHowever their reliable detection in parallel cor-pora is far from trivial as we will show in thenext section Therefore we will not further in-vestigate the classification of MWE patterns inthe rest of the paper but concentrate on the high-precision detection of one-to-many translationsSuch a pattern-independent identification methodis crucial for the further data-driven study of one-to-many translations in parallel corpora

3 Multiword Translation Detection

This section is devoted to the problem of high-precision detection of one-to-many translationsSection 31 describes an evaluation of automaticword alignments against our gold standard Insection 32 we describe a method that extractsloosely aligned syntactic configurations whichyields much more promising results

31 One-to-many Alignments

To illustrate the problem of purely distributionalone-to-many alignment table 3 presents an eval-uation of the automatic one-to-many word align-ments produced by GIZA++ that uses the stan-dard heuristics for bidirectional word alignmentfrom phrase-based MT (Och and Ney 2003) Weevaluate the rate of translational correspondenceson the type-level that the system discovers againstthe one-to-many translations in our gold standardBy type we mean the set of lemmatized Englishtokens that makes up the translation of the Ger-man source lemma Generally automatic wordalignment yields a very high FPR if no frequencythreshold is used Increasing the threshold mayhelp in some cases however the frequency of the

verb n gt 0 n gt 1 n gt 3FPR FNR FPR FNR FPR FNR

v1 097 093 10 10 10 10

v2 093 09 05 096 00 098

v3 088 083 08 097 067 097

v4 098 092 08 092 034 092

Table 3 False positive rate and False negative rateof GIZA++ one-to-many alignments

translation types is so low that already at a thresh-old of 3 almost all types get filtered This does notmean that the automatic word alignment does notdiscover any correct correspondences at all but itmeans that the detection of the exact set of tokensthat correspond to the source token is rare

This low precision of one-to-many alignmentsisnrsquot very surprising Many types of MWEs con-sist of items that contribute most of the lexical se-mantic content while the other items belong to theclass of semantically almost ldquoemptyrdquo items (egparticles light verbs) These semantically ldquolightrdquoitems have a distribution that doesnrsquot necessarilycorrelate with the source item For instance inthe following sentence pair taken from EuroparlGIZA++ was not able to capture the correspon-dence between the German main verbbehindern(en impede) and the LVCconstitute an obstacleto but only finds an alignment link between theverb and the nounobstacle

(15) DieThe

Korruptioncorruption

behindertimpedes

diethe

Entwicklungdevelopment

(16) Corruptionconstitutes an obstacle todevelopment

Another limitation of the word-alignment mod-els is that are independent of whether the sen-tences are largely parallel or rather free transla-tions However parallel corpora like Europarl areknow to contain a very large number of free trans-lations In these cases direct lexical correspon-dences are much more unlikely to be found

32 Aligning Syntactic Configurations

High-precision extraction of one-to-many trans-lation detection thus involves two major prob-lems 1) How to identify sentences or configura-tions where reliable lexical correspondences canbe found 2) How to align target items that have alow occurrence correlation

We argue that both of these problems can beadressed by taking syntactic information into ac-

26

count As an example consider the pair of paral-lel configurations in Figure 1 for the sentence pairgiven in (15) and (16) Although there is no strictone-to-one alignment for the German verb the ba-sic predicate-argument structure is parallel Theverbs arguments directly correspond to each otherand are all dominated by a verbal root node

Based on these intuitions we propose agenerate-and-filter strategy for our one-to-manytranslation detection which extracts partial largelyparallel dependency configurations By admittingtarget dependency paths to be aligned to sourcesingle dependency relations we admit configura-tions where the source item is translated by morethan one word For instance given the configura-tion in Figure 1 we allow the German verb to bealigned to the path connectingconstituteand theargumentY2

Our one-to-many translation detection consistsof the following steps a) candidate generationof aligned syntactic configurations b) filtering theconfigurations c) alignment post-editing ie as-sembling the target tokens corresponding to thesource item The following paragraphs will brieflycaracterize these steps

behindert

X 1 Y 1

Y 2

an to

X 2 obstacle

create

Figure 1 Example of a typical syntactic MWEconfiguration

Data We word-aligned the German and Englishportion of the Europarl corpus by means of theGIZA++ tool Both portions where assigned flatsyntactic dependency analyses by means of theMaltParser (Nivre et al 2006) such that we ob-tain a parallel resource of word-aligned depen-dency parses Each sentence in our resource canbe represented by the triple(DG DE AGE) DG

is the set of dependency triples(s1 rel s2) such

thats2 is a dependent ofs1 of typerel ands1 s2

are words of the source languageDE is the setof dependency triples of the target sentenceAGE

corresponds to the set of pairs(s1 t1) such thats1 t1 are aligned

Candidate Generation This step generates alist of source configurations by searching for oc-curences of the source lexical verb where it islinked to some syntactic dependents (eg its argu-ments) An example input would be the configura-tion ( (verbSB) (verbOA)) forour German verbs

Filtering Given our source candidates a validparallel configuration(DG DE AGE) is then de-fined by the following conditions1 The source configurationDG is the set of tu-ples(s1 rel sn) wheres1 is our source item andsn some dependent2 For eachsn isin DG there is a tuple(sn tn) isinAGE ie every dependent has an alignment3 There is a target itemt1 isin DE such thatfor eachtn there is ap sub DE such thatp isa path(t1 rel tx) (tx rel ty)(tz rel tn) thatconnectst1 and tn Thus the target dependentshave a common root

To filter noise due to parsing or alignment er-rors we further introduce a filter on the length ofthe path that connects the target root and its de-pendents and w exclude paths cross contain sen-tence boundaries Moreover the above candi-date filtering doesnrsquot exclude configurations whichexhibit paraphrases involving head-switching orcomplex coordination Head-switching can be de-tected with the help of alignment information ifthere is a item in our target configuration that hasan reliable alignment with an item not contained inour source configuration our target configurationis likely to contain such a structural paraphrasesand is excluded from our candidate set Coordina-tion can be discarded by imposing the condition onthe configuration not to contain a coordination re-lation This Generate-and-Filter strategy now ex-tracts a set of sentences where we are likely to finda good one-to-one or one-to-many translation forthe source verb

Alignment Post-editing In the final alignmentstep one now needs to figure out which lexicalmaterial in the aligned syntactic configurations ac-tually corresponds to the translation of the sourceitem The intuition discussed in 32 was that all

27

the items lying on a path between the root itemand the terminals belong to the translation of thesource item However these items may have othersyntactic dependents that may also be part of theone-to-many translation As an example considerthe configuration in figure 1 where the articleanwhich is part of the LVCcreate an obstacle tohasto be aligned to the German source verb

Thus for a set of itemsti for which there is a de-pendency relation(tx rel ti) isin DE such thattx isan element of our target configuration we need todecide whether(s1 ti) isin AGE This translationproblem now largely parallels collocation trans-lation problems discussed in the literature as in(Smadja and McKeown 1994) But crucially oursyntactic filtering strategy has substantially nar-rowed down the number of items that are possi-ble parts of the one-to-many translation Thus astraightforward way to assemble the translationalcorrespondence is to compute the correlation orassociation of the possibly missing items with thegiven translation pair as proposed in (Smadja andMcKeown 1994) Therefore we propose the fol-lowing alignment post-editing algorithmGiven the source items1 and the set of target itemsT where eachti isin T is an element of our targetconfiguration

1 Compute corr(s1 T ) the correlation be-tweens1 andT

2 For each ti tx such that there isa (ti rel tx) isin DE computecorr(s1 T + tx)

3 if corr(s1 T + tx) ge corr(s1 T ) addtxto T

As the Dice coefficient is often to give the bestresults eg in (Smadja and McKeown 1994) wealso chose Dice as our correlation measure In fu-ture work we will experiment with other associa-tion measures Our correlation scores are thus de-fined by the formula

corr(s1 T ) =2(freq(s1 and T ))

freq(s1) + freq(T )

We definefreq(T ) as the number of sentencepairs whose target sentence contains occurrencesof all ti isin T andfreq(s1) accordingly The ob-servation frequencyfreq(s1andT ) is the number ofsentence pairs that wheres1 occurs in the sourcesentence andT in the target sentence

The output translation can then be rep-resented as a dependency configurationof the following kind ((ofPMOD)(riskNMODof)(riskNMODthe) (runOBJrisk)(runSBJ))which is the syntactic representationfor the English MWErun the risk of

33 Evaluation

Our translational approach to MWE extractionbears the advantage that evaluation is not exclu-sively bound to the manual judgement of candi-date lists Instead we can first evaluate the systemoutput against translation gold standards which areeasier to obtain The linguistic classification of thecandidates according to their compositionality canthen be treated as a separate problem

We present two experiments in this evaluationsection We will first evaluate the translation de-tection on our gold standard to assess the gen-eral quality of the extraction method Since thisgold standard is to small to draw conclusions aboutthe quality of MWE patterns that the system de-tects we further evaluate the translational corre-spondences for a larger set of verbs

Translation evaluation In the first experimentwe extracted all types of translational correspon-dences for the verbs we annotated in the gold stan-dard We converted the output dependency con-figurations to the lemmatized bag-of-word formwe already applied for the alignment evaluationand calculated the FPR and FNR of the trans-lation types The evaluation is displayed in ta-ble 4 Nearly all translation types that our sys-tem detected are correct This confirms our hy-pothesis that syntactic filtering yields more reli-able translations that just coocurrence-based align-ments However the false negative rate is alsovery high This low recall is due to the fact thatour syntactic filters are very restrictive such that amajor part of the occurrences of the source lemmadonrsquot figure in the prototypical syntactic configu-ration Column two and three of the evaluation ta-ble present the FPR and FNR for experiments witha relaxed syntactic filter that doesnrsquot constrain thesyntactic type of the parallel argument relationsWhile not decreasing the FNR the FPR decreasessignificantly This means that the syntactic filtersmainly fire on noisy configurations and donrsquot de-crease the recall A manual error analysis has alsoshown that the relatively flat annotation scheme ofour dependency parses significantly narrows down

28

the number of candidate configurations that our al-gorithm detects As the dependency parses donrsquotprovide deep analyses for tense or control phe-nomena very often a verbrsquos arguments donrsquot fig-ure as its syntactic dependents and no configura-tion is found Future work will explore the im-pact of deep syntactic analysis for the detection oftranslational correspondences

MWE evaluation In a second experiment weevaluated the patterns of correspondences foundby our extraction method for use in an MWE con-text Therefore we selected 50 random verbs oc-curring in the Europarl corpus and extracted theirrespective translational correspondences This setof 50 verbs yields a set of 1592 one-to-many typesof translational correspondences We filtered thetypes wich display only morphological variationsuch that the set of potential MWE types com-prises 1302 types Out of these we evaluated arandom sample of 300 types by labelling the typeswith the MWE categories we established for theanalysis of our gold standard During the clas-sification we encountered a further category ofoneto- many correspondence which cannot be con-sidered an MWE the category of alternation Forinstance we found a translational correspondencebetween the active realization of the German verbbegruszligen(enappreciate) and the English passivebe pleased by

The classification is displayed in table 5 Al-most 83 of the translational correspondencesthat our system extracted are perfect translationtypes Almost 60 of the extracted types can beconsidered MWEs that exhibit some kind of se-mantic idiosyncrasy The other translations couldbe classified as paraphrases or alternations In ourrandom sample the portions of idioms is signifi-cantly higher than in our gold standard which con-firms our intuition that the MWE pattern of theone-to-many translations for a given verb are re-lated to language-specific semantic properties ofthe verbs and the lexical concepts they realize

4 Related Work

The problem sketched in this paper has clear con-ncetions to statistical MT So-called phrase-basedtranslation models generally target whole sentencealignment and do not necessarily recur to linguis-tically motivated phrase correspondences (Koehnet al 2003) Syntax-based translation that speci-fies formal relations between bilingual parses was

Strict Filter Relaxed Filter

FPR FNR FPR FNR

v1 00 096 05 096

v2 025 088 047 079

v3 025 074 056 063

v4 00 0875 056 084

Table 4 False positive and false negative rate ofone-to-many translations

Trans type Proportion

MWE type Proportion

MWEs 575

V Part 82

V Prep 518

LVC 324

Idiom 106

Paraphrases 244

Alternations 10

Noise 171

Table 5 Classification of 300 types sampled fromthe set of one-to-many translations for 50 verbs

established by (Wu 1997) Our way to use syn-tactic configurations can be seen as a heuristic tocheck relaxed structural parallelism

Work on MWEs in a crosslingual context hasalmost exclusively focussed on MWE translation(Smadja and McKeown 1994 Anastasiou 2008)In (Villada Moiron and Tiedemann 2006) the au-thors make use of alignment information in a par-allel corpus to rank MWE candidates These ap-proaches donrsquot rely on the lexical semantic knowl-edge about MWEs in form of one-to-many trans-lations

By contrast previous approaches to paraphraseextraction made more explicit use of crosslingualsemantic information In (Bannard and Callison-Burch 2005) the authors use the target languageas a pivot providing contextual features for iden-tifying semantically similar expressions Para-phrasing is however only partially comparable tothe crosslingual MWE detection we propose inthis paper Recently the very pronounced contextdependence of monolingual pairs of semanticallysimilar expressions has been recognized as a ma-jor challenge in modelling word meaning (Erk andPado 2009)

The idea that parallel corpora can be used asa linguistic resource that provides empirical evi-dence for monolingual idiosyncrasies has already

29

been exploited in eg morphology projection(Yarowsky et al 2001) or word sense disambigua-tion (Dyvik 2004) While in a monolingual set-ting it is quite tricky to come up with theoreticalor empirical definitions of sense discriminationsthe crosslingual scenario offers a theory-neutraldata-driven solution Since ambiguity is an id-iosyncratic property of a lexical item in a givenlanguage it is not likely to be mirrored in a tar-get language Similarly our approach can also beseen as a projection idea we project the semanticinformation of simplex realization in a source lan-guage to an idiosyncratic multiword realization inthe target language

5 Conclusion

We have explored the phenomenon of one-to-many translations in parallel corpora from theperspective of MWE identification Our man-ual study on a translation gold standard as wellas our experiments in automatic translation ex-traction have shown that one-to-many correspon-dences provide a rich resource and fruitful basisof study for data-driven MWE identification Thecrosslingual perspective raises new research ques-tions about the identification and interpretation ofMWEs It challenges the distinction between para-phrases and MWEs a problem that does not ariseat all in the context of monolingual MWE ex-traction It also allows for the study of the rela-tion between the semantics of lexical concepts andtheir MWE realization Further research in this di-rection should investigate translational correspon-dences on a larger scale and further explore thesefor monolingual interpretation of MWEs

Our extraction method that is based on syn-tactic filters identifies MWE types with a muchhigher precision than purely cooccurence-basedword alignment and captures the various patternswe found in our gold standard Future work on theextraction method will have to focus on the gener-alization of these filters and the generalization toother items than verbs The experiments presentedin this paper also suggest that the MWE realiza-tion of certain lexical items in a target languageis subject to certain linguistic patterns Moreoverthe method we propose is completely languagein-dependent such that further research has to studythe impact of the relatedness of the consideredlanguages on the patterns of one-to-many transla-tional correspondences

ReferencesDimitra Anastasiou 2008 Identification of idioms by mtrsquos

hybrid research system vs three commercial system InProceedings of the EAMT pp 12ndash20

Timothy Baldwin and Aline Villavicencio 2002 Extract-ing the unextractable a case study on verb-particles InProceedings of the COLING-02 pp 1ndash7

Colin Bannard and Chris Callison-Burch 2005 Paraphras-ing with bilingual parallel corpora InProceedings of the43rd Annual Meeting of the ACL pp 597ndash604

Lea Cyrus 2006 Building a resource for studying transla-tion shifts InProceedings of the 5th LREC pp 1240ndash1245

Helge Dyvik 2004 Translations as semantic mirrors Fromparallel corpus to WordNetLanguage and Computers1311 ndash 326

Katrin Erk and Sebastian Pado 2009 Paraphrase assess-ment in structured vector space Exploring parameters anddatasets InProc of the EACL GEMS Workshop pp 57ndash65

Joao de Almeida Varelas Graca Joana Paulo Pardal LuısaCoheur and Diamantino Antonio Caseiro 2008 Multi-language word alignments annotation guidelines Techni-cal report Tech Rep 38 2008 INESC-ID Lisboa

Philipp Koehn Franz Josef Och and Daniel Marcu 2003Statistical phrase-based translation InProceedings of theNAACL rsquo03 pp 48ndash54

Philipp Koehn 2005 Europarl A parallel corpus for statis-tical machine translation InMT Summit 2005 pp 79ndash86

Joakim Nivre Johan Hall and Jens Nilsson 2006 Malt-parser A data driven parser-generator for dependencyparsing InProc of LREC-2006 pp 2216ndash2219

Franz Josef Och and Hermann Ney 2003 A systematic com-parison of various statistical alignment modelsComputa-tional Linguistics 29(1)19ndash51

Pavel Pecina 2008 A machine learning approach to multi-word expression extraction InProceedings of the LRECMWE 2008 Workshop pp 54ndash57

Ivan A Sag Timothy Baldwin Francis Bond and AnnCopestake 2002 Multiword expressions A pain in theneck for NLP InProc of the CICLing-2002 pp 1ndash15

Frank Smadja and Kathleen McKeown 1994 Translatingcollocations for use in bilingual lexicons InProceedingsof the HLT rsquo94 workshop pp 152ndash156

Begona Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automatic word-alignment InProc of the EACL MWE 2006 Workshoppp 33ndash40

Dekai Wu 1997 Stochastic inversion transduction gram-mars and bilingual parsing of parallel corporaComputLinguist 23(3)377ndash403

David Yarowsky Grace Ngai and Richard Wicentowski2001 Inducing multilingual text analysis tools via robustprojection across aligned corpora InProceedings of HLT2001 pp 1ndash8

30

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 31ndash39Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

A Re-examination of Lexical Association Measures

Abstract

We review lexical Association Measures (AMs) that have been employed by past work in extracting multiword expressions Our work contributes to the understanding of these AMs by categorizing them into two groups and suggesting the use of rank equivalence to group AMs with the same ranking performance We also examine how existing AMs can be adapted to better rank English verb particle constructions and light verb constructions Specifically we suggest normalizing (Pointwise) Mutual Information and using marginal frequencies to construct penalization terms We empirically validate the effectiveness of these modified AMs in detection tasks in English performed on the Penn Treebank which shows significant improvement over the original AMs

1 Introduction

Recently the NLP community has witnessed a renewed interest in the use of lexical association measures in extracting Multiword Expressions (MWEs) Lexical Association Measures (hereafter AMs) are mathematical formulas which can be used to capture the degree of connection or association between constituents of a given phrase Well-known AMs include Pointwise Mutual Information (PMI) Pearsonrsquos 2χ and the Odds Ratio These AMs have been applied in many different fields of study from information retrieval to hypothesis testing In the context of MWE extraction many published works have been devoted to comparing their effectiveness Krenn and Evert (2001) evaluate Mutual Information (MI) Dice Pearsonrsquos 2χ log-likelihood

ratio and the T score In Pearce (2002) AMs such as Z score Pointwise MI cost reduction left and right context entropy odds ratio are evaluated Evert (2004) discussed a wide range of AMs including exact hypothesis tests such as the binomial test and Fisherrsquos exact tests various coefficients such as Dice and Jaccard Later Ramisch et al (2008) evaluated MI Pearsonrsquos 2χ and Permutation Entropy Probably the most comprehensive evaluation of AMs was presented in Pecina and Schlesinger (2006) where 82 AMs were assembled and evaluated over Czech collocations These collocations contained a mix of idiomatic expressions technical terms light verb constructions and stock phrases In their work the best combination of AMs was selected using machine learning

While the previous works have evaluated AMs there have been few details on why the AMs perform as they do A detailed analysis of why these AMs perform as they do is needed in order to explain their identification performance and to help us recommend AMs for future tasks This weakness of previous works motivated us to address this issue In this work we contribute to further understanding of association measures using two different MWE extraction tasks to motivate and concretize our discussion Our goal is to be able to predict a priori what types of AMs are likely to perform well for a particular MWE class

We focus on the extraction of two common types of English MWEs that can be captured by bigram model Verb Particle Constructions (VPCs) and Light Verb Constructions (LVCs) VPCs consist of a verb and one or more particles which can be prepositions (eg put on bolster up) adjectives (cut short) or verbs (make do) For simplicity we focus only on bigram VPCs that take prepositional particles the most common class of VPCs A special characteristic of VPCs that affects their extraction is the

Hung Huu Hoang Dept of Computer Science

National University of Singapore

hoanghuucompnusedusg

Su Nam Kim Dept of Computer Science and Software Engineering University of Melbourne

snkimcsseunimelbeduau

Min-Yen Kan Dept of Computer Science

National University of Singapore

kanmycompnusedusg

31

mobility of noun phrase complements in transitive VPCs They can appear after the particle (Take off your hat) or between the verb and the particle (Take your hat off) However a pronominal complement can only appear in the latter configuration (Take it off)

In comparison LVCs comprise of a verb and a complement which is usually a noun phrase (make a presentation give a demonstration) Their meanings come mostly from their complements and as such verbs in LVCs are termed semantically light hence the name light verb This explains why modifiers of LVCs modify the complement instead of the verb (make a serious mistake vs make a mistake seriously) This phenomenon also shows that an LVCrsquos constituents may not occur contiguously

2 Classification of Association Measures

Although different AMs have different approaches to measuring association we observed that they can effectively be classified into two broad classes Class I AMs look at the degree of institutionalization ie the extent to which the phrase is a semantic unit rather than a free combination of words Some of the AMs in this class directly measure this association between constituents using various combinations of co-occurrence and marginal frequencies Examples include MI PMI and their variants as well as most of the association coefficients such as Jaccard Hamann Brawn-Blanquet and others Other Class I AMs estimate a phrasersquos MWE-hood by judging the significance of the difference between observed and expected frequencies These AMs include among others statistical hypothesis tests such as T score Z score and Pearsonrsquos 2χ test

Class II AMs feature the use of context to measure non-compositionality a peculiar characteristic of many types of MWEs including VPCs and idioms This is commonly done in one of the following two ways First non-compositionality can be modeled through the diversity of contexts measured using entropy The underlying assumption of this approach is that non-compositional phrases appear in a more restricted set of contexts than compositional ones Second non-compositionality can also be measured through context similarity between the phrase and its constituents The observation here is that non-compositional phrases have different semantics from those of their constituents It then

follows that contexts in which the phrase and its constituents appear would be different (Zhai 1997) Some VPC examples include carry out give up A close approximation stipulates that contexts of a non-compositional phrasersquos constituents are also different For instance phrases such as hot dog and Dutch courage are comprised of constituents that have unrelated meanings Metrics that are commonly used to compute context similarity include cosine and dice similarity distance metrics such as Euclidean and Manhattan norm and probability distribution measures such as Kullback-Leibler divergence and Jensen-Shannon divergence

Table 1 lists all AMs used in our discussion

The lower left legend defines the variables a b c and d with respect to the raw co-occurrence statistics observed in the corpus data When an AM is introduced it is prefixed with its index given in Table 1(eg [M2] Mutual Information) for the readerrsquos convenience

3 Evaluation

We will first present how VPC and LVC candidates are extracted and used to form our evaluation data set Second we will discuss how performances of AMs are measured in our experiments

31 Evaluation Data

In this study we employ the Wall Street Journal (WSJ) section of one million words in the Penn Tree Bank To create the evaluation data set we first extract the VPC and LVC candidates from our corpus as described below We note here that the mobility property of both VPC and LVC constituents have been used in the extraction process

For VPCs we first identify particles using a pre-compiled set of 38 particles based on Baldwin (2005) and Quirk et al (1985) (Appendix A) Here we do not use the WSJ particle tag to avoid possible inconsistencies pointed out in Baldwin (2005) Next we search to the left of the located particle for the nearest verb As verbs and particles in transitive VPCs may not occur contiguously we allow an intervening NP of up to 5 words similar to Baldwin and Villavicencio (2002) and Smadja (1993) since longer NPs tend to be located after particles

32

Extraction of LVCs is carried out in a similar fashion First occurrences of light verbs are located based on the following set of seven

frequently used English light verbs do get give have make put and take Next we search to the right of the light verbs for the nearest noun

AM Name Formula AM Name Formula M1 Joint Probability ( ) f xy N M2 Mutual Information

1log ˆ

ijij

i j ij

ff

N fsum

M3 Log likelihood ratio

2 log ˆ

ijij

i j ij

ff

fsum

M4 Pointwise MI (PMI) ( )log

( ) ( )

P xy

P x P ylowast lowast

M5 Local-PMI ( ) PMIf xy times M6 PMIk ( )log

( ) ( )

kNf xy

f x f ylowast lowast

M7 PMI2 2( )log

( ) ( )

Nf xy

f x f ylowast lowast

M8 Mutual Dependency 2( )log( ) ( )P xy

P x P y

M9 Driver-Kroeber

( )( )

a

a b a c+ +

M10 Normalized expectation

2

2

a

a b c+ +

M11 Jaccard a

a b c+ +

M12 First Kulczynski a

b c+

M13 Second Sokal-Sneath 2( )

a

a b c+ +

M14 Third Sokal-Sneath

a d

b c

+

+

M15 Sokal-Michiner a d

a b c d

+

+ + +

M16 Rogers-Tanimoto

2 2

a d

a b c d

+

+ + +

M17 Hamann ( ) ( )a d b c

a b c d

+ minus +

+ + +

M18 Odds ratio adbc

M19 Yulersquos ω ad bc

ad bc

minus

+

M20 Yulersquos Q ad bcad bc

minus+

M21 Brawn- Blanquet max( )

a

a b a c+ +

M22 Simpson

min( )

a

a b a c+ +

M23 S cost 12min( )

log(1 )1

b c

a

minus+

+

M24 Adjusted S Cost 12max( )

log(1 )1

b c

a

minus+

+

M25 Laplace 1

min( ) 2

a

a b c

+

+ +

M26 Adjusted Laplace 1

max( ) 2

a

a b c

+

+ +

M27 Fager [M9]

1max( )

2b cminus

M28 Adjusted Fager [M9]

1max( )b c

aNminus

M29 Normalized PMIs

PMI NF( )α PMI NFMax

M30 Simplified normalized PMI for VPCs

log( )

(1 )

ad

b cα αtimes + minus times

M31 Normalized MIs

MI NF( )α MI NFMax

NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast [0 1]α isin NFMax = max( ( ) ( ))P x P ylowast lowast

11 ( )a f f xy= = 12 ( )b f f xy= =

21 ( )c f f xy= = 22 ( )d f f xy= =

( )f xlowast ( )f xlowast

( )f ylowast ( )f ylowast N

Table 1 Association measures discussed in this paper Starred AMs () are developed in this work

Contingency table of a bigram (x y) recording co-occurrence and marginal frequencies w stands for all words except w stands for all words N is total number of bigrams The expected frequency under the independence assumption is ˆ ( ) ( ) ( ) f xy f x f y N= lowast lowast

33

permitting a maximum of 4 intervening words to allow for quantifiers (aan the many etc) adjectival and adverbial modifiers etc If this search fails to find a noun as when LVCs are used in the passive (eg the presentation was made) we search to the right of the light verb also allowing a maximum of 4 intervening words The above extraction process produced a total of 8652 VPC and 11465 LVC candidates when run on the corpus We then filter out candidates with observed frequencies less than 6 as suggested in Pecina and Schlesinger (2006) to obtain a set of 1629 VPCs and 1254 LVCs

Separately we use the following two available sources of annotations 3078 VPC candidates extracted and annotated in (Baldwin 2005) and 464 annotated LVC candidates used in (Tan et al 2006) Both sets of annotations give both positive and negative examples

Our final VPC and LVC evaluation datasets were then constructed by intersecting the gold-standard datasets with our corresponding sets of extracted candidates We also concatenated both sets of evaluation data for composite evaluation This set is referred to as ldquoMixedrdquo Statistics of our three evaluation datasets are summarized in Table 2 VPC data LVC data Mixed Total (freq ge 6)

413 100 513

Positive instances

117 (2833)

28 (28)

145 (2326)

Table 2 Evaluation data sizes (type count not token) While these datasets are small our primary

goal in this work is to establish initial comparable baselines and describe interesting phenomena that we plan to investigate over larger datasets in future work

32 Evaluation Metric

To evaluate the performance of AMs we can use the standard precision and recall measures as in much past work We note that the ranked list of candidates generated by an AM is often used as a classifier by setting a threshold However setting a threshold is problematic and optimal threshold values vary for different AMs Additionally using the list of ranked candidates directly as a classifier does not consider the confidence indicated by actual scores Another way to avoid setting threshold values is to measure precision and recall of only the n most likely candidates

(the n-best method) However as discussed in Evert and Krenn (2001) this method depends heavily on the choice of n In this paper we opt for average precision (AP) which is the average of precisions at all possible recall values This choice also makes our results comparable to those of Pecina and Schlesinger (2006)

33 Evaluation Results

Figure 1(a b) gives the two average precision profiles of the 82 AMs presented in Pecina and Schlesinger (2006) when we replicated their experiments over our English VPC and LVC datasets We observe that the average precision profile for VPCs is slightly concave while the one for LVCs is more convex This can be interpreted as VPCs being more sensitive to the choice of AM than LVCs Another point we observed is that a vast majority of Class I AMs including PMI its variants and association coefficients (excluding hypothesis tests) perform reasonably well in our application In contrast the performances of most of context-based and hypothesis test AMs are very modest Their mediocre performance indicates their inapplicability to our VPC and LVC tasks In particular the high frequencies of particles in VPCs and light verbs in LVCs both undermine their contextsrsquo discriminative power and skew the difference between observed and expected frequencies that are relied on in hypothesis tests

4 Rank Equivalence

We note that some AMs although not mathematically equivalent (ie assigning identical scores to input candidates) produce the same lists of ranked candidates on our datasets Hence they achieve the same average precision The ability to identify such groups of AMs is helpful in simplifying their formulas which in turn assisting in analyzing their meanings Definition Association measures M1 and M2 are

rank equivalent over a set C denoted by M1 r

Cequiv

M2 if and only if M1(cj) gt M1(ck) M2(cj) gt M2(ck) and M1(cj) = M1(ck) M2(cj) = M2(ck) for all cj ck belongs to C where Mk(ci) denotes the score assigned to ci by the measure Mk As a corollary the following also holds for rank equivalent AMs

34

Corollary If M1 r

Cequiv M2 then APC(M1) = APC(M2)

where APC(Mi) stands for the average precision of the AM Mi over the data set C Essentially M1 and M2 are rank equivalent over a set C if their ranked lists of all candidates taken from C are the same ignoring the actual calculated scores1 As an example the following 3 AMs Odds ratio Yulersquos ω and Yulersquos Q (Table 3 row 5) though not mathematically equivalent can be shown to be rank equivalent Five groups of rank equivalent AMs that we have found are listed in Table 3 This allows us to replace the below 15 AMs with their (most simple) representatives from each rank equivalent group

1 Two AMs may be rank equivalent with the exception of some candidates where one AM is undefined due to a zero in the denominator while the other AM is still well-defined We call these cases weakly rank equivalent With a reasonably large corpus such candidates are rare for our VPC and LVC types Hence we still consider such AM pairs to be rank equivalent

1) [M2] Mutual Information [M3] Log likelihood ratio 2) [M7] PMI2 [M8] Mutual Dependency [M9] Driver-Kroeber (aka Ochiai) 3) [M10] Normalized expectation [M11] Jaccard [M12] First Kulczynski

[M13]Second Sokal-Sneath (aka Anderberg)

4) [M14] Third Sokal-Sneath [M15] Sokal-Michiner [M16] Rogers-Tanimoto [M17] Hamann 5) [M18] Odds ratio [M19] Yulersquos ω [M20] Yulersquos Q

Table 3 Five groups of rank equivalent AMs

5 Examination of Association Measures

We highlight two important findings in our analysis of the AMs over our English datasets Section 51 focuses on MI and PMI and Section 52 discusses penalization terms

51 Mutual Information and Pointwise Mutual Information

In Figure 1 over 82 AMs PMI ranks 11th in identifying VPCs while MI ranks 35th in

01

02

03

04

05

06

AP

Figure 1a AP profile of AMs examined over our VPC data set

01

02

03

04

05

06

AP

Figure 1b AP profile of AMs examined over our LVC data set

Figure 1 Average precision (AP) performance of the 82 AMs from Pecina and Schlesinger (2006) on our English VPC and LVC datasets Bold points indicate AMs discussed in this paper

Hypothesis test AMs loz Class I AMs excluding hypothesis test AMs + Context-based AMs

35

identifying LVCs In this section we show how their performances can be improved significantly

Mutual Information (MI) measures the common information between two variables or the reduction in uncertainty of one variable given knowledge of the other

( )MI( ) ( )log ( ) ( )u v

p uvU V p uv p u p v= lowast lowastsum In the

context of bigrams the above formula can be

simplified to [M2] MI =

1log ˆN

ijij

i jij

ff

fsum While MI

holds between random variables [M4] Pointwise MI (PMI) holds between specific values PMI(x

y) =( )

log( ) ( )

P xy

P x P ylowast lowast

( )log

( ) ( )

Nf xy

f x f y=

lowast lowast It has long

been pointed out that PMI favors bigrams with low-frequency constituents as evidenced by the product of two marginal frequencies in its denominator To reduce this bias a common solution is to assign more weight to the co-occurrence frequency ( )f xy in the numerator by either raising it to some power k (Daille 1994) or multiplying PMI with ( )f xy Table 4 lists these adjusted versions of PMI and their performance over our datasets We can see from Table 4 that the best performance of PMIk

is obtained at k values less than one indicating that it is better to rely less on ( )f xy Similarly multiplying

( )f xy directly to PMI reduces the performance of PMI As such assigning more weight to ( )f xy does not improve the AP performance of PMI AM VPCs LVCs MixedBest [M6] PMIk 547

(k = 13) 573

(k = 85) 544

(k = 32)

[M4] PMI 510 566 515 [M5] Local-PMI 259 393 272 [M1] Joint Prob 170 28 175 Table 4 AP performance of PMI and its variants Best alpha settings shown in parentheses

Another shortcoming of (P)MI is that both grow not only with the degree of dependence but also with frequency (Manning and Schutzeampamp 1999 p 66) In particular we can show that MI(X Y) le min(H(X) H(Y)) where H() denotes entropy and PMI(xy) le min( log ( )P xminus lowast log ( )P yminus lowast )

These two inequalities suggest that the allowed score ranges of different candidates vary and consequently MI and PMI scores are not directly comparable Furthermore in the case of VPCs and LVCs the differences among score

ranges of different candidates are large due to high frequencies of particles and light verbs This has motivated us to normalize these scores before using them for comparison We suggest MI and PMI be divided by one of the following two normalization factors NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast with [0 1]α isin and NFmax = max( ( ) ( ))P x P ylowast lowast NF( )α being dependent on alpha can be optimized by setting an appropriate alpha value which is inevitably affected by the MWE type and the corpus statistics On the other hand NFmax is independent of alpha and is recommended when one needs to apply normalized (P)MI to a mixed set of different MWE types or when sufficient data for parameter tuning is unavailable As shown in Table 5 normalized MI and PMI show considerable improvements of up to 80 Also PMI and MI after being normalized with NFmax rank number one in VPC and LVC task respectively If one re-writes MI as = (1 N) ij ij

i jPMIf timessum it is easy to see the heavy

dependence of MI on direct frequencies compared with PMI and this explains why normalization is a pressing need for MI

AM VPCs LVCs MixedMI NF( )α 508

(α = 48) 583

(α = 47) 516

(α = 5)

MI NFmax 508 584 518 [M2] MI 273 435 289

PMI NF( )α 592 (α = 8)

554 (α = 48)

588 (α = 77)

PMI NFmax 565 517 556 [M4] PMI 510 566 515

Table 5 AP performance of normalized (P)MI versus standard (P)MI Best alpha settings shown in parentheses

52 Penalization Terms

It can be seen that given equal co-occurrence frequencies higher marginal frequencies reduce the likelihood of being MWEs This motivates us to use marginal frequencies to synthesize penalization terms which are formulae whose values are inversely proportional to the likelihood of being MWEs We hypothesize that incorporating such penalization terms can improve the respective AMs detection AP

Take as an example the AMs [M21] Brawn-Blanquet (aka Minimum Sensitivity) and [M22] Simpson These two AMs are identical except

36

for one difference in the denominator Brawn-Blanquet uses max(b c) Simpson uses min(b c) It is intuitive and confirmed by our experiments that penalizing against the more frequent constituent by choosing max(b c) is more effective This is further attested in AMs [M23] S Cost and [M25] Laplace where we tried to replace the min(b c) term with max(b c) Table 6 shows the average precision on our datasets for all these AMs

AM VPCs LVCs Mixed[M21]Brawn- Blanquet

478 578 486

[M22] Simpson 249 382 260

[M24] Adjusted S Cost

485 577 492

[M23] S cost 249 388 260

[M26] Adjusted Laplace

486 577 493

[M25] Laplace 241 388 254 Table 6 Replacing min() with max() in selected AMs

In the [M27] Fager AM the penalization term max(b c) is subtracted from the first term which is no stranger but rank equivalent to [M7] PMI2 In our application this AM is not good since the second term is far larger than the first term which is less than 1 As such Fager is largely equivalent to just ndashfrac12 max(b c) In order to make use of the first term we need to replace the constant frac12 by a scaled down version of max(b c) We have approximately derived 1 aN as a lower bound estimate of max(b c) using the independence assumption producing [M28] Adjusted Fager We can see from Table 7 that this adjustment improves Fager on both datasets AM VPCs LVCs Mixed[M28] Adjusted Fager

564 543 554

[M27] Fager 552 439 525 Table 7 Performance of Fager and its adjusted version

The next experiment involves [M14] Third Sokal Sneath which can be shown to be rank equivalent to ndashb ndashc We further notice that frequencies c of particles are normally much larger than frequencies b of verbs Thus this AM runs the risk of ranking VPC candidates based on only frequencies of particles So it is necessary

that we scale b and c properly as in [M14] bαminus times ndash (1 ) cαminus times Having scaled the constituents properly we still see that [M14] by itself is not a good measure as it uses only constituent frequencies and does not take into consideration the co-occurrence frequency of the two constituents This has led us to experiment

with [MR14]PMI

(1 )b cα αtimes + minus times The

denominator of [MR14] is obtained by removing the minus sign from [MR14] so that it can be used as a penalization term The choice of PMI in the numerator is due to the fact that the denominator of [MR14] is in essence similar to NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast which has been successfully used to divide PMI in the normalized PMI experiment We heuristically tried to simplify [MR14] to the following AM

[M30]log( )

(1 )

ad

b cα αtimes + minus times The setting of alpha in

Table 8 below is taken from the best alpha setting obtained the experiment on the normalized PMI (Table 5) It can be observed from Table 8 that [MR14] being computationally simpler than normalized PMI performs as well as normalized PMI and better than Third Sokal-Sneath over the VPC data set

AM VPCs LVCs MixedPMI NF( )α 592

(α =8) 554

(α =48) 588

(α =77)

[M30]log( )

(1 )

ad

b cα αtimes + minus times

600 (α =8)

484 (α =48)

588 (α =77)

[M14] Third Sokal Sneath

565 453 546

Table 8 AP performance of suggested VPCsrsquo penalization terms and AMs

With the same intention and method we have found that while addition of marginal frequencies is a good penalization term for VPCs the product of marginal frequencies is more suitable for LVCs (rows 1 and 2 Table 9) As with the linear combination the product bc should also be weighted accordingly as (1 )b cα αminus The best alpha value is also taken from the normalized PMI experiments (Table 5) which is nearly 5 Under this setting this penalization term is exactly the denominator of the [M18] Odds Ratio Table 9 below show our experiment results in deriving the penalization term for LVCs

37

AM VPCs LVCs Mixedndashb ndashc 565 453 546 1bc 502 532 502 [M18] Odds ratio 443 567 456

Table 9 AP performance of suggested LVCsrsquo penalization terms and AMs

6 Conclusions

We have conducted an analysis of the 82 AMs assembled in Pecina and Schlesinger (2006) for the tasks of English VPC and LVC extraction over the Wall Street Journal Penn Treebank data In our work we have observed that AMs can be divided into two classes ones that do not use context (Class I) and ones that do (Class II) and find that the latter is not suitable for our VPC and LVC detection tasks as the size of our corpus is too small to rely on the frequency of candidatesrsquo contexts This phenomenon also revealed the inappropriateness of hypothesis tests for our detection task We have also introduced the novel notion of rank equivalence to MWE detection in which we show that complex AMs may be replaced by simpler AMs that yield the same average precision performance

We further observed that certain modifications to some AMs are necessary First in the context of ranking we have proposed normalizing scores produced by MI and PMI in cases where the distributions of the two events are markedly different as is the case for light verbs and particles While our claims are limited to the datasets analyzed they show clear improvements normalized PMI produces better performance over our mixed MWE dataset yielding an average precision of 588 compared to 515 when using standard PMI a significant improvement as judged by paired T test Normalized MI also yields the best performance over our LVC dataset with a significantly improved AP of 583

We also show that marginal frequencies can be used to form effective penalization terms In particular we find that (1 )b cα αtimes + minus times is a good penalization term for VPCs while (1 )b cα αminus is suitable for LVCs Our introduced alpha tuning parameter should be set to properly scale the values b and c and should be optimized per MWE type In cases where a common factor is applied to different MWE types max(b c) is a better choice than min(b c) In future work we plan to expand our investigations over larger

web-based datasets of English to verify the performance gains of our modified AMs

Acknowledgement This work was partially supported by a National Research Foundation grant ldquoInteractive Media Searchrdquo (grant R 252 000 325 279)

References Baldwin Timothy (2005) The deep lexical

acquisition of English verb-particle constructions Computer Speech and Language Special Issue on Multiword Expressions 19(4)398ndash414

Baldwin Timothy and Villavicencio Aline (2002) Extracting the unextractable A case study on verb-particles In Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002) pages 98ndash104 Taipei Taiwan

Daille Beacuteatrice (1994) Approche mixte pour lextraction automatique de terminologie statistiques lexicales et filtres linguistiques PhD thesis Universiteacute Paris 7

Evert Stefan (2004) Online repository of association measures httpwwwcollocationsde a companion to The Statistics of Word Cooccurrences Word Pairs and Collocations PhD dissertation University of Stuttgart

Evert Stefan and Krenn Brigitte (2001) Methods for qualitative evaluation of lexical association measures In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics pages 188-915 Toulouse France

Katz Graham and Giesbrecht Eugenie (2006) Automatic identification of non-compositional multi-word expressions using latent semantic analysis In Proceedings of the ACL Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties pages 12-19 Sydney Australia

Krenn Brigitte and Evert Stefan (2001) Can we do better than frequency A case study on extracting PP-verb collocations In Proceedings of the ACLEACL 2001 Workshop on the Computational Extraction Analysis and Exploitation of Collocations pages 39ndash46 Toulouse France

Manning D Christopher and Schutze ampamp Hinrich (1999) Foundations of Statistical Natural Language Processing The MIT Press Cambridge Massachusetts

Pearce Darren (2002) A comparative evaluation of collocation extraction techniques In Proc of the

38

3rd International Conference on Language Resources and Evaluation (LREC 2002) Las Palmas pages 1530-1536 Canary Islands

Pecina Pavel and Schlesinger Pavel (2006) Combining association measures for collocation extraction In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLINGACL 2006) pages 651-658 Sydney Australia

Quirk Randolph Greenbaum Sidney Leech Geoffrey and Svartvik Jan (1985) A Comprehensive Grammar of the English Language Longman London UK

Ramisch Carlos Schreiner Paulo Idiart Marco and Villavicencio Aline (2008) An Evaluation of Methods for the extraction of Multiword Expressions In Proceedings of the LREC-2008 Workshop on Multiword Expressions Towards a Shared Task for Multiword Expressions pages 50-53 Marrakech Morocco

Smadja Frank (1993) Retrieving collocations from text Xtract Computational Linguistics 19(1) 143ndash77

Tan Y Fan Kan M Yen and Cui Hang (2006) Extending corpus-based identification of light verb constructions using a supervised learning framework In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context pages 49ndash56 Trento Italy

Zhai Chengxiang (1997) Exploiting context to identify lexical atoms ndash A statistical view of linguistic context In International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97) pages 119-129 Rio de Janeiro Brazil

Appendix A List of particles used in identifying verb particle constructions about aback aboard above abroad across adrift ahead along apart around aside astray away back backward backwards behind by down forth forward forwards in into off on out over past round through to together under up upon without

39

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 40ndash46Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus

R Mahesh K Sinha

Department of Computer Science amp Engineering Indian Institute of Technology Kanpur

Kanpur 208016 India rmkiitkacin

Abstract

Complex predicate is a noun a verb an ad-jective or an adverb followed by a light verb that behaves as a single unit of verb Com-plex predicates (CPs) are abundantly used in Hindi and other languages of Indo Aryan family Detecting and interpreting CPs con-stitute an important and somewhat a diffi-cult task The linguistic and statistical methods have yielded limited success in mining this data In this paper we present a simple method for detecting CPs of all kinds using a Hindi-English parallel corpus A CP is hypothesized by detecting absence of the conventional meaning of the light verb in the aligned English sentence This simple strategy exploits the fact that CP is a multi-word expression with a meaning that is dis-tinct from the meaning of the light verb Al-though there are several shortcomings in the methodology this empirical method surpri-singly yields mining of CPs with an average precision of 89 and a recall of 90

1 Introduction

Complex predicates (CPs) are abundantly used in Hindi and other languages of Indo-Aryan family and have been widely studied (Hook 1974 Ab-bi 1992 Verma 1993 Mohanan 1994 Singh 1994 Butt 1995 Butt and Geuder 2001 Butt and Ramchand 2001 Butt et al 2003) A com-plex predicate is a multi-word expression (MWE) where a noun a verb or an adjective is followed by a light verb (LV) and the MWE be-haves as a single unit of verb The general theory of complex predicate is discussed in Alsina (1996) These studies attempt to model the lin-guistic facts of complex predicate formation and the associated semantic roles

CPs empower the language in its expressive-ness but are hard to detect Detection and inter-

pretation of CPs are important for several tasks of natural language processing tasks such as ma-chine translation information retrieval summa-rization etc A mere listing of the CPs constitutes a valuable linguistic resource for lexicographers wordnet designers (Chakrabarti et al 2007) and other NLP system designers Computational me-thod using Hindi corpus has been used to mine CPs and categorize them based on statistical analysis (Sriram and Joshi 2005) with limited success Chakrabarti et al (2008) present a me-thod for automatic extraction of V+V CPs only from a corpus based on linguistic features They report an accuracy of about 98 An attempt has also been made to use a parallel corpus for de-tecting CPs using projection POS tags from Eng-lish to Hindi (Soni Mukerjee and Raina 2006) It uses Giza++ word alignment tool to align the projected POS information A success of 83 precision and 46 recall has been reported

In this paper we present a simple strategy for mining of CPs in Hindi using projection of meaning of light verb in a parallel corpus In the following section the nature of CP in Hindi is outlined and this is followed by system design experimentation and results

2 Complex Predicates in Hindi

A CP in Hindi is a syntactic construction consist-ing of either a verb a noun an adjective or an adverb as main predicator followed by a light verb (LV) Thus a CP can be a noun+LV an adjective+LV a verb+LV or an adverb+LV Fur-ther it is also possible that a CP is followed by a LV (CP+LV) The light verb carries the tense and agreement morphology In V+V CPs the contribution of the light verb denotes aspectual terms such as continuity perfectivity inception completion or denotes an expression of forceful-ness suddenness etc (Singh 1994 Butt 1995) The CP in a sentence syntactically acts as a sin-gle lexical unit of verb that has a meaning dis-

40

tinct from that of the LV CPs are also referred as the complex or compound verbs

Given below are some examples (1) CP=noun+LV noun = ashirwad blessings LV = denaa to give usane mujhe ashirwad diyaa उसन मझ आश वाद दया he me blessings gave he blessed me (2) No CP usane mujhe ek pustak dii उसन मझ एक प तक द he me one book gave he gave me a book In (1) the light verb diyaa (gave) in its past tense form with the noun ashirwad (blessings) makes a complex predicate verb form ashirwad diyaa (blessed) in the past tense form The CP here is ashirwad denaa and its corresponding English translation is lsquoto blessrsquo On the other hand in ex-ample (2) the verb dii (gave) is a simple verb in past tense form and is not a light verb Although same Hindi verb denaa (to give) is used in both the examples it is a light verb in (1) and a main verb in (2) Whether it acts as a light verb or not depends upon the semantics of the preceding noun However it is observed that the English meaning in case of the complex predicate is not derived from the individual meanings of the con-stituent words It is this observation that forms basis of our approach for mining of CPs (3) CP=adjective+LV adjective=khush happy LV=karanaa to do usane mujhe khush kiyaa उसन मझ खश कया he me happy did he pleased me Here the Hindi verb kiyaa (did) is the past tense form of a light verb karanaa (to do) and the pre-ceding word khush (happy) is an adjective The CP here is khush karanaa (to please) (4) CP=verb+LV verb = paRhnaa to read LV = lenaa to take usane pustak paRh liyaa

उसन प तक पढ़ लया he book read took he has read the book Here the Hindi verb liyaa (took) is the past tense form of the light verb lenaa (to take) and the pre-ceding word paRh (read) is the verb paRhnaa (to read) in its stem form The CP is paRh lenaa (to finish reading) In such cases the light verb acts as an aspectual modal or as an intensifier (5) CP=verb+LV verb = phaadanaa to tear LV = denaa to give usane pustak phaad diyaa उसन प तक फाड़ दया he book tear gave he has torn the book Here the Hindi verb diyaa (gave) is the past tense form of the light verb denaa (to give) and the preceding word phaad (tear) is the stem form of the verb phaadanaa (to tear) The CP is phaad denaa (to cause and complete act of tearing) (6) CP=verb+LV verb = denaa to give LV = maaranaato hit to kill usane pustak de maaraa उसन प तक द मारा he book give hit he threw the book Here the Hindi verb maaraa (hitkilled) is the past tense form of the light verb maaranaa (to hit to kill) and the preceding word de (give) is a verb denaa (to give) in its stem form The CP is de maranaa (to throw) The verb combination yields a new meaning This may also be consi-dered as a semi-idiomatic construct by some people (7) CP=adverb+LV1+LV2 adverb = vaapasback LV1 = karanaato do LV2 = denaato give or CP = CP+LV CP = vaapas karanaato return LV = denaato give usane pustak vaapas kar diyaa उसन प तक वापस कर दया he book back do gave

41

he returned the book Here there are two Hindi light verbs used The verb kar (do) is the stem form of the light verb karanaa (to do) and the verb diyaa (gave) is the past tense form of the light verb denaa (to give) The preceding word vaapas (back) is an adverb One way of interpretation is that the CP (a con-junct verb) vaapas karanaa (to return) is fol-lowed by another LV denaa (to give) signifying completion of the task Another way of looking at it is to consider these as a combination of two CPs vaapas karanaa (to return) and kar denaa (to complete the act) The semantic interpreta-tions in the two cases remain the same It may be noted that the word vaapas (return) is also a noun and in such a case the CP is a noun+LV

From all the above examples the complexity of the task of mining the CPs is evident Howev-er it is also observed that in the translated text the meaning of the light verb does not appear in case of CPs Our methodology for mining CPs is based on this observation and is outlined in the following section

3 System Design

As outlined earlier our method for detecting a CP is based on detecting a mismatch of the Hindi light verb meaning in the aligned English sen-tence The steps involved are as follows 1) Align the sentences of Hindi-English corpus 2) Create a list of Hindi light verbs and their

common English meanings as a simple verb (Table 1)

3) For each Hindi light verb generate all the morphological forms (Figure 1)

4) For each English meaning of the light verb as given in table 1 generate all the morphologi-cal forms (Figure 2)

5) For each Hindi-English aligned sentence execute the following steps a) For each light verb of Hindi (table 1)

execute the following steps i) Search for a Hindi light verb (LV)

and its morphological derivatives (figure 1) in the Hindi sentence and mark its position in the sentence (K)

ii) If the LV or its morphological deriv-ative is found then search for the equivalent English meanings for any of the morphological forms (figure 2) in the corresponding aligned Eng-lish sentence

iii) If no match is found then scan the words in the Hindi sentence to the left of the Kth position (as identified in step (i)) else if a match is found then exit ie go to step (a)

iv) If the scanned word is a lsquostop wordrsquo (figure 3) then ignore it and contin-ue scanning

v) Stop the scan when it is not a lsquostop wordrsquo and collect the Hindi word (W)

vi) If W is an lsquoexit wordrsquo then exit ie go to step (a) else the identified CP is W+LV

Hindi has a large number of light verbs A list of some of the commonly used light verbs along with their common English meanings as a simple verb is given in table 1 The light verb kar (do) is the most frequently used light verb Using its literal meaning as lsquodorsquo as a criterion for testing CP is quite misleading since lsquodorsquo in English is used in several other contexts Such meanings have been shown within parentheses and are not used for matching

light verb base form root verb meaning baithanaa बठना sit

bananaa बनना makebecomebuildconstruct manufactureprepare

banaanaa बनाना makebuildconstructmanufact-ure prepare

denaa दना give

lenaa लना take

paanaa पाना obtainget

uthanaa 3ठना rise arise get-up

uthaanaa 3ठाना raiselift wake-up

laganaa लगना feelappear look seem

lagaanaa लगाना fixinstall apply

cukanaa चकना (finish)

cukaanaa चकाना pay

karanaa करना (do)

honaa होना happenbecome be

aanaa आना come

jaanaa जाना go

khaanaa खाना eat

rakhanaa रखना keep put

maaranaa मारना killbeathit

daalanaa डालना put

haankanaa हाकना drive

Table 1 Some of the common light verbs in Hindi

42

For each of the Hindi light verb all morpho-logical forms are generated A few illustrations are given in figures 1(a) and 1(b) Similarly for each of the English meaning of the light verb all of its morphological derivatives are generated Figure 2 shows a few illustrations of the same

There are a number of words that can appear in between the nominal and the light verb in a CP These words are ignored in search for a CP and are treated as stop words These are words that denote negation or are emphasizers inten-sifiers interrogative pronoun or a particle A list of stop words used in the experimentation is giv-en in figure 3

Figure 1(a) Morphological derivatives of sample Hindi light verb rsquojaanaarsquo जाना to go

Figure 1(b) Morphological derivatives of sample Hindi light verb rsquolenaarsquo लना to take

Figure 2 Morphological derivatives of sample English meanings

We use a list of words of words that we have

named as lsquoexit wordsrsquo which cannot form part of a CP in Hindi We have used Hindi case (vibhak-ti) markers (also called parsarg) conjunctions and pronouns as the lsquoexit wordsrsquo in our imple-mentation Figure 4 shows a partial list used However this list can be augmented based on analysis of errors in LV identification It should be noted that we do not perform parts of speech (POS) tagging and so the nature of the word pre-ceding the LV is unknown to the system

English word sit Morphological derivations

sit sits sat sitting English word give Morphological derivations

give gives gave given giving degdegdegdegdeg

LV jaanaa जाना to go Morphological derivatives jaa jaae jaao jaaeM jaauuM jaane jaanaa jaanii jaataa jaatii jaate jaaniiM jaatiiM jaaoge jaaogii gaii jaauuMgA jaayegaa jaauuMgii jaayegii gaye gaiiM gayaa gayii jaayeMge jaayeMgI jaakara जा (go stem) जाए (go imperative)

जाओ (go imperative) जाए (go imperative)

जाऊ (go first-person) जान (go infinitive oblique)

जाना (go infinitive masculine singular)

जानी (go infinitive feminine singular)

जाता (go indefinite masculine singular)

जाती (go indefinite feminine singular)

जात (go indefinite masculine pluraloblique)

जानी (go infinitive feminine plural)

जाती (go indefinite feminine plural)

जाओग (go future masculine singular)

जाओगी (go future feminine singular)

गई (go past feminine singular)

जाऊगा (go future masculine first-person singular)

जायगा (go future masculine third-person singular)

जाऊगी (go future feminine first-person singular)

जायगी (go future feminine third-person singular)

गय (go past masculine pluraloblique)

गई (go past feminine plural)

गया (go past masculine singular)

गयी (go past feminine singular)

जायग (go future masculine plural)

जायगी (go future feminine plural)

जाकर (go completion) degdegdegdegdeg

LV lenaa लना to take Morphological derivatives le lii leM lo letaa letii lete liiM luuM legaa legii lene lenaa lenii liyaa leMge loge letiiM luuMgaa luuMgii lekara ल (take stem) ल (take past)

ल (take imperative) लो (take imperative)

लता (take indefinite masculine singular)

लती (take indefinite feminine singular)

लत (take indefinite masculine pluraloblique)

ल (takepastfeminineplural) ल(take first-person)

लगा(take future masculine third-personsingular)

लगी(take future feminine third-person singular)

लन (take infinitive oblique)

लना (take infinitive masculine singular)

लनी (take infinitive feminine singular)

लया (take past masculine singular)

लग (take future masculine plural)

लोग (take future masculine singular)

लती (take indefinite feminine plural)

लगा (take future masculinefirst-personsingular)

लगी (take future feminine first-person singular)

लकर (take completion) degdegdegdegdeg

43

Figure 3 Stop words in Hindi used by the system

Figure 4 A few exit words in Hindi used by the system

The inner loop of the procedure identifies mul-tiple CPs that may be present in a sentence The outer loop is for mining the CPs in the entire corpus The experimentation and results are dis-cussed in the following section

4 Experimentation and Results

The CP mining methodology outlined earlier has been implemented and tested over multiple files of EMILLE (McEnery Baker Gaizauskas and Cunningham 2000) English-Hindi parallel cor-pus A summary of the results obtained are given in table 2 As can be seen from this table the precision obtained is 80 to 92 and the recall is between 89 to 100 The F-measure is 88 to 97 This is a remarkable and somewhat sur-prising result from the simple methodology without much of linguistic or statistical analysis This is much higher than what has been reported on the same corpus by Mukerjee et al 2006 (83 precision and 46 recall) who use projec-tion of POS and word alignment for CP identifi-cation This is the only other work that uses a parallel corpus and covers all kinds of CPs The results as reported by Chakrabarti et al (2008) are only for V-V CPs Moreover they do not re-port the recall value File

1 File 2

File 3

File 4

File 5

File 6

No of Sentences

112 193 102 43 133 107

Total no of CP(N)

200 298 150 46 188 151

Correctly identified CP (TP)

195 296 149 46 175 135

V-V CP 56 63 9 6 15 20

Incorrectly identified CP (FP)

17 44 7 11 16 20

Unidentified CP(FN)

5 2 1 0 13 16

Accuracy

9750 9933 9933 1000 9308 8940

Precision (TP (TP+FP))

9198 8705 9551 8070 9162 8709

Recall ( TP (TP+FN))

9750 9833 9933 1000 9308 8940

F-measure ( 2PR ( P+R))

946 923 974 893 923 882

Table 2 Results of the experimentation

न (ergative case marker) को (accusative case marker) का (possessive case marker) क (possessive case marker) की (possessive case marker) स (frombywith) म (ininto) पर (onbut) और (and Hindi particle) तथा (and) या (or) ल कन (but) पर त (but) क (that Hindi particle) म (I) तम (you) आप (you) वह (heshe) मरा (my) मर (my) मर (my) त हारा (your) त हार (your) त हार (your) उसका (his) उसकी (her) उसक (hisher) अपना (own) अपनी (own) अपन (own) उनक (their) मन (I ergative) त ह (to you) आपको (to you) उसको (to himher) उनको (to them) उ ह (to them) मझको (to me) मझ (to me) िजसका (whose) िजसकी (whose) िजसक (whose) िजनको (to whom) िजनक (to whom)

नह (nonot) न (nonot Hindi particle) भी (also Hindi particle) ह (only Hindi particle) तो (then Hindi particle) य (why) या (what Hindi particle)

कहा (where Hindi particle) कब (when) यहा (here) वहा (there) जहा (where) पहल (before) बाद म (after) श म (beginning) आर भ म (beginning) अत म (in the end) आ खर म (in the end)

44

Given below are some sample outputs (1) English sentence I also enjoy working with the childrens parents who often come to me for advice - its good to know you can help Aligned Hindi sentence मझ ब च क माता - पताओ क साथ काम करना भी अ छा लगता ह जो क अ सर सलाह लन आत ह - यह जानकार खशी होती ह क आप कसी की मदद कर सकत ह | The CPs identified in the sentence i काम करना (to work) ii अ छा लगना (to feel good enjoy) iii सलाह लना (to seek advice) iv खशी होना (to feel happy good) v मदद करना (to help)

Here the system identified 5 different CPs all of which are correct and no CP in the sentence has gone undetected The POS projection and word alignment method (Mukerjee et al 2006) would fail to identify CPs सलाह लना (to seek advice) and खशी होना (to feel happy) (2) English sentence Thousands of children are already benefiting from the input of people like you - people who care about children and their future who have the commitment energy and enthusiasm to be positive role models and who value the opportu-nity for a worthwhile career Aligned Hindi sentence आप जस लोग जो क ब च और उनक भ व य क बार म सोचत ह - इस समय भी हज़ार ब च को लाभ पहचा रह ह | अ छ आदश बनन क लए ऐस लोग म तब ता उ सराह और लगन ह और व एक समथन - यो य यवसाय की क करत ह |

The CPs identified in the sentence i आदश बनना (to be role model) ii क करना (to respect)

Here also the two CPs identified are correct It is obvious that this empirical method of

mining CPs will fail whenever the Hindi light verb maps on to its core meaning in English It

may also produce garbage as POS of the preced-ing word is not being checked However the mining success rate obtained speaks of these be-ing in small numbers in practice Use of the lsquostop wordsrsquo in allowing the intervening words within the CPs helps a lot in improving the per-formance Similarly use of the lsquoexit wordsrsquo avoid a lot of incorrect identification

5 Conclusions

The simple empirical method for mining CPs outlined in this work yields an average 89 of precision and 90 recall which is better than the results reported so far in the literature The major drawback is that we have to generate a list of all possible light verbs This list appears to be very large for Hindi Since no POS tagging or statis-tical analysis is performed the identified CPs are merely a list of mined CPs in Hindi with no lin-guistic categorization or analysis However this list of mined CPs is valuable to the lexicograph-ers and other language technology developers This list can also be used for word alignment tools where the identified components of CPs are grouped together before the word alignment process This will increase both the alignment accuracy and the speed

The methodology presented in this work is equally applicable to all other languages within the Indo-Aryan family

References Anthony McEnery Paul Baker Rob Gaizauskas Ha-

mish Cunningham 2000 EMILLE Building a Corpus of South Asian Languages Vivek A Quar-terly in Artiificial Intelligence 13(3)23ndash32

Amitabh Mukerjee Ankit Soni and Achala M Raina 2006 Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora Proceed-ings of the Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties Sydney 11ndash18

Alex Alsina 1996 Complex PredicatesStructure and Theory CSLI PublicationsStanford CA

Anvita Abbi 1992 The explicator compound verbsome definitional issues and criteria for iden-tification Indian Linguistics 53 27-46

Debasri Chakrabarti Vaijayanthi Sarma and Pushpak Bhattacharyya 2007 Complex Predicates in In-dian Language Wordnets Lexical Resources and Evaluation Journal 40 (3-4)

Debasri Chakrabarti Hemang Mandalia Ritwik Priya Vaijayanthi Sarma and Pushpak Bhattacharyya 2008 Hindi Compound Verbs and their Automatic Extraction Computational Linguistics (COLING08) Manchester UK

45

Manindra K Verma (ed) 1993 Complex Predicates in South Asian Languages Manohar Publishers and Distributors New Delhi

Miriam Butt 1995 The Structure of Complex Predi-cates in Urdu CSLI Publications

Mirium Butt and Gillian Ramchand 2001 Complex Aspectual Structure in HindiUrdu In Maria Lia-kata Britta Jensen and Didier Maillat (Editors) Oxford University Working Papers in Linguistics Philology amp Phonetics Vol 6

Miriam Butt Tracy Holloway King and John T Maxwell III 2003 Complex Predicates via Re-striction Proceedings of the LFG03 Conference

Miriam Butt and Wilhelm Geuder 2001 On the (semi)lexical status of light verbs In Norbert Corv-er and Henk van Riemsdijk (Editors) Semi-lexical Categories On the content of function words and the function of content words Mouton de Gruyter Berlin 323ndash370

Mona Singh 1994 Perfectivity Definiteness and Specificity A Classification of Verbal Predicates Hindi Doctoral dissertation University of Texas Austin

Peter Edwin Hook 1974 The Compound Verb in Hindi Center for South and Southeast Asian Stu-dies The University of Michigan

Tara Mohanan 1994 Argument Structure in Hindi CSLI Publications Stanford California

Venkatapathy Sriram and Aravind K Joshi 2005 Relative compositionality of multi-word expres-sions a study of verb-noun (V-N) collocations In Proceedings of International Joint Conference on Natural Language Processing - 2005 Jeju Island Korea 553-564

46

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 47ndash54Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Improving Statistical Machine Translation UsingDomain Bilingual Multiword Expressions

Zhixiang Ren1 Yajuan L u1 Jie Cao1 Qun Liu 1 Yun Huang2

1Key Lab of Intelligent Info Processing 2Department of Computer ScienceInstitute of Computing Technology School of Computing

Chinese Academy of Sciences National University of SingaporePO Box 2704 Beijing 100190 China Computing 1 Law Link Singapore 117590

renzhixianglvyajuan huangyuncompnusedusgcaojieliuqunictaccn

Abstract

Multiword expressions (MWEs) havebeen proved useful for many natural lan-guage processing tasks However how touse them to improve performance of statis-tical machine translation (SMT) is not wellstudied This paper presents a simple yeteffective strategy to extract domain bilin-gual multiword expressions In additionwe implement three methods to integratebilingual MWEs to Moses the state-of-the-art phrase-based machine translationsystem Experiments show that bilingualMWEs could improve translation perfor-mance significantly

1 Introduction

Phrase-based machine translation model has beenproved a great improvement over the initial word-based approaches (Brown et al 1993) Recentsyntax-based models perform even better thanphrase-based models However when syntax-based models are applied to new domain with fewsyntax-annotated corpus the translation perfor-mance would decrease To utilize the robustnessof phrases and make up the lack of syntax or se-mantic information in phrase-based model for do-main translation we study domain bilingual mul-tiword expressions and integrate them to the exist-ing phrase-based model

A multiword expression (MWE)can be consid-ered as word sequence with relatively fixed struc-ture representing special meanings There is nouniform definition of MWE and many researchersgive different properties of MWE Sag et al (2002)roughly defined MWE as ldquoidiosyncratic interpre-tations that cross word boundaries (or spaces)rdquoCruys and Moiron (2007) focused on the non-compositional property of MWE ie the propertythat whole expression cannot be derived from their

component words Stanford university launcheda MWE project1 in which different qualities ofMWE were presented Forbilingual multiwordexpression (BiMWE) we define a bilingual phraseas a bilingual MWE if (1) the source phrase is aMWE in source language (2) the source phraseand the target phrase must be translated to eachother exactly ie there is no additional (boundary)word in target phrase which cannot find the corre-sponding word in source phrase and vice versaIn recent years many useful methods have beenproposed to extract MWEs or BiMWEs automati-cally (Piao et al 2005 Bannard 2007 Fazly andStevenson 2006) Since MWE usually constrainspossible senses of a polysemous word in contextthey can be used in many NLP applications suchas information retrieval question answering wordsense disambiguation and so on

For machine translation Piao et al (2005) havenoted that the issue of MWE identification andaccurate interpretation from source to target lan-guage remained an unsolved problem for existingMT systems This problem is more severe whenMT systems are used to translate domain-specifictexts since they may include technical terminol-ogy as well as more general fixed expressions andidioms Although some MT systems may employa machine-readable bilingual dictionary of MWEit is time-consuming and inefficient to obtain thisresource manually Therefore some researchershave tried to use automatically extracted bilingualMWEs in SMT Tanaka and Baldwin (2003) de-scribed an approach of noun-noun compound ma-chine translation but no significant comparisonwas presented Lambert and Banchs (2005) pre-sented a method in which bilingual MWEs wereused to modify the word alignment so as to im-prove the SMT quality In their work a bilin-gual MWE in training corpus was grouped as

1httpmwestanfordedu

47

one unique token before training alignment mod-els They reported that both alignment qualityand translation accuracy were improved on a smallcorpus However in their further study they re-ported even lower BLEU scores after groupingMWEs according to part-of-speech on a large cor-pus (Lambert and Banchs 2006) Nonethelesssince MWE represents liguistic knowledge therole and usefulness of MWE in full-scale SMTis intuitively positive The difficulty lies in howto integrate bilingual MWEs into existing SMTsystem to improve SMT performance especiallywhen translating domain texts

In this paper we implement three methods thatintegrate domain bilingual MWEs into a phrase-based SMT system and show that these ap-proaches improve translation quality significantlyThe main difference between our methods andLambert and Banchsrsquo work is that we directly aimat improving the SMT performance rather than im-proving the word alignment quality In detail dif-ferences are listed as follows

bull Instead of using the bilingual n-gram trans-lation model we choose the phrase-basedSMT system Moses2 which achieves sig-nificantly better translation performance thanmany other SMT systems and is a state-of-the-art SMT system

bull Instead of improving translation indirectlyby improving the word alignment qualitywe directly target at the quality of transla-tion Some researchers have argued that largegains of alignment performance under manymetrics only led to small gains in translationperformance (Ayan and Dorr 2006 Fraserand Marcu 2007)

Besides the above differences there are someadvantages of our approaches

bull In our method automatically extractedMWEs are used as additional resources ratherthan as phrase-table filter Since bilingualMWEs are extracted according to noisy au-tomatic word alignment errors in word align-ment would further propagate to the SMT andhurt SMT performance

bull We conduct experiments on domain-specificcorpus For one thing domain-specific

2httpwwwstatmtorgmoses

corpus potentially includes a large numberof technical terminologies as well as moregeneral fixed expressions and idioms iedomain-specific corpus has high MWE cov-erage For another after the investigationcurrent SMT system could not effectivelydeal with these domain-specific MWEs es-pecially for Chinese since these MWEs aremore flexible and concise Take the Chi-nese term ldquo j Ntilde (rdquo for example Themeaning of this term is ldquosoften hard massand dispel pathogenic accumulationrdquo Ev-ery word of this term represents a specialmeaning and cannot be understood literallyor without this context These terms are dif-ficult to be translated even for humans letalone machine translation So treating theseterms as MWEs and applying them in SMTsystem have practical significance

bull In our approach no additional corpus is intro-duced We attempt to extract useful MWEsfrom the training corpus and adopt suitablemethods to apply them Thus it benefitsfor the full exploitation of available resourceswithout increasing great time and space com-plexities of SMT system

The remainder of the paper is organized as fol-lows Section 2 describes the bilingual MWE ex-traction technique Section 3 proposes three meth-ods to apply bilingual MWEs in SMT systemSection 4 presents the experimental results Sec-tion 5 draws conclusions and describes the futurework Since this paper mainly focuses on the ap-plication of BiMWE in SMT we only give a briefintroduction on monolingual and bilingual MWEextraction

2 Bilingual Multiword ExpressionExtraction

In this section we describe our approach of bilin-gual MWE extraction In the first step we obtainmonolingual MWEs from the Chinese part of par-allel corpus After that we look for the translationof the extracted MWEs from parallel corpus

21 Automatic Extraction of MWEs

In the past two decades many different ap-proaches on automatic MWE identification werereported In general those approaches can beclassified into three main trends (1) statisti-cal approaches (Pantel and Lin 2001 Piao et

48

al 2005) (2) syntactic approaches (Fazly andStevenson 2006 Bannard 2007) and (3) seman-tic approaches (Baldwin et al 2003 Cruys andMoiron 2007) Syntax-based and semantic-basedmethods achieve high precision but syntax or se-mantic analysis has to be introduced as preparingstep so it is difficult to apply them to domains withfew syntactical or semantic annotation Statisticalapproaches only consider frequency informationso they can be used to obtain MWEs from bilin-gual corpora without deeper syntactic or semanticanalysis Most statistical measures only take twowords into account so it not easy to extract MWEscontaining three or more than three words

Log Likelihood Ratio (LLR)has been proved agood statistical measurement of the association oftwo random variables (Chang et al 2002) Weadopt the idea of statistical approaches and pro-pose a new algorithm named LLR-based Hierar-chical Reducing Algorithm (HRA for short) to ex-tract MWEs with arbitrary lengths To illustrateour algorithm firstly we define some useful itemsIn the following definitions we assume the givensentence is ldquoA B C D Erdquo

Definition 1 Unit A unit is any sub-string of thegiven sentence For example ldquoA Brdquo ldquo Crdquo ldquo C D Erdquoare all units but ldquoA B Drdquo is not a unit

Definition 2 List A list is an ordered sequence ofunits which exactly cover the given sentence ForexampleldquoArdquoldquo B C Drdquoldquo Erdquo forms a list

Definition 3 Score The score function only de-fines on two adjacent units and return the LLRbetween the last word of first unit and the firstword of the second unit3 For example the scoreof adjacent unit ldquoB Crdquo and ldquoD Erdquo is defined asLLR(ldquo Crdquoldquo Drdquo)

Definition 4 Select The selecting operator is tofind the two adjacent units with maximum scorein a list

Definition 5 ReduceThe reducing operator is toremove two specific adjacent units concatenatethem and put back the result unit to the removedposition For example if we want to reduce unitldquoB Crdquo and unit ldquoDrdquo in list ldquoArdquoldquo B Crdquoldquo Drdquoldquo Erdquowe will get the listldquoArdquoldquo B C Drdquoldquo Erdquo

Initially every word in the sentence is consid-ered as one unit and all these units form a initiallist L If the sentence is of lengthN then the

3we use a stoplist to eliminate the units containing func-tion words by setting their score to 0

list containsN units of course The final set ofMWEsS is initialized to empty set After initial-ization the algorithm will enter an iterating loopwith two steps (1) select the two adjacent unitswith maximum score inL namingU1 andU2 and(2) reduceU1 andU2 in L and insert the reducingresult into the final setS Our algorithm termi-nates on two conditions (1) if the maximum scoreafter selection is less than a given threshold or (2)if L contains only one unit

c1(pound p Eacute w egrave )

c1(pound p Eacute w )

c1(pound p Eacute w)

c1(pound)

c2(p Eacute w)

c2(p Eacute)

c2(p) c3(Eacute) c4(w) c5()

c6(egrave )

c6(egrave) c7()

pound p Eacute w egrave 1471 67552 10596 0 0 8096

Figure 1 Example of Hierarchical Reducing Al-gorithm

Let us make the algorithm clearer with an ex-ample Assume the threshold of score is 20 thegiven sentence is ldquopound p Eacute w egrave rdquo4Figure 1 shows the hierarchical structure of givensentence (based on LLR of adjacent words) Inthis example four MWEs (ldquop Eacuterdquo ldquo p Eacute

wrdquo ldquo egrave rdquo ldquo pound p Eacute wrdquo) are extractedin the order and sub-strings over dotted line in fig-ure 1 are not extracted

From the above example we can see that theextracted MWEs correspond to human intuitionIn general the basic idea of HRA is to reflectthe hierarchical structure pattern of natural lan-guage Furthermore in the HRA MWEs are mea-sured with the minimum LLR of adjacent wordsin them which gives lexical confidence of ex-tracted MWEs Finally suppose given sentencehas lengthN HRA would definitely terminatewithin N minus 1 iterations which is very efficient

However HRA has a problem that it would ex-tract substrings before extracting the whole stringeven if the substrings only appear in the particu-lar whole string which we consider useless Tosolve this problem we use contextual features

4The whole sentence means ldquohealthy tea for preventinghyperlipidemiardquo and we give the meaning for each Chi-nese wordpound(preventing)p(hyper-)Eacute(-lipid-) w(-emia)(for) egrave(healthy)(tea)

49

contextual entropy (Luo and Sun 2003) and C-value (Frantzi and Ananiadou 1996) to filter outthose substrings which exist only in few MWEs

22 Automatic Extraction of MWErsquosTranslation

In subsection 21 we described the algorithm toobtain MWEs and we would like to introduce theprocedure to find their translations from parallelcorpus in this subsection

For mining the English translations of ChineseMWEs we first obtain the candidate translationsof a given MWE from the parallel corpus Stepsare listed as follows

1 Run GIZA++5 to align words in the trainingparallel corpus

2 For a given MWE find the bilingual sentencepairs where the source language sentences in-clude the MWE

3 Extract the candidate translations of theMWE from the above sentence pairs accord-ing to the algorithm described by Och (2002)

After the above procedure we have alreadyextracted all possible candidate translations of agiven MWE The next step is to distinguish rightcandidates from wrong candidates We constructperceptron-based classification model (Collins2002) to solve the problem We design twogroups of features translation features whichdescribe the mutual translating chance betweensource phrase and target phrase and the languagefeatures which refer to how well a candidateis a reasonable translation The translation fea-tures include (1) the logarithm of source-targettranslation probability (2) the logarithm of target-source translation probability (3) the logarithmof source-target lexical weighting (4) the loga-rithm of target-source lexical weighting and (5)the logarithm of the phrase pairrsquos LLR (Dunning1993) The first four features are exactly the sameas the four translation probabilities used in tradi-tional phrase-based system (Koehn et al 2003)The language features include (1) the left entropyof the target phrase (Luo and Sun 2003) (2) theright entropy of the target phrase (3) the first wordof the target phrase (4) the last word of the targetphrase and (5) all words in the target phrase

5httpwwwfjochcomGIZA++html

We select and annotate 33000 phrase pairs ran-domly of which 30000 pairs are used as trainingset and 3000 pairs are used as test set We use theperceptron training algorithm to train the modelAs the experiments reveal the classification preci-sion of this model is 9167

3 Application of Bilingual MWEs

Intuitively bilingual MWE is useful to improvethe performance of SMT However as we de-scribed in section 1 it still needs further researchon how to integrate bilingual MWEs into SMTsystem In this section we propose three methodsto utilize bilingual MWEs and we will comparetheir performance in section 4

31 Model Retraining with Bilingual MWEs

Bilingual phrase table is very important forphrase-based MT system However due to the er-rors in automatic word alignment and unalignedword extension in phrase extraction (Och 2002)many meaningless phrases would be extractedwhich results in inaccuracy of phrase probabilityestimation To alleviate this problem we take theautomatically extracted bilingual MWEs as paral-lel sentence pairs add them into the training cor-pus and retrain the model using GIZA++ Byincreasing the occurrences of bilingual MWEswhich are good phrases we expect that the align-ment would be modified and the probability es-timation would be more reasonable Wu et al(2008) also used this method to perform domainadaption for SMT Different from their approachin which bilingual MWEs are extracted from ad-ditional corpus we extract bilingual MWEs fromthe original training set The fact that additionalresources can improve the domain-specific SMTperformance was proved by many researchers(Wu et al 2008 Eck et al 2004) Howeverour method shows that making better use of theresources in hand could also enhance the qualityof SMT system We use ldquoBaseline+BiMWErdquo torepresent this method

32 New Feature for Bilingual MWEs

Lopez and Resnik (2006) once pointed out thatbetter feature mining can lead to substantial gainin translation quality Inspired by this idea weappend one feature into bilingual phrase table toindicate that whether a bilingual phrase containsbilingual MWEs In other words if the source lan-guage phrase contains a MWE (as substring) and

50

the target language phrase contains the translationof the MWE (as substring) the feature value is 1otherwise the feature value is set to 0 Due to thehigh reliability of bilingual MWEs we expect thatthis feature could help SMT system to select bet-ter and reasonable phrase pairs during translationWe use ldquoBaseline+Featrdquo to represent this method

33 Additional Phrase Table of bilingualMWEs

Wu et al (2008) proposed a method to construct aphrase table by a manually-made translation dic-tionary Instead of manually constructing transla-tion dictionary we construct an additional phrasetable containing automatically extracted bilingualMWEs As to probability assignment we just as-sign 1 to the four translation probabilities for sim-plicity Since Moses supports multiple bilingualphrase tables we combine the original phrase ta-ble and new constructed bilingual MWE table Foreach phrase in input sentence during translationthe decoder would search all candidate transla-tion phrases in both phrase tables We use ldquoBase-line+NewBPrdquo to represent this method

4 Experiments

41 Data

We run experiments on two domain-specific patentcorpora one is for traditional medicine domainand the other is for chemical industry domain Ourtranslation tasks are Chinese-to-English

In the traditional medicine domain table 1shows the data statistics For language model weuse SRI Language Modeling Toolkit6 to train a tri-gram model with modified Kneser-Ney smoothing(Chen and Goodman 1998) on the target side oftraining corpus Using our bilingual MWE ex-tracting algorithm 80287 bilingual MWEs are ex-tracted from the training set

Chinese EnglishTraining Sentences 120355

Words 4688873 4737843Dev Sentences 1000

Words 31722 32390Test Sentences 1000

Words 41643 40551

Table 1 Traditional medicine corpus

6httpwwwspeechsricomprojectssrilm

In the chemical industry domain table 2 givesthe detail information of the data In this experi-ment 59466 bilingual MWEs are extracted

Chinese EnglishTraining Sentences 120856

Words 4532503 4311682Dev Sentences 1099

Words 42122 40521Test Sentences 1099

Words 41069 39210

Table 2 Chemical industry corpus

We test translation quality on test set and use theopen source tool mteval-vllbpl7 to calculate case-sensitive BLEU 4 score (Papineni et al 2002) asour evaluation criteria For this evaluation there isonly one reference per test sentence We also per-form statistical significant test between two trans-lation results (Collins et al 2005) The mean ofall scores and relative standard deviation are calcu-lated with a 99 confidence interval of the mean

42 MT Systems

We use the state-of-the-art phrase-based SMT sys-tem Moses as our baseline system The featuresused in baseline system include (1) four transla-tion probability features (2) one language modelfeature (3) distance-based and lexicalized distor-tion model feature (4) word penalty (5) phrasepenalty For ldquoBaseline+BiMWErdquo method bilin-gual MWEs are added into training corpus as aresult new alignment and new phrase table areobtained For ldquoBaseline+Featrdquo method one ad-ditional 01 feature are introduced to each entry inphrase table For ldquoBaseline+NewBPrdquo additionalphrase table constructed by bilingual MWEs isused

Features are combined in the log-linear modelTo obtain the best translatione of the source sen-tencef log-linear model uses following equation

e = arg maxe

p(e|f)

= arg maxe

Msum

m=1

λmhm(e f) (1)

in which hm andλm denote themth feature andweight The weights are automatically turned byminimum error rate training (Och 2002) on devel-opment set

7httpwwwnistgovspeechtestsmtresourcesscoringhtm

51

43 Results

Methods BLEUBaseline 02658Baseline+BiMWE 02661Baseline+Feat 02675Baseline+NewBP 02719

Table 3 Translation results of using bilingualMWEs in traditional medicine domain

Table 3 gives our experiment results Fromthis table we can see that bilingual MWEsimprove translation quality in all cases TheBaseline+NewBP method achieves the most im-provement of 061 BLEU score comparedwith the baseline system The Baseline+Featmethod comes next with 017 BLEU score im-provement And the Baseline+BiMWE achievesslightly higher translation quality than the baselinesystem

To our disappointment however none of theseimprovements are statistical significant Wemanually examine the extracted bilingual MWEswhich are labeled positive by perceptron algorithmand find that although the classification precisionis high (9167) the proportion of positive exam-ple is relatively lower (7669) The low positiveproportion means that many negative instanceshave been wrongly classified to positive which in-troduce noises To remove noisy bilingual MWEswe use the length ratiox of the source phrase overthe target phrase to rank the bilingual MWEs la-beled positive Assumex follows Gaussian distri-butions then the ranking score of phrase pair(s t)is defined as the following formula

Score(s t) = log(LLR(s t))times 1radic2πσ

timeseminus

(xminusmicro)2

2σ2

(2)Here the meanmicro and varianceσ2 are estimatedfrom the training set After ranking by score weselect the top 50000 60000 and 70000 bilingualMWEs to perform the three methods mentioned insection 3 The results are showed in table 4

From this table we can conclude that (1) Allthe three methods on all settings improve BLEUscore (2) Except the Baseline+BiMWE methodthe other two methods obtain significant improve-ment of BLEU score (02728 02734 02724)over baseline system (02658) (3) When the scaleof bilingual MWEs is relatively small (5000060000) the Baseline+Feat method performs better

Methods 50000 60000 70000Baseline 02658Baseline+BiMWE 02671 02686 02715Baseline+Feat 02728 02734 02712Baseline+NewBP 02662 02706 02724

Table 4 Translation results of using bilingualMWEs in traditional medicine domain

than others (4) As the number of bilingual MWEsincreasing the Baseline+NewBP method outper-forms the Baseline+Feat method (5) Comparingtable 4 and 3 we can see it is not true that themore bilingual MWEs the better performance ofphrase-based SMT This conclusion is the same as(Lambert and Banchs 2005)

To verify the assumption that bilingual MWEsdo indeed improve the SMT performance not onlyon particular domain we also perform some ex-periments on chemical industry domain Table 5shows the results From this table we can see thatthese three methods can improve the translationperformance on chemical industry domain as wellas on the traditional medicine domain

Methods BLEUBaseline 01882Baseline+BiMWE 01928Baseline+Feat 01917Baseline+Newbp 01914

Table 5 Translation results of using bilingualMWEs in chemical industry domain

44 Discussion

In order to know in what respects our methods im-prove performance of translation we manually an-alyze some test sentences and gives some exam-ples in this subsection

(1) For the first example in table 6 ldquoIuml oacuterdquois aligned to other words and not correctly trans-lated in baseline system while it is aligned to cor-rect target phrase ldquodredging meridiansrdquo in Base-line+BiMWE since the bilingual MWE (ldquoIuml oacuterdquoldquodredging meridiansrdquo) has been added into train-ing corpus and then aligned by GIZA++

(2) For the second example in table 6 ldquo

rdquo has two candidate translation in phrase tableldquo teardquo and ldquomedicated teardquo The baseline systemchooses the ldquoteardquo as the translation of ldquo rdquowhile the Baseline+Feat system chooses the ldquomed-

52

Src T not auml k Ouml Eacute Aring raquo Iuml oacute ) 9 | Y S _ Ouml otilde

egraveE 8

Ref the obtained product is effective in tonifying blood expelling cold dredging meridians promoting production of body fluid promoting urination and tranquilizing mind and can be used for supplementing nutrition and protecting health

Baseline the food has effects in tonifying blood dispelling cold promoting salivation and water and tranquilizing and tonic effects and making nutritious health

+Bimwe the food has effects in tonifying blood dispelling cold dredging meridians promotingsalivation promoting urination and tranquilizing tonic nutritious pulverizing

Src curren iexclJ J NtildeJ 5J

Ref the product can also be made into tablet pill powder medicated tea or injection Baseline may also be made into tablet pill powder tea or injection +Feat may also be made into tablet pill powder medicated tea or injection

Table 6 Translation example

icated teardquo because the additional feature giveshigh probability of the correct translation ldquomedi-cated teardquo

5 Conclusion and Future Works

This paper presents the LLR-based hierarchicalreducing algorithm to automatically extract bilin-gual MWEs and investigates the performance ofthree different application strategies in applyingbilingual MWEs for SMT system The translationresults show that using an additional feature to rep-resent whether a bilingual phrase contains bilin-gual MWEs performs the best in most cases Theother two strategies can also improve the qualityof SMT system although not as much as the firstone These results are encouraging and motivatedto do further research in this area

The strategies of bilingual MWE application isroughly simply and coarse in this paper Com-plicated approaches should be taken into accountduring applying bilingual MWEs For examplewe may consider other features of the bilingualMWEs and examine their effect on the SMT per-formance Besides application in phrase-basedSMT system bilingual MWEs may also be inte-grated into other MT models such as hierarchicalphrase-based models or syntax-based translationmodels We will do further studies on improvingstatistical machine translation using domain bilin-gual MWEs

Acknowledgments

This work is supported by National Natural Sci-ence Foundation of China Contracts 60603095and 60873167 We would like to thank the anony-

mous reviewers for their insightful comments onan earlier draft of this paper

References

Necip Fazil Ayan and Bonnie J Dorr 2006 Goingbeyond aer an extensive analysis of word align-ments and their impact on mt InProceedings of the44th Annual Meeting of the Association for Compu-tational Linguistics pages 9ndash16

Timothy Baldwin Colin Bannard Takaaki Tanaka andDominic Widdows 2003 An empirical modelof multiword expression decomposability InPro-ceedings of the ACL-2003 Workshop on MultiwordExpressions Analysis Acquisiton and Treatmentpages 89ndash96

Colin Bannard 2007 A measure of syntactic flex-ibility for automatically identifying multiword ex-pressions in corpora InProceedings of the ACLWorkshop on A Broader Perspective on MultiwordExpressions pages 1ndash8

Peter F Brown Stephen Della Pietra Vincent J DellaPietra and Robert L Mercer 1993 The mathe-matics of statistical machine translation Parameterestimation Computational Linguistics 19(2)263ndash311

Baobao Chang Pernilla Danielsson and WolfgangTeubert 2002 Extraction of translation unit fromchinese-english parallel corpora InProceedings ofthe first SIGHAN workshop on Chinese languageprocessing pages 1ndash5

Stanley F Chen and Joshua Goodman 1998 Am em-pirical study of smoothing techniques for languagemodeling Technical report

Michael Collins Philipp Koehn and Ivona Kucerova2005 Clause restructuring for statistical machinetranslation InProceedings of the 43rd Annual

53

Meeting on Association for Computational Linguis-tics pages 531ndash540

Michael Collins 2002 Discriminative training meth-ods for hidden markov models Theory and exper-iments with perceptron algorithms InProceedingsof the Empirical Methods in Natural Language Pro-cessing Conference pages 1ndash8

Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32

Ted Dunning 1993 Accurate methods for the statis-tics of surprise and coincidenceComputationalLinguistics 19(1)61ndash74

Matthias Eck Stephan Vogel and Alex Waibel 2004Improving statistical machine translation in the med-ical domain using the unified medical language sys-tem InProceedings of the 20th international con-ference on Computational Linguistics table of con-tents pages 792ndash798

Afsaneh Fazly and Suzanne Stevenson 2006 Auto-matically constructing a lexicon of verb phrase id-iomatic combinations InProceedings of the EACLpages 337ndash344

Katerina T Frantzi and Sophia Ananiadou 1996 Ex-tracting nested collocations InProceedings of the16th conference on Computational linguistics pages41ndash46

Alexander Fraser and Daniel Marcu 2007 Measuringword alignment quality for statistical machine trans-lation Computational Linguistics 33(3)293ndash303

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InPro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology pages48ndash54

Patrik Lambert and Rafael Banchs 2005 Data in-ferred multi-word expressions for statistical machinetranslation InProceedings of Machine TranslationSummit X pages 396ndash403

Patrik Lambert and Rafael Banchs 2006 Groupingmulti-word expressions according to part-of-speechin statistical machine translation InProceedings ofthe Workshop on Multi-word-expressions in a multi-lingual context pages 9ndash16

Adam Lopez and Philip Resnik 2006 Word-basedalignment phrase-based translation Whatrsquos thelink In proceedings of the 7th conference of the as-sociation for machine translation in the Americasvisions for the future of machine translation pages90ndash99

Shengfen Luo and Maosong Sun 2003 Two-characterchinese word extraction based on hybrid of internaland contextual measures InProceedings of the sec-ond SIGHAN workshop on Chinese language pro-cessing pages 24ndash30

Franz Josef Och 2002Statistical Machine Transla-tion From Single-Word Models to Alignment Tem-plates Phd thesis Computer Science DepartmentRWTH Aachen Germany

Patrick Pantel and Dekang Lin 2001 A statistical cor-pus based term extractor InAI rsquo01 Proceedings ofthe 14th Biennial Conference of the Canadian Soci-ety on Computational Studies of Intelligence pages36ndash46

Kishore Papineni Salim Roukos Todd Ward and Wei-jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation InProceedings of the40th Annual Conference of the Association for Com-putational Linguistics pages 311ndash318

Scott Songlin Piao Paul Rayson Dawn Archer andTony McEnery 2005 Comparing and combining asemantic tagger and a statistical tool for mwe extrac-tion Computer Speech and Language 19(4)378ndash397

Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 Multi-word expressions A pain in the neck for nlp InProceedings of the 3th International Conferenceon Intelligent Text Processing and ComputationalLinguistics(CICLing-2002) pages 1ndash15

Takaaki Tanaka and Timothy Baldwin 2003 Noun-noun compound machine translation A feasibilitystudy on shallow processing InProceedings ofthe ACL-2003 Workshop on Multiword ExpressionsAnalysis Acquisition and Treatment pages 17ndash24

Hua Wu Haifeng Wang and Chengqing Zong 2008Domain adaptation for statistical machine transla-tion with domain dictionary and monolingual cor-pora InProceedings of Conference on Computa-tional Linguistics (COLING) pages 993ndash1000

54

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 55ndash62Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Bottom-up Named Entity Recognitionusing a Two-stage Machine Learning Method

Hirotaka Funayama Tomohide Shibata Sadao KurohashiKyoto University Yoshida-honmachi

Sakyo-ku Kyoto 606-8501 Japanfunayama shibata kuronlpkueekyoto-uacjp

Abstract

This paper proposes Japanese bottom-upnamed entity recognition using a two-stage machine learning method Mostwork has formalized Named Entity Recog-nition as a sequential labeling problem inwhich only local information is utilizedfor the label estimation and thus a longnamed entity consisting of several mor-phemes tends to be wrongly recognizedOur proposed method regards a compoundnoun (chunk) as a labeling unit and firstestimates the labels of all the chunks ina phrasal unit (bunsetsu) using a machinelearning method Then the best label as-signment in the bunsetsu is determinedfrom bottom up as the CKY parsing al-gorithm using a machine learning methodWe conducted an experimental on CRLNE data and achieved an F measure of8979 which is higher than previous work

1 Introduction

Named Entity Recognition (NER) is a task of rec-ognizing named entities such as person namesorganization names and location It is used forseveral NLP applications such as Information Ex-traction (IE) and Question Answering (QA) Mostwork uses machine learning methods such as Sup-port Vector Machines (SVMs) (Vapnik 1995) andConditional Random Field (CRF) (Lafferty et al2001) using a hand-annotated corpus (Krishnanand DManning 2006 Kazama and Torisawa2008 Sasano and Kurohashi 2008 Fukushima etal 2008 Nakano and Hirai 2004 Masayuki andMatsumoto 2003)

In general NER is formalized as a sequentiallabeling problem For example regarding a mor-pheme as a basic unit it is first labeled as S-PERSON B-PERSON I-PERSON E-PERSON

S-ORGANIZATION etc Then considering thelabeling results of morphemes the best NE labelsequence is recognized

When the label of each morpheme is estimatedonly local information around the morpheme (egthe morpheme the two preceding morphemes andthe two following morphemes) is utilized There-fore a long named entity consisting of severalmorphemes tends to be wrongly recognized Letus consider the example sentences shown in Fig-ure 1

In sentence (1) the label of ldquoKazamardquo can berecognized to be S-PERSON (PERSON consist-ing of one morpheme) by utilizing the surroundinginformation such as the suffix ldquosanrdquo (Mr) and theverb ldquokikoku shitardquo (return home)

On the other hand in sentence (2) when thelabel of ldquoshinyourdquo (credit) is recognized to beB-ORGANIZATION (the beginning of ORGA-NIZATION) only information from ldquohatsudourdquo(invoke) to ldquokyusairdquo (relief) can be utilized andthus the information of the morpheme ldquoginkourdquo(bank) that is apart from ldquoshinyourdquo by three mor-phemes cannot be utilized To cope with this prob-lem Nakano et al (Nakano and Hirai 2004) andSasano et al (Sasano and Kurohashi 2008) uti-lized information of the head of bunsetsu1 In theirmethods when the label of ldquoshinyourdquo is recog-nized the information of the morpheme ldquoginkourdquocan be utilized

However these methods do not work when themorpheme that we want to refer to is not a headof bunsetsu as in sentence (3) In this examplewhen ldquogaikokurdquo (foreign) is recognized to be B-ARTIFACT (the beginning of ARTIFACT) wewant to refer to ldquohourdquo (law) not ldquoihanrdquo (viola-tion) which is the head of the bunsetsu

This paper proposes Japanese bottom-up named1Bunsetsu is the smallest coherent phrasal unit in

Japanese It consists of one or more content words followedby zero or more function words

55

(1) kikoku-shitareturn home

Kazama-san-waMrKazama TOP

rsquoMr Kazama who returned homersquo

(2) hatsudou-shitainvoke

shinyou-kumiai-kyusai-ginkou-nocredit union relief bank GEN

setsuritsu-mo establishment

rsquothe establishment of the invoking credit union relief bankrsquo

(3) shibunsyo-gizou-toprivate document falsification and

gaikoku-jin-touroku-hou-ihan-noforeigner registration law violation GEN

utagai-desuspicion INS

rsquoon suspicion of the private document falsification and the violation of the foreigner registra-tion lawrsquo

Figure 1 Example sentences

entity recognition using a two-stage machinelearning method Different from previous workthis method regards a compound noun as a la-beling unit (we call it chunk hereafter) and es-timates the labels of all the chunks in the bun-setsu using a machine learning method In sen-tence (3) all the chunks in the second bunsetsu(ie ldquogaikokurdquo ldquogaikoku-jinrdquo middot middot middot ldquogaikoku-jin-touroku-hou-ihan rdquo middot middot middot ldquoihanrdquo) are labeled andin the case that the chunk ldquogaikoku-jin-touroku-hourdquo is labeled the information about ldquohourdquo (law)is utilized in a natural manner Then in the bun-setsu the best label assignment is determined Forexample among the combination of ldquogaikoku-jin-touroku-hourdquo (ARTIFACT) and ldquoihanrdquo (OTHER)the combination of ldquogaikoku-jinrdquo (PERSON) andldquotouroku-hou-ihanrdquo (OTHER) etc the best la-bel assignment ldquogaikoku-jin-touroku-hourdquo (AR-TIFACT) and ldquoihanrdquo (OTHER) is chosen basedon a machine learning method In this determi-nation of the best label assignment as the CKYparsing algorithm the label assignment is deter-mined by bottom-up dynamic programming Weconducted an experimental on CRL NE data andachieved an F measure of 8979 which is higherthan previous work

This paper is organized as follows Section 2 re-views related work of NER especially focusing onsequential labeling based method Section 3 de-scribes an overview of our proposed method Sec-tion 4 presents two machine learning models andSection 5 describes an analysis algorithm Section6 gives an experimental result

2 Related Work

In Japanese Named Entity Recognition the defi-nition of Named Entity in IREX Workshop (IREX

class examplePERSON Kimura SyonosukeLOCATION Taiheiyou (Pacific Ocean)ORGANIZATION Jimin-tou (Liberal Democratic

Party)ARTIFACT PL-houan (PL bill)DATE 21-seiki (21 century)TIME gozen-7-ji (7 am)MONEY 500-oku-en (50 billions yen)PERCENT 20 percent

Table 1 NE classes and their examples

Committee 1999) is usually used In this def-inition NEs are classified into eight classesPERSON LOCATION ORGANIZATION AR-TIFACT DATE TIME MONEY and PERCENTTable 1 shows example instances of each class

NER methods are divided into two approachesrule-based approach and machine learning ap-proach According to previous work machinelearning approach achieved better performancethan rule-based approach

In general a machine learning method is for-malized as a sequential labeling problem Thisproblem is first assigning each token (character ormorpheme) to several labels In an SE-algorithm(Sekine et al 1998) S is assigned to NE com-posed of one morpheme B I E is assigned to thebeginning middle end of NE respectively and Ois assigned to the morpheme that is not an NE2The labels S B I and E are prepared for each NEclasses and thus the total number of labels is 33(= 8 4 + 1)

The model for the label estimation is learnedbased on machine learning The following fea-tures are generally utilized characters type of

2Besides there are IOB1 IOB2 algorithm using onlyIOB and IOE1 IOE2 algorithm using only IOE (Kim andVeenstra 1999)

56

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu-MeijinMONEY0075

OTHERe0092

MeijinOTHERe0245

(a)initial state

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu + MeijinPSN+OTHERe0438+0245

Yoshiharu Yoshiharu + Meijin

final outputanalysis direction

MONEY0075

MNY+OTHERe0075+0245

MeijinOTHERe0245

(b)final output

Figure 2 An overview of our proposed method (the bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo)

character POS etc about the morpheme and thesurrounding two morphemes The methods utiliz-ing SVM or CRF are proposed

Most of NER methods based on sequential la-beling use only local information Thereforemethods utilizing global information are pro-posed Nakano et al utilized as a feature the wordsub class of NE on the analyzing direction in thebunsetsu the noun in the end of the bunsetsu ad-jacent to the analyzing direction and the head ofeach bunsetsu (Nakano and Hirai 2004) Sasanoet al utilized cache feature coreference resultsyntactic feature and caseframe feature as struc-tural features (Sasano and Kurohashi 2008)

Some work acquired knowledge from unan-notated large corpus and applied it to NERKazama et al utilized a Named Entity dic-tionary constructed from Wikipedia and a nounclustering result obtained using huge amount ofpairs of dependency relations (Kazama and Tori-sawa 2008) Fukushima et al acquired hugeamount of category-instance pairs (eg ldquopo-litical party - New party DAICHIrdquoldquocompany-TOYOTArdquo) by some patterns from a large Webcorpus (Fukushima et al 2008)

In Japanese NER researches CRL NE data areusually utilized for the evaluation This data in-cludes approximately 10 thousands sentences innews paper articles in which approximately 20thousands NEs are annotated Previous workachieved an F measure of about 089 using thisdata

3 Overview of Proposed Method

Our proposed method first estimates the label ofall the compound nouns (chunk) in a bunsetsu

Then the best label assignment is determinedby bottom-up dynamic programming as the CKYparsing algorithm Figure 2 illustrates an overviewof our proposed method In this example thebunsetsu ldquoHabu-Yoshiharu-Meijinrdquo (Grand Mas-ter Yoshiharu Habu) is analyzed First the labelsof all the chunks (ldquoHaburdquo ldquoHabu-YoshiharurdquoldquoHabu-Yoshiharu-Meijinrdquo middot middot middot ldquoMeijinrdquo etc) inthe bunsetsu are analyzed using a machine learn-ing method as shown in Figure 2 (a)

We call the state in Figure 2 (a) initial statewhere the labels of all the chunks have been es-timated From this state the best label assign-ment in the bunsetsu is determined This pro-cedure is performed from the lower left (corre-sponds to each morpheme) to the upper right likethe CKY parsing algorithm as shown in Figure 2(b) For example when the label assignment forldquoHabu-Yoshiharurdquo is determined the label assign-ment ldquoHabu-Yoshiharurdquo (PERSON) and the labelassignment ldquoHaburdquo (PERSON) and ldquoYoshiharurdquo(OTHER) are compared and the better one is cho-sen While grammatical rules are utilized in ageneral CKY algorithm this method chooses bet-ter label assignment for each cell using a machinelearning method

The learned models are the followings

bull the model that estimates the label of a chunk(label estimation model)

bull the model that compares two label assign-ments (label comparison model)

The two models are described in detail in thenext section

57

Habu Yoshiharu Meijin gaPERSON OTHERe

invalid invalid

invalid

invalidinvalid

Figure 3 Label assignment for all the chunks inthe bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo

4 Model Learning

41 Label Estimation ModelThis model estimates the label for each chunk Ananalysis unit is basically bunsetsu This is because935 of named entities is located in a bunsetsuin CRL NE data Exceptionally the following ex-pressions located in multiple bunsetsus tend to bean NE

bull expressions enclosed in parentheses (eg ldquolsquoHimeyuri-no toursquo rdquo (The tower of Himeyuri)(ARTIFACT))

bull expressions that have an entry in Wikipedia(eg ldquoNihon-yatyou-no kairdquo (Wild Bird So-ciety of Japan) (ORGANIZATION))

Hereafter bunsetsu is expanded when one of theabove conditions meet By this expansion 986of named entities is located in a bunsetsu3

For each bunsetsu the head or tail functionwords are deleted For example in the bun-setsu ldquoHabu-Yoshiharu-Meijin-wardquo the tail func-tion word ldquowardquo (TOP) is deleted In the bunsetsuldquoyaku-san-bairdquo (about three times) the head func-tion word ldquoyakurdquo (about) is deleted

Next for learning the label estimation modelall the chunks in a bunsetsu are attached to the cor-rect label from a hand-annotated corpus The la-bel set is 13 classes which includes eight NE class(as shown in Table 1) and five classes OTHERsOTHERb OTHERi OTHERe and invalid

The chunk that corresponds to a whole bun-setsu and does not contain any NEs is labeledas OTHERs and the head middle tail chunkthat does not correspond to an NE is labeled asOTHERb OTHERi OTHERe respectively4

3As an example in which an NE is not included by anexpanded bunsetsu there are ldquoToru-no Kimirdquo (PERSON)and ldquoOsaka-fu midori-no kankyo-seibi-shitsurdquo (ORGANI-ZATION)

4Each OTHER is assigned to the longest chunk that satis-fies its condition in a chunk

1 of morphemes in the chunk

2 the position of the chunk in its bunsetsu

3 character type5

4 the combination of the character type of adjoiningmorphemes

- For the chunk ldquoRussian Armyrdquo this feature isldquoKatakanaKanjirdquo

5 word class word sub class and several features pro-vided by a morphological analyzer JUMAN

6 several features6 provided by a parser KNP

7 string of the morpheme in the chunk

8 IPADIC7 feature- If the string of the chunk are registered in the fol-

lowing categories of IPADIC ldquopersonrdquo ldquolo-cationrdquo ldquoorganizationrdquo and ldquogeneralrdquo thisfeature fires

9 Wikipedia feature

- If the string of the chunk has an entry inWikipedia this feature fires

- the hypernym extracted from its definition sen-tence using some patterns (eg The hyper-nym of ldquothe Liberal Democratic Partyrdquo is apolitical party)

10 cache feature- When the same string of the chunk appears in the

preceding context the label of the precedingchunk is used for the feature

11 particles that the bunsetsu includes

12 the morphemes particles and head morpheme in theparent bunsetsu

13 the NEcategory ratio in a case slot of predicatenouncase frame(Sasano and Kurohashi 2008)

- For example in the case ga (NOM) of the pred-icate case frame ldquokaikenrdquo (interview) the NEratio ldquoPERSON0245rdquo is assigned to the caseslot Hence in the sentence ldquoHabu-ga kaiken-shitardquo (Mr Habu interviewed) the featureldquoPERSON0245rdquo is utilized for the chunkldquoHaburdquo

14 parenthesis feature

- When the chunk in a parenthesis this featurefires

Table 2 Features for the label estimation model

The chunk that is neither any eight NE class northe above four OTHER is labeled as invalid

In an example as shown in Figure 3 ldquoHabu-Yoshiharurdquo is labeled as PERSON ldquoMeijinrdquo is la-beled as OTHERe and the other chunks are la-beled as invalid

Next the label estimation model is learned fromthe data in which the above label set is assigned

5The following five character types are considered KanjiHiragana Katakana Number and Alphabet

6When a morpheme has an ambiguity all the correspond-ing features fire

7httpchasenaist-naraacjpchasendistributionhtmlja

58

to all the chunks The features for the label esti-mation model are shown in Table 2 Among thefeatures as for feature (3) (5)minus(8) three cate-gories according to the position of a morphemein the chunk are prepared ldquoheadrdquo ldquotailrdquo andldquoanywhererdquo For example in the chunk ldquoHabu-Yoshiharu-Meijinrdquo as for the morpheme ldquoHaburdquofeature (7) is set to be ldquoHaburdquo in ldquoheadrdquo and as forthe morpheme ldquoYoshiharurdquo feature (7) is set to beldquoYoshiharurdquo in ldquoanywhererdquo

The label estimation model is learned from pairsof label and feature in each chunk To classify themulti classes the one-vs-rest method is adopted(consequently 13 models are learned) The SVMoutput is transformed by using the sigmoid func-tion 1

1+exp(minusβx) and the transformed value is nor-malized so that the sum of the value of 13 labelsin a chunk is one

The purpose for setting up the label ldquoinvalidrdquo isas follows In the chunk ldquoHaburdquo and ldquoYoshiharurdquoin Figure 3 since the label ldquoinvalidrdquo has a rela-tively higher score the score of the label PERSONis relatively low Therefore when the label com-parison described in Section 42 is performed thelabel assignment ldquoHabu-Yoshiharurdquo (PERSON) islikely to be chosen In the chunk where the scoreof the label invalid has the highest score the labelthat has the second highest score is adopted

42 Label Comparison Model

This model compares the two label assignmentsfor a certain string For example in the stringldquoHabu-Yoshiharurdquo the model compares the fol-lowing two label assignments

bull ldquoHabu-Yoshiharurdquo is labeled as PERSON

bull ldquoHaburdquo is labeled as PERSON and ldquoYoshi-harurdquo is labeled as MONEY

First as shown in Figure 4 the two comparedsets of chunks are lined up by sandwiching ldquovsrdquo(The left one right one is called the first set thesecond set respectively) When the first set is cor-rect this example is positive otherwise this ex-ample is negative The max number of chunks foreach set is five and thus examples in which thefirst or second set has more than five chunks arenot utilized for the model learning

Then the feature is assigned to each exampleThe feature (13 dimensions) for each chunk is de-fined as follows the first 12 dimensions are used

positive+1 Habu-Yoshiharu vs Habu + Yoshiharu

PSN PSN + MNY+1 Habu-Yoshiharu + Meijin vs Habu + Yoshiharu + Meijin

PSN + OTHERe PSN + MONEY + OTHERe

negative- 1 Habu-Yoshiharu-Meijin vs Habu-Yoshiharu + Meijin

ORG PSN + OTHERe

Figure 4 Assignment of positivenegative exam-ples

for each label which is estimated by the label esti-mation model and the last 13th dimension is usedfor the score of an SVM output Then for the firstand second set the features for each chunk are ar-ranged from the left and zero vectors are placedin the remainder part

Figure 5 illustrates the feature for ldquoHabu-Yoshiharurdquo vs ldquoHabu + Yoshiharurdquo The labelcomparison model is learned from such data us-ing SVM Note that only the fact that ldquoHabu-Yoshiharurdquo is PERSON can be found from thehand-annotated corpus and thus in the exampleldquoHabu-Yoshiharu-Meijinrdquo vs ldquoHabu + Yoshiharu-Meijinrdquo we cannot determine which one is cor-rect Therefore such example cannot be used forthe model learning

5 Analysis

First the label of all the chunks in a bunsetsu isestimated by using the label estimation model de-scribed in Section 41 Then the best label assign-ment in the bunsetsu is determined by applying thelabel comparison model described in Section 42iteratively as shown in Figure 2 (b) In this stepthe better label assignment is determined from bot-tom up as the CKY parsing algorithm

For example the initial state shown in Figure2(a) is obtained using the label estimation modelThen the label assignment is determined using thelabel comparison model from the lower left (cor-responds to each morpheme) to the upper rightIn determining the label assignment for the cellof ldquoHabu-Yoshiharurdquo as shown in 6(a) the modelcompares the label assignment ldquoBrdquo with the la-bel assignment ldquoA+Drdquo In this case the modelchooses the label assignment ldquoBrdquo that is ldquoHabu- Yoshiharurdquo is labeled as PERSON Similarlyin determining the label assignment for the cellof ldquoYoshiharu-Meijinrdquo the model compares the

59

chunk Habu-Yoshiharu Habu Yoshiharulabel PERSON PERSON MONEYvector V11 0 0 0 0 V21 V22 0 0 0

Figure 5 An example of the feature for the label comparison model (The example is ldquoHabu-Yoshiharuvs Habu + Yoshiharurdquo and V11 V21 V22 and 0 is a vector whose dimension is 13)

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu-Meijin

label assighment ldquoHabu-Yoshiharurdquo

label assignmentldquoHaburdquo + ldquoYoshiharurdquo

A B C

EDMONEY0075

OTHERe0092

MeijinOTHERe0245

ldquoHaburdquo + ldquoYoshiharurdquo

F

(a) label assignment for the cell ldquoHabu-Yoshiharurdquo

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu + Meijin

label assignmentldquoHabu-Yoshiharu-Meijinrdquo

A B C

EDMONEY0075

MNY+OTHERe0075+0245

MeijinOTHERe0245

label assignmentldquoHaburdquo + ldquoYoshiharurdquo + ldquoMeijinrdquo

label assignmentldquoHabu-Yoshiharurdquo + ldquoMeijinrdquo

F

(b) label assignment for the cell ldquoHabu-Yoshiharu-Meijinrdquo

Figure 6 The label comparison model

label assignment ldquoErdquo with the label assignmentldquoD+Frdquo In this case the model chooses the labelassignment ldquoD+Frdquo that is ldquoYoshiharurdquo is labeledas MONEY and ldquoMeijinrdquo is labeled as OTHEReWhen the label assignment consists of multiplechunks the content of the cell is updated Inthis case the cell ldquoErdquo is changed from ldquoYoshi-haru-Meijinrdquo (OTHERe) to ldquoYoshiharu + Meijinrdquo(MONEY + OTHERe)

As shown in Figure 6(b) in determining the bestlabel assignment for the upper right cell that isthe final output is determined the model comparesthe label assignment ldquoA+D+Frdquo ldquoB+Frdquo and ldquoCrdquoWhen there are more than two candidates of labelassignments for a cell all the label assignments arecompared in a pairwise and the label assignmentthat obtains the highest score is adopted

In the label comparing step the label as-signment in which OTHERlowast follows OTHERlowast(OTHERlowast - OTHERlowast) is not allowed since eachOTHER is assigned to the longest chunk as de-scribed in Section 41 When the first combina-tion of chunks equals to the second combinationof chunks the comparison is not performed

6 Experiment

To demonstrate the effectiveness of our proposedmethod we conducted an experiment on CRL NEdata In this data 10718 sentences in 1174 newsarticles are annotated with eight NEs The expres-sion to which it is difficult to annotate manually islabeled as OPTIONAL and was not used for both

a b c d e

learn the label estimation model 1

Corpus CRL NE data

learn the label estimation model 2bobtain features for the label apply

b c d e

b c d e

b c d e

obtain features for the labelcomparison model

learn the label estimation model 2c

learn the label estimation model 2d

learn the label estimation model 2e

apply

learn the label comparison model

Figure 7 5-fold cross validation

the model learning8 and the evaluationWe performed 5-fold cross validation following

previous work Different from previous work ourwork has to learn the SVM models twice There-fore the corpus was divided as shown in Figure 7Let us consider the analysis in the part (a) Firstthe label estimation model 1 is learned from thepart (b)-(e) Then the label estimation model 2bis learned from the part (c)-(e) and applying thelearned model to the part (b) features for learningthe label comparison model are obtained Simi-larly the label estimation model 2c is learned fromthe part (b)(d)(e) and applying it to the part (c)features are obtained It is the same with the part

8Exceptionally ldquoOPTIONALrdquo is used when the label es-timation model for OTHERlowast and invalid is learned

60

Recall PrecisionORGANIZATION 8183 (30083676) 8837 (30083404)PERSON 9005 (34583840) 9387 (34583684)LOCATION 9138 (49925463) 9244 (49925400)ARTIFACT 4672 ( 349 747) 7489 ( 349 466)DATE 9327 (33273567) 9312 (33273573)TIME 8825 ( 443 502) 9059 ( 443 489)MONEY 9385 ( 366 390) 9760 ( 366 375)PERCENT 9533 ( 469 492) 9591 ( 469 489)ALL-SLOT 8787 9179F-measure 8979

Table 3 Experimental result

(d) and (e) Then the label comparison model islearned from the obtained features After that theanalysis in the part (a) is performed by using boththe label estimation model 1 and the label compar-ison model

In this experiment a Japanese morphologicalanalyzer JUMAN9 and a Japanese parser KNP10

were adopted The two SVM models were learnedwith polynomial kernel of degree 2 and β in thesigmoid function was set to be 1

Table 6 shows an experimental result An F-measure in all NE classes is 8979

7 Discussion

71 Comparison with Previous Work

Table 7 presents the comparison with previ-ous work and our method outperformed previ-ous work Among previous work Fukushimaet al acquired huge amount of category-instance pairs (eg ldquopolitical party - New partyDAICHIrdquoldquocompany-TOYOTArdquo) by some patternsfrom a large Web corpus and Sasano et al uti-lized the analysis result of corefer resolution as afeature for the model learning Therefore in ourmethod by incorporating these knowledge andorsuch analysis result the performance would be im-proved

Compared with Sasano et al our methodachieved the better performance in analyzinga long compound noun For example in thebunsetsu ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo (Treaty on Conventional ArmedForces in Europe) while Sasano et al labeledldquoOushurdquo (Europe) as LOCATION our methodcorrectly labeled ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo as ARTIFACT Sasano etal incorrectly labeled ldquoOushurdquo as LOCATIONalthough they utilized the information about

9httpnlpkueekyoto-uacjpnl-resourcejuman-ehtml10httpnlpkueekyoto-uacjpnl-resourceknp-ehtml

the head of bunsetsu ldquojyouyakurdquo (treaty) Inour method for the cell ldquoOushurdquo invalid hasthe highest score and thus the score of LOCA-TION relatively drops Similarly for the cellldquosenryoku-sakugen-jyouyakurdquo invalid has thehighest score Consequently ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo is correctly labeledas ARTIFACT

In the bunsetsu ldquogaikoku-jin-touroku-hou-ihanrdquo(the violation of the foreigner registration law)while Sasano et al labeled ldquotouroku-hourdquo as AR-TIFACT our method correctly labeled ldquogaikoku-jin-touroku-hourdquo as ARTIFACT Sasano et al can-not utilize the information about ldquohourdquo that is use-ful for the label estimation since the head of thisbunsetsu is ldquoihanrdquo In contrast in estimating thelabel of the chunk ldquogaikoku-jin-touroku-hourdquo theinformation of ldquohourdquo can be utilized

72 Error Analysis

There were some errors in analyzing a Katakanaalphabet word In the following example althoughthe correct is that ldquoBatistutardquo is labeled as PER-SON the system labeled it as OTHERs

(4) Italy-deItaly LOC

katsuyaku-suruactive

Batistuta-woBatistuta ACC

kuwaetacall

ArgentineArgentine

rsquoArgentine called Batistuta who was active inItalyrsquo

There is not an entry of ldquoBatistutardquo in the dictio-nary of JUMAN nor Wikipedia and thus only thesurrounding information is utilized However thecase analysis of ldquokatsuyakurdquo (active) is incorrectwhich leads to the error of ldquoBatistutardquo

There were some errors in applying the la-bel comparison model although the analysis ofeach chunk is correct For example in thebunsetsu ldquoHongKong-seityourdquo (Government ofHongKong) the correct is that ldquoHongKong-seityourdquo is labeled as ORGANIZATION Asshown in Figure 8 (b) the system incorrectlylabeled ldquoHongKongrdquo as LOCATION As shownin Figure 8(a) although in the initial stateldquoHongKong-seityourdquo was correctly labeled as OR-GANIZATION the label assignment ldquoHongKong+ seityourdquo was incorrectly chosen by the labelcomparison model To cope with this problemwe are planning to the adjustment of the value βin the sigmoid function and the refinement of the

61

F1 analysis unit distinctive features(Fukushima et al 2008) 8929 character Web(Kazama and Torisawa 2008) 8893 character WikipediaWeb(Sasano and Kurohashi 2008) 8940 morpheme structural information(Nakano and Hirai 2004) 8903 character bunsetsu feature(Masayuki and Matsumoto 2003) 8721 character(Isozaki and Kazawa 2003) 8677 morphemeproposed method 8979 compound noun Wikipediastructural information

Table 4 Comparison with previous work (All work was evaluated on CRL NE data using cross valida-tion)

HongKongLOCATION

HongKong-seityouORGANIZATIONLOCATION

0266 0205

seityouOTHERe0184

(a)initial state

HongKongLOCATION

HongKong + seityouLOC+OTHEReLOCATION

0266LOC+OTHERe0266+0184

seityouOTHERe0184

(b)the final output

Figure 8 An example of the error in the label com-parison model

features for the label comparison model

8 Conclusion

This paper proposed bottom-up Named EntityRecognition using a two-stage machine learningmethod This method first estimates the label ofall the chunks in a bunsetsu using a machine learn-ing and then the best label assignment is deter-mined by bottom-up dynamic programming Weconducted an experiment on CRL NE data andachieved an F-measure of 8979

We are planning to integrate this method withthe syntactic and case analysis method (Kawa-hara and Kurohashi 2007) and perform syntacticcase and Named Entity analysis simultaneously toimprove the overall accuracy

ReferencesKenrsquoichi Fukushima Nobuhiro Kaji and Masaru

Kitsuregawa 2008 Use of massive amountsof web text in Japanese named entity recogni-tion In Proceedings of Data Engineering Workshop(DEWS2008) A3-3 (in Japanese)

IREX Committee editor 1999 Proceedings of theIREX Workshop

Hideki Isozaki and Hideto Kazawa 2003 Speeding upsupport vector machines for named entity recogni-tion Transaction of Information Processing Societyof Japan 44(3)970ndash979 (in Japanese)

Daisuke Kawahara and Sadao Kurohashi 2007 Prob-abilistic coordination disambiguation in a fully-lexicalized Japanese parser In Proceedings of the

2007 Joint Conference on Empirical Methods inNatural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL2007)pages 304ndash311

Junrsquoichi Kazama and Kentaro Torisawa 2008 In-ducing gazetteers for named entity recognition bylarge-scale clustering of dependency relations InProceedings of ACL-08 HLT pages 407ndash415

Erik F Tjong Kim and Jorn Veenstra 1999 Repre-senting text chunks In Proceedings of EACL rsquo99pages 173ndash179

Vajay Krishnan and Christopher DManning 2006An effective two-stage model for exploiting non-local dependencies in named entity recognitionpages 1121ndash1128

John Lafferty Andrew McCallun and FernandoPereira 2001 Conditional random fields Prob-abilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth In-ternational Conference (ICMLrsquo01) pages 282ndash289

Asahara Masayuki and Yuji Matsumoto 2003Japanese named entity extraction with redundantmorphological analysis In Proceeding of HLT-NAACL 2003 pages 8ndash15

Keigo Nakano and Yuzo Hirai 2004 Japanese namedentity extraction with bunsetsu features Transac-tion of Information Processing Society of Japan45(3)934ndash941 (in Japanese)

Ryohei Sasano and Sadao Kurohashi 2008 Japanesenamed entity recognition using structural naturallanguage processing In Proceeding of Third In-ternational Joint Conference on Natural LanguageProcessing pages 607ndash612

Satoshi Sekine Ralph Grishman and Hiroyuki Shin-nou 1998 A decision tree method for finding andclassifying names in japanese texts In Proceed-ings of the Sixth Workshop on Very Large Corpora(WVLC-6) pages 171ndash178

Vladimir Vapnik 1995 The Nature of StatisticalLearning Theory Springer

62

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 63ndash70Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

13

13

13 13

13

13

13

13

13

13

13

13 13

13

$

13 amp

13 amp

$ 13

13 (

) 13

$ 13

13

) 13

13 13 13

$ 13 13

+$ - 0)$ - 12

$ 3$ 4 1

)13$ 5 + )$ 5 2

3$ 6 2 $ 7-

2 $ 7 13

2 13 13

13 13

) 13

13$ $ 13$ $ $

13

13 13 13

13 1313 138

$ 13$

3$ 13) )

) 8 9$ $ $

$ $ )13$ 13 13

3

$

13 13

13 +

)

13 $

13 ) 13

$ )

13 13

13 ) $ )

13$

$

) (

$

13

13 13

13

)

13

13 )

$ 13

ファミリーレストラン 13) lt

ファミレス 13lt $

=

13 )

$ 13 13

$ $ )

13 -

)$ $ =

13 131313 13 1313 13 13 13 13 1313 13 13

63

13

13

13 13

13 13

13

13 13

13 $

13

13 amp

13

( ) +

13

13

-

13 13

13 13 13

日本 amp0 東京 $amp0amp

医科 amp0 科学 amp0amp

女子 amp0amp

大学 13amp0 研究所0

ampamp $ 13 13

+

13

$1

$ 13 $1

13

213

$1

$

$1

$1

13 $1

13

13 13 略語は0

0amp

$ +3

1313欧文略語一覧13カタカナ略語一覧13漢字略語一覧13大学の略称

$1

$1 13

13 -

13 $1 13

456

13

456

13

37+

- 13

13

$

8 913

77 4 77 913 13

13 77(amp 13 ltamp

13amp amp

13

amp - 13 13

13

13 9

13 13

$

13

-

13 =

= 9

8

13

13 $

13 13

13 13 13 1313

ッ$ 13 っ$

amp ―$

64

13

$ $

amp $ amp

() $

+) $amp $

) $ $ $ $ $

13 $ amp $ $ $ amp $$

$- )) ) 13

) ))13 ) 0()00)13 ) ) )13

- $ ))) ) 131

) 0)13 ))

13 13

2 ) )) )

)) 3 ) 1313))

1313 2 ) 13)

4) 5) )) )13

) 13) 1

6

7 ) )) ) )13 1

)

) )0

13 ) 8

-

) )

)13 ) 4

4 7 )

) )9 8)

5 ) )8 7 ))

)

) )13 1

)13 )) ) ) )

) 7 )) )13 1

)) 0()

) )))13 0

))13 ) 1

)6 5) ) )

5 ) ))1

lt ) ) 13 )

))) ) =))lt

)=)lt ))=lt gt

=ltlt )) ))) )

13 131313

-

))lt

lt ))lt

))lt

))lt lt

7 ) ))1

lt ) )) ) ))lt lt

13 lt ) ) )8 ) 1

)13 )) )

))lt 13 1

)13 ) )8 ) )13 ))

) ))) 85 1

)13 )8) 2 ))) ) (

0 )) )))

13))13

) ))) ) ( のlt

) ) )13 1

) )) )

) ))) のlt 0 )

)13 )) )1

) ) 13

) ) )) )1

) 00 )13 13)

1313 1313

2 ) 13 )) )

)13 )13 )

))13 )

3 )) )) 1

) ) )

$ )

+ ) )

+ 8 )

65

1313

1313

13 13

13 13

13 13 13

13 13

$13amp13amp 13 13 $13

$13 13 $ ( ) ) 13

13 13 + 13 1313 $

13 + 13 1313 + -

+ 13 ++ 13 13

1313 +

13 13

13 13 1313 13 -

+13 1313 13 13

1313

13

0 13 13 13 13

13 13 12 13

3013+ 13 45 666 1313

0 13 1313

7 13 13 13 13-

13 8 +-13 0

+ 13 13 13 -

+ 13 13 13

3

13 13-

13 13 + 13 13

+13513

1313 13 9

1 13 13 -

+ ( 13 13 3

13 13 + 13+ 13 13

3

135 1313 +13 13

13 13 13+ 13

1313 13 +

1313 1313

13 13 13

13+ 13 $1313 (

13 1313 1313 +-

13 4 13 $

$13 $ 13 -

13 +-13 0

13 1313 13-13 1313 + 13 13 4 - 13 + 13 13 13+ 131313 1313 3 13 13 ( 1313 13 13 + 13 13 1313 13 13 13 13 13 13 13 13 13 13 13+ 13 13 9

13

13

13

13

+ 13 13 +13

13+ 4

13 1313 1313 -

+

13 1313 13-

13 13 +13 + +

(13 13 $ +-

13 $1313 13 13 $ $ 13 $ 13

13 $ampamp lt13

2 13 13 131313 1313 13

13 lt13 ( $

13 13 13 13+ +13 -

13 13 13

13 2 2 2

= 13 13 13 +13 1313+

13 13 13

13 13 13 13

1313 13-

13 ( 13 13 13 13

13 13+ 13 1313

13 13 13+ 1313 13+

13 13 13 13 13 13

$gt1 gt1 13+ $ 13-

13 13 13 13+

13131313

66

1313 1313 1313

1313 1313 1313

13

13 13

13

13

$ 13 13

amp() amp() () amp amp

$ 13 13

+

13

13

13

- + 13 ( (

13 13

-+

13

0 -(1313 amp+ ( (

13 13

13 1

13 () 2 3

0 13 -4252

6-amp++ -32 6+

13 0

$ 135 2

1

- amp+ $ 13

13

$ amp( ) + ( ( )- )( ( ) )

FeaturesLabel

Noun

Noun

Noun

Noun

Particle

Noun

POS

0

0

0

1

1

1

Head of wordreadingword-

FeaturesLabel

Noun

Noun

Noun

Noun

Particle

Noun

POS

0

0

0

1

1

1

Head of wordreadingword-

amp 13 0

13 朝ビタ -+ 13

13 朝はビタミン - +

13 7

7

$

088

13 0

$ amp 9 amp 9

13

13

7

1313 0

13

13 2lt

13 13 13 13

13 13

3 13 amp

13 amp

13

13

0 13 13

13 13

13 13

13

1

ampampamp

67

13 13

$

13 13

ampamp (

)

ampamp $

ampamp amp

amp +

amp +

- - -

amp-

amp

amp

amp

- amp

+

- +

) +

amp $

- amp

+

- amp

+

0 $ ampamp +

1- amp

0 +

2-

)

amp -

amp ampamp

$ amp 0- 3(

34) 5 -+

0 - 4

6)4 amp amp 1 amp

7 4( 6

$ - +

- 8

64

13

) amp

amp

+

amp 9 2

- 9 - +

2 amp

- amp

amp 1

$

( +

(

( amp ( ( -

amp2

amp

amp )

( +

68

13 13

13 1313

13 13 13

13 13

13 13 13

13 13

13 13

13 $

13

13 13 13 13

13

13 13 13

1313

13

amp 13 13 13 13

13 13 13 13

(() 13 13 13

13 13 + 13

13 amp 13

13 13

$ 13

13 13

13

13 1313

13 -

13

1313 13

13 13 13

13 1313 amp

13 13

13 13 1313

0

1313 13 13

13 13

13

13 13

$ $ amp amp ( $)+ - 0 13 + 13 112

ミュージックステーション 13 ステ 13 13 13 13 13 ミュージック 13 ステ ステーション

3 4 56 1 0+ 0 + + 7+ 8 9 $8 4 lt

6 $8 -= 1 8 - = $ 13++ + 139 13 $ 22lt

8 $+ 4 13) 1ltltlt3 $gt3 =

5 $ $ ( 6 5 13amp

5 $ $ ( 213 $ 13 13

5 = 3 + 13 () 1 +

5 () 13 7+ + + 2ltlt

5 () 13 3A 21 + $ + 13

5 () $ gt) 3A2 0 13 13 22lt2lt

+A = 7 1 4 30$+ -+ 3 13 13$ 11

$ 13 amp) $ 4 42 6= $ 2

4 13 13+ $ gt $ 9+ amp6+ 13+ 8 8 6 3 (+ 6 13 11

13 136) $ 4 13 + gt+ 7 30 13 13+ 1

69

13 13 13

13 13 1313

13 13

1313 13

13 1313

13

$

13 13 1313 $

amp (

70

Author Index

Bhutada Pravin 17

Cao Jie 47Caseli Helena 1

Diab Mona 17

Finatto Maria Jose 1Fujii Hiroko 63Fukui Mika 63Funayama Hirotaka 55

Hoang Hung Huu 31Huang Yun 47

Kan Min-Yen 9 31Kim Su Nam 9 31Kuhn Jonas 23Kurohashi Sadao 55

Liu Qun 47Lu Yajuan 47

Machado Andre 1

Ren Zhixiang 47

Shibata Tomohide 55Sinha R Mahesh K 40Sumita Kazuo 63Suzuki Masaru 63

Villavicencio Aline 1

Wakaki Hiromi 63

Zarrieszlig Sina 23

71

  • Workshop Program
  • Statistically-Driven Alignment-Based Multiword Expression Identification for Technical Domains
  • Re-examining Automatic Keyphrase Extraction Approaches in Scientific Articles
  • Verb Noun Construction MWE Token Classification
  • Exploiting Translational Correspondences for Pattern-Independent MWE Identification
  • A re-examination of lexical association measures
  • Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus
  • Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions
  • Bottom-up Named Entity Recognition using Two-stage Machine Learning Method
  • Abbreviation Generation for Japanese Multi-Word Expressions
Page 3: MWE 2009 2009 Workshop on Multiword Expressions

Introduction

The ACL 2009 Workshop on Multiword Expressions Identification Interpretation Disambiguation andApplications (MWErsquo09) took place on August 6 2009 in Singapore immediately following the annualmeeting of the Association for Computational Linguistics (ACL) This is the fifth time this workshop hasbeen held in conjunction with ACL following the meetings in 2003 2004 2006 and 2007

The workshop focused on Multi-Word Expressions (MWEs) which represent an indispensable part ofnatural languages and appear steadily on a daily basis both novel and already existing but paraphrasedwhich makes them important for many natural language applications Unfortunately while easilymastered by native speakers MWEs are often non-compositional which poses a major challenge forboth foreign language learners and automatic analysis

The growing interest in MWEs in the NLP community has led to many specialized workshops heldevery year since 2001 in conjunction with ACL EACL and LREC there have been also two recentspecial issues on MWEs published by leading journals the International Journal of Language Resourcesand Evaluation and the Journal of Computer Speech and Language

As a result of the overall progress in the field the time has come to move from basic preliminary researchto actual applications in real-world NLP tasks Thus in MWErsquo09 we were interested in the overallprocess of dealing with MWEs asking for original research on the following four fundamental topics

Identification Identifying MWEs in free text is a very challenging problem Due to the variability ofexpression it does not suffice to collect and use a static list of known MWEs complex rules andmachine learning are typically needed as well

Interpretation Semantically interpreting MWEs is a central issue For some kinds of MWEs egnoun compounds it could mean specifying their semantics using a static inventory of semanticrelations eg WordNet-derived In other cases MWErsquos semantics could be expressible by asuitable paraphrase

Disambiguation Most MWEs are ambiguous in various ways A typical disambiguation task is todetermine whether an MWE is used non-compositionally (ie figuratively) or compositionally(ie literally) in a particular context

Applications Identifying MWEs in context and understanding their syntax and semantics is importantfor many natural language applications including but not limited to question answering machinetranslation information retrieval information extraction and textual entailment Still despite thegrowing research interest there are not enough successful applications in real NLP problemswhich we believe is the key for the advancement of the field

Of course the above topics largely overlap For example identification can require disambiguatingbetween literal and idiomatic uses since MWEs are typically required to be non-compositional bydefinition Similarly interpreting three-word noun compounds like morning flight ticket and plasticwater bottle requires disambiguation between a left and a right syntactic structure while interpretingtwo-word compounds like English teacher requires disambiguating between (a) lsquoteacher who teachesEnglishrsquo and (b) lsquoteacher coming from England (who could teach any subject eg math)rsquo

We received 18 submissions and given our limited capacity as a one-day workshop we were only ableto accept 9 full papers for oral presentation an acceptance rate of 50

We would like to thank the members of the Program Committee for their timely reviews We would alsolike to thank the authors for their valuable contributions

Dimitra Anastasiou Chikara Hashimoto Preslav Nakov and Su Nam KimCo-Organizers

iii

Organizers

Dimitra Anastasiou Localisation Research Centre Limerick University IrelandChikara Hashimoto National Institute of Information and Communications Technology JapanPreslav Nakov National University of Singapore SingaporeSu Nam Kim University of Melbourne Australia

Program Committee

Inaki Alegria University of the Basque Country (Spain)Timothy Baldwin University of Melbourne (Australia)Colin Bannard Max Planck Institute (Germany)Francis Bond National Institute of Information and Communications Technology (Japan)Gael Dias Beira Interior University (Portugal)Ulrich Heid Stuttgart University (Germany)Stefan Evert University of Osnabruck (Germany)Afsaneh Fazly University of Toronto (Canada)Nicole Gregoire University of Utrecht (The Netherlands)Roxana Girju University of Illinois at Urbana-Champaign (USA)Kyo Kageura University of Tokyo (Japan)Brigitte Krenn Austrian Research Institute for Artificial Intelligence (Austria)Eric Laporte University of Marne-la-Vallee (France)Rosamund Moon University of Birmingham (UK)Diana McCarthy University of Sussex (UK)Jan Odijk University of Utrecht (The Netherlands)Stephan Oepen University of Oslo (Norway)Darren Pearce London Knowledge Lab (UK)Pavel Pecina Charles University (Czech Republic)Scott Piao University of Manchester (UK)Violeta Seretan University of Geneva (Switzerland)Stan Szpakowicz University of Ottawa (Canada)Beata Trawinski University of Tubingen (Germany)Peter Turney National Research Council of Canada (Canada)Kiyoko Uchiyama Keio University (Japan)Begona Villada Moiron University of Groningen (The Netherlands)Aline Villavicencio Federal University of Rio Grande do Sul (Brazil)

v

Table of Contents

Statistically-Driven Alignment-Based Multiword Expression Identification for Technical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto 1

Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan 9

Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada 17

Exploiting Translational Correspondences for Pattern-Independent MWE IdentificationSina Zarrieszlig and Jonas Kuhn 23

A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan 31

Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha 40

Improving Statistical Machine Translation Using Domain Bilingual Multiword ExpressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang 47

Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi 55

Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita 63

vii

Workshop Program

Friday August 6 2009

830ndash845 Welcome and Introduction to the Workshop

Session 1 (0845ndash1000) MWE Identification and Disambiguation

0845ndash0910 Statistically-Driven Alignment-Based Multiword Expression Identification for Tech-nical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto

0910ndash0935 Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan

0935ndash1000 Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada

1000-1030 BREAK

Session 2 (1030ndash1210) Identification Interpretation and Disambiguation

1030ndash1055 Exploiting Translational Correspondences for Pattern-Independent MWE Identifi-cationSina Zarrieszlig and Jonas Kuhn

1055ndash1120 A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan

1120ndash1145 Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha

1145-1350 LUNCH

Session 3 (1350ndash1530) Applications

1350ndash1415 Improving Statistical Machine Translation Using Domain Bilingual Multiword Ex-pressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang

1415ndash1440 Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi

1440ndash1505 Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita

1505-1530 Discussion of Sessions 1 2 3 (Creating an Agenda for the general discussion)

1530-1600 BREAK

1600-1700 General Discussion

1700-1715 Closing Remarks

ix

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 1ndash8Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Statistically-Driven Alignment-Based Multiword ExpressionIdentification for Technical Domains

Helena de Medeiros Caselidiams Aline Villavicencioclubsspades Andre Machadoclubs Maria Jose Finattohearts

diamsDepartment of Computer Science Federal University of Sao Carlos (Brazil)clubsInstitute of Informatics Federal University of Rio Grande do Sul (Brazil)

spadesDepartment of Computer Sciences Bath University (UK)heartsInstitute of Language and Linguistics Federal University of Rio Grande do Sul (Brazil)

helenacaselidcufscarbr avillavicencioinfufrgsbrammachadoinfufrgsbr mfinattoterracombr

Abstract

Multiword Expressions (MWEs) are oneof the stumbling blocks for more preciseNatural Language Processing (NLP) sys-tems Particularly the lack of coverageof MWEs in resources can impact nega-tively on the performance of tasks and ap-plications and can lead to loss of informa-tion or communication errors This is es-pecially problematic in technical domainswhere a significant portion of the vocab-ulary is composed of MWEs This pa-per investigates the use of a statistically-driven alignment-based approach to theidentification of MWEs in technical cor-pora We look at the use of several sourcesof data including parallel corpora usingEnglish and Portuguese data from a corpusof Pediatrics and examining how a sec-ond language can provide relevant cues forthis tasks We report results obtained bya combination of statistical measures andlinguistic information and compare theseto the reported in the literature Such anapproach to the (semi-)automatic identifi-cation of MWEs can considerably speedup lexicographic work providing a moretargeted list of MWE candidates

1 Introduction

A multiword expression (MWE) can be definedas any word combination for which the syntac-tic or semantic properties of the whole expres-sion cannot be obtained from its parts (Sag etal 2002) Examples of MWEs are phrasal verbs(break down rely on) compounds (police car cof-fee machine) idioms (rock the boat let the catout of the bag) They are very numerous in lan-guages as Biber et al (1999) note accouting forbetween 30 and 45 of spoken English and 21

of academic prose and for Jackendoff (1997) thenumber of MWEs in a speakerrsquos lexicon is of thesame order of magnitude as the number of singlewords However these estimates are likely to beunderestimates if we consider that for languagefrom a specific domain the specialized vocabularyis going to consist largely of MWEs (global warm-ing protein sequencing) and new MWEs are con-stantly appearing (weapons of mass destructionaxis of evil)

Multiword expressions play an important rolein Natural Language Processing (NLP) applica-tions which should not only identify the MWEsbut also be able to deal with them when they arefound (Fazly and Stevenson 2007) Failing toidentify MWEs may cause serious problems formany NLP tasks especially those envolving somekind of semantic processing For parsing for in-stance Baldwin et al (2004) found that for a ran-dom sample of 20000 strings from the British Na-tional Corpus (BNC) even with a broad-coveragegrammar for English (Flickinger 2000) missingMWEs accounted for 8 of total parsing errorsTherefore there is an enormous need for robust(semi-)automated ways of acquiring lexical infor-mation for MWEs (Villavicencio et al 2007) thatcan significantly extend the coverage of resourcesFor example one can more than double the num-ber of verb-particle constructions (VPCs) entriesin a dictionary such as the Alvey Natural Lan-guage Tools (Carroll and Grover 1989) just ex-tracting VPCs from a corpus like the BNC (Bald-win 2005) Furthermore as MWEs are languagedependent and culturally motivated identifyingthe adequate translation of MWE occurrences is animportant challenge for machine translation meth-ods

In this paper we investigate experimentally theuse of an alignment-based approach for the iden-tification of MWEs in technical corpora We lookat the use of several sources of data including par-

1

allel corpora using English and Portuguese datafrom a corpus of Pediatrics and examining howa second language can provide relevant cues forthis tasks In this way cost-effective tools for theautomatic alignment of texts can generate a listof MWE candidates with their appropriate trans-lations Such an approach to the (semi-)automaticidentification of MWEs can considerably speed uplexicographic work providing a more targeted listof MWE candidates and their translations for theconstruction of bilingual resources andor withsome semantic information for monolingual re-sources

The remainder of this paper is structured as fol-lows Section 2 briefly discusses MWEs and someprevious works on methods for automatically ex-tracting them Section 3 presents the resourcesused while section 4 describes the methods pro-posed to extract MWEs as a statistically-driven by-product of an automatic word alignment processSection 5 presents the evaluation methodology andanalyses the results and section 6 finishes this pa-per with some conclusions and proposals for fu-ture work

2 Related Work

The term Multiword Expression has been used todescribe a large number of distinct but related phe-nomena such as phrasal verbs (eg come along)nominal compounds (eg frying pan) institution-alised phrases (eg bread and butter) and manyothers (Sag et al 2002) They are very frequent ineveryday language and this is reflected in severalexisting grammars and lexical resources wherealmost half of the entries are Multiword Expres-sions

However due to their heterogeneous charac-teristics MWEs present a tough challenge forboth linguistic and computational work (Sag etal 2002) Some MWEs are fixed and do notpresent internal variation such as ad hoc whileothers allow different degrees of internal vari-ability and modification such as touch a nerve(touchfind a nerve) and spill beans (spill sev-eralmusicalmountains of beans) In terms of se-mantics some MWEs are more opaque in theirmeaning (eg to kick the bucket as to die) whileothers have more transparent meanings that can beinferred from the words in the MWE (eg eat upwhere the particle up adds a completive sense toeat) Therefore providing appropriate methods

for the automatic identification and treatment ofthese phenomena is a real challenge for NLP sys-tems

A variety of approaches has been proposed forautomatically identifying MWEs differing basi-cally in terms of the type of MWE and lan-guage to which they apply and the sources ofinformation they use Although some work onMWEs is type independent (eg (Zhang et al2006 Villavicencio et al 2007)) given the het-erogeneity of MWEs much of the work looks in-stead at specific types of MWE like collocations(Pearce 2002) compounds (Keller and Lapata2003) and VPCs (Baldwin 2005 Villavicencio2005 Carlos Ramisch and Aline Villavicencio andLeonardo Moura and Marco Idiart 2008) Someof these works concentrate on particular languages(eg (Pearce 2002 Baldwin 2005) for Englishand (Piao et al 2006) for Chinese) but somework has also benefitted from asymmetries in lan-guages using information from one language tohelp deal with MWEs in the other (eg (na Vil-lada Moiron and Tiedemann 2006 Caseli et al2009))

As basis for helping to determine whether agiven sequence of words is in fact an MWE (egad hoc vs the small boy) some of these works em-ploy linguistic knowledge for the task (Villavicen-cio 2005) while others employ statistical meth-ods (Pearce 2002 Evert and Krenn 2005 Zhanget al 2006 Villavicencio et al 2007) or combinethem with some kinds of linguistic informationsuch as syntactic and semantic properties (Bald-win and Villavicencio 2002 Van de Cruys and naVillada Moiron 2007) or automatic word align-ment (na Villada Moiron and Tiedemann 2006)

Statistical measures of association have beencommonly used for this task as they can be demo-cratically applied to any language and MWE typeHowever there is no consensus about which mea-sure is best suited for identifying MWEs in gen-eral Villavicencio et al (2007) compared some ofthese measures (mutual information permutationentropy and χ2) for the type-independent detec-tion of MWEs and found that Mutual Informationseemed to differentiate MWEs from non-MWEsbut the same was not true of χ2 In addition Ev-ert and Krenn (2005) found that for MWE iden-tification the efficacy of a given measure dependson factors like the type of MWEs being targetedfor identification the domain and size of the cor-

2

pora used and the amount of low-frequency dataexcluded by adopting a threshold NonethelessVillavicencio et al (2007) discussing the influ-ence of the corpus size and nature over the meth-ods found that these different measures have ahigh level of agreement about MWEs whetherin carefully constructed corpora or in more het-erogeneous web-based ones They also discussthe results obtained from adopting approaches likethese for extending the coverage of resources ar-guing that grammar coverage can be significantlyincreased if MWEs are properly identified andtreated (Villavicencio et al 2007)

Among the methods that use additional infor-mation along with statistics to extract MWE theone proposed by na Villada Moiron and Tiede-mann (2006) seems to be the most similar to ourapproach The main difference between them isthe way in which word alignment is used in theMWE extraction process In this paper the wordalignment is the basis for the MWE extractionprocess while Villada Moiron and Tiedemannrsquosmethod uses the alignment just for ranking theMWE candidates which were extracted on the ba-sis of association measures (log-likelihood andsalience) and head dependence heuristic (in parseddata)

Our approach as described in details by Caseliet al (2009) also follows to some extent thatof Zhang et al (2006) as missing lexical en-tries for MWEs and related constructions are de-tected via error mining methods and this paper fo-cuses on the extraction of generic MWEs as a by-product of an automatic word alignment Anotherrelated work is the automatic detection of non-compositional compounds (NCC) by Melamed(1997) in which NCCs are identified by analyz-ing statistical translation models trained in a hugecorpus by a time-demanding process

Given this context our approach proposesthe use of alignment techniques for identifyingMWEs looking at sequences detected by thealigner as containing more than one word whichform the MWE candidates As a result sequencesof two or more consecutive source words aretreated as MWE candidates regardless of whetherthey are translated as one or more target words

3 The Corpus and Reference Lists

The Corpus of Pediatrics used in these experi-ments contains 283 texts in Portuguese with a total

of 785448 words extracted from the Jornal de Pe-diatria From this corpus the Pediatrics Glossarya reference list containing multiword terms and re-curring expressions was semi-automatically con-structed and manually checked1 The primary aimof the Pediatrics Glossary as an online resourcefor long-distance education was to train qualifyand support translation students on the domain ofpediatrics texts

The Pediatrics Glossary was built from the36741 ngrams that occurred at least 5 times in thecorpus These were automatically cleaned or re-moved using some POS tag patterns (eg remov-ing prepositions from terms that began or endedwith them) In addition if an ngram was part of alarger ngram only the latter appeared in the Glos-sary as is the case of aleitamento materno (mater-nal breastfeeding) which is excluded as it is con-tained in aleitamento materno exclusivo (exclusivematernal breastfeeding) This post-processing re-sulted in 3645 ngrams which were manuallychecked by translation students and resulted in2407 terms with 1421 bigrams 730 trigrams and339 ngrams with n larger than 3 (not considered inthe experiments presented in this paper)

4 Statistically-Driven andAlignment-Based methods

41 Statistically-Driven methodStatistical measures of association have beenwidely employed in the identification of MWEsThe idea behind their use is that they are an in-expensive language and type independent meansof detecting recurrent patterns As Firth famouslysaid a word is characterized by the company itkeeps and since we expect the component wordsof an MWE to occur frequently together thenthese measures can give an indication of MWE-ness In this way if a group of words co-occurswith significantly high frequency when comparedto the frequencies of the individual words thenthey may form an MWE Indeed measures suchas Pointwise Mutual Information (PMI) MutualInformation (MI) χ2 log-likelihood (Press et al1992) and others have been employed for thistask and some of them seem to provide more ac-curate predictions of MWEness than others Infact in a comparison of some measures for thetype-independent detection of MWEs MI seemed

1Available in the TEXTQUIMUFRGS website httpwwwufrgsbrtextquim

3

to differentiate MWEs from non-MWEs but thesame was not true of χ2 (Villavicencio et al2007) In this work we use two commonly em-ployed measures for this task PMI and MIas implemented in the Ngram Statistics Package(Banerjee and Pedersen 2003)

From the Portuguese portion of the Corpus ofPediatrics 196105 bigram and 362663 trigramMWE candidates were generated after filteringngrams containing punctuation and numbers Inorder to evaluate how these methods perform with-out any linguistic filtering the only threshold em-ployed was a frequency cut-off of 2 occurrencesresulting in 64839 bigrams and 54548 trigramsEach of the four measures were then calculated forthese ngrams and we ranked each n-gram accord-ing to each of these measures The average of allthe rankings is used as the combined measure ofthe MWE candidates

42 Alignment-Based method

The second of the MWE extraction approachesto be investigated in this paper is the alignment-based method The automatic word alignment oftwo parallel texts mdash a text written in one (source)language and its translation to another (target) lan-guage mdash is the process of searching for correspon-dences between source and target words and se-quences of words For each word in a source sen-tence equivalences in the parallel target sentenceare looked for Therefore taking into account aword alignment between a source word sequenceS (S = s1 sn with n ge 2) and a target wordsequence T (T = t1 tm with m ge 1) thatis S harr T the alignmet-based MWE extracionmethod assumes that (a) S and T share some se-mantic features and (b) S may be a MWE

In other words the alignment-based MWE ex-traction method states that the sequence S will bea MWE candidate if it is aligned with a sequenceT composed of one or more words (a n m align-ment with n ge 2 and m ge 1) For examplethe sequence of two Portuguese words aleitamentomaterno mdash which occurs 202 times in the cor-pus used in our experiments mdash is a MWE can-didate because these two words were joined to bealigned 184 times with the word breastfeeding (a2 1 alignment) 8 times with the word breast-fed (a 2 1 alignment) 2 times with breastfeedingpractice (a 2 2 alignment) and so on

Thus notice that the alignment-based MWE ex-

traction method does not rely on the conceptualasymmetries between languages since it does notexpect that a source sequence of words be alignedwith a single target word The method looksfor the sequences of source words that are fre-quently joined together during the alignment de-spite the number of target words involved Thesefeatures indicate that the method priorizes preci-sion in spite of recall

It is also important to say that although the se-quences of source and target words resemble thephrases used in the phrase-based statistical ma-chine translation (SMT) they are indeed a re-finement of them More specifically althoughboth approaches rely on word alignments per-formed by GIZA++2 (Och and Ney 2000) inthe alignment-based approach not all sequences ofwords are considered as phrases (and MWE can-didates) but just those with an alignment n m(n gt= 2) with a target sequence To confirmthis assumption a phrase-based SMT system wastrained with the same corpus used in our exper-iments and the number of phrases extracted fol-lowing both approaches were compared Whilethe SMT extracted 819208 source phrases ouralignment-based approach (without applying anypart-of-speech or frequency filter) extracted only34277 These results show that the alignment-based approach refines in some way the phrasesof SMT systems

In this paper we investigate experimentallywhether MWEs can be identified as a by-productof the automatic word alignment of parallel textsWe focus on Portuguese MWEs from the Corpusof Pediatrics and the evaluation is performed us-ing the bigrams and trigrams from the PediatricsGlossary as gold standard

To perform the extraction of MWE candi-dates following the alignment-based approachfirst the original corpus had to be sentence andword aligned and Part-of-Speech (POS) taggedFor these preprocessing steps were used re-spectively a version of the Translation Cor-pus Aligner (TCA) (Hofland 1996) the statisti-cal word aligner GIZA++ (Och and Ney 2000)and the morphological analysers and POS taggersfrom Apertium3 (Armentano-Oller et al 2006)

2GIZA++ is a well-known statistical word aligner that canbe found at httpwwwfjochcomGIZA++html

3Apertium is an open-source machine translation en-gine and toolbox available at httpwwwapertiumorg

4

From the preprocessed corpus the MWE candi-dates are extracted as those in which two or morewords have the same alignment that is they arelinked to the same target unit This initial list ofMWE candidates is then filtered to remove thosecandidates that (a) match some sequences of POStags or words (patterns) defined in previous exper-iments (Caseli et al 2009) or (b) whose frequencyis below a certain threshold The remaining unitsin the candidate list are considered to be MWEs

Several filtering patterns and minimum fre-quency thresholds were tested and three of themare presented in details here The first one (F1)is the same used during the manual building ofthe reference lists of MWEs (a) patterns begin-ning with Article + Noun and beginning or finish-ing with verbs and (b) with a minimum frequencythreshold of 5

The second one (F2) is the same used in the(Caseli et al 2009) mainly (a) patterns begin-ning with determiner auxiliary verb pronoun ad-verb conjunction and surface forms such as thoseof the verb to be (are is was were) relatives(that what when which who why) and prepo-sitions (from to of ) and (b) with a minimum fre-quency threshold of 2

And the third one (F3) is the same as (Caseli etal 2009) plus (a) patterns beginning or finishingwith determiner adverb conjunction prepositionverb pronoun and numeral and (b) with a mini-mum frequency threshold of 2

5 Experiments and Results

Table 1 shows the top 5 and the bottom 5 rankedcandidates returned by PMI and the alignment-based approach Although some of the results aregood especially the top candidates there is stillconsiderable noise among the candidates as forinstance jogar video game (lit play video game)From table 1 it is also possible to notice that thealignment-based approach indeed extracts Pedi-atrics terms such as aleitamento materno (breast-feeding) and also other possible MWE that are notPediatrics terms such as estados unidos (UnitedStates)

In table 2 we show the precision (number ofcorrect candidates among the proposed ones) re-call (number of correct candidates among those inreference lists) and F-measure ((2 lowast precision lowastrecall)(precision + recall)) figures for the as-sociation measures using all the candidates (on the

PMI alignment-basedOnline Mendelian Inheritance faixa etariaBeta Technology Incorporated aleitamento maternoLange Beta Technology estados unidosOxido Nitrico Inalatorio hipertensao arterialjogar video game leite materno e um de couro cabeludoe a do bloqueio lactıferosse que de emocional anatomiae a da neonato a termoe de nao duplas maes bebes

Table 1 Top 5 and Bottom 5 MWE candidatesranked by PMI and alignment-based approach

pt MWE candidates PMI MI proposed bigrams 64839 64839 correct MWEs 1403 1403precision 216 216recall 9873 9873F 423 423 proposed trigrams 54548 54548 correct MWEs 701 701precision 129 129recall 9603 9603F 255 255 proposed bigrams 1421 1421 correct MWEs 155 261precision 1091 1837recall 1091 1837F 1091 1837 proposed trigrams 730 730 correct MWEs 44 20precision 603 274recall 603 274F 603 274

Table 2 Evaluation of MWE candidates - PMI andMI

first half of the table) and using the top 1421 bi-gram and 730 trigram candidates (on the secondhalf) From these latter results we can see that thetop candidates produced by these measures do notagree with the Pediatrics Glossary since there areonly at most 1837 bigram and 603 trigramMWEs among the top candidates as ranked byMI and PMI respectively Interestingly MI had abetter performance for bigrams while for trigramsPMI performed better

On the other hand looking at the alignment-based method 34277 pt MWE candidates wereextracted and Table 3 sumarizes the number ofcandidates filtered following the three filters de-scribed in 42 F1 F2 and F3

To evaluate the efficacy of the alignment-basedmethod in identifying multiword terms of Pedi-atrics an automatic comparison was performedusing the Pediatrics Glossary In this auto-

5

pt MWE candidates F1 F2 F3 filtered by POS patterns 24996 21544 32644 filtered by frequency 9012 11855 1442 final Set 269 878 191

Table 3 Number of pt MWE candidates filteredin the alignment-based approach

pt MWE candidates F1 F2 F3 proposed bigrams 250 754 169 correct MWEs 48 95 65precision 1920 1260 3846recall 338 669 457F 575 874 818 proposed trigrams 19 110 20 correct MWEs 1 9 4precision 526 818 2000recall 014 123 055F 027 214 107 proposed bitrigrams 269 864 189 correct MWEs 49 104 69precision 1822 1204 3651recall 228 483 321F 405 690 590

Table 4 Evaluation of MWE candidates

matic comparision we considered the final lists ofMWEs candidates generated by each filter in table3 The number of matching entries and the valuesfor precision recall and F-measure are showed intable 4

The different values of extracted MWEs (in ta-ble 3) and evaluated ones (in table 4) are due tothe restriction of considering only bigrams and tri-grams in the Pediatrics Glossary Then longerMWEs mdash such as doenca arterial coronarianaprematura (premature coronary artery disease)and pequenos para idade gestacional (small forgestational age) mdash extracted by the alignment-based method are not being considered at the mo-ment

After the automatic comparison using the Pedi-atrics Glossary an analysis by human experts wasperformed on one of the derived lists mdash that withthe best precision values so far (from filter F3)The human analysis was necessary since as statedin (Caseli et al 2009) the coverage of referencelists may be low and it is likely that a lot of MWEcandidates that were not found in the PediatricsGlossary are nonetheless true MWEs In this pa-per only the pt MWE candidates extracted usingfilter F3 (as described in section 42) were manu-ally evaluated

From the 191 pt MWE candidates extracted af-ter F3 69 candidates (361 of the total amount)were found in the bigrams or trigrams in the

Glossary (see table 4) Then the remaining 122candidates (639) were analysed by two native-speakers human judges who classified each of the122 candidates as true if it is a multiword expres-sion or false otherwise independently of beinga Pediatrics term For the judges a sequence ofwords was considered a MWE mainly if it was(1) a proper name or (2) a sequence of words forwhich the meaning cannot be obtained by com-pounding the meanings of its component words

The judgments of both judges were comparedand a disagreement of approximately 12 on mul-tiwords was verified This disagreement was alsomeasured by the kappa (K) measure (Carletta1996) with k = 073 which does not preventconclusions to be drawn According to Carletta(1996) among other authors a value of k between067 and 08 indicates a good agreement

In order to calculate the percentage of true can-didates among the 122 two approaches can befollowed depending on what criteria one wantsto emphasize precision or coverage (not recallbecause we are not calculating regarding a refer-ence list) To emphasize the precision one shouldconsider as genuine MWEs only those candidatesclassified as true by both judges on the other handto emphasize the coverage one should consideralso those candidates classified as true by just oneof them So from 191 MWE candidates 126(6597) were classified as true by both judgesand 145 (7592) by at least one of them

6 Conclusions and Future Work

MWEs are a complex and heterogeneous set ofphenomena that defy attempts to capture themfully but due to their role in communication theyneed to be properly accounted for in NLP applica-tions and tasks

In this paper we investigated the identifica-tion of MWEs from technical domain test-ing statistically-driven and alignment-based ap-proaches for identifying MWEs from a Pediatricsparallel corpus The alignment-based method gen-erates a targeted precision-oriented list of MWEcandidates while the statistical methods producerecall-oriented results at the expense of precisionTherefore the combination of these methods canproduce a set of MWE candidates that is both moreprecise than the latter and has more coverage thanthe former This can significantly speed up lex-icographic work Moreover the results obtained

6

show that in comparison with the manual extrac-tion of MWEs this approach can provide also ageneral set of MWE candidates in addition to themanually selected technical terms

Using the alignment-based extraction methodwe notice that it is possible to extract MWEs thatare Pediatrics terms with a precision of 38 forbigrams and 20 for trigrams but with very lowrecall since only the MWEs in the Pediatrics Glos-sary were considered correct However after amanual analysis carried out by two native speakersof Portuguese we found that the percentage of trueMWEs considered by both or at least one of themwere respectively 6597 and 7592 This wasa significative improvement but it is important tosay that in this manual analysis the human ex-perts classified the MWEs as true independently ofthem being Pediatrics terms So as future work weintend to carry out a more carefull analysis withexperts in Pediatrics to evaluate how many MWEscandidates are also Pediatrics terms

In addition we plan to investigate a weightedcombination of these methods favouring thosethat have better precision Finally we also in-tend to apply the results obtained in to the semi-automatic construction of ontologies

Acknowledgments

We would like to thank the TEXTQUIMUFRGSgroup for making the Corpus of Pediatrics and Pe-diatrics Glossary available to us We also thank thefinancial support of FAPESP in bulding the paral-lel corpus This research has been partly fundedby the FINEP project COMUNICA

ReferencesCarme Armentano-Oller Rafael C Carrasco Anto-

nio M Corbı-Bellot Mikel L Forcada MireiaGinestı-Rosell Sergio Ortiz-Rojas Juan Anto-nio Perez-Ortiz Gema Ramırez-Sanchez FelipeSanchez-Martınez and Miriam A Scalco 2006Open-source Portuguese-Spanish machine transla-tion In R Vieira P Quaresma MGV Nunes NJMamede C Oliveira and MC Dias editors Pro-ceedings of the 7th International Workshop on Com-putational Processing of Written and Spoken Por-tuguese (PROPOR 2006) volume 3960 of LectureNotes in Computer Science pages 50ndash59 Springer-Verlag May

Timothy Baldwin and Aline Villavicencio 2002 Ex-tracting the Unextractable A Case Study on Verb-particles In Proceedings of the 6th Conference on

Natural Language Learning (CoNLL-2002) pages98ndash104 Taipei Taiwan

Timothy Baldwin Emily M Bender Dan FlickingerAra Kim and Stephan Oepen 2004 Road-testingthe English Resource Grammar over the British Na-tional Corpus In Proceedings of the Fourth In-ternational Conference on Language Resources andEvaluation (LREC 2004) pages 2047ndash2050 LisbonPortugal

Timothy Baldwin 2005 The deep lexical acquisitionof English verb-particles Computer Speech andLanguage Special Issue on Multiword Expressions19(4)398ndash414

Satanjeev Banerjee and Ted Pedersen 2003 The De-sign Implementation and Use of the Ngram Statis-tics Package In In Proceedings of the Fourth Inter-national Conference on Intelligent Text Processingand Computational Linguistics pages 370ndash381

Douglas Biber Stig Johansson Geoffrey Leech SusanConrad and Edward Finegan 1999 Grammar ofSpoken and Written English Longman Harlow

Jean Carletta 1996 Assessing agreement on classi-fication tasks the kappa statistics ComputationalLinguistics 22(2)249ndash254

Carlos Ramisch and Aline Villavicencio and LeonardoMoura and Marco Idiart 2008 Picking themup and Figuring them out Verb-Particle Construc-tions Noise and Idiomaticity In Proceedings ofthe 12th Conference on Computational Natural Lan-guage Learning (CoNLL 2008) pages 49ndash56

John Carroll and Claire Grover 1989 The derivationof a large computational lexicon of English fromLDOCE In Bran Boguraev and Ted Briscoe editorsComputational Lexicography for Natural LanguageProcessing pages 117ndash134 Longman Harlow UK

Helena M Caseli Carlos Ramisch Maria G V Nunesand Aline Villavicencio 2009 Alignment-basedextraction of multiword expressions Language Re-sources and Evaluation to appear

Stefan Evert and Brigitte Krenn 2005 Using smallrandom samples for the manual evaluation of statis-tical association measures Computer Speech andLanguage 19(4)450ndash466

Afsaneh Fazly and Suzanne Stevenson 2007 Distin-guishing Subtypes of Multiword Expressions UsingLinguistically-Motivated Statistical Measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueJune

Dan Flickinger 2000 On building a more efficientgrammar by exploiting types Natural LanguageEngineering 6(1)15ndash28

7

Knut Hofland 1996 A program for aligning Englishand Norwegian sentences In S Hockey N Ideand G Perissinotto editors Research in HumanitiesComputing pages 165ndash178 Oxford Oxford Univer-sity Press

Ray Jackendoff 1997 Twistinrsquo the night away Lan-guage 73534ndash59

Frank Keller and Mirella Lapata 2003 Using the Webto Obtain Frequencies for Unseen Bigrams Compu-tational Linguistics 29(3)459ndash484

I Dan Melamed 1997 Automatic Discovery ofNon-Compositional Compounds in Parallel Data Ineprint arXivcmp-lg9706027 pages 6027ndash+ June

Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the Workshopon Multi-word-expressions in a Multilingual Context(EACL-2006) pages 33ndash40 Trento Italy

Franz Josef Och and Hermann Ney 2000 Improvedstatistical alignment models In Proceedings of the38th Annual Meeting of the ACL pages 440ndash447Hong Kong China October

Darren Pearce 2002 A Comparative Evaluation ofCollocation Extraction Techniques In Proceedingsof the Third International Conference on LanguageResources and Evaluation pages 1ndash7 Las PalmasCanary Islands Spain

Scott S L Piao Guangfan Sun Paul Rayson andQi Yuan 2006 Automatic Extraction of Chi-nese Multiword Expressions with a Statistical ToolIn Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context (EACL-2006)pages 17ndash24 Trento Italy April

William H Press Saul A Teukolsky William T Vet-terling and Brian P Flannery 1992 NumericalRecipes in C The Art of Scientific Computing Sec-ond edition Cambridge University Press

Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 MultiwordExpressions A Pain in the Neck for NLP In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing (CICLing-2002) volume 2276 of (LectureNotes in Computer Science) pages 1ndash15 LondonUK Springer-Verlag

Tim Van de Cruys and Bego na Villada Moiron 2007Semantics-based Multiword Expression ExtractionIn Proceedings of the Workshop on A Broader Pre-spective on Multiword Expressions pages 25ndash32Prague June

Aline Villavicencio Valia Kordoni Yi Zhang MarcoIdiart and Carlos Ramisch 2007 Validation andEvaluation of Automatically Acquired Multiword

Expressions for Grammar Engineering In Pro-ceedings of the 2007 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning pages1034ndash1043 Prague June

Aline Villavicencio 2005 The Availability of Verb-Particle Constructions in Lexical Resources HowMuch is Enough Journal of Computer Speech andLanguage Processing 19(4)415ndash432

Yi Zhang Valia Kordoni Aline Villavicencio andMarco Idiart 2006 Automated Multiword Ex-pression Prediction for Grammar Engineering InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 36ndash44 Sydney Australia July Asso-ciation for Computational Linguistics

8

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 9ndash16Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Re-examining Automatic Keyphrase Extraction Approachesin Scientific Articles

Su Nam KimCSSE dept

University of Melbournesnkimcsseunimelbeduau

Min-Yen KanDepartment of Computer ScienceNational University of Singaporekanmycompnusedusg

Abstract

We tackle two major issues in automatickeyphrase extraction using scientific arti-cles candidate selection and feature engi-neering To develop an efficient candidateselection method we analyze the natureand variation of keyphrases and then se-lect candidates using regular expressionsSecondly we re-examine the existing fea-tures broadly used for the supervised ap-proach exploring different ways to en-hance their performance While mostother approaches are supervised we alsostudy the optimal features for unsuper-vised keyphrase extraction Our researchhas shown that effective candidate selec-tion leads to better performance as evalua-tion accounts for candidate coverage Ourwork also attests that many of existing fea-tures are also usable in unsupervised ex-traction

1 Introduction

Keyphrases are simplex nouns or noun phrases(NPs) that represent the key ideas of the documentKeyphrases can serve as a representative summaryof the document and also serve as high quality in-dex terms It is thus no surprise that keyphraseshave been utilized to acquire critical informationas well as to improve the quality of natural lan-guage processing (NLP) applications such as doc-ument summarizer(Davanzo and Magnini 2005)information retrieval (IR)(Gutwin et al 1999) anddocument clustering(Hammouda et al 2005)

In the past various attempts have been made toboost automatic keyphrase extraction performancebased primarily on statistics(Frank et al 1999Turney 2003 Park et al 2004 Wan and Xiao2008) and a rich set of heuristic features(Barkerand Corrnacchia 2000 Medelyan and Witten

2006 Nguyen and Kan 2007) In Section 2 wegive a more comprehensive overview of previousattempts

Current keyphrase technology still has muchroom for improvement First of all although sev-eral candidate selection methods have been pro-posed for automatic keyphrase extraction in thepast (eg (Frank et al 1999 Park et al 2004Nguyen and Kan 2007)) most of them do not ef-fectively deal with various keyphrase forms whichresults in the ignorance of some keyphrases as can-didates Moreover no studies thus far have donea detailed investigation of the nature and varia-tion of manually-provided keyphrases As a con-sequence the community lacks a standardized listof candidate forms which leads to difficulties indirect comparison across techniques during evalu-ation and hinders re-usability

Secondly previous studies have shown the ef-fectiveness of their own features but not manycompared their features with other existing fea-tures That leads to a redundancy in studies andhinders direct comparison In addition existingfeatures are specifically designed for supervisedapproaches with few exceptions However thisapproach involves a large amount of manual laborthus reducing its utility for real-world applicationHence unsupervised approach is inevitable in or-der to minimize manual tasks and to encourageutilization It is a worthy study to attest the re-liability and re-usability for the unsupervised ap-proach in order to set up the tentative guideline forapplications

This paper targets to resolve these issues ofcandidate selection and feature engineering Inour work on candidate selection we analyze thenature and variation of keyphrases with the pur-pose of proposing a candidate selection methodwhich improves the coverage of candidates thatoccur in various forms Our second contributionre-examines existing keyphrase extraction features

9

reported in the literature in terms of their effec-tiveness and re-usability We test and comparethe usefulness of each feature for further improve-ment In addition we assess how well these fea-tures can be applied in an unsupervised approach

In the remaining sections we describe anoverview of related work in Section 2 our propos-als on candidate selection and feature engineeringin Section 4 and 5 our system architecture anddata in Section 6 Then we evaluate our propos-als discuss outcomes and conclude our work inSection 7 8 and 9 respectively

2 Related Work

The majority of related work has been carriedout using statistical approaches a rich set ofsymbolic resources and linguistically-motivatedheuristics(Frank et al 1999 Turney 1999 Barkerand Corrnacchia 2000 Matsuo and Ishizuka2004 Nguyen and Kan 2007) Features used canbe categorized into three broad groups (1) docu-ment cohesion features (ie relationship betweendocument and keyphrases)(Frank et al 1999Matsuo and Ishizuka 2004 Medelyan and Wit-ten 2006 Nguyen and Kan 2007) and to lesser(2) keyphrase cohesion features (ie relationshipamong keyphrases)(Turney 2003) and (3) termcohesion features (ie relationship among compo-nents in a keyphrase)(Park et al 2004)

The simplest system is KEA (Frank et al1999 Witten et al 1999) that uses TFIDF (ieterm frequency inverse document frequency) andfirst occurrence in the document TFIDF mea-sures the document cohesion and the first occur-rence implies the importance of the abstract orintroduction which indicates the keyphrases havea locality Turney (2003) added the notion ofkeyphrase cohesion to KEA features and Nguyenand Kan (2007) added linguistic features suchas section information and suffix sequence TheGenEx system(Turney 1999) employed an inven-tory of nine syntactic features such as length inwords and frequency of stemming phrase as aset of parametrized heuristic rules Barker andCorrnacchia (2000) introduced a method basedon head noun heuristics that took three featureslength of candidate frequency and head noun fre-quency To take advantage of domain knowledgeHulth et al (2001) used a hierarchically-organizeddomain-specific thesaurus from Swedish Parlia-ment as a secondary knowledge source The

Textract (Park et al 2004) also ranks the can-didate keyphrases by its judgment of keyphrasesrsquodegree of domain specificity based on subject-specific collocations(Damerau 1993) in addi-tion to term cohesion using Dice coefficient(Dice1945) Recently Wan and Xiao (2008) extractsautomatic keyphrases from single documents uti-lizing document clustering information The as-sumption behind this work is that the documentswith the same or similar topics interact with eachother in terms of salience of words The authorsfirst clustered the documents then used the graph-based ranking algorithm to rank the candidates ina document by making use of mutual influences ofother documents in the same cluster

3 Keyphrase Analysis

In previous study KEA employed the index-ing words as candidates whereas others such as(Park et al 2004 Nguyen and Kan 2007) gen-erated handcrafted regular expression rules How-ever none carefully undertook the analysis ofkeyphrases We believe there is more to be learnedfrom the reference keyphrases themselves by do-ing a fine-grained careful analysis of their formand composition Note that we used the articlescollected from ACM digital library for both ana-lyzing keyphrases as well as evaluating methodsSee Section 6 for data in detail

Syntactically keyphrases can be formed by ei-ther simplex nouns (eg algorithm keyphrasemulti-agent) or noun phrases (NPs) which can be asequence of nouns and their auxiliary words suchas adjectives and adverbs (eg mobile networkfast computing partially observable Markov de-cision process) despite few incidences They canalso incorporate a prepositional phrase (PP) (egquality of service policy of distributed caching)When keyphrases take the form of an NP with anattached PP (ie NPs in of-PP form) the preposi-tion of is most common but others such as for invia also occur (eg incentive for cooperation in-equality in welfare agent security via approximatepolicy trade in financial instrument based on log-ical formula) The patterns above correlate wellto part-of-speech (POS) patterns used in modernkeyphrase extraction systems

However our analysis uncovered additional lin-guistic patterns and alternations which other stud-ies may have overlooked In our study we alsofound that keyphrases also occur as a simple con-

10

Criteria RulesFrequency (Rule1) Frequency heuristic ie frequency ge 2 for simplex words vs frequency ge 1 for NPs

Length (Rule2) Length heuristic ie up to length 3 for NPs in non-of-PP form vs up to length 4 for NPs in of-PP form(eg synchronous concurrent program vs model of multiagent interaction)

Alternation (Rule3) of-PP form alternation(eg number of sensor = sensor number history of past encounter = past encounter history)(Rule4) Possessive alternation(eg agentrsquos goal = goal of agent securityrsquos value = value of security)

Extraction (Rule5) Noun Phrase = (NN |NNS|NNP |NNPS|JJ |JJR|JJS)lowast(NN |NNS|NNP |NNPS)(eg complexity effective algorithm grid computing distributed web-service discovery architecture)(Rule6) Simplex WordNP IN Simplex WordNP(eg quality of service sensitivity of VOIP traffic (VOIP traffic extracted)simplified instantiation of zebroid (simplified instantiation extracted))

Table 1 Candidate Selection Rules

junctions (eg search and rescue propagation anddelivery) and much more rarely as conjunctionsof more complex NPs (eg history of past en-counter and transitivity) Some keyphrases appearto be more complex (eg pervasive document editand management system task and resource allo-cation in agent system) Similarly abbreviationsand possessive forms figure as common patterns(eg belief desire intention = BDI inverse docu-ment frequency = (IDF) Bayesrsquo theorem agentrsquosdominant strategy)

A critical insight of our work is that keyphrasescan be morphologically and semantically alteredKeyphrases that incorporate a PP or have an un-derlying genitive composition are often easily var-ied by word order alternation Previous studieshave used the altered keyphrases when forming inof-PP form For example quality of service canbe altered to service quality sometimes with lit-tle semantic difference Also as most morpho-logical variation in English relates to noun num-ber and verb inflection keyphrases are subject tothese rules as well (eg distributed system 6= dis-tributing system dynamical caching 6= dynamicalcache) In addition possessives tend to alternatewith of-PP form (eg agentrsquos goal = goal of agentsecurityrsquos value = value of security)

4 Candidate Selection

We now describe our proposed candidate selectionprocess Candidate selection is a crucial step forautomatic keyphrase extraction This step is corre-lated to term extraction study since topNth rankedterms become keyphrases in documents In pre-vious study KEA employed the indexing wordsas candidates whereas others such as (Park et al2004 Nguyen and Kan 2007) generated hand-crafted regular expression rules However nonecarefully undertook the analysis of keyphrases In

this section before we present our method we firstdescribe the detail of keyphrase analysis

In our keyphrase analysis we observed thatmost of author assigned keyphrase andor readerassigned keyphrase are syntactically more of-ten simplex words and less often NPs Whenkeyphrases take an NP form they tend to be a sim-ple form of NPs ie either without a PP or withonly a PP or with a conjunction but few appear asa mixture of such forms We also noticed that thecomponents of NPs are normally nouns and adjec-tives but rarely are adverbs and verbs As a re-sult we decided to ignore NPs containing adverbsand verbs in this study as our candidates since theytend to produce more errors and to require morecomplexity

Another observation is that keyphrases contain-ing more than three words are rare (ie 6 in ourdata set) validating what Paukkeri et al (2008)observed Hence we apply a length heuristic Ourcandidate selection rule collects candidates up tolength 3 but also of length 4 for NPs in of-PPform since they may have a non-genetive alter-nation that reduces its length to 3 (eg perfor-mance of distributed system = distributed systemperformance) In previous studies words occur-ring at least twice are selected as candidates How-ever during our acquisition of reader assignedkeyphrase we observed that readers tend to collectNPs as keyphrases regardless of their frequencyDue to this we apply different frequency thresh-olds for simplex words (gt= 2) and NPs (gt= 1)Note that 30 of NPs occurred only once in ourdata

Finally we generated regular expression rulesto extract candidates as presented in Table 1 Ourcandidate extraction rules are based on those inNguyen and Kan (2007) However our Rule6for NPs in of-PP form broadens the coverage of

11

possible candidates ie with a given NPs in of-PP form not only we collect simplex word(s)but we also extract non-of-PP form of NPs fromnoun phrases governing the PP and the PP Forexample our rule extracts effective algorithm ofgrid computing as well as effective algorithm andgrid computing as candidates while the previousworksrsquo rules do not

5 Feature Engineering

With a wider candidate selection criteria the onusof filtering out irrelevant candidates becomes theresponsibility of careful feature engineering Welist 25 features that we have found useful in ex-tracting keyphrases comprising of 9 existing and16 novel andor modified features that we intro-duce in our work (marked with lowast) As one ofour goals in feature engineering is to assess thesuitability of features in the unsupervised settingwe have also indicated which features are suitableonly for the supervised setting (S) or applicable toboth (S U)

51 Document Cohesion

Document cohesion indicates how important thecandidates are for the given document The mostpopular feature for this cohesion is TFIDF butsome works have also used context words to checkthe correlation between candidates and the givendocument Other features for document cohesionare distance section information and so on Wenote that listed features other than TFIDF are re-lated to locality That is the intuition behind thesefeatures is that keyphrases tend to appear in spe-cific area such as the beginning and the end of doc-uments

F1 TFIDF (SU) TFIDF indicates doc-ument cohesion by looking at the frequency ofterms in the documents and is broadly used in pre-vious work(Frank et al 1999 Witten et al 1999Nguyen and Kan 2007) However a disadvan-tage of the feature is in requiring a large corpusto compute useful IDF As an alternative con-text words(Matsuo and Ishizuka 2004) can alsobe used to measure document cohesion From ourstudy of keyphrases we saw that substrings withinlonger candidates need to be properly counted andas such our method measures TF in substrings aswell as in exact matches For example grid com-puting is often a substring of other phrases such asgrid computing algorithm and efficient grid com-

puting algorithm We also normalize TF with re-spect to candidate types ie we separately treatsimplex words and NPs to compute TF To makeour IDFs broadly representative we employed theGoogle n-gram counts that were computedover terabytes of data Given this large genericsource of word count IDF can be incorporatedwithout corpus-dependent processing hence suchfeatures are useful in unsupervised approachesas well The following list shows variations ofTFIDF employed as features in our system

bull (F1a) TFIDF

bull (F1b) TF including counts of substrings

bull (F1c) TF of substring as a separate feature

bull (F1d) normalized TF by candidate types(ie simplex words vs NPs)

bull (F1e) normalized TF by candidate types asa separate feature

bull (F1f) IDF using Google n-gram

F2 First Occurrence (SU) KEA used the firstappearance of the word in the document(Frank etal 1999 Witten et al 1999) The main ideabehind this feature is that keyphrases tend to oc-cur in the beginning of documents especially instructured reports (eg in abstract and introduc-tion sections) and newswire

F3 Section Information (SU) Nguyen andKan (2007) used the identity of which specificdocument section a candidate occurs in This lo-cality feature attempts to identify key sections Forexample in their study of scientific papers theauthors weighted candidates differently dependingon whether they occurred in the abstract introduc-tion conclusion section head title andor refer-ences

F4 Additional Section Information (SU)We first added the related work or previous workas one of section information not included inNguyen and Kan (2007) We also propose and testa number of variations We used the substringsthat occur in section headers and reference titlesas keyphrases We counted the co-occurrence ofcandidates (ie the section TF) across all key sec-tions that indicates the correlation among key sec-tions We assign section-specific weights as in-dividual sections exhibit different propensities forgenerating keyphrases For example introduction

12

contains the majority of keyphrases while the ti-tle or section head contains many fewer due to thevariation in size

bull (F4a) section rsquorelatedprevious workrsquo

bull (F4b) counting substring occurring in keysections

bull (F4c) section TF across all key sections

bull (F4d) weighting key sections according tothe portion of keyphrases found

F5 Last Occurrence (SU) Similar to dis-tance in KEA the position of the last occurrenceof a candidate may also imply the importance ofkeyphrases as keyphrases tend to appear in thelast part of document such as the conclusion anddiscussion

52 Keyphrase CohesionThe intuition behind using keyphrase cohesion isthat actual keyphrases are often associated witheach other since they are semantically related totopic of the document Note that this assumptionholds only when the document describes a singlecoherent topic ndash a document that represents a col-lection may be first need to be segmented into itsconstituent topics

F6 Co-occurrence of Another Candidatein Section (SU) When candidates co-occur inseveral key sections together then they are morelikely keyphrases Hence we used the number ofsections that candidates co-occur

F7 Title overlap (S) In a way titles also rep-resent the topics of their documents A large col-lection of titles in the domain can act as a prob-abilistic prior of what words could stand as con-stituent words in keyphrases In our work as weexamined scientific papers from computer sciencewe used a collection of titles obtained from thelarge CiteSeer1 collection to create this feature

bull (F7a) co-occurrence (Boolean) in title col-location

bull (F7b) co-occurrence (TF) in title collection

F8 Keyphrase Cohesion (SU) Turney (2003)integrated keyphrase cohesion into his system bychecking the semantic similarity between top Nranked candidates against the remainder In the

1It contains 13M titles from articles papers and reports

original work a large external web corpus wasused to obtain the similarity judgments As wedid not have access to the same web corpus andall candidateskeyphrases were not found in theGoogle n-gram corpus we approximated this fea-ture using a similar notion of contextual similarityWe simulated a latent 2-dimensional matrix (simi-lar to latent semantic analysis) by listing all candi-date words in rows and their neighboring words(nouns verbs and adjectives only) in columnsThe cosine measure is then used to compute thesimilarity among keyphrases

53 Term Cohesion

Term cohesion further refines the candidacy judg-ment by incorporating an internal analysis of thecandidatersquos constituent words Term cohesionposits that high values for internal word associa-tion measures correlates indicates that the candi-date is a keyphrase (Church and Hanks 1989)

F9 Term Cohesion (SU) Park et al (2004)used in the Dice coefficient (Dice 1945)to measure term cohesion particularly for multi-word terms In their work as NPs are longer thansimplex words they simply discounted simplexword cohesion by 10 In our work we vary themeasure of TF used in Dice coefficientsimilar to our discussion earlier

bull (F9a) term cohesion by (Park et al 2004)

bull (F9b) normalized TF by candidate types(ie simplex words vs NPs)

bull (F9c) applying different weight by candi-date types

bull (F9d) normalized TF and different weight-ing by candidate types

54 Other Features

F10 Acronym (S) Nguyen and Kan (2007) ac-counted for the importance of acronym as a fea-ture We found that this feature is heavily depen-dent on the data set Hence we used it only forNampK to attest our candidate selection method

F11 POS sequence (S) Hulth and Megyesi(2006) pointed out that POS sequences ofkeyphrases are similar It showed the distinctivedistribution of POS sequences of keyphrases anduse them as a feature Like acronym this is alsosubject to the data set

13

F12 Suffix sequence (S) Similar to acronymNguyen and Kan (2007) also used a candidatersquossuffix sequence as a feature to capture the propen-sity of English to use certain Latin derivationalmorphology for technical keyphrases This fea-ture is also a data dependent features thus used insupervised approach only

F13 Length of Keyphrases (SU) Barker andCorrnacchia (2000) showed that candidate lengthis also a useful feature in extraction as well as incandidate selection as the majority of keyphrasesare one or two terms in length

6 System and Data

To assess the performance of the proposed candi-date selection rules and features we implementeda keyphrase extraction pipe line We start withraw text of computer science articles convertedfrom PDF by pdftotext Then we parti-tioned the into section such as title and sectionsvia heuristic rules and applied sentence segmenter2 ParsCit3(Councill et al 2008) for refer-ence collection part-of-speech tagger4 and lem-matizer5(Minnen et al 2001) of the input Af-ter preprocessing we built both supervised andunsupervised classifiers using Naive Bayes fromthe WEKA machine learning toolkit(Witten andFrank 2005) Maximum Entropy6 and simpleweighting

In evaluation we collected 250 papers fromfour different categories7 of the ACM digital li-brary Each paper was 6 to 8 pages on averageIn author assigned keyphrase we found manywere missing or found as substrings To rem-edy this we collected reader assigned keyphraseby hiring senior year undergraduates in computerscience each whom annotated five of the paperswith an annotation guideline and on average tookabout 15 minutes to annotate each paper The fi-nal statistics of keyphrases is presented in Table2 where Combined represents the total number ofkeyphrases The numbers in () denotes the num-ber of keyphrases in of-PP form Found means the

2httpwwwengritsumeiacjpasaoresourcessentseg3httpwingcompnusedusgparsCit4httpsearchcpanorgdistLingua-EN-

TaggerTaggerpm5httpwwwinformaticssusxacukresearchgroupsnlpcarrollmorphhtml6httpmaxentsourceforgenetindexhtml7C24 (Distributed Systems) H33 (Information Search

and Retrieval) I211 (Distributed Artificial Intelligence-Multiagent Systems) and J4 (Social and Behavioral Sciences-Economics)

number of author assigned keyphrase and readerassigned keyphrase found in the documents

Author Reader CombinedTotal 1252 (53) 3110 (111) 3816 (146)NPs 904 2537 3027Average 385 (401) 1244 (1288) 1526 (1585)Found 769 2509 2864

Table 2 Statistics in Keyphrases

7 Evaluation

The baseline system for both the supervised andunsupervised approaches is modified NampK whichuses TFIDF distance section information andadditional section information (ie F1-4) Apartfrom baseline we also implemented basicKEA and NampK to compare Note that NampK is con-sidered a supervised approach as it utilizes fea-tures like acronym POS sequence and suffix se-quence

Table 3 and 4 shows the performance of our can-didate selection method and features with respectto supervised and unsupervised approaches usingthe current standard evaluation method (ie exactmatching scheme) over top 5th 10th 15th candi-dates

BestFeatures includes F1cTF of substring asa separate feature F2first occurrence F3sectioninformation F4dweighting key sections F5lastoccurrence F6co-occurrence of another candi-date in section F7btitle overlap F9aterm co-hesion by (Park et al 2004) F13length ofkeyphrases Best-TFIDF means using all bestfeatures but TFIDF

In Tables 3 and 4 C denotes the classifier tech-nique unsupervised (U) or supervised using Max-imum Entropy (S)8

In Table 5 the performance of each feature ismeasured using NampK system and the target fea-ture + indicates an improvement - indicates aperformance decline and indicates no effector unconfirmed due to small changes of perfor-mances Again supervised denotes MaximumEntropy training and Unsupervised is our unsu-pervised approach

8 Discussion

We compared the performances over our candi-date selection and feature engineering with sim-ple KEA NampK and our baseline system In eval-uating candidate selection we found that longer

8Due to the page limits we present the best performance

14

Method Features C Five Ten FifteenMatch Precision Recall Fscore Match Precising Recall Fscore Match Precision Recall Fscore

All KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 079 1584 519 782 139 1388 909 1099 184 1224 1203 1213

NampK S 132 2648 867 1306 204 2036 1334 1612 254 1693 1664 1678baseline U 092 1832 600 904 157 1568 1027 1241 220 1464 1439 1451

S 115 2304 755 1137 190 1896 1242 1501 244 1624 1596 1610Lengthlt=3 KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 081 1616 529 797 140 1400 917 1108 184 1224 1203 1213

NampK S 140 2792 915 1378 210 2104 1378 1665 262 1749 1719 1734baseline U 092 184 603 908 158 1576 1032 1247 220 1464 1439 1451

S 118 2368 776 1169 190 1900 1245 1504 240 1600 1572 1586Lengthlt=3 KEA U 001 024 008 012 005 052 034 041 007 048 047 047Candidates S 083 1664 545 821 142 1424 933 1127 187 1245 1224 1234

+ Alternation NampK S 153 3064 1004 1512 231 2308 1512 1827 288 1920 1887 1903baseline U 098 1968 645 972 172 1724 1129 1364 237 1579 1551 1565

S 133 2656 870 1311 209 2088 1368 1653 269 1792 1761 1776

Table 3 Performance on Proposed Candidate Selection

Features C Five Ten FifteenMatch Prec Recall Fscore Match Prec Recall Fscore Match Prec Recall Fscore

Best U 114 228 747 113 192 192 126 152 261 174 171 173S 156 312 102 154 250 250 164 198 315 210 206 208

Best U 114 228 74 113 192 192 126 152 261 174 171 173wo TFIDF S 156 311 102 154 246 246 161 194 312 208 204 206

Table 4 Performance on Feature Engineering

A Method Feature+ S F1aF2F3F4aF4dF9a

U F1aF1cF2F3F4aF4dF5F7bF9a- S F1bF1cF1dF1fF4bF4cF7aF7bF9b-dF13

U F1dF1eF1fF4bF4cF6F7aF9b-d S F1eF10F11F12

U F1b

Table 5 Performance on Each Feature

length candidates play a role to be noises so de-creased the overall performance We also con-firmed that candidate alternation offered the flexi-bility of keyphrases leading higher candidate cov-erage as well as better performance

To re-examine features we analyzed the impactof existing and new features and their variationsFirst of all unlike previous studies we found thatthe performance with and without TFIDF did notlead to a large difference which indicates the im-pact of TFIDF was minor as long as other fea-tures are incorporated Secondly counting sub-strings for TF improved performance while ap-plying term weighting for TF andor IDF did notimpact on the performance We estimated thecause that many of keyphrases are substrings ofcandidates and vice versa Thirdly section in-formation was also validated to improve perfor-mance as in Nguyen and Kan (2007) Extend-ing this logic modeling additional section infor-mation (related work) and weighting sections bothturned out to be useful features Other localityfeatures were also validated as helpful both firstoccurrence and last occurrence are helpful as itimplies the locality of the key ideas In addi-

tion keyphrase co-occurrence with selected sec-tions was proposed in our work and found empiri-cally useful Term cohesion (Park et al 2004) is auseful feature although it has a heuristic factor thatreduce the weight by 10 for simplex words Nor-mally term cohesion is subject to NPs only henceit needs to be extended to work with multi-wordNPs as well Table 5 summarizes the reflectionson each feature

As unsupervised methods have the appeal of notneeding to be trained on expensive hand-annotateddata we also compared the performance of super-vised and unsupervised methods Given the fea-tures initially introduced for supervised learningunsupervised performance is surprisingly highWhile supervised classifier produced a matchingcount of 315 the unsupervised classifier obtains acount of 261 We feel this indicates that the exist-ing features for supervised methods are also suit-able for use in unsupervised methods with slightlyreduced performance In general we observed thatthe best features in both supervised and unsuper-vised methods are the same ndash section informationand candidate length In our analysis of the im-pact of individual features we observed that mostfeatures affect performance in the same way forboth supervised and unsupervised approaches asshown in Table 5 These findings indicate that al-though these features may be been originally de-signed for use in a supervised approach they arestable and can be expected to perform similar inunsupervised approaches

15

9 Conclusion

We have identified and tackled two core issuesin automatic keyphrase extraction candidate se-lection and feature engineering In the area ofcandidate selection we observe variations and al-ternations that were previously unaccounted forOur selection rules expand the scope of possiblekeyphrase coverage while not overly expandingthe total number candidates to consider In ourre-examination of feature engineering we com-piled a comprehensive feature list from previousworks while exploring the use of substrings in de-vising new features Moreover we also attested toeach featurersquos fitness for use in unsupervised ap-proaches in order to utilize them in real-world ap-plications with minimal cost

10 AcknowledgementThis work was partially supported by a National ResearchFoundation grant Interactive Media Search (grant R 252000 325 279) while the first author was a postdoctoral fellowat the National University of Singapore

ReferencesKen Barker and Nadia Corrnacchia Using noun phrase

heads to extract document keyphrases In Proceedings ofthe 13th Biennial Conference of the Canadian Society onComputational Studies of Intelligence Advances in Arti-ficial Intelligence 2000

Regina Barzilay and Michael Elhadad Using lexical chainsfor text summarization In Proceedings of the ACLEACL1997 Workshop on Intelligent Scalable Text Summariza-tion 1997 pp 10ndash17

Kenneth Church and Patrick Hanks Word associationnorms mutual information and lexicography In Proceed-ings of ACL 1989 76ndash83

Isaac Councill and C Lee Giles and Min-Yen Kan ParsCitAn open-source CRF reference string parsing package InProceedings of LREC 2008 28ndash30

Ernesto DAvanzo and Bernado Magnini A Keyphrase-Based Approach to Summarizationthe LAKE System atDUC-2005 In Proceedings of DUC 2005

F Damerau Generating and evaluating domain-orientedmulti-word terms from texts Information Processing andManagement 1993 29 pp43ndash447

Lee Dice Measures of the amount of ecologic associationsbetween species Journal of Ecology 1945 2

Eibe Frank and Gordon Paynter and Ian Witten and CarlGutwin and Craig Nevill-manning Domain SpecificKeyphrase Extraction In Proceedings of IJCAI 1999pp668ndash673

Carl Gutwin and Gordon Paynter and Ian Witten and CraigNevill-Manning and Eibe Frank Improving browsing indigital libraries with keyphrase indexes Journal of Deci-sion Support Systems 1999 27 pp81ndash104

Khaled Hammouda and Diego Matute and Mohamed KamelCorePhrase keyphrase extraction for document cluster-ing In Proceedings of MLDM 2005

Annette Hulth and Jussi Karlgren and Anna Jonsson andHenrik Bostrm and Lars Asker Automatic Keyword Ex-traction using Domain Knowledge In Proceedings of CI-CLing 2001

Annette Hulth and Beata Megyesi A study on automaticallyextracted keywords in text categorization In Proceedingsof ACLCOLING 2006 537ndash544

Mario Jarmasz and Caroline Barriere Using semantic sim-ilarity over tera-byte corpus compute the performance ofkeyphrase extraction In Proceedings of CLINE 2004

Dawn Lawrie and W Bruce Croft and Arnold RosenbergFinding Topic Words for Hierarchical Summarization InProceedings of SIGIR 2001 pp 349ndash357

Y Matsuo and M Ishizuka Keyword Extraction from a Sin-gle Document using Word Co-occurrence Statistical Infor-mation In International Journal on Artificial IntelligenceTools 2004 13(1) pp 157ndash169

Olena Medelyan and Ian Witten Thesaurus based automatickeyphrase indexing In Proceedings of ACMIEED-CSjoint conference on Digital libraries 2006 pp296ndash297

Guido Minnen and John Carroll and Darren Pearce Appliedmorphological processing of English NLE 2001 7(3)pp207ndash223

Thuy Dung Nguyen and Min-Yen Kan Key phrase Extrac-tion in Scientific Publications In Proceeding of ICADL2007 pp317-326

Youngja Park and Roy Byrd and Branimir Boguraev Auto-matic Glossary Extraction Beyond Terminology Identifi-cation In Proceedings of COLING 2004 pp48ndash55

Mari-Sanna Paukkeri and Ilari Nieminen and Matti Pollaand Timo Honkela A Language-Independent Approachto Keyphrase Extraction and Evaluation In Proceedingsof COLING 2008

Peter Turney Learning to Extract Keyphrases from TextIn National Research Council Institute for InformationTechnology Technical Report ERB-1057 1999

Peter Turney Coherent keyphrase extraction via Web min-ing In Proceedings of IJCAI 2003 pp 434ndash439

Xiaojun Wan and Jianguo Xiao CollabRank towards a col-laborative approach to single-document keyphrase extrac-tion In Proceedings of COLING 2008

Ian Witten and Gordon Paynter and Eibe Frank and CarGutwin and Graig Nevill-Manning KEAPractical Au-tomatic Key phrase Extraction In Proceedings of ACMDL 1999 pp254ndash256

Ian Witten and Eibe Frank Data Mining Practical Ma-chine Learning Tools and Techniques Morgan Kauf-mann 2005

Yongzheng Zhang and Nur Zinchir-Heywood and Evange-los Milios Term based Clustering and Summarization ofWeb Page Collections In Proceedings of Conference ofthe Canadian Society for Computational Studies of Intel-ligence 2004

16

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 17ndash22Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Verb Noun Construction MWE Token Supervised Classification

Mona T DiabCenter for Computational Learning Systems

Columbia Universitymdiabcclscolumbiaedu

Pravin BhutadaComputer Science Department

Columbia Universitypb2351columbiaedu

Abstract

We address the problem of classifying multi-word expression tokens in running text Wefocus our study on Verb-Noun Constructions(VNC) that vary in their idiomaticity depend-ing on context VNC tokens are classified aseither idiomatic or literal We present a super-vised learning approach to the problem We ex-periment with different features Our approachyields the best results to date on MWE clas-sification combining different linguistically mo-tivated features the overall performance yieldsan F-measure of 8458 corresponding to an F-measure of 8996 for idiomaticity identificationand classification and 6203 for literal identifi-cation and classification

1 Introduction

In the literature in general a multiword expression(MWE) refers to a multiword unit or a colloca-tion of words that co-occur together statisticallymore than chance A MWE is a cover term fordifferent types of collocations which vary in theirtransparency and fixedness MWEs are pervasivein natural language especially in web based textsand speech genres Identifying MWEs and under-standing their meaning is essential to language un-derstanding hence they are of crucial importancefor any Natural Language Processing (NLP) appli-cations that aim at handling robust language mean-ing and use In fact the seminal paper (Sag et al2002) refers to this problem as a key issue for thedevelopment of high-quality NLP applications

For our purposes a MWE is defined as a collo-cation of words that refers to a single concept forexample - kick the bucket spill the beans make adecision etc An MWE typically has an idiosyn-cratic meaning that is more or different from themeaning of its component words An MWE mean-ing is transparent ie predictable in as muchas the component words in the expression relaythe meaning portended by the speaker composi-tionally Accordingly MWEs vary in their de-gree of meaning compositionality composition-ality is correlated with the level of idiomaticityAn MWE is compositional if the meaning of an

MWE as a unit can be predicted from the mean-ing of its component words such as in make adecision meaning to decide If we conceive ofidiomaticity as being a continuum the more id-iomatic an expression the less transparent and themore non-compositional it is Some MWEs aremore predictable than others for instance kick thebucket when used idiomatically to mean to diehas nothing in common with the literal meaningof either kick or bucket however make a decisionis very clearly related to to decide Both of theseexpressions are considered MWEs but have vary-ing degrees of compositionality and predictabilityBoth of these expressions belong to a class of id-iomatic MWEs known as verb noun constructions(VNC) The first VNC kick the bucket is a non-decomposable VNC MWE the latter make a deci-sion is a decomposable VNC MWE These typesof constructions are the object of our study

To date most research has addressed the prob-lem of MWE type classification for VNC expres-sions in English (Melamed 1997 Lin 1999Baldwin et al 2003 na Villada Moiron andTiedemann 2006 Fazly and Stevenson 2007Van de Cruys and Villada Moiron 2007 Mc-Carthy et al 2007) not token classification Forexample he spilt the beans on the kitchen counteris most likely a literal usage This is given away bythe use of the prepositional phrase on the kitchencounter as it is plausable that beans could haveliterally been spilt on a location such as a kitchencounter Most previous research would classifyspilt the beans as idiomatic irrespective of con-textual usage In a recent study by (Cook et al2008) of 53 idiom MWE types used in differentcontexts the authors concluded that almost half ofthem had clear literal meaning and over 40 oftheir usages in text were actually literal Thus itwould be important for an NLP application suchas machine translation for example when givena new VNC MWE token to be able to determinewhether it is used idiomatically or not as it couldpotentially have detrimental effects on the qualityof the translation

17

In this paper we address the problem of MWEclassification for verb-noun (VNC) token con-structions in running text We investigate the bi-nary classification of an unseen VNC token ex-pression as being either Idiomatic (IDM) or Lit-eral (LIT) An IDM expression is certainly anMWE however the converse is not necessarilytrue To date most approaches to the problem ofidiomaticity classification on the token level havebeen unsupervised (Birke and Sarkar 2006 Diaband Krishna 2009b Diab and Krishna 2009aSporleder and Li 2009) In this study we carryout a supervised learning investigation using sup-port vector machines that uses some of the featureswhich have been shown to help in unsupervisedapproaches to the problem

This paper is organized as follows In Section2 we describe our understanding of the variousclasses of MWEs in general Section 3 is a sum-mary of previous related research Section 4 de-scribes our approach In Section 5 we present thedetails of our experiments We discuss the resultsin Section 6 Finally we conclude in Section 7

2 Multi-word Expressions

MWEs are typically not productive though theyallow for inflectional variation (Sag et al 2002)They have been conventionalized due to persis-tent use MWEs can be classified based on theirsemantic types as follows Idiomatic This cat-egory includes expressions that are semanticallynon-compositional fixed expressions such as king-dom come ad hoc non-fixed expressions suchas break new ground speak of the devil TheVNCs which we are focusing on in this paper fallinto this category Semi-idiomatic This classincludes expressions that seem semantically non-compositional yet their semantics are more or lesstransparent This category consists of Light VerbConstructions (LVC) such as make a living andVerb Particle Constructions (VPC) such as write-up call-up Non-Idiomatic This category in-cludes expressions that are semantically compo-sitional such as prime minister proper nouns suchas New York Yankees and collocations such as ma-chine translation These expressions are statisti-cally idiosyncratic For instance traffic light isthe most likely lexicalization of the concept andwould occur more often in text than say trafficregulator or vehicle light

3 Related Work

Several researchers have addressed the problem ofMWE classification (Baldwin et al 2003 Katzand Giesbrecht 2006 Schone and Juraksfy 2001

Hashimoto et al 2006 Hashimoto and Kawa-hara 2008) The majority of the proposed researchhas been using unsupervised approaches and haveaddressed the problem of MWE type classifica-tion irrespective of usage in context (Fazly andStevenson 2007 Cook et al 2007) We areaware of two supervised approaches to the prob-lem work by (Katz and Giesbrecht 2006) andwork by (Hashimoto and Kawahara 2008)

In Katz and Giesbrecht (2006) (KG06) the au-thors carried out a vector similarity comparisonbetween the context of an MWE and that of theconstituent words using LSA to determine if theexpression is idiomatic or not The KG06 is sim-ilar in intuition to work proposed by (Fazly andStevenson 2007) however the latter work was un-supervised KG06 experimented with a tiny dataset of only 108 sentences corresponding to oneMWE idiomatic expression

Hashimoto and Kawahara (2008) (HK08) is thefirst large scale study to our knowledge that ad-dressed token classification into idiomatic versusliteral for Japanese MWEs of all types They ap-ply a supervised learning framework using sup-port vector machines based on TinySVM with aquadratic kernel They annotate a web based cor-pus for training data They identify 101 idiomtypes each with a corresponding 1000 exampleshence they had a corpus of 102K sentences of an-notated data for their experiments They exper-iment with 90 idiom types only for which theyhad more than 50 examples They use two typesof features word sense disambiguation (WSD)features and idiom features The WSD featurescomprised some basic syntactic features such asPOS lemma information token n-gram featuresin addition to hypernymy information on words aswell as domain information For the idiom fea-tures they were mostly inflectional features suchas voice negativity modality in addition to adja-cency and adnominal features They report resultsin terms of accuracy and rate of error reductionTheir overall accuracy is of 8925 using all thefeatures

4 Our ApproachWe apply a supervised learning framework tothe problem of both identifying and classifying aMWE expression token in context We specificallyfocus on VNC MWE expressions We use the an-notated data by (Cook et al 2008) We adopt achunking approach to the problem using an InsideOutside Beginning (IOB) tagging framework forperforming the identification of MWE VNC to-kens and classifying them as idiomatic or literalin context For chunk tagging we use the Yam-

18

Cha sequence labeling system1 YamCha is basedon Support Vector Machines technology using de-gree 2 polynomial kernels

We label each sentence with standard IOB tagsSince this is a binary classification task we have 5different tags B-L (Beginning of a literal chunk)I-L (Inside of a literal chunk) B-I (Beginning anIdiomatic chunk) I-I (Inside an Idiomatic chunk)O (Outside a chunk) As an example a sentencesuch as John kicked the bucket last Friday will beannotated as follows John O kicked B-I the I-Ibucket I-I last O Friday O We experiment withsome basic features and some more linguisticallymotivated ones

We experiment with different window sizes forcontext ranging from minus+1 to minus+5 tokens be-fore and after the token of interest We also em-ploy linguistic features such as character n-gramfeatures namely last 3 characters of a token asa means of indirectly capturing the word inflec-tional and derivational morphology (NGRAM)Other features include Part-of-Speech (POS)tags lemma form (LEMMA) or the citation formof the word and named entity (NE) informationThe latter feature is shown to help in the unsuper-vised setting in recent work (Diab and Krishna2009b Diab and Krishna 2009a) In general allthe linguistic features are represented as separatefeature sets explicitly modeled in the input dataHence if we are modeling the POS tag feature forour running example the training data would beannotated as follows John NN O kicked VBDB-I the Det I-I bucket NN I-I last ADV O FridayNN O Likewise adding the NGRAM featurewould be represented as follows John NN ohnO kicked VBD ked B-I the Det the I-I bucket NNket I-I last ADV ast O Friday NN day O and soon

With the NE feature we followed the same rep-resentation as the other features as a separate col-umn as expressed above referred to as NamedEntity Separate (NES) For named entity recogni-tion (NER) we use the BBN Identifinder softwarewhich identifies 19 NE tags2 We have two set-tings for NES one with the full 19 tags explic-itly identified (NES-Full) and the other where wehave a binary feature indicating whether a wordis a NE or not (NES-Bin) Moreover we addedanother experimental condition where we changedthe wordsrsquo representation in the input to their NEclass Named Entity InText (NEI) For example forthe NEI condition our running example is repre-sented as follows PER NN ohn O kicked VBDked B-I the Det the I-I bucket NN ket I-I last ADV

1httpwwwtado-chasencomyamcha2httpwwwbbncomidentifinder

ast O DAY NN day O where John is replaced bythe NE ldquoPERrdquo

5 Experiments and Results

51 Data

We use the manually annotated standard dataset identified in (Cook et al 2008) This datacomprises 2920 unique VNC-Token expressionsdrawn from the entire British National Corpus(BNC)3 The BNC contains 100M words of multi-ple genres including written text and transcribedspeech In this set VNC token expressions aremanually annotated as idiomatic literal or un-known We exclude those annotated as unknownand those pertaining to the Speech part of thedata leaving us with a total of 2432 sentences cor-responding to 53 VNC MWE types This datahas 2571 annotations4 corresponding to 2020 Id-iomatic tokens and 551 literal ones Since the dataset is relatively small we carry out 5-fold cross val-idation experiments The results we report are av-eraged over the 5 folds per condition We splitthe data into 80 for training 10 for testing and10 for development The data used is the tok-enized version of the BNC

52 Evaluation Metrics

We use Fβ=1 (F-measure) as the harmonic meanbetween (P)recision and (R)ecall as well as accu-racy to report the results5 We report the resultsseparately for the two classes IDM and LIT aver-aged over the 5 folds of the TEST data set

53 Results

We present the results for the different featuressets and their combination We also present resultson a simple most frequent tag baseline (FREQ) aswell as a baseline of using no features just thetokenized words (TOK) The baseline is basicallytagging all identified VNC tokens in the data set asidiomatic It is worth noting that the baseline hasthe advantage of gold identification of MWE VNCtoken expressions In our experimental conditionsidentification of a potential VNC MWE is part ofwhat is discovered automatically hence our sys-tem is penalized for identifying other VNC MWE

3httpwwwnatcorpoxacuk4A sentence can have more than one MWE expression

hence the number of annotations exceeds the number of sen-tences

5We do not think that accuracy should be reported in gen-eral since it is an inflated result as it is not a measure of errorAll words identified as O factor into the accuracy which re-sults in exaggerated values for accuracy We report it onlysince it the metric used by previous work

19

tokens that are not in the original data set6

In Table 2 we present the results yielded per fea-ture and per condition We experimented with dif-ferent context sizes initially to decide on the opti-mal window size for our learning framework re-sults are presented in Table 1 Then once that isdetermined we proceed to add features

Noting that a window size of minus+3 yields thebest results we proceed to use that as our contextsize for the following experimental conditions Wewill not include accuracy since it above 96 for allour experimental conditions

All the results yielded by our experiments out-perform the baseline FREQ The simple tokenizedwords baseline (TOK) with no added features witha context size of minus+3 shows a significant im-provement over the very basic baseline FREQ withan overall F measure of 7704

Adding lemma information or POS or NGRAMfeatures all independently contribute to a bettersolution however combining the three featuresyields a significant boost in performance over theTOK baseline of 267 absolute F points in over-all performance

Confirming previous observations in the liter-ature the overall best results are obtained byusing NE features The NEI condition yieldsslightly better results than the NES conditionsin the case when no other features are beingused NES-Full significantly outperforms NES-Bin when used alone especially on literal classi-fication yielding the highest results on this classof phenomena across the board However whencombined with other features NES-Bin fares bet-ter than NES-Full as we observe slightly less per-formance when comparing NES-Full+L+N+P andNES-Bin+L+N+P

Combining NEI+L+N+P yields the highest re-sults with an overall F measure of 8458 a sig-nificant improvement over both baselines and overthe condition that does not exploit NE featuresL+N+P Using NEI may be considered a formof dimensionality reduction hence the significantcontribution to performance

6 Discussion

The overall results strongly suggest that using lin-guistically interesting features explicitly has a pos-itive impact on performance NE features helpthe most and combining them with other features

6We could have easily identified all VNC syntactic con-figurations corresponding to verb object as a potential MWEVNC assuming that they are literal by default This wouldhave boosted our literal score baseline however for this in-vestigation we decided to strictly work with the gold stan-dard data set exclusively

yields the best results In general performanceon the classification and identification of idiomaticexpressions yielded much better results This maybe due to the fact that the data has a lot more id-iomatic token examples for training Also we notethat precision scores are significantly higher thanrecall scores especially with performance on lit-eral token instance classification This might be anindication that identifying when an MWE is usedliterally is a difficult task

We analyzed some of the errors yielded in ourbest condition NEI+L+N+P The biggest errors area result of identifying other VNC constructionsnot annotated in the training and test data as VNCMWEs However we also see errors of confusingidiomatic cases with literal ones 23 times and theopposite 4 times

Some of the errors where the VNC should havebeen classified as literal however the system clas-sified them as idiomatic are kick heel find feetmake top Cases of idiomatic expressions erro-neously classified as literal are for MWE types hitthe road blow trumpet blow whistle bit a wall

The system is able to identify new VNC MWEconstructions For instance in the sentence On theother hand Pinkie seemed to have lost his head toa certain extent perhaps some prospects of mak-ing his mark by bringing in something novel inthe way of business the first MWE lost his headis annotated in the training data however makinghis mark is newly identified as idiomatic in thiscontext

Also the system identified hit the post as aliteral MWE VNC token in As the ball hit thepost the referee blew the whistle where blew thewhistle is a literal VNC in this context and it iden-tified hit the post as another literal VNC

7 Conclusion

In this study we explore a set of features that con-tribute to VNC token expression binary supervisedclassification The use of NER significantly im-proves the performance of the system Using NERas a means of dimensionality reduction yields thebest results We achieve a state of the art perfor-mance of an overall F measure of 8458 In thefuture we are looking at ways of adding more so-phisticated syntactic and semantic features fromWSD Given the fact that we were able to get moreinteresting VNC data automatically we are cur-rently looking into adding the new data to the an-notated pool after manual checking

20

IDM-F LIT-F Overall F Overall Accminus+1 7793 4857 7178 9622minus+2 8538 5561 7971 9706minus+3 8699 5568 8125 9693minus+4 8622 5581 8075 9706minus+5 8338 50 7763 9661

Table 1 Results in s of varying context window size

IDM-P IDM-R IDM-F LIT-P LIT-R LIT-F Overall FFREQ 7002 8916 7844 0 0 0 6968TOK 8178 8333 8255 7179 4375 5437 7704

(L)EMMA 831 8429 8369 6977 4688 5607 7811(N)GRAM 8317 8238 8278 70 4375 5385 7701

(P)OS 8333 8333 8333 7778 4375 5600 7808L+N+P 8695 8333 8538 7222 4561 5591 7971

NES-Full 852 8793 8655 7907 5862 6733 8277NES-Bin 8497 8241 8367 7349 5259 6131 7915

NEI 8992 8518 8748 8133 5259 6387 8282NES-Full+L+N+P 8989 8492 8734 7632 50 6042 8199NES-Bin+L+N+P 9086 8492 8779 7632 50 6042 8233

NEI+L+N+P 9135 8842 8986 8169 50 6203 8458

Table 2 Final results in s averaged over 5 folds of test data using different features and their combina-tions

8 Acknowledgement

The first author was partially funded by DARPAGALE and MADCAT projects The authors wouldlike to acknowledge the useful comments by twoanonymous reviewers who helped in making thispublication more concise and better presented

ReferencesTimothy Baldwin Collin Bannard Takakki Tanaka

and Dominic Widdows 2003 An empirical modelof multiword expression decomposability In Pro-ceedings of the ACL 2003 workshop on Multiwordexpressions pages 89ndash96 Morristown NJ USA

J Birke and A Sarkar 2006 A clustering approach fornearly unsupervised recognition of nonliteral lan-guage In Proceedings of EACL volume 6 pages329ndash336

Paul Cook Afsaneh Fazly and Suzanne Stevenson2007 Pulling their weight Exploiting syntacticforms for the automatic identification of idiomaticexpressions in context In Proceedings of the Work-shop on A Broader Perspective on Multiword Ex-pressions pages 41ndash48 Prague Czech RepublicJune Association for Computational Linguistics

Paul Cook Afsaneh Fazly and Suzanne Stevenson2008 The VNC-Tokens Dataset In Proceedings ofthe LREC Workshop on Towards a Shared Task for

Multiword Expressions (MWE 2008) MarrakechMorocco June

Mona Diab and Madhav Krishna 2009a Handlingsparsity for verb noun MWE token classification InProceedings of the Workshop on Geometrical Mod-els of Natural Language Semantics pages 96ndash103Athens Greece March Association for Computa-tional Linguistics

Mona Diab and Madhav Krishna 2009b Unsuper-vised classification for vnc multiword expressionstokens In CICLING

Afsaneh Fazly and Suzanne Stevenson 2007 Dis-tinguishing subtypes of multiword expressions us-ing linguistically-motivated statistical measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueCzech Republic June Association for Computa-tional Linguistics

Chikara Hashimoto and Daisuke Kawahara 2008Construction of an idiom corpus and its applica-tion to idiom identification based on WSD incor-porating idiom-specific features In Proceedings ofthe 2008 Conference on Empirical Methods in Nat-ural Language Processing pages 992ndash1001 Hon-olulu Hawaii October Association for Computa-tional Linguistics

Chikara Hashimoto Satoshi Sato and Takehito Utsuro2006 Japanese idiom recognition Drawing a linebetween literal and idiomatic meanings In Proceed-ings of the COLINGACL 2006 Main Conference

21

Poster Sessions pages 353ndash360 Sydney AustraliaJuly Association for Computational Linguistics

Graham Katz and Eugenie Giesbrecht 2006 Au-tomatic identification of non-compositional multi-word expressions using latent semantic analysis InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 12ndash19 Sydney Australia July Asso-ciation for Computational Linguistics

Dekang Lin 1999 Automatic identification of non-compositional phrases In Proceedings of ACL-99pages 317ndash324 Univeristy of Maryland CollegePark Maryland USA

Diana McCarthy Sriram Venkatapathy and AravindJoshi 2007 Detecting compositionality of verb-object combinations using selectional preferencesIn Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL) pages 369ndash379 Prague CzechRepublic June Association for Computational Lin-guistics

Dan I Melamed 1997 Automatic discovery of non-compositional compounds in parallel data In Pro-ceedings of the 2nd Conference on Empirical Meth-ods in Natural Language Processing (EMNLPrsquo97)pages 97ndash108 Providence RI USA August

Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the EACL-06Workshop on Multiword Expressions in a Multilin-gual Context pages 33ndash40 Morristown NJ USA

Ivan A Sag Timothy Baldwin Francis Bond Ann ACopestake and Dan Flickinger 2002 Multiwordexpressions A pain in the neck for nlp In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing pages 1ndash15 London UK Springer-Verlag

Patrick Schone and Daniel Juraksfy 2001 Isknowledge-free induction of multiword unit dictio-nary headwords a solved problem In Proceedingsof Empirical Methods in Natural Language Process-ing pages 100ndash108 Pittsburg PA USA

C Sporleder and L Li 2009 Unsupervised Recog-nition of Literal and Non-Literal Use of IdiomaticExpressions In Proceedings of the 12th Conferenceof the European Chapter of the ACL (EACL 2009)pages 754ndash762 Association for Computational Lin-guistics

Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32Prague Czech Republic June Association for Com-putational Linguistics

22

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 23ndash30Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Exploiting Translational Correspondences for Pattern-Independent MWEIdentification

Sina ZarrieszligDepartment of Linguistics

University of Potsdam Germanysinalinguni-potsdamde

Jonas KuhnDepartment of Linguistics

University of Potsdam Germanykuhnlinguni-potsdamde

Abstract

Based on a study of verb translations inthe Europarl corpus we argue that a widerange of MWE patterns can be identified intranslations that exhibit a correspondencebetween a single lexical item in the sourcelanguage and a group of lexical items inthe target language We show that thesecorrespondences can be reliably detectedon dependency-parsed word-aligned sen-tences We propose an extraction methodthat combines word alignment with syn-tactic filters and is independent of thestructural pattern of the translation

1 Introduction

Parallel corpora have proved to be a valuable re-source not only for statistical machine translationbut also for crosslingual induction of morphologi-cal syntactic and semantic analyses (Yarowsky etal 2001 Dyvik 2004) In this paper we proposean approach to the identification of multiword ex-pressions (MWEs) that exploits translational cor-respondences in a parallel corpus We will con-sider in translations of the following type

(1) DerThe

RatCouncil

sollteshould

unsereour

Positionposition

berucksichtigenconsider

(2) The Council shouldtake account ofour position

This sentence pair has been taken from theGerman - English section of the Europarl corpus(Koehn 2005) It exemplifies a translational cor-respondence between an English MWEtake ac-count ofand a German simplex verbberucksichti-gen In the following we refer to such correspon-dences asone-to-many translations Based on astudy of verb translations in Europarl we will ex-plore to what extent one-to-many translations pro-vide evidence for MWE realization in the targetlanguage It will turn out that crosslingual corre-

spondences realize a wide range of different lin-guistic patterns that are relevant for MWE iden-tification but that they pose problems to auto-matic word alignment We propose an extractionmethod that combines distributional word align-ment with syntactic filters We will show thatthese correspondences can be reliably detectedon dependency-parsed wordaligned sentences andare able to identify various MWE patterns

In a monolingual setting the task of MWE ex-traction is usually conceived of as a lexical as-sociation problem where distributional measuresmodel the syntactic and semantic idiosyncracy ex-hibited by MWEs eg (Pecina 2008) This ap-proach generally involves two main steps 1) theextraction of a candidate list of potential MWEsoften constrained by a particular target patternof the detection method like verb particle con-structions (Baldwin and Villavicencio 2002) orverb PP combinations (Villada Moiron and Tiede-mann 2006) 2) the ranking of this candidate listby an appropriate assocation measure

The crosslingual MWE identification wepresent in this paper is a priori independentof any specific association measure or syntacticpattern The translation scenario allows us toadopt a completely data-driven definition of whatconstitutes an MWE Given a parallel corpuswe propose to consider those tokens in a targetlanguage as MWEs which correspond to a singlelexical item in the source language The intuitionis that if a group of lexical items in one lan-guage can be realized as a single item in anotherlanguage it can be considered as some kind oflexically fixed entity By this means we willnot approach the MWE identification problemby asking for a given list of candidates whetherthese are MWEs or not Instead we will ask fora given list of lexical items in a source languagewhether there exists a one-to-many translation forthis item in a target language (and whether these

23

one-to-many translations correspond to MWEs)This strategy offers a straightforward solution tothe interpretation problem As the translation canbe related to the meaning of the source item andto its other translations in the target language theinterpretation is independent of the expressionrsquostransparency This solution has its limitationscompared to other approaches that need to auto-matically establish the degree of compositionalityof a given MWE candidate However for manyNLP applications coarse-grained knowledgeabout the semantic relation between a widerange of MWEs and their corresponding atomicrealization is already very useful

In this work we therefore focus on a generalmethod of MWE identification that captures thevarious patterns of translational correspondencesthat can be found in parallel corpora Our exper-iments described in section 3 show that one-to-many translations should be extracted from syn-tactic configurations rather than from unstructuredsets of aligned words This syntax-driven methodis less dependent on frequency distributions in agiven corpus but is based on the intuition thatmonolingual idiosyncracies like MWE realizationof an entity are not likely to be mirrored in anotherlanguage (see section 4 for discussion)

Our goal in this paper is twofold First we wantto investigate to what extent one-to-many transla-tional correspondences can serve as an empiricalbasis for MWE identification To this end Sec-tion 2 presents a corpus-based study of the rela-tion between one-to-many translations and MWEsthat we carried out on a translation gold standardSecond we investigate methods for the automaticdetection of complex lexical correspondences fora given parallel corpus Therefore Section 3 eval-uates automatic word alignments against our goldstandard and gives a method for high-precisionone-to-many translation detection that relies onsyntactic filters in addition to word-alignments

2 Multiword Translations as MWEs

The idea to exploit one-to-many translations forthe identification of MWE candidates has not re-ceived much attention in the literature Thus it isnot a priori clear what can be expected from trans-lational correspondences with respect to MWEidentification To corroborate the intuitions intro-duced in the last section we carried out a corpus-based study that aims to discover linguistic pat-

Verb 1-1 1-n n-1 n-n No

anheben (v1) 535 212 92 16 325

bezwecken (v2) 167 513 06 313 150

riskieren (v3) 467 357 05 17 182

verschlimmern (v4) 302 215 286 445 275

Table 1 Proportions of types of translational cor-respondences (token-level) in our gold standard

terns exhibited by one-to-many translationsWe constructed a gold standard coveringall En-

glish translations of four German verb lemmas ex-tracted from the Europarl Corpus These verbssubcategorize for a nominative subject and an ac-cusative object and are in the middle frequencylayer (around 200 occurrences) We extracted allsentences in Europarl with occurences of theselemmas and their automatic word alignments pro-duced by GIZA++ (Och and Ney 2003) Thesealignments were manually corrected on the basisof the crosslingual word alignment guidelines de-velopped by (Graca et al 2008)

For each of the German source lemmas ourgold standard records four translation categoriesone-to-one one-to-many many-to-one many-to-many translations Table 1 shows the distributionof these categories for each verb Strikingly thefour verbs show very different proportions con-cerning the types of their translational correspon-dences Thus while the German verbanheben(en increase) seems to have a frequent parallelrealization the verbsbezwecken(en intend to)or verschlimmern(en aggravate) tend to be real-ized by more complex phrasal translations In anycase the percentage of one-to-many translations isrelatively high which corroborates our hypothesisthat parallel corpora constitute a very interestingresource for data-driven MWE discovery

A closer look at the one-to-many translations re-veals that these cover a wide spectrum of MWEphenomena traditionally considered in the liter-ature as well as constructions that one wouldusually not regard as an MWE Below we willshortly illustrate the different classes of one-to-many translations we found in our gold standard

Morphological variations This type of one-to-many translations is mainly due to non-parallel re-alization of tense Itrsquos rather irrelevant from anMWE perspective but easy to discover and filterautomatically

24

(3) SieThey

verschlimmernaggravate

diethe

Ubelmisfortunes

(4) Their actionis aggravatingthe misfortunes

Verb particle combinations A typical MWEpattern treated for instance in (Baldwin andVillavicencio 2002) It further divides into trans-parent and non-transparent combinations the lat-ter is illustrated below

(5) DerThe

Ausschusscommitte

bezwecktintends

denthe

Institutioneninstitutions

eina

politischespolitical

Instrumentinstrument

anat

diethe

Handhand

zuto

gebengive

(6) The committeeset out to equip the institutions with apolitical instrument

Verb preposition combinations While thisclass isnrsquot discussed very often in the MWE lit-erature it can nevertheless be considered as an id-iosyncratic combination of lexical items Sag et al(2002) propose an analysis within an MWE frame-work

(7) SieThey

werdenwill

denthe

Treibhauseffektgreen house effect

verschlimmernaggravate

(8) They will add to the green house effect

Light verb constructions (LVCs) This is themost frequent pattern in our gold standard It ac-tually subsumes various subpatterns depending onwhether the light verbs complement is realized as anoun adjective or PP Generally LVCs are syntac-tically and semantically more flexible than otherMWE types such that our gold standard containsvariants of LVCs with similar potentially mod-ified adjectives or nouns as in the example be-low However it can be considered an idiosyn-cratic combination since the LVCs exhibit specificlexical restrictions (Sag et al 2002)

(9) IchIch

werdewill

diethe

Sachething

nuronly

nochjust

verschlimmernaggravate

(10) I am justmaking thingsmore difficult

Idioms This MWE type is probably the mostdiscussed in the literature due to its semantic andsyntactic idiosyncracy Itrsquos not very frequent inour gold standard which may be mainly due to itslimited size and the source items we chose

(11) SieThey

bezweckenintend

diethe

Umgestaltungconversion

ininto

einea

zivilecivil

Nationnation

(12) Theyhave in mind the conversion into a civil nation

v1 v2 v3 v4

Ntype 22 (26) 41 (47) 26 (35) 17 (24)

V Part 227 49 00 00

V Prep 364 415 39 59

LVC 182 293 885 882

Idiom 00 24 00 00

Para 364 243 115 235

Table 2 Proportions of MWE types per lemma

Paraphrases From an MWE perspective para-phrases are the most problematic and challengingtype of translational correspondence in our goldstandard While the MWE literature typically dis-cusses the distinction between collocations andMWEs the boarderline between paraphrases andMWEs is not really clear On the hand para-phrases as we classified them here are transparentcombinations of lexical items like in the exam-ple belowensure that something increases How-ever semantically these transparent combinationscan also be rendered by an atomic expressionin-crease A further problem raised by paraphrases isthat they often involve translational shifts (Cyrus2006) These shifts are hard to identify automat-ically and present a general challenge for seman-tic processing of parallel corpora An example isgiven below

(13) WirWe

brauchenneed

besserebetter

Zusammenarbeitcooperation

umto

diethe

RuckzahlungenrepaymentsOBJ

anzuhebenincrease

(14) We need greater cooperation in this respect toensurethat repaymentsincrease

Table 2 displays the proportions of the MWEcategories for the number of types of one-to-manycorrespondences in our gold standard We filteredthe types due to morphological variations only (theoverall number of types is indicated in brackets)Note that some types in our gold standard fall intoseveral categories eg they combine a verb prepo-sition with a verb particle construction For allof the verbs the number of types belonging tocore MWE categories largely outweighs the pro-portion of paraphrases As we already observedin our analysis of general translation categorieshere again the different verb lemmas show strik-ing differences with respect to their realization inEnglish translations For instanceanheben(enincrease) or bezwecken(en intend) are frequently

25

translated with verb particle or preposition combi-nations while the other verbs are much more of-ten translated by means of LVCs Also the morespecific LVC patterns differ largely among theverbs Whileverschlimmern(en aggravate) hasmany different adjectival LVC correspondencesthe translations ofriskieren(en risk) are predomi-nantly nominal LVCs The fact that we found veryfew idioms in our gold standard may be simplyrelated to our arbitrary choice of German sourceverbs that do not have an English idiom realiza-tion (see our experiment on a random set of verbsin Section 33)

In general one-to-many translational corre-spondences seem to provide a very fruitful groundfor the large-scale study of MWE phenomenaHowever their reliable detection in parallel cor-pora is far from trivial as we will show in thenext section Therefore we will not further in-vestigate the classification of MWE patterns inthe rest of the paper but concentrate on the high-precision detection of one-to-many translationsSuch a pattern-independent identification methodis crucial for the further data-driven study of one-to-many translations in parallel corpora

3 Multiword Translation Detection

This section is devoted to the problem of high-precision detection of one-to-many translationsSection 31 describes an evaluation of automaticword alignments against our gold standard Insection 32 we describe a method that extractsloosely aligned syntactic configurations whichyields much more promising results

31 One-to-many Alignments

To illustrate the problem of purely distributionalone-to-many alignment table 3 presents an eval-uation of the automatic one-to-many word align-ments produced by GIZA++ that uses the stan-dard heuristics for bidirectional word alignmentfrom phrase-based MT (Och and Ney 2003) Weevaluate the rate of translational correspondenceson the type-level that the system discovers againstthe one-to-many translations in our gold standardBy type we mean the set of lemmatized Englishtokens that makes up the translation of the Ger-man source lemma Generally automatic wordalignment yields a very high FPR if no frequencythreshold is used Increasing the threshold mayhelp in some cases however the frequency of the

verb n gt 0 n gt 1 n gt 3FPR FNR FPR FNR FPR FNR

v1 097 093 10 10 10 10

v2 093 09 05 096 00 098

v3 088 083 08 097 067 097

v4 098 092 08 092 034 092

Table 3 False positive rate and False negative rateof GIZA++ one-to-many alignments

translation types is so low that already at a thresh-old of 3 almost all types get filtered This does notmean that the automatic word alignment does notdiscover any correct correspondences at all but itmeans that the detection of the exact set of tokensthat correspond to the source token is rare

This low precision of one-to-many alignmentsisnrsquot very surprising Many types of MWEs con-sist of items that contribute most of the lexical se-mantic content while the other items belong to theclass of semantically almost ldquoemptyrdquo items (egparticles light verbs) These semantically ldquolightrdquoitems have a distribution that doesnrsquot necessarilycorrelate with the source item For instance inthe following sentence pair taken from EuroparlGIZA++ was not able to capture the correspon-dence between the German main verbbehindern(en impede) and the LVCconstitute an obstacleto but only finds an alignment link between theverb and the nounobstacle

(15) DieThe

Korruptioncorruption

behindertimpedes

diethe

Entwicklungdevelopment

(16) Corruptionconstitutes an obstacle todevelopment

Another limitation of the word-alignment mod-els is that are independent of whether the sen-tences are largely parallel or rather free transla-tions However parallel corpora like Europarl areknow to contain a very large number of free trans-lations In these cases direct lexical correspon-dences are much more unlikely to be found

32 Aligning Syntactic Configurations

High-precision extraction of one-to-many trans-lation detection thus involves two major prob-lems 1) How to identify sentences or configura-tions where reliable lexical correspondences canbe found 2) How to align target items that have alow occurrence correlation

We argue that both of these problems can beadressed by taking syntactic information into ac-

26

count As an example consider the pair of paral-lel configurations in Figure 1 for the sentence pairgiven in (15) and (16) Although there is no strictone-to-one alignment for the German verb the ba-sic predicate-argument structure is parallel Theverbs arguments directly correspond to each otherand are all dominated by a verbal root node

Based on these intuitions we propose agenerate-and-filter strategy for our one-to-manytranslation detection which extracts partial largelyparallel dependency configurations By admittingtarget dependency paths to be aligned to sourcesingle dependency relations we admit configura-tions where the source item is translated by morethan one word For instance given the configura-tion in Figure 1 we allow the German verb to bealigned to the path connectingconstituteand theargumentY2

Our one-to-many translation detection consistsof the following steps a) candidate generationof aligned syntactic configurations b) filtering theconfigurations c) alignment post-editing ie as-sembling the target tokens corresponding to thesource item The following paragraphs will brieflycaracterize these steps

behindert

X 1 Y 1

Y 2

an to

X 2 obstacle

create

Figure 1 Example of a typical syntactic MWEconfiguration

Data We word-aligned the German and Englishportion of the Europarl corpus by means of theGIZA++ tool Both portions where assigned flatsyntactic dependency analyses by means of theMaltParser (Nivre et al 2006) such that we ob-tain a parallel resource of word-aligned depen-dency parses Each sentence in our resource canbe represented by the triple(DG DE AGE) DG

is the set of dependency triples(s1 rel s2) such

thats2 is a dependent ofs1 of typerel ands1 s2

are words of the source languageDE is the setof dependency triples of the target sentenceAGE

corresponds to the set of pairs(s1 t1) such thats1 t1 are aligned

Candidate Generation This step generates alist of source configurations by searching for oc-curences of the source lexical verb where it islinked to some syntactic dependents (eg its argu-ments) An example input would be the configura-tion ( (verbSB) (verbOA)) forour German verbs

Filtering Given our source candidates a validparallel configuration(DG DE AGE) is then de-fined by the following conditions1 The source configurationDG is the set of tu-ples(s1 rel sn) wheres1 is our source item andsn some dependent2 For eachsn isin DG there is a tuple(sn tn) isinAGE ie every dependent has an alignment3 There is a target itemt1 isin DE such thatfor eachtn there is ap sub DE such thatp isa path(t1 rel tx) (tx rel ty)(tz rel tn) thatconnectst1 and tn Thus the target dependentshave a common root

To filter noise due to parsing or alignment er-rors we further introduce a filter on the length ofthe path that connects the target root and its de-pendents and w exclude paths cross contain sen-tence boundaries Moreover the above candi-date filtering doesnrsquot exclude configurations whichexhibit paraphrases involving head-switching orcomplex coordination Head-switching can be de-tected with the help of alignment information ifthere is a item in our target configuration that hasan reliable alignment with an item not contained inour source configuration our target configurationis likely to contain such a structural paraphrasesand is excluded from our candidate set Coordina-tion can be discarded by imposing the condition onthe configuration not to contain a coordination re-lation This Generate-and-Filter strategy now ex-tracts a set of sentences where we are likely to finda good one-to-one or one-to-many translation forthe source verb

Alignment Post-editing In the final alignmentstep one now needs to figure out which lexicalmaterial in the aligned syntactic configurations ac-tually corresponds to the translation of the sourceitem The intuition discussed in 32 was that all

27

the items lying on a path between the root itemand the terminals belong to the translation of thesource item However these items may have othersyntactic dependents that may also be part of theone-to-many translation As an example considerthe configuration in figure 1 where the articleanwhich is part of the LVCcreate an obstacle tohasto be aligned to the German source verb

Thus for a set of itemsti for which there is a de-pendency relation(tx rel ti) isin DE such thattx isan element of our target configuration we need todecide whether(s1 ti) isin AGE This translationproblem now largely parallels collocation trans-lation problems discussed in the literature as in(Smadja and McKeown 1994) But crucially oursyntactic filtering strategy has substantially nar-rowed down the number of items that are possi-ble parts of the one-to-many translation Thus astraightforward way to assemble the translationalcorrespondence is to compute the correlation orassociation of the possibly missing items with thegiven translation pair as proposed in (Smadja andMcKeown 1994) Therefore we propose the fol-lowing alignment post-editing algorithmGiven the source items1 and the set of target itemsT where eachti isin T is an element of our targetconfiguration

1 Compute corr(s1 T ) the correlation be-tweens1 andT

2 For each ti tx such that there isa (ti rel tx) isin DE computecorr(s1 T + tx)

3 if corr(s1 T + tx) ge corr(s1 T ) addtxto T

As the Dice coefficient is often to give the bestresults eg in (Smadja and McKeown 1994) wealso chose Dice as our correlation measure In fu-ture work we will experiment with other associa-tion measures Our correlation scores are thus de-fined by the formula

corr(s1 T ) =2(freq(s1 and T ))

freq(s1) + freq(T )

We definefreq(T ) as the number of sentencepairs whose target sentence contains occurrencesof all ti isin T andfreq(s1) accordingly The ob-servation frequencyfreq(s1andT ) is the number ofsentence pairs that wheres1 occurs in the sourcesentence andT in the target sentence

The output translation can then be rep-resented as a dependency configurationof the following kind ((ofPMOD)(riskNMODof)(riskNMODthe) (runOBJrisk)(runSBJ))which is the syntactic representationfor the English MWErun the risk of

33 Evaluation

Our translational approach to MWE extractionbears the advantage that evaluation is not exclu-sively bound to the manual judgement of candi-date lists Instead we can first evaluate the systemoutput against translation gold standards which areeasier to obtain The linguistic classification of thecandidates according to their compositionality canthen be treated as a separate problem

We present two experiments in this evaluationsection We will first evaluate the translation de-tection on our gold standard to assess the gen-eral quality of the extraction method Since thisgold standard is to small to draw conclusions aboutthe quality of MWE patterns that the system de-tects we further evaluate the translational corre-spondences for a larger set of verbs

Translation evaluation In the first experimentwe extracted all types of translational correspon-dences for the verbs we annotated in the gold stan-dard We converted the output dependency con-figurations to the lemmatized bag-of-word formwe already applied for the alignment evaluationand calculated the FPR and FNR of the trans-lation types The evaluation is displayed in ta-ble 4 Nearly all translation types that our sys-tem detected are correct This confirms our hy-pothesis that syntactic filtering yields more reli-able translations that just coocurrence-based align-ments However the false negative rate is alsovery high This low recall is due to the fact thatour syntactic filters are very restrictive such that amajor part of the occurrences of the source lemmadonrsquot figure in the prototypical syntactic configu-ration Column two and three of the evaluation ta-ble present the FPR and FNR for experiments witha relaxed syntactic filter that doesnrsquot constrain thesyntactic type of the parallel argument relationsWhile not decreasing the FNR the FPR decreasessignificantly This means that the syntactic filtersmainly fire on noisy configurations and donrsquot de-crease the recall A manual error analysis has alsoshown that the relatively flat annotation scheme ofour dependency parses significantly narrows down

28

the number of candidate configurations that our al-gorithm detects As the dependency parses donrsquotprovide deep analyses for tense or control phe-nomena very often a verbrsquos arguments donrsquot fig-ure as its syntactic dependents and no configura-tion is found Future work will explore the im-pact of deep syntactic analysis for the detection oftranslational correspondences

MWE evaluation In a second experiment weevaluated the patterns of correspondences foundby our extraction method for use in an MWE con-text Therefore we selected 50 random verbs oc-curring in the Europarl corpus and extracted theirrespective translational correspondences This setof 50 verbs yields a set of 1592 one-to-many typesof translational correspondences We filtered thetypes wich display only morphological variationsuch that the set of potential MWE types com-prises 1302 types Out of these we evaluated arandom sample of 300 types by labelling the typeswith the MWE categories we established for theanalysis of our gold standard During the clas-sification we encountered a further category ofoneto- many correspondence which cannot be con-sidered an MWE the category of alternation Forinstance we found a translational correspondencebetween the active realization of the German verbbegruszligen(enappreciate) and the English passivebe pleased by

The classification is displayed in table 5 Al-most 83 of the translational correspondencesthat our system extracted are perfect translationtypes Almost 60 of the extracted types can beconsidered MWEs that exhibit some kind of se-mantic idiosyncrasy The other translations couldbe classified as paraphrases or alternations In ourrandom sample the portions of idioms is signifi-cantly higher than in our gold standard which con-firms our intuition that the MWE pattern of theone-to-many translations for a given verb are re-lated to language-specific semantic properties ofthe verbs and the lexical concepts they realize

4 Related Work

The problem sketched in this paper has clear con-ncetions to statistical MT So-called phrase-basedtranslation models generally target whole sentencealignment and do not necessarily recur to linguis-tically motivated phrase correspondences (Koehnet al 2003) Syntax-based translation that speci-fies formal relations between bilingual parses was

Strict Filter Relaxed Filter

FPR FNR FPR FNR

v1 00 096 05 096

v2 025 088 047 079

v3 025 074 056 063

v4 00 0875 056 084

Table 4 False positive and false negative rate ofone-to-many translations

Trans type Proportion

MWE type Proportion

MWEs 575

V Part 82

V Prep 518

LVC 324

Idiom 106

Paraphrases 244

Alternations 10

Noise 171

Table 5 Classification of 300 types sampled fromthe set of one-to-many translations for 50 verbs

established by (Wu 1997) Our way to use syn-tactic configurations can be seen as a heuristic tocheck relaxed structural parallelism

Work on MWEs in a crosslingual context hasalmost exclusively focussed on MWE translation(Smadja and McKeown 1994 Anastasiou 2008)In (Villada Moiron and Tiedemann 2006) the au-thors make use of alignment information in a par-allel corpus to rank MWE candidates These ap-proaches donrsquot rely on the lexical semantic knowl-edge about MWEs in form of one-to-many trans-lations

By contrast previous approaches to paraphraseextraction made more explicit use of crosslingualsemantic information In (Bannard and Callison-Burch 2005) the authors use the target languageas a pivot providing contextual features for iden-tifying semantically similar expressions Para-phrasing is however only partially comparable tothe crosslingual MWE detection we propose inthis paper Recently the very pronounced contextdependence of monolingual pairs of semanticallysimilar expressions has been recognized as a ma-jor challenge in modelling word meaning (Erk andPado 2009)

The idea that parallel corpora can be used asa linguistic resource that provides empirical evi-dence for monolingual idiosyncrasies has already

29

been exploited in eg morphology projection(Yarowsky et al 2001) or word sense disambigua-tion (Dyvik 2004) While in a monolingual set-ting it is quite tricky to come up with theoreticalor empirical definitions of sense discriminationsthe crosslingual scenario offers a theory-neutraldata-driven solution Since ambiguity is an id-iosyncratic property of a lexical item in a givenlanguage it is not likely to be mirrored in a tar-get language Similarly our approach can also beseen as a projection idea we project the semanticinformation of simplex realization in a source lan-guage to an idiosyncratic multiword realization inthe target language

5 Conclusion

We have explored the phenomenon of one-to-many translations in parallel corpora from theperspective of MWE identification Our man-ual study on a translation gold standard as wellas our experiments in automatic translation ex-traction have shown that one-to-many correspon-dences provide a rich resource and fruitful basisof study for data-driven MWE identification Thecrosslingual perspective raises new research ques-tions about the identification and interpretation ofMWEs It challenges the distinction between para-phrases and MWEs a problem that does not ariseat all in the context of monolingual MWE ex-traction It also allows for the study of the rela-tion between the semantics of lexical concepts andtheir MWE realization Further research in this di-rection should investigate translational correspon-dences on a larger scale and further explore thesefor monolingual interpretation of MWEs

Our extraction method that is based on syn-tactic filters identifies MWE types with a muchhigher precision than purely cooccurence-basedword alignment and captures the various patternswe found in our gold standard Future work on theextraction method will have to focus on the gener-alization of these filters and the generalization toother items than verbs The experiments presentedin this paper also suggest that the MWE realiza-tion of certain lexical items in a target languageis subject to certain linguistic patterns Moreoverthe method we propose is completely languagein-dependent such that further research has to studythe impact of the relatedness of the consideredlanguages on the patterns of one-to-many transla-tional correspondences

ReferencesDimitra Anastasiou 2008 Identification of idioms by mtrsquos

hybrid research system vs three commercial system InProceedings of the EAMT pp 12ndash20

Timothy Baldwin and Aline Villavicencio 2002 Extract-ing the unextractable a case study on verb-particles InProceedings of the COLING-02 pp 1ndash7

Colin Bannard and Chris Callison-Burch 2005 Paraphras-ing with bilingual parallel corpora InProceedings of the43rd Annual Meeting of the ACL pp 597ndash604

Lea Cyrus 2006 Building a resource for studying transla-tion shifts InProceedings of the 5th LREC pp 1240ndash1245

Helge Dyvik 2004 Translations as semantic mirrors Fromparallel corpus to WordNetLanguage and Computers1311 ndash 326

Katrin Erk and Sebastian Pado 2009 Paraphrase assess-ment in structured vector space Exploring parameters anddatasets InProc of the EACL GEMS Workshop pp 57ndash65

Joao de Almeida Varelas Graca Joana Paulo Pardal LuısaCoheur and Diamantino Antonio Caseiro 2008 Multi-language word alignments annotation guidelines Techni-cal report Tech Rep 38 2008 INESC-ID Lisboa

Philipp Koehn Franz Josef Och and Daniel Marcu 2003Statistical phrase-based translation InProceedings of theNAACL rsquo03 pp 48ndash54

Philipp Koehn 2005 Europarl A parallel corpus for statis-tical machine translation InMT Summit 2005 pp 79ndash86

Joakim Nivre Johan Hall and Jens Nilsson 2006 Malt-parser A data driven parser-generator for dependencyparsing InProc of LREC-2006 pp 2216ndash2219

Franz Josef Och and Hermann Ney 2003 A systematic com-parison of various statistical alignment modelsComputa-tional Linguistics 29(1)19ndash51

Pavel Pecina 2008 A machine learning approach to multi-word expression extraction InProceedings of the LRECMWE 2008 Workshop pp 54ndash57

Ivan A Sag Timothy Baldwin Francis Bond and AnnCopestake 2002 Multiword expressions A pain in theneck for NLP InProc of the CICLing-2002 pp 1ndash15

Frank Smadja and Kathleen McKeown 1994 Translatingcollocations for use in bilingual lexicons InProceedingsof the HLT rsquo94 workshop pp 152ndash156

Begona Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automatic word-alignment InProc of the EACL MWE 2006 Workshoppp 33ndash40

Dekai Wu 1997 Stochastic inversion transduction gram-mars and bilingual parsing of parallel corporaComputLinguist 23(3)377ndash403

David Yarowsky Grace Ngai and Richard Wicentowski2001 Inducing multilingual text analysis tools via robustprojection across aligned corpora InProceedings of HLT2001 pp 1ndash8

30

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 31ndash39Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

A Re-examination of Lexical Association Measures

Abstract

We review lexical Association Measures (AMs) that have been employed by past work in extracting multiword expressions Our work contributes to the understanding of these AMs by categorizing them into two groups and suggesting the use of rank equivalence to group AMs with the same ranking performance We also examine how existing AMs can be adapted to better rank English verb particle constructions and light verb constructions Specifically we suggest normalizing (Pointwise) Mutual Information and using marginal frequencies to construct penalization terms We empirically validate the effectiveness of these modified AMs in detection tasks in English performed on the Penn Treebank which shows significant improvement over the original AMs

1 Introduction

Recently the NLP community has witnessed a renewed interest in the use of lexical association measures in extracting Multiword Expressions (MWEs) Lexical Association Measures (hereafter AMs) are mathematical formulas which can be used to capture the degree of connection or association between constituents of a given phrase Well-known AMs include Pointwise Mutual Information (PMI) Pearsonrsquos 2χ and the Odds Ratio These AMs have been applied in many different fields of study from information retrieval to hypothesis testing In the context of MWE extraction many published works have been devoted to comparing their effectiveness Krenn and Evert (2001) evaluate Mutual Information (MI) Dice Pearsonrsquos 2χ log-likelihood

ratio and the T score In Pearce (2002) AMs such as Z score Pointwise MI cost reduction left and right context entropy odds ratio are evaluated Evert (2004) discussed a wide range of AMs including exact hypothesis tests such as the binomial test and Fisherrsquos exact tests various coefficients such as Dice and Jaccard Later Ramisch et al (2008) evaluated MI Pearsonrsquos 2χ and Permutation Entropy Probably the most comprehensive evaluation of AMs was presented in Pecina and Schlesinger (2006) where 82 AMs were assembled and evaluated over Czech collocations These collocations contained a mix of idiomatic expressions technical terms light verb constructions and stock phrases In their work the best combination of AMs was selected using machine learning

While the previous works have evaluated AMs there have been few details on why the AMs perform as they do A detailed analysis of why these AMs perform as they do is needed in order to explain their identification performance and to help us recommend AMs for future tasks This weakness of previous works motivated us to address this issue In this work we contribute to further understanding of association measures using two different MWE extraction tasks to motivate and concretize our discussion Our goal is to be able to predict a priori what types of AMs are likely to perform well for a particular MWE class

We focus on the extraction of two common types of English MWEs that can be captured by bigram model Verb Particle Constructions (VPCs) and Light Verb Constructions (LVCs) VPCs consist of a verb and one or more particles which can be prepositions (eg put on bolster up) adjectives (cut short) or verbs (make do) For simplicity we focus only on bigram VPCs that take prepositional particles the most common class of VPCs A special characteristic of VPCs that affects their extraction is the

Hung Huu Hoang Dept of Computer Science

National University of Singapore

hoanghuucompnusedusg

Su Nam Kim Dept of Computer Science and Software Engineering University of Melbourne

snkimcsseunimelbeduau

Min-Yen Kan Dept of Computer Science

National University of Singapore

kanmycompnusedusg

31

mobility of noun phrase complements in transitive VPCs They can appear after the particle (Take off your hat) or between the verb and the particle (Take your hat off) However a pronominal complement can only appear in the latter configuration (Take it off)

In comparison LVCs comprise of a verb and a complement which is usually a noun phrase (make a presentation give a demonstration) Their meanings come mostly from their complements and as such verbs in LVCs are termed semantically light hence the name light verb This explains why modifiers of LVCs modify the complement instead of the verb (make a serious mistake vs make a mistake seriously) This phenomenon also shows that an LVCrsquos constituents may not occur contiguously

2 Classification of Association Measures

Although different AMs have different approaches to measuring association we observed that they can effectively be classified into two broad classes Class I AMs look at the degree of institutionalization ie the extent to which the phrase is a semantic unit rather than a free combination of words Some of the AMs in this class directly measure this association between constituents using various combinations of co-occurrence and marginal frequencies Examples include MI PMI and their variants as well as most of the association coefficients such as Jaccard Hamann Brawn-Blanquet and others Other Class I AMs estimate a phrasersquos MWE-hood by judging the significance of the difference between observed and expected frequencies These AMs include among others statistical hypothesis tests such as T score Z score and Pearsonrsquos 2χ test

Class II AMs feature the use of context to measure non-compositionality a peculiar characteristic of many types of MWEs including VPCs and idioms This is commonly done in one of the following two ways First non-compositionality can be modeled through the diversity of contexts measured using entropy The underlying assumption of this approach is that non-compositional phrases appear in a more restricted set of contexts than compositional ones Second non-compositionality can also be measured through context similarity between the phrase and its constituents The observation here is that non-compositional phrases have different semantics from those of their constituents It then

follows that contexts in which the phrase and its constituents appear would be different (Zhai 1997) Some VPC examples include carry out give up A close approximation stipulates that contexts of a non-compositional phrasersquos constituents are also different For instance phrases such as hot dog and Dutch courage are comprised of constituents that have unrelated meanings Metrics that are commonly used to compute context similarity include cosine and dice similarity distance metrics such as Euclidean and Manhattan norm and probability distribution measures such as Kullback-Leibler divergence and Jensen-Shannon divergence

Table 1 lists all AMs used in our discussion

The lower left legend defines the variables a b c and d with respect to the raw co-occurrence statistics observed in the corpus data When an AM is introduced it is prefixed with its index given in Table 1(eg [M2] Mutual Information) for the readerrsquos convenience

3 Evaluation

We will first present how VPC and LVC candidates are extracted and used to form our evaluation data set Second we will discuss how performances of AMs are measured in our experiments

31 Evaluation Data

In this study we employ the Wall Street Journal (WSJ) section of one million words in the Penn Tree Bank To create the evaluation data set we first extract the VPC and LVC candidates from our corpus as described below We note here that the mobility property of both VPC and LVC constituents have been used in the extraction process

For VPCs we first identify particles using a pre-compiled set of 38 particles based on Baldwin (2005) and Quirk et al (1985) (Appendix A) Here we do not use the WSJ particle tag to avoid possible inconsistencies pointed out in Baldwin (2005) Next we search to the left of the located particle for the nearest verb As verbs and particles in transitive VPCs may not occur contiguously we allow an intervening NP of up to 5 words similar to Baldwin and Villavicencio (2002) and Smadja (1993) since longer NPs tend to be located after particles

32

Extraction of LVCs is carried out in a similar fashion First occurrences of light verbs are located based on the following set of seven

frequently used English light verbs do get give have make put and take Next we search to the right of the light verbs for the nearest noun

AM Name Formula AM Name Formula M1 Joint Probability ( ) f xy N M2 Mutual Information

1log ˆ

ijij

i j ij

ff

N fsum

M3 Log likelihood ratio

2 log ˆ

ijij

i j ij

ff

fsum

M4 Pointwise MI (PMI) ( )log

( ) ( )

P xy

P x P ylowast lowast

M5 Local-PMI ( ) PMIf xy times M6 PMIk ( )log

( ) ( )

kNf xy

f x f ylowast lowast

M7 PMI2 2( )log

( ) ( )

Nf xy

f x f ylowast lowast

M8 Mutual Dependency 2( )log( ) ( )P xy

P x P y

M9 Driver-Kroeber

( )( )

a

a b a c+ +

M10 Normalized expectation

2

2

a

a b c+ +

M11 Jaccard a

a b c+ +

M12 First Kulczynski a

b c+

M13 Second Sokal-Sneath 2( )

a

a b c+ +

M14 Third Sokal-Sneath

a d

b c

+

+

M15 Sokal-Michiner a d

a b c d

+

+ + +

M16 Rogers-Tanimoto

2 2

a d

a b c d

+

+ + +

M17 Hamann ( ) ( )a d b c

a b c d

+ minus +

+ + +

M18 Odds ratio adbc

M19 Yulersquos ω ad bc

ad bc

minus

+

M20 Yulersquos Q ad bcad bc

minus+

M21 Brawn- Blanquet max( )

a

a b a c+ +

M22 Simpson

min( )

a

a b a c+ +

M23 S cost 12min( )

log(1 )1

b c

a

minus+

+

M24 Adjusted S Cost 12max( )

log(1 )1

b c

a

minus+

+

M25 Laplace 1

min( ) 2

a

a b c

+

+ +

M26 Adjusted Laplace 1

max( ) 2

a

a b c

+

+ +

M27 Fager [M9]

1max( )

2b cminus

M28 Adjusted Fager [M9]

1max( )b c

aNminus

M29 Normalized PMIs

PMI NF( )α PMI NFMax

M30 Simplified normalized PMI for VPCs

log( )

(1 )

ad

b cα αtimes + minus times

M31 Normalized MIs

MI NF( )α MI NFMax

NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast [0 1]α isin NFMax = max( ( ) ( ))P x P ylowast lowast

11 ( )a f f xy= = 12 ( )b f f xy= =

21 ( )c f f xy= = 22 ( )d f f xy= =

( )f xlowast ( )f xlowast

( )f ylowast ( )f ylowast N

Table 1 Association measures discussed in this paper Starred AMs () are developed in this work

Contingency table of a bigram (x y) recording co-occurrence and marginal frequencies w stands for all words except w stands for all words N is total number of bigrams The expected frequency under the independence assumption is ˆ ( ) ( ) ( ) f xy f x f y N= lowast lowast

33

permitting a maximum of 4 intervening words to allow for quantifiers (aan the many etc) adjectival and adverbial modifiers etc If this search fails to find a noun as when LVCs are used in the passive (eg the presentation was made) we search to the right of the light verb also allowing a maximum of 4 intervening words The above extraction process produced a total of 8652 VPC and 11465 LVC candidates when run on the corpus We then filter out candidates with observed frequencies less than 6 as suggested in Pecina and Schlesinger (2006) to obtain a set of 1629 VPCs and 1254 LVCs

Separately we use the following two available sources of annotations 3078 VPC candidates extracted and annotated in (Baldwin 2005) and 464 annotated LVC candidates used in (Tan et al 2006) Both sets of annotations give both positive and negative examples

Our final VPC and LVC evaluation datasets were then constructed by intersecting the gold-standard datasets with our corresponding sets of extracted candidates We also concatenated both sets of evaluation data for composite evaluation This set is referred to as ldquoMixedrdquo Statistics of our three evaluation datasets are summarized in Table 2 VPC data LVC data Mixed Total (freq ge 6)

413 100 513

Positive instances

117 (2833)

28 (28)

145 (2326)

Table 2 Evaluation data sizes (type count not token) While these datasets are small our primary

goal in this work is to establish initial comparable baselines and describe interesting phenomena that we plan to investigate over larger datasets in future work

32 Evaluation Metric

To evaluate the performance of AMs we can use the standard precision and recall measures as in much past work We note that the ranked list of candidates generated by an AM is often used as a classifier by setting a threshold However setting a threshold is problematic and optimal threshold values vary for different AMs Additionally using the list of ranked candidates directly as a classifier does not consider the confidence indicated by actual scores Another way to avoid setting threshold values is to measure precision and recall of only the n most likely candidates

(the n-best method) However as discussed in Evert and Krenn (2001) this method depends heavily on the choice of n In this paper we opt for average precision (AP) which is the average of precisions at all possible recall values This choice also makes our results comparable to those of Pecina and Schlesinger (2006)

33 Evaluation Results

Figure 1(a b) gives the two average precision profiles of the 82 AMs presented in Pecina and Schlesinger (2006) when we replicated their experiments over our English VPC and LVC datasets We observe that the average precision profile for VPCs is slightly concave while the one for LVCs is more convex This can be interpreted as VPCs being more sensitive to the choice of AM than LVCs Another point we observed is that a vast majority of Class I AMs including PMI its variants and association coefficients (excluding hypothesis tests) perform reasonably well in our application In contrast the performances of most of context-based and hypothesis test AMs are very modest Their mediocre performance indicates their inapplicability to our VPC and LVC tasks In particular the high frequencies of particles in VPCs and light verbs in LVCs both undermine their contextsrsquo discriminative power and skew the difference between observed and expected frequencies that are relied on in hypothesis tests

4 Rank Equivalence

We note that some AMs although not mathematically equivalent (ie assigning identical scores to input candidates) produce the same lists of ranked candidates on our datasets Hence they achieve the same average precision The ability to identify such groups of AMs is helpful in simplifying their formulas which in turn assisting in analyzing their meanings Definition Association measures M1 and M2 are

rank equivalent over a set C denoted by M1 r

Cequiv

M2 if and only if M1(cj) gt M1(ck) M2(cj) gt M2(ck) and M1(cj) = M1(ck) M2(cj) = M2(ck) for all cj ck belongs to C where Mk(ci) denotes the score assigned to ci by the measure Mk As a corollary the following also holds for rank equivalent AMs

34

Corollary If M1 r

Cequiv M2 then APC(M1) = APC(M2)

where APC(Mi) stands for the average precision of the AM Mi over the data set C Essentially M1 and M2 are rank equivalent over a set C if their ranked lists of all candidates taken from C are the same ignoring the actual calculated scores1 As an example the following 3 AMs Odds ratio Yulersquos ω and Yulersquos Q (Table 3 row 5) though not mathematically equivalent can be shown to be rank equivalent Five groups of rank equivalent AMs that we have found are listed in Table 3 This allows us to replace the below 15 AMs with their (most simple) representatives from each rank equivalent group

1 Two AMs may be rank equivalent with the exception of some candidates where one AM is undefined due to a zero in the denominator while the other AM is still well-defined We call these cases weakly rank equivalent With a reasonably large corpus such candidates are rare for our VPC and LVC types Hence we still consider such AM pairs to be rank equivalent

1) [M2] Mutual Information [M3] Log likelihood ratio 2) [M7] PMI2 [M8] Mutual Dependency [M9] Driver-Kroeber (aka Ochiai) 3) [M10] Normalized expectation [M11] Jaccard [M12] First Kulczynski

[M13]Second Sokal-Sneath (aka Anderberg)

4) [M14] Third Sokal-Sneath [M15] Sokal-Michiner [M16] Rogers-Tanimoto [M17] Hamann 5) [M18] Odds ratio [M19] Yulersquos ω [M20] Yulersquos Q

Table 3 Five groups of rank equivalent AMs

5 Examination of Association Measures

We highlight two important findings in our analysis of the AMs over our English datasets Section 51 focuses on MI and PMI and Section 52 discusses penalization terms

51 Mutual Information and Pointwise Mutual Information

In Figure 1 over 82 AMs PMI ranks 11th in identifying VPCs while MI ranks 35th in

01

02

03

04

05

06

AP

Figure 1a AP profile of AMs examined over our VPC data set

01

02

03

04

05

06

AP

Figure 1b AP profile of AMs examined over our LVC data set

Figure 1 Average precision (AP) performance of the 82 AMs from Pecina and Schlesinger (2006) on our English VPC and LVC datasets Bold points indicate AMs discussed in this paper

Hypothesis test AMs loz Class I AMs excluding hypothesis test AMs + Context-based AMs

35

identifying LVCs In this section we show how their performances can be improved significantly

Mutual Information (MI) measures the common information between two variables or the reduction in uncertainty of one variable given knowledge of the other

( )MI( ) ( )log ( ) ( )u v

p uvU V p uv p u p v= lowast lowastsum In the

context of bigrams the above formula can be

simplified to [M2] MI =

1log ˆN

ijij

i jij

ff

fsum While MI

holds between random variables [M4] Pointwise MI (PMI) holds between specific values PMI(x

y) =( )

log( ) ( )

P xy

P x P ylowast lowast

( )log

( ) ( )

Nf xy

f x f y=

lowast lowast It has long

been pointed out that PMI favors bigrams with low-frequency constituents as evidenced by the product of two marginal frequencies in its denominator To reduce this bias a common solution is to assign more weight to the co-occurrence frequency ( )f xy in the numerator by either raising it to some power k (Daille 1994) or multiplying PMI with ( )f xy Table 4 lists these adjusted versions of PMI and their performance over our datasets We can see from Table 4 that the best performance of PMIk

is obtained at k values less than one indicating that it is better to rely less on ( )f xy Similarly multiplying

( )f xy directly to PMI reduces the performance of PMI As such assigning more weight to ( )f xy does not improve the AP performance of PMI AM VPCs LVCs MixedBest [M6] PMIk 547

(k = 13) 573

(k = 85) 544

(k = 32)

[M4] PMI 510 566 515 [M5] Local-PMI 259 393 272 [M1] Joint Prob 170 28 175 Table 4 AP performance of PMI and its variants Best alpha settings shown in parentheses

Another shortcoming of (P)MI is that both grow not only with the degree of dependence but also with frequency (Manning and Schutzeampamp 1999 p 66) In particular we can show that MI(X Y) le min(H(X) H(Y)) where H() denotes entropy and PMI(xy) le min( log ( )P xminus lowast log ( )P yminus lowast )

These two inequalities suggest that the allowed score ranges of different candidates vary and consequently MI and PMI scores are not directly comparable Furthermore in the case of VPCs and LVCs the differences among score

ranges of different candidates are large due to high frequencies of particles and light verbs This has motivated us to normalize these scores before using them for comparison We suggest MI and PMI be divided by one of the following two normalization factors NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast with [0 1]α isin and NFmax = max( ( ) ( ))P x P ylowast lowast NF( )α being dependent on alpha can be optimized by setting an appropriate alpha value which is inevitably affected by the MWE type and the corpus statistics On the other hand NFmax is independent of alpha and is recommended when one needs to apply normalized (P)MI to a mixed set of different MWE types or when sufficient data for parameter tuning is unavailable As shown in Table 5 normalized MI and PMI show considerable improvements of up to 80 Also PMI and MI after being normalized with NFmax rank number one in VPC and LVC task respectively If one re-writes MI as = (1 N) ij ij

i jPMIf timessum it is easy to see the heavy

dependence of MI on direct frequencies compared with PMI and this explains why normalization is a pressing need for MI

AM VPCs LVCs MixedMI NF( )α 508

(α = 48) 583

(α = 47) 516

(α = 5)

MI NFmax 508 584 518 [M2] MI 273 435 289

PMI NF( )α 592 (α = 8)

554 (α = 48)

588 (α = 77)

PMI NFmax 565 517 556 [M4] PMI 510 566 515

Table 5 AP performance of normalized (P)MI versus standard (P)MI Best alpha settings shown in parentheses

52 Penalization Terms

It can be seen that given equal co-occurrence frequencies higher marginal frequencies reduce the likelihood of being MWEs This motivates us to use marginal frequencies to synthesize penalization terms which are formulae whose values are inversely proportional to the likelihood of being MWEs We hypothesize that incorporating such penalization terms can improve the respective AMs detection AP

Take as an example the AMs [M21] Brawn-Blanquet (aka Minimum Sensitivity) and [M22] Simpson These two AMs are identical except

36

for one difference in the denominator Brawn-Blanquet uses max(b c) Simpson uses min(b c) It is intuitive and confirmed by our experiments that penalizing against the more frequent constituent by choosing max(b c) is more effective This is further attested in AMs [M23] S Cost and [M25] Laplace where we tried to replace the min(b c) term with max(b c) Table 6 shows the average precision on our datasets for all these AMs

AM VPCs LVCs Mixed[M21]Brawn- Blanquet

478 578 486

[M22] Simpson 249 382 260

[M24] Adjusted S Cost

485 577 492

[M23] S cost 249 388 260

[M26] Adjusted Laplace

486 577 493

[M25] Laplace 241 388 254 Table 6 Replacing min() with max() in selected AMs

In the [M27] Fager AM the penalization term max(b c) is subtracted from the first term which is no stranger but rank equivalent to [M7] PMI2 In our application this AM is not good since the second term is far larger than the first term which is less than 1 As such Fager is largely equivalent to just ndashfrac12 max(b c) In order to make use of the first term we need to replace the constant frac12 by a scaled down version of max(b c) We have approximately derived 1 aN as a lower bound estimate of max(b c) using the independence assumption producing [M28] Adjusted Fager We can see from Table 7 that this adjustment improves Fager on both datasets AM VPCs LVCs Mixed[M28] Adjusted Fager

564 543 554

[M27] Fager 552 439 525 Table 7 Performance of Fager and its adjusted version

The next experiment involves [M14] Third Sokal Sneath which can be shown to be rank equivalent to ndashb ndashc We further notice that frequencies c of particles are normally much larger than frequencies b of verbs Thus this AM runs the risk of ranking VPC candidates based on only frequencies of particles So it is necessary

that we scale b and c properly as in [M14] bαminus times ndash (1 ) cαminus times Having scaled the constituents properly we still see that [M14] by itself is not a good measure as it uses only constituent frequencies and does not take into consideration the co-occurrence frequency of the two constituents This has led us to experiment

with [MR14]PMI

(1 )b cα αtimes + minus times The

denominator of [MR14] is obtained by removing the minus sign from [MR14] so that it can be used as a penalization term The choice of PMI in the numerator is due to the fact that the denominator of [MR14] is in essence similar to NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast which has been successfully used to divide PMI in the normalized PMI experiment We heuristically tried to simplify [MR14] to the following AM

[M30]log( )

(1 )

ad

b cα αtimes + minus times The setting of alpha in

Table 8 below is taken from the best alpha setting obtained the experiment on the normalized PMI (Table 5) It can be observed from Table 8 that [MR14] being computationally simpler than normalized PMI performs as well as normalized PMI and better than Third Sokal-Sneath over the VPC data set

AM VPCs LVCs MixedPMI NF( )α 592

(α =8) 554

(α =48) 588

(α =77)

[M30]log( )

(1 )

ad

b cα αtimes + minus times

600 (α =8)

484 (α =48)

588 (α =77)

[M14] Third Sokal Sneath

565 453 546

Table 8 AP performance of suggested VPCsrsquo penalization terms and AMs

With the same intention and method we have found that while addition of marginal frequencies is a good penalization term for VPCs the product of marginal frequencies is more suitable for LVCs (rows 1 and 2 Table 9) As with the linear combination the product bc should also be weighted accordingly as (1 )b cα αminus The best alpha value is also taken from the normalized PMI experiments (Table 5) which is nearly 5 Under this setting this penalization term is exactly the denominator of the [M18] Odds Ratio Table 9 below show our experiment results in deriving the penalization term for LVCs

37

AM VPCs LVCs Mixedndashb ndashc 565 453 546 1bc 502 532 502 [M18] Odds ratio 443 567 456

Table 9 AP performance of suggested LVCsrsquo penalization terms and AMs

6 Conclusions

We have conducted an analysis of the 82 AMs assembled in Pecina and Schlesinger (2006) for the tasks of English VPC and LVC extraction over the Wall Street Journal Penn Treebank data In our work we have observed that AMs can be divided into two classes ones that do not use context (Class I) and ones that do (Class II) and find that the latter is not suitable for our VPC and LVC detection tasks as the size of our corpus is too small to rely on the frequency of candidatesrsquo contexts This phenomenon also revealed the inappropriateness of hypothesis tests for our detection task We have also introduced the novel notion of rank equivalence to MWE detection in which we show that complex AMs may be replaced by simpler AMs that yield the same average precision performance

We further observed that certain modifications to some AMs are necessary First in the context of ranking we have proposed normalizing scores produced by MI and PMI in cases where the distributions of the two events are markedly different as is the case for light verbs and particles While our claims are limited to the datasets analyzed they show clear improvements normalized PMI produces better performance over our mixed MWE dataset yielding an average precision of 588 compared to 515 when using standard PMI a significant improvement as judged by paired T test Normalized MI also yields the best performance over our LVC dataset with a significantly improved AP of 583

We also show that marginal frequencies can be used to form effective penalization terms In particular we find that (1 )b cα αtimes + minus times is a good penalization term for VPCs while (1 )b cα αminus is suitable for LVCs Our introduced alpha tuning parameter should be set to properly scale the values b and c and should be optimized per MWE type In cases where a common factor is applied to different MWE types max(b c) is a better choice than min(b c) In future work we plan to expand our investigations over larger

web-based datasets of English to verify the performance gains of our modified AMs

Acknowledgement This work was partially supported by a National Research Foundation grant ldquoInteractive Media Searchrdquo (grant R 252 000 325 279)

References Baldwin Timothy (2005) The deep lexical

acquisition of English verb-particle constructions Computer Speech and Language Special Issue on Multiword Expressions 19(4)398ndash414

Baldwin Timothy and Villavicencio Aline (2002) Extracting the unextractable A case study on verb-particles In Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002) pages 98ndash104 Taipei Taiwan

Daille Beacuteatrice (1994) Approche mixte pour lextraction automatique de terminologie statistiques lexicales et filtres linguistiques PhD thesis Universiteacute Paris 7

Evert Stefan (2004) Online repository of association measures httpwwwcollocationsde a companion to The Statistics of Word Cooccurrences Word Pairs and Collocations PhD dissertation University of Stuttgart

Evert Stefan and Krenn Brigitte (2001) Methods for qualitative evaluation of lexical association measures In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics pages 188-915 Toulouse France

Katz Graham and Giesbrecht Eugenie (2006) Automatic identification of non-compositional multi-word expressions using latent semantic analysis In Proceedings of the ACL Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties pages 12-19 Sydney Australia

Krenn Brigitte and Evert Stefan (2001) Can we do better than frequency A case study on extracting PP-verb collocations In Proceedings of the ACLEACL 2001 Workshop on the Computational Extraction Analysis and Exploitation of Collocations pages 39ndash46 Toulouse France

Manning D Christopher and Schutze ampamp Hinrich (1999) Foundations of Statistical Natural Language Processing The MIT Press Cambridge Massachusetts

Pearce Darren (2002) A comparative evaluation of collocation extraction techniques In Proc of the

38

3rd International Conference on Language Resources and Evaluation (LREC 2002) Las Palmas pages 1530-1536 Canary Islands

Pecina Pavel and Schlesinger Pavel (2006) Combining association measures for collocation extraction In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLINGACL 2006) pages 651-658 Sydney Australia

Quirk Randolph Greenbaum Sidney Leech Geoffrey and Svartvik Jan (1985) A Comprehensive Grammar of the English Language Longman London UK

Ramisch Carlos Schreiner Paulo Idiart Marco and Villavicencio Aline (2008) An Evaluation of Methods for the extraction of Multiword Expressions In Proceedings of the LREC-2008 Workshop on Multiword Expressions Towards a Shared Task for Multiword Expressions pages 50-53 Marrakech Morocco

Smadja Frank (1993) Retrieving collocations from text Xtract Computational Linguistics 19(1) 143ndash77

Tan Y Fan Kan M Yen and Cui Hang (2006) Extending corpus-based identification of light verb constructions using a supervised learning framework In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context pages 49ndash56 Trento Italy

Zhai Chengxiang (1997) Exploiting context to identify lexical atoms ndash A statistical view of linguistic context In International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97) pages 119-129 Rio de Janeiro Brazil

Appendix A List of particles used in identifying verb particle constructions about aback aboard above abroad across adrift ahead along apart around aside astray away back backward backwards behind by down forth forward forwards in into off on out over past round through to together under up upon without

39

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 40ndash46Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus

R Mahesh K Sinha

Department of Computer Science amp Engineering Indian Institute of Technology Kanpur

Kanpur 208016 India rmkiitkacin

Abstract

Complex predicate is a noun a verb an ad-jective or an adverb followed by a light verb that behaves as a single unit of verb Com-plex predicates (CPs) are abundantly used in Hindi and other languages of Indo Aryan family Detecting and interpreting CPs con-stitute an important and somewhat a diffi-cult task The linguistic and statistical methods have yielded limited success in mining this data In this paper we present a simple method for detecting CPs of all kinds using a Hindi-English parallel corpus A CP is hypothesized by detecting absence of the conventional meaning of the light verb in the aligned English sentence This simple strategy exploits the fact that CP is a multi-word expression with a meaning that is dis-tinct from the meaning of the light verb Al-though there are several shortcomings in the methodology this empirical method surpri-singly yields mining of CPs with an average precision of 89 and a recall of 90

1 Introduction

Complex predicates (CPs) are abundantly used in Hindi and other languages of Indo-Aryan family and have been widely studied (Hook 1974 Ab-bi 1992 Verma 1993 Mohanan 1994 Singh 1994 Butt 1995 Butt and Geuder 2001 Butt and Ramchand 2001 Butt et al 2003) A com-plex predicate is a multi-word expression (MWE) where a noun a verb or an adjective is followed by a light verb (LV) and the MWE be-haves as a single unit of verb The general theory of complex predicate is discussed in Alsina (1996) These studies attempt to model the lin-guistic facts of complex predicate formation and the associated semantic roles

CPs empower the language in its expressive-ness but are hard to detect Detection and inter-

pretation of CPs are important for several tasks of natural language processing tasks such as ma-chine translation information retrieval summa-rization etc A mere listing of the CPs constitutes a valuable linguistic resource for lexicographers wordnet designers (Chakrabarti et al 2007) and other NLP system designers Computational me-thod using Hindi corpus has been used to mine CPs and categorize them based on statistical analysis (Sriram and Joshi 2005) with limited success Chakrabarti et al (2008) present a me-thod for automatic extraction of V+V CPs only from a corpus based on linguistic features They report an accuracy of about 98 An attempt has also been made to use a parallel corpus for de-tecting CPs using projection POS tags from Eng-lish to Hindi (Soni Mukerjee and Raina 2006) It uses Giza++ word alignment tool to align the projected POS information A success of 83 precision and 46 recall has been reported

In this paper we present a simple strategy for mining of CPs in Hindi using projection of meaning of light verb in a parallel corpus In the following section the nature of CP in Hindi is outlined and this is followed by system design experimentation and results

2 Complex Predicates in Hindi

A CP in Hindi is a syntactic construction consist-ing of either a verb a noun an adjective or an adverb as main predicator followed by a light verb (LV) Thus a CP can be a noun+LV an adjective+LV a verb+LV or an adverb+LV Fur-ther it is also possible that a CP is followed by a LV (CP+LV) The light verb carries the tense and agreement morphology In V+V CPs the contribution of the light verb denotes aspectual terms such as continuity perfectivity inception completion or denotes an expression of forceful-ness suddenness etc (Singh 1994 Butt 1995) The CP in a sentence syntactically acts as a sin-gle lexical unit of verb that has a meaning dis-

40

tinct from that of the LV CPs are also referred as the complex or compound verbs

Given below are some examples (1) CP=noun+LV noun = ashirwad blessings LV = denaa to give usane mujhe ashirwad diyaa उसन मझ आश वाद दया he me blessings gave he blessed me (2) No CP usane mujhe ek pustak dii उसन मझ एक प तक द he me one book gave he gave me a book In (1) the light verb diyaa (gave) in its past tense form with the noun ashirwad (blessings) makes a complex predicate verb form ashirwad diyaa (blessed) in the past tense form The CP here is ashirwad denaa and its corresponding English translation is lsquoto blessrsquo On the other hand in ex-ample (2) the verb dii (gave) is a simple verb in past tense form and is not a light verb Although same Hindi verb denaa (to give) is used in both the examples it is a light verb in (1) and a main verb in (2) Whether it acts as a light verb or not depends upon the semantics of the preceding noun However it is observed that the English meaning in case of the complex predicate is not derived from the individual meanings of the con-stituent words It is this observation that forms basis of our approach for mining of CPs (3) CP=adjective+LV adjective=khush happy LV=karanaa to do usane mujhe khush kiyaa उसन मझ खश कया he me happy did he pleased me Here the Hindi verb kiyaa (did) is the past tense form of a light verb karanaa (to do) and the pre-ceding word khush (happy) is an adjective The CP here is khush karanaa (to please) (4) CP=verb+LV verb = paRhnaa to read LV = lenaa to take usane pustak paRh liyaa

उसन प तक पढ़ लया he book read took he has read the book Here the Hindi verb liyaa (took) is the past tense form of the light verb lenaa (to take) and the pre-ceding word paRh (read) is the verb paRhnaa (to read) in its stem form The CP is paRh lenaa (to finish reading) In such cases the light verb acts as an aspectual modal or as an intensifier (5) CP=verb+LV verb = phaadanaa to tear LV = denaa to give usane pustak phaad diyaa उसन प तक फाड़ दया he book tear gave he has torn the book Here the Hindi verb diyaa (gave) is the past tense form of the light verb denaa (to give) and the preceding word phaad (tear) is the stem form of the verb phaadanaa (to tear) The CP is phaad denaa (to cause and complete act of tearing) (6) CP=verb+LV verb = denaa to give LV = maaranaato hit to kill usane pustak de maaraa उसन प तक द मारा he book give hit he threw the book Here the Hindi verb maaraa (hitkilled) is the past tense form of the light verb maaranaa (to hit to kill) and the preceding word de (give) is a verb denaa (to give) in its stem form The CP is de maranaa (to throw) The verb combination yields a new meaning This may also be consi-dered as a semi-idiomatic construct by some people (7) CP=adverb+LV1+LV2 adverb = vaapasback LV1 = karanaato do LV2 = denaato give or CP = CP+LV CP = vaapas karanaato return LV = denaato give usane pustak vaapas kar diyaa उसन प तक वापस कर दया he book back do gave

41

he returned the book Here there are two Hindi light verbs used The verb kar (do) is the stem form of the light verb karanaa (to do) and the verb diyaa (gave) is the past tense form of the light verb denaa (to give) The preceding word vaapas (back) is an adverb One way of interpretation is that the CP (a con-junct verb) vaapas karanaa (to return) is fol-lowed by another LV denaa (to give) signifying completion of the task Another way of looking at it is to consider these as a combination of two CPs vaapas karanaa (to return) and kar denaa (to complete the act) The semantic interpreta-tions in the two cases remain the same It may be noted that the word vaapas (return) is also a noun and in such a case the CP is a noun+LV

From all the above examples the complexity of the task of mining the CPs is evident Howev-er it is also observed that in the translated text the meaning of the light verb does not appear in case of CPs Our methodology for mining CPs is based on this observation and is outlined in the following section

3 System Design

As outlined earlier our method for detecting a CP is based on detecting a mismatch of the Hindi light verb meaning in the aligned English sen-tence The steps involved are as follows 1) Align the sentences of Hindi-English corpus 2) Create a list of Hindi light verbs and their

common English meanings as a simple verb (Table 1)

3) For each Hindi light verb generate all the morphological forms (Figure 1)

4) For each English meaning of the light verb as given in table 1 generate all the morphologi-cal forms (Figure 2)

5) For each Hindi-English aligned sentence execute the following steps a) For each light verb of Hindi (table 1)

execute the following steps i) Search for a Hindi light verb (LV)

and its morphological derivatives (figure 1) in the Hindi sentence and mark its position in the sentence (K)

ii) If the LV or its morphological deriv-ative is found then search for the equivalent English meanings for any of the morphological forms (figure 2) in the corresponding aligned Eng-lish sentence

iii) If no match is found then scan the words in the Hindi sentence to the left of the Kth position (as identified in step (i)) else if a match is found then exit ie go to step (a)

iv) If the scanned word is a lsquostop wordrsquo (figure 3) then ignore it and contin-ue scanning

v) Stop the scan when it is not a lsquostop wordrsquo and collect the Hindi word (W)

vi) If W is an lsquoexit wordrsquo then exit ie go to step (a) else the identified CP is W+LV

Hindi has a large number of light verbs A list of some of the commonly used light verbs along with their common English meanings as a simple verb is given in table 1 The light verb kar (do) is the most frequently used light verb Using its literal meaning as lsquodorsquo as a criterion for testing CP is quite misleading since lsquodorsquo in English is used in several other contexts Such meanings have been shown within parentheses and are not used for matching

light verb base form root verb meaning baithanaa बठना sit

bananaa बनना makebecomebuildconstruct manufactureprepare

banaanaa बनाना makebuildconstructmanufact-ure prepare

denaa दना give

lenaa लना take

paanaa पाना obtainget

uthanaa 3ठना rise arise get-up

uthaanaa 3ठाना raiselift wake-up

laganaa लगना feelappear look seem

lagaanaa लगाना fixinstall apply

cukanaa चकना (finish)

cukaanaa चकाना pay

karanaa करना (do)

honaa होना happenbecome be

aanaa आना come

jaanaa जाना go

khaanaa खाना eat

rakhanaa रखना keep put

maaranaa मारना killbeathit

daalanaa डालना put

haankanaa हाकना drive

Table 1 Some of the common light verbs in Hindi

42

For each of the Hindi light verb all morpho-logical forms are generated A few illustrations are given in figures 1(a) and 1(b) Similarly for each of the English meaning of the light verb all of its morphological derivatives are generated Figure 2 shows a few illustrations of the same

There are a number of words that can appear in between the nominal and the light verb in a CP These words are ignored in search for a CP and are treated as stop words These are words that denote negation or are emphasizers inten-sifiers interrogative pronoun or a particle A list of stop words used in the experimentation is giv-en in figure 3

Figure 1(a) Morphological derivatives of sample Hindi light verb rsquojaanaarsquo जाना to go

Figure 1(b) Morphological derivatives of sample Hindi light verb rsquolenaarsquo लना to take

Figure 2 Morphological derivatives of sample English meanings

We use a list of words of words that we have

named as lsquoexit wordsrsquo which cannot form part of a CP in Hindi We have used Hindi case (vibhak-ti) markers (also called parsarg) conjunctions and pronouns as the lsquoexit wordsrsquo in our imple-mentation Figure 4 shows a partial list used However this list can be augmented based on analysis of errors in LV identification It should be noted that we do not perform parts of speech (POS) tagging and so the nature of the word pre-ceding the LV is unknown to the system

English word sit Morphological derivations

sit sits sat sitting English word give Morphological derivations

give gives gave given giving degdegdegdegdeg

LV jaanaa जाना to go Morphological derivatives jaa jaae jaao jaaeM jaauuM jaane jaanaa jaanii jaataa jaatii jaate jaaniiM jaatiiM jaaoge jaaogii gaii jaauuMgA jaayegaa jaauuMgii jaayegii gaye gaiiM gayaa gayii jaayeMge jaayeMgI jaakara जा (go stem) जाए (go imperative)

जाओ (go imperative) जाए (go imperative)

जाऊ (go first-person) जान (go infinitive oblique)

जाना (go infinitive masculine singular)

जानी (go infinitive feminine singular)

जाता (go indefinite masculine singular)

जाती (go indefinite feminine singular)

जात (go indefinite masculine pluraloblique)

जानी (go infinitive feminine plural)

जाती (go indefinite feminine plural)

जाओग (go future masculine singular)

जाओगी (go future feminine singular)

गई (go past feminine singular)

जाऊगा (go future masculine first-person singular)

जायगा (go future masculine third-person singular)

जाऊगी (go future feminine first-person singular)

जायगी (go future feminine third-person singular)

गय (go past masculine pluraloblique)

गई (go past feminine plural)

गया (go past masculine singular)

गयी (go past feminine singular)

जायग (go future masculine plural)

जायगी (go future feminine plural)

जाकर (go completion) degdegdegdegdeg

LV lenaa लना to take Morphological derivatives le lii leM lo letaa letii lete liiM luuM legaa legii lene lenaa lenii liyaa leMge loge letiiM luuMgaa luuMgii lekara ल (take stem) ल (take past)

ल (take imperative) लो (take imperative)

लता (take indefinite masculine singular)

लती (take indefinite feminine singular)

लत (take indefinite masculine pluraloblique)

ल (takepastfeminineplural) ल(take first-person)

लगा(take future masculine third-personsingular)

लगी(take future feminine third-person singular)

लन (take infinitive oblique)

लना (take infinitive masculine singular)

लनी (take infinitive feminine singular)

लया (take past masculine singular)

लग (take future masculine plural)

लोग (take future masculine singular)

लती (take indefinite feminine plural)

लगा (take future masculinefirst-personsingular)

लगी (take future feminine first-person singular)

लकर (take completion) degdegdegdegdeg

43

Figure 3 Stop words in Hindi used by the system

Figure 4 A few exit words in Hindi used by the system

The inner loop of the procedure identifies mul-tiple CPs that may be present in a sentence The outer loop is for mining the CPs in the entire corpus The experimentation and results are dis-cussed in the following section

4 Experimentation and Results

The CP mining methodology outlined earlier has been implemented and tested over multiple files of EMILLE (McEnery Baker Gaizauskas and Cunningham 2000) English-Hindi parallel cor-pus A summary of the results obtained are given in table 2 As can be seen from this table the precision obtained is 80 to 92 and the recall is between 89 to 100 The F-measure is 88 to 97 This is a remarkable and somewhat sur-prising result from the simple methodology without much of linguistic or statistical analysis This is much higher than what has been reported on the same corpus by Mukerjee et al 2006 (83 precision and 46 recall) who use projec-tion of POS and word alignment for CP identifi-cation This is the only other work that uses a parallel corpus and covers all kinds of CPs The results as reported by Chakrabarti et al (2008) are only for V-V CPs Moreover they do not re-port the recall value File

1 File 2

File 3

File 4

File 5

File 6

No of Sentences

112 193 102 43 133 107

Total no of CP(N)

200 298 150 46 188 151

Correctly identified CP (TP)

195 296 149 46 175 135

V-V CP 56 63 9 6 15 20

Incorrectly identified CP (FP)

17 44 7 11 16 20

Unidentified CP(FN)

5 2 1 0 13 16

Accuracy

9750 9933 9933 1000 9308 8940

Precision (TP (TP+FP))

9198 8705 9551 8070 9162 8709

Recall ( TP (TP+FN))

9750 9833 9933 1000 9308 8940

F-measure ( 2PR ( P+R))

946 923 974 893 923 882

Table 2 Results of the experimentation

न (ergative case marker) को (accusative case marker) का (possessive case marker) क (possessive case marker) की (possessive case marker) स (frombywith) म (ininto) पर (onbut) और (and Hindi particle) तथा (and) या (or) ल कन (but) पर त (but) क (that Hindi particle) म (I) तम (you) आप (you) वह (heshe) मरा (my) मर (my) मर (my) त हारा (your) त हार (your) त हार (your) उसका (his) उसकी (her) उसक (hisher) अपना (own) अपनी (own) अपन (own) उनक (their) मन (I ergative) त ह (to you) आपको (to you) उसको (to himher) उनको (to them) उ ह (to them) मझको (to me) मझ (to me) िजसका (whose) िजसकी (whose) िजसक (whose) िजनको (to whom) िजनक (to whom)

नह (nonot) न (nonot Hindi particle) भी (also Hindi particle) ह (only Hindi particle) तो (then Hindi particle) य (why) या (what Hindi particle)

कहा (where Hindi particle) कब (when) यहा (here) वहा (there) जहा (where) पहल (before) बाद म (after) श म (beginning) आर भ म (beginning) अत म (in the end) आ खर म (in the end)

44

Given below are some sample outputs (1) English sentence I also enjoy working with the childrens parents who often come to me for advice - its good to know you can help Aligned Hindi sentence मझ ब च क माता - पताओ क साथ काम करना भी अ छा लगता ह जो क अ सर सलाह लन आत ह - यह जानकार खशी होती ह क आप कसी की मदद कर सकत ह | The CPs identified in the sentence i काम करना (to work) ii अ छा लगना (to feel good enjoy) iii सलाह लना (to seek advice) iv खशी होना (to feel happy good) v मदद करना (to help)

Here the system identified 5 different CPs all of which are correct and no CP in the sentence has gone undetected The POS projection and word alignment method (Mukerjee et al 2006) would fail to identify CPs सलाह लना (to seek advice) and खशी होना (to feel happy) (2) English sentence Thousands of children are already benefiting from the input of people like you - people who care about children and their future who have the commitment energy and enthusiasm to be positive role models and who value the opportu-nity for a worthwhile career Aligned Hindi sentence आप जस लोग जो क ब च और उनक भ व य क बार म सोचत ह - इस समय भी हज़ार ब च को लाभ पहचा रह ह | अ छ आदश बनन क लए ऐस लोग म तब ता उ सराह और लगन ह और व एक समथन - यो य यवसाय की क करत ह |

The CPs identified in the sentence i आदश बनना (to be role model) ii क करना (to respect)

Here also the two CPs identified are correct It is obvious that this empirical method of

mining CPs will fail whenever the Hindi light verb maps on to its core meaning in English It

may also produce garbage as POS of the preced-ing word is not being checked However the mining success rate obtained speaks of these be-ing in small numbers in practice Use of the lsquostop wordsrsquo in allowing the intervening words within the CPs helps a lot in improving the per-formance Similarly use of the lsquoexit wordsrsquo avoid a lot of incorrect identification

5 Conclusions

The simple empirical method for mining CPs outlined in this work yields an average 89 of precision and 90 recall which is better than the results reported so far in the literature The major drawback is that we have to generate a list of all possible light verbs This list appears to be very large for Hindi Since no POS tagging or statis-tical analysis is performed the identified CPs are merely a list of mined CPs in Hindi with no lin-guistic categorization or analysis However this list of mined CPs is valuable to the lexicograph-ers and other language technology developers This list can also be used for word alignment tools where the identified components of CPs are grouped together before the word alignment process This will increase both the alignment accuracy and the speed

The methodology presented in this work is equally applicable to all other languages within the Indo-Aryan family

References Anthony McEnery Paul Baker Rob Gaizauskas Ha-

mish Cunningham 2000 EMILLE Building a Corpus of South Asian Languages Vivek A Quar-terly in Artiificial Intelligence 13(3)23ndash32

Amitabh Mukerjee Ankit Soni and Achala M Raina 2006 Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora Proceed-ings of the Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties Sydney 11ndash18

Alex Alsina 1996 Complex PredicatesStructure and Theory CSLI PublicationsStanford CA

Anvita Abbi 1992 The explicator compound verbsome definitional issues and criteria for iden-tification Indian Linguistics 53 27-46

Debasri Chakrabarti Vaijayanthi Sarma and Pushpak Bhattacharyya 2007 Complex Predicates in In-dian Language Wordnets Lexical Resources and Evaluation Journal 40 (3-4)

Debasri Chakrabarti Hemang Mandalia Ritwik Priya Vaijayanthi Sarma and Pushpak Bhattacharyya 2008 Hindi Compound Verbs and their Automatic Extraction Computational Linguistics (COLING08) Manchester UK

45

Manindra K Verma (ed) 1993 Complex Predicates in South Asian Languages Manohar Publishers and Distributors New Delhi

Miriam Butt 1995 The Structure of Complex Predi-cates in Urdu CSLI Publications

Mirium Butt and Gillian Ramchand 2001 Complex Aspectual Structure in HindiUrdu In Maria Lia-kata Britta Jensen and Didier Maillat (Editors) Oxford University Working Papers in Linguistics Philology amp Phonetics Vol 6

Miriam Butt Tracy Holloway King and John T Maxwell III 2003 Complex Predicates via Re-striction Proceedings of the LFG03 Conference

Miriam Butt and Wilhelm Geuder 2001 On the (semi)lexical status of light verbs In Norbert Corv-er and Henk van Riemsdijk (Editors) Semi-lexical Categories On the content of function words and the function of content words Mouton de Gruyter Berlin 323ndash370

Mona Singh 1994 Perfectivity Definiteness and Specificity A Classification of Verbal Predicates Hindi Doctoral dissertation University of Texas Austin

Peter Edwin Hook 1974 The Compound Verb in Hindi Center for South and Southeast Asian Stu-dies The University of Michigan

Tara Mohanan 1994 Argument Structure in Hindi CSLI Publications Stanford California

Venkatapathy Sriram and Aravind K Joshi 2005 Relative compositionality of multi-word expres-sions a study of verb-noun (V-N) collocations In Proceedings of International Joint Conference on Natural Language Processing - 2005 Jeju Island Korea 553-564

46

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 47ndash54Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Improving Statistical Machine Translation UsingDomain Bilingual Multiword Expressions

Zhixiang Ren1 Yajuan L u1 Jie Cao1 Qun Liu 1 Yun Huang2

1Key Lab of Intelligent Info Processing 2Department of Computer ScienceInstitute of Computing Technology School of Computing

Chinese Academy of Sciences National University of SingaporePO Box 2704 Beijing 100190 China Computing 1 Law Link Singapore 117590

renzhixianglvyajuan huangyuncompnusedusgcaojieliuqunictaccn

Abstract

Multiword expressions (MWEs) havebeen proved useful for many natural lan-guage processing tasks However how touse them to improve performance of statis-tical machine translation (SMT) is not wellstudied This paper presents a simple yeteffective strategy to extract domain bilin-gual multiword expressions In additionwe implement three methods to integratebilingual MWEs to Moses the state-of-the-art phrase-based machine translationsystem Experiments show that bilingualMWEs could improve translation perfor-mance significantly

1 Introduction

Phrase-based machine translation model has beenproved a great improvement over the initial word-based approaches (Brown et al 1993) Recentsyntax-based models perform even better thanphrase-based models However when syntax-based models are applied to new domain with fewsyntax-annotated corpus the translation perfor-mance would decrease To utilize the robustnessof phrases and make up the lack of syntax or se-mantic information in phrase-based model for do-main translation we study domain bilingual mul-tiword expressions and integrate them to the exist-ing phrase-based model

A multiword expression (MWE)can be consid-ered as word sequence with relatively fixed struc-ture representing special meanings There is nouniform definition of MWE and many researchersgive different properties of MWE Sag et al (2002)roughly defined MWE as ldquoidiosyncratic interpre-tations that cross word boundaries (or spaces)rdquoCruys and Moiron (2007) focused on the non-compositional property of MWE ie the propertythat whole expression cannot be derived from their

component words Stanford university launcheda MWE project1 in which different qualities ofMWE were presented Forbilingual multiwordexpression (BiMWE) we define a bilingual phraseas a bilingual MWE if (1) the source phrase is aMWE in source language (2) the source phraseand the target phrase must be translated to eachother exactly ie there is no additional (boundary)word in target phrase which cannot find the corre-sponding word in source phrase and vice versaIn recent years many useful methods have beenproposed to extract MWEs or BiMWEs automati-cally (Piao et al 2005 Bannard 2007 Fazly andStevenson 2006) Since MWE usually constrainspossible senses of a polysemous word in contextthey can be used in many NLP applications suchas information retrieval question answering wordsense disambiguation and so on

For machine translation Piao et al (2005) havenoted that the issue of MWE identification andaccurate interpretation from source to target lan-guage remained an unsolved problem for existingMT systems This problem is more severe whenMT systems are used to translate domain-specifictexts since they may include technical terminol-ogy as well as more general fixed expressions andidioms Although some MT systems may employa machine-readable bilingual dictionary of MWEit is time-consuming and inefficient to obtain thisresource manually Therefore some researchershave tried to use automatically extracted bilingualMWEs in SMT Tanaka and Baldwin (2003) de-scribed an approach of noun-noun compound ma-chine translation but no significant comparisonwas presented Lambert and Banchs (2005) pre-sented a method in which bilingual MWEs wereused to modify the word alignment so as to im-prove the SMT quality In their work a bilin-gual MWE in training corpus was grouped as

1httpmwestanfordedu

47

one unique token before training alignment mod-els They reported that both alignment qualityand translation accuracy were improved on a smallcorpus However in their further study they re-ported even lower BLEU scores after groupingMWEs according to part-of-speech on a large cor-pus (Lambert and Banchs 2006) Nonethelesssince MWE represents liguistic knowledge therole and usefulness of MWE in full-scale SMTis intuitively positive The difficulty lies in howto integrate bilingual MWEs into existing SMTsystem to improve SMT performance especiallywhen translating domain texts

In this paper we implement three methods thatintegrate domain bilingual MWEs into a phrase-based SMT system and show that these ap-proaches improve translation quality significantlyThe main difference between our methods andLambert and Banchsrsquo work is that we directly aimat improving the SMT performance rather than im-proving the word alignment quality In detail dif-ferences are listed as follows

bull Instead of using the bilingual n-gram trans-lation model we choose the phrase-basedSMT system Moses2 which achieves sig-nificantly better translation performance thanmany other SMT systems and is a state-of-the-art SMT system

bull Instead of improving translation indirectlyby improving the word alignment qualitywe directly target at the quality of transla-tion Some researchers have argued that largegains of alignment performance under manymetrics only led to small gains in translationperformance (Ayan and Dorr 2006 Fraserand Marcu 2007)

Besides the above differences there are someadvantages of our approaches

bull In our method automatically extractedMWEs are used as additional resources ratherthan as phrase-table filter Since bilingualMWEs are extracted according to noisy au-tomatic word alignment errors in word align-ment would further propagate to the SMT andhurt SMT performance

bull We conduct experiments on domain-specificcorpus For one thing domain-specific

2httpwwwstatmtorgmoses

corpus potentially includes a large numberof technical terminologies as well as moregeneral fixed expressions and idioms iedomain-specific corpus has high MWE cov-erage For another after the investigationcurrent SMT system could not effectivelydeal with these domain-specific MWEs es-pecially for Chinese since these MWEs aremore flexible and concise Take the Chi-nese term ldquo j Ntilde (rdquo for example Themeaning of this term is ldquosoften hard massand dispel pathogenic accumulationrdquo Ev-ery word of this term represents a specialmeaning and cannot be understood literallyor without this context These terms are dif-ficult to be translated even for humans letalone machine translation So treating theseterms as MWEs and applying them in SMTsystem have practical significance

bull In our approach no additional corpus is intro-duced We attempt to extract useful MWEsfrom the training corpus and adopt suitablemethods to apply them Thus it benefitsfor the full exploitation of available resourceswithout increasing great time and space com-plexities of SMT system

The remainder of the paper is organized as fol-lows Section 2 describes the bilingual MWE ex-traction technique Section 3 proposes three meth-ods to apply bilingual MWEs in SMT systemSection 4 presents the experimental results Sec-tion 5 draws conclusions and describes the futurework Since this paper mainly focuses on the ap-plication of BiMWE in SMT we only give a briefintroduction on monolingual and bilingual MWEextraction

2 Bilingual Multiword ExpressionExtraction

In this section we describe our approach of bilin-gual MWE extraction In the first step we obtainmonolingual MWEs from the Chinese part of par-allel corpus After that we look for the translationof the extracted MWEs from parallel corpus

21 Automatic Extraction of MWEs

In the past two decades many different ap-proaches on automatic MWE identification werereported In general those approaches can beclassified into three main trends (1) statisti-cal approaches (Pantel and Lin 2001 Piao et

48

al 2005) (2) syntactic approaches (Fazly andStevenson 2006 Bannard 2007) and (3) seman-tic approaches (Baldwin et al 2003 Cruys andMoiron 2007) Syntax-based and semantic-basedmethods achieve high precision but syntax or se-mantic analysis has to be introduced as preparingstep so it is difficult to apply them to domains withfew syntactical or semantic annotation Statisticalapproaches only consider frequency informationso they can be used to obtain MWEs from bilin-gual corpora without deeper syntactic or semanticanalysis Most statistical measures only take twowords into account so it not easy to extract MWEscontaining three or more than three words

Log Likelihood Ratio (LLR)has been proved agood statistical measurement of the association oftwo random variables (Chang et al 2002) Weadopt the idea of statistical approaches and pro-pose a new algorithm named LLR-based Hierar-chical Reducing Algorithm (HRA for short) to ex-tract MWEs with arbitrary lengths To illustrateour algorithm firstly we define some useful itemsIn the following definitions we assume the givensentence is ldquoA B C D Erdquo

Definition 1 Unit A unit is any sub-string of thegiven sentence For example ldquoA Brdquo ldquo Crdquo ldquo C D Erdquoare all units but ldquoA B Drdquo is not a unit

Definition 2 List A list is an ordered sequence ofunits which exactly cover the given sentence ForexampleldquoArdquoldquo B C Drdquoldquo Erdquo forms a list

Definition 3 Score The score function only de-fines on two adjacent units and return the LLRbetween the last word of first unit and the firstword of the second unit3 For example the scoreof adjacent unit ldquoB Crdquo and ldquoD Erdquo is defined asLLR(ldquo Crdquoldquo Drdquo)

Definition 4 Select The selecting operator is tofind the two adjacent units with maximum scorein a list

Definition 5 ReduceThe reducing operator is toremove two specific adjacent units concatenatethem and put back the result unit to the removedposition For example if we want to reduce unitldquoB Crdquo and unit ldquoDrdquo in list ldquoArdquoldquo B Crdquoldquo Drdquoldquo Erdquowe will get the listldquoArdquoldquo B C Drdquoldquo Erdquo

Initially every word in the sentence is consid-ered as one unit and all these units form a initiallist L If the sentence is of lengthN then the

3we use a stoplist to eliminate the units containing func-tion words by setting their score to 0

list containsN units of course The final set ofMWEsS is initialized to empty set After initial-ization the algorithm will enter an iterating loopwith two steps (1) select the two adjacent unitswith maximum score inL namingU1 andU2 and(2) reduceU1 andU2 in L and insert the reducingresult into the final setS Our algorithm termi-nates on two conditions (1) if the maximum scoreafter selection is less than a given threshold or (2)if L contains only one unit

c1(pound p Eacute w egrave )

c1(pound p Eacute w )

c1(pound p Eacute w)

c1(pound)

c2(p Eacute w)

c2(p Eacute)

c2(p) c3(Eacute) c4(w) c5()

c6(egrave )

c6(egrave) c7()

pound p Eacute w egrave 1471 67552 10596 0 0 8096

Figure 1 Example of Hierarchical Reducing Al-gorithm

Let us make the algorithm clearer with an ex-ample Assume the threshold of score is 20 thegiven sentence is ldquopound p Eacute w egrave rdquo4Figure 1 shows the hierarchical structure of givensentence (based on LLR of adjacent words) Inthis example four MWEs (ldquop Eacuterdquo ldquo p Eacute

wrdquo ldquo egrave rdquo ldquo pound p Eacute wrdquo) are extractedin the order and sub-strings over dotted line in fig-ure 1 are not extracted

From the above example we can see that theextracted MWEs correspond to human intuitionIn general the basic idea of HRA is to reflectthe hierarchical structure pattern of natural lan-guage Furthermore in the HRA MWEs are mea-sured with the minimum LLR of adjacent wordsin them which gives lexical confidence of ex-tracted MWEs Finally suppose given sentencehas lengthN HRA would definitely terminatewithin N minus 1 iterations which is very efficient

However HRA has a problem that it would ex-tract substrings before extracting the whole stringeven if the substrings only appear in the particu-lar whole string which we consider useless Tosolve this problem we use contextual features

4The whole sentence means ldquohealthy tea for preventinghyperlipidemiardquo and we give the meaning for each Chi-nese wordpound(preventing)p(hyper-)Eacute(-lipid-) w(-emia)(for) egrave(healthy)(tea)

49

contextual entropy (Luo and Sun 2003) and C-value (Frantzi and Ananiadou 1996) to filter outthose substrings which exist only in few MWEs

22 Automatic Extraction of MWErsquosTranslation

In subsection 21 we described the algorithm toobtain MWEs and we would like to introduce theprocedure to find their translations from parallelcorpus in this subsection

For mining the English translations of ChineseMWEs we first obtain the candidate translationsof a given MWE from the parallel corpus Stepsare listed as follows

1 Run GIZA++5 to align words in the trainingparallel corpus

2 For a given MWE find the bilingual sentencepairs where the source language sentences in-clude the MWE

3 Extract the candidate translations of theMWE from the above sentence pairs accord-ing to the algorithm described by Och (2002)

After the above procedure we have alreadyextracted all possible candidate translations of agiven MWE The next step is to distinguish rightcandidates from wrong candidates We constructperceptron-based classification model (Collins2002) to solve the problem We design twogroups of features translation features whichdescribe the mutual translating chance betweensource phrase and target phrase and the languagefeatures which refer to how well a candidateis a reasonable translation The translation fea-tures include (1) the logarithm of source-targettranslation probability (2) the logarithm of target-source translation probability (3) the logarithmof source-target lexical weighting (4) the loga-rithm of target-source lexical weighting and (5)the logarithm of the phrase pairrsquos LLR (Dunning1993) The first four features are exactly the sameas the four translation probabilities used in tradi-tional phrase-based system (Koehn et al 2003)The language features include (1) the left entropyof the target phrase (Luo and Sun 2003) (2) theright entropy of the target phrase (3) the first wordof the target phrase (4) the last word of the targetphrase and (5) all words in the target phrase

5httpwwwfjochcomGIZA++html

We select and annotate 33000 phrase pairs ran-domly of which 30000 pairs are used as trainingset and 3000 pairs are used as test set We use theperceptron training algorithm to train the modelAs the experiments reveal the classification preci-sion of this model is 9167

3 Application of Bilingual MWEs

Intuitively bilingual MWE is useful to improvethe performance of SMT However as we de-scribed in section 1 it still needs further researchon how to integrate bilingual MWEs into SMTsystem In this section we propose three methodsto utilize bilingual MWEs and we will comparetheir performance in section 4

31 Model Retraining with Bilingual MWEs

Bilingual phrase table is very important forphrase-based MT system However due to the er-rors in automatic word alignment and unalignedword extension in phrase extraction (Och 2002)many meaningless phrases would be extractedwhich results in inaccuracy of phrase probabilityestimation To alleviate this problem we take theautomatically extracted bilingual MWEs as paral-lel sentence pairs add them into the training cor-pus and retrain the model using GIZA++ Byincreasing the occurrences of bilingual MWEswhich are good phrases we expect that the align-ment would be modified and the probability es-timation would be more reasonable Wu et al(2008) also used this method to perform domainadaption for SMT Different from their approachin which bilingual MWEs are extracted from ad-ditional corpus we extract bilingual MWEs fromthe original training set The fact that additionalresources can improve the domain-specific SMTperformance was proved by many researchers(Wu et al 2008 Eck et al 2004) Howeverour method shows that making better use of theresources in hand could also enhance the qualityof SMT system We use ldquoBaseline+BiMWErdquo torepresent this method

32 New Feature for Bilingual MWEs

Lopez and Resnik (2006) once pointed out thatbetter feature mining can lead to substantial gainin translation quality Inspired by this idea weappend one feature into bilingual phrase table toindicate that whether a bilingual phrase containsbilingual MWEs In other words if the source lan-guage phrase contains a MWE (as substring) and

50

the target language phrase contains the translationof the MWE (as substring) the feature value is 1otherwise the feature value is set to 0 Due to thehigh reliability of bilingual MWEs we expect thatthis feature could help SMT system to select bet-ter and reasonable phrase pairs during translationWe use ldquoBaseline+Featrdquo to represent this method

33 Additional Phrase Table of bilingualMWEs

Wu et al (2008) proposed a method to construct aphrase table by a manually-made translation dic-tionary Instead of manually constructing transla-tion dictionary we construct an additional phrasetable containing automatically extracted bilingualMWEs As to probability assignment we just as-sign 1 to the four translation probabilities for sim-plicity Since Moses supports multiple bilingualphrase tables we combine the original phrase ta-ble and new constructed bilingual MWE table Foreach phrase in input sentence during translationthe decoder would search all candidate transla-tion phrases in both phrase tables We use ldquoBase-line+NewBPrdquo to represent this method

4 Experiments

41 Data

We run experiments on two domain-specific patentcorpora one is for traditional medicine domainand the other is for chemical industry domain Ourtranslation tasks are Chinese-to-English

In the traditional medicine domain table 1shows the data statistics For language model weuse SRI Language Modeling Toolkit6 to train a tri-gram model with modified Kneser-Ney smoothing(Chen and Goodman 1998) on the target side oftraining corpus Using our bilingual MWE ex-tracting algorithm 80287 bilingual MWEs are ex-tracted from the training set

Chinese EnglishTraining Sentences 120355

Words 4688873 4737843Dev Sentences 1000

Words 31722 32390Test Sentences 1000

Words 41643 40551

Table 1 Traditional medicine corpus

6httpwwwspeechsricomprojectssrilm

In the chemical industry domain table 2 givesthe detail information of the data In this experi-ment 59466 bilingual MWEs are extracted

Chinese EnglishTraining Sentences 120856

Words 4532503 4311682Dev Sentences 1099

Words 42122 40521Test Sentences 1099

Words 41069 39210

Table 2 Chemical industry corpus

We test translation quality on test set and use theopen source tool mteval-vllbpl7 to calculate case-sensitive BLEU 4 score (Papineni et al 2002) asour evaluation criteria For this evaluation there isonly one reference per test sentence We also per-form statistical significant test between two trans-lation results (Collins et al 2005) The mean ofall scores and relative standard deviation are calcu-lated with a 99 confidence interval of the mean

42 MT Systems

We use the state-of-the-art phrase-based SMT sys-tem Moses as our baseline system The featuresused in baseline system include (1) four transla-tion probability features (2) one language modelfeature (3) distance-based and lexicalized distor-tion model feature (4) word penalty (5) phrasepenalty For ldquoBaseline+BiMWErdquo method bilin-gual MWEs are added into training corpus as aresult new alignment and new phrase table areobtained For ldquoBaseline+Featrdquo method one ad-ditional 01 feature are introduced to each entry inphrase table For ldquoBaseline+NewBPrdquo additionalphrase table constructed by bilingual MWEs isused

Features are combined in the log-linear modelTo obtain the best translatione of the source sen-tencef log-linear model uses following equation

e = arg maxe

p(e|f)

= arg maxe

Msum

m=1

λmhm(e f) (1)

in which hm andλm denote themth feature andweight The weights are automatically turned byminimum error rate training (Och 2002) on devel-opment set

7httpwwwnistgovspeechtestsmtresourcesscoringhtm

51

43 Results

Methods BLEUBaseline 02658Baseline+BiMWE 02661Baseline+Feat 02675Baseline+NewBP 02719

Table 3 Translation results of using bilingualMWEs in traditional medicine domain

Table 3 gives our experiment results Fromthis table we can see that bilingual MWEsimprove translation quality in all cases TheBaseline+NewBP method achieves the most im-provement of 061 BLEU score comparedwith the baseline system The Baseline+Featmethod comes next with 017 BLEU score im-provement And the Baseline+BiMWE achievesslightly higher translation quality than the baselinesystem

To our disappointment however none of theseimprovements are statistical significant Wemanually examine the extracted bilingual MWEswhich are labeled positive by perceptron algorithmand find that although the classification precisionis high (9167) the proportion of positive exam-ple is relatively lower (7669) The low positiveproportion means that many negative instanceshave been wrongly classified to positive which in-troduce noises To remove noisy bilingual MWEswe use the length ratiox of the source phrase overthe target phrase to rank the bilingual MWEs la-beled positive Assumex follows Gaussian distri-butions then the ranking score of phrase pair(s t)is defined as the following formula

Score(s t) = log(LLR(s t))times 1radic2πσ

timeseminus

(xminusmicro)2

2σ2

(2)Here the meanmicro and varianceσ2 are estimatedfrom the training set After ranking by score weselect the top 50000 60000 and 70000 bilingualMWEs to perform the three methods mentioned insection 3 The results are showed in table 4

From this table we can conclude that (1) Allthe three methods on all settings improve BLEUscore (2) Except the Baseline+BiMWE methodthe other two methods obtain significant improve-ment of BLEU score (02728 02734 02724)over baseline system (02658) (3) When the scaleof bilingual MWEs is relatively small (5000060000) the Baseline+Feat method performs better

Methods 50000 60000 70000Baseline 02658Baseline+BiMWE 02671 02686 02715Baseline+Feat 02728 02734 02712Baseline+NewBP 02662 02706 02724

Table 4 Translation results of using bilingualMWEs in traditional medicine domain

than others (4) As the number of bilingual MWEsincreasing the Baseline+NewBP method outper-forms the Baseline+Feat method (5) Comparingtable 4 and 3 we can see it is not true that themore bilingual MWEs the better performance ofphrase-based SMT This conclusion is the same as(Lambert and Banchs 2005)

To verify the assumption that bilingual MWEsdo indeed improve the SMT performance not onlyon particular domain we also perform some ex-periments on chemical industry domain Table 5shows the results From this table we can see thatthese three methods can improve the translationperformance on chemical industry domain as wellas on the traditional medicine domain

Methods BLEUBaseline 01882Baseline+BiMWE 01928Baseline+Feat 01917Baseline+Newbp 01914

Table 5 Translation results of using bilingualMWEs in chemical industry domain

44 Discussion

In order to know in what respects our methods im-prove performance of translation we manually an-alyze some test sentences and gives some exam-ples in this subsection

(1) For the first example in table 6 ldquoIuml oacuterdquois aligned to other words and not correctly trans-lated in baseline system while it is aligned to cor-rect target phrase ldquodredging meridiansrdquo in Base-line+BiMWE since the bilingual MWE (ldquoIuml oacuterdquoldquodredging meridiansrdquo) has been added into train-ing corpus and then aligned by GIZA++

(2) For the second example in table 6 ldquo

rdquo has two candidate translation in phrase tableldquo teardquo and ldquomedicated teardquo The baseline systemchooses the ldquoteardquo as the translation of ldquo rdquowhile the Baseline+Feat system chooses the ldquomed-

52

Src T not auml k Ouml Eacute Aring raquo Iuml oacute ) 9 | Y S _ Ouml otilde

egraveE 8

Ref the obtained product is effective in tonifying blood expelling cold dredging meridians promoting production of body fluid promoting urination and tranquilizing mind and can be used for supplementing nutrition and protecting health

Baseline the food has effects in tonifying blood dispelling cold promoting salivation and water and tranquilizing and tonic effects and making nutritious health

+Bimwe the food has effects in tonifying blood dispelling cold dredging meridians promotingsalivation promoting urination and tranquilizing tonic nutritious pulverizing

Src curren iexclJ J NtildeJ 5J

Ref the product can also be made into tablet pill powder medicated tea or injection Baseline may also be made into tablet pill powder tea or injection +Feat may also be made into tablet pill powder medicated tea or injection

Table 6 Translation example

icated teardquo because the additional feature giveshigh probability of the correct translation ldquomedi-cated teardquo

5 Conclusion and Future Works

This paper presents the LLR-based hierarchicalreducing algorithm to automatically extract bilin-gual MWEs and investigates the performance ofthree different application strategies in applyingbilingual MWEs for SMT system The translationresults show that using an additional feature to rep-resent whether a bilingual phrase contains bilin-gual MWEs performs the best in most cases Theother two strategies can also improve the qualityof SMT system although not as much as the firstone These results are encouraging and motivatedto do further research in this area

The strategies of bilingual MWE application isroughly simply and coarse in this paper Com-plicated approaches should be taken into accountduring applying bilingual MWEs For examplewe may consider other features of the bilingualMWEs and examine their effect on the SMT per-formance Besides application in phrase-basedSMT system bilingual MWEs may also be inte-grated into other MT models such as hierarchicalphrase-based models or syntax-based translationmodels We will do further studies on improvingstatistical machine translation using domain bilin-gual MWEs

Acknowledgments

This work is supported by National Natural Sci-ence Foundation of China Contracts 60603095and 60873167 We would like to thank the anony-

mous reviewers for their insightful comments onan earlier draft of this paper

References

Necip Fazil Ayan and Bonnie J Dorr 2006 Goingbeyond aer an extensive analysis of word align-ments and their impact on mt InProceedings of the44th Annual Meeting of the Association for Compu-tational Linguistics pages 9ndash16

Timothy Baldwin Colin Bannard Takaaki Tanaka andDominic Widdows 2003 An empirical modelof multiword expression decomposability InPro-ceedings of the ACL-2003 Workshop on MultiwordExpressions Analysis Acquisiton and Treatmentpages 89ndash96

Colin Bannard 2007 A measure of syntactic flex-ibility for automatically identifying multiword ex-pressions in corpora InProceedings of the ACLWorkshop on A Broader Perspective on MultiwordExpressions pages 1ndash8

Peter F Brown Stephen Della Pietra Vincent J DellaPietra and Robert L Mercer 1993 The mathe-matics of statistical machine translation Parameterestimation Computational Linguistics 19(2)263ndash311

Baobao Chang Pernilla Danielsson and WolfgangTeubert 2002 Extraction of translation unit fromchinese-english parallel corpora InProceedings ofthe first SIGHAN workshop on Chinese languageprocessing pages 1ndash5

Stanley F Chen and Joshua Goodman 1998 Am em-pirical study of smoothing techniques for languagemodeling Technical report

Michael Collins Philipp Koehn and Ivona Kucerova2005 Clause restructuring for statistical machinetranslation InProceedings of the 43rd Annual

53

Meeting on Association for Computational Linguis-tics pages 531ndash540

Michael Collins 2002 Discriminative training meth-ods for hidden markov models Theory and exper-iments with perceptron algorithms InProceedingsof the Empirical Methods in Natural Language Pro-cessing Conference pages 1ndash8

Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32

Ted Dunning 1993 Accurate methods for the statis-tics of surprise and coincidenceComputationalLinguistics 19(1)61ndash74

Matthias Eck Stephan Vogel and Alex Waibel 2004Improving statistical machine translation in the med-ical domain using the unified medical language sys-tem InProceedings of the 20th international con-ference on Computational Linguistics table of con-tents pages 792ndash798

Afsaneh Fazly and Suzanne Stevenson 2006 Auto-matically constructing a lexicon of verb phrase id-iomatic combinations InProceedings of the EACLpages 337ndash344

Katerina T Frantzi and Sophia Ananiadou 1996 Ex-tracting nested collocations InProceedings of the16th conference on Computational linguistics pages41ndash46

Alexander Fraser and Daniel Marcu 2007 Measuringword alignment quality for statistical machine trans-lation Computational Linguistics 33(3)293ndash303

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InPro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology pages48ndash54

Patrik Lambert and Rafael Banchs 2005 Data in-ferred multi-word expressions for statistical machinetranslation InProceedings of Machine TranslationSummit X pages 396ndash403

Patrik Lambert and Rafael Banchs 2006 Groupingmulti-word expressions according to part-of-speechin statistical machine translation InProceedings ofthe Workshop on Multi-word-expressions in a multi-lingual context pages 9ndash16

Adam Lopez and Philip Resnik 2006 Word-basedalignment phrase-based translation Whatrsquos thelink In proceedings of the 7th conference of the as-sociation for machine translation in the Americasvisions for the future of machine translation pages90ndash99

Shengfen Luo and Maosong Sun 2003 Two-characterchinese word extraction based on hybrid of internaland contextual measures InProceedings of the sec-ond SIGHAN workshop on Chinese language pro-cessing pages 24ndash30

Franz Josef Och 2002Statistical Machine Transla-tion From Single-Word Models to Alignment Tem-plates Phd thesis Computer Science DepartmentRWTH Aachen Germany

Patrick Pantel and Dekang Lin 2001 A statistical cor-pus based term extractor InAI rsquo01 Proceedings ofthe 14th Biennial Conference of the Canadian Soci-ety on Computational Studies of Intelligence pages36ndash46

Kishore Papineni Salim Roukos Todd Ward and Wei-jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation InProceedings of the40th Annual Conference of the Association for Com-putational Linguistics pages 311ndash318

Scott Songlin Piao Paul Rayson Dawn Archer andTony McEnery 2005 Comparing and combining asemantic tagger and a statistical tool for mwe extrac-tion Computer Speech and Language 19(4)378ndash397

Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 Multi-word expressions A pain in the neck for nlp InProceedings of the 3th International Conferenceon Intelligent Text Processing and ComputationalLinguistics(CICLing-2002) pages 1ndash15

Takaaki Tanaka and Timothy Baldwin 2003 Noun-noun compound machine translation A feasibilitystudy on shallow processing InProceedings ofthe ACL-2003 Workshop on Multiword ExpressionsAnalysis Acquisition and Treatment pages 17ndash24

Hua Wu Haifeng Wang and Chengqing Zong 2008Domain adaptation for statistical machine transla-tion with domain dictionary and monolingual cor-pora InProceedings of Conference on Computa-tional Linguistics (COLING) pages 993ndash1000

54

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 55ndash62Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Bottom-up Named Entity Recognitionusing a Two-stage Machine Learning Method

Hirotaka Funayama Tomohide Shibata Sadao KurohashiKyoto University Yoshida-honmachi

Sakyo-ku Kyoto 606-8501 Japanfunayama shibata kuronlpkueekyoto-uacjp

Abstract

This paper proposes Japanese bottom-upnamed entity recognition using a two-stage machine learning method Mostwork has formalized Named Entity Recog-nition as a sequential labeling problem inwhich only local information is utilizedfor the label estimation and thus a longnamed entity consisting of several mor-phemes tends to be wrongly recognizedOur proposed method regards a compoundnoun (chunk) as a labeling unit and firstestimates the labels of all the chunks ina phrasal unit (bunsetsu) using a machinelearning method Then the best label as-signment in the bunsetsu is determinedfrom bottom up as the CKY parsing al-gorithm using a machine learning methodWe conducted an experimental on CRLNE data and achieved an F measure of8979 which is higher than previous work

1 Introduction

Named Entity Recognition (NER) is a task of rec-ognizing named entities such as person namesorganization names and location It is used forseveral NLP applications such as Information Ex-traction (IE) and Question Answering (QA) Mostwork uses machine learning methods such as Sup-port Vector Machines (SVMs) (Vapnik 1995) andConditional Random Field (CRF) (Lafferty et al2001) using a hand-annotated corpus (Krishnanand DManning 2006 Kazama and Torisawa2008 Sasano and Kurohashi 2008 Fukushima etal 2008 Nakano and Hirai 2004 Masayuki andMatsumoto 2003)

In general NER is formalized as a sequentiallabeling problem For example regarding a mor-pheme as a basic unit it is first labeled as S-PERSON B-PERSON I-PERSON E-PERSON

S-ORGANIZATION etc Then considering thelabeling results of morphemes the best NE labelsequence is recognized

When the label of each morpheme is estimatedonly local information around the morpheme (egthe morpheme the two preceding morphemes andthe two following morphemes) is utilized There-fore a long named entity consisting of severalmorphemes tends to be wrongly recognized Letus consider the example sentences shown in Fig-ure 1

In sentence (1) the label of ldquoKazamardquo can berecognized to be S-PERSON (PERSON consist-ing of one morpheme) by utilizing the surroundinginformation such as the suffix ldquosanrdquo (Mr) and theverb ldquokikoku shitardquo (return home)

On the other hand in sentence (2) when thelabel of ldquoshinyourdquo (credit) is recognized to beB-ORGANIZATION (the beginning of ORGA-NIZATION) only information from ldquohatsudourdquo(invoke) to ldquokyusairdquo (relief) can be utilized andthus the information of the morpheme ldquoginkourdquo(bank) that is apart from ldquoshinyourdquo by three mor-phemes cannot be utilized To cope with this prob-lem Nakano et al (Nakano and Hirai 2004) andSasano et al (Sasano and Kurohashi 2008) uti-lized information of the head of bunsetsu1 In theirmethods when the label of ldquoshinyourdquo is recog-nized the information of the morpheme ldquoginkourdquocan be utilized

However these methods do not work when themorpheme that we want to refer to is not a headof bunsetsu as in sentence (3) In this examplewhen ldquogaikokurdquo (foreign) is recognized to be B-ARTIFACT (the beginning of ARTIFACT) wewant to refer to ldquohourdquo (law) not ldquoihanrdquo (viola-tion) which is the head of the bunsetsu

This paper proposes Japanese bottom-up named1Bunsetsu is the smallest coherent phrasal unit in

Japanese It consists of one or more content words followedby zero or more function words

55

(1) kikoku-shitareturn home

Kazama-san-waMrKazama TOP

rsquoMr Kazama who returned homersquo

(2) hatsudou-shitainvoke

shinyou-kumiai-kyusai-ginkou-nocredit union relief bank GEN

setsuritsu-mo establishment

rsquothe establishment of the invoking credit union relief bankrsquo

(3) shibunsyo-gizou-toprivate document falsification and

gaikoku-jin-touroku-hou-ihan-noforeigner registration law violation GEN

utagai-desuspicion INS

rsquoon suspicion of the private document falsification and the violation of the foreigner registra-tion lawrsquo

Figure 1 Example sentences

entity recognition using a two-stage machinelearning method Different from previous workthis method regards a compound noun as a la-beling unit (we call it chunk hereafter) and es-timates the labels of all the chunks in the bun-setsu using a machine learning method In sen-tence (3) all the chunks in the second bunsetsu(ie ldquogaikokurdquo ldquogaikoku-jinrdquo middot middot middot ldquogaikoku-jin-touroku-hou-ihan rdquo middot middot middot ldquoihanrdquo) are labeled andin the case that the chunk ldquogaikoku-jin-touroku-hourdquo is labeled the information about ldquohourdquo (law)is utilized in a natural manner Then in the bun-setsu the best label assignment is determined Forexample among the combination of ldquogaikoku-jin-touroku-hourdquo (ARTIFACT) and ldquoihanrdquo (OTHER)the combination of ldquogaikoku-jinrdquo (PERSON) andldquotouroku-hou-ihanrdquo (OTHER) etc the best la-bel assignment ldquogaikoku-jin-touroku-hourdquo (AR-TIFACT) and ldquoihanrdquo (OTHER) is chosen basedon a machine learning method In this determi-nation of the best label assignment as the CKYparsing algorithm the label assignment is deter-mined by bottom-up dynamic programming Weconducted an experimental on CRL NE data andachieved an F measure of 8979 which is higherthan previous work

This paper is organized as follows Section 2 re-views related work of NER especially focusing onsequential labeling based method Section 3 de-scribes an overview of our proposed method Sec-tion 4 presents two machine learning models andSection 5 describes an analysis algorithm Section6 gives an experimental result

2 Related Work

In Japanese Named Entity Recognition the defi-nition of Named Entity in IREX Workshop (IREX

class examplePERSON Kimura SyonosukeLOCATION Taiheiyou (Pacific Ocean)ORGANIZATION Jimin-tou (Liberal Democratic

Party)ARTIFACT PL-houan (PL bill)DATE 21-seiki (21 century)TIME gozen-7-ji (7 am)MONEY 500-oku-en (50 billions yen)PERCENT 20 percent

Table 1 NE classes and their examples

Committee 1999) is usually used In this def-inition NEs are classified into eight classesPERSON LOCATION ORGANIZATION AR-TIFACT DATE TIME MONEY and PERCENTTable 1 shows example instances of each class

NER methods are divided into two approachesrule-based approach and machine learning ap-proach According to previous work machinelearning approach achieved better performancethan rule-based approach

In general a machine learning method is for-malized as a sequential labeling problem Thisproblem is first assigning each token (character ormorpheme) to several labels In an SE-algorithm(Sekine et al 1998) S is assigned to NE com-posed of one morpheme B I E is assigned to thebeginning middle end of NE respectively and Ois assigned to the morpheme that is not an NE2The labels S B I and E are prepared for each NEclasses and thus the total number of labels is 33(= 8 4 + 1)

The model for the label estimation is learnedbased on machine learning The following fea-tures are generally utilized characters type of

2Besides there are IOB1 IOB2 algorithm using onlyIOB and IOE1 IOE2 algorithm using only IOE (Kim andVeenstra 1999)

56

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu-MeijinMONEY0075

OTHERe0092

MeijinOTHERe0245

(a)initial state

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu + MeijinPSN+OTHERe0438+0245

Yoshiharu Yoshiharu + Meijin

final outputanalysis direction

MONEY0075

MNY+OTHERe0075+0245

MeijinOTHERe0245

(b)final output

Figure 2 An overview of our proposed method (the bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo)

character POS etc about the morpheme and thesurrounding two morphemes The methods utiliz-ing SVM or CRF are proposed

Most of NER methods based on sequential la-beling use only local information Thereforemethods utilizing global information are pro-posed Nakano et al utilized as a feature the wordsub class of NE on the analyzing direction in thebunsetsu the noun in the end of the bunsetsu ad-jacent to the analyzing direction and the head ofeach bunsetsu (Nakano and Hirai 2004) Sasanoet al utilized cache feature coreference resultsyntactic feature and caseframe feature as struc-tural features (Sasano and Kurohashi 2008)

Some work acquired knowledge from unan-notated large corpus and applied it to NERKazama et al utilized a Named Entity dic-tionary constructed from Wikipedia and a nounclustering result obtained using huge amount ofpairs of dependency relations (Kazama and Tori-sawa 2008) Fukushima et al acquired hugeamount of category-instance pairs (eg ldquopo-litical party - New party DAICHIrdquoldquocompany-TOYOTArdquo) by some patterns from a large Webcorpus (Fukushima et al 2008)

In Japanese NER researches CRL NE data areusually utilized for the evaluation This data in-cludes approximately 10 thousands sentences innews paper articles in which approximately 20thousands NEs are annotated Previous workachieved an F measure of about 089 using thisdata

3 Overview of Proposed Method

Our proposed method first estimates the label ofall the compound nouns (chunk) in a bunsetsu

Then the best label assignment is determinedby bottom-up dynamic programming as the CKYparsing algorithm Figure 2 illustrates an overviewof our proposed method In this example thebunsetsu ldquoHabu-Yoshiharu-Meijinrdquo (Grand Mas-ter Yoshiharu Habu) is analyzed First the labelsof all the chunks (ldquoHaburdquo ldquoHabu-YoshiharurdquoldquoHabu-Yoshiharu-Meijinrdquo middot middot middot ldquoMeijinrdquo etc) inthe bunsetsu are analyzed using a machine learn-ing method as shown in Figure 2 (a)

We call the state in Figure 2 (a) initial statewhere the labels of all the chunks have been es-timated From this state the best label assign-ment in the bunsetsu is determined This pro-cedure is performed from the lower left (corre-sponds to each morpheme) to the upper right likethe CKY parsing algorithm as shown in Figure 2(b) For example when the label assignment forldquoHabu-Yoshiharurdquo is determined the label assign-ment ldquoHabu-Yoshiharurdquo (PERSON) and the labelassignment ldquoHaburdquo (PERSON) and ldquoYoshiharurdquo(OTHER) are compared and the better one is cho-sen While grammatical rules are utilized in ageneral CKY algorithm this method chooses bet-ter label assignment for each cell using a machinelearning method

The learned models are the followings

bull the model that estimates the label of a chunk(label estimation model)

bull the model that compares two label assign-ments (label comparison model)

The two models are described in detail in thenext section

57

Habu Yoshiharu Meijin gaPERSON OTHERe

invalid invalid

invalid

invalidinvalid

Figure 3 Label assignment for all the chunks inthe bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo

4 Model Learning

41 Label Estimation ModelThis model estimates the label for each chunk Ananalysis unit is basically bunsetsu This is because935 of named entities is located in a bunsetsuin CRL NE data Exceptionally the following ex-pressions located in multiple bunsetsus tend to bean NE

bull expressions enclosed in parentheses (eg ldquolsquoHimeyuri-no toursquo rdquo (The tower of Himeyuri)(ARTIFACT))

bull expressions that have an entry in Wikipedia(eg ldquoNihon-yatyou-no kairdquo (Wild Bird So-ciety of Japan) (ORGANIZATION))

Hereafter bunsetsu is expanded when one of theabove conditions meet By this expansion 986of named entities is located in a bunsetsu3

For each bunsetsu the head or tail functionwords are deleted For example in the bun-setsu ldquoHabu-Yoshiharu-Meijin-wardquo the tail func-tion word ldquowardquo (TOP) is deleted In the bunsetsuldquoyaku-san-bairdquo (about three times) the head func-tion word ldquoyakurdquo (about) is deleted

Next for learning the label estimation modelall the chunks in a bunsetsu are attached to the cor-rect label from a hand-annotated corpus The la-bel set is 13 classes which includes eight NE class(as shown in Table 1) and five classes OTHERsOTHERb OTHERi OTHERe and invalid

The chunk that corresponds to a whole bun-setsu and does not contain any NEs is labeledas OTHERs and the head middle tail chunkthat does not correspond to an NE is labeled asOTHERb OTHERi OTHERe respectively4

3As an example in which an NE is not included by anexpanded bunsetsu there are ldquoToru-no Kimirdquo (PERSON)and ldquoOsaka-fu midori-no kankyo-seibi-shitsurdquo (ORGANI-ZATION)

4Each OTHER is assigned to the longest chunk that satis-fies its condition in a chunk

1 of morphemes in the chunk

2 the position of the chunk in its bunsetsu

3 character type5

4 the combination of the character type of adjoiningmorphemes

- For the chunk ldquoRussian Armyrdquo this feature isldquoKatakanaKanjirdquo

5 word class word sub class and several features pro-vided by a morphological analyzer JUMAN

6 several features6 provided by a parser KNP

7 string of the morpheme in the chunk

8 IPADIC7 feature- If the string of the chunk are registered in the fol-

lowing categories of IPADIC ldquopersonrdquo ldquolo-cationrdquo ldquoorganizationrdquo and ldquogeneralrdquo thisfeature fires

9 Wikipedia feature

- If the string of the chunk has an entry inWikipedia this feature fires

- the hypernym extracted from its definition sen-tence using some patterns (eg The hyper-nym of ldquothe Liberal Democratic Partyrdquo is apolitical party)

10 cache feature- When the same string of the chunk appears in the

preceding context the label of the precedingchunk is used for the feature

11 particles that the bunsetsu includes

12 the morphemes particles and head morpheme in theparent bunsetsu

13 the NEcategory ratio in a case slot of predicatenouncase frame(Sasano and Kurohashi 2008)

- For example in the case ga (NOM) of the pred-icate case frame ldquokaikenrdquo (interview) the NEratio ldquoPERSON0245rdquo is assigned to the caseslot Hence in the sentence ldquoHabu-ga kaiken-shitardquo (Mr Habu interviewed) the featureldquoPERSON0245rdquo is utilized for the chunkldquoHaburdquo

14 parenthesis feature

- When the chunk in a parenthesis this featurefires

Table 2 Features for the label estimation model

The chunk that is neither any eight NE class northe above four OTHER is labeled as invalid

In an example as shown in Figure 3 ldquoHabu-Yoshiharurdquo is labeled as PERSON ldquoMeijinrdquo is la-beled as OTHERe and the other chunks are la-beled as invalid

Next the label estimation model is learned fromthe data in which the above label set is assigned

5The following five character types are considered KanjiHiragana Katakana Number and Alphabet

6When a morpheme has an ambiguity all the correspond-ing features fire

7httpchasenaist-naraacjpchasendistributionhtmlja

58

to all the chunks The features for the label esti-mation model are shown in Table 2 Among thefeatures as for feature (3) (5)minus(8) three cate-gories according to the position of a morphemein the chunk are prepared ldquoheadrdquo ldquotailrdquo andldquoanywhererdquo For example in the chunk ldquoHabu-Yoshiharu-Meijinrdquo as for the morpheme ldquoHaburdquofeature (7) is set to be ldquoHaburdquo in ldquoheadrdquo and as forthe morpheme ldquoYoshiharurdquo feature (7) is set to beldquoYoshiharurdquo in ldquoanywhererdquo

The label estimation model is learned from pairsof label and feature in each chunk To classify themulti classes the one-vs-rest method is adopted(consequently 13 models are learned) The SVMoutput is transformed by using the sigmoid func-tion 1

1+exp(minusβx) and the transformed value is nor-malized so that the sum of the value of 13 labelsin a chunk is one

The purpose for setting up the label ldquoinvalidrdquo isas follows In the chunk ldquoHaburdquo and ldquoYoshiharurdquoin Figure 3 since the label ldquoinvalidrdquo has a rela-tively higher score the score of the label PERSONis relatively low Therefore when the label com-parison described in Section 42 is performed thelabel assignment ldquoHabu-Yoshiharurdquo (PERSON) islikely to be chosen In the chunk where the scoreof the label invalid has the highest score the labelthat has the second highest score is adopted

42 Label Comparison Model

This model compares the two label assignmentsfor a certain string For example in the stringldquoHabu-Yoshiharurdquo the model compares the fol-lowing two label assignments

bull ldquoHabu-Yoshiharurdquo is labeled as PERSON

bull ldquoHaburdquo is labeled as PERSON and ldquoYoshi-harurdquo is labeled as MONEY

First as shown in Figure 4 the two comparedsets of chunks are lined up by sandwiching ldquovsrdquo(The left one right one is called the first set thesecond set respectively) When the first set is cor-rect this example is positive otherwise this ex-ample is negative The max number of chunks foreach set is five and thus examples in which thefirst or second set has more than five chunks arenot utilized for the model learning

Then the feature is assigned to each exampleThe feature (13 dimensions) for each chunk is de-fined as follows the first 12 dimensions are used

positive+1 Habu-Yoshiharu vs Habu + Yoshiharu

PSN PSN + MNY+1 Habu-Yoshiharu + Meijin vs Habu + Yoshiharu + Meijin

PSN + OTHERe PSN + MONEY + OTHERe

negative- 1 Habu-Yoshiharu-Meijin vs Habu-Yoshiharu + Meijin

ORG PSN + OTHERe

Figure 4 Assignment of positivenegative exam-ples

for each label which is estimated by the label esti-mation model and the last 13th dimension is usedfor the score of an SVM output Then for the firstand second set the features for each chunk are ar-ranged from the left and zero vectors are placedin the remainder part

Figure 5 illustrates the feature for ldquoHabu-Yoshiharurdquo vs ldquoHabu + Yoshiharurdquo The labelcomparison model is learned from such data us-ing SVM Note that only the fact that ldquoHabu-Yoshiharurdquo is PERSON can be found from thehand-annotated corpus and thus in the exampleldquoHabu-Yoshiharu-Meijinrdquo vs ldquoHabu + Yoshiharu-Meijinrdquo we cannot determine which one is cor-rect Therefore such example cannot be used forthe model learning

5 Analysis

First the label of all the chunks in a bunsetsu isestimated by using the label estimation model de-scribed in Section 41 Then the best label assign-ment in the bunsetsu is determined by applying thelabel comparison model described in Section 42iteratively as shown in Figure 2 (b) In this stepthe better label assignment is determined from bot-tom up as the CKY parsing algorithm

For example the initial state shown in Figure2(a) is obtained using the label estimation modelThen the label assignment is determined using thelabel comparison model from the lower left (cor-responds to each morpheme) to the upper rightIn determining the label assignment for the cellof ldquoHabu-Yoshiharurdquo as shown in 6(a) the modelcompares the label assignment ldquoBrdquo with the la-bel assignment ldquoA+Drdquo In this case the modelchooses the label assignment ldquoBrdquo that is ldquoHabu- Yoshiharurdquo is labeled as PERSON Similarlyin determining the label assignment for the cellof ldquoYoshiharu-Meijinrdquo the model compares the

59

chunk Habu-Yoshiharu Habu Yoshiharulabel PERSON PERSON MONEYvector V11 0 0 0 0 V21 V22 0 0 0

Figure 5 An example of the feature for the label comparison model (The example is ldquoHabu-Yoshiharuvs Habu + Yoshiharurdquo and V11 V21 V22 and 0 is a vector whose dimension is 13)

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu-Meijin

label assighment ldquoHabu-Yoshiharurdquo

label assignmentldquoHaburdquo + ldquoYoshiharurdquo

A B C

EDMONEY0075

OTHERe0092

MeijinOTHERe0245

ldquoHaburdquo + ldquoYoshiharurdquo

F

(a) label assignment for the cell ldquoHabu-Yoshiharurdquo

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu + Meijin

label assignmentldquoHabu-Yoshiharu-Meijinrdquo

A B C

EDMONEY0075

MNY+OTHERe0075+0245

MeijinOTHERe0245

label assignmentldquoHaburdquo + ldquoYoshiharurdquo + ldquoMeijinrdquo

label assignmentldquoHabu-Yoshiharurdquo + ldquoMeijinrdquo

F

(b) label assignment for the cell ldquoHabu-Yoshiharu-Meijinrdquo

Figure 6 The label comparison model

label assignment ldquoErdquo with the label assignmentldquoD+Frdquo In this case the model chooses the labelassignment ldquoD+Frdquo that is ldquoYoshiharurdquo is labeledas MONEY and ldquoMeijinrdquo is labeled as OTHEReWhen the label assignment consists of multiplechunks the content of the cell is updated Inthis case the cell ldquoErdquo is changed from ldquoYoshi-haru-Meijinrdquo (OTHERe) to ldquoYoshiharu + Meijinrdquo(MONEY + OTHERe)

As shown in Figure 6(b) in determining the bestlabel assignment for the upper right cell that isthe final output is determined the model comparesthe label assignment ldquoA+D+Frdquo ldquoB+Frdquo and ldquoCrdquoWhen there are more than two candidates of labelassignments for a cell all the label assignments arecompared in a pairwise and the label assignmentthat obtains the highest score is adopted

In the label comparing step the label as-signment in which OTHERlowast follows OTHERlowast(OTHERlowast - OTHERlowast) is not allowed since eachOTHER is assigned to the longest chunk as de-scribed in Section 41 When the first combina-tion of chunks equals to the second combinationof chunks the comparison is not performed

6 Experiment

To demonstrate the effectiveness of our proposedmethod we conducted an experiment on CRL NEdata In this data 10718 sentences in 1174 newsarticles are annotated with eight NEs The expres-sion to which it is difficult to annotate manually islabeled as OPTIONAL and was not used for both

a b c d e

learn the label estimation model 1

Corpus CRL NE data

learn the label estimation model 2bobtain features for the label apply

b c d e

b c d e

b c d e

obtain features for the labelcomparison model

learn the label estimation model 2c

learn the label estimation model 2d

learn the label estimation model 2e

apply

learn the label comparison model

Figure 7 5-fold cross validation

the model learning8 and the evaluationWe performed 5-fold cross validation following

previous work Different from previous work ourwork has to learn the SVM models twice There-fore the corpus was divided as shown in Figure 7Let us consider the analysis in the part (a) Firstthe label estimation model 1 is learned from thepart (b)-(e) Then the label estimation model 2bis learned from the part (c)-(e) and applying thelearned model to the part (b) features for learningthe label comparison model are obtained Simi-larly the label estimation model 2c is learned fromthe part (b)(d)(e) and applying it to the part (c)features are obtained It is the same with the part

8Exceptionally ldquoOPTIONALrdquo is used when the label es-timation model for OTHERlowast and invalid is learned

60

Recall PrecisionORGANIZATION 8183 (30083676) 8837 (30083404)PERSON 9005 (34583840) 9387 (34583684)LOCATION 9138 (49925463) 9244 (49925400)ARTIFACT 4672 ( 349 747) 7489 ( 349 466)DATE 9327 (33273567) 9312 (33273573)TIME 8825 ( 443 502) 9059 ( 443 489)MONEY 9385 ( 366 390) 9760 ( 366 375)PERCENT 9533 ( 469 492) 9591 ( 469 489)ALL-SLOT 8787 9179F-measure 8979

Table 3 Experimental result

(d) and (e) Then the label comparison model islearned from the obtained features After that theanalysis in the part (a) is performed by using boththe label estimation model 1 and the label compar-ison model

In this experiment a Japanese morphologicalanalyzer JUMAN9 and a Japanese parser KNP10

were adopted The two SVM models were learnedwith polynomial kernel of degree 2 and β in thesigmoid function was set to be 1

Table 6 shows an experimental result An F-measure in all NE classes is 8979

7 Discussion

71 Comparison with Previous Work

Table 7 presents the comparison with previ-ous work and our method outperformed previ-ous work Among previous work Fukushimaet al acquired huge amount of category-instance pairs (eg ldquopolitical party - New partyDAICHIrdquoldquocompany-TOYOTArdquo) by some patternsfrom a large Web corpus and Sasano et al uti-lized the analysis result of corefer resolution as afeature for the model learning Therefore in ourmethod by incorporating these knowledge andorsuch analysis result the performance would be im-proved

Compared with Sasano et al our methodachieved the better performance in analyzinga long compound noun For example in thebunsetsu ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo (Treaty on Conventional ArmedForces in Europe) while Sasano et al labeledldquoOushurdquo (Europe) as LOCATION our methodcorrectly labeled ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo as ARTIFACT Sasano etal incorrectly labeled ldquoOushurdquo as LOCATIONalthough they utilized the information about

9httpnlpkueekyoto-uacjpnl-resourcejuman-ehtml10httpnlpkueekyoto-uacjpnl-resourceknp-ehtml

the head of bunsetsu ldquojyouyakurdquo (treaty) Inour method for the cell ldquoOushurdquo invalid hasthe highest score and thus the score of LOCA-TION relatively drops Similarly for the cellldquosenryoku-sakugen-jyouyakurdquo invalid has thehighest score Consequently ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo is correctly labeledas ARTIFACT

In the bunsetsu ldquogaikoku-jin-touroku-hou-ihanrdquo(the violation of the foreigner registration law)while Sasano et al labeled ldquotouroku-hourdquo as AR-TIFACT our method correctly labeled ldquogaikoku-jin-touroku-hourdquo as ARTIFACT Sasano et al can-not utilize the information about ldquohourdquo that is use-ful for the label estimation since the head of thisbunsetsu is ldquoihanrdquo In contrast in estimating thelabel of the chunk ldquogaikoku-jin-touroku-hourdquo theinformation of ldquohourdquo can be utilized

72 Error Analysis

There were some errors in analyzing a Katakanaalphabet word In the following example althoughthe correct is that ldquoBatistutardquo is labeled as PER-SON the system labeled it as OTHERs

(4) Italy-deItaly LOC

katsuyaku-suruactive

Batistuta-woBatistuta ACC

kuwaetacall

ArgentineArgentine

rsquoArgentine called Batistuta who was active inItalyrsquo

There is not an entry of ldquoBatistutardquo in the dictio-nary of JUMAN nor Wikipedia and thus only thesurrounding information is utilized However thecase analysis of ldquokatsuyakurdquo (active) is incorrectwhich leads to the error of ldquoBatistutardquo

There were some errors in applying the la-bel comparison model although the analysis ofeach chunk is correct For example in thebunsetsu ldquoHongKong-seityourdquo (Government ofHongKong) the correct is that ldquoHongKong-seityourdquo is labeled as ORGANIZATION Asshown in Figure 8 (b) the system incorrectlylabeled ldquoHongKongrdquo as LOCATION As shownin Figure 8(a) although in the initial stateldquoHongKong-seityourdquo was correctly labeled as OR-GANIZATION the label assignment ldquoHongKong+ seityourdquo was incorrectly chosen by the labelcomparison model To cope with this problemwe are planning to the adjustment of the value βin the sigmoid function and the refinement of the

61

F1 analysis unit distinctive features(Fukushima et al 2008) 8929 character Web(Kazama and Torisawa 2008) 8893 character WikipediaWeb(Sasano and Kurohashi 2008) 8940 morpheme structural information(Nakano and Hirai 2004) 8903 character bunsetsu feature(Masayuki and Matsumoto 2003) 8721 character(Isozaki and Kazawa 2003) 8677 morphemeproposed method 8979 compound noun Wikipediastructural information

Table 4 Comparison with previous work (All work was evaluated on CRL NE data using cross valida-tion)

HongKongLOCATION

HongKong-seityouORGANIZATIONLOCATION

0266 0205

seityouOTHERe0184

(a)initial state

HongKongLOCATION

HongKong + seityouLOC+OTHEReLOCATION

0266LOC+OTHERe0266+0184

seityouOTHERe0184

(b)the final output

Figure 8 An example of the error in the label com-parison model

features for the label comparison model

8 Conclusion

This paper proposed bottom-up Named EntityRecognition using a two-stage machine learningmethod This method first estimates the label ofall the chunks in a bunsetsu using a machine learn-ing and then the best label assignment is deter-mined by bottom-up dynamic programming Weconducted an experiment on CRL NE data andachieved an F-measure of 8979

We are planning to integrate this method withthe syntactic and case analysis method (Kawa-hara and Kurohashi 2007) and perform syntacticcase and Named Entity analysis simultaneously toimprove the overall accuracy

ReferencesKenrsquoichi Fukushima Nobuhiro Kaji and Masaru

Kitsuregawa 2008 Use of massive amountsof web text in Japanese named entity recogni-tion In Proceedings of Data Engineering Workshop(DEWS2008) A3-3 (in Japanese)

IREX Committee editor 1999 Proceedings of theIREX Workshop

Hideki Isozaki and Hideto Kazawa 2003 Speeding upsupport vector machines for named entity recogni-tion Transaction of Information Processing Societyof Japan 44(3)970ndash979 (in Japanese)

Daisuke Kawahara and Sadao Kurohashi 2007 Prob-abilistic coordination disambiguation in a fully-lexicalized Japanese parser In Proceedings of the

2007 Joint Conference on Empirical Methods inNatural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL2007)pages 304ndash311

Junrsquoichi Kazama and Kentaro Torisawa 2008 In-ducing gazetteers for named entity recognition bylarge-scale clustering of dependency relations InProceedings of ACL-08 HLT pages 407ndash415

Erik F Tjong Kim and Jorn Veenstra 1999 Repre-senting text chunks In Proceedings of EACL rsquo99pages 173ndash179

Vajay Krishnan and Christopher DManning 2006An effective two-stage model for exploiting non-local dependencies in named entity recognitionpages 1121ndash1128

John Lafferty Andrew McCallun and FernandoPereira 2001 Conditional random fields Prob-abilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth In-ternational Conference (ICMLrsquo01) pages 282ndash289

Asahara Masayuki and Yuji Matsumoto 2003Japanese named entity extraction with redundantmorphological analysis In Proceeding of HLT-NAACL 2003 pages 8ndash15

Keigo Nakano and Yuzo Hirai 2004 Japanese namedentity extraction with bunsetsu features Transac-tion of Information Processing Society of Japan45(3)934ndash941 (in Japanese)

Ryohei Sasano and Sadao Kurohashi 2008 Japanesenamed entity recognition using structural naturallanguage processing In Proceeding of Third In-ternational Joint Conference on Natural LanguageProcessing pages 607ndash612

Satoshi Sekine Ralph Grishman and Hiroyuki Shin-nou 1998 A decision tree method for finding andclassifying names in japanese texts In Proceed-ings of the Sixth Workshop on Very Large Corpora(WVLC-6) pages 171ndash178

Vladimir Vapnik 1995 The Nature of StatisticalLearning Theory Springer

62

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 63ndash70Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

13

13

13 13

13

13

13

13

13

13

13

13 13

13

$

13 amp

13 amp

$ 13

13 (

) 13

$ 13

13

) 13

13 13 13

$ 13 13

+$ - 0)$ - 12

$ 3$ 4 1

)13$ 5 + )$ 5 2

3$ 6 2 $ 7-

2 $ 7 13

2 13 13

13 13

) 13

13$ $ 13$ $ $

13

13 13 13

13 1313 138

$ 13$

3$ 13) )

) 8 9$ $ $

$ $ )13$ 13 13

3

$

13 13

13 +

)

13 $

13 ) 13

$ )

13 13

13 ) $ )

13$

$

) (

$

13

13 13

13

)

13

13 )

$ 13

ファミリーレストラン 13) lt

ファミレス 13lt $

=

13 )

$ 13 13

$ $ )

13 -

)$ $ =

13 131313 13 1313 13 13 13 13 1313 13 13

63

13

13

13 13

13 13

13

13 13

13 $

13

13 amp

13

( ) +

13

13

-

13 13

13 13 13

日本 amp0 東京 $amp0amp

医科 amp0 科学 amp0amp

女子 amp0amp

大学 13amp0 研究所0

ampamp $ 13 13

+

13

$1

$ 13 $1

13

213

$1

$

$1

$1

13 $1

13

13 13 略語は0

0amp

$ +3

1313欧文略語一覧13カタカナ略語一覧13漢字略語一覧13大学の略称

$1

$1 13

13 -

13 $1 13

456

13

456

13

37+

- 13

13

$

8 913

77 4 77 913 13

13 77(amp 13 ltamp

13amp amp

13

amp - 13 13

13

13 9

13 13

$

13

-

13 =

= 9

8

13

13 $

13 13

13 13 13 1313

ッ$ 13 っ$

amp ―$

64

13

$ $

amp $ amp

() $

+) $amp $

) $ $ $ $ $

13 $ amp $ $ $ amp $$

$- )) ) 13

) ))13 ) 0()00)13 ) ) )13

- $ ))) ) 131

) 0)13 ))

13 13

2 ) )) )

)) 3 ) 1313))

1313 2 ) 13)

4) 5) )) )13

) 13) 1

6

7 ) )) ) )13 1

)

) )0

13 ) 8

-

) )

)13 ) 4

4 7 )

) )9 8)

5 ) )8 7 ))

)

) )13 1

)13 )) ) ) )

) 7 )) )13 1

)) 0()

) )))13 0

))13 ) 1

)6 5) ) )

5 ) ))1

lt ) ) 13 )

))) ) =))lt

)=)lt ))=lt gt

=ltlt )) ))) )

13 131313

-

))lt

lt ))lt

))lt

))lt lt

7 ) ))1

lt ) )) ) ))lt lt

13 lt ) ) )8 ) 1

)13 )) )

))lt 13 1

)13 ) )8 ) )13 ))

) ))) 85 1

)13 )8) 2 ))) ) (

0 )) )))

13))13

) ))) ) ( のlt

) ) )13 1

) )) )

) ))) のlt 0 )

)13 )) )1

) ) 13

) ) )) )1

) 00 )13 13)

1313 1313

2 ) 13 )) )

)13 )13 )

))13 )

3 )) )) 1

) ) )

$ )

+ ) )

+ 8 )

65

1313

1313

13 13

13 13

13 13 13

13 13

$13amp13amp 13 13 $13

$13 13 $ ( ) ) 13

13 13 + 13 1313 $

13 + 13 1313 + -

+ 13 ++ 13 13

1313 +

13 13

13 13 1313 13 -

+13 1313 13 13

1313

13

0 13 13 13 13

13 13 12 13

3013+ 13 45 666 1313

0 13 1313

7 13 13 13 13-

13 8 +-13 0

+ 13 13 13 -

+ 13 13 13

3

13 13-

13 13 + 13 13

+13513

1313 13 9

1 13 13 -

+ ( 13 13 3

13 13 + 13+ 13 13

3

135 1313 +13 13

13 13 13+ 13

1313 13 +

1313 1313

13 13 13

13+ 13 $1313 (

13 1313 1313 +-

13 4 13 $

$13 $ 13 -

13 +-13 0

13 1313 13-13 1313 + 13 13 4 - 13 + 13 13 13+ 131313 1313 3 13 13 ( 1313 13 13 + 13 13 1313 13 13 13 13 13 13 13 13 13 13 13+ 13 13 9

13

13

13

13

+ 13 13 +13

13+ 4

13 1313 1313 -

+

13 1313 13-

13 13 +13 + +

(13 13 $ +-

13 $1313 13 13 $ $ 13 $ 13

13 $ampamp lt13

2 13 13 131313 1313 13

13 lt13 ( $

13 13 13 13+ +13 -

13 13 13

13 2 2 2

= 13 13 13 +13 1313+

13 13 13

13 13 13 13

1313 13-

13 ( 13 13 13 13

13 13+ 13 1313

13 13 13+ 1313 13+

13 13 13 13 13 13

$gt1 gt1 13+ $ 13-

13 13 13 13+

13131313

66

1313 1313 1313

1313 1313 1313

13

13 13

13

13

$ 13 13

amp() amp() () amp amp

$ 13 13

+

13

13

13

- + 13 ( (

13 13

-+

13

0 -(1313 amp+ ( (

13 13

13 1

13 () 2 3

0 13 -4252

6-amp++ -32 6+

13 0

$ 135 2

1

- amp+ $ 13

13

$ amp( ) + ( ( )- )( ( ) )

FeaturesLabel

Noun

Noun

Noun

Noun

Particle

Noun

POS

0

0

0

1

1

1

Head of wordreadingword-

FeaturesLabel

Noun

Noun

Noun

Noun

Particle

Noun

POS

0

0

0

1

1

1

Head of wordreadingword-

amp 13 0

13 朝ビタ -+ 13

13 朝はビタミン - +

13 7

7

$

088

13 0

$ amp 9 amp 9

13

13

7

1313 0

13

13 2lt

13 13 13 13

13 13

3 13 amp

13 amp

13

13

0 13 13

13 13

13 13

13

1

ampampamp

67

13 13

$

13 13

ampamp (

)

ampamp $

ampamp amp

amp +

amp +

- - -

amp-

amp

amp

amp

- amp

+

- +

) +

amp $

- amp

+

- amp

+

0 $ ampamp +

1- amp

0 +

2-

)

amp -

amp ampamp

$ amp 0- 3(

34) 5 -+

0 - 4

6)4 amp amp 1 amp

7 4( 6

$ - +

- 8

64

13

) amp

amp

+

amp 9 2

- 9 - +

2 amp

- amp

amp 1

$

( +

(

( amp ( ( -

amp2

amp

amp )

( +

68

13 13

13 1313

13 13 13

13 13

13 13 13

13 13

13 13

13 $

13

13 13 13 13

13

13 13 13

1313

13

amp 13 13 13 13

13 13 13 13

(() 13 13 13

13 13 + 13

13 amp 13

13 13

$ 13

13 13

13

13 1313

13 -

13

1313 13

13 13 13

13 1313 amp

13 13

13 13 1313

0

1313 13 13

13 13

13

13 13

$ $ amp amp ( $)+ - 0 13 + 13 112

ミュージックステーション 13 ステ 13 13 13 13 13 ミュージック 13 ステ ステーション

3 4 56 1 0+ 0 + + 7+ 8 9 $8 4 lt

6 $8 -= 1 8 - = $ 13++ + 139 13 $ 22lt

8 $+ 4 13) 1ltltlt3 $gt3 =

5 $ $ ( 6 5 13amp

5 $ $ ( 213 $ 13 13

5 = 3 + 13 () 1 +

5 () 13 7+ + + 2ltlt

5 () 13 3A 21 + $ + 13

5 () $ gt) 3A2 0 13 13 22lt2lt

+A = 7 1 4 30$+ -+ 3 13 13$ 11

$ 13 amp) $ 4 42 6= $ 2

4 13 13+ $ gt $ 9+ amp6+ 13+ 8 8 6 3 (+ 6 13 11

13 136) $ 4 13 + gt+ 7 30 13 13+ 1

69

13 13 13

13 13 1313

13 13

1313 13

13 1313

13

$

13 13 1313 $

amp (

70

Author Index

Bhutada Pravin 17

Cao Jie 47Caseli Helena 1

Diab Mona 17

Finatto Maria Jose 1Fujii Hiroko 63Fukui Mika 63Funayama Hirotaka 55

Hoang Hung Huu 31Huang Yun 47

Kan Min-Yen 9 31Kim Su Nam 9 31Kuhn Jonas 23Kurohashi Sadao 55

Liu Qun 47Lu Yajuan 47

Machado Andre 1

Ren Zhixiang 47

Shibata Tomohide 55Sinha R Mahesh K 40Sumita Kazuo 63Suzuki Masaru 63

Villavicencio Aline 1

Wakaki Hiromi 63

Zarrieszlig Sina 23

71

  • Workshop Program
  • Statistically-Driven Alignment-Based Multiword Expression Identification for Technical Domains
  • Re-examining Automatic Keyphrase Extraction Approaches in Scientific Articles
  • Verb Noun Construction MWE Token Classification
  • Exploiting Translational Correspondences for Pattern-Independent MWE Identification
  • A re-examination of lexical association measures
  • Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus
  • Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions
  • Bottom-up Named Entity Recognition using Two-stage Machine Learning Method
  • Abbreviation Generation for Japanese Multi-Word Expressions
Page 4: MWE 2009 2009 Workshop on Multiword Expressions

Organizers

Dimitra Anastasiou Localisation Research Centre Limerick University IrelandChikara Hashimoto National Institute of Information and Communications Technology JapanPreslav Nakov National University of Singapore SingaporeSu Nam Kim University of Melbourne Australia

Program Committee

Inaki Alegria University of the Basque Country (Spain)Timothy Baldwin University of Melbourne (Australia)Colin Bannard Max Planck Institute (Germany)Francis Bond National Institute of Information and Communications Technology (Japan)Gael Dias Beira Interior University (Portugal)Ulrich Heid Stuttgart University (Germany)Stefan Evert University of Osnabruck (Germany)Afsaneh Fazly University of Toronto (Canada)Nicole Gregoire University of Utrecht (The Netherlands)Roxana Girju University of Illinois at Urbana-Champaign (USA)Kyo Kageura University of Tokyo (Japan)Brigitte Krenn Austrian Research Institute for Artificial Intelligence (Austria)Eric Laporte University of Marne-la-Vallee (France)Rosamund Moon University of Birmingham (UK)Diana McCarthy University of Sussex (UK)Jan Odijk University of Utrecht (The Netherlands)Stephan Oepen University of Oslo (Norway)Darren Pearce London Knowledge Lab (UK)Pavel Pecina Charles University (Czech Republic)Scott Piao University of Manchester (UK)Violeta Seretan University of Geneva (Switzerland)Stan Szpakowicz University of Ottawa (Canada)Beata Trawinski University of Tubingen (Germany)Peter Turney National Research Council of Canada (Canada)Kiyoko Uchiyama Keio University (Japan)Begona Villada Moiron University of Groningen (The Netherlands)Aline Villavicencio Federal University of Rio Grande do Sul (Brazil)

v

Table of Contents

Statistically-Driven Alignment-Based Multiword Expression Identification for Technical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto 1

Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan 9

Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada 17

Exploiting Translational Correspondences for Pattern-Independent MWE IdentificationSina Zarrieszlig and Jonas Kuhn 23

A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan 31

Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha 40

Improving Statistical Machine Translation Using Domain Bilingual Multiword ExpressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang 47

Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi 55

Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita 63

vii

Workshop Program

Friday August 6 2009

830ndash845 Welcome and Introduction to the Workshop

Session 1 (0845ndash1000) MWE Identification and Disambiguation

0845ndash0910 Statistically-Driven Alignment-Based Multiword Expression Identification for Tech-nical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto

0910ndash0935 Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan

0935ndash1000 Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada

1000-1030 BREAK

Session 2 (1030ndash1210) Identification Interpretation and Disambiguation

1030ndash1055 Exploiting Translational Correspondences for Pattern-Independent MWE Identifi-cationSina Zarrieszlig and Jonas Kuhn

1055ndash1120 A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan

1120ndash1145 Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha

1145-1350 LUNCH

Session 3 (1350ndash1530) Applications

1350ndash1415 Improving Statistical Machine Translation Using Domain Bilingual Multiword Ex-pressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang

1415ndash1440 Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi

1440ndash1505 Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita

1505-1530 Discussion of Sessions 1 2 3 (Creating an Agenda for the general discussion)

1530-1600 BREAK

1600-1700 General Discussion

1700-1715 Closing Remarks

ix

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 1ndash8Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Statistically-Driven Alignment-Based Multiword ExpressionIdentification for Technical Domains

Helena de Medeiros Caselidiams Aline Villavicencioclubsspades Andre Machadoclubs Maria Jose Finattohearts

diamsDepartment of Computer Science Federal University of Sao Carlos (Brazil)clubsInstitute of Informatics Federal University of Rio Grande do Sul (Brazil)

spadesDepartment of Computer Sciences Bath University (UK)heartsInstitute of Language and Linguistics Federal University of Rio Grande do Sul (Brazil)

helenacaselidcufscarbr avillavicencioinfufrgsbrammachadoinfufrgsbr mfinattoterracombr

Abstract

Multiword Expressions (MWEs) are oneof the stumbling blocks for more preciseNatural Language Processing (NLP) sys-tems Particularly the lack of coverageof MWEs in resources can impact nega-tively on the performance of tasks and ap-plications and can lead to loss of informa-tion or communication errors This is es-pecially problematic in technical domainswhere a significant portion of the vocab-ulary is composed of MWEs This pa-per investigates the use of a statistically-driven alignment-based approach to theidentification of MWEs in technical cor-pora We look at the use of several sourcesof data including parallel corpora usingEnglish and Portuguese data from a corpusof Pediatrics and examining how a sec-ond language can provide relevant cues forthis tasks We report results obtained bya combination of statistical measures andlinguistic information and compare theseto the reported in the literature Such anapproach to the (semi-)automatic identifi-cation of MWEs can considerably speedup lexicographic work providing a moretargeted list of MWE candidates

1 Introduction

A multiword expression (MWE) can be definedas any word combination for which the syntac-tic or semantic properties of the whole expres-sion cannot be obtained from its parts (Sag etal 2002) Examples of MWEs are phrasal verbs(break down rely on) compounds (police car cof-fee machine) idioms (rock the boat let the catout of the bag) They are very numerous in lan-guages as Biber et al (1999) note accouting forbetween 30 and 45 of spoken English and 21

of academic prose and for Jackendoff (1997) thenumber of MWEs in a speakerrsquos lexicon is of thesame order of magnitude as the number of singlewords However these estimates are likely to beunderestimates if we consider that for languagefrom a specific domain the specialized vocabularyis going to consist largely of MWEs (global warm-ing protein sequencing) and new MWEs are con-stantly appearing (weapons of mass destructionaxis of evil)

Multiword expressions play an important rolein Natural Language Processing (NLP) applica-tions which should not only identify the MWEsbut also be able to deal with them when they arefound (Fazly and Stevenson 2007) Failing toidentify MWEs may cause serious problems formany NLP tasks especially those envolving somekind of semantic processing For parsing for in-stance Baldwin et al (2004) found that for a ran-dom sample of 20000 strings from the British Na-tional Corpus (BNC) even with a broad-coveragegrammar for English (Flickinger 2000) missingMWEs accounted for 8 of total parsing errorsTherefore there is an enormous need for robust(semi-)automated ways of acquiring lexical infor-mation for MWEs (Villavicencio et al 2007) thatcan significantly extend the coverage of resourcesFor example one can more than double the num-ber of verb-particle constructions (VPCs) entriesin a dictionary such as the Alvey Natural Lan-guage Tools (Carroll and Grover 1989) just ex-tracting VPCs from a corpus like the BNC (Bald-win 2005) Furthermore as MWEs are languagedependent and culturally motivated identifyingthe adequate translation of MWE occurrences is animportant challenge for machine translation meth-ods

In this paper we investigate experimentally theuse of an alignment-based approach for the iden-tification of MWEs in technical corpora We lookat the use of several sources of data including par-

1

allel corpora using English and Portuguese datafrom a corpus of Pediatrics and examining howa second language can provide relevant cues forthis tasks In this way cost-effective tools for theautomatic alignment of texts can generate a listof MWE candidates with their appropriate trans-lations Such an approach to the (semi-)automaticidentification of MWEs can considerably speed uplexicographic work providing a more targeted listof MWE candidates and their translations for theconstruction of bilingual resources andor withsome semantic information for monolingual re-sources

The remainder of this paper is structured as fol-lows Section 2 briefly discusses MWEs and someprevious works on methods for automatically ex-tracting them Section 3 presents the resourcesused while section 4 describes the methods pro-posed to extract MWEs as a statistically-driven by-product of an automatic word alignment processSection 5 presents the evaluation methodology andanalyses the results and section 6 finishes this pa-per with some conclusions and proposals for fu-ture work

2 Related Work

The term Multiword Expression has been used todescribe a large number of distinct but related phe-nomena such as phrasal verbs (eg come along)nominal compounds (eg frying pan) institution-alised phrases (eg bread and butter) and manyothers (Sag et al 2002) They are very frequent ineveryday language and this is reflected in severalexisting grammars and lexical resources wherealmost half of the entries are Multiword Expres-sions

However due to their heterogeneous charac-teristics MWEs present a tough challenge forboth linguistic and computational work (Sag etal 2002) Some MWEs are fixed and do notpresent internal variation such as ad hoc whileothers allow different degrees of internal vari-ability and modification such as touch a nerve(touchfind a nerve) and spill beans (spill sev-eralmusicalmountains of beans) In terms of se-mantics some MWEs are more opaque in theirmeaning (eg to kick the bucket as to die) whileothers have more transparent meanings that can beinferred from the words in the MWE (eg eat upwhere the particle up adds a completive sense toeat) Therefore providing appropriate methods

for the automatic identification and treatment ofthese phenomena is a real challenge for NLP sys-tems

A variety of approaches has been proposed forautomatically identifying MWEs differing basi-cally in terms of the type of MWE and lan-guage to which they apply and the sources ofinformation they use Although some work onMWEs is type independent (eg (Zhang et al2006 Villavicencio et al 2007)) given the het-erogeneity of MWEs much of the work looks in-stead at specific types of MWE like collocations(Pearce 2002) compounds (Keller and Lapata2003) and VPCs (Baldwin 2005 Villavicencio2005 Carlos Ramisch and Aline Villavicencio andLeonardo Moura and Marco Idiart 2008) Someof these works concentrate on particular languages(eg (Pearce 2002 Baldwin 2005) for Englishand (Piao et al 2006) for Chinese) but somework has also benefitted from asymmetries in lan-guages using information from one language tohelp deal with MWEs in the other (eg (na Vil-lada Moiron and Tiedemann 2006 Caseli et al2009))

As basis for helping to determine whether agiven sequence of words is in fact an MWE (egad hoc vs the small boy) some of these works em-ploy linguistic knowledge for the task (Villavicen-cio 2005) while others employ statistical meth-ods (Pearce 2002 Evert and Krenn 2005 Zhanget al 2006 Villavicencio et al 2007) or combinethem with some kinds of linguistic informationsuch as syntactic and semantic properties (Bald-win and Villavicencio 2002 Van de Cruys and naVillada Moiron 2007) or automatic word align-ment (na Villada Moiron and Tiedemann 2006)

Statistical measures of association have beencommonly used for this task as they can be demo-cratically applied to any language and MWE typeHowever there is no consensus about which mea-sure is best suited for identifying MWEs in gen-eral Villavicencio et al (2007) compared some ofthese measures (mutual information permutationentropy and χ2) for the type-independent detec-tion of MWEs and found that Mutual Informationseemed to differentiate MWEs from non-MWEsbut the same was not true of χ2 In addition Ev-ert and Krenn (2005) found that for MWE iden-tification the efficacy of a given measure dependson factors like the type of MWEs being targetedfor identification the domain and size of the cor-

2

pora used and the amount of low-frequency dataexcluded by adopting a threshold NonethelessVillavicencio et al (2007) discussing the influ-ence of the corpus size and nature over the meth-ods found that these different measures have ahigh level of agreement about MWEs whetherin carefully constructed corpora or in more het-erogeneous web-based ones They also discussthe results obtained from adopting approaches likethese for extending the coverage of resources ar-guing that grammar coverage can be significantlyincreased if MWEs are properly identified andtreated (Villavicencio et al 2007)

Among the methods that use additional infor-mation along with statistics to extract MWE theone proposed by na Villada Moiron and Tiede-mann (2006) seems to be the most similar to ourapproach The main difference between them isthe way in which word alignment is used in theMWE extraction process In this paper the wordalignment is the basis for the MWE extractionprocess while Villada Moiron and Tiedemannrsquosmethod uses the alignment just for ranking theMWE candidates which were extracted on the ba-sis of association measures (log-likelihood andsalience) and head dependence heuristic (in parseddata)

Our approach as described in details by Caseliet al (2009) also follows to some extent thatof Zhang et al (2006) as missing lexical en-tries for MWEs and related constructions are de-tected via error mining methods and this paper fo-cuses on the extraction of generic MWEs as a by-product of an automatic word alignment Anotherrelated work is the automatic detection of non-compositional compounds (NCC) by Melamed(1997) in which NCCs are identified by analyz-ing statistical translation models trained in a hugecorpus by a time-demanding process

Given this context our approach proposesthe use of alignment techniques for identifyingMWEs looking at sequences detected by thealigner as containing more than one word whichform the MWE candidates As a result sequencesof two or more consecutive source words aretreated as MWE candidates regardless of whetherthey are translated as one or more target words

3 The Corpus and Reference Lists

The Corpus of Pediatrics used in these experi-ments contains 283 texts in Portuguese with a total

of 785448 words extracted from the Jornal de Pe-diatria From this corpus the Pediatrics Glossarya reference list containing multiword terms and re-curring expressions was semi-automatically con-structed and manually checked1 The primary aimof the Pediatrics Glossary as an online resourcefor long-distance education was to train qualifyand support translation students on the domain ofpediatrics texts

The Pediatrics Glossary was built from the36741 ngrams that occurred at least 5 times in thecorpus These were automatically cleaned or re-moved using some POS tag patterns (eg remov-ing prepositions from terms that began or endedwith them) In addition if an ngram was part of alarger ngram only the latter appeared in the Glos-sary as is the case of aleitamento materno (mater-nal breastfeeding) which is excluded as it is con-tained in aleitamento materno exclusivo (exclusivematernal breastfeeding) This post-processing re-sulted in 3645 ngrams which were manuallychecked by translation students and resulted in2407 terms with 1421 bigrams 730 trigrams and339 ngrams with n larger than 3 (not considered inthe experiments presented in this paper)

4 Statistically-Driven andAlignment-Based methods

41 Statistically-Driven methodStatistical measures of association have beenwidely employed in the identification of MWEsThe idea behind their use is that they are an in-expensive language and type independent meansof detecting recurrent patterns As Firth famouslysaid a word is characterized by the company itkeeps and since we expect the component wordsof an MWE to occur frequently together thenthese measures can give an indication of MWE-ness In this way if a group of words co-occurswith significantly high frequency when comparedto the frequencies of the individual words thenthey may form an MWE Indeed measures suchas Pointwise Mutual Information (PMI) MutualInformation (MI) χ2 log-likelihood (Press et al1992) and others have been employed for thistask and some of them seem to provide more ac-curate predictions of MWEness than others Infact in a comparison of some measures for thetype-independent detection of MWEs MI seemed

1Available in the TEXTQUIMUFRGS website httpwwwufrgsbrtextquim

3

to differentiate MWEs from non-MWEs but thesame was not true of χ2 (Villavicencio et al2007) In this work we use two commonly em-ployed measures for this task PMI and MIas implemented in the Ngram Statistics Package(Banerjee and Pedersen 2003)

From the Portuguese portion of the Corpus ofPediatrics 196105 bigram and 362663 trigramMWE candidates were generated after filteringngrams containing punctuation and numbers Inorder to evaluate how these methods perform with-out any linguistic filtering the only threshold em-ployed was a frequency cut-off of 2 occurrencesresulting in 64839 bigrams and 54548 trigramsEach of the four measures were then calculated forthese ngrams and we ranked each n-gram accord-ing to each of these measures The average of allthe rankings is used as the combined measure ofthe MWE candidates

42 Alignment-Based method

The second of the MWE extraction approachesto be investigated in this paper is the alignment-based method The automatic word alignment oftwo parallel texts mdash a text written in one (source)language and its translation to another (target) lan-guage mdash is the process of searching for correspon-dences between source and target words and se-quences of words For each word in a source sen-tence equivalences in the parallel target sentenceare looked for Therefore taking into account aword alignment between a source word sequenceS (S = s1 sn with n ge 2) and a target wordsequence T (T = t1 tm with m ge 1) thatis S harr T the alignmet-based MWE extracionmethod assumes that (a) S and T share some se-mantic features and (b) S may be a MWE

In other words the alignment-based MWE ex-traction method states that the sequence S will bea MWE candidate if it is aligned with a sequenceT composed of one or more words (a n m align-ment with n ge 2 and m ge 1) For examplethe sequence of two Portuguese words aleitamentomaterno mdash which occurs 202 times in the cor-pus used in our experiments mdash is a MWE can-didate because these two words were joined to bealigned 184 times with the word breastfeeding (a2 1 alignment) 8 times with the word breast-fed (a 2 1 alignment) 2 times with breastfeedingpractice (a 2 2 alignment) and so on

Thus notice that the alignment-based MWE ex-

traction method does not rely on the conceptualasymmetries between languages since it does notexpect that a source sequence of words be alignedwith a single target word The method looksfor the sequences of source words that are fre-quently joined together during the alignment de-spite the number of target words involved Thesefeatures indicate that the method priorizes preci-sion in spite of recall

It is also important to say that although the se-quences of source and target words resemble thephrases used in the phrase-based statistical ma-chine translation (SMT) they are indeed a re-finement of them More specifically althoughboth approaches rely on word alignments per-formed by GIZA++2 (Och and Ney 2000) inthe alignment-based approach not all sequences ofwords are considered as phrases (and MWE can-didates) but just those with an alignment n m(n gt= 2) with a target sequence To confirmthis assumption a phrase-based SMT system wastrained with the same corpus used in our exper-iments and the number of phrases extracted fol-lowing both approaches were compared Whilethe SMT extracted 819208 source phrases ouralignment-based approach (without applying anypart-of-speech or frequency filter) extracted only34277 These results show that the alignment-based approach refines in some way the phrasesof SMT systems

In this paper we investigate experimentallywhether MWEs can be identified as a by-productof the automatic word alignment of parallel textsWe focus on Portuguese MWEs from the Corpusof Pediatrics and the evaluation is performed us-ing the bigrams and trigrams from the PediatricsGlossary as gold standard

To perform the extraction of MWE candi-dates following the alignment-based approachfirst the original corpus had to be sentence andword aligned and Part-of-Speech (POS) taggedFor these preprocessing steps were used re-spectively a version of the Translation Cor-pus Aligner (TCA) (Hofland 1996) the statisti-cal word aligner GIZA++ (Och and Ney 2000)and the morphological analysers and POS taggersfrom Apertium3 (Armentano-Oller et al 2006)

2GIZA++ is a well-known statistical word aligner that canbe found at httpwwwfjochcomGIZA++html

3Apertium is an open-source machine translation en-gine and toolbox available at httpwwwapertiumorg

4

From the preprocessed corpus the MWE candi-dates are extracted as those in which two or morewords have the same alignment that is they arelinked to the same target unit This initial list ofMWE candidates is then filtered to remove thosecandidates that (a) match some sequences of POStags or words (patterns) defined in previous exper-iments (Caseli et al 2009) or (b) whose frequencyis below a certain threshold The remaining unitsin the candidate list are considered to be MWEs

Several filtering patterns and minimum fre-quency thresholds were tested and three of themare presented in details here The first one (F1)is the same used during the manual building ofthe reference lists of MWEs (a) patterns begin-ning with Article + Noun and beginning or finish-ing with verbs and (b) with a minimum frequencythreshold of 5

The second one (F2) is the same used in the(Caseli et al 2009) mainly (a) patterns begin-ning with determiner auxiliary verb pronoun ad-verb conjunction and surface forms such as thoseof the verb to be (are is was were) relatives(that what when which who why) and prepo-sitions (from to of ) and (b) with a minimum fre-quency threshold of 2

And the third one (F3) is the same as (Caseli etal 2009) plus (a) patterns beginning or finishingwith determiner adverb conjunction prepositionverb pronoun and numeral and (b) with a mini-mum frequency threshold of 2

5 Experiments and Results

Table 1 shows the top 5 and the bottom 5 rankedcandidates returned by PMI and the alignment-based approach Although some of the results aregood especially the top candidates there is stillconsiderable noise among the candidates as forinstance jogar video game (lit play video game)From table 1 it is also possible to notice that thealignment-based approach indeed extracts Pedi-atrics terms such as aleitamento materno (breast-feeding) and also other possible MWE that are notPediatrics terms such as estados unidos (UnitedStates)

In table 2 we show the precision (number ofcorrect candidates among the proposed ones) re-call (number of correct candidates among those inreference lists) and F-measure ((2 lowast precision lowastrecall)(precision + recall)) figures for the as-sociation measures using all the candidates (on the

PMI alignment-basedOnline Mendelian Inheritance faixa etariaBeta Technology Incorporated aleitamento maternoLange Beta Technology estados unidosOxido Nitrico Inalatorio hipertensao arterialjogar video game leite materno e um de couro cabeludoe a do bloqueio lactıferosse que de emocional anatomiae a da neonato a termoe de nao duplas maes bebes

Table 1 Top 5 and Bottom 5 MWE candidatesranked by PMI and alignment-based approach

pt MWE candidates PMI MI proposed bigrams 64839 64839 correct MWEs 1403 1403precision 216 216recall 9873 9873F 423 423 proposed trigrams 54548 54548 correct MWEs 701 701precision 129 129recall 9603 9603F 255 255 proposed bigrams 1421 1421 correct MWEs 155 261precision 1091 1837recall 1091 1837F 1091 1837 proposed trigrams 730 730 correct MWEs 44 20precision 603 274recall 603 274F 603 274

Table 2 Evaluation of MWE candidates - PMI andMI

first half of the table) and using the top 1421 bi-gram and 730 trigram candidates (on the secondhalf) From these latter results we can see that thetop candidates produced by these measures do notagree with the Pediatrics Glossary since there areonly at most 1837 bigram and 603 trigramMWEs among the top candidates as ranked byMI and PMI respectively Interestingly MI had abetter performance for bigrams while for trigramsPMI performed better

On the other hand looking at the alignment-based method 34277 pt MWE candidates wereextracted and Table 3 sumarizes the number ofcandidates filtered following the three filters de-scribed in 42 F1 F2 and F3

To evaluate the efficacy of the alignment-basedmethod in identifying multiword terms of Pedi-atrics an automatic comparison was performedusing the Pediatrics Glossary In this auto-

5

pt MWE candidates F1 F2 F3 filtered by POS patterns 24996 21544 32644 filtered by frequency 9012 11855 1442 final Set 269 878 191

Table 3 Number of pt MWE candidates filteredin the alignment-based approach

pt MWE candidates F1 F2 F3 proposed bigrams 250 754 169 correct MWEs 48 95 65precision 1920 1260 3846recall 338 669 457F 575 874 818 proposed trigrams 19 110 20 correct MWEs 1 9 4precision 526 818 2000recall 014 123 055F 027 214 107 proposed bitrigrams 269 864 189 correct MWEs 49 104 69precision 1822 1204 3651recall 228 483 321F 405 690 590

Table 4 Evaluation of MWE candidates

matic comparision we considered the final lists ofMWEs candidates generated by each filter in table3 The number of matching entries and the valuesfor precision recall and F-measure are showed intable 4

The different values of extracted MWEs (in ta-ble 3) and evaluated ones (in table 4) are due tothe restriction of considering only bigrams and tri-grams in the Pediatrics Glossary Then longerMWEs mdash such as doenca arterial coronarianaprematura (premature coronary artery disease)and pequenos para idade gestacional (small forgestational age) mdash extracted by the alignment-based method are not being considered at the mo-ment

After the automatic comparison using the Pedi-atrics Glossary an analysis by human experts wasperformed on one of the derived lists mdash that withthe best precision values so far (from filter F3)The human analysis was necessary since as statedin (Caseli et al 2009) the coverage of referencelists may be low and it is likely that a lot of MWEcandidates that were not found in the PediatricsGlossary are nonetheless true MWEs In this pa-per only the pt MWE candidates extracted usingfilter F3 (as described in section 42) were manu-ally evaluated

From the 191 pt MWE candidates extracted af-ter F3 69 candidates (361 of the total amount)were found in the bigrams or trigrams in the

Glossary (see table 4) Then the remaining 122candidates (639) were analysed by two native-speakers human judges who classified each of the122 candidates as true if it is a multiword expres-sion or false otherwise independently of beinga Pediatrics term For the judges a sequence ofwords was considered a MWE mainly if it was(1) a proper name or (2) a sequence of words forwhich the meaning cannot be obtained by com-pounding the meanings of its component words

The judgments of both judges were comparedand a disagreement of approximately 12 on mul-tiwords was verified This disagreement was alsomeasured by the kappa (K) measure (Carletta1996) with k = 073 which does not preventconclusions to be drawn According to Carletta(1996) among other authors a value of k between067 and 08 indicates a good agreement

In order to calculate the percentage of true can-didates among the 122 two approaches can befollowed depending on what criteria one wantsto emphasize precision or coverage (not recallbecause we are not calculating regarding a refer-ence list) To emphasize the precision one shouldconsider as genuine MWEs only those candidatesclassified as true by both judges on the other handto emphasize the coverage one should consideralso those candidates classified as true by just oneof them So from 191 MWE candidates 126(6597) were classified as true by both judgesand 145 (7592) by at least one of them

6 Conclusions and Future Work

MWEs are a complex and heterogeneous set ofphenomena that defy attempts to capture themfully but due to their role in communication theyneed to be properly accounted for in NLP applica-tions and tasks

In this paper we investigated the identifica-tion of MWEs from technical domain test-ing statistically-driven and alignment-based ap-proaches for identifying MWEs from a Pediatricsparallel corpus The alignment-based method gen-erates a targeted precision-oriented list of MWEcandidates while the statistical methods producerecall-oriented results at the expense of precisionTherefore the combination of these methods canproduce a set of MWE candidates that is both moreprecise than the latter and has more coverage thanthe former This can significantly speed up lex-icographic work Moreover the results obtained

6

show that in comparison with the manual extrac-tion of MWEs this approach can provide also ageneral set of MWE candidates in addition to themanually selected technical terms

Using the alignment-based extraction methodwe notice that it is possible to extract MWEs thatare Pediatrics terms with a precision of 38 forbigrams and 20 for trigrams but with very lowrecall since only the MWEs in the Pediatrics Glos-sary were considered correct However after amanual analysis carried out by two native speakersof Portuguese we found that the percentage of trueMWEs considered by both or at least one of themwere respectively 6597 and 7592 This wasa significative improvement but it is important tosay that in this manual analysis the human ex-perts classified the MWEs as true independently ofthem being Pediatrics terms So as future work weintend to carry out a more carefull analysis withexperts in Pediatrics to evaluate how many MWEscandidates are also Pediatrics terms

In addition we plan to investigate a weightedcombination of these methods favouring thosethat have better precision Finally we also in-tend to apply the results obtained in to the semi-automatic construction of ontologies

Acknowledgments

We would like to thank the TEXTQUIMUFRGSgroup for making the Corpus of Pediatrics and Pe-diatrics Glossary available to us We also thank thefinancial support of FAPESP in bulding the paral-lel corpus This research has been partly fundedby the FINEP project COMUNICA

ReferencesCarme Armentano-Oller Rafael C Carrasco Anto-

nio M Corbı-Bellot Mikel L Forcada MireiaGinestı-Rosell Sergio Ortiz-Rojas Juan Anto-nio Perez-Ortiz Gema Ramırez-Sanchez FelipeSanchez-Martınez and Miriam A Scalco 2006Open-source Portuguese-Spanish machine transla-tion In R Vieira P Quaresma MGV Nunes NJMamede C Oliveira and MC Dias editors Pro-ceedings of the 7th International Workshop on Com-putational Processing of Written and Spoken Por-tuguese (PROPOR 2006) volume 3960 of LectureNotes in Computer Science pages 50ndash59 Springer-Verlag May

Timothy Baldwin and Aline Villavicencio 2002 Ex-tracting the Unextractable A Case Study on Verb-particles In Proceedings of the 6th Conference on

Natural Language Learning (CoNLL-2002) pages98ndash104 Taipei Taiwan

Timothy Baldwin Emily M Bender Dan FlickingerAra Kim and Stephan Oepen 2004 Road-testingthe English Resource Grammar over the British Na-tional Corpus In Proceedings of the Fourth In-ternational Conference on Language Resources andEvaluation (LREC 2004) pages 2047ndash2050 LisbonPortugal

Timothy Baldwin 2005 The deep lexical acquisitionof English verb-particles Computer Speech andLanguage Special Issue on Multiword Expressions19(4)398ndash414

Satanjeev Banerjee and Ted Pedersen 2003 The De-sign Implementation and Use of the Ngram Statis-tics Package In In Proceedings of the Fourth Inter-national Conference on Intelligent Text Processingand Computational Linguistics pages 370ndash381

Douglas Biber Stig Johansson Geoffrey Leech SusanConrad and Edward Finegan 1999 Grammar ofSpoken and Written English Longman Harlow

Jean Carletta 1996 Assessing agreement on classi-fication tasks the kappa statistics ComputationalLinguistics 22(2)249ndash254

Carlos Ramisch and Aline Villavicencio and LeonardoMoura and Marco Idiart 2008 Picking themup and Figuring them out Verb-Particle Construc-tions Noise and Idiomaticity In Proceedings ofthe 12th Conference on Computational Natural Lan-guage Learning (CoNLL 2008) pages 49ndash56

John Carroll and Claire Grover 1989 The derivationof a large computational lexicon of English fromLDOCE In Bran Boguraev and Ted Briscoe editorsComputational Lexicography for Natural LanguageProcessing pages 117ndash134 Longman Harlow UK

Helena M Caseli Carlos Ramisch Maria G V Nunesand Aline Villavicencio 2009 Alignment-basedextraction of multiword expressions Language Re-sources and Evaluation to appear

Stefan Evert and Brigitte Krenn 2005 Using smallrandom samples for the manual evaluation of statis-tical association measures Computer Speech andLanguage 19(4)450ndash466

Afsaneh Fazly and Suzanne Stevenson 2007 Distin-guishing Subtypes of Multiword Expressions UsingLinguistically-Motivated Statistical Measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueJune

Dan Flickinger 2000 On building a more efficientgrammar by exploiting types Natural LanguageEngineering 6(1)15ndash28

7

Knut Hofland 1996 A program for aligning Englishand Norwegian sentences In S Hockey N Ideand G Perissinotto editors Research in HumanitiesComputing pages 165ndash178 Oxford Oxford Univer-sity Press

Ray Jackendoff 1997 Twistinrsquo the night away Lan-guage 73534ndash59

Frank Keller and Mirella Lapata 2003 Using the Webto Obtain Frequencies for Unseen Bigrams Compu-tational Linguistics 29(3)459ndash484

I Dan Melamed 1997 Automatic Discovery ofNon-Compositional Compounds in Parallel Data Ineprint arXivcmp-lg9706027 pages 6027ndash+ June

Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the Workshopon Multi-word-expressions in a Multilingual Context(EACL-2006) pages 33ndash40 Trento Italy

Franz Josef Och and Hermann Ney 2000 Improvedstatistical alignment models In Proceedings of the38th Annual Meeting of the ACL pages 440ndash447Hong Kong China October

Darren Pearce 2002 A Comparative Evaluation ofCollocation Extraction Techniques In Proceedingsof the Third International Conference on LanguageResources and Evaluation pages 1ndash7 Las PalmasCanary Islands Spain

Scott S L Piao Guangfan Sun Paul Rayson andQi Yuan 2006 Automatic Extraction of Chi-nese Multiword Expressions with a Statistical ToolIn Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context (EACL-2006)pages 17ndash24 Trento Italy April

William H Press Saul A Teukolsky William T Vet-terling and Brian P Flannery 1992 NumericalRecipes in C The Art of Scientific Computing Sec-ond edition Cambridge University Press

Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 MultiwordExpressions A Pain in the Neck for NLP In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing (CICLing-2002) volume 2276 of (LectureNotes in Computer Science) pages 1ndash15 LondonUK Springer-Verlag

Tim Van de Cruys and Bego na Villada Moiron 2007Semantics-based Multiword Expression ExtractionIn Proceedings of the Workshop on A Broader Pre-spective on Multiword Expressions pages 25ndash32Prague June

Aline Villavicencio Valia Kordoni Yi Zhang MarcoIdiart and Carlos Ramisch 2007 Validation andEvaluation of Automatically Acquired Multiword

Expressions for Grammar Engineering In Pro-ceedings of the 2007 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning pages1034ndash1043 Prague June

Aline Villavicencio 2005 The Availability of Verb-Particle Constructions in Lexical Resources HowMuch is Enough Journal of Computer Speech andLanguage Processing 19(4)415ndash432

Yi Zhang Valia Kordoni Aline Villavicencio andMarco Idiart 2006 Automated Multiword Ex-pression Prediction for Grammar Engineering InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 36ndash44 Sydney Australia July Asso-ciation for Computational Linguistics

8

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 9ndash16Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Re-examining Automatic Keyphrase Extraction Approachesin Scientific Articles

Su Nam KimCSSE dept

University of Melbournesnkimcsseunimelbeduau

Min-Yen KanDepartment of Computer ScienceNational University of Singaporekanmycompnusedusg

Abstract

We tackle two major issues in automatickeyphrase extraction using scientific arti-cles candidate selection and feature engi-neering To develop an efficient candidateselection method we analyze the natureand variation of keyphrases and then se-lect candidates using regular expressionsSecondly we re-examine the existing fea-tures broadly used for the supervised ap-proach exploring different ways to en-hance their performance While mostother approaches are supervised we alsostudy the optimal features for unsuper-vised keyphrase extraction Our researchhas shown that effective candidate selec-tion leads to better performance as evalua-tion accounts for candidate coverage Ourwork also attests that many of existing fea-tures are also usable in unsupervised ex-traction

1 Introduction

Keyphrases are simplex nouns or noun phrases(NPs) that represent the key ideas of the documentKeyphrases can serve as a representative summaryof the document and also serve as high quality in-dex terms It is thus no surprise that keyphraseshave been utilized to acquire critical informationas well as to improve the quality of natural lan-guage processing (NLP) applications such as doc-ument summarizer(Davanzo and Magnini 2005)information retrieval (IR)(Gutwin et al 1999) anddocument clustering(Hammouda et al 2005)

In the past various attempts have been made toboost automatic keyphrase extraction performancebased primarily on statistics(Frank et al 1999Turney 2003 Park et al 2004 Wan and Xiao2008) and a rich set of heuristic features(Barkerand Corrnacchia 2000 Medelyan and Witten

2006 Nguyen and Kan 2007) In Section 2 wegive a more comprehensive overview of previousattempts

Current keyphrase technology still has muchroom for improvement First of all although sev-eral candidate selection methods have been pro-posed for automatic keyphrase extraction in thepast (eg (Frank et al 1999 Park et al 2004Nguyen and Kan 2007)) most of them do not ef-fectively deal with various keyphrase forms whichresults in the ignorance of some keyphrases as can-didates Moreover no studies thus far have donea detailed investigation of the nature and varia-tion of manually-provided keyphrases As a con-sequence the community lacks a standardized listof candidate forms which leads to difficulties indirect comparison across techniques during evalu-ation and hinders re-usability

Secondly previous studies have shown the ef-fectiveness of their own features but not manycompared their features with other existing fea-tures That leads to a redundancy in studies andhinders direct comparison In addition existingfeatures are specifically designed for supervisedapproaches with few exceptions However thisapproach involves a large amount of manual laborthus reducing its utility for real-world applicationHence unsupervised approach is inevitable in or-der to minimize manual tasks and to encourageutilization It is a worthy study to attest the re-liability and re-usability for the unsupervised ap-proach in order to set up the tentative guideline forapplications

This paper targets to resolve these issues ofcandidate selection and feature engineering Inour work on candidate selection we analyze thenature and variation of keyphrases with the pur-pose of proposing a candidate selection methodwhich improves the coverage of candidates thatoccur in various forms Our second contributionre-examines existing keyphrase extraction features

9

reported in the literature in terms of their effec-tiveness and re-usability We test and comparethe usefulness of each feature for further improve-ment In addition we assess how well these fea-tures can be applied in an unsupervised approach

In the remaining sections we describe anoverview of related work in Section 2 our propos-als on candidate selection and feature engineeringin Section 4 and 5 our system architecture anddata in Section 6 Then we evaluate our propos-als discuss outcomes and conclude our work inSection 7 8 and 9 respectively

2 Related Work

The majority of related work has been carriedout using statistical approaches a rich set ofsymbolic resources and linguistically-motivatedheuristics(Frank et al 1999 Turney 1999 Barkerand Corrnacchia 2000 Matsuo and Ishizuka2004 Nguyen and Kan 2007) Features used canbe categorized into three broad groups (1) docu-ment cohesion features (ie relationship betweendocument and keyphrases)(Frank et al 1999Matsuo and Ishizuka 2004 Medelyan and Wit-ten 2006 Nguyen and Kan 2007) and to lesser(2) keyphrase cohesion features (ie relationshipamong keyphrases)(Turney 2003) and (3) termcohesion features (ie relationship among compo-nents in a keyphrase)(Park et al 2004)

The simplest system is KEA (Frank et al1999 Witten et al 1999) that uses TFIDF (ieterm frequency inverse document frequency) andfirst occurrence in the document TFIDF mea-sures the document cohesion and the first occur-rence implies the importance of the abstract orintroduction which indicates the keyphrases havea locality Turney (2003) added the notion ofkeyphrase cohesion to KEA features and Nguyenand Kan (2007) added linguistic features suchas section information and suffix sequence TheGenEx system(Turney 1999) employed an inven-tory of nine syntactic features such as length inwords and frequency of stemming phrase as aset of parametrized heuristic rules Barker andCorrnacchia (2000) introduced a method basedon head noun heuristics that took three featureslength of candidate frequency and head noun fre-quency To take advantage of domain knowledgeHulth et al (2001) used a hierarchically-organizeddomain-specific thesaurus from Swedish Parlia-ment as a secondary knowledge source The

Textract (Park et al 2004) also ranks the can-didate keyphrases by its judgment of keyphrasesrsquodegree of domain specificity based on subject-specific collocations(Damerau 1993) in addi-tion to term cohesion using Dice coefficient(Dice1945) Recently Wan and Xiao (2008) extractsautomatic keyphrases from single documents uti-lizing document clustering information The as-sumption behind this work is that the documentswith the same or similar topics interact with eachother in terms of salience of words The authorsfirst clustered the documents then used the graph-based ranking algorithm to rank the candidates ina document by making use of mutual influences ofother documents in the same cluster

3 Keyphrase Analysis

In previous study KEA employed the index-ing words as candidates whereas others such as(Park et al 2004 Nguyen and Kan 2007) gen-erated handcrafted regular expression rules How-ever none carefully undertook the analysis ofkeyphrases We believe there is more to be learnedfrom the reference keyphrases themselves by do-ing a fine-grained careful analysis of their formand composition Note that we used the articlescollected from ACM digital library for both ana-lyzing keyphrases as well as evaluating methodsSee Section 6 for data in detail

Syntactically keyphrases can be formed by ei-ther simplex nouns (eg algorithm keyphrasemulti-agent) or noun phrases (NPs) which can be asequence of nouns and their auxiliary words suchas adjectives and adverbs (eg mobile networkfast computing partially observable Markov de-cision process) despite few incidences They canalso incorporate a prepositional phrase (PP) (egquality of service policy of distributed caching)When keyphrases take the form of an NP with anattached PP (ie NPs in of-PP form) the preposi-tion of is most common but others such as for invia also occur (eg incentive for cooperation in-equality in welfare agent security via approximatepolicy trade in financial instrument based on log-ical formula) The patterns above correlate wellto part-of-speech (POS) patterns used in modernkeyphrase extraction systems

However our analysis uncovered additional lin-guistic patterns and alternations which other stud-ies may have overlooked In our study we alsofound that keyphrases also occur as a simple con-

10

Criteria RulesFrequency (Rule1) Frequency heuristic ie frequency ge 2 for simplex words vs frequency ge 1 for NPs

Length (Rule2) Length heuristic ie up to length 3 for NPs in non-of-PP form vs up to length 4 for NPs in of-PP form(eg synchronous concurrent program vs model of multiagent interaction)

Alternation (Rule3) of-PP form alternation(eg number of sensor = sensor number history of past encounter = past encounter history)(Rule4) Possessive alternation(eg agentrsquos goal = goal of agent securityrsquos value = value of security)

Extraction (Rule5) Noun Phrase = (NN |NNS|NNP |NNPS|JJ |JJR|JJS)lowast(NN |NNS|NNP |NNPS)(eg complexity effective algorithm grid computing distributed web-service discovery architecture)(Rule6) Simplex WordNP IN Simplex WordNP(eg quality of service sensitivity of VOIP traffic (VOIP traffic extracted)simplified instantiation of zebroid (simplified instantiation extracted))

Table 1 Candidate Selection Rules

junctions (eg search and rescue propagation anddelivery) and much more rarely as conjunctionsof more complex NPs (eg history of past en-counter and transitivity) Some keyphrases appearto be more complex (eg pervasive document editand management system task and resource allo-cation in agent system) Similarly abbreviationsand possessive forms figure as common patterns(eg belief desire intention = BDI inverse docu-ment frequency = (IDF) Bayesrsquo theorem agentrsquosdominant strategy)

A critical insight of our work is that keyphrasescan be morphologically and semantically alteredKeyphrases that incorporate a PP or have an un-derlying genitive composition are often easily var-ied by word order alternation Previous studieshave used the altered keyphrases when forming inof-PP form For example quality of service canbe altered to service quality sometimes with lit-tle semantic difference Also as most morpho-logical variation in English relates to noun num-ber and verb inflection keyphrases are subject tothese rules as well (eg distributed system 6= dis-tributing system dynamical caching 6= dynamicalcache) In addition possessives tend to alternatewith of-PP form (eg agentrsquos goal = goal of agentsecurityrsquos value = value of security)

4 Candidate Selection

We now describe our proposed candidate selectionprocess Candidate selection is a crucial step forautomatic keyphrase extraction This step is corre-lated to term extraction study since topNth rankedterms become keyphrases in documents In pre-vious study KEA employed the indexing wordsas candidates whereas others such as (Park et al2004 Nguyen and Kan 2007) generated hand-crafted regular expression rules However nonecarefully undertook the analysis of keyphrases In

this section before we present our method we firstdescribe the detail of keyphrase analysis

In our keyphrase analysis we observed thatmost of author assigned keyphrase andor readerassigned keyphrase are syntactically more of-ten simplex words and less often NPs Whenkeyphrases take an NP form they tend to be a sim-ple form of NPs ie either without a PP or withonly a PP or with a conjunction but few appear asa mixture of such forms We also noticed that thecomponents of NPs are normally nouns and adjec-tives but rarely are adverbs and verbs As a re-sult we decided to ignore NPs containing adverbsand verbs in this study as our candidates since theytend to produce more errors and to require morecomplexity

Another observation is that keyphrases contain-ing more than three words are rare (ie 6 in ourdata set) validating what Paukkeri et al (2008)observed Hence we apply a length heuristic Ourcandidate selection rule collects candidates up tolength 3 but also of length 4 for NPs in of-PPform since they may have a non-genetive alter-nation that reduces its length to 3 (eg perfor-mance of distributed system = distributed systemperformance) In previous studies words occur-ring at least twice are selected as candidates How-ever during our acquisition of reader assignedkeyphrase we observed that readers tend to collectNPs as keyphrases regardless of their frequencyDue to this we apply different frequency thresh-olds for simplex words (gt= 2) and NPs (gt= 1)Note that 30 of NPs occurred only once in ourdata

Finally we generated regular expression rulesto extract candidates as presented in Table 1 Ourcandidate extraction rules are based on those inNguyen and Kan (2007) However our Rule6for NPs in of-PP form broadens the coverage of

11

possible candidates ie with a given NPs in of-PP form not only we collect simplex word(s)but we also extract non-of-PP form of NPs fromnoun phrases governing the PP and the PP Forexample our rule extracts effective algorithm ofgrid computing as well as effective algorithm andgrid computing as candidates while the previousworksrsquo rules do not

5 Feature Engineering

With a wider candidate selection criteria the onusof filtering out irrelevant candidates becomes theresponsibility of careful feature engineering Welist 25 features that we have found useful in ex-tracting keyphrases comprising of 9 existing and16 novel andor modified features that we intro-duce in our work (marked with lowast) As one ofour goals in feature engineering is to assess thesuitability of features in the unsupervised settingwe have also indicated which features are suitableonly for the supervised setting (S) or applicable toboth (S U)

51 Document Cohesion

Document cohesion indicates how important thecandidates are for the given document The mostpopular feature for this cohesion is TFIDF butsome works have also used context words to checkthe correlation between candidates and the givendocument Other features for document cohesionare distance section information and so on Wenote that listed features other than TFIDF are re-lated to locality That is the intuition behind thesefeatures is that keyphrases tend to appear in spe-cific area such as the beginning and the end of doc-uments

F1 TFIDF (SU) TFIDF indicates doc-ument cohesion by looking at the frequency ofterms in the documents and is broadly used in pre-vious work(Frank et al 1999 Witten et al 1999Nguyen and Kan 2007) However a disadvan-tage of the feature is in requiring a large corpusto compute useful IDF As an alternative con-text words(Matsuo and Ishizuka 2004) can alsobe used to measure document cohesion From ourstudy of keyphrases we saw that substrings withinlonger candidates need to be properly counted andas such our method measures TF in substrings aswell as in exact matches For example grid com-puting is often a substring of other phrases such asgrid computing algorithm and efficient grid com-

puting algorithm We also normalize TF with re-spect to candidate types ie we separately treatsimplex words and NPs to compute TF To makeour IDFs broadly representative we employed theGoogle n-gram counts that were computedover terabytes of data Given this large genericsource of word count IDF can be incorporatedwithout corpus-dependent processing hence suchfeatures are useful in unsupervised approachesas well The following list shows variations ofTFIDF employed as features in our system

bull (F1a) TFIDF

bull (F1b) TF including counts of substrings

bull (F1c) TF of substring as a separate feature

bull (F1d) normalized TF by candidate types(ie simplex words vs NPs)

bull (F1e) normalized TF by candidate types asa separate feature

bull (F1f) IDF using Google n-gram

F2 First Occurrence (SU) KEA used the firstappearance of the word in the document(Frank etal 1999 Witten et al 1999) The main ideabehind this feature is that keyphrases tend to oc-cur in the beginning of documents especially instructured reports (eg in abstract and introduc-tion sections) and newswire

F3 Section Information (SU) Nguyen andKan (2007) used the identity of which specificdocument section a candidate occurs in This lo-cality feature attempts to identify key sections Forexample in their study of scientific papers theauthors weighted candidates differently dependingon whether they occurred in the abstract introduc-tion conclusion section head title andor refer-ences

F4 Additional Section Information (SU)We first added the related work or previous workas one of section information not included inNguyen and Kan (2007) We also propose and testa number of variations We used the substringsthat occur in section headers and reference titlesas keyphrases We counted the co-occurrence ofcandidates (ie the section TF) across all key sec-tions that indicates the correlation among key sec-tions We assign section-specific weights as in-dividual sections exhibit different propensities forgenerating keyphrases For example introduction

12

contains the majority of keyphrases while the ti-tle or section head contains many fewer due to thevariation in size

bull (F4a) section rsquorelatedprevious workrsquo

bull (F4b) counting substring occurring in keysections

bull (F4c) section TF across all key sections

bull (F4d) weighting key sections according tothe portion of keyphrases found

F5 Last Occurrence (SU) Similar to dis-tance in KEA the position of the last occurrenceof a candidate may also imply the importance ofkeyphrases as keyphrases tend to appear in thelast part of document such as the conclusion anddiscussion

52 Keyphrase CohesionThe intuition behind using keyphrase cohesion isthat actual keyphrases are often associated witheach other since they are semantically related totopic of the document Note that this assumptionholds only when the document describes a singlecoherent topic ndash a document that represents a col-lection may be first need to be segmented into itsconstituent topics

F6 Co-occurrence of Another Candidatein Section (SU) When candidates co-occur inseveral key sections together then they are morelikely keyphrases Hence we used the number ofsections that candidates co-occur

F7 Title overlap (S) In a way titles also rep-resent the topics of their documents A large col-lection of titles in the domain can act as a prob-abilistic prior of what words could stand as con-stituent words in keyphrases In our work as weexamined scientific papers from computer sciencewe used a collection of titles obtained from thelarge CiteSeer1 collection to create this feature

bull (F7a) co-occurrence (Boolean) in title col-location

bull (F7b) co-occurrence (TF) in title collection

F8 Keyphrase Cohesion (SU) Turney (2003)integrated keyphrase cohesion into his system bychecking the semantic similarity between top Nranked candidates against the remainder In the

1It contains 13M titles from articles papers and reports

original work a large external web corpus wasused to obtain the similarity judgments As wedid not have access to the same web corpus andall candidateskeyphrases were not found in theGoogle n-gram corpus we approximated this fea-ture using a similar notion of contextual similarityWe simulated a latent 2-dimensional matrix (simi-lar to latent semantic analysis) by listing all candi-date words in rows and their neighboring words(nouns verbs and adjectives only) in columnsThe cosine measure is then used to compute thesimilarity among keyphrases

53 Term Cohesion

Term cohesion further refines the candidacy judg-ment by incorporating an internal analysis of thecandidatersquos constituent words Term cohesionposits that high values for internal word associa-tion measures correlates indicates that the candi-date is a keyphrase (Church and Hanks 1989)

F9 Term Cohesion (SU) Park et al (2004)used in the Dice coefficient (Dice 1945)to measure term cohesion particularly for multi-word terms In their work as NPs are longer thansimplex words they simply discounted simplexword cohesion by 10 In our work we vary themeasure of TF used in Dice coefficientsimilar to our discussion earlier

bull (F9a) term cohesion by (Park et al 2004)

bull (F9b) normalized TF by candidate types(ie simplex words vs NPs)

bull (F9c) applying different weight by candi-date types

bull (F9d) normalized TF and different weight-ing by candidate types

54 Other Features

F10 Acronym (S) Nguyen and Kan (2007) ac-counted for the importance of acronym as a fea-ture We found that this feature is heavily depen-dent on the data set Hence we used it only forNampK to attest our candidate selection method

F11 POS sequence (S) Hulth and Megyesi(2006) pointed out that POS sequences ofkeyphrases are similar It showed the distinctivedistribution of POS sequences of keyphrases anduse them as a feature Like acronym this is alsosubject to the data set

13

F12 Suffix sequence (S) Similar to acronymNguyen and Kan (2007) also used a candidatersquossuffix sequence as a feature to capture the propen-sity of English to use certain Latin derivationalmorphology for technical keyphrases This fea-ture is also a data dependent features thus used insupervised approach only

F13 Length of Keyphrases (SU) Barker andCorrnacchia (2000) showed that candidate lengthis also a useful feature in extraction as well as incandidate selection as the majority of keyphrasesare one or two terms in length

6 System and Data

To assess the performance of the proposed candi-date selection rules and features we implementeda keyphrase extraction pipe line We start withraw text of computer science articles convertedfrom PDF by pdftotext Then we parti-tioned the into section such as title and sectionsvia heuristic rules and applied sentence segmenter2 ParsCit3(Councill et al 2008) for refer-ence collection part-of-speech tagger4 and lem-matizer5(Minnen et al 2001) of the input Af-ter preprocessing we built both supervised andunsupervised classifiers using Naive Bayes fromthe WEKA machine learning toolkit(Witten andFrank 2005) Maximum Entropy6 and simpleweighting

In evaluation we collected 250 papers fromfour different categories7 of the ACM digital li-brary Each paper was 6 to 8 pages on averageIn author assigned keyphrase we found manywere missing or found as substrings To rem-edy this we collected reader assigned keyphraseby hiring senior year undergraduates in computerscience each whom annotated five of the paperswith an annotation guideline and on average tookabout 15 minutes to annotate each paper The fi-nal statistics of keyphrases is presented in Table2 where Combined represents the total number ofkeyphrases The numbers in () denotes the num-ber of keyphrases in of-PP form Found means the

2httpwwwengritsumeiacjpasaoresourcessentseg3httpwingcompnusedusgparsCit4httpsearchcpanorgdistLingua-EN-

TaggerTaggerpm5httpwwwinformaticssusxacukresearchgroupsnlpcarrollmorphhtml6httpmaxentsourceforgenetindexhtml7C24 (Distributed Systems) H33 (Information Search

and Retrieval) I211 (Distributed Artificial Intelligence-Multiagent Systems) and J4 (Social and Behavioral Sciences-Economics)

number of author assigned keyphrase and readerassigned keyphrase found in the documents

Author Reader CombinedTotal 1252 (53) 3110 (111) 3816 (146)NPs 904 2537 3027Average 385 (401) 1244 (1288) 1526 (1585)Found 769 2509 2864

Table 2 Statistics in Keyphrases

7 Evaluation

The baseline system for both the supervised andunsupervised approaches is modified NampK whichuses TFIDF distance section information andadditional section information (ie F1-4) Apartfrom baseline we also implemented basicKEA and NampK to compare Note that NampK is con-sidered a supervised approach as it utilizes fea-tures like acronym POS sequence and suffix se-quence

Table 3 and 4 shows the performance of our can-didate selection method and features with respectto supervised and unsupervised approaches usingthe current standard evaluation method (ie exactmatching scheme) over top 5th 10th 15th candi-dates

BestFeatures includes F1cTF of substring asa separate feature F2first occurrence F3sectioninformation F4dweighting key sections F5lastoccurrence F6co-occurrence of another candi-date in section F7btitle overlap F9aterm co-hesion by (Park et al 2004) F13length ofkeyphrases Best-TFIDF means using all bestfeatures but TFIDF

In Tables 3 and 4 C denotes the classifier tech-nique unsupervised (U) or supervised using Max-imum Entropy (S)8

In Table 5 the performance of each feature ismeasured using NampK system and the target fea-ture + indicates an improvement - indicates aperformance decline and indicates no effector unconfirmed due to small changes of perfor-mances Again supervised denotes MaximumEntropy training and Unsupervised is our unsu-pervised approach

8 Discussion

We compared the performances over our candi-date selection and feature engineering with sim-ple KEA NampK and our baseline system In eval-uating candidate selection we found that longer

8Due to the page limits we present the best performance

14

Method Features C Five Ten FifteenMatch Precision Recall Fscore Match Precising Recall Fscore Match Precision Recall Fscore

All KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 079 1584 519 782 139 1388 909 1099 184 1224 1203 1213

NampK S 132 2648 867 1306 204 2036 1334 1612 254 1693 1664 1678baseline U 092 1832 600 904 157 1568 1027 1241 220 1464 1439 1451

S 115 2304 755 1137 190 1896 1242 1501 244 1624 1596 1610Lengthlt=3 KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 081 1616 529 797 140 1400 917 1108 184 1224 1203 1213

NampK S 140 2792 915 1378 210 2104 1378 1665 262 1749 1719 1734baseline U 092 184 603 908 158 1576 1032 1247 220 1464 1439 1451

S 118 2368 776 1169 190 1900 1245 1504 240 1600 1572 1586Lengthlt=3 KEA U 001 024 008 012 005 052 034 041 007 048 047 047Candidates S 083 1664 545 821 142 1424 933 1127 187 1245 1224 1234

+ Alternation NampK S 153 3064 1004 1512 231 2308 1512 1827 288 1920 1887 1903baseline U 098 1968 645 972 172 1724 1129 1364 237 1579 1551 1565

S 133 2656 870 1311 209 2088 1368 1653 269 1792 1761 1776

Table 3 Performance on Proposed Candidate Selection

Features C Five Ten FifteenMatch Prec Recall Fscore Match Prec Recall Fscore Match Prec Recall Fscore

Best U 114 228 747 113 192 192 126 152 261 174 171 173S 156 312 102 154 250 250 164 198 315 210 206 208

Best U 114 228 74 113 192 192 126 152 261 174 171 173wo TFIDF S 156 311 102 154 246 246 161 194 312 208 204 206

Table 4 Performance on Feature Engineering

A Method Feature+ S F1aF2F3F4aF4dF9a

U F1aF1cF2F3F4aF4dF5F7bF9a- S F1bF1cF1dF1fF4bF4cF7aF7bF9b-dF13

U F1dF1eF1fF4bF4cF6F7aF9b-d S F1eF10F11F12

U F1b

Table 5 Performance on Each Feature

length candidates play a role to be noises so de-creased the overall performance We also con-firmed that candidate alternation offered the flexi-bility of keyphrases leading higher candidate cov-erage as well as better performance

To re-examine features we analyzed the impactof existing and new features and their variationsFirst of all unlike previous studies we found thatthe performance with and without TFIDF did notlead to a large difference which indicates the im-pact of TFIDF was minor as long as other fea-tures are incorporated Secondly counting sub-strings for TF improved performance while ap-plying term weighting for TF andor IDF did notimpact on the performance We estimated thecause that many of keyphrases are substrings ofcandidates and vice versa Thirdly section in-formation was also validated to improve perfor-mance as in Nguyen and Kan (2007) Extend-ing this logic modeling additional section infor-mation (related work) and weighting sections bothturned out to be useful features Other localityfeatures were also validated as helpful both firstoccurrence and last occurrence are helpful as itimplies the locality of the key ideas In addi-

tion keyphrase co-occurrence with selected sec-tions was proposed in our work and found empiri-cally useful Term cohesion (Park et al 2004) is auseful feature although it has a heuristic factor thatreduce the weight by 10 for simplex words Nor-mally term cohesion is subject to NPs only henceit needs to be extended to work with multi-wordNPs as well Table 5 summarizes the reflectionson each feature

As unsupervised methods have the appeal of notneeding to be trained on expensive hand-annotateddata we also compared the performance of super-vised and unsupervised methods Given the fea-tures initially introduced for supervised learningunsupervised performance is surprisingly highWhile supervised classifier produced a matchingcount of 315 the unsupervised classifier obtains acount of 261 We feel this indicates that the exist-ing features for supervised methods are also suit-able for use in unsupervised methods with slightlyreduced performance In general we observed thatthe best features in both supervised and unsuper-vised methods are the same ndash section informationand candidate length In our analysis of the im-pact of individual features we observed that mostfeatures affect performance in the same way forboth supervised and unsupervised approaches asshown in Table 5 These findings indicate that al-though these features may be been originally de-signed for use in a supervised approach they arestable and can be expected to perform similar inunsupervised approaches

15

9 Conclusion

We have identified and tackled two core issuesin automatic keyphrase extraction candidate se-lection and feature engineering In the area ofcandidate selection we observe variations and al-ternations that were previously unaccounted forOur selection rules expand the scope of possiblekeyphrase coverage while not overly expandingthe total number candidates to consider In ourre-examination of feature engineering we com-piled a comprehensive feature list from previousworks while exploring the use of substrings in de-vising new features Moreover we also attested toeach featurersquos fitness for use in unsupervised ap-proaches in order to utilize them in real-world ap-plications with minimal cost

10 AcknowledgementThis work was partially supported by a National ResearchFoundation grant Interactive Media Search (grant R 252000 325 279) while the first author was a postdoctoral fellowat the National University of Singapore

ReferencesKen Barker and Nadia Corrnacchia Using noun phrase

heads to extract document keyphrases In Proceedings ofthe 13th Biennial Conference of the Canadian Society onComputational Studies of Intelligence Advances in Arti-ficial Intelligence 2000

Regina Barzilay and Michael Elhadad Using lexical chainsfor text summarization In Proceedings of the ACLEACL1997 Workshop on Intelligent Scalable Text Summariza-tion 1997 pp 10ndash17

Kenneth Church and Patrick Hanks Word associationnorms mutual information and lexicography In Proceed-ings of ACL 1989 76ndash83

Isaac Councill and C Lee Giles and Min-Yen Kan ParsCitAn open-source CRF reference string parsing package InProceedings of LREC 2008 28ndash30

Ernesto DAvanzo and Bernado Magnini A Keyphrase-Based Approach to Summarizationthe LAKE System atDUC-2005 In Proceedings of DUC 2005

F Damerau Generating and evaluating domain-orientedmulti-word terms from texts Information Processing andManagement 1993 29 pp43ndash447

Lee Dice Measures of the amount of ecologic associationsbetween species Journal of Ecology 1945 2

Eibe Frank and Gordon Paynter and Ian Witten and CarlGutwin and Craig Nevill-manning Domain SpecificKeyphrase Extraction In Proceedings of IJCAI 1999pp668ndash673

Carl Gutwin and Gordon Paynter and Ian Witten and CraigNevill-Manning and Eibe Frank Improving browsing indigital libraries with keyphrase indexes Journal of Deci-sion Support Systems 1999 27 pp81ndash104

Khaled Hammouda and Diego Matute and Mohamed KamelCorePhrase keyphrase extraction for document cluster-ing In Proceedings of MLDM 2005

Annette Hulth and Jussi Karlgren and Anna Jonsson andHenrik Bostrm and Lars Asker Automatic Keyword Ex-traction using Domain Knowledge In Proceedings of CI-CLing 2001

Annette Hulth and Beata Megyesi A study on automaticallyextracted keywords in text categorization In Proceedingsof ACLCOLING 2006 537ndash544

Mario Jarmasz and Caroline Barriere Using semantic sim-ilarity over tera-byte corpus compute the performance ofkeyphrase extraction In Proceedings of CLINE 2004

Dawn Lawrie and W Bruce Croft and Arnold RosenbergFinding Topic Words for Hierarchical Summarization InProceedings of SIGIR 2001 pp 349ndash357

Y Matsuo and M Ishizuka Keyword Extraction from a Sin-gle Document using Word Co-occurrence Statistical Infor-mation In International Journal on Artificial IntelligenceTools 2004 13(1) pp 157ndash169

Olena Medelyan and Ian Witten Thesaurus based automatickeyphrase indexing In Proceedings of ACMIEED-CSjoint conference on Digital libraries 2006 pp296ndash297

Guido Minnen and John Carroll and Darren Pearce Appliedmorphological processing of English NLE 2001 7(3)pp207ndash223

Thuy Dung Nguyen and Min-Yen Kan Key phrase Extrac-tion in Scientific Publications In Proceeding of ICADL2007 pp317-326

Youngja Park and Roy Byrd and Branimir Boguraev Auto-matic Glossary Extraction Beyond Terminology Identifi-cation In Proceedings of COLING 2004 pp48ndash55

Mari-Sanna Paukkeri and Ilari Nieminen and Matti Pollaand Timo Honkela A Language-Independent Approachto Keyphrase Extraction and Evaluation In Proceedingsof COLING 2008

Peter Turney Learning to Extract Keyphrases from TextIn National Research Council Institute for InformationTechnology Technical Report ERB-1057 1999

Peter Turney Coherent keyphrase extraction via Web min-ing In Proceedings of IJCAI 2003 pp 434ndash439

Xiaojun Wan and Jianguo Xiao CollabRank towards a col-laborative approach to single-document keyphrase extrac-tion In Proceedings of COLING 2008

Ian Witten and Gordon Paynter and Eibe Frank and CarGutwin and Graig Nevill-Manning KEAPractical Au-tomatic Key phrase Extraction In Proceedings of ACMDL 1999 pp254ndash256

Ian Witten and Eibe Frank Data Mining Practical Ma-chine Learning Tools and Techniques Morgan Kauf-mann 2005

Yongzheng Zhang and Nur Zinchir-Heywood and Evange-los Milios Term based Clustering and Summarization ofWeb Page Collections In Proceedings of Conference ofthe Canadian Society for Computational Studies of Intel-ligence 2004

16

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 17ndash22Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Verb Noun Construction MWE Token Supervised Classification

Mona T DiabCenter for Computational Learning Systems

Columbia Universitymdiabcclscolumbiaedu

Pravin BhutadaComputer Science Department

Columbia Universitypb2351columbiaedu

Abstract

We address the problem of classifying multi-word expression tokens in running text Wefocus our study on Verb-Noun Constructions(VNC) that vary in their idiomaticity depend-ing on context VNC tokens are classified aseither idiomatic or literal We present a super-vised learning approach to the problem We ex-periment with different features Our approachyields the best results to date on MWE clas-sification combining different linguistically mo-tivated features the overall performance yieldsan F-measure of 8458 corresponding to an F-measure of 8996 for idiomaticity identificationand classification and 6203 for literal identifi-cation and classification

1 Introduction

In the literature in general a multiword expression(MWE) refers to a multiword unit or a colloca-tion of words that co-occur together statisticallymore than chance A MWE is a cover term fordifferent types of collocations which vary in theirtransparency and fixedness MWEs are pervasivein natural language especially in web based textsand speech genres Identifying MWEs and under-standing their meaning is essential to language un-derstanding hence they are of crucial importancefor any Natural Language Processing (NLP) appli-cations that aim at handling robust language mean-ing and use In fact the seminal paper (Sag et al2002) refers to this problem as a key issue for thedevelopment of high-quality NLP applications

For our purposes a MWE is defined as a collo-cation of words that refers to a single concept forexample - kick the bucket spill the beans make adecision etc An MWE typically has an idiosyn-cratic meaning that is more or different from themeaning of its component words An MWE mean-ing is transparent ie predictable in as muchas the component words in the expression relaythe meaning portended by the speaker composi-tionally Accordingly MWEs vary in their de-gree of meaning compositionality composition-ality is correlated with the level of idiomaticityAn MWE is compositional if the meaning of an

MWE as a unit can be predicted from the mean-ing of its component words such as in make adecision meaning to decide If we conceive ofidiomaticity as being a continuum the more id-iomatic an expression the less transparent and themore non-compositional it is Some MWEs aremore predictable than others for instance kick thebucket when used idiomatically to mean to diehas nothing in common with the literal meaningof either kick or bucket however make a decisionis very clearly related to to decide Both of theseexpressions are considered MWEs but have vary-ing degrees of compositionality and predictabilityBoth of these expressions belong to a class of id-iomatic MWEs known as verb noun constructions(VNC) The first VNC kick the bucket is a non-decomposable VNC MWE the latter make a deci-sion is a decomposable VNC MWE These typesof constructions are the object of our study

To date most research has addressed the prob-lem of MWE type classification for VNC expres-sions in English (Melamed 1997 Lin 1999Baldwin et al 2003 na Villada Moiron andTiedemann 2006 Fazly and Stevenson 2007Van de Cruys and Villada Moiron 2007 Mc-Carthy et al 2007) not token classification Forexample he spilt the beans on the kitchen counteris most likely a literal usage This is given away bythe use of the prepositional phrase on the kitchencounter as it is plausable that beans could haveliterally been spilt on a location such as a kitchencounter Most previous research would classifyspilt the beans as idiomatic irrespective of con-textual usage In a recent study by (Cook et al2008) of 53 idiom MWE types used in differentcontexts the authors concluded that almost half ofthem had clear literal meaning and over 40 oftheir usages in text were actually literal Thus itwould be important for an NLP application suchas machine translation for example when givena new VNC MWE token to be able to determinewhether it is used idiomatically or not as it couldpotentially have detrimental effects on the qualityof the translation

17

In this paper we address the problem of MWEclassification for verb-noun (VNC) token con-structions in running text We investigate the bi-nary classification of an unseen VNC token ex-pression as being either Idiomatic (IDM) or Lit-eral (LIT) An IDM expression is certainly anMWE however the converse is not necessarilytrue To date most approaches to the problem ofidiomaticity classification on the token level havebeen unsupervised (Birke and Sarkar 2006 Diaband Krishna 2009b Diab and Krishna 2009aSporleder and Li 2009) In this study we carryout a supervised learning investigation using sup-port vector machines that uses some of the featureswhich have been shown to help in unsupervisedapproaches to the problem

This paper is organized as follows In Section2 we describe our understanding of the variousclasses of MWEs in general Section 3 is a sum-mary of previous related research Section 4 de-scribes our approach In Section 5 we present thedetails of our experiments We discuss the resultsin Section 6 Finally we conclude in Section 7

2 Multi-word Expressions

MWEs are typically not productive though theyallow for inflectional variation (Sag et al 2002)They have been conventionalized due to persis-tent use MWEs can be classified based on theirsemantic types as follows Idiomatic This cat-egory includes expressions that are semanticallynon-compositional fixed expressions such as king-dom come ad hoc non-fixed expressions suchas break new ground speak of the devil TheVNCs which we are focusing on in this paper fallinto this category Semi-idiomatic This classincludes expressions that seem semantically non-compositional yet their semantics are more or lesstransparent This category consists of Light VerbConstructions (LVC) such as make a living andVerb Particle Constructions (VPC) such as write-up call-up Non-Idiomatic This category in-cludes expressions that are semantically compo-sitional such as prime minister proper nouns suchas New York Yankees and collocations such as ma-chine translation These expressions are statisti-cally idiosyncratic For instance traffic light isthe most likely lexicalization of the concept andwould occur more often in text than say trafficregulator or vehicle light

3 Related Work

Several researchers have addressed the problem ofMWE classification (Baldwin et al 2003 Katzand Giesbrecht 2006 Schone and Juraksfy 2001

Hashimoto et al 2006 Hashimoto and Kawa-hara 2008) The majority of the proposed researchhas been using unsupervised approaches and haveaddressed the problem of MWE type classifica-tion irrespective of usage in context (Fazly andStevenson 2007 Cook et al 2007) We areaware of two supervised approaches to the prob-lem work by (Katz and Giesbrecht 2006) andwork by (Hashimoto and Kawahara 2008)

In Katz and Giesbrecht (2006) (KG06) the au-thors carried out a vector similarity comparisonbetween the context of an MWE and that of theconstituent words using LSA to determine if theexpression is idiomatic or not The KG06 is sim-ilar in intuition to work proposed by (Fazly andStevenson 2007) however the latter work was un-supervised KG06 experimented with a tiny dataset of only 108 sentences corresponding to oneMWE idiomatic expression

Hashimoto and Kawahara (2008) (HK08) is thefirst large scale study to our knowledge that ad-dressed token classification into idiomatic versusliteral for Japanese MWEs of all types They ap-ply a supervised learning framework using sup-port vector machines based on TinySVM with aquadratic kernel They annotate a web based cor-pus for training data They identify 101 idiomtypes each with a corresponding 1000 exampleshence they had a corpus of 102K sentences of an-notated data for their experiments They exper-iment with 90 idiom types only for which theyhad more than 50 examples They use two typesof features word sense disambiguation (WSD)features and idiom features The WSD featurescomprised some basic syntactic features such asPOS lemma information token n-gram featuresin addition to hypernymy information on words aswell as domain information For the idiom fea-tures they were mostly inflectional features suchas voice negativity modality in addition to adja-cency and adnominal features They report resultsin terms of accuracy and rate of error reductionTheir overall accuracy is of 8925 using all thefeatures

4 Our ApproachWe apply a supervised learning framework tothe problem of both identifying and classifying aMWE expression token in context We specificallyfocus on VNC MWE expressions We use the an-notated data by (Cook et al 2008) We adopt achunking approach to the problem using an InsideOutside Beginning (IOB) tagging framework forperforming the identification of MWE VNC to-kens and classifying them as idiomatic or literalin context For chunk tagging we use the Yam-

18

Cha sequence labeling system1 YamCha is basedon Support Vector Machines technology using de-gree 2 polynomial kernels

We label each sentence with standard IOB tagsSince this is a binary classification task we have 5different tags B-L (Beginning of a literal chunk)I-L (Inside of a literal chunk) B-I (Beginning anIdiomatic chunk) I-I (Inside an Idiomatic chunk)O (Outside a chunk) As an example a sentencesuch as John kicked the bucket last Friday will beannotated as follows John O kicked B-I the I-Ibucket I-I last O Friday O We experiment withsome basic features and some more linguisticallymotivated ones

We experiment with different window sizes forcontext ranging from minus+1 to minus+5 tokens be-fore and after the token of interest We also em-ploy linguistic features such as character n-gramfeatures namely last 3 characters of a token asa means of indirectly capturing the word inflec-tional and derivational morphology (NGRAM)Other features include Part-of-Speech (POS)tags lemma form (LEMMA) or the citation formof the word and named entity (NE) informationThe latter feature is shown to help in the unsuper-vised setting in recent work (Diab and Krishna2009b Diab and Krishna 2009a) In general allthe linguistic features are represented as separatefeature sets explicitly modeled in the input dataHence if we are modeling the POS tag feature forour running example the training data would beannotated as follows John NN O kicked VBDB-I the Det I-I bucket NN I-I last ADV O FridayNN O Likewise adding the NGRAM featurewould be represented as follows John NN ohnO kicked VBD ked B-I the Det the I-I bucket NNket I-I last ADV ast O Friday NN day O and soon

With the NE feature we followed the same rep-resentation as the other features as a separate col-umn as expressed above referred to as NamedEntity Separate (NES) For named entity recogni-tion (NER) we use the BBN Identifinder softwarewhich identifies 19 NE tags2 We have two set-tings for NES one with the full 19 tags explic-itly identified (NES-Full) and the other where wehave a binary feature indicating whether a wordis a NE or not (NES-Bin) Moreover we addedanother experimental condition where we changedthe wordsrsquo representation in the input to their NEclass Named Entity InText (NEI) For example forthe NEI condition our running example is repre-sented as follows PER NN ohn O kicked VBDked B-I the Det the I-I bucket NN ket I-I last ADV

1httpwwwtado-chasencomyamcha2httpwwwbbncomidentifinder

ast O DAY NN day O where John is replaced bythe NE ldquoPERrdquo

5 Experiments and Results

51 Data

We use the manually annotated standard dataset identified in (Cook et al 2008) This datacomprises 2920 unique VNC-Token expressionsdrawn from the entire British National Corpus(BNC)3 The BNC contains 100M words of multi-ple genres including written text and transcribedspeech In this set VNC token expressions aremanually annotated as idiomatic literal or un-known We exclude those annotated as unknownand those pertaining to the Speech part of thedata leaving us with a total of 2432 sentences cor-responding to 53 VNC MWE types This datahas 2571 annotations4 corresponding to 2020 Id-iomatic tokens and 551 literal ones Since the dataset is relatively small we carry out 5-fold cross val-idation experiments The results we report are av-eraged over the 5 folds per condition We splitthe data into 80 for training 10 for testing and10 for development The data used is the tok-enized version of the BNC

52 Evaluation Metrics

We use Fβ=1 (F-measure) as the harmonic meanbetween (P)recision and (R)ecall as well as accu-racy to report the results5 We report the resultsseparately for the two classes IDM and LIT aver-aged over the 5 folds of the TEST data set

53 Results

We present the results for the different featuressets and their combination We also present resultson a simple most frequent tag baseline (FREQ) aswell as a baseline of using no features just thetokenized words (TOK) The baseline is basicallytagging all identified VNC tokens in the data set asidiomatic It is worth noting that the baseline hasthe advantage of gold identification of MWE VNCtoken expressions In our experimental conditionsidentification of a potential VNC MWE is part ofwhat is discovered automatically hence our sys-tem is penalized for identifying other VNC MWE

3httpwwwnatcorpoxacuk4A sentence can have more than one MWE expression

hence the number of annotations exceeds the number of sen-tences

5We do not think that accuracy should be reported in gen-eral since it is an inflated result as it is not a measure of errorAll words identified as O factor into the accuracy which re-sults in exaggerated values for accuracy We report it onlysince it the metric used by previous work

19

tokens that are not in the original data set6

In Table 2 we present the results yielded per fea-ture and per condition We experimented with dif-ferent context sizes initially to decide on the opti-mal window size for our learning framework re-sults are presented in Table 1 Then once that isdetermined we proceed to add features

Noting that a window size of minus+3 yields thebest results we proceed to use that as our contextsize for the following experimental conditions Wewill not include accuracy since it above 96 for allour experimental conditions

All the results yielded by our experiments out-perform the baseline FREQ The simple tokenizedwords baseline (TOK) with no added features witha context size of minus+3 shows a significant im-provement over the very basic baseline FREQ withan overall F measure of 7704

Adding lemma information or POS or NGRAMfeatures all independently contribute to a bettersolution however combining the three featuresyields a significant boost in performance over theTOK baseline of 267 absolute F points in over-all performance

Confirming previous observations in the liter-ature the overall best results are obtained byusing NE features The NEI condition yieldsslightly better results than the NES conditionsin the case when no other features are beingused NES-Full significantly outperforms NES-Bin when used alone especially on literal classi-fication yielding the highest results on this classof phenomena across the board However whencombined with other features NES-Bin fares bet-ter than NES-Full as we observe slightly less per-formance when comparing NES-Full+L+N+P andNES-Bin+L+N+P

Combining NEI+L+N+P yields the highest re-sults with an overall F measure of 8458 a sig-nificant improvement over both baselines and overthe condition that does not exploit NE featuresL+N+P Using NEI may be considered a formof dimensionality reduction hence the significantcontribution to performance

6 Discussion

The overall results strongly suggest that using lin-guistically interesting features explicitly has a pos-itive impact on performance NE features helpthe most and combining them with other features

6We could have easily identified all VNC syntactic con-figurations corresponding to verb object as a potential MWEVNC assuming that they are literal by default This wouldhave boosted our literal score baseline however for this in-vestigation we decided to strictly work with the gold stan-dard data set exclusively

yields the best results In general performanceon the classification and identification of idiomaticexpressions yielded much better results This maybe due to the fact that the data has a lot more id-iomatic token examples for training Also we notethat precision scores are significantly higher thanrecall scores especially with performance on lit-eral token instance classification This might be anindication that identifying when an MWE is usedliterally is a difficult task

We analyzed some of the errors yielded in ourbest condition NEI+L+N+P The biggest errors area result of identifying other VNC constructionsnot annotated in the training and test data as VNCMWEs However we also see errors of confusingidiomatic cases with literal ones 23 times and theopposite 4 times

Some of the errors where the VNC should havebeen classified as literal however the system clas-sified them as idiomatic are kick heel find feetmake top Cases of idiomatic expressions erro-neously classified as literal are for MWE types hitthe road blow trumpet blow whistle bit a wall

The system is able to identify new VNC MWEconstructions For instance in the sentence On theother hand Pinkie seemed to have lost his head toa certain extent perhaps some prospects of mak-ing his mark by bringing in something novel inthe way of business the first MWE lost his headis annotated in the training data however makinghis mark is newly identified as idiomatic in thiscontext

Also the system identified hit the post as aliteral MWE VNC token in As the ball hit thepost the referee blew the whistle where blew thewhistle is a literal VNC in this context and it iden-tified hit the post as another literal VNC

7 Conclusion

In this study we explore a set of features that con-tribute to VNC token expression binary supervisedclassification The use of NER significantly im-proves the performance of the system Using NERas a means of dimensionality reduction yields thebest results We achieve a state of the art perfor-mance of an overall F measure of 8458 In thefuture we are looking at ways of adding more so-phisticated syntactic and semantic features fromWSD Given the fact that we were able to get moreinteresting VNC data automatically we are cur-rently looking into adding the new data to the an-notated pool after manual checking

20

IDM-F LIT-F Overall F Overall Accminus+1 7793 4857 7178 9622minus+2 8538 5561 7971 9706minus+3 8699 5568 8125 9693minus+4 8622 5581 8075 9706minus+5 8338 50 7763 9661

Table 1 Results in s of varying context window size

IDM-P IDM-R IDM-F LIT-P LIT-R LIT-F Overall FFREQ 7002 8916 7844 0 0 0 6968TOK 8178 8333 8255 7179 4375 5437 7704

(L)EMMA 831 8429 8369 6977 4688 5607 7811(N)GRAM 8317 8238 8278 70 4375 5385 7701

(P)OS 8333 8333 8333 7778 4375 5600 7808L+N+P 8695 8333 8538 7222 4561 5591 7971

NES-Full 852 8793 8655 7907 5862 6733 8277NES-Bin 8497 8241 8367 7349 5259 6131 7915

NEI 8992 8518 8748 8133 5259 6387 8282NES-Full+L+N+P 8989 8492 8734 7632 50 6042 8199NES-Bin+L+N+P 9086 8492 8779 7632 50 6042 8233

NEI+L+N+P 9135 8842 8986 8169 50 6203 8458

Table 2 Final results in s averaged over 5 folds of test data using different features and their combina-tions

8 Acknowledgement

The first author was partially funded by DARPAGALE and MADCAT projects The authors wouldlike to acknowledge the useful comments by twoanonymous reviewers who helped in making thispublication more concise and better presented

ReferencesTimothy Baldwin Collin Bannard Takakki Tanaka

and Dominic Widdows 2003 An empirical modelof multiword expression decomposability In Pro-ceedings of the ACL 2003 workshop on Multiwordexpressions pages 89ndash96 Morristown NJ USA

J Birke and A Sarkar 2006 A clustering approach fornearly unsupervised recognition of nonliteral lan-guage In Proceedings of EACL volume 6 pages329ndash336

Paul Cook Afsaneh Fazly and Suzanne Stevenson2007 Pulling their weight Exploiting syntacticforms for the automatic identification of idiomaticexpressions in context In Proceedings of the Work-shop on A Broader Perspective on Multiword Ex-pressions pages 41ndash48 Prague Czech RepublicJune Association for Computational Linguistics

Paul Cook Afsaneh Fazly and Suzanne Stevenson2008 The VNC-Tokens Dataset In Proceedings ofthe LREC Workshop on Towards a Shared Task for

Multiword Expressions (MWE 2008) MarrakechMorocco June

Mona Diab and Madhav Krishna 2009a Handlingsparsity for verb noun MWE token classification InProceedings of the Workshop on Geometrical Mod-els of Natural Language Semantics pages 96ndash103Athens Greece March Association for Computa-tional Linguistics

Mona Diab and Madhav Krishna 2009b Unsuper-vised classification for vnc multiword expressionstokens In CICLING

Afsaneh Fazly and Suzanne Stevenson 2007 Dis-tinguishing subtypes of multiword expressions us-ing linguistically-motivated statistical measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueCzech Republic June Association for Computa-tional Linguistics

Chikara Hashimoto and Daisuke Kawahara 2008Construction of an idiom corpus and its applica-tion to idiom identification based on WSD incor-porating idiom-specific features In Proceedings ofthe 2008 Conference on Empirical Methods in Nat-ural Language Processing pages 992ndash1001 Hon-olulu Hawaii October Association for Computa-tional Linguistics

Chikara Hashimoto Satoshi Sato and Takehito Utsuro2006 Japanese idiom recognition Drawing a linebetween literal and idiomatic meanings In Proceed-ings of the COLINGACL 2006 Main Conference

21

Poster Sessions pages 353ndash360 Sydney AustraliaJuly Association for Computational Linguistics

Graham Katz and Eugenie Giesbrecht 2006 Au-tomatic identification of non-compositional multi-word expressions using latent semantic analysis InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 12ndash19 Sydney Australia July Asso-ciation for Computational Linguistics

Dekang Lin 1999 Automatic identification of non-compositional phrases In Proceedings of ACL-99pages 317ndash324 Univeristy of Maryland CollegePark Maryland USA

Diana McCarthy Sriram Venkatapathy and AravindJoshi 2007 Detecting compositionality of verb-object combinations using selectional preferencesIn Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL) pages 369ndash379 Prague CzechRepublic June Association for Computational Lin-guistics

Dan I Melamed 1997 Automatic discovery of non-compositional compounds in parallel data In Pro-ceedings of the 2nd Conference on Empirical Meth-ods in Natural Language Processing (EMNLPrsquo97)pages 97ndash108 Providence RI USA August

Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the EACL-06Workshop on Multiword Expressions in a Multilin-gual Context pages 33ndash40 Morristown NJ USA

Ivan A Sag Timothy Baldwin Francis Bond Ann ACopestake and Dan Flickinger 2002 Multiwordexpressions A pain in the neck for nlp In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing pages 1ndash15 London UK Springer-Verlag

Patrick Schone and Daniel Juraksfy 2001 Isknowledge-free induction of multiword unit dictio-nary headwords a solved problem In Proceedingsof Empirical Methods in Natural Language Process-ing pages 100ndash108 Pittsburg PA USA

C Sporleder and L Li 2009 Unsupervised Recog-nition of Literal and Non-Literal Use of IdiomaticExpressions In Proceedings of the 12th Conferenceof the European Chapter of the ACL (EACL 2009)pages 754ndash762 Association for Computational Lin-guistics

Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32Prague Czech Republic June Association for Com-putational Linguistics

22

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 23ndash30Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Exploiting Translational Correspondences for Pattern-Independent MWEIdentification

Sina ZarrieszligDepartment of Linguistics

University of Potsdam Germanysinalinguni-potsdamde

Jonas KuhnDepartment of Linguistics

University of Potsdam Germanykuhnlinguni-potsdamde

Abstract

Based on a study of verb translations inthe Europarl corpus we argue that a widerange of MWE patterns can be identified intranslations that exhibit a correspondencebetween a single lexical item in the sourcelanguage and a group of lexical items inthe target language We show that thesecorrespondences can be reliably detectedon dependency-parsed word-aligned sen-tences We propose an extraction methodthat combines word alignment with syn-tactic filters and is independent of thestructural pattern of the translation

1 Introduction

Parallel corpora have proved to be a valuable re-source not only for statistical machine translationbut also for crosslingual induction of morphologi-cal syntactic and semantic analyses (Yarowsky etal 2001 Dyvik 2004) In this paper we proposean approach to the identification of multiword ex-pressions (MWEs) that exploits translational cor-respondences in a parallel corpus We will con-sider in translations of the following type

(1) DerThe

RatCouncil

sollteshould

unsereour

Positionposition

berucksichtigenconsider

(2) The Council shouldtake account ofour position

This sentence pair has been taken from theGerman - English section of the Europarl corpus(Koehn 2005) It exemplifies a translational cor-respondence between an English MWEtake ac-count ofand a German simplex verbberucksichti-gen In the following we refer to such correspon-dences asone-to-many translations Based on astudy of verb translations in Europarl we will ex-plore to what extent one-to-many translations pro-vide evidence for MWE realization in the targetlanguage It will turn out that crosslingual corre-

spondences realize a wide range of different lin-guistic patterns that are relevant for MWE iden-tification but that they pose problems to auto-matic word alignment We propose an extractionmethod that combines distributional word align-ment with syntactic filters We will show thatthese correspondences can be reliably detectedon dependency-parsed wordaligned sentences andare able to identify various MWE patterns

In a monolingual setting the task of MWE ex-traction is usually conceived of as a lexical as-sociation problem where distributional measuresmodel the syntactic and semantic idiosyncracy ex-hibited by MWEs eg (Pecina 2008) This ap-proach generally involves two main steps 1) theextraction of a candidate list of potential MWEsoften constrained by a particular target patternof the detection method like verb particle con-structions (Baldwin and Villavicencio 2002) orverb PP combinations (Villada Moiron and Tiede-mann 2006) 2) the ranking of this candidate listby an appropriate assocation measure

The crosslingual MWE identification wepresent in this paper is a priori independentof any specific association measure or syntacticpattern The translation scenario allows us toadopt a completely data-driven definition of whatconstitutes an MWE Given a parallel corpuswe propose to consider those tokens in a targetlanguage as MWEs which correspond to a singlelexical item in the source language The intuitionis that if a group of lexical items in one lan-guage can be realized as a single item in anotherlanguage it can be considered as some kind oflexically fixed entity By this means we willnot approach the MWE identification problemby asking for a given list of candidates whetherthese are MWEs or not Instead we will ask fora given list of lexical items in a source languagewhether there exists a one-to-many translation forthis item in a target language (and whether these

23

one-to-many translations correspond to MWEs)This strategy offers a straightforward solution tothe interpretation problem As the translation canbe related to the meaning of the source item andto its other translations in the target language theinterpretation is independent of the expressionrsquostransparency This solution has its limitationscompared to other approaches that need to auto-matically establish the degree of compositionalityof a given MWE candidate However for manyNLP applications coarse-grained knowledgeabout the semantic relation between a widerange of MWEs and their corresponding atomicrealization is already very useful

In this work we therefore focus on a generalmethod of MWE identification that captures thevarious patterns of translational correspondencesthat can be found in parallel corpora Our exper-iments described in section 3 show that one-to-many translations should be extracted from syn-tactic configurations rather than from unstructuredsets of aligned words This syntax-driven methodis less dependent on frequency distributions in agiven corpus but is based on the intuition thatmonolingual idiosyncracies like MWE realizationof an entity are not likely to be mirrored in anotherlanguage (see section 4 for discussion)

Our goal in this paper is twofold First we wantto investigate to what extent one-to-many transla-tional correspondences can serve as an empiricalbasis for MWE identification To this end Sec-tion 2 presents a corpus-based study of the rela-tion between one-to-many translations and MWEsthat we carried out on a translation gold standardSecond we investigate methods for the automaticdetection of complex lexical correspondences fora given parallel corpus Therefore Section 3 eval-uates automatic word alignments against our goldstandard and gives a method for high-precisionone-to-many translation detection that relies onsyntactic filters in addition to word-alignments

2 Multiword Translations as MWEs

The idea to exploit one-to-many translations forthe identification of MWE candidates has not re-ceived much attention in the literature Thus it isnot a priori clear what can be expected from trans-lational correspondences with respect to MWEidentification To corroborate the intuitions intro-duced in the last section we carried out a corpus-based study that aims to discover linguistic pat-

Verb 1-1 1-n n-1 n-n No

anheben (v1) 535 212 92 16 325

bezwecken (v2) 167 513 06 313 150

riskieren (v3) 467 357 05 17 182

verschlimmern (v4) 302 215 286 445 275

Table 1 Proportions of types of translational cor-respondences (token-level) in our gold standard

terns exhibited by one-to-many translationsWe constructed a gold standard coveringall En-

glish translations of four German verb lemmas ex-tracted from the Europarl Corpus These verbssubcategorize for a nominative subject and an ac-cusative object and are in the middle frequencylayer (around 200 occurrences) We extracted allsentences in Europarl with occurences of theselemmas and their automatic word alignments pro-duced by GIZA++ (Och and Ney 2003) Thesealignments were manually corrected on the basisof the crosslingual word alignment guidelines de-velopped by (Graca et al 2008)

For each of the German source lemmas ourgold standard records four translation categoriesone-to-one one-to-many many-to-one many-to-many translations Table 1 shows the distributionof these categories for each verb Strikingly thefour verbs show very different proportions con-cerning the types of their translational correspon-dences Thus while the German verbanheben(en increase) seems to have a frequent parallelrealization the verbsbezwecken(en intend to)or verschlimmern(en aggravate) tend to be real-ized by more complex phrasal translations In anycase the percentage of one-to-many translations isrelatively high which corroborates our hypothesisthat parallel corpora constitute a very interestingresource for data-driven MWE discovery

A closer look at the one-to-many translations re-veals that these cover a wide spectrum of MWEphenomena traditionally considered in the liter-ature as well as constructions that one wouldusually not regard as an MWE Below we willshortly illustrate the different classes of one-to-many translations we found in our gold standard

Morphological variations This type of one-to-many translations is mainly due to non-parallel re-alization of tense Itrsquos rather irrelevant from anMWE perspective but easy to discover and filterautomatically

24

(3) SieThey

verschlimmernaggravate

diethe

Ubelmisfortunes

(4) Their actionis aggravatingthe misfortunes

Verb particle combinations A typical MWEpattern treated for instance in (Baldwin andVillavicencio 2002) It further divides into trans-parent and non-transparent combinations the lat-ter is illustrated below

(5) DerThe

Ausschusscommitte

bezwecktintends

denthe

Institutioneninstitutions

eina

politischespolitical

Instrumentinstrument

anat

diethe

Handhand

zuto

gebengive

(6) The committeeset out to equip the institutions with apolitical instrument

Verb preposition combinations While thisclass isnrsquot discussed very often in the MWE lit-erature it can nevertheless be considered as an id-iosyncratic combination of lexical items Sag et al(2002) propose an analysis within an MWE frame-work

(7) SieThey

werdenwill

denthe

Treibhauseffektgreen house effect

verschlimmernaggravate

(8) They will add to the green house effect

Light verb constructions (LVCs) This is themost frequent pattern in our gold standard It ac-tually subsumes various subpatterns depending onwhether the light verbs complement is realized as anoun adjective or PP Generally LVCs are syntac-tically and semantically more flexible than otherMWE types such that our gold standard containsvariants of LVCs with similar potentially mod-ified adjectives or nouns as in the example be-low However it can be considered an idiosyn-cratic combination since the LVCs exhibit specificlexical restrictions (Sag et al 2002)

(9) IchIch

werdewill

diethe

Sachething

nuronly

nochjust

verschlimmernaggravate

(10) I am justmaking thingsmore difficult

Idioms This MWE type is probably the mostdiscussed in the literature due to its semantic andsyntactic idiosyncracy Itrsquos not very frequent inour gold standard which may be mainly due to itslimited size and the source items we chose

(11) SieThey

bezweckenintend

diethe

Umgestaltungconversion

ininto

einea

zivilecivil

Nationnation

(12) Theyhave in mind the conversion into a civil nation

v1 v2 v3 v4

Ntype 22 (26) 41 (47) 26 (35) 17 (24)

V Part 227 49 00 00

V Prep 364 415 39 59

LVC 182 293 885 882

Idiom 00 24 00 00

Para 364 243 115 235

Table 2 Proportions of MWE types per lemma

Paraphrases From an MWE perspective para-phrases are the most problematic and challengingtype of translational correspondence in our goldstandard While the MWE literature typically dis-cusses the distinction between collocations andMWEs the boarderline between paraphrases andMWEs is not really clear On the hand para-phrases as we classified them here are transparentcombinations of lexical items like in the exam-ple belowensure that something increases How-ever semantically these transparent combinationscan also be rendered by an atomic expressionin-crease A further problem raised by paraphrases isthat they often involve translational shifts (Cyrus2006) These shifts are hard to identify automat-ically and present a general challenge for seman-tic processing of parallel corpora An example isgiven below

(13) WirWe

brauchenneed

besserebetter

Zusammenarbeitcooperation

umto

diethe

RuckzahlungenrepaymentsOBJ

anzuhebenincrease

(14) We need greater cooperation in this respect toensurethat repaymentsincrease

Table 2 displays the proportions of the MWEcategories for the number of types of one-to-manycorrespondences in our gold standard We filteredthe types due to morphological variations only (theoverall number of types is indicated in brackets)Note that some types in our gold standard fall intoseveral categories eg they combine a verb prepo-sition with a verb particle construction For allof the verbs the number of types belonging tocore MWE categories largely outweighs the pro-portion of paraphrases As we already observedin our analysis of general translation categorieshere again the different verb lemmas show strik-ing differences with respect to their realization inEnglish translations For instanceanheben(enincrease) or bezwecken(en intend) are frequently

25

translated with verb particle or preposition combi-nations while the other verbs are much more of-ten translated by means of LVCs Also the morespecific LVC patterns differ largely among theverbs Whileverschlimmern(en aggravate) hasmany different adjectival LVC correspondencesthe translations ofriskieren(en risk) are predomi-nantly nominal LVCs The fact that we found veryfew idioms in our gold standard may be simplyrelated to our arbitrary choice of German sourceverbs that do not have an English idiom realiza-tion (see our experiment on a random set of verbsin Section 33)

In general one-to-many translational corre-spondences seem to provide a very fruitful groundfor the large-scale study of MWE phenomenaHowever their reliable detection in parallel cor-pora is far from trivial as we will show in thenext section Therefore we will not further in-vestigate the classification of MWE patterns inthe rest of the paper but concentrate on the high-precision detection of one-to-many translationsSuch a pattern-independent identification methodis crucial for the further data-driven study of one-to-many translations in parallel corpora

3 Multiword Translation Detection

This section is devoted to the problem of high-precision detection of one-to-many translationsSection 31 describes an evaluation of automaticword alignments against our gold standard Insection 32 we describe a method that extractsloosely aligned syntactic configurations whichyields much more promising results

31 One-to-many Alignments

To illustrate the problem of purely distributionalone-to-many alignment table 3 presents an eval-uation of the automatic one-to-many word align-ments produced by GIZA++ that uses the stan-dard heuristics for bidirectional word alignmentfrom phrase-based MT (Och and Ney 2003) Weevaluate the rate of translational correspondenceson the type-level that the system discovers againstthe one-to-many translations in our gold standardBy type we mean the set of lemmatized Englishtokens that makes up the translation of the Ger-man source lemma Generally automatic wordalignment yields a very high FPR if no frequencythreshold is used Increasing the threshold mayhelp in some cases however the frequency of the

verb n gt 0 n gt 1 n gt 3FPR FNR FPR FNR FPR FNR

v1 097 093 10 10 10 10

v2 093 09 05 096 00 098

v3 088 083 08 097 067 097

v4 098 092 08 092 034 092

Table 3 False positive rate and False negative rateof GIZA++ one-to-many alignments

translation types is so low that already at a thresh-old of 3 almost all types get filtered This does notmean that the automatic word alignment does notdiscover any correct correspondences at all but itmeans that the detection of the exact set of tokensthat correspond to the source token is rare

This low precision of one-to-many alignmentsisnrsquot very surprising Many types of MWEs con-sist of items that contribute most of the lexical se-mantic content while the other items belong to theclass of semantically almost ldquoemptyrdquo items (egparticles light verbs) These semantically ldquolightrdquoitems have a distribution that doesnrsquot necessarilycorrelate with the source item For instance inthe following sentence pair taken from EuroparlGIZA++ was not able to capture the correspon-dence between the German main verbbehindern(en impede) and the LVCconstitute an obstacleto but only finds an alignment link between theverb and the nounobstacle

(15) DieThe

Korruptioncorruption

behindertimpedes

diethe

Entwicklungdevelopment

(16) Corruptionconstitutes an obstacle todevelopment

Another limitation of the word-alignment mod-els is that are independent of whether the sen-tences are largely parallel or rather free transla-tions However parallel corpora like Europarl areknow to contain a very large number of free trans-lations In these cases direct lexical correspon-dences are much more unlikely to be found

32 Aligning Syntactic Configurations

High-precision extraction of one-to-many trans-lation detection thus involves two major prob-lems 1) How to identify sentences or configura-tions where reliable lexical correspondences canbe found 2) How to align target items that have alow occurrence correlation

We argue that both of these problems can beadressed by taking syntactic information into ac-

26

count As an example consider the pair of paral-lel configurations in Figure 1 for the sentence pairgiven in (15) and (16) Although there is no strictone-to-one alignment for the German verb the ba-sic predicate-argument structure is parallel Theverbs arguments directly correspond to each otherand are all dominated by a verbal root node

Based on these intuitions we propose agenerate-and-filter strategy for our one-to-manytranslation detection which extracts partial largelyparallel dependency configurations By admittingtarget dependency paths to be aligned to sourcesingle dependency relations we admit configura-tions where the source item is translated by morethan one word For instance given the configura-tion in Figure 1 we allow the German verb to bealigned to the path connectingconstituteand theargumentY2

Our one-to-many translation detection consistsof the following steps a) candidate generationof aligned syntactic configurations b) filtering theconfigurations c) alignment post-editing ie as-sembling the target tokens corresponding to thesource item The following paragraphs will brieflycaracterize these steps

behindert

X 1 Y 1

Y 2

an to

X 2 obstacle

create

Figure 1 Example of a typical syntactic MWEconfiguration

Data We word-aligned the German and Englishportion of the Europarl corpus by means of theGIZA++ tool Both portions where assigned flatsyntactic dependency analyses by means of theMaltParser (Nivre et al 2006) such that we ob-tain a parallel resource of word-aligned depen-dency parses Each sentence in our resource canbe represented by the triple(DG DE AGE) DG

is the set of dependency triples(s1 rel s2) such

thats2 is a dependent ofs1 of typerel ands1 s2

are words of the source languageDE is the setof dependency triples of the target sentenceAGE

corresponds to the set of pairs(s1 t1) such thats1 t1 are aligned

Candidate Generation This step generates alist of source configurations by searching for oc-curences of the source lexical verb where it islinked to some syntactic dependents (eg its argu-ments) An example input would be the configura-tion ( (verbSB) (verbOA)) forour German verbs

Filtering Given our source candidates a validparallel configuration(DG DE AGE) is then de-fined by the following conditions1 The source configurationDG is the set of tu-ples(s1 rel sn) wheres1 is our source item andsn some dependent2 For eachsn isin DG there is a tuple(sn tn) isinAGE ie every dependent has an alignment3 There is a target itemt1 isin DE such thatfor eachtn there is ap sub DE such thatp isa path(t1 rel tx) (tx rel ty)(tz rel tn) thatconnectst1 and tn Thus the target dependentshave a common root

To filter noise due to parsing or alignment er-rors we further introduce a filter on the length ofthe path that connects the target root and its de-pendents and w exclude paths cross contain sen-tence boundaries Moreover the above candi-date filtering doesnrsquot exclude configurations whichexhibit paraphrases involving head-switching orcomplex coordination Head-switching can be de-tected with the help of alignment information ifthere is a item in our target configuration that hasan reliable alignment with an item not contained inour source configuration our target configurationis likely to contain such a structural paraphrasesand is excluded from our candidate set Coordina-tion can be discarded by imposing the condition onthe configuration not to contain a coordination re-lation This Generate-and-Filter strategy now ex-tracts a set of sentences where we are likely to finda good one-to-one or one-to-many translation forthe source verb

Alignment Post-editing In the final alignmentstep one now needs to figure out which lexicalmaterial in the aligned syntactic configurations ac-tually corresponds to the translation of the sourceitem The intuition discussed in 32 was that all

27

the items lying on a path between the root itemand the terminals belong to the translation of thesource item However these items may have othersyntactic dependents that may also be part of theone-to-many translation As an example considerthe configuration in figure 1 where the articleanwhich is part of the LVCcreate an obstacle tohasto be aligned to the German source verb

Thus for a set of itemsti for which there is a de-pendency relation(tx rel ti) isin DE such thattx isan element of our target configuration we need todecide whether(s1 ti) isin AGE This translationproblem now largely parallels collocation trans-lation problems discussed in the literature as in(Smadja and McKeown 1994) But crucially oursyntactic filtering strategy has substantially nar-rowed down the number of items that are possi-ble parts of the one-to-many translation Thus astraightforward way to assemble the translationalcorrespondence is to compute the correlation orassociation of the possibly missing items with thegiven translation pair as proposed in (Smadja andMcKeown 1994) Therefore we propose the fol-lowing alignment post-editing algorithmGiven the source items1 and the set of target itemsT where eachti isin T is an element of our targetconfiguration

1 Compute corr(s1 T ) the correlation be-tweens1 andT

2 For each ti tx such that there isa (ti rel tx) isin DE computecorr(s1 T + tx)

3 if corr(s1 T + tx) ge corr(s1 T ) addtxto T

As the Dice coefficient is often to give the bestresults eg in (Smadja and McKeown 1994) wealso chose Dice as our correlation measure In fu-ture work we will experiment with other associa-tion measures Our correlation scores are thus de-fined by the formula

corr(s1 T ) =2(freq(s1 and T ))

freq(s1) + freq(T )

We definefreq(T ) as the number of sentencepairs whose target sentence contains occurrencesof all ti isin T andfreq(s1) accordingly The ob-servation frequencyfreq(s1andT ) is the number ofsentence pairs that wheres1 occurs in the sourcesentence andT in the target sentence

The output translation can then be rep-resented as a dependency configurationof the following kind ((ofPMOD)(riskNMODof)(riskNMODthe) (runOBJrisk)(runSBJ))which is the syntactic representationfor the English MWErun the risk of

33 Evaluation

Our translational approach to MWE extractionbears the advantage that evaluation is not exclu-sively bound to the manual judgement of candi-date lists Instead we can first evaluate the systemoutput against translation gold standards which areeasier to obtain The linguistic classification of thecandidates according to their compositionality canthen be treated as a separate problem

We present two experiments in this evaluationsection We will first evaluate the translation de-tection on our gold standard to assess the gen-eral quality of the extraction method Since thisgold standard is to small to draw conclusions aboutthe quality of MWE patterns that the system de-tects we further evaluate the translational corre-spondences for a larger set of verbs

Translation evaluation In the first experimentwe extracted all types of translational correspon-dences for the verbs we annotated in the gold stan-dard We converted the output dependency con-figurations to the lemmatized bag-of-word formwe already applied for the alignment evaluationand calculated the FPR and FNR of the trans-lation types The evaluation is displayed in ta-ble 4 Nearly all translation types that our sys-tem detected are correct This confirms our hy-pothesis that syntactic filtering yields more reli-able translations that just coocurrence-based align-ments However the false negative rate is alsovery high This low recall is due to the fact thatour syntactic filters are very restrictive such that amajor part of the occurrences of the source lemmadonrsquot figure in the prototypical syntactic configu-ration Column two and three of the evaluation ta-ble present the FPR and FNR for experiments witha relaxed syntactic filter that doesnrsquot constrain thesyntactic type of the parallel argument relationsWhile not decreasing the FNR the FPR decreasessignificantly This means that the syntactic filtersmainly fire on noisy configurations and donrsquot de-crease the recall A manual error analysis has alsoshown that the relatively flat annotation scheme ofour dependency parses significantly narrows down

28

the number of candidate configurations that our al-gorithm detects As the dependency parses donrsquotprovide deep analyses for tense or control phe-nomena very often a verbrsquos arguments donrsquot fig-ure as its syntactic dependents and no configura-tion is found Future work will explore the im-pact of deep syntactic analysis for the detection oftranslational correspondences

MWE evaluation In a second experiment weevaluated the patterns of correspondences foundby our extraction method for use in an MWE con-text Therefore we selected 50 random verbs oc-curring in the Europarl corpus and extracted theirrespective translational correspondences This setof 50 verbs yields a set of 1592 one-to-many typesof translational correspondences We filtered thetypes wich display only morphological variationsuch that the set of potential MWE types com-prises 1302 types Out of these we evaluated arandom sample of 300 types by labelling the typeswith the MWE categories we established for theanalysis of our gold standard During the clas-sification we encountered a further category ofoneto- many correspondence which cannot be con-sidered an MWE the category of alternation Forinstance we found a translational correspondencebetween the active realization of the German verbbegruszligen(enappreciate) and the English passivebe pleased by

The classification is displayed in table 5 Al-most 83 of the translational correspondencesthat our system extracted are perfect translationtypes Almost 60 of the extracted types can beconsidered MWEs that exhibit some kind of se-mantic idiosyncrasy The other translations couldbe classified as paraphrases or alternations In ourrandom sample the portions of idioms is signifi-cantly higher than in our gold standard which con-firms our intuition that the MWE pattern of theone-to-many translations for a given verb are re-lated to language-specific semantic properties ofthe verbs and the lexical concepts they realize

4 Related Work

The problem sketched in this paper has clear con-ncetions to statistical MT So-called phrase-basedtranslation models generally target whole sentencealignment and do not necessarily recur to linguis-tically motivated phrase correspondences (Koehnet al 2003) Syntax-based translation that speci-fies formal relations between bilingual parses was

Strict Filter Relaxed Filter

FPR FNR FPR FNR

v1 00 096 05 096

v2 025 088 047 079

v3 025 074 056 063

v4 00 0875 056 084

Table 4 False positive and false negative rate ofone-to-many translations

Trans type Proportion

MWE type Proportion

MWEs 575

V Part 82

V Prep 518

LVC 324

Idiom 106

Paraphrases 244

Alternations 10

Noise 171

Table 5 Classification of 300 types sampled fromthe set of one-to-many translations for 50 verbs

established by (Wu 1997) Our way to use syn-tactic configurations can be seen as a heuristic tocheck relaxed structural parallelism

Work on MWEs in a crosslingual context hasalmost exclusively focussed on MWE translation(Smadja and McKeown 1994 Anastasiou 2008)In (Villada Moiron and Tiedemann 2006) the au-thors make use of alignment information in a par-allel corpus to rank MWE candidates These ap-proaches donrsquot rely on the lexical semantic knowl-edge about MWEs in form of one-to-many trans-lations

By contrast previous approaches to paraphraseextraction made more explicit use of crosslingualsemantic information In (Bannard and Callison-Burch 2005) the authors use the target languageas a pivot providing contextual features for iden-tifying semantically similar expressions Para-phrasing is however only partially comparable tothe crosslingual MWE detection we propose inthis paper Recently the very pronounced contextdependence of monolingual pairs of semanticallysimilar expressions has been recognized as a ma-jor challenge in modelling word meaning (Erk andPado 2009)

The idea that parallel corpora can be used asa linguistic resource that provides empirical evi-dence for monolingual idiosyncrasies has already

29

been exploited in eg morphology projection(Yarowsky et al 2001) or word sense disambigua-tion (Dyvik 2004) While in a monolingual set-ting it is quite tricky to come up with theoreticalor empirical definitions of sense discriminationsthe crosslingual scenario offers a theory-neutraldata-driven solution Since ambiguity is an id-iosyncratic property of a lexical item in a givenlanguage it is not likely to be mirrored in a tar-get language Similarly our approach can also beseen as a projection idea we project the semanticinformation of simplex realization in a source lan-guage to an idiosyncratic multiword realization inthe target language

5 Conclusion

We have explored the phenomenon of one-to-many translations in parallel corpora from theperspective of MWE identification Our man-ual study on a translation gold standard as wellas our experiments in automatic translation ex-traction have shown that one-to-many correspon-dences provide a rich resource and fruitful basisof study for data-driven MWE identification Thecrosslingual perspective raises new research ques-tions about the identification and interpretation ofMWEs It challenges the distinction between para-phrases and MWEs a problem that does not ariseat all in the context of monolingual MWE ex-traction It also allows for the study of the rela-tion between the semantics of lexical concepts andtheir MWE realization Further research in this di-rection should investigate translational correspon-dences on a larger scale and further explore thesefor monolingual interpretation of MWEs

Our extraction method that is based on syn-tactic filters identifies MWE types with a muchhigher precision than purely cooccurence-basedword alignment and captures the various patternswe found in our gold standard Future work on theextraction method will have to focus on the gener-alization of these filters and the generalization toother items than verbs The experiments presentedin this paper also suggest that the MWE realiza-tion of certain lexical items in a target languageis subject to certain linguistic patterns Moreoverthe method we propose is completely languagein-dependent such that further research has to studythe impact of the relatedness of the consideredlanguages on the patterns of one-to-many transla-tional correspondences

ReferencesDimitra Anastasiou 2008 Identification of idioms by mtrsquos

hybrid research system vs three commercial system InProceedings of the EAMT pp 12ndash20

Timothy Baldwin and Aline Villavicencio 2002 Extract-ing the unextractable a case study on verb-particles InProceedings of the COLING-02 pp 1ndash7

Colin Bannard and Chris Callison-Burch 2005 Paraphras-ing with bilingual parallel corpora InProceedings of the43rd Annual Meeting of the ACL pp 597ndash604

Lea Cyrus 2006 Building a resource for studying transla-tion shifts InProceedings of the 5th LREC pp 1240ndash1245

Helge Dyvik 2004 Translations as semantic mirrors Fromparallel corpus to WordNetLanguage and Computers1311 ndash 326

Katrin Erk and Sebastian Pado 2009 Paraphrase assess-ment in structured vector space Exploring parameters anddatasets InProc of the EACL GEMS Workshop pp 57ndash65

Joao de Almeida Varelas Graca Joana Paulo Pardal LuısaCoheur and Diamantino Antonio Caseiro 2008 Multi-language word alignments annotation guidelines Techni-cal report Tech Rep 38 2008 INESC-ID Lisboa

Philipp Koehn Franz Josef Och and Daniel Marcu 2003Statistical phrase-based translation InProceedings of theNAACL rsquo03 pp 48ndash54

Philipp Koehn 2005 Europarl A parallel corpus for statis-tical machine translation InMT Summit 2005 pp 79ndash86

Joakim Nivre Johan Hall and Jens Nilsson 2006 Malt-parser A data driven parser-generator for dependencyparsing InProc of LREC-2006 pp 2216ndash2219

Franz Josef Och and Hermann Ney 2003 A systematic com-parison of various statistical alignment modelsComputa-tional Linguistics 29(1)19ndash51

Pavel Pecina 2008 A machine learning approach to multi-word expression extraction InProceedings of the LRECMWE 2008 Workshop pp 54ndash57

Ivan A Sag Timothy Baldwin Francis Bond and AnnCopestake 2002 Multiword expressions A pain in theneck for NLP InProc of the CICLing-2002 pp 1ndash15

Frank Smadja and Kathleen McKeown 1994 Translatingcollocations for use in bilingual lexicons InProceedingsof the HLT rsquo94 workshop pp 152ndash156

Begona Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automatic word-alignment InProc of the EACL MWE 2006 Workshoppp 33ndash40

Dekai Wu 1997 Stochastic inversion transduction gram-mars and bilingual parsing of parallel corporaComputLinguist 23(3)377ndash403

David Yarowsky Grace Ngai and Richard Wicentowski2001 Inducing multilingual text analysis tools via robustprojection across aligned corpora InProceedings of HLT2001 pp 1ndash8

30

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 31ndash39Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

A Re-examination of Lexical Association Measures

Abstract

We review lexical Association Measures (AMs) that have been employed by past work in extracting multiword expressions Our work contributes to the understanding of these AMs by categorizing them into two groups and suggesting the use of rank equivalence to group AMs with the same ranking performance We also examine how existing AMs can be adapted to better rank English verb particle constructions and light verb constructions Specifically we suggest normalizing (Pointwise) Mutual Information and using marginal frequencies to construct penalization terms We empirically validate the effectiveness of these modified AMs in detection tasks in English performed on the Penn Treebank which shows significant improvement over the original AMs

1 Introduction

Recently the NLP community has witnessed a renewed interest in the use of lexical association measures in extracting Multiword Expressions (MWEs) Lexical Association Measures (hereafter AMs) are mathematical formulas which can be used to capture the degree of connection or association between constituents of a given phrase Well-known AMs include Pointwise Mutual Information (PMI) Pearsonrsquos 2χ and the Odds Ratio These AMs have been applied in many different fields of study from information retrieval to hypothesis testing In the context of MWE extraction many published works have been devoted to comparing their effectiveness Krenn and Evert (2001) evaluate Mutual Information (MI) Dice Pearsonrsquos 2χ log-likelihood

ratio and the T score In Pearce (2002) AMs such as Z score Pointwise MI cost reduction left and right context entropy odds ratio are evaluated Evert (2004) discussed a wide range of AMs including exact hypothesis tests such as the binomial test and Fisherrsquos exact tests various coefficients such as Dice and Jaccard Later Ramisch et al (2008) evaluated MI Pearsonrsquos 2χ and Permutation Entropy Probably the most comprehensive evaluation of AMs was presented in Pecina and Schlesinger (2006) where 82 AMs were assembled and evaluated over Czech collocations These collocations contained a mix of idiomatic expressions technical terms light verb constructions and stock phrases In their work the best combination of AMs was selected using machine learning

While the previous works have evaluated AMs there have been few details on why the AMs perform as they do A detailed analysis of why these AMs perform as they do is needed in order to explain their identification performance and to help us recommend AMs for future tasks This weakness of previous works motivated us to address this issue In this work we contribute to further understanding of association measures using two different MWE extraction tasks to motivate and concretize our discussion Our goal is to be able to predict a priori what types of AMs are likely to perform well for a particular MWE class

We focus on the extraction of two common types of English MWEs that can be captured by bigram model Verb Particle Constructions (VPCs) and Light Verb Constructions (LVCs) VPCs consist of a verb and one or more particles which can be prepositions (eg put on bolster up) adjectives (cut short) or verbs (make do) For simplicity we focus only on bigram VPCs that take prepositional particles the most common class of VPCs A special characteristic of VPCs that affects their extraction is the

Hung Huu Hoang Dept of Computer Science

National University of Singapore

hoanghuucompnusedusg

Su Nam Kim Dept of Computer Science and Software Engineering University of Melbourne

snkimcsseunimelbeduau

Min-Yen Kan Dept of Computer Science

National University of Singapore

kanmycompnusedusg

31

mobility of noun phrase complements in transitive VPCs They can appear after the particle (Take off your hat) or between the verb and the particle (Take your hat off) However a pronominal complement can only appear in the latter configuration (Take it off)

In comparison LVCs comprise of a verb and a complement which is usually a noun phrase (make a presentation give a demonstration) Their meanings come mostly from their complements and as such verbs in LVCs are termed semantically light hence the name light verb This explains why modifiers of LVCs modify the complement instead of the verb (make a serious mistake vs make a mistake seriously) This phenomenon also shows that an LVCrsquos constituents may not occur contiguously

2 Classification of Association Measures

Although different AMs have different approaches to measuring association we observed that they can effectively be classified into two broad classes Class I AMs look at the degree of institutionalization ie the extent to which the phrase is a semantic unit rather than a free combination of words Some of the AMs in this class directly measure this association between constituents using various combinations of co-occurrence and marginal frequencies Examples include MI PMI and their variants as well as most of the association coefficients such as Jaccard Hamann Brawn-Blanquet and others Other Class I AMs estimate a phrasersquos MWE-hood by judging the significance of the difference between observed and expected frequencies These AMs include among others statistical hypothesis tests such as T score Z score and Pearsonrsquos 2χ test

Class II AMs feature the use of context to measure non-compositionality a peculiar characteristic of many types of MWEs including VPCs and idioms This is commonly done in one of the following two ways First non-compositionality can be modeled through the diversity of contexts measured using entropy The underlying assumption of this approach is that non-compositional phrases appear in a more restricted set of contexts than compositional ones Second non-compositionality can also be measured through context similarity between the phrase and its constituents The observation here is that non-compositional phrases have different semantics from those of their constituents It then

follows that contexts in which the phrase and its constituents appear would be different (Zhai 1997) Some VPC examples include carry out give up A close approximation stipulates that contexts of a non-compositional phrasersquos constituents are also different For instance phrases such as hot dog and Dutch courage are comprised of constituents that have unrelated meanings Metrics that are commonly used to compute context similarity include cosine and dice similarity distance metrics such as Euclidean and Manhattan norm and probability distribution measures such as Kullback-Leibler divergence and Jensen-Shannon divergence

Table 1 lists all AMs used in our discussion

The lower left legend defines the variables a b c and d with respect to the raw co-occurrence statistics observed in the corpus data When an AM is introduced it is prefixed with its index given in Table 1(eg [M2] Mutual Information) for the readerrsquos convenience

3 Evaluation

We will first present how VPC and LVC candidates are extracted and used to form our evaluation data set Second we will discuss how performances of AMs are measured in our experiments

31 Evaluation Data

In this study we employ the Wall Street Journal (WSJ) section of one million words in the Penn Tree Bank To create the evaluation data set we first extract the VPC and LVC candidates from our corpus as described below We note here that the mobility property of both VPC and LVC constituents have been used in the extraction process

For VPCs we first identify particles using a pre-compiled set of 38 particles based on Baldwin (2005) and Quirk et al (1985) (Appendix A) Here we do not use the WSJ particle tag to avoid possible inconsistencies pointed out in Baldwin (2005) Next we search to the left of the located particle for the nearest verb As verbs and particles in transitive VPCs may not occur contiguously we allow an intervening NP of up to 5 words similar to Baldwin and Villavicencio (2002) and Smadja (1993) since longer NPs tend to be located after particles

32

Extraction of LVCs is carried out in a similar fashion First occurrences of light verbs are located based on the following set of seven

frequently used English light verbs do get give have make put and take Next we search to the right of the light verbs for the nearest noun

AM Name Formula AM Name Formula M1 Joint Probability ( ) f xy N M2 Mutual Information

1log ˆ

ijij

i j ij

ff

N fsum

M3 Log likelihood ratio

2 log ˆ

ijij

i j ij

ff

fsum

M4 Pointwise MI (PMI) ( )log

( ) ( )

P xy

P x P ylowast lowast

M5 Local-PMI ( ) PMIf xy times M6 PMIk ( )log

( ) ( )

kNf xy

f x f ylowast lowast

M7 PMI2 2( )log

( ) ( )

Nf xy

f x f ylowast lowast

M8 Mutual Dependency 2( )log( ) ( )P xy

P x P y

M9 Driver-Kroeber

( )( )

a

a b a c+ +

M10 Normalized expectation

2

2

a

a b c+ +

M11 Jaccard a

a b c+ +

M12 First Kulczynski a

b c+

M13 Second Sokal-Sneath 2( )

a

a b c+ +

M14 Third Sokal-Sneath

a d

b c

+

+

M15 Sokal-Michiner a d

a b c d

+

+ + +

M16 Rogers-Tanimoto

2 2

a d

a b c d

+

+ + +

M17 Hamann ( ) ( )a d b c

a b c d

+ minus +

+ + +

M18 Odds ratio adbc

M19 Yulersquos ω ad bc

ad bc

minus

+

M20 Yulersquos Q ad bcad bc

minus+

M21 Brawn- Blanquet max( )

a

a b a c+ +

M22 Simpson

min( )

a

a b a c+ +

M23 S cost 12min( )

log(1 )1

b c

a

minus+

+

M24 Adjusted S Cost 12max( )

log(1 )1

b c

a

minus+

+

M25 Laplace 1

min( ) 2

a

a b c

+

+ +

M26 Adjusted Laplace 1

max( ) 2

a

a b c

+

+ +

M27 Fager [M9]

1max( )

2b cminus

M28 Adjusted Fager [M9]

1max( )b c

aNminus

M29 Normalized PMIs

PMI NF( )α PMI NFMax

M30 Simplified normalized PMI for VPCs

log( )

(1 )

ad

b cα αtimes + minus times

M31 Normalized MIs

MI NF( )α MI NFMax

NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast [0 1]α isin NFMax = max( ( ) ( ))P x P ylowast lowast

11 ( )a f f xy= = 12 ( )b f f xy= =

21 ( )c f f xy= = 22 ( )d f f xy= =

( )f xlowast ( )f xlowast

( )f ylowast ( )f ylowast N

Table 1 Association measures discussed in this paper Starred AMs () are developed in this work

Contingency table of a bigram (x y) recording co-occurrence and marginal frequencies w stands for all words except w stands for all words N is total number of bigrams The expected frequency under the independence assumption is ˆ ( ) ( ) ( ) f xy f x f y N= lowast lowast

33

permitting a maximum of 4 intervening words to allow for quantifiers (aan the many etc) adjectival and adverbial modifiers etc If this search fails to find a noun as when LVCs are used in the passive (eg the presentation was made) we search to the right of the light verb also allowing a maximum of 4 intervening words The above extraction process produced a total of 8652 VPC and 11465 LVC candidates when run on the corpus We then filter out candidates with observed frequencies less than 6 as suggested in Pecina and Schlesinger (2006) to obtain a set of 1629 VPCs and 1254 LVCs

Separately we use the following two available sources of annotations 3078 VPC candidates extracted and annotated in (Baldwin 2005) and 464 annotated LVC candidates used in (Tan et al 2006) Both sets of annotations give both positive and negative examples

Our final VPC and LVC evaluation datasets were then constructed by intersecting the gold-standard datasets with our corresponding sets of extracted candidates We also concatenated both sets of evaluation data for composite evaluation This set is referred to as ldquoMixedrdquo Statistics of our three evaluation datasets are summarized in Table 2 VPC data LVC data Mixed Total (freq ge 6)

413 100 513

Positive instances

117 (2833)

28 (28)

145 (2326)

Table 2 Evaluation data sizes (type count not token) While these datasets are small our primary

goal in this work is to establish initial comparable baselines and describe interesting phenomena that we plan to investigate over larger datasets in future work

32 Evaluation Metric

To evaluate the performance of AMs we can use the standard precision and recall measures as in much past work We note that the ranked list of candidates generated by an AM is often used as a classifier by setting a threshold However setting a threshold is problematic and optimal threshold values vary for different AMs Additionally using the list of ranked candidates directly as a classifier does not consider the confidence indicated by actual scores Another way to avoid setting threshold values is to measure precision and recall of only the n most likely candidates

(the n-best method) However as discussed in Evert and Krenn (2001) this method depends heavily on the choice of n In this paper we opt for average precision (AP) which is the average of precisions at all possible recall values This choice also makes our results comparable to those of Pecina and Schlesinger (2006)

33 Evaluation Results

Figure 1(a b) gives the two average precision profiles of the 82 AMs presented in Pecina and Schlesinger (2006) when we replicated their experiments over our English VPC and LVC datasets We observe that the average precision profile for VPCs is slightly concave while the one for LVCs is more convex This can be interpreted as VPCs being more sensitive to the choice of AM than LVCs Another point we observed is that a vast majority of Class I AMs including PMI its variants and association coefficients (excluding hypothesis tests) perform reasonably well in our application In contrast the performances of most of context-based and hypothesis test AMs are very modest Their mediocre performance indicates their inapplicability to our VPC and LVC tasks In particular the high frequencies of particles in VPCs and light verbs in LVCs both undermine their contextsrsquo discriminative power and skew the difference between observed and expected frequencies that are relied on in hypothesis tests

4 Rank Equivalence

We note that some AMs although not mathematically equivalent (ie assigning identical scores to input candidates) produce the same lists of ranked candidates on our datasets Hence they achieve the same average precision The ability to identify such groups of AMs is helpful in simplifying their formulas which in turn assisting in analyzing their meanings Definition Association measures M1 and M2 are

rank equivalent over a set C denoted by M1 r

Cequiv

M2 if and only if M1(cj) gt M1(ck) M2(cj) gt M2(ck) and M1(cj) = M1(ck) M2(cj) = M2(ck) for all cj ck belongs to C where Mk(ci) denotes the score assigned to ci by the measure Mk As a corollary the following also holds for rank equivalent AMs

34

Corollary If M1 r

Cequiv M2 then APC(M1) = APC(M2)

where APC(Mi) stands for the average precision of the AM Mi over the data set C Essentially M1 and M2 are rank equivalent over a set C if their ranked lists of all candidates taken from C are the same ignoring the actual calculated scores1 As an example the following 3 AMs Odds ratio Yulersquos ω and Yulersquos Q (Table 3 row 5) though not mathematically equivalent can be shown to be rank equivalent Five groups of rank equivalent AMs that we have found are listed in Table 3 This allows us to replace the below 15 AMs with their (most simple) representatives from each rank equivalent group

1 Two AMs may be rank equivalent with the exception of some candidates where one AM is undefined due to a zero in the denominator while the other AM is still well-defined We call these cases weakly rank equivalent With a reasonably large corpus such candidates are rare for our VPC and LVC types Hence we still consider such AM pairs to be rank equivalent

1) [M2] Mutual Information [M3] Log likelihood ratio 2) [M7] PMI2 [M8] Mutual Dependency [M9] Driver-Kroeber (aka Ochiai) 3) [M10] Normalized expectation [M11] Jaccard [M12] First Kulczynski

[M13]Second Sokal-Sneath (aka Anderberg)

4) [M14] Third Sokal-Sneath [M15] Sokal-Michiner [M16] Rogers-Tanimoto [M17] Hamann 5) [M18] Odds ratio [M19] Yulersquos ω [M20] Yulersquos Q

Table 3 Five groups of rank equivalent AMs

5 Examination of Association Measures

We highlight two important findings in our analysis of the AMs over our English datasets Section 51 focuses on MI and PMI and Section 52 discusses penalization terms

51 Mutual Information and Pointwise Mutual Information

In Figure 1 over 82 AMs PMI ranks 11th in identifying VPCs while MI ranks 35th in

01

02

03

04

05

06

AP

Figure 1a AP profile of AMs examined over our VPC data set

01

02

03

04

05

06

AP

Figure 1b AP profile of AMs examined over our LVC data set

Figure 1 Average precision (AP) performance of the 82 AMs from Pecina and Schlesinger (2006) on our English VPC and LVC datasets Bold points indicate AMs discussed in this paper

Hypothesis test AMs loz Class I AMs excluding hypothesis test AMs + Context-based AMs

35

identifying LVCs In this section we show how their performances can be improved significantly

Mutual Information (MI) measures the common information between two variables or the reduction in uncertainty of one variable given knowledge of the other

( )MI( ) ( )log ( ) ( )u v

p uvU V p uv p u p v= lowast lowastsum In the

context of bigrams the above formula can be

simplified to [M2] MI =

1log ˆN

ijij

i jij

ff

fsum While MI

holds between random variables [M4] Pointwise MI (PMI) holds between specific values PMI(x

y) =( )

log( ) ( )

P xy

P x P ylowast lowast

( )log

( ) ( )

Nf xy

f x f y=

lowast lowast It has long

been pointed out that PMI favors bigrams with low-frequency constituents as evidenced by the product of two marginal frequencies in its denominator To reduce this bias a common solution is to assign more weight to the co-occurrence frequency ( )f xy in the numerator by either raising it to some power k (Daille 1994) or multiplying PMI with ( )f xy Table 4 lists these adjusted versions of PMI and their performance over our datasets We can see from Table 4 that the best performance of PMIk

is obtained at k values less than one indicating that it is better to rely less on ( )f xy Similarly multiplying

( )f xy directly to PMI reduces the performance of PMI As such assigning more weight to ( )f xy does not improve the AP performance of PMI AM VPCs LVCs MixedBest [M6] PMIk 547

(k = 13) 573

(k = 85) 544

(k = 32)

[M4] PMI 510 566 515 [M5] Local-PMI 259 393 272 [M1] Joint Prob 170 28 175 Table 4 AP performance of PMI and its variants Best alpha settings shown in parentheses

Another shortcoming of (P)MI is that both grow not only with the degree of dependence but also with frequency (Manning and Schutzeampamp 1999 p 66) In particular we can show that MI(X Y) le min(H(X) H(Y)) where H() denotes entropy and PMI(xy) le min( log ( )P xminus lowast log ( )P yminus lowast )

These two inequalities suggest that the allowed score ranges of different candidates vary and consequently MI and PMI scores are not directly comparable Furthermore in the case of VPCs and LVCs the differences among score

ranges of different candidates are large due to high frequencies of particles and light verbs This has motivated us to normalize these scores before using them for comparison We suggest MI and PMI be divided by one of the following two normalization factors NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast with [0 1]α isin and NFmax = max( ( ) ( ))P x P ylowast lowast NF( )α being dependent on alpha can be optimized by setting an appropriate alpha value which is inevitably affected by the MWE type and the corpus statistics On the other hand NFmax is independent of alpha and is recommended when one needs to apply normalized (P)MI to a mixed set of different MWE types or when sufficient data for parameter tuning is unavailable As shown in Table 5 normalized MI and PMI show considerable improvements of up to 80 Also PMI and MI after being normalized with NFmax rank number one in VPC and LVC task respectively If one re-writes MI as = (1 N) ij ij

i jPMIf timessum it is easy to see the heavy

dependence of MI on direct frequencies compared with PMI and this explains why normalization is a pressing need for MI

AM VPCs LVCs MixedMI NF( )α 508

(α = 48) 583

(α = 47) 516

(α = 5)

MI NFmax 508 584 518 [M2] MI 273 435 289

PMI NF( )α 592 (α = 8)

554 (α = 48)

588 (α = 77)

PMI NFmax 565 517 556 [M4] PMI 510 566 515

Table 5 AP performance of normalized (P)MI versus standard (P)MI Best alpha settings shown in parentheses

52 Penalization Terms

It can be seen that given equal co-occurrence frequencies higher marginal frequencies reduce the likelihood of being MWEs This motivates us to use marginal frequencies to synthesize penalization terms which are formulae whose values are inversely proportional to the likelihood of being MWEs We hypothesize that incorporating such penalization terms can improve the respective AMs detection AP

Take as an example the AMs [M21] Brawn-Blanquet (aka Minimum Sensitivity) and [M22] Simpson These two AMs are identical except

36

for one difference in the denominator Brawn-Blanquet uses max(b c) Simpson uses min(b c) It is intuitive and confirmed by our experiments that penalizing against the more frequent constituent by choosing max(b c) is more effective This is further attested in AMs [M23] S Cost and [M25] Laplace where we tried to replace the min(b c) term with max(b c) Table 6 shows the average precision on our datasets for all these AMs

AM VPCs LVCs Mixed[M21]Brawn- Blanquet

478 578 486

[M22] Simpson 249 382 260

[M24] Adjusted S Cost

485 577 492

[M23] S cost 249 388 260

[M26] Adjusted Laplace

486 577 493

[M25] Laplace 241 388 254 Table 6 Replacing min() with max() in selected AMs

In the [M27] Fager AM the penalization term max(b c) is subtracted from the first term which is no stranger but rank equivalent to [M7] PMI2 In our application this AM is not good since the second term is far larger than the first term which is less than 1 As such Fager is largely equivalent to just ndashfrac12 max(b c) In order to make use of the first term we need to replace the constant frac12 by a scaled down version of max(b c) We have approximately derived 1 aN as a lower bound estimate of max(b c) using the independence assumption producing [M28] Adjusted Fager We can see from Table 7 that this adjustment improves Fager on both datasets AM VPCs LVCs Mixed[M28] Adjusted Fager

564 543 554

[M27] Fager 552 439 525 Table 7 Performance of Fager and its adjusted version

The next experiment involves [M14] Third Sokal Sneath which can be shown to be rank equivalent to ndashb ndashc We further notice that frequencies c of particles are normally much larger than frequencies b of verbs Thus this AM runs the risk of ranking VPC candidates based on only frequencies of particles So it is necessary

that we scale b and c properly as in [M14] bαminus times ndash (1 ) cαminus times Having scaled the constituents properly we still see that [M14] by itself is not a good measure as it uses only constituent frequencies and does not take into consideration the co-occurrence frequency of the two constituents This has led us to experiment

with [MR14]PMI

(1 )b cα αtimes + minus times The

denominator of [MR14] is obtained by removing the minus sign from [MR14] so that it can be used as a penalization term The choice of PMI in the numerator is due to the fact that the denominator of [MR14] is in essence similar to NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast which has been successfully used to divide PMI in the normalized PMI experiment We heuristically tried to simplify [MR14] to the following AM

[M30]log( )

(1 )

ad

b cα αtimes + minus times The setting of alpha in

Table 8 below is taken from the best alpha setting obtained the experiment on the normalized PMI (Table 5) It can be observed from Table 8 that [MR14] being computationally simpler than normalized PMI performs as well as normalized PMI and better than Third Sokal-Sneath over the VPC data set

AM VPCs LVCs MixedPMI NF( )α 592

(α =8) 554

(α =48) 588

(α =77)

[M30]log( )

(1 )

ad

b cα αtimes + minus times

600 (α =8)

484 (α =48)

588 (α =77)

[M14] Third Sokal Sneath

565 453 546

Table 8 AP performance of suggested VPCsrsquo penalization terms and AMs

With the same intention and method we have found that while addition of marginal frequencies is a good penalization term for VPCs the product of marginal frequencies is more suitable for LVCs (rows 1 and 2 Table 9) As with the linear combination the product bc should also be weighted accordingly as (1 )b cα αminus The best alpha value is also taken from the normalized PMI experiments (Table 5) which is nearly 5 Under this setting this penalization term is exactly the denominator of the [M18] Odds Ratio Table 9 below show our experiment results in deriving the penalization term for LVCs

37

AM VPCs LVCs Mixedndashb ndashc 565 453 546 1bc 502 532 502 [M18] Odds ratio 443 567 456

Table 9 AP performance of suggested LVCsrsquo penalization terms and AMs

6 Conclusions

We have conducted an analysis of the 82 AMs assembled in Pecina and Schlesinger (2006) for the tasks of English VPC and LVC extraction over the Wall Street Journal Penn Treebank data In our work we have observed that AMs can be divided into two classes ones that do not use context (Class I) and ones that do (Class II) and find that the latter is not suitable for our VPC and LVC detection tasks as the size of our corpus is too small to rely on the frequency of candidatesrsquo contexts This phenomenon also revealed the inappropriateness of hypothesis tests for our detection task We have also introduced the novel notion of rank equivalence to MWE detection in which we show that complex AMs may be replaced by simpler AMs that yield the same average precision performance

We further observed that certain modifications to some AMs are necessary First in the context of ranking we have proposed normalizing scores produced by MI and PMI in cases where the distributions of the two events are markedly different as is the case for light verbs and particles While our claims are limited to the datasets analyzed they show clear improvements normalized PMI produces better performance over our mixed MWE dataset yielding an average precision of 588 compared to 515 when using standard PMI a significant improvement as judged by paired T test Normalized MI also yields the best performance over our LVC dataset with a significantly improved AP of 583

We also show that marginal frequencies can be used to form effective penalization terms In particular we find that (1 )b cα αtimes + minus times is a good penalization term for VPCs while (1 )b cα αminus is suitable for LVCs Our introduced alpha tuning parameter should be set to properly scale the values b and c and should be optimized per MWE type In cases where a common factor is applied to different MWE types max(b c) is a better choice than min(b c) In future work we plan to expand our investigations over larger

web-based datasets of English to verify the performance gains of our modified AMs

Acknowledgement This work was partially supported by a National Research Foundation grant ldquoInteractive Media Searchrdquo (grant R 252 000 325 279)

References Baldwin Timothy (2005) The deep lexical

acquisition of English verb-particle constructions Computer Speech and Language Special Issue on Multiword Expressions 19(4)398ndash414

Baldwin Timothy and Villavicencio Aline (2002) Extracting the unextractable A case study on verb-particles In Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002) pages 98ndash104 Taipei Taiwan

Daille Beacuteatrice (1994) Approche mixte pour lextraction automatique de terminologie statistiques lexicales et filtres linguistiques PhD thesis Universiteacute Paris 7

Evert Stefan (2004) Online repository of association measures httpwwwcollocationsde a companion to The Statistics of Word Cooccurrences Word Pairs and Collocations PhD dissertation University of Stuttgart

Evert Stefan and Krenn Brigitte (2001) Methods for qualitative evaluation of lexical association measures In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics pages 188-915 Toulouse France

Katz Graham and Giesbrecht Eugenie (2006) Automatic identification of non-compositional multi-word expressions using latent semantic analysis In Proceedings of the ACL Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties pages 12-19 Sydney Australia

Krenn Brigitte and Evert Stefan (2001) Can we do better than frequency A case study on extracting PP-verb collocations In Proceedings of the ACLEACL 2001 Workshop on the Computational Extraction Analysis and Exploitation of Collocations pages 39ndash46 Toulouse France

Manning D Christopher and Schutze ampamp Hinrich (1999) Foundations of Statistical Natural Language Processing The MIT Press Cambridge Massachusetts

Pearce Darren (2002) A comparative evaluation of collocation extraction techniques In Proc of the

38

3rd International Conference on Language Resources and Evaluation (LREC 2002) Las Palmas pages 1530-1536 Canary Islands

Pecina Pavel and Schlesinger Pavel (2006) Combining association measures for collocation extraction In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLINGACL 2006) pages 651-658 Sydney Australia

Quirk Randolph Greenbaum Sidney Leech Geoffrey and Svartvik Jan (1985) A Comprehensive Grammar of the English Language Longman London UK

Ramisch Carlos Schreiner Paulo Idiart Marco and Villavicencio Aline (2008) An Evaluation of Methods for the extraction of Multiword Expressions In Proceedings of the LREC-2008 Workshop on Multiword Expressions Towards a Shared Task for Multiword Expressions pages 50-53 Marrakech Morocco

Smadja Frank (1993) Retrieving collocations from text Xtract Computational Linguistics 19(1) 143ndash77

Tan Y Fan Kan M Yen and Cui Hang (2006) Extending corpus-based identification of light verb constructions using a supervised learning framework In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context pages 49ndash56 Trento Italy

Zhai Chengxiang (1997) Exploiting context to identify lexical atoms ndash A statistical view of linguistic context In International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97) pages 119-129 Rio de Janeiro Brazil

Appendix A List of particles used in identifying verb particle constructions about aback aboard above abroad across adrift ahead along apart around aside astray away back backward backwards behind by down forth forward forwards in into off on out over past round through to together under up upon without

39

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 40ndash46Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus

R Mahesh K Sinha

Department of Computer Science amp Engineering Indian Institute of Technology Kanpur

Kanpur 208016 India rmkiitkacin

Abstract

Complex predicate is a noun a verb an ad-jective or an adverb followed by a light verb that behaves as a single unit of verb Com-plex predicates (CPs) are abundantly used in Hindi and other languages of Indo Aryan family Detecting and interpreting CPs con-stitute an important and somewhat a diffi-cult task The linguistic and statistical methods have yielded limited success in mining this data In this paper we present a simple method for detecting CPs of all kinds using a Hindi-English parallel corpus A CP is hypothesized by detecting absence of the conventional meaning of the light verb in the aligned English sentence This simple strategy exploits the fact that CP is a multi-word expression with a meaning that is dis-tinct from the meaning of the light verb Al-though there are several shortcomings in the methodology this empirical method surpri-singly yields mining of CPs with an average precision of 89 and a recall of 90

1 Introduction

Complex predicates (CPs) are abundantly used in Hindi and other languages of Indo-Aryan family and have been widely studied (Hook 1974 Ab-bi 1992 Verma 1993 Mohanan 1994 Singh 1994 Butt 1995 Butt and Geuder 2001 Butt and Ramchand 2001 Butt et al 2003) A com-plex predicate is a multi-word expression (MWE) where a noun a verb or an adjective is followed by a light verb (LV) and the MWE be-haves as a single unit of verb The general theory of complex predicate is discussed in Alsina (1996) These studies attempt to model the lin-guistic facts of complex predicate formation and the associated semantic roles

CPs empower the language in its expressive-ness but are hard to detect Detection and inter-

pretation of CPs are important for several tasks of natural language processing tasks such as ma-chine translation information retrieval summa-rization etc A mere listing of the CPs constitutes a valuable linguistic resource for lexicographers wordnet designers (Chakrabarti et al 2007) and other NLP system designers Computational me-thod using Hindi corpus has been used to mine CPs and categorize them based on statistical analysis (Sriram and Joshi 2005) with limited success Chakrabarti et al (2008) present a me-thod for automatic extraction of V+V CPs only from a corpus based on linguistic features They report an accuracy of about 98 An attempt has also been made to use a parallel corpus for de-tecting CPs using projection POS tags from Eng-lish to Hindi (Soni Mukerjee and Raina 2006) It uses Giza++ word alignment tool to align the projected POS information A success of 83 precision and 46 recall has been reported

In this paper we present a simple strategy for mining of CPs in Hindi using projection of meaning of light verb in a parallel corpus In the following section the nature of CP in Hindi is outlined and this is followed by system design experimentation and results

2 Complex Predicates in Hindi

A CP in Hindi is a syntactic construction consist-ing of either a verb a noun an adjective or an adverb as main predicator followed by a light verb (LV) Thus a CP can be a noun+LV an adjective+LV a verb+LV or an adverb+LV Fur-ther it is also possible that a CP is followed by a LV (CP+LV) The light verb carries the tense and agreement morphology In V+V CPs the contribution of the light verb denotes aspectual terms such as continuity perfectivity inception completion or denotes an expression of forceful-ness suddenness etc (Singh 1994 Butt 1995) The CP in a sentence syntactically acts as a sin-gle lexical unit of verb that has a meaning dis-

40

tinct from that of the LV CPs are also referred as the complex or compound verbs

Given below are some examples (1) CP=noun+LV noun = ashirwad blessings LV = denaa to give usane mujhe ashirwad diyaa उसन मझ आश वाद दया he me blessings gave he blessed me (2) No CP usane mujhe ek pustak dii उसन मझ एक प तक द he me one book gave he gave me a book In (1) the light verb diyaa (gave) in its past tense form with the noun ashirwad (blessings) makes a complex predicate verb form ashirwad diyaa (blessed) in the past tense form The CP here is ashirwad denaa and its corresponding English translation is lsquoto blessrsquo On the other hand in ex-ample (2) the verb dii (gave) is a simple verb in past tense form and is not a light verb Although same Hindi verb denaa (to give) is used in both the examples it is a light verb in (1) and a main verb in (2) Whether it acts as a light verb or not depends upon the semantics of the preceding noun However it is observed that the English meaning in case of the complex predicate is not derived from the individual meanings of the con-stituent words It is this observation that forms basis of our approach for mining of CPs (3) CP=adjective+LV adjective=khush happy LV=karanaa to do usane mujhe khush kiyaa उसन मझ खश कया he me happy did he pleased me Here the Hindi verb kiyaa (did) is the past tense form of a light verb karanaa (to do) and the pre-ceding word khush (happy) is an adjective The CP here is khush karanaa (to please) (4) CP=verb+LV verb = paRhnaa to read LV = lenaa to take usane pustak paRh liyaa

उसन प तक पढ़ लया he book read took he has read the book Here the Hindi verb liyaa (took) is the past tense form of the light verb lenaa (to take) and the pre-ceding word paRh (read) is the verb paRhnaa (to read) in its stem form The CP is paRh lenaa (to finish reading) In such cases the light verb acts as an aspectual modal or as an intensifier (5) CP=verb+LV verb = phaadanaa to tear LV = denaa to give usane pustak phaad diyaa उसन प तक फाड़ दया he book tear gave he has torn the book Here the Hindi verb diyaa (gave) is the past tense form of the light verb denaa (to give) and the preceding word phaad (tear) is the stem form of the verb phaadanaa (to tear) The CP is phaad denaa (to cause and complete act of tearing) (6) CP=verb+LV verb = denaa to give LV = maaranaato hit to kill usane pustak de maaraa उसन प तक द मारा he book give hit he threw the book Here the Hindi verb maaraa (hitkilled) is the past tense form of the light verb maaranaa (to hit to kill) and the preceding word de (give) is a verb denaa (to give) in its stem form The CP is de maranaa (to throw) The verb combination yields a new meaning This may also be consi-dered as a semi-idiomatic construct by some people (7) CP=adverb+LV1+LV2 adverb = vaapasback LV1 = karanaato do LV2 = denaato give or CP = CP+LV CP = vaapas karanaato return LV = denaato give usane pustak vaapas kar diyaa उसन प तक वापस कर दया he book back do gave

41

he returned the book Here there are two Hindi light verbs used The verb kar (do) is the stem form of the light verb karanaa (to do) and the verb diyaa (gave) is the past tense form of the light verb denaa (to give) The preceding word vaapas (back) is an adverb One way of interpretation is that the CP (a con-junct verb) vaapas karanaa (to return) is fol-lowed by another LV denaa (to give) signifying completion of the task Another way of looking at it is to consider these as a combination of two CPs vaapas karanaa (to return) and kar denaa (to complete the act) The semantic interpreta-tions in the two cases remain the same It may be noted that the word vaapas (return) is also a noun and in such a case the CP is a noun+LV

From all the above examples the complexity of the task of mining the CPs is evident Howev-er it is also observed that in the translated text the meaning of the light verb does not appear in case of CPs Our methodology for mining CPs is based on this observation and is outlined in the following section

3 System Design

As outlined earlier our method for detecting a CP is based on detecting a mismatch of the Hindi light verb meaning in the aligned English sen-tence The steps involved are as follows 1) Align the sentences of Hindi-English corpus 2) Create a list of Hindi light verbs and their

common English meanings as a simple verb (Table 1)

3) For each Hindi light verb generate all the morphological forms (Figure 1)

4) For each English meaning of the light verb as given in table 1 generate all the morphologi-cal forms (Figure 2)

5) For each Hindi-English aligned sentence execute the following steps a) For each light verb of Hindi (table 1)

execute the following steps i) Search for a Hindi light verb (LV)

and its morphological derivatives (figure 1) in the Hindi sentence and mark its position in the sentence (K)

ii) If the LV or its morphological deriv-ative is found then search for the equivalent English meanings for any of the morphological forms (figure 2) in the corresponding aligned Eng-lish sentence

iii) If no match is found then scan the words in the Hindi sentence to the left of the Kth position (as identified in step (i)) else if a match is found then exit ie go to step (a)

iv) If the scanned word is a lsquostop wordrsquo (figure 3) then ignore it and contin-ue scanning

v) Stop the scan when it is not a lsquostop wordrsquo and collect the Hindi word (W)

vi) If W is an lsquoexit wordrsquo then exit ie go to step (a) else the identified CP is W+LV

Hindi has a large number of light verbs A list of some of the commonly used light verbs along with their common English meanings as a simple verb is given in table 1 The light verb kar (do) is the most frequently used light verb Using its literal meaning as lsquodorsquo as a criterion for testing CP is quite misleading since lsquodorsquo in English is used in several other contexts Such meanings have been shown within parentheses and are not used for matching

light verb base form root verb meaning baithanaa बठना sit

bananaa बनना makebecomebuildconstruct manufactureprepare

banaanaa बनाना makebuildconstructmanufact-ure prepare

denaa दना give

lenaa लना take

paanaa पाना obtainget

uthanaa 3ठना rise arise get-up

uthaanaa 3ठाना raiselift wake-up

laganaa लगना feelappear look seem

lagaanaa लगाना fixinstall apply

cukanaa चकना (finish)

cukaanaa चकाना pay

karanaa करना (do)

honaa होना happenbecome be

aanaa आना come

jaanaa जाना go

khaanaa खाना eat

rakhanaa रखना keep put

maaranaa मारना killbeathit

daalanaa डालना put

haankanaa हाकना drive

Table 1 Some of the common light verbs in Hindi

42

For each of the Hindi light verb all morpho-logical forms are generated A few illustrations are given in figures 1(a) and 1(b) Similarly for each of the English meaning of the light verb all of its morphological derivatives are generated Figure 2 shows a few illustrations of the same

There are a number of words that can appear in between the nominal and the light verb in a CP These words are ignored in search for a CP and are treated as stop words These are words that denote negation or are emphasizers inten-sifiers interrogative pronoun or a particle A list of stop words used in the experimentation is giv-en in figure 3

Figure 1(a) Morphological derivatives of sample Hindi light verb rsquojaanaarsquo जाना to go

Figure 1(b) Morphological derivatives of sample Hindi light verb rsquolenaarsquo लना to take

Figure 2 Morphological derivatives of sample English meanings

We use a list of words of words that we have

named as lsquoexit wordsrsquo which cannot form part of a CP in Hindi We have used Hindi case (vibhak-ti) markers (also called parsarg) conjunctions and pronouns as the lsquoexit wordsrsquo in our imple-mentation Figure 4 shows a partial list used However this list can be augmented based on analysis of errors in LV identification It should be noted that we do not perform parts of speech (POS) tagging and so the nature of the word pre-ceding the LV is unknown to the system

English word sit Morphological derivations

sit sits sat sitting English word give Morphological derivations

give gives gave given giving degdegdegdegdeg

LV jaanaa जाना to go Morphological derivatives jaa jaae jaao jaaeM jaauuM jaane jaanaa jaanii jaataa jaatii jaate jaaniiM jaatiiM jaaoge jaaogii gaii jaauuMgA jaayegaa jaauuMgii jaayegii gaye gaiiM gayaa gayii jaayeMge jaayeMgI jaakara जा (go stem) जाए (go imperative)

जाओ (go imperative) जाए (go imperative)

जाऊ (go first-person) जान (go infinitive oblique)

जाना (go infinitive masculine singular)

जानी (go infinitive feminine singular)

जाता (go indefinite masculine singular)

जाती (go indefinite feminine singular)

जात (go indefinite masculine pluraloblique)

जानी (go infinitive feminine plural)

जाती (go indefinite feminine plural)

जाओग (go future masculine singular)

जाओगी (go future feminine singular)

गई (go past feminine singular)

जाऊगा (go future masculine first-person singular)

जायगा (go future masculine third-person singular)

जाऊगी (go future feminine first-person singular)

जायगी (go future feminine third-person singular)

गय (go past masculine pluraloblique)

गई (go past feminine plural)

गया (go past masculine singular)

गयी (go past feminine singular)

जायग (go future masculine plural)

जायगी (go future feminine plural)

जाकर (go completion) degdegdegdegdeg

LV lenaa लना to take Morphological derivatives le lii leM lo letaa letii lete liiM luuM legaa legii lene lenaa lenii liyaa leMge loge letiiM luuMgaa luuMgii lekara ल (take stem) ल (take past)

ल (take imperative) लो (take imperative)

लता (take indefinite masculine singular)

लती (take indefinite feminine singular)

लत (take indefinite masculine pluraloblique)

ल (takepastfeminineplural) ल(take first-person)

लगा(take future masculine third-personsingular)

लगी(take future feminine third-person singular)

लन (take infinitive oblique)

लना (take infinitive masculine singular)

लनी (take infinitive feminine singular)

लया (take past masculine singular)

लग (take future masculine plural)

लोग (take future masculine singular)

लती (take indefinite feminine plural)

लगा (take future masculinefirst-personsingular)

लगी (take future feminine first-person singular)

लकर (take completion) degdegdegdegdeg

43

Figure 3 Stop words in Hindi used by the system

Figure 4 A few exit words in Hindi used by the system

The inner loop of the procedure identifies mul-tiple CPs that may be present in a sentence The outer loop is for mining the CPs in the entire corpus The experimentation and results are dis-cussed in the following section

4 Experimentation and Results

The CP mining methodology outlined earlier has been implemented and tested over multiple files of EMILLE (McEnery Baker Gaizauskas and Cunningham 2000) English-Hindi parallel cor-pus A summary of the results obtained are given in table 2 As can be seen from this table the precision obtained is 80 to 92 and the recall is between 89 to 100 The F-measure is 88 to 97 This is a remarkable and somewhat sur-prising result from the simple methodology without much of linguistic or statistical analysis This is much higher than what has been reported on the same corpus by Mukerjee et al 2006 (83 precision and 46 recall) who use projec-tion of POS and word alignment for CP identifi-cation This is the only other work that uses a parallel corpus and covers all kinds of CPs The results as reported by Chakrabarti et al (2008) are only for V-V CPs Moreover they do not re-port the recall value File

1 File 2

File 3

File 4

File 5

File 6

No of Sentences

112 193 102 43 133 107

Total no of CP(N)

200 298 150 46 188 151

Correctly identified CP (TP)

195 296 149 46 175 135

V-V CP 56 63 9 6 15 20

Incorrectly identified CP (FP)

17 44 7 11 16 20

Unidentified CP(FN)

5 2 1 0 13 16

Accuracy

9750 9933 9933 1000 9308 8940

Precision (TP (TP+FP))

9198 8705 9551 8070 9162 8709

Recall ( TP (TP+FN))

9750 9833 9933 1000 9308 8940

F-measure ( 2PR ( P+R))

946 923 974 893 923 882

Table 2 Results of the experimentation

न (ergative case marker) को (accusative case marker) का (possessive case marker) क (possessive case marker) की (possessive case marker) स (frombywith) म (ininto) पर (onbut) और (and Hindi particle) तथा (and) या (or) ल कन (but) पर त (but) क (that Hindi particle) म (I) तम (you) आप (you) वह (heshe) मरा (my) मर (my) मर (my) त हारा (your) त हार (your) त हार (your) उसका (his) उसकी (her) उसक (hisher) अपना (own) अपनी (own) अपन (own) उनक (their) मन (I ergative) त ह (to you) आपको (to you) उसको (to himher) उनको (to them) उ ह (to them) मझको (to me) मझ (to me) िजसका (whose) िजसकी (whose) िजसक (whose) िजनको (to whom) िजनक (to whom)

नह (nonot) न (nonot Hindi particle) भी (also Hindi particle) ह (only Hindi particle) तो (then Hindi particle) य (why) या (what Hindi particle)

कहा (where Hindi particle) कब (when) यहा (here) वहा (there) जहा (where) पहल (before) बाद म (after) श म (beginning) आर भ म (beginning) अत म (in the end) आ खर म (in the end)

44

Given below are some sample outputs (1) English sentence I also enjoy working with the childrens parents who often come to me for advice - its good to know you can help Aligned Hindi sentence मझ ब च क माता - पताओ क साथ काम करना भी अ छा लगता ह जो क अ सर सलाह लन आत ह - यह जानकार खशी होती ह क आप कसी की मदद कर सकत ह | The CPs identified in the sentence i काम करना (to work) ii अ छा लगना (to feel good enjoy) iii सलाह लना (to seek advice) iv खशी होना (to feel happy good) v मदद करना (to help)

Here the system identified 5 different CPs all of which are correct and no CP in the sentence has gone undetected The POS projection and word alignment method (Mukerjee et al 2006) would fail to identify CPs सलाह लना (to seek advice) and खशी होना (to feel happy) (2) English sentence Thousands of children are already benefiting from the input of people like you - people who care about children and their future who have the commitment energy and enthusiasm to be positive role models and who value the opportu-nity for a worthwhile career Aligned Hindi sentence आप जस लोग जो क ब च और उनक भ व य क बार म सोचत ह - इस समय भी हज़ार ब च को लाभ पहचा रह ह | अ छ आदश बनन क लए ऐस लोग म तब ता उ सराह और लगन ह और व एक समथन - यो य यवसाय की क करत ह |

The CPs identified in the sentence i आदश बनना (to be role model) ii क करना (to respect)

Here also the two CPs identified are correct It is obvious that this empirical method of

mining CPs will fail whenever the Hindi light verb maps on to its core meaning in English It

may also produce garbage as POS of the preced-ing word is not being checked However the mining success rate obtained speaks of these be-ing in small numbers in practice Use of the lsquostop wordsrsquo in allowing the intervening words within the CPs helps a lot in improving the per-formance Similarly use of the lsquoexit wordsrsquo avoid a lot of incorrect identification

5 Conclusions

The simple empirical method for mining CPs outlined in this work yields an average 89 of precision and 90 recall which is better than the results reported so far in the literature The major drawback is that we have to generate a list of all possible light verbs This list appears to be very large for Hindi Since no POS tagging or statis-tical analysis is performed the identified CPs are merely a list of mined CPs in Hindi with no lin-guistic categorization or analysis However this list of mined CPs is valuable to the lexicograph-ers and other language technology developers This list can also be used for word alignment tools where the identified components of CPs are grouped together before the word alignment process This will increase both the alignment accuracy and the speed

The methodology presented in this work is equally applicable to all other languages within the Indo-Aryan family

References Anthony McEnery Paul Baker Rob Gaizauskas Ha-

mish Cunningham 2000 EMILLE Building a Corpus of South Asian Languages Vivek A Quar-terly in Artiificial Intelligence 13(3)23ndash32

Amitabh Mukerjee Ankit Soni and Achala M Raina 2006 Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora Proceed-ings of the Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties Sydney 11ndash18

Alex Alsina 1996 Complex PredicatesStructure and Theory CSLI PublicationsStanford CA

Anvita Abbi 1992 The explicator compound verbsome definitional issues and criteria for iden-tification Indian Linguistics 53 27-46

Debasri Chakrabarti Vaijayanthi Sarma and Pushpak Bhattacharyya 2007 Complex Predicates in In-dian Language Wordnets Lexical Resources and Evaluation Journal 40 (3-4)

Debasri Chakrabarti Hemang Mandalia Ritwik Priya Vaijayanthi Sarma and Pushpak Bhattacharyya 2008 Hindi Compound Verbs and their Automatic Extraction Computational Linguistics (COLING08) Manchester UK

45

Manindra K Verma (ed) 1993 Complex Predicates in South Asian Languages Manohar Publishers and Distributors New Delhi

Miriam Butt 1995 The Structure of Complex Predi-cates in Urdu CSLI Publications

Mirium Butt and Gillian Ramchand 2001 Complex Aspectual Structure in HindiUrdu In Maria Lia-kata Britta Jensen and Didier Maillat (Editors) Oxford University Working Papers in Linguistics Philology amp Phonetics Vol 6

Miriam Butt Tracy Holloway King and John T Maxwell III 2003 Complex Predicates via Re-striction Proceedings of the LFG03 Conference

Miriam Butt and Wilhelm Geuder 2001 On the (semi)lexical status of light verbs In Norbert Corv-er and Henk van Riemsdijk (Editors) Semi-lexical Categories On the content of function words and the function of content words Mouton de Gruyter Berlin 323ndash370

Mona Singh 1994 Perfectivity Definiteness and Specificity A Classification of Verbal Predicates Hindi Doctoral dissertation University of Texas Austin

Peter Edwin Hook 1974 The Compound Verb in Hindi Center for South and Southeast Asian Stu-dies The University of Michigan

Tara Mohanan 1994 Argument Structure in Hindi CSLI Publications Stanford California

Venkatapathy Sriram and Aravind K Joshi 2005 Relative compositionality of multi-word expres-sions a study of verb-noun (V-N) collocations In Proceedings of International Joint Conference on Natural Language Processing - 2005 Jeju Island Korea 553-564

46

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 47ndash54Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Improving Statistical Machine Translation UsingDomain Bilingual Multiword Expressions

Zhixiang Ren1 Yajuan L u1 Jie Cao1 Qun Liu 1 Yun Huang2

1Key Lab of Intelligent Info Processing 2Department of Computer ScienceInstitute of Computing Technology School of Computing

Chinese Academy of Sciences National University of SingaporePO Box 2704 Beijing 100190 China Computing 1 Law Link Singapore 117590

renzhixianglvyajuan huangyuncompnusedusgcaojieliuqunictaccn

Abstract

Multiword expressions (MWEs) havebeen proved useful for many natural lan-guage processing tasks However how touse them to improve performance of statis-tical machine translation (SMT) is not wellstudied This paper presents a simple yeteffective strategy to extract domain bilin-gual multiword expressions In additionwe implement three methods to integratebilingual MWEs to Moses the state-of-the-art phrase-based machine translationsystem Experiments show that bilingualMWEs could improve translation perfor-mance significantly

1 Introduction

Phrase-based machine translation model has beenproved a great improvement over the initial word-based approaches (Brown et al 1993) Recentsyntax-based models perform even better thanphrase-based models However when syntax-based models are applied to new domain with fewsyntax-annotated corpus the translation perfor-mance would decrease To utilize the robustnessof phrases and make up the lack of syntax or se-mantic information in phrase-based model for do-main translation we study domain bilingual mul-tiword expressions and integrate them to the exist-ing phrase-based model

A multiword expression (MWE)can be consid-ered as word sequence with relatively fixed struc-ture representing special meanings There is nouniform definition of MWE and many researchersgive different properties of MWE Sag et al (2002)roughly defined MWE as ldquoidiosyncratic interpre-tations that cross word boundaries (or spaces)rdquoCruys and Moiron (2007) focused on the non-compositional property of MWE ie the propertythat whole expression cannot be derived from their

component words Stanford university launcheda MWE project1 in which different qualities ofMWE were presented Forbilingual multiwordexpression (BiMWE) we define a bilingual phraseas a bilingual MWE if (1) the source phrase is aMWE in source language (2) the source phraseand the target phrase must be translated to eachother exactly ie there is no additional (boundary)word in target phrase which cannot find the corre-sponding word in source phrase and vice versaIn recent years many useful methods have beenproposed to extract MWEs or BiMWEs automati-cally (Piao et al 2005 Bannard 2007 Fazly andStevenson 2006) Since MWE usually constrainspossible senses of a polysemous word in contextthey can be used in many NLP applications suchas information retrieval question answering wordsense disambiguation and so on

For machine translation Piao et al (2005) havenoted that the issue of MWE identification andaccurate interpretation from source to target lan-guage remained an unsolved problem for existingMT systems This problem is more severe whenMT systems are used to translate domain-specifictexts since they may include technical terminol-ogy as well as more general fixed expressions andidioms Although some MT systems may employa machine-readable bilingual dictionary of MWEit is time-consuming and inefficient to obtain thisresource manually Therefore some researchershave tried to use automatically extracted bilingualMWEs in SMT Tanaka and Baldwin (2003) de-scribed an approach of noun-noun compound ma-chine translation but no significant comparisonwas presented Lambert and Banchs (2005) pre-sented a method in which bilingual MWEs wereused to modify the word alignment so as to im-prove the SMT quality In their work a bilin-gual MWE in training corpus was grouped as

1httpmwestanfordedu

47

one unique token before training alignment mod-els They reported that both alignment qualityand translation accuracy were improved on a smallcorpus However in their further study they re-ported even lower BLEU scores after groupingMWEs according to part-of-speech on a large cor-pus (Lambert and Banchs 2006) Nonethelesssince MWE represents liguistic knowledge therole and usefulness of MWE in full-scale SMTis intuitively positive The difficulty lies in howto integrate bilingual MWEs into existing SMTsystem to improve SMT performance especiallywhen translating domain texts

In this paper we implement three methods thatintegrate domain bilingual MWEs into a phrase-based SMT system and show that these ap-proaches improve translation quality significantlyThe main difference between our methods andLambert and Banchsrsquo work is that we directly aimat improving the SMT performance rather than im-proving the word alignment quality In detail dif-ferences are listed as follows

bull Instead of using the bilingual n-gram trans-lation model we choose the phrase-basedSMT system Moses2 which achieves sig-nificantly better translation performance thanmany other SMT systems and is a state-of-the-art SMT system

bull Instead of improving translation indirectlyby improving the word alignment qualitywe directly target at the quality of transla-tion Some researchers have argued that largegains of alignment performance under manymetrics only led to small gains in translationperformance (Ayan and Dorr 2006 Fraserand Marcu 2007)

Besides the above differences there are someadvantages of our approaches

bull In our method automatically extractedMWEs are used as additional resources ratherthan as phrase-table filter Since bilingualMWEs are extracted according to noisy au-tomatic word alignment errors in word align-ment would further propagate to the SMT andhurt SMT performance

bull We conduct experiments on domain-specificcorpus For one thing domain-specific

2httpwwwstatmtorgmoses

corpus potentially includes a large numberof technical terminologies as well as moregeneral fixed expressions and idioms iedomain-specific corpus has high MWE cov-erage For another after the investigationcurrent SMT system could not effectivelydeal with these domain-specific MWEs es-pecially for Chinese since these MWEs aremore flexible and concise Take the Chi-nese term ldquo j Ntilde (rdquo for example Themeaning of this term is ldquosoften hard massand dispel pathogenic accumulationrdquo Ev-ery word of this term represents a specialmeaning and cannot be understood literallyor without this context These terms are dif-ficult to be translated even for humans letalone machine translation So treating theseterms as MWEs and applying them in SMTsystem have practical significance

bull In our approach no additional corpus is intro-duced We attempt to extract useful MWEsfrom the training corpus and adopt suitablemethods to apply them Thus it benefitsfor the full exploitation of available resourceswithout increasing great time and space com-plexities of SMT system

The remainder of the paper is organized as fol-lows Section 2 describes the bilingual MWE ex-traction technique Section 3 proposes three meth-ods to apply bilingual MWEs in SMT systemSection 4 presents the experimental results Sec-tion 5 draws conclusions and describes the futurework Since this paper mainly focuses on the ap-plication of BiMWE in SMT we only give a briefintroduction on monolingual and bilingual MWEextraction

2 Bilingual Multiword ExpressionExtraction

In this section we describe our approach of bilin-gual MWE extraction In the first step we obtainmonolingual MWEs from the Chinese part of par-allel corpus After that we look for the translationof the extracted MWEs from parallel corpus

21 Automatic Extraction of MWEs

In the past two decades many different ap-proaches on automatic MWE identification werereported In general those approaches can beclassified into three main trends (1) statisti-cal approaches (Pantel and Lin 2001 Piao et

48

al 2005) (2) syntactic approaches (Fazly andStevenson 2006 Bannard 2007) and (3) seman-tic approaches (Baldwin et al 2003 Cruys andMoiron 2007) Syntax-based and semantic-basedmethods achieve high precision but syntax or se-mantic analysis has to be introduced as preparingstep so it is difficult to apply them to domains withfew syntactical or semantic annotation Statisticalapproaches only consider frequency informationso they can be used to obtain MWEs from bilin-gual corpora without deeper syntactic or semanticanalysis Most statistical measures only take twowords into account so it not easy to extract MWEscontaining three or more than three words

Log Likelihood Ratio (LLR)has been proved agood statistical measurement of the association oftwo random variables (Chang et al 2002) Weadopt the idea of statistical approaches and pro-pose a new algorithm named LLR-based Hierar-chical Reducing Algorithm (HRA for short) to ex-tract MWEs with arbitrary lengths To illustrateour algorithm firstly we define some useful itemsIn the following definitions we assume the givensentence is ldquoA B C D Erdquo

Definition 1 Unit A unit is any sub-string of thegiven sentence For example ldquoA Brdquo ldquo Crdquo ldquo C D Erdquoare all units but ldquoA B Drdquo is not a unit

Definition 2 List A list is an ordered sequence ofunits which exactly cover the given sentence ForexampleldquoArdquoldquo B C Drdquoldquo Erdquo forms a list

Definition 3 Score The score function only de-fines on two adjacent units and return the LLRbetween the last word of first unit and the firstword of the second unit3 For example the scoreof adjacent unit ldquoB Crdquo and ldquoD Erdquo is defined asLLR(ldquo Crdquoldquo Drdquo)

Definition 4 Select The selecting operator is tofind the two adjacent units with maximum scorein a list

Definition 5 ReduceThe reducing operator is toremove two specific adjacent units concatenatethem and put back the result unit to the removedposition For example if we want to reduce unitldquoB Crdquo and unit ldquoDrdquo in list ldquoArdquoldquo B Crdquoldquo Drdquoldquo Erdquowe will get the listldquoArdquoldquo B C Drdquoldquo Erdquo

Initially every word in the sentence is consid-ered as one unit and all these units form a initiallist L If the sentence is of lengthN then the

3we use a stoplist to eliminate the units containing func-tion words by setting their score to 0

list containsN units of course The final set ofMWEsS is initialized to empty set After initial-ization the algorithm will enter an iterating loopwith two steps (1) select the two adjacent unitswith maximum score inL namingU1 andU2 and(2) reduceU1 andU2 in L and insert the reducingresult into the final setS Our algorithm termi-nates on two conditions (1) if the maximum scoreafter selection is less than a given threshold or (2)if L contains only one unit

c1(pound p Eacute w egrave )

c1(pound p Eacute w )

c1(pound p Eacute w)

c1(pound)

c2(p Eacute w)

c2(p Eacute)

c2(p) c3(Eacute) c4(w) c5()

c6(egrave )

c6(egrave) c7()

pound p Eacute w egrave 1471 67552 10596 0 0 8096

Figure 1 Example of Hierarchical Reducing Al-gorithm

Let us make the algorithm clearer with an ex-ample Assume the threshold of score is 20 thegiven sentence is ldquopound p Eacute w egrave rdquo4Figure 1 shows the hierarchical structure of givensentence (based on LLR of adjacent words) Inthis example four MWEs (ldquop Eacuterdquo ldquo p Eacute

wrdquo ldquo egrave rdquo ldquo pound p Eacute wrdquo) are extractedin the order and sub-strings over dotted line in fig-ure 1 are not extracted

From the above example we can see that theextracted MWEs correspond to human intuitionIn general the basic idea of HRA is to reflectthe hierarchical structure pattern of natural lan-guage Furthermore in the HRA MWEs are mea-sured with the minimum LLR of adjacent wordsin them which gives lexical confidence of ex-tracted MWEs Finally suppose given sentencehas lengthN HRA would definitely terminatewithin N minus 1 iterations which is very efficient

However HRA has a problem that it would ex-tract substrings before extracting the whole stringeven if the substrings only appear in the particu-lar whole string which we consider useless Tosolve this problem we use contextual features

4The whole sentence means ldquohealthy tea for preventinghyperlipidemiardquo and we give the meaning for each Chi-nese wordpound(preventing)p(hyper-)Eacute(-lipid-) w(-emia)(for) egrave(healthy)(tea)

49

contextual entropy (Luo and Sun 2003) and C-value (Frantzi and Ananiadou 1996) to filter outthose substrings which exist only in few MWEs

22 Automatic Extraction of MWErsquosTranslation

In subsection 21 we described the algorithm toobtain MWEs and we would like to introduce theprocedure to find their translations from parallelcorpus in this subsection

For mining the English translations of ChineseMWEs we first obtain the candidate translationsof a given MWE from the parallel corpus Stepsare listed as follows

1 Run GIZA++5 to align words in the trainingparallel corpus

2 For a given MWE find the bilingual sentencepairs where the source language sentences in-clude the MWE

3 Extract the candidate translations of theMWE from the above sentence pairs accord-ing to the algorithm described by Och (2002)

After the above procedure we have alreadyextracted all possible candidate translations of agiven MWE The next step is to distinguish rightcandidates from wrong candidates We constructperceptron-based classification model (Collins2002) to solve the problem We design twogroups of features translation features whichdescribe the mutual translating chance betweensource phrase and target phrase and the languagefeatures which refer to how well a candidateis a reasonable translation The translation fea-tures include (1) the logarithm of source-targettranslation probability (2) the logarithm of target-source translation probability (3) the logarithmof source-target lexical weighting (4) the loga-rithm of target-source lexical weighting and (5)the logarithm of the phrase pairrsquos LLR (Dunning1993) The first four features are exactly the sameas the four translation probabilities used in tradi-tional phrase-based system (Koehn et al 2003)The language features include (1) the left entropyof the target phrase (Luo and Sun 2003) (2) theright entropy of the target phrase (3) the first wordof the target phrase (4) the last word of the targetphrase and (5) all words in the target phrase

5httpwwwfjochcomGIZA++html

We select and annotate 33000 phrase pairs ran-domly of which 30000 pairs are used as trainingset and 3000 pairs are used as test set We use theperceptron training algorithm to train the modelAs the experiments reveal the classification preci-sion of this model is 9167

3 Application of Bilingual MWEs

Intuitively bilingual MWE is useful to improvethe performance of SMT However as we de-scribed in section 1 it still needs further researchon how to integrate bilingual MWEs into SMTsystem In this section we propose three methodsto utilize bilingual MWEs and we will comparetheir performance in section 4

31 Model Retraining with Bilingual MWEs

Bilingual phrase table is very important forphrase-based MT system However due to the er-rors in automatic word alignment and unalignedword extension in phrase extraction (Och 2002)many meaningless phrases would be extractedwhich results in inaccuracy of phrase probabilityestimation To alleviate this problem we take theautomatically extracted bilingual MWEs as paral-lel sentence pairs add them into the training cor-pus and retrain the model using GIZA++ Byincreasing the occurrences of bilingual MWEswhich are good phrases we expect that the align-ment would be modified and the probability es-timation would be more reasonable Wu et al(2008) also used this method to perform domainadaption for SMT Different from their approachin which bilingual MWEs are extracted from ad-ditional corpus we extract bilingual MWEs fromthe original training set The fact that additionalresources can improve the domain-specific SMTperformance was proved by many researchers(Wu et al 2008 Eck et al 2004) Howeverour method shows that making better use of theresources in hand could also enhance the qualityof SMT system We use ldquoBaseline+BiMWErdquo torepresent this method

32 New Feature for Bilingual MWEs

Lopez and Resnik (2006) once pointed out thatbetter feature mining can lead to substantial gainin translation quality Inspired by this idea weappend one feature into bilingual phrase table toindicate that whether a bilingual phrase containsbilingual MWEs In other words if the source lan-guage phrase contains a MWE (as substring) and

50

the target language phrase contains the translationof the MWE (as substring) the feature value is 1otherwise the feature value is set to 0 Due to thehigh reliability of bilingual MWEs we expect thatthis feature could help SMT system to select bet-ter and reasonable phrase pairs during translationWe use ldquoBaseline+Featrdquo to represent this method

33 Additional Phrase Table of bilingualMWEs

Wu et al (2008) proposed a method to construct aphrase table by a manually-made translation dic-tionary Instead of manually constructing transla-tion dictionary we construct an additional phrasetable containing automatically extracted bilingualMWEs As to probability assignment we just as-sign 1 to the four translation probabilities for sim-plicity Since Moses supports multiple bilingualphrase tables we combine the original phrase ta-ble and new constructed bilingual MWE table Foreach phrase in input sentence during translationthe decoder would search all candidate transla-tion phrases in both phrase tables We use ldquoBase-line+NewBPrdquo to represent this method

4 Experiments

41 Data

We run experiments on two domain-specific patentcorpora one is for traditional medicine domainand the other is for chemical industry domain Ourtranslation tasks are Chinese-to-English

In the traditional medicine domain table 1shows the data statistics For language model weuse SRI Language Modeling Toolkit6 to train a tri-gram model with modified Kneser-Ney smoothing(Chen and Goodman 1998) on the target side oftraining corpus Using our bilingual MWE ex-tracting algorithm 80287 bilingual MWEs are ex-tracted from the training set

Chinese EnglishTraining Sentences 120355

Words 4688873 4737843Dev Sentences 1000

Words 31722 32390Test Sentences 1000

Words 41643 40551

Table 1 Traditional medicine corpus

6httpwwwspeechsricomprojectssrilm

In the chemical industry domain table 2 givesthe detail information of the data In this experi-ment 59466 bilingual MWEs are extracted

Chinese EnglishTraining Sentences 120856

Words 4532503 4311682Dev Sentences 1099

Words 42122 40521Test Sentences 1099

Words 41069 39210

Table 2 Chemical industry corpus

We test translation quality on test set and use theopen source tool mteval-vllbpl7 to calculate case-sensitive BLEU 4 score (Papineni et al 2002) asour evaluation criteria For this evaluation there isonly one reference per test sentence We also per-form statistical significant test between two trans-lation results (Collins et al 2005) The mean ofall scores and relative standard deviation are calcu-lated with a 99 confidence interval of the mean

42 MT Systems

We use the state-of-the-art phrase-based SMT sys-tem Moses as our baseline system The featuresused in baseline system include (1) four transla-tion probability features (2) one language modelfeature (3) distance-based and lexicalized distor-tion model feature (4) word penalty (5) phrasepenalty For ldquoBaseline+BiMWErdquo method bilin-gual MWEs are added into training corpus as aresult new alignment and new phrase table areobtained For ldquoBaseline+Featrdquo method one ad-ditional 01 feature are introduced to each entry inphrase table For ldquoBaseline+NewBPrdquo additionalphrase table constructed by bilingual MWEs isused

Features are combined in the log-linear modelTo obtain the best translatione of the source sen-tencef log-linear model uses following equation

e = arg maxe

p(e|f)

= arg maxe

Msum

m=1

λmhm(e f) (1)

in which hm andλm denote themth feature andweight The weights are automatically turned byminimum error rate training (Och 2002) on devel-opment set

7httpwwwnistgovspeechtestsmtresourcesscoringhtm

51

43 Results

Methods BLEUBaseline 02658Baseline+BiMWE 02661Baseline+Feat 02675Baseline+NewBP 02719

Table 3 Translation results of using bilingualMWEs in traditional medicine domain

Table 3 gives our experiment results Fromthis table we can see that bilingual MWEsimprove translation quality in all cases TheBaseline+NewBP method achieves the most im-provement of 061 BLEU score comparedwith the baseline system The Baseline+Featmethod comes next with 017 BLEU score im-provement And the Baseline+BiMWE achievesslightly higher translation quality than the baselinesystem

To our disappointment however none of theseimprovements are statistical significant Wemanually examine the extracted bilingual MWEswhich are labeled positive by perceptron algorithmand find that although the classification precisionis high (9167) the proportion of positive exam-ple is relatively lower (7669) The low positiveproportion means that many negative instanceshave been wrongly classified to positive which in-troduce noises To remove noisy bilingual MWEswe use the length ratiox of the source phrase overthe target phrase to rank the bilingual MWEs la-beled positive Assumex follows Gaussian distri-butions then the ranking score of phrase pair(s t)is defined as the following formula

Score(s t) = log(LLR(s t))times 1radic2πσ

timeseminus

(xminusmicro)2

2σ2

(2)Here the meanmicro and varianceσ2 are estimatedfrom the training set After ranking by score weselect the top 50000 60000 and 70000 bilingualMWEs to perform the three methods mentioned insection 3 The results are showed in table 4

From this table we can conclude that (1) Allthe three methods on all settings improve BLEUscore (2) Except the Baseline+BiMWE methodthe other two methods obtain significant improve-ment of BLEU score (02728 02734 02724)over baseline system (02658) (3) When the scaleof bilingual MWEs is relatively small (5000060000) the Baseline+Feat method performs better

Methods 50000 60000 70000Baseline 02658Baseline+BiMWE 02671 02686 02715Baseline+Feat 02728 02734 02712Baseline+NewBP 02662 02706 02724

Table 4 Translation results of using bilingualMWEs in traditional medicine domain

than others (4) As the number of bilingual MWEsincreasing the Baseline+NewBP method outper-forms the Baseline+Feat method (5) Comparingtable 4 and 3 we can see it is not true that themore bilingual MWEs the better performance ofphrase-based SMT This conclusion is the same as(Lambert and Banchs 2005)

To verify the assumption that bilingual MWEsdo indeed improve the SMT performance not onlyon particular domain we also perform some ex-periments on chemical industry domain Table 5shows the results From this table we can see thatthese three methods can improve the translationperformance on chemical industry domain as wellas on the traditional medicine domain

Methods BLEUBaseline 01882Baseline+BiMWE 01928Baseline+Feat 01917Baseline+Newbp 01914

Table 5 Translation results of using bilingualMWEs in chemical industry domain

44 Discussion

In order to know in what respects our methods im-prove performance of translation we manually an-alyze some test sentences and gives some exam-ples in this subsection

(1) For the first example in table 6 ldquoIuml oacuterdquois aligned to other words and not correctly trans-lated in baseline system while it is aligned to cor-rect target phrase ldquodredging meridiansrdquo in Base-line+BiMWE since the bilingual MWE (ldquoIuml oacuterdquoldquodredging meridiansrdquo) has been added into train-ing corpus and then aligned by GIZA++

(2) For the second example in table 6 ldquo

rdquo has two candidate translation in phrase tableldquo teardquo and ldquomedicated teardquo The baseline systemchooses the ldquoteardquo as the translation of ldquo rdquowhile the Baseline+Feat system chooses the ldquomed-

52

Src T not auml k Ouml Eacute Aring raquo Iuml oacute ) 9 | Y S _ Ouml otilde

egraveE 8

Ref the obtained product is effective in tonifying blood expelling cold dredging meridians promoting production of body fluid promoting urination and tranquilizing mind and can be used for supplementing nutrition and protecting health

Baseline the food has effects in tonifying blood dispelling cold promoting salivation and water and tranquilizing and tonic effects and making nutritious health

+Bimwe the food has effects in tonifying blood dispelling cold dredging meridians promotingsalivation promoting urination and tranquilizing tonic nutritious pulverizing

Src curren iexclJ J NtildeJ 5J

Ref the product can also be made into tablet pill powder medicated tea or injection Baseline may also be made into tablet pill powder tea or injection +Feat may also be made into tablet pill powder medicated tea or injection

Table 6 Translation example

icated teardquo because the additional feature giveshigh probability of the correct translation ldquomedi-cated teardquo

5 Conclusion and Future Works

This paper presents the LLR-based hierarchicalreducing algorithm to automatically extract bilin-gual MWEs and investigates the performance ofthree different application strategies in applyingbilingual MWEs for SMT system The translationresults show that using an additional feature to rep-resent whether a bilingual phrase contains bilin-gual MWEs performs the best in most cases Theother two strategies can also improve the qualityof SMT system although not as much as the firstone These results are encouraging and motivatedto do further research in this area

The strategies of bilingual MWE application isroughly simply and coarse in this paper Com-plicated approaches should be taken into accountduring applying bilingual MWEs For examplewe may consider other features of the bilingualMWEs and examine their effect on the SMT per-formance Besides application in phrase-basedSMT system bilingual MWEs may also be inte-grated into other MT models such as hierarchicalphrase-based models or syntax-based translationmodels We will do further studies on improvingstatistical machine translation using domain bilin-gual MWEs

Acknowledgments

This work is supported by National Natural Sci-ence Foundation of China Contracts 60603095and 60873167 We would like to thank the anony-

mous reviewers for their insightful comments onan earlier draft of this paper

References

Necip Fazil Ayan and Bonnie J Dorr 2006 Goingbeyond aer an extensive analysis of word align-ments and their impact on mt InProceedings of the44th Annual Meeting of the Association for Compu-tational Linguistics pages 9ndash16

Timothy Baldwin Colin Bannard Takaaki Tanaka andDominic Widdows 2003 An empirical modelof multiword expression decomposability InPro-ceedings of the ACL-2003 Workshop on MultiwordExpressions Analysis Acquisiton and Treatmentpages 89ndash96

Colin Bannard 2007 A measure of syntactic flex-ibility for automatically identifying multiword ex-pressions in corpora InProceedings of the ACLWorkshop on A Broader Perspective on MultiwordExpressions pages 1ndash8

Peter F Brown Stephen Della Pietra Vincent J DellaPietra and Robert L Mercer 1993 The mathe-matics of statistical machine translation Parameterestimation Computational Linguistics 19(2)263ndash311

Baobao Chang Pernilla Danielsson and WolfgangTeubert 2002 Extraction of translation unit fromchinese-english parallel corpora InProceedings ofthe first SIGHAN workshop on Chinese languageprocessing pages 1ndash5

Stanley F Chen and Joshua Goodman 1998 Am em-pirical study of smoothing techniques for languagemodeling Technical report

Michael Collins Philipp Koehn and Ivona Kucerova2005 Clause restructuring for statistical machinetranslation InProceedings of the 43rd Annual

53

Meeting on Association for Computational Linguis-tics pages 531ndash540

Michael Collins 2002 Discriminative training meth-ods for hidden markov models Theory and exper-iments with perceptron algorithms InProceedingsof the Empirical Methods in Natural Language Pro-cessing Conference pages 1ndash8

Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32

Ted Dunning 1993 Accurate methods for the statis-tics of surprise and coincidenceComputationalLinguistics 19(1)61ndash74

Matthias Eck Stephan Vogel and Alex Waibel 2004Improving statistical machine translation in the med-ical domain using the unified medical language sys-tem InProceedings of the 20th international con-ference on Computational Linguistics table of con-tents pages 792ndash798

Afsaneh Fazly and Suzanne Stevenson 2006 Auto-matically constructing a lexicon of verb phrase id-iomatic combinations InProceedings of the EACLpages 337ndash344

Katerina T Frantzi and Sophia Ananiadou 1996 Ex-tracting nested collocations InProceedings of the16th conference on Computational linguistics pages41ndash46

Alexander Fraser and Daniel Marcu 2007 Measuringword alignment quality for statistical machine trans-lation Computational Linguistics 33(3)293ndash303

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InPro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology pages48ndash54

Patrik Lambert and Rafael Banchs 2005 Data in-ferred multi-word expressions for statistical machinetranslation InProceedings of Machine TranslationSummit X pages 396ndash403

Patrik Lambert and Rafael Banchs 2006 Groupingmulti-word expressions according to part-of-speechin statistical machine translation InProceedings ofthe Workshop on Multi-word-expressions in a multi-lingual context pages 9ndash16

Adam Lopez and Philip Resnik 2006 Word-basedalignment phrase-based translation Whatrsquos thelink In proceedings of the 7th conference of the as-sociation for machine translation in the Americasvisions for the future of machine translation pages90ndash99

Shengfen Luo and Maosong Sun 2003 Two-characterchinese word extraction based on hybrid of internaland contextual measures InProceedings of the sec-ond SIGHAN workshop on Chinese language pro-cessing pages 24ndash30

Franz Josef Och 2002Statistical Machine Transla-tion From Single-Word Models to Alignment Tem-plates Phd thesis Computer Science DepartmentRWTH Aachen Germany

Patrick Pantel and Dekang Lin 2001 A statistical cor-pus based term extractor InAI rsquo01 Proceedings ofthe 14th Biennial Conference of the Canadian Soci-ety on Computational Studies of Intelligence pages36ndash46

Kishore Papineni Salim Roukos Todd Ward and Wei-jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation InProceedings of the40th Annual Conference of the Association for Com-putational Linguistics pages 311ndash318

Scott Songlin Piao Paul Rayson Dawn Archer andTony McEnery 2005 Comparing and combining asemantic tagger and a statistical tool for mwe extrac-tion Computer Speech and Language 19(4)378ndash397

Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 Multi-word expressions A pain in the neck for nlp InProceedings of the 3th International Conferenceon Intelligent Text Processing and ComputationalLinguistics(CICLing-2002) pages 1ndash15

Takaaki Tanaka and Timothy Baldwin 2003 Noun-noun compound machine translation A feasibilitystudy on shallow processing InProceedings ofthe ACL-2003 Workshop on Multiword ExpressionsAnalysis Acquisition and Treatment pages 17ndash24

Hua Wu Haifeng Wang and Chengqing Zong 2008Domain adaptation for statistical machine transla-tion with domain dictionary and monolingual cor-pora InProceedings of Conference on Computa-tional Linguistics (COLING) pages 993ndash1000

54

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 55ndash62Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Bottom-up Named Entity Recognitionusing a Two-stage Machine Learning Method

Hirotaka Funayama Tomohide Shibata Sadao KurohashiKyoto University Yoshida-honmachi

Sakyo-ku Kyoto 606-8501 Japanfunayama shibata kuronlpkueekyoto-uacjp

Abstract

This paper proposes Japanese bottom-upnamed entity recognition using a two-stage machine learning method Mostwork has formalized Named Entity Recog-nition as a sequential labeling problem inwhich only local information is utilizedfor the label estimation and thus a longnamed entity consisting of several mor-phemes tends to be wrongly recognizedOur proposed method regards a compoundnoun (chunk) as a labeling unit and firstestimates the labels of all the chunks ina phrasal unit (bunsetsu) using a machinelearning method Then the best label as-signment in the bunsetsu is determinedfrom bottom up as the CKY parsing al-gorithm using a machine learning methodWe conducted an experimental on CRLNE data and achieved an F measure of8979 which is higher than previous work

1 Introduction

Named Entity Recognition (NER) is a task of rec-ognizing named entities such as person namesorganization names and location It is used forseveral NLP applications such as Information Ex-traction (IE) and Question Answering (QA) Mostwork uses machine learning methods such as Sup-port Vector Machines (SVMs) (Vapnik 1995) andConditional Random Field (CRF) (Lafferty et al2001) using a hand-annotated corpus (Krishnanand DManning 2006 Kazama and Torisawa2008 Sasano and Kurohashi 2008 Fukushima etal 2008 Nakano and Hirai 2004 Masayuki andMatsumoto 2003)

In general NER is formalized as a sequentiallabeling problem For example regarding a mor-pheme as a basic unit it is first labeled as S-PERSON B-PERSON I-PERSON E-PERSON

S-ORGANIZATION etc Then considering thelabeling results of morphemes the best NE labelsequence is recognized

When the label of each morpheme is estimatedonly local information around the morpheme (egthe morpheme the two preceding morphemes andthe two following morphemes) is utilized There-fore a long named entity consisting of severalmorphemes tends to be wrongly recognized Letus consider the example sentences shown in Fig-ure 1

In sentence (1) the label of ldquoKazamardquo can berecognized to be S-PERSON (PERSON consist-ing of one morpheme) by utilizing the surroundinginformation such as the suffix ldquosanrdquo (Mr) and theverb ldquokikoku shitardquo (return home)

On the other hand in sentence (2) when thelabel of ldquoshinyourdquo (credit) is recognized to beB-ORGANIZATION (the beginning of ORGA-NIZATION) only information from ldquohatsudourdquo(invoke) to ldquokyusairdquo (relief) can be utilized andthus the information of the morpheme ldquoginkourdquo(bank) that is apart from ldquoshinyourdquo by three mor-phemes cannot be utilized To cope with this prob-lem Nakano et al (Nakano and Hirai 2004) andSasano et al (Sasano and Kurohashi 2008) uti-lized information of the head of bunsetsu1 In theirmethods when the label of ldquoshinyourdquo is recog-nized the information of the morpheme ldquoginkourdquocan be utilized

However these methods do not work when themorpheme that we want to refer to is not a headof bunsetsu as in sentence (3) In this examplewhen ldquogaikokurdquo (foreign) is recognized to be B-ARTIFACT (the beginning of ARTIFACT) wewant to refer to ldquohourdquo (law) not ldquoihanrdquo (viola-tion) which is the head of the bunsetsu

This paper proposes Japanese bottom-up named1Bunsetsu is the smallest coherent phrasal unit in

Japanese It consists of one or more content words followedby zero or more function words

55

(1) kikoku-shitareturn home

Kazama-san-waMrKazama TOP

rsquoMr Kazama who returned homersquo

(2) hatsudou-shitainvoke

shinyou-kumiai-kyusai-ginkou-nocredit union relief bank GEN

setsuritsu-mo establishment

rsquothe establishment of the invoking credit union relief bankrsquo

(3) shibunsyo-gizou-toprivate document falsification and

gaikoku-jin-touroku-hou-ihan-noforeigner registration law violation GEN

utagai-desuspicion INS

rsquoon suspicion of the private document falsification and the violation of the foreigner registra-tion lawrsquo

Figure 1 Example sentences

entity recognition using a two-stage machinelearning method Different from previous workthis method regards a compound noun as a la-beling unit (we call it chunk hereafter) and es-timates the labels of all the chunks in the bun-setsu using a machine learning method In sen-tence (3) all the chunks in the second bunsetsu(ie ldquogaikokurdquo ldquogaikoku-jinrdquo middot middot middot ldquogaikoku-jin-touroku-hou-ihan rdquo middot middot middot ldquoihanrdquo) are labeled andin the case that the chunk ldquogaikoku-jin-touroku-hourdquo is labeled the information about ldquohourdquo (law)is utilized in a natural manner Then in the bun-setsu the best label assignment is determined Forexample among the combination of ldquogaikoku-jin-touroku-hourdquo (ARTIFACT) and ldquoihanrdquo (OTHER)the combination of ldquogaikoku-jinrdquo (PERSON) andldquotouroku-hou-ihanrdquo (OTHER) etc the best la-bel assignment ldquogaikoku-jin-touroku-hourdquo (AR-TIFACT) and ldquoihanrdquo (OTHER) is chosen basedon a machine learning method In this determi-nation of the best label assignment as the CKYparsing algorithm the label assignment is deter-mined by bottom-up dynamic programming Weconducted an experimental on CRL NE data andachieved an F measure of 8979 which is higherthan previous work

This paper is organized as follows Section 2 re-views related work of NER especially focusing onsequential labeling based method Section 3 de-scribes an overview of our proposed method Sec-tion 4 presents two machine learning models andSection 5 describes an analysis algorithm Section6 gives an experimental result

2 Related Work

In Japanese Named Entity Recognition the defi-nition of Named Entity in IREX Workshop (IREX

class examplePERSON Kimura SyonosukeLOCATION Taiheiyou (Pacific Ocean)ORGANIZATION Jimin-tou (Liberal Democratic

Party)ARTIFACT PL-houan (PL bill)DATE 21-seiki (21 century)TIME gozen-7-ji (7 am)MONEY 500-oku-en (50 billions yen)PERCENT 20 percent

Table 1 NE classes and their examples

Committee 1999) is usually used In this def-inition NEs are classified into eight classesPERSON LOCATION ORGANIZATION AR-TIFACT DATE TIME MONEY and PERCENTTable 1 shows example instances of each class

NER methods are divided into two approachesrule-based approach and machine learning ap-proach According to previous work machinelearning approach achieved better performancethan rule-based approach

In general a machine learning method is for-malized as a sequential labeling problem Thisproblem is first assigning each token (character ormorpheme) to several labels In an SE-algorithm(Sekine et al 1998) S is assigned to NE com-posed of one morpheme B I E is assigned to thebeginning middle end of NE respectively and Ois assigned to the morpheme that is not an NE2The labels S B I and E are prepared for each NEclasses and thus the total number of labels is 33(= 8 4 + 1)

The model for the label estimation is learnedbased on machine learning The following fea-tures are generally utilized characters type of

2Besides there are IOB1 IOB2 algorithm using onlyIOB and IOE1 IOE2 algorithm using only IOE (Kim andVeenstra 1999)

56

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu-MeijinMONEY0075

OTHERe0092

MeijinOTHERe0245

(a)initial state

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu + MeijinPSN+OTHERe0438+0245

Yoshiharu Yoshiharu + Meijin

final outputanalysis direction

MONEY0075

MNY+OTHERe0075+0245

MeijinOTHERe0245

(b)final output

Figure 2 An overview of our proposed method (the bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo)

character POS etc about the morpheme and thesurrounding two morphemes The methods utiliz-ing SVM or CRF are proposed

Most of NER methods based on sequential la-beling use only local information Thereforemethods utilizing global information are pro-posed Nakano et al utilized as a feature the wordsub class of NE on the analyzing direction in thebunsetsu the noun in the end of the bunsetsu ad-jacent to the analyzing direction and the head ofeach bunsetsu (Nakano and Hirai 2004) Sasanoet al utilized cache feature coreference resultsyntactic feature and caseframe feature as struc-tural features (Sasano and Kurohashi 2008)

Some work acquired knowledge from unan-notated large corpus and applied it to NERKazama et al utilized a Named Entity dic-tionary constructed from Wikipedia and a nounclustering result obtained using huge amount ofpairs of dependency relations (Kazama and Tori-sawa 2008) Fukushima et al acquired hugeamount of category-instance pairs (eg ldquopo-litical party - New party DAICHIrdquoldquocompany-TOYOTArdquo) by some patterns from a large Webcorpus (Fukushima et al 2008)

In Japanese NER researches CRL NE data areusually utilized for the evaluation This data in-cludes approximately 10 thousands sentences innews paper articles in which approximately 20thousands NEs are annotated Previous workachieved an F measure of about 089 using thisdata

3 Overview of Proposed Method

Our proposed method first estimates the label ofall the compound nouns (chunk) in a bunsetsu

Then the best label assignment is determinedby bottom-up dynamic programming as the CKYparsing algorithm Figure 2 illustrates an overviewof our proposed method In this example thebunsetsu ldquoHabu-Yoshiharu-Meijinrdquo (Grand Mas-ter Yoshiharu Habu) is analyzed First the labelsof all the chunks (ldquoHaburdquo ldquoHabu-YoshiharurdquoldquoHabu-Yoshiharu-Meijinrdquo middot middot middot ldquoMeijinrdquo etc) inthe bunsetsu are analyzed using a machine learn-ing method as shown in Figure 2 (a)

We call the state in Figure 2 (a) initial statewhere the labels of all the chunks have been es-timated From this state the best label assign-ment in the bunsetsu is determined This pro-cedure is performed from the lower left (corre-sponds to each morpheme) to the upper right likethe CKY parsing algorithm as shown in Figure 2(b) For example when the label assignment forldquoHabu-Yoshiharurdquo is determined the label assign-ment ldquoHabu-Yoshiharurdquo (PERSON) and the labelassignment ldquoHaburdquo (PERSON) and ldquoYoshiharurdquo(OTHER) are compared and the better one is cho-sen While grammatical rules are utilized in ageneral CKY algorithm this method chooses bet-ter label assignment for each cell using a machinelearning method

The learned models are the followings

bull the model that estimates the label of a chunk(label estimation model)

bull the model that compares two label assign-ments (label comparison model)

The two models are described in detail in thenext section

57

Habu Yoshiharu Meijin gaPERSON OTHERe

invalid invalid

invalid

invalidinvalid

Figure 3 Label assignment for all the chunks inthe bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo

4 Model Learning

41 Label Estimation ModelThis model estimates the label for each chunk Ananalysis unit is basically bunsetsu This is because935 of named entities is located in a bunsetsuin CRL NE data Exceptionally the following ex-pressions located in multiple bunsetsus tend to bean NE

bull expressions enclosed in parentheses (eg ldquolsquoHimeyuri-no toursquo rdquo (The tower of Himeyuri)(ARTIFACT))

bull expressions that have an entry in Wikipedia(eg ldquoNihon-yatyou-no kairdquo (Wild Bird So-ciety of Japan) (ORGANIZATION))

Hereafter bunsetsu is expanded when one of theabove conditions meet By this expansion 986of named entities is located in a bunsetsu3

For each bunsetsu the head or tail functionwords are deleted For example in the bun-setsu ldquoHabu-Yoshiharu-Meijin-wardquo the tail func-tion word ldquowardquo (TOP) is deleted In the bunsetsuldquoyaku-san-bairdquo (about three times) the head func-tion word ldquoyakurdquo (about) is deleted

Next for learning the label estimation modelall the chunks in a bunsetsu are attached to the cor-rect label from a hand-annotated corpus The la-bel set is 13 classes which includes eight NE class(as shown in Table 1) and five classes OTHERsOTHERb OTHERi OTHERe and invalid

The chunk that corresponds to a whole bun-setsu and does not contain any NEs is labeledas OTHERs and the head middle tail chunkthat does not correspond to an NE is labeled asOTHERb OTHERi OTHERe respectively4

3As an example in which an NE is not included by anexpanded bunsetsu there are ldquoToru-no Kimirdquo (PERSON)and ldquoOsaka-fu midori-no kankyo-seibi-shitsurdquo (ORGANI-ZATION)

4Each OTHER is assigned to the longest chunk that satis-fies its condition in a chunk

1 of morphemes in the chunk

2 the position of the chunk in its bunsetsu

3 character type5

4 the combination of the character type of adjoiningmorphemes

- For the chunk ldquoRussian Armyrdquo this feature isldquoKatakanaKanjirdquo

5 word class word sub class and several features pro-vided by a morphological analyzer JUMAN

6 several features6 provided by a parser KNP

7 string of the morpheme in the chunk

8 IPADIC7 feature- If the string of the chunk are registered in the fol-

lowing categories of IPADIC ldquopersonrdquo ldquolo-cationrdquo ldquoorganizationrdquo and ldquogeneralrdquo thisfeature fires

9 Wikipedia feature

- If the string of the chunk has an entry inWikipedia this feature fires

- the hypernym extracted from its definition sen-tence using some patterns (eg The hyper-nym of ldquothe Liberal Democratic Partyrdquo is apolitical party)

10 cache feature- When the same string of the chunk appears in the

preceding context the label of the precedingchunk is used for the feature

11 particles that the bunsetsu includes

12 the morphemes particles and head morpheme in theparent bunsetsu

13 the NEcategory ratio in a case slot of predicatenouncase frame(Sasano and Kurohashi 2008)

- For example in the case ga (NOM) of the pred-icate case frame ldquokaikenrdquo (interview) the NEratio ldquoPERSON0245rdquo is assigned to the caseslot Hence in the sentence ldquoHabu-ga kaiken-shitardquo (Mr Habu interviewed) the featureldquoPERSON0245rdquo is utilized for the chunkldquoHaburdquo

14 parenthesis feature

- When the chunk in a parenthesis this featurefires

Table 2 Features for the label estimation model

The chunk that is neither any eight NE class northe above four OTHER is labeled as invalid

In an example as shown in Figure 3 ldquoHabu-Yoshiharurdquo is labeled as PERSON ldquoMeijinrdquo is la-beled as OTHERe and the other chunks are la-beled as invalid

Next the label estimation model is learned fromthe data in which the above label set is assigned

5The following five character types are considered KanjiHiragana Katakana Number and Alphabet

6When a morpheme has an ambiguity all the correspond-ing features fire

7httpchasenaist-naraacjpchasendistributionhtmlja

58

to all the chunks The features for the label esti-mation model are shown in Table 2 Among thefeatures as for feature (3) (5)minus(8) three cate-gories according to the position of a morphemein the chunk are prepared ldquoheadrdquo ldquotailrdquo andldquoanywhererdquo For example in the chunk ldquoHabu-Yoshiharu-Meijinrdquo as for the morpheme ldquoHaburdquofeature (7) is set to be ldquoHaburdquo in ldquoheadrdquo and as forthe morpheme ldquoYoshiharurdquo feature (7) is set to beldquoYoshiharurdquo in ldquoanywhererdquo

The label estimation model is learned from pairsof label and feature in each chunk To classify themulti classes the one-vs-rest method is adopted(consequently 13 models are learned) The SVMoutput is transformed by using the sigmoid func-tion 1

1+exp(minusβx) and the transformed value is nor-malized so that the sum of the value of 13 labelsin a chunk is one

The purpose for setting up the label ldquoinvalidrdquo isas follows In the chunk ldquoHaburdquo and ldquoYoshiharurdquoin Figure 3 since the label ldquoinvalidrdquo has a rela-tively higher score the score of the label PERSONis relatively low Therefore when the label com-parison described in Section 42 is performed thelabel assignment ldquoHabu-Yoshiharurdquo (PERSON) islikely to be chosen In the chunk where the scoreof the label invalid has the highest score the labelthat has the second highest score is adopted

42 Label Comparison Model

This model compares the two label assignmentsfor a certain string For example in the stringldquoHabu-Yoshiharurdquo the model compares the fol-lowing two label assignments

bull ldquoHabu-Yoshiharurdquo is labeled as PERSON

bull ldquoHaburdquo is labeled as PERSON and ldquoYoshi-harurdquo is labeled as MONEY

First as shown in Figure 4 the two comparedsets of chunks are lined up by sandwiching ldquovsrdquo(The left one right one is called the first set thesecond set respectively) When the first set is cor-rect this example is positive otherwise this ex-ample is negative The max number of chunks foreach set is five and thus examples in which thefirst or second set has more than five chunks arenot utilized for the model learning

Then the feature is assigned to each exampleThe feature (13 dimensions) for each chunk is de-fined as follows the first 12 dimensions are used

positive+1 Habu-Yoshiharu vs Habu + Yoshiharu

PSN PSN + MNY+1 Habu-Yoshiharu + Meijin vs Habu + Yoshiharu + Meijin

PSN + OTHERe PSN + MONEY + OTHERe

negative- 1 Habu-Yoshiharu-Meijin vs Habu-Yoshiharu + Meijin

ORG PSN + OTHERe

Figure 4 Assignment of positivenegative exam-ples

for each label which is estimated by the label esti-mation model and the last 13th dimension is usedfor the score of an SVM output Then for the firstand second set the features for each chunk are ar-ranged from the left and zero vectors are placedin the remainder part

Figure 5 illustrates the feature for ldquoHabu-Yoshiharurdquo vs ldquoHabu + Yoshiharurdquo The labelcomparison model is learned from such data us-ing SVM Note that only the fact that ldquoHabu-Yoshiharurdquo is PERSON can be found from thehand-annotated corpus and thus in the exampleldquoHabu-Yoshiharu-Meijinrdquo vs ldquoHabu + Yoshiharu-Meijinrdquo we cannot determine which one is cor-rect Therefore such example cannot be used forthe model learning

5 Analysis

First the label of all the chunks in a bunsetsu isestimated by using the label estimation model de-scribed in Section 41 Then the best label assign-ment in the bunsetsu is determined by applying thelabel comparison model described in Section 42iteratively as shown in Figure 2 (b) In this stepthe better label assignment is determined from bot-tom up as the CKY parsing algorithm

For example the initial state shown in Figure2(a) is obtained using the label estimation modelThen the label assignment is determined using thelabel comparison model from the lower left (cor-responds to each morpheme) to the upper rightIn determining the label assignment for the cellof ldquoHabu-Yoshiharurdquo as shown in 6(a) the modelcompares the label assignment ldquoBrdquo with the la-bel assignment ldquoA+Drdquo In this case the modelchooses the label assignment ldquoBrdquo that is ldquoHabu- Yoshiharurdquo is labeled as PERSON Similarlyin determining the label assignment for the cellof ldquoYoshiharu-Meijinrdquo the model compares the

59

chunk Habu-Yoshiharu Habu Yoshiharulabel PERSON PERSON MONEYvector V11 0 0 0 0 V21 V22 0 0 0

Figure 5 An example of the feature for the label comparison model (The example is ldquoHabu-Yoshiharuvs Habu + Yoshiharurdquo and V11 V21 V22 and 0 is a vector whose dimension is 13)

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu-Meijin

label assighment ldquoHabu-Yoshiharurdquo

label assignmentldquoHaburdquo + ldquoYoshiharurdquo

A B C

EDMONEY0075

OTHERe0092

MeijinOTHERe0245

ldquoHaburdquo + ldquoYoshiharurdquo

F

(a) label assignment for the cell ldquoHabu-Yoshiharurdquo

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu + Meijin

label assignmentldquoHabu-Yoshiharu-Meijinrdquo

A B C

EDMONEY0075

MNY+OTHERe0075+0245

MeijinOTHERe0245

label assignmentldquoHaburdquo + ldquoYoshiharurdquo + ldquoMeijinrdquo

label assignmentldquoHabu-Yoshiharurdquo + ldquoMeijinrdquo

F

(b) label assignment for the cell ldquoHabu-Yoshiharu-Meijinrdquo

Figure 6 The label comparison model

label assignment ldquoErdquo with the label assignmentldquoD+Frdquo In this case the model chooses the labelassignment ldquoD+Frdquo that is ldquoYoshiharurdquo is labeledas MONEY and ldquoMeijinrdquo is labeled as OTHEReWhen the label assignment consists of multiplechunks the content of the cell is updated Inthis case the cell ldquoErdquo is changed from ldquoYoshi-haru-Meijinrdquo (OTHERe) to ldquoYoshiharu + Meijinrdquo(MONEY + OTHERe)

As shown in Figure 6(b) in determining the bestlabel assignment for the upper right cell that isthe final output is determined the model comparesthe label assignment ldquoA+D+Frdquo ldquoB+Frdquo and ldquoCrdquoWhen there are more than two candidates of labelassignments for a cell all the label assignments arecompared in a pairwise and the label assignmentthat obtains the highest score is adopted

In the label comparing step the label as-signment in which OTHERlowast follows OTHERlowast(OTHERlowast - OTHERlowast) is not allowed since eachOTHER is assigned to the longest chunk as de-scribed in Section 41 When the first combina-tion of chunks equals to the second combinationof chunks the comparison is not performed

6 Experiment

To demonstrate the effectiveness of our proposedmethod we conducted an experiment on CRL NEdata In this data 10718 sentences in 1174 newsarticles are annotated with eight NEs The expres-sion to which it is difficult to annotate manually islabeled as OPTIONAL and was not used for both

a b c d e

learn the label estimation model 1

Corpus CRL NE data

learn the label estimation model 2bobtain features for the label apply

b c d e

b c d e

b c d e

obtain features for the labelcomparison model

learn the label estimation model 2c

learn the label estimation model 2d

learn the label estimation model 2e

apply

learn the label comparison model

Figure 7 5-fold cross validation

the model learning8 and the evaluationWe performed 5-fold cross validation following

previous work Different from previous work ourwork has to learn the SVM models twice There-fore the corpus was divided as shown in Figure 7Let us consider the analysis in the part (a) Firstthe label estimation model 1 is learned from thepart (b)-(e) Then the label estimation model 2bis learned from the part (c)-(e) and applying thelearned model to the part (b) features for learningthe label comparison model are obtained Simi-larly the label estimation model 2c is learned fromthe part (b)(d)(e) and applying it to the part (c)features are obtained It is the same with the part

8Exceptionally ldquoOPTIONALrdquo is used when the label es-timation model for OTHERlowast and invalid is learned

60

Recall PrecisionORGANIZATION 8183 (30083676) 8837 (30083404)PERSON 9005 (34583840) 9387 (34583684)LOCATION 9138 (49925463) 9244 (49925400)ARTIFACT 4672 ( 349 747) 7489 ( 349 466)DATE 9327 (33273567) 9312 (33273573)TIME 8825 ( 443 502) 9059 ( 443 489)MONEY 9385 ( 366 390) 9760 ( 366 375)PERCENT 9533 ( 469 492) 9591 ( 469 489)ALL-SLOT 8787 9179F-measure 8979

Table 3 Experimental result

(d) and (e) Then the label comparison model islearned from the obtained features After that theanalysis in the part (a) is performed by using boththe label estimation model 1 and the label compar-ison model

In this experiment a Japanese morphologicalanalyzer JUMAN9 and a Japanese parser KNP10

were adopted The two SVM models were learnedwith polynomial kernel of degree 2 and β in thesigmoid function was set to be 1

Table 6 shows an experimental result An F-measure in all NE classes is 8979

7 Discussion

71 Comparison with Previous Work

Table 7 presents the comparison with previ-ous work and our method outperformed previ-ous work Among previous work Fukushimaet al acquired huge amount of category-instance pairs (eg ldquopolitical party - New partyDAICHIrdquoldquocompany-TOYOTArdquo) by some patternsfrom a large Web corpus and Sasano et al uti-lized the analysis result of corefer resolution as afeature for the model learning Therefore in ourmethod by incorporating these knowledge andorsuch analysis result the performance would be im-proved

Compared with Sasano et al our methodachieved the better performance in analyzinga long compound noun For example in thebunsetsu ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo (Treaty on Conventional ArmedForces in Europe) while Sasano et al labeledldquoOushurdquo (Europe) as LOCATION our methodcorrectly labeled ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo as ARTIFACT Sasano etal incorrectly labeled ldquoOushurdquo as LOCATIONalthough they utilized the information about

9httpnlpkueekyoto-uacjpnl-resourcejuman-ehtml10httpnlpkueekyoto-uacjpnl-resourceknp-ehtml

the head of bunsetsu ldquojyouyakurdquo (treaty) Inour method for the cell ldquoOushurdquo invalid hasthe highest score and thus the score of LOCA-TION relatively drops Similarly for the cellldquosenryoku-sakugen-jyouyakurdquo invalid has thehighest score Consequently ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo is correctly labeledas ARTIFACT

In the bunsetsu ldquogaikoku-jin-touroku-hou-ihanrdquo(the violation of the foreigner registration law)while Sasano et al labeled ldquotouroku-hourdquo as AR-TIFACT our method correctly labeled ldquogaikoku-jin-touroku-hourdquo as ARTIFACT Sasano et al can-not utilize the information about ldquohourdquo that is use-ful for the label estimation since the head of thisbunsetsu is ldquoihanrdquo In contrast in estimating thelabel of the chunk ldquogaikoku-jin-touroku-hourdquo theinformation of ldquohourdquo can be utilized

72 Error Analysis

There were some errors in analyzing a Katakanaalphabet word In the following example althoughthe correct is that ldquoBatistutardquo is labeled as PER-SON the system labeled it as OTHERs

(4) Italy-deItaly LOC

katsuyaku-suruactive

Batistuta-woBatistuta ACC

kuwaetacall

ArgentineArgentine

rsquoArgentine called Batistuta who was active inItalyrsquo

There is not an entry of ldquoBatistutardquo in the dictio-nary of JUMAN nor Wikipedia and thus only thesurrounding information is utilized However thecase analysis of ldquokatsuyakurdquo (active) is incorrectwhich leads to the error of ldquoBatistutardquo

There were some errors in applying the la-bel comparison model although the analysis ofeach chunk is correct For example in thebunsetsu ldquoHongKong-seityourdquo (Government ofHongKong) the correct is that ldquoHongKong-seityourdquo is labeled as ORGANIZATION Asshown in Figure 8 (b) the system incorrectlylabeled ldquoHongKongrdquo as LOCATION As shownin Figure 8(a) although in the initial stateldquoHongKong-seityourdquo was correctly labeled as OR-GANIZATION the label assignment ldquoHongKong+ seityourdquo was incorrectly chosen by the labelcomparison model To cope with this problemwe are planning to the adjustment of the value βin the sigmoid function and the refinement of the

61

F1 analysis unit distinctive features(Fukushima et al 2008) 8929 character Web(Kazama and Torisawa 2008) 8893 character WikipediaWeb(Sasano and Kurohashi 2008) 8940 morpheme structural information(Nakano and Hirai 2004) 8903 character bunsetsu feature(Masayuki and Matsumoto 2003) 8721 character(Isozaki and Kazawa 2003) 8677 morphemeproposed method 8979 compound noun Wikipediastructural information

Table 4 Comparison with previous work (All work was evaluated on CRL NE data using cross valida-tion)

HongKongLOCATION

HongKong-seityouORGANIZATIONLOCATION

0266 0205

seityouOTHERe0184

(a)initial state

HongKongLOCATION

HongKong + seityouLOC+OTHEReLOCATION

0266LOC+OTHERe0266+0184

seityouOTHERe0184

(b)the final output

Figure 8 An example of the error in the label com-parison model

features for the label comparison model

8 Conclusion

This paper proposed bottom-up Named EntityRecognition using a two-stage machine learningmethod This method first estimates the label ofall the chunks in a bunsetsu using a machine learn-ing and then the best label assignment is deter-mined by bottom-up dynamic programming Weconducted an experiment on CRL NE data andachieved an F-measure of 8979

We are planning to integrate this method withthe syntactic and case analysis method (Kawa-hara and Kurohashi 2007) and perform syntacticcase and Named Entity analysis simultaneously toimprove the overall accuracy

ReferencesKenrsquoichi Fukushima Nobuhiro Kaji and Masaru

Kitsuregawa 2008 Use of massive amountsof web text in Japanese named entity recogni-tion In Proceedings of Data Engineering Workshop(DEWS2008) A3-3 (in Japanese)

IREX Committee editor 1999 Proceedings of theIREX Workshop

Hideki Isozaki and Hideto Kazawa 2003 Speeding upsupport vector machines for named entity recogni-tion Transaction of Information Processing Societyof Japan 44(3)970ndash979 (in Japanese)

Daisuke Kawahara and Sadao Kurohashi 2007 Prob-abilistic coordination disambiguation in a fully-lexicalized Japanese parser In Proceedings of the

2007 Joint Conference on Empirical Methods inNatural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL2007)pages 304ndash311

Junrsquoichi Kazama and Kentaro Torisawa 2008 In-ducing gazetteers for named entity recognition bylarge-scale clustering of dependency relations InProceedings of ACL-08 HLT pages 407ndash415

Erik F Tjong Kim and Jorn Veenstra 1999 Repre-senting text chunks In Proceedings of EACL rsquo99pages 173ndash179

Vajay Krishnan and Christopher DManning 2006An effective two-stage model for exploiting non-local dependencies in named entity recognitionpages 1121ndash1128

John Lafferty Andrew McCallun and FernandoPereira 2001 Conditional random fields Prob-abilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth In-ternational Conference (ICMLrsquo01) pages 282ndash289

Asahara Masayuki and Yuji Matsumoto 2003Japanese named entity extraction with redundantmorphological analysis In Proceeding of HLT-NAACL 2003 pages 8ndash15

Keigo Nakano and Yuzo Hirai 2004 Japanese namedentity extraction with bunsetsu features Transac-tion of Information Processing Society of Japan45(3)934ndash941 (in Japanese)

Ryohei Sasano and Sadao Kurohashi 2008 Japanesenamed entity recognition using structural naturallanguage processing In Proceeding of Third In-ternational Joint Conference on Natural LanguageProcessing pages 607ndash612

Satoshi Sekine Ralph Grishman and Hiroyuki Shin-nou 1998 A decision tree method for finding andclassifying names in japanese texts In Proceed-ings of the Sixth Workshop on Very Large Corpora(WVLC-6) pages 171ndash178

Vladimir Vapnik 1995 The Nature of StatisticalLearning Theory Springer

62

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 63ndash70Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

13

13

13 13

13

13

13

13

13

13

13

13 13

13

$

13 amp

13 amp

$ 13

13 (

) 13

$ 13

13

) 13

13 13 13

$ 13 13

+$ - 0)$ - 12

$ 3$ 4 1

)13$ 5 + )$ 5 2

3$ 6 2 $ 7-

2 $ 7 13

2 13 13

13 13

) 13

13$ $ 13$ $ $

13

13 13 13

13 1313 138

$ 13$

3$ 13) )

) 8 9$ $ $

$ $ )13$ 13 13

3

$

13 13

13 +

)

13 $

13 ) 13

$ )

13 13

13 ) $ )

13$

$

) (

$

13

13 13

13

)

13

13 )

$ 13

ファミリーレストラン 13) lt

ファミレス 13lt $

=

13 )

$ 13 13

$ $ )

13 -

)$ $ =

13 131313 13 1313 13 13 13 13 1313 13 13

63

13

13

13 13

13 13

13

13 13

13 $

13

13 amp

13

( ) +

13

13

-

13 13

13 13 13

日本 amp0 東京 $amp0amp

医科 amp0 科学 amp0amp

女子 amp0amp

大学 13amp0 研究所0

ampamp $ 13 13

+

13

$1

$ 13 $1

13

213

$1

$

$1

$1

13 $1

13

13 13 略語は0

0amp

$ +3

1313欧文略語一覧13カタカナ略語一覧13漢字略語一覧13大学の略称

$1

$1 13

13 -

13 $1 13

456

13

456

13

37+

- 13

13

$

8 913

77 4 77 913 13

13 77(amp 13 ltamp

13amp amp

13

amp - 13 13

13

13 9

13 13

$

13

-

13 =

= 9

8

13

13 $

13 13

13 13 13 1313

ッ$ 13 っ$

amp ―$

64

13

$ $

amp $ amp

() $

+) $amp $

) $ $ $ $ $

13 $ amp $ $ $ amp $$

$- )) ) 13

) ))13 ) 0()00)13 ) ) )13

- $ ))) ) 131

) 0)13 ))

13 13

2 ) )) )

)) 3 ) 1313))

1313 2 ) 13)

4) 5) )) )13

) 13) 1

6

7 ) )) ) )13 1

)

) )0

13 ) 8

-

) )

)13 ) 4

4 7 )

) )9 8)

5 ) )8 7 ))

)

) )13 1

)13 )) ) ) )

) 7 )) )13 1

)) 0()

) )))13 0

))13 ) 1

)6 5) ) )

5 ) ))1

lt ) ) 13 )

))) ) =))lt

)=)lt ))=lt gt

=ltlt )) ))) )

13 131313

-

))lt

lt ))lt

))lt

))lt lt

7 ) ))1

lt ) )) ) ))lt lt

13 lt ) ) )8 ) 1

)13 )) )

))lt 13 1

)13 ) )8 ) )13 ))

) ))) 85 1

)13 )8) 2 ))) ) (

0 )) )))

13))13

) ))) ) ( のlt

) ) )13 1

) )) )

) ))) のlt 0 )

)13 )) )1

) ) 13

) ) )) )1

) 00 )13 13)

1313 1313

2 ) 13 )) )

)13 )13 )

))13 )

3 )) )) 1

) ) )

$ )

+ ) )

+ 8 )

65

1313

1313

13 13

13 13

13 13 13

13 13

$13amp13amp 13 13 $13

$13 13 $ ( ) ) 13

13 13 + 13 1313 $

13 + 13 1313 + -

+ 13 ++ 13 13

1313 +

13 13

13 13 1313 13 -

+13 1313 13 13

1313

13

0 13 13 13 13

13 13 12 13

3013+ 13 45 666 1313

0 13 1313

7 13 13 13 13-

13 8 +-13 0

+ 13 13 13 -

+ 13 13 13

3

13 13-

13 13 + 13 13

+13513

1313 13 9

1 13 13 -

+ ( 13 13 3

13 13 + 13+ 13 13

3

135 1313 +13 13

13 13 13+ 13

1313 13 +

1313 1313

13 13 13

13+ 13 $1313 (

13 1313 1313 +-

13 4 13 $

$13 $ 13 -

13 +-13 0

13 1313 13-13 1313 + 13 13 4 - 13 + 13 13 13+ 131313 1313 3 13 13 ( 1313 13 13 + 13 13 1313 13 13 13 13 13 13 13 13 13 13 13+ 13 13 9

13

13

13

13

+ 13 13 +13

13+ 4

13 1313 1313 -

+

13 1313 13-

13 13 +13 + +

(13 13 $ +-

13 $1313 13 13 $ $ 13 $ 13

13 $ampamp lt13

2 13 13 131313 1313 13

13 lt13 ( $

13 13 13 13+ +13 -

13 13 13

13 2 2 2

= 13 13 13 +13 1313+

13 13 13

13 13 13 13

1313 13-

13 ( 13 13 13 13

13 13+ 13 1313

13 13 13+ 1313 13+

13 13 13 13 13 13

$gt1 gt1 13+ $ 13-

13 13 13 13+

13131313

66

1313 1313 1313

1313 1313 1313

13

13 13

13

13

$ 13 13

amp() amp() () amp amp

$ 13 13

+

13

13

13

- + 13 ( (

13 13

-+

13

0 -(1313 amp+ ( (

13 13

13 1

13 () 2 3

0 13 -4252

6-amp++ -32 6+

13 0

$ 135 2

1

- amp+ $ 13

13

$ amp( ) + ( ( )- )( ( ) )

FeaturesLabel

Noun

Noun

Noun

Noun

Particle

Noun

POS

0

0

0

1

1

1

Head of wordreadingword-

FeaturesLabel

Noun

Noun

Noun

Noun

Particle

Noun

POS

0

0

0

1

1

1

Head of wordreadingword-

amp 13 0

13 朝ビタ -+ 13

13 朝はビタミン - +

13 7

7

$

088

13 0

$ amp 9 amp 9

13

13

7

1313 0

13

13 2lt

13 13 13 13

13 13

3 13 amp

13 amp

13

13

0 13 13

13 13

13 13

13

1

ampampamp

67

13 13

$

13 13

ampamp (

)

ampamp $

ampamp amp

amp +

amp +

- - -

amp-

amp

amp

amp

- amp

+

- +

) +

amp $

- amp

+

- amp

+

0 $ ampamp +

1- amp

0 +

2-

)

amp -

amp ampamp

$ amp 0- 3(

34) 5 -+

0 - 4

6)4 amp amp 1 amp

7 4( 6

$ - +

- 8

64

13

) amp

amp

+

amp 9 2

- 9 - +

2 amp

- amp

amp 1

$

( +

(

( amp ( ( -

amp2

amp

amp )

( +

68

13 13

13 1313

13 13 13

13 13

13 13 13

13 13

13 13

13 $

13

13 13 13 13

13

13 13 13

1313

13

amp 13 13 13 13

13 13 13 13

(() 13 13 13

13 13 + 13

13 amp 13

13 13

$ 13

13 13

13

13 1313

13 -

13

1313 13

13 13 13

13 1313 amp

13 13

13 13 1313

0

1313 13 13

13 13

13

13 13

$ $ amp amp ( $)+ - 0 13 + 13 112

ミュージックステーション 13 ステ 13 13 13 13 13 ミュージック 13 ステ ステーション

3 4 56 1 0+ 0 + + 7+ 8 9 $8 4 lt

6 $8 -= 1 8 - = $ 13++ + 139 13 $ 22lt

8 $+ 4 13) 1ltltlt3 $gt3 =

5 $ $ ( 6 5 13amp

5 $ $ ( 213 $ 13 13

5 = 3 + 13 () 1 +

5 () 13 7+ + + 2ltlt

5 () 13 3A 21 + $ + 13

5 () $ gt) 3A2 0 13 13 22lt2lt

+A = 7 1 4 30$+ -+ 3 13 13$ 11

$ 13 amp) $ 4 42 6= $ 2

4 13 13+ $ gt $ 9+ amp6+ 13+ 8 8 6 3 (+ 6 13 11

13 136) $ 4 13 + gt+ 7 30 13 13+ 1

69

13 13 13

13 13 1313

13 13

1313 13

13 1313

13

$

13 13 1313 $

amp (

70

Author Index

Bhutada Pravin 17

Cao Jie 47Caseli Helena 1

Diab Mona 17

Finatto Maria Jose 1Fujii Hiroko 63Fukui Mika 63Funayama Hirotaka 55

Hoang Hung Huu 31Huang Yun 47

Kan Min-Yen 9 31Kim Su Nam 9 31Kuhn Jonas 23Kurohashi Sadao 55

Liu Qun 47Lu Yajuan 47

Machado Andre 1

Ren Zhixiang 47

Shibata Tomohide 55Sinha R Mahesh K 40Sumita Kazuo 63Suzuki Masaru 63

Villavicencio Aline 1

Wakaki Hiromi 63

Zarrieszlig Sina 23

71

  • Workshop Program
  • Statistically-Driven Alignment-Based Multiword Expression Identification for Technical Domains
  • Re-examining Automatic Keyphrase Extraction Approaches in Scientific Articles
  • Verb Noun Construction MWE Token Classification
  • Exploiting Translational Correspondences for Pattern-Independent MWE Identification
  • A re-examination of lexical association measures
  • Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus
  • Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions
  • Bottom-up Named Entity Recognition using Two-stage Machine Learning Method
  • Abbreviation Generation for Japanese Multi-Word Expressions
Page 5: MWE 2009 2009 Workshop on Multiword Expressions

Table of Contents

Statistically-Driven Alignment-Based Multiword Expression Identification for Technical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto 1

Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan 9

Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada 17

Exploiting Translational Correspondences for Pattern-Independent MWE IdentificationSina Zarrieszlig and Jonas Kuhn 23

A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan 31

Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha 40

Improving Statistical Machine Translation Using Domain Bilingual Multiword ExpressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang 47

Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi 55

Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita 63

vii

Workshop Program

Friday August 6 2009

830ndash845 Welcome and Introduction to the Workshop

Session 1 (0845ndash1000) MWE Identification and Disambiguation

0845ndash0910 Statistically-Driven Alignment-Based Multiword Expression Identification for Tech-nical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto

0910ndash0935 Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan

0935ndash1000 Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada

1000-1030 BREAK

Session 2 (1030ndash1210) Identification Interpretation and Disambiguation

1030ndash1055 Exploiting Translational Correspondences for Pattern-Independent MWE Identifi-cationSina Zarrieszlig and Jonas Kuhn

1055ndash1120 A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan

1120ndash1145 Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha

1145-1350 LUNCH

Session 3 (1350ndash1530) Applications

1350ndash1415 Improving Statistical Machine Translation Using Domain Bilingual Multiword Ex-pressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang

1415ndash1440 Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi

1440ndash1505 Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita

1505-1530 Discussion of Sessions 1 2 3 (Creating an Agenda for the general discussion)

1530-1600 BREAK

1600-1700 General Discussion

1700-1715 Closing Remarks

ix

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 1ndash8Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Statistically-Driven Alignment-Based Multiword ExpressionIdentification for Technical Domains

Helena de Medeiros Caselidiams Aline Villavicencioclubsspades Andre Machadoclubs Maria Jose Finattohearts

diamsDepartment of Computer Science Federal University of Sao Carlos (Brazil)clubsInstitute of Informatics Federal University of Rio Grande do Sul (Brazil)

spadesDepartment of Computer Sciences Bath University (UK)heartsInstitute of Language and Linguistics Federal University of Rio Grande do Sul (Brazil)

helenacaselidcufscarbr avillavicencioinfufrgsbrammachadoinfufrgsbr mfinattoterracombr

Abstract

Multiword Expressions (MWEs) are oneof the stumbling blocks for more preciseNatural Language Processing (NLP) sys-tems Particularly the lack of coverageof MWEs in resources can impact nega-tively on the performance of tasks and ap-plications and can lead to loss of informa-tion or communication errors This is es-pecially problematic in technical domainswhere a significant portion of the vocab-ulary is composed of MWEs This pa-per investigates the use of a statistically-driven alignment-based approach to theidentification of MWEs in technical cor-pora We look at the use of several sourcesof data including parallel corpora usingEnglish and Portuguese data from a corpusof Pediatrics and examining how a sec-ond language can provide relevant cues forthis tasks We report results obtained bya combination of statistical measures andlinguistic information and compare theseto the reported in the literature Such anapproach to the (semi-)automatic identifi-cation of MWEs can considerably speedup lexicographic work providing a moretargeted list of MWE candidates

1 Introduction

A multiword expression (MWE) can be definedas any word combination for which the syntac-tic or semantic properties of the whole expres-sion cannot be obtained from its parts (Sag etal 2002) Examples of MWEs are phrasal verbs(break down rely on) compounds (police car cof-fee machine) idioms (rock the boat let the catout of the bag) They are very numerous in lan-guages as Biber et al (1999) note accouting forbetween 30 and 45 of spoken English and 21

of academic prose and for Jackendoff (1997) thenumber of MWEs in a speakerrsquos lexicon is of thesame order of magnitude as the number of singlewords However these estimates are likely to beunderestimates if we consider that for languagefrom a specific domain the specialized vocabularyis going to consist largely of MWEs (global warm-ing protein sequencing) and new MWEs are con-stantly appearing (weapons of mass destructionaxis of evil)

Multiword expressions play an important rolein Natural Language Processing (NLP) applica-tions which should not only identify the MWEsbut also be able to deal with them when they arefound (Fazly and Stevenson 2007) Failing toidentify MWEs may cause serious problems formany NLP tasks especially those envolving somekind of semantic processing For parsing for in-stance Baldwin et al (2004) found that for a ran-dom sample of 20000 strings from the British Na-tional Corpus (BNC) even with a broad-coveragegrammar for English (Flickinger 2000) missingMWEs accounted for 8 of total parsing errorsTherefore there is an enormous need for robust(semi-)automated ways of acquiring lexical infor-mation for MWEs (Villavicencio et al 2007) thatcan significantly extend the coverage of resourcesFor example one can more than double the num-ber of verb-particle constructions (VPCs) entriesin a dictionary such as the Alvey Natural Lan-guage Tools (Carroll and Grover 1989) just ex-tracting VPCs from a corpus like the BNC (Bald-win 2005) Furthermore as MWEs are languagedependent and culturally motivated identifyingthe adequate translation of MWE occurrences is animportant challenge for machine translation meth-ods

In this paper we investigate experimentally theuse of an alignment-based approach for the iden-tification of MWEs in technical corpora We lookat the use of several sources of data including par-

1

allel corpora using English and Portuguese datafrom a corpus of Pediatrics and examining howa second language can provide relevant cues forthis tasks In this way cost-effective tools for theautomatic alignment of texts can generate a listof MWE candidates with their appropriate trans-lations Such an approach to the (semi-)automaticidentification of MWEs can considerably speed uplexicographic work providing a more targeted listof MWE candidates and their translations for theconstruction of bilingual resources andor withsome semantic information for monolingual re-sources

The remainder of this paper is structured as fol-lows Section 2 briefly discusses MWEs and someprevious works on methods for automatically ex-tracting them Section 3 presents the resourcesused while section 4 describes the methods pro-posed to extract MWEs as a statistically-driven by-product of an automatic word alignment processSection 5 presents the evaluation methodology andanalyses the results and section 6 finishes this pa-per with some conclusions and proposals for fu-ture work

2 Related Work

The term Multiword Expression has been used todescribe a large number of distinct but related phe-nomena such as phrasal verbs (eg come along)nominal compounds (eg frying pan) institution-alised phrases (eg bread and butter) and manyothers (Sag et al 2002) They are very frequent ineveryday language and this is reflected in severalexisting grammars and lexical resources wherealmost half of the entries are Multiword Expres-sions

However due to their heterogeneous charac-teristics MWEs present a tough challenge forboth linguistic and computational work (Sag etal 2002) Some MWEs are fixed and do notpresent internal variation such as ad hoc whileothers allow different degrees of internal vari-ability and modification such as touch a nerve(touchfind a nerve) and spill beans (spill sev-eralmusicalmountains of beans) In terms of se-mantics some MWEs are more opaque in theirmeaning (eg to kick the bucket as to die) whileothers have more transparent meanings that can beinferred from the words in the MWE (eg eat upwhere the particle up adds a completive sense toeat) Therefore providing appropriate methods

for the automatic identification and treatment ofthese phenomena is a real challenge for NLP sys-tems

A variety of approaches has been proposed forautomatically identifying MWEs differing basi-cally in terms of the type of MWE and lan-guage to which they apply and the sources ofinformation they use Although some work onMWEs is type independent (eg (Zhang et al2006 Villavicencio et al 2007)) given the het-erogeneity of MWEs much of the work looks in-stead at specific types of MWE like collocations(Pearce 2002) compounds (Keller and Lapata2003) and VPCs (Baldwin 2005 Villavicencio2005 Carlos Ramisch and Aline Villavicencio andLeonardo Moura and Marco Idiart 2008) Someof these works concentrate on particular languages(eg (Pearce 2002 Baldwin 2005) for Englishand (Piao et al 2006) for Chinese) but somework has also benefitted from asymmetries in lan-guages using information from one language tohelp deal with MWEs in the other (eg (na Vil-lada Moiron and Tiedemann 2006 Caseli et al2009))

As basis for helping to determine whether agiven sequence of words is in fact an MWE (egad hoc vs the small boy) some of these works em-ploy linguistic knowledge for the task (Villavicen-cio 2005) while others employ statistical meth-ods (Pearce 2002 Evert and Krenn 2005 Zhanget al 2006 Villavicencio et al 2007) or combinethem with some kinds of linguistic informationsuch as syntactic and semantic properties (Bald-win and Villavicencio 2002 Van de Cruys and naVillada Moiron 2007) or automatic word align-ment (na Villada Moiron and Tiedemann 2006)

Statistical measures of association have beencommonly used for this task as they can be demo-cratically applied to any language and MWE typeHowever there is no consensus about which mea-sure is best suited for identifying MWEs in gen-eral Villavicencio et al (2007) compared some ofthese measures (mutual information permutationentropy and χ2) for the type-independent detec-tion of MWEs and found that Mutual Informationseemed to differentiate MWEs from non-MWEsbut the same was not true of χ2 In addition Ev-ert and Krenn (2005) found that for MWE iden-tification the efficacy of a given measure dependson factors like the type of MWEs being targetedfor identification the domain and size of the cor-

2

pora used and the amount of low-frequency dataexcluded by adopting a threshold NonethelessVillavicencio et al (2007) discussing the influ-ence of the corpus size and nature over the meth-ods found that these different measures have ahigh level of agreement about MWEs whetherin carefully constructed corpora or in more het-erogeneous web-based ones They also discussthe results obtained from adopting approaches likethese for extending the coverage of resources ar-guing that grammar coverage can be significantlyincreased if MWEs are properly identified andtreated (Villavicencio et al 2007)

Among the methods that use additional infor-mation along with statistics to extract MWE theone proposed by na Villada Moiron and Tiede-mann (2006) seems to be the most similar to ourapproach The main difference between them isthe way in which word alignment is used in theMWE extraction process In this paper the wordalignment is the basis for the MWE extractionprocess while Villada Moiron and Tiedemannrsquosmethod uses the alignment just for ranking theMWE candidates which were extracted on the ba-sis of association measures (log-likelihood andsalience) and head dependence heuristic (in parseddata)

Our approach as described in details by Caseliet al (2009) also follows to some extent thatof Zhang et al (2006) as missing lexical en-tries for MWEs and related constructions are de-tected via error mining methods and this paper fo-cuses on the extraction of generic MWEs as a by-product of an automatic word alignment Anotherrelated work is the automatic detection of non-compositional compounds (NCC) by Melamed(1997) in which NCCs are identified by analyz-ing statistical translation models trained in a hugecorpus by a time-demanding process

Given this context our approach proposesthe use of alignment techniques for identifyingMWEs looking at sequences detected by thealigner as containing more than one word whichform the MWE candidates As a result sequencesof two or more consecutive source words aretreated as MWE candidates regardless of whetherthey are translated as one or more target words

3 The Corpus and Reference Lists

The Corpus of Pediatrics used in these experi-ments contains 283 texts in Portuguese with a total

of 785448 words extracted from the Jornal de Pe-diatria From this corpus the Pediatrics Glossarya reference list containing multiword terms and re-curring expressions was semi-automatically con-structed and manually checked1 The primary aimof the Pediatrics Glossary as an online resourcefor long-distance education was to train qualifyand support translation students on the domain ofpediatrics texts

The Pediatrics Glossary was built from the36741 ngrams that occurred at least 5 times in thecorpus These were automatically cleaned or re-moved using some POS tag patterns (eg remov-ing prepositions from terms that began or endedwith them) In addition if an ngram was part of alarger ngram only the latter appeared in the Glos-sary as is the case of aleitamento materno (mater-nal breastfeeding) which is excluded as it is con-tained in aleitamento materno exclusivo (exclusivematernal breastfeeding) This post-processing re-sulted in 3645 ngrams which were manuallychecked by translation students and resulted in2407 terms with 1421 bigrams 730 trigrams and339 ngrams with n larger than 3 (not considered inthe experiments presented in this paper)

4 Statistically-Driven andAlignment-Based methods

41 Statistically-Driven methodStatistical measures of association have beenwidely employed in the identification of MWEsThe idea behind their use is that they are an in-expensive language and type independent meansof detecting recurrent patterns As Firth famouslysaid a word is characterized by the company itkeeps and since we expect the component wordsof an MWE to occur frequently together thenthese measures can give an indication of MWE-ness In this way if a group of words co-occurswith significantly high frequency when comparedto the frequencies of the individual words thenthey may form an MWE Indeed measures suchas Pointwise Mutual Information (PMI) MutualInformation (MI) χ2 log-likelihood (Press et al1992) and others have been employed for thistask and some of them seem to provide more ac-curate predictions of MWEness than others Infact in a comparison of some measures for thetype-independent detection of MWEs MI seemed

1Available in the TEXTQUIMUFRGS website httpwwwufrgsbrtextquim

3

to differentiate MWEs from non-MWEs but thesame was not true of χ2 (Villavicencio et al2007) In this work we use two commonly em-ployed measures for this task PMI and MIas implemented in the Ngram Statistics Package(Banerjee and Pedersen 2003)

From the Portuguese portion of the Corpus ofPediatrics 196105 bigram and 362663 trigramMWE candidates were generated after filteringngrams containing punctuation and numbers Inorder to evaluate how these methods perform with-out any linguistic filtering the only threshold em-ployed was a frequency cut-off of 2 occurrencesresulting in 64839 bigrams and 54548 trigramsEach of the four measures were then calculated forthese ngrams and we ranked each n-gram accord-ing to each of these measures The average of allthe rankings is used as the combined measure ofthe MWE candidates

42 Alignment-Based method

The second of the MWE extraction approachesto be investigated in this paper is the alignment-based method The automatic word alignment oftwo parallel texts mdash a text written in one (source)language and its translation to another (target) lan-guage mdash is the process of searching for correspon-dences between source and target words and se-quences of words For each word in a source sen-tence equivalences in the parallel target sentenceare looked for Therefore taking into account aword alignment between a source word sequenceS (S = s1 sn with n ge 2) and a target wordsequence T (T = t1 tm with m ge 1) thatis S harr T the alignmet-based MWE extracionmethod assumes that (a) S and T share some se-mantic features and (b) S may be a MWE

In other words the alignment-based MWE ex-traction method states that the sequence S will bea MWE candidate if it is aligned with a sequenceT composed of one or more words (a n m align-ment with n ge 2 and m ge 1) For examplethe sequence of two Portuguese words aleitamentomaterno mdash which occurs 202 times in the cor-pus used in our experiments mdash is a MWE can-didate because these two words were joined to bealigned 184 times with the word breastfeeding (a2 1 alignment) 8 times with the word breast-fed (a 2 1 alignment) 2 times with breastfeedingpractice (a 2 2 alignment) and so on

Thus notice that the alignment-based MWE ex-

traction method does not rely on the conceptualasymmetries between languages since it does notexpect that a source sequence of words be alignedwith a single target word The method looksfor the sequences of source words that are fre-quently joined together during the alignment de-spite the number of target words involved Thesefeatures indicate that the method priorizes preci-sion in spite of recall

It is also important to say that although the se-quences of source and target words resemble thephrases used in the phrase-based statistical ma-chine translation (SMT) they are indeed a re-finement of them More specifically althoughboth approaches rely on word alignments per-formed by GIZA++2 (Och and Ney 2000) inthe alignment-based approach not all sequences ofwords are considered as phrases (and MWE can-didates) but just those with an alignment n m(n gt= 2) with a target sequence To confirmthis assumption a phrase-based SMT system wastrained with the same corpus used in our exper-iments and the number of phrases extracted fol-lowing both approaches were compared Whilethe SMT extracted 819208 source phrases ouralignment-based approach (without applying anypart-of-speech or frequency filter) extracted only34277 These results show that the alignment-based approach refines in some way the phrasesof SMT systems

In this paper we investigate experimentallywhether MWEs can be identified as a by-productof the automatic word alignment of parallel textsWe focus on Portuguese MWEs from the Corpusof Pediatrics and the evaluation is performed us-ing the bigrams and trigrams from the PediatricsGlossary as gold standard

To perform the extraction of MWE candi-dates following the alignment-based approachfirst the original corpus had to be sentence andword aligned and Part-of-Speech (POS) taggedFor these preprocessing steps were used re-spectively a version of the Translation Cor-pus Aligner (TCA) (Hofland 1996) the statisti-cal word aligner GIZA++ (Och and Ney 2000)and the morphological analysers and POS taggersfrom Apertium3 (Armentano-Oller et al 2006)

2GIZA++ is a well-known statistical word aligner that canbe found at httpwwwfjochcomGIZA++html

3Apertium is an open-source machine translation en-gine and toolbox available at httpwwwapertiumorg

4

From the preprocessed corpus the MWE candi-dates are extracted as those in which two or morewords have the same alignment that is they arelinked to the same target unit This initial list ofMWE candidates is then filtered to remove thosecandidates that (a) match some sequences of POStags or words (patterns) defined in previous exper-iments (Caseli et al 2009) or (b) whose frequencyis below a certain threshold The remaining unitsin the candidate list are considered to be MWEs

Several filtering patterns and minimum fre-quency thresholds were tested and three of themare presented in details here The first one (F1)is the same used during the manual building ofthe reference lists of MWEs (a) patterns begin-ning with Article + Noun and beginning or finish-ing with verbs and (b) with a minimum frequencythreshold of 5

The second one (F2) is the same used in the(Caseli et al 2009) mainly (a) patterns begin-ning with determiner auxiliary verb pronoun ad-verb conjunction and surface forms such as thoseof the verb to be (are is was were) relatives(that what when which who why) and prepo-sitions (from to of ) and (b) with a minimum fre-quency threshold of 2

And the third one (F3) is the same as (Caseli etal 2009) plus (a) patterns beginning or finishingwith determiner adverb conjunction prepositionverb pronoun and numeral and (b) with a mini-mum frequency threshold of 2

5 Experiments and Results

Table 1 shows the top 5 and the bottom 5 rankedcandidates returned by PMI and the alignment-based approach Although some of the results aregood especially the top candidates there is stillconsiderable noise among the candidates as forinstance jogar video game (lit play video game)From table 1 it is also possible to notice that thealignment-based approach indeed extracts Pedi-atrics terms such as aleitamento materno (breast-feeding) and also other possible MWE that are notPediatrics terms such as estados unidos (UnitedStates)

In table 2 we show the precision (number ofcorrect candidates among the proposed ones) re-call (number of correct candidates among those inreference lists) and F-measure ((2 lowast precision lowastrecall)(precision + recall)) figures for the as-sociation measures using all the candidates (on the

PMI alignment-basedOnline Mendelian Inheritance faixa etariaBeta Technology Incorporated aleitamento maternoLange Beta Technology estados unidosOxido Nitrico Inalatorio hipertensao arterialjogar video game leite materno e um de couro cabeludoe a do bloqueio lactıferosse que de emocional anatomiae a da neonato a termoe de nao duplas maes bebes

Table 1 Top 5 and Bottom 5 MWE candidatesranked by PMI and alignment-based approach

pt MWE candidates PMI MI proposed bigrams 64839 64839 correct MWEs 1403 1403precision 216 216recall 9873 9873F 423 423 proposed trigrams 54548 54548 correct MWEs 701 701precision 129 129recall 9603 9603F 255 255 proposed bigrams 1421 1421 correct MWEs 155 261precision 1091 1837recall 1091 1837F 1091 1837 proposed trigrams 730 730 correct MWEs 44 20precision 603 274recall 603 274F 603 274

Table 2 Evaluation of MWE candidates - PMI andMI

first half of the table) and using the top 1421 bi-gram and 730 trigram candidates (on the secondhalf) From these latter results we can see that thetop candidates produced by these measures do notagree with the Pediatrics Glossary since there areonly at most 1837 bigram and 603 trigramMWEs among the top candidates as ranked byMI and PMI respectively Interestingly MI had abetter performance for bigrams while for trigramsPMI performed better

On the other hand looking at the alignment-based method 34277 pt MWE candidates wereextracted and Table 3 sumarizes the number ofcandidates filtered following the three filters de-scribed in 42 F1 F2 and F3

To evaluate the efficacy of the alignment-basedmethod in identifying multiword terms of Pedi-atrics an automatic comparison was performedusing the Pediatrics Glossary In this auto-

5

pt MWE candidates F1 F2 F3 filtered by POS patterns 24996 21544 32644 filtered by frequency 9012 11855 1442 final Set 269 878 191

Table 3 Number of pt MWE candidates filteredin the alignment-based approach

pt MWE candidates F1 F2 F3 proposed bigrams 250 754 169 correct MWEs 48 95 65precision 1920 1260 3846recall 338 669 457F 575 874 818 proposed trigrams 19 110 20 correct MWEs 1 9 4precision 526 818 2000recall 014 123 055F 027 214 107 proposed bitrigrams 269 864 189 correct MWEs 49 104 69precision 1822 1204 3651recall 228 483 321F 405 690 590

Table 4 Evaluation of MWE candidates

matic comparision we considered the final lists ofMWEs candidates generated by each filter in table3 The number of matching entries and the valuesfor precision recall and F-measure are showed intable 4

The different values of extracted MWEs (in ta-ble 3) and evaluated ones (in table 4) are due tothe restriction of considering only bigrams and tri-grams in the Pediatrics Glossary Then longerMWEs mdash such as doenca arterial coronarianaprematura (premature coronary artery disease)and pequenos para idade gestacional (small forgestational age) mdash extracted by the alignment-based method are not being considered at the mo-ment

After the automatic comparison using the Pedi-atrics Glossary an analysis by human experts wasperformed on one of the derived lists mdash that withthe best precision values so far (from filter F3)The human analysis was necessary since as statedin (Caseli et al 2009) the coverage of referencelists may be low and it is likely that a lot of MWEcandidates that were not found in the PediatricsGlossary are nonetheless true MWEs In this pa-per only the pt MWE candidates extracted usingfilter F3 (as described in section 42) were manu-ally evaluated

From the 191 pt MWE candidates extracted af-ter F3 69 candidates (361 of the total amount)were found in the bigrams or trigrams in the

Glossary (see table 4) Then the remaining 122candidates (639) were analysed by two native-speakers human judges who classified each of the122 candidates as true if it is a multiword expres-sion or false otherwise independently of beinga Pediatrics term For the judges a sequence ofwords was considered a MWE mainly if it was(1) a proper name or (2) a sequence of words forwhich the meaning cannot be obtained by com-pounding the meanings of its component words

The judgments of both judges were comparedand a disagreement of approximately 12 on mul-tiwords was verified This disagreement was alsomeasured by the kappa (K) measure (Carletta1996) with k = 073 which does not preventconclusions to be drawn According to Carletta(1996) among other authors a value of k between067 and 08 indicates a good agreement

In order to calculate the percentage of true can-didates among the 122 two approaches can befollowed depending on what criteria one wantsto emphasize precision or coverage (not recallbecause we are not calculating regarding a refer-ence list) To emphasize the precision one shouldconsider as genuine MWEs only those candidatesclassified as true by both judges on the other handto emphasize the coverage one should consideralso those candidates classified as true by just oneof them So from 191 MWE candidates 126(6597) were classified as true by both judgesand 145 (7592) by at least one of them

6 Conclusions and Future Work

MWEs are a complex and heterogeneous set ofphenomena that defy attempts to capture themfully but due to their role in communication theyneed to be properly accounted for in NLP applica-tions and tasks

In this paper we investigated the identifica-tion of MWEs from technical domain test-ing statistically-driven and alignment-based ap-proaches for identifying MWEs from a Pediatricsparallel corpus The alignment-based method gen-erates a targeted precision-oriented list of MWEcandidates while the statistical methods producerecall-oriented results at the expense of precisionTherefore the combination of these methods canproduce a set of MWE candidates that is both moreprecise than the latter and has more coverage thanthe former This can significantly speed up lex-icographic work Moreover the results obtained

6

show that in comparison with the manual extrac-tion of MWEs this approach can provide also ageneral set of MWE candidates in addition to themanually selected technical terms

Using the alignment-based extraction methodwe notice that it is possible to extract MWEs thatare Pediatrics terms with a precision of 38 forbigrams and 20 for trigrams but with very lowrecall since only the MWEs in the Pediatrics Glos-sary were considered correct However after amanual analysis carried out by two native speakersof Portuguese we found that the percentage of trueMWEs considered by both or at least one of themwere respectively 6597 and 7592 This wasa significative improvement but it is important tosay that in this manual analysis the human ex-perts classified the MWEs as true independently ofthem being Pediatrics terms So as future work weintend to carry out a more carefull analysis withexperts in Pediatrics to evaluate how many MWEscandidates are also Pediatrics terms

In addition we plan to investigate a weightedcombination of these methods favouring thosethat have better precision Finally we also in-tend to apply the results obtained in to the semi-automatic construction of ontologies

Acknowledgments

We would like to thank the TEXTQUIMUFRGSgroup for making the Corpus of Pediatrics and Pe-diatrics Glossary available to us We also thank thefinancial support of FAPESP in bulding the paral-lel corpus This research has been partly fundedby the FINEP project COMUNICA

ReferencesCarme Armentano-Oller Rafael C Carrasco Anto-

nio M Corbı-Bellot Mikel L Forcada MireiaGinestı-Rosell Sergio Ortiz-Rojas Juan Anto-nio Perez-Ortiz Gema Ramırez-Sanchez FelipeSanchez-Martınez and Miriam A Scalco 2006Open-source Portuguese-Spanish machine transla-tion In R Vieira P Quaresma MGV Nunes NJMamede C Oliveira and MC Dias editors Pro-ceedings of the 7th International Workshop on Com-putational Processing of Written and Spoken Por-tuguese (PROPOR 2006) volume 3960 of LectureNotes in Computer Science pages 50ndash59 Springer-Verlag May

Timothy Baldwin and Aline Villavicencio 2002 Ex-tracting the Unextractable A Case Study on Verb-particles In Proceedings of the 6th Conference on

Natural Language Learning (CoNLL-2002) pages98ndash104 Taipei Taiwan

Timothy Baldwin Emily M Bender Dan FlickingerAra Kim and Stephan Oepen 2004 Road-testingthe English Resource Grammar over the British Na-tional Corpus In Proceedings of the Fourth In-ternational Conference on Language Resources andEvaluation (LREC 2004) pages 2047ndash2050 LisbonPortugal

Timothy Baldwin 2005 The deep lexical acquisitionof English verb-particles Computer Speech andLanguage Special Issue on Multiword Expressions19(4)398ndash414

Satanjeev Banerjee and Ted Pedersen 2003 The De-sign Implementation and Use of the Ngram Statis-tics Package In In Proceedings of the Fourth Inter-national Conference on Intelligent Text Processingand Computational Linguistics pages 370ndash381

Douglas Biber Stig Johansson Geoffrey Leech SusanConrad and Edward Finegan 1999 Grammar ofSpoken and Written English Longman Harlow

Jean Carletta 1996 Assessing agreement on classi-fication tasks the kappa statistics ComputationalLinguistics 22(2)249ndash254

Carlos Ramisch and Aline Villavicencio and LeonardoMoura and Marco Idiart 2008 Picking themup and Figuring them out Verb-Particle Construc-tions Noise and Idiomaticity In Proceedings ofthe 12th Conference on Computational Natural Lan-guage Learning (CoNLL 2008) pages 49ndash56

John Carroll and Claire Grover 1989 The derivationof a large computational lexicon of English fromLDOCE In Bran Boguraev and Ted Briscoe editorsComputational Lexicography for Natural LanguageProcessing pages 117ndash134 Longman Harlow UK

Helena M Caseli Carlos Ramisch Maria G V Nunesand Aline Villavicencio 2009 Alignment-basedextraction of multiword expressions Language Re-sources and Evaluation to appear

Stefan Evert and Brigitte Krenn 2005 Using smallrandom samples for the manual evaluation of statis-tical association measures Computer Speech andLanguage 19(4)450ndash466

Afsaneh Fazly and Suzanne Stevenson 2007 Distin-guishing Subtypes of Multiword Expressions UsingLinguistically-Motivated Statistical Measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueJune

Dan Flickinger 2000 On building a more efficientgrammar by exploiting types Natural LanguageEngineering 6(1)15ndash28

7

Knut Hofland 1996 A program for aligning Englishand Norwegian sentences In S Hockey N Ideand G Perissinotto editors Research in HumanitiesComputing pages 165ndash178 Oxford Oxford Univer-sity Press

Ray Jackendoff 1997 Twistinrsquo the night away Lan-guage 73534ndash59

Frank Keller and Mirella Lapata 2003 Using the Webto Obtain Frequencies for Unseen Bigrams Compu-tational Linguistics 29(3)459ndash484

I Dan Melamed 1997 Automatic Discovery ofNon-Compositional Compounds in Parallel Data Ineprint arXivcmp-lg9706027 pages 6027ndash+ June

Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the Workshopon Multi-word-expressions in a Multilingual Context(EACL-2006) pages 33ndash40 Trento Italy

Franz Josef Och and Hermann Ney 2000 Improvedstatistical alignment models In Proceedings of the38th Annual Meeting of the ACL pages 440ndash447Hong Kong China October

Darren Pearce 2002 A Comparative Evaluation ofCollocation Extraction Techniques In Proceedingsof the Third International Conference on LanguageResources and Evaluation pages 1ndash7 Las PalmasCanary Islands Spain

Scott S L Piao Guangfan Sun Paul Rayson andQi Yuan 2006 Automatic Extraction of Chi-nese Multiword Expressions with a Statistical ToolIn Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context (EACL-2006)pages 17ndash24 Trento Italy April

William H Press Saul A Teukolsky William T Vet-terling and Brian P Flannery 1992 NumericalRecipes in C The Art of Scientific Computing Sec-ond edition Cambridge University Press

Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 MultiwordExpressions A Pain in the Neck for NLP In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing (CICLing-2002) volume 2276 of (LectureNotes in Computer Science) pages 1ndash15 LondonUK Springer-Verlag

Tim Van de Cruys and Bego na Villada Moiron 2007Semantics-based Multiword Expression ExtractionIn Proceedings of the Workshop on A Broader Pre-spective on Multiword Expressions pages 25ndash32Prague June

Aline Villavicencio Valia Kordoni Yi Zhang MarcoIdiart and Carlos Ramisch 2007 Validation andEvaluation of Automatically Acquired Multiword

Expressions for Grammar Engineering In Pro-ceedings of the 2007 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning pages1034ndash1043 Prague June

Aline Villavicencio 2005 The Availability of Verb-Particle Constructions in Lexical Resources HowMuch is Enough Journal of Computer Speech andLanguage Processing 19(4)415ndash432

Yi Zhang Valia Kordoni Aline Villavicencio andMarco Idiart 2006 Automated Multiword Ex-pression Prediction for Grammar Engineering InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 36ndash44 Sydney Australia July Asso-ciation for Computational Linguistics

8

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 9ndash16Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Re-examining Automatic Keyphrase Extraction Approachesin Scientific Articles

Su Nam KimCSSE dept

University of Melbournesnkimcsseunimelbeduau

Min-Yen KanDepartment of Computer ScienceNational University of Singaporekanmycompnusedusg

Abstract

We tackle two major issues in automatickeyphrase extraction using scientific arti-cles candidate selection and feature engi-neering To develop an efficient candidateselection method we analyze the natureand variation of keyphrases and then se-lect candidates using regular expressionsSecondly we re-examine the existing fea-tures broadly used for the supervised ap-proach exploring different ways to en-hance their performance While mostother approaches are supervised we alsostudy the optimal features for unsuper-vised keyphrase extraction Our researchhas shown that effective candidate selec-tion leads to better performance as evalua-tion accounts for candidate coverage Ourwork also attests that many of existing fea-tures are also usable in unsupervised ex-traction

1 Introduction

Keyphrases are simplex nouns or noun phrases(NPs) that represent the key ideas of the documentKeyphrases can serve as a representative summaryof the document and also serve as high quality in-dex terms It is thus no surprise that keyphraseshave been utilized to acquire critical informationas well as to improve the quality of natural lan-guage processing (NLP) applications such as doc-ument summarizer(Davanzo and Magnini 2005)information retrieval (IR)(Gutwin et al 1999) anddocument clustering(Hammouda et al 2005)

In the past various attempts have been made toboost automatic keyphrase extraction performancebased primarily on statistics(Frank et al 1999Turney 2003 Park et al 2004 Wan and Xiao2008) and a rich set of heuristic features(Barkerand Corrnacchia 2000 Medelyan and Witten

2006 Nguyen and Kan 2007) In Section 2 wegive a more comprehensive overview of previousattempts

Current keyphrase technology still has muchroom for improvement First of all although sev-eral candidate selection methods have been pro-posed for automatic keyphrase extraction in thepast (eg (Frank et al 1999 Park et al 2004Nguyen and Kan 2007)) most of them do not ef-fectively deal with various keyphrase forms whichresults in the ignorance of some keyphrases as can-didates Moreover no studies thus far have donea detailed investigation of the nature and varia-tion of manually-provided keyphrases As a con-sequence the community lacks a standardized listof candidate forms which leads to difficulties indirect comparison across techniques during evalu-ation and hinders re-usability

Secondly previous studies have shown the ef-fectiveness of their own features but not manycompared their features with other existing fea-tures That leads to a redundancy in studies andhinders direct comparison In addition existingfeatures are specifically designed for supervisedapproaches with few exceptions However thisapproach involves a large amount of manual laborthus reducing its utility for real-world applicationHence unsupervised approach is inevitable in or-der to minimize manual tasks and to encourageutilization It is a worthy study to attest the re-liability and re-usability for the unsupervised ap-proach in order to set up the tentative guideline forapplications

This paper targets to resolve these issues ofcandidate selection and feature engineering Inour work on candidate selection we analyze thenature and variation of keyphrases with the pur-pose of proposing a candidate selection methodwhich improves the coverage of candidates thatoccur in various forms Our second contributionre-examines existing keyphrase extraction features

9

reported in the literature in terms of their effec-tiveness and re-usability We test and comparethe usefulness of each feature for further improve-ment In addition we assess how well these fea-tures can be applied in an unsupervised approach

In the remaining sections we describe anoverview of related work in Section 2 our propos-als on candidate selection and feature engineeringin Section 4 and 5 our system architecture anddata in Section 6 Then we evaluate our propos-als discuss outcomes and conclude our work inSection 7 8 and 9 respectively

2 Related Work

The majority of related work has been carriedout using statistical approaches a rich set ofsymbolic resources and linguistically-motivatedheuristics(Frank et al 1999 Turney 1999 Barkerand Corrnacchia 2000 Matsuo and Ishizuka2004 Nguyen and Kan 2007) Features used canbe categorized into three broad groups (1) docu-ment cohesion features (ie relationship betweendocument and keyphrases)(Frank et al 1999Matsuo and Ishizuka 2004 Medelyan and Wit-ten 2006 Nguyen and Kan 2007) and to lesser(2) keyphrase cohesion features (ie relationshipamong keyphrases)(Turney 2003) and (3) termcohesion features (ie relationship among compo-nents in a keyphrase)(Park et al 2004)

The simplest system is KEA (Frank et al1999 Witten et al 1999) that uses TFIDF (ieterm frequency inverse document frequency) andfirst occurrence in the document TFIDF mea-sures the document cohesion and the first occur-rence implies the importance of the abstract orintroduction which indicates the keyphrases havea locality Turney (2003) added the notion ofkeyphrase cohesion to KEA features and Nguyenand Kan (2007) added linguistic features suchas section information and suffix sequence TheGenEx system(Turney 1999) employed an inven-tory of nine syntactic features such as length inwords and frequency of stemming phrase as aset of parametrized heuristic rules Barker andCorrnacchia (2000) introduced a method basedon head noun heuristics that took three featureslength of candidate frequency and head noun fre-quency To take advantage of domain knowledgeHulth et al (2001) used a hierarchically-organizeddomain-specific thesaurus from Swedish Parlia-ment as a secondary knowledge source The

Textract (Park et al 2004) also ranks the can-didate keyphrases by its judgment of keyphrasesrsquodegree of domain specificity based on subject-specific collocations(Damerau 1993) in addi-tion to term cohesion using Dice coefficient(Dice1945) Recently Wan and Xiao (2008) extractsautomatic keyphrases from single documents uti-lizing document clustering information The as-sumption behind this work is that the documentswith the same or similar topics interact with eachother in terms of salience of words The authorsfirst clustered the documents then used the graph-based ranking algorithm to rank the candidates ina document by making use of mutual influences ofother documents in the same cluster

3 Keyphrase Analysis

In previous study KEA employed the index-ing words as candidates whereas others such as(Park et al 2004 Nguyen and Kan 2007) gen-erated handcrafted regular expression rules How-ever none carefully undertook the analysis ofkeyphrases We believe there is more to be learnedfrom the reference keyphrases themselves by do-ing a fine-grained careful analysis of their formand composition Note that we used the articlescollected from ACM digital library for both ana-lyzing keyphrases as well as evaluating methodsSee Section 6 for data in detail

Syntactically keyphrases can be formed by ei-ther simplex nouns (eg algorithm keyphrasemulti-agent) or noun phrases (NPs) which can be asequence of nouns and their auxiliary words suchas adjectives and adverbs (eg mobile networkfast computing partially observable Markov de-cision process) despite few incidences They canalso incorporate a prepositional phrase (PP) (egquality of service policy of distributed caching)When keyphrases take the form of an NP with anattached PP (ie NPs in of-PP form) the preposi-tion of is most common but others such as for invia also occur (eg incentive for cooperation in-equality in welfare agent security via approximatepolicy trade in financial instrument based on log-ical formula) The patterns above correlate wellto part-of-speech (POS) patterns used in modernkeyphrase extraction systems

However our analysis uncovered additional lin-guistic patterns and alternations which other stud-ies may have overlooked In our study we alsofound that keyphrases also occur as a simple con-

10

Criteria RulesFrequency (Rule1) Frequency heuristic ie frequency ge 2 for simplex words vs frequency ge 1 for NPs

Length (Rule2) Length heuristic ie up to length 3 for NPs in non-of-PP form vs up to length 4 for NPs in of-PP form(eg synchronous concurrent program vs model of multiagent interaction)

Alternation (Rule3) of-PP form alternation(eg number of sensor = sensor number history of past encounter = past encounter history)(Rule4) Possessive alternation(eg agentrsquos goal = goal of agent securityrsquos value = value of security)

Extraction (Rule5) Noun Phrase = (NN |NNS|NNP |NNPS|JJ |JJR|JJS)lowast(NN |NNS|NNP |NNPS)(eg complexity effective algorithm grid computing distributed web-service discovery architecture)(Rule6) Simplex WordNP IN Simplex WordNP(eg quality of service sensitivity of VOIP traffic (VOIP traffic extracted)simplified instantiation of zebroid (simplified instantiation extracted))

Table 1 Candidate Selection Rules

junctions (eg search and rescue propagation anddelivery) and much more rarely as conjunctionsof more complex NPs (eg history of past en-counter and transitivity) Some keyphrases appearto be more complex (eg pervasive document editand management system task and resource allo-cation in agent system) Similarly abbreviationsand possessive forms figure as common patterns(eg belief desire intention = BDI inverse docu-ment frequency = (IDF) Bayesrsquo theorem agentrsquosdominant strategy)

A critical insight of our work is that keyphrasescan be morphologically and semantically alteredKeyphrases that incorporate a PP or have an un-derlying genitive composition are often easily var-ied by word order alternation Previous studieshave used the altered keyphrases when forming inof-PP form For example quality of service canbe altered to service quality sometimes with lit-tle semantic difference Also as most morpho-logical variation in English relates to noun num-ber and verb inflection keyphrases are subject tothese rules as well (eg distributed system 6= dis-tributing system dynamical caching 6= dynamicalcache) In addition possessives tend to alternatewith of-PP form (eg agentrsquos goal = goal of agentsecurityrsquos value = value of security)

4 Candidate Selection

We now describe our proposed candidate selectionprocess Candidate selection is a crucial step forautomatic keyphrase extraction This step is corre-lated to term extraction study since topNth rankedterms become keyphrases in documents In pre-vious study KEA employed the indexing wordsas candidates whereas others such as (Park et al2004 Nguyen and Kan 2007) generated hand-crafted regular expression rules However nonecarefully undertook the analysis of keyphrases In

this section before we present our method we firstdescribe the detail of keyphrase analysis

In our keyphrase analysis we observed thatmost of author assigned keyphrase andor readerassigned keyphrase are syntactically more of-ten simplex words and less often NPs Whenkeyphrases take an NP form they tend to be a sim-ple form of NPs ie either without a PP or withonly a PP or with a conjunction but few appear asa mixture of such forms We also noticed that thecomponents of NPs are normally nouns and adjec-tives but rarely are adverbs and verbs As a re-sult we decided to ignore NPs containing adverbsand verbs in this study as our candidates since theytend to produce more errors and to require morecomplexity

Another observation is that keyphrases contain-ing more than three words are rare (ie 6 in ourdata set) validating what Paukkeri et al (2008)observed Hence we apply a length heuristic Ourcandidate selection rule collects candidates up tolength 3 but also of length 4 for NPs in of-PPform since they may have a non-genetive alter-nation that reduces its length to 3 (eg perfor-mance of distributed system = distributed systemperformance) In previous studies words occur-ring at least twice are selected as candidates How-ever during our acquisition of reader assignedkeyphrase we observed that readers tend to collectNPs as keyphrases regardless of their frequencyDue to this we apply different frequency thresh-olds for simplex words (gt= 2) and NPs (gt= 1)Note that 30 of NPs occurred only once in ourdata

Finally we generated regular expression rulesto extract candidates as presented in Table 1 Ourcandidate extraction rules are based on those inNguyen and Kan (2007) However our Rule6for NPs in of-PP form broadens the coverage of

11

possible candidates ie with a given NPs in of-PP form not only we collect simplex word(s)but we also extract non-of-PP form of NPs fromnoun phrases governing the PP and the PP Forexample our rule extracts effective algorithm ofgrid computing as well as effective algorithm andgrid computing as candidates while the previousworksrsquo rules do not

5 Feature Engineering

With a wider candidate selection criteria the onusof filtering out irrelevant candidates becomes theresponsibility of careful feature engineering Welist 25 features that we have found useful in ex-tracting keyphrases comprising of 9 existing and16 novel andor modified features that we intro-duce in our work (marked with lowast) As one ofour goals in feature engineering is to assess thesuitability of features in the unsupervised settingwe have also indicated which features are suitableonly for the supervised setting (S) or applicable toboth (S U)

51 Document Cohesion

Document cohesion indicates how important thecandidates are for the given document The mostpopular feature for this cohesion is TFIDF butsome works have also used context words to checkthe correlation between candidates and the givendocument Other features for document cohesionare distance section information and so on Wenote that listed features other than TFIDF are re-lated to locality That is the intuition behind thesefeatures is that keyphrases tend to appear in spe-cific area such as the beginning and the end of doc-uments

F1 TFIDF (SU) TFIDF indicates doc-ument cohesion by looking at the frequency ofterms in the documents and is broadly used in pre-vious work(Frank et al 1999 Witten et al 1999Nguyen and Kan 2007) However a disadvan-tage of the feature is in requiring a large corpusto compute useful IDF As an alternative con-text words(Matsuo and Ishizuka 2004) can alsobe used to measure document cohesion From ourstudy of keyphrases we saw that substrings withinlonger candidates need to be properly counted andas such our method measures TF in substrings aswell as in exact matches For example grid com-puting is often a substring of other phrases such asgrid computing algorithm and efficient grid com-

puting algorithm We also normalize TF with re-spect to candidate types ie we separately treatsimplex words and NPs to compute TF To makeour IDFs broadly representative we employed theGoogle n-gram counts that were computedover terabytes of data Given this large genericsource of word count IDF can be incorporatedwithout corpus-dependent processing hence suchfeatures are useful in unsupervised approachesas well The following list shows variations ofTFIDF employed as features in our system

bull (F1a) TFIDF

bull (F1b) TF including counts of substrings

bull (F1c) TF of substring as a separate feature

bull (F1d) normalized TF by candidate types(ie simplex words vs NPs)

bull (F1e) normalized TF by candidate types asa separate feature

bull (F1f) IDF using Google n-gram

F2 First Occurrence (SU) KEA used the firstappearance of the word in the document(Frank etal 1999 Witten et al 1999) The main ideabehind this feature is that keyphrases tend to oc-cur in the beginning of documents especially instructured reports (eg in abstract and introduc-tion sections) and newswire

F3 Section Information (SU) Nguyen andKan (2007) used the identity of which specificdocument section a candidate occurs in This lo-cality feature attempts to identify key sections Forexample in their study of scientific papers theauthors weighted candidates differently dependingon whether they occurred in the abstract introduc-tion conclusion section head title andor refer-ences

F4 Additional Section Information (SU)We first added the related work or previous workas one of section information not included inNguyen and Kan (2007) We also propose and testa number of variations We used the substringsthat occur in section headers and reference titlesas keyphrases We counted the co-occurrence ofcandidates (ie the section TF) across all key sec-tions that indicates the correlation among key sec-tions We assign section-specific weights as in-dividual sections exhibit different propensities forgenerating keyphrases For example introduction

12

contains the majority of keyphrases while the ti-tle or section head contains many fewer due to thevariation in size

bull (F4a) section rsquorelatedprevious workrsquo

bull (F4b) counting substring occurring in keysections

bull (F4c) section TF across all key sections

bull (F4d) weighting key sections according tothe portion of keyphrases found

F5 Last Occurrence (SU) Similar to dis-tance in KEA the position of the last occurrenceof a candidate may also imply the importance ofkeyphrases as keyphrases tend to appear in thelast part of document such as the conclusion anddiscussion

52 Keyphrase CohesionThe intuition behind using keyphrase cohesion isthat actual keyphrases are often associated witheach other since they are semantically related totopic of the document Note that this assumptionholds only when the document describes a singlecoherent topic ndash a document that represents a col-lection may be first need to be segmented into itsconstituent topics

F6 Co-occurrence of Another Candidatein Section (SU) When candidates co-occur inseveral key sections together then they are morelikely keyphrases Hence we used the number ofsections that candidates co-occur

F7 Title overlap (S) In a way titles also rep-resent the topics of their documents A large col-lection of titles in the domain can act as a prob-abilistic prior of what words could stand as con-stituent words in keyphrases In our work as weexamined scientific papers from computer sciencewe used a collection of titles obtained from thelarge CiteSeer1 collection to create this feature

bull (F7a) co-occurrence (Boolean) in title col-location

bull (F7b) co-occurrence (TF) in title collection

F8 Keyphrase Cohesion (SU) Turney (2003)integrated keyphrase cohesion into his system bychecking the semantic similarity between top Nranked candidates against the remainder In the

1It contains 13M titles from articles papers and reports

original work a large external web corpus wasused to obtain the similarity judgments As wedid not have access to the same web corpus andall candidateskeyphrases were not found in theGoogle n-gram corpus we approximated this fea-ture using a similar notion of contextual similarityWe simulated a latent 2-dimensional matrix (simi-lar to latent semantic analysis) by listing all candi-date words in rows and their neighboring words(nouns verbs and adjectives only) in columnsThe cosine measure is then used to compute thesimilarity among keyphrases

53 Term Cohesion

Term cohesion further refines the candidacy judg-ment by incorporating an internal analysis of thecandidatersquos constituent words Term cohesionposits that high values for internal word associa-tion measures correlates indicates that the candi-date is a keyphrase (Church and Hanks 1989)

F9 Term Cohesion (SU) Park et al (2004)used in the Dice coefficient (Dice 1945)to measure term cohesion particularly for multi-word terms In their work as NPs are longer thansimplex words they simply discounted simplexword cohesion by 10 In our work we vary themeasure of TF used in Dice coefficientsimilar to our discussion earlier

bull (F9a) term cohesion by (Park et al 2004)

bull (F9b) normalized TF by candidate types(ie simplex words vs NPs)

bull (F9c) applying different weight by candi-date types

bull (F9d) normalized TF and different weight-ing by candidate types

54 Other Features

F10 Acronym (S) Nguyen and Kan (2007) ac-counted for the importance of acronym as a fea-ture We found that this feature is heavily depen-dent on the data set Hence we used it only forNampK to attest our candidate selection method

F11 POS sequence (S) Hulth and Megyesi(2006) pointed out that POS sequences ofkeyphrases are similar It showed the distinctivedistribution of POS sequences of keyphrases anduse them as a feature Like acronym this is alsosubject to the data set

13

F12 Suffix sequence (S) Similar to acronymNguyen and Kan (2007) also used a candidatersquossuffix sequence as a feature to capture the propen-sity of English to use certain Latin derivationalmorphology for technical keyphrases This fea-ture is also a data dependent features thus used insupervised approach only

F13 Length of Keyphrases (SU) Barker andCorrnacchia (2000) showed that candidate lengthis also a useful feature in extraction as well as incandidate selection as the majority of keyphrasesare one or two terms in length

6 System and Data

To assess the performance of the proposed candi-date selection rules and features we implementeda keyphrase extraction pipe line We start withraw text of computer science articles convertedfrom PDF by pdftotext Then we parti-tioned the into section such as title and sectionsvia heuristic rules and applied sentence segmenter2 ParsCit3(Councill et al 2008) for refer-ence collection part-of-speech tagger4 and lem-matizer5(Minnen et al 2001) of the input Af-ter preprocessing we built both supervised andunsupervised classifiers using Naive Bayes fromthe WEKA machine learning toolkit(Witten andFrank 2005) Maximum Entropy6 and simpleweighting

In evaluation we collected 250 papers fromfour different categories7 of the ACM digital li-brary Each paper was 6 to 8 pages on averageIn author assigned keyphrase we found manywere missing or found as substrings To rem-edy this we collected reader assigned keyphraseby hiring senior year undergraduates in computerscience each whom annotated five of the paperswith an annotation guideline and on average tookabout 15 minutes to annotate each paper The fi-nal statistics of keyphrases is presented in Table2 where Combined represents the total number ofkeyphrases The numbers in () denotes the num-ber of keyphrases in of-PP form Found means the

2httpwwwengritsumeiacjpasaoresourcessentseg3httpwingcompnusedusgparsCit4httpsearchcpanorgdistLingua-EN-

TaggerTaggerpm5httpwwwinformaticssusxacukresearchgroupsnlpcarrollmorphhtml6httpmaxentsourceforgenetindexhtml7C24 (Distributed Systems) H33 (Information Search

and Retrieval) I211 (Distributed Artificial Intelligence-Multiagent Systems) and J4 (Social and Behavioral Sciences-Economics)

number of author assigned keyphrase and readerassigned keyphrase found in the documents

Author Reader CombinedTotal 1252 (53) 3110 (111) 3816 (146)NPs 904 2537 3027Average 385 (401) 1244 (1288) 1526 (1585)Found 769 2509 2864

Table 2 Statistics in Keyphrases

7 Evaluation

The baseline system for both the supervised andunsupervised approaches is modified NampK whichuses TFIDF distance section information andadditional section information (ie F1-4) Apartfrom baseline we also implemented basicKEA and NampK to compare Note that NampK is con-sidered a supervised approach as it utilizes fea-tures like acronym POS sequence and suffix se-quence

Table 3 and 4 shows the performance of our can-didate selection method and features with respectto supervised and unsupervised approaches usingthe current standard evaluation method (ie exactmatching scheme) over top 5th 10th 15th candi-dates

BestFeatures includes F1cTF of substring asa separate feature F2first occurrence F3sectioninformation F4dweighting key sections F5lastoccurrence F6co-occurrence of another candi-date in section F7btitle overlap F9aterm co-hesion by (Park et al 2004) F13length ofkeyphrases Best-TFIDF means using all bestfeatures but TFIDF

In Tables 3 and 4 C denotes the classifier tech-nique unsupervised (U) or supervised using Max-imum Entropy (S)8

In Table 5 the performance of each feature ismeasured using NampK system and the target fea-ture + indicates an improvement - indicates aperformance decline and indicates no effector unconfirmed due to small changes of perfor-mances Again supervised denotes MaximumEntropy training and Unsupervised is our unsu-pervised approach

8 Discussion

We compared the performances over our candi-date selection and feature engineering with sim-ple KEA NampK and our baseline system In eval-uating candidate selection we found that longer

8Due to the page limits we present the best performance

14

Method Features C Five Ten FifteenMatch Precision Recall Fscore Match Precising Recall Fscore Match Precision Recall Fscore

All KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 079 1584 519 782 139 1388 909 1099 184 1224 1203 1213

NampK S 132 2648 867 1306 204 2036 1334 1612 254 1693 1664 1678baseline U 092 1832 600 904 157 1568 1027 1241 220 1464 1439 1451

S 115 2304 755 1137 190 1896 1242 1501 244 1624 1596 1610Lengthlt=3 KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 081 1616 529 797 140 1400 917 1108 184 1224 1203 1213

NampK S 140 2792 915 1378 210 2104 1378 1665 262 1749 1719 1734baseline U 092 184 603 908 158 1576 1032 1247 220 1464 1439 1451

S 118 2368 776 1169 190 1900 1245 1504 240 1600 1572 1586Lengthlt=3 KEA U 001 024 008 012 005 052 034 041 007 048 047 047Candidates S 083 1664 545 821 142 1424 933 1127 187 1245 1224 1234

+ Alternation NampK S 153 3064 1004 1512 231 2308 1512 1827 288 1920 1887 1903baseline U 098 1968 645 972 172 1724 1129 1364 237 1579 1551 1565

S 133 2656 870 1311 209 2088 1368 1653 269 1792 1761 1776

Table 3 Performance on Proposed Candidate Selection

Features C Five Ten FifteenMatch Prec Recall Fscore Match Prec Recall Fscore Match Prec Recall Fscore

Best U 114 228 747 113 192 192 126 152 261 174 171 173S 156 312 102 154 250 250 164 198 315 210 206 208

Best U 114 228 74 113 192 192 126 152 261 174 171 173wo TFIDF S 156 311 102 154 246 246 161 194 312 208 204 206

Table 4 Performance on Feature Engineering

A Method Feature+ S F1aF2F3F4aF4dF9a

U F1aF1cF2F3F4aF4dF5F7bF9a- S F1bF1cF1dF1fF4bF4cF7aF7bF9b-dF13

U F1dF1eF1fF4bF4cF6F7aF9b-d S F1eF10F11F12

U F1b

Table 5 Performance on Each Feature

length candidates play a role to be noises so de-creased the overall performance We also con-firmed that candidate alternation offered the flexi-bility of keyphrases leading higher candidate cov-erage as well as better performance

To re-examine features we analyzed the impactof existing and new features and their variationsFirst of all unlike previous studies we found thatthe performance with and without TFIDF did notlead to a large difference which indicates the im-pact of TFIDF was minor as long as other fea-tures are incorporated Secondly counting sub-strings for TF improved performance while ap-plying term weighting for TF andor IDF did notimpact on the performance We estimated thecause that many of keyphrases are substrings ofcandidates and vice versa Thirdly section in-formation was also validated to improve perfor-mance as in Nguyen and Kan (2007) Extend-ing this logic modeling additional section infor-mation (related work) and weighting sections bothturned out to be useful features Other localityfeatures were also validated as helpful both firstoccurrence and last occurrence are helpful as itimplies the locality of the key ideas In addi-

tion keyphrase co-occurrence with selected sec-tions was proposed in our work and found empiri-cally useful Term cohesion (Park et al 2004) is auseful feature although it has a heuristic factor thatreduce the weight by 10 for simplex words Nor-mally term cohesion is subject to NPs only henceit needs to be extended to work with multi-wordNPs as well Table 5 summarizes the reflectionson each feature

As unsupervised methods have the appeal of notneeding to be trained on expensive hand-annotateddata we also compared the performance of super-vised and unsupervised methods Given the fea-tures initially introduced for supervised learningunsupervised performance is surprisingly highWhile supervised classifier produced a matchingcount of 315 the unsupervised classifier obtains acount of 261 We feel this indicates that the exist-ing features for supervised methods are also suit-able for use in unsupervised methods with slightlyreduced performance In general we observed thatthe best features in both supervised and unsuper-vised methods are the same ndash section informationand candidate length In our analysis of the im-pact of individual features we observed that mostfeatures affect performance in the same way forboth supervised and unsupervised approaches asshown in Table 5 These findings indicate that al-though these features may be been originally de-signed for use in a supervised approach they arestable and can be expected to perform similar inunsupervised approaches

15

9 Conclusion

We have identified and tackled two core issuesin automatic keyphrase extraction candidate se-lection and feature engineering In the area ofcandidate selection we observe variations and al-ternations that were previously unaccounted forOur selection rules expand the scope of possiblekeyphrase coverage while not overly expandingthe total number candidates to consider In ourre-examination of feature engineering we com-piled a comprehensive feature list from previousworks while exploring the use of substrings in de-vising new features Moreover we also attested toeach featurersquos fitness for use in unsupervised ap-proaches in order to utilize them in real-world ap-plications with minimal cost

10 AcknowledgementThis work was partially supported by a National ResearchFoundation grant Interactive Media Search (grant R 252000 325 279) while the first author was a postdoctoral fellowat the National University of Singapore

ReferencesKen Barker and Nadia Corrnacchia Using noun phrase

heads to extract document keyphrases In Proceedings ofthe 13th Biennial Conference of the Canadian Society onComputational Studies of Intelligence Advances in Arti-ficial Intelligence 2000

Regina Barzilay and Michael Elhadad Using lexical chainsfor text summarization In Proceedings of the ACLEACL1997 Workshop on Intelligent Scalable Text Summariza-tion 1997 pp 10ndash17

Kenneth Church and Patrick Hanks Word associationnorms mutual information and lexicography In Proceed-ings of ACL 1989 76ndash83

Isaac Councill and C Lee Giles and Min-Yen Kan ParsCitAn open-source CRF reference string parsing package InProceedings of LREC 2008 28ndash30

Ernesto DAvanzo and Bernado Magnini A Keyphrase-Based Approach to Summarizationthe LAKE System atDUC-2005 In Proceedings of DUC 2005

F Damerau Generating and evaluating domain-orientedmulti-word terms from texts Information Processing andManagement 1993 29 pp43ndash447

Lee Dice Measures of the amount of ecologic associationsbetween species Journal of Ecology 1945 2

Eibe Frank and Gordon Paynter and Ian Witten and CarlGutwin and Craig Nevill-manning Domain SpecificKeyphrase Extraction In Proceedings of IJCAI 1999pp668ndash673

Carl Gutwin and Gordon Paynter and Ian Witten and CraigNevill-Manning and Eibe Frank Improving browsing indigital libraries with keyphrase indexes Journal of Deci-sion Support Systems 1999 27 pp81ndash104

Khaled Hammouda and Diego Matute and Mohamed KamelCorePhrase keyphrase extraction for document cluster-ing In Proceedings of MLDM 2005

Annette Hulth and Jussi Karlgren and Anna Jonsson andHenrik Bostrm and Lars Asker Automatic Keyword Ex-traction using Domain Knowledge In Proceedings of CI-CLing 2001

Annette Hulth and Beata Megyesi A study on automaticallyextracted keywords in text categorization In Proceedingsof ACLCOLING 2006 537ndash544

Mario Jarmasz and Caroline Barriere Using semantic sim-ilarity over tera-byte corpus compute the performance ofkeyphrase extraction In Proceedings of CLINE 2004

Dawn Lawrie and W Bruce Croft and Arnold RosenbergFinding Topic Words for Hierarchical Summarization InProceedings of SIGIR 2001 pp 349ndash357

Y Matsuo and M Ishizuka Keyword Extraction from a Sin-gle Document using Word Co-occurrence Statistical Infor-mation In International Journal on Artificial IntelligenceTools 2004 13(1) pp 157ndash169

Olena Medelyan and Ian Witten Thesaurus based automatickeyphrase indexing In Proceedings of ACMIEED-CSjoint conference on Digital libraries 2006 pp296ndash297

Guido Minnen and John Carroll and Darren Pearce Appliedmorphological processing of English NLE 2001 7(3)pp207ndash223

Thuy Dung Nguyen and Min-Yen Kan Key phrase Extrac-tion in Scientific Publications In Proceeding of ICADL2007 pp317-326

Youngja Park and Roy Byrd and Branimir Boguraev Auto-matic Glossary Extraction Beyond Terminology Identifi-cation In Proceedings of COLING 2004 pp48ndash55

Mari-Sanna Paukkeri and Ilari Nieminen and Matti Pollaand Timo Honkela A Language-Independent Approachto Keyphrase Extraction and Evaluation In Proceedingsof COLING 2008

Peter Turney Learning to Extract Keyphrases from TextIn National Research Council Institute for InformationTechnology Technical Report ERB-1057 1999

Peter Turney Coherent keyphrase extraction via Web min-ing In Proceedings of IJCAI 2003 pp 434ndash439

Xiaojun Wan and Jianguo Xiao CollabRank towards a col-laborative approach to single-document keyphrase extrac-tion In Proceedings of COLING 2008

Ian Witten and Gordon Paynter and Eibe Frank and CarGutwin and Graig Nevill-Manning KEAPractical Au-tomatic Key phrase Extraction In Proceedings of ACMDL 1999 pp254ndash256

Ian Witten and Eibe Frank Data Mining Practical Ma-chine Learning Tools and Techniques Morgan Kauf-mann 2005

Yongzheng Zhang and Nur Zinchir-Heywood and Evange-los Milios Term based Clustering and Summarization ofWeb Page Collections In Proceedings of Conference ofthe Canadian Society for Computational Studies of Intel-ligence 2004

16

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 17ndash22Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Verb Noun Construction MWE Token Supervised Classification

Mona T DiabCenter for Computational Learning Systems

Columbia Universitymdiabcclscolumbiaedu

Pravin BhutadaComputer Science Department

Columbia Universitypb2351columbiaedu

Abstract

We address the problem of classifying multi-word expression tokens in running text Wefocus our study on Verb-Noun Constructions(VNC) that vary in their idiomaticity depend-ing on context VNC tokens are classified aseither idiomatic or literal We present a super-vised learning approach to the problem We ex-periment with different features Our approachyields the best results to date on MWE clas-sification combining different linguistically mo-tivated features the overall performance yieldsan F-measure of 8458 corresponding to an F-measure of 8996 for idiomaticity identificationand classification and 6203 for literal identifi-cation and classification

1 Introduction

In the literature in general a multiword expression(MWE) refers to a multiword unit or a colloca-tion of words that co-occur together statisticallymore than chance A MWE is a cover term fordifferent types of collocations which vary in theirtransparency and fixedness MWEs are pervasivein natural language especially in web based textsand speech genres Identifying MWEs and under-standing their meaning is essential to language un-derstanding hence they are of crucial importancefor any Natural Language Processing (NLP) appli-cations that aim at handling robust language mean-ing and use In fact the seminal paper (Sag et al2002) refers to this problem as a key issue for thedevelopment of high-quality NLP applications

For our purposes a MWE is defined as a collo-cation of words that refers to a single concept forexample - kick the bucket spill the beans make adecision etc An MWE typically has an idiosyn-cratic meaning that is more or different from themeaning of its component words An MWE mean-ing is transparent ie predictable in as muchas the component words in the expression relaythe meaning portended by the speaker composi-tionally Accordingly MWEs vary in their de-gree of meaning compositionality composition-ality is correlated with the level of idiomaticityAn MWE is compositional if the meaning of an

MWE as a unit can be predicted from the mean-ing of its component words such as in make adecision meaning to decide If we conceive ofidiomaticity as being a continuum the more id-iomatic an expression the less transparent and themore non-compositional it is Some MWEs aremore predictable than others for instance kick thebucket when used idiomatically to mean to diehas nothing in common with the literal meaningof either kick or bucket however make a decisionis very clearly related to to decide Both of theseexpressions are considered MWEs but have vary-ing degrees of compositionality and predictabilityBoth of these expressions belong to a class of id-iomatic MWEs known as verb noun constructions(VNC) The first VNC kick the bucket is a non-decomposable VNC MWE the latter make a deci-sion is a decomposable VNC MWE These typesof constructions are the object of our study

To date most research has addressed the prob-lem of MWE type classification for VNC expres-sions in English (Melamed 1997 Lin 1999Baldwin et al 2003 na Villada Moiron andTiedemann 2006 Fazly and Stevenson 2007Van de Cruys and Villada Moiron 2007 Mc-Carthy et al 2007) not token classification Forexample he spilt the beans on the kitchen counteris most likely a literal usage This is given away bythe use of the prepositional phrase on the kitchencounter as it is plausable that beans could haveliterally been spilt on a location such as a kitchencounter Most previous research would classifyspilt the beans as idiomatic irrespective of con-textual usage In a recent study by (Cook et al2008) of 53 idiom MWE types used in differentcontexts the authors concluded that almost half ofthem had clear literal meaning and over 40 oftheir usages in text were actually literal Thus itwould be important for an NLP application suchas machine translation for example when givena new VNC MWE token to be able to determinewhether it is used idiomatically or not as it couldpotentially have detrimental effects on the qualityof the translation

17

In this paper we address the problem of MWEclassification for verb-noun (VNC) token con-structions in running text We investigate the bi-nary classification of an unseen VNC token ex-pression as being either Idiomatic (IDM) or Lit-eral (LIT) An IDM expression is certainly anMWE however the converse is not necessarilytrue To date most approaches to the problem ofidiomaticity classification on the token level havebeen unsupervised (Birke and Sarkar 2006 Diaband Krishna 2009b Diab and Krishna 2009aSporleder and Li 2009) In this study we carryout a supervised learning investigation using sup-port vector machines that uses some of the featureswhich have been shown to help in unsupervisedapproaches to the problem

This paper is organized as follows In Section2 we describe our understanding of the variousclasses of MWEs in general Section 3 is a sum-mary of previous related research Section 4 de-scribes our approach In Section 5 we present thedetails of our experiments We discuss the resultsin Section 6 Finally we conclude in Section 7

2 Multi-word Expressions

MWEs are typically not productive though theyallow for inflectional variation (Sag et al 2002)They have been conventionalized due to persis-tent use MWEs can be classified based on theirsemantic types as follows Idiomatic This cat-egory includes expressions that are semanticallynon-compositional fixed expressions such as king-dom come ad hoc non-fixed expressions suchas break new ground speak of the devil TheVNCs which we are focusing on in this paper fallinto this category Semi-idiomatic This classincludes expressions that seem semantically non-compositional yet their semantics are more or lesstransparent This category consists of Light VerbConstructions (LVC) such as make a living andVerb Particle Constructions (VPC) such as write-up call-up Non-Idiomatic This category in-cludes expressions that are semantically compo-sitional such as prime minister proper nouns suchas New York Yankees and collocations such as ma-chine translation These expressions are statisti-cally idiosyncratic For instance traffic light isthe most likely lexicalization of the concept andwould occur more often in text than say trafficregulator or vehicle light

3 Related Work

Several researchers have addressed the problem ofMWE classification (Baldwin et al 2003 Katzand Giesbrecht 2006 Schone and Juraksfy 2001

Hashimoto et al 2006 Hashimoto and Kawa-hara 2008) The majority of the proposed researchhas been using unsupervised approaches and haveaddressed the problem of MWE type classifica-tion irrespective of usage in context (Fazly andStevenson 2007 Cook et al 2007) We areaware of two supervised approaches to the prob-lem work by (Katz and Giesbrecht 2006) andwork by (Hashimoto and Kawahara 2008)

In Katz and Giesbrecht (2006) (KG06) the au-thors carried out a vector similarity comparisonbetween the context of an MWE and that of theconstituent words using LSA to determine if theexpression is idiomatic or not The KG06 is sim-ilar in intuition to work proposed by (Fazly andStevenson 2007) however the latter work was un-supervised KG06 experimented with a tiny dataset of only 108 sentences corresponding to oneMWE idiomatic expression

Hashimoto and Kawahara (2008) (HK08) is thefirst large scale study to our knowledge that ad-dressed token classification into idiomatic versusliteral for Japanese MWEs of all types They ap-ply a supervised learning framework using sup-port vector machines based on TinySVM with aquadratic kernel They annotate a web based cor-pus for training data They identify 101 idiomtypes each with a corresponding 1000 exampleshence they had a corpus of 102K sentences of an-notated data for their experiments They exper-iment with 90 idiom types only for which theyhad more than 50 examples They use two typesof features word sense disambiguation (WSD)features and idiom features The WSD featurescomprised some basic syntactic features such asPOS lemma information token n-gram featuresin addition to hypernymy information on words aswell as domain information For the idiom fea-tures they were mostly inflectional features suchas voice negativity modality in addition to adja-cency and adnominal features They report resultsin terms of accuracy and rate of error reductionTheir overall accuracy is of 8925 using all thefeatures

4 Our ApproachWe apply a supervised learning framework tothe problem of both identifying and classifying aMWE expression token in context We specificallyfocus on VNC MWE expressions We use the an-notated data by (Cook et al 2008) We adopt achunking approach to the problem using an InsideOutside Beginning (IOB) tagging framework forperforming the identification of MWE VNC to-kens and classifying them as idiomatic or literalin context For chunk tagging we use the Yam-

18

Cha sequence labeling system1 YamCha is basedon Support Vector Machines technology using de-gree 2 polynomial kernels

We label each sentence with standard IOB tagsSince this is a binary classification task we have 5different tags B-L (Beginning of a literal chunk)I-L (Inside of a literal chunk) B-I (Beginning anIdiomatic chunk) I-I (Inside an Idiomatic chunk)O (Outside a chunk) As an example a sentencesuch as John kicked the bucket last Friday will beannotated as follows John O kicked B-I the I-Ibucket I-I last O Friday O We experiment withsome basic features and some more linguisticallymotivated ones

We experiment with different window sizes forcontext ranging from minus+1 to minus+5 tokens be-fore and after the token of interest We also em-ploy linguistic features such as character n-gramfeatures namely last 3 characters of a token asa means of indirectly capturing the word inflec-tional and derivational morphology (NGRAM)Other features include Part-of-Speech (POS)tags lemma form (LEMMA) or the citation formof the word and named entity (NE) informationThe latter feature is shown to help in the unsuper-vised setting in recent work (Diab and Krishna2009b Diab and Krishna 2009a) In general allthe linguistic features are represented as separatefeature sets explicitly modeled in the input dataHence if we are modeling the POS tag feature forour running example the training data would beannotated as follows John NN O kicked VBDB-I the Det I-I bucket NN I-I last ADV O FridayNN O Likewise adding the NGRAM featurewould be represented as follows John NN ohnO kicked VBD ked B-I the Det the I-I bucket NNket I-I last ADV ast O Friday NN day O and soon

With the NE feature we followed the same rep-resentation as the other features as a separate col-umn as expressed above referred to as NamedEntity Separate (NES) For named entity recogni-tion (NER) we use the BBN Identifinder softwarewhich identifies 19 NE tags2 We have two set-tings for NES one with the full 19 tags explic-itly identified (NES-Full) and the other where wehave a binary feature indicating whether a wordis a NE or not (NES-Bin) Moreover we addedanother experimental condition where we changedthe wordsrsquo representation in the input to their NEclass Named Entity InText (NEI) For example forthe NEI condition our running example is repre-sented as follows PER NN ohn O kicked VBDked B-I the Det the I-I bucket NN ket I-I last ADV

1httpwwwtado-chasencomyamcha2httpwwwbbncomidentifinder

ast O DAY NN day O where John is replaced bythe NE ldquoPERrdquo

5 Experiments and Results

51 Data

We use the manually annotated standard dataset identified in (Cook et al 2008) This datacomprises 2920 unique VNC-Token expressionsdrawn from the entire British National Corpus(BNC)3 The BNC contains 100M words of multi-ple genres including written text and transcribedspeech In this set VNC token expressions aremanually annotated as idiomatic literal or un-known We exclude those annotated as unknownand those pertaining to the Speech part of thedata leaving us with a total of 2432 sentences cor-responding to 53 VNC MWE types This datahas 2571 annotations4 corresponding to 2020 Id-iomatic tokens and 551 literal ones Since the dataset is relatively small we carry out 5-fold cross val-idation experiments The results we report are av-eraged over the 5 folds per condition We splitthe data into 80 for training 10 for testing and10 for development The data used is the tok-enized version of the BNC

52 Evaluation Metrics

We use Fβ=1 (F-measure) as the harmonic meanbetween (P)recision and (R)ecall as well as accu-racy to report the results5 We report the resultsseparately for the two classes IDM and LIT aver-aged over the 5 folds of the TEST data set

53 Results

We present the results for the different featuressets and their combination We also present resultson a simple most frequent tag baseline (FREQ) aswell as a baseline of using no features just thetokenized words (TOK) The baseline is basicallytagging all identified VNC tokens in the data set asidiomatic It is worth noting that the baseline hasthe advantage of gold identification of MWE VNCtoken expressions In our experimental conditionsidentification of a potential VNC MWE is part ofwhat is discovered automatically hence our sys-tem is penalized for identifying other VNC MWE

3httpwwwnatcorpoxacuk4A sentence can have more than one MWE expression

hence the number of annotations exceeds the number of sen-tences

5We do not think that accuracy should be reported in gen-eral since it is an inflated result as it is not a measure of errorAll words identified as O factor into the accuracy which re-sults in exaggerated values for accuracy We report it onlysince it the metric used by previous work

19

tokens that are not in the original data set6

In Table 2 we present the results yielded per fea-ture and per condition We experimented with dif-ferent context sizes initially to decide on the opti-mal window size for our learning framework re-sults are presented in Table 1 Then once that isdetermined we proceed to add features

Noting that a window size of minus+3 yields thebest results we proceed to use that as our contextsize for the following experimental conditions Wewill not include accuracy since it above 96 for allour experimental conditions

All the results yielded by our experiments out-perform the baseline FREQ The simple tokenizedwords baseline (TOK) with no added features witha context size of minus+3 shows a significant im-provement over the very basic baseline FREQ withan overall F measure of 7704

Adding lemma information or POS or NGRAMfeatures all independently contribute to a bettersolution however combining the three featuresyields a significant boost in performance over theTOK baseline of 267 absolute F points in over-all performance

Confirming previous observations in the liter-ature the overall best results are obtained byusing NE features The NEI condition yieldsslightly better results than the NES conditionsin the case when no other features are beingused NES-Full significantly outperforms NES-Bin when used alone especially on literal classi-fication yielding the highest results on this classof phenomena across the board However whencombined with other features NES-Bin fares bet-ter than NES-Full as we observe slightly less per-formance when comparing NES-Full+L+N+P andNES-Bin+L+N+P

Combining NEI+L+N+P yields the highest re-sults with an overall F measure of 8458 a sig-nificant improvement over both baselines and overthe condition that does not exploit NE featuresL+N+P Using NEI may be considered a formof dimensionality reduction hence the significantcontribution to performance

6 Discussion

The overall results strongly suggest that using lin-guistically interesting features explicitly has a pos-itive impact on performance NE features helpthe most and combining them with other features

6We could have easily identified all VNC syntactic con-figurations corresponding to verb object as a potential MWEVNC assuming that they are literal by default This wouldhave boosted our literal score baseline however for this in-vestigation we decided to strictly work with the gold stan-dard data set exclusively

yields the best results In general performanceon the classification and identification of idiomaticexpressions yielded much better results This maybe due to the fact that the data has a lot more id-iomatic token examples for training Also we notethat precision scores are significantly higher thanrecall scores especially with performance on lit-eral token instance classification This might be anindication that identifying when an MWE is usedliterally is a difficult task

We analyzed some of the errors yielded in ourbest condition NEI+L+N+P The biggest errors area result of identifying other VNC constructionsnot annotated in the training and test data as VNCMWEs However we also see errors of confusingidiomatic cases with literal ones 23 times and theopposite 4 times

Some of the errors where the VNC should havebeen classified as literal however the system clas-sified them as idiomatic are kick heel find feetmake top Cases of idiomatic expressions erro-neously classified as literal are for MWE types hitthe road blow trumpet blow whistle bit a wall

The system is able to identify new VNC MWEconstructions For instance in the sentence On theother hand Pinkie seemed to have lost his head toa certain extent perhaps some prospects of mak-ing his mark by bringing in something novel inthe way of business the first MWE lost his headis annotated in the training data however makinghis mark is newly identified as idiomatic in thiscontext

Also the system identified hit the post as aliteral MWE VNC token in As the ball hit thepost the referee blew the whistle where blew thewhistle is a literal VNC in this context and it iden-tified hit the post as another literal VNC

7 Conclusion

In this study we explore a set of features that con-tribute to VNC token expression binary supervisedclassification The use of NER significantly im-proves the performance of the system Using NERas a means of dimensionality reduction yields thebest results We achieve a state of the art perfor-mance of an overall F measure of 8458 In thefuture we are looking at ways of adding more so-phisticated syntactic and semantic features fromWSD Given the fact that we were able to get moreinteresting VNC data automatically we are cur-rently looking into adding the new data to the an-notated pool after manual checking

20

IDM-F LIT-F Overall F Overall Accminus+1 7793 4857 7178 9622minus+2 8538 5561 7971 9706minus+3 8699 5568 8125 9693minus+4 8622 5581 8075 9706minus+5 8338 50 7763 9661

Table 1 Results in s of varying context window size

IDM-P IDM-R IDM-F LIT-P LIT-R LIT-F Overall FFREQ 7002 8916 7844 0 0 0 6968TOK 8178 8333 8255 7179 4375 5437 7704

(L)EMMA 831 8429 8369 6977 4688 5607 7811(N)GRAM 8317 8238 8278 70 4375 5385 7701

(P)OS 8333 8333 8333 7778 4375 5600 7808L+N+P 8695 8333 8538 7222 4561 5591 7971

NES-Full 852 8793 8655 7907 5862 6733 8277NES-Bin 8497 8241 8367 7349 5259 6131 7915

NEI 8992 8518 8748 8133 5259 6387 8282NES-Full+L+N+P 8989 8492 8734 7632 50 6042 8199NES-Bin+L+N+P 9086 8492 8779 7632 50 6042 8233

NEI+L+N+P 9135 8842 8986 8169 50 6203 8458

Table 2 Final results in s averaged over 5 folds of test data using different features and their combina-tions

8 Acknowledgement

The first author was partially funded by DARPAGALE and MADCAT projects The authors wouldlike to acknowledge the useful comments by twoanonymous reviewers who helped in making thispublication more concise and better presented

ReferencesTimothy Baldwin Collin Bannard Takakki Tanaka

and Dominic Widdows 2003 An empirical modelof multiword expression decomposability In Pro-ceedings of the ACL 2003 workshop on Multiwordexpressions pages 89ndash96 Morristown NJ USA

J Birke and A Sarkar 2006 A clustering approach fornearly unsupervised recognition of nonliteral lan-guage In Proceedings of EACL volume 6 pages329ndash336

Paul Cook Afsaneh Fazly and Suzanne Stevenson2007 Pulling their weight Exploiting syntacticforms for the automatic identification of idiomaticexpressions in context In Proceedings of the Work-shop on A Broader Perspective on Multiword Ex-pressions pages 41ndash48 Prague Czech RepublicJune Association for Computational Linguistics

Paul Cook Afsaneh Fazly and Suzanne Stevenson2008 The VNC-Tokens Dataset In Proceedings ofthe LREC Workshop on Towards a Shared Task for

Multiword Expressions (MWE 2008) MarrakechMorocco June

Mona Diab and Madhav Krishna 2009a Handlingsparsity for verb noun MWE token classification InProceedings of the Workshop on Geometrical Mod-els of Natural Language Semantics pages 96ndash103Athens Greece March Association for Computa-tional Linguistics

Mona Diab and Madhav Krishna 2009b Unsuper-vised classification for vnc multiword expressionstokens In CICLING

Afsaneh Fazly and Suzanne Stevenson 2007 Dis-tinguishing subtypes of multiword expressions us-ing linguistically-motivated statistical measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueCzech Republic June Association for Computa-tional Linguistics

Chikara Hashimoto and Daisuke Kawahara 2008Construction of an idiom corpus and its applica-tion to idiom identification based on WSD incor-porating idiom-specific features In Proceedings ofthe 2008 Conference on Empirical Methods in Nat-ural Language Processing pages 992ndash1001 Hon-olulu Hawaii October Association for Computa-tional Linguistics

Chikara Hashimoto Satoshi Sato and Takehito Utsuro2006 Japanese idiom recognition Drawing a linebetween literal and idiomatic meanings In Proceed-ings of the COLINGACL 2006 Main Conference

21

Poster Sessions pages 353ndash360 Sydney AustraliaJuly Association for Computational Linguistics

Graham Katz and Eugenie Giesbrecht 2006 Au-tomatic identification of non-compositional multi-word expressions using latent semantic analysis InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 12ndash19 Sydney Australia July Asso-ciation for Computational Linguistics

Dekang Lin 1999 Automatic identification of non-compositional phrases In Proceedings of ACL-99pages 317ndash324 Univeristy of Maryland CollegePark Maryland USA

Diana McCarthy Sriram Venkatapathy and AravindJoshi 2007 Detecting compositionality of verb-object combinations using selectional preferencesIn Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL) pages 369ndash379 Prague CzechRepublic June Association for Computational Lin-guistics

Dan I Melamed 1997 Automatic discovery of non-compositional compounds in parallel data In Pro-ceedings of the 2nd Conference on Empirical Meth-ods in Natural Language Processing (EMNLPrsquo97)pages 97ndash108 Providence RI USA August

Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the EACL-06Workshop on Multiword Expressions in a Multilin-gual Context pages 33ndash40 Morristown NJ USA

Ivan A Sag Timothy Baldwin Francis Bond Ann ACopestake and Dan Flickinger 2002 Multiwordexpressions A pain in the neck for nlp In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing pages 1ndash15 London UK Springer-Verlag

Patrick Schone and Daniel Juraksfy 2001 Isknowledge-free induction of multiword unit dictio-nary headwords a solved problem In Proceedingsof Empirical Methods in Natural Language Process-ing pages 100ndash108 Pittsburg PA USA

C Sporleder and L Li 2009 Unsupervised Recog-nition of Literal and Non-Literal Use of IdiomaticExpressions In Proceedings of the 12th Conferenceof the European Chapter of the ACL (EACL 2009)pages 754ndash762 Association for Computational Lin-guistics

Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32Prague Czech Republic June Association for Com-putational Linguistics

22

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 23ndash30Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Exploiting Translational Correspondences for Pattern-Independent MWEIdentification

Sina ZarrieszligDepartment of Linguistics

University of Potsdam Germanysinalinguni-potsdamde

Jonas KuhnDepartment of Linguistics

University of Potsdam Germanykuhnlinguni-potsdamde

Abstract

Based on a study of verb translations inthe Europarl corpus we argue that a widerange of MWE patterns can be identified intranslations that exhibit a correspondencebetween a single lexical item in the sourcelanguage and a group of lexical items inthe target language We show that thesecorrespondences can be reliably detectedon dependency-parsed word-aligned sen-tences We propose an extraction methodthat combines word alignment with syn-tactic filters and is independent of thestructural pattern of the translation

1 Introduction

Parallel corpora have proved to be a valuable re-source not only for statistical machine translationbut also for crosslingual induction of morphologi-cal syntactic and semantic analyses (Yarowsky etal 2001 Dyvik 2004) In this paper we proposean approach to the identification of multiword ex-pressions (MWEs) that exploits translational cor-respondences in a parallel corpus We will con-sider in translations of the following type

(1) DerThe

RatCouncil

sollteshould

unsereour

Positionposition

berucksichtigenconsider

(2) The Council shouldtake account ofour position

This sentence pair has been taken from theGerman - English section of the Europarl corpus(Koehn 2005) It exemplifies a translational cor-respondence between an English MWEtake ac-count ofand a German simplex verbberucksichti-gen In the following we refer to such correspon-dences asone-to-many translations Based on astudy of verb translations in Europarl we will ex-plore to what extent one-to-many translations pro-vide evidence for MWE realization in the targetlanguage It will turn out that crosslingual corre-

spondences realize a wide range of different lin-guistic patterns that are relevant for MWE iden-tification but that they pose problems to auto-matic word alignment We propose an extractionmethod that combines distributional word align-ment with syntactic filters We will show thatthese correspondences can be reliably detectedon dependency-parsed wordaligned sentences andare able to identify various MWE patterns

In a monolingual setting the task of MWE ex-traction is usually conceived of as a lexical as-sociation problem where distributional measuresmodel the syntactic and semantic idiosyncracy ex-hibited by MWEs eg (Pecina 2008) This ap-proach generally involves two main steps 1) theextraction of a candidate list of potential MWEsoften constrained by a particular target patternof the detection method like verb particle con-structions (Baldwin and Villavicencio 2002) orverb PP combinations (Villada Moiron and Tiede-mann 2006) 2) the ranking of this candidate listby an appropriate assocation measure

The crosslingual MWE identification wepresent in this paper is a priori independentof any specific association measure or syntacticpattern The translation scenario allows us toadopt a completely data-driven definition of whatconstitutes an MWE Given a parallel corpuswe propose to consider those tokens in a targetlanguage as MWEs which correspond to a singlelexical item in the source language The intuitionis that if a group of lexical items in one lan-guage can be realized as a single item in anotherlanguage it can be considered as some kind oflexically fixed entity By this means we willnot approach the MWE identification problemby asking for a given list of candidates whetherthese are MWEs or not Instead we will ask fora given list of lexical items in a source languagewhether there exists a one-to-many translation forthis item in a target language (and whether these

23

one-to-many translations correspond to MWEs)This strategy offers a straightforward solution tothe interpretation problem As the translation canbe related to the meaning of the source item andto its other translations in the target language theinterpretation is independent of the expressionrsquostransparency This solution has its limitationscompared to other approaches that need to auto-matically establish the degree of compositionalityof a given MWE candidate However for manyNLP applications coarse-grained knowledgeabout the semantic relation between a widerange of MWEs and their corresponding atomicrealization is already very useful

In this work we therefore focus on a generalmethod of MWE identification that captures thevarious patterns of translational correspondencesthat can be found in parallel corpora Our exper-iments described in section 3 show that one-to-many translations should be extracted from syn-tactic configurations rather than from unstructuredsets of aligned words This syntax-driven methodis less dependent on frequency distributions in agiven corpus but is based on the intuition thatmonolingual idiosyncracies like MWE realizationof an entity are not likely to be mirrored in anotherlanguage (see section 4 for discussion)

Our goal in this paper is twofold First we wantto investigate to what extent one-to-many transla-tional correspondences can serve as an empiricalbasis for MWE identification To this end Sec-tion 2 presents a corpus-based study of the rela-tion between one-to-many translations and MWEsthat we carried out on a translation gold standardSecond we investigate methods for the automaticdetection of complex lexical correspondences fora given parallel corpus Therefore Section 3 eval-uates automatic word alignments against our goldstandard and gives a method for high-precisionone-to-many translation detection that relies onsyntactic filters in addition to word-alignments

2 Multiword Translations as MWEs

The idea to exploit one-to-many translations forthe identification of MWE candidates has not re-ceived much attention in the literature Thus it isnot a priori clear what can be expected from trans-lational correspondences with respect to MWEidentification To corroborate the intuitions intro-duced in the last section we carried out a corpus-based study that aims to discover linguistic pat-

Verb 1-1 1-n n-1 n-n No

anheben (v1) 535 212 92 16 325

bezwecken (v2) 167 513 06 313 150

riskieren (v3) 467 357 05 17 182

verschlimmern (v4) 302 215 286 445 275

Table 1 Proportions of types of translational cor-respondences (token-level) in our gold standard

terns exhibited by one-to-many translationsWe constructed a gold standard coveringall En-

glish translations of four German verb lemmas ex-tracted from the Europarl Corpus These verbssubcategorize for a nominative subject and an ac-cusative object and are in the middle frequencylayer (around 200 occurrences) We extracted allsentences in Europarl with occurences of theselemmas and their automatic word alignments pro-duced by GIZA++ (Och and Ney 2003) Thesealignments were manually corrected on the basisof the crosslingual word alignment guidelines de-velopped by (Graca et al 2008)

For each of the German source lemmas ourgold standard records four translation categoriesone-to-one one-to-many many-to-one many-to-many translations Table 1 shows the distributionof these categories for each verb Strikingly thefour verbs show very different proportions con-cerning the types of their translational correspon-dences Thus while the German verbanheben(en increase) seems to have a frequent parallelrealization the verbsbezwecken(en intend to)or verschlimmern(en aggravate) tend to be real-ized by more complex phrasal translations In anycase the percentage of one-to-many translations isrelatively high which corroborates our hypothesisthat parallel corpora constitute a very interestingresource for data-driven MWE discovery

A closer look at the one-to-many translations re-veals that these cover a wide spectrum of MWEphenomena traditionally considered in the liter-ature as well as constructions that one wouldusually not regard as an MWE Below we willshortly illustrate the different classes of one-to-many translations we found in our gold standard

Morphological variations This type of one-to-many translations is mainly due to non-parallel re-alization of tense Itrsquos rather irrelevant from anMWE perspective but easy to discover and filterautomatically

24

(3) SieThey

verschlimmernaggravate

diethe

Ubelmisfortunes

(4) Their actionis aggravatingthe misfortunes

Verb particle combinations A typical MWEpattern treated for instance in (Baldwin andVillavicencio 2002) It further divides into trans-parent and non-transparent combinations the lat-ter is illustrated below

(5) DerThe

Ausschusscommitte

bezwecktintends

denthe

Institutioneninstitutions

eina

politischespolitical

Instrumentinstrument

anat

diethe

Handhand

zuto

gebengive

(6) The committeeset out to equip the institutions with apolitical instrument

Verb preposition combinations While thisclass isnrsquot discussed very often in the MWE lit-erature it can nevertheless be considered as an id-iosyncratic combination of lexical items Sag et al(2002) propose an analysis within an MWE frame-work

(7) SieThey

werdenwill

denthe

Treibhauseffektgreen house effect

verschlimmernaggravate

(8) They will add to the green house effect

Light verb constructions (LVCs) This is themost frequent pattern in our gold standard It ac-tually subsumes various subpatterns depending onwhether the light verbs complement is realized as anoun adjective or PP Generally LVCs are syntac-tically and semantically more flexible than otherMWE types such that our gold standard containsvariants of LVCs with similar potentially mod-ified adjectives or nouns as in the example be-low However it can be considered an idiosyn-cratic combination since the LVCs exhibit specificlexical restrictions (Sag et al 2002)

(9) IchIch

werdewill

diethe

Sachething

nuronly

nochjust

verschlimmernaggravate

(10) I am justmaking thingsmore difficult

Idioms This MWE type is probably the mostdiscussed in the literature due to its semantic andsyntactic idiosyncracy Itrsquos not very frequent inour gold standard which may be mainly due to itslimited size and the source items we chose

(11) SieThey

bezweckenintend

diethe

Umgestaltungconversion

ininto

einea

zivilecivil

Nationnation

(12) Theyhave in mind the conversion into a civil nation

v1 v2 v3 v4

Ntype 22 (26) 41 (47) 26 (35) 17 (24)

V Part 227 49 00 00

V Prep 364 415 39 59

LVC 182 293 885 882

Idiom 00 24 00 00

Para 364 243 115 235

Table 2 Proportions of MWE types per lemma

Paraphrases From an MWE perspective para-phrases are the most problematic and challengingtype of translational correspondence in our goldstandard While the MWE literature typically dis-cusses the distinction between collocations andMWEs the boarderline between paraphrases andMWEs is not really clear On the hand para-phrases as we classified them here are transparentcombinations of lexical items like in the exam-ple belowensure that something increases How-ever semantically these transparent combinationscan also be rendered by an atomic expressionin-crease A further problem raised by paraphrases isthat they often involve translational shifts (Cyrus2006) These shifts are hard to identify automat-ically and present a general challenge for seman-tic processing of parallel corpora An example isgiven below

(13) WirWe

brauchenneed

besserebetter

Zusammenarbeitcooperation

umto

diethe

RuckzahlungenrepaymentsOBJ

anzuhebenincrease

(14) We need greater cooperation in this respect toensurethat repaymentsincrease

Table 2 displays the proportions of the MWEcategories for the number of types of one-to-manycorrespondences in our gold standard We filteredthe types due to morphological variations only (theoverall number of types is indicated in brackets)Note that some types in our gold standard fall intoseveral categories eg they combine a verb prepo-sition with a verb particle construction For allof the verbs the number of types belonging tocore MWE categories largely outweighs the pro-portion of paraphrases As we already observedin our analysis of general translation categorieshere again the different verb lemmas show strik-ing differences with respect to their realization inEnglish translations For instanceanheben(enincrease) or bezwecken(en intend) are frequently

25

translated with verb particle or preposition combi-nations while the other verbs are much more of-ten translated by means of LVCs Also the morespecific LVC patterns differ largely among theverbs Whileverschlimmern(en aggravate) hasmany different adjectival LVC correspondencesthe translations ofriskieren(en risk) are predomi-nantly nominal LVCs The fact that we found veryfew idioms in our gold standard may be simplyrelated to our arbitrary choice of German sourceverbs that do not have an English idiom realiza-tion (see our experiment on a random set of verbsin Section 33)

In general one-to-many translational corre-spondences seem to provide a very fruitful groundfor the large-scale study of MWE phenomenaHowever their reliable detection in parallel cor-pora is far from trivial as we will show in thenext section Therefore we will not further in-vestigate the classification of MWE patterns inthe rest of the paper but concentrate on the high-precision detection of one-to-many translationsSuch a pattern-independent identification methodis crucial for the further data-driven study of one-to-many translations in parallel corpora

3 Multiword Translation Detection

This section is devoted to the problem of high-precision detection of one-to-many translationsSection 31 describes an evaluation of automaticword alignments against our gold standard Insection 32 we describe a method that extractsloosely aligned syntactic configurations whichyields much more promising results

31 One-to-many Alignments

To illustrate the problem of purely distributionalone-to-many alignment table 3 presents an eval-uation of the automatic one-to-many word align-ments produced by GIZA++ that uses the stan-dard heuristics for bidirectional word alignmentfrom phrase-based MT (Och and Ney 2003) Weevaluate the rate of translational correspondenceson the type-level that the system discovers againstthe one-to-many translations in our gold standardBy type we mean the set of lemmatized Englishtokens that makes up the translation of the Ger-man source lemma Generally automatic wordalignment yields a very high FPR if no frequencythreshold is used Increasing the threshold mayhelp in some cases however the frequency of the

verb n gt 0 n gt 1 n gt 3FPR FNR FPR FNR FPR FNR

v1 097 093 10 10 10 10

v2 093 09 05 096 00 098

v3 088 083 08 097 067 097

v4 098 092 08 092 034 092

Table 3 False positive rate and False negative rateof GIZA++ one-to-many alignments

translation types is so low that already at a thresh-old of 3 almost all types get filtered This does notmean that the automatic word alignment does notdiscover any correct correspondences at all but itmeans that the detection of the exact set of tokensthat correspond to the source token is rare

This low precision of one-to-many alignmentsisnrsquot very surprising Many types of MWEs con-sist of items that contribute most of the lexical se-mantic content while the other items belong to theclass of semantically almost ldquoemptyrdquo items (egparticles light verbs) These semantically ldquolightrdquoitems have a distribution that doesnrsquot necessarilycorrelate with the source item For instance inthe following sentence pair taken from EuroparlGIZA++ was not able to capture the correspon-dence between the German main verbbehindern(en impede) and the LVCconstitute an obstacleto but only finds an alignment link between theverb and the nounobstacle

(15) DieThe

Korruptioncorruption

behindertimpedes

diethe

Entwicklungdevelopment

(16) Corruptionconstitutes an obstacle todevelopment

Another limitation of the word-alignment mod-els is that are independent of whether the sen-tences are largely parallel or rather free transla-tions However parallel corpora like Europarl areknow to contain a very large number of free trans-lations In these cases direct lexical correspon-dences are much more unlikely to be found

32 Aligning Syntactic Configurations

High-precision extraction of one-to-many trans-lation detection thus involves two major prob-lems 1) How to identify sentences or configura-tions where reliable lexical correspondences canbe found 2) How to align target items that have alow occurrence correlation

We argue that both of these problems can beadressed by taking syntactic information into ac-

26

count As an example consider the pair of paral-lel configurations in Figure 1 for the sentence pairgiven in (15) and (16) Although there is no strictone-to-one alignment for the German verb the ba-sic predicate-argument structure is parallel Theverbs arguments directly correspond to each otherand are all dominated by a verbal root node

Based on these intuitions we propose agenerate-and-filter strategy for our one-to-manytranslation detection which extracts partial largelyparallel dependency configurations By admittingtarget dependency paths to be aligned to sourcesingle dependency relations we admit configura-tions where the source item is translated by morethan one word For instance given the configura-tion in Figure 1 we allow the German verb to bealigned to the path connectingconstituteand theargumentY2

Our one-to-many translation detection consistsof the following steps a) candidate generationof aligned syntactic configurations b) filtering theconfigurations c) alignment post-editing ie as-sembling the target tokens corresponding to thesource item The following paragraphs will brieflycaracterize these steps

behindert

X 1 Y 1

Y 2

an to

X 2 obstacle

create

Figure 1 Example of a typical syntactic MWEconfiguration

Data We word-aligned the German and Englishportion of the Europarl corpus by means of theGIZA++ tool Both portions where assigned flatsyntactic dependency analyses by means of theMaltParser (Nivre et al 2006) such that we ob-tain a parallel resource of word-aligned depen-dency parses Each sentence in our resource canbe represented by the triple(DG DE AGE) DG

is the set of dependency triples(s1 rel s2) such

thats2 is a dependent ofs1 of typerel ands1 s2

are words of the source languageDE is the setof dependency triples of the target sentenceAGE

corresponds to the set of pairs(s1 t1) such thats1 t1 are aligned

Candidate Generation This step generates alist of source configurations by searching for oc-curences of the source lexical verb where it islinked to some syntactic dependents (eg its argu-ments) An example input would be the configura-tion ( (verbSB) (verbOA)) forour German verbs

Filtering Given our source candidates a validparallel configuration(DG DE AGE) is then de-fined by the following conditions1 The source configurationDG is the set of tu-ples(s1 rel sn) wheres1 is our source item andsn some dependent2 For eachsn isin DG there is a tuple(sn tn) isinAGE ie every dependent has an alignment3 There is a target itemt1 isin DE such thatfor eachtn there is ap sub DE such thatp isa path(t1 rel tx) (tx rel ty)(tz rel tn) thatconnectst1 and tn Thus the target dependentshave a common root

To filter noise due to parsing or alignment er-rors we further introduce a filter on the length ofthe path that connects the target root and its de-pendents and w exclude paths cross contain sen-tence boundaries Moreover the above candi-date filtering doesnrsquot exclude configurations whichexhibit paraphrases involving head-switching orcomplex coordination Head-switching can be de-tected with the help of alignment information ifthere is a item in our target configuration that hasan reliable alignment with an item not contained inour source configuration our target configurationis likely to contain such a structural paraphrasesand is excluded from our candidate set Coordina-tion can be discarded by imposing the condition onthe configuration not to contain a coordination re-lation This Generate-and-Filter strategy now ex-tracts a set of sentences where we are likely to finda good one-to-one or one-to-many translation forthe source verb

Alignment Post-editing In the final alignmentstep one now needs to figure out which lexicalmaterial in the aligned syntactic configurations ac-tually corresponds to the translation of the sourceitem The intuition discussed in 32 was that all

27

the items lying on a path between the root itemand the terminals belong to the translation of thesource item However these items may have othersyntactic dependents that may also be part of theone-to-many translation As an example considerthe configuration in figure 1 where the articleanwhich is part of the LVCcreate an obstacle tohasto be aligned to the German source verb

Thus for a set of itemsti for which there is a de-pendency relation(tx rel ti) isin DE such thattx isan element of our target configuration we need todecide whether(s1 ti) isin AGE This translationproblem now largely parallels collocation trans-lation problems discussed in the literature as in(Smadja and McKeown 1994) But crucially oursyntactic filtering strategy has substantially nar-rowed down the number of items that are possi-ble parts of the one-to-many translation Thus astraightforward way to assemble the translationalcorrespondence is to compute the correlation orassociation of the possibly missing items with thegiven translation pair as proposed in (Smadja andMcKeown 1994) Therefore we propose the fol-lowing alignment post-editing algorithmGiven the source items1 and the set of target itemsT where eachti isin T is an element of our targetconfiguration

1 Compute corr(s1 T ) the correlation be-tweens1 andT

2 For each ti tx such that there isa (ti rel tx) isin DE computecorr(s1 T + tx)

3 if corr(s1 T + tx) ge corr(s1 T ) addtxto T

As the Dice coefficient is often to give the bestresults eg in (Smadja and McKeown 1994) wealso chose Dice as our correlation measure In fu-ture work we will experiment with other associa-tion measures Our correlation scores are thus de-fined by the formula

corr(s1 T ) =2(freq(s1 and T ))

freq(s1) + freq(T )

We definefreq(T ) as the number of sentencepairs whose target sentence contains occurrencesof all ti isin T andfreq(s1) accordingly The ob-servation frequencyfreq(s1andT ) is the number ofsentence pairs that wheres1 occurs in the sourcesentence andT in the target sentence

The output translation can then be rep-resented as a dependency configurationof the following kind ((ofPMOD)(riskNMODof)(riskNMODthe) (runOBJrisk)(runSBJ))which is the syntactic representationfor the English MWErun the risk of

33 Evaluation

Our translational approach to MWE extractionbears the advantage that evaluation is not exclu-sively bound to the manual judgement of candi-date lists Instead we can first evaluate the systemoutput against translation gold standards which areeasier to obtain The linguistic classification of thecandidates according to their compositionality canthen be treated as a separate problem

We present two experiments in this evaluationsection We will first evaluate the translation de-tection on our gold standard to assess the gen-eral quality of the extraction method Since thisgold standard is to small to draw conclusions aboutthe quality of MWE patterns that the system de-tects we further evaluate the translational corre-spondences for a larger set of verbs

Translation evaluation In the first experimentwe extracted all types of translational correspon-dences for the verbs we annotated in the gold stan-dard We converted the output dependency con-figurations to the lemmatized bag-of-word formwe already applied for the alignment evaluationand calculated the FPR and FNR of the trans-lation types The evaluation is displayed in ta-ble 4 Nearly all translation types that our sys-tem detected are correct This confirms our hy-pothesis that syntactic filtering yields more reli-able translations that just coocurrence-based align-ments However the false negative rate is alsovery high This low recall is due to the fact thatour syntactic filters are very restrictive such that amajor part of the occurrences of the source lemmadonrsquot figure in the prototypical syntactic configu-ration Column two and three of the evaluation ta-ble present the FPR and FNR for experiments witha relaxed syntactic filter that doesnrsquot constrain thesyntactic type of the parallel argument relationsWhile not decreasing the FNR the FPR decreasessignificantly This means that the syntactic filtersmainly fire on noisy configurations and donrsquot de-crease the recall A manual error analysis has alsoshown that the relatively flat annotation scheme ofour dependency parses significantly narrows down

28

the number of candidate configurations that our al-gorithm detects As the dependency parses donrsquotprovide deep analyses for tense or control phe-nomena very often a verbrsquos arguments donrsquot fig-ure as its syntactic dependents and no configura-tion is found Future work will explore the im-pact of deep syntactic analysis for the detection oftranslational correspondences

MWE evaluation In a second experiment weevaluated the patterns of correspondences foundby our extraction method for use in an MWE con-text Therefore we selected 50 random verbs oc-curring in the Europarl corpus and extracted theirrespective translational correspondences This setof 50 verbs yields a set of 1592 one-to-many typesof translational correspondences We filtered thetypes wich display only morphological variationsuch that the set of potential MWE types com-prises 1302 types Out of these we evaluated arandom sample of 300 types by labelling the typeswith the MWE categories we established for theanalysis of our gold standard During the clas-sification we encountered a further category ofoneto- many correspondence which cannot be con-sidered an MWE the category of alternation Forinstance we found a translational correspondencebetween the active realization of the German verbbegruszligen(enappreciate) and the English passivebe pleased by

The classification is displayed in table 5 Al-most 83 of the translational correspondencesthat our system extracted are perfect translationtypes Almost 60 of the extracted types can beconsidered MWEs that exhibit some kind of se-mantic idiosyncrasy The other translations couldbe classified as paraphrases or alternations In ourrandom sample the portions of idioms is signifi-cantly higher than in our gold standard which con-firms our intuition that the MWE pattern of theone-to-many translations for a given verb are re-lated to language-specific semantic properties ofthe verbs and the lexical concepts they realize

4 Related Work

The problem sketched in this paper has clear con-ncetions to statistical MT So-called phrase-basedtranslation models generally target whole sentencealignment and do not necessarily recur to linguis-tically motivated phrase correspondences (Koehnet al 2003) Syntax-based translation that speci-fies formal relations between bilingual parses was

Strict Filter Relaxed Filter

FPR FNR FPR FNR

v1 00 096 05 096

v2 025 088 047 079

v3 025 074 056 063

v4 00 0875 056 084

Table 4 False positive and false negative rate ofone-to-many translations

Trans type Proportion

MWE type Proportion

MWEs 575

V Part 82

V Prep 518

LVC 324

Idiom 106

Paraphrases 244

Alternations 10

Noise 171

Table 5 Classification of 300 types sampled fromthe set of one-to-many translations for 50 verbs

established by (Wu 1997) Our way to use syn-tactic configurations can be seen as a heuristic tocheck relaxed structural parallelism

Work on MWEs in a crosslingual context hasalmost exclusively focussed on MWE translation(Smadja and McKeown 1994 Anastasiou 2008)In (Villada Moiron and Tiedemann 2006) the au-thors make use of alignment information in a par-allel corpus to rank MWE candidates These ap-proaches donrsquot rely on the lexical semantic knowl-edge about MWEs in form of one-to-many trans-lations

By contrast previous approaches to paraphraseextraction made more explicit use of crosslingualsemantic information In (Bannard and Callison-Burch 2005) the authors use the target languageas a pivot providing contextual features for iden-tifying semantically similar expressions Para-phrasing is however only partially comparable tothe crosslingual MWE detection we propose inthis paper Recently the very pronounced contextdependence of monolingual pairs of semanticallysimilar expressions has been recognized as a ma-jor challenge in modelling word meaning (Erk andPado 2009)

The idea that parallel corpora can be used asa linguistic resource that provides empirical evi-dence for monolingual idiosyncrasies has already

29

been exploited in eg morphology projection(Yarowsky et al 2001) or word sense disambigua-tion (Dyvik 2004) While in a monolingual set-ting it is quite tricky to come up with theoreticalor empirical definitions of sense discriminationsthe crosslingual scenario offers a theory-neutraldata-driven solution Since ambiguity is an id-iosyncratic property of a lexical item in a givenlanguage it is not likely to be mirrored in a tar-get language Similarly our approach can also beseen as a projection idea we project the semanticinformation of simplex realization in a source lan-guage to an idiosyncratic multiword realization inthe target language

5 Conclusion

We have explored the phenomenon of one-to-many translations in parallel corpora from theperspective of MWE identification Our man-ual study on a translation gold standard as wellas our experiments in automatic translation ex-traction have shown that one-to-many correspon-dences provide a rich resource and fruitful basisof study for data-driven MWE identification Thecrosslingual perspective raises new research ques-tions about the identification and interpretation ofMWEs It challenges the distinction between para-phrases and MWEs a problem that does not ariseat all in the context of monolingual MWE ex-traction It also allows for the study of the rela-tion between the semantics of lexical concepts andtheir MWE realization Further research in this di-rection should investigate translational correspon-dences on a larger scale and further explore thesefor monolingual interpretation of MWEs

Our extraction method that is based on syn-tactic filters identifies MWE types with a muchhigher precision than purely cooccurence-basedword alignment and captures the various patternswe found in our gold standard Future work on theextraction method will have to focus on the gener-alization of these filters and the generalization toother items than verbs The experiments presentedin this paper also suggest that the MWE realiza-tion of certain lexical items in a target languageis subject to certain linguistic patterns Moreoverthe method we propose is completely languagein-dependent such that further research has to studythe impact of the relatedness of the consideredlanguages on the patterns of one-to-many transla-tional correspondences

ReferencesDimitra Anastasiou 2008 Identification of idioms by mtrsquos

hybrid research system vs three commercial system InProceedings of the EAMT pp 12ndash20

Timothy Baldwin and Aline Villavicencio 2002 Extract-ing the unextractable a case study on verb-particles InProceedings of the COLING-02 pp 1ndash7

Colin Bannard and Chris Callison-Burch 2005 Paraphras-ing with bilingual parallel corpora InProceedings of the43rd Annual Meeting of the ACL pp 597ndash604

Lea Cyrus 2006 Building a resource for studying transla-tion shifts InProceedings of the 5th LREC pp 1240ndash1245

Helge Dyvik 2004 Translations as semantic mirrors Fromparallel corpus to WordNetLanguage and Computers1311 ndash 326

Katrin Erk and Sebastian Pado 2009 Paraphrase assess-ment in structured vector space Exploring parameters anddatasets InProc of the EACL GEMS Workshop pp 57ndash65

Joao de Almeida Varelas Graca Joana Paulo Pardal LuısaCoheur and Diamantino Antonio Caseiro 2008 Multi-language word alignments annotation guidelines Techni-cal report Tech Rep 38 2008 INESC-ID Lisboa

Philipp Koehn Franz Josef Och and Daniel Marcu 2003Statistical phrase-based translation InProceedings of theNAACL rsquo03 pp 48ndash54

Philipp Koehn 2005 Europarl A parallel corpus for statis-tical machine translation InMT Summit 2005 pp 79ndash86

Joakim Nivre Johan Hall and Jens Nilsson 2006 Malt-parser A data driven parser-generator for dependencyparsing InProc of LREC-2006 pp 2216ndash2219

Franz Josef Och and Hermann Ney 2003 A systematic com-parison of various statistical alignment modelsComputa-tional Linguistics 29(1)19ndash51

Pavel Pecina 2008 A machine learning approach to multi-word expression extraction InProceedings of the LRECMWE 2008 Workshop pp 54ndash57

Ivan A Sag Timothy Baldwin Francis Bond and AnnCopestake 2002 Multiword expressions A pain in theneck for NLP InProc of the CICLing-2002 pp 1ndash15

Frank Smadja and Kathleen McKeown 1994 Translatingcollocations for use in bilingual lexicons InProceedingsof the HLT rsquo94 workshop pp 152ndash156

Begona Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automatic word-alignment InProc of the EACL MWE 2006 Workshoppp 33ndash40

Dekai Wu 1997 Stochastic inversion transduction gram-mars and bilingual parsing of parallel corporaComputLinguist 23(3)377ndash403

David Yarowsky Grace Ngai and Richard Wicentowski2001 Inducing multilingual text analysis tools via robustprojection across aligned corpora InProceedings of HLT2001 pp 1ndash8

30

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 31ndash39Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

A Re-examination of Lexical Association Measures

Abstract

We review lexical Association Measures (AMs) that have been employed by past work in extracting multiword expressions Our work contributes to the understanding of these AMs by categorizing them into two groups and suggesting the use of rank equivalence to group AMs with the same ranking performance We also examine how existing AMs can be adapted to better rank English verb particle constructions and light verb constructions Specifically we suggest normalizing (Pointwise) Mutual Information and using marginal frequencies to construct penalization terms We empirically validate the effectiveness of these modified AMs in detection tasks in English performed on the Penn Treebank which shows significant improvement over the original AMs

1 Introduction

Recently the NLP community has witnessed a renewed interest in the use of lexical association measures in extracting Multiword Expressions (MWEs) Lexical Association Measures (hereafter AMs) are mathematical formulas which can be used to capture the degree of connection or association between constituents of a given phrase Well-known AMs include Pointwise Mutual Information (PMI) Pearsonrsquos 2χ and the Odds Ratio These AMs have been applied in many different fields of study from information retrieval to hypothesis testing In the context of MWE extraction many published works have been devoted to comparing their effectiveness Krenn and Evert (2001) evaluate Mutual Information (MI) Dice Pearsonrsquos 2χ log-likelihood

ratio and the T score In Pearce (2002) AMs such as Z score Pointwise MI cost reduction left and right context entropy odds ratio are evaluated Evert (2004) discussed a wide range of AMs including exact hypothesis tests such as the binomial test and Fisherrsquos exact tests various coefficients such as Dice and Jaccard Later Ramisch et al (2008) evaluated MI Pearsonrsquos 2χ and Permutation Entropy Probably the most comprehensive evaluation of AMs was presented in Pecina and Schlesinger (2006) where 82 AMs were assembled and evaluated over Czech collocations These collocations contained a mix of idiomatic expressions technical terms light verb constructions and stock phrases In their work the best combination of AMs was selected using machine learning

While the previous works have evaluated AMs there have been few details on why the AMs perform as they do A detailed analysis of why these AMs perform as they do is needed in order to explain their identification performance and to help us recommend AMs for future tasks This weakness of previous works motivated us to address this issue In this work we contribute to further understanding of association measures using two different MWE extraction tasks to motivate and concretize our discussion Our goal is to be able to predict a priori what types of AMs are likely to perform well for a particular MWE class

We focus on the extraction of two common types of English MWEs that can be captured by bigram model Verb Particle Constructions (VPCs) and Light Verb Constructions (LVCs) VPCs consist of a verb and one or more particles which can be prepositions (eg put on bolster up) adjectives (cut short) or verbs (make do) For simplicity we focus only on bigram VPCs that take prepositional particles the most common class of VPCs A special characteristic of VPCs that affects their extraction is the

Hung Huu Hoang Dept of Computer Science

National University of Singapore

hoanghuucompnusedusg

Su Nam Kim Dept of Computer Science and Software Engineering University of Melbourne

snkimcsseunimelbeduau

Min-Yen Kan Dept of Computer Science

National University of Singapore

kanmycompnusedusg

31

mobility of noun phrase complements in transitive VPCs They can appear after the particle (Take off your hat) or between the verb and the particle (Take your hat off) However a pronominal complement can only appear in the latter configuration (Take it off)

In comparison LVCs comprise of a verb and a complement which is usually a noun phrase (make a presentation give a demonstration) Their meanings come mostly from their complements and as such verbs in LVCs are termed semantically light hence the name light verb This explains why modifiers of LVCs modify the complement instead of the verb (make a serious mistake vs make a mistake seriously) This phenomenon also shows that an LVCrsquos constituents may not occur contiguously

2 Classification of Association Measures

Although different AMs have different approaches to measuring association we observed that they can effectively be classified into two broad classes Class I AMs look at the degree of institutionalization ie the extent to which the phrase is a semantic unit rather than a free combination of words Some of the AMs in this class directly measure this association between constituents using various combinations of co-occurrence and marginal frequencies Examples include MI PMI and their variants as well as most of the association coefficients such as Jaccard Hamann Brawn-Blanquet and others Other Class I AMs estimate a phrasersquos MWE-hood by judging the significance of the difference between observed and expected frequencies These AMs include among others statistical hypothesis tests such as T score Z score and Pearsonrsquos 2χ test

Class II AMs feature the use of context to measure non-compositionality a peculiar characteristic of many types of MWEs including VPCs and idioms This is commonly done in one of the following two ways First non-compositionality can be modeled through the diversity of contexts measured using entropy The underlying assumption of this approach is that non-compositional phrases appear in a more restricted set of contexts than compositional ones Second non-compositionality can also be measured through context similarity between the phrase and its constituents The observation here is that non-compositional phrases have different semantics from those of their constituents It then

follows that contexts in which the phrase and its constituents appear would be different (Zhai 1997) Some VPC examples include carry out give up A close approximation stipulates that contexts of a non-compositional phrasersquos constituents are also different For instance phrases such as hot dog and Dutch courage are comprised of constituents that have unrelated meanings Metrics that are commonly used to compute context similarity include cosine and dice similarity distance metrics such as Euclidean and Manhattan norm and probability distribution measures such as Kullback-Leibler divergence and Jensen-Shannon divergence

Table 1 lists all AMs used in our discussion

The lower left legend defines the variables a b c and d with respect to the raw co-occurrence statistics observed in the corpus data When an AM is introduced it is prefixed with its index given in Table 1(eg [M2] Mutual Information) for the readerrsquos convenience

3 Evaluation

We will first present how VPC and LVC candidates are extracted and used to form our evaluation data set Second we will discuss how performances of AMs are measured in our experiments

31 Evaluation Data

In this study we employ the Wall Street Journal (WSJ) section of one million words in the Penn Tree Bank To create the evaluation data set we first extract the VPC and LVC candidates from our corpus as described below We note here that the mobility property of both VPC and LVC constituents have been used in the extraction process

For VPCs we first identify particles using a pre-compiled set of 38 particles based on Baldwin (2005) and Quirk et al (1985) (Appendix A) Here we do not use the WSJ particle tag to avoid possible inconsistencies pointed out in Baldwin (2005) Next we search to the left of the located particle for the nearest verb As verbs and particles in transitive VPCs may not occur contiguously we allow an intervening NP of up to 5 words similar to Baldwin and Villavicencio (2002) and Smadja (1993) since longer NPs tend to be located after particles

32

Extraction of LVCs is carried out in a similar fashion First occurrences of light verbs are located based on the following set of seven

frequently used English light verbs do get give have make put and take Next we search to the right of the light verbs for the nearest noun

AM Name Formula AM Name Formula M1 Joint Probability ( ) f xy N M2 Mutual Information

1log ˆ

ijij

i j ij

ff

N fsum

M3 Log likelihood ratio

2 log ˆ

ijij

i j ij

ff

fsum

M4 Pointwise MI (PMI) ( )log

( ) ( )

P xy

P x P ylowast lowast

M5 Local-PMI ( ) PMIf xy times M6 PMIk ( )log

( ) ( )

kNf xy

f x f ylowast lowast

M7 PMI2 2( )log

( ) ( )

Nf xy

f x f ylowast lowast

M8 Mutual Dependency 2( )log( ) ( )P xy

P x P y

M9 Driver-Kroeber

( )( )

a

a b a c+ +

M10 Normalized expectation

2

2

a

a b c+ +

M11 Jaccard a

a b c+ +

M12 First Kulczynski a

b c+

M13 Second Sokal-Sneath 2( )

a

a b c+ +

M14 Third Sokal-Sneath

a d

b c

+

+

M15 Sokal-Michiner a d

a b c d

+

+ + +

M16 Rogers-Tanimoto

2 2

a d

a b c d

+

+ + +

M17 Hamann ( ) ( )a d b c

a b c d

+ minus +

+ + +

M18 Odds ratio adbc

M19 Yulersquos ω ad bc

ad bc

minus

+

M20 Yulersquos Q ad bcad bc

minus+

M21 Brawn- Blanquet max( )

a

a b a c+ +

M22 Simpson

min( )

a

a b a c+ +

M23 S cost 12min( )

log(1 )1

b c

a

minus+

+

M24 Adjusted S Cost 12max( )

log(1 )1

b c

a

minus+

+

M25 Laplace 1

min( ) 2

a

a b c

+

+ +

M26 Adjusted Laplace 1

max( ) 2

a

a b c

+

+ +

M27 Fager [M9]

1max( )

2b cminus

M28 Adjusted Fager [M9]

1max( )b c

aNminus

M29 Normalized PMIs

PMI NF( )α PMI NFMax

M30 Simplified normalized PMI for VPCs

log( )

(1 )

ad

b cα αtimes + minus times

M31 Normalized MIs

MI NF( )α MI NFMax

NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast [0 1]α isin NFMax = max( ( ) ( ))P x P ylowast lowast

11 ( )a f f xy= = 12 ( )b f f xy= =

21 ( )c f f xy= = 22 ( )d f f xy= =

( )f xlowast ( )f xlowast

( )f ylowast ( )f ylowast N

Table 1 Association measures discussed in this paper Starred AMs () are developed in this work

Contingency table of a bigram (x y) recording co-occurrence and marginal frequencies w stands for all words except w stands for all words N is total number of bigrams The expected frequency under the independence assumption is ˆ ( ) ( ) ( ) f xy f x f y N= lowast lowast

33

permitting a maximum of 4 intervening words to allow for quantifiers (aan the many etc) adjectival and adverbial modifiers etc If this search fails to find a noun as when LVCs are used in the passive (eg the presentation was made) we search to the right of the light verb also allowing a maximum of 4 intervening words The above extraction process produced a total of 8652 VPC and 11465 LVC candidates when run on the corpus We then filter out candidates with observed frequencies less than 6 as suggested in Pecina and Schlesinger (2006) to obtain a set of 1629 VPCs and 1254 LVCs

Separately we use the following two available sources of annotations 3078 VPC candidates extracted and annotated in (Baldwin 2005) and 464 annotated LVC candidates used in (Tan et al 2006) Both sets of annotations give both positive and negative examples

Our final VPC and LVC evaluation datasets were then constructed by intersecting the gold-standard datasets with our corresponding sets of extracted candidates We also concatenated both sets of evaluation data for composite evaluation This set is referred to as ldquoMixedrdquo Statistics of our three evaluation datasets are summarized in Table 2 VPC data LVC data Mixed Total (freq ge 6)

413 100 513

Positive instances

117 (2833)

28 (28)

145 (2326)

Table 2 Evaluation data sizes (type count not token) While these datasets are small our primary

goal in this work is to establish initial comparable baselines and describe interesting phenomena that we plan to investigate over larger datasets in future work

32 Evaluation Metric

To evaluate the performance of AMs we can use the standard precision and recall measures as in much past work We note that the ranked list of candidates generated by an AM is often used as a classifier by setting a threshold However setting a threshold is problematic and optimal threshold values vary for different AMs Additionally using the list of ranked candidates directly as a classifier does not consider the confidence indicated by actual scores Another way to avoid setting threshold values is to measure precision and recall of only the n most likely candidates

(the n-best method) However as discussed in Evert and Krenn (2001) this method depends heavily on the choice of n In this paper we opt for average precision (AP) which is the average of precisions at all possible recall values This choice also makes our results comparable to those of Pecina and Schlesinger (2006)

33 Evaluation Results

Figure 1(a b) gives the two average precision profiles of the 82 AMs presented in Pecina and Schlesinger (2006) when we replicated their experiments over our English VPC and LVC datasets We observe that the average precision profile for VPCs is slightly concave while the one for LVCs is more convex This can be interpreted as VPCs being more sensitive to the choice of AM than LVCs Another point we observed is that a vast majority of Class I AMs including PMI its variants and association coefficients (excluding hypothesis tests) perform reasonably well in our application In contrast the performances of most of context-based and hypothesis test AMs are very modest Their mediocre performance indicates their inapplicability to our VPC and LVC tasks In particular the high frequencies of particles in VPCs and light verbs in LVCs both undermine their contextsrsquo discriminative power and skew the difference between observed and expected frequencies that are relied on in hypothesis tests

4 Rank Equivalence

We note that some AMs although not mathematically equivalent (ie assigning identical scores to input candidates) produce the same lists of ranked candidates on our datasets Hence they achieve the same average precision The ability to identify such groups of AMs is helpful in simplifying their formulas which in turn assisting in analyzing their meanings Definition Association measures M1 and M2 are

rank equivalent over a set C denoted by M1 r

Cequiv

M2 if and only if M1(cj) gt M1(ck) M2(cj) gt M2(ck) and M1(cj) = M1(ck) M2(cj) = M2(ck) for all cj ck belongs to C where Mk(ci) denotes the score assigned to ci by the measure Mk As a corollary the following also holds for rank equivalent AMs

34

Corollary If M1 r

Cequiv M2 then APC(M1) = APC(M2)

where APC(Mi) stands for the average precision of the AM Mi over the data set C Essentially M1 and M2 are rank equivalent over a set C if their ranked lists of all candidates taken from C are the same ignoring the actual calculated scores1 As an example the following 3 AMs Odds ratio Yulersquos ω and Yulersquos Q (Table 3 row 5) though not mathematically equivalent can be shown to be rank equivalent Five groups of rank equivalent AMs that we have found are listed in Table 3 This allows us to replace the below 15 AMs with their (most simple) representatives from each rank equivalent group

1 Two AMs may be rank equivalent with the exception of some candidates where one AM is undefined due to a zero in the denominator while the other AM is still well-defined We call these cases weakly rank equivalent With a reasonably large corpus such candidates are rare for our VPC and LVC types Hence we still consider such AM pairs to be rank equivalent

1) [M2] Mutual Information [M3] Log likelihood ratio 2) [M7] PMI2 [M8] Mutual Dependency [M9] Driver-Kroeber (aka Ochiai) 3) [M10] Normalized expectation [M11] Jaccard [M12] First Kulczynski

[M13]Second Sokal-Sneath (aka Anderberg)

4) [M14] Third Sokal-Sneath [M15] Sokal-Michiner [M16] Rogers-Tanimoto [M17] Hamann 5) [M18] Odds ratio [M19] Yulersquos ω [M20] Yulersquos Q

Table 3 Five groups of rank equivalent AMs

5 Examination of Association Measures

We highlight two important findings in our analysis of the AMs over our English datasets Section 51 focuses on MI and PMI and Section 52 discusses penalization terms

51 Mutual Information and Pointwise Mutual Information

In Figure 1 over 82 AMs PMI ranks 11th in identifying VPCs while MI ranks 35th in

01

02

03

04

05

06

AP

Figure 1a AP profile of AMs examined over our VPC data set

01

02

03

04

05

06

AP

Figure 1b AP profile of AMs examined over our LVC data set

Figure 1 Average precision (AP) performance of the 82 AMs from Pecina and Schlesinger (2006) on our English VPC and LVC datasets Bold points indicate AMs discussed in this paper

Hypothesis test AMs loz Class I AMs excluding hypothesis test AMs + Context-based AMs

35

identifying LVCs In this section we show how their performances can be improved significantly

Mutual Information (MI) measures the common information between two variables or the reduction in uncertainty of one variable given knowledge of the other

( )MI( ) ( )log ( ) ( )u v

p uvU V p uv p u p v= lowast lowastsum In the

context of bigrams the above formula can be

simplified to [M2] MI =

1log ˆN

ijij

i jij

ff

fsum While MI

holds between random variables [M4] Pointwise MI (PMI) holds between specific values PMI(x

y) =( )

log( ) ( )

P xy

P x P ylowast lowast

( )log

( ) ( )

Nf xy

f x f y=

lowast lowast It has long

been pointed out that PMI favors bigrams with low-frequency constituents as evidenced by the product of two marginal frequencies in its denominator To reduce this bias a common solution is to assign more weight to the co-occurrence frequency ( )f xy in the numerator by either raising it to some power k (Daille 1994) or multiplying PMI with ( )f xy Table 4 lists these adjusted versions of PMI and their performance over our datasets We can see from Table 4 that the best performance of PMIk

is obtained at k values less than one indicating that it is better to rely less on ( )f xy Similarly multiplying

( )f xy directly to PMI reduces the performance of PMI As such assigning more weight to ( )f xy does not improve the AP performance of PMI AM VPCs LVCs MixedBest [M6] PMIk 547

(k = 13) 573

(k = 85) 544

(k = 32)

[M4] PMI 510 566 515 [M5] Local-PMI 259 393 272 [M1] Joint Prob 170 28 175 Table 4 AP performance of PMI and its variants Best alpha settings shown in parentheses

Another shortcoming of (P)MI is that both grow not only with the degree of dependence but also with frequency (Manning and Schutzeampamp 1999 p 66) In particular we can show that MI(X Y) le min(H(X) H(Y)) where H() denotes entropy and PMI(xy) le min( log ( )P xminus lowast log ( )P yminus lowast )

These two inequalities suggest that the allowed score ranges of different candidates vary and consequently MI and PMI scores are not directly comparable Furthermore in the case of VPCs and LVCs the differences among score

ranges of different candidates are large due to high frequencies of particles and light verbs This has motivated us to normalize these scores before using them for comparison We suggest MI and PMI be divided by one of the following two normalization factors NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast with [0 1]α isin and NFmax = max( ( ) ( ))P x P ylowast lowast NF( )α being dependent on alpha can be optimized by setting an appropriate alpha value which is inevitably affected by the MWE type and the corpus statistics On the other hand NFmax is independent of alpha and is recommended when one needs to apply normalized (P)MI to a mixed set of different MWE types or when sufficient data for parameter tuning is unavailable As shown in Table 5 normalized MI and PMI show considerable improvements of up to 80 Also PMI and MI after being normalized with NFmax rank number one in VPC and LVC task respectively If one re-writes MI as = (1 N) ij ij

i jPMIf timessum it is easy to see the heavy

dependence of MI on direct frequencies compared with PMI and this explains why normalization is a pressing need for MI

AM VPCs LVCs MixedMI NF( )α 508

(α = 48) 583

(α = 47) 516

(α = 5)

MI NFmax 508 584 518 [M2] MI 273 435 289

PMI NF( )α 592 (α = 8)

554 (α = 48)

588 (α = 77)

PMI NFmax 565 517 556 [M4] PMI 510 566 515

Table 5 AP performance of normalized (P)MI versus standard (P)MI Best alpha settings shown in parentheses

52 Penalization Terms

It can be seen that given equal co-occurrence frequencies higher marginal frequencies reduce the likelihood of being MWEs This motivates us to use marginal frequencies to synthesize penalization terms which are formulae whose values are inversely proportional to the likelihood of being MWEs We hypothesize that incorporating such penalization terms can improve the respective AMs detection AP

Take as an example the AMs [M21] Brawn-Blanquet (aka Minimum Sensitivity) and [M22] Simpson These two AMs are identical except

36

for one difference in the denominator Brawn-Blanquet uses max(b c) Simpson uses min(b c) It is intuitive and confirmed by our experiments that penalizing against the more frequent constituent by choosing max(b c) is more effective This is further attested in AMs [M23] S Cost and [M25] Laplace where we tried to replace the min(b c) term with max(b c) Table 6 shows the average precision on our datasets for all these AMs

AM VPCs LVCs Mixed[M21]Brawn- Blanquet

478 578 486

[M22] Simpson 249 382 260

[M24] Adjusted S Cost

485 577 492

[M23] S cost 249 388 260

[M26] Adjusted Laplace

486 577 493

[M25] Laplace 241 388 254 Table 6 Replacing min() with max() in selected AMs

In the [M27] Fager AM the penalization term max(b c) is subtracted from the first term which is no stranger but rank equivalent to [M7] PMI2 In our application this AM is not good since the second term is far larger than the first term which is less than 1 As such Fager is largely equivalent to just ndashfrac12 max(b c) In order to make use of the first term we need to replace the constant frac12 by a scaled down version of max(b c) We have approximately derived 1 aN as a lower bound estimate of max(b c) using the independence assumption producing [M28] Adjusted Fager We can see from Table 7 that this adjustment improves Fager on both datasets AM VPCs LVCs Mixed[M28] Adjusted Fager

564 543 554

[M27] Fager 552 439 525 Table 7 Performance of Fager and its adjusted version

The next experiment involves [M14] Third Sokal Sneath which can be shown to be rank equivalent to ndashb ndashc We further notice that frequencies c of particles are normally much larger than frequencies b of verbs Thus this AM runs the risk of ranking VPC candidates based on only frequencies of particles So it is necessary

that we scale b and c properly as in [M14] bαminus times ndash (1 ) cαminus times Having scaled the constituents properly we still see that [M14] by itself is not a good measure as it uses only constituent frequencies and does not take into consideration the co-occurrence frequency of the two constituents This has led us to experiment

with [MR14]PMI

(1 )b cα αtimes + minus times The

denominator of [MR14] is obtained by removing the minus sign from [MR14] so that it can be used as a penalization term The choice of PMI in the numerator is due to the fact that the denominator of [MR14] is in essence similar to NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast which has been successfully used to divide PMI in the normalized PMI experiment We heuristically tried to simplify [MR14] to the following AM

[M30]log( )

(1 )

ad

b cα αtimes + minus times The setting of alpha in

Table 8 below is taken from the best alpha setting obtained the experiment on the normalized PMI (Table 5) It can be observed from Table 8 that [MR14] being computationally simpler than normalized PMI performs as well as normalized PMI and better than Third Sokal-Sneath over the VPC data set

AM VPCs LVCs MixedPMI NF( )α 592

(α =8) 554

(α =48) 588

(α =77)

[M30]log( )

(1 )

ad

b cα αtimes + minus times

600 (α =8)

484 (α =48)

588 (α =77)

[M14] Third Sokal Sneath

565 453 546

Table 8 AP performance of suggested VPCsrsquo penalization terms and AMs

With the same intention and method we have found that while addition of marginal frequencies is a good penalization term for VPCs the product of marginal frequencies is more suitable for LVCs (rows 1 and 2 Table 9) As with the linear combination the product bc should also be weighted accordingly as (1 )b cα αminus The best alpha value is also taken from the normalized PMI experiments (Table 5) which is nearly 5 Under this setting this penalization term is exactly the denominator of the [M18] Odds Ratio Table 9 below show our experiment results in deriving the penalization term for LVCs

37

AM VPCs LVCs Mixedndashb ndashc 565 453 546 1bc 502 532 502 [M18] Odds ratio 443 567 456

Table 9 AP performance of suggested LVCsrsquo penalization terms and AMs

6 Conclusions

We have conducted an analysis of the 82 AMs assembled in Pecina and Schlesinger (2006) for the tasks of English VPC and LVC extraction over the Wall Street Journal Penn Treebank data In our work we have observed that AMs can be divided into two classes ones that do not use context (Class I) and ones that do (Class II) and find that the latter is not suitable for our VPC and LVC detection tasks as the size of our corpus is too small to rely on the frequency of candidatesrsquo contexts This phenomenon also revealed the inappropriateness of hypothesis tests for our detection task We have also introduced the novel notion of rank equivalence to MWE detection in which we show that complex AMs may be replaced by simpler AMs that yield the same average precision performance

We further observed that certain modifications to some AMs are necessary First in the context of ranking we have proposed normalizing scores produced by MI and PMI in cases where the distributions of the two events are markedly different as is the case for light verbs and particles While our claims are limited to the datasets analyzed they show clear improvements normalized PMI produces better performance over our mixed MWE dataset yielding an average precision of 588 compared to 515 when using standard PMI a significant improvement as judged by paired T test Normalized MI also yields the best performance over our LVC dataset with a significantly improved AP of 583

We also show that marginal frequencies can be used to form effective penalization terms In particular we find that (1 )b cα αtimes + minus times is a good penalization term for VPCs while (1 )b cα αminus is suitable for LVCs Our introduced alpha tuning parameter should be set to properly scale the values b and c and should be optimized per MWE type In cases where a common factor is applied to different MWE types max(b c) is a better choice than min(b c) In future work we plan to expand our investigations over larger

web-based datasets of English to verify the performance gains of our modified AMs

Acknowledgement This work was partially supported by a National Research Foundation grant ldquoInteractive Media Searchrdquo (grant R 252 000 325 279)

References Baldwin Timothy (2005) The deep lexical

acquisition of English verb-particle constructions Computer Speech and Language Special Issue on Multiword Expressions 19(4)398ndash414

Baldwin Timothy and Villavicencio Aline (2002) Extracting the unextractable A case study on verb-particles In Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002) pages 98ndash104 Taipei Taiwan

Daille Beacuteatrice (1994) Approche mixte pour lextraction automatique de terminologie statistiques lexicales et filtres linguistiques PhD thesis Universiteacute Paris 7

Evert Stefan (2004) Online repository of association measures httpwwwcollocationsde a companion to The Statistics of Word Cooccurrences Word Pairs and Collocations PhD dissertation University of Stuttgart

Evert Stefan and Krenn Brigitte (2001) Methods for qualitative evaluation of lexical association measures In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics pages 188-915 Toulouse France

Katz Graham and Giesbrecht Eugenie (2006) Automatic identification of non-compositional multi-word expressions using latent semantic analysis In Proceedings of the ACL Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties pages 12-19 Sydney Australia

Krenn Brigitte and Evert Stefan (2001) Can we do better than frequency A case study on extracting PP-verb collocations In Proceedings of the ACLEACL 2001 Workshop on the Computational Extraction Analysis and Exploitation of Collocations pages 39ndash46 Toulouse France

Manning D Christopher and Schutze ampamp Hinrich (1999) Foundations of Statistical Natural Language Processing The MIT Press Cambridge Massachusetts

Pearce Darren (2002) A comparative evaluation of collocation extraction techniques In Proc of the

38

3rd International Conference on Language Resources and Evaluation (LREC 2002) Las Palmas pages 1530-1536 Canary Islands

Pecina Pavel and Schlesinger Pavel (2006) Combining association measures for collocation extraction In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLINGACL 2006) pages 651-658 Sydney Australia

Quirk Randolph Greenbaum Sidney Leech Geoffrey and Svartvik Jan (1985) A Comprehensive Grammar of the English Language Longman London UK

Ramisch Carlos Schreiner Paulo Idiart Marco and Villavicencio Aline (2008) An Evaluation of Methods for the extraction of Multiword Expressions In Proceedings of the LREC-2008 Workshop on Multiword Expressions Towards a Shared Task for Multiword Expressions pages 50-53 Marrakech Morocco

Smadja Frank (1993) Retrieving collocations from text Xtract Computational Linguistics 19(1) 143ndash77

Tan Y Fan Kan M Yen and Cui Hang (2006) Extending corpus-based identification of light verb constructions using a supervised learning framework In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context pages 49ndash56 Trento Italy

Zhai Chengxiang (1997) Exploiting context to identify lexical atoms ndash A statistical view of linguistic context In International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97) pages 119-129 Rio de Janeiro Brazil

Appendix A List of particles used in identifying verb particle constructions about aback aboard above abroad across adrift ahead along apart around aside astray away back backward backwards behind by down forth forward forwards in into off on out over past round through to together under up upon without

39

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 40ndash46Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus

R Mahesh K Sinha

Department of Computer Science amp Engineering Indian Institute of Technology Kanpur

Kanpur 208016 India rmkiitkacin

Abstract

Complex predicate is a noun a verb an ad-jective or an adverb followed by a light verb that behaves as a single unit of verb Com-plex predicates (CPs) are abundantly used in Hindi and other languages of Indo Aryan family Detecting and interpreting CPs con-stitute an important and somewhat a diffi-cult task The linguistic and statistical methods have yielded limited success in mining this data In this paper we present a simple method for detecting CPs of all kinds using a Hindi-English parallel corpus A CP is hypothesized by detecting absence of the conventional meaning of the light verb in the aligned English sentence This simple strategy exploits the fact that CP is a multi-word expression with a meaning that is dis-tinct from the meaning of the light verb Al-though there are several shortcomings in the methodology this empirical method surpri-singly yields mining of CPs with an average precision of 89 and a recall of 90

1 Introduction

Complex predicates (CPs) are abundantly used in Hindi and other languages of Indo-Aryan family and have been widely studied (Hook 1974 Ab-bi 1992 Verma 1993 Mohanan 1994 Singh 1994 Butt 1995 Butt and Geuder 2001 Butt and Ramchand 2001 Butt et al 2003) A com-plex predicate is a multi-word expression (MWE) where a noun a verb or an adjective is followed by a light verb (LV) and the MWE be-haves as a single unit of verb The general theory of complex predicate is discussed in Alsina (1996) These studies attempt to model the lin-guistic facts of complex predicate formation and the associated semantic roles

CPs empower the language in its expressive-ness but are hard to detect Detection and inter-

pretation of CPs are important for several tasks of natural language processing tasks such as ma-chine translation information retrieval summa-rization etc A mere listing of the CPs constitutes a valuable linguistic resource for lexicographers wordnet designers (Chakrabarti et al 2007) and other NLP system designers Computational me-thod using Hindi corpus has been used to mine CPs and categorize them based on statistical analysis (Sriram and Joshi 2005) with limited success Chakrabarti et al (2008) present a me-thod for automatic extraction of V+V CPs only from a corpus based on linguistic features They report an accuracy of about 98 An attempt has also been made to use a parallel corpus for de-tecting CPs using projection POS tags from Eng-lish to Hindi (Soni Mukerjee and Raina 2006) It uses Giza++ word alignment tool to align the projected POS information A success of 83 precision and 46 recall has been reported

In this paper we present a simple strategy for mining of CPs in Hindi using projection of meaning of light verb in a parallel corpus In the following section the nature of CP in Hindi is outlined and this is followed by system design experimentation and results

2 Complex Predicates in Hindi

A CP in Hindi is a syntactic construction consist-ing of either a verb a noun an adjective or an adverb as main predicator followed by a light verb (LV) Thus a CP can be a noun+LV an adjective+LV a verb+LV or an adverb+LV Fur-ther it is also possible that a CP is followed by a LV (CP+LV) The light verb carries the tense and agreement morphology In V+V CPs the contribution of the light verb denotes aspectual terms such as continuity perfectivity inception completion or denotes an expression of forceful-ness suddenness etc (Singh 1994 Butt 1995) The CP in a sentence syntactically acts as a sin-gle lexical unit of verb that has a meaning dis-

40

tinct from that of the LV CPs are also referred as the complex or compound verbs

Given below are some examples (1) CP=noun+LV noun = ashirwad blessings LV = denaa to give usane mujhe ashirwad diyaa उसन मझ आश वाद दया he me blessings gave he blessed me (2) No CP usane mujhe ek pustak dii उसन मझ एक प तक द he me one book gave he gave me a book In (1) the light verb diyaa (gave) in its past tense form with the noun ashirwad (blessings) makes a complex predicate verb form ashirwad diyaa (blessed) in the past tense form The CP here is ashirwad denaa and its corresponding English translation is lsquoto blessrsquo On the other hand in ex-ample (2) the verb dii (gave) is a simple verb in past tense form and is not a light verb Although same Hindi verb denaa (to give) is used in both the examples it is a light verb in (1) and a main verb in (2) Whether it acts as a light verb or not depends upon the semantics of the preceding noun However it is observed that the English meaning in case of the complex predicate is not derived from the individual meanings of the con-stituent words It is this observation that forms basis of our approach for mining of CPs (3) CP=adjective+LV adjective=khush happy LV=karanaa to do usane mujhe khush kiyaa उसन मझ खश कया he me happy did he pleased me Here the Hindi verb kiyaa (did) is the past tense form of a light verb karanaa (to do) and the pre-ceding word khush (happy) is an adjective The CP here is khush karanaa (to please) (4) CP=verb+LV verb = paRhnaa to read LV = lenaa to take usane pustak paRh liyaa

उसन प तक पढ़ लया he book read took he has read the book Here the Hindi verb liyaa (took) is the past tense form of the light verb lenaa (to take) and the pre-ceding word paRh (read) is the verb paRhnaa (to read) in its stem form The CP is paRh lenaa (to finish reading) In such cases the light verb acts as an aspectual modal or as an intensifier (5) CP=verb+LV verb = phaadanaa to tear LV = denaa to give usane pustak phaad diyaa उसन प तक फाड़ दया he book tear gave he has torn the book Here the Hindi verb diyaa (gave) is the past tense form of the light verb denaa (to give) and the preceding word phaad (tear) is the stem form of the verb phaadanaa (to tear) The CP is phaad denaa (to cause and complete act of tearing) (6) CP=verb+LV verb = denaa to give LV = maaranaato hit to kill usane pustak de maaraa उसन प तक द मारा he book give hit he threw the book Here the Hindi verb maaraa (hitkilled) is the past tense form of the light verb maaranaa (to hit to kill) and the preceding word de (give) is a verb denaa (to give) in its stem form The CP is de maranaa (to throw) The verb combination yields a new meaning This may also be consi-dered as a semi-idiomatic construct by some people (7) CP=adverb+LV1+LV2 adverb = vaapasback LV1 = karanaato do LV2 = denaato give or CP = CP+LV CP = vaapas karanaato return LV = denaato give usane pustak vaapas kar diyaa उसन प तक वापस कर दया he book back do gave

41

he returned the book Here there are two Hindi light verbs used The verb kar (do) is the stem form of the light verb karanaa (to do) and the verb diyaa (gave) is the past tense form of the light verb denaa (to give) The preceding word vaapas (back) is an adverb One way of interpretation is that the CP (a con-junct verb) vaapas karanaa (to return) is fol-lowed by another LV denaa (to give) signifying completion of the task Another way of looking at it is to consider these as a combination of two CPs vaapas karanaa (to return) and kar denaa (to complete the act) The semantic interpreta-tions in the two cases remain the same It may be noted that the word vaapas (return) is also a noun and in such a case the CP is a noun+LV

From all the above examples the complexity of the task of mining the CPs is evident Howev-er it is also observed that in the translated text the meaning of the light verb does not appear in case of CPs Our methodology for mining CPs is based on this observation and is outlined in the following section

3 System Design

As outlined earlier our method for detecting a CP is based on detecting a mismatch of the Hindi light verb meaning in the aligned English sen-tence The steps involved are as follows 1) Align the sentences of Hindi-English corpus 2) Create a list of Hindi light verbs and their

common English meanings as a simple verb (Table 1)

3) For each Hindi light verb generate all the morphological forms (Figure 1)

4) For each English meaning of the light verb as given in table 1 generate all the morphologi-cal forms (Figure 2)

5) For each Hindi-English aligned sentence execute the following steps a) For each light verb of Hindi (table 1)

execute the following steps i) Search for a Hindi light verb (LV)

and its morphological derivatives (figure 1) in the Hindi sentence and mark its position in the sentence (K)

ii) If the LV or its morphological deriv-ative is found then search for the equivalent English meanings for any of the morphological forms (figure 2) in the corresponding aligned Eng-lish sentence

iii) If no match is found then scan the words in the Hindi sentence to the left of the Kth position (as identified in step (i)) else if a match is found then exit ie go to step (a)

iv) If the scanned word is a lsquostop wordrsquo (figure 3) then ignore it and contin-ue scanning

v) Stop the scan when it is not a lsquostop wordrsquo and collect the Hindi word (W)

vi) If W is an lsquoexit wordrsquo then exit ie go to step (a) else the identified CP is W+LV

Hindi has a large number of light verbs A list of some of the commonly used light verbs along with their common English meanings as a simple verb is given in table 1 The light verb kar (do) is the most frequently used light verb Using its literal meaning as lsquodorsquo as a criterion for testing CP is quite misleading since lsquodorsquo in English is used in several other contexts Such meanings have been shown within parentheses and are not used for matching

light verb base form root verb meaning baithanaa बठना sit

bananaa बनना makebecomebuildconstruct manufactureprepare

banaanaa बनाना makebuildconstructmanufact-ure prepare

denaa दना give

lenaa लना take

paanaa पाना obtainget

uthanaa 3ठना rise arise get-up

uthaanaa 3ठाना raiselift wake-up

laganaa लगना feelappear look seem

lagaanaa लगाना fixinstall apply

cukanaa चकना (finish)

cukaanaa चकाना pay

karanaa करना (do)

honaa होना happenbecome be

aanaa आना come

jaanaa जाना go

khaanaa खाना eat

rakhanaa रखना keep put

maaranaa मारना killbeathit

daalanaa डालना put

haankanaa हाकना drive

Table 1 Some of the common light verbs in Hindi

42

For each of the Hindi light verb all morpho-logical forms are generated A few illustrations are given in figures 1(a) and 1(b) Similarly for each of the English meaning of the light verb all of its morphological derivatives are generated Figure 2 shows a few illustrations of the same

There are a number of words that can appear in between the nominal and the light verb in a CP These words are ignored in search for a CP and are treated as stop words These are words that denote negation or are emphasizers inten-sifiers interrogative pronoun or a particle A list of stop words used in the experimentation is giv-en in figure 3

Figure 1(a) Morphological derivatives of sample Hindi light verb rsquojaanaarsquo जाना to go

Figure 1(b) Morphological derivatives of sample Hindi light verb rsquolenaarsquo लना to take

Figure 2 Morphological derivatives of sample English meanings

We use a list of words of words that we have

named as lsquoexit wordsrsquo which cannot form part of a CP in Hindi We have used Hindi case (vibhak-ti) markers (also called parsarg) conjunctions and pronouns as the lsquoexit wordsrsquo in our imple-mentation Figure 4 shows a partial list used However this list can be augmented based on analysis of errors in LV identification It should be noted that we do not perform parts of speech (POS) tagging and so the nature of the word pre-ceding the LV is unknown to the system

English word sit Morphological derivations

sit sits sat sitting English word give Morphological derivations

give gives gave given giving degdegdegdegdeg

LV jaanaa जाना to go Morphological derivatives jaa jaae jaao jaaeM jaauuM jaane jaanaa jaanii jaataa jaatii jaate jaaniiM jaatiiM jaaoge jaaogii gaii jaauuMgA jaayegaa jaauuMgii jaayegii gaye gaiiM gayaa gayii jaayeMge jaayeMgI jaakara जा (go stem) जाए (go imperative)

जाओ (go imperative) जाए (go imperative)

जाऊ (go first-person) जान (go infinitive oblique)

जाना (go infinitive masculine singular)

जानी (go infinitive feminine singular)

जाता (go indefinite masculine singular)

जाती (go indefinite feminine singular)

जात (go indefinite masculine pluraloblique)

जानी (go infinitive feminine plural)

जाती (go indefinite feminine plural)

जाओग (go future masculine singular)

जाओगी (go future feminine singular)

गई (go past feminine singular)

जाऊगा (go future masculine first-person singular)

जायगा (go future masculine third-person singular)

जाऊगी (go future feminine first-person singular)

जायगी (go future feminine third-person singular)

गय (go past masculine pluraloblique)

गई (go past feminine plural)

गया (go past masculine singular)

गयी (go past feminine singular)

जायग (go future masculine plural)

जायगी (go future feminine plural)

जाकर (go completion) degdegdegdegdeg

LV lenaa लना to take Morphological derivatives le lii leM lo letaa letii lete liiM luuM legaa legii lene lenaa lenii liyaa leMge loge letiiM luuMgaa luuMgii lekara ल (take stem) ल (take past)

ल (take imperative) लो (take imperative)

लता (take indefinite masculine singular)

लती (take indefinite feminine singular)

लत (take indefinite masculine pluraloblique)

ल (takepastfeminineplural) ल(take first-person)

लगा(take future masculine third-personsingular)

लगी(take future feminine third-person singular)

लन (take infinitive oblique)

लना (take infinitive masculine singular)

लनी (take infinitive feminine singular)

लया (take past masculine singular)

लग (take future masculine plural)

लोग (take future masculine singular)

लती (take indefinite feminine plural)

लगा (take future masculinefirst-personsingular)

लगी (take future feminine first-person singular)

लकर (take completion) degdegdegdegdeg

43

Figure 3 Stop words in Hindi used by the system

Figure 4 A few exit words in Hindi used by the system

The inner loop of the procedure identifies mul-tiple CPs that may be present in a sentence The outer loop is for mining the CPs in the entire corpus The experimentation and results are dis-cussed in the following section

4 Experimentation and Results

The CP mining methodology outlined earlier has been implemented and tested over multiple files of EMILLE (McEnery Baker Gaizauskas and Cunningham 2000) English-Hindi parallel cor-pus A summary of the results obtained are given in table 2 As can be seen from this table the precision obtained is 80 to 92 and the recall is between 89 to 100 The F-measure is 88 to 97 This is a remarkable and somewhat sur-prising result from the simple methodology without much of linguistic or statistical analysis This is much higher than what has been reported on the same corpus by Mukerjee et al 2006 (83 precision and 46 recall) who use projec-tion of POS and word alignment for CP identifi-cation This is the only other work that uses a parallel corpus and covers all kinds of CPs The results as reported by Chakrabarti et al (2008) are only for V-V CPs Moreover they do not re-port the recall value File

1 File 2

File 3

File 4

File 5

File 6

No of Sentences

112 193 102 43 133 107

Total no of CP(N)

200 298 150 46 188 151

Correctly identified CP (TP)

195 296 149 46 175 135

V-V CP 56 63 9 6 15 20

Incorrectly identified CP (FP)

17 44 7 11 16 20

Unidentified CP(FN)

5 2 1 0 13 16

Accuracy

9750 9933 9933 1000 9308 8940

Precision (TP (TP+FP))

9198 8705 9551 8070 9162 8709

Recall ( TP (TP+FN))

9750 9833 9933 1000 9308 8940

F-measure ( 2PR ( P+R))

946 923 974 893 923 882

Table 2 Results of the experimentation

न (ergative case marker) को (accusative case marker) का (possessive case marker) क (possessive case marker) की (possessive case marker) स (frombywith) म (ininto) पर (onbut) और (and Hindi particle) तथा (and) या (or) ल कन (but) पर त (but) क (that Hindi particle) म (I) तम (you) आप (you) वह (heshe) मरा (my) मर (my) मर (my) त हारा (your) त हार (your) त हार (your) उसका (his) उसकी (her) उसक (hisher) अपना (own) अपनी (own) अपन (own) उनक (their) मन (I ergative) त ह (to you) आपको (to you) उसको (to himher) उनको (to them) उ ह (to them) मझको (to me) मझ (to me) िजसका (whose) िजसकी (whose) िजसक (whose) िजनको (to whom) िजनक (to whom)

नह (nonot) न (nonot Hindi particle) भी (also Hindi particle) ह (only Hindi particle) तो (then Hindi particle) य (why) या (what Hindi particle)

कहा (where Hindi particle) कब (when) यहा (here) वहा (there) जहा (where) पहल (before) बाद म (after) श म (beginning) आर भ म (beginning) अत म (in the end) आ खर म (in the end)

44

Given below are some sample outputs (1) English sentence I also enjoy working with the childrens parents who often come to me for advice - its good to know you can help Aligned Hindi sentence मझ ब च क माता - पताओ क साथ काम करना भी अ छा लगता ह जो क अ सर सलाह लन आत ह - यह जानकार खशी होती ह क आप कसी की मदद कर सकत ह | The CPs identified in the sentence i काम करना (to work) ii अ छा लगना (to feel good enjoy) iii सलाह लना (to seek advice) iv खशी होना (to feel happy good) v मदद करना (to help)

Here the system identified 5 different CPs all of which are correct and no CP in the sentence has gone undetected The POS projection and word alignment method (Mukerjee et al 2006) would fail to identify CPs सलाह लना (to seek advice) and खशी होना (to feel happy) (2) English sentence Thousands of children are already benefiting from the input of people like you - people who care about children and their future who have the commitment energy and enthusiasm to be positive role models and who value the opportu-nity for a worthwhile career Aligned Hindi sentence आप जस लोग जो क ब च और उनक भ व य क बार म सोचत ह - इस समय भी हज़ार ब च को लाभ पहचा रह ह | अ छ आदश बनन क लए ऐस लोग म तब ता उ सराह और लगन ह और व एक समथन - यो य यवसाय की क करत ह |

The CPs identified in the sentence i आदश बनना (to be role model) ii क करना (to respect)

Here also the two CPs identified are correct It is obvious that this empirical method of

mining CPs will fail whenever the Hindi light verb maps on to its core meaning in English It

may also produce garbage as POS of the preced-ing word is not being checked However the mining success rate obtained speaks of these be-ing in small numbers in practice Use of the lsquostop wordsrsquo in allowing the intervening words within the CPs helps a lot in improving the per-formance Similarly use of the lsquoexit wordsrsquo avoid a lot of incorrect identification

5 Conclusions

The simple empirical method for mining CPs outlined in this work yields an average 89 of precision and 90 recall which is better than the results reported so far in the literature The major drawback is that we have to generate a list of all possible light verbs This list appears to be very large for Hindi Since no POS tagging or statis-tical analysis is performed the identified CPs are merely a list of mined CPs in Hindi with no lin-guistic categorization or analysis However this list of mined CPs is valuable to the lexicograph-ers and other language technology developers This list can also be used for word alignment tools where the identified components of CPs are grouped together before the word alignment process This will increase both the alignment accuracy and the speed

The methodology presented in this work is equally applicable to all other languages within the Indo-Aryan family

References Anthony McEnery Paul Baker Rob Gaizauskas Ha-

mish Cunningham 2000 EMILLE Building a Corpus of South Asian Languages Vivek A Quar-terly in Artiificial Intelligence 13(3)23ndash32

Amitabh Mukerjee Ankit Soni and Achala M Raina 2006 Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora Proceed-ings of the Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties Sydney 11ndash18

Alex Alsina 1996 Complex PredicatesStructure and Theory CSLI PublicationsStanford CA

Anvita Abbi 1992 The explicator compound verbsome definitional issues and criteria for iden-tification Indian Linguistics 53 27-46

Debasri Chakrabarti Vaijayanthi Sarma and Pushpak Bhattacharyya 2007 Complex Predicates in In-dian Language Wordnets Lexical Resources and Evaluation Journal 40 (3-4)

Debasri Chakrabarti Hemang Mandalia Ritwik Priya Vaijayanthi Sarma and Pushpak Bhattacharyya 2008 Hindi Compound Verbs and their Automatic Extraction Computational Linguistics (COLING08) Manchester UK

45

Manindra K Verma (ed) 1993 Complex Predicates in South Asian Languages Manohar Publishers and Distributors New Delhi

Miriam Butt 1995 The Structure of Complex Predi-cates in Urdu CSLI Publications

Mirium Butt and Gillian Ramchand 2001 Complex Aspectual Structure in HindiUrdu In Maria Lia-kata Britta Jensen and Didier Maillat (Editors) Oxford University Working Papers in Linguistics Philology amp Phonetics Vol 6

Miriam Butt Tracy Holloway King and John T Maxwell III 2003 Complex Predicates via Re-striction Proceedings of the LFG03 Conference

Miriam Butt and Wilhelm Geuder 2001 On the (semi)lexical status of light verbs In Norbert Corv-er and Henk van Riemsdijk (Editors) Semi-lexical Categories On the content of function words and the function of content words Mouton de Gruyter Berlin 323ndash370

Mona Singh 1994 Perfectivity Definiteness and Specificity A Classification of Verbal Predicates Hindi Doctoral dissertation University of Texas Austin

Peter Edwin Hook 1974 The Compound Verb in Hindi Center for South and Southeast Asian Stu-dies The University of Michigan

Tara Mohanan 1994 Argument Structure in Hindi CSLI Publications Stanford California

Venkatapathy Sriram and Aravind K Joshi 2005 Relative compositionality of multi-word expres-sions a study of verb-noun (V-N) collocations In Proceedings of International Joint Conference on Natural Language Processing - 2005 Jeju Island Korea 553-564

46

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 47ndash54Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Improving Statistical Machine Translation UsingDomain Bilingual Multiword Expressions

Zhixiang Ren1 Yajuan L u1 Jie Cao1 Qun Liu 1 Yun Huang2

1Key Lab of Intelligent Info Processing 2Department of Computer ScienceInstitute of Computing Technology School of Computing

Chinese Academy of Sciences National University of SingaporePO Box 2704 Beijing 100190 China Computing 1 Law Link Singapore 117590

renzhixianglvyajuan huangyuncompnusedusgcaojieliuqunictaccn

Abstract

Multiword expressions (MWEs) havebeen proved useful for many natural lan-guage processing tasks However how touse them to improve performance of statis-tical machine translation (SMT) is not wellstudied This paper presents a simple yeteffective strategy to extract domain bilin-gual multiword expressions In additionwe implement three methods to integratebilingual MWEs to Moses the state-of-the-art phrase-based machine translationsystem Experiments show that bilingualMWEs could improve translation perfor-mance significantly

1 Introduction

Phrase-based machine translation model has beenproved a great improvement over the initial word-based approaches (Brown et al 1993) Recentsyntax-based models perform even better thanphrase-based models However when syntax-based models are applied to new domain with fewsyntax-annotated corpus the translation perfor-mance would decrease To utilize the robustnessof phrases and make up the lack of syntax or se-mantic information in phrase-based model for do-main translation we study domain bilingual mul-tiword expressions and integrate them to the exist-ing phrase-based model

A multiword expression (MWE)can be consid-ered as word sequence with relatively fixed struc-ture representing special meanings There is nouniform definition of MWE and many researchersgive different properties of MWE Sag et al (2002)roughly defined MWE as ldquoidiosyncratic interpre-tations that cross word boundaries (or spaces)rdquoCruys and Moiron (2007) focused on the non-compositional property of MWE ie the propertythat whole expression cannot be derived from their

component words Stanford university launcheda MWE project1 in which different qualities ofMWE were presented Forbilingual multiwordexpression (BiMWE) we define a bilingual phraseas a bilingual MWE if (1) the source phrase is aMWE in source language (2) the source phraseand the target phrase must be translated to eachother exactly ie there is no additional (boundary)word in target phrase which cannot find the corre-sponding word in source phrase and vice versaIn recent years many useful methods have beenproposed to extract MWEs or BiMWEs automati-cally (Piao et al 2005 Bannard 2007 Fazly andStevenson 2006) Since MWE usually constrainspossible senses of a polysemous word in contextthey can be used in many NLP applications suchas information retrieval question answering wordsense disambiguation and so on

For machine translation Piao et al (2005) havenoted that the issue of MWE identification andaccurate interpretation from source to target lan-guage remained an unsolved problem for existingMT systems This problem is more severe whenMT systems are used to translate domain-specifictexts since they may include technical terminol-ogy as well as more general fixed expressions andidioms Although some MT systems may employa machine-readable bilingual dictionary of MWEit is time-consuming and inefficient to obtain thisresource manually Therefore some researchershave tried to use automatically extracted bilingualMWEs in SMT Tanaka and Baldwin (2003) de-scribed an approach of noun-noun compound ma-chine translation but no significant comparisonwas presented Lambert and Banchs (2005) pre-sented a method in which bilingual MWEs wereused to modify the word alignment so as to im-prove the SMT quality In their work a bilin-gual MWE in training corpus was grouped as

1httpmwestanfordedu

47

one unique token before training alignment mod-els They reported that both alignment qualityand translation accuracy were improved on a smallcorpus However in their further study they re-ported even lower BLEU scores after groupingMWEs according to part-of-speech on a large cor-pus (Lambert and Banchs 2006) Nonethelesssince MWE represents liguistic knowledge therole and usefulness of MWE in full-scale SMTis intuitively positive The difficulty lies in howto integrate bilingual MWEs into existing SMTsystem to improve SMT performance especiallywhen translating domain texts

In this paper we implement three methods thatintegrate domain bilingual MWEs into a phrase-based SMT system and show that these ap-proaches improve translation quality significantlyThe main difference between our methods andLambert and Banchsrsquo work is that we directly aimat improving the SMT performance rather than im-proving the word alignment quality In detail dif-ferences are listed as follows

bull Instead of using the bilingual n-gram trans-lation model we choose the phrase-basedSMT system Moses2 which achieves sig-nificantly better translation performance thanmany other SMT systems and is a state-of-the-art SMT system

bull Instead of improving translation indirectlyby improving the word alignment qualitywe directly target at the quality of transla-tion Some researchers have argued that largegains of alignment performance under manymetrics only led to small gains in translationperformance (Ayan and Dorr 2006 Fraserand Marcu 2007)

Besides the above differences there are someadvantages of our approaches

bull In our method automatically extractedMWEs are used as additional resources ratherthan as phrase-table filter Since bilingualMWEs are extracted according to noisy au-tomatic word alignment errors in word align-ment would further propagate to the SMT andhurt SMT performance

bull We conduct experiments on domain-specificcorpus For one thing domain-specific

2httpwwwstatmtorgmoses

corpus potentially includes a large numberof technical terminologies as well as moregeneral fixed expressions and idioms iedomain-specific corpus has high MWE cov-erage For another after the investigationcurrent SMT system could not effectivelydeal with these domain-specific MWEs es-pecially for Chinese since these MWEs aremore flexible and concise Take the Chi-nese term ldquo j Ntilde (rdquo for example Themeaning of this term is ldquosoften hard massand dispel pathogenic accumulationrdquo Ev-ery word of this term represents a specialmeaning and cannot be understood literallyor without this context These terms are dif-ficult to be translated even for humans letalone machine translation So treating theseterms as MWEs and applying them in SMTsystem have practical significance

bull In our approach no additional corpus is intro-duced We attempt to extract useful MWEsfrom the training corpus and adopt suitablemethods to apply them Thus it benefitsfor the full exploitation of available resourceswithout increasing great time and space com-plexities of SMT system

The remainder of the paper is organized as fol-lows Section 2 describes the bilingual MWE ex-traction technique Section 3 proposes three meth-ods to apply bilingual MWEs in SMT systemSection 4 presents the experimental results Sec-tion 5 draws conclusions and describes the futurework Since this paper mainly focuses on the ap-plication of BiMWE in SMT we only give a briefintroduction on monolingual and bilingual MWEextraction

2 Bilingual Multiword ExpressionExtraction

In this section we describe our approach of bilin-gual MWE extraction In the first step we obtainmonolingual MWEs from the Chinese part of par-allel corpus After that we look for the translationof the extracted MWEs from parallel corpus

21 Automatic Extraction of MWEs

In the past two decades many different ap-proaches on automatic MWE identification werereported In general those approaches can beclassified into three main trends (1) statisti-cal approaches (Pantel and Lin 2001 Piao et

48

al 2005) (2) syntactic approaches (Fazly andStevenson 2006 Bannard 2007) and (3) seman-tic approaches (Baldwin et al 2003 Cruys andMoiron 2007) Syntax-based and semantic-basedmethods achieve high precision but syntax or se-mantic analysis has to be introduced as preparingstep so it is difficult to apply them to domains withfew syntactical or semantic annotation Statisticalapproaches only consider frequency informationso they can be used to obtain MWEs from bilin-gual corpora without deeper syntactic or semanticanalysis Most statistical measures only take twowords into account so it not easy to extract MWEscontaining three or more than three words

Log Likelihood Ratio (LLR)has been proved agood statistical measurement of the association oftwo random variables (Chang et al 2002) Weadopt the idea of statistical approaches and pro-pose a new algorithm named LLR-based Hierar-chical Reducing Algorithm (HRA for short) to ex-tract MWEs with arbitrary lengths To illustrateour algorithm firstly we define some useful itemsIn the following definitions we assume the givensentence is ldquoA B C D Erdquo

Definition 1 Unit A unit is any sub-string of thegiven sentence For example ldquoA Brdquo ldquo Crdquo ldquo C D Erdquoare all units but ldquoA B Drdquo is not a unit

Definition 2 List A list is an ordered sequence ofunits which exactly cover the given sentence ForexampleldquoArdquoldquo B C Drdquoldquo Erdquo forms a list

Definition 3 Score The score function only de-fines on two adjacent units and return the LLRbetween the last word of first unit and the firstword of the second unit3 For example the scoreof adjacent unit ldquoB Crdquo and ldquoD Erdquo is defined asLLR(ldquo Crdquoldquo Drdquo)

Definition 4 Select The selecting operator is tofind the two adjacent units with maximum scorein a list

Definition 5 ReduceThe reducing operator is toremove two specific adjacent units concatenatethem and put back the result unit to the removedposition For example if we want to reduce unitldquoB Crdquo and unit ldquoDrdquo in list ldquoArdquoldquo B Crdquoldquo Drdquoldquo Erdquowe will get the listldquoArdquoldquo B C Drdquoldquo Erdquo

Initially every word in the sentence is consid-ered as one unit and all these units form a initiallist L If the sentence is of lengthN then the

3we use a stoplist to eliminate the units containing func-tion words by setting their score to 0

list containsN units of course The final set ofMWEsS is initialized to empty set After initial-ization the algorithm will enter an iterating loopwith two steps (1) select the two adjacent unitswith maximum score inL namingU1 andU2 and(2) reduceU1 andU2 in L and insert the reducingresult into the final setS Our algorithm termi-nates on two conditions (1) if the maximum scoreafter selection is less than a given threshold or (2)if L contains only one unit

c1(pound p Eacute w egrave )

c1(pound p Eacute w )

c1(pound p Eacute w)

c1(pound)

c2(p Eacute w)

c2(p Eacute)

c2(p) c3(Eacute) c4(w) c5()

c6(egrave )

c6(egrave) c7()

pound p Eacute w egrave 1471 67552 10596 0 0 8096

Figure 1 Example of Hierarchical Reducing Al-gorithm

Let us make the algorithm clearer with an ex-ample Assume the threshold of score is 20 thegiven sentence is ldquopound p Eacute w egrave rdquo4Figure 1 shows the hierarchical structure of givensentence (based on LLR of adjacent words) Inthis example four MWEs (ldquop Eacuterdquo ldquo p Eacute

wrdquo ldquo egrave rdquo ldquo pound p Eacute wrdquo) are extractedin the order and sub-strings over dotted line in fig-ure 1 are not extracted

From the above example we can see that theextracted MWEs correspond to human intuitionIn general the basic idea of HRA is to reflectthe hierarchical structure pattern of natural lan-guage Furthermore in the HRA MWEs are mea-sured with the minimum LLR of adjacent wordsin them which gives lexical confidence of ex-tracted MWEs Finally suppose given sentencehas lengthN HRA would definitely terminatewithin N minus 1 iterations which is very efficient

However HRA has a problem that it would ex-tract substrings before extracting the whole stringeven if the substrings only appear in the particu-lar whole string which we consider useless Tosolve this problem we use contextual features

4The whole sentence means ldquohealthy tea for preventinghyperlipidemiardquo and we give the meaning for each Chi-nese wordpound(preventing)p(hyper-)Eacute(-lipid-) w(-emia)(for) egrave(healthy)(tea)

49

contextual entropy (Luo and Sun 2003) and C-value (Frantzi and Ananiadou 1996) to filter outthose substrings which exist only in few MWEs

22 Automatic Extraction of MWErsquosTranslation

In subsection 21 we described the algorithm toobtain MWEs and we would like to introduce theprocedure to find their translations from parallelcorpus in this subsection

For mining the English translations of ChineseMWEs we first obtain the candidate translationsof a given MWE from the parallel corpus Stepsare listed as follows

1 Run GIZA++5 to align words in the trainingparallel corpus

2 For a given MWE find the bilingual sentencepairs where the source language sentences in-clude the MWE

3 Extract the candidate translations of theMWE from the above sentence pairs accord-ing to the algorithm described by Och (2002)

After the above procedure we have alreadyextracted all possible candidate translations of agiven MWE The next step is to distinguish rightcandidates from wrong candidates We constructperceptron-based classification model (Collins2002) to solve the problem We design twogroups of features translation features whichdescribe the mutual translating chance betweensource phrase and target phrase and the languagefeatures which refer to how well a candidateis a reasonable translation The translation fea-tures include (1) the logarithm of source-targettranslation probability (2) the logarithm of target-source translation probability (3) the logarithmof source-target lexical weighting (4) the loga-rithm of target-source lexical weighting and (5)the logarithm of the phrase pairrsquos LLR (Dunning1993) The first four features are exactly the sameas the four translation probabilities used in tradi-tional phrase-based system (Koehn et al 2003)The language features include (1) the left entropyof the target phrase (Luo and Sun 2003) (2) theright entropy of the target phrase (3) the first wordof the target phrase (4) the last word of the targetphrase and (5) all words in the target phrase

5httpwwwfjochcomGIZA++html

We select and annotate 33000 phrase pairs ran-domly of which 30000 pairs are used as trainingset and 3000 pairs are used as test set We use theperceptron training algorithm to train the modelAs the experiments reveal the classification preci-sion of this model is 9167

3 Application of Bilingual MWEs

Intuitively bilingual MWE is useful to improvethe performance of SMT However as we de-scribed in section 1 it still needs further researchon how to integrate bilingual MWEs into SMTsystem In this section we propose three methodsto utilize bilingual MWEs and we will comparetheir performance in section 4

31 Model Retraining with Bilingual MWEs

Bilingual phrase table is very important forphrase-based MT system However due to the er-rors in automatic word alignment and unalignedword extension in phrase extraction (Och 2002)many meaningless phrases would be extractedwhich results in inaccuracy of phrase probabilityestimation To alleviate this problem we take theautomatically extracted bilingual MWEs as paral-lel sentence pairs add them into the training cor-pus and retrain the model using GIZA++ Byincreasing the occurrences of bilingual MWEswhich are good phrases we expect that the align-ment would be modified and the probability es-timation would be more reasonable Wu et al(2008) also used this method to perform domainadaption for SMT Different from their approachin which bilingual MWEs are extracted from ad-ditional corpus we extract bilingual MWEs fromthe original training set The fact that additionalresources can improve the domain-specific SMTperformance was proved by many researchers(Wu et al 2008 Eck et al 2004) Howeverour method shows that making better use of theresources in hand could also enhance the qualityof SMT system We use ldquoBaseline+BiMWErdquo torepresent this method

32 New Feature for Bilingual MWEs

Lopez and Resnik (2006) once pointed out thatbetter feature mining can lead to substantial gainin translation quality Inspired by this idea weappend one feature into bilingual phrase table toindicate that whether a bilingual phrase containsbilingual MWEs In other words if the source lan-guage phrase contains a MWE (as substring) and

50

the target language phrase contains the translationof the MWE (as substring) the feature value is 1otherwise the feature value is set to 0 Due to thehigh reliability of bilingual MWEs we expect thatthis feature could help SMT system to select bet-ter and reasonable phrase pairs during translationWe use ldquoBaseline+Featrdquo to represent this method

33 Additional Phrase Table of bilingualMWEs

Wu et al (2008) proposed a method to construct aphrase table by a manually-made translation dic-tionary Instead of manually constructing transla-tion dictionary we construct an additional phrasetable containing automatically extracted bilingualMWEs As to probability assignment we just as-sign 1 to the four translation probabilities for sim-plicity Since Moses supports multiple bilingualphrase tables we combine the original phrase ta-ble and new constructed bilingual MWE table Foreach phrase in input sentence during translationthe decoder would search all candidate transla-tion phrases in both phrase tables We use ldquoBase-line+NewBPrdquo to represent this method

4 Experiments

41 Data

We run experiments on two domain-specific patentcorpora one is for traditional medicine domainand the other is for chemical industry domain Ourtranslation tasks are Chinese-to-English

In the traditional medicine domain table 1shows the data statistics For language model weuse SRI Language Modeling Toolkit6 to train a tri-gram model with modified Kneser-Ney smoothing(Chen and Goodman 1998) on the target side oftraining corpus Using our bilingual MWE ex-tracting algorithm 80287 bilingual MWEs are ex-tracted from the training set

Chinese EnglishTraining Sentences 120355

Words 4688873 4737843Dev Sentences 1000

Words 31722 32390Test Sentences 1000

Words 41643 40551

Table 1 Traditional medicine corpus

6httpwwwspeechsricomprojectssrilm

In the chemical industry domain table 2 givesthe detail information of the data In this experi-ment 59466 bilingual MWEs are extracted

Chinese EnglishTraining Sentences 120856

Words 4532503 4311682Dev Sentences 1099

Words 42122 40521Test Sentences 1099

Words 41069 39210

Table 2 Chemical industry corpus

We test translation quality on test set and use theopen source tool mteval-vllbpl7 to calculate case-sensitive BLEU 4 score (Papineni et al 2002) asour evaluation criteria For this evaluation there isonly one reference per test sentence We also per-form statistical significant test between two trans-lation results (Collins et al 2005) The mean ofall scores and relative standard deviation are calcu-lated with a 99 confidence interval of the mean

42 MT Systems

We use the state-of-the-art phrase-based SMT sys-tem Moses as our baseline system The featuresused in baseline system include (1) four transla-tion probability features (2) one language modelfeature (3) distance-based and lexicalized distor-tion model feature (4) word penalty (5) phrasepenalty For ldquoBaseline+BiMWErdquo method bilin-gual MWEs are added into training corpus as aresult new alignment and new phrase table areobtained For ldquoBaseline+Featrdquo method one ad-ditional 01 feature are introduced to each entry inphrase table For ldquoBaseline+NewBPrdquo additionalphrase table constructed by bilingual MWEs isused

Features are combined in the log-linear modelTo obtain the best translatione of the source sen-tencef log-linear model uses following equation

e = arg maxe

p(e|f)

= arg maxe

Msum

m=1

λmhm(e f) (1)

in which hm andλm denote themth feature andweight The weights are automatically turned byminimum error rate training (Och 2002) on devel-opment set

7httpwwwnistgovspeechtestsmtresourcesscoringhtm

51

43 Results

Methods BLEUBaseline 02658Baseline+BiMWE 02661Baseline+Feat 02675Baseline+NewBP 02719

Table 3 Translation results of using bilingualMWEs in traditional medicine domain

Table 3 gives our experiment results Fromthis table we can see that bilingual MWEsimprove translation quality in all cases TheBaseline+NewBP method achieves the most im-provement of 061 BLEU score comparedwith the baseline system The Baseline+Featmethod comes next with 017 BLEU score im-provement And the Baseline+BiMWE achievesslightly higher translation quality than the baselinesystem

To our disappointment however none of theseimprovements are statistical significant Wemanually examine the extracted bilingual MWEswhich are labeled positive by perceptron algorithmand find that although the classification precisionis high (9167) the proportion of positive exam-ple is relatively lower (7669) The low positiveproportion means that many negative instanceshave been wrongly classified to positive which in-troduce noises To remove noisy bilingual MWEswe use the length ratiox of the source phrase overthe target phrase to rank the bilingual MWEs la-beled positive Assumex follows Gaussian distri-butions then the ranking score of phrase pair(s t)is defined as the following formula

Score(s t) = log(LLR(s t))times 1radic2πσ

timeseminus

(xminusmicro)2

2σ2

(2)Here the meanmicro and varianceσ2 are estimatedfrom the training set After ranking by score weselect the top 50000 60000 and 70000 bilingualMWEs to perform the three methods mentioned insection 3 The results are showed in table 4

From this table we can conclude that (1) Allthe three methods on all settings improve BLEUscore (2) Except the Baseline+BiMWE methodthe other two methods obtain significant improve-ment of BLEU score (02728 02734 02724)over baseline system (02658) (3) When the scaleof bilingual MWEs is relatively small (5000060000) the Baseline+Feat method performs better

Methods 50000 60000 70000Baseline 02658Baseline+BiMWE 02671 02686 02715Baseline+Feat 02728 02734 02712Baseline+NewBP 02662 02706 02724

Table 4 Translation results of using bilingualMWEs in traditional medicine domain

than others (4) As the number of bilingual MWEsincreasing the Baseline+NewBP method outper-forms the Baseline+Feat method (5) Comparingtable 4 and 3 we can see it is not true that themore bilingual MWEs the better performance ofphrase-based SMT This conclusion is the same as(Lambert and Banchs 2005)

To verify the assumption that bilingual MWEsdo indeed improve the SMT performance not onlyon particular domain we also perform some ex-periments on chemical industry domain Table 5shows the results From this table we can see thatthese three methods can improve the translationperformance on chemical industry domain as wellas on the traditional medicine domain

Methods BLEUBaseline 01882Baseline+BiMWE 01928Baseline+Feat 01917Baseline+Newbp 01914

Table 5 Translation results of using bilingualMWEs in chemical industry domain

44 Discussion

In order to know in what respects our methods im-prove performance of translation we manually an-alyze some test sentences and gives some exam-ples in this subsection

(1) For the first example in table 6 ldquoIuml oacuterdquois aligned to other words and not correctly trans-lated in baseline system while it is aligned to cor-rect target phrase ldquodredging meridiansrdquo in Base-line+BiMWE since the bilingual MWE (ldquoIuml oacuterdquoldquodredging meridiansrdquo) has been added into train-ing corpus and then aligned by GIZA++

(2) For the second example in table 6 ldquo

rdquo has two candidate translation in phrase tableldquo teardquo and ldquomedicated teardquo The baseline systemchooses the ldquoteardquo as the translation of ldquo rdquowhile the Baseline+Feat system chooses the ldquomed-

52

Src T not auml k Ouml Eacute Aring raquo Iuml oacute ) 9 | Y S _ Ouml otilde

egraveE 8

Ref the obtained product is effective in tonifying blood expelling cold dredging meridians promoting production of body fluid promoting urination and tranquilizing mind and can be used for supplementing nutrition and protecting health

Baseline the food has effects in tonifying blood dispelling cold promoting salivation and water and tranquilizing and tonic effects and making nutritious health

+Bimwe the food has effects in tonifying blood dispelling cold dredging meridians promotingsalivation promoting urination and tranquilizing tonic nutritious pulverizing

Src curren iexclJ J NtildeJ 5J

Ref the product can also be made into tablet pill powder medicated tea or injection Baseline may also be made into tablet pill powder tea or injection +Feat may also be made into tablet pill powder medicated tea or injection

Table 6 Translation example

icated teardquo because the additional feature giveshigh probability of the correct translation ldquomedi-cated teardquo

5 Conclusion and Future Works

This paper presents the LLR-based hierarchicalreducing algorithm to automatically extract bilin-gual MWEs and investigates the performance ofthree different application strategies in applyingbilingual MWEs for SMT system The translationresults show that using an additional feature to rep-resent whether a bilingual phrase contains bilin-gual MWEs performs the best in most cases Theother two strategies can also improve the qualityof SMT system although not as much as the firstone These results are encouraging and motivatedto do further research in this area

The strategies of bilingual MWE application isroughly simply and coarse in this paper Com-plicated approaches should be taken into accountduring applying bilingual MWEs For examplewe may consider other features of the bilingualMWEs and examine their effect on the SMT per-formance Besides application in phrase-basedSMT system bilingual MWEs may also be inte-grated into other MT models such as hierarchicalphrase-based models or syntax-based translationmodels We will do further studies on improvingstatistical machine translation using domain bilin-gual MWEs

Acknowledgments

This work is supported by National Natural Sci-ence Foundation of China Contracts 60603095and 60873167 We would like to thank the anony-

mous reviewers for their insightful comments onan earlier draft of this paper

References

Necip Fazil Ayan and Bonnie J Dorr 2006 Goingbeyond aer an extensive analysis of word align-ments and their impact on mt InProceedings of the44th Annual Meeting of the Association for Compu-tational Linguistics pages 9ndash16

Timothy Baldwin Colin Bannard Takaaki Tanaka andDominic Widdows 2003 An empirical modelof multiword expression decomposability InPro-ceedings of the ACL-2003 Workshop on MultiwordExpressions Analysis Acquisiton and Treatmentpages 89ndash96

Colin Bannard 2007 A measure of syntactic flex-ibility for automatically identifying multiword ex-pressions in corpora InProceedings of the ACLWorkshop on A Broader Perspective on MultiwordExpressions pages 1ndash8

Peter F Brown Stephen Della Pietra Vincent J DellaPietra and Robert L Mercer 1993 The mathe-matics of statistical machine translation Parameterestimation Computational Linguistics 19(2)263ndash311

Baobao Chang Pernilla Danielsson and WolfgangTeubert 2002 Extraction of translation unit fromchinese-english parallel corpora InProceedings ofthe first SIGHAN workshop on Chinese languageprocessing pages 1ndash5

Stanley F Chen and Joshua Goodman 1998 Am em-pirical study of smoothing techniques for languagemodeling Technical report

Michael Collins Philipp Koehn and Ivona Kucerova2005 Clause restructuring for statistical machinetranslation InProceedings of the 43rd Annual

53

Meeting on Association for Computational Linguis-tics pages 531ndash540

Michael Collins 2002 Discriminative training meth-ods for hidden markov models Theory and exper-iments with perceptron algorithms InProceedingsof the Empirical Methods in Natural Language Pro-cessing Conference pages 1ndash8

Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32

Ted Dunning 1993 Accurate methods for the statis-tics of surprise and coincidenceComputationalLinguistics 19(1)61ndash74

Matthias Eck Stephan Vogel and Alex Waibel 2004Improving statistical machine translation in the med-ical domain using the unified medical language sys-tem InProceedings of the 20th international con-ference on Computational Linguistics table of con-tents pages 792ndash798

Afsaneh Fazly and Suzanne Stevenson 2006 Auto-matically constructing a lexicon of verb phrase id-iomatic combinations InProceedings of the EACLpages 337ndash344

Katerina T Frantzi and Sophia Ananiadou 1996 Ex-tracting nested collocations InProceedings of the16th conference on Computational linguistics pages41ndash46

Alexander Fraser and Daniel Marcu 2007 Measuringword alignment quality for statistical machine trans-lation Computational Linguistics 33(3)293ndash303

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InPro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology pages48ndash54

Patrik Lambert and Rafael Banchs 2005 Data in-ferred multi-word expressions for statistical machinetranslation InProceedings of Machine TranslationSummit X pages 396ndash403

Patrik Lambert and Rafael Banchs 2006 Groupingmulti-word expressions according to part-of-speechin statistical machine translation InProceedings ofthe Workshop on Multi-word-expressions in a multi-lingual context pages 9ndash16

Adam Lopez and Philip Resnik 2006 Word-basedalignment phrase-based translation Whatrsquos thelink In proceedings of the 7th conference of the as-sociation for machine translation in the Americasvisions for the future of machine translation pages90ndash99

Shengfen Luo and Maosong Sun 2003 Two-characterchinese word extraction based on hybrid of internaland contextual measures InProceedings of the sec-ond SIGHAN workshop on Chinese language pro-cessing pages 24ndash30

Franz Josef Och 2002Statistical Machine Transla-tion From Single-Word Models to Alignment Tem-plates Phd thesis Computer Science DepartmentRWTH Aachen Germany

Patrick Pantel and Dekang Lin 2001 A statistical cor-pus based term extractor InAI rsquo01 Proceedings ofthe 14th Biennial Conference of the Canadian Soci-ety on Computational Studies of Intelligence pages36ndash46

Kishore Papineni Salim Roukos Todd Ward and Wei-jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation InProceedings of the40th Annual Conference of the Association for Com-putational Linguistics pages 311ndash318

Scott Songlin Piao Paul Rayson Dawn Archer andTony McEnery 2005 Comparing and combining asemantic tagger and a statistical tool for mwe extrac-tion Computer Speech and Language 19(4)378ndash397

Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 Multi-word expressions A pain in the neck for nlp InProceedings of the 3th International Conferenceon Intelligent Text Processing and ComputationalLinguistics(CICLing-2002) pages 1ndash15

Takaaki Tanaka and Timothy Baldwin 2003 Noun-noun compound machine translation A feasibilitystudy on shallow processing InProceedings ofthe ACL-2003 Workshop on Multiword ExpressionsAnalysis Acquisition and Treatment pages 17ndash24

Hua Wu Haifeng Wang and Chengqing Zong 2008Domain adaptation for statistical machine transla-tion with domain dictionary and monolingual cor-pora InProceedings of Conference on Computa-tional Linguistics (COLING) pages 993ndash1000

54

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 55ndash62Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Bottom-up Named Entity Recognitionusing a Two-stage Machine Learning Method

Hirotaka Funayama Tomohide Shibata Sadao KurohashiKyoto University Yoshida-honmachi

Sakyo-ku Kyoto 606-8501 Japanfunayama shibata kuronlpkueekyoto-uacjp

Abstract

This paper proposes Japanese bottom-upnamed entity recognition using a two-stage machine learning method Mostwork has formalized Named Entity Recog-nition as a sequential labeling problem inwhich only local information is utilizedfor the label estimation and thus a longnamed entity consisting of several mor-phemes tends to be wrongly recognizedOur proposed method regards a compoundnoun (chunk) as a labeling unit and firstestimates the labels of all the chunks ina phrasal unit (bunsetsu) using a machinelearning method Then the best label as-signment in the bunsetsu is determinedfrom bottom up as the CKY parsing al-gorithm using a machine learning methodWe conducted an experimental on CRLNE data and achieved an F measure of8979 which is higher than previous work

1 Introduction

Named Entity Recognition (NER) is a task of rec-ognizing named entities such as person namesorganization names and location It is used forseveral NLP applications such as Information Ex-traction (IE) and Question Answering (QA) Mostwork uses machine learning methods such as Sup-port Vector Machines (SVMs) (Vapnik 1995) andConditional Random Field (CRF) (Lafferty et al2001) using a hand-annotated corpus (Krishnanand DManning 2006 Kazama and Torisawa2008 Sasano and Kurohashi 2008 Fukushima etal 2008 Nakano and Hirai 2004 Masayuki andMatsumoto 2003)

In general NER is formalized as a sequentiallabeling problem For example regarding a mor-pheme as a basic unit it is first labeled as S-PERSON B-PERSON I-PERSON E-PERSON

S-ORGANIZATION etc Then considering thelabeling results of morphemes the best NE labelsequence is recognized

When the label of each morpheme is estimatedonly local information around the morpheme (egthe morpheme the two preceding morphemes andthe two following morphemes) is utilized There-fore a long named entity consisting of severalmorphemes tends to be wrongly recognized Letus consider the example sentences shown in Fig-ure 1

In sentence (1) the label of ldquoKazamardquo can berecognized to be S-PERSON (PERSON consist-ing of one morpheme) by utilizing the surroundinginformation such as the suffix ldquosanrdquo (Mr) and theverb ldquokikoku shitardquo (return home)

On the other hand in sentence (2) when thelabel of ldquoshinyourdquo (credit) is recognized to beB-ORGANIZATION (the beginning of ORGA-NIZATION) only information from ldquohatsudourdquo(invoke) to ldquokyusairdquo (relief) can be utilized andthus the information of the morpheme ldquoginkourdquo(bank) that is apart from ldquoshinyourdquo by three mor-phemes cannot be utilized To cope with this prob-lem Nakano et al (Nakano and Hirai 2004) andSasano et al (Sasano and Kurohashi 2008) uti-lized information of the head of bunsetsu1 In theirmethods when the label of ldquoshinyourdquo is recog-nized the information of the morpheme ldquoginkourdquocan be utilized

However these methods do not work when themorpheme that we want to refer to is not a headof bunsetsu as in sentence (3) In this examplewhen ldquogaikokurdquo (foreign) is recognized to be B-ARTIFACT (the beginning of ARTIFACT) wewant to refer to ldquohourdquo (law) not ldquoihanrdquo (viola-tion) which is the head of the bunsetsu

This paper proposes Japanese bottom-up named1Bunsetsu is the smallest coherent phrasal unit in

Japanese It consists of one or more content words followedby zero or more function words

55

(1) kikoku-shitareturn home

Kazama-san-waMrKazama TOP

rsquoMr Kazama who returned homersquo

(2) hatsudou-shitainvoke

shinyou-kumiai-kyusai-ginkou-nocredit union relief bank GEN

setsuritsu-mo establishment

rsquothe establishment of the invoking credit union relief bankrsquo

(3) shibunsyo-gizou-toprivate document falsification and

gaikoku-jin-touroku-hou-ihan-noforeigner registration law violation GEN

utagai-desuspicion INS

rsquoon suspicion of the private document falsification and the violation of the foreigner registra-tion lawrsquo

Figure 1 Example sentences

entity recognition using a two-stage machinelearning method Different from previous workthis method regards a compound noun as a la-beling unit (we call it chunk hereafter) and es-timates the labels of all the chunks in the bun-setsu using a machine learning method In sen-tence (3) all the chunks in the second bunsetsu(ie ldquogaikokurdquo ldquogaikoku-jinrdquo middot middot middot ldquogaikoku-jin-touroku-hou-ihan rdquo middot middot middot ldquoihanrdquo) are labeled andin the case that the chunk ldquogaikoku-jin-touroku-hourdquo is labeled the information about ldquohourdquo (law)is utilized in a natural manner Then in the bun-setsu the best label assignment is determined Forexample among the combination of ldquogaikoku-jin-touroku-hourdquo (ARTIFACT) and ldquoihanrdquo (OTHER)the combination of ldquogaikoku-jinrdquo (PERSON) andldquotouroku-hou-ihanrdquo (OTHER) etc the best la-bel assignment ldquogaikoku-jin-touroku-hourdquo (AR-TIFACT) and ldquoihanrdquo (OTHER) is chosen basedon a machine learning method In this determi-nation of the best label assignment as the CKYparsing algorithm the label assignment is deter-mined by bottom-up dynamic programming Weconducted an experimental on CRL NE data andachieved an F measure of 8979 which is higherthan previous work

This paper is organized as follows Section 2 re-views related work of NER especially focusing onsequential labeling based method Section 3 de-scribes an overview of our proposed method Sec-tion 4 presents two machine learning models andSection 5 describes an analysis algorithm Section6 gives an experimental result

2 Related Work

In Japanese Named Entity Recognition the defi-nition of Named Entity in IREX Workshop (IREX

class examplePERSON Kimura SyonosukeLOCATION Taiheiyou (Pacific Ocean)ORGANIZATION Jimin-tou (Liberal Democratic

Party)ARTIFACT PL-houan (PL bill)DATE 21-seiki (21 century)TIME gozen-7-ji (7 am)MONEY 500-oku-en (50 billions yen)PERCENT 20 percent

Table 1 NE classes and their examples

Committee 1999) is usually used In this def-inition NEs are classified into eight classesPERSON LOCATION ORGANIZATION AR-TIFACT DATE TIME MONEY and PERCENTTable 1 shows example instances of each class

NER methods are divided into two approachesrule-based approach and machine learning ap-proach According to previous work machinelearning approach achieved better performancethan rule-based approach

In general a machine learning method is for-malized as a sequential labeling problem Thisproblem is first assigning each token (character ormorpheme) to several labels In an SE-algorithm(Sekine et al 1998) S is assigned to NE com-posed of one morpheme B I E is assigned to thebeginning middle end of NE respectively and Ois assigned to the morpheme that is not an NE2The labels S B I and E are prepared for each NEclasses and thus the total number of labels is 33(= 8 4 + 1)

The model for the label estimation is learnedbased on machine learning The following fea-tures are generally utilized characters type of

2Besides there are IOB1 IOB2 algorithm using onlyIOB and IOE1 IOE2 algorithm using only IOE (Kim andVeenstra 1999)

56

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu-MeijinMONEY0075

OTHERe0092

MeijinOTHERe0245

(a)initial state

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu + MeijinPSN+OTHERe0438+0245

Yoshiharu Yoshiharu + Meijin

final outputanalysis direction

MONEY0075

MNY+OTHERe0075+0245

MeijinOTHERe0245

(b)final output

Figure 2 An overview of our proposed method (the bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo)

character POS etc about the morpheme and thesurrounding two morphemes The methods utiliz-ing SVM or CRF are proposed

Most of NER methods based on sequential la-beling use only local information Thereforemethods utilizing global information are pro-posed Nakano et al utilized as a feature the wordsub class of NE on the analyzing direction in thebunsetsu the noun in the end of the bunsetsu ad-jacent to the analyzing direction and the head ofeach bunsetsu (Nakano and Hirai 2004) Sasanoet al utilized cache feature coreference resultsyntactic feature and caseframe feature as struc-tural features (Sasano and Kurohashi 2008)

Some work acquired knowledge from unan-notated large corpus and applied it to NERKazama et al utilized a Named Entity dic-tionary constructed from Wikipedia and a nounclustering result obtained using huge amount ofpairs of dependency relations (Kazama and Tori-sawa 2008) Fukushima et al acquired hugeamount of category-instance pairs (eg ldquopo-litical party - New party DAICHIrdquoldquocompany-TOYOTArdquo) by some patterns from a large Webcorpus (Fukushima et al 2008)

In Japanese NER researches CRL NE data areusually utilized for the evaluation This data in-cludes approximately 10 thousands sentences innews paper articles in which approximately 20thousands NEs are annotated Previous workachieved an F measure of about 089 using thisdata

3 Overview of Proposed Method

Our proposed method first estimates the label ofall the compound nouns (chunk) in a bunsetsu

Then the best label assignment is determinedby bottom-up dynamic programming as the CKYparsing algorithm Figure 2 illustrates an overviewof our proposed method In this example thebunsetsu ldquoHabu-Yoshiharu-Meijinrdquo (Grand Mas-ter Yoshiharu Habu) is analyzed First the labelsof all the chunks (ldquoHaburdquo ldquoHabu-YoshiharurdquoldquoHabu-Yoshiharu-Meijinrdquo middot middot middot ldquoMeijinrdquo etc) inthe bunsetsu are analyzed using a machine learn-ing method as shown in Figure 2 (a)

We call the state in Figure 2 (a) initial statewhere the labels of all the chunks have been es-timated From this state the best label assign-ment in the bunsetsu is determined This pro-cedure is performed from the lower left (corre-sponds to each morpheme) to the upper right likethe CKY parsing algorithm as shown in Figure 2(b) For example when the label assignment forldquoHabu-Yoshiharurdquo is determined the label assign-ment ldquoHabu-Yoshiharurdquo (PERSON) and the labelassignment ldquoHaburdquo (PERSON) and ldquoYoshiharurdquo(OTHER) are compared and the better one is cho-sen While grammatical rules are utilized in ageneral CKY algorithm this method chooses bet-ter label assignment for each cell using a machinelearning method

The learned models are the followings

bull the model that estimates the label of a chunk(label estimation model)

bull the model that compares two label assign-ments (label comparison model)

The two models are described in detail in thenext section

57

Habu Yoshiharu Meijin gaPERSON OTHERe

invalid invalid

invalid

invalidinvalid

Figure 3 Label assignment for all the chunks inthe bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo

4 Model Learning

41 Label Estimation ModelThis model estimates the label for each chunk Ananalysis unit is basically bunsetsu This is because935 of named entities is located in a bunsetsuin CRL NE data Exceptionally the following ex-pressions located in multiple bunsetsus tend to bean NE

bull expressions enclosed in parentheses (eg ldquolsquoHimeyuri-no toursquo rdquo (The tower of Himeyuri)(ARTIFACT))

bull expressions that have an entry in Wikipedia(eg ldquoNihon-yatyou-no kairdquo (Wild Bird So-ciety of Japan) (ORGANIZATION))

Hereafter bunsetsu is expanded when one of theabove conditions meet By this expansion 986of named entities is located in a bunsetsu3

For each bunsetsu the head or tail functionwords are deleted For example in the bun-setsu ldquoHabu-Yoshiharu-Meijin-wardquo the tail func-tion word ldquowardquo (TOP) is deleted In the bunsetsuldquoyaku-san-bairdquo (about three times) the head func-tion word ldquoyakurdquo (about) is deleted

Next for learning the label estimation modelall the chunks in a bunsetsu are attached to the cor-rect label from a hand-annotated corpus The la-bel set is 13 classes which includes eight NE class(as shown in Table 1) and five classes OTHERsOTHERb OTHERi OTHERe and invalid

The chunk that corresponds to a whole bun-setsu and does not contain any NEs is labeledas OTHERs and the head middle tail chunkthat does not correspond to an NE is labeled asOTHERb OTHERi OTHERe respectively4

3As an example in which an NE is not included by anexpanded bunsetsu there are ldquoToru-no Kimirdquo (PERSON)and ldquoOsaka-fu midori-no kankyo-seibi-shitsurdquo (ORGANI-ZATION)

4Each OTHER is assigned to the longest chunk that satis-fies its condition in a chunk

1 of morphemes in the chunk

2 the position of the chunk in its bunsetsu

3 character type5

4 the combination of the character type of adjoiningmorphemes

- For the chunk ldquoRussian Armyrdquo this feature isldquoKatakanaKanjirdquo

5 word class word sub class and several features pro-vided by a morphological analyzer JUMAN

6 several features6 provided by a parser KNP

7 string of the morpheme in the chunk

8 IPADIC7 feature- If the string of the chunk are registered in the fol-

lowing categories of IPADIC ldquopersonrdquo ldquolo-cationrdquo ldquoorganizationrdquo and ldquogeneralrdquo thisfeature fires

9 Wikipedia feature

- If the string of the chunk has an entry inWikipedia this feature fires

- the hypernym extracted from its definition sen-tence using some patterns (eg The hyper-nym of ldquothe Liberal Democratic Partyrdquo is apolitical party)

10 cache feature- When the same string of the chunk appears in the

preceding context the label of the precedingchunk is used for the feature

11 particles that the bunsetsu includes

12 the morphemes particles and head morpheme in theparent bunsetsu

13 the NEcategory ratio in a case slot of predicatenouncase frame(Sasano and Kurohashi 2008)

- For example in the case ga (NOM) of the pred-icate case frame ldquokaikenrdquo (interview) the NEratio ldquoPERSON0245rdquo is assigned to the caseslot Hence in the sentence ldquoHabu-ga kaiken-shitardquo (Mr Habu interviewed) the featureldquoPERSON0245rdquo is utilized for the chunkldquoHaburdquo

14 parenthesis feature

- When the chunk in a parenthesis this featurefires

Table 2 Features for the label estimation model

The chunk that is neither any eight NE class northe above four OTHER is labeled as invalid

In an example as shown in Figure 3 ldquoHabu-Yoshiharurdquo is labeled as PERSON ldquoMeijinrdquo is la-beled as OTHERe and the other chunks are la-beled as invalid

Next the label estimation model is learned fromthe data in which the above label set is assigned

5The following five character types are considered KanjiHiragana Katakana Number and Alphabet

6When a morpheme has an ambiguity all the correspond-ing features fire

7httpchasenaist-naraacjpchasendistributionhtmlja

58

to all the chunks The features for the label esti-mation model are shown in Table 2 Among thefeatures as for feature (3) (5)minus(8) three cate-gories according to the position of a morphemein the chunk are prepared ldquoheadrdquo ldquotailrdquo andldquoanywhererdquo For example in the chunk ldquoHabu-Yoshiharu-Meijinrdquo as for the morpheme ldquoHaburdquofeature (7) is set to be ldquoHaburdquo in ldquoheadrdquo and as forthe morpheme ldquoYoshiharurdquo feature (7) is set to beldquoYoshiharurdquo in ldquoanywhererdquo

The label estimation model is learned from pairsof label and feature in each chunk To classify themulti classes the one-vs-rest method is adopted(consequently 13 models are learned) The SVMoutput is transformed by using the sigmoid func-tion 1

1+exp(minusβx) and the transformed value is nor-malized so that the sum of the value of 13 labelsin a chunk is one

The purpose for setting up the label ldquoinvalidrdquo isas follows In the chunk ldquoHaburdquo and ldquoYoshiharurdquoin Figure 3 since the label ldquoinvalidrdquo has a rela-tively higher score the score of the label PERSONis relatively low Therefore when the label com-parison described in Section 42 is performed thelabel assignment ldquoHabu-Yoshiharurdquo (PERSON) islikely to be chosen In the chunk where the scoreof the label invalid has the highest score the labelthat has the second highest score is adopted

42 Label Comparison Model

This model compares the two label assignmentsfor a certain string For example in the stringldquoHabu-Yoshiharurdquo the model compares the fol-lowing two label assignments

bull ldquoHabu-Yoshiharurdquo is labeled as PERSON

bull ldquoHaburdquo is labeled as PERSON and ldquoYoshi-harurdquo is labeled as MONEY

First as shown in Figure 4 the two comparedsets of chunks are lined up by sandwiching ldquovsrdquo(The left one right one is called the first set thesecond set respectively) When the first set is cor-rect this example is positive otherwise this ex-ample is negative The max number of chunks foreach set is five and thus examples in which thefirst or second set has more than five chunks arenot utilized for the model learning

Then the feature is assigned to each exampleThe feature (13 dimensions) for each chunk is de-fined as follows the first 12 dimensions are used

positive+1 Habu-Yoshiharu vs Habu + Yoshiharu

PSN PSN + MNY+1 Habu-Yoshiharu + Meijin vs Habu + Yoshiharu + Meijin

PSN + OTHERe PSN + MONEY + OTHERe

negative- 1 Habu-Yoshiharu-Meijin vs Habu-Yoshiharu + Meijin

ORG PSN + OTHERe

Figure 4 Assignment of positivenegative exam-ples

for each label which is estimated by the label esti-mation model and the last 13th dimension is usedfor the score of an SVM output Then for the firstand second set the features for each chunk are ar-ranged from the left and zero vectors are placedin the remainder part

Figure 5 illustrates the feature for ldquoHabu-Yoshiharurdquo vs ldquoHabu + Yoshiharurdquo The labelcomparison model is learned from such data us-ing SVM Note that only the fact that ldquoHabu-Yoshiharurdquo is PERSON can be found from thehand-annotated corpus and thus in the exampleldquoHabu-Yoshiharu-Meijinrdquo vs ldquoHabu + Yoshiharu-Meijinrdquo we cannot determine which one is cor-rect Therefore such example cannot be used forthe model learning

5 Analysis

First the label of all the chunks in a bunsetsu isestimated by using the label estimation model de-scribed in Section 41 Then the best label assign-ment in the bunsetsu is determined by applying thelabel comparison model described in Section 42iteratively as shown in Figure 2 (b) In this stepthe better label assignment is determined from bot-tom up as the CKY parsing algorithm

For example the initial state shown in Figure2(a) is obtained using the label estimation modelThen the label assignment is determined using thelabel comparison model from the lower left (cor-responds to each morpheme) to the upper rightIn determining the label assignment for the cellof ldquoHabu-Yoshiharurdquo as shown in 6(a) the modelcompares the label assignment ldquoBrdquo with the la-bel assignment ldquoA+Drdquo In this case the modelchooses the label assignment ldquoBrdquo that is ldquoHabu- Yoshiharurdquo is labeled as PERSON Similarlyin determining the label assignment for the cellof ldquoYoshiharu-Meijinrdquo the model compares the

59

chunk Habu-Yoshiharu Habu Yoshiharulabel PERSON PERSON MONEYvector V11 0 0 0 0 V21 V22 0 0 0

Figure 5 An example of the feature for the label comparison model (The example is ldquoHabu-Yoshiharuvs Habu + Yoshiharurdquo and V11 V21 V22 and 0 is a vector whose dimension is 13)

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu-Meijin

label assighment ldquoHabu-Yoshiharurdquo

label assignmentldquoHaburdquo + ldquoYoshiharurdquo

A B C

EDMONEY0075

OTHERe0092

MeijinOTHERe0245

ldquoHaburdquo + ldquoYoshiharurdquo

F

(a) label assignment for the cell ldquoHabu-Yoshiharurdquo

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu + Meijin

label assignmentldquoHabu-Yoshiharu-Meijinrdquo

A B C

EDMONEY0075

MNY+OTHERe0075+0245

MeijinOTHERe0245

label assignmentldquoHaburdquo + ldquoYoshiharurdquo + ldquoMeijinrdquo

label assignmentldquoHabu-Yoshiharurdquo + ldquoMeijinrdquo

F

(b) label assignment for the cell ldquoHabu-Yoshiharu-Meijinrdquo

Figure 6 The label comparison model

label assignment ldquoErdquo with the label assignmentldquoD+Frdquo In this case the model chooses the labelassignment ldquoD+Frdquo that is ldquoYoshiharurdquo is labeledas MONEY and ldquoMeijinrdquo is labeled as OTHEReWhen the label assignment consists of multiplechunks the content of the cell is updated Inthis case the cell ldquoErdquo is changed from ldquoYoshi-haru-Meijinrdquo (OTHERe) to ldquoYoshiharu + Meijinrdquo(MONEY + OTHERe)

As shown in Figure 6(b) in determining the bestlabel assignment for the upper right cell that isthe final output is determined the model comparesthe label assignment ldquoA+D+Frdquo ldquoB+Frdquo and ldquoCrdquoWhen there are more than two candidates of labelassignments for a cell all the label assignments arecompared in a pairwise and the label assignmentthat obtains the highest score is adopted

In the label comparing step the label as-signment in which OTHERlowast follows OTHERlowast(OTHERlowast - OTHERlowast) is not allowed since eachOTHER is assigned to the longest chunk as de-scribed in Section 41 When the first combina-tion of chunks equals to the second combinationof chunks the comparison is not performed

6 Experiment

To demonstrate the effectiveness of our proposedmethod we conducted an experiment on CRL NEdata In this data 10718 sentences in 1174 newsarticles are annotated with eight NEs The expres-sion to which it is difficult to annotate manually islabeled as OPTIONAL and was not used for both

a b c d e

learn the label estimation model 1

Corpus CRL NE data

learn the label estimation model 2bobtain features for the label apply

b c d e

b c d e

b c d e

obtain features for the labelcomparison model

learn the label estimation model 2c

learn the label estimation model 2d

learn the label estimation model 2e

apply

learn the label comparison model

Figure 7 5-fold cross validation

the model learning8 and the evaluationWe performed 5-fold cross validation following

previous work Different from previous work ourwork has to learn the SVM models twice There-fore the corpus was divided as shown in Figure 7Let us consider the analysis in the part (a) Firstthe label estimation model 1 is learned from thepart (b)-(e) Then the label estimation model 2bis learned from the part (c)-(e) and applying thelearned model to the part (b) features for learningthe label comparison model are obtained Simi-larly the label estimation model 2c is learned fromthe part (b)(d)(e) and applying it to the part (c)features are obtained It is the same with the part

8Exceptionally ldquoOPTIONALrdquo is used when the label es-timation model for OTHERlowast and invalid is learned

60

Recall PrecisionORGANIZATION 8183 (30083676) 8837 (30083404)PERSON 9005 (34583840) 9387 (34583684)LOCATION 9138 (49925463) 9244 (49925400)ARTIFACT 4672 ( 349 747) 7489 ( 349 466)DATE 9327 (33273567) 9312 (33273573)TIME 8825 ( 443 502) 9059 ( 443 489)MONEY 9385 ( 366 390) 9760 ( 366 375)PERCENT 9533 ( 469 492) 9591 ( 469 489)ALL-SLOT 8787 9179F-measure 8979

Table 3 Experimental result

(d) and (e) Then the label comparison model islearned from the obtained features After that theanalysis in the part (a) is performed by using boththe label estimation model 1 and the label compar-ison model

In this experiment a Japanese morphologicalanalyzer JUMAN9 and a Japanese parser KNP10

were adopted The two SVM models were learnedwith polynomial kernel of degree 2 and β in thesigmoid function was set to be 1

Table 6 shows an experimental result An F-measure in all NE classes is 8979

7 Discussion

71 Comparison with Previous Work

Table 7 presents the comparison with previ-ous work and our method outperformed previ-ous work Among previous work Fukushimaet al acquired huge amount of category-instance pairs (eg ldquopolitical party - New partyDAICHIrdquoldquocompany-TOYOTArdquo) by some patternsfrom a large Web corpus and Sasano et al uti-lized the analysis result of corefer resolution as afeature for the model learning Therefore in ourmethod by incorporating these knowledge andorsuch analysis result the performance would be im-proved

Compared with Sasano et al our methodachieved the better performance in analyzinga long compound noun For example in thebunsetsu ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo (Treaty on Conventional ArmedForces in Europe) while Sasano et al labeledldquoOushurdquo (Europe) as LOCATION our methodcorrectly labeled ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo as ARTIFACT Sasano etal incorrectly labeled ldquoOushurdquo as LOCATIONalthough they utilized the information about

9httpnlpkueekyoto-uacjpnl-resourcejuman-ehtml10httpnlpkueekyoto-uacjpnl-resourceknp-ehtml

the head of bunsetsu ldquojyouyakurdquo (treaty) Inour method for the cell ldquoOushurdquo invalid hasthe highest score and thus the score of LOCA-TION relatively drops Similarly for the cellldquosenryoku-sakugen-jyouyakurdquo invalid has thehighest score Consequently ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo is correctly labeledas ARTIFACT

In the bunsetsu ldquogaikoku-jin-touroku-hou-ihanrdquo(the violation of the foreigner registration law)while Sasano et al labeled ldquotouroku-hourdquo as AR-TIFACT our method correctly labeled ldquogaikoku-jin-touroku-hourdquo as ARTIFACT Sasano et al can-not utilize the information about ldquohourdquo that is use-ful for the label estimation since the head of thisbunsetsu is ldquoihanrdquo In contrast in estimating thelabel of the chunk ldquogaikoku-jin-touroku-hourdquo theinformation of ldquohourdquo can be utilized

72 Error Analysis

There were some errors in analyzing a Katakanaalphabet word In the following example althoughthe correct is that ldquoBatistutardquo is labeled as PER-SON the system labeled it as OTHERs

(4) Italy-deItaly LOC

katsuyaku-suruactive

Batistuta-woBatistuta ACC

kuwaetacall

ArgentineArgentine

rsquoArgentine called Batistuta who was active inItalyrsquo

There is not an entry of ldquoBatistutardquo in the dictio-nary of JUMAN nor Wikipedia and thus only thesurrounding information is utilized However thecase analysis of ldquokatsuyakurdquo (active) is incorrectwhich leads to the error of ldquoBatistutardquo

There were some errors in applying the la-bel comparison model although the analysis ofeach chunk is correct For example in thebunsetsu ldquoHongKong-seityourdquo (Government ofHongKong) the correct is that ldquoHongKong-seityourdquo is labeled as ORGANIZATION Asshown in Figure 8 (b) the system incorrectlylabeled ldquoHongKongrdquo as LOCATION As shownin Figure 8(a) although in the initial stateldquoHongKong-seityourdquo was correctly labeled as OR-GANIZATION the label assignment ldquoHongKong+ seityourdquo was incorrectly chosen by the labelcomparison model To cope with this problemwe are planning to the adjustment of the value βin the sigmoid function and the refinement of the

61

F1 analysis unit distinctive features(Fukushima et al 2008) 8929 character Web(Kazama and Torisawa 2008) 8893 character WikipediaWeb(Sasano and Kurohashi 2008) 8940 morpheme structural information(Nakano and Hirai 2004) 8903 character bunsetsu feature(Masayuki and Matsumoto 2003) 8721 character(Isozaki and Kazawa 2003) 8677 morphemeproposed method 8979 compound noun Wikipediastructural information

Table 4 Comparison with previous work (All work was evaluated on CRL NE data using cross valida-tion)

HongKongLOCATION

HongKong-seityouORGANIZATIONLOCATION

0266 0205

seityouOTHERe0184

(a)initial state

HongKongLOCATION

HongKong + seityouLOC+OTHEReLOCATION

0266LOC+OTHERe0266+0184

seityouOTHERe0184

(b)the final output

Figure 8 An example of the error in the label com-parison model

features for the label comparison model

8 Conclusion

This paper proposed bottom-up Named EntityRecognition using a two-stage machine learningmethod This method first estimates the label ofall the chunks in a bunsetsu using a machine learn-ing and then the best label assignment is deter-mined by bottom-up dynamic programming Weconducted an experiment on CRL NE data andachieved an F-measure of 8979

We are planning to integrate this method withthe syntactic and case analysis method (Kawa-hara and Kurohashi 2007) and perform syntacticcase and Named Entity analysis simultaneously toimprove the overall accuracy

ReferencesKenrsquoichi Fukushima Nobuhiro Kaji and Masaru

Kitsuregawa 2008 Use of massive amountsof web text in Japanese named entity recogni-tion In Proceedings of Data Engineering Workshop(DEWS2008) A3-3 (in Japanese)

IREX Committee editor 1999 Proceedings of theIREX Workshop

Hideki Isozaki and Hideto Kazawa 2003 Speeding upsupport vector machines for named entity recogni-tion Transaction of Information Processing Societyof Japan 44(3)970ndash979 (in Japanese)

Daisuke Kawahara and Sadao Kurohashi 2007 Prob-abilistic coordination disambiguation in a fully-lexicalized Japanese parser In Proceedings of the

2007 Joint Conference on Empirical Methods inNatural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL2007)pages 304ndash311

Junrsquoichi Kazama and Kentaro Torisawa 2008 In-ducing gazetteers for named entity recognition bylarge-scale clustering of dependency relations InProceedings of ACL-08 HLT pages 407ndash415

Erik F Tjong Kim and Jorn Veenstra 1999 Repre-senting text chunks In Proceedings of EACL rsquo99pages 173ndash179

Vajay Krishnan and Christopher DManning 2006An effective two-stage model for exploiting non-local dependencies in named entity recognitionpages 1121ndash1128

John Lafferty Andrew McCallun and FernandoPereira 2001 Conditional random fields Prob-abilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth In-ternational Conference (ICMLrsquo01) pages 282ndash289

Asahara Masayuki and Yuji Matsumoto 2003Japanese named entity extraction with redundantmorphological analysis In Proceeding of HLT-NAACL 2003 pages 8ndash15

Keigo Nakano and Yuzo Hirai 2004 Japanese namedentity extraction with bunsetsu features Transac-tion of Information Processing Society of Japan45(3)934ndash941 (in Japanese)

Ryohei Sasano and Sadao Kurohashi 2008 Japanesenamed entity recognition using structural naturallanguage processing In Proceeding of Third In-ternational Joint Conference on Natural LanguageProcessing pages 607ndash612

Satoshi Sekine Ralph Grishman and Hiroyuki Shin-nou 1998 A decision tree method for finding andclassifying names in japanese texts In Proceed-ings of the Sixth Workshop on Very Large Corpora(WVLC-6) pages 171ndash178

Vladimir Vapnik 1995 The Nature of StatisticalLearning Theory Springer

62

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 63ndash70Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

13

13

13 13

13

13

13

13

13

13

13

13 13

13

$

13 amp

13 amp

$ 13

13 (

) 13

$ 13

13

) 13

13 13 13

$ 13 13

+$ - 0)$ - 12

$ 3$ 4 1

)13$ 5 + )$ 5 2

3$ 6 2 $ 7-

2 $ 7 13

2 13 13

13 13

) 13

13$ $ 13$ $ $

13

13 13 13

13 1313 138

$ 13$

3$ 13) )

) 8 9$ $ $

$ $ )13$ 13 13

3

$

13 13

13 +

)

13 $

13 ) 13

$ )

13 13

13 ) $ )

13$

$

) (

$

13

13 13

13

)

13

13 )

$ 13

ファミリーレストラン 13) lt

ファミレス 13lt $

=

13 )

$ 13 13

$ $ )

13 -

)$ $ =

13 131313 13 1313 13 13 13 13 1313 13 13

63

13

13

13 13

13 13

13

13 13

13 $

13

13 amp

13

( ) +

13

13

-

13 13

13 13 13

日本 amp0 東京 $amp0amp

医科 amp0 科学 amp0amp

女子 amp0amp

大学 13amp0 研究所0

ampamp $ 13 13

+

13

$1

$ 13 $1

13

213

$1

$

$1

$1

13 $1

13

13 13 略語は0

0amp

$ +3

1313欧文略語一覧13カタカナ略語一覧13漢字略語一覧13大学の略称

$1

$1 13

13 -

13 $1 13

456

13

456

13

37+

- 13

13

$

8 913

77 4 77 913 13

13 77(amp 13 ltamp

13amp amp

13

amp - 13 13

13

13 9

13 13

$

13

-

13 =

= 9

8

13

13 $

13 13

13 13 13 1313

ッ$ 13 っ$

amp ―$

64

13

$ $

amp $ amp

() $

+) $amp $

) $ $ $ $ $

13 $ amp $ $ $ amp $$

$- )) ) 13

) ))13 ) 0()00)13 ) ) )13

- $ ))) ) 131

) 0)13 ))

13 13

2 ) )) )

)) 3 ) 1313))

1313 2 ) 13)

4) 5) )) )13

) 13) 1

6

7 ) )) ) )13 1

)

) )0

13 ) 8

-

) )

)13 ) 4

4 7 )

) )9 8)

5 ) )8 7 ))

)

) )13 1

)13 )) ) ) )

) 7 )) )13 1

)) 0()

) )))13 0

))13 ) 1

)6 5) ) )

5 ) ))1

lt ) ) 13 )

))) ) =))lt

)=)lt ))=lt gt

=ltlt )) ))) )

13 131313

-

))lt

lt ))lt

))lt

))lt lt

7 ) ))1

lt ) )) ) ))lt lt

13 lt ) ) )8 ) 1

)13 )) )

))lt 13 1

)13 ) )8 ) )13 ))

) ))) 85 1

)13 )8) 2 ))) ) (

0 )) )))

13))13

) ))) ) ( のlt

) ) )13 1

) )) )

) ))) のlt 0 )

)13 )) )1

) ) 13

) ) )) )1

) 00 )13 13)

1313 1313

2 ) 13 )) )

)13 )13 )

))13 )

3 )) )) 1

) ) )

$ )

+ ) )

+ 8 )

65

1313

1313

13 13

13 13

13 13 13

13 13

$13amp13amp 13 13 $13

$13 13 $ ( ) ) 13

13 13 + 13 1313 $

13 + 13 1313 + -

+ 13 ++ 13 13

1313 +

13 13

13 13 1313 13 -

+13 1313 13 13

1313

13

0 13 13 13 13

13 13 12 13

3013+ 13 45 666 1313

0 13 1313

7 13 13 13 13-

13 8 +-13 0

+ 13 13 13 -

+ 13 13 13

3

13 13-

13 13 + 13 13

+13513

1313 13 9

1 13 13 -

+ ( 13 13 3

13 13 + 13+ 13 13

3

135 1313 +13 13

13 13 13+ 13

1313 13 +

1313 1313

13 13 13

13+ 13 $1313 (

13 1313 1313 +-

13 4 13 $

$13 $ 13 -

13 +-13 0

13 1313 13-13 1313 + 13 13 4 - 13 + 13 13 13+ 131313 1313 3 13 13 ( 1313 13 13 + 13 13 1313 13 13 13 13 13 13 13 13 13 13 13+ 13 13 9

13

13

13

13

+ 13 13 +13

13+ 4

13 1313 1313 -

+

13 1313 13-

13 13 +13 + +

(13 13 $ +-

13 $1313 13 13 $ $ 13 $ 13

13 $ampamp lt13

2 13 13 131313 1313 13

13 lt13 ( $

13 13 13 13+ +13 -

13 13 13

13 2 2 2

= 13 13 13 +13 1313+

13 13 13

13 13 13 13

1313 13-

13 ( 13 13 13 13

13 13+ 13 1313

13 13 13+ 1313 13+

13 13 13 13 13 13

$gt1 gt1 13+ $ 13-

13 13 13 13+

13131313

66

1313 1313 1313

1313 1313 1313

13

13 13

13

13

$ 13 13

amp() amp() () amp amp

$ 13 13

+

13

13

13

- + 13 ( (

13 13

-+

13

0 -(1313 amp+ ( (

13 13

13 1

13 () 2 3

0 13 -4252

6-amp++ -32 6+

13 0

$ 135 2

1

- amp+ $ 13

13

$ amp( ) + ( ( )- )( ( ) )

FeaturesLabel

Noun

Noun

Noun

Noun

Particle

Noun

POS

0

0

0

1

1

1

Head of wordreadingword-

FeaturesLabel

Noun

Noun

Noun

Noun

Particle

Noun

POS

0

0

0

1

1

1

Head of wordreadingword-

amp 13 0

13 朝ビタ -+ 13

13 朝はビタミン - +

13 7

7

$

088

13 0

$ amp 9 amp 9

13

13

7

1313 0

13

13 2lt

13 13 13 13

13 13

3 13 amp

13 amp

13

13

0 13 13

13 13

13 13

13

1

ampampamp

67

13 13

$

13 13

ampamp (

)

ampamp $

ampamp amp

amp +

amp +

- - -

amp-

amp

amp

amp

- amp

+

- +

) +

amp $

- amp

+

- amp

+

0 $ ampamp +

1- amp

0 +

2-

)

amp -

amp ampamp

$ amp 0- 3(

34) 5 -+

0 - 4

6)4 amp amp 1 amp

7 4( 6

$ - +

- 8

64

13

) amp

amp

+

amp 9 2

- 9 - +

2 amp

- amp

amp 1

$

( +

(

( amp ( ( -

amp2

amp

amp )

( +

68

13 13

13 1313

13 13 13

13 13

13 13 13

13 13

13 13

13 $

13

13 13 13 13

13

13 13 13

1313

13

amp 13 13 13 13

13 13 13 13

(() 13 13 13

13 13 + 13

13 amp 13

13 13

$ 13

13 13

13

13 1313

13 -

13

1313 13

13 13 13

13 1313 amp

13 13

13 13 1313

0

1313 13 13

13 13

13

13 13

$ $ amp amp ( $)+ - 0 13 + 13 112

ミュージックステーション 13 ステ 13 13 13 13 13 ミュージック 13 ステ ステーション

3 4 56 1 0+ 0 + + 7+ 8 9 $8 4 lt

6 $8 -= 1 8 - = $ 13++ + 139 13 $ 22lt

8 $+ 4 13) 1ltltlt3 $gt3 =

5 $ $ ( 6 5 13amp

5 $ $ ( 213 $ 13 13

5 = 3 + 13 () 1 +

5 () 13 7+ + + 2ltlt

5 () 13 3A 21 + $ + 13

5 () $ gt) 3A2 0 13 13 22lt2lt

+A = 7 1 4 30$+ -+ 3 13 13$ 11

$ 13 amp) $ 4 42 6= $ 2

4 13 13+ $ gt $ 9+ amp6+ 13+ 8 8 6 3 (+ 6 13 11

13 136) $ 4 13 + gt+ 7 30 13 13+ 1

69

13 13 13

13 13 1313

13 13

1313 13

13 1313

13

$

13 13 1313 $

amp (

70

Author Index

Bhutada Pravin 17

Cao Jie 47Caseli Helena 1

Diab Mona 17

Finatto Maria Jose 1Fujii Hiroko 63Fukui Mika 63Funayama Hirotaka 55

Hoang Hung Huu 31Huang Yun 47

Kan Min-Yen 9 31Kim Su Nam 9 31Kuhn Jonas 23Kurohashi Sadao 55

Liu Qun 47Lu Yajuan 47

Machado Andre 1

Ren Zhixiang 47

Shibata Tomohide 55Sinha R Mahesh K 40Sumita Kazuo 63Suzuki Masaru 63

Villavicencio Aline 1

Wakaki Hiromi 63

Zarrieszlig Sina 23

71

  • Workshop Program
  • Statistically-Driven Alignment-Based Multiword Expression Identification for Technical Domains
  • Re-examining Automatic Keyphrase Extraction Approaches in Scientific Articles
  • Verb Noun Construction MWE Token Classification
  • Exploiting Translational Correspondences for Pattern-Independent MWE Identification
  • A re-examination of lexical association measures
  • Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus
  • Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions
  • Bottom-up Named Entity Recognition using Two-stage Machine Learning Method
  • Abbreviation Generation for Japanese Multi-Word Expressions
Page 6: MWE 2009 2009 Workshop on Multiword Expressions

Workshop Program

Friday August 6 2009

830ndash845 Welcome and Introduction to the Workshop

Session 1 (0845ndash1000) MWE Identification and Disambiguation

0845ndash0910 Statistically-Driven Alignment-Based Multiword Expression Identification for Tech-nical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto

0910ndash0935 Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan

0935ndash1000 Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada

1000-1030 BREAK

Session 2 (1030ndash1210) Identification Interpretation and Disambiguation

1030ndash1055 Exploiting Translational Correspondences for Pattern-Independent MWE Identifi-cationSina Zarrieszlig and Jonas Kuhn

1055ndash1120 A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan

1120ndash1145 Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha

1145-1350 LUNCH

Session 3 (1350ndash1530) Applications

1350ndash1415 Improving Statistical Machine Translation Using Domain Bilingual Multiword Ex-pressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang

1415ndash1440 Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi

1440ndash1505 Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita

1505-1530 Discussion of Sessions 1 2 3 (Creating an Agenda for the general discussion)

1530-1600 BREAK

1600-1700 General Discussion

1700-1715 Closing Remarks

ix

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 1ndash8Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Statistically-Driven Alignment-Based Multiword ExpressionIdentification for Technical Domains

Helena de Medeiros Caselidiams Aline Villavicencioclubsspades Andre Machadoclubs Maria Jose Finattohearts

diamsDepartment of Computer Science Federal University of Sao Carlos (Brazil)clubsInstitute of Informatics Federal University of Rio Grande do Sul (Brazil)

spadesDepartment of Computer Sciences Bath University (UK)heartsInstitute of Language and Linguistics Federal University of Rio Grande do Sul (Brazil)

helenacaselidcufscarbr avillavicencioinfufrgsbrammachadoinfufrgsbr mfinattoterracombr

Abstract

Multiword Expressions (MWEs) are oneof the stumbling blocks for more preciseNatural Language Processing (NLP) sys-tems Particularly the lack of coverageof MWEs in resources can impact nega-tively on the performance of tasks and ap-plications and can lead to loss of informa-tion or communication errors This is es-pecially problematic in technical domainswhere a significant portion of the vocab-ulary is composed of MWEs This pa-per investigates the use of a statistically-driven alignment-based approach to theidentification of MWEs in technical cor-pora We look at the use of several sourcesof data including parallel corpora usingEnglish and Portuguese data from a corpusof Pediatrics and examining how a sec-ond language can provide relevant cues forthis tasks We report results obtained bya combination of statistical measures andlinguistic information and compare theseto the reported in the literature Such anapproach to the (semi-)automatic identifi-cation of MWEs can considerably speedup lexicographic work providing a moretargeted list of MWE candidates

1 Introduction

A multiword expression (MWE) can be definedas any word combination for which the syntac-tic or semantic properties of the whole expres-sion cannot be obtained from its parts (Sag etal 2002) Examples of MWEs are phrasal verbs(break down rely on) compounds (police car cof-fee machine) idioms (rock the boat let the catout of the bag) They are very numerous in lan-guages as Biber et al (1999) note accouting forbetween 30 and 45 of spoken English and 21

of academic prose and for Jackendoff (1997) thenumber of MWEs in a speakerrsquos lexicon is of thesame order of magnitude as the number of singlewords However these estimates are likely to beunderestimates if we consider that for languagefrom a specific domain the specialized vocabularyis going to consist largely of MWEs (global warm-ing protein sequencing) and new MWEs are con-stantly appearing (weapons of mass destructionaxis of evil)

Multiword expressions play an important rolein Natural Language Processing (NLP) applica-tions which should not only identify the MWEsbut also be able to deal with them when they arefound (Fazly and Stevenson 2007) Failing toidentify MWEs may cause serious problems formany NLP tasks especially those envolving somekind of semantic processing For parsing for in-stance Baldwin et al (2004) found that for a ran-dom sample of 20000 strings from the British Na-tional Corpus (BNC) even with a broad-coveragegrammar for English (Flickinger 2000) missingMWEs accounted for 8 of total parsing errorsTherefore there is an enormous need for robust(semi-)automated ways of acquiring lexical infor-mation for MWEs (Villavicencio et al 2007) thatcan significantly extend the coverage of resourcesFor example one can more than double the num-ber of verb-particle constructions (VPCs) entriesin a dictionary such as the Alvey Natural Lan-guage Tools (Carroll and Grover 1989) just ex-tracting VPCs from a corpus like the BNC (Bald-win 2005) Furthermore as MWEs are languagedependent and culturally motivated identifyingthe adequate translation of MWE occurrences is animportant challenge for machine translation meth-ods

In this paper we investigate experimentally theuse of an alignment-based approach for the iden-tification of MWEs in technical corpora We lookat the use of several sources of data including par-

1

allel corpora using English and Portuguese datafrom a corpus of Pediatrics and examining howa second language can provide relevant cues forthis tasks In this way cost-effective tools for theautomatic alignment of texts can generate a listof MWE candidates with their appropriate trans-lations Such an approach to the (semi-)automaticidentification of MWEs can considerably speed uplexicographic work providing a more targeted listof MWE candidates and their translations for theconstruction of bilingual resources andor withsome semantic information for monolingual re-sources

The remainder of this paper is structured as fol-lows Section 2 briefly discusses MWEs and someprevious works on methods for automatically ex-tracting them Section 3 presents the resourcesused while section 4 describes the methods pro-posed to extract MWEs as a statistically-driven by-product of an automatic word alignment processSection 5 presents the evaluation methodology andanalyses the results and section 6 finishes this pa-per with some conclusions and proposals for fu-ture work

2 Related Work

The term Multiword Expression has been used todescribe a large number of distinct but related phe-nomena such as phrasal verbs (eg come along)nominal compounds (eg frying pan) institution-alised phrases (eg bread and butter) and manyothers (Sag et al 2002) They are very frequent ineveryday language and this is reflected in severalexisting grammars and lexical resources wherealmost half of the entries are Multiword Expres-sions

However due to their heterogeneous charac-teristics MWEs present a tough challenge forboth linguistic and computational work (Sag etal 2002) Some MWEs are fixed and do notpresent internal variation such as ad hoc whileothers allow different degrees of internal vari-ability and modification such as touch a nerve(touchfind a nerve) and spill beans (spill sev-eralmusicalmountains of beans) In terms of se-mantics some MWEs are more opaque in theirmeaning (eg to kick the bucket as to die) whileothers have more transparent meanings that can beinferred from the words in the MWE (eg eat upwhere the particle up adds a completive sense toeat) Therefore providing appropriate methods

for the automatic identification and treatment ofthese phenomena is a real challenge for NLP sys-tems

A variety of approaches has been proposed forautomatically identifying MWEs differing basi-cally in terms of the type of MWE and lan-guage to which they apply and the sources ofinformation they use Although some work onMWEs is type independent (eg (Zhang et al2006 Villavicencio et al 2007)) given the het-erogeneity of MWEs much of the work looks in-stead at specific types of MWE like collocations(Pearce 2002) compounds (Keller and Lapata2003) and VPCs (Baldwin 2005 Villavicencio2005 Carlos Ramisch and Aline Villavicencio andLeonardo Moura and Marco Idiart 2008) Someof these works concentrate on particular languages(eg (Pearce 2002 Baldwin 2005) for Englishand (Piao et al 2006) for Chinese) but somework has also benefitted from asymmetries in lan-guages using information from one language tohelp deal with MWEs in the other (eg (na Vil-lada Moiron and Tiedemann 2006 Caseli et al2009))

As basis for helping to determine whether agiven sequence of words is in fact an MWE (egad hoc vs the small boy) some of these works em-ploy linguistic knowledge for the task (Villavicen-cio 2005) while others employ statistical meth-ods (Pearce 2002 Evert and Krenn 2005 Zhanget al 2006 Villavicencio et al 2007) or combinethem with some kinds of linguistic informationsuch as syntactic and semantic properties (Bald-win and Villavicencio 2002 Van de Cruys and naVillada Moiron 2007) or automatic word align-ment (na Villada Moiron and Tiedemann 2006)

Statistical measures of association have beencommonly used for this task as they can be demo-cratically applied to any language and MWE typeHowever there is no consensus about which mea-sure is best suited for identifying MWEs in gen-eral Villavicencio et al (2007) compared some ofthese measures (mutual information permutationentropy and χ2) for the type-independent detec-tion of MWEs and found that Mutual Informationseemed to differentiate MWEs from non-MWEsbut the same was not true of χ2 In addition Ev-ert and Krenn (2005) found that for MWE iden-tification the efficacy of a given measure dependson factors like the type of MWEs being targetedfor identification the domain and size of the cor-

2

pora used and the amount of low-frequency dataexcluded by adopting a threshold NonethelessVillavicencio et al (2007) discussing the influ-ence of the corpus size and nature over the meth-ods found that these different measures have ahigh level of agreement about MWEs whetherin carefully constructed corpora or in more het-erogeneous web-based ones They also discussthe results obtained from adopting approaches likethese for extending the coverage of resources ar-guing that grammar coverage can be significantlyincreased if MWEs are properly identified andtreated (Villavicencio et al 2007)

Among the methods that use additional infor-mation along with statistics to extract MWE theone proposed by na Villada Moiron and Tiede-mann (2006) seems to be the most similar to ourapproach The main difference between them isthe way in which word alignment is used in theMWE extraction process In this paper the wordalignment is the basis for the MWE extractionprocess while Villada Moiron and Tiedemannrsquosmethod uses the alignment just for ranking theMWE candidates which were extracted on the ba-sis of association measures (log-likelihood andsalience) and head dependence heuristic (in parseddata)

Our approach as described in details by Caseliet al (2009) also follows to some extent thatof Zhang et al (2006) as missing lexical en-tries for MWEs and related constructions are de-tected via error mining methods and this paper fo-cuses on the extraction of generic MWEs as a by-product of an automatic word alignment Anotherrelated work is the automatic detection of non-compositional compounds (NCC) by Melamed(1997) in which NCCs are identified by analyz-ing statistical translation models trained in a hugecorpus by a time-demanding process

Given this context our approach proposesthe use of alignment techniques for identifyingMWEs looking at sequences detected by thealigner as containing more than one word whichform the MWE candidates As a result sequencesof two or more consecutive source words aretreated as MWE candidates regardless of whetherthey are translated as one or more target words

3 The Corpus and Reference Lists

The Corpus of Pediatrics used in these experi-ments contains 283 texts in Portuguese with a total

of 785448 words extracted from the Jornal de Pe-diatria From this corpus the Pediatrics Glossarya reference list containing multiword terms and re-curring expressions was semi-automatically con-structed and manually checked1 The primary aimof the Pediatrics Glossary as an online resourcefor long-distance education was to train qualifyand support translation students on the domain ofpediatrics texts

The Pediatrics Glossary was built from the36741 ngrams that occurred at least 5 times in thecorpus These were automatically cleaned or re-moved using some POS tag patterns (eg remov-ing prepositions from terms that began or endedwith them) In addition if an ngram was part of alarger ngram only the latter appeared in the Glos-sary as is the case of aleitamento materno (mater-nal breastfeeding) which is excluded as it is con-tained in aleitamento materno exclusivo (exclusivematernal breastfeeding) This post-processing re-sulted in 3645 ngrams which were manuallychecked by translation students and resulted in2407 terms with 1421 bigrams 730 trigrams and339 ngrams with n larger than 3 (not considered inthe experiments presented in this paper)

4 Statistically-Driven andAlignment-Based methods

41 Statistically-Driven methodStatistical measures of association have beenwidely employed in the identification of MWEsThe idea behind their use is that they are an in-expensive language and type independent meansof detecting recurrent patterns As Firth famouslysaid a word is characterized by the company itkeeps and since we expect the component wordsof an MWE to occur frequently together thenthese measures can give an indication of MWE-ness In this way if a group of words co-occurswith significantly high frequency when comparedto the frequencies of the individual words thenthey may form an MWE Indeed measures suchas Pointwise Mutual Information (PMI) MutualInformation (MI) χ2 log-likelihood (Press et al1992) and others have been employed for thistask and some of them seem to provide more ac-curate predictions of MWEness than others Infact in a comparison of some measures for thetype-independent detection of MWEs MI seemed

1Available in the TEXTQUIMUFRGS website httpwwwufrgsbrtextquim

3

to differentiate MWEs from non-MWEs but thesame was not true of χ2 (Villavicencio et al2007) In this work we use two commonly em-ployed measures for this task PMI and MIas implemented in the Ngram Statistics Package(Banerjee and Pedersen 2003)

From the Portuguese portion of the Corpus ofPediatrics 196105 bigram and 362663 trigramMWE candidates were generated after filteringngrams containing punctuation and numbers Inorder to evaluate how these methods perform with-out any linguistic filtering the only threshold em-ployed was a frequency cut-off of 2 occurrencesresulting in 64839 bigrams and 54548 trigramsEach of the four measures were then calculated forthese ngrams and we ranked each n-gram accord-ing to each of these measures The average of allthe rankings is used as the combined measure ofthe MWE candidates

42 Alignment-Based method

The second of the MWE extraction approachesto be investigated in this paper is the alignment-based method The automatic word alignment oftwo parallel texts mdash a text written in one (source)language and its translation to another (target) lan-guage mdash is the process of searching for correspon-dences between source and target words and se-quences of words For each word in a source sen-tence equivalences in the parallel target sentenceare looked for Therefore taking into account aword alignment between a source word sequenceS (S = s1 sn with n ge 2) and a target wordsequence T (T = t1 tm with m ge 1) thatis S harr T the alignmet-based MWE extracionmethod assumes that (a) S and T share some se-mantic features and (b) S may be a MWE

In other words the alignment-based MWE ex-traction method states that the sequence S will bea MWE candidate if it is aligned with a sequenceT composed of one or more words (a n m align-ment with n ge 2 and m ge 1) For examplethe sequence of two Portuguese words aleitamentomaterno mdash which occurs 202 times in the cor-pus used in our experiments mdash is a MWE can-didate because these two words were joined to bealigned 184 times with the word breastfeeding (a2 1 alignment) 8 times with the word breast-fed (a 2 1 alignment) 2 times with breastfeedingpractice (a 2 2 alignment) and so on

Thus notice that the alignment-based MWE ex-

traction method does not rely on the conceptualasymmetries between languages since it does notexpect that a source sequence of words be alignedwith a single target word The method looksfor the sequences of source words that are fre-quently joined together during the alignment de-spite the number of target words involved Thesefeatures indicate that the method priorizes preci-sion in spite of recall

It is also important to say that although the se-quences of source and target words resemble thephrases used in the phrase-based statistical ma-chine translation (SMT) they are indeed a re-finement of them More specifically althoughboth approaches rely on word alignments per-formed by GIZA++2 (Och and Ney 2000) inthe alignment-based approach not all sequences ofwords are considered as phrases (and MWE can-didates) but just those with an alignment n m(n gt= 2) with a target sequence To confirmthis assumption a phrase-based SMT system wastrained with the same corpus used in our exper-iments and the number of phrases extracted fol-lowing both approaches were compared Whilethe SMT extracted 819208 source phrases ouralignment-based approach (without applying anypart-of-speech or frequency filter) extracted only34277 These results show that the alignment-based approach refines in some way the phrasesof SMT systems

In this paper we investigate experimentallywhether MWEs can be identified as a by-productof the automatic word alignment of parallel textsWe focus on Portuguese MWEs from the Corpusof Pediatrics and the evaluation is performed us-ing the bigrams and trigrams from the PediatricsGlossary as gold standard

To perform the extraction of MWE candi-dates following the alignment-based approachfirst the original corpus had to be sentence andword aligned and Part-of-Speech (POS) taggedFor these preprocessing steps were used re-spectively a version of the Translation Cor-pus Aligner (TCA) (Hofland 1996) the statisti-cal word aligner GIZA++ (Och and Ney 2000)and the morphological analysers and POS taggersfrom Apertium3 (Armentano-Oller et al 2006)

2GIZA++ is a well-known statistical word aligner that canbe found at httpwwwfjochcomGIZA++html

3Apertium is an open-source machine translation en-gine and toolbox available at httpwwwapertiumorg

4

From the preprocessed corpus the MWE candi-dates are extracted as those in which two or morewords have the same alignment that is they arelinked to the same target unit This initial list ofMWE candidates is then filtered to remove thosecandidates that (a) match some sequences of POStags or words (patterns) defined in previous exper-iments (Caseli et al 2009) or (b) whose frequencyis below a certain threshold The remaining unitsin the candidate list are considered to be MWEs

Several filtering patterns and minimum fre-quency thresholds were tested and three of themare presented in details here The first one (F1)is the same used during the manual building ofthe reference lists of MWEs (a) patterns begin-ning with Article + Noun and beginning or finish-ing with verbs and (b) with a minimum frequencythreshold of 5

The second one (F2) is the same used in the(Caseli et al 2009) mainly (a) patterns begin-ning with determiner auxiliary verb pronoun ad-verb conjunction and surface forms such as thoseof the verb to be (are is was were) relatives(that what when which who why) and prepo-sitions (from to of ) and (b) with a minimum fre-quency threshold of 2

And the third one (F3) is the same as (Caseli etal 2009) plus (a) patterns beginning or finishingwith determiner adverb conjunction prepositionverb pronoun and numeral and (b) with a mini-mum frequency threshold of 2

5 Experiments and Results

Table 1 shows the top 5 and the bottom 5 rankedcandidates returned by PMI and the alignment-based approach Although some of the results aregood especially the top candidates there is stillconsiderable noise among the candidates as forinstance jogar video game (lit play video game)From table 1 it is also possible to notice that thealignment-based approach indeed extracts Pedi-atrics terms such as aleitamento materno (breast-feeding) and also other possible MWE that are notPediatrics terms such as estados unidos (UnitedStates)

In table 2 we show the precision (number ofcorrect candidates among the proposed ones) re-call (number of correct candidates among those inreference lists) and F-measure ((2 lowast precision lowastrecall)(precision + recall)) figures for the as-sociation measures using all the candidates (on the

PMI alignment-basedOnline Mendelian Inheritance faixa etariaBeta Technology Incorporated aleitamento maternoLange Beta Technology estados unidosOxido Nitrico Inalatorio hipertensao arterialjogar video game leite materno e um de couro cabeludoe a do bloqueio lactıferosse que de emocional anatomiae a da neonato a termoe de nao duplas maes bebes

Table 1 Top 5 and Bottom 5 MWE candidatesranked by PMI and alignment-based approach

pt MWE candidates PMI MI proposed bigrams 64839 64839 correct MWEs 1403 1403precision 216 216recall 9873 9873F 423 423 proposed trigrams 54548 54548 correct MWEs 701 701precision 129 129recall 9603 9603F 255 255 proposed bigrams 1421 1421 correct MWEs 155 261precision 1091 1837recall 1091 1837F 1091 1837 proposed trigrams 730 730 correct MWEs 44 20precision 603 274recall 603 274F 603 274

Table 2 Evaluation of MWE candidates - PMI andMI

first half of the table) and using the top 1421 bi-gram and 730 trigram candidates (on the secondhalf) From these latter results we can see that thetop candidates produced by these measures do notagree with the Pediatrics Glossary since there areonly at most 1837 bigram and 603 trigramMWEs among the top candidates as ranked byMI and PMI respectively Interestingly MI had abetter performance for bigrams while for trigramsPMI performed better

On the other hand looking at the alignment-based method 34277 pt MWE candidates wereextracted and Table 3 sumarizes the number ofcandidates filtered following the three filters de-scribed in 42 F1 F2 and F3

To evaluate the efficacy of the alignment-basedmethod in identifying multiword terms of Pedi-atrics an automatic comparison was performedusing the Pediatrics Glossary In this auto-

5

pt MWE candidates F1 F2 F3 filtered by POS patterns 24996 21544 32644 filtered by frequency 9012 11855 1442 final Set 269 878 191

Table 3 Number of pt MWE candidates filteredin the alignment-based approach

pt MWE candidates F1 F2 F3 proposed bigrams 250 754 169 correct MWEs 48 95 65precision 1920 1260 3846recall 338 669 457F 575 874 818 proposed trigrams 19 110 20 correct MWEs 1 9 4precision 526 818 2000recall 014 123 055F 027 214 107 proposed bitrigrams 269 864 189 correct MWEs 49 104 69precision 1822 1204 3651recall 228 483 321F 405 690 590

Table 4 Evaluation of MWE candidates

matic comparision we considered the final lists ofMWEs candidates generated by each filter in table3 The number of matching entries and the valuesfor precision recall and F-measure are showed intable 4

The different values of extracted MWEs (in ta-ble 3) and evaluated ones (in table 4) are due tothe restriction of considering only bigrams and tri-grams in the Pediatrics Glossary Then longerMWEs mdash such as doenca arterial coronarianaprematura (premature coronary artery disease)and pequenos para idade gestacional (small forgestational age) mdash extracted by the alignment-based method are not being considered at the mo-ment

After the automatic comparison using the Pedi-atrics Glossary an analysis by human experts wasperformed on one of the derived lists mdash that withthe best precision values so far (from filter F3)The human analysis was necessary since as statedin (Caseli et al 2009) the coverage of referencelists may be low and it is likely that a lot of MWEcandidates that were not found in the PediatricsGlossary are nonetheless true MWEs In this pa-per only the pt MWE candidates extracted usingfilter F3 (as described in section 42) were manu-ally evaluated

From the 191 pt MWE candidates extracted af-ter F3 69 candidates (361 of the total amount)were found in the bigrams or trigrams in the

Glossary (see table 4) Then the remaining 122candidates (639) were analysed by two native-speakers human judges who classified each of the122 candidates as true if it is a multiword expres-sion or false otherwise independently of beinga Pediatrics term For the judges a sequence ofwords was considered a MWE mainly if it was(1) a proper name or (2) a sequence of words forwhich the meaning cannot be obtained by com-pounding the meanings of its component words

The judgments of both judges were comparedand a disagreement of approximately 12 on mul-tiwords was verified This disagreement was alsomeasured by the kappa (K) measure (Carletta1996) with k = 073 which does not preventconclusions to be drawn According to Carletta(1996) among other authors a value of k between067 and 08 indicates a good agreement

In order to calculate the percentage of true can-didates among the 122 two approaches can befollowed depending on what criteria one wantsto emphasize precision or coverage (not recallbecause we are not calculating regarding a refer-ence list) To emphasize the precision one shouldconsider as genuine MWEs only those candidatesclassified as true by both judges on the other handto emphasize the coverage one should consideralso those candidates classified as true by just oneof them So from 191 MWE candidates 126(6597) were classified as true by both judgesand 145 (7592) by at least one of them

6 Conclusions and Future Work

MWEs are a complex and heterogeneous set ofphenomena that defy attempts to capture themfully but due to their role in communication theyneed to be properly accounted for in NLP applica-tions and tasks

In this paper we investigated the identifica-tion of MWEs from technical domain test-ing statistically-driven and alignment-based ap-proaches for identifying MWEs from a Pediatricsparallel corpus The alignment-based method gen-erates a targeted precision-oriented list of MWEcandidates while the statistical methods producerecall-oriented results at the expense of precisionTherefore the combination of these methods canproduce a set of MWE candidates that is both moreprecise than the latter and has more coverage thanthe former This can significantly speed up lex-icographic work Moreover the results obtained

6

show that in comparison with the manual extrac-tion of MWEs this approach can provide also ageneral set of MWE candidates in addition to themanually selected technical terms

Using the alignment-based extraction methodwe notice that it is possible to extract MWEs thatare Pediatrics terms with a precision of 38 forbigrams and 20 for trigrams but with very lowrecall since only the MWEs in the Pediatrics Glos-sary were considered correct However after amanual analysis carried out by two native speakersof Portuguese we found that the percentage of trueMWEs considered by both or at least one of themwere respectively 6597 and 7592 This wasa significative improvement but it is important tosay that in this manual analysis the human ex-perts classified the MWEs as true independently ofthem being Pediatrics terms So as future work weintend to carry out a more carefull analysis withexperts in Pediatrics to evaluate how many MWEscandidates are also Pediatrics terms

In addition we plan to investigate a weightedcombination of these methods favouring thosethat have better precision Finally we also in-tend to apply the results obtained in to the semi-automatic construction of ontologies

Acknowledgments

We would like to thank the TEXTQUIMUFRGSgroup for making the Corpus of Pediatrics and Pe-diatrics Glossary available to us We also thank thefinancial support of FAPESP in bulding the paral-lel corpus This research has been partly fundedby the FINEP project COMUNICA

ReferencesCarme Armentano-Oller Rafael C Carrasco Anto-

nio M Corbı-Bellot Mikel L Forcada MireiaGinestı-Rosell Sergio Ortiz-Rojas Juan Anto-nio Perez-Ortiz Gema Ramırez-Sanchez FelipeSanchez-Martınez and Miriam A Scalco 2006Open-source Portuguese-Spanish machine transla-tion In R Vieira P Quaresma MGV Nunes NJMamede C Oliveira and MC Dias editors Pro-ceedings of the 7th International Workshop on Com-putational Processing of Written and Spoken Por-tuguese (PROPOR 2006) volume 3960 of LectureNotes in Computer Science pages 50ndash59 Springer-Verlag May

Timothy Baldwin and Aline Villavicencio 2002 Ex-tracting the Unextractable A Case Study on Verb-particles In Proceedings of the 6th Conference on

Natural Language Learning (CoNLL-2002) pages98ndash104 Taipei Taiwan

Timothy Baldwin Emily M Bender Dan FlickingerAra Kim and Stephan Oepen 2004 Road-testingthe English Resource Grammar over the British Na-tional Corpus In Proceedings of the Fourth In-ternational Conference on Language Resources andEvaluation (LREC 2004) pages 2047ndash2050 LisbonPortugal

Timothy Baldwin 2005 The deep lexical acquisitionof English verb-particles Computer Speech andLanguage Special Issue on Multiword Expressions19(4)398ndash414

Satanjeev Banerjee and Ted Pedersen 2003 The De-sign Implementation and Use of the Ngram Statis-tics Package In In Proceedings of the Fourth Inter-national Conference on Intelligent Text Processingand Computational Linguistics pages 370ndash381

Douglas Biber Stig Johansson Geoffrey Leech SusanConrad and Edward Finegan 1999 Grammar ofSpoken and Written English Longman Harlow

Jean Carletta 1996 Assessing agreement on classi-fication tasks the kappa statistics ComputationalLinguistics 22(2)249ndash254

Carlos Ramisch and Aline Villavicencio and LeonardoMoura and Marco Idiart 2008 Picking themup and Figuring them out Verb-Particle Construc-tions Noise and Idiomaticity In Proceedings ofthe 12th Conference on Computational Natural Lan-guage Learning (CoNLL 2008) pages 49ndash56

John Carroll and Claire Grover 1989 The derivationof a large computational lexicon of English fromLDOCE In Bran Boguraev and Ted Briscoe editorsComputational Lexicography for Natural LanguageProcessing pages 117ndash134 Longman Harlow UK

Helena M Caseli Carlos Ramisch Maria G V Nunesand Aline Villavicencio 2009 Alignment-basedextraction of multiword expressions Language Re-sources and Evaluation to appear

Stefan Evert and Brigitte Krenn 2005 Using smallrandom samples for the manual evaluation of statis-tical association measures Computer Speech andLanguage 19(4)450ndash466

Afsaneh Fazly and Suzanne Stevenson 2007 Distin-guishing Subtypes of Multiword Expressions UsingLinguistically-Motivated Statistical Measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueJune

Dan Flickinger 2000 On building a more efficientgrammar by exploiting types Natural LanguageEngineering 6(1)15ndash28

7

Knut Hofland 1996 A program for aligning Englishand Norwegian sentences In S Hockey N Ideand G Perissinotto editors Research in HumanitiesComputing pages 165ndash178 Oxford Oxford Univer-sity Press

Ray Jackendoff 1997 Twistinrsquo the night away Lan-guage 73534ndash59

Frank Keller and Mirella Lapata 2003 Using the Webto Obtain Frequencies for Unseen Bigrams Compu-tational Linguistics 29(3)459ndash484

I Dan Melamed 1997 Automatic Discovery ofNon-Compositional Compounds in Parallel Data Ineprint arXivcmp-lg9706027 pages 6027ndash+ June

Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the Workshopon Multi-word-expressions in a Multilingual Context(EACL-2006) pages 33ndash40 Trento Italy

Franz Josef Och and Hermann Ney 2000 Improvedstatistical alignment models In Proceedings of the38th Annual Meeting of the ACL pages 440ndash447Hong Kong China October

Darren Pearce 2002 A Comparative Evaluation ofCollocation Extraction Techniques In Proceedingsof the Third International Conference on LanguageResources and Evaluation pages 1ndash7 Las PalmasCanary Islands Spain

Scott S L Piao Guangfan Sun Paul Rayson andQi Yuan 2006 Automatic Extraction of Chi-nese Multiword Expressions with a Statistical ToolIn Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context (EACL-2006)pages 17ndash24 Trento Italy April

William H Press Saul A Teukolsky William T Vet-terling and Brian P Flannery 1992 NumericalRecipes in C The Art of Scientific Computing Sec-ond edition Cambridge University Press

Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 MultiwordExpressions A Pain in the Neck for NLP In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing (CICLing-2002) volume 2276 of (LectureNotes in Computer Science) pages 1ndash15 LondonUK Springer-Verlag

Tim Van de Cruys and Bego na Villada Moiron 2007Semantics-based Multiword Expression ExtractionIn Proceedings of the Workshop on A Broader Pre-spective on Multiword Expressions pages 25ndash32Prague June

Aline Villavicencio Valia Kordoni Yi Zhang MarcoIdiart and Carlos Ramisch 2007 Validation andEvaluation of Automatically Acquired Multiword

Expressions for Grammar Engineering In Pro-ceedings of the 2007 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning pages1034ndash1043 Prague June

Aline Villavicencio 2005 The Availability of Verb-Particle Constructions in Lexical Resources HowMuch is Enough Journal of Computer Speech andLanguage Processing 19(4)415ndash432

Yi Zhang Valia Kordoni Aline Villavicencio andMarco Idiart 2006 Automated Multiword Ex-pression Prediction for Grammar Engineering InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 36ndash44 Sydney Australia July Asso-ciation for Computational Linguistics

8

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 9ndash16Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Re-examining Automatic Keyphrase Extraction Approachesin Scientific Articles

Su Nam KimCSSE dept

University of Melbournesnkimcsseunimelbeduau

Min-Yen KanDepartment of Computer ScienceNational University of Singaporekanmycompnusedusg

Abstract

We tackle two major issues in automatickeyphrase extraction using scientific arti-cles candidate selection and feature engi-neering To develop an efficient candidateselection method we analyze the natureand variation of keyphrases and then se-lect candidates using regular expressionsSecondly we re-examine the existing fea-tures broadly used for the supervised ap-proach exploring different ways to en-hance their performance While mostother approaches are supervised we alsostudy the optimal features for unsuper-vised keyphrase extraction Our researchhas shown that effective candidate selec-tion leads to better performance as evalua-tion accounts for candidate coverage Ourwork also attests that many of existing fea-tures are also usable in unsupervised ex-traction

1 Introduction

Keyphrases are simplex nouns or noun phrases(NPs) that represent the key ideas of the documentKeyphrases can serve as a representative summaryof the document and also serve as high quality in-dex terms It is thus no surprise that keyphraseshave been utilized to acquire critical informationas well as to improve the quality of natural lan-guage processing (NLP) applications such as doc-ument summarizer(Davanzo and Magnini 2005)information retrieval (IR)(Gutwin et al 1999) anddocument clustering(Hammouda et al 2005)

In the past various attempts have been made toboost automatic keyphrase extraction performancebased primarily on statistics(Frank et al 1999Turney 2003 Park et al 2004 Wan and Xiao2008) and a rich set of heuristic features(Barkerand Corrnacchia 2000 Medelyan and Witten

2006 Nguyen and Kan 2007) In Section 2 wegive a more comprehensive overview of previousattempts

Current keyphrase technology still has muchroom for improvement First of all although sev-eral candidate selection methods have been pro-posed for automatic keyphrase extraction in thepast (eg (Frank et al 1999 Park et al 2004Nguyen and Kan 2007)) most of them do not ef-fectively deal with various keyphrase forms whichresults in the ignorance of some keyphrases as can-didates Moreover no studies thus far have donea detailed investigation of the nature and varia-tion of manually-provided keyphrases As a con-sequence the community lacks a standardized listof candidate forms which leads to difficulties indirect comparison across techniques during evalu-ation and hinders re-usability

Secondly previous studies have shown the ef-fectiveness of their own features but not manycompared their features with other existing fea-tures That leads to a redundancy in studies andhinders direct comparison In addition existingfeatures are specifically designed for supervisedapproaches with few exceptions However thisapproach involves a large amount of manual laborthus reducing its utility for real-world applicationHence unsupervised approach is inevitable in or-der to minimize manual tasks and to encourageutilization It is a worthy study to attest the re-liability and re-usability for the unsupervised ap-proach in order to set up the tentative guideline forapplications

This paper targets to resolve these issues ofcandidate selection and feature engineering Inour work on candidate selection we analyze thenature and variation of keyphrases with the pur-pose of proposing a candidate selection methodwhich improves the coverage of candidates thatoccur in various forms Our second contributionre-examines existing keyphrase extraction features

9

reported in the literature in terms of their effec-tiveness and re-usability We test and comparethe usefulness of each feature for further improve-ment In addition we assess how well these fea-tures can be applied in an unsupervised approach

In the remaining sections we describe anoverview of related work in Section 2 our propos-als on candidate selection and feature engineeringin Section 4 and 5 our system architecture anddata in Section 6 Then we evaluate our propos-als discuss outcomes and conclude our work inSection 7 8 and 9 respectively

2 Related Work

The majority of related work has been carriedout using statistical approaches a rich set ofsymbolic resources and linguistically-motivatedheuristics(Frank et al 1999 Turney 1999 Barkerand Corrnacchia 2000 Matsuo and Ishizuka2004 Nguyen and Kan 2007) Features used canbe categorized into three broad groups (1) docu-ment cohesion features (ie relationship betweendocument and keyphrases)(Frank et al 1999Matsuo and Ishizuka 2004 Medelyan and Wit-ten 2006 Nguyen and Kan 2007) and to lesser(2) keyphrase cohesion features (ie relationshipamong keyphrases)(Turney 2003) and (3) termcohesion features (ie relationship among compo-nents in a keyphrase)(Park et al 2004)

The simplest system is KEA (Frank et al1999 Witten et al 1999) that uses TFIDF (ieterm frequency inverse document frequency) andfirst occurrence in the document TFIDF mea-sures the document cohesion and the first occur-rence implies the importance of the abstract orintroduction which indicates the keyphrases havea locality Turney (2003) added the notion ofkeyphrase cohesion to KEA features and Nguyenand Kan (2007) added linguistic features suchas section information and suffix sequence TheGenEx system(Turney 1999) employed an inven-tory of nine syntactic features such as length inwords and frequency of stemming phrase as aset of parametrized heuristic rules Barker andCorrnacchia (2000) introduced a method basedon head noun heuristics that took three featureslength of candidate frequency and head noun fre-quency To take advantage of domain knowledgeHulth et al (2001) used a hierarchically-organizeddomain-specific thesaurus from Swedish Parlia-ment as a secondary knowledge source The

Textract (Park et al 2004) also ranks the can-didate keyphrases by its judgment of keyphrasesrsquodegree of domain specificity based on subject-specific collocations(Damerau 1993) in addi-tion to term cohesion using Dice coefficient(Dice1945) Recently Wan and Xiao (2008) extractsautomatic keyphrases from single documents uti-lizing document clustering information The as-sumption behind this work is that the documentswith the same or similar topics interact with eachother in terms of salience of words The authorsfirst clustered the documents then used the graph-based ranking algorithm to rank the candidates ina document by making use of mutual influences ofother documents in the same cluster

3 Keyphrase Analysis

In previous study KEA employed the index-ing words as candidates whereas others such as(Park et al 2004 Nguyen and Kan 2007) gen-erated handcrafted regular expression rules How-ever none carefully undertook the analysis ofkeyphrases We believe there is more to be learnedfrom the reference keyphrases themselves by do-ing a fine-grained careful analysis of their formand composition Note that we used the articlescollected from ACM digital library for both ana-lyzing keyphrases as well as evaluating methodsSee Section 6 for data in detail

Syntactically keyphrases can be formed by ei-ther simplex nouns (eg algorithm keyphrasemulti-agent) or noun phrases (NPs) which can be asequence of nouns and their auxiliary words suchas adjectives and adverbs (eg mobile networkfast computing partially observable Markov de-cision process) despite few incidences They canalso incorporate a prepositional phrase (PP) (egquality of service policy of distributed caching)When keyphrases take the form of an NP with anattached PP (ie NPs in of-PP form) the preposi-tion of is most common but others such as for invia also occur (eg incentive for cooperation in-equality in welfare agent security via approximatepolicy trade in financial instrument based on log-ical formula) The patterns above correlate wellto part-of-speech (POS) patterns used in modernkeyphrase extraction systems

However our analysis uncovered additional lin-guistic patterns and alternations which other stud-ies may have overlooked In our study we alsofound that keyphrases also occur as a simple con-

10

Criteria RulesFrequency (Rule1) Frequency heuristic ie frequency ge 2 for simplex words vs frequency ge 1 for NPs

Length (Rule2) Length heuristic ie up to length 3 for NPs in non-of-PP form vs up to length 4 for NPs in of-PP form(eg synchronous concurrent program vs model of multiagent interaction)

Alternation (Rule3) of-PP form alternation(eg number of sensor = sensor number history of past encounter = past encounter history)(Rule4) Possessive alternation(eg agentrsquos goal = goal of agent securityrsquos value = value of security)

Extraction (Rule5) Noun Phrase = (NN |NNS|NNP |NNPS|JJ |JJR|JJS)lowast(NN |NNS|NNP |NNPS)(eg complexity effective algorithm grid computing distributed web-service discovery architecture)(Rule6) Simplex WordNP IN Simplex WordNP(eg quality of service sensitivity of VOIP traffic (VOIP traffic extracted)simplified instantiation of zebroid (simplified instantiation extracted))

Table 1 Candidate Selection Rules

junctions (eg search and rescue propagation anddelivery) and much more rarely as conjunctionsof more complex NPs (eg history of past en-counter and transitivity) Some keyphrases appearto be more complex (eg pervasive document editand management system task and resource allo-cation in agent system) Similarly abbreviationsand possessive forms figure as common patterns(eg belief desire intention = BDI inverse docu-ment frequency = (IDF) Bayesrsquo theorem agentrsquosdominant strategy)

A critical insight of our work is that keyphrasescan be morphologically and semantically alteredKeyphrases that incorporate a PP or have an un-derlying genitive composition are often easily var-ied by word order alternation Previous studieshave used the altered keyphrases when forming inof-PP form For example quality of service canbe altered to service quality sometimes with lit-tle semantic difference Also as most morpho-logical variation in English relates to noun num-ber and verb inflection keyphrases are subject tothese rules as well (eg distributed system 6= dis-tributing system dynamical caching 6= dynamicalcache) In addition possessives tend to alternatewith of-PP form (eg agentrsquos goal = goal of agentsecurityrsquos value = value of security)

4 Candidate Selection

We now describe our proposed candidate selectionprocess Candidate selection is a crucial step forautomatic keyphrase extraction This step is corre-lated to term extraction study since topNth rankedterms become keyphrases in documents In pre-vious study KEA employed the indexing wordsas candidates whereas others such as (Park et al2004 Nguyen and Kan 2007) generated hand-crafted regular expression rules However nonecarefully undertook the analysis of keyphrases In

this section before we present our method we firstdescribe the detail of keyphrase analysis

In our keyphrase analysis we observed thatmost of author assigned keyphrase andor readerassigned keyphrase are syntactically more of-ten simplex words and less often NPs Whenkeyphrases take an NP form they tend to be a sim-ple form of NPs ie either without a PP or withonly a PP or with a conjunction but few appear asa mixture of such forms We also noticed that thecomponents of NPs are normally nouns and adjec-tives but rarely are adverbs and verbs As a re-sult we decided to ignore NPs containing adverbsand verbs in this study as our candidates since theytend to produce more errors and to require morecomplexity

Another observation is that keyphrases contain-ing more than three words are rare (ie 6 in ourdata set) validating what Paukkeri et al (2008)observed Hence we apply a length heuristic Ourcandidate selection rule collects candidates up tolength 3 but also of length 4 for NPs in of-PPform since they may have a non-genetive alter-nation that reduces its length to 3 (eg perfor-mance of distributed system = distributed systemperformance) In previous studies words occur-ring at least twice are selected as candidates How-ever during our acquisition of reader assignedkeyphrase we observed that readers tend to collectNPs as keyphrases regardless of their frequencyDue to this we apply different frequency thresh-olds for simplex words (gt= 2) and NPs (gt= 1)Note that 30 of NPs occurred only once in ourdata

Finally we generated regular expression rulesto extract candidates as presented in Table 1 Ourcandidate extraction rules are based on those inNguyen and Kan (2007) However our Rule6for NPs in of-PP form broadens the coverage of

11

possible candidates ie with a given NPs in of-PP form not only we collect simplex word(s)but we also extract non-of-PP form of NPs fromnoun phrases governing the PP and the PP Forexample our rule extracts effective algorithm ofgrid computing as well as effective algorithm andgrid computing as candidates while the previousworksrsquo rules do not

5 Feature Engineering

With a wider candidate selection criteria the onusof filtering out irrelevant candidates becomes theresponsibility of careful feature engineering Welist 25 features that we have found useful in ex-tracting keyphrases comprising of 9 existing and16 novel andor modified features that we intro-duce in our work (marked with lowast) As one ofour goals in feature engineering is to assess thesuitability of features in the unsupervised settingwe have also indicated which features are suitableonly for the supervised setting (S) or applicable toboth (S U)

51 Document Cohesion

Document cohesion indicates how important thecandidates are for the given document The mostpopular feature for this cohesion is TFIDF butsome works have also used context words to checkthe correlation between candidates and the givendocument Other features for document cohesionare distance section information and so on Wenote that listed features other than TFIDF are re-lated to locality That is the intuition behind thesefeatures is that keyphrases tend to appear in spe-cific area such as the beginning and the end of doc-uments

F1 TFIDF (SU) TFIDF indicates doc-ument cohesion by looking at the frequency ofterms in the documents and is broadly used in pre-vious work(Frank et al 1999 Witten et al 1999Nguyen and Kan 2007) However a disadvan-tage of the feature is in requiring a large corpusto compute useful IDF As an alternative con-text words(Matsuo and Ishizuka 2004) can alsobe used to measure document cohesion From ourstudy of keyphrases we saw that substrings withinlonger candidates need to be properly counted andas such our method measures TF in substrings aswell as in exact matches For example grid com-puting is often a substring of other phrases such asgrid computing algorithm and efficient grid com-

puting algorithm We also normalize TF with re-spect to candidate types ie we separately treatsimplex words and NPs to compute TF To makeour IDFs broadly representative we employed theGoogle n-gram counts that were computedover terabytes of data Given this large genericsource of word count IDF can be incorporatedwithout corpus-dependent processing hence suchfeatures are useful in unsupervised approachesas well The following list shows variations ofTFIDF employed as features in our system

bull (F1a) TFIDF

bull (F1b) TF including counts of substrings

bull (F1c) TF of substring as a separate feature

bull (F1d) normalized TF by candidate types(ie simplex words vs NPs)

bull (F1e) normalized TF by candidate types asa separate feature

bull (F1f) IDF using Google n-gram

F2 First Occurrence (SU) KEA used the firstappearance of the word in the document(Frank etal 1999 Witten et al 1999) The main ideabehind this feature is that keyphrases tend to oc-cur in the beginning of documents especially instructured reports (eg in abstract and introduc-tion sections) and newswire

F3 Section Information (SU) Nguyen andKan (2007) used the identity of which specificdocument section a candidate occurs in This lo-cality feature attempts to identify key sections Forexample in their study of scientific papers theauthors weighted candidates differently dependingon whether they occurred in the abstract introduc-tion conclusion section head title andor refer-ences

F4 Additional Section Information (SU)We first added the related work or previous workas one of section information not included inNguyen and Kan (2007) We also propose and testa number of variations We used the substringsthat occur in section headers and reference titlesas keyphrases We counted the co-occurrence ofcandidates (ie the section TF) across all key sec-tions that indicates the correlation among key sec-tions We assign section-specific weights as in-dividual sections exhibit different propensities forgenerating keyphrases For example introduction

12

contains the majority of keyphrases while the ti-tle or section head contains many fewer due to thevariation in size

bull (F4a) section rsquorelatedprevious workrsquo

bull (F4b) counting substring occurring in keysections

bull (F4c) section TF across all key sections

bull (F4d) weighting key sections according tothe portion of keyphrases found

F5 Last Occurrence (SU) Similar to dis-tance in KEA the position of the last occurrenceof a candidate may also imply the importance ofkeyphrases as keyphrases tend to appear in thelast part of document such as the conclusion anddiscussion

52 Keyphrase CohesionThe intuition behind using keyphrase cohesion isthat actual keyphrases are often associated witheach other since they are semantically related totopic of the document Note that this assumptionholds only when the document describes a singlecoherent topic ndash a document that represents a col-lection may be first need to be segmented into itsconstituent topics

F6 Co-occurrence of Another Candidatein Section (SU) When candidates co-occur inseveral key sections together then they are morelikely keyphrases Hence we used the number ofsections that candidates co-occur

F7 Title overlap (S) In a way titles also rep-resent the topics of their documents A large col-lection of titles in the domain can act as a prob-abilistic prior of what words could stand as con-stituent words in keyphrases In our work as weexamined scientific papers from computer sciencewe used a collection of titles obtained from thelarge CiteSeer1 collection to create this feature

bull (F7a) co-occurrence (Boolean) in title col-location

bull (F7b) co-occurrence (TF) in title collection

F8 Keyphrase Cohesion (SU) Turney (2003)integrated keyphrase cohesion into his system bychecking the semantic similarity between top Nranked candidates against the remainder In the

1It contains 13M titles from articles papers and reports

original work a large external web corpus wasused to obtain the similarity judgments As wedid not have access to the same web corpus andall candidateskeyphrases were not found in theGoogle n-gram corpus we approximated this fea-ture using a similar notion of contextual similarityWe simulated a latent 2-dimensional matrix (simi-lar to latent semantic analysis) by listing all candi-date words in rows and their neighboring words(nouns verbs and adjectives only) in columnsThe cosine measure is then used to compute thesimilarity among keyphrases

53 Term Cohesion

Term cohesion further refines the candidacy judg-ment by incorporating an internal analysis of thecandidatersquos constituent words Term cohesionposits that high values for internal word associa-tion measures correlates indicates that the candi-date is a keyphrase (Church and Hanks 1989)

F9 Term Cohesion (SU) Park et al (2004)used in the Dice coefficient (Dice 1945)to measure term cohesion particularly for multi-word terms In their work as NPs are longer thansimplex words they simply discounted simplexword cohesion by 10 In our work we vary themeasure of TF used in Dice coefficientsimilar to our discussion earlier

bull (F9a) term cohesion by (Park et al 2004)

bull (F9b) normalized TF by candidate types(ie simplex words vs NPs)

bull (F9c) applying different weight by candi-date types

bull (F9d) normalized TF and different weight-ing by candidate types

54 Other Features

F10 Acronym (S) Nguyen and Kan (2007) ac-counted for the importance of acronym as a fea-ture We found that this feature is heavily depen-dent on the data set Hence we used it only forNampK to attest our candidate selection method

F11 POS sequence (S) Hulth and Megyesi(2006) pointed out that POS sequences ofkeyphrases are similar It showed the distinctivedistribution of POS sequences of keyphrases anduse them as a feature Like acronym this is alsosubject to the data set

13

F12 Suffix sequence (S) Similar to acronymNguyen and Kan (2007) also used a candidatersquossuffix sequence as a feature to capture the propen-sity of English to use certain Latin derivationalmorphology for technical keyphrases This fea-ture is also a data dependent features thus used insupervised approach only

F13 Length of Keyphrases (SU) Barker andCorrnacchia (2000) showed that candidate lengthis also a useful feature in extraction as well as incandidate selection as the majority of keyphrasesare one or two terms in length

6 System and Data

To assess the performance of the proposed candi-date selection rules and features we implementeda keyphrase extraction pipe line We start withraw text of computer science articles convertedfrom PDF by pdftotext Then we parti-tioned the into section such as title and sectionsvia heuristic rules and applied sentence segmenter2 ParsCit3(Councill et al 2008) for refer-ence collection part-of-speech tagger4 and lem-matizer5(Minnen et al 2001) of the input Af-ter preprocessing we built both supervised andunsupervised classifiers using Naive Bayes fromthe WEKA machine learning toolkit(Witten andFrank 2005) Maximum Entropy6 and simpleweighting

In evaluation we collected 250 papers fromfour different categories7 of the ACM digital li-brary Each paper was 6 to 8 pages on averageIn author assigned keyphrase we found manywere missing or found as substrings To rem-edy this we collected reader assigned keyphraseby hiring senior year undergraduates in computerscience each whom annotated five of the paperswith an annotation guideline and on average tookabout 15 minutes to annotate each paper The fi-nal statistics of keyphrases is presented in Table2 where Combined represents the total number ofkeyphrases The numbers in () denotes the num-ber of keyphrases in of-PP form Found means the

2httpwwwengritsumeiacjpasaoresourcessentseg3httpwingcompnusedusgparsCit4httpsearchcpanorgdistLingua-EN-

TaggerTaggerpm5httpwwwinformaticssusxacukresearchgroupsnlpcarrollmorphhtml6httpmaxentsourceforgenetindexhtml7C24 (Distributed Systems) H33 (Information Search

and Retrieval) I211 (Distributed Artificial Intelligence-Multiagent Systems) and J4 (Social and Behavioral Sciences-Economics)

number of author assigned keyphrase and readerassigned keyphrase found in the documents

Author Reader CombinedTotal 1252 (53) 3110 (111) 3816 (146)NPs 904 2537 3027Average 385 (401) 1244 (1288) 1526 (1585)Found 769 2509 2864

Table 2 Statistics in Keyphrases

7 Evaluation

The baseline system for both the supervised andunsupervised approaches is modified NampK whichuses TFIDF distance section information andadditional section information (ie F1-4) Apartfrom baseline we also implemented basicKEA and NampK to compare Note that NampK is con-sidered a supervised approach as it utilizes fea-tures like acronym POS sequence and suffix se-quence

Table 3 and 4 shows the performance of our can-didate selection method and features with respectto supervised and unsupervised approaches usingthe current standard evaluation method (ie exactmatching scheme) over top 5th 10th 15th candi-dates

BestFeatures includes F1cTF of substring asa separate feature F2first occurrence F3sectioninformation F4dweighting key sections F5lastoccurrence F6co-occurrence of another candi-date in section F7btitle overlap F9aterm co-hesion by (Park et al 2004) F13length ofkeyphrases Best-TFIDF means using all bestfeatures but TFIDF

In Tables 3 and 4 C denotes the classifier tech-nique unsupervised (U) or supervised using Max-imum Entropy (S)8

In Table 5 the performance of each feature ismeasured using NampK system and the target fea-ture + indicates an improvement - indicates aperformance decline and indicates no effector unconfirmed due to small changes of perfor-mances Again supervised denotes MaximumEntropy training and Unsupervised is our unsu-pervised approach

8 Discussion

We compared the performances over our candi-date selection and feature engineering with sim-ple KEA NampK and our baseline system In eval-uating candidate selection we found that longer

8Due to the page limits we present the best performance

14

Method Features C Five Ten FifteenMatch Precision Recall Fscore Match Precising Recall Fscore Match Precision Recall Fscore

All KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 079 1584 519 782 139 1388 909 1099 184 1224 1203 1213

NampK S 132 2648 867 1306 204 2036 1334 1612 254 1693 1664 1678baseline U 092 1832 600 904 157 1568 1027 1241 220 1464 1439 1451

S 115 2304 755 1137 190 1896 1242 1501 244 1624 1596 1610Lengthlt=3 KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 081 1616 529 797 140 1400 917 1108 184 1224 1203 1213

NampK S 140 2792 915 1378 210 2104 1378 1665 262 1749 1719 1734baseline U 092 184 603 908 158 1576 1032 1247 220 1464 1439 1451

S 118 2368 776 1169 190 1900 1245 1504 240 1600 1572 1586Lengthlt=3 KEA U 001 024 008 012 005 052 034 041 007 048 047 047Candidates S 083 1664 545 821 142 1424 933 1127 187 1245 1224 1234

+ Alternation NampK S 153 3064 1004 1512 231 2308 1512 1827 288 1920 1887 1903baseline U 098 1968 645 972 172 1724 1129 1364 237 1579 1551 1565

S 133 2656 870 1311 209 2088 1368 1653 269 1792 1761 1776

Table 3 Performance on Proposed Candidate Selection

Features C Five Ten FifteenMatch Prec Recall Fscore Match Prec Recall Fscore Match Prec Recall Fscore

Best U 114 228 747 113 192 192 126 152 261 174 171 173S 156 312 102 154 250 250 164 198 315 210 206 208

Best U 114 228 74 113 192 192 126 152 261 174 171 173wo TFIDF S 156 311 102 154 246 246 161 194 312 208 204 206

Table 4 Performance on Feature Engineering

A Method Feature+ S F1aF2F3F4aF4dF9a

U F1aF1cF2F3F4aF4dF5F7bF9a- S F1bF1cF1dF1fF4bF4cF7aF7bF9b-dF13

U F1dF1eF1fF4bF4cF6F7aF9b-d S F1eF10F11F12

U F1b

Table 5 Performance on Each Feature

length candidates play a role to be noises so de-creased the overall performance We also con-firmed that candidate alternation offered the flexi-bility of keyphrases leading higher candidate cov-erage as well as better performance

To re-examine features we analyzed the impactof existing and new features and their variationsFirst of all unlike previous studies we found thatthe performance with and without TFIDF did notlead to a large difference which indicates the im-pact of TFIDF was minor as long as other fea-tures are incorporated Secondly counting sub-strings for TF improved performance while ap-plying term weighting for TF andor IDF did notimpact on the performance We estimated thecause that many of keyphrases are substrings ofcandidates and vice versa Thirdly section in-formation was also validated to improve perfor-mance as in Nguyen and Kan (2007) Extend-ing this logic modeling additional section infor-mation (related work) and weighting sections bothturned out to be useful features Other localityfeatures were also validated as helpful both firstoccurrence and last occurrence are helpful as itimplies the locality of the key ideas In addi-

tion keyphrase co-occurrence with selected sec-tions was proposed in our work and found empiri-cally useful Term cohesion (Park et al 2004) is auseful feature although it has a heuristic factor thatreduce the weight by 10 for simplex words Nor-mally term cohesion is subject to NPs only henceit needs to be extended to work with multi-wordNPs as well Table 5 summarizes the reflectionson each feature

As unsupervised methods have the appeal of notneeding to be trained on expensive hand-annotateddata we also compared the performance of super-vised and unsupervised methods Given the fea-tures initially introduced for supervised learningunsupervised performance is surprisingly highWhile supervised classifier produced a matchingcount of 315 the unsupervised classifier obtains acount of 261 We feel this indicates that the exist-ing features for supervised methods are also suit-able for use in unsupervised methods with slightlyreduced performance In general we observed thatthe best features in both supervised and unsuper-vised methods are the same ndash section informationand candidate length In our analysis of the im-pact of individual features we observed that mostfeatures affect performance in the same way forboth supervised and unsupervised approaches asshown in Table 5 These findings indicate that al-though these features may be been originally de-signed for use in a supervised approach they arestable and can be expected to perform similar inunsupervised approaches

15

9 Conclusion

We have identified and tackled two core issuesin automatic keyphrase extraction candidate se-lection and feature engineering In the area ofcandidate selection we observe variations and al-ternations that were previously unaccounted forOur selection rules expand the scope of possiblekeyphrase coverage while not overly expandingthe total number candidates to consider In ourre-examination of feature engineering we com-piled a comprehensive feature list from previousworks while exploring the use of substrings in de-vising new features Moreover we also attested toeach featurersquos fitness for use in unsupervised ap-proaches in order to utilize them in real-world ap-plications with minimal cost

10 AcknowledgementThis work was partially supported by a National ResearchFoundation grant Interactive Media Search (grant R 252000 325 279) while the first author was a postdoctoral fellowat the National University of Singapore

ReferencesKen Barker and Nadia Corrnacchia Using noun phrase

heads to extract document keyphrases In Proceedings ofthe 13th Biennial Conference of the Canadian Society onComputational Studies of Intelligence Advances in Arti-ficial Intelligence 2000

Regina Barzilay and Michael Elhadad Using lexical chainsfor text summarization In Proceedings of the ACLEACL1997 Workshop on Intelligent Scalable Text Summariza-tion 1997 pp 10ndash17

Kenneth Church and Patrick Hanks Word associationnorms mutual information and lexicography In Proceed-ings of ACL 1989 76ndash83

Isaac Councill and C Lee Giles and Min-Yen Kan ParsCitAn open-source CRF reference string parsing package InProceedings of LREC 2008 28ndash30

Ernesto DAvanzo and Bernado Magnini A Keyphrase-Based Approach to Summarizationthe LAKE System atDUC-2005 In Proceedings of DUC 2005

F Damerau Generating and evaluating domain-orientedmulti-word terms from texts Information Processing andManagement 1993 29 pp43ndash447

Lee Dice Measures of the amount of ecologic associationsbetween species Journal of Ecology 1945 2

Eibe Frank and Gordon Paynter and Ian Witten and CarlGutwin and Craig Nevill-manning Domain SpecificKeyphrase Extraction In Proceedings of IJCAI 1999pp668ndash673

Carl Gutwin and Gordon Paynter and Ian Witten and CraigNevill-Manning and Eibe Frank Improving browsing indigital libraries with keyphrase indexes Journal of Deci-sion Support Systems 1999 27 pp81ndash104

Khaled Hammouda and Diego Matute and Mohamed KamelCorePhrase keyphrase extraction for document cluster-ing In Proceedings of MLDM 2005

Annette Hulth and Jussi Karlgren and Anna Jonsson andHenrik Bostrm and Lars Asker Automatic Keyword Ex-traction using Domain Knowledge In Proceedings of CI-CLing 2001

Annette Hulth and Beata Megyesi A study on automaticallyextracted keywords in text categorization In Proceedingsof ACLCOLING 2006 537ndash544

Mario Jarmasz and Caroline Barriere Using semantic sim-ilarity over tera-byte corpus compute the performance ofkeyphrase extraction In Proceedings of CLINE 2004

Dawn Lawrie and W Bruce Croft and Arnold RosenbergFinding Topic Words for Hierarchical Summarization InProceedings of SIGIR 2001 pp 349ndash357

Y Matsuo and M Ishizuka Keyword Extraction from a Sin-gle Document using Word Co-occurrence Statistical Infor-mation In International Journal on Artificial IntelligenceTools 2004 13(1) pp 157ndash169

Olena Medelyan and Ian Witten Thesaurus based automatickeyphrase indexing In Proceedings of ACMIEED-CSjoint conference on Digital libraries 2006 pp296ndash297

Guido Minnen and John Carroll and Darren Pearce Appliedmorphological processing of English NLE 2001 7(3)pp207ndash223

Thuy Dung Nguyen and Min-Yen Kan Key phrase Extrac-tion in Scientific Publications In Proceeding of ICADL2007 pp317-326

Youngja Park and Roy Byrd and Branimir Boguraev Auto-matic Glossary Extraction Beyond Terminology Identifi-cation In Proceedings of COLING 2004 pp48ndash55

Mari-Sanna Paukkeri and Ilari Nieminen and Matti Pollaand Timo Honkela A Language-Independent Approachto Keyphrase Extraction and Evaluation In Proceedingsof COLING 2008

Peter Turney Learning to Extract Keyphrases from TextIn National Research Council Institute for InformationTechnology Technical Report ERB-1057 1999

Peter Turney Coherent keyphrase extraction via Web min-ing In Proceedings of IJCAI 2003 pp 434ndash439

Xiaojun Wan and Jianguo Xiao CollabRank towards a col-laborative approach to single-document keyphrase extrac-tion In Proceedings of COLING 2008

Ian Witten and Gordon Paynter and Eibe Frank and CarGutwin and Graig Nevill-Manning KEAPractical Au-tomatic Key phrase Extraction In Proceedings of ACMDL 1999 pp254ndash256

Ian Witten and Eibe Frank Data Mining Practical Ma-chine Learning Tools and Techniques Morgan Kauf-mann 2005

Yongzheng Zhang and Nur Zinchir-Heywood and Evange-los Milios Term based Clustering and Summarization ofWeb Page Collections In Proceedings of Conference ofthe Canadian Society for Computational Studies of Intel-ligence 2004

16

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 17ndash22Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Verb Noun Construction MWE Token Supervised Classification

Mona T DiabCenter for Computational Learning Systems

Columbia Universitymdiabcclscolumbiaedu

Pravin BhutadaComputer Science Department

Columbia Universitypb2351columbiaedu

Abstract

We address the problem of classifying multi-word expression tokens in running text Wefocus our study on Verb-Noun Constructions(VNC) that vary in their idiomaticity depend-ing on context VNC tokens are classified aseither idiomatic or literal We present a super-vised learning approach to the problem We ex-periment with different features Our approachyields the best results to date on MWE clas-sification combining different linguistically mo-tivated features the overall performance yieldsan F-measure of 8458 corresponding to an F-measure of 8996 for idiomaticity identificationand classification and 6203 for literal identifi-cation and classification

1 Introduction

In the literature in general a multiword expression(MWE) refers to a multiword unit or a colloca-tion of words that co-occur together statisticallymore than chance A MWE is a cover term fordifferent types of collocations which vary in theirtransparency and fixedness MWEs are pervasivein natural language especially in web based textsand speech genres Identifying MWEs and under-standing their meaning is essential to language un-derstanding hence they are of crucial importancefor any Natural Language Processing (NLP) appli-cations that aim at handling robust language mean-ing and use In fact the seminal paper (Sag et al2002) refers to this problem as a key issue for thedevelopment of high-quality NLP applications

For our purposes a MWE is defined as a collo-cation of words that refers to a single concept forexample - kick the bucket spill the beans make adecision etc An MWE typically has an idiosyn-cratic meaning that is more or different from themeaning of its component words An MWE mean-ing is transparent ie predictable in as muchas the component words in the expression relaythe meaning portended by the speaker composi-tionally Accordingly MWEs vary in their de-gree of meaning compositionality composition-ality is correlated with the level of idiomaticityAn MWE is compositional if the meaning of an

MWE as a unit can be predicted from the mean-ing of its component words such as in make adecision meaning to decide If we conceive ofidiomaticity as being a continuum the more id-iomatic an expression the less transparent and themore non-compositional it is Some MWEs aremore predictable than others for instance kick thebucket when used idiomatically to mean to diehas nothing in common with the literal meaningof either kick or bucket however make a decisionis very clearly related to to decide Both of theseexpressions are considered MWEs but have vary-ing degrees of compositionality and predictabilityBoth of these expressions belong to a class of id-iomatic MWEs known as verb noun constructions(VNC) The first VNC kick the bucket is a non-decomposable VNC MWE the latter make a deci-sion is a decomposable VNC MWE These typesof constructions are the object of our study

To date most research has addressed the prob-lem of MWE type classification for VNC expres-sions in English (Melamed 1997 Lin 1999Baldwin et al 2003 na Villada Moiron andTiedemann 2006 Fazly and Stevenson 2007Van de Cruys and Villada Moiron 2007 Mc-Carthy et al 2007) not token classification Forexample he spilt the beans on the kitchen counteris most likely a literal usage This is given away bythe use of the prepositional phrase on the kitchencounter as it is plausable that beans could haveliterally been spilt on a location such as a kitchencounter Most previous research would classifyspilt the beans as idiomatic irrespective of con-textual usage In a recent study by (Cook et al2008) of 53 idiom MWE types used in differentcontexts the authors concluded that almost half ofthem had clear literal meaning and over 40 oftheir usages in text were actually literal Thus itwould be important for an NLP application suchas machine translation for example when givena new VNC MWE token to be able to determinewhether it is used idiomatically or not as it couldpotentially have detrimental effects on the qualityof the translation

17

In this paper we address the problem of MWEclassification for verb-noun (VNC) token con-structions in running text We investigate the bi-nary classification of an unseen VNC token ex-pression as being either Idiomatic (IDM) or Lit-eral (LIT) An IDM expression is certainly anMWE however the converse is not necessarilytrue To date most approaches to the problem ofidiomaticity classification on the token level havebeen unsupervised (Birke and Sarkar 2006 Diaband Krishna 2009b Diab and Krishna 2009aSporleder and Li 2009) In this study we carryout a supervised learning investigation using sup-port vector machines that uses some of the featureswhich have been shown to help in unsupervisedapproaches to the problem

This paper is organized as follows In Section2 we describe our understanding of the variousclasses of MWEs in general Section 3 is a sum-mary of previous related research Section 4 de-scribes our approach In Section 5 we present thedetails of our experiments We discuss the resultsin Section 6 Finally we conclude in Section 7

2 Multi-word Expressions

MWEs are typically not productive though theyallow for inflectional variation (Sag et al 2002)They have been conventionalized due to persis-tent use MWEs can be classified based on theirsemantic types as follows Idiomatic This cat-egory includes expressions that are semanticallynon-compositional fixed expressions such as king-dom come ad hoc non-fixed expressions suchas break new ground speak of the devil TheVNCs which we are focusing on in this paper fallinto this category Semi-idiomatic This classincludes expressions that seem semantically non-compositional yet their semantics are more or lesstransparent This category consists of Light VerbConstructions (LVC) such as make a living andVerb Particle Constructions (VPC) such as write-up call-up Non-Idiomatic This category in-cludes expressions that are semantically compo-sitional such as prime minister proper nouns suchas New York Yankees and collocations such as ma-chine translation These expressions are statisti-cally idiosyncratic For instance traffic light isthe most likely lexicalization of the concept andwould occur more often in text than say trafficregulator or vehicle light

3 Related Work

Several researchers have addressed the problem ofMWE classification (Baldwin et al 2003 Katzand Giesbrecht 2006 Schone and Juraksfy 2001

Hashimoto et al 2006 Hashimoto and Kawa-hara 2008) The majority of the proposed researchhas been using unsupervised approaches and haveaddressed the problem of MWE type classifica-tion irrespective of usage in context (Fazly andStevenson 2007 Cook et al 2007) We areaware of two supervised approaches to the prob-lem work by (Katz and Giesbrecht 2006) andwork by (Hashimoto and Kawahara 2008)

In Katz and Giesbrecht (2006) (KG06) the au-thors carried out a vector similarity comparisonbetween the context of an MWE and that of theconstituent words using LSA to determine if theexpression is idiomatic or not The KG06 is sim-ilar in intuition to work proposed by (Fazly andStevenson 2007) however the latter work was un-supervised KG06 experimented with a tiny dataset of only 108 sentences corresponding to oneMWE idiomatic expression

Hashimoto and Kawahara (2008) (HK08) is thefirst large scale study to our knowledge that ad-dressed token classification into idiomatic versusliteral for Japanese MWEs of all types They ap-ply a supervised learning framework using sup-port vector machines based on TinySVM with aquadratic kernel They annotate a web based cor-pus for training data They identify 101 idiomtypes each with a corresponding 1000 exampleshence they had a corpus of 102K sentences of an-notated data for their experiments They exper-iment with 90 idiom types only for which theyhad more than 50 examples They use two typesof features word sense disambiguation (WSD)features and idiom features The WSD featurescomprised some basic syntactic features such asPOS lemma information token n-gram featuresin addition to hypernymy information on words aswell as domain information For the idiom fea-tures they were mostly inflectional features suchas voice negativity modality in addition to adja-cency and adnominal features They report resultsin terms of accuracy and rate of error reductionTheir overall accuracy is of 8925 using all thefeatures

4 Our ApproachWe apply a supervised learning framework tothe problem of both identifying and classifying aMWE expression token in context We specificallyfocus on VNC MWE expressions We use the an-notated data by (Cook et al 2008) We adopt achunking approach to the problem using an InsideOutside Beginning (IOB) tagging framework forperforming the identification of MWE VNC to-kens and classifying them as idiomatic or literalin context For chunk tagging we use the Yam-

18

Cha sequence labeling system1 YamCha is basedon Support Vector Machines technology using de-gree 2 polynomial kernels

We label each sentence with standard IOB tagsSince this is a binary classification task we have 5different tags B-L (Beginning of a literal chunk)I-L (Inside of a literal chunk) B-I (Beginning anIdiomatic chunk) I-I (Inside an Idiomatic chunk)O (Outside a chunk) As an example a sentencesuch as John kicked the bucket last Friday will beannotated as follows John O kicked B-I the I-Ibucket I-I last O Friday O We experiment withsome basic features and some more linguisticallymotivated ones

We experiment with different window sizes forcontext ranging from minus+1 to minus+5 tokens be-fore and after the token of interest We also em-ploy linguistic features such as character n-gramfeatures namely last 3 characters of a token asa means of indirectly capturing the word inflec-tional and derivational morphology (NGRAM)Other features include Part-of-Speech (POS)tags lemma form (LEMMA) or the citation formof the word and named entity (NE) informationThe latter feature is shown to help in the unsuper-vised setting in recent work (Diab and Krishna2009b Diab and Krishna 2009a) In general allthe linguistic features are represented as separatefeature sets explicitly modeled in the input dataHence if we are modeling the POS tag feature forour running example the training data would beannotated as follows John NN O kicked VBDB-I the Det I-I bucket NN I-I last ADV O FridayNN O Likewise adding the NGRAM featurewould be represented as follows John NN ohnO kicked VBD ked B-I the Det the I-I bucket NNket I-I last ADV ast O Friday NN day O and soon

With the NE feature we followed the same rep-resentation as the other features as a separate col-umn as expressed above referred to as NamedEntity Separate (NES) For named entity recogni-tion (NER) we use the BBN Identifinder softwarewhich identifies 19 NE tags2 We have two set-tings for NES one with the full 19 tags explic-itly identified (NES-Full) and the other where wehave a binary feature indicating whether a wordis a NE or not (NES-Bin) Moreover we addedanother experimental condition where we changedthe wordsrsquo representation in the input to their NEclass Named Entity InText (NEI) For example forthe NEI condition our running example is repre-sented as follows PER NN ohn O kicked VBDked B-I the Det the I-I bucket NN ket I-I last ADV

1httpwwwtado-chasencomyamcha2httpwwwbbncomidentifinder

ast O DAY NN day O where John is replaced bythe NE ldquoPERrdquo

5 Experiments and Results

51 Data

We use the manually annotated standard dataset identified in (Cook et al 2008) This datacomprises 2920 unique VNC-Token expressionsdrawn from the entire British National Corpus(BNC)3 The BNC contains 100M words of multi-ple genres including written text and transcribedspeech In this set VNC token expressions aremanually annotated as idiomatic literal or un-known We exclude those annotated as unknownand those pertaining to the Speech part of thedata leaving us with a total of 2432 sentences cor-responding to 53 VNC MWE types This datahas 2571 annotations4 corresponding to 2020 Id-iomatic tokens and 551 literal ones Since the dataset is relatively small we carry out 5-fold cross val-idation experiments The results we report are av-eraged over the 5 folds per condition We splitthe data into 80 for training 10 for testing and10 for development The data used is the tok-enized version of the BNC

52 Evaluation Metrics

We use Fβ=1 (F-measure) as the harmonic meanbetween (P)recision and (R)ecall as well as accu-racy to report the results5 We report the resultsseparately for the two classes IDM and LIT aver-aged over the 5 folds of the TEST data set

53 Results

We present the results for the different featuressets and their combination We also present resultson a simple most frequent tag baseline (FREQ) aswell as a baseline of using no features just thetokenized words (TOK) The baseline is basicallytagging all identified VNC tokens in the data set asidiomatic It is worth noting that the baseline hasthe advantage of gold identification of MWE VNCtoken expressions In our experimental conditionsidentification of a potential VNC MWE is part ofwhat is discovered automatically hence our sys-tem is penalized for identifying other VNC MWE

3httpwwwnatcorpoxacuk4A sentence can have more than one MWE expression

hence the number of annotations exceeds the number of sen-tences

5We do not think that accuracy should be reported in gen-eral since it is an inflated result as it is not a measure of errorAll words identified as O factor into the accuracy which re-sults in exaggerated values for accuracy We report it onlysince it the metric used by previous work

19

tokens that are not in the original data set6

In Table 2 we present the results yielded per fea-ture and per condition We experimented with dif-ferent context sizes initially to decide on the opti-mal window size for our learning framework re-sults are presented in Table 1 Then once that isdetermined we proceed to add features

Noting that a window size of minus+3 yields thebest results we proceed to use that as our contextsize for the following experimental conditions Wewill not include accuracy since it above 96 for allour experimental conditions

All the results yielded by our experiments out-perform the baseline FREQ The simple tokenizedwords baseline (TOK) with no added features witha context size of minus+3 shows a significant im-provement over the very basic baseline FREQ withan overall F measure of 7704

Adding lemma information or POS or NGRAMfeatures all independently contribute to a bettersolution however combining the three featuresyields a significant boost in performance over theTOK baseline of 267 absolute F points in over-all performance

Confirming previous observations in the liter-ature the overall best results are obtained byusing NE features The NEI condition yieldsslightly better results than the NES conditionsin the case when no other features are beingused NES-Full significantly outperforms NES-Bin when used alone especially on literal classi-fication yielding the highest results on this classof phenomena across the board However whencombined with other features NES-Bin fares bet-ter than NES-Full as we observe slightly less per-formance when comparing NES-Full+L+N+P andNES-Bin+L+N+P

Combining NEI+L+N+P yields the highest re-sults with an overall F measure of 8458 a sig-nificant improvement over both baselines and overthe condition that does not exploit NE featuresL+N+P Using NEI may be considered a formof dimensionality reduction hence the significantcontribution to performance

6 Discussion

The overall results strongly suggest that using lin-guistically interesting features explicitly has a pos-itive impact on performance NE features helpthe most and combining them with other features

6We could have easily identified all VNC syntactic con-figurations corresponding to verb object as a potential MWEVNC assuming that they are literal by default This wouldhave boosted our literal score baseline however for this in-vestigation we decided to strictly work with the gold stan-dard data set exclusively

yields the best results In general performanceon the classification and identification of idiomaticexpressions yielded much better results This maybe due to the fact that the data has a lot more id-iomatic token examples for training Also we notethat precision scores are significantly higher thanrecall scores especially with performance on lit-eral token instance classification This might be anindication that identifying when an MWE is usedliterally is a difficult task

We analyzed some of the errors yielded in ourbest condition NEI+L+N+P The biggest errors area result of identifying other VNC constructionsnot annotated in the training and test data as VNCMWEs However we also see errors of confusingidiomatic cases with literal ones 23 times and theopposite 4 times

Some of the errors where the VNC should havebeen classified as literal however the system clas-sified them as idiomatic are kick heel find feetmake top Cases of idiomatic expressions erro-neously classified as literal are for MWE types hitthe road blow trumpet blow whistle bit a wall

The system is able to identify new VNC MWEconstructions For instance in the sentence On theother hand Pinkie seemed to have lost his head toa certain extent perhaps some prospects of mak-ing his mark by bringing in something novel inthe way of business the first MWE lost his headis annotated in the training data however makinghis mark is newly identified as idiomatic in thiscontext

Also the system identified hit the post as aliteral MWE VNC token in As the ball hit thepost the referee blew the whistle where blew thewhistle is a literal VNC in this context and it iden-tified hit the post as another literal VNC

7 Conclusion

In this study we explore a set of features that con-tribute to VNC token expression binary supervisedclassification The use of NER significantly im-proves the performance of the system Using NERas a means of dimensionality reduction yields thebest results We achieve a state of the art perfor-mance of an overall F measure of 8458 In thefuture we are looking at ways of adding more so-phisticated syntactic and semantic features fromWSD Given the fact that we were able to get moreinteresting VNC data automatically we are cur-rently looking into adding the new data to the an-notated pool after manual checking

20

IDM-F LIT-F Overall F Overall Accminus+1 7793 4857 7178 9622minus+2 8538 5561 7971 9706minus+3 8699 5568 8125 9693minus+4 8622 5581 8075 9706minus+5 8338 50 7763 9661

Table 1 Results in s of varying context window size

IDM-P IDM-R IDM-F LIT-P LIT-R LIT-F Overall FFREQ 7002 8916 7844 0 0 0 6968TOK 8178 8333 8255 7179 4375 5437 7704

(L)EMMA 831 8429 8369 6977 4688 5607 7811(N)GRAM 8317 8238 8278 70 4375 5385 7701

(P)OS 8333 8333 8333 7778 4375 5600 7808L+N+P 8695 8333 8538 7222 4561 5591 7971

NES-Full 852 8793 8655 7907 5862 6733 8277NES-Bin 8497 8241 8367 7349 5259 6131 7915

NEI 8992 8518 8748 8133 5259 6387 8282NES-Full+L+N+P 8989 8492 8734 7632 50 6042 8199NES-Bin+L+N+P 9086 8492 8779 7632 50 6042 8233

NEI+L+N+P 9135 8842 8986 8169 50 6203 8458

Table 2 Final results in s averaged over 5 folds of test data using different features and their combina-tions

8 Acknowledgement

The first author was partially funded by DARPAGALE and MADCAT projects The authors wouldlike to acknowledge the useful comments by twoanonymous reviewers who helped in making thispublication more concise and better presented

ReferencesTimothy Baldwin Collin Bannard Takakki Tanaka

and Dominic Widdows 2003 An empirical modelof multiword expression decomposability In Pro-ceedings of the ACL 2003 workshop on Multiwordexpressions pages 89ndash96 Morristown NJ USA

J Birke and A Sarkar 2006 A clustering approach fornearly unsupervised recognition of nonliteral lan-guage In Proceedings of EACL volume 6 pages329ndash336

Paul Cook Afsaneh Fazly and Suzanne Stevenson2007 Pulling their weight Exploiting syntacticforms for the automatic identification of idiomaticexpressions in context In Proceedings of the Work-shop on A Broader Perspective on Multiword Ex-pressions pages 41ndash48 Prague Czech RepublicJune Association for Computational Linguistics

Paul Cook Afsaneh Fazly and Suzanne Stevenson2008 The VNC-Tokens Dataset In Proceedings ofthe LREC Workshop on Towards a Shared Task for

Multiword Expressions (MWE 2008) MarrakechMorocco June

Mona Diab and Madhav Krishna 2009a Handlingsparsity for verb noun MWE token classification InProceedings of the Workshop on Geometrical Mod-els of Natural Language Semantics pages 96ndash103Athens Greece March Association for Computa-tional Linguistics

Mona Diab and Madhav Krishna 2009b Unsuper-vised classification for vnc multiword expressionstokens In CICLING

Afsaneh Fazly and Suzanne Stevenson 2007 Dis-tinguishing subtypes of multiword expressions us-ing linguistically-motivated statistical measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueCzech Republic June Association for Computa-tional Linguistics

Chikara Hashimoto and Daisuke Kawahara 2008Construction of an idiom corpus and its applica-tion to idiom identification based on WSD incor-porating idiom-specific features In Proceedings ofthe 2008 Conference on Empirical Methods in Nat-ural Language Processing pages 992ndash1001 Hon-olulu Hawaii October Association for Computa-tional Linguistics

Chikara Hashimoto Satoshi Sato and Takehito Utsuro2006 Japanese idiom recognition Drawing a linebetween literal and idiomatic meanings In Proceed-ings of the COLINGACL 2006 Main Conference

21

Poster Sessions pages 353ndash360 Sydney AustraliaJuly Association for Computational Linguistics

Graham Katz and Eugenie Giesbrecht 2006 Au-tomatic identification of non-compositional multi-word expressions using latent semantic analysis InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 12ndash19 Sydney Australia July Asso-ciation for Computational Linguistics

Dekang Lin 1999 Automatic identification of non-compositional phrases In Proceedings of ACL-99pages 317ndash324 Univeristy of Maryland CollegePark Maryland USA

Diana McCarthy Sriram Venkatapathy and AravindJoshi 2007 Detecting compositionality of verb-object combinations using selectional preferencesIn Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL) pages 369ndash379 Prague CzechRepublic June Association for Computational Lin-guistics

Dan I Melamed 1997 Automatic discovery of non-compositional compounds in parallel data In Pro-ceedings of the 2nd Conference on Empirical Meth-ods in Natural Language Processing (EMNLPrsquo97)pages 97ndash108 Providence RI USA August

Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the EACL-06Workshop on Multiword Expressions in a Multilin-gual Context pages 33ndash40 Morristown NJ USA

Ivan A Sag Timothy Baldwin Francis Bond Ann ACopestake and Dan Flickinger 2002 Multiwordexpressions A pain in the neck for nlp In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing pages 1ndash15 London UK Springer-Verlag

Patrick Schone and Daniel Juraksfy 2001 Isknowledge-free induction of multiword unit dictio-nary headwords a solved problem In Proceedingsof Empirical Methods in Natural Language Process-ing pages 100ndash108 Pittsburg PA USA

C Sporleder and L Li 2009 Unsupervised Recog-nition of Literal and Non-Literal Use of IdiomaticExpressions In Proceedings of the 12th Conferenceof the European Chapter of the ACL (EACL 2009)pages 754ndash762 Association for Computational Lin-guistics

Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32Prague Czech Republic June Association for Com-putational Linguistics

22

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 23ndash30Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Exploiting Translational Correspondences for Pattern-Independent MWEIdentification

Sina ZarrieszligDepartment of Linguistics

University of Potsdam Germanysinalinguni-potsdamde

Jonas KuhnDepartment of Linguistics

University of Potsdam Germanykuhnlinguni-potsdamde

Abstract

Based on a study of verb translations inthe Europarl corpus we argue that a widerange of MWE patterns can be identified intranslations that exhibit a correspondencebetween a single lexical item in the sourcelanguage and a group of lexical items inthe target language We show that thesecorrespondences can be reliably detectedon dependency-parsed word-aligned sen-tences We propose an extraction methodthat combines word alignment with syn-tactic filters and is independent of thestructural pattern of the translation

1 Introduction

Parallel corpora have proved to be a valuable re-source not only for statistical machine translationbut also for crosslingual induction of morphologi-cal syntactic and semantic analyses (Yarowsky etal 2001 Dyvik 2004) In this paper we proposean approach to the identification of multiword ex-pressions (MWEs) that exploits translational cor-respondences in a parallel corpus We will con-sider in translations of the following type

(1) DerThe

RatCouncil

sollteshould

unsereour

Positionposition

berucksichtigenconsider

(2) The Council shouldtake account ofour position

This sentence pair has been taken from theGerman - English section of the Europarl corpus(Koehn 2005) It exemplifies a translational cor-respondence between an English MWEtake ac-count ofand a German simplex verbberucksichti-gen In the following we refer to such correspon-dences asone-to-many translations Based on astudy of verb translations in Europarl we will ex-plore to what extent one-to-many translations pro-vide evidence for MWE realization in the targetlanguage It will turn out that crosslingual corre-

spondences realize a wide range of different lin-guistic patterns that are relevant for MWE iden-tification but that they pose problems to auto-matic word alignment We propose an extractionmethod that combines distributional word align-ment with syntactic filters We will show thatthese correspondences can be reliably detectedon dependency-parsed wordaligned sentences andare able to identify various MWE patterns

In a monolingual setting the task of MWE ex-traction is usually conceived of as a lexical as-sociation problem where distributional measuresmodel the syntactic and semantic idiosyncracy ex-hibited by MWEs eg (Pecina 2008) This ap-proach generally involves two main steps 1) theextraction of a candidate list of potential MWEsoften constrained by a particular target patternof the detection method like verb particle con-structions (Baldwin and Villavicencio 2002) orverb PP combinations (Villada Moiron and Tiede-mann 2006) 2) the ranking of this candidate listby an appropriate assocation measure

The crosslingual MWE identification wepresent in this paper is a priori independentof any specific association measure or syntacticpattern The translation scenario allows us toadopt a completely data-driven definition of whatconstitutes an MWE Given a parallel corpuswe propose to consider those tokens in a targetlanguage as MWEs which correspond to a singlelexical item in the source language The intuitionis that if a group of lexical items in one lan-guage can be realized as a single item in anotherlanguage it can be considered as some kind oflexically fixed entity By this means we willnot approach the MWE identification problemby asking for a given list of candidates whetherthese are MWEs or not Instead we will ask fora given list of lexical items in a source languagewhether there exists a one-to-many translation forthis item in a target language (and whether these

23

one-to-many translations correspond to MWEs)This strategy offers a straightforward solution tothe interpretation problem As the translation canbe related to the meaning of the source item andto its other translations in the target language theinterpretation is independent of the expressionrsquostransparency This solution has its limitationscompared to other approaches that need to auto-matically establish the degree of compositionalityof a given MWE candidate However for manyNLP applications coarse-grained knowledgeabout the semantic relation between a widerange of MWEs and their corresponding atomicrealization is already very useful

In this work we therefore focus on a generalmethod of MWE identification that captures thevarious patterns of translational correspondencesthat can be found in parallel corpora Our exper-iments described in section 3 show that one-to-many translations should be extracted from syn-tactic configurations rather than from unstructuredsets of aligned words This syntax-driven methodis less dependent on frequency distributions in agiven corpus but is based on the intuition thatmonolingual idiosyncracies like MWE realizationof an entity are not likely to be mirrored in anotherlanguage (see section 4 for discussion)

Our goal in this paper is twofold First we wantto investigate to what extent one-to-many transla-tional correspondences can serve as an empiricalbasis for MWE identification To this end Sec-tion 2 presents a corpus-based study of the rela-tion between one-to-many translations and MWEsthat we carried out on a translation gold standardSecond we investigate methods for the automaticdetection of complex lexical correspondences fora given parallel corpus Therefore Section 3 eval-uates automatic word alignments against our goldstandard and gives a method for high-precisionone-to-many translation detection that relies onsyntactic filters in addition to word-alignments

2 Multiword Translations as MWEs

The idea to exploit one-to-many translations forthe identification of MWE candidates has not re-ceived much attention in the literature Thus it isnot a priori clear what can be expected from trans-lational correspondences with respect to MWEidentification To corroborate the intuitions intro-duced in the last section we carried out a corpus-based study that aims to discover linguistic pat-

Verb 1-1 1-n n-1 n-n No

anheben (v1) 535 212 92 16 325

bezwecken (v2) 167 513 06 313 150

riskieren (v3) 467 357 05 17 182

verschlimmern (v4) 302 215 286 445 275

Table 1 Proportions of types of translational cor-respondences (token-level) in our gold standard

terns exhibited by one-to-many translationsWe constructed a gold standard coveringall En-

glish translations of four German verb lemmas ex-tracted from the Europarl Corpus These verbssubcategorize for a nominative subject and an ac-cusative object and are in the middle frequencylayer (around 200 occurrences) We extracted allsentences in Europarl with occurences of theselemmas and their automatic word alignments pro-duced by GIZA++ (Och and Ney 2003) Thesealignments were manually corrected on the basisof the crosslingual word alignment guidelines de-velopped by (Graca et al 2008)

For each of the German source lemmas ourgold standard records four translation categoriesone-to-one one-to-many many-to-one many-to-many translations Table 1 shows the distributionof these categories for each verb Strikingly thefour verbs show very different proportions con-cerning the types of their translational correspon-dences Thus while the German verbanheben(en increase) seems to have a frequent parallelrealization the verbsbezwecken(en intend to)or verschlimmern(en aggravate) tend to be real-ized by more complex phrasal translations In anycase the percentage of one-to-many translations isrelatively high which corroborates our hypothesisthat parallel corpora constitute a very interestingresource for data-driven MWE discovery

A closer look at the one-to-many translations re-veals that these cover a wide spectrum of MWEphenomena traditionally considered in the liter-ature as well as constructions that one wouldusually not regard as an MWE Below we willshortly illustrate the different classes of one-to-many translations we found in our gold standard

Morphological variations This type of one-to-many translations is mainly due to non-parallel re-alization of tense Itrsquos rather irrelevant from anMWE perspective but easy to discover and filterautomatically

24

(3) SieThey

verschlimmernaggravate

diethe

Ubelmisfortunes

(4) Their actionis aggravatingthe misfortunes

Verb particle combinations A typical MWEpattern treated for instance in (Baldwin andVillavicencio 2002) It further divides into trans-parent and non-transparent combinations the lat-ter is illustrated below

(5) DerThe

Ausschusscommitte

bezwecktintends

denthe

Institutioneninstitutions

eina

politischespolitical

Instrumentinstrument

anat

diethe

Handhand

zuto

gebengive

(6) The committeeset out to equip the institutions with apolitical instrument

Verb preposition combinations While thisclass isnrsquot discussed very often in the MWE lit-erature it can nevertheless be considered as an id-iosyncratic combination of lexical items Sag et al(2002) propose an analysis within an MWE frame-work

(7) SieThey

werdenwill

denthe

Treibhauseffektgreen house effect

verschlimmernaggravate

(8) They will add to the green house effect

Light verb constructions (LVCs) This is themost frequent pattern in our gold standard It ac-tually subsumes various subpatterns depending onwhether the light verbs complement is realized as anoun adjective or PP Generally LVCs are syntac-tically and semantically more flexible than otherMWE types such that our gold standard containsvariants of LVCs with similar potentially mod-ified adjectives or nouns as in the example be-low However it can be considered an idiosyn-cratic combination since the LVCs exhibit specificlexical restrictions (Sag et al 2002)

(9) IchIch

werdewill

diethe

Sachething

nuronly

nochjust

verschlimmernaggravate

(10) I am justmaking thingsmore difficult

Idioms This MWE type is probably the mostdiscussed in the literature due to its semantic andsyntactic idiosyncracy Itrsquos not very frequent inour gold standard which may be mainly due to itslimited size and the source items we chose

(11) SieThey

bezweckenintend

diethe

Umgestaltungconversion

ininto

einea

zivilecivil

Nationnation

(12) Theyhave in mind the conversion into a civil nation

v1 v2 v3 v4

Ntype 22 (26) 41 (47) 26 (35) 17 (24)

V Part 227 49 00 00

V Prep 364 415 39 59

LVC 182 293 885 882

Idiom 00 24 00 00

Para 364 243 115 235

Table 2 Proportions of MWE types per lemma

Paraphrases From an MWE perspective para-phrases are the most problematic and challengingtype of translational correspondence in our goldstandard While the MWE literature typically dis-cusses the distinction between collocations andMWEs the boarderline between paraphrases andMWEs is not really clear On the hand para-phrases as we classified them here are transparentcombinations of lexical items like in the exam-ple belowensure that something increases How-ever semantically these transparent combinationscan also be rendered by an atomic expressionin-crease A further problem raised by paraphrases isthat they often involve translational shifts (Cyrus2006) These shifts are hard to identify automat-ically and present a general challenge for seman-tic processing of parallel corpora An example isgiven below

(13) WirWe

brauchenneed

besserebetter

Zusammenarbeitcooperation

umto

diethe

RuckzahlungenrepaymentsOBJ

anzuhebenincrease

(14) We need greater cooperation in this respect toensurethat repaymentsincrease

Table 2 displays the proportions of the MWEcategories for the number of types of one-to-manycorrespondences in our gold standard We filteredthe types due to morphological variations only (theoverall number of types is indicated in brackets)Note that some types in our gold standard fall intoseveral categories eg they combine a verb prepo-sition with a verb particle construction For allof the verbs the number of types belonging tocore MWE categories largely outweighs the pro-portion of paraphrases As we already observedin our analysis of general translation categorieshere again the different verb lemmas show strik-ing differences with respect to their realization inEnglish translations For instanceanheben(enincrease) or bezwecken(en intend) are frequently

25

translated with verb particle or preposition combi-nations while the other verbs are much more of-ten translated by means of LVCs Also the morespecific LVC patterns differ largely among theverbs Whileverschlimmern(en aggravate) hasmany different adjectival LVC correspondencesthe translations ofriskieren(en risk) are predomi-nantly nominal LVCs The fact that we found veryfew idioms in our gold standard may be simplyrelated to our arbitrary choice of German sourceverbs that do not have an English idiom realiza-tion (see our experiment on a random set of verbsin Section 33)

In general one-to-many translational corre-spondences seem to provide a very fruitful groundfor the large-scale study of MWE phenomenaHowever their reliable detection in parallel cor-pora is far from trivial as we will show in thenext section Therefore we will not further in-vestigate the classification of MWE patterns inthe rest of the paper but concentrate on the high-precision detection of one-to-many translationsSuch a pattern-independent identification methodis crucial for the further data-driven study of one-to-many translations in parallel corpora

3 Multiword Translation Detection

This section is devoted to the problem of high-precision detection of one-to-many translationsSection 31 describes an evaluation of automaticword alignments against our gold standard Insection 32 we describe a method that extractsloosely aligned syntactic configurations whichyields much more promising results

31 One-to-many Alignments

To illustrate the problem of purely distributionalone-to-many alignment table 3 presents an eval-uation of the automatic one-to-many word align-ments produced by GIZA++ that uses the stan-dard heuristics for bidirectional word alignmentfrom phrase-based MT (Och and Ney 2003) Weevaluate the rate of translational correspondenceson the type-level that the system discovers againstthe one-to-many translations in our gold standardBy type we mean the set of lemmatized Englishtokens that makes up the translation of the Ger-man source lemma Generally automatic wordalignment yields a very high FPR if no frequencythreshold is used Increasing the threshold mayhelp in some cases however the frequency of the

verb n gt 0 n gt 1 n gt 3FPR FNR FPR FNR FPR FNR

v1 097 093 10 10 10 10

v2 093 09 05 096 00 098

v3 088 083 08 097 067 097

v4 098 092 08 092 034 092

Table 3 False positive rate and False negative rateof GIZA++ one-to-many alignments

translation types is so low that already at a thresh-old of 3 almost all types get filtered This does notmean that the automatic word alignment does notdiscover any correct correspondences at all but itmeans that the detection of the exact set of tokensthat correspond to the source token is rare

This low precision of one-to-many alignmentsisnrsquot very surprising Many types of MWEs con-sist of items that contribute most of the lexical se-mantic content while the other items belong to theclass of semantically almost ldquoemptyrdquo items (egparticles light verbs) These semantically ldquolightrdquoitems have a distribution that doesnrsquot necessarilycorrelate with the source item For instance inthe following sentence pair taken from EuroparlGIZA++ was not able to capture the correspon-dence between the German main verbbehindern(en impede) and the LVCconstitute an obstacleto but only finds an alignment link between theverb and the nounobstacle

(15) DieThe

Korruptioncorruption

behindertimpedes

diethe

Entwicklungdevelopment

(16) Corruptionconstitutes an obstacle todevelopment

Another limitation of the word-alignment mod-els is that are independent of whether the sen-tences are largely parallel or rather free transla-tions However parallel corpora like Europarl areknow to contain a very large number of free trans-lations In these cases direct lexical correspon-dences are much more unlikely to be found

32 Aligning Syntactic Configurations

High-precision extraction of one-to-many trans-lation detection thus involves two major prob-lems 1) How to identify sentences or configura-tions where reliable lexical correspondences canbe found 2) How to align target items that have alow occurrence correlation

We argue that both of these problems can beadressed by taking syntactic information into ac-

26

count As an example consider the pair of paral-lel configurations in Figure 1 for the sentence pairgiven in (15) and (16) Although there is no strictone-to-one alignment for the German verb the ba-sic predicate-argument structure is parallel Theverbs arguments directly correspond to each otherand are all dominated by a verbal root node

Based on these intuitions we propose agenerate-and-filter strategy for our one-to-manytranslation detection which extracts partial largelyparallel dependency configurations By admittingtarget dependency paths to be aligned to sourcesingle dependency relations we admit configura-tions where the source item is translated by morethan one word For instance given the configura-tion in Figure 1 we allow the German verb to bealigned to the path connectingconstituteand theargumentY2

Our one-to-many translation detection consistsof the following steps a) candidate generationof aligned syntactic configurations b) filtering theconfigurations c) alignment post-editing ie as-sembling the target tokens corresponding to thesource item The following paragraphs will brieflycaracterize these steps

behindert

X 1 Y 1

Y 2

an to

X 2 obstacle

create

Figure 1 Example of a typical syntactic MWEconfiguration

Data We word-aligned the German and Englishportion of the Europarl corpus by means of theGIZA++ tool Both portions where assigned flatsyntactic dependency analyses by means of theMaltParser (Nivre et al 2006) such that we ob-tain a parallel resource of word-aligned depen-dency parses Each sentence in our resource canbe represented by the triple(DG DE AGE) DG

is the set of dependency triples(s1 rel s2) such

thats2 is a dependent ofs1 of typerel ands1 s2

are words of the source languageDE is the setof dependency triples of the target sentenceAGE

corresponds to the set of pairs(s1 t1) such thats1 t1 are aligned

Candidate Generation This step generates alist of source configurations by searching for oc-curences of the source lexical verb where it islinked to some syntactic dependents (eg its argu-ments) An example input would be the configura-tion ( (verbSB) (verbOA)) forour German verbs

Filtering Given our source candidates a validparallel configuration(DG DE AGE) is then de-fined by the following conditions1 The source configurationDG is the set of tu-ples(s1 rel sn) wheres1 is our source item andsn some dependent2 For eachsn isin DG there is a tuple(sn tn) isinAGE ie every dependent has an alignment3 There is a target itemt1 isin DE such thatfor eachtn there is ap sub DE such thatp isa path(t1 rel tx) (tx rel ty)(tz rel tn) thatconnectst1 and tn Thus the target dependentshave a common root

To filter noise due to parsing or alignment er-rors we further introduce a filter on the length ofthe path that connects the target root and its de-pendents and w exclude paths cross contain sen-tence boundaries Moreover the above candi-date filtering doesnrsquot exclude configurations whichexhibit paraphrases involving head-switching orcomplex coordination Head-switching can be de-tected with the help of alignment information ifthere is a item in our target configuration that hasan reliable alignment with an item not contained inour source configuration our target configurationis likely to contain such a structural paraphrasesand is excluded from our candidate set Coordina-tion can be discarded by imposing the condition onthe configuration not to contain a coordination re-lation This Generate-and-Filter strategy now ex-tracts a set of sentences where we are likely to finda good one-to-one or one-to-many translation forthe source verb

Alignment Post-editing In the final alignmentstep one now needs to figure out which lexicalmaterial in the aligned syntactic configurations ac-tually corresponds to the translation of the sourceitem The intuition discussed in 32 was that all

27

the items lying on a path between the root itemand the terminals belong to the translation of thesource item However these items may have othersyntactic dependents that may also be part of theone-to-many translation As an example considerthe configuration in figure 1 where the articleanwhich is part of the LVCcreate an obstacle tohasto be aligned to the German source verb

Thus for a set of itemsti for which there is a de-pendency relation(tx rel ti) isin DE such thattx isan element of our target configuration we need todecide whether(s1 ti) isin AGE This translationproblem now largely parallels collocation trans-lation problems discussed in the literature as in(Smadja and McKeown 1994) But crucially oursyntactic filtering strategy has substantially nar-rowed down the number of items that are possi-ble parts of the one-to-many translation Thus astraightforward way to assemble the translationalcorrespondence is to compute the correlation orassociation of the possibly missing items with thegiven translation pair as proposed in (Smadja andMcKeown 1994) Therefore we propose the fol-lowing alignment post-editing algorithmGiven the source items1 and the set of target itemsT where eachti isin T is an element of our targetconfiguration

1 Compute corr(s1 T ) the correlation be-tweens1 andT

2 For each ti tx such that there isa (ti rel tx) isin DE computecorr(s1 T + tx)

3 if corr(s1 T + tx) ge corr(s1 T ) addtxto T

As the Dice coefficient is often to give the bestresults eg in (Smadja and McKeown 1994) wealso chose Dice as our correlation measure In fu-ture work we will experiment with other associa-tion measures Our correlation scores are thus de-fined by the formula

corr(s1 T ) =2(freq(s1 and T ))

freq(s1) + freq(T )

We definefreq(T ) as the number of sentencepairs whose target sentence contains occurrencesof all ti isin T andfreq(s1) accordingly The ob-servation frequencyfreq(s1andT ) is the number ofsentence pairs that wheres1 occurs in the sourcesentence andT in the target sentence

The output translation can then be rep-resented as a dependency configurationof the following kind ((ofPMOD)(riskNMODof)(riskNMODthe) (runOBJrisk)(runSBJ))which is the syntactic representationfor the English MWErun the risk of

33 Evaluation

Our translational approach to MWE extractionbears the advantage that evaluation is not exclu-sively bound to the manual judgement of candi-date lists Instead we can first evaluate the systemoutput against translation gold standards which areeasier to obtain The linguistic classification of thecandidates according to their compositionality canthen be treated as a separate problem

We present two experiments in this evaluationsection We will first evaluate the translation de-tection on our gold standard to assess the gen-eral quality of the extraction method Since thisgold standard is to small to draw conclusions aboutthe quality of MWE patterns that the system de-tects we further evaluate the translational corre-spondences for a larger set of verbs

Translation evaluation In the first experimentwe extracted all types of translational correspon-dences for the verbs we annotated in the gold stan-dard We converted the output dependency con-figurations to the lemmatized bag-of-word formwe already applied for the alignment evaluationand calculated the FPR and FNR of the trans-lation types The evaluation is displayed in ta-ble 4 Nearly all translation types that our sys-tem detected are correct This confirms our hy-pothesis that syntactic filtering yields more reli-able translations that just coocurrence-based align-ments However the false negative rate is alsovery high This low recall is due to the fact thatour syntactic filters are very restrictive such that amajor part of the occurrences of the source lemmadonrsquot figure in the prototypical syntactic configu-ration Column two and three of the evaluation ta-ble present the FPR and FNR for experiments witha relaxed syntactic filter that doesnrsquot constrain thesyntactic type of the parallel argument relationsWhile not decreasing the FNR the FPR decreasessignificantly This means that the syntactic filtersmainly fire on noisy configurations and donrsquot de-crease the recall A manual error analysis has alsoshown that the relatively flat annotation scheme ofour dependency parses significantly narrows down

28

the number of candidate configurations that our al-gorithm detects As the dependency parses donrsquotprovide deep analyses for tense or control phe-nomena very often a verbrsquos arguments donrsquot fig-ure as its syntactic dependents and no configura-tion is found Future work will explore the im-pact of deep syntactic analysis for the detection oftranslational correspondences

MWE evaluation In a second experiment weevaluated the patterns of correspondences foundby our extraction method for use in an MWE con-text Therefore we selected 50 random verbs oc-curring in the Europarl corpus and extracted theirrespective translational correspondences This setof 50 verbs yields a set of 1592 one-to-many typesof translational correspondences We filtered thetypes wich display only morphological variationsuch that the set of potential MWE types com-prises 1302 types Out of these we evaluated arandom sample of 300 types by labelling the typeswith the MWE categories we established for theanalysis of our gold standard During the clas-sification we encountered a further category ofoneto- many correspondence which cannot be con-sidered an MWE the category of alternation Forinstance we found a translational correspondencebetween the active realization of the German verbbegruszligen(enappreciate) and the English passivebe pleased by

The classification is displayed in table 5 Al-most 83 of the translational correspondencesthat our system extracted are perfect translationtypes Almost 60 of the extracted types can beconsidered MWEs that exhibit some kind of se-mantic idiosyncrasy The other translations couldbe classified as paraphrases or alternations In ourrandom sample the portions of idioms is signifi-cantly higher than in our gold standard which con-firms our intuition that the MWE pattern of theone-to-many translations for a given verb are re-lated to language-specific semantic properties ofthe verbs and the lexical concepts they realize

4 Related Work

The problem sketched in this paper has clear con-ncetions to statistical MT So-called phrase-basedtranslation models generally target whole sentencealignment and do not necessarily recur to linguis-tically motivated phrase correspondences (Koehnet al 2003) Syntax-based translation that speci-fies formal relations between bilingual parses was

Strict Filter Relaxed Filter

FPR FNR FPR FNR

v1 00 096 05 096

v2 025 088 047 079

v3 025 074 056 063

v4 00 0875 056 084

Table 4 False positive and false negative rate ofone-to-many translations

Trans type Proportion

MWE type Proportion

MWEs 575

V Part 82

V Prep 518

LVC 324

Idiom 106

Paraphrases 244

Alternations 10

Noise 171

Table 5 Classification of 300 types sampled fromthe set of one-to-many translations for 50 verbs

established by (Wu 1997) Our way to use syn-tactic configurations can be seen as a heuristic tocheck relaxed structural parallelism

Work on MWEs in a crosslingual context hasalmost exclusively focussed on MWE translation(Smadja and McKeown 1994 Anastasiou 2008)In (Villada Moiron and Tiedemann 2006) the au-thors make use of alignment information in a par-allel corpus to rank MWE candidates These ap-proaches donrsquot rely on the lexical semantic knowl-edge about MWEs in form of one-to-many trans-lations

By contrast previous approaches to paraphraseextraction made more explicit use of crosslingualsemantic information In (Bannard and Callison-Burch 2005) the authors use the target languageas a pivot providing contextual features for iden-tifying semantically similar expressions Para-phrasing is however only partially comparable tothe crosslingual MWE detection we propose inthis paper Recently the very pronounced contextdependence of monolingual pairs of semanticallysimilar expressions has been recognized as a ma-jor challenge in modelling word meaning (Erk andPado 2009)

The idea that parallel corpora can be used asa linguistic resource that provides empirical evi-dence for monolingual idiosyncrasies has already

29

been exploited in eg morphology projection(Yarowsky et al 2001) or word sense disambigua-tion (Dyvik 2004) While in a monolingual set-ting it is quite tricky to come up with theoreticalor empirical definitions of sense discriminationsthe crosslingual scenario offers a theory-neutraldata-driven solution Since ambiguity is an id-iosyncratic property of a lexical item in a givenlanguage it is not likely to be mirrored in a tar-get language Similarly our approach can also beseen as a projection idea we project the semanticinformation of simplex realization in a source lan-guage to an idiosyncratic multiword realization inthe target language

5 Conclusion

We have explored the phenomenon of one-to-many translations in parallel corpora from theperspective of MWE identification Our man-ual study on a translation gold standard as wellas our experiments in automatic translation ex-traction have shown that one-to-many correspon-dences provide a rich resource and fruitful basisof study for data-driven MWE identification Thecrosslingual perspective raises new research ques-tions about the identification and interpretation ofMWEs It challenges the distinction between para-phrases and MWEs a problem that does not ariseat all in the context of monolingual MWE ex-traction It also allows for the study of the rela-tion between the semantics of lexical concepts andtheir MWE realization Further research in this di-rection should investigate translational correspon-dences on a larger scale and further explore thesefor monolingual interpretation of MWEs

Our extraction method that is based on syn-tactic filters identifies MWE types with a muchhigher precision than purely cooccurence-basedword alignment and captures the various patternswe found in our gold standard Future work on theextraction method will have to focus on the gener-alization of these filters and the generalization toother items than verbs The experiments presentedin this paper also suggest that the MWE realiza-tion of certain lexical items in a target languageis subject to certain linguistic patterns Moreoverthe method we propose is completely languagein-dependent such that further research has to studythe impact of the relatedness of the consideredlanguages on the patterns of one-to-many transla-tional correspondences

ReferencesDimitra Anastasiou 2008 Identification of idioms by mtrsquos

hybrid research system vs three commercial system InProceedings of the EAMT pp 12ndash20

Timothy Baldwin and Aline Villavicencio 2002 Extract-ing the unextractable a case study on verb-particles InProceedings of the COLING-02 pp 1ndash7

Colin Bannard and Chris Callison-Burch 2005 Paraphras-ing with bilingual parallel corpora InProceedings of the43rd Annual Meeting of the ACL pp 597ndash604

Lea Cyrus 2006 Building a resource for studying transla-tion shifts InProceedings of the 5th LREC pp 1240ndash1245

Helge Dyvik 2004 Translations as semantic mirrors Fromparallel corpus to WordNetLanguage and Computers1311 ndash 326

Katrin Erk and Sebastian Pado 2009 Paraphrase assess-ment in structured vector space Exploring parameters anddatasets InProc of the EACL GEMS Workshop pp 57ndash65

Joao de Almeida Varelas Graca Joana Paulo Pardal LuısaCoheur and Diamantino Antonio Caseiro 2008 Multi-language word alignments annotation guidelines Techni-cal report Tech Rep 38 2008 INESC-ID Lisboa

Philipp Koehn Franz Josef Och and Daniel Marcu 2003Statistical phrase-based translation InProceedings of theNAACL rsquo03 pp 48ndash54

Philipp Koehn 2005 Europarl A parallel corpus for statis-tical machine translation InMT Summit 2005 pp 79ndash86

Joakim Nivre Johan Hall and Jens Nilsson 2006 Malt-parser A data driven parser-generator for dependencyparsing InProc of LREC-2006 pp 2216ndash2219

Franz Josef Och and Hermann Ney 2003 A systematic com-parison of various statistical alignment modelsComputa-tional Linguistics 29(1)19ndash51

Pavel Pecina 2008 A machine learning approach to multi-word expression extraction InProceedings of the LRECMWE 2008 Workshop pp 54ndash57

Ivan A Sag Timothy Baldwin Francis Bond and AnnCopestake 2002 Multiword expressions A pain in theneck for NLP InProc of the CICLing-2002 pp 1ndash15

Frank Smadja and Kathleen McKeown 1994 Translatingcollocations for use in bilingual lexicons InProceedingsof the HLT rsquo94 workshop pp 152ndash156

Begona Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automatic word-alignment InProc of the EACL MWE 2006 Workshoppp 33ndash40

Dekai Wu 1997 Stochastic inversion transduction gram-mars and bilingual parsing of parallel corporaComputLinguist 23(3)377ndash403

David Yarowsky Grace Ngai and Richard Wicentowski2001 Inducing multilingual text analysis tools via robustprojection across aligned corpora InProceedings of HLT2001 pp 1ndash8

30

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 31ndash39Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

A Re-examination of Lexical Association Measures

Abstract

We review lexical Association Measures (AMs) that have been employed by past work in extracting multiword expressions Our work contributes to the understanding of these AMs by categorizing them into two groups and suggesting the use of rank equivalence to group AMs with the same ranking performance We also examine how existing AMs can be adapted to better rank English verb particle constructions and light verb constructions Specifically we suggest normalizing (Pointwise) Mutual Information and using marginal frequencies to construct penalization terms We empirically validate the effectiveness of these modified AMs in detection tasks in English performed on the Penn Treebank which shows significant improvement over the original AMs

1 Introduction

Recently the NLP community has witnessed a renewed interest in the use of lexical association measures in extracting Multiword Expressions (MWEs) Lexical Association Measures (hereafter AMs) are mathematical formulas which can be used to capture the degree of connection or association between constituents of a given phrase Well-known AMs include Pointwise Mutual Information (PMI) Pearsonrsquos 2χ and the Odds Ratio These AMs have been applied in many different fields of study from information retrieval to hypothesis testing In the context of MWE extraction many published works have been devoted to comparing their effectiveness Krenn and Evert (2001) evaluate Mutual Information (MI) Dice Pearsonrsquos 2χ log-likelihood

ratio and the T score In Pearce (2002) AMs such as Z score Pointwise MI cost reduction left and right context entropy odds ratio are evaluated Evert (2004) discussed a wide range of AMs including exact hypothesis tests such as the binomial test and Fisherrsquos exact tests various coefficients such as Dice and Jaccard Later Ramisch et al (2008) evaluated MI Pearsonrsquos 2χ and Permutation Entropy Probably the most comprehensive evaluation of AMs was presented in Pecina and Schlesinger (2006) where 82 AMs were assembled and evaluated over Czech collocations These collocations contained a mix of idiomatic expressions technical terms light verb constructions and stock phrases In their work the best combination of AMs was selected using machine learning

While the previous works have evaluated AMs there have been few details on why the AMs perform as they do A detailed analysis of why these AMs perform as they do is needed in order to explain their identification performance and to help us recommend AMs for future tasks This weakness of previous works motivated us to address this issue In this work we contribute to further understanding of association measures using two different MWE extraction tasks to motivate and concretize our discussion Our goal is to be able to predict a priori what types of AMs are likely to perform well for a particular MWE class

We focus on the extraction of two common types of English MWEs that can be captured by bigram model Verb Particle Constructions (VPCs) and Light Verb Constructions (LVCs) VPCs consist of a verb and one or more particles which can be prepositions (eg put on bolster up) adjectives (cut short) or verbs (make do) For simplicity we focus only on bigram VPCs that take prepositional particles the most common class of VPCs A special characteristic of VPCs that affects their extraction is the

Hung Huu Hoang Dept of Computer Science

National University of Singapore

hoanghuucompnusedusg

Su Nam Kim Dept of Computer Science and Software Engineering University of Melbourne

snkimcsseunimelbeduau

Min-Yen Kan Dept of Computer Science

National University of Singapore

kanmycompnusedusg

31

mobility of noun phrase complements in transitive VPCs They can appear after the particle (Take off your hat) or between the verb and the particle (Take your hat off) However a pronominal complement can only appear in the latter configuration (Take it off)

In comparison LVCs comprise of a verb and a complement which is usually a noun phrase (make a presentation give a demonstration) Their meanings come mostly from their complements and as such verbs in LVCs are termed semantically light hence the name light verb This explains why modifiers of LVCs modify the complement instead of the verb (make a serious mistake vs make a mistake seriously) This phenomenon also shows that an LVCrsquos constituents may not occur contiguously

2 Classification of Association Measures

Although different AMs have different approaches to measuring association we observed that they can effectively be classified into two broad classes Class I AMs look at the degree of institutionalization ie the extent to which the phrase is a semantic unit rather than a free combination of words Some of the AMs in this class directly measure this association between constituents using various combinations of co-occurrence and marginal frequencies Examples include MI PMI and their variants as well as most of the association coefficients such as Jaccard Hamann Brawn-Blanquet and others Other Class I AMs estimate a phrasersquos MWE-hood by judging the significance of the difference between observed and expected frequencies These AMs include among others statistical hypothesis tests such as T score Z score and Pearsonrsquos 2χ test

Class II AMs feature the use of context to measure non-compositionality a peculiar characteristic of many types of MWEs including VPCs and idioms This is commonly done in one of the following two ways First non-compositionality can be modeled through the diversity of contexts measured using entropy The underlying assumption of this approach is that non-compositional phrases appear in a more restricted set of contexts than compositional ones Second non-compositionality can also be measured through context similarity between the phrase and its constituents The observation here is that non-compositional phrases have different semantics from those of their constituents It then

follows that contexts in which the phrase and its constituents appear would be different (Zhai 1997) Some VPC examples include carry out give up A close approximation stipulates that contexts of a non-compositional phrasersquos constituents are also different For instance phrases such as hot dog and Dutch courage are comprised of constituents that have unrelated meanings Metrics that are commonly used to compute context similarity include cosine and dice similarity distance metrics such as Euclidean and Manhattan norm and probability distribution measures such as Kullback-Leibler divergence and Jensen-Shannon divergence

Table 1 lists all AMs used in our discussion

The lower left legend defines the variables a b c and d with respect to the raw co-occurrence statistics observed in the corpus data When an AM is introduced it is prefixed with its index given in Table 1(eg [M2] Mutual Information) for the readerrsquos convenience

3 Evaluation

We will first present how VPC and LVC candidates are extracted and used to form our evaluation data set Second we will discuss how performances of AMs are measured in our experiments

31 Evaluation Data

In this study we employ the Wall Street Journal (WSJ) section of one million words in the Penn Tree Bank To create the evaluation data set we first extract the VPC and LVC candidates from our corpus as described below We note here that the mobility property of both VPC and LVC constituents have been used in the extraction process

For VPCs we first identify particles using a pre-compiled set of 38 particles based on Baldwin (2005) and Quirk et al (1985) (Appendix A) Here we do not use the WSJ particle tag to avoid possible inconsistencies pointed out in Baldwin (2005) Next we search to the left of the located particle for the nearest verb As verbs and particles in transitive VPCs may not occur contiguously we allow an intervening NP of up to 5 words similar to Baldwin and Villavicencio (2002) and Smadja (1993) since longer NPs tend to be located after particles

32

Extraction of LVCs is carried out in a similar fashion First occurrences of light verbs are located based on the following set of seven

frequently used English light verbs do get give have make put and take Next we search to the right of the light verbs for the nearest noun

AM Name Formula AM Name Formula M1 Joint Probability ( ) f xy N M2 Mutual Information

1log ˆ

ijij

i j ij

ff

N fsum

M3 Log likelihood ratio

2 log ˆ

ijij

i j ij

ff

fsum

M4 Pointwise MI (PMI) ( )log

( ) ( )

P xy

P x P ylowast lowast

M5 Local-PMI ( ) PMIf xy times M6 PMIk ( )log

( ) ( )

kNf xy

f x f ylowast lowast

M7 PMI2 2( )log

( ) ( )

Nf xy

f x f ylowast lowast

M8 Mutual Dependency 2( )log( ) ( )P xy

P x P y

M9 Driver-Kroeber

( )( )

a

a b a c+ +

M10 Normalized expectation

2

2

a

a b c+ +

M11 Jaccard a

a b c+ +

M12 First Kulczynski a

b c+

M13 Second Sokal-Sneath 2( )

a

a b c+ +

M14 Third Sokal-Sneath

a d

b c

+

+

M15 Sokal-Michiner a d

a b c d

+

+ + +

M16 Rogers-Tanimoto

2 2

a d

a b c d

+

+ + +

M17 Hamann ( ) ( )a d b c

a b c d

+ minus +

+ + +

M18 Odds ratio adbc

M19 Yulersquos ω ad bc

ad bc

minus

+

M20 Yulersquos Q ad bcad bc

minus+

M21 Brawn- Blanquet max( )

a

a b a c+ +

M22 Simpson

min( )

a

a b a c+ +

M23 S cost 12min( )

log(1 )1

b c

a

minus+

+

M24 Adjusted S Cost 12max( )

log(1 )1

b c

a

minus+

+

M25 Laplace 1

min( ) 2

a

a b c

+

+ +

M26 Adjusted Laplace 1

max( ) 2

a

a b c

+

+ +

M27 Fager [M9]

1max( )

2b cminus

M28 Adjusted Fager [M9]

1max( )b c

aNminus

M29 Normalized PMIs

PMI NF( )α PMI NFMax

M30 Simplified normalized PMI for VPCs

log( )

(1 )

ad

b cα αtimes + minus times

M31 Normalized MIs

MI NF( )α MI NFMax

NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast [0 1]α isin NFMax = max( ( ) ( ))P x P ylowast lowast

11 ( )a f f xy= = 12 ( )b f f xy= =

21 ( )c f f xy= = 22 ( )d f f xy= =

( )f xlowast ( )f xlowast

( )f ylowast ( )f ylowast N

Table 1 Association measures discussed in this paper Starred AMs () are developed in this work

Contingency table of a bigram (x y) recording co-occurrence and marginal frequencies w stands for all words except w stands for all words N is total number of bigrams The expected frequency under the independence assumption is ˆ ( ) ( ) ( ) f xy f x f y N= lowast lowast

33

permitting a maximum of 4 intervening words to allow for quantifiers (aan the many etc) adjectival and adverbial modifiers etc If this search fails to find a noun as when LVCs are used in the passive (eg the presentation was made) we search to the right of the light verb also allowing a maximum of 4 intervening words The above extraction process produced a total of 8652 VPC and 11465 LVC candidates when run on the corpus We then filter out candidates with observed frequencies less than 6 as suggested in Pecina and Schlesinger (2006) to obtain a set of 1629 VPCs and 1254 LVCs

Separately we use the following two available sources of annotations 3078 VPC candidates extracted and annotated in (Baldwin 2005) and 464 annotated LVC candidates used in (Tan et al 2006) Both sets of annotations give both positive and negative examples

Our final VPC and LVC evaluation datasets were then constructed by intersecting the gold-standard datasets with our corresponding sets of extracted candidates We also concatenated both sets of evaluation data for composite evaluation This set is referred to as ldquoMixedrdquo Statistics of our three evaluation datasets are summarized in Table 2 VPC data LVC data Mixed Total (freq ge 6)

413 100 513

Positive instances

117 (2833)

28 (28)

145 (2326)

Table 2 Evaluation data sizes (type count not token) While these datasets are small our primary

goal in this work is to establish initial comparable baselines and describe interesting phenomena that we plan to investigate over larger datasets in future work

32 Evaluation Metric

To evaluate the performance of AMs we can use the standard precision and recall measures as in much past work We note that the ranked list of candidates generated by an AM is often used as a classifier by setting a threshold However setting a threshold is problematic and optimal threshold values vary for different AMs Additionally using the list of ranked candidates directly as a classifier does not consider the confidence indicated by actual scores Another way to avoid setting threshold values is to measure precision and recall of only the n most likely candidates

(the n-best method) However as discussed in Evert and Krenn (2001) this method depends heavily on the choice of n In this paper we opt for average precision (AP) which is the average of precisions at all possible recall values This choice also makes our results comparable to those of Pecina and Schlesinger (2006)

33 Evaluation Results

Figure 1(a b) gives the two average precision profiles of the 82 AMs presented in Pecina and Schlesinger (2006) when we replicated their experiments over our English VPC and LVC datasets We observe that the average precision profile for VPCs is slightly concave while the one for LVCs is more convex This can be interpreted as VPCs being more sensitive to the choice of AM than LVCs Another point we observed is that a vast majority of Class I AMs including PMI its variants and association coefficients (excluding hypothesis tests) perform reasonably well in our application In contrast the performances of most of context-based and hypothesis test AMs are very modest Their mediocre performance indicates their inapplicability to our VPC and LVC tasks In particular the high frequencies of particles in VPCs and light verbs in LVCs both undermine their contextsrsquo discriminative power and skew the difference between observed and expected frequencies that are relied on in hypothesis tests

4 Rank Equivalence

We note that some AMs although not mathematically equivalent (ie assigning identical scores to input candidates) produce the same lists of ranked candidates on our datasets Hence they achieve the same average precision The ability to identify such groups of AMs is helpful in simplifying their formulas which in turn assisting in analyzing their meanings Definition Association measures M1 and M2 are

rank equivalent over a set C denoted by M1 r

Cequiv

M2 if and only if M1(cj) gt M1(ck) M2(cj) gt M2(ck) and M1(cj) = M1(ck) M2(cj) = M2(ck) for all cj ck belongs to C where Mk(ci) denotes the score assigned to ci by the measure Mk As a corollary the following also holds for rank equivalent AMs

34

Corollary If M1 r

Cequiv M2 then APC(M1) = APC(M2)

where APC(Mi) stands for the average precision of the AM Mi over the data set C Essentially M1 and M2 are rank equivalent over a set C if their ranked lists of all candidates taken from C are the same ignoring the actual calculated scores1 As an example the following 3 AMs Odds ratio Yulersquos ω and Yulersquos Q (Table 3 row 5) though not mathematically equivalent can be shown to be rank equivalent Five groups of rank equivalent AMs that we have found are listed in Table 3 This allows us to replace the below 15 AMs with their (most simple) representatives from each rank equivalent group

1 Two AMs may be rank equivalent with the exception of some candidates where one AM is undefined due to a zero in the denominator while the other AM is still well-defined We call these cases weakly rank equivalent With a reasonably large corpus such candidates are rare for our VPC and LVC types Hence we still consider such AM pairs to be rank equivalent

1) [M2] Mutual Information [M3] Log likelihood ratio 2) [M7] PMI2 [M8] Mutual Dependency [M9] Driver-Kroeber (aka Ochiai) 3) [M10] Normalized expectation [M11] Jaccard [M12] First Kulczynski

[M13]Second Sokal-Sneath (aka Anderberg)

4) [M14] Third Sokal-Sneath [M15] Sokal-Michiner [M16] Rogers-Tanimoto [M17] Hamann 5) [M18] Odds ratio [M19] Yulersquos ω [M20] Yulersquos Q

Table 3 Five groups of rank equivalent AMs

5 Examination of Association Measures

We highlight two important findings in our analysis of the AMs over our English datasets Section 51 focuses on MI and PMI and Section 52 discusses penalization terms

51 Mutual Information and Pointwise Mutual Information

In Figure 1 over 82 AMs PMI ranks 11th in identifying VPCs while MI ranks 35th in

01

02

03

04

05

06

AP

Figure 1a AP profile of AMs examined over our VPC data set

01

02

03

04

05

06

AP

Figure 1b AP profile of AMs examined over our LVC data set

Figure 1 Average precision (AP) performance of the 82 AMs from Pecina and Schlesinger (2006) on our English VPC and LVC datasets Bold points indicate AMs discussed in this paper

Hypothesis test AMs loz Class I AMs excluding hypothesis test AMs + Context-based AMs

35

identifying LVCs In this section we show how their performances can be improved significantly

Mutual Information (MI) measures the common information between two variables or the reduction in uncertainty of one variable given knowledge of the other

( )MI( ) ( )log ( ) ( )u v

p uvU V p uv p u p v= lowast lowastsum In the

context of bigrams the above formula can be

simplified to [M2] MI =

1log ˆN

ijij

i jij

ff

fsum While MI

holds between random variables [M4] Pointwise MI (PMI) holds between specific values PMI(x

y) =( )

log( ) ( )

P xy

P x P ylowast lowast

( )log

( ) ( )

Nf xy

f x f y=

lowast lowast It has long

been pointed out that PMI favors bigrams with low-frequency constituents as evidenced by the product of two marginal frequencies in its denominator To reduce this bias a common solution is to assign more weight to the co-occurrence frequency ( )f xy in the numerator by either raising it to some power k (Daille 1994) or multiplying PMI with ( )f xy Table 4 lists these adjusted versions of PMI and their performance over our datasets We can see from Table 4 that the best performance of PMIk

is obtained at k values less than one indicating that it is better to rely less on ( )f xy Similarly multiplying

( )f xy directly to PMI reduces the performance of PMI As such assigning more weight to ( )f xy does not improve the AP performance of PMI AM VPCs LVCs MixedBest [M6] PMIk 547

(k = 13) 573

(k = 85) 544

(k = 32)

[M4] PMI 510 566 515 [M5] Local-PMI 259 393 272 [M1] Joint Prob 170 28 175 Table 4 AP performance of PMI and its variants Best alpha settings shown in parentheses

Another shortcoming of (P)MI is that both grow not only with the degree of dependence but also with frequency (Manning and Schutzeampamp 1999 p 66) In particular we can show that MI(X Y) le min(H(X) H(Y)) where H() denotes entropy and PMI(xy) le min( log ( )P xminus lowast log ( )P yminus lowast )

These two inequalities suggest that the allowed score ranges of different candidates vary and consequently MI and PMI scores are not directly comparable Furthermore in the case of VPCs and LVCs the differences among score

ranges of different candidates are large due to high frequencies of particles and light verbs This has motivated us to normalize these scores before using them for comparison We suggest MI and PMI be divided by one of the following two normalization factors NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast with [0 1]α isin and NFmax = max( ( ) ( ))P x P ylowast lowast NF( )α being dependent on alpha can be optimized by setting an appropriate alpha value which is inevitably affected by the MWE type and the corpus statistics On the other hand NFmax is independent of alpha and is recommended when one needs to apply normalized (P)MI to a mixed set of different MWE types or when sufficient data for parameter tuning is unavailable As shown in Table 5 normalized MI and PMI show considerable improvements of up to 80 Also PMI and MI after being normalized with NFmax rank number one in VPC and LVC task respectively If one re-writes MI as = (1 N) ij ij

i jPMIf timessum it is easy to see the heavy

dependence of MI on direct frequencies compared with PMI and this explains why normalization is a pressing need for MI

AM VPCs LVCs MixedMI NF( )α 508

(α = 48) 583

(α = 47) 516

(α = 5)

MI NFmax 508 584 518 [M2] MI 273 435 289

PMI NF( )α 592 (α = 8)

554 (α = 48)

588 (α = 77)

PMI NFmax 565 517 556 [M4] PMI 510 566 515

Table 5 AP performance of normalized (P)MI versus standard (P)MI Best alpha settings shown in parentheses

52 Penalization Terms

It can be seen that given equal co-occurrence frequencies higher marginal frequencies reduce the likelihood of being MWEs This motivates us to use marginal frequencies to synthesize penalization terms which are formulae whose values are inversely proportional to the likelihood of being MWEs We hypothesize that incorporating such penalization terms can improve the respective AMs detection AP

Take as an example the AMs [M21] Brawn-Blanquet (aka Minimum Sensitivity) and [M22] Simpson These two AMs are identical except

36

for one difference in the denominator Brawn-Blanquet uses max(b c) Simpson uses min(b c) It is intuitive and confirmed by our experiments that penalizing against the more frequent constituent by choosing max(b c) is more effective This is further attested in AMs [M23] S Cost and [M25] Laplace where we tried to replace the min(b c) term with max(b c) Table 6 shows the average precision on our datasets for all these AMs

AM VPCs LVCs Mixed[M21]Brawn- Blanquet

478 578 486

[M22] Simpson 249 382 260

[M24] Adjusted S Cost

485 577 492

[M23] S cost 249 388 260

[M26] Adjusted Laplace

486 577 493

[M25] Laplace 241 388 254 Table 6 Replacing min() with max() in selected AMs

In the [M27] Fager AM the penalization term max(b c) is subtracted from the first term which is no stranger but rank equivalent to [M7] PMI2 In our application this AM is not good since the second term is far larger than the first term which is less than 1 As such Fager is largely equivalent to just ndashfrac12 max(b c) In order to make use of the first term we need to replace the constant frac12 by a scaled down version of max(b c) We have approximately derived 1 aN as a lower bound estimate of max(b c) using the independence assumption producing [M28] Adjusted Fager We can see from Table 7 that this adjustment improves Fager on both datasets AM VPCs LVCs Mixed[M28] Adjusted Fager

564 543 554

[M27] Fager 552 439 525 Table 7 Performance of Fager and its adjusted version

The next experiment involves [M14] Third Sokal Sneath which can be shown to be rank equivalent to ndashb ndashc We further notice that frequencies c of particles are normally much larger than frequencies b of verbs Thus this AM runs the risk of ranking VPC candidates based on only frequencies of particles So it is necessary

that we scale b and c properly as in [M14] bαminus times ndash (1 ) cαminus times Having scaled the constituents properly we still see that [M14] by itself is not a good measure as it uses only constituent frequencies and does not take into consideration the co-occurrence frequency of the two constituents This has led us to experiment

with [MR14]PMI

(1 )b cα αtimes + minus times The

denominator of [MR14] is obtained by removing the minus sign from [MR14] so that it can be used as a penalization term The choice of PMI in the numerator is due to the fact that the denominator of [MR14] is in essence similar to NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast which has been successfully used to divide PMI in the normalized PMI experiment We heuristically tried to simplify [MR14] to the following AM

[M30]log( )

(1 )

ad

b cα αtimes + minus times The setting of alpha in

Table 8 below is taken from the best alpha setting obtained the experiment on the normalized PMI (Table 5) It can be observed from Table 8 that [MR14] being computationally simpler than normalized PMI performs as well as normalized PMI and better than Third Sokal-Sneath over the VPC data set

AM VPCs LVCs MixedPMI NF( )α 592

(α =8) 554

(α =48) 588

(α =77)

[M30]log( )

(1 )

ad

b cα αtimes + minus times

600 (α =8)

484 (α =48)

588 (α =77)

[M14] Third Sokal Sneath

565 453 546

Table 8 AP performance of suggested VPCsrsquo penalization terms and AMs

With the same intention and method we have found that while addition of marginal frequencies is a good penalization term for VPCs the product of marginal frequencies is more suitable for LVCs (rows 1 and 2 Table 9) As with the linear combination the product bc should also be weighted accordingly as (1 )b cα αminus The best alpha value is also taken from the normalized PMI experiments (Table 5) which is nearly 5 Under this setting this penalization term is exactly the denominator of the [M18] Odds Ratio Table 9 below show our experiment results in deriving the penalization term for LVCs

37

AM VPCs LVCs Mixedndashb ndashc 565 453 546 1bc 502 532 502 [M18] Odds ratio 443 567 456

Table 9 AP performance of suggested LVCsrsquo penalization terms and AMs

6 Conclusions

We have conducted an analysis of the 82 AMs assembled in Pecina and Schlesinger (2006) for the tasks of English VPC and LVC extraction over the Wall Street Journal Penn Treebank data In our work we have observed that AMs can be divided into two classes ones that do not use context (Class I) and ones that do (Class II) and find that the latter is not suitable for our VPC and LVC detection tasks as the size of our corpus is too small to rely on the frequency of candidatesrsquo contexts This phenomenon also revealed the inappropriateness of hypothesis tests for our detection task We have also introduced the novel notion of rank equivalence to MWE detection in which we show that complex AMs may be replaced by simpler AMs that yield the same average precision performance

We further observed that certain modifications to some AMs are necessary First in the context of ranking we have proposed normalizing scores produced by MI and PMI in cases where the distributions of the two events are markedly different as is the case for light verbs and particles While our claims are limited to the datasets analyzed they show clear improvements normalized PMI produces better performance over our mixed MWE dataset yielding an average precision of 588 compared to 515 when using standard PMI a significant improvement as judged by paired T test Normalized MI also yields the best performance over our LVC dataset with a significantly improved AP of 583

We also show that marginal frequencies can be used to form effective penalization terms In particular we find that (1 )b cα αtimes + minus times is a good penalization term for VPCs while (1 )b cα αminus is suitable for LVCs Our introduced alpha tuning parameter should be set to properly scale the values b and c and should be optimized per MWE type In cases where a common factor is applied to different MWE types max(b c) is a better choice than min(b c) In future work we plan to expand our investigations over larger

web-based datasets of English to verify the performance gains of our modified AMs

Acknowledgement This work was partially supported by a National Research Foundation grant ldquoInteractive Media Searchrdquo (grant R 252 000 325 279)

References Baldwin Timothy (2005) The deep lexical

acquisition of English verb-particle constructions Computer Speech and Language Special Issue on Multiword Expressions 19(4)398ndash414

Baldwin Timothy and Villavicencio Aline (2002) Extracting the unextractable A case study on verb-particles In Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002) pages 98ndash104 Taipei Taiwan

Daille Beacuteatrice (1994) Approche mixte pour lextraction automatique de terminologie statistiques lexicales et filtres linguistiques PhD thesis Universiteacute Paris 7

Evert Stefan (2004) Online repository of association measures httpwwwcollocationsde a companion to The Statistics of Word Cooccurrences Word Pairs and Collocations PhD dissertation University of Stuttgart

Evert Stefan and Krenn Brigitte (2001) Methods for qualitative evaluation of lexical association measures In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics pages 188-915 Toulouse France

Katz Graham and Giesbrecht Eugenie (2006) Automatic identification of non-compositional multi-word expressions using latent semantic analysis In Proceedings of the ACL Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties pages 12-19 Sydney Australia

Krenn Brigitte and Evert Stefan (2001) Can we do better than frequency A case study on extracting PP-verb collocations In Proceedings of the ACLEACL 2001 Workshop on the Computational Extraction Analysis and Exploitation of Collocations pages 39ndash46 Toulouse France

Manning D Christopher and Schutze ampamp Hinrich (1999) Foundations of Statistical Natural Language Processing The MIT Press Cambridge Massachusetts

Pearce Darren (2002) A comparative evaluation of collocation extraction techniques In Proc of the

38

3rd International Conference on Language Resources and Evaluation (LREC 2002) Las Palmas pages 1530-1536 Canary Islands

Pecina Pavel and Schlesinger Pavel (2006) Combining association measures for collocation extraction In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLINGACL 2006) pages 651-658 Sydney Australia

Quirk Randolph Greenbaum Sidney Leech Geoffrey and Svartvik Jan (1985) A Comprehensive Grammar of the English Language Longman London UK

Ramisch Carlos Schreiner Paulo Idiart Marco and Villavicencio Aline (2008) An Evaluation of Methods for the extraction of Multiword Expressions In Proceedings of the LREC-2008 Workshop on Multiword Expressions Towards a Shared Task for Multiword Expressions pages 50-53 Marrakech Morocco

Smadja Frank (1993) Retrieving collocations from text Xtract Computational Linguistics 19(1) 143ndash77

Tan Y Fan Kan M Yen and Cui Hang (2006) Extending corpus-based identification of light verb constructions using a supervised learning framework In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context pages 49ndash56 Trento Italy

Zhai Chengxiang (1997) Exploiting context to identify lexical atoms ndash A statistical view of linguistic context In International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97) pages 119-129 Rio de Janeiro Brazil

Appendix A List of particles used in identifying verb particle constructions about aback aboard above abroad across adrift ahead along apart around aside astray away back backward backwards behind by down forth forward forwards in into off on out over past round through to together under up upon without

39

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 40ndash46Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus

R Mahesh K Sinha

Department of Computer Science amp Engineering Indian Institute of Technology Kanpur

Kanpur 208016 India rmkiitkacin

Abstract

Complex predicate is a noun a verb an ad-jective or an adverb followed by a light verb that behaves as a single unit of verb Com-plex predicates (CPs) are abundantly used in Hindi and other languages of Indo Aryan family Detecting and interpreting CPs con-stitute an important and somewhat a diffi-cult task The linguistic and statistical methods have yielded limited success in mining this data In this paper we present a simple method for detecting CPs of all kinds using a Hindi-English parallel corpus A CP is hypothesized by detecting absence of the conventional meaning of the light verb in the aligned English sentence This simple strategy exploits the fact that CP is a multi-word expression with a meaning that is dis-tinct from the meaning of the light verb Al-though there are several shortcomings in the methodology this empirical method surpri-singly yields mining of CPs with an average precision of 89 and a recall of 90

1 Introduction

Complex predicates (CPs) are abundantly used in Hindi and other languages of Indo-Aryan family and have been widely studied (Hook 1974 Ab-bi 1992 Verma 1993 Mohanan 1994 Singh 1994 Butt 1995 Butt and Geuder 2001 Butt and Ramchand 2001 Butt et al 2003) A com-plex predicate is a multi-word expression (MWE) where a noun a verb or an adjective is followed by a light verb (LV) and the MWE be-haves as a single unit of verb The general theory of complex predicate is discussed in Alsina (1996) These studies attempt to model the lin-guistic facts of complex predicate formation and the associated semantic roles

CPs empower the language in its expressive-ness but are hard to detect Detection and inter-

pretation of CPs are important for several tasks of natural language processing tasks such as ma-chine translation information retrieval summa-rization etc A mere listing of the CPs constitutes a valuable linguistic resource for lexicographers wordnet designers (Chakrabarti et al 2007) and other NLP system designers Computational me-thod using Hindi corpus has been used to mine CPs and categorize them based on statistical analysis (Sriram and Joshi 2005) with limited success Chakrabarti et al (2008) present a me-thod for automatic extraction of V+V CPs only from a corpus based on linguistic features They report an accuracy of about 98 An attempt has also been made to use a parallel corpus for de-tecting CPs using projection POS tags from Eng-lish to Hindi (Soni Mukerjee and Raina 2006) It uses Giza++ word alignment tool to align the projected POS information A success of 83 precision and 46 recall has been reported

In this paper we present a simple strategy for mining of CPs in Hindi using projection of meaning of light verb in a parallel corpus In the following section the nature of CP in Hindi is outlined and this is followed by system design experimentation and results

2 Complex Predicates in Hindi

A CP in Hindi is a syntactic construction consist-ing of either a verb a noun an adjective or an adverb as main predicator followed by a light verb (LV) Thus a CP can be a noun+LV an adjective+LV a verb+LV or an adverb+LV Fur-ther it is also possible that a CP is followed by a LV (CP+LV) The light verb carries the tense and agreement morphology In V+V CPs the contribution of the light verb denotes aspectual terms such as continuity perfectivity inception completion or denotes an expression of forceful-ness suddenness etc (Singh 1994 Butt 1995) The CP in a sentence syntactically acts as a sin-gle lexical unit of verb that has a meaning dis-

40

tinct from that of the LV CPs are also referred as the complex or compound verbs

Given below are some examples (1) CP=noun+LV noun = ashirwad blessings LV = denaa to give usane mujhe ashirwad diyaa उसन मझ आश वाद दया he me blessings gave he blessed me (2) No CP usane mujhe ek pustak dii उसन मझ एक प तक द he me one book gave he gave me a book In (1) the light verb diyaa (gave) in its past tense form with the noun ashirwad (blessings) makes a complex predicate verb form ashirwad diyaa (blessed) in the past tense form The CP here is ashirwad denaa and its corresponding English translation is lsquoto blessrsquo On the other hand in ex-ample (2) the verb dii (gave) is a simple verb in past tense form and is not a light verb Although same Hindi verb denaa (to give) is used in both the examples it is a light verb in (1) and a main verb in (2) Whether it acts as a light verb or not depends upon the semantics of the preceding noun However it is observed that the English meaning in case of the complex predicate is not derived from the individual meanings of the con-stituent words It is this observation that forms basis of our approach for mining of CPs (3) CP=adjective+LV adjective=khush happy LV=karanaa to do usane mujhe khush kiyaa उसन मझ खश कया he me happy did he pleased me Here the Hindi verb kiyaa (did) is the past tense form of a light verb karanaa (to do) and the pre-ceding word khush (happy) is an adjective The CP here is khush karanaa (to please) (4) CP=verb+LV verb = paRhnaa to read LV = lenaa to take usane pustak paRh liyaa

उसन प तक पढ़ लया he book read took he has read the book Here the Hindi verb liyaa (took) is the past tense form of the light verb lenaa (to take) and the pre-ceding word paRh (read) is the verb paRhnaa (to read) in its stem form The CP is paRh lenaa (to finish reading) In such cases the light verb acts as an aspectual modal or as an intensifier (5) CP=verb+LV verb = phaadanaa to tear LV = denaa to give usane pustak phaad diyaa उसन प तक फाड़ दया he book tear gave he has torn the book Here the Hindi verb diyaa (gave) is the past tense form of the light verb denaa (to give) and the preceding word phaad (tear) is the stem form of the verb phaadanaa (to tear) The CP is phaad denaa (to cause and complete act of tearing) (6) CP=verb+LV verb = denaa to give LV = maaranaato hit to kill usane pustak de maaraa उसन प तक द मारा he book give hit he threw the book Here the Hindi verb maaraa (hitkilled) is the past tense form of the light verb maaranaa (to hit to kill) and the preceding word de (give) is a verb denaa (to give) in its stem form The CP is de maranaa (to throw) The verb combination yields a new meaning This may also be consi-dered as a semi-idiomatic construct by some people (7) CP=adverb+LV1+LV2 adverb = vaapasback LV1 = karanaato do LV2 = denaato give or CP = CP+LV CP = vaapas karanaato return LV = denaato give usane pustak vaapas kar diyaa उसन प तक वापस कर दया he book back do gave

41

he returned the book Here there are two Hindi light verbs used The verb kar (do) is the stem form of the light verb karanaa (to do) and the verb diyaa (gave) is the past tense form of the light verb denaa (to give) The preceding word vaapas (back) is an adverb One way of interpretation is that the CP (a con-junct verb) vaapas karanaa (to return) is fol-lowed by another LV denaa (to give) signifying completion of the task Another way of looking at it is to consider these as a combination of two CPs vaapas karanaa (to return) and kar denaa (to complete the act) The semantic interpreta-tions in the two cases remain the same It may be noted that the word vaapas (return) is also a noun and in such a case the CP is a noun+LV

From all the above examples the complexity of the task of mining the CPs is evident Howev-er it is also observed that in the translated text the meaning of the light verb does not appear in case of CPs Our methodology for mining CPs is based on this observation and is outlined in the following section

3 System Design

As outlined earlier our method for detecting a CP is based on detecting a mismatch of the Hindi light verb meaning in the aligned English sen-tence The steps involved are as follows 1) Align the sentences of Hindi-English corpus 2) Create a list of Hindi light verbs and their

common English meanings as a simple verb (Table 1)

3) For each Hindi light verb generate all the morphological forms (Figure 1)

4) For each English meaning of the light verb as given in table 1 generate all the morphologi-cal forms (Figure 2)

5) For each Hindi-English aligned sentence execute the following steps a) For each light verb of Hindi (table 1)

execute the following steps i) Search for a Hindi light verb (LV)

and its morphological derivatives (figure 1) in the Hindi sentence and mark its position in the sentence (K)

ii) If the LV or its morphological deriv-ative is found then search for the equivalent English meanings for any of the morphological forms (figure 2) in the corresponding aligned Eng-lish sentence

iii) If no match is found then scan the words in the Hindi sentence to the left of the Kth position (as identified in step (i)) else if a match is found then exit ie go to step (a)

iv) If the scanned word is a lsquostop wordrsquo (figure 3) then ignore it and contin-ue scanning

v) Stop the scan when it is not a lsquostop wordrsquo and collect the Hindi word (W)

vi) If W is an lsquoexit wordrsquo then exit ie go to step (a) else the identified CP is W+LV

Hindi has a large number of light verbs A list of some of the commonly used light verbs along with their common English meanings as a simple verb is given in table 1 The light verb kar (do) is the most frequently used light verb Using its literal meaning as lsquodorsquo as a criterion for testing CP is quite misleading since lsquodorsquo in English is used in several other contexts Such meanings have been shown within parentheses and are not used for matching

light verb base form root verb meaning baithanaa बठना sit

bananaa बनना makebecomebuildconstruct manufactureprepare

banaanaa बनाना makebuildconstructmanufact-ure prepare

denaa दना give

lenaa लना take

paanaa पाना obtainget

uthanaa 3ठना rise arise get-up

uthaanaa 3ठाना raiselift wake-up

laganaa लगना feelappear look seem

lagaanaa लगाना fixinstall apply

cukanaa चकना (finish)

cukaanaa चकाना pay

karanaa करना (do)

honaa होना happenbecome be

aanaa आना come

jaanaa जाना go

khaanaa खाना eat

rakhanaa रखना keep put

maaranaa मारना killbeathit

daalanaa डालना put

haankanaa हाकना drive

Table 1 Some of the common light verbs in Hindi

42

For each of the Hindi light verb all morpho-logical forms are generated A few illustrations are given in figures 1(a) and 1(b) Similarly for each of the English meaning of the light verb all of its morphological derivatives are generated Figure 2 shows a few illustrations of the same

There are a number of words that can appear in between the nominal and the light verb in a CP These words are ignored in search for a CP and are treated as stop words These are words that denote negation or are emphasizers inten-sifiers interrogative pronoun or a particle A list of stop words used in the experimentation is giv-en in figure 3

Figure 1(a) Morphological derivatives of sample Hindi light verb rsquojaanaarsquo जाना to go

Figure 1(b) Morphological derivatives of sample Hindi light verb rsquolenaarsquo लना to take

Figure 2 Morphological derivatives of sample English meanings

We use a list of words of words that we have

named as lsquoexit wordsrsquo which cannot form part of a CP in Hindi We have used Hindi case (vibhak-ti) markers (also called parsarg) conjunctions and pronouns as the lsquoexit wordsrsquo in our imple-mentation Figure 4 shows a partial list used However this list can be augmented based on analysis of errors in LV identification It should be noted that we do not perform parts of speech (POS) tagging and so the nature of the word pre-ceding the LV is unknown to the system

English word sit Morphological derivations

sit sits sat sitting English word give Morphological derivations

give gives gave given giving degdegdegdegdeg

LV jaanaa जाना to go Morphological derivatives jaa jaae jaao jaaeM jaauuM jaane jaanaa jaanii jaataa jaatii jaate jaaniiM jaatiiM jaaoge jaaogii gaii jaauuMgA jaayegaa jaauuMgii jaayegii gaye gaiiM gayaa gayii jaayeMge jaayeMgI jaakara जा (go stem) जाए (go imperative)

जाओ (go imperative) जाए (go imperative)

जाऊ (go first-person) जान (go infinitive oblique)

जाना (go infinitive masculine singular)

जानी (go infinitive feminine singular)

जाता (go indefinite masculine singular)

जाती (go indefinite feminine singular)

जात (go indefinite masculine pluraloblique)

जानी (go infinitive feminine plural)

जाती (go indefinite feminine plural)

जाओग (go future masculine singular)

जाओगी (go future feminine singular)

गई (go past feminine singular)

जाऊगा (go future masculine first-person singular)

जायगा (go future masculine third-person singular)

जाऊगी (go future feminine first-person singular)

जायगी (go future feminine third-person singular)

गय (go past masculine pluraloblique)

गई (go past feminine plural)

गया (go past masculine singular)

गयी (go past feminine singular)

जायग (go future masculine plural)

जायगी (go future feminine plural)

जाकर (go completion) degdegdegdegdeg

LV lenaa लना to take Morphological derivatives le lii leM lo letaa letii lete liiM luuM legaa legii lene lenaa lenii liyaa leMge loge letiiM luuMgaa luuMgii lekara ल (take stem) ल (take past)

ल (take imperative) लो (take imperative)

लता (take indefinite masculine singular)

लती (take indefinite feminine singular)

लत (take indefinite masculine pluraloblique)

ल (takepastfeminineplural) ल(take first-person)

लगा(take future masculine third-personsingular)

लगी(take future feminine third-person singular)

लन (take infinitive oblique)

लना (take infinitive masculine singular)

लनी (take infinitive feminine singular)

लया (take past masculine singular)

लग (take future masculine plural)

लोग (take future masculine singular)

लती (take indefinite feminine plural)

लगा (take future masculinefirst-personsingular)

लगी (take future feminine first-person singular)

लकर (take completion) degdegdegdegdeg

43

Figure 3 Stop words in Hindi used by the system

Figure 4 A few exit words in Hindi used by the system

The inner loop of the procedure identifies mul-tiple CPs that may be present in a sentence The outer loop is for mining the CPs in the entire corpus The experimentation and results are dis-cussed in the following section

4 Experimentation and Results

The CP mining methodology outlined earlier has been implemented and tested over multiple files of EMILLE (McEnery Baker Gaizauskas and Cunningham 2000) English-Hindi parallel cor-pus A summary of the results obtained are given in table 2 As can be seen from this table the precision obtained is 80 to 92 and the recall is between 89 to 100 The F-measure is 88 to 97 This is a remarkable and somewhat sur-prising result from the simple methodology without much of linguistic or statistical analysis This is much higher than what has been reported on the same corpus by Mukerjee et al 2006 (83 precision and 46 recall) who use projec-tion of POS and word alignment for CP identifi-cation This is the only other work that uses a parallel corpus and covers all kinds of CPs The results as reported by Chakrabarti et al (2008) are only for V-V CPs Moreover they do not re-port the recall value File

1 File 2

File 3

File 4

File 5

File 6

No of Sentences

112 193 102 43 133 107

Total no of CP(N)

200 298 150 46 188 151

Correctly identified CP (TP)

195 296 149 46 175 135

V-V CP 56 63 9 6 15 20

Incorrectly identified CP (FP)

17 44 7 11 16 20

Unidentified CP(FN)

5 2 1 0 13 16

Accuracy

9750 9933 9933 1000 9308 8940

Precision (TP (TP+FP))

9198 8705 9551 8070 9162 8709

Recall ( TP (TP+FN))

9750 9833 9933 1000 9308 8940

F-measure ( 2PR ( P+R))

946 923 974 893 923 882

Table 2 Results of the experimentation

न (ergative case marker) को (accusative case marker) का (possessive case marker) क (possessive case marker) की (possessive case marker) स (frombywith) म (ininto) पर (onbut) और (and Hindi particle) तथा (and) या (or) ल कन (but) पर त (but) क (that Hindi particle) म (I) तम (you) आप (you) वह (heshe) मरा (my) मर (my) मर (my) त हारा (your) त हार (your) त हार (your) उसका (his) उसकी (her) उसक (hisher) अपना (own) अपनी (own) अपन (own) उनक (their) मन (I ergative) त ह (to you) आपको (to you) उसको (to himher) उनको (to them) उ ह (to them) मझको (to me) मझ (to me) िजसका (whose) िजसकी (whose) िजसक (whose) िजनको (to whom) िजनक (to whom)

नह (nonot) न (nonot Hindi particle) भी (also Hindi particle) ह (only Hindi particle) तो (then Hindi particle) य (why) या (what Hindi particle)

कहा (where Hindi particle) कब (when) यहा (here) वहा (there) जहा (where) पहल (before) बाद म (after) श म (beginning) आर भ म (beginning) अत म (in the end) आ खर म (in the end)

44

Given below are some sample outputs (1) English sentence I also enjoy working with the childrens parents who often come to me for advice - its good to know you can help Aligned Hindi sentence मझ ब च क माता - पताओ क साथ काम करना भी अ छा लगता ह जो क अ सर सलाह लन आत ह - यह जानकार खशी होती ह क आप कसी की मदद कर सकत ह | The CPs identified in the sentence i काम करना (to work) ii अ छा लगना (to feel good enjoy) iii सलाह लना (to seek advice) iv खशी होना (to feel happy good) v मदद करना (to help)

Here the system identified 5 different CPs all of which are correct and no CP in the sentence has gone undetected The POS projection and word alignment method (Mukerjee et al 2006) would fail to identify CPs सलाह लना (to seek advice) and खशी होना (to feel happy) (2) English sentence Thousands of children are already benefiting from the input of people like you - people who care about children and their future who have the commitment energy and enthusiasm to be positive role models and who value the opportu-nity for a worthwhile career Aligned Hindi sentence आप जस लोग जो क ब च और उनक भ व य क बार म सोचत ह - इस समय भी हज़ार ब च को लाभ पहचा रह ह | अ छ आदश बनन क लए ऐस लोग म तब ता उ सराह और लगन ह और व एक समथन - यो य यवसाय की क करत ह |

The CPs identified in the sentence i आदश बनना (to be role model) ii क करना (to respect)

Here also the two CPs identified are correct It is obvious that this empirical method of

mining CPs will fail whenever the Hindi light verb maps on to its core meaning in English It

may also produce garbage as POS of the preced-ing word is not being checked However the mining success rate obtained speaks of these be-ing in small numbers in practice Use of the lsquostop wordsrsquo in allowing the intervening words within the CPs helps a lot in improving the per-formance Similarly use of the lsquoexit wordsrsquo avoid a lot of incorrect identification

5 Conclusions

The simple empirical method for mining CPs outlined in this work yields an average 89 of precision and 90 recall which is better than the results reported so far in the literature The major drawback is that we have to generate a list of all possible light verbs This list appears to be very large for Hindi Since no POS tagging or statis-tical analysis is performed the identified CPs are merely a list of mined CPs in Hindi with no lin-guistic categorization or analysis However this list of mined CPs is valuable to the lexicograph-ers and other language technology developers This list can also be used for word alignment tools where the identified components of CPs are grouped together before the word alignment process This will increase both the alignment accuracy and the speed

The methodology presented in this work is equally applicable to all other languages within the Indo-Aryan family

References Anthony McEnery Paul Baker Rob Gaizauskas Ha-

mish Cunningham 2000 EMILLE Building a Corpus of South Asian Languages Vivek A Quar-terly in Artiificial Intelligence 13(3)23ndash32

Amitabh Mukerjee Ankit Soni and Achala M Raina 2006 Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora Proceed-ings of the Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties Sydney 11ndash18

Alex Alsina 1996 Complex PredicatesStructure and Theory CSLI PublicationsStanford CA

Anvita Abbi 1992 The explicator compound verbsome definitional issues and criteria for iden-tification Indian Linguistics 53 27-46

Debasri Chakrabarti Vaijayanthi Sarma and Pushpak Bhattacharyya 2007 Complex Predicates in In-dian Language Wordnets Lexical Resources and Evaluation Journal 40 (3-4)

Debasri Chakrabarti Hemang Mandalia Ritwik Priya Vaijayanthi Sarma and Pushpak Bhattacharyya 2008 Hindi Compound Verbs and their Automatic Extraction Computational Linguistics (COLING08) Manchester UK

45

Manindra K Verma (ed) 1993 Complex Predicates in South Asian Languages Manohar Publishers and Distributors New Delhi

Miriam Butt 1995 The Structure of Complex Predi-cates in Urdu CSLI Publications

Mirium Butt and Gillian Ramchand 2001 Complex Aspectual Structure in HindiUrdu In Maria Lia-kata Britta Jensen and Didier Maillat (Editors) Oxford University Working Papers in Linguistics Philology amp Phonetics Vol 6

Miriam Butt Tracy Holloway King and John T Maxwell III 2003 Complex Predicates via Re-striction Proceedings of the LFG03 Conference

Miriam Butt and Wilhelm Geuder 2001 On the (semi)lexical status of light verbs In Norbert Corv-er and Henk van Riemsdijk (Editors) Semi-lexical Categories On the content of function words and the function of content words Mouton de Gruyter Berlin 323ndash370

Mona Singh 1994 Perfectivity Definiteness and Specificity A Classification of Verbal Predicates Hindi Doctoral dissertation University of Texas Austin

Peter Edwin Hook 1974 The Compound Verb in Hindi Center for South and Southeast Asian Stu-dies The University of Michigan

Tara Mohanan 1994 Argument Structure in Hindi CSLI Publications Stanford California

Venkatapathy Sriram and Aravind K Joshi 2005 Relative compositionality of multi-word expres-sions a study of verb-noun (V-N) collocations In Proceedings of International Joint Conference on Natural Language Processing - 2005 Jeju Island Korea 553-564

46

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 47ndash54Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Improving Statistical Machine Translation UsingDomain Bilingual Multiword Expressions

Zhixiang Ren1 Yajuan L u1 Jie Cao1 Qun Liu 1 Yun Huang2

1Key Lab of Intelligent Info Processing 2Department of Computer ScienceInstitute of Computing Technology School of Computing

Chinese Academy of Sciences National University of SingaporePO Box 2704 Beijing 100190 China Computing 1 Law Link Singapore 117590

renzhixianglvyajuan huangyuncompnusedusgcaojieliuqunictaccn

Abstract

Multiword expressions (MWEs) havebeen proved useful for many natural lan-guage processing tasks However how touse them to improve performance of statis-tical machine translation (SMT) is not wellstudied This paper presents a simple yeteffective strategy to extract domain bilin-gual multiword expressions In additionwe implement three methods to integratebilingual MWEs to Moses the state-of-the-art phrase-based machine translationsystem Experiments show that bilingualMWEs could improve translation perfor-mance significantly

1 Introduction

Phrase-based machine translation model has beenproved a great improvement over the initial word-based approaches (Brown et al 1993) Recentsyntax-based models perform even better thanphrase-based models However when syntax-based models are applied to new domain with fewsyntax-annotated corpus the translation perfor-mance would decrease To utilize the robustnessof phrases and make up the lack of syntax or se-mantic information in phrase-based model for do-main translation we study domain bilingual mul-tiword expressions and integrate them to the exist-ing phrase-based model

A multiword expression (MWE)can be consid-ered as word sequence with relatively fixed struc-ture representing special meanings There is nouniform definition of MWE and many researchersgive different properties of MWE Sag et al (2002)roughly defined MWE as ldquoidiosyncratic interpre-tations that cross word boundaries (or spaces)rdquoCruys and Moiron (2007) focused on the non-compositional property of MWE ie the propertythat whole expression cannot be derived from their

component words Stanford university launcheda MWE project1 in which different qualities ofMWE were presented Forbilingual multiwordexpression (BiMWE) we define a bilingual phraseas a bilingual MWE if (1) the source phrase is aMWE in source language (2) the source phraseand the target phrase must be translated to eachother exactly ie there is no additional (boundary)word in target phrase which cannot find the corre-sponding word in source phrase and vice versaIn recent years many useful methods have beenproposed to extract MWEs or BiMWEs automati-cally (Piao et al 2005 Bannard 2007 Fazly andStevenson 2006) Since MWE usually constrainspossible senses of a polysemous word in contextthey can be used in many NLP applications suchas information retrieval question answering wordsense disambiguation and so on

For machine translation Piao et al (2005) havenoted that the issue of MWE identification andaccurate interpretation from source to target lan-guage remained an unsolved problem for existingMT systems This problem is more severe whenMT systems are used to translate domain-specifictexts since they may include technical terminol-ogy as well as more general fixed expressions andidioms Although some MT systems may employa machine-readable bilingual dictionary of MWEit is time-consuming and inefficient to obtain thisresource manually Therefore some researchershave tried to use automatically extracted bilingualMWEs in SMT Tanaka and Baldwin (2003) de-scribed an approach of noun-noun compound ma-chine translation but no significant comparisonwas presented Lambert and Banchs (2005) pre-sented a method in which bilingual MWEs wereused to modify the word alignment so as to im-prove the SMT quality In their work a bilin-gual MWE in training corpus was grouped as

1httpmwestanfordedu

47

one unique token before training alignment mod-els They reported that both alignment qualityand translation accuracy were improved on a smallcorpus However in their further study they re-ported even lower BLEU scores after groupingMWEs according to part-of-speech on a large cor-pus (Lambert and Banchs 2006) Nonethelesssince MWE represents liguistic knowledge therole and usefulness of MWE in full-scale SMTis intuitively positive The difficulty lies in howto integrate bilingual MWEs into existing SMTsystem to improve SMT performance especiallywhen translating domain texts

In this paper we implement three methods thatintegrate domain bilingual MWEs into a phrase-based SMT system and show that these ap-proaches improve translation quality significantlyThe main difference between our methods andLambert and Banchsrsquo work is that we directly aimat improving the SMT performance rather than im-proving the word alignment quality In detail dif-ferences are listed as follows

bull Instead of using the bilingual n-gram trans-lation model we choose the phrase-basedSMT system Moses2 which achieves sig-nificantly better translation performance thanmany other SMT systems and is a state-of-the-art SMT system

bull Instead of improving translation indirectlyby improving the word alignment qualitywe directly target at the quality of transla-tion Some researchers have argued that largegains of alignment performance under manymetrics only led to small gains in translationperformance (Ayan and Dorr 2006 Fraserand Marcu 2007)

Besides the above differences there are someadvantages of our approaches

bull In our method automatically extractedMWEs are used as additional resources ratherthan as phrase-table filter Since bilingualMWEs are extracted according to noisy au-tomatic word alignment errors in word align-ment would further propagate to the SMT andhurt SMT performance

bull We conduct experiments on domain-specificcorpus For one thing domain-specific

2httpwwwstatmtorgmoses

corpus potentially includes a large numberof technical terminologies as well as moregeneral fixed expressions and idioms iedomain-specific corpus has high MWE cov-erage For another after the investigationcurrent SMT system could not effectivelydeal with these domain-specific MWEs es-pecially for Chinese since these MWEs aremore flexible and concise Take the Chi-nese term ldquo j Ntilde (rdquo for example Themeaning of this term is ldquosoften hard massand dispel pathogenic accumulationrdquo Ev-ery word of this term represents a specialmeaning and cannot be understood literallyor without this context These terms are dif-ficult to be translated even for humans letalone machine translation So treating theseterms as MWEs and applying them in SMTsystem have practical significance

bull In our approach no additional corpus is intro-duced We attempt to extract useful MWEsfrom the training corpus and adopt suitablemethods to apply them Thus it benefitsfor the full exploitation of available resourceswithout increasing great time and space com-plexities of SMT system

The remainder of the paper is organized as fol-lows Section 2 describes the bilingual MWE ex-traction technique Section 3 proposes three meth-ods to apply bilingual MWEs in SMT systemSection 4 presents the experimental results Sec-tion 5 draws conclusions and describes the futurework Since this paper mainly focuses on the ap-plication of BiMWE in SMT we only give a briefintroduction on monolingual and bilingual MWEextraction

2 Bilingual Multiword ExpressionExtraction

In this section we describe our approach of bilin-gual MWE extraction In the first step we obtainmonolingual MWEs from the Chinese part of par-allel corpus After that we look for the translationof the extracted MWEs from parallel corpus

21 Automatic Extraction of MWEs

In the past two decades many different ap-proaches on automatic MWE identification werereported In general those approaches can beclassified into three main trends (1) statisti-cal approaches (Pantel and Lin 2001 Piao et

48

al 2005) (2) syntactic approaches (Fazly andStevenson 2006 Bannard 2007) and (3) seman-tic approaches (Baldwin et al 2003 Cruys andMoiron 2007) Syntax-based and semantic-basedmethods achieve high precision but syntax or se-mantic analysis has to be introduced as preparingstep so it is difficult to apply them to domains withfew syntactical or semantic annotation Statisticalapproaches only consider frequency informationso they can be used to obtain MWEs from bilin-gual corpora without deeper syntactic or semanticanalysis Most statistical measures only take twowords into account so it not easy to extract MWEscontaining three or more than three words

Log Likelihood Ratio (LLR)has been proved agood statistical measurement of the association oftwo random variables (Chang et al 2002) Weadopt the idea of statistical approaches and pro-pose a new algorithm named LLR-based Hierar-chical Reducing Algorithm (HRA for short) to ex-tract MWEs with arbitrary lengths To illustrateour algorithm firstly we define some useful itemsIn the following definitions we assume the givensentence is ldquoA B C D Erdquo

Definition 1 Unit A unit is any sub-string of thegiven sentence For example ldquoA Brdquo ldquo Crdquo ldquo C D Erdquoare all units but ldquoA B Drdquo is not a unit

Definition 2 List A list is an ordered sequence ofunits which exactly cover the given sentence ForexampleldquoArdquoldquo B C Drdquoldquo Erdquo forms a list

Definition 3 Score The score function only de-fines on two adjacent units and return the LLRbetween the last word of first unit and the firstword of the second unit3 For example the scoreof adjacent unit ldquoB Crdquo and ldquoD Erdquo is defined asLLR(ldquo Crdquoldquo Drdquo)

Definition 4 Select The selecting operator is tofind the two adjacent units with maximum scorein a list

Definition 5 ReduceThe reducing operator is toremove two specific adjacent units concatenatethem and put back the result unit to the removedposition For example if we want to reduce unitldquoB Crdquo and unit ldquoDrdquo in list ldquoArdquoldquo B Crdquoldquo Drdquoldquo Erdquowe will get the listldquoArdquoldquo B C Drdquoldquo Erdquo

Initially every word in the sentence is consid-ered as one unit and all these units form a initiallist L If the sentence is of lengthN then the

3we use a stoplist to eliminate the units containing func-tion words by setting their score to 0

list containsN units of course The final set ofMWEsS is initialized to empty set After initial-ization the algorithm will enter an iterating loopwith two steps (1) select the two adjacent unitswith maximum score inL namingU1 andU2 and(2) reduceU1 andU2 in L and insert the reducingresult into the final setS Our algorithm termi-nates on two conditions (1) if the maximum scoreafter selection is less than a given threshold or (2)if L contains only one unit

c1(pound p Eacute w egrave )

c1(pound p Eacute w )

c1(pound p Eacute w)

c1(pound)

c2(p Eacute w)

c2(p Eacute)

c2(p) c3(Eacute) c4(w) c5()

c6(egrave )

c6(egrave) c7()

pound p Eacute w egrave 1471 67552 10596 0 0 8096

Figure 1 Example of Hierarchical Reducing Al-gorithm

Let us make the algorithm clearer with an ex-ample Assume the threshold of score is 20 thegiven sentence is ldquopound p Eacute w egrave rdquo4Figure 1 shows the hierarchical structure of givensentence (based on LLR of adjacent words) Inthis example four MWEs (ldquop Eacuterdquo ldquo p Eacute

wrdquo ldquo egrave rdquo ldquo pound p Eacute wrdquo) are extractedin the order and sub-strings over dotted line in fig-ure 1 are not extracted

From the above example we can see that theextracted MWEs correspond to human intuitionIn general the basic idea of HRA is to reflectthe hierarchical structure pattern of natural lan-guage Furthermore in the HRA MWEs are mea-sured with the minimum LLR of adjacent wordsin them which gives lexical confidence of ex-tracted MWEs Finally suppose given sentencehas lengthN HRA would definitely terminatewithin N minus 1 iterations which is very efficient

However HRA has a problem that it would ex-tract substrings before extracting the whole stringeven if the substrings only appear in the particu-lar whole string which we consider useless Tosolve this problem we use contextual features

4The whole sentence means ldquohealthy tea for preventinghyperlipidemiardquo and we give the meaning for each Chi-nese wordpound(preventing)p(hyper-)Eacute(-lipid-) w(-emia)(for) egrave(healthy)(tea)

49

contextual entropy (Luo and Sun 2003) and C-value (Frantzi and Ananiadou 1996) to filter outthose substrings which exist only in few MWEs

22 Automatic Extraction of MWErsquosTranslation

In subsection 21 we described the algorithm toobtain MWEs and we would like to introduce theprocedure to find their translations from parallelcorpus in this subsection

For mining the English translations of ChineseMWEs we first obtain the candidate translationsof a given MWE from the parallel corpus Stepsare listed as follows

1 Run GIZA++5 to align words in the trainingparallel corpus

2 For a given MWE find the bilingual sentencepairs where the source language sentences in-clude the MWE

3 Extract the candidate translations of theMWE from the above sentence pairs accord-ing to the algorithm described by Och (2002)

After the above procedure we have alreadyextracted all possible candidate translations of agiven MWE The next step is to distinguish rightcandidates from wrong candidates We constructperceptron-based classification model (Collins2002) to solve the problem We design twogroups of features translation features whichdescribe the mutual translating chance betweensource phrase and target phrase and the languagefeatures which refer to how well a candidateis a reasonable translation The translation fea-tures include (1) the logarithm of source-targettranslation probability (2) the logarithm of target-source translation probability (3) the logarithmof source-target lexical weighting (4) the loga-rithm of target-source lexical weighting and (5)the logarithm of the phrase pairrsquos LLR (Dunning1993) The first four features are exactly the sameas the four translation probabilities used in tradi-tional phrase-based system (Koehn et al 2003)The language features include (1) the left entropyof the target phrase (Luo and Sun 2003) (2) theright entropy of the target phrase (3) the first wordof the target phrase (4) the last word of the targetphrase and (5) all words in the target phrase

5httpwwwfjochcomGIZA++html

We select and annotate 33000 phrase pairs ran-domly of which 30000 pairs are used as trainingset and 3000 pairs are used as test set We use theperceptron training algorithm to train the modelAs the experiments reveal the classification preci-sion of this model is 9167

3 Application of Bilingual MWEs

Intuitively bilingual MWE is useful to improvethe performance of SMT However as we de-scribed in section 1 it still needs further researchon how to integrate bilingual MWEs into SMTsystem In this section we propose three methodsto utilize bilingual MWEs and we will comparetheir performance in section 4

31 Model Retraining with Bilingual MWEs

Bilingual phrase table is very important forphrase-based MT system However due to the er-rors in automatic word alignment and unalignedword extension in phrase extraction (Och 2002)many meaningless phrases would be extractedwhich results in inaccuracy of phrase probabilityestimation To alleviate this problem we take theautomatically extracted bilingual MWEs as paral-lel sentence pairs add them into the training cor-pus and retrain the model using GIZA++ Byincreasing the occurrences of bilingual MWEswhich are good phrases we expect that the align-ment would be modified and the probability es-timation would be more reasonable Wu et al(2008) also used this method to perform domainadaption for SMT Different from their approachin which bilingual MWEs are extracted from ad-ditional corpus we extract bilingual MWEs fromthe original training set The fact that additionalresources can improve the domain-specific SMTperformance was proved by many researchers(Wu et al 2008 Eck et al 2004) Howeverour method shows that making better use of theresources in hand could also enhance the qualityof SMT system We use ldquoBaseline+BiMWErdquo torepresent this method

32 New Feature for Bilingual MWEs

Lopez and Resnik (2006) once pointed out thatbetter feature mining can lead to substantial gainin translation quality Inspired by this idea weappend one feature into bilingual phrase table toindicate that whether a bilingual phrase containsbilingual MWEs In other words if the source lan-guage phrase contains a MWE (as substring) and

50

the target language phrase contains the translationof the MWE (as substring) the feature value is 1otherwise the feature value is set to 0 Due to thehigh reliability of bilingual MWEs we expect thatthis feature could help SMT system to select bet-ter and reasonable phrase pairs during translationWe use ldquoBaseline+Featrdquo to represent this method

33 Additional Phrase Table of bilingualMWEs

Wu et al (2008) proposed a method to construct aphrase table by a manually-made translation dic-tionary Instead of manually constructing transla-tion dictionary we construct an additional phrasetable containing automatically extracted bilingualMWEs As to probability assignment we just as-sign 1 to the four translation probabilities for sim-plicity Since Moses supports multiple bilingualphrase tables we combine the original phrase ta-ble and new constructed bilingual MWE table Foreach phrase in input sentence during translationthe decoder would search all candidate transla-tion phrases in both phrase tables We use ldquoBase-line+NewBPrdquo to represent this method

4 Experiments

41 Data

We run experiments on two domain-specific patentcorpora one is for traditional medicine domainand the other is for chemical industry domain Ourtranslation tasks are Chinese-to-English

In the traditional medicine domain table 1shows the data statistics For language model weuse SRI Language Modeling Toolkit6 to train a tri-gram model with modified Kneser-Ney smoothing(Chen and Goodman 1998) on the target side oftraining corpus Using our bilingual MWE ex-tracting algorithm 80287 bilingual MWEs are ex-tracted from the training set

Chinese EnglishTraining Sentences 120355

Words 4688873 4737843Dev Sentences 1000

Words 31722 32390Test Sentences 1000

Words 41643 40551

Table 1 Traditional medicine corpus

6httpwwwspeechsricomprojectssrilm

In the chemical industry domain table 2 givesthe detail information of the data In this experi-ment 59466 bilingual MWEs are extracted

Chinese EnglishTraining Sentences 120856

Words 4532503 4311682Dev Sentences 1099

Words 42122 40521Test Sentences 1099

Words 41069 39210

Table 2 Chemical industry corpus

We test translation quality on test set and use theopen source tool mteval-vllbpl7 to calculate case-sensitive BLEU 4 score (Papineni et al 2002) asour evaluation criteria For this evaluation there isonly one reference per test sentence We also per-form statistical significant test between two trans-lation results (Collins et al 2005) The mean ofall scores and relative standard deviation are calcu-lated with a 99 confidence interval of the mean

42 MT Systems

We use the state-of-the-art phrase-based SMT sys-tem Moses as our baseline system The featuresused in baseline system include (1) four transla-tion probability features (2) one language modelfeature (3) distance-based and lexicalized distor-tion model feature (4) word penalty (5) phrasepenalty For ldquoBaseline+BiMWErdquo method bilin-gual MWEs are added into training corpus as aresult new alignment and new phrase table areobtained For ldquoBaseline+Featrdquo method one ad-ditional 01 feature are introduced to each entry inphrase table For ldquoBaseline+NewBPrdquo additionalphrase table constructed by bilingual MWEs isused

Features are combined in the log-linear modelTo obtain the best translatione of the source sen-tencef log-linear model uses following equation

e = arg maxe

p(e|f)

= arg maxe

Msum

m=1

λmhm(e f) (1)

in which hm andλm denote themth feature andweight The weights are automatically turned byminimum error rate training (Och 2002) on devel-opment set

7httpwwwnistgovspeechtestsmtresourcesscoringhtm

51

43 Results

Methods BLEUBaseline 02658Baseline+BiMWE 02661Baseline+Feat 02675Baseline+NewBP 02719

Table 3 Translation results of using bilingualMWEs in traditional medicine domain

Table 3 gives our experiment results Fromthis table we can see that bilingual MWEsimprove translation quality in all cases TheBaseline+NewBP method achieves the most im-provement of 061 BLEU score comparedwith the baseline system The Baseline+Featmethod comes next with 017 BLEU score im-provement And the Baseline+BiMWE achievesslightly higher translation quality than the baselinesystem

To our disappointment however none of theseimprovements are statistical significant Wemanually examine the extracted bilingual MWEswhich are labeled positive by perceptron algorithmand find that although the classification precisionis high (9167) the proportion of positive exam-ple is relatively lower (7669) The low positiveproportion means that many negative instanceshave been wrongly classified to positive which in-troduce noises To remove noisy bilingual MWEswe use the length ratiox of the source phrase overthe target phrase to rank the bilingual MWEs la-beled positive Assumex follows Gaussian distri-butions then the ranking score of phrase pair(s t)is defined as the following formula

Score(s t) = log(LLR(s t))times 1radic2πσ

timeseminus

(xminusmicro)2

2σ2

(2)Here the meanmicro and varianceσ2 are estimatedfrom the training set After ranking by score weselect the top 50000 60000 and 70000 bilingualMWEs to perform the three methods mentioned insection 3 The results are showed in table 4

From this table we can conclude that (1) Allthe three methods on all settings improve BLEUscore (2) Except the Baseline+BiMWE methodthe other two methods obtain significant improve-ment of BLEU score (02728 02734 02724)over baseline system (02658) (3) When the scaleof bilingual MWEs is relatively small (5000060000) the Baseline+Feat method performs better

Methods 50000 60000 70000Baseline 02658Baseline+BiMWE 02671 02686 02715Baseline+Feat 02728 02734 02712Baseline+NewBP 02662 02706 02724

Table 4 Translation results of using bilingualMWEs in traditional medicine domain

than others (4) As the number of bilingual MWEsincreasing the Baseline+NewBP method outper-forms the Baseline+Feat method (5) Comparingtable 4 and 3 we can see it is not true that themore bilingual MWEs the better performance ofphrase-based SMT This conclusion is the same as(Lambert and Banchs 2005)

To verify the assumption that bilingual MWEsdo indeed improve the SMT performance not onlyon particular domain we also perform some ex-periments on chemical industry domain Table 5shows the results From this table we can see thatthese three methods can improve the translationperformance on chemical industry domain as wellas on the traditional medicine domain

Methods BLEUBaseline 01882Baseline+BiMWE 01928Baseline+Feat 01917Baseline+Newbp 01914

Table 5 Translation results of using bilingualMWEs in chemical industry domain

44 Discussion

In order to know in what respects our methods im-prove performance of translation we manually an-alyze some test sentences and gives some exam-ples in this subsection

(1) For the first example in table 6 ldquoIuml oacuterdquois aligned to other words and not correctly trans-lated in baseline system while it is aligned to cor-rect target phrase ldquodredging meridiansrdquo in Base-line+BiMWE since the bilingual MWE (ldquoIuml oacuterdquoldquodredging meridiansrdquo) has been added into train-ing corpus and then aligned by GIZA++

(2) For the second example in table 6 ldquo

rdquo has two candidate translation in phrase tableldquo teardquo and ldquomedicated teardquo The baseline systemchooses the ldquoteardquo as the translation of ldquo rdquowhile the Baseline+Feat system chooses the ldquomed-

52

Src T not auml k Ouml Eacute Aring raquo Iuml oacute ) 9 | Y S _ Ouml otilde

egraveE 8

Ref the obtained product is effective in tonifying blood expelling cold dredging meridians promoting production of body fluid promoting urination and tranquilizing mind and can be used for supplementing nutrition and protecting health

Baseline the food has effects in tonifying blood dispelling cold promoting salivation and water and tranquilizing and tonic effects and making nutritious health

+Bimwe the food has effects in tonifying blood dispelling cold dredging meridians promotingsalivation promoting urination and tranquilizing tonic nutritious pulverizing

Src curren iexclJ J NtildeJ 5J

Ref the product can also be made into tablet pill powder medicated tea or injection Baseline may also be made into tablet pill powder tea or injection +Feat may also be made into tablet pill powder medicated tea or injection

Table 6 Translation example

icated teardquo because the additional feature giveshigh probability of the correct translation ldquomedi-cated teardquo

5 Conclusion and Future Works

This paper presents the LLR-based hierarchicalreducing algorithm to automatically extract bilin-gual MWEs and investigates the performance ofthree different application strategies in applyingbilingual MWEs for SMT system The translationresults show that using an additional feature to rep-resent whether a bilingual phrase contains bilin-gual MWEs performs the best in most cases Theother two strategies can also improve the qualityof SMT system although not as much as the firstone These results are encouraging and motivatedto do further research in this area

The strategies of bilingual MWE application isroughly simply and coarse in this paper Com-plicated approaches should be taken into accountduring applying bilingual MWEs For examplewe may consider other features of the bilingualMWEs and examine their effect on the SMT per-formance Besides application in phrase-basedSMT system bilingual MWEs may also be inte-grated into other MT models such as hierarchicalphrase-based models or syntax-based translationmodels We will do further studies on improvingstatistical machine translation using domain bilin-gual MWEs

Acknowledgments

This work is supported by National Natural Sci-ence Foundation of China Contracts 60603095and 60873167 We would like to thank the anony-

mous reviewers for their insightful comments onan earlier draft of this paper

References

Necip Fazil Ayan and Bonnie J Dorr 2006 Goingbeyond aer an extensive analysis of word align-ments and their impact on mt InProceedings of the44th Annual Meeting of the Association for Compu-tational Linguistics pages 9ndash16

Timothy Baldwin Colin Bannard Takaaki Tanaka andDominic Widdows 2003 An empirical modelof multiword expression decomposability InPro-ceedings of the ACL-2003 Workshop on MultiwordExpressions Analysis Acquisiton and Treatmentpages 89ndash96

Colin Bannard 2007 A measure of syntactic flex-ibility for automatically identifying multiword ex-pressions in corpora InProceedings of the ACLWorkshop on A Broader Perspective on MultiwordExpressions pages 1ndash8

Peter F Brown Stephen Della Pietra Vincent J DellaPietra and Robert L Mercer 1993 The mathe-matics of statistical machine translation Parameterestimation Computational Linguistics 19(2)263ndash311

Baobao Chang Pernilla Danielsson and WolfgangTeubert 2002 Extraction of translation unit fromchinese-english parallel corpora InProceedings ofthe first SIGHAN workshop on Chinese languageprocessing pages 1ndash5

Stanley F Chen and Joshua Goodman 1998 Am em-pirical study of smoothing techniques for languagemodeling Technical report

Michael Collins Philipp Koehn and Ivona Kucerova2005 Clause restructuring for statistical machinetranslation InProceedings of the 43rd Annual

53

Meeting on Association for Computational Linguis-tics pages 531ndash540

Michael Collins 2002 Discriminative training meth-ods for hidden markov models Theory and exper-iments with perceptron algorithms InProceedingsof the Empirical Methods in Natural Language Pro-cessing Conference pages 1ndash8

Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32

Ted Dunning 1993 Accurate methods for the statis-tics of surprise and coincidenceComputationalLinguistics 19(1)61ndash74

Matthias Eck Stephan Vogel and Alex Waibel 2004Improving statistical machine translation in the med-ical domain using the unified medical language sys-tem InProceedings of the 20th international con-ference on Computational Linguistics table of con-tents pages 792ndash798

Afsaneh Fazly and Suzanne Stevenson 2006 Auto-matically constructing a lexicon of verb phrase id-iomatic combinations InProceedings of the EACLpages 337ndash344

Katerina T Frantzi and Sophia Ananiadou 1996 Ex-tracting nested collocations InProceedings of the16th conference on Computational linguistics pages41ndash46

Alexander Fraser and Daniel Marcu 2007 Measuringword alignment quality for statistical machine trans-lation Computational Linguistics 33(3)293ndash303

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InPro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology pages48ndash54

Patrik Lambert and Rafael Banchs 2005 Data in-ferred multi-word expressions for statistical machinetranslation InProceedings of Machine TranslationSummit X pages 396ndash403

Patrik Lambert and Rafael Banchs 2006 Groupingmulti-word expressions according to part-of-speechin statistical machine translation InProceedings ofthe Workshop on Multi-word-expressions in a multi-lingual context pages 9ndash16

Adam Lopez and Philip Resnik 2006 Word-basedalignment phrase-based translation Whatrsquos thelink In proceedings of the 7th conference of the as-sociation for machine translation in the Americasvisions for the future of machine translation pages90ndash99

Shengfen Luo and Maosong Sun 2003 Two-characterchinese word extraction based on hybrid of internaland contextual measures InProceedings of the sec-ond SIGHAN workshop on Chinese language pro-cessing pages 24ndash30

Franz Josef Och 2002Statistical Machine Transla-tion From Single-Word Models to Alignment Tem-plates Phd thesis Computer Science DepartmentRWTH Aachen Germany

Patrick Pantel and Dekang Lin 2001 A statistical cor-pus based term extractor InAI rsquo01 Proceedings ofthe 14th Biennial Conference of the Canadian Soci-ety on Computational Studies of Intelligence pages36ndash46

Kishore Papineni Salim Roukos Todd Ward and Wei-jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation InProceedings of the40th Annual Conference of the Association for Com-putational Linguistics pages 311ndash318

Scott Songlin Piao Paul Rayson Dawn Archer andTony McEnery 2005 Comparing and combining asemantic tagger and a statistical tool for mwe extrac-tion Computer Speech and Language 19(4)378ndash397

Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 Multi-word expressions A pain in the neck for nlp InProceedings of the 3th International Conferenceon Intelligent Text Processing and ComputationalLinguistics(CICLing-2002) pages 1ndash15

Takaaki Tanaka and Timothy Baldwin 2003 Noun-noun compound machine translation A feasibilitystudy on shallow processing InProceedings ofthe ACL-2003 Workshop on Multiword ExpressionsAnalysis Acquisition and Treatment pages 17ndash24

Hua Wu Haifeng Wang and Chengqing Zong 2008Domain adaptation for statistical machine transla-tion with domain dictionary and monolingual cor-pora InProceedings of Conference on Computa-tional Linguistics (COLING) pages 993ndash1000

54

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 55ndash62Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Bottom-up Named Entity Recognitionusing a Two-stage Machine Learning Method

Hirotaka Funayama Tomohide Shibata Sadao KurohashiKyoto University Yoshida-honmachi

Sakyo-ku Kyoto 606-8501 Japanfunayama shibata kuronlpkueekyoto-uacjp

Abstract

This paper proposes Japanese bottom-upnamed entity recognition using a two-stage machine learning method Mostwork has formalized Named Entity Recog-nition as a sequential labeling problem inwhich only local information is utilizedfor the label estimation and thus a longnamed entity consisting of several mor-phemes tends to be wrongly recognizedOur proposed method regards a compoundnoun (chunk) as a labeling unit and firstestimates the labels of all the chunks ina phrasal unit (bunsetsu) using a machinelearning method Then the best label as-signment in the bunsetsu is determinedfrom bottom up as the CKY parsing al-gorithm using a machine learning methodWe conducted an experimental on CRLNE data and achieved an F measure of8979 which is higher than previous work

1 Introduction

Named Entity Recognition (NER) is a task of rec-ognizing named entities such as person namesorganization names and location It is used forseveral NLP applications such as Information Ex-traction (IE) and Question Answering (QA) Mostwork uses machine learning methods such as Sup-port Vector Machines (SVMs) (Vapnik 1995) andConditional Random Field (CRF) (Lafferty et al2001) using a hand-annotated corpus (Krishnanand DManning 2006 Kazama and Torisawa2008 Sasano and Kurohashi 2008 Fukushima etal 2008 Nakano and Hirai 2004 Masayuki andMatsumoto 2003)

In general NER is formalized as a sequentiallabeling problem For example regarding a mor-pheme as a basic unit it is first labeled as S-PERSON B-PERSON I-PERSON E-PERSON

S-ORGANIZATION etc Then considering thelabeling results of morphemes the best NE labelsequence is recognized

When the label of each morpheme is estimatedonly local information around the morpheme (egthe morpheme the two preceding morphemes andthe two following morphemes) is utilized There-fore a long named entity consisting of severalmorphemes tends to be wrongly recognized Letus consider the example sentences shown in Fig-ure 1

In sentence (1) the label of ldquoKazamardquo can berecognized to be S-PERSON (PERSON consist-ing of one morpheme) by utilizing the surroundinginformation such as the suffix ldquosanrdquo (Mr) and theverb ldquokikoku shitardquo (return home)

On the other hand in sentence (2) when thelabel of ldquoshinyourdquo (credit) is recognized to beB-ORGANIZATION (the beginning of ORGA-NIZATION) only information from ldquohatsudourdquo(invoke) to ldquokyusairdquo (relief) can be utilized andthus the information of the morpheme ldquoginkourdquo(bank) that is apart from ldquoshinyourdquo by three mor-phemes cannot be utilized To cope with this prob-lem Nakano et al (Nakano and Hirai 2004) andSasano et al (Sasano and Kurohashi 2008) uti-lized information of the head of bunsetsu1 In theirmethods when the label of ldquoshinyourdquo is recog-nized the information of the morpheme ldquoginkourdquocan be utilized

However these methods do not work when themorpheme that we want to refer to is not a headof bunsetsu as in sentence (3) In this examplewhen ldquogaikokurdquo (foreign) is recognized to be B-ARTIFACT (the beginning of ARTIFACT) wewant to refer to ldquohourdquo (law) not ldquoihanrdquo (viola-tion) which is the head of the bunsetsu

This paper proposes Japanese bottom-up named1Bunsetsu is the smallest coherent phrasal unit in

Japanese It consists of one or more content words followedby zero or more function words

55

(1) kikoku-shitareturn home

Kazama-san-waMrKazama TOP

rsquoMr Kazama who returned homersquo

(2) hatsudou-shitainvoke

shinyou-kumiai-kyusai-ginkou-nocredit union relief bank GEN

setsuritsu-mo establishment

rsquothe establishment of the invoking credit union relief bankrsquo

(3) shibunsyo-gizou-toprivate document falsification and

gaikoku-jin-touroku-hou-ihan-noforeigner registration law violation GEN

utagai-desuspicion INS

rsquoon suspicion of the private document falsification and the violation of the foreigner registra-tion lawrsquo

Figure 1 Example sentences

entity recognition using a two-stage machinelearning method Different from previous workthis method regards a compound noun as a la-beling unit (we call it chunk hereafter) and es-timates the labels of all the chunks in the bun-setsu using a machine learning method In sen-tence (3) all the chunks in the second bunsetsu(ie ldquogaikokurdquo ldquogaikoku-jinrdquo middot middot middot ldquogaikoku-jin-touroku-hou-ihan rdquo middot middot middot ldquoihanrdquo) are labeled andin the case that the chunk ldquogaikoku-jin-touroku-hourdquo is labeled the information about ldquohourdquo (law)is utilized in a natural manner Then in the bun-setsu the best label assignment is determined Forexample among the combination of ldquogaikoku-jin-touroku-hourdquo (ARTIFACT) and ldquoihanrdquo (OTHER)the combination of ldquogaikoku-jinrdquo (PERSON) andldquotouroku-hou-ihanrdquo (OTHER) etc the best la-bel assignment ldquogaikoku-jin-touroku-hourdquo (AR-TIFACT) and ldquoihanrdquo (OTHER) is chosen basedon a machine learning method In this determi-nation of the best label assignment as the CKYparsing algorithm the label assignment is deter-mined by bottom-up dynamic programming Weconducted an experimental on CRL NE data andachieved an F measure of 8979 which is higherthan previous work

This paper is organized as follows Section 2 re-views related work of NER especially focusing onsequential labeling based method Section 3 de-scribes an overview of our proposed method Sec-tion 4 presents two machine learning models andSection 5 describes an analysis algorithm Section6 gives an experimental result

2 Related Work

In Japanese Named Entity Recognition the defi-nition of Named Entity in IREX Workshop (IREX

class examplePERSON Kimura SyonosukeLOCATION Taiheiyou (Pacific Ocean)ORGANIZATION Jimin-tou (Liberal Democratic

Party)ARTIFACT PL-houan (PL bill)DATE 21-seiki (21 century)TIME gozen-7-ji (7 am)MONEY 500-oku-en (50 billions yen)PERCENT 20 percent

Table 1 NE classes and their examples

Committee 1999) is usually used In this def-inition NEs are classified into eight classesPERSON LOCATION ORGANIZATION AR-TIFACT DATE TIME MONEY and PERCENTTable 1 shows example instances of each class

NER methods are divided into two approachesrule-based approach and machine learning ap-proach According to previous work machinelearning approach achieved better performancethan rule-based approach

In general a machine learning method is for-malized as a sequential labeling problem Thisproblem is first assigning each token (character ormorpheme) to several labels In an SE-algorithm(Sekine et al 1998) S is assigned to NE com-posed of one morpheme B I E is assigned to thebeginning middle end of NE respectively and Ois assigned to the morpheme that is not an NE2The labels S B I and E are prepared for each NEclasses and thus the total number of labels is 33(= 8 4 + 1)

The model for the label estimation is learnedbased on machine learning The following fea-tures are generally utilized characters type of

2Besides there are IOB1 IOB2 algorithm using onlyIOB and IOE1 IOE2 algorithm using only IOE (Kim andVeenstra 1999)

56

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu-MeijinMONEY0075

OTHERe0092

MeijinOTHERe0245

(a)initial state

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu + MeijinPSN+OTHERe0438+0245

Yoshiharu Yoshiharu + Meijin

final outputanalysis direction

MONEY0075

MNY+OTHERe0075+0245

MeijinOTHERe0245

(b)final output

Figure 2 An overview of our proposed method (the bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo)

character POS etc about the morpheme and thesurrounding two morphemes The methods utiliz-ing SVM or CRF are proposed

Most of NER methods based on sequential la-beling use only local information Thereforemethods utilizing global information are pro-posed Nakano et al utilized as a feature the wordsub class of NE on the analyzing direction in thebunsetsu the noun in the end of the bunsetsu ad-jacent to the analyzing direction and the head ofeach bunsetsu (Nakano and Hirai 2004) Sasanoet al utilized cache feature coreference resultsyntactic feature and caseframe feature as struc-tural features (Sasano and Kurohashi 2008)

Some work acquired knowledge from unan-notated large corpus and applied it to NERKazama et al utilized a Named Entity dic-tionary constructed from Wikipedia and a nounclustering result obtained using huge amount ofpairs of dependency relations (Kazama and Tori-sawa 2008) Fukushima et al acquired hugeamount of category-instance pairs (eg ldquopo-litical party - New party DAICHIrdquoldquocompany-TOYOTArdquo) by some patterns from a large Webcorpus (Fukushima et al 2008)

In Japanese NER researches CRL NE data areusually utilized for the evaluation This data in-cludes approximately 10 thousands sentences innews paper articles in which approximately 20thousands NEs are annotated Previous workachieved an F measure of about 089 using thisdata

3 Overview of Proposed Method

Our proposed method first estimates the label ofall the compound nouns (chunk) in a bunsetsu

Then the best label assignment is determinedby bottom-up dynamic programming as the CKYparsing algorithm Figure 2 illustrates an overviewof our proposed method In this example thebunsetsu ldquoHabu-Yoshiharu-Meijinrdquo (Grand Mas-ter Yoshiharu Habu) is analyzed First the labelsof all the chunks (ldquoHaburdquo ldquoHabu-YoshiharurdquoldquoHabu-Yoshiharu-Meijinrdquo middot middot middot ldquoMeijinrdquo etc) inthe bunsetsu are analyzed using a machine learn-ing method as shown in Figure 2 (a)

We call the state in Figure 2 (a) initial statewhere the labels of all the chunks have been es-timated From this state the best label assign-ment in the bunsetsu is determined This pro-cedure is performed from the lower left (corre-sponds to each morpheme) to the upper right likethe CKY parsing algorithm as shown in Figure 2(b) For example when the label assignment forldquoHabu-Yoshiharurdquo is determined the label assign-ment ldquoHabu-Yoshiharurdquo (PERSON) and the labelassignment ldquoHaburdquo (PERSON) and ldquoYoshiharurdquo(OTHER) are compared and the better one is cho-sen While grammatical rules are utilized in ageneral CKY algorithm this method chooses bet-ter label assignment for each cell using a machinelearning method

The learned models are the followings

bull the model that estimates the label of a chunk(label estimation model)

bull the model that compares two label assign-ments (label comparison model)

The two models are described in detail in thenext section

57

Habu Yoshiharu Meijin gaPERSON OTHERe

invalid invalid

invalid

invalidinvalid

Figure 3 Label assignment for all the chunks inthe bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo

4 Model Learning

41 Label Estimation ModelThis model estimates the label for each chunk Ananalysis unit is basically bunsetsu This is because935 of named entities is located in a bunsetsuin CRL NE data Exceptionally the following ex-pressions located in multiple bunsetsus tend to bean NE

bull expressions enclosed in parentheses (eg ldquolsquoHimeyuri-no toursquo rdquo (The tower of Himeyuri)(ARTIFACT))

bull expressions that have an entry in Wikipedia(eg ldquoNihon-yatyou-no kairdquo (Wild Bird So-ciety of Japan) (ORGANIZATION))

Hereafter bunsetsu is expanded when one of theabove conditions meet By this expansion 986of named entities is located in a bunsetsu3

For each bunsetsu the head or tail functionwords are deleted For example in the bun-setsu ldquoHabu-Yoshiharu-Meijin-wardquo the tail func-tion word ldquowardquo (TOP) is deleted In the bunsetsuldquoyaku-san-bairdquo (about three times) the head func-tion word ldquoyakurdquo (about) is deleted

Next for learning the label estimation modelall the chunks in a bunsetsu are attached to the cor-rect label from a hand-annotated corpus The la-bel set is 13 classes which includes eight NE class(as shown in Table 1) and five classes OTHERsOTHERb OTHERi OTHERe and invalid

The chunk that corresponds to a whole bun-setsu and does not contain any NEs is labeledas OTHERs and the head middle tail chunkthat does not correspond to an NE is labeled asOTHERb OTHERi OTHERe respectively4

3As an example in which an NE is not included by anexpanded bunsetsu there are ldquoToru-no Kimirdquo (PERSON)and ldquoOsaka-fu midori-no kankyo-seibi-shitsurdquo (ORGANI-ZATION)

4Each OTHER is assigned to the longest chunk that satis-fies its condition in a chunk

1 of morphemes in the chunk

2 the position of the chunk in its bunsetsu

3 character type5

4 the combination of the character type of adjoiningmorphemes

- For the chunk ldquoRussian Armyrdquo this feature isldquoKatakanaKanjirdquo

5 word class word sub class and several features pro-vided by a morphological analyzer JUMAN

6 several features6 provided by a parser KNP

7 string of the morpheme in the chunk

8 IPADIC7 feature- If the string of the chunk are registered in the fol-

lowing categories of IPADIC ldquopersonrdquo ldquolo-cationrdquo ldquoorganizationrdquo and ldquogeneralrdquo thisfeature fires

9 Wikipedia feature

- If the string of the chunk has an entry inWikipedia this feature fires

- the hypernym extracted from its definition sen-tence using some patterns (eg The hyper-nym of ldquothe Liberal Democratic Partyrdquo is apolitical party)

10 cache feature- When the same string of the chunk appears in the

preceding context the label of the precedingchunk is used for the feature

11 particles that the bunsetsu includes

12 the morphemes particles and head morpheme in theparent bunsetsu

13 the NEcategory ratio in a case slot of predicatenouncase frame(Sasano and Kurohashi 2008)

- For example in the case ga (NOM) of the pred-icate case frame ldquokaikenrdquo (interview) the NEratio ldquoPERSON0245rdquo is assigned to the caseslot Hence in the sentence ldquoHabu-ga kaiken-shitardquo (Mr Habu interviewed) the featureldquoPERSON0245rdquo is utilized for the chunkldquoHaburdquo

14 parenthesis feature

- When the chunk in a parenthesis this featurefires

Table 2 Features for the label estimation model

The chunk that is neither any eight NE class northe above four OTHER is labeled as invalid

In an example as shown in Figure 3 ldquoHabu-Yoshiharurdquo is labeled as PERSON ldquoMeijinrdquo is la-beled as OTHERe and the other chunks are la-beled as invalid

Next the label estimation model is learned fromthe data in which the above label set is assigned

5The following five character types are considered KanjiHiragana Katakana Number and Alphabet

6When a morpheme has an ambiguity all the correspond-ing features fire

7httpchasenaist-naraacjpchasendistributionhtmlja

58

to all the chunks The features for the label esti-mation model are shown in Table 2 Among thefeatures as for feature (3) (5)minus(8) three cate-gories according to the position of a morphemein the chunk are prepared ldquoheadrdquo ldquotailrdquo andldquoanywhererdquo For example in the chunk ldquoHabu-Yoshiharu-Meijinrdquo as for the morpheme ldquoHaburdquofeature (7) is set to be ldquoHaburdquo in ldquoheadrdquo and as forthe morpheme ldquoYoshiharurdquo feature (7) is set to beldquoYoshiharurdquo in ldquoanywhererdquo

The label estimation model is learned from pairsof label and feature in each chunk To classify themulti classes the one-vs-rest method is adopted(consequently 13 models are learned) The SVMoutput is transformed by using the sigmoid func-tion 1

1+exp(minusβx) and the transformed value is nor-malized so that the sum of the value of 13 labelsin a chunk is one

The purpose for setting up the label ldquoinvalidrdquo isas follows In the chunk ldquoHaburdquo and ldquoYoshiharurdquoin Figure 3 since the label ldquoinvalidrdquo has a rela-tively higher score the score of the label PERSONis relatively low Therefore when the label com-parison described in Section 42 is performed thelabel assignment ldquoHabu-Yoshiharurdquo (PERSON) islikely to be chosen In the chunk where the scoreof the label invalid has the highest score the labelthat has the second highest score is adopted

42 Label Comparison Model

This model compares the two label assignmentsfor a certain string For example in the stringldquoHabu-Yoshiharurdquo the model compares the fol-lowing two label assignments

bull ldquoHabu-Yoshiharurdquo is labeled as PERSON

bull ldquoHaburdquo is labeled as PERSON and ldquoYoshi-harurdquo is labeled as MONEY

First as shown in Figure 4 the two comparedsets of chunks are lined up by sandwiching ldquovsrdquo(The left one right one is called the first set thesecond set respectively) When the first set is cor-rect this example is positive otherwise this ex-ample is negative The max number of chunks foreach set is five and thus examples in which thefirst or second set has more than five chunks arenot utilized for the model learning

Then the feature is assigned to each exampleThe feature (13 dimensions) for each chunk is de-fined as follows the first 12 dimensions are used

positive+1 Habu-Yoshiharu vs Habu + Yoshiharu

PSN PSN + MNY+1 Habu-Yoshiharu + Meijin vs Habu + Yoshiharu + Meijin

PSN + OTHERe PSN + MONEY + OTHERe

negative- 1 Habu-Yoshiharu-Meijin vs Habu-Yoshiharu + Meijin

ORG PSN + OTHERe

Figure 4 Assignment of positivenegative exam-ples

for each label which is estimated by the label esti-mation model and the last 13th dimension is usedfor the score of an SVM output Then for the firstand second set the features for each chunk are ar-ranged from the left and zero vectors are placedin the remainder part

Figure 5 illustrates the feature for ldquoHabu-Yoshiharurdquo vs ldquoHabu + Yoshiharurdquo The labelcomparison model is learned from such data us-ing SVM Note that only the fact that ldquoHabu-Yoshiharurdquo is PERSON can be found from thehand-annotated corpus and thus in the exampleldquoHabu-Yoshiharu-Meijinrdquo vs ldquoHabu + Yoshiharu-Meijinrdquo we cannot determine which one is cor-rect Therefore such example cannot be used forthe model learning

5 Analysis

First the label of all the chunks in a bunsetsu isestimated by using the label estimation model de-scribed in Section 41 Then the best label assign-ment in the bunsetsu is determined by applying thelabel comparison model described in Section 42iteratively as shown in Figure 2 (b) In this stepthe better label assignment is determined from bot-tom up as the CKY parsing algorithm

For example the initial state shown in Figure2(a) is obtained using the label estimation modelThen the label assignment is determined using thelabel comparison model from the lower left (cor-responds to each morpheme) to the upper rightIn determining the label assignment for the cellof ldquoHabu-Yoshiharurdquo as shown in 6(a) the modelcompares the label assignment ldquoBrdquo with the la-bel assignment ldquoA+Drdquo In this case the modelchooses the label assignment ldquoBrdquo that is ldquoHabu- Yoshiharurdquo is labeled as PERSON Similarlyin determining the label assignment for the cellof ldquoYoshiharu-Meijinrdquo the model compares the

59

chunk Habu-Yoshiharu Habu Yoshiharulabel PERSON PERSON MONEYvector V11 0 0 0 0 V21 V22 0 0 0

Figure 5 An example of the feature for the label comparison model (The example is ldquoHabu-Yoshiharuvs Habu + Yoshiharurdquo and V11 V21 V22 and 0 is a vector whose dimension is 13)

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu-Meijin

label assighment ldquoHabu-Yoshiharurdquo

label assignmentldquoHaburdquo + ldquoYoshiharurdquo

A B C

EDMONEY0075

OTHERe0092

MeijinOTHERe0245

ldquoHaburdquo + ldquoYoshiharurdquo

F

(a) label assignment for the cell ldquoHabu-Yoshiharurdquo

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu + Meijin

label assignmentldquoHabu-Yoshiharu-Meijinrdquo

A B C

EDMONEY0075

MNY+OTHERe0075+0245

MeijinOTHERe0245

label assignmentldquoHaburdquo + ldquoYoshiharurdquo + ldquoMeijinrdquo

label assignmentldquoHabu-Yoshiharurdquo + ldquoMeijinrdquo

F

(b) label assignment for the cell ldquoHabu-Yoshiharu-Meijinrdquo

Figure 6 The label comparison model

label assignment ldquoErdquo with the label assignmentldquoD+Frdquo In this case the model chooses the labelassignment ldquoD+Frdquo that is ldquoYoshiharurdquo is labeledas MONEY and ldquoMeijinrdquo is labeled as OTHEReWhen the label assignment consists of multiplechunks the content of the cell is updated Inthis case the cell ldquoErdquo is changed from ldquoYoshi-haru-Meijinrdquo (OTHERe) to ldquoYoshiharu + Meijinrdquo(MONEY + OTHERe)

As shown in Figure 6(b) in determining the bestlabel assignment for the upper right cell that isthe final output is determined the model comparesthe label assignment ldquoA+D+Frdquo ldquoB+Frdquo and ldquoCrdquoWhen there are more than two candidates of labelassignments for a cell all the label assignments arecompared in a pairwise and the label assignmentthat obtains the highest score is adopted

In the label comparing step the label as-signment in which OTHERlowast follows OTHERlowast(OTHERlowast - OTHERlowast) is not allowed since eachOTHER is assigned to the longest chunk as de-scribed in Section 41 When the first combina-tion of chunks equals to the second combinationof chunks the comparison is not performed

6 Experiment

To demonstrate the effectiveness of our proposedmethod we conducted an experiment on CRL NEdata In this data 10718 sentences in 1174 newsarticles are annotated with eight NEs The expres-sion to which it is difficult to annotate manually islabeled as OPTIONAL and was not used for both

a b c d e

learn the label estimation model 1

Corpus CRL NE data

learn the label estimation model 2bobtain features for the label apply

b c d e

b c d e

b c d e

obtain features for the labelcomparison model

learn the label estimation model 2c

learn the label estimation model 2d

learn the label estimation model 2e

apply

learn the label comparison model

Figure 7 5-fold cross validation

the model learning8 and the evaluationWe performed 5-fold cross validation following

previous work Different from previous work ourwork has to learn the SVM models twice There-fore the corpus was divided as shown in Figure 7Let us consider the analysis in the part (a) Firstthe label estimation model 1 is learned from thepart (b)-(e) Then the label estimation model 2bis learned from the part (c)-(e) and applying thelearned model to the part (b) features for learningthe label comparison model are obtained Simi-larly the label estimation model 2c is learned fromthe part (b)(d)(e) and applying it to the part (c)features are obtained It is the same with the part

8Exceptionally ldquoOPTIONALrdquo is used when the label es-timation model for OTHERlowast and invalid is learned

60

Recall PrecisionORGANIZATION 8183 (30083676) 8837 (30083404)PERSON 9005 (34583840) 9387 (34583684)LOCATION 9138 (49925463) 9244 (49925400)ARTIFACT 4672 ( 349 747) 7489 ( 349 466)DATE 9327 (33273567) 9312 (33273573)TIME 8825 ( 443 502) 9059 ( 443 489)MONEY 9385 ( 366 390) 9760 ( 366 375)PERCENT 9533 ( 469 492) 9591 ( 469 489)ALL-SLOT 8787 9179F-measure 8979

Table 3 Experimental result

(d) and (e) Then the label comparison model islearned from the obtained features After that theanalysis in the part (a) is performed by using boththe label estimation model 1 and the label compar-ison model

In this experiment a Japanese morphologicalanalyzer JUMAN9 and a Japanese parser KNP10

were adopted The two SVM models were learnedwith polynomial kernel of degree 2 and β in thesigmoid function was set to be 1

Table 6 shows an experimental result An F-measure in all NE classes is 8979

7 Discussion

71 Comparison with Previous Work

Table 7 presents the comparison with previ-ous work and our method outperformed previ-ous work Among previous work Fukushimaet al acquired huge amount of category-instance pairs (eg ldquopolitical party - New partyDAICHIrdquoldquocompany-TOYOTArdquo) by some patternsfrom a large Web corpus and Sasano et al uti-lized the analysis result of corefer resolution as afeature for the model learning Therefore in ourmethod by incorporating these knowledge andorsuch analysis result the performance would be im-proved

Compared with Sasano et al our methodachieved the better performance in analyzinga long compound noun For example in thebunsetsu ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo (Treaty on Conventional ArmedForces in Europe) while Sasano et al labeledldquoOushurdquo (Europe) as LOCATION our methodcorrectly labeled ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo as ARTIFACT Sasano etal incorrectly labeled ldquoOushurdquo as LOCATIONalthough they utilized the information about

9httpnlpkueekyoto-uacjpnl-resourcejuman-ehtml10httpnlpkueekyoto-uacjpnl-resourceknp-ehtml

the head of bunsetsu ldquojyouyakurdquo (treaty) Inour method for the cell ldquoOushurdquo invalid hasthe highest score and thus the score of LOCA-TION relatively drops Similarly for the cellldquosenryoku-sakugen-jyouyakurdquo invalid has thehighest score Consequently ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo is correctly labeledas ARTIFACT

In the bunsetsu ldquogaikoku-jin-touroku-hou-ihanrdquo(the violation of the foreigner registration law)while Sasano et al labeled ldquotouroku-hourdquo as AR-TIFACT our method correctly labeled ldquogaikoku-jin-touroku-hourdquo as ARTIFACT Sasano et al can-not utilize the information about ldquohourdquo that is use-ful for the label estimation since the head of thisbunsetsu is ldquoihanrdquo In contrast in estimating thelabel of the chunk ldquogaikoku-jin-touroku-hourdquo theinformation of ldquohourdquo can be utilized

72 Error Analysis

There were some errors in analyzing a Katakanaalphabet word In the following example althoughthe correct is that ldquoBatistutardquo is labeled as PER-SON the system labeled it as OTHERs

(4) Italy-deItaly LOC

katsuyaku-suruactive

Batistuta-woBatistuta ACC

kuwaetacall

ArgentineArgentine

rsquoArgentine called Batistuta who was active inItalyrsquo

There is not an entry of ldquoBatistutardquo in the dictio-nary of JUMAN nor Wikipedia and thus only thesurrounding information is utilized However thecase analysis of ldquokatsuyakurdquo (active) is incorrectwhich leads to the error of ldquoBatistutardquo

There were some errors in applying the la-bel comparison model although the analysis ofeach chunk is correct For example in thebunsetsu ldquoHongKong-seityourdquo (Government ofHongKong) the correct is that ldquoHongKong-seityourdquo is labeled as ORGANIZATION Asshown in Figure 8 (b) the system incorrectlylabeled ldquoHongKongrdquo as LOCATION As shownin Figure 8(a) although in the initial stateldquoHongKong-seityourdquo was correctly labeled as OR-GANIZATION the label assignment ldquoHongKong+ seityourdquo was incorrectly chosen by the labelcomparison model To cope with this problemwe are planning to the adjustment of the value βin the sigmoid function and the refinement of the

61

F1 analysis unit distinctive features(Fukushima et al 2008) 8929 character Web(Kazama and Torisawa 2008) 8893 character WikipediaWeb(Sasano and Kurohashi 2008) 8940 morpheme structural information(Nakano and Hirai 2004) 8903 character bunsetsu feature(Masayuki and Matsumoto 2003) 8721 character(Isozaki and Kazawa 2003) 8677 morphemeproposed method 8979 compound noun Wikipediastructural information

Table 4 Comparison with previous work (All work was evaluated on CRL NE data using cross valida-tion)

HongKongLOCATION

HongKong-seityouORGANIZATIONLOCATION

0266 0205

seityouOTHERe0184

(a)initial state

HongKongLOCATION

HongKong + seityouLOC+OTHEReLOCATION

0266LOC+OTHERe0266+0184

seityouOTHERe0184

(b)the final output

Figure 8 An example of the error in the label com-parison model

features for the label comparison model

8 Conclusion

This paper proposed bottom-up Named EntityRecognition using a two-stage machine learningmethod This method first estimates the label ofall the chunks in a bunsetsu using a machine learn-ing and then the best label assignment is deter-mined by bottom-up dynamic programming Weconducted an experiment on CRL NE data andachieved an F-measure of 8979

We are planning to integrate this method withthe syntactic and case analysis method (Kawa-hara and Kurohashi 2007) and perform syntacticcase and Named Entity analysis simultaneously toimprove the overall accuracy

ReferencesKenrsquoichi Fukushima Nobuhiro Kaji and Masaru

Kitsuregawa 2008 Use of massive amountsof web text in Japanese named entity recogni-tion In Proceedings of Data Engineering Workshop(DEWS2008) A3-3 (in Japanese)

IREX Committee editor 1999 Proceedings of theIREX Workshop

Hideki Isozaki and Hideto Kazawa 2003 Speeding upsupport vector machines for named entity recogni-tion Transaction of Information Processing Societyof Japan 44(3)970ndash979 (in Japanese)

Daisuke Kawahara and Sadao Kurohashi 2007 Prob-abilistic coordination disambiguation in a fully-lexicalized Japanese parser In Proceedings of the

2007 Joint Conference on Empirical Methods inNatural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL2007)pages 304ndash311

Junrsquoichi Kazama and Kentaro Torisawa 2008 In-ducing gazetteers for named entity recognition bylarge-scale clustering of dependency relations InProceedings of ACL-08 HLT pages 407ndash415

Erik F Tjong Kim and Jorn Veenstra 1999 Repre-senting text chunks In Proceedings of EACL rsquo99pages 173ndash179

Vajay Krishnan and Christopher DManning 2006An effective two-stage model for exploiting non-local dependencies in named entity recognitionpages 1121ndash1128

John Lafferty Andrew McCallun and FernandoPereira 2001 Conditional random fields Prob-abilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth In-ternational Conference (ICMLrsquo01) pages 282ndash289

Asahara Masayuki and Yuji Matsumoto 2003Japanese named entity extraction with redundantmorphological analysis In Proceeding of HLT-NAACL 2003 pages 8ndash15

Keigo Nakano and Yuzo Hirai 2004 Japanese namedentity extraction with bunsetsu features Transac-tion of Information Processing Society of Japan45(3)934ndash941 (in Japanese)

Ryohei Sasano and Sadao Kurohashi 2008 Japanesenamed entity recognition using structural naturallanguage processing In Proceeding of Third In-ternational Joint Conference on Natural LanguageProcessing pages 607ndash612

Satoshi Sekine Ralph Grishman and Hiroyuki Shin-nou 1998 A decision tree method for finding andclassifying names in japanese texts In Proceed-ings of the Sixth Workshop on Very Large Corpora(WVLC-6) pages 171ndash178

Vladimir Vapnik 1995 The Nature of StatisticalLearning Theory Springer

62

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 63ndash70Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

13

13

13 13

13

13

13

13

13

13

13

13 13

13

$

13 amp

13 amp

$ 13

13 (

) 13

$ 13

13

) 13

13 13 13

$ 13 13

+$ - 0)$ - 12

$ 3$ 4 1

)13$ 5 + )$ 5 2

3$ 6 2 $ 7-

2 $ 7 13

2 13 13

13 13

) 13

13$ $ 13$ $ $

13

13 13 13

13 1313 138

$ 13$

3$ 13) )

) 8 9$ $ $

$ $ )13$ 13 13

3

$

13 13

13 +

)

13 $

13 ) 13

$ )

13 13

13 ) $ )

13$

$

) (

$

13

13 13

13

)

13

13 )

$ 13

ファミリーレストラン 13) lt

ファミレス 13lt $

=

13 )

$ 13 13

$ $ )

13 -

)$ $ =

13 131313 13 1313 13 13 13 13 1313 13 13

63

13

13

13 13

13 13

13

13 13

13 $

13

13 amp

13

( ) +

13

13

-

13 13

13 13 13

日本 amp0 東京 $amp0amp

医科 amp0 科学 amp0amp

女子 amp0amp

大学 13amp0 研究所0

ampamp $ 13 13

+

13

$1

$ 13 $1

13

213

$1

$

$1

$1

13 $1

13

13 13 略語は0

0amp

$ +3

1313欧文略語一覧13カタカナ略語一覧13漢字略語一覧13大学の略称

$1

$1 13

13 -

13 $1 13

456

13

456

13

37+

- 13

13

$

8 913

77 4 77 913 13

13 77(amp 13 ltamp

13amp amp

13

amp - 13 13

13

13 9

13 13

$

13

-

13 =

= 9

8

13

13 $

13 13

13 13 13 1313

ッ$ 13 っ$

amp ―$

64

13

$ $

amp $ amp

() $

+) $amp $

) $ $ $ $ $

13 $ amp $ $ $ amp $$

$- )) ) 13

) ))13 ) 0()00)13 ) ) )13

- $ ))) ) 131

) 0)13 ))

13 13

2 ) )) )

)) 3 ) 1313))

1313 2 ) 13)

4) 5) )) )13

) 13) 1

6

7 ) )) ) )13 1

)

) )0

13 ) 8

-

) )

)13 ) 4

4 7 )

) )9 8)

5 ) )8 7 ))

)

) )13 1

)13 )) ) ) )

) 7 )) )13 1

)) 0()

) )))13 0

))13 ) 1

)6 5) ) )

5 ) ))1

lt ) ) 13 )

))) ) =))lt

)=)lt ))=lt gt

=ltlt )) ))) )

13 131313

-

))lt

lt ))lt

))lt

))lt lt

7 ) ))1

lt ) )) ) ))lt lt

13 lt ) ) )8 ) 1

)13 )) )

))lt 13 1

)13 ) )8 ) )13 ))

) ))) 85 1

)13 )8) 2 ))) ) (

0 )) )))

13))13

) ))) ) ( のlt

) ) )13 1

) )) )

) ))) のlt 0 )

)13 )) )1

) ) 13

) ) )) )1

) 00 )13 13)

1313 1313

2 ) 13 )) )

)13 )13 )

))13 )

3 )) )) 1

) ) )

$ )

+ ) )

+ 8 )

65

1313

1313

13 13

13 13

13 13 13

13 13

$13amp13amp 13 13 $13

$13 13 $ ( ) ) 13

13 13 + 13 1313 $

13 + 13 1313 + -

+ 13 ++ 13 13

1313 +

13 13

13 13 1313 13 -

+13 1313 13 13

1313

13

0 13 13 13 13

13 13 12 13

3013+ 13 45 666 1313

0 13 1313

7 13 13 13 13-

13 8 +-13 0

+ 13 13 13 -

+ 13 13 13

3

13 13-

13 13 + 13 13

+13513

1313 13 9

1 13 13 -

+ ( 13 13 3

13 13 + 13+ 13 13

3

135 1313 +13 13

13 13 13+ 13

1313 13 +

1313 1313

13 13 13

13+ 13 $1313 (

13 1313 1313 +-

13 4 13 $

$13 $ 13 -

13 +-13 0

13 1313 13-13 1313 + 13 13 4 - 13 + 13 13 13+ 131313 1313 3 13 13 ( 1313 13 13 + 13 13 1313 13 13 13 13 13 13 13 13 13 13 13+ 13 13 9

13

13

13

13

+ 13 13 +13

13+ 4

13 1313 1313 -

+

13 1313 13-

13 13 +13 + +

(13 13 $ +-

13 $1313 13 13 $ $ 13 $ 13

13 $ampamp lt13

2 13 13 131313 1313 13

13 lt13 ( $

13 13 13 13+ +13 -

13 13 13

13 2 2 2

= 13 13 13 +13 1313+

13 13 13

13 13 13 13

1313 13-

13 ( 13 13 13 13

13 13+ 13 1313

13 13 13+ 1313 13+

13 13 13 13 13 13

$gt1 gt1 13+ $ 13-

13 13 13 13+

13131313

66

1313 1313 1313

1313 1313 1313

13

13 13

13

13

$ 13 13

amp() amp() () amp amp

$ 13 13

+

13

13

13

- + 13 ( (

13 13

-+

13

0 -(1313 amp+ ( (

13 13

13 1

13 () 2 3

0 13 -4252

6-amp++ -32 6+

13 0

$ 135 2

1

- amp+ $ 13

13

$ amp( ) + ( ( )- )( ( ) )

FeaturesLabel

Noun

Noun

Noun

Noun

Particle

Noun

POS

0

0

0

1

1

1

Head of wordreadingword-

FeaturesLabel

Noun

Noun

Noun

Noun

Particle

Noun

POS

0

0

0

1

1

1

Head of wordreadingword-

amp 13 0

13 朝ビタ -+ 13

13 朝はビタミン - +

13 7

7

$

088

13 0

$ amp 9 amp 9

13

13

7

1313 0

13

13 2lt

13 13 13 13

13 13

3 13 amp

13 amp

13

13

0 13 13

13 13

13 13

13

1

ampampamp

67

13 13

$

13 13

ampamp (

)

ampamp $

ampamp amp

amp +

amp +

- - -

amp-

amp

amp

amp

- amp

+

- +

) +

amp $

- amp

+

- amp

+

0 $ ampamp +

1- amp

0 +

2-

)

amp -

amp ampamp

$ amp 0- 3(

34) 5 -+

0 - 4

6)4 amp amp 1 amp

7 4( 6

$ - +

- 8

64

13

) amp

amp

+

amp 9 2

- 9 - +

2 amp

- amp

amp 1

$

( +

(

( amp ( ( -

amp2

amp

amp )

( +

68

13 13

13 1313

13 13 13

13 13

13 13 13

13 13

13 13

13 $

13

13 13 13 13

13

13 13 13

1313

13

amp 13 13 13 13

13 13 13 13

(() 13 13 13

13 13 + 13

13 amp 13

13 13

$ 13

13 13

13

13 1313

13 -

13

1313 13

13 13 13

13 1313 amp

13 13

13 13 1313

0

1313 13 13

13 13

13

13 13

$ $ amp amp ( $)+ - 0 13 + 13 112

ミュージックステーション 13 ステ 13 13 13 13 13 ミュージック 13 ステ ステーション

3 4 56 1 0+ 0 + + 7+ 8 9 $8 4 lt

6 $8 -= 1 8 - = $ 13++ + 139 13 $ 22lt

8 $+ 4 13) 1ltltlt3 $gt3 =

5 $ $ ( 6 5 13amp

5 $ $ ( 213 $ 13 13

5 = 3 + 13 () 1 +

5 () 13 7+ + + 2ltlt

5 () 13 3A 21 + $ + 13

5 () $ gt) 3A2 0 13 13 22lt2lt

+A = 7 1 4 30$+ -+ 3 13 13$ 11

$ 13 amp) $ 4 42 6= $ 2

4 13 13+ $ gt $ 9+ amp6+ 13+ 8 8 6 3 (+ 6 13 11

13 136) $ 4 13 + gt+ 7 30 13 13+ 1

69

13 13 13

13 13 1313

13 13

1313 13

13 1313

13

$

13 13 1313 $

amp (

70

Author Index

Bhutada Pravin 17

Cao Jie 47Caseli Helena 1

Diab Mona 17

Finatto Maria Jose 1Fujii Hiroko 63Fukui Mika 63Funayama Hirotaka 55

Hoang Hung Huu 31Huang Yun 47

Kan Min-Yen 9 31Kim Su Nam 9 31Kuhn Jonas 23Kurohashi Sadao 55

Liu Qun 47Lu Yajuan 47

Machado Andre 1

Ren Zhixiang 47

Shibata Tomohide 55Sinha R Mahesh K 40Sumita Kazuo 63Suzuki Masaru 63

Villavicencio Aline 1

Wakaki Hiromi 63

Zarrieszlig Sina 23

71

  • Workshop Program
  • Statistically-Driven Alignment-Based Multiword Expression Identification for Technical Domains
  • Re-examining Automatic Keyphrase Extraction Approaches in Scientific Articles
  • Verb Noun Construction MWE Token Classification
  • Exploiting Translational Correspondences for Pattern-Independent MWE Identification
  • A re-examination of lexical association measures
  • Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus
  • Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions
  • Bottom-up Named Entity Recognition using Two-stage Machine Learning Method
  • Abbreviation Generation for Japanese Multi-Word Expressions
Page 7: MWE 2009 2009 Workshop on Multiword Expressions

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 1ndash8Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Statistically-Driven Alignment-Based Multiword ExpressionIdentification for Technical Domains

Helena de Medeiros Caselidiams Aline Villavicencioclubsspades Andre Machadoclubs Maria Jose Finattohearts

diamsDepartment of Computer Science Federal University of Sao Carlos (Brazil)clubsInstitute of Informatics Federal University of Rio Grande do Sul (Brazil)

spadesDepartment of Computer Sciences Bath University (UK)heartsInstitute of Language and Linguistics Federal University of Rio Grande do Sul (Brazil)

helenacaselidcufscarbr avillavicencioinfufrgsbrammachadoinfufrgsbr mfinattoterracombr

Abstract

Multiword Expressions (MWEs) are oneof the stumbling blocks for more preciseNatural Language Processing (NLP) sys-tems Particularly the lack of coverageof MWEs in resources can impact nega-tively on the performance of tasks and ap-plications and can lead to loss of informa-tion or communication errors This is es-pecially problematic in technical domainswhere a significant portion of the vocab-ulary is composed of MWEs This pa-per investigates the use of a statistically-driven alignment-based approach to theidentification of MWEs in technical cor-pora We look at the use of several sourcesof data including parallel corpora usingEnglish and Portuguese data from a corpusof Pediatrics and examining how a sec-ond language can provide relevant cues forthis tasks We report results obtained bya combination of statistical measures andlinguistic information and compare theseto the reported in the literature Such anapproach to the (semi-)automatic identifi-cation of MWEs can considerably speedup lexicographic work providing a moretargeted list of MWE candidates

1 Introduction

A multiword expression (MWE) can be definedas any word combination for which the syntac-tic or semantic properties of the whole expres-sion cannot be obtained from its parts (Sag etal 2002) Examples of MWEs are phrasal verbs(break down rely on) compounds (police car cof-fee machine) idioms (rock the boat let the catout of the bag) They are very numerous in lan-guages as Biber et al (1999) note accouting forbetween 30 and 45 of spoken English and 21

of academic prose and for Jackendoff (1997) thenumber of MWEs in a speakerrsquos lexicon is of thesame order of magnitude as the number of singlewords However these estimates are likely to beunderestimates if we consider that for languagefrom a specific domain the specialized vocabularyis going to consist largely of MWEs (global warm-ing protein sequencing) and new MWEs are con-stantly appearing (weapons of mass destructionaxis of evil)

Multiword expressions play an important rolein Natural Language Processing (NLP) applica-tions which should not only identify the MWEsbut also be able to deal with them when they arefound (Fazly and Stevenson 2007) Failing toidentify MWEs may cause serious problems formany NLP tasks especially those envolving somekind of semantic processing For parsing for in-stance Baldwin et al (2004) found that for a ran-dom sample of 20000 strings from the British Na-tional Corpus (BNC) even with a broad-coveragegrammar for English (Flickinger 2000) missingMWEs accounted for 8 of total parsing errorsTherefore there is an enormous need for robust(semi-)automated ways of acquiring lexical infor-mation for MWEs (Villavicencio et al 2007) thatcan significantly extend the coverage of resourcesFor example one can more than double the num-ber of verb-particle constructions (VPCs) entriesin a dictionary such as the Alvey Natural Lan-guage Tools (Carroll and Grover 1989) just ex-tracting VPCs from a corpus like the BNC (Bald-win 2005) Furthermore as MWEs are languagedependent and culturally motivated identifyingthe adequate translation of MWE occurrences is animportant challenge for machine translation meth-ods

In this paper we investigate experimentally theuse of an alignment-based approach for the iden-tification of MWEs in technical corpora We lookat the use of several sources of data including par-

1

allel corpora using English and Portuguese datafrom a corpus of Pediatrics and examining howa second language can provide relevant cues forthis tasks In this way cost-effective tools for theautomatic alignment of texts can generate a listof MWE candidates with their appropriate trans-lations Such an approach to the (semi-)automaticidentification of MWEs can considerably speed uplexicographic work providing a more targeted listof MWE candidates and their translations for theconstruction of bilingual resources andor withsome semantic information for monolingual re-sources

The remainder of this paper is structured as fol-lows Section 2 briefly discusses MWEs and someprevious works on methods for automatically ex-tracting them Section 3 presents the resourcesused while section 4 describes the methods pro-posed to extract MWEs as a statistically-driven by-product of an automatic word alignment processSection 5 presents the evaluation methodology andanalyses the results and section 6 finishes this pa-per with some conclusions and proposals for fu-ture work

2 Related Work

The term Multiword Expression has been used todescribe a large number of distinct but related phe-nomena such as phrasal verbs (eg come along)nominal compounds (eg frying pan) institution-alised phrases (eg bread and butter) and manyothers (Sag et al 2002) They are very frequent ineveryday language and this is reflected in severalexisting grammars and lexical resources wherealmost half of the entries are Multiword Expres-sions

However due to their heterogeneous charac-teristics MWEs present a tough challenge forboth linguistic and computational work (Sag etal 2002) Some MWEs are fixed and do notpresent internal variation such as ad hoc whileothers allow different degrees of internal vari-ability and modification such as touch a nerve(touchfind a nerve) and spill beans (spill sev-eralmusicalmountains of beans) In terms of se-mantics some MWEs are more opaque in theirmeaning (eg to kick the bucket as to die) whileothers have more transparent meanings that can beinferred from the words in the MWE (eg eat upwhere the particle up adds a completive sense toeat) Therefore providing appropriate methods

for the automatic identification and treatment ofthese phenomena is a real challenge for NLP sys-tems

A variety of approaches has been proposed forautomatically identifying MWEs differing basi-cally in terms of the type of MWE and lan-guage to which they apply and the sources ofinformation they use Although some work onMWEs is type independent (eg (Zhang et al2006 Villavicencio et al 2007)) given the het-erogeneity of MWEs much of the work looks in-stead at specific types of MWE like collocations(Pearce 2002) compounds (Keller and Lapata2003) and VPCs (Baldwin 2005 Villavicencio2005 Carlos Ramisch and Aline Villavicencio andLeonardo Moura and Marco Idiart 2008) Someof these works concentrate on particular languages(eg (Pearce 2002 Baldwin 2005) for Englishand (Piao et al 2006) for Chinese) but somework has also benefitted from asymmetries in lan-guages using information from one language tohelp deal with MWEs in the other (eg (na Vil-lada Moiron and Tiedemann 2006 Caseli et al2009))

As basis for helping to determine whether agiven sequence of words is in fact an MWE (egad hoc vs the small boy) some of these works em-ploy linguistic knowledge for the task (Villavicen-cio 2005) while others employ statistical meth-ods (Pearce 2002 Evert and Krenn 2005 Zhanget al 2006 Villavicencio et al 2007) or combinethem with some kinds of linguistic informationsuch as syntactic and semantic properties (Bald-win and Villavicencio 2002 Van de Cruys and naVillada Moiron 2007) or automatic word align-ment (na Villada Moiron and Tiedemann 2006)

Statistical measures of association have beencommonly used for this task as they can be demo-cratically applied to any language and MWE typeHowever there is no consensus about which mea-sure is best suited for identifying MWEs in gen-eral Villavicencio et al (2007) compared some ofthese measures (mutual information permutationentropy and χ2) for the type-independent detec-tion of MWEs and found that Mutual Informationseemed to differentiate MWEs from non-MWEsbut the same was not true of χ2 In addition Ev-ert and Krenn (2005) found that for MWE iden-tification the efficacy of a given measure dependson factors like the type of MWEs being targetedfor identification the domain and size of the cor-

2

pora used and the amount of low-frequency dataexcluded by adopting a threshold NonethelessVillavicencio et al (2007) discussing the influ-ence of the corpus size and nature over the meth-ods found that these different measures have ahigh level of agreement about MWEs whetherin carefully constructed corpora or in more het-erogeneous web-based ones They also discussthe results obtained from adopting approaches likethese for extending the coverage of resources ar-guing that grammar coverage can be significantlyincreased if MWEs are properly identified andtreated (Villavicencio et al 2007)

Among the methods that use additional infor-mation along with statistics to extract MWE theone proposed by na Villada Moiron and Tiede-mann (2006) seems to be the most similar to ourapproach The main difference between them isthe way in which word alignment is used in theMWE extraction process In this paper the wordalignment is the basis for the MWE extractionprocess while Villada Moiron and Tiedemannrsquosmethod uses the alignment just for ranking theMWE candidates which were extracted on the ba-sis of association measures (log-likelihood andsalience) and head dependence heuristic (in parseddata)

Our approach as described in details by Caseliet al (2009) also follows to some extent thatof Zhang et al (2006) as missing lexical en-tries for MWEs and related constructions are de-tected via error mining methods and this paper fo-cuses on the extraction of generic MWEs as a by-product of an automatic word alignment Anotherrelated work is the automatic detection of non-compositional compounds (NCC) by Melamed(1997) in which NCCs are identified by analyz-ing statistical translation models trained in a hugecorpus by a time-demanding process

Given this context our approach proposesthe use of alignment techniques for identifyingMWEs looking at sequences detected by thealigner as containing more than one word whichform the MWE candidates As a result sequencesof two or more consecutive source words aretreated as MWE candidates regardless of whetherthey are translated as one or more target words

3 The Corpus and Reference Lists

The Corpus of Pediatrics used in these experi-ments contains 283 texts in Portuguese with a total

of 785448 words extracted from the Jornal de Pe-diatria From this corpus the Pediatrics Glossarya reference list containing multiword terms and re-curring expressions was semi-automatically con-structed and manually checked1 The primary aimof the Pediatrics Glossary as an online resourcefor long-distance education was to train qualifyand support translation students on the domain ofpediatrics texts

The Pediatrics Glossary was built from the36741 ngrams that occurred at least 5 times in thecorpus These were automatically cleaned or re-moved using some POS tag patterns (eg remov-ing prepositions from terms that began or endedwith them) In addition if an ngram was part of alarger ngram only the latter appeared in the Glos-sary as is the case of aleitamento materno (mater-nal breastfeeding) which is excluded as it is con-tained in aleitamento materno exclusivo (exclusivematernal breastfeeding) This post-processing re-sulted in 3645 ngrams which were manuallychecked by translation students and resulted in2407 terms with 1421 bigrams 730 trigrams and339 ngrams with n larger than 3 (not considered inthe experiments presented in this paper)

4 Statistically-Driven andAlignment-Based methods

41 Statistically-Driven methodStatistical measures of association have beenwidely employed in the identification of MWEsThe idea behind their use is that they are an in-expensive language and type independent meansof detecting recurrent patterns As Firth famouslysaid a word is characterized by the company itkeeps and since we expect the component wordsof an MWE to occur frequently together thenthese measures can give an indication of MWE-ness In this way if a group of words co-occurswith significantly high frequency when comparedto the frequencies of the individual words thenthey may form an MWE Indeed measures suchas Pointwise Mutual Information (PMI) MutualInformation (MI) χ2 log-likelihood (Press et al1992) and others have been employed for thistask and some of them seem to provide more ac-curate predictions of MWEness than others Infact in a comparison of some measures for thetype-independent detection of MWEs MI seemed

1Available in the TEXTQUIMUFRGS website httpwwwufrgsbrtextquim

3

to differentiate MWEs from non-MWEs but thesame was not true of χ2 (Villavicencio et al2007) In this work we use two commonly em-ployed measures for this task PMI and MIas implemented in the Ngram Statistics Package(Banerjee and Pedersen 2003)

From the Portuguese portion of the Corpus ofPediatrics 196105 bigram and 362663 trigramMWE candidates were generated after filteringngrams containing punctuation and numbers Inorder to evaluate how these methods perform with-out any linguistic filtering the only threshold em-ployed was a frequency cut-off of 2 occurrencesresulting in 64839 bigrams and 54548 trigramsEach of the four measures were then calculated forthese ngrams and we ranked each n-gram accord-ing to each of these measures The average of allthe rankings is used as the combined measure ofthe MWE candidates

42 Alignment-Based method

The second of the MWE extraction approachesto be investigated in this paper is the alignment-based method The automatic word alignment oftwo parallel texts mdash a text written in one (source)language and its translation to another (target) lan-guage mdash is the process of searching for correspon-dences between source and target words and se-quences of words For each word in a source sen-tence equivalences in the parallel target sentenceare looked for Therefore taking into account aword alignment between a source word sequenceS (S = s1 sn with n ge 2) and a target wordsequence T (T = t1 tm with m ge 1) thatis S harr T the alignmet-based MWE extracionmethod assumes that (a) S and T share some se-mantic features and (b) S may be a MWE

In other words the alignment-based MWE ex-traction method states that the sequence S will bea MWE candidate if it is aligned with a sequenceT composed of one or more words (a n m align-ment with n ge 2 and m ge 1) For examplethe sequence of two Portuguese words aleitamentomaterno mdash which occurs 202 times in the cor-pus used in our experiments mdash is a MWE can-didate because these two words were joined to bealigned 184 times with the word breastfeeding (a2 1 alignment) 8 times with the word breast-fed (a 2 1 alignment) 2 times with breastfeedingpractice (a 2 2 alignment) and so on

Thus notice that the alignment-based MWE ex-

traction method does not rely on the conceptualasymmetries between languages since it does notexpect that a source sequence of words be alignedwith a single target word The method looksfor the sequences of source words that are fre-quently joined together during the alignment de-spite the number of target words involved Thesefeatures indicate that the method priorizes preci-sion in spite of recall

It is also important to say that although the se-quences of source and target words resemble thephrases used in the phrase-based statistical ma-chine translation (SMT) they are indeed a re-finement of them More specifically althoughboth approaches rely on word alignments per-formed by GIZA++2 (Och and Ney 2000) inthe alignment-based approach not all sequences ofwords are considered as phrases (and MWE can-didates) but just those with an alignment n m(n gt= 2) with a target sequence To confirmthis assumption a phrase-based SMT system wastrained with the same corpus used in our exper-iments and the number of phrases extracted fol-lowing both approaches were compared Whilethe SMT extracted 819208 source phrases ouralignment-based approach (without applying anypart-of-speech or frequency filter) extracted only34277 These results show that the alignment-based approach refines in some way the phrasesof SMT systems

In this paper we investigate experimentallywhether MWEs can be identified as a by-productof the automatic word alignment of parallel textsWe focus on Portuguese MWEs from the Corpusof Pediatrics and the evaluation is performed us-ing the bigrams and trigrams from the PediatricsGlossary as gold standard

To perform the extraction of MWE candi-dates following the alignment-based approachfirst the original corpus had to be sentence andword aligned and Part-of-Speech (POS) taggedFor these preprocessing steps were used re-spectively a version of the Translation Cor-pus Aligner (TCA) (Hofland 1996) the statisti-cal word aligner GIZA++ (Och and Ney 2000)and the morphological analysers and POS taggersfrom Apertium3 (Armentano-Oller et al 2006)

2GIZA++ is a well-known statistical word aligner that canbe found at httpwwwfjochcomGIZA++html

3Apertium is an open-source machine translation en-gine and toolbox available at httpwwwapertiumorg

4

From the preprocessed corpus the MWE candi-dates are extracted as those in which two or morewords have the same alignment that is they arelinked to the same target unit This initial list ofMWE candidates is then filtered to remove thosecandidates that (a) match some sequences of POStags or words (patterns) defined in previous exper-iments (Caseli et al 2009) or (b) whose frequencyis below a certain threshold The remaining unitsin the candidate list are considered to be MWEs

Several filtering patterns and minimum fre-quency thresholds were tested and three of themare presented in details here The first one (F1)is the same used during the manual building ofthe reference lists of MWEs (a) patterns begin-ning with Article + Noun and beginning or finish-ing with verbs and (b) with a minimum frequencythreshold of 5

The second one (F2) is the same used in the(Caseli et al 2009) mainly (a) patterns begin-ning with determiner auxiliary verb pronoun ad-verb conjunction and surface forms such as thoseof the verb to be (are is was were) relatives(that what when which who why) and prepo-sitions (from to of ) and (b) with a minimum fre-quency threshold of 2

And the third one (F3) is the same as (Caseli etal 2009) plus (a) patterns beginning or finishingwith determiner adverb conjunction prepositionverb pronoun and numeral and (b) with a mini-mum frequency threshold of 2

5 Experiments and Results

Table 1 shows the top 5 and the bottom 5 rankedcandidates returned by PMI and the alignment-based approach Although some of the results aregood especially the top candidates there is stillconsiderable noise among the candidates as forinstance jogar video game (lit play video game)From table 1 it is also possible to notice that thealignment-based approach indeed extracts Pedi-atrics terms such as aleitamento materno (breast-feeding) and also other possible MWE that are notPediatrics terms such as estados unidos (UnitedStates)

In table 2 we show the precision (number ofcorrect candidates among the proposed ones) re-call (number of correct candidates among those inreference lists) and F-measure ((2 lowast precision lowastrecall)(precision + recall)) figures for the as-sociation measures using all the candidates (on the

PMI alignment-basedOnline Mendelian Inheritance faixa etariaBeta Technology Incorporated aleitamento maternoLange Beta Technology estados unidosOxido Nitrico Inalatorio hipertensao arterialjogar video game leite materno e um de couro cabeludoe a do bloqueio lactıferosse que de emocional anatomiae a da neonato a termoe de nao duplas maes bebes

Table 1 Top 5 and Bottom 5 MWE candidatesranked by PMI and alignment-based approach

pt MWE candidates PMI MI proposed bigrams 64839 64839 correct MWEs 1403 1403precision 216 216recall 9873 9873F 423 423 proposed trigrams 54548 54548 correct MWEs 701 701precision 129 129recall 9603 9603F 255 255 proposed bigrams 1421 1421 correct MWEs 155 261precision 1091 1837recall 1091 1837F 1091 1837 proposed trigrams 730 730 correct MWEs 44 20precision 603 274recall 603 274F 603 274

Table 2 Evaluation of MWE candidates - PMI andMI

first half of the table) and using the top 1421 bi-gram and 730 trigram candidates (on the secondhalf) From these latter results we can see that thetop candidates produced by these measures do notagree with the Pediatrics Glossary since there areonly at most 1837 bigram and 603 trigramMWEs among the top candidates as ranked byMI and PMI respectively Interestingly MI had abetter performance for bigrams while for trigramsPMI performed better

On the other hand looking at the alignment-based method 34277 pt MWE candidates wereextracted and Table 3 sumarizes the number ofcandidates filtered following the three filters de-scribed in 42 F1 F2 and F3

To evaluate the efficacy of the alignment-basedmethod in identifying multiword terms of Pedi-atrics an automatic comparison was performedusing the Pediatrics Glossary In this auto-

5

pt MWE candidates F1 F2 F3 filtered by POS patterns 24996 21544 32644 filtered by frequency 9012 11855 1442 final Set 269 878 191

Table 3 Number of pt MWE candidates filteredin the alignment-based approach

pt MWE candidates F1 F2 F3 proposed bigrams 250 754 169 correct MWEs 48 95 65precision 1920 1260 3846recall 338 669 457F 575 874 818 proposed trigrams 19 110 20 correct MWEs 1 9 4precision 526 818 2000recall 014 123 055F 027 214 107 proposed bitrigrams 269 864 189 correct MWEs 49 104 69precision 1822 1204 3651recall 228 483 321F 405 690 590

Table 4 Evaluation of MWE candidates

matic comparision we considered the final lists ofMWEs candidates generated by each filter in table3 The number of matching entries and the valuesfor precision recall and F-measure are showed intable 4

The different values of extracted MWEs (in ta-ble 3) and evaluated ones (in table 4) are due tothe restriction of considering only bigrams and tri-grams in the Pediatrics Glossary Then longerMWEs mdash such as doenca arterial coronarianaprematura (premature coronary artery disease)and pequenos para idade gestacional (small forgestational age) mdash extracted by the alignment-based method are not being considered at the mo-ment

After the automatic comparison using the Pedi-atrics Glossary an analysis by human experts wasperformed on one of the derived lists mdash that withthe best precision values so far (from filter F3)The human analysis was necessary since as statedin (Caseli et al 2009) the coverage of referencelists may be low and it is likely that a lot of MWEcandidates that were not found in the PediatricsGlossary are nonetheless true MWEs In this pa-per only the pt MWE candidates extracted usingfilter F3 (as described in section 42) were manu-ally evaluated

From the 191 pt MWE candidates extracted af-ter F3 69 candidates (361 of the total amount)were found in the bigrams or trigrams in the

Glossary (see table 4) Then the remaining 122candidates (639) were analysed by two native-speakers human judges who classified each of the122 candidates as true if it is a multiword expres-sion or false otherwise independently of beinga Pediatrics term For the judges a sequence ofwords was considered a MWE mainly if it was(1) a proper name or (2) a sequence of words forwhich the meaning cannot be obtained by com-pounding the meanings of its component words

The judgments of both judges were comparedand a disagreement of approximately 12 on mul-tiwords was verified This disagreement was alsomeasured by the kappa (K) measure (Carletta1996) with k = 073 which does not preventconclusions to be drawn According to Carletta(1996) among other authors a value of k between067 and 08 indicates a good agreement

In order to calculate the percentage of true can-didates among the 122 two approaches can befollowed depending on what criteria one wantsto emphasize precision or coverage (not recallbecause we are not calculating regarding a refer-ence list) To emphasize the precision one shouldconsider as genuine MWEs only those candidatesclassified as true by both judges on the other handto emphasize the coverage one should consideralso those candidates classified as true by just oneof them So from 191 MWE candidates 126(6597) were classified as true by both judgesand 145 (7592) by at least one of them

6 Conclusions and Future Work

MWEs are a complex and heterogeneous set ofphenomena that defy attempts to capture themfully but due to their role in communication theyneed to be properly accounted for in NLP applica-tions and tasks

In this paper we investigated the identifica-tion of MWEs from technical domain test-ing statistically-driven and alignment-based ap-proaches for identifying MWEs from a Pediatricsparallel corpus The alignment-based method gen-erates a targeted precision-oriented list of MWEcandidates while the statistical methods producerecall-oriented results at the expense of precisionTherefore the combination of these methods canproduce a set of MWE candidates that is both moreprecise than the latter and has more coverage thanthe former This can significantly speed up lex-icographic work Moreover the results obtained

6

show that in comparison with the manual extrac-tion of MWEs this approach can provide also ageneral set of MWE candidates in addition to themanually selected technical terms

Using the alignment-based extraction methodwe notice that it is possible to extract MWEs thatare Pediatrics terms with a precision of 38 forbigrams and 20 for trigrams but with very lowrecall since only the MWEs in the Pediatrics Glos-sary were considered correct However after amanual analysis carried out by two native speakersof Portuguese we found that the percentage of trueMWEs considered by both or at least one of themwere respectively 6597 and 7592 This wasa significative improvement but it is important tosay that in this manual analysis the human ex-perts classified the MWEs as true independently ofthem being Pediatrics terms So as future work weintend to carry out a more carefull analysis withexperts in Pediatrics to evaluate how many MWEscandidates are also Pediatrics terms

In addition we plan to investigate a weightedcombination of these methods favouring thosethat have better precision Finally we also in-tend to apply the results obtained in to the semi-automatic construction of ontologies

Acknowledgments

We would like to thank the TEXTQUIMUFRGSgroup for making the Corpus of Pediatrics and Pe-diatrics Glossary available to us We also thank thefinancial support of FAPESP in bulding the paral-lel corpus This research has been partly fundedby the FINEP project COMUNICA

ReferencesCarme Armentano-Oller Rafael C Carrasco Anto-

nio M Corbı-Bellot Mikel L Forcada MireiaGinestı-Rosell Sergio Ortiz-Rojas Juan Anto-nio Perez-Ortiz Gema Ramırez-Sanchez FelipeSanchez-Martınez and Miriam A Scalco 2006Open-source Portuguese-Spanish machine transla-tion In R Vieira P Quaresma MGV Nunes NJMamede C Oliveira and MC Dias editors Pro-ceedings of the 7th International Workshop on Com-putational Processing of Written and Spoken Por-tuguese (PROPOR 2006) volume 3960 of LectureNotes in Computer Science pages 50ndash59 Springer-Verlag May

Timothy Baldwin and Aline Villavicencio 2002 Ex-tracting the Unextractable A Case Study on Verb-particles In Proceedings of the 6th Conference on

Natural Language Learning (CoNLL-2002) pages98ndash104 Taipei Taiwan

Timothy Baldwin Emily M Bender Dan FlickingerAra Kim and Stephan Oepen 2004 Road-testingthe English Resource Grammar over the British Na-tional Corpus In Proceedings of the Fourth In-ternational Conference on Language Resources andEvaluation (LREC 2004) pages 2047ndash2050 LisbonPortugal

Timothy Baldwin 2005 The deep lexical acquisitionof English verb-particles Computer Speech andLanguage Special Issue on Multiword Expressions19(4)398ndash414

Satanjeev Banerjee and Ted Pedersen 2003 The De-sign Implementation and Use of the Ngram Statis-tics Package In In Proceedings of the Fourth Inter-national Conference on Intelligent Text Processingand Computational Linguistics pages 370ndash381

Douglas Biber Stig Johansson Geoffrey Leech SusanConrad and Edward Finegan 1999 Grammar ofSpoken and Written English Longman Harlow

Jean Carletta 1996 Assessing agreement on classi-fication tasks the kappa statistics ComputationalLinguistics 22(2)249ndash254

Carlos Ramisch and Aline Villavicencio and LeonardoMoura and Marco Idiart 2008 Picking themup and Figuring them out Verb-Particle Construc-tions Noise and Idiomaticity In Proceedings ofthe 12th Conference on Computational Natural Lan-guage Learning (CoNLL 2008) pages 49ndash56

John Carroll and Claire Grover 1989 The derivationof a large computational lexicon of English fromLDOCE In Bran Boguraev and Ted Briscoe editorsComputational Lexicography for Natural LanguageProcessing pages 117ndash134 Longman Harlow UK

Helena M Caseli Carlos Ramisch Maria G V Nunesand Aline Villavicencio 2009 Alignment-basedextraction of multiword expressions Language Re-sources and Evaluation to appear

Stefan Evert and Brigitte Krenn 2005 Using smallrandom samples for the manual evaluation of statis-tical association measures Computer Speech andLanguage 19(4)450ndash466

Afsaneh Fazly and Suzanne Stevenson 2007 Distin-guishing Subtypes of Multiword Expressions UsingLinguistically-Motivated Statistical Measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueJune

Dan Flickinger 2000 On building a more efficientgrammar by exploiting types Natural LanguageEngineering 6(1)15ndash28

7

Knut Hofland 1996 A program for aligning Englishand Norwegian sentences In S Hockey N Ideand G Perissinotto editors Research in HumanitiesComputing pages 165ndash178 Oxford Oxford Univer-sity Press

Ray Jackendoff 1997 Twistinrsquo the night away Lan-guage 73534ndash59

Frank Keller and Mirella Lapata 2003 Using the Webto Obtain Frequencies for Unseen Bigrams Compu-tational Linguistics 29(3)459ndash484

I Dan Melamed 1997 Automatic Discovery ofNon-Compositional Compounds in Parallel Data Ineprint arXivcmp-lg9706027 pages 6027ndash+ June

Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the Workshopon Multi-word-expressions in a Multilingual Context(EACL-2006) pages 33ndash40 Trento Italy

Franz Josef Och and Hermann Ney 2000 Improvedstatistical alignment models In Proceedings of the38th Annual Meeting of the ACL pages 440ndash447Hong Kong China October

Darren Pearce 2002 A Comparative Evaluation ofCollocation Extraction Techniques In Proceedingsof the Third International Conference on LanguageResources and Evaluation pages 1ndash7 Las PalmasCanary Islands Spain

Scott S L Piao Guangfan Sun Paul Rayson andQi Yuan 2006 Automatic Extraction of Chi-nese Multiword Expressions with a Statistical ToolIn Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context (EACL-2006)pages 17ndash24 Trento Italy April

William H Press Saul A Teukolsky William T Vet-terling and Brian P Flannery 1992 NumericalRecipes in C The Art of Scientific Computing Sec-ond edition Cambridge University Press

Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 MultiwordExpressions A Pain in the Neck for NLP In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing (CICLing-2002) volume 2276 of (LectureNotes in Computer Science) pages 1ndash15 LondonUK Springer-Verlag

Tim Van de Cruys and Bego na Villada Moiron 2007Semantics-based Multiword Expression ExtractionIn Proceedings of the Workshop on A Broader Pre-spective on Multiword Expressions pages 25ndash32Prague June

Aline Villavicencio Valia Kordoni Yi Zhang MarcoIdiart and Carlos Ramisch 2007 Validation andEvaluation of Automatically Acquired Multiword

Expressions for Grammar Engineering In Pro-ceedings of the 2007 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning pages1034ndash1043 Prague June

Aline Villavicencio 2005 The Availability of Verb-Particle Constructions in Lexical Resources HowMuch is Enough Journal of Computer Speech andLanguage Processing 19(4)415ndash432

Yi Zhang Valia Kordoni Aline Villavicencio andMarco Idiart 2006 Automated Multiword Ex-pression Prediction for Grammar Engineering InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 36ndash44 Sydney Australia July Asso-ciation for Computational Linguistics

8

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 9ndash16Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Re-examining Automatic Keyphrase Extraction Approachesin Scientific Articles

Su Nam KimCSSE dept

University of Melbournesnkimcsseunimelbeduau

Min-Yen KanDepartment of Computer ScienceNational University of Singaporekanmycompnusedusg

Abstract

We tackle two major issues in automatickeyphrase extraction using scientific arti-cles candidate selection and feature engi-neering To develop an efficient candidateselection method we analyze the natureand variation of keyphrases and then se-lect candidates using regular expressionsSecondly we re-examine the existing fea-tures broadly used for the supervised ap-proach exploring different ways to en-hance their performance While mostother approaches are supervised we alsostudy the optimal features for unsuper-vised keyphrase extraction Our researchhas shown that effective candidate selec-tion leads to better performance as evalua-tion accounts for candidate coverage Ourwork also attests that many of existing fea-tures are also usable in unsupervised ex-traction

1 Introduction

Keyphrases are simplex nouns or noun phrases(NPs) that represent the key ideas of the documentKeyphrases can serve as a representative summaryof the document and also serve as high quality in-dex terms It is thus no surprise that keyphraseshave been utilized to acquire critical informationas well as to improve the quality of natural lan-guage processing (NLP) applications such as doc-ument summarizer(Davanzo and Magnini 2005)information retrieval (IR)(Gutwin et al 1999) anddocument clustering(Hammouda et al 2005)

In the past various attempts have been made toboost automatic keyphrase extraction performancebased primarily on statistics(Frank et al 1999Turney 2003 Park et al 2004 Wan and Xiao2008) and a rich set of heuristic features(Barkerand Corrnacchia 2000 Medelyan and Witten

2006 Nguyen and Kan 2007) In Section 2 wegive a more comprehensive overview of previousattempts

Current keyphrase technology still has muchroom for improvement First of all although sev-eral candidate selection methods have been pro-posed for automatic keyphrase extraction in thepast (eg (Frank et al 1999 Park et al 2004Nguyen and Kan 2007)) most of them do not ef-fectively deal with various keyphrase forms whichresults in the ignorance of some keyphrases as can-didates Moreover no studies thus far have donea detailed investigation of the nature and varia-tion of manually-provided keyphrases As a con-sequence the community lacks a standardized listof candidate forms which leads to difficulties indirect comparison across techniques during evalu-ation and hinders re-usability

Secondly previous studies have shown the ef-fectiveness of their own features but not manycompared their features with other existing fea-tures That leads to a redundancy in studies andhinders direct comparison In addition existingfeatures are specifically designed for supervisedapproaches with few exceptions However thisapproach involves a large amount of manual laborthus reducing its utility for real-world applicationHence unsupervised approach is inevitable in or-der to minimize manual tasks and to encourageutilization It is a worthy study to attest the re-liability and re-usability for the unsupervised ap-proach in order to set up the tentative guideline forapplications

This paper targets to resolve these issues ofcandidate selection and feature engineering Inour work on candidate selection we analyze thenature and variation of keyphrases with the pur-pose of proposing a candidate selection methodwhich improves the coverage of candidates thatoccur in various forms Our second contributionre-examines existing keyphrase extraction features

9

reported in the literature in terms of their effec-tiveness and re-usability We test and comparethe usefulness of each feature for further improve-ment In addition we assess how well these fea-tures can be applied in an unsupervised approach

In the remaining sections we describe anoverview of related work in Section 2 our propos-als on candidate selection and feature engineeringin Section 4 and 5 our system architecture anddata in Section 6 Then we evaluate our propos-als discuss outcomes and conclude our work inSection 7 8 and 9 respectively

2 Related Work

The majority of related work has been carriedout using statistical approaches a rich set ofsymbolic resources and linguistically-motivatedheuristics(Frank et al 1999 Turney 1999 Barkerand Corrnacchia 2000 Matsuo and Ishizuka2004 Nguyen and Kan 2007) Features used canbe categorized into three broad groups (1) docu-ment cohesion features (ie relationship betweendocument and keyphrases)(Frank et al 1999Matsuo and Ishizuka 2004 Medelyan and Wit-ten 2006 Nguyen and Kan 2007) and to lesser(2) keyphrase cohesion features (ie relationshipamong keyphrases)(Turney 2003) and (3) termcohesion features (ie relationship among compo-nents in a keyphrase)(Park et al 2004)

The simplest system is KEA (Frank et al1999 Witten et al 1999) that uses TFIDF (ieterm frequency inverse document frequency) andfirst occurrence in the document TFIDF mea-sures the document cohesion and the first occur-rence implies the importance of the abstract orintroduction which indicates the keyphrases havea locality Turney (2003) added the notion ofkeyphrase cohesion to KEA features and Nguyenand Kan (2007) added linguistic features suchas section information and suffix sequence TheGenEx system(Turney 1999) employed an inven-tory of nine syntactic features such as length inwords and frequency of stemming phrase as aset of parametrized heuristic rules Barker andCorrnacchia (2000) introduced a method basedon head noun heuristics that took three featureslength of candidate frequency and head noun fre-quency To take advantage of domain knowledgeHulth et al (2001) used a hierarchically-organizeddomain-specific thesaurus from Swedish Parlia-ment as a secondary knowledge source The

Textract (Park et al 2004) also ranks the can-didate keyphrases by its judgment of keyphrasesrsquodegree of domain specificity based on subject-specific collocations(Damerau 1993) in addi-tion to term cohesion using Dice coefficient(Dice1945) Recently Wan and Xiao (2008) extractsautomatic keyphrases from single documents uti-lizing document clustering information The as-sumption behind this work is that the documentswith the same or similar topics interact with eachother in terms of salience of words The authorsfirst clustered the documents then used the graph-based ranking algorithm to rank the candidates ina document by making use of mutual influences ofother documents in the same cluster

3 Keyphrase Analysis

In previous study KEA employed the index-ing words as candidates whereas others such as(Park et al 2004 Nguyen and Kan 2007) gen-erated handcrafted regular expression rules How-ever none carefully undertook the analysis ofkeyphrases We believe there is more to be learnedfrom the reference keyphrases themselves by do-ing a fine-grained careful analysis of their formand composition Note that we used the articlescollected from ACM digital library for both ana-lyzing keyphrases as well as evaluating methodsSee Section 6 for data in detail

Syntactically keyphrases can be formed by ei-ther simplex nouns (eg algorithm keyphrasemulti-agent) or noun phrases (NPs) which can be asequence of nouns and their auxiliary words suchas adjectives and adverbs (eg mobile networkfast computing partially observable Markov de-cision process) despite few incidences They canalso incorporate a prepositional phrase (PP) (egquality of service policy of distributed caching)When keyphrases take the form of an NP with anattached PP (ie NPs in of-PP form) the preposi-tion of is most common but others such as for invia also occur (eg incentive for cooperation in-equality in welfare agent security via approximatepolicy trade in financial instrument based on log-ical formula) The patterns above correlate wellto part-of-speech (POS) patterns used in modernkeyphrase extraction systems

However our analysis uncovered additional lin-guistic patterns and alternations which other stud-ies may have overlooked In our study we alsofound that keyphrases also occur as a simple con-

10

Criteria RulesFrequency (Rule1) Frequency heuristic ie frequency ge 2 for simplex words vs frequency ge 1 for NPs

Length (Rule2) Length heuristic ie up to length 3 for NPs in non-of-PP form vs up to length 4 for NPs in of-PP form(eg synchronous concurrent program vs model of multiagent interaction)

Alternation (Rule3) of-PP form alternation(eg number of sensor = sensor number history of past encounter = past encounter history)(Rule4) Possessive alternation(eg agentrsquos goal = goal of agent securityrsquos value = value of security)

Extraction (Rule5) Noun Phrase = (NN |NNS|NNP |NNPS|JJ |JJR|JJS)lowast(NN |NNS|NNP |NNPS)(eg complexity effective algorithm grid computing distributed web-service discovery architecture)(Rule6) Simplex WordNP IN Simplex WordNP(eg quality of service sensitivity of VOIP traffic (VOIP traffic extracted)simplified instantiation of zebroid (simplified instantiation extracted))

Table 1 Candidate Selection Rules

junctions (eg search and rescue propagation anddelivery) and much more rarely as conjunctionsof more complex NPs (eg history of past en-counter and transitivity) Some keyphrases appearto be more complex (eg pervasive document editand management system task and resource allo-cation in agent system) Similarly abbreviationsand possessive forms figure as common patterns(eg belief desire intention = BDI inverse docu-ment frequency = (IDF) Bayesrsquo theorem agentrsquosdominant strategy)

A critical insight of our work is that keyphrasescan be morphologically and semantically alteredKeyphrases that incorporate a PP or have an un-derlying genitive composition are often easily var-ied by word order alternation Previous studieshave used the altered keyphrases when forming inof-PP form For example quality of service canbe altered to service quality sometimes with lit-tle semantic difference Also as most morpho-logical variation in English relates to noun num-ber and verb inflection keyphrases are subject tothese rules as well (eg distributed system 6= dis-tributing system dynamical caching 6= dynamicalcache) In addition possessives tend to alternatewith of-PP form (eg agentrsquos goal = goal of agentsecurityrsquos value = value of security)

4 Candidate Selection

We now describe our proposed candidate selectionprocess Candidate selection is a crucial step forautomatic keyphrase extraction This step is corre-lated to term extraction study since topNth rankedterms become keyphrases in documents In pre-vious study KEA employed the indexing wordsas candidates whereas others such as (Park et al2004 Nguyen and Kan 2007) generated hand-crafted regular expression rules However nonecarefully undertook the analysis of keyphrases In

this section before we present our method we firstdescribe the detail of keyphrase analysis

In our keyphrase analysis we observed thatmost of author assigned keyphrase andor readerassigned keyphrase are syntactically more of-ten simplex words and less often NPs Whenkeyphrases take an NP form they tend to be a sim-ple form of NPs ie either without a PP or withonly a PP or with a conjunction but few appear asa mixture of such forms We also noticed that thecomponents of NPs are normally nouns and adjec-tives but rarely are adverbs and verbs As a re-sult we decided to ignore NPs containing adverbsand verbs in this study as our candidates since theytend to produce more errors and to require morecomplexity

Another observation is that keyphrases contain-ing more than three words are rare (ie 6 in ourdata set) validating what Paukkeri et al (2008)observed Hence we apply a length heuristic Ourcandidate selection rule collects candidates up tolength 3 but also of length 4 for NPs in of-PPform since they may have a non-genetive alter-nation that reduces its length to 3 (eg perfor-mance of distributed system = distributed systemperformance) In previous studies words occur-ring at least twice are selected as candidates How-ever during our acquisition of reader assignedkeyphrase we observed that readers tend to collectNPs as keyphrases regardless of their frequencyDue to this we apply different frequency thresh-olds for simplex words (gt= 2) and NPs (gt= 1)Note that 30 of NPs occurred only once in ourdata

Finally we generated regular expression rulesto extract candidates as presented in Table 1 Ourcandidate extraction rules are based on those inNguyen and Kan (2007) However our Rule6for NPs in of-PP form broadens the coverage of

11

possible candidates ie with a given NPs in of-PP form not only we collect simplex word(s)but we also extract non-of-PP form of NPs fromnoun phrases governing the PP and the PP Forexample our rule extracts effective algorithm ofgrid computing as well as effective algorithm andgrid computing as candidates while the previousworksrsquo rules do not

5 Feature Engineering

With a wider candidate selection criteria the onusof filtering out irrelevant candidates becomes theresponsibility of careful feature engineering Welist 25 features that we have found useful in ex-tracting keyphrases comprising of 9 existing and16 novel andor modified features that we intro-duce in our work (marked with lowast) As one ofour goals in feature engineering is to assess thesuitability of features in the unsupervised settingwe have also indicated which features are suitableonly for the supervised setting (S) or applicable toboth (S U)

51 Document Cohesion

Document cohesion indicates how important thecandidates are for the given document The mostpopular feature for this cohesion is TFIDF butsome works have also used context words to checkthe correlation between candidates and the givendocument Other features for document cohesionare distance section information and so on Wenote that listed features other than TFIDF are re-lated to locality That is the intuition behind thesefeatures is that keyphrases tend to appear in spe-cific area such as the beginning and the end of doc-uments

F1 TFIDF (SU) TFIDF indicates doc-ument cohesion by looking at the frequency ofterms in the documents and is broadly used in pre-vious work(Frank et al 1999 Witten et al 1999Nguyen and Kan 2007) However a disadvan-tage of the feature is in requiring a large corpusto compute useful IDF As an alternative con-text words(Matsuo and Ishizuka 2004) can alsobe used to measure document cohesion From ourstudy of keyphrases we saw that substrings withinlonger candidates need to be properly counted andas such our method measures TF in substrings aswell as in exact matches For example grid com-puting is often a substring of other phrases such asgrid computing algorithm and efficient grid com-

puting algorithm We also normalize TF with re-spect to candidate types ie we separately treatsimplex words and NPs to compute TF To makeour IDFs broadly representative we employed theGoogle n-gram counts that were computedover terabytes of data Given this large genericsource of word count IDF can be incorporatedwithout corpus-dependent processing hence suchfeatures are useful in unsupervised approachesas well The following list shows variations ofTFIDF employed as features in our system

bull (F1a) TFIDF

bull (F1b) TF including counts of substrings

bull (F1c) TF of substring as a separate feature

bull (F1d) normalized TF by candidate types(ie simplex words vs NPs)

bull (F1e) normalized TF by candidate types asa separate feature

bull (F1f) IDF using Google n-gram

F2 First Occurrence (SU) KEA used the firstappearance of the word in the document(Frank etal 1999 Witten et al 1999) The main ideabehind this feature is that keyphrases tend to oc-cur in the beginning of documents especially instructured reports (eg in abstract and introduc-tion sections) and newswire

F3 Section Information (SU) Nguyen andKan (2007) used the identity of which specificdocument section a candidate occurs in This lo-cality feature attempts to identify key sections Forexample in their study of scientific papers theauthors weighted candidates differently dependingon whether they occurred in the abstract introduc-tion conclusion section head title andor refer-ences

F4 Additional Section Information (SU)We first added the related work or previous workas one of section information not included inNguyen and Kan (2007) We also propose and testa number of variations We used the substringsthat occur in section headers and reference titlesas keyphrases We counted the co-occurrence ofcandidates (ie the section TF) across all key sec-tions that indicates the correlation among key sec-tions We assign section-specific weights as in-dividual sections exhibit different propensities forgenerating keyphrases For example introduction

12

contains the majority of keyphrases while the ti-tle or section head contains many fewer due to thevariation in size

bull (F4a) section rsquorelatedprevious workrsquo

bull (F4b) counting substring occurring in keysections

bull (F4c) section TF across all key sections

bull (F4d) weighting key sections according tothe portion of keyphrases found

F5 Last Occurrence (SU) Similar to dis-tance in KEA the position of the last occurrenceof a candidate may also imply the importance ofkeyphrases as keyphrases tend to appear in thelast part of document such as the conclusion anddiscussion

52 Keyphrase CohesionThe intuition behind using keyphrase cohesion isthat actual keyphrases are often associated witheach other since they are semantically related totopic of the document Note that this assumptionholds only when the document describes a singlecoherent topic ndash a document that represents a col-lection may be first need to be segmented into itsconstituent topics

F6 Co-occurrence of Another Candidatein Section (SU) When candidates co-occur inseveral key sections together then they are morelikely keyphrases Hence we used the number ofsections that candidates co-occur

F7 Title overlap (S) In a way titles also rep-resent the topics of their documents A large col-lection of titles in the domain can act as a prob-abilistic prior of what words could stand as con-stituent words in keyphrases In our work as weexamined scientific papers from computer sciencewe used a collection of titles obtained from thelarge CiteSeer1 collection to create this feature

bull (F7a) co-occurrence (Boolean) in title col-location

bull (F7b) co-occurrence (TF) in title collection

F8 Keyphrase Cohesion (SU) Turney (2003)integrated keyphrase cohesion into his system bychecking the semantic similarity between top Nranked candidates against the remainder In the

1It contains 13M titles from articles papers and reports

original work a large external web corpus wasused to obtain the similarity judgments As wedid not have access to the same web corpus andall candidateskeyphrases were not found in theGoogle n-gram corpus we approximated this fea-ture using a similar notion of contextual similarityWe simulated a latent 2-dimensional matrix (simi-lar to latent semantic analysis) by listing all candi-date words in rows and their neighboring words(nouns verbs and adjectives only) in columnsThe cosine measure is then used to compute thesimilarity among keyphrases

53 Term Cohesion

Term cohesion further refines the candidacy judg-ment by incorporating an internal analysis of thecandidatersquos constituent words Term cohesionposits that high values for internal word associa-tion measures correlates indicates that the candi-date is a keyphrase (Church and Hanks 1989)

F9 Term Cohesion (SU) Park et al (2004)used in the Dice coefficient (Dice 1945)to measure term cohesion particularly for multi-word terms In their work as NPs are longer thansimplex words they simply discounted simplexword cohesion by 10 In our work we vary themeasure of TF used in Dice coefficientsimilar to our discussion earlier

bull (F9a) term cohesion by (Park et al 2004)

bull (F9b) normalized TF by candidate types(ie simplex words vs NPs)

bull (F9c) applying different weight by candi-date types

bull (F9d) normalized TF and different weight-ing by candidate types

54 Other Features

F10 Acronym (S) Nguyen and Kan (2007) ac-counted for the importance of acronym as a fea-ture We found that this feature is heavily depen-dent on the data set Hence we used it only forNampK to attest our candidate selection method

F11 POS sequence (S) Hulth and Megyesi(2006) pointed out that POS sequences ofkeyphrases are similar It showed the distinctivedistribution of POS sequences of keyphrases anduse them as a feature Like acronym this is alsosubject to the data set

13

F12 Suffix sequence (S) Similar to acronymNguyen and Kan (2007) also used a candidatersquossuffix sequence as a feature to capture the propen-sity of English to use certain Latin derivationalmorphology for technical keyphrases This fea-ture is also a data dependent features thus used insupervised approach only

F13 Length of Keyphrases (SU) Barker andCorrnacchia (2000) showed that candidate lengthis also a useful feature in extraction as well as incandidate selection as the majority of keyphrasesare one or two terms in length

6 System and Data

To assess the performance of the proposed candi-date selection rules and features we implementeda keyphrase extraction pipe line We start withraw text of computer science articles convertedfrom PDF by pdftotext Then we parti-tioned the into section such as title and sectionsvia heuristic rules and applied sentence segmenter2 ParsCit3(Councill et al 2008) for refer-ence collection part-of-speech tagger4 and lem-matizer5(Minnen et al 2001) of the input Af-ter preprocessing we built both supervised andunsupervised classifiers using Naive Bayes fromthe WEKA machine learning toolkit(Witten andFrank 2005) Maximum Entropy6 and simpleweighting

In evaluation we collected 250 papers fromfour different categories7 of the ACM digital li-brary Each paper was 6 to 8 pages on averageIn author assigned keyphrase we found manywere missing or found as substrings To rem-edy this we collected reader assigned keyphraseby hiring senior year undergraduates in computerscience each whom annotated five of the paperswith an annotation guideline and on average tookabout 15 minutes to annotate each paper The fi-nal statistics of keyphrases is presented in Table2 where Combined represents the total number ofkeyphrases The numbers in () denotes the num-ber of keyphrases in of-PP form Found means the

2httpwwwengritsumeiacjpasaoresourcessentseg3httpwingcompnusedusgparsCit4httpsearchcpanorgdistLingua-EN-

TaggerTaggerpm5httpwwwinformaticssusxacukresearchgroupsnlpcarrollmorphhtml6httpmaxentsourceforgenetindexhtml7C24 (Distributed Systems) H33 (Information Search

and Retrieval) I211 (Distributed Artificial Intelligence-Multiagent Systems) and J4 (Social and Behavioral Sciences-Economics)

number of author assigned keyphrase and readerassigned keyphrase found in the documents

Author Reader CombinedTotal 1252 (53) 3110 (111) 3816 (146)NPs 904 2537 3027Average 385 (401) 1244 (1288) 1526 (1585)Found 769 2509 2864

Table 2 Statistics in Keyphrases

7 Evaluation

The baseline system for both the supervised andunsupervised approaches is modified NampK whichuses TFIDF distance section information andadditional section information (ie F1-4) Apartfrom baseline we also implemented basicKEA and NampK to compare Note that NampK is con-sidered a supervised approach as it utilizes fea-tures like acronym POS sequence and suffix se-quence

Table 3 and 4 shows the performance of our can-didate selection method and features with respectto supervised and unsupervised approaches usingthe current standard evaluation method (ie exactmatching scheme) over top 5th 10th 15th candi-dates

BestFeatures includes F1cTF of substring asa separate feature F2first occurrence F3sectioninformation F4dweighting key sections F5lastoccurrence F6co-occurrence of another candi-date in section F7btitle overlap F9aterm co-hesion by (Park et al 2004) F13length ofkeyphrases Best-TFIDF means using all bestfeatures but TFIDF

In Tables 3 and 4 C denotes the classifier tech-nique unsupervised (U) or supervised using Max-imum Entropy (S)8

In Table 5 the performance of each feature ismeasured using NampK system and the target fea-ture + indicates an improvement - indicates aperformance decline and indicates no effector unconfirmed due to small changes of perfor-mances Again supervised denotes MaximumEntropy training and Unsupervised is our unsu-pervised approach

8 Discussion

We compared the performances over our candi-date selection and feature engineering with sim-ple KEA NampK and our baseline system In eval-uating candidate selection we found that longer

8Due to the page limits we present the best performance

14

Method Features C Five Ten FifteenMatch Precision Recall Fscore Match Precising Recall Fscore Match Precision Recall Fscore

All KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 079 1584 519 782 139 1388 909 1099 184 1224 1203 1213

NampK S 132 2648 867 1306 204 2036 1334 1612 254 1693 1664 1678baseline U 092 1832 600 904 157 1568 1027 1241 220 1464 1439 1451

S 115 2304 755 1137 190 1896 1242 1501 244 1624 1596 1610Lengthlt=3 KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 081 1616 529 797 140 1400 917 1108 184 1224 1203 1213

NampK S 140 2792 915 1378 210 2104 1378 1665 262 1749 1719 1734baseline U 092 184 603 908 158 1576 1032 1247 220 1464 1439 1451

S 118 2368 776 1169 190 1900 1245 1504 240 1600 1572 1586Lengthlt=3 KEA U 001 024 008 012 005 052 034 041 007 048 047 047Candidates S 083 1664 545 821 142 1424 933 1127 187 1245 1224 1234

+ Alternation NampK S 153 3064 1004 1512 231 2308 1512 1827 288 1920 1887 1903baseline U 098 1968 645 972 172 1724 1129 1364 237 1579 1551 1565

S 133 2656 870 1311 209 2088 1368 1653 269 1792 1761 1776

Table 3 Performance on Proposed Candidate Selection

Features C Five Ten FifteenMatch Prec Recall Fscore Match Prec Recall Fscore Match Prec Recall Fscore

Best U 114 228 747 113 192 192 126 152 261 174 171 173S 156 312 102 154 250 250 164 198 315 210 206 208

Best U 114 228 74 113 192 192 126 152 261 174 171 173wo TFIDF S 156 311 102 154 246 246 161 194 312 208 204 206

Table 4 Performance on Feature Engineering

A Method Feature+ S F1aF2F3F4aF4dF9a

U F1aF1cF2F3F4aF4dF5F7bF9a- S F1bF1cF1dF1fF4bF4cF7aF7bF9b-dF13

U F1dF1eF1fF4bF4cF6F7aF9b-d S F1eF10F11F12

U F1b

Table 5 Performance on Each Feature

length candidates play a role to be noises so de-creased the overall performance We also con-firmed that candidate alternation offered the flexi-bility of keyphrases leading higher candidate cov-erage as well as better performance

To re-examine features we analyzed the impactof existing and new features and their variationsFirst of all unlike previous studies we found thatthe performance with and without TFIDF did notlead to a large difference which indicates the im-pact of TFIDF was minor as long as other fea-tures are incorporated Secondly counting sub-strings for TF improved performance while ap-plying term weighting for TF andor IDF did notimpact on the performance We estimated thecause that many of keyphrases are substrings ofcandidates and vice versa Thirdly section in-formation was also validated to improve perfor-mance as in Nguyen and Kan (2007) Extend-ing this logic modeling additional section infor-mation (related work) and weighting sections bothturned out to be useful features Other localityfeatures were also validated as helpful both firstoccurrence and last occurrence are helpful as itimplies the locality of the key ideas In addi-

tion keyphrase co-occurrence with selected sec-tions was proposed in our work and found empiri-cally useful Term cohesion (Park et al 2004) is auseful feature although it has a heuristic factor thatreduce the weight by 10 for simplex words Nor-mally term cohesion is subject to NPs only henceit needs to be extended to work with multi-wordNPs as well Table 5 summarizes the reflectionson each feature

As unsupervised methods have the appeal of notneeding to be trained on expensive hand-annotateddata we also compared the performance of super-vised and unsupervised methods Given the fea-tures initially introduced for supervised learningunsupervised performance is surprisingly highWhile supervised classifier produced a matchingcount of 315 the unsupervised classifier obtains acount of 261 We feel this indicates that the exist-ing features for supervised methods are also suit-able for use in unsupervised methods with slightlyreduced performance In general we observed thatthe best features in both supervised and unsuper-vised methods are the same ndash section informationand candidate length In our analysis of the im-pact of individual features we observed that mostfeatures affect performance in the same way forboth supervised and unsupervised approaches asshown in Table 5 These findings indicate that al-though these features may be been originally de-signed for use in a supervised approach they arestable and can be expected to perform similar inunsupervised approaches

15

9 Conclusion

We have identified and tackled two core issuesin automatic keyphrase extraction candidate se-lection and feature engineering In the area ofcandidate selection we observe variations and al-ternations that were previously unaccounted forOur selection rules expand the scope of possiblekeyphrase coverage while not overly expandingthe total number candidates to consider In ourre-examination of feature engineering we com-piled a comprehensive feature list from previousworks while exploring the use of substrings in de-vising new features Moreover we also attested toeach featurersquos fitness for use in unsupervised ap-proaches in order to utilize them in real-world ap-plications with minimal cost

10 AcknowledgementThis work was partially supported by a National ResearchFoundation grant Interactive Media Search (grant R 252000 325 279) while the first author was a postdoctoral fellowat the National University of Singapore

ReferencesKen Barker and Nadia Corrnacchia Using noun phrase

heads to extract document keyphrases In Proceedings ofthe 13th Biennial Conference of the Canadian Society onComputational Studies of Intelligence Advances in Arti-ficial Intelligence 2000

Regina Barzilay and Michael Elhadad Using lexical chainsfor text summarization In Proceedings of the ACLEACL1997 Workshop on Intelligent Scalable Text Summariza-tion 1997 pp 10ndash17

Kenneth Church and Patrick Hanks Word associationnorms mutual information and lexicography In Proceed-ings of ACL 1989 76ndash83

Isaac Councill and C Lee Giles and Min-Yen Kan ParsCitAn open-source CRF reference string parsing package InProceedings of LREC 2008 28ndash30

Ernesto DAvanzo and Bernado Magnini A Keyphrase-Based Approach to Summarizationthe LAKE System atDUC-2005 In Proceedings of DUC 2005

F Damerau Generating and evaluating domain-orientedmulti-word terms from texts Information Processing andManagement 1993 29 pp43ndash447

Lee Dice Measures of the amount of ecologic associationsbetween species Journal of Ecology 1945 2

Eibe Frank and Gordon Paynter and Ian Witten and CarlGutwin and Craig Nevill-manning Domain SpecificKeyphrase Extraction In Proceedings of IJCAI 1999pp668ndash673

Carl Gutwin and Gordon Paynter and Ian Witten and CraigNevill-Manning and Eibe Frank Improving browsing indigital libraries with keyphrase indexes Journal of Deci-sion Support Systems 1999 27 pp81ndash104

Khaled Hammouda and Diego Matute and Mohamed KamelCorePhrase keyphrase extraction for document cluster-ing In Proceedings of MLDM 2005

Annette Hulth and Jussi Karlgren and Anna Jonsson andHenrik Bostrm and Lars Asker Automatic Keyword Ex-traction using Domain Knowledge In Proceedings of CI-CLing 2001

Annette Hulth and Beata Megyesi A study on automaticallyextracted keywords in text categorization In Proceedingsof ACLCOLING 2006 537ndash544

Mario Jarmasz and Caroline Barriere Using semantic sim-ilarity over tera-byte corpus compute the performance ofkeyphrase extraction In Proceedings of CLINE 2004

Dawn Lawrie and W Bruce Croft and Arnold RosenbergFinding Topic Words for Hierarchical Summarization InProceedings of SIGIR 2001 pp 349ndash357

Y Matsuo and M Ishizuka Keyword Extraction from a Sin-gle Document using Word Co-occurrence Statistical Infor-mation In International Journal on Artificial IntelligenceTools 2004 13(1) pp 157ndash169

Olena Medelyan and Ian Witten Thesaurus based automatickeyphrase indexing In Proceedings of ACMIEED-CSjoint conference on Digital libraries 2006 pp296ndash297

Guido Minnen and John Carroll and Darren Pearce Appliedmorphological processing of English NLE 2001 7(3)pp207ndash223

Thuy Dung Nguyen and Min-Yen Kan Key phrase Extrac-tion in Scientific Publications In Proceeding of ICADL2007 pp317-326

Youngja Park and Roy Byrd and Branimir Boguraev Auto-matic Glossary Extraction Beyond Terminology Identifi-cation In Proceedings of COLING 2004 pp48ndash55

Mari-Sanna Paukkeri and Ilari Nieminen and Matti Pollaand Timo Honkela A Language-Independent Approachto Keyphrase Extraction and Evaluation In Proceedingsof COLING 2008

Peter Turney Learning to Extract Keyphrases from TextIn National Research Council Institute for InformationTechnology Technical Report ERB-1057 1999

Peter Turney Coherent keyphrase extraction via Web min-ing In Proceedings of IJCAI 2003 pp 434ndash439

Xiaojun Wan and Jianguo Xiao CollabRank towards a col-laborative approach to single-document keyphrase extrac-tion In Proceedings of COLING 2008

Ian Witten and Gordon Paynter and Eibe Frank and CarGutwin and Graig Nevill-Manning KEAPractical Au-tomatic Key phrase Extraction In Proceedings of ACMDL 1999 pp254ndash256

Ian Witten and Eibe Frank Data Mining Practical Ma-chine Learning Tools and Techniques Morgan Kauf-mann 2005

Yongzheng Zhang and Nur Zinchir-Heywood and Evange-los Milios Term based Clustering and Summarization ofWeb Page Collections In Proceedings of Conference ofthe Canadian Society for Computational Studies of Intel-ligence 2004

16

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 17ndash22Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Verb Noun Construction MWE Token Supervised Classification

Mona T DiabCenter for Computational Learning Systems

Columbia Universitymdiabcclscolumbiaedu

Pravin BhutadaComputer Science Department

Columbia Universitypb2351columbiaedu

Abstract

We address the problem of classifying multi-word expression tokens in running text Wefocus our study on Verb-Noun Constructions(VNC) that vary in their idiomaticity depend-ing on context VNC tokens are classified aseither idiomatic or literal We present a super-vised learning approach to the problem We ex-periment with different features Our approachyields the best results to date on MWE clas-sification combining different linguistically mo-tivated features the overall performance yieldsan F-measure of 8458 corresponding to an F-measure of 8996 for idiomaticity identificationand classification and 6203 for literal identifi-cation and classification

1 Introduction

In the literature in general a multiword expression(MWE) refers to a multiword unit or a colloca-tion of words that co-occur together statisticallymore than chance A MWE is a cover term fordifferent types of collocations which vary in theirtransparency and fixedness MWEs are pervasivein natural language especially in web based textsand speech genres Identifying MWEs and under-standing their meaning is essential to language un-derstanding hence they are of crucial importancefor any Natural Language Processing (NLP) appli-cations that aim at handling robust language mean-ing and use In fact the seminal paper (Sag et al2002) refers to this problem as a key issue for thedevelopment of high-quality NLP applications

For our purposes a MWE is defined as a collo-cation of words that refers to a single concept forexample - kick the bucket spill the beans make adecision etc An MWE typically has an idiosyn-cratic meaning that is more or different from themeaning of its component words An MWE mean-ing is transparent ie predictable in as muchas the component words in the expression relaythe meaning portended by the speaker composi-tionally Accordingly MWEs vary in their de-gree of meaning compositionality composition-ality is correlated with the level of idiomaticityAn MWE is compositional if the meaning of an

MWE as a unit can be predicted from the mean-ing of its component words such as in make adecision meaning to decide If we conceive ofidiomaticity as being a continuum the more id-iomatic an expression the less transparent and themore non-compositional it is Some MWEs aremore predictable than others for instance kick thebucket when used idiomatically to mean to diehas nothing in common with the literal meaningof either kick or bucket however make a decisionis very clearly related to to decide Both of theseexpressions are considered MWEs but have vary-ing degrees of compositionality and predictabilityBoth of these expressions belong to a class of id-iomatic MWEs known as verb noun constructions(VNC) The first VNC kick the bucket is a non-decomposable VNC MWE the latter make a deci-sion is a decomposable VNC MWE These typesof constructions are the object of our study

To date most research has addressed the prob-lem of MWE type classification for VNC expres-sions in English (Melamed 1997 Lin 1999Baldwin et al 2003 na Villada Moiron andTiedemann 2006 Fazly and Stevenson 2007Van de Cruys and Villada Moiron 2007 Mc-Carthy et al 2007) not token classification Forexample he spilt the beans on the kitchen counteris most likely a literal usage This is given away bythe use of the prepositional phrase on the kitchencounter as it is plausable that beans could haveliterally been spilt on a location such as a kitchencounter Most previous research would classifyspilt the beans as idiomatic irrespective of con-textual usage In a recent study by (Cook et al2008) of 53 idiom MWE types used in differentcontexts the authors concluded that almost half ofthem had clear literal meaning and over 40 oftheir usages in text were actually literal Thus itwould be important for an NLP application suchas machine translation for example when givena new VNC MWE token to be able to determinewhether it is used idiomatically or not as it couldpotentially have detrimental effects on the qualityof the translation

17

In this paper we address the problem of MWEclassification for verb-noun (VNC) token con-structions in running text We investigate the bi-nary classification of an unseen VNC token ex-pression as being either Idiomatic (IDM) or Lit-eral (LIT) An IDM expression is certainly anMWE however the converse is not necessarilytrue To date most approaches to the problem ofidiomaticity classification on the token level havebeen unsupervised (Birke and Sarkar 2006 Diaband Krishna 2009b Diab and Krishna 2009aSporleder and Li 2009) In this study we carryout a supervised learning investigation using sup-port vector machines that uses some of the featureswhich have been shown to help in unsupervisedapproaches to the problem

This paper is organized as follows In Section2 we describe our understanding of the variousclasses of MWEs in general Section 3 is a sum-mary of previous related research Section 4 de-scribes our approach In Section 5 we present thedetails of our experiments We discuss the resultsin Section 6 Finally we conclude in Section 7

2 Multi-word Expressions

MWEs are typically not productive though theyallow for inflectional variation (Sag et al 2002)They have been conventionalized due to persis-tent use MWEs can be classified based on theirsemantic types as follows Idiomatic This cat-egory includes expressions that are semanticallynon-compositional fixed expressions such as king-dom come ad hoc non-fixed expressions suchas break new ground speak of the devil TheVNCs which we are focusing on in this paper fallinto this category Semi-idiomatic This classincludes expressions that seem semantically non-compositional yet their semantics are more or lesstransparent This category consists of Light VerbConstructions (LVC) such as make a living andVerb Particle Constructions (VPC) such as write-up call-up Non-Idiomatic This category in-cludes expressions that are semantically compo-sitional such as prime minister proper nouns suchas New York Yankees and collocations such as ma-chine translation These expressions are statisti-cally idiosyncratic For instance traffic light isthe most likely lexicalization of the concept andwould occur more often in text than say trafficregulator or vehicle light

3 Related Work

Several researchers have addressed the problem ofMWE classification (Baldwin et al 2003 Katzand Giesbrecht 2006 Schone and Juraksfy 2001

Hashimoto et al 2006 Hashimoto and Kawa-hara 2008) The majority of the proposed researchhas been using unsupervised approaches and haveaddressed the problem of MWE type classifica-tion irrespective of usage in context (Fazly andStevenson 2007 Cook et al 2007) We areaware of two supervised approaches to the prob-lem work by (Katz and Giesbrecht 2006) andwork by (Hashimoto and Kawahara 2008)

In Katz and Giesbrecht (2006) (KG06) the au-thors carried out a vector similarity comparisonbetween the context of an MWE and that of theconstituent words using LSA to determine if theexpression is idiomatic or not The KG06 is sim-ilar in intuition to work proposed by (Fazly andStevenson 2007) however the latter work was un-supervised KG06 experimented with a tiny dataset of only 108 sentences corresponding to oneMWE idiomatic expression

Hashimoto and Kawahara (2008) (HK08) is thefirst large scale study to our knowledge that ad-dressed token classification into idiomatic versusliteral for Japanese MWEs of all types They ap-ply a supervised learning framework using sup-port vector machines based on TinySVM with aquadratic kernel They annotate a web based cor-pus for training data They identify 101 idiomtypes each with a corresponding 1000 exampleshence they had a corpus of 102K sentences of an-notated data for their experiments They exper-iment with 90 idiom types only for which theyhad more than 50 examples They use two typesof features word sense disambiguation (WSD)features and idiom features The WSD featurescomprised some basic syntactic features such asPOS lemma information token n-gram featuresin addition to hypernymy information on words aswell as domain information For the idiom fea-tures they were mostly inflectional features suchas voice negativity modality in addition to adja-cency and adnominal features They report resultsin terms of accuracy and rate of error reductionTheir overall accuracy is of 8925 using all thefeatures

4 Our ApproachWe apply a supervised learning framework tothe problem of both identifying and classifying aMWE expression token in context We specificallyfocus on VNC MWE expressions We use the an-notated data by (Cook et al 2008) We adopt achunking approach to the problem using an InsideOutside Beginning (IOB) tagging framework forperforming the identification of MWE VNC to-kens and classifying them as idiomatic or literalin context For chunk tagging we use the Yam-

18

Cha sequence labeling system1 YamCha is basedon Support Vector Machines technology using de-gree 2 polynomial kernels

We label each sentence with standard IOB tagsSince this is a binary classification task we have 5different tags B-L (Beginning of a literal chunk)I-L (Inside of a literal chunk) B-I (Beginning anIdiomatic chunk) I-I (Inside an Idiomatic chunk)O (Outside a chunk) As an example a sentencesuch as John kicked the bucket last Friday will beannotated as follows John O kicked B-I the I-Ibucket I-I last O Friday O We experiment withsome basic features and some more linguisticallymotivated ones

We experiment with different window sizes forcontext ranging from minus+1 to minus+5 tokens be-fore and after the token of interest We also em-ploy linguistic features such as character n-gramfeatures namely last 3 characters of a token asa means of indirectly capturing the word inflec-tional and derivational morphology (NGRAM)Other features include Part-of-Speech (POS)tags lemma form (LEMMA) or the citation formof the word and named entity (NE) informationThe latter feature is shown to help in the unsuper-vised setting in recent work (Diab and Krishna2009b Diab and Krishna 2009a) In general allthe linguistic features are represented as separatefeature sets explicitly modeled in the input dataHence if we are modeling the POS tag feature forour running example the training data would beannotated as follows John NN O kicked VBDB-I the Det I-I bucket NN I-I last ADV O FridayNN O Likewise adding the NGRAM featurewould be represented as follows John NN ohnO kicked VBD ked B-I the Det the I-I bucket NNket I-I last ADV ast O Friday NN day O and soon

With the NE feature we followed the same rep-resentation as the other features as a separate col-umn as expressed above referred to as NamedEntity Separate (NES) For named entity recogni-tion (NER) we use the BBN Identifinder softwarewhich identifies 19 NE tags2 We have two set-tings for NES one with the full 19 tags explic-itly identified (NES-Full) and the other where wehave a binary feature indicating whether a wordis a NE or not (NES-Bin) Moreover we addedanother experimental condition where we changedthe wordsrsquo representation in the input to their NEclass Named Entity InText (NEI) For example forthe NEI condition our running example is repre-sented as follows PER NN ohn O kicked VBDked B-I the Det the I-I bucket NN ket I-I last ADV

1httpwwwtado-chasencomyamcha2httpwwwbbncomidentifinder

ast O DAY NN day O where John is replaced bythe NE ldquoPERrdquo

5 Experiments and Results

51 Data

We use the manually annotated standard dataset identified in (Cook et al 2008) This datacomprises 2920 unique VNC-Token expressionsdrawn from the entire British National Corpus(BNC)3 The BNC contains 100M words of multi-ple genres including written text and transcribedspeech In this set VNC token expressions aremanually annotated as idiomatic literal or un-known We exclude those annotated as unknownand those pertaining to the Speech part of thedata leaving us with a total of 2432 sentences cor-responding to 53 VNC MWE types This datahas 2571 annotations4 corresponding to 2020 Id-iomatic tokens and 551 literal ones Since the dataset is relatively small we carry out 5-fold cross val-idation experiments The results we report are av-eraged over the 5 folds per condition We splitthe data into 80 for training 10 for testing and10 for development The data used is the tok-enized version of the BNC

52 Evaluation Metrics

We use Fβ=1 (F-measure) as the harmonic meanbetween (P)recision and (R)ecall as well as accu-racy to report the results5 We report the resultsseparately for the two classes IDM and LIT aver-aged over the 5 folds of the TEST data set

53 Results

We present the results for the different featuressets and their combination We also present resultson a simple most frequent tag baseline (FREQ) aswell as a baseline of using no features just thetokenized words (TOK) The baseline is basicallytagging all identified VNC tokens in the data set asidiomatic It is worth noting that the baseline hasthe advantage of gold identification of MWE VNCtoken expressions In our experimental conditionsidentification of a potential VNC MWE is part ofwhat is discovered automatically hence our sys-tem is penalized for identifying other VNC MWE

3httpwwwnatcorpoxacuk4A sentence can have more than one MWE expression

hence the number of annotations exceeds the number of sen-tences

5We do not think that accuracy should be reported in gen-eral since it is an inflated result as it is not a measure of errorAll words identified as O factor into the accuracy which re-sults in exaggerated values for accuracy We report it onlysince it the metric used by previous work

19

tokens that are not in the original data set6

In Table 2 we present the results yielded per fea-ture and per condition We experimented with dif-ferent context sizes initially to decide on the opti-mal window size for our learning framework re-sults are presented in Table 1 Then once that isdetermined we proceed to add features

Noting that a window size of minus+3 yields thebest results we proceed to use that as our contextsize for the following experimental conditions Wewill not include accuracy since it above 96 for allour experimental conditions

All the results yielded by our experiments out-perform the baseline FREQ The simple tokenizedwords baseline (TOK) with no added features witha context size of minus+3 shows a significant im-provement over the very basic baseline FREQ withan overall F measure of 7704

Adding lemma information or POS or NGRAMfeatures all independently contribute to a bettersolution however combining the three featuresyields a significant boost in performance over theTOK baseline of 267 absolute F points in over-all performance

Confirming previous observations in the liter-ature the overall best results are obtained byusing NE features The NEI condition yieldsslightly better results than the NES conditionsin the case when no other features are beingused NES-Full significantly outperforms NES-Bin when used alone especially on literal classi-fication yielding the highest results on this classof phenomena across the board However whencombined with other features NES-Bin fares bet-ter than NES-Full as we observe slightly less per-formance when comparing NES-Full+L+N+P andNES-Bin+L+N+P

Combining NEI+L+N+P yields the highest re-sults with an overall F measure of 8458 a sig-nificant improvement over both baselines and overthe condition that does not exploit NE featuresL+N+P Using NEI may be considered a formof dimensionality reduction hence the significantcontribution to performance

6 Discussion

The overall results strongly suggest that using lin-guistically interesting features explicitly has a pos-itive impact on performance NE features helpthe most and combining them with other features

6We could have easily identified all VNC syntactic con-figurations corresponding to verb object as a potential MWEVNC assuming that they are literal by default This wouldhave boosted our literal score baseline however for this in-vestigation we decided to strictly work with the gold stan-dard data set exclusively

yields the best results In general performanceon the classification and identification of idiomaticexpressions yielded much better results This maybe due to the fact that the data has a lot more id-iomatic token examples for training Also we notethat precision scores are significantly higher thanrecall scores especially with performance on lit-eral token instance classification This might be anindication that identifying when an MWE is usedliterally is a difficult task

We analyzed some of the errors yielded in ourbest condition NEI+L+N+P The biggest errors area result of identifying other VNC constructionsnot annotated in the training and test data as VNCMWEs However we also see errors of confusingidiomatic cases with literal ones 23 times and theopposite 4 times

Some of the errors where the VNC should havebeen classified as literal however the system clas-sified them as idiomatic are kick heel find feetmake top Cases of idiomatic expressions erro-neously classified as literal are for MWE types hitthe road blow trumpet blow whistle bit a wall

The system is able to identify new VNC MWEconstructions For instance in the sentence On theother hand Pinkie seemed to have lost his head toa certain extent perhaps some prospects of mak-ing his mark by bringing in something novel inthe way of business the first MWE lost his headis annotated in the training data however makinghis mark is newly identified as idiomatic in thiscontext

Also the system identified hit the post as aliteral MWE VNC token in As the ball hit thepost the referee blew the whistle where blew thewhistle is a literal VNC in this context and it iden-tified hit the post as another literal VNC

7 Conclusion

In this study we explore a set of features that con-tribute to VNC token expression binary supervisedclassification The use of NER significantly im-proves the performance of the system Using NERas a means of dimensionality reduction yields thebest results We achieve a state of the art perfor-mance of an overall F measure of 8458 In thefuture we are looking at ways of adding more so-phisticated syntactic and semantic features fromWSD Given the fact that we were able to get moreinteresting VNC data automatically we are cur-rently looking into adding the new data to the an-notated pool after manual checking

20

IDM-F LIT-F Overall F Overall Accminus+1 7793 4857 7178 9622minus+2 8538 5561 7971 9706minus+3 8699 5568 8125 9693minus+4 8622 5581 8075 9706minus+5 8338 50 7763 9661

Table 1 Results in s of varying context window size

IDM-P IDM-R IDM-F LIT-P LIT-R LIT-F Overall FFREQ 7002 8916 7844 0 0 0 6968TOK 8178 8333 8255 7179 4375 5437 7704

(L)EMMA 831 8429 8369 6977 4688 5607 7811(N)GRAM 8317 8238 8278 70 4375 5385 7701

(P)OS 8333 8333 8333 7778 4375 5600 7808L+N+P 8695 8333 8538 7222 4561 5591 7971

NES-Full 852 8793 8655 7907 5862 6733 8277NES-Bin 8497 8241 8367 7349 5259 6131 7915

NEI 8992 8518 8748 8133 5259 6387 8282NES-Full+L+N+P 8989 8492 8734 7632 50 6042 8199NES-Bin+L+N+P 9086 8492 8779 7632 50 6042 8233

NEI+L+N+P 9135 8842 8986 8169 50 6203 8458

Table 2 Final results in s averaged over 5 folds of test data using different features and their combina-tions

8 Acknowledgement

The first author was partially funded by DARPAGALE and MADCAT projects The authors wouldlike to acknowledge the useful comments by twoanonymous reviewers who helped in making thispublication more concise and better presented

ReferencesTimothy Baldwin Collin Bannard Takakki Tanaka

and Dominic Widdows 2003 An empirical modelof multiword expression decomposability In Pro-ceedings of the ACL 2003 workshop on Multiwordexpressions pages 89ndash96 Morristown NJ USA

J Birke and A Sarkar 2006 A clustering approach fornearly unsupervised recognition of nonliteral lan-guage In Proceedings of EACL volume 6 pages329ndash336

Paul Cook Afsaneh Fazly and Suzanne Stevenson2007 Pulling their weight Exploiting syntacticforms for the automatic identification of idiomaticexpressions in context In Proceedings of the Work-shop on A Broader Perspective on Multiword Ex-pressions pages 41ndash48 Prague Czech RepublicJune Association for Computational Linguistics

Paul Cook Afsaneh Fazly and Suzanne Stevenson2008 The VNC-Tokens Dataset In Proceedings ofthe LREC Workshop on Towards a Shared Task for

Multiword Expressions (MWE 2008) MarrakechMorocco June

Mona Diab and Madhav Krishna 2009a Handlingsparsity for verb noun MWE token classification InProceedings of the Workshop on Geometrical Mod-els of Natural Language Semantics pages 96ndash103Athens Greece March Association for Computa-tional Linguistics

Mona Diab and Madhav Krishna 2009b Unsuper-vised classification for vnc multiword expressionstokens In CICLING

Afsaneh Fazly and Suzanne Stevenson 2007 Dis-tinguishing subtypes of multiword expressions us-ing linguistically-motivated statistical measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueCzech Republic June Association for Computa-tional Linguistics

Chikara Hashimoto and Daisuke Kawahara 2008Construction of an idiom corpus and its applica-tion to idiom identification based on WSD incor-porating idiom-specific features In Proceedings ofthe 2008 Conference on Empirical Methods in Nat-ural Language Processing pages 992ndash1001 Hon-olulu Hawaii October Association for Computa-tional Linguistics

Chikara Hashimoto Satoshi Sato and Takehito Utsuro2006 Japanese idiom recognition Drawing a linebetween literal and idiomatic meanings In Proceed-ings of the COLINGACL 2006 Main Conference

21

Poster Sessions pages 353ndash360 Sydney AustraliaJuly Association for Computational Linguistics

Graham Katz and Eugenie Giesbrecht 2006 Au-tomatic identification of non-compositional multi-word expressions using latent semantic analysis InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 12ndash19 Sydney Australia July Asso-ciation for Computational Linguistics

Dekang Lin 1999 Automatic identification of non-compositional phrases In Proceedings of ACL-99pages 317ndash324 Univeristy of Maryland CollegePark Maryland USA

Diana McCarthy Sriram Venkatapathy and AravindJoshi 2007 Detecting compositionality of verb-object combinations using selectional preferencesIn Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL) pages 369ndash379 Prague CzechRepublic June Association for Computational Lin-guistics

Dan I Melamed 1997 Automatic discovery of non-compositional compounds in parallel data In Pro-ceedings of the 2nd Conference on Empirical Meth-ods in Natural Language Processing (EMNLPrsquo97)pages 97ndash108 Providence RI USA August

Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the EACL-06Workshop on Multiword Expressions in a Multilin-gual Context pages 33ndash40 Morristown NJ USA

Ivan A Sag Timothy Baldwin Francis Bond Ann ACopestake and Dan Flickinger 2002 Multiwordexpressions A pain in the neck for nlp In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing pages 1ndash15 London UK Springer-Verlag

Patrick Schone and Daniel Juraksfy 2001 Isknowledge-free induction of multiword unit dictio-nary headwords a solved problem In Proceedingsof Empirical Methods in Natural Language Process-ing pages 100ndash108 Pittsburg PA USA

C Sporleder and L Li 2009 Unsupervised Recog-nition of Literal and Non-Literal Use of IdiomaticExpressions In Proceedings of the 12th Conferenceof the European Chapter of the ACL (EACL 2009)pages 754ndash762 Association for Computational Lin-guistics

Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32Prague Czech Republic June Association for Com-putational Linguistics

22

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 23ndash30Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Exploiting Translational Correspondences for Pattern-Independent MWEIdentification

Sina ZarrieszligDepartment of Linguistics

University of Potsdam Germanysinalinguni-potsdamde

Jonas KuhnDepartment of Linguistics

University of Potsdam Germanykuhnlinguni-potsdamde

Abstract

Based on a study of verb translations inthe Europarl corpus we argue that a widerange of MWE patterns can be identified intranslations that exhibit a correspondencebetween a single lexical item in the sourcelanguage and a group of lexical items inthe target language We show that thesecorrespondences can be reliably detectedon dependency-parsed word-aligned sen-tences We propose an extraction methodthat combines word alignment with syn-tactic filters and is independent of thestructural pattern of the translation

1 Introduction

Parallel corpora have proved to be a valuable re-source not only for statistical machine translationbut also for crosslingual induction of morphologi-cal syntactic and semantic analyses (Yarowsky etal 2001 Dyvik 2004) In this paper we proposean approach to the identification of multiword ex-pressions (MWEs) that exploits translational cor-respondences in a parallel corpus We will con-sider in translations of the following type

(1) DerThe

RatCouncil

sollteshould

unsereour

Positionposition

berucksichtigenconsider

(2) The Council shouldtake account ofour position

This sentence pair has been taken from theGerman - English section of the Europarl corpus(Koehn 2005) It exemplifies a translational cor-respondence between an English MWEtake ac-count ofand a German simplex verbberucksichti-gen In the following we refer to such correspon-dences asone-to-many translations Based on astudy of verb translations in Europarl we will ex-plore to what extent one-to-many translations pro-vide evidence for MWE realization in the targetlanguage It will turn out that crosslingual corre-

spondences realize a wide range of different lin-guistic patterns that are relevant for MWE iden-tification but that they pose problems to auto-matic word alignment We propose an extractionmethod that combines distributional word align-ment with syntactic filters We will show thatthese correspondences can be reliably detectedon dependency-parsed wordaligned sentences andare able to identify various MWE patterns

In a monolingual setting the task of MWE ex-traction is usually conceived of as a lexical as-sociation problem where distributional measuresmodel the syntactic and semantic idiosyncracy ex-hibited by MWEs eg (Pecina 2008) This ap-proach generally involves two main steps 1) theextraction of a candidate list of potential MWEsoften constrained by a particular target patternof the detection method like verb particle con-structions (Baldwin and Villavicencio 2002) orverb PP combinations (Villada Moiron and Tiede-mann 2006) 2) the ranking of this candidate listby an appropriate assocation measure

The crosslingual MWE identification wepresent in this paper is a priori independentof any specific association measure or syntacticpattern The translation scenario allows us toadopt a completely data-driven definition of whatconstitutes an MWE Given a parallel corpuswe propose to consider those tokens in a targetlanguage as MWEs which correspond to a singlelexical item in the source language The intuitionis that if a group of lexical items in one lan-guage can be realized as a single item in anotherlanguage it can be considered as some kind oflexically fixed entity By this means we willnot approach the MWE identification problemby asking for a given list of candidates whetherthese are MWEs or not Instead we will ask fora given list of lexical items in a source languagewhether there exists a one-to-many translation forthis item in a target language (and whether these

23

one-to-many translations correspond to MWEs)This strategy offers a straightforward solution tothe interpretation problem As the translation canbe related to the meaning of the source item andto its other translations in the target language theinterpretation is independent of the expressionrsquostransparency This solution has its limitationscompared to other approaches that need to auto-matically establish the degree of compositionalityof a given MWE candidate However for manyNLP applications coarse-grained knowledgeabout the semantic relation between a widerange of MWEs and their corresponding atomicrealization is already very useful

In this work we therefore focus on a generalmethod of MWE identification that captures thevarious patterns of translational correspondencesthat can be found in parallel corpora Our exper-iments described in section 3 show that one-to-many translations should be extracted from syn-tactic configurations rather than from unstructuredsets of aligned words This syntax-driven methodis less dependent on frequency distributions in agiven corpus but is based on the intuition thatmonolingual idiosyncracies like MWE realizationof an entity are not likely to be mirrored in anotherlanguage (see section 4 for discussion)

Our goal in this paper is twofold First we wantto investigate to what extent one-to-many transla-tional correspondences can serve as an empiricalbasis for MWE identification To this end Sec-tion 2 presents a corpus-based study of the rela-tion between one-to-many translations and MWEsthat we carried out on a translation gold standardSecond we investigate methods for the automaticdetection of complex lexical correspondences fora given parallel corpus Therefore Section 3 eval-uates automatic word alignments against our goldstandard and gives a method for high-precisionone-to-many translation detection that relies onsyntactic filters in addition to word-alignments

2 Multiword Translations as MWEs

The idea to exploit one-to-many translations forthe identification of MWE candidates has not re-ceived much attention in the literature Thus it isnot a priori clear what can be expected from trans-lational correspondences with respect to MWEidentification To corroborate the intuitions intro-duced in the last section we carried out a corpus-based study that aims to discover linguistic pat-

Verb 1-1 1-n n-1 n-n No

anheben (v1) 535 212 92 16 325

bezwecken (v2) 167 513 06 313 150

riskieren (v3) 467 357 05 17 182

verschlimmern (v4) 302 215 286 445 275

Table 1 Proportions of types of translational cor-respondences (token-level) in our gold standard

terns exhibited by one-to-many translationsWe constructed a gold standard coveringall En-

glish translations of four German verb lemmas ex-tracted from the Europarl Corpus These verbssubcategorize for a nominative subject and an ac-cusative object and are in the middle frequencylayer (around 200 occurrences) We extracted allsentences in Europarl with occurences of theselemmas and their automatic word alignments pro-duced by GIZA++ (Och and Ney 2003) Thesealignments were manually corrected on the basisof the crosslingual word alignment guidelines de-velopped by (Graca et al 2008)

For each of the German source lemmas ourgold standard records four translation categoriesone-to-one one-to-many many-to-one many-to-many translations Table 1 shows the distributionof these categories for each verb Strikingly thefour verbs show very different proportions con-cerning the types of their translational correspon-dences Thus while the German verbanheben(en increase) seems to have a frequent parallelrealization the verbsbezwecken(en intend to)or verschlimmern(en aggravate) tend to be real-ized by more complex phrasal translations In anycase the percentage of one-to-many translations isrelatively high which corroborates our hypothesisthat parallel corpora constitute a very interestingresource for data-driven MWE discovery

A closer look at the one-to-many translations re-veals that these cover a wide spectrum of MWEphenomena traditionally considered in the liter-ature as well as constructions that one wouldusually not regard as an MWE Below we willshortly illustrate the different classes of one-to-many translations we found in our gold standard

Morphological variations This type of one-to-many translations is mainly due to non-parallel re-alization of tense Itrsquos rather irrelevant from anMWE perspective but easy to discover and filterautomatically

24

(3) SieThey

verschlimmernaggravate

diethe

Ubelmisfortunes

(4) Their actionis aggravatingthe misfortunes

Verb particle combinations A typical MWEpattern treated for instance in (Baldwin andVillavicencio 2002) It further divides into trans-parent and non-transparent combinations the lat-ter is illustrated below

(5) DerThe

Ausschusscommitte

bezwecktintends

denthe

Institutioneninstitutions

eina

politischespolitical

Instrumentinstrument

anat

diethe

Handhand

zuto

gebengive

(6) The committeeset out to equip the institutions with apolitical instrument

Verb preposition combinations While thisclass isnrsquot discussed very often in the MWE lit-erature it can nevertheless be considered as an id-iosyncratic combination of lexical items Sag et al(2002) propose an analysis within an MWE frame-work

(7) SieThey

werdenwill

denthe

Treibhauseffektgreen house effect

verschlimmernaggravate

(8) They will add to the green house effect

Light verb constructions (LVCs) This is themost frequent pattern in our gold standard It ac-tually subsumes various subpatterns depending onwhether the light verbs complement is realized as anoun adjective or PP Generally LVCs are syntac-tically and semantically more flexible than otherMWE types such that our gold standard containsvariants of LVCs with similar potentially mod-ified adjectives or nouns as in the example be-low However it can be considered an idiosyn-cratic combination since the LVCs exhibit specificlexical restrictions (Sag et al 2002)

(9) IchIch

werdewill

diethe

Sachething

nuronly

nochjust

verschlimmernaggravate

(10) I am justmaking thingsmore difficult

Idioms This MWE type is probably the mostdiscussed in the literature due to its semantic andsyntactic idiosyncracy Itrsquos not very frequent inour gold standard which may be mainly due to itslimited size and the source items we chose

(11) SieThey

bezweckenintend

diethe

Umgestaltungconversion

ininto

einea

zivilecivil

Nationnation

(12) Theyhave in mind the conversion into a civil nation

v1 v2 v3 v4

Ntype 22 (26) 41 (47) 26 (35) 17 (24)

V Part 227 49 00 00

V Prep 364 415 39 59

LVC 182 293 885 882

Idiom 00 24 00 00

Para 364 243 115 235

Table 2 Proportions of MWE types per lemma

Paraphrases From an MWE perspective para-phrases are the most problematic and challengingtype of translational correspondence in our goldstandard While the MWE literature typically dis-cusses the distinction between collocations andMWEs the boarderline between paraphrases andMWEs is not really clear On the hand para-phrases as we classified them here are transparentcombinations of lexical items like in the exam-ple belowensure that something increases How-ever semantically these transparent combinationscan also be rendered by an atomic expressionin-crease A further problem raised by paraphrases isthat they often involve translational shifts (Cyrus2006) These shifts are hard to identify automat-ically and present a general challenge for seman-tic processing of parallel corpora An example isgiven below

(13) WirWe

brauchenneed

besserebetter

Zusammenarbeitcooperation

umto

diethe

RuckzahlungenrepaymentsOBJ

anzuhebenincrease

(14) We need greater cooperation in this respect toensurethat repaymentsincrease

Table 2 displays the proportions of the MWEcategories for the number of types of one-to-manycorrespondences in our gold standard We filteredthe types due to morphological variations only (theoverall number of types is indicated in brackets)Note that some types in our gold standard fall intoseveral categories eg they combine a verb prepo-sition with a verb particle construction For allof the verbs the number of types belonging tocore MWE categories largely outweighs the pro-portion of paraphrases As we already observedin our analysis of general translation categorieshere again the different verb lemmas show strik-ing differences with respect to their realization inEnglish translations For instanceanheben(enincrease) or bezwecken(en intend) are frequently

25

translated with verb particle or preposition combi-nations while the other verbs are much more of-ten translated by means of LVCs Also the morespecific LVC patterns differ largely among theverbs Whileverschlimmern(en aggravate) hasmany different adjectival LVC correspondencesthe translations ofriskieren(en risk) are predomi-nantly nominal LVCs The fact that we found veryfew idioms in our gold standard may be simplyrelated to our arbitrary choice of German sourceverbs that do not have an English idiom realiza-tion (see our experiment on a random set of verbsin Section 33)

In general one-to-many translational corre-spondences seem to provide a very fruitful groundfor the large-scale study of MWE phenomenaHowever their reliable detection in parallel cor-pora is far from trivial as we will show in thenext section Therefore we will not further in-vestigate the classification of MWE patterns inthe rest of the paper but concentrate on the high-precision detection of one-to-many translationsSuch a pattern-independent identification methodis crucial for the further data-driven study of one-to-many translations in parallel corpora

3 Multiword Translation Detection

This section is devoted to the problem of high-precision detection of one-to-many translationsSection 31 describes an evaluation of automaticword alignments against our gold standard Insection 32 we describe a method that extractsloosely aligned syntactic configurations whichyields much more promising results

31 One-to-many Alignments

To illustrate the problem of purely distributionalone-to-many alignment table 3 presents an eval-uation of the automatic one-to-many word align-ments produced by GIZA++ that uses the stan-dard heuristics for bidirectional word alignmentfrom phrase-based MT (Och and Ney 2003) Weevaluate the rate of translational correspondenceson the type-level that the system discovers againstthe one-to-many translations in our gold standardBy type we mean the set of lemmatized Englishtokens that makes up the translation of the Ger-man source lemma Generally automatic wordalignment yields a very high FPR if no frequencythreshold is used Increasing the threshold mayhelp in some cases however the frequency of the

verb n gt 0 n gt 1 n gt 3FPR FNR FPR FNR FPR FNR

v1 097 093 10 10 10 10

v2 093 09 05 096 00 098

v3 088 083 08 097 067 097

v4 098 092 08 092 034 092

Table 3 False positive rate and False negative rateof GIZA++ one-to-many alignments

translation types is so low that already at a thresh-old of 3 almost all types get filtered This does notmean that the automatic word alignment does notdiscover any correct correspondences at all but itmeans that the detection of the exact set of tokensthat correspond to the source token is rare

This low precision of one-to-many alignmentsisnrsquot very surprising Many types of MWEs con-sist of items that contribute most of the lexical se-mantic content while the other items belong to theclass of semantically almost ldquoemptyrdquo items (egparticles light verbs) These semantically ldquolightrdquoitems have a distribution that doesnrsquot necessarilycorrelate with the source item For instance inthe following sentence pair taken from EuroparlGIZA++ was not able to capture the correspon-dence between the German main verbbehindern(en impede) and the LVCconstitute an obstacleto but only finds an alignment link between theverb and the nounobstacle

(15) DieThe

Korruptioncorruption

behindertimpedes

diethe

Entwicklungdevelopment

(16) Corruptionconstitutes an obstacle todevelopment

Another limitation of the word-alignment mod-els is that are independent of whether the sen-tences are largely parallel or rather free transla-tions However parallel corpora like Europarl areknow to contain a very large number of free trans-lations In these cases direct lexical correspon-dences are much more unlikely to be found

32 Aligning Syntactic Configurations

High-precision extraction of one-to-many trans-lation detection thus involves two major prob-lems 1) How to identify sentences or configura-tions where reliable lexical correspondences canbe found 2) How to align target items that have alow occurrence correlation

We argue that both of these problems can beadressed by taking syntactic information into ac-

26

count As an example consider the pair of paral-lel configurations in Figure 1 for the sentence pairgiven in (15) and (16) Although there is no strictone-to-one alignment for the German verb the ba-sic predicate-argument structure is parallel Theverbs arguments directly correspond to each otherand are all dominated by a verbal root node

Based on these intuitions we propose agenerate-and-filter strategy for our one-to-manytranslation detection which extracts partial largelyparallel dependency configurations By admittingtarget dependency paths to be aligned to sourcesingle dependency relations we admit configura-tions where the source item is translated by morethan one word For instance given the configura-tion in Figure 1 we allow the German verb to bealigned to the path connectingconstituteand theargumentY2

Our one-to-many translation detection consistsof the following steps a) candidate generationof aligned syntactic configurations b) filtering theconfigurations c) alignment post-editing ie as-sembling the target tokens corresponding to thesource item The following paragraphs will brieflycaracterize these steps

behindert

X 1 Y 1

Y 2

an to

X 2 obstacle

create

Figure 1 Example of a typical syntactic MWEconfiguration

Data We word-aligned the German and Englishportion of the Europarl corpus by means of theGIZA++ tool Both portions where assigned flatsyntactic dependency analyses by means of theMaltParser (Nivre et al 2006) such that we ob-tain a parallel resource of word-aligned depen-dency parses Each sentence in our resource canbe represented by the triple(DG DE AGE) DG

is the set of dependency triples(s1 rel s2) such

thats2 is a dependent ofs1 of typerel ands1 s2

are words of the source languageDE is the setof dependency triples of the target sentenceAGE

corresponds to the set of pairs(s1 t1) such thats1 t1 are aligned

Candidate Generation This step generates alist of source configurations by searching for oc-curences of the source lexical verb where it islinked to some syntactic dependents (eg its argu-ments) An example input would be the configura-tion ( (verbSB) (verbOA)) forour German verbs

Filtering Given our source candidates a validparallel configuration(DG DE AGE) is then de-fined by the following conditions1 The source configurationDG is the set of tu-ples(s1 rel sn) wheres1 is our source item andsn some dependent2 For eachsn isin DG there is a tuple(sn tn) isinAGE ie every dependent has an alignment3 There is a target itemt1 isin DE such thatfor eachtn there is ap sub DE such thatp isa path(t1 rel tx) (tx rel ty)(tz rel tn) thatconnectst1 and tn Thus the target dependentshave a common root

To filter noise due to parsing or alignment er-rors we further introduce a filter on the length ofthe path that connects the target root and its de-pendents and w exclude paths cross contain sen-tence boundaries Moreover the above candi-date filtering doesnrsquot exclude configurations whichexhibit paraphrases involving head-switching orcomplex coordination Head-switching can be de-tected with the help of alignment information ifthere is a item in our target configuration that hasan reliable alignment with an item not contained inour source configuration our target configurationis likely to contain such a structural paraphrasesand is excluded from our candidate set Coordina-tion can be discarded by imposing the condition onthe configuration not to contain a coordination re-lation This Generate-and-Filter strategy now ex-tracts a set of sentences where we are likely to finda good one-to-one or one-to-many translation forthe source verb

Alignment Post-editing In the final alignmentstep one now needs to figure out which lexicalmaterial in the aligned syntactic configurations ac-tually corresponds to the translation of the sourceitem The intuition discussed in 32 was that all

27

the items lying on a path between the root itemand the terminals belong to the translation of thesource item However these items may have othersyntactic dependents that may also be part of theone-to-many translation As an example considerthe configuration in figure 1 where the articleanwhich is part of the LVCcreate an obstacle tohasto be aligned to the German source verb

Thus for a set of itemsti for which there is a de-pendency relation(tx rel ti) isin DE such thattx isan element of our target configuration we need todecide whether(s1 ti) isin AGE This translationproblem now largely parallels collocation trans-lation problems discussed in the literature as in(Smadja and McKeown 1994) But crucially oursyntactic filtering strategy has substantially nar-rowed down the number of items that are possi-ble parts of the one-to-many translation Thus astraightforward way to assemble the translationalcorrespondence is to compute the correlation orassociation of the possibly missing items with thegiven translation pair as proposed in (Smadja andMcKeown 1994) Therefore we propose the fol-lowing alignment post-editing algorithmGiven the source items1 and the set of target itemsT where eachti isin T is an element of our targetconfiguration

1 Compute corr(s1 T ) the correlation be-tweens1 andT

2 For each ti tx such that there isa (ti rel tx) isin DE computecorr(s1 T + tx)

3 if corr(s1 T + tx) ge corr(s1 T ) addtxto T

As the Dice coefficient is often to give the bestresults eg in (Smadja and McKeown 1994) wealso chose Dice as our correlation measure In fu-ture work we will experiment with other associa-tion measures Our correlation scores are thus de-fined by the formula

corr(s1 T ) =2(freq(s1 and T ))

freq(s1) + freq(T )

We definefreq(T ) as the number of sentencepairs whose target sentence contains occurrencesof all ti isin T andfreq(s1) accordingly The ob-servation frequencyfreq(s1andT ) is the number ofsentence pairs that wheres1 occurs in the sourcesentence andT in the target sentence

The output translation can then be rep-resented as a dependency configurationof the following kind ((ofPMOD)(riskNMODof)(riskNMODthe) (runOBJrisk)(runSBJ))which is the syntactic representationfor the English MWErun the risk of

33 Evaluation

Our translational approach to MWE extractionbears the advantage that evaluation is not exclu-sively bound to the manual judgement of candi-date lists Instead we can first evaluate the systemoutput against translation gold standards which areeasier to obtain The linguistic classification of thecandidates according to their compositionality canthen be treated as a separate problem

We present two experiments in this evaluationsection We will first evaluate the translation de-tection on our gold standard to assess the gen-eral quality of the extraction method Since thisgold standard is to small to draw conclusions aboutthe quality of MWE patterns that the system de-tects we further evaluate the translational corre-spondences for a larger set of verbs

Translation evaluation In the first experimentwe extracted all types of translational correspon-dences for the verbs we annotated in the gold stan-dard We converted the output dependency con-figurations to the lemmatized bag-of-word formwe already applied for the alignment evaluationand calculated the FPR and FNR of the trans-lation types The evaluation is displayed in ta-ble 4 Nearly all translation types that our sys-tem detected are correct This confirms our hy-pothesis that syntactic filtering yields more reli-able translations that just coocurrence-based align-ments However the false negative rate is alsovery high This low recall is due to the fact thatour syntactic filters are very restrictive such that amajor part of the occurrences of the source lemmadonrsquot figure in the prototypical syntactic configu-ration Column two and three of the evaluation ta-ble present the FPR and FNR for experiments witha relaxed syntactic filter that doesnrsquot constrain thesyntactic type of the parallel argument relationsWhile not decreasing the FNR the FPR decreasessignificantly This means that the syntactic filtersmainly fire on noisy configurations and donrsquot de-crease the recall A manual error analysis has alsoshown that the relatively flat annotation scheme ofour dependency parses significantly narrows down

28

the number of candidate configurations that our al-gorithm detects As the dependency parses donrsquotprovide deep analyses for tense or control phe-nomena very often a verbrsquos arguments donrsquot fig-ure as its syntactic dependents and no configura-tion is found Future work will explore the im-pact of deep syntactic analysis for the detection oftranslational correspondences

MWE evaluation In a second experiment weevaluated the patterns of correspondences foundby our extraction method for use in an MWE con-text Therefore we selected 50 random verbs oc-curring in the Europarl corpus and extracted theirrespective translational correspondences This setof 50 verbs yields a set of 1592 one-to-many typesof translational correspondences We filtered thetypes wich display only morphological variationsuch that the set of potential MWE types com-prises 1302 types Out of these we evaluated arandom sample of 300 types by labelling the typeswith the MWE categories we established for theanalysis of our gold standard During the clas-sification we encountered a further category ofoneto- many correspondence which cannot be con-sidered an MWE the category of alternation Forinstance we found a translational correspondencebetween the active realization of the German verbbegruszligen(enappreciate) and the English passivebe pleased by

The classification is displayed in table 5 Al-most 83 of the translational correspondencesthat our system extracted are perfect translationtypes Almost 60 of the extracted types can beconsidered MWEs that exhibit some kind of se-mantic idiosyncrasy The other translations couldbe classified as paraphrases or alternations In ourrandom sample the portions of idioms is signifi-cantly higher than in our gold standard which con-firms our intuition that the MWE pattern of theone-to-many translations for a given verb are re-lated to language-specific semantic properties ofthe verbs and the lexical concepts they realize

4 Related Work

The problem sketched in this paper has clear con-ncetions to statistical MT So-called phrase-basedtranslation models generally target whole sentencealignment and do not necessarily recur to linguis-tically motivated phrase correspondences (Koehnet al 2003) Syntax-based translation that speci-fies formal relations between bilingual parses was

Strict Filter Relaxed Filter

FPR FNR FPR FNR

v1 00 096 05 096

v2 025 088 047 079

v3 025 074 056 063

v4 00 0875 056 084

Table 4 False positive and false negative rate ofone-to-many translations

Trans type Proportion

MWE type Proportion

MWEs 575

V Part 82

V Prep 518

LVC 324

Idiom 106

Paraphrases 244

Alternations 10

Noise 171

Table 5 Classification of 300 types sampled fromthe set of one-to-many translations for 50 verbs

established by (Wu 1997) Our way to use syn-tactic configurations can be seen as a heuristic tocheck relaxed structural parallelism

Work on MWEs in a crosslingual context hasalmost exclusively focussed on MWE translation(Smadja and McKeown 1994 Anastasiou 2008)In (Villada Moiron and Tiedemann 2006) the au-thors make use of alignment information in a par-allel corpus to rank MWE candidates These ap-proaches donrsquot rely on the lexical semantic knowl-edge about MWEs in form of one-to-many trans-lations

By contrast previous approaches to paraphraseextraction made more explicit use of crosslingualsemantic information In (Bannard and Callison-Burch 2005) the authors use the target languageas a pivot providing contextual features for iden-tifying semantically similar expressions Para-phrasing is however only partially comparable tothe crosslingual MWE detection we propose inthis paper Recently the very pronounced contextdependence of monolingual pairs of semanticallysimilar expressions has been recognized as a ma-jor challenge in modelling word meaning (Erk andPado 2009)

The idea that parallel corpora can be used asa linguistic resource that provides empirical evi-dence for monolingual idiosyncrasies has already

29

been exploited in eg morphology projection(Yarowsky et al 2001) or word sense disambigua-tion (Dyvik 2004) While in a monolingual set-ting it is quite tricky to come up with theoreticalor empirical definitions of sense discriminationsthe crosslingual scenario offers a theory-neutraldata-driven solution Since ambiguity is an id-iosyncratic property of a lexical item in a givenlanguage it is not likely to be mirrored in a tar-get language Similarly our approach can also beseen as a projection idea we project the semanticinformation of simplex realization in a source lan-guage to an idiosyncratic multiword realization inthe target language

5 Conclusion

We have explored the phenomenon of one-to-many translations in parallel corpora from theperspective of MWE identification Our man-ual study on a translation gold standard as wellas our experiments in automatic translation ex-traction have shown that one-to-many correspon-dences provide a rich resource and fruitful basisof study for data-driven MWE identification Thecrosslingual perspective raises new research ques-tions about the identification and interpretation ofMWEs It challenges the distinction between para-phrases and MWEs a problem that does not ariseat all in the context of monolingual MWE ex-traction It also allows for the study of the rela-tion between the semantics of lexical concepts andtheir MWE realization Further research in this di-rection should investigate translational correspon-dences on a larger scale and further explore thesefor monolingual interpretation of MWEs

Our extraction method that is based on syn-tactic filters identifies MWE types with a muchhigher precision than purely cooccurence-basedword alignment and captures the various patternswe found in our gold standard Future work on theextraction method will have to focus on the gener-alization of these filters and the generalization toother items than verbs The experiments presentedin this paper also suggest that the MWE realiza-tion of certain lexical items in a target languageis subject to certain linguistic patterns Moreoverthe method we propose is completely languagein-dependent such that further research has to studythe impact of the relatedness of the consideredlanguages on the patterns of one-to-many transla-tional correspondences

ReferencesDimitra Anastasiou 2008 Identification of idioms by mtrsquos

hybrid research system vs three commercial system InProceedings of the EAMT pp 12ndash20

Timothy Baldwin and Aline Villavicencio 2002 Extract-ing the unextractable a case study on verb-particles InProceedings of the COLING-02 pp 1ndash7

Colin Bannard and Chris Callison-Burch 2005 Paraphras-ing with bilingual parallel corpora InProceedings of the43rd Annual Meeting of the ACL pp 597ndash604

Lea Cyrus 2006 Building a resource for studying transla-tion shifts InProceedings of the 5th LREC pp 1240ndash1245

Helge Dyvik 2004 Translations as semantic mirrors Fromparallel corpus to WordNetLanguage and Computers1311 ndash 326

Katrin Erk and Sebastian Pado 2009 Paraphrase assess-ment in structured vector space Exploring parameters anddatasets InProc of the EACL GEMS Workshop pp 57ndash65

Joao de Almeida Varelas Graca Joana Paulo Pardal LuısaCoheur and Diamantino Antonio Caseiro 2008 Multi-language word alignments annotation guidelines Techni-cal report Tech Rep 38 2008 INESC-ID Lisboa

Philipp Koehn Franz Josef Och and Daniel Marcu 2003Statistical phrase-based translation InProceedings of theNAACL rsquo03 pp 48ndash54

Philipp Koehn 2005 Europarl A parallel corpus for statis-tical machine translation InMT Summit 2005 pp 79ndash86

Joakim Nivre Johan Hall and Jens Nilsson 2006 Malt-parser A data driven parser-generator for dependencyparsing InProc of LREC-2006 pp 2216ndash2219

Franz Josef Och and Hermann Ney 2003 A systematic com-parison of various statistical alignment modelsComputa-tional Linguistics 29(1)19ndash51

Pavel Pecina 2008 A machine learning approach to multi-word expression extraction InProceedings of the LRECMWE 2008 Workshop pp 54ndash57

Ivan A Sag Timothy Baldwin Francis Bond and AnnCopestake 2002 Multiword expressions A pain in theneck for NLP InProc of the CICLing-2002 pp 1ndash15

Frank Smadja and Kathleen McKeown 1994 Translatingcollocations for use in bilingual lexicons InProceedingsof the HLT rsquo94 workshop pp 152ndash156

Begona Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automatic word-alignment InProc of the EACL MWE 2006 Workshoppp 33ndash40

Dekai Wu 1997 Stochastic inversion transduction gram-mars and bilingual parsing of parallel corporaComputLinguist 23(3)377ndash403

David Yarowsky Grace Ngai and Richard Wicentowski2001 Inducing multilingual text analysis tools via robustprojection across aligned corpora InProceedings of HLT2001 pp 1ndash8

30

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 31ndash39Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

A Re-examination of Lexical Association Measures

Abstract

We review lexical Association Measures (AMs) that have been employed by past work in extracting multiword expressions Our work contributes to the understanding of these AMs by categorizing them into two groups and suggesting the use of rank equivalence to group AMs with the same ranking performance We also examine how existing AMs can be adapted to better rank English verb particle constructions and light verb constructions Specifically we suggest normalizing (Pointwise) Mutual Information and using marginal frequencies to construct penalization terms We empirically validate the effectiveness of these modified AMs in detection tasks in English performed on the Penn Treebank which shows significant improvement over the original AMs

1 Introduction

Recently the NLP community has witnessed a renewed interest in the use of lexical association measures in extracting Multiword Expressions (MWEs) Lexical Association Measures (hereafter AMs) are mathematical formulas which can be used to capture the degree of connection or association between constituents of a given phrase Well-known AMs include Pointwise Mutual Information (PMI) Pearsonrsquos 2χ and the Odds Ratio These AMs have been applied in many different fields of study from information retrieval to hypothesis testing In the context of MWE extraction many published works have been devoted to comparing their effectiveness Krenn and Evert (2001) evaluate Mutual Information (MI) Dice Pearsonrsquos 2χ log-likelihood

ratio and the T score In Pearce (2002) AMs such as Z score Pointwise MI cost reduction left and right context entropy odds ratio are evaluated Evert (2004) discussed a wide range of AMs including exact hypothesis tests such as the binomial test and Fisherrsquos exact tests various coefficients such as Dice and Jaccard Later Ramisch et al (2008) evaluated MI Pearsonrsquos 2χ and Permutation Entropy Probably the most comprehensive evaluation of AMs was presented in Pecina and Schlesinger (2006) where 82 AMs were assembled and evaluated over Czech collocations These collocations contained a mix of idiomatic expressions technical terms light verb constructions and stock phrases In their work the best combination of AMs was selected using machine learning

While the previous works have evaluated AMs there have been few details on why the AMs perform as they do A detailed analysis of why these AMs perform as they do is needed in order to explain their identification performance and to help us recommend AMs for future tasks This weakness of previous works motivated us to address this issue In this work we contribute to further understanding of association measures using two different MWE extraction tasks to motivate and concretize our discussion Our goal is to be able to predict a priori what types of AMs are likely to perform well for a particular MWE class

We focus on the extraction of two common types of English MWEs that can be captured by bigram model Verb Particle Constructions (VPCs) and Light Verb Constructions (LVCs) VPCs consist of a verb and one or more particles which can be prepositions (eg put on bolster up) adjectives (cut short) or verbs (make do) For simplicity we focus only on bigram VPCs that take prepositional particles the most common class of VPCs A special characteristic of VPCs that affects their extraction is the

Hung Huu Hoang Dept of Computer Science

National University of Singapore

hoanghuucompnusedusg

Su Nam Kim Dept of Computer Science and Software Engineering University of Melbourne

snkimcsseunimelbeduau

Min-Yen Kan Dept of Computer Science

National University of Singapore

kanmycompnusedusg

31

mobility of noun phrase complements in transitive VPCs They can appear after the particle (Take off your hat) or between the verb and the particle (Take your hat off) However a pronominal complement can only appear in the latter configuration (Take it off)

In comparison LVCs comprise of a verb and a complement which is usually a noun phrase (make a presentation give a demonstration) Their meanings come mostly from their complements and as such verbs in LVCs are termed semantically light hence the name light verb This explains why modifiers of LVCs modify the complement instead of the verb (make a serious mistake vs make a mistake seriously) This phenomenon also shows that an LVCrsquos constituents may not occur contiguously

2 Classification of Association Measures

Although different AMs have different approaches to measuring association we observed that they can effectively be classified into two broad classes Class I AMs look at the degree of institutionalization ie the extent to which the phrase is a semantic unit rather than a free combination of words Some of the AMs in this class directly measure this association between constituents using various combinations of co-occurrence and marginal frequencies Examples include MI PMI and their variants as well as most of the association coefficients such as Jaccard Hamann Brawn-Blanquet and others Other Class I AMs estimate a phrasersquos MWE-hood by judging the significance of the difference between observed and expected frequencies These AMs include among others statistical hypothesis tests such as T score Z score and Pearsonrsquos 2χ test

Class II AMs feature the use of context to measure non-compositionality a peculiar characteristic of many types of MWEs including VPCs and idioms This is commonly done in one of the following two ways First non-compositionality can be modeled through the diversity of contexts measured using entropy The underlying assumption of this approach is that non-compositional phrases appear in a more restricted set of contexts than compositional ones Second non-compositionality can also be measured through context similarity between the phrase and its constituents The observation here is that non-compositional phrases have different semantics from those of their constituents It then

follows that contexts in which the phrase and its constituents appear would be different (Zhai 1997) Some VPC examples include carry out give up A close approximation stipulates that contexts of a non-compositional phrasersquos constituents are also different For instance phrases such as hot dog and Dutch courage are comprised of constituents that have unrelated meanings Metrics that are commonly used to compute context similarity include cosine and dice similarity distance metrics such as Euclidean and Manhattan norm and probability distribution measures such as Kullback-Leibler divergence and Jensen-Shannon divergence

Table 1 lists all AMs used in our discussion

The lower left legend defines the variables a b c and d with respect to the raw co-occurrence statistics observed in the corpus data When an AM is introduced it is prefixed with its index given in Table 1(eg [M2] Mutual Information) for the readerrsquos convenience

3 Evaluation

We will first present how VPC and LVC candidates are extracted and used to form our evaluation data set Second we will discuss how performances of AMs are measured in our experiments

31 Evaluation Data

In this study we employ the Wall Street Journal (WSJ) section of one million words in the Penn Tree Bank To create the evaluation data set we first extract the VPC and LVC candidates from our corpus as described below We note here that the mobility property of both VPC and LVC constituents have been used in the extraction process

For VPCs we first identify particles using a pre-compiled set of 38 particles based on Baldwin (2005) and Quirk et al (1985) (Appendix A) Here we do not use the WSJ particle tag to avoid possible inconsistencies pointed out in Baldwin (2005) Next we search to the left of the located particle for the nearest verb As verbs and particles in transitive VPCs may not occur contiguously we allow an intervening NP of up to 5 words similar to Baldwin and Villavicencio (2002) and Smadja (1993) since longer NPs tend to be located after particles

32

Extraction of LVCs is carried out in a similar fashion First occurrences of light verbs are located based on the following set of seven

frequently used English light verbs do get give have make put and take Next we search to the right of the light verbs for the nearest noun

AM Name Formula AM Name Formula M1 Joint Probability ( ) f xy N M2 Mutual Information

1log ˆ

ijij

i j ij

ff

N fsum

M3 Log likelihood ratio

2 log ˆ

ijij

i j ij

ff

fsum

M4 Pointwise MI (PMI) ( )log

( ) ( )

P xy

P x P ylowast lowast

M5 Local-PMI ( ) PMIf xy times M6 PMIk ( )log

( ) ( )

kNf xy

f x f ylowast lowast

M7 PMI2 2( )log

( ) ( )

Nf xy

f x f ylowast lowast

M8 Mutual Dependency 2( )log( ) ( )P xy

P x P y

M9 Driver-Kroeber

( )( )

a

a b a c+ +

M10 Normalized expectation

2

2

a

a b c+ +

M11 Jaccard a

a b c+ +

M12 First Kulczynski a

b c+

M13 Second Sokal-Sneath 2( )

a

a b c+ +

M14 Third Sokal-Sneath

a d

b c

+

+

M15 Sokal-Michiner a d

a b c d

+

+ + +

M16 Rogers-Tanimoto

2 2

a d

a b c d

+

+ + +

M17 Hamann ( ) ( )a d b c

a b c d

+ minus +

+ + +

M18 Odds ratio adbc

M19 Yulersquos ω ad bc

ad bc

minus

+

M20 Yulersquos Q ad bcad bc

minus+

M21 Brawn- Blanquet max( )

a

a b a c+ +

M22 Simpson

min( )

a

a b a c+ +

M23 S cost 12min( )

log(1 )1

b c

a

minus+

+

M24 Adjusted S Cost 12max( )

log(1 )1

b c

a

minus+

+

M25 Laplace 1

min( ) 2

a

a b c

+

+ +

M26 Adjusted Laplace 1

max( ) 2

a

a b c

+

+ +

M27 Fager [M9]

1max( )

2b cminus

M28 Adjusted Fager [M9]

1max( )b c

aNminus

M29 Normalized PMIs

PMI NF( )α PMI NFMax

M30 Simplified normalized PMI for VPCs

log( )

(1 )

ad

b cα αtimes + minus times

M31 Normalized MIs

MI NF( )α MI NFMax

NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast [0 1]α isin NFMax = max( ( ) ( ))P x P ylowast lowast

11 ( )a f f xy= = 12 ( )b f f xy= =

21 ( )c f f xy= = 22 ( )d f f xy= =

( )f xlowast ( )f xlowast

( )f ylowast ( )f ylowast N

Table 1 Association measures discussed in this paper Starred AMs () are developed in this work

Contingency table of a bigram (x y) recording co-occurrence and marginal frequencies w stands for all words except w stands for all words N is total number of bigrams The expected frequency under the independence assumption is ˆ ( ) ( ) ( ) f xy f x f y N= lowast lowast

33

permitting a maximum of 4 intervening words to allow for quantifiers (aan the many etc) adjectival and adverbial modifiers etc If this search fails to find a noun as when LVCs are used in the passive (eg the presentation was made) we search to the right of the light verb also allowing a maximum of 4 intervening words The above extraction process produced a total of 8652 VPC and 11465 LVC candidates when run on the corpus We then filter out candidates with observed frequencies less than 6 as suggested in Pecina and Schlesinger (2006) to obtain a set of 1629 VPCs and 1254 LVCs

Separately we use the following two available sources of annotations 3078 VPC candidates extracted and annotated in (Baldwin 2005) and 464 annotated LVC candidates used in (Tan et al 2006) Both sets of annotations give both positive and negative examples

Our final VPC and LVC evaluation datasets were then constructed by intersecting the gold-standard datasets with our corresponding sets of extracted candidates We also concatenated both sets of evaluation data for composite evaluation This set is referred to as ldquoMixedrdquo Statistics of our three evaluation datasets are summarized in Table 2 VPC data LVC data Mixed Total (freq ge 6)

413 100 513

Positive instances

117 (2833)

28 (28)

145 (2326)

Table 2 Evaluation data sizes (type count not token) While these datasets are small our primary

goal in this work is to establish initial comparable baselines and describe interesting phenomena that we plan to investigate over larger datasets in future work

32 Evaluation Metric

To evaluate the performance of AMs we can use the standard precision and recall measures as in much past work We note that the ranked list of candidates generated by an AM is often used as a classifier by setting a threshold However setting a threshold is problematic and optimal threshold values vary for different AMs Additionally using the list of ranked candidates directly as a classifier does not consider the confidence indicated by actual scores Another way to avoid setting threshold values is to measure precision and recall of only the n most likely candidates

(the n-best method) However as discussed in Evert and Krenn (2001) this method depends heavily on the choice of n In this paper we opt for average precision (AP) which is the average of precisions at all possible recall values This choice also makes our results comparable to those of Pecina and Schlesinger (2006)

33 Evaluation Results

Figure 1(a b) gives the two average precision profiles of the 82 AMs presented in Pecina and Schlesinger (2006) when we replicated their experiments over our English VPC and LVC datasets We observe that the average precision profile for VPCs is slightly concave while the one for LVCs is more convex This can be interpreted as VPCs being more sensitive to the choice of AM than LVCs Another point we observed is that a vast majority of Class I AMs including PMI its variants and association coefficients (excluding hypothesis tests) perform reasonably well in our application In contrast the performances of most of context-based and hypothesis test AMs are very modest Their mediocre performance indicates their inapplicability to our VPC and LVC tasks In particular the high frequencies of particles in VPCs and light verbs in LVCs both undermine their contextsrsquo discriminative power and skew the difference between observed and expected frequencies that are relied on in hypothesis tests

4 Rank Equivalence

We note that some AMs although not mathematically equivalent (ie assigning identical scores to input candidates) produce the same lists of ranked candidates on our datasets Hence they achieve the same average precision The ability to identify such groups of AMs is helpful in simplifying their formulas which in turn assisting in analyzing their meanings Definition Association measures M1 and M2 are

rank equivalent over a set C denoted by M1 r

Cequiv

M2 if and only if M1(cj) gt M1(ck) M2(cj) gt M2(ck) and M1(cj) = M1(ck) M2(cj) = M2(ck) for all cj ck belongs to C where Mk(ci) denotes the score assigned to ci by the measure Mk As a corollary the following also holds for rank equivalent AMs

34

Corollary If M1 r

Cequiv M2 then APC(M1) = APC(M2)

where APC(Mi) stands for the average precision of the AM Mi over the data set C Essentially M1 and M2 are rank equivalent over a set C if their ranked lists of all candidates taken from C are the same ignoring the actual calculated scores1 As an example the following 3 AMs Odds ratio Yulersquos ω and Yulersquos Q (Table 3 row 5) though not mathematically equivalent can be shown to be rank equivalent Five groups of rank equivalent AMs that we have found are listed in Table 3 This allows us to replace the below 15 AMs with their (most simple) representatives from each rank equivalent group

1 Two AMs may be rank equivalent with the exception of some candidates where one AM is undefined due to a zero in the denominator while the other AM is still well-defined We call these cases weakly rank equivalent With a reasonably large corpus such candidates are rare for our VPC and LVC types Hence we still consider such AM pairs to be rank equivalent

1) [M2] Mutual Information [M3] Log likelihood ratio 2) [M7] PMI2 [M8] Mutual Dependency [M9] Driver-Kroeber (aka Ochiai) 3) [M10] Normalized expectation [M11] Jaccard [M12] First Kulczynski

[M13]Second Sokal-Sneath (aka Anderberg)

4) [M14] Third Sokal-Sneath [M15] Sokal-Michiner [M16] Rogers-Tanimoto [M17] Hamann 5) [M18] Odds ratio [M19] Yulersquos ω [M20] Yulersquos Q

Table 3 Five groups of rank equivalent AMs

5 Examination of Association Measures

We highlight two important findings in our analysis of the AMs over our English datasets Section 51 focuses on MI and PMI and Section 52 discusses penalization terms

51 Mutual Information and Pointwise Mutual Information

In Figure 1 over 82 AMs PMI ranks 11th in identifying VPCs while MI ranks 35th in

01

02

03

04

05

06

AP

Figure 1a AP profile of AMs examined over our VPC data set

01

02

03

04

05

06

AP

Figure 1b AP profile of AMs examined over our LVC data set

Figure 1 Average precision (AP) performance of the 82 AMs from Pecina and Schlesinger (2006) on our English VPC and LVC datasets Bold points indicate AMs discussed in this paper

Hypothesis test AMs loz Class I AMs excluding hypothesis test AMs + Context-based AMs

35

identifying LVCs In this section we show how their performances can be improved significantly

Mutual Information (MI) measures the common information between two variables or the reduction in uncertainty of one variable given knowledge of the other

( )MI( ) ( )log ( ) ( )u v

p uvU V p uv p u p v= lowast lowastsum In the

context of bigrams the above formula can be

simplified to [M2] MI =

1log ˆN

ijij

i jij

ff

fsum While MI

holds between random variables [M4] Pointwise MI (PMI) holds between specific values PMI(x

y) =( )

log( ) ( )

P xy

P x P ylowast lowast

( )log

( ) ( )

Nf xy

f x f y=

lowast lowast It has long

been pointed out that PMI favors bigrams with low-frequency constituents as evidenced by the product of two marginal frequencies in its denominator To reduce this bias a common solution is to assign more weight to the co-occurrence frequency ( )f xy in the numerator by either raising it to some power k (Daille 1994) or multiplying PMI with ( )f xy Table 4 lists these adjusted versions of PMI and their performance over our datasets We can see from Table 4 that the best performance of PMIk

is obtained at k values less than one indicating that it is better to rely less on ( )f xy Similarly multiplying

( )f xy directly to PMI reduces the performance of PMI As such assigning more weight to ( )f xy does not improve the AP performance of PMI AM VPCs LVCs MixedBest [M6] PMIk 547

(k = 13) 573

(k = 85) 544

(k = 32)

[M4] PMI 510 566 515 [M5] Local-PMI 259 393 272 [M1] Joint Prob 170 28 175 Table 4 AP performance of PMI and its variants Best alpha settings shown in parentheses

Another shortcoming of (P)MI is that both grow not only with the degree of dependence but also with frequency (Manning and Schutzeampamp 1999 p 66) In particular we can show that MI(X Y) le min(H(X) H(Y)) where H() denotes entropy and PMI(xy) le min( log ( )P xminus lowast log ( )P yminus lowast )

These two inequalities suggest that the allowed score ranges of different candidates vary and consequently MI and PMI scores are not directly comparable Furthermore in the case of VPCs and LVCs the differences among score

ranges of different candidates are large due to high frequencies of particles and light verbs This has motivated us to normalize these scores before using them for comparison We suggest MI and PMI be divided by one of the following two normalization factors NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast with [0 1]α isin and NFmax = max( ( ) ( ))P x P ylowast lowast NF( )α being dependent on alpha can be optimized by setting an appropriate alpha value which is inevitably affected by the MWE type and the corpus statistics On the other hand NFmax is independent of alpha and is recommended when one needs to apply normalized (P)MI to a mixed set of different MWE types or when sufficient data for parameter tuning is unavailable As shown in Table 5 normalized MI and PMI show considerable improvements of up to 80 Also PMI and MI after being normalized with NFmax rank number one in VPC and LVC task respectively If one re-writes MI as = (1 N) ij ij

i jPMIf timessum it is easy to see the heavy

dependence of MI on direct frequencies compared with PMI and this explains why normalization is a pressing need for MI

AM VPCs LVCs MixedMI NF( )α 508

(α = 48) 583

(α = 47) 516

(α = 5)

MI NFmax 508 584 518 [M2] MI 273 435 289

PMI NF( )α 592 (α = 8)

554 (α = 48)

588 (α = 77)

PMI NFmax 565 517 556 [M4] PMI 510 566 515

Table 5 AP performance of normalized (P)MI versus standard (P)MI Best alpha settings shown in parentheses

52 Penalization Terms

It can be seen that given equal co-occurrence frequencies higher marginal frequencies reduce the likelihood of being MWEs This motivates us to use marginal frequencies to synthesize penalization terms which are formulae whose values are inversely proportional to the likelihood of being MWEs We hypothesize that incorporating such penalization terms can improve the respective AMs detection AP

Take as an example the AMs [M21] Brawn-Blanquet (aka Minimum Sensitivity) and [M22] Simpson These two AMs are identical except

36

for one difference in the denominator Brawn-Blanquet uses max(b c) Simpson uses min(b c) It is intuitive and confirmed by our experiments that penalizing against the more frequent constituent by choosing max(b c) is more effective This is further attested in AMs [M23] S Cost and [M25] Laplace where we tried to replace the min(b c) term with max(b c) Table 6 shows the average precision on our datasets for all these AMs

AM VPCs LVCs Mixed[M21]Brawn- Blanquet

478 578 486

[M22] Simpson 249 382 260

[M24] Adjusted S Cost

485 577 492

[M23] S cost 249 388 260

[M26] Adjusted Laplace

486 577 493

[M25] Laplace 241 388 254 Table 6 Replacing min() with max() in selected AMs

In the [M27] Fager AM the penalization term max(b c) is subtracted from the first term which is no stranger but rank equivalent to [M7] PMI2 In our application this AM is not good since the second term is far larger than the first term which is less than 1 As such Fager is largely equivalent to just ndashfrac12 max(b c) In order to make use of the first term we need to replace the constant frac12 by a scaled down version of max(b c) We have approximately derived 1 aN as a lower bound estimate of max(b c) using the independence assumption producing [M28] Adjusted Fager We can see from Table 7 that this adjustment improves Fager on both datasets AM VPCs LVCs Mixed[M28] Adjusted Fager

564 543 554

[M27] Fager 552 439 525 Table 7 Performance of Fager and its adjusted version

The next experiment involves [M14] Third Sokal Sneath which can be shown to be rank equivalent to ndashb ndashc We further notice that frequencies c of particles are normally much larger than frequencies b of verbs Thus this AM runs the risk of ranking VPC candidates based on only frequencies of particles So it is necessary

that we scale b and c properly as in [M14] bαminus times ndash (1 ) cαminus times Having scaled the constituents properly we still see that [M14] by itself is not a good measure as it uses only constituent frequencies and does not take into consideration the co-occurrence frequency of the two constituents This has led us to experiment

with [MR14]PMI

(1 )b cα αtimes + minus times The

denominator of [MR14] is obtained by removing the minus sign from [MR14] so that it can be used as a penalization term The choice of PMI in the numerator is due to the fact that the denominator of [MR14] is in essence similar to NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast which has been successfully used to divide PMI in the normalized PMI experiment We heuristically tried to simplify [MR14] to the following AM

[M30]log( )

(1 )

ad

b cα αtimes + minus times The setting of alpha in

Table 8 below is taken from the best alpha setting obtained the experiment on the normalized PMI (Table 5) It can be observed from Table 8 that [MR14] being computationally simpler than normalized PMI performs as well as normalized PMI and better than Third Sokal-Sneath over the VPC data set

AM VPCs LVCs MixedPMI NF( )α 592

(α =8) 554

(α =48) 588

(α =77)

[M30]log( )

(1 )

ad

b cα αtimes + minus times

600 (α =8)

484 (α =48)

588 (α =77)

[M14] Third Sokal Sneath

565 453 546

Table 8 AP performance of suggested VPCsrsquo penalization terms and AMs

With the same intention and method we have found that while addition of marginal frequencies is a good penalization term for VPCs the product of marginal frequencies is more suitable for LVCs (rows 1 and 2 Table 9) As with the linear combination the product bc should also be weighted accordingly as (1 )b cα αminus The best alpha value is also taken from the normalized PMI experiments (Table 5) which is nearly 5 Under this setting this penalization term is exactly the denominator of the [M18] Odds Ratio Table 9 below show our experiment results in deriving the penalization term for LVCs

37

AM VPCs LVCs Mixedndashb ndashc 565 453 546 1bc 502 532 502 [M18] Odds ratio 443 567 456

Table 9 AP performance of suggested LVCsrsquo penalization terms and AMs

6 Conclusions

We have conducted an analysis of the 82 AMs assembled in Pecina and Schlesinger (2006) for the tasks of English VPC and LVC extraction over the Wall Street Journal Penn Treebank data In our work we have observed that AMs can be divided into two classes ones that do not use context (Class I) and ones that do (Class II) and find that the latter is not suitable for our VPC and LVC detection tasks as the size of our corpus is too small to rely on the frequency of candidatesrsquo contexts This phenomenon also revealed the inappropriateness of hypothesis tests for our detection task We have also introduced the novel notion of rank equivalence to MWE detection in which we show that complex AMs may be replaced by simpler AMs that yield the same average precision performance

We further observed that certain modifications to some AMs are necessary First in the context of ranking we have proposed normalizing scores produced by MI and PMI in cases where the distributions of the two events are markedly different as is the case for light verbs and particles While our claims are limited to the datasets analyzed they show clear improvements normalized PMI produces better performance over our mixed MWE dataset yielding an average precision of 588 compared to 515 when using standard PMI a significant improvement as judged by paired T test Normalized MI also yields the best performance over our LVC dataset with a significantly improved AP of 583

We also show that marginal frequencies can be used to form effective penalization terms In particular we find that (1 )b cα αtimes + minus times is a good penalization term for VPCs while (1 )b cα αminus is suitable for LVCs Our introduced alpha tuning parameter should be set to properly scale the values b and c and should be optimized per MWE type In cases where a common factor is applied to different MWE types max(b c) is a better choice than min(b c) In future work we plan to expand our investigations over larger

web-based datasets of English to verify the performance gains of our modified AMs

Acknowledgement This work was partially supported by a National Research Foundation grant ldquoInteractive Media Searchrdquo (grant R 252 000 325 279)

References Baldwin Timothy (2005) The deep lexical

acquisition of English verb-particle constructions Computer Speech and Language Special Issue on Multiword Expressions 19(4)398ndash414

Baldwin Timothy and Villavicencio Aline (2002) Extracting the unextractable A case study on verb-particles In Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002) pages 98ndash104 Taipei Taiwan

Daille Beacuteatrice (1994) Approche mixte pour lextraction automatique de terminologie statistiques lexicales et filtres linguistiques PhD thesis Universiteacute Paris 7

Evert Stefan (2004) Online repository of association measures httpwwwcollocationsde a companion to The Statistics of Word Cooccurrences Word Pairs and Collocations PhD dissertation University of Stuttgart

Evert Stefan and Krenn Brigitte (2001) Methods for qualitative evaluation of lexical association measures In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics pages 188-915 Toulouse France

Katz Graham and Giesbrecht Eugenie (2006) Automatic identification of non-compositional multi-word expressions using latent semantic analysis In Proceedings of the ACL Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties pages 12-19 Sydney Australia

Krenn Brigitte and Evert Stefan (2001) Can we do better than frequency A case study on extracting PP-verb collocations In Proceedings of the ACLEACL 2001 Workshop on the Computational Extraction Analysis and Exploitation of Collocations pages 39ndash46 Toulouse France

Manning D Christopher and Schutze ampamp Hinrich (1999) Foundations of Statistical Natural Language Processing The MIT Press Cambridge Massachusetts

Pearce Darren (2002) A comparative evaluation of collocation extraction techniques In Proc of the

38

3rd International Conference on Language Resources and Evaluation (LREC 2002) Las Palmas pages 1530-1536 Canary Islands

Pecina Pavel and Schlesinger Pavel (2006) Combining association measures for collocation extraction In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLINGACL 2006) pages 651-658 Sydney Australia

Quirk Randolph Greenbaum Sidney Leech Geoffrey and Svartvik Jan (1985) A Comprehensive Grammar of the English Language Longman London UK

Ramisch Carlos Schreiner Paulo Idiart Marco and Villavicencio Aline (2008) An Evaluation of Methods for the extraction of Multiword Expressions In Proceedings of the LREC-2008 Workshop on Multiword Expressions Towards a Shared Task for Multiword Expressions pages 50-53 Marrakech Morocco

Smadja Frank (1993) Retrieving collocations from text Xtract Computational Linguistics 19(1) 143ndash77

Tan Y Fan Kan M Yen and Cui Hang (2006) Extending corpus-based identification of light verb constructions using a supervised learning framework In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context pages 49ndash56 Trento Italy

Zhai Chengxiang (1997) Exploiting context to identify lexical atoms ndash A statistical view of linguistic context In International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97) pages 119-129 Rio de Janeiro Brazil

Appendix A List of particles used in identifying verb particle constructions about aback aboard above abroad across adrift ahead along apart around aside astray away back backward backwards behind by down forth forward forwards in into off on out over past round through to together under up upon without

39

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 40ndash46Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus

R Mahesh K Sinha

Department of Computer Science amp Engineering Indian Institute of Technology Kanpur

Kanpur 208016 India rmkiitkacin

Abstract

Complex predicate is a noun a verb an ad-jective or an adverb followed by a light verb that behaves as a single unit of verb Com-plex predicates (CPs) are abundantly used in Hindi and other languages of Indo Aryan family Detecting and interpreting CPs con-stitute an important and somewhat a diffi-cult task The linguistic and statistical methods have yielded limited success in mining this data In this paper we present a simple method for detecting CPs of all kinds using a Hindi-English parallel corpus A CP is hypothesized by detecting absence of the conventional meaning of the light verb in the aligned English sentence This simple strategy exploits the fact that CP is a multi-word expression with a meaning that is dis-tinct from the meaning of the light verb Al-though there are several shortcomings in the methodology this empirical method surpri-singly yields mining of CPs with an average precision of 89 and a recall of 90

1 Introduction

Complex predicates (CPs) are abundantly used in Hindi and other languages of Indo-Aryan family and have been widely studied (Hook 1974 Ab-bi 1992 Verma 1993 Mohanan 1994 Singh 1994 Butt 1995 Butt and Geuder 2001 Butt and Ramchand 2001 Butt et al 2003) A com-plex predicate is a multi-word expression (MWE) where a noun a verb or an adjective is followed by a light verb (LV) and the MWE be-haves as a single unit of verb The general theory of complex predicate is discussed in Alsina (1996) These studies attempt to model the lin-guistic facts of complex predicate formation and the associated semantic roles

CPs empower the language in its expressive-ness but are hard to detect Detection and inter-

pretation of CPs are important for several tasks of natural language processing tasks such as ma-chine translation information retrieval summa-rization etc A mere listing of the CPs constitutes a valuable linguistic resource for lexicographers wordnet designers (Chakrabarti et al 2007) and other NLP system designers Computational me-thod using Hindi corpus has been used to mine CPs and categorize them based on statistical analysis (Sriram and Joshi 2005) with limited success Chakrabarti et al (2008) present a me-thod for automatic extraction of V+V CPs only from a corpus based on linguistic features They report an accuracy of about 98 An attempt has also been made to use a parallel corpus for de-tecting CPs using projection POS tags from Eng-lish to Hindi (Soni Mukerjee and Raina 2006) It uses Giza++ word alignment tool to align the projected POS information A success of 83 precision and 46 recall has been reported

In this paper we present a simple strategy for mining of CPs in Hindi using projection of meaning of light verb in a parallel corpus In the following section the nature of CP in Hindi is outlined and this is followed by system design experimentation and results

2 Complex Predicates in Hindi

A CP in Hindi is a syntactic construction consist-ing of either a verb a noun an adjective or an adverb as main predicator followed by a light verb (LV) Thus a CP can be a noun+LV an adjective+LV a verb+LV or an adverb+LV Fur-ther it is also possible that a CP is followed by a LV (CP+LV) The light verb carries the tense and agreement morphology In V+V CPs the contribution of the light verb denotes aspectual terms such as continuity perfectivity inception completion or denotes an expression of forceful-ness suddenness etc (Singh 1994 Butt 1995) The CP in a sentence syntactically acts as a sin-gle lexical unit of verb that has a meaning dis-

40

tinct from that of the LV CPs are also referred as the complex or compound verbs

Given below are some examples (1) CP=noun+LV noun = ashirwad blessings LV = denaa to give usane mujhe ashirwad diyaa उसन मझ आश वाद दया he me blessings gave he blessed me (2) No CP usane mujhe ek pustak dii उसन मझ एक प तक द he me one book gave he gave me a book In (1) the light verb diyaa (gave) in its past tense form with the noun ashirwad (blessings) makes a complex predicate verb form ashirwad diyaa (blessed) in the past tense form The CP here is ashirwad denaa and its corresponding English translation is lsquoto blessrsquo On the other hand in ex-ample (2) the verb dii (gave) is a simple verb in past tense form and is not a light verb Although same Hindi verb denaa (to give) is used in both the examples it is a light verb in (1) and a main verb in (2) Whether it acts as a light verb or not depends upon the semantics of the preceding noun However it is observed that the English meaning in case of the complex predicate is not derived from the individual meanings of the con-stituent words It is this observation that forms basis of our approach for mining of CPs (3) CP=adjective+LV adjective=khush happy LV=karanaa to do usane mujhe khush kiyaa उसन मझ खश कया he me happy did he pleased me Here the Hindi verb kiyaa (did) is the past tense form of a light verb karanaa (to do) and the pre-ceding word khush (happy) is an adjective The CP here is khush karanaa (to please) (4) CP=verb+LV verb = paRhnaa to read LV = lenaa to take usane pustak paRh liyaa

उसन प तक पढ़ लया he book read took he has read the book Here the Hindi verb liyaa (took) is the past tense form of the light verb lenaa (to take) and the pre-ceding word paRh (read) is the verb paRhnaa (to read) in its stem form The CP is paRh lenaa (to finish reading) In such cases the light verb acts as an aspectual modal or as an intensifier (5) CP=verb+LV verb = phaadanaa to tear LV = denaa to give usane pustak phaad diyaa उसन प तक फाड़ दया he book tear gave he has torn the book Here the Hindi verb diyaa (gave) is the past tense form of the light verb denaa (to give) and the preceding word phaad (tear) is the stem form of the verb phaadanaa (to tear) The CP is phaad denaa (to cause and complete act of tearing) (6) CP=verb+LV verb = denaa to give LV = maaranaato hit to kill usane pustak de maaraa उसन प तक द मारा he book give hit he threw the book Here the Hindi verb maaraa (hitkilled) is the past tense form of the light verb maaranaa (to hit to kill) and the preceding word de (give) is a verb denaa (to give) in its stem form The CP is de maranaa (to throw) The verb combination yields a new meaning This may also be consi-dered as a semi-idiomatic construct by some people (7) CP=adverb+LV1+LV2 adverb = vaapasback LV1 = karanaato do LV2 = denaato give or CP = CP+LV CP = vaapas karanaato return LV = denaato give usane pustak vaapas kar diyaa उसन प तक वापस कर दया he book back do gave

41

he returned the book Here there are two Hindi light verbs used The verb kar (do) is the stem form of the light verb karanaa (to do) and the verb diyaa (gave) is the past tense form of the light verb denaa (to give) The preceding word vaapas (back) is an adverb One way of interpretation is that the CP (a con-junct verb) vaapas karanaa (to return) is fol-lowed by another LV denaa (to give) signifying completion of the task Another way of looking at it is to consider these as a combination of two CPs vaapas karanaa (to return) and kar denaa (to complete the act) The semantic interpreta-tions in the two cases remain the same It may be noted that the word vaapas (return) is also a noun and in such a case the CP is a noun+LV

From all the above examples the complexity of the task of mining the CPs is evident Howev-er it is also observed that in the translated text the meaning of the light verb does not appear in case of CPs Our methodology for mining CPs is based on this observation and is outlined in the following section

3 System Design

As outlined earlier our method for detecting a CP is based on detecting a mismatch of the Hindi light verb meaning in the aligned English sen-tence The steps involved are as follows 1) Align the sentences of Hindi-English corpus 2) Create a list of Hindi light verbs and their

common English meanings as a simple verb (Table 1)

3) For each Hindi light verb generate all the morphological forms (Figure 1)

4) For each English meaning of the light verb as given in table 1 generate all the morphologi-cal forms (Figure 2)

5) For each Hindi-English aligned sentence execute the following steps a) For each light verb of Hindi (table 1)

execute the following steps i) Search for a Hindi light verb (LV)

and its morphological derivatives (figure 1) in the Hindi sentence and mark its position in the sentence (K)

ii) If the LV or its morphological deriv-ative is found then search for the equivalent English meanings for any of the morphological forms (figure 2) in the corresponding aligned Eng-lish sentence

iii) If no match is found then scan the words in the Hindi sentence to the left of the Kth position (as identified in step (i)) else if a match is found then exit ie go to step (a)

iv) If the scanned word is a lsquostop wordrsquo (figure 3) then ignore it and contin-ue scanning

v) Stop the scan when it is not a lsquostop wordrsquo and collect the Hindi word (W)

vi) If W is an lsquoexit wordrsquo then exit ie go to step (a) else the identified CP is W+LV

Hindi has a large number of light verbs A list of some of the commonly used light verbs along with their common English meanings as a simple verb is given in table 1 The light verb kar (do) is the most frequently used light verb Using its literal meaning as lsquodorsquo as a criterion for testing CP is quite misleading since lsquodorsquo in English is used in several other contexts Such meanings have been shown within parentheses and are not used for matching

light verb base form root verb meaning baithanaa बठना sit

bananaa बनना makebecomebuildconstruct manufactureprepare

banaanaa बनाना makebuildconstructmanufact-ure prepare

denaa दना give

lenaa लना take

paanaa पाना obtainget

uthanaa 3ठना rise arise get-up

uthaanaa 3ठाना raiselift wake-up

laganaa लगना feelappear look seem

lagaanaa लगाना fixinstall apply

cukanaa चकना (finish)

cukaanaa चकाना pay

karanaa करना (do)

honaa होना happenbecome be

aanaa आना come

jaanaa जाना go

khaanaa खाना eat

rakhanaa रखना keep put

maaranaa मारना killbeathit

daalanaa डालना put

haankanaa हाकना drive

Table 1 Some of the common light verbs in Hindi

42

For each of the Hindi light verb all morpho-logical forms are generated A few illustrations are given in figures 1(a) and 1(b) Similarly for each of the English meaning of the light verb all of its morphological derivatives are generated Figure 2 shows a few illustrations of the same

There are a number of words that can appear in between the nominal and the light verb in a CP These words are ignored in search for a CP and are treated as stop words These are words that denote negation or are emphasizers inten-sifiers interrogative pronoun or a particle A list of stop words used in the experimentation is giv-en in figure 3

Figure 1(a) Morphological derivatives of sample Hindi light verb rsquojaanaarsquo जाना to go

Figure 1(b) Morphological derivatives of sample Hindi light verb rsquolenaarsquo लना to take

Figure 2 Morphological derivatives of sample English meanings

We use a list of words of words that we have

named as lsquoexit wordsrsquo which cannot form part of a CP in Hindi We have used Hindi case (vibhak-ti) markers (also called parsarg) conjunctions and pronouns as the lsquoexit wordsrsquo in our imple-mentation Figure 4 shows a partial list used However this list can be augmented based on analysis of errors in LV identification It should be noted that we do not perform parts of speech (POS) tagging and so the nature of the word pre-ceding the LV is unknown to the system

English word sit Morphological derivations

sit sits sat sitting English word give Morphological derivations

give gives gave given giving degdegdegdegdeg

LV jaanaa जाना to go Morphological derivatives jaa jaae jaao jaaeM jaauuM jaane jaanaa jaanii jaataa jaatii jaate jaaniiM jaatiiM jaaoge jaaogii gaii jaauuMgA jaayegaa jaauuMgii jaayegii gaye gaiiM gayaa gayii jaayeMge jaayeMgI jaakara जा (go stem) जाए (go imperative)

जाओ (go imperative) जाए (go imperative)

जाऊ (go first-person) जान (go infinitive oblique)

जाना (go infinitive masculine singular)

जानी (go infinitive feminine singular)

जाता (go indefinite masculine singular)

जाती (go indefinite feminine singular)

जात (go indefinite masculine pluraloblique)

जानी (go infinitive feminine plural)

जाती (go indefinite feminine plural)

जाओग (go future masculine singular)

जाओगी (go future feminine singular)

गई (go past feminine singular)

जाऊगा (go future masculine first-person singular)

जायगा (go future masculine third-person singular)

जाऊगी (go future feminine first-person singular)

जायगी (go future feminine third-person singular)

गय (go past masculine pluraloblique)

गई (go past feminine plural)

गया (go past masculine singular)

गयी (go past feminine singular)

जायग (go future masculine plural)

जायगी (go future feminine plural)

जाकर (go completion) degdegdegdegdeg

LV lenaa लना to take Morphological derivatives le lii leM lo letaa letii lete liiM luuM legaa legii lene lenaa lenii liyaa leMge loge letiiM luuMgaa luuMgii lekara ल (take stem) ल (take past)

ल (take imperative) लो (take imperative)

लता (take indefinite masculine singular)

लती (take indefinite feminine singular)

लत (take indefinite masculine pluraloblique)

ल (takepastfeminineplural) ल(take first-person)

लगा(take future masculine third-personsingular)

लगी(take future feminine third-person singular)

लन (take infinitive oblique)

लना (take infinitive masculine singular)

लनी (take infinitive feminine singular)

लया (take past masculine singular)

लग (take future masculine plural)

लोग (take future masculine singular)

लती (take indefinite feminine plural)

लगा (take future masculinefirst-personsingular)

लगी (take future feminine first-person singular)

लकर (take completion) degdegdegdegdeg

43

Figure 3 Stop words in Hindi used by the system

Figure 4 A few exit words in Hindi used by the system

The inner loop of the procedure identifies mul-tiple CPs that may be present in a sentence The outer loop is for mining the CPs in the entire corpus The experimentation and results are dis-cussed in the following section

4 Experimentation and Results

The CP mining methodology outlined earlier has been implemented and tested over multiple files of EMILLE (McEnery Baker Gaizauskas and Cunningham 2000) English-Hindi parallel cor-pus A summary of the results obtained are given in table 2 As can be seen from this table the precision obtained is 80 to 92 and the recall is between 89 to 100 The F-measure is 88 to 97 This is a remarkable and somewhat sur-prising result from the simple methodology without much of linguistic or statistical analysis This is much higher than what has been reported on the same corpus by Mukerjee et al 2006 (83 precision and 46 recall) who use projec-tion of POS and word alignment for CP identifi-cation This is the only other work that uses a parallel corpus and covers all kinds of CPs The results as reported by Chakrabarti et al (2008) are only for V-V CPs Moreover they do not re-port the recall value File

1 File 2

File 3

File 4

File 5

File 6

No of Sentences

112 193 102 43 133 107

Total no of CP(N)

200 298 150 46 188 151

Correctly identified CP (TP)

195 296 149 46 175 135

V-V CP 56 63 9 6 15 20

Incorrectly identified CP (FP)

17 44 7 11 16 20

Unidentified CP(FN)

5 2 1 0 13 16

Accuracy

9750 9933 9933 1000 9308 8940

Precision (TP (TP+FP))

9198 8705 9551 8070 9162 8709

Recall ( TP (TP+FN))

9750 9833 9933 1000 9308 8940

F-measure ( 2PR ( P+R))

946 923 974 893 923 882

Table 2 Results of the experimentation

न (ergative case marker) को (accusative case marker) का (possessive case marker) क (possessive case marker) की (possessive case marker) स (frombywith) म (ininto) पर (onbut) और (and Hindi particle) तथा (and) या (or) ल कन (but) पर त (but) क (that Hindi particle) म (I) तम (you) आप (you) वह (heshe) मरा (my) मर (my) मर (my) त हारा (your) त हार (your) त हार (your) उसका (his) उसकी (her) उसक (hisher) अपना (own) अपनी (own) अपन (own) उनक (their) मन (I ergative) त ह (to you) आपको (to you) उसको (to himher) उनको (to them) उ ह (to them) मझको (to me) मझ (to me) िजसका (whose) िजसकी (whose) िजसक (whose) िजनको (to whom) िजनक (to whom)

नह (nonot) न (nonot Hindi particle) भी (also Hindi particle) ह (only Hindi particle) तो (then Hindi particle) य (why) या (what Hindi particle)

कहा (where Hindi particle) कब (when) यहा (here) वहा (there) जहा (where) पहल (before) बाद म (after) श म (beginning) आर भ म (beginning) अत म (in the end) आ खर म (in the end)

44

Given below are some sample outputs (1) English sentence I also enjoy working with the childrens parents who often come to me for advice - its good to know you can help Aligned Hindi sentence मझ ब च क माता - पताओ क साथ काम करना भी अ छा लगता ह जो क अ सर सलाह लन आत ह - यह जानकार खशी होती ह क आप कसी की मदद कर सकत ह | The CPs identified in the sentence i काम करना (to work) ii अ छा लगना (to feel good enjoy) iii सलाह लना (to seek advice) iv खशी होना (to feel happy good) v मदद करना (to help)

Here the system identified 5 different CPs all of which are correct and no CP in the sentence has gone undetected The POS projection and word alignment method (Mukerjee et al 2006) would fail to identify CPs सलाह लना (to seek advice) and खशी होना (to feel happy) (2) English sentence Thousands of children are already benefiting from the input of people like you - people who care about children and their future who have the commitment energy and enthusiasm to be positive role models and who value the opportu-nity for a worthwhile career Aligned Hindi sentence आप जस लोग जो क ब च और उनक भ व य क बार म सोचत ह - इस समय भी हज़ार ब च को लाभ पहचा रह ह | अ छ आदश बनन क लए ऐस लोग म तब ता उ सराह और लगन ह और व एक समथन - यो य यवसाय की क करत ह |

The CPs identified in the sentence i आदश बनना (to be role model) ii क करना (to respect)

Here also the two CPs identified are correct It is obvious that this empirical method of

mining CPs will fail whenever the Hindi light verb maps on to its core meaning in English It

may also produce garbage as POS of the preced-ing word is not being checked However the mining success rate obtained speaks of these be-ing in small numbers in practice Use of the lsquostop wordsrsquo in allowing the intervening words within the CPs helps a lot in improving the per-formance Similarly use of the lsquoexit wordsrsquo avoid a lot of incorrect identification

5 Conclusions

The simple empirical method for mining CPs outlined in this work yields an average 89 of precision and 90 recall which is better than the results reported so far in the literature The major drawback is that we have to generate a list of all possible light verbs This list appears to be very large for Hindi Since no POS tagging or statis-tical analysis is performed the identified CPs are merely a list of mined CPs in Hindi with no lin-guistic categorization or analysis However this list of mined CPs is valuable to the lexicograph-ers and other language technology developers This list can also be used for word alignment tools where the identified components of CPs are grouped together before the word alignment process This will increase both the alignment accuracy and the speed

The methodology presented in this work is equally applicable to all other languages within the Indo-Aryan family

References Anthony McEnery Paul Baker Rob Gaizauskas Ha-

mish Cunningham 2000 EMILLE Building a Corpus of South Asian Languages Vivek A Quar-terly in Artiificial Intelligence 13(3)23ndash32

Amitabh Mukerjee Ankit Soni and Achala M Raina 2006 Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora Proceed-ings of the Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties Sydney 11ndash18

Alex Alsina 1996 Complex PredicatesStructure and Theory CSLI PublicationsStanford CA

Anvita Abbi 1992 The explicator compound verbsome definitional issues and criteria for iden-tification Indian Linguistics 53 27-46

Debasri Chakrabarti Vaijayanthi Sarma and Pushpak Bhattacharyya 2007 Complex Predicates in In-dian Language Wordnets Lexical Resources and Evaluation Journal 40 (3-4)

Debasri Chakrabarti Hemang Mandalia Ritwik Priya Vaijayanthi Sarma and Pushpak Bhattacharyya 2008 Hindi Compound Verbs and their Automatic Extraction Computational Linguistics (COLING08) Manchester UK

45

Manindra K Verma (ed) 1993 Complex Predicates in South Asian Languages Manohar Publishers and Distributors New Delhi

Miriam Butt 1995 The Structure of Complex Predi-cates in Urdu CSLI Publications

Mirium Butt and Gillian Ramchand 2001 Complex Aspectual Structure in HindiUrdu In Maria Lia-kata Britta Jensen and Didier Maillat (Editors) Oxford University Working Papers in Linguistics Philology amp Phonetics Vol 6

Miriam Butt Tracy Holloway King and John T Maxwell III 2003 Complex Predicates via Re-striction Proceedings of the LFG03 Conference

Miriam Butt and Wilhelm Geuder 2001 On the (semi)lexical status of light verbs In Norbert Corv-er and Henk van Riemsdijk (Editors) Semi-lexical Categories On the content of function words and the function of content words Mouton de Gruyter Berlin 323ndash370

Mona Singh 1994 Perfectivity Definiteness and Specificity A Classification of Verbal Predicates Hindi Doctoral dissertation University of Texas Austin

Peter Edwin Hook 1974 The Compound Verb in Hindi Center for South and Southeast Asian Stu-dies The University of Michigan

Tara Mohanan 1994 Argument Structure in Hindi CSLI Publications Stanford California

Venkatapathy Sriram and Aravind K Joshi 2005 Relative compositionality of multi-word expres-sions a study of verb-noun (V-N) collocations In Proceedings of International Joint Conference on Natural Language Processing - 2005 Jeju Island Korea 553-564

46

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 47ndash54Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Improving Statistical Machine Translation UsingDomain Bilingual Multiword Expressions

Zhixiang Ren1 Yajuan L u1 Jie Cao1 Qun Liu 1 Yun Huang2

1Key Lab of Intelligent Info Processing 2Department of Computer ScienceInstitute of Computing Technology School of Computing

Chinese Academy of Sciences National University of SingaporePO Box 2704 Beijing 100190 China Computing 1 Law Link Singapore 117590

renzhixianglvyajuan huangyuncompnusedusgcaojieliuqunictaccn

Abstract

Multiword expressions (MWEs) havebeen proved useful for many natural lan-guage processing tasks However how touse them to improve performance of statis-tical machine translation (SMT) is not wellstudied This paper presents a simple yeteffective strategy to extract domain bilin-gual multiword expressions In additionwe implement three methods to integratebilingual MWEs to Moses the state-of-the-art phrase-based machine translationsystem Experiments show that bilingualMWEs could improve translation perfor-mance significantly

1 Introduction

Phrase-based machine translation model has beenproved a great improvement over the initial word-based approaches (Brown et al 1993) Recentsyntax-based models perform even better thanphrase-based models However when syntax-based models are applied to new domain with fewsyntax-annotated corpus the translation perfor-mance would decrease To utilize the robustnessof phrases and make up the lack of syntax or se-mantic information in phrase-based model for do-main translation we study domain bilingual mul-tiword expressions and integrate them to the exist-ing phrase-based model

A multiword expression (MWE)can be consid-ered as word sequence with relatively fixed struc-ture representing special meanings There is nouniform definition of MWE and many researchersgive different properties of MWE Sag et al (2002)roughly defined MWE as ldquoidiosyncratic interpre-tations that cross word boundaries (or spaces)rdquoCruys and Moiron (2007) focused on the non-compositional property of MWE ie the propertythat whole expression cannot be derived from their

component words Stanford university launcheda MWE project1 in which different qualities ofMWE were presented Forbilingual multiwordexpression (BiMWE) we define a bilingual phraseas a bilingual MWE if (1) the source phrase is aMWE in source language (2) the source phraseand the target phrase must be translated to eachother exactly ie there is no additional (boundary)word in target phrase which cannot find the corre-sponding word in source phrase and vice versaIn recent years many useful methods have beenproposed to extract MWEs or BiMWEs automati-cally (Piao et al 2005 Bannard 2007 Fazly andStevenson 2006) Since MWE usually constrainspossible senses of a polysemous word in contextthey can be used in many NLP applications suchas information retrieval question answering wordsense disambiguation and so on

For machine translation Piao et al (2005) havenoted that the issue of MWE identification andaccurate interpretation from source to target lan-guage remained an unsolved problem for existingMT systems This problem is more severe whenMT systems are used to translate domain-specifictexts since they may include technical terminol-ogy as well as more general fixed expressions andidioms Although some MT systems may employa machine-readable bilingual dictionary of MWEit is time-consuming and inefficient to obtain thisresource manually Therefore some researchershave tried to use automatically extracted bilingualMWEs in SMT Tanaka and Baldwin (2003) de-scribed an approach of noun-noun compound ma-chine translation but no significant comparisonwas presented Lambert and Banchs (2005) pre-sented a method in which bilingual MWEs wereused to modify the word alignment so as to im-prove the SMT quality In their work a bilin-gual MWE in training corpus was grouped as

1httpmwestanfordedu

47

one unique token before training alignment mod-els They reported that both alignment qualityand translation accuracy were improved on a smallcorpus However in their further study they re-ported even lower BLEU scores after groupingMWEs according to part-of-speech on a large cor-pus (Lambert and Banchs 2006) Nonethelesssince MWE represents liguistic knowledge therole and usefulness of MWE in full-scale SMTis intuitively positive The difficulty lies in howto integrate bilingual MWEs into existing SMTsystem to improve SMT performance especiallywhen translating domain texts

In this paper we implement three methods thatintegrate domain bilingual MWEs into a phrase-based SMT system and show that these ap-proaches improve translation quality significantlyThe main difference between our methods andLambert and Banchsrsquo work is that we directly aimat improving the SMT performance rather than im-proving the word alignment quality In detail dif-ferences are listed as follows

bull Instead of using the bilingual n-gram trans-lation model we choose the phrase-basedSMT system Moses2 which achieves sig-nificantly better translation performance thanmany other SMT systems and is a state-of-the-art SMT system

bull Instead of improving translation indirectlyby improving the word alignment qualitywe directly target at the quality of transla-tion Some researchers have argued that largegains of alignment performance under manymetrics only led to small gains in translationperformance (Ayan and Dorr 2006 Fraserand Marcu 2007)

Besides the above differences there are someadvantages of our approaches

bull In our method automatically extractedMWEs are used as additional resources ratherthan as phrase-table filter Since bilingualMWEs are extracted according to noisy au-tomatic word alignment errors in word align-ment would further propagate to the SMT andhurt SMT performance

bull We conduct experiments on domain-specificcorpus For one thing domain-specific

2httpwwwstatmtorgmoses

corpus potentially includes a large numberof technical terminologies as well as moregeneral fixed expressions and idioms iedomain-specific corpus has high MWE cov-erage For another after the investigationcurrent SMT system could not effectivelydeal with these domain-specific MWEs es-pecially for Chinese since these MWEs aremore flexible and concise Take the Chi-nese term ldquo j Ntilde (rdquo for example Themeaning of this term is ldquosoften hard massand dispel pathogenic accumulationrdquo Ev-ery word of this term represents a specialmeaning and cannot be understood literallyor without this context These terms are dif-ficult to be translated even for humans letalone machine translation So treating theseterms as MWEs and applying them in SMTsystem have practical significance

bull In our approach no additional corpus is intro-duced We attempt to extract useful MWEsfrom the training corpus and adopt suitablemethods to apply them Thus it benefitsfor the full exploitation of available resourceswithout increasing great time and space com-plexities of SMT system

The remainder of the paper is organized as fol-lows Section 2 describes the bilingual MWE ex-traction technique Section 3 proposes three meth-ods to apply bilingual MWEs in SMT systemSection 4 presents the experimental results Sec-tion 5 draws conclusions and describes the futurework Since this paper mainly focuses on the ap-plication of BiMWE in SMT we only give a briefintroduction on monolingual and bilingual MWEextraction

2 Bilingual Multiword ExpressionExtraction

In this section we describe our approach of bilin-gual MWE extraction In the first step we obtainmonolingual MWEs from the Chinese part of par-allel corpus After that we look for the translationof the extracted MWEs from parallel corpus

21 Automatic Extraction of MWEs

In the past two decades many different ap-proaches on automatic MWE identification werereported In general those approaches can beclassified into three main trends (1) statisti-cal approaches (Pantel and Lin 2001 Piao et

48

al 2005) (2) syntactic approaches (Fazly andStevenson 2006 Bannard 2007) and (3) seman-tic approaches (Baldwin et al 2003 Cruys andMoiron 2007) Syntax-based and semantic-basedmethods achieve high precision but syntax or se-mantic analysis has to be introduced as preparingstep so it is difficult to apply them to domains withfew syntactical or semantic annotation Statisticalapproaches only consider frequency informationso they can be used to obtain MWEs from bilin-gual corpora without deeper syntactic or semanticanalysis Most statistical measures only take twowords into account so it not easy to extract MWEscontaining three or more than three words

Log Likelihood Ratio (LLR)has been proved agood statistical measurement of the association oftwo random variables (Chang et al 2002) Weadopt the idea of statistical approaches and pro-pose a new algorithm named LLR-based Hierar-chical Reducing Algorithm (HRA for short) to ex-tract MWEs with arbitrary lengths To illustrateour algorithm firstly we define some useful itemsIn the following definitions we assume the givensentence is ldquoA B C D Erdquo

Definition 1 Unit A unit is any sub-string of thegiven sentence For example ldquoA Brdquo ldquo Crdquo ldquo C D Erdquoare all units but ldquoA B Drdquo is not a unit

Definition 2 List A list is an ordered sequence ofunits which exactly cover the given sentence ForexampleldquoArdquoldquo B C Drdquoldquo Erdquo forms a list

Definition 3 Score The score function only de-fines on two adjacent units and return the LLRbetween the last word of first unit and the firstword of the second unit3 For example the scoreof adjacent unit ldquoB Crdquo and ldquoD Erdquo is defined asLLR(ldquo Crdquoldquo Drdquo)

Definition 4 Select The selecting operator is tofind the two adjacent units with maximum scorein a list

Definition 5 ReduceThe reducing operator is toremove two specific adjacent units concatenatethem and put back the result unit to the removedposition For example if we want to reduce unitldquoB Crdquo and unit ldquoDrdquo in list ldquoArdquoldquo B Crdquoldquo Drdquoldquo Erdquowe will get the listldquoArdquoldquo B C Drdquoldquo Erdquo

Initially every word in the sentence is consid-ered as one unit and all these units form a initiallist L If the sentence is of lengthN then the

3we use a stoplist to eliminate the units containing func-tion words by setting their score to 0

list containsN units of course The final set ofMWEsS is initialized to empty set After initial-ization the algorithm will enter an iterating loopwith two steps (1) select the two adjacent unitswith maximum score inL namingU1 andU2 and(2) reduceU1 andU2 in L and insert the reducingresult into the final setS Our algorithm termi-nates on two conditions (1) if the maximum scoreafter selection is less than a given threshold or (2)if L contains only one unit

c1(pound p Eacute w egrave )

c1(pound p Eacute w )

c1(pound p Eacute w)

c1(pound)

c2(p Eacute w)

c2(p Eacute)

c2(p) c3(Eacute) c4(w) c5()

c6(egrave )

c6(egrave) c7()

pound p Eacute w egrave 1471 67552 10596 0 0 8096

Figure 1 Example of Hierarchical Reducing Al-gorithm

Let us make the algorithm clearer with an ex-ample Assume the threshold of score is 20 thegiven sentence is ldquopound p Eacute w egrave rdquo4Figure 1 shows the hierarchical structure of givensentence (based on LLR of adjacent words) Inthis example four MWEs (ldquop Eacuterdquo ldquo p Eacute

wrdquo ldquo egrave rdquo ldquo pound p Eacute wrdquo) are extractedin the order and sub-strings over dotted line in fig-ure 1 are not extracted

From the above example we can see that theextracted MWEs correspond to human intuitionIn general the basic idea of HRA is to reflectthe hierarchical structure pattern of natural lan-guage Furthermore in the HRA MWEs are mea-sured with the minimum LLR of adjacent wordsin them which gives lexical confidence of ex-tracted MWEs Finally suppose given sentencehas lengthN HRA would definitely terminatewithin N minus 1 iterations which is very efficient

However HRA has a problem that it would ex-tract substrings before extracting the whole stringeven if the substrings only appear in the particu-lar whole string which we consider useless Tosolve this problem we use contextual features

4The whole sentence means ldquohealthy tea for preventinghyperlipidemiardquo and we give the meaning for each Chi-nese wordpound(preventing)p(hyper-)Eacute(-lipid-) w(-emia)(for) egrave(healthy)(tea)

49

contextual entropy (Luo and Sun 2003) and C-value (Frantzi and Ananiadou 1996) to filter outthose substrings which exist only in few MWEs

22 Automatic Extraction of MWErsquosTranslation

In subsection 21 we described the algorithm toobtain MWEs and we would like to introduce theprocedure to find their translations from parallelcorpus in this subsection

For mining the English translations of ChineseMWEs we first obtain the candidate translationsof a given MWE from the parallel corpus Stepsare listed as follows

1 Run GIZA++5 to align words in the trainingparallel corpus

2 For a given MWE find the bilingual sentencepairs where the source language sentences in-clude the MWE

3 Extract the candidate translations of theMWE from the above sentence pairs accord-ing to the algorithm described by Och (2002)

After the above procedure we have alreadyextracted all possible candidate translations of agiven MWE The next step is to distinguish rightcandidates from wrong candidates We constructperceptron-based classification model (Collins2002) to solve the problem We design twogroups of features translation features whichdescribe the mutual translating chance betweensource phrase and target phrase and the languagefeatures which refer to how well a candidateis a reasonable translation The translation fea-tures include (1) the logarithm of source-targettranslation probability (2) the logarithm of target-source translation probability (3) the logarithmof source-target lexical weighting (4) the loga-rithm of target-source lexical weighting and (5)the logarithm of the phrase pairrsquos LLR (Dunning1993) The first four features are exactly the sameas the four translation probabilities used in tradi-tional phrase-based system (Koehn et al 2003)The language features include (1) the left entropyof the target phrase (Luo and Sun 2003) (2) theright entropy of the target phrase (3) the first wordof the target phrase (4) the last word of the targetphrase and (5) all words in the target phrase

5httpwwwfjochcomGIZA++html

We select and annotate 33000 phrase pairs ran-domly of which 30000 pairs are used as trainingset and 3000 pairs are used as test set We use theperceptron training algorithm to train the modelAs the experiments reveal the classification preci-sion of this model is 9167

3 Application of Bilingual MWEs

Intuitively bilingual MWE is useful to improvethe performance of SMT However as we de-scribed in section 1 it still needs further researchon how to integrate bilingual MWEs into SMTsystem In this section we propose three methodsto utilize bilingual MWEs and we will comparetheir performance in section 4

31 Model Retraining with Bilingual MWEs

Bilingual phrase table is very important forphrase-based MT system However due to the er-rors in automatic word alignment and unalignedword extension in phrase extraction (Och 2002)many meaningless phrases would be extractedwhich results in inaccuracy of phrase probabilityestimation To alleviate this problem we take theautomatically extracted bilingual MWEs as paral-lel sentence pairs add them into the training cor-pus and retrain the model using GIZA++ Byincreasing the occurrences of bilingual MWEswhich are good phrases we expect that the align-ment would be modified and the probability es-timation would be more reasonable Wu et al(2008) also used this method to perform domainadaption for SMT Different from their approachin which bilingual MWEs are extracted from ad-ditional corpus we extract bilingual MWEs fromthe original training set The fact that additionalresources can improve the domain-specific SMTperformance was proved by many researchers(Wu et al 2008 Eck et al 2004) Howeverour method shows that making better use of theresources in hand could also enhance the qualityof SMT system We use ldquoBaseline+BiMWErdquo torepresent this method

32 New Feature for Bilingual MWEs

Lopez and Resnik (2006) once pointed out thatbetter feature mining can lead to substantial gainin translation quality Inspired by this idea weappend one feature into bilingual phrase table toindicate that whether a bilingual phrase containsbilingual MWEs In other words if the source lan-guage phrase contains a MWE (as substring) and

50

the target language phrase contains the translationof the MWE (as substring) the feature value is 1otherwise the feature value is set to 0 Due to thehigh reliability of bilingual MWEs we expect thatthis feature could help SMT system to select bet-ter and reasonable phrase pairs during translationWe use ldquoBaseline+Featrdquo to represent this method

33 Additional Phrase Table of bilingualMWEs

Wu et al (2008) proposed a method to construct aphrase table by a manually-made translation dic-tionary Instead of manually constructing transla-tion dictionary we construct an additional phrasetable containing automatically extracted bilingualMWEs As to probability assignment we just as-sign 1 to the four translation probabilities for sim-plicity Since Moses supports multiple bilingualphrase tables we combine the original phrase ta-ble and new constructed bilingual MWE table Foreach phrase in input sentence during translationthe decoder would search all candidate transla-tion phrases in both phrase tables We use ldquoBase-line+NewBPrdquo to represent this method

4 Experiments

41 Data

We run experiments on two domain-specific patentcorpora one is for traditional medicine domainand the other is for chemical industry domain Ourtranslation tasks are Chinese-to-English

In the traditional medicine domain table 1shows the data statistics For language model weuse SRI Language Modeling Toolkit6 to train a tri-gram model with modified Kneser-Ney smoothing(Chen and Goodman 1998) on the target side oftraining corpus Using our bilingual MWE ex-tracting algorithm 80287 bilingual MWEs are ex-tracted from the training set

Chinese EnglishTraining Sentences 120355

Words 4688873 4737843Dev Sentences 1000

Words 31722 32390Test Sentences 1000

Words 41643 40551

Table 1 Traditional medicine corpus

6httpwwwspeechsricomprojectssrilm

In the chemical industry domain table 2 givesthe detail information of the data In this experi-ment 59466 bilingual MWEs are extracted

Chinese EnglishTraining Sentences 120856

Words 4532503 4311682Dev Sentences 1099

Words 42122 40521Test Sentences 1099

Words 41069 39210

Table 2 Chemical industry corpus

We test translation quality on test set and use theopen source tool mteval-vllbpl7 to calculate case-sensitive BLEU 4 score (Papineni et al 2002) asour evaluation criteria For this evaluation there isonly one reference per test sentence We also per-form statistical significant test between two trans-lation results (Collins et al 2005) The mean ofall scores and relative standard deviation are calcu-lated with a 99 confidence interval of the mean

42 MT Systems

We use the state-of-the-art phrase-based SMT sys-tem Moses as our baseline system The featuresused in baseline system include (1) four transla-tion probability features (2) one language modelfeature (3) distance-based and lexicalized distor-tion model feature (4) word penalty (5) phrasepenalty For ldquoBaseline+BiMWErdquo method bilin-gual MWEs are added into training corpus as aresult new alignment and new phrase table areobtained For ldquoBaseline+Featrdquo method one ad-ditional 01 feature are introduced to each entry inphrase table For ldquoBaseline+NewBPrdquo additionalphrase table constructed by bilingual MWEs isused

Features are combined in the log-linear modelTo obtain the best translatione of the source sen-tencef log-linear model uses following equation

e = arg maxe

p(e|f)

= arg maxe

Msum

m=1

λmhm(e f) (1)

in which hm andλm denote themth feature andweight The weights are automatically turned byminimum error rate training (Och 2002) on devel-opment set

7httpwwwnistgovspeechtestsmtresourcesscoringhtm

51

43 Results

Methods BLEUBaseline 02658Baseline+BiMWE 02661Baseline+Feat 02675Baseline+NewBP 02719

Table 3 Translation results of using bilingualMWEs in traditional medicine domain

Table 3 gives our experiment results Fromthis table we can see that bilingual MWEsimprove translation quality in all cases TheBaseline+NewBP method achieves the most im-provement of 061 BLEU score comparedwith the baseline system The Baseline+Featmethod comes next with 017 BLEU score im-provement And the Baseline+BiMWE achievesslightly higher translation quality than the baselinesystem

To our disappointment however none of theseimprovements are statistical significant Wemanually examine the extracted bilingual MWEswhich are labeled positive by perceptron algorithmand find that although the classification precisionis high (9167) the proportion of positive exam-ple is relatively lower (7669) The low positiveproportion means that many negative instanceshave been wrongly classified to positive which in-troduce noises To remove noisy bilingual MWEswe use the length ratiox of the source phrase overthe target phrase to rank the bilingual MWEs la-beled positive Assumex follows Gaussian distri-butions then the ranking score of phrase pair(s t)is defined as the following formula

Score(s t) = log(LLR(s t))times 1radic2πσ

timeseminus

(xminusmicro)2

2σ2

(2)Here the meanmicro and varianceσ2 are estimatedfrom the training set After ranking by score weselect the top 50000 60000 and 70000 bilingualMWEs to perform the three methods mentioned insection 3 The results are showed in table 4

From this table we can conclude that (1) Allthe three methods on all settings improve BLEUscore (2) Except the Baseline+BiMWE methodthe other two methods obtain significant improve-ment of BLEU score (02728 02734 02724)over baseline system (02658) (3) When the scaleof bilingual MWEs is relatively small (5000060000) the Baseline+Feat method performs better

Methods 50000 60000 70000Baseline 02658Baseline+BiMWE 02671 02686 02715Baseline+Feat 02728 02734 02712Baseline+NewBP 02662 02706 02724

Table 4 Translation results of using bilingualMWEs in traditional medicine domain

than others (4) As the number of bilingual MWEsincreasing the Baseline+NewBP method outper-forms the Baseline+Feat method (5) Comparingtable 4 and 3 we can see it is not true that themore bilingual MWEs the better performance ofphrase-based SMT This conclusion is the same as(Lambert and Banchs 2005)

To verify the assumption that bilingual MWEsdo indeed improve the SMT performance not onlyon particular domain we also perform some ex-periments on chemical industry domain Table 5shows the results From this table we can see thatthese three methods can improve the translationperformance on chemical industry domain as wellas on the traditional medicine domain

Methods BLEUBaseline 01882Baseline+BiMWE 01928Baseline+Feat 01917Baseline+Newbp 01914

Table 5 Translation results of using bilingualMWEs in chemical industry domain

44 Discussion

In order to know in what respects our methods im-prove performance of translation we manually an-alyze some test sentences and gives some exam-ples in this subsection

(1) For the first example in table 6 ldquoIuml oacuterdquois aligned to other words and not correctly trans-lated in baseline system while it is aligned to cor-rect target phrase ldquodredging meridiansrdquo in Base-line+BiMWE since the bilingual MWE (ldquoIuml oacuterdquoldquodredging meridiansrdquo) has been added into train-ing corpus and then aligned by GIZA++

(2) For the second example in table 6 ldquo

rdquo has two candidate translation in phrase tableldquo teardquo and ldquomedicated teardquo The baseline systemchooses the ldquoteardquo as the translation of ldquo rdquowhile the Baseline+Feat system chooses the ldquomed-

52

Src T not auml k Ouml Eacute Aring raquo Iuml oacute ) 9 | Y S _ Ouml otilde

egraveE 8

Ref the obtained product is effective in tonifying blood expelling cold dredging meridians promoting production of body fluid promoting urination and tranquilizing mind and can be used for supplementing nutrition and protecting health

Baseline the food has effects in tonifying blood dispelling cold promoting salivation and water and tranquilizing and tonic effects and making nutritious health

+Bimwe the food has effects in tonifying blood dispelling cold dredging meridians promotingsalivation promoting urination and tranquilizing tonic nutritious pulverizing

Src curren iexclJ J NtildeJ 5J

Ref the product can also be made into tablet pill powder medicated tea or injection Baseline may also be made into tablet pill powder tea or injection +Feat may also be made into tablet pill powder medicated tea or injection

Table 6 Translation example

icated teardquo because the additional feature giveshigh probability of the correct translation ldquomedi-cated teardquo

5 Conclusion and Future Works

This paper presents the LLR-based hierarchicalreducing algorithm to automatically extract bilin-gual MWEs and investigates the performance ofthree different application strategies in applyingbilingual MWEs for SMT system The translationresults show that using an additional feature to rep-resent whether a bilingual phrase contains bilin-gual MWEs performs the best in most cases Theother two strategies can also improve the qualityof SMT system although not as much as the firstone These results are encouraging and motivatedto do further research in this area

The strategies of bilingual MWE application isroughly simply and coarse in this paper Com-plicated approaches should be taken into accountduring applying bilingual MWEs For examplewe may consider other features of the bilingualMWEs and examine their effect on the SMT per-formance Besides application in phrase-basedSMT system bilingual MWEs may also be inte-grated into other MT models such as hierarchicalphrase-based models or syntax-based translationmodels We will do further studies on improvingstatistical machine translation using domain bilin-gual MWEs

Acknowledgments

This work is supported by National Natural Sci-ence Foundation of China Contracts 60603095and 60873167 We would like to thank the anony-

mous reviewers for their insightful comments onan earlier draft of this paper

References

Necip Fazil Ayan and Bonnie J Dorr 2006 Goingbeyond aer an extensive analysis of word align-ments and their impact on mt InProceedings of the44th Annual Meeting of the Association for Compu-tational Linguistics pages 9ndash16

Timothy Baldwin Colin Bannard Takaaki Tanaka andDominic Widdows 2003 An empirical modelof multiword expression decomposability InPro-ceedings of the ACL-2003 Workshop on MultiwordExpressions Analysis Acquisiton and Treatmentpages 89ndash96

Colin Bannard 2007 A measure of syntactic flex-ibility for automatically identifying multiword ex-pressions in corpora InProceedings of the ACLWorkshop on A Broader Perspective on MultiwordExpressions pages 1ndash8

Peter F Brown Stephen Della Pietra Vincent J DellaPietra and Robert L Mercer 1993 The mathe-matics of statistical machine translation Parameterestimation Computational Linguistics 19(2)263ndash311

Baobao Chang Pernilla Danielsson and WolfgangTeubert 2002 Extraction of translation unit fromchinese-english parallel corpora InProceedings ofthe first SIGHAN workshop on Chinese languageprocessing pages 1ndash5

Stanley F Chen and Joshua Goodman 1998 Am em-pirical study of smoothing techniques for languagemodeling Technical report

Michael Collins Philipp Koehn and Ivona Kucerova2005 Clause restructuring for statistical machinetranslation InProceedings of the 43rd Annual

53

Meeting on Association for Computational Linguis-tics pages 531ndash540

Michael Collins 2002 Discriminative training meth-ods for hidden markov models Theory and exper-iments with perceptron algorithms InProceedingsof the Empirical Methods in Natural Language Pro-cessing Conference pages 1ndash8

Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32

Ted Dunning 1993 Accurate methods for the statis-tics of surprise and coincidenceComputationalLinguistics 19(1)61ndash74

Matthias Eck Stephan Vogel and Alex Waibel 2004Improving statistical machine translation in the med-ical domain using the unified medical language sys-tem InProceedings of the 20th international con-ference on Computational Linguistics table of con-tents pages 792ndash798

Afsaneh Fazly and Suzanne Stevenson 2006 Auto-matically constructing a lexicon of verb phrase id-iomatic combinations InProceedings of the EACLpages 337ndash344

Katerina T Frantzi and Sophia Ananiadou 1996 Ex-tracting nested collocations InProceedings of the16th conference on Computational linguistics pages41ndash46

Alexander Fraser and Daniel Marcu 2007 Measuringword alignment quality for statistical machine trans-lation Computational Linguistics 33(3)293ndash303

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InPro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology pages48ndash54

Patrik Lambert and Rafael Banchs 2005 Data in-ferred multi-word expressions for statistical machinetranslation InProceedings of Machine TranslationSummit X pages 396ndash403

Patrik Lambert and Rafael Banchs 2006 Groupingmulti-word expressions according to part-of-speechin statistical machine translation InProceedings ofthe Workshop on Multi-word-expressions in a multi-lingual context pages 9ndash16

Adam Lopez and Philip Resnik 2006 Word-basedalignment phrase-based translation Whatrsquos thelink In proceedings of the 7th conference of the as-sociation for machine translation in the Americasvisions for the future of machine translation pages90ndash99

Shengfen Luo and Maosong Sun 2003 Two-characterchinese word extraction based on hybrid of internaland contextual measures InProceedings of the sec-ond SIGHAN workshop on Chinese language pro-cessing pages 24ndash30

Franz Josef Och 2002Statistical Machine Transla-tion From Single-Word Models to Alignment Tem-plates Phd thesis Computer Science DepartmentRWTH Aachen Germany

Patrick Pantel and Dekang Lin 2001 A statistical cor-pus based term extractor InAI rsquo01 Proceedings ofthe 14th Biennial Conference of the Canadian Soci-ety on Computational Studies of Intelligence pages36ndash46

Kishore Papineni Salim Roukos Todd Ward and Wei-jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation InProceedings of the40th Annual Conference of the Association for Com-putational Linguistics pages 311ndash318

Scott Songlin Piao Paul Rayson Dawn Archer andTony McEnery 2005 Comparing and combining asemantic tagger and a statistical tool for mwe extrac-tion Computer Speech and Language 19(4)378ndash397

Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 Multi-word expressions A pain in the neck for nlp InProceedings of the 3th International Conferenceon Intelligent Text Processing and ComputationalLinguistics(CICLing-2002) pages 1ndash15

Takaaki Tanaka and Timothy Baldwin 2003 Noun-noun compound machine translation A feasibilitystudy on shallow processing InProceedings ofthe ACL-2003 Workshop on Multiword ExpressionsAnalysis Acquisition and Treatment pages 17ndash24

Hua Wu Haifeng Wang and Chengqing Zong 2008Domain adaptation for statistical machine transla-tion with domain dictionary and monolingual cor-pora InProceedings of Conference on Computa-tional Linguistics (COLING) pages 993ndash1000

54

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 55ndash62Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Bottom-up Named Entity Recognitionusing a Two-stage Machine Learning Method

Hirotaka Funayama Tomohide Shibata Sadao KurohashiKyoto University Yoshida-honmachi

Sakyo-ku Kyoto 606-8501 Japanfunayama shibata kuronlpkueekyoto-uacjp

Abstract

This paper proposes Japanese bottom-upnamed entity recognition using a two-stage machine learning method Mostwork has formalized Named Entity Recog-nition as a sequential labeling problem inwhich only local information is utilizedfor the label estimation and thus a longnamed entity consisting of several mor-phemes tends to be wrongly recognizedOur proposed method regards a compoundnoun (chunk) as a labeling unit and firstestimates the labels of all the chunks ina phrasal unit (bunsetsu) using a machinelearning method Then the best label as-signment in the bunsetsu is determinedfrom bottom up as the CKY parsing al-gorithm using a machine learning methodWe conducted an experimental on CRLNE data and achieved an F measure of8979 which is higher than previous work

1 Introduction

Named Entity Recognition (NER) is a task of rec-ognizing named entities such as person namesorganization names and location It is used forseveral NLP applications such as Information Ex-traction (IE) and Question Answering (QA) Mostwork uses machine learning methods such as Sup-port Vector Machines (SVMs) (Vapnik 1995) andConditional Random Field (CRF) (Lafferty et al2001) using a hand-annotated corpus (Krishnanand DManning 2006 Kazama and Torisawa2008 Sasano and Kurohashi 2008 Fukushima etal 2008 Nakano and Hirai 2004 Masayuki andMatsumoto 2003)

In general NER is formalized as a sequentiallabeling problem For example regarding a mor-pheme as a basic unit it is first labeled as S-PERSON B-PERSON I-PERSON E-PERSON

S-ORGANIZATION etc Then considering thelabeling results of morphemes the best NE labelsequence is recognized

When the label of each morpheme is estimatedonly local information around the morpheme (egthe morpheme the two preceding morphemes andthe two following morphemes) is utilized There-fore a long named entity consisting of severalmorphemes tends to be wrongly recognized Letus consider the example sentences shown in Fig-ure 1

In sentence (1) the label of ldquoKazamardquo can berecognized to be S-PERSON (PERSON consist-ing of one morpheme) by utilizing the surroundinginformation such as the suffix ldquosanrdquo (Mr) and theverb ldquokikoku shitardquo (return home)

On the other hand in sentence (2) when thelabel of ldquoshinyourdquo (credit) is recognized to beB-ORGANIZATION (the beginning of ORGA-NIZATION) only information from ldquohatsudourdquo(invoke) to ldquokyusairdquo (relief) can be utilized andthus the information of the morpheme ldquoginkourdquo(bank) that is apart from ldquoshinyourdquo by three mor-phemes cannot be utilized To cope with this prob-lem Nakano et al (Nakano and Hirai 2004) andSasano et al (Sasano and Kurohashi 2008) uti-lized information of the head of bunsetsu1 In theirmethods when the label of ldquoshinyourdquo is recog-nized the information of the morpheme ldquoginkourdquocan be utilized

However these methods do not work when themorpheme that we want to refer to is not a headof bunsetsu as in sentence (3) In this examplewhen ldquogaikokurdquo (foreign) is recognized to be B-ARTIFACT (the beginning of ARTIFACT) wewant to refer to ldquohourdquo (law) not ldquoihanrdquo (viola-tion) which is the head of the bunsetsu

This paper proposes Japanese bottom-up named1Bunsetsu is the smallest coherent phrasal unit in

Japanese It consists of one or more content words followedby zero or more function words

55

(1) kikoku-shitareturn home

Kazama-san-waMrKazama TOP

rsquoMr Kazama who returned homersquo

(2) hatsudou-shitainvoke

shinyou-kumiai-kyusai-ginkou-nocredit union relief bank GEN

setsuritsu-mo establishment

rsquothe establishment of the invoking credit union relief bankrsquo

(3) shibunsyo-gizou-toprivate document falsification and

gaikoku-jin-touroku-hou-ihan-noforeigner registration law violation GEN

utagai-desuspicion INS

rsquoon suspicion of the private document falsification and the violation of the foreigner registra-tion lawrsquo

Figure 1 Example sentences

entity recognition using a two-stage machinelearning method Different from previous workthis method regards a compound noun as a la-beling unit (we call it chunk hereafter) and es-timates the labels of all the chunks in the bun-setsu using a machine learning method In sen-tence (3) all the chunks in the second bunsetsu(ie ldquogaikokurdquo ldquogaikoku-jinrdquo middot middot middot ldquogaikoku-jin-touroku-hou-ihan rdquo middot middot middot ldquoihanrdquo) are labeled andin the case that the chunk ldquogaikoku-jin-touroku-hourdquo is labeled the information about ldquohourdquo (law)is utilized in a natural manner Then in the bun-setsu the best label assignment is determined Forexample among the combination of ldquogaikoku-jin-touroku-hourdquo (ARTIFACT) and ldquoihanrdquo (OTHER)the combination of ldquogaikoku-jinrdquo (PERSON) andldquotouroku-hou-ihanrdquo (OTHER) etc the best la-bel assignment ldquogaikoku-jin-touroku-hourdquo (AR-TIFACT) and ldquoihanrdquo (OTHER) is chosen basedon a machine learning method In this determi-nation of the best label assignment as the CKYparsing algorithm the label assignment is deter-mined by bottom-up dynamic programming Weconducted an experimental on CRL NE data andachieved an F measure of 8979 which is higherthan previous work

This paper is organized as follows Section 2 re-views related work of NER especially focusing onsequential labeling based method Section 3 de-scribes an overview of our proposed method Sec-tion 4 presents two machine learning models andSection 5 describes an analysis algorithm Section6 gives an experimental result

2 Related Work

In Japanese Named Entity Recognition the defi-nition of Named Entity in IREX Workshop (IREX

class examplePERSON Kimura SyonosukeLOCATION Taiheiyou (Pacific Ocean)ORGANIZATION Jimin-tou (Liberal Democratic

Party)ARTIFACT PL-houan (PL bill)DATE 21-seiki (21 century)TIME gozen-7-ji (7 am)MONEY 500-oku-en (50 billions yen)PERCENT 20 percent

Table 1 NE classes and their examples

Committee 1999) is usually used In this def-inition NEs are classified into eight classesPERSON LOCATION ORGANIZATION AR-TIFACT DATE TIME MONEY and PERCENTTable 1 shows example instances of each class

NER methods are divided into two approachesrule-based approach and machine learning ap-proach According to previous work machinelearning approach achieved better performancethan rule-based approach

In general a machine learning method is for-malized as a sequential labeling problem Thisproblem is first assigning each token (character ormorpheme) to several labels In an SE-algorithm(Sekine et al 1998) S is assigned to NE com-posed of one morpheme B I E is assigned to thebeginning middle end of NE respectively and Ois assigned to the morpheme that is not an NE2The labels S B I and E are prepared for each NEclasses and thus the total number of labels is 33(= 8 4 + 1)

The model for the label estimation is learnedbased on machine learning The following fea-tures are generally utilized characters type of

2Besides there are IOB1 IOB2 algorithm using onlyIOB and IOE1 IOE2 algorithm using only IOE (Kim andVeenstra 1999)

56

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu-MeijinMONEY0075

OTHERe0092

MeijinOTHERe0245

(a)initial state

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu + MeijinPSN+OTHERe0438+0245

Yoshiharu Yoshiharu + Meijin

final outputanalysis direction

MONEY0075

MNY+OTHERe0075+0245

MeijinOTHERe0245

(b)final output

Figure 2 An overview of our proposed method (the bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo)

character POS etc about the morpheme and thesurrounding two morphemes The methods utiliz-ing SVM or CRF are proposed

Most of NER methods based on sequential la-beling use only local information Thereforemethods utilizing global information are pro-posed Nakano et al utilized as a feature the wordsub class of NE on the analyzing direction in thebunsetsu the noun in the end of the bunsetsu ad-jacent to the analyzing direction and the head ofeach bunsetsu (Nakano and Hirai 2004) Sasanoet al utilized cache feature coreference resultsyntactic feature and caseframe feature as struc-tural features (Sasano and Kurohashi 2008)

Some work acquired knowledge from unan-notated large corpus and applied it to NERKazama et al utilized a Named Entity dic-tionary constructed from Wikipedia and a nounclustering result obtained using huge amount ofpairs of dependency relations (Kazama and Tori-sawa 2008) Fukushima et al acquired hugeamount of category-instance pairs (eg ldquopo-litical party - New party DAICHIrdquoldquocompany-TOYOTArdquo) by some patterns from a large Webcorpus (Fukushima et al 2008)

In Japanese NER researches CRL NE data areusually utilized for the evaluation This data in-cludes approximately 10 thousands sentences innews paper articles in which approximately 20thousands NEs are annotated Previous workachieved an F measure of about 089 using thisdata

3 Overview of Proposed Method

Our proposed method first estimates the label ofall the compound nouns (chunk) in a bunsetsu

Then the best label assignment is determinedby bottom-up dynamic programming as the CKYparsing algorithm Figure 2 illustrates an overviewof our proposed method In this example thebunsetsu ldquoHabu-Yoshiharu-Meijinrdquo (Grand Mas-ter Yoshiharu Habu) is analyzed First the labelsof all the chunks (ldquoHaburdquo ldquoHabu-YoshiharurdquoldquoHabu-Yoshiharu-Meijinrdquo middot middot middot ldquoMeijinrdquo etc) inthe bunsetsu are analyzed using a machine learn-ing method as shown in Figure 2 (a)

We call the state in Figure 2 (a) initial statewhere the labels of all the chunks have been es-timated From this state the best label assign-ment in the bunsetsu is determined This pro-cedure is performed from the lower left (corre-sponds to each morpheme) to the upper right likethe CKY parsing algorithm as shown in Figure 2(b) For example when the label assignment forldquoHabu-Yoshiharurdquo is determined the label assign-ment ldquoHabu-Yoshiharurdquo (PERSON) and the labelassignment ldquoHaburdquo (PERSON) and ldquoYoshiharurdquo(OTHER) are compared and the better one is cho-sen While grammatical rules are utilized in ageneral CKY algorithm this method chooses bet-ter label assignment for each cell using a machinelearning method

The learned models are the followings

bull the model that estimates the label of a chunk(label estimation model)

bull the model that compares two label assign-ments (label comparison model)

The two models are described in detail in thenext section

57

Habu Yoshiharu Meijin gaPERSON OTHERe

invalid invalid

invalid

invalidinvalid

Figure 3 Label assignment for all the chunks inthe bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo

4 Model Learning

41 Label Estimation ModelThis model estimates the label for each chunk Ananalysis unit is basically bunsetsu This is because935 of named entities is located in a bunsetsuin CRL NE data Exceptionally the following ex-pressions located in multiple bunsetsus tend to bean NE

bull expressions enclosed in parentheses (eg ldquolsquoHimeyuri-no toursquo rdquo (The tower of Himeyuri)(ARTIFACT))

bull expressions that have an entry in Wikipedia(eg ldquoNihon-yatyou-no kairdquo (Wild Bird So-ciety of Japan) (ORGANIZATION))

Hereafter bunsetsu is expanded when one of theabove conditions meet By this expansion 986of named entities is located in a bunsetsu3

For each bunsetsu the head or tail functionwords are deleted For example in the bun-setsu ldquoHabu-Yoshiharu-Meijin-wardquo the tail func-tion word ldquowardquo (TOP) is deleted In the bunsetsuldquoyaku-san-bairdquo (about three times) the head func-tion word ldquoyakurdquo (about) is deleted

Next for learning the label estimation modelall the chunks in a bunsetsu are attached to the cor-rect label from a hand-annotated corpus The la-bel set is 13 classes which includes eight NE class(as shown in Table 1) and five classes OTHERsOTHERb OTHERi OTHERe and invalid

The chunk that corresponds to a whole bun-setsu and does not contain any NEs is labeledas OTHERs and the head middle tail chunkthat does not correspond to an NE is labeled asOTHERb OTHERi OTHERe respectively4

3As an example in which an NE is not included by anexpanded bunsetsu there are ldquoToru-no Kimirdquo (PERSON)and ldquoOsaka-fu midori-no kankyo-seibi-shitsurdquo (ORGANI-ZATION)

4Each OTHER is assigned to the longest chunk that satis-fies its condition in a chunk

1 of morphemes in the chunk

2 the position of the chunk in its bunsetsu

3 character type5

4 the combination of the character type of adjoiningmorphemes

- For the chunk ldquoRussian Armyrdquo this feature isldquoKatakanaKanjirdquo

5 word class word sub class and several features pro-vided by a morphological analyzer JUMAN

6 several features6 provided by a parser KNP

7 string of the morpheme in the chunk

8 IPADIC7 feature- If the string of the chunk are registered in the fol-

lowing categories of IPADIC ldquopersonrdquo ldquolo-cationrdquo ldquoorganizationrdquo and ldquogeneralrdquo thisfeature fires

9 Wikipedia feature

- If the string of the chunk has an entry inWikipedia this feature fires

- the hypernym extracted from its definition sen-tence using some patterns (eg The hyper-nym of ldquothe Liberal Democratic Partyrdquo is apolitical party)

10 cache feature- When the same string of the chunk appears in the

preceding context the label of the precedingchunk is used for the feature

11 particles that the bunsetsu includes

12 the morphemes particles and head morpheme in theparent bunsetsu

13 the NEcategory ratio in a case slot of predicatenouncase frame(Sasano and Kurohashi 2008)

- For example in the case ga (NOM) of the pred-icate case frame ldquokaikenrdquo (interview) the NEratio ldquoPERSON0245rdquo is assigned to the caseslot Hence in the sentence ldquoHabu-ga kaiken-shitardquo (Mr Habu interviewed) the featureldquoPERSON0245rdquo is utilized for the chunkldquoHaburdquo

14 parenthesis feature

- When the chunk in a parenthesis this featurefires

Table 2 Features for the label estimation model

The chunk that is neither any eight NE class northe above four OTHER is labeled as invalid

In an example as shown in Figure 3 ldquoHabu-Yoshiharurdquo is labeled as PERSON ldquoMeijinrdquo is la-beled as OTHERe and the other chunks are la-beled as invalid

Next the label estimation model is learned fromthe data in which the above label set is assigned

5The following five character types are considered KanjiHiragana Katakana Number and Alphabet

6When a morpheme has an ambiguity all the correspond-ing features fire

7httpchasenaist-naraacjpchasendistributionhtmlja

58

to all the chunks The features for the label esti-mation model are shown in Table 2 Among thefeatures as for feature (3) (5)minus(8) three cate-gories according to the position of a morphemein the chunk are prepared ldquoheadrdquo ldquotailrdquo andldquoanywhererdquo For example in the chunk ldquoHabu-Yoshiharu-Meijinrdquo as for the morpheme ldquoHaburdquofeature (7) is set to be ldquoHaburdquo in ldquoheadrdquo and as forthe morpheme ldquoYoshiharurdquo feature (7) is set to beldquoYoshiharurdquo in ldquoanywhererdquo

The label estimation model is learned from pairsof label and feature in each chunk To classify themulti classes the one-vs-rest method is adopted(consequently 13 models are learned) The SVMoutput is transformed by using the sigmoid func-tion 1

1+exp(minusβx) and the transformed value is nor-malized so that the sum of the value of 13 labelsin a chunk is one

The purpose for setting up the label ldquoinvalidrdquo isas follows In the chunk ldquoHaburdquo and ldquoYoshiharurdquoin Figure 3 since the label ldquoinvalidrdquo has a rela-tively higher score the score of the label PERSONis relatively low Therefore when the label com-parison described in Section 42 is performed thelabel assignment ldquoHabu-Yoshiharurdquo (PERSON) islikely to be chosen In the chunk where the scoreof the label invalid has the highest score the labelthat has the second highest score is adopted

42 Label Comparison Model

This model compares the two label assignmentsfor a certain string For example in the stringldquoHabu-Yoshiharurdquo the model compares the fol-lowing two label assignments

bull ldquoHabu-Yoshiharurdquo is labeled as PERSON

bull ldquoHaburdquo is labeled as PERSON and ldquoYoshi-harurdquo is labeled as MONEY

First as shown in Figure 4 the two comparedsets of chunks are lined up by sandwiching ldquovsrdquo(The left one right one is called the first set thesecond set respectively) When the first set is cor-rect this example is positive otherwise this ex-ample is negative The max number of chunks foreach set is five and thus examples in which thefirst or second set has more than five chunks arenot utilized for the model learning

Then the feature is assigned to each exampleThe feature (13 dimensions) for each chunk is de-fined as follows the first 12 dimensions are used

positive+1 Habu-Yoshiharu vs Habu + Yoshiharu

PSN PSN + MNY+1 Habu-Yoshiharu + Meijin vs Habu + Yoshiharu + Meijin

PSN + OTHERe PSN + MONEY + OTHERe

negative- 1 Habu-Yoshiharu-Meijin vs Habu-Yoshiharu + Meijin

ORG PSN + OTHERe

Figure 4 Assignment of positivenegative exam-ples

for each label which is estimated by the label esti-mation model and the last 13th dimension is usedfor the score of an SVM output Then for the firstand second set the features for each chunk are ar-ranged from the left and zero vectors are placedin the remainder part

Figure 5 illustrates the feature for ldquoHabu-Yoshiharurdquo vs ldquoHabu + Yoshiharurdquo The labelcomparison model is learned from such data us-ing SVM Note that only the fact that ldquoHabu-Yoshiharurdquo is PERSON can be found from thehand-annotated corpus and thus in the exampleldquoHabu-Yoshiharu-Meijinrdquo vs ldquoHabu + Yoshiharu-Meijinrdquo we cannot determine which one is cor-rect Therefore such example cannot be used forthe model learning

5 Analysis

First the label of all the chunks in a bunsetsu isestimated by using the label estimation model de-scribed in Section 41 Then the best label assign-ment in the bunsetsu is determined by applying thelabel comparison model described in Section 42iteratively as shown in Figure 2 (b) In this stepthe better label assignment is determined from bot-tom up as the CKY parsing algorithm

For example the initial state shown in Figure2(a) is obtained using the label estimation modelThen the label assignment is determined using thelabel comparison model from the lower left (cor-responds to each morpheme) to the upper rightIn determining the label assignment for the cellof ldquoHabu-Yoshiharurdquo as shown in 6(a) the modelcompares the label assignment ldquoBrdquo with the la-bel assignment ldquoA+Drdquo In this case the modelchooses the label assignment ldquoBrdquo that is ldquoHabu- Yoshiharurdquo is labeled as PERSON Similarlyin determining the label assignment for the cellof ldquoYoshiharu-Meijinrdquo the model compares the

59

chunk Habu-Yoshiharu Habu Yoshiharulabel PERSON PERSON MONEYvector V11 0 0 0 0 V21 V22 0 0 0

Figure 5 An example of the feature for the label comparison model (The example is ldquoHabu-Yoshiharuvs Habu + Yoshiharurdquo and V11 V21 V22 and 0 is a vector whose dimension is 13)

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu-Meijin

label assighment ldquoHabu-Yoshiharurdquo

label assignmentldquoHaburdquo + ldquoYoshiharurdquo

A B C

EDMONEY0075

OTHERe0092

MeijinOTHERe0245

ldquoHaburdquo + ldquoYoshiharurdquo

F

(a) label assignment for the cell ldquoHabu-Yoshiharurdquo

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu + Meijin

label assignmentldquoHabu-Yoshiharu-Meijinrdquo

A B C

EDMONEY0075

MNY+OTHERe0075+0245

MeijinOTHERe0245

label assignmentldquoHaburdquo + ldquoYoshiharurdquo + ldquoMeijinrdquo

label assignmentldquoHabu-Yoshiharurdquo + ldquoMeijinrdquo

F

(b) label assignment for the cell ldquoHabu-Yoshiharu-Meijinrdquo

Figure 6 The label comparison model

label assignment ldquoErdquo with the label assignmentldquoD+Frdquo In this case the model chooses the labelassignment ldquoD+Frdquo that is ldquoYoshiharurdquo is labeledas MONEY and ldquoMeijinrdquo is labeled as OTHEReWhen the label assignment consists of multiplechunks the content of the cell is updated Inthis case the cell ldquoErdquo is changed from ldquoYoshi-haru-Meijinrdquo (OTHERe) to ldquoYoshiharu + Meijinrdquo(MONEY + OTHERe)

As shown in Figure 6(b) in determining the bestlabel assignment for the upper right cell that isthe final output is determined the model comparesthe label assignment ldquoA+D+Frdquo ldquoB+Frdquo and ldquoCrdquoWhen there are more than two candidates of labelassignments for a cell all the label assignments arecompared in a pairwise and the label assignmentthat obtains the highest score is adopted

In the label comparing step the label as-signment in which OTHERlowast follows OTHERlowast(OTHERlowast - OTHERlowast) is not allowed since eachOTHER is assigned to the longest chunk as de-scribed in Section 41 When the first combina-tion of chunks equals to the second combinationof chunks the comparison is not performed

6 Experiment

To demonstrate the effectiveness of our proposedmethod we conducted an experiment on CRL NEdata In this data 10718 sentences in 1174 newsarticles are annotated with eight NEs The expres-sion to which it is difficult to annotate manually islabeled as OPTIONAL and was not used for both

a b c d e

learn the label estimation model 1

Corpus CRL NE data

learn the label estimation model 2bobtain features for the label apply

b c d e

b c d e

b c d e

obtain features for the labelcomparison model

learn the label estimation model 2c

learn the label estimation model 2d

learn the label estimation model 2e

apply

learn the label comparison model

Figure 7 5-fold cross validation

the model learning8 and the evaluationWe performed 5-fold cross validation following

previous work Different from previous work ourwork has to learn the SVM models twice There-fore the corpus was divided as shown in Figure 7Let us consider the analysis in the part (a) Firstthe label estimation model 1 is learned from thepart (b)-(e) Then the label estimation model 2bis learned from the part (c)-(e) and applying thelearned model to the part (b) features for learningthe label comparison model are obtained Simi-larly the label estimation model 2c is learned fromthe part (b)(d)(e) and applying it to the part (c)features are obtained It is the same with the part

8Exceptionally ldquoOPTIONALrdquo is used when the label es-timation model for OTHERlowast and invalid is learned

60

Recall PrecisionORGANIZATION 8183 (30083676) 8837 (30083404)PERSON 9005 (34583840) 9387 (34583684)LOCATION 9138 (49925463) 9244 (49925400)ARTIFACT 4672 ( 349 747) 7489 ( 349 466)DATE 9327 (33273567) 9312 (33273573)TIME 8825 ( 443 502) 9059 ( 443 489)MONEY 9385 ( 366 390) 9760 ( 366 375)PERCENT 9533 ( 469 492) 9591 ( 469 489)ALL-SLOT 8787 9179F-measure 8979

Table 3 Experimental result

(d) and (e) Then the label comparison model islearned from the obtained features After that theanalysis in the part (a) is performed by using boththe label estimation model 1 and the label compar-ison model

In this experiment a Japanese morphologicalanalyzer JUMAN9 and a Japanese parser KNP10

were adopted The two SVM models were learnedwith polynomial kernel of degree 2 and β in thesigmoid function was set to be 1

Table 6 shows an experimental result An F-measure in all NE classes is 8979

7 Discussion

71 Comparison with Previous Work

Table 7 presents the comparison with previ-ous work and our method outperformed previ-ous work Among previous work Fukushimaet al acquired huge amount of category-instance pairs (eg ldquopolitical party - New partyDAICHIrdquoldquocompany-TOYOTArdquo) by some patternsfrom a large Web corpus and Sasano et al uti-lized the analysis result of corefer resolution as afeature for the model learning Therefore in ourmethod by incorporating these knowledge andorsuch analysis result the performance would be im-proved

Compared with Sasano et al our methodachieved the better performance in analyzinga long compound noun For example in thebunsetsu ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo (Treaty on Conventional ArmedForces in Europe) while Sasano et al labeledldquoOushurdquo (Europe) as LOCATION our methodcorrectly labeled ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo as ARTIFACT Sasano etal incorrectly labeled ldquoOushurdquo as LOCATIONalthough they utilized the information about

9httpnlpkueekyoto-uacjpnl-resourcejuman-ehtml10httpnlpkueekyoto-uacjpnl-resourceknp-ehtml

the head of bunsetsu ldquojyouyakurdquo (treaty) Inour method for the cell ldquoOushurdquo invalid hasthe highest score and thus the score of LOCA-TION relatively drops Similarly for the cellldquosenryoku-sakugen-jyouyakurdquo invalid has thehighest score Consequently ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo is correctly labeledas ARTIFACT

In the bunsetsu ldquogaikoku-jin-touroku-hou-ihanrdquo(the violation of the foreigner registration law)while Sasano et al labeled ldquotouroku-hourdquo as AR-TIFACT our method correctly labeled ldquogaikoku-jin-touroku-hourdquo as ARTIFACT Sasano et al can-not utilize the information about ldquohourdquo that is use-ful for the label estimation since the head of thisbunsetsu is ldquoihanrdquo In contrast in estimating thelabel of the chunk ldquogaikoku-jin-touroku-hourdquo theinformation of ldquohourdquo can be utilized

72 Error Analysis

There were some errors in analyzing a Katakanaalphabet word In the following example althoughthe correct is that ldquoBatistutardquo is labeled as PER-SON the system labeled it as OTHERs

(4) Italy-deItaly LOC

katsuyaku-suruactive

Batistuta-woBatistuta ACC

kuwaetacall

ArgentineArgentine

rsquoArgentine called Batistuta who was active inItalyrsquo

There is not an entry of ldquoBatistutardquo in the dictio-nary of JUMAN nor Wikipedia and thus only thesurrounding information is utilized However thecase analysis of ldquokatsuyakurdquo (active) is incorrectwhich leads to the error of ldquoBatistutardquo

There were some errors in applying the la-bel comparison model although the analysis ofeach chunk is correct For example in thebunsetsu ldquoHongKong-seityourdquo (Government ofHongKong) the correct is that ldquoHongKong-seityourdquo is labeled as ORGANIZATION Asshown in Figure 8 (b) the system incorrectlylabeled ldquoHongKongrdquo as LOCATION As shownin Figure 8(a) although in the initial stateldquoHongKong-seityourdquo was correctly labeled as OR-GANIZATION the label assignment ldquoHongKong+ seityourdquo was incorrectly chosen by the labelcomparison model To cope with this problemwe are planning to the adjustment of the value βin the sigmoid function and the refinement of the

61

F1 analysis unit distinctive features(Fukushima et al 2008) 8929 character Web(Kazama and Torisawa 2008) 8893 character WikipediaWeb(Sasano and Kurohashi 2008) 8940 morpheme structural information(Nakano and Hirai 2004) 8903 character bunsetsu feature(Masayuki and Matsumoto 2003) 8721 character(Isozaki and Kazawa 2003) 8677 morphemeproposed method 8979 compound noun Wikipediastructural information

Table 4 Comparison with previous work (All work was evaluated on CRL NE data using cross valida-tion)

HongKongLOCATION

HongKong-seityouORGANIZATIONLOCATION

0266 0205

seityouOTHERe0184

(a)initial state

HongKongLOCATION

HongKong + seityouLOC+OTHEReLOCATION

0266LOC+OTHERe0266+0184

seityouOTHERe0184

(b)the final output

Figure 8 An example of the error in the label com-parison model

features for the label comparison model

8 Conclusion

This paper proposed bottom-up Named EntityRecognition using a two-stage machine learningmethod This method first estimates the label ofall the chunks in a bunsetsu using a machine learn-ing and then the best label assignment is deter-mined by bottom-up dynamic programming Weconducted an experiment on CRL NE data andachieved an F-measure of 8979

We are planning to integrate this method withthe syntactic and case analysis method (Kawa-hara and Kurohashi 2007) and perform syntacticcase and Named Entity analysis simultaneously toimprove the overall accuracy

ReferencesKenrsquoichi Fukushima Nobuhiro Kaji and Masaru

Kitsuregawa 2008 Use of massive amountsof web text in Japanese named entity recogni-tion In Proceedings of Data Engineering Workshop(DEWS2008) A3-3 (in Japanese)

IREX Committee editor 1999 Proceedings of theIREX Workshop

Hideki Isozaki and Hideto Kazawa 2003 Speeding upsupport vector machines for named entity recogni-tion Transaction of Information Processing Societyof Japan 44(3)970ndash979 (in Japanese)

Daisuke Kawahara and Sadao Kurohashi 2007 Prob-abilistic coordination disambiguation in a fully-lexicalized Japanese parser In Proceedings of the

2007 Joint Conference on Empirical Methods inNatural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL2007)pages 304ndash311

Junrsquoichi Kazama and Kentaro Torisawa 2008 In-ducing gazetteers for named entity recognition bylarge-scale clustering of dependency relations InProceedings of ACL-08 HLT pages 407ndash415

Erik F Tjong Kim and Jorn Veenstra 1999 Repre-senting text chunks In Proceedings of EACL rsquo99pages 173ndash179

Vajay Krishnan and Christopher DManning 2006An effective two-stage model for exploiting non-local dependencies in named entity recognitionpages 1121ndash1128

John Lafferty Andrew McCallun and FernandoPereira 2001 Conditional random fields Prob-abilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth In-ternational Conference (ICMLrsquo01) pages 282ndash289

Asahara Masayuki and Yuji Matsumoto 2003Japanese named entity extraction with redundantmorphological analysis In Proceeding of HLT-NAACL 2003 pages 8ndash15

Keigo Nakano and Yuzo Hirai 2004 Japanese namedentity extraction with bunsetsu features Transac-tion of Information Processing Society of Japan45(3)934ndash941 (in Japanese)

Ryohei Sasano and Sadao Kurohashi 2008 Japanesenamed entity recognition using structural naturallanguage processing In Proceeding of Third In-ternational Joint Conference on Natural LanguageProcessing pages 607ndash612

Satoshi Sekine Ralph Grishman and Hiroyuki Shin-nou 1998 A decision tree method for finding andclassifying names in japanese texts In Proceed-ings of the Sixth Workshop on Very Large Corpora(WVLC-6) pages 171ndash178

Vladimir Vapnik 1995 The Nature of StatisticalLearning Theory Springer

62

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 63ndash70Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

13

13

13 13

13

13

13

13

13

13

13

13 13

13

$

13 amp

13 amp

$ 13

13 (

) 13

$ 13

13

) 13

13 13 13

$ 13 13

+$ - 0)$ - 12

$ 3$ 4 1

)13$ 5 + )$ 5 2

3$ 6 2 $ 7-

2 $ 7 13

2 13 13

13 13

) 13

13$ $ 13$ $ $

13

13 13 13

13 1313 138

$ 13$

3$ 13) )

) 8 9$ $ $

$ $ )13$ 13 13

3

$

13 13

13 +

)

13 $

13 ) 13

$ )

13 13

13 ) $ )

13$

$

) (

$

13

13 13

13

)

13

13 )

$ 13

ファミリーレストラン 13) lt

ファミレス 13lt $

=

13 )

$ 13 13

$ $ )

13 -

)$ $ =

13 131313 13 1313 13 13 13 13 1313 13 13

63

13

13

13 13

13 13

13

13 13

13 $

13

13 amp

13

( ) +

13

13

-

13 13

13 13 13

日本 amp0 東京 $amp0amp

医科 amp0 科学 amp0amp

女子 amp0amp

大学 13amp0 研究所0

ampamp $ 13 13

+

13

$1

$ 13 $1

13

213

$1

$

$1

$1

13 $1

13

13 13 略語は0

0amp

$ +3

1313欧文略語一覧13カタカナ略語一覧13漢字略語一覧13大学の略称

$1

$1 13

13 -

13 $1 13

456

13

456

13

37+

- 13

13

$

8 913

77 4 77 913 13

13 77(amp 13 ltamp

13amp amp

13

amp - 13 13

13

13 9

13 13

$

13

-

13 =

= 9

8

13

13 $

13 13

13 13 13 1313

ッ$ 13 っ$

amp ―$

64

13

$ $

amp $ amp

() $

+) $amp $

) $ $ $ $ $

13 $ amp $ $ $ amp $$

$- )) ) 13

) ))13 ) 0()00)13 ) ) )13

- $ ))) ) 131

) 0)13 ))

13 13

2 ) )) )

)) 3 ) 1313))

1313 2 ) 13)

4) 5) )) )13

) 13) 1

6

7 ) )) ) )13 1

)

) )0

13 ) 8

-

) )

)13 ) 4

4 7 )

) )9 8)

5 ) )8 7 ))

)

) )13 1

)13 )) ) ) )

) 7 )) )13 1

)) 0()

) )))13 0

))13 ) 1

)6 5) ) )

5 ) ))1

lt ) ) 13 )

))) ) =))lt

)=)lt ))=lt gt

=ltlt )) ))) )

13 131313

-

))lt

lt ))lt

))lt

))lt lt

7 ) ))1

lt ) )) ) ))lt lt

13 lt ) ) )8 ) 1

)13 )) )

))lt 13 1

)13 ) )8 ) )13 ))

) ))) 85 1

)13 )8) 2 ))) ) (

0 )) )))

13))13

) ))) ) ( のlt

) ) )13 1

) )) )

) ))) のlt 0 )

)13 )) )1

) ) 13

) ) )) )1

) 00 )13 13)

1313 1313

2 ) 13 )) )

)13 )13 )

))13 )

3 )) )) 1

) ) )

$ )

+ ) )

+ 8 )

65

1313

1313

13 13

13 13

13 13 13

13 13

$13amp13amp 13 13 $13

$13 13 $ ( ) ) 13

13 13 + 13 1313 $

13 + 13 1313 + -

+ 13 ++ 13 13

1313 +

13 13

13 13 1313 13 -

+13 1313 13 13

1313

13

0 13 13 13 13

13 13 12 13

3013+ 13 45 666 1313

0 13 1313

7 13 13 13 13-

13 8 +-13 0

+ 13 13 13 -

+ 13 13 13

3

13 13-

13 13 + 13 13

+13513

1313 13 9

1 13 13 -

+ ( 13 13 3

13 13 + 13+ 13 13

3

135 1313 +13 13

13 13 13+ 13

1313 13 +

1313 1313

13 13 13

13+ 13 $1313 (

13 1313 1313 +-

13 4 13 $

$13 $ 13 -

13 +-13 0

13 1313 13-13 1313 + 13 13 4 - 13 + 13 13 13+ 131313 1313 3 13 13 ( 1313 13 13 + 13 13 1313 13 13 13 13 13 13 13 13 13 13 13+ 13 13 9

13

13

13

13

+ 13 13 +13

13+ 4

13 1313 1313 -

+

13 1313 13-

13 13 +13 + +

(13 13 $ +-

13 $1313 13 13 $ $ 13 $ 13

13 $ampamp lt13

2 13 13 131313 1313 13

13 lt13 ( $

13 13 13 13+ +13 -

13 13 13

13 2 2 2

= 13 13 13 +13 1313+

13 13 13

13 13 13 13

1313 13-

13 ( 13 13 13 13

13 13+ 13 1313

13 13 13+ 1313 13+

13 13 13 13 13 13

$gt1 gt1 13+ $ 13-

13 13 13 13+

13131313

66

1313 1313 1313

1313 1313 1313

13

13 13

13

13

$ 13 13

amp() amp() () amp amp

$ 13 13

+

13

13

13

- + 13 ( (

13 13

-+

13

0 -(1313 amp+ ( (

13 13

13 1

13 () 2 3

0 13 -4252

6-amp++ -32 6+

13 0

$ 135 2

1

- amp+ $ 13

13

$ amp( ) + ( ( )- )( ( ) )

FeaturesLabel

Noun

Noun

Noun

Noun

Particle

Noun

POS

0

0

0

1

1

1

Head of wordreadingword-

FeaturesLabel

Noun

Noun

Noun

Noun

Particle

Noun

POS

0

0

0

1

1

1

Head of wordreadingword-

amp 13 0

13 朝ビタ -+ 13

13 朝はビタミン - +

13 7

7

$

088

13 0

$ amp 9 amp 9

13

13

7

1313 0

13

13 2lt

13 13 13 13

13 13

3 13 amp

13 amp

13

13

0 13 13

13 13

13 13

13

1

ampampamp

67

13 13

$

13 13

ampamp (

)

ampamp $

ampamp amp

amp +

amp +

- - -

amp-

amp

amp

amp

- amp

+

- +

) +

amp $

- amp

+

- amp

+

0 $ ampamp +

1- amp

0 +

2-

)

amp -

amp ampamp

$ amp 0- 3(

34) 5 -+

0 - 4

6)4 amp amp 1 amp

7 4( 6

$ - +

- 8

64

13

) amp

amp

+

amp 9 2

- 9 - +

2 amp

- amp

amp 1

$

( +

(

( amp ( ( -

amp2

amp

amp )

( +

68

13 13

13 1313

13 13 13

13 13

13 13 13

13 13

13 13

13 $

13

13 13 13 13

13

13 13 13

1313

13

amp 13 13 13 13

13 13 13 13

(() 13 13 13

13 13 + 13

13 amp 13

13 13

$ 13

13 13

13

13 1313

13 -

13

1313 13

13 13 13

13 1313 amp

13 13

13 13 1313

0

1313 13 13

13 13

13

13 13

$ $ amp amp ( $)+ - 0 13 + 13 112

ミュージックステーション 13 ステ 13 13 13 13 13 ミュージック 13 ステ ステーション

3 4 56 1 0+ 0 + + 7+ 8 9 $8 4 lt

6 $8 -= 1 8 - = $ 13++ + 139 13 $ 22lt

8 $+ 4 13) 1ltltlt3 $gt3 =

5 $ $ ( 6 5 13amp

5 $ $ ( 213 $ 13 13

5 = 3 + 13 () 1 +

5 () 13 7+ + + 2ltlt

5 () 13 3A 21 + $ + 13

5 () $ gt) 3A2 0 13 13 22lt2lt

+A = 7 1 4 30$+ -+ 3 13 13$ 11

$ 13 amp) $ 4 42 6= $ 2

4 13 13+ $ gt $ 9+ amp6+ 13+ 8 8 6 3 (+ 6 13 11

13 136) $ 4 13 + gt+ 7 30 13 13+ 1

69

13 13 13

13 13 1313

13 13

1313 13

13 1313

13

$

13 13 1313 $

amp (

70

Author Index

Bhutada Pravin 17

Cao Jie 47Caseli Helena 1

Diab Mona 17

Finatto Maria Jose 1Fujii Hiroko 63Fukui Mika 63Funayama Hirotaka 55

Hoang Hung Huu 31Huang Yun 47

Kan Min-Yen 9 31Kim Su Nam 9 31Kuhn Jonas 23Kurohashi Sadao 55

Liu Qun 47Lu Yajuan 47

Machado Andre 1

Ren Zhixiang 47

Shibata Tomohide 55Sinha R Mahesh K 40Sumita Kazuo 63Suzuki Masaru 63

Villavicencio Aline 1

Wakaki Hiromi 63

Zarrieszlig Sina 23

71

  • Workshop Program
  • Statistically-Driven Alignment-Based Multiword Expression Identification for Technical Domains
  • Re-examining Automatic Keyphrase Extraction Approaches in Scientific Articles
  • Verb Noun Construction MWE Token Classification
  • Exploiting Translational Correspondences for Pattern-Independent MWE Identification
  • A re-examination of lexical association measures
  • Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus
  • Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions
  • Bottom-up Named Entity Recognition using Two-stage Machine Learning Method
  • Abbreviation Generation for Japanese Multi-Word Expressions
Page 8: MWE 2009 2009 Workshop on Multiword Expressions

allel corpora using English and Portuguese datafrom a corpus of Pediatrics and examining howa second language can provide relevant cues forthis tasks In this way cost-effective tools for theautomatic alignment of texts can generate a listof MWE candidates with their appropriate trans-lations Such an approach to the (semi-)automaticidentification of MWEs can considerably speed uplexicographic work providing a more targeted listof MWE candidates and their translations for theconstruction of bilingual resources andor withsome semantic information for monolingual re-sources

The remainder of this paper is structured as fol-lows Section 2 briefly discusses MWEs and someprevious works on methods for automatically ex-tracting them Section 3 presents the resourcesused while section 4 describes the methods pro-posed to extract MWEs as a statistically-driven by-product of an automatic word alignment processSection 5 presents the evaluation methodology andanalyses the results and section 6 finishes this pa-per with some conclusions and proposals for fu-ture work

2 Related Work

The term Multiword Expression has been used todescribe a large number of distinct but related phe-nomena such as phrasal verbs (eg come along)nominal compounds (eg frying pan) institution-alised phrases (eg bread and butter) and manyothers (Sag et al 2002) They are very frequent ineveryday language and this is reflected in severalexisting grammars and lexical resources wherealmost half of the entries are Multiword Expres-sions

However due to their heterogeneous charac-teristics MWEs present a tough challenge forboth linguistic and computational work (Sag etal 2002) Some MWEs are fixed and do notpresent internal variation such as ad hoc whileothers allow different degrees of internal vari-ability and modification such as touch a nerve(touchfind a nerve) and spill beans (spill sev-eralmusicalmountains of beans) In terms of se-mantics some MWEs are more opaque in theirmeaning (eg to kick the bucket as to die) whileothers have more transparent meanings that can beinferred from the words in the MWE (eg eat upwhere the particle up adds a completive sense toeat) Therefore providing appropriate methods

for the automatic identification and treatment ofthese phenomena is a real challenge for NLP sys-tems

A variety of approaches has been proposed forautomatically identifying MWEs differing basi-cally in terms of the type of MWE and lan-guage to which they apply and the sources ofinformation they use Although some work onMWEs is type independent (eg (Zhang et al2006 Villavicencio et al 2007)) given the het-erogeneity of MWEs much of the work looks in-stead at specific types of MWE like collocations(Pearce 2002) compounds (Keller and Lapata2003) and VPCs (Baldwin 2005 Villavicencio2005 Carlos Ramisch and Aline Villavicencio andLeonardo Moura and Marco Idiart 2008) Someof these works concentrate on particular languages(eg (Pearce 2002 Baldwin 2005) for Englishand (Piao et al 2006) for Chinese) but somework has also benefitted from asymmetries in lan-guages using information from one language tohelp deal with MWEs in the other (eg (na Vil-lada Moiron and Tiedemann 2006 Caseli et al2009))

As basis for helping to determine whether agiven sequence of words is in fact an MWE (egad hoc vs the small boy) some of these works em-ploy linguistic knowledge for the task (Villavicen-cio 2005) while others employ statistical meth-ods (Pearce 2002 Evert and Krenn 2005 Zhanget al 2006 Villavicencio et al 2007) or combinethem with some kinds of linguistic informationsuch as syntactic and semantic properties (Bald-win and Villavicencio 2002 Van de Cruys and naVillada Moiron 2007) or automatic word align-ment (na Villada Moiron and Tiedemann 2006)

Statistical measures of association have beencommonly used for this task as they can be demo-cratically applied to any language and MWE typeHowever there is no consensus about which mea-sure is best suited for identifying MWEs in gen-eral Villavicencio et al (2007) compared some ofthese measures (mutual information permutationentropy and χ2) for the type-independent detec-tion of MWEs and found that Mutual Informationseemed to differentiate MWEs from non-MWEsbut the same was not true of χ2 In addition Ev-ert and Krenn (2005) found that for MWE iden-tification the efficacy of a given measure dependson factors like the type of MWEs being targetedfor identification the domain and size of the cor-

2

pora used and the amount of low-frequency dataexcluded by adopting a threshold NonethelessVillavicencio et al (2007) discussing the influ-ence of the corpus size and nature over the meth-ods found that these different measures have ahigh level of agreement about MWEs whetherin carefully constructed corpora or in more het-erogeneous web-based ones They also discussthe results obtained from adopting approaches likethese for extending the coverage of resources ar-guing that grammar coverage can be significantlyincreased if MWEs are properly identified andtreated (Villavicencio et al 2007)

Among the methods that use additional infor-mation along with statistics to extract MWE theone proposed by na Villada Moiron and Tiede-mann (2006) seems to be the most similar to ourapproach The main difference between them isthe way in which word alignment is used in theMWE extraction process In this paper the wordalignment is the basis for the MWE extractionprocess while Villada Moiron and Tiedemannrsquosmethod uses the alignment just for ranking theMWE candidates which were extracted on the ba-sis of association measures (log-likelihood andsalience) and head dependence heuristic (in parseddata)

Our approach as described in details by Caseliet al (2009) also follows to some extent thatof Zhang et al (2006) as missing lexical en-tries for MWEs and related constructions are de-tected via error mining methods and this paper fo-cuses on the extraction of generic MWEs as a by-product of an automatic word alignment Anotherrelated work is the automatic detection of non-compositional compounds (NCC) by Melamed(1997) in which NCCs are identified by analyz-ing statistical translation models trained in a hugecorpus by a time-demanding process

Given this context our approach proposesthe use of alignment techniques for identifyingMWEs looking at sequences detected by thealigner as containing more than one word whichform the MWE candidates As a result sequencesof two or more consecutive source words aretreated as MWE candidates regardless of whetherthey are translated as one or more target words

3 The Corpus and Reference Lists

The Corpus of Pediatrics used in these experi-ments contains 283 texts in Portuguese with a total

of 785448 words extracted from the Jornal de Pe-diatria From this corpus the Pediatrics Glossarya reference list containing multiword terms and re-curring expressions was semi-automatically con-structed and manually checked1 The primary aimof the Pediatrics Glossary as an online resourcefor long-distance education was to train qualifyand support translation students on the domain ofpediatrics texts

The Pediatrics Glossary was built from the36741 ngrams that occurred at least 5 times in thecorpus These were automatically cleaned or re-moved using some POS tag patterns (eg remov-ing prepositions from terms that began or endedwith them) In addition if an ngram was part of alarger ngram only the latter appeared in the Glos-sary as is the case of aleitamento materno (mater-nal breastfeeding) which is excluded as it is con-tained in aleitamento materno exclusivo (exclusivematernal breastfeeding) This post-processing re-sulted in 3645 ngrams which were manuallychecked by translation students and resulted in2407 terms with 1421 bigrams 730 trigrams and339 ngrams with n larger than 3 (not considered inthe experiments presented in this paper)

4 Statistically-Driven andAlignment-Based methods

41 Statistically-Driven methodStatistical measures of association have beenwidely employed in the identification of MWEsThe idea behind their use is that they are an in-expensive language and type independent meansof detecting recurrent patterns As Firth famouslysaid a word is characterized by the company itkeeps and since we expect the component wordsof an MWE to occur frequently together thenthese measures can give an indication of MWE-ness In this way if a group of words co-occurswith significantly high frequency when comparedto the frequencies of the individual words thenthey may form an MWE Indeed measures suchas Pointwise Mutual Information (PMI) MutualInformation (MI) χ2 log-likelihood (Press et al1992) and others have been employed for thistask and some of them seem to provide more ac-curate predictions of MWEness than others Infact in a comparison of some measures for thetype-independent detection of MWEs MI seemed

1Available in the TEXTQUIMUFRGS website httpwwwufrgsbrtextquim

3

to differentiate MWEs from non-MWEs but thesame was not true of χ2 (Villavicencio et al2007) In this work we use two commonly em-ployed measures for this task PMI and MIas implemented in the Ngram Statistics Package(Banerjee and Pedersen 2003)

From the Portuguese portion of the Corpus ofPediatrics 196105 bigram and 362663 trigramMWE candidates were generated after filteringngrams containing punctuation and numbers Inorder to evaluate how these methods perform with-out any linguistic filtering the only threshold em-ployed was a frequency cut-off of 2 occurrencesresulting in 64839 bigrams and 54548 trigramsEach of the four measures were then calculated forthese ngrams and we ranked each n-gram accord-ing to each of these measures The average of allthe rankings is used as the combined measure ofthe MWE candidates

42 Alignment-Based method

The second of the MWE extraction approachesto be investigated in this paper is the alignment-based method The automatic word alignment oftwo parallel texts mdash a text written in one (source)language and its translation to another (target) lan-guage mdash is the process of searching for correspon-dences between source and target words and se-quences of words For each word in a source sen-tence equivalences in the parallel target sentenceare looked for Therefore taking into account aword alignment between a source word sequenceS (S = s1 sn with n ge 2) and a target wordsequence T (T = t1 tm with m ge 1) thatis S harr T the alignmet-based MWE extracionmethod assumes that (a) S and T share some se-mantic features and (b) S may be a MWE

In other words the alignment-based MWE ex-traction method states that the sequence S will bea MWE candidate if it is aligned with a sequenceT composed of one or more words (a n m align-ment with n ge 2 and m ge 1) For examplethe sequence of two Portuguese words aleitamentomaterno mdash which occurs 202 times in the cor-pus used in our experiments mdash is a MWE can-didate because these two words were joined to bealigned 184 times with the word breastfeeding (a2 1 alignment) 8 times with the word breast-fed (a 2 1 alignment) 2 times with breastfeedingpractice (a 2 2 alignment) and so on

Thus notice that the alignment-based MWE ex-

traction method does not rely on the conceptualasymmetries between languages since it does notexpect that a source sequence of words be alignedwith a single target word The method looksfor the sequences of source words that are fre-quently joined together during the alignment de-spite the number of target words involved Thesefeatures indicate that the method priorizes preci-sion in spite of recall

It is also important to say that although the se-quences of source and target words resemble thephrases used in the phrase-based statistical ma-chine translation (SMT) they are indeed a re-finement of them More specifically althoughboth approaches rely on word alignments per-formed by GIZA++2 (Och and Ney 2000) inthe alignment-based approach not all sequences ofwords are considered as phrases (and MWE can-didates) but just those with an alignment n m(n gt= 2) with a target sequence To confirmthis assumption a phrase-based SMT system wastrained with the same corpus used in our exper-iments and the number of phrases extracted fol-lowing both approaches were compared Whilethe SMT extracted 819208 source phrases ouralignment-based approach (without applying anypart-of-speech or frequency filter) extracted only34277 These results show that the alignment-based approach refines in some way the phrasesof SMT systems

In this paper we investigate experimentallywhether MWEs can be identified as a by-productof the automatic word alignment of parallel textsWe focus on Portuguese MWEs from the Corpusof Pediatrics and the evaluation is performed us-ing the bigrams and trigrams from the PediatricsGlossary as gold standard

To perform the extraction of MWE candi-dates following the alignment-based approachfirst the original corpus had to be sentence andword aligned and Part-of-Speech (POS) taggedFor these preprocessing steps were used re-spectively a version of the Translation Cor-pus Aligner (TCA) (Hofland 1996) the statisti-cal word aligner GIZA++ (Och and Ney 2000)and the morphological analysers and POS taggersfrom Apertium3 (Armentano-Oller et al 2006)

2GIZA++ is a well-known statistical word aligner that canbe found at httpwwwfjochcomGIZA++html

3Apertium is an open-source machine translation en-gine and toolbox available at httpwwwapertiumorg

4

From the preprocessed corpus the MWE candi-dates are extracted as those in which two or morewords have the same alignment that is they arelinked to the same target unit This initial list ofMWE candidates is then filtered to remove thosecandidates that (a) match some sequences of POStags or words (patterns) defined in previous exper-iments (Caseli et al 2009) or (b) whose frequencyis below a certain threshold The remaining unitsin the candidate list are considered to be MWEs

Several filtering patterns and minimum fre-quency thresholds were tested and three of themare presented in details here The first one (F1)is the same used during the manual building ofthe reference lists of MWEs (a) patterns begin-ning with Article + Noun and beginning or finish-ing with verbs and (b) with a minimum frequencythreshold of 5

The second one (F2) is the same used in the(Caseli et al 2009) mainly (a) patterns begin-ning with determiner auxiliary verb pronoun ad-verb conjunction and surface forms such as thoseof the verb to be (are is was were) relatives(that what when which who why) and prepo-sitions (from to of ) and (b) with a minimum fre-quency threshold of 2

And the third one (F3) is the same as (Caseli etal 2009) plus (a) patterns beginning or finishingwith determiner adverb conjunction prepositionverb pronoun and numeral and (b) with a mini-mum frequency threshold of 2

5 Experiments and Results

Table 1 shows the top 5 and the bottom 5 rankedcandidates returned by PMI and the alignment-based approach Although some of the results aregood especially the top candidates there is stillconsiderable noise among the candidates as forinstance jogar video game (lit play video game)From table 1 it is also possible to notice that thealignment-based approach indeed extracts Pedi-atrics terms such as aleitamento materno (breast-feeding) and also other possible MWE that are notPediatrics terms such as estados unidos (UnitedStates)

In table 2 we show the precision (number ofcorrect candidates among the proposed ones) re-call (number of correct candidates among those inreference lists) and F-measure ((2 lowast precision lowastrecall)(precision + recall)) figures for the as-sociation measures using all the candidates (on the

PMI alignment-basedOnline Mendelian Inheritance faixa etariaBeta Technology Incorporated aleitamento maternoLange Beta Technology estados unidosOxido Nitrico Inalatorio hipertensao arterialjogar video game leite materno e um de couro cabeludoe a do bloqueio lactıferosse que de emocional anatomiae a da neonato a termoe de nao duplas maes bebes

Table 1 Top 5 and Bottom 5 MWE candidatesranked by PMI and alignment-based approach

pt MWE candidates PMI MI proposed bigrams 64839 64839 correct MWEs 1403 1403precision 216 216recall 9873 9873F 423 423 proposed trigrams 54548 54548 correct MWEs 701 701precision 129 129recall 9603 9603F 255 255 proposed bigrams 1421 1421 correct MWEs 155 261precision 1091 1837recall 1091 1837F 1091 1837 proposed trigrams 730 730 correct MWEs 44 20precision 603 274recall 603 274F 603 274

Table 2 Evaluation of MWE candidates - PMI andMI

first half of the table) and using the top 1421 bi-gram and 730 trigram candidates (on the secondhalf) From these latter results we can see that thetop candidates produced by these measures do notagree with the Pediatrics Glossary since there areonly at most 1837 bigram and 603 trigramMWEs among the top candidates as ranked byMI and PMI respectively Interestingly MI had abetter performance for bigrams while for trigramsPMI performed better

On the other hand looking at the alignment-based method 34277 pt MWE candidates wereextracted and Table 3 sumarizes the number ofcandidates filtered following the three filters de-scribed in 42 F1 F2 and F3

To evaluate the efficacy of the alignment-basedmethod in identifying multiword terms of Pedi-atrics an automatic comparison was performedusing the Pediatrics Glossary In this auto-

5

pt MWE candidates F1 F2 F3 filtered by POS patterns 24996 21544 32644 filtered by frequency 9012 11855 1442 final Set 269 878 191

Table 3 Number of pt MWE candidates filteredin the alignment-based approach

pt MWE candidates F1 F2 F3 proposed bigrams 250 754 169 correct MWEs 48 95 65precision 1920 1260 3846recall 338 669 457F 575 874 818 proposed trigrams 19 110 20 correct MWEs 1 9 4precision 526 818 2000recall 014 123 055F 027 214 107 proposed bitrigrams 269 864 189 correct MWEs 49 104 69precision 1822 1204 3651recall 228 483 321F 405 690 590

Table 4 Evaluation of MWE candidates

matic comparision we considered the final lists ofMWEs candidates generated by each filter in table3 The number of matching entries and the valuesfor precision recall and F-measure are showed intable 4

The different values of extracted MWEs (in ta-ble 3) and evaluated ones (in table 4) are due tothe restriction of considering only bigrams and tri-grams in the Pediatrics Glossary Then longerMWEs mdash such as doenca arterial coronarianaprematura (premature coronary artery disease)and pequenos para idade gestacional (small forgestational age) mdash extracted by the alignment-based method are not being considered at the mo-ment

After the automatic comparison using the Pedi-atrics Glossary an analysis by human experts wasperformed on one of the derived lists mdash that withthe best precision values so far (from filter F3)The human analysis was necessary since as statedin (Caseli et al 2009) the coverage of referencelists may be low and it is likely that a lot of MWEcandidates that were not found in the PediatricsGlossary are nonetheless true MWEs In this pa-per only the pt MWE candidates extracted usingfilter F3 (as described in section 42) were manu-ally evaluated

From the 191 pt MWE candidates extracted af-ter F3 69 candidates (361 of the total amount)were found in the bigrams or trigrams in the

Glossary (see table 4) Then the remaining 122candidates (639) were analysed by two native-speakers human judges who classified each of the122 candidates as true if it is a multiword expres-sion or false otherwise independently of beinga Pediatrics term For the judges a sequence ofwords was considered a MWE mainly if it was(1) a proper name or (2) a sequence of words forwhich the meaning cannot be obtained by com-pounding the meanings of its component words

The judgments of both judges were comparedand a disagreement of approximately 12 on mul-tiwords was verified This disagreement was alsomeasured by the kappa (K) measure (Carletta1996) with k = 073 which does not preventconclusions to be drawn According to Carletta(1996) among other authors a value of k between067 and 08 indicates a good agreement

In order to calculate the percentage of true can-didates among the 122 two approaches can befollowed depending on what criteria one wantsto emphasize precision or coverage (not recallbecause we are not calculating regarding a refer-ence list) To emphasize the precision one shouldconsider as genuine MWEs only those candidatesclassified as true by both judges on the other handto emphasize the coverage one should consideralso those candidates classified as true by just oneof them So from 191 MWE candidates 126(6597) were classified as true by both judgesand 145 (7592) by at least one of them

6 Conclusions and Future Work

MWEs are a complex and heterogeneous set ofphenomena that defy attempts to capture themfully but due to their role in communication theyneed to be properly accounted for in NLP applica-tions and tasks

In this paper we investigated the identifica-tion of MWEs from technical domain test-ing statistically-driven and alignment-based ap-proaches for identifying MWEs from a Pediatricsparallel corpus The alignment-based method gen-erates a targeted precision-oriented list of MWEcandidates while the statistical methods producerecall-oriented results at the expense of precisionTherefore the combination of these methods canproduce a set of MWE candidates that is both moreprecise than the latter and has more coverage thanthe former This can significantly speed up lex-icographic work Moreover the results obtained

6

show that in comparison with the manual extrac-tion of MWEs this approach can provide also ageneral set of MWE candidates in addition to themanually selected technical terms

Using the alignment-based extraction methodwe notice that it is possible to extract MWEs thatare Pediatrics terms with a precision of 38 forbigrams and 20 for trigrams but with very lowrecall since only the MWEs in the Pediatrics Glos-sary were considered correct However after amanual analysis carried out by two native speakersof Portuguese we found that the percentage of trueMWEs considered by both or at least one of themwere respectively 6597 and 7592 This wasa significative improvement but it is important tosay that in this manual analysis the human ex-perts classified the MWEs as true independently ofthem being Pediatrics terms So as future work weintend to carry out a more carefull analysis withexperts in Pediatrics to evaluate how many MWEscandidates are also Pediatrics terms

In addition we plan to investigate a weightedcombination of these methods favouring thosethat have better precision Finally we also in-tend to apply the results obtained in to the semi-automatic construction of ontologies

Acknowledgments

We would like to thank the TEXTQUIMUFRGSgroup for making the Corpus of Pediatrics and Pe-diatrics Glossary available to us We also thank thefinancial support of FAPESP in bulding the paral-lel corpus This research has been partly fundedby the FINEP project COMUNICA

ReferencesCarme Armentano-Oller Rafael C Carrasco Anto-

nio M Corbı-Bellot Mikel L Forcada MireiaGinestı-Rosell Sergio Ortiz-Rojas Juan Anto-nio Perez-Ortiz Gema Ramırez-Sanchez FelipeSanchez-Martınez and Miriam A Scalco 2006Open-source Portuguese-Spanish machine transla-tion In R Vieira P Quaresma MGV Nunes NJMamede C Oliveira and MC Dias editors Pro-ceedings of the 7th International Workshop on Com-putational Processing of Written and Spoken Por-tuguese (PROPOR 2006) volume 3960 of LectureNotes in Computer Science pages 50ndash59 Springer-Verlag May

Timothy Baldwin and Aline Villavicencio 2002 Ex-tracting the Unextractable A Case Study on Verb-particles In Proceedings of the 6th Conference on

Natural Language Learning (CoNLL-2002) pages98ndash104 Taipei Taiwan

Timothy Baldwin Emily M Bender Dan FlickingerAra Kim and Stephan Oepen 2004 Road-testingthe English Resource Grammar over the British Na-tional Corpus In Proceedings of the Fourth In-ternational Conference on Language Resources andEvaluation (LREC 2004) pages 2047ndash2050 LisbonPortugal

Timothy Baldwin 2005 The deep lexical acquisitionof English verb-particles Computer Speech andLanguage Special Issue on Multiword Expressions19(4)398ndash414

Satanjeev Banerjee and Ted Pedersen 2003 The De-sign Implementation and Use of the Ngram Statis-tics Package In In Proceedings of the Fourth Inter-national Conference on Intelligent Text Processingand Computational Linguistics pages 370ndash381

Douglas Biber Stig Johansson Geoffrey Leech SusanConrad and Edward Finegan 1999 Grammar ofSpoken and Written English Longman Harlow

Jean Carletta 1996 Assessing agreement on classi-fication tasks the kappa statistics ComputationalLinguistics 22(2)249ndash254

Carlos Ramisch and Aline Villavicencio and LeonardoMoura and Marco Idiart 2008 Picking themup and Figuring them out Verb-Particle Construc-tions Noise and Idiomaticity In Proceedings ofthe 12th Conference on Computational Natural Lan-guage Learning (CoNLL 2008) pages 49ndash56

John Carroll and Claire Grover 1989 The derivationof a large computational lexicon of English fromLDOCE In Bran Boguraev and Ted Briscoe editorsComputational Lexicography for Natural LanguageProcessing pages 117ndash134 Longman Harlow UK

Helena M Caseli Carlos Ramisch Maria G V Nunesand Aline Villavicencio 2009 Alignment-basedextraction of multiword expressions Language Re-sources and Evaluation to appear

Stefan Evert and Brigitte Krenn 2005 Using smallrandom samples for the manual evaluation of statis-tical association measures Computer Speech andLanguage 19(4)450ndash466

Afsaneh Fazly and Suzanne Stevenson 2007 Distin-guishing Subtypes of Multiword Expressions UsingLinguistically-Motivated Statistical Measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueJune

Dan Flickinger 2000 On building a more efficientgrammar by exploiting types Natural LanguageEngineering 6(1)15ndash28

7

Knut Hofland 1996 A program for aligning Englishand Norwegian sentences In S Hockey N Ideand G Perissinotto editors Research in HumanitiesComputing pages 165ndash178 Oxford Oxford Univer-sity Press

Ray Jackendoff 1997 Twistinrsquo the night away Lan-guage 73534ndash59

Frank Keller and Mirella Lapata 2003 Using the Webto Obtain Frequencies for Unseen Bigrams Compu-tational Linguistics 29(3)459ndash484

I Dan Melamed 1997 Automatic Discovery ofNon-Compositional Compounds in Parallel Data Ineprint arXivcmp-lg9706027 pages 6027ndash+ June

Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the Workshopon Multi-word-expressions in a Multilingual Context(EACL-2006) pages 33ndash40 Trento Italy

Franz Josef Och and Hermann Ney 2000 Improvedstatistical alignment models In Proceedings of the38th Annual Meeting of the ACL pages 440ndash447Hong Kong China October

Darren Pearce 2002 A Comparative Evaluation ofCollocation Extraction Techniques In Proceedingsof the Third International Conference on LanguageResources and Evaluation pages 1ndash7 Las PalmasCanary Islands Spain

Scott S L Piao Guangfan Sun Paul Rayson andQi Yuan 2006 Automatic Extraction of Chi-nese Multiword Expressions with a Statistical ToolIn Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context (EACL-2006)pages 17ndash24 Trento Italy April

William H Press Saul A Teukolsky William T Vet-terling and Brian P Flannery 1992 NumericalRecipes in C The Art of Scientific Computing Sec-ond edition Cambridge University Press

Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 MultiwordExpressions A Pain in the Neck for NLP In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing (CICLing-2002) volume 2276 of (LectureNotes in Computer Science) pages 1ndash15 LondonUK Springer-Verlag

Tim Van de Cruys and Bego na Villada Moiron 2007Semantics-based Multiword Expression ExtractionIn Proceedings of the Workshop on A Broader Pre-spective on Multiword Expressions pages 25ndash32Prague June

Aline Villavicencio Valia Kordoni Yi Zhang MarcoIdiart and Carlos Ramisch 2007 Validation andEvaluation of Automatically Acquired Multiword

Expressions for Grammar Engineering In Pro-ceedings of the 2007 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning pages1034ndash1043 Prague June

Aline Villavicencio 2005 The Availability of Verb-Particle Constructions in Lexical Resources HowMuch is Enough Journal of Computer Speech andLanguage Processing 19(4)415ndash432

Yi Zhang Valia Kordoni Aline Villavicencio andMarco Idiart 2006 Automated Multiword Ex-pression Prediction for Grammar Engineering InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 36ndash44 Sydney Australia July Asso-ciation for Computational Linguistics

8

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 9ndash16Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Re-examining Automatic Keyphrase Extraction Approachesin Scientific Articles

Su Nam KimCSSE dept

University of Melbournesnkimcsseunimelbeduau

Min-Yen KanDepartment of Computer ScienceNational University of Singaporekanmycompnusedusg

Abstract

We tackle two major issues in automatickeyphrase extraction using scientific arti-cles candidate selection and feature engi-neering To develop an efficient candidateselection method we analyze the natureand variation of keyphrases and then se-lect candidates using regular expressionsSecondly we re-examine the existing fea-tures broadly used for the supervised ap-proach exploring different ways to en-hance their performance While mostother approaches are supervised we alsostudy the optimal features for unsuper-vised keyphrase extraction Our researchhas shown that effective candidate selec-tion leads to better performance as evalua-tion accounts for candidate coverage Ourwork also attests that many of existing fea-tures are also usable in unsupervised ex-traction

1 Introduction

Keyphrases are simplex nouns or noun phrases(NPs) that represent the key ideas of the documentKeyphrases can serve as a representative summaryof the document and also serve as high quality in-dex terms It is thus no surprise that keyphraseshave been utilized to acquire critical informationas well as to improve the quality of natural lan-guage processing (NLP) applications such as doc-ument summarizer(Davanzo and Magnini 2005)information retrieval (IR)(Gutwin et al 1999) anddocument clustering(Hammouda et al 2005)

In the past various attempts have been made toboost automatic keyphrase extraction performancebased primarily on statistics(Frank et al 1999Turney 2003 Park et al 2004 Wan and Xiao2008) and a rich set of heuristic features(Barkerand Corrnacchia 2000 Medelyan and Witten

2006 Nguyen and Kan 2007) In Section 2 wegive a more comprehensive overview of previousattempts

Current keyphrase technology still has muchroom for improvement First of all although sev-eral candidate selection methods have been pro-posed for automatic keyphrase extraction in thepast (eg (Frank et al 1999 Park et al 2004Nguyen and Kan 2007)) most of them do not ef-fectively deal with various keyphrase forms whichresults in the ignorance of some keyphrases as can-didates Moreover no studies thus far have donea detailed investigation of the nature and varia-tion of manually-provided keyphrases As a con-sequence the community lacks a standardized listof candidate forms which leads to difficulties indirect comparison across techniques during evalu-ation and hinders re-usability

Secondly previous studies have shown the ef-fectiveness of their own features but not manycompared their features with other existing fea-tures That leads to a redundancy in studies andhinders direct comparison In addition existingfeatures are specifically designed for supervisedapproaches with few exceptions However thisapproach involves a large amount of manual laborthus reducing its utility for real-world applicationHence unsupervised approach is inevitable in or-der to minimize manual tasks and to encourageutilization It is a worthy study to attest the re-liability and re-usability for the unsupervised ap-proach in order to set up the tentative guideline forapplications

This paper targets to resolve these issues ofcandidate selection and feature engineering Inour work on candidate selection we analyze thenature and variation of keyphrases with the pur-pose of proposing a candidate selection methodwhich improves the coverage of candidates thatoccur in various forms Our second contributionre-examines existing keyphrase extraction features

9

reported in the literature in terms of their effec-tiveness and re-usability We test and comparethe usefulness of each feature for further improve-ment In addition we assess how well these fea-tures can be applied in an unsupervised approach

In the remaining sections we describe anoverview of related work in Section 2 our propos-als on candidate selection and feature engineeringin Section 4 and 5 our system architecture anddata in Section 6 Then we evaluate our propos-als discuss outcomes and conclude our work inSection 7 8 and 9 respectively

2 Related Work

The majority of related work has been carriedout using statistical approaches a rich set ofsymbolic resources and linguistically-motivatedheuristics(Frank et al 1999 Turney 1999 Barkerand Corrnacchia 2000 Matsuo and Ishizuka2004 Nguyen and Kan 2007) Features used canbe categorized into three broad groups (1) docu-ment cohesion features (ie relationship betweendocument and keyphrases)(Frank et al 1999Matsuo and Ishizuka 2004 Medelyan and Wit-ten 2006 Nguyen and Kan 2007) and to lesser(2) keyphrase cohesion features (ie relationshipamong keyphrases)(Turney 2003) and (3) termcohesion features (ie relationship among compo-nents in a keyphrase)(Park et al 2004)

The simplest system is KEA (Frank et al1999 Witten et al 1999) that uses TFIDF (ieterm frequency inverse document frequency) andfirst occurrence in the document TFIDF mea-sures the document cohesion and the first occur-rence implies the importance of the abstract orintroduction which indicates the keyphrases havea locality Turney (2003) added the notion ofkeyphrase cohesion to KEA features and Nguyenand Kan (2007) added linguistic features suchas section information and suffix sequence TheGenEx system(Turney 1999) employed an inven-tory of nine syntactic features such as length inwords and frequency of stemming phrase as aset of parametrized heuristic rules Barker andCorrnacchia (2000) introduced a method basedon head noun heuristics that took three featureslength of candidate frequency and head noun fre-quency To take advantage of domain knowledgeHulth et al (2001) used a hierarchically-organizeddomain-specific thesaurus from Swedish Parlia-ment as a secondary knowledge source The

Textract (Park et al 2004) also ranks the can-didate keyphrases by its judgment of keyphrasesrsquodegree of domain specificity based on subject-specific collocations(Damerau 1993) in addi-tion to term cohesion using Dice coefficient(Dice1945) Recently Wan and Xiao (2008) extractsautomatic keyphrases from single documents uti-lizing document clustering information The as-sumption behind this work is that the documentswith the same or similar topics interact with eachother in terms of salience of words The authorsfirst clustered the documents then used the graph-based ranking algorithm to rank the candidates ina document by making use of mutual influences ofother documents in the same cluster

3 Keyphrase Analysis

In previous study KEA employed the index-ing words as candidates whereas others such as(Park et al 2004 Nguyen and Kan 2007) gen-erated handcrafted regular expression rules How-ever none carefully undertook the analysis ofkeyphrases We believe there is more to be learnedfrom the reference keyphrases themselves by do-ing a fine-grained careful analysis of their formand composition Note that we used the articlescollected from ACM digital library for both ana-lyzing keyphrases as well as evaluating methodsSee Section 6 for data in detail

Syntactically keyphrases can be formed by ei-ther simplex nouns (eg algorithm keyphrasemulti-agent) or noun phrases (NPs) which can be asequence of nouns and their auxiliary words suchas adjectives and adverbs (eg mobile networkfast computing partially observable Markov de-cision process) despite few incidences They canalso incorporate a prepositional phrase (PP) (egquality of service policy of distributed caching)When keyphrases take the form of an NP with anattached PP (ie NPs in of-PP form) the preposi-tion of is most common but others such as for invia also occur (eg incentive for cooperation in-equality in welfare agent security via approximatepolicy trade in financial instrument based on log-ical formula) The patterns above correlate wellto part-of-speech (POS) patterns used in modernkeyphrase extraction systems

However our analysis uncovered additional lin-guistic patterns and alternations which other stud-ies may have overlooked In our study we alsofound that keyphrases also occur as a simple con-

10

Criteria RulesFrequency (Rule1) Frequency heuristic ie frequency ge 2 for simplex words vs frequency ge 1 for NPs

Length (Rule2) Length heuristic ie up to length 3 for NPs in non-of-PP form vs up to length 4 for NPs in of-PP form(eg synchronous concurrent program vs model of multiagent interaction)

Alternation (Rule3) of-PP form alternation(eg number of sensor = sensor number history of past encounter = past encounter history)(Rule4) Possessive alternation(eg agentrsquos goal = goal of agent securityrsquos value = value of security)

Extraction (Rule5) Noun Phrase = (NN |NNS|NNP |NNPS|JJ |JJR|JJS)lowast(NN |NNS|NNP |NNPS)(eg complexity effective algorithm grid computing distributed web-service discovery architecture)(Rule6) Simplex WordNP IN Simplex WordNP(eg quality of service sensitivity of VOIP traffic (VOIP traffic extracted)simplified instantiation of zebroid (simplified instantiation extracted))

Table 1 Candidate Selection Rules

junctions (eg search and rescue propagation anddelivery) and much more rarely as conjunctionsof more complex NPs (eg history of past en-counter and transitivity) Some keyphrases appearto be more complex (eg pervasive document editand management system task and resource allo-cation in agent system) Similarly abbreviationsand possessive forms figure as common patterns(eg belief desire intention = BDI inverse docu-ment frequency = (IDF) Bayesrsquo theorem agentrsquosdominant strategy)

A critical insight of our work is that keyphrasescan be morphologically and semantically alteredKeyphrases that incorporate a PP or have an un-derlying genitive composition are often easily var-ied by word order alternation Previous studieshave used the altered keyphrases when forming inof-PP form For example quality of service canbe altered to service quality sometimes with lit-tle semantic difference Also as most morpho-logical variation in English relates to noun num-ber and verb inflection keyphrases are subject tothese rules as well (eg distributed system 6= dis-tributing system dynamical caching 6= dynamicalcache) In addition possessives tend to alternatewith of-PP form (eg agentrsquos goal = goal of agentsecurityrsquos value = value of security)

4 Candidate Selection

We now describe our proposed candidate selectionprocess Candidate selection is a crucial step forautomatic keyphrase extraction This step is corre-lated to term extraction study since topNth rankedterms become keyphrases in documents In pre-vious study KEA employed the indexing wordsas candidates whereas others such as (Park et al2004 Nguyen and Kan 2007) generated hand-crafted regular expression rules However nonecarefully undertook the analysis of keyphrases In

this section before we present our method we firstdescribe the detail of keyphrase analysis

In our keyphrase analysis we observed thatmost of author assigned keyphrase andor readerassigned keyphrase are syntactically more of-ten simplex words and less often NPs Whenkeyphrases take an NP form they tend to be a sim-ple form of NPs ie either without a PP or withonly a PP or with a conjunction but few appear asa mixture of such forms We also noticed that thecomponents of NPs are normally nouns and adjec-tives but rarely are adverbs and verbs As a re-sult we decided to ignore NPs containing adverbsand verbs in this study as our candidates since theytend to produce more errors and to require morecomplexity

Another observation is that keyphrases contain-ing more than three words are rare (ie 6 in ourdata set) validating what Paukkeri et al (2008)observed Hence we apply a length heuristic Ourcandidate selection rule collects candidates up tolength 3 but also of length 4 for NPs in of-PPform since they may have a non-genetive alter-nation that reduces its length to 3 (eg perfor-mance of distributed system = distributed systemperformance) In previous studies words occur-ring at least twice are selected as candidates How-ever during our acquisition of reader assignedkeyphrase we observed that readers tend to collectNPs as keyphrases regardless of their frequencyDue to this we apply different frequency thresh-olds for simplex words (gt= 2) and NPs (gt= 1)Note that 30 of NPs occurred only once in ourdata

Finally we generated regular expression rulesto extract candidates as presented in Table 1 Ourcandidate extraction rules are based on those inNguyen and Kan (2007) However our Rule6for NPs in of-PP form broadens the coverage of

11

possible candidates ie with a given NPs in of-PP form not only we collect simplex word(s)but we also extract non-of-PP form of NPs fromnoun phrases governing the PP and the PP Forexample our rule extracts effective algorithm ofgrid computing as well as effective algorithm andgrid computing as candidates while the previousworksrsquo rules do not

5 Feature Engineering

With a wider candidate selection criteria the onusof filtering out irrelevant candidates becomes theresponsibility of careful feature engineering Welist 25 features that we have found useful in ex-tracting keyphrases comprising of 9 existing and16 novel andor modified features that we intro-duce in our work (marked with lowast) As one ofour goals in feature engineering is to assess thesuitability of features in the unsupervised settingwe have also indicated which features are suitableonly for the supervised setting (S) or applicable toboth (S U)

51 Document Cohesion

Document cohesion indicates how important thecandidates are for the given document The mostpopular feature for this cohesion is TFIDF butsome works have also used context words to checkthe correlation between candidates and the givendocument Other features for document cohesionare distance section information and so on Wenote that listed features other than TFIDF are re-lated to locality That is the intuition behind thesefeatures is that keyphrases tend to appear in spe-cific area such as the beginning and the end of doc-uments

F1 TFIDF (SU) TFIDF indicates doc-ument cohesion by looking at the frequency ofterms in the documents and is broadly used in pre-vious work(Frank et al 1999 Witten et al 1999Nguyen and Kan 2007) However a disadvan-tage of the feature is in requiring a large corpusto compute useful IDF As an alternative con-text words(Matsuo and Ishizuka 2004) can alsobe used to measure document cohesion From ourstudy of keyphrases we saw that substrings withinlonger candidates need to be properly counted andas such our method measures TF in substrings aswell as in exact matches For example grid com-puting is often a substring of other phrases such asgrid computing algorithm and efficient grid com-

puting algorithm We also normalize TF with re-spect to candidate types ie we separately treatsimplex words and NPs to compute TF To makeour IDFs broadly representative we employed theGoogle n-gram counts that were computedover terabytes of data Given this large genericsource of word count IDF can be incorporatedwithout corpus-dependent processing hence suchfeatures are useful in unsupervised approachesas well The following list shows variations ofTFIDF employed as features in our system

bull (F1a) TFIDF

bull (F1b) TF including counts of substrings

bull (F1c) TF of substring as a separate feature

bull (F1d) normalized TF by candidate types(ie simplex words vs NPs)

bull (F1e) normalized TF by candidate types asa separate feature

bull (F1f) IDF using Google n-gram

F2 First Occurrence (SU) KEA used the firstappearance of the word in the document(Frank etal 1999 Witten et al 1999) The main ideabehind this feature is that keyphrases tend to oc-cur in the beginning of documents especially instructured reports (eg in abstract and introduc-tion sections) and newswire

F3 Section Information (SU) Nguyen andKan (2007) used the identity of which specificdocument section a candidate occurs in This lo-cality feature attempts to identify key sections Forexample in their study of scientific papers theauthors weighted candidates differently dependingon whether they occurred in the abstract introduc-tion conclusion section head title andor refer-ences

F4 Additional Section Information (SU)We first added the related work or previous workas one of section information not included inNguyen and Kan (2007) We also propose and testa number of variations We used the substringsthat occur in section headers and reference titlesas keyphrases We counted the co-occurrence ofcandidates (ie the section TF) across all key sec-tions that indicates the correlation among key sec-tions We assign section-specific weights as in-dividual sections exhibit different propensities forgenerating keyphrases For example introduction

12

contains the majority of keyphrases while the ti-tle or section head contains many fewer due to thevariation in size

bull (F4a) section rsquorelatedprevious workrsquo

bull (F4b) counting substring occurring in keysections

bull (F4c) section TF across all key sections

bull (F4d) weighting key sections according tothe portion of keyphrases found

F5 Last Occurrence (SU) Similar to dis-tance in KEA the position of the last occurrenceof a candidate may also imply the importance ofkeyphrases as keyphrases tend to appear in thelast part of document such as the conclusion anddiscussion

52 Keyphrase CohesionThe intuition behind using keyphrase cohesion isthat actual keyphrases are often associated witheach other since they are semantically related totopic of the document Note that this assumptionholds only when the document describes a singlecoherent topic ndash a document that represents a col-lection may be first need to be segmented into itsconstituent topics

F6 Co-occurrence of Another Candidatein Section (SU) When candidates co-occur inseveral key sections together then they are morelikely keyphrases Hence we used the number ofsections that candidates co-occur

F7 Title overlap (S) In a way titles also rep-resent the topics of their documents A large col-lection of titles in the domain can act as a prob-abilistic prior of what words could stand as con-stituent words in keyphrases In our work as weexamined scientific papers from computer sciencewe used a collection of titles obtained from thelarge CiteSeer1 collection to create this feature

bull (F7a) co-occurrence (Boolean) in title col-location

bull (F7b) co-occurrence (TF) in title collection

F8 Keyphrase Cohesion (SU) Turney (2003)integrated keyphrase cohesion into his system bychecking the semantic similarity between top Nranked candidates against the remainder In the

1It contains 13M titles from articles papers and reports

original work a large external web corpus wasused to obtain the similarity judgments As wedid not have access to the same web corpus andall candidateskeyphrases were not found in theGoogle n-gram corpus we approximated this fea-ture using a similar notion of contextual similarityWe simulated a latent 2-dimensional matrix (simi-lar to latent semantic analysis) by listing all candi-date words in rows and their neighboring words(nouns verbs and adjectives only) in columnsThe cosine measure is then used to compute thesimilarity among keyphrases

53 Term Cohesion

Term cohesion further refines the candidacy judg-ment by incorporating an internal analysis of thecandidatersquos constituent words Term cohesionposits that high values for internal word associa-tion measures correlates indicates that the candi-date is a keyphrase (Church and Hanks 1989)

F9 Term Cohesion (SU) Park et al (2004)used in the Dice coefficient (Dice 1945)to measure term cohesion particularly for multi-word terms In their work as NPs are longer thansimplex words they simply discounted simplexword cohesion by 10 In our work we vary themeasure of TF used in Dice coefficientsimilar to our discussion earlier

bull (F9a) term cohesion by (Park et al 2004)

bull (F9b) normalized TF by candidate types(ie simplex words vs NPs)

bull (F9c) applying different weight by candi-date types

bull (F9d) normalized TF and different weight-ing by candidate types

54 Other Features

F10 Acronym (S) Nguyen and Kan (2007) ac-counted for the importance of acronym as a fea-ture We found that this feature is heavily depen-dent on the data set Hence we used it only forNampK to attest our candidate selection method

F11 POS sequence (S) Hulth and Megyesi(2006) pointed out that POS sequences ofkeyphrases are similar It showed the distinctivedistribution of POS sequences of keyphrases anduse them as a feature Like acronym this is alsosubject to the data set

13

F12 Suffix sequence (S) Similar to acronymNguyen and Kan (2007) also used a candidatersquossuffix sequence as a feature to capture the propen-sity of English to use certain Latin derivationalmorphology for technical keyphrases This fea-ture is also a data dependent features thus used insupervised approach only

F13 Length of Keyphrases (SU) Barker andCorrnacchia (2000) showed that candidate lengthis also a useful feature in extraction as well as incandidate selection as the majority of keyphrasesare one or two terms in length

6 System and Data

To assess the performance of the proposed candi-date selection rules and features we implementeda keyphrase extraction pipe line We start withraw text of computer science articles convertedfrom PDF by pdftotext Then we parti-tioned the into section such as title and sectionsvia heuristic rules and applied sentence segmenter2 ParsCit3(Councill et al 2008) for refer-ence collection part-of-speech tagger4 and lem-matizer5(Minnen et al 2001) of the input Af-ter preprocessing we built both supervised andunsupervised classifiers using Naive Bayes fromthe WEKA machine learning toolkit(Witten andFrank 2005) Maximum Entropy6 and simpleweighting

In evaluation we collected 250 papers fromfour different categories7 of the ACM digital li-brary Each paper was 6 to 8 pages on averageIn author assigned keyphrase we found manywere missing or found as substrings To rem-edy this we collected reader assigned keyphraseby hiring senior year undergraduates in computerscience each whom annotated five of the paperswith an annotation guideline and on average tookabout 15 minutes to annotate each paper The fi-nal statistics of keyphrases is presented in Table2 where Combined represents the total number ofkeyphrases The numbers in () denotes the num-ber of keyphrases in of-PP form Found means the

2httpwwwengritsumeiacjpasaoresourcessentseg3httpwingcompnusedusgparsCit4httpsearchcpanorgdistLingua-EN-

TaggerTaggerpm5httpwwwinformaticssusxacukresearchgroupsnlpcarrollmorphhtml6httpmaxentsourceforgenetindexhtml7C24 (Distributed Systems) H33 (Information Search

and Retrieval) I211 (Distributed Artificial Intelligence-Multiagent Systems) and J4 (Social and Behavioral Sciences-Economics)

number of author assigned keyphrase and readerassigned keyphrase found in the documents

Author Reader CombinedTotal 1252 (53) 3110 (111) 3816 (146)NPs 904 2537 3027Average 385 (401) 1244 (1288) 1526 (1585)Found 769 2509 2864

Table 2 Statistics in Keyphrases

7 Evaluation

The baseline system for both the supervised andunsupervised approaches is modified NampK whichuses TFIDF distance section information andadditional section information (ie F1-4) Apartfrom baseline we also implemented basicKEA and NampK to compare Note that NampK is con-sidered a supervised approach as it utilizes fea-tures like acronym POS sequence and suffix se-quence

Table 3 and 4 shows the performance of our can-didate selection method and features with respectto supervised and unsupervised approaches usingthe current standard evaluation method (ie exactmatching scheme) over top 5th 10th 15th candi-dates

BestFeatures includes F1cTF of substring asa separate feature F2first occurrence F3sectioninformation F4dweighting key sections F5lastoccurrence F6co-occurrence of another candi-date in section F7btitle overlap F9aterm co-hesion by (Park et al 2004) F13length ofkeyphrases Best-TFIDF means using all bestfeatures but TFIDF

In Tables 3 and 4 C denotes the classifier tech-nique unsupervised (U) or supervised using Max-imum Entropy (S)8

In Table 5 the performance of each feature ismeasured using NampK system and the target fea-ture + indicates an improvement - indicates aperformance decline and indicates no effector unconfirmed due to small changes of perfor-mances Again supervised denotes MaximumEntropy training and Unsupervised is our unsu-pervised approach

8 Discussion

We compared the performances over our candi-date selection and feature engineering with sim-ple KEA NampK and our baseline system In eval-uating candidate selection we found that longer

8Due to the page limits we present the best performance

14

Method Features C Five Ten FifteenMatch Precision Recall Fscore Match Precising Recall Fscore Match Precision Recall Fscore

All KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 079 1584 519 782 139 1388 909 1099 184 1224 1203 1213

NampK S 132 2648 867 1306 204 2036 1334 1612 254 1693 1664 1678baseline U 092 1832 600 904 157 1568 1027 1241 220 1464 1439 1451

S 115 2304 755 1137 190 1896 1242 1501 244 1624 1596 1610Lengthlt=3 KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 081 1616 529 797 140 1400 917 1108 184 1224 1203 1213

NampK S 140 2792 915 1378 210 2104 1378 1665 262 1749 1719 1734baseline U 092 184 603 908 158 1576 1032 1247 220 1464 1439 1451

S 118 2368 776 1169 190 1900 1245 1504 240 1600 1572 1586Lengthlt=3 KEA U 001 024 008 012 005 052 034 041 007 048 047 047Candidates S 083 1664 545 821 142 1424 933 1127 187 1245 1224 1234

+ Alternation NampK S 153 3064 1004 1512 231 2308 1512 1827 288 1920 1887 1903baseline U 098 1968 645 972 172 1724 1129 1364 237 1579 1551 1565

S 133 2656 870 1311 209 2088 1368 1653 269 1792 1761 1776

Table 3 Performance on Proposed Candidate Selection

Features C Five Ten FifteenMatch Prec Recall Fscore Match Prec Recall Fscore Match Prec Recall Fscore

Best U 114 228 747 113 192 192 126 152 261 174 171 173S 156 312 102 154 250 250 164 198 315 210 206 208

Best U 114 228 74 113 192 192 126 152 261 174 171 173wo TFIDF S 156 311 102 154 246 246 161 194 312 208 204 206

Table 4 Performance on Feature Engineering

A Method Feature+ S F1aF2F3F4aF4dF9a

U F1aF1cF2F3F4aF4dF5F7bF9a- S F1bF1cF1dF1fF4bF4cF7aF7bF9b-dF13

U F1dF1eF1fF4bF4cF6F7aF9b-d S F1eF10F11F12

U F1b

Table 5 Performance on Each Feature

length candidates play a role to be noises so de-creased the overall performance We also con-firmed that candidate alternation offered the flexi-bility of keyphrases leading higher candidate cov-erage as well as better performance

To re-examine features we analyzed the impactof existing and new features and their variationsFirst of all unlike previous studies we found thatthe performance with and without TFIDF did notlead to a large difference which indicates the im-pact of TFIDF was minor as long as other fea-tures are incorporated Secondly counting sub-strings for TF improved performance while ap-plying term weighting for TF andor IDF did notimpact on the performance We estimated thecause that many of keyphrases are substrings ofcandidates and vice versa Thirdly section in-formation was also validated to improve perfor-mance as in Nguyen and Kan (2007) Extend-ing this logic modeling additional section infor-mation (related work) and weighting sections bothturned out to be useful features Other localityfeatures were also validated as helpful both firstoccurrence and last occurrence are helpful as itimplies the locality of the key ideas In addi-

tion keyphrase co-occurrence with selected sec-tions was proposed in our work and found empiri-cally useful Term cohesion (Park et al 2004) is auseful feature although it has a heuristic factor thatreduce the weight by 10 for simplex words Nor-mally term cohesion is subject to NPs only henceit needs to be extended to work with multi-wordNPs as well Table 5 summarizes the reflectionson each feature

As unsupervised methods have the appeal of notneeding to be trained on expensive hand-annotateddata we also compared the performance of super-vised and unsupervised methods Given the fea-tures initially introduced for supervised learningunsupervised performance is surprisingly highWhile supervised classifier produced a matchingcount of 315 the unsupervised classifier obtains acount of 261 We feel this indicates that the exist-ing features for supervised methods are also suit-able for use in unsupervised methods with slightlyreduced performance In general we observed thatthe best features in both supervised and unsuper-vised methods are the same ndash section informationand candidate length In our analysis of the im-pact of individual features we observed that mostfeatures affect performance in the same way forboth supervised and unsupervised approaches asshown in Table 5 These findings indicate that al-though these features may be been originally de-signed for use in a supervised approach they arestable and can be expected to perform similar inunsupervised approaches

15

9 Conclusion

We have identified and tackled two core issuesin automatic keyphrase extraction candidate se-lection and feature engineering In the area ofcandidate selection we observe variations and al-ternations that were previously unaccounted forOur selection rules expand the scope of possiblekeyphrase coverage while not overly expandingthe total number candidates to consider In ourre-examination of feature engineering we com-piled a comprehensive feature list from previousworks while exploring the use of substrings in de-vising new features Moreover we also attested toeach featurersquos fitness for use in unsupervised ap-proaches in order to utilize them in real-world ap-plications with minimal cost

10 AcknowledgementThis work was partially supported by a National ResearchFoundation grant Interactive Media Search (grant R 252000 325 279) while the first author was a postdoctoral fellowat the National University of Singapore

ReferencesKen Barker and Nadia Corrnacchia Using noun phrase

heads to extract document keyphrases In Proceedings ofthe 13th Biennial Conference of the Canadian Society onComputational Studies of Intelligence Advances in Arti-ficial Intelligence 2000

Regina Barzilay and Michael Elhadad Using lexical chainsfor text summarization In Proceedings of the ACLEACL1997 Workshop on Intelligent Scalable Text Summariza-tion 1997 pp 10ndash17

Kenneth Church and Patrick Hanks Word associationnorms mutual information and lexicography In Proceed-ings of ACL 1989 76ndash83

Isaac Councill and C Lee Giles and Min-Yen Kan ParsCitAn open-source CRF reference string parsing package InProceedings of LREC 2008 28ndash30

Ernesto DAvanzo and Bernado Magnini A Keyphrase-Based Approach to Summarizationthe LAKE System atDUC-2005 In Proceedings of DUC 2005

F Damerau Generating and evaluating domain-orientedmulti-word terms from texts Information Processing andManagement 1993 29 pp43ndash447

Lee Dice Measures of the amount of ecologic associationsbetween species Journal of Ecology 1945 2

Eibe Frank and Gordon Paynter and Ian Witten and CarlGutwin and Craig Nevill-manning Domain SpecificKeyphrase Extraction In Proceedings of IJCAI 1999pp668ndash673

Carl Gutwin and Gordon Paynter and Ian Witten and CraigNevill-Manning and Eibe Frank Improving browsing indigital libraries with keyphrase indexes Journal of Deci-sion Support Systems 1999 27 pp81ndash104

Khaled Hammouda and Diego Matute and Mohamed KamelCorePhrase keyphrase extraction for document cluster-ing In Proceedings of MLDM 2005

Annette Hulth and Jussi Karlgren and Anna Jonsson andHenrik Bostrm and Lars Asker Automatic Keyword Ex-traction using Domain Knowledge In Proceedings of CI-CLing 2001

Annette Hulth and Beata Megyesi A study on automaticallyextracted keywords in text categorization In Proceedingsof ACLCOLING 2006 537ndash544

Mario Jarmasz and Caroline Barriere Using semantic sim-ilarity over tera-byte corpus compute the performance ofkeyphrase extraction In Proceedings of CLINE 2004

Dawn Lawrie and W Bruce Croft and Arnold RosenbergFinding Topic Words for Hierarchical Summarization InProceedings of SIGIR 2001 pp 349ndash357

Y Matsuo and M Ishizuka Keyword Extraction from a Sin-gle Document using Word Co-occurrence Statistical Infor-mation In International Journal on Artificial IntelligenceTools 2004 13(1) pp 157ndash169

Olena Medelyan and Ian Witten Thesaurus based automatickeyphrase indexing In Proceedings of ACMIEED-CSjoint conference on Digital libraries 2006 pp296ndash297

Guido Minnen and John Carroll and Darren Pearce Appliedmorphological processing of English NLE 2001 7(3)pp207ndash223

Thuy Dung Nguyen and Min-Yen Kan Key phrase Extrac-tion in Scientific Publications In Proceeding of ICADL2007 pp317-326

Youngja Park and Roy Byrd and Branimir Boguraev Auto-matic Glossary Extraction Beyond Terminology Identifi-cation In Proceedings of COLING 2004 pp48ndash55

Mari-Sanna Paukkeri and Ilari Nieminen and Matti Pollaand Timo Honkela A Language-Independent Approachto Keyphrase Extraction and Evaluation In Proceedingsof COLING 2008

Peter Turney Learning to Extract Keyphrases from TextIn National Research Council Institute for InformationTechnology Technical Report ERB-1057 1999

Peter Turney Coherent keyphrase extraction via Web min-ing In Proceedings of IJCAI 2003 pp 434ndash439

Xiaojun Wan and Jianguo Xiao CollabRank towards a col-laborative approach to single-document keyphrase extrac-tion In Proceedings of COLING 2008

Ian Witten and Gordon Paynter and Eibe Frank and CarGutwin and Graig Nevill-Manning KEAPractical Au-tomatic Key phrase Extraction In Proceedings of ACMDL 1999 pp254ndash256

Ian Witten and Eibe Frank Data Mining Practical Ma-chine Learning Tools and Techniques Morgan Kauf-mann 2005

Yongzheng Zhang and Nur Zinchir-Heywood and Evange-los Milios Term based Clustering and Summarization ofWeb Page Collections In Proceedings of Conference ofthe Canadian Society for Computational Studies of Intel-ligence 2004

16

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 17ndash22Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Verb Noun Construction MWE Token Supervised Classification

Mona T DiabCenter for Computational Learning Systems

Columbia Universitymdiabcclscolumbiaedu

Pravin BhutadaComputer Science Department

Columbia Universitypb2351columbiaedu

Abstract

We address the problem of classifying multi-word expression tokens in running text Wefocus our study on Verb-Noun Constructions(VNC) that vary in their idiomaticity depend-ing on context VNC tokens are classified aseither idiomatic or literal We present a super-vised learning approach to the problem We ex-periment with different features Our approachyields the best results to date on MWE clas-sification combining different linguistically mo-tivated features the overall performance yieldsan F-measure of 8458 corresponding to an F-measure of 8996 for idiomaticity identificationand classification and 6203 for literal identifi-cation and classification

1 Introduction

In the literature in general a multiword expression(MWE) refers to a multiword unit or a colloca-tion of words that co-occur together statisticallymore than chance A MWE is a cover term fordifferent types of collocations which vary in theirtransparency and fixedness MWEs are pervasivein natural language especially in web based textsand speech genres Identifying MWEs and under-standing their meaning is essential to language un-derstanding hence they are of crucial importancefor any Natural Language Processing (NLP) appli-cations that aim at handling robust language mean-ing and use In fact the seminal paper (Sag et al2002) refers to this problem as a key issue for thedevelopment of high-quality NLP applications

For our purposes a MWE is defined as a collo-cation of words that refers to a single concept forexample - kick the bucket spill the beans make adecision etc An MWE typically has an idiosyn-cratic meaning that is more or different from themeaning of its component words An MWE mean-ing is transparent ie predictable in as muchas the component words in the expression relaythe meaning portended by the speaker composi-tionally Accordingly MWEs vary in their de-gree of meaning compositionality composition-ality is correlated with the level of idiomaticityAn MWE is compositional if the meaning of an

MWE as a unit can be predicted from the mean-ing of its component words such as in make adecision meaning to decide If we conceive ofidiomaticity as being a continuum the more id-iomatic an expression the less transparent and themore non-compositional it is Some MWEs aremore predictable than others for instance kick thebucket when used idiomatically to mean to diehas nothing in common with the literal meaningof either kick or bucket however make a decisionis very clearly related to to decide Both of theseexpressions are considered MWEs but have vary-ing degrees of compositionality and predictabilityBoth of these expressions belong to a class of id-iomatic MWEs known as verb noun constructions(VNC) The first VNC kick the bucket is a non-decomposable VNC MWE the latter make a deci-sion is a decomposable VNC MWE These typesof constructions are the object of our study

To date most research has addressed the prob-lem of MWE type classification for VNC expres-sions in English (Melamed 1997 Lin 1999Baldwin et al 2003 na Villada Moiron andTiedemann 2006 Fazly and Stevenson 2007Van de Cruys and Villada Moiron 2007 Mc-Carthy et al 2007) not token classification Forexample he spilt the beans on the kitchen counteris most likely a literal usage This is given away bythe use of the prepositional phrase on the kitchencounter as it is plausable that beans could haveliterally been spilt on a location such as a kitchencounter Most previous research would classifyspilt the beans as idiomatic irrespective of con-textual usage In a recent study by (Cook et al2008) of 53 idiom MWE types used in differentcontexts the authors concluded that almost half ofthem had clear literal meaning and over 40 oftheir usages in text were actually literal Thus itwould be important for an NLP application suchas machine translation for example when givena new VNC MWE token to be able to determinewhether it is used idiomatically or not as it couldpotentially have detrimental effects on the qualityof the translation

17

In this paper we address the problem of MWEclassification for verb-noun (VNC) token con-structions in running text We investigate the bi-nary classification of an unseen VNC token ex-pression as being either Idiomatic (IDM) or Lit-eral (LIT) An IDM expression is certainly anMWE however the converse is not necessarilytrue To date most approaches to the problem ofidiomaticity classification on the token level havebeen unsupervised (Birke and Sarkar 2006 Diaband Krishna 2009b Diab and Krishna 2009aSporleder and Li 2009) In this study we carryout a supervised learning investigation using sup-port vector machines that uses some of the featureswhich have been shown to help in unsupervisedapproaches to the problem

This paper is organized as follows In Section2 we describe our understanding of the variousclasses of MWEs in general Section 3 is a sum-mary of previous related research Section 4 de-scribes our approach In Section 5 we present thedetails of our experiments We discuss the resultsin Section 6 Finally we conclude in Section 7

2 Multi-word Expressions

MWEs are typically not productive though theyallow for inflectional variation (Sag et al 2002)They have been conventionalized due to persis-tent use MWEs can be classified based on theirsemantic types as follows Idiomatic This cat-egory includes expressions that are semanticallynon-compositional fixed expressions such as king-dom come ad hoc non-fixed expressions suchas break new ground speak of the devil TheVNCs which we are focusing on in this paper fallinto this category Semi-idiomatic This classincludes expressions that seem semantically non-compositional yet their semantics are more or lesstransparent This category consists of Light VerbConstructions (LVC) such as make a living andVerb Particle Constructions (VPC) such as write-up call-up Non-Idiomatic This category in-cludes expressions that are semantically compo-sitional such as prime minister proper nouns suchas New York Yankees and collocations such as ma-chine translation These expressions are statisti-cally idiosyncratic For instance traffic light isthe most likely lexicalization of the concept andwould occur more often in text than say trafficregulator or vehicle light

3 Related Work

Several researchers have addressed the problem ofMWE classification (Baldwin et al 2003 Katzand Giesbrecht 2006 Schone and Juraksfy 2001

Hashimoto et al 2006 Hashimoto and Kawa-hara 2008) The majority of the proposed researchhas been using unsupervised approaches and haveaddressed the problem of MWE type classifica-tion irrespective of usage in context (Fazly andStevenson 2007 Cook et al 2007) We areaware of two supervised approaches to the prob-lem work by (Katz and Giesbrecht 2006) andwork by (Hashimoto and Kawahara 2008)

In Katz and Giesbrecht (2006) (KG06) the au-thors carried out a vector similarity comparisonbetween the context of an MWE and that of theconstituent words using LSA to determine if theexpression is idiomatic or not The KG06 is sim-ilar in intuition to work proposed by (Fazly andStevenson 2007) however the latter work was un-supervised KG06 experimented with a tiny dataset of only 108 sentences corresponding to oneMWE idiomatic expression

Hashimoto and Kawahara (2008) (HK08) is thefirst large scale study to our knowledge that ad-dressed token classification into idiomatic versusliteral for Japanese MWEs of all types They ap-ply a supervised learning framework using sup-port vector machines based on TinySVM with aquadratic kernel They annotate a web based cor-pus for training data They identify 101 idiomtypes each with a corresponding 1000 exampleshence they had a corpus of 102K sentences of an-notated data for their experiments They exper-iment with 90 idiom types only for which theyhad more than 50 examples They use two typesof features word sense disambiguation (WSD)features and idiom features The WSD featurescomprised some basic syntactic features such asPOS lemma information token n-gram featuresin addition to hypernymy information on words aswell as domain information For the idiom fea-tures they were mostly inflectional features suchas voice negativity modality in addition to adja-cency and adnominal features They report resultsin terms of accuracy and rate of error reductionTheir overall accuracy is of 8925 using all thefeatures

4 Our ApproachWe apply a supervised learning framework tothe problem of both identifying and classifying aMWE expression token in context We specificallyfocus on VNC MWE expressions We use the an-notated data by (Cook et al 2008) We adopt achunking approach to the problem using an InsideOutside Beginning (IOB) tagging framework forperforming the identification of MWE VNC to-kens and classifying them as idiomatic or literalin context For chunk tagging we use the Yam-

18

Cha sequence labeling system1 YamCha is basedon Support Vector Machines technology using de-gree 2 polynomial kernels

We label each sentence with standard IOB tagsSince this is a binary classification task we have 5different tags B-L (Beginning of a literal chunk)I-L (Inside of a literal chunk) B-I (Beginning anIdiomatic chunk) I-I (Inside an Idiomatic chunk)O (Outside a chunk) As an example a sentencesuch as John kicked the bucket last Friday will beannotated as follows John O kicked B-I the I-Ibucket I-I last O Friday O We experiment withsome basic features and some more linguisticallymotivated ones

We experiment with different window sizes forcontext ranging from minus+1 to minus+5 tokens be-fore and after the token of interest We also em-ploy linguistic features such as character n-gramfeatures namely last 3 characters of a token asa means of indirectly capturing the word inflec-tional and derivational morphology (NGRAM)Other features include Part-of-Speech (POS)tags lemma form (LEMMA) or the citation formof the word and named entity (NE) informationThe latter feature is shown to help in the unsuper-vised setting in recent work (Diab and Krishna2009b Diab and Krishna 2009a) In general allthe linguistic features are represented as separatefeature sets explicitly modeled in the input dataHence if we are modeling the POS tag feature forour running example the training data would beannotated as follows John NN O kicked VBDB-I the Det I-I bucket NN I-I last ADV O FridayNN O Likewise adding the NGRAM featurewould be represented as follows John NN ohnO kicked VBD ked B-I the Det the I-I bucket NNket I-I last ADV ast O Friday NN day O and soon

With the NE feature we followed the same rep-resentation as the other features as a separate col-umn as expressed above referred to as NamedEntity Separate (NES) For named entity recogni-tion (NER) we use the BBN Identifinder softwarewhich identifies 19 NE tags2 We have two set-tings for NES one with the full 19 tags explic-itly identified (NES-Full) and the other where wehave a binary feature indicating whether a wordis a NE or not (NES-Bin) Moreover we addedanother experimental condition where we changedthe wordsrsquo representation in the input to their NEclass Named Entity InText (NEI) For example forthe NEI condition our running example is repre-sented as follows PER NN ohn O kicked VBDked B-I the Det the I-I bucket NN ket I-I last ADV

1httpwwwtado-chasencomyamcha2httpwwwbbncomidentifinder

ast O DAY NN day O where John is replaced bythe NE ldquoPERrdquo

5 Experiments and Results

51 Data

We use the manually annotated standard dataset identified in (Cook et al 2008) This datacomprises 2920 unique VNC-Token expressionsdrawn from the entire British National Corpus(BNC)3 The BNC contains 100M words of multi-ple genres including written text and transcribedspeech In this set VNC token expressions aremanually annotated as idiomatic literal or un-known We exclude those annotated as unknownand those pertaining to the Speech part of thedata leaving us with a total of 2432 sentences cor-responding to 53 VNC MWE types This datahas 2571 annotations4 corresponding to 2020 Id-iomatic tokens and 551 literal ones Since the dataset is relatively small we carry out 5-fold cross val-idation experiments The results we report are av-eraged over the 5 folds per condition We splitthe data into 80 for training 10 for testing and10 for development The data used is the tok-enized version of the BNC

52 Evaluation Metrics

We use Fβ=1 (F-measure) as the harmonic meanbetween (P)recision and (R)ecall as well as accu-racy to report the results5 We report the resultsseparately for the two classes IDM and LIT aver-aged over the 5 folds of the TEST data set

53 Results

We present the results for the different featuressets and their combination We also present resultson a simple most frequent tag baseline (FREQ) aswell as a baseline of using no features just thetokenized words (TOK) The baseline is basicallytagging all identified VNC tokens in the data set asidiomatic It is worth noting that the baseline hasthe advantage of gold identification of MWE VNCtoken expressions In our experimental conditionsidentification of a potential VNC MWE is part ofwhat is discovered automatically hence our sys-tem is penalized for identifying other VNC MWE

3httpwwwnatcorpoxacuk4A sentence can have more than one MWE expression

hence the number of annotations exceeds the number of sen-tences

5We do not think that accuracy should be reported in gen-eral since it is an inflated result as it is not a measure of errorAll words identified as O factor into the accuracy which re-sults in exaggerated values for accuracy We report it onlysince it the metric used by previous work

19

tokens that are not in the original data set6

In Table 2 we present the results yielded per fea-ture and per condition We experimented with dif-ferent context sizes initially to decide on the opti-mal window size for our learning framework re-sults are presented in Table 1 Then once that isdetermined we proceed to add features

Noting that a window size of minus+3 yields thebest results we proceed to use that as our contextsize for the following experimental conditions Wewill not include accuracy since it above 96 for allour experimental conditions

All the results yielded by our experiments out-perform the baseline FREQ The simple tokenizedwords baseline (TOK) with no added features witha context size of minus+3 shows a significant im-provement over the very basic baseline FREQ withan overall F measure of 7704

Adding lemma information or POS or NGRAMfeatures all independently contribute to a bettersolution however combining the three featuresyields a significant boost in performance over theTOK baseline of 267 absolute F points in over-all performance

Confirming previous observations in the liter-ature the overall best results are obtained byusing NE features The NEI condition yieldsslightly better results than the NES conditionsin the case when no other features are beingused NES-Full significantly outperforms NES-Bin when used alone especially on literal classi-fication yielding the highest results on this classof phenomena across the board However whencombined with other features NES-Bin fares bet-ter than NES-Full as we observe slightly less per-formance when comparing NES-Full+L+N+P andNES-Bin+L+N+P

Combining NEI+L+N+P yields the highest re-sults with an overall F measure of 8458 a sig-nificant improvement over both baselines and overthe condition that does not exploit NE featuresL+N+P Using NEI may be considered a formof dimensionality reduction hence the significantcontribution to performance

6 Discussion

The overall results strongly suggest that using lin-guistically interesting features explicitly has a pos-itive impact on performance NE features helpthe most and combining them with other features

6We could have easily identified all VNC syntactic con-figurations corresponding to verb object as a potential MWEVNC assuming that they are literal by default This wouldhave boosted our literal score baseline however for this in-vestigation we decided to strictly work with the gold stan-dard data set exclusively

yields the best results In general performanceon the classification and identification of idiomaticexpressions yielded much better results This maybe due to the fact that the data has a lot more id-iomatic token examples for training Also we notethat precision scores are significantly higher thanrecall scores especially with performance on lit-eral token instance classification This might be anindication that identifying when an MWE is usedliterally is a difficult task

We analyzed some of the errors yielded in ourbest condition NEI+L+N+P The biggest errors area result of identifying other VNC constructionsnot annotated in the training and test data as VNCMWEs However we also see errors of confusingidiomatic cases with literal ones 23 times and theopposite 4 times

Some of the errors where the VNC should havebeen classified as literal however the system clas-sified them as idiomatic are kick heel find feetmake top Cases of idiomatic expressions erro-neously classified as literal are for MWE types hitthe road blow trumpet blow whistle bit a wall

The system is able to identify new VNC MWEconstructions For instance in the sentence On theother hand Pinkie seemed to have lost his head toa certain extent perhaps some prospects of mak-ing his mark by bringing in something novel inthe way of business the first MWE lost his headis annotated in the training data however makinghis mark is newly identified as idiomatic in thiscontext

Also the system identified hit the post as aliteral MWE VNC token in As the ball hit thepost the referee blew the whistle where blew thewhistle is a literal VNC in this context and it iden-tified hit the post as another literal VNC

7 Conclusion

In this study we explore a set of features that con-tribute to VNC token expression binary supervisedclassification The use of NER significantly im-proves the performance of the system Using NERas a means of dimensionality reduction yields thebest results We achieve a state of the art perfor-mance of an overall F measure of 8458 In thefuture we are looking at ways of adding more so-phisticated syntactic and semantic features fromWSD Given the fact that we were able to get moreinteresting VNC data automatically we are cur-rently looking into adding the new data to the an-notated pool after manual checking

20

IDM-F LIT-F Overall F Overall Accminus+1 7793 4857 7178 9622minus+2 8538 5561 7971 9706minus+3 8699 5568 8125 9693minus+4 8622 5581 8075 9706minus+5 8338 50 7763 9661

Table 1 Results in s of varying context window size

IDM-P IDM-R IDM-F LIT-P LIT-R LIT-F Overall FFREQ 7002 8916 7844 0 0 0 6968TOK 8178 8333 8255 7179 4375 5437 7704

(L)EMMA 831 8429 8369 6977 4688 5607 7811(N)GRAM 8317 8238 8278 70 4375 5385 7701

(P)OS 8333 8333 8333 7778 4375 5600 7808L+N+P 8695 8333 8538 7222 4561 5591 7971

NES-Full 852 8793 8655 7907 5862 6733 8277NES-Bin 8497 8241 8367 7349 5259 6131 7915

NEI 8992 8518 8748 8133 5259 6387 8282NES-Full+L+N+P 8989 8492 8734 7632 50 6042 8199NES-Bin+L+N+P 9086 8492 8779 7632 50 6042 8233

NEI+L+N+P 9135 8842 8986 8169 50 6203 8458

Table 2 Final results in s averaged over 5 folds of test data using different features and their combina-tions

8 Acknowledgement

The first author was partially funded by DARPAGALE and MADCAT projects The authors wouldlike to acknowledge the useful comments by twoanonymous reviewers who helped in making thispublication more concise and better presented

ReferencesTimothy Baldwin Collin Bannard Takakki Tanaka

and Dominic Widdows 2003 An empirical modelof multiword expression decomposability In Pro-ceedings of the ACL 2003 workshop on Multiwordexpressions pages 89ndash96 Morristown NJ USA

J Birke and A Sarkar 2006 A clustering approach fornearly unsupervised recognition of nonliteral lan-guage In Proceedings of EACL volume 6 pages329ndash336

Paul Cook Afsaneh Fazly and Suzanne Stevenson2007 Pulling their weight Exploiting syntacticforms for the automatic identification of idiomaticexpressions in context In Proceedings of the Work-shop on A Broader Perspective on Multiword Ex-pressions pages 41ndash48 Prague Czech RepublicJune Association for Computational Linguistics

Paul Cook Afsaneh Fazly and Suzanne Stevenson2008 The VNC-Tokens Dataset In Proceedings ofthe LREC Workshop on Towards a Shared Task for

Multiword Expressions (MWE 2008) MarrakechMorocco June

Mona Diab and Madhav Krishna 2009a Handlingsparsity for verb noun MWE token classification InProceedings of the Workshop on Geometrical Mod-els of Natural Language Semantics pages 96ndash103Athens Greece March Association for Computa-tional Linguistics

Mona Diab and Madhav Krishna 2009b Unsuper-vised classification for vnc multiword expressionstokens In CICLING

Afsaneh Fazly and Suzanne Stevenson 2007 Dis-tinguishing subtypes of multiword expressions us-ing linguistically-motivated statistical measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueCzech Republic June Association for Computa-tional Linguistics

Chikara Hashimoto and Daisuke Kawahara 2008Construction of an idiom corpus and its applica-tion to idiom identification based on WSD incor-porating idiom-specific features In Proceedings ofthe 2008 Conference on Empirical Methods in Nat-ural Language Processing pages 992ndash1001 Hon-olulu Hawaii October Association for Computa-tional Linguistics

Chikara Hashimoto Satoshi Sato and Takehito Utsuro2006 Japanese idiom recognition Drawing a linebetween literal and idiomatic meanings In Proceed-ings of the COLINGACL 2006 Main Conference

21

Poster Sessions pages 353ndash360 Sydney AustraliaJuly Association for Computational Linguistics

Graham Katz and Eugenie Giesbrecht 2006 Au-tomatic identification of non-compositional multi-word expressions using latent semantic analysis InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 12ndash19 Sydney Australia July Asso-ciation for Computational Linguistics

Dekang Lin 1999 Automatic identification of non-compositional phrases In Proceedings of ACL-99pages 317ndash324 Univeristy of Maryland CollegePark Maryland USA

Diana McCarthy Sriram Venkatapathy and AravindJoshi 2007 Detecting compositionality of verb-object combinations using selectional preferencesIn Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL) pages 369ndash379 Prague CzechRepublic June Association for Computational Lin-guistics

Dan I Melamed 1997 Automatic discovery of non-compositional compounds in parallel data In Pro-ceedings of the 2nd Conference on Empirical Meth-ods in Natural Language Processing (EMNLPrsquo97)pages 97ndash108 Providence RI USA August

Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the EACL-06Workshop on Multiword Expressions in a Multilin-gual Context pages 33ndash40 Morristown NJ USA

Ivan A Sag Timothy Baldwin Francis Bond Ann ACopestake and Dan Flickinger 2002 Multiwordexpressions A pain in the neck for nlp In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing pages 1ndash15 London UK Springer-Verlag

Patrick Schone and Daniel Juraksfy 2001 Isknowledge-free induction of multiword unit dictio-nary headwords a solved problem In Proceedingsof Empirical Methods in Natural Language Process-ing pages 100ndash108 Pittsburg PA USA

C Sporleder and L Li 2009 Unsupervised Recog-nition of Literal and Non-Literal Use of IdiomaticExpressions In Proceedings of the 12th Conferenceof the European Chapter of the ACL (EACL 2009)pages 754ndash762 Association for Computational Lin-guistics

Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32Prague Czech Republic June Association for Com-putational Linguistics

22

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 23ndash30Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Exploiting Translational Correspondences for Pattern-Independent MWEIdentification

Sina ZarrieszligDepartment of Linguistics

University of Potsdam Germanysinalinguni-potsdamde

Jonas KuhnDepartment of Linguistics

University of Potsdam Germanykuhnlinguni-potsdamde

Abstract

Based on a study of verb translations inthe Europarl corpus we argue that a widerange of MWE patterns can be identified intranslations that exhibit a correspondencebetween a single lexical item in the sourcelanguage and a group of lexical items inthe target language We show that thesecorrespondences can be reliably detectedon dependency-parsed word-aligned sen-tences We propose an extraction methodthat combines word alignment with syn-tactic filters and is independent of thestructural pattern of the translation

1 Introduction

Parallel corpora have proved to be a valuable re-source not only for statistical machine translationbut also for crosslingual induction of morphologi-cal syntactic and semantic analyses (Yarowsky etal 2001 Dyvik 2004) In this paper we proposean approach to the identification of multiword ex-pressions (MWEs) that exploits translational cor-respondences in a parallel corpus We will con-sider in translations of the following type

(1) DerThe

RatCouncil

sollteshould

unsereour

Positionposition

berucksichtigenconsider

(2) The Council shouldtake account ofour position

This sentence pair has been taken from theGerman - English section of the Europarl corpus(Koehn 2005) It exemplifies a translational cor-respondence between an English MWEtake ac-count ofand a German simplex verbberucksichti-gen In the following we refer to such correspon-dences asone-to-many translations Based on astudy of verb translations in Europarl we will ex-plore to what extent one-to-many translations pro-vide evidence for MWE realization in the targetlanguage It will turn out that crosslingual corre-

spondences realize a wide range of different lin-guistic patterns that are relevant for MWE iden-tification but that they pose problems to auto-matic word alignment We propose an extractionmethod that combines distributional word align-ment with syntactic filters We will show thatthese correspondences can be reliably detectedon dependency-parsed wordaligned sentences andare able to identify various MWE patterns

In a monolingual setting the task of MWE ex-traction is usually conceived of as a lexical as-sociation problem where distributional measuresmodel the syntactic and semantic idiosyncracy ex-hibited by MWEs eg (Pecina 2008) This ap-proach generally involves two main steps 1) theextraction of a candidate list of potential MWEsoften constrained by a particular target patternof the detection method like verb particle con-structions (Baldwin and Villavicencio 2002) orverb PP combinations (Villada Moiron and Tiede-mann 2006) 2) the ranking of this candidate listby an appropriate assocation measure

The crosslingual MWE identification wepresent in this paper is a priori independentof any specific association measure or syntacticpattern The translation scenario allows us toadopt a completely data-driven definition of whatconstitutes an MWE Given a parallel corpuswe propose to consider those tokens in a targetlanguage as MWEs which correspond to a singlelexical item in the source language The intuitionis that if a group of lexical items in one lan-guage can be realized as a single item in anotherlanguage it can be considered as some kind oflexically fixed entity By this means we willnot approach the MWE identification problemby asking for a given list of candidates whetherthese are MWEs or not Instead we will ask fora given list of lexical items in a source languagewhether there exists a one-to-many translation forthis item in a target language (and whether these

23

one-to-many translations correspond to MWEs)This strategy offers a straightforward solution tothe interpretation problem As the translation canbe related to the meaning of the source item andto its other translations in the target language theinterpretation is independent of the expressionrsquostransparency This solution has its limitationscompared to other approaches that need to auto-matically establish the degree of compositionalityof a given MWE candidate However for manyNLP applications coarse-grained knowledgeabout the semantic relation between a widerange of MWEs and their corresponding atomicrealization is already very useful

In this work we therefore focus on a generalmethod of MWE identification that captures thevarious patterns of translational correspondencesthat can be found in parallel corpora Our exper-iments described in section 3 show that one-to-many translations should be extracted from syn-tactic configurations rather than from unstructuredsets of aligned words This syntax-driven methodis less dependent on frequency distributions in agiven corpus but is based on the intuition thatmonolingual idiosyncracies like MWE realizationof an entity are not likely to be mirrored in anotherlanguage (see section 4 for discussion)

Our goal in this paper is twofold First we wantto investigate to what extent one-to-many transla-tional correspondences can serve as an empiricalbasis for MWE identification To this end Sec-tion 2 presents a corpus-based study of the rela-tion between one-to-many translations and MWEsthat we carried out on a translation gold standardSecond we investigate methods for the automaticdetection of complex lexical correspondences fora given parallel corpus Therefore Section 3 eval-uates automatic word alignments against our goldstandard and gives a method for high-precisionone-to-many translation detection that relies onsyntactic filters in addition to word-alignments

2 Multiword Translations as MWEs

The idea to exploit one-to-many translations forthe identification of MWE candidates has not re-ceived much attention in the literature Thus it isnot a priori clear what can be expected from trans-lational correspondences with respect to MWEidentification To corroborate the intuitions intro-duced in the last section we carried out a corpus-based study that aims to discover linguistic pat-

Verb 1-1 1-n n-1 n-n No

anheben (v1) 535 212 92 16 325

bezwecken (v2) 167 513 06 313 150

riskieren (v3) 467 357 05 17 182

verschlimmern (v4) 302 215 286 445 275

Table 1 Proportions of types of translational cor-respondences (token-level) in our gold standard

terns exhibited by one-to-many translationsWe constructed a gold standard coveringall En-

glish translations of four German verb lemmas ex-tracted from the Europarl Corpus These verbssubcategorize for a nominative subject and an ac-cusative object and are in the middle frequencylayer (around 200 occurrences) We extracted allsentences in Europarl with occurences of theselemmas and their automatic word alignments pro-duced by GIZA++ (Och and Ney 2003) Thesealignments were manually corrected on the basisof the crosslingual word alignment guidelines de-velopped by (Graca et al 2008)

For each of the German source lemmas ourgold standard records four translation categoriesone-to-one one-to-many many-to-one many-to-many translations Table 1 shows the distributionof these categories for each verb Strikingly thefour verbs show very different proportions con-cerning the types of their translational correspon-dences Thus while the German verbanheben(en increase) seems to have a frequent parallelrealization the verbsbezwecken(en intend to)or verschlimmern(en aggravate) tend to be real-ized by more complex phrasal translations In anycase the percentage of one-to-many translations isrelatively high which corroborates our hypothesisthat parallel corpora constitute a very interestingresource for data-driven MWE discovery

A closer look at the one-to-many translations re-veals that these cover a wide spectrum of MWEphenomena traditionally considered in the liter-ature as well as constructions that one wouldusually not regard as an MWE Below we willshortly illustrate the different classes of one-to-many translations we found in our gold standard

Morphological variations This type of one-to-many translations is mainly due to non-parallel re-alization of tense Itrsquos rather irrelevant from anMWE perspective but easy to discover and filterautomatically

24

(3) SieThey

verschlimmernaggravate

diethe

Ubelmisfortunes

(4) Their actionis aggravatingthe misfortunes

Verb particle combinations A typical MWEpattern treated for instance in (Baldwin andVillavicencio 2002) It further divides into trans-parent and non-transparent combinations the lat-ter is illustrated below

(5) DerThe

Ausschusscommitte

bezwecktintends

denthe

Institutioneninstitutions

eina

politischespolitical

Instrumentinstrument

anat

diethe

Handhand

zuto

gebengive

(6) The committeeset out to equip the institutions with apolitical instrument

Verb preposition combinations While thisclass isnrsquot discussed very often in the MWE lit-erature it can nevertheless be considered as an id-iosyncratic combination of lexical items Sag et al(2002) propose an analysis within an MWE frame-work

(7) SieThey

werdenwill

denthe

Treibhauseffektgreen house effect

verschlimmernaggravate

(8) They will add to the green house effect

Light verb constructions (LVCs) This is themost frequent pattern in our gold standard It ac-tually subsumes various subpatterns depending onwhether the light verbs complement is realized as anoun adjective or PP Generally LVCs are syntac-tically and semantically more flexible than otherMWE types such that our gold standard containsvariants of LVCs with similar potentially mod-ified adjectives or nouns as in the example be-low However it can be considered an idiosyn-cratic combination since the LVCs exhibit specificlexical restrictions (Sag et al 2002)

(9) IchIch

werdewill

diethe

Sachething

nuronly

nochjust

verschlimmernaggravate

(10) I am justmaking thingsmore difficult

Idioms This MWE type is probably the mostdiscussed in the literature due to its semantic andsyntactic idiosyncracy Itrsquos not very frequent inour gold standard which may be mainly due to itslimited size and the source items we chose

(11) SieThey

bezweckenintend

diethe

Umgestaltungconversion

ininto

einea

zivilecivil

Nationnation

(12) Theyhave in mind the conversion into a civil nation

v1 v2 v3 v4

Ntype 22 (26) 41 (47) 26 (35) 17 (24)

V Part 227 49 00 00

V Prep 364 415 39 59

LVC 182 293 885 882

Idiom 00 24 00 00

Para 364 243 115 235

Table 2 Proportions of MWE types per lemma

Paraphrases From an MWE perspective para-phrases are the most problematic and challengingtype of translational correspondence in our goldstandard While the MWE literature typically dis-cusses the distinction between collocations andMWEs the boarderline between paraphrases andMWEs is not really clear On the hand para-phrases as we classified them here are transparentcombinations of lexical items like in the exam-ple belowensure that something increases How-ever semantically these transparent combinationscan also be rendered by an atomic expressionin-crease A further problem raised by paraphrases isthat they often involve translational shifts (Cyrus2006) These shifts are hard to identify automat-ically and present a general challenge for seman-tic processing of parallel corpora An example isgiven below

(13) WirWe

brauchenneed

besserebetter

Zusammenarbeitcooperation

umto

diethe

RuckzahlungenrepaymentsOBJ

anzuhebenincrease

(14) We need greater cooperation in this respect toensurethat repaymentsincrease

Table 2 displays the proportions of the MWEcategories for the number of types of one-to-manycorrespondences in our gold standard We filteredthe types due to morphological variations only (theoverall number of types is indicated in brackets)Note that some types in our gold standard fall intoseveral categories eg they combine a verb prepo-sition with a verb particle construction For allof the verbs the number of types belonging tocore MWE categories largely outweighs the pro-portion of paraphrases As we already observedin our analysis of general translation categorieshere again the different verb lemmas show strik-ing differences with respect to their realization inEnglish translations For instanceanheben(enincrease) or bezwecken(en intend) are frequently

25

translated with verb particle or preposition combi-nations while the other verbs are much more of-ten translated by means of LVCs Also the morespecific LVC patterns differ largely among theverbs Whileverschlimmern(en aggravate) hasmany different adjectival LVC correspondencesthe translations ofriskieren(en risk) are predomi-nantly nominal LVCs The fact that we found veryfew idioms in our gold standard may be simplyrelated to our arbitrary choice of German sourceverbs that do not have an English idiom realiza-tion (see our experiment on a random set of verbsin Section 33)

In general one-to-many translational corre-spondences seem to provide a very fruitful groundfor the large-scale study of MWE phenomenaHowever their reliable detection in parallel cor-pora is far from trivial as we will show in thenext section Therefore we will not further in-vestigate the classification of MWE patterns inthe rest of the paper but concentrate on the high-precision detection of one-to-many translationsSuch a pattern-independent identification methodis crucial for the further data-driven study of one-to-many translations in parallel corpora

3 Multiword Translation Detection

This section is devoted to the problem of high-precision detection of one-to-many translationsSection 31 describes an evaluation of automaticword alignments against our gold standard Insection 32 we describe a method that extractsloosely aligned syntactic configurations whichyields much more promising results

31 One-to-many Alignments

To illustrate the problem of purely distributionalone-to-many alignment table 3 presents an eval-uation of the automatic one-to-many word align-ments produced by GIZA++ that uses the stan-dard heuristics for bidirectional word alignmentfrom phrase-based MT (Och and Ney 2003) Weevaluate the rate of translational correspondenceson the type-level that the system discovers againstthe one-to-many translations in our gold standardBy type we mean the set of lemmatized Englishtokens that makes up the translation of the Ger-man source lemma Generally automatic wordalignment yields a very high FPR if no frequencythreshold is used Increasing the threshold mayhelp in some cases however the frequency of the

verb n gt 0 n gt 1 n gt 3FPR FNR FPR FNR FPR FNR

v1 097 093 10 10 10 10

v2 093 09 05 096 00 098

v3 088 083 08 097 067 097

v4 098 092 08 092 034 092

Table 3 False positive rate and False negative rateof GIZA++ one-to-many alignments

translation types is so low that already at a thresh-old of 3 almost all types get filtered This does notmean that the automatic word alignment does notdiscover any correct correspondences at all but itmeans that the detection of the exact set of tokensthat correspond to the source token is rare

This low precision of one-to-many alignmentsisnrsquot very surprising Many types of MWEs con-sist of items that contribute most of the lexical se-mantic content while the other items belong to theclass of semantically almost ldquoemptyrdquo items (egparticles light verbs) These semantically ldquolightrdquoitems have a distribution that doesnrsquot necessarilycorrelate with the source item For instance inthe following sentence pair taken from EuroparlGIZA++ was not able to capture the correspon-dence between the German main verbbehindern(en impede) and the LVCconstitute an obstacleto but only finds an alignment link between theverb and the nounobstacle

(15) DieThe

Korruptioncorruption

behindertimpedes

diethe

Entwicklungdevelopment

(16) Corruptionconstitutes an obstacle todevelopment

Another limitation of the word-alignment mod-els is that are independent of whether the sen-tences are largely parallel or rather free transla-tions However parallel corpora like Europarl areknow to contain a very large number of free trans-lations In these cases direct lexical correspon-dences are much more unlikely to be found

32 Aligning Syntactic Configurations

High-precision extraction of one-to-many trans-lation detection thus involves two major prob-lems 1) How to identify sentences or configura-tions where reliable lexical correspondences canbe found 2) How to align target items that have alow occurrence correlation

We argue that both of these problems can beadressed by taking syntactic information into ac-

26

count As an example consider the pair of paral-lel configurations in Figure 1 for the sentence pairgiven in (15) and (16) Although there is no strictone-to-one alignment for the German verb the ba-sic predicate-argument structure is parallel Theverbs arguments directly correspond to each otherand are all dominated by a verbal root node

Based on these intuitions we propose agenerate-and-filter strategy for our one-to-manytranslation detection which extracts partial largelyparallel dependency configurations By admittingtarget dependency paths to be aligned to sourcesingle dependency relations we admit configura-tions where the source item is translated by morethan one word For instance given the configura-tion in Figure 1 we allow the German verb to bealigned to the path connectingconstituteand theargumentY2

Our one-to-many translation detection consistsof the following steps a) candidate generationof aligned syntactic configurations b) filtering theconfigurations c) alignment post-editing ie as-sembling the target tokens corresponding to thesource item The following paragraphs will brieflycaracterize these steps

behindert

X 1 Y 1

Y 2

an to

X 2 obstacle

create

Figure 1 Example of a typical syntactic MWEconfiguration

Data We word-aligned the German and Englishportion of the Europarl corpus by means of theGIZA++ tool Both portions where assigned flatsyntactic dependency analyses by means of theMaltParser (Nivre et al 2006) such that we ob-tain a parallel resource of word-aligned depen-dency parses Each sentence in our resource canbe represented by the triple(DG DE AGE) DG

is the set of dependency triples(s1 rel s2) such

thats2 is a dependent ofs1 of typerel ands1 s2

are words of the source languageDE is the setof dependency triples of the target sentenceAGE

corresponds to the set of pairs(s1 t1) such thats1 t1 are aligned

Candidate Generation This step generates alist of source configurations by searching for oc-curences of the source lexical verb where it islinked to some syntactic dependents (eg its argu-ments) An example input would be the configura-tion ( (verbSB) (verbOA)) forour German verbs

Filtering Given our source candidates a validparallel configuration(DG DE AGE) is then de-fined by the following conditions1 The source configurationDG is the set of tu-ples(s1 rel sn) wheres1 is our source item andsn some dependent2 For eachsn isin DG there is a tuple(sn tn) isinAGE ie every dependent has an alignment3 There is a target itemt1 isin DE such thatfor eachtn there is ap sub DE such thatp isa path(t1 rel tx) (tx rel ty)(tz rel tn) thatconnectst1 and tn Thus the target dependentshave a common root

To filter noise due to parsing or alignment er-rors we further introduce a filter on the length ofthe path that connects the target root and its de-pendents and w exclude paths cross contain sen-tence boundaries Moreover the above candi-date filtering doesnrsquot exclude configurations whichexhibit paraphrases involving head-switching orcomplex coordination Head-switching can be de-tected with the help of alignment information ifthere is a item in our target configuration that hasan reliable alignment with an item not contained inour source configuration our target configurationis likely to contain such a structural paraphrasesand is excluded from our candidate set Coordina-tion can be discarded by imposing the condition onthe configuration not to contain a coordination re-lation This Generate-and-Filter strategy now ex-tracts a set of sentences where we are likely to finda good one-to-one or one-to-many translation forthe source verb

Alignment Post-editing In the final alignmentstep one now needs to figure out which lexicalmaterial in the aligned syntactic configurations ac-tually corresponds to the translation of the sourceitem The intuition discussed in 32 was that all

27

the items lying on a path between the root itemand the terminals belong to the translation of thesource item However these items may have othersyntactic dependents that may also be part of theone-to-many translation As an example considerthe configuration in figure 1 where the articleanwhich is part of the LVCcreate an obstacle tohasto be aligned to the German source verb

Thus for a set of itemsti for which there is a de-pendency relation(tx rel ti) isin DE such thattx isan element of our target configuration we need todecide whether(s1 ti) isin AGE This translationproblem now largely parallels collocation trans-lation problems discussed in the literature as in(Smadja and McKeown 1994) But crucially oursyntactic filtering strategy has substantially nar-rowed down the number of items that are possi-ble parts of the one-to-many translation Thus astraightforward way to assemble the translationalcorrespondence is to compute the correlation orassociation of the possibly missing items with thegiven translation pair as proposed in (Smadja andMcKeown 1994) Therefore we propose the fol-lowing alignment post-editing algorithmGiven the source items1 and the set of target itemsT where eachti isin T is an element of our targetconfiguration

1 Compute corr(s1 T ) the correlation be-tweens1 andT

2 For each ti tx such that there isa (ti rel tx) isin DE computecorr(s1 T + tx)

3 if corr(s1 T + tx) ge corr(s1 T ) addtxto T

As the Dice coefficient is often to give the bestresults eg in (Smadja and McKeown 1994) wealso chose Dice as our correlation measure In fu-ture work we will experiment with other associa-tion measures Our correlation scores are thus de-fined by the formula

corr(s1 T ) =2(freq(s1 and T ))

freq(s1) + freq(T )

We definefreq(T ) as the number of sentencepairs whose target sentence contains occurrencesof all ti isin T andfreq(s1) accordingly The ob-servation frequencyfreq(s1andT ) is the number ofsentence pairs that wheres1 occurs in the sourcesentence andT in the target sentence

The output translation can then be rep-resented as a dependency configurationof the following kind ((ofPMOD)(riskNMODof)(riskNMODthe) (runOBJrisk)(runSBJ))which is the syntactic representationfor the English MWErun the risk of

33 Evaluation

Our translational approach to MWE extractionbears the advantage that evaluation is not exclu-sively bound to the manual judgement of candi-date lists Instead we can first evaluate the systemoutput against translation gold standards which areeasier to obtain The linguistic classification of thecandidates according to their compositionality canthen be treated as a separate problem

We present two experiments in this evaluationsection We will first evaluate the translation de-tection on our gold standard to assess the gen-eral quality of the extraction method Since thisgold standard is to small to draw conclusions aboutthe quality of MWE patterns that the system de-tects we further evaluate the translational corre-spondences for a larger set of verbs

Translation evaluation In the first experimentwe extracted all types of translational correspon-dences for the verbs we annotated in the gold stan-dard We converted the output dependency con-figurations to the lemmatized bag-of-word formwe already applied for the alignment evaluationand calculated the FPR and FNR of the trans-lation types The evaluation is displayed in ta-ble 4 Nearly all translation types that our sys-tem detected are correct This confirms our hy-pothesis that syntactic filtering yields more reli-able translations that just coocurrence-based align-ments However the false negative rate is alsovery high This low recall is due to the fact thatour syntactic filters are very restrictive such that amajor part of the occurrences of the source lemmadonrsquot figure in the prototypical syntactic configu-ration Column two and three of the evaluation ta-ble present the FPR and FNR for experiments witha relaxed syntactic filter that doesnrsquot constrain thesyntactic type of the parallel argument relationsWhile not decreasing the FNR the FPR decreasessignificantly This means that the syntactic filtersmainly fire on noisy configurations and donrsquot de-crease the recall A manual error analysis has alsoshown that the relatively flat annotation scheme ofour dependency parses significantly narrows down

28

the number of candidate configurations that our al-gorithm detects As the dependency parses donrsquotprovide deep analyses for tense or control phe-nomena very often a verbrsquos arguments donrsquot fig-ure as its syntactic dependents and no configura-tion is found Future work will explore the im-pact of deep syntactic analysis for the detection oftranslational correspondences

MWE evaluation In a second experiment weevaluated the patterns of correspondences foundby our extraction method for use in an MWE con-text Therefore we selected 50 random verbs oc-curring in the Europarl corpus and extracted theirrespective translational correspondences This setof 50 verbs yields a set of 1592 one-to-many typesof translational correspondences We filtered thetypes wich display only morphological variationsuch that the set of potential MWE types com-prises 1302 types Out of these we evaluated arandom sample of 300 types by labelling the typeswith the MWE categories we established for theanalysis of our gold standard During the clas-sification we encountered a further category ofoneto- many correspondence which cannot be con-sidered an MWE the category of alternation Forinstance we found a translational correspondencebetween the active realization of the German verbbegruszligen(enappreciate) and the English passivebe pleased by

The classification is displayed in table 5 Al-most 83 of the translational correspondencesthat our system extracted are perfect translationtypes Almost 60 of the extracted types can beconsidered MWEs that exhibit some kind of se-mantic idiosyncrasy The other translations couldbe classified as paraphrases or alternations In ourrandom sample the portions of idioms is signifi-cantly higher than in our gold standard which con-firms our intuition that the MWE pattern of theone-to-many translations for a given verb are re-lated to language-specific semantic properties ofthe verbs and the lexical concepts they realize

4 Related Work

The problem sketched in this paper has clear con-ncetions to statistical MT So-called phrase-basedtranslation models generally target whole sentencealignment and do not necessarily recur to linguis-tically motivated phrase correspondences (Koehnet al 2003) Syntax-based translation that speci-fies formal relations between bilingual parses was

Strict Filter Relaxed Filter

FPR FNR FPR FNR

v1 00 096 05 096

v2 025 088 047 079

v3 025 074 056 063

v4 00 0875 056 084

Table 4 False positive and false negative rate ofone-to-many translations

Trans type Proportion

MWE type Proportion

MWEs 575

V Part 82

V Prep 518

LVC 324

Idiom 106

Paraphrases 244

Alternations 10

Noise 171

Table 5 Classification of 300 types sampled fromthe set of one-to-many translations for 50 verbs

established by (Wu 1997) Our way to use syn-tactic configurations can be seen as a heuristic tocheck relaxed structural parallelism

Work on MWEs in a crosslingual context hasalmost exclusively focussed on MWE translation(Smadja and McKeown 1994 Anastasiou 2008)In (Villada Moiron and Tiedemann 2006) the au-thors make use of alignment information in a par-allel corpus to rank MWE candidates These ap-proaches donrsquot rely on the lexical semantic knowl-edge about MWEs in form of one-to-many trans-lations

By contrast previous approaches to paraphraseextraction made more explicit use of crosslingualsemantic information In (Bannard and Callison-Burch 2005) the authors use the target languageas a pivot providing contextual features for iden-tifying semantically similar expressions Para-phrasing is however only partially comparable tothe crosslingual MWE detection we propose inthis paper Recently the very pronounced contextdependence of monolingual pairs of semanticallysimilar expressions has been recognized as a ma-jor challenge in modelling word meaning (Erk andPado 2009)

The idea that parallel corpora can be used asa linguistic resource that provides empirical evi-dence for monolingual idiosyncrasies has already

29

been exploited in eg morphology projection(Yarowsky et al 2001) or word sense disambigua-tion (Dyvik 2004) While in a monolingual set-ting it is quite tricky to come up with theoreticalor empirical definitions of sense discriminationsthe crosslingual scenario offers a theory-neutraldata-driven solution Since ambiguity is an id-iosyncratic property of a lexical item in a givenlanguage it is not likely to be mirrored in a tar-get language Similarly our approach can also beseen as a projection idea we project the semanticinformation of simplex realization in a source lan-guage to an idiosyncratic multiword realization inthe target language

5 Conclusion

We have explored the phenomenon of one-to-many translations in parallel corpora from theperspective of MWE identification Our man-ual study on a translation gold standard as wellas our experiments in automatic translation ex-traction have shown that one-to-many correspon-dences provide a rich resource and fruitful basisof study for data-driven MWE identification Thecrosslingual perspective raises new research ques-tions about the identification and interpretation ofMWEs It challenges the distinction between para-phrases and MWEs a problem that does not ariseat all in the context of monolingual MWE ex-traction It also allows for the study of the rela-tion between the semantics of lexical concepts andtheir MWE realization Further research in this di-rection should investigate translational correspon-dences on a larger scale and further explore thesefor monolingual interpretation of MWEs

Our extraction method that is based on syn-tactic filters identifies MWE types with a muchhigher precision than purely cooccurence-basedword alignment and captures the various patternswe found in our gold standard Future work on theextraction method will have to focus on the gener-alization of these filters and the generalization toother items than verbs The experiments presentedin this paper also suggest that the MWE realiza-tion of certain lexical items in a target languageis subject to certain linguistic patterns Moreoverthe method we propose is completely languagein-dependent such that further research has to studythe impact of the relatedness of the consideredlanguages on the patterns of one-to-many transla-tional correspondences

ReferencesDimitra Anastasiou 2008 Identification of idioms by mtrsquos

hybrid research system vs three commercial system InProceedings of the EAMT pp 12ndash20

Timothy Baldwin and Aline Villavicencio 2002 Extract-ing the unextractable a case study on verb-particles InProceedings of the COLING-02 pp 1ndash7

Colin Bannard and Chris Callison-Burch 2005 Paraphras-ing with bilingual parallel corpora InProceedings of the43rd Annual Meeting of the ACL pp 597ndash604

Lea Cyrus 2006 Building a resource for studying transla-tion shifts InProceedings of the 5th LREC pp 1240ndash1245

Helge Dyvik 2004 Translations as semantic mirrors Fromparallel corpus to WordNetLanguage and Computers1311 ndash 326

Katrin Erk and Sebastian Pado 2009 Paraphrase assess-ment in structured vector space Exploring parameters anddatasets InProc of the EACL GEMS Workshop pp 57ndash65

Joao de Almeida Varelas Graca Joana Paulo Pardal LuısaCoheur and Diamantino Antonio Caseiro 2008 Multi-language word alignments annotation guidelines Techni-cal report Tech Rep 38 2008 INESC-ID Lisboa

Philipp Koehn Franz Josef Och and Daniel Marcu 2003Statistical phrase-based translation InProceedings of theNAACL rsquo03 pp 48ndash54

Philipp Koehn 2005 Europarl A parallel corpus for statis-tical machine translation InMT Summit 2005 pp 79ndash86

Joakim Nivre Johan Hall and Jens Nilsson 2006 Malt-parser A data driven parser-generator for dependencyparsing InProc of LREC-2006 pp 2216ndash2219

Franz Josef Och and Hermann Ney 2003 A systematic com-parison of various statistical alignment modelsComputa-tional Linguistics 29(1)19ndash51

Pavel Pecina 2008 A machine learning approach to multi-word expression extraction InProceedings of the LRECMWE 2008 Workshop pp 54ndash57

Ivan A Sag Timothy Baldwin Francis Bond and AnnCopestake 2002 Multiword expressions A pain in theneck for NLP InProc of the CICLing-2002 pp 1ndash15

Frank Smadja and Kathleen McKeown 1994 Translatingcollocations for use in bilingual lexicons InProceedingsof the HLT rsquo94 workshop pp 152ndash156

Begona Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automatic word-alignment InProc of the EACL MWE 2006 Workshoppp 33ndash40

Dekai Wu 1997 Stochastic inversion transduction gram-mars and bilingual parsing of parallel corporaComputLinguist 23(3)377ndash403

David Yarowsky Grace Ngai and Richard Wicentowski2001 Inducing multilingual text analysis tools via robustprojection across aligned corpora InProceedings of HLT2001 pp 1ndash8

30

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 31ndash39Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

A Re-examination of Lexical Association Measures

Abstract

We review lexical Association Measures (AMs) that have been employed by past work in extracting multiword expressions Our work contributes to the understanding of these AMs by categorizing them into two groups and suggesting the use of rank equivalence to group AMs with the same ranking performance We also examine how existing AMs can be adapted to better rank English verb particle constructions and light verb constructions Specifically we suggest normalizing (Pointwise) Mutual Information and using marginal frequencies to construct penalization terms We empirically validate the effectiveness of these modified AMs in detection tasks in English performed on the Penn Treebank which shows significant improvement over the original AMs

1 Introduction

Recently the NLP community has witnessed a renewed interest in the use of lexical association measures in extracting Multiword Expressions (MWEs) Lexical Association Measures (hereafter AMs) are mathematical formulas which can be used to capture the degree of connection or association between constituents of a given phrase Well-known AMs include Pointwise Mutual Information (PMI) Pearsonrsquos 2χ and the Odds Ratio These AMs have been applied in many different fields of study from information retrieval to hypothesis testing In the context of MWE extraction many published works have been devoted to comparing their effectiveness Krenn and Evert (2001) evaluate Mutual Information (MI) Dice Pearsonrsquos 2χ log-likelihood

ratio and the T score In Pearce (2002) AMs such as Z score Pointwise MI cost reduction left and right context entropy odds ratio are evaluated Evert (2004) discussed a wide range of AMs including exact hypothesis tests such as the binomial test and Fisherrsquos exact tests various coefficients such as Dice and Jaccard Later Ramisch et al (2008) evaluated MI Pearsonrsquos 2χ and Permutation Entropy Probably the most comprehensive evaluation of AMs was presented in Pecina and Schlesinger (2006) where 82 AMs were assembled and evaluated over Czech collocations These collocations contained a mix of idiomatic expressions technical terms light verb constructions and stock phrases In their work the best combination of AMs was selected using machine learning

While the previous works have evaluated AMs there have been few details on why the AMs perform as they do A detailed analysis of why these AMs perform as they do is needed in order to explain their identification performance and to help us recommend AMs for future tasks This weakness of previous works motivated us to address this issue In this work we contribute to further understanding of association measures using two different MWE extraction tasks to motivate and concretize our discussion Our goal is to be able to predict a priori what types of AMs are likely to perform well for a particular MWE class

We focus on the extraction of two common types of English MWEs that can be captured by bigram model Verb Particle Constructions (VPCs) and Light Verb Constructions (LVCs) VPCs consist of a verb and one or more particles which can be prepositions (eg put on bolster up) adjectives (cut short) or verbs (make do) For simplicity we focus only on bigram VPCs that take prepositional particles the most common class of VPCs A special characteristic of VPCs that affects their extraction is the

Hung Huu Hoang Dept of Computer Science

National University of Singapore

hoanghuucompnusedusg

Su Nam Kim Dept of Computer Science and Software Engineering University of Melbourne

snkimcsseunimelbeduau

Min-Yen Kan Dept of Computer Science

National University of Singapore

kanmycompnusedusg

31

mobility of noun phrase complements in transitive VPCs They can appear after the particle (Take off your hat) or between the verb and the particle (Take your hat off) However a pronominal complement can only appear in the latter configuration (Take it off)

In comparison LVCs comprise of a verb and a complement which is usually a noun phrase (make a presentation give a demonstration) Their meanings come mostly from their complements and as such verbs in LVCs are termed semantically light hence the name light verb This explains why modifiers of LVCs modify the complement instead of the verb (make a serious mistake vs make a mistake seriously) This phenomenon also shows that an LVCrsquos constituents may not occur contiguously

2 Classification of Association Measures

Although different AMs have different approaches to measuring association we observed that they can effectively be classified into two broad classes Class I AMs look at the degree of institutionalization ie the extent to which the phrase is a semantic unit rather than a free combination of words Some of the AMs in this class directly measure this association between constituents using various combinations of co-occurrence and marginal frequencies Examples include MI PMI and their variants as well as most of the association coefficients such as Jaccard Hamann Brawn-Blanquet and others Other Class I AMs estimate a phrasersquos MWE-hood by judging the significance of the difference between observed and expected frequencies These AMs include among others statistical hypothesis tests such as T score Z score and Pearsonrsquos 2χ test

Class II AMs feature the use of context to measure non-compositionality a peculiar characteristic of many types of MWEs including VPCs and idioms This is commonly done in one of the following two ways First non-compositionality can be modeled through the diversity of contexts measured using entropy The underlying assumption of this approach is that non-compositional phrases appear in a more restricted set of contexts than compositional ones Second non-compositionality can also be measured through context similarity between the phrase and its constituents The observation here is that non-compositional phrases have different semantics from those of their constituents It then

follows that contexts in which the phrase and its constituents appear would be different (Zhai 1997) Some VPC examples include carry out give up A close approximation stipulates that contexts of a non-compositional phrasersquos constituents are also different For instance phrases such as hot dog and Dutch courage are comprised of constituents that have unrelated meanings Metrics that are commonly used to compute context similarity include cosine and dice similarity distance metrics such as Euclidean and Manhattan norm and probability distribution measures such as Kullback-Leibler divergence and Jensen-Shannon divergence

Table 1 lists all AMs used in our discussion

The lower left legend defines the variables a b c and d with respect to the raw co-occurrence statistics observed in the corpus data When an AM is introduced it is prefixed with its index given in Table 1(eg [M2] Mutual Information) for the readerrsquos convenience

3 Evaluation

We will first present how VPC and LVC candidates are extracted and used to form our evaluation data set Second we will discuss how performances of AMs are measured in our experiments

31 Evaluation Data

In this study we employ the Wall Street Journal (WSJ) section of one million words in the Penn Tree Bank To create the evaluation data set we first extract the VPC and LVC candidates from our corpus as described below We note here that the mobility property of both VPC and LVC constituents have been used in the extraction process

For VPCs we first identify particles using a pre-compiled set of 38 particles based on Baldwin (2005) and Quirk et al (1985) (Appendix A) Here we do not use the WSJ particle tag to avoid possible inconsistencies pointed out in Baldwin (2005) Next we search to the left of the located particle for the nearest verb As verbs and particles in transitive VPCs may not occur contiguously we allow an intervening NP of up to 5 words similar to Baldwin and Villavicencio (2002) and Smadja (1993) since longer NPs tend to be located after particles

32

Extraction of LVCs is carried out in a similar fashion First occurrences of light verbs are located based on the following set of seven

frequently used English light verbs do get give have make put and take Next we search to the right of the light verbs for the nearest noun

AM Name Formula AM Name Formula M1 Joint Probability ( ) f xy N M2 Mutual Information

1log ˆ

ijij

i j ij

ff

N fsum

M3 Log likelihood ratio

2 log ˆ

ijij

i j ij

ff

fsum

M4 Pointwise MI (PMI) ( )log

( ) ( )

P xy

P x P ylowast lowast

M5 Local-PMI ( ) PMIf xy times M6 PMIk ( )log

( ) ( )

kNf xy

f x f ylowast lowast

M7 PMI2 2( )log

( ) ( )

Nf xy

f x f ylowast lowast

M8 Mutual Dependency 2( )log( ) ( )P xy

P x P y

M9 Driver-Kroeber

( )( )

a

a b a c+ +

M10 Normalized expectation

2

2

a

a b c+ +

M11 Jaccard a

a b c+ +

M12 First Kulczynski a

b c+

M13 Second Sokal-Sneath 2( )

a

a b c+ +

M14 Third Sokal-Sneath

a d

b c

+

+

M15 Sokal-Michiner a d

a b c d

+

+ + +

M16 Rogers-Tanimoto

2 2

a d

a b c d

+

+ + +

M17 Hamann ( ) ( )a d b c

a b c d

+ minus +

+ + +

M18 Odds ratio adbc

M19 Yulersquos ω ad bc

ad bc

minus

+

M20 Yulersquos Q ad bcad bc

minus+

M21 Brawn- Blanquet max( )

a

a b a c+ +

M22 Simpson

min( )

a

a b a c+ +

M23 S cost 12min( )

log(1 )1

b c

a

minus+

+

M24 Adjusted S Cost 12max( )

log(1 )1

b c

a

minus+

+

M25 Laplace 1

min( ) 2

a

a b c

+

+ +

M26 Adjusted Laplace 1

max( ) 2

a

a b c

+

+ +

M27 Fager [M9]

1max( )

2b cminus

M28 Adjusted Fager [M9]

1max( )b c

aNminus

M29 Normalized PMIs

PMI NF( )α PMI NFMax

M30 Simplified normalized PMI for VPCs

log( )

(1 )

ad

b cα αtimes + minus times

M31 Normalized MIs

MI NF( )α MI NFMax

NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast [0 1]α isin NFMax = max( ( ) ( ))P x P ylowast lowast

11 ( )a f f xy= = 12 ( )b f f xy= =

21 ( )c f f xy= = 22 ( )d f f xy= =

( )f xlowast ( )f xlowast

( )f ylowast ( )f ylowast N

Table 1 Association measures discussed in this paper Starred AMs () are developed in this work

Contingency table of a bigram (x y) recording co-occurrence and marginal frequencies w stands for all words except w stands for all words N is total number of bigrams The expected frequency under the independence assumption is ˆ ( ) ( ) ( ) f xy f x f y N= lowast lowast

33

permitting a maximum of 4 intervening words to allow for quantifiers (aan the many etc) adjectival and adverbial modifiers etc If this search fails to find a noun as when LVCs are used in the passive (eg the presentation was made) we search to the right of the light verb also allowing a maximum of 4 intervening words The above extraction process produced a total of 8652 VPC and 11465 LVC candidates when run on the corpus We then filter out candidates with observed frequencies less than 6 as suggested in Pecina and Schlesinger (2006) to obtain a set of 1629 VPCs and 1254 LVCs

Separately we use the following two available sources of annotations 3078 VPC candidates extracted and annotated in (Baldwin 2005) and 464 annotated LVC candidates used in (Tan et al 2006) Both sets of annotations give both positive and negative examples

Our final VPC and LVC evaluation datasets were then constructed by intersecting the gold-standard datasets with our corresponding sets of extracted candidates We also concatenated both sets of evaluation data for composite evaluation This set is referred to as ldquoMixedrdquo Statistics of our three evaluation datasets are summarized in Table 2 VPC data LVC data Mixed Total (freq ge 6)

413 100 513

Positive instances

117 (2833)

28 (28)

145 (2326)

Table 2 Evaluation data sizes (type count not token) While these datasets are small our primary

goal in this work is to establish initial comparable baselines and describe interesting phenomena that we plan to investigate over larger datasets in future work

32 Evaluation Metric

To evaluate the performance of AMs we can use the standard precision and recall measures as in much past work We note that the ranked list of candidates generated by an AM is often used as a classifier by setting a threshold However setting a threshold is problematic and optimal threshold values vary for different AMs Additionally using the list of ranked candidates directly as a classifier does not consider the confidence indicated by actual scores Another way to avoid setting threshold values is to measure precision and recall of only the n most likely candidates

(the n-best method) However as discussed in Evert and Krenn (2001) this method depends heavily on the choice of n In this paper we opt for average precision (AP) which is the average of precisions at all possible recall values This choice also makes our results comparable to those of Pecina and Schlesinger (2006)

33 Evaluation Results

Figure 1(a b) gives the two average precision profiles of the 82 AMs presented in Pecina and Schlesinger (2006) when we replicated their experiments over our English VPC and LVC datasets We observe that the average precision profile for VPCs is slightly concave while the one for LVCs is more convex This can be interpreted as VPCs being more sensitive to the choice of AM than LVCs Another point we observed is that a vast majority of Class I AMs including PMI its variants and association coefficients (excluding hypothesis tests) perform reasonably well in our application In contrast the performances of most of context-based and hypothesis test AMs are very modest Their mediocre performance indicates their inapplicability to our VPC and LVC tasks In particular the high frequencies of particles in VPCs and light verbs in LVCs both undermine their contextsrsquo discriminative power and skew the difference between observed and expected frequencies that are relied on in hypothesis tests

4 Rank Equivalence

We note that some AMs although not mathematically equivalent (ie assigning identical scores to input candidates) produce the same lists of ranked candidates on our datasets Hence they achieve the same average precision The ability to identify such groups of AMs is helpful in simplifying their formulas which in turn assisting in analyzing their meanings Definition Association measures M1 and M2 are

rank equivalent over a set C denoted by M1 r

Cequiv

M2 if and only if M1(cj) gt M1(ck) M2(cj) gt M2(ck) and M1(cj) = M1(ck) M2(cj) = M2(ck) for all cj ck belongs to C where Mk(ci) denotes the score assigned to ci by the measure Mk As a corollary the following also holds for rank equivalent AMs

34

Corollary If M1 r

Cequiv M2 then APC(M1) = APC(M2)

where APC(Mi) stands for the average precision of the AM Mi over the data set C Essentially M1 and M2 are rank equivalent over a set C if their ranked lists of all candidates taken from C are the same ignoring the actual calculated scores1 As an example the following 3 AMs Odds ratio Yulersquos ω and Yulersquos Q (Table 3 row 5) though not mathematically equivalent can be shown to be rank equivalent Five groups of rank equivalent AMs that we have found are listed in Table 3 This allows us to replace the below 15 AMs with their (most simple) representatives from each rank equivalent group

1 Two AMs may be rank equivalent with the exception of some candidates where one AM is undefined due to a zero in the denominator while the other AM is still well-defined We call these cases weakly rank equivalent With a reasonably large corpus such candidates are rare for our VPC and LVC types Hence we still consider such AM pairs to be rank equivalent

1) [M2] Mutual Information [M3] Log likelihood ratio 2) [M7] PMI2 [M8] Mutual Dependency [M9] Driver-Kroeber (aka Ochiai) 3) [M10] Normalized expectation [M11] Jaccard [M12] First Kulczynski

[M13]Second Sokal-Sneath (aka Anderberg)

4) [M14] Third Sokal-Sneath [M15] Sokal-Michiner [M16] Rogers-Tanimoto [M17] Hamann 5) [M18] Odds ratio [M19] Yulersquos ω [M20] Yulersquos Q

Table 3 Five groups of rank equivalent AMs

5 Examination of Association Measures

We highlight two important findings in our analysis of the AMs over our English datasets Section 51 focuses on MI and PMI and Section 52 discusses penalization terms

51 Mutual Information and Pointwise Mutual Information

In Figure 1 over 82 AMs PMI ranks 11th in identifying VPCs while MI ranks 35th in

01

02

03

04

05

06

AP

Figure 1a AP profile of AMs examined over our VPC data set

01

02

03

04

05

06

AP

Figure 1b AP profile of AMs examined over our LVC data set

Figure 1 Average precision (AP) performance of the 82 AMs from Pecina and Schlesinger (2006) on our English VPC and LVC datasets Bold points indicate AMs discussed in this paper

Hypothesis test AMs loz Class I AMs excluding hypothesis test AMs + Context-based AMs

35

identifying LVCs In this section we show how their performances can be improved significantly

Mutual Information (MI) measures the common information between two variables or the reduction in uncertainty of one variable given knowledge of the other

( )MI( ) ( )log ( ) ( )u v

p uvU V p uv p u p v= lowast lowastsum In the

context of bigrams the above formula can be

simplified to [M2] MI =

1log ˆN

ijij

i jij

ff

fsum While MI

holds between random variables [M4] Pointwise MI (PMI) holds between specific values PMI(x

y) =( )

log( ) ( )

P xy

P x P ylowast lowast

( )log

( ) ( )

Nf xy

f x f y=

lowast lowast It has long

been pointed out that PMI favors bigrams with low-frequency constituents as evidenced by the product of two marginal frequencies in its denominator To reduce this bias a common solution is to assign more weight to the co-occurrence frequency ( )f xy in the numerator by either raising it to some power k (Daille 1994) or multiplying PMI with ( )f xy Table 4 lists these adjusted versions of PMI and their performance over our datasets We can see from Table 4 that the best performance of PMIk

is obtained at k values less than one indicating that it is better to rely less on ( )f xy Similarly multiplying

( )f xy directly to PMI reduces the performance of PMI As such assigning more weight to ( )f xy does not improve the AP performance of PMI AM VPCs LVCs MixedBest [M6] PMIk 547

(k = 13) 573

(k = 85) 544

(k = 32)

[M4] PMI 510 566 515 [M5] Local-PMI 259 393 272 [M1] Joint Prob 170 28 175 Table 4 AP performance of PMI and its variants Best alpha settings shown in parentheses

Another shortcoming of (P)MI is that both grow not only with the degree of dependence but also with frequency (Manning and Schutzeampamp 1999 p 66) In particular we can show that MI(X Y) le min(H(X) H(Y)) where H() denotes entropy and PMI(xy) le min( log ( )P xminus lowast log ( )P yminus lowast )

These two inequalities suggest that the allowed score ranges of different candidates vary and consequently MI and PMI scores are not directly comparable Furthermore in the case of VPCs and LVCs the differences among score

ranges of different candidates are large due to high frequencies of particles and light verbs This has motivated us to normalize these scores before using them for comparison We suggest MI and PMI be divided by one of the following two normalization factors NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast with [0 1]α isin and NFmax = max( ( ) ( ))P x P ylowast lowast NF( )α being dependent on alpha can be optimized by setting an appropriate alpha value which is inevitably affected by the MWE type and the corpus statistics On the other hand NFmax is independent of alpha and is recommended when one needs to apply normalized (P)MI to a mixed set of different MWE types or when sufficient data for parameter tuning is unavailable As shown in Table 5 normalized MI and PMI show considerable improvements of up to 80 Also PMI and MI after being normalized with NFmax rank number one in VPC and LVC task respectively If one re-writes MI as = (1 N) ij ij

i jPMIf timessum it is easy to see the heavy

dependence of MI on direct frequencies compared with PMI and this explains why normalization is a pressing need for MI

AM VPCs LVCs MixedMI NF( )α 508

(α = 48) 583

(α = 47) 516

(α = 5)

MI NFmax 508 584 518 [M2] MI 273 435 289

PMI NF( )α 592 (α = 8)

554 (α = 48)

588 (α = 77)

PMI NFmax 565 517 556 [M4] PMI 510 566 515

Table 5 AP performance of normalized (P)MI versus standard (P)MI Best alpha settings shown in parentheses

52 Penalization Terms

It can be seen that given equal co-occurrence frequencies higher marginal frequencies reduce the likelihood of being MWEs This motivates us to use marginal frequencies to synthesize penalization terms which are formulae whose values are inversely proportional to the likelihood of being MWEs We hypothesize that incorporating such penalization terms can improve the respective AMs detection AP

Take as an example the AMs [M21] Brawn-Blanquet (aka Minimum Sensitivity) and [M22] Simpson These two AMs are identical except

36

for one difference in the denominator Brawn-Blanquet uses max(b c) Simpson uses min(b c) It is intuitive and confirmed by our experiments that penalizing against the more frequent constituent by choosing max(b c) is more effective This is further attested in AMs [M23] S Cost and [M25] Laplace where we tried to replace the min(b c) term with max(b c) Table 6 shows the average precision on our datasets for all these AMs

AM VPCs LVCs Mixed[M21]Brawn- Blanquet

478 578 486

[M22] Simpson 249 382 260

[M24] Adjusted S Cost

485 577 492

[M23] S cost 249 388 260

[M26] Adjusted Laplace

486 577 493

[M25] Laplace 241 388 254 Table 6 Replacing min() with max() in selected AMs

In the [M27] Fager AM the penalization term max(b c) is subtracted from the first term which is no stranger but rank equivalent to [M7] PMI2 In our application this AM is not good since the second term is far larger than the first term which is less than 1 As such Fager is largely equivalent to just ndashfrac12 max(b c) In order to make use of the first term we need to replace the constant frac12 by a scaled down version of max(b c) We have approximately derived 1 aN as a lower bound estimate of max(b c) using the independence assumption producing [M28] Adjusted Fager We can see from Table 7 that this adjustment improves Fager on both datasets AM VPCs LVCs Mixed[M28] Adjusted Fager

564 543 554

[M27] Fager 552 439 525 Table 7 Performance of Fager and its adjusted version

The next experiment involves [M14] Third Sokal Sneath which can be shown to be rank equivalent to ndashb ndashc We further notice that frequencies c of particles are normally much larger than frequencies b of verbs Thus this AM runs the risk of ranking VPC candidates based on only frequencies of particles So it is necessary

that we scale b and c properly as in [M14] bαminus times ndash (1 ) cαminus times Having scaled the constituents properly we still see that [M14] by itself is not a good measure as it uses only constituent frequencies and does not take into consideration the co-occurrence frequency of the two constituents This has led us to experiment

with [MR14]PMI

(1 )b cα αtimes + minus times The

denominator of [MR14] is obtained by removing the minus sign from [MR14] so that it can be used as a penalization term The choice of PMI in the numerator is due to the fact that the denominator of [MR14] is in essence similar to NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast which has been successfully used to divide PMI in the normalized PMI experiment We heuristically tried to simplify [MR14] to the following AM

[M30]log( )

(1 )

ad

b cα αtimes + minus times The setting of alpha in

Table 8 below is taken from the best alpha setting obtained the experiment on the normalized PMI (Table 5) It can be observed from Table 8 that [MR14] being computationally simpler than normalized PMI performs as well as normalized PMI and better than Third Sokal-Sneath over the VPC data set

AM VPCs LVCs MixedPMI NF( )α 592

(α =8) 554

(α =48) 588

(α =77)

[M30]log( )

(1 )

ad

b cα αtimes + minus times

600 (α =8)

484 (α =48)

588 (α =77)

[M14] Third Sokal Sneath

565 453 546

Table 8 AP performance of suggested VPCsrsquo penalization terms and AMs

With the same intention and method we have found that while addition of marginal frequencies is a good penalization term for VPCs the product of marginal frequencies is more suitable for LVCs (rows 1 and 2 Table 9) As with the linear combination the product bc should also be weighted accordingly as (1 )b cα αminus The best alpha value is also taken from the normalized PMI experiments (Table 5) which is nearly 5 Under this setting this penalization term is exactly the denominator of the [M18] Odds Ratio Table 9 below show our experiment results in deriving the penalization term for LVCs

37

AM VPCs LVCs Mixedndashb ndashc 565 453 546 1bc 502 532 502 [M18] Odds ratio 443 567 456

Table 9 AP performance of suggested LVCsrsquo penalization terms and AMs

6 Conclusions

We have conducted an analysis of the 82 AMs assembled in Pecina and Schlesinger (2006) for the tasks of English VPC and LVC extraction over the Wall Street Journal Penn Treebank data In our work we have observed that AMs can be divided into two classes ones that do not use context (Class I) and ones that do (Class II) and find that the latter is not suitable for our VPC and LVC detection tasks as the size of our corpus is too small to rely on the frequency of candidatesrsquo contexts This phenomenon also revealed the inappropriateness of hypothesis tests for our detection task We have also introduced the novel notion of rank equivalence to MWE detection in which we show that complex AMs may be replaced by simpler AMs that yield the same average precision performance

We further observed that certain modifications to some AMs are necessary First in the context of ranking we have proposed normalizing scores produced by MI and PMI in cases where the distributions of the two events are markedly different as is the case for light verbs and particles While our claims are limited to the datasets analyzed they show clear improvements normalized PMI produces better performance over our mixed MWE dataset yielding an average precision of 588 compared to 515 when using standard PMI a significant improvement as judged by paired T test Normalized MI also yields the best performance over our LVC dataset with a significantly improved AP of 583

We also show that marginal frequencies can be used to form effective penalization terms In particular we find that (1 )b cα αtimes + minus times is a good penalization term for VPCs while (1 )b cα αminus is suitable for LVCs Our introduced alpha tuning parameter should be set to properly scale the values b and c and should be optimized per MWE type In cases where a common factor is applied to different MWE types max(b c) is a better choice than min(b c) In future work we plan to expand our investigations over larger

web-based datasets of English to verify the performance gains of our modified AMs

Acknowledgement This work was partially supported by a National Research Foundation grant ldquoInteractive Media Searchrdquo (grant R 252 000 325 279)

References Baldwin Timothy (2005) The deep lexical

acquisition of English verb-particle constructions Computer Speech and Language Special Issue on Multiword Expressions 19(4)398ndash414

Baldwin Timothy and Villavicencio Aline (2002) Extracting the unextractable A case study on verb-particles In Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002) pages 98ndash104 Taipei Taiwan

Daille Beacuteatrice (1994) Approche mixte pour lextraction automatique de terminologie statistiques lexicales et filtres linguistiques PhD thesis Universiteacute Paris 7

Evert Stefan (2004) Online repository of association measures httpwwwcollocationsde a companion to The Statistics of Word Cooccurrences Word Pairs and Collocations PhD dissertation University of Stuttgart

Evert Stefan and Krenn Brigitte (2001) Methods for qualitative evaluation of lexical association measures In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics pages 188-915 Toulouse France

Katz Graham and Giesbrecht Eugenie (2006) Automatic identification of non-compositional multi-word expressions using latent semantic analysis In Proceedings of the ACL Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties pages 12-19 Sydney Australia

Krenn Brigitte and Evert Stefan (2001) Can we do better than frequency A case study on extracting PP-verb collocations In Proceedings of the ACLEACL 2001 Workshop on the Computational Extraction Analysis and Exploitation of Collocations pages 39ndash46 Toulouse France

Manning D Christopher and Schutze ampamp Hinrich (1999) Foundations of Statistical Natural Language Processing The MIT Press Cambridge Massachusetts

Pearce Darren (2002) A comparative evaluation of collocation extraction techniques In Proc of the

38

3rd International Conference on Language Resources and Evaluation (LREC 2002) Las Palmas pages 1530-1536 Canary Islands

Pecina Pavel and Schlesinger Pavel (2006) Combining association measures for collocation extraction In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLINGACL 2006) pages 651-658 Sydney Australia

Quirk Randolph Greenbaum Sidney Leech Geoffrey and Svartvik Jan (1985) A Comprehensive Grammar of the English Language Longman London UK

Ramisch Carlos Schreiner Paulo Idiart Marco and Villavicencio Aline (2008) An Evaluation of Methods for the extraction of Multiword Expressions In Proceedings of the LREC-2008 Workshop on Multiword Expressions Towards a Shared Task for Multiword Expressions pages 50-53 Marrakech Morocco

Smadja Frank (1993) Retrieving collocations from text Xtract Computational Linguistics 19(1) 143ndash77

Tan Y Fan Kan M Yen and Cui Hang (2006) Extending corpus-based identification of light verb constructions using a supervised learning framework In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context pages 49ndash56 Trento Italy

Zhai Chengxiang (1997) Exploiting context to identify lexical atoms ndash A statistical view of linguistic context In International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97) pages 119-129 Rio de Janeiro Brazil

Appendix A List of particles used in identifying verb particle constructions about aback aboard above abroad across adrift ahead along apart around aside astray away back backward backwards behind by down forth forward forwards in into off on out over past round through to together under up upon without

39

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 40ndash46Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus

R Mahesh K Sinha

Department of Computer Science amp Engineering Indian Institute of Technology Kanpur

Kanpur 208016 India rmkiitkacin

Abstract

Complex predicate is a noun a verb an ad-jective or an adverb followed by a light verb that behaves as a single unit of verb Com-plex predicates (CPs) are abundantly used in Hindi and other languages of Indo Aryan family Detecting and interpreting CPs con-stitute an important and somewhat a diffi-cult task The linguistic and statistical methods have yielded limited success in mining this data In this paper we present a simple method for detecting CPs of all kinds using a Hindi-English parallel corpus A CP is hypothesized by detecting absence of the conventional meaning of the light verb in the aligned English sentence This simple strategy exploits the fact that CP is a multi-word expression with a meaning that is dis-tinct from the meaning of the light verb Al-though there are several shortcomings in the methodology this empirical method surpri-singly yields mining of CPs with an average precision of 89 and a recall of 90

1 Introduction

Complex predicates (CPs) are abundantly used in Hindi and other languages of Indo-Aryan family and have been widely studied (Hook 1974 Ab-bi 1992 Verma 1993 Mohanan 1994 Singh 1994 Butt 1995 Butt and Geuder 2001 Butt and Ramchand 2001 Butt et al 2003) A com-plex predicate is a multi-word expression (MWE) where a noun a verb or an adjective is followed by a light verb (LV) and the MWE be-haves as a single unit of verb The general theory of complex predicate is discussed in Alsina (1996) These studies attempt to model the lin-guistic facts of complex predicate formation and the associated semantic roles

CPs empower the language in its expressive-ness but are hard to detect Detection and inter-

pretation of CPs are important for several tasks of natural language processing tasks such as ma-chine translation information retrieval summa-rization etc A mere listing of the CPs constitutes a valuable linguistic resource for lexicographers wordnet designers (Chakrabarti et al 2007) and other NLP system designers Computational me-thod using Hindi corpus has been used to mine CPs and categorize them based on statistical analysis (Sriram and Joshi 2005) with limited success Chakrabarti et al (2008) present a me-thod for automatic extraction of V+V CPs only from a corpus based on linguistic features They report an accuracy of about 98 An attempt has also been made to use a parallel corpus for de-tecting CPs using projection POS tags from Eng-lish to Hindi (Soni Mukerjee and Raina 2006) It uses Giza++ word alignment tool to align the projected POS information A success of 83 precision and 46 recall has been reported

In this paper we present a simple strategy for mining of CPs in Hindi using projection of meaning of light verb in a parallel corpus In the following section the nature of CP in Hindi is outlined and this is followed by system design experimentation and results

2 Complex Predicates in Hindi

A CP in Hindi is a syntactic construction consist-ing of either a verb a noun an adjective or an adverb as main predicator followed by a light verb (LV) Thus a CP can be a noun+LV an adjective+LV a verb+LV or an adverb+LV Fur-ther it is also possible that a CP is followed by a LV (CP+LV) The light verb carries the tense and agreement morphology In V+V CPs the contribution of the light verb denotes aspectual terms such as continuity perfectivity inception completion or denotes an expression of forceful-ness suddenness etc (Singh 1994 Butt 1995) The CP in a sentence syntactically acts as a sin-gle lexical unit of verb that has a meaning dis-

40

tinct from that of the LV CPs are also referred as the complex or compound verbs

Given below are some examples (1) CP=noun+LV noun = ashirwad blessings LV = denaa to give usane mujhe ashirwad diyaa उसन मझ आश वाद दया he me blessings gave he blessed me (2) No CP usane mujhe ek pustak dii उसन मझ एक प तक द he me one book gave he gave me a book In (1) the light verb diyaa (gave) in its past tense form with the noun ashirwad (blessings) makes a complex predicate verb form ashirwad diyaa (blessed) in the past tense form The CP here is ashirwad denaa and its corresponding English translation is lsquoto blessrsquo On the other hand in ex-ample (2) the verb dii (gave) is a simple verb in past tense form and is not a light verb Although same Hindi verb denaa (to give) is used in both the examples it is a light verb in (1) and a main verb in (2) Whether it acts as a light verb or not depends upon the semantics of the preceding noun However it is observed that the English meaning in case of the complex predicate is not derived from the individual meanings of the con-stituent words It is this observation that forms basis of our approach for mining of CPs (3) CP=adjective+LV adjective=khush happy LV=karanaa to do usane mujhe khush kiyaa उसन मझ खश कया he me happy did he pleased me Here the Hindi verb kiyaa (did) is the past tense form of a light verb karanaa (to do) and the pre-ceding word khush (happy) is an adjective The CP here is khush karanaa (to please) (4) CP=verb+LV verb = paRhnaa to read LV = lenaa to take usane pustak paRh liyaa

उसन प तक पढ़ लया he book read took he has read the book Here the Hindi verb liyaa (took) is the past tense form of the light verb lenaa (to take) and the pre-ceding word paRh (read) is the verb paRhnaa (to read) in its stem form The CP is paRh lenaa (to finish reading) In such cases the light verb acts as an aspectual modal or as an intensifier (5) CP=verb+LV verb = phaadanaa to tear LV = denaa to give usane pustak phaad diyaa उसन प तक फाड़ दया he book tear gave he has torn the book Here the Hindi verb diyaa (gave) is the past tense form of the light verb denaa (to give) and the preceding word phaad (tear) is the stem form of the verb phaadanaa (to tear) The CP is phaad denaa (to cause and complete act of tearing) (6) CP=verb+LV verb = denaa to give LV = maaranaato hit to kill usane pustak de maaraa उसन प तक द मारा he book give hit he threw the book Here the Hindi verb maaraa (hitkilled) is the past tense form of the light verb maaranaa (to hit to kill) and the preceding word de (give) is a verb denaa (to give) in its stem form The CP is de maranaa (to throw) The verb combination yields a new meaning This may also be consi-dered as a semi-idiomatic construct by some people (7) CP=adverb+LV1+LV2 adverb = vaapasback LV1 = karanaato do LV2 = denaato give or CP = CP+LV CP = vaapas karanaato return LV = denaato give usane pustak vaapas kar diyaa उसन प तक वापस कर दया he book back do gave

41

he returned the book Here there are two Hindi light verbs used The verb kar (do) is the stem form of the light verb karanaa (to do) and the verb diyaa (gave) is the past tense form of the light verb denaa (to give) The preceding word vaapas (back) is an adverb One way of interpretation is that the CP (a con-junct verb) vaapas karanaa (to return) is fol-lowed by another LV denaa (to give) signifying completion of the task Another way of looking at it is to consider these as a combination of two CPs vaapas karanaa (to return) and kar denaa (to complete the act) The semantic interpreta-tions in the two cases remain the same It may be noted that the word vaapas (return) is also a noun and in such a case the CP is a noun+LV

From all the above examples the complexity of the task of mining the CPs is evident Howev-er it is also observed that in the translated text the meaning of the light verb does not appear in case of CPs Our methodology for mining CPs is based on this observation and is outlined in the following section

3 System Design

As outlined earlier our method for detecting a CP is based on detecting a mismatch of the Hindi light verb meaning in the aligned English sen-tence The steps involved are as follows 1) Align the sentences of Hindi-English corpus 2) Create a list of Hindi light verbs and their

common English meanings as a simple verb (Table 1)

3) For each Hindi light verb generate all the morphological forms (Figure 1)

4) For each English meaning of the light verb as given in table 1 generate all the morphologi-cal forms (Figure 2)

5) For each Hindi-English aligned sentence execute the following steps a) For each light verb of Hindi (table 1)

execute the following steps i) Search for a Hindi light verb (LV)

and its morphological derivatives (figure 1) in the Hindi sentence and mark its position in the sentence (K)

ii) If the LV or its morphological deriv-ative is found then search for the equivalent English meanings for any of the morphological forms (figure 2) in the corresponding aligned Eng-lish sentence

iii) If no match is found then scan the words in the Hindi sentence to the left of the Kth position (as identified in step (i)) else if a match is found then exit ie go to step (a)

iv) If the scanned word is a lsquostop wordrsquo (figure 3) then ignore it and contin-ue scanning

v) Stop the scan when it is not a lsquostop wordrsquo and collect the Hindi word (W)

vi) If W is an lsquoexit wordrsquo then exit ie go to step (a) else the identified CP is W+LV

Hindi has a large number of light verbs A list of some of the commonly used light verbs along with their common English meanings as a simple verb is given in table 1 The light verb kar (do) is the most frequently used light verb Using its literal meaning as lsquodorsquo as a criterion for testing CP is quite misleading since lsquodorsquo in English is used in several other contexts Such meanings have been shown within parentheses and are not used for matching

light verb base form root verb meaning baithanaa बठना sit

bananaa बनना makebecomebuildconstruct manufactureprepare

banaanaa बनाना makebuildconstructmanufact-ure prepare

denaa दना give

lenaa लना take

paanaa पाना obtainget

uthanaa 3ठना rise arise get-up

uthaanaa 3ठाना raiselift wake-up

laganaa लगना feelappear look seem

lagaanaa लगाना fixinstall apply

cukanaa चकना (finish)

cukaanaa चकाना pay

karanaa करना (do)

honaa होना happenbecome be

aanaa आना come

jaanaa जाना go

khaanaa खाना eat

rakhanaa रखना keep put

maaranaa मारना killbeathit

daalanaa डालना put

haankanaa हाकना drive

Table 1 Some of the common light verbs in Hindi

42

For each of the Hindi light verb all morpho-logical forms are generated A few illustrations are given in figures 1(a) and 1(b) Similarly for each of the English meaning of the light verb all of its morphological derivatives are generated Figure 2 shows a few illustrations of the same

There are a number of words that can appear in between the nominal and the light verb in a CP These words are ignored in search for a CP and are treated as stop words These are words that denote negation or are emphasizers inten-sifiers interrogative pronoun or a particle A list of stop words used in the experimentation is giv-en in figure 3

Figure 1(a) Morphological derivatives of sample Hindi light verb rsquojaanaarsquo जाना to go

Figure 1(b) Morphological derivatives of sample Hindi light verb rsquolenaarsquo लना to take

Figure 2 Morphological derivatives of sample English meanings

We use a list of words of words that we have

named as lsquoexit wordsrsquo which cannot form part of a CP in Hindi We have used Hindi case (vibhak-ti) markers (also called parsarg) conjunctions and pronouns as the lsquoexit wordsrsquo in our imple-mentation Figure 4 shows a partial list used However this list can be augmented based on analysis of errors in LV identification It should be noted that we do not perform parts of speech (POS) tagging and so the nature of the word pre-ceding the LV is unknown to the system

English word sit Morphological derivations

sit sits sat sitting English word give Morphological derivations

give gives gave given giving degdegdegdegdeg

LV jaanaa जाना to go Morphological derivatives jaa jaae jaao jaaeM jaauuM jaane jaanaa jaanii jaataa jaatii jaate jaaniiM jaatiiM jaaoge jaaogii gaii jaauuMgA jaayegaa jaauuMgii jaayegii gaye gaiiM gayaa gayii jaayeMge jaayeMgI jaakara जा (go stem) जाए (go imperative)

जाओ (go imperative) जाए (go imperative)

जाऊ (go first-person) जान (go infinitive oblique)

जाना (go infinitive masculine singular)

जानी (go infinitive feminine singular)

जाता (go indefinite masculine singular)

जाती (go indefinite feminine singular)

जात (go indefinite masculine pluraloblique)

जानी (go infinitive feminine plural)

जाती (go indefinite feminine plural)

जाओग (go future masculine singular)

जाओगी (go future feminine singular)

गई (go past feminine singular)

जाऊगा (go future masculine first-person singular)

जायगा (go future masculine third-person singular)

जाऊगी (go future feminine first-person singular)

जायगी (go future feminine third-person singular)

गय (go past masculine pluraloblique)

गई (go past feminine plural)

गया (go past masculine singular)

गयी (go past feminine singular)

जायग (go future masculine plural)

जायगी (go future feminine plural)

जाकर (go completion) degdegdegdegdeg

LV lenaa लना to take Morphological derivatives le lii leM lo letaa letii lete liiM luuM legaa legii lene lenaa lenii liyaa leMge loge letiiM luuMgaa luuMgii lekara ल (take stem) ल (take past)

ल (take imperative) लो (take imperative)

लता (take indefinite masculine singular)

लती (take indefinite feminine singular)

लत (take indefinite masculine pluraloblique)

ल (takepastfeminineplural) ल(take first-person)

लगा(take future masculine third-personsingular)

लगी(take future feminine third-person singular)

लन (take infinitive oblique)

लना (take infinitive masculine singular)

लनी (take infinitive feminine singular)

लया (take past masculine singular)

लग (take future masculine plural)

लोग (take future masculine singular)

लती (take indefinite feminine plural)

लगा (take future masculinefirst-personsingular)

लगी (take future feminine first-person singular)

लकर (take completion) degdegdegdegdeg

43

Figure 3 Stop words in Hindi used by the system

Figure 4 A few exit words in Hindi used by the system

The inner loop of the procedure identifies mul-tiple CPs that may be present in a sentence The outer loop is for mining the CPs in the entire corpus The experimentation and results are dis-cussed in the following section

4 Experimentation and Results

The CP mining methodology outlined earlier has been implemented and tested over multiple files of EMILLE (McEnery Baker Gaizauskas and Cunningham 2000) English-Hindi parallel cor-pus A summary of the results obtained are given in table 2 As can be seen from this table the precision obtained is 80 to 92 and the recall is between 89 to 100 The F-measure is 88 to 97 This is a remarkable and somewhat sur-prising result from the simple methodology without much of linguistic or statistical analysis This is much higher than what has been reported on the same corpus by Mukerjee et al 2006 (83 precision and 46 recall) who use projec-tion of POS and word alignment for CP identifi-cation This is the only other work that uses a parallel corpus and covers all kinds of CPs The results as reported by Chakrabarti et al (2008) are only for V-V CPs Moreover they do not re-port the recall value File

1 File 2

File 3

File 4

File 5

File 6

No of Sentences

112 193 102 43 133 107

Total no of CP(N)

200 298 150 46 188 151

Correctly identified CP (TP)

195 296 149 46 175 135

V-V CP 56 63 9 6 15 20

Incorrectly identified CP (FP)

17 44 7 11 16 20

Unidentified CP(FN)

5 2 1 0 13 16

Accuracy

9750 9933 9933 1000 9308 8940

Precision (TP (TP+FP))

9198 8705 9551 8070 9162 8709

Recall ( TP (TP+FN))

9750 9833 9933 1000 9308 8940

F-measure ( 2PR ( P+R))

946 923 974 893 923 882

Table 2 Results of the experimentation

न (ergative case marker) को (accusative case marker) का (possessive case marker) क (possessive case marker) की (possessive case marker) स (frombywith) म (ininto) पर (onbut) और (and Hindi particle) तथा (and) या (or) ल कन (but) पर त (but) क (that Hindi particle) म (I) तम (you) आप (you) वह (heshe) मरा (my) मर (my) मर (my) त हारा (your) त हार (your) त हार (your) उसका (his) उसकी (her) उसक (hisher) अपना (own) अपनी (own) अपन (own) उनक (their) मन (I ergative) त ह (to you) आपको (to you) उसको (to himher) उनको (to them) उ ह (to them) मझको (to me) मझ (to me) िजसका (whose) िजसकी (whose) िजसक (whose) िजनको (to whom) िजनक (to whom)

नह (nonot) न (nonot Hindi particle) भी (also Hindi particle) ह (only Hindi particle) तो (then Hindi particle) य (why) या (what Hindi particle)

कहा (where Hindi particle) कब (when) यहा (here) वहा (there) जहा (where) पहल (before) बाद म (after) श म (beginning) आर भ म (beginning) अत म (in the end) आ खर म (in the end)

44

Given below are some sample outputs (1) English sentence I also enjoy working with the childrens parents who often come to me for advice - its good to know you can help Aligned Hindi sentence मझ ब च क माता - पताओ क साथ काम करना भी अ छा लगता ह जो क अ सर सलाह लन आत ह - यह जानकार खशी होती ह क आप कसी की मदद कर सकत ह | The CPs identified in the sentence i काम करना (to work) ii अ छा लगना (to feel good enjoy) iii सलाह लना (to seek advice) iv खशी होना (to feel happy good) v मदद करना (to help)

Here the system identified 5 different CPs all of which are correct and no CP in the sentence has gone undetected The POS projection and word alignment method (Mukerjee et al 2006) would fail to identify CPs सलाह लना (to seek advice) and खशी होना (to feel happy) (2) English sentence Thousands of children are already benefiting from the input of people like you - people who care about children and their future who have the commitment energy and enthusiasm to be positive role models and who value the opportu-nity for a worthwhile career Aligned Hindi sentence आप जस लोग जो क ब च और उनक भ व य क बार म सोचत ह - इस समय भी हज़ार ब च को लाभ पहचा रह ह | अ छ आदश बनन क लए ऐस लोग म तब ता उ सराह और लगन ह और व एक समथन - यो य यवसाय की क करत ह |

The CPs identified in the sentence i आदश बनना (to be role model) ii क करना (to respect)

Here also the two CPs identified are correct It is obvious that this empirical method of

mining CPs will fail whenever the Hindi light verb maps on to its core meaning in English It

may also produce garbage as POS of the preced-ing word is not being checked However the mining success rate obtained speaks of these be-ing in small numbers in practice Use of the lsquostop wordsrsquo in allowing the intervening words within the CPs helps a lot in improving the per-formance Similarly use of the lsquoexit wordsrsquo avoid a lot of incorrect identification

5 Conclusions

The simple empirical method for mining CPs outlined in this work yields an average 89 of precision and 90 recall which is better than the results reported so far in the literature The major drawback is that we have to generate a list of all possible light verbs This list appears to be very large for Hindi Since no POS tagging or statis-tical analysis is performed the identified CPs are merely a list of mined CPs in Hindi with no lin-guistic categorization or analysis However this list of mined CPs is valuable to the lexicograph-ers and other language technology developers This list can also be used for word alignment tools where the identified components of CPs are grouped together before the word alignment process This will increase both the alignment accuracy and the speed

The methodology presented in this work is equally applicable to all other languages within the Indo-Aryan family

References Anthony McEnery Paul Baker Rob Gaizauskas Ha-

mish Cunningham 2000 EMILLE Building a Corpus of South Asian Languages Vivek A Quar-terly in Artiificial Intelligence 13(3)23ndash32

Amitabh Mukerjee Ankit Soni and Achala M Raina 2006 Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora Proceed-ings of the Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties Sydney 11ndash18

Alex Alsina 1996 Complex PredicatesStructure and Theory CSLI PublicationsStanford CA

Anvita Abbi 1992 The explicator compound verbsome definitional issues and criteria for iden-tification Indian Linguistics 53 27-46

Debasri Chakrabarti Vaijayanthi Sarma and Pushpak Bhattacharyya 2007 Complex Predicates in In-dian Language Wordnets Lexical Resources and Evaluation Journal 40 (3-4)

Debasri Chakrabarti Hemang Mandalia Ritwik Priya Vaijayanthi Sarma and Pushpak Bhattacharyya 2008 Hindi Compound Verbs and their Automatic Extraction Computational Linguistics (COLING08) Manchester UK

45

Manindra K Verma (ed) 1993 Complex Predicates in South Asian Languages Manohar Publishers and Distributors New Delhi

Miriam Butt 1995 The Structure of Complex Predi-cates in Urdu CSLI Publications

Mirium Butt and Gillian Ramchand 2001 Complex Aspectual Structure in HindiUrdu In Maria Lia-kata Britta Jensen and Didier Maillat (Editors) Oxford University Working Papers in Linguistics Philology amp Phonetics Vol 6

Miriam Butt Tracy Holloway King and John T Maxwell III 2003 Complex Predicates via Re-striction Proceedings of the LFG03 Conference

Miriam Butt and Wilhelm Geuder 2001 On the (semi)lexical status of light verbs In Norbert Corv-er and Henk van Riemsdijk (Editors) Semi-lexical Categories On the content of function words and the function of content words Mouton de Gruyter Berlin 323ndash370

Mona Singh 1994 Perfectivity Definiteness and Specificity A Classification of Verbal Predicates Hindi Doctoral dissertation University of Texas Austin

Peter Edwin Hook 1974 The Compound Verb in Hindi Center for South and Southeast Asian Stu-dies The University of Michigan

Tara Mohanan 1994 Argument Structure in Hindi CSLI Publications Stanford California

Venkatapathy Sriram and Aravind K Joshi 2005 Relative compositionality of multi-word expres-sions a study of verb-noun (V-N) collocations In Proceedings of International Joint Conference on Natural Language Processing - 2005 Jeju Island Korea 553-564

46

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 47ndash54Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Improving Statistical Machine Translation UsingDomain Bilingual Multiword Expressions

Zhixiang Ren1 Yajuan L u1 Jie Cao1 Qun Liu 1 Yun Huang2

1Key Lab of Intelligent Info Processing 2Department of Computer ScienceInstitute of Computing Technology School of Computing

Chinese Academy of Sciences National University of SingaporePO Box 2704 Beijing 100190 China Computing 1 Law Link Singapore 117590

renzhixianglvyajuan huangyuncompnusedusgcaojieliuqunictaccn

Abstract

Multiword expressions (MWEs) havebeen proved useful for many natural lan-guage processing tasks However how touse them to improve performance of statis-tical machine translation (SMT) is not wellstudied This paper presents a simple yeteffective strategy to extract domain bilin-gual multiword expressions In additionwe implement three methods to integratebilingual MWEs to Moses the state-of-the-art phrase-based machine translationsystem Experiments show that bilingualMWEs could improve translation perfor-mance significantly

1 Introduction

Phrase-based machine translation model has beenproved a great improvement over the initial word-based approaches (Brown et al 1993) Recentsyntax-based models perform even better thanphrase-based models However when syntax-based models are applied to new domain with fewsyntax-annotated corpus the translation perfor-mance would decrease To utilize the robustnessof phrases and make up the lack of syntax or se-mantic information in phrase-based model for do-main translation we study domain bilingual mul-tiword expressions and integrate them to the exist-ing phrase-based model

A multiword expression (MWE)can be consid-ered as word sequence with relatively fixed struc-ture representing special meanings There is nouniform definition of MWE and many researchersgive different properties of MWE Sag et al (2002)roughly defined MWE as ldquoidiosyncratic interpre-tations that cross word boundaries (or spaces)rdquoCruys and Moiron (2007) focused on the non-compositional property of MWE ie the propertythat whole expression cannot be derived from their

component words Stanford university launcheda MWE project1 in which different qualities ofMWE were presented Forbilingual multiwordexpression (BiMWE) we define a bilingual phraseas a bilingual MWE if (1) the source phrase is aMWE in source language (2) the source phraseand the target phrase must be translated to eachother exactly ie there is no additional (boundary)word in target phrase which cannot find the corre-sponding word in source phrase and vice versaIn recent years many useful methods have beenproposed to extract MWEs or BiMWEs automati-cally (Piao et al 2005 Bannard 2007 Fazly andStevenson 2006) Since MWE usually constrainspossible senses of a polysemous word in contextthey can be used in many NLP applications suchas information retrieval question answering wordsense disambiguation and so on

For machine translation Piao et al (2005) havenoted that the issue of MWE identification andaccurate interpretation from source to target lan-guage remained an unsolved problem for existingMT systems This problem is more severe whenMT systems are used to translate domain-specifictexts since they may include technical terminol-ogy as well as more general fixed expressions andidioms Although some MT systems may employa machine-readable bilingual dictionary of MWEit is time-consuming and inefficient to obtain thisresource manually Therefore some researchershave tried to use automatically extracted bilingualMWEs in SMT Tanaka and Baldwin (2003) de-scribed an approach of noun-noun compound ma-chine translation but no significant comparisonwas presented Lambert and Banchs (2005) pre-sented a method in which bilingual MWEs wereused to modify the word alignment so as to im-prove the SMT quality In their work a bilin-gual MWE in training corpus was grouped as

1httpmwestanfordedu

47

one unique token before training alignment mod-els They reported that both alignment qualityand translation accuracy were improved on a smallcorpus However in their further study they re-ported even lower BLEU scores after groupingMWEs according to part-of-speech on a large cor-pus (Lambert and Banchs 2006) Nonethelesssince MWE represents liguistic knowledge therole and usefulness of MWE in full-scale SMTis intuitively positive The difficulty lies in howto integrate bilingual MWEs into existing SMTsystem to improve SMT performance especiallywhen translating domain texts

In this paper we implement three methods thatintegrate domain bilingual MWEs into a phrase-based SMT system and show that these ap-proaches improve translation quality significantlyThe main difference between our methods andLambert and Banchsrsquo work is that we directly aimat improving the SMT performance rather than im-proving the word alignment quality In detail dif-ferences are listed as follows

bull Instead of using the bilingual n-gram trans-lation model we choose the phrase-basedSMT system Moses2 which achieves sig-nificantly better translation performance thanmany other SMT systems and is a state-of-the-art SMT system

bull Instead of improving translation indirectlyby improving the word alignment qualitywe directly target at the quality of transla-tion Some researchers have argued that largegains of alignment performance under manymetrics only led to small gains in translationperformance (Ayan and Dorr 2006 Fraserand Marcu 2007)

Besides the above differences there are someadvantages of our approaches

bull In our method automatically extractedMWEs are used as additional resources ratherthan as phrase-table filter Since bilingualMWEs are extracted according to noisy au-tomatic word alignment errors in word align-ment would further propagate to the SMT andhurt SMT performance

bull We conduct experiments on domain-specificcorpus For one thing domain-specific

2httpwwwstatmtorgmoses

corpus potentially includes a large numberof technical terminologies as well as moregeneral fixed expressions and idioms iedomain-specific corpus has high MWE cov-erage For another after the investigationcurrent SMT system could not effectivelydeal with these domain-specific MWEs es-pecially for Chinese since these MWEs aremore flexible and concise Take the Chi-nese term ldquo j Ntilde (rdquo for example Themeaning of this term is ldquosoften hard massand dispel pathogenic accumulationrdquo Ev-ery word of this term represents a specialmeaning and cannot be understood literallyor without this context These terms are dif-ficult to be translated even for humans letalone machine translation So treating theseterms as MWEs and applying them in SMTsystem have practical significance

bull In our approach no additional corpus is intro-duced We attempt to extract useful MWEsfrom the training corpus and adopt suitablemethods to apply them Thus it benefitsfor the full exploitation of available resourceswithout increasing great time and space com-plexities of SMT system

The remainder of the paper is organized as fol-lows Section 2 describes the bilingual MWE ex-traction technique Section 3 proposes three meth-ods to apply bilingual MWEs in SMT systemSection 4 presents the experimental results Sec-tion 5 draws conclusions and describes the futurework Since this paper mainly focuses on the ap-plication of BiMWE in SMT we only give a briefintroduction on monolingual and bilingual MWEextraction

2 Bilingual Multiword ExpressionExtraction

In this section we describe our approach of bilin-gual MWE extraction In the first step we obtainmonolingual MWEs from the Chinese part of par-allel corpus After that we look for the translationof the extracted MWEs from parallel corpus

21 Automatic Extraction of MWEs

In the past two decades many different ap-proaches on automatic MWE identification werereported In general those approaches can beclassified into three main trends (1) statisti-cal approaches (Pantel and Lin 2001 Piao et

48

al 2005) (2) syntactic approaches (Fazly andStevenson 2006 Bannard 2007) and (3) seman-tic approaches (Baldwin et al 2003 Cruys andMoiron 2007) Syntax-based and semantic-basedmethods achieve high precision but syntax or se-mantic analysis has to be introduced as preparingstep so it is difficult to apply them to domains withfew syntactical or semantic annotation Statisticalapproaches only consider frequency informationso they can be used to obtain MWEs from bilin-gual corpora without deeper syntactic or semanticanalysis Most statistical measures only take twowords into account so it not easy to extract MWEscontaining three or more than three words

Log Likelihood Ratio (LLR)has been proved agood statistical measurement of the association oftwo random variables (Chang et al 2002) Weadopt the idea of statistical approaches and pro-pose a new algorithm named LLR-based Hierar-chical Reducing Algorithm (HRA for short) to ex-tract MWEs with arbitrary lengths To illustrateour algorithm firstly we define some useful itemsIn the following definitions we assume the givensentence is ldquoA B C D Erdquo

Definition 1 Unit A unit is any sub-string of thegiven sentence For example ldquoA Brdquo ldquo Crdquo ldquo C D Erdquoare all units but ldquoA B Drdquo is not a unit

Definition 2 List A list is an ordered sequence ofunits which exactly cover the given sentence ForexampleldquoArdquoldquo B C Drdquoldquo Erdquo forms a list

Definition 3 Score The score function only de-fines on two adjacent units and return the LLRbetween the last word of first unit and the firstword of the second unit3 For example the scoreof adjacent unit ldquoB Crdquo and ldquoD Erdquo is defined asLLR(ldquo Crdquoldquo Drdquo)

Definition 4 Select The selecting operator is tofind the two adjacent units with maximum scorein a list

Definition 5 ReduceThe reducing operator is toremove two specific adjacent units concatenatethem and put back the result unit to the removedposition For example if we want to reduce unitldquoB Crdquo and unit ldquoDrdquo in list ldquoArdquoldquo B Crdquoldquo Drdquoldquo Erdquowe will get the listldquoArdquoldquo B C Drdquoldquo Erdquo

Initially every word in the sentence is consid-ered as one unit and all these units form a initiallist L If the sentence is of lengthN then the

3we use a stoplist to eliminate the units containing func-tion words by setting their score to 0

list containsN units of course The final set ofMWEsS is initialized to empty set After initial-ization the algorithm will enter an iterating loopwith two steps (1) select the two adjacent unitswith maximum score inL namingU1 andU2 and(2) reduceU1 andU2 in L and insert the reducingresult into the final setS Our algorithm termi-nates on two conditions (1) if the maximum scoreafter selection is less than a given threshold or (2)if L contains only one unit

c1(pound p Eacute w egrave )

c1(pound p Eacute w )

c1(pound p Eacute w)

c1(pound)

c2(p Eacute w)

c2(p Eacute)

c2(p) c3(Eacute) c4(w) c5()

c6(egrave )

c6(egrave) c7()

pound p Eacute w egrave 1471 67552 10596 0 0 8096

Figure 1 Example of Hierarchical Reducing Al-gorithm

Let us make the algorithm clearer with an ex-ample Assume the threshold of score is 20 thegiven sentence is ldquopound p Eacute w egrave rdquo4Figure 1 shows the hierarchical structure of givensentence (based on LLR of adjacent words) Inthis example four MWEs (ldquop Eacuterdquo ldquo p Eacute

wrdquo ldquo egrave rdquo ldquo pound p Eacute wrdquo) are extractedin the order and sub-strings over dotted line in fig-ure 1 are not extracted

From the above example we can see that theextracted MWEs correspond to human intuitionIn general the basic idea of HRA is to reflectthe hierarchical structure pattern of natural lan-guage Furthermore in the HRA MWEs are mea-sured with the minimum LLR of adjacent wordsin them which gives lexical confidence of ex-tracted MWEs Finally suppose given sentencehas lengthN HRA would definitely terminatewithin N minus 1 iterations which is very efficient

However HRA has a problem that it would ex-tract substrings before extracting the whole stringeven if the substrings only appear in the particu-lar whole string which we consider useless Tosolve this problem we use contextual features

4The whole sentence means ldquohealthy tea for preventinghyperlipidemiardquo and we give the meaning for each Chi-nese wordpound(preventing)p(hyper-)Eacute(-lipid-) w(-emia)(for) egrave(healthy)(tea)

49

contextual entropy (Luo and Sun 2003) and C-value (Frantzi and Ananiadou 1996) to filter outthose substrings which exist only in few MWEs

22 Automatic Extraction of MWErsquosTranslation

In subsection 21 we described the algorithm toobtain MWEs and we would like to introduce theprocedure to find their translations from parallelcorpus in this subsection

For mining the English translations of ChineseMWEs we first obtain the candidate translationsof a given MWE from the parallel corpus Stepsare listed as follows

1 Run GIZA++5 to align words in the trainingparallel corpus

2 For a given MWE find the bilingual sentencepairs where the source language sentences in-clude the MWE

3 Extract the candidate translations of theMWE from the above sentence pairs accord-ing to the algorithm described by Och (2002)

After the above procedure we have alreadyextracted all possible candidate translations of agiven MWE The next step is to distinguish rightcandidates from wrong candidates We constructperceptron-based classification model (Collins2002) to solve the problem We design twogroups of features translation features whichdescribe the mutual translating chance betweensource phrase and target phrase and the languagefeatures which refer to how well a candidateis a reasonable translation The translation fea-tures include (1) the logarithm of source-targettranslation probability (2) the logarithm of target-source translation probability (3) the logarithmof source-target lexical weighting (4) the loga-rithm of target-source lexical weighting and (5)the logarithm of the phrase pairrsquos LLR (Dunning1993) The first four features are exactly the sameas the four translation probabilities used in tradi-tional phrase-based system (Koehn et al 2003)The language features include (1) the left entropyof the target phrase (Luo and Sun 2003) (2) theright entropy of the target phrase (3) the first wordof the target phrase (4) the last word of the targetphrase and (5) all words in the target phrase

5httpwwwfjochcomGIZA++html

We select and annotate 33000 phrase pairs ran-domly of which 30000 pairs are used as trainingset and 3000 pairs are used as test set We use theperceptron training algorithm to train the modelAs the experiments reveal the classification preci-sion of this model is 9167

3 Application of Bilingual MWEs

Intuitively bilingual MWE is useful to improvethe performance of SMT However as we de-scribed in section 1 it still needs further researchon how to integrate bilingual MWEs into SMTsystem In this section we propose three methodsto utilize bilingual MWEs and we will comparetheir performance in section 4

31 Model Retraining with Bilingual MWEs

Bilingual phrase table is very important forphrase-based MT system However due to the er-rors in automatic word alignment and unalignedword extension in phrase extraction (Och 2002)many meaningless phrases would be extractedwhich results in inaccuracy of phrase probabilityestimation To alleviate this problem we take theautomatically extracted bilingual MWEs as paral-lel sentence pairs add them into the training cor-pus and retrain the model using GIZA++ Byincreasing the occurrences of bilingual MWEswhich are good phrases we expect that the align-ment would be modified and the probability es-timation would be more reasonable Wu et al(2008) also used this method to perform domainadaption for SMT Different from their approachin which bilingual MWEs are extracted from ad-ditional corpus we extract bilingual MWEs fromthe original training set The fact that additionalresources can improve the domain-specific SMTperformance was proved by many researchers(Wu et al 2008 Eck et al 2004) Howeverour method shows that making better use of theresources in hand could also enhance the qualityof SMT system We use ldquoBaseline+BiMWErdquo torepresent this method

32 New Feature for Bilingual MWEs

Lopez and Resnik (2006) once pointed out thatbetter feature mining can lead to substantial gainin translation quality Inspired by this idea weappend one feature into bilingual phrase table toindicate that whether a bilingual phrase containsbilingual MWEs In other words if the source lan-guage phrase contains a MWE (as substring) and

50

the target language phrase contains the translationof the MWE (as substring) the feature value is 1otherwise the feature value is set to 0 Due to thehigh reliability of bilingual MWEs we expect thatthis feature could help SMT system to select bet-ter and reasonable phrase pairs during translationWe use ldquoBaseline+Featrdquo to represent this method

33 Additional Phrase Table of bilingualMWEs

Wu et al (2008) proposed a method to construct aphrase table by a manually-made translation dic-tionary Instead of manually constructing transla-tion dictionary we construct an additional phrasetable containing automatically extracted bilingualMWEs As to probability assignment we just as-sign 1 to the four translation probabilities for sim-plicity Since Moses supports multiple bilingualphrase tables we combine the original phrase ta-ble and new constructed bilingual MWE table Foreach phrase in input sentence during translationthe decoder would search all candidate transla-tion phrases in both phrase tables We use ldquoBase-line+NewBPrdquo to represent this method

4 Experiments

41 Data

We run experiments on two domain-specific patentcorpora one is for traditional medicine domainand the other is for chemical industry domain Ourtranslation tasks are Chinese-to-English

In the traditional medicine domain table 1shows the data statistics For language model weuse SRI Language Modeling Toolkit6 to train a tri-gram model with modified Kneser-Ney smoothing(Chen and Goodman 1998) on the target side oftraining corpus Using our bilingual MWE ex-tracting algorithm 80287 bilingual MWEs are ex-tracted from the training set

Chinese EnglishTraining Sentences 120355

Words 4688873 4737843Dev Sentences 1000

Words 31722 32390Test Sentences 1000

Words 41643 40551

Table 1 Traditional medicine corpus

6httpwwwspeechsricomprojectssrilm

In the chemical industry domain table 2 givesthe detail information of the data In this experi-ment 59466 bilingual MWEs are extracted

Chinese EnglishTraining Sentences 120856

Words 4532503 4311682Dev Sentences 1099

Words 42122 40521Test Sentences 1099

Words 41069 39210

Table 2 Chemical industry corpus

We test translation quality on test set and use theopen source tool mteval-vllbpl7 to calculate case-sensitive BLEU 4 score (Papineni et al 2002) asour evaluation criteria For this evaluation there isonly one reference per test sentence We also per-form statistical significant test between two trans-lation results (Collins et al 2005) The mean ofall scores and relative standard deviation are calcu-lated with a 99 confidence interval of the mean

42 MT Systems

We use the state-of-the-art phrase-based SMT sys-tem Moses as our baseline system The featuresused in baseline system include (1) four transla-tion probability features (2) one language modelfeature (3) distance-based and lexicalized distor-tion model feature (4) word penalty (5) phrasepenalty For ldquoBaseline+BiMWErdquo method bilin-gual MWEs are added into training corpus as aresult new alignment and new phrase table areobtained For ldquoBaseline+Featrdquo method one ad-ditional 01 feature are introduced to each entry inphrase table For ldquoBaseline+NewBPrdquo additionalphrase table constructed by bilingual MWEs isused

Features are combined in the log-linear modelTo obtain the best translatione of the source sen-tencef log-linear model uses following equation

e = arg maxe

p(e|f)

= arg maxe

Msum

m=1

λmhm(e f) (1)

in which hm andλm denote themth feature andweight The weights are automatically turned byminimum error rate training (Och 2002) on devel-opment set

7httpwwwnistgovspeechtestsmtresourcesscoringhtm

51

43 Results

Methods BLEUBaseline 02658Baseline+BiMWE 02661Baseline+Feat 02675Baseline+NewBP 02719

Table 3 Translation results of using bilingualMWEs in traditional medicine domain

Table 3 gives our experiment results Fromthis table we can see that bilingual MWEsimprove translation quality in all cases TheBaseline+NewBP method achieves the most im-provement of 061 BLEU score comparedwith the baseline system The Baseline+Featmethod comes next with 017 BLEU score im-provement And the Baseline+BiMWE achievesslightly higher translation quality than the baselinesystem

To our disappointment however none of theseimprovements are statistical significant Wemanually examine the extracted bilingual MWEswhich are labeled positive by perceptron algorithmand find that although the classification precisionis high (9167) the proportion of positive exam-ple is relatively lower (7669) The low positiveproportion means that many negative instanceshave been wrongly classified to positive which in-troduce noises To remove noisy bilingual MWEswe use the length ratiox of the source phrase overthe target phrase to rank the bilingual MWEs la-beled positive Assumex follows Gaussian distri-butions then the ranking score of phrase pair(s t)is defined as the following formula

Score(s t) = log(LLR(s t))times 1radic2πσ

timeseminus

(xminusmicro)2

2σ2

(2)Here the meanmicro and varianceσ2 are estimatedfrom the training set After ranking by score weselect the top 50000 60000 and 70000 bilingualMWEs to perform the three methods mentioned insection 3 The results are showed in table 4

From this table we can conclude that (1) Allthe three methods on all settings improve BLEUscore (2) Except the Baseline+BiMWE methodthe other two methods obtain significant improve-ment of BLEU score (02728 02734 02724)over baseline system (02658) (3) When the scaleof bilingual MWEs is relatively small (5000060000) the Baseline+Feat method performs better

Methods 50000 60000 70000Baseline 02658Baseline+BiMWE 02671 02686 02715Baseline+Feat 02728 02734 02712Baseline+NewBP 02662 02706 02724

Table 4 Translation results of using bilingualMWEs in traditional medicine domain

than others (4) As the number of bilingual MWEsincreasing the Baseline+NewBP method outper-forms the Baseline+Feat method (5) Comparingtable 4 and 3 we can see it is not true that themore bilingual MWEs the better performance ofphrase-based SMT This conclusion is the same as(Lambert and Banchs 2005)

To verify the assumption that bilingual MWEsdo indeed improve the SMT performance not onlyon particular domain we also perform some ex-periments on chemical industry domain Table 5shows the results From this table we can see thatthese three methods can improve the translationperformance on chemical industry domain as wellas on the traditional medicine domain

Methods BLEUBaseline 01882Baseline+BiMWE 01928Baseline+Feat 01917Baseline+Newbp 01914

Table 5 Translation results of using bilingualMWEs in chemical industry domain

44 Discussion

In order to know in what respects our methods im-prove performance of translation we manually an-alyze some test sentences and gives some exam-ples in this subsection

(1) For the first example in table 6 ldquoIuml oacuterdquois aligned to other words and not correctly trans-lated in baseline system while it is aligned to cor-rect target phrase ldquodredging meridiansrdquo in Base-line+BiMWE since the bilingual MWE (ldquoIuml oacuterdquoldquodredging meridiansrdquo) has been added into train-ing corpus and then aligned by GIZA++

(2) For the second example in table 6 ldquo

rdquo has two candidate translation in phrase tableldquo teardquo and ldquomedicated teardquo The baseline systemchooses the ldquoteardquo as the translation of ldquo rdquowhile the Baseline+Feat system chooses the ldquomed-

52

Src T not auml k Ouml Eacute Aring raquo Iuml oacute ) 9 | Y S _ Ouml otilde

egraveE 8

Ref the obtained product is effective in tonifying blood expelling cold dredging meridians promoting production of body fluid promoting urination and tranquilizing mind and can be used for supplementing nutrition and protecting health

Baseline the food has effects in tonifying blood dispelling cold promoting salivation and water and tranquilizing and tonic effects and making nutritious health

+Bimwe the food has effects in tonifying blood dispelling cold dredging meridians promotingsalivation promoting urination and tranquilizing tonic nutritious pulverizing

Src curren iexclJ J NtildeJ 5J

Ref the product can also be made into tablet pill powder medicated tea or injection Baseline may also be made into tablet pill powder tea or injection +Feat may also be made into tablet pill powder medicated tea or injection

Table 6 Translation example

icated teardquo because the additional feature giveshigh probability of the correct translation ldquomedi-cated teardquo

5 Conclusion and Future Works

This paper presents the LLR-based hierarchicalreducing algorithm to automatically extract bilin-gual MWEs and investigates the performance ofthree different application strategies in applyingbilingual MWEs for SMT system The translationresults show that using an additional feature to rep-resent whether a bilingual phrase contains bilin-gual MWEs performs the best in most cases Theother two strategies can also improve the qualityof SMT system although not as much as the firstone These results are encouraging and motivatedto do further research in this area

The strategies of bilingual MWE application isroughly simply and coarse in this paper Com-plicated approaches should be taken into accountduring applying bilingual MWEs For examplewe may consider other features of the bilingualMWEs and examine their effect on the SMT per-formance Besides application in phrase-basedSMT system bilingual MWEs may also be inte-grated into other MT models such as hierarchicalphrase-based models or syntax-based translationmodels We will do further studies on improvingstatistical machine translation using domain bilin-gual MWEs

Acknowledgments

This work is supported by National Natural Sci-ence Foundation of China Contracts 60603095and 60873167 We would like to thank the anony-

mous reviewers for their insightful comments onan earlier draft of this paper

References

Necip Fazil Ayan and Bonnie J Dorr 2006 Goingbeyond aer an extensive analysis of word align-ments and their impact on mt InProceedings of the44th Annual Meeting of the Association for Compu-tational Linguistics pages 9ndash16

Timothy Baldwin Colin Bannard Takaaki Tanaka andDominic Widdows 2003 An empirical modelof multiword expression decomposability InPro-ceedings of the ACL-2003 Workshop on MultiwordExpressions Analysis Acquisiton and Treatmentpages 89ndash96

Colin Bannard 2007 A measure of syntactic flex-ibility for automatically identifying multiword ex-pressions in corpora InProceedings of the ACLWorkshop on A Broader Perspective on MultiwordExpressions pages 1ndash8

Peter F Brown Stephen Della Pietra Vincent J DellaPietra and Robert L Mercer 1993 The mathe-matics of statistical machine translation Parameterestimation Computational Linguistics 19(2)263ndash311

Baobao Chang Pernilla Danielsson and WolfgangTeubert 2002 Extraction of translation unit fromchinese-english parallel corpora InProceedings ofthe first SIGHAN workshop on Chinese languageprocessing pages 1ndash5

Stanley F Chen and Joshua Goodman 1998 Am em-pirical study of smoothing techniques for languagemodeling Technical report

Michael Collins Philipp Koehn and Ivona Kucerova2005 Clause restructuring for statistical machinetranslation InProceedings of the 43rd Annual

53

Meeting on Association for Computational Linguis-tics pages 531ndash540

Michael Collins 2002 Discriminative training meth-ods for hidden markov models Theory and exper-iments with perceptron algorithms InProceedingsof the Empirical Methods in Natural Language Pro-cessing Conference pages 1ndash8

Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32

Ted Dunning 1993 Accurate methods for the statis-tics of surprise and coincidenceComputationalLinguistics 19(1)61ndash74

Matthias Eck Stephan Vogel and Alex Waibel 2004Improving statistical machine translation in the med-ical domain using the unified medical language sys-tem InProceedings of the 20th international con-ference on Computational Linguistics table of con-tents pages 792ndash798

Afsaneh Fazly and Suzanne Stevenson 2006 Auto-matically constructing a lexicon of verb phrase id-iomatic combinations InProceedings of the EACLpages 337ndash344

Katerina T Frantzi and Sophia Ananiadou 1996 Ex-tracting nested collocations InProceedings of the16th conference on Computational linguistics pages41ndash46

Alexander Fraser and Daniel Marcu 2007 Measuringword alignment quality for statistical machine trans-lation Computational Linguistics 33(3)293ndash303

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InPro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology pages48ndash54

Patrik Lambert and Rafael Banchs 2005 Data in-ferred multi-word expressions for statistical machinetranslation InProceedings of Machine TranslationSummit X pages 396ndash403

Patrik Lambert and Rafael Banchs 2006 Groupingmulti-word expressions according to part-of-speechin statistical machine translation InProceedings ofthe Workshop on Multi-word-expressions in a multi-lingual context pages 9ndash16

Adam Lopez and Philip Resnik 2006 Word-basedalignment phrase-based translation Whatrsquos thelink In proceedings of the 7th conference of the as-sociation for machine translation in the Americasvisions for the future of machine translation pages90ndash99

Shengfen Luo and Maosong Sun 2003 Two-characterchinese word extraction based on hybrid of internaland contextual measures InProceedings of the sec-ond SIGHAN workshop on Chinese language pro-cessing pages 24ndash30

Franz Josef Och 2002Statistical Machine Transla-tion From Single-Word Models to Alignment Tem-plates Phd thesis Computer Science DepartmentRWTH Aachen Germany

Patrick Pantel and Dekang Lin 2001 A statistical cor-pus based term extractor InAI rsquo01 Proceedings ofthe 14th Biennial Conference of the Canadian Soci-ety on Computational Studies of Intelligence pages36ndash46

Kishore Papineni Salim Roukos Todd Ward and Wei-jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation InProceedings of the40th Annual Conference of the Association for Com-putational Linguistics pages 311ndash318

Scott Songlin Piao Paul Rayson Dawn Archer andTony McEnery 2005 Comparing and combining asemantic tagger and a statistical tool for mwe extrac-tion Computer Speech and Language 19(4)378ndash397

Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 Multi-word expressions A pain in the neck for nlp InProceedings of the 3th International Conferenceon Intelligent Text Processing and ComputationalLinguistics(CICLing-2002) pages 1ndash15

Takaaki Tanaka and Timothy Baldwin 2003 Noun-noun compound machine translation A feasibilitystudy on shallow processing InProceedings ofthe ACL-2003 Workshop on Multiword ExpressionsAnalysis Acquisition and Treatment pages 17ndash24

Hua Wu Haifeng Wang and Chengqing Zong 2008Domain adaptation for statistical machine transla-tion with domain dictionary and monolingual cor-pora InProceedings of Conference on Computa-tional Linguistics (COLING) pages 993ndash1000

54

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 55ndash62Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

Bottom-up Named Entity Recognitionusing a Two-stage Machine Learning Method

Hirotaka Funayama Tomohide Shibata Sadao KurohashiKyoto University Yoshida-honmachi

Sakyo-ku Kyoto 606-8501 Japanfunayama shibata kuronlpkueekyoto-uacjp

Abstract

This paper proposes Japanese bottom-upnamed entity recognition using a two-stage machine learning method Mostwork has formalized Named Entity Recog-nition as a sequential labeling problem inwhich only local information is utilizedfor the label estimation and thus a longnamed entity consisting of several mor-phemes tends to be wrongly recognizedOur proposed method regards a compoundnoun (chunk) as a labeling unit and firstestimates the labels of all the chunks ina phrasal unit (bunsetsu) using a machinelearning method Then the best label as-signment in the bunsetsu is determinedfrom bottom up as the CKY parsing al-gorithm using a machine learning methodWe conducted an experimental on CRLNE data and achieved an F measure of8979 which is higher than previous work

1 Introduction

Named Entity Recognition (NER) is a task of rec-ognizing named entities such as person namesorganization names and location It is used forseveral NLP applications such as Information Ex-traction (IE) and Question Answering (QA) Mostwork uses machine learning methods such as Sup-port Vector Machines (SVMs) (Vapnik 1995) andConditional Random Field (CRF) (Lafferty et al2001) using a hand-annotated corpus (Krishnanand DManning 2006 Kazama and Torisawa2008 Sasano and Kurohashi 2008 Fukushima etal 2008 Nakano and Hirai 2004 Masayuki andMatsumoto 2003)

In general NER is formalized as a sequentiallabeling problem For example regarding a mor-pheme as a basic unit it is first labeled as S-PERSON B-PERSON I-PERSON E-PERSON

S-ORGANIZATION etc Then considering thelabeling results of morphemes the best NE labelsequence is recognized

When the label of each morpheme is estimatedonly local information around the morpheme (egthe morpheme the two preceding morphemes andthe two following morphemes) is utilized There-fore a long named entity consisting of severalmorphemes tends to be wrongly recognized Letus consider the example sentences shown in Fig-ure 1

In sentence (1) the label of ldquoKazamardquo can berecognized to be S-PERSON (PERSON consist-ing of one morpheme) by utilizing the surroundinginformation such as the suffix ldquosanrdquo (Mr) and theverb ldquokikoku shitardquo (return home)

On the other hand in sentence (2) when thelabel of ldquoshinyourdquo (credit) is recognized to beB-ORGANIZATION (the beginning of ORGA-NIZATION) only information from ldquohatsudourdquo(invoke) to ldquokyusairdquo (relief) can be utilized andthus the information of the morpheme ldquoginkourdquo(bank) that is apart from ldquoshinyourdquo by three mor-phemes cannot be utilized To cope with this prob-lem Nakano et al (Nakano and Hirai 2004) andSasano et al (Sasano and Kurohashi 2008) uti-lized information of the head of bunsetsu1 In theirmethods when the label of ldquoshinyourdquo is recog-nized the information of the morpheme ldquoginkourdquocan be utilized

However these methods do not work when themorpheme that we want to refer to is not a headof bunsetsu as in sentence (3) In this examplewhen ldquogaikokurdquo (foreign) is recognized to be B-ARTIFACT (the beginning of ARTIFACT) wewant to refer to ldquohourdquo (law) not ldquoihanrdquo (viola-tion) which is the head of the bunsetsu

This paper proposes Japanese bottom-up named1Bunsetsu is the smallest coherent phrasal unit in

Japanese It consists of one or more content words followedby zero or more function words

55

(1) kikoku-shitareturn home

Kazama-san-waMrKazama TOP

rsquoMr Kazama who returned homersquo

(2) hatsudou-shitainvoke

shinyou-kumiai-kyusai-ginkou-nocredit union relief bank GEN

setsuritsu-mo establishment

rsquothe establishment of the invoking credit union relief bankrsquo

(3) shibunsyo-gizou-toprivate document falsification and

gaikoku-jin-touroku-hou-ihan-noforeigner registration law violation GEN

utagai-desuspicion INS

rsquoon suspicion of the private document falsification and the violation of the foreigner registra-tion lawrsquo

Figure 1 Example sentences

entity recognition using a two-stage machinelearning method Different from previous workthis method regards a compound noun as a la-beling unit (we call it chunk hereafter) and es-timates the labels of all the chunks in the bun-setsu using a machine learning method In sen-tence (3) all the chunks in the second bunsetsu(ie ldquogaikokurdquo ldquogaikoku-jinrdquo middot middot middot ldquogaikoku-jin-touroku-hou-ihan rdquo middot middot middot ldquoihanrdquo) are labeled andin the case that the chunk ldquogaikoku-jin-touroku-hourdquo is labeled the information about ldquohourdquo (law)is utilized in a natural manner Then in the bun-setsu the best label assignment is determined Forexample among the combination of ldquogaikoku-jin-touroku-hourdquo (ARTIFACT) and ldquoihanrdquo (OTHER)the combination of ldquogaikoku-jinrdquo (PERSON) andldquotouroku-hou-ihanrdquo (OTHER) etc the best la-bel assignment ldquogaikoku-jin-touroku-hourdquo (AR-TIFACT) and ldquoihanrdquo (OTHER) is chosen basedon a machine learning method In this determi-nation of the best label assignment as the CKYparsing algorithm the label assignment is deter-mined by bottom-up dynamic programming Weconducted an experimental on CRL NE data andachieved an F measure of 8979 which is higherthan previous work

This paper is organized as follows Section 2 re-views related work of NER especially focusing onsequential labeling based method Section 3 de-scribes an overview of our proposed method Sec-tion 4 presents two machine learning models andSection 5 describes an analysis algorithm Section6 gives an experimental result

2 Related Work

In Japanese Named Entity Recognition the defi-nition of Named Entity in IREX Workshop (IREX

class examplePERSON Kimura SyonosukeLOCATION Taiheiyou (Pacific Ocean)ORGANIZATION Jimin-tou (Liberal Democratic

Party)ARTIFACT PL-houan (PL bill)DATE 21-seiki (21 century)TIME gozen-7-ji (7 am)MONEY 500-oku-en (50 billions yen)PERCENT 20 percent

Table 1 NE classes and their examples

Committee 1999) is usually used In this def-inition NEs are classified into eight classesPERSON LOCATION ORGANIZATION AR-TIFACT DATE TIME MONEY and PERCENTTable 1 shows example instances of each class

NER methods are divided into two approachesrule-based approach and machine learning ap-proach According to previous work machinelearning approach achieved better performancethan rule-based approach

In general a machine learning method is for-malized as a sequential labeling problem Thisproblem is first assigning each token (character ormorpheme) to several labels In an SE-algorithm(Sekine et al 1998) S is assigned to NE com-posed of one morpheme B I E is assigned to thebeginning middle end of NE respectively and Ois assigned to the morpheme that is not an NE2The labels S B I and E are prepared for each NEclasses and thus the total number of labels is 33(= 8 4 + 1)

The model for the label estimation is learnedbased on machine learning The following fea-tures are generally utilized characters type of

2Besides there are IOB1 IOB2 algorithm using onlyIOB and IOE1 IOE2 algorithm using only IOE (Kim andVeenstra 1999)

56

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu-MeijinMONEY0075

OTHERe0092

MeijinOTHERe0245

(a)initial state

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu + MeijinPSN+OTHERe0438+0245

Yoshiharu Yoshiharu + Meijin

final outputanalysis direction

MONEY0075

MNY+OTHERe0075+0245

MeijinOTHERe0245

(b)final output

Figure 2 An overview of our proposed method (the bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo)

character POS etc about the morpheme and thesurrounding two morphemes The methods utiliz-ing SVM or CRF are proposed

Most of NER methods based on sequential la-beling use only local information Thereforemethods utilizing global information are pro-posed Nakano et al utilized as a feature the wordsub class of NE on the analyzing direction in thebunsetsu the noun in the end of the bunsetsu ad-jacent to the analyzing direction and the head ofeach bunsetsu (Nakano and Hirai 2004) Sasanoet al utilized cache feature coreference resultsyntactic feature and caseframe feature as struc-tural features (Sasano and Kurohashi 2008)

Some work acquired knowledge from unan-notated large corpus and applied it to NERKazama et al utilized a Named Entity dic-tionary constructed from Wikipedia and a nounclustering result obtained using huge amount ofpairs of dependency relations (Kazama and Tori-sawa 2008) Fukushima et al acquired hugeamount of category-instance pairs (eg ldquopo-litical party - New party DAICHIrdquoldquocompany-TOYOTArdquo) by some patterns from a large Webcorpus (Fukushima et al 2008)

In Japanese NER researches CRL NE data areusually utilized for the evaluation This data in-cludes approximately 10 thousands sentences innews paper articles in which approximately 20thousands NEs are annotated Previous workachieved an F measure of about 089 using thisdata

3 Overview of Proposed Method

Our proposed method first estimates the label ofall the compound nouns (chunk) in a bunsetsu

Then the best label assignment is determinedby bottom-up dynamic programming as the CKYparsing algorithm Figure 2 illustrates an overviewof our proposed method In this example thebunsetsu ldquoHabu-Yoshiharu-Meijinrdquo (Grand Mas-ter Yoshiharu Habu) is analyzed First the labelsof all the chunks (ldquoHaburdquo ldquoHabu-YoshiharurdquoldquoHabu-Yoshiharu-Meijinrdquo middot middot middot ldquoMeijinrdquo etc) inthe bunsetsu are analyzed using a machine learn-ing method as shown in Figure 2 (a)

We call the state in Figure 2 (a) initial statewhere the labels of all the chunks have been es-timated From this state the best label assign-ment in the bunsetsu is determined This pro-cedure is performed from the lower left (corre-sponds to each morpheme) to the upper right likethe CKY parsing algorithm as shown in Figure 2(b) For example when the label assignment forldquoHabu-Yoshiharurdquo is determined the label assign-ment ldquoHabu-Yoshiharurdquo (PERSON) and the labelassignment ldquoHaburdquo (PERSON) and ldquoYoshiharurdquo(OTHER) are compared and the better one is cho-sen While grammatical rules are utilized in ageneral CKY algorithm this method chooses bet-ter label assignment for each cell using a machinelearning method

The learned models are the followings

bull the model that estimates the label of a chunk(label estimation model)

bull the model that compares two label assign-ments (label comparison model)

The two models are described in detail in thenext section

57

Habu Yoshiharu Meijin gaPERSON OTHERe

invalid invalid

invalid

invalidinvalid

Figure 3 Label assignment for all the chunks inthe bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo

4 Model Learning

41 Label Estimation ModelThis model estimates the label for each chunk Ananalysis unit is basically bunsetsu This is because935 of named entities is located in a bunsetsuin CRL NE data Exceptionally the following ex-pressions located in multiple bunsetsus tend to bean NE

bull expressions enclosed in parentheses (eg ldquolsquoHimeyuri-no toursquo rdquo (The tower of Himeyuri)(ARTIFACT))

bull expressions that have an entry in Wikipedia(eg ldquoNihon-yatyou-no kairdquo (Wild Bird So-ciety of Japan) (ORGANIZATION))

Hereafter bunsetsu is expanded when one of theabove conditions meet By this expansion 986of named entities is located in a bunsetsu3

For each bunsetsu the head or tail functionwords are deleted For example in the bun-setsu ldquoHabu-Yoshiharu-Meijin-wardquo the tail func-tion word ldquowardquo (TOP) is deleted In the bunsetsuldquoyaku-san-bairdquo (about three times) the head func-tion word ldquoyakurdquo (about) is deleted

Next for learning the label estimation modelall the chunks in a bunsetsu are attached to the cor-rect label from a hand-annotated corpus The la-bel set is 13 classes which includes eight NE class(as shown in Table 1) and five classes OTHERsOTHERb OTHERi OTHERe and invalid

The chunk that corresponds to a whole bun-setsu and does not contain any NEs is labeledas OTHERs and the head middle tail chunkthat does not correspond to an NE is labeled asOTHERb OTHERi OTHERe respectively4

3As an example in which an NE is not included by anexpanded bunsetsu there are ldquoToru-no Kimirdquo (PERSON)and ldquoOsaka-fu midori-no kankyo-seibi-shitsurdquo (ORGANI-ZATION)

4Each OTHER is assigned to the longest chunk that satis-fies its condition in a chunk

1 of morphemes in the chunk

2 the position of the chunk in its bunsetsu

3 character type5

4 the combination of the character type of adjoiningmorphemes

- For the chunk ldquoRussian Armyrdquo this feature isldquoKatakanaKanjirdquo

5 word class word sub class and several features pro-vided by a morphological analyzer JUMAN

6 several features6 provided by a parser KNP

7 string of the morpheme in the chunk

8 IPADIC7 feature- If the string of the chunk are registered in the fol-

lowing categories of IPADIC ldquopersonrdquo ldquolo-cationrdquo ldquoorganizationrdquo and ldquogeneralrdquo thisfeature fires

9 Wikipedia feature

- If the string of the chunk has an entry inWikipedia this feature fires

- the hypernym extracted from its definition sen-tence using some patterns (eg The hyper-nym of ldquothe Liberal Democratic Partyrdquo is apolitical party)

10 cache feature- When the same string of the chunk appears in the

preceding context the label of the precedingchunk is used for the feature

11 particles that the bunsetsu includes

12 the morphemes particles and head morpheme in theparent bunsetsu

13 the NEcategory ratio in a case slot of predicatenouncase frame(Sasano and Kurohashi 2008)

- For example in the case ga (NOM) of the pred-icate case frame ldquokaikenrdquo (interview) the NEratio ldquoPERSON0245rdquo is assigned to the caseslot Hence in the sentence ldquoHabu-ga kaiken-shitardquo (Mr Habu interviewed) the featureldquoPERSON0245rdquo is utilized for the chunkldquoHaburdquo

14 parenthesis feature

- When the chunk in a parenthesis this featurefires

Table 2 Features for the label estimation model

The chunk that is neither any eight NE class northe above four OTHER is labeled as invalid

In an example as shown in Figure 3 ldquoHabu-Yoshiharurdquo is labeled as PERSON ldquoMeijinrdquo is la-beled as OTHERe and the other chunks are la-beled as invalid

Next the label estimation model is learned fromthe data in which the above label set is assigned

5The following five character types are considered KanjiHiragana Katakana Number and Alphabet

6When a morpheme has an ambiguity all the correspond-ing features fire

7httpchasenaist-naraacjpchasendistributionhtmlja

58

to all the chunks The features for the label esti-mation model are shown in Table 2 Among thefeatures as for feature (3) (5)minus(8) three cate-gories according to the position of a morphemein the chunk are prepared ldquoheadrdquo ldquotailrdquo andldquoanywhererdquo For example in the chunk ldquoHabu-Yoshiharu-Meijinrdquo as for the morpheme ldquoHaburdquofeature (7) is set to be ldquoHaburdquo in ldquoheadrdquo and as forthe morpheme ldquoYoshiharurdquo feature (7) is set to beldquoYoshiharurdquo in ldquoanywhererdquo

The label estimation model is learned from pairsof label and feature in each chunk To classify themulti classes the one-vs-rest method is adopted(consequently 13 models are learned) The SVMoutput is transformed by using the sigmoid func-tion 1

1+exp(minusβx) and the transformed value is nor-malized so that the sum of the value of 13 labelsin a chunk is one

The purpose for setting up the label ldquoinvalidrdquo isas follows In the chunk ldquoHaburdquo and ldquoYoshiharurdquoin Figure 3 since the label ldquoinvalidrdquo has a rela-tively higher score the score of the label PERSONis relatively low Therefore when the label com-parison described in Section 42 is performed thelabel assignment ldquoHabu-Yoshiharurdquo (PERSON) islikely to be chosen In the chunk where the scoreof the label invalid has the highest score the labelthat has the second highest score is adopted

42 Label Comparison Model

This model compares the two label assignmentsfor a certain string For example in the stringldquoHabu-Yoshiharurdquo the model compares the fol-lowing two label assignments

bull ldquoHabu-Yoshiharurdquo is labeled as PERSON

bull ldquoHaburdquo is labeled as PERSON and ldquoYoshi-harurdquo is labeled as MONEY

First as shown in Figure 4 the two comparedsets of chunks are lined up by sandwiching ldquovsrdquo(The left one right one is called the first set thesecond set respectively) When the first set is cor-rect this example is positive otherwise this ex-ample is negative The max number of chunks foreach set is five and thus examples in which thefirst or second set has more than five chunks arenot utilized for the model learning

Then the feature is assigned to each exampleThe feature (13 dimensions) for each chunk is de-fined as follows the first 12 dimensions are used

positive+1 Habu-Yoshiharu vs Habu + Yoshiharu

PSN PSN + MNY+1 Habu-Yoshiharu + Meijin vs Habu + Yoshiharu + Meijin

PSN + OTHERe PSN + MONEY + OTHERe

negative- 1 Habu-Yoshiharu-Meijin vs Habu-Yoshiharu + Meijin

ORG PSN + OTHERe

Figure 4 Assignment of positivenegative exam-ples

for each label which is estimated by the label esti-mation model and the last 13th dimension is usedfor the score of an SVM output Then for the firstand second set the features for each chunk are ar-ranged from the left and zero vectors are placedin the remainder part

Figure 5 illustrates the feature for ldquoHabu-Yoshiharurdquo vs ldquoHabu + Yoshiharurdquo The labelcomparison model is learned from such data us-ing SVM Note that only the fact that ldquoHabu-Yoshiharurdquo is PERSON can be found from thehand-annotated corpus and thus in the exampleldquoHabu-Yoshiharu-Meijinrdquo vs ldquoHabu + Yoshiharu-Meijinrdquo we cannot determine which one is cor-rect Therefore such example cannot be used forthe model learning

5 Analysis

First the label of all the chunks in a bunsetsu isestimated by using the label estimation model de-scribed in Section 41 Then the best label assign-ment in the bunsetsu is determined by applying thelabel comparison model described in Section 42iteratively as shown in Figure 2 (b) In this stepthe better label assignment is determined from bot-tom up as the CKY parsing algorithm

For example the initial state shown in Figure2(a) is obtained using the label estimation modelThen the label assignment is determined using thelabel comparison model from the lower left (cor-responds to each morpheme) to the upper rightIn determining the label assignment for the cellof ldquoHabu-Yoshiharurdquo as shown in 6(a) the modelcompares the label assignment ldquoBrdquo with the la-bel assignment ldquoA+Drdquo In this case the modelchooses the label assignment ldquoBrdquo that is ldquoHabu- Yoshiharurdquo is labeled as PERSON Similarlyin determining the label assignment for the cellof ldquoYoshiharu-Meijinrdquo the model compares the

59

chunk Habu-Yoshiharu Habu Yoshiharulabel PERSON PERSON MONEYvector V11 0 0 0 0 V21 V22 0 0 0

Figure 5 An example of the feature for the label comparison model (The example is ldquoHabu-Yoshiharuvs Habu + Yoshiharurdquo and V11 V21 V22 and 0 is a vector whose dimension is 13)

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu-Meijin

label assighment ldquoHabu-Yoshiharurdquo

label assignmentldquoHaburdquo + ldquoYoshiharurdquo

A B C

EDMONEY0075

OTHERe0092

MeijinOTHERe0245

ldquoHaburdquo + ldquoYoshiharurdquo

F

(a) label assignment for the cell ldquoHabu-Yoshiharurdquo

HabuPERSON 0111

Habu-YoshiharuPERSON 0438

Habu-Yoshiharu-MeijinORGANIZATION0083

Yoshiharu Yoshiharu + Meijin

label assignmentldquoHabu-Yoshiharu-Meijinrdquo

A B C

EDMONEY0075

MNY+OTHERe0075+0245

MeijinOTHERe0245

label assignmentldquoHaburdquo + ldquoYoshiharurdquo + ldquoMeijinrdquo

label assignmentldquoHabu-Yoshiharurdquo + ldquoMeijinrdquo

F

(b) label assignment for the cell ldquoHabu-Yoshiharu-Meijinrdquo

Figure 6 The label comparison model

label assignment ldquoErdquo with the label assignmentldquoD+Frdquo In this case the model chooses the labelassignment ldquoD+Frdquo that is ldquoYoshiharurdquo is labeledas MONEY and ldquoMeijinrdquo is labeled as OTHEReWhen the label assignment consists of multiplechunks the content of the cell is updated Inthis case the cell ldquoErdquo is changed from ldquoYoshi-haru-Meijinrdquo (OTHERe) to ldquoYoshiharu + Meijinrdquo(MONEY + OTHERe)

As shown in Figure 6(b) in determining the bestlabel assignment for the upper right cell that isthe final output is determined the model comparesthe label assignment ldquoA+D+Frdquo ldquoB+Frdquo and ldquoCrdquoWhen there are more than two candidates of labelassignments for a cell all the label assignments arecompared in a pairwise and the label assignmentthat obtains the highest score is adopted

In the label comparing step the label as-signment in which OTHERlowast follows OTHERlowast(OTHERlowast - OTHERlowast) is not allowed since eachOTHER is assigned to the longest chunk as de-scribed in Section 41 When the first combina-tion of chunks equals to the second combinationof chunks the comparison is not performed

6 Experiment

To demonstrate the effectiveness of our proposedmethod we conducted an experiment on CRL NEdata In this data 10718 sentences in 1174 newsarticles are annotated with eight NEs The expres-sion to which it is difficult to annotate manually islabeled as OPTIONAL and was not used for both

a b c d e

learn the label estimation model 1

Corpus CRL NE data

learn the label estimation model 2bobtain features for the label apply

b c d e

b c d e

b c d e

obtain features for the labelcomparison model

learn the label estimation model 2c

learn the label estimation model 2d

learn the label estimation model 2e

apply

learn the label comparison model

Figure 7 5-fold cross validation

the model learning8 and the evaluationWe performed 5-fold cross validation following

previous work Different from previous work ourwork has to learn the SVM models twice There-fore the corpus was divided as shown in Figure 7Let us consider the analysis in the part (a) Firstthe label estimation model 1 is learned from thepart (b)-(e) Then the label estimation model 2bis learned from the part (c)-(e) and applying thelearned model to the part (b) features for learningthe label comparison model are obtained Simi-larly the label estimation model 2c is learned fromthe part (b)(d)(e) and applying it to the part (c)features are obtained It is the same with the part

8Exceptionally ldquoOPTIONALrdquo is used when the label es-timation model for OTHERlowast and invalid is learned

60

Recall PrecisionORGANIZATION 8183 (30083676) 8837 (30083404)PERSON 9005 (34583840) 9387 (34583684)LOCATION 9138 (49925463) 9244 (49925400)ARTIFACT 4672 ( 349 747) 7489 ( 349 466)DATE 9327 (33273567) 9312 (33273573)TIME 8825 ( 443 502) 9059 ( 443 489)MONEY 9385 ( 366 390) 9760 ( 366 375)PERCENT 9533 ( 469 492) 9591 ( 469 489)ALL-SLOT 8787 9179F-measure 8979

Table 3 Experimental result

(d) and (e) Then the label comparison model islearned from the obtained features After that theanalysis in the part (a) is performed by using boththe label estimation model 1 and the label compar-ison model

In this experiment a Japanese morphologicalanalyzer JUMAN9 and a Japanese parser KNP10

were adopted The two SVM models were learnedwith polynomial kernel of degree 2 and β in thesigmoid function was set to be 1

Table 6 shows an experimental result An F-measure in all NE classes is 8979

7 Discussion

71 Comparison with Previous Work

Table 7 presents the comparison with previ-ous work and our method outperformed previ-ous work Among previous work Fukushimaet al acquired huge amount of category-instance pairs (eg ldquopolitical party - New partyDAICHIrdquoldquocompany-TOYOTArdquo) by some patternsfrom a large Web corpus and Sasano et al uti-lized the analysis result of corefer resolution as afeature for the model learning Therefore in ourmethod by incorporating these knowledge andorsuch analysis result the performance would be im-proved

Compared with Sasano et al our methodachieved the better performance in analyzinga long compound noun For example in thebunsetsu ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo (Treaty on Conventional ArmedForces in Europe) while Sasano et al labeledldquoOushurdquo (Europe) as LOCATION our methodcorrectly labeled ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo as ARTIFACT Sasano etal incorrectly labeled ldquoOushurdquo as LOCATIONalthough they utilized the information about

9httpnlpkueekyoto-uacjpnl-resourcejuman-ehtml10httpnlpkueekyoto-uacjpnl-resourceknp-ehtml

the head of bunsetsu ldquojyouyakurdquo (treaty) Inour method for the cell ldquoOushurdquo invalid hasthe highest score and thus the score of LOCA-TION relatively drops Similarly for the cellldquosenryoku-sakugen-jyouyakurdquo invalid has thehighest score Consequently ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo is correctly labeledas ARTIFACT

In the bunsetsu ldquogaikoku-jin-touroku-hou-ihanrdquo(the violation of the foreigner registration law)while Sasano et al labeled ldquotouroku-hourdquo as AR-TIFACT our method correctly labeled ldquogaikoku-jin-touroku-hourdquo as ARTIFACT Sasano et al can-not utilize the information about ldquohourdquo that is use-ful for the label estimation since the head of thisbunsetsu is ldquoihanrdquo In contrast in estimating thelabel of the chunk ldquogaikoku-jin-touroku-hourdquo theinformation of ldquohourdquo can be utilized

72 Error Analysis

There were some errors in analyzing a Katakanaalphabet word In the following example althoughthe correct is that ldquoBatistutardquo is labeled as PER-SON the system labeled it as OTHERs

(4) Italy-deItaly LOC

katsuyaku-suruactive

Batistuta-woBatistuta ACC

kuwaetacall

ArgentineArgentine

rsquoArgentine called Batistuta who was active inItalyrsquo

There is not an entry of ldquoBatistutardquo in the dictio-nary of JUMAN nor Wikipedia and thus only thesurrounding information is utilized However thecase analysis of ldquokatsuyakurdquo (active) is incorrectwhich leads to the error of ldquoBatistutardquo

There were some errors in applying the la-bel comparison model although the analysis ofeach chunk is correct For example in thebunsetsu ldquoHongKong-seityourdquo (Government ofHongKong) the correct is that ldquoHongKong-seityourdquo is labeled as ORGANIZATION Asshown in Figure 8 (b) the system incorrectlylabeled ldquoHongKongrdquo as LOCATION As shownin Figure 8(a) although in the initial stateldquoHongKong-seityourdquo was correctly labeled as OR-GANIZATION the label assignment ldquoHongKong+ seityourdquo was incorrectly chosen by the labelcomparison model To cope with this problemwe are planning to the adjustment of the value βin the sigmoid function and the refinement of the

61

F1 analysis unit distinctive features(Fukushima et al 2008) 8929 character Web(Kazama and Torisawa 2008) 8893 character WikipediaWeb(Sasano and Kurohashi 2008) 8940 morpheme structural information(Nakano and Hirai 2004) 8903 character bunsetsu feature(Masayuki and Matsumoto 2003) 8721 character(Isozaki and Kazawa 2003) 8677 morphemeproposed method 8979 compound noun Wikipediastructural information

Table 4 Comparison with previous work (All work was evaluated on CRL NE data using cross valida-tion)

HongKongLOCATION

HongKong-seityouORGANIZATIONLOCATION

0266 0205

seityouOTHERe0184

(a)initial state

HongKongLOCATION

HongKong + seityouLOC+OTHEReLOCATION

0266LOC+OTHERe0266+0184

seityouOTHERe0184

(b)the final output

Figure 8 An example of the error in the label com-parison model

features for the label comparison model

8 Conclusion

This paper proposed bottom-up Named EntityRecognition using a two-stage machine learningmethod This method first estimates the label ofall the chunks in a bunsetsu using a machine learn-ing and then the best label assignment is deter-mined by bottom-up dynamic programming Weconducted an experiment on CRL NE data andachieved an F-measure of 8979

We are planning to integrate this method withthe syntactic and case analysis method (Kawa-hara and Kurohashi 2007) and perform syntacticcase and Named Entity analysis simultaneously toimprove the overall accuracy

ReferencesKenrsquoichi Fukushima Nobuhiro Kaji and Masaru

Kitsuregawa 2008 Use of massive amountsof web text in Japanese named entity recogni-tion In Proceedings of Data Engineering Workshop(DEWS2008) A3-3 (in Japanese)

IREX Committee editor 1999 Proceedings of theIREX Workshop

Hideki Isozaki and Hideto Kazawa 2003 Speeding upsupport vector machines for named entity recogni-tion Transaction of Information Processing Societyof Japan 44(3)970ndash979 (in Japanese)

Daisuke Kawahara and Sadao Kurohashi 2007 Prob-abilistic coordination disambiguation in a fully-lexicalized Japanese parser In Proceedings of the

2007 Joint Conference on Empirical Methods inNatural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL2007)pages 304ndash311

Junrsquoichi Kazama and Kentaro Torisawa 2008 In-ducing gazetteers for named entity recognition bylarge-scale clustering of dependency relations InProceedings of ACL-08 HLT pages 407ndash415

Erik F Tjong Kim and Jorn Veenstra 1999 Repre-senting text chunks In Proceedings of EACL rsquo99pages 173ndash179

Vajay Krishnan and Christopher DManning 2006An effective two-stage model for exploiting non-local dependencies in named entity recognitionpages 1121ndash1128

John Lafferty Andrew McCallun and FernandoPereira 2001 Conditional random fields Prob-abilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth In-ternational Conference (ICMLrsquo01) pages 282ndash289

Asahara Masayuki and Yuji Matsumoto 2003Japanese named entity extraction with redundantmorphological analysis In Proceeding of HLT-NAACL 2003 pages 8ndash15

Keigo Nakano and Yuzo Hirai 2004 Japanese namedentity extraction with bunsetsu features Transac-tion of Information Processing Society of Japan45(3)934ndash941 (in Japanese)

Ryohei Sasano and Sadao Kurohashi 2008 Japanesenamed entity recognition using structural naturallanguage processing In Proceeding of Third In-ternational Joint Conference on Natural LanguageProcessing pages 607ndash612

Satoshi Sekine Ralph Grishman and Hiroyuki Shin-nou 1998 A decision tree method for finding andclassifying names in japanese texts In Proceed-ings of the Sixth Workshop on Very Large Corpora(WVLC-6) pages 171ndash178

Vladimir Vapnik 1995 The Nature of StatisticalLearning Theory Springer

62

Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 63ndash70Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP

13

13

13 13

13

13

13

13

13

13

13

13 13

13

$

13 amp

13 amp

$ 13

13 (

) 13

$ 13

13

) 13

13 13 13

$ 13 13

+$ - 0)$ - 12

$ 3$ 4 1

)13$ 5 + )$ 5 2

3$ 6 2 $ 7-

2 $ 7 13

2 13 13

13 13

) 13

13$ $ 13$ $ $

13

13 13 13

13 1313 138

$ 13$

3$ 13) )

) 8 9$ $ $

$ $ )13$ 13 13

3

$

13 13

13 +

)

13 $

13 ) 13

$ )

13 13

13 ) $ )

13$

$

) (

$

13

13 13

13

)

13

13 )

$ 13

ファミリーレストラン 13) lt

ファミレス 13lt $

=

13 )

$ 13 13

$ $ )

13 -

)$ $ =

13 131313 13 1313 13 13 13 13 1313 13 13

63

13

13

13 13

13 13

13

13 13

13 $

13

13 amp

13

( ) +

13

13

-

13 13

13 13 13

日本 amp0 東京 $amp0amp

医科 amp0 科学 amp0amp

女子 amp0amp

大学 13amp0 研究所0

ampamp $ 13 13

+

13

$1

$ 13 $1

13

213

$1

$

$1

$1

13 $1

13

13 13 略語は0

0amp

$ +3

1313欧文略語一覧13カタカナ略語一覧13漢字略語一覧13大学の略称

$1

$1 13

13 -

13 $1 13

456

13

456

13

37+

- 13

13

$

8 913

77 4 77 913 13

13 77(amp 13 ltamp

13amp amp

13

amp - 13 13

13

13 9

13 13

$

13

-

13 =

= 9

8

13

13 $

13 13

13 13 13 1313

ッ$ 13 っ$

amp ―$

64

13

$ $

amp $ amp

() $

+) $amp $

) $ $ $ $ $

13 $ amp $ $ $ amp $$

$- )) ) 13

) ))13 ) 0()00)13 ) ) )13

- $ ))) ) 131

) 0)13 ))

13 13

2 ) )) )

)) 3 ) 1313))

1313 2 ) 13)

4) 5) )) )13

) 13) 1

6

7 ) )) ) )13 1

)

) )0

13 ) 8

-

) )

)13 ) 4

4 7 )

) )9 8)

5 ) )8 7 ))

)

) )13 1

)13 )) ) ) )

) 7 )) )13 1

)) 0()

) )))13 0

))13 ) 1

)6 5) ) )

5 ) ))1

lt ) ) 13 )

))) ) =))lt

)=)lt ))=lt gt

=ltlt )) ))) )

13 131313

-

))lt

lt ))lt

))lt

))lt lt

7 ) ))1

lt ) )) ) ))lt lt

13 lt ) ) )8 ) 1

)13 )) )

))lt 13 1

)13 ) )8 ) )13 ))

) ))) 85 1

)13 )8) 2 ))) ) (

0 )) )))

13))13

) ))) ) ( のlt

) ) )13 1

) )) )

) ))) のlt 0 )

)13 )) )1

) ) 13

) ) )) )1

) 00 )13 13)

1313 1313

2 ) 13 )) )

)13 )13 )

))13 )

3 )) )) 1

) ) )

$ )

+ ) )

+ 8 )

65

1313

1313

13 13

13 13

13 13 13

13 13

$13amp13amp 13 13 $13

$13 13 $ ( ) ) 13

13 13 + 13 1313 $

13 + 13 1313 + -

+ 13 ++ 13 13

1313 +

13 13

13 13 1313 13 -

+13 1313 13 13

1313

13

0 13 13 13 13

13 13 12 13

3013+ 13 45 666 1313

0 13 1313

7 13 13 13 13-

13 8 +-13 0

+ 13 13 13 -

+ 13 13 13

3

13 13-

13 13 + 13 13

+13513

1313 13 9

1 13 13 -

+ ( 13 13 3

13 13 + 13+ 13 13

3

135 1313 +13 13

13 13 13+ 13

1313 13 +

1313 1313

13 13 13

13+ 13 $1313 (

13 1313 1313 +-

13 4 13 $

$13 $ 13 -

13 +-13 0

13 1313 13-13 1313 + 13 13 4 - 13 + 13 13 13+ 131313 1313 3 13 13 ( 1313 13 13 + 13 13 1313 13 13 13 13 13 13 13 13 13 13 13+ 13 13 9

13

13

13

13

+ 13 13 +13

13+ 4

13 1313 1313 -

+

13 1313 13-

13 13 +13 + +

(13 13 $ +-

13 $1313 13 13 $ $ 13 $ 13

13 $ampamp lt13

2 13 13 131313 1313 13

13 lt13 ( $

13 13 13 13+ +13 -

13 13 13

13 2 2 2

= 13 13 13 +13 1313+

13 13 13

13 13 13 13

1313 13-

13 ( 13 13 13 13

13 13+ 13 1313

13 13 13+ 1313 13+

13 13 13 13 13 13

$gt1 gt1 13+ $ 13-

13 13 13 13+

13131313

66

1313 1313 1313

1313 1313 1313

13

13 13

13

13

$ 13 13

amp() amp() () amp amp

$ 13 13

+

13

13

13

- + 13 ( (

13 13

-+

13

0 -(1313 amp+ ( (

13 13

13 1

13 () 2 3

0 13 -4252

6-amp++ -32 6+

13 0

$ 135 2

1

- amp+ $ 13

13

$ amp( ) + ( ( )- )( ( ) )

FeaturesLabel

Noun

Noun

Noun

Noun

Particle

Noun

POS

0

0

0

1

1

1

Head of wordreadingword-

FeaturesLabel

Noun

Noun

Noun

Noun

Particle

Noun

POS

0

0

0

1

1

1

Head of wordreadingword-

amp 13 0

13 朝ビタ -+ 13

13 朝はビタミン - +

13 7

7

$

088

13 0

$ amp 9 amp 9

13

13

7

1313 0

13

13 2lt

13 13 13 13

13 13

3 13 amp

13 amp

13

13

0 13 13

13 13

13 13

13

1

ampampamp

67

13 13

$

13 13

ampamp (

)

ampamp $

ampamp amp

amp +

amp +

- - -

amp-

amp

amp

amp

- amp

+

- +

) +

amp $

- amp

+

- amp

+

0 $ ampamp +

1- amp

0 +

2-

)

amp -

amp ampamp

$ amp 0- 3(

34) 5 -+

0 - 4

6)4 amp amp 1 amp

7 4( 6

$ - +

- 8

64

13

) amp

amp

+

amp 9 2

- 9 - +

2 amp

- amp

amp 1

$

( +

(

( amp ( ( -

amp2

amp

amp )

( +

68

13 13

13 1313

13 13 13

13 13

13 13 13

13 13

13 13

13 $

13

13 13 13 13

13

13 13 13

1313

13

amp 13 13 13 13

13 13 13 13

(() 13 13 13

13 13 + 13

13 amp 13

13 13

$ 13

13 13

13

13 1313

13 -

13

1313 13

13 13 13

13 1313 amp

13 13

13 13 1313

0

1313 13 13

13 13

13

13 13

$ $ amp amp ( $)+ - 0 13 + 13 112

ミュージックステーション 13 ステ 13 13 13 13 13 ミュージック 13 ステ ステーション

3 4 56 1 0+ 0 + + 7+ 8 9 $8 4 lt

6 $8 -= 1 8 - = $ 13++ + 139 13 $ 22lt

8 $+ 4 13) 1ltltlt3 $gt3 =

5 $ $ ( 6 5 13amp

5 $ $ ( 213 $ 13 13

5 = 3 + 13 () 1 +

5 () 13 7+ + + 2ltlt

5 () 13 3A 21 + $ + 13

5 () $ gt) 3A2 0 13 13 22lt2lt

+A = 7 1 4 30$+ -+ 3 13 13$ 11

$ 13 amp) $ 4 42 6= $ 2

4 13 13+ $ gt $ 9+ amp6+ 13+ 8 8 6 3 (+ 6 13 11

13 136) $ 4 13 + gt+ 7 30 13 13+ 1

69

13 13 13

13 13 1313

13 13

1313 13

13 1313

13

$

13 13 1313 $

amp (

70

Author Index

Bhutada Pravin 17

Cao Jie 47Caseli Helena 1

Diab Mona 17

Finatto Maria Jose 1Fujii Hiroko 63Fukui Mika 63Funayama Hirotaka 55

Hoang Hung Huu 31Huang Yun 47

Kan Min-Yen 9 31Kim Su Nam 9 31Kuhn Jonas 23Kurohashi Sadao 55

Liu Qun 47Lu Yajuan 47

Machado Andre 1

Ren Zhixiang 47

Shibata Tomohide 55Sinha R Mahesh K 40Sumita Kazuo 63Suzuki Masaru 63

Villavicencio Aline 1

Wakaki Hiromi 63

Zarrieszlig Sina 23

71

  • Workshop Program
  • Statistically-Driven Alignment-Based Multiword Expression Identification for Technical Domains
  • Re-examining Automatic Keyphrase Extraction Approaches in Scientific Articles
  • Verb Noun Construction MWE Token Classification
  • Exploiting Translational Correspondences for Pattern-Independent MWE Identification
  • A re-examination of lexical association measures
  • Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus
  • Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions
  • Bottom-up Named Entity Recognition using Two-stage Machine Learning Method
  • Abbreviation Generation for Japanese Multi-Word Expressions
Page 9: MWE 2009 2009 Workshop on Multiword Expressions
Page 10: MWE 2009 2009 Workshop on Multiword Expressions
Page 11: MWE 2009 2009 Workshop on Multiword Expressions
Page 12: MWE 2009 2009 Workshop on Multiword Expressions
Page 13: MWE 2009 2009 Workshop on Multiword Expressions
Page 14: MWE 2009 2009 Workshop on Multiword Expressions
Page 15: MWE 2009 2009 Workshop on Multiword Expressions
Page 16: MWE 2009 2009 Workshop on Multiword Expressions
Page 17: MWE 2009 2009 Workshop on Multiword Expressions
Page 18: MWE 2009 2009 Workshop on Multiword Expressions
Page 19: MWE 2009 2009 Workshop on Multiword Expressions
Page 20: MWE 2009 2009 Workshop on Multiword Expressions
Page 21: MWE 2009 2009 Workshop on Multiword Expressions
Page 22: MWE 2009 2009 Workshop on Multiword Expressions
Page 23: MWE 2009 2009 Workshop on Multiword Expressions
Page 24: MWE 2009 2009 Workshop on Multiword Expressions
Page 25: MWE 2009 2009 Workshop on Multiword Expressions
Page 26: MWE 2009 2009 Workshop on Multiword Expressions
Page 27: MWE 2009 2009 Workshop on Multiword Expressions
Page 28: MWE 2009 2009 Workshop on Multiword Expressions
Page 29: MWE 2009 2009 Workshop on Multiword Expressions
Page 30: MWE 2009 2009 Workshop on Multiword Expressions
Page 31: MWE 2009 2009 Workshop on Multiword Expressions
Page 32: MWE 2009 2009 Workshop on Multiword Expressions
Page 33: MWE 2009 2009 Workshop on Multiword Expressions
Page 34: MWE 2009 2009 Workshop on Multiword Expressions
Page 35: MWE 2009 2009 Workshop on Multiword Expressions
Page 36: MWE 2009 2009 Workshop on Multiword Expressions
Page 37: MWE 2009 2009 Workshop on Multiword Expressions
Page 38: MWE 2009 2009 Workshop on Multiword Expressions
Page 39: MWE 2009 2009 Workshop on Multiword Expressions
Page 40: MWE 2009 2009 Workshop on Multiword Expressions
Page 41: MWE 2009 2009 Workshop on Multiword Expressions
Page 42: MWE 2009 2009 Workshop on Multiword Expressions
Page 43: MWE 2009 2009 Workshop on Multiword Expressions
Page 44: MWE 2009 2009 Workshop on Multiword Expressions
Page 45: MWE 2009 2009 Workshop on Multiword Expressions
Page 46: MWE 2009 2009 Workshop on Multiword Expressions
Page 47: MWE 2009 2009 Workshop on Multiword Expressions
Page 48: MWE 2009 2009 Workshop on Multiword Expressions
Page 49: MWE 2009 2009 Workshop on Multiword Expressions
Page 50: MWE 2009 2009 Workshop on Multiword Expressions
Page 51: MWE 2009 2009 Workshop on Multiword Expressions
Page 52: MWE 2009 2009 Workshop on Multiword Expressions
Page 53: MWE 2009 2009 Workshop on Multiword Expressions
Page 54: MWE 2009 2009 Workshop on Multiword Expressions
Page 55: MWE 2009 2009 Workshop on Multiword Expressions
Page 56: MWE 2009 2009 Workshop on Multiword Expressions
Page 57: MWE 2009 2009 Workshop on Multiword Expressions
Page 58: MWE 2009 2009 Workshop on Multiword Expressions
Page 59: MWE 2009 2009 Workshop on Multiword Expressions
Page 60: MWE 2009 2009 Workshop on Multiword Expressions
Page 61: MWE 2009 2009 Workshop on Multiword Expressions
Page 62: MWE 2009 2009 Workshop on Multiword Expressions
Page 63: MWE 2009 2009 Workshop on Multiword Expressions
Page 64: MWE 2009 2009 Workshop on Multiword Expressions
Page 65: MWE 2009 2009 Workshop on Multiword Expressions
Page 66: MWE 2009 2009 Workshop on Multiword Expressions
Page 67: MWE 2009 2009 Workshop on Multiword Expressions
Page 68: MWE 2009 2009 Workshop on Multiword Expressions
Page 69: MWE 2009 2009 Workshop on Multiword Expressions
Page 70: MWE 2009 2009 Workshop on Multiword Expressions
Page 71: MWE 2009 2009 Workshop on Multiword Expressions
Page 72: MWE 2009 2009 Workshop on Multiword Expressions
Page 73: MWE 2009 2009 Workshop on Multiword Expressions
Page 74: MWE 2009 2009 Workshop on Multiword Expressions
Page 75: MWE 2009 2009 Workshop on Multiword Expressions
Page 76: MWE 2009 2009 Workshop on Multiword Expressions
Page 77: MWE 2009 2009 Workshop on Multiword Expressions