Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
ACL-IJCNLP 2009
MWE 2009
2009 Workshop on Multiword ExpressionsIdentification Interpretation Disambiguation Applications
Proceedings of the Workshop
6 August 2009Suntec Singapore
Production and Manufacturing byWorld Scientific Publishing Co Pte Ltd5 Toh Tuck LinkSingapore 596224
ccopy2009 The Association for Computational Linguisticsand The Asian Federation of Natural Language Processing
Order copies of this and other ACL proceedings from
Association for Computational Linguistics (ACL)209 N Eighth StreetStroudsburg PA 18360USATel +1-570-476-8006Fax +1-570-476-0860aclaclweborg
ISBN 978-1-932432-60-2 1-932432-60-4
ii
Introduction
The ACL 2009 Workshop on Multiword Expressions Identification Interpretation Disambiguation andApplications (MWErsquo09) took place on August 6 2009 in Singapore immediately following the annualmeeting of the Association for Computational Linguistics (ACL) This is the fifth time this workshop hasbeen held in conjunction with ACL following the meetings in 2003 2004 2006 and 2007
The workshop focused on Multi-Word Expressions (MWEs) which represent an indispensable part ofnatural languages and appear steadily on a daily basis both novel and already existing but paraphrasedwhich makes them important for many natural language applications Unfortunately while easilymastered by native speakers MWEs are often non-compositional which poses a major challenge forboth foreign language learners and automatic analysis
The growing interest in MWEs in the NLP community has led to many specialized workshops heldevery year since 2001 in conjunction with ACL EACL and LREC there have been also two recentspecial issues on MWEs published by leading journals the International Journal of Language Resourcesand Evaluation and the Journal of Computer Speech and Language
As a result of the overall progress in the field the time has come to move from basic preliminary researchto actual applications in real-world NLP tasks Thus in MWErsquo09 we were interested in the overallprocess of dealing with MWEs asking for original research on the following four fundamental topics
Identification Identifying MWEs in free text is a very challenging problem Due to the variability ofexpression it does not suffice to collect and use a static list of known MWEs complex rules andmachine learning are typically needed as well
Interpretation Semantically interpreting MWEs is a central issue For some kinds of MWEs egnoun compounds it could mean specifying their semantics using a static inventory of semanticrelations eg WordNet-derived In other cases MWErsquos semantics could be expressible by asuitable paraphrase
Disambiguation Most MWEs are ambiguous in various ways A typical disambiguation task is todetermine whether an MWE is used non-compositionally (ie figuratively) or compositionally(ie literally) in a particular context
Applications Identifying MWEs in context and understanding their syntax and semantics is importantfor many natural language applications including but not limited to question answering machinetranslation information retrieval information extraction and textual entailment Still despite thegrowing research interest there are not enough successful applications in real NLP problemswhich we believe is the key for the advancement of the field
Of course the above topics largely overlap For example identification can require disambiguatingbetween literal and idiomatic uses since MWEs are typically required to be non-compositional bydefinition Similarly interpreting three-word noun compounds like morning flight ticket and plasticwater bottle requires disambiguation between a left and a right syntactic structure while interpretingtwo-word compounds like English teacher requires disambiguating between (a) lsquoteacher who teachesEnglishrsquo and (b) lsquoteacher coming from England (who could teach any subject eg math)rsquo
We received 18 submissions and given our limited capacity as a one-day workshop we were only ableto accept 9 full papers for oral presentation an acceptance rate of 50
We would like to thank the members of the Program Committee for their timely reviews We would alsolike to thank the authors for their valuable contributions
Dimitra Anastasiou Chikara Hashimoto Preslav Nakov and Su Nam KimCo-Organizers
iii
Organizers
Dimitra Anastasiou Localisation Research Centre Limerick University IrelandChikara Hashimoto National Institute of Information and Communications Technology JapanPreslav Nakov National University of Singapore SingaporeSu Nam Kim University of Melbourne Australia
Program Committee
Inaki Alegria University of the Basque Country (Spain)Timothy Baldwin University of Melbourne (Australia)Colin Bannard Max Planck Institute (Germany)Francis Bond National Institute of Information and Communications Technology (Japan)Gael Dias Beira Interior University (Portugal)Ulrich Heid Stuttgart University (Germany)Stefan Evert University of Osnabruck (Germany)Afsaneh Fazly University of Toronto (Canada)Nicole Gregoire University of Utrecht (The Netherlands)Roxana Girju University of Illinois at Urbana-Champaign (USA)Kyo Kageura University of Tokyo (Japan)Brigitte Krenn Austrian Research Institute for Artificial Intelligence (Austria)Eric Laporte University of Marne-la-Vallee (France)Rosamund Moon University of Birmingham (UK)Diana McCarthy University of Sussex (UK)Jan Odijk University of Utrecht (The Netherlands)Stephan Oepen University of Oslo (Norway)Darren Pearce London Knowledge Lab (UK)Pavel Pecina Charles University (Czech Republic)Scott Piao University of Manchester (UK)Violeta Seretan University of Geneva (Switzerland)Stan Szpakowicz University of Ottawa (Canada)Beata Trawinski University of Tubingen (Germany)Peter Turney National Research Council of Canada (Canada)Kiyoko Uchiyama Keio University (Japan)Begona Villada Moiron University of Groningen (The Netherlands)Aline Villavicencio Federal University of Rio Grande do Sul (Brazil)
v
Table of Contents
Statistically-Driven Alignment-Based Multiword Expression Identification for Technical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto 1
Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan 9
Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada 17
Exploiting Translational Correspondences for Pattern-Independent MWE IdentificationSina Zarrieszlig and Jonas Kuhn 23
A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan 31
Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha 40
Improving Statistical Machine Translation Using Domain Bilingual Multiword ExpressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang 47
Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi 55
Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita 63
vii
Workshop Program
Friday August 6 2009
830ndash845 Welcome and Introduction to the Workshop
Session 1 (0845ndash1000) MWE Identification and Disambiguation
0845ndash0910 Statistically-Driven Alignment-Based Multiword Expression Identification for Tech-nical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto
0910ndash0935 Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan
0935ndash1000 Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada
1000-1030 BREAK
Session 2 (1030ndash1210) Identification Interpretation and Disambiguation
1030ndash1055 Exploiting Translational Correspondences for Pattern-Independent MWE Identifi-cationSina Zarrieszlig and Jonas Kuhn
1055ndash1120 A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan
1120ndash1145 Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha
1145-1350 LUNCH
Session 3 (1350ndash1530) Applications
1350ndash1415 Improving Statistical Machine Translation Using Domain Bilingual Multiword Ex-pressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang
1415ndash1440 Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi
1440ndash1505 Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita
1505-1530 Discussion of Sessions 1 2 3 (Creating an Agenda for the general discussion)
1530-1600 BREAK
1600-1700 General Discussion
1700-1715 Closing Remarks
ix
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 1ndash8Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Statistically-Driven Alignment-Based Multiword ExpressionIdentification for Technical Domains
Helena de Medeiros Caselidiams Aline Villavicencioclubsspades Andre Machadoclubs Maria Jose Finattohearts
diamsDepartment of Computer Science Federal University of Sao Carlos (Brazil)clubsInstitute of Informatics Federal University of Rio Grande do Sul (Brazil)
spadesDepartment of Computer Sciences Bath University (UK)heartsInstitute of Language and Linguistics Federal University of Rio Grande do Sul (Brazil)
helenacaselidcufscarbr avillavicencioinfufrgsbrammachadoinfufrgsbr mfinattoterracombr
Abstract
Multiword Expressions (MWEs) are oneof the stumbling blocks for more preciseNatural Language Processing (NLP) sys-tems Particularly the lack of coverageof MWEs in resources can impact nega-tively on the performance of tasks and ap-plications and can lead to loss of informa-tion or communication errors This is es-pecially problematic in technical domainswhere a significant portion of the vocab-ulary is composed of MWEs This pa-per investigates the use of a statistically-driven alignment-based approach to theidentification of MWEs in technical cor-pora We look at the use of several sourcesof data including parallel corpora usingEnglish and Portuguese data from a corpusof Pediatrics and examining how a sec-ond language can provide relevant cues forthis tasks We report results obtained bya combination of statistical measures andlinguistic information and compare theseto the reported in the literature Such anapproach to the (semi-)automatic identifi-cation of MWEs can considerably speedup lexicographic work providing a moretargeted list of MWE candidates
1 Introduction
A multiword expression (MWE) can be definedas any word combination for which the syntac-tic or semantic properties of the whole expres-sion cannot be obtained from its parts (Sag etal 2002) Examples of MWEs are phrasal verbs(break down rely on) compounds (police car cof-fee machine) idioms (rock the boat let the catout of the bag) They are very numerous in lan-guages as Biber et al (1999) note accouting forbetween 30 and 45 of spoken English and 21
of academic prose and for Jackendoff (1997) thenumber of MWEs in a speakerrsquos lexicon is of thesame order of magnitude as the number of singlewords However these estimates are likely to beunderestimates if we consider that for languagefrom a specific domain the specialized vocabularyis going to consist largely of MWEs (global warm-ing protein sequencing) and new MWEs are con-stantly appearing (weapons of mass destructionaxis of evil)
Multiword expressions play an important rolein Natural Language Processing (NLP) applica-tions which should not only identify the MWEsbut also be able to deal with them when they arefound (Fazly and Stevenson 2007) Failing toidentify MWEs may cause serious problems formany NLP tasks especially those envolving somekind of semantic processing For parsing for in-stance Baldwin et al (2004) found that for a ran-dom sample of 20000 strings from the British Na-tional Corpus (BNC) even with a broad-coveragegrammar for English (Flickinger 2000) missingMWEs accounted for 8 of total parsing errorsTherefore there is an enormous need for robust(semi-)automated ways of acquiring lexical infor-mation for MWEs (Villavicencio et al 2007) thatcan significantly extend the coverage of resourcesFor example one can more than double the num-ber of verb-particle constructions (VPCs) entriesin a dictionary such as the Alvey Natural Lan-guage Tools (Carroll and Grover 1989) just ex-tracting VPCs from a corpus like the BNC (Bald-win 2005) Furthermore as MWEs are languagedependent and culturally motivated identifyingthe adequate translation of MWE occurrences is animportant challenge for machine translation meth-ods
In this paper we investigate experimentally theuse of an alignment-based approach for the iden-tification of MWEs in technical corpora We lookat the use of several sources of data including par-
1
allel corpora using English and Portuguese datafrom a corpus of Pediatrics and examining howa second language can provide relevant cues forthis tasks In this way cost-effective tools for theautomatic alignment of texts can generate a listof MWE candidates with their appropriate trans-lations Such an approach to the (semi-)automaticidentification of MWEs can considerably speed uplexicographic work providing a more targeted listof MWE candidates and their translations for theconstruction of bilingual resources andor withsome semantic information for monolingual re-sources
The remainder of this paper is structured as fol-lows Section 2 briefly discusses MWEs and someprevious works on methods for automatically ex-tracting them Section 3 presents the resourcesused while section 4 describes the methods pro-posed to extract MWEs as a statistically-driven by-product of an automatic word alignment processSection 5 presents the evaluation methodology andanalyses the results and section 6 finishes this pa-per with some conclusions and proposals for fu-ture work
2 Related Work
The term Multiword Expression has been used todescribe a large number of distinct but related phe-nomena such as phrasal verbs (eg come along)nominal compounds (eg frying pan) institution-alised phrases (eg bread and butter) and manyothers (Sag et al 2002) They are very frequent ineveryday language and this is reflected in severalexisting grammars and lexical resources wherealmost half of the entries are Multiword Expres-sions
However due to their heterogeneous charac-teristics MWEs present a tough challenge forboth linguistic and computational work (Sag etal 2002) Some MWEs are fixed and do notpresent internal variation such as ad hoc whileothers allow different degrees of internal vari-ability and modification such as touch a nerve(touchfind a nerve) and spill beans (spill sev-eralmusicalmountains of beans) In terms of se-mantics some MWEs are more opaque in theirmeaning (eg to kick the bucket as to die) whileothers have more transparent meanings that can beinferred from the words in the MWE (eg eat upwhere the particle up adds a completive sense toeat) Therefore providing appropriate methods
for the automatic identification and treatment ofthese phenomena is a real challenge for NLP sys-tems
A variety of approaches has been proposed forautomatically identifying MWEs differing basi-cally in terms of the type of MWE and lan-guage to which they apply and the sources ofinformation they use Although some work onMWEs is type independent (eg (Zhang et al2006 Villavicencio et al 2007)) given the het-erogeneity of MWEs much of the work looks in-stead at specific types of MWE like collocations(Pearce 2002) compounds (Keller and Lapata2003) and VPCs (Baldwin 2005 Villavicencio2005 Carlos Ramisch and Aline Villavicencio andLeonardo Moura and Marco Idiart 2008) Someof these works concentrate on particular languages(eg (Pearce 2002 Baldwin 2005) for Englishand (Piao et al 2006) for Chinese) but somework has also benefitted from asymmetries in lan-guages using information from one language tohelp deal with MWEs in the other (eg (na Vil-lada Moiron and Tiedemann 2006 Caseli et al2009))
As basis for helping to determine whether agiven sequence of words is in fact an MWE (egad hoc vs the small boy) some of these works em-ploy linguistic knowledge for the task (Villavicen-cio 2005) while others employ statistical meth-ods (Pearce 2002 Evert and Krenn 2005 Zhanget al 2006 Villavicencio et al 2007) or combinethem with some kinds of linguistic informationsuch as syntactic and semantic properties (Bald-win and Villavicencio 2002 Van de Cruys and naVillada Moiron 2007) or automatic word align-ment (na Villada Moiron and Tiedemann 2006)
Statistical measures of association have beencommonly used for this task as they can be demo-cratically applied to any language and MWE typeHowever there is no consensus about which mea-sure is best suited for identifying MWEs in gen-eral Villavicencio et al (2007) compared some ofthese measures (mutual information permutationentropy and χ2) for the type-independent detec-tion of MWEs and found that Mutual Informationseemed to differentiate MWEs from non-MWEsbut the same was not true of χ2 In addition Ev-ert and Krenn (2005) found that for MWE iden-tification the efficacy of a given measure dependson factors like the type of MWEs being targetedfor identification the domain and size of the cor-
2
pora used and the amount of low-frequency dataexcluded by adopting a threshold NonethelessVillavicencio et al (2007) discussing the influ-ence of the corpus size and nature over the meth-ods found that these different measures have ahigh level of agreement about MWEs whetherin carefully constructed corpora or in more het-erogeneous web-based ones They also discussthe results obtained from adopting approaches likethese for extending the coverage of resources ar-guing that grammar coverage can be significantlyincreased if MWEs are properly identified andtreated (Villavicencio et al 2007)
Among the methods that use additional infor-mation along with statistics to extract MWE theone proposed by na Villada Moiron and Tiede-mann (2006) seems to be the most similar to ourapproach The main difference between them isthe way in which word alignment is used in theMWE extraction process In this paper the wordalignment is the basis for the MWE extractionprocess while Villada Moiron and Tiedemannrsquosmethod uses the alignment just for ranking theMWE candidates which were extracted on the ba-sis of association measures (log-likelihood andsalience) and head dependence heuristic (in parseddata)
Our approach as described in details by Caseliet al (2009) also follows to some extent thatof Zhang et al (2006) as missing lexical en-tries for MWEs and related constructions are de-tected via error mining methods and this paper fo-cuses on the extraction of generic MWEs as a by-product of an automatic word alignment Anotherrelated work is the automatic detection of non-compositional compounds (NCC) by Melamed(1997) in which NCCs are identified by analyz-ing statistical translation models trained in a hugecorpus by a time-demanding process
Given this context our approach proposesthe use of alignment techniques for identifyingMWEs looking at sequences detected by thealigner as containing more than one word whichform the MWE candidates As a result sequencesof two or more consecutive source words aretreated as MWE candidates regardless of whetherthey are translated as one or more target words
3 The Corpus and Reference Lists
The Corpus of Pediatrics used in these experi-ments contains 283 texts in Portuguese with a total
of 785448 words extracted from the Jornal de Pe-diatria From this corpus the Pediatrics Glossarya reference list containing multiword terms and re-curring expressions was semi-automatically con-structed and manually checked1 The primary aimof the Pediatrics Glossary as an online resourcefor long-distance education was to train qualifyand support translation students on the domain ofpediatrics texts
The Pediatrics Glossary was built from the36741 ngrams that occurred at least 5 times in thecorpus These were automatically cleaned or re-moved using some POS tag patterns (eg remov-ing prepositions from terms that began or endedwith them) In addition if an ngram was part of alarger ngram only the latter appeared in the Glos-sary as is the case of aleitamento materno (mater-nal breastfeeding) which is excluded as it is con-tained in aleitamento materno exclusivo (exclusivematernal breastfeeding) This post-processing re-sulted in 3645 ngrams which were manuallychecked by translation students and resulted in2407 terms with 1421 bigrams 730 trigrams and339 ngrams with n larger than 3 (not considered inthe experiments presented in this paper)
4 Statistically-Driven andAlignment-Based methods
41 Statistically-Driven methodStatistical measures of association have beenwidely employed in the identification of MWEsThe idea behind their use is that they are an in-expensive language and type independent meansof detecting recurrent patterns As Firth famouslysaid a word is characterized by the company itkeeps and since we expect the component wordsof an MWE to occur frequently together thenthese measures can give an indication of MWE-ness In this way if a group of words co-occurswith significantly high frequency when comparedto the frequencies of the individual words thenthey may form an MWE Indeed measures suchas Pointwise Mutual Information (PMI) MutualInformation (MI) χ2 log-likelihood (Press et al1992) and others have been employed for thistask and some of them seem to provide more ac-curate predictions of MWEness than others Infact in a comparison of some measures for thetype-independent detection of MWEs MI seemed
1Available in the TEXTQUIMUFRGS website httpwwwufrgsbrtextquim
3
to differentiate MWEs from non-MWEs but thesame was not true of χ2 (Villavicencio et al2007) In this work we use two commonly em-ployed measures for this task PMI and MIas implemented in the Ngram Statistics Package(Banerjee and Pedersen 2003)
From the Portuguese portion of the Corpus ofPediatrics 196105 bigram and 362663 trigramMWE candidates were generated after filteringngrams containing punctuation and numbers Inorder to evaluate how these methods perform with-out any linguistic filtering the only threshold em-ployed was a frequency cut-off of 2 occurrencesresulting in 64839 bigrams and 54548 trigramsEach of the four measures were then calculated forthese ngrams and we ranked each n-gram accord-ing to each of these measures The average of allthe rankings is used as the combined measure ofthe MWE candidates
42 Alignment-Based method
The second of the MWE extraction approachesto be investigated in this paper is the alignment-based method The automatic word alignment oftwo parallel texts mdash a text written in one (source)language and its translation to another (target) lan-guage mdash is the process of searching for correspon-dences between source and target words and se-quences of words For each word in a source sen-tence equivalences in the parallel target sentenceare looked for Therefore taking into account aword alignment between a source word sequenceS (S = s1 sn with n ge 2) and a target wordsequence T (T = t1 tm with m ge 1) thatis S harr T the alignmet-based MWE extracionmethod assumes that (a) S and T share some se-mantic features and (b) S may be a MWE
In other words the alignment-based MWE ex-traction method states that the sequence S will bea MWE candidate if it is aligned with a sequenceT composed of one or more words (a n m align-ment with n ge 2 and m ge 1) For examplethe sequence of two Portuguese words aleitamentomaterno mdash which occurs 202 times in the cor-pus used in our experiments mdash is a MWE can-didate because these two words were joined to bealigned 184 times with the word breastfeeding (a2 1 alignment) 8 times with the word breast-fed (a 2 1 alignment) 2 times with breastfeedingpractice (a 2 2 alignment) and so on
Thus notice that the alignment-based MWE ex-
traction method does not rely on the conceptualasymmetries between languages since it does notexpect that a source sequence of words be alignedwith a single target word The method looksfor the sequences of source words that are fre-quently joined together during the alignment de-spite the number of target words involved Thesefeatures indicate that the method priorizes preci-sion in spite of recall
It is also important to say that although the se-quences of source and target words resemble thephrases used in the phrase-based statistical ma-chine translation (SMT) they are indeed a re-finement of them More specifically althoughboth approaches rely on word alignments per-formed by GIZA++2 (Och and Ney 2000) inthe alignment-based approach not all sequences ofwords are considered as phrases (and MWE can-didates) but just those with an alignment n m(n gt= 2) with a target sequence To confirmthis assumption a phrase-based SMT system wastrained with the same corpus used in our exper-iments and the number of phrases extracted fol-lowing both approaches were compared Whilethe SMT extracted 819208 source phrases ouralignment-based approach (without applying anypart-of-speech or frequency filter) extracted only34277 These results show that the alignment-based approach refines in some way the phrasesof SMT systems
In this paper we investigate experimentallywhether MWEs can be identified as a by-productof the automatic word alignment of parallel textsWe focus on Portuguese MWEs from the Corpusof Pediatrics and the evaluation is performed us-ing the bigrams and trigrams from the PediatricsGlossary as gold standard
To perform the extraction of MWE candi-dates following the alignment-based approachfirst the original corpus had to be sentence andword aligned and Part-of-Speech (POS) taggedFor these preprocessing steps were used re-spectively a version of the Translation Cor-pus Aligner (TCA) (Hofland 1996) the statisti-cal word aligner GIZA++ (Och and Ney 2000)and the morphological analysers and POS taggersfrom Apertium3 (Armentano-Oller et al 2006)
2GIZA++ is a well-known statistical word aligner that canbe found at httpwwwfjochcomGIZA++html
3Apertium is an open-source machine translation en-gine and toolbox available at httpwwwapertiumorg
4
From the preprocessed corpus the MWE candi-dates are extracted as those in which two or morewords have the same alignment that is they arelinked to the same target unit This initial list ofMWE candidates is then filtered to remove thosecandidates that (a) match some sequences of POStags or words (patterns) defined in previous exper-iments (Caseli et al 2009) or (b) whose frequencyis below a certain threshold The remaining unitsin the candidate list are considered to be MWEs
Several filtering patterns and minimum fre-quency thresholds were tested and three of themare presented in details here The first one (F1)is the same used during the manual building ofthe reference lists of MWEs (a) patterns begin-ning with Article + Noun and beginning or finish-ing with verbs and (b) with a minimum frequencythreshold of 5
The second one (F2) is the same used in the(Caseli et al 2009) mainly (a) patterns begin-ning with determiner auxiliary verb pronoun ad-verb conjunction and surface forms such as thoseof the verb to be (are is was were) relatives(that what when which who why) and prepo-sitions (from to of ) and (b) with a minimum fre-quency threshold of 2
And the third one (F3) is the same as (Caseli etal 2009) plus (a) patterns beginning or finishingwith determiner adverb conjunction prepositionverb pronoun and numeral and (b) with a mini-mum frequency threshold of 2
5 Experiments and Results
Table 1 shows the top 5 and the bottom 5 rankedcandidates returned by PMI and the alignment-based approach Although some of the results aregood especially the top candidates there is stillconsiderable noise among the candidates as forinstance jogar video game (lit play video game)From table 1 it is also possible to notice that thealignment-based approach indeed extracts Pedi-atrics terms such as aleitamento materno (breast-feeding) and also other possible MWE that are notPediatrics terms such as estados unidos (UnitedStates)
In table 2 we show the precision (number ofcorrect candidates among the proposed ones) re-call (number of correct candidates among those inreference lists) and F-measure ((2 lowast precision lowastrecall)(precision + recall)) figures for the as-sociation measures using all the candidates (on the
PMI alignment-basedOnline Mendelian Inheritance faixa etariaBeta Technology Incorporated aleitamento maternoLange Beta Technology estados unidosOxido Nitrico Inalatorio hipertensao arterialjogar video game leite materno e um de couro cabeludoe a do bloqueio lactıferosse que de emocional anatomiae a da neonato a termoe de nao duplas maes bebes
Table 1 Top 5 and Bottom 5 MWE candidatesranked by PMI and alignment-based approach
pt MWE candidates PMI MI proposed bigrams 64839 64839 correct MWEs 1403 1403precision 216 216recall 9873 9873F 423 423 proposed trigrams 54548 54548 correct MWEs 701 701precision 129 129recall 9603 9603F 255 255 proposed bigrams 1421 1421 correct MWEs 155 261precision 1091 1837recall 1091 1837F 1091 1837 proposed trigrams 730 730 correct MWEs 44 20precision 603 274recall 603 274F 603 274
Table 2 Evaluation of MWE candidates - PMI andMI
first half of the table) and using the top 1421 bi-gram and 730 trigram candidates (on the secondhalf) From these latter results we can see that thetop candidates produced by these measures do notagree with the Pediatrics Glossary since there areonly at most 1837 bigram and 603 trigramMWEs among the top candidates as ranked byMI and PMI respectively Interestingly MI had abetter performance for bigrams while for trigramsPMI performed better
On the other hand looking at the alignment-based method 34277 pt MWE candidates wereextracted and Table 3 sumarizes the number ofcandidates filtered following the three filters de-scribed in 42 F1 F2 and F3
To evaluate the efficacy of the alignment-basedmethod in identifying multiword terms of Pedi-atrics an automatic comparison was performedusing the Pediatrics Glossary In this auto-
5
pt MWE candidates F1 F2 F3 filtered by POS patterns 24996 21544 32644 filtered by frequency 9012 11855 1442 final Set 269 878 191
Table 3 Number of pt MWE candidates filteredin the alignment-based approach
pt MWE candidates F1 F2 F3 proposed bigrams 250 754 169 correct MWEs 48 95 65precision 1920 1260 3846recall 338 669 457F 575 874 818 proposed trigrams 19 110 20 correct MWEs 1 9 4precision 526 818 2000recall 014 123 055F 027 214 107 proposed bitrigrams 269 864 189 correct MWEs 49 104 69precision 1822 1204 3651recall 228 483 321F 405 690 590
Table 4 Evaluation of MWE candidates
matic comparision we considered the final lists ofMWEs candidates generated by each filter in table3 The number of matching entries and the valuesfor precision recall and F-measure are showed intable 4
The different values of extracted MWEs (in ta-ble 3) and evaluated ones (in table 4) are due tothe restriction of considering only bigrams and tri-grams in the Pediatrics Glossary Then longerMWEs mdash such as doenca arterial coronarianaprematura (premature coronary artery disease)and pequenos para idade gestacional (small forgestational age) mdash extracted by the alignment-based method are not being considered at the mo-ment
After the automatic comparison using the Pedi-atrics Glossary an analysis by human experts wasperformed on one of the derived lists mdash that withthe best precision values so far (from filter F3)The human analysis was necessary since as statedin (Caseli et al 2009) the coverage of referencelists may be low and it is likely that a lot of MWEcandidates that were not found in the PediatricsGlossary are nonetheless true MWEs In this pa-per only the pt MWE candidates extracted usingfilter F3 (as described in section 42) were manu-ally evaluated
From the 191 pt MWE candidates extracted af-ter F3 69 candidates (361 of the total amount)were found in the bigrams or trigrams in the
Glossary (see table 4) Then the remaining 122candidates (639) were analysed by two native-speakers human judges who classified each of the122 candidates as true if it is a multiword expres-sion or false otherwise independently of beinga Pediatrics term For the judges a sequence ofwords was considered a MWE mainly if it was(1) a proper name or (2) a sequence of words forwhich the meaning cannot be obtained by com-pounding the meanings of its component words
The judgments of both judges were comparedand a disagreement of approximately 12 on mul-tiwords was verified This disagreement was alsomeasured by the kappa (K) measure (Carletta1996) with k = 073 which does not preventconclusions to be drawn According to Carletta(1996) among other authors a value of k between067 and 08 indicates a good agreement
In order to calculate the percentage of true can-didates among the 122 two approaches can befollowed depending on what criteria one wantsto emphasize precision or coverage (not recallbecause we are not calculating regarding a refer-ence list) To emphasize the precision one shouldconsider as genuine MWEs only those candidatesclassified as true by both judges on the other handto emphasize the coverage one should consideralso those candidates classified as true by just oneof them So from 191 MWE candidates 126(6597) were classified as true by both judgesand 145 (7592) by at least one of them
6 Conclusions and Future Work
MWEs are a complex and heterogeneous set ofphenomena that defy attempts to capture themfully but due to their role in communication theyneed to be properly accounted for in NLP applica-tions and tasks
In this paper we investigated the identifica-tion of MWEs from technical domain test-ing statistically-driven and alignment-based ap-proaches for identifying MWEs from a Pediatricsparallel corpus The alignment-based method gen-erates a targeted precision-oriented list of MWEcandidates while the statistical methods producerecall-oriented results at the expense of precisionTherefore the combination of these methods canproduce a set of MWE candidates that is both moreprecise than the latter and has more coverage thanthe former This can significantly speed up lex-icographic work Moreover the results obtained
6
show that in comparison with the manual extrac-tion of MWEs this approach can provide also ageneral set of MWE candidates in addition to themanually selected technical terms
Using the alignment-based extraction methodwe notice that it is possible to extract MWEs thatare Pediatrics terms with a precision of 38 forbigrams and 20 for trigrams but with very lowrecall since only the MWEs in the Pediatrics Glos-sary were considered correct However after amanual analysis carried out by two native speakersof Portuguese we found that the percentage of trueMWEs considered by both or at least one of themwere respectively 6597 and 7592 This wasa significative improvement but it is important tosay that in this manual analysis the human ex-perts classified the MWEs as true independently ofthem being Pediatrics terms So as future work weintend to carry out a more carefull analysis withexperts in Pediatrics to evaluate how many MWEscandidates are also Pediatrics terms
In addition we plan to investigate a weightedcombination of these methods favouring thosethat have better precision Finally we also in-tend to apply the results obtained in to the semi-automatic construction of ontologies
Acknowledgments
We would like to thank the TEXTQUIMUFRGSgroup for making the Corpus of Pediatrics and Pe-diatrics Glossary available to us We also thank thefinancial support of FAPESP in bulding the paral-lel corpus This research has been partly fundedby the FINEP project COMUNICA
ReferencesCarme Armentano-Oller Rafael C Carrasco Anto-
nio M Corbı-Bellot Mikel L Forcada MireiaGinestı-Rosell Sergio Ortiz-Rojas Juan Anto-nio Perez-Ortiz Gema Ramırez-Sanchez FelipeSanchez-Martınez and Miriam A Scalco 2006Open-source Portuguese-Spanish machine transla-tion In R Vieira P Quaresma MGV Nunes NJMamede C Oliveira and MC Dias editors Pro-ceedings of the 7th International Workshop on Com-putational Processing of Written and Spoken Por-tuguese (PROPOR 2006) volume 3960 of LectureNotes in Computer Science pages 50ndash59 Springer-Verlag May
Timothy Baldwin and Aline Villavicencio 2002 Ex-tracting the Unextractable A Case Study on Verb-particles In Proceedings of the 6th Conference on
Natural Language Learning (CoNLL-2002) pages98ndash104 Taipei Taiwan
Timothy Baldwin Emily M Bender Dan FlickingerAra Kim and Stephan Oepen 2004 Road-testingthe English Resource Grammar over the British Na-tional Corpus In Proceedings of the Fourth In-ternational Conference on Language Resources andEvaluation (LREC 2004) pages 2047ndash2050 LisbonPortugal
Timothy Baldwin 2005 The deep lexical acquisitionof English verb-particles Computer Speech andLanguage Special Issue on Multiword Expressions19(4)398ndash414
Satanjeev Banerjee and Ted Pedersen 2003 The De-sign Implementation and Use of the Ngram Statis-tics Package In In Proceedings of the Fourth Inter-national Conference on Intelligent Text Processingand Computational Linguistics pages 370ndash381
Douglas Biber Stig Johansson Geoffrey Leech SusanConrad and Edward Finegan 1999 Grammar ofSpoken and Written English Longman Harlow
Jean Carletta 1996 Assessing agreement on classi-fication tasks the kappa statistics ComputationalLinguistics 22(2)249ndash254
Carlos Ramisch and Aline Villavicencio and LeonardoMoura and Marco Idiart 2008 Picking themup and Figuring them out Verb-Particle Construc-tions Noise and Idiomaticity In Proceedings ofthe 12th Conference on Computational Natural Lan-guage Learning (CoNLL 2008) pages 49ndash56
John Carroll and Claire Grover 1989 The derivationof a large computational lexicon of English fromLDOCE In Bran Boguraev and Ted Briscoe editorsComputational Lexicography for Natural LanguageProcessing pages 117ndash134 Longman Harlow UK
Helena M Caseli Carlos Ramisch Maria G V Nunesand Aline Villavicencio 2009 Alignment-basedextraction of multiword expressions Language Re-sources and Evaluation to appear
Stefan Evert and Brigitte Krenn 2005 Using smallrandom samples for the manual evaluation of statis-tical association measures Computer Speech andLanguage 19(4)450ndash466
Afsaneh Fazly and Suzanne Stevenson 2007 Distin-guishing Subtypes of Multiword Expressions UsingLinguistically-Motivated Statistical Measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueJune
Dan Flickinger 2000 On building a more efficientgrammar by exploiting types Natural LanguageEngineering 6(1)15ndash28
7
Knut Hofland 1996 A program for aligning Englishand Norwegian sentences In S Hockey N Ideand G Perissinotto editors Research in HumanitiesComputing pages 165ndash178 Oxford Oxford Univer-sity Press
Ray Jackendoff 1997 Twistinrsquo the night away Lan-guage 73534ndash59
Frank Keller and Mirella Lapata 2003 Using the Webto Obtain Frequencies for Unseen Bigrams Compu-tational Linguistics 29(3)459ndash484
I Dan Melamed 1997 Automatic Discovery ofNon-Compositional Compounds in Parallel Data Ineprint arXivcmp-lg9706027 pages 6027ndash+ June
Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the Workshopon Multi-word-expressions in a Multilingual Context(EACL-2006) pages 33ndash40 Trento Italy
Franz Josef Och and Hermann Ney 2000 Improvedstatistical alignment models In Proceedings of the38th Annual Meeting of the ACL pages 440ndash447Hong Kong China October
Darren Pearce 2002 A Comparative Evaluation ofCollocation Extraction Techniques In Proceedingsof the Third International Conference on LanguageResources and Evaluation pages 1ndash7 Las PalmasCanary Islands Spain
Scott S L Piao Guangfan Sun Paul Rayson andQi Yuan 2006 Automatic Extraction of Chi-nese Multiword Expressions with a Statistical ToolIn Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context (EACL-2006)pages 17ndash24 Trento Italy April
William H Press Saul A Teukolsky William T Vet-terling and Brian P Flannery 1992 NumericalRecipes in C The Art of Scientific Computing Sec-ond edition Cambridge University Press
Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 MultiwordExpressions A Pain in the Neck for NLP In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing (CICLing-2002) volume 2276 of (LectureNotes in Computer Science) pages 1ndash15 LondonUK Springer-Verlag
Tim Van de Cruys and Bego na Villada Moiron 2007Semantics-based Multiword Expression ExtractionIn Proceedings of the Workshop on A Broader Pre-spective on Multiword Expressions pages 25ndash32Prague June
Aline Villavicencio Valia Kordoni Yi Zhang MarcoIdiart and Carlos Ramisch 2007 Validation andEvaluation of Automatically Acquired Multiword
Expressions for Grammar Engineering In Pro-ceedings of the 2007 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning pages1034ndash1043 Prague June
Aline Villavicencio 2005 The Availability of Verb-Particle Constructions in Lexical Resources HowMuch is Enough Journal of Computer Speech andLanguage Processing 19(4)415ndash432
Yi Zhang Valia Kordoni Aline Villavicencio andMarco Idiart 2006 Automated Multiword Ex-pression Prediction for Grammar Engineering InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 36ndash44 Sydney Australia July Asso-ciation for Computational Linguistics
8
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 9ndash16Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Re-examining Automatic Keyphrase Extraction Approachesin Scientific Articles
Su Nam KimCSSE dept
University of Melbournesnkimcsseunimelbeduau
Min-Yen KanDepartment of Computer ScienceNational University of Singaporekanmycompnusedusg
Abstract
We tackle two major issues in automatickeyphrase extraction using scientific arti-cles candidate selection and feature engi-neering To develop an efficient candidateselection method we analyze the natureand variation of keyphrases and then se-lect candidates using regular expressionsSecondly we re-examine the existing fea-tures broadly used for the supervised ap-proach exploring different ways to en-hance their performance While mostother approaches are supervised we alsostudy the optimal features for unsuper-vised keyphrase extraction Our researchhas shown that effective candidate selec-tion leads to better performance as evalua-tion accounts for candidate coverage Ourwork also attests that many of existing fea-tures are also usable in unsupervised ex-traction
1 Introduction
Keyphrases are simplex nouns or noun phrases(NPs) that represent the key ideas of the documentKeyphrases can serve as a representative summaryof the document and also serve as high quality in-dex terms It is thus no surprise that keyphraseshave been utilized to acquire critical informationas well as to improve the quality of natural lan-guage processing (NLP) applications such as doc-ument summarizer(Davanzo and Magnini 2005)information retrieval (IR)(Gutwin et al 1999) anddocument clustering(Hammouda et al 2005)
In the past various attempts have been made toboost automatic keyphrase extraction performancebased primarily on statistics(Frank et al 1999Turney 2003 Park et al 2004 Wan and Xiao2008) and a rich set of heuristic features(Barkerand Corrnacchia 2000 Medelyan and Witten
2006 Nguyen and Kan 2007) In Section 2 wegive a more comprehensive overview of previousattempts
Current keyphrase technology still has muchroom for improvement First of all although sev-eral candidate selection methods have been pro-posed for automatic keyphrase extraction in thepast (eg (Frank et al 1999 Park et al 2004Nguyen and Kan 2007)) most of them do not ef-fectively deal with various keyphrase forms whichresults in the ignorance of some keyphrases as can-didates Moreover no studies thus far have donea detailed investigation of the nature and varia-tion of manually-provided keyphrases As a con-sequence the community lacks a standardized listof candidate forms which leads to difficulties indirect comparison across techniques during evalu-ation and hinders re-usability
Secondly previous studies have shown the ef-fectiveness of their own features but not manycompared their features with other existing fea-tures That leads to a redundancy in studies andhinders direct comparison In addition existingfeatures are specifically designed for supervisedapproaches with few exceptions However thisapproach involves a large amount of manual laborthus reducing its utility for real-world applicationHence unsupervised approach is inevitable in or-der to minimize manual tasks and to encourageutilization It is a worthy study to attest the re-liability and re-usability for the unsupervised ap-proach in order to set up the tentative guideline forapplications
This paper targets to resolve these issues ofcandidate selection and feature engineering Inour work on candidate selection we analyze thenature and variation of keyphrases with the pur-pose of proposing a candidate selection methodwhich improves the coverage of candidates thatoccur in various forms Our second contributionre-examines existing keyphrase extraction features
9
reported in the literature in terms of their effec-tiveness and re-usability We test and comparethe usefulness of each feature for further improve-ment In addition we assess how well these fea-tures can be applied in an unsupervised approach
In the remaining sections we describe anoverview of related work in Section 2 our propos-als on candidate selection and feature engineeringin Section 4 and 5 our system architecture anddata in Section 6 Then we evaluate our propos-als discuss outcomes and conclude our work inSection 7 8 and 9 respectively
2 Related Work
The majority of related work has been carriedout using statistical approaches a rich set ofsymbolic resources and linguistically-motivatedheuristics(Frank et al 1999 Turney 1999 Barkerand Corrnacchia 2000 Matsuo and Ishizuka2004 Nguyen and Kan 2007) Features used canbe categorized into three broad groups (1) docu-ment cohesion features (ie relationship betweendocument and keyphrases)(Frank et al 1999Matsuo and Ishizuka 2004 Medelyan and Wit-ten 2006 Nguyen and Kan 2007) and to lesser(2) keyphrase cohesion features (ie relationshipamong keyphrases)(Turney 2003) and (3) termcohesion features (ie relationship among compo-nents in a keyphrase)(Park et al 2004)
The simplest system is KEA (Frank et al1999 Witten et al 1999) that uses TFIDF (ieterm frequency inverse document frequency) andfirst occurrence in the document TFIDF mea-sures the document cohesion and the first occur-rence implies the importance of the abstract orintroduction which indicates the keyphrases havea locality Turney (2003) added the notion ofkeyphrase cohesion to KEA features and Nguyenand Kan (2007) added linguistic features suchas section information and suffix sequence TheGenEx system(Turney 1999) employed an inven-tory of nine syntactic features such as length inwords and frequency of stemming phrase as aset of parametrized heuristic rules Barker andCorrnacchia (2000) introduced a method basedon head noun heuristics that took three featureslength of candidate frequency and head noun fre-quency To take advantage of domain knowledgeHulth et al (2001) used a hierarchically-organizeddomain-specific thesaurus from Swedish Parlia-ment as a secondary knowledge source The
Textract (Park et al 2004) also ranks the can-didate keyphrases by its judgment of keyphrasesrsquodegree of domain specificity based on subject-specific collocations(Damerau 1993) in addi-tion to term cohesion using Dice coefficient(Dice1945) Recently Wan and Xiao (2008) extractsautomatic keyphrases from single documents uti-lizing document clustering information The as-sumption behind this work is that the documentswith the same or similar topics interact with eachother in terms of salience of words The authorsfirst clustered the documents then used the graph-based ranking algorithm to rank the candidates ina document by making use of mutual influences ofother documents in the same cluster
3 Keyphrase Analysis
In previous study KEA employed the index-ing words as candidates whereas others such as(Park et al 2004 Nguyen and Kan 2007) gen-erated handcrafted regular expression rules How-ever none carefully undertook the analysis ofkeyphrases We believe there is more to be learnedfrom the reference keyphrases themselves by do-ing a fine-grained careful analysis of their formand composition Note that we used the articlescollected from ACM digital library for both ana-lyzing keyphrases as well as evaluating methodsSee Section 6 for data in detail
Syntactically keyphrases can be formed by ei-ther simplex nouns (eg algorithm keyphrasemulti-agent) or noun phrases (NPs) which can be asequence of nouns and their auxiliary words suchas adjectives and adverbs (eg mobile networkfast computing partially observable Markov de-cision process) despite few incidences They canalso incorporate a prepositional phrase (PP) (egquality of service policy of distributed caching)When keyphrases take the form of an NP with anattached PP (ie NPs in of-PP form) the preposi-tion of is most common but others such as for invia also occur (eg incentive for cooperation in-equality in welfare agent security via approximatepolicy trade in financial instrument based on log-ical formula) The patterns above correlate wellto part-of-speech (POS) patterns used in modernkeyphrase extraction systems
However our analysis uncovered additional lin-guistic patterns and alternations which other stud-ies may have overlooked In our study we alsofound that keyphrases also occur as a simple con-
10
Criteria RulesFrequency (Rule1) Frequency heuristic ie frequency ge 2 for simplex words vs frequency ge 1 for NPs
Length (Rule2) Length heuristic ie up to length 3 for NPs in non-of-PP form vs up to length 4 for NPs in of-PP form(eg synchronous concurrent program vs model of multiagent interaction)
Alternation (Rule3) of-PP form alternation(eg number of sensor = sensor number history of past encounter = past encounter history)(Rule4) Possessive alternation(eg agentrsquos goal = goal of agent securityrsquos value = value of security)
Extraction (Rule5) Noun Phrase = (NN |NNS|NNP |NNPS|JJ |JJR|JJS)lowast(NN |NNS|NNP |NNPS)(eg complexity effective algorithm grid computing distributed web-service discovery architecture)(Rule6) Simplex WordNP IN Simplex WordNP(eg quality of service sensitivity of VOIP traffic (VOIP traffic extracted)simplified instantiation of zebroid (simplified instantiation extracted))
Table 1 Candidate Selection Rules
junctions (eg search and rescue propagation anddelivery) and much more rarely as conjunctionsof more complex NPs (eg history of past en-counter and transitivity) Some keyphrases appearto be more complex (eg pervasive document editand management system task and resource allo-cation in agent system) Similarly abbreviationsand possessive forms figure as common patterns(eg belief desire intention = BDI inverse docu-ment frequency = (IDF) Bayesrsquo theorem agentrsquosdominant strategy)
A critical insight of our work is that keyphrasescan be morphologically and semantically alteredKeyphrases that incorporate a PP or have an un-derlying genitive composition are often easily var-ied by word order alternation Previous studieshave used the altered keyphrases when forming inof-PP form For example quality of service canbe altered to service quality sometimes with lit-tle semantic difference Also as most morpho-logical variation in English relates to noun num-ber and verb inflection keyphrases are subject tothese rules as well (eg distributed system 6= dis-tributing system dynamical caching 6= dynamicalcache) In addition possessives tend to alternatewith of-PP form (eg agentrsquos goal = goal of agentsecurityrsquos value = value of security)
4 Candidate Selection
We now describe our proposed candidate selectionprocess Candidate selection is a crucial step forautomatic keyphrase extraction This step is corre-lated to term extraction study since topNth rankedterms become keyphrases in documents In pre-vious study KEA employed the indexing wordsas candidates whereas others such as (Park et al2004 Nguyen and Kan 2007) generated hand-crafted regular expression rules However nonecarefully undertook the analysis of keyphrases In
this section before we present our method we firstdescribe the detail of keyphrase analysis
In our keyphrase analysis we observed thatmost of author assigned keyphrase andor readerassigned keyphrase are syntactically more of-ten simplex words and less often NPs Whenkeyphrases take an NP form they tend to be a sim-ple form of NPs ie either without a PP or withonly a PP or with a conjunction but few appear asa mixture of such forms We also noticed that thecomponents of NPs are normally nouns and adjec-tives but rarely are adverbs and verbs As a re-sult we decided to ignore NPs containing adverbsand verbs in this study as our candidates since theytend to produce more errors and to require morecomplexity
Another observation is that keyphrases contain-ing more than three words are rare (ie 6 in ourdata set) validating what Paukkeri et al (2008)observed Hence we apply a length heuristic Ourcandidate selection rule collects candidates up tolength 3 but also of length 4 for NPs in of-PPform since they may have a non-genetive alter-nation that reduces its length to 3 (eg perfor-mance of distributed system = distributed systemperformance) In previous studies words occur-ring at least twice are selected as candidates How-ever during our acquisition of reader assignedkeyphrase we observed that readers tend to collectNPs as keyphrases regardless of their frequencyDue to this we apply different frequency thresh-olds for simplex words (gt= 2) and NPs (gt= 1)Note that 30 of NPs occurred only once in ourdata
Finally we generated regular expression rulesto extract candidates as presented in Table 1 Ourcandidate extraction rules are based on those inNguyen and Kan (2007) However our Rule6for NPs in of-PP form broadens the coverage of
11
possible candidates ie with a given NPs in of-PP form not only we collect simplex word(s)but we also extract non-of-PP form of NPs fromnoun phrases governing the PP and the PP Forexample our rule extracts effective algorithm ofgrid computing as well as effective algorithm andgrid computing as candidates while the previousworksrsquo rules do not
5 Feature Engineering
With a wider candidate selection criteria the onusof filtering out irrelevant candidates becomes theresponsibility of careful feature engineering Welist 25 features that we have found useful in ex-tracting keyphrases comprising of 9 existing and16 novel andor modified features that we intro-duce in our work (marked with lowast) As one ofour goals in feature engineering is to assess thesuitability of features in the unsupervised settingwe have also indicated which features are suitableonly for the supervised setting (S) or applicable toboth (S U)
51 Document Cohesion
Document cohesion indicates how important thecandidates are for the given document The mostpopular feature for this cohesion is TFIDF butsome works have also used context words to checkthe correlation between candidates and the givendocument Other features for document cohesionare distance section information and so on Wenote that listed features other than TFIDF are re-lated to locality That is the intuition behind thesefeatures is that keyphrases tend to appear in spe-cific area such as the beginning and the end of doc-uments
F1 TFIDF (SU) TFIDF indicates doc-ument cohesion by looking at the frequency ofterms in the documents and is broadly used in pre-vious work(Frank et al 1999 Witten et al 1999Nguyen and Kan 2007) However a disadvan-tage of the feature is in requiring a large corpusto compute useful IDF As an alternative con-text words(Matsuo and Ishizuka 2004) can alsobe used to measure document cohesion From ourstudy of keyphrases we saw that substrings withinlonger candidates need to be properly counted andas such our method measures TF in substrings aswell as in exact matches For example grid com-puting is often a substring of other phrases such asgrid computing algorithm and efficient grid com-
puting algorithm We also normalize TF with re-spect to candidate types ie we separately treatsimplex words and NPs to compute TF To makeour IDFs broadly representative we employed theGoogle n-gram counts that were computedover terabytes of data Given this large genericsource of word count IDF can be incorporatedwithout corpus-dependent processing hence suchfeatures are useful in unsupervised approachesas well The following list shows variations ofTFIDF employed as features in our system
bull (F1a) TFIDF
bull (F1b) TF including counts of substrings
bull (F1c) TF of substring as a separate feature
bull (F1d) normalized TF by candidate types(ie simplex words vs NPs)
bull (F1e) normalized TF by candidate types asa separate feature
bull (F1f) IDF using Google n-gram
F2 First Occurrence (SU) KEA used the firstappearance of the word in the document(Frank etal 1999 Witten et al 1999) The main ideabehind this feature is that keyphrases tend to oc-cur in the beginning of documents especially instructured reports (eg in abstract and introduc-tion sections) and newswire
F3 Section Information (SU) Nguyen andKan (2007) used the identity of which specificdocument section a candidate occurs in This lo-cality feature attempts to identify key sections Forexample in their study of scientific papers theauthors weighted candidates differently dependingon whether they occurred in the abstract introduc-tion conclusion section head title andor refer-ences
F4 Additional Section Information (SU)We first added the related work or previous workas one of section information not included inNguyen and Kan (2007) We also propose and testa number of variations We used the substringsthat occur in section headers and reference titlesas keyphrases We counted the co-occurrence ofcandidates (ie the section TF) across all key sec-tions that indicates the correlation among key sec-tions We assign section-specific weights as in-dividual sections exhibit different propensities forgenerating keyphrases For example introduction
12
contains the majority of keyphrases while the ti-tle or section head contains many fewer due to thevariation in size
bull (F4a) section rsquorelatedprevious workrsquo
bull (F4b) counting substring occurring in keysections
bull (F4c) section TF across all key sections
bull (F4d) weighting key sections according tothe portion of keyphrases found
F5 Last Occurrence (SU) Similar to dis-tance in KEA the position of the last occurrenceof a candidate may also imply the importance ofkeyphrases as keyphrases tend to appear in thelast part of document such as the conclusion anddiscussion
52 Keyphrase CohesionThe intuition behind using keyphrase cohesion isthat actual keyphrases are often associated witheach other since they are semantically related totopic of the document Note that this assumptionholds only when the document describes a singlecoherent topic ndash a document that represents a col-lection may be first need to be segmented into itsconstituent topics
F6 Co-occurrence of Another Candidatein Section (SU) When candidates co-occur inseveral key sections together then they are morelikely keyphrases Hence we used the number ofsections that candidates co-occur
F7 Title overlap (S) In a way titles also rep-resent the topics of their documents A large col-lection of titles in the domain can act as a prob-abilistic prior of what words could stand as con-stituent words in keyphrases In our work as weexamined scientific papers from computer sciencewe used a collection of titles obtained from thelarge CiteSeer1 collection to create this feature
bull (F7a) co-occurrence (Boolean) in title col-location
bull (F7b) co-occurrence (TF) in title collection
F8 Keyphrase Cohesion (SU) Turney (2003)integrated keyphrase cohesion into his system bychecking the semantic similarity between top Nranked candidates against the remainder In the
1It contains 13M titles from articles papers and reports
original work a large external web corpus wasused to obtain the similarity judgments As wedid not have access to the same web corpus andall candidateskeyphrases were not found in theGoogle n-gram corpus we approximated this fea-ture using a similar notion of contextual similarityWe simulated a latent 2-dimensional matrix (simi-lar to latent semantic analysis) by listing all candi-date words in rows and their neighboring words(nouns verbs and adjectives only) in columnsThe cosine measure is then used to compute thesimilarity among keyphrases
53 Term Cohesion
Term cohesion further refines the candidacy judg-ment by incorporating an internal analysis of thecandidatersquos constituent words Term cohesionposits that high values for internal word associa-tion measures correlates indicates that the candi-date is a keyphrase (Church and Hanks 1989)
F9 Term Cohesion (SU) Park et al (2004)used in the Dice coefficient (Dice 1945)to measure term cohesion particularly for multi-word terms In their work as NPs are longer thansimplex words they simply discounted simplexword cohesion by 10 In our work we vary themeasure of TF used in Dice coefficientsimilar to our discussion earlier
bull (F9a) term cohesion by (Park et al 2004)
bull (F9b) normalized TF by candidate types(ie simplex words vs NPs)
bull (F9c) applying different weight by candi-date types
bull (F9d) normalized TF and different weight-ing by candidate types
54 Other Features
F10 Acronym (S) Nguyen and Kan (2007) ac-counted for the importance of acronym as a fea-ture We found that this feature is heavily depen-dent on the data set Hence we used it only forNampK to attest our candidate selection method
F11 POS sequence (S) Hulth and Megyesi(2006) pointed out that POS sequences ofkeyphrases are similar It showed the distinctivedistribution of POS sequences of keyphrases anduse them as a feature Like acronym this is alsosubject to the data set
13
F12 Suffix sequence (S) Similar to acronymNguyen and Kan (2007) also used a candidatersquossuffix sequence as a feature to capture the propen-sity of English to use certain Latin derivationalmorphology for technical keyphrases This fea-ture is also a data dependent features thus used insupervised approach only
F13 Length of Keyphrases (SU) Barker andCorrnacchia (2000) showed that candidate lengthis also a useful feature in extraction as well as incandidate selection as the majority of keyphrasesare one or two terms in length
6 System and Data
To assess the performance of the proposed candi-date selection rules and features we implementeda keyphrase extraction pipe line We start withraw text of computer science articles convertedfrom PDF by pdftotext Then we parti-tioned the into section such as title and sectionsvia heuristic rules and applied sentence segmenter2 ParsCit3(Councill et al 2008) for refer-ence collection part-of-speech tagger4 and lem-matizer5(Minnen et al 2001) of the input Af-ter preprocessing we built both supervised andunsupervised classifiers using Naive Bayes fromthe WEKA machine learning toolkit(Witten andFrank 2005) Maximum Entropy6 and simpleweighting
In evaluation we collected 250 papers fromfour different categories7 of the ACM digital li-brary Each paper was 6 to 8 pages on averageIn author assigned keyphrase we found manywere missing or found as substrings To rem-edy this we collected reader assigned keyphraseby hiring senior year undergraduates in computerscience each whom annotated five of the paperswith an annotation guideline and on average tookabout 15 minutes to annotate each paper The fi-nal statistics of keyphrases is presented in Table2 where Combined represents the total number ofkeyphrases The numbers in () denotes the num-ber of keyphrases in of-PP form Found means the
2httpwwwengritsumeiacjpasaoresourcessentseg3httpwingcompnusedusgparsCit4httpsearchcpanorgdistLingua-EN-
TaggerTaggerpm5httpwwwinformaticssusxacukresearchgroupsnlpcarrollmorphhtml6httpmaxentsourceforgenetindexhtml7C24 (Distributed Systems) H33 (Information Search
and Retrieval) I211 (Distributed Artificial Intelligence-Multiagent Systems) and J4 (Social and Behavioral Sciences-Economics)
number of author assigned keyphrase and readerassigned keyphrase found in the documents
Author Reader CombinedTotal 1252 (53) 3110 (111) 3816 (146)NPs 904 2537 3027Average 385 (401) 1244 (1288) 1526 (1585)Found 769 2509 2864
Table 2 Statistics in Keyphrases
7 Evaluation
The baseline system for both the supervised andunsupervised approaches is modified NampK whichuses TFIDF distance section information andadditional section information (ie F1-4) Apartfrom baseline we also implemented basicKEA and NampK to compare Note that NampK is con-sidered a supervised approach as it utilizes fea-tures like acronym POS sequence and suffix se-quence
Table 3 and 4 shows the performance of our can-didate selection method and features with respectto supervised and unsupervised approaches usingthe current standard evaluation method (ie exactmatching scheme) over top 5th 10th 15th candi-dates
BestFeatures includes F1cTF of substring asa separate feature F2first occurrence F3sectioninformation F4dweighting key sections F5lastoccurrence F6co-occurrence of another candi-date in section F7btitle overlap F9aterm co-hesion by (Park et al 2004) F13length ofkeyphrases Best-TFIDF means using all bestfeatures but TFIDF
In Tables 3 and 4 C denotes the classifier tech-nique unsupervised (U) or supervised using Max-imum Entropy (S)8
In Table 5 the performance of each feature ismeasured using NampK system and the target fea-ture + indicates an improvement - indicates aperformance decline and indicates no effector unconfirmed due to small changes of perfor-mances Again supervised denotes MaximumEntropy training and Unsupervised is our unsu-pervised approach
8 Discussion
We compared the performances over our candi-date selection and feature engineering with sim-ple KEA NampK and our baseline system In eval-uating candidate selection we found that longer
8Due to the page limits we present the best performance
14
Method Features C Five Ten FifteenMatch Precision Recall Fscore Match Precising Recall Fscore Match Precision Recall Fscore
All KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 079 1584 519 782 139 1388 909 1099 184 1224 1203 1213
NampK S 132 2648 867 1306 204 2036 1334 1612 254 1693 1664 1678baseline U 092 1832 600 904 157 1568 1027 1241 220 1464 1439 1451
S 115 2304 755 1137 190 1896 1242 1501 244 1624 1596 1610Lengthlt=3 KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 081 1616 529 797 140 1400 917 1108 184 1224 1203 1213
NampK S 140 2792 915 1378 210 2104 1378 1665 262 1749 1719 1734baseline U 092 184 603 908 158 1576 1032 1247 220 1464 1439 1451
S 118 2368 776 1169 190 1900 1245 1504 240 1600 1572 1586Lengthlt=3 KEA U 001 024 008 012 005 052 034 041 007 048 047 047Candidates S 083 1664 545 821 142 1424 933 1127 187 1245 1224 1234
+ Alternation NampK S 153 3064 1004 1512 231 2308 1512 1827 288 1920 1887 1903baseline U 098 1968 645 972 172 1724 1129 1364 237 1579 1551 1565
S 133 2656 870 1311 209 2088 1368 1653 269 1792 1761 1776
Table 3 Performance on Proposed Candidate Selection
Features C Five Ten FifteenMatch Prec Recall Fscore Match Prec Recall Fscore Match Prec Recall Fscore
Best U 114 228 747 113 192 192 126 152 261 174 171 173S 156 312 102 154 250 250 164 198 315 210 206 208
Best U 114 228 74 113 192 192 126 152 261 174 171 173wo TFIDF S 156 311 102 154 246 246 161 194 312 208 204 206
Table 4 Performance on Feature Engineering
A Method Feature+ S F1aF2F3F4aF4dF9a
U F1aF1cF2F3F4aF4dF5F7bF9a- S F1bF1cF1dF1fF4bF4cF7aF7bF9b-dF13
U F1dF1eF1fF4bF4cF6F7aF9b-d S F1eF10F11F12
U F1b
Table 5 Performance on Each Feature
length candidates play a role to be noises so de-creased the overall performance We also con-firmed that candidate alternation offered the flexi-bility of keyphrases leading higher candidate cov-erage as well as better performance
To re-examine features we analyzed the impactof existing and new features and their variationsFirst of all unlike previous studies we found thatthe performance with and without TFIDF did notlead to a large difference which indicates the im-pact of TFIDF was minor as long as other fea-tures are incorporated Secondly counting sub-strings for TF improved performance while ap-plying term weighting for TF andor IDF did notimpact on the performance We estimated thecause that many of keyphrases are substrings ofcandidates and vice versa Thirdly section in-formation was also validated to improve perfor-mance as in Nguyen and Kan (2007) Extend-ing this logic modeling additional section infor-mation (related work) and weighting sections bothturned out to be useful features Other localityfeatures were also validated as helpful both firstoccurrence and last occurrence are helpful as itimplies the locality of the key ideas In addi-
tion keyphrase co-occurrence with selected sec-tions was proposed in our work and found empiri-cally useful Term cohesion (Park et al 2004) is auseful feature although it has a heuristic factor thatreduce the weight by 10 for simplex words Nor-mally term cohesion is subject to NPs only henceit needs to be extended to work with multi-wordNPs as well Table 5 summarizes the reflectionson each feature
As unsupervised methods have the appeal of notneeding to be trained on expensive hand-annotateddata we also compared the performance of super-vised and unsupervised methods Given the fea-tures initially introduced for supervised learningunsupervised performance is surprisingly highWhile supervised classifier produced a matchingcount of 315 the unsupervised classifier obtains acount of 261 We feel this indicates that the exist-ing features for supervised methods are also suit-able for use in unsupervised methods with slightlyreduced performance In general we observed thatthe best features in both supervised and unsuper-vised methods are the same ndash section informationand candidate length In our analysis of the im-pact of individual features we observed that mostfeatures affect performance in the same way forboth supervised and unsupervised approaches asshown in Table 5 These findings indicate that al-though these features may be been originally de-signed for use in a supervised approach they arestable and can be expected to perform similar inunsupervised approaches
15
9 Conclusion
We have identified and tackled two core issuesin automatic keyphrase extraction candidate se-lection and feature engineering In the area ofcandidate selection we observe variations and al-ternations that were previously unaccounted forOur selection rules expand the scope of possiblekeyphrase coverage while not overly expandingthe total number candidates to consider In ourre-examination of feature engineering we com-piled a comprehensive feature list from previousworks while exploring the use of substrings in de-vising new features Moreover we also attested toeach featurersquos fitness for use in unsupervised ap-proaches in order to utilize them in real-world ap-plications with minimal cost
10 AcknowledgementThis work was partially supported by a National ResearchFoundation grant Interactive Media Search (grant R 252000 325 279) while the first author was a postdoctoral fellowat the National University of Singapore
ReferencesKen Barker and Nadia Corrnacchia Using noun phrase
heads to extract document keyphrases In Proceedings ofthe 13th Biennial Conference of the Canadian Society onComputational Studies of Intelligence Advances in Arti-ficial Intelligence 2000
Regina Barzilay and Michael Elhadad Using lexical chainsfor text summarization In Proceedings of the ACLEACL1997 Workshop on Intelligent Scalable Text Summariza-tion 1997 pp 10ndash17
Kenneth Church and Patrick Hanks Word associationnorms mutual information and lexicography In Proceed-ings of ACL 1989 76ndash83
Isaac Councill and C Lee Giles and Min-Yen Kan ParsCitAn open-source CRF reference string parsing package InProceedings of LREC 2008 28ndash30
Ernesto DAvanzo and Bernado Magnini A Keyphrase-Based Approach to Summarizationthe LAKE System atDUC-2005 In Proceedings of DUC 2005
F Damerau Generating and evaluating domain-orientedmulti-word terms from texts Information Processing andManagement 1993 29 pp43ndash447
Lee Dice Measures of the amount of ecologic associationsbetween species Journal of Ecology 1945 2
Eibe Frank and Gordon Paynter and Ian Witten and CarlGutwin and Craig Nevill-manning Domain SpecificKeyphrase Extraction In Proceedings of IJCAI 1999pp668ndash673
Carl Gutwin and Gordon Paynter and Ian Witten and CraigNevill-Manning and Eibe Frank Improving browsing indigital libraries with keyphrase indexes Journal of Deci-sion Support Systems 1999 27 pp81ndash104
Khaled Hammouda and Diego Matute and Mohamed KamelCorePhrase keyphrase extraction for document cluster-ing In Proceedings of MLDM 2005
Annette Hulth and Jussi Karlgren and Anna Jonsson andHenrik Bostrm and Lars Asker Automatic Keyword Ex-traction using Domain Knowledge In Proceedings of CI-CLing 2001
Annette Hulth and Beata Megyesi A study on automaticallyextracted keywords in text categorization In Proceedingsof ACLCOLING 2006 537ndash544
Mario Jarmasz and Caroline Barriere Using semantic sim-ilarity over tera-byte corpus compute the performance ofkeyphrase extraction In Proceedings of CLINE 2004
Dawn Lawrie and W Bruce Croft and Arnold RosenbergFinding Topic Words for Hierarchical Summarization InProceedings of SIGIR 2001 pp 349ndash357
Y Matsuo and M Ishizuka Keyword Extraction from a Sin-gle Document using Word Co-occurrence Statistical Infor-mation In International Journal on Artificial IntelligenceTools 2004 13(1) pp 157ndash169
Olena Medelyan and Ian Witten Thesaurus based automatickeyphrase indexing In Proceedings of ACMIEED-CSjoint conference on Digital libraries 2006 pp296ndash297
Guido Minnen and John Carroll and Darren Pearce Appliedmorphological processing of English NLE 2001 7(3)pp207ndash223
Thuy Dung Nguyen and Min-Yen Kan Key phrase Extrac-tion in Scientific Publications In Proceeding of ICADL2007 pp317-326
Youngja Park and Roy Byrd and Branimir Boguraev Auto-matic Glossary Extraction Beyond Terminology Identifi-cation In Proceedings of COLING 2004 pp48ndash55
Mari-Sanna Paukkeri and Ilari Nieminen and Matti Pollaand Timo Honkela A Language-Independent Approachto Keyphrase Extraction and Evaluation In Proceedingsof COLING 2008
Peter Turney Learning to Extract Keyphrases from TextIn National Research Council Institute for InformationTechnology Technical Report ERB-1057 1999
Peter Turney Coherent keyphrase extraction via Web min-ing In Proceedings of IJCAI 2003 pp 434ndash439
Xiaojun Wan and Jianguo Xiao CollabRank towards a col-laborative approach to single-document keyphrase extrac-tion In Proceedings of COLING 2008
Ian Witten and Gordon Paynter and Eibe Frank and CarGutwin and Graig Nevill-Manning KEAPractical Au-tomatic Key phrase Extraction In Proceedings of ACMDL 1999 pp254ndash256
Ian Witten and Eibe Frank Data Mining Practical Ma-chine Learning Tools and Techniques Morgan Kauf-mann 2005
Yongzheng Zhang and Nur Zinchir-Heywood and Evange-los Milios Term based Clustering and Summarization ofWeb Page Collections In Proceedings of Conference ofthe Canadian Society for Computational Studies of Intel-ligence 2004
16
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 17ndash22Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Verb Noun Construction MWE Token Supervised Classification
Mona T DiabCenter for Computational Learning Systems
Columbia Universitymdiabcclscolumbiaedu
Pravin BhutadaComputer Science Department
Columbia Universitypb2351columbiaedu
Abstract
We address the problem of classifying multi-word expression tokens in running text Wefocus our study on Verb-Noun Constructions(VNC) that vary in their idiomaticity depend-ing on context VNC tokens are classified aseither idiomatic or literal We present a super-vised learning approach to the problem We ex-periment with different features Our approachyields the best results to date on MWE clas-sification combining different linguistically mo-tivated features the overall performance yieldsan F-measure of 8458 corresponding to an F-measure of 8996 for idiomaticity identificationand classification and 6203 for literal identifi-cation and classification
1 Introduction
In the literature in general a multiword expression(MWE) refers to a multiword unit or a colloca-tion of words that co-occur together statisticallymore than chance A MWE is a cover term fordifferent types of collocations which vary in theirtransparency and fixedness MWEs are pervasivein natural language especially in web based textsand speech genres Identifying MWEs and under-standing their meaning is essential to language un-derstanding hence they are of crucial importancefor any Natural Language Processing (NLP) appli-cations that aim at handling robust language mean-ing and use In fact the seminal paper (Sag et al2002) refers to this problem as a key issue for thedevelopment of high-quality NLP applications
For our purposes a MWE is defined as a collo-cation of words that refers to a single concept forexample - kick the bucket spill the beans make adecision etc An MWE typically has an idiosyn-cratic meaning that is more or different from themeaning of its component words An MWE mean-ing is transparent ie predictable in as muchas the component words in the expression relaythe meaning portended by the speaker composi-tionally Accordingly MWEs vary in their de-gree of meaning compositionality composition-ality is correlated with the level of idiomaticityAn MWE is compositional if the meaning of an
MWE as a unit can be predicted from the mean-ing of its component words such as in make adecision meaning to decide If we conceive ofidiomaticity as being a continuum the more id-iomatic an expression the less transparent and themore non-compositional it is Some MWEs aremore predictable than others for instance kick thebucket when used idiomatically to mean to diehas nothing in common with the literal meaningof either kick or bucket however make a decisionis very clearly related to to decide Both of theseexpressions are considered MWEs but have vary-ing degrees of compositionality and predictabilityBoth of these expressions belong to a class of id-iomatic MWEs known as verb noun constructions(VNC) The first VNC kick the bucket is a non-decomposable VNC MWE the latter make a deci-sion is a decomposable VNC MWE These typesof constructions are the object of our study
To date most research has addressed the prob-lem of MWE type classification for VNC expres-sions in English (Melamed 1997 Lin 1999Baldwin et al 2003 na Villada Moiron andTiedemann 2006 Fazly and Stevenson 2007Van de Cruys and Villada Moiron 2007 Mc-Carthy et al 2007) not token classification Forexample he spilt the beans on the kitchen counteris most likely a literal usage This is given away bythe use of the prepositional phrase on the kitchencounter as it is plausable that beans could haveliterally been spilt on a location such as a kitchencounter Most previous research would classifyspilt the beans as idiomatic irrespective of con-textual usage In a recent study by (Cook et al2008) of 53 idiom MWE types used in differentcontexts the authors concluded that almost half ofthem had clear literal meaning and over 40 oftheir usages in text were actually literal Thus itwould be important for an NLP application suchas machine translation for example when givena new VNC MWE token to be able to determinewhether it is used idiomatically or not as it couldpotentially have detrimental effects on the qualityof the translation
17
In this paper we address the problem of MWEclassification for verb-noun (VNC) token con-structions in running text We investigate the bi-nary classification of an unseen VNC token ex-pression as being either Idiomatic (IDM) or Lit-eral (LIT) An IDM expression is certainly anMWE however the converse is not necessarilytrue To date most approaches to the problem ofidiomaticity classification on the token level havebeen unsupervised (Birke and Sarkar 2006 Diaband Krishna 2009b Diab and Krishna 2009aSporleder and Li 2009) In this study we carryout a supervised learning investigation using sup-port vector machines that uses some of the featureswhich have been shown to help in unsupervisedapproaches to the problem
This paper is organized as follows In Section2 we describe our understanding of the variousclasses of MWEs in general Section 3 is a sum-mary of previous related research Section 4 de-scribes our approach In Section 5 we present thedetails of our experiments We discuss the resultsin Section 6 Finally we conclude in Section 7
2 Multi-word Expressions
MWEs are typically not productive though theyallow for inflectional variation (Sag et al 2002)They have been conventionalized due to persis-tent use MWEs can be classified based on theirsemantic types as follows Idiomatic This cat-egory includes expressions that are semanticallynon-compositional fixed expressions such as king-dom come ad hoc non-fixed expressions suchas break new ground speak of the devil TheVNCs which we are focusing on in this paper fallinto this category Semi-idiomatic This classincludes expressions that seem semantically non-compositional yet their semantics are more or lesstransparent This category consists of Light VerbConstructions (LVC) such as make a living andVerb Particle Constructions (VPC) such as write-up call-up Non-Idiomatic This category in-cludes expressions that are semantically compo-sitional such as prime minister proper nouns suchas New York Yankees and collocations such as ma-chine translation These expressions are statisti-cally idiosyncratic For instance traffic light isthe most likely lexicalization of the concept andwould occur more often in text than say trafficregulator or vehicle light
3 Related Work
Several researchers have addressed the problem ofMWE classification (Baldwin et al 2003 Katzand Giesbrecht 2006 Schone and Juraksfy 2001
Hashimoto et al 2006 Hashimoto and Kawa-hara 2008) The majority of the proposed researchhas been using unsupervised approaches and haveaddressed the problem of MWE type classifica-tion irrespective of usage in context (Fazly andStevenson 2007 Cook et al 2007) We areaware of two supervised approaches to the prob-lem work by (Katz and Giesbrecht 2006) andwork by (Hashimoto and Kawahara 2008)
In Katz and Giesbrecht (2006) (KG06) the au-thors carried out a vector similarity comparisonbetween the context of an MWE and that of theconstituent words using LSA to determine if theexpression is idiomatic or not The KG06 is sim-ilar in intuition to work proposed by (Fazly andStevenson 2007) however the latter work was un-supervised KG06 experimented with a tiny dataset of only 108 sentences corresponding to oneMWE idiomatic expression
Hashimoto and Kawahara (2008) (HK08) is thefirst large scale study to our knowledge that ad-dressed token classification into idiomatic versusliteral for Japanese MWEs of all types They ap-ply a supervised learning framework using sup-port vector machines based on TinySVM with aquadratic kernel They annotate a web based cor-pus for training data They identify 101 idiomtypes each with a corresponding 1000 exampleshence they had a corpus of 102K sentences of an-notated data for their experiments They exper-iment with 90 idiom types only for which theyhad more than 50 examples They use two typesof features word sense disambiguation (WSD)features and idiom features The WSD featurescomprised some basic syntactic features such asPOS lemma information token n-gram featuresin addition to hypernymy information on words aswell as domain information For the idiom fea-tures they were mostly inflectional features suchas voice negativity modality in addition to adja-cency and adnominal features They report resultsin terms of accuracy and rate of error reductionTheir overall accuracy is of 8925 using all thefeatures
4 Our ApproachWe apply a supervised learning framework tothe problem of both identifying and classifying aMWE expression token in context We specificallyfocus on VNC MWE expressions We use the an-notated data by (Cook et al 2008) We adopt achunking approach to the problem using an InsideOutside Beginning (IOB) tagging framework forperforming the identification of MWE VNC to-kens and classifying them as idiomatic or literalin context For chunk tagging we use the Yam-
18
Cha sequence labeling system1 YamCha is basedon Support Vector Machines technology using de-gree 2 polynomial kernels
We label each sentence with standard IOB tagsSince this is a binary classification task we have 5different tags B-L (Beginning of a literal chunk)I-L (Inside of a literal chunk) B-I (Beginning anIdiomatic chunk) I-I (Inside an Idiomatic chunk)O (Outside a chunk) As an example a sentencesuch as John kicked the bucket last Friday will beannotated as follows John O kicked B-I the I-Ibucket I-I last O Friday O We experiment withsome basic features and some more linguisticallymotivated ones
We experiment with different window sizes forcontext ranging from minus+1 to minus+5 tokens be-fore and after the token of interest We also em-ploy linguistic features such as character n-gramfeatures namely last 3 characters of a token asa means of indirectly capturing the word inflec-tional and derivational morphology (NGRAM)Other features include Part-of-Speech (POS)tags lemma form (LEMMA) or the citation formof the word and named entity (NE) informationThe latter feature is shown to help in the unsuper-vised setting in recent work (Diab and Krishna2009b Diab and Krishna 2009a) In general allthe linguistic features are represented as separatefeature sets explicitly modeled in the input dataHence if we are modeling the POS tag feature forour running example the training data would beannotated as follows John NN O kicked VBDB-I the Det I-I bucket NN I-I last ADV O FridayNN O Likewise adding the NGRAM featurewould be represented as follows John NN ohnO kicked VBD ked B-I the Det the I-I bucket NNket I-I last ADV ast O Friday NN day O and soon
With the NE feature we followed the same rep-resentation as the other features as a separate col-umn as expressed above referred to as NamedEntity Separate (NES) For named entity recogni-tion (NER) we use the BBN Identifinder softwarewhich identifies 19 NE tags2 We have two set-tings for NES one with the full 19 tags explic-itly identified (NES-Full) and the other where wehave a binary feature indicating whether a wordis a NE or not (NES-Bin) Moreover we addedanother experimental condition where we changedthe wordsrsquo representation in the input to their NEclass Named Entity InText (NEI) For example forthe NEI condition our running example is repre-sented as follows PER NN ohn O kicked VBDked B-I the Det the I-I bucket NN ket I-I last ADV
1httpwwwtado-chasencomyamcha2httpwwwbbncomidentifinder
ast O DAY NN day O where John is replaced bythe NE ldquoPERrdquo
5 Experiments and Results
51 Data
We use the manually annotated standard dataset identified in (Cook et al 2008) This datacomprises 2920 unique VNC-Token expressionsdrawn from the entire British National Corpus(BNC)3 The BNC contains 100M words of multi-ple genres including written text and transcribedspeech In this set VNC token expressions aremanually annotated as idiomatic literal or un-known We exclude those annotated as unknownand those pertaining to the Speech part of thedata leaving us with a total of 2432 sentences cor-responding to 53 VNC MWE types This datahas 2571 annotations4 corresponding to 2020 Id-iomatic tokens and 551 literal ones Since the dataset is relatively small we carry out 5-fold cross val-idation experiments The results we report are av-eraged over the 5 folds per condition We splitthe data into 80 for training 10 for testing and10 for development The data used is the tok-enized version of the BNC
52 Evaluation Metrics
We use Fβ=1 (F-measure) as the harmonic meanbetween (P)recision and (R)ecall as well as accu-racy to report the results5 We report the resultsseparately for the two classes IDM and LIT aver-aged over the 5 folds of the TEST data set
53 Results
We present the results for the different featuressets and their combination We also present resultson a simple most frequent tag baseline (FREQ) aswell as a baseline of using no features just thetokenized words (TOK) The baseline is basicallytagging all identified VNC tokens in the data set asidiomatic It is worth noting that the baseline hasthe advantage of gold identification of MWE VNCtoken expressions In our experimental conditionsidentification of a potential VNC MWE is part ofwhat is discovered automatically hence our sys-tem is penalized for identifying other VNC MWE
3httpwwwnatcorpoxacuk4A sentence can have more than one MWE expression
hence the number of annotations exceeds the number of sen-tences
5We do not think that accuracy should be reported in gen-eral since it is an inflated result as it is not a measure of errorAll words identified as O factor into the accuracy which re-sults in exaggerated values for accuracy We report it onlysince it the metric used by previous work
19
tokens that are not in the original data set6
In Table 2 we present the results yielded per fea-ture and per condition We experimented with dif-ferent context sizes initially to decide on the opti-mal window size for our learning framework re-sults are presented in Table 1 Then once that isdetermined we proceed to add features
Noting that a window size of minus+3 yields thebest results we proceed to use that as our contextsize for the following experimental conditions Wewill not include accuracy since it above 96 for allour experimental conditions
All the results yielded by our experiments out-perform the baseline FREQ The simple tokenizedwords baseline (TOK) with no added features witha context size of minus+3 shows a significant im-provement over the very basic baseline FREQ withan overall F measure of 7704
Adding lemma information or POS or NGRAMfeatures all independently contribute to a bettersolution however combining the three featuresyields a significant boost in performance over theTOK baseline of 267 absolute F points in over-all performance
Confirming previous observations in the liter-ature the overall best results are obtained byusing NE features The NEI condition yieldsslightly better results than the NES conditionsin the case when no other features are beingused NES-Full significantly outperforms NES-Bin when used alone especially on literal classi-fication yielding the highest results on this classof phenomena across the board However whencombined with other features NES-Bin fares bet-ter than NES-Full as we observe slightly less per-formance when comparing NES-Full+L+N+P andNES-Bin+L+N+P
Combining NEI+L+N+P yields the highest re-sults with an overall F measure of 8458 a sig-nificant improvement over both baselines and overthe condition that does not exploit NE featuresL+N+P Using NEI may be considered a formof dimensionality reduction hence the significantcontribution to performance
6 Discussion
The overall results strongly suggest that using lin-guistically interesting features explicitly has a pos-itive impact on performance NE features helpthe most and combining them with other features
6We could have easily identified all VNC syntactic con-figurations corresponding to verb object as a potential MWEVNC assuming that they are literal by default This wouldhave boosted our literal score baseline however for this in-vestigation we decided to strictly work with the gold stan-dard data set exclusively
yields the best results In general performanceon the classification and identification of idiomaticexpressions yielded much better results This maybe due to the fact that the data has a lot more id-iomatic token examples for training Also we notethat precision scores are significantly higher thanrecall scores especially with performance on lit-eral token instance classification This might be anindication that identifying when an MWE is usedliterally is a difficult task
We analyzed some of the errors yielded in ourbest condition NEI+L+N+P The biggest errors area result of identifying other VNC constructionsnot annotated in the training and test data as VNCMWEs However we also see errors of confusingidiomatic cases with literal ones 23 times and theopposite 4 times
Some of the errors where the VNC should havebeen classified as literal however the system clas-sified them as idiomatic are kick heel find feetmake top Cases of idiomatic expressions erro-neously classified as literal are for MWE types hitthe road blow trumpet blow whistle bit a wall
The system is able to identify new VNC MWEconstructions For instance in the sentence On theother hand Pinkie seemed to have lost his head toa certain extent perhaps some prospects of mak-ing his mark by bringing in something novel inthe way of business the first MWE lost his headis annotated in the training data however makinghis mark is newly identified as idiomatic in thiscontext
Also the system identified hit the post as aliteral MWE VNC token in As the ball hit thepost the referee blew the whistle where blew thewhistle is a literal VNC in this context and it iden-tified hit the post as another literal VNC
7 Conclusion
In this study we explore a set of features that con-tribute to VNC token expression binary supervisedclassification The use of NER significantly im-proves the performance of the system Using NERas a means of dimensionality reduction yields thebest results We achieve a state of the art perfor-mance of an overall F measure of 8458 In thefuture we are looking at ways of adding more so-phisticated syntactic and semantic features fromWSD Given the fact that we were able to get moreinteresting VNC data automatically we are cur-rently looking into adding the new data to the an-notated pool after manual checking
20
IDM-F LIT-F Overall F Overall Accminus+1 7793 4857 7178 9622minus+2 8538 5561 7971 9706minus+3 8699 5568 8125 9693minus+4 8622 5581 8075 9706minus+5 8338 50 7763 9661
Table 1 Results in s of varying context window size
IDM-P IDM-R IDM-F LIT-P LIT-R LIT-F Overall FFREQ 7002 8916 7844 0 0 0 6968TOK 8178 8333 8255 7179 4375 5437 7704
(L)EMMA 831 8429 8369 6977 4688 5607 7811(N)GRAM 8317 8238 8278 70 4375 5385 7701
(P)OS 8333 8333 8333 7778 4375 5600 7808L+N+P 8695 8333 8538 7222 4561 5591 7971
NES-Full 852 8793 8655 7907 5862 6733 8277NES-Bin 8497 8241 8367 7349 5259 6131 7915
NEI 8992 8518 8748 8133 5259 6387 8282NES-Full+L+N+P 8989 8492 8734 7632 50 6042 8199NES-Bin+L+N+P 9086 8492 8779 7632 50 6042 8233
NEI+L+N+P 9135 8842 8986 8169 50 6203 8458
Table 2 Final results in s averaged over 5 folds of test data using different features and their combina-tions
8 Acknowledgement
The first author was partially funded by DARPAGALE and MADCAT projects The authors wouldlike to acknowledge the useful comments by twoanonymous reviewers who helped in making thispublication more concise and better presented
ReferencesTimothy Baldwin Collin Bannard Takakki Tanaka
and Dominic Widdows 2003 An empirical modelof multiword expression decomposability In Pro-ceedings of the ACL 2003 workshop on Multiwordexpressions pages 89ndash96 Morristown NJ USA
J Birke and A Sarkar 2006 A clustering approach fornearly unsupervised recognition of nonliteral lan-guage In Proceedings of EACL volume 6 pages329ndash336
Paul Cook Afsaneh Fazly and Suzanne Stevenson2007 Pulling their weight Exploiting syntacticforms for the automatic identification of idiomaticexpressions in context In Proceedings of the Work-shop on A Broader Perspective on Multiword Ex-pressions pages 41ndash48 Prague Czech RepublicJune Association for Computational Linguistics
Paul Cook Afsaneh Fazly and Suzanne Stevenson2008 The VNC-Tokens Dataset In Proceedings ofthe LREC Workshop on Towards a Shared Task for
Multiword Expressions (MWE 2008) MarrakechMorocco June
Mona Diab and Madhav Krishna 2009a Handlingsparsity for verb noun MWE token classification InProceedings of the Workshop on Geometrical Mod-els of Natural Language Semantics pages 96ndash103Athens Greece March Association for Computa-tional Linguistics
Mona Diab and Madhav Krishna 2009b Unsuper-vised classification for vnc multiword expressionstokens In CICLING
Afsaneh Fazly and Suzanne Stevenson 2007 Dis-tinguishing subtypes of multiword expressions us-ing linguistically-motivated statistical measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueCzech Republic June Association for Computa-tional Linguistics
Chikara Hashimoto and Daisuke Kawahara 2008Construction of an idiom corpus and its applica-tion to idiom identification based on WSD incor-porating idiom-specific features In Proceedings ofthe 2008 Conference on Empirical Methods in Nat-ural Language Processing pages 992ndash1001 Hon-olulu Hawaii October Association for Computa-tional Linguistics
Chikara Hashimoto Satoshi Sato and Takehito Utsuro2006 Japanese idiom recognition Drawing a linebetween literal and idiomatic meanings In Proceed-ings of the COLINGACL 2006 Main Conference
21
Poster Sessions pages 353ndash360 Sydney AustraliaJuly Association for Computational Linguistics
Graham Katz and Eugenie Giesbrecht 2006 Au-tomatic identification of non-compositional multi-word expressions using latent semantic analysis InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 12ndash19 Sydney Australia July Asso-ciation for Computational Linguistics
Dekang Lin 1999 Automatic identification of non-compositional phrases In Proceedings of ACL-99pages 317ndash324 Univeristy of Maryland CollegePark Maryland USA
Diana McCarthy Sriram Venkatapathy and AravindJoshi 2007 Detecting compositionality of verb-object combinations using selectional preferencesIn Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL) pages 369ndash379 Prague CzechRepublic June Association for Computational Lin-guistics
Dan I Melamed 1997 Automatic discovery of non-compositional compounds in parallel data In Pro-ceedings of the 2nd Conference on Empirical Meth-ods in Natural Language Processing (EMNLPrsquo97)pages 97ndash108 Providence RI USA August
Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the EACL-06Workshop on Multiword Expressions in a Multilin-gual Context pages 33ndash40 Morristown NJ USA
Ivan A Sag Timothy Baldwin Francis Bond Ann ACopestake and Dan Flickinger 2002 Multiwordexpressions A pain in the neck for nlp In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing pages 1ndash15 London UK Springer-Verlag
Patrick Schone and Daniel Juraksfy 2001 Isknowledge-free induction of multiword unit dictio-nary headwords a solved problem In Proceedingsof Empirical Methods in Natural Language Process-ing pages 100ndash108 Pittsburg PA USA
C Sporleder and L Li 2009 Unsupervised Recog-nition of Literal and Non-Literal Use of IdiomaticExpressions In Proceedings of the 12th Conferenceof the European Chapter of the ACL (EACL 2009)pages 754ndash762 Association for Computational Lin-guistics
Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32Prague Czech Republic June Association for Com-putational Linguistics
22
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 23ndash30Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Exploiting Translational Correspondences for Pattern-Independent MWEIdentification
Sina ZarrieszligDepartment of Linguistics
University of Potsdam Germanysinalinguni-potsdamde
Jonas KuhnDepartment of Linguistics
University of Potsdam Germanykuhnlinguni-potsdamde
Abstract
Based on a study of verb translations inthe Europarl corpus we argue that a widerange of MWE patterns can be identified intranslations that exhibit a correspondencebetween a single lexical item in the sourcelanguage and a group of lexical items inthe target language We show that thesecorrespondences can be reliably detectedon dependency-parsed word-aligned sen-tences We propose an extraction methodthat combines word alignment with syn-tactic filters and is independent of thestructural pattern of the translation
1 Introduction
Parallel corpora have proved to be a valuable re-source not only for statistical machine translationbut also for crosslingual induction of morphologi-cal syntactic and semantic analyses (Yarowsky etal 2001 Dyvik 2004) In this paper we proposean approach to the identification of multiword ex-pressions (MWEs) that exploits translational cor-respondences in a parallel corpus We will con-sider in translations of the following type
(1) DerThe
RatCouncil
sollteshould
unsereour
Positionposition
berucksichtigenconsider
(2) The Council shouldtake account ofour position
This sentence pair has been taken from theGerman - English section of the Europarl corpus(Koehn 2005) It exemplifies a translational cor-respondence between an English MWEtake ac-count ofand a German simplex verbberucksichti-gen In the following we refer to such correspon-dences asone-to-many translations Based on astudy of verb translations in Europarl we will ex-plore to what extent one-to-many translations pro-vide evidence for MWE realization in the targetlanguage It will turn out that crosslingual corre-
spondences realize a wide range of different lin-guistic patterns that are relevant for MWE iden-tification but that they pose problems to auto-matic word alignment We propose an extractionmethod that combines distributional word align-ment with syntactic filters We will show thatthese correspondences can be reliably detectedon dependency-parsed wordaligned sentences andare able to identify various MWE patterns
In a monolingual setting the task of MWE ex-traction is usually conceived of as a lexical as-sociation problem where distributional measuresmodel the syntactic and semantic idiosyncracy ex-hibited by MWEs eg (Pecina 2008) This ap-proach generally involves two main steps 1) theextraction of a candidate list of potential MWEsoften constrained by a particular target patternof the detection method like verb particle con-structions (Baldwin and Villavicencio 2002) orverb PP combinations (Villada Moiron and Tiede-mann 2006) 2) the ranking of this candidate listby an appropriate assocation measure
The crosslingual MWE identification wepresent in this paper is a priori independentof any specific association measure or syntacticpattern The translation scenario allows us toadopt a completely data-driven definition of whatconstitutes an MWE Given a parallel corpuswe propose to consider those tokens in a targetlanguage as MWEs which correspond to a singlelexical item in the source language The intuitionis that if a group of lexical items in one lan-guage can be realized as a single item in anotherlanguage it can be considered as some kind oflexically fixed entity By this means we willnot approach the MWE identification problemby asking for a given list of candidates whetherthese are MWEs or not Instead we will ask fora given list of lexical items in a source languagewhether there exists a one-to-many translation forthis item in a target language (and whether these
23
one-to-many translations correspond to MWEs)This strategy offers a straightforward solution tothe interpretation problem As the translation canbe related to the meaning of the source item andto its other translations in the target language theinterpretation is independent of the expressionrsquostransparency This solution has its limitationscompared to other approaches that need to auto-matically establish the degree of compositionalityof a given MWE candidate However for manyNLP applications coarse-grained knowledgeabout the semantic relation between a widerange of MWEs and their corresponding atomicrealization is already very useful
In this work we therefore focus on a generalmethod of MWE identification that captures thevarious patterns of translational correspondencesthat can be found in parallel corpora Our exper-iments described in section 3 show that one-to-many translations should be extracted from syn-tactic configurations rather than from unstructuredsets of aligned words This syntax-driven methodis less dependent on frequency distributions in agiven corpus but is based on the intuition thatmonolingual idiosyncracies like MWE realizationof an entity are not likely to be mirrored in anotherlanguage (see section 4 for discussion)
Our goal in this paper is twofold First we wantto investigate to what extent one-to-many transla-tional correspondences can serve as an empiricalbasis for MWE identification To this end Sec-tion 2 presents a corpus-based study of the rela-tion between one-to-many translations and MWEsthat we carried out on a translation gold standardSecond we investigate methods for the automaticdetection of complex lexical correspondences fora given parallel corpus Therefore Section 3 eval-uates automatic word alignments against our goldstandard and gives a method for high-precisionone-to-many translation detection that relies onsyntactic filters in addition to word-alignments
2 Multiword Translations as MWEs
The idea to exploit one-to-many translations forthe identification of MWE candidates has not re-ceived much attention in the literature Thus it isnot a priori clear what can be expected from trans-lational correspondences with respect to MWEidentification To corroborate the intuitions intro-duced in the last section we carried out a corpus-based study that aims to discover linguistic pat-
Verb 1-1 1-n n-1 n-n No
anheben (v1) 535 212 92 16 325
bezwecken (v2) 167 513 06 313 150
riskieren (v3) 467 357 05 17 182
verschlimmern (v4) 302 215 286 445 275
Table 1 Proportions of types of translational cor-respondences (token-level) in our gold standard
terns exhibited by one-to-many translationsWe constructed a gold standard coveringall En-
glish translations of four German verb lemmas ex-tracted from the Europarl Corpus These verbssubcategorize for a nominative subject and an ac-cusative object and are in the middle frequencylayer (around 200 occurrences) We extracted allsentences in Europarl with occurences of theselemmas and their automatic word alignments pro-duced by GIZA++ (Och and Ney 2003) Thesealignments were manually corrected on the basisof the crosslingual word alignment guidelines de-velopped by (Graca et al 2008)
For each of the German source lemmas ourgold standard records four translation categoriesone-to-one one-to-many many-to-one many-to-many translations Table 1 shows the distributionof these categories for each verb Strikingly thefour verbs show very different proportions con-cerning the types of their translational correspon-dences Thus while the German verbanheben(en increase) seems to have a frequent parallelrealization the verbsbezwecken(en intend to)or verschlimmern(en aggravate) tend to be real-ized by more complex phrasal translations In anycase the percentage of one-to-many translations isrelatively high which corroborates our hypothesisthat parallel corpora constitute a very interestingresource for data-driven MWE discovery
A closer look at the one-to-many translations re-veals that these cover a wide spectrum of MWEphenomena traditionally considered in the liter-ature as well as constructions that one wouldusually not regard as an MWE Below we willshortly illustrate the different classes of one-to-many translations we found in our gold standard
Morphological variations This type of one-to-many translations is mainly due to non-parallel re-alization of tense Itrsquos rather irrelevant from anMWE perspective but easy to discover and filterautomatically
24
(3) SieThey
verschlimmernaggravate
diethe
Ubelmisfortunes
(4) Their actionis aggravatingthe misfortunes
Verb particle combinations A typical MWEpattern treated for instance in (Baldwin andVillavicencio 2002) It further divides into trans-parent and non-transparent combinations the lat-ter is illustrated below
(5) DerThe
Ausschusscommitte
bezwecktintends
denthe
Institutioneninstitutions
eina
politischespolitical
Instrumentinstrument
anat
diethe
Handhand
zuto
gebengive
(6) The committeeset out to equip the institutions with apolitical instrument
Verb preposition combinations While thisclass isnrsquot discussed very often in the MWE lit-erature it can nevertheless be considered as an id-iosyncratic combination of lexical items Sag et al(2002) propose an analysis within an MWE frame-work
(7) SieThey
werdenwill
denthe
Treibhauseffektgreen house effect
verschlimmernaggravate
(8) They will add to the green house effect
Light verb constructions (LVCs) This is themost frequent pattern in our gold standard It ac-tually subsumes various subpatterns depending onwhether the light verbs complement is realized as anoun adjective or PP Generally LVCs are syntac-tically and semantically more flexible than otherMWE types such that our gold standard containsvariants of LVCs with similar potentially mod-ified adjectives or nouns as in the example be-low However it can be considered an idiosyn-cratic combination since the LVCs exhibit specificlexical restrictions (Sag et al 2002)
(9) IchIch
werdewill
diethe
Sachething
nuronly
nochjust
verschlimmernaggravate
(10) I am justmaking thingsmore difficult
Idioms This MWE type is probably the mostdiscussed in the literature due to its semantic andsyntactic idiosyncracy Itrsquos not very frequent inour gold standard which may be mainly due to itslimited size and the source items we chose
(11) SieThey
bezweckenintend
diethe
Umgestaltungconversion
ininto
einea
zivilecivil
Nationnation
(12) Theyhave in mind the conversion into a civil nation
v1 v2 v3 v4
Ntype 22 (26) 41 (47) 26 (35) 17 (24)
V Part 227 49 00 00
V Prep 364 415 39 59
LVC 182 293 885 882
Idiom 00 24 00 00
Para 364 243 115 235
Table 2 Proportions of MWE types per lemma
Paraphrases From an MWE perspective para-phrases are the most problematic and challengingtype of translational correspondence in our goldstandard While the MWE literature typically dis-cusses the distinction between collocations andMWEs the boarderline between paraphrases andMWEs is not really clear On the hand para-phrases as we classified them here are transparentcombinations of lexical items like in the exam-ple belowensure that something increases How-ever semantically these transparent combinationscan also be rendered by an atomic expressionin-crease A further problem raised by paraphrases isthat they often involve translational shifts (Cyrus2006) These shifts are hard to identify automat-ically and present a general challenge for seman-tic processing of parallel corpora An example isgiven below
(13) WirWe
brauchenneed
besserebetter
Zusammenarbeitcooperation
umto
diethe
RuckzahlungenrepaymentsOBJ
anzuhebenincrease
(14) We need greater cooperation in this respect toensurethat repaymentsincrease
Table 2 displays the proportions of the MWEcategories for the number of types of one-to-manycorrespondences in our gold standard We filteredthe types due to morphological variations only (theoverall number of types is indicated in brackets)Note that some types in our gold standard fall intoseveral categories eg they combine a verb prepo-sition with a verb particle construction For allof the verbs the number of types belonging tocore MWE categories largely outweighs the pro-portion of paraphrases As we already observedin our analysis of general translation categorieshere again the different verb lemmas show strik-ing differences with respect to their realization inEnglish translations For instanceanheben(enincrease) or bezwecken(en intend) are frequently
25
translated with verb particle or preposition combi-nations while the other verbs are much more of-ten translated by means of LVCs Also the morespecific LVC patterns differ largely among theverbs Whileverschlimmern(en aggravate) hasmany different adjectival LVC correspondencesthe translations ofriskieren(en risk) are predomi-nantly nominal LVCs The fact that we found veryfew idioms in our gold standard may be simplyrelated to our arbitrary choice of German sourceverbs that do not have an English idiom realiza-tion (see our experiment on a random set of verbsin Section 33)
In general one-to-many translational corre-spondences seem to provide a very fruitful groundfor the large-scale study of MWE phenomenaHowever their reliable detection in parallel cor-pora is far from trivial as we will show in thenext section Therefore we will not further in-vestigate the classification of MWE patterns inthe rest of the paper but concentrate on the high-precision detection of one-to-many translationsSuch a pattern-independent identification methodis crucial for the further data-driven study of one-to-many translations in parallel corpora
3 Multiword Translation Detection
This section is devoted to the problem of high-precision detection of one-to-many translationsSection 31 describes an evaluation of automaticword alignments against our gold standard Insection 32 we describe a method that extractsloosely aligned syntactic configurations whichyields much more promising results
31 One-to-many Alignments
To illustrate the problem of purely distributionalone-to-many alignment table 3 presents an eval-uation of the automatic one-to-many word align-ments produced by GIZA++ that uses the stan-dard heuristics for bidirectional word alignmentfrom phrase-based MT (Och and Ney 2003) Weevaluate the rate of translational correspondenceson the type-level that the system discovers againstthe one-to-many translations in our gold standardBy type we mean the set of lemmatized Englishtokens that makes up the translation of the Ger-man source lemma Generally automatic wordalignment yields a very high FPR if no frequencythreshold is used Increasing the threshold mayhelp in some cases however the frequency of the
verb n gt 0 n gt 1 n gt 3FPR FNR FPR FNR FPR FNR
v1 097 093 10 10 10 10
v2 093 09 05 096 00 098
v3 088 083 08 097 067 097
v4 098 092 08 092 034 092
Table 3 False positive rate and False negative rateof GIZA++ one-to-many alignments
translation types is so low that already at a thresh-old of 3 almost all types get filtered This does notmean that the automatic word alignment does notdiscover any correct correspondences at all but itmeans that the detection of the exact set of tokensthat correspond to the source token is rare
This low precision of one-to-many alignmentsisnrsquot very surprising Many types of MWEs con-sist of items that contribute most of the lexical se-mantic content while the other items belong to theclass of semantically almost ldquoemptyrdquo items (egparticles light verbs) These semantically ldquolightrdquoitems have a distribution that doesnrsquot necessarilycorrelate with the source item For instance inthe following sentence pair taken from EuroparlGIZA++ was not able to capture the correspon-dence between the German main verbbehindern(en impede) and the LVCconstitute an obstacleto but only finds an alignment link between theverb and the nounobstacle
(15) DieThe
Korruptioncorruption
behindertimpedes
diethe
Entwicklungdevelopment
(16) Corruptionconstitutes an obstacle todevelopment
Another limitation of the word-alignment mod-els is that are independent of whether the sen-tences are largely parallel or rather free transla-tions However parallel corpora like Europarl areknow to contain a very large number of free trans-lations In these cases direct lexical correspon-dences are much more unlikely to be found
32 Aligning Syntactic Configurations
High-precision extraction of one-to-many trans-lation detection thus involves two major prob-lems 1) How to identify sentences or configura-tions where reliable lexical correspondences canbe found 2) How to align target items that have alow occurrence correlation
We argue that both of these problems can beadressed by taking syntactic information into ac-
26
count As an example consider the pair of paral-lel configurations in Figure 1 for the sentence pairgiven in (15) and (16) Although there is no strictone-to-one alignment for the German verb the ba-sic predicate-argument structure is parallel Theverbs arguments directly correspond to each otherand are all dominated by a verbal root node
Based on these intuitions we propose agenerate-and-filter strategy for our one-to-manytranslation detection which extracts partial largelyparallel dependency configurations By admittingtarget dependency paths to be aligned to sourcesingle dependency relations we admit configura-tions where the source item is translated by morethan one word For instance given the configura-tion in Figure 1 we allow the German verb to bealigned to the path connectingconstituteand theargumentY2
Our one-to-many translation detection consistsof the following steps a) candidate generationof aligned syntactic configurations b) filtering theconfigurations c) alignment post-editing ie as-sembling the target tokens corresponding to thesource item The following paragraphs will brieflycaracterize these steps
behindert
X 1 Y 1
Y 2
an to
X 2 obstacle
create
Figure 1 Example of a typical syntactic MWEconfiguration
Data We word-aligned the German and Englishportion of the Europarl corpus by means of theGIZA++ tool Both portions where assigned flatsyntactic dependency analyses by means of theMaltParser (Nivre et al 2006) such that we ob-tain a parallel resource of word-aligned depen-dency parses Each sentence in our resource canbe represented by the triple(DG DE AGE) DG
is the set of dependency triples(s1 rel s2) such
thats2 is a dependent ofs1 of typerel ands1 s2
are words of the source languageDE is the setof dependency triples of the target sentenceAGE
corresponds to the set of pairs(s1 t1) such thats1 t1 are aligned
Candidate Generation This step generates alist of source configurations by searching for oc-curences of the source lexical verb where it islinked to some syntactic dependents (eg its argu-ments) An example input would be the configura-tion ( (verbSB) (verbOA)) forour German verbs
Filtering Given our source candidates a validparallel configuration(DG DE AGE) is then de-fined by the following conditions1 The source configurationDG is the set of tu-ples(s1 rel sn) wheres1 is our source item andsn some dependent2 For eachsn isin DG there is a tuple(sn tn) isinAGE ie every dependent has an alignment3 There is a target itemt1 isin DE such thatfor eachtn there is ap sub DE such thatp isa path(t1 rel tx) (tx rel ty)(tz rel tn) thatconnectst1 and tn Thus the target dependentshave a common root
To filter noise due to parsing or alignment er-rors we further introduce a filter on the length ofthe path that connects the target root and its de-pendents and w exclude paths cross contain sen-tence boundaries Moreover the above candi-date filtering doesnrsquot exclude configurations whichexhibit paraphrases involving head-switching orcomplex coordination Head-switching can be de-tected with the help of alignment information ifthere is a item in our target configuration that hasan reliable alignment with an item not contained inour source configuration our target configurationis likely to contain such a structural paraphrasesand is excluded from our candidate set Coordina-tion can be discarded by imposing the condition onthe configuration not to contain a coordination re-lation This Generate-and-Filter strategy now ex-tracts a set of sentences where we are likely to finda good one-to-one or one-to-many translation forthe source verb
Alignment Post-editing In the final alignmentstep one now needs to figure out which lexicalmaterial in the aligned syntactic configurations ac-tually corresponds to the translation of the sourceitem The intuition discussed in 32 was that all
27
the items lying on a path between the root itemand the terminals belong to the translation of thesource item However these items may have othersyntactic dependents that may also be part of theone-to-many translation As an example considerthe configuration in figure 1 where the articleanwhich is part of the LVCcreate an obstacle tohasto be aligned to the German source verb
Thus for a set of itemsti for which there is a de-pendency relation(tx rel ti) isin DE such thattx isan element of our target configuration we need todecide whether(s1 ti) isin AGE This translationproblem now largely parallels collocation trans-lation problems discussed in the literature as in(Smadja and McKeown 1994) But crucially oursyntactic filtering strategy has substantially nar-rowed down the number of items that are possi-ble parts of the one-to-many translation Thus astraightforward way to assemble the translationalcorrespondence is to compute the correlation orassociation of the possibly missing items with thegiven translation pair as proposed in (Smadja andMcKeown 1994) Therefore we propose the fol-lowing alignment post-editing algorithmGiven the source items1 and the set of target itemsT where eachti isin T is an element of our targetconfiguration
1 Compute corr(s1 T ) the correlation be-tweens1 andT
2 For each ti tx such that there isa (ti rel tx) isin DE computecorr(s1 T + tx)
3 if corr(s1 T + tx) ge corr(s1 T ) addtxto T
As the Dice coefficient is often to give the bestresults eg in (Smadja and McKeown 1994) wealso chose Dice as our correlation measure In fu-ture work we will experiment with other associa-tion measures Our correlation scores are thus de-fined by the formula
corr(s1 T ) =2(freq(s1 and T ))
freq(s1) + freq(T )
We definefreq(T ) as the number of sentencepairs whose target sentence contains occurrencesof all ti isin T andfreq(s1) accordingly The ob-servation frequencyfreq(s1andT ) is the number ofsentence pairs that wheres1 occurs in the sourcesentence andT in the target sentence
The output translation can then be rep-resented as a dependency configurationof the following kind ((ofPMOD)(riskNMODof)(riskNMODthe) (runOBJrisk)(runSBJ))which is the syntactic representationfor the English MWErun the risk of
33 Evaluation
Our translational approach to MWE extractionbears the advantage that evaluation is not exclu-sively bound to the manual judgement of candi-date lists Instead we can first evaluate the systemoutput against translation gold standards which areeasier to obtain The linguistic classification of thecandidates according to their compositionality canthen be treated as a separate problem
We present two experiments in this evaluationsection We will first evaluate the translation de-tection on our gold standard to assess the gen-eral quality of the extraction method Since thisgold standard is to small to draw conclusions aboutthe quality of MWE patterns that the system de-tects we further evaluate the translational corre-spondences for a larger set of verbs
Translation evaluation In the first experimentwe extracted all types of translational correspon-dences for the verbs we annotated in the gold stan-dard We converted the output dependency con-figurations to the lemmatized bag-of-word formwe already applied for the alignment evaluationand calculated the FPR and FNR of the trans-lation types The evaluation is displayed in ta-ble 4 Nearly all translation types that our sys-tem detected are correct This confirms our hy-pothesis that syntactic filtering yields more reli-able translations that just coocurrence-based align-ments However the false negative rate is alsovery high This low recall is due to the fact thatour syntactic filters are very restrictive such that amajor part of the occurrences of the source lemmadonrsquot figure in the prototypical syntactic configu-ration Column two and three of the evaluation ta-ble present the FPR and FNR for experiments witha relaxed syntactic filter that doesnrsquot constrain thesyntactic type of the parallel argument relationsWhile not decreasing the FNR the FPR decreasessignificantly This means that the syntactic filtersmainly fire on noisy configurations and donrsquot de-crease the recall A manual error analysis has alsoshown that the relatively flat annotation scheme ofour dependency parses significantly narrows down
28
the number of candidate configurations that our al-gorithm detects As the dependency parses donrsquotprovide deep analyses for tense or control phe-nomena very often a verbrsquos arguments donrsquot fig-ure as its syntactic dependents and no configura-tion is found Future work will explore the im-pact of deep syntactic analysis for the detection oftranslational correspondences
MWE evaluation In a second experiment weevaluated the patterns of correspondences foundby our extraction method for use in an MWE con-text Therefore we selected 50 random verbs oc-curring in the Europarl corpus and extracted theirrespective translational correspondences This setof 50 verbs yields a set of 1592 one-to-many typesof translational correspondences We filtered thetypes wich display only morphological variationsuch that the set of potential MWE types com-prises 1302 types Out of these we evaluated arandom sample of 300 types by labelling the typeswith the MWE categories we established for theanalysis of our gold standard During the clas-sification we encountered a further category ofoneto- many correspondence which cannot be con-sidered an MWE the category of alternation Forinstance we found a translational correspondencebetween the active realization of the German verbbegruszligen(enappreciate) and the English passivebe pleased by
The classification is displayed in table 5 Al-most 83 of the translational correspondencesthat our system extracted are perfect translationtypes Almost 60 of the extracted types can beconsidered MWEs that exhibit some kind of se-mantic idiosyncrasy The other translations couldbe classified as paraphrases or alternations In ourrandom sample the portions of idioms is signifi-cantly higher than in our gold standard which con-firms our intuition that the MWE pattern of theone-to-many translations for a given verb are re-lated to language-specific semantic properties ofthe verbs and the lexical concepts they realize
4 Related Work
The problem sketched in this paper has clear con-ncetions to statistical MT So-called phrase-basedtranslation models generally target whole sentencealignment and do not necessarily recur to linguis-tically motivated phrase correspondences (Koehnet al 2003) Syntax-based translation that speci-fies formal relations between bilingual parses was
Strict Filter Relaxed Filter
FPR FNR FPR FNR
v1 00 096 05 096
v2 025 088 047 079
v3 025 074 056 063
v4 00 0875 056 084
Table 4 False positive and false negative rate ofone-to-many translations
Trans type Proportion
MWE type Proportion
MWEs 575
V Part 82
V Prep 518
LVC 324
Idiom 106
Paraphrases 244
Alternations 10
Noise 171
Table 5 Classification of 300 types sampled fromthe set of one-to-many translations for 50 verbs
established by (Wu 1997) Our way to use syn-tactic configurations can be seen as a heuristic tocheck relaxed structural parallelism
Work on MWEs in a crosslingual context hasalmost exclusively focussed on MWE translation(Smadja and McKeown 1994 Anastasiou 2008)In (Villada Moiron and Tiedemann 2006) the au-thors make use of alignment information in a par-allel corpus to rank MWE candidates These ap-proaches donrsquot rely on the lexical semantic knowl-edge about MWEs in form of one-to-many trans-lations
By contrast previous approaches to paraphraseextraction made more explicit use of crosslingualsemantic information In (Bannard and Callison-Burch 2005) the authors use the target languageas a pivot providing contextual features for iden-tifying semantically similar expressions Para-phrasing is however only partially comparable tothe crosslingual MWE detection we propose inthis paper Recently the very pronounced contextdependence of monolingual pairs of semanticallysimilar expressions has been recognized as a ma-jor challenge in modelling word meaning (Erk andPado 2009)
The idea that parallel corpora can be used asa linguistic resource that provides empirical evi-dence for monolingual idiosyncrasies has already
29
been exploited in eg morphology projection(Yarowsky et al 2001) or word sense disambigua-tion (Dyvik 2004) While in a monolingual set-ting it is quite tricky to come up with theoreticalor empirical definitions of sense discriminationsthe crosslingual scenario offers a theory-neutraldata-driven solution Since ambiguity is an id-iosyncratic property of a lexical item in a givenlanguage it is not likely to be mirrored in a tar-get language Similarly our approach can also beseen as a projection idea we project the semanticinformation of simplex realization in a source lan-guage to an idiosyncratic multiword realization inthe target language
5 Conclusion
We have explored the phenomenon of one-to-many translations in parallel corpora from theperspective of MWE identification Our man-ual study on a translation gold standard as wellas our experiments in automatic translation ex-traction have shown that one-to-many correspon-dences provide a rich resource and fruitful basisof study for data-driven MWE identification Thecrosslingual perspective raises new research ques-tions about the identification and interpretation ofMWEs It challenges the distinction between para-phrases and MWEs a problem that does not ariseat all in the context of monolingual MWE ex-traction It also allows for the study of the rela-tion between the semantics of lexical concepts andtheir MWE realization Further research in this di-rection should investigate translational correspon-dences on a larger scale and further explore thesefor monolingual interpretation of MWEs
Our extraction method that is based on syn-tactic filters identifies MWE types with a muchhigher precision than purely cooccurence-basedword alignment and captures the various patternswe found in our gold standard Future work on theextraction method will have to focus on the gener-alization of these filters and the generalization toother items than verbs The experiments presentedin this paper also suggest that the MWE realiza-tion of certain lexical items in a target languageis subject to certain linguistic patterns Moreoverthe method we propose is completely languagein-dependent such that further research has to studythe impact of the relatedness of the consideredlanguages on the patterns of one-to-many transla-tional correspondences
ReferencesDimitra Anastasiou 2008 Identification of idioms by mtrsquos
hybrid research system vs three commercial system InProceedings of the EAMT pp 12ndash20
Timothy Baldwin and Aline Villavicencio 2002 Extract-ing the unextractable a case study on verb-particles InProceedings of the COLING-02 pp 1ndash7
Colin Bannard and Chris Callison-Burch 2005 Paraphras-ing with bilingual parallel corpora InProceedings of the43rd Annual Meeting of the ACL pp 597ndash604
Lea Cyrus 2006 Building a resource for studying transla-tion shifts InProceedings of the 5th LREC pp 1240ndash1245
Helge Dyvik 2004 Translations as semantic mirrors Fromparallel corpus to WordNetLanguage and Computers1311 ndash 326
Katrin Erk and Sebastian Pado 2009 Paraphrase assess-ment in structured vector space Exploring parameters anddatasets InProc of the EACL GEMS Workshop pp 57ndash65
Joao de Almeida Varelas Graca Joana Paulo Pardal LuısaCoheur and Diamantino Antonio Caseiro 2008 Multi-language word alignments annotation guidelines Techni-cal report Tech Rep 38 2008 INESC-ID Lisboa
Philipp Koehn Franz Josef Och and Daniel Marcu 2003Statistical phrase-based translation InProceedings of theNAACL rsquo03 pp 48ndash54
Philipp Koehn 2005 Europarl A parallel corpus for statis-tical machine translation InMT Summit 2005 pp 79ndash86
Joakim Nivre Johan Hall and Jens Nilsson 2006 Malt-parser A data driven parser-generator for dependencyparsing InProc of LREC-2006 pp 2216ndash2219
Franz Josef Och and Hermann Ney 2003 A systematic com-parison of various statistical alignment modelsComputa-tional Linguistics 29(1)19ndash51
Pavel Pecina 2008 A machine learning approach to multi-word expression extraction InProceedings of the LRECMWE 2008 Workshop pp 54ndash57
Ivan A Sag Timothy Baldwin Francis Bond and AnnCopestake 2002 Multiword expressions A pain in theneck for NLP InProc of the CICLing-2002 pp 1ndash15
Frank Smadja and Kathleen McKeown 1994 Translatingcollocations for use in bilingual lexicons InProceedingsof the HLT rsquo94 workshop pp 152ndash156
Begona Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automatic word-alignment InProc of the EACL MWE 2006 Workshoppp 33ndash40
Dekai Wu 1997 Stochastic inversion transduction gram-mars and bilingual parsing of parallel corporaComputLinguist 23(3)377ndash403
David Yarowsky Grace Ngai and Richard Wicentowski2001 Inducing multilingual text analysis tools via robustprojection across aligned corpora InProceedings of HLT2001 pp 1ndash8
30
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 31ndash39Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
A Re-examination of Lexical Association Measures
Abstract
We review lexical Association Measures (AMs) that have been employed by past work in extracting multiword expressions Our work contributes to the understanding of these AMs by categorizing them into two groups and suggesting the use of rank equivalence to group AMs with the same ranking performance We also examine how existing AMs can be adapted to better rank English verb particle constructions and light verb constructions Specifically we suggest normalizing (Pointwise) Mutual Information and using marginal frequencies to construct penalization terms We empirically validate the effectiveness of these modified AMs in detection tasks in English performed on the Penn Treebank which shows significant improvement over the original AMs
1 Introduction
Recently the NLP community has witnessed a renewed interest in the use of lexical association measures in extracting Multiword Expressions (MWEs) Lexical Association Measures (hereafter AMs) are mathematical formulas which can be used to capture the degree of connection or association between constituents of a given phrase Well-known AMs include Pointwise Mutual Information (PMI) Pearsonrsquos 2χ and the Odds Ratio These AMs have been applied in many different fields of study from information retrieval to hypothesis testing In the context of MWE extraction many published works have been devoted to comparing their effectiveness Krenn and Evert (2001) evaluate Mutual Information (MI) Dice Pearsonrsquos 2χ log-likelihood
ratio and the T score In Pearce (2002) AMs such as Z score Pointwise MI cost reduction left and right context entropy odds ratio are evaluated Evert (2004) discussed a wide range of AMs including exact hypothesis tests such as the binomial test and Fisherrsquos exact tests various coefficients such as Dice and Jaccard Later Ramisch et al (2008) evaluated MI Pearsonrsquos 2χ and Permutation Entropy Probably the most comprehensive evaluation of AMs was presented in Pecina and Schlesinger (2006) where 82 AMs were assembled and evaluated over Czech collocations These collocations contained a mix of idiomatic expressions technical terms light verb constructions and stock phrases In their work the best combination of AMs was selected using machine learning
While the previous works have evaluated AMs there have been few details on why the AMs perform as they do A detailed analysis of why these AMs perform as they do is needed in order to explain their identification performance and to help us recommend AMs for future tasks This weakness of previous works motivated us to address this issue In this work we contribute to further understanding of association measures using two different MWE extraction tasks to motivate and concretize our discussion Our goal is to be able to predict a priori what types of AMs are likely to perform well for a particular MWE class
We focus on the extraction of two common types of English MWEs that can be captured by bigram model Verb Particle Constructions (VPCs) and Light Verb Constructions (LVCs) VPCs consist of a verb and one or more particles which can be prepositions (eg put on bolster up) adjectives (cut short) or verbs (make do) For simplicity we focus only on bigram VPCs that take prepositional particles the most common class of VPCs A special characteristic of VPCs that affects their extraction is the
Hung Huu Hoang Dept of Computer Science
National University of Singapore
hoanghuucompnusedusg
Su Nam Kim Dept of Computer Science and Software Engineering University of Melbourne
snkimcsseunimelbeduau
Min-Yen Kan Dept of Computer Science
National University of Singapore
kanmycompnusedusg
31
mobility of noun phrase complements in transitive VPCs They can appear after the particle (Take off your hat) or between the verb and the particle (Take your hat off) However a pronominal complement can only appear in the latter configuration (Take it off)
In comparison LVCs comprise of a verb and a complement which is usually a noun phrase (make a presentation give a demonstration) Their meanings come mostly from their complements and as such verbs in LVCs are termed semantically light hence the name light verb This explains why modifiers of LVCs modify the complement instead of the verb (make a serious mistake vs make a mistake seriously) This phenomenon also shows that an LVCrsquos constituents may not occur contiguously
2 Classification of Association Measures
Although different AMs have different approaches to measuring association we observed that they can effectively be classified into two broad classes Class I AMs look at the degree of institutionalization ie the extent to which the phrase is a semantic unit rather than a free combination of words Some of the AMs in this class directly measure this association between constituents using various combinations of co-occurrence and marginal frequencies Examples include MI PMI and their variants as well as most of the association coefficients such as Jaccard Hamann Brawn-Blanquet and others Other Class I AMs estimate a phrasersquos MWE-hood by judging the significance of the difference between observed and expected frequencies These AMs include among others statistical hypothesis tests such as T score Z score and Pearsonrsquos 2χ test
Class II AMs feature the use of context to measure non-compositionality a peculiar characteristic of many types of MWEs including VPCs and idioms This is commonly done in one of the following two ways First non-compositionality can be modeled through the diversity of contexts measured using entropy The underlying assumption of this approach is that non-compositional phrases appear in a more restricted set of contexts than compositional ones Second non-compositionality can also be measured through context similarity between the phrase and its constituents The observation here is that non-compositional phrases have different semantics from those of their constituents It then
follows that contexts in which the phrase and its constituents appear would be different (Zhai 1997) Some VPC examples include carry out give up A close approximation stipulates that contexts of a non-compositional phrasersquos constituents are also different For instance phrases such as hot dog and Dutch courage are comprised of constituents that have unrelated meanings Metrics that are commonly used to compute context similarity include cosine and dice similarity distance metrics such as Euclidean and Manhattan norm and probability distribution measures such as Kullback-Leibler divergence and Jensen-Shannon divergence
Table 1 lists all AMs used in our discussion
The lower left legend defines the variables a b c and d with respect to the raw co-occurrence statistics observed in the corpus data When an AM is introduced it is prefixed with its index given in Table 1(eg [M2] Mutual Information) for the readerrsquos convenience
3 Evaluation
We will first present how VPC and LVC candidates are extracted and used to form our evaluation data set Second we will discuss how performances of AMs are measured in our experiments
31 Evaluation Data
In this study we employ the Wall Street Journal (WSJ) section of one million words in the Penn Tree Bank To create the evaluation data set we first extract the VPC and LVC candidates from our corpus as described below We note here that the mobility property of both VPC and LVC constituents have been used in the extraction process
For VPCs we first identify particles using a pre-compiled set of 38 particles based on Baldwin (2005) and Quirk et al (1985) (Appendix A) Here we do not use the WSJ particle tag to avoid possible inconsistencies pointed out in Baldwin (2005) Next we search to the left of the located particle for the nearest verb As verbs and particles in transitive VPCs may not occur contiguously we allow an intervening NP of up to 5 words similar to Baldwin and Villavicencio (2002) and Smadja (1993) since longer NPs tend to be located after particles
32
Extraction of LVCs is carried out in a similar fashion First occurrences of light verbs are located based on the following set of seven
frequently used English light verbs do get give have make put and take Next we search to the right of the light verbs for the nearest noun
AM Name Formula AM Name Formula M1 Joint Probability ( ) f xy N M2 Mutual Information
1log ˆ
ijij
i j ij
ff
N fsum
M3 Log likelihood ratio
2 log ˆ
ijij
i j ij
ff
fsum
M4 Pointwise MI (PMI) ( )log
( ) ( )
P xy
P x P ylowast lowast
M5 Local-PMI ( ) PMIf xy times M6 PMIk ( )log
( ) ( )
kNf xy
f x f ylowast lowast
M7 PMI2 2( )log
( ) ( )
Nf xy
f x f ylowast lowast
M8 Mutual Dependency 2( )log( ) ( )P xy
P x P y
M9 Driver-Kroeber
( )( )
a
a b a c+ +
M10 Normalized expectation
2
2
a
a b c+ +
M11 Jaccard a
a b c+ +
M12 First Kulczynski a
b c+
M13 Second Sokal-Sneath 2( )
a
a b c+ +
M14 Third Sokal-Sneath
a d
b c
+
+
M15 Sokal-Michiner a d
a b c d
+
+ + +
M16 Rogers-Tanimoto
2 2
a d
a b c d
+
+ + +
M17 Hamann ( ) ( )a d b c
a b c d
+ minus +
+ + +
M18 Odds ratio adbc
M19 Yulersquos ω ad bc
ad bc
minus
+
M20 Yulersquos Q ad bcad bc
minus+
M21 Brawn- Blanquet max( )
a
a b a c+ +
M22 Simpson
min( )
a
a b a c+ +
M23 S cost 12min( )
log(1 )1
b c
a
minus+
+
M24 Adjusted S Cost 12max( )
log(1 )1
b c
a
minus+
+
M25 Laplace 1
min( ) 2
a
a b c
+
+ +
M26 Adjusted Laplace 1
max( ) 2
a
a b c
+
+ +
M27 Fager [M9]
1max( )
2b cminus
M28 Adjusted Fager [M9]
1max( )b c
aNminus
M29 Normalized PMIs
PMI NF( )α PMI NFMax
M30 Simplified normalized PMI for VPCs
log( )
(1 )
ad
b cα αtimes + minus times
M31 Normalized MIs
MI NF( )α MI NFMax
NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast [0 1]α isin NFMax = max( ( ) ( ))P x P ylowast lowast
11 ( )a f f xy= = 12 ( )b f f xy= =
21 ( )c f f xy= = 22 ( )d f f xy= =
( )f xlowast ( )f xlowast
( )f ylowast ( )f ylowast N
Table 1 Association measures discussed in this paper Starred AMs () are developed in this work
Contingency table of a bigram (x y) recording co-occurrence and marginal frequencies w stands for all words except w stands for all words N is total number of bigrams The expected frequency under the independence assumption is ˆ ( ) ( ) ( ) f xy f x f y N= lowast lowast
33
permitting a maximum of 4 intervening words to allow for quantifiers (aan the many etc) adjectival and adverbial modifiers etc If this search fails to find a noun as when LVCs are used in the passive (eg the presentation was made) we search to the right of the light verb also allowing a maximum of 4 intervening words The above extraction process produced a total of 8652 VPC and 11465 LVC candidates when run on the corpus We then filter out candidates with observed frequencies less than 6 as suggested in Pecina and Schlesinger (2006) to obtain a set of 1629 VPCs and 1254 LVCs
Separately we use the following two available sources of annotations 3078 VPC candidates extracted and annotated in (Baldwin 2005) and 464 annotated LVC candidates used in (Tan et al 2006) Both sets of annotations give both positive and negative examples
Our final VPC and LVC evaluation datasets were then constructed by intersecting the gold-standard datasets with our corresponding sets of extracted candidates We also concatenated both sets of evaluation data for composite evaluation This set is referred to as ldquoMixedrdquo Statistics of our three evaluation datasets are summarized in Table 2 VPC data LVC data Mixed Total (freq ge 6)
413 100 513
Positive instances
117 (2833)
28 (28)
145 (2326)
Table 2 Evaluation data sizes (type count not token) While these datasets are small our primary
goal in this work is to establish initial comparable baselines and describe interesting phenomena that we plan to investigate over larger datasets in future work
32 Evaluation Metric
To evaluate the performance of AMs we can use the standard precision and recall measures as in much past work We note that the ranked list of candidates generated by an AM is often used as a classifier by setting a threshold However setting a threshold is problematic and optimal threshold values vary for different AMs Additionally using the list of ranked candidates directly as a classifier does not consider the confidence indicated by actual scores Another way to avoid setting threshold values is to measure precision and recall of only the n most likely candidates
(the n-best method) However as discussed in Evert and Krenn (2001) this method depends heavily on the choice of n In this paper we opt for average precision (AP) which is the average of precisions at all possible recall values This choice also makes our results comparable to those of Pecina and Schlesinger (2006)
33 Evaluation Results
Figure 1(a b) gives the two average precision profiles of the 82 AMs presented in Pecina and Schlesinger (2006) when we replicated their experiments over our English VPC and LVC datasets We observe that the average precision profile for VPCs is slightly concave while the one for LVCs is more convex This can be interpreted as VPCs being more sensitive to the choice of AM than LVCs Another point we observed is that a vast majority of Class I AMs including PMI its variants and association coefficients (excluding hypothesis tests) perform reasonably well in our application In contrast the performances of most of context-based and hypothesis test AMs are very modest Their mediocre performance indicates their inapplicability to our VPC and LVC tasks In particular the high frequencies of particles in VPCs and light verbs in LVCs both undermine their contextsrsquo discriminative power and skew the difference between observed and expected frequencies that are relied on in hypothesis tests
4 Rank Equivalence
We note that some AMs although not mathematically equivalent (ie assigning identical scores to input candidates) produce the same lists of ranked candidates on our datasets Hence they achieve the same average precision The ability to identify such groups of AMs is helpful in simplifying their formulas which in turn assisting in analyzing their meanings Definition Association measures M1 and M2 are
rank equivalent over a set C denoted by M1 r
Cequiv
M2 if and only if M1(cj) gt M1(ck) M2(cj) gt M2(ck) and M1(cj) = M1(ck) M2(cj) = M2(ck) for all cj ck belongs to C where Mk(ci) denotes the score assigned to ci by the measure Mk As a corollary the following also holds for rank equivalent AMs
34
Corollary If M1 r
Cequiv M2 then APC(M1) = APC(M2)
where APC(Mi) stands for the average precision of the AM Mi over the data set C Essentially M1 and M2 are rank equivalent over a set C if their ranked lists of all candidates taken from C are the same ignoring the actual calculated scores1 As an example the following 3 AMs Odds ratio Yulersquos ω and Yulersquos Q (Table 3 row 5) though not mathematically equivalent can be shown to be rank equivalent Five groups of rank equivalent AMs that we have found are listed in Table 3 This allows us to replace the below 15 AMs with their (most simple) representatives from each rank equivalent group
1 Two AMs may be rank equivalent with the exception of some candidates where one AM is undefined due to a zero in the denominator while the other AM is still well-defined We call these cases weakly rank equivalent With a reasonably large corpus such candidates are rare for our VPC and LVC types Hence we still consider such AM pairs to be rank equivalent
1) [M2] Mutual Information [M3] Log likelihood ratio 2) [M7] PMI2 [M8] Mutual Dependency [M9] Driver-Kroeber (aka Ochiai) 3) [M10] Normalized expectation [M11] Jaccard [M12] First Kulczynski
[M13]Second Sokal-Sneath (aka Anderberg)
4) [M14] Third Sokal-Sneath [M15] Sokal-Michiner [M16] Rogers-Tanimoto [M17] Hamann 5) [M18] Odds ratio [M19] Yulersquos ω [M20] Yulersquos Q
Table 3 Five groups of rank equivalent AMs
5 Examination of Association Measures
We highlight two important findings in our analysis of the AMs over our English datasets Section 51 focuses on MI and PMI and Section 52 discusses penalization terms
51 Mutual Information and Pointwise Mutual Information
In Figure 1 over 82 AMs PMI ranks 11th in identifying VPCs while MI ranks 35th in
01
02
03
04
05
06
AP
Figure 1a AP profile of AMs examined over our VPC data set
01
02
03
04
05
06
AP
Figure 1b AP profile of AMs examined over our LVC data set
Figure 1 Average precision (AP) performance of the 82 AMs from Pecina and Schlesinger (2006) on our English VPC and LVC datasets Bold points indicate AMs discussed in this paper
Hypothesis test AMs loz Class I AMs excluding hypothesis test AMs + Context-based AMs
35
identifying LVCs In this section we show how their performances can be improved significantly
Mutual Information (MI) measures the common information between two variables or the reduction in uncertainty of one variable given knowledge of the other
( )MI( ) ( )log ( ) ( )u v
p uvU V p uv p u p v= lowast lowastsum In the
context of bigrams the above formula can be
simplified to [M2] MI =
1log ˆN
ijij
i jij
ff
fsum While MI
holds between random variables [M4] Pointwise MI (PMI) holds between specific values PMI(x
y) =( )
log( ) ( )
P xy
P x P ylowast lowast
( )log
( ) ( )
Nf xy
f x f y=
lowast lowast It has long
been pointed out that PMI favors bigrams with low-frequency constituents as evidenced by the product of two marginal frequencies in its denominator To reduce this bias a common solution is to assign more weight to the co-occurrence frequency ( )f xy in the numerator by either raising it to some power k (Daille 1994) or multiplying PMI with ( )f xy Table 4 lists these adjusted versions of PMI and their performance over our datasets We can see from Table 4 that the best performance of PMIk
is obtained at k values less than one indicating that it is better to rely less on ( )f xy Similarly multiplying
( )f xy directly to PMI reduces the performance of PMI As such assigning more weight to ( )f xy does not improve the AP performance of PMI AM VPCs LVCs MixedBest [M6] PMIk 547
(k = 13) 573
(k = 85) 544
(k = 32)
[M4] PMI 510 566 515 [M5] Local-PMI 259 393 272 [M1] Joint Prob 170 28 175 Table 4 AP performance of PMI and its variants Best alpha settings shown in parentheses
Another shortcoming of (P)MI is that both grow not only with the degree of dependence but also with frequency (Manning and Schutzeampamp 1999 p 66) In particular we can show that MI(X Y) le min(H(X) H(Y)) where H() denotes entropy and PMI(xy) le min( log ( )P xminus lowast log ( )P yminus lowast )
These two inequalities suggest that the allowed score ranges of different candidates vary and consequently MI and PMI scores are not directly comparable Furthermore in the case of VPCs and LVCs the differences among score
ranges of different candidates are large due to high frequencies of particles and light verbs This has motivated us to normalize these scores before using them for comparison We suggest MI and PMI be divided by one of the following two normalization factors NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast with [0 1]α isin and NFmax = max( ( ) ( ))P x P ylowast lowast NF( )α being dependent on alpha can be optimized by setting an appropriate alpha value which is inevitably affected by the MWE type and the corpus statistics On the other hand NFmax is independent of alpha and is recommended when one needs to apply normalized (P)MI to a mixed set of different MWE types or when sufficient data for parameter tuning is unavailable As shown in Table 5 normalized MI and PMI show considerable improvements of up to 80 Also PMI and MI after being normalized with NFmax rank number one in VPC and LVC task respectively If one re-writes MI as = (1 N) ij ij
i jPMIf timessum it is easy to see the heavy
dependence of MI on direct frequencies compared with PMI and this explains why normalization is a pressing need for MI
AM VPCs LVCs MixedMI NF( )α 508
(α = 48) 583
(α = 47) 516
(α = 5)
MI NFmax 508 584 518 [M2] MI 273 435 289
PMI NF( )α 592 (α = 8)
554 (α = 48)
588 (α = 77)
PMI NFmax 565 517 556 [M4] PMI 510 566 515
Table 5 AP performance of normalized (P)MI versus standard (P)MI Best alpha settings shown in parentheses
52 Penalization Terms
It can be seen that given equal co-occurrence frequencies higher marginal frequencies reduce the likelihood of being MWEs This motivates us to use marginal frequencies to synthesize penalization terms which are formulae whose values are inversely proportional to the likelihood of being MWEs We hypothesize that incorporating such penalization terms can improve the respective AMs detection AP
Take as an example the AMs [M21] Brawn-Blanquet (aka Minimum Sensitivity) and [M22] Simpson These two AMs are identical except
36
for one difference in the denominator Brawn-Blanquet uses max(b c) Simpson uses min(b c) It is intuitive and confirmed by our experiments that penalizing against the more frequent constituent by choosing max(b c) is more effective This is further attested in AMs [M23] S Cost and [M25] Laplace where we tried to replace the min(b c) term with max(b c) Table 6 shows the average precision on our datasets for all these AMs
AM VPCs LVCs Mixed[M21]Brawn- Blanquet
478 578 486
[M22] Simpson 249 382 260
[M24] Adjusted S Cost
485 577 492
[M23] S cost 249 388 260
[M26] Adjusted Laplace
486 577 493
[M25] Laplace 241 388 254 Table 6 Replacing min() with max() in selected AMs
In the [M27] Fager AM the penalization term max(b c) is subtracted from the first term which is no stranger but rank equivalent to [M7] PMI2 In our application this AM is not good since the second term is far larger than the first term which is less than 1 As such Fager is largely equivalent to just ndashfrac12 max(b c) In order to make use of the first term we need to replace the constant frac12 by a scaled down version of max(b c) We have approximately derived 1 aN as a lower bound estimate of max(b c) using the independence assumption producing [M28] Adjusted Fager We can see from Table 7 that this adjustment improves Fager on both datasets AM VPCs LVCs Mixed[M28] Adjusted Fager
564 543 554
[M27] Fager 552 439 525 Table 7 Performance of Fager and its adjusted version
The next experiment involves [M14] Third Sokal Sneath which can be shown to be rank equivalent to ndashb ndashc We further notice that frequencies c of particles are normally much larger than frequencies b of verbs Thus this AM runs the risk of ranking VPC candidates based on only frequencies of particles So it is necessary
that we scale b and c properly as in [M14] bαminus times ndash (1 ) cαminus times Having scaled the constituents properly we still see that [M14] by itself is not a good measure as it uses only constituent frequencies and does not take into consideration the co-occurrence frequency of the two constituents This has led us to experiment
with [MR14]PMI
(1 )b cα αtimes + minus times The
denominator of [MR14] is obtained by removing the minus sign from [MR14] so that it can be used as a penalization term The choice of PMI in the numerator is due to the fact that the denominator of [MR14] is in essence similar to NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast which has been successfully used to divide PMI in the normalized PMI experiment We heuristically tried to simplify [MR14] to the following AM
[M30]log( )
(1 )
ad
b cα αtimes + minus times The setting of alpha in
Table 8 below is taken from the best alpha setting obtained the experiment on the normalized PMI (Table 5) It can be observed from Table 8 that [MR14] being computationally simpler than normalized PMI performs as well as normalized PMI and better than Third Sokal-Sneath over the VPC data set
AM VPCs LVCs MixedPMI NF( )α 592
(α =8) 554
(α =48) 588
(α =77)
[M30]log( )
(1 )
ad
b cα αtimes + minus times
600 (α =8)
484 (α =48)
588 (α =77)
[M14] Third Sokal Sneath
565 453 546
Table 8 AP performance of suggested VPCsrsquo penalization terms and AMs
With the same intention and method we have found that while addition of marginal frequencies is a good penalization term for VPCs the product of marginal frequencies is more suitable for LVCs (rows 1 and 2 Table 9) As with the linear combination the product bc should also be weighted accordingly as (1 )b cα αminus The best alpha value is also taken from the normalized PMI experiments (Table 5) which is nearly 5 Under this setting this penalization term is exactly the denominator of the [M18] Odds Ratio Table 9 below show our experiment results in deriving the penalization term for LVCs
37
AM VPCs LVCs Mixedndashb ndashc 565 453 546 1bc 502 532 502 [M18] Odds ratio 443 567 456
Table 9 AP performance of suggested LVCsrsquo penalization terms and AMs
6 Conclusions
We have conducted an analysis of the 82 AMs assembled in Pecina and Schlesinger (2006) for the tasks of English VPC and LVC extraction over the Wall Street Journal Penn Treebank data In our work we have observed that AMs can be divided into two classes ones that do not use context (Class I) and ones that do (Class II) and find that the latter is not suitable for our VPC and LVC detection tasks as the size of our corpus is too small to rely on the frequency of candidatesrsquo contexts This phenomenon also revealed the inappropriateness of hypothesis tests for our detection task We have also introduced the novel notion of rank equivalence to MWE detection in which we show that complex AMs may be replaced by simpler AMs that yield the same average precision performance
We further observed that certain modifications to some AMs are necessary First in the context of ranking we have proposed normalizing scores produced by MI and PMI in cases where the distributions of the two events are markedly different as is the case for light verbs and particles While our claims are limited to the datasets analyzed they show clear improvements normalized PMI produces better performance over our mixed MWE dataset yielding an average precision of 588 compared to 515 when using standard PMI a significant improvement as judged by paired T test Normalized MI also yields the best performance over our LVC dataset with a significantly improved AP of 583
We also show that marginal frequencies can be used to form effective penalization terms In particular we find that (1 )b cα αtimes + minus times is a good penalization term for VPCs while (1 )b cα αminus is suitable for LVCs Our introduced alpha tuning parameter should be set to properly scale the values b and c and should be optimized per MWE type In cases where a common factor is applied to different MWE types max(b c) is a better choice than min(b c) In future work we plan to expand our investigations over larger
web-based datasets of English to verify the performance gains of our modified AMs
Acknowledgement This work was partially supported by a National Research Foundation grant ldquoInteractive Media Searchrdquo (grant R 252 000 325 279)
References Baldwin Timothy (2005) The deep lexical
acquisition of English verb-particle constructions Computer Speech and Language Special Issue on Multiword Expressions 19(4)398ndash414
Baldwin Timothy and Villavicencio Aline (2002) Extracting the unextractable A case study on verb-particles In Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002) pages 98ndash104 Taipei Taiwan
Daille Beacuteatrice (1994) Approche mixte pour lextraction automatique de terminologie statistiques lexicales et filtres linguistiques PhD thesis Universiteacute Paris 7
Evert Stefan (2004) Online repository of association measures httpwwwcollocationsde a companion to The Statistics of Word Cooccurrences Word Pairs and Collocations PhD dissertation University of Stuttgart
Evert Stefan and Krenn Brigitte (2001) Methods for qualitative evaluation of lexical association measures In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics pages 188-915 Toulouse France
Katz Graham and Giesbrecht Eugenie (2006) Automatic identification of non-compositional multi-word expressions using latent semantic analysis In Proceedings of the ACL Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties pages 12-19 Sydney Australia
Krenn Brigitte and Evert Stefan (2001) Can we do better than frequency A case study on extracting PP-verb collocations In Proceedings of the ACLEACL 2001 Workshop on the Computational Extraction Analysis and Exploitation of Collocations pages 39ndash46 Toulouse France
Manning D Christopher and Schutze ampamp Hinrich (1999) Foundations of Statistical Natural Language Processing The MIT Press Cambridge Massachusetts
Pearce Darren (2002) A comparative evaluation of collocation extraction techniques In Proc of the
38
3rd International Conference on Language Resources and Evaluation (LREC 2002) Las Palmas pages 1530-1536 Canary Islands
Pecina Pavel and Schlesinger Pavel (2006) Combining association measures for collocation extraction In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLINGACL 2006) pages 651-658 Sydney Australia
Quirk Randolph Greenbaum Sidney Leech Geoffrey and Svartvik Jan (1985) A Comprehensive Grammar of the English Language Longman London UK
Ramisch Carlos Schreiner Paulo Idiart Marco and Villavicencio Aline (2008) An Evaluation of Methods for the extraction of Multiword Expressions In Proceedings of the LREC-2008 Workshop on Multiword Expressions Towards a Shared Task for Multiword Expressions pages 50-53 Marrakech Morocco
Smadja Frank (1993) Retrieving collocations from text Xtract Computational Linguistics 19(1) 143ndash77
Tan Y Fan Kan M Yen and Cui Hang (2006) Extending corpus-based identification of light verb constructions using a supervised learning framework In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context pages 49ndash56 Trento Italy
Zhai Chengxiang (1997) Exploiting context to identify lexical atoms ndash A statistical view of linguistic context In International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97) pages 119-129 Rio de Janeiro Brazil
Appendix A List of particles used in identifying verb particle constructions about aback aboard above abroad across adrift ahead along apart around aside astray away back backward backwards behind by down forth forward forwards in into off on out over past round through to together under up upon without
39
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 40ndash46Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus
R Mahesh K Sinha
Department of Computer Science amp Engineering Indian Institute of Technology Kanpur
Kanpur 208016 India rmkiitkacin
Abstract
Complex predicate is a noun a verb an ad-jective or an adverb followed by a light verb that behaves as a single unit of verb Com-plex predicates (CPs) are abundantly used in Hindi and other languages of Indo Aryan family Detecting and interpreting CPs con-stitute an important and somewhat a diffi-cult task The linguistic and statistical methods have yielded limited success in mining this data In this paper we present a simple method for detecting CPs of all kinds using a Hindi-English parallel corpus A CP is hypothesized by detecting absence of the conventional meaning of the light verb in the aligned English sentence This simple strategy exploits the fact that CP is a multi-word expression with a meaning that is dis-tinct from the meaning of the light verb Al-though there are several shortcomings in the methodology this empirical method surpri-singly yields mining of CPs with an average precision of 89 and a recall of 90
1 Introduction
Complex predicates (CPs) are abundantly used in Hindi and other languages of Indo-Aryan family and have been widely studied (Hook 1974 Ab-bi 1992 Verma 1993 Mohanan 1994 Singh 1994 Butt 1995 Butt and Geuder 2001 Butt and Ramchand 2001 Butt et al 2003) A com-plex predicate is a multi-word expression (MWE) where a noun a verb or an adjective is followed by a light verb (LV) and the MWE be-haves as a single unit of verb The general theory of complex predicate is discussed in Alsina (1996) These studies attempt to model the lin-guistic facts of complex predicate formation and the associated semantic roles
CPs empower the language in its expressive-ness but are hard to detect Detection and inter-
pretation of CPs are important for several tasks of natural language processing tasks such as ma-chine translation information retrieval summa-rization etc A mere listing of the CPs constitutes a valuable linguistic resource for lexicographers wordnet designers (Chakrabarti et al 2007) and other NLP system designers Computational me-thod using Hindi corpus has been used to mine CPs and categorize them based on statistical analysis (Sriram and Joshi 2005) with limited success Chakrabarti et al (2008) present a me-thod for automatic extraction of V+V CPs only from a corpus based on linguistic features They report an accuracy of about 98 An attempt has also been made to use a parallel corpus for de-tecting CPs using projection POS tags from Eng-lish to Hindi (Soni Mukerjee and Raina 2006) It uses Giza++ word alignment tool to align the projected POS information A success of 83 precision and 46 recall has been reported
In this paper we present a simple strategy for mining of CPs in Hindi using projection of meaning of light verb in a parallel corpus In the following section the nature of CP in Hindi is outlined and this is followed by system design experimentation and results
2 Complex Predicates in Hindi
A CP in Hindi is a syntactic construction consist-ing of either a verb a noun an adjective or an adverb as main predicator followed by a light verb (LV) Thus a CP can be a noun+LV an adjective+LV a verb+LV or an adverb+LV Fur-ther it is also possible that a CP is followed by a LV (CP+LV) The light verb carries the tense and agreement morphology In V+V CPs the contribution of the light verb denotes aspectual terms such as continuity perfectivity inception completion or denotes an expression of forceful-ness suddenness etc (Singh 1994 Butt 1995) The CP in a sentence syntactically acts as a sin-gle lexical unit of verb that has a meaning dis-
40
tinct from that of the LV CPs are also referred as the complex or compound verbs
Given below are some examples (1) CP=noun+LV noun = ashirwad blessings LV = denaa to give usane mujhe ashirwad diyaa उसन मझ आश वाद दया he me blessings gave he blessed me (2) No CP usane mujhe ek pustak dii उसन मझ एक प तक द he me one book gave he gave me a book In (1) the light verb diyaa (gave) in its past tense form with the noun ashirwad (blessings) makes a complex predicate verb form ashirwad diyaa (blessed) in the past tense form The CP here is ashirwad denaa and its corresponding English translation is lsquoto blessrsquo On the other hand in ex-ample (2) the verb dii (gave) is a simple verb in past tense form and is not a light verb Although same Hindi verb denaa (to give) is used in both the examples it is a light verb in (1) and a main verb in (2) Whether it acts as a light verb or not depends upon the semantics of the preceding noun However it is observed that the English meaning in case of the complex predicate is not derived from the individual meanings of the con-stituent words It is this observation that forms basis of our approach for mining of CPs (3) CP=adjective+LV adjective=khush happy LV=karanaa to do usane mujhe khush kiyaa उसन मझ खश कया he me happy did he pleased me Here the Hindi verb kiyaa (did) is the past tense form of a light verb karanaa (to do) and the pre-ceding word khush (happy) is an adjective The CP here is khush karanaa (to please) (4) CP=verb+LV verb = paRhnaa to read LV = lenaa to take usane pustak paRh liyaa
उसन प तक पढ़ लया he book read took he has read the book Here the Hindi verb liyaa (took) is the past tense form of the light verb lenaa (to take) and the pre-ceding word paRh (read) is the verb paRhnaa (to read) in its stem form The CP is paRh lenaa (to finish reading) In such cases the light verb acts as an aspectual modal or as an intensifier (5) CP=verb+LV verb = phaadanaa to tear LV = denaa to give usane pustak phaad diyaa उसन प तक फाड़ दया he book tear gave he has torn the book Here the Hindi verb diyaa (gave) is the past tense form of the light verb denaa (to give) and the preceding word phaad (tear) is the stem form of the verb phaadanaa (to tear) The CP is phaad denaa (to cause and complete act of tearing) (6) CP=verb+LV verb = denaa to give LV = maaranaato hit to kill usane pustak de maaraa उसन प तक द मारा he book give hit he threw the book Here the Hindi verb maaraa (hitkilled) is the past tense form of the light verb maaranaa (to hit to kill) and the preceding word de (give) is a verb denaa (to give) in its stem form The CP is de maranaa (to throw) The verb combination yields a new meaning This may also be consi-dered as a semi-idiomatic construct by some people (7) CP=adverb+LV1+LV2 adverb = vaapasback LV1 = karanaato do LV2 = denaato give or CP = CP+LV CP = vaapas karanaato return LV = denaato give usane pustak vaapas kar diyaa उसन प तक वापस कर दया he book back do gave
41
he returned the book Here there are two Hindi light verbs used The verb kar (do) is the stem form of the light verb karanaa (to do) and the verb diyaa (gave) is the past tense form of the light verb denaa (to give) The preceding word vaapas (back) is an adverb One way of interpretation is that the CP (a con-junct verb) vaapas karanaa (to return) is fol-lowed by another LV denaa (to give) signifying completion of the task Another way of looking at it is to consider these as a combination of two CPs vaapas karanaa (to return) and kar denaa (to complete the act) The semantic interpreta-tions in the two cases remain the same It may be noted that the word vaapas (return) is also a noun and in such a case the CP is a noun+LV
From all the above examples the complexity of the task of mining the CPs is evident Howev-er it is also observed that in the translated text the meaning of the light verb does not appear in case of CPs Our methodology for mining CPs is based on this observation and is outlined in the following section
3 System Design
As outlined earlier our method for detecting a CP is based on detecting a mismatch of the Hindi light verb meaning in the aligned English sen-tence The steps involved are as follows 1) Align the sentences of Hindi-English corpus 2) Create a list of Hindi light verbs and their
common English meanings as a simple verb (Table 1)
3) For each Hindi light verb generate all the morphological forms (Figure 1)
4) For each English meaning of the light verb as given in table 1 generate all the morphologi-cal forms (Figure 2)
5) For each Hindi-English aligned sentence execute the following steps a) For each light verb of Hindi (table 1)
execute the following steps i) Search for a Hindi light verb (LV)
and its morphological derivatives (figure 1) in the Hindi sentence and mark its position in the sentence (K)
ii) If the LV or its morphological deriv-ative is found then search for the equivalent English meanings for any of the morphological forms (figure 2) in the corresponding aligned Eng-lish sentence
iii) If no match is found then scan the words in the Hindi sentence to the left of the Kth position (as identified in step (i)) else if a match is found then exit ie go to step (a)
iv) If the scanned word is a lsquostop wordrsquo (figure 3) then ignore it and contin-ue scanning
v) Stop the scan when it is not a lsquostop wordrsquo and collect the Hindi word (W)
vi) If W is an lsquoexit wordrsquo then exit ie go to step (a) else the identified CP is W+LV
Hindi has a large number of light verbs A list of some of the commonly used light verbs along with their common English meanings as a simple verb is given in table 1 The light verb kar (do) is the most frequently used light verb Using its literal meaning as lsquodorsquo as a criterion for testing CP is quite misleading since lsquodorsquo in English is used in several other contexts Such meanings have been shown within parentheses and are not used for matching
light verb base form root verb meaning baithanaa बठना sit
bananaa बनना makebecomebuildconstruct manufactureprepare
banaanaa बनाना makebuildconstructmanufact-ure prepare
denaa दना give
lenaa लना take
paanaa पाना obtainget
uthanaa 3ठना rise arise get-up
uthaanaa 3ठाना raiselift wake-up
laganaa लगना feelappear look seem
lagaanaa लगाना fixinstall apply
cukanaa चकना (finish)
cukaanaa चकाना pay
karanaa करना (do)
honaa होना happenbecome be
aanaa आना come
jaanaa जाना go
khaanaa खाना eat
rakhanaa रखना keep put
maaranaa मारना killbeathit
daalanaa डालना put
haankanaa हाकना drive
Table 1 Some of the common light verbs in Hindi
42
For each of the Hindi light verb all morpho-logical forms are generated A few illustrations are given in figures 1(a) and 1(b) Similarly for each of the English meaning of the light verb all of its morphological derivatives are generated Figure 2 shows a few illustrations of the same
There are a number of words that can appear in between the nominal and the light verb in a CP These words are ignored in search for a CP and are treated as stop words These are words that denote negation or are emphasizers inten-sifiers interrogative pronoun or a particle A list of stop words used in the experimentation is giv-en in figure 3
Figure 1(a) Morphological derivatives of sample Hindi light verb rsquojaanaarsquo जाना to go
Figure 1(b) Morphological derivatives of sample Hindi light verb rsquolenaarsquo लना to take
Figure 2 Morphological derivatives of sample English meanings
We use a list of words of words that we have
named as lsquoexit wordsrsquo which cannot form part of a CP in Hindi We have used Hindi case (vibhak-ti) markers (also called parsarg) conjunctions and pronouns as the lsquoexit wordsrsquo in our imple-mentation Figure 4 shows a partial list used However this list can be augmented based on analysis of errors in LV identification It should be noted that we do not perform parts of speech (POS) tagging and so the nature of the word pre-ceding the LV is unknown to the system
English word sit Morphological derivations
sit sits sat sitting English word give Morphological derivations
give gives gave given giving degdegdegdegdeg
LV jaanaa जाना to go Morphological derivatives jaa jaae jaao jaaeM jaauuM jaane jaanaa jaanii jaataa jaatii jaate jaaniiM jaatiiM jaaoge jaaogii gaii jaauuMgA jaayegaa jaauuMgii jaayegii gaye gaiiM gayaa gayii jaayeMge jaayeMgI jaakara जा (go stem) जाए (go imperative)
जाओ (go imperative) जाए (go imperative)
जाऊ (go first-person) जान (go infinitive oblique)
जाना (go infinitive masculine singular)
जानी (go infinitive feminine singular)
जाता (go indefinite masculine singular)
जाती (go indefinite feminine singular)
जात (go indefinite masculine pluraloblique)
जानी (go infinitive feminine plural)
जाती (go indefinite feminine plural)
जाओग (go future masculine singular)
जाओगी (go future feminine singular)
गई (go past feminine singular)
जाऊगा (go future masculine first-person singular)
जायगा (go future masculine third-person singular)
जाऊगी (go future feminine first-person singular)
जायगी (go future feminine third-person singular)
गय (go past masculine pluraloblique)
गई (go past feminine plural)
गया (go past masculine singular)
गयी (go past feminine singular)
जायग (go future masculine plural)
जायगी (go future feminine plural)
जाकर (go completion) degdegdegdegdeg
LV lenaa लना to take Morphological derivatives le lii leM lo letaa letii lete liiM luuM legaa legii lene lenaa lenii liyaa leMge loge letiiM luuMgaa luuMgii lekara ल (take stem) ल (take past)
ल (take imperative) लो (take imperative)
लता (take indefinite masculine singular)
लती (take indefinite feminine singular)
लत (take indefinite masculine pluraloblique)
ल (takepastfeminineplural) ल(take first-person)
लगा(take future masculine third-personsingular)
लगी(take future feminine third-person singular)
लन (take infinitive oblique)
लना (take infinitive masculine singular)
लनी (take infinitive feminine singular)
लया (take past masculine singular)
लग (take future masculine plural)
लोग (take future masculine singular)
लती (take indefinite feminine plural)
लगा (take future masculinefirst-personsingular)
लगी (take future feminine first-person singular)
लकर (take completion) degdegdegdegdeg
43
Figure 3 Stop words in Hindi used by the system
Figure 4 A few exit words in Hindi used by the system
The inner loop of the procedure identifies mul-tiple CPs that may be present in a sentence The outer loop is for mining the CPs in the entire corpus The experimentation and results are dis-cussed in the following section
4 Experimentation and Results
The CP mining methodology outlined earlier has been implemented and tested over multiple files of EMILLE (McEnery Baker Gaizauskas and Cunningham 2000) English-Hindi parallel cor-pus A summary of the results obtained are given in table 2 As can be seen from this table the precision obtained is 80 to 92 and the recall is between 89 to 100 The F-measure is 88 to 97 This is a remarkable and somewhat sur-prising result from the simple methodology without much of linguistic or statistical analysis This is much higher than what has been reported on the same corpus by Mukerjee et al 2006 (83 precision and 46 recall) who use projec-tion of POS and word alignment for CP identifi-cation This is the only other work that uses a parallel corpus and covers all kinds of CPs The results as reported by Chakrabarti et al (2008) are only for V-V CPs Moreover they do not re-port the recall value File
1 File 2
File 3
File 4
File 5
File 6
No of Sentences
112 193 102 43 133 107
Total no of CP(N)
200 298 150 46 188 151
Correctly identified CP (TP)
195 296 149 46 175 135
V-V CP 56 63 9 6 15 20
Incorrectly identified CP (FP)
17 44 7 11 16 20
Unidentified CP(FN)
5 2 1 0 13 16
Accuracy
9750 9933 9933 1000 9308 8940
Precision (TP (TP+FP))
9198 8705 9551 8070 9162 8709
Recall ( TP (TP+FN))
9750 9833 9933 1000 9308 8940
F-measure ( 2PR ( P+R))
946 923 974 893 923 882
Table 2 Results of the experimentation
न (ergative case marker) को (accusative case marker) का (possessive case marker) क (possessive case marker) की (possessive case marker) स (frombywith) म (ininto) पर (onbut) और (and Hindi particle) तथा (and) या (or) ल कन (but) पर त (but) क (that Hindi particle) म (I) तम (you) आप (you) वह (heshe) मरा (my) मर (my) मर (my) त हारा (your) त हार (your) त हार (your) उसका (his) उसकी (her) उसक (hisher) अपना (own) अपनी (own) अपन (own) उनक (their) मन (I ergative) त ह (to you) आपको (to you) उसको (to himher) उनको (to them) उ ह (to them) मझको (to me) मझ (to me) िजसका (whose) िजसकी (whose) िजसक (whose) िजनको (to whom) िजनक (to whom)
नह (nonot) न (nonot Hindi particle) भी (also Hindi particle) ह (only Hindi particle) तो (then Hindi particle) य (why) या (what Hindi particle)
कहा (where Hindi particle) कब (when) यहा (here) वहा (there) जहा (where) पहल (before) बाद म (after) श म (beginning) आर भ म (beginning) अत म (in the end) आ खर म (in the end)
44
Given below are some sample outputs (1) English sentence I also enjoy working with the childrens parents who often come to me for advice - its good to know you can help Aligned Hindi sentence मझ ब च क माता - पताओ क साथ काम करना भी अ छा लगता ह जो क अ सर सलाह लन आत ह - यह जानकार खशी होती ह क आप कसी की मदद कर सकत ह | The CPs identified in the sentence i काम करना (to work) ii अ छा लगना (to feel good enjoy) iii सलाह लना (to seek advice) iv खशी होना (to feel happy good) v मदद करना (to help)
Here the system identified 5 different CPs all of which are correct and no CP in the sentence has gone undetected The POS projection and word alignment method (Mukerjee et al 2006) would fail to identify CPs सलाह लना (to seek advice) and खशी होना (to feel happy) (2) English sentence Thousands of children are already benefiting from the input of people like you - people who care about children and their future who have the commitment energy and enthusiasm to be positive role models and who value the opportu-nity for a worthwhile career Aligned Hindi sentence आप जस लोग जो क ब च और उनक भ व य क बार म सोचत ह - इस समय भी हज़ार ब च को लाभ पहचा रह ह | अ छ आदश बनन क लए ऐस लोग म तब ता उ सराह और लगन ह और व एक समथन - यो य यवसाय की क करत ह |
The CPs identified in the sentence i आदश बनना (to be role model) ii क करना (to respect)
Here also the two CPs identified are correct It is obvious that this empirical method of
mining CPs will fail whenever the Hindi light verb maps on to its core meaning in English It
may also produce garbage as POS of the preced-ing word is not being checked However the mining success rate obtained speaks of these be-ing in small numbers in practice Use of the lsquostop wordsrsquo in allowing the intervening words within the CPs helps a lot in improving the per-formance Similarly use of the lsquoexit wordsrsquo avoid a lot of incorrect identification
5 Conclusions
The simple empirical method for mining CPs outlined in this work yields an average 89 of precision and 90 recall which is better than the results reported so far in the literature The major drawback is that we have to generate a list of all possible light verbs This list appears to be very large for Hindi Since no POS tagging or statis-tical analysis is performed the identified CPs are merely a list of mined CPs in Hindi with no lin-guistic categorization or analysis However this list of mined CPs is valuable to the lexicograph-ers and other language technology developers This list can also be used for word alignment tools where the identified components of CPs are grouped together before the word alignment process This will increase both the alignment accuracy and the speed
The methodology presented in this work is equally applicable to all other languages within the Indo-Aryan family
References Anthony McEnery Paul Baker Rob Gaizauskas Ha-
mish Cunningham 2000 EMILLE Building a Corpus of South Asian Languages Vivek A Quar-terly in Artiificial Intelligence 13(3)23ndash32
Amitabh Mukerjee Ankit Soni and Achala M Raina 2006 Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora Proceed-ings of the Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties Sydney 11ndash18
Alex Alsina 1996 Complex PredicatesStructure and Theory CSLI PublicationsStanford CA
Anvita Abbi 1992 The explicator compound verbsome definitional issues and criteria for iden-tification Indian Linguistics 53 27-46
Debasri Chakrabarti Vaijayanthi Sarma and Pushpak Bhattacharyya 2007 Complex Predicates in In-dian Language Wordnets Lexical Resources and Evaluation Journal 40 (3-4)
Debasri Chakrabarti Hemang Mandalia Ritwik Priya Vaijayanthi Sarma and Pushpak Bhattacharyya 2008 Hindi Compound Verbs and their Automatic Extraction Computational Linguistics (COLING08) Manchester UK
45
Manindra K Verma (ed) 1993 Complex Predicates in South Asian Languages Manohar Publishers and Distributors New Delhi
Miriam Butt 1995 The Structure of Complex Predi-cates in Urdu CSLI Publications
Mirium Butt and Gillian Ramchand 2001 Complex Aspectual Structure in HindiUrdu In Maria Lia-kata Britta Jensen and Didier Maillat (Editors) Oxford University Working Papers in Linguistics Philology amp Phonetics Vol 6
Miriam Butt Tracy Holloway King and John T Maxwell III 2003 Complex Predicates via Re-striction Proceedings of the LFG03 Conference
Miriam Butt and Wilhelm Geuder 2001 On the (semi)lexical status of light verbs In Norbert Corv-er and Henk van Riemsdijk (Editors) Semi-lexical Categories On the content of function words and the function of content words Mouton de Gruyter Berlin 323ndash370
Mona Singh 1994 Perfectivity Definiteness and Specificity A Classification of Verbal Predicates Hindi Doctoral dissertation University of Texas Austin
Peter Edwin Hook 1974 The Compound Verb in Hindi Center for South and Southeast Asian Stu-dies The University of Michigan
Tara Mohanan 1994 Argument Structure in Hindi CSLI Publications Stanford California
Venkatapathy Sriram and Aravind K Joshi 2005 Relative compositionality of multi-word expres-sions a study of verb-noun (V-N) collocations In Proceedings of International Joint Conference on Natural Language Processing - 2005 Jeju Island Korea 553-564
46
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 47ndash54Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Improving Statistical Machine Translation UsingDomain Bilingual Multiword Expressions
Zhixiang Ren1 Yajuan L u1 Jie Cao1 Qun Liu 1 Yun Huang2
1Key Lab of Intelligent Info Processing 2Department of Computer ScienceInstitute of Computing Technology School of Computing
Chinese Academy of Sciences National University of SingaporePO Box 2704 Beijing 100190 China Computing 1 Law Link Singapore 117590
renzhixianglvyajuan huangyuncompnusedusgcaojieliuqunictaccn
Abstract
Multiword expressions (MWEs) havebeen proved useful for many natural lan-guage processing tasks However how touse them to improve performance of statis-tical machine translation (SMT) is not wellstudied This paper presents a simple yeteffective strategy to extract domain bilin-gual multiword expressions In additionwe implement three methods to integratebilingual MWEs to Moses the state-of-the-art phrase-based machine translationsystem Experiments show that bilingualMWEs could improve translation perfor-mance significantly
1 Introduction
Phrase-based machine translation model has beenproved a great improvement over the initial word-based approaches (Brown et al 1993) Recentsyntax-based models perform even better thanphrase-based models However when syntax-based models are applied to new domain with fewsyntax-annotated corpus the translation perfor-mance would decrease To utilize the robustnessof phrases and make up the lack of syntax or se-mantic information in phrase-based model for do-main translation we study domain bilingual mul-tiword expressions and integrate them to the exist-ing phrase-based model
A multiword expression (MWE)can be consid-ered as word sequence with relatively fixed struc-ture representing special meanings There is nouniform definition of MWE and many researchersgive different properties of MWE Sag et al (2002)roughly defined MWE as ldquoidiosyncratic interpre-tations that cross word boundaries (or spaces)rdquoCruys and Moiron (2007) focused on the non-compositional property of MWE ie the propertythat whole expression cannot be derived from their
component words Stanford university launcheda MWE project1 in which different qualities ofMWE were presented Forbilingual multiwordexpression (BiMWE) we define a bilingual phraseas a bilingual MWE if (1) the source phrase is aMWE in source language (2) the source phraseand the target phrase must be translated to eachother exactly ie there is no additional (boundary)word in target phrase which cannot find the corre-sponding word in source phrase and vice versaIn recent years many useful methods have beenproposed to extract MWEs or BiMWEs automati-cally (Piao et al 2005 Bannard 2007 Fazly andStevenson 2006) Since MWE usually constrainspossible senses of a polysemous word in contextthey can be used in many NLP applications suchas information retrieval question answering wordsense disambiguation and so on
For machine translation Piao et al (2005) havenoted that the issue of MWE identification andaccurate interpretation from source to target lan-guage remained an unsolved problem for existingMT systems This problem is more severe whenMT systems are used to translate domain-specifictexts since they may include technical terminol-ogy as well as more general fixed expressions andidioms Although some MT systems may employa machine-readable bilingual dictionary of MWEit is time-consuming and inefficient to obtain thisresource manually Therefore some researchershave tried to use automatically extracted bilingualMWEs in SMT Tanaka and Baldwin (2003) de-scribed an approach of noun-noun compound ma-chine translation but no significant comparisonwas presented Lambert and Banchs (2005) pre-sented a method in which bilingual MWEs wereused to modify the word alignment so as to im-prove the SMT quality In their work a bilin-gual MWE in training corpus was grouped as
1httpmwestanfordedu
47
one unique token before training alignment mod-els They reported that both alignment qualityand translation accuracy were improved on a smallcorpus However in their further study they re-ported even lower BLEU scores after groupingMWEs according to part-of-speech on a large cor-pus (Lambert and Banchs 2006) Nonethelesssince MWE represents liguistic knowledge therole and usefulness of MWE in full-scale SMTis intuitively positive The difficulty lies in howto integrate bilingual MWEs into existing SMTsystem to improve SMT performance especiallywhen translating domain texts
In this paper we implement three methods thatintegrate domain bilingual MWEs into a phrase-based SMT system and show that these ap-proaches improve translation quality significantlyThe main difference between our methods andLambert and Banchsrsquo work is that we directly aimat improving the SMT performance rather than im-proving the word alignment quality In detail dif-ferences are listed as follows
bull Instead of using the bilingual n-gram trans-lation model we choose the phrase-basedSMT system Moses2 which achieves sig-nificantly better translation performance thanmany other SMT systems and is a state-of-the-art SMT system
bull Instead of improving translation indirectlyby improving the word alignment qualitywe directly target at the quality of transla-tion Some researchers have argued that largegains of alignment performance under manymetrics only led to small gains in translationperformance (Ayan and Dorr 2006 Fraserand Marcu 2007)
Besides the above differences there are someadvantages of our approaches
bull In our method automatically extractedMWEs are used as additional resources ratherthan as phrase-table filter Since bilingualMWEs are extracted according to noisy au-tomatic word alignment errors in word align-ment would further propagate to the SMT andhurt SMT performance
bull We conduct experiments on domain-specificcorpus For one thing domain-specific
2httpwwwstatmtorgmoses
corpus potentially includes a large numberof technical terminologies as well as moregeneral fixed expressions and idioms iedomain-specific corpus has high MWE cov-erage For another after the investigationcurrent SMT system could not effectivelydeal with these domain-specific MWEs es-pecially for Chinese since these MWEs aremore flexible and concise Take the Chi-nese term ldquo j Ntilde (rdquo for example Themeaning of this term is ldquosoften hard massand dispel pathogenic accumulationrdquo Ev-ery word of this term represents a specialmeaning and cannot be understood literallyor without this context These terms are dif-ficult to be translated even for humans letalone machine translation So treating theseterms as MWEs and applying them in SMTsystem have practical significance
bull In our approach no additional corpus is intro-duced We attempt to extract useful MWEsfrom the training corpus and adopt suitablemethods to apply them Thus it benefitsfor the full exploitation of available resourceswithout increasing great time and space com-plexities of SMT system
The remainder of the paper is organized as fol-lows Section 2 describes the bilingual MWE ex-traction technique Section 3 proposes three meth-ods to apply bilingual MWEs in SMT systemSection 4 presents the experimental results Sec-tion 5 draws conclusions and describes the futurework Since this paper mainly focuses on the ap-plication of BiMWE in SMT we only give a briefintroduction on monolingual and bilingual MWEextraction
2 Bilingual Multiword ExpressionExtraction
In this section we describe our approach of bilin-gual MWE extraction In the first step we obtainmonolingual MWEs from the Chinese part of par-allel corpus After that we look for the translationof the extracted MWEs from parallel corpus
21 Automatic Extraction of MWEs
In the past two decades many different ap-proaches on automatic MWE identification werereported In general those approaches can beclassified into three main trends (1) statisti-cal approaches (Pantel and Lin 2001 Piao et
48
al 2005) (2) syntactic approaches (Fazly andStevenson 2006 Bannard 2007) and (3) seman-tic approaches (Baldwin et al 2003 Cruys andMoiron 2007) Syntax-based and semantic-basedmethods achieve high precision but syntax or se-mantic analysis has to be introduced as preparingstep so it is difficult to apply them to domains withfew syntactical or semantic annotation Statisticalapproaches only consider frequency informationso they can be used to obtain MWEs from bilin-gual corpora without deeper syntactic or semanticanalysis Most statistical measures only take twowords into account so it not easy to extract MWEscontaining three or more than three words
Log Likelihood Ratio (LLR)has been proved agood statistical measurement of the association oftwo random variables (Chang et al 2002) Weadopt the idea of statistical approaches and pro-pose a new algorithm named LLR-based Hierar-chical Reducing Algorithm (HRA for short) to ex-tract MWEs with arbitrary lengths To illustrateour algorithm firstly we define some useful itemsIn the following definitions we assume the givensentence is ldquoA B C D Erdquo
Definition 1 Unit A unit is any sub-string of thegiven sentence For example ldquoA Brdquo ldquo Crdquo ldquo C D Erdquoare all units but ldquoA B Drdquo is not a unit
Definition 2 List A list is an ordered sequence ofunits which exactly cover the given sentence ForexampleldquoArdquoldquo B C Drdquoldquo Erdquo forms a list
Definition 3 Score The score function only de-fines on two adjacent units and return the LLRbetween the last word of first unit and the firstword of the second unit3 For example the scoreof adjacent unit ldquoB Crdquo and ldquoD Erdquo is defined asLLR(ldquo Crdquoldquo Drdquo)
Definition 4 Select The selecting operator is tofind the two adjacent units with maximum scorein a list
Definition 5 ReduceThe reducing operator is toremove two specific adjacent units concatenatethem and put back the result unit to the removedposition For example if we want to reduce unitldquoB Crdquo and unit ldquoDrdquo in list ldquoArdquoldquo B Crdquoldquo Drdquoldquo Erdquowe will get the listldquoArdquoldquo B C Drdquoldquo Erdquo
Initially every word in the sentence is consid-ered as one unit and all these units form a initiallist L If the sentence is of lengthN then the
3we use a stoplist to eliminate the units containing func-tion words by setting their score to 0
list containsN units of course The final set ofMWEsS is initialized to empty set After initial-ization the algorithm will enter an iterating loopwith two steps (1) select the two adjacent unitswith maximum score inL namingU1 andU2 and(2) reduceU1 andU2 in L and insert the reducingresult into the final setS Our algorithm termi-nates on two conditions (1) if the maximum scoreafter selection is less than a given threshold or (2)if L contains only one unit
c1(pound p Eacute w egrave )
c1(pound p Eacute w )
c1(pound p Eacute w)
c1(pound)
c2(p Eacute w)
c2(p Eacute)
c2(p) c3(Eacute) c4(w) c5()
c6(egrave )
c6(egrave) c7()
pound p Eacute w egrave 1471 67552 10596 0 0 8096
Figure 1 Example of Hierarchical Reducing Al-gorithm
Let us make the algorithm clearer with an ex-ample Assume the threshold of score is 20 thegiven sentence is ldquopound p Eacute w egrave rdquo4Figure 1 shows the hierarchical structure of givensentence (based on LLR of adjacent words) Inthis example four MWEs (ldquop Eacuterdquo ldquo p Eacute
wrdquo ldquo egrave rdquo ldquo pound p Eacute wrdquo) are extractedin the order and sub-strings over dotted line in fig-ure 1 are not extracted
From the above example we can see that theextracted MWEs correspond to human intuitionIn general the basic idea of HRA is to reflectthe hierarchical structure pattern of natural lan-guage Furthermore in the HRA MWEs are mea-sured with the minimum LLR of adjacent wordsin them which gives lexical confidence of ex-tracted MWEs Finally suppose given sentencehas lengthN HRA would definitely terminatewithin N minus 1 iterations which is very efficient
However HRA has a problem that it would ex-tract substrings before extracting the whole stringeven if the substrings only appear in the particu-lar whole string which we consider useless Tosolve this problem we use contextual features
4The whole sentence means ldquohealthy tea for preventinghyperlipidemiardquo and we give the meaning for each Chi-nese wordpound(preventing)p(hyper-)Eacute(-lipid-) w(-emia)(for) egrave(healthy)(tea)
49
contextual entropy (Luo and Sun 2003) and C-value (Frantzi and Ananiadou 1996) to filter outthose substrings which exist only in few MWEs
22 Automatic Extraction of MWErsquosTranslation
In subsection 21 we described the algorithm toobtain MWEs and we would like to introduce theprocedure to find their translations from parallelcorpus in this subsection
For mining the English translations of ChineseMWEs we first obtain the candidate translationsof a given MWE from the parallel corpus Stepsare listed as follows
1 Run GIZA++5 to align words in the trainingparallel corpus
2 For a given MWE find the bilingual sentencepairs where the source language sentences in-clude the MWE
3 Extract the candidate translations of theMWE from the above sentence pairs accord-ing to the algorithm described by Och (2002)
After the above procedure we have alreadyextracted all possible candidate translations of agiven MWE The next step is to distinguish rightcandidates from wrong candidates We constructperceptron-based classification model (Collins2002) to solve the problem We design twogroups of features translation features whichdescribe the mutual translating chance betweensource phrase and target phrase and the languagefeatures which refer to how well a candidateis a reasonable translation The translation fea-tures include (1) the logarithm of source-targettranslation probability (2) the logarithm of target-source translation probability (3) the logarithmof source-target lexical weighting (4) the loga-rithm of target-source lexical weighting and (5)the logarithm of the phrase pairrsquos LLR (Dunning1993) The first four features are exactly the sameas the four translation probabilities used in tradi-tional phrase-based system (Koehn et al 2003)The language features include (1) the left entropyof the target phrase (Luo and Sun 2003) (2) theright entropy of the target phrase (3) the first wordof the target phrase (4) the last word of the targetphrase and (5) all words in the target phrase
5httpwwwfjochcomGIZA++html
We select and annotate 33000 phrase pairs ran-domly of which 30000 pairs are used as trainingset and 3000 pairs are used as test set We use theperceptron training algorithm to train the modelAs the experiments reveal the classification preci-sion of this model is 9167
3 Application of Bilingual MWEs
Intuitively bilingual MWE is useful to improvethe performance of SMT However as we de-scribed in section 1 it still needs further researchon how to integrate bilingual MWEs into SMTsystem In this section we propose three methodsto utilize bilingual MWEs and we will comparetheir performance in section 4
31 Model Retraining with Bilingual MWEs
Bilingual phrase table is very important forphrase-based MT system However due to the er-rors in automatic word alignment and unalignedword extension in phrase extraction (Och 2002)many meaningless phrases would be extractedwhich results in inaccuracy of phrase probabilityestimation To alleviate this problem we take theautomatically extracted bilingual MWEs as paral-lel sentence pairs add them into the training cor-pus and retrain the model using GIZA++ Byincreasing the occurrences of bilingual MWEswhich are good phrases we expect that the align-ment would be modified and the probability es-timation would be more reasonable Wu et al(2008) also used this method to perform domainadaption for SMT Different from their approachin which bilingual MWEs are extracted from ad-ditional corpus we extract bilingual MWEs fromthe original training set The fact that additionalresources can improve the domain-specific SMTperformance was proved by many researchers(Wu et al 2008 Eck et al 2004) Howeverour method shows that making better use of theresources in hand could also enhance the qualityof SMT system We use ldquoBaseline+BiMWErdquo torepresent this method
32 New Feature for Bilingual MWEs
Lopez and Resnik (2006) once pointed out thatbetter feature mining can lead to substantial gainin translation quality Inspired by this idea weappend one feature into bilingual phrase table toindicate that whether a bilingual phrase containsbilingual MWEs In other words if the source lan-guage phrase contains a MWE (as substring) and
50
the target language phrase contains the translationof the MWE (as substring) the feature value is 1otherwise the feature value is set to 0 Due to thehigh reliability of bilingual MWEs we expect thatthis feature could help SMT system to select bet-ter and reasonable phrase pairs during translationWe use ldquoBaseline+Featrdquo to represent this method
33 Additional Phrase Table of bilingualMWEs
Wu et al (2008) proposed a method to construct aphrase table by a manually-made translation dic-tionary Instead of manually constructing transla-tion dictionary we construct an additional phrasetable containing automatically extracted bilingualMWEs As to probability assignment we just as-sign 1 to the four translation probabilities for sim-plicity Since Moses supports multiple bilingualphrase tables we combine the original phrase ta-ble and new constructed bilingual MWE table Foreach phrase in input sentence during translationthe decoder would search all candidate transla-tion phrases in both phrase tables We use ldquoBase-line+NewBPrdquo to represent this method
4 Experiments
41 Data
We run experiments on two domain-specific patentcorpora one is for traditional medicine domainand the other is for chemical industry domain Ourtranslation tasks are Chinese-to-English
In the traditional medicine domain table 1shows the data statistics For language model weuse SRI Language Modeling Toolkit6 to train a tri-gram model with modified Kneser-Ney smoothing(Chen and Goodman 1998) on the target side oftraining corpus Using our bilingual MWE ex-tracting algorithm 80287 bilingual MWEs are ex-tracted from the training set
Chinese EnglishTraining Sentences 120355
Words 4688873 4737843Dev Sentences 1000
Words 31722 32390Test Sentences 1000
Words 41643 40551
Table 1 Traditional medicine corpus
6httpwwwspeechsricomprojectssrilm
In the chemical industry domain table 2 givesthe detail information of the data In this experi-ment 59466 bilingual MWEs are extracted
Chinese EnglishTraining Sentences 120856
Words 4532503 4311682Dev Sentences 1099
Words 42122 40521Test Sentences 1099
Words 41069 39210
Table 2 Chemical industry corpus
We test translation quality on test set and use theopen source tool mteval-vllbpl7 to calculate case-sensitive BLEU 4 score (Papineni et al 2002) asour evaluation criteria For this evaluation there isonly one reference per test sentence We also per-form statistical significant test between two trans-lation results (Collins et al 2005) The mean ofall scores and relative standard deviation are calcu-lated with a 99 confidence interval of the mean
42 MT Systems
We use the state-of-the-art phrase-based SMT sys-tem Moses as our baseline system The featuresused in baseline system include (1) four transla-tion probability features (2) one language modelfeature (3) distance-based and lexicalized distor-tion model feature (4) word penalty (5) phrasepenalty For ldquoBaseline+BiMWErdquo method bilin-gual MWEs are added into training corpus as aresult new alignment and new phrase table areobtained For ldquoBaseline+Featrdquo method one ad-ditional 01 feature are introduced to each entry inphrase table For ldquoBaseline+NewBPrdquo additionalphrase table constructed by bilingual MWEs isused
Features are combined in the log-linear modelTo obtain the best translatione of the source sen-tencef log-linear model uses following equation
e = arg maxe
p(e|f)
= arg maxe
Msum
m=1
λmhm(e f) (1)
in which hm andλm denote themth feature andweight The weights are automatically turned byminimum error rate training (Och 2002) on devel-opment set
7httpwwwnistgovspeechtestsmtresourcesscoringhtm
51
43 Results
Methods BLEUBaseline 02658Baseline+BiMWE 02661Baseline+Feat 02675Baseline+NewBP 02719
Table 3 Translation results of using bilingualMWEs in traditional medicine domain
Table 3 gives our experiment results Fromthis table we can see that bilingual MWEsimprove translation quality in all cases TheBaseline+NewBP method achieves the most im-provement of 061 BLEU score comparedwith the baseline system The Baseline+Featmethod comes next with 017 BLEU score im-provement And the Baseline+BiMWE achievesslightly higher translation quality than the baselinesystem
To our disappointment however none of theseimprovements are statistical significant Wemanually examine the extracted bilingual MWEswhich are labeled positive by perceptron algorithmand find that although the classification precisionis high (9167) the proportion of positive exam-ple is relatively lower (7669) The low positiveproportion means that many negative instanceshave been wrongly classified to positive which in-troduce noises To remove noisy bilingual MWEswe use the length ratiox of the source phrase overthe target phrase to rank the bilingual MWEs la-beled positive Assumex follows Gaussian distri-butions then the ranking score of phrase pair(s t)is defined as the following formula
Score(s t) = log(LLR(s t))times 1radic2πσ
timeseminus
(xminusmicro)2
2σ2
(2)Here the meanmicro and varianceσ2 are estimatedfrom the training set After ranking by score weselect the top 50000 60000 and 70000 bilingualMWEs to perform the three methods mentioned insection 3 The results are showed in table 4
From this table we can conclude that (1) Allthe three methods on all settings improve BLEUscore (2) Except the Baseline+BiMWE methodthe other two methods obtain significant improve-ment of BLEU score (02728 02734 02724)over baseline system (02658) (3) When the scaleof bilingual MWEs is relatively small (5000060000) the Baseline+Feat method performs better
Methods 50000 60000 70000Baseline 02658Baseline+BiMWE 02671 02686 02715Baseline+Feat 02728 02734 02712Baseline+NewBP 02662 02706 02724
Table 4 Translation results of using bilingualMWEs in traditional medicine domain
than others (4) As the number of bilingual MWEsincreasing the Baseline+NewBP method outper-forms the Baseline+Feat method (5) Comparingtable 4 and 3 we can see it is not true that themore bilingual MWEs the better performance ofphrase-based SMT This conclusion is the same as(Lambert and Banchs 2005)
To verify the assumption that bilingual MWEsdo indeed improve the SMT performance not onlyon particular domain we also perform some ex-periments on chemical industry domain Table 5shows the results From this table we can see thatthese three methods can improve the translationperformance on chemical industry domain as wellas on the traditional medicine domain
Methods BLEUBaseline 01882Baseline+BiMWE 01928Baseline+Feat 01917Baseline+Newbp 01914
Table 5 Translation results of using bilingualMWEs in chemical industry domain
44 Discussion
In order to know in what respects our methods im-prove performance of translation we manually an-alyze some test sentences and gives some exam-ples in this subsection
(1) For the first example in table 6 ldquoIuml oacuterdquois aligned to other words and not correctly trans-lated in baseline system while it is aligned to cor-rect target phrase ldquodredging meridiansrdquo in Base-line+BiMWE since the bilingual MWE (ldquoIuml oacuterdquoldquodredging meridiansrdquo) has been added into train-ing corpus and then aligned by GIZA++
(2) For the second example in table 6 ldquo
rdquo has two candidate translation in phrase tableldquo teardquo and ldquomedicated teardquo The baseline systemchooses the ldquoteardquo as the translation of ldquo rdquowhile the Baseline+Feat system chooses the ldquomed-
52
Src T not auml k Ouml Eacute Aring raquo Iuml oacute ) 9 | Y S _ Ouml otilde
egraveE 8
Ref the obtained product is effective in tonifying blood expelling cold dredging meridians promoting production of body fluid promoting urination and tranquilizing mind and can be used for supplementing nutrition and protecting health
Baseline the food has effects in tonifying blood dispelling cold promoting salivation and water and tranquilizing and tonic effects and making nutritious health
+Bimwe the food has effects in tonifying blood dispelling cold dredging meridians promotingsalivation promoting urination and tranquilizing tonic nutritious pulverizing
Src curren iexclJ J NtildeJ 5J
Ref the product can also be made into tablet pill powder medicated tea or injection Baseline may also be made into tablet pill powder tea or injection +Feat may also be made into tablet pill powder medicated tea or injection
Table 6 Translation example
icated teardquo because the additional feature giveshigh probability of the correct translation ldquomedi-cated teardquo
5 Conclusion and Future Works
This paper presents the LLR-based hierarchicalreducing algorithm to automatically extract bilin-gual MWEs and investigates the performance ofthree different application strategies in applyingbilingual MWEs for SMT system The translationresults show that using an additional feature to rep-resent whether a bilingual phrase contains bilin-gual MWEs performs the best in most cases Theother two strategies can also improve the qualityof SMT system although not as much as the firstone These results are encouraging and motivatedto do further research in this area
The strategies of bilingual MWE application isroughly simply and coarse in this paper Com-plicated approaches should be taken into accountduring applying bilingual MWEs For examplewe may consider other features of the bilingualMWEs and examine their effect on the SMT per-formance Besides application in phrase-basedSMT system bilingual MWEs may also be inte-grated into other MT models such as hierarchicalphrase-based models or syntax-based translationmodels We will do further studies on improvingstatistical machine translation using domain bilin-gual MWEs
Acknowledgments
This work is supported by National Natural Sci-ence Foundation of China Contracts 60603095and 60873167 We would like to thank the anony-
mous reviewers for their insightful comments onan earlier draft of this paper
References
Necip Fazil Ayan and Bonnie J Dorr 2006 Goingbeyond aer an extensive analysis of word align-ments and their impact on mt InProceedings of the44th Annual Meeting of the Association for Compu-tational Linguistics pages 9ndash16
Timothy Baldwin Colin Bannard Takaaki Tanaka andDominic Widdows 2003 An empirical modelof multiword expression decomposability InPro-ceedings of the ACL-2003 Workshop on MultiwordExpressions Analysis Acquisiton and Treatmentpages 89ndash96
Colin Bannard 2007 A measure of syntactic flex-ibility for automatically identifying multiword ex-pressions in corpora InProceedings of the ACLWorkshop on A Broader Perspective on MultiwordExpressions pages 1ndash8
Peter F Brown Stephen Della Pietra Vincent J DellaPietra and Robert L Mercer 1993 The mathe-matics of statistical machine translation Parameterestimation Computational Linguistics 19(2)263ndash311
Baobao Chang Pernilla Danielsson and WolfgangTeubert 2002 Extraction of translation unit fromchinese-english parallel corpora InProceedings ofthe first SIGHAN workshop on Chinese languageprocessing pages 1ndash5
Stanley F Chen and Joshua Goodman 1998 Am em-pirical study of smoothing techniques for languagemodeling Technical report
Michael Collins Philipp Koehn and Ivona Kucerova2005 Clause restructuring for statistical machinetranslation InProceedings of the 43rd Annual
53
Meeting on Association for Computational Linguis-tics pages 531ndash540
Michael Collins 2002 Discriminative training meth-ods for hidden markov models Theory and exper-iments with perceptron algorithms InProceedingsof the Empirical Methods in Natural Language Pro-cessing Conference pages 1ndash8
Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32
Ted Dunning 1993 Accurate methods for the statis-tics of surprise and coincidenceComputationalLinguistics 19(1)61ndash74
Matthias Eck Stephan Vogel and Alex Waibel 2004Improving statistical machine translation in the med-ical domain using the unified medical language sys-tem InProceedings of the 20th international con-ference on Computational Linguistics table of con-tents pages 792ndash798
Afsaneh Fazly and Suzanne Stevenson 2006 Auto-matically constructing a lexicon of verb phrase id-iomatic combinations InProceedings of the EACLpages 337ndash344
Katerina T Frantzi and Sophia Ananiadou 1996 Ex-tracting nested collocations InProceedings of the16th conference on Computational linguistics pages41ndash46
Alexander Fraser and Daniel Marcu 2007 Measuringword alignment quality for statistical machine trans-lation Computational Linguistics 33(3)293ndash303
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InPro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology pages48ndash54
Patrik Lambert and Rafael Banchs 2005 Data in-ferred multi-word expressions for statistical machinetranslation InProceedings of Machine TranslationSummit X pages 396ndash403
Patrik Lambert and Rafael Banchs 2006 Groupingmulti-word expressions according to part-of-speechin statistical machine translation InProceedings ofthe Workshop on Multi-word-expressions in a multi-lingual context pages 9ndash16
Adam Lopez and Philip Resnik 2006 Word-basedalignment phrase-based translation Whatrsquos thelink In proceedings of the 7th conference of the as-sociation for machine translation in the Americasvisions for the future of machine translation pages90ndash99
Shengfen Luo and Maosong Sun 2003 Two-characterchinese word extraction based on hybrid of internaland contextual measures InProceedings of the sec-ond SIGHAN workshop on Chinese language pro-cessing pages 24ndash30
Franz Josef Och 2002Statistical Machine Transla-tion From Single-Word Models to Alignment Tem-plates Phd thesis Computer Science DepartmentRWTH Aachen Germany
Patrick Pantel and Dekang Lin 2001 A statistical cor-pus based term extractor InAI rsquo01 Proceedings ofthe 14th Biennial Conference of the Canadian Soci-ety on Computational Studies of Intelligence pages36ndash46
Kishore Papineni Salim Roukos Todd Ward and Wei-jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation InProceedings of the40th Annual Conference of the Association for Com-putational Linguistics pages 311ndash318
Scott Songlin Piao Paul Rayson Dawn Archer andTony McEnery 2005 Comparing and combining asemantic tagger and a statistical tool for mwe extrac-tion Computer Speech and Language 19(4)378ndash397
Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 Multi-word expressions A pain in the neck for nlp InProceedings of the 3th International Conferenceon Intelligent Text Processing and ComputationalLinguistics(CICLing-2002) pages 1ndash15
Takaaki Tanaka and Timothy Baldwin 2003 Noun-noun compound machine translation A feasibilitystudy on shallow processing InProceedings ofthe ACL-2003 Workshop on Multiword ExpressionsAnalysis Acquisition and Treatment pages 17ndash24
Hua Wu Haifeng Wang and Chengqing Zong 2008Domain adaptation for statistical machine transla-tion with domain dictionary and monolingual cor-pora InProceedings of Conference on Computa-tional Linguistics (COLING) pages 993ndash1000
54
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 55ndash62Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Bottom-up Named Entity Recognitionusing a Two-stage Machine Learning Method
Hirotaka Funayama Tomohide Shibata Sadao KurohashiKyoto University Yoshida-honmachi
Sakyo-ku Kyoto 606-8501 Japanfunayama shibata kuronlpkueekyoto-uacjp
Abstract
This paper proposes Japanese bottom-upnamed entity recognition using a two-stage machine learning method Mostwork has formalized Named Entity Recog-nition as a sequential labeling problem inwhich only local information is utilizedfor the label estimation and thus a longnamed entity consisting of several mor-phemes tends to be wrongly recognizedOur proposed method regards a compoundnoun (chunk) as a labeling unit and firstestimates the labels of all the chunks ina phrasal unit (bunsetsu) using a machinelearning method Then the best label as-signment in the bunsetsu is determinedfrom bottom up as the CKY parsing al-gorithm using a machine learning methodWe conducted an experimental on CRLNE data and achieved an F measure of8979 which is higher than previous work
1 Introduction
Named Entity Recognition (NER) is a task of rec-ognizing named entities such as person namesorganization names and location It is used forseveral NLP applications such as Information Ex-traction (IE) and Question Answering (QA) Mostwork uses machine learning methods such as Sup-port Vector Machines (SVMs) (Vapnik 1995) andConditional Random Field (CRF) (Lafferty et al2001) using a hand-annotated corpus (Krishnanand DManning 2006 Kazama and Torisawa2008 Sasano and Kurohashi 2008 Fukushima etal 2008 Nakano and Hirai 2004 Masayuki andMatsumoto 2003)
In general NER is formalized as a sequentiallabeling problem For example regarding a mor-pheme as a basic unit it is first labeled as S-PERSON B-PERSON I-PERSON E-PERSON
S-ORGANIZATION etc Then considering thelabeling results of morphemes the best NE labelsequence is recognized
When the label of each morpheme is estimatedonly local information around the morpheme (egthe morpheme the two preceding morphemes andthe two following morphemes) is utilized There-fore a long named entity consisting of severalmorphemes tends to be wrongly recognized Letus consider the example sentences shown in Fig-ure 1
In sentence (1) the label of ldquoKazamardquo can berecognized to be S-PERSON (PERSON consist-ing of one morpheme) by utilizing the surroundinginformation such as the suffix ldquosanrdquo (Mr) and theverb ldquokikoku shitardquo (return home)
On the other hand in sentence (2) when thelabel of ldquoshinyourdquo (credit) is recognized to beB-ORGANIZATION (the beginning of ORGA-NIZATION) only information from ldquohatsudourdquo(invoke) to ldquokyusairdquo (relief) can be utilized andthus the information of the morpheme ldquoginkourdquo(bank) that is apart from ldquoshinyourdquo by three mor-phemes cannot be utilized To cope with this prob-lem Nakano et al (Nakano and Hirai 2004) andSasano et al (Sasano and Kurohashi 2008) uti-lized information of the head of bunsetsu1 In theirmethods when the label of ldquoshinyourdquo is recog-nized the information of the morpheme ldquoginkourdquocan be utilized
However these methods do not work when themorpheme that we want to refer to is not a headof bunsetsu as in sentence (3) In this examplewhen ldquogaikokurdquo (foreign) is recognized to be B-ARTIFACT (the beginning of ARTIFACT) wewant to refer to ldquohourdquo (law) not ldquoihanrdquo (viola-tion) which is the head of the bunsetsu
This paper proposes Japanese bottom-up named1Bunsetsu is the smallest coherent phrasal unit in
Japanese It consists of one or more content words followedby zero or more function words
55
(1) kikoku-shitareturn home
Kazama-san-waMrKazama TOP
rsquoMr Kazama who returned homersquo
(2) hatsudou-shitainvoke
shinyou-kumiai-kyusai-ginkou-nocredit union relief bank GEN
setsuritsu-mo establishment
rsquothe establishment of the invoking credit union relief bankrsquo
(3) shibunsyo-gizou-toprivate document falsification and
gaikoku-jin-touroku-hou-ihan-noforeigner registration law violation GEN
utagai-desuspicion INS
rsquoon suspicion of the private document falsification and the violation of the foreigner registra-tion lawrsquo
Figure 1 Example sentences
entity recognition using a two-stage machinelearning method Different from previous workthis method regards a compound noun as a la-beling unit (we call it chunk hereafter) and es-timates the labels of all the chunks in the bun-setsu using a machine learning method In sen-tence (3) all the chunks in the second bunsetsu(ie ldquogaikokurdquo ldquogaikoku-jinrdquo middot middot middot ldquogaikoku-jin-touroku-hou-ihan rdquo middot middot middot ldquoihanrdquo) are labeled andin the case that the chunk ldquogaikoku-jin-touroku-hourdquo is labeled the information about ldquohourdquo (law)is utilized in a natural manner Then in the bun-setsu the best label assignment is determined Forexample among the combination of ldquogaikoku-jin-touroku-hourdquo (ARTIFACT) and ldquoihanrdquo (OTHER)the combination of ldquogaikoku-jinrdquo (PERSON) andldquotouroku-hou-ihanrdquo (OTHER) etc the best la-bel assignment ldquogaikoku-jin-touroku-hourdquo (AR-TIFACT) and ldquoihanrdquo (OTHER) is chosen basedon a machine learning method In this determi-nation of the best label assignment as the CKYparsing algorithm the label assignment is deter-mined by bottom-up dynamic programming Weconducted an experimental on CRL NE data andachieved an F measure of 8979 which is higherthan previous work
This paper is organized as follows Section 2 re-views related work of NER especially focusing onsequential labeling based method Section 3 de-scribes an overview of our proposed method Sec-tion 4 presents two machine learning models andSection 5 describes an analysis algorithm Section6 gives an experimental result
2 Related Work
In Japanese Named Entity Recognition the defi-nition of Named Entity in IREX Workshop (IREX
class examplePERSON Kimura SyonosukeLOCATION Taiheiyou (Pacific Ocean)ORGANIZATION Jimin-tou (Liberal Democratic
Party)ARTIFACT PL-houan (PL bill)DATE 21-seiki (21 century)TIME gozen-7-ji (7 am)MONEY 500-oku-en (50 billions yen)PERCENT 20 percent
Table 1 NE classes and their examples
Committee 1999) is usually used In this def-inition NEs are classified into eight classesPERSON LOCATION ORGANIZATION AR-TIFACT DATE TIME MONEY and PERCENTTable 1 shows example instances of each class
NER methods are divided into two approachesrule-based approach and machine learning ap-proach According to previous work machinelearning approach achieved better performancethan rule-based approach
In general a machine learning method is for-malized as a sequential labeling problem Thisproblem is first assigning each token (character ormorpheme) to several labels In an SE-algorithm(Sekine et al 1998) S is assigned to NE com-posed of one morpheme B I E is assigned to thebeginning middle end of NE respectively and Ois assigned to the morpheme that is not an NE2The labels S B I and E are prepared for each NEclasses and thus the total number of labels is 33(= 8 4 + 1)
The model for the label estimation is learnedbased on machine learning The following fea-tures are generally utilized characters type of
2Besides there are IOB1 IOB2 algorithm using onlyIOB and IOE1 IOE2 algorithm using only IOE (Kim andVeenstra 1999)
56
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu-MeijinMONEY0075
OTHERe0092
MeijinOTHERe0245
(a)initial state
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu + MeijinPSN+OTHERe0438+0245
Yoshiharu Yoshiharu + Meijin
final outputanalysis direction
MONEY0075
MNY+OTHERe0075+0245
MeijinOTHERe0245
(b)final output
Figure 2 An overview of our proposed method (the bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo)
character POS etc about the morpheme and thesurrounding two morphemes The methods utiliz-ing SVM or CRF are proposed
Most of NER methods based on sequential la-beling use only local information Thereforemethods utilizing global information are pro-posed Nakano et al utilized as a feature the wordsub class of NE on the analyzing direction in thebunsetsu the noun in the end of the bunsetsu ad-jacent to the analyzing direction and the head ofeach bunsetsu (Nakano and Hirai 2004) Sasanoet al utilized cache feature coreference resultsyntactic feature and caseframe feature as struc-tural features (Sasano and Kurohashi 2008)
Some work acquired knowledge from unan-notated large corpus and applied it to NERKazama et al utilized a Named Entity dic-tionary constructed from Wikipedia and a nounclustering result obtained using huge amount ofpairs of dependency relations (Kazama and Tori-sawa 2008) Fukushima et al acquired hugeamount of category-instance pairs (eg ldquopo-litical party - New party DAICHIrdquoldquocompany-TOYOTArdquo) by some patterns from a large Webcorpus (Fukushima et al 2008)
In Japanese NER researches CRL NE data areusually utilized for the evaluation This data in-cludes approximately 10 thousands sentences innews paper articles in which approximately 20thousands NEs are annotated Previous workachieved an F measure of about 089 using thisdata
3 Overview of Proposed Method
Our proposed method first estimates the label ofall the compound nouns (chunk) in a bunsetsu
Then the best label assignment is determinedby bottom-up dynamic programming as the CKYparsing algorithm Figure 2 illustrates an overviewof our proposed method In this example thebunsetsu ldquoHabu-Yoshiharu-Meijinrdquo (Grand Mas-ter Yoshiharu Habu) is analyzed First the labelsof all the chunks (ldquoHaburdquo ldquoHabu-YoshiharurdquoldquoHabu-Yoshiharu-Meijinrdquo middot middot middot ldquoMeijinrdquo etc) inthe bunsetsu are analyzed using a machine learn-ing method as shown in Figure 2 (a)
We call the state in Figure 2 (a) initial statewhere the labels of all the chunks have been es-timated From this state the best label assign-ment in the bunsetsu is determined This pro-cedure is performed from the lower left (corre-sponds to each morpheme) to the upper right likethe CKY parsing algorithm as shown in Figure 2(b) For example when the label assignment forldquoHabu-Yoshiharurdquo is determined the label assign-ment ldquoHabu-Yoshiharurdquo (PERSON) and the labelassignment ldquoHaburdquo (PERSON) and ldquoYoshiharurdquo(OTHER) are compared and the better one is cho-sen While grammatical rules are utilized in ageneral CKY algorithm this method chooses bet-ter label assignment for each cell using a machinelearning method
The learned models are the followings
bull the model that estimates the label of a chunk(label estimation model)
bull the model that compares two label assign-ments (label comparison model)
The two models are described in detail in thenext section
57
Habu Yoshiharu Meijin gaPERSON OTHERe
invalid invalid
invalid
invalidinvalid
Figure 3 Label assignment for all the chunks inthe bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo
4 Model Learning
41 Label Estimation ModelThis model estimates the label for each chunk Ananalysis unit is basically bunsetsu This is because935 of named entities is located in a bunsetsuin CRL NE data Exceptionally the following ex-pressions located in multiple bunsetsus tend to bean NE
bull expressions enclosed in parentheses (eg ldquolsquoHimeyuri-no toursquo rdquo (The tower of Himeyuri)(ARTIFACT))
bull expressions that have an entry in Wikipedia(eg ldquoNihon-yatyou-no kairdquo (Wild Bird So-ciety of Japan) (ORGANIZATION))
Hereafter bunsetsu is expanded when one of theabove conditions meet By this expansion 986of named entities is located in a bunsetsu3
For each bunsetsu the head or tail functionwords are deleted For example in the bun-setsu ldquoHabu-Yoshiharu-Meijin-wardquo the tail func-tion word ldquowardquo (TOP) is deleted In the bunsetsuldquoyaku-san-bairdquo (about three times) the head func-tion word ldquoyakurdquo (about) is deleted
Next for learning the label estimation modelall the chunks in a bunsetsu are attached to the cor-rect label from a hand-annotated corpus The la-bel set is 13 classes which includes eight NE class(as shown in Table 1) and five classes OTHERsOTHERb OTHERi OTHERe and invalid
The chunk that corresponds to a whole bun-setsu and does not contain any NEs is labeledas OTHERs and the head middle tail chunkthat does not correspond to an NE is labeled asOTHERb OTHERi OTHERe respectively4
3As an example in which an NE is not included by anexpanded bunsetsu there are ldquoToru-no Kimirdquo (PERSON)and ldquoOsaka-fu midori-no kankyo-seibi-shitsurdquo (ORGANI-ZATION)
4Each OTHER is assigned to the longest chunk that satis-fies its condition in a chunk
1 of morphemes in the chunk
2 the position of the chunk in its bunsetsu
3 character type5
4 the combination of the character type of adjoiningmorphemes
- For the chunk ldquoRussian Armyrdquo this feature isldquoKatakanaKanjirdquo
5 word class word sub class and several features pro-vided by a morphological analyzer JUMAN
6 several features6 provided by a parser KNP
7 string of the morpheme in the chunk
8 IPADIC7 feature- If the string of the chunk are registered in the fol-
lowing categories of IPADIC ldquopersonrdquo ldquolo-cationrdquo ldquoorganizationrdquo and ldquogeneralrdquo thisfeature fires
9 Wikipedia feature
- If the string of the chunk has an entry inWikipedia this feature fires
- the hypernym extracted from its definition sen-tence using some patterns (eg The hyper-nym of ldquothe Liberal Democratic Partyrdquo is apolitical party)
10 cache feature- When the same string of the chunk appears in the
preceding context the label of the precedingchunk is used for the feature
11 particles that the bunsetsu includes
12 the morphemes particles and head morpheme in theparent bunsetsu
13 the NEcategory ratio in a case slot of predicatenouncase frame(Sasano and Kurohashi 2008)
- For example in the case ga (NOM) of the pred-icate case frame ldquokaikenrdquo (interview) the NEratio ldquoPERSON0245rdquo is assigned to the caseslot Hence in the sentence ldquoHabu-ga kaiken-shitardquo (Mr Habu interviewed) the featureldquoPERSON0245rdquo is utilized for the chunkldquoHaburdquo
14 parenthesis feature
- When the chunk in a parenthesis this featurefires
Table 2 Features for the label estimation model
The chunk that is neither any eight NE class northe above four OTHER is labeled as invalid
In an example as shown in Figure 3 ldquoHabu-Yoshiharurdquo is labeled as PERSON ldquoMeijinrdquo is la-beled as OTHERe and the other chunks are la-beled as invalid
Next the label estimation model is learned fromthe data in which the above label set is assigned
5The following five character types are considered KanjiHiragana Katakana Number and Alphabet
6When a morpheme has an ambiguity all the correspond-ing features fire
7httpchasenaist-naraacjpchasendistributionhtmlja
58
to all the chunks The features for the label esti-mation model are shown in Table 2 Among thefeatures as for feature (3) (5)minus(8) three cate-gories according to the position of a morphemein the chunk are prepared ldquoheadrdquo ldquotailrdquo andldquoanywhererdquo For example in the chunk ldquoHabu-Yoshiharu-Meijinrdquo as for the morpheme ldquoHaburdquofeature (7) is set to be ldquoHaburdquo in ldquoheadrdquo and as forthe morpheme ldquoYoshiharurdquo feature (7) is set to beldquoYoshiharurdquo in ldquoanywhererdquo
The label estimation model is learned from pairsof label and feature in each chunk To classify themulti classes the one-vs-rest method is adopted(consequently 13 models are learned) The SVMoutput is transformed by using the sigmoid func-tion 1
1+exp(minusβx) and the transformed value is nor-malized so that the sum of the value of 13 labelsin a chunk is one
The purpose for setting up the label ldquoinvalidrdquo isas follows In the chunk ldquoHaburdquo and ldquoYoshiharurdquoin Figure 3 since the label ldquoinvalidrdquo has a rela-tively higher score the score of the label PERSONis relatively low Therefore when the label com-parison described in Section 42 is performed thelabel assignment ldquoHabu-Yoshiharurdquo (PERSON) islikely to be chosen In the chunk where the scoreof the label invalid has the highest score the labelthat has the second highest score is adopted
42 Label Comparison Model
This model compares the two label assignmentsfor a certain string For example in the stringldquoHabu-Yoshiharurdquo the model compares the fol-lowing two label assignments
bull ldquoHabu-Yoshiharurdquo is labeled as PERSON
bull ldquoHaburdquo is labeled as PERSON and ldquoYoshi-harurdquo is labeled as MONEY
First as shown in Figure 4 the two comparedsets of chunks are lined up by sandwiching ldquovsrdquo(The left one right one is called the first set thesecond set respectively) When the first set is cor-rect this example is positive otherwise this ex-ample is negative The max number of chunks foreach set is five and thus examples in which thefirst or second set has more than five chunks arenot utilized for the model learning
Then the feature is assigned to each exampleThe feature (13 dimensions) for each chunk is de-fined as follows the first 12 dimensions are used
positive+1 Habu-Yoshiharu vs Habu + Yoshiharu
PSN PSN + MNY+1 Habu-Yoshiharu + Meijin vs Habu + Yoshiharu + Meijin
PSN + OTHERe PSN + MONEY + OTHERe
negative- 1 Habu-Yoshiharu-Meijin vs Habu-Yoshiharu + Meijin
ORG PSN + OTHERe
Figure 4 Assignment of positivenegative exam-ples
for each label which is estimated by the label esti-mation model and the last 13th dimension is usedfor the score of an SVM output Then for the firstand second set the features for each chunk are ar-ranged from the left and zero vectors are placedin the remainder part
Figure 5 illustrates the feature for ldquoHabu-Yoshiharurdquo vs ldquoHabu + Yoshiharurdquo The labelcomparison model is learned from such data us-ing SVM Note that only the fact that ldquoHabu-Yoshiharurdquo is PERSON can be found from thehand-annotated corpus and thus in the exampleldquoHabu-Yoshiharu-Meijinrdquo vs ldquoHabu + Yoshiharu-Meijinrdquo we cannot determine which one is cor-rect Therefore such example cannot be used forthe model learning
5 Analysis
First the label of all the chunks in a bunsetsu isestimated by using the label estimation model de-scribed in Section 41 Then the best label assign-ment in the bunsetsu is determined by applying thelabel comparison model described in Section 42iteratively as shown in Figure 2 (b) In this stepthe better label assignment is determined from bot-tom up as the CKY parsing algorithm
For example the initial state shown in Figure2(a) is obtained using the label estimation modelThen the label assignment is determined using thelabel comparison model from the lower left (cor-responds to each morpheme) to the upper rightIn determining the label assignment for the cellof ldquoHabu-Yoshiharurdquo as shown in 6(a) the modelcompares the label assignment ldquoBrdquo with the la-bel assignment ldquoA+Drdquo In this case the modelchooses the label assignment ldquoBrdquo that is ldquoHabu- Yoshiharurdquo is labeled as PERSON Similarlyin determining the label assignment for the cellof ldquoYoshiharu-Meijinrdquo the model compares the
59
chunk Habu-Yoshiharu Habu Yoshiharulabel PERSON PERSON MONEYvector V11 0 0 0 0 V21 V22 0 0 0
Figure 5 An example of the feature for the label comparison model (The example is ldquoHabu-Yoshiharuvs Habu + Yoshiharurdquo and V11 V21 V22 and 0 is a vector whose dimension is 13)
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu-Meijin
label assighment ldquoHabu-Yoshiharurdquo
label assignmentldquoHaburdquo + ldquoYoshiharurdquo
A B C
EDMONEY0075
OTHERe0092
MeijinOTHERe0245
ldquoHaburdquo + ldquoYoshiharurdquo
F
(a) label assignment for the cell ldquoHabu-Yoshiharurdquo
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu + Meijin
label assignmentldquoHabu-Yoshiharu-Meijinrdquo
A B C
EDMONEY0075
MNY+OTHERe0075+0245
MeijinOTHERe0245
label assignmentldquoHaburdquo + ldquoYoshiharurdquo + ldquoMeijinrdquo
label assignmentldquoHabu-Yoshiharurdquo + ldquoMeijinrdquo
F
(b) label assignment for the cell ldquoHabu-Yoshiharu-Meijinrdquo
Figure 6 The label comparison model
label assignment ldquoErdquo with the label assignmentldquoD+Frdquo In this case the model chooses the labelassignment ldquoD+Frdquo that is ldquoYoshiharurdquo is labeledas MONEY and ldquoMeijinrdquo is labeled as OTHEReWhen the label assignment consists of multiplechunks the content of the cell is updated Inthis case the cell ldquoErdquo is changed from ldquoYoshi-haru-Meijinrdquo (OTHERe) to ldquoYoshiharu + Meijinrdquo(MONEY + OTHERe)
As shown in Figure 6(b) in determining the bestlabel assignment for the upper right cell that isthe final output is determined the model comparesthe label assignment ldquoA+D+Frdquo ldquoB+Frdquo and ldquoCrdquoWhen there are more than two candidates of labelassignments for a cell all the label assignments arecompared in a pairwise and the label assignmentthat obtains the highest score is adopted
In the label comparing step the label as-signment in which OTHERlowast follows OTHERlowast(OTHERlowast - OTHERlowast) is not allowed since eachOTHER is assigned to the longest chunk as de-scribed in Section 41 When the first combina-tion of chunks equals to the second combinationof chunks the comparison is not performed
6 Experiment
To demonstrate the effectiveness of our proposedmethod we conducted an experiment on CRL NEdata In this data 10718 sentences in 1174 newsarticles are annotated with eight NEs The expres-sion to which it is difficult to annotate manually islabeled as OPTIONAL and was not used for both
a b c d e
learn the label estimation model 1
Corpus CRL NE data
learn the label estimation model 2bobtain features for the label apply
b c d e
b c d e
b c d e
obtain features for the labelcomparison model
learn the label estimation model 2c
learn the label estimation model 2d
learn the label estimation model 2e
apply
learn the label comparison model
Figure 7 5-fold cross validation
the model learning8 and the evaluationWe performed 5-fold cross validation following
previous work Different from previous work ourwork has to learn the SVM models twice There-fore the corpus was divided as shown in Figure 7Let us consider the analysis in the part (a) Firstthe label estimation model 1 is learned from thepart (b)-(e) Then the label estimation model 2bis learned from the part (c)-(e) and applying thelearned model to the part (b) features for learningthe label comparison model are obtained Simi-larly the label estimation model 2c is learned fromthe part (b)(d)(e) and applying it to the part (c)features are obtained It is the same with the part
8Exceptionally ldquoOPTIONALrdquo is used when the label es-timation model for OTHERlowast and invalid is learned
60
Recall PrecisionORGANIZATION 8183 (30083676) 8837 (30083404)PERSON 9005 (34583840) 9387 (34583684)LOCATION 9138 (49925463) 9244 (49925400)ARTIFACT 4672 ( 349 747) 7489 ( 349 466)DATE 9327 (33273567) 9312 (33273573)TIME 8825 ( 443 502) 9059 ( 443 489)MONEY 9385 ( 366 390) 9760 ( 366 375)PERCENT 9533 ( 469 492) 9591 ( 469 489)ALL-SLOT 8787 9179F-measure 8979
Table 3 Experimental result
(d) and (e) Then the label comparison model islearned from the obtained features After that theanalysis in the part (a) is performed by using boththe label estimation model 1 and the label compar-ison model
In this experiment a Japanese morphologicalanalyzer JUMAN9 and a Japanese parser KNP10
were adopted The two SVM models were learnedwith polynomial kernel of degree 2 and β in thesigmoid function was set to be 1
Table 6 shows an experimental result An F-measure in all NE classes is 8979
7 Discussion
71 Comparison with Previous Work
Table 7 presents the comparison with previ-ous work and our method outperformed previ-ous work Among previous work Fukushimaet al acquired huge amount of category-instance pairs (eg ldquopolitical party - New partyDAICHIrdquoldquocompany-TOYOTArdquo) by some patternsfrom a large Web corpus and Sasano et al uti-lized the analysis result of corefer resolution as afeature for the model learning Therefore in ourmethod by incorporating these knowledge andorsuch analysis result the performance would be im-proved
Compared with Sasano et al our methodachieved the better performance in analyzinga long compound noun For example in thebunsetsu ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo (Treaty on Conventional ArmedForces in Europe) while Sasano et al labeledldquoOushurdquo (Europe) as LOCATION our methodcorrectly labeled ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo as ARTIFACT Sasano etal incorrectly labeled ldquoOushurdquo as LOCATIONalthough they utilized the information about
9httpnlpkueekyoto-uacjpnl-resourcejuman-ehtml10httpnlpkueekyoto-uacjpnl-resourceknp-ehtml
the head of bunsetsu ldquojyouyakurdquo (treaty) Inour method for the cell ldquoOushurdquo invalid hasthe highest score and thus the score of LOCA-TION relatively drops Similarly for the cellldquosenryoku-sakugen-jyouyakurdquo invalid has thehighest score Consequently ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo is correctly labeledas ARTIFACT
In the bunsetsu ldquogaikoku-jin-touroku-hou-ihanrdquo(the violation of the foreigner registration law)while Sasano et al labeled ldquotouroku-hourdquo as AR-TIFACT our method correctly labeled ldquogaikoku-jin-touroku-hourdquo as ARTIFACT Sasano et al can-not utilize the information about ldquohourdquo that is use-ful for the label estimation since the head of thisbunsetsu is ldquoihanrdquo In contrast in estimating thelabel of the chunk ldquogaikoku-jin-touroku-hourdquo theinformation of ldquohourdquo can be utilized
72 Error Analysis
There were some errors in analyzing a Katakanaalphabet word In the following example althoughthe correct is that ldquoBatistutardquo is labeled as PER-SON the system labeled it as OTHERs
(4) Italy-deItaly LOC
katsuyaku-suruactive
Batistuta-woBatistuta ACC
kuwaetacall
ArgentineArgentine
rsquoArgentine called Batistuta who was active inItalyrsquo
There is not an entry of ldquoBatistutardquo in the dictio-nary of JUMAN nor Wikipedia and thus only thesurrounding information is utilized However thecase analysis of ldquokatsuyakurdquo (active) is incorrectwhich leads to the error of ldquoBatistutardquo
There were some errors in applying the la-bel comparison model although the analysis ofeach chunk is correct For example in thebunsetsu ldquoHongKong-seityourdquo (Government ofHongKong) the correct is that ldquoHongKong-seityourdquo is labeled as ORGANIZATION Asshown in Figure 8 (b) the system incorrectlylabeled ldquoHongKongrdquo as LOCATION As shownin Figure 8(a) although in the initial stateldquoHongKong-seityourdquo was correctly labeled as OR-GANIZATION the label assignment ldquoHongKong+ seityourdquo was incorrectly chosen by the labelcomparison model To cope with this problemwe are planning to the adjustment of the value βin the sigmoid function and the refinement of the
61
F1 analysis unit distinctive features(Fukushima et al 2008) 8929 character Web(Kazama and Torisawa 2008) 8893 character WikipediaWeb(Sasano and Kurohashi 2008) 8940 morpheme structural information(Nakano and Hirai 2004) 8903 character bunsetsu feature(Masayuki and Matsumoto 2003) 8721 character(Isozaki and Kazawa 2003) 8677 morphemeproposed method 8979 compound noun Wikipediastructural information
Table 4 Comparison with previous work (All work was evaluated on CRL NE data using cross valida-tion)
HongKongLOCATION
HongKong-seityouORGANIZATIONLOCATION
0266 0205
seityouOTHERe0184
(a)initial state
HongKongLOCATION
HongKong + seityouLOC+OTHEReLOCATION
0266LOC+OTHERe0266+0184
seityouOTHERe0184
(b)the final output
Figure 8 An example of the error in the label com-parison model
features for the label comparison model
8 Conclusion
This paper proposed bottom-up Named EntityRecognition using a two-stage machine learningmethod This method first estimates the label ofall the chunks in a bunsetsu using a machine learn-ing and then the best label assignment is deter-mined by bottom-up dynamic programming Weconducted an experiment on CRL NE data andachieved an F-measure of 8979
We are planning to integrate this method withthe syntactic and case analysis method (Kawa-hara and Kurohashi 2007) and perform syntacticcase and Named Entity analysis simultaneously toimprove the overall accuracy
ReferencesKenrsquoichi Fukushima Nobuhiro Kaji and Masaru
Kitsuregawa 2008 Use of massive amountsof web text in Japanese named entity recogni-tion In Proceedings of Data Engineering Workshop(DEWS2008) A3-3 (in Japanese)
IREX Committee editor 1999 Proceedings of theIREX Workshop
Hideki Isozaki and Hideto Kazawa 2003 Speeding upsupport vector machines for named entity recogni-tion Transaction of Information Processing Societyof Japan 44(3)970ndash979 (in Japanese)
Daisuke Kawahara and Sadao Kurohashi 2007 Prob-abilistic coordination disambiguation in a fully-lexicalized Japanese parser In Proceedings of the
2007 Joint Conference on Empirical Methods inNatural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL2007)pages 304ndash311
Junrsquoichi Kazama and Kentaro Torisawa 2008 In-ducing gazetteers for named entity recognition bylarge-scale clustering of dependency relations InProceedings of ACL-08 HLT pages 407ndash415
Erik F Tjong Kim and Jorn Veenstra 1999 Repre-senting text chunks In Proceedings of EACL rsquo99pages 173ndash179
Vajay Krishnan and Christopher DManning 2006An effective two-stage model for exploiting non-local dependencies in named entity recognitionpages 1121ndash1128
John Lafferty Andrew McCallun and FernandoPereira 2001 Conditional random fields Prob-abilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth In-ternational Conference (ICMLrsquo01) pages 282ndash289
Asahara Masayuki and Yuji Matsumoto 2003Japanese named entity extraction with redundantmorphological analysis In Proceeding of HLT-NAACL 2003 pages 8ndash15
Keigo Nakano and Yuzo Hirai 2004 Japanese namedentity extraction with bunsetsu features Transac-tion of Information Processing Society of Japan45(3)934ndash941 (in Japanese)
Ryohei Sasano and Sadao Kurohashi 2008 Japanesenamed entity recognition using structural naturallanguage processing In Proceeding of Third In-ternational Joint Conference on Natural LanguageProcessing pages 607ndash612
Satoshi Sekine Ralph Grishman and Hiroyuki Shin-nou 1998 A decision tree method for finding andclassifying names in japanese texts In Proceed-ings of the Sixth Workshop on Very Large Corpora(WVLC-6) pages 171ndash178
Vladimir Vapnik 1995 The Nature of StatisticalLearning Theory Springer
62
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 63ndash70Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
13
13
13 13
13
13
13
13
13
13
13
13 13
13
$
13 amp
13 amp
$ 13
13 (
) 13
$ 13
13
) 13
13 13 13
$ 13 13
+$ - 0)$ - 12
$ 3$ 4 1
)13$ 5 + )$ 5 2
3$ 6 2 $ 7-
2 $ 7 13
2 13 13
13 13
) 13
13$ $ 13$ $ $
13
13 13 13
13 1313 138
$ 13$
3$ 13) )
) 8 9$ $ $
$ $ )13$ 13 13
3
$
13 13
13 +
)
13 $
13 ) 13
$ )
13 13
13 ) $ )
13$
$
) (
$
13
13 13
13
)
13
13 )
$ 13
ファミリーレストラン 13) lt
ファミレス 13lt $
=
13 )
$ 13 13
$ $ )
13 -
)$ $ =
13 131313 13 1313 13 13 13 13 1313 13 13
63
13
13
13 13
13 13
13
13 13
13 $
13
13 amp
13
( ) +
13
13
-
13 13
13 13 13
日本 amp0 東京 $amp0amp
医科 amp0 科学 amp0amp
女子 amp0amp
大学 13amp0 研究所0
ampamp $ 13 13
+
13
$1
$ 13 $1
13
213
$1
$
$1
$1
13 $1
13
13 13 略語は0
0amp
$ +3
1313欧文略語一覧13カタカナ略語一覧13漢字略語一覧13大学の略称
$1
$1 13
13 -
13 $1 13
456
13
456
13
37+
- 13
13
$
8 913
77 4 77 913 13
13 77(amp 13 ltamp
13amp amp
13
amp - 13 13
13
13 9
13 13
$
13
-
13 =
= 9
8
13
13 $
13 13
13 13 13 1313
ッ$ 13 っ$
amp ―$
64
13
$ $
amp $ amp
() $
+) $amp $
) $ $ $ $ $
13 $ amp $ $ $ amp $$
$- )) ) 13
) ))13 ) 0()00)13 ) ) )13
- $ ))) ) 131
) 0)13 ))
13 13
2 ) )) )
)) 3 ) 1313))
1313 2 ) 13)
4) 5) )) )13
) 13) 1
6
7 ) )) ) )13 1
)
) )0
13 ) 8
-
) )
)13 ) 4
4 7 )
) )9 8)
5 ) )8 7 ))
)
) )13 1
)13 )) ) ) )
) 7 )) )13 1
)) 0()
) )))13 0
))13 ) 1
)6 5) ) )
5 ) ))1
lt ) ) 13 )
))) ) =))lt
)=)lt ))=lt gt
=ltlt )) ))) )
13 131313
-
))lt
lt ))lt
))lt
))lt lt
7 ) ))1
lt ) )) ) ))lt lt
13 lt ) ) )8 ) 1
)13 )) )
))lt 13 1
)13 ) )8 ) )13 ))
) ))) 85 1
)13 )8) 2 ))) ) (
0 )) )))
13))13
) ))) ) ( のlt
) ) )13 1
) )) )
) ))) のlt 0 )
)13 )) )1
) ) 13
) ) )) )1
) 00 )13 13)
1313 1313
2 ) 13 )) )
)13 )13 )
))13 )
3 )) )) 1
) ) )
$ )
+ ) )
+ 8 )
65
1313
1313
13 13
13 13
13 13 13
13 13
$13amp13amp 13 13 $13
$13 13 $ ( ) ) 13
13 13 + 13 1313 $
13 + 13 1313 + -
+ 13 ++ 13 13
1313 +
13 13
13 13 1313 13 -
+13 1313 13 13
1313
13
0 13 13 13 13
13 13 12 13
3013+ 13 45 666 1313
0 13 1313
7 13 13 13 13-
13 8 +-13 0
+ 13 13 13 -
+ 13 13 13
3
13 13-
13 13 + 13 13
+13513
1313 13 9
1 13 13 -
+ ( 13 13 3
13 13 + 13+ 13 13
3
135 1313 +13 13
13 13 13+ 13
1313 13 +
1313 1313
13 13 13
13+ 13 $1313 (
13 1313 1313 +-
13 4 13 $
$13 $ 13 -
13 +-13 0
13 1313 13-13 1313 + 13 13 4 - 13 + 13 13 13+ 131313 1313 3 13 13 ( 1313 13 13 + 13 13 1313 13 13 13 13 13 13 13 13 13 13 13+ 13 13 9
13
13
13
13
+ 13 13 +13
13+ 4
13 1313 1313 -
+
13 1313 13-
13 13 +13 + +
(13 13 $ +-
13 $1313 13 13 $ $ 13 $ 13
13 $ampamp lt13
2 13 13 131313 1313 13
13 lt13 ( $
13 13 13 13+ +13 -
13 13 13
13 2 2 2
= 13 13 13 +13 1313+
13 13 13
13 13 13 13
1313 13-
13 ( 13 13 13 13
13 13+ 13 1313
13 13 13+ 1313 13+
13 13 13 13 13 13
$gt1 gt1 13+ $ 13-
13 13 13 13+
13131313
66
1313 1313 1313
1313 1313 1313
13
13 13
13
13
$ 13 13
amp() amp() () amp amp
$ 13 13
+
13
13
13
- + 13 ( (
13 13
-+
13
0 -(1313 amp+ ( (
13 13
13 1
13 () 2 3
0 13 -4252
6-amp++ -32 6+
13 0
$ 135 2
1
- amp+ $ 13
13
$ amp( ) + ( ( )- )( ( ) )
FeaturesLabel
Noun
Noun
Noun
Noun
Particle
Noun
POS
0
0
0
1
1
1
Head of wordreadingword-
FeaturesLabel
Noun
Noun
Noun
Noun
Particle
Noun
POS
0
0
0
1
1
1
Head of wordreadingword-
amp 13 0
13 朝ビタ -+ 13
13 朝はビタミン - +
13 7
7
$
088
13 0
$ amp 9 amp 9
13
13
7
1313 0
13
13 2lt
13 13 13 13
13 13
3 13 amp
13 amp
13
13
0 13 13
13 13
13 13
13
1
ampampamp
67
13 13
$
13 13
ampamp (
)
ampamp $
ampamp amp
amp +
amp +
- - -
amp-
amp
amp
amp
- amp
+
- +
) +
amp $
- amp
+
- amp
+
0 $ ampamp +
1- amp
0 +
2-
)
amp -
amp ampamp
$ amp 0- 3(
34) 5 -+
0 - 4
6)4 amp amp 1 amp
7 4( 6
$ - +
- 8
64
13
) amp
amp
+
amp 9 2
- 9 - +
2 amp
- amp
amp 1
$
( +
(
( amp ( ( -
amp2
amp
amp )
( +
68
13 13
13 1313
13 13 13
13 13
13 13 13
13 13
13 13
13 $
13
13 13 13 13
13
13 13 13
1313
13
amp 13 13 13 13
13 13 13 13
(() 13 13 13
13 13 + 13
13 amp 13
13 13
$ 13
13 13
13
13 1313
13 -
13
1313 13
13 13 13
13 1313 amp
13 13
13 13 1313
0
1313 13 13
13 13
13
13 13
$ $ amp amp ( $)+ - 0 13 + 13 112
ミュージックステーション 13 ステ 13 13 13 13 13 ミュージック 13 ステ ステーション
3 4 56 1 0+ 0 + + 7+ 8 9 $8 4 lt
6 $8 -= 1 8 - = $ 13++ + 139 13 $ 22lt
8 $+ 4 13) 1ltltlt3 $gt3 =
5 $ $ ( 6 5 13amp
5 $ $ ( 213 $ 13 13
5 = 3 + 13 () 1 +
5 () 13 7+ + + 2ltlt
5 () 13 3A 21 + $ + 13
5 () $ gt) 3A2 0 13 13 22lt2lt
+A = 7 1 4 30$+ -+ 3 13 13$ 11
$ 13 amp) $ 4 42 6= $ 2
4 13 13+ $ gt $ 9+ amp6+ 13+ 8 8 6 3 (+ 6 13 11
13 136) $ 4 13 + gt+ 7 30 13 13+ 1
69
13 13 13
13 13 1313
13 13
1313 13
13 1313
13
$
13 13 1313 $
amp (
70
Author Index
Bhutada Pravin 17
Cao Jie 47Caseli Helena 1
Diab Mona 17
Finatto Maria Jose 1Fujii Hiroko 63Fukui Mika 63Funayama Hirotaka 55
Hoang Hung Huu 31Huang Yun 47
Kan Min-Yen 9 31Kim Su Nam 9 31Kuhn Jonas 23Kurohashi Sadao 55
Liu Qun 47Lu Yajuan 47
Machado Andre 1
Ren Zhixiang 47
Shibata Tomohide 55Sinha R Mahesh K 40Sumita Kazuo 63Suzuki Masaru 63
Villavicencio Aline 1
Wakaki Hiromi 63
Zarrieszlig Sina 23
71
Production and Manufacturing byWorld Scientific Publishing Co Pte Ltd5 Toh Tuck LinkSingapore 596224
ccopy2009 The Association for Computational Linguisticsand The Asian Federation of Natural Language Processing
Order copies of this and other ACL proceedings from
Association for Computational Linguistics (ACL)209 N Eighth StreetStroudsburg PA 18360USATel +1-570-476-8006Fax +1-570-476-0860aclaclweborg
ISBN 978-1-932432-60-2 1-932432-60-4
ii
Introduction
The ACL 2009 Workshop on Multiword Expressions Identification Interpretation Disambiguation andApplications (MWErsquo09) took place on August 6 2009 in Singapore immediately following the annualmeeting of the Association for Computational Linguistics (ACL) This is the fifth time this workshop hasbeen held in conjunction with ACL following the meetings in 2003 2004 2006 and 2007
The workshop focused on Multi-Word Expressions (MWEs) which represent an indispensable part ofnatural languages and appear steadily on a daily basis both novel and already existing but paraphrasedwhich makes them important for many natural language applications Unfortunately while easilymastered by native speakers MWEs are often non-compositional which poses a major challenge forboth foreign language learners and automatic analysis
The growing interest in MWEs in the NLP community has led to many specialized workshops heldevery year since 2001 in conjunction with ACL EACL and LREC there have been also two recentspecial issues on MWEs published by leading journals the International Journal of Language Resourcesand Evaluation and the Journal of Computer Speech and Language
As a result of the overall progress in the field the time has come to move from basic preliminary researchto actual applications in real-world NLP tasks Thus in MWErsquo09 we were interested in the overallprocess of dealing with MWEs asking for original research on the following four fundamental topics
Identification Identifying MWEs in free text is a very challenging problem Due to the variability ofexpression it does not suffice to collect and use a static list of known MWEs complex rules andmachine learning are typically needed as well
Interpretation Semantically interpreting MWEs is a central issue For some kinds of MWEs egnoun compounds it could mean specifying their semantics using a static inventory of semanticrelations eg WordNet-derived In other cases MWErsquos semantics could be expressible by asuitable paraphrase
Disambiguation Most MWEs are ambiguous in various ways A typical disambiguation task is todetermine whether an MWE is used non-compositionally (ie figuratively) or compositionally(ie literally) in a particular context
Applications Identifying MWEs in context and understanding their syntax and semantics is importantfor many natural language applications including but not limited to question answering machinetranslation information retrieval information extraction and textual entailment Still despite thegrowing research interest there are not enough successful applications in real NLP problemswhich we believe is the key for the advancement of the field
Of course the above topics largely overlap For example identification can require disambiguatingbetween literal and idiomatic uses since MWEs are typically required to be non-compositional bydefinition Similarly interpreting three-word noun compounds like morning flight ticket and plasticwater bottle requires disambiguation between a left and a right syntactic structure while interpretingtwo-word compounds like English teacher requires disambiguating between (a) lsquoteacher who teachesEnglishrsquo and (b) lsquoteacher coming from England (who could teach any subject eg math)rsquo
We received 18 submissions and given our limited capacity as a one-day workshop we were only ableto accept 9 full papers for oral presentation an acceptance rate of 50
We would like to thank the members of the Program Committee for their timely reviews We would alsolike to thank the authors for their valuable contributions
Dimitra Anastasiou Chikara Hashimoto Preslav Nakov and Su Nam KimCo-Organizers
iii
Organizers
Dimitra Anastasiou Localisation Research Centre Limerick University IrelandChikara Hashimoto National Institute of Information and Communications Technology JapanPreslav Nakov National University of Singapore SingaporeSu Nam Kim University of Melbourne Australia
Program Committee
Inaki Alegria University of the Basque Country (Spain)Timothy Baldwin University of Melbourne (Australia)Colin Bannard Max Planck Institute (Germany)Francis Bond National Institute of Information and Communications Technology (Japan)Gael Dias Beira Interior University (Portugal)Ulrich Heid Stuttgart University (Germany)Stefan Evert University of Osnabruck (Germany)Afsaneh Fazly University of Toronto (Canada)Nicole Gregoire University of Utrecht (The Netherlands)Roxana Girju University of Illinois at Urbana-Champaign (USA)Kyo Kageura University of Tokyo (Japan)Brigitte Krenn Austrian Research Institute for Artificial Intelligence (Austria)Eric Laporte University of Marne-la-Vallee (France)Rosamund Moon University of Birmingham (UK)Diana McCarthy University of Sussex (UK)Jan Odijk University of Utrecht (The Netherlands)Stephan Oepen University of Oslo (Norway)Darren Pearce London Knowledge Lab (UK)Pavel Pecina Charles University (Czech Republic)Scott Piao University of Manchester (UK)Violeta Seretan University of Geneva (Switzerland)Stan Szpakowicz University of Ottawa (Canada)Beata Trawinski University of Tubingen (Germany)Peter Turney National Research Council of Canada (Canada)Kiyoko Uchiyama Keio University (Japan)Begona Villada Moiron University of Groningen (The Netherlands)Aline Villavicencio Federal University of Rio Grande do Sul (Brazil)
v
Table of Contents
Statistically-Driven Alignment-Based Multiword Expression Identification for Technical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto 1
Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan 9
Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada 17
Exploiting Translational Correspondences for Pattern-Independent MWE IdentificationSina Zarrieszlig and Jonas Kuhn 23
A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan 31
Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha 40
Improving Statistical Machine Translation Using Domain Bilingual Multiword ExpressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang 47
Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi 55
Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita 63
vii
Workshop Program
Friday August 6 2009
830ndash845 Welcome and Introduction to the Workshop
Session 1 (0845ndash1000) MWE Identification and Disambiguation
0845ndash0910 Statistically-Driven Alignment-Based Multiword Expression Identification for Tech-nical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto
0910ndash0935 Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan
0935ndash1000 Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada
1000-1030 BREAK
Session 2 (1030ndash1210) Identification Interpretation and Disambiguation
1030ndash1055 Exploiting Translational Correspondences for Pattern-Independent MWE Identifi-cationSina Zarrieszlig and Jonas Kuhn
1055ndash1120 A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan
1120ndash1145 Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha
1145-1350 LUNCH
Session 3 (1350ndash1530) Applications
1350ndash1415 Improving Statistical Machine Translation Using Domain Bilingual Multiword Ex-pressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang
1415ndash1440 Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi
1440ndash1505 Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita
1505-1530 Discussion of Sessions 1 2 3 (Creating an Agenda for the general discussion)
1530-1600 BREAK
1600-1700 General Discussion
1700-1715 Closing Remarks
ix
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 1ndash8Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Statistically-Driven Alignment-Based Multiword ExpressionIdentification for Technical Domains
Helena de Medeiros Caselidiams Aline Villavicencioclubsspades Andre Machadoclubs Maria Jose Finattohearts
diamsDepartment of Computer Science Federal University of Sao Carlos (Brazil)clubsInstitute of Informatics Federal University of Rio Grande do Sul (Brazil)
spadesDepartment of Computer Sciences Bath University (UK)heartsInstitute of Language and Linguistics Federal University of Rio Grande do Sul (Brazil)
helenacaselidcufscarbr avillavicencioinfufrgsbrammachadoinfufrgsbr mfinattoterracombr
Abstract
Multiword Expressions (MWEs) are oneof the stumbling blocks for more preciseNatural Language Processing (NLP) sys-tems Particularly the lack of coverageof MWEs in resources can impact nega-tively on the performance of tasks and ap-plications and can lead to loss of informa-tion or communication errors This is es-pecially problematic in technical domainswhere a significant portion of the vocab-ulary is composed of MWEs This pa-per investigates the use of a statistically-driven alignment-based approach to theidentification of MWEs in technical cor-pora We look at the use of several sourcesof data including parallel corpora usingEnglish and Portuguese data from a corpusof Pediatrics and examining how a sec-ond language can provide relevant cues forthis tasks We report results obtained bya combination of statistical measures andlinguistic information and compare theseto the reported in the literature Such anapproach to the (semi-)automatic identifi-cation of MWEs can considerably speedup lexicographic work providing a moretargeted list of MWE candidates
1 Introduction
A multiword expression (MWE) can be definedas any word combination for which the syntac-tic or semantic properties of the whole expres-sion cannot be obtained from its parts (Sag etal 2002) Examples of MWEs are phrasal verbs(break down rely on) compounds (police car cof-fee machine) idioms (rock the boat let the catout of the bag) They are very numerous in lan-guages as Biber et al (1999) note accouting forbetween 30 and 45 of spoken English and 21
of academic prose and for Jackendoff (1997) thenumber of MWEs in a speakerrsquos lexicon is of thesame order of magnitude as the number of singlewords However these estimates are likely to beunderestimates if we consider that for languagefrom a specific domain the specialized vocabularyis going to consist largely of MWEs (global warm-ing protein sequencing) and new MWEs are con-stantly appearing (weapons of mass destructionaxis of evil)
Multiword expressions play an important rolein Natural Language Processing (NLP) applica-tions which should not only identify the MWEsbut also be able to deal with them when they arefound (Fazly and Stevenson 2007) Failing toidentify MWEs may cause serious problems formany NLP tasks especially those envolving somekind of semantic processing For parsing for in-stance Baldwin et al (2004) found that for a ran-dom sample of 20000 strings from the British Na-tional Corpus (BNC) even with a broad-coveragegrammar for English (Flickinger 2000) missingMWEs accounted for 8 of total parsing errorsTherefore there is an enormous need for robust(semi-)automated ways of acquiring lexical infor-mation for MWEs (Villavicencio et al 2007) thatcan significantly extend the coverage of resourcesFor example one can more than double the num-ber of verb-particle constructions (VPCs) entriesin a dictionary such as the Alvey Natural Lan-guage Tools (Carroll and Grover 1989) just ex-tracting VPCs from a corpus like the BNC (Bald-win 2005) Furthermore as MWEs are languagedependent and culturally motivated identifyingthe adequate translation of MWE occurrences is animportant challenge for machine translation meth-ods
In this paper we investigate experimentally theuse of an alignment-based approach for the iden-tification of MWEs in technical corpora We lookat the use of several sources of data including par-
1
allel corpora using English and Portuguese datafrom a corpus of Pediatrics and examining howa second language can provide relevant cues forthis tasks In this way cost-effective tools for theautomatic alignment of texts can generate a listof MWE candidates with their appropriate trans-lations Such an approach to the (semi-)automaticidentification of MWEs can considerably speed uplexicographic work providing a more targeted listof MWE candidates and their translations for theconstruction of bilingual resources andor withsome semantic information for monolingual re-sources
The remainder of this paper is structured as fol-lows Section 2 briefly discusses MWEs and someprevious works on methods for automatically ex-tracting them Section 3 presents the resourcesused while section 4 describes the methods pro-posed to extract MWEs as a statistically-driven by-product of an automatic word alignment processSection 5 presents the evaluation methodology andanalyses the results and section 6 finishes this pa-per with some conclusions and proposals for fu-ture work
2 Related Work
The term Multiword Expression has been used todescribe a large number of distinct but related phe-nomena such as phrasal verbs (eg come along)nominal compounds (eg frying pan) institution-alised phrases (eg bread and butter) and manyothers (Sag et al 2002) They are very frequent ineveryday language and this is reflected in severalexisting grammars and lexical resources wherealmost half of the entries are Multiword Expres-sions
However due to their heterogeneous charac-teristics MWEs present a tough challenge forboth linguistic and computational work (Sag etal 2002) Some MWEs are fixed and do notpresent internal variation such as ad hoc whileothers allow different degrees of internal vari-ability and modification such as touch a nerve(touchfind a nerve) and spill beans (spill sev-eralmusicalmountains of beans) In terms of se-mantics some MWEs are more opaque in theirmeaning (eg to kick the bucket as to die) whileothers have more transparent meanings that can beinferred from the words in the MWE (eg eat upwhere the particle up adds a completive sense toeat) Therefore providing appropriate methods
for the automatic identification and treatment ofthese phenomena is a real challenge for NLP sys-tems
A variety of approaches has been proposed forautomatically identifying MWEs differing basi-cally in terms of the type of MWE and lan-guage to which they apply and the sources ofinformation they use Although some work onMWEs is type independent (eg (Zhang et al2006 Villavicencio et al 2007)) given the het-erogeneity of MWEs much of the work looks in-stead at specific types of MWE like collocations(Pearce 2002) compounds (Keller and Lapata2003) and VPCs (Baldwin 2005 Villavicencio2005 Carlos Ramisch and Aline Villavicencio andLeonardo Moura and Marco Idiart 2008) Someof these works concentrate on particular languages(eg (Pearce 2002 Baldwin 2005) for Englishand (Piao et al 2006) for Chinese) but somework has also benefitted from asymmetries in lan-guages using information from one language tohelp deal with MWEs in the other (eg (na Vil-lada Moiron and Tiedemann 2006 Caseli et al2009))
As basis for helping to determine whether agiven sequence of words is in fact an MWE (egad hoc vs the small boy) some of these works em-ploy linguistic knowledge for the task (Villavicen-cio 2005) while others employ statistical meth-ods (Pearce 2002 Evert and Krenn 2005 Zhanget al 2006 Villavicencio et al 2007) or combinethem with some kinds of linguistic informationsuch as syntactic and semantic properties (Bald-win and Villavicencio 2002 Van de Cruys and naVillada Moiron 2007) or automatic word align-ment (na Villada Moiron and Tiedemann 2006)
Statistical measures of association have beencommonly used for this task as they can be demo-cratically applied to any language and MWE typeHowever there is no consensus about which mea-sure is best suited for identifying MWEs in gen-eral Villavicencio et al (2007) compared some ofthese measures (mutual information permutationentropy and χ2) for the type-independent detec-tion of MWEs and found that Mutual Informationseemed to differentiate MWEs from non-MWEsbut the same was not true of χ2 In addition Ev-ert and Krenn (2005) found that for MWE iden-tification the efficacy of a given measure dependson factors like the type of MWEs being targetedfor identification the domain and size of the cor-
2
pora used and the amount of low-frequency dataexcluded by adopting a threshold NonethelessVillavicencio et al (2007) discussing the influ-ence of the corpus size and nature over the meth-ods found that these different measures have ahigh level of agreement about MWEs whetherin carefully constructed corpora or in more het-erogeneous web-based ones They also discussthe results obtained from adopting approaches likethese for extending the coverage of resources ar-guing that grammar coverage can be significantlyincreased if MWEs are properly identified andtreated (Villavicencio et al 2007)
Among the methods that use additional infor-mation along with statistics to extract MWE theone proposed by na Villada Moiron and Tiede-mann (2006) seems to be the most similar to ourapproach The main difference between them isthe way in which word alignment is used in theMWE extraction process In this paper the wordalignment is the basis for the MWE extractionprocess while Villada Moiron and Tiedemannrsquosmethod uses the alignment just for ranking theMWE candidates which were extracted on the ba-sis of association measures (log-likelihood andsalience) and head dependence heuristic (in parseddata)
Our approach as described in details by Caseliet al (2009) also follows to some extent thatof Zhang et al (2006) as missing lexical en-tries for MWEs and related constructions are de-tected via error mining methods and this paper fo-cuses on the extraction of generic MWEs as a by-product of an automatic word alignment Anotherrelated work is the automatic detection of non-compositional compounds (NCC) by Melamed(1997) in which NCCs are identified by analyz-ing statistical translation models trained in a hugecorpus by a time-demanding process
Given this context our approach proposesthe use of alignment techniques for identifyingMWEs looking at sequences detected by thealigner as containing more than one word whichform the MWE candidates As a result sequencesof two or more consecutive source words aretreated as MWE candidates regardless of whetherthey are translated as one or more target words
3 The Corpus and Reference Lists
The Corpus of Pediatrics used in these experi-ments contains 283 texts in Portuguese with a total
of 785448 words extracted from the Jornal de Pe-diatria From this corpus the Pediatrics Glossarya reference list containing multiword terms and re-curring expressions was semi-automatically con-structed and manually checked1 The primary aimof the Pediatrics Glossary as an online resourcefor long-distance education was to train qualifyand support translation students on the domain ofpediatrics texts
The Pediatrics Glossary was built from the36741 ngrams that occurred at least 5 times in thecorpus These were automatically cleaned or re-moved using some POS tag patterns (eg remov-ing prepositions from terms that began or endedwith them) In addition if an ngram was part of alarger ngram only the latter appeared in the Glos-sary as is the case of aleitamento materno (mater-nal breastfeeding) which is excluded as it is con-tained in aleitamento materno exclusivo (exclusivematernal breastfeeding) This post-processing re-sulted in 3645 ngrams which were manuallychecked by translation students and resulted in2407 terms with 1421 bigrams 730 trigrams and339 ngrams with n larger than 3 (not considered inthe experiments presented in this paper)
4 Statistically-Driven andAlignment-Based methods
41 Statistically-Driven methodStatistical measures of association have beenwidely employed in the identification of MWEsThe idea behind their use is that they are an in-expensive language and type independent meansof detecting recurrent patterns As Firth famouslysaid a word is characterized by the company itkeeps and since we expect the component wordsof an MWE to occur frequently together thenthese measures can give an indication of MWE-ness In this way if a group of words co-occurswith significantly high frequency when comparedto the frequencies of the individual words thenthey may form an MWE Indeed measures suchas Pointwise Mutual Information (PMI) MutualInformation (MI) χ2 log-likelihood (Press et al1992) and others have been employed for thistask and some of them seem to provide more ac-curate predictions of MWEness than others Infact in a comparison of some measures for thetype-independent detection of MWEs MI seemed
1Available in the TEXTQUIMUFRGS website httpwwwufrgsbrtextquim
3
to differentiate MWEs from non-MWEs but thesame was not true of χ2 (Villavicencio et al2007) In this work we use two commonly em-ployed measures for this task PMI and MIas implemented in the Ngram Statistics Package(Banerjee and Pedersen 2003)
From the Portuguese portion of the Corpus ofPediatrics 196105 bigram and 362663 trigramMWE candidates were generated after filteringngrams containing punctuation and numbers Inorder to evaluate how these methods perform with-out any linguistic filtering the only threshold em-ployed was a frequency cut-off of 2 occurrencesresulting in 64839 bigrams and 54548 trigramsEach of the four measures were then calculated forthese ngrams and we ranked each n-gram accord-ing to each of these measures The average of allthe rankings is used as the combined measure ofthe MWE candidates
42 Alignment-Based method
The second of the MWE extraction approachesto be investigated in this paper is the alignment-based method The automatic word alignment oftwo parallel texts mdash a text written in one (source)language and its translation to another (target) lan-guage mdash is the process of searching for correspon-dences between source and target words and se-quences of words For each word in a source sen-tence equivalences in the parallel target sentenceare looked for Therefore taking into account aword alignment between a source word sequenceS (S = s1 sn with n ge 2) and a target wordsequence T (T = t1 tm with m ge 1) thatis S harr T the alignmet-based MWE extracionmethod assumes that (a) S and T share some se-mantic features and (b) S may be a MWE
In other words the alignment-based MWE ex-traction method states that the sequence S will bea MWE candidate if it is aligned with a sequenceT composed of one or more words (a n m align-ment with n ge 2 and m ge 1) For examplethe sequence of two Portuguese words aleitamentomaterno mdash which occurs 202 times in the cor-pus used in our experiments mdash is a MWE can-didate because these two words were joined to bealigned 184 times with the word breastfeeding (a2 1 alignment) 8 times with the word breast-fed (a 2 1 alignment) 2 times with breastfeedingpractice (a 2 2 alignment) and so on
Thus notice that the alignment-based MWE ex-
traction method does not rely on the conceptualasymmetries between languages since it does notexpect that a source sequence of words be alignedwith a single target word The method looksfor the sequences of source words that are fre-quently joined together during the alignment de-spite the number of target words involved Thesefeatures indicate that the method priorizes preci-sion in spite of recall
It is also important to say that although the se-quences of source and target words resemble thephrases used in the phrase-based statistical ma-chine translation (SMT) they are indeed a re-finement of them More specifically althoughboth approaches rely on word alignments per-formed by GIZA++2 (Och and Ney 2000) inthe alignment-based approach not all sequences ofwords are considered as phrases (and MWE can-didates) but just those with an alignment n m(n gt= 2) with a target sequence To confirmthis assumption a phrase-based SMT system wastrained with the same corpus used in our exper-iments and the number of phrases extracted fol-lowing both approaches were compared Whilethe SMT extracted 819208 source phrases ouralignment-based approach (without applying anypart-of-speech or frequency filter) extracted only34277 These results show that the alignment-based approach refines in some way the phrasesof SMT systems
In this paper we investigate experimentallywhether MWEs can be identified as a by-productof the automatic word alignment of parallel textsWe focus on Portuguese MWEs from the Corpusof Pediatrics and the evaluation is performed us-ing the bigrams and trigrams from the PediatricsGlossary as gold standard
To perform the extraction of MWE candi-dates following the alignment-based approachfirst the original corpus had to be sentence andword aligned and Part-of-Speech (POS) taggedFor these preprocessing steps were used re-spectively a version of the Translation Cor-pus Aligner (TCA) (Hofland 1996) the statisti-cal word aligner GIZA++ (Och and Ney 2000)and the morphological analysers and POS taggersfrom Apertium3 (Armentano-Oller et al 2006)
2GIZA++ is a well-known statistical word aligner that canbe found at httpwwwfjochcomGIZA++html
3Apertium is an open-source machine translation en-gine and toolbox available at httpwwwapertiumorg
4
From the preprocessed corpus the MWE candi-dates are extracted as those in which two or morewords have the same alignment that is they arelinked to the same target unit This initial list ofMWE candidates is then filtered to remove thosecandidates that (a) match some sequences of POStags or words (patterns) defined in previous exper-iments (Caseli et al 2009) or (b) whose frequencyis below a certain threshold The remaining unitsin the candidate list are considered to be MWEs
Several filtering patterns and minimum fre-quency thresholds were tested and three of themare presented in details here The first one (F1)is the same used during the manual building ofthe reference lists of MWEs (a) patterns begin-ning with Article + Noun and beginning or finish-ing with verbs and (b) with a minimum frequencythreshold of 5
The second one (F2) is the same used in the(Caseli et al 2009) mainly (a) patterns begin-ning with determiner auxiliary verb pronoun ad-verb conjunction and surface forms such as thoseof the verb to be (are is was were) relatives(that what when which who why) and prepo-sitions (from to of ) and (b) with a minimum fre-quency threshold of 2
And the third one (F3) is the same as (Caseli etal 2009) plus (a) patterns beginning or finishingwith determiner adverb conjunction prepositionverb pronoun and numeral and (b) with a mini-mum frequency threshold of 2
5 Experiments and Results
Table 1 shows the top 5 and the bottom 5 rankedcandidates returned by PMI and the alignment-based approach Although some of the results aregood especially the top candidates there is stillconsiderable noise among the candidates as forinstance jogar video game (lit play video game)From table 1 it is also possible to notice that thealignment-based approach indeed extracts Pedi-atrics terms such as aleitamento materno (breast-feeding) and also other possible MWE that are notPediatrics terms such as estados unidos (UnitedStates)
In table 2 we show the precision (number ofcorrect candidates among the proposed ones) re-call (number of correct candidates among those inreference lists) and F-measure ((2 lowast precision lowastrecall)(precision + recall)) figures for the as-sociation measures using all the candidates (on the
PMI alignment-basedOnline Mendelian Inheritance faixa etariaBeta Technology Incorporated aleitamento maternoLange Beta Technology estados unidosOxido Nitrico Inalatorio hipertensao arterialjogar video game leite materno e um de couro cabeludoe a do bloqueio lactıferosse que de emocional anatomiae a da neonato a termoe de nao duplas maes bebes
Table 1 Top 5 and Bottom 5 MWE candidatesranked by PMI and alignment-based approach
pt MWE candidates PMI MI proposed bigrams 64839 64839 correct MWEs 1403 1403precision 216 216recall 9873 9873F 423 423 proposed trigrams 54548 54548 correct MWEs 701 701precision 129 129recall 9603 9603F 255 255 proposed bigrams 1421 1421 correct MWEs 155 261precision 1091 1837recall 1091 1837F 1091 1837 proposed trigrams 730 730 correct MWEs 44 20precision 603 274recall 603 274F 603 274
Table 2 Evaluation of MWE candidates - PMI andMI
first half of the table) and using the top 1421 bi-gram and 730 trigram candidates (on the secondhalf) From these latter results we can see that thetop candidates produced by these measures do notagree with the Pediatrics Glossary since there areonly at most 1837 bigram and 603 trigramMWEs among the top candidates as ranked byMI and PMI respectively Interestingly MI had abetter performance for bigrams while for trigramsPMI performed better
On the other hand looking at the alignment-based method 34277 pt MWE candidates wereextracted and Table 3 sumarizes the number ofcandidates filtered following the three filters de-scribed in 42 F1 F2 and F3
To evaluate the efficacy of the alignment-basedmethod in identifying multiword terms of Pedi-atrics an automatic comparison was performedusing the Pediatrics Glossary In this auto-
5
pt MWE candidates F1 F2 F3 filtered by POS patterns 24996 21544 32644 filtered by frequency 9012 11855 1442 final Set 269 878 191
Table 3 Number of pt MWE candidates filteredin the alignment-based approach
pt MWE candidates F1 F2 F3 proposed bigrams 250 754 169 correct MWEs 48 95 65precision 1920 1260 3846recall 338 669 457F 575 874 818 proposed trigrams 19 110 20 correct MWEs 1 9 4precision 526 818 2000recall 014 123 055F 027 214 107 proposed bitrigrams 269 864 189 correct MWEs 49 104 69precision 1822 1204 3651recall 228 483 321F 405 690 590
Table 4 Evaluation of MWE candidates
matic comparision we considered the final lists ofMWEs candidates generated by each filter in table3 The number of matching entries and the valuesfor precision recall and F-measure are showed intable 4
The different values of extracted MWEs (in ta-ble 3) and evaluated ones (in table 4) are due tothe restriction of considering only bigrams and tri-grams in the Pediatrics Glossary Then longerMWEs mdash such as doenca arterial coronarianaprematura (premature coronary artery disease)and pequenos para idade gestacional (small forgestational age) mdash extracted by the alignment-based method are not being considered at the mo-ment
After the automatic comparison using the Pedi-atrics Glossary an analysis by human experts wasperformed on one of the derived lists mdash that withthe best precision values so far (from filter F3)The human analysis was necessary since as statedin (Caseli et al 2009) the coverage of referencelists may be low and it is likely that a lot of MWEcandidates that were not found in the PediatricsGlossary are nonetheless true MWEs In this pa-per only the pt MWE candidates extracted usingfilter F3 (as described in section 42) were manu-ally evaluated
From the 191 pt MWE candidates extracted af-ter F3 69 candidates (361 of the total amount)were found in the bigrams or trigrams in the
Glossary (see table 4) Then the remaining 122candidates (639) were analysed by two native-speakers human judges who classified each of the122 candidates as true if it is a multiword expres-sion or false otherwise independently of beinga Pediatrics term For the judges a sequence ofwords was considered a MWE mainly if it was(1) a proper name or (2) a sequence of words forwhich the meaning cannot be obtained by com-pounding the meanings of its component words
The judgments of both judges were comparedand a disagreement of approximately 12 on mul-tiwords was verified This disagreement was alsomeasured by the kappa (K) measure (Carletta1996) with k = 073 which does not preventconclusions to be drawn According to Carletta(1996) among other authors a value of k between067 and 08 indicates a good agreement
In order to calculate the percentage of true can-didates among the 122 two approaches can befollowed depending on what criteria one wantsto emphasize precision or coverage (not recallbecause we are not calculating regarding a refer-ence list) To emphasize the precision one shouldconsider as genuine MWEs only those candidatesclassified as true by both judges on the other handto emphasize the coverage one should consideralso those candidates classified as true by just oneof them So from 191 MWE candidates 126(6597) were classified as true by both judgesand 145 (7592) by at least one of them
6 Conclusions and Future Work
MWEs are a complex and heterogeneous set ofphenomena that defy attempts to capture themfully but due to their role in communication theyneed to be properly accounted for in NLP applica-tions and tasks
In this paper we investigated the identifica-tion of MWEs from technical domain test-ing statistically-driven and alignment-based ap-proaches for identifying MWEs from a Pediatricsparallel corpus The alignment-based method gen-erates a targeted precision-oriented list of MWEcandidates while the statistical methods producerecall-oriented results at the expense of precisionTherefore the combination of these methods canproduce a set of MWE candidates that is both moreprecise than the latter and has more coverage thanthe former This can significantly speed up lex-icographic work Moreover the results obtained
6
show that in comparison with the manual extrac-tion of MWEs this approach can provide also ageneral set of MWE candidates in addition to themanually selected technical terms
Using the alignment-based extraction methodwe notice that it is possible to extract MWEs thatare Pediatrics terms with a precision of 38 forbigrams and 20 for trigrams but with very lowrecall since only the MWEs in the Pediatrics Glos-sary were considered correct However after amanual analysis carried out by two native speakersof Portuguese we found that the percentage of trueMWEs considered by both or at least one of themwere respectively 6597 and 7592 This wasa significative improvement but it is important tosay that in this manual analysis the human ex-perts classified the MWEs as true independently ofthem being Pediatrics terms So as future work weintend to carry out a more carefull analysis withexperts in Pediatrics to evaluate how many MWEscandidates are also Pediatrics terms
In addition we plan to investigate a weightedcombination of these methods favouring thosethat have better precision Finally we also in-tend to apply the results obtained in to the semi-automatic construction of ontologies
Acknowledgments
We would like to thank the TEXTQUIMUFRGSgroup for making the Corpus of Pediatrics and Pe-diatrics Glossary available to us We also thank thefinancial support of FAPESP in bulding the paral-lel corpus This research has been partly fundedby the FINEP project COMUNICA
ReferencesCarme Armentano-Oller Rafael C Carrasco Anto-
nio M Corbı-Bellot Mikel L Forcada MireiaGinestı-Rosell Sergio Ortiz-Rojas Juan Anto-nio Perez-Ortiz Gema Ramırez-Sanchez FelipeSanchez-Martınez and Miriam A Scalco 2006Open-source Portuguese-Spanish machine transla-tion In R Vieira P Quaresma MGV Nunes NJMamede C Oliveira and MC Dias editors Pro-ceedings of the 7th International Workshop on Com-putational Processing of Written and Spoken Por-tuguese (PROPOR 2006) volume 3960 of LectureNotes in Computer Science pages 50ndash59 Springer-Verlag May
Timothy Baldwin and Aline Villavicencio 2002 Ex-tracting the Unextractable A Case Study on Verb-particles In Proceedings of the 6th Conference on
Natural Language Learning (CoNLL-2002) pages98ndash104 Taipei Taiwan
Timothy Baldwin Emily M Bender Dan FlickingerAra Kim and Stephan Oepen 2004 Road-testingthe English Resource Grammar over the British Na-tional Corpus In Proceedings of the Fourth In-ternational Conference on Language Resources andEvaluation (LREC 2004) pages 2047ndash2050 LisbonPortugal
Timothy Baldwin 2005 The deep lexical acquisitionof English verb-particles Computer Speech andLanguage Special Issue on Multiword Expressions19(4)398ndash414
Satanjeev Banerjee and Ted Pedersen 2003 The De-sign Implementation and Use of the Ngram Statis-tics Package In In Proceedings of the Fourth Inter-national Conference on Intelligent Text Processingand Computational Linguistics pages 370ndash381
Douglas Biber Stig Johansson Geoffrey Leech SusanConrad and Edward Finegan 1999 Grammar ofSpoken and Written English Longman Harlow
Jean Carletta 1996 Assessing agreement on classi-fication tasks the kappa statistics ComputationalLinguistics 22(2)249ndash254
Carlos Ramisch and Aline Villavicencio and LeonardoMoura and Marco Idiart 2008 Picking themup and Figuring them out Verb-Particle Construc-tions Noise and Idiomaticity In Proceedings ofthe 12th Conference on Computational Natural Lan-guage Learning (CoNLL 2008) pages 49ndash56
John Carroll and Claire Grover 1989 The derivationof a large computational lexicon of English fromLDOCE In Bran Boguraev and Ted Briscoe editorsComputational Lexicography for Natural LanguageProcessing pages 117ndash134 Longman Harlow UK
Helena M Caseli Carlos Ramisch Maria G V Nunesand Aline Villavicencio 2009 Alignment-basedextraction of multiword expressions Language Re-sources and Evaluation to appear
Stefan Evert and Brigitte Krenn 2005 Using smallrandom samples for the manual evaluation of statis-tical association measures Computer Speech andLanguage 19(4)450ndash466
Afsaneh Fazly and Suzanne Stevenson 2007 Distin-guishing Subtypes of Multiword Expressions UsingLinguistically-Motivated Statistical Measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueJune
Dan Flickinger 2000 On building a more efficientgrammar by exploiting types Natural LanguageEngineering 6(1)15ndash28
7
Knut Hofland 1996 A program for aligning Englishand Norwegian sentences In S Hockey N Ideand G Perissinotto editors Research in HumanitiesComputing pages 165ndash178 Oxford Oxford Univer-sity Press
Ray Jackendoff 1997 Twistinrsquo the night away Lan-guage 73534ndash59
Frank Keller and Mirella Lapata 2003 Using the Webto Obtain Frequencies for Unseen Bigrams Compu-tational Linguistics 29(3)459ndash484
I Dan Melamed 1997 Automatic Discovery ofNon-Compositional Compounds in Parallel Data Ineprint arXivcmp-lg9706027 pages 6027ndash+ June
Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the Workshopon Multi-word-expressions in a Multilingual Context(EACL-2006) pages 33ndash40 Trento Italy
Franz Josef Och and Hermann Ney 2000 Improvedstatistical alignment models In Proceedings of the38th Annual Meeting of the ACL pages 440ndash447Hong Kong China October
Darren Pearce 2002 A Comparative Evaluation ofCollocation Extraction Techniques In Proceedingsof the Third International Conference on LanguageResources and Evaluation pages 1ndash7 Las PalmasCanary Islands Spain
Scott S L Piao Guangfan Sun Paul Rayson andQi Yuan 2006 Automatic Extraction of Chi-nese Multiword Expressions with a Statistical ToolIn Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context (EACL-2006)pages 17ndash24 Trento Italy April
William H Press Saul A Teukolsky William T Vet-terling and Brian P Flannery 1992 NumericalRecipes in C The Art of Scientific Computing Sec-ond edition Cambridge University Press
Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 MultiwordExpressions A Pain in the Neck for NLP In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing (CICLing-2002) volume 2276 of (LectureNotes in Computer Science) pages 1ndash15 LondonUK Springer-Verlag
Tim Van de Cruys and Bego na Villada Moiron 2007Semantics-based Multiword Expression ExtractionIn Proceedings of the Workshop on A Broader Pre-spective on Multiword Expressions pages 25ndash32Prague June
Aline Villavicencio Valia Kordoni Yi Zhang MarcoIdiart and Carlos Ramisch 2007 Validation andEvaluation of Automatically Acquired Multiword
Expressions for Grammar Engineering In Pro-ceedings of the 2007 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning pages1034ndash1043 Prague June
Aline Villavicencio 2005 The Availability of Verb-Particle Constructions in Lexical Resources HowMuch is Enough Journal of Computer Speech andLanguage Processing 19(4)415ndash432
Yi Zhang Valia Kordoni Aline Villavicencio andMarco Idiart 2006 Automated Multiword Ex-pression Prediction for Grammar Engineering InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 36ndash44 Sydney Australia July Asso-ciation for Computational Linguistics
8
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 9ndash16Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Re-examining Automatic Keyphrase Extraction Approachesin Scientific Articles
Su Nam KimCSSE dept
University of Melbournesnkimcsseunimelbeduau
Min-Yen KanDepartment of Computer ScienceNational University of Singaporekanmycompnusedusg
Abstract
We tackle two major issues in automatickeyphrase extraction using scientific arti-cles candidate selection and feature engi-neering To develop an efficient candidateselection method we analyze the natureand variation of keyphrases and then se-lect candidates using regular expressionsSecondly we re-examine the existing fea-tures broadly used for the supervised ap-proach exploring different ways to en-hance their performance While mostother approaches are supervised we alsostudy the optimal features for unsuper-vised keyphrase extraction Our researchhas shown that effective candidate selec-tion leads to better performance as evalua-tion accounts for candidate coverage Ourwork also attests that many of existing fea-tures are also usable in unsupervised ex-traction
1 Introduction
Keyphrases are simplex nouns or noun phrases(NPs) that represent the key ideas of the documentKeyphrases can serve as a representative summaryof the document and also serve as high quality in-dex terms It is thus no surprise that keyphraseshave been utilized to acquire critical informationas well as to improve the quality of natural lan-guage processing (NLP) applications such as doc-ument summarizer(Davanzo and Magnini 2005)information retrieval (IR)(Gutwin et al 1999) anddocument clustering(Hammouda et al 2005)
In the past various attempts have been made toboost automatic keyphrase extraction performancebased primarily on statistics(Frank et al 1999Turney 2003 Park et al 2004 Wan and Xiao2008) and a rich set of heuristic features(Barkerand Corrnacchia 2000 Medelyan and Witten
2006 Nguyen and Kan 2007) In Section 2 wegive a more comprehensive overview of previousattempts
Current keyphrase technology still has muchroom for improvement First of all although sev-eral candidate selection methods have been pro-posed for automatic keyphrase extraction in thepast (eg (Frank et al 1999 Park et al 2004Nguyen and Kan 2007)) most of them do not ef-fectively deal with various keyphrase forms whichresults in the ignorance of some keyphrases as can-didates Moreover no studies thus far have donea detailed investigation of the nature and varia-tion of manually-provided keyphrases As a con-sequence the community lacks a standardized listof candidate forms which leads to difficulties indirect comparison across techniques during evalu-ation and hinders re-usability
Secondly previous studies have shown the ef-fectiveness of their own features but not manycompared their features with other existing fea-tures That leads to a redundancy in studies andhinders direct comparison In addition existingfeatures are specifically designed for supervisedapproaches with few exceptions However thisapproach involves a large amount of manual laborthus reducing its utility for real-world applicationHence unsupervised approach is inevitable in or-der to minimize manual tasks and to encourageutilization It is a worthy study to attest the re-liability and re-usability for the unsupervised ap-proach in order to set up the tentative guideline forapplications
This paper targets to resolve these issues ofcandidate selection and feature engineering Inour work on candidate selection we analyze thenature and variation of keyphrases with the pur-pose of proposing a candidate selection methodwhich improves the coverage of candidates thatoccur in various forms Our second contributionre-examines existing keyphrase extraction features
9
reported in the literature in terms of their effec-tiveness and re-usability We test and comparethe usefulness of each feature for further improve-ment In addition we assess how well these fea-tures can be applied in an unsupervised approach
In the remaining sections we describe anoverview of related work in Section 2 our propos-als on candidate selection and feature engineeringin Section 4 and 5 our system architecture anddata in Section 6 Then we evaluate our propos-als discuss outcomes and conclude our work inSection 7 8 and 9 respectively
2 Related Work
The majority of related work has been carriedout using statistical approaches a rich set ofsymbolic resources and linguistically-motivatedheuristics(Frank et al 1999 Turney 1999 Barkerand Corrnacchia 2000 Matsuo and Ishizuka2004 Nguyen and Kan 2007) Features used canbe categorized into three broad groups (1) docu-ment cohesion features (ie relationship betweendocument and keyphrases)(Frank et al 1999Matsuo and Ishizuka 2004 Medelyan and Wit-ten 2006 Nguyen and Kan 2007) and to lesser(2) keyphrase cohesion features (ie relationshipamong keyphrases)(Turney 2003) and (3) termcohesion features (ie relationship among compo-nents in a keyphrase)(Park et al 2004)
The simplest system is KEA (Frank et al1999 Witten et al 1999) that uses TFIDF (ieterm frequency inverse document frequency) andfirst occurrence in the document TFIDF mea-sures the document cohesion and the first occur-rence implies the importance of the abstract orintroduction which indicates the keyphrases havea locality Turney (2003) added the notion ofkeyphrase cohesion to KEA features and Nguyenand Kan (2007) added linguistic features suchas section information and suffix sequence TheGenEx system(Turney 1999) employed an inven-tory of nine syntactic features such as length inwords and frequency of stemming phrase as aset of parametrized heuristic rules Barker andCorrnacchia (2000) introduced a method basedon head noun heuristics that took three featureslength of candidate frequency and head noun fre-quency To take advantage of domain knowledgeHulth et al (2001) used a hierarchically-organizeddomain-specific thesaurus from Swedish Parlia-ment as a secondary knowledge source The
Textract (Park et al 2004) also ranks the can-didate keyphrases by its judgment of keyphrasesrsquodegree of domain specificity based on subject-specific collocations(Damerau 1993) in addi-tion to term cohesion using Dice coefficient(Dice1945) Recently Wan and Xiao (2008) extractsautomatic keyphrases from single documents uti-lizing document clustering information The as-sumption behind this work is that the documentswith the same or similar topics interact with eachother in terms of salience of words The authorsfirst clustered the documents then used the graph-based ranking algorithm to rank the candidates ina document by making use of mutual influences ofother documents in the same cluster
3 Keyphrase Analysis
In previous study KEA employed the index-ing words as candidates whereas others such as(Park et al 2004 Nguyen and Kan 2007) gen-erated handcrafted regular expression rules How-ever none carefully undertook the analysis ofkeyphrases We believe there is more to be learnedfrom the reference keyphrases themselves by do-ing a fine-grained careful analysis of their formand composition Note that we used the articlescollected from ACM digital library for both ana-lyzing keyphrases as well as evaluating methodsSee Section 6 for data in detail
Syntactically keyphrases can be formed by ei-ther simplex nouns (eg algorithm keyphrasemulti-agent) or noun phrases (NPs) which can be asequence of nouns and their auxiliary words suchas adjectives and adverbs (eg mobile networkfast computing partially observable Markov de-cision process) despite few incidences They canalso incorporate a prepositional phrase (PP) (egquality of service policy of distributed caching)When keyphrases take the form of an NP with anattached PP (ie NPs in of-PP form) the preposi-tion of is most common but others such as for invia also occur (eg incentive for cooperation in-equality in welfare agent security via approximatepolicy trade in financial instrument based on log-ical formula) The patterns above correlate wellto part-of-speech (POS) patterns used in modernkeyphrase extraction systems
However our analysis uncovered additional lin-guistic patterns and alternations which other stud-ies may have overlooked In our study we alsofound that keyphrases also occur as a simple con-
10
Criteria RulesFrequency (Rule1) Frequency heuristic ie frequency ge 2 for simplex words vs frequency ge 1 for NPs
Length (Rule2) Length heuristic ie up to length 3 for NPs in non-of-PP form vs up to length 4 for NPs in of-PP form(eg synchronous concurrent program vs model of multiagent interaction)
Alternation (Rule3) of-PP form alternation(eg number of sensor = sensor number history of past encounter = past encounter history)(Rule4) Possessive alternation(eg agentrsquos goal = goal of agent securityrsquos value = value of security)
Extraction (Rule5) Noun Phrase = (NN |NNS|NNP |NNPS|JJ |JJR|JJS)lowast(NN |NNS|NNP |NNPS)(eg complexity effective algorithm grid computing distributed web-service discovery architecture)(Rule6) Simplex WordNP IN Simplex WordNP(eg quality of service sensitivity of VOIP traffic (VOIP traffic extracted)simplified instantiation of zebroid (simplified instantiation extracted))
Table 1 Candidate Selection Rules
junctions (eg search and rescue propagation anddelivery) and much more rarely as conjunctionsof more complex NPs (eg history of past en-counter and transitivity) Some keyphrases appearto be more complex (eg pervasive document editand management system task and resource allo-cation in agent system) Similarly abbreviationsand possessive forms figure as common patterns(eg belief desire intention = BDI inverse docu-ment frequency = (IDF) Bayesrsquo theorem agentrsquosdominant strategy)
A critical insight of our work is that keyphrasescan be morphologically and semantically alteredKeyphrases that incorporate a PP or have an un-derlying genitive composition are often easily var-ied by word order alternation Previous studieshave used the altered keyphrases when forming inof-PP form For example quality of service canbe altered to service quality sometimes with lit-tle semantic difference Also as most morpho-logical variation in English relates to noun num-ber and verb inflection keyphrases are subject tothese rules as well (eg distributed system 6= dis-tributing system dynamical caching 6= dynamicalcache) In addition possessives tend to alternatewith of-PP form (eg agentrsquos goal = goal of agentsecurityrsquos value = value of security)
4 Candidate Selection
We now describe our proposed candidate selectionprocess Candidate selection is a crucial step forautomatic keyphrase extraction This step is corre-lated to term extraction study since topNth rankedterms become keyphrases in documents In pre-vious study KEA employed the indexing wordsas candidates whereas others such as (Park et al2004 Nguyen and Kan 2007) generated hand-crafted regular expression rules However nonecarefully undertook the analysis of keyphrases In
this section before we present our method we firstdescribe the detail of keyphrase analysis
In our keyphrase analysis we observed thatmost of author assigned keyphrase andor readerassigned keyphrase are syntactically more of-ten simplex words and less often NPs Whenkeyphrases take an NP form they tend to be a sim-ple form of NPs ie either without a PP or withonly a PP or with a conjunction but few appear asa mixture of such forms We also noticed that thecomponents of NPs are normally nouns and adjec-tives but rarely are adverbs and verbs As a re-sult we decided to ignore NPs containing adverbsand verbs in this study as our candidates since theytend to produce more errors and to require morecomplexity
Another observation is that keyphrases contain-ing more than three words are rare (ie 6 in ourdata set) validating what Paukkeri et al (2008)observed Hence we apply a length heuristic Ourcandidate selection rule collects candidates up tolength 3 but also of length 4 for NPs in of-PPform since they may have a non-genetive alter-nation that reduces its length to 3 (eg perfor-mance of distributed system = distributed systemperformance) In previous studies words occur-ring at least twice are selected as candidates How-ever during our acquisition of reader assignedkeyphrase we observed that readers tend to collectNPs as keyphrases regardless of their frequencyDue to this we apply different frequency thresh-olds for simplex words (gt= 2) and NPs (gt= 1)Note that 30 of NPs occurred only once in ourdata
Finally we generated regular expression rulesto extract candidates as presented in Table 1 Ourcandidate extraction rules are based on those inNguyen and Kan (2007) However our Rule6for NPs in of-PP form broadens the coverage of
11
possible candidates ie with a given NPs in of-PP form not only we collect simplex word(s)but we also extract non-of-PP form of NPs fromnoun phrases governing the PP and the PP Forexample our rule extracts effective algorithm ofgrid computing as well as effective algorithm andgrid computing as candidates while the previousworksrsquo rules do not
5 Feature Engineering
With a wider candidate selection criteria the onusof filtering out irrelevant candidates becomes theresponsibility of careful feature engineering Welist 25 features that we have found useful in ex-tracting keyphrases comprising of 9 existing and16 novel andor modified features that we intro-duce in our work (marked with lowast) As one ofour goals in feature engineering is to assess thesuitability of features in the unsupervised settingwe have also indicated which features are suitableonly for the supervised setting (S) or applicable toboth (S U)
51 Document Cohesion
Document cohesion indicates how important thecandidates are for the given document The mostpopular feature for this cohesion is TFIDF butsome works have also used context words to checkthe correlation between candidates and the givendocument Other features for document cohesionare distance section information and so on Wenote that listed features other than TFIDF are re-lated to locality That is the intuition behind thesefeatures is that keyphrases tend to appear in spe-cific area such as the beginning and the end of doc-uments
F1 TFIDF (SU) TFIDF indicates doc-ument cohesion by looking at the frequency ofterms in the documents and is broadly used in pre-vious work(Frank et al 1999 Witten et al 1999Nguyen and Kan 2007) However a disadvan-tage of the feature is in requiring a large corpusto compute useful IDF As an alternative con-text words(Matsuo and Ishizuka 2004) can alsobe used to measure document cohesion From ourstudy of keyphrases we saw that substrings withinlonger candidates need to be properly counted andas such our method measures TF in substrings aswell as in exact matches For example grid com-puting is often a substring of other phrases such asgrid computing algorithm and efficient grid com-
puting algorithm We also normalize TF with re-spect to candidate types ie we separately treatsimplex words and NPs to compute TF To makeour IDFs broadly representative we employed theGoogle n-gram counts that were computedover terabytes of data Given this large genericsource of word count IDF can be incorporatedwithout corpus-dependent processing hence suchfeatures are useful in unsupervised approachesas well The following list shows variations ofTFIDF employed as features in our system
bull (F1a) TFIDF
bull (F1b) TF including counts of substrings
bull (F1c) TF of substring as a separate feature
bull (F1d) normalized TF by candidate types(ie simplex words vs NPs)
bull (F1e) normalized TF by candidate types asa separate feature
bull (F1f) IDF using Google n-gram
F2 First Occurrence (SU) KEA used the firstappearance of the word in the document(Frank etal 1999 Witten et al 1999) The main ideabehind this feature is that keyphrases tend to oc-cur in the beginning of documents especially instructured reports (eg in abstract and introduc-tion sections) and newswire
F3 Section Information (SU) Nguyen andKan (2007) used the identity of which specificdocument section a candidate occurs in This lo-cality feature attempts to identify key sections Forexample in their study of scientific papers theauthors weighted candidates differently dependingon whether they occurred in the abstract introduc-tion conclusion section head title andor refer-ences
F4 Additional Section Information (SU)We first added the related work or previous workas one of section information not included inNguyen and Kan (2007) We also propose and testa number of variations We used the substringsthat occur in section headers and reference titlesas keyphrases We counted the co-occurrence ofcandidates (ie the section TF) across all key sec-tions that indicates the correlation among key sec-tions We assign section-specific weights as in-dividual sections exhibit different propensities forgenerating keyphrases For example introduction
12
contains the majority of keyphrases while the ti-tle or section head contains many fewer due to thevariation in size
bull (F4a) section rsquorelatedprevious workrsquo
bull (F4b) counting substring occurring in keysections
bull (F4c) section TF across all key sections
bull (F4d) weighting key sections according tothe portion of keyphrases found
F5 Last Occurrence (SU) Similar to dis-tance in KEA the position of the last occurrenceof a candidate may also imply the importance ofkeyphrases as keyphrases tend to appear in thelast part of document such as the conclusion anddiscussion
52 Keyphrase CohesionThe intuition behind using keyphrase cohesion isthat actual keyphrases are often associated witheach other since they are semantically related totopic of the document Note that this assumptionholds only when the document describes a singlecoherent topic ndash a document that represents a col-lection may be first need to be segmented into itsconstituent topics
F6 Co-occurrence of Another Candidatein Section (SU) When candidates co-occur inseveral key sections together then they are morelikely keyphrases Hence we used the number ofsections that candidates co-occur
F7 Title overlap (S) In a way titles also rep-resent the topics of their documents A large col-lection of titles in the domain can act as a prob-abilistic prior of what words could stand as con-stituent words in keyphrases In our work as weexamined scientific papers from computer sciencewe used a collection of titles obtained from thelarge CiteSeer1 collection to create this feature
bull (F7a) co-occurrence (Boolean) in title col-location
bull (F7b) co-occurrence (TF) in title collection
F8 Keyphrase Cohesion (SU) Turney (2003)integrated keyphrase cohesion into his system bychecking the semantic similarity between top Nranked candidates against the remainder In the
1It contains 13M titles from articles papers and reports
original work a large external web corpus wasused to obtain the similarity judgments As wedid not have access to the same web corpus andall candidateskeyphrases were not found in theGoogle n-gram corpus we approximated this fea-ture using a similar notion of contextual similarityWe simulated a latent 2-dimensional matrix (simi-lar to latent semantic analysis) by listing all candi-date words in rows and their neighboring words(nouns verbs and adjectives only) in columnsThe cosine measure is then used to compute thesimilarity among keyphrases
53 Term Cohesion
Term cohesion further refines the candidacy judg-ment by incorporating an internal analysis of thecandidatersquos constituent words Term cohesionposits that high values for internal word associa-tion measures correlates indicates that the candi-date is a keyphrase (Church and Hanks 1989)
F9 Term Cohesion (SU) Park et al (2004)used in the Dice coefficient (Dice 1945)to measure term cohesion particularly for multi-word terms In their work as NPs are longer thansimplex words they simply discounted simplexword cohesion by 10 In our work we vary themeasure of TF used in Dice coefficientsimilar to our discussion earlier
bull (F9a) term cohesion by (Park et al 2004)
bull (F9b) normalized TF by candidate types(ie simplex words vs NPs)
bull (F9c) applying different weight by candi-date types
bull (F9d) normalized TF and different weight-ing by candidate types
54 Other Features
F10 Acronym (S) Nguyen and Kan (2007) ac-counted for the importance of acronym as a fea-ture We found that this feature is heavily depen-dent on the data set Hence we used it only forNampK to attest our candidate selection method
F11 POS sequence (S) Hulth and Megyesi(2006) pointed out that POS sequences ofkeyphrases are similar It showed the distinctivedistribution of POS sequences of keyphrases anduse them as a feature Like acronym this is alsosubject to the data set
13
F12 Suffix sequence (S) Similar to acronymNguyen and Kan (2007) also used a candidatersquossuffix sequence as a feature to capture the propen-sity of English to use certain Latin derivationalmorphology for technical keyphrases This fea-ture is also a data dependent features thus used insupervised approach only
F13 Length of Keyphrases (SU) Barker andCorrnacchia (2000) showed that candidate lengthis also a useful feature in extraction as well as incandidate selection as the majority of keyphrasesare one or two terms in length
6 System and Data
To assess the performance of the proposed candi-date selection rules and features we implementeda keyphrase extraction pipe line We start withraw text of computer science articles convertedfrom PDF by pdftotext Then we parti-tioned the into section such as title and sectionsvia heuristic rules and applied sentence segmenter2 ParsCit3(Councill et al 2008) for refer-ence collection part-of-speech tagger4 and lem-matizer5(Minnen et al 2001) of the input Af-ter preprocessing we built both supervised andunsupervised classifiers using Naive Bayes fromthe WEKA machine learning toolkit(Witten andFrank 2005) Maximum Entropy6 and simpleweighting
In evaluation we collected 250 papers fromfour different categories7 of the ACM digital li-brary Each paper was 6 to 8 pages on averageIn author assigned keyphrase we found manywere missing or found as substrings To rem-edy this we collected reader assigned keyphraseby hiring senior year undergraduates in computerscience each whom annotated five of the paperswith an annotation guideline and on average tookabout 15 minutes to annotate each paper The fi-nal statistics of keyphrases is presented in Table2 where Combined represents the total number ofkeyphrases The numbers in () denotes the num-ber of keyphrases in of-PP form Found means the
2httpwwwengritsumeiacjpasaoresourcessentseg3httpwingcompnusedusgparsCit4httpsearchcpanorgdistLingua-EN-
TaggerTaggerpm5httpwwwinformaticssusxacukresearchgroupsnlpcarrollmorphhtml6httpmaxentsourceforgenetindexhtml7C24 (Distributed Systems) H33 (Information Search
and Retrieval) I211 (Distributed Artificial Intelligence-Multiagent Systems) and J4 (Social and Behavioral Sciences-Economics)
number of author assigned keyphrase and readerassigned keyphrase found in the documents
Author Reader CombinedTotal 1252 (53) 3110 (111) 3816 (146)NPs 904 2537 3027Average 385 (401) 1244 (1288) 1526 (1585)Found 769 2509 2864
Table 2 Statistics in Keyphrases
7 Evaluation
The baseline system for both the supervised andunsupervised approaches is modified NampK whichuses TFIDF distance section information andadditional section information (ie F1-4) Apartfrom baseline we also implemented basicKEA and NampK to compare Note that NampK is con-sidered a supervised approach as it utilizes fea-tures like acronym POS sequence and suffix se-quence
Table 3 and 4 shows the performance of our can-didate selection method and features with respectto supervised and unsupervised approaches usingthe current standard evaluation method (ie exactmatching scheme) over top 5th 10th 15th candi-dates
BestFeatures includes F1cTF of substring asa separate feature F2first occurrence F3sectioninformation F4dweighting key sections F5lastoccurrence F6co-occurrence of another candi-date in section F7btitle overlap F9aterm co-hesion by (Park et al 2004) F13length ofkeyphrases Best-TFIDF means using all bestfeatures but TFIDF
In Tables 3 and 4 C denotes the classifier tech-nique unsupervised (U) or supervised using Max-imum Entropy (S)8
In Table 5 the performance of each feature ismeasured using NampK system and the target fea-ture + indicates an improvement - indicates aperformance decline and indicates no effector unconfirmed due to small changes of perfor-mances Again supervised denotes MaximumEntropy training and Unsupervised is our unsu-pervised approach
8 Discussion
We compared the performances over our candi-date selection and feature engineering with sim-ple KEA NampK and our baseline system In eval-uating candidate selection we found that longer
8Due to the page limits we present the best performance
14
Method Features C Five Ten FifteenMatch Precision Recall Fscore Match Precising Recall Fscore Match Precision Recall Fscore
All KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 079 1584 519 782 139 1388 909 1099 184 1224 1203 1213
NampK S 132 2648 867 1306 204 2036 1334 1612 254 1693 1664 1678baseline U 092 1832 600 904 157 1568 1027 1241 220 1464 1439 1451
S 115 2304 755 1137 190 1896 1242 1501 244 1624 1596 1610Lengthlt=3 KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 081 1616 529 797 140 1400 917 1108 184 1224 1203 1213
NampK S 140 2792 915 1378 210 2104 1378 1665 262 1749 1719 1734baseline U 092 184 603 908 158 1576 1032 1247 220 1464 1439 1451
S 118 2368 776 1169 190 1900 1245 1504 240 1600 1572 1586Lengthlt=3 KEA U 001 024 008 012 005 052 034 041 007 048 047 047Candidates S 083 1664 545 821 142 1424 933 1127 187 1245 1224 1234
+ Alternation NampK S 153 3064 1004 1512 231 2308 1512 1827 288 1920 1887 1903baseline U 098 1968 645 972 172 1724 1129 1364 237 1579 1551 1565
S 133 2656 870 1311 209 2088 1368 1653 269 1792 1761 1776
Table 3 Performance on Proposed Candidate Selection
Features C Five Ten FifteenMatch Prec Recall Fscore Match Prec Recall Fscore Match Prec Recall Fscore
Best U 114 228 747 113 192 192 126 152 261 174 171 173S 156 312 102 154 250 250 164 198 315 210 206 208
Best U 114 228 74 113 192 192 126 152 261 174 171 173wo TFIDF S 156 311 102 154 246 246 161 194 312 208 204 206
Table 4 Performance on Feature Engineering
A Method Feature+ S F1aF2F3F4aF4dF9a
U F1aF1cF2F3F4aF4dF5F7bF9a- S F1bF1cF1dF1fF4bF4cF7aF7bF9b-dF13
U F1dF1eF1fF4bF4cF6F7aF9b-d S F1eF10F11F12
U F1b
Table 5 Performance on Each Feature
length candidates play a role to be noises so de-creased the overall performance We also con-firmed that candidate alternation offered the flexi-bility of keyphrases leading higher candidate cov-erage as well as better performance
To re-examine features we analyzed the impactof existing and new features and their variationsFirst of all unlike previous studies we found thatthe performance with and without TFIDF did notlead to a large difference which indicates the im-pact of TFIDF was minor as long as other fea-tures are incorporated Secondly counting sub-strings for TF improved performance while ap-plying term weighting for TF andor IDF did notimpact on the performance We estimated thecause that many of keyphrases are substrings ofcandidates and vice versa Thirdly section in-formation was also validated to improve perfor-mance as in Nguyen and Kan (2007) Extend-ing this logic modeling additional section infor-mation (related work) and weighting sections bothturned out to be useful features Other localityfeatures were also validated as helpful both firstoccurrence and last occurrence are helpful as itimplies the locality of the key ideas In addi-
tion keyphrase co-occurrence with selected sec-tions was proposed in our work and found empiri-cally useful Term cohesion (Park et al 2004) is auseful feature although it has a heuristic factor thatreduce the weight by 10 for simplex words Nor-mally term cohesion is subject to NPs only henceit needs to be extended to work with multi-wordNPs as well Table 5 summarizes the reflectionson each feature
As unsupervised methods have the appeal of notneeding to be trained on expensive hand-annotateddata we also compared the performance of super-vised and unsupervised methods Given the fea-tures initially introduced for supervised learningunsupervised performance is surprisingly highWhile supervised classifier produced a matchingcount of 315 the unsupervised classifier obtains acount of 261 We feel this indicates that the exist-ing features for supervised methods are also suit-able for use in unsupervised methods with slightlyreduced performance In general we observed thatthe best features in both supervised and unsuper-vised methods are the same ndash section informationand candidate length In our analysis of the im-pact of individual features we observed that mostfeatures affect performance in the same way forboth supervised and unsupervised approaches asshown in Table 5 These findings indicate that al-though these features may be been originally de-signed for use in a supervised approach they arestable and can be expected to perform similar inunsupervised approaches
15
9 Conclusion
We have identified and tackled two core issuesin automatic keyphrase extraction candidate se-lection and feature engineering In the area ofcandidate selection we observe variations and al-ternations that were previously unaccounted forOur selection rules expand the scope of possiblekeyphrase coverage while not overly expandingthe total number candidates to consider In ourre-examination of feature engineering we com-piled a comprehensive feature list from previousworks while exploring the use of substrings in de-vising new features Moreover we also attested toeach featurersquos fitness for use in unsupervised ap-proaches in order to utilize them in real-world ap-plications with minimal cost
10 AcknowledgementThis work was partially supported by a National ResearchFoundation grant Interactive Media Search (grant R 252000 325 279) while the first author was a postdoctoral fellowat the National University of Singapore
ReferencesKen Barker and Nadia Corrnacchia Using noun phrase
heads to extract document keyphrases In Proceedings ofthe 13th Biennial Conference of the Canadian Society onComputational Studies of Intelligence Advances in Arti-ficial Intelligence 2000
Regina Barzilay and Michael Elhadad Using lexical chainsfor text summarization In Proceedings of the ACLEACL1997 Workshop on Intelligent Scalable Text Summariza-tion 1997 pp 10ndash17
Kenneth Church and Patrick Hanks Word associationnorms mutual information and lexicography In Proceed-ings of ACL 1989 76ndash83
Isaac Councill and C Lee Giles and Min-Yen Kan ParsCitAn open-source CRF reference string parsing package InProceedings of LREC 2008 28ndash30
Ernesto DAvanzo and Bernado Magnini A Keyphrase-Based Approach to Summarizationthe LAKE System atDUC-2005 In Proceedings of DUC 2005
F Damerau Generating and evaluating domain-orientedmulti-word terms from texts Information Processing andManagement 1993 29 pp43ndash447
Lee Dice Measures of the amount of ecologic associationsbetween species Journal of Ecology 1945 2
Eibe Frank and Gordon Paynter and Ian Witten and CarlGutwin and Craig Nevill-manning Domain SpecificKeyphrase Extraction In Proceedings of IJCAI 1999pp668ndash673
Carl Gutwin and Gordon Paynter and Ian Witten and CraigNevill-Manning and Eibe Frank Improving browsing indigital libraries with keyphrase indexes Journal of Deci-sion Support Systems 1999 27 pp81ndash104
Khaled Hammouda and Diego Matute and Mohamed KamelCorePhrase keyphrase extraction for document cluster-ing In Proceedings of MLDM 2005
Annette Hulth and Jussi Karlgren and Anna Jonsson andHenrik Bostrm and Lars Asker Automatic Keyword Ex-traction using Domain Knowledge In Proceedings of CI-CLing 2001
Annette Hulth and Beata Megyesi A study on automaticallyextracted keywords in text categorization In Proceedingsof ACLCOLING 2006 537ndash544
Mario Jarmasz and Caroline Barriere Using semantic sim-ilarity over tera-byte corpus compute the performance ofkeyphrase extraction In Proceedings of CLINE 2004
Dawn Lawrie and W Bruce Croft and Arnold RosenbergFinding Topic Words for Hierarchical Summarization InProceedings of SIGIR 2001 pp 349ndash357
Y Matsuo and M Ishizuka Keyword Extraction from a Sin-gle Document using Word Co-occurrence Statistical Infor-mation In International Journal on Artificial IntelligenceTools 2004 13(1) pp 157ndash169
Olena Medelyan and Ian Witten Thesaurus based automatickeyphrase indexing In Proceedings of ACMIEED-CSjoint conference on Digital libraries 2006 pp296ndash297
Guido Minnen and John Carroll and Darren Pearce Appliedmorphological processing of English NLE 2001 7(3)pp207ndash223
Thuy Dung Nguyen and Min-Yen Kan Key phrase Extrac-tion in Scientific Publications In Proceeding of ICADL2007 pp317-326
Youngja Park and Roy Byrd and Branimir Boguraev Auto-matic Glossary Extraction Beyond Terminology Identifi-cation In Proceedings of COLING 2004 pp48ndash55
Mari-Sanna Paukkeri and Ilari Nieminen and Matti Pollaand Timo Honkela A Language-Independent Approachto Keyphrase Extraction and Evaluation In Proceedingsof COLING 2008
Peter Turney Learning to Extract Keyphrases from TextIn National Research Council Institute for InformationTechnology Technical Report ERB-1057 1999
Peter Turney Coherent keyphrase extraction via Web min-ing In Proceedings of IJCAI 2003 pp 434ndash439
Xiaojun Wan and Jianguo Xiao CollabRank towards a col-laborative approach to single-document keyphrase extrac-tion In Proceedings of COLING 2008
Ian Witten and Gordon Paynter and Eibe Frank and CarGutwin and Graig Nevill-Manning KEAPractical Au-tomatic Key phrase Extraction In Proceedings of ACMDL 1999 pp254ndash256
Ian Witten and Eibe Frank Data Mining Practical Ma-chine Learning Tools and Techniques Morgan Kauf-mann 2005
Yongzheng Zhang and Nur Zinchir-Heywood and Evange-los Milios Term based Clustering and Summarization ofWeb Page Collections In Proceedings of Conference ofthe Canadian Society for Computational Studies of Intel-ligence 2004
16
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 17ndash22Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Verb Noun Construction MWE Token Supervised Classification
Mona T DiabCenter for Computational Learning Systems
Columbia Universitymdiabcclscolumbiaedu
Pravin BhutadaComputer Science Department
Columbia Universitypb2351columbiaedu
Abstract
We address the problem of classifying multi-word expression tokens in running text Wefocus our study on Verb-Noun Constructions(VNC) that vary in their idiomaticity depend-ing on context VNC tokens are classified aseither idiomatic or literal We present a super-vised learning approach to the problem We ex-periment with different features Our approachyields the best results to date on MWE clas-sification combining different linguistically mo-tivated features the overall performance yieldsan F-measure of 8458 corresponding to an F-measure of 8996 for idiomaticity identificationand classification and 6203 for literal identifi-cation and classification
1 Introduction
In the literature in general a multiword expression(MWE) refers to a multiword unit or a colloca-tion of words that co-occur together statisticallymore than chance A MWE is a cover term fordifferent types of collocations which vary in theirtransparency and fixedness MWEs are pervasivein natural language especially in web based textsand speech genres Identifying MWEs and under-standing their meaning is essential to language un-derstanding hence they are of crucial importancefor any Natural Language Processing (NLP) appli-cations that aim at handling robust language mean-ing and use In fact the seminal paper (Sag et al2002) refers to this problem as a key issue for thedevelopment of high-quality NLP applications
For our purposes a MWE is defined as a collo-cation of words that refers to a single concept forexample - kick the bucket spill the beans make adecision etc An MWE typically has an idiosyn-cratic meaning that is more or different from themeaning of its component words An MWE mean-ing is transparent ie predictable in as muchas the component words in the expression relaythe meaning portended by the speaker composi-tionally Accordingly MWEs vary in their de-gree of meaning compositionality composition-ality is correlated with the level of idiomaticityAn MWE is compositional if the meaning of an
MWE as a unit can be predicted from the mean-ing of its component words such as in make adecision meaning to decide If we conceive ofidiomaticity as being a continuum the more id-iomatic an expression the less transparent and themore non-compositional it is Some MWEs aremore predictable than others for instance kick thebucket when used idiomatically to mean to diehas nothing in common with the literal meaningof either kick or bucket however make a decisionis very clearly related to to decide Both of theseexpressions are considered MWEs but have vary-ing degrees of compositionality and predictabilityBoth of these expressions belong to a class of id-iomatic MWEs known as verb noun constructions(VNC) The first VNC kick the bucket is a non-decomposable VNC MWE the latter make a deci-sion is a decomposable VNC MWE These typesof constructions are the object of our study
To date most research has addressed the prob-lem of MWE type classification for VNC expres-sions in English (Melamed 1997 Lin 1999Baldwin et al 2003 na Villada Moiron andTiedemann 2006 Fazly and Stevenson 2007Van de Cruys and Villada Moiron 2007 Mc-Carthy et al 2007) not token classification Forexample he spilt the beans on the kitchen counteris most likely a literal usage This is given away bythe use of the prepositional phrase on the kitchencounter as it is plausable that beans could haveliterally been spilt on a location such as a kitchencounter Most previous research would classifyspilt the beans as idiomatic irrespective of con-textual usage In a recent study by (Cook et al2008) of 53 idiom MWE types used in differentcontexts the authors concluded that almost half ofthem had clear literal meaning and over 40 oftheir usages in text were actually literal Thus itwould be important for an NLP application suchas machine translation for example when givena new VNC MWE token to be able to determinewhether it is used idiomatically or not as it couldpotentially have detrimental effects on the qualityof the translation
17
In this paper we address the problem of MWEclassification for verb-noun (VNC) token con-structions in running text We investigate the bi-nary classification of an unseen VNC token ex-pression as being either Idiomatic (IDM) or Lit-eral (LIT) An IDM expression is certainly anMWE however the converse is not necessarilytrue To date most approaches to the problem ofidiomaticity classification on the token level havebeen unsupervised (Birke and Sarkar 2006 Diaband Krishna 2009b Diab and Krishna 2009aSporleder and Li 2009) In this study we carryout a supervised learning investigation using sup-port vector machines that uses some of the featureswhich have been shown to help in unsupervisedapproaches to the problem
This paper is organized as follows In Section2 we describe our understanding of the variousclasses of MWEs in general Section 3 is a sum-mary of previous related research Section 4 de-scribes our approach In Section 5 we present thedetails of our experiments We discuss the resultsin Section 6 Finally we conclude in Section 7
2 Multi-word Expressions
MWEs are typically not productive though theyallow for inflectional variation (Sag et al 2002)They have been conventionalized due to persis-tent use MWEs can be classified based on theirsemantic types as follows Idiomatic This cat-egory includes expressions that are semanticallynon-compositional fixed expressions such as king-dom come ad hoc non-fixed expressions suchas break new ground speak of the devil TheVNCs which we are focusing on in this paper fallinto this category Semi-idiomatic This classincludes expressions that seem semantically non-compositional yet their semantics are more or lesstransparent This category consists of Light VerbConstructions (LVC) such as make a living andVerb Particle Constructions (VPC) such as write-up call-up Non-Idiomatic This category in-cludes expressions that are semantically compo-sitional such as prime minister proper nouns suchas New York Yankees and collocations such as ma-chine translation These expressions are statisti-cally idiosyncratic For instance traffic light isthe most likely lexicalization of the concept andwould occur more often in text than say trafficregulator or vehicle light
3 Related Work
Several researchers have addressed the problem ofMWE classification (Baldwin et al 2003 Katzand Giesbrecht 2006 Schone and Juraksfy 2001
Hashimoto et al 2006 Hashimoto and Kawa-hara 2008) The majority of the proposed researchhas been using unsupervised approaches and haveaddressed the problem of MWE type classifica-tion irrespective of usage in context (Fazly andStevenson 2007 Cook et al 2007) We areaware of two supervised approaches to the prob-lem work by (Katz and Giesbrecht 2006) andwork by (Hashimoto and Kawahara 2008)
In Katz and Giesbrecht (2006) (KG06) the au-thors carried out a vector similarity comparisonbetween the context of an MWE and that of theconstituent words using LSA to determine if theexpression is idiomatic or not The KG06 is sim-ilar in intuition to work proposed by (Fazly andStevenson 2007) however the latter work was un-supervised KG06 experimented with a tiny dataset of only 108 sentences corresponding to oneMWE idiomatic expression
Hashimoto and Kawahara (2008) (HK08) is thefirst large scale study to our knowledge that ad-dressed token classification into idiomatic versusliteral for Japanese MWEs of all types They ap-ply a supervised learning framework using sup-port vector machines based on TinySVM with aquadratic kernel They annotate a web based cor-pus for training data They identify 101 idiomtypes each with a corresponding 1000 exampleshence they had a corpus of 102K sentences of an-notated data for their experiments They exper-iment with 90 idiom types only for which theyhad more than 50 examples They use two typesof features word sense disambiguation (WSD)features and idiom features The WSD featurescomprised some basic syntactic features such asPOS lemma information token n-gram featuresin addition to hypernymy information on words aswell as domain information For the idiom fea-tures they were mostly inflectional features suchas voice negativity modality in addition to adja-cency and adnominal features They report resultsin terms of accuracy and rate of error reductionTheir overall accuracy is of 8925 using all thefeatures
4 Our ApproachWe apply a supervised learning framework tothe problem of both identifying and classifying aMWE expression token in context We specificallyfocus on VNC MWE expressions We use the an-notated data by (Cook et al 2008) We adopt achunking approach to the problem using an InsideOutside Beginning (IOB) tagging framework forperforming the identification of MWE VNC to-kens and classifying them as idiomatic or literalin context For chunk tagging we use the Yam-
18
Cha sequence labeling system1 YamCha is basedon Support Vector Machines technology using de-gree 2 polynomial kernels
We label each sentence with standard IOB tagsSince this is a binary classification task we have 5different tags B-L (Beginning of a literal chunk)I-L (Inside of a literal chunk) B-I (Beginning anIdiomatic chunk) I-I (Inside an Idiomatic chunk)O (Outside a chunk) As an example a sentencesuch as John kicked the bucket last Friday will beannotated as follows John O kicked B-I the I-Ibucket I-I last O Friday O We experiment withsome basic features and some more linguisticallymotivated ones
We experiment with different window sizes forcontext ranging from minus+1 to minus+5 tokens be-fore and after the token of interest We also em-ploy linguistic features such as character n-gramfeatures namely last 3 characters of a token asa means of indirectly capturing the word inflec-tional and derivational morphology (NGRAM)Other features include Part-of-Speech (POS)tags lemma form (LEMMA) or the citation formof the word and named entity (NE) informationThe latter feature is shown to help in the unsuper-vised setting in recent work (Diab and Krishna2009b Diab and Krishna 2009a) In general allthe linguistic features are represented as separatefeature sets explicitly modeled in the input dataHence if we are modeling the POS tag feature forour running example the training data would beannotated as follows John NN O kicked VBDB-I the Det I-I bucket NN I-I last ADV O FridayNN O Likewise adding the NGRAM featurewould be represented as follows John NN ohnO kicked VBD ked B-I the Det the I-I bucket NNket I-I last ADV ast O Friday NN day O and soon
With the NE feature we followed the same rep-resentation as the other features as a separate col-umn as expressed above referred to as NamedEntity Separate (NES) For named entity recogni-tion (NER) we use the BBN Identifinder softwarewhich identifies 19 NE tags2 We have two set-tings for NES one with the full 19 tags explic-itly identified (NES-Full) and the other where wehave a binary feature indicating whether a wordis a NE or not (NES-Bin) Moreover we addedanother experimental condition where we changedthe wordsrsquo representation in the input to their NEclass Named Entity InText (NEI) For example forthe NEI condition our running example is repre-sented as follows PER NN ohn O kicked VBDked B-I the Det the I-I bucket NN ket I-I last ADV
1httpwwwtado-chasencomyamcha2httpwwwbbncomidentifinder
ast O DAY NN day O where John is replaced bythe NE ldquoPERrdquo
5 Experiments and Results
51 Data
We use the manually annotated standard dataset identified in (Cook et al 2008) This datacomprises 2920 unique VNC-Token expressionsdrawn from the entire British National Corpus(BNC)3 The BNC contains 100M words of multi-ple genres including written text and transcribedspeech In this set VNC token expressions aremanually annotated as idiomatic literal or un-known We exclude those annotated as unknownand those pertaining to the Speech part of thedata leaving us with a total of 2432 sentences cor-responding to 53 VNC MWE types This datahas 2571 annotations4 corresponding to 2020 Id-iomatic tokens and 551 literal ones Since the dataset is relatively small we carry out 5-fold cross val-idation experiments The results we report are av-eraged over the 5 folds per condition We splitthe data into 80 for training 10 for testing and10 for development The data used is the tok-enized version of the BNC
52 Evaluation Metrics
We use Fβ=1 (F-measure) as the harmonic meanbetween (P)recision and (R)ecall as well as accu-racy to report the results5 We report the resultsseparately for the two classes IDM and LIT aver-aged over the 5 folds of the TEST data set
53 Results
We present the results for the different featuressets and their combination We also present resultson a simple most frequent tag baseline (FREQ) aswell as a baseline of using no features just thetokenized words (TOK) The baseline is basicallytagging all identified VNC tokens in the data set asidiomatic It is worth noting that the baseline hasthe advantage of gold identification of MWE VNCtoken expressions In our experimental conditionsidentification of a potential VNC MWE is part ofwhat is discovered automatically hence our sys-tem is penalized for identifying other VNC MWE
3httpwwwnatcorpoxacuk4A sentence can have more than one MWE expression
hence the number of annotations exceeds the number of sen-tences
5We do not think that accuracy should be reported in gen-eral since it is an inflated result as it is not a measure of errorAll words identified as O factor into the accuracy which re-sults in exaggerated values for accuracy We report it onlysince it the metric used by previous work
19
tokens that are not in the original data set6
In Table 2 we present the results yielded per fea-ture and per condition We experimented with dif-ferent context sizes initially to decide on the opti-mal window size for our learning framework re-sults are presented in Table 1 Then once that isdetermined we proceed to add features
Noting that a window size of minus+3 yields thebest results we proceed to use that as our contextsize for the following experimental conditions Wewill not include accuracy since it above 96 for allour experimental conditions
All the results yielded by our experiments out-perform the baseline FREQ The simple tokenizedwords baseline (TOK) with no added features witha context size of minus+3 shows a significant im-provement over the very basic baseline FREQ withan overall F measure of 7704
Adding lemma information or POS or NGRAMfeatures all independently contribute to a bettersolution however combining the three featuresyields a significant boost in performance over theTOK baseline of 267 absolute F points in over-all performance
Confirming previous observations in the liter-ature the overall best results are obtained byusing NE features The NEI condition yieldsslightly better results than the NES conditionsin the case when no other features are beingused NES-Full significantly outperforms NES-Bin when used alone especially on literal classi-fication yielding the highest results on this classof phenomena across the board However whencombined with other features NES-Bin fares bet-ter than NES-Full as we observe slightly less per-formance when comparing NES-Full+L+N+P andNES-Bin+L+N+P
Combining NEI+L+N+P yields the highest re-sults with an overall F measure of 8458 a sig-nificant improvement over both baselines and overthe condition that does not exploit NE featuresL+N+P Using NEI may be considered a formof dimensionality reduction hence the significantcontribution to performance
6 Discussion
The overall results strongly suggest that using lin-guistically interesting features explicitly has a pos-itive impact on performance NE features helpthe most and combining them with other features
6We could have easily identified all VNC syntactic con-figurations corresponding to verb object as a potential MWEVNC assuming that they are literal by default This wouldhave boosted our literal score baseline however for this in-vestigation we decided to strictly work with the gold stan-dard data set exclusively
yields the best results In general performanceon the classification and identification of idiomaticexpressions yielded much better results This maybe due to the fact that the data has a lot more id-iomatic token examples for training Also we notethat precision scores are significantly higher thanrecall scores especially with performance on lit-eral token instance classification This might be anindication that identifying when an MWE is usedliterally is a difficult task
We analyzed some of the errors yielded in ourbest condition NEI+L+N+P The biggest errors area result of identifying other VNC constructionsnot annotated in the training and test data as VNCMWEs However we also see errors of confusingidiomatic cases with literal ones 23 times and theopposite 4 times
Some of the errors where the VNC should havebeen classified as literal however the system clas-sified them as idiomatic are kick heel find feetmake top Cases of idiomatic expressions erro-neously classified as literal are for MWE types hitthe road blow trumpet blow whistle bit a wall
The system is able to identify new VNC MWEconstructions For instance in the sentence On theother hand Pinkie seemed to have lost his head toa certain extent perhaps some prospects of mak-ing his mark by bringing in something novel inthe way of business the first MWE lost his headis annotated in the training data however makinghis mark is newly identified as idiomatic in thiscontext
Also the system identified hit the post as aliteral MWE VNC token in As the ball hit thepost the referee blew the whistle where blew thewhistle is a literal VNC in this context and it iden-tified hit the post as another literal VNC
7 Conclusion
In this study we explore a set of features that con-tribute to VNC token expression binary supervisedclassification The use of NER significantly im-proves the performance of the system Using NERas a means of dimensionality reduction yields thebest results We achieve a state of the art perfor-mance of an overall F measure of 8458 In thefuture we are looking at ways of adding more so-phisticated syntactic and semantic features fromWSD Given the fact that we were able to get moreinteresting VNC data automatically we are cur-rently looking into adding the new data to the an-notated pool after manual checking
20
IDM-F LIT-F Overall F Overall Accminus+1 7793 4857 7178 9622minus+2 8538 5561 7971 9706minus+3 8699 5568 8125 9693minus+4 8622 5581 8075 9706minus+5 8338 50 7763 9661
Table 1 Results in s of varying context window size
IDM-P IDM-R IDM-F LIT-P LIT-R LIT-F Overall FFREQ 7002 8916 7844 0 0 0 6968TOK 8178 8333 8255 7179 4375 5437 7704
(L)EMMA 831 8429 8369 6977 4688 5607 7811(N)GRAM 8317 8238 8278 70 4375 5385 7701
(P)OS 8333 8333 8333 7778 4375 5600 7808L+N+P 8695 8333 8538 7222 4561 5591 7971
NES-Full 852 8793 8655 7907 5862 6733 8277NES-Bin 8497 8241 8367 7349 5259 6131 7915
NEI 8992 8518 8748 8133 5259 6387 8282NES-Full+L+N+P 8989 8492 8734 7632 50 6042 8199NES-Bin+L+N+P 9086 8492 8779 7632 50 6042 8233
NEI+L+N+P 9135 8842 8986 8169 50 6203 8458
Table 2 Final results in s averaged over 5 folds of test data using different features and their combina-tions
8 Acknowledgement
The first author was partially funded by DARPAGALE and MADCAT projects The authors wouldlike to acknowledge the useful comments by twoanonymous reviewers who helped in making thispublication more concise and better presented
ReferencesTimothy Baldwin Collin Bannard Takakki Tanaka
and Dominic Widdows 2003 An empirical modelof multiword expression decomposability In Pro-ceedings of the ACL 2003 workshop on Multiwordexpressions pages 89ndash96 Morristown NJ USA
J Birke and A Sarkar 2006 A clustering approach fornearly unsupervised recognition of nonliteral lan-guage In Proceedings of EACL volume 6 pages329ndash336
Paul Cook Afsaneh Fazly and Suzanne Stevenson2007 Pulling their weight Exploiting syntacticforms for the automatic identification of idiomaticexpressions in context In Proceedings of the Work-shop on A Broader Perspective on Multiword Ex-pressions pages 41ndash48 Prague Czech RepublicJune Association for Computational Linguistics
Paul Cook Afsaneh Fazly and Suzanne Stevenson2008 The VNC-Tokens Dataset In Proceedings ofthe LREC Workshop on Towards a Shared Task for
Multiword Expressions (MWE 2008) MarrakechMorocco June
Mona Diab and Madhav Krishna 2009a Handlingsparsity for verb noun MWE token classification InProceedings of the Workshop on Geometrical Mod-els of Natural Language Semantics pages 96ndash103Athens Greece March Association for Computa-tional Linguistics
Mona Diab and Madhav Krishna 2009b Unsuper-vised classification for vnc multiword expressionstokens In CICLING
Afsaneh Fazly and Suzanne Stevenson 2007 Dis-tinguishing subtypes of multiword expressions us-ing linguistically-motivated statistical measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueCzech Republic June Association for Computa-tional Linguistics
Chikara Hashimoto and Daisuke Kawahara 2008Construction of an idiom corpus and its applica-tion to idiom identification based on WSD incor-porating idiom-specific features In Proceedings ofthe 2008 Conference on Empirical Methods in Nat-ural Language Processing pages 992ndash1001 Hon-olulu Hawaii October Association for Computa-tional Linguistics
Chikara Hashimoto Satoshi Sato and Takehito Utsuro2006 Japanese idiom recognition Drawing a linebetween literal and idiomatic meanings In Proceed-ings of the COLINGACL 2006 Main Conference
21
Poster Sessions pages 353ndash360 Sydney AustraliaJuly Association for Computational Linguistics
Graham Katz and Eugenie Giesbrecht 2006 Au-tomatic identification of non-compositional multi-word expressions using latent semantic analysis InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 12ndash19 Sydney Australia July Asso-ciation for Computational Linguistics
Dekang Lin 1999 Automatic identification of non-compositional phrases In Proceedings of ACL-99pages 317ndash324 Univeristy of Maryland CollegePark Maryland USA
Diana McCarthy Sriram Venkatapathy and AravindJoshi 2007 Detecting compositionality of verb-object combinations using selectional preferencesIn Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL) pages 369ndash379 Prague CzechRepublic June Association for Computational Lin-guistics
Dan I Melamed 1997 Automatic discovery of non-compositional compounds in parallel data In Pro-ceedings of the 2nd Conference on Empirical Meth-ods in Natural Language Processing (EMNLPrsquo97)pages 97ndash108 Providence RI USA August
Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the EACL-06Workshop on Multiword Expressions in a Multilin-gual Context pages 33ndash40 Morristown NJ USA
Ivan A Sag Timothy Baldwin Francis Bond Ann ACopestake and Dan Flickinger 2002 Multiwordexpressions A pain in the neck for nlp In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing pages 1ndash15 London UK Springer-Verlag
Patrick Schone and Daniel Juraksfy 2001 Isknowledge-free induction of multiword unit dictio-nary headwords a solved problem In Proceedingsof Empirical Methods in Natural Language Process-ing pages 100ndash108 Pittsburg PA USA
C Sporleder and L Li 2009 Unsupervised Recog-nition of Literal and Non-Literal Use of IdiomaticExpressions In Proceedings of the 12th Conferenceof the European Chapter of the ACL (EACL 2009)pages 754ndash762 Association for Computational Lin-guistics
Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32Prague Czech Republic June Association for Com-putational Linguistics
22
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 23ndash30Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Exploiting Translational Correspondences for Pattern-Independent MWEIdentification
Sina ZarrieszligDepartment of Linguistics
University of Potsdam Germanysinalinguni-potsdamde
Jonas KuhnDepartment of Linguistics
University of Potsdam Germanykuhnlinguni-potsdamde
Abstract
Based on a study of verb translations inthe Europarl corpus we argue that a widerange of MWE patterns can be identified intranslations that exhibit a correspondencebetween a single lexical item in the sourcelanguage and a group of lexical items inthe target language We show that thesecorrespondences can be reliably detectedon dependency-parsed word-aligned sen-tences We propose an extraction methodthat combines word alignment with syn-tactic filters and is independent of thestructural pattern of the translation
1 Introduction
Parallel corpora have proved to be a valuable re-source not only for statistical machine translationbut also for crosslingual induction of morphologi-cal syntactic and semantic analyses (Yarowsky etal 2001 Dyvik 2004) In this paper we proposean approach to the identification of multiword ex-pressions (MWEs) that exploits translational cor-respondences in a parallel corpus We will con-sider in translations of the following type
(1) DerThe
RatCouncil
sollteshould
unsereour
Positionposition
berucksichtigenconsider
(2) The Council shouldtake account ofour position
This sentence pair has been taken from theGerman - English section of the Europarl corpus(Koehn 2005) It exemplifies a translational cor-respondence between an English MWEtake ac-count ofand a German simplex verbberucksichti-gen In the following we refer to such correspon-dences asone-to-many translations Based on astudy of verb translations in Europarl we will ex-plore to what extent one-to-many translations pro-vide evidence for MWE realization in the targetlanguage It will turn out that crosslingual corre-
spondences realize a wide range of different lin-guistic patterns that are relevant for MWE iden-tification but that they pose problems to auto-matic word alignment We propose an extractionmethod that combines distributional word align-ment with syntactic filters We will show thatthese correspondences can be reliably detectedon dependency-parsed wordaligned sentences andare able to identify various MWE patterns
In a monolingual setting the task of MWE ex-traction is usually conceived of as a lexical as-sociation problem where distributional measuresmodel the syntactic and semantic idiosyncracy ex-hibited by MWEs eg (Pecina 2008) This ap-proach generally involves two main steps 1) theextraction of a candidate list of potential MWEsoften constrained by a particular target patternof the detection method like verb particle con-structions (Baldwin and Villavicencio 2002) orverb PP combinations (Villada Moiron and Tiede-mann 2006) 2) the ranking of this candidate listby an appropriate assocation measure
The crosslingual MWE identification wepresent in this paper is a priori independentof any specific association measure or syntacticpattern The translation scenario allows us toadopt a completely data-driven definition of whatconstitutes an MWE Given a parallel corpuswe propose to consider those tokens in a targetlanguage as MWEs which correspond to a singlelexical item in the source language The intuitionis that if a group of lexical items in one lan-guage can be realized as a single item in anotherlanguage it can be considered as some kind oflexically fixed entity By this means we willnot approach the MWE identification problemby asking for a given list of candidates whetherthese are MWEs or not Instead we will ask fora given list of lexical items in a source languagewhether there exists a one-to-many translation forthis item in a target language (and whether these
23
one-to-many translations correspond to MWEs)This strategy offers a straightforward solution tothe interpretation problem As the translation canbe related to the meaning of the source item andto its other translations in the target language theinterpretation is independent of the expressionrsquostransparency This solution has its limitationscompared to other approaches that need to auto-matically establish the degree of compositionalityof a given MWE candidate However for manyNLP applications coarse-grained knowledgeabout the semantic relation between a widerange of MWEs and their corresponding atomicrealization is already very useful
In this work we therefore focus on a generalmethod of MWE identification that captures thevarious patterns of translational correspondencesthat can be found in parallel corpora Our exper-iments described in section 3 show that one-to-many translations should be extracted from syn-tactic configurations rather than from unstructuredsets of aligned words This syntax-driven methodis less dependent on frequency distributions in agiven corpus but is based on the intuition thatmonolingual idiosyncracies like MWE realizationof an entity are not likely to be mirrored in anotherlanguage (see section 4 for discussion)
Our goal in this paper is twofold First we wantto investigate to what extent one-to-many transla-tional correspondences can serve as an empiricalbasis for MWE identification To this end Sec-tion 2 presents a corpus-based study of the rela-tion between one-to-many translations and MWEsthat we carried out on a translation gold standardSecond we investigate methods for the automaticdetection of complex lexical correspondences fora given parallel corpus Therefore Section 3 eval-uates automatic word alignments against our goldstandard and gives a method for high-precisionone-to-many translation detection that relies onsyntactic filters in addition to word-alignments
2 Multiword Translations as MWEs
The idea to exploit one-to-many translations forthe identification of MWE candidates has not re-ceived much attention in the literature Thus it isnot a priori clear what can be expected from trans-lational correspondences with respect to MWEidentification To corroborate the intuitions intro-duced in the last section we carried out a corpus-based study that aims to discover linguistic pat-
Verb 1-1 1-n n-1 n-n No
anheben (v1) 535 212 92 16 325
bezwecken (v2) 167 513 06 313 150
riskieren (v3) 467 357 05 17 182
verschlimmern (v4) 302 215 286 445 275
Table 1 Proportions of types of translational cor-respondences (token-level) in our gold standard
terns exhibited by one-to-many translationsWe constructed a gold standard coveringall En-
glish translations of four German verb lemmas ex-tracted from the Europarl Corpus These verbssubcategorize for a nominative subject and an ac-cusative object and are in the middle frequencylayer (around 200 occurrences) We extracted allsentences in Europarl with occurences of theselemmas and their automatic word alignments pro-duced by GIZA++ (Och and Ney 2003) Thesealignments were manually corrected on the basisof the crosslingual word alignment guidelines de-velopped by (Graca et al 2008)
For each of the German source lemmas ourgold standard records four translation categoriesone-to-one one-to-many many-to-one many-to-many translations Table 1 shows the distributionof these categories for each verb Strikingly thefour verbs show very different proportions con-cerning the types of their translational correspon-dences Thus while the German verbanheben(en increase) seems to have a frequent parallelrealization the verbsbezwecken(en intend to)or verschlimmern(en aggravate) tend to be real-ized by more complex phrasal translations In anycase the percentage of one-to-many translations isrelatively high which corroborates our hypothesisthat parallel corpora constitute a very interestingresource for data-driven MWE discovery
A closer look at the one-to-many translations re-veals that these cover a wide spectrum of MWEphenomena traditionally considered in the liter-ature as well as constructions that one wouldusually not regard as an MWE Below we willshortly illustrate the different classes of one-to-many translations we found in our gold standard
Morphological variations This type of one-to-many translations is mainly due to non-parallel re-alization of tense Itrsquos rather irrelevant from anMWE perspective but easy to discover and filterautomatically
24
(3) SieThey
verschlimmernaggravate
diethe
Ubelmisfortunes
(4) Their actionis aggravatingthe misfortunes
Verb particle combinations A typical MWEpattern treated for instance in (Baldwin andVillavicencio 2002) It further divides into trans-parent and non-transparent combinations the lat-ter is illustrated below
(5) DerThe
Ausschusscommitte
bezwecktintends
denthe
Institutioneninstitutions
eina
politischespolitical
Instrumentinstrument
anat
diethe
Handhand
zuto
gebengive
(6) The committeeset out to equip the institutions with apolitical instrument
Verb preposition combinations While thisclass isnrsquot discussed very often in the MWE lit-erature it can nevertheless be considered as an id-iosyncratic combination of lexical items Sag et al(2002) propose an analysis within an MWE frame-work
(7) SieThey
werdenwill
denthe
Treibhauseffektgreen house effect
verschlimmernaggravate
(8) They will add to the green house effect
Light verb constructions (LVCs) This is themost frequent pattern in our gold standard It ac-tually subsumes various subpatterns depending onwhether the light verbs complement is realized as anoun adjective or PP Generally LVCs are syntac-tically and semantically more flexible than otherMWE types such that our gold standard containsvariants of LVCs with similar potentially mod-ified adjectives or nouns as in the example be-low However it can be considered an idiosyn-cratic combination since the LVCs exhibit specificlexical restrictions (Sag et al 2002)
(9) IchIch
werdewill
diethe
Sachething
nuronly
nochjust
verschlimmernaggravate
(10) I am justmaking thingsmore difficult
Idioms This MWE type is probably the mostdiscussed in the literature due to its semantic andsyntactic idiosyncracy Itrsquos not very frequent inour gold standard which may be mainly due to itslimited size and the source items we chose
(11) SieThey
bezweckenintend
diethe
Umgestaltungconversion
ininto
einea
zivilecivil
Nationnation
(12) Theyhave in mind the conversion into a civil nation
v1 v2 v3 v4
Ntype 22 (26) 41 (47) 26 (35) 17 (24)
V Part 227 49 00 00
V Prep 364 415 39 59
LVC 182 293 885 882
Idiom 00 24 00 00
Para 364 243 115 235
Table 2 Proportions of MWE types per lemma
Paraphrases From an MWE perspective para-phrases are the most problematic and challengingtype of translational correspondence in our goldstandard While the MWE literature typically dis-cusses the distinction between collocations andMWEs the boarderline between paraphrases andMWEs is not really clear On the hand para-phrases as we classified them here are transparentcombinations of lexical items like in the exam-ple belowensure that something increases How-ever semantically these transparent combinationscan also be rendered by an atomic expressionin-crease A further problem raised by paraphrases isthat they often involve translational shifts (Cyrus2006) These shifts are hard to identify automat-ically and present a general challenge for seman-tic processing of parallel corpora An example isgiven below
(13) WirWe
brauchenneed
besserebetter
Zusammenarbeitcooperation
umto
diethe
RuckzahlungenrepaymentsOBJ
anzuhebenincrease
(14) We need greater cooperation in this respect toensurethat repaymentsincrease
Table 2 displays the proportions of the MWEcategories for the number of types of one-to-manycorrespondences in our gold standard We filteredthe types due to morphological variations only (theoverall number of types is indicated in brackets)Note that some types in our gold standard fall intoseveral categories eg they combine a verb prepo-sition with a verb particle construction For allof the verbs the number of types belonging tocore MWE categories largely outweighs the pro-portion of paraphrases As we already observedin our analysis of general translation categorieshere again the different verb lemmas show strik-ing differences with respect to their realization inEnglish translations For instanceanheben(enincrease) or bezwecken(en intend) are frequently
25
translated with verb particle or preposition combi-nations while the other verbs are much more of-ten translated by means of LVCs Also the morespecific LVC patterns differ largely among theverbs Whileverschlimmern(en aggravate) hasmany different adjectival LVC correspondencesthe translations ofriskieren(en risk) are predomi-nantly nominal LVCs The fact that we found veryfew idioms in our gold standard may be simplyrelated to our arbitrary choice of German sourceverbs that do not have an English idiom realiza-tion (see our experiment on a random set of verbsin Section 33)
In general one-to-many translational corre-spondences seem to provide a very fruitful groundfor the large-scale study of MWE phenomenaHowever their reliable detection in parallel cor-pora is far from trivial as we will show in thenext section Therefore we will not further in-vestigate the classification of MWE patterns inthe rest of the paper but concentrate on the high-precision detection of one-to-many translationsSuch a pattern-independent identification methodis crucial for the further data-driven study of one-to-many translations in parallel corpora
3 Multiword Translation Detection
This section is devoted to the problem of high-precision detection of one-to-many translationsSection 31 describes an evaluation of automaticword alignments against our gold standard Insection 32 we describe a method that extractsloosely aligned syntactic configurations whichyields much more promising results
31 One-to-many Alignments
To illustrate the problem of purely distributionalone-to-many alignment table 3 presents an eval-uation of the automatic one-to-many word align-ments produced by GIZA++ that uses the stan-dard heuristics for bidirectional word alignmentfrom phrase-based MT (Och and Ney 2003) Weevaluate the rate of translational correspondenceson the type-level that the system discovers againstthe one-to-many translations in our gold standardBy type we mean the set of lemmatized Englishtokens that makes up the translation of the Ger-man source lemma Generally automatic wordalignment yields a very high FPR if no frequencythreshold is used Increasing the threshold mayhelp in some cases however the frequency of the
verb n gt 0 n gt 1 n gt 3FPR FNR FPR FNR FPR FNR
v1 097 093 10 10 10 10
v2 093 09 05 096 00 098
v3 088 083 08 097 067 097
v4 098 092 08 092 034 092
Table 3 False positive rate and False negative rateof GIZA++ one-to-many alignments
translation types is so low that already at a thresh-old of 3 almost all types get filtered This does notmean that the automatic word alignment does notdiscover any correct correspondences at all but itmeans that the detection of the exact set of tokensthat correspond to the source token is rare
This low precision of one-to-many alignmentsisnrsquot very surprising Many types of MWEs con-sist of items that contribute most of the lexical se-mantic content while the other items belong to theclass of semantically almost ldquoemptyrdquo items (egparticles light verbs) These semantically ldquolightrdquoitems have a distribution that doesnrsquot necessarilycorrelate with the source item For instance inthe following sentence pair taken from EuroparlGIZA++ was not able to capture the correspon-dence between the German main verbbehindern(en impede) and the LVCconstitute an obstacleto but only finds an alignment link between theverb and the nounobstacle
(15) DieThe
Korruptioncorruption
behindertimpedes
diethe
Entwicklungdevelopment
(16) Corruptionconstitutes an obstacle todevelopment
Another limitation of the word-alignment mod-els is that are independent of whether the sen-tences are largely parallel or rather free transla-tions However parallel corpora like Europarl areknow to contain a very large number of free trans-lations In these cases direct lexical correspon-dences are much more unlikely to be found
32 Aligning Syntactic Configurations
High-precision extraction of one-to-many trans-lation detection thus involves two major prob-lems 1) How to identify sentences or configura-tions where reliable lexical correspondences canbe found 2) How to align target items that have alow occurrence correlation
We argue that both of these problems can beadressed by taking syntactic information into ac-
26
count As an example consider the pair of paral-lel configurations in Figure 1 for the sentence pairgiven in (15) and (16) Although there is no strictone-to-one alignment for the German verb the ba-sic predicate-argument structure is parallel Theverbs arguments directly correspond to each otherand are all dominated by a verbal root node
Based on these intuitions we propose agenerate-and-filter strategy for our one-to-manytranslation detection which extracts partial largelyparallel dependency configurations By admittingtarget dependency paths to be aligned to sourcesingle dependency relations we admit configura-tions where the source item is translated by morethan one word For instance given the configura-tion in Figure 1 we allow the German verb to bealigned to the path connectingconstituteand theargumentY2
Our one-to-many translation detection consistsof the following steps a) candidate generationof aligned syntactic configurations b) filtering theconfigurations c) alignment post-editing ie as-sembling the target tokens corresponding to thesource item The following paragraphs will brieflycaracterize these steps
behindert
X 1 Y 1
Y 2
an to
X 2 obstacle
create
Figure 1 Example of a typical syntactic MWEconfiguration
Data We word-aligned the German and Englishportion of the Europarl corpus by means of theGIZA++ tool Both portions where assigned flatsyntactic dependency analyses by means of theMaltParser (Nivre et al 2006) such that we ob-tain a parallel resource of word-aligned depen-dency parses Each sentence in our resource canbe represented by the triple(DG DE AGE) DG
is the set of dependency triples(s1 rel s2) such
thats2 is a dependent ofs1 of typerel ands1 s2
are words of the source languageDE is the setof dependency triples of the target sentenceAGE
corresponds to the set of pairs(s1 t1) such thats1 t1 are aligned
Candidate Generation This step generates alist of source configurations by searching for oc-curences of the source lexical verb where it islinked to some syntactic dependents (eg its argu-ments) An example input would be the configura-tion ( (verbSB) (verbOA)) forour German verbs
Filtering Given our source candidates a validparallel configuration(DG DE AGE) is then de-fined by the following conditions1 The source configurationDG is the set of tu-ples(s1 rel sn) wheres1 is our source item andsn some dependent2 For eachsn isin DG there is a tuple(sn tn) isinAGE ie every dependent has an alignment3 There is a target itemt1 isin DE such thatfor eachtn there is ap sub DE such thatp isa path(t1 rel tx) (tx rel ty)(tz rel tn) thatconnectst1 and tn Thus the target dependentshave a common root
To filter noise due to parsing or alignment er-rors we further introduce a filter on the length ofthe path that connects the target root and its de-pendents and w exclude paths cross contain sen-tence boundaries Moreover the above candi-date filtering doesnrsquot exclude configurations whichexhibit paraphrases involving head-switching orcomplex coordination Head-switching can be de-tected with the help of alignment information ifthere is a item in our target configuration that hasan reliable alignment with an item not contained inour source configuration our target configurationis likely to contain such a structural paraphrasesand is excluded from our candidate set Coordina-tion can be discarded by imposing the condition onthe configuration not to contain a coordination re-lation This Generate-and-Filter strategy now ex-tracts a set of sentences where we are likely to finda good one-to-one or one-to-many translation forthe source verb
Alignment Post-editing In the final alignmentstep one now needs to figure out which lexicalmaterial in the aligned syntactic configurations ac-tually corresponds to the translation of the sourceitem The intuition discussed in 32 was that all
27
the items lying on a path between the root itemand the terminals belong to the translation of thesource item However these items may have othersyntactic dependents that may also be part of theone-to-many translation As an example considerthe configuration in figure 1 where the articleanwhich is part of the LVCcreate an obstacle tohasto be aligned to the German source verb
Thus for a set of itemsti for which there is a de-pendency relation(tx rel ti) isin DE such thattx isan element of our target configuration we need todecide whether(s1 ti) isin AGE This translationproblem now largely parallels collocation trans-lation problems discussed in the literature as in(Smadja and McKeown 1994) But crucially oursyntactic filtering strategy has substantially nar-rowed down the number of items that are possi-ble parts of the one-to-many translation Thus astraightforward way to assemble the translationalcorrespondence is to compute the correlation orassociation of the possibly missing items with thegiven translation pair as proposed in (Smadja andMcKeown 1994) Therefore we propose the fol-lowing alignment post-editing algorithmGiven the source items1 and the set of target itemsT where eachti isin T is an element of our targetconfiguration
1 Compute corr(s1 T ) the correlation be-tweens1 andT
2 For each ti tx such that there isa (ti rel tx) isin DE computecorr(s1 T + tx)
3 if corr(s1 T + tx) ge corr(s1 T ) addtxto T
As the Dice coefficient is often to give the bestresults eg in (Smadja and McKeown 1994) wealso chose Dice as our correlation measure In fu-ture work we will experiment with other associa-tion measures Our correlation scores are thus de-fined by the formula
corr(s1 T ) =2(freq(s1 and T ))
freq(s1) + freq(T )
We definefreq(T ) as the number of sentencepairs whose target sentence contains occurrencesof all ti isin T andfreq(s1) accordingly The ob-servation frequencyfreq(s1andT ) is the number ofsentence pairs that wheres1 occurs in the sourcesentence andT in the target sentence
The output translation can then be rep-resented as a dependency configurationof the following kind ((ofPMOD)(riskNMODof)(riskNMODthe) (runOBJrisk)(runSBJ))which is the syntactic representationfor the English MWErun the risk of
33 Evaluation
Our translational approach to MWE extractionbears the advantage that evaluation is not exclu-sively bound to the manual judgement of candi-date lists Instead we can first evaluate the systemoutput against translation gold standards which areeasier to obtain The linguistic classification of thecandidates according to their compositionality canthen be treated as a separate problem
We present two experiments in this evaluationsection We will first evaluate the translation de-tection on our gold standard to assess the gen-eral quality of the extraction method Since thisgold standard is to small to draw conclusions aboutthe quality of MWE patterns that the system de-tects we further evaluate the translational corre-spondences for a larger set of verbs
Translation evaluation In the first experimentwe extracted all types of translational correspon-dences for the verbs we annotated in the gold stan-dard We converted the output dependency con-figurations to the lemmatized bag-of-word formwe already applied for the alignment evaluationand calculated the FPR and FNR of the trans-lation types The evaluation is displayed in ta-ble 4 Nearly all translation types that our sys-tem detected are correct This confirms our hy-pothesis that syntactic filtering yields more reli-able translations that just coocurrence-based align-ments However the false negative rate is alsovery high This low recall is due to the fact thatour syntactic filters are very restrictive such that amajor part of the occurrences of the source lemmadonrsquot figure in the prototypical syntactic configu-ration Column two and three of the evaluation ta-ble present the FPR and FNR for experiments witha relaxed syntactic filter that doesnrsquot constrain thesyntactic type of the parallel argument relationsWhile not decreasing the FNR the FPR decreasessignificantly This means that the syntactic filtersmainly fire on noisy configurations and donrsquot de-crease the recall A manual error analysis has alsoshown that the relatively flat annotation scheme ofour dependency parses significantly narrows down
28
the number of candidate configurations that our al-gorithm detects As the dependency parses donrsquotprovide deep analyses for tense or control phe-nomena very often a verbrsquos arguments donrsquot fig-ure as its syntactic dependents and no configura-tion is found Future work will explore the im-pact of deep syntactic analysis for the detection oftranslational correspondences
MWE evaluation In a second experiment weevaluated the patterns of correspondences foundby our extraction method for use in an MWE con-text Therefore we selected 50 random verbs oc-curring in the Europarl corpus and extracted theirrespective translational correspondences This setof 50 verbs yields a set of 1592 one-to-many typesof translational correspondences We filtered thetypes wich display only morphological variationsuch that the set of potential MWE types com-prises 1302 types Out of these we evaluated arandom sample of 300 types by labelling the typeswith the MWE categories we established for theanalysis of our gold standard During the clas-sification we encountered a further category ofoneto- many correspondence which cannot be con-sidered an MWE the category of alternation Forinstance we found a translational correspondencebetween the active realization of the German verbbegruszligen(enappreciate) and the English passivebe pleased by
The classification is displayed in table 5 Al-most 83 of the translational correspondencesthat our system extracted are perfect translationtypes Almost 60 of the extracted types can beconsidered MWEs that exhibit some kind of se-mantic idiosyncrasy The other translations couldbe classified as paraphrases or alternations In ourrandom sample the portions of idioms is signifi-cantly higher than in our gold standard which con-firms our intuition that the MWE pattern of theone-to-many translations for a given verb are re-lated to language-specific semantic properties ofthe verbs and the lexical concepts they realize
4 Related Work
The problem sketched in this paper has clear con-ncetions to statistical MT So-called phrase-basedtranslation models generally target whole sentencealignment and do not necessarily recur to linguis-tically motivated phrase correspondences (Koehnet al 2003) Syntax-based translation that speci-fies formal relations between bilingual parses was
Strict Filter Relaxed Filter
FPR FNR FPR FNR
v1 00 096 05 096
v2 025 088 047 079
v3 025 074 056 063
v4 00 0875 056 084
Table 4 False positive and false negative rate ofone-to-many translations
Trans type Proportion
MWE type Proportion
MWEs 575
V Part 82
V Prep 518
LVC 324
Idiom 106
Paraphrases 244
Alternations 10
Noise 171
Table 5 Classification of 300 types sampled fromthe set of one-to-many translations for 50 verbs
established by (Wu 1997) Our way to use syn-tactic configurations can be seen as a heuristic tocheck relaxed structural parallelism
Work on MWEs in a crosslingual context hasalmost exclusively focussed on MWE translation(Smadja and McKeown 1994 Anastasiou 2008)In (Villada Moiron and Tiedemann 2006) the au-thors make use of alignment information in a par-allel corpus to rank MWE candidates These ap-proaches donrsquot rely on the lexical semantic knowl-edge about MWEs in form of one-to-many trans-lations
By contrast previous approaches to paraphraseextraction made more explicit use of crosslingualsemantic information In (Bannard and Callison-Burch 2005) the authors use the target languageas a pivot providing contextual features for iden-tifying semantically similar expressions Para-phrasing is however only partially comparable tothe crosslingual MWE detection we propose inthis paper Recently the very pronounced contextdependence of monolingual pairs of semanticallysimilar expressions has been recognized as a ma-jor challenge in modelling word meaning (Erk andPado 2009)
The idea that parallel corpora can be used asa linguistic resource that provides empirical evi-dence for monolingual idiosyncrasies has already
29
been exploited in eg morphology projection(Yarowsky et al 2001) or word sense disambigua-tion (Dyvik 2004) While in a monolingual set-ting it is quite tricky to come up with theoreticalor empirical definitions of sense discriminationsthe crosslingual scenario offers a theory-neutraldata-driven solution Since ambiguity is an id-iosyncratic property of a lexical item in a givenlanguage it is not likely to be mirrored in a tar-get language Similarly our approach can also beseen as a projection idea we project the semanticinformation of simplex realization in a source lan-guage to an idiosyncratic multiword realization inthe target language
5 Conclusion
We have explored the phenomenon of one-to-many translations in parallel corpora from theperspective of MWE identification Our man-ual study on a translation gold standard as wellas our experiments in automatic translation ex-traction have shown that one-to-many correspon-dences provide a rich resource and fruitful basisof study for data-driven MWE identification Thecrosslingual perspective raises new research ques-tions about the identification and interpretation ofMWEs It challenges the distinction between para-phrases and MWEs a problem that does not ariseat all in the context of monolingual MWE ex-traction It also allows for the study of the rela-tion between the semantics of lexical concepts andtheir MWE realization Further research in this di-rection should investigate translational correspon-dences on a larger scale and further explore thesefor monolingual interpretation of MWEs
Our extraction method that is based on syn-tactic filters identifies MWE types with a muchhigher precision than purely cooccurence-basedword alignment and captures the various patternswe found in our gold standard Future work on theextraction method will have to focus on the gener-alization of these filters and the generalization toother items than verbs The experiments presentedin this paper also suggest that the MWE realiza-tion of certain lexical items in a target languageis subject to certain linguistic patterns Moreoverthe method we propose is completely languagein-dependent such that further research has to studythe impact of the relatedness of the consideredlanguages on the patterns of one-to-many transla-tional correspondences
ReferencesDimitra Anastasiou 2008 Identification of idioms by mtrsquos
hybrid research system vs three commercial system InProceedings of the EAMT pp 12ndash20
Timothy Baldwin and Aline Villavicencio 2002 Extract-ing the unextractable a case study on verb-particles InProceedings of the COLING-02 pp 1ndash7
Colin Bannard and Chris Callison-Burch 2005 Paraphras-ing with bilingual parallel corpora InProceedings of the43rd Annual Meeting of the ACL pp 597ndash604
Lea Cyrus 2006 Building a resource for studying transla-tion shifts InProceedings of the 5th LREC pp 1240ndash1245
Helge Dyvik 2004 Translations as semantic mirrors Fromparallel corpus to WordNetLanguage and Computers1311 ndash 326
Katrin Erk and Sebastian Pado 2009 Paraphrase assess-ment in structured vector space Exploring parameters anddatasets InProc of the EACL GEMS Workshop pp 57ndash65
Joao de Almeida Varelas Graca Joana Paulo Pardal LuısaCoheur and Diamantino Antonio Caseiro 2008 Multi-language word alignments annotation guidelines Techni-cal report Tech Rep 38 2008 INESC-ID Lisboa
Philipp Koehn Franz Josef Och and Daniel Marcu 2003Statistical phrase-based translation InProceedings of theNAACL rsquo03 pp 48ndash54
Philipp Koehn 2005 Europarl A parallel corpus for statis-tical machine translation InMT Summit 2005 pp 79ndash86
Joakim Nivre Johan Hall and Jens Nilsson 2006 Malt-parser A data driven parser-generator for dependencyparsing InProc of LREC-2006 pp 2216ndash2219
Franz Josef Och and Hermann Ney 2003 A systematic com-parison of various statistical alignment modelsComputa-tional Linguistics 29(1)19ndash51
Pavel Pecina 2008 A machine learning approach to multi-word expression extraction InProceedings of the LRECMWE 2008 Workshop pp 54ndash57
Ivan A Sag Timothy Baldwin Francis Bond and AnnCopestake 2002 Multiword expressions A pain in theneck for NLP InProc of the CICLing-2002 pp 1ndash15
Frank Smadja and Kathleen McKeown 1994 Translatingcollocations for use in bilingual lexicons InProceedingsof the HLT rsquo94 workshop pp 152ndash156
Begona Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automatic word-alignment InProc of the EACL MWE 2006 Workshoppp 33ndash40
Dekai Wu 1997 Stochastic inversion transduction gram-mars and bilingual parsing of parallel corporaComputLinguist 23(3)377ndash403
David Yarowsky Grace Ngai and Richard Wicentowski2001 Inducing multilingual text analysis tools via robustprojection across aligned corpora InProceedings of HLT2001 pp 1ndash8
30
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 31ndash39Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
A Re-examination of Lexical Association Measures
Abstract
We review lexical Association Measures (AMs) that have been employed by past work in extracting multiword expressions Our work contributes to the understanding of these AMs by categorizing them into two groups and suggesting the use of rank equivalence to group AMs with the same ranking performance We also examine how existing AMs can be adapted to better rank English verb particle constructions and light verb constructions Specifically we suggest normalizing (Pointwise) Mutual Information and using marginal frequencies to construct penalization terms We empirically validate the effectiveness of these modified AMs in detection tasks in English performed on the Penn Treebank which shows significant improvement over the original AMs
1 Introduction
Recently the NLP community has witnessed a renewed interest in the use of lexical association measures in extracting Multiword Expressions (MWEs) Lexical Association Measures (hereafter AMs) are mathematical formulas which can be used to capture the degree of connection or association between constituents of a given phrase Well-known AMs include Pointwise Mutual Information (PMI) Pearsonrsquos 2χ and the Odds Ratio These AMs have been applied in many different fields of study from information retrieval to hypothesis testing In the context of MWE extraction many published works have been devoted to comparing their effectiveness Krenn and Evert (2001) evaluate Mutual Information (MI) Dice Pearsonrsquos 2χ log-likelihood
ratio and the T score In Pearce (2002) AMs such as Z score Pointwise MI cost reduction left and right context entropy odds ratio are evaluated Evert (2004) discussed a wide range of AMs including exact hypothesis tests such as the binomial test and Fisherrsquos exact tests various coefficients such as Dice and Jaccard Later Ramisch et al (2008) evaluated MI Pearsonrsquos 2χ and Permutation Entropy Probably the most comprehensive evaluation of AMs was presented in Pecina and Schlesinger (2006) where 82 AMs were assembled and evaluated over Czech collocations These collocations contained a mix of idiomatic expressions technical terms light verb constructions and stock phrases In their work the best combination of AMs was selected using machine learning
While the previous works have evaluated AMs there have been few details on why the AMs perform as they do A detailed analysis of why these AMs perform as they do is needed in order to explain their identification performance and to help us recommend AMs for future tasks This weakness of previous works motivated us to address this issue In this work we contribute to further understanding of association measures using two different MWE extraction tasks to motivate and concretize our discussion Our goal is to be able to predict a priori what types of AMs are likely to perform well for a particular MWE class
We focus on the extraction of two common types of English MWEs that can be captured by bigram model Verb Particle Constructions (VPCs) and Light Verb Constructions (LVCs) VPCs consist of a verb and one or more particles which can be prepositions (eg put on bolster up) adjectives (cut short) or verbs (make do) For simplicity we focus only on bigram VPCs that take prepositional particles the most common class of VPCs A special characteristic of VPCs that affects their extraction is the
Hung Huu Hoang Dept of Computer Science
National University of Singapore
hoanghuucompnusedusg
Su Nam Kim Dept of Computer Science and Software Engineering University of Melbourne
snkimcsseunimelbeduau
Min-Yen Kan Dept of Computer Science
National University of Singapore
kanmycompnusedusg
31
mobility of noun phrase complements in transitive VPCs They can appear after the particle (Take off your hat) or between the verb and the particle (Take your hat off) However a pronominal complement can only appear in the latter configuration (Take it off)
In comparison LVCs comprise of a verb and a complement which is usually a noun phrase (make a presentation give a demonstration) Their meanings come mostly from their complements and as such verbs in LVCs are termed semantically light hence the name light verb This explains why modifiers of LVCs modify the complement instead of the verb (make a serious mistake vs make a mistake seriously) This phenomenon also shows that an LVCrsquos constituents may not occur contiguously
2 Classification of Association Measures
Although different AMs have different approaches to measuring association we observed that they can effectively be classified into two broad classes Class I AMs look at the degree of institutionalization ie the extent to which the phrase is a semantic unit rather than a free combination of words Some of the AMs in this class directly measure this association between constituents using various combinations of co-occurrence and marginal frequencies Examples include MI PMI and their variants as well as most of the association coefficients such as Jaccard Hamann Brawn-Blanquet and others Other Class I AMs estimate a phrasersquos MWE-hood by judging the significance of the difference between observed and expected frequencies These AMs include among others statistical hypothesis tests such as T score Z score and Pearsonrsquos 2χ test
Class II AMs feature the use of context to measure non-compositionality a peculiar characteristic of many types of MWEs including VPCs and idioms This is commonly done in one of the following two ways First non-compositionality can be modeled through the diversity of contexts measured using entropy The underlying assumption of this approach is that non-compositional phrases appear in a more restricted set of contexts than compositional ones Second non-compositionality can also be measured through context similarity between the phrase and its constituents The observation here is that non-compositional phrases have different semantics from those of their constituents It then
follows that contexts in which the phrase and its constituents appear would be different (Zhai 1997) Some VPC examples include carry out give up A close approximation stipulates that contexts of a non-compositional phrasersquos constituents are also different For instance phrases such as hot dog and Dutch courage are comprised of constituents that have unrelated meanings Metrics that are commonly used to compute context similarity include cosine and dice similarity distance metrics such as Euclidean and Manhattan norm and probability distribution measures such as Kullback-Leibler divergence and Jensen-Shannon divergence
Table 1 lists all AMs used in our discussion
The lower left legend defines the variables a b c and d with respect to the raw co-occurrence statistics observed in the corpus data When an AM is introduced it is prefixed with its index given in Table 1(eg [M2] Mutual Information) for the readerrsquos convenience
3 Evaluation
We will first present how VPC and LVC candidates are extracted and used to form our evaluation data set Second we will discuss how performances of AMs are measured in our experiments
31 Evaluation Data
In this study we employ the Wall Street Journal (WSJ) section of one million words in the Penn Tree Bank To create the evaluation data set we first extract the VPC and LVC candidates from our corpus as described below We note here that the mobility property of both VPC and LVC constituents have been used in the extraction process
For VPCs we first identify particles using a pre-compiled set of 38 particles based on Baldwin (2005) and Quirk et al (1985) (Appendix A) Here we do not use the WSJ particle tag to avoid possible inconsistencies pointed out in Baldwin (2005) Next we search to the left of the located particle for the nearest verb As verbs and particles in transitive VPCs may not occur contiguously we allow an intervening NP of up to 5 words similar to Baldwin and Villavicencio (2002) and Smadja (1993) since longer NPs tend to be located after particles
32
Extraction of LVCs is carried out in a similar fashion First occurrences of light verbs are located based on the following set of seven
frequently used English light verbs do get give have make put and take Next we search to the right of the light verbs for the nearest noun
AM Name Formula AM Name Formula M1 Joint Probability ( ) f xy N M2 Mutual Information
1log ˆ
ijij
i j ij
ff
N fsum
M3 Log likelihood ratio
2 log ˆ
ijij
i j ij
ff
fsum
M4 Pointwise MI (PMI) ( )log
( ) ( )
P xy
P x P ylowast lowast
M5 Local-PMI ( ) PMIf xy times M6 PMIk ( )log
( ) ( )
kNf xy
f x f ylowast lowast
M7 PMI2 2( )log
( ) ( )
Nf xy
f x f ylowast lowast
M8 Mutual Dependency 2( )log( ) ( )P xy
P x P y
M9 Driver-Kroeber
( )( )
a
a b a c+ +
M10 Normalized expectation
2
2
a
a b c+ +
M11 Jaccard a
a b c+ +
M12 First Kulczynski a
b c+
M13 Second Sokal-Sneath 2( )
a
a b c+ +
M14 Third Sokal-Sneath
a d
b c
+
+
M15 Sokal-Michiner a d
a b c d
+
+ + +
M16 Rogers-Tanimoto
2 2
a d
a b c d
+
+ + +
M17 Hamann ( ) ( )a d b c
a b c d
+ minus +
+ + +
M18 Odds ratio adbc
M19 Yulersquos ω ad bc
ad bc
minus
+
M20 Yulersquos Q ad bcad bc
minus+
M21 Brawn- Blanquet max( )
a
a b a c+ +
M22 Simpson
min( )
a
a b a c+ +
M23 S cost 12min( )
log(1 )1
b c
a
minus+
+
M24 Adjusted S Cost 12max( )
log(1 )1
b c
a
minus+
+
M25 Laplace 1
min( ) 2
a
a b c
+
+ +
M26 Adjusted Laplace 1
max( ) 2
a
a b c
+
+ +
M27 Fager [M9]
1max( )
2b cminus
M28 Adjusted Fager [M9]
1max( )b c
aNminus
M29 Normalized PMIs
PMI NF( )α PMI NFMax
M30 Simplified normalized PMI for VPCs
log( )
(1 )
ad
b cα αtimes + minus times
M31 Normalized MIs
MI NF( )α MI NFMax
NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast [0 1]α isin NFMax = max( ( ) ( ))P x P ylowast lowast
11 ( )a f f xy= = 12 ( )b f f xy= =
21 ( )c f f xy= = 22 ( )d f f xy= =
( )f xlowast ( )f xlowast
( )f ylowast ( )f ylowast N
Table 1 Association measures discussed in this paper Starred AMs () are developed in this work
Contingency table of a bigram (x y) recording co-occurrence and marginal frequencies w stands for all words except w stands for all words N is total number of bigrams The expected frequency under the independence assumption is ˆ ( ) ( ) ( ) f xy f x f y N= lowast lowast
33
permitting a maximum of 4 intervening words to allow for quantifiers (aan the many etc) adjectival and adverbial modifiers etc If this search fails to find a noun as when LVCs are used in the passive (eg the presentation was made) we search to the right of the light verb also allowing a maximum of 4 intervening words The above extraction process produced a total of 8652 VPC and 11465 LVC candidates when run on the corpus We then filter out candidates with observed frequencies less than 6 as suggested in Pecina and Schlesinger (2006) to obtain a set of 1629 VPCs and 1254 LVCs
Separately we use the following two available sources of annotations 3078 VPC candidates extracted and annotated in (Baldwin 2005) and 464 annotated LVC candidates used in (Tan et al 2006) Both sets of annotations give both positive and negative examples
Our final VPC and LVC evaluation datasets were then constructed by intersecting the gold-standard datasets with our corresponding sets of extracted candidates We also concatenated both sets of evaluation data for composite evaluation This set is referred to as ldquoMixedrdquo Statistics of our three evaluation datasets are summarized in Table 2 VPC data LVC data Mixed Total (freq ge 6)
413 100 513
Positive instances
117 (2833)
28 (28)
145 (2326)
Table 2 Evaluation data sizes (type count not token) While these datasets are small our primary
goal in this work is to establish initial comparable baselines and describe interesting phenomena that we plan to investigate over larger datasets in future work
32 Evaluation Metric
To evaluate the performance of AMs we can use the standard precision and recall measures as in much past work We note that the ranked list of candidates generated by an AM is often used as a classifier by setting a threshold However setting a threshold is problematic and optimal threshold values vary for different AMs Additionally using the list of ranked candidates directly as a classifier does not consider the confidence indicated by actual scores Another way to avoid setting threshold values is to measure precision and recall of only the n most likely candidates
(the n-best method) However as discussed in Evert and Krenn (2001) this method depends heavily on the choice of n In this paper we opt for average precision (AP) which is the average of precisions at all possible recall values This choice also makes our results comparable to those of Pecina and Schlesinger (2006)
33 Evaluation Results
Figure 1(a b) gives the two average precision profiles of the 82 AMs presented in Pecina and Schlesinger (2006) when we replicated their experiments over our English VPC and LVC datasets We observe that the average precision profile for VPCs is slightly concave while the one for LVCs is more convex This can be interpreted as VPCs being more sensitive to the choice of AM than LVCs Another point we observed is that a vast majority of Class I AMs including PMI its variants and association coefficients (excluding hypothesis tests) perform reasonably well in our application In contrast the performances of most of context-based and hypothesis test AMs are very modest Their mediocre performance indicates their inapplicability to our VPC and LVC tasks In particular the high frequencies of particles in VPCs and light verbs in LVCs both undermine their contextsrsquo discriminative power and skew the difference between observed and expected frequencies that are relied on in hypothesis tests
4 Rank Equivalence
We note that some AMs although not mathematically equivalent (ie assigning identical scores to input candidates) produce the same lists of ranked candidates on our datasets Hence they achieve the same average precision The ability to identify such groups of AMs is helpful in simplifying their formulas which in turn assisting in analyzing their meanings Definition Association measures M1 and M2 are
rank equivalent over a set C denoted by M1 r
Cequiv
M2 if and only if M1(cj) gt M1(ck) M2(cj) gt M2(ck) and M1(cj) = M1(ck) M2(cj) = M2(ck) for all cj ck belongs to C where Mk(ci) denotes the score assigned to ci by the measure Mk As a corollary the following also holds for rank equivalent AMs
34
Corollary If M1 r
Cequiv M2 then APC(M1) = APC(M2)
where APC(Mi) stands for the average precision of the AM Mi over the data set C Essentially M1 and M2 are rank equivalent over a set C if their ranked lists of all candidates taken from C are the same ignoring the actual calculated scores1 As an example the following 3 AMs Odds ratio Yulersquos ω and Yulersquos Q (Table 3 row 5) though not mathematically equivalent can be shown to be rank equivalent Five groups of rank equivalent AMs that we have found are listed in Table 3 This allows us to replace the below 15 AMs with their (most simple) representatives from each rank equivalent group
1 Two AMs may be rank equivalent with the exception of some candidates where one AM is undefined due to a zero in the denominator while the other AM is still well-defined We call these cases weakly rank equivalent With a reasonably large corpus such candidates are rare for our VPC and LVC types Hence we still consider such AM pairs to be rank equivalent
1) [M2] Mutual Information [M3] Log likelihood ratio 2) [M7] PMI2 [M8] Mutual Dependency [M9] Driver-Kroeber (aka Ochiai) 3) [M10] Normalized expectation [M11] Jaccard [M12] First Kulczynski
[M13]Second Sokal-Sneath (aka Anderberg)
4) [M14] Third Sokal-Sneath [M15] Sokal-Michiner [M16] Rogers-Tanimoto [M17] Hamann 5) [M18] Odds ratio [M19] Yulersquos ω [M20] Yulersquos Q
Table 3 Five groups of rank equivalent AMs
5 Examination of Association Measures
We highlight two important findings in our analysis of the AMs over our English datasets Section 51 focuses on MI and PMI and Section 52 discusses penalization terms
51 Mutual Information and Pointwise Mutual Information
In Figure 1 over 82 AMs PMI ranks 11th in identifying VPCs while MI ranks 35th in
01
02
03
04
05
06
AP
Figure 1a AP profile of AMs examined over our VPC data set
01
02
03
04
05
06
AP
Figure 1b AP profile of AMs examined over our LVC data set
Figure 1 Average precision (AP) performance of the 82 AMs from Pecina and Schlesinger (2006) on our English VPC and LVC datasets Bold points indicate AMs discussed in this paper
Hypothesis test AMs loz Class I AMs excluding hypothesis test AMs + Context-based AMs
35
identifying LVCs In this section we show how their performances can be improved significantly
Mutual Information (MI) measures the common information between two variables or the reduction in uncertainty of one variable given knowledge of the other
( )MI( ) ( )log ( ) ( )u v
p uvU V p uv p u p v= lowast lowastsum In the
context of bigrams the above formula can be
simplified to [M2] MI =
1log ˆN
ijij
i jij
ff
fsum While MI
holds between random variables [M4] Pointwise MI (PMI) holds between specific values PMI(x
y) =( )
log( ) ( )
P xy
P x P ylowast lowast
( )log
( ) ( )
Nf xy
f x f y=
lowast lowast It has long
been pointed out that PMI favors bigrams with low-frequency constituents as evidenced by the product of two marginal frequencies in its denominator To reduce this bias a common solution is to assign more weight to the co-occurrence frequency ( )f xy in the numerator by either raising it to some power k (Daille 1994) or multiplying PMI with ( )f xy Table 4 lists these adjusted versions of PMI and their performance over our datasets We can see from Table 4 that the best performance of PMIk
is obtained at k values less than one indicating that it is better to rely less on ( )f xy Similarly multiplying
( )f xy directly to PMI reduces the performance of PMI As such assigning more weight to ( )f xy does not improve the AP performance of PMI AM VPCs LVCs MixedBest [M6] PMIk 547
(k = 13) 573
(k = 85) 544
(k = 32)
[M4] PMI 510 566 515 [M5] Local-PMI 259 393 272 [M1] Joint Prob 170 28 175 Table 4 AP performance of PMI and its variants Best alpha settings shown in parentheses
Another shortcoming of (P)MI is that both grow not only with the degree of dependence but also with frequency (Manning and Schutzeampamp 1999 p 66) In particular we can show that MI(X Y) le min(H(X) H(Y)) where H() denotes entropy and PMI(xy) le min( log ( )P xminus lowast log ( )P yminus lowast )
These two inequalities suggest that the allowed score ranges of different candidates vary and consequently MI and PMI scores are not directly comparable Furthermore in the case of VPCs and LVCs the differences among score
ranges of different candidates are large due to high frequencies of particles and light verbs This has motivated us to normalize these scores before using them for comparison We suggest MI and PMI be divided by one of the following two normalization factors NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast with [0 1]α isin and NFmax = max( ( ) ( ))P x P ylowast lowast NF( )α being dependent on alpha can be optimized by setting an appropriate alpha value which is inevitably affected by the MWE type and the corpus statistics On the other hand NFmax is independent of alpha and is recommended when one needs to apply normalized (P)MI to a mixed set of different MWE types or when sufficient data for parameter tuning is unavailable As shown in Table 5 normalized MI and PMI show considerable improvements of up to 80 Also PMI and MI after being normalized with NFmax rank number one in VPC and LVC task respectively If one re-writes MI as = (1 N) ij ij
i jPMIf timessum it is easy to see the heavy
dependence of MI on direct frequencies compared with PMI and this explains why normalization is a pressing need for MI
AM VPCs LVCs MixedMI NF( )α 508
(α = 48) 583
(α = 47) 516
(α = 5)
MI NFmax 508 584 518 [M2] MI 273 435 289
PMI NF( )α 592 (α = 8)
554 (α = 48)
588 (α = 77)
PMI NFmax 565 517 556 [M4] PMI 510 566 515
Table 5 AP performance of normalized (P)MI versus standard (P)MI Best alpha settings shown in parentheses
52 Penalization Terms
It can be seen that given equal co-occurrence frequencies higher marginal frequencies reduce the likelihood of being MWEs This motivates us to use marginal frequencies to synthesize penalization terms which are formulae whose values are inversely proportional to the likelihood of being MWEs We hypothesize that incorporating such penalization terms can improve the respective AMs detection AP
Take as an example the AMs [M21] Brawn-Blanquet (aka Minimum Sensitivity) and [M22] Simpson These two AMs are identical except
36
for one difference in the denominator Brawn-Blanquet uses max(b c) Simpson uses min(b c) It is intuitive and confirmed by our experiments that penalizing against the more frequent constituent by choosing max(b c) is more effective This is further attested in AMs [M23] S Cost and [M25] Laplace where we tried to replace the min(b c) term with max(b c) Table 6 shows the average precision on our datasets for all these AMs
AM VPCs LVCs Mixed[M21]Brawn- Blanquet
478 578 486
[M22] Simpson 249 382 260
[M24] Adjusted S Cost
485 577 492
[M23] S cost 249 388 260
[M26] Adjusted Laplace
486 577 493
[M25] Laplace 241 388 254 Table 6 Replacing min() with max() in selected AMs
In the [M27] Fager AM the penalization term max(b c) is subtracted from the first term which is no stranger but rank equivalent to [M7] PMI2 In our application this AM is not good since the second term is far larger than the first term which is less than 1 As such Fager is largely equivalent to just ndashfrac12 max(b c) In order to make use of the first term we need to replace the constant frac12 by a scaled down version of max(b c) We have approximately derived 1 aN as a lower bound estimate of max(b c) using the independence assumption producing [M28] Adjusted Fager We can see from Table 7 that this adjustment improves Fager on both datasets AM VPCs LVCs Mixed[M28] Adjusted Fager
564 543 554
[M27] Fager 552 439 525 Table 7 Performance of Fager and its adjusted version
The next experiment involves [M14] Third Sokal Sneath which can be shown to be rank equivalent to ndashb ndashc We further notice that frequencies c of particles are normally much larger than frequencies b of verbs Thus this AM runs the risk of ranking VPC candidates based on only frequencies of particles So it is necessary
that we scale b and c properly as in [M14] bαminus times ndash (1 ) cαminus times Having scaled the constituents properly we still see that [M14] by itself is not a good measure as it uses only constituent frequencies and does not take into consideration the co-occurrence frequency of the two constituents This has led us to experiment
with [MR14]PMI
(1 )b cα αtimes + minus times The
denominator of [MR14] is obtained by removing the minus sign from [MR14] so that it can be used as a penalization term The choice of PMI in the numerator is due to the fact that the denominator of [MR14] is in essence similar to NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast which has been successfully used to divide PMI in the normalized PMI experiment We heuristically tried to simplify [MR14] to the following AM
[M30]log( )
(1 )
ad
b cα αtimes + minus times The setting of alpha in
Table 8 below is taken from the best alpha setting obtained the experiment on the normalized PMI (Table 5) It can be observed from Table 8 that [MR14] being computationally simpler than normalized PMI performs as well as normalized PMI and better than Third Sokal-Sneath over the VPC data set
AM VPCs LVCs MixedPMI NF( )α 592
(α =8) 554
(α =48) 588
(α =77)
[M30]log( )
(1 )
ad
b cα αtimes + minus times
600 (α =8)
484 (α =48)
588 (α =77)
[M14] Third Sokal Sneath
565 453 546
Table 8 AP performance of suggested VPCsrsquo penalization terms and AMs
With the same intention and method we have found that while addition of marginal frequencies is a good penalization term for VPCs the product of marginal frequencies is more suitable for LVCs (rows 1 and 2 Table 9) As with the linear combination the product bc should also be weighted accordingly as (1 )b cα αminus The best alpha value is also taken from the normalized PMI experiments (Table 5) which is nearly 5 Under this setting this penalization term is exactly the denominator of the [M18] Odds Ratio Table 9 below show our experiment results in deriving the penalization term for LVCs
37
AM VPCs LVCs Mixedndashb ndashc 565 453 546 1bc 502 532 502 [M18] Odds ratio 443 567 456
Table 9 AP performance of suggested LVCsrsquo penalization terms and AMs
6 Conclusions
We have conducted an analysis of the 82 AMs assembled in Pecina and Schlesinger (2006) for the tasks of English VPC and LVC extraction over the Wall Street Journal Penn Treebank data In our work we have observed that AMs can be divided into two classes ones that do not use context (Class I) and ones that do (Class II) and find that the latter is not suitable for our VPC and LVC detection tasks as the size of our corpus is too small to rely on the frequency of candidatesrsquo contexts This phenomenon also revealed the inappropriateness of hypothesis tests for our detection task We have also introduced the novel notion of rank equivalence to MWE detection in which we show that complex AMs may be replaced by simpler AMs that yield the same average precision performance
We further observed that certain modifications to some AMs are necessary First in the context of ranking we have proposed normalizing scores produced by MI and PMI in cases where the distributions of the two events are markedly different as is the case for light verbs and particles While our claims are limited to the datasets analyzed they show clear improvements normalized PMI produces better performance over our mixed MWE dataset yielding an average precision of 588 compared to 515 when using standard PMI a significant improvement as judged by paired T test Normalized MI also yields the best performance over our LVC dataset with a significantly improved AP of 583
We also show that marginal frequencies can be used to form effective penalization terms In particular we find that (1 )b cα αtimes + minus times is a good penalization term for VPCs while (1 )b cα αminus is suitable for LVCs Our introduced alpha tuning parameter should be set to properly scale the values b and c and should be optimized per MWE type In cases where a common factor is applied to different MWE types max(b c) is a better choice than min(b c) In future work we plan to expand our investigations over larger
web-based datasets of English to verify the performance gains of our modified AMs
Acknowledgement This work was partially supported by a National Research Foundation grant ldquoInteractive Media Searchrdquo (grant R 252 000 325 279)
References Baldwin Timothy (2005) The deep lexical
acquisition of English verb-particle constructions Computer Speech and Language Special Issue on Multiword Expressions 19(4)398ndash414
Baldwin Timothy and Villavicencio Aline (2002) Extracting the unextractable A case study on verb-particles In Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002) pages 98ndash104 Taipei Taiwan
Daille Beacuteatrice (1994) Approche mixte pour lextraction automatique de terminologie statistiques lexicales et filtres linguistiques PhD thesis Universiteacute Paris 7
Evert Stefan (2004) Online repository of association measures httpwwwcollocationsde a companion to The Statistics of Word Cooccurrences Word Pairs and Collocations PhD dissertation University of Stuttgart
Evert Stefan and Krenn Brigitte (2001) Methods for qualitative evaluation of lexical association measures In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics pages 188-915 Toulouse France
Katz Graham and Giesbrecht Eugenie (2006) Automatic identification of non-compositional multi-word expressions using latent semantic analysis In Proceedings of the ACL Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties pages 12-19 Sydney Australia
Krenn Brigitte and Evert Stefan (2001) Can we do better than frequency A case study on extracting PP-verb collocations In Proceedings of the ACLEACL 2001 Workshop on the Computational Extraction Analysis and Exploitation of Collocations pages 39ndash46 Toulouse France
Manning D Christopher and Schutze ampamp Hinrich (1999) Foundations of Statistical Natural Language Processing The MIT Press Cambridge Massachusetts
Pearce Darren (2002) A comparative evaluation of collocation extraction techniques In Proc of the
38
3rd International Conference on Language Resources and Evaluation (LREC 2002) Las Palmas pages 1530-1536 Canary Islands
Pecina Pavel and Schlesinger Pavel (2006) Combining association measures for collocation extraction In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLINGACL 2006) pages 651-658 Sydney Australia
Quirk Randolph Greenbaum Sidney Leech Geoffrey and Svartvik Jan (1985) A Comprehensive Grammar of the English Language Longman London UK
Ramisch Carlos Schreiner Paulo Idiart Marco and Villavicencio Aline (2008) An Evaluation of Methods for the extraction of Multiword Expressions In Proceedings of the LREC-2008 Workshop on Multiword Expressions Towards a Shared Task for Multiword Expressions pages 50-53 Marrakech Morocco
Smadja Frank (1993) Retrieving collocations from text Xtract Computational Linguistics 19(1) 143ndash77
Tan Y Fan Kan M Yen and Cui Hang (2006) Extending corpus-based identification of light verb constructions using a supervised learning framework In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context pages 49ndash56 Trento Italy
Zhai Chengxiang (1997) Exploiting context to identify lexical atoms ndash A statistical view of linguistic context In International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97) pages 119-129 Rio de Janeiro Brazil
Appendix A List of particles used in identifying verb particle constructions about aback aboard above abroad across adrift ahead along apart around aside astray away back backward backwards behind by down forth forward forwards in into off on out over past round through to together under up upon without
39
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 40ndash46Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus
R Mahesh K Sinha
Department of Computer Science amp Engineering Indian Institute of Technology Kanpur
Kanpur 208016 India rmkiitkacin
Abstract
Complex predicate is a noun a verb an ad-jective or an adverb followed by a light verb that behaves as a single unit of verb Com-plex predicates (CPs) are abundantly used in Hindi and other languages of Indo Aryan family Detecting and interpreting CPs con-stitute an important and somewhat a diffi-cult task The linguistic and statistical methods have yielded limited success in mining this data In this paper we present a simple method for detecting CPs of all kinds using a Hindi-English parallel corpus A CP is hypothesized by detecting absence of the conventional meaning of the light verb in the aligned English sentence This simple strategy exploits the fact that CP is a multi-word expression with a meaning that is dis-tinct from the meaning of the light verb Al-though there are several shortcomings in the methodology this empirical method surpri-singly yields mining of CPs with an average precision of 89 and a recall of 90
1 Introduction
Complex predicates (CPs) are abundantly used in Hindi and other languages of Indo-Aryan family and have been widely studied (Hook 1974 Ab-bi 1992 Verma 1993 Mohanan 1994 Singh 1994 Butt 1995 Butt and Geuder 2001 Butt and Ramchand 2001 Butt et al 2003) A com-plex predicate is a multi-word expression (MWE) where a noun a verb or an adjective is followed by a light verb (LV) and the MWE be-haves as a single unit of verb The general theory of complex predicate is discussed in Alsina (1996) These studies attempt to model the lin-guistic facts of complex predicate formation and the associated semantic roles
CPs empower the language in its expressive-ness but are hard to detect Detection and inter-
pretation of CPs are important for several tasks of natural language processing tasks such as ma-chine translation information retrieval summa-rization etc A mere listing of the CPs constitutes a valuable linguistic resource for lexicographers wordnet designers (Chakrabarti et al 2007) and other NLP system designers Computational me-thod using Hindi corpus has been used to mine CPs and categorize them based on statistical analysis (Sriram and Joshi 2005) with limited success Chakrabarti et al (2008) present a me-thod for automatic extraction of V+V CPs only from a corpus based on linguistic features They report an accuracy of about 98 An attempt has also been made to use a parallel corpus for de-tecting CPs using projection POS tags from Eng-lish to Hindi (Soni Mukerjee and Raina 2006) It uses Giza++ word alignment tool to align the projected POS information A success of 83 precision and 46 recall has been reported
In this paper we present a simple strategy for mining of CPs in Hindi using projection of meaning of light verb in a parallel corpus In the following section the nature of CP in Hindi is outlined and this is followed by system design experimentation and results
2 Complex Predicates in Hindi
A CP in Hindi is a syntactic construction consist-ing of either a verb a noun an adjective or an adverb as main predicator followed by a light verb (LV) Thus a CP can be a noun+LV an adjective+LV a verb+LV or an adverb+LV Fur-ther it is also possible that a CP is followed by a LV (CP+LV) The light verb carries the tense and agreement morphology In V+V CPs the contribution of the light verb denotes aspectual terms such as continuity perfectivity inception completion or denotes an expression of forceful-ness suddenness etc (Singh 1994 Butt 1995) The CP in a sentence syntactically acts as a sin-gle lexical unit of verb that has a meaning dis-
40
tinct from that of the LV CPs are also referred as the complex or compound verbs
Given below are some examples (1) CP=noun+LV noun = ashirwad blessings LV = denaa to give usane mujhe ashirwad diyaa उसन मझ आश वाद दया he me blessings gave he blessed me (2) No CP usane mujhe ek pustak dii उसन मझ एक प तक द he me one book gave he gave me a book In (1) the light verb diyaa (gave) in its past tense form with the noun ashirwad (blessings) makes a complex predicate verb form ashirwad diyaa (blessed) in the past tense form The CP here is ashirwad denaa and its corresponding English translation is lsquoto blessrsquo On the other hand in ex-ample (2) the verb dii (gave) is a simple verb in past tense form and is not a light verb Although same Hindi verb denaa (to give) is used in both the examples it is a light verb in (1) and a main verb in (2) Whether it acts as a light verb or not depends upon the semantics of the preceding noun However it is observed that the English meaning in case of the complex predicate is not derived from the individual meanings of the con-stituent words It is this observation that forms basis of our approach for mining of CPs (3) CP=adjective+LV adjective=khush happy LV=karanaa to do usane mujhe khush kiyaa उसन मझ खश कया he me happy did he pleased me Here the Hindi verb kiyaa (did) is the past tense form of a light verb karanaa (to do) and the pre-ceding word khush (happy) is an adjective The CP here is khush karanaa (to please) (4) CP=verb+LV verb = paRhnaa to read LV = lenaa to take usane pustak paRh liyaa
उसन प तक पढ़ लया he book read took he has read the book Here the Hindi verb liyaa (took) is the past tense form of the light verb lenaa (to take) and the pre-ceding word paRh (read) is the verb paRhnaa (to read) in its stem form The CP is paRh lenaa (to finish reading) In such cases the light verb acts as an aspectual modal or as an intensifier (5) CP=verb+LV verb = phaadanaa to tear LV = denaa to give usane pustak phaad diyaa उसन प तक फाड़ दया he book tear gave he has torn the book Here the Hindi verb diyaa (gave) is the past tense form of the light verb denaa (to give) and the preceding word phaad (tear) is the stem form of the verb phaadanaa (to tear) The CP is phaad denaa (to cause and complete act of tearing) (6) CP=verb+LV verb = denaa to give LV = maaranaato hit to kill usane pustak de maaraa उसन प तक द मारा he book give hit he threw the book Here the Hindi verb maaraa (hitkilled) is the past tense form of the light verb maaranaa (to hit to kill) and the preceding word de (give) is a verb denaa (to give) in its stem form The CP is de maranaa (to throw) The verb combination yields a new meaning This may also be consi-dered as a semi-idiomatic construct by some people (7) CP=adverb+LV1+LV2 adverb = vaapasback LV1 = karanaato do LV2 = denaato give or CP = CP+LV CP = vaapas karanaato return LV = denaato give usane pustak vaapas kar diyaa उसन प तक वापस कर दया he book back do gave
41
he returned the book Here there are two Hindi light verbs used The verb kar (do) is the stem form of the light verb karanaa (to do) and the verb diyaa (gave) is the past tense form of the light verb denaa (to give) The preceding word vaapas (back) is an adverb One way of interpretation is that the CP (a con-junct verb) vaapas karanaa (to return) is fol-lowed by another LV denaa (to give) signifying completion of the task Another way of looking at it is to consider these as a combination of two CPs vaapas karanaa (to return) and kar denaa (to complete the act) The semantic interpreta-tions in the two cases remain the same It may be noted that the word vaapas (return) is also a noun and in such a case the CP is a noun+LV
From all the above examples the complexity of the task of mining the CPs is evident Howev-er it is also observed that in the translated text the meaning of the light verb does not appear in case of CPs Our methodology for mining CPs is based on this observation and is outlined in the following section
3 System Design
As outlined earlier our method for detecting a CP is based on detecting a mismatch of the Hindi light verb meaning in the aligned English sen-tence The steps involved are as follows 1) Align the sentences of Hindi-English corpus 2) Create a list of Hindi light verbs and their
common English meanings as a simple verb (Table 1)
3) For each Hindi light verb generate all the morphological forms (Figure 1)
4) For each English meaning of the light verb as given in table 1 generate all the morphologi-cal forms (Figure 2)
5) For each Hindi-English aligned sentence execute the following steps a) For each light verb of Hindi (table 1)
execute the following steps i) Search for a Hindi light verb (LV)
and its morphological derivatives (figure 1) in the Hindi sentence and mark its position in the sentence (K)
ii) If the LV or its morphological deriv-ative is found then search for the equivalent English meanings for any of the morphological forms (figure 2) in the corresponding aligned Eng-lish sentence
iii) If no match is found then scan the words in the Hindi sentence to the left of the Kth position (as identified in step (i)) else if a match is found then exit ie go to step (a)
iv) If the scanned word is a lsquostop wordrsquo (figure 3) then ignore it and contin-ue scanning
v) Stop the scan when it is not a lsquostop wordrsquo and collect the Hindi word (W)
vi) If W is an lsquoexit wordrsquo then exit ie go to step (a) else the identified CP is W+LV
Hindi has a large number of light verbs A list of some of the commonly used light verbs along with their common English meanings as a simple verb is given in table 1 The light verb kar (do) is the most frequently used light verb Using its literal meaning as lsquodorsquo as a criterion for testing CP is quite misleading since lsquodorsquo in English is used in several other contexts Such meanings have been shown within parentheses and are not used for matching
light verb base form root verb meaning baithanaa बठना sit
bananaa बनना makebecomebuildconstruct manufactureprepare
banaanaa बनाना makebuildconstructmanufact-ure prepare
denaa दना give
lenaa लना take
paanaa पाना obtainget
uthanaa 3ठना rise arise get-up
uthaanaa 3ठाना raiselift wake-up
laganaa लगना feelappear look seem
lagaanaa लगाना fixinstall apply
cukanaa चकना (finish)
cukaanaa चकाना pay
karanaa करना (do)
honaa होना happenbecome be
aanaa आना come
jaanaa जाना go
khaanaa खाना eat
rakhanaa रखना keep put
maaranaa मारना killbeathit
daalanaa डालना put
haankanaa हाकना drive
Table 1 Some of the common light verbs in Hindi
42
For each of the Hindi light verb all morpho-logical forms are generated A few illustrations are given in figures 1(a) and 1(b) Similarly for each of the English meaning of the light verb all of its morphological derivatives are generated Figure 2 shows a few illustrations of the same
There are a number of words that can appear in between the nominal and the light verb in a CP These words are ignored in search for a CP and are treated as stop words These are words that denote negation or are emphasizers inten-sifiers interrogative pronoun or a particle A list of stop words used in the experimentation is giv-en in figure 3
Figure 1(a) Morphological derivatives of sample Hindi light verb rsquojaanaarsquo जाना to go
Figure 1(b) Morphological derivatives of sample Hindi light verb rsquolenaarsquo लना to take
Figure 2 Morphological derivatives of sample English meanings
We use a list of words of words that we have
named as lsquoexit wordsrsquo which cannot form part of a CP in Hindi We have used Hindi case (vibhak-ti) markers (also called parsarg) conjunctions and pronouns as the lsquoexit wordsrsquo in our imple-mentation Figure 4 shows a partial list used However this list can be augmented based on analysis of errors in LV identification It should be noted that we do not perform parts of speech (POS) tagging and so the nature of the word pre-ceding the LV is unknown to the system
English word sit Morphological derivations
sit sits sat sitting English word give Morphological derivations
give gives gave given giving degdegdegdegdeg
LV jaanaa जाना to go Morphological derivatives jaa jaae jaao jaaeM jaauuM jaane jaanaa jaanii jaataa jaatii jaate jaaniiM jaatiiM jaaoge jaaogii gaii jaauuMgA jaayegaa jaauuMgii jaayegii gaye gaiiM gayaa gayii jaayeMge jaayeMgI jaakara जा (go stem) जाए (go imperative)
जाओ (go imperative) जाए (go imperative)
जाऊ (go first-person) जान (go infinitive oblique)
जाना (go infinitive masculine singular)
जानी (go infinitive feminine singular)
जाता (go indefinite masculine singular)
जाती (go indefinite feminine singular)
जात (go indefinite masculine pluraloblique)
जानी (go infinitive feminine plural)
जाती (go indefinite feminine plural)
जाओग (go future masculine singular)
जाओगी (go future feminine singular)
गई (go past feminine singular)
जाऊगा (go future masculine first-person singular)
जायगा (go future masculine third-person singular)
जाऊगी (go future feminine first-person singular)
जायगी (go future feminine third-person singular)
गय (go past masculine pluraloblique)
गई (go past feminine plural)
गया (go past masculine singular)
गयी (go past feminine singular)
जायग (go future masculine plural)
जायगी (go future feminine plural)
जाकर (go completion) degdegdegdegdeg
LV lenaa लना to take Morphological derivatives le lii leM lo letaa letii lete liiM luuM legaa legii lene lenaa lenii liyaa leMge loge letiiM luuMgaa luuMgii lekara ल (take stem) ल (take past)
ल (take imperative) लो (take imperative)
लता (take indefinite masculine singular)
लती (take indefinite feminine singular)
लत (take indefinite masculine pluraloblique)
ल (takepastfeminineplural) ल(take first-person)
लगा(take future masculine third-personsingular)
लगी(take future feminine third-person singular)
लन (take infinitive oblique)
लना (take infinitive masculine singular)
लनी (take infinitive feminine singular)
लया (take past masculine singular)
लग (take future masculine plural)
लोग (take future masculine singular)
लती (take indefinite feminine plural)
लगा (take future masculinefirst-personsingular)
लगी (take future feminine first-person singular)
लकर (take completion) degdegdegdegdeg
43
Figure 3 Stop words in Hindi used by the system
Figure 4 A few exit words in Hindi used by the system
The inner loop of the procedure identifies mul-tiple CPs that may be present in a sentence The outer loop is for mining the CPs in the entire corpus The experimentation and results are dis-cussed in the following section
4 Experimentation and Results
The CP mining methodology outlined earlier has been implemented and tested over multiple files of EMILLE (McEnery Baker Gaizauskas and Cunningham 2000) English-Hindi parallel cor-pus A summary of the results obtained are given in table 2 As can be seen from this table the precision obtained is 80 to 92 and the recall is between 89 to 100 The F-measure is 88 to 97 This is a remarkable and somewhat sur-prising result from the simple methodology without much of linguistic or statistical analysis This is much higher than what has been reported on the same corpus by Mukerjee et al 2006 (83 precision and 46 recall) who use projec-tion of POS and word alignment for CP identifi-cation This is the only other work that uses a parallel corpus and covers all kinds of CPs The results as reported by Chakrabarti et al (2008) are only for V-V CPs Moreover they do not re-port the recall value File
1 File 2
File 3
File 4
File 5
File 6
No of Sentences
112 193 102 43 133 107
Total no of CP(N)
200 298 150 46 188 151
Correctly identified CP (TP)
195 296 149 46 175 135
V-V CP 56 63 9 6 15 20
Incorrectly identified CP (FP)
17 44 7 11 16 20
Unidentified CP(FN)
5 2 1 0 13 16
Accuracy
9750 9933 9933 1000 9308 8940
Precision (TP (TP+FP))
9198 8705 9551 8070 9162 8709
Recall ( TP (TP+FN))
9750 9833 9933 1000 9308 8940
F-measure ( 2PR ( P+R))
946 923 974 893 923 882
Table 2 Results of the experimentation
न (ergative case marker) को (accusative case marker) का (possessive case marker) क (possessive case marker) की (possessive case marker) स (frombywith) म (ininto) पर (onbut) और (and Hindi particle) तथा (and) या (or) ल कन (but) पर त (but) क (that Hindi particle) म (I) तम (you) आप (you) वह (heshe) मरा (my) मर (my) मर (my) त हारा (your) त हार (your) त हार (your) उसका (his) उसकी (her) उसक (hisher) अपना (own) अपनी (own) अपन (own) उनक (their) मन (I ergative) त ह (to you) आपको (to you) उसको (to himher) उनको (to them) उ ह (to them) मझको (to me) मझ (to me) िजसका (whose) िजसकी (whose) िजसक (whose) िजनको (to whom) िजनक (to whom)
नह (nonot) न (nonot Hindi particle) भी (also Hindi particle) ह (only Hindi particle) तो (then Hindi particle) य (why) या (what Hindi particle)
कहा (where Hindi particle) कब (when) यहा (here) वहा (there) जहा (where) पहल (before) बाद म (after) श म (beginning) आर भ म (beginning) अत म (in the end) आ खर म (in the end)
44
Given below are some sample outputs (1) English sentence I also enjoy working with the childrens parents who often come to me for advice - its good to know you can help Aligned Hindi sentence मझ ब च क माता - पताओ क साथ काम करना भी अ छा लगता ह जो क अ सर सलाह लन आत ह - यह जानकार खशी होती ह क आप कसी की मदद कर सकत ह | The CPs identified in the sentence i काम करना (to work) ii अ छा लगना (to feel good enjoy) iii सलाह लना (to seek advice) iv खशी होना (to feel happy good) v मदद करना (to help)
Here the system identified 5 different CPs all of which are correct and no CP in the sentence has gone undetected The POS projection and word alignment method (Mukerjee et al 2006) would fail to identify CPs सलाह लना (to seek advice) and खशी होना (to feel happy) (2) English sentence Thousands of children are already benefiting from the input of people like you - people who care about children and their future who have the commitment energy and enthusiasm to be positive role models and who value the opportu-nity for a worthwhile career Aligned Hindi sentence आप जस लोग जो क ब च और उनक भ व य क बार म सोचत ह - इस समय भी हज़ार ब च को लाभ पहचा रह ह | अ छ आदश बनन क लए ऐस लोग म तब ता उ सराह और लगन ह और व एक समथन - यो य यवसाय की क करत ह |
The CPs identified in the sentence i आदश बनना (to be role model) ii क करना (to respect)
Here also the two CPs identified are correct It is obvious that this empirical method of
mining CPs will fail whenever the Hindi light verb maps on to its core meaning in English It
may also produce garbage as POS of the preced-ing word is not being checked However the mining success rate obtained speaks of these be-ing in small numbers in practice Use of the lsquostop wordsrsquo in allowing the intervening words within the CPs helps a lot in improving the per-formance Similarly use of the lsquoexit wordsrsquo avoid a lot of incorrect identification
5 Conclusions
The simple empirical method for mining CPs outlined in this work yields an average 89 of precision and 90 recall which is better than the results reported so far in the literature The major drawback is that we have to generate a list of all possible light verbs This list appears to be very large for Hindi Since no POS tagging or statis-tical analysis is performed the identified CPs are merely a list of mined CPs in Hindi with no lin-guistic categorization or analysis However this list of mined CPs is valuable to the lexicograph-ers and other language technology developers This list can also be used for word alignment tools where the identified components of CPs are grouped together before the word alignment process This will increase both the alignment accuracy and the speed
The methodology presented in this work is equally applicable to all other languages within the Indo-Aryan family
References Anthony McEnery Paul Baker Rob Gaizauskas Ha-
mish Cunningham 2000 EMILLE Building a Corpus of South Asian Languages Vivek A Quar-terly in Artiificial Intelligence 13(3)23ndash32
Amitabh Mukerjee Ankit Soni and Achala M Raina 2006 Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora Proceed-ings of the Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties Sydney 11ndash18
Alex Alsina 1996 Complex PredicatesStructure and Theory CSLI PublicationsStanford CA
Anvita Abbi 1992 The explicator compound verbsome definitional issues and criteria for iden-tification Indian Linguistics 53 27-46
Debasri Chakrabarti Vaijayanthi Sarma and Pushpak Bhattacharyya 2007 Complex Predicates in In-dian Language Wordnets Lexical Resources and Evaluation Journal 40 (3-4)
Debasri Chakrabarti Hemang Mandalia Ritwik Priya Vaijayanthi Sarma and Pushpak Bhattacharyya 2008 Hindi Compound Verbs and their Automatic Extraction Computational Linguistics (COLING08) Manchester UK
45
Manindra K Verma (ed) 1993 Complex Predicates in South Asian Languages Manohar Publishers and Distributors New Delhi
Miriam Butt 1995 The Structure of Complex Predi-cates in Urdu CSLI Publications
Mirium Butt and Gillian Ramchand 2001 Complex Aspectual Structure in HindiUrdu In Maria Lia-kata Britta Jensen and Didier Maillat (Editors) Oxford University Working Papers in Linguistics Philology amp Phonetics Vol 6
Miriam Butt Tracy Holloway King and John T Maxwell III 2003 Complex Predicates via Re-striction Proceedings of the LFG03 Conference
Miriam Butt and Wilhelm Geuder 2001 On the (semi)lexical status of light verbs In Norbert Corv-er and Henk van Riemsdijk (Editors) Semi-lexical Categories On the content of function words and the function of content words Mouton de Gruyter Berlin 323ndash370
Mona Singh 1994 Perfectivity Definiteness and Specificity A Classification of Verbal Predicates Hindi Doctoral dissertation University of Texas Austin
Peter Edwin Hook 1974 The Compound Verb in Hindi Center for South and Southeast Asian Stu-dies The University of Michigan
Tara Mohanan 1994 Argument Structure in Hindi CSLI Publications Stanford California
Venkatapathy Sriram and Aravind K Joshi 2005 Relative compositionality of multi-word expres-sions a study of verb-noun (V-N) collocations In Proceedings of International Joint Conference on Natural Language Processing - 2005 Jeju Island Korea 553-564
46
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 47ndash54Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Improving Statistical Machine Translation UsingDomain Bilingual Multiword Expressions
Zhixiang Ren1 Yajuan L u1 Jie Cao1 Qun Liu 1 Yun Huang2
1Key Lab of Intelligent Info Processing 2Department of Computer ScienceInstitute of Computing Technology School of Computing
Chinese Academy of Sciences National University of SingaporePO Box 2704 Beijing 100190 China Computing 1 Law Link Singapore 117590
renzhixianglvyajuan huangyuncompnusedusgcaojieliuqunictaccn
Abstract
Multiword expressions (MWEs) havebeen proved useful for many natural lan-guage processing tasks However how touse them to improve performance of statis-tical machine translation (SMT) is not wellstudied This paper presents a simple yeteffective strategy to extract domain bilin-gual multiword expressions In additionwe implement three methods to integratebilingual MWEs to Moses the state-of-the-art phrase-based machine translationsystem Experiments show that bilingualMWEs could improve translation perfor-mance significantly
1 Introduction
Phrase-based machine translation model has beenproved a great improvement over the initial word-based approaches (Brown et al 1993) Recentsyntax-based models perform even better thanphrase-based models However when syntax-based models are applied to new domain with fewsyntax-annotated corpus the translation perfor-mance would decrease To utilize the robustnessof phrases and make up the lack of syntax or se-mantic information in phrase-based model for do-main translation we study domain bilingual mul-tiword expressions and integrate them to the exist-ing phrase-based model
A multiword expression (MWE)can be consid-ered as word sequence with relatively fixed struc-ture representing special meanings There is nouniform definition of MWE and many researchersgive different properties of MWE Sag et al (2002)roughly defined MWE as ldquoidiosyncratic interpre-tations that cross word boundaries (or spaces)rdquoCruys and Moiron (2007) focused on the non-compositional property of MWE ie the propertythat whole expression cannot be derived from their
component words Stanford university launcheda MWE project1 in which different qualities ofMWE were presented Forbilingual multiwordexpression (BiMWE) we define a bilingual phraseas a bilingual MWE if (1) the source phrase is aMWE in source language (2) the source phraseand the target phrase must be translated to eachother exactly ie there is no additional (boundary)word in target phrase which cannot find the corre-sponding word in source phrase and vice versaIn recent years many useful methods have beenproposed to extract MWEs or BiMWEs automati-cally (Piao et al 2005 Bannard 2007 Fazly andStevenson 2006) Since MWE usually constrainspossible senses of a polysemous word in contextthey can be used in many NLP applications suchas information retrieval question answering wordsense disambiguation and so on
For machine translation Piao et al (2005) havenoted that the issue of MWE identification andaccurate interpretation from source to target lan-guage remained an unsolved problem for existingMT systems This problem is more severe whenMT systems are used to translate domain-specifictexts since they may include technical terminol-ogy as well as more general fixed expressions andidioms Although some MT systems may employa machine-readable bilingual dictionary of MWEit is time-consuming and inefficient to obtain thisresource manually Therefore some researchershave tried to use automatically extracted bilingualMWEs in SMT Tanaka and Baldwin (2003) de-scribed an approach of noun-noun compound ma-chine translation but no significant comparisonwas presented Lambert and Banchs (2005) pre-sented a method in which bilingual MWEs wereused to modify the word alignment so as to im-prove the SMT quality In their work a bilin-gual MWE in training corpus was grouped as
1httpmwestanfordedu
47
one unique token before training alignment mod-els They reported that both alignment qualityand translation accuracy were improved on a smallcorpus However in their further study they re-ported even lower BLEU scores after groupingMWEs according to part-of-speech on a large cor-pus (Lambert and Banchs 2006) Nonethelesssince MWE represents liguistic knowledge therole and usefulness of MWE in full-scale SMTis intuitively positive The difficulty lies in howto integrate bilingual MWEs into existing SMTsystem to improve SMT performance especiallywhen translating domain texts
In this paper we implement three methods thatintegrate domain bilingual MWEs into a phrase-based SMT system and show that these ap-proaches improve translation quality significantlyThe main difference between our methods andLambert and Banchsrsquo work is that we directly aimat improving the SMT performance rather than im-proving the word alignment quality In detail dif-ferences are listed as follows
bull Instead of using the bilingual n-gram trans-lation model we choose the phrase-basedSMT system Moses2 which achieves sig-nificantly better translation performance thanmany other SMT systems and is a state-of-the-art SMT system
bull Instead of improving translation indirectlyby improving the word alignment qualitywe directly target at the quality of transla-tion Some researchers have argued that largegains of alignment performance under manymetrics only led to small gains in translationperformance (Ayan and Dorr 2006 Fraserand Marcu 2007)
Besides the above differences there are someadvantages of our approaches
bull In our method automatically extractedMWEs are used as additional resources ratherthan as phrase-table filter Since bilingualMWEs are extracted according to noisy au-tomatic word alignment errors in word align-ment would further propagate to the SMT andhurt SMT performance
bull We conduct experiments on domain-specificcorpus For one thing domain-specific
2httpwwwstatmtorgmoses
corpus potentially includes a large numberof technical terminologies as well as moregeneral fixed expressions and idioms iedomain-specific corpus has high MWE cov-erage For another after the investigationcurrent SMT system could not effectivelydeal with these domain-specific MWEs es-pecially for Chinese since these MWEs aremore flexible and concise Take the Chi-nese term ldquo j Ntilde (rdquo for example Themeaning of this term is ldquosoften hard massand dispel pathogenic accumulationrdquo Ev-ery word of this term represents a specialmeaning and cannot be understood literallyor without this context These terms are dif-ficult to be translated even for humans letalone machine translation So treating theseterms as MWEs and applying them in SMTsystem have practical significance
bull In our approach no additional corpus is intro-duced We attempt to extract useful MWEsfrom the training corpus and adopt suitablemethods to apply them Thus it benefitsfor the full exploitation of available resourceswithout increasing great time and space com-plexities of SMT system
The remainder of the paper is organized as fol-lows Section 2 describes the bilingual MWE ex-traction technique Section 3 proposes three meth-ods to apply bilingual MWEs in SMT systemSection 4 presents the experimental results Sec-tion 5 draws conclusions and describes the futurework Since this paper mainly focuses on the ap-plication of BiMWE in SMT we only give a briefintroduction on monolingual and bilingual MWEextraction
2 Bilingual Multiword ExpressionExtraction
In this section we describe our approach of bilin-gual MWE extraction In the first step we obtainmonolingual MWEs from the Chinese part of par-allel corpus After that we look for the translationof the extracted MWEs from parallel corpus
21 Automatic Extraction of MWEs
In the past two decades many different ap-proaches on automatic MWE identification werereported In general those approaches can beclassified into three main trends (1) statisti-cal approaches (Pantel and Lin 2001 Piao et
48
al 2005) (2) syntactic approaches (Fazly andStevenson 2006 Bannard 2007) and (3) seman-tic approaches (Baldwin et al 2003 Cruys andMoiron 2007) Syntax-based and semantic-basedmethods achieve high precision but syntax or se-mantic analysis has to be introduced as preparingstep so it is difficult to apply them to domains withfew syntactical or semantic annotation Statisticalapproaches only consider frequency informationso they can be used to obtain MWEs from bilin-gual corpora without deeper syntactic or semanticanalysis Most statistical measures only take twowords into account so it not easy to extract MWEscontaining three or more than three words
Log Likelihood Ratio (LLR)has been proved agood statistical measurement of the association oftwo random variables (Chang et al 2002) Weadopt the idea of statistical approaches and pro-pose a new algorithm named LLR-based Hierar-chical Reducing Algorithm (HRA for short) to ex-tract MWEs with arbitrary lengths To illustrateour algorithm firstly we define some useful itemsIn the following definitions we assume the givensentence is ldquoA B C D Erdquo
Definition 1 Unit A unit is any sub-string of thegiven sentence For example ldquoA Brdquo ldquo Crdquo ldquo C D Erdquoare all units but ldquoA B Drdquo is not a unit
Definition 2 List A list is an ordered sequence ofunits which exactly cover the given sentence ForexampleldquoArdquoldquo B C Drdquoldquo Erdquo forms a list
Definition 3 Score The score function only de-fines on two adjacent units and return the LLRbetween the last word of first unit and the firstword of the second unit3 For example the scoreof adjacent unit ldquoB Crdquo and ldquoD Erdquo is defined asLLR(ldquo Crdquoldquo Drdquo)
Definition 4 Select The selecting operator is tofind the two adjacent units with maximum scorein a list
Definition 5 ReduceThe reducing operator is toremove two specific adjacent units concatenatethem and put back the result unit to the removedposition For example if we want to reduce unitldquoB Crdquo and unit ldquoDrdquo in list ldquoArdquoldquo B Crdquoldquo Drdquoldquo Erdquowe will get the listldquoArdquoldquo B C Drdquoldquo Erdquo
Initially every word in the sentence is consid-ered as one unit and all these units form a initiallist L If the sentence is of lengthN then the
3we use a stoplist to eliminate the units containing func-tion words by setting their score to 0
list containsN units of course The final set ofMWEsS is initialized to empty set After initial-ization the algorithm will enter an iterating loopwith two steps (1) select the two adjacent unitswith maximum score inL namingU1 andU2 and(2) reduceU1 andU2 in L and insert the reducingresult into the final setS Our algorithm termi-nates on two conditions (1) if the maximum scoreafter selection is less than a given threshold or (2)if L contains only one unit
c1(pound p Eacute w egrave )
c1(pound p Eacute w )
c1(pound p Eacute w)
c1(pound)
c2(p Eacute w)
c2(p Eacute)
c2(p) c3(Eacute) c4(w) c5()
c6(egrave )
c6(egrave) c7()
pound p Eacute w egrave 1471 67552 10596 0 0 8096
Figure 1 Example of Hierarchical Reducing Al-gorithm
Let us make the algorithm clearer with an ex-ample Assume the threshold of score is 20 thegiven sentence is ldquopound p Eacute w egrave rdquo4Figure 1 shows the hierarchical structure of givensentence (based on LLR of adjacent words) Inthis example four MWEs (ldquop Eacuterdquo ldquo p Eacute
wrdquo ldquo egrave rdquo ldquo pound p Eacute wrdquo) are extractedin the order and sub-strings over dotted line in fig-ure 1 are not extracted
From the above example we can see that theextracted MWEs correspond to human intuitionIn general the basic idea of HRA is to reflectthe hierarchical structure pattern of natural lan-guage Furthermore in the HRA MWEs are mea-sured with the minimum LLR of adjacent wordsin them which gives lexical confidence of ex-tracted MWEs Finally suppose given sentencehas lengthN HRA would definitely terminatewithin N minus 1 iterations which is very efficient
However HRA has a problem that it would ex-tract substrings before extracting the whole stringeven if the substrings only appear in the particu-lar whole string which we consider useless Tosolve this problem we use contextual features
4The whole sentence means ldquohealthy tea for preventinghyperlipidemiardquo and we give the meaning for each Chi-nese wordpound(preventing)p(hyper-)Eacute(-lipid-) w(-emia)(for) egrave(healthy)(tea)
49
contextual entropy (Luo and Sun 2003) and C-value (Frantzi and Ananiadou 1996) to filter outthose substrings which exist only in few MWEs
22 Automatic Extraction of MWErsquosTranslation
In subsection 21 we described the algorithm toobtain MWEs and we would like to introduce theprocedure to find their translations from parallelcorpus in this subsection
For mining the English translations of ChineseMWEs we first obtain the candidate translationsof a given MWE from the parallel corpus Stepsare listed as follows
1 Run GIZA++5 to align words in the trainingparallel corpus
2 For a given MWE find the bilingual sentencepairs where the source language sentences in-clude the MWE
3 Extract the candidate translations of theMWE from the above sentence pairs accord-ing to the algorithm described by Och (2002)
After the above procedure we have alreadyextracted all possible candidate translations of agiven MWE The next step is to distinguish rightcandidates from wrong candidates We constructperceptron-based classification model (Collins2002) to solve the problem We design twogroups of features translation features whichdescribe the mutual translating chance betweensource phrase and target phrase and the languagefeatures which refer to how well a candidateis a reasonable translation The translation fea-tures include (1) the logarithm of source-targettranslation probability (2) the logarithm of target-source translation probability (3) the logarithmof source-target lexical weighting (4) the loga-rithm of target-source lexical weighting and (5)the logarithm of the phrase pairrsquos LLR (Dunning1993) The first four features are exactly the sameas the four translation probabilities used in tradi-tional phrase-based system (Koehn et al 2003)The language features include (1) the left entropyof the target phrase (Luo and Sun 2003) (2) theright entropy of the target phrase (3) the first wordof the target phrase (4) the last word of the targetphrase and (5) all words in the target phrase
5httpwwwfjochcomGIZA++html
We select and annotate 33000 phrase pairs ran-domly of which 30000 pairs are used as trainingset and 3000 pairs are used as test set We use theperceptron training algorithm to train the modelAs the experiments reveal the classification preci-sion of this model is 9167
3 Application of Bilingual MWEs
Intuitively bilingual MWE is useful to improvethe performance of SMT However as we de-scribed in section 1 it still needs further researchon how to integrate bilingual MWEs into SMTsystem In this section we propose three methodsto utilize bilingual MWEs and we will comparetheir performance in section 4
31 Model Retraining with Bilingual MWEs
Bilingual phrase table is very important forphrase-based MT system However due to the er-rors in automatic word alignment and unalignedword extension in phrase extraction (Och 2002)many meaningless phrases would be extractedwhich results in inaccuracy of phrase probabilityestimation To alleviate this problem we take theautomatically extracted bilingual MWEs as paral-lel sentence pairs add them into the training cor-pus and retrain the model using GIZA++ Byincreasing the occurrences of bilingual MWEswhich are good phrases we expect that the align-ment would be modified and the probability es-timation would be more reasonable Wu et al(2008) also used this method to perform domainadaption for SMT Different from their approachin which bilingual MWEs are extracted from ad-ditional corpus we extract bilingual MWEs fromthe original training set The fact that additionalresources can improve the domain-specific SMTperformance was proved by many researchers(Wu et al 2008 Eck et al 2004) Howeverour method shows that making better use of theresources in hand could also enhance the qualityof SMT system We use ldquoBaseline+BiMWErdquo torepresent this method
32 New Feature for Bilingual MWEs
Lopez and Resnik (2006) once pointed out thatbetter feature mining can lead to substantial gainin translation quality Inspired by this idea weappend one feature into bilingual phrase table toindicate that whether a bilingual phrase containsbilingual MWEs In other words if the source lan-guage phrase contains a MWE (as substring) and
50
the target language phrase contains the translationof the MWE (as substring) the feature value is 1otherwise the feature value is set to 0 Due to thehigh reliability of bilingual MWEs we expect thatthis feature could help SMT system to select bet-ter and reasonable phrase pairs during translationWe use ldquoBaseline+Featrdquo to represent this method
33 Additional Phrase Table of bilingualMWEs
Wu et al (2008) proposed a method to construct aphrase table by a manually-made translation dic-tionary Instead of manually constructing transla-tion dictionary we construct an additional phrasetable containing automatically extracted bilingualMWEs As to probability assignment we just as-sign 1 to the four translation probabilities for sim-plicity Since Moses supports multiple bilingualphrase tables we combine the original phrase ta-ble and new constructed bilingual MWE table Foreach phrase in input sentence during translationthe decoder would search all candidate transla-tion phrases in both phrase tables We use ldquoBase-line+NewBPrdquo to represent this method
4 Experiments
41 Data
We run experiments on two domain-specific patentcorpora one is for traditional medicine domainand the other is for chemical industry domain Ourtranslation tasks are Chinese-to-English
In the traditional medicine domain table 1shows the data statistics For language model weuse SRI Language Modeling Toolkit6 to train a tri-gram model with modified Kneser-Ney smoothing(Chen and Goodman 1998) on the target side oftraining corpus Using our bilingual MWE ex-tracting algorithm 80287 bilingual MWEs are ex-tracted from the training set
Chinese EnglishTraining Sentences 120355
Words 4688873 4737843Dev Sentences 1000
Words 31722 32390Test Sentences 1000
Words 41643 40551
Table 1 Traditional medicine corpus
6httpwwwspeechsricomprojectssrilm
In the chemical industry domain table 2 givesthe detail information of the data In this experi-ment 59466 bilingual MWEs are extracted
Chinese EnglishTraining Sentences 120856
Words 4532503 4311682Dev Sentences 1099
Words 42122 40521Test Sentences 1099
Words 41069 39210
Table 2 Chemical industry corpus
We test translation quality on test set and use theopen source tool mteval-vllbpl7 to calculate case-sensitive BLEU 4 score (Papineni et al 2002) asour evaluation criteria For this evaluation there isonly one reference per test sentence We also per-form statistical significant test between two trans-lation results (Collins et al 2005) The mean ofall scores and relative standard deviation are calcu-lated with a 99 confidence interval of the mean
42 MT Systems
We use the state-of-the-art phrase-based SMT sys-tem Moses as our baseline system The featuresused in baseline system include (1) four transla-tion probability features (2) one language modelfeature (3) distance-based and lexicalized distor-tion model feature (4) word penalty (5) phrasepenalty For ldquoBaseline+BiMWErdquo method bilin-gual MWEs are added into training corpus as aresult new alignment and new phrase table areobtained For ldquoBaseline+Featrdquo method one ad-ditional 01 feature are introduced to each entry inphrase table For ldquoBaseline+NewBPrdquo additionalphrase table constructed by bilingual MWEs isused
Features are combined in the log-linear modelTo obtain the best translatione of the source sen-tencef log-linear model uses following equation
e = arg maxe
p(e|f)
= arg maxe
Msum
m=1
λmhm(e f) (1)
in which hm andλm denote themth feature andweight The weights are automatically turned byminimum error rate training (Och 2002) on devel-opment set
7httpwwwnistgovspeechtestsmtresourcesscoringhtm
51
43 Results
Methods BLEUBaseline 02658Baseline+BiMWE 02661Baseline+Feat 02675Baseline+NewBP 02719
Table 3 Translation results of using bilingualMWEs in traditional medicine domain
Table 3 gives our experiment results Fromthis table we can see that bilingual MWEsimprove translation quality in all cases TheBaseline+NewBP method achieves the most im-provement of 061 BLEU score comparedwith the baseline system The Baseline+Featmethod comes next with 017 BLEU score im-provement And the Baseline+BiMWE achievesslightly higher translation quality than the baselinesystem
To our disappointment however none of theseimprovements are statistical significant Wemanually examine the extracted bilingual MWEswhich are labeled positive by perceptron algorithmand find that although the classification precisionis high (9167) the proportion of positive exam-ple is relatively lower (7669) The low positiveproportion means that many negative instanceshave been wrongly classified to positive which in-troduce noises To remove noisy bilingual MWEswe use the length ratiox of the source phrase overthe target phrase to rank the bilingual MWEs la-beled positive Assumex follows Gaussian distri-butions then the ranking score of phrase pair(s t)is defined as the following formula
Score(s t) = log(LLR(s t))times 1radic2πσ
timeseminus
(xminusmicro)2
2σ2
(2)Here the meanmicro and varianceσ2 are estimatedfrom the training set After ranking by score weselect the top 50000 60000 and 70000 bilingualMWEs to perform the three methods mentioned insection 3 The results are showed in table 4
From this table we can conclude that (1) Allthe three methods on all settings improve BLEUscore (2) Except the Baseline+BiMWE methodthe other two methods obtain significant improve-ment of BLEU score (02728 02734 02724)over baseline system (02658) (3) When the scaleof bilingual MWEs is relatively small (5000060000) the Baseline+Feat method performs better
Methods 50000 60000 70000Baseline 02658Baseline+BiMWE 02671 02686 02715Baseline+Feat 02728 02734 02712Baseline+NewBP 02662 02706 02724
Table 4 Translation results of using bilingualMWEs in traditional medicine domain
than others (4) As the number of bilingual MWEsincreasing the Baseline+NewBP method outper-forms the Baseline+Feat method (5) Comparingtable 4 and 3 we can see it is not true that themore bilingual MWEs the better performance ofphrase-based SMT This conclusion is the same as(Lambert and Banchs 2005)
To verify the assumption that bilingual MWEsdo indeed improve the SMT performance not onlyon particular domain we also perform some ex-periments on chemical industry domain Table 5shows the results From this table we can see thatthese three methods can improve the translationperformance on chemical industry domain as wellas on the traditional medicine domain
Methods BLEUBaseline 01882Baseline+BiMWE 01928Baseline+Feat 01917Baseline+Newbp 01914
Table 5 Translation results of using bilingualMWEs in chemical industry domain
44 Discussion
In order to know in what respects our methods im-prove performance of translation we manually an-alyze some test sentences and gives some exam-ples in this subsection
(1) For the first example in table 6 ldquoIuml oacuterdquois aligned to other words and not correctly trans-lated in baseline system while it is aligned to cor-rect target phrase ldquodredging meridiansrdquo in Base-line+BiMWE since the bilingual MWE (ldquoIuml oacuterdquoldquodredging meridiansrdquo) has been added into train-ing corpus and then aligned by GIZA++
(2) For the second example in table 6 ldquo
rdquo has two candidate translation in phrase tableldquo teardquo and ldquomedicated teardquo The baseline systemchooses the ldquoteardquo as the translation of ldquo rdquowhile the Baseline+Feat system chooses the ldquomed-
52
Src T not auml k Ouml Eacute Aring raquo Iuml oacute ) 9 | Y S _ Ouml otilde
egraveE 8
Ref the obtained product is effective in tonifying blood expelling cold dredging meridians promoting production of body fluid promoting urination and tranquilizing mind and can be used for supplementing nutrition and protecting health
Baseline the food has effects in tonifying blood dispelling cold promoting salivation and water and tranquilizing and tonic effects and making nutritious health
+Bimwe the food has effects in tonifying blood dispelling cold dredging meridians promotingsalivation promoting urination and tranquilizing tonic nutritious pulverizing
Src curren iexclJ J NtildeJ 5J
Ref the product can also be made into tablet pill powder medicated tea or injection Baseline may also be made into tablet pill powder tea or injection +Feat may also be made into tablet pill powder medicated tea or injection
Table 6 Translation example
icated teardquo because the additional feature giveshigh probability of the correct translation ldquomedi-cated teardquo
5 Conclusion and Future Works
This paper presents the LLR-based hierarchicalreducing algorithm to automatically extract bilin-gual MWEs and investigates the performance ofthree different application strategies in applyingbilingual MWEs for SMT system The translationresults show that using an additional feature to rep-resent whether a bilingual phrase contains bilin-gual MWEs performs the best in most cases Theother two strategies can also improve the qualityof SMT system although not as much as the firstone These results are encouraging and motivatedto do further research in this area
The strategies of bilingual MWE application isroughly simply and coarse in this paper Com-plicated approaches should be taken into accountduring applying bilingual MWEs For examplewe may consider other features of the bilingualMWEs and examine their effect on the SMT per-formance Besides application in phrase-basedSMT system bilingual MWEs may also be inte-grated into other MT models such as hierarchicalphrase-based models or syntax-based translationmodels We will do further studies on improvingstatistical machine translation using domain bilin-gual MWEs
Acknowledgments
This work is supported by National Natural Sci-ence Foundation of China Contracts 60603095and 60873167 We would like to thank the anony-
mous reviewers for their insightful comments onan earlier draft of this paper
References
Necip Fazil Ayan and Bonnie J Dorr 2006 Goingbeyond aer an extensive analysis of word align-ments and their impact on mt InProceedings of the44th Annual Meeting of the Association for Compu-tational Linguistics pages 9ndash16
Timothy Baldwin Colin Bannard Takaaki Tanaka andDominic Widdows 2003 An empirical modelof multiword expression decomposability InPro-ceedings of the ACL-2003 Workshop on MultiwordExpressions Analysis Acquisiton and Treatmentpages 89ndash96
Colin Bannard 2007 A measure of syntactic flex-ibility for automatically identifying multiword ex-pressions in corpora InProceedings of the ACLWorkshop on A Broader Perspective on MultiwordExpressions pages 1ndash8
Peter F Brown Stephen Della Pietra Vincent J DellaPietra and Robert L Mercer 1993 The mathe-matics of statistical machine translation Parameterestimation Computational Linguistics 19(2)263ndash311
Baobao Chang Pernilla Danielsson and WolfgangTeubert 2002 Extraction of translation unit fromchinese-english parallel corpora InProceedings ofthe first SIGHAN workshop on Chinese languageprocessing pages 1ndash5
Stanley F Chen and Joshua Goodman 1998 Am em-pirical study of smoothing techniques for languagemodeling Technical report
Michael Collins Philipp Koehn and Ivona Kucerova2005 Clause restructuring for statistical machinetranslation InProceedings of the 43rd Annual
53
Meeting on Association for Computational Linguis-tics pages 531ndash540
Michael Collins 2002 Discriminative training meth-ods for hidden markov models Theory and exper-iments with perceptron algorithms InProceedingsof the Empirical Methods in Natural Language Pro-cessing Conference pages 1ndash8
Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32
Ted Dunning 1993 Accurate methods for the statis-tics of surprise and coincidenceComputationalLinguistics 19(1)61ndash74
Matthias Eck Stephan Vogel and Alex Waibel 2004Improving statistical machine translation in the med-ical domain using the unified medical language sys-tem InProceedings of the 20th international con-ference on Computational Linguistics table of con-tents pages 792ndash798
Afsaneh Fazly and Suzanne Stevenson 2006 Auto-matically constructing a lexicon of verb phrase id-iomatic combinations InProceedings of the EACLpages 337ndash344
Katerina T Frantzi and Sophia Ananiadou 1996 Ex-tracting nested collocations InProceedings of the16th conference on Computational linguistics pages41ndash46
Alexander Fraser and Daniel Marcu 2007 Measuringword alignment quality for statistical machine trans-lation Computational Linguistics 33(3)293ndash303
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InPro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology pages48ndash54
Patrik Lambert and Rafael Banchs 2005 Data in-ferred multi-word expressions for statistical machinetranslation InProceedings of Machine TranslationSummit X pages 396ndash403
Patrik Lambert and Rafael Banchs 2006 Groupingmulti-word expressions according to part-of-speechin statistical machine translation InProceedings ofthe Workshop on Multi-word-expressions in a multi-lingual context pages 9ndash16
Adam Lopez and Philip Resnik 2006 Word-basedalignment phrase-based translation Whatrsquos thelink In proceedings of the 7th conference of the as-sociation for machine translation in the Americasvisions for the future of machine translation pages90ndash99
Shengfen Luo and Maosong Sun 2003 Two-characterchinese word extraction based on hybrid of internaland contextual measures InProceedings of the sec-ond SIGHAN workshop on Chinese language pro-cessing pages 24ndash30
Franz Josef Och 2002Statistical Machine Transla-tion From Single-Word Models to Alignment Tem-plates Phd thesis Computer Science DepartmentRWTH Aachen Germany
Patrick Pantel and Dekang Lin 2001 A statistical cor-pus based term extractor InAI rsquo01 Proceedings ofthe 14th Biennial Conference of the Canadian Soci-ety on Computational Studies of Intelligence pages36ndash46
Kishore Papineni Salim Roukos Todd Ward and Wei-jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation InProceedings of the40th Annual Conference of the Association for Com-putational Linguistics pages 311ndash318
Scott Songlin Piao Paul Rayson Dawn Archer andTony McEnery 2005 Comparing and combining asemantic tagger and a statistical tool for mwe extrac-tion Computer Speech and Language 19(4)378ndash397
Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 Multi-word expressions A pain in the neck for nlp InProceedings of the 3th International Conferenceon Intelligent Text Processing and ComputationalLinguistics(CICLing-2002) pages 1ndash15
Takaaki Tanaka and Timothy Baldwin 2003 Noun-noun compound machine translation A feasibilitystudy on shallow processing InProceedings ofthe ACL-2003 Workshop on Multiword ExpressionsAnalysis Acquisition and Treatment pages 17ndash24
Hua Wu Haifeng Wang and Chengqing Zong 2008Domain adaptation for statistical machine transla-tion with domain dictionary and monolingual cor-pora InProceedings of Conference on Computa-tional Linguistics (COLING) pages 993ndash1000
54
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 55ndash62Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Bottom-up Named Entity Recognitionusing a Two-stage Machine Learning Method
Hirotaka Funayama Tomohide Shibata Sadao KurohashiKyoto University Yoshida-honmachi
Sakyo-ku Kyoto 606-8501 Japanfunayama shibata kuronlpkueekyoto-uacjp
Abstract
This paper proposes Japanese bottom-upnamed entity recognition using a two-stage machine learning method Mostwork has formalized Named Entity Recog-nition as a sequential labeling problem inwhich only local information is utilizedfor the label estimation and thus a longnamed entity consisting of several mor-phemes tends to be wrongly recognizedOur proposed method regards a compoundnoun (chunk) as a labeling unit and firstestimates the labels of all the chunks ina phrasal unit (bunsetsu) using a machinelearning method Then the best label as-signment in the bunsetsu is determinedfrom bottom up as the CKY parsing al-gorithm using a machine learning methodWe conducted an experimental on CRLNE data and achieved an F measure of8979 which is higher than previous work
1 Introduction
Named Entity Recognition (NER) is a task of rec-ognizing named entities such as person namesorganization names and location It is used forseveral NLP applications such as Information Ex-traction (IE) and Question Answering (QA) Mostwork uses machine learning methods such as Sup-port Vector Machines (SVMs) (Vapnik 1995) andConditional Random Field (CRF) (Lafferty et al2001) using a hand-annotated corpus (Krishnanand DManning 2006 Kazama and Torisawa2008 Sasano and Kurohashi 2008 Fukushima etal 2008 Nakano and Hirai 2004 Masayuki andMatsumoto 2003)
In general NER is formalized as a sequentiallabeling problem For example regarding a mor-pheme as a basic unit it is first labeled as S-PERSON B-PERSON I-PERSON E-PERSON
S-ORGANIZATION etc Then considering thelabeling results of morphemes the best NE labelsequence is recognized
When the label of each morpheme is estimatedonly local information around the morpheme (egthe morpheme the two preceding morphemes andthe two following morphemes) is utilized There-fore a long named entity consisting of severalmorphemes tends to be wrongly recognized Letus consider the example sentences shown in Fig-ure 1
In sentence (1) the label of ldquoKazamardquo can berecognized to be S-PERSON (PERSON consist-ing of one morpheme) by utilizing the surroundinginformation such as the suffix ldquosanrdquo (Mr) and theverb ldquokikoku shitardquo (return home)
On the other hand in sentence (2) when thelabel of ldquoshinyourdquo (credit) is recognized to beB-ORGANIZATION (the beginning of ORGA-NIZATION) only information from ldquohatsudourdquo(invoke) to ldquokyusairdquo (relief) can be utilized andthus the information of the morpheme ldquoginkourdquo(bank) that is apart from ldquoshinyourdquo by three mor-phemes cannot be utilized To cope with this prob-lem Nakano et al (Nakano and Hirai 2004) andSasano et al (Sasano and Kurohashi 2008) uti-lized information of the head of bunsetsu1 In theirmethods when the label of ldquoshinyourdquo is recog-nized the information of the morpheme ldquoginkourdquocan be utilized
However these methods do not work when themorpheme that we want to refer to is not a headof bunsetsu as in sentence (3) In this examplewhen ldquogaikokurdquo (foreign) is recognized to be B-ARTIFACT (the beginning of ARTIFACT) wewant to refer to ldquohourdquo (law) not ldquoihanrdquo (viola-tion) which is the head of the bunsetsu
This paper proposes Japanese bottom-up named1Bunsetsu is the smallest coherent phrasal unit in
Japanese It consists of one or more content words followedby zero or more function words
55
(1) kikoku-shitareturn home
Kazama-san-waMrKazama TOP
rsquoMr Kazama who returned homersquo
(2) hatsudou-shitainvoke
shinyou-kumiai-kyusai-ginkou-nocredit union relief bank GEN
setsuritsu-mo establishment
rsquothe establishment of the invoking credit union relief bankrsquo
(3) shibunsyo-gizou-toprivate document falsification and
gaikoku-jin-touroku-hou-ihan-noforeigner registration law violation GEN
utagai-desuspicion INS
rsquoon suspicion of the private document falsification and the violation of the foreigner registra-tion lawrsquo
Figure 1 Example sentences
entity recognition using a two-stage machinelearning method Different from previous workthis method regards a compound noun as a la-beling unit (we call it chunk hereafter) and es-timates the labels of all the chunks in the bun-setsu using a machine learning method In sen-tence (3) all the chunks in the second bunsetsu(ie ldquogaikokurdquo ldquogaikoku-jinrdquo middot middot middot ldquogaikoku-jin-touroku-hou-ihan rdquo middot middot middot ldquoihanrdquo) are labeled andin the case that the chunk ldquogaikoku-jin-touroku-hourdquo is labeled the information about ldquohourdquo (law)is utilized in a natural manner Then in the bun-setsu the best label assignment is determined Forexample among the combination of ldquogaikoku-jin-touroku-hourdquo (ARTIFACT) and ldquoihanrdquo (OTHER)the combination of ldquogaikoku-jinrdquo (PERSON) andldquotouroku-hou-ihanrdquo (OTHER) etc the best la-bel assignment ldquogaikoku-jin-touroku-hourdquo (AR-TIFACT) and ldquoihanrdquo (OTHER) is chosen basedon a machine learning method In this determi-nation of the best label assignment as the CKYparsing algorithm the label assignment is deter-mined by bottom-up dynamic programming Weconducted an experimental on CRL NE data andachieved an F measure of 8979 which is higherthan previous work
This paper is organized as follows Section 2 re-views related work of NER especially focusing onsequential labeling based method Section 3 de-scribes an overview of our proposed method Sec-tion 4 presents two machine learning models andSection 5 describes an analysis algorithm Section6 gives an experimental result
2 Related Work
In Japanese Named Entity Recognition the defi-nition of Named Entity in IREX Workshop (IREX
class examplePERSON Kimura SyonosukeLOCATION Taiheiyou (Pacific Ocean)ORGANIZATION Jimin-tou (Liberal Democratic
Party)ARTIFACT PL-houan (PL bill)DATE 21-seiki (21 century)TIME gozen-7-ji (7 am)MONEY 500-oku-en (50 billions yen)PERCENT 20 percent
Table 1 NE classes and their examples
Committee 1999) is usually used In this def-inition NEs are classified into eight classesPERSON LOCATION ORGANIZATION AR-TIFACT DATE TIME MONEY and PERCENTTable 1 shows example instances of each class
NER methods are divided into two approachesrule-based approach and machine learning ap-proach According to previous work machinelearning approach achieved better performancethan rule-based approach
In general a machine learning method is for-malized as a sequential labeling problem Thisproblem is first assigning each token (character ormorpheme) to several labels In an SE-algorithm(Sekine et al 1998) S is assigned to NE com-posed of one morpheme B I E is assigned to thebeginning middle end of NE respectively and Ois assigned to the morpheme that is not an NE2The labels S B I and E are prepared for each NEclasses and thus the total number of labels is 33(= 8 4 + 1)
The model for the label estimation is learnedbased on machine learning The following fea-tures are generally utilized characters type of
2Besides there are IOB1 IOB2 algorithm using onlyIOB and IOE1 IOE2 algorithm using only IOE (Kim andVeenstra 1999)
56
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu-MeijinMONEY0075
OTHERe0092
MeijinOTHERe0245
(a)initial state
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu + MeijinPSN+OTHERe0438+0245
Yoshiharu Yoshiharu + Meijin
final outputanalysis direction
MONEY0075
MNY+OTHERe0075+0245
MeijinOTHERe0245
(b)final output
Figure 2 An overview of our proposed method (the bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo)
character POS etc about the morpheme and thesurrounding two morphemes The methods utiliz-ing SVM or CRF are proposed
Most of NER methods based on sequential la-beling use only local information Thereforemethods utilizing global information are pro-posed Nakano et al utilized as a feature the wordsub class of NE on the analyzing direction in thebunsetsu the noun in the end of the bunsetsu ad-jacent to the analyzing direction and the head ofeach bunsetsu (Nakano and Hirai 2004) Sasanoet al utilized cache feature coreference resultsyntactic feature and caseframe feature as struc-tural features (Sasano and Kurohashi 2008)
Some work acquired knowledge from unan-notated large corpus and applied it to NERKazama et al utilized a Named Entity dic-tionary constructed from Wikipedia and a nounclustering result obtained using huge amount ofpairs of dependency relations (Kazama and Tori-sawa 2008) Fukushima et al acquired hugeamount of category-instance pairs (eg ldquopo-litical party - New party DAICHIrdquoldquocompany-TOYOTArdquo) by some patterns from a large Webcorpus (Fukushima et al 2008)
In Japanese NER researches CRL NE data areusually utilized for the evaluation This data in-cludes approximately 10 thousands sentences innews paper articles in which approximately 20thousands NEs are annotated Previous workachieved an F measure of about 089 using thisdata
3 Overview of Proposed Method
Our proposed method first estimates the label ofall the compound nouns (chunk) in a bunsetsu
Then the best label assignment is determinedby bottom-up dynamic programming as the CKYparsing algorithm Figure 2 illustrates an overviewof our proposed method In this example thebunsetsu ldquoHabu-Yoshiharu-Meijinrdquo (Grand Mas-ter Yoshiharu Habu) is analyzed First the labelsof all the chunks (ldquoHaburdquo ldquoHabu-YoshiharurdquoldquoHabu-Yoshiharu-Meijinrdquo middot middot middot ldquoMeijinrdquo etc) inthe bunsetsu are analyzed using a machine learn-ing method as shown in Figure 2 (a)
We call the state in Figure 2 (a) initial statewhere the labels of all the chunks have been es-timated From this state the best label assign-ment in the bunsetsu is determined This pro-cedure is performed from the lower left (corre-sponds to each morpheme) to the upper right likethe CKY parsing algorithm as shown in Figure 2(b) For example when the label assignment forldquoHabu-Yoshiharurdquo is determined the label assign-ment ldquoHabu-Yoshiharurdquo (PERSON) and the labelassignment ldquoHaburdquo (PERSON) and ldquoYoshiharurdquo(OTHER) are compared and the better one is cho-sen While grammatical rules are utilized in ageneral CKY algorithm this method chooses bet-ter label assignment for each cell using a machinelearning method
The learned models are the followings
bull the model that estimates the label of a chunk(label estimation model)
bull the model that compares two label assign-ments (label comparison model)
The two models are described in detail in thenext section
57
Habu Yoshiharu Meijin gaPERSON OTHERe
invalid invalid
invalid
invalidinvalid
Figure 3 Label assignment for all the chunks inthe bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo
4 Model Learning
41 Label Estimation ModelThis model estimates the label for each chunk Ananalysis unit is basically bunsetsu This is because935 of named entities is located in a bunsetsuin CRL NE data Exceptionally the following ex-pressions located in multiple bunsetsus tend to bean NE
bull expressions enclosed in parentheses (eg ldquolsquoHimeyuri-no toursquo rdquo (The tower of Himeyuri)(ARTIFACT))
bull expressions that have an entry in Wikipedia(eg ldquoNihon-yatyou-no kairdquo (Wild Bird So-ciety of Japan) (ORGANIZATION))
Hereafter bunsetsu is expanded when one of theabove conditions meet By this expansion 986of named entities is located in a bunsetsu3
For each bunsetsu the head or tail functionwords are deleted For example in the bun-setsu ldquoHabu-Yoshiharu-Meijin-wardquo the tail func-tion word ldquowardquo (TOP) is deleted In the bunsetsuldquoyaku-san-bairdquo (about three times) the head func-tion word ldquoyakurdquo (about) is deleted
Next for learning the label estimation modelall the chunks in a bunsetsu are attached to the cor-rect label from a hand-annotated corpus The la-bel set is 13 classes which includes eight NE class(as shown in Table 1) and five classes OTHERsOTHERb OTHERi OTHERe and invalid
The chunk that corresponds to a whole bun-setsu and does not contain any NEs is labeledas OTHERs and the head middle tail chunkthat does not correspond to an NE is labeled asOTHERb OTHERi OTHERe respectively4
3As an example in which an NE is not included by anexpanded bunsetsu there are ldquoToru-no Kimirdquo (PERSON)and ldquoOsaka-fu midori-no kankyo-seibi-shitsurdquo (ORGANI-ZATION)
4Each OTHER is assigned to the longest chunk that satis-fies its condition in a chunk
1 of morphemes in the chunk
2 the position of the chunk in its bunsetsu
3 character type5
4 the combination of the character type of adjoiningmorphemes
- For the chunk ldquoRussian Armyrdquo this feature isldquoKatakanaKanjirdquo
5 word class word sub class and several features pro-vided by a morphological analyzer JUMAN
6 several features6 provided by a parser KNP
7 string of the morpheme in the chunk
8 IPADIC7 feature- If the string of the chunk are registered in the fol-
lowing categories of IPADIC ldquopersonrdquo ldquolo-cationrdquo ldquoorganizationrdquo and ldquogeneralrdquo thisfeature fires
9 Wikipedia feature
- If the string of the chunk has an entry inWikipedia this feature fires
- the hypernym extracted from its definition sen-tence using some patterns (eg The hyper-nym of ldquothe Liberal Democratic Partyrdquo is apolitical party)
10 cache feature- When the same string of the chunk appears in the
preceding context the label of the precedingchunk is used for the feature
11 particles that the bunsetsu includes
12 the morphemes particles and head morpheme in theparent bunsetsu
13 the NEcategory ratio in a case slot of predicatenouncase frame(Sasano and Kurohashi 2008)
- For example in the case ga (NOM) of the pred-icate case frame ldquokaikenrdquo (interview) the NEratio ldquoPERSON0245rdquo is assigned to the caseslot Hence in the sentence ldquoHabu-ga kaiken-shitardquo (Mr Habu interviewed) the featureldquoPERSON0245rdquo is utilized for the chunkldquoHaburdquo
14 parenthesis feature
- When the chunk in a parenthesis this featurefires
Table 2 Features for the label estimation model
The chunk that is neither any eight NE class northe above four OTHER is labeled as invalid
In an example as shown in Figure 3 ldquoHabu-Yoshiharurdquo is labeled as PERSON ldquoMeijinrdquo is la-beled as OTHERe and the other chunks are la-beled as invalid
Next the label estimation model is learned fromthe data in which the above label set is assigned
5The following five character types are considered KanjiHiragana Katakana Number and Alphabet
6When a morpheme has an ambiguity all the correspond-ing features fire
7httpchasenaist-naraacjpchasendistributionhtmlja
58
to all the chunks The features for the label esti-mation model are shown in Table 2 Among thefeatures as for feature (3) (5)minus(8) three cate-gories according to the position of a morphemein the chunk are prepared ldquoheadrdquo ldquotailrdquo andldquoanywhererdquo For example in the chunk ldquoHabu-Yoshiharu-Meijinrdquo as for the morpheme ldquoHaburdquofeature (7) is set to be ldquoHaburdquo in ldquoheadrdquo and as forthe morpheme ldquoYoshiharurdquo feature (7) is set to beldquoYoshiharurdquo in ldquoanywhererdquo
The label estimation model is learned from pairsof label and feature in each chunk To classify themulti classes the one-vs-rest method is adopted(consequently 13 models are learned) The SVMoutput is transformed by using the sigmoid func-tion 1
1+exp(minusβx) and the transformed value is nor-malized so that the sum of the value of 13 labelsin a chunk is one
The purpose for setting up the label ldquoinvalidrdquo isas follows In the chunk ldquoHaburdquo and ldquoYoshiharurdquoin Figure 3 since the label ldquoinvalidrdquo has a rela-tively higher score the score of the label PERSONis relatively low Therefore when the label com-parison described in Section 42 is performed thelabel assignment ldquoHabu-Yoshiharurdquo (PERSON) islikely to be chosen In the chunk where the scoreof the label invalid has the highest score the labelthat has the second highest score is adopted
42 Label Comparison Model
This model compares the two label assignmentsfor a certain string For example in the stringldquoHabu-Yoshiharurdquo the model compares the fol-lowing two label assignments
bull ldquoHabu-Yoshiharurdquo is labeled as PERSON
bull ldquoHaburdquo is labeled as PERSON and ldquoYoshi-harurdquo is labeled as MONEY
First as shown in Figure 4 the two comparedsets of chunks are lined up by sandwiching ldquovsrdquo(The left one right one is called the first set thesecond set respectively) When the first set is cor-rect this example is positive otherwise this ex-ample is negative The max number of chunks foreach set is five and thus examples in which thefirst or second set has more than five chunks arenot utilized for the model learning
Then the feature is assigned to each exampleThe feature (13 dimensions) for each chunk is de-fined as follows the first 12 dimensions are used
positive+1 Habu-Yoshiharu vs Habu + Yoshiharu
PSN PSN + MNY+1 Habu-Yoshiharu + Meijin vs Habu + Yoshiharu + Meijin
PSN + OTHERe PSN + MONEY + OTHERe
negative- 1 Habu-Yoshiharu-Meijin vs Habu-Yoshiharu + Meijin
ORG PSN + OTHERe
Figure 4 Assignment of positivenegative exam-ples
for each label which is estimated by the label esti-mation model and the last 13th dimension is usedfor the score of an SVM output Then for the firstand second set the features for each chunk are ar-ranged from the left and zero vectors are placedin the remainder part
Figure 5 illustrates the feature for ldquoHabu-Yoshiharurdquo vs ldquoHabu + Yoshiharurdquo The labelcomparison model is learned from such data us-ing SVM Note that only the fact that ldquoHabu-Yoshiharurdquo is PERSON can be found from thehand-annotated corpus and thus in the exampleldquoHabu-Yoshiharu-Meijinrdquo vs ldquoHabu + Yoshiharu-Meijinrdquo we cannot determine which one is cor-rect Therefore such example cannot be used forthe model learning
5 Analysis
First the label of all the chunks in a bunsetsu isestimated by using the label estimation model de-scribed in Section 41 Then the best label assign-ment in the bunsetsu is determined by applying thelabel comparison model described in Section 42iteratively as shown in Figure 2 (b) In this stepthe better label assignment is determined from bot-tom up as the CKY parsing algorithm
For example the initial state shown in Figure2(a) is obtained using the label estimation modelThen the label assignment is determined using thelabel comparison model from the lower left (cor-responds to each morpheme) to the upper rightIn determining the label assignment for the cellof ldquoHabu-Yoshiharurdquo as shown in 6(a) the modelcompares the label assignment ldquoBrdquo with the la-bel assignment ldquoA+Drdquo In this case the modelchooses the label assignment ldquoBrdquo that is ldquoHabu- Yoshiharurdquo is labeled as PERSON Similarlyin determining the label assignment for the cellof ldquoYoshiharu-Meijinrdquo the model compares the
59
chunk Habu-Yoshiharu Habu Yoshiharulabel PERSON PERSON MONEYvector V11 0 0 0 0 V21 V22 0 0 0
Figure 5 An example of the feature for the label comparison model (The example is ldquoHabu-Yoshiharuvs Habu + Yoshiharurdquo and V11 V21 V22 and 0 is a vector whose dimension is 13)
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu-Meijin
label assighment ldquoHabu-Yoshiharurdquo
label assignmentldquoHaburdquo + ldquoYoshiharurdquo
A B C
EDMONEY0075
OTHERe0092
MeijinOTHERe0245
ldquoHaburdquo + ldquoYoshiharurdquo
F
(a) label assignment for the cell ldquoHabu-Yoshiharurdquo
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu + Meijin
label assignmentldquoHabu-Yoshiharu-Meijinrdquo
A B C
EDMONEY0075
MNY+OTHERe0075+0245
MeijinOTHERe0245
label assignmentldquoHaburdquo + ldquoYoshiharurdquo + ldquoMeijinrdquo
label assignmentldquoHabu-Yoshiharurdquo + ldquoMeijinrdquo
F
(b) label assignment for the cell ldquoHabu-Yoshiharu-Meijinrdquo
Figure 6 The label comparison model
label assignment ldquoErdquo with the label assignmentldquoD+Frdquo In this case the model chooses the labelassignment ldquoD+Frdquo that is ldquoYoshiharurdquo is labeledas MONEY and ldquoMeijinrdquo is labeled as OTHEReWhen the label assignment consists of multiplechunks the content of the cell is updated Inthis case the cell ldquoErdquo is changed from ldquoYoshi-haru-Meijinrdquo (OTHERe) to ldquoYoshiharu + Meijinrdquo(MONEY + OTHERe)
As shown in Figure 6(b) in determining the bestlabel assignment for the upper right cell that isthe final output is determined the model comparesthe label assignment ldquoA+D+Frdquo ldquoB+Frdquo and ldquoCrdquoWhen there are more than two candidates of labelassignments for a cell all the label assignments arecompared in a pairwise and the label assignmentthat obtains the highest score is adopted
In the label comparing step the label as-signment in which OTHERlowast follows OTHERlowast(OTHERlowast - OTHERlowast) is not allowed since eachOTHER is assigned to the longest chunk as de-scribed in Section 41 When the first combina-tion of chunks equals to the second combinationof chunks the comparison is not performed
6 Experiment
To demonstrate the effectiveness of our proposedmethod we conducted an experiment on CRL NEdata In this data 10718 sentences in 1174 newsarticles are annotated with eight NEs The expres-sion to which it is difficult to annotate manually islabeled as OPTIONAL and was not used for both
a b c d e
learn the label estimation model 1
Corpus CRL NE data
learn the label estimation model 2bobtain features for the label apply
b c d e
b c d e
b c d e
obtain features for the labelcomparison model
learn the label estimation model 2c
learn the label estimation model 2d
learn the label estimation model 2e
apply
learn the label comparison model
Figure 7 5-fold cross validation
the model learning8 and the evaluationWe performed 5-fold cross validation following
previous work Different from previous work ourwork has to learn the SVM models twice There-fore the corpus was divided as shown in Figure 7Let us consider the analysis in the part (a) Firstthe label estimation model 1 is learned from thepart (b)-(e) Then the label estimation model 2bis learned from the part (c)-(e) and applying thelearned model to the part (b) features for learningthe label comparison model are obtained Simi-larly the label estimation model 2c is learned fromthe part (b)(d)(e) and applying it to the part (c)features are obtained It is the same with the part
8Exceptionally ldquoOPTIONALrdquo is used when the label es-timation model for OTHERlowast and invalid is learned
60
Recall PrecisionORGANIZATION 8183 (30083676) 8837 (30083404)PERSON 9005 (34583840) 9387 (34583684)LOCATION 9138 (49925463) 9244 (49925400)ARTIFACT 4672 ( 349 747) 7489 ( 349 466)DATE 9327 (33273567) 9312 (33273573)TIME 8825 ( 443 502) 9059 ( 443 489)MONEY 9385 ( 366 390) 9760 ( 366 375)PERCENT 9533 ( 469 492) 9591 ( 469 489)ALL-SLOT 8787 9179F-measure 8979
Table 3 Experimental result
(d) and (e) Then the label comparison model islearned from the obtained features After that theanalysis in the part (a) is performed by using boththe label estimation model 1 and the label compar-ison model
In this experiment a Japanese morphologicalanalyzer JUMAN9 and a Japanese parser KNP10
were adopted The two SVM models were learnedwith polynomial kernel of degree 2 and β in thesigmoid function was set to be 1
Table 6 shows an experimental result An F-measure in all NE classes is 8979
7 Discussion
71 Comparison with Previous Work
Table 7 presents the comparison with previ-ous work and our method outperformed previ-ous work Among previous work Fukushimaet al acquired huge amount of category-instance pairs (eg ldquopolitical party - New partyDAICHIrdquoldquocompany-TOYOTArdquo) by some patternsfrom a large Web corpus and Sasano et al uti-lized the analysis result of corefer resolution as afeature for the model learning Therefore in ourmethod by incorporating these knowledge andorsuch analysis result the performance would be im-proved
Compared with Sasano et al our methodachieved the better performance in analyzinga long compound noun For example in thebunsetsu ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo (Treaty on Conventional ArmedForces in Europe) while Sasano et al labeledldquoOushurdquo (Europe) as LOCATION our methodcorrectly labeled ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo as ARTIFACT Sasano etal incorrectly labeled ldquoOushurdquo as LOCATIONalthough they utilized the information about
9httpnlpkueekyoto-uacjpnl-resourcejuman-ehtml10httpnlpkueekyoto-uacjpnl-resourceknp-ehtml
the head of bunsetsu ldquojyouyakurdquo (treaty) Inour method for the cell ldquoOushurdquo invalid hasthe highest score and thus the score of LOCA-TION relatively drops Similarly for the cellldquosenryoku-sakugen-jyouyakurdquo invalid has thehighest score Consequently ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo is correctly labeledas ARTIFACT
In the bunsetsu ldquogaikoku-jin-touroku-hou-ihanrdquo(the violation of the foreigner registration law)while Sasano et al labeled ldquotouroku-hourdquo as AR-TIFACT our method correctly labeled ldquogaikoku-jin-touroku-hourdquo as ARTIFACT Sasano et al can-not utilize the information about ldquohourdquo that is use-ful for the label estimation since the head of thisbunsetsu is ldquoihanrdquo In contrast in estimating thelabel of the chunk ldquogaikoku-jin-touroku-hourdquo theinformation of ldquohourdquo can be utilized
72 Error Analysis
There were some errors in analyzing a Katakanaalphabet word In the following example althoughthe correct is that ldquoBatistutardquo is labeled as PER-SON the system labeled it as OTHERs
(4) Italy-deItaly LOC
katsuyaku-suruactive
Batistuta-woBatistuta ACC
kuwaetacall
ArgentineArgentine
rsquoArgentine called Batistuta who was active inItalyrsquo
There is not an entry of ldquoBatistutardquo in the dictio-nary of JUMAN nor Wikipedia and thus only thesurrounding information is utilized However thecase analysis of ldquokatsuyakurdquo (active) is incorrectwhich leads to the error of ldquoBatistutardquo
There were some errors in applying the la-bel comparison model although the analysis ofeach chunk is correct For example in thebunsetsu ldquoHongKong-seityourdquo (Government ofHongKong) the correct is that ldquoHongKong-seityourdquo is labeled as ORGANIZATION Asshown in Figure 8 (b) the system incorrectlylabeled ldquoHongKongrdquo as LOCATION As shownin Figure 8(a) although in the initial stateldquoHongKong-seityourdquo was correctly labeled as OR-GANIZATION the label assignment ldquoHongKong+ seityourdquo was incorrectly chosen by the labelcomparison model To cope with this problemwe are planning to the adjustment of the value βin the sigmoid function and the refinement of the
61
F1 analysis unit distinctive features(Fukushima et al 2008) 8929 character Web(Kazama and Torisawa 2008) 8893 character WikipediaWeb(Sasano and Kurohashi 2008) 8940 morpheme structural information(Nakano and Hirai 2004) 8903 character bunsetsu feature(Masayuki and Matsumoto 2003) 8721 character(Isozaki and Kazawa 2003) 8677 morphemeproposed method 8979 compound noun Wikipediastructural information
Table 4 Comparison with previous work (All work was evaluated on CRL NE data using cross valida-tion)
HongKongLOCATION
HongKong-seityouORGANIZATIONLOCATION
0266 0205
seityouOTHERe0184
(a)initial state
HongKongLOCATION
HongKong + seityouLOC+OTHEReLOCATION
0266LOC+OTHERe0266+0184
seityouOTHERe0184
(b)the final output
Figure 8 An example of the error in the label com-parison model
features for the label comparison model
8 Conclusion
This paper proposed bottom-up Named EntityRecognition using a two-stage machine learningmethod This method first estimates the label ofall the chunks in a bunsetsu using a machine learn-ing and then the best label assignment is deter-mined by bottom-up dynamic programming Weconducted an experiment on CRL NE data andachieved an F-measure of 8979
We are planning to integrate this method withthe syntactic and case analysis method (Kawa-hara and Kurohashi 2007) and perform syntacticcase and Named Entity analysis simultaneously toimprove the overall accuracy
ReferencesKenrsquoichi Fukushima Nobuhiro Kaji and Masaru
Kitsuregawa 2008 Use of massive amountsof web text in Japanese named entity recogni-tion In Proceedings of Data Engineering Workshop(DEWS2008) A3-3 (in Japanese)
IREX Committee editor 1999 Proceedings of theIREX Workshop
Hideki Isozaki and Hideto Kazawa 2003 Speeding upsupport vector machines for named entity recogni-tion Transaction of Information Processing Societyof Japan 44(3)970ndash979 (in Japanese)
Daisuke Kawahara and Sadao Kurohashi 2007 Prob-abilistic coordination disambiguation in a fully-lexicalized Japanese parser In Proceedings of the
2007 Joint Conference on Empirical Methods inNatural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL2007)pages 304ndash311
Junrsquoichi Kazama and Kentaro Torisawa 2008 In-ducing gazetteers for named entity recognition bylarge-scale clustering of dependency relations InProceedings of ACL-08 HLT pages 407ndash415
Erik F Tjong Kim and Jorn Veenstra 1999 Repre-senting text chunks In Proceedings of EACL rsquo99pages 173ndash179
Vajay Krishnan and Christopher DManning 2006An effective two-stage model for exploiting non-local dependencies in named entity recognitionpages 1121ndash1128
John Lafferty Andrew McCallun and FernandoPereira 2001 Conditional random fields Prob-abilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth In-ternational Conference (ICMLrsquo01) pages 282ndash289
Asahara Masayuki and Yuji Matsumoto 2003Japanese named entity extraction with redundantmorphological analysis In Proceeding of HLT-NAACL 2003 pages 8ndash15
Keigo Nakano and Yuzo Hirai 2004 Japanese namedentity extraction with bunsetsu features Transac-tion of Information Processing Society of Japan45(3)934ndash941 (in Japanese)
Ryohei Sasano and Sadao Kurohashi 2008 Japanesenamed entity recognition using structural naturallanguage processing In Proceeding of Third In-ternational Joint Conference on Natural LanguageProcessing pages 607ndash612
Satoshi Sekine Ralph Grishman and Hiroyuki Shin-nou 1998 A decision tree method for finding andclassifying names in japanese texts In Proceed-ings of the Sixth Workshop on Very Large Corpora(WVLC-6) pages 171ndash178
Vladimir Vapnik 1995 The Nature of StatisticalLearning Theory Springer
62
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 63ndash70Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
13
13
13 13
13
13
13
13
13
13
13
13 13
13
$
13 amp
13 amp
$ 13
13 (
) 13
$ 13
13
) 13
13 13 13
$ 13 13
+$ - 0)$ - 12
$ 3$ 4 1
)13$ 5 + )$ 5 2
3$ 6 2 $ 7-
2 $ 7 13
2 13 13
13 13
) 13
13$ $ 13$ $ $
13
13 13 13
13 1313 138
$ 13$
3$ 13) )
) 8 9$ $ $
$ $ )13$ 13 13
3
$
13 13
13 +
)
13 $
13 ) 13
$ )
13 13
13 ) $ )
13$
$
) (
$
13
13 13
13
)
13
13 )
$ 13
ファミリーレストラン 13) lt
ファミレス 13lt $
=
13 )
$ 13 13
$ $ )
13 -
)$ $ =
13 131313 13 1313 13 13 13 13 1313 13 13
63
13
13
13 13
13 13
13
13 13
13 $
13
13 amp
13
( ) +
13
13
-
13 13
13 13 13
日本 amp0 東京 $amp0amp
医科 amp0 科学 amp0amp
女子 amp0amp
大学 13amp0 研究所0
ampamp $ 13 13
+
13
$1
$ 13 $1
13
213
$1
$
$1
$1
13 $1
13
13 13 略語は0
0amp
$ +3
1313欧文略語一覧13カタカナ略語一覧13漢字略語一覧13大学の略称
$1
$1 13
13 -
13 $1 13
456
13
456
13
37+
- 13
13
$
8 913
77 4 77 913 13
13 77(amp 13 ltamp
13amp amp
13
amp - 13 13
13
13 9
13 13
$
13
-
13 =
= 9
8
13
13 $
13 13
13 13 13 1313
ッ$ 13 っ$
amp ―$
64
13
$ $
amp $ amp
() $
+) $amp $
) $ $ $ $ $
13 $ amp $ $ $ amp $$
$- )) ) 13
) ))13 ) 0()00)13 ) ) )13
- $ ))) ) 131
) 0)13 ))
13 13
2 ) )) )
)) 3 ) 1313))
1313 2 ) 13)
4) 5) )) )13
) 13) 1
6
7 ) )) ) )13 1
)
) )0
13 ) 8
-
) )
)13 ) 4
4 7 )
) )9 8)
5 ) )8 7 ))
)
) )13 1
)13 )) ) ) )
) 7 )) )13 1
)) 0()
) )))13 0
))13 ) 1
)6 5) ) )
5 ) ))1
lt ) ) 13 )
))) ) =))lt
)=)lt ))=lt gt
=ltlt )) ))) )
13 131313
-
))lt
lt ))lt
))lt
))lt lt
7 ) ))1
lt ) )) ) ))lt lt
13 lt ) ) )8 ) 1
)13 )) )
))lt 13 1
)13 ) )8 ) )13 ))
) ))) 85 1
)13 )8) 2 ))) ) (
0 )) )))
13))13
) ))) ) ( のlt
) ) )13 1
) )) )
) ))) のlt 0 )
)13 )) )1
) ) 13
) ) )) )1
) 00 )13 13)
1313 1313
2 ) 13 )) )
)13 )13 )
))13 )
3 )) )) 1
) ) )
$ )
+ ) )
+ 8 )
65
1313
1313
13 13
13 13
13 13 13
13 13
$13amp13amp 13 13 $13
$13 13 $ ( ) ) 13
13 13 + 13 1313 $
13 + 13 1313 + -
+ 13 ++ 13 13
1313 +
13 13
13 13 1313 13 -
+13 1313 13 13
1313
13
0 13 13 13 13
13 13 12 13
3013+ 13 45 666 1313
0 13 1313
7 13 13 13 13-
13 8 +-13 0
+ 13 13 13 -
+ 13 13 13
3
13 13-
13 13 + 13 13
+13513
1313 13 9
1 13 13 -
+ ( 13 13 3
13 13 + 13+ 13 13
3
135 1313 +13 13
13 13 13+ 13
1313 13 +
1313 1313
13 13 13
13+ 13 $1313 (
13 1313 1313 +-
13 4 13 $
$13 $ 13 -
13 +-13 0
13 1313 13-13 1313 + 13 13 4 - 13 + 13 13 13+ 131313 1313 3 13 13 ( 1313 13 13 + 13 13 1313 13 13 13 13 13 13 13 13 13 13 13+ 13 13 9
13
13
13
13
+ 13 13 +13
13+ 4
13 1313 1313 -
+
13 1313 13-
13 13 +13 + +
(13 13 $ +-
13 $1313 13 13 $ $ 13 $ 13
13 $ampamp lt13
2 13 13 131313 1313 13
13 lt13 ( $
13 13 13 13+ +13 -
13 13 13
13 2 2 2
= 13 13 13 +13 1313+
13 13 13
13 13 13 13
1313 13-
13 ( 13 13 13 13
13 13+ 13 1313
13 13 13+ 1313 13+
13 13 13 13 13 13
$gt1 gt1 13+ $ 13-
13 13 13 13+
13131313
66
1313 1313 1313
1313 1313 1313
13
13 13
13
13
$ 13 13
amp() amp() () amp amp
$ 13 13
+
13
13
13
- + 13 ( (
13 13
-+
13
0 -(1313 amp+ ( (
13 13
13 1
13 () 2 3
0 13 -4252
6-amp++ -32 6+
13 0
$ 135 2
1
- amp+ $ 13
13
$ amp( ) + ( ( )- )( ( ) )
FeaturesLabel
Noun
Noun
Noun
Noun
Particle
Noun
POS
0
0
0
1
1
1
Head of wordreadingword-
FeaturesLabel
Noun
Noun
Noun
Noun
Particle
Noun
POS
0
0
0
1
1
1
Head of wordreadingword-
amp 13 0
13 朝ビタ -+ 13
13 朝はビタミン - +
13 7
7
$
088
13 0
$ amp 9 amp 9
13
13
7
1313 0
13
13 2lt
13 13 13 13
13 13
3 13 amp
13 amp
13
13
0 13 13
13 13
13 13
13
1
ampampamp
67
13 13
$
13 13
ampamp (
)
ampamp $
ampamp amp
amp +
amp +
- - -
amp-
amp
amp
amp
- amp
+
- +
) +
amp $
- amp
+
- amp
+
0 $ ampamp +
1- amp
0 +
2-
)
amp -
amp ampamp
$ amp 0- 3(
34) 5 -+
0 - 4
6)4 amp amp 1 amp
7 4( 6
$ - +
- 8
64
13
) amp
amp
+
amp 9 2
- 9 - +
2 amp
- amp
amp 1
$
( +
(
( amp ( ( -
amp2
amp
amp )
( +
68
13 13
13 1313
13 13 13
13 13
13 13 13
13 13
13 13
13 $
13
13 13 13 13
13
13 13 13
1313
13
amp 13 13 13 13
13 13 13 13
(() 13 13 13
13 13 + 13
13 amp 13
13 13
$ 13
13 13
13
13 1313
13 -
13
1313 13
13 13 13
13 1313 amp
13 13
13 13 1313
0
1313 13 13
13 13
13
13 13
$ $ amp amp ( $)+ - 0 13 + 13 112
ミュージックステーション 13 ステ 13 13 13 13 13 ミュージック 13 ステ ステーション
3 4 56 1 0+ 0 + + 7+ 8 9 $8 4 lt
6 $8 -= 1 8 - = $ 13++ + 139 13 $ 22lt
8 $+ 4 13) 1ltltlt3 $gt3 =
5 $ $ ( 6 5 13amp
5 $ $ ( 213 $ 13 13
5 = 3 + 13 () 1 +
5 () 13 7+ + + 2ltlt
5 () 13 3A 21 + $ + 13
5 () $ gt) 3A2 0 13 13 22lt2lt
+A = 7 1 4 30$+ -+ 3 13 13$ 11
$ 13 amp) $ 4 42 6= $ 2
4 13 13+ $ gt $ 9+ amp6+ 13+ 8 8 6 3 (+ 6 13 11
13 136) $ 4 13 + gt+ 7 30 13 13+ 1
69
13 13 13
13 13 1313
13 13
1313 13
13 1313
13
$
13 13 1313 $
amp (
70
Author Index
Bhutada Pravin 17
Cao Jie 47Caseli Helena 1
Diab Mona 17
Finatto Maria Jose 1Fujii Hiroko 63Fukui Mika 63Funayama Hirotaka 55
Hoang Hung Huu 31Huang Yun 47
Kan Min-Yen 9 31Kim Su Nam 9 31Kuhn Jonas 23Kurohashi Sadao 55
Liu Qun 47Lu Yajuan 47
Machado Andre 1
Ren Zhixiang 47
Shibata Tomohide 55Sinha R Mahesh K 40Sumita Kazuo 63Suzuki Masaru 63
Villavicencio Aline 1
Wakaki Hiromi 63
Zarrieszlig Sina 23
71
Introduction
The ACL 2009 Workshop on Multiword Expressions Identification Interpretation Disambiguation andApplications (MWErsquo09) took place on August 6 2009 in Singapore immediately following the annualmeeting of the Association for Computational Linguistics (ACL) This is the fifth time this workshop hasbeen held in conjunction with ACL following the meetings in 2003 2004 2006 and 2007
The workshop focused on Multi-Word Expressions (MWEs) which represent an indispensable part ofnatural languages and appear steadily on a daily basis both novel and already existing but paraphrasedwhich makes them important for many natural language applications Unfortunately while easilymastered by native speakers MWEs are often non-compositional which poses a major challenge forboth foreign language learners and automatic analysis
The growing interest in MWEs in the NLP community has led to many specialized workshops heldevery year since 2001 in conjunction with ACL EACL and LREC there have been also two recentspecial issues on MWEs published by leading journals the International Journal of Language Resourcesand Evaluation and the Journal of Computer Speech and Language
As a result of the overall progress in the field the time has come to move from basic preliminary researchto actual applications in real-world NLP tasks Thus in MWErsquo09 we were interested in the overallprocess of dealing with MWEs asking for original research on the following four fundamental topics
Identification Identifying MWEs in free text is a very challenging problem Due to the variability ofexpression it does not suffice to collect and use a static list of known MWEs complex rules andmachine learning are typically needed as well
Interpretation Semantically interpreting MWEs is a central issue For some kinds of MWEs egnoun compounds it could mean specifying their semantics using a static inventory of semanticrelations eg WordNet-derived In other cases MWErsquos semantics could be expressible by asuitable paraphrase
Disambiguation Most MWEs are ambiguous in various ways A typical disambiguation task is todetermine whether an MWE is used non-compositionally (ie figuratively) or compositionally(ie literally) in a particular context
Applications Identifying MWEs in context and understanding their syntax and semantics is importantfor many natural language applications including but not limited to question answering machinetranslation information retrieval information extraction and textual entailment Still despite thegrowing research interest there are not enough successful applications in real NLP problemswhich we believe is the key for the advancement of the field
Of course the above topics largely overlap For example identification can require disambiguatingbetween literal and idiomatic uses since MWEs are typically required to be non-compositional bydefinition Similarly interpreting three-word noun compounds like morning flight ticket and plasticwater bottle requires disambiguation between a left and a right syntactic structure while interpretingtwo-word compounds like English teacher requires disambiguating between (a) lsquoteacher who teachesEnglishrsquo and (b) lsquoteacher coming from England (who could teach any subject eg math)rsquo
We received 18 submissions and given our limited capacity as a one-day workshop we were only ableto accept 9 full papers for oral presentation an acceptance rate of 50
We would like to thank the members of the Program Committee for their timely reviews We would alsolike to thank the authors for their valuable contributions
Dimitra Anastasiou Chikara Hashimoto Preslav Nakov and Su Nam KimCo-Organizers
iii
Organizers
Dimitra Anastasiou Localisation Research Centre Limerick University IrelandChikara Hashimoto National Institute of Information and Communications Technology JapanPreslav Nakov National University of Singapore SingaporeSu Nam Kim University of Melbourne Australia
Program Committee
Inaki Alegria University of the Basque Country (Spain)Timothy Baldwin University of Melbourne (Australia)Colin Bannard Max Planck Institute (Germany)Francis Bond National Institute of Information and Communications Technology (Japan)Gael Dias Beira Interior University (Portugal)Ulrich Heid Stuttgart University (Germany)Stefan Evert University of Osnabruck (Germany)Afsaneh Fazly University of Toronto (Canada)Nicole Gregoire University of Utrecht (The Netherlands)Roxana Girju University of Illinois at Urbana-Champaign (USA)Kyo Kageura University of Tokyo (Japan)Brigitte Krenn Austrian Research Institute for Artificial Intelligence (Austria)Eric Laporte University of Marne-la-Vallee (France)Rosamund Moon University of Birmingham (UK)Diana McCarthy University of Sussex (UK)Jan Odijk University of Utrecht (The Netherlands)Stephan Oepen University of Oslo (Norway)Darren Pearce London Knowledge Lab (UK)Pavel Pecina Charles University (Czech Republic)Scott Piao University of Manchester (UK)Violeta Seretan University of Geneva (Switzerland)Stan Szpakowicz University of Ottawa (Canada)Beata Trawinski University of Tubingen (Germany)Peter Turney National Research Council of Canada (Canada)Kiyoko Uchiyama Keio University (Japan)Begona Villada Moiron University of Groningen (The Netherlands)Aline Villavicencio Federal University of Rio Grande do Sul (Brazil)
v
Table of Contents
Statistically-Driven Alignment-Based Multiword Expression Identification for Technical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto 1
Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan 9
Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada 17
Exploiting Translational Correspondences for Pattern-Independent MWE IdentificationSina Zarrieszlig and Jonas Kuhn 23
A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan 31
Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha 40
Improving Statistical Machine Translation Using Domain Bilingual Multiword ExpressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang 47
Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi 55
Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita 63
vii
Workshop Program
Friday August 6 2009
830ndash845 Welcome and Introduction to the Workshop
Session 1 (0845ndash1000) MWE Identification and Disambiguation
0845ndash0910 Statistically-Driven Alignment-Based Multiword Expression Identification for Tech-nical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto
0910ndash0935 Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan
0935ndash1000 Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada
1000-1030 BREAK
Session 2 (1030ndash1210) Identification Interpretation and Disambiguation
1030ndash1055 Exploiting Translational Correspondences for Pattern-Independent MWE Identifi-cationSina Zarrieszlig and Jonas Kuhn
1055ndash1120 A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan
1120ndash1145 Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha
1145-1350 LUNCH
Session 3 (1350ndash1530) Applications
1350ndash1415 Improving Statistical Machine Translation Using Domain Bilingual Multiword Ex-pressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang
1415ndash1440 Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi
1440ndash1505 Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita
1505-1530 Discussion of Sessions 1 2 3 (Creating an Agenda for the general discussion)
1530-1600 BREAK
1600-1700 General Discussion
1700-1715 Closing Remarks
ix
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 1ndash8Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Statistically-Driven Alignment-Based Multiword ExpressionIdentification for Technical Domains
Helena de Medeiros Caselidiams Aline Villavicencioclubsspades Andre Machadoclubs Maria Jose Finattohearts
diamsDepartment of Computer Science Federal University of Sao Carlos (Brazil)clubsInstitute of Informatics Federal University of Rio Grande do Sul (Brazil)
spadesDepartment of Computer Sciences Bath University (UK)heartsInstitute of Language and Linguistics Federal University of Rio Grande do Sul (Brazil)
helenacaselidcufscarbr avillavicencioinfufrgsbrammachadoinfufrgsbr mfinattoterracombr
Abstract
Multiword Expressions (MWEs) are oneof the stumbling blocks for more preciseNatural Language Processing (NLP) sys-tems Particularly the lack of coverageof MWEs in resources can impact nega-tively on the performance of tasks and ap-plications and can lead to loss of informa-tion or communication errors This is es-pecially problematic in technical domainswhere a significant portion of the vocab-ulary is composed of MWEs This pa-per investigates the use of a statistically-driven alignment-based approach to theidentification of MWEs in technical cor-pora We look at the use of several sourcesof data including parallel corpora usingEnglish and Portuguese data from a corpusof Pediatrics and examining how a sec-ond language can provide relevant cues forthis tasks We report results obtained bya combination of statistical measures andlinguistic information and compare theseto the reported in the literature Such anapproach to the (semi-)automatic identifi-cation of MWEs can considerably speedup lexicographic work providing a moretargeted list of MWE candidates
1 Introduction
A multiword expression (MWE) can be definedas any word combination for which the syntac-tic or semantic properties of the whole expres-sion cannot be obtained from its parts (Sag etal 2002) Examples of MWEs are phrasal verbs(break down rely on) compounds (police car cof-fee machine) idioms (rock the boat let the catout of the bag) They are very numerous in lan-guages as Biber et al (1999) note accouting forbetween 30 and 45 of spoken English and 21
of academic prose and for Jackendoff (1997) thenumber of MWEs in a speakerrsquos lexicon is of thesame order of magnitude as the number of singlewords However these estimates are likely to beunderestimates if we consider that for languagefrom a specific domain the specialized vocabularyis going to consist largely of MWEs (global warm-ing protein sequencing) and new MWEs are con-stantly appearing (weapons of mass destructionaxis of evil)
Multiword expressions play an important rolein Natural Language Processing (NLP) applica-tions which should not only identify the MWEsbut also be able to deal with them when they arefound (Fazly and Stevenson 2007) Failing toidentify MWEs may cause serious problems formany NLP tasks especially those envolving somekind of semantic processing For parsing for in-stance Baldwin et al (2004) found that for a ran-dom sample of 20000 strings from the British Na-tional Corpus (BNC) even with a broad-coveragegrammar for English (Flickinger 2000) missingMWEs accounted for 8 of total parsing errorsTherefore there is an enormous need for robust(semi-)automated ways of acquiring lexical infor-mation for MWEs (Villavicencio et al 2007) thatcan significantly extend the coverage of resourcesFor example one can more than double the num-ber of verb-particle constructions (VPCs) entriesin a dictionary such as the Alvey Natural Lan-guage Tools (Carroll and Grover 1989) just ex-tracting VPCs from a corpus like the BNC (Bald-win 2005) Furthermore as MWEs are languagedependent and culturally motivated identifyingthe adequate translation of MWE occurrences is animportant challenge for machine translation meth-ods
In this paper we investigate experimentally theuse of an alignment-based approach for the iden-tification of MWEs in technical corpora We lookat the use of several sources of data including par-
1
allel corpora using English and Portuguese datafrom a corpus of Pediatrics and examining howa second language can provide relevant cues forthis tasks In this way cost-effective tools for theautomatic alignment of texts can generate a listof MWE candidates with their appropriate trans-lations Such an approach to the (semi-)automaticidentification of MWEs can considerably speed uplexicographic work providing a more targeted listof MWE candidates and their translations for theconstruction of bilingual resources andor withsome semantic information for monolingual re-sources
The remainder of this paper is structured as fol-lows Section 2 briefly discusses MWEs and someprevious works on methods for automatically ex-tracting them Section 3 presents the resourcesused while section 4 describes the methods pro-posed to extract MWEs as a statistically-driven by-product of an automatic word alignment processSection 5 presents the evaluation methodology andanalyses the results and section 6 finishes this pa-per with some conclusions and proposals for fu-ture work
2 Related Work
The term Multiword Expression has been used todescribe a large number of distinct but related phe-nomena such as phrasal verbs (eg come along)nominal compounds (eg frying pan) institution-alised phrases (eg bread and butter) and manyothers (Sag et al 2002) They are very frequent ineveryday language and this is reflected in severalexisting grammars and lexical resources wherealmost half of the entries are Multiword Expres-sions
However due to their heterogeneous charac-teristics MWEs present a tough challenge forboth linguistic and computational work (Sag etal 2002) Some MWEs are fixed and do notpresent internal variation such as ad hoc whileothers allow different degrees of internal vari-ability and modification such as touch a nerve(touchfind a nerve) and spill beans (spill sev-eralmusicalmountains of beans) In terms of se-mantics some MWEs are more opaque in theirmeaning (eg to kick the bucket as to die) whileothers have more transparent meanings that can beinferred from the words in the MWE (eg eat upwhere the particle up adds a completive sense toeat) Therefore providing appropriate methods
for the automatic identification and treatment ofthese phenomena is a real challenge for NLP sys-tems
A variety of approaches has been proposed forautomatically identifying MWEs differing basi-cally in terms of the type of MWE and lan-guage to which they apply and the sources ofinformation they use Although some work onMWEs is type independent (eg (Zhang et al2006 Villavicencio et al 2007)) given the het-erogeneity of MWEs much of the work looks in-stead at specific types of MWE like collocations(Pearce 2002) compounds (Keller and Lapata2003) and VPCs (Baldwin 2005 Villavicencio2005 Carlos Ramisch and Aline Villavicencio andLeonardo Moura and Marco Idiart 2008) Someof these works concentrate on particular languages(eg (Pearce 2002 Baldwin 2005) for Englishand (Piao et al 2006) for Chinese) but somework has also benefitted from asymmetries in lan-guages using information from one language tohelp deal with MWEs in the other (eg (na Vil-lada Moiron and Tiedemann 2006 Caseli et al2009))
As basis for helping to determine whether agiven sequence of words is in fact an MWE (egad hoc vs the small boy) some of these works em-ploy linguistic knowledge for the task (Villavicen-cio 2005) while others employ statistical meth-ods (Pearce 2002 Evert and Krenn 2005 Zhanget al 2006 Villavicencio et al 2007) or combinethem with some kinds of linguistic informationsuch as syntactic and semantic properties (Bald-win and Villavicencio 2002 Van de Cruys and naVillada Moiron 2007) or automatic word align-ment (na Villada Moiron and Tiedemann 2006)
Statistical measures of association have beencommonly used for this task as they can be demo-cratically applied to any language and MWE typeHowever there is no consensus about which mea-sure is best suited for identifying MWEs in gen-eral Villavicencio et al (2007) compared some ofthese measures (mutual information permutationentropy and χ2) for the type-independent detec-tion of MWEs and found that Mutual Informationseemed to differentiate MWEs from non-MWEsbut the same was not true of χ2 In addition Ev-ert and Krenn (2005) found that for MWE iden-tification the efficacy of a given measure dependson factors like the type of MWEs being targetedfor identification the domain and size of the cor-
2
pora used and the amount of low-frequency dataexcluded by adopting a threshold NonethelessVillavicencio et al (2007) discussing the influ-ence of the corpus size and nature over the meth-ods found that these different measures have ahigh level of agreement about MWEs whetherin carefully constructed corpora or in more het-erogeneous web-based ones They also discussthe results obtained from adopting approaches likethese for extending the coverage of resources ar-guing that grammar coverage can be significantlyincreased if MWEs are properly identified andtreated (Villavicencio et al 2007)
Among the methods that use additional infor-mation along with statistics to extract MWE theone proposed by na Villada Moiron and Tiede-mann (2006) seems to be the most similar to ourapproach The main difference between them isthe way in which word alignment is used in theMWE extraction process In this paper the wordalignment is the basis for the MWE extractionprocess while Villada Moiron and Tiedemannrsquosmethod uses the alignment just for ranking theMWE candidates which were extracted on the ba-sis of association measures (log-likelihood andsalience) and head dependence heuristic (in parseddata)
Our approach as described in details by Caseliet al (2009) also follows to some extent thatof Zhang et al (2006) as missing lexical en-tries for MWEs and related constructions are de-tected via error mining methods and this paper fo-cuses on the extraction of generic MWEs as a by-product of an automatic word alignment Anotherrelated work is the automatic detection of non-compositional compounds (NCC) by Melamed(1997) in which NCCs are identified by analyz-ing statistical translation models trained in a hugecorpus by a time-demanding process
Given this context our approach proposesthe use of alignment techniques for identifyingMWEs looking at sequences detected by thealigner as containing more than one word whichform the MWE candidates As a result sequencesof two or more consecutive source words aretreated as MWE candidates regardless of whetherthey are translated as one or more target words
3 The Corpus and Reference Lists
The Corpus of Pediatrics used in these experi-ments contains 283 texts in Portuguese with a total
of 785448 words extracted from the Jornal de Pe-diatria From this corpus the Pediatrics Glossarya reference list containing multiword terms and re-curring expressions was semi-automatically con-structed and manually checked1 The primary aimof the Pediatrics Glossary as an online resourcefor long-distance education was to train qualifyand support translation students on the domain ofpediatrics texts
The Pediatrics Glossary was built from the36741 ngrams that occurred at least 5 times in thecorpus These were automatically cleaned or re-moved using some POS tag patterns (eg remov-ing prepositions from terms that began or endedwith them) In addition if an ngram was part of alarger ngram only the latter appeared in the Glos-sary as is the case of aleitamento materno (mater-nal breastfeeding) which is excluded as it is con-tained in aleitamento materno exclusivo (exclusivematernal breastfeeding) This post-processing re-sulted in 3645 ngrams which were manuallychecked by translation students and resulted in2407 terms with 1421 bigrams 730 trigrams and339 ngrams with n larger than 3 (not considered inthe experiments presented in this paper)
4 Statistically-Driven andAlignment-Based methods
41 Statistically-Driven methodStatistical measures of association have beenwidely employed in the identification of MWEsThe idea behind their use is that they are an in-expensive language and type independent meansof detecting recurrent patterns As Firth famouslysaid a word is characterized by the company itkeeps and since we expect the component wordsof an MWE to occur frequently together thenthese measures can give an indication of MWE-ness In this way if a group of words co-occurswith significantly high frequency when comparedto the frequencies of the individual words thenthey may form an MWE Indeed measures suchas Pointwise Mutual Information (PMI) MutualInformation (MI) χ2 log-likelihood (Press et al1992) and others have been employed for thistask and some of them seem to provide more ac-curate predictions of MWEness than others Infact in a comparison of some measures for thetype-independent detection of MWEs MI seemed
1Available in the TEXTQUIMUFRGS website httpwwwufrgsbrtextquim
3
to differentiate MWEs from non-MWEs but thesame was not true of χ2 (Villavicencio et al2007) In this work we use two commonly em-ployed measures for this task PMI and MIas implemented in the Ngram Statistics Package(Banerjee and Pedersen 2003)
From the Portuguese portion of the Corpus ofPediatrics 196105 bigram and 362663 trigramMWE candidates were generated after filteringngrams containing punctuation and numbers Inorder to evaluate how these methods perform with-out any linguistic filtering the only threshold em-ployed was a frequency cut-off of 2 occurrencesresulting in 64839 bigrams and 54548 trigramsEach of the four measures were then calculated forthese ngrams and we ranked each n-gram accord-ing to each of these measures The average of allthe rankings is used as the combined measure ofthe MWE candidates
42 Alignment-Based method
The second of the MWE extraction approachesto be investigated in this paper is the alignment-based method The automatic word alignment oftwo parallel texts mdash a text written in one (source)language and its translation to another (target) lan-guage mdash is the process of searching for correspon-dences between source and target words and se-quences of words For each word in a source sen-tence equivalences in the parallel target sentenceare looked for Therefore taking into account aword alignment between a source word sequenceS (S = s1 sn with n ge 2) and a target wordsequence T (T = t1 tm with m ge 1) thatis S harr T the alignmet-based MWE extracionmethod assumes that (a) S and T share some se-mantic features and (b) S may be a MWE
In other words the alignment-based MWE ex-traction method states that the sequence S will bea MWE candidate if it is aligned with a sequenceT composed of one or more words (a n m align-ment with n ge 2 and m ge 1) For examplethe sequence of two Portuguese words aleitamentomaterno mdash which occurs 202 times in the cor-pus used in our experiments mdash is a MWE can-didate because these two words were joined to bealigned 184 times with the word breastfeeding (a2 1 alignment) 8 times with the word breast-fed (a 2 1 alignment) 2 times with breastfeedingpractice (a 2 2 alignment) and so on
Thus notice that the alignment-based MWE ex-
traction method does not rely on the conceptualasymmetries between languages since it does notexpect that a source sequence of words be alignedwith a single target word The method looksfor the sequences of source words that are fre-quently joined together during the alignment de-spite the number of target words involved Thesefeatures indicate that the method priorizes preci-sion in spite of recall
It is also important to say that although the se-quences of source and target words resemble thephrases used in the phrase-based statistical ma-chine translation (SMT) they are indeed a re-finement of them More specifically althoughboth approaches rely on word alignments per-formed by GIZA++2 (Och and Ney 2000) inthe alignment-based approach not all sequences ofwords are considered as phrases (and MWE can-didates) but just those with an alignment n m(n gt= 2) with a target sequence To confirmthis assumption a phrase-based SMT system wastrained with the same corpus used in our exper-iments and the number of phrases extracted fol-lowing both approaches were compared Whilethe SMT extracted 819208 source phrases ouralignment-based approach (without applying anypart-of-speech or frequency filter) extracted only34277 These results show that the alignment-based approach refines in some way the phrasesof SMT systems
In this paper we investigate experimentallywhether MWEs can be identified as a by-productof the automatic word alignment of parallel textsWe focus on Portuguese MWEs from the Corpusof Pediatrics and the evaluation is performed us-ing the bigrams and trigrams from the PediatricsGlossary as gold standard
To perform the extraction of MWE candi-dates following the alignment-based approachfirst the original corpus had to be sentence andword aligned and Part-of-Speech (POS) taggedFor these preprocessing steps were used re-spectively a version of the Translation Cor-pus Aligner (TCA) (Hofland 1996) the statisti-cal word aligner GIZA++ (Och and Ney 2000)and the morphological analysers and POS taggersfrom Apertium3 (Armentano-Oller et al 2006)
2GIZA++ is a well-known statistical word aligner that canbe found at httpwwwfjochcomGIZA++html
3Apertium is an open-source machine translation en-gine and toolbox available at httpwwwapertiumorg
4
From the preprocessed corpus the MWE candi-dates are extracted as those in which two or morewords have the same alignment that is they arelinked to the same target unit This initial list ofMWE candidates is then filtered to remove thosecandidates that (a) match some sequences of POStags or words (patterns) defined in previous exper-iments (Caseli et al 2009) or (b) whose frequencyis below a certain threshold The remaining unitsin the candidate list are considered to be MWEs
Several filtering patterns and minimum fre-quency thresholds were tested and three of themare presented in details here The first one (F1)is the same used during the manual building ofthe reference lists of MWEs (a) patterns begin-ning with Article + Noun and beginning or finish-ing with verbs and (b) with a minimum frequencythreshold of 5
The second one (F2) is the same used in the(Caseli et al 2009) mainly (a) patterns begin-ning with determiner auxiliary verb pronoun ad-verb conjunction and surface forms such as thoseof the verb to be (are is was were) relatives(that what when which who why) and prepo-sitions (from to of ) and (b) with a minimum fre-quency threshold of 2
And the third one (F3) is the same as (Caseli etal 2009) plus (a) patterns beginning or finishingwith determiner adverb conjunction prepositionverb pronoun and numeral and (b) with a mini-mum frequency threshold of 2
5 Experiments and Results
Table 1 shows the top 5 and the bottom 5 rankedcandidates returned by PMI and the alignment-based approach Although some of the results aregood especially the top candidates there is stillconsiderable noise among the candidates as forinstance jogar video game (lit play video game)From table 1 it is also possible to notice that thealignment-based approach indeed extracts Pedi-atrics terms such as aleitamento materno (breast-feeding) and also other possible MWE that are notPediatrics terms such as estados unidos (UnitedStates)
In table 2 we show the precision (number ofcorrect candidates among the proposed ones) re-call (number of correct candidates among those inreference lists) and F-measure ((2 lowast precision lowastrecall)(precision + recall)) figures for the as-sociation measures using all the candidates (on the
PMI alignment-basedOnline Mendelian Inheritance faixa etariaBeta Technology Incorporated aleitamento maternoLange Beta Technology estados unidosOxido Nitrico Inalatorio hipertensao arterialjogar video game leite materno e um de couro cabeludoe a do bloqueio lactıferosse que de emocional anatomiae a da neonato a termoe de nao duplas maes bebes
Table 1 Top 5 and Bottom 5 MWE candidatesranked by PMI and alignment-based approach
pt MWE candidates PMI MI proposed bigrams 64839 64839 correct MWEs 1403 1403precision 216 216recall 9873 9873F 423 423 proposed trigrams 54548 54548 correct MWEs 701 701precision 129 129recall 9603 9603F 255 255 proposed bigrams 1421 1421 correct MWEs 155 261precision 1091 1837recall 1091 1837F 1091 1837 proposed trigrams 730 730 correct MWEs 44 20precision 603 274recall 603 274F 603 274
Table 2 Evaluation of MWE candidates - PMI andMI
first half of the table) and using the top 1421 bi-gram and 730 trigram candidates (on the secondhalf) From these latter results we can see that thetop candidates produced by these measures do notagree with the Pediatrics Glossary since there areonly at most 1837 bigram and 603 trigramMWEs among the top candidates as ranked byMI and PMI respectively Interestingly MI had abetter performance for bigrams while for trigramsPMI performed better
On the other hand looking at the alignment-based method 34277 pt MWE candidates wereextracted and Table 3 sumarizes the number ofcandidates filtered following the three filters de-scribed in 42 F1 F2 and F3
To evaluate the efficacy of the alignment-basedmethod in identifying multiword terms of Pedi-atrics an automatic comparison was performedusing the Pediatrics Glossary In this auto-
5
pt MWE candidates F1 F2 F3 filtered by POS patterns 24996 21544 32644 filtered by frequency 9012 11855 1442 final Set 269 878 191
Table 3 Number of pt MWE candidates filteredin the alignment-based approach
pt MWE candidates F1 F2 F3 proposed bigrams 250 754 169 correct MWEs 48 95 65precision 1920 1260 3846recall 338 669 457F 575 874 818 proposed trigrams 19 110 20 correct MWEs 1 9 4precision 526 818 2000recall 014 123 055F 027 214 107 proposed bitrigrams 269 864 189 correct MWEs 49 104 69precision 1822 1204 3651recall 228 483 321F 405 690 590
Table 4 Evaluation of MWE candidates
matic comparision we considered the final lists ofMWEs candidates generated by each filter in table3 The number of matching entries and the valuesfor precision recall and F-measure are showed intable 4
The different values of extracted MWEs (in ta-ble 3) and evaluated ones (in table 4) are due tothe restriction of considering only bigrams and tri-grams in the Pediatrics Glossary Then longerMWEs mdash such as doenca arterial coronarianaprematura (premature coronary artery disease)and pequenos para idade gestacional (small forgestational age) mdash extracted by the alignment-based method are not being considered at the mo-ment
After the automatic comparison using the Pedi-atrics Glossary an analysis by human experts wasperformed on one of the derived lists mdash that withthe best precision values so far (from filter F3)The human analysis was necessary since as statedin (Caseli et al 2009) the coverage of referencelists may be low and it is likely that a lot of MWEcandidates that were not found in the PediatricsGlossary are nonetheless true MWEs In this pa-per only the pt MWE candidates extracted usingfilter F3 (as described in section 42) were manu-ally evaluated
From the 191 pt MWE candidates extracted af-ter F3 69 candidates (361 of the total amount)were found in the bigrams or trigrams in the
Glossary (see table 4) Then the remaining 122candidates (639) were analysed by two native-speakers human judges who classified each of the122 candidates as true if it is a multiword expres-sion or false otherwise independently of beinga Pediatrics term For the judges a sequence ofwords was considered a MWE mainly if it was(1) a proper name or (2) a sequence of words forwhich the meaning cannot be obtained by com-pounding the meanings of its component words
The judgments of both judges were comparedand a disagreement of approximately 12 on mul-tiwords was verified This disagreement was alsomeasured by the kappa (K) measure (Carletta1996) with k = 073 which does not preventconclusions to be drawn According to Carletta(1996) among other authors a value of k between067 and 08 indicates a good agreement
In order to calculate the percentage of true can-didates among the 122 two approaches can befollowed depending on what criteria one wantsto emphasize precision or coverage (not recallbecause we are not calculating regarding a refer-ence list) To emphasize the precision one shouldconsider as genuine MWEs only those candidatesclassified as true by both judges on the other handto emphasize the coverage one should consideralso those candidates classified as true by just oneof them So from 191 MWE candidates 126(6597) were classified as true by both judgesand 145 (7592) by at least one of them
6 Conclusions and Future Work
MWEs are a complex and heterogeneous set ofphenomena that defy attempts to capture themfully but due to their role in communication theyneed to be properly accounted for in NLP applica-tions and tasks
In this paper we investigated the identifica-tion of MWEs from technical domain test-ing statistically-driven and alignment-based ap-proaches for identifying MWEs from a Pediatricsparallel corpus The alignment-based method gen-erates a targeted precision-oriented list of MWEcandidates while the statistical methods producerecall-oriented results at the expense of precisionTherefore the combination of these methods canproduce a set of MWE candidates that is both moreprecise than the latter and has more coverage thanthe former This can significantly speed up lex-icographic work Moreover the results obtained
6
show that in comparison with the manual extrac-tion of MWEs this approach can provide also ageneral set of MWE candidates in addition to themanually selected technical terms
Using the alignment-based extraction methodwe notice that it is possible to extract MWEs thatare Pediatrics terms with a precision of 38 forbigrams and 20 for trigrams but with very lowrecall since only the MWEs in the Pediatrics Glos-sary were considered correct However after amanual analysis carried out by two native speakersof Portuguese we found that the percentage of trueMWEs considered by both or at least one of themwere respectively 6597 and 7592 This wasa significative improvement but it is important tosay that in this manual analysis the human ex-perts classified the MWEs as true independently ofthem being Pediatrics terms So as future work weintend to carry out a more carefull analysis withexperts in Pediatrics to evaluate how many MWEscandidates are also Pediatrics terms
In addition we plan to investigate a weightedcombination of these methods favouring thosethat have better precision Finally we also in-tend to apply the results obtained in to the semi-automatic construction of ontologies
Acknowledgments
We would like to thank the TEXTQUIMUFRGSgroup for making the Corpus of Pediatrics and Pe-diatrics Glossary available to us We also thank thefinancial support of FAPESP in bulding the paral-lel corpus This research has been partly fundedby the FINEP project COMUNICA
ReferencesCarme Armentano-Oller Rafael C Carrasco Anto-
nio M Corbı-Bellot Mikel L Forcada MireiaGinestı-Rosell Sergio Ortiz-Rojas Juan Anto-nio Perez-Ortiz Gema Ramırez-Sanchez FelipeSanchez-Martınez and Miriam A Scalco 2006Open-source Portuguese-Spanish machine transla-tion In R Vieira P Quaresma MGV Nunes NJMamede C Oliveira and MC Dias editors Pro-ceedings of the 7th International Workshop on Com-putational Processing of Written and Spoken Por-tuguese (PROPOR 2006) volume 3960 of LectureNotes in Computer Science pages 50ndash59 Springer-Verlag May
Timothy Baldwin and Aline Villavicencio 2002 Ex-tracting the Unextractable A Case Study on Verb-particles In Proceedings of the 6th Conference on
Natural Language Learning (CoNLL-2002) pages98ndash104 Taipei Taiwan
Timothy Baldwin Emily M Bender Dan FlickingerAra Kim and Stephan Oepen 2004 Road-testingthe English Resource Grammar over the British Na-tional Corpus In Proceedings of the Fourth In-ternational Conference on Language Resources andEvaluation (LREC 2004) pages 2047ndash2050 LisbonPortugal
Timothy Baldwin 2005 The deep lexical acquisitionof English verb-particles Computer Speech andLanguage Special Issue on Multiword Expressions19(4)398ndash414
Satanjeev Banerjee and Ted Pedersen 2003 The De-sign Implementation and Use of the Ngram Statis-tics Package In In Proceedings of the Fourth Inter-national Conference on Intelligent Text Processingand Computational Linguistics pages 370ndash381
Douglas Biber Stig Johansson Geoffrey Leech SusanConrad and Edward Finegan 1999 Grammar ofSpoken and Written English Longman Harlow
Jean Carletta 1996 Assessing agreement on classi-fication tasks the kappa statistics ComputationalLinguistics 22(2)249ndash254
Carlos Ramisch and Aline Villavicencio and LeonardoMoura and Marco Idiart 2008 Picking themup and Figuring them out Verb-Particle Construc-tions Noise and Idiomaticity In Proceedings ofthe 12th Conference on Computational Natural Lan-guage Learning (CoNLL 2008) pages 49ndash56
John Carroll and Claire Grover 1989 The derivationof a large computational lexicon of English fromLDOCE In Bran Boguraev and Ted Briscoe editorsComputational Lexicography for Natural LanguageProcessing pages 117ndash134 Longman Harlow UK
Helena M Caseli Carlos Ramisch Maria G V Nunesand Aline Villavicencio 2009 Alignment-basedextraction of multiword expressions Language Re-sources and Evaluation to appear
Stefan Evert and Brigitte Krenn 2005 Using smallrandom samples for the manual evaluation of statis-tical association measures Computer Speech andLanguage 19(4)450ndash466
Afsaneh Fazly and Suzanne Stevenson 2007 Distin-guishing Subtypes of Multiword Expressions UsingLinguistically-Motivated Statistical Measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueJune
Dan Flickinger 2000 On building a more efficientgrammar by exploiting types Natural LanguageEngineering 6(1)15ndash28
7
Knut Hofland 1996 A program for aligning Englishand Norwegian sentences In S Hockey N Ideand G Perissinotto editors Research in HumanitiesComputing pages 165ndash178 Oxford Oxford Univer-sity Press
Ray Jackendoff 1997 Twistinrsquo the night away Lan-guage 73534ndash59
Frank Keller and Mirella Lapata 2003 Using the Webto Obtain Frequencies for Unseen Bigrams Compu-tational Linguistics 29(3)459ndash484
I Dan Melamed 1997 Automatic Discovery ofNon-Compositional Compounds in Parallel Data Ineprint arXivcmp-lg9706027 pages 6027ndash+ June
Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the Workshopon Multi-word-expressions in a Multilingual Context(EACL-2006) pages 33ndash40 Trento Italy
Franz Josef Och and Hermann Ney 2000 Improvedstatistical alignment models In Proceedings of the38th Annual Meeting of the ACL pages 440ndash447Hong Kong China October
Darren Pearce 2002 A Comparative Evaluation ofCollocation Extraction Techniques In Proceedingsof the Third International Conference on LanguageResources and Evaluation pages 1ndash7 Las PalmasCanary Islands Spain
Scott S L Piao Guangfan Sun Paul Rayson andQi Yuan 2006 Automatic Extraction of Chi-nese Multiword Expressions with a Statistical ToolIn Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context (EACL-2006)pages 17ndash24 Trento Italy April
William H Press Saul A Teukolsky William T Vet-terling and Brian P Flannery 1992 NumericalRecipes in C The Art of Scientific Computing Sec-ond edition Cambridge University Press
Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 MultiwordExpressions A Pain in the Neck for NLP In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing (CICLing-2002) volume 2276 of (LectureNotes in Computer Science) pages 1ndash15 LondonUK Springer-Verlag
Tim Van de Cruys and Bego na Villada Moiron 2007Semantics-based Multiword Expression ExtractionIn Proceedings of the Workshop on A Broader Pre-spective on Multiword Expressions pages 25ndash32Prague June
Aline Villavicencio Valia Kordoni Yi Zhang MarcoIdiart and Carlos Ramisch 2007 Validation andEvaluation of Automatically Acquired Multiword
Expressions for Grammar Engineering In Pro-ceedings of the 2007 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning pages1034ndash1043 Prague June
Aline Villavicencio 2005 The Availability of Verb-Particle Constructions in Lexical Resources HowMuch is Enough Journal of Computer Speech andLanguage Processing 19(4)415ndash432
Yi Zhang Valia Kordoni Aline Villavicencio andMarco Idiart 2006 Automated Multiword Ex-pression Prediction for Grammar Engineering InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 36ndash44 Sydney Australia July Asso-ciation for Computational Linguistics
8
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 9ndash16Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Re-examining Automatic Keyphrase Extraction Approachesin Scientific Articles
Su Nam KimCSSE dept
University of Melbournesnkimcsseunimelbeduau
Min-Yen KanDepartment of Computer ScienceNational University of Singaporekanmycompnusedusg
Abstract
We tackle two major issues in automatickeyphrase extraction using scientific arti-cles candidate selection and feature engi-neering To develop an efficient candidateselection method we analyze the natureand variation of keyphrases and then se-lect candidates using regular expressionsSecondly we re-examine the existing fea-tures broadly used for the supervised ap-proach exploring different ways to en-hance their performance While mostother approaches are supervised we alsostudy the optimal features for unsuper-vised keyphrase extraction Our researchhas shown that effective candidate selec-tion leads to better performance as evalua-tion accounts for candidate coverage Ourwork also attests that many of existing fea-tures are also usable in unsupervised ex-traction
1 Introduction
Keyphrases are simplex nouns or noun phrases(NPs) that represent the key ideas of the documentKeyphrases can serve as a representative summaryof the document and also serve as high quality in-dex terms It is thus no surprise that keyphraseshave been utilized to acquire critical informationas well as to improve the quality of natural lan-guage processing (NLP) applications such as doc-ument summarizer(Davanzo and Magnini 2005)information retrieval (IR)(Gutwin et al 1999) anddocument clustering(Hammouda et al 2005)
In the past various attempts have been made toboost automatic keyphrase extraction performancebased primarily on statistics(Frank et al 1999Turney 2003 Park et al 2004 Wan and Xiao2008) and a rich set of heuristic features(Barkerand Corrnacchia 2000 Medelyan and Witten
2006 Nguyen and Kan 2007) In Section 2 wegive a more comprehensive overview of previousattempts
Current keyphrase technology still has muchroom for improvement First of all although sev-eral candidate selection methods have been pro-posed for automatic keyphrase extraction in thepast (eg (Frank et al 1999 Park et al 2004Nguyen and Kan 2007)) most of them do not ef-fectively deal with various keyphrase forms whichresults in the ignorance of some keyphrases as can-didates Moreover no studies thus far have donea detailed investigation of the nature and varia-tion of manually-provided keyphrases As a con-sequence the community lacks a standardized listof candidate forms which leads to difficulties indirect comparison across techniques during evalu-ation and hinders re-usability
Secondly previous studies have shown the ef-fectiveness of their own features but not manycompared their features with other existing fea-tures That leads to a redundancy in studies andhinders direct comparison In addition existingfeatures are specifically designed for supervisedapproaches with few exceptions However thisapproach involves a large amount of manual laborthus reducing its utility for real-world applicationHence unsupervised approach is inevitable in or-der to minimize manual tasks and to encourageutilization It is a worthy study to attest the re-liability and re-usability for the unsupervised ap-proach in order to set up the tentative guideline forapplications
This paper targets to resolve these issues ofcandidate selection and feature engineering Inour work on candidate selection we analyze thenature and variation of keyphrases with the pur-pose of proposing a candidate selection methodwhich improves the coverage of candidates thatoccur in various forms Our second contributionre-examines existing keyphrase extraction features
9
reported in the literature in terms of their effec-tiveness and re-usability We test and comparethe usefulness of each feature for further improve-ment In addition we assess how well these fea-tures can be applied in an unsupervised approach
In the remaining sections we describe anoverview of related work in Section 2 our propos-als on candidate selection and feature engineeringin Section 4 and 5 our system architecture anddata in Section 6 Then we evaluate our propos-als discuss outcomes and conclude our work inSection 7 8 and 9 respectively
2 Related Work
The majority of related work has been carriedout using statistical approaches a rich set ofsymbolic resources and linguistically-motivatedheuristics(Frank et al 1999 Turney 1999 Barkerand Corrnacchia 2000 Matsuo and Ishizuka2004 Nguyen and Kan 2007) Features used canbe categorized into three broad groups (1) docu-ment cohesion features (ie relationship betweendocument and keyphrases)(Frank et al 1999Matsuo and Ishizuka 2004 Medelyan and Wit-ten 2006 Nguyen and Kan 2007) and to lesser(2) keyphrase cohesion features (ie relationshipamong keyphrases)(Turney 2003) and (3) termcohesion features (ie relationship among compo-nents in a keyphrase)(Park et al 2004)
The simplest system is KEA (Frank et al1999 Witten et al 1999) that uses TFIDF (ieterm frequency inverse document frequency) andfirst occurrence in the document TFIDF mea-sures the document cohesion and the first occur-rence implies the importance of the abstract orintroduction which indicates the keyphrases havea locality Turney (2003) added the notion ofkeyphrase cohesion to KEA features and Nguyenand Kan (2007) added linguistic features suchas section information and suffix sequence TheGenEx system(Turney 1999) employed an inven-tory of nine syntactic features such as length inwords and frequency of stemming phrase as aset of parametrized heuristic rules Barker andCorrnacchia (2000) introduced a method basedon head noun heuristics that took three featureslength of candidate frequency and head noun fre-quency To take advantage of domain knowledgeHulth et al (2001) used a hierarchically-organizeddomain-specific thesaurus from Swedish Parlia-ment as a secondary knowledge source The
Textract (Park et al 2004) also ranks the can-didate keyphrases by its judgment of keyphrasesrsquodegree of domain specificity based on subject-specific collocations(Damerau 1993) in addi-tion to term cohesion using Dice coefficient(Dice1945) Recently Wan and Xiao (2008) extractsautomatic keyphrases from single documents uti-lizing document clustering information The as-sumption behind this work is that the documentswith the same or similar topics interact with eachother in terms of salience of words The authorsfirst clustered the documents then used the graph-based ranking algorithm to rank the candidates ina document by making use of mutual influences ofother documents in the same cluster
3 Keyphrase Analysis
In previous study KEA employed the index-ing words as candidates whereas others such as(Park et al 2004 Nguyen and Kan 2007) gen-erated handcrafted regular expression rules How-ever none carefully undertook the analysis ofkeyphrases We believe there is more to be learnedfrom the reference keyphrases themselves by do-ing a fine-grained careful analysis of their formand composition Note that we used the articlescollected from ACM digital library for both ana-lyzing keyphrases as well as evaluating methodsSee Section 6 for data in detail
Syntactically keyphrases can be formed by ei-ther simplex nouns (eg algorithm keyphrasemulti-agent) or noun phrases (NPs) which can be asequence of nouns and their auxiliary words suchas adjectives and adverbs (eg mobile networkfast computing partially observable Markov de-cision process) despite few incidences They canalso incorporate a prepositional phrase (PP) (egquality of service policy of distributed caching)When keyphrases take the form of an NP with anattached PP (ie NPs in of-PP form) the preposi-tion of is most common but others such as for invia also occur (eg incentive for cooperation in-equality in welfare agent security via approximatepolicy trade in financial instrument based on log-ical formula) The patterns above correlate wellto part-of-speech (POS) patterns used in modernkeyphrase extraction systems
However our analysis uncovered additional lin-guistic patterns and alternations which other stud-ies may have overlooked In our study we alsofound that keyphrases also occur as a simple con-
10
Criteria RulesFrequency (Rule1) Frequency heuristic ie frequency ge 2 for simplex words vs frequency ge 1 for NPs
Length (Rule2) Length heuristic ie up to length 3 for NPs in non-of-PP form vs up to length 4 for NPs in of-PP form(eg synchronous concurrent program vs model of multiagent interaction)
Alternation (Rule3) of-PP form alternation(eg number of sensor = sensor number history of past encounter = past encounter history)(Rule4) Possessive alternation(eg agentrsquos goal = goal of agent securityrsquos value = value of security)
Extraction (Rule5) Noun Phrase = (NN |NNS|NNP |NNPS|JJ |JJR|JJS)lowast(NN |NNS|NNP |NNPS)(eg complexity effective algorithm grid computing distributed web-service discovery architecture)(Rule6) Simplex WordNP IN Simplex WordNP(eg quality of service sensitivity of VOIP traffic (VOIP traffic extracted)simplified instantiation of zebroid (simplified instantiation extracted))
Table 1 Candidate Selection Rules
junctions (eg search and rescue propagation anddelivery) and much more rarely as conjunctionsof more complex NPs (eg history of past en-counter and transitivity) Some keyphrases appearto be more complex (eg pervasive document editand management system task and resource allo-cation in agent system) Similarly abbreviationsand possessive forms figure as common patterns(eg belief desire intention = BDI inverse docu-ment frequency = (IDF) Bayesrsquo theorem agentrsquosdominant strategy)
A critical insight of our work is that keyphrasescan be morphologically and semantically alteredKeyphrases that incorporate a PP or have an un-derlying genitive composition are often easily var-ied by word order alternation Previous studieshave used the altered keyphrases when forming inof-PP form For example quality of service canbe altered to service quality sometimes with lit-tle semantic difference Also as most morpho-logical variation in English relates to noun num-ber and verb inflection keyphrases are subject tothese rules as well (eg distributed system 6= dis-tributing system dynamical caching 6= dynamicalcache) In addition possessives tend to alternatewith of-PP form (eg agentrsquos goal = goal of agentsecurityrsquos value = value of security)
4 Candidate Selection
We now describe our proposed candidate selectionprocess Candidate selection is a crucial step forautomatic keyphrase extraction This step is corre-lated to term extraction study since topNth rankedterms become keyphrases in documents In pre-vious study KEA employed the indexing wordsas candidates whereas others such as (Park et al2004 Nguyen and Kan 2007) generated hand-crafted regular expression rules However nonecarefully undertook the analysis of keyphrases In
this section before we present our method we firstdescribe the detail of keyphrase analysis
In our keyphrase analysis we observed thatmost of author assigned keyphrase andor readerassigned keyphrase are syntactically more of-ten simplex words and less often NPs Whenkeyphrases take an NP form they tend to be a sim-ple form of NPs ie either without a PP or withonly a PP or with a conjunction but few appear asa mixture of such forms We also noticed that thecomponents of NPs are normally nouns and adjec-tives but rarely are adverbs and verbs As a re-sult we decided to ignore NPs containing adverbsand verbs in this study as our candidates since theytend to produce more errors and to require morecomplexity
Another observation is that keyphrases contain-ing more than three words are rare (ie 6 in ourdata set) validating what Paukkeri et al (2008)observed Hence we apply a length heuristic Ourcandidate selection rule collects candidates up tolength 3 but also of length 4 for NPs in of-PPform since they may have a non-genetive alter-nation that reduces its length to 3 (eg perfor-mance of distributed system = distributed systemperformance) In previous studies words occur-ring at least twice are selected as candidates How-ever during our acquisition of reader assignedkeyphrase we observed that readers tend to collectNPs as keyphrases regardless of their frequencyDue to this we apply different frequency thresh-olds for simplex words (gt= 2) and NPs (gt= 1)Note that 30 of NPs occurred only once in ourdata
Finally we generated regular expression rulesto extract candidates as presented in Table 1 Ourcandidate extraction rules are based on those inNguyen and Kan (2007) However our Rule6for NPs in of-PP form broadens the coverage of
11
possible candidates ie with a given NPs in of-PP form not only we collect simplex word(s)but we also extract non-of-PP form of NPs fromnoun phrases governing the PP and the PP Forexample our rule extracts effective algorithm ofgrid computing as well as effective algorithm andgrid computing as candidates while the previousworksrsquo rules do not
5 Feature Engineering
With a wider candidate selection criteria the onusof filtering out irrelevant candidates becomes theresponsibility of careful feature engineering Welist 25 features that we have found useful in ex-tracting keyphrases comprising of 9 existing and16 novel andor modified features that we intro-duce in our work (marked with lowast) As one ofour goals in feature engineering is to assess thesuitability of features in the unsupervised settingwe have also indicated which features are suitableonly for the supervised setting (S) or applicable toboth (S U)
51 Document Cohesion
Document cohesion indicates how important thecandidates are for the given document The mostpopular feature for this cohesion is TFIDF butsome works have also used context words to checkthe correlation between candidates and the givendocument Other features for document cohesionare distance section information and so on Wenote that listed features other than TFIDF are re-lated to locality That is the intuition behind thesefeatures is that keyphrases tend to appear in spe-cific area such as the beginning and the end of doc-uments
F1 TFIDF (SU) TFIDF indicates doc-ument cohesion by looking at the frequency ofterms in the documents and is broadly used in pre-vious work(Frank et al 1999 Witten et al 1999Nguyen and Kan 2007) However a disadvan-tage of the feature is in requiring a large corpusto compute useful IDF As an alternative con-text words(Matsuo and Ishizuka 2004) can alsobe used to measure document cohesion From ourstudy of keyphrases we saw that substrings withinlonger candidates need to be properly counted andas such our method measures TF in substrings aswell as in exact matches For example grid com-puting is often a substring of other phrases such asgrid computing algorithm and efficient grid com-
puting algorithm We also normalize TF with re-spect to candidate types ie we separately treatsimplex words and NPs to compute TF To makeour IDFs broadly representative we employed theGoogle n-gram counts that were computedover terabytes of data Given this large genericsource of word count IDF can be incorporatedwithout corpus-dependent processing hence suchfeatures are useful in unsupervised approachesas well The following list shows variations ofTFIDF employed as features in our system
bull (F1a) TFIDF
bull (F1b) TF including counts of substrings
bull (F1c) TF of substring as a separate feature
bull (F1d) normalized TF by candidate types(ie simplex words vs NPs)
bull (F1e) normalized TF by candidate types asa separate feature
bull (F1f) IDF using Google n-gram
F2 First Occurrence (SU) KEA used the firstappearance of the word in the document(Frank etal 1999 Witten et al 1999) The main ideabehind this feature is that keyphrases tend to oc-cur in the beginning of documents especially instructured reports (eg in abstract and introduc-tion sections) and newswire
F3 Section Information (SU) Nguyen andKan (2007) used the identity of which specificdocument section a candidate occurs in This lo-cality feature attempts to identify key sections Forexample in their study of scientific papers theauthors weighted candidates differently dependingon whether they occurred in the abstract introduc-tion conclusion section head title andor refer-ences
F4 Additional Section Information (SU)We first added the related work or previous workas one of section information not included inNguyen and Kan (2007) We also propose and testa number of variations We used the substringsthat occur in section headers and reference titlesas keyphrases We counted the co-occurrence ofcandidates (ie the section TF) across all key sec-tions that indicates the correlation among key sec-tions We assign section-specific weights as in-dividual sections exhibit different propensities forgenerating keyphrases For example introduction
12
contains the majority of keyphrases while the ti-tle or section head contains many fewer due to thevariation in size
bull (F4a) section rsquorelatedprevious workrsquo
bull (F4b) counting substring occurring in keysections
bull (F4c) section TF across all key sections
bull (F4d) weighting key sections according tothe portion of keyphrases found
F5 Last Occurrence (SU) Similar to dis-tance in KEA the position of the last occurrenceof a candidate may also imply the importance ofkeyphrases as keyphrases tend to appear in thelast part of document such as the conclusion anddiscussion
52 Keyphrase CohesionThe intuition behind using keyphrase cohesion isthat actual keyphrases are often associated witheach other since they are semantically related totopic of the document Note that this assumptionholds only when the document describes a singlecoherent topic ndash a document that represents a col-lection may be first need to be segmented into itsconstituent topics
F6 Co-occurrence of Another Candidatein Section (SU) When candidates co-occur inseveral key sections together then they are morelikely keyphrases Hence we used the number ofsections that candidates co-occur
F7 Title overlap (S) In a way titles also rep-resent the topics of their documents A large col-lection of titles in the domain can act as a prob-abilistic prior of what words could stand as con-stituent words in keyphrases In our work as weexamined scientific papers from computer sciencewe used a collection of titles obtained from thelarge CiteSeer1 collection to create this feature
bull (F7a) co-occurrence (Boolean) in title col-location
bull (F7b) co-occurrence (TF) in title collection
F8 Keyphrase Cohesion (SU) Turney (2003)integrated keyphrase cohesion into his system bychecking the semantic similarity between top Nranked candidates against the remainder In the
1It contains 13M titles from articles papers and reports
original work a large external web corpus wasused to obtain the similarity judgments As wedid not have access to the same web corpus andall candidateskeyphrases were not found in theGoogle n-gram corpus we approximated this fea-ture using a similar notion of contextual similarityWe simulated a latent 2-dimensional matrix (simi-lar to latent semantic analysis) by listing all candi-date words in rows and their neighboring words(nouns verbs and adjectives only) in columnsThe cosine measure is then used to compute thesimilarity among keyphrases
53 Term Cohesion
Term cohesion further refines the candidacy judg-ment by incorporating an internal analysis of thecandidatersquos constituent words Term cohesionposits that high values for internal word associa-tion measures correlates indicates that the candi-date is a keyphrase (Church and Hanks 1989)
F9 Term Cohesion (SU) Park et al (2004)used in the Dice coefficient (Dice 1945)to measure term cohesion particularly for multi-word terms In their work as NPs are longer thansimplex words they simply discounted simplexword cohesion by 10 In our work we vary themeasure of TF used in Dice coefficientsimilar to our discussion earlier
bull (F9a) term cohesion by (Park et al 2004)
bull (F9b) normalized TF by candidate types(ie simplex words vs NPs)
bull (F9c) applying different weight by candi-date types
bull (F9d) normalized TF and different weight-ing by candidate types
54 Other Features
F10 Acronym (S) Nguyen and Kan (2007) ac-counted for the importance of acronym as a fea-ture We found that this feature is heavily depen-dent on the data set Hence we used it only forNampK to attest our candidate selection method
F11 POS sequence (S) Hulth and Megyesi(2006) pointed out that POS sequences ofkeyphrases are similar It showed the distinctivedistribution of POS sequences of keyphrases anduse them as a feature Like acronym this is alsosubject to the data set
13
F12 Suffix sequence (S) Similar to acronymNguyen and Kan (2007) also used a candidatersquossuffix sequence as a feature to capture the propen-sity of English to use certain Latin derivationalmorphology for technical keyphrases This fea-ture is also a data dependent features thus used insupervised approach only
F13 Length of Keyphrases (SU) Barker andCorrnacchia (2000) showed that candidate lengthis also a useful feature in extraction as well as incandidate selection as the majority of keyphrasesare one or two terms in length
6 System and Data
To assess the performance of the proposed candi-date selection rules and features we implementeda keyphrase extraction pipe line We start withraw text of computer science articles convertedfrom PDF by pdftotext Then we parti-tioned the into section such as title and sectionsvia heuristic rules and applied sentence segmenter2 ParsCit3(Councill et al 2008) for refer-ence collection part-of-speech tagger4 and lem-matizer5(Minnen et al 2001) of the input Af-ter preprocessing we built both supervised andunsupervised classifiers using Naive Bayes fromthe WEKA machine learning toolkit(Witten andFrank 2005) Maximum Entropy6 and simpleweighting
In evaluation we collected 250 papers fromfour different categories7 of the ACM digital li-brary Each paper was 6 to 8 pages on averageIn author assigned keyphrase we found manywere missing or found as substrings To rem-edy this we collected reader assigned keyphraseby hiring senior year undergraduates in computerscience each whom annotated five of the paperswith an annotation guideline and on average tookabout 15 minutes to annotate each paper The fi-nal statistics of keyphrases is presented in Table2 where Combined represents the total number ofkeyphrases The numbers in () denotes the num-ber of keyphrases in of-PP form Found means the
2httpwwwengritsumeiacjpasaoresourcessentseg3httpwingcompnusedusgparsCit4httpsearchcpanorgdistLingua-EN-
TaggerTaggerpm5httpwwwinformaticssusxacukresearchgroupsnlpcarrollmorphhtml6httpmaxentsourceforgenetindexhtml7C24 (Distributed Systems) H33 (Information Search
and Retrieval) I211 (Distributed Artificial Intelligence-Multiagent Systems) and J4 (Social and Behavioral Sciences-Economics)
number of author assigned keyphrase and readerassigned keyphrase found in the documents
Author Reader CombinedTotal 1252 (53) 3110 (111) 3816 (146)NPs 904 2537 3027Average 385 (401) 1244 (1288) 1526 (1585)Found 769 2509 2864
Table 2 Statistics in Keyphrases
7 Evaluation
The baseline system for both the supervised andunsupervised approaches is modified NampK whichuses TFIDF distance section information andadditional section information (ie F1-4) Apartfrom baseline we also implemented basicKEA and NampK to compare Note that NampK is con-sidered a supervised approach as it utilizes fea-tures like acronym POS sequence and suffix se-quence
Table 3 and 4 shows the performance of our can-didate selection method and features with respectto supervised and unsupervised approaches usingthe current standard evaluation method (ie exactmatching scheme) over top 5th 10th 15th candi-dates
BestFeatures includes F1cTF of substring asa separate feature F2first occurrence F3sectioninformation F4dweighting key sections F5lastoccurrence F6co-occurrence of another candi-date in section F7btitle overlap F9aterm co-hesion by (Park et al 2004) F13length ofkeyphrases Best-TFIDF means using all bestfeatures but TFIDF
In Tables 3 and 4 C denotes the classifier tech-nique unsupervised (U) or supervised using Max-imum Entropy (S)8
In Table 5 the performance of each feature ismeasured using NampK system and the target fea-ture + indicates an improvement - indicates aperformance decline and indicates no effector unconfirmed due to small changes of perfor-mances Again supervised denotes MaximumEntropy training and Unsupervised is our unsu-pervised approach
8 Discussion
We compared the performances over our candi-date selection and feature engineering with sim-ple KEA NampK and our baseline system In eval-uating candidate selection we found that longer
8Due to the page limits we present the best performance
14
Method Features C Five Ten FifteenMatch Precision Recall Fscore Match Precising Recall Fscore Match Precision Recall Fscore
All KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 079 1584 519 782 139 1388 909 1099 184 1224 1203 1213
NampK S 132 2648 867 1306 204 2036 1334 1612 254 1693 1664 1678baseline U 092 1832 600 904 157 1568 1027 1241 220 1464 1439 1451
S 115 2304 755 1137 190 1896 1242 1501 244 1624 1596 1610Lengthlt=3 KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 081 1616 529 797 140 1400 917 1108 184 1224 1203 1213
NampK S 140 2792 915 1378 210 2104 1378 1665 262 1749 1719 1734baseline U 092 184 603 908 158 1576 1032 1247 220 1464 1439 1451
S 118 2368 776 1169 190 1900 1245 1504 240 1600 1572 1586Lengthlt=3 KEA U 001 024 008 012 005 052 034 041 007 048 047 047Candidates S 083 1664 545 821 142 1424 933 1127 187 1245 1224 1234
+ Alternation NampK S 153 3064 1004 1512 231 2308 1512 1827 288 1920 1887 1903baseline U 098 1968 645 972 172 1724 1129 1364 237 1579 1551 1565
S 133 2656 870 1311 209 2088 1368 1653 269 1792 1761 1776
Table 3 Performance on Proposed Candidate Selection
Features C Five Ten FifteenMatch Prec Recall Fscore Match Prec Recall Fscore Match Prec Recall Fscore
Best U 114 228 747 113 192 192 126 152 261 174 171 173S 156 312 102 154 250 250 164 198 315 210 206 208
Best U 114 228 74 113 192 192 126 152 261 174 171 173wo TFIDF S 156 311 102 154 246 246 161 194 312 208 204 206
Table 4 Performance on Feature Engineering
A Method Feature+ S F1aF2F3F4aF4dF9a
U F1aF1cF2F3F4aF4dF5F7bF9a- S F1bF1cF1dF1fF4bF4cF7aF7bF9b-dF13
U F1dF1eF1fF4bF4cF6F7aF9b-d S F1eF10F11F12
U F1b
Table 5 Performance on Each Feature
length candidates play a role to be noises so de-creased the overall performance We also con-firmed that candidate alternation offered the flexi-bility of keyphrases leading higher candidate cov-erage as well as better performance
To re-examine features we analyzed the impactof existing and new features and their variationsFirst of all unlike previous studies we found thatthe performance with and without TFIDF did notlead to a large difference which indicates the im-pact of TFIDF was minor as long as other fea-tures are incorporated Secondly counting sub-strings for TF improved performance while ap-plying term weighting for TF andor IDF did notimpact on the performance We estimated thecause that many of keyphrases are substrings ofcandidates and vice versa Thirdly section in-formation was also validated to improve perfor-mance as in Nguyen and Kan (2007) Extend-ing this logic modeling additional section infor-mation (related work) and weighting sections bothturned out to be useful features Other localityfeatures were also validated as helpful both firstoccurrence and last occurrence are helpful as itimplies the locality of the key ideas In addi-
tion keyphrase co-occurrence with selected sec-tions was proposed in our work and found empiri-cally useful Term cohesion (Park et al 2004) is auseful feature although it has a heuristic factor thatreduce the weight by 10 for simplex words Nor-mally term cohesion is subject to NPs only henceit needs to be extended to work with multi-wordNPs as well Table 5 summarizes the reflectionson each feature
As unsupervised methods have the appeal of notneeding to be trained on expensive hand-annotateddata we also compared the performance of super-vised and unsupervised methods Given the fea-tures initially introduced for supervised learningunsupervised performance is surprisingly highWhile supervised classifier produced a matchingcount of 315 the unsupervised classifier obtains acount of 261 We feel this indicates that the exist-ing features for supervised methods are also suit-able for use in unsupervised methods with slightlyreduced performance In general we observed thatthe best features in both supervised and unsuper-vised methods are the same ndash section informationand candidate length In our analysis of the im-pact of individual features we observed that mostfeatures affect performance in the same way forboth supervised and unsupervised approaches asshown in Table 5 These findings indicate that al-though these features may be been originally de-signed for use in a supervised approach they arestable and can be expected to perform similar inunsupervised approaches
15
9 Conclusion
We have identified and tackled two core issuesin automatic keyphrase extraction candidate se-lection and feature engineering In the area ofcandidate selection we observe variations and al-ternations that were previously unaccounted forOur selection rules expand the scope of possiblekeyphrase coverage while not overly expandingthe total number candidates to consider In ourre-examination of feature engineering we com-piled a comprehensive feature list from previousworks while exploring the use of substrings in de-vising new features Moreover we also attested toeach featurersquos fitness for use in unsupervised ap-proaches in order to utilize them in real-world ap-plications with minimal cost
10 AcknowledgementThis work was partially supported by a National ResearchFoundation grant Interactive Media Search (grant R 252000 325 279) while the first author was a postdoctoral fellowat the National University of Singapore
ReferencesKen Barker and Nadia Corrnacchia Using noun phrase
heads to extract document keyphrases In Proceedings ofthe 13th Biennial Conference of the Canadian Society onComputational Studies of Intelligence Advances in Arti-ficial Intelligence 2000
Regina Barzilay and Michael Elhadad Using lexical chainsfor text summarization In Proceedings of the ACLEACL1997 Workshop on Intelligent Scalable Text Summariza-tion 1997 pp 10ndash17
Kenneth Church and Patrick Hanks Word associationnorms mutual information and lexicography In Proceed-ings of ACL 1989 76ndash83
Isaac Councill and C Lee Giles and Min-Yen Kan ParsCitAn open-source CRF reference string parsing package InProceedings of LREC 2008 28ndash30
Ernesto DAvanzo and Bernado Magnini A Keyphrase-Based Approach to Summarizationthe LAKE System atDUC-2005 In Proceedings of DUC 2005
F Damerau Generating and evaluating domain-orientedmulti-word terms from texts Information Processing andManagement 1993 29 pp43ndash447
Lee Dice Measures of the amount of ecologic associationsbetween species Journal of Ecology 1945 2
Eibe Frank and Gordon Paynter and Ian Witten and CarlGutwin and Craig Nevill-manning Domain SpecificKeyphrase Extraction In Proceedings of IJCAI 1999pp668ndash673
Carl Gutwin and Gordon Paynter and Ian Witten and CraigNevill-Manning and Eibe Frank Improving browsing indigital libraries with keyphrase indexes Journal of Deci-sion Support Systems 1999 27 pp81ndash104
Khaled Hammouda and Diego Matute and Mohamed KamelCorePhrase keyphrase extraction for document cluster-ing In Proceedings of MLDM 2005
Annette Hulth and Jussi Karlgren and Anna Jonsson andHenrik Bostrm and Lars Asker Automatic Keyword Ex-traction using Domain Knowledge In Proceedings of CI-CLing 2001
Annette Hulth and Beata Megyesi A study on automaticallyextracted keywords in text categorization In Proceedingsof ACLCOLING 2006 537ndash544
Mario Jarmasz and Caroline Barriere Using semantic sim-ilarity over tera-byte corpus compute the performance ofkeyphrase extraction In Proceedings of CLINE 2004
Dawn Lawrie and W Bruce Croft and Arnold RosenbergFinding Topic Words for Hierarchical Summarization InProceedings of SIGIR 2001 pp 349ndash357
Y Matsuo and M Ishizuka Keyword Extraction from a Sin-gle Document using Word Co-occurrence Statistical Infor-mation In International Journal on Artificial IntelligenceTools 2004 13(1) pp 157ndash169
Olena Medelyan and Ian Witten Thesaurus based automatickeyphrase indexing In Proceedings of ACMIEED-CSjoint conference on Digital libraries 2006 pp296ndash297
Guido Minnen and John Carroll and Darren Pearce Appliedmorphological processing of English NLE 2001 7(3)pp207ndash223
Thuy Dung Nguyen and Min-Yen Kan Key phrase Extrac-tion in Scientific Publications In Proceeding of ICADL2007 pp317-326
Youngja Park and Roy Byrd and Branimir Boguraev Auto-matic Glossary Extraction Beyond Terminology Identifi-cation In Proceedings of COLING 2004 pp48ndash55
Mari-Sanna Paukkeri and Ilari Nieminen and Matti Pollaand Timo Honkela A Language-Independent Approachto Keyphrase Extraction and Evaluation In Proceedingsof COLING 2008
Peter Turney Learning to Extract Keyphrases from TextIn National Research Council Institute for InformationTechnology Technical Report ERB-1057 1999
Peter Turney Coherent keyphrase extraction via Web min-ing In Proceedings of IJCAI 2003 pp 434ndash439
Xiaojun Wan and Jianguo Xiao CollabRank towards a col-laborative approach to single-document keyphrase extrac-tion In Proceedings of COLING 2008
Ian Witten and Gordon Paynter and Eibe Frank and CarGutwin and Graig Nevill-Manning KEAPractical Au-tomatic Key phrase Extraction In Proceedings of ACMDL 1999 pp254ndash256
Ian Witten and Eibe Frank Data Mining Practical Ma-chine Learning Tools and Techniques Morgan Kauf-mann 2005
Yongzheng Zhang and Nur Zinchir-Heywood and Evange-los Milios Term based Clustering and Summarization ofWeb Page Collections In Proceedings of Conference ofthe Canadian Society for Computational Studies of Intel-ligence 2004
16
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 17ndash22Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Verb Noun Construction MWE Token Supervised Classification
Mona T DiabCenter for Computational Learning Systems
Columbia Universitymdiabcclscolumbiaedu
Pravin BhutadaComputer Science Department
Columbia Universitypb2351columbiaedu
Abstract
We address the problem of classifying multi-word expression tokens in running text Wefocus our study on Verb-Noun Constructions(VNC) that vary in their idiomaticity depend-ing on context VNC tokens are classified aseither idiomatic or literal We present a super-vised learning approach to the problem We ex-periment with different features Our approachyields the best results to date on MWE clas-sification combining different linguistically mo-tivated features the overall performance yieldsan F-measure of 8458 corresponding to an F-measure of 8996 for idiomaticity identificationand classification and 6203 for literal identifi-cation and classification
1 Introduction
In the literature in general a multiword expression(MWE) refers to a multiword unit or a colloca-tion of words that co-occur together statisticallymore than chance A MWE is a cover term fordifferent types of collocations which vary in theirtransparency and fixedness MWEs are pervasivein natural language especially in web based textsand speech genres Identifying MWEs and under-standing their meaning is essential to language un-derstanding hence they are of crucial importancefor any Natural Language Processing (NLP) appli-cations that aim at handling robust language mean-ing and use In fact the seminal paper (Sag et al2002) refers to this problem as a key issue for thedevelopment of high-quality NLP applications
For our purposes a MWE is defined as a collo-cation of words that refers to a single concept forexample - kick the bucket spill the beans make adecision etc An MWE typically has an idiosyn-cratic meaning that is more or different from themeaning of its component words An MWE mean-ing is transparent ie predictable in as muchas the component words in the expression relaythe meaning portended by the speaker composi-tionally Accordingly MWEs vary in their de-gree of meaning compositionality composition-ality is correlated with the level of idiomaticityAn MWE is compositional if the meaning of an
MWE as a unit can be predicted from the mean-ing of its component words such as in make adecision meaning to decide If we conceive ofidiomaticity as being a continuum the more id-iomatic an expression the less transparent and themore non-compositional it is Some MWEs aremore predictable than others for instance kick thebucket when used idiomatically to mean to diehas nothing in common with the literal meaningof either kick or bucket however make a decisionis very clearly related to to decide Both of theseexpressions are considered MWEs but have vary-ing degrees of compositionality and predictabilityBoth of these expressions belong to a class of id-iomatic MWEs known as verb noun constructions(VNC) The first VNC kick the bucket is a non-decomposable VNC MWE the latter make a deci-sion is a decomposable VNC MWE These typesof constructions are the object of our study
To date most research has addressed the prob-lem of MWE type classification for VNC expres-sions in English (Melamed 1997 Lin 1999Baldwin et al 2003 na Villada Moiron andTiedemann 2006 Fazly and Stevenson 2007Van de Cruys and Villada Moiron 2007 Mc-Carthy et al 2007) not token classification Forexample he spilt the beans on the kitchen counteris most likely a literal usage This is given away bythe use of the prepositional phrase on the kitchencounter as it is plausable that beans could haveliterally been spilt on a location such as a kitchencounter Most previous research would classifyspilt the beans as idiomatic irrespective of con-textual usage In a recent study by (Cook et al2008) of 53 idiom MWE types used in differentcontexts the authors concluded that almost half ofthem had clear literal meaning and over 40 oftheir usages in text were actually literal Thus itwould be important for an NLP application suchas machine translation for example when givena new VNC MWE token to be able to determinewhether it is used idiomatically or not as it couldpotentially have detrimental effects on the qualityof the translation
17
In this paper we address the problem of MWEclassification for verb-noun (VNC) token con-structions in running text We investigate the bi-nary classification of an unseen VNC token ex-pression as being either Idiomatic (IDM) or Lit-eral (LIT) An IDM expression is certainly anMWE however the converse is not necessarilytrue To date most approaches to the problem ofidiomaticity classification on the token level havebeen unsupervised (Birke and Sarkar 2006 Diaband Krishna 2009b Diab and Krishna 2009aSporleder and Li 2009) In this study we carryout a supervised learning investigation using sup-port vector machines that uses some of the featureswhich have been shown to help in unsupervisedapproaches to the problem
This paper is organized as follows In Section2 we describe our understanding of the variousclasses of MWEs in general Section 3 is a sum-mary of previous related research Section 4 de-scribes our approach In Section 5 we present thedetails of our experiments We discuss the resultsin Section 6 Finally we conclude in Section 7
2 Multi-word Expressions
MWEs are typically not productive though theyallow for inflectional variation (Sag et al 2002)They have been conventionalized due to persis-tent use MWEs can be classified based on theirsemantic types as follows Idiomatic This cat-egory includes expressions that are semanticallynon-compositional fixed expressions such as king-dom come ad hoc non-fixed expressions suchas break new ground speak of the devil TheVNCs which we are focusing on in this paper fallinto this category Semi-idiomatic This classincludes expressions that seem semantically non-compositional yet their semantics are more or lesstransparent This category consists of Light VerbConstructions (LVC) such as make a living andVerb Particle Constructions (VPC) such as write-up call-up Non-Idiomatic This category in-cludes expressions that are semantically compo-sitional such as prime minister proper nouns suchas New York Yankees and collocations such as ma-chine translation These expressions are statisti-cally idiosyncratic For instance traffic light isthe most likely lexicalization of the concept andwould occur more often in text than say trafficregulator or vehicle light
3 Related Work
Several researchers have addressed the problem ofMWE classification (Baldwin et al 2003 Katzand Giesbrecht 2006 Schone and Juraksfy 2001
Hashimoto et al 2006 Hashimoto and Kawa-hara 2008) The majority of the proposed researchhas been using unsupervised approaches and haveaddressed the problem of MWE type classifica-tion irrespective of usage in context (Fazly andStevenson 2007 Cook et al 2007) We areaware of two supervised approaches to the prob-lem work by (Katz and Giesbrecht 2006) andwork by (Hashimoto and Kawahara 2008)
In Katz and Giesbrecht (2006) (KG06) the au-thors carried out a vector similarity comparisonbetween the context of an MWE and that of theconstituent words using LSA to determine if theexpression is idiomatic or not The KG06 is sim-ilar in intuition to work proposed by (Fazly andStevenson 2007) however the latter work was un-supervised KG06 experimented with a tiny dataset of only 108 sentences corresponding to oneMWE idiomatic expression
Hashimoto and Kawahara (2008) (HK08) is thefirst large scale study to our knowledge that ad-dressed token classification into idiomatic versusliteral for Japanese MWEs of all types They ap-ply a supervised learning framework using sup-port vector machines based on TinySVM with aquadratic kernel They annotate a web based cor-pus for training data They identify 101 idiomtypes each with a corresponding 1000 exampleshence they had a corpus of 102K sentences of an-notated data for their experiments They exper-iment with 90 idiom types only for which theyhad more than 50 examples They use two typesof features word sense disambiguation (WSD)features and idiom features The WSD featurescomprised some basic syntactic features such asPOS lemma information token n-gram featuresin addition to hypernymy information on words aswell as domain information For the idiom fea-tures they were mostly inflectional features suchas voice negativity modality in addition to adja-cency and adnominal features They report resultsin terms of accuracy and rate of error reductionTheir overall accuracy is of 8925 using all thefeatures
4 Our ApproachWe apply a supervised learning framework tothe problem of both identifying and classifying aMWE expression token in context We specificallyfocus on VNC MWE expressions We use the an-notated data by (Cook et al 2008) We adopt achunking approach to the problem using an InsideOutside Beginning (IOB) tagging framework forperforming the identification of MWE VNC to-kens and classifying them as idiomatic or literalin context For chunk tagging we use the Yam-
18
Cha sequence labeling system1 YamCha is basedon Support Vector Machines technology using de-gree 2 polynomial kernels
We label each sentence with standard IOB tagsSince this is a binary classification task we have 5different tags B-L (Beginning of a literal chunk)I-L (Inside of a literal chunk) B-I (Beginning anIdiomatic chunk) I-I (Inside an Idiomatic chunk)O (Outside a chunk) As an example a sentencesuch as John kicked the bucket last Friday will beannotated as follows John O kicked B-I the I-Ibucket I-I last O Friday O We experiment withsome basic features and some more linguisticallymotivated ones
We experiment with different window sizes forcontext ranging from minus+1 to minus+5 tokens be-fore and after the token of interest We also em-ploy linguistic features such as character n-gramfeatures namely last 3 characters of a token asa means of indirectly capturing the word inflec-tional and derivational morphology (NGRAM)Other features include Part-of-Speech (POS)tags lemma form (LEMMA) or the citation formof the word and named entity (NE) informationThe latter feature is shown to help in the unsuper-vised setting in recent work (Diab and Krishna2009b Diab and Krishna 2009a) In general allthe linguistic features are represented as separatefeature sets explicitly modeled in the input dataHence if we are modeling the POS tag feature forour running example the training data would beannotated as follows John NN O kicked VBDB-I the Det I-I bucket NN I-I last ADV O FridayNN O Likewise adding the NGRAM featurewould be represented as follows John NN ohnO kicked VBD ked B-I the Det the I-I bucket NNket I-I last ADV ast O Friday NN day O and soon
With the NE feature we followed the same rep-resentation as the other features as a separate col-umn as expressed above referred to as NamedEntity Separate (NES) For named entity recogni-tion (NER) we use the BBN Identifinder softwarewhich identifies 19 NE tags2 We have two set-tings for NES one with the full 19 tags explic-itly identified (NES-Full) and the other where wehave a binary feature indicating whether a wordis a NE or not (NES-Bin) Moreover we addedanother experimental condition where we changedthe wordsrsquo representation in the input to their NEclass Named Entity InText (NEI) For example forthe NEI condition our running example is repre-sented as follows PER NN ohn O kicked VBDked B-I the Det the I-I bucket NN ket I-I last ADV
1httpwwwtado-chasencomyamcha2httpwwwbbncomidentifinder
ast O DAY NN day O where John is replaced bythe NE ldquoPERrdquo
5 Experiments and Results
51 Data
We use the manually annotated standard dataset identified in (Cook et al 2008) This datacomprises 2920 unique VNC-Token expressionsdrawn from the entire British National Corpus(BNC)3 The BNC contains 100M words of multi-ple genres including written text and transcribedspeech In this set VNC token expressions aremanually annotated as idiomatic literal or un-known We exclude those annotated as unknownand those pertaining to the Speech part of thedata leaving us with a total of 2432 sentences cor-responding to 53 VNC MWE types This datahas 2571 annotations4 corresponding to 2020 Id-iomatic tokens and 551 literal ones Since the dataset is relatively small we carry out 5-fold cross val-idation experiments The results we report are av-eraged over the 5 folds per condition We splitthe data into 80 for training 10 for testing and10 for development The data used is the tok-enized version of the BNC
52 Evaluation Metrics
We use Fβ=1 (F-measure) as the harmonic meanbetween (P)recision and (R)ecall as well as accu-racy to report the results5 We report the resultsseparately for the two classes IDM and LIT aver-aged over the 5 folds of the TEST data set
53 Results
We present the results for the different featuressets and their combination We also present resultson a simple most frequent tag baseline (FREQ) aswell as a baseline of using no features just thetokenized words (TOK) The baseline is basicallytagging all identified VNC tokens in the data set asidiomatic It is worth noting that the baseline hasthe advantage of gold identification of MWE VNCtoken expressions In our experimental conditionsidentification of a potential VNC MWE is part ofwhat is discovered automatically hence our sys-tem is penalized for identifying other VNC MWE
3httpwwwnatcorpoxacuk4A sentence can have more than one MWE expression
hence the number of annotations exceeds the number of sen-tences
5We do not think that accuracy should be reported in gen-eral since it is an inflated result as it is not a measure of errorAll words identified as O factor into the accuracy which re-sults in exaggerated values for accuracy We report it onlysince it the metric used by previous work
19
tokens that are not in the original data set6
In Table 2 we present the results yielded per fea-ture and per condition We experimented with dif-ferent context sizes initially to decide on the opti-mal window size for our learning framework re-sults are presented in Table 1 Then once that isdetermined we proceed to add features
Noting that a window size of minus+3 yields thebest results we proceed to use that as our contextsize for the following experimental conditions Wewill not include accuracy since it above 96 for allour experimental conditions
All the results yielded by our experiments out-perform the baseline FREQ The simple tokenizedwords baseline (TOK) with no added features witha context size of minus+3 shows a significant im-provement over the very basic baseline FREQ withan overall F measure of 7704
Adding lemma information or POS or NGRAMfeatures all independently contribute to a bettersolution however combining the three featuresyields a significant boost in performance over theTOK baseline of 267 absolute F points in over-all performance
Confirming previous observations in the liter-ature the overall best results are obtained byusing NE features The NEI condition yieldsslightly better results than the NES conditionsin the case when no other features are beingused NES-Full significantly outperforms NES-Bin when used alone especially on literal classi-fication yielding the highest results on this classof phenomena across the board However whencombined with other features NES-Bin fares bet-ter than NES-Full as we observe slightly less per-formance when comparing NES-Full+L+N+P andNES-Bin+L+N+P
Combining NEI+L+N+P yields the highest re-sults with an overall F measure of 8458 a sig-nificant improvement over both baselines and overthe condition that does not exploit NE featuresL+N+P Using NEI may be considered a formof dimensionality reduction hence the significantcontribution to performance
6 Discussion
The overall results strongly suggest that using lin-guistically interesting features explicitly has a pos-itive impact on performance NE features helpthe most and combining them with other features
6We could have easily identified all VNC syntactic con-figurations corresponding to verb object as a potential MWEVNC assuming that they are literal by default This wouldhave boosted our literal score baseline however for this in-vestigation we decided to strictly work with the gold stan-dard data set exclusively
yields the best results In general performanceon the classification and identification of idiomaticexpressions yielded much better results This maybe due to the fact that the data has a lot more id-iomatic token examples for training Also we notethat precision scores are significantly higher thanrecall scores especially with performance on lit-eral token instance classification This might be anindication that identifying when an MWE is usedliterally is a difficult task
We analyzed some of the errors yielded in ourbest condition NEI+L+N+P The biggest errors area result of identifying other VNC constructionsnot annotated in the training and test data as VNCMWEs However we also see errors of confusingidiomatic cases with literal ones 23 times and theopposite 4 times
Some of the errors where the VNC should havebeen classified as literal however the system clas-sified them as idiomatic are kick heel find feetmake top Cases of idiomatic expressions erro-neously classified as literal are for MWE types hitthe road blow trumpet blow whistle bit a wall
The system is able to identify new VNC MWEconstructions For instance in the sentence On theother hand Pinkie seemed to have lost his head toa certain extent perhaps some prospects of mak-ing his mark by bringing in something novel inthe way of business the first MWE lost his headis annotated in the training data however makinghis mark is newly identified as idiomatic in thiscontext
Also the system identified hit the post as aliteral MWE VNC token in As the ball hit thepost the referee blew the whistle where blew thewhistle is a literal VNC in this context and it iden-tified hit the post as another literal VNC
7 Conclusion
In this study we explore a set of features that con-tribute to VNC token expression binary supervisedclassification The use of NER significantly im-proves the performance of the system Using NERas a means of dimensionality reduction yields thebest results We achieve a state of the art perfor-mance of an overall F measure of 8458 In thefuture we are looking at ways of adding more so-phisticated syntactic and semantic features fromWSD Given the fact that we were able to get moreinteresting VNC data automatically we are cur-rently looking into adding the new data to the an-notated pool after manual checking
20
IDM-F LIT-F Overall F Overall Accminus+1 7793 4857 7178 9622minus+2 8538 5561 7971 9706minus+3 8699 5568 8125 9693minus+4 8622 5581 8075 9706minus+5 8338 50 7763 9661
Table 1 Results in s of varying context window size
IDM-P IDM-R IDM-F LIT-P LIT-R LIT-F Overall FFREQ 7002 8916 7844 0 0 0 6968TOK 8178 8333 8255 7179 4375 5437 7704
(L)EMMA 831 8429 8369 6977 4688 5607 7811(N)GRAM 8317 8238 8278 70 4375 5385 7701
(P)OS 8333 8333 8333 7778 4375 5600 7808L+N+P 8695 8333 8538 7222 4561 5591 7971
NES-Full 852 8793 8655 7907 5862 6733 8277NES-Bin 8497 8241 8367 7349 5259 6131 7915
NEI 8992 8518 8748 8133 5259 6387 8282NES-Full+L+N+P 8989 8492 8734 7632 50 6042 8199NES-Bin+L+N+P 9086 8492 8779 7632 50 6042 8233
NEI+L+N+P 9135 8842 8986 8169 50 6203 8458
Table 2 Final results in s averaged over 5 folds of test data using different features and their combina-tions
8 Acknowledgement
The first author was partially funded by DARPAGALE and MADCAT projects The authors wouldlike to acknowledge the useful comments by twoanonymous reviewers who helped in making thispublication more concise and better presented
ReferencesTimothy Baldwin Collin Bannard Takakki Tanaka
and Dominic Widdows 2003 An empirical modelof multiword expression decomposability In Pro-ceedings of the ACL 2003 workshop on Multiwordexpressions pages 89ndash96 Morristown NJ USA
J Birke and A Sarkar 2006 A clustering approach fornearly unsupervised recognition of nonliteral lan-guage In Proceedings of EACL volume 6 pages329ndash336
Paul Cook Afsaneh Fazly and Suzanne Stevenson2007 Pulling their weight Exploiting syntacticforms for the automatic identification of idiomaticexpressions in context In Proceedings of the Work-shop on A Broader Perspective on Multiword Ex-pressions pages 41ndash48 Prague Czech RepublicJune Association for Computational Linguistics
Paul Cook Afsaneh Fazly and Suzanne Stevenson2008 The VNC-Tokens Dataset In Proceedings ofthe LREC Workshop on Towards a Shared Task for
Multiword Expressions (MWE 2008) MarrakechMorocco June
Mona Diab and Madhav Krishna 2009a Handlingsparsity for verb noun MWE token classification InProceedings of the Workshop on Geometrical Mod-els of Natural Language Semantics pages 96ndash103Athens Greece March Association for Computa-tional Linguistics
Mona Diab and Madhav Krishna 2009b Unsuper-vised classification for vnc multiword expressionstokens In CICLING
Afsaneh Fazly and Suzanne Stevenson 2007 Dis-tinguishing subtypes of multiword expressions us-ing linguistically-motivated statistical measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueCzech Republic June Association for Computa-tional Linguistics
Chikara Hashimoto and Daisuke Kawahara 2008Construction of an idiom corpus and its applica-tion to idiom identification based on WSD incor-porating idiom-specific features In Proceedings ofthe 2008 Conference on Empirical Methods in Nat-ural Language Processing pages 992ndash1001 Hon-olulu Hawaii October Association for Computa-tional Linguistics
Chikara Hashimoto Satoshi Sato and Takehito Utsuro2006 Japanese idiom recognition Drawing a linebetween literal and idiomatic meanings In Proceed-ings of the COLINGACL 2006 Main Conference
21
Poster Sessions pages 353ndash360 Sydney AustraliaJuly Association for Computational Linguistics
Graham Katz and Eugenie Giesbrecht 2006 Au-tomatic identification of non-compositional multi-word expressions using latent semantic analysis InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 12ndash19 Sydney Australia July Asso-ciation for Computational Linguistics
Dekang Lin 1999 Automatic identification of non-compositional phrases In Proceedings of ACL-99pages 317ndash324 Univeristy of Maryland CollegePark Maryland USA
Diana McCarthy Sriram Venkatapathy and AravindJoshi 2007 Detecting compositionality of verb-object combinations using selectional preferencesIn Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL) pages 369ndash379 Prague CzechRepublic June Association for Computational Lin-guistics
Dan I Melamed 1997 Automatic discovery of non-compositional compounds in parallel data In Pro-ceedings of the 2nd Conference on Empirical Meth-ods in Natural Language Processing (EMNLPrsquo97)pages 97ndash108 Providence RI USA August
Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the EACL-06Workshop on Multiword Expressions in a Multilin-gual Context pages 33ndash40 Morristown NJ USA
Ivan A Sag Timothy Baldwin Francis Bond Ann ACopestake and Dan Flickinger 2002 Multiwordexpressions A pain in the neck for nlp In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing pages 1ndash15 London UK Springer-Verlag
Patrick Schone and Daniel Juraksfy 2001 Isknowledge-free induction of multiword unit dictio-nary headwords a solved problem In Proceedingsof Empirical Methods in Natural Language Process-ing pages 100ndash108 Pittsburg PA USA
C Sporleder and L Li 2009 Unsupervised Recog-nition of Literal and Non-Literal Use of IdiomaticExpressions In Proceedings of the 12th Conferenceof the European Chapter of the ACL (EACL 2009)pages 754ndash762 Association for Computational Lin-guistics
Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32Prague Czech Republic June Association for Com-putational Linguistics
22
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 23ndash30Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Exploiting Translational Correspondences for Pattern-Independent MWEIdentification
Sina ZarrieszligDepartment of Linguistics
University of Potsdam Germanysinalinguni-potsdamde
Jonas KuhnDepartment of Linguistics
University of Potsdam Germanykuhnlinguni-potsdamde
Abstract
Based on a study of verb translations inthe Europarl corpus we argue that a widerange of MWE patterns can be identified intranslations that exhibit a correspondencebetween a single lexical item in the sourcelanguage and a group of lexical items inthe target language We show that thesecorrespondences can be reliably detectedon dependency-parsed word-aligned sen-tences We propose an extraction methodthat combines word alignment with syn-tactic filters and is independent of thestructural pattern of the translation
1 Introduction
Parallel corpora have proved to be a valuable re-source not only for statistical machine translationbut also for crosslingual induction of morphologi-cal syntactic and semantic analyses (Yarowsky etal 2001 Dyvik 2004) In this paper we proposean approach to the identification of multiword ex-pressions (MWEs) that exploits translational cor-respondences in a parallel corpus We will con-sider in translations of the following type
(1) DerThe
RatCouncil
sollteshould
unsereour
Positionposition
berucksichtigenconsider
(2) The Council shouldtake account ofour position
This sentence pair has been taken from theGerman - English section of the Europarl corpus(Koehn 2005) It exemplifies a translational cor-respondence between an English MWEtake ac-count ofand a German simplex verbberucksichti-gen In the following we refer to such correspon-dences asone-to-many translations Based on astudy of verb translations in Europarl we will ex-plore to what extent one-to-many translations pro-vide evidence for MWE realization in the targetlanguage It will turn out that crosslingual corre-
spondences realize a wide range of different lin-guistic patterns that are relevant for MWE iden-tification but that they pose problems to auto-matic word alignment We propose an extractionmethod that combines distributional word align-ment with syntactic filters We will show thatthese correspondences can be reliably detectedon dependency-parsed wordaligned sentences andare able to identify various MWE patterns
In a monolingual setting the task of MWE ex-traction is usually conceived of as a lexical as-sociation problem where distributional measuresmodel the syntactic and semantic idiosyncracy ex-hibited by MWEs eg (Pecina 2008) This ap-proach generally involves two main steps 1) theextraction of a candidate list of potential MWEsoften constrained by a particular target patternof the detection method like verb particle con-structions (Baldwin and Villavicencio 2002) orverb PP combinations (Villada Moiron and Tiede-mann 2006) 2) the ranking of this candidate listby an appropriate assocation measure
The crosslingual MWE identification wepresent in this paper is a priori independentof any specific association measure or syntacticpattern The translation scenario allows us toadopt a completely data-driven definition of whatconstitutes an MWE Given a parallel corpuswe propose to consider those tokens in a targetlanguage as MWEs which correspond to a singlelexical item in the source language The intuitionis that if a group of lexical items in one lan-guage can be realized as a single item in anotherlanguage it can be considered as some kind oflexically fixed entity By this means we willnot approach the MWE identification problemby asking for a given list of candidates whetherthese are MWEs or not Instead we will ask fora given list of lexical items in a source languagewhether there exists a one-to-many translation forthis item in a target language (and whether these
23
one-to-many translations correspond to MWEs)This strategy offers a straightforward solution tothe interpretation problem As the translation canbe related to the meaning of the source item andto its other translations in the target language theinterpretation is independent of the expressionrsquostransparency This solution has its limitationscompared to other approaches that need to auto-matically establish the degree of compositionalityof a given MWE candidate However for manyNLP applications coarse-grained knowledgeabout the semantic relation between a widerange of MWEs and their corresponding atomicrealization is already very useful
In this work we therefore focus on a generalmethod of MWE identification that captures thevarious patterns of translational correspondencesthat can be found in parallel corpora Our exper-iments described in section 3 show that one-to-many translations should be extracted from syn-tactic configurations rather than from unstructuredsets of aligned words This syntax-driven methodis less dependent on frequency distributions in agiven corpus but is based on the intuition thatmonolingual idiosyncracies like MWE realizationof an entity are not likely to be mirrored in anotherlanguage (see section 4 for discussion)
Our goal in this paper is twofold First we wantto investigate to what extent one-to-many transla-tional correspondences can serve as an empiricalbasis for MWE identification To this end Sec-tion 2 presents a corpus-based study of the rela-tion between one-to-many translations and MWEsthat we carried out on a translation gold standardSecond we investigate methods for the automaticdetection of complex lexical correspondences fora given parallel corpus Therefore Section 3 eval-uates automatic word alignments against our goldstandard and gives a method for high-precisionone-to-many translation detection that relies onsyntactic filters in addition to word-alignments
2 Multiword Translations as MWEs
The idea to exploit one-to-many translations forthe identification of MWE candidates has not re-ceived much attention in the literature Thus it isnot a priori clear what can be expected from trans-lational correspondences with respect to MWEidentification To corroborate the intuitions intro-duced in the last section we carried out a corpus-based study that aims to discover linguistic pat-
Verb 1-1 1-n n-1 n-n No
anheben (v1) 535 212 92 16 325
bezwecken (v2) 167 513 06 313 150
riskieren (v3) 467 357 05 17 182
verschlimmern (v4) 302 215 286 445 275
Table 1 Proportions of types of translational cor-respondences (token-level) in our gold standard
terns exhibited by one-to-many translationsWe constructed a gold standard coveringall En-
glish translations of four German verb lemmas ex-tracted from the Europarl Corpus These verbssubcategorize for a nominative subject and an ac-cusative object and are in the middle frequencylayer (around 200 occurrences) We extracted allsentences in Europarl with occurences of theselemmas and their automatic word alignments pro-duced by GIZA++ (Och and Ney 2003) Thesealignments were manually corrected on the basisof the crosslingual word alignment guidelines de-velopped by (Graca et al 2008)
For each of the German source lemmas ourgold standard records four translation categoriesone-to-one one-to-many many-to-one many-to-many translations Table 1 shows the distributionof these categories for each verb Strikingly thefour verbs show very different proportions con-cerning the types of their translational correspon-dences Thus while the German verbanheben(en increase) seems to have a frequent parallelrealization the verbsbezwecken(en intend to)or verschlimmern(en aggravate) tend to be real-ized by more complex phrasal translations In anycase the percentage of one-to-many translations isrelatively high which corroborates our hypothesisthat parallel corpora constitute a very interestingresource for data-driven MWE discovery
A closer look at the one-to-many translations re-veals that these cover a wide spectrum of MWEphenomena traditionally considered in the liter-ature as well as constructions that one wouldusually not regard as an MWE Below we willshortly illustrate the different classes of one-to-many translations we found in our gold standard
Morphological variations This type of one-to-many translations is mainly due to non-parallel re-alization of tense Itrsquos rather irrelevant from anMWE perspective but easy to discover and filterautomatically
24
(3) SieThey
verschlimmernaggravate
diethe
Ubelmisfortunes
(4) Their actionis aggravatingthe misfortunes
Verb particle combinations A typical MWEpattern treated for instance in (Baldwin andVillavicencio 2002) It further divides into trans-parent and non-transparent combinations the lat-ter is illustrated below
(5) DerThe
Ausschusscommitte
bezwecktintends
denthe
Institutioneninstitutions
eina
politischespolitical
Instrumentinstrument
anat
diethe
Handhand
zuto
gebengive
(6) The committeeset out to equip the institutions with apolitical instrument
Verb preposition combinations While thisclass isnrsquot discussed very often in the MWE lit-erature it can nevertheless be considered as an id-iosyncratic combination of lexical items Sag et al(2002) propose an analysis within an MWE frame-work
(7) SieThey
werdenwill
denthe
Treibhauseffektgreen house effect
verschlimmernaggravate
(8) They will add to the green house effect
Light verb constructions (LVCs) This is themost frequent pattern in our gold standard It ac-tually subsumes various subpatterns depending onwhether the light verbs complement is realized as anoun adjective or PP Generally LVCs are syntac-tically and semantically more flexible than otherMWE types such that our gold standard containsvariants of LVCs with similar potentially mod-ified adjectives or nouns as in the example be-low However it can be considered an idiosyn-cratic combination since the LVCs exhibit specificlexical restrictions (Sag et al 2002)
(9) IchIch
werdewill
diethe
Sachething
nuronly
nochjust
verschlimmernaggravate
(10) I am justmaking thingsmore difficult
Idioms This MWE type is probably the mostdiscussed in the literature due to its semantic andsyntactic idiosyncracy Itrsquos not very frequent inour gold standard which may be mainly due to itslimited size and the source items we chose
(11) SieThey
bezweckenintend
diethe
Umgestaltungconversion
ininto
einea
zivilecivil
Nationnation
(12) Theyhave in mind the conversion into a civil nation
v1 v2 v3 v4
Ntype 22 (26) 41 (47) 26 (35) 17 (24)
V Part 227 49 00 00
V Prep 364 415 39 59
LVC 182 293 885 882
Idiom 00 24 00 00
Para 364 243 115 235
Table 2 Proportions of MWE types per lemma
Paraphrases From an MWE perspective para-phrases are the most problematic and challengingtype of translational correspondence in our goldstandard While the MWE literature typically dis-cusses the distinction between collocations andMWEs the boarderline between paraphrases andMWEs is not really clear On the hand para-phrases as we classified them here are transparentcombinations of lexical items like in the exam-ple belowensure that something increases How-ever semantically these transparent combinationscan also be rendered by an atomic expressionin-crease A further problem raised by paraphrases isthat they often involve translational shifts (Cyrus2006) These shifts are hard to identify automat-ically and present a general challenge for seman-tic processing of parallel corpora An example isgiven below
(13) WirWe
brauchenneed
besserebetter
Zusammenarbeitcooperation
umto
diethe
RuckzahlungenrepaymentsOBJ
anzuhebenincrease
(14) We need greater cooperation in this respect toensurethat repaymentsincrease
Table 2 displays the proportions of the MWEcategories for the number of types of one-to-manycorrespondences in our gold standard We filteredthe types due to morphological variations only (theoverall number of types is indicated in brackets)Note that some types in our gold standard fall intoseveral categories eg they combine a verb prepo-sition with a verb particle construction For allof the verbs the number of types belonging tocore MWE categories largely outweighs the pro-portion of paraphrases As we already observedin our analysis of general translation categorieshere again the different verb lemmas show strik-ing differences with respect to their realization inEnglish translations For instanceanheben(enincrease) or bezwecken(en intend) are frequently
25
translated with verb particle or preposition combi-nations while the other verbs are much more of-ten translated by means of LVCs Also the morespecific LVC patterns differ largely among theverbs Whileverschlimmern(en aggravate) hasmany different adjectival LVC correspondencesthe translations ofriskieren(en risk) are predomi-nantly nominal LVCs The fact that we found veryfew idioms in our gold standard may be simplyrelated to our arbitrary choice of German sourceverbs that do not have an English idiom realiza-tion (see our experiment on a random set of verbsin Section 33)
In general one-to-many translational corre-spondences seem to provide a very fruitful groundfor the large-scale study of MWE phenomenaHowever their reliable detection in parallel cor-pora is far from trivial as we will show in thenext section Therefore we will not further in-vestigate the classification of MWE patterns inthe rest of the paper but concentrate on the high-precision detection of one-to-many translationsSuch a pattern-independent identification methodis crucial for the further data-driven study of one-to-many translations in parallel corpora
3 Multiword Translation Detection
This section is devoted to the problem of high-precision detection of one-to-many translationsSection 31 describes an evaluation of automaticword alignments against our gold standard Insection 32 we describe a method that extractsloosely aligned syntactic configurations whichyields much more promising results
31 One-to-many Alignments
To illustrate the problem of purely distributionalone-to-many alignment table 3 presents an eval-uation of the automatic one-to-many word align-ments produced by GIZA++ that uses the stan-dard heuristics for bidirectional word alignmentfrom phrase-based MT (Och and Ney 2003) Weevaluate the rate of translational correspondenceson the type-level that the system discovers againstthe one-to-many translations in our gold standardBy type we mean the set of lemmatized Englishtokens that makes up the translation of the Ger-man source lemma Generally automatic wordalignment yields a very high FPR if no frequencythreshold is used Increasing the threshold mayhelp in some cases however the frequency of the
verb n gt 0 n gt 1 n gt 3FPR FNR FPR FNR FPR FNR
v1 097 093 10 10 10 10
v2 093 09 05 096 00 098
v3 088 083 08 097 067 097
v4 098 092 08 092 034 092
Table 3 False positive rate and False negative rateof GIZA++ one-to-many alignments
translation types is so low that already at a thresh-old of 3 almost all types get filtered This does notmean that the automatic word alignment does notdiscover any correct correspondences at all but itmeans that the detection of the exact set of tokensthat correspond to the source token is rare
This low precision of one-to-many alignmentsisnrsquot very surprising Many types of MWEs con-sist of items that contribute most of the lexical se-mantic content while the other items belong to theclass of semantically almost ldquoemptyrdquo items (egparticles light verbs) These semantically ldquolightrdquoitems have a distribution that doesnrsquot necessarilycorrelate with the source item For instance inthe following sentence pair taken from EuroparlGIZA++ was not able to capture the correspon-dence between the German main verbbehindern(en impede) and the LVCconstitute an obstacleto but only finds an alignment link between theverb and the nounobstacle
(15) DieThe
Korruptioncorruption
behindertimpedes
diethe
Entwicklungdevelopment
(16) Corruptionconstitutes an obstacle todevelopment
Another limitation of the word-alignment mod-els is that are independent of whether the sen-tences are largely parallel or rather free transla-tions However parallel corpora like Europarl areknow to contain a very large number of free trans-lations In these cases direct lexical correspon-dences are much more unlikely to be found
32 Aligning Syntactic Configurations
High-precision extraction of one-to-many trans-lation detection thus involves two major prob-lems 1) How to identify sentences or configura-tions where reliable lexical correspondences canbe found 2) How to align target items that have alow occurrence correlation
We argue that both of these problems can beadressed by taking syntactic information into ac-
26
count As an example consider the pair of paral-lel configurations in Figure 1 for the sentence pairgiven in (15) and (16) Although there is no strictone-to-one alignment for the German verb the ba-sic predicate-argument structure is parallel Theverbs arguments directly correspond to each otherand are all dominated by a verbal root node
Based on these intuitions we propose agenerate-and-filter strategy for our one-to-manytranslation detection which extracts partial largelyparallel dependency configurations By admittingtarget dependency paths to be aligned to sourcesingle dependency relations we admit configura-tions where the source item is translated by morethan one word For instance given the configura-tion in Figure 1 we allow the German verb to bealigned to the path connectingconstituteand theargumentY2
Our one-to-many translation detection consistsof the following steps a) candidate generationof aligned syntactic configurations b) filtering theconfigurations c) alignment post-editing ie as-sembling the target tokens corresponding to thesource item The following paragraphs will brieflycaracterize these steps
behindert
X 1 Y 1
Y 2
an to
X 2 obstacle
create
Figure 1 Example of a typical syntactic MWEconfiguration
Data We word-aligned the German and Englishportion of the Europarl corpus by means of theGIZA++ tool Both portions where assigned flatsyntactic dependency analyses by means of theMaltParser (Nivre et al 2006) such that we ob-tain a parallel resource of word-aligned depen-dency parses Each sentence in our resource canbe represented by the triple(DG DE AGE) DG
is the set of dependency triples(s1 rel s2) such
thats2 is a dependent ofs1 of typerel ands1 s2
are words of the source languageDE is the setof dependency triples of the target sentenceAGE
corresponds to the set of pairs(s1 t1) such thats1 t1 are aligned
Candidate Generation This step generates alist of source configurations by searching for oc-curences of the source lexical verb where it islinked to some syntactic dependents (eg its argu-ments) An example input would be the configura-tion ( (verbSB) (verbOA)) forour German verbs
Filtering Given our source candidates a validparallel configuration(DG DE AGE) is then de-fined by the following conditions1 The source configurationDG is the set of tu-ples(s1 rel sn) wheres1 is our source item andsn some dependent2 For eachsn isin DG there is a tuple(sn tn) isinAGE ie every dependent has an alignment3 There is a target itemt1 isin DE such thatfor eachtn there is ap sub DE such thatp isa path(t1 rel tx) (tx rel ty)(tz rel tn) thatconnectst1 and tn Thus the target dependentshave a common root
To filter noise due to parsing or alignment er-rors we further introduce a filter on the length ofthe path that connects the target root and its de-pendents and w exclude paths cross contain sen-tence boundaries Moreover the above candi-date filtering doesnrsquot exclude configurations whichexhibit paraphrases involving head-switching orcomplex coordination Head-switching can be de-tected with the help of alignment information ifthere is a item in our target configuration that hasan reliable alignment with an item not contained inour source configuration our target configurationis likely to contain such a structural paraphrasesand is excluded from our candidate set Coordina-tion can be discarded by imposing the condition onthe configuration not to contain a coordination re-lation This Generate-and-Filter strategy now ex-tracts a set of sentences where we are likely to finda good one-to-one or one-to-many translation forthe source verb
Alignment Post-editing In the final alignmentstep one now needs to figure out which lexicalmaterial in the aligned syntactic configurations ac-tually corresponds to the translation of the sourceitem The intuition discussed in 32 was that all
27
the items lying on a path between the root itemand the terminals belong to the translation of thesource item However these items may have othersyntactic dependents that may also be part of theone-to-many translation As an example considerthe configuration in figure 1 where the articleanwhich is part of the LVCcreate an obstacle tohasto be aligned to the German source verb
Thus for a set of itemsti for which there is a de-pendency relation(tx rel ti) isin DE such thattx isan element of our target configuration we need todecide whether(s1 ti) isin AGE This translationproblem now largely parallels collocation trans-lation problems discussed in the literature as in(Smadja and McKeown 1994) But crucially oursyntactic filtering strategy has substantially nar-rowed down the number of items that are possi-ble parts of the one-to-many translation Thus astraightforward way to assemble the translationalcorrespondence is to compute the correlation orassociation of the possibly missing items with thegiven translation pair as proposed in (Smadja andMcKeown 1994) Therefore we propose the fol-lowing alignment post-editing algorithmGiven the source items1 and the set of target itemsT where eachti isin T is an element of our targetconfiguration
1 Compute corr(s1 T ) the correlation be-tweens1 andT
2 For each ti tx such that there isa (ti rel tx) isin DE computecorr(s1 T + tx)
3 if corr(s1 T + tx) ge corr(s1 T ) addtxto T
As the Dice coefficient is often to give the bestresults eg in (Smadja and McKeown 1994) wealso chose Dice as our correlation measure In fu-ture work we will experiment with other associa-tion measures Our correlation scores are thus de-fined by the formula
corr(s1 T ) =2(freq(s1 and T ))
freq(s1) + freq(T )
We definefreq(T ) as the number of sentencepairs whose target sentence contains occurrencesof all ti isin T andfreq(s1) accordingly The ob-servation frequencyfreq(s1andT ) is the number ofsentence pairs that wheres1 occurs in the sourcesentence andT in the target sentence
The output translation can then be rep-resented as a dependency configurationof the following kind ((ofPMOD)(riskNMODof)(riskNMODthe) (runOBJrisk)(runSBJ))which is the syntactic representationfor the English MWErun the risk of
33 Evaluation
Our translational approach to MWE extractionbears the advantage that evaluation is not exclu-sively bound to the manual judgement of candi-date lists Instead we can first evaluate the systemoutput against translation gold standards which areeasier to obtain The linguistic classification of thecandidates according to their compositionality canthen be treated as a separate problem
We present two experiments in this evaluationsection We will first evaluate the translation de-tection on our gold standard to assess the gen-eral quality of the extraction method Since thisgold standard is to small to draw conclusions aboutthe quality of MWE patterns that the system de-tects we further evaluate the translational corre-spondences for a larger set of verbs
Translation evaluation In the first experimentwe extracted all types of translational correspon-dences for the verbs we annotated in the gold stan-dard We converted the output dependency con-figurations to the lemmatized bag-of-word formwe already applied for the alignment evaluationand calculated the FPR and FNR of the trans-lation types The evaluation is displayed in ta-ble 4 Nearly all translation types that our sys-tem detected are correct This confirms our hy-pothesis that syntactic filtering yields more reli-able translations that just coocurrence-based align-ments However the false negative rate is alsovery high This low recall is due to the fact thatour syntactic filters are very restrictive such that amajor part of the occurrences of the source lemmadonrsquot figure in the prototypical syntactic configu-ration Column two and three of the evaluation ta-ble present the FPR and FNR for experiments witha relaxed syntactic filter that doesnrsquot constrain thesyntactic type of the parallel argument relationsWhile not decreasing the FNR the FPR decreasessignificantly This means that the syntactic filtersmainly fire on noisy configurations and donrsquot de-crease the recall A manual error analysis has alsoshown that the relatively flat annotation scheme ofour dependency parses significantly narrows down
28
the number of candidate configurations that our al-gorithm detects As the dependency parses donrsquotprovide deep analyses for tense or control phe-nomena very often a verbrsquos arguments donrsquot fig-ure as its syntactic dependents and no configura-tion is found Future work will explore the im-pact of deep syntactic analysis for the detection oftranslational correspondences
MWE evaluation In a second experiment weevaluated the patterns of correspondences foundby our extraction method for use in an MWE con-text Therefore we selected 50 random verbs oc-curring in the Europarl corpus and extracted theirrespective translational correspondences This setof 50 verbs yields a set of 1592 one-to-many typesof translational correspondences We filtered thetypes wich display only morphological variationsuch that the set of potential MWE types com-prises 1302 types Out of these we evaluated arandom sample of 300 types by labelling the typeswith the MWE categories we established for theanalysis of our gold standard During the clas-sification we encountered a further category ofoneto- many correspondence which cannot be con-sidered an MWE the category of alternation Forinstance we found a translational correspondencebetween the active realization of the German verbbegruszligen(enappreciate) and the English passivebe pleased by
The classification is displayed in table 5 Al-most 83 of the translational correspondencesthat our system extracted are perfect translationtypes Almost 60 of the extracted types can beconsidered MWEs that exhibit some kind of se-mantic idiosyncrasy The other translations couldbe classified as paraphrases or alternations In ourrandom sample the portions of idioms is signifi-cantly higher than in our gold standard which con-firms our intuition that the MWE pattern of theone-to-many translations for a given verb are re-lated to language-specific semantic properties ofthe verbs and the lexical concepts they realize
4 Related Work
The problem sketched in this paper has clear con-ncetions to statistical MT So-called phrase-basedtranslation models generally target whole sentencealignment and do not necessarily recur to linguis-tically motivated phrase correspondences (Koehnet al 2003) Syntax-based translation that speci-fies formal relations between bilingual parses was
Strict Filter Relaxed Filter
FPR FNR FPR FNR
v1 00 096 05 096
v2 025 088 047 079
v3 025 074 056 063
v4 00 0875 056 084
Table 4 False positive and false negative rate ofone-to-many translations
Trans type Proportion
MWE type Proportion
MWEs 575
V Part 82
V Prep 518
LVC 324
Idiom 106
Paraphrases 244
Alternations 10
Noise 171
Table 5 Classification of 300 types sampled fromthe set of one-to-many translations for 50 verbs
established by (Wu 1997) Our way to use syn-tactic configurations can be seen as a heuristic tocheck relaxed structural parallelism
Work on MWEs in a crosslingual context hasalmost exclusively focussed on MWE translation(Smadja and McKeown 1994 Anastasiou 2008)In (Villada Moiron and Tiedemann 2006) the au-thors make use of alignment information in a par-allel corpus to rank MWE candidates These ap-proaches donrsquot rely on the lexical semantic knowl-edge about MWEs in form of one-to-many trans-lations
By contrast previous approaches to paraphraseextraction made more explicit use of crosslingualsemantic information In (Bannard and Callison-Burch 2005) the authors use the target languageas a pivot providing contextual features for iden-tifying semantically similar expressions Para-phrasing is however only partially comparable tothe crosslingual MWE detection we propose inthis paper Recently the very pronounced contextdependence of monolingual pairs of semanticallysimilar expressions has been recognized as a ma-jor challenge in modelling word meaning (Erk andPado 2009)
The idea that parallel corpora can be used asa linguistic resource that provides empirical evi-dence for monolingual idiosyncrasies has already
29
been exploited in eg morphology projection(Yarowsky et al 2001) or word sense disambigua-tion (Dyvik 2004) While in a monolingual set-ting it is quite tricky to come up with theoreticalor empirical definitions of sense discriminationsthe crosslingual scenario offers a theory-neutraldata-driven solution Since ambiguity is an id-iosyncratic property of a lexical item in a givenlanguage it is not likely to be mirrored in a tar-get language Similarly our approach can also beseen as a projection idea we project the semanticinformation of simplex realization in a source lan-guage to an idiosyncratic multiword realization inthe target language
5 Conclusion
We have explored the phenomenon of one-to-many translations in parallel corpora from theperspective of MWE identification Our man-ual study on a translation gold standard as wellas our experiments in automatic translation ex-traction have shown that one-to-many correspon-dences provide a rich resource and fruitful basisof study for data-driven MWE identification Thecrosslingual perspective raises new research ques-tions about the identification and interpretation ofMWEs It challenges the distinction between para-phrases and MWEs a problem that does not ariseat all in the context of monolingual MWE ex-traction It also allows for the study of the rela-tion between the semantics of lexical concepts andtheir MWE realization Further research in this di-rection should investigate translational correspon-dences on a larger scale and further explore thesefor monolingual interpretation of MWEs
Our extraction method that is based on syn-tactic filters identifies MWE types with a muchhigher precision than purely cooccurence-basedword alignment and captures the various patternswe found in our gold standard Future work on theextraction method will have to focus on the gener-alization of these filters and the generalization toother items than verbs The experiments presentedin this paper also suggest that the MWE realiza-tion of certain lexical items in a target languageis subject to certain linguistic patterns Moreoverthe method we propose is completely languagein-dependent such that further research has to studythe impact of the relatedness of the consideredlanguages on the patterns of one-to-many transla-tional correspondences
ReferencesDimitra Anastasiou 2008 Identification of idioms by mtrsquos
hybrid research system vs three commercial system InProceedings of the EAMT pp 12ndash20
Timothy Baldwin and Aline Villavicencio 2002 Extract-ing the unextractable a case study on verb-particles InProceedings of the COLING-02 pp 1ndash7
Colin Bannard and Chris Callison-Burch 2005 Paraphras-ing with bilingual parallel corpora InProceedings of the43rd Annual Meeting of the ACL pp 597ndash604
Lea Cyrus 2006 Building a resource for studying transla-tion shifts InProceedings of the 5th LREC pp 1240ndash1245
Helge Dyvik 2004 Translations as semantic mirrors Fromparallel corpus to WordNetLanguage and Computers1311 ndash 326
Katrin Erk and Sebastian Pado 2009 Paraphrase assess-ment in structured vector space Exploring parameters anddatasets InProc of the EACL GEMS Workshop pp 57ndash65
Joao de Almeida Varelas Graca Joana Paulo Pardal LuısaCoheur and Diamantino Antonio Caseiro 2008 Multi-language word alignments annotation guidelines Techni-cal report Tech Rep 38 2008 INESC-ID Lisboa
Philipp Koehn Franz Josef Och and Daniel Marcu 2003Statistical phrase-based translation InProceedings of theNAACL rsquo03 pp 48ndash54
Philipp Koehn 2005 Europarl A parallel corpus for statis-tical machine translation InMT Summit 2005 pp 79ndash86
Joakim Nivre Johan Hall and Jens Nilsson 2006 Malt-parser A data driven parser-generator for dependencyparsing InProc of LREC-2006 pp 2216ndash2219
Franz Josef Och and Hermann Ney 2003 A systematic com-parison of various statistical alignment modelsComputa-tional Linguistics 29(1)19ndash51
Pavel Pecina 2008 A machine learning approach to multi-word expression extraction InProceedings of the LRECMWE 2008 Workshop pp 54ndash57
Ivan A Sag Timothy Baldwin Francis Bond and AnnCopestake 2002 Multiword expressions A pain in theneck for NLP InProc of the CICLing-2002 pp 1ndash15
Frank Smadja and Kathleen McKeown 1994 Translatingcollocations for use in bilingual lexicons InProceedingsof the HLT rsquo94 workshop pp 152ndash156
Begona Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automatic word-alignment InProc of the EACL MWE 2006 Workshoppp 33ndash40
Dekai Wu 1997 Stochastic inversion transduction gram-mars and bilingual parsing of parallel corporaComputLinguist 23(3)377ndash403
David Yarowsky Grace Ngai and Richard Wicentowski2001 Inducing multilingual text analysis tools via robustprojection across aligned corpora InProceedings of HLT2001 pp 1ndash8
30
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 31ndash39Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
A Re-examination of Lexical Association Measures
Abstract
We review lexical Association Measures (AMs) that have been employed by past work in extracting multiword expressions Our work contributes to the understanding of these AMs by categorizing them into two groups and suggesting the use of rank equivalence to group AMs with the same ranking performance We also examine how existing AMs can be adapted to better rank English verb particle constructions and light verb constructions Specifically we suggest normalizing (Pointwise) Mutual Information and using marginal frequencies to construct penalization terms We empirically validate the effectiveness of these modified AMs in detection tasks in English performed on the Penn Treebank which shows significant improvement over the original AMs
1 Introduction
Recently the NLP community has witnessed a renewed interest in the use of lexical association measures in extracting Multiword Expressions (MWEs) Lexical Association Measures (hereafter AMs) are mathematical formulas which can be used to capture the degree of connection or association between constituents of a given phrase Well-known AMs include Pointwise Mutual Information (PMI) Pearsonrsquos 2χ and the Odds Ratio These AMs have been applied in many different fields of study from information retrieval to hypothesis testing In the context of MWE extraction many published works have been devoted to comparing their effectiveness Krenn and Evert (2001) evaluate Mutual Information (MI) Dice Pearsonrsquos 2χ log-likelihood
ratio and the T score In Pearce (2002) AMs such as Z score Pointwise MI cost reduction left and right context entropy odds ratio are evaluated Evert (2004) discussed a wide range of AMs including exact hypothesis tests such as the binomial test and Fisherrsquos exact tests various coefficients such as Dice and Jaccard Later Ramisch et al (2008) evaluated MI Pearsonrsquos 2χ and Permutation Entropy Probably the most comprehensive evaluation of AMs was presented in Pecina and Schlesinger (2006) where 82 AMs were assembled and evaluated over Czech collocations These collocations contained a mix of idiomatic expressions technical terms light verb constructions and stock phrases In their work the best combination of AMs was selected using machine learning
While the previous works have evaluated AMs there have been few details on why the AMs perform as they do A detailed analysis of why these AMs perform as they do is needed in order to explain their identification performance and to help us recommend AMs for future tasks This weakness of previous works motivated us to address this issue In this work we contribute to further understanding of association measures using two different MWE extraction tasks to motivate and concretize our discussion Our goal is to be able to predict a priori what types of AMs are likely to perform well for a particular MWE class
We focus on the extraction of two common types of English MWEs that can be captured by bigram model Verb Particle Constructions (VPCs) and Light Verb Constructions (LVCs) VPCs consist of a verb and one or more particles which can be prepositions (eg put on bolster up) adjectives (cut short) or verbs (make do) For simplicity we focus only on bigram VPCs that take prepositional particles the most common class of VPCs A special characteristic of VPCs that affects their extraction is the
Hung Huu Hoang Dept of Computer Science
National University of Singapore
hoanghuucompnusedusg
Su Nam Kim Dept of Computer Science and Software Engineering University of Melbourne
snkimcsseunimelbeduau
Min-Yen Kan Dept of Computer Science
National University of Singapore
kanmycompnusedusg
31
mobility of noun phrase complements in transitive VPCs They can appear after the particle (Take off your hat) or between the verb and the particle (Take your hat off) However a pronominal complement can only appear in the latter configuration (Take it off)
In comparison LVCs comprise of a verb and a complement which is usually a noun phrase (make a presentation give a demonstration) Their meanings come mostly from their complements and as such verbs in LVCs are termed semantically light hence the name light verb This explains why modifiers of LVCs modify the complement instead of the verb (make a serious mistake vs make a mistake seriously) This phenomenon also shows that an LVCrsquos constituents may not occur contiguously
2 Classification of Association Measures
Although different AMs have different approaches to measuring association we observed that they can effectively be classified into two broad classes Class I AMs look at the degree of institutionalization ie the extent to which the phrase is a semantic unit rather than a free combination of words Some of the AMs in this class directly measure this association between constituents using various combinations of co-occurrence and marginal frequencies Examples include MI PMI and their variants as well as most of the association coefficients such as Jaccard Hamann Brawn-Blanquet and others Other Class I AMs estimate a phrasersquos MWE-hood by judging the significance of the difference between observed and expected frequencies These AMs include among others statistical hypothesis tests such as T score Z score and Pearsonrsquos 2χ test
Class II AMs feature the use of context to measure non-compositionality a peculiar characteristic of many types of MWEs including VPCs and idioms This is commonly done in one of the following two ways First non-compositionality can be modeled through the diversity of contexts measured using entropy The underlying assumption of this approach is that non-compositional phrases appear in a more restricted set of contexts than compositional ones Second non-compositionality can also be measured through context similarity between the phrase and its constituents The observation here is that non-compositional phrases have different semantics from those of their constituents It then
follows that contexts in which the phrase and its constituents appear would be different (Zhai 1997) Some VPC examples include carry out give up A close approximation stipulates that contexts of a non-compositional phrasersquos constituents are also different For instance phrases such as hot dog and Dutch courage are comprised of constituents that have unrelated meanings Metrics that are commonly used to compute context similarity include cosine and dice similarity distance metrics such as Euclidean and Manhattan norm and probability distribution measures such as Kullback-Leibler divergence and Jensen-Shannon divergence
Table 1 lists all AMs used in our discussion
The lower left legend defines the variables a b c and d with respect to the raw co-occurrence statistics observed in the corpus data When an AM is introduced it is prefixed with its index given in Table 1(eg [M2] Mutual Information) for the readerrsquos convenience
3 Evaluation
We will first present how VPC and LVC candidates are extracted and used to form our evaluation data set Second we will discuss how performances of AMs are measured in our experiments
31 Evaluation Data
In this study we employ the Wall Street Journal (WSJ) section of one million words in the Penn Tree Bank To create the evaluation data set we first extract the VPC and LVC candidates from our corpus as described below We note here that the mobility property of both VPC and LVC constituents have been used in the extraction process
For VPCs we first identify particles using a pre-compiled set of 38 particles based on Baldwin (2005) and Quirk et al (1985) (Appendix A) Here we do not use the WSJ particle tag to avoid possible inconsistencies pointed out in Baldwin (2005) Next we search to the left of the located particle for the nearest verb As verbs and particles in transitive VPCs may not occur contiguously we allow an intervening NP of up to 5 words similar to Baldwin and Villavicencio (2002) and Smadja (1993) since longer NPs tend to be located after particles
32
Extraction of LVCs is carried out in a similar fashion First occurrences of light verbs are located based on the following set of seven
frequently used English light verbs do get give have make put and take Next we search to the right of the light verbs for the nearest noun
AM Name Formula AM Name Formula M1 Joint Probability ( ) f xy N M2 Mutual Information
1log ˆ
ijij
i j ij
ff
N fsum
M3 Log likelihood ratio
2 log ˆ
ijij
i j ij
ff
fsum
M4 Pointwise MI (PMI) ( )log
( ) ( )
P xy
P x P ylowast lowast
M5 Local-PMI ( ) PMIf xy times M6 PMIk ( )log
( ) ( )
kNf xy
f x f ylowast lowast
M7 PMI2 2( )log
( ) ( )
Nf xy
f x f ylowast lowast
M8 Mutual Dependency 2( )log( ) ( )P xy
P x P y
M9 Driver-Kroeber
( )( )
a
a b a c+ +
M10 Normalized expectation
2
2
a
a b c+ +
M11 Jaccard a
a b c+ +
M12 First Kulczynski a
b c+
M13 Second Sokal-Sneath 2( )
a
a b c+ +
M14 Third Sokal-Sneath
a d
b c
+
+
M15 Sokal-Michiner a d
a b c d
+
+ + +
M16 Rogers-Tanimoto
2 2
a d
a b c d
+
+ + +
M17 Hamann ( ) ( )a d b c
a b c d
+ minus +
+ + +
M18 Odds ratio adbc
M19 Yulersquos ω ad bc
ad bc
minus
+
M20 Yulersquos Q ad bcad bc
minus+
M21 Brawn- Blanquet max( )
a
a b a c+ +
M22 Simpson
min( )
a
a b a c+ +
M23 S cost 12min( )
log(1 )1
b c
a
minus+
+
M24 Adjusted S Cost 12max( )
log(1 )1
b c
a
minus+
+
M25 Laplace 1
min( ) 2
a
a b c
+
+ +
M26 Adjusted Laplace 1
max( ) 2
a
a b c
+
+ +
M27 Fager [M9]
1max( )
2b cminus
M28 Adjusted Fager [M9]
1max( )b c
aNminus
M29 Normalized PMIs
PMI NF( )α PMI NFMax
M30 Simplified normalized PMI for VPCs
log( )
(1 )
ad
b cα αtimes + minus times
M31 Normalized MIs
MI NF( )α MI NFMax
NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast [0 1]α isin NFMax = max( ( ) ( ))P x P ylowast lowast
11 ( )a f f xy= = 12 ( )b f f xy= =
21 ( )c f f xy= = 22 ( )d f f xy= =
( )f xlowast ( )f xlowast
( )f ylowast ( )f ylowast N
Table 1 Association measures discussed in this paper Starred AMs () are developed in this work
Contingency table of a bigram (x y) recording co-occurrence and marginal frequencies w stands for all words except w stands for all words N is total number of bigrams The expected frequency under the independence assumption is ˆ ( ) ( ) ( ) f xy f x f y N= lowast lowast
33
permitting a maximum of 4 intervening words to allow for quantifiers (aan the many etc) adjectival and adverbial modifiers etc If this search fails to find a noun as when LVCs are used in the passive (eg the presentation was made) we search to the right of the light verb also allowing a maximum of 4 intervening words The above extraction process produced a total of 8652 VPC and 11465 LVC candidates when run on the corpus We then filter out candidates with observed frequencies less than 6 as suggested in Pecina and Schlesinger (2006) to obtain a set of 1629 VPCs and 1254 LVCs
Separately we use the following two available sources of annotations 3078 VPC candidates extracted and annotated in (Baldwin 2005) and 464 annotated LVC candidates used in (Tan et al 2006) Both sets of annotations give both positive and negative examples
Our final VPC and LVC evaluation datasets were then constructed by intersecting the gold-standard datasets with our corresponding sets of extracted candidates We also concatenated both sets of evaluation data for composite evaluation This set is referred to as ldquoMixedrdquo Statistics of our three evaluation datasets are summarized in Table 2 VPC data LVC data Mixed Total (freq ge 6)
413 100 513
Positive instances
117 (2833)
28 (28)
145 (2326)
Table 2 Evaluation data sizes (type count not token) While these datasets are small our primary
goal in this work is to establish initial comparable baselines and describe interesting phenomena that we plan to investigate over larger datasets in future work
32 Evaluation Metric
To evaluate the performance of AMs we can use the standard precision and recall measures as in much past work We note that the ranked list of candidates generated by an AM is often used as a classifier by setting a threshold However setting a threshold is problematic and optimal threshold values vary for different AMs Additionally using the list of ranked candidates directly as a classifier does not consider the confidence indicated by actual scores Another way to avoid setting threshold values is to measure precision and recall of only the n most likely candidates
(the n-best method) However as discussed in Evert and Krenn (2001) this method depends heavily on the choice of n In this paper we opt for average precision (AP) which is the average of precisions at all possible recall values This choice also makes our results comparable to those of Pecina and Schlesinger (2006)
33 Evaluation Results
Figure 1(a b) gives the two average precision profiles of the 82 AMs presented in Pecina and Schlesinger (2006) when we replicated their experiments over our English VPC and LVC datasets We observe that the average precision profile for VPCs is slightly concave while the one for LVCs is more convex This can be interpreted as VPCs being more sensitive to the choice of AM than LVCs Another point we observed is that a vast majority of Class I AMs including PMI its variants and association coefficients (excluding hypothesis tests) perform reasonably well in our application In contrast the performances of most of context-based and hypothesis test AMs are very modest Their mediocre performance indicates their inapplicability to our VPC and LVC tasks In particular the high frequencies of particles in VPCs and light verbs in LVCs both undermine their contextsrsquo discriminative power and skew the difference between observed and expected frequencies that are relied on in hypothesis tests
4 Rank Equivalence
We note that some AMs although not mathematically equivalent (ie assigning identical scores to input candidates) produce the same lists of ranked candidates on our datasets Hence they achieve the same average precision The ability to identify such groups of AMs is helpful in simplifying their formulas which in turn assisting in analyzing their meanings Definition Association measures M1 and M2 are
rank equivalent over a set C denoted by M1 r
Cequiv
M2 if and only if M1(cj) gt M1(ck) M2(cj) gt M2(ck) and M1(cj) = M1(ck) M2(cj) = M2(ck) for all cj ck belongs to C where Mk(ci) denotes the score assigned to ci by the measure Mk As a corollary the following also holds for rank equivalent AMs
34
Corollary If M1 r
Cequiv M2 then APC(M1) = APC(M2)
where APC(Mi) stands for the average precision of the AM Mi over the data set C Essentially M1 and M2 are rank equivalent over a set C if their ranked lists of all candidates taken from C are the same ignoring the actual calculated scores1 As an example the following 3 AMs Odds ratio Yulersquos ω and Yulersquos Q (Table 3 row 5) though not mathematically equivalent can be shown to be rank equivalent Five groups of rank equivalent AMs that we have found are listed in Table 3 This allows us to replace the below 15 AMs with their (most simple) representatives from each rank equivalent group
1 Two AMs may be rank equivalent with the exception of some candidates where one AM is undefined due to a zero in the denominator while the other AM is still well-defined We call these cases weakly rank equivalent With a reasonably large corpus such candidates are rare for our VPC and LVC types Hence we still consider such AM pairs to be rank equivalent
1) [M2] Mutual Information [M3] Log likelihood ratio 2) [M7] PMI2 [M8] Mutual Dependency [M9] Driver-Kroeber (aka Ochiai) 3) [M10] Normalized expectation [M11] Jaccard [M12] First Kulczynski
[M13]Second Sokal-Sneath (aka Anderberg)
4) [M14] Third Sokal-Sneath [M15] Sokal-Michiner [M16] Rogers-Tanimoto [M17] Hamann 5) [M18] Odds ratio [M19] Yulersquos ω [M20] Yulersquos Q
Table 3 Five groups of rank equivalent AMs
5 Examination of Association Measures
We highlight two important findings in our analysis of the AMs over our English datasets Section 51 focuses on MI and PMI and Section 52 discusses penalization terms
51 Mutual Information and Pointwise Mutual Information
In Figure 1 over 82 AMs PMI ranks 11th in identifying VPCs while MI ranks 35th in
01
02
03
04
05
06
AP
Figure 1a AP profile of AMs examined over our VPC data set
01
02
03
04
05
06
AP
Figure 1b AP profile of AMs examined over our LVC data set
Figure 1 Average precision (AP) performance of the 82 AMs from Pecina and Schlesinger (2006) on our English VPC and LVC datasets Bold points indicate AMs discussed in this paper
Hypothesis test AMs loz Class I AMs excluding hypothesis test AMs + Context-based AMs
35
identifying LVCs In this section we show how their performances can be improved significantly
Mutual Information (MI) measures the common information between two variables or the reduction in uncertainty of one variable given knowledge of the other
( )MI( ) ( )log ( ) ( )u v
p uvU V p uv p u p v= lowast lowastsum In the
context of bigrams the above formula can be
simplified to [M2] MI =
1log ˆN
ijij
i jij
ff
fsum While MI
holds between random variables [M4] Pointwise MI (PMI) holds between specific values PMI(x
y) =( )
log( ) ( )
P xy
P x P ylowast lowast
( )log
( ) ( )
Nf xy
f x f y=
lowast lowast It has long
been pointed out that PMI favors bigrams with low-frequency constituents as evidenced by the product of two marginal frequencies in its denominator To reduce this bias a common solution is to assign more weight to the co-occurrence frequency ( )f xy in the numerator by either raising it to some power k (Daille 1994) or multiplying PMI with ( )f xy Table 4 lists these adjusted versions of PMI and their performance over our datasets We can see from Table 4 that the best performance of PMIk
is obtained at k values less than one indicating that it is better to rely less on ( )f xy Similarly multiplying
( )f xy directly to PMI reduces the performance of PMI As such assigning more weight to ( )f xy does not improve the AP performance of PMI AM VPCs LVCs MixedBest [M6] PMIk 547
(k = 13) 573
(k = 85) 544
(k = 32)
[M4] PMI 510 566 515 [M5] Local-PMI 259 393 272 [M1] Joint Prob 170 28 175 Table 4 AP performance of PMI and its variants Best alpha settings shown in parentheses
Another shortcoming of (P)MI is that both grow not only with the degree of dependence but also with frequency (Manning and Schutzeampamp 1999 p 66) In particular we can show that MI(X Y) le min(H(X) H(Y)) where H() denotes entropy and PMI(xy) le min( log ( )P xminus lowast log ( )P yminus lowast )
These two inequalities suggest that the allowed score ranges of different candidates vary and consequently MI and PMI scores are not directly comparable Furthermore in the case of VPCs and LVCs the differences among score
ranges of different candidates are large due to high frequencies of particles and light verbs This has motivated us to normalize these scores before using them for comparison We suggest MI and PMI be divided by one of the following two normalization factors NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast with [0 1]α isin and NFmax = max( ( ) ( ))P x P ylowast lowast NF( )α being dependent on alpha can be optimized by setting an appropriate alpha value which is inevitably affected by the MWE type and the corpus statistics On the other hand NFmax is independent of alpha and is recommended when one needs to apply normalized (P)MI to a mixed set of different MWE types or when sufficient data for parameter tuning is unavailable As shown in Table 5 normalized MI and PMI show considerable improvements of up to 80 Also PMI and MI after being normalized with NFmax rank number one in VPC and LVC task respectively If one re-writes MI as = (1 N) ij ij
i jPMIf timessum it is easy to see the heavy
dependence of MI on direct frequencies compared with PMI and this explains why normalization is a pressing need for MI
AM VPCs LVCs MixedMI NF( )α 508
(α = 48) 583
(α = 47) 516
(α = 5)
MI NFmax 508 584 518 [M2] MI 273 435 289
PMI NF( )α 592 (α = 8)
554 (α = 48)
588 (α = 77)
PMI NFmax 565 517 556 [M4] PMI 510 566 515
Table 5 AP performance of normalized (P)MI versus standard (P)MI Best alpha settings shown in parentheses
52 Penalization Terms
It can be seen that given equal co-occurrence frequencies higher marginal frequencies reduce the likelihood of being MWEs This motivates us to use marginal frequencies to synthesize penalization terms which are formulae whose values are inversely proportional to the likelihood of being MWEs We hypothesize that incorporating such penalization terms can improve the respective AMs detection AP
Take as an example the AMs [M21] Brawn-Blanquet (aka Minimum Sensitivity) and [M22] Simpson These two AMs are identical except
36
for one difference in the denominator Brawn-Blanquet uses max(b c) Simpson uses min(b c) It is intuitive and confirmed by our experiments that penalizing against the more frequent constituent by choosing max(b c) is more effective This is further attested in AMs [M23] S Cost and [M25] Laplace where we tried to replace the min(b c) term with max(b c) Table 6 shows the average precision on our datasets for all these AMs
AM VPCs LVCs Mixed[M21]Brawn- Blanquet
478 578 486
[M22] Simpson 249 382 260
[M24] Adjusted S Cost
485 577 492
[M23] S cost 249 388 260
[M26] Adjusted Laplace
486 577 493
[M25] Laplace 241 388 254 Table 6 Replacing min() with max() in selected AMs
In the [M27] Fager AM the penalization term max(b c) is subtracted from the first term which is no stranger but rank equivalent to [M7] PMI2 In our application this AM is not good since the second term is far larger than the first term which is less than 1 As such Fager is largely equivalent to just ndashfrac12 max(b c) In order to make use of the first term we need to replace the constant frac12 by a scaled down version of max(b c) We have approximately derived 1 aN as a lower bound estimate of max(b c) using the independence assumption producing [M28] Adjusted Fager We can see from Table 7 that this adjustment improves Fager on both datasets AM VPCs LVCs Mixed[M28] Adjusted Fager
564 543 554
[M27] Fager 552 439 525 Table 7 Performance of Fager and its adjusted version
The next experiment involves [M14] Third Sokal Sneath which can be shown to be rank equivalent to ndashb ndashc We further notice that frequencies c of particles are normally much larger than frequencies b of verbs Thus this AM runs the risk of ranking VPC candidates based on only frequencies of particles So it is necessary
that we scale b and c properly as in [M14] bαminus times ndash (1 ) cαminus times Having scaled the constituents properly we still see that [M14] by itself is not a good measure as it uses only constituent frequencies and does not take into consideration the co-occurrence frequency of the two constituents This has led us to experiment
with [MR14]PMI
(1 )b cα αtimes + minus times The
denominator of [MR14] is obtained by removing the minus sign from [MR14] so that it can be used as a penalization term The choice of PMI in the numerator is due to the fact that the denominator of [MR14] is in essence similar to NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast which has been successfully used to divide PMI in the normalized PMI experiment We heuristically tried to simplify [MR14] to the following AM
[M30]log( )
(1 )
ad
b cα αtimes + minus times The setting of alpha in
Table 8 below is taken from the best alpha setting obtained the experiment on the normalized PMI (Table 5) It can be observed from Table 8 that [MR14] being computationally simpler than normalized PMI performs as well as normalized PMI and better than Third Sokal-Sneath over the VPC data set
AM VPCs LVCs MixedPMI NF( )α 592
(α =8) 554
(α =48) 588
(α =77)
[M30]log( )
(1 )
ad
b cα αtimes + minus times
600 (α =8)
484 (α =48)
588 (α =77)
[M14] Third Sokal Sneath
565 453 546
Table 8 AP performance of suggested VPCsrsquo penalization terms and AMs
With the same intention and method we have found that while addition of marginal frequencies is a good penalization term for VPCs the product of marginal frequencies is more suitable for LVCs (rows 1 and 2 Table 9) As with the linear combination the product bc should also be weighted accordingly as (1 )b cα αminus The best alpha value is also taken from the normalized PMI experiments (Table 5) which is nearly 5 Under this setting this penalization term is exactly the denominator of the [M18] Odds Ratio Table 9 below show our experiment results in deriving the penalization term for LVCs
37
AM VPCs LVCs Mixedndashb ndashc 565 453 546 1bc 502 532 502 [M18] Odds ratio 443 567 456
Table 9 AP performance of suggested LVCsrsquo penalization terms and AMs
6 Conclusions
We have conducted an analysis of the 82 AMs assembled in Pecina and Schlesinger (2006) for the tasks of English VPC and LVC extraction over the Wall Street Journal Penn Treebank data In our work we have observed that AMs can be divided into two classes ones that do not use context (Class I) and ones that do (Class II) and find that the latter is not suitable for our VPC and LVC detection tasks as the size of our corpus is too small to rely on the frequency of candidatesrsquo contexts This phenomenon also revealed the inappropriateness of hypothesis tests for our detection task We have also introduced the novel notion of rank equivalence to MWE detection in which we show that complex AMs may be replaced by simpler AMs that yield the same average precision performance
We further observed that certain modifications to some AMs are necessary First in the context of ranking we have proposed normalizing scores produced by MI and PMI in cases where the distributions of the two events are markedly different as is the case for light verbs and particles While our claims are limited to the datasets analyzed they show clear improvements normalized PMI produces better performance over our mixed MWE dataset yielding an average precision of 588 compared to 515 when using standard PMI a significant improvement as judged by paired T test Normalized MI also yields the best performance over our LVC dataset with a significantly improved AP of 583
We also show that marginal frequencies can be used to form effective penalization terms In particular we find that (1 )b cα αtimes + minus times is a good penalization term for VPCs while (1 )b cα αminus is suitable for LVCs Our introduced alpha tuning parameter should be set to properly scale the values b and c and should be optimized per MWE type In cases where a common factor is applied to different MWE types max(b c) is a better choice than min(b c) In future work we plan to expand our investigations over larger
web-based datasets of English to verify the performance gains of our modified AMs
Acknowledgement This work was partially supported by a National Research Foundation grant ldquoInteractive Media Searchrdquo (grant R 252 000 325 279)
References Baldwin Timothy (2005) The deep lexical
acquisition of English verb-particle constructions Computer Speech and Language Special Issue on Multiword Expressions 19(4)398ndash414
Baldwin Timothy and Villavicencio Aline (2002) Extracting the unextractable A case study on verb-particles In Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002) pages 98ndash104 Taipei Taiwan
Daille Beacuteatrice (1994) Approche mixte pour lextraction automatique de terminologie statistiques lexicales et filtres linguistiques PhD thesis Universiteacute Paris 7
Evert Stefan (2004) Online repository of association measures httpwwwcollocationsde a companion to The Statistics of Word Cooccurrences Word Pairs and Collocations PhD dissertation University of Stuttgart
Evert Stefan and Krenn Brigitte (2001) Methods for qualitative evaluation of lexical association measures In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics pages 188-915 Toulouse France
Katz Graham and Giesbrecht Eugenie (2006) Automatic identification of non-compositional multi-word expressions using latent semantic analysis In Proceedings of the ACL Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties pages 12-19 Sydney Australia
Krenn Brigitte and Evert Stefan (2001) Can we do better than frequency A case study on extracting PP-verb collocations In Proceedings of the ACLEACL 2001 Workshop on the Computational Extraction Analysis and Exploitation of Collocations pages 39ndash46 Toulouse France
Manning D Christopher and Schutze ampamp Hinrich (1999) Foundations of Statistical Natural Language Processing The MIT Press Cambridge Massachusetts
Pearce Darren (2002) A comparative evaluation of collocation extraction techniques In Proc of the
38
3rd International Conference on Language Resources and Evaluation (LREC 2002) Las Palmas pages 1530-1536 Canary Islands
Pecina Pavel and Schlesinger Pavel (2006) Combining association measures for collocation extraction In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLINGACL 2006) pages 651-658 Sydney Australia
Quirk Randolph Greenbaum Sidney Leech Geoffrey and Svartvik Jan (1985) A Comprehensive Grammar of the English Language Longman London UK
Ramisch Carlos Schreiner Paulo Idiart Marco and Villavicencio Aline (2008) An Evaluation of Methods for the extraction of Multiword Expressions In Proceedings of the LREC-2008 Workshop on Multiword Expressions Towards a Shared Task for Multiword Expressions pages 50-53 Marrakech Morocco
Smadja Frank (1993) Retrieving collocations from text Xtract Computational Linguistics 19(1) 143ndash77
Tan Y Fan Kan M Yen and Cui Hang (2006) Extending corpus-based identification of light verb constructions using a supervised learning framework In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context pages 49ndash56 Trento Italy
Zhai Chengxiang (1997) Exploiting context to identify lexical atoms ndash A statistical view of linguistic context In International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97) pages 119-129 Rio de Janeiro Brazil
Appendix A List of particles used in identifying verb particle constructions about aback aboard above abroad across adrift ahead along apart around aside astray away back backward backwards behind by down forth forward forwards in into off on out over past round through to together under up upon without
39
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 40ndash46Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus
R Mahesh K Sinha
Department of Computer Science amp Engineering Indian Institute of Technology Kanpur
Kanpur 208016 India rmkiitkacin
Abstract
Complex predicate is a noun a verb an ad-jective or an adverb followed by a light verb that behaves as a single unit of verb Com-plex predicates (CPs) are abundantly used in Hindi and other languages of Indo Aryan family Detecting and interpreting CPs con-stitute an important and somewhat a diffi-cult task The linguistic and statistical methods have yielded limited success in mining this data In this paper we present a simple method for detecting CPs of all kinds using a Hindi-English parallel corpus A CP is hypothesized by detecting absence of the conventional meaning of the light verb in the aligned English sentence This simple strategy exploits the fact that CP is a multi-word expression with a meaning that is dis-tinct from the meaning of the light verb Al-though there are several shortcomings in the methodology this empirical method surpri-singly yields mining of CPs with an average precision of 89 and a recall of 90
1 Introduction
Complex predicates (CPs) are abundantly used in Hindi and other languages of Indo-Aryan family and have been widely studied (Hook 1974 Ab-bi 1992 Verma 1993 Mohanan 1994 Singh 1994 Butt 1995 Butt and Geuder 2001 Butt and Ramchand 2001 Butt et al 2003) A com-plex predicate is a multi-word expression (MWE) where a noun a verb or an adjective is followed by a light verb (LV) and the MWE be-haves as a single unit of verb The general theory of complex predicate is discussed in Alsina (1996) These studies attempt to model the lin-guistic facts of complex predicate formation and the associated semantic roles
CPs empower the language in its expressive-ness but are hard to detect Detection and inter-
pretation of CPs are important for several tasks of natural language processing tasks such as ma-chine translation information retrieval summa-rization etc A mere listing of the CPs constitutes a valuable linguistic resource for lexicographers wordnet designers (Chakrabarti et al 2007) and other NLP system designers Computational me-thod using Hindi corpus has been used to mine CPs and categorize them based on statistical analysis (Sriram and Joshi 2005) with limited success Chakrabarti et al (2008) present a me-thod for automatic extraction of V+V CPs only from a corpus based on linguistic features They report an accuracy of about 98 An attempt has also been made to use a parallel corpus for de-tecting CPs using projection POS tags from Eng-lish to Hindi (Soni Mukerjee and Raina 2006) It uses Giza++ word alignment tool to align the projected POS information A success of 83 precision and 46 recall has been reported
In this paper we present a simple strategy for mining of CPs in Hindi using projection of meaning of light verb in a parallel corpus In the following section the nature of CP in Hindi is outlined and this is followed by system design experimentation and results
2 Complex Predicates in Hindi
A CP in Hindi is a syntactic construction consist-ing of either a verb a noun an adjective or an adverb as main predicator followed by a light verb (LV) Thus a CP can be a noun+LV an adjective+LV a verb+LV or an adverb+LV Fur-ther it is also possible that a CP is followed by a LV (CP+LV) The light verb carries the tense and agreement morphology In V+V CPs the contribution of the light verb denotes aspectual terms such as continuity perfectivity inception completion or denotes an expression of forceful-ness suddenness etc (Singh 1994 Butt 1995) The CP in a sentence syntactically acts as a sin-gle lexical unit of verb that has a meaning dis-
40
tinct from that of the LV CPs are also referred as the complex or compound verbs
Given below are some examples (1) CP=noun+LV noun = ashirwad blessings LV = denaa to give usane mujhe ashirwad diyaa उसन मझ आश वाद दया he me blessings gave he blessed me (2) No CP usane mujhe ek pustak dii उसन मझ एक प तक द he me one book gave he gave me a book In (1) the light verb diyaa (gave) in its past tense form with the noun ashirwad (blessings) makes a complex predicate verb form ashirwad diyaa (blessed) in the past tense form The CP here is ashirwad denaa and its corresponding English translation is lsquoto blessrsquo On the other hand in ex-ample (2) the verb dii (gave) is a simple verb in past tense form and is not a light verb Although same Hindi verb denaa (to give) is used in both the examples it is a light verb in (1) and a main verb in (2) Whether it acts as a light verb or not depends upon the semantics of the preceding noun However it is observed that the English meaning in case of the complex predicate is not derived from the individual meanings of the con-stituent words It is this observation that forms basis of our approach for mining of CPs (3) CP=adjective+LV adjective=khush happy LV=karanaa to do usane mujhe khush kiyaa उसन मझ खश कया he me happy did he pleased me Here the Hindi verb kiyaa (did) is the past tense form of a light verb karanaa (to do) and the pre-ceding word khush (happy) is an adjective The CP here is khush karanaa (to please) (4) CP=verb+LV verb = paRhnaa to read LV = lenaa to take usane pustak paRh liyaa
उसन प तक पढ़ लया he book read took he has read the book Here the Hindi verb liyaa (took) is the past tense form of the light verb lenaa (to take) and the pre-ceding word paRh (read) is the verb paRhnaa (to read) in its stem form The CP is paRh lenaa (to finish reading) In such cases the light verb acts as an aspectual modal or as an intensifier (5) CP=verb+LV verb = phaadanaa to tear LV = denaa to give usane pustak phaad diyaa उसन प तक फाड़ दया he book tear gave he has torn the book Here the Hindi verb diyaa (gave) is the past tense form of the light verb denaa (to give) and the preceding word phaad (tear) is the stem form of the verb phaadanaa (to tear) The CP is phaad denaa (to cause and complete act of tearing) (6) CP=verb+LV verb = denaa to give LV = maaranaato hit to kill usane pustak de maaraa उसन प तक द मारा he book give hit he threw the book Here the Hindi verb maaraa (hitkilled) is the past tense form of the light verb maaranaa (to hit to kill) and the preceding word de (give) is a verb denaa (to give) in its stem form The CP is de maranaa (to throw) The verb combination yields a new meaning This may also be consi-dered as a semi-idiomatic construct by some people (7) CP=adverb+LV1+LV2 adverb = vaapasback LV1 = karanaato do LV2 = denaato give or CP = CP+LV CP = vaapas karanaato return LV = denaato give usane pustak vaapas kar diyaa उसन प तक वापस कर दया he book back do gave
41
he returned the book Here there are two Hindi light verbs used The verb kar (do) is the stem form of the light verb karanaa (to do) and the verb diyaa (gave) is the past tense form of the light verb denaa (to give) The preceding word vaapas (back) is an adverb One way of interpretation is that the CP (a con-junct verb) vaapas karanaa (to return) is fol-lowed by another LV denaa (to give) signifying completion of the task Another way of looking at it is to consider these as a combination of two CPs vaapas karanaa (to return) and kar denaa (to complete the act) The semantic interpreta-tions in the two cases remain the same It may be noted that the word vaapas (return) is also a noun and in such a case the CP is a noun+LV
From all the above examples the complexity of the task of mining the CPs is evident Howev-er it is also observed that in the translated text the meaning of the light verb does not appear in case of CPs Our methodology for mining CPs is based on this observation and is outlined in the following section
3 System Design
As outlined earlier our method for detecting a CP is based on detecting a mismatch of the Hindi light verb meaning in the aligned English sen-tence The steps involved are as follows 1) Align the sentences of Hindi-English corpus 2) Create a list of Hindi light verbs and their
common English meanings as a simple verb (Table 1)
3) For each Hindi light verb generate all the morphological forms (Figure 1)
4) For each English meaning of the light verb as given in table 1 generate all the morphologi-cal forms (Figure 2)
5) For each Hindi-English aligned sentence execute the following steps a) For each light verb of Hindi (table 1)
execute the following steps i) Search for a Hindi light verb (LV)
and its morphological derivatives (figure 1) in the Hindi sentence and mark its position in the sentence (K)
ii) If the LV or its morphological deriv-ative is found then search for the equivalent English meanings for any of the morphological forms (figure 2) in the corresponding aligned Eng-lish sentence
iii) If no match is found then scan the words in the Hindi sentence to the left of the Kth position (as identified in step (i)) else if a match is found then exit ie go to step (a)
iv) If the scanned word is a lsquostop wordrsquo (figure 3) then ignore it and contin-ue scanning
v) Stop the scan when it is not a lsquostop wordrsquo and collect the Hindi word (W)
vi) If W is an lsquoexit wordrsquo then exit ie go to step (a) else the identified CP is W+LV
Hindi has a large number of light verbs A list of some of the commonly used light verbs along with their common English meanings as a simple verb is given in table 1 The light verb kar (do) is the most frequently used light verb Using its literal meaning as lsquodorsquo as a criterion for testing CP is quite misleading since lsquodorsquo in English is used in several other contexts Such meanings have been shown within parentheses and are not used for matching
light verb base form root verb meaning baithanaa बठना sit
bananaa बनना makebecomebuildconstruct manufactureprepare
banaanaa बनाना makebuildconstructmanufact-ure prepare
denaa दना give
lenaa लना take
paanaa पाना obtainget
uthanaa 3ठना rise arise get-up
uthaanaa 3ठाना raiselift wake-up
laganaa लगना feelappear look seem
lagaanaa लगाना fixinstall apply
cukanaa चकना (finish)
cukaanaa चकाना pay
karanaa करना (do)
honaa होना happenbecome be
aanaa आना come
jaanaa जाना go
khaanaa खाना eat
rakhanaa रखना keep put
maaranaa मारना killbeathit
daalanaa डालना put
haankanaa हाकना drive
Table 1 Some of the common light verbs in Hindi
42
For each of the Hindi light verb all morpho-logical forms are generated A few illustrations are given in figures 1(a) and 1(b) Similarly for each of the English meaning of the light verb all of its morphological derivatives are generated Figure 2 shows a few illustrations of the same
There are a number of words that can appear in between the nominal and the light verb in a CP These words are ignored in search for a CP and are treated as stop words These are words that denote negation or are emphasizers inten-sifiers interrogative pronoun or a particle A list of stop words used in the experimentation is giv-en in figure 3
Figure 1(a) Morphological derivatives of sample Hindi light verb rsquojaanaarsquo जाना to go
Figure 1(b) Morphological derivatives of sample Hindi light verb rsquolenaarsquo लना to take
Figure 2 Morphological derivatives of sample English meanings
We use a list of words of words that we have
named as lsquoexit wordsrsquo which cannot form part of a CP in Hindi We have used Hindi case (vibhak-ti) markers (also called parsarg) conjunctions and pronouns as the lsquoexit wordsrsquo in our imple-mentation Figure 4 shows a partial list used However this list can be augmented based on analysis of errors in LV identification It should be noted that we do not perform parts of speech (POS) tagging and so the nature of the word pre-ceding the LV is unknown to the system
English word sit Morphological derivations
sit sits sat sitting English word give Morphological derivations
give gives gave given giving degdegdegdegdeg
LV jaanaa जाना to go Morphological derivatives jaa jaae jaao jaaeM jaauuM jaane jaanaa jaanii jaataa jaatii jaate jaaniiM jaatiiM jaaoge jaaogii gaii jaauuMgA jaayegaa jaauuMgii jaayegii gaye gaiiM gayaa gayii jaayeMge jaayeMgI jaakara जा (go stem) जाए (go imperative)
जाओ (go imperative) जाए (go imperative)
जाऊ (go first-person) जान (go infinitive oblique)
जाना (go infinitive masculine singular)
जानी (go infinitive feminine singular)
जाता (go indefinite masculine singular)
जाती (go indefinite feminine singular)
जात (go indefinite masculine pluraloblique)
जानी (go infinitive feminine plural)
जाती (go indefinite feminine plural)
जाओग (go future masculine singular)
जाओगी (go future feminine singular)
गई (go past feminine singular)
जाऊगा (go future masculine first-person singular)
जायगा (go future masculine third-person singular)
जाऊगी (go future feminine first-person singular)
जायगी (go future feminine third-person singular)
गय (go past masculine pluraloblique)
गई (go past feminine plural)
गया (go past masculine singular)
गयी (go past feminine singular)
जायग (go future masculine plural)
जायगी (go future feminine plural)
जाकर (go completion) degdegdegdegdeg
LV lenaa लना to take Morphological derivatives le lii leM lo letaa letii lete liiM luuM legaa legii lene lenaa lenii liyaa leMge loge letiiM luuMgaa luuMgii lekara ल (take stem) ल (take past)
ल (take imperative) लो (take imperative)
लता (take indefinite masculine singular)
लती (take indefinite feminine singular)
लत (take indefinite masculine pluraloblique)
ल (takepastfeminineplural) ल(take first-person)
लगा(take future masculine third-personsingular)
लगी(take future feminine third-person singular)
लन (take infinitive oblique)
लना (take infinitive masculine singular)
लनी (take infinitive feminine singular)
लया (take past masculine singular)
लग (take future masculine plural)
लोग (take future masculine singular)
लती (take indefinite feminine plural)
लगा (take future masculinefirst-personsingular)
लगी (take future feminine first-person singular)
लकर (take completion) degdegdegdegdeg
43
Figure 3 Stop words in Hindi used by the system
Figure 4 A few exit words in Hindi used by the system
The inner loop of the procedure identifies mul-tiple CPs that may be present in a sentence The outer loop is for mining the CPs in the entire corpus The experimentation and results are dis-cussed in the following section
4 Experimentation and Results
The CP mining methodology outlined earlier has been implemented and tested over multiple files of EMILLE (McEnery Baker Gaizauskas and Cunningham 2000) English-Hindi parallel cor-pus A summary of the results obtained are given in table 2 As can be seen from this table the precision obtained is 80 to 92 and the recall is between 89 to 100 The F-measure is 88 to 97 This is a remarkable and somewhat sur-prising result from the simple methodology without much of linguistic or statistical analysis This is much higher than what has been reported on the same corpus by Mukerjee et al 2006 (83 precision and 46 recall) who use projec-tion of POS and word alignment for CP identifi-cation This is the only other work that uses a parallel corpus and covers all kinds of CPs The results as reported by Chakrabarti et al (2008) are only for V-V CPs Moreover they do not re-port the recall value File
1 File 2
File 3
File 4
File 5
File 6
No of Sentences
112 193 102 43 133 107
Total no of CP(N)
200 298 150 46 188 151
Correctly identified CP (TP)
195 296 149 46 175 135
V-V CP 56 63 9 6 15 20
Incorrectly identified CP (FP)
17 44 7 11 16 20
Unidentified CP(FN)
5 2 1 0 13 16
Accuracy
9750 9933 9933 1000 9308 8940
Precision (TP (TP+FP))
9198 8705 9551 8070 9162 8709
Recall ( TP (TP+FN))
9750 9833 9933 1000 9308 8940
F-measure ( 2PR ( P+R))
946 923 974 893 923 882
Table 2 Results of the experimentation
न (ergative case marker) को (accusative case marker) का (possessive case marker) क (possessive case marker) की (possessive case marker) स (frombywith) म (ininto) पर (onbut) और (and Hindi particle) तथा (and) या (or) ल कन (but) पर त (but) क (that Hindi particle) म (I) तम (you) आप (you) वह (heshe) मरा (my) मर (my) मर (my) त हारा (your) त हार (your) त हार (your) उसका (his) उसकी (her) उसक (hisher) अपना (own) अपनी (own) अपन (own) उनक (their) मन (I ergative) त ह (to you) आपको (to you) उसको (to himher) उनको (to them) उ ह (to them) मझको (to me) मझ (to me) िजसका (whose) िजसकी (whose) िजसक (whose) िजनको (to whom) िजनक (to whom)
नह (nonot) न (nonot Hindi particle) भी (also Hindi particle) ह (only Hindi particle) तो (then Hindi particle) य (why) या (what Hindi particle)
कहा (where Hindi particle) कब (when) यहा (here) वहा (there) जहा (where) पहल (before) बाद म (after) श म (beginning) आर भ म (beginning) अत म (in the end) आ खर म (in the end)
44
Given below are some sample outputs (1) English sentence I also enjoy working with the childrens parents who often come to me for advice - its good to know you can help Aligned Hindi sentence मझ ब च क माता - पताओ क साथ काम करना भी अ छा लगता ह जो क अ सर सलाह लन आत ह - यह जानकार खशी होती ह क आप कसी की मदद कर सकत ह | The CPs identified in the sentence i काम करना (to work) ii अ छा लगना (to feel good enjoy) iii सलाह लना (to seek advice) iv खशी होना (to feel happy good) v मदद करना (to help)
Here the system identified 5 different CPs all of which are correct and no CP in the sentence has gone undetected The POS projection and word alignment method (Mukerjee et al 2006) would fail to identify CPs सलाह लना (to seek advice) and खशी होना (to feel happy) (2) English sentence Thousands of children are already benefiting from the input of people like you - people who care about children and their future who have the commitment energy and enthusiasm to be positive role models and who value the opportu-nity for a worthwhile career Aligned Hindi sentence आप जस लोग जो क ब च और उनक भ व य क बार म सोचत ह - इस समय भी हज़ार ब च को लाभ पहचा रह ह | अ छ आदश बनन क लए ऐस लोग म तब ता उ सराह और लगन ह और व एक समथन - यो य यवसाय की क करत ह |
The CPs identified in the sentence i आदश बनना (to be role model) ii क करना (to respect)
Here also the two CPs identified are correct It is obvious that this empirical method of
mining CPs will fail whenever the Hindi light verb maps on to its core meaning in English It
may also produce garbage as POS of the preced-ing word is not being checked However the mining success rate obtained speaks of these be-ing in small numbers in practice Use of the lsquostop wordsrsquo in allowing the intervening words within the CPs helps a lot in improving the per-formance Similarly use of the lsquoexit wordsrsquo avoid a lot of incorrect identification
5 Conclusions
The simple empirical method for mining CPs outlined in this work yields an average 89 of precision and 90 recall which is better than the results reported so far in the literature The major drawback is that we have to generate a list of all possible light verbs This list appears to be very large for Hindi Since no POS tagging or statis-tical analysis is performed the identified CPs are merely a list of mined CPs in Hindi with no lin-guistic categorization or analysis However this list of mined CPs is valuable to the lexicograph-ers and other language technology developers This list can also be used for word alignment tools where the identified components of CPs are grouped together before the word alignment process This will increase both the alignment accuracy and the speed
The methodology presented in this work is equally applicable to all other languages within the Indo-Aryan family
References Anthony McEnery Paul Baker Rob Gaizauskas Ha-
mish Cunningham 2000 EMILLE Building a Corpus of South Asian Languages Vivek A Quar-terly in Artiificial Intelligence 13(3)23ndash32
Amitabh Mukerjee Ankit Soni and Achala M Raina 2006 Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora Proceed-ings of the Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties Sydney 11ndash18
Alex Alsina 1996 Complex PredicatesStructure and Theory CSLI PublicationsStanford CA
Anvita Abbi 1992 The explicator compound verbsome definitional issues and criteria for iden-tification Indian Linguistics 53 27-46
Debasri Chakrabarti Vaijayanthi Sarma and Pushpak Bhattacharyya 2007 Complex Predicates in In-dian Language Wordnets Lexical Resources and Evaluation Journal 40 (3-4)
Debasri Chakrabarti Hemang Mandalia Ritwik Priya Vaijayanthi Sarma and Pushpak Bhattacharyya 2008 Hindi Compound Verbs and their Automatic Extraction Computational Linguistics (COLING08) Manchester UK
45
Manindra K Verma (ed) 1993 Complex Predicates in South Asian Languages Manohar Publishers and Distributors New Delhi
Miriam Butt 1995 The Structure of Complex Predi-cates in Urdu CSLI Publications
Mirium Butt and Gillian Ramchand 2001 Complex Aspectual Structure in HindiUrdu In Maria Lia-kata Britta Jensen and Didier Maillat (Editors) Oxford University Working Papers in Linguistics Philology amp Phonetics Vol 6
Miriam Butt Tracy Holloway King and John T Maxwell III 2003 Complex Predicates via Re-striction Proceedings of the LFG03 Conference
Miriam Butt and Wilhelm Geuder 2001 On the (semi)lexical status of light verbs In Norbert Corv-er and Henk van Riemsdijk (Editors) Semi-lexical Categories On the content of function words and the function of content words Mouton de Gruyter Berlin 323ndash370
Mona Singh 1994 Perfectivity Definiteness and Specificity A Classification of Verbal Predicates Hindi Doctoral dissertation University of Texas Austin
Peter Edwin Hook 1974 The Compound Verb in Hindi Center for South and Southeast Asian Stu-dies The University of Michigan
Tara Mohanan 1994 Argument Structure in Hindi CSLI Publications Stanford California
Venkatapathy Sriram and Aravind K Joshi 2005 Relative compositionality of multi-word expres-sions a study of verb-noun (V-N) collocations In Proceedings of International Joint Conference on Natural Language Processing - 2005 Jeju Island Korea 553-564
46
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 47ndash54Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Improving Statistical Machine Translation UsingDomain Bilingual Multiword Expressions
Zhixiang Ren1 Yajuan L u1 Jie Cao1 Qun Liu 1 Yun Huang2
1Key Lab of Intelligent Info Processing 2Department of Computer ScienceInstitute of Computing Technology School of Computing
Chinese Academy of Sciences National University of SingaporePO Box 2704 Beijing 100190 China Computing 1 Law Link Singapore 117590
renzhixianglvyajuan huangyuncompnusedusgcaojieliuqunictaccn
Abstract
Multiword expressions (MWEs) havebeen proved useful for many natural lan-guage processing tasks However how touse them to improve performance of statis-tical machine translation (SMT) is not wellstudied This paper presents a simple yeteffective strategy to extract domain bilin-gual multiword expressions In additionwe implement three methods to integratebilingual MWEs to Moses the state-of-the-art phrase-based machine translationsystem Experiments show that bilingualMWEs could improve translation perfor-mance significantly
1 Introduction
Phrase-based machine translation model has beenproved a great improvement over the initial word-based approaches (Brown et al 1993) Recentsyntax-based models perform even better thanphrase-based models However when syntax-based models are applied to new domain with fewsyntax-annotated corpus the translation perfor-mance would decrease To utilize the robustnessof phrases and make up the lack of syntax or se-mantic information in phrase-based model for do-main translation we study domain bilingual mul-tiword expressions and integrate them to the exist-ing phrase-based model
A multiword expression (MWE)can be consid-ered as word sequence with relatively fixed struc-ture representing special meanings There is nouniform definition of MWE and many researchersgive different properties of MWE Sag et al (2002)roughly defined MWE as ldquoidiosyncratic interpre-tations that cross word boundaries (or spaces)rdquoCruys and Moiron (2007) focused on the non-compositional property of MWE ie the propertythat whole expression cannot be derived from their
component words Stanford university launcheda MWE project1 in which different qualities ofMWE were presented Forbilingual multiwordexpression (BiMWE) we define a bilingual phraseas a bilingual MWE if (1) the source phrase is aMWE in source language (2) the source phraseand the target phrase must be translated to eachother exactly ie there is no additional (boundary)word in target phrase which cannot find the corre-sponding word in source phrase and vice versaIn recent years many useful methods have beenproposed to extract MWEs or BiMWEs automati-cally (Piao et al 2005 Bannard 2007 Fazly andStevenson 2006) Since MWE usually constrainspossible senses of a polysemous word in contextthey can be used in many NLP applications suchas information retrieval question answering wordsense disambiguation and so on
For machine translation Piao et al (2005) havenoted that the issue of MWE identification andaccurate interpretation from source to target lan-guage remained an unsolved problem for existingMT systems This problem is more severe whenMT systems are used to translate domain-specifictexts since they may include technical terminol-ogy as well as more general fixed expressions andidioms Although some MT systems may employa machine-readable bilingual dictionary of MWEit is time-consuming and inefficient to obtain thisresource manually Therefore some researchershave tried to use automatically extracted bilingualMWEs in SMT Tanaka and Baldwin (2003) de-scribed an approach of noun-noun compound ma-chine translation but no significant comparisonwas presented Lambert and Banchs (2005) pre-sented a method in which bilingual MWEs wereused to modify the word alignment so as to im-prove the SMT quality In their work a bilin-gual MWE in training corpus was grouped as
1httpmwestanfordedu
47
one unique token before training alignment mod-els They reported that both alignment qualityand translation accuracy were improved on a smallcorpus However in their further study they re-ported even lower BLEU scores after groupingMWEs according to part-of-speech on a large cor-pus (Lambert and Banchs 2006) Nonethelesssince MWE represents liguistic knowledge therole and usefulness of MWE in full-scale SMTis intuitively positive The difficulty lies in howto integrate bilingual MWEs into existing SMTsystem to improve SMT performance especiallywhen translating domain texts
In this paper we implement three methods thatintegrate domain bilingual MWEs into a phrase-based SMT system and show that these ap-proaches improve translation quality significantlyThe main difference between our methods andLambert and Banchsrsquo work is that we directly aimat improving the SMT performance rather than im-proving the word alignment quality In detail dif-ferences are listed as follows
bull Instead of using the bilingual n-gram trans-lation model we choose the phrase-basedSMT system Moses2 which achieves sig-nificantly better translation performance thanmany other SMT systems and is a state-of-the-art SMT system
bull Instead of improving translation indirectlyby improving the word alignment qualitywe directly target at the quality of transla-tion Some researchers have argued that largegains of alignment performance under manymetrics only led to small gains in translationperformance (Ayan and Dorr 2006 Fraserand Marcu 2007)
Besides the above differences there are someadvantages of our approaches
bull In our method automatically extractedMWEs are used as additional resources ratherthan as phrase-table filter Since bilingualMWEs are extracted according to noisy au-tomatic word alignment errors in word align-ment would further propagate to the SMT andhurt SMT performance
bull We conduct experiments on domain-specificcorpus For one thing domain-specific
2httpwwwstatmtorgmoses
corpus potentially includes a large numberof technical terminologies as well as moregeneral fixed expressions and idioms iedomain-specific corpus has high MWE cov-erage For another after the investigationcurrent SMT system could not effectivelydeal with these domain-specific MWEs es-pecially for Chinese since these MWEs aremore flexible and concise Take the Chi-nese term ldquo j Ntilde (rdquo for example Themeaning of this term is ldquosoften hard massand dispel pathogenic accumulationrdquo Ev-ery word of this term represents a specialmeaning and cannot be understood literallyor without this context These terms are dif-ficult to be translated even for humans letalone machine translation So treating theseterms as MWEs and applying them in SMTsystem have practical significance
bull In our approach no additional corpus is intro-duced We attempt to extract useful MWEsfrom the training corpus and adopt suitablemethods to apply them Thus it benefitsfor the full exploitation of available resourceswithout increasing great time and space com-plexities of SMT system
The remainder of the paper is organized as fol-lows Section 2 describes the bilingual MWE ex-traction technique Section 3 proposes three meth-ods to apply bilingual MWEs in SMT systemSection 4 presents the experimental results Sec-tion 5 draws conclusions and describes the futurework Since this paper mainly focuses on the ap-plication of BiMWE in SMT we only give a briefintroduction on monolingual and bilingual MWEextraction
2 Bilingual Multiword ExpressionExtraction
In this section we describe our approach of bilin-gual MWE extraction In the first step we obtainmonolingual MWEs from the Chinese part of par-allel corpus After that we look for the translationof the extracted MWEs from parallel corpus
21 Automatic Extraction of MWEs
In the past two decades many different ap-proaches on automatic MWE identification werereported In general those approaches can beclassified into three main trends (1) statisti-cal approaches (Pantel and Lin 2001 Piao et
48
al 2005) (2) syntactic approaches (Fazly andStevenson 2006 Bannard 2007) and (3) seman-tic approaches (Baldwin et al 2003 Cruys andMoiron 2007) Syntax-based and semantic-basedmethods achieve high precision but syntax or se-mantic analysis has to be introduced as preparingstep so it is difficult to apply them to domains withfew syntactical or semantic annotation Statisticalapproaches only consider frequency informationso they can be used to obtain MWEs from bilin-gual corpora without deeper syntactic or semanticanalysis Most statistical measures only take twowords into account so it not easy to extract MWEscontaining three or more than three words
Log Likelihood Ratio (LLR)has been proved agood statistical measurement of the association oftwo random variables (Chang et al 2002) Weadopt the idea of statistical approaches and pro-pose a new algorithm named LLR-based Hierar-chical Reducing Algorithm (HRA for short) to ex-tract MWEs with arbitrary lengths To illustrateour algorithm firstly we define some useful itemsIn the following definitions we assume the givensentence is ldquoA B C D Erdquo
Definition 1 Unit A unit is any sub-string of thegiven sentence For example ldquoA Brdquo ldquo Crdquo ldquo C D Erdquoare all units but ldquoA B Drdquo is not a unit
Definition 2 List A list is an ordered sequence ofunits which exactly cover the given sentence ForexampleldquoArdquoldquo B C Drdquoldquo Erdquo forms a list
Definition 3 Score The score function only de-fines on two adjacent units and return the LLRbetween the last word of first unit and the firstword of the second unit3 For example the scoreof adjacent unit ldquoB Crdquo and ldquoD Erdquo is defined asLLR(ldquo Crdquoldquo Drdquo)
Definition 4 Select The selecting operator is tofind the two adjacent units with maximum scorein a list
Definition 5 ReduceThe reducing operator is toremove two specific adjacent units concatenatethem and put back the result unit to the removedposition For example if we want to reduce unitldquoB Crdquo and unit ldquoDrdquo in list ldquoArdquoldquo B Crdquoldquo Drdquoldquo Erdquowe will get the listldquoArdquoldquo B C Drdquoldquo Erdquo
Initially every word in the sentence is consid-ered as one unit and all these units form a initiallist L If the sentence is of lengthN then the
3we use a stoplist to eliminate the units containing func-tion words by setting their score to 0
list containsN units of course The final set ofMWEsS is initialized to empty set After initial-ization the algorithm will enter an iterating loopwith two steps (1) select the two adjacent unitswith maximum score inL namingU1 andU2 and(2) reduceU1 andU2 in L and insert the reducingresult into the final setS Our algorithm termi-nates on two conditions (1) if the maximum scoreafter selection is less than a given threshold or (2)if L contains only one unit
c1(pound p Eacute w egrave )
c1(pound p Eacute w )
c1(pound p Eacute w)
c1(pound)
c2(p Eacute w)
c2(p Eacute)
c2(p) c3(Eacute) c4(w) c5()
c6(egrave )
c6(egrave) c7()
pound p Eacute w egrave 1471 67552 10596 0 0 8096
Figure 1 Example of Hierarchical Reducing Al-gorithm
Let us make the algorithm clearer with an ex-ample Assume the threshold of score is 20 thegiven sentence is ldquopound p Eacute w egrave rdquo4Figure 1 shows the hierarchical structure of givensentence (based on LLR of adjacent words) Inthis example four MWEs (ldquop Eacuterdquo ldquo p Eacute
wrdquo ldquo egrave rdquo ldquo pound p Eacute wrdquo) are extractedin the order and sub-strings over dotted line in fig-ure 1 are not extracted
From the above example we can see that theextracted MWEs correspond to human intuitionIn general the basic idea of HRA is to reflectthe hierarchical structure pattern of natural lan-guage Furthermore in the HRA MWEs are mea-sured with the minimum LLR of adjacent wordsin them which gives lexical confidence of ex-tracted MWEs Finally suppose given sentencehas lengthN HRA would definitely terminatewithin N minus 1 iterations which is very efficient
However HRA has a problem that it would ex-tract substrings before extracting the whole stringeven if the substrings only appear in the particu-lar whole string which we consider useless Tosolve this problem we use contextual features
4The whole sentence means ldquohealthy tea for preventinghyperlipidemiardquo and we give the meaning for each Chi-nese wordpound(preventing)p(hyper-)Eacute(-lipid-) w(-emia)(for) egrave(healthy)(tea)
49
contextual entropy (Luo and Sun 2003) and C-value (Frantzi and Ananiadou 1996) to filter outthose substrings which exist only in few MWEs
22 Automatic Extraction of MWErsquosTranslation
In subsection 21 we described the algorithm toobtain MWEs and we would like to introduce theprocedure to find their translations from parallelcorpus in this subsection
For mining the English translations of ChineseMWEs we first obtain the candidate translationsof a given MWE from the parallel corpus Stepsare listed as follows
1 Run GIZA++5 to align words in the trainingparallel corpus
2 For a given MWE find the bilingual sentencepairs where the source language sentences in-clude the MWE
3 Extract the candidate translations of theMWE from the above sentence pairs accord-ing to the algorithm described by Och (2002)
After the above procedure we have alreadyextracted all possible candidate translations of agiven MWE The next step is to distinguish rightcandidates from wrong candidates We constructperceptron-based classification model (Collins2002) to solve the problem We design twogroups of features translation features whichdescribe the mutual translating chance betweensource phrase and target phrase and the languagefeatures which refer to how well a candidateis a reasonable translation The translation fea-tures include (1) the logarithm of source-targettranslation probability (2) the logarithm of target-source translation probability (3) the logarithmof source-target lexical weighting (4) the loga-rithm of target-source lexical weighting and (5)the logarithm of the phrase pairrsquos LLR (Dunning1993) The first four features are exactly the sameas the four translation probabilities used in tradi-tional phrase-based system (Koehn et al 2003)The language features include (1) the left entropyof the target phrase (Luo and Sun 2003) (2) theright entropy of the target phrase (3) the first wordof the target phrase (4) the last word of the targetphrase and (5) all words in the target phrase
5httpwwwfjochcomGIZA++html
We select and annotate 33000 phrase pairs ran-domly of which 30000 pairs are used as trainingset and 3000 pairs are used as test set We use theperceptron training algorithm to train the modelAs the experiments reveal the classification preci-sion of this model is 9167
3 Application of Bilingual MWEs
Intuitively bilingual MWE is useful to improvethe performance of SMT However as we de-scribed in section 1 it still needs further researchon how to integrate bilingual MWEs into SMTsystem In this section we propose three methodsto utilize bilingual MWEs and we will comparetheir performance in section 4
31 Model Retraining with Bilingual MWEs
Bilingual phrase table is very important forphrase-based MT system However due to the er-rors in automatic word alignment and unalignedword extension in phrase extraction (Och 2002)many meaningless phrases would be extractedwhich results in inaccuracy of phrase probabilityestimation To alleviate this problem we take theautomatically extracted bilingual MWEs as paral-lel sentence pairs add them into the training cor-pus and retrain the model using GIZA++ Byincreasing the occurrences of bilingual MWEswhich are good phrases we expect that the align-ment would be modified and the probability es-timation would be more reasonable Wu et al(2008) also used this method to perform domainadaption for SMT Different from their approachin which bilingual MWEs are extracted from ad-ditional corpus we extract bilingual MWEs fromthe original training set The fact that additionalresources can improve the domain-specific SMTperformance was proved by many researchers(Wu et al 2008 Eck et al 2004) Howeverour method shows that making better use of theresources in hand could also enhance the qualityof SMT system We use ldquoBaseline+BiMWErdquo torepresent this method
32 New Feature for Bilingual MWEs
Lopez and Resnik (2006) once pointed out thatbetter feature mining can lead to substantial gainin translation quality Inspired by this idea weappend one feature into bilingual phrase table toindicate that whether a bilingual phrase containsbilingual MWEs In other words if the source lan-guage phrase contains a MWE (as substring) and
50
the target language phrase contains the translationof the MWE (as substring) the feature value is 1otherwise the feature value is set to 0 Due to thehigh reliability of bilingual MWEs we expect thatthis feature could help SMT system to select bet-ter and reasonable phrase pairs during translationWe use ldquoBaseline+Featrdquo to represent this method
33 Additional Phrase Table of bilingualMWEs
Wu et al (2008) proposed a method to construct aphrase table by a manually-made translation dic-tionary Instead of manually constructing transla-tion dictionary we construct an additional phrasetable containing automatically extracted bilingualMWEs As to probability assignment we just as-sign 1 to the four translation probabilities for sim-plicity Since Moses supports multiple bilingualphrase tables we combine the original phrase ta-ble and new constructed bilingual MWE table Foreach phrase in input sentence during translationthe decoder would search all candidate transla-tion phrases in both phrase tables We use ldquoBase-line+NewBPrdquo to represent this method
4 Experiments
41 Data
We run experiments on two domain-specific patentcorpora one is for traditional medicine domainand the other is for chemical industry domain Ourtranslation tasks are Chinese-to-English
In the traditional medicine domain table 1shows the data statistics For language model weuse SRI Language Modeling Toolkit6 to train a tri-gram model with modified Kneser-Ney smoothing(Chen and Goodman 1998) on the target side oftraining corpus Using our bilingual MWE ex-tracting algorithm 80287 bilingual MWEs are ex-tracted from the training set
Chinese EnglishTraining Sentences 120355
Words 4688873 4737843Dev Sentences 1000
Words 31722 32390Test Sentences 1000
Words 41643 40551
Table 1 Traditional medicine corpus
6httpwwwspeechsricomprojectssrilm
In the chemical industry domain table 2 givesthe detail information of the data In this experi-ment 59466 bilingual MWEs are extracted
Chinese EnglishTraining Sentences 120856
Words 4532503 4311682Dev Sentences 1099
Words 42122 40521Test Sentences 1099
Words 41069 39210
Table 2 Chemical industry corpus
We test translation quality on test set and use theopen source tool mteval-vllbpl7 to calculate case-sensitive BLEU 4 score (Papineni et al 2002) asour evaluation criteria For this evaluation there isonly one reference per test sentence We also per-form statistical significant test between two trans-lation results (Collins et al 2005) The mean ofall scores and relative standard deviation are calcu-lated with a 99 confidence interval of the mean
42 MT Systems
We use the state-of-the-art phrase-based SMT sys-tem Moses as our baseline system The featuresused in baseline system include (1) four transla-tion probability features (2) one language modelfeature (3) distance-based and lexicalized distor-tion model feature (4) word penalty (5) phrasepenalty For ldquoBaseline+BiMWErdquo method bilin-gual MWEs are added into training corpus as aresult new alignment and new phrase table areobtained For ldquoBaseline+Featrdquo method one ad-ditional 01 feature are introduced to each entry inphrase table For ldquoBaseline+NewBPrdquo additionalphrase table constructed by bilingual MWEs isused
Features are combined in the log-linear modelTo obtain the best translatione of the source sen-tencef log-linear model uses following equation
e = arg maxe
p(e|f)
= arg maxe
Msum
m=1
λmhm(e f) (1)
in which hm andλm denote themth feature andweight The weights are automatically turned byminimum error rate training (Och 2002) on devel-opment set
7httpwwwnistgovspeechtestsmtresourcesscoringhtm
51
43 Results
Methods BLEUBaseline 02658Baseline+BiMWE 02661Baseline+Feat 02675Baseline+NewBP 02719
Table 3 Translation results of using bilingualMWEs in traditional medicine domain
Table 3 gives our experiment results Fromthis table we can see that bilingual MWEsimprove translation quality in all cases TheBaseline+NewBP method achieves the most im-provement of 061 BLEU score comparedwith the baseline system The Baseline+Featmethod comes next with 017 BLEU score im-provement And the Baseline+BiMWE achievesslightly higher translation quality than the baselinesystem
To our disappointment however none of theseimprovements are statistical significant Wemanually examine the extracted bilingual MWEswhich are labeled positive by perceptron algorithmand find that although the classification precisionis high (9167) the proportion of positive exam-ple is relatively lower (7669) The low positiveproportion means that many negative instanceshave been wrongly classified to positive which in-troduce noises To remove noisy bilingual MWEswe use the length ratiox of the source phrase overthe target phrase to rank the bilingual MWEs la-beled positive Assumex follows Gaussian distri-butions then the ranking score of phrase pair(s t)is defined as the following formula
Score(s t) = log(LLR(s t))times 1radic2πσ
timeseminus
(xminusmicro)2
2σ2
(2)Here the meanmicro and varianceσ2 are estimatedfrom the training set After ranking by score weselect the top 50000 60000 and 70000 bilingualMWEs to perform the three methods mentioned insection 3 The results are showed in table 4
From this table we can conclude that (1) Allthe three methods on all settings improve BLEUscore (2) Except the Baseline+BiMWE methodthe other two methods obtain significant improve-ment of BLEU score (02728 02734 02724)over baseline system (02658) (3) When the scaleof bilingual MWEs is relatively small (5000060000) the Baseline+Feat method performs better
Methods 50000 60000 70000Baseline 02658Baseline+BiMWE 02671 02686 02715Baseline+Feat 02728 02734 02712Baseline+NewBP 02662 02706 02724
Table 4 Translation results of using bilingualMWEs in traditional medicine domain
than others (4) As the number of bilingual MWEsincreasing the Baseline+NewBP method outper-forms the Baseline+Feat method (5) Comparingtable 4 and 3 we can see it is not true that themore bilingual MWEs the better performance ofphrase-based SMT This conclusion is the same as(Lambert and Banchs 2005)
To verify the assumption that bilingual MWEsdo indeed improve the SMT performance not onlyon particular domain we also perform some ex-periments on chemical industry domain Table 5shows the results From this table we can see thatthese three methods can improve the translationperformance on chemical industry domain as wellas on the traditional medicine domain
Methods BLEUBaseline 01882Baseline+BiMWE 01928Baseline+Feat 01917Baseline+Newbp 01914
Table 5 Translation results of using bilingualMWEs in chemical industry domain
44 Discussion
In order to know in what respects our methods im-prove performance of translation we manually an-alyze some test sentences and gives some exam-ples in this subsection
(1) For the first example in table 6 ldquoIuml oacuterdquois aligned to other words and not correctly trans-lated in baseline system while it is aligned to cor-rect target phrase ldquodredging meridiansrdquo in Base-line+BiMWE since the bilingual MWE (ldquoIuml oacuterdquoldquodredging meridiansrdquo) has been added into train-ing corpus and then aligned by GIZA++
(2) For the second example in table 6 ldquo
rdquo has two candidate translation in phrase tableldquo teardquo and ldquomedicated teardquo The baseline systemchooses the ldquoteardquo as the translation of ldquo rdquowhile the Baseline+Feat system chooses the ldquomed-
52
Src T not auml k Ouml Eacute Aring raquo Iuml oacute ) 9 | Y S _ Ouml otilde
egraveE 8
Ref the obtained product is effective in tonifying blood expelling cold dredging meridians promoting production of body fluid promoting urination and tranquilizing mind and can be used for supplementing nutrition and protecting health
Baseline the food has effects in tonifying blood dispelling cold promoting salivation and water and tranquilizing and tonic effects and making nutritious health
+Bimwe the food has effects in tonifying blood dispelling cold dredging meridians promotingsalivation promoting urination and tranquilizing tonic nutritious pulverizing
Src curren iexclJ J NtildeJ 5J
Ref the product can also be made into tablet pill powder medicated tea or injection Baseline may also be made into tablet pill powder tea or injection +Feat may also be made into tablet pill powder medicated tea or injection
Table 6 Translation example
icated teardquo because the additional feature giveshigh probability of the correct translation ldquomedi-cated teardquo
5 Conclusion and Future Works
This paper presents the LLR-based hierarchicalreducing algorithm to automatically extract bilin-gual MWEs and investigates the performance ofthree different application strategies in applyingbilingual MWEs for SMT system The translationresults show that using an additional feature to rep-resent whether a bilingual phrase contains bilin-gual MWEs performs the best in most cases Theother two strategies can also improve the qualityof SMT system although not as much as the firstone These results are encouraging and motivatedto do further research in this area
The strategies of bilingual MWE application isroughly simply and coarse in this paper Com-plicated approaches should be taken into accountduring applying bilingual MWEs For examplewe may consider other features of the bilingualMWEs and examine their effect on the SMT per-formance Besides application in phrase-basedSMT system bilingual MWEs may also be inte-grated into other MT models such as hierarchicalphrase-based models or syntax-based translationmodels We will do further studies on improvingstatistical machine translation using domain bilin-gual MWEs
Acknowledgments
This work is supported by National Natural Sci-ence Foundation of China Contracts 60603095and 60873167 We would like to thank the anony-
mous reviewers for their insightful comments onan earlier draft of this paper
References
Necip Fazil Ayan and Bonnie J Dorr 2006 Goingbeyond aer an extensive analysis of word align-ments and their impact on mt InProceedings of the44th Annual Meeting of the Association for Compu-tational Linguistics pages 9ndash16
Timothy Baldwin Colin Bannard Takaaki Tanaka andDominic Widdows 2003 An empirical modelof multiword expression decomposability InPro-ceedings of the ACL-2003 Workshop on MultiwordExpressions Analysis Acquisiton and Treatmentpages 89ndash96
Colin Bannard 2007 A measure of syntactic flex-ibility for automatically identifying multiword ex-pressions in corpora InProceedings of the ACLWorkshop on A Broader Perspective on MultiwordExpressions pages 1ndash8
Peter F Brown Stephen Della Pietra Vincent J DellaPietra and Robert L Mercer 1993 The mathe-matics of statistical machine translation Parameterestimation Computational Linguistics 19(2)263ndash311
Baobao Chang Pernilla Danielsson and WolfgangTeubert 2002 Extraction of translation unit fromchinese-english parallel corpora InProceedings ofthe first SIGHAN workshop on Chinese languageprocessing pages 1ndash5
Stanley F Chen and Joshua Goodman 1998 Am em-pirical study of smoothing techniques for languagemodeling Technical report
Michael Collins Philipp Koehn and Ivona Kucerova2005 Clause restructuring for statistical machinetranslation InProceedings of the 43rd Annual
53
Meeting on Association for Computational Linguis-tics pages 531ndash540
Michael Collins 2002 Discriminative training meth-ods for hidden markov models Theory and exper-iments with perceptron algorithms InProceedingsof the Empirical Methods in Natural Language Pro-cessing Conference pages 1ndash8
Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32
Ted Dunning 1993 Accurate methods for the statis-tics of surprise and coincidenceComputationalLinguistics 19(1)61ndash74
Matthias Eck Stephan Vogel and Alex Waibel 2004Improving statistical machine translation in the med-ical domain using the unified medical language sys-tem InProceedings of the 20th international con-ference on Computational Linguistics table of con-tents pages 792ndash798
Afsaneh Fazly and Suzanne Stevenson 2006 Auto-matically constructing a lexicon of verb phrase id-iomatic combinations InProceedings of the EACLpages 337ndash344
Katerina T Frantzi and Sophia Ananiadou 1996 Ex-tracting nested collocations InProceedings of the16th conference on Computational linguistics pages41ndash46
Alexander Fraser and Daniel Marcu 2007 Measuringword alignment quality for statistical machine trans-lation Computational Linguistics 33(3)293ndash303
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InPro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology pages48ndash54
Patrik Lambert and Rafael Banchs 2005 Data in-ferred multi-word expressions for statistical machinetranslation InProceedings of Machine TranslationSummit X pages 396ndash403
Patrik Lambert and Rafael Banchs 2006 Groupingmulti-word expressions according to part-of-speechin statistical machine translation InProceedings ofthe Workshop on Multi-word-expressions in a multi-lingual context pages 9ndash16
Adam Lopez and Philip Resnik 2006 Word-basedalignment phrase-based translation Whatrsquos thelink In proceedings of the 7th conference of the as-sociation for machine translation in the Americasvisions for the future of machine translation pages90ndash99
Shengfen Luo and Maosong Sun 2003 Two-characterchinese word extraction based on hybrid of internaland contextual measures InProceedings of the sec-ond SIGHAN workshop on Chinese language pro-cessing pages 24ndash30
Franz Josef Och 2002Statistical Machine Transla-tion From Single-Word Models to Alignment Tem-plates Phd thesis Computer Science DepartmentRWTH Aachen Germany
Patrick Pantel and Dekang Lin 2001 A statistical cor-pus based term extractor InAI rsquo01 Proceedings ofthe 14th Biennial Conference of the Canadian Soci-ety on Computational Studies of Intelligence pages36ndash46
Kishore Papineni Salim Roukos Todd Ward and Wei-jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation InProceedings of the40th Annual Conference of the Association for Com-putational Linguistics pages 311ndash318
Scott Songlin Piao Paul Rayson Dawn Archer andTony McEnery 2005 Comparing and combining asemantic tagger and a statistical tool for mwe extrac-tion Computer Speech and Language 19(4)378ndash397
Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 Multi-word expressions A pain in the neck for nlp InProceedings of the 3th International Conferenceon Intelligent Text Processing and ComputationalLinguistics(CICLing-2002) pages 1ndash15
Takaaki Tanaka and Timothy Baldwin 2003 Noun-noun compound machine translation A feasibilitystudy on shallow processing InProceedings ofthe ACL-2003 Workshop on Multiword ExpressionsAnalysis Acquisition and Treatment pages 17ndash24
Hua Wu Haifeng Wang and Chengqing Zong 2008Domain adaptation for statistical machine transla-tion with domain dictionary and monolingual cor-pora InProceedings of Conference on Computa-tional Linguistics (COLING) pages 993ndash1000
54
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 55ndash62Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Bottom-up Named Entity Recognitionusing a Two-stage Machine Learning Method
Hirotaka Funayama Tomohide Shibata Sadao KurohashiKyoto University Yoshida-honmachi
Sakyo-ku Kyoto 606-8501 Japanfunayama shibata kuronlpkueekyoto-uacjp
Abstract
This paper proposes Japanese bottom-upnamed entity recognition using a two-stage machine learning method Mostwork has formalized Named Entity Recog-nition as a sequential labeling problem inwhich only local information is utilizedfor the label estimation and thus a longnamed entity consisting of several mor-phemes tends to be wrongly recognizedOur proposed method regards a compoundnoun (chunk) as a labeling unit and firstestimates the labels of all the chunks ina phrasal unit (bunsetsu) using a machinelearning method Then the best label as-signment in the bunsetsu is determinedfrom bottom up as the CKY parsing al-gorithm using a machine learning methodWe conducted an experimental on CRLNE data and achieved an F measure of8979 which is higher than previous work
1 Introduction
Named Entity Recognition (NER) is a task of rec-ognizing named entities such as person namesorganization names and location It is used forseveral NLP applications such as Information Ex-traction (IE) and Question Answering (QA) Mostwork uses machine learning methods such as Sup-port Vector Machines (SVMs) (Vapnik 1995) andConditional Random Field (CRF) (Lafferty et al2001) using a hand-annotated corpus (Krishnanand DManning 2006 Kazama and Torisawa2008 Sasano and Kurohashi 2008 Fukushima etal 2008 Nakano and Hirai 2004 Masayuki andMatsumoto 2003)
In general NER is formalized as a sequentiallabeling problem For example regarding a mor-pheme as a basic unit it is first labeled as S-PERSON B-PERSON I-PERSON E-PERSON
S-ORGANIZATION etc Then considering thelabeling results of morphemes the best NE labelsequence is recognized
When the label of each morpheme is estimatedonly local information around the morpheme (egthe morpheme the two preceding morphemes andthe two following morphemes) is utilized There-fore a long named entity consisting of severalmorphemes tends to be wrongly recognized Letus consider the example sentences shown in Fig-ure 1
In sentence (1) the label of ldquoKazamardquo can berecognized to be S-PERSON (PERSON consist-ing of one morpheme) by utilizing the surroundinginformation such as the suffix ldquosanrdquo (Mr) and theverb ldquokikoku shitardquo (return home)
On the other hand in sentence (2) when thelabel of ldquoshinyourdquo (credit) is recognized to beB-ORGANIZATION (the beginning of ORGA-NIZATION) only information from ldquohatsudourdquo(invoke) to ldquokyusairdquo (relief) can be utilized andthus the information of the morpheme ldquoginkourdquo(bank) that is apart from ldquoshinyourdquo by three mor-phemes cannot be utilized To cope with this prob-lem Nakano et al (Nakano and Hirai 2004) andSasano et al (Sasano and Kurohashi 2008) uti-lized information of the head of bunsetsu1 In theirmethods when the label of ldquoshinyourdquo is recog-nized the information of the morpheme ldquoginkourdquocan be utilized
However these methods do not work when themorpheme that we want to refer to is not a headof bunsetsu as in sentence (3) In this examplewhen ldquogaikokurdquo (foreign) is recognized to be B-ARTIFACT (the beginning of ARTIFACT) wewant to refer to ldquohourdquo (law) not ldquoihanrdquo (viola-tion) which is the head of the bunsetsu
This paper proposes Japanese bottom-up named1Bunsetsu is the smallest coherent phrasal unit in
Japanese It consists of one or more content words followedby zero or more function words
55
(1) kikoku-shitareturn home
Kazama-san-waMrKazama TOP
rsquoMr Kazama who returned homersquo
(2) hatsudou-shitainvoke
shinyou-kumiai-kyusai-ginkou-nocredit union relief bank GEN
setsuritsu-mo establishment
rsquothe establishment of the invoking credit union relief bankrsquo
(3) shibunsyo-gizou-toprivate document falsification and
gaikoku-jin-touroku-hou-ihan-noforeigner registration law violation GEN
utagai-desuspicion INS
rsquoon suspicion of the private document falsification and the violation of the foreigner registra-tion lawrsquo
Figure 1 Example sentences
entity recognition using a two-stage machinelearning method Different from previous workthis method regards a compound noun as a la-beling unit (we call it chunk hereafter) and es-timates the labels of all the chunks in the bun-setsu using a machine learning method In sen-tence (3) all the chunks in the second bunsetsu(ie ldquogaikokurdquo ldquogaikoku-jinrdquo middot middot middot ldquogaikoku-jin-touroku-hou-ihan rdquo middot middot middot ldquoihanrdquo) are labeled andin the case that the chunk ldquogaikoku-jin-touroku-hourdquo is labeled the information about ldquohourdquo (law)is utilized in a natural manner Then in the bun-setsu the best label assignment is determined Forexample among the combination of ldquogaikoku-jin-touroku-hourdquo (ARTIFACT) and ldquoihanrdquo (OTHER)the combination of ldquogaikoku-jinrdquo (PERSON) andldquotouroku-hou-ihanrdquo (OTHER) etc the best la-bel assignment ldquogaikoku-jin-touroku-hourdquo (AR-TIFACT) and ldquoihanrdquo (OTHER) is chosen basedon a machine learning method In this determi-nation of the best label assignment as the CKYparsing algorithm the label assignment is deter-mined by bottom-up dynamic programming Weconducted an experimental on CRL NE data andachieved an F measure of 8979 which is higherthan previous work
This paper is organized as follows Section 2 re-views related work of NER especially focusing onsequential labeling based method Section 3 de-scribes an overview of our proposed method Sec-tion 4 presents two machine learning models andSection 5 describes an analysis algorithm Section6 gives an experimental result
2 Related Work
In Japanese Named Entity Recognition the defi-nition of Named Entity in IREX Workshop (IREX
class examplePERSON Kimura SyonosukeLOCATION Taiheiyou (Pacific Ocean)ORGANIZATION Jimin-tou (Liberal Democratic
Party)ARTIFACT PL-houan (PL bill)DATE 21-seiki (21 century)TIME gozen-7-ji (7 am)MONEY 500-oku-en (50 billions yen)PERCENT 20 percent
Table 1 NE classes and their examples
Committee 1999) is usually used In this def-inition NEs are classified into eight classesPERSON LOCATION ORGANIZATION AR-TIFACT DATE TIME MONEY and PERCENTTable 1 shows example instances of each class
NER methods are divided into two approachesrule-based approach and machine learning ap-proach According to previous work machinelearning approach achieved better performancethan rule-based approach
In general a machine learning method is for-malized as a sequential labeling problem Thisproblem is first assigning each token (character ormorpheme) to several labels In an SE-algorithm(Sekine et al 1998) S is assigned to NE com-posed of one morpheme B I E is assigned to thebeginning middle end of NE respectively and Ois assigned to the morpheme that is not an NE2The labels S B I and E are prepared for each NEclasses and thus the total number of labels is 33(= 8 4 + 1)
The model for the label estimation is learnedbased on machine learning The following fea-tures are generally utilized characters type of
2Besides there are IOB1 IOB2 algorithm using onlyIOB and IOE1 IOE2 algorithm using only IOE (Kim andVeenstra 1999)
56
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu-MeijinMONEY0075
OTHERe0092
MeijinOTHERe0245
(a)initial state
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu + MeijinPSN+OTHERe0438+0245
Yoshiharu Yoshiharu + Meijin
final outputanalysis direction
MONEY0075
MNY+OTHERe0075+0245
MeijinOTHERe0245
(b)final output
Figure 2 An overview of our proposed method (the bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo)
character POS etc about the morpheme and thesurrounding two morphemes The methods utiliz-ing SVM or CRF are proposed
Most of NER methods based on sequential la-beling use only local information Thereforemethods utilizing global information are pro-posed Nakano et al utilized as a feature the wordsub class of NE on the analyzing direction in thebunsetsu the noun in the end of the bunsetsu ad-jacent to the analyzing direction and the head ofeach bunsetsu (Nakano and Hirai 2004) Sasanoet al utilized cache feature coreference resultsyntactic feature and caseframe feature as struc-tural features (Sasano and Kurohashi 2008)
Some work acquired knowledge from unan-notated large corpus and applied it to NERKazama et al utilized a Named Entity dic-tionary constructed from Wikipedia and a nounclustering result obtained using huge amount ofpairs of dependency relations (Kazama and Tori-sawa 2008) Fukushima et al acquired hugeamount of category-instance pairs (eg ldquopo-litical party - New party DAICHIrdquoldquocompany-TOYOTArdquo) by some patterns from a large Webcorpus (Fukushima et al 2008)
In Japanese NER researches CRL NE data areusually utilized for the evaluation This data in-cludes approximately 10 thousands sentences innews paper articles in which approximately 20thousands NEs are annotated Previous workachieved an F measure of about 089 using thisdata
3 Overview of Proposed Method
Our proposed method first estimates the label ofall the compound nouns (chunk) in a bunsetsu
Then the best label assignment is determinedby bottom-up dynamic programming as the CKYparsing algorithm Figure 2 illustrates an overviewof our proposed method In this example thebunsetsu ldquoHabu-Yoshiharu-Meijinrdquo (Grand Mas-ter Yoshiharu Habu) is analyzed First the labelsof all the chunks (ldquoHaburdquo ldquoHabu-YoshiharurdquoldquoHabu-Yoshiharu-Meijinrdquo middot middot middot ldquoMeijinrdquo etc) inthe bunsetsu are analyzed using a machine learn-ing method as shown in Figure 2 (a)
We call the state in Figure 2 (a) initial statewhere the labels of all the chunks have been es-timated From this state the best label assign-ment in the bunsetsu is determined This pro-cedure is performed from the lower left (corre-sponds to each morpheme) to the upper right likethe CKY parsing algorithm as shown in Figure 2(b) For example when the label assignment forldquoHabu-Yoshiharurdquo is determined the label assign-ment ldquoHabu-Yoshiharurdquo (PERSON) and the labelassignment ldquoHaburdquo (PERSON) and ldquoYoshiharurdquo(OTHER) are compared and the better one is cho-sen While grammatical rules are utilized in ageneral CKY algorithm this method chooses bet-ter label assignment for each cell using a machinelearning method
The learned models are the followings
bull the model that estimates the label of a chunk(label estimation model)
bull the model that compares two label assign-ments (label comparison model)
The two models are described in detail in thenext section
57
Habu Yoshiharu Meijin gaPERSON OTHERe
invalid invalid
invalid
invalidinvalid
Figure 3 Label assignment for all the chunks inthe bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo
4 Model Learning
41 Label Estimation ModelThis model estimates the label for each chunk Ananalysis unit is basically bunsetsu This is because935 of named entities is located in a bunsetsuin CRL NE data Exceptionally the following ex-pressions located in multiple bunsetsus tend to bean NE
bull expressions enclosed in parentheses (eg ldquolsquoHimeyuri-no toursquo rdquo (The tower of Himeyuri)(ARTIFACT))
bull expressions that have an entry in Wikipedia(eg ldquoNihon-yatyou-no kairdquo (Wild Bird So-ciety of Japan) (ORGANIZATION))
Hereafter bunsetsu is expanded when one of theabove conditions meet By this expansion 986of named entities is located in a bunsetsu3
For each bunsetsu the head or tail functionwords are deleted For example in the bun-setsu ldquoHabu-Yoshiharu-Meijin-wardquo the tail func-tion word ldquowardquo (TOP) is deleted In the bunsetsuldquoyaku-san-bairdquo (about three times) the head func-tion word ldquoyakurdquo (about) is deleted
Next for learning the label estimation modelall the chunks in a bunsetsu are attached to the cor-rect label from a hand-annotated corpus The la-bel set is 13 classes which includes eight NE class(as shown in Table 1) and five classes OTHERsOTHERb OTHERi OTHERe and invalid
The chunk that corresponds to a whole bun-setsu and does not contain any NEs is labeledas OTHERs and the head middle tail chunkthat does not correspond to an NE is labeled asOTHERb OTHERi OTHERe respectively4
3As an example in which an NE is not included by anexpanded bunsetsu there are ldquoToru-no Kimirdquo (PERSON)and ldquoOsaka-fu midori-no kankyo-seibi-shitsurdquo (ORGANI-ZATION)
4Each OTHER is assigned to the longest chunk that satis-fies its condition in a chunk
1 of morphemes in the chunk
2 the position of the chunk in its bunsetsu
3 character type5
4 the combination of the character type of adjoiningmorphemes
- For the chunk ldquoRussian Armyrdquo this feature isldquoKatakanaKanjirdquo
5 word class word sub class and several features pro-vided by a morphological analyzer JUMAN
6 several features6 provided by a parser KNP
7 string of the morpheme in the chunk
8 IPADIC7 feature- If the string of the chunk are registered in the fol-
lowing categories of IPADIC ldquopersonrdquo ldquolo-cationrdquo ldquoorganizationrdquo and ldquogeneralrdquo thisfeature fires
9 Wikipedia feature
- If the string of the chunk has an entry inWikipedia this feature fires
- the hypernym extracted from its definition sen-tence using some patterns (eg The hyper-nym of ldquothe Liberal Democratic Partyrdquo is apolitical party)
10 cache feature- When the same string of the chunk appears in the
preceding context the label of the precedingchunk is used for the feature
11 particles that the bunsetsu includes
12 the morphemes particles and head morpheme in theparent bunsetsu
13 the NEcategory ratio in a case slot of predicatenouncase frame(Sasano and Kurohashi 2008)
- For example in the case ga (NOM) of the pred-icate case frame ldquokaikenrdquo (interview) the NEratio ldquoPERSON0245rdquo is assigned to the caseslot Hence in the sentence ldquoHabu-ga kaiken-shitardquo (Mr Habu interviewed) the featureldquoPERSON0245rdquo is utilized for the chunkldquoHaburdquo
14 parenthesis feature
- When the chunk in a parenthesis this featurefires
Table 2 Features for the label estimation model
The chunk that is neither any eight NE class northe above four OTHER is labeled as invalid
In an example as shown in Figure 3 ldquoHabu-Yoshiharurdquo is labeled as PERSON ldquoMeijinrdquo is la-beled as OTHERe and the other chunks are la-beled as invalid
Next the label estimation model is learned fromthe data in which the above label set is assigned
5The following five character types are considered KanjiHiragana Katakana Number and Alphabet
6When a morpheme has an ambiguity all the correspond-ing features fire
7httpchasenaist-naraacjpchasendistributionhtmlja
58
to all the chunks The features for the label esti-mation model are shown in Table 2 Among thefeatures as for feature (3) (5)minus(8) three cate-gories according to the position of a morphemein the chunk are prepared ldquoheadrdquo ldquotailrdquo andldquoanywhererdquo For example in the chunk ldquoHabu-Yoshiharu-Meijinrdquo as for the morpheme ldquoHaburdquofeature (7) is set to be ldquoHaburdquo in ldquoheadrdquo and as forthe morpheme ldquoYoshiharurdquo feature (7) is set to beldquoYoshiharurdquo in ldquoanywhererdquo
The label estimation model is learned from pairsof label and feature in each chunk To classify themulti classes the one-vs-rest method is adopted(consequently 13 models are learned) The SVMoutput is transformed by using the sigmoid func-tion 1
1+exp(minusβx) and the transformed value is nor-malized so that the sum of the value of 13 labelsin a chunk is one
The purpose for setting up the label ldquoinvalidrdquo isas follows In the chunk ldquoHaburdquo and ldquoYoshiharurdquoin Figure 3 since the label ldquoinvalidrdquo has a rela-tively higher score the score of the label PERSONis relatively low Therefore when the label com-parison described in Section 42 is performed thelabel assignment ldquoHabu-Yoshiharurdquo (PERSON) islikely to be chosen In the chunk where the scoreof the label invalid has the highest score the labelthat has the second highest score is adopted
42 Label Comparison Model
This model compares the two label assignmentsfor a certain string For example in the stringldquoHabu-Yoshiharurdquo the model compares the fol-lowing two label assignments
bull ldquoHabu-Yoshiharurdquo is labeled as PERSON
bull ldquoHaburdquo is labeled as PERSON and ldquoYoshi-harurdquo is labeled as MONEY
First as shown in Figure 4 the two comparedsets of chunks are lined up by sandwiching ldquovsrdquo(The left one right one is called the first set thesecond set respectively) When the first set is cor-rect this example is positive otherwise this ex-ample is negative The max number of chunks foreach set is five and thus examples in which thefirst or second set has more than five chunks arenot utilized for the model learning
Then the feature is assigned to each exampleThe feature (13 dimensions) for each chunk is de-fined as follows the first 12 dimensions are used
positive+1 Habu-Yoshiharu vs Habu + Yoshiharu
PSN PSN + MNY+1 Habu-Yoshiharu + Meijin vs Habu + Yoshiharu + Meijin
PSN + OTHERe PSN + MONEY + OTHERe
negative- 1 Habu-Yoshiharu-Meijin vs Habu-Yoshiharu + Meijin
ORG PSN + OTHERe
Figure 4 Assignment of positivenegative exam-ples
for each label which is estimated by the label esti-mation model and the last 13th dimension is usedfor the score of an SVM output Then for the firstand second set the features for each chunk are ar-ranged from the left and zero vectors are placedin the remainder part
Figure 5 illustrates the feature for ldquoHabu-Yoshiharurdquo vs ldquoHabu + Yoshiharurdquo The labelcomparison model is learned from such data us-ing SVM Note that only the fact that ldquoHabu-Yoshiharurdquo is PERSON can be found from thehand-annotated corpus and thus in the exampleldquoHabu-Yoshiharu-Meijinrdquo vs ldquoHabu + Yoshiharu-Meijinrdquo we cannot determine which one is cor-rect Therefore such example cannot be used forthe model learning
5 Analysis
First the label of all the chunks in a bunsetsu isestimated by using the label estimation model de-scribed in Section 41 Then the best label assign-ment in the bunsetsu is determined by applying thelabel comparison model described in Section 42iteratively as shown in Figure 2 (b) In this stepthe better label assignment is determined from bot-tom up as the CKY parsing algorithm
For example the initial state shown in Figure2(a) is obtained using the label estimation modelThen the label assignment is determined using thelabel comparison model from the lower left (cor-responds to each morpheme) to the upper rightIn determining the label assignment for the cellof ldquoHabu-Yoshiharurdquo as shown in 6(a) the modelcompares the label assignment ldquoBrdquo with the la-bel assignment ldquoA+Drdquo In this case the modelchooses the label assignment ldquoBrdquo that is ldquoHabu- Yoshiharurdquo is labeled as PERSON Similarlyin determining the label assignment for the cellof ldquoYoshiharu-Meijinrdquo the model compares the
59
chunk Habu-Yoshiharu Habu Yoshiharulabel PERSON PERSON MONEYvector V11 0 0 0 0 V21 V22 0 0 0
Figure 5 An example of the feature for the label comparison model (The example is ldquoHabu-Yoshiharuvs Habu + Yoshiharurdquo and V11 V21 V22 and 0 is a vector whose dimension is 13)
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu-Meijin
label assighment ldquoHabu-Yoshiharurdquo
label assignmentldquoHaburdquo + ldquoYoshiharurdquo
A B C
EDMONEY0075
OTHERe0092
MeijinOTHERe0245
ldquoHaburdquo + ldquoYoshiharurdquo
F
(a) label assignment for the cell ldquoHabu-Yoshiharurdquo
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu + Meijin
label assignmentldquoHabu-Yoshiharu-Meijinrdquo
A B C
EDMONEY0075
MNY+OTHERe0075+0245
MeijinOTHERe0245
label assignmentldquoHaburdquo + ldquoYoshiharurdquo + ldquoMeijinrdquo
label assignmentldquoHabu-Yoshiharurdquo + ldquoMeijinrdquo
F
(b) label assignment for the cell ldquoHabu-Yoshiharu-Meijinrdquo
Figure 6 The label comparison model
label assignment ldquoErdquo with the label assignmentldquoD+Frdquo In this case the model chooses the labelassignment ldquoD+Frdquo that is ldquoYoshiharurdquo is labeledas MONEY and ldquoMeijinrdquo is labeled as OTHEReWhen the label assignment consists of multiplechunks the content of the cell is updated Inthis case the cell ldquoErdquo is changed from ldquoYoshi-haru-Meijinrdquo (OTHERe) to ldquoYoshiharu + Meijinrdquo(MONEY + OTHERe)
As shown in Figure 6(b) in determining the bestlabel assignment for the upper right cell that isthe final output is determined the model comparesthe label assignment ldquoA+D+Frdquo ldquoB+Frdquo and ldquoCrdquoWhen there are more than two candidates of labelassignments for a cell all the label assignments arecompared in a pairwise and the label assignmentthat obtains the highest score is adopted
In the label comparing step the label as-signment in which OTHERlowast follows OTHERlowast(OTHERlowast - OTHERlowast) is not allowed since eachOTHER is assigned to the longest chunk as de-scribed in Section 41 When the first combina-tion of chunks equals to the second combinationof chunks the comparison is not performed
6 Experiment
To demonstrate the effectiveness of our proposedmethod we conducted an experiment on CRL NEdata In this data 10718 sentences in 1174 newsarticles are annotated with eight NEs The expres-sion to which it is difficult to annotate manually islabeled as OPTIONAL and was not used for both
a b c d e
learn the label estimation model 1
Corpus CRL NE data
learn the label estimation model 2bobtain features for the label apply
b c d e
b c d e
b c d e
obtain features for the labelcomparison model
learn the label estimation model 2c
learn the label estimation model 2d
learn the label estimation model 2e
apply
learn the label comparison model
Figure 7 5-fold cross validation
the model learning8 and the evaluationWe performed 5-fold cross validation following
previous work Different from previous work ourwork has to learn the SVM models twice There-fore the corpus was divided as shown in Figure 7Let us consider the analysis in the part (a) Firstthe label estimation model 1 is learned from thepart (b)-(e) Then the label estimation model 2bis learned from the part (c)-(e) and applying thelearned model to the part (b) features for learningthe label comparison model are obtained Simi-larly the label estimation model 2c is learned fromthe part (b)(d)(e) and applying it to the part (c)features are obtained It is the same with the part
8Exceptionally ldquoOPTIONALrdquo is used when the label es-timation model for OTHERlowast and invalid is learned
60
Recall PrecisionORGANIZATION 8183 (30083676) 8837 (30083404)PERSON 9005 (34583840) 9387 (34583684)LOCATION 9138 (49925463) 9244 (49925400)ARTIFACT 4672 ( 349 747) 7489 ( 349 466)DATE 9327 (33273567) 9312 (33273573)TIME 8825 ( 443 502) 9059 ( 443 489)MONEY 9385 ( 366 390) 9760 ( 366 375)PERCENT 9533 ( 469 492) 9591 ( 469 489)ALL-SLOT 8787 9179F-measure 8979
Table 3 Experimental result
(d) and (e) Then the label comparison model islearned from the obtained features After that theanalysis in the part (a) is performed by using boththe label estimation model 1 and the label compar-ison model
In this experiment a Japanese morphologicalanalyzer JUMAN9 and a Japanese parser KNP10
were adopted The two SVM models were learnedwith polynomial kernel of degree 2 and β in thesigmoid function was set to be 1
Table 6 shows an experimental result An F-measure in all NE classes is 8979
7 Discussion
71 Comparison with Previous Work
Table 7 presents the comparison with previ-ous work and our method outperformed previ-ous work Among previous work Fukushimaet al acquired huge amount of category-instance pairs (eg ldquopolitical party - New partyDAICHIrdquoldquocompany-TOYOTArdquo) by some patternsfrom a large Web corpus and Sasano et al uti-lized the analysis result of corefer resolution as afeature for the model learning Therefore in ourmethod by incorporating these knowledge andorsuch analysis result the performance would be im-proved
Compared with Sasano et al our methodachieved the better performance in analyzinga long compound noun For example in thebunsetsu ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo (Treaty on Conventional ArmedForces in Europe) while Sasano et al labeledldquoOushurdquo (Europe) as LOCATION our methodcorrectly labeled ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo as ARTIFACT Sasano etal incorrectly labeled ldquoOushurdquo as LOCATIONalthough they utilized the information about
9httpnlpkueekyoto-uacjpnl-resourcejuman-ehtml10httpnlpkueekyoto-uacjpnl-resourceknp-ehtml
the head of bunsetsu ldquojyouyakurdquo (treaty) Inour method for the cell ldquoOushurdquo invalid hasthe highest score and thus the score of LOCA-TION relatively drops Similarly for the cellldquosenryoku-sakugen-jyouyakurdquo invalid has thehighest score Consequently ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo is correctly labeledas ARTIFACT
In the bunsetsu ldquogaikoku-jin-touroku-hou-ihanrdquo(the violation of the foreigner registration law)while Sasano et al labeled ldquotouroku-hourdquo as AR-TIFACT our method correctly labeled ldquogaikoku-jin-touroku-hourdquo as ARTIFACT Sasano et al can-not utilize the information about ldquohourdquo that is use-ful for the label estimation since the head of thisbunsetsu is ldquoihanrdquo In contrast in estimating thelabel of the chunk ldquogaikoku-jin-touroku-hourdquo theinformation of ldquohourdquo can be utilized
72 Error Analysis
There were some errors in analyzing a Katakanaalphabet word In the following example althoughthe correct is that ldquoBatistutardquo is labeled as PER-SON the system labeled it as OTHERs
(4) Italy-deItaly LOC
katsuyaku-suruactive
Batistuta-woBatistuta ACC
kuwaetacall
ArgentineArgentine
rsquoArgentine called Batistuta who was active inItalyrsquo
There is not an entry of ldquoBatistutardquo in the dictio-nary of JUMAN nor Wikipedia and thus only thesurrounding information is utilized However thecase analysis of ldquokatsuyakurdquo (active) is incorrectwhich leads to the error of ldquoBatistutardquo
There were some errors in applying the la-bel comparison model although the analysis ofeach chunk is correct For example in thebunsetsu ldquoHongKong-seityourdquo (Government ofHongKong) the correct is that ldquoHongKong-seityourdquo is labeled as ORGANIZATION Asshown in Figure 8 (b) the system incorrectlylabeled ldquoHongKongrdquo as LOCATION As shownin Figure 8(a) although in the initial stateldquoHongKong-seityourdquo was correctly labeled as OR-GANIZATION the label assignment ldquoHongKong+ seityourdquo was incorrectly chosen by the labelcomparison model To cope with this problemwe are planning to the adjustment of the value βin the sigmoid function and the refinement of the
61
F1 analysis unit distinctive features(Fukushima et al 2008) 8929 character Web(Kazama and Torisawa 2008) 8893 character WikipediaWeb(Sasano and Kurohashi 2008) 8940 morpheme structural information(Nakano and Hirai 2004) 8903 character bunsetsu feature(Masayuki and Matsumoto 2003) 8721 character(Isozaki and Kazawa 2003) 8677 morphemeproposed method 8979 compound noun Wikipediastructural information
Table 4 Comparison with previous work (All work was evaluated on CRL NE data using cross valida-tion)
HongKongLOCATION
HongKong-seityouORGANIZATIONLOCATION
0266 0205
seityouOTHERe0184
(a)initial state
HongKongLOCATION
HongKong + seityouLOC+OTHEReLOCATION
0266LOC+OTHERe0266+0184
seityouOTHERe0184
(b)the final output
Figure 8 An example of the error in the label com-parison model
features for the label comparison model
8 Conclusion
This paper proposed bottom-up Named EntityRecognition using a two-stage machine learningmethod This method first estimates the label ofall the chunks in a bunsetsu using a machine learn-ing and then the best label assignment is deter-mined by bottom-up dynamic programming Weconducted an experiment on CRL NE data andachieved an F-measure of 8979
We are planning to integrate this method withthe syntactic and case analysis method (Kawa-hara and Kurohashi 2007) and perform syntacticcase and Named Entity analysis simultaneously toimprove the overall accuracy
ReferencesKenrsquoichi Fukushima Nobuhiro Kaji and Masaru
Kitsuregawa 2008 Use of massive amountsof web text in Japanese named entity recogni-tion In Proceedings of Data Engineering Workshop(DEWS2008) A3-3 (in Japanese)
IREX Committee editor 1999 Proceedings of theIREX Workshop
Hideki Isozaki and Hideto Kazawa 2003 Speeding upsupport vector machines for named entity recogni-tion Transaction of Information Processing Societyof Japan 44(3)970ndash979 (in Japanese)
Daisuke Kawahara and Sadao Kurohashi 2007 Prob-abilistic coordination disambiguation in a fully-lexicalized Japanese parser In Proceedings of the
2007 Joint Conference on Empirical Methods inNatural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL2007)pages 304ndash311
Junrsquoichi Kazama and Kentaro Torisawa 2008 In-ducing gazetteers for named entity recognition bylarge-scale clustering of dependency relations InProceedings of ACL-08 HLT pages 407ndash415
Erik F Tjong Kim and Jorn Veenstra 1999 Repre-senting text chunks In Proceedings of EACL rsquo99pages 173ndash179
Vajay Krishnan and Christopher DManning 2006An effective two-stage model for exploiting non-local dependencies in named entity recognitionpages 1121ndash1128
John Lafferty Andrew McCallun and FernandoPereira 2001 Conditional random fields Prob-abilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth In-ternational Conference (ICMLrsquo01) pages 282ndash289
Asahara Masayuki and Yuji Matsumoto 2003Japanese named entity extraction with redundantmorphological analysis In Proceeding of HLT-NAACL 2003 pages 8ndash15
Keigo Nakano and Yuzo Hirai 2004 Japanese namedentity extraction with bunsetsu features Transac-tion of Information Processing Society of Japan45(3)934ndash941 (in Japanese)
Ryohei Sasano and Sadao Kurohashi 2008 Japanesenamed entity recognition using structural naturallanguage processing In Proceeding of Third In-ternational Joint Conference on Natural LanguageProcessing pages 607ndash612
Satoshi Sekine Ralph Grishman and Hiroyuki Shin-nou 1998 A decision tree method for finding andclassifying names in japanese texts In Proceed-ings of the Sixth Workshop on Very Large Corpora(WVLC-6) pages 171ndash178
Vladimir Vapnik 1995 The Nature of StatisticalLearning Theory Springer
62
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 63ndash70Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
13
13
13 13
13
13
13
13
13
13
13
13 13
13
$
13 amp
13 amp
$ 13
13 (
) 13
$ 13
13
) 13
13 13 13
$ 13 13
+$ - 0)$ - 12
$ 3$ 4 1
)13$ 5 + )$ 5 2
3$ 6 2 $ 7-
2 $ 7 13
2 13 13
13 13
) 13
13$ $ 13$ $ $
13
13 13 13
13 1313 138
$ 13$
3$ 13) )
) 8 9$ $ $
$ $ )13$ 13 13
3
$
13 13
13 +
)
13 $
13 ) 13
$ )
13 13
13 ) $ )
13$
$
) (
$
13
13 13
13
)
13
13 )
$ 13
ファミリーレストラン 13) lt
ファミレス 13lt $
=
13 )
$ 13 13
$ $ )
13 -
)$ $ =
13 131313 13 1313 13 13 13 13 1313 13 13
63
13
13
13 13
13 13
13
13 13
13 $
13
13 amp
13
( ) +
13
13
-
13 13
13 13 13
日本 amp0 東京 $amp0amp
医科 amp0 科学 amp0amp
女子 amp0amp
大学 13amp0 研究所0
ampamp $ 13 13
+
13
$1
$ 13 $1
13
213
$1
$
$1
$1
13 $1
13
13 13 略語は0
0amp
$ +3
1313欧文略語一覧13カタカナ略語一覧13漢字略語一覧13大学の略称
$1
$1 13
13 -
13 $1 13
456
13
456
13
37+
- 13
13
$
8 913
77 4 77 913 13
13 77(amp 13 ltamp
13amp amp
13
amp - 13 13
13
13 9
13 13
$
13
-
13 =
= 9
8
13
13 $
13 13
13 13 13 1313
ッ$ 13 っ$
amp ―$
64
13
$ $
amp $ amp
() $
+) $amp $
) $ $ $ $ $
13 $ amp $ $ $ amp $$
$- )) ) 13
) ))13 ) 0()00)13 ) ) )13
- $ ))) ) 131
) 0)13 ))
13 13
2 ) )) )
)) 3 ) 1313))
1313 2 ) 13)
4) 5) )) )13
) 13) 1
6
7 ) )) ) )13 1
)
) )0
13 ) 8
-
) )
)13 ) 4
4 7 )
) )9 8)
5 ) )8 7 ))
)
) )13 1
)13 )) ) ) )
) 7 )) )13 1
)) 0()
) )))13 0
))13 ) 1
)6 5) ) )
5 ) ))1
lt ) ) 13 )
))) ) =))lt
)=)lt ))=lt gt
=ltlt )) ))) )
13 131313
-
))lt
lt ))lt
))lt
))lt lt
7 ) ))1
lt ) )) ) ))lt lt
13 lt ) ) )8 ) 1
)13 )) )
))lt 13 1
)13 ) )8 ) )13 ))
) ))) 85 1
)13 )8) 2 ))) ) (
0 )) )))
13))13
) ))) ) ( のlt
) ) )13 1
) )) )
) ))) のlt 0 )
)13 )) )1
) ) 13
) ) )) )1
) 00 )13 13)
1313 1313
2 ) 13 )) )
)13 )13 )
))13 )
3 )) )) 1
) ) )
$ )
+ ) )
+ 8 )
65
1313
1313
13 13
13 13
13 13 13
13 13
$13amp13amp 13 13 $13
$13 13 $ ( ) ) 13
13 13 + 13 1313 $
13 + 13 1313 + -
+ 13 ++ 13 13
1313 +
13 13
13 13 1313 13 -
+13 1313 13 13
1313
13
0 13 13 13 13
13 13 12 13
3013+ 13 45 666 1313
0 13 1313
7 13 13 13 13-
13 8 +-13 0
+ 13 13 13 -
+ 13 13 13
3
13 13-
13 13 + 13 13
+13513
1313 13 9
1 13 13 -
+ ( 13 13 3
13 13 + 13+ 13 13
3
135 1313 +13 13
13 13 13+ 13
1313 13 +
1313 1313
13 13 13
13+ 13 $1313 (
13 1313 1313 +-
13 4 13 $
$13 $ 13 -
13 +-13 0
13 1313 13-13 1313 + 13 13 4 - 13 + 13 13 13+ 131313 1313 3 13 13 ( 1313 13 13 + 13 13 1313 13 13 13 13 13 13 13 13 13 13 13+ 13 13 9
13
13
13
13
+ 13 13 +13
13+ 4
13 1313 1313 -
+
13 1313 13-
13 13 +13 + +
(13 13 $ +-
13 $1313 13 13 $ $ 13 $ 13
13 $ampamp lt13
2 13 13 131313 1313 13
13 lt13 ( $
13 13 13 13+ +13 -
13 13 13
13 2 2 2
= 13 13 13 +13 1313+
13 13 13
13 13 13 13
1313 13-
13 ( 13 13 13 13
13 13+ 13 1313
13 13 13+ 1313 13+
13 13 13 13 13 13
$gt1 gt1 13+ $ 13-
13 13 13 13+
13131313
66
1313 1313 1313
1313 1313 1313
13
13 13
13
13
$ 13 13
amp() amp() () amp amp
$ 13 13
+
13
13
13
- + 13 ( (
13 13
-+
13
0 -(1313 amp+ ( (
13 13
13 1
13 () 2 3
0 13 -4252
6-amp++ -32 6+
13 0
$ 135 2
1
- amp+ $ 13
13
$ amp( ) + ( ( )- )( ( ) )
FeaturesLabel
Noun
Noun
Noun
Noun
Particle
Noun
POS
0
0
0
1
1
1
Head of wordreadingword-
FeaturesLabel
Noun
Noun
Noun
Noun
Particle
Noun
POS
0
0
0
1
1
1
Head of wordreadingword-
amp 13 0
13 朝ビタ -+ 13
13 朝はビタミン - +
13 7
7
$
088
13 0
$ amp 9 amp 9
13
13
7
1313 0
13
13 2lt
13 13 13 13
13 13
3 13 amp
13 amp
13
13
0 13 13
13 13
13 13
13
1
ampampamp
67
13 13
$
13 13
ampamp (
)
ampamp $
ampamp amp
amp +
amp +
- - -
amp-
amp
amp
amp
- amp
+
- +
) +
amp $
- amp
+
- amp
+
0 $ ampamp +
1- amp
0 +
2-
)
amp -
amp ampamp
$ amp 0- 3(
34) 5 -+
0 - 4
6)4 amp amp 1 amp
7 4( 6
$ - +
- 8
64
13
) amp
amp
+
amp 9 2
- 9 - +
2 amp
- amp
amp 1
$
( +
(
( amp ( ( -
amp2
amp
amp )
( +
68
13 13
13 1313
13 13 13
13 13
13 13 13
13 13
13 13
13 $
13
13 13 13 13
13
13 13 13
1313
13
amp 13 13 13 13
13 13 13 13
(() 13 13 13
13 13 + 13
13 amp 13
13 13
$ 13
13 13
13
13 1313
13 -
13
1313 13
13 13 13
13 1313 amp
13 13
13 13 1313
0
1313 13 13
13 13
13
13 13
$ $ amp amp ( $)+ - 0 13 + 13 112
ミュージックステーション 13 ステ 13 13 13 13 13 ミュージック 13 ステ ステーション
3 4 56 1 0+ 0 + + 7+ 8 9 $8 4 lt
6 $8 -= 1 8 - = $ 13++ + 139 13 $ 22lt
8 $+ 4 13) 1ltltlt3 $gt3 =
5 $ $ ( 6 5 13amp
5 $ $ ( 213 $ 13 13
5 = 3 + 13 () 1 +
5 () 13 7+ + + 2ltlt
5 () 13 3A 21 + $ + 13
5 () $ gt) 3A2 0 13 13 22lt2lt
+A = 7 1 4 30$+ -+ 3 13 13$ 11
$ 13 amp) $ 4 42 6= $ 2
4 13 13+ $ gt $ 9+ amp6+ 13+ 8 8 6 3 (+ 6 13 11
13 136) $ 4 13 + gt+ 7 30 13 13+ 1
69
13 13 13
13 13 1313
13 13
1313 13
13 1313
13
$
13 13 1313 $
amp (
70
Author Index
Bhutada Pravin 17
Cao Jie 47Caseli Helena 1
Diab Mona 17
Finatto Maria Jose 1Fujii Hiroko 63Fukui Mika 63Funayama Hirotaka 55
Hoang Hung Huu 31Huang Yun 47
Kan Min-Yen 9 31Kim Su Nam 9 31Kuhn Jonas 23Kurohashi Sadao 55
Liu Qun 47Lu Yajuan 47
Machado Andre 1
Ren Zhixiang 47
Shibata Tomohide 55Sinha R Mahesh K 40Sumita Kazuo 63Suzuki Masaru 63
Villavicencio Aline 1
Wakaki Hiromi 63
Zarrieszlig Sina 23
71
Organizers
Dimitra Anastasiou Localisation Research Centre Limerick University IrelandChikara Hashimoto National Institute of Information and Communications Technology JapanPreslav Nakov National University of Singapore SingaporeSu Nam Kim University of Melbourne Australia
Program Committee
Inaki Alegria University of the Basque Country (Spain)Timothy Baldwin University of Melbourne (Australia)Colin Bannard Max Planck Institute (Germany)Francis Bond National Institute of Information and Communications Technology (Japan)Gael Dias Beira Interior University (Portugal)Ulrich Heid Stuttgart University (Germany)Stefan Evert University of Osnabruck (Germany)Afsaneh Fazly University of Toronto (Canada)Nicole Gregoire University of Utrecht (The Netherlands)Roxana Girju University of Illinois at Urbana-Champaign (USA)Kyo Kageura University of Tokyo (Japan)Brigitte Krenn Austrian Research Institute for Artificial Intelligence (Austria)Eric Laporte University of Marne-la-Vallee (France)Rosamund Moon University of Birmingham (UK)Diana McCarthy University of Sussex (UK)Jan Odijk University of Utrecht (The Netherlands)Stephan Oepen University of Oslo (Norway)Darren Pearce London Knowledge Lab (UK)Pavel Pecina Charles University (Czech Republic)Scott Piao University of Manchester (UK)Violeta Seretan University of Geneva (Switzerland)Stan Szpakowicz University of Ottawa (Canada)Beata Trawinski University of Tubingen (Germany)Peter Turney National Research Council of Canada (Canada)Kiyoko Uchiyama Keio University (Japan)Begona Villada Moiron University of Groningen (The Netherlands)Aline Villavicencio Federal University of Rio Grande do Sul (Brazil)
v
Table of Contents
Statistically-Driven Alignment-Based Multiword Expression Identification for Technical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto 1
Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan 9
Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada 17
Exploiting Translational Correspondences for Pattern-Independent MWE IdentificationSina Zarrieszlig and Jonas Kuhn 23
A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan 31
Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha 40
Improving Statistical Machine Translation Using Domain Bilingual Multiword ExpressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang 47
Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi 55
Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita 63
vii
Workshop Program
Friday August 6 2009
830ndash845 Welcome and Introduction to the Workshop
Session 1 (0845ndash1000) MWE Identification and Disambiguation
0845ndash0910 Statistically-Driven Alignment-Based Multiword Expression Identification for Tech-nical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto
0910ndash0935 Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan
0935ndash1000 Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada
1000-1030 BREAK
Session 2 (1030ndash1210) Identification Interpretation and Disambiguation
1030ndash1055 Exploiting Translational Correspondences for Pattern-Independent MWE Identifi-cationSina Zarrieszlig and Jonas Kuhn
1055ndash1120 A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan
1120ndash1145 Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha
1145-1350 LUNCH
Session 3 (1350ndash1530) Applications
1350ndash1415 Improving Statistical Machine Translation Using Domain Bilingual Multiword Ex-pressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang
1415ndash1440 Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi
1440ndash1505 Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita
1505-1530 Discussion of Sessions 1 2 3 (Creating an Agenda for the general discussion)
1530-1600 BREAK
1600-1700 General Discussion
1700-1715 Closing Remarks
ix
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 1ndash8Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Statistically-Driven Alignment-Based Multiword ExpressionIdentification for Technical Domains
Helena de Medeiros Caselidiams Aline Villavicencioclubsspades Andre Machadoclubs Maria Jose Finattohearts
diamsDepartment of Computer Science Federal University of Sao Carlos (Brazil)clubsInstitute of Informatics Federal University of Rio Grande do Sul (Brazil)
spadesDepartment of Computer Sciences Bath University (UK)heartsInstitute of Language and Linguistics Federal University of Rio Grande do Sul (Brazil)
helenacaselidcufscarbr avillavicencioinfufrgsbrammachadoinfufrgsbr mfinattoterracombr
Abstract
Multiword Expressions (MWEs) are oneof the stumbling blocks for more preciseNatural Language Processing (NLP) sys-tems Particularly the lack of coverageof MWEs in resources can impact nega-tively on the performance of tasks and ap-plications and can lead to loss of informa-tion or communication errors This is es-pecially problematic in technical domainswhere a significant portion of the vocab-ulary is composed of MWEs This pa-per investigates the use of a statistically-driven alignment-based approach to theidentification of MWEs in technical cor-pora We look at the use of several sourcesof data including parallel corpora usingEnglish and Portuguese data from a corpusof Pediatrics and examining how a sec-ond language can provide relevant cues forthis tasks We report results obtained bya combination of statistical measures andlinguistic information and compare theseto the reported in the literature Such anapproach to the (semi-)automatic identifi-cation of MWEs can considerably speedup lexicographic work providing a moretargeted list of MWE candidates
1 Introduction
A multiword expression (MWE) can be definedas any word combination for which the syntac-tic or semantic properties of the whole expres-sion cannot be obtained from its parts (Sag etal 2002) Examples of MWEs are phrasal verbs(break down rely on) compounds (police car cof-fee machine) idioms (rock the boat let the catout of the bag) They are very numerous in lan-guages as Biber et al (1999) note accouting forbetween 30 and 45 of spoken English and 21
of academic prose and for Jackendoff (1997) thenumber of MWEs in a speakerrsquos lexicon is of thesame order of magnitude as the number of singlewords However these estimates are likely to beunderestimates if we consider that for languagefrom a specific domain the specialized vocabularyis going to consist largely of MWEs (global warm-ing protein sequencing) and new MWEs are con-stantly appearing (weapons of mass destructionaxis of evil)
Multiword expressions play an important rolein Natural Language Processing (NLP) applica-tions which should not only identify the MWEsbut also be able to deal with them when they arefound (Fazly and Stevenson 2007) Failing toidentify MWEs may cause serious problems formany NLP tasks especially those envolving somekind of semantic processing For parsing for in-stance Baldwin et al (2004) found that for a ran-dom sample of 20000 strings from the British Na-tional Corpus (BNC) even with a broad-coveragegrammar for English (Flickinger 2000) missingMWEs accounted for 8 of total parsing errorsTherefore there is an enormous need for robust(semi-)automated ways of acquiring lexical infor-mation for MWEs (Villavicencio et al 2007) thatcan significantly extend the coverage of resourcesFor example one can more than double the num-ber of verb-particle constructions (VPCs) entriesin a dictionary such as the Alvey Natural Lan-guage Tools (Carroll and Grover 1989) just ex-tracting VPCs from a corpus like the BNC (Bald-win 2005) Furthermore as MWEs are languagedependent and culturally motivated identifyingthe adequate translation of MWE occurrences is animportant challenge for machine translation meth-ods
In this paper we investigate experimentally theuse of an alignment-based approach for the iden-tification of MWEs in technical corpora We lookat the use of several sources of data including par-
1
allel corpora using English and Portuguese datafrom a corpus of Pediatrics and examining howa second language can provide relevant cues forthis tasks In this way cost-effective tools for theautomatic alignment of texts can generate a listof MWE candidates with their appropriate trans-lations Such an approach to the (semi-)automaticidentification of MWEs can considerably speed uplexicographic work providing a more targeted listof MWE candidates and their translations for theconstruction of bilingual resources andor withsome semantic information for monolingual re-sources
The remainder of this paper is structured as fol-lows Section 2 briefly discusses MWEs and someprevious works on methods for automatically ex-tracting them Section 3 presents the resourcesused while section 4 describes the methods pro-posed to extract MWEs as a statistically-driven by-product of an automatic word alignment processSection 5 presents the evaluation methodology andanalyses the results and section 6 finishes this pa-per with some conclusions and proposals for fu-ture work
2 Related Work
The term Multiword Expression has been used todescribe a large number of distinct but related phe-nomena such as phrasal verbs (eg come along)nominal compounds (eg frying pan) institution-alised phrases (eg bread and butter) and manyothers (Sag et al 2002) They are very frequent ineveryday language and this is reflected in severalexisting grammars and lexical resources wherealmost half of the entries are Multiword Expres-sions
However due to their heterogeneous charac-teristics MWEs present a tough challenge forboth linguistic and computational work (Sag etal 2002) Some MWEs are fixed and do notpresent internal variation such as ad hoc whileothers allow different degrees of internal vari-ability and modification such as touch a nerve(touchfind a nerve) and spill beans (spill sev-eralmusicalmountains of beans) In terms of se-mantics some MWEs are more opaque in theirmeaning (eg to kick the bucket as to die) whileothers have more transparent meanings that can beinferred from the words in the MWE (eg eat upwhere the particle up adds a completive sense toeat) Therefore providing appropriate methods
for the automatic identification and treatment ofthese phenomena is a real challenge for NLP sys-tems
A variety of approaches has been proposed forautomatically identifying MWEs differing basi-cally in terms of the type of MWE and lan-guage to which they apply and the sources ofinformation they use Although some work onMWEs is type independent (eg (Zhang et al2006 Villavicencio et al 2007)) given the het-erogeneity of MWEs much of the work looks in-stead at specific types of MWE like collocations(Pearce 2002) compounds (Keller and Lapata2003) and VPCs (Baldwin 2005 Villavicencio2005 Carlos Ramisch and Aline Villavicencio andLeonardo Moura and Marco Idiart 2008) Someof these works concentrate on particular languages(eg (Pearce 2002 Baldwin 2005) for Englishand (Piao et al 2006) for Chinese) but somework has also benefitted from asymmetries in lan-guages using information from one language tohelp deal with MWEs in the other (eg (na Vil-lada Moiron and Tiedemann 2006 Caseli et al2009))
As basis for helping to determine whether agiven sequence of words is in fact an MWE (egad hoc vs the small boy) some of these works em-ploy linguistic knowledge for the task (Villavicen-cio 2005) while others employ statistical meth-ods (Pearce 2002 Evert and Krenn 2005 Zhanget al 2006 Villavicencio et al 2007) or combinethem with some kinds of linguistic informationsuch as syntactic and semantic properties (Bald-win and Villavicencio 2002 Van de Cruys and naVillada Moiron 2007) or automatic word align-ment (na Villada Moiron and Tiedemann 2006)
Statistical measures of association have beencommonly used for this task as they can be demo-cratically applied to any language and MWE typeHowever there is no consensus about which mea-sure is best suited for identifying MWEs in gen-eral Villavicencio et al (2007) compared some ofthese measures (mutual information permutationentropy and χ2) for the type-independent detec-tion of MWEs and found that Mutual Informationseemed to differentiate MWEs from non-MWEsbut the same was not true of χ2 In addition Ev-ert and Krenn (2005) found that for MWE iden-tification the efficacy of a given measure dependson factors like the type of MWEs being targetedfor identification the domain and size of the cor-
2
pora used and the amount of low-frequency dataexcluded by adopting a threshold NonethelessVillavicencio et al (2007) discussing the influ-ence of the corpus size and nature over the meth-ods found that these different measures have ahigh level of agreement about MWEs whetherin carefully constructed corpora or in more het-erogeneous web-based ones They also discussthe results obtained from adopting approaches likethese for extending the coverage of resources ar-guing that grammar coverage can be significantlyincreased if MWEs are properly identified andtreated (Villavicencio et al 2007)
Among the methods that use additional infor-mation along with statistics to extract MWE theone proposed by na Villada Moiron and Tiede-mann (2006) seems to be the most similar to ourapproach The main difference between them isthe way in which word alignment is used in theMWE extraction process In this paper the wordalignment is the basis for the MWE extractionprocess while Villada Moiron and Tiedemannrsquosmethod uses the alignment just for ranking theMWE candidates which were extracted on the ba-sis of association measures (log-likelihood andsalience) and head dependence heuristic (in parseddata)
Our approach as described in details by Caseliet al (2009) also follows to some extent thatof Zhang et al (2006) as missing lexical en-tries for MWEs and related constructions are de-tected via error mining methods and this paper fo-cuses on the extraction of generic MWEs as a by-product of an automatic word alignment Anotherrelated work is the automatic detection of non-compositional compounds (NCC) by Melamed(1997) in which NCCs are identified by analyz-ing statistical translation models trained in a hugecorpus by a time-demanding process
Given this context our approach proposesthe use of alignment techniques for identifyingMWEs looking at sequences detected by thealigner as containing more than one word whichform the MWE candidates As a result sequencesof two or more consecutive source words aretreated as MWE candidates regardless of whetherthey are translated as one or more target words
3 The Corpus and Reference Lists
The Corpus of Pediatrics used in these experi-ments contains 283 texts in Portuguese with a total
of 785448 words extracted from the Jornal de Pe-diatria From this corpus the Pediatrics Glossarya reference list containing multiword terms and re-curring expressions was semi-automatically con-structed and manually checked1 The primary aimof the Pediatrics Glossary as an online resourcefor long-distance education was to train qualifyand support translation students on the domain ofpediatrics texts
The Pediatrics Glossary was built from the36741 ngrams that occurred at least 5 times in thecorpus These were automatically cleaned or re-moved using some POS tag patterns (eg remov-ing prepositions from terms that began or endedwith them) In addition if an ngram was part of alarger ngram only the latter appeared in the Glos-sary as is the case of aleitamento materno (mater-nal breastfeeding) which is excluded as it is con-tained in aleitamento materno exclusivo (exclusivematernal breastfeeding) This post-processing re-sulted in 3645 ngrams which were manuallychecked by translation students and resulted in2407 terms with 1421 bigrams 730 trigrams and339 ngrams with n larger than 3 (not considered inthe experiments presented in this paper)
4 Statistically-Driven andAlignment-Based methods
41 Statistically-Driven methodStatistical measures of association have beenwidely employed in the identification of MWEsThe idea behind their use is that they are an in-expensive language and type independent meansof detecting recurrent patterns As Firth famouslysaid a word is characterized by the company itkeeps and since we expect the component wordsof an MWE to occur frequently together thenthese measures can give an indication of MWE-ness In this way if a group of words co-occurswith significantly high frequency when comparedto the frequencies of the individual words thenthey may form an MWE Indeed measures suchas Pointwise Mutual Information (PMI) MutualInformation (MI) χ2 log-likelihood (Press et al1992) and others have been employed for thistask and some of them seem to provide more ac-curate predictions of MWEness than others Infact in a comparison of some measures for thetype-independent detection of MWEs MI seemed
1Available in the TEXTQUIMUFRGS website httpwwwufrgsbrtextquim
3
to differentiate MWEs from non-MWEs but thesame was not true of χ2 (Villavicencio et al2007) In this work we use two commonly em-ployed measures for this task PMI and MIas implemented in the Ngram Statistics Package(Banerjee and Pedersen 2003)
From the Portuguese portion of the Corpus ofPediatrics 196105 bigram and 362663 trigramMWE candidates were generated after filteringngrams containing punctuation and numbers Inorder to evaluate how these methods perform with-out any linguistic filtering the only threshold em-ployed was a frequency cut-off of 2 occurrencesresulting in 64839 bigrams and 54548 trigramsEach of the four measures were then calculated forthese ngrams and we ranked each n-gram accord-ing to each of these measures The average of allthe rankings is used as the combined measure ofthe MWE candidates
42 Alignment-Based method
The second of the MWE extraction approachesto be investigated in this paper is the alignment-based method The automatic word alignment oftwo parallel texts mdash a text written in one (source)language and its translation to another (target) lan-guage mdash is the process of searching for correspon-dences between source and target words and se-quences of words For each word in a source sen-tence equivalences in the parallel target sentenceare looked for Therefore taking into account aword alignment between a source word sequenceS (S = s1 sn with n ge 2) and a target wordsequence T (T = t1 tm with m ge 1) thatis S harr T the alignmet-based MWE extracionmethod assumes that (a) S and T share some se-mantic features and (b) S may be a MWE
In other words the alignment-based MWE ex-traction method states that the sequence S will bea MWE candidate if it is aligned with a sequenceT composed of one or more words (a n m align-ment with n ge 2 and m ge 1) For examplethe sequence of two Portuguese words aleitamentomaterno mdash which occurs 202 times in the cor-pus used in our experiments mdash is a MWE can-didate because these two words were joined to bealigned 184 times with the word breastfeeding (a2 1 alignment) 8 times with the word breast-fed (a 2 1 alignment) 2 times with breastfeedingpractice (a 2 2 alignment) and so on
Thus notice that the alignment-based MWE ex-
traction method does not rely on the conceptualasymmetries between languages since it does notexpect that a source sequence of words be alignedwith a single target word The method looksfor the sequences of source words that are fre-quently joined together during the alignment de-spite the number of target words involved Thesefeatures indicate that the method priorizes preci-sion in spite of recall
It is also important to say that although the se-quences of source and target words resemble thephrases used in the phrase-based statistical ma-chine translation (SMT) they are indeed a re-finement of them More specifically althoughboth approaches rely on word alignments per-formed by GIZA++2 (Och and Ney 2000) inthe alignment-based approach not all sequences ofwords are considered as phrases (and MWE can-didates) but just those with an alignment n m(n gt= 2) with a target sequence To confirmthis assumption a phrase-based SMT system wastrained with the same corpus used in our exper-iments and the number of phrases extracted fol-lowing both approaches were compared Whilethe SMT extracted 819208 source phrases ouralignment-based approach (without applying anypart-of-speech or frequency filter) extracted only34277 These results show that the alignment-based approach refines in some way the phrasesof SMT systems
In this paper we investigate experimentallywhether MWEs can be identified as a by-productof the automatic word alignment of parallel textsWe focus on Portuguese MWEs from the Corpusof Pediatrics and the evaluation is performed us-ing the bigrams and trigrams from the PediatricsGlossary as gold standard
To perform the extraction of MWE candi-dates following the alignment-based approachfirst the original corpus had to be sentence andword aligned and Part-of-Speech (POS) taggedFor these preprocessing steps were used re-spectively a version of the Translation Cor-pus Aligner (TCA) (Hofland 1996) the statisti-cal word aligner GIZA++ (Och and Ney 2000)and the morphological analysers and POS taggersfrom Apertium3 (Armentano-Oller et al 2006)
2GIZA++ is a well-known statistical word aligner that canbe found at httpwwwfjochcomGIZA++html
3Apertium is an open-source machine translation en-gine and toolbox available at httpwwwapertiumorg
4
From the preprocessed corpus the MWE candi-dates are extracted as those in which two or morewords have the same alignment that is they arelinked to the same target unit This initial list ofMWE candidates is then filtered to remove thosecandidates that (a) match some sequences of POStags or words (patterns) defined in previous exper-iments (Caseli et al 2009) or (b) whose frequencyis below a certain threshold The remaining unitsin the candidate list are considered to be MWEs
Several filtering patterns and minimum fre-quency thresholds were tested and three of themare presented in details here The first one (F1)is the same used during the manual building ofthe reference lists of MWEs (a) patterns begin-ning with Article + Noun and beginning or finish-ing with verbs and (b) with a minimum frequencythreshold of 5
The second one (F2) is the same used in the(Caseli et al 2009) mainly (a) patterns begin-ning with determiner auxiliary verb pronoun ad-verb conjunction and surface forms such as thoseof the verb to be (are is was were) relatives(that what when which who why) and prepo-sitions (from to of ) and (b) with a minimum fre-quency threshold of 2
And the third one (F3) is the same as (Caseli etal 2009) plus (a) patterns beginning or finishingwith determiner adverb conjunction prepositionverb pronoun and numeral and (b) with a mini-mum frequency threshold of 2
5 Experiments and Results
Table 1 shows the top 5 and the bottom 5 rankedcandidates returned by PMI and the alignment-based approach Although some of the results aregood especially the top candidates there is stillconsiderable noise among the candidates as forinstance jogar video game (lit play video game)From table 1 it is also possible to notice that thealignment-based approach indeed extracts Pedi-atrics terms such as aleitamento materno (breast-feeding) and also other possible MWE that are notPediatrics terms such as estados unidos (UnitedStates)
In table 2 we show the precision (number ofcorrect candidates among the proposed ones) re-call (number of correct candidates among those inreference lists) and F-measure ((2 lowast precision lowastrecall)(precision + recall)) figures for the as-sociation measures using all the candidates (on the
PMI alignment-basedOnline Mendelian Inheritance faixa etariaBeta Technology Incorporated aleitamento maternoLange Beta Technology estados unidosOxido Nitrico Inalatorio hipertensao arterialjogar video game leite materno e um de couro cabeludoe a do bloqueio lactıferosse que de emocional anatomiae a da neonato a termoe de nao duplas maes bebes
Table 1 Top 5 and Bottom 5 MWE candidatesranked by PMI and alignment-based approach
pt MWE candidates PMI MI proposed bigrams 64839 64839 correct MWEs 1403 1403precision 216 216recall 9873 9873F 423 423 proposed trigrams 54548 54548 correct MWEs 701 701precision 129 129recall 9603 9603F 255 255 proposed bigrams 1421 1421 correct MWEs 155 261precision 1091 1837recall 1091 1837F 1091 1837 proposed trigrams 730 730 correct MWEs 44 20precision 603 274recall 603 274F 603 274
Table 2 Evaluation of MWE candidates - PMI andMI
first half of the table) and using the top 1421 bi-gram and 730 trigram candidates (on the secondhalf) From these latter results we can see that thetop candidates produced by these measures do notagree with the Pediatrics Glossary since there areonly at most 1837 bigram and 603 trigramMWEs among the top candidates as ranked byMI and PMI respectively Interestingly MI had abetter performance for bigrams while for trigramsPMI performed better
On the other hand looking at the alignment-based method 34277 pt MWE candidates wereextracted and Table 3 sumarizes the number ofcandidates filtered following the three filters de-scribed in 42 F1 F2 and F3
To evaluate the efficacy of the alignment-basedmethod in identifying multiword terms of Pedi-atrics an automatic comparison was performedusing the Pediatrics Glossary In this auto-
5
pt MWE candidates F1 F2 F3 filtered by POS patterns 24996 21544 32644 filtered by frequency 9012 11855 1442 final Set 269 878 191
Table 3 Number of pt MWE candidates filteredin the alignment-based approach
pt MWE candidates F1 F2 F3 proposed bigrams 250 754 169 correct MWEs 48 95 65precision 1920 1260 3846recall 338 669 457F 575 874 818 proposed trigrams 19 110 20 correct MWEs 1 9 4precision 526 818 2000recall 014 123 055F 027 214 107 proposed bitrigrams 269 864 189 correct MWEs 49 104 69precision 1822 1204 3651recall 228 483 321F 405 690 590
Table 4 Evaluation of MWE candidates
matic comparision we considered the final lists ofMWEs candidates generated by each filter in table3 The number of matching entries and the valuesfor precision recall and F-measure are showed intable 4
The different values of extracted MWEs (in ta-ble 3) and evaluated ones (in table 4) are due tothe restriction of considering only bigrams and tri-grams in the Pediatrics Glossary Then longerMWEs mdash such as doenca arterial coronarianaprematura (premature coronary artery disease)and pequenos para idade gestacional (small forgestational age) mdash extracted by the alignment-based method are not being considered at the mo-ment
After the automatic comparison using the Pedi-atrics Glossary an analysis by human experts wasperformed on one of the derived lists mdash that withthe best precision values so far (from filter F3)The human analysis was necessary since as statedin (Caseli et al 2009) the coverage of referencelists may be low and it is likely that a lot of MWEcandidates that were not found in the PediatricsGlossary are nonetheless true MWEs In this pa-per only the pt MWE candidates extracted usingfilter F3 (as described in section 42) were manu-ally evaluated
From the 191 pt MWE candidates extracted af-ter F3 69 candidates (361 of the total amount)were found in the bigrams or trigrams in the
Glossary (see table 4) Then the remaining 122candidates (639) were analysed by two native-speakers human judges who classified each of the122 candidates as true if it is a multiword expres-sion or false otherwise independently of beinga Pediatrics term For the judges a sequence ofwords was considered a MWE mainly if it was(1) a proper name or (2) a sequence of words forwhich the meaning cannot be obtained by com-pounding the meanings of its component words
The judgments of both judges were comparedand a disagreement of approximately 12 on mul-tiwords was verified This disagreement was alsomeasured by the kappa (K) measure (Carletta1996) with k = 073 which does not preventconclusions to be drawn According to Carletta(1996) among other authors a value of k between067 and 08 indicates a good agreement
In order to calculate the percentage of true can-didates among the 122 two approaches can befollowed depending on what criteria one wantsto emphasize precision or coverage (not recallbecause we are not calculating regarding a refer-ence list) To emphasize the precision one shouldconsider as genuine MWEs only those candidatesclassified as true by both judges on the other handto emphasize the coverage one should consideralso those candidates classified as true by just oneof them So from 191 MWE candidates 126(6597) were classified as true by both judgesand 145 (7592) by at least one of them
6 Conclusions and Future Work
MWEs are a complex and heterogeneous set ofphenomena that defy attempts to capture themfully but due to their role in communication theyneed to be properly accounted for in NLP applica-tions and tasks
In this paper we investigated the identifica-tion of MWEs from technical domain test-ing statistically-driven and alignment-based ap-proaches for identifying MWEs from a Pediatricsparallel corpus The alignment-based method gen-erates a targeted precision-oriented list of MWEcandidates while the statistical methods producerecall-oriented results at the expense of precisionTherefore the combination of these methods canproduce a set of MWE candidates that is both moreprecise than the latter and has more coverage thanthe former This can significantly speed up lex-icographic work Moreover the results obtained
6
show that in comparison with the manual extrac-tion of MWEs this approach can provide also ageneral set of MWE candidates in addition to themanually selected technical terms
Using the alignment-based extraction methodwe notice that it is possible to extract MWEs thatare Pediatrics terms with a precision of 38 forbigrams and 20 for trigrams but with very lowrecall since only the MWEs in the Pediatrics Glos-sary were considered correct However after amanual analysis carried out by two native speakersof Portuguese we found that the percentage of trueMWEs considered by both or at least one of themwere respectively 6597 and 7592 This wasa significative improvement but it is important tosay that in this manual analysis the human ex-perts classified the MWEs as true independently ofthem being Pediatrics terms So as future work weintend to carry out a more carefull analysis withexperts in Pediatrics to evaluate how many MWEscandidates are also Pediatrics terms
In addition we plan to investigate a weightedcombination of these methods favouring thosethat have better precision Finally we also in-tend to apply the results obtained in to the semi-automatic construction of ontologies
Acknowledgments
We would like to thank the TEXTQUIMUFRGSgroup for making the Corpus of Pediatrics and Pe-diatrics Glossary available to us We also thank thefinancial support of FAPESP in bulding the paral-lel corpus This research has been partly fundedby the FINEP project COMUNICA
ReferencesCarme Armentano-Oller Rafael C Carrasco Anto-
nio M Corbı-Bellot Mikel L Forcada MireiaGinestı-Rosell Sergio Ortiz-Rojas Juan Anto-nio Perez-Ortiz Gema Ramırez-Sanchez FelipeSanchez-Martınez and Miriam A Scalco 2006Open-source Portuguese-Spanish machine transla-tion In R Vieira P Quaresma MGV Nunes NJMamede C Oliveira and MC Dias editors Pro-ceedings of the 7th International Workshop on Com-putational Processing of Written and Spoken Por-tuguese (PROPOR 2006) volume 3960 of LectureNotes in Computer Science pages 50ndash59 Springer-Verlag May
Timothy Baldwin and Aline Villavicencio 2002 Ex-tracting the Unextractable A Case Study on Verb-particles In Proceedings of the 6th Conference on
Natural Language Learning (CoNLL-2002) pages98ndash104 Taipei Taiwan
Timothy Baldwin Emily M Bender Dan FlickingerAra Kim and Stephan Oepen 2004 Road-testingthe English Resource Grammar over the British Na-tional Corpus In Proceedings of the Fourth In-ternational Conference on Language Resources andEvaluation (LREC 2004) pages 2047ndash2050 LisbonPortugal
Timothy Baldwin 2005 The deep lexical acquisitionof English verb-particles Computer Speech andLanguage Special Issue on Multiword Expressions19(4)398ndash414
Satanjeev Banerjee and Ted Pedersen 2003 The De-sign Implementation and Use of the Ngram Statis-tics Package In In Proceedings of the Fourth Inter-national Conference on Intelligent Text Processingand Computational Linguistics pages 370ndash381
Douglas Biber Stig Johansson Geoffrey Leech SusanConrad and Edward Finegan 1999 Grammar ofSpoken and Written English Longman Harlow
Jean Carletta 1996 Assessing agreement on classi-fication tasks the kappa statistics ComputationalLinguistics 22(2)249ndash254
Carlos Ramisch and Aline Villavicencio and LeonardoMoura and Marco Idiart 2008 Picking themup and Figuring them out Verb-Particle Construc-tions Noise and Idiomaticity In Proceedings ofthe 12th Conference on Computational Natural Lan-guage Learning (CoNLL 2008) pages 49ndash56
John Carroll and Claire Grover 1989 The derivationof a large computational lexicon of English fromLDOCE In Bran Boguraev and Ted Briscoe editorsComputational Lexicography for Natural LanguageProcessing pages 117ndash134 Longman Harlow UK
Helena M Caseli Carlos Ramisch Maria G V Nunesand Aline Villavicencio 2009 Alignment-basedextraction of multiword expressions Language Re-sources and Evaluation to appear
Stefan Evert and Brigitte Krenn 2005 Using smallrandom samples for the manual evaluation of statis-tical association measures Computer Speech andLanguage 19(4)450ndash466
Afsaneh Fazly and Suzanne Stevenson 2007 Distin-guishing Subtypes of Multiword Expressions UsingLinguistically-Motivated Statistical Measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueJune
Dan Flickinger 2000 On building a more efficientgrammar by exploiting types Natural LanguageEngineering 6(1)15ndash28
7
Knut Hofland 1996 A program for aligning Englishand Norwegian sentences In S Hockey N Ideand G Perissinotto editors Research in HumanitiesComputing pages 165ndash178 Oxford Oxford Univer-sity Press
Ray Jackendoff 1997 Twistinrsquo the night away Lan-guage 73534ndash59
Frank Keller and Mirella Lapata 2003 Using the Webto Obtain Frequencies for Unseen Bigrams Compu-tational Linguistics 29(3)459ndash484
I Dan Melamed 1997 Automatic Discovery ofNon-Compositional Compounds in Parallel Data Ineprint arXivcmp-lg9706027 pages 6027ndash+ June
Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the Workshopon Multi-word-expressions in a Multilingual Context(EACL-2006) pages 33ndash40 Trento Italy
Franz Josef Och and Hermann Ney 2000 Improvedstatistical alignment models In Proceedings of the38th Annual Meeting of the ACL pages 440ndash447Hong Kong China October
Darren Pearce 2002 A Comparative Evaluation ofCollocation Extraction Techniques In Proceedingsof the Third International Conference on LanguageResources and Evaluation pages 1ndash7 Las PalmasCanary Islands Spain
Scott S L Piao Guangfan Sun Paul Rayson andQi Yuan 2006 Automatic Extraction of Chi-nese Multiword Expressions with a Statistical ToolIn Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context (EACL-2006)pages 17ndash24 Trento Italy April
William H Press Saul A Teukolsky William T Vet-terling and Brian P Flannery 1992 NumericalRecipes in C The Art of Scientific Computing Sec-ond edition Cambridge University Press
Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 MultiwordExpressions A Pain in the Neck for NLP In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing (CICLing-2002) volume 2276 of (LectureNotes in Computer Science) pages 1ndash15 LondonUK Springer-Verlag
Tim Van de Cruys and Bego na Villada Moiron 2007Semantics-based Multiword Expression ExtractionIn Proceedings of the Workshop on A Broader Pre-spective on Multiword Expressions pages 25ndash32Prague June
Aline Villavicencio Valia Kordoni Yi Zhang MarcoIdiart and Carlos Ramisch 2007 Validation andEvaluation of Automatically Acquired Multiword
Expressions for Grammar Engineering In Pro-ceedings of the 2007 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning pages1034ndash1043 Prague June
Aline Villavicencio 2005 The Availability of Verb-Particle Constructions in Lexical Resources HowMuch is Enough Journal of Computer Speech andLanguage Processing 19(4)415ndash432
Yi Zhang Valia Kordoni Aline Villavicencio andMarco Idiart 2006 Automated Multiword Ex-pression Prediction for Grammar Engineering InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 36ndash44 Sydney Australia July Asso-ciation for Computational Linguistics
8
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 9ndash16Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Re-examining Automatic Keyphrase Extraction Approachesin Scientific Articles
Su Nam KimCSSE dept
University of Melbournesnkimcsseunimelbeduau
Min-Yen KanDepartment of Computer ScienceNational University of Singaporekanmycompnusedusg
Abstract
We tackle two major issues in automatickeyphrase extraction using scientific arti-cles candidate selection and feature engi-neering To develop an efficient candidateselection method we analyze the natureand variation of keyphrases and then se-lect candidates using regular expressionsSecondly we re-examine the existing fea-tures broadly used for the supervised ap-proach exploring different ways to en-hance their performance While mostother approaches are supervised we alsostudy the optimal features for unsuper-vised keyphrase extraction Our researchhas shown that effective candidate selec-tion leads to better performance as evalua-tion accounts for candidate coverage Ourwork also attests that many of existing fea-tures are also usable in unsupervised ex-traction
1 Introduction
Keyphrases are simplex nouns or noun phrases(NPs) that represent the key ideas of the documentKeyphrases can serve as a representative summaryof the document and also serve as high quality in-dex terms It is thus no surprise that keyphraseshave been utilized to acquire critical informationas well as to improve the quality of natural lan-guage processing (NLP) applications such as doc-ument summarizer(Davanzo and Magnini 2005)information retrieval (IR)(Gutwin et al 1999) anddocument clustering(Hammouda et al 2005)
In the past various attempts have been made toboost automatic keyphrase extraction performancebased primarily on statistics(Frank et al 1999Turney 2003 Park et al 2004 Wan and Xiao2008) and a rich set of heuristic features(Barkerand Corrnacchia 2000 Medelyan and Witten
2006 Nguyen and Kan 2007) In Section 2 wegive a more comprehensive overview of previousattempts
Current keyphrase technology still has muchroom for improvement First of all although sev-eral candidate selection methods have been pro-posed for automatic keyphrase extraction in thepast (eg (Frank et al 1999 Park et al 2004Nguyen and Kan 2007)) most of them do not ef-fectively deal with various keyphrase forms whichresults in the ignorance of some keyphrases as can-didates Moreover no studies thus far have donea detailed investigation of the nature and varia-tion of manually-provided keyphrases As a con-sequence the community lacks a standardized listof candidate forms which leads to difficulties indirect comparison across techniques during evalu-ation and hinders re-usability
Secondly previous studies have shown the ef-fectiveness of their own features but not manycompared their features with other existing fea-tures That leads to a redundancy in studies andhinders direct comparison In addition existingfeatures are specifically designed for supervisedapproaches with few exceptions However thisapproach involves a large amount of manual laborthus reducing its utility for real-world applicationHence unsupervised approach is inevitable in or-der to minimize manual tasks and to encourageutilization It is a worthy study to attest the re-liability and re-usability for the unsupervised ap-proach in order to set up the tentative guideline forapplications
This paper targets to resolve these issues ofcandidate selection and feature engineering Inour work on candidate selection we analyze thenature and variation of keyphrases with the pur-pose of proposing a candidate selection methodwhich improves the coverage of candidates thatoccur in various forms Our second contributionre-examines existing keyphrase extraction features
9
reported in the literature in terms of their effec-tiveness and re-usability We test and comparethe usefulness of each feature for further improve-ment In addition we assess how well these fea-tures can be applied in an unsupervised approach
In the remaining sections we describe anoverview of related work in Section 2 our propos-als on candidate selection and feature engineeringin Section 4 and 5 our system architecture anddata in Section 6 Then we evaluate our propos-als discuss outcomes and conclude our work inSection 7 8 and 9 respectively
2 Related Work
The majority of related work has been carriedout using statistical approaches a rich set ofsymbolic resources and linguistically-motivatedheuristics(Frank et al 1999 Turney 1999 Barkerand Corrnacchia 2000 Matsuo and Ishizuka2004 Nguyen and Kan 2007) Features used canbe categorized into three broad groups (1) docu-ment cohesion features (ie relationship betweendocument and keyphrases)(Frank et al 1999Matsuo and Ishizuka 2004 Medelyan and Wit-ten 2006 Nguyen and Kan 2007) and to lesser(2) keyphrase cohesion features (ie relationshipamong keyphrases)(Turney 2003) and (3) termcohesion features (ie relationship among compo-nents in a keyphrase)(Park et al 2004)
The simplest system is KEA (Frank et al1999 Witten et al 1999) that uses TFIDF (ieterm frequency inverse document frequency) andfirst occurrence in the document TFIDF mea-sures the document cohesion and the first occur-rence implies the importance of the abstract orintroduction which indicates the keyphrases havea locality Turney (2003) added the notion ofkeyphrase cohesion to KEA features and Nguyenand Kan (2007) added linguistic features suchas section information and suffix sequence TheGenEx system(Turney 1999) employed an inven-tory of nine syntactic features such as length inwords and frequency of stemming phrase as aset of parametrized heuristic rules Barker andCorrnacchia (2000) introduced a method basedon head noun heuristics that took three featureslength of candidate frequency and head noun fre-quency To take advantage of domain knowledgeHulth et al (2001) used a hierarchically-organizeddomain-specific thesaurus from Swedish Parlia-ment as a secondary knowledge source The
Textract (Park et al 2004) also ranks the can-didate keyphrases by its judgment of keyphrasesrsquodegree of domain specificity based on subject-specific collocations(Damerau 1993) in addi-tion to term cohesion using Dice coefficient(Dice1945) Recently Wan and Xiao (2008) extractsautomatic keyphrases from single documents uti-lizing document clustering information The as-sumption behind this work is that the documentswith the same or similar topics interact with eachother in terms of salience of words The authorsfirst clustered the documents then used the graph-based ranking algorithm to rank the candidates ina document by making use of mutual influences ofother documents in the same cluster
3 Keyphrase Analysis
In previous study KEA employed the index-ing words as candidates whereas others such as(Park et al 2004 Nguyen and Kan 2007) gen-erated handcrafted regular expression rules How-ever none carefully undertook the analysis ofkeyphrases We believe there is more to be learnedfrom the reference keyphrases themselves by do-ing a fine-grained careful analysis of their formand composition Note that we used the articlescollected from ACM digital library for both ana-lyzing keyphrases as well as evaluating methodsSee Section 6 for data in detail
Syntactically keyphrases can be formed by ei-ther simplex nouns (eg algorithm keyphrasemulti-agent) or noun phrases (NPs) which can be asequence of nouns and their auxiliary words suchas adjectives and adverbs (eg mobile networkfast computing partially observable Markov de-cision process) despite few incidences They canalso incorporate a prepositional phrase (PP) (egquality of service policy of distributed caching)When keyphrases take the form of an NP with anattached PP (ie NPs in of-PP form) the preposi-tion of is most common but others such as for invia also occur (eg incentive for cooperation in-equality in welfare agent security via approximatepolicy trade in financial instrument based on log-ical formula) The patterns above correlate wellto part-of-speech (POS) patterns used in modernkeyphrase extraction systems
However our analysis uncovered additional lin-guistic patterns and alternations which other stud-ies may have overlooked In our study we alsofound that keyphrases also occur as a simple con-
10
Criteria RulesFrequency (Rule1) Frequency heuristic ie frequency ge 2 for simplex words vs frequency ge 1 for NPs
Length (Rule2) Length heuristic ie up to length 3 for NPs in non-of-PP form vs up to length 4 for NPs in of-PP form(eg synchronous concurrent program vs model of multiagent interaction)
Alternation (Rule3) of-PP form alternation(eg number of sensor = sensor number history of past encounter = past encounter history)(Rule4) Possessive alternation(eg agentrsquos goal = goal of agent securityrsquos value = value of security)
Extraction (Rule5) Noun Phrase = (NN |NNS|NNP |NNPS|JJ |JJR|JJS)lowast(NN |NNS|NNP |NNPS)(eg complexity effective algorithm grid computing distributed web-service discovery architecture)(Rule6) Simplex WordNP IN Simplex WordNP(eg quality of service sensitivity of VOIP traffic (VOIP traffic extracted)simplified instantiation of zebroid (simplified instantiation extracted))
Table 1 Candidate Selection Rules
junctions (eg search and rescue propagation anddelivery) and much more rarely as conjunctionsof more complex NPs (eg history of past en-counter and transitivity) Some keyphrases appearto be more complex (eg pervasive document editand management system task and resource allo-cation in agent system) Similarly abbreviationsand possessive forms figure as common patterns(eg belief desire intention = BDI inverse docu-ment frequency = (IDF) Bayesrsquo theorem agentrsquosdominant strategy)
A critical insight of our work is that keyphrasescan be morphologically and semantically alteredKeyphrases that incorporate a PP or have an un-derlying genitive composition are often easily var-ied by word order alternation Previous studieshave used the altered keyphrases when forming inof-PP form For example quality of service canbe altered to service quality sometimes with lit-tle semantic difference Also as most morpho-logical variation in English relates to noun num-ber and verb inflection keyphrases are subject tothese rules as well (eg distributed system 6= dis-tributing system dynamical caching 6= dynamicalcache) In addition possessives tend to alternatewith of-PP form (eg agentrsquos goal = goal of agentsecurityrsquos value = value of security)
4 Candidate Selection
We now describe our proposed candidate selectionprocess Candidate selection is a crucial step forautomatic keyphrase extraction This step is corre-lated to term extraction study since topNth rankedterms become keyphrases in documents In pre-vious study KEA employed the indexing wordsas candidates whereas others such as (Park et al2004 Nguyen and Kan 2007) generated hand-crafted regular expression rules However nonecarefully undertook the analysis of keyphrases In
this section before we present our method we firstdescribe the detail of keyphrase analysis
In our keyphrase analysis we observed thatmost of author assigned keyphrase andor readerassigned keyphrase are syntactically more of-ten simplex words and less often NPs Whenkeyphrases take an NP form they tend to be a sim-ple form of NPs ie either without a PP or withonly a PP or with a conjunction but few appear asa mixture of such forms We also noticed that thecomponents of NPs are normally nouns and adjec-tives but rarely are adverbs and verbs As a re-sult we decided to ignore NPs containing adverbsand verbs in this study as our candidates since theytend to produce more errors and to require morecomplexity
Another observation is that keyphrases contain-ing more than three words are rare (ie 6 in ourdata set) validating what Paukkeri et al (2008)observed Hence we apply a length heuristic Ourcandidate selection rule collects candidates up tolength 3 but also of length 4 for NPs in of-PPform since they may have a non-genetive alter-nation that reduces its length to 3 (eg perfor-mance of distributed system = distributed systemperformance) In previous studies words occur-ring at least twice are selected as candidates How-ever during our acquisition of reader assignedkeyphrase we observed that readers tend to collectNPs as keyphrases regardless of their frequencyDue to this we apply different frequency thresh-olds for simplex words (gt= 2) and NPs (gt= 1)Note that 30 of NPs occurred only once in ourdata
Finally we generated regular expression rulesto extract candidates as presented in Table 1 Ourcandidate extraction rules are based on those inNguyen and Kan (2007) However our Rule6for NPs in of-PP form broadens the coverage of
11
possible candidates ie with a given NPs in of-PP form not only we collect simplex word(s)but we also extract non-of-PP form of NPs fromnoun phrases governing the PP and the PP Forexample our rule extracts effective algorithm ofgrid computing as well as effective algorithm andgrid computing as candidates while the previousworksrsquo rules do not
5 Feature Engineering
With a wider candidate selection criteria the onusof filtering out irrelevant candidates becomes theresponsibility of careful feature engineering Welist 25 features that we have found useful in ex-tracting keyphrases comprising of 9 existing and16 novel andor modified features that we intro-duce in our work (marked with lowast) As one ofour goals in feature engineering is to assess thesuitability of features in the unsupervised settingwe have also indicated which features are suitableonly for the supervised setting (S) or applicable toboth (S U)
51 Document Cohesion
Document cohesion indicates how important thecandidates are for the given document The mostpopular feature for this cohesion is TFIDF butsome works have also used context words to checkthe correlation between candidates and the givendocument Other features for document cohesionare distance section information and so on Wenote that listed features other than TFIDF are re-lated to locality That is the intuition behind thesefeatures is that keyphrases tend to appear in spe-cific area such as the beginning and the end of doc-uments
F1 TFIDF (SU) TFIDF indicates doc-ument cohesion by looking at the frequency ofterms in the documents and is broadly used in pre-vious work(Frank et al 1999 Witten et al 1999Nguyen and Kan 2007) However a disadvan-tage of the feature is in requiring a large corpusto compute useful IDF As an alternative con-text words(Matsuo and Ishizuka 2004) can alsobe used to measure document cohesion From ourstudy of keyphrases we saw that substrings withinlonger candidates need to be properly counted andas such our method measures TF in substrings aswell as in exact matches For example grid com-puting is often a substring of other phrases such asgrid computing algorithm and efficient grid com-
puting algorithm We also normalize TF with re-spect to candidate types ie we separately treatsimplex words and NPs to compute TF To makeour IDFs broadly representative we employed theGoogle n-gram counts that were computedover terabytes of data Given this large genericsource of word count IDF can be incorporatedwithout corpus-dependent processing hence suchfeatures are useful in unsupervised approachesas well The following list shows variations ofTFIDF employed as features in our system
bull (F1a) TFIDF
bull (F1b) TF including counts of substrings
bull (F1c) TF of substring as a separate feature
bull (F1d) normalized TF by candidate types(ie simplex words vs NPs)
bull (F1e) normalized TF by candidate types asa separate feature
bull (F1f) IDF using Google n-gram
F2 First Occurrence (SU) KEA used the firstappearance of the word in the document(Frank etal 1999 Witten et al 1999) The main ideabehind this feature is that keyphrases tend to oc-cur in the beginning of documents especially instructured reports (eg in abstract and introduc-tion sections) and newswire
F3 Section Information (SU) Nguyen andKan (2007) used the identity of which specificdocument section a candidate occurs in This lo-cality feature attempts to identify key sections Forexample in their study of scientific papers theauthors weighted candidates differently dependingon whether they occurred in the abstract introduc-tion conclusion section head title andor refer-ences
F4 Additional Section Information (SU)We first added the related work or previous workas one of section information not included inNguyen and Kan (2007) We also propose and testa number of variations We used the substringsthat occur in section headers and reference titlesas keyphrases We counted the co-occurrence ofcandidates (ie the section TF) across all key sec-tions that indicates the correlation among key sec-tions We assign section-specific weights as in-dividual sections exhibit different propensities forgenerating keyphrases For example introduction
12
contains the majority of keyphrases while the ti-tle or section head contains many fewer due to thevariation in size
bull (F4a) section rsquorelatedprevious workrsquo
bull (F4b) counting substring occurring in keysections
bull (F4c) section TF across all key sections
bull (F4d) weighting key sections according tothe portion of keyphrases found
F5 Last Occurrence (SU) Similar to dis-tance in KEA the position of the last occurrenceof a candidate may also imply the importance ofkeyphrases as keyphrases tend to appear in thelast part of document such as the conclusion anddiscussion
52 Keyphrase CohesionThe intuition behind using keyphrase cohesion isthat actual keyphrases are often associated witheach other since they are semantically related totopic of the document Note that this assumptionholds only when the document describes a singlecoherent topic ndash a document that represents a col-lection may be first need to be segmented into itsconstituent topics
F6 Co-occurrence of Another Candidatein Section (SU) When candidates co-occur inseveral key sections together then they are morelikely keyphrases Hence we used the number ofsections that candidates co-occur
F7 Title overlap (S) In a way titles also rep-resent the topics of their documents A large col-lection of titles in the domain can act as a prob-abilistic prior of what words could stand as con-stituent words in keyphrases In our work as weexamined scientific papers from computer sciencewe used a collection of titles obtained from thelarge CiteSeer1 collection to create this feature
bull (F7a) co-occurrence (Boolean) in title col-location
bull (F7b) co-occurrence (TF) in title collection
F8 Keyphrase Cohesion (SU) Turney (2003)integrated keyphrase cohesion into his system bychecking the semantic similarity between top Nranked candidates against the remainder In the
1It contains 13M titles from articles papers and reports
original work a large external web corpus wasused to obtain the similarity judgments As wedid not have access to the same web corpus andall candidateskeyphrases were not found in theGoogle n-gram corpus we approximated this fea-ture using a similar notion of contextual similarityWe simulated a latent 2-dimensional matrix (simi-lar to latent semantic analysis) by listing all candi-date words in rows and their neighboring words(nouns verbs and adjectives only) in columnsThe cosine measure is then used to compute thesimilarity among keyphrases
53 Term Cohesion
Term cohesion further refines the candidacy judg-ment by incorporating an internal analysis of thecandidatersquos constituent words Term cohesionposits that high values for internal word associa-tion measures correlates indicates that the candi-date is a keyphrase (Church and Hanks 1989)
F9 Term Cohesion (SU) Park et al (2004)used in the Dice coefficient (Dice 1945)to measure term cohesion particularly for multi-word terms In their work as NPs are longer thansimplex words they simply discounted simplexword cohesion by 10 In our work we vary themeasure of TF used in Dice coefficientsimilar to our discussion earlier
bull (F9a) term cohesion by (Park et al 2004)
bull (F9b) normalized TF by candidate types(ie simplex words vs NPs)
bull (F9c) applying different weight by candi-date types
bull (F9d) normalized TF and different weight-ing by candidate types
54 Other Features
F10 Acronym (S) Nguyen and Kan (2007) ac-counted for the importance of acronym as a fea-ture We found that this feature is heavily depen-dent on the data set Hence we used it only forNampK to attest our candidate selection method
F11 POS sequence (S) Hulth and Megyesi(2006) pointed out that POS sequences ofkeyphrases are similar It showed the distinctivedistribution of POS sequences of keyphrases anduse them as a feature Like acronym this is alsosubject to the data set
13
F12 Suffix sequence (S) Similar to acronymNguyen and Kan (2007) also used a candidatersquossuffix sequence as a feature to capture the propen-sity of English to use certain Latin derivationalmorphology for technical keyphrases This fea-ture is also a data dependent features thus used insupervised approach only
F13 Length of Keyphrases (SU) Barker andCorrnacchia (2000) showed that candidate lengthis also a useful feature in extraction as well as incandidate selection as the majority of keyphrasesare one or two terms in length
6 System and Data
To assess the performance of the proposed candi-date selection rules and features we implementeda keyphrase extraction pipe line We start withraw text of computer science articles convertedfrom PDF by pdftotext Then we parti-tioned the into section such as title and sectionsvia heuristic rules and applied sentence segmenter2 ParsCit3(Councill et al 2008) for refer-ence collection part-of-speech tagger4 and lem-matizer5(Minnen et al 2001) of the input Af-ter preprocessing we built both supervised andunsupervised classifiers using Naive Bayes fromthe WEKA machine learning toolkit(Witten andFrank 2005) Maximum Entropy6 and simpleweighting
In evaluation we collected 250 papers fromfour different categories7 of the ACM digital li-brary Each paper was 6 to 8 pages on averageIn author assigned keyphrase we found manywere missing or found as substrings To rem-edy this we collected reader assigned keyphraseby hiring senior year undergraduates in computerscience each whom annotated five of the paperswith an annotation guideline and on average tookabout 15 minutes to annotate each paper The fi-nal statistics of keyphrases is presented in Table2 where Combined represents the total number ofkeyphrases The numbers in () denotes the num-ber of keyphrases in of-PP form Found means the
2httpwwwengritsumeiacjpasaoresourcessentseg3httpwingcompnusedusgparsCit4httpsearchcpanorgdistLingua-EN-
TaggerTaggerpm5httpwwwinformaticssusxacukresearchgroupsnlpcarrollmorphhtml6httpmaxentsourceforgenetindexhtml7C24 (Distributed Systems) H33 (Information Search
and Retrieval) I211 (Distributed Artificial Intelligence-Multiagent Systems) and J4 (Social and Behavioral Sciences-Economics)
number of author assigned keyphrase and readerassigned keyphrase found in the documents
Author Reader CombinedTotal 1252 (53) 3110 (111) 3816 (146)NPs 904 2537 3027Average 385 (401) 1244 (1288) 1526 (1585)Found 769 2509 2864
Table 2 Statistics in Keyphrases
7 Evaluation
The baseline system for both the supervised andunsupervised approaches is modified NampK whichuses TFIDF distance section information andadditional section information (ie F1-4) Apartfrom baseline we also implemented basicKEA and NampK to compare Note that NampK is con-sidered a supervised approach as it utilizes fea-tures like acronym POS sequence and suffix se-quence
Table 3 and 4 shows the performance of our can-didate selection method and features with respectto supervised and unsupervised approaches usingthe current standard evaluation method (ie exactmatching scheme) over top 5th 10th 15th candi-dates
BestFeatures includes F1cTF of substring asa separate feature F2first occurrence F3sectioninformation F4dweighting key sections F5lastoccurrence F6co-occurrence of another candi-date in section F7btitle overlap F9aterm co-hesion by (Park et al 2004) F13length ofkeyphrases Best-TFIDF means using all bestfeatures but TFIDF
In Tables 3 and 4 C denotes the classifier tech-nique unsupervised (U) or supervised using Max-imum Entropy (S)8
In Table 5 the performance of each feature ismeasured using NampK system and the target fea-ture + indicates an improvement - indicates aperformance decline and indicates no effector unconfirmed due to small changes of perfor-mances Again supervised denotes MaximumEntropy training and Unsupervised is our unsu-pervised approach
8 Discussion
We compared the performances over our candi-date selection and feature engineering with sim-ple KEA NampK and our baseline system In eval-uating candidate selection we found that longer
8Due to the page limits we present the best performance
14
Method Features C Five Ten FifteenMatch Precision Recall Fscore Match Precising Recall Fscore Match Precision Recall Fscore
All KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 079 1584 519 782 139 1388 909 1099 184 1224 1203 1213
NampK S 132 2648 867 1306 204 2036 1334 1612 254 1693 1664 1678baseline U 092 1832 600 904 157 1568 1027 1241 220 1464 1439 1451
S 115 2304 755 1137 190 1896 1242 1501 244 1624 1596 1610Lengthlt=3 KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 081 1616 529 797 140 1400 917 1108 184 1224 1203 1213
NampK S 140 2792 915 1378 210 2104 1378 1665 262 1749 1719 1734baseline U 092 184 603 908 158 1576 1032 1247 220 1464 1439 1451
S 118 2368 776 1169 190 1900 1245 1504 240 1600 1572 1586Lengthlt=3 KEA U 001 024 008 012 005 052 034 041 007 048 047 047Candidates S 083 1664 545 821 142 1424 933 1127 187 1245 1224 1234
+ Alternation NampK S 153 3064 1004 1512 231 2308 1512 1827 288 1920 1887 1903baseline U 098 1968 645 972 172 1724 1129 1364 237 1579 1551 1565
S 133 2656 870 1311 209 2088 1368 1653 269 1792 1761 1776
Table 3 Performance on Proposed Candidate Selection
Features C Five Ten FifteenMatch Prec Recall Fscore Match Prec Recall Fscore Match Prec Recall Fscore
Best U 114 228 747 113 192 192 126 152 261 174 171 173S 156 312 102 154 250 250 164 198 315 210 206 208
Best U 114 228 74 113 192 192 126 152 261 174 171 173wo TFIDF S 156 311 102 154 246 246 161 194 312 208 204 206
Table 4 Performance on Feature Engineering
A Method Feature+ S F1aF2F3F4aF4dF9a
U F1aF1cF2F3F4aF4dF5F7bF9a- S F1bF1cF1dF1fF4bF4cF7aF7bF9b-dF13
U F1dF1eF1fF4bF4cF6F7aF9b-d S F1eF10F11F12
U F1b
Table 5 Performance on Each Feature
length candidates play a role to be noises so de-creased the overall performance We also con-firmed that candidate alternation offered the flexi-bility of keyphrases leading higher candidate cov-erage as well as better performance
To re-examine features we analyzed the impactof existing and new features and their variationsFirst of all unlike previous studies we found thatthe performance with and without TFIDF did notlead to a large difference which indicates the im-pact of TFIDF was minor as long as other fea-tures are incorporated Secondly counting sub-strings for TF improved performance while ap-plying term weighting for TF andor IDF did notimpact on the performance We estimated thecause that many of keyphrases are substrings ofcandidates and vice versa Thirdly section in-formation was also validated to improve perfor-mance as in Nguyen and Kan (2007) Extend-ing this logic modeling additional section infor-mation (related work) and weighting sections bothturned out to be useful features Other localityfeatures were also validated as helpful both firstoccurrence and last occurrence are helpful as itimplies the locality of the key ideas In addi-
tion keyphrase co-occurrence with selected sec-tions was proposed in our work and found empiri-cally useful Term cohesion (Park et al 2004) is auseful feature although it has a heuristic factor thatreduce the weight by 10 for simplex words Nor-mally term cohesion is subject to NPs only henceit needs to be extended to work with multi-wordNPs as well Table 5 summarizes the reflectionson each feature
As unsupervised methods have the appeal of notneeding to be trained on expensive hand-annotateddata we also compared the performance of super-vised and unsupervised methods Given the fea-tures initially introduced for supervised learningunsupervised performance is surprisingly highWhile supervised classifier produced a matchingcount of 315 the unsupervised classifier obtains acount of 261 We feel this indicates that the exist-ing features for supervised methods are also suit-able for use in unsupervised methods with slightlyreduced performance In general we observed thatthe best features in both supervised and unsuper-vised methods are the same ndash section informationand candidate length In our analysis of the im-pact of individual features we observed that mostfeatures affect performance in the same way forboth supervised and unsupervised approaches asshown in Table 5 These findings indicate that al-though these features may be been originally de-signed for use in a supervised approach they arestable and can be expected to perform similar inunsupervised approaches
15
9 Conclusion
We have identified and tackled two core issuesin automatic keyphrase extraction candidate se-lection and feature engineering In the area ofcandidate selection we observe variations and al-ternations that were previously unaccounted forOur selection rules expand the scope of possiblekeyphrase coverage while not overly expandingthe total number candidates to consider In ourre-examination of feature engineering we com-piled a comprehensive feature list from previousworks while exploring the use of substrings in de-vising new features Moreover we also attested toeach featurersquos fitness for use in unsupervised ap-proaches in order to utilize them in real-world ap-plications with minimal cost
10 AcknowledgementThis work was partially supported by a National ResearchFoundation grant Interactive Media Search (grant R 252000 325 279) while the first author was a postdoctoral fellowat the National University of Singapore
ReferencesKen Barker and Nadia Corrnacchia Using noun phrase
heads to extract document keyphrases In Proceedings ofthe 13th Biennial Conference of the Canadian Society onComputational Studies of Intelligence Advances in Arti-ficial Intelligence 2000
Regina Barzilay and Michael Elhadad Using lexical chainsfor text summarization In Proceedings of the ACLEACL1997 Workshop on Intelligent Scalable Text Summariza-tion 1997 pp 10ndash17
Kenneth Church and Patrick Hanks Word associationnorms mutual information and lexicography In Proceed-ings of ACL 1989 76ndash83
Isaac Councill and C Lee Giles and Min-Yen Kan ParsCitAn open-source CRF reference string parsing package InProceedings of LREC 2008 28ndash30
Ernesto DAvanzo and Bernado Magnini A Keyphrase-Based Approach to Summarizationthe LAKE System atDUC-2005 In Proceedings of DUC 2005
F Damerau Generating and evaluating domain-orientedmulti-word terms from texts Information Processing andManagement 1993 29 pp43ndash447
Lee Dice Measures of the amount of ecologic associationsbetween species Journal of Ecology 1945 2
Eibe Frank and Gordon Paynter and Ian Witten and CarlGutwin and Craig Nevill-manning Domain SpecificKeyphrase Extraction In Proceedings of IJCAI 1999pp668ndash673
Carl Gutwin and Gordon Paynter and Ian Witten and CraigNevill-Manning and Eibe Frank Improving browsing indigital libraries with keyphrase indexes Journal of Deci-sion Support Systems 1999 27 pp81ndash104
Khaled Hammouda and Diego Matute and Mohamed KamelCorePhrase keyphrase extraction for document cluster-ing In Proceedings of MLDM 2005
Annette Hulth and Jussi Karlgren and Anna Jonsson andHenrik Bostrm and Lars Asker Automatic Keyword Ex-traction using Domain Knowledge In Proceedings of CI-CLing 2001
Annette Hulth and Beata Megyesi A study on automaticallyextracted keywords in text categorization In Proceedingsof ACLCOLING 2006 537ndash544
Mario Jarmasz and Caroline Barriere Using semantic sim-ilarity over tera-byte corpus compute the performance ofkeyphrase extraction In Proceedings of CLINE 2004
Dawn Lawrie and W Bruce Croft and Arnold RosenbergFinding Topic Words for Hierarchical Summarization InProceedings of SIGIR 2001 pp 349ndash357
Y Matsuo and M Ishizuka Keyword Extraction from a Sin-gle Document using Word Co-occurrence Statistical Infor-mation In International Journal on Artificial IntelligenceTools 2004 13(1) pp 157ndash169
Olena Medelyan and Ian Witten Thesaurus based automatickeyphrase indexing In Proceedings of ACMIEED-CSjoint conference on Digital libraries 2006 pp296ndash297
Guido Minnen and John Carroll and Darren Pearce Appliedmorphological processing of English NLE 2001 7(3)pp207ndash223
Thuy Dung Nguyen and Min-Yen Kan Key phrase Extrac-tion in Scientific Publications In Proceeding of ICADL2007 pp317-326
Youngja Park and Roy Byrd and Branimir Boguraev Auto-matic Glossary Extraction Beyond Terminology Identifi-cation In Proceedings of COLING 2004 pp48ndash55
Mari-Sanna Paukkeri and Ilari Nieminen and Matti Pollaand Timo Honkela A Language-Independent Approachto Keyphrase Extraction and Evaluation In Proceedingsof COLING 2008
Peter Turney Learning to Extract Keyphrases from TextIn National Research Council Institute for InformationTechnology Technical Report ERB-1057 1999
Peter Turney Coherent keyphrase extraction via Web min-ing In Proceedings of IJCAI 2003 pp 434ndash439
Xiaojun Wan and Jianguo Xiao CollabRank towards a col-laborative approach to single-document keyphrase extrac-tion In Proceedings of COLING 2008
Ian Witten and Gordon Paynter and Eibe Frank and CarGutwin and Graig Nevill-Manning KEAPractical Au-tomatic Key phrase Extraction In Proceedings of ACMDL 1999 pp254ndash256
Ian Witten and Eibe Frank Data Mining Practical Ma-chine Learning Tools and Techniques Morgan Kauf-mann 2005
Yongzheng Zhang and Nur Zinchir-Heywood and Evange-los Milios Term based Clustering and Summarization ofWeb Page Collections In Proceedings of Conference ofthe Canadian Society for Computational Studies of Intel-ligence 2004
16
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 17ndash22Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Verb Noun Construction MWE Token Supervised Classification
Mona T DiabCenter for Computational Learning Systems
Columbia Universitymdiabcclscolumbiaedu
Pravin BhutadaComputer Science Department
Columbia Universitypb2351columbiaedu
Abstract
We address the problem of classifying multi-word expression tokens in running text Wefocus our study on Verb-Noun Constructions(VNC) that vary in their idiomaticity depend-ing on context VNC tokens are classified aseither idiomatic or literal We present a super-vised learning approach to the problem We ex-periment with different features Our approachyields the best results to date on MWE clas-sification combining different linguistically mo-tivated features the overall performance yieldsan F-measure of 8458 corresponding to an F-measure of 8996 for idiomaticity identificationand classification and 6203 for literal identifi-cation and classification
1 Introduction
In the literature in general a multiword expression(MWE) refers to a multiword unit or a colloca-tion of words that co-occur together statisticallymore than chance A MWE is a cover term fordifferent types of collocations which vary in theirtransparency and fixedness MWEs are pervasivein natural language especially in web based textsand speech genres Identifying MWEs and under-standing their meaning is essential to language un-derstanding hence they are of crucial importancefor any Natural Language Processing (NLP) appli-cations that aim at handling robust language mean-ing and use In fact the seminal paper (Sag et al2002) refers to this problem as a key issue for thedevelopment of high-quality NLP applications
For our purposes a MWE is defined as a collo-cation of words that refers to a single concept forexample - kick the bucket spill the beans make adecision etc An MWE typically has an idiosyn-cratic meaning that is more or different from themeaning of its component words An MWE mean-ing is transparent ie predictable in as muchas the component words in the expression relaythe meaning portended by the speaker composi-tionally Accordingly MWEs vary in their de-gree of meaning compositionality composition-ality is correlated with the level of idiomaticityAn MWE is compositional if the meaning of an
MWE as a unit can be predicted from the mean-ing of its component words such as in make adecision meaning to decide If we conceive ofidiomaticity as being a continuum the more id-iomatic an expression the less transparent and themore non-compositional it is Some MWEs aremore predictable than others for instance kick thebucket when used idiomatically to mean to diehas nothing in common with the literal meaningof either kick or bucket however make a decisionis very clearly related to to decide Both of theseexpressions are considered MWEs but have vary-ing degrees of compositionality and predictabilityBoth of these expressions belong to a class of id-iomatic MWEs known as verb noun constructions(VNC) The first VNC kick the bucket is a non-decomposable VNC MWE the latter make a deci-sion is a decomposable VNC MWE These typesof constructions are the object of our study
To date most research has addressed the prob-lem of MWE type classification for VNC expres-sions in English (Melamed 1997 Lin 1999Baldwin et al 2003 na Villada Moiron andTiedemann 2006 Fazly and Stevenson 2007Van de Cruys and Villada Moiron 2007 Mc-Carthy et al 2007) not token classification Forexample he spilt the beans on the kitchen counteris most likely a literal usage This is given away bythe use of the prepositional phrase on the kitchencounter as it is plausable that beans could haveliterally been spilt on a location such as a kitchencounter Most previous research would classifyspilt the beans as idiomatic irrespective of con-textual usage In a recent study by (Cook et al2008) of 53 idiom MWE types used in differentcontexts the authors concluded that almost half ofthem had clear literal meaning and over 40 oftheir usages in text were actually literal Thus itwould be important for an NLP application suchas machine translation for example when givena new VNC MWE token to be able to determinewhether it is used idiomatically or not as it couldpotentially have detrimental effects on the qualityof the translation
17
In this paper we address the problem of MWEclassification for verb-noun (VNC) token con-structions in running text We investigate the bi-nary classification of an unseen VNC token ex-pression as being either Idiomatic (IDM) or Lit-eral (LIT) An IDM expression is certainly anMWE however the converse is not necessarilytrue To date most approaches to the problem ofidiomaticity classification on the token level havebeen unsupervised (Birke and Sarkar 2006 Diaband Krishna 2009b Diab and Krishna 2009aSporleder and Li 2009) In this study we carryout a supervised learning investigation using sup-port vector machines that uses some of the featureswhich have been shown to help in unsupervisedapproaches to the problem
This paper is organized as follows In Section2 we describe our understanding of the variousclasses of MWEs in general Section 3 is a sum-mary of previous related research Section 4 de-scribes our approach In Section 5 we present thedetails of our experiments We discuss the resultsin Section 6 Finally we conclude in Section 7
2 Multi-word Expressions
MWEs are typically not productive though theyallow for inflectional variation (Sag et al 2002)They have been conventionalized due to persis-tent use MWEs can be classified based on theirsemantic types as follows Idiomatic This cat-egory includes expressions that are semanticallynon-compositional fixed expressions such as king-dom come ad hoc non-fixed expressions suchas break new ground speak of the devil TheVNCs which we are focusing on in this paper fallinto this category Semi-idiomatic This classincludes expressions that seem semantically non-compositional yet their semantics are more or lesstransparent This category consists of Light VerbConstructions (LVC) such as make a living andVerb Particle Constructions (VPC) such as write-up call-up Non-Idiomatic This category in-cludes expressions that are semantically compo-sitional such as prime minister proper nouns suchas New York Yankees and collocations such as ma-chine translation These expressions are statisti-cally idiosyncratic For instance traffic light isthe most likely lexicalization of the concept andwould occur more often in text than say trafficregulator or vehicle light
3 Related Work
Several researchers have addressed the problem ofMWE classification (Baldwin et al 2003 Katzand Giesbrecht 2006 Schone and Juraksfy 2001
Hashimoto et al 2006 Hashimoto and Kawa-hara 2008) The majority of the proposed researchhas been using unsupervised approaches and haveaddressed the problem of MWE type classifica-tion irrespective of usage in context (Fazly andStevenson 2007 Cook et al 2007) We areaware of two supervised approaches to the prob-lem work by (Katz and Giesbrecht 2006) andwork by (Hashimoto and Kawahara 2008)
In Katz and Giesbrecht (2006) (KG06) the au-thors carried out a vector similarity comparisonbetween the context of an MWE and that of theconstituent words using LSA to determine if theexpression is idiomatic or not The KG06 is sim-ilar in intuition to work proposed by (Fazly andStevenson 2007) however the latter work was un-supervised KG06 experimented with a tiny dataset of only 108 sentences corresponding to oneMWE idiomatic expression
Hashimoto and Kawahara (2008) (HK08) is thefirst large scale study to our knowledge that ad-dressed token classification into idiomatic versusliteral for Japanese MWEs of all types They ap-ply a supervised learning framework using sup-port vector machines based on TinySVM with aquadratic kernel They annotate a web based cor-pus for training data They identify 101 idiomtypes each with a corresponding 1000 exampleshence they had a corpus of 102K sentences of an-notated data for their experiments They exper-iment with 90 idiom types only for which theyhad more than 50 examples They use two typesof features word sense disambiguation (WSD)features and idiom features The WSD featurescomprised some basic syntactic features such asPOS lemma information token n-gram featuresin addition to hypernymy information on words aswell as domain information For the idiom fea-tures they were mostly inflectional features suchas voice negativity modality in addition to adja-cency and adnominal features They report resultsin terms of accuracy and rate of error reductionTheir overall accuracy is of 8925 using all thefeatures
4 Our ApproachWe apply a supervised learning framework tothe problem of both identifying and classifying aMWE expression token in context We specificallyfocus on VNC MWE expressions We use the an-notated data by (Cook et al 2008) We adopt achunking approach to the problem using an InsideOutside Beginning (IOB) tagging framework forperforming the identification of MWE VNC to-kens and classifying them as idiomatic or literalin context For chunk tagging we use the Yam-
18
Cha sequence labeling system1 YamCha is basedon Support Vector Machines technology using de-gree 2 polynomial kernels
We label each sentence with standard IOB tagsSince this is a binary classification task we have 5different tags B-L (Beginning of a literal chunk)I-L (Inside of a literal chunk) B-I (Beginning anIdiomatic chunk) I-I (Inside an Idiomatic chunk)O (Outside a chunk) As an example a sentencesuch as John kicked the bucket last Friday will beannotated as follows John O kicked B-I the I-Ibucket I-I last O Friday O We experiment withsome basic features and some more linguisticallymotivated ones
We experiment with different window sizes forcontext ranging from minus+1 to minus+5 tokens be-fore and after the token of interest We also em-ploy linguistic features such as character n-gramfeatures namely last 3 characters of a token asa means of indirectly capturing the word inflec-tional and derivational morphology (NGRAM)Other features include Part-of-Speech (POS)tags lemma form (LEMMA) or the citation formof the word and named entity (NE) informationThe latter feature is shown to help in the unsuper-vised setting in recent work (Diab and Krishna2009b Diab and Krishna 2009a) In general allthe linguistic features are represented as separatefeature sets explicitly modeled in the input dataHence if we are modeling the POS tag feature forour running example the training data would beannotated as follows John NN O kicked VBDB-I the Det I-I bucket NN I-I last ADV O FridayNN O Likewise adding the NGRAM featurewould be represented as follows John NN ohnO kicked VBD ked B-I the Det the I-I bucket NNket I-I last ADV ast O Friday NN day O and soon
With the NE feature we followed the same rep-resentation as the other features as a separate col-umn as expressed above referred to as NamedEntity Separate (NES) For named entity recogni-tion (NER) we use the BBN Identifinder softwarewhich identifies 19 NE tags2 We have two set-tings for NES one with the full 19 tags explic-itly identified (NES-Full) and the other where wehave a binary feature indicating whether a wordis a NE or not (NES-Bin) Moreover we addedanother experimental condition where we changedthe wordsrsquo representation in the input to their NEclass Named Entity InText (NEI) For example forthe NEI condition our running example is repre-sented as follows PER NN ohn O kicked VBDked B-I the Det the I-I bucket NN ket I-I last ADV
1httpwwwtado-chasencomyamcha2httpwwwbbncomidentifinder
ast O DAY NN day O where John is replaced bythe NE ldquoPERrdquo
5 Experiments and Results
51 Data
We use the manually annotated standard dataset identified in (Cook et al 2008) This datacomprises 2920 unique VNC-Token expressionsdrawn from the entire British National Corpus(BNC)3 The BNC contains 100M words of multi-ple genres including written text and transcribedspeech In this set VNC token expressions aremanually annotated as idiomatic literal or un-known We exclude those annotated as unknownand those pertaining to the Speech part of thedata leaving us with a total of 2432 sentences cor-responding to 53 VNC MWE types This datahas 2571 annotations4 corresponding to 2020 Id-iomatic tokens and 551 literal ones Since the dataset is relatively small we carry out 5-fold cross val-idation experiments The results we report are av-eraged over the 5 folds per condition We splitthe data into 80 for training 10 for testing and10 for development The data used is the tok-enized version of the BNC
52 Evaluation Metrics
We use Fβ=1 (F-measure) as the harmonic meanbetween (P)recision and (R)ecall as well as accu-racy to report the results5 We report the resultsseparately for the two classes IDM and LIT aver-aged over the 5 folds of the TEST data set
53 Results
We present the results for the different featuressets and their combination We also present resultson a simple most frequent tag baseline (FREQ) aswell as a baseline of using no features just thetokenized words (TOK) The baseline is basicallytagging all identified VNC tokens in the data set asidiomatic It is worth noting that the baseline hasthe advantage of gold identification of MWE VNCtoken expressions In our experimental conditionsidentification of a potential VNC MWE is part ofwhat is discovered automatically hence our sys-tem is penalized for identifying other VNC MWE
3httpwwwnatcorpoxacuk4A sentence can have more than one MWE expression
hence the number of annotations exceeds the number of sen-tences
5We do not think that accuracy should be reported in gen-eral since it is an inflated result as it is not a measure of errorAll words identified as O factor into the accuracy which re-sults in exaggerated values for accuracy We report it onlysince it the metric used by previous work
19
tokens that are not in the original data set6
In Table 2 we present the results yielded per fea-ture and per condition We experimented with dif-ferent context sizes initially to decide on the opti-mal window size for our learning framework re-sults are presented in Table 1 Then once that isdetermined we proceed to add features
Noting that a window size of minus+3 yields thebest results we proceed to use that as our contextsize for the following experimental conditions Wewill not include accuracy since it above 96 for allour experimental conditions
All the results yielded by our experiments out-perform the baseline FREQ The simple tokenizedwords baseline (TOK) with no added features witha context size of minus+3 shows a significant im-provement over the very basic baseline FREQ withan overall F measure of 7704
Adding lemma information or POS or NGRAMfeatures all independently contribute to a bettersolution however combining the three featuresyields a significant boost in performance over theTOK baseline of 267 absolute F points in over-all performance
Confirming previous observations in the liter-ature the overall best results are obtained byusing NE features The NEI condition yieldsslightly better results than the NES conditionsin the case when no other features are beingused NES-Full significantly outperforms NES-Bin when used alone especially on literal classi-fication yielding the highest results on this classof phenomena across the board However whencombined with other features NES-Bin fares bet-ter than NES-Full as we observe slightly less per-formance when comparing NES-Full+L+N+P andNES-Bin+L+N+P
Combining NEI+L+N+P yields the highest re-sults with an overall F measure of 8458 a sig-nificant improvement over both baselines and overthe condition that does not exploit NE featuresL+N+P Using NEI may be considered a formof dimensionality reduction hence the significantcontribution to performance
6 Discussion
The overall results strongly suggest that using lin-guistically interesting features explicitly has a pos-itive impact on performance NE features helpthe most and combining them with other features
6We could have easily identified all VNC syntactic con-figurations corresponding to verb object as a potential MWEVNC assuming that they are literal by default This wouldhave boosted our literal score baseline however for this in-vestigation we decided to strictly work with the gold stan-dard data set exclusively
yields the best results In general performanceon the classification and identification of idiomaticexpressions yielded much better results This maybe due to the fact that the data has a lot more id-iomatic token examples for training Also we notethat precision scores are significantly higher thanrecall scores especially with performance on lit-eral token instance classification This might be anindication that identifying when an MWE is usedliterally is a difficult task
We analyzed some of the errors yielded in ourbest condition NEI+L+N+P The biggest errors area result of identifying other VNC constructionsnot annotated in the training and test data as VNCMWEs However we also see errors of confusingidiomatic cases with literal ones 23 times and theopposite 4 times
Some of the errors where the VNC should havebeen classified as literal however the system clas-sified them as idiomatic are kick heel find feetmake top Cases of idiomatic expressions erro-neously classified as literal are for MWE types hitthe road blow trumpet blow whistle bit a wall
The system is able to identify new VNC MWEconstructions For instance in the sentence On theother hand Pinkie seemed to have lost his head toa certain extent perhaps some prospects of mak-ing his mark by bringing in something novel inthe way of business the first MWE lost his headis annotated in the training data however makinghis mark is newly identified as idiomatic in thiscontext
Also the system identified hit the post as aliteral MWE VNC token in As the ball hit thepost the referee blew the whistle where blew thewhistle is a literal VNC in this context and it iden-tified hit the post as another literal VNC
7 Conclusion
In this study we explore a set of features that con-tribute to VNC token expression binary supervisedclassification The use of NER significantly im-proves the performance of the system Using NERas a means of dimensionality reduction yields thebest results We achieve a state of the art perfor-mance of an overall F measure of 8458 In thefuture we are looking at ways of adding more so-phisticated syntactic and semantic features fromWSD Given the fact that we were able to get moreinteresting VNC data automatically we are cur-rently looking into adding the new data to the an-notated pool after manual checking
20
IDM-F LIT-F Overall F Overall Accminus+1 7793 4857 7178 9622minus+2 8538 5561 7971 9706minus+3 8699 5568 8125 9693minus+4 8622 5581 8075 9706minus+5 8338 50 7763 9661
Table 1 Results in s of varying context window size
IDM-P IDM-R IDM-F LIT-P LIT-R LIT-F Overall FFREQ 7002 8916 7844 0 0 0 6968TOK 8178 8333 8255 7179 4375 5437 7704
(L)EMMA 831 8429 8369 6977 4688 5607 7811(N)GRAM 8317 8238 8278 70 4375 5385 7701
(P)OS 8333 8333 8333 7778 4375 5600 7808L+N+P 8695 8333 8538 7222 4561 5591 7971
NES-Full 852 8793 8655 7907 5862 6733 8277NES-Bin 8497 8241 8367 7349 5259 6131 7915
NEI 8992 8518 8748 8133 5259 6387 8282NES-Full+L+N+P 8989 8492 8734 7632 50 6042 8199NES-Bin+L+N+P 9086 8492 8779 7632 50 6042 8233
NEI+L+N+P 9135 8842 8986 8169 50 6203 8458
Table 2 Final results in s averaged over 5 folds of test data using different features and their combina-tions
8 Acknowledgement
The first author was partially funded by DARPAGALE and MADCAT projects The authors wouldlike to acknowledge the useful comments by twoanonymous reviewers who helped in making thispublication more concise and better presented
ReferencesTimothy Baldwin Collin Bannard Takakki Tanaka
and Dominic Widdows 2003 An empirical modelof multiword expression decomposability In Pro-ceedings of the ACL 2003 workshop on Multiwordexpressions pages 89ndash96 Morristown NJ USA
J Birke and A Sarkar 2006 A clustering approach fornearly unsupervised recognition of nonliteral lan-guage In Proceedings of EACL volume 6 pages329ndash336
Paul Cook Afsaneh Fazly and Suzanne Stevenson2007 Pulling their weight Exploiting syntacticforms for the automatic identification of idiomaticexpressions in context In Proceedings of the Work-shop on A Broader Perspective on Multiword Ex-pressions pages 41ndash48 Prague Czech RepublicJune Association for Computational Linguistics
Paul Cook Afsaneh Fazly and Suzanne Stevenson2008 The VNC-Tokens Dataset In Proceedings ofthe LREC Workshop on Towards a Shared Task for
Multiword Expressions (MWE 2008) MarrakechMorocco June
Mona Diab and Madhav Krishna 2009a Handlingsparsity for verb noun MWE token classification InProceedings of the Workshop on Geometrical Mod-els of Natural Language Semantics pages 96ndash103Athens Greece March Association for Computa-tional Linguistics
Mona Diab and Madhav Krishna 2009b Unsuper-vised classification for vnc multiword expressionstokens In CICLING
Afsaneh Fazly and Suzanne Stevenson 2007 Dis-tinguishing subtypes of multiword expressions us-ing linguistically-motivated statistical measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueCzech Republic June Association for Computa-tional Linguistics
Chikara Hashimoto and Daisuke Kawahara 2008Construction of an idiom corpus and its applica-tion to idiom identification based on WSD incor-porating idiom-specific features In Proceedings ofthe 2008 Conference on Empirical Methods in Nat-ural Language Processing pages 992ndash1001 Hon-olulu Hawaii October Association for Computa-tional Linguistics
Chikara Hashimoto Satoshi Sato and Takehito Utsuro2006 Japanese idiom recognition Drawing a linebetween literal and idiomatic meanings In Proceed-ings of the COLINGACL 2006 Main Conference
21
Poster Sessions pages 353ndash360 Sydney AustraliaJuly Association for Computational Linguistics
Graham Katz and Eugenie Giesbrecht 2006 Au-tomatic identification of non-compositional multi-word expressions using latent semantic analysis InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 12ndash19 Sydney Australia July Asso-ciation for Computational Linguistics
Dekang Lin 1999 Automatic identification of non-compositional phrases In Proceedings of ACL-99pages 317ndash324 Univeristy of Maryland CollegePark Maryland USA
Diana McCarthy Sriram Venkatapathy and AravindJoshi 2007 Detecting compositionality of verb-object combinations using selectional preferencesIn Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL) pages 369ndash379 Prague CzechRepublic June Association for Computational Lin-guistics
Dan I Melamed 1997 Automatic discovery of non-compositional compounds in parallel data In Pro-ceedings of the 2nd Conference on Empirical Meth-ods in Natural Language Processing (EMNLPrsquo97)pages 97ndash108 Providence RI USA August
Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the EACL-06Workshop on Multiword Expressions in a Multilin-gual Context pages 33ndash40 Morristown NJ USA
Ivan A Sag Timothy Baldwin Francis Bond Ann ACopestake and Dan Flickinger 2002 Multiwordexpressions A pain in the neck for nlp In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing pages 1ndash15 London UK Springer-Verlag
Patrick Schone and Daniel Juraksfy 2001 Isknowledge-free induction of multiword unit dictio-nary headwords a solved problem In Proceedingsof Empirical Methods in Natural Language Process-ing pages 100ndash108 Pittsburg PA USA
C Sporleder and L Li 2009 Unsupervised Recog-nition of Literal and Non-Literal Use of IdiomaticExpressions In Proceedings of the 12th Conferenceof the European Chapter of the ACL (EACL 2009)pages 754ndash762 Association for Computational Lin-guistics
Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32Prague Czech Republic June Association for Com-putational Linguistics
22
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 23ndash30Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Exploiting Translational Correspondences for Pattern-Independent MWEIdentification
Sina ZarrieszligDepartment of Linguistics
University of Potsdam Germanysinalinguni-potsdamde
Jonas KuhnDepartment of Linguistics
University of Potsdam Germanykuhnlinguni-potsdamde
Abstract
Based on a study of verb translations inthe Europarl corpus we argue that a widerange of MWE patterns can be identified intranslations that exhibit a correspondencebetween a single lexical item in the sourcelanguage and a group of lexical items inthe target language We show that thesecorrespondences can be reliably detectedon dependency-parsed word-aligned sen-tences We propose an extraction methodthat combines word alignment with syn-tactic filters and is independent of thestructural pattern of the translation
1 Introduction
Parallel corpora have proved to be a valuable re-source not only for statistical machine translationbut also for crosslingual induction of morphologi-cal syntactic and semantic analyses (Yarowsky etal 2001 Dyvik 2004) In this paper we proposean approach to the identification of multiword ex-pressions (MWEs) that exploits translational cor-respondences in a parallel corpus We will con-sider in translations of the following type
(1) DerThe
RatCouncil
sollteshould
unsereour
Positionposition
berucksichtigenconsider
(2) The Council shouldtake account ofour position
This sentence pair has been taken from theGerman - English section of the Europarl corpus(Koehn 2005) It exemplifies a translational cor-respondence between an English MWEtake ac-count ofand a German simplex verbberucksichti-gen In the following we refer to such correspon-dences asone-to-many translations Based on astudy of verb translations in Europarl we will ex-plore to what extent one-to-many translations pro-vide evidence for MWE realization in the targetlanguage It will turn out that crosslingual corre-
spondences realize a wide range of different lin-guistic patterns that are relevant for MWE iden-tification but that they pose problems to auto-matic word alignment We propose an extractionmethod that combines distributional word align-ment with syntactic filters We will show thatthese correspondences can be reliably detectedon dependency-parsed wordaligned sentences andare able to identify various MWE patterns
In a monolingual setting the task of MWE ex-traction is usually conceived of as a lexical as-sociation problem where distributional measuresmodel the syntactic and semantic idiosyncracy ex-hibited by MWEs eg (Pecina 2008) This ap-proach generally involves two main steps 1) theextraction of a candidate list of potential MWEsoften constrained by a particular target patternof the detection method like verb particle con-structions (Baldwin and Villavicencio 2002) orverb PP combinations (Villada Moiron and Tiede-mann 2006) 2) the ranking of this candidate listby an appropriate assocation measure
The crosslingual MWE identification wepresent in this paper is a priori independentof any specific association measure or syntacticpattern The translation scenario allows us toadopt a completely data-driven definition of whatconstitutes an MWE Given a parallel corpuswe propose to consider those tokens in a targetlanguage as MWEs which correspond to a singlelexical item in the source language The intuitionis that if a group of lexical items in one lan-guage can be realized as a single item in anotherlanguage it can be considered as some kind oflexically fixed entity By this means we willnot approach the MWE identification problemby asking for a given list of candidates whetherthese are MWEs or not Instead we will ask fora given list of lexical items in a source languagewhether there exists a one-to-many translation forthis item in a target language (and whether these
23
one-to-many translations correspond to MWEs)This strategy offers a straightforward solution tothe interpretation problem As the translation canbe related to the meaning of the source item andto its other translations in the target language theinterpretation is independent of the expressionrsquostransparency This solution has its limitationscompared to other approaches that need to auto-matically establish the degree of compositionalityof a given MWE candidate However for manyNLP applications coarse-grained knowledgeabout the semantic relation between a widerange of MWEs and their corresponding atomicrealization is already very useful
In this work we therefore focus on a generalmethod of MWE identification that captures thevarious patterns of translational correspondencesthat can be found in parallel corpora Our exper-iments described in section 3 show that one-to-many translations should be extracted from syn-tactic configurations rather than from unstructuredsets of aligned words This syntax-driven methodis less dependent on frequency distributions in agiven corpus but is based on the intuition thatmonolingual idiosyncracies like MWE realizationof an entity are not likely to be mirrored in anotherlanguage (see section 4 for discussion)
Our goal in this paper is twofold First we wantto investigate to what extent one-to-many transla-tional correspondences can serve as an empiricalbasis for MWE identification To this end Sec-tion 2 presents a corpus-based study of the rela-tion between one-to-many translations and MWEsthat we carried out on a translation gold standardSecond we investigate methods for the automaticdetection of complex lexical correspondences fora given parallel corpus Therefore Section 3 eval-uates automatic word alignments against our goldstandard and gives a method for high-precisionone-to-many translation detection that relies onsyntactic filters in addition to word-alignments
2 Multiword Translations as MWEs
The idea to exploit one-to-many translations forthe identification of MWE candidates has not re-ceived much attention in the literature Thus it isnot a priori clear what can be expected from trans-lational correspondences with respect to MWEidentification To corroborate the intuitions intro-duced in the last section we carried out a corpus-based study that aims to discover linguistic pat-
Verb 1-1 1-n n-1 n-n No
anheben (v1) 535 212 92 16 325
bezwecken (v2) 167 513 06 313 150
riskieren (v3) 467 357 05 17 182
verschlimmern (v4) 302 215 286 445 275
Table 1 Proportions of types of translational cor-respondences (token-level) in our gold standard
terns exhibited by one-to-many translationsWe constructed a gold standard coveringall En-
glish translations of four German verb lemmas ex-tracted from the Europarl Corpus These verbssubcategorize for a nominative subject and an ac-cusative object and are in the middle frequencylayer (around 200 occurrences) We extracted allsentences in Europarl with occurences of theselemmas and their automatic word alignments pro-duced by GIZA++ (Och and Ney 2003) Thesealignments were manually corrected on the basisof the crosslingual word alignment guidelines de-velopped by (Graca et al 2008)
For each of the German source lemmas ourgold standard records four translation categoriesone-to-one one-to-many many-to-one many-to-many translations Table 1 shows the distributionof these categories for each verb Strikingly thefour verbs show very different proportions con-cerning the types of their translational correspon-dences Thus while the German verbanheben(en increase) seems to have a frequent parallelrealization the verbsbezwecken(en intend to)or verschlimmern(en aggravate) tend to be real-ized by more complex phrasal translations In anycase the percentage of one-to-many translations isrelatively high which corroborates our hypothesisthat parallel corpora constitute a very interestingresource for data-driven MWE discovery
A closer look at the one-to-many translations re-veals that these cover a wide spectrum of MWEphenomena traditionally considered in the liter-ature as well as constructions that one wouldusually not regard as an MWE Below we willshortly illustrate the different classes of one-to-many translations we found in our gold standard
Morphological variations This type of one-to-many translations is mainly due to non-parallel re-alization of tense Itrsquos rather irrelevant from anMWE perspective but easy to discover and filterautomatically
24
(3) SieThey
verschlimmernaggravate
diethe
Ubelmisfortunes
(4) Their actionis aggravatingthe misfortunes
Verb particle combinations A typical MWEpattern treated for instance in (Baldwin andVillavicencio 2002) It further divides into trans-parent and non-transparent combinations the lat-ter is illustrated below
(5) DerThe
Ausschusscommitte
bezwecktintends
denthe
Institutioneninstitutions
eina
politischespolitical
Instrumentinstrument
anat
diethe
Handhand
zuto
gebengive
(6) The committeeset out to equip the institutions with apolitical instrument
Verb preposition combinations While thisclass isnrsquot discussed very often in the MWE lit-erature it can nevertheless be considered as an id-iosyncratic combination of lexical items Sag et al(2002) propose an analysis within an MWE frame-work
(7) SieThey
werdenwill
denthe
Treibhauseffektgreen house effect
verschlimmernaggravate
(8) They will add to the green house effect
Light verb constructions (LVCs) This is themost frequent pattern in our gold standard It ac-tually subsumes various subpatterns depending onwhether the light verbs complement is realized as anoun adjective or PP Generally LVCs are syntac-tically and semantically more flexible than otherMWE types such that our gold standard containsvariants of LVCs with similar potentially mod-ified adjectives or nouns as in the example be-low However it can be considered an idiosyn-cratic combination since the LVCs exhibit specificlexical restrictions (Sag et al 2002)
(9) IchIch
werdewill
diethe
Sachething
nuronly
nochjust
verschlimmernaggravate
(10) I am justmaking thingsmore difficult
Idioms This MWE type is probably the mostdiscussed in the literature due to its semantic andsyntactic idiosyncracy Itrsquos not very frequent inour gold standard which may be mainly due to itslimited size and the source items we chose
(11) SieThey
bezweckenintend
diethe
Umgestaltungconversion
ininto
einea
zivilecivil
Nationnation
(12) Theyhave in mind the conversion into a civil nation
v1 v2 v3 v4
Ntype 22 (26) 41 (47) 26 (35) 17 (24)
V Part 227 49 00 00
V Prep 364 415 39 59
LVC 182 293 885 882
Idiom 00 24 00 00
Para 364 243 115 235
Table 2 Proportions of MWE types per lemma
Paraphrases From an MWE perspective para-phrases are the most problematic and challengingtype of translational correspondence in our goldstandard While the MWE literature typically dis-cusses the distinction between collocations andMWEs the boarderline between paraphrases andMWEs is not really clear On the hand para-phrases as we classified them here are transparentcombinations of lexical items like in the exam-ple belowensure that something increases How-ever semantically these transparent combinationscan also be rendered by an atomic expressionin-crease A further problem raised by paraphrases isthat they often involve translational shifts (Cyrus2006) These shifts are hard to identify automat-ically and present a general challenge for seman-tic processing of parallel corpora An example isgiven below
(13) WirWe
brauchenneed
besserebetter
Zusammenarbeitcooperation
umto
diethe
RuckzahlungenrepaymentsOBJ
anzuhebenincrease
(14) We need greater cooperation in this respect toensurethat repaymentsincrease
Table 2 displays the proportions of the MWEcategories for the number of types of one-to-manycorrespondences in our gold standard We filteredthe types due to morphological variations only (theoverall number of types is indicated in brackets)Note that some types in our gold standard fall intoseveral categories eg they combine a verb prepo-sition with a verb particle construction For allof the verbs the number of types belonging tocore MWE categories largely outweighs the pro-portion of paraphrases As we already observedin our analysis of general translation categorieshere again the different verb lemmas show strik-ing differences with respect to their realization inEnglish translations For instanceanheben(enincrease) or bezwecken(en intend) are frequently
25
translated with verb particle or preposition combi-nations while the other verbs are much more of-ten translated by means of LVCs Also the morespecific LVC patterns differ largely among theverbs Whileverschlimmern(en aggravate) hasmany different adjectival LVC correspondencesthe translations ofriskieren(en risk) are predomi-nantly nominal LVCs The fact that we found veryfew idioms in our gold standard may be simplyrelated to our arbitrary choice of German sourceverbs that do not have an English idiom realiza-tion (see our experiment on a random set of verbsin Section 33)
In general one-to-many translational corre-spondences seem to provide a very fruitful groundfor the large-scale study of MWE phenomenaHowever their reliable detection in parallel cor-pora is far from trivial as we will show in thenext section Therefore we will not further in-vestigate the classification of MWE patterns inthe rest of the paper but concentrate on the high-precision detection of one-to-many translationsSuch a pattern-independent identification methodis crucial for the further data-driven study of one-to-many translations in parallel corpora
3 Multiword Translation Detection
This section is devoted to the problem of high-precision detection of one-to-many translationsSection 31 describes an evaluation of automaticword alignments against our gold standard Insection 32 we describe a method that extractsloosely aligned syntactic configurations whichyields much more promising results
31 One-to-many Alignments
To illustrate the problem of purely distributionalone-to-many alignment table 3 presents an eval-uation of the automatic one-to-many word align-ments produced by GIZA++ that uses the stan-dard heuristics for bidirectional word alignmentfrom phrase-based MT (Och and Ney 2003) Weevaluate the rate of translational correspondenceson the type-level that the system discovers againstthe one-to-many translations in our gold standardBy type we mean the set of lemmatized Englishtokens that makes up the translation of the Ger-man source lemma Generally automatic wordalignment yields a very high FPR if no frequencythreshold is used Increasing the threshold mayhelp in some cases however the frequency of the
verb n gt 0 n gt 1 n gt 3FPR FNR FPR FNR FPR FNR
v1 097 093 10 10 10 10
v2 093 09 05 096 00 098
v3 088 083 08 097 067 097
v4 098 092 08 092 034 092
Table 3 False positive rate and False negative rateof GIZA++ one-to-many alignments
translation types is so low that already at a thresh-old of 3 almost all types get filtered This does notmean that the automatic word alignment does notdiscover any correct correspondences at all but itmeans that the detection of the exact set of tokensthat correspond to the source token is rare
This low precision of one-to-many alignmentsisnrsquot very surprising Many types of MWEs con-sist of items that contribute most of the lexical se-mantic content while the other items belong to theclass of semantically almost ldquoemptyrdquo items (egparticles light verbs) These semantically ldquolightrdquoitems have a distribution that doesnrsquot necessarilycorrelate with the source item For instance inthe following sentence pair taken from EuroparlGIZA++ was not able to capture the correspon-dence between the German main verbbehindern(en impede) and the LVCconstitute an obstacleto but only finds an alignment link between theverb and the nounobstacle
(15) DieThe
Korruptioncorruption
behindertimpedes
diethe
Entwicklungdevelopment
(16) Corruptionconstitutes an obstacle todevelopment
Another limitation of the word-alignment mod-els is that are independent of whether the sen-tences are largely parallel or rather free transla-tions However parallel corpora like Europarl areknow to contain a very large number of free trans-lations In these cases direct lexical correspon-dences are much more unlikely to be found
32 Aligning Syntactic Configurations
High-precision extraction of one-to-many trans-lation detection thus involves two major prob-lems 1) How to identify sentences or configura-tions where reliable lexical correspondences canbe found 2) How to align target items that have alow occurrence correlation
We argue that both of these problems can beadressed by taking syntactic information into ac-
26
count As an example consider the pair of paral-lel configurations in Figure 1 for the sentence pairgiven in (15) and (16) Although there is no strictone-to-one alignment for the German verb the ba-sic predicate-argument structure is parallel Theverbs arguments directly correspond to each otherand are all dominated by a verbal root node
Based on these intuitions we propose agenerate-and-filter strategy for our one-to-manytranslation detection which extracts partial largelyparallel dependency configurations By admittingtarget dependency paths to be aligned to sourcesingle dependency relations we admit configura-tions where the source item is translated by morethan one word For instance given the configura-tion in Figure 1 we allow the German verb to bealigned to the path connectingconstituteand theargumentY2
Our one-to-many translation detection consistsof the following steps a) candidate generationof aligned syntactic configurations b) filtering theconfigurations c) alignment post-editing ie as-sembling the target tokens corresponding to thesource item The following paragraphs will brieflycaracterize these steps
behindert
X 1 Y 1
Y 2
an to
X 2 obstacle
create
Figure 1 Example of a typical syntactic MWEconfiguration
Data We word-aligned the German and Englishportion of the Europarl corpus by means of theGIZA++ tool Both portions where assigned flatsyntactic dependency analyses by means of theMaltParser (Nivre et al 2006) such that we ob-tain a parallel resource of word-aligned depen-dency parses Each sentence in our resource canbe represented by the triple(DG DE AGE) DG
is the set of dependency triples(s1 rel s2) such
thats2 is a dependent ofs1 of typerel ands1 s2
are words of the source languageDE is the setof dependency triples of the target sentenceAGE
corresponds to the set of pairs(s1 t1) such thats1 t1 are aligned
Candidate Generation This step generates alist of source configurations by searching for oc-curences of the source lexical verb where it islinked to some syntactic dependents (eg its argu-ments) An example input would be the configura-tion ( (verbSB) (verbOA)) forour German verbs
Filtering Given our source candidates a validparallel configuration(DG DE AGE) is then de-fined by the following conditions1 The source configurationDG is the set of tu-ples(s1 rel sn) wheres1 is our source item andsn some dependent2 For eachsn isin DG there is a tuple(sn tn) isinAGE ie every dependent has an alignment3 There is a target itemt1 isin DE such thatfor eachtn there is ap sub DE such thatp isa path(t1 rel tx) (tx rel ty)(tz rel tn) thatconnectst1 and tn Thus the target dependentshave a common root
To filter noise due to parsing or alignment er-rors we further introduce a filter on the length ofthe path that connects the target root and its de-pendents and w exclude paths cross contain sen-tence boundaries Moreover the above candi-date filtering doesnrsquot exclude configurations whichexhibit paraphrases involving head-switching orcomplex coordination Head-switching can be de-tected with the help of alignment information ifthere is a item in our target configuration that hasan reliable alignment with an item not contained inour source configuration our target configurationis likely to contain such a structural paraphrasesand is excluded from our candidate set Coordina-tion can be discarded by imposing the condition onthe configuration not to contain a coordination re-lation This Generate-and-Filter strategy now ex-tracts a set of sentences where we are likely to finda good one-to-one or one-to-many translation forthe source verb
Alignment Post-editing In the final alignmentstep one now needs to figure out which lexicalmaterial in the aligned syntactic configurations ac-tually corresponds to the translation of the sourceitem The intuition discussed in 32 was that all
27
the items lying on a path between the root itemand the terminals belong to the translation of thesource item However these items may have othersyntactic dependents that may also be part of theone-to-many translation As an example considerthe configuration in figure 1 where the articleanwhich is part of the LVCcreate an obstacle tohasto be aligned to the German source verb
Thus for a set of itemsti for which there is a de-pendency relation(tx rel ti) isin DE such thattx isan element of our target configuration we need todecide whether(s1 ti) isin AGE This translationproblem now largely parallels collocation trans-lation problems discussed in the literature as in(Smadja and McKeown 1994) But crucially oursyntactic filtering strategy has substantially nar-rowed down the number of items that are possi-ble parts of the one-to-many translation Thus astraightforward way to assemble the translationalcorrespondence is to compute the correlation orassociation of the possibly missing items with thegiven translation pair as proposed in (Smadja andMcKeown 1994) Therefore we propose the fol-lowing alignment post-editing algorithmGiven the source items1 and the set of target itemsT where eachti isin T is an element of our targetconfiguration
1 Compute corr(s1 T ) the correlation be-tweens1 andT
2 For each ti tx such that there isa (ti rel tx) isin DE computecorr(s1 T + tx)
3 if corr(s1 T + tx) ge corr(s1 T ) addtxto T
As the Dice coefficient is often to give the bestresults eg in (Smadja and McKeown 1994) wealso chose Dice as our correlation measure In fu-ture work we will experiment with other associa-tion measures Our correlation scores are thus de-fined by the formula
corr(s1 T ) =2(freq(s1 and T ))
freq(s1) + freq(T )
We definefreq(T ) as the number of sentencepairs whose target sentence contains occurrencesof all ti isin T andfreq(s1) accordingly The ob-servation frequencyfreq(s1andT ) is the number ofsentence pairs that wheres1 occurs in the sourcesentence andT in the target sentence
The output translation can then be rep-resented as a dependency configurationof the following kind ((ofPMOD)(riskNMODof)(riskNMODthe) (runOBJrisk)(runSBJ))which is the syntactic representationfor the English MWErun the risk of
33 Evaluation
Our translational approach to MWE extractionbears the advantage that evaluation is not exclu-sively bound to the manual judgement of candi-date lists Instead we can first evaluate the systemoutput against translation gold standards which areeasier to obtain The linguistic classification of thecandidates according to their compositionality canthen be treated as a separate problem
We present two experiments in this evaluationsection We will first evaluate the translation de-tection on our gold standard to assess the gen-eral quality of the extraction method Since thisgold standard is to small to draw conclusions aboutthe quality of MWE patterns that the system de-tects we further evaluate the translational corre-spondences for a larger set of verbs
Translation evaluation In the first experimentwe extracted all types of translational correspon-dences for the verbs we annotated in the gold stan-dard We converted the output dependency con-figurations to the lemmatized bag-of-word formwe already applied for the alignment evaluationand calculated the FPR and FNR of the trans-lation types The evaluation is displayed in ta-ble 4 Nearly all translation types that our sys-tem detected are correct This confirms our hy-pothesis that syntactic filtering yields more reli-able translations that just coocurrence-based align-ments However the false negative rate is alsovery high This low recall is due to the fact thatour syntactic filters are very restrictive such that amajor part of the occurrences of the source lemmadonrsquot figure in the prototypical syntactic configu-ration Column two and three of the evaluation ta-ble present the FPR and FNR for experiments witha relaxed syntactic filter that doesnrsquot constrain thesyntactic type of the parallel argument relationsWhile not decreasing the FNR the FPR decreasessignificantly This means that the syntactic filtersmainly fire on noisy configurations and donrsquot de-crease the recall A manual error analysis has alsoshown that the relatively flat annotation scheme ofour dependency parses significantly narrows down
28
the number of candidate configurations that our al-gorithm detects As the dependency parses donrsquotprovide deep analyses for tense or control phe-nomena very often a verbrsquos arguments donrsquot fig-ure as its syntactic dependents and no configura-tion is found Future work will explore the im-pact of deep syntactic analysis for the detection oftranslational correspondences
MWE evaluation In a second experiment weevaluated the patterns of correspondences foundby our extraction method for use in an MWE con-text Therefore we selected 50 random verbs oc-curring in the Europarl corpus and extracted theirrespective translational correspondences This setof 50 verbs yields a set of 1592 one-to-many typesof translational correspondences We filtered thetypes wich display only morphological variationsuch that the set of potential MWE types com-prises 1302 types Out of these we evaluated arandom sample of 300 types by labelling the typeswith the MWE categories we established for theanalysis of our gold standard During the clas-sification we encountered a further category ofoneto- many correspondence which cannot be con-sidered an MWE the category of alternation Forinstance we found a translational correspondencebetween the active realization of the German verbbegruszligen(enappreciate) and the English passivebe pleased by
The classification is displayed in table 5 Al-most 83 of the translational correspondencesthat our system extracted are perfect translationtypes Almost 60 of the extracted types can beconsidered MWEs that exhibit some kind of se-mantic idiosyncrasy The other translations couldbe classified as paraphrases or alternations In ourrandom sample the portions of idioms is signifi-cantly higher than in our gold standard which con-firms our intuition that the MWE pattern of theone-to-many translations for a given verb are re-lated to language-specific semantic properties ofthe verbs and the lexical concepts they realize
4 Related Work
The problem sketched in this paper has clear con-ncetions to statistical MT So-called phrase-basedtranslation models generally target whole sentencealignment and do not necessarily recur to linguis-tically motivated phrase correspondences (Koehnet al 2003) Syntax-based translation that speci-fies formal relations between bilingual parses was
Strict Filter Relaxed Filter
FPR FNR FPR FNR
v1 00 096 05 096
v2 025 088 047 079
v3 025 074 056 063
v4 00 0875 056 084
Table 4 False positive and false negative rate ofone-to-many translations
Trans type Proportion
MWE type Proportion
MWEs 575
V Part 82
V Prep 518
LVC 324
Idiom 106
Paraphrases 244
Alternations 10
Noise 171
Table 5 Classification of 300 types sampled fromthe set of one-to-many translations for 50 verbs
established by (Wu 1997) Our way to use syn-tactic configurations can be seen as a heuristic tocheck relaxed structural parallelism
Work on MWEs in a crosslingual context hasalmost exclusively focussed on MWE translation(Smadja and McKeown 1994 Anastasiou 2008)In (Villada Moiron and Tiedemann 2006) the au-thors make use of alignment information in a par-allel corpus to rank MWE candidates These ap-proaches donrsquot rely on the lexical semantic knowl-edge about MWEs in form of one-to-many trans-lations
By contrast previous approaches to paraphraseextraction made more explicit use of crosslingualsemantic information In (Bannard and Callison-Burch 2005) the authors use the target languageas a pivot providing contextual features for iden-tifying semantically similar expressions Para-phrasing is however only partially comparable tothe crosslingual MWE detection we propose inthis paper Recently the very pronounced contextdependence of monolingual pairs of semanticallysimilar expressions has been recognized as a ma-jor challenge in modelling word meaning (Erk andPado 2009)
The idea that parallel corpora can be used asa linguistic resource that provides empirical evi-dence for monolingual idiosyncrasies has already
29
been exploited in eg morphology projection(Yarowsky et al 2001) or word sense disambigua-tion (Dyvik 2004) While in a monolingual set-ting it is quite tricky to come up with theoreticalor empirical definitions of sense discriminationsthe crosslingual scenario offers a theory-neutraldata-driven solution Since ambiguity is an id-iosyncratic property of a lexical item in a givenlanguage it is not likely to be mirrored in a tar-get language Similarly our approach can also beseen as a projection idea we project the semanticinformation of simplex realization in a source lan-guage to an idiosyncratic multiword realization inthe target language
5 Conclusion
We have explored the phenomenon of one-to-many translations in parallel corpora from theperspective of MWE identification Our man-ual study on a translation gold standard as wellas our experiments in automatic translation ex-traction have shown that one-to-many correspon-dences provide a rich resource and fruitful basisof study for data-driven MWE identification Thecrosslingual perspective raises new research ques-tions about the identification and interpretation ofMWEs It challenges the distinction between para-phrases and MWEs a problem that does not ariseat all in the context of monolingual MWE ex-traction It also allows for the study of the rela-tion between the semantics of lexical concepts andtheir MWE realization Further research in this di-rection should investigate translational correspon-dences on a larger scale and further explore thesefor monolingual interpretation of MWEs
Our extraction method that is based on syn-tactic filters identifies MWE types with a muchhigher precision than purely cooccurence-basedword alignment and captures the various patternswe found in our gold standard Future work on theextraction method will have to focus on the gener-alization of these filters and the generalization toother items than verbs The experiments presentedin this paper also suggest that the MWE realiza-tion of certain lexical items in a target languageis subject to certain linguistic patterns Moreoverthe method we propose is completely languagein-dependent such that further research has to studythe impact of the relatedness of the consideredlanguages on the patterns of one-to-many transla-tional correspondences
ReferencesDimitra Anastasiou 2008 Identification of idioms by mtrsquos
hybrid research system vs three commercial system InProceedings of the EAMT pp 12ndash20
Timothy Baldwin and Aline Villavicencio 2002 Extract-ing the unextractable a case study on verb-particles InProceedings of the COLING-02 pp 1ndash7
Colin Bannard and Chris Callison-Burch 2005 Paraphras-ing with bilingual parallel corpora InProceedings of the43rd Annual Meeting of the ACL pp 597ndash604
Lea Cyrus 2006 Building a resource for studying transla-tion shifts InProceedings of the 5th LREC pp 1240ndash1245
Helge Dyvik 2004 Translations as semantic mirrors Fromparallel corpus to WordNetLanguage and Computers1311 ndash 326
Katrin Erk and Sebastian Pado 2009 Paraphrase assess-ment in structured vector space Exploring parameters anddatasets InProc of the EACL GEMS Workshop pp 57ndash65
Joao de Almeida Varelas Graca Joana Paulo Pardal LuısaCoheur and Diamantino Antonio Caseiro 2008 Multi-language word alignments annotation guidelines Techni-cal report Tech Rep 38 2008 INESC-ID Lisboa
Philipp Koehn Franz Josef Och and Daniel Marcu 2003Statistical phrase-based translation InProceedings of theNAACL rsquo03 pp 48ndash54
Philipp Koehn 2005 Europarl A parallel corpus for statis-tical machine translation InMT Summit 2005 pp 79ndash86
Joakim Nivre Johan Hall and Jens Nilsson 2006 Malt-parser A data driven parser-generator for dependencyparsing InProc of LREC-2006 pp 2216ndash2219
Franz Josef Och and Hermann Ney 2003 A systematic com-parison of various statistical alignment modelsComputa-tional Linguistics 29(1)19ndash51
Pavel Pecina 2008 A machine learning approach to multi-word expression extraction InProceedings of the LRECMWE 2008 Workshop pp 54ndash57
Ivan A Sag Timothy Baldwin Francis Bond and AnnCopestake 2002 Multiword expressions A pain in theneck for NLP InProc of the CICLing-2002 pp 1ndash15
Frank Smadja and Kathleen McKeown 1994 Translatingcollocations for use in bilingual lexicons InProceedingsof the HLT rsquo94 workshop pp 152ndash156
Begona Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automatic word-alignment InProc of the EACL MWE 2006 Workshoppp 33ndash40
Dekai Wu 1997 Stochastic inversion transduction gram-mars and bilingual parsing of parallel corporaComputLinguist 23(3)377ndash403
David Yarowsky Grace Ngai and Richard Wicentowski2001 Inducing multilingual text analysis tools via robustprojection across aligned corpora InProceedings of HLT2001 pp 1ndash8
30
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 31ndash39Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
A Re-examination of Lexical Association Measures
Abstract
We review lexical Association Measures (AMs) that have been employed by past work in extracting multiword expressions Our work contributes to the understanding of these AMs by categorizing them into two groups and suggesting the use of rank equivalence to group AMs with the same ranking performance We also examine how existing AMs can be adapted to better rank English verb particle constructions and light verb constructions Specifically we suggest normalizing (Pointwise) Mutual Information and using marginal frequencies to construct penalization terms We empirically validate the effectiveness of these modified AMs in detection tasks in English performed on the Penn Treebank which shows significant improvement over the original AMs
1 Introduction
Recently the NLP community has witnessed a renewed interest in the use of lexical association measures in extracting Multiword Expressions (MWEs) Lexical Association Measures (hereafter AMs) are mathematical formulas which can be used to capture the degree of connection or association between constituents of a given phrase Well-known AMs include Pointwise Mutual Information (PMI) Pearsonrsquos 2χ and the Odds Ratio These AMs have been applied in many different fields of study from information retrieval to hypothesis testing In the context of MWE extraction many published works have been devoted to comparing their effectiveness Krenn and Evert (2001) evaluate Mutual Information (MI) Dice Pearsonrsquos 2χ log-likelihood
ratio and the T score In Pearce (2002) AMs such as Z score Pointwise MI cost reduction left and right context entropy odds ratio are evaluated Evert (2004) discussed a wide range of AMs including exact hypothesis tests such as the binomial test and Fisherrsquos exact tests various coefficients such as Dice and Jaccard Later Ramisch et al (2008) evaluated MI Pearsonrsquos 2χ and Permutation Entropy Probably the most comprehensive evaluation of AMs was presented in Pecina and Schlesinger (2006) where 82 AMs were assembled and evaluated over Czech collocations These collocations contained a mix of idiomatic expressions technical terms light verb constructions and stock phrases In their work the best combination of AMs was selected using machine learning
While the previous works have evaluated AMs there have been few details on why the AMs perform as they do A detailed analysis of why these AMs perform as they do is needed in order to explain their identification performance and to help us recommend AMs for future tasks This weakness of previous works motivated us to address this issue In this work we contribute to further understanding of association measures using two different MWE extraction tasks to motivate and concretize our discussion Our goal is to be able to predict a priori what types of AMs are likely to perform well for a particular MWE class
We focus on the extraction of two common types of English MWEs that can be captured by bigram model Verb Particle Constructions (VPCs) and Light Verb Constructions (LVCs) VPCs consist of a verb and one or more particles which can be prepositions (eg put on bolster up) adjectives (cut short) or verbs (make do) For simplicity we focus only on bigram VPCs that take prepositional particles the most common class of VPCs A special characteristic of VPCs that affects their extraction is the
Hung Huu Hoang Dept of Computer Science
National University of Singapore
hoanghuucompnusedusg
Su Nam Kim Dept of Computer Science and Software Engineering University of Melbourne
snkimcsseunimelbeduau
Min-Yen Kan Dept of Computer Science
National University of Singapore
kanmycompnusedusg
31
mobility of noun phrase complements in transitive VPCs They can appear after the particle (Take off your hat) or between the verb and the particle (Take your hat off) However a pronominal complement can only appear in the latter configuration (Take it off)
In comparison LVCs comprise of a verb and a complement which is usually a noun phrase (make a presentation give a demonstration) Their meanings come mostly from their complements and as such verbs in LVCs are termed semantically light hence the name light verb This explains why modifiers of LVCs modify the complement instead of the verb (make a serious mistake vs make a mistake seriously) This phenomenon also shows that an LVCrsquos constituents may not occur contiguously
2 Classification of Association Measures
Although different AMs have different approaches to measuring association we observed that they can effectively be classified into two broad classes Class I AMs look at the degree of institutionalization ie the extent to which the phrase is a semantic unit rather than a free combination of words Some of the AMs in this class directly measure this association between constituents using various combinations of co-occurrence and marginal frequencies Examples include MI PMI and their variants as well as most of the association coefficients such as Jaccard Hamann Brawn-Blanquet and others Other Class I AMs estimate a phrasersquos MWE-hood by judging the significance of the difference between observed and expected frequencies These AMs include among others statistical hypothesis tests such as T score Z score and Pearsonrsquos 2χ test
Class II AMs feature the use of context to measure non-compositionality a peculiar characteristic of many types of MWEs including VPCs and idioms This is commonly done in one of the following two ways First non-compositionality can be modeled through the diversity of contexts measured using entropy The underlying assumption of this approach is that non-compositional phrases appear in a more restricted set of contexts than compositional ones Second non-compositionality can also be measured through context similarity between the phrase and its constituents The observation here is that non-compositional phrases have different semantics from those of their constituents It then
follows that contexts in which the phrase and its constituents appear would be different (Zhai 1997) Some VPC examples include carry out give up A close approximation stipulates that contexts of a non-compositional phrasersquos constituents are also different For instance phrases such as hot dog and Dutch courage are comprised of constituents that have unrelated meanings Metrics that are commonly used to compute context similarity include cosine and dice similarity distance metrics such as Euclidean and Manhattan norm and probability distribution measures such as Kullback-Leibler divergence and Jensen-Shannon divergence
Table 1 lists all AMs used in our discussion
The lower left legend defines the variables a b c and d with respect to the raw co-occurrence statistics observed in the corpus data When an AM is introduced it is prefixed with its index given in Table 1(eg [M2] Mutual Information) for the readerrsquos convenience
3 Evaluation
We will first present how VPC and LVC candidates are extracted and used to form our evaluation data set Second we will discuss how performances of AMs are measured in our experiments
31 Evaluation Data
In this study we employ the Wall Street Journal (WSJ) section of one million words in the Penn Tree Bank To create the evaluation data set we first extract the VPC and LVC candidates from our corpus as described below We note here that the mobility property of both VPC and LVC constituents have been used in the extraction process
For VPCs we first identify particles using a pre-compiled set of 38 particles based on Baldwin (2005) and Quirk et al (1985) (Appendix A) Here we do not use the WSJ particle tag to avoid possible inconsistencies pointed out in Baldwin (2005) Next we search to the left of the located particle for the nearest verb As verbs and particles in transitive VPCs may not occur contiguously we allow an intervening NP of up to 5 words similar to Baldwin and Villavicencio (2002) and Smadja (1993) since longer NPs tend to be located after particles
32
Extraction of LVCs is carried out in a similar fashion First occurrences of light verbs are located based on the following set of seven
frequently used English light verbs do get give have make put and take Next we search to the right of the light verbs for the nearest noun
AM Name Formula AM Name Formula M1 Joint Probability ( ) f xy N M2 Mutual Information
1log ˆ
ijij
i j ij
ff
N fsum
M3 Log likelihood ratio
2 log ˆ
ijij
i j ij
ff
fsum
M4 Pointwise MI (PMI) ( )log
( ) ( )
P xy
P x P ylowast lowast
M5 Local-PMI ( ) PMIf xy times M6 PMIk ( )log
( ) ( )
kNf xy
f x f ylowast lowast
M7 PMI2 2( )log
( ) ( )
Nf xy
f x f ylowast lowast
M8 Mutual Dependency 2( )log( ) ( )P xy
P x P y
M9 Driver-Kroeber
( )( )
a
a b a c+ +
M10 Normalized expectation
2
2
a
a b c+ +
M11 Jaccard a
a b c+ +
M12 First Kulczynski a
b c+
M13 Second Sokal-Sneath 2( )
a
a b c+ +
M14 Third Sokal-Sneath
a d
b c
+
+
M15 Sokal-Michiner a d
a b c d
+
+ + +
M16 Rogers-Tanimoto
2 2
a d
a b c d
+
+ + +
M17 Hamann ( ) ( )a d b c
a b c d
+ minus +
+ + +
M18 Odds ratio adbc
M19 Yulersquos ω ad bc
ad bc
minus
+
M20 Yulersquos Q ad bcad bc
minus+
M21 Brawn- Blanquet max( )
a
a b a c+ +
M22 Simpson
min( )
a
a b a c+ +
M23 S cost 12min( )
log(1 )1
b c
a
minus+
+
M24 Adjusted S Cost 12max( )
log(1 )1
b c
a
minus+
+
M25 Laplace 1
min( ) 2
a
a b c
+
+ +
M26 Adjusted Laplace 1
max( ) 2
a
a b c
+
+ +
M27 Fager [M9]
1max( )
2b cminus
M28 Adjusted Fager [M9]
1max( )b c
aNminus
M29 Normalized PMIs
PMI NF( )α PMI NFMax
M30 Simplified normalized PMI for VPCs
log( )
(1 )
ad
b cα αtimes + minus times
M31 Normalized MIs
MI NF( )α MI NFMax
NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast [0 1]α isin NFMax = max( ( ) ( ))P x P ylowast lowast
11 ( )a f f xy= = 12 ( )b f f xy= =
21 ( )c f f xy= = 22 ( )d f f xy= =
( )f xlowast ( )f xlowast
( )f ylowast ( )f ylowast N
Table 1 Association measures discussed in this paper Starred AMs () are developed in this work
Contingency table of a bigram (x y) recording co-occurrence and marginal frequencies w stands for all words except w stands for all words N is total number of bigrams The expected frequency under the independence assumption is ˆ ( ) ( ) ( ) f xy f x f y N= lowast lowast
33
permitting a maximum of 4 intervening words to allow for quantifiers (aan the many etc) adjectival and adverbial modifiers etc If this search fails to find a noun as when LVCs are used in the passive (eg the presentation was made) we search to the right of the light verb also allowing a maximum of 4 intervening words The above extraction process produced a total of 8652 VPC and 11465 LVC candidates when run on the corpus We then filter out candidates with observed frequencies less than 6 as suggested in Pecina and Schlesinger (2006) to obtain a set of 1629 VPCs and 1254 LVCs
Separately we use the following two available sources of annotations 3078 VPC candidates extracted and annotated in (Baldwin 2005) and 464 annotated LVC candidates used in (Tan et al 2006) Both sets of annotations give both positive and negative examples
Our final VPC and LVC evaluation datasets were then constructed by intersecting the gold-standard datasets with our corresponding sets of extracted candidates We also concatenated both sets of evaluation data for composite evaluation This set is referred to as ldquoMixedrdquo Statistics of our three evaluation datasets are summarized in Table 2 VPC data LVC data Mixed Total (freq ge 6)
413 100 513
Positive instances
117 (2833)
28 (28)
145 (2326)
Table 2 Evaluation data sizes (type count not token) While these datasets are small our primary
goal in this work is to establish initial comparable baselines and describe interesting phenomena that we plan to investigate over larger datasets in future work
32 Evaluation Metric
To evaluate the performance of AMs we can use the standard precision and recall measures as in much past work We note that the ranked list of candidates generated by an AM is often used as a classifier by setting a threshold However setting a threshold is problematic and optimal threshold values vary for different AMs Additionally using the list of ranked candidates directly as a classifier does not consider the confidence indicated by actual scores Another way to avoid setting threshold values is to measure precision and recall of only the n most likely candidates
(the n-best method) However as discussed in Evert and Krenn (2001) this method depends heavily on the choice of n In this paper we opt for average precision (AP) which is the average of precisions at all possible recall values This choice also makes our results comparable to those of Pecina and Schlesinger (2006)
33 Evaluation Results
Figure 1(a b) gives the two average precision profiles of the 82 AMs presented in Pecina and Schlesinger (2006) when we replicated their experiments over our English VPC and LVC datasets We observe that the average precision profile for VPCs is slightly concave while the one for LVCs is more convex This can be interpreted as VPCs being more sensitive to the choice of AM than LVCs Another point we observed is that a vast majority of Class I AMs including PMI its variants and association coefficients (excluding hypothesis tests) perform reasonably well in our application In contrast the performances of most of context-based and hypothesis test AMs are very modest Their mediocre performance indicates their inapplicability to our VPC and LVC tasks In particular the high frequencies of particles in VPCs and light verbs in LVCs both undermine their contextsrsquo discriminative power and skew the difference between observed and expected frequencies that are relied on in hypothesis tests
4 Rank Equivalence
We note that some AMs although not mathematically equivalent (ie assigning identical scores to input candidates) produce the same lists of ranked candidates on our datasets Hence they achieve the same average precision The ability to identify such groups of AMs is helpful in simplifying their formulas which in turn assisting in analyzing their meanings Definition Association measures M1 and M2 are
rank equivalent over a set C denoted by M1 r
Cequiv
M2 if and only if M1(cj) gt M1(ck) M2(cj) gt M2(ck) and M1(cj) = M1(ck) M2(cj) = M2(ck) for all cj ck belongs to C where Mk(ci) denotes the score assigned to ci by the measure Mk As a corollary the following also holds for rank equivalent AMs
34
Corollary If M1 r
Cequiv M2 then APC(M1) = APC(M2)
where APC(Mi) stands for the average precision of the AM Mi over the data set C Essentially M1 and M2 are rank equivalent over a set C if their ranked lists of all candidates taken from C are the same ignoring the actual calculated scores1 As an example the following 3 AMs Odds ratio Yulersquos ω and Yulersquos Q (Table 3 row 5) though not mathematically equivalent can be shown to be rank equivalent Five groups of rank equivalent AMs that we have found are listed in Table 3 This allows us to replace the below 15 AMs with their (most simple) representatives from each rank equivalent group
1 Two AMs may be rank equivalent with the exception of some candidates where one AM is undefined due to a zero in the denominator while the other AM is still well-defined We call these cases weakly rank equivalent With a reasonably large corpus such candidates are rare for our VPC and LVC types Hence we still consider such AM pairs to be rank equivalent
1) [M2] Mutual Information [M3] Log likelihood ratio 2) [M7] PMI2 [M8] Mutual Dependency [M9] Driver-Kroeber (aka Ochiai) 3) [M10] Normalized expectation [M11] Jaccard [M12] First Kulczynski
[M13]Second Sokal-Sneath (aka Anderberg)
4) [M14] Third Sokal-Sneath [M15] Sokal-Michiner [M16] Rogers-Tanimoto [M17] Hamann 5) [M18] Odds ratio [M19] Yulersquos ω [M20] Yulersquos Q
Table 3 Five groups of rank equivalent AMs
5 Examination of Association Measures
We highlight two important findings in our analysis of the AMs over our English datasets Section 51 focuses on MI and PMI and Section 52 discusses penalization terms
51 Mutual Information and Pointwise Mutual Information
In Figure 1 over 82 AMs PMI ranks 11th in identifying VPCs while MI ranks 35th in
01
02
03
04
05
06
AP
Figure 1a AP profile of AMs examined over our VPC data set
01
02
03
04
05
06
AP
Figure 1b AP profile of AMs examined over our LVC data set
Figure 1 Average precision (AP) performance of the 82 AMs from Pecina and Schlesinger (2006) on our English VPC and LVC datasets Bold points indicate AMs discussed in this paper
Hypothesis test AMs loz Class I AMs excluding hypothesis test AMs + Context-based AMs
35
identifying LVCs In this section we show how their performances can be improved significantly
Mutual Information (MI) measures the common information between two variables or the reduction in uncertainty of one variable given knowledge of the other
( )MI( ) ( )log ( ) ( )u v
p uvU V p uv p u p v= lowast lowastsum In the
context of bigrams the above formula can be
simplified to [M2] MI =
1log ˆN
ijij
i jij
ff
fsum While MI
holds between random variables [M4] Pointwise MI (PMI) holds between specific values PMI(x
y) =( )
log( ) ( )
P xy
P x P ylowast lowast
( )log
( ) ( )
Nf xy
f x f y=
lowast lowast It has long
been pointed out that PMI favors bigrams with low-frequency constituents as evidenced by the product of two marginal frequencies in its denominator To reduce this bias a common solution is to assign more weight to the co-occurrence frequency ( )f xy in the numerator by either raising it to some power k (Daille 1994) or multiplying PMI with ( )f xy Table 4 lists these adjusted versions of PMI and their performance over our datasets We can see from Table 4 that the best performance of PMIk
is obtained at k values less than one indicating that it is better to rely less on ( )f xy Similarly multiplying
( )f xy directly to PMI reduces the performance of PMI As such assigning more weight to ( )f xy does not improve the AP performance of PMI AM VPCs LVCs MixedBest [M6] PMIk 547
(k = 13) 573
(k = 85) 544
(k = 32)
[M4] PMI 510 566 515 [M5] Local-PMI 259 393 272 [M1] Joint Prob 170 28 175 Table 4 AP performance of PMI and its variants Best alpha settings shown in parentheses
Another shortcoming of (P)MI is that both grow not only with the degree of dependence but also with frequency (Manning and Schutzeampamp 1999 p 66) In particular we can show that MI(X Y) le min(H(X) H(Y)) where H() denotes entropy and PMI(xy) le min( log ( )P xminus lowast log ( )P yminus lowast )
These two inequalities suggest that the allowed score ranges of different candidates vary and consequently MI and PMI scores are not directly comparable Furthermore in the case of VPCs and LVCs the differences among score
ranges of different candidates are large due to high frequencies of particles and light verbs This has motivated us to normalize these scores before using them for comparison We suggest MI and PMI be divided by one of the following two normalization factors NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast with [0 1]α isin and NFmax = max( ( ) ( ))P x P ylowast lowast NF( )α being dependent on alpha can be optimized by setting an appropriate alpha value which is inevitably affected by the MWE type and the corpus statistics On the other hand NFmax is independent of alpha and is recommended when one needs to apply normalized (P)MI to a mixed set of different MWE types or when sufficient data for parameter tuning is unavailable As shown in Table 5 normalized MI and PMI show considerable improvements of up to 80 Also PMI and MI after being normalized with NFmax rank number one in VPC and LVC task respectively If one re-writes MI as = (1 N) ij ij
i jPMIf timessum it is easy to see the heavy
dependence of MI on direct frequencies compared with PMI and this explains why normalization is a pressing need for MI
AM VPCs LVCs MixedMI NF( )α 508
(α = 48) 583
(α = 47) 516
(α = 5)
MI NFmax 508 584 518 [M2] MI 273 435 289
PMI NF( )α 592 (α = 8)
554 (α = 48)
588 (α = 77)
PMI NFmax 565 517 556 [M4] PMI 510 566 515
Table 5 AP performance of normalized (P)MI versus standard (P)MI Best alpha settings shown in parentheses
52 Penalization Terms
It can be seen that given equal co-occurrence frequencies higher marginal frequencies reduce the likelihood of being MWEs This motivates us to use marginal frequencies to synthesize penalization terms which are formulae whose values are inversely proportional to the likelihood of being MWEs We hypothesize that incorporating such penalization terms can improve the respective AMs detection AP
Take as an example the AMs [M21] Brawn-Blanquet (aka Minimum Sensitivity) and [M22] Simpson These two AMs are identical except
36
for one difference in the denominator Brawn-Blanquet uses max(b c) Simpson uses min(b c) It is intuitive and confirmed by our experiments that penalizing against the more frequent constituent by choosing max(b c) is more effective This is further attested in AMs [M23] S Cost and [M25] Laplace where we tried to replace the min(b c) term with max(b c) Table 6 shows the average precision on our datasets for all these AMs
AM VPCs LVCs Mixed[M21]Brawn- Blanquet
478 578 486
[M22] Simpson 249 382 260
[M24] Adjusted S Cost
485 577 492
[M23] S cost 249 388 260
[M26] Adjusted Laplace
486 577 493
[M25] Laplace 241 388 254 Table 6 Replacing min() with max() in selected AMs
In the [M27] Fager AM the penalization term max(b c) is subtracted from the first term which is no stranger but rank equivalent to [M7] PMI2 In our application this AM is not good since the second term is far larger than the first term which is less than 1 As such Fager is largely equivalent to just ndashfrac12 max(b c) In order to make use of the first term we need to replace the constant frac12 by a scaled down version of max(b c) We have approximately derived 1 aN as a lower bound estimate of max(b c) using the independence assumption producing [M28] Adjusted Fager We can see from Table 7 that this adjustment improves Fager on both datasets AM VPCs LVCs Mixed[M28] Adjusted Fager
564 543 554
[M27] Fager 552 439 525 Table 7 Performance of Fager and its adjusted version
The next experiment involves [M14] Third Sokal Sneath which can be shown to be rank equivalent to ndashb ndashc We further notice that frequencies c of particles are normally much larger than frequencies b of verbs Thus this AM runs the risk of ranking VPC candidates based on only frequencies of particles So it is necessary
that we scale b and c properly as in [M14] bαminus times ndash (1 ) cαminus times Having scaled the constituents properly we still see that [M14] by itself is not a good measure as it uses only constituent frequencies and does not take into consideration the co-occurrence frequency of the two constituents This has led us to experiment
with [MR14]PMI
(1 )b cα αtimes + minus times The
denominator of [MR14] is obtained by removing the minus sign from [MR14] so that it can be used as a penalization term The choice of PMI in the numerator is due to the fact that the denominator of [MR14] is in essence similar to NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast which has been successfully used to divide PMI in the normalized PMI experiment We heuristically tried to simplify [MR14] to the following AM
[M30]log( )
(1 )
ad
b cα αtimes + minus times The setting of alpha in
Table 8 below is taken from the best alpha setting obtained the experiment on the normalized PMI (Table 5) It can be observed from Table 8 that [MR14] being computationally simpler than normalized PMI performs as well as normalized PMI and better than Third Sokal-Sneath over the VPC data set
AM VPCs LVCs MixedPMI NF( )α 592
(α =8) 554
(α =48) 588
(α =77)
[M30]log( )
(1 )
ad
b cα αtimes + minus times
600 (α =8)
484 (α =48)
588 (α =77)
[M14] Third Sokal Sneath
565 453 546
Table 8 AP performance of suggested VPCsrsquo penalization terms and AMs
With the same intention and method we have found that while addition of marginal frequencies is a good penalization term for VPCs the product of marginal frequencies is more suitable for LVCs (rows 1 and 2 Table 9) As with the linear combination the product bc should also be weighted accordingly as (1 )b cα αminus The best alpha value is also taken from the normalized PMI experiments (Table 5) which is nearly 5 Under this setting this penalization term is exactly the denominator of the [M18] Odds Ratio Table 9 below show our experiment results in deriving the penalization term for LVCs
37
AM VPCs LVCs Mixedndashb ndashc 565 453 546 1bc 502 532 502 [M18] Odds ratio 443 567 456
Table 9 AP performance of suggested LVCsrsquo penalization terms and AMs
6 Conclusions
We have conducted an analysis of the 82 AMs assembled in Pecina and Schlesinger (2006) for the tasks of English VPC and LVC extraction over the Wall Street Journal Penn Treebank data In our work we have observed that AMs can be divided into two classes ones that do not use context (Class I) and ones that do (Class II) and find that the latter is not suitable for our VPC and LVC detection tasks as the size of our corpus is too small to rely on the frequency of candidatesrsquo contexts This phenomenon also revealed the inappropriateness of hypothesis tests for our detection task We have also introduced the novel notion of rank equivalence to MWE detection in which we show that complex AMs may be replaced by simpler AMs that yield the same average precision performance
We further observed that certain modifications to some AMs are necessary First in the context of ranking we have proposed normalizing scores produced by MI and PMI in cases where the distributions of the two events are markedly different as is the case for light verbs and particles While our claims are limited to the datasets analyzed they show clear improvements normalized PMI produces better performance over our mixed MWE dataset yielding an average precision of 588 compared to 515 when using standard PMI a significant improvement as judged by paired T test Normalized MI also yields the best performance over our LVC dataset with a significantly improved AP of 583
We also show that marginal frequencies can be used to form effective penalization terms In particular we find that (1 )b cα αtimes + minus times is a good penalization term for VPCs while (1 )b cα αminus is suitable for LVCs Our introduced alpha tuning parameter should be set to properly scale the values b and c and should be optimized per MWE type In cases where a common factor is applied to different MWE types max(b c) is a better choice than min(b c) In future work we plan to expand our investigations over larger
web-based datasets of English to verify the performance gains of our modified AMs
Acknowledgement This work was partially supported by a National Research Foundation grant ldquoInteractive Media Searchrdquo (grant R 252 000 325 279)
References Baldwin Timothy (2005) The deep lexical
acquisition of English verb-particle constructions Computer Speech and Language Special Issue on Multiword Expressions 19(4)398ndash414
Baldwin Timothy and Villavicencio Aline (2002) Extracting the unextractable A case study on verb-particles In Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002) pages 98ndash104 Taipei Taiwan
Daille Beacuteatrice (1994) Approche mixte pour lextraction automatique de terminologie statistiques lexicales et filtres linguistiques PhD thesis Universiteacute Paris 7
Evert Stefan (2004) Online repository of association measures httpwwwcollocationsde a companion to The Statistics of Word Cooccurrences Word Pairs and Collocations PhD dissertation University of Stuttgart
Evert Stefan and Krenn Brigitte (2001) Methods for qualitative evaluation of lexical association measures In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics pages 188-915 Toulouse France
Katz Graham and Giesbrecht Eugenie (2006) Automatic identification of non-compositional multi-word expressions using latent semantic analysis In Proceedings of the ACL Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties pages 12-19 Sydney Australia
Krenn Brigitte and Evert Stefan (2001) Can we do better than frequency A case study on extracting PP-verb collocations In Proceedings of the ACLEACL 2001 Workshop on the Computational Extraction Analysis and Exploitation of Collocations pages 39ndash46 Toulouse France
Manning D Christopher and Schutze ampamp Hinrich (1999) Foundations of Statistical Natural Language Processing The MIT Press Cambridge Massachusetts
Pearce Darren (2002) A comparative evaluation of collocation extraction techniques In Proc of the
38
3rd International Conference on Language Resources and Evaluation (LREC 2002) Las Palmas pages 1530-1536 Canary Islands
Pecina Pavel and Schlesinger Pavel (2006) Combining association measures for collocation extraction In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLINGACL 2006) pages 651-658 Sydney Australia
Quirk Randolph Greenbaum Sidney Leech Geoffrey and Svartvik Jan (1985) A Comprehensive Grammar of the English Language Longman London UK
Ramisch Carlos Schreiner Paulo Idiart Marco and Villavicencio Aline (2008) An Evaluation of Methods for the extraction of Multiword Expressions In Proceedings of the LREC-2008 Workshop on Multiword Expressions Towards a Shared Task for Multiword Expressions pages 50-53 Marrakech Morocco
Smadja Frank (1993) Retrieving collocations from text Xtract Computational Linguistics 19(1) 143ndash77
Tan Y Fan Kan M Yen and Cui Hang (2006) Extending corpus-based identification of light verb constructions using a supervised learning framework In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context pages 49ndash56 Trento Italy
Zhai Chengxiang (1997) Exploiting context to identify lexical atoms ndash A statistical view of linguistic context In International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97) pages 119-129 Rio de Janeiro Brazil
Appendix A List of particles used in identifying verb particle constructions about aback aboard above abroad across adrift ahead along apart around aside astray away back backward backwards behind by down forth forward forwards in into off on out over past round through to together under up upon without
39
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 40ndash46Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus
R Mahesh K Sinha
Department of Computer Science amp Engineering Indian Institute of Technology Kanpur
Kanpur 208016 India rmkiitkacin
Abstract
Complex predicate is a noun a verb an ad-jective or an adverb followed by a light verb that behaves as a single unit of verb Com-plex predicates (CPs) are abundantly used in Hindi and other languages of Indo Aryan family Detecting and interpreting CPs con-stitute an important and somewhat a diffi-cult task The linguistic and statistical methods have yielded limited success in mining this data In this paper we present a simple method for detecting CPs of all kinds using a Hindi-English parallel corpus A CP is hypothesized by detecting absence of the conventional meaning of the light verb in the aligned English sentence This simple strategy exploits the fact that CP is a multi-word expression with a meaning that is dis-tinct from the meaning of the light verb Al-though there are several shortcomings in the methodology this empirical method surpri-singly yields mining of CPs with an average precision of 89 and a recall of 90
1 Introduction
Complex predicates (CPs) are abundantly used in Hindi and other languages of Indo-Aryan family and have been widely studied (Hook 1974 Ab-bi 1992 Verma 1993 Mohanan 1994 Singh 1994 Butt 1995 Butt and Geuder 2001 Butt and Ramchand 2001 Butt et al 2003) A com-plex predicate is a multi-word expression (MWE) where a noun a verb or an adjective is followed by a light verb (LV) and the MWE be-haves as a single unit of verb The general theory of complex predicate is discussed in Alsina (1996) These studies attempt to model the lin-guistic facts of complex predicate formation and the associated semantic roles
CPs empower the language in its expressive-ness but are hard to detect Detection and inter-
pretation of CPs are important for several tasks of natural language processing tasks such as ma-chine translation information retrieval summa-rization etc A mere listing of the CPs constitutes a valuable linguistic resource for lexicographers wordnet designers (Chakrabarti et al 2007) and other NLP system designers Computational me-thod using Hindi corpus has been used to mine CPs and categorize them based on statistical analysis (Sriram and Joshi 2005) with limited success Chakrabarti et al (2008) present a me-thod for automatic extraction of V+V CPs only from a corpus based on linguistic features They report an accuracy of about 98 An attempt has also been made to use a parallel corpus for de-tecting CPs using projection POS tags from Eng-lish to Hindi (Soni Mukerjee and Raina 2006) It uses Giza++ word alignment tool to align the projected POS information A success of 83 precision and 46 recall has been reported
In this paper we present a simple strategy for mining of CPs in Hindi using projection of meaning of light verb in a parallel corpus In the following section the nature of CP in Hindi is outlined and this is followed by system design experimentation and results
2 Complex Predicates in Hindi
A CP in Hindi is a syntactic construction consist-ing of either a verb a noun an adjective or an adverb as main predicator followed by a light verb (LV) Thus a CP can be a noun+LV an adjective+LV a verb+LV or an adverb+LV Fur-ther it is also possible that a CP is followed by a LV (CP+LV) The light verb carries the tense and agreement morphology In V+V CPs the contribution of the light verb denotes aspectual terms such as continuity perfectivity inception completion or denotes an expression of forceful-ness suddenness etc (Singh 1994 Butt 1995) The CP in a sentence syntactically acts as a sin-gle lexical unit of verb that has a meaning dis-
40
tinct from that of the LV CPs are also referred as the complex or compound verbs
Given below are some examples (1) CP=noun+LV noun = ashirwad blessings LV = denaa to give usane mujhe ashirwad diyaa उसन मझ आश वाद दया he me blessings gave he blessed me (2) No CP usane mujhe ek pustak dii उसन मझ एक प तक द he me one book gave he gave me a book In (1) the light verb diyaa (gave) in its past tense form with the noun ashirwad (blessings) makes a complex predicate verb form ashirwad diyaa (blessed) in the past tense form The CP here is ashirwad denaa and its corresponding English translation is lsquoto blessrsquo On the other hand in ex-ample (2) the verb dii (gave) is a simple verb in past tense form and is not a light verb Although same Hindi verb denaa (to give) is used in both the examples it is a light verb in (1) and a main verb in (2) Whether it acts as a light verb or not depends upon the semantics of the preceding noun However it is observed that the English meaning in case of the complex predicate is not derived from the individual meanings of the con-stituent words It is this observation that forms basis of our approach for mining of CPs (3) CP=adjective+LV adjective=khush happy LV=karanaa to do usane mujhe khush kiyaa उसन मझ खश कया he me happy did he pleased me Here the Hindi verb kiyaa (did) is the past tense form of a light verb karanaa (to do) and the pre-ceding word khush (happy) is an adjective The CP here is khush karanaa (to please) (4) CP=verb+LV verb = paRhnaa to read LV = lenaa to take usane pustak paRh liyaa
उसन प तक पढ़ लया he book read took he has read the book Here the Hindi verb liyaa (took) is the past tense form of the light verb lenaa (to take) and the pre-ceding word paRh (read) is the verb paRhnaa (to read) in its stem form The CP is paRh lenaa (to finish reading) In such cases the light verb acts as an aspectual modal or as an intensifier (5) CP=verb+LV verb = phaadanaa to tear LV = denaa to give usane pustak phaad diyaa उसन प तक फाड़ दया he book tear gave he has torn the book Here the Hindi verb diyaa (gave) is the past tense form of the light verb denaa (to give) and the preceding word phaad (tear) is the stem form of the verb phaadanaa (to tear) The CP is phaad denaa (to cause and complete act of tearing) (6) CP=verb+LV verb = denaa to give LV = maaranaato hit to kill usane pustak de maaraa उसन प तक द मारा he book give hit he threw the book Here the Hindi verb maaraa (hitkilled) is the past tense form of the light verb maaranaa (to hit to kill) and the preceding word de (give) is a verb denaa (to give) in its stem form The CP is de maranaa (to throw) The verb combination yields a new meaning This may also be consi-dered as a semi-idiomatic construct by some people (7) CP=adverb+LV1+LV2 adverb = vaapasback LV1 = karanaato do LV2 = denaato give or CP = CP+LV CP = vaapas karanaato return LV = denaato give usane pustak vaapas kar diyaa उसन प तक वापस कर दया he book back do gave
41
he returned the book Here there are two Hindi light verbs used The verb kar (do) is the stem form of the light verb karanaa (to do) and the verb diyaa (gave) is the past tense form of the light verb denaa (to give) The preceding word vaapas (back) is an adverb One way of interpretation is that the CP (a con-junct verb) vaapas karanaa (to return) is fol-lowed by another LV denaa (to give) signifying completion of the task Another way of looking at it is to consider these as a combination of two CPs vaapas karanaa (to return) and kar denaa (to complete the act) The semantic interpreta-tions in the two cases remain the same It may be noted that the word vaapas (return) is also a noun and in such a case the CP is a noun+LV
From all the above examples the complexity of the task of mining the CPs is evident Howev-er it is also observed that in the translated text the meaning of the light verb does not appear in case of CPs Our methodology for mining CPs is based on this observation and is outlined in the following section
3 System Design
As outlined earlier our method for detecting a CP is based on detecting a mismatch of the Hindi light verb meaning in the aligned English sen-tence The steps involved are as follows 1) Align the sentences of Hindi-English corpus 2) Create a list of Hindi light verbs and their
common English meanings as a simple verb (Table 1)
3) For each Hindi light verb generate all the morphological forms (Figure 1)
4) For each English meaning of the light verb as given in table 1 generate all the morphologi-cal forms (Figure 2)
5) For each Hindi-English aligned sentence execute the following steps a) For each light verb of Hindi (table 1)
execute the following steps i) Search for a Hindi light verb (LV)
and its morphological derivatives (figure 1) in the Hindi sentence and mark its position in the sentence (K)
ii) If the LV or its morphological deriv-ative is found then search for the equivalent English meanings for any of the morphological forms (figure 2) in the corresponding aligned Eng-lish sentence
iii) If no match is found then scan the words in the Hindi sentence to the left of the Kth position (as identified in step (i)) else if a match is found then exit ie go to step (a)
iv) If the scanned word is a lsquostop wordrsquo (figure 3) then ignore it and contin-ue scanning
v) Stop the scan when it is not a lsquostop wordrsquo and collect the Hindi word (W)
vi) If W is an lsquoexit wordrsquo then exit ie go to step (a) else the identified CP is W+LV
Hindi has a large number of light verbs A list of some of the commonly used light verbs along with their common English meanings as a simple verb is given in table 1 The light verb kar (do) is the most frequently used light verb Using its literal meaning as lsquodorsquo as a criterion for testing CP is quite misleading since lsquodorsquo in English is used in several other contexts Such meanings have been shown within parentheses and are not used for matching
light verb base form root verb meaning baithanaa बठना sit
bananaa बनना makebecomebuildconstruct manufactureprepare
banaanaa बनाना makebuildconstructmanufact-ure prepare
denaa दना give
lenaa लना take
paanaa पाना obtainget
uthanaa 3ठना rise arise get-up
uthaanaa 3ठाना raiselift wake-up
laganaa लगना feelappear look seem
lagaanaa लगाना fixinstall apply
cukanaa चकना (finish)
cukaanaa चकाना pay
karanaa करना (do)
honaa होना happenbecome be
aanaa आना come
jaanaa जाना go
khaanaa खाना eat
rakhanaa रखना keep put
maaranaa मारना killbeathit
daalanaa डालना put
haankanaa हाकना drive
Table 1 Some of the common light verbs in Hindi
42
For each of the Hindi light verb all morpho-logical forms are generated A few illustrations are given in figures 1(a) and 1(b) Similarly for each of the English meaning of the light verb all of its morphological derivatives are generated Figure 2 shows a few illustrations of the same
There are a number of words that can appear in between the nominal and the light verb in a CP These words are ignored in search for a CP and are treated as stop words These are words that denote negation or are emphasizers inten-sifiers interrogative pronoun or a particle A list of stop words used in the experimentation is giv-en in figure 3
Figure 1(a) Morphological derivatives of sample Hindi light verb rsquojaanaarsquo जाना to go
Figure 1(b) Morphological derivatives of sample Hindi light verb rsquolenaarsquo लना to take
Figure 2 Morphological derivatives of sample English meanings
We use a list of words of words that we have
named as lsquoexit wordsrsquo which cannot form part of a CP in Hindi We have used Hindi case (vibhak-ti) markers (also called parsarg) conjunctions and pronouns as the lsquoexit wordsrsquo in our imple-mentation Figure 4 shows a partial list used However this list can be augmented based on analysis of errors in LV identification It should be noted that we do not perform parts of speech (POS) tagging and so the nature of the word pre-ceding the LV is unknown to the system
English word sit Morphological derivations
sit sits sat sitting English word give Morphological derivations
give gives gave given giving degdegdegdegdeg
LV jaanaa जाना to go Morphological derivatives jaa jaae jaao jaaeM jaauuM jaane jaanaa jaanii jaataa jaatii jaate jaaniiM jaatiiM jaaoge jaaogii gaii jaauuMgA jaayegaa jaauuMgii jaayegii gaye gaiiM gayaa gayii jaayeMge jaayeMgI jaakara जा (go stem) जाए (go imperative)
जाओ (go imperative) जाए (go imperative)
जाऊ (go first-person) जान (go infinitive oblique)
जाना (go infinitive masculine singular)
जानी (go infinitive feminine singular)
जाता (go indefinite masculine singular)
जाती (go indefinite feminine singular)
जात (go indefinite masculine pluraloblique)
जानी (go infinitive feminine plural)
जाती (go indefinite feminine plural)
जाओग (go future masculine singular)
जाओगी (go future feminine singular)
गई (go past feminine singular)
जाऊगा (go future masculine first-person singular)
जायगा (go future masculine third-person singular)
जाऊगी (go future feminine first-person singular)
जायगी (go future feminine third-person singular)
गय (go past masculine pluraloblique)
गई (go past feminine plural)
गया (go past masculine singular)
गयी (go past feminine singular)
जायग (go future masculine plural)
जायगी (go future feminine plural)
जाकर (go completion) degdegdegdegdeg
LV lenaa लना to take Morphological derivatives le lii leM lo letaa letii lete liiM luuM legaa legii lene lenaa lenii liyaa leMge loge letiiM luuMgaa luuMgii lekara ल (take stem) ल (take past)
ल (take imperative) लो (take imperative)
लता (take indefinite masculine singular)
लती (take indefinite feminine singular)
लत (take indefinite masculine pluraloblique)
ल (takepastfeminineplural) ल(take first-person)
लगा(take future masculine third-personsingular)
लगी(take future feminine third-person singular)
लन (take infinitive oblique)
लना (take infinitive masculine singular)
लनी (take infinitive feminine singular)
लया (take past masculine singular)
लग (take future masculine plural)
लोग (take future masculine singular)
लती (take indefinite feminine plural)
लगा (take future masculinefirst-personsingular)
लगी (take future feminine first-person singular)
लकर (take completion) degdegdegdegdeg
43
Figure 3 Stop words in Hindi used by the system
Figure 4 A few exit words in Hindi used by the system
The inner loop of the procedure identifies mul-tiple CPs that may be present in a sentence The outer loop is for mining the CPs in the entire corpus The experimentation and results are dis-cussed in the following section
4 Experimentation and Results
The CP mining methodology outlined earlier has been implemented and tested over multiple files of EMILLE (McEnery Baker Gaizauskas and Cunningham 2000) English-Hindi parallel cor-pus A summary of the results obtained are given in table 2 As can be seen from this table the precision obtained is 80 to 92 and the recall is between 89 to 100 The F-measure is 88 to 97 This is a remarkable and somewhat sur-prising result from the simple methodology without much of linguistic or statistical analysis This is much higher than what has been reported on the same corpus by Mukerjee et al 2006 (83 precision and 46 recall) who use projec-tion of POS and word alignment for CP identifi-cation This is the only other work that uses a parallel corpus and covers all kinds of CPs The results as reported by Chakrabarti et al (2008) are only for V-V CPs Moreover they do not re-port the recall value File
1 File 2
File 3
File 4
File 5
File 6
No of Sentences
112 193 102 43 133 107
Total no of CP(N)
200 298 150 46 188 151
Correctly identified CP (TP)
195 296 149 46 175 135
V-V CP 56 63 9 6 15 20
Incorrectly identified CP (FP)
17 44 7 11 16 20
Unidentified CP(FN)
5 2 1 0 13 16
Accuracy
9750 9933 9933 1000 9308 8940
Precision (TP (TP+FP))
9198 8705 9551 8070 9162 8709
Recall ( TP (TP+FN))
9750 9833 9933 1000 9308 8940
F-measure ( 2PR ( P+R))
946 923 974 893 923 882
Table 2 Results of the experimentation
न (ergative case marker) को (accusative case marker) का (possessive case marker) क (possessive case marker) की (possessive case marker) स (frombywith) म (ininto) पर (onbut) और (and Hindi particle) तथा (and) या (or) ल कन (but) पर त (but) क (that Hindi particle) म (I) तम (you) आप (you) वह (heshe) मरा (my) मर (my) मर (my) त हारा (your) त हार (your) त हार (your) उसका (his) उसकी (her) उसक (hisher) अपना (own) अपनी (own) अपन (own) उनक (their) मन (I ergative) त ह (to you) आपको (to you) उसको (to himher) उनको (to them) उ ह (to them) मझको (to me) मझ (to me) िजसका (whose) िजसकी (whose) िजसक (whose) िजनको (to whom) िजनक (to whom)
नह (nonot) न (nonot Hindi particle) भी (also Hindi particle) ह (only Hindi particle) तो (then Hindi particle) य (why) या (what Hindi particle)
कहा (where Hindi particle) कब (when) यहा (here) वहा (there) जहा (where) पहल (before) बाद म (after) श म (beginning) आर भ म (beginning) अत म (in the end) आ खर म (in the end)
44
Given below are some sample outputs (1) English sentence I also enjoy working with the childrens parents who often come to me for advice - its good to know you can help Aligned Hindi sentence मझ ब च क माता - पताओ क साथ काम करना भी अ छा लगता ह जो क अ सर सलाह लन आत ह - यह जानकार खशी होती ह क आप कसी की मदद कर सकत ह | The CPs identified in the sentence i काम करना (to work) ii अ छा लगना (to feel good enjoy) iii सलाह लना (to seek advice) iv खशी होना (to feel happy good) v मदद करना (to help)
Here the system identified 5 different CPs all of which are correct and no CP in the sentence has gone undetected The POS projection and word alignment method (Mukerjee et al 2006) would fail to identify CPs सलाह लना (to seek advice) and खशी होना (to feel happy) (2) English sentence Thousands of children are already benefiting from the input of people like you - people who care about children and their future who have the commitment energy and enthusiasm to be positive role models and who value the opportu-nity for a worthwhile career Aligned Hindi sentence आप जस लोग जो क ब च और उनक भ व य क बार म सोचत ह - इस समय भी हज़ार ब च को लाभ पहचा रह ह | अ छ आदश बनन क लए ऐस लोग म तब ता उ सराह और लगन ह और व एक समथन - यो य यवसाय की क करत ह |
The CPs identified in the sentence i आदश बनना (to be role model) ii क करना (to respect)
Here also the two CPs identified are correct It is obvious that this empirical method of
mining CPs will fail whenever the Hindi light verb maps on to its core meaning in English It
may also produce garbage as POS of the preced-ing word is not being checked However the mining success rate obtained speaks of these be-ing in small numbers in practice Use of the lsquostop wordsrsquo in allowing the intervening words within the CPs helps a lot in improving the per-formance Similarly use of the lsquoexit wordsrsquo avoid a lot of incorrect identification
5 Conclusions
The simple empirical method for mining CPs outlined in this work yields an average 89 of precision and 90 recall which is better than the results reported so far in the literature The major drawback is that we have to generate a list of all possible light verbs This list appears to be very large for Hindi Since no POS tagging or statis-tical analysis is performed the identified CPs are merely a list of mined CPs in Hindi with no lin-guistic categorization or analysis However this list of mined CPs is valuable to the lexicograph-ers and other language technology developers This list can also be used for word alignment tools where the identified components of CPs are grouped together before the word alignment process This will increase both the alignment accuracy and the speed
The methodology presented in this work is equally applicable to all other languages within the Indo-Aryan family
References Anthony McEnery Paul Baker Rob Gaizauskas Ha-
mish Cunningham 2000 EMILLE Building a Corpus of South Asian Languages Vivek A Quar-terly in Artiificial Intelligence 13(3)23ndash32
Amitabh Mukerjee Ankit Soni and Achala M Raina 2006 Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora Proceed-ings of the Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties Sydney 11ndash18
Alex Alsina 1996 Complex PredicatesStructure and Theory CSLI PublicationsStanford CA
Anvita Abbi 1992 The explicator compound verbsome definitional issues and criteria for iden-tification Indian Linguistics 53 27-46
Debasri Chakrabarti Vaijayanthi Sarma and Pushpak Bhattacharyya 2007 Complex Predicates in In-dian Language Wordnets Lexical Resources and Evaluation Journal 40 (3-4)
Debasri Chakrabarti Hemang Mandalia Ritwik Priya Vaijayanthi Sarma and Pushpak Bhattacharyya 2008 Hindi Compound Verbs and their Automatic Extraction Computational Linguistics (COLING08) Manchester UK
45
Manindra K Verma (ed) 1993 Complex Predicates in South Asian Languages Manohar Publishers and Distributors New Delhi
Miriam Butt 1995 The Structure of Complex Predi-cates in Urdu CSLI Publications
Mirium Butt and Gillian Ramchand 2001 Complex Aspectual Structure in HindiUrdu In Maria Lia-kata Britta Jensen and Didier Maillat (Editors) Oxford University Working Papers in Linguistics Philology amp Phonetics Vol 6
Miriam Butt Tracy Holloway King and John T Maxwell III 2003 Complex Predicates via Re-striction Proceedings of the LFG03 Conference
Miriam Butt and Wilhelm Geuder 2001 On the (semi)lexical status of light verbs In Norbert Corv-er and Henk van Riemsdijk (Editors) Semi-lexical Categories On the content of function words and the function of content words Mouton de Gruyter Berlin 323ndash370
Mona Singh 1994 Perfectivity Definiteness and Specificity A Classification of Verbal Predicates Hindi Doctoral dissertation University of Texas Austin
Peter Edwin Hook 1974 The Compound Verb in Hindi Center for South and Southeast Asian Stu-dies The University of Michigan
Tara Mohanan 1994 Argument Structure in Hindi CSLI Publications Stanford California
Venkatapathy Sriram and Aravind K Joshi 2005 Relative compositionality of multi-word expres-sions a study of verb-noun (V-N) collocations In Proceedings of International Joint Conference on Natural Language Processing - 2005 Jeju Island Korea 553-564
46
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 47ndash54Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Improving Statistical Machine Translation UsingDomain Bilingual Multiword Expressions
Zhixiang Ren1 Yajuan L u1 Jie Cao1 Qun Liu 1 Yun Huang2
1Key Lab of Intelligent Info Processing 2Department of Computer ScienceInstitute of Computing Technology School of Computing
Chinese Academy of Sciences National University of SingaporePO Box 2704 Beijing 100190 China Computing 1 Law Link Singapore 117590
renzhixianglvyajuan huangyuncompnusedusgcaojieliuqunictaccn
Abstract
Multiword expressions (MWEs) havebeen proved useful for many natural lan-guage processing tasks However how touse them to improve performance of statis-tical machine translation (SMT) is not wellstudied This paper presents a simple yeteffective strategy to extract domain bilin-gual multiword expressions In additionwe implement three methods to integratebilingual MWEs to Moses the state-of-the-art phrase-based machine translationsystem Experiments show that bilingualMWEs could improve translation perfor-mance significantly
1 Introduction
Phrase-based machine translation model has beenproved a great improvement over the initial word-based approaches (Brown et al 1993) Recentsyntax-based models perform even better thanphrase-based models However when syntax-based models are applied to new domain with fewsyntax-annotated corpus the translation perfor-mance would decrease To utilize the robustnessof phrases and make up the lack of syntax or se-mantic information in phrase-based model for do-main translation we study domain bilingual mul-tiword expressions and integrate them to the exist-ing phrase-based model
A multiword expression (MWE)can be consid-ered as word sequence with relatively fixed struc-ture representing special meanings There is nouniform definition of MWE and many researchersgive different properties of MWE Sag et al (2002)roughly defined MWE as ldquoidiosyncratic interpre-tations that cross word boundaries (or spaces)rdquoCruys and Moiron (2007) focused on the non-compositional property of MWE ie the propertythat whole expression cannot be derived from their
component words Stanford university launcheda MWE project1 in which different qualities ofMWE were presented Forbilingual multiwordexpression (BiMWE) we define a bilingual phraseas a bilingual MWE if (1) the source phrase is aMWE in source language (2) the source phraseand the target phrase must be translated to eachother exactly ie there is no additional (boundary)word in target phrase which cannot find the corre-sponding word in source phrase and vice versaIn recent years many useful methods have beenproposed to extract MWEs or BiMWEs automati-cally (Piao et al 2005 Bannard 2007 Fazly andStevenson 2006) Since MWE usually constrainspossible senses of a polysemous word in contextthey can be used in many NLP applications suchas information retrieval question answering wordsense disambiguation and so on
For machine translation Piao et al (2005) havenoted that the issue of MWE identification andaccurate interpretation from source to target lan-guage remained an unsolved problem for existingMT systems This problem is more severe whenMT systems are used to translate domain-specifictexts since they may include technical terminol-ogy as well as more general fixed expressions andidioms Although some MT systems may employa machine-readable bilingual dictionary of MWEit is time-consuming and inefficient to obtain thisresource manually Therefore some researchershave tried to use automatically extracted bilingualMWEs in SMT Tanaka and Baldwin (2003) de-scribed an approach of noun-noun compound ma-chine translation but no significant comparisonwas presented Lambert and Banchs (2005) pre-sented a method in which bilingual MWEs wereused to modify the word alignment so as to im-prove the SMT quality In their work a bilin-gual MWE in training corpus was grouped as
1httpmwestanfordedu
47
one unique token before training alignment mod-els They reported that both alignment qualityand translation accuracy were improved on a smallcorpus However in their further study they re-ported even lower BLEU scores after groupingMWEs according to part-of-speech on a large cor-pus (Lambert and Banchs 2006) Nonethelesssince MWE represents liguistic knowledge therole and usefulness of MWE in full-scale SMTis intuitively positive The difficulty lies in howto integrate bilingual MWEs into existing SMTsystem to improve SMT performance especiallywhen translating domain texts
In this paper we implement three methods thatintegrate domain bilingual MWEs into a phrase-based SMT system and show that these ap-proaches improve translation quality significantlyThe main difference between our methods andLambert and Banchsrsquo work is that we directly aimat improving the SMT performance rather than im-proving the word alignment quality In detail dif-ferences are listed as follows
bull Instead of using the bilingual n-gram trans-lation model we choose the phrase-basedSMT system Moses2 which achieves sig-nificantly better translation performance thanmany other SMT systems and is a state-of-the-art SMT system
bull Instead of improving translation indirectlyby improving the word alignment qualitywe directly target at the quality of transla-tion Some researchers have argued that largegains of alignment performance under manymetrics only led to small gains in translationperformance (Ayan and Dorr 2006 Fraserand Marcu 2007)
Besides the above differences there are someadvantages of our approaches
bull In our method automatically extractedMWEs are used as additional resources ratherthan as phrase-table filter Since bilingualMWEs are extracted according to noisy au-tomatic word alignment errors in word align-ment would further propagate to the SMT andhurt SMT performance
bull We conduct experiments on domain-specificcorpus For one thing domain-specific
2httpwwwstatmtorgmoses
corpus potentially includes a large numberof technical terminologies as well as moregeneral fixed expressions and idioms iedomain-specific corpus has high MWE cov-erage For another after the investigationcurrent SMT system could not effectivelydeal with these domain-specific MWEs es-pecially for Chinese since these MWEs aremore flexible and concise Take the Chi-nese term ldquo j Ntilde (rdquo for example Themeaning of this term is ldquosoften hard massand dispel pathogenic accumulationrdquo Ev-ery word of this term represents a specialmeaning and cannot be understood literallyor without this context These terms are dif-ficult to be translated even for humans letalone machine translation So treating theseterms as MWEs and applying them in SMTsystem have practical significance
bull In our approach no additional corpus is intro-duced We attempt to extract useful MWEsfrom the training corpus and adopt suitablemethods to apply them Thus it benefitsfor the full exploitation of available resourceswithout increasing great time and space com-plexities of SMT system
The remainder of the paper is organized as fol-lows Section 2 describes the bilingual MWE ex-traction technique Section 3 proposes three meth-ods to apply bilingual MWEs in SMT systemSection 4 presents the experimental results Sec-tion 5 draws conclusions and describes the futurework Since this paper mainly focuses on the ap-plication of BiMWE in SMT we only give a briefintroduction on monolingual and bilingual MWEextraction
2 Bilingual Multiword ExpressionExtraction
In this section we describe our approach of bilin-gual MWE extraction In the first step we obtainmonolingual MWEs from the Chinese part of par-allel corpus After that we look for the translationof the extracted MWEs from parallel corpus
21 Automatic Extraction of MWEs
In the past two decades many different ap-proaches on automatic MWE identification werereported In general those approaches can beclassified into three main trends (1) statisti-cal approaches (Pantel and Lin 2001 Piao et
48
al 2005) (2) syntactic approaches (Fazly andStevenson 2006 Bannard 2007) and (3) seman-tic approaches (Baldwin et al 2003 Cruys andMoiron 2007) Syntax-based and semantic-basedmethods achieve high precision but syntax or se-mantic analysis has to be introduced as preparingstep so it is difficult to apply them to domains withfew syntactical or semantic annotation Statisticalapproaches only consider frequency informationso they can be used to obtain MWEs from bilin-gual corpora without deeper syntactic or semanticanalysis Most statistical measures only take twowords into account so it not easy to extract MWEscontaining three or more than three words
Log Likelihood Ratio (LLR)has been proved agood statistical measurement of the association oftwo random variables (Chang et al 2002) Weadopt the idea of statistical approaches and pro-pose a new algorithm named LLR-based Hierar-chical Reducing Algorithm (HRA for short) to ex-tract MWEs with arbitrary lengths To illustrateour algorithm firstly we define some useful itemsIn the following definitions we assume the givensentence is ldquoA B C D Erdquo
Definition 1 Unit A unit is any sub-string of thegiven sentence For example ldquoA Brdquo ldquo Crdquo ldquo C D Erdquoare all units but ldquoA B Drdquo is not a unit
Definition 2 List A list is an ordered sequence ofunits which exactly cover the given sentence ForexampleldquoArdquoldquo B C Drdquoldquo Erdquo forms a list
Definition 3 Score The score function only de-fines on two adjacent units and return the LLRbetween the last word of first unit and the firstword of the second unit3 For example the scoreof adjacent unit ldquoB Crdquo and ldquoD Erdquo is defined asLLR(ldquo Crdquoldquo Drdquo)
Definition 4 Select The selecting operator is tofind the two adjacent units with maximum scorein a list
Definition 5 ReduceThe reducing operator is toremove two specific adjacent units concatenatethem and put back the result unit to the removedposition For example if we want to reduce unitldquoB Crdquo and unit ldquoDrdquo in list ldquoArdquoldquo B Crdquoldquo Drdquoldquo Erdquowe will get the listldquoArdquoldquo B C Drdquoldquo Erdquo
Initially every word in the sentence is consid-ered as one unit and all these units form a initiallist L If the sentence is of lengthN then the
3we use a stoplist to eliminate the units containing func-tion words by setting their score to 0
list containsN units of course The final set ofMWEsS is initialized to empty set After initial-ization the algorithm will enter an iterating loopwith two steps (1) select the two adjacent unitswith maximum score inL namingU1 andU2 and(2) reduceU1 andU2 in L and insert the reducingresult into the final setS Our algorithm termi-nates on two conditions (1) if the maximum scoreafter selection is less than a given threshold or (2)if L contains only one unit
c1(pound p Eacute w egrave )
c1(pound p Eacute w )
c1(pound p Eacute w)
c1(pound)
c2(p Eacute w)
c2(p Eacute)
c2(p) c3(Eacute) c4(w) c5()
c6(egrave )
c6(egrave) c7()
pound p Eacute w egrave 1471 67552 10596 0 0 8096
Figure 1 Example of Hierarchical Reducing Al-gorithm
Let us make the algorithm clearer with an ex-ample Assume the threshold of score is 20 thegiven sentence is ldquopound p Eacute w egrave rdquo4Figure 1 shows the hierarchical structure of givensentence (based on LLR of adjacent words) Inthis example four MWEs (ldquop Eacuterdquo ldquo p Eacute
wrdquo ldquo egrave rdquo ldquo pound p Eacute wrdquo) are extractedin the order and sub-strings over dotted line in fig-ure 1 are not extracted
From the above example we can see that theextracted MWEs correspond to human intuitionIn general the basic idea of HRA is to reflectthe hierarchical structure pattern of natural lan-guage Furthermore in the HRA MWEs are mea-sured with the minimum LLR of adjacent wordsin them which gives lexical confidence of ex-tracted MWEs Finally suppose given sentencehas lengthN HRA would definitely terminatewithin N minus 1 iterations which is very efficient
However HRA has a problem that it would ex-tract substrings before extracting the whole stringeven if the substrings only appear in the particu-lar whole string which we consider useless Tosolve this problem we use contextual features
4The whole sentence means ldquohealthy tea for preventinghyperlipidemiardquo and we give the meaning for each Chi-nese wordpound(preventing)p(hyper-)Eacute(-lipid-) w(-emia)(for) egrave(healthy)(tea)
49
contextual entropy (Luo and Sun 2003) and C-value (Frantzi and Ananiadou 1996) to filter outthose substrings which exist only in few MWEs
22 Automatic Extraction of MWErsquosTranslation
In subsection 21 we described the algorithm toobtain MWEs and we would like to introduce theprocedure to find their translations from parallelcorpus in this subsection
For mining the English translations of ChineseMWEs we first obtain the candidate translationsof a given MWE from the parallel corpus Stepsare listed as follows
1 Run GIZA++5 to align words in the trainingparallel corpus
2 For a given MWE find the bilingual sentencepairs where the source language sentences in-clude the MWE
3 Extract the candidate translations of theMWE from the above sentence pairs accord-ing to the algorithm described by Och (2002)
After the above procedure we have alreadyextracted all possible candidate translations of agiven MWE The next step is to distinguish rightcandidates from wrong candidates We constructperceptron-based classification model (Collins2002) to solve the problem We design twogroups of features translation features whichdescribe the mutual translating chance betweensource phrase and target phrase and the languagefeatures which refer to how well a candidateis a reasonable translation The translation fea-tures include (1) the logarithm of source-targettranslation probability (2) the logarithm of target-source translation probability (3) the logarithmof source-target lexical weighting (4) the loga-rithm of target-source lexical weighting and (5)the logarithm of the phrase pairrsquos LLR (Dunning1993) The first four features are exactly the sameas the four translation probabilities used in tradi-tional phrase-based system (Koehn et al 2003)The language features include (1) the left entropyof the target phrase (Luo and Sun 2003) (2) theright entropy of the target phrase (3) the first wordof the target phrase (4) the last word of the targetphrase and (5) all words in the target phrase
5httpwwwfjochcomGIZA++html
We select and annotate 33000 phrase pairs ran-domly of which 30000 pairs are used as trainingset and 3000 pairs are used as test set We use theperceptron training algorithm to train the modelAs the experiments reveal the classification preci-sion of this model is 9167
3 Application of Bilingual MWEs
Intuitively bilingual MWE is useful to improvethe performance of SMT However as we de-scribed in section 1 it still needs further researchon how to integrate bilingual MWEs into SMTsystem In this section we propose three methodsto utilize bilingual MWEs and we will comparetheir performance in section 4
31 Model Retraining with Bilingual MWEs
Bilingual phrase table is very important forphrase-based MT system However due to the er-rors in automatic word alignment and unalignedword extension in phrase extraction (Och 2002)many meaningless phrases would be extractedwhich results in inaccuracy of phrase probabilityestimation To alleviate this problem we take theautomatically extracted bilingual MWEs as paral-lel sentence pairs add them into the training cor-pus and retrain the model using GIZA++ Byincreasing the occurrences of bilingual MWEswhich are good phrases we expect that the align-ment would be modified and the probability es-timation would be more reasonable Wu et al(2008) also used this method to perform domainadaption for SMT Different from their approachin which bilingual MWEs are extracted from ad-ditional corpus we extract bilingual MWEs fromthe original training set The fact that additionalresources can improve the domain-specific SMTperformance was proved by many researchers(Wu et al 2008 Eck et al 2004) Howeverour method shows that making better use of theresources in hand could also enhance the qualityof SMT system We use ldquoBaseline+BiMWErdquo torepresent this method
32 New Feature for Bilingual MWEs
Lopez and Resnik (2006) once pointed out thatbetter feature mining can lead to substantial gainin translation quality Inspired by this idea weappend one feature into bilingual phrase table toindicate that whether a bilingual phrase containsbilingual MWEs In other words if the source lan-guage phrase contains a MWE (as substring) and
50
the target language phrase contains the translationof the MWE (as substring) the feature value is 1otherwise the feature value is set to 0 Due to thehigh reliability of bilingual MWEs we expect thatthis feature could help SMT system to select bet-ter and reasonable phrase pairs during translationWe use ldquoBaseline+Featrdquo to represent this method
33 Additional Phrase Table of bilingualMWEs
Wu et al (2008) proposed a method to construct aphrase table by a manually-made translation dic-tionary Instead of manually constructing transla-tion dictionary we construct an additional phrasetable containing automatically extracted bilingualMWEs As to probability assignment we just as-sign 1 to the four translation probabilities for sim-plicity Since Moses supports multiple bilingualphrase tables we combine the original phrase ta-ble and new constructed bilingual MWE table Foreach phrase in input sentence during translationthe decoder would search all candidate transla-tion phrases in both phrase tables We use ldquoBase-line+NewBPrdquo to represent this method
4 Experiments
41 Data
We run experiments on two domain-specific patentcorpora one is for traditional medicine domainand the other is for chemical industry domain Ourtranslation tasks are Chinese-to-English
In the traditional medicine domain table 1shows the data statistics For language model weuse SRI Language Modeling Toolkit6 to train a tri-gram model with modified Kneser-Ney smoothing(Chen and Goodman 1998) on the target side oftraining corpus Using our bilingual MWE ex-tracting algorithm 80287 bilingual MWEs are ex-tracted from the training set
Chinese EnglishTraining Sentences 120355
Words 4688873 4737843Dev Sentences 1000
Words 31722 32390Test Sentences 1000
Words 41643 40551
Table 1 Traditional medicine corpus
6httpwwwspeechsricomprojectssrilm
In the chemical industry domain table 2 givesthe detail information of the data In this experi-ment 59466 bilingual MWEs are extracted
Chinese EnglishTraining Sentences 120856
Words 4532503 4311682Dev Sentences 1099
Words 42122 40521Test Sentences 1099
Words 41069 39210
Table 2 Chemical industry corpus
We test translation quality on test set and use theopen source tool mteval-vllbpl7 to calculate case-sensitive BLEU 4 score (Papineni et al 2002) asour evaluation criteria For this evaluation there isonly one reference per test sentence We also per-form statistical significant test between two trans-lation results (Collins et al 2005) The mean ofall scores and relative standard deviation are calcu-lated with a 99 confidence interval of the mean
42 MT Systems
We use the state-of-the-art phrase-based SMT sys-tem Moses as our baseline system The featuresused in baseline system include (1) four transla-tion probability features (2) one language modelfeature (3) distance-based and lexicalized distor-tion model feature (4) word penalty (5) phrasepenalty For ldquoBaseline+BiMWErdquo method bilin-gual MWEs are added into training corpus as aresult new alignment and new phrase table areobtained For ldquoBaseline+Featrdquo method one ad-ditional 01 feature are introduced to each entry inphrase table For ldquoBaseline+NewBPrdquo additionalphrase table constructed by bilingual MWEs isused
Features are combined in the log-linear modelTo obtain the best translatione of the source sen-tencef log-linear model uses following equation
e = arg maxe
p(e|f)
= arg maxe
Msum
m=1
λmhm(e f) (1)
in which hm andλm denote themth feature andweight The weights are automatically turned byminimum error rate training (Och 2002) on devel-opment set
7httpwwwnistgovspeechtestsmtresourcesscoringhtm
51
43 Results
Methods BLEUBaseline 02658Baseline+BiMWE 02661Baseline+Feat 02675Baseline+NewBP 02719
Table 3 Translation results of using bilingualMWEs in traditional medicine domain
Table 3 gives our experiment results Fromthis table we can see that bilingual MWEsimprove translation quality in all cases TheBaseline+NewBP method achieves the most im-provement of 061 BLEU score comparedwith the baseline system The Baseline+Featmethod comes next with 017 BLEU score im-provement And the Baseline+BiMWE achievesslightly higher translation quality than the baselinesystem
To our disappointment however none of theseimprovements are statistical significant Wemanually examine the extracted bilingual MWEswhich are labeled positive by perceptron algorithmand find that although the classification precisionis high (9167) the proportion of positive exam-ple is relatively lower (7669) The low positiveproportion means that many negative instanceshave been wrongly classified to positive which in-troduce noises To remove noisy bilingual MWEswe use the length ratiox of the source phrase overthe target phrase to rank the bilingual MWEs la-beled positive Assumex follows Gaussian distri-butions then the ranking score of phrase pair(s t)is defined as the following formula
Score(s t) = log(LLR(s t))times 1radic2πσ
timeseminus
(xminusmicro)2
2σ2
(2)Here the meanmicro and varianceσ2 are estimatedfrom the training set After ranking by score weselect the top 50000 60000 and 70000 bilingualMWEs to perform the three methods mentioned insection 3 The results are showed in table 4
From this table we can conclude that (1) Allthe three methods on all settings improve BLEUscore (2) Except the Baseline+BiMWE methodthe other two methods obtain significant improve-ment of BLEU score (02728 02734 02724)over baseline system (02658) (3) When the scaleof bilingual MWEs is relatively small (5000060000) the Baseline+Feat method performs better
Methods 50000 60000 70000Baseline 02658Baseline+BiMWE 02671 02686 02715Baseline+Feat 02728 02734 02712Baseline+NewBP 02662 02706 02724
Table 4 Translation results of using bilingualMWEs in traditional medicine domain
than others (4) As the number of bilingual MWEsincreasing the Baseline+NewBP method outper-forms the Baseline+Feat method (5) Comparingtable 4 and 3 we can see it is not true that themore bilingual MWEs the better performance ofphrase-based SMT This conclusion is the same as(Lambert and Banchs 2005)
To verify the assumption that bilingual MWEsdo indeed improve the SMT performance not onlyon particular domain we also perform some ex-periments on chemical industry domain Table 5shows the results From this table we can see thatthese three methods can improve the translationperformance on chemical industry domain as wellas on the traditional medicine domain
Methods BLEUBaseline 01882Baseline+BiMWE 01928Baseline+Feat 01917Baseline+Newbp 01914
Table 5 Translation results of using bilingualMWEs in chemical industry domain
44 Discussion
In order to know in what respects our methods im-prove performance of translation we manually an-alyze some test sentences and gives some exam-ples in this subsection
(1) For the first example in table 6 ldquoIuml oacuterdquois aligned to other words and not correctly trans-lated in baseline system while it is aligned to cor-rect target phrase ldquodredging meridiansrdquo in Base-line+BiMWE since the bilingual MWE (ldquoIuml oacuterdquoldquodredging meridiansrdquo) has been added into train-ing corpus and then aligned by GIZA++
(2) For the second example in table 6 ldquo
rdquo has two candidate translation in phrase tableldquo teardquo and ldquomedicated teardquo The baseline systemchooses the ldquoteardquo as the translation of ldquo rdquowhile the Baseline+Feat system chooses the ldquomed-
52
Src T not auml k Ouml Eacute Aring raquo Iuml oacute ) 9 | Y S _ Ouml otilde
egraveE 8
Ref the obtained product is effective in tonifying blood expelling cold dredging meridians promoting production of body fluid promoting urination and tranquilizing mind and can be used for supplementing nutrition and protecting health
Baseline the food has effects in tonifying blood dispelling cold promoting salivation and water and tranquilizing and tonic effects and making nutritious health
+Bimwe the food has effects in tonifying blood dispelling cold dredging meridians promotingsalivation promoting urination and tranquilizing tonic nutritious pulverizing
Src curren iexclJ J NtildeJ 5J
Ref the product can also be made into tablet pill powder medicated tea or injection Baseline may also be made into tablet pill powder tea or injection +Feat may also be made into tablet pill powder medicated tea or injection
Table 6 Translation example
icated teardquo because the additional feature giveshigh probability of the correct translation ldquomedi-cated teardquo
5 Conclusion and Future Works
This paper presents the LLR-based hierarchicalreducing algorithm to automatically extract bilin-gual MWEs and investigates the performance ofthree different application strategies in applyingbilingual MWEs for SMT system The translationresults show that using an additional feature to rep-resent whether a bilingual phrase contains bilin-gual MWEs performs the best in most cases Theother two strategies can also improve the qualityof SMT system although not as much as the firstone These results are encouraging and motivatedto do further research in this area
The strategies of bilingual MWE application isroughly simply and coarse in this paper Com-plicated approaches should be taken into accountduring applying bilingual MWEs For examplewe may consider other features of the bilingualMWEs and examine their effect on the SMT per-formance Besides application in phrase-basedSMT system bilingual MWEs may also be inte-grated into other MT models such as hierarchicalphrase-based models or syntax-based translationmodels We will do further studies on improvingstatistical machine translation using domain bilin-gual MWEs
Acknowledgments
This work is supported by National Natural Sci-ence Foundation of China Contracts 60603095and 60873167 We would like to thank the anony-
mous reviewers for their insightful comments onan earlier draft of this paper
References
Necip Fazil Ayan and Bonnie J Dorr 2006 Goingbeyond aer an extensive analysis of word align-ments and their impact on mt InProceedings of the44th Annual Meeting of the Association for Compu-tational Linguistics pages 9ndash16
Timothy Baldwin Colin Bannard Takaaki Tanaka andDominic Widdows 2003 An empirical modelof multiword expression decomposability InPro-ceedings of the ACL-2003 Workshop on MultiwordExpressions Analysis Acquisiton and Treatmentpages 89ndash96
Colin Bannard 2007 A measure of syntactic flex-ibility for automatically identifying multiword ex-pressions in corpora InProceedings of the ACLWorkshop on A Broader Perspective on MultiwordExpressions pages 1ndash8
Peter F Brown Stephen Della Pietra Vincent J DellaPietra and Robert L Mercer 1993 The mathe-matics of statistical machine translation Parameterestimation Computational Linguistics 19(2)263ndash311
Baobao Chang Pernilla Danielsson and WolfgangTeubert 2002 Extraction of translation unit fromchinese-english parallel corpora InProceedings ofthe first SIGHAN workshop on Chinese languageprocessing pages 1ndash5
Stanley F Chen and Joshua Goodman 1998 Am em-pirical study of smoothing techniques for languagemodeling Technical report
Michael Collins Philipp Koehn and Ivona Kucerova2005 Clause restructuring for statistical machinetranslation InProceedings of the 43rd Annual
53
Meeting on Association for Computational Linguis-tics pages 531ndash540
Michael Collins 2002 Discriminative training meth-ods for hidden markov models Theory and exper-iments with perceptron algorithms InProceedingsof the Empirical Methods in Natural Language Pro-cessing Conference pages 1ndash8
Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32
Ted Dunning 1993 Accurate methods for the statis-tics of surprise and coincidenceComputationalLinguistics 19(1)61ndash74
Matthias Eck Stephan Vogel and Alex Waibel 2004Improving statistical machine translation in the med-ical domain using the unified medical language sys-tem InProceedings of the 20th international con-ference on Computational Linguistics table of con-tents pages 792ndash798
Afsaneh Fazly and Suzanne Stevenson 2006 Auto-matically constructing a lexicon of verb phrase id-iomatic combinations InProceedings of the EACLpages 337ndash344
Katerina T Frantzi and Sophia Ananiadou 1996 Ex-tracting nested collocations InProceedings of the16th conference on Computational linguistics pages41ndash46
Alexander Fraser and Daniel Marcu 2007 Measuringword alignment quality for statistical machine trans-lation Computational Linguistics 33(3)293ndash303
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InPro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology pages48ndash54
Patrik Lambert and Rafael Banchs 2005 Data in-ferred multi-word expressions for statistical machinetranslation InProceedings of Machine TranslationSummit X pages 396ndash403
Patrik Lambert and Rafael Banchs 2006 Groupingmulti-word expressions according to part-of-speechin statistical machine translation InProceedings ofthe Workshop on Multi-word-expressions in a multi-lingual context pages 9ndash16
Adam Lopez and Philip Resnik 2006 Word-basedalignment phrase-based translation Whatrsquos thelink In proceedings of the 7th conference of the as-sociation for machine translation in the Americasvisions for the future of machine translation pages90ndash99
Shengfen Luo and Maosong Sun 2003 Two-characterchinese word extraction based on hybrid of internaland contextual measures InProceedings of the sec-ond SIGHAN workshop on Chinese language pro-cessing pages 24ndash30
Franz Josef Och 2002Statistical Machine Transla-tion From Single-Word Models to Alignment Tem-plates Phd thesis Computer Science DepartmentRWTH Aachen Germany
Patrick Pantel and Dekang Lin 2001 A statistical cor-pus based term extractor InAI rsquo01 Proceedings ofthe 14th Biennial Conference of the Canadian Soci-ety on Computational Studies of Intelligence pages36ndash46
Kishore Papineni Salim Roukos Todd Ward and Wei-jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation InProceedings of the40th Annual Conference of the Association for Com-putational Linguistics pages 311ndash318
Scott Songlin Piao Paul Rayson Dawn Archer andTony McEnery 2005 Comparing and combining asemantic tagger and a statistical tool for mwe extrac-tion Computer Speech and Language 19(4)378ndash397
Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 Multi-word expressions A pain in the neck for nlp InProceedings of the 3th International Conferenceon Intelligent Text Processing and ComputationalLinguistics(CICLing-2002) pages 1ndash15
Takaaki Tanaka and Timothy Baldwin 2003 Noun-noun compound machine translation A feasibilitystudy on shallow processing InProceedings ofthe ACL-2003 Workshop on Multiword ExpressionsAnalysis Acquisition and Treatment pages 17ndash24
Hua Wu Haifeng Wang and Chengqing Zong 2008Domain adaptation for statistical machine transla-tion with domain dictionary and monolingual cor-pora InProceedings of Conference on Computa-tional Linguistics (COLING) pages 993ndash1000
54
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 55ndash62Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Bottom-up Named Entity Recognitionusing a Two-stage Machine Learning Method
Hirotaka Funayama Tomohide Shibata Sadao KurohashiKyoto University Yoshida-honmachi
Sakyo-ku Kyoto 606-8501 Japanfunayama shibata kuronlpkueekyoto-uacjp
Abstract
This paper proposes Japanese bottom-upnamed entity recognition using a two-stage machine learning method Mostwork has formalized Named Entity Recog-nition as a sequential labeling problem inwhich only local information is utilizedfor the label estimation and thus a longnamed entity consisting of several mor-phemes tends to be wrongly recognizedOur proposed method regards a compoundnoun (chunk) as a labeling unit and firstestimates the labels of all the chunks ina phrasal unit (bunsetsu) using a machinelearning method Then the best label as-signment in the bunsetsu is determinedfrom bottom up as the CKY parsing al-gorithm using a machine learning methodWe conducted an experimental on CRLNE data and achieved an F measure of8979 which is higher than previous work
1 Introduction
Named Entity Recognition (NER) is a task of rec-ognizing named entities such as person namesorganization names and location It is used forseveral NLP applications such as Information Ex-traction (IE) and Question Answering (QA) Mostwork uses machine learning methods such as Sup-port Vector Machines (SVMs) (Vapnik 1995) andConditional Random Field (CRF) (Lafferty et al2001) using a hand-annotated corpus (Krishnanand DManning 2006 Kazama and Torisawa2008 Sasano and Kurohashi 2008 Fukushima etal 2008 Nakano and Hirai 2004 Masayuki andMatsumoto 2003)
In general NER is formalized as a sequentiallabeling problem For example regarding a mor-pheme as a basic unit it is first labeled as S-PERSON B-PERSON I-PERSON E-PERSON
S-ORGANIZATION etc Then considering thelabeling results of morphemes the best NE labelsequence is recognized
When the label of each morpheme is estimatedonly local information around the morpheme (egthe morpheme the two preceding morphemes andthe two following morphemes) is utilized There-fore a long named entity consisting of severalmorphemes tends to be wrongly recognized Letus consider the example sentences shown in Fig-ure 1
In sentence (1) the label of ldquoKazamardquo can berecognized to be S-PERSON (PERSON consist-ing of one morpheme) by utilizing the surroundinginformation such as the suffix ldquosanrdquo (Mr) and theverb ldquokikoku shitardquo (return home)
On the other hand in sentence (2) when thelabel of ldquoshinyourdquo (credit) is recognized to beB-ORGANIZATION (the beginning of ORGA-NIZATION) only information from ldquohatsudourdquo(invoke) to ldquokyusairdquo (relief) can be utilized andthus the information of the morpheme ldquoginkourdquo(bank) that is apart from ldquoshinyourdquo by three mor-phemes cannot be utilized To cope with this prob-lem Nakano et al (Nakano and Hirai 2004) andSasano et al (Sasano and Kurohashi 2008) uti-lized information of the head of bunsetsu1 In theirmethods when the label of ldquoshinyourdquo is recog-nized the information of the morpheme ldquoginkourdquocan be utilized
However these methods do not work when themorpheme that we want to refer to is not a headof bunsetsu as in sentence (3) In this examplewhen ldquogaikokurdquo (foreign) is recognized to be B-ARTIFACT (the beginning of ARTIFACT) wewant to refer to ldquohourdquo (law) not ldquoihanrdquo (viola-tion) which is the head of the bunsetsu
This paper proposes Japanese bottom-up named1Bunsetsu is the smallest coherent phrasal unit in
Japanese It consists of one or more content words followedby zero or more function words
55
(1) kikoku-shitareturn home
Kazama-san-waMrKazama TOP
rsquoMr Kazama who returned homersquo
(2) hatsudou-shitainvoke
shinyou-kumiai-kyusai-ginkou-nocredit union relief bank GEN
setsuritsu-mo establishment
rsquothe establishment of the invoking credit union relief bankrsquo
(3) shibunsyo-gizou-toprivate document falsification and
gaikoku-jin-touroku-hou-ihan-noforeigner registration law violation GEN
utagai-desuspicion INS
rsquoon suspicion of the private document falsification and the violation of the foreigner registra-tion lawrsquo
Figure 1 Example sentences
entity recognition using a two-stage machinelearning method Different from previous workthis method regards a compound noun as a la-beling unit (we call it chunk hereafter) and es-timates the labels of all the chunks in the bun-setsu using a machine learning method In sen-tence (3) all the chunks in the second bunsetsu(ie ldquogaikokurdquo ldquogaikoku-jinrdquo middot middot middot ldquogaikoku-jin-touroku-hou-ihan rdquo middot middot middot ldquoihanrdquo) are labeled andin the case that the chunk ldquogaikoku-jin-touroku-hourdquo is labeled the information about ldquohourdquo (law)is utilized in a natural manner Then in the bun-setsu the best label assignment is determined Forexample among the combination of ldquogaikoku-jin-touroku-hourdquo (ARTIFACT) and ldquoihanrdquo (OTHER)the combination of ldquogaikoku-jinrdquo (PERSON) andldquotouroku-hou-ihanrdquo (OTHER) etc the best la-bel assignment ldquogaikoku-jin-touroku-hourdquo (AR-TIFACT) and ldquoihanrdquo (OTHER) is chosen basedon a machine learning method In this determi-nation of the best label assignment as the CKYparsing algorithm the label assignment is deter-mined by bottom-up dynamic programming Weconducted an experimental on CRL NE data andachieved an F measure of 8979 which is higherthan previous work
This paper is organized as follows Section 2 re-views related work of NER especially focusing onsequential labeling based method Section 3 de-scribes an overview of our proposed method Sec-tion 4 presents two machine learning models andSection 5 describes an analysis algorithm Section6 gives an experimental result
2 Related Work
In Japanese Named Entity Recognition the defi-nition of Named Entity in IREX Workshop (IREX
class examplePERSON Kimura SyonosukeLOCATION Taiheiyou (Pacific Ocean)ORGANIZATION Jimin-tou (Liberal Democratic
Party)ARTIFACT PL-houan (PL bill)DATE 21-seiki (21 century)TIME gozen-7-ji (7 am)MONEY 500-oku-en (50 billions yen)PERCENT 20 percent
Table 1 NE classes and their examples
Committee 1999) is usually used In this def-inition NEs are classified into eight classesPERSON LOCATION ORGANIZATION AR-TIFACT DATE TIME MONEY and PERCENTTable 1 shows example instances of each class
NER methods are divided into two approachesrule-based approach and machine learning ap-proach According to previous work machinelearning approach achieved better performancethan rule-based approach
In general a machine learning method is for-malized as a sequential labeling problem Thisproblem is first assigning each token (character ormorpheme) to several labels In an SE-algorithm(Sekine et al 1998) S is assigned to NE com-posed of one morpheme B I E is assigned to thebeginning middle end of NE respectively and Ois assigned to the morpheme that is not an NE2The labels S B I and E are prepared for each NEclasses and thus the total number of labels is 33(= 8 4 + 1)
The model for the label estimation is learnedbased on machine learning The following fea-tures are generally utilized characters type of
2Besides there are IOB1 IOB2 algorithm using onlyIOB and IOE1 IOE2 algorithm using only IOE (Kim andVeenstra 1999)
56
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu-MeijinMONEY0075
OTHERe0092
MeijinOTHERe0245
(a)initial state
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu + MeijinPSN+OTHERe0438+0245
Yoshiharu Yoshiharu + Meijin
final outputanalysis direction
MONEY0075
MNY+OTHERe0075+0245
MeijinOTHERe0245
(b)final output
Figure 2 An overview of our proposed method (the bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo)
character POS etc about the morpheme and thesurrounding two morphemes The methods utiliz-ing SVM or CRF are proposed
Most of NER methods based on sequential la-beling use only local information Thereforemethods utilizing global information are pro-posed Nakano et al utilized as a feature the wordsub class of NE on the analyzing direction in thebunsetsu the noun in the end of the bunsetsu ad-jacent to the analyzing direction and the head ofeach bunsetsu (Nakano and Hirai 2004) Sasanoet al utilized cache feature coreference resultsyntactic feature and caseframe feature as struc-tural features (Sasano and Kurohashi 2008)
Some work acquired knowledge from unan-notated large corpus and applied it to NERKazama et al utilized a Named Entity dic-tionary constructed from Wikipedia and a nounclustering result obtained using huge amount ofpairs of dependency relations (Kazama and Tori-sawa 2008) Fukushima et al acquired hugeamount of category-instance pairs (eg ldquopo-litical party - New party DAICHIrdquoldquocompany-TOYOTArdquo) by some patterns from a large Webcorpus (Fukushima et al 2008)
In Japanese NER researches CRL NE data areusually utilized for the evaluation This data in-cludes approximately 10 thousands sentences innews paper articles in which approximately 20thousands NEs are annotated Previous workachieved an F measure of about 089 using thisdata
3 Overview of Proposed Method
Our proposed method first estimates the label ofall the compound nouns (chunk) in a bunsetsu
Then the best label assignment is determinedby bottom-up dynamic programming as the CKYparsing algorithm Figure 2 illustrates an overviewof our proposed method In this example thebunsetsu ldquoHabu-Yoshiharu-Meijinrdquo (Grand Mas-ter Yoshiharu Habu) is analyzed First the labelsof all the chunks (ldquoHaburdquo ldquoHabu-YoshiharurdquoldquoHabu-Yoshiharu-Meijinrdquo middot middot middot ldquoMeijinrdquo etc) inthe bunsetsu are analyzed using a machine learn-ing method as shown in Figure 2 (a)
We call the state in Figure 2 (a) initial statewhere the labels of all the chunks have been es-timated From this state the best label assign-ment in the bunsetsu is determined This pro-cedure is performed from the lower left (corre-sponds to each morpheme) to the upper right likethe CKY parsing algorithm as shown in Figure 2(b) For example when the label assignment forldquoHabu-Yoshiharurdquo is determined the label assign-ment ldquoHabu-Yoshiharurdquo (PERSON) and the labelassignment ldquoHaburdquo (PERSON) and ldquoYoshiharurdquo(OTHER) are compared and the better one is cho-sen While grammatical rules are utilized in ageneral CKY algorithm this method chooses bet-ter label assignment for each cell using a machinelearning method
The learned models are the followings
bull the model that estimates the label of a chunk(label estimation model)
bull the model that compares two label assign-ments (label comparison model)
The two models are described in detail in thenext section
57
Habu Yoshiharu Meijin gaPERSON OTHERe
invalid invalid
invalid
invalidinvalid
Figure 3 Label assignment for all the chunks inthe bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo
4 Model Learning
41 Label Estimation ModelThis model estimates the label for each chunk Ananalysis unit is basically bunsetsu This is because935 of named entities is located in a bunsetsuin CRL NE data Exceptionally the following ex-pressions located in multiple bunsetsus tend to bean NE
bull expressions enclosed in parentheses (eg ldquolsquoHimeyuri-no toursquo rdquo (The tower of Himeyuri)(ARTIFACT))
bull expressions that have an entry in Wikipedia(eg ldquoNihon-yatyou-no kairdquo (Wild Bird So-ciety of Japan) (ORGANIZATION))
Hereafter bunsetsu is expanded when one of theabove conditions meet By this expansion 986of named entities is located in a bunsetsu3
For each bunsetsu the head or tail functionwords are deleted For example in the bun-setsu ldquoHabu-Yoshiharu-Meijin-wardquo the tail func-tion word ldquowardquo (TOP) is deleted In the bunsetsuldquoyaku-san-bairdquo (about three times) the head func-tion word ldquoyakurdquo (about) is deleted
Next for learning the label estimation modelall the chunks in a bunsetsu are attached to the cor-rect label from a hand-annotated corpus The la-bel set is 13 classes which includes eight NE class(as shown in Table 1) and five classes OTHERsOTHERb OTHERi OTHERe and invalid
The chunk that corresponds to a whole bun-setsu and does not contain any NEs is labeledas OTHERs and the head middle tail chunkthat does not correspond to an NE is labeled asOTHERb OTHERi OTHERe respectively4
3As an example in which an NE is not included by anexpanded bunsetsu there are ldquoToru-no Kimirdquo (PERSON)and ldquoOsaka-fu midori-no kankyo-seibi-shitsurdquo (ORGANI-ZATION)
4Each OTHER is assigned to the longest chunk that satis-fies its condition in a chunk
1 of morphemes in the chunk
2 the position of the chunk in its bunsetsu
3 character type5
4 the combination of the character type of adjoiningmorphemes
- For the chunk ldquoRussian Armyrdquo this feature isldquoKatakanaKanjirdquo
5 word class word sub class and several features pro-vided by a morphological analyzer JUMAN
6 several features6 provided by a parser KNP
7 string of the morpheme in the chunk
8 IPADIC7 feature- If the string of the chunk are registered in the fol-
lowing categories of IPADIC ldquopersonrdquo ldquolo-cationrdquo ldquoorganizationrdquo and ldquogeneralrdquo thisfeature fires
9 Wikipedia feature
- If the string of the chunk has an entry inWikipedia this feature fires
- the hypernym extracted from its definition sen-tence using some patterns (eg The hyper-nym of ldquothe Liberal Democratic Partyrdquo is apolitical party)
10 cache feature- When the same string of the chunk appears in the
preceding context the label of the precedingchunk is used for the feature
11 particles that the bunsetsu includes
12 the morphemes particles and head morpheme in theparent bunsetsu
13 the NEcategory ratio in a case slot of predicatenouncase frame(Sasano and Kurohashi 2008)
- For example in the case ga (NOM) of the pred-icate case frame ldquokaikenrdquo (interview) the NEratio ldquoPERSON0245rdquo is assigned to the caseslot Hence in the sentence ldquoHabu-ga kaiken-shitardquo (Mr Habu interviewed) the featureldquoPERSON0245rdquo is utilized for the chunkldquoHaburdquo
14 parenthesis feature
- When the chunk in a parenthesis this featurefires
Table 2 Features for the label estimation model
The chunk that is neither any eight NE class northe above four OTHER is labeled as invalid
In an example as shown in Figure 3 ldquoHabu-Yoshiharurdquo is labeled as PERSON ldquoMeijinrdquo is la-beled as OTHERe and the other chunks are la-beled as invalid
Next the label estimation model is learned fromthe data in which the above label set is assigned
5The following five character types are considered KanjiHiragana Katakana Number and Alphabet
6When a morpheme has an ambiguity all the correspond-ing features fire
7httpchasenaist-naraacjpchasendistributionhtmlja
58
to all the chunks The features for the label esti-mation model are shown in Table 2 Among thefeatures as for feature (3) (5)minus(8) three cate-gories according to the position of a morphemein the chunk are prepared ldquoheadrdquo ldquotailrdquo andldquoanywhererdquo For example in the chunk ldquoHabu-Yoshiharu-Meijinrdquo as for the morpheme ldquoHaburdquofeature (7) is set to be ldquoHaburdquo in ldquoheadrdquo and as forthe morpheme ldquoYoshiharurdquo feature (7) is set to beldquoYoshiharurdquo in ldquoanywhererdquo
The label estimation model is learned from pairsof label and feature in each chunk To classify themulti classes the one-vs-rest method is adopted(consequently 13 models are learned) The SVMoutput is transformed by using the sigmoid func-tion 1
1+exp(minusβx) and the transformed value is nor-malized so that the sum of the value of 13 labelsin a chunk is one
The purpose for setting up the label ldquoinvalidrdquo isas follows In the chunk ldquoHaburdquo and ldquoYoshiharurdquoin Figure 3 since the label ldquoinvalidrdquo has a rela-tively higher score the score of the label PERSONis relatively low Therefore when the label com-parison described in Section 42 is performed thelabel assignment ldquoHabu-Yoshiharurdquo (PERSON) islikely to be chosen In the chunk where the scoreof the label invalid has the highest score the labelthat has the second highest score is adopted
42 Label Comparison Model
This model compares the two label assignmentsfor a certain string For example in the stringldquoHabu-Yoshiharurdquo the model compares the fol-lowing two label assignments
bull ldquoHabu-Yoshiharurdquo is labeled as PERSON
bull ldquoHaburdquo is labeled as PERSON and ldquoYoshi-harurdquo is labeled as MONEY
First as shown in Figure 4 the two comparedsets of chunks are lined up by sandwiching ldquovsrdquo(The left one right one is called the first set thesecond set respectively) When the first set is cor-rect this example is positive otherwise this ex-ample is negative The max number of chunks foreach set is five and thus examples in which thefirst or second set has more than five chunks arenot utilized for the model learning
Then the feature is assigned to each exampleThe feature (13 dimensions) for each chunk is de-fined as follows the first 12 dimensions are used
positive+1 Habu-Yoshiharu vs Habu + Yoshiharu
PSN PSN + MNY+1 Habu-Yoshiharu + Meijin vs Habu + Yoshiharu + Meijin
PSN + OTHERe PSN + MONEY + OTHERe
negative- 1 Habu-Yoshiharu-Meijin vs Habu-Yoshiharu + Meijin
ORG PSN + OTHERe
Figure 4 Assignment of positivenegative exam-ples
for each label which is estimated by the label esti-mation model and the last 13th dimension is usedfor the score of an SVM output Then for the firstand second set the features for each chunk are ar-ranged from the left and zero vectors are placedin the remainder part
Figure 5 illustrates the feature for ldquoHabu-Yoshiharurdquo vs ldquoHabu + Yoshiharurdquo The labelcomparison model is learned from such data us-ing SVM Note that only the fact that ldquoHabu-Yoshiharurdquo is PERSON can be found from thehand-annotated corpus and thus in the exampleldquoHabu-Yoshiharu-Meijinrdquo vs ldquoHabu + Yoshiharu-Meijinrdquo we cannot determine which one is cor-rect Therefore such example cannot be used forthe model learning
5 Analysis
First the label of all the chunks in a bunsetsu isestimated by using the label estimation model de-scribed in Section 41 Then the best label assign-ment in the bunsetsu is determined by applying thelabel comparison model described in Section 42iteratively as shown in Figure 2 (b) In this stepthe better label assignment is determined from bot-tom up as the CKY parsing algorithm
For example the initial state shown in Figure2(a) is obtained using the label estimation modelThen the label assignment is determined using thelabel comparison model from the lower left (cor-responds to each morpheme) to the upper rightIn determining the label assignment for the cellof ldquoHabu-Yoshiharurdquo as shown in 6(a) the modelcompares the label assignment ldquoBrdquo with the la-bel assignment ldquoA+Drdquo In this case the modelchooses the label assignment ldquoBrdquo that is ldquoHabu- Yoshiharurdquo is labeled as PERSON Similarlyin determining the label assignment for the cellof ldquoYoshiharu-Meijinrdquo the model compares the
59
chunk Habu-Yoshiharu Habu Yoshiharulabel PERSON PERSON MONEYvector V11 0 0 0 0 V21 V22 0 0 0
Figure 5 An example of the feature for the label comparison model (The example is ldquoHabu-Yoshiharuvs Habu + Yoshiharurdquo and V11 V21 V22 and 0 is a vector whose dimension is 13)
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu-Meijin
label assighment ldquoHabu-Yoshiharurdquo
label assignmentldquoHaburdquo + ldquoYoshiharurdquo
A B C
EDMONEY0075
OTHERe0092
MeijinOTHERe0245
ldquoHaburdquo + ldquoYoshiharurdquo
F
(a) label assignment for the cell ldquoHabu-Yoshiharurdquo
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu + Meijin
label assignmentldquoHabu-Yoshiharu-Meijinrdquo
A B C
EDMONEY0075
MNY+OTHERe0075+0245
MeijinOTHERe0245
label assignmentldquoHaburdquo + ldquoYoshiharurdquo + ldquoMeijinrdquo
label assignmentldquoHabu-Yoshiharurdquo + ldquoMeijinrdquo
F
(b) label assignment for the cell ldquoHabu-Yoshiharu-Meijinrdquo
Figure 6 The label comparison model
label assignment ldquoErdquo with the label assignmentldquoD+Frdquo In this case the model chooses the labelassignment ldquoD+Frdquo that is ldquoYoshiharurdquo is labeledas MONEY and ldquoMeijinrdquo is labeled as OTHEReWhen the label assignment consists of multiplechunks the content of the cell is updated Inthis case the cell ldquoErdquo is changed from ldquoYoshi-haru-Meijinrdquo (OTHERe) to ldquoYoshiharu + Meijinrdquo(MONEY + OTHERe)
As shown in Figure 6(b) in determining the bestlabel assignment for the upper right cell that isthe final output is determined the model comparesthe label assignment ldquoA+D+Frdquo ldquoB+Frdquo and ldquoCrdquoWhen there are more than two candidates of labelassignments for a cell all the label assignments arecompared in a pairwise and the label assignmentthat obtains the highest score is adopted
In the label comparing step the label as-signment in which OTHERlowast follows OTHERlowast(OTHERlowast - OTHERlowast) is not allowed since eachOTHER is assigned to the longest chunk as de-scribed in Section 41 When the first combina-tion of chunks equals to the second combinationof chunks the comparison is not performed
6 Experiment
To demonstrate the effectiveness of our proposedmethod we conducted an experiment on CRL NEdata In this data 10718 sentences in 1174 newsarticles are annotated with eight NEs The expres-sion to which it is difficult to annotate manually islabeled as OPTIONAL and was not used for both
a b c d e
learn the label estimation model 1
Corpus CRL NE data
learn the label estimation model 2bobtain features for the label apply
b c d e
b c d e
b c d e
obtain features for the labelcomparison model
learn the label estimation model 2c
learn the label estimation model 2d
learn the label estimation model 2e
apply
learn the label comparison model
Figure 7 5-fold cross validation
the model learning8 and the evaluationWe performed 5-fold cross validation following
previous work Different from previous work ourwork has to learn the SVM models twice There-fore the corpus was divided as shown in Figure 7Let us consider the analysis in the part (a) Firstthe label estimation model 1 is learned from thepart (b)-(e) Then the label estimation model 2bis learned from the part (c)-(e) and applying thelearned model to the part (b) features for learningthe label comparison model are obtained Simi-larly the label estimation model 2c is learned fromthe part (b)(d)(e) and applying it to the part (c)features are obtained It is the same with the part
8Exceptionally ldquoOPTIONALrdquo is used when the label es-timation model for OTHERlowast and invalid is learned
60
Recall PrecisionORGANIZATION 8183 (30083676) 8837 (30083404)PERSON 9005 (34583840) 9387 (34583684)LOCATION 9138 (49925463) 9244 (49925400)ARTIFACT 4672 ( 349 747) 7489 ( 349 466)DATE 9327 (33273567) 9312 (33273573)TIME 8825 ( 443 502) 9059 ( 443 489)MONEY 9385 ( 366 390) 9760 ( 366 375)PERCENT 9533 ( 469 492) 9591 ( 469 489)ALL-SLOT 8787 9179F-measure 8979
Table 3 Experimental result
(d) and (e) Then the label comparison model islearned from the obtained features After that theanalysis in the part (a) is performed by using boththe label estimation model 1 and the label compar-ison model
In this experiment a Japanese morphologicalanalyzer JUMAN9 and a Japanese parser KNP10
were adopted The two SVM models were learnedwith polynomial kernel of degree 2 and β in thesigmoid function was set to be 1
Table 6 shows an experimental result An F-measure in all NE classes is 8979
7 Discussion
71 Comparison with Previous Work
Table 7 presents the comparison with previ-ous work and our method outperformed previ-ous work Among previous work Fukushimaet al acquired huge amount of category-instance pairs (eg ldquopolitical party - New partyDAICHIrdquoldquocompany-TOYOTArdquo) by some patternsfrom a large Web corpus and Sasano et al uti-lized the analysis result of corefer resolution as afeature for the model learning Therefore in ourmethod by incorporating these knowledge andorsuch analysis result the performance would be im-proved
Compared with Sasano et al our methodachieved the better performance in analyzinga long compound noun For example in thebunsetsu ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo (Treaty on Conventional ArmedForces in Europe) while Sasano et al labeledldquoOushurdquo (Europe) as LOCATION our methodcorrectly labeled ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo as ARTIFACT Sasano etal incorrectly labeled ldquoOushurdquo as LOCATIONalthough they utilized the information about
9httpnlpkueekyoto-uacjpnl-resourcejuman-ehtml10httpnlpkueekyoto-uacjpnl-resourceknp-ehtml
the head of bunsetsu ldquojyouyakurdquo (treaty) Inour method for the cell ldquoOushurdquo invalid hasthe highest score and thus the score of LOCA-TION relatively drops Similarly for the cellldquosenryoku-sakugen-jyouyakurdquo invalid has thehighest score Consequently ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo is correctly labeledas ARTIFACT
In the bunsetsu ldquogaikoku-jin-touroku-hou-ihanrdquo(the violation of the foreigner registration law)while Sasano et al labeled ldquotouroku-hourdquo as AR-TIFACT our method correctly labeled ldquogaikoku-jin-touroku-hourdquo as ARTIFACT Sasano et al can-not utilize the information about ldquohourdquo that is use-ful for the label estimation since the head of thisbunsetsu is ldquoihanrdquo In contrast in estimating thelabel of the chunk ldquogaikoku-jin-touroku-hourdquo theinformation of ldquohourdquo can be utilized
72 Error Analysis
There were some errors in analyzing a Katakanaalphabet word In the following example althoughthe correct is that ldquoBatistutardquo is labeled as PER-SON the system labeled it as OTHERs
(4) Italy-deItaly LOC
katsuyaku-suruactive
Batistuta-woBatistuta ACC
kuwaetacall
ArgentineArgentine
rsquoArgentine called Batistuta who was active inItalyrsquo
There is not an entry of ldquoBatistutardquo in the dictio-nary of JUMAN nor Wikipedia and thus only thesurrounding information is utilized However thecase analysis of ldquokatsuyakurdquo (active) is incorrectwhich leads to the error of ldquoBatistutardquo
There were some errors in applying the la-bel comparison model although the analysis ofeach chunk is correct For example in thebunsetsu ldquoHongKong-seityourdquo (Government ofHongKong) the correct is that ldquoHongKong-seityourdquo is labeled as ORGANIZATION Asshown in Figure 8 (b) the system incorrectlylabeled ldquoHongKongrdquo as LOCATION As shownin Figure 8(a) although in the initial stateldquoHongKong-seityourdquo was correctly labeled as OR-GANIZATION the label assignment ldquoHongKong+ seityourdquo was incorrectly chosen by the labelcomparison model To cope with this problemwe are planning to the adjustment of the value βin the sigmoid function and the refinement of the
61
F1 analysis unit distinctive features(Fukushima et al 2008) 8929 character Web(Kazama and Torisawa 2008) 8893 character WikipediaWeb(Sasano and Kurohashi 2008) 8940 morpheme structural information(Nakano and Hirai 2004) 8903 character bunsetsu feature(Masayuki and Matsumoto 2003) 8721 character(Isozaki and Kazawa 2003) 8677 morphemeproposed method 8979 compound noun Wikipediastructural information
Table 4 Comparison with previous work (All work was evaluated on CRL NE data using cross valida-tion)
HongKongLOCATION
HongKong-seityouORGANIZATIONLOCATION
0266 0205
seityouOTHERe0184
(a)initial state
HongKongLOCATION
HongKong + seityouLOC+OTHEReLOCATION
0266LOC+OTHERe0266+0184
seityouOTHERe0184
(b)the final output
Figure 8 An example of the error in the label com-parison model
features for the label comparison model
8 Conclusion
This paper proposed bottom-up Named EntityRecognition using a two-stage machine learningmethod This method first estimates the label ofall the chunks in a bunsetsu using a machine learn-ing and then the best label assignment is deter-mined by bottom-up dynamic programming Weconducted an experiment on CRL NE data andachieved an F-measure of 8979
We are planning to integrate this method withthe syntactic and case analysis method (Kawa-hara and Kurohashi 2007) and perform syntacticcase and Named Entity analysis simultaneously toimprove the overall accuracy
ReferencesKenrsquoichi Fukushima Nobuhiro Kaji and Masaru
Kitsuregawa 2008 Use of massive amountsof web text in Japanese named entity recogni-tion In Proceedings of Data Engineering Workshop(DEWS2008) A3-3 (in Japanese)
IREX Committee editor 1999 Proceedings of theIREX Workshop
Hideki Isozaki and Hideto Kazawa 2003 Speeding upsupport vector machines for named entity recogni-tion Transaction of Information Processing Societyof Japan 44(3)970ndash979 (in Japanese)
Daisuke Kawahara and Sadao Kurohashi 2007 Prob-abilistic coordination disambiguation in a fully-lexicalized Japanese parser In Proceedings of the
2007 Joint Conference on Empirical Methods inNatural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL2007)pages 304ndash311
Junrsquoichi Kazama and Kentaro Torisawa 2008 In-ducing gazetteers for named entity recognition bylarge-scale clustering of dependency relations InProceedings of ACL-08 HLT pages 407ndash415
Erik F Tjong Kim and Jorn Veenstra 1999 Repre-senting text chunks In Proceedings of EACL rsquo99pages 173ndash179
Vajay Krishnan and Christopher DManning 2006An effective two-stage model for exploiting non-local dependencies in named entity recognitionpages 1121ndash1128
John Lafferty Andrew McCallun and FernandoPereira 2001 Conditional random fields Prob-abilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth In-ternational Conference (ICMLrsquo01) pages 282ndash289
Asahara Masayuki and Yuji Matsumoto 2003Japanese named entity extraction with redundantmorphological analysis In Proceeding of HLT-NAACL 2003 pages 8ndash15
Keigo Nakano and Yuzo Hirai 2004 Japanese namedentity extraction with bunsetsu features Transac-tion of Information Processing Society of Japan45(3)934ndash941 (in Japanese)
Ryohei Sasano and Sadao Kurohashi 2008 Japanesenamed entity recognition using structural naturallanguage processing In Proceeding of Third In-ternational Joint Conference on Natural LanguageProcessing pages 607ndash612
Satoshi Sekine Ralph Grishman and Hiroyuki Shin-nou 1998 A decision tree method for finding andclassifying names in japanese texts In Proceed-ings of the Sixth Workshop on Very Large Corpora(WVLC-6) pages 171ndash178
Vladimir Vapnik 1995 The Nature of StatisticalLearning Theory Springer
62
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 63ndash70Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
13
13
13 13
13
13
13
13
13
13
13
13 13
13
$
13 amp
13 amp
$ 13
13 (
) 13
$ 13
13
) 13
13 13 13
$ 13 13
+$ - 0)$ - 12
$ 3$ 4 1
)13$ 5 + )$ 5 2
3$ 6 2 $ 7-
2 $ 7 13
2 13 13
13 13
) 13
13$ $ 13$ $ $
13
13 13 13
13 1313 138
$ 13$
3$ 13) )
) 8 9$ $ $
$ $ )13$ 13 13
3
$
13 13
13 +
)
13 $
13 ) 13
$ )
13 13
13 ) $ )
13$
$
) (
$
13
13 13
13
)
13
13 )
$ 13
ファミリーレストラン 13) lt
ファミレス 13lt $
=
13 )
$ 13 13
$ $ )
13 -
)$ $ =
13 131313 13 1313 13 13 13 13 1313 13 13
63
13
13
13 13
13 13
13
13 13
13 $
13
13 amp
13
( ) +
13
13
-
13 13
13 13 13
日本 amp0 東京 $amp0amp
医科 amp0 科学 amp0amp
女子 amp0amp
大学 13amp0 研究所0
ampamp $ 13 13
+
13
$1
$ 13 $1
13
213
$1
$
$1
$1
13 $1
13
13 13 略語は0
0amp
$ +3
1313欧文略語一覧13カタカナ略語一覧13漢字略語一覧13大学の略称
$1
$1 13
13 -
13 $1 13
456
13
456
13
37+
- 13
13
$
8 913
77 4 77 913 13
13 77(amp 13 ltamp
13amp amp
13
amp - 13 13
13
13 9
13 13
$
13
-
13 =
= 9
8
13
13 $
13 13
13 13 13 1313
ッ$ 13 っ$
amp ―$
64
13
$ $
amp $ amp
() $
+) $amp $
) $ $ $ $ $
13 $ amp $ $ $ amp $$
$- )) ) 13
) ))13 ) 0()00)13 ) ) )13
- $ ))) ) 131
) 0)13 ))
13 13
2 ) )) )
)) 3 ) 1313))
1313 2 ) 13)
4) 5) )) )13
) 13) 1
6
7 ) )) ) )13 1
)
) )0
13 ) 8
-
) )
)13 ) 4
4 7 )
) )9 8)
5 ) )8 7 ))
)
) )13 1
)13 )) ) ) )
) 7 )) )13 1
)) 0()
) )))13 0
))13 ) 1
)6 5) ) )
5 ) ))1
lt ) ) 13 )
))) ) =))lt
)=)lt ))=lt gt
=ltlt )) ))) )
13 131313
-
))lt
lt ))lt
))lt
))lt lt
7 ) ))1
lt ) )) ) ))lt lt
13 lt ) ) )8 ) 1
)13 )) )
))lt 13 1
)13 ) )8 ) )13 ))
) ))) 85 1
)13 )8) 2 ))) ) (
0 )) )))
13))13
) ))) ) ( のlt
) ) )13 1
) )) )
) ))) のlt 0 )
)13 )) )1
) ) 13
) ) )) )1
) 00 )13 13)
1313 1313
2 ) 13 )) )
)13 )13 )
))13 )
3 )) )) 1
) ) )
$ )
+ ) )
+ 8 )
65
1313
1313
13 13
13 13
13 13 13
13 13
$13amp13amp 13 13 $13
$13 13 $ ( ) ) 13
13 13 + 13 1313 $
13 + 13 1313 + -
+ 13 ++ 13 13
1313 +
13 13
13 13 1313 13 -
+13 1313 13 13
1313
13
0 13 13 13 13
13 13 12 13
3013+ 13 45 666 1313
0 13 1313
7 13 13 13 13-
13 8 +-13 0
+ 13 13 13 -
+ 13 13 13
3
13 13-
13 13 + 13 13
+13513
1313 13 9
1 13 13 -
+ ( 13 13 3
13 13 + 13+ 13 13
3
135 1313 +13 13
13 13 13+ 13
1313 13 +
1313 1313
13 13 13
13+ 13 $1313 (
13 1313 1313 +-
13 4 13 $
$13 $ 13 -
13 +-13 0
13 1313 13-13 1313 + 13 13 4 - 13 + 13 13 13+ 131313 1313 3 13 13 ( 1313 13 13 + 13 13 1313 13 13 13 13 13 13 13 13 13 13 13+ 13 13 9
13
13
13
13
+ 13 13 +13
13+ 4
13 1313 1313 -
+
13 1313 13-
13 13 +13 + +
(13 13 $ +-
13 $1313 13 13 $ $ 13 $ 13
13 $ampamp lt13
2 13 13 131313 1313 13
13 lt13 ( $
13 13 13 13+ +13 -
13 13 13
13 2 2 2
= 13 13 13 +13 1313+
13 13 13
13 13 13 13
1313 13-
13 ( 13 13 13 13
13 13+ 13 1313
13 13 13+ 1313 13+
13 13 13 13 13 13
$gt1 gt1 13+ $ 13-
13 13 13 13+
13131313
66
1313 1313 1313
1313 1313 1313
13
13 13
13
13
$ 13 13
amp() amp() () amp amp
$ 13 13
+
13
13
13
- + 13 ( (
13 13
-+
13
0 -(1313 amp+ ( (
13 13
13 1
13 () 2 3
0 13 -4252
6-amp++ -32 6+
13 0
$ 135 2
1
- amp+ $ 13
13
$ amp( ) + ( ( )- )( ( ) )
FeaturesLabel
Noun
Noun
Noun
Noun
Particle
Noun
POS
0
0
0
1
1
1
Head of wordreadingword-
FeaturesLabel
Noun
Noun
Noun
Noun
Particle
Noun
POS
0
0
0
1
1
1
Head of wordreadingword-
amp 13 0
13 朝ビタ -+ 13
13 朝はビタミン - +
13 7
7
$
088
13 0
$ amp 9 amp 9
13
13
7
1313 0
13
13 2lt
13 13 13 13
13 13
3 13 amp
13 amp
13
13
0 13 13
13 13
13 13
13
1
ampampamp
67
13 13
$
13 13
ampamp (
)
ampamp $
ampamp amp
amp +
amp +
- - -
amp-
amp
amp
amp
- amp
+
- +
) +
amp $
- amp
+
- amp
+
0 $ ampamp +
1- amp
0 +
2-
)
amp -
amp ampamp
$ amp 0- 3(
34) 5 -+
0 - 4
6)4 amp amp 1 amp
7 4( 6
$ - +
- 8
64
13
) amp
amp
+
amp 9 2
- 9 - +
2 amp
- amp
amp 1
$
( +
(
( amp ( ( -
amp2
amp
amp )
( +
68
13 13
13 1313
13 13 13
13 13
13 13 13
13 13
13 13
13 $
13
13 13 13 13
13
13 13 13
1313
13
amp 13 13 13 13
13 13 13 13
(() 13 13 13
13 13 + 13
13 amp 13
13 13
$ 13
13 13
13
13 1313
13 -
13
1313 13
13 13 13
13 1313 amp
13 13
13 13 1313
0
1313 13 13
13 13
13
13 13
$ $ amp amp ( $)+ - 0 13 + 13 112
ミュージックステーション 13 ステ 13 13 13 13 13 ミュージック 13 ステ ステーション
3 4 56 1 0+ 0 + + 7+ 8 9 $8 4 lt
6 $8 -= 1 8 - = $ 13++ + 139 13 $ 22lt
8 $+ 4 13) 1ltltlt3 $gt3 =
5 $ $ ( 6 5 13amp
5 $ $ ( 213 $ 13 13
5 = 3 + 13 () 1 +
5 () 13 7+ + + 2ltlt
5 () 13 3A 21 + $ + 13
5 () $ gt) 3A2 0 13 13 22lt2lt
+A = 7 1 4 30$+ -+ 3 13 13$ 11
$ 13 amp) $ 4 42 6= $ 2
4 13 13+ $ gt $ 9+ amp6+ 13+ 8 8 6 3 (+ 6 13 11
13 136) $ 4 13 + gt+ 7 30 13 13+ 1
69
13 13 13
13 13 1313
13 13
1313 13
13 1313
13
$
13 13 1313 $
amp (
70
Author Index
Bhutada Pravin 17
Cao Jie 47Caseli Helena 1
Diab Mona 17
Finatto Maria Jose 1Fujii Hiroko 63Fukui Mika 63Funayama Hirotaka 55
Hoang Hung Huu 31Huang Yun 47
Kan Min-Yen 9 31Kim Su Nam 9 31Kuhn Jonas 23Kurohashi Sadao 55
Liu Qun 47Lu Yajuan 47
Machado Andre 1
Ren Zhixiang 47
Shibata Tomohide 55Sinha R Mahesh K 40Sumita Kazuo 63Suzuki Masaru 63
Villavicencio Aline 1
Wakaki Hiromi 63
Zarrieszlig Sina 23
71
Table of Contents
Statistically-Driven Alignment-Based Multiword Expression Identification for Technical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto 1
Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan 9
Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada 17
Exploiting Translational Correspondences for Pattern-Independent MWE IdentificationSina Zarrieszlig and Jonas Kuhn 23
A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan 31
Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha 40
Improving Statistical Machine Translation Using Domain Bilingual Multiword ExpressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang 47
Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi 55
Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita 63
vii
Workshop Program
Friday August 6 2009
830ndash845 Welcome and Introduction to the Workshop
Session 1 (0845ndash1000) MWE Identification and Disambiguation
0845ndash0910 Statistically-Driven Alignment-Based Multiword Expression Identification for Tech-nical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto
0910ndash0935 Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan
0935ndash1000 Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada
1000-1030 BREAK
Session 2 (1030ndash1210) Identification Interpretation and Disambiguation
1030ndash1055 Exploiting Translational Correspondences for Pattern-Independent MWE Identifi-cationSina Zarrieszlig and Jonas Kuhn
1055ndash1120 A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan
1120ndash1145 Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha
1145-1350 LUNCH
Session 3 (1350ndash1530) Applications
1350ndash1415 Improving Statistical Machine Translation Using Domain Bilingual Multiword Ex-pressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang
1415ndash1440 Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi
1440ndash1505 Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita
1505-1530 Discussion of Sessions 1 2 3 (Creating an Agenda for the general discussion)
1530-1600 BREAK
1600-1700 General Discussion
1700-1715 Closing Remarks
ix
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 1ndash8Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Statistically-Driven Alignment-Based Multiword ExpressionIdentification for Technical Domains
Helena de Medeiros Caselidiams Aline Villavicencioclubsspades Andre Machadoclubs Maria Jose Finattohearts
diamsDepartment of Computer Science Federal University of Sao Carlos (Brazil)clubsInstitute of Informatics Federal University of Rio Grande do Sul (Brazil)
spadesDepartment of Computer Sciences Bath University (UK)heartsInstitute of Language and Linguistics Federal University of Rio Grande do Sul (Brazil)
helenacaselidcufscarbr avillavicencioinfufrgsbrammachadoinfufrgsbr mfinattoterracombr
Abstract
Multiword Expressions (MWEs) are oneof the stumbling blocks for more preciseNatural Language Processing (NLP) sys-tems Particularly the lack of coverageof MWEs in resources can impact nega-tively on the performance of tasks and ap-plications and can lead to loss of informa-tion or communication errors This is es-pecially problematic in technical domainswhere a significant portion of the vocab-ulary is composed of MWEs This pa-per investigates the use of a statistically-driven alignment-based approach to theidentification of MWEs in technical cor-pora We look at the use of several sourcesof data including parallel corpora usingEnglish and Portuguese data from a corpusof Pediatrics and examining how a sec-ond language can provide relevant cues forthis tasks We report results obtained bya combination of statistical measures andlinguistic information and compare theseto the reported in the literature Such anapproach to the (semi-)automatic identifi-cation of MWEs can considerably speedup lexicographic work providing a moretargeted list of MWE candidates
1 Introduction
A multiword expression (MWE) can be definedas any word combination for which the syntac-tic or semantic properties of the whole expres-sion cannot be obtained from its parts (Sag etal 2002) Examples of MWEs are phrasal verbs(break down rely on) compounds (police car cof-fee machine) idioms (rock the boat let the catout of the bag) They are very numerous in lan-guages as Biber et al (1999) note accouting forbetween 30 and 45 of spoken English and 21
of academic prose and for Jackendoff (1997) thenumber of MWEs in a speakerrsquos lexicon is of thesame order of magnitude as the number of singlewords However these estimates are likely to beunderestimates if we consider that for languagefrom a specific domain the specialized vocabularyis going to consist largely of MWEs (global warm-ing protein sequencing) and new MWEs are con-stantly appearing (weapons of mass destructionaxis of evil)
Multiword expressions play an important rolein Natural Language Processing (NLP) applica-tions which should not only identify the MWEsbut also be able to deal with them when they arefound (Fazly and Stevenson 2007) Failing toidentify MWEs may cause serious problems formany NLP tasks especially those envolving somekind of semantic processing For parsing for in-stance Baldwin et al (2004) found that for a ran-dom sample of 20000 strings from the British Na-tional Corpus (BNC) even with a broad-coveragegrammar for English (Flickinger 2000) missingMWEs accounted for 8 of total parsing errorsTherefore there is an enormous need for robust(semi-)automated ways of acquiring lexical infor-mation for MWEs (Villavicencio et al 2007) thatcan significantly extend the coverage of resourcesFor example one can more than double the num-ber of verb-particle constructions (VPCs) entriesin a dictionary such as the Alvey Natural Lan-guage Tools (Carroll and Grover 1989) just ex-tracting VPCs from a corpus like the BNC (Bald-win 2005) Furthermore as MWEs are languagedependent and culturally motivated identifyingthe adequate translation of MWE occurrences is animportant challenge for machine translation meth-ods
In this paper we investigate experimentally theuse of an alignment-based approach for the iden-tification of MWEs in technical corpora We lookat the use of several sources of data including par-
1
allel corpora using English and Portuguese datafrom a corpus of Pediatrics and examining howa second language can provide relevant cues forthis tasks In this way cost-effective tools for theautomatic alignment of texts can generate a listof MWE candidates with their appropriate trans-lations Such an approach to the (semi-)automaticidentification of MWEs can considerably speed uplexicographic work providing a more targeted listof MWE candidates and their translations for theconstruction of bilingual resources andor withsome semantic information for monolingual re-sources
The remainder of this paper is structured as fol-lows Section 2 briefly discusses MWEs and someprevious works on methods for automatically ex-tracting them Section 3 presents the resourcesused while section 4 describes the methods pro-posed to extract MWEs as a statistically-driven by-product of an automatic word alignment processSection 5 presents the evaluation methodology andanalyses the results and section 6 finishes this pa-per with some conclusions and proposals for fu-ture work
2 Related Work
The term Multiword Expression has been used todescribe a large number of distinct but related phe-nomena such as phrasal verbs (eg come along)nominal compounds (eg frying pan) institution-alised phrases (eg bread and butter) and manyothers (Sag et al 2002) They are very frequent ineveryday language and this is reflected in severalexisting grammars and lexical resources wherealmost half of the entries are Multiword Expres-sions
However due to their heterogeneous charac-teristics MWEs present a tough challenge forboth linguistic and computational work (Sag etal 2002) Some MWEs are fixed and do notpresent internal variation such as ad hoc whileothers allow different degrees of internal vari-ability and modification such as touch a nerve(touchfind a nerve) and spill beans (spill sev-eralmusicalmountains of beans) In terms of se-mantics some MWEs are more opaque in theirmeaning (eg to kick the bucket as to die) whileothers have more transparent meanings that can beinferred from the words in the MWE (eg eat upwhere the particle up adds a completive sense toeat) Therefore providing appropriate methods
for the automatic identification and treatment ofthese phenomena is a real challenge for NLP sys-tems
A variety of approaches has been proposed forautomatically identifying MWEs differing basi-cally in terms of the type of MWE and lan-guage to which they apply and the sources ofinformation they use Although some work onMWEs is type independent (eg (Zhang et al2006 Villavicencio et al 2007)) given the het-erogeneity of MWEs much of the work looks in-stead at specific types of MWE like collocations(Pearce 2002) compounds (Keller and Lapata2003) and VPCs (Baldwin 2005 Villavicencio2005 Carlos Ramisch and Aline Villavicencio andLeonardo Moura and Marco Idiart 2008) Someof these works concentrate on particular languages(eg (Pearce 2002 Baldwin 2005) for Englishand (Piao et al 2006) for Chinese) but somework has also benefitted from asymmetries in lan-guages using information from one language tohelp deal with MWEs in the other (eg (na Vil-lada Moiron and Tiedemann 2006 Caseli et al2009))
As basis for helping to determine whether agiven sequence of words is in fact an MWE (egad hoc vs the small boy) some of these works em-ploy linguistic knowledge for the task (Villavicen-cio 2005) while others employ statistical meth-ods (Pearce 2002 Evert and Krenn 2005 Zhanget al 2006 Villavicencio et al 2007) or combinethem with some kinds of linguistic informationsuch as syntactic and semantic properties (Bald-win and Villavicencio 2002 Van de Cruys and naVillada Moiron 2007) or automatic word align-ment (na Villada Moiron and Tiedemann 2006)
Statistical measures of association have beencommonly used for this task as they can be demo-cratically applied to any language and MWE typeHowever there is no consensus about which mea-sure is best suited for identifying MWEs in gen-eral Villavicencio et al (2007) compared some ofthese measures (mutual information permutationentropy and χ2) for the type-independent detec-tion of MWEs and found that Mutual Informationseemed to differentiate MWEs from non-MWEsbut the same was not true of χ2 In addition Ev-ert and Krenn (2005) found that for MWE iden-tification the efficacy of a given measure dependson factors like the type of MWEs being targetedfor identification the domain and size of the cor-
2
pora used and the amount of low-frequency dataexcluded by adopting a threshold NonethelessVillavicencio et al (2007) discussing the influ-ence of the corpus size and nature over the meth-ods found that these different measures have ahigh level of agreement about MWEs whetherin carefully constructed corpora or in more het-erogeneous web-based ones They also discussthe results obtained from adopting approaches likethese for extending the coverage of resources ar-guing that grammar coverage can be significantlyincreased if MWEs are properly identified andtreated (Villavicencio et al 2007)
Among the methods that use additional infor-mation along with statistics to extract MWE theone proposed by na Villada Moiron and Tiede-mann (2006) seems to be the most similar to ourapproach The main difference between them isthe way in which word alignment is used in theMWE extraction process In this paper the wordalignment is the basis for the MWE extractionprocess while Villada Moiron and Tiedemannrsquosmethod uses the alignment just for ranking theMWE candidates which were extracted on the ba-sis of association measures (log-likelihood andsalience) and head dependence heuristic (in parseddata)
Our approach as described in details by Caseliet al (2009) also follows to some extent thatof Zhang et al (2006) as missing lexical en-tries for MWEs and related constructions are de-tected via error mining methods and this paper fo-cuses on the extraction of generic MWEs as a by-product of an automatic word alignment Anotherrelated work is the automatic detection of non-compositional compounds (NCC) by Melamed(1997) in which NCCs are identified by analyz-ing statistical translation models trained in a hugecorpus by a time-demanding process
Given this context our approach proposesthe use of alignment techniques for identifyingMWEs looking at sequences detected by thealigner as containing more than one word whichform the MWE candidates As a result sequencesof two or more consecutive source words aretreated as MWE candidates regardless of whetherthey are translated as one or more target words
3 The Corpus and Reference Lists
The Corpus of Pediatrics used in these experi-ments contains 283 texts in Portuguese with a total
of 785448 words extracted from the Jornal de Pe-diatria From this corpus the Pediatrics Glossarya reference list containing multiword terms and re-curring expressions was semi-automatically con-structed and manually checked1 The primary aimof the Pediatrics Glossary as an online resourcefor long-distance education was to train qualifyand support translation students on the domain ofpediatrics texts
The Pediatrics Glossary was built from the36741 ngrams that occurred at least 5 times in thecorpus These were automatically cleaned or re-moved using some POS tag patterns (eg remov-ing prepositions from terms that began or endedwith them) In addition if an ngram was part of alarger ngram only the latter appeared in the Glos-sary as is the case of aleitamento materno (mater-nal breastfeeding) which is excluded as it is con-tained in aleitamento materno exclusivo (exclusivematernal breastfeeding) This post-processing re-sulted in 3645 ngrams which were manuallychecked by translation students and resulted in2407 terms with 1421 bigrams 730 trigrams and339 ngrams with n larger than 3 (not considered inthe experiments presented in this paper)
4 Statistically-Driven andAlignment-Based methods
41 Statistically-Driven methodStatistical measures of association have beenwidely employed in the identification of MWEsThe idea behind their use is that they are an in-expensive language and type independent meansof detecting recurrent patterns As Firth famouslysaid a word is characterized by the company itkeeps and since we expect the component wordsof an MWE to occur frequently together thenthese measures can give an indication of MWE-ness In this way if a group of words co-occurswith significantly high frequency when comparedto the frequencies of the individual words thenthey may form an MWE Indeed measures suchas Pointwise Mutual Information (PMI) MutualInformation (MI) χ2 log-likelihood (Press et al1992) and others have been employed for thistask and some of them seem to provide more ac-curate predictions of MWEness than others Infact in a comparison of some measures for thetype-independent detection of MWEs MI seemed
1Available in the TEXTQUIMUFRGS website httpwwwufrgsbrtextquim
3
to differentiate MWEs from non-MWEs but thesame was not true of χ2 (Villavicencio et al2007) In this work we use two commonly em-ployed measures for this task PMI and MIas implemented in the Ngram Statistics Package(Banerjee and Pedersen 2003)
From the Portuguese portion of the Corpus ofPediatrics 196105 bigram and 362663 trigramMWE candidates were generated after filteringngrams containing punctuation and numbers Inorder to evaluate how these methods perform with-out any linguistic filtering the only threshold em-ployed was a frequency cut-off of 2 occurrencesresulting in 64839 bigrams and 54548 trigramsEach of the four measures were then calculated forthese ngrams and we ranked each n-gram accord-ing to each of these measures The average of allthe rankings is used as the combined measure ofthe MWE candidates
42 Alignment-Based method
The second of the MWE extraction approachesto be investigated in this paper is the alignment-based method The automatic word alignment oftwo parallel texts mdash a text written in one (source)language and its translation to another (target) lan-guage mdash is the process of searching for correspon-dences between source and target words and se-quences of words For each word in a source sen-tence equivalences in the parallel target sentenceare looked for Therefore taking into account aword alignment between a source word sequenceS (S = s1 sn with n ge 2) and a target wordsequence T (T = t1 tm with m ge 1) thatis S harr T the alignmet-based MWE extracionmethod assumes that (a) S and T share some se-mantic features and (b) S may be a MWE
In other words the alignment-based MWE ex-traction method states that the sequence S will bea MWE candidate if it is aligned with a sequenceT composed of one or more words (a n m align-ment with n ge 2 and m ge 1) For examplethe sequence of two Portuguese words aleitamentomaterno mdash which occurs 202 times in the cor-pus used in our experiments mdash is a MWE can-didate because these two words were joined to bealigned 184 times with the word breastfeeding (a2 1 alignment) 8 times with the word breast-fed (a 2 1 alignment) 2 times with breastfeedingpractice (a 2 2 alignment) and so on
Thus notice that the alignment-based MWE ex-
traction method does not rely on the conceptualasymmetries between languages since it does notexpect that a source sequence of words be alignedwith a single target word The method looksfor the sequences of source words that are fre-quently joined together during the alignment de-spite the number of target words involved Thesefeatures indicate that the method priorizes preci-sion in spite of recall
It is also important to say that although the se-quences of source and target words resemble thephrases used in the phrase-based statistical ma-chine translation (SMT) they are indeed a re-finement of them More specifically althoughboth approaches rely on word alignments per-formed by GIZA++2 (Och and Ney 2000) inthe alignment-based approach not all sequences ofwords are considered as phrases (and MWE can-didates) but just those with an alignment n m(n gt= 2) with a target sequence To confirmthis assumption a phrase-based SMT system wastrained with the same corpus used in our exper-iments and the number of phrases extracted fol-lowing both approaches were compared Whilethe SMT extracted 819208 source phrases ouralignment-based approach (without applying anypart-of-speech or frequency filter) extracted only34277 These results show that the alignment-based approach refines in some way the phrasesof SMT systems
In this paper we investigate experimentallywhether MWEs can be identified as a by-productof the automatic word alignment of parallel textsWe focus on Portuguese MWEs from the Corpusof Pediatrics and the evaluation is performed us-ing the bigrams and trigrams from the PediatricsGlossary as gold standard
To perform the extraction of MWE candi-dates following the alignment-based approachfirst the original corpus had to be sentence andword aligned and Part-of-Speech (POS) taggedFor these preprocessing steps were used re-spectively a version of the Translation Cor-pus Aligner (TCA) (Hofland 1996) the statisti-cal word aligner GIZA++ (Och and Ney 2000)and the morphological analysers and POS taggersfrom Apertium3 (Armentano-Oller et al 2006)
2GIZA++ is a well-known statistical word aligner that canbe found at httpwwwfjochcomGIZA++html
3Apertium is an open-source machine translation en-gine and toolbox available at httpwwwapertiumorg
4
From the preprocessed corpus the MWE candi-dates are extracted as those in which two or morewords have the same alignment that is they arelinked to the same target unit This initial list ofMWE candidates is then filtered to remove thosecandidates that (a) match some sequences of POStags or words (patterns) defined in previous exper-iments (Caseli et al 2009) or (b) whose frequencyis below a certain threshold The remaining unitsin the candidate list are considered to be MWEs
Several filtering patterns and minimum fre-quency thresholds were tested and three of themare presented in details here The first one (F1)is the same used during the manual building ofthe reference lists of MWEs (a) patterns begin-ning with Article + Noun and beginning or finish-ing with verbs and (b) with a minimum frequencythreshold of 5
The second one (F2) is the same used in the(Caseli et al 2009) mainly (a) patterns begin-ning with determiner auxiliary verb pronoun ad-verb conjunction and surface forms such as thoseof the verb to be (are is was were) relatives(that what when which who why) and prepo-sitions (from to of ) and (b) with a minimum fre-quency threshold of 2
And the third one (F3) is the same as (Caseli etal 2009) plus (a) patterns beginning or finishingwith determiner adverb conjunction prepositionverb pronoun and numeral and (b) with a mini-mum frequency threshold of 2
5 Experiments and Results
Table 1 shows the top 5 and the bottom 5 rankedcandidates returned by PMI and the alignment-based approach Although some of the results aregood especially the top candidates there is stillconsiderable noise among the candidates as forinstance jogar video game (lit play video game)From table 1 it is also possible to notice that thealignment-based approach indeed extracts Pedi-atrics terms such as aleitamento materno (breast-feeding) and also other possible MWE that are notPediatrics terms such as estados unidos (UnitedStates)
In table 2 we show the precision (number ofcorrect candidates among the proposed ones) re-call (number of correct candidates among those inreference lists) and F-measure ((2 lowast precision lowastrecall)(precision + recall)) figures for the as-sociation measures using all the candidates (on the
PMI alignment-basedOnline Mendelian Inheritance faixa etariaBeta Technology Incorporated aleitamento maternoLange Beta Technology estados unidosOxido Nitrico Inalatorio hipertensao arterialjogar video game leite materno e um de couro cabeludoe a do bloqueio lactıferosse que de emocional anatomiae a da neonato a termoe de nao duplas maes bebes
Table 1 Top 5 and Bottom 5 MWE candidatesranked by PMI and alignment-based approach
pt MWE candidates PMI MI proposed bigrams 64839 64839 correct MWEs 1403 1403precision 216 216recall 9873 9873F 423 423 proposed trigrams 54548 54548 correct MWEs 701 701precision 129 129recall 9603 9603F 255 255 proposed bigrams 1421 1421 correct MWEs 155 261precision 1091 1837recall 1091 1837F 1091 1837 proposed trigrams 730 730 correct MWEs 44 20precision 603 274recall 603 274F 603 274
Table 2 Evaluation of MWE candidates - PMI andMI
first half of the table) and using the top 1421 bi-gram and 730 trigram candidates (on the secondhalf) From these latter results we can see that thetop candidates produced by these measures do notagree with the Pediatrics Glossary since there areonly at most 1837 bigram and 603 trigramMWEs among the top candidates as ranked byMI and PMI respectively Interestingly MI had abetter performance for bigrams while for trigramsPMI performed better
On the other hand looking at the alignment-based method 34277 pt MWE candidates wereextracted and Table 3 sumarizes the number ofcandidates filtered following the three filters de-scribed in 42 F1 F2 and F3
To evaluate the efficacy of the alignment-basedmethod in identifying multiword terms of Pedi-atrics an automatic comparison was performedusing the Pediatrics Glossary In this auto-
5
pt MWE candidates F1 F2 F3 filtered by POS patterns 24996 21544 32644 filtered by frequency 9012 11855 1442 final Set 269 878 191
Table 3 Number of pt MWE candidates filteredin the alignment-based approach
pt MWE candidates F1 F2 F3 proposed bigrams 250 754 169 correct MWEs 48 95 65precision 1920 1260 3846recall 338 669 457F 575 874 818 proposed trigrams 19 110 20 correct MWEs 1 9 4precision 526 818 2000recall 014 123 055F 027 214 107 proposed bitrigrams 269 864 189 correct MWEs 49 104 69precision 1822 1204 3651recall 228 483 321F 405 690 590
Table 4 Evaluation of MWE candidates
matic comparision we considered the final lists ofMWEs candidates generated by each filter in table3 The number of matching entries and the valuesfor precision recall and F-measure are showed intable 4
The different values of extracted MWEs (in ta-ble 3) and evaluated ones (in table 4) are due tothe restriction of considering only bigrams and tri-grams in the Pediatrics Glossary Then longerMWEs mdash such as doenca arterial coronarianaprematura (premature coronary artery disease)and pequenos para idade gestacional (small forgestational age) mdash extracted by the alignment-based method are not being considered at the mo-ment
After the automatic comparison using the Pedi-atrics Glossary an analysis by human experts wasperformed on one of the derived lists mdash that withthe best precision values so far (from filter F3)The human analysis was necessary since as statedin (Caseli et al 2009) the coverage of referencelists may be low and it is likely that a lot of MWEcandidates that were not found in the PediatricsGlossary are nonetheless true MWEs In this pa-per only the pt MWE candidates extracted usingfilter F3 (as described in section 42) were manu-ally evaluated
From the 191 pt MWE candidates extracted af-ter F3 69 candidates (361 of the total amount)were found in the bigrams or trigrams in the
Glossary (see table 4) Then the remaining 122candidates (639) were analysed by two native-speakers human judges who classified each of the122 candidates as true if it is a multiword expres-sion or false otherwise independently of beinga Pediatrics term For the judges a sequence ofwords was considered a MWE mainly if it was(1) a proper name or (2) a sequence of words forwhich the meaning cannot be obtained by com-pounding the meanings of its component words
The judgments of both judges were comparedand a disagreement of approximately 12 on mul-tiwords was verified This disagreement was alsomeasured by the kappa (K) measure (Carletta1996) with k = 073 which does not preventconclusions to be drawn According to Carletta(1996) among other authors a value of k between067 and 08 indicates a good agreement
In order to calculate the percentage of true can-didates among the 122 two approaches can befollowed depending on what criteria one wantsto emphasize precision or coverage (not recallbecause we are not calculating regarding a refer-ence list) To emphasize the precision one shouldconsider as genuine MWEs only those candidatesclassified as true by both judges on the other handto emphasize the coverage one should consideralso those candidates classified as true by just oneof them So from 191 MWE candidates 126(6597) were classified as true by both judgesand 145 (7592) by at least one of them
6 Conclusions and Future Work
MWEs are a complex and heterogeneous set ofphenomena that defy attempts to capture themfully but due to their role in communication theyneed to be properly accounted for in NLP applica-tions and tasks
In this paper we investigated the identifica-tion of MWEs from technical domain test-ing statistically-driven and alignment-based ap-proaches for identifying MWEs from a Pediatricsparallel corpus The alignment-based method gen-erates a targeted precision-oriented list of MWEcandidates while the statistical methods producerecall-oriented results at the expense of precisionTherefore the combination of these methods canproduce a set of MWE candidates that is both moreprecise than the latter and has more coverage thanthe former This can significantly speed up lex-icographic work Moreover the results obtained
6
show that in comparison with the manual extrac-tion of MWEs this approach can provide also ageneral set of MWE candidates in addition to themanually selected technical terms
Using the alignment-based extraction methodwe notice that it is possible to extract MWEs thatare Pediatrics terms with a precision of 38 forbigrams and 20 for trigrams but with very lowrecall since only the MWEs in the Pediatrics Glos-sary were considered correct However after amanual analysis carried out by two native speakersof Portuguese we found that the percentage of trueMWEs considered by both or at least one of themwere respectively 6597 and 7592 This wasa significative improvement but it is important tosay that in this manual analysis the human ex-perts classified the MWEs as true independently ofthem being Pediatrics terms So as future work weintend to carry out a more carefull analysis withexperts in Pediatrics to evaluate how many MWEscandidates are also Pediatrics terms
In addition we plan to investigate a weightedcombination of these methods favouring thosethat have better precision Finally we also in-tend to apply the results obtained in to the semi-automatic construction of ontologies
Acknowledgments
We would like to thank the TEXTQUIMUFRGSgroup for making the Corpus of Pediatrics and Pe-diatrics Glossary available to us We also thank thefinancial support of FAPESP in bulding the paral-lel corpus This research has been partly fundedby the FINEP project COMUNICA
ReferencesCarme Armentano-Oller Rafael C Carrasco Anto-
nio M Corbı-Bellot Mikel L Forcada MireiaGinestı-Rosell Sergio Ortiz-Rojas Juan Anto-nio Perez-Ortiz Gema Ramırez-Sanchez FelipeSanchez-Martınez and Miriam A Scalco 2006Open-source Portuguese-Spanish machine transla-tion In R Vieira P Quaresma MGV Nunes NJMamede C Oliveira and MC Dias editors Pro-ceedings of the 7th International Workshop on Com-putational Processing of Written and Spoken Por-tuguese (PROPOR 2006) volume 3960 of LectureNotes in Computer Science pages 50ndash59 Springer-Verlag May
Timothy Baldwin and Aline Villavicencio 2002 Ex-tracting the Unextractable A Case Study on Verb-particles In Proceedings of the 6th Conference on
Natural Language Learning (CoNLL-2002) pages98ndash104 Taipei Taiwan
Timothy Baldwin Emily M Bender Dan FlickingerAra Kim and Stephan Oepen 2004 Road-testingthe English Resource Grammar over the British Na-tional Corpus In Proceedings of the Fourth In-ternational Conference on Language Resources andEvaluation (LREC 2004) pages 2047ndash2050 LisbonPortugal
Timothy Baldwin 2005 The deep lexical acquisitionof English verb-particles Computer Speech andLanguage Special Issue on Multiword Expressions19(4)398ndash414
Satanjeev Banerjee and Ted Pedersen 2003 The De-sign Implementation and Use of the Ngram Statis-tics Package In In Proceedings of the Fourth Inter-national Conference on Intelligent Text Processingand Computational Linguistics pages 370ndash381
Douglas Biber Stig Johansson Geoffrey Leech SusanConrad and Edward Finegan 1999 Grammar ofSpoken and Written English Longman Harlow
Jean Carletta 1996 Assessing agreement on classi-fication tasks the kappa statistics ComputationalLinguistics 22(2)249ndash254
Carlos Ramisch and Aline Villavicencio and LeonardoMoura and Marco Idiart 2008 Picking themup and Figuring them out Verb-Particle Construc-tions Noise and Idiomaticity In Proceedings ofthe 12th Conference on Computational Natural Lan-guage Learning (CoNLL 2008) pages 49ndash56
John Carroll and Claire Grover 1989 The derivationof a large computational lexicon of English fromLDOCE In Bran Boguraev and Ted Briscoe editorsComputational Lexicography for Natural LanguageProcessing pages 117ndash134 Longman Harlow UK
Helena M Caseli Carlos Ramisch Maria G V Nunesand Aline Villavicencio 2009 Alignment-basedextraction of multiword expressions Language Re-sources and Evaluation to appear
Stefan Evert and Brigitte Krenn 2005 Using smallrandom samples for the manual evaluation of statis-tical association measures Computer Speech andLanguage 19(4)450ndash466
Afsaneh Fazly and Suzanne Stevenson 2007 Distin-guishing Subtypes of Multiword Expressions UsingLinguistically-Motivated Statistical Measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueJune
Dan Flickinger 2000 On building a more efficientgrammar by exploiting types Natural LanguageEngineering 6(1)15ndash28
7
Knut Hofland 1996 A program for aligning Englishand Norwegian sentences In S Hockey N Ideand G Perissinotto editors Research in HumanitiesComputing pages 165ndash178 Oxford Oxford Univer-sity Press
Ray Jackendoff 1997 Twistinrsquo the night away Lan-guage 73534ndash59
Frank Keller and Mirella Lapata 2003 Using the Webto Obtain Frequencies for Unseen Bigrams Compu-tational Linguistics 29(3)459ndash484
I Dan Melamed 1997 Automatic Discovery ofNon-Compositional Compounds in Parallel Data Ineprint arXivcmp-lg9706027 pages 6027ndash+ June
Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the Workshopon Multi-word-expressions in a Multilingual Context(EACL-2006) pages 33ndash40 Trento Italy
Franz Josef Och and Hermann Ney 2000 Improvedstatistical alignment models In Proceedings of the38th Annual Meeting of the ACL pages 440ndash447Hong Kong China October
Darren Pearce 2002 A Comparative Evaluation ofCollocation Extraction Techniques In Proceedingsof the Third International Conference on LanguageResources and Evaluation pages 1ndash7 Las PalmasCanary Islands Spain
Scott S L Piao Guangfan Sun Paul Rayson andQi Yuan 2006 Automatic Extraction of Chi-nese Multiword Expressions with a Statistical ToolIn Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context (EACL-2006)pages 17ndash24 Trento Italy April
William H Press Saul A Teukolsky William T Vet-terling and Brian P Flannery 1992 NumericalRecipes in C The Art of Scientific Computing Sec-ond edition Cambridge University Press
Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 MultiwordExpressions A Pain in the Neck for NLP In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing (CICLing-2002) volume 2276 of (LectureNotes in Computer Science) pages 1ndash15 LondonUK Springer-Verlag
Tim Van de Cruys and Bego na Villada Moiron 2007Semantics-based Multiword Expression ExtractionIn Proceedings of the Workshop on A Broader Pre-spective on Multiword Expressions pages 25ndash32Prague June
Aline Villavicencio Valia Kordoni Yi Zhang MarcoIdiart and Carlos Ramisch 2007 Validation andEvaluation of Automatically Acquired Multiword
Expressions for Grammar Engineering In Pro-ceedings of the 2007 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning pages1034ndash1043 Prague June
Aline Villavicencio 2005 The Availability of Verb-Particle Constructions in Lexical Resources HowMuch is Enough Journal of Computer Speech andLanguage Processing 19(4)415ndash432
Yi Zhang Valia Kordoni Aline Villavicencio andMarco Idiart 2006 Automated Multiword Ex-pression Prediction for Grammar Engineering InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 36ndash44 Sydney Australia July Asso-ciation for Computational Linguistics
8
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 9ndash16Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Re-examining Automatic Keyphrase Extraction Approachesin Scientific Articles
Su Nam KimCSSE dept
University of Melbournesnkimcsseunimelbeduau
Min-Yen KanDepartment of Computer ScienceNational University of Singaporekanmycompnusedusg
Abstract
We tackle two major issues in automatickeyphrase extraction using scientific arti-cles candidate selection and feature engi-neering To develop an efficient candidateselection method we analyze the natureand variation of keyphrases and then se-lect candidates using regular expressionsSecondly we re-examine the existing fea-tures broadly used for the supervised ap-proach exploring different ways to en-hance their performance While mostother approaches are supervised we alsostudy the optimal features for unsuper-vised keyphrase extraction Our researchhas shown that effective candidate selec-tion leads to better performance as evalua-tion accounts for candidate coverage Ourwork also attests that many of existing fea-tures are also usable in unsupervised ex-traction
1 Introduction
Keyphrases are simplex nouns or noun phrases(NPs) that represent the key ideas of the documentKeyphrases can serve as a representative summaryof the document and also serve as high quality in-dex terms It is thus no surprise that keyphraseshave been utilized to acquire critical informationas well as to improve the quality of natural lan-guage processing (NLP) applications such as doc-ument summarizer(Davanzo and Magnini 2005)information retrieval (IR)(Gutwin et al 1999) anddocument clustering(Hammouda et al 2005)
In the past various attempts have been made toboost automatic keyphrase extraction performancebased primarily on statistics(Frank et al 1999Turney 2003 Park et al 2004 Wan and Xiao2008) and a rich set of heuristic features(Barkerand Corrnacchia 2000 Medelyan and Witten
2006 Nguyen and Kan 2007) In Section 2 wegive a more comprehensive overview of previousattempts
Current keyphrase technology still has muchroom for improvement First of all although sev-eral candidate selection methods have been pro-posed for automatic keyphrase extraction in thepast (eg (Frank et al 1999 Park et al 2004Nguyen and Kan 2007)) most of them do not ef-fectively deal with various keyphrase forms whichresults in the ignorance of some keyphrases as can-didates Moreover no studies thus far have donea detailed investigation of the nature and varia-tion of manually-provided keyphrases As a con-sequence the community lacks a standardized listof candidate forms which leads to difficulties indirect comparison across techniques during evalu-ation and hinders re-usability
Secondly previous studies have shown the ef-fectiveness of their own features but not manycompared their features with other existing fea-tures That leads to a redundancy in studies andhinders direct comparison In addition existingfeatures are specifically designed for supervisedapproaches with few exceptions However thisapproach involves a large amount of manual laborthus reducing its utility for real-world applicationHence unsupervised approach is inevitable in or-der to minimize manual tasks and to encourageutilization It is a worthy study to attest the re-liability and re-usability for the unsupervised ap-proach in order to set up the tentative guideline forapplications
This paper targets to resolve these issues ofcandidate selection and feature engineering Inour work on candidate selection we analyze thenature and variation of keyphrases with the pur-pose of proposing a candidate selection methodwhich improves the coverage of candidates thatoccur in various forms Our second contributionre-examines existing keyphrase extraction features
9
reported in the literature in terms of their effec-tiveness and re-usability We test and comparethe usefulness of each feature for further improve-ment In addition we assess how well these fea-tures can be applied in an unsupervised approach
In the remaining sections we describe anoverview of related work in Section 2 our propos-als on candidate selection and feature engineeringin Section 4 and 5 our system architecture anddata in Section 6 Then we evaluate our propos-als discuss outcomes and conclude our work inSection 7 8 and 9 respectively
2 Related Work
The majority of related work has been carriedout using statistical approaches a rich set ofsymbolic resources and linguistically-motivatedheuristics(Frank et al 1999 Turney 1999 Barkerand Corrnacchia 2000 Matsuo and Ishizuka2004 Nguyen and Kan 2007) Features used canbe categorized into three broad groups (1) docu-ment cohesion features (ie relationship betweendocument and keyphrases)(Frank et al 1999Matsuo and Ishizuka 2004 Medelyan and Wit-ten 2006 Nguyen and Kan 2007) and to lesser(2) keyphrase cohesion features (ie relationshipamong keyphrases)(Turney 2003) and (3) termcohesion features (ie relationship among compo-nents in a keyphrase)(Park et al 2004)
The simplest system is KEA (Frank et al1999 Witten et al 1999) that uses TFIDF (ieterm frequency inverse document frequency) andfirst occurrence in the document TFIDF mea-sures the document cohesion and the first occur-rence implies the importance of the abstract orintroduction which indicates the keyphrases havea locality Turney (2003) added the notion ofkeyphrase cohesion to KEA features and Nguyenand Kan (2007) added linguistic features suchas section information and suffix sequence TheGenEx system(Turney 1999) employed an inven-tory of nine syntactic features such as length inwords and frequency of stemming phrase as aset of parametrized heuristic rules Barker andCorrnacchia (2000) introduced a method basedon head noun heuristics that took three featureslength of candidate frequency and head noun fre-quency To take advantage of domain knowledgeHulth et al (2001) used a hierarchically-organizeddomain-specific thesaurus from Swedish Parlia-ment as a secondary knowledge source The
Textract (Park et al 2004) also ranks the can-didate keyphrases by its judgment of keyphrasesrsquodegree of domain specificity based on subject-specific collocations(Damerau 1993) in addi-tion to term cohesion using Dice coefficient(Dice1945) Recently Wan and Xiao (2008) extractsautomatic keyphrases from single documents uti-lizing document clustering information The as-sumption behind this work is that the documentswith the same or similar topics interact with eachother in terms of salience of words The authorsfirst clustered the documents then used the graph-based ranking algorithm to rank the candidates ina document by making use of mutual influences ofother documents in the same cluster
3 Keyphrase Analysis
In previous study KEA employed the index-ing words as candidates whereas others such as(Park et al 2004 Nguyen and Kan 2007) gen-erated handcrafted regular expression rules How-ever none carefully undertook the analysis ofkeyphrases We believe there is more to be learnedfrom the reference keyphrases themselves by do-ing a fine-grained careful analysis of their formand composition Note that we used the articlescollected from ACM digital library for both ana-lyzing keyphrases as well as evaluating methodsSee Section 6 for data in detail
Syntactically keyphrases can be formed by ei-ther simplex nouns (eg algorithm keyphrasemulti-agent) or noun phrases (NPs) which can be asequence of nouns and their auxiliary words suchas adjectives and adverbs (eg mobile networkfast computing partially observable Markov de-cision process) despite few incidences They canalso incorporate a prepositional phrase (PP) (egquality of service policy of distributed caching)When keyphrases take the form of an NP with anattached PP (ie NPs in of-PP form) the preposi-tion of is most common but others such as for invia also occur (eg incentive for cooperation in-equality in welfare agent security via approximatepolicy trade in financial instrument based on log-ical formula) The patterns above correlate wellto part-of-speech (POS) patterns used in modernkeyphrase extraction systems
However our analysis uncovered additional lin-guistic patterns and alternations which other stud-ies may have overlooked In our study we alsofound that keyphrases also occur as a simple con-
10
Criteria RulesFrequency (Rule1) Frequency heuristic ie frequency ge 2 for simplex words vs frequency ge 1 for NPs
Length (Rule2) Length heuristic ie up to length 3 for NPs in non-of-PP form vs up to length 4 for NPs in of-PP form(eg synchronous concurrent program vs model of multiagent interaction)
Alternation (Rule3) of-PP form alternation(eg number of sensor = sensor number history of past encounter = past encounter history)(Rule4) Possessive alternation(eg agentrsquos goal = goal of agent securityrsquos value = value of security)
Extraction (Rule5) Noun Phrase = (NN |NNS|NNP |NNPS|JJ |JJR|JJS)lowast(NN |NNS|NNP |NNPS)(eg complexity effective algorithm grid computing distributed web-service discovery architecture)(Rule6) Simplex WordNP IN Simplex WordNP(eg quality of service sensitivity of VOIP traffic (VOIP traffic extracted)simplified instantiation of zebroid (simplified instantiation extracted))
Table 1 Candidate Selection Rules
junctions (eg search and rescue propagation anddelivery) and much more rarely as conjunctionsof more complex NPs (eg history of past en-counter and transitivity) Some keyphrases appearto be more complex (eg pervasive document editand management system task and resource allo-cation in agent system) Similarly abbreviationsand possessive forms figure as common patterns(eg belief desire intention = BDI inverse docu-ment frequency = (IDF) Bayesrsquo theorem agentrsquosdominant strategy)
A critical insight of our work is that keyphrasescan be morphologically and semantically alteredKeyphrases that incorporate a PP or have an un-derlying genitive composition are often easily var-ied by word order alternation Previous studieshave used the altered keyphrases when forming inof-PP form For example quality of service canbe altered to service quality sometimes with lit-tle semantic difference Also as most morpho-logical variation in English relates to noun num-ber and verb inflection keyphrases are subject tothese rules as well (eg distributed system 6= dis-tributing system dynamical caching 6= dynamicalcache) In addition possessives tend to alternatewith of-PP form (eg agentrsquos goal = goal of agentsecurityrsquos value = value of security)
4 Candidate Selection
We now describe our proposed candidate selectionprocess Candidate selection is a crucial step forautomatic keyphrase extraction This step is corre-lated to term extraction study since topNth rankedterms become keyphrases in documents In pre-vious study KEA employed the indexing wordsas candidates whereas others such as (Park et al2004 Nguyen and Kan 2007) generated hand-crafted regular expression rules However nonecarefully undertook the analysis of keyphrases In
this section before we present our method we firstdescribe the detail of keyphrase analysis
In our keyphrase analysis we observed thatmost of author assigned keyphrase andor readerassigned keyphrase are syntactically more of-ten simplex words and less often NPs Whenkeyphrases take an NP form they tend to be a sim-ple form of NPs ie either without a PP or withonly a PP or with a conjunction but few appear asa mixture of such forms We also noticed that thecomponents of NPs are normally nouns and adjec-tives but rarely are adverbs and verbs As a re-sult we decided to ignore NPs containing adverbsand verbs in this study as our candidates since theytend to produce more errors and to require morecomplexity
Another observation is that keyphrases contain-ing more than three words are rare (ie 6 in ourdata set) validating what Paukkeri et al (2008)observed Hence we apply a length heuristic Ourcandidate selection rule collects candidates up tolength 3 but also of length 4 for NPs in of-PPform since they may have a non-genetive alter-nation that reduces its length to 3 (eg perfor-mance of distributed system = distributed systemperformance) In previous studies words occur-ring at least twice are selected as candidates How-ever during our acquisition of reader assignedkeyphrase we observed that readers tend to collectNPs as keyphrases regardless of their frequencyDue to this we apply different frequency thresh-olds for simplex words (gt= 2) and NPs (gt= 1)Note that 30 of NPs occurred only once in ourdata
Finally we generated regular expression rulesto extract candidates as presented in Table 1 Ourcandidate extraction rules are based on those inNguyen and Kan (2007) However our Rule6for NPs in of-PP form broadens the coverage of
11
possible candidates ie with a given NPs in of-PP form not only we collect simplex word(s)but we also extract non-of-PP form of NPs fromnoun phrases governing the PP and the PP Forexample our rule extracts effective algorithm ofgrid computing as well as effective algorithm andgrid computing as candidates while the previousworksrsquo rules do not
5 Feature Engineering
With a wider candidate selection criteria the onusof filtering out irrelevant candidates becomes theresponsibility of careful feature engineering Welist 25 features that we have found useful in ex-tracting keyphrases comprising of 9 existing and16 novel andor modified features that we intro-duce in our work (marked with lowast) As one ofour goals in feature engineering is to assess thesuitability of features in the unsupervised settingwe have also indicated which features are suitableonly for the supervised setting (S) or applicable toboth (S U)
51 Document Cohesion
Document cohesion indicates how important thecandidates are for the given document The mostpopular feature for this cohesion is TFIDF butsome works have also used context words to checkthe correlation between candidates and the givendocument Other features for document cohesionare distance section information and so on Wenote that listed features other than TFIDF are re-lated to locality That is the intuition behind thesefeatures is that keyphrases tend to appear in spe-cific area such as the beginning and the end of doc-uments
F1 TFIDF (SU) TFIDF indicates doc-ument cohesion by looking at the frequency ofterms in the documents and is broadly used in pre-vious work(Frank et al 1999 Witten et al 1999Nguyen and Kan 2007) However a disadvan-tage of the feature is in requiring a large corpusto compute useful IDF As an alternative con-text words(Matsuo and Ishizuka 2004) can alsobe used to measure document cohesion From ourstudy of keyphrases we saw that substrings withinlonger candidates need to be properly counted andas such our method measures TF in substrings aswell as in exact matches For example grid com-puting is often a substring of other phrases such asgrid computing algorithm and efficient grid com-
puting algorithm We also normalize TF with re-spect to candidate types ie we separately treatsimplex words and NPs to compute TF To makeour IDFs broadly representative we employed theGoogle n-gram counts that were computedover terabytes of data Given this large genericsource of word count IDF can be incorporatedwithout corpus-dependent processing hence suchfeatures are useful in unsupervised approachesas well The following list shows variations ofTFIDF employed as features in our system
bull (F1a) TFIDF
bull (F1b) TF including counts of substrings
bull (F1c) TF of substring as a separate feature
bull (F1d) normalized TF by candidate types(ie simplex words vs NPs)
bull (F1e) normalized TF by candidate types asa separate feature
bull (F1f) IDF using Google n-gram
F2 First Occurrence (SU) KEA used the firstappearance of the word in the document(Frank etal 1999 Witten et al 1999) The main ideabehind this feature is that keyphrases tend to oc-cur in the beginning of documents especially instructured reports (eg in abstract and introduc-tion sections) and newswire
F3 Section Information (SU) Nguyen andKan (2007) used the identity of which specificdocument section a candidate occurs in This lo-cality feature attempts to identify key sections Forexample in their study of scientific papers theauthors weighted candidates differently dependingon whether they occurred in the abstract introduc-tion conclusion section head title andor refer-ences
F4 Additional Section Information (SU)We first added the related work or previous workas one of section information not included inNguyen and Kan (2007) We also propose and testa number of variations We used the substringsthat occur in section headers and reference titlesas keyphrases We counted the co-occurrence ofcandidates (ie the section TF) across all key sec-tions that indicates the correlation among key sec-tions We assign section-specific weights as in-dividual sections exhibit different propensities forgenerating keyphrases For example introduction
12
contains the majority of keyphrases while the ti-tle or section head contains many fewer due to thevariation in size
bull (F4a) section rsquorelatedprevious workrsquo
bull (F4b) counting substring occurring in keysections
bull (F4c) section TF across all key sections
bull (F4d) weighting key sections according tothe portion of keyphrases found
F5 Last Occurrence (SU) Similar to dis-tance in KEA the position of the last occurrenceof a candidate may also imply the importance ofkeyphrases as keyphrases tend to appear in thelast part of document such as the conclusion anddiscussion
52 Keyphrase CohesionThe intuition behind using keyphrase cohesion isthat actual keyphrases are often associated witheach other since they are semantically related totopic of the document Note that this assumptionholds only when the document describes a singlecoherent topic ndash a document that represents a col-lection may be first need to be segmented into itsconstituent topics
F6 Co-occurrence of Another Candidatein Section (SU) When candidates co-occur inseveral key sections together then they are morelikely keyphrases Hence we used the number ofsections that candidates co-occur
F7 Title overlap (S) In a way titles also rep-resent the topics of their documents A large col-lection of titles in the domain can act as a prob-abilistic prior of what words could stand as con-stituent words in keyphrases In our work as weexamined scientific papers from computer sciencewe used a collection of titles obtained from thelarge CiteSeer1 collection to create this feature
bull (F7a) co-occurrence (Boolean) in title col-location
bull (F7b) co-occurrence (TF) in title collection
F8 Keyphrase Cohesion (SU) Turney (2003)integrated keyphrase cohesion into his system bychecking the semantic similarity between top Nranked candidates against the remainder In the
1It contains 13M titles from articles papers and reports
original work a large external web corpus wasused to obtain the similarity judgments As wedid not have access to the same web corpus andall candidateskeyphrases were not found in theGoogle n-gram corpus we approximated this fea-ture using a similar notion of contextual similarityWe simulated a latent 2-dimensional matrix (simi-lar to latent semantic analysis) by listing all candi-date words in rows and their neighboring words(nouns verbs and adjectives only) in columnsThe cosine measure is then used to compute thesimilarity among keyphrases
53 Term Cohesion
Term cohesion further refines the candidacy judg-ment by incorporating an internal analysis of thecandidatersquos constituent words Term cohesionposits that high values for internal word associa-tion measures correlates indicates that the candi-date is a keyphrase (Church and Hanks 1989)
F9 Term Cohesion (SU) Park et al (2004)used in the Dice coefficient (Dice 1945)to measure term cohesion particularly for multi-word terms In their work as NPs are longer thansimplex words they simply discounted simplexword cohesion by 10 In our work we vary themeasure of TF used in Dice coefficientsimilar to our discussion earlier
bull (F9a) term cohesion by (Park et al 2004)
bull (F9b) normalized TF by candidate types(ie simplex words vs NPs)
bull (F9c) applying different weight by candi-date types
bull (F9d) normalized TF and different weight-ing by candidate types
54 Other Features
F10 Acronym (S) Nguyen and Kan (2007) ac-counted for the importance of acronym as a fea-ture We found that this feature is heavily depen-dent on the data set Hence we used it only forNampK to attest our candidate selection method
F11 POS sequence (S) Hulth and Megyesi(2006) pointed out that POS sequences ofkeyphrases are similar It showed the distinctivedistribution of POS sequences of keyphrases anduse them as a feature Like acronym this is alsosubject to the data set
13
F12 Suffix sequence (S) Similar to acronymNguyen and Kan (2007) also used a candidatersquossuffix sequence as a feature to capture the propen-sity of English to use certain Latin derivationalmorphology for technical keyphrases This fea-ture is also a data dependent features thus used insupervised approach only
F13 Length of Keyphrases (SU) Barker andCorrnacchia (2000) showed that candidate lengthis also a useful feature in extraction as well as incandidate selection as the majority of keyphrasesare one or two terms in length
6 System and Data
To assess the performance of the proposed candi-date selection rules and features we implementeda keyphrase extraction pipe line We start withraw text of computer science articles convertedfrom PDF by pdftotext Then we parti-tioned the into section such as title and sectionsvia heuristic rules and applied sentence segmenter2 ParsCit3(Councill et al 2008) for refer-ence collection part-of-speech tagger4 and lem-matizer5(Minnen et al 2001) of the input Af-ter preprocessing we built both supervised andunsupervised classifiers using Naive Bayes fromthe WEKA machine learning toolkit(Witten andFrank 2005) Maximum Entropy6 and simpleweighting
In evaluation we collected 250 papers fromfour different categories7 of the ACM digital li-brary Each paper was 6 to 8 pages on averageIn author assigned keyphrase we found manywere missing or found as substrings To rem-edy this we collected reader assigned keyphraseby hiring senior year undergraduates in computerscience each whom annotated five of the paperswith an annotation guideline and on average tookabout 15 minutes to annotate each paper The fi-nal statistics of keyphrases is presented in Table2 where Combined represents the total number ofkeyphrases The numbers in () denotes the num-ber of keyphrases in of-PP form Found means the
2httpwwwengritsumeiacjpasaoresourcessentseg3httpwingcompnusedusgparsCit4httpsearchcpanorgdistLingua-EN-
TaggerTaggerpm5httpwwwinformaticssusxacukresearchgroupsnlpcarrollmorphhtml6httpmaxentsourceforgenetindexhtml7C24 (Distributed Systems) H33 (Information Search
and Retrieval) I211 (Distributed Artificial Intelligence-Multiagent Systems) and J4 (Social and Behavioral Sciences-Economics)
number of author assigned keyphrase and readerassigned keyphrase found in the documents
Author Reader CombinedTotal 1252 (53) 3110 (111) 3816 (146)NPs 904 2537 3027Average 385 (401) 1244 (1288) 1526 (1585)Found 769 2509 2864
Table 2 Statistics in Keyphrases
7 Evaluation
The baseline system for both the supervised andunsupervised approaches is modified NampK whichuses TFIDF distance section information andadditional section information (ie F1-4) Apartfrom baseline we also implemented basicKEA and NampK to compare Note that NampK is con-sidered a supervised approach as it utilizes fea-tures like acronym POS sequence and suffix se-quence
Table 3 and 4 shows the performance of our can-didate selection method and features with respectto supervised and unsupervised approaches usingthe current standard evaluation method (ie exactmatching scheme) over top 5th 10th 15th candi-dates
BestFeatures includes F1cTF of substring asa separate feature F2first occurrence F3sectioninformation F4dweighting key sections F5lastoccurrence F6co-occurrence of another candi-date in section F7btitle overlap F9aterm co-hesion by (Park et al 2004) F13length ofkeyphrases Best-TFIDF means using all bestfeatures but TFIDF
In Tables 3 and 4 C denotes the classifier tech-nique unsupervised (U) or supervised using Max-imum Entropy (S)8
In Table 5 the performance of each feature ismeasured using NampK system and the target fea-ture + indicates an improvement - indicates aperformance decline and indicates no effector unconfirmed due to small changes of perfor-mances Again supervised denotes MaximumEntropy training and Unsupervised is our unsu-pervised approach
8 Discussion
We compared the performances over our candi-date selection and feature engineering with sim-ple KEA NampK and our baseline system In eval-uating candidate selection we found that longer
8Due to the page limits we present the best performance
14
Method Features C Five Ten FifteenMatch Precision Recall Fscore Match Precising Recall Fscore Match Precision Recall Fscore
All KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 079 1584 519 782 139 1388 909 1099 184 1224 1203 1213
NampK S 132 2648 867 1306 204 2036 1334 1612 254 1693 1664 1678baseline U 092 1832 600 904 157 1568 1027 1241 220 1464 1439 1451
S 115 2304 755 1137 190 1896 1242 1501 244 1624 1596 1610Lengthlt=3 KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 081 1616 529 797 140 1400 917 1108 184 1224 1203 1213
NampK S 140 2792 915 1378 210 2104 1378 1665 262 1749 1719 1734baseline U 092 184 603 908 158 1576 1032 1247 220 1464 1439 1451
S 118 2368 776 1169 190 1900 1245 1504 240 1600 1572 1586Lengthlt=3 KEA U 001 024 008 012 005 052 034 041 007 048 047 047Candidates S 083 1664 545 821 142 1424 933 1127 187 1245 1224 1234
+ Alternation NampK S 153 3064 1004 1512 231 2308 1512 1827 288 1920 1887 1903baseline U 098 1968 645 972 172 1724 1129 1364 237 1579 1551 1565
S 133 2656 870 1311 209 2088 1368 1653 269 1792 1761 1776
Table 3 Performance on Proposed Candidate Selection
Features C Five Ten FifteenMatch Prec Recall Fscore Match Prec Recall Fscore Match Prec Recall Fscore
Best U 114 228 747 113 192 192 126 152 261 174 171 173S 156 312 102 154 250 250 164 198 315 210 206 208
Best U 114 228 74 113 192 192 126 152 261 174 171 173wo TFIDF S 156 311 102 154 246 246 161 194 312 208 204 206
Table 4 Performance on Feature Engineering
A Method Feature+ S F1aF2F3F4aF4dF9a
U F1aF1cF2F3F4aF4dF5F7bF9a- S F1bF1cF1dF1fF4bF4cF7aF7bF9b-dF13
U F1dF1eF1fF4bF4cF6F7aF9b-d S F1eF10F11F12
U F1b
Table 5 Performance on Each Feature
length candidates play a role to be noises so de-creased the overall performance We also con-firmed that candidate alternation offered the flexi-bility of keyphrases leading higher candidate cov-erage as well as better performance
To re-examine features we analyzed the impactof existing and new features and their variationsFirst of all unlike previous studies we found thatthe performance with and without TFIDF did notlead to a large difference which indicates the im-pact of TFIDF was minor as long as other fea-tures are incorporated Secondly counting sub-strings for TF improved performance while ap-plying term weighting for TF andor IDF did notimpact on the performance We estimated thecause that many of keyphrases are substrings ofcandidates and vice versa Thirdly section in-formation was also validated to improve perfor-mance as in Nguyen and Kan (2007) Extend-ing this logic modeling additional section infor-mation (related work) and weighting sections bothturned out to be useful features Other localityfeatures were also validated as helpful both firstoccurrence and last occurrence are helpful as itimplies the locality of the key ideas In addi-
tion keyphrase co-occurrence with selected sec-tions was proposed in our work and found empiri-cally useful Term cohesion (Park et al 2004) is auseful feature although it has a heuristic factor thatreduce the weight by 10 for simplex words Nor-mally term cohesion is subject to NPs only henceit needs to be extended to work with multi-wordNPs as well Table 5 summarizes the reflectionson each feature
As unsupervised methods have the appeal of notneeding to be trained on expensive hand-annotateddata we also compared the performance of super-vised and unsupervised methods Given the fea-tures initially introduced for supervised learningunsupervised performance is surprisingly highWhile supervised classifier produced a matchingcount of 315 the unsupervised classifier obtains acount of 261 We feel this indicates that the exist-ing features for supervised methods are also suit-able for use in unsupervised methods with slightlyreduced performance In general we observed thatthe best features in both supervised and unsuper-vised methods are the same ndash section informationand candidate length In our analysis of the im-pact of individual features we observed that mostfeatures affect performance in the same way forboth supervised and unsupervised approaches asshown in Table 5 These findings indicate that al-though these features may be been originally de-signed for use in a supervised approach they arestable and can be expected to perform similar inunsupervised approaches
15
9 Conclusion
We have identified and tackled two core issuesin automatic keyphrase extraction candidate se-lection and feature engineering In the area ofcandidate selection we observe variations and al-ternations that were previously unaccounted forOur selection rules expand the scope of possiblekeyphrase coverage while not overly expandingthe total number candidates to consider In ourre-examination of feature engineering we com-piled a comprehensive feature list from previousworks while exploring the use of substrings in de-vising new features Moreover we also attested toeach featurersquos fitness for use in unsupervised ap-proaches in order to utilize them in real-world ap-plications with minimal cost
10 AcknowledgementThis work was partially supported by a National ResearchFoundation grant Interactive Media Search (grant R 252000 325 279) while the first author was a postdoctoral fellowat the National University of Singapore
ReferencesKen Barker and Nadia Corrnacchia Using noun phrase
heads to extract document keyphrases In Proceedings ofthe 13th Biennial Conference of the Canadian Society onComputational Studies of Intelligence Advances in Arti-ficial Intelligence 2000
Regina Barzilay and Michael Elhadad Using lexical chainsfor text summarization In Proceedings of the ACLEACL1997 Workshop on Intelligent Scalable Text Summariza-tion 1997 pp 10ndash17
Kenneth Church and Patrick Hanks Word associationnorms mutual information and lexicography In Proceed-ings of ACL 1989 76ndash83
Isaac Councill and C Lee Giles and Min-Yen Kan ParsCitAn open-source CRF reference string parsing package InProceedings of LREC 2008 28ndash30
Ernesto DAvanzo and Bernado Magnini A Keyphrase-Based Approach to Summarizationthe LAKE System atDUC-2005 In Proceedings of DUC 2005
F Damerau Generating and evaluating domain-orientedmulti-word terms from texts Information Processing andManagement 1993 29 pp43ndash447
Lee Dice Measures of the amount of ecologic associationsbetween species Journal of Ecology 1945 2
Eibe Frank and Gordon Paynter and Ian Witten and CarlGutwin and Craig Nevill-manning Domain SpecificKeyphrase Extraction In Proceedings of IJCAI 1999pp668ndash673
Carl Gutwin and Gordon Paynter and Ian Witten and CraigNevill-Manning and Eibe Frank Improving browsing indigital libraries with keyphrase indexes Journal of Deci-sion Support Systems 1999 27 pp81ndash104
Khaled Hammouda and Diego Matute and Mohamed KamelCorePhrase keyphrase extraction for document cluster-ing In Proceedings of MLDM 2005
Annette Hulth and Jussi Karlgren and Anna Jonsson andHenrik Bostrm and Lars Asker Automatic Keyword Ex-traction using Domain Knowledge In Proceedings of CI-CLing 2001
Annette Hulth and Beata Megyesi A study on automaticallyextracted keywords in text categorization In Proceedingsof ACLCOLING 2006 537ndash544
Mario Jarmasz and Caroline Barriere Using semantic sim-ilarity over tera-byte corpus compute the performance ofkeyphrase extraction In Proceedings of CLINE 2004
Dawn Lawrie and W Bruce Croft and Arnold RosenbergFinding Topic Words for Hierarchical Summarization InProceedings of SIGIR 2001 pp 349ndash357
Y Matsuo and M Ishizuka Keyword Extraction from a Sin-gle Document using Word Co-occurrence Statistical Infor-mation In International Journal on Artificial IntelligenceTools 2004 13(1) pp 157ndash169
Olena Medelyan and Ian Witten Thesaurus based automatickeyphrase indexing In Proceedings of ACMIEED-CSjoint conference on Digital libraries 2006 pp296ndash297
Guido Minnen and John Carroll and Darren Pearce Appliedmorphological processing of English NLE 2001 7(3)pp207ndash223
Thuy Dung Nguyen and Min-Yen Kan Key phrase Extrac-tion in Scientific Publications In Proceeding of ICADL2007 pp317-326
Youngja Park and Roy Byrd and Branimir Boguraev Auto-matic Glossary Extraction Beyond Terminology Identifi-cation In Proceedings of COLING 2004 pp48ndash55
Mari-Sanna Paukkeri and Ilari Nieminen and Matti Pollaand Timo Honkela A Language-Independent Approachto Keyphrase Extraction and Evaluation In Proceedingsof COLING 2008
Peter Turney Learning to Extract Keyphrases from TextIn National Research Council Institute for InformationTechnology Technical Report ERB-1057 1999
Peter Turney Coherent keyphrase extraction via Web min-ing In Proceedings of IJCAI 2003 pp 434ndash439
Xiaojun Wan and Jianguo Xiao CollabRank towards a col-laborative approach to single-document keyphrase extrac-tion In Proceedings of COLING 2008
Ian Witten and Gordon Paynter and Eibe Frank and CarGutwin and Graig Nevill-Manning KEAPractical Au-tomatic Key phrase Extraction In Proceedings of ACMDL 1999 pp254ndash256
Ian Witten and Eibe Frank Data Mining Practical Ma-chine Learning Tools and Techniques Morgan Kauf-mann 2005
Yongzheng Zhang and Nur Zinchir-Heywood and Evange-los Milios Term based Clustering and Summarization ofWeb Page Collections In Proceedings of Conference ofthe Canadian Society for Computational Studies of Intel-ligence 2004
16
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 17ndash22Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Verb Noun Construction MWE Token Supervised Classification
Mona T DiabCenter for Computational Learning Systems
Columbia Universitymdiabcclscolumbiaedu
Pravin BhutadaComputer Science Department
Columbia Universitypb2351columbiaedu
Abstract
We address the problem of classifying multi-word expression tokens in running text Wefocus our study on Verb-Noun Constructions(VNC) that vary in their idiomaticity depend-ing on context VNC tokens are classified aseither idiomatic or literal We present a super-vised learning approach to the problem We ex-periment with different features Our approachyields the best results to date on MWE clas-sification combining different linguistically mo-tivated features the overall performance yieldsan F-measure of 8458 corresponding to an F-measure of 8996 for idiomaticity identificationand classification and 6203 for literal identifi-cation and classification
1 Introduction
In the literature in general a multiword expression(MWE) refers to a multiword unit or a colloca-tion of words that co-occur together statisticallymore than chance A MWE is a cover term fordifferent types of collocations which vary in theirtransparency and fixedness MWEs are pervasivein natural language especially in web based textsand speech genres Identifying MWEs and under-standing their meaning is essential to language un-derstanding hence they are of crucial importancefor any Natural Language Processing (NLP) appli-cations that aim at handling robust language mean-ing and use In fact the seminal paper (Sag et al2002) refers to this problem as a key issue for thedevelopment of high-quality NLP applications
For our purposes a MWE is defined as a collo-cation of words that refers to a single concept forexample - kick the bucket spill the beans make adecision etc An MWE typically has an idiosyn-cratic meaning that is more or different from themeaning of its component words An MWE mean-ing is transparent ie predictable in as muchas the component words in the expression relaythe meaning portended by the speaker composi-tionally Accordingly MWEs vary in their de-gree of meaning compositionality composition-ality is correlated with the level of idiomaticityAn MWE is compositional if the meaning of an
MWE as a unit can be predicted from the mean-ing of its component words such as in make adecision meaning to decide If we conceive ofidiomaticity as being a continuum the more id-iomatic an expression the less transparent and themore non-compositional it is Some MWEs aremore predictable than others for instance kick thebucket when used idiomatically to mean to diehas nothing in common with the literal meaningof either kick or bucket however make a decisionis very clearly related to to decide Both of theseexpressions are considered MWEs but have vary-ing degrees of compositionality and predictabilityBoth of these expressions belong to a class of id-iomatic MWEs known as verb noun constructions(VNC) The first VNC kick the bucket is a non-decomposable VNC MWE the latter make a deci-sion is a decomposable VNC MWE These typesof constructions are the object of our study
To date most research has addressed the prob-lem of MWE type classification for VNC expres-sions in English (Melamed 1997 Lin 1999Baldwin et al 2003 na Villada Moiron andTiedemann 2006 Fazly and Stevenson 2007Van de Cruys and Villada Moiron 2007 Mc-Carthy et al 2007) not token classification Forexample he spilt the beans on the kitchen counteris most likely a literal usage This is given away bythe use of the prepositional phrase on the kitchencounter as it is plausable that beans could haveliterally been spilt on a location such as a kitchencounter Most previous research would classifyspilt the beans as idiomatic irrespective of con-textual usage In a recent study by (Cook et al2008) of 53 idiom MWE types used in differentcontexts the authors concluded that almost half ofthem had clear literal meaning and over 40 oftheir usages in text were actually literal Thus itwould be important for an NLP application suchas machine translation for example when givena new VNC MWE token to be able to determinewhether it is used idiomatically or not as it couldpotentially have detrimental effects on the qualityof the translation
17
In this paper we address the problem of MWEclassification for verb-noun (VNC) token con-structions in running text We investigate the bi-nary classification of an unseen VNC token ex-pression as being either Idiomatic (IDM) or Lit-eral (LIT) An IDM expression is certainly anMWE however the converse is not necessarilytrue To date most approaches to the problem ofidiomaticity classification on the token level havebeen unsupervised (Birke and Sarkar 2006 Diaband Krishna 2009b Diab and Krishna 2009aSporleder and Li 2009) In this study we carryout a supervised learning investigation using sup-port vector machines that uses some of the featureswhich have been shown to help in unsupervisedapproaches to the problem
This paper is organized as follows In Section2 we describe our understanding of the variousclasses of MWEs in general Section 3 is a sum-mary of previous related research Section 4 de-scribes our approach In Section 5 we present thedetails of our experiments We discuss the resultsin Section 6 Finally we conclude in Section 7
2 Multi-word Expressions
MWEs are typically not productive though theyallow for inflectional variation (Sag et al 2002)They have been conventionalized due to persis-tent use MWEs can be classified based on theirsemantic types as follows Idiomatic This cat-egory includes expressions that are semanticallynon-compositional fixed expressions such as king-dom come ad hoc non-fixed expressions suchas break new ground speak of the devil TheVNCs which we are focusing on in this paper fallinto this category Semi-idiomatic This classincludes expressions that seem semantically non-compositional yet their semantics are more or lesstransparent This category consists of Light VerbConstructions (LVC) such as make a living andVerb Particle Constructions (VPC) such as write-up call-up Non-Idiomatic This category in-cludes expressions that are semantically compo-sitional such as prime minister proper nouns suchas New York Yankees and collocations such as ma-chine translation These expressions are statisti-cally idiosyncratic For instance traffic light isthe most likely lexicalization of the concept andwould occur more often in text than say trafficregulator or vehicle light
3 Related Work
Several researchers have addressed the problem ofMWE classification (Baldwin et al 2003 Katzand Giesbrecht 2006 Schone and Juraksfy 2001
Hashimoto et al 2006 Hashimoto and Kawa-hara 2008) The majority of the proposed researchhas been using unsupervised approaches and haveaddressed the problem of MWE type classifica-tion irrespective of usage in context (Fazly andStevenson 2007 Cook et al 2007) We areaware of two supervised approaches to the prob-lem work by (Katz and Giesbrecht 2006) andwork by (Hashimoto and Kawahara 2008)
In Katz and Giesbrecht (2006) (KG06) the au-thors carried out a vector similarity comparisonbetween the context of an MWE and that of theconstituent words using LSA to determine if theexpression is idiomatic or not The KG06 is sim-ilar in intuition to work proposed by (Fazly andStevenson 2007) however the latter work was un-supervised KG06 experimented with a tiny dataset of only 108 sentences corresponding to oneMWE idiomatic expression
Hashimoto and Kawahara (2008) (HK08) is thefirst large scale study to our knowledge that ad-dressed token classification into idiomatic versusliteral for Japanese MWEs of all types They ap-ply a supervised learning framework using sup-port vector machines based on TinySVM with aquadratic kernel They annotate a web based cor-pus for training data They identify 101 idiomtypes each with a corresponding 1000 exampleshence they had a corpus of 102K sentences of an-notated data for their experiments They exper-iment with 90 idiom types only for which theyhad more than 50 examples They use two typesof features word sense disambiguation (WSD)features and idiom features The WSD featurescomprised some basic syntactic features such asPOS lemma information token n-gram featuresin addition to hypernymy information on words aswell as domain information For the idiom fea-tures they were mostly inflectional features suchas voice negativity modality in addition to adja-cency and adnominal features They report resultsin terms of accuracy and rate of error reductionTheir overall accuracy is of 8925 using all thefeatures
4 Our ApproachWe apply a supervised learning framework tothe problem of both identifying and classifying aMWE expression token in context We specificallyfocus on VNC MWE expressions We use the an-notated data by (Cook et al 2008) We adopt achunking approach to the problem using an InsideOutside Beginning (IOB) tagging framework forperforming the identification of MWE VNC to-kens and classifying them as idiomatic or literalin context For chunk tagging we use the Yam-
18
Cha sequence labeling system1 YamCha is basedon Support Vector Machines technology using de-gree 2 polynomial kernels
We label each sentence with standard IOB tagsSince this is a binary classification task we have 5different tags B-L (Beginning of a literal chunk)I-L (Inside of a literal chunk) B-I (Beginning anIdiomatic chunk) I-I (Inside an Idiomatic chunk)O (Outside a chunk) As an example a sentencesuch as John kicked the bucket last Friday will beannotated as follows John O kicked B-I the I-Ibucket I-I last O Friday O We experiment withsome basic features and some more linguisticallymotivated ones
We experiment with different window sizes forcontext ranging from minus+1 to minus+5 tokens be-fore and after the token of interest We also em-ploy linguistic features such as character n-gramfeatures namely last 3 characters of a token asa means of indirectly capturing the word inflec-tional and derivational morphology (NGRAM)Other features include Part-of-Speech (POS)tags lemma form (LEMMA) or the citation formof the word and named entity (NE) informationThe latter feature is shown to help in the unsuper-vised setting in recent work (Diab and Krishna2009b Diab and Krishna 2009a) In general allthe linguistic features are represented as separatefeature sets explicitly modeled in the input dataHence if we are modeling the POS tag feature forour running example the training data would beannotated as follows John NN O kicked VBDB-I the Det I-I bucket NN I-I last ADV O FridayNN O Likewise adding the NGRAM featurewould be represented as follows John NN ohnO kicked VBD ked B-I the Det the I-I bucket NNket I-I last ADV ast O Friday NN day O and soon
With the NE feature we followed the same rep-resentation as the other features as a separate col-umn as expressed above referred to as NamedEntity Separate (NES) For named entity recogni-tion (NER) we use the BBN Identifinder softwarewhich identifies 19 NE tags2 We have two set-tings for NES one with the full 19 tags explic-itly identified (NES-Full) and the other where wehave a binary feature indicating whether a wordis a NE or not (NES-Bin) Moreover we addedanother experimental condition where we changedthe wordsrsquo representation in the input to their NEclass Named Entity InText (NEI) For example forthe NEI condition our running example is repre-sented as follows PER NN ohn O kicked VBDked B-I the Det the I-I bucket NN ket I-I last ADV
1httpwwwtado-chasencomyamcha2httpwwwbbncomidentifinder
ast O DAY NN day O where John is replaced bythe NE ldquoPERrdquo
5 Experiments and Results
51 Data
We use the manually annotated standard dataset identified in (Cook et al 2008) This datacomprises 2920 unique VNC-Token expressionsdrawn from the entire British National Corpus(BNC)3 The BNC contains 100M words of multi-ple genres including written text and transcribedspeech In this set VNC token expressions aremanually annotated as idiomatic literal or un-known We exclude those annotated as unknownand those pertaining to the Speech part of thedata leaving us with a total of 2432 sentences cor-responding to 53 VNC MWE types This datahas 2571 annotations4 corresponding to 2020 Id-iomatic tokens and 551 literal ones Since the dataset is relatively small we carry out 5-fold cross val-idation experiments The results we report are av-eraged over the 5 folds per condition We splitthe data into 80 for training 10 for testing and10 for development The data used is the tok-enized version of the BNC
52 Evaluation Metrics
We use Fβ=1 (F-measure) as the harmonic meanbetween (P)recision and (R)ecall as well as accu-racy to report the results5 We report the resultsseparately for the two classes IDM and LIT aver-aged over the 5 folds of the TEST data set
53 Results
We present the results for the different featuressets and their combination We also present resultson a simple most frequent tag baseline (FREQ) aswell as a baseline of using no features just thetokenized words (TOK) The baseline is basicallytagging all identified VNC tokens in the data set asidiomatic It is worth noting that the baseline hasthe advantage of gold identification of MWE VNCtoken expressions In our experimental conditionsidentification of a potential VNC MWE is part ofwhat is discovered automatically hence our sys-tem is penalized for identifying other VNC MWE
3httpwwwnatcorpoxacuk4A sentence can have more than one MWE expression
hence the number of annotations exceeds the number of sen-tences
5We do not think that accuracy should be reported in gen-eral since it is an inflated result as it is not a measure of errorAll words identified as O factor into the accuracy which re-sults in exaggerated values for accuracy We report it onlysince it the metric used by previous work
19
tokens that are not in the original data set6
In Table 2 we present the results yielded per fea-ture and per condition We experimented with dif-ferent context sizes initially to decide on the opti-mal window size for our learning framework re-sults are presented in Table 1 Then once that isdetermined we proceed to add features
Noting that a window size of minus+3 yields thebest results we proceed to use that as our contextsize for the following experimental conditions Wewill not include accuracy since it above 96 for allour experimental conditions
All the results yielded by our experiments out-perform the baseline FREQ The simple tokenizedwords baseline (TOK) with no added features witha context size of minus+3 shows a significant im-provement over the very basic baseline FREQ withan overall F measure of 7704
Adding lemma information or POS or NGRAMfeatures all independently contribute to a bettersolution however combining the three featuresyields a significant boost in performance over theTOK baseline of 267 absolute F points in over-all performance
Confirming previous observations in the liter-ature the overall best results are obtained byusing NE features The NEI condition yieldsslightly better results than the NES conditionsin the case when no other features are beingused NES-Full significantly outperforms NES-Bin when used alone especially on literal classi-fication yielding the highest results on this classof phenomena across the board However whencombined with other features NES-Bin fares bet-ter than NES-Full as we observe slightly less per-formance when comparing NES-Full+L+N+P andNES-Bin+L+N+P
Combining NEI+L+N+P yields the highest re-sults with an overall F measure of 8458 a sig-nificant improvement over both baselines and overthe condition that does not exploit NE featuresL+N+P Using NEI may be considered a formof dimensionality reduction hence the significantcontribution to performance
6 Discussion
The overall results strongly suggest that using lin-guistically interesting features explicitly has a pos-itive impact on performance NE features helpthe most and combining them with other features
6We could have easily identified all VNC syntactic con-figurations corresponding to verb object as a potential MWEVNC assuming that they are literal by default This wouldhave boosted our literal score baseline however for this in-vestigation we decided to strictly work with the gold stan-dard data set exclusively
yields the best results In general performanceon the classification and identification of idiomaticexpressions yielded much better results This maybe due to the fact that the data has a lot more id-iomatic token examples for training Also we notethat precision scores are significantly higher thanrecall scores especially with performance on lit-eral token instance classification This might be anindication that identifying when an MWE is usedliterally is a difficult task
We analyzed some of the errors yielded in ourbest condition NEI+L+N+P The biggest errors area result of identifying other VNC constructionsnot annotated in the training and test data as VNCMWEs However we also see errors of confusingidiomatic cases with literal ones 23 times and theopposite 4 times
Some of the errors where the VNC should havebeen classified as literal however the system clas-sified them as idiomatic are kick heel find feetmake top Cases of idiomatic expressions erro-neously classified as literal are for MWE types hitthe road blow trumpet blow whistle bit a wall
The system is able to identify new VNC MWEconstructions For instance in the sentence On theother hand Pinkie seemed to have lost his head toa certain extent perhaps some prospects of mak-ing his mark by bringing in something novel inthe way of business the first MWE lost his headis annotated in the training data however makinghis mark is newly identified as idiomatic in thiscontext
Also the system identified hit the post as aliteral MWE VNC token in As the ball hit thepost the referee blew the whistle where blew thewhistle is a literal VNC in this context and it iden-tified hit the post as another literal VNC
7 Conclusion
In this study we explore a set of features that con-tribute to VNC token expression binary supervisedclassification The use of NER significantly im-proves the performance of the system Using NERas a means of dimensionality reduction yields thebest results We achieve a state of the art perfor-mance of an overall F measure of 8458 In thefuture we are looking at ways of adding more so-phisticated syntactic and semantic features fromWSD Given the fact that we were able to get moreinteresting VNC data automatically we are cur-rently looking into adding the new data to the an-notated pool after manual checking
20
IDM-F LIT-F Overall F Overall Accminus+1 7793 4857 7178 9622minus+2 8538 5561 7971 9706minus+3 8699 5568 8125 9693minus+4 8622 5581 8075 9706minus+5 8338 50 7763 9661
Table 1 Results in s of varying context window size
IDM-P IDM-R IDM-F LIT-P LIT-R LIT-F Overall FFREQ 7002 8916 7844 0 0 0 6968TOK 8178 8333 8255 7179 4375 5437 7704
(L)EMMA 831 8429 8369 6977 4688 5607 7811(N)GRAM 8317 8238 8278 70 4375 5385 7701
(P)OS 8333 8333 8333 7778 4375 5600 7808L+N+P 8695 8333 8538 7222 4561 5591 7971
NES-Full 852 8793 8655 7907 5862 6733 8277NES-Bin 8497 8241 8367 7349 5259 6131 7915
NEI 8992 8518 8748 8133 5259 6387 8282NES-Full+L+N+P 8989 8492 8734 7632 50 6042 8199NES-Bin+L+N+P 9086 8492 8779 7632 50 6042 8233
NEI+L+N+P 9135 8842 8986 8169 50 6203 8458
Table 2 Final results in s averaged over 5 folds of test data using different features and their combina-tions
8 Acknowledgement
The first author was partially funded by DARPAGALE and MADCAT projects The authors wouldlike to acknowledge the useful comments by twoanonymous reviewers who helped in making thispublication more concise and better presented
ReferencesTimothy Baldwin Collin Bannard Takakki Tanaka
and Dominic Widdows 2003 An empirical modelof multiword expression decomposability In Pro-ceedings of the ACL 2003 workshop on Multiwordexpressions pages 89ndash96 Morristown NJ USA
J Birke and A Sarkar 2006 A clustering approach fornearly unsupervised recognition of nonliteral lan-guage In Proceedings of EACL volume 6 pages329ndash336
Paul Cook Afsaneh Fazly and Suzanne Stevenson2007 Pulling their weight Exploiting syntacticforms for the automatic identification of idiomaticexpressions in context In Proceedings of the Work-shop on A Broader Perspective on Multiword Ex-pressions pages 41ndash48 Prague Czech RepublicJune Association for Computational Linguistics
Paul Cook Afsaneh Fazly and Suzanne Stevenson2008 The VNC-Tokens Dataset In Proceedings ofthe LREC Workshop on Towards a Shared Task for
Multiword Expressions (MWE 2008) MarrakechMorocco June
Mona Diab and Madhav Krishna 2009a Handlingsparsity for verb noun MWE token classification InProceedings of the Workshop on Geometrical Mod-els of Natural Language Semantics pages 96ndash103Athens Greece March Association for Computa-tional Linguistics
Mona Diab and Madhav Krishna 2009b Unsuper-vised classification for vnc multiword expressionstokens In CICLING
Afsaneh Fazly and Suzanne Stevenson 2007 Dis-tinguishing subtypes of multiword expressions us-ing linguistically-motivated statistical measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueCzech Republic June Association for Computa-tional Linguistics
Chikara Hashimoto and Daisuke Kawahara 2008Construction of an idiom corpus and its applica-tion to idiom identification based on WSD incor-porating idiom-specific features In Proceedings ofthe 2008 Conference on Empirical Methods in Nat-ural Language Processing pages 992ndash1001 Hon-olulu Hawaii October Association for Computa-tional Linguistics
Chikara Hashimoto Satoshi Sato and Takehito Utsuro2006 Japanese idiom recognition Drawing a linebetween literal and idiomatic meanings In Proceed-ings of the COLINGACL 2006 Main Conference
21
Poster Sessions pages 353ndash360 Sydney AustraliaJuly Association for Computational Linguistics
Graham Katz and Eugenie Giesbrecht 2006 Au-tomatic identification of non-compositional multi-word expressions using latent semantic analysis InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 12ndash19 Sydney Australia July Asso-ciation for Computational Linguistics
Dekang Lin 1999 Automatic identification of non-compositional phrases In Proceedings of ACL-99pages 317ndash324 Univeristy of Maryland CollegePark Maryland USA
Diana McCarthy Sriram Venkatapathy and AravindJoshi 2007 Detecting compositionality of verb-object combinations using selectional preferencesIn Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL) pages 369ndash379 Prague CzechRepublic June Association for Computational Lin-guistics
Dan I Melamed 1997 Automatic discovery of non-compositional compounds in parallel data In Pro-ceedings of the 2nd Conference on Empirical Meth-ods in Natural Language Processing (EMNLPrsquo97)pages 97ndash108 Providence RI USA August
Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the EACL-06Workshop on Multiword Expressions in a Multilin-gual Context pages 33ndash40 Morristown NJ USA
Ivan A Sag Timothy Baldwin Francis Bond Ann ACopestake and Dan Flickinger 2002 Multiwordexpressions A pain in the neck for nlp In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing pages 1ndash15 London UK Springer-Verlag
Patrick Schone and Daniel Juraksfy 2001 Isknowledge-free induction of multiword unit dictio-nary headwords a solved problem In Proceedingsof Empirical Methods in Natural Language Process-ing pages 100ndash108 Pittsburg PA USA
C Sporleder and L Li 2009 Unsupervised Recog-nition of Literal and Non-Literal Use of IdiomaticExpressions In Proceedings of the 12th Conferenceof the European Chapter of the ACL (EACL 2009)pages 754ndash762 Association for Computational Lin-guistics
Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32Prague Czech Republic June Association for Com-putational Linguistics
22
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 23ndash30Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Exploiting Translational Correspondences for Pattern-Independent MWEIdentification
Sina ZarrieszligDepartment of Linguistics
University of Potsdam Germanysinalinguni-potsdamde
Jonas KuhnDepartment of Linguistics
University of Potsdam Germanykuhnlinguni-potsdamde
Abstract
Based on a study of verb translations inthe Europarl corpus we argue that a widerange of MWE patterns can be identified intranslations that exhibit a correspondencebetween a single lexical item in the sourcelanguage and a group of lexical items inthe target language We show that thesecorrespondences can be reliably detectedon dependency-parsed word-aligned sen-tences We propose an extraction methodthat combines word alignment with syn-tactic filters and is independent of thestructural pattern of the translation
1 Introduction
Parallel corpora have proved to be a valuable re-source not only for statistical machine translationbut also for crosslingual induction of morphologi-cal syntactic and semantic analyses (Yarowsky etal 2001 Dyvik 2004) In this paper we proposean approach to the identification of multiword ex-pressions (MWEs) that exploits translational cor-respondences in a parallel corpus We will con-sider in translations of the following type
(1) DerThe
RatCouncil
sollteshould
unsereour
Positionposition
berucksichtigenconsider
(2) The Council shouldtake account ofour position
This sentence pair has been taken from theGerman - English section of the Europarl corpus(Koehn 2005) It exemplifies a translational cor-respondence between an English MWEtake ac-count ofand a German simplex verbberucksichti-gen In the following we refer to such correspon-dences asone-to-many translations Based on astudy of verb translations in Europarl we will ex-plore to what extent one-to-many translations pro-vide evidence for MWE realization in the targetlanguage It will turn out that crosslingual corre-
spondences realize a wide range of different lin-guistic patterns that are relevant for MWE iden-tification but that they pose problems to auto-matic word alignment We propose an extractionmethod that combines distributional word align-ment with syntactic filters We will show thatthese correspondences can be reliably detectedon dependency-parsed wordaligned sentences andare able to identify various MWE patterns
In a monolingual setting the task of MWE ex-traction is usually conceived of as a lexical as-sociation problem where distributional measuresmodel the syntactic and semantic idiosyncracy ex-hibited by MWEs eg (Pecina 2008) This ap-proach generally involves two main steps 1) theextraction of a candidate list of potential MWEsoften constrained by a particular target patternof the detection method like verb particle con-structions (Baldwin and Villavicencio 2002) orverb PP combinations (Villada Moiron and Tiede-mann 2006) 2) the ranking of this candidate listby an appropriate assocation measure
The crosslingual MWE identification wepresent in this paper is a priori independentof any specific association measure or syntacticpattern The translation scenario allows us toadopt a completely data-driven definition of whatconstitutes an MWE Given a parallel corpuswe propose to consider those tokens in a targetlanguage as MWEs which correspond to a singlelexical item in the source language The intuitionis that if a group of lexical items in one lan-guage can be realized as a single item in anotherlanguage it can be considered as some kind oflexically fixed entity By this means we willnot approach the MWE identification problemby asking for a given list of candidates whetherthese are MWEs or not Instead we will ask fora given list of lexical items in a source languagewhether there exists a one-to-many translation forthis item in a target language (and whether these
23
one-to-many translations correspond to MWEs)This strategy offers a straightforward solution tothe interpretation problem As the translation canbe related to the meaning of the source item andto its other translations in the target language theinterpretation is independent of the expressionrsquostransparency This solution has its limitationscompared to other approaches that need to auto-matically establish the degree of compositionalityof a given MWE candidate However for manyNLP applications coarse-grained knowledgeabout the semantic relation between a widerange of MWEs and their corresponding atomicrealization is already very useful
In this work we therefore focus on a generalmethod of MWE identification that captures thevarious patterns of translational correspondencesthat can be found in parallel corpora Our exper-iments described in section 3 show that one-to-many translations should be extracted from syn-tactic configurations rather than from unstructuredsets of aligned words This syntax-driven methodis less dependent on frequency distributions in agiven corpus but is based on the intuition thatmonolingual idiosyncracies like MWE realizationof an entity are not likely to be mirrored in anotherlanguage (see section 4 for discussion)
Our goal in this paper is twofold First we wantto investigate to what extent one-to-many transla-tional correspondences can serve as an empiricalbasis for MWE identification To this end Sec-tion 2 presents a corpus-based study of the rela-tion between one-to-many translations and MWEsthat we carried out on a translation gold standardSecond we investigate methods for the automaticdetection of complex lexical correspondences fora given parallel corpus Therefore Section 3 eval-uates automatic word alignments against our goldstandard and gives a method for high-precisionone-to-many translation detection that relies onsyntactic filters in addition to word-alignments
2 Multiword Translations as MWEs
The idea to exploit one-to-many translations forthe identification of MWE candidates has not re-ceived much attention in the literature Thus it isnot a priori clear what can be expected from trans-lational correspondences with respect to MWEidentification To corroborate the intuitions intro-duced in the last section we carried out a corpus-based study that aims to discover linguistic pat-
Verb 1-1 1-n n-1 n-n No
anheben (v1) 535 212 92 16 325
bezwecken (v2) 167 513 06 313 150
riskieren (v3) 467 357 05 17 182
verschlimmern (v4) 302 215 286 445 275
Table 1 Proportions of types of translational cor-respondences (token-level) in our gold standard
terns exhibited by one-to-many translationsWe constructed a gold standard coveringall En-
glish translations of four German verb lemmas ex-tracted from the Europarl Corpus These verbssubcategorize for a nominative subject and an ac-cusative object and are in the middle frequencylayer (around 200 occurrences) We extracted allsentences in Europarl with occurences of theselemmas and their automatic word alignments pro-duced by GIZA++ (Och and Ney 2003) Thesealignments were manually corrected on the basisof the crosslingual word alignment guidelines de-velopped by (Graca et al 2008)
For each of the German source lemmas ourgold standard records four translation categoriesone-to-one one-to-many many-to-one many-to-many translations Table 1 shows the distributionof these categories for each verb Strikingly thefour verbs show very different proportions con-cerning the types of their translational correspon-dences Thus while the German verbanheben(en increase) seems to have a frequent parallelrealization the verbsbezwecken(en intend to)or verschlimmern(en aggravate) tend to be real-ized by more complex phrasal translations In anycase the percentage of one-to-many translations isrelatively high which corroborates our hypothesisthat parallel corpora constitute a very interestingresource for data-driven MWE discovery
A closer look at the one-to-many translations re-veals that these cover a wide spectrum of MWEphenomena traditionally considered in the liter-ature as well as constructions that one wouldusually not regard as an MWE Below we willshortly illustrate the different classes of one-to-many translations we found in our gold standard
Morphological variations This type of one-to-many translations is mainly due to non-parallel re-alization of tense Itrsquos rather irrelevant from anMWE perspective but easy to discover and filterautomatically
24
(3) SieThey
verschlimmernaggravate
diethe
Ubelmisfortunes
(4) Their actionis aggravatingthe misfortunes
Verb particle combinations A typical MWEpattern treated for instance in (Baldwin andVillavicencio 2002) It further divides into trans-parent and non-transparent combinations the lat-ter is illustrated below
(5) DerThe
Ausschusscommitte
bezwecktintends
denthe
Institutioneninstitutions
eina
politischespolitical
Instrumentinstrument
anat
diethe
Handhand
zuto
gebengive
(6) The committeeset out to equip the institutions with apolitical instrument
Verb preposition combinations While thisclass isnrsquot discussed very often in the MWE lit-erature it can nevertheless be considered as an id-iosyncratic combination of lexical items Sag et al(2002) propose an analysis within an MWE frame-work
(7) SieThey
werdenwill
denthe
Treibhauseffektgreen house effect
verschlimmernaggravate
(8) They will add to the green house effect
Light verb constructions (LVCs) This is themost frequent pattern in our gold standard It ac-tually subsumes various subpatterns depending onwhether the light verbs complement is realized as anoun adjective or PP Generally LVCs are syntac-tically and semantically more flexible than otherMWE types such that our gold standard containsvariants of LVCs with similar potentially mod-ified adjectives or nouns as in the example be-low However it can be considered an idiosyn-cratic combination since the LVCs exhibit specificlexical restrictions (Sag et al 2002)
(9) IchIch
werdewill
diethe
Sachething
nuronly
nochjust
verschlimmernaggravate
(10) I am justmaking thingsmore difficult
Idioms This MWE type is probably the mostdiscussed in the literature due to its semantic andsyntactic idiosyncracy Itrsquos not very frequent inour gold standard which may be mainly due to itslimited size and the source items we chose
(11) SieThey
bezweckenintend
diethe
Umgestaltungconversion
ininto
einea
zivilecivil
Nationnation
(12) Theyhave in mind the conversion into a civil nation
v1 v2 v3 v4
Ntype 22 (26) 41 (47) 26 (35) 17 (24)
V Part 227 49 00 00
V Prep 364 415 39 59
LVC 182 293 885 882
Idiom 00 24 00 00
Para 364 243 115 235
Table 2 Proportions of MWE types per lemma
Paraphrases From an MWE perspective para-phrases are the most problematic and challengingtype of translational correspondence in our goldstandard While the MWE literature typically dis-cusses the distinction between collocations andMWEs the boarderline between paraphrases andMWEs is not really clear On the hand para-phrases as we classified them here are transparentcombinations of lexical items like in the exam-ple belowensure that something increases How-ever semantically these transparent combinationscan also be rendered by an atomic expressionin-crease A further problem raised by paraphrases isthat they often involve translational shifts (Cyrus2006) These shifts are hard to identify automat-ically and present a general challenge for seman-tic processing of parallel corpora An example isgiven below
(13) WirWe
brauchenneed
besserebetter
Zusammenarbeitcooperation
umto
diethe
RuckzahlungenrepaymentsOBJ
anzuhebenincrease
(14) We need greater cooperation in this respect toensurethat repaymentsincrease
Table 2 displays the proportions of the MWEcategories for the number of types of one-to-manycorrespondences in our gold standard We filteredthe types due to morphological variations only (theoverall number of types is indicated in brackets)Note that some types in our gold standard fall intoseveral categories eg they combine a verb prepo-sition with a verb particle construction For allof the verbs the number of types belonging tocore MWE categories largely outweighs the pro-portion of paraphrases As we already observedin our analysis of general translation categorieshere again the different verb lemmas show strik-ing differences with respect to their realization inEnglish translations For instanceanheben(enincrease) or bezwecken(en intend) are frequently
25
translated with verb particle or preposition combi-nations while the other verbs are much more of-ten translated by means of LVCs Also the morespecific LVC patterns differ largely among theverbs Whileverschlimmern(en aggravate) hasmany different adjectival LVC correspondencesthe translations ofriskieren(en risk) are predomi-nantly nominal LVCs The fact that we found veryfew idioms in our gold standard may be simplyrelated to our arbitrary choice of German sourceverbs that do not have an English idiom realiza-tion (see our experiment on a random set of verbsin Section 33)
In general one-to-many translational corre-spondences seem to provide a very fruitful groundfor the large-scale study of MWE phenomenaHowever their reliable detection in parallel cor-pora is far from trivial as we will show in thenext section Therefore we will not further in-vestigate the classification of MWE patterns inthe rest of the paper but concentrate on the high-precision detection of one-to-many translationsSuch a pattern-independent identification methodis crucial for the further data-driven study of one-to-many translations in parallel corpora
3 Multiword Translation Detection
This section is devoted to the problem of high-precision detection of one-to-many translationsSection 31 describes an evaluation of automaticword alignments against our gold standard Insection 32 we describe a method that extractsloosely aligned syntactic configurations whichyields much more promising results
31 One-to-many Alignments
To illustrate the problem of purely distributionalone-to-many alignment table 3 presents an eval-uation of the automatic one-to-many word align-ments produced by GIZA++ that uses the stan-dard heuristics for bidirectional word alignmentfrom phrase-based MT (Och and Ney 2003) Weevaluate the rate of translational correspondenceson the type-level that the system discovers againstthe one-to-many translations in our gold standardBy type we mean the set of lemmatized Englishtokens that makes up the translation of the Ger-man source lemma Generally automatic wordalignment yields a very high FPR if no frequencythreshold is used Increasing the threshold mayhelp in some cases however the frequency of the
verb n gt 0 n gt 1 n gt 3FPR FNR FPR FNR FPR FNR
v1 097 093 10 10 10 10
v2 093 09 05 096 00 098
v3 088 083 08 097 067 097
v4 098 092 08 092 034 092
Table 3 False positive rate and False negative rateof GIZA++ one-to-many alignments
translation types is so low that already at a thresh-old of 3 almost all types get filtered This does notmean that the automatic word alignment does notdiscover any correct correspondences at all but itmeans that the detection of the exact set of tokensthat correspond to the source token is rare
This low precision of one-to-many alignmentsisnrsquot very surprising Many types of MWEs con-sist of items that contribute most of the lexical se-mantic content while the other items belong to theclass of semantically almost ldquoemptyrdquo items (egparticles light verbs) These semantically ldquolightrdquoitems have a distribution that doesnrsquot necessarilycorrelate with the source item For instance inthe following sentence pair taken from EuroparlGIZA++ was not able to capture the correspon-dence between the German main verbbehindern(en impede) and the LVCconstitute an obstacleto but only finds an alignment link between theverb and the nounobstacle
(15) DieThe
Korruptioncorruption
behindertimpedes
diethe
Entwicklungdevelopment
(16) Corruptionconstitutes an obstacle todevelopment
Another limitation of the word-alignment mod-els is that are independent of whether the sen-tences are largely parallel or rather free transla-tions However parallel corpora like Europarl areknow to contain a very large number of free trans-lations In these cases direct lexical correspon-dences are much more unlikely to be found
32 Aligning Syntactic Configurations
High-precision extraction of one-to-many trans-lation detection thus involves two major prob-lems 1) How to identify sentences or configura-tions where reliable lexical correspondences canbe found 2) How to align target items that have alow occurrence correlation
We argue that both of these problems can beadressed by taking syntactic information into ac-
26
count As an example consider the pair of paral-lel configurations in Figure 1 for the sentence pairgiven in (15) and (16) Although there is no strictone-to-one alignment for the German verb the ba-sic predicate-argument structure is parallel Theverbs arguments directly correspond to each otherand are all dominated by a verbal root node
Based on these intuitions we propose agenerate-and-filter strategy for our one-to-manytranslation detection which extracts partial largelyparallel dependency configurations By admittingtarget dependency paths to be aligned to sourcesingle dependency relations we admit configura-tions where the source item is translated by morethan one word For instance given the configura-tion in Figure 1 we allow the German verb to bealigned to the path connectingconstituteand theargumentY2
Our one-to-many translation detection consistsof the following steps a) candidate generationof aligned syntactic configurations b) filtering theconfigurations c) alignment post-editing ie as-sembling the target tokens corresponding to thesource item The following paragraphs will brieflycaracterize these steps
behindert
X 1 Y 1
Y 2
an to
X 2 obstacle
create
Figure 1 Example of a typical syntactic MWEconfiguration
Data We word-aligned the German and Englishportion of the Europarl corpus by means of theGIZA++ tool Both portions where assigned flatsyntactic dependency analyses by means of theMaltParser (Nivre et al 2006) such that we ob-tain a parallel resource of word-aligned depen-dency parses Each sentence in our resource canbe represented by the triple(DG DE AGE) DG
is the set of dependency triples(s1 rel s2) such
thats2 is a dependent ofs1 of typerel ands1 s2
are words of the source languageDE is the setof dependency triples of the target sentenceAGE
corresponds to the set of pairs(s1 t1) such thats1 t1 are aligned
Candidate Generation This step generates alist of source configurations by searching for oc-curences of the source lexical verb where it islinked to some syntactic dependents (eg its argu-ments) An example input would be the configura-tion ( (verbSB) (verbOA)) forour German verbs
Filtering Given our source candidates a validparallel configuration(DG DE AGE) is then de-fined by the following conditions1 The source configurationDG is the set of tu-ples(s1 rel sn) wheres1 is our source item andsn some dependent2 For eachsn isin DG there is a tuple(sn tn) isinAGE ie every dependent has an alignment3 There is a target itemt1 isin DE such thatfor eachtn there is ap sub DE such thatp isa path(t1 rel tx) (tx rel ty)(tz rel tn) thatconnectst1 and tn Thus the target dependentshave a common root
To filter noise due to parsing or alignment er-rors we further introduce a filter on the length ofthe path that connects the target root and its de-pendents and w exclude paths cross contain sen-tence boundaries Moreover the above candi-date filtering doesnrsquot exclude configurations whichexhibit paraphrases involving head-switching orcomplex coordination Head-switching can be de-tected with the help of alignment information ifthere is a item in our target configuration that hasan reliable alignment with an item not contained inour source configuration our target configurationis likely to contain such a structural paraphrasesand is excluded from our candidate set Coordina-tion can be discarded by imposing the condition onthe configuration not to contain a coordination re-lation This Generate-and-Filter strategy now ex-tracts a set of sentences where we are likely to finda good one-to-one or one-to-many translation forthe source verb
Alignment Post-editing In the final alignmentstep one now needs to figure out which lexicalmaterial in the aligned syntactic configurations ac-tually corresponds to the translation of the sourceitem The intuition discussed in 32 was that all
27
the items lying on a path between the root itemand the terminals belong to the translation of thesource item However these items may have othersyntactic dependents that may also be part of theone-to-many translation As an example considerthe configuration in figure 1 where the articleanwhich is part of the LVCcreate an obstacle tohasto be aligned to the German source verb
Thus for a set of itemsti for which there is a de-pendency relation(tx rel ti) isin DE such thattx isan element of our target configuration we need todecide whether(s1 ti) isin AGE This translationproblem now largely parallels collocation trans-lation problems discussed in the literature as in(Smadja and McKeown 1994) But crucially oursyntactic filtering strategy has substantially nar-rowed down the number of items that are possi-ble parts of the one-to-many translation Thus astraightforward way to assemble the translationalcorrespondence is to compute the correlation orassociation of the possibly missing items with thegiven translation pair as proposed in (Smadja andMcKeown 1994) Therefore we propose the fol-lowing alignment post-editing algorithmGiven the source items1 and the set of target itemsT where eachti isin T is an element of our targetconfiguration
1 Compute corr(s1 T ) the correlation be-tweens1 andT
2 For each ti tx such that there isa (ti rel tx) isin DE computecorr(s1 T + tx)
3 if corr(s1 T + tx) ge corr(s1 T ) addtxto T
As the Dice coefficient is often to give the bestresults eg in (Smadja and McKeown 1994) wealso chose Dice as our correlation measure In fu-ture work we will experiment with other associa-tion measures Our correlation scores are thus de-fined by the formula
corr(s1 T ) =2(freq(s1 and T ))
freq(s1) + freq(T )
We definefreq(T ) as the number of sentencepairs whose target sentence contains occurrencesof all ti isin T andfreq(s1) accordingly The ob-servation frequencyfreq(s1andT ) is the number ofsentence pairs that wheres1 occurs in the sourcesentence andT in the target sentence
The output translation can then be rep-resented as a dependency configurationof the following kind ((ofPMOD)(riskNMODof)(riskNMODthe) (runOBJrisk)(runSBJ))which is the syntactic representationfor the English MWErun the risk of
33 Evaluation
Our translational approach to MWE extractionbears the advantage that evaluation is not exclu-sively bound to the manual judgement of candi-date lists Instead we can first evaluate the systemoutput against translation gold standards which areeasier to obtain The linguistic classification of thecandidates according to their compositionality canthen be treated as a separate problem
We present two experiments in this evaluationsection We will first evaluate the translation de-tection on our gold standard to assess the gen-eral quality of the extraction method Since thisgold standard is to small to draw conclusions aboutthe quality of MWE patterns that the system de-tects we further evaluate the translational corre-spondences for a larger set of verbs
Translation evaluation In the first experimentwe extracted all types of translational correspon-dences for the verbs we annotated in the gold stan-dard We converted the output dependency con-figurations to the lemmatized bag-of-word formwe already applied for the alignment evaluationand calculated the FPR and FNR of the trans-lation types The evaluation is displayed in ta-ble 4 Nearly all translation types that our sys-tem detected are correct This confirms our hy-pothesis that syntactic filtering yields more reli-able translations that just coocurrence-based align-ments However the false negative rate is alsovery high This low recall is due to the fact thatour syntactic filters are very restrictive such that amajor part of the occurrences of the source lemmadonrsquot figure in the prototypical syntactic configu-ration Column two and three of the evaluation ta-ble present the FPR and FNR for experiments witha relaxed syntactic filter that doesnrsquot constrain thesyntactic type of the parallel argument relationsWhile not decreasing the FNR the FPR decreasessignificantly This means that the syntactic filtersmainly fire on noisy configurations and donrsquot de-crease the recall A manual error analysis has alsoshown that the relatively flat annotation scheme ofour dependency parses significantly narrows down
28
the number of candidate configurations that our al-gorithm detects As the dependency parses donrsquotprovide deep analyses for tense or control phe-nomena very often a verbrsquos arguments donrsquot fig-ure as its syntactic dependents and no configura-tion is found Future work will explore the im-pact of deep syntactic analysis for the detection oftranslational correspondences
MWE evaluation In a second experiment weevaluated the patterns of correspondences foundby our extraction method for use in an MWE con-text Therefore we selected 50 random verbs oc-curring in the Europarl corpus and extracted theirrespective translational correspondences This setof 50 verbs yields a set of 1592 one-to-many typesof translational correspondences We filtered thetypes wich display only morphological variationsuch that the set of potential MWE types com-prises 1302 types Out of these we evaluated arandom sample of 300 types by labelling the typeswith the MWE categories we established for theanalysis of our gold standard During the clas-sification we encountered a further category ofoneto- many correspondence which cannot be con-sidered an MWE the category of alternation Forinstance we found a translational correspondencebetween the active realization of the German verbbegruszligen(enappreciate) and the English passivebe pleased by
The classification is displayed in table 5 Al-most 83 of the translational correspondencesthat our system extracted are perfect translationtypes Almost 60 of the extracted types can beconsidered MWEs that exhibit some kind of se-mantic idiosyncrasy The other translations couldbe classified as paraphrases or alternations In ourrandom sample the portions of idioms is signifi-cantly higher than in our gold standard which con-firms our intuition that the MWE pattern of theone-to-many translations for a given verb are re-lated to language-specific semantic properties ofthe verbs and the lexical concepts they realize
4 Related Work
The problem sketched in this paper has clear con-ncetions to statistical MT So-called phrase-basedtranslation models generally target whole sentencealignment and do not necessarily recur to linguis-tically motivated phrase correspondences (Koehnet al 2003) Syntax-based translation that speci-fies formal relations between bilingual parses was
Strict Filter Relaxed Filter
FPR FNR FPR FNR
v1 00 096 05 096
v2 025 088 047 079
v3 025 074 056 063
v4 00 0875 056 084
Table 4 False positive and false negative rate ofone-to-many translations
Trans type Proportion
MWE type Proportion
MWEs 575
V Part 82
V Prep 518
LVC 324
Idiom 106
Paraphrases 244
Alternations 10
Noise 171
Table 5 Classification of 300 types sampled fromthe set of one-to-many translations for 50 verbs
established by (Wu 1997) Our way to use syn-tactic configurations can be seen as a heuristic tocheck relaxed structural parallelism
Work on MWEs in a crosslingual context hasalmost exclusively focussed on MWE translation(Smadja and McKeown 1994 Anastasiou 2008)In (Villada Moiron and Tiedemann 2006) the au-thors make use of alignment information in a par-allel corpus to rank MWE candidates These ap-proaches donrsquot rely on the lexical semantic knowl-edge about MWEs in form of one-to-many trans-lations
By contrast previous approaches to paraphraseextraction made more explicit use of crosslingualsemantic information In (Bannard and Callison-Burch 2005) the authors use the target languageas a pivot providing contextual features for iden-tifying semantically similar expressions Para-phrasing is however only partially comparable tothe crosslingual MWE detection we propose inthis paper Recently the very pronounced contextdependence of monolingual pairs of semanticallysimilar expressions has been recognized as a ma-jor challenge in modelling word meaning (Erk andPado 2009)
The idea that parallel corpora can be used asa linguistic resource that provides empirical evi-dence for monolingual idiosyncrasies has already
29
been exploited in eg morphology projection(Yarowsky et al 2001) or word sense disambigua-tion (Dyvik 2004) While in a monolingual set-ting it is quite tricky to come up with theoreticalor empirical definitions of sense discriminationsthe crosslingual scenario offers a theory-neutraldata-driven solution Since ambiguity is an id-iosyncratic property of a lexical item in a givenlanguage it is not likely to be mirrored in a tar-get language Similarly our approach can also beseen as a projection idea we project the semanticinformation of simplex realization in a source lan-guage to an idiosyncratic multiword realization inthe target language
5 Conclusion
We have explored the phenomenon of one-to-many translations in parallel corpora from theperspective of MWE identification Our man-ual study on a translation gold standard as wellas our experiments in automatic translation ex-traction have shown that one-to-many correspon-dences provide a rich resource and fruitful basisof study for data-driven MWE identification Thecrosslingual perspective raises new research ques-tions about the identification and interpretation ofMWEs It challenges the distinction between para-phrases and MWEs a problem that does not ariseat all in the context of monolingual MWE ex-traction It also allows for the study of the rela-tion between the semantics of lexical concepts andtheir MWE realization Further research in this di-rection should investigate translational correspon-dences on a larger scale and further explore thesefor monolingual interpretation of MWEs
Our extraction method that is based on syn-tactic filters identifies MWE types with a muchhigher precision than purely cooccurence-basedword alignment and captures the various patternswe found in our gold standard Future work on theextraction method will have to focus on the gener-alization of these filters and the generalization toother items than verbs The experiments presentedin this paper also suggest that the MWE realiza-tion of certain lexical items in a target languageis subject to certain linguistic patterns Moreoverthe method we propose is completely languagein-dependent such that further research has to studythe impact of the relatedness of the consideredlanguages on the patterns of one-to-many transla-tional correspondences
ReferencesDimitra Anastasiou 2008 Identification of idioms by mtrsquos
hybrid research system vs three commercial system InProceedings of the EAMT pp 12ndash20
Timothy Baldwin and Aline Villavicencio 2002 Extract-ing the unextractable a case study on verb-particles InProceedings of the COLING-02 pp 1ndash7
Colin Bannard and Chris Callison-Burch 2005 Paraphras-ing with bilingual parallel corpora InProceedings of the43rd Annual Meeting of the ACL pp 597ndash604
Lea Cyrus 2006 Building a resource for studying transla-tion shifts InProceedings of the 5th LREC pp 1240ndash1245
Helge Dyvik 2004 Translations as semantic mirrors Fromparallel corpus to WordNetLanguage and Computers1311 ndash 326
Katrin Erk and Sebastian Pado 2009 Paraphrase assess-ment in structured vector space Exploring parameters anddatasets InProc of the EACL GEMS Workshop pp 57ndash65
Joao de Almeida Varelas Graca Joana Paulo Pardal LuısaCoheur and Diamantino Antonio Caseiro 2008 Multi-language word alignments annotation guidelines Techni-cal report Tech Rep 38 2008 INESC-ID Lisboa
Philipp Koehn Franz Josef Och and Daniel Marcu 2003Statistical phrase-based translation InProceedings of theNAACL rsquo03 pp 48ndash54
Philipp Koehn 2005 Europarl A parallel corpus for statis-tical machine translation InMT Summit 2005 pp 79ndash86
Joakim Nivre Johan Hall and Jens Nilsson 2006 Malt-parser A data driven parser-generator for dependencyparsing InProc of LREC-2006 pp 2216ndash2219
Franz Josef Och and Hermann Ney 2003 A systematic com-parison of various statistical alignment modelsComputa-tional Linguistics 29(1)19ndash51
Pavel Pecina 2008 A machine learning approach to multi-word expression extraction InProceedings of the LRECMWE 2008 Workshop pp 54ndash57
Ivan A Sag Timothy Baldwin Francis Bond and AnnCopestake 2002 Multiword expressions A pain in theneck for NLP InProc of the CICLing-2002 pp 1ndash15
Frank Smadja and Kathleen McKeown 1994 Translatingcollocations for use in bilingual lexicons InProceedingsof the HLT rsquo94 workshop pp 152ndash156
Begona Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automatic word-alignment InProc of the EACL MWE 2006 Workshoppp 33ndash40
Dekai Wu 1997 Stochastic inversion transduction gram-mars and bilingual parsing of parallel corporaComputLinguist 23(3)377ndash403
David Yarowsky Grace Ngai and Richard Wicentowski2001 Inducing multilingual text analysis tools via robustprojection across aligned corpora InProceedings of HLT2001 pp 1ndash8
30
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 31ndash39Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
A Re-examination of Lexical Association Measures
Abstract
We review lexical Association Measures (AMs) that have been employed by past work in extracting multiword expressions Our work contributes to the understanding of these AMs by categorizing them into two groups and suggesting the use of rank equivalence to group AMs with the same ranking performance We also examine how existing AMs can be adapted to better rank English verb particle constructions and light verb constructions Specifically we suggest normalizing (Pointwise) Mutual Information and using marginal frequencies to construct penalization terms We empirically validate the effectiveness of these modified AMs in detection tasks in English performed on the Penn Treebank which shows significant improvement over the original AMs
1 Introduction
Recently the NLP community has witnessed a renewed interest in the use of lexical association measures in extracting Multiword Expressions (MWEs) Lexical Association Measures (hereafter AMs) are mathematical formulas which can be used to capture the degree of connection or association between constituents of a given phrase Well-known AMs include Pointwise Mutual Information (PMI) Pearsonrsquos 2χ and the Odds Ratio These AMs have been applied in many different fields of study from information retrieval to hypothesis testing In the context of MWE extraction many published works have been devoted to comparing their effectiveness Krenn and Evert (2001) evaluate Mutual Information (MI) Dice Pearsonrsquos 2χ log-likelihood
ratio and the T score In Pearce (2002) AMs such as Z score Pointwise MI cost reduction left and right context entropy odds ratio are evaluated Evert (2004) discussed a wide range of AMs including exact hypothesis tests such as the binomial test and Fisherrsquos exact tests various coefficients such as Dice and Jaccard Later Ramisch et al (2008) evaluated MI Pearsonrsquos 2χ and Permutation Entropy Probably the most comprehensive evaluation of AMs was presented in Pecina and Schlesinger (2006) where 82 AMs were assembled and evaluated over Czech collocations These collocations contained a mix of idiomatic expressions technical terms light verb constructions and stock phrases In their work the best combination of AMs was selected using machine learning
While the previous works have evaluated AMs there have been few details on why the AMs perform as they do A detailed analysis of why these AMs perform as they do is needed in order to explain their identification performance and to help us recommend AMs for future tasks This weakness of previous works motivated us to address this issue In this work we contribute to further understanding of association measures using two different MWE extraction tasks to motivate and concretize our discussion Our goal is to be able to predict a priori what types of AMs are likely to perform well for a particular MWE class
We focus on the extraction of two common types of English MWEs that can be captured by bigram model Verb Particle Constructions (VPCs) and Light Verb Constructions (LVCs) VPCs consist of a verb and one or more particles which can be prepositions (eg put on bolster up) adjectives (cut short) or verbs (make do) For simplicity we focus only on bigram VPCs that take prepositional particles the most common class of VPCs A special characteristic of VPCs that affects their extraction is the
Hung Huu Hoang Dept of Computer Science
National University of Singapore
hoanghuucompnusedusg
Su Nam Kim Dept of Computer Science and Software Engineering University of Melbourne
snkimcsseunimelbeduau
Min-Yen Kan Dept of Computer Science
National University of Singapore
kanmycompnusedusg
31
mobility of noun phrase complements in transitive VPCs They can appear after the particle (Take off your hat) or between the verb and the particle (Take your hat off) However a pronominal complement can only appear in the latter configuration (Take it off)
In comparison LVCs comprise of a verb and a complement which is usually a noun phrase (make a presentation give a demonstration) Their meanings come mostly from their complements and as such verbs in LVCs are termed semantically light hence the name light verb This explains why modifiers of LVCs modify the complement instead of the verb (make a serious mistake vs make a mistake seriously) This phenomenon also shows that an LVCrsquos constituents may not occur contiguously
2 Classification of Association Measures
Although different AMs have different approaches to measuring association we observed that they can effectively be classified into two broad classes Class I AMs look at the degree of institutionalization ie the extent to which the phrase is a semantic unit rather than a free combination of words Some of the AMs in this class directly measure this association between constituents using various combinations of co-occurrence and marginal frequencies Examples include MI PMI and their variants as well as most of the association coefficients such as Jaccard Hamann Brawn-Blanquet and others Other Class I AMs estimate a phrasersquos MWE-hood by judging the significance of the difference between observed and expected frequencies These AMs include among others statistical hypothesis tests such as T score Z score and Pearsonrsquos 2χ test
Class II AMs feature the use of context to measure non-compositionality a peculiar characteristic of many types of MWEs including VPCs and idioms This is commonly done in one of the following two ways First non-compositionality can be modeled through the diversity of contexts measured using entropy The underlying assumption of this approach is that non-compositional phrases appear in a more restricted set of contexts than compositional ones Second non-compositionality can also be measured through context similarity between the phrase and its constituents The observation here is that non-compositional phrases have different semantics from those of their constituents It then
follows that contexts in which the phrase and its constituents appear would be different (Zhai 1997) Some VPC examples include carry out give up A close approximation stipulates that contexts of a non-compositional phrasersquos constituents are also different For instance phrases such as hot dog and Dutch courage are comprised of constituents that have unrelated meanings Metrics that are commonly used to compute context similarity include cosine and dice similarity distance metrics such as Euclidean and Manhattan norm and probability distribution measures such as Kullback-Leibler divergence and Jensen-Shannon divergence
Table 1 lists all AMs used in our discussion
The lower left legend defines the variables a b c and d with respect to the raw co-occurrence statistics observed in the corpus data When an AM is introduced it is prefixed with its index given in Table 1(eg [M2] Mutual Information) for the readerrsquos convenience
3 Evaluation
We will first present how VPC and LVC candidates are extracted and used to form our evaluation data set Second we will discuss how performances of AMs are measured in our experiments
31 Evaluation Data
In this study we employ the Wall Street Journal (WSJ) section of one million words in the Penn Tree Bank To create the evaluation data set we first extract the VPC and LVC candidates from our corpus as described below We note here that the mobility property of both VPC and LVC constituents have been used in the extraction process
For VPCs we first identify particles using a pre-compiled set of 38 particles based on Baldwin (2005) and Quirk et al (1985) (Appendix A) Here we do not use the WSJ particle tag to avoid possible inconsistencies pointed out in Baldwin (2005) Next we search to the left of the located particle for the nearest verb As verbs and particles in transitive VPCs may not occur contiguously we allow an intervening NP of up to 5 words similar to Baldwin and Villavicencio (2002) and Smadja (1993) since longer NPs tend to be located after particles
32
Extraction of LVCs is carried out in a similar fashion First occurrences of light verbs are located based on the following set of seven
frequently used English light verbs do get give have make put and take Next we search to the right of the light verbs for the nearest noun
AM Name Formula AM Name Formula M1 Joint Probability ( ) f xy N M2 Mutual Information
1log ˆ
ijij
i j ij
ff
N fsum
M3 Log likelihood ratio
2 log ˆ
ijij
i j ij
ff
fsum
M4 Pointwise MI (PMI) ( )log
( ) ( )
P xy
P x P ylowast lowast
M5 Local-PMI ( ) PMIf xy times M6 PMIk ( )log
( ) ( )
kNf xy
f x f ylowast lowast
M7 PMI2 2( )log
( ) ( )
Nf xy
f x f ylowast lowast
M8 Mutual Dependency 2( )log( ) ( )P xy
P x P y
M9 Driver-Kroeber
( )( )
a
a b a c+ +
M10 Normalized expectation
2
2
a
a b c+ +
M11 Jaccard a
a b c+ +
M12 First Kulczynski a
b c+
M13 Second Sokal-Sneath 2( )
a
a b c+ +
M14 Third Sokal-Sneath
a d
b c
+
+
M15 Sokal-Michiner a d
a b c d
+
+ + +
M16 Rogers-Tanimoto
2 2
a d
a b c d
+
+ + +
M17 Hamann ( ) ( )a d b c
a b c d
+ minus +
+ + +
M18 Odds ratio adbc
M19 Yulersquos ω ad bc
ad bc
minus
+
M20 Yulersquos Q ad bcad bc
minus+
M21 Brawn- Blanquet max( )
a
a b a c+ +
M22 Simpson
min( )
a
a b a c+ +
M23 S cost 12min( )
log(1 )1
b c
a
minus+
+
M24 Adjusted S Cost 12max( )
log(1 )1
b c
a
minus+
+
M25 Laplace 1
min( ) 2
a
a b c
+
+ +
M26 Adjusted Laplace 1
max( ) 2
a
a b c
+
+ +
M27 Fager [M9]
1max( )
2b cminus
M28 Adjusted Fager [M9]
1max( )b c
aNminus
M29 Normalized PMIs
PMI NF( )α PMI NFMax
M30 Simplified normalized PMI for VPCs
log( )
(1 )
ad
b cα αtimes + minus times
M31 Normalized MIs
MI NF( )α MI NFMax
NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast [0 1]α isin NFMax = max( ( ) ( ))P x P ylowast lowast
11 ( )a f f xy= = 12 ( )b f f xy= =
21 ( )c f f xy= = 22 ( )d f f xy= =
( )f xlowast ( )f xlowast
( )f ylowast ( )f ylowast N
Table 1 Association measures discussed in this paper Starred AMs () are developed in this work
Contingency table of a bigram (x y) recording co-occurrence and marginal frequencies w stands for all words except w stands for all words N is total number of bigrams The expected frequency under the independence assumption is ˆ ( ) ( ) ( ) f xy f x f y N= lowast lowast
33
permitting a maximum of 4 intervening words to allow for quantifiers (aan the many etc) adjectival and adverbial modifiers etc If this search fails to find a noun as when LVCs are used in the passive (eg the presentation was made) we search to the right of the light verb also allowing a maximum of 4 intervening words The above extraction process produced a total of 8652 VPC and 11465 LVC candidates when run on the corpus We then filter out candidates with observed frequencies less than 6 as suggested in Pecina and Schlesinger (2006) to obtain a set of 1629 VPCs and 1254 LVCs
Separately we use the following two available sources of annotations 3078 VPC candidates extracted and annotated in (Baldwin 2005) and 464 annotated LVC candidates used in (Tan et al 2006) Both sets of annotations give both positive and negative examples
Our final VPC and LVC evaluation datasets were then constructed by intersecting the gold-standard datasets with our corresponding sets of extracted candidates We also concatenated both sets of evaluation data for composite evaluation This set is referred to as ldquoMixedrdquo Statistics of our three evaluation datasets are summarized in Table 2 VPC data LVC data Mixed Total (freq ge 6)
413 100 513
Positive instances
117 (2833)
28 (28)
145 (2326)
Table 2 Evaluation data sizes (type count not token) While these datasets are small our primary
goal in this work is to establish initial comparable baselines and describe interesting phenomena that we plan to investigate over larger datasets in future work
32 Evaluation Metric
To evaluate the performance of AMs we can use the standard precision and recall measures as in much past work We note that the ranked list of candidates generated by an AM is often used as a classifier by setting a threshold However setting a threshold is problematic and optimal threshold values vary for different AMs Additionally using the list of ranked candidates directly as a classifier does not consider the confidence indicated by actual scores Another way to avoid setting threshold values is to measure precision and recall of only the n most likely candidates
(the n-best method) However as discussed in Evert and Krenn (2001) this method depends heavily on the choice of n In this paper we opt for average precision (AP) which is the average of precisions at all possible recall values This choice also makes our results comparable to those of Pecina and Schlesinger (2006)
33 Evaluation Results
Figure 1(a b) gives the two average precision profiles of the 82 AMs presented in Pecina and Schlesinger (2006) when we replicated their experiments over our English VPC and LVC datasets We observe that the average precision profile for VPCs is slightly concave while the one for LVCs is more convex This can be interpreted as VPCs being more sensitive to the choice of AM than LVCs Another point we observed is that a vast majority of Class I AMs including PMI its variants and association coefficients (excluding hypothesis tests) perform reasonably well in our application In contrast the performances of most of context-based and hypothesis test AMs are very modest Their mediocre performance indicates their inapplicability to our VPC and LVC tasks In particular the high frequencies of particles in VPCs and light verbs in LVCs both undermine their contextsrsquo discriminative power and skew the difference between observed and expected frequencies that are relied on in hypothesis tests
4 Rank Equivalence
We note that some AMs although not mathematically equivalent (ie assigning identical scores to input candidates) produce the same lists of ranked candidates on our datasets Hence they achieve the same average precision The ability to identify such groups of AMs is helpful in simplifying their formulas which in turn assisting in analyzing their meanings Definition Association measures M1 and M2 are
rank equivalent over a set C denoted by M1 r
Cequiv
M2 if and only if M1(cj) gt M1(ck) M2(cj) gt M2(ck) and M1(cj) = M1(ck) M2(cj) = M2(ck) for all cj ck belongs to C where Mk(ci) denotes the score assigned to ci by the measure Mk As a corollary the following also holds for rank equivalent AMs
34
Corollary If M1 r
Cequiv M2 then APC(M1) = APC(M2)
where APC(Mi) stands for the average precision of the AM Mi over the data set C Essentially M1 and M2 are rank equivalent over a set C if their ranked lists of all candidates taken from C are the same ignoring the actual calculated scores1 As an example the following 3 AMs Odds ratio Yulersquos ω and Yulersquos Q (Table 3 row 5) though not mathematically equivalent can be shown to be rank equivalent Five groups of rank equivalent AMs that we have found are listed in Table 3 This allows us to replace the below 15 AMs with their (most simple) representatives from each rank equivalent group
1 Two AMs may be rank equivalent with the exception of some candidates where one AM is undefined due to a zero in the denominator while the other AM is still well-defined We call these cases weakly rank equivalent With a reasonably large corpus such candidates are rare for our VPC and LVC types Hence we still consider such AM pairs to be rank equivalent
1) [M2] Mutual Information [M3] Log likelihood ratio 2) [M7] PMI2 [M8] Mutual Dependency [M9] Driver-Kroeber (aka Ochiai) 3) [M10] Normalized expectation [M11] Jaccard [M12] First Kulczynski
[M13]Second Sokal-Sneath (aka Anderberg)
4) [M14] Third Sokal-Sneath [M15] Sokal-Michiner [M16] Rogers-Tanimoto [M17] Hamann 5) [M18] Odds ratio [M19] Yulersquos ω [M20] Yulersquos Q
Table 3 Five groups of rank equivalent AMs
5 Examination of Association Measures
We highlight two important findings in our analysis of the AMs over our English datasets Section 51 focuses on MI and PMI and Section 52 discusses penalization terms
51 Mutual Information and Pointwise Mutual Information
In Figure 1 over 82 AMs PMI ranks 11th in identifying VPCs while MI ranks 35th in
01
02
03
04
05
06
AP
Figure 1a AP profile of AMs examined over our VPC data set
01
02
03
04
05
06
AP
Figure 1b AP profile of AMs examined over our LVC data set
Figure 1 Average precision (AP) performance of the 82 AMs from Pecina and Schlesinger (2006) on our English VPC and LVC datasets Bold points indicate AMs discussed in this paper
Hypothesis test AMs loz Class I AMs excluding hypothesis test AMs + Context-based AMs
35
identifying LVCs In this section we show how their performances can be improved significantly
Mutual Information (MI) measures the common information between two variables or the reduction in uncertainty of one variable given knowledge of the other
( )MI( ) ( )log ( ) ( )u v
p uvU V p uv p u p v= lowast lowastsum In the
context of bigrams the above formula can be
simplified to [M2] MI =
1log ˆN
ijij
i jij
ff
fsum While MI
holds between random variables [M4] Pointwise MI (PMI) holds between specific values PMI(x
y) =( )
log( ) ( )
P xy
P x P ylowast lowast
( )log
( ) ( )
Nf xy
f x f y=
lowast lowast It has long
been pointed out that PMI favors bigrams with low-frequency constituents as evidenced by the product of two marginal frequencies in its denominator To reduce this bias a common solution is to assign more weight to the co-occurrence frequency ( )f xy in the numerator by either raising it to some power k (Daille 1994) or multiplying PMI with ( )f xy Table 4 lists these adjusted versions of PMI and their performance over our datasets We can see from Table 4 that the best performance of PMIk
is obtained at k values less than one indicating that it is better to rely less on ( )f xy Similarly multiplying
( )f xy directly to PMI reduces the performance of PMI As such assigning more weight to ( )f xy does not improve the AP performance of PMI AM VPCs LVCs MixedBest [M6] PMIk 547
(k = 13) 573
(k = 85) 544
(k = 32)
[M4] PMI 510 566 515 [M5] Local-PMI 259 393 272 [M1] Joint Prob 170 28 175 Table 4 AP performance of PMI and its variants Best alpha settings shown in parentheses
Another shortcoming of (P)MI is that both grow not only with the degree of dependence but also with frequency (Manning and Schutzeampamp 1999 p 66) In particular we can show that MI(X Y) le min(H(X) H(Y)) where H() denotes entropy and PMI(xy) le min( log ( )P xminus lowast log ( )P yminus lowast )
These two inequalities suggest that the allowed score ranges of different candidates vary and consequently MI and PMI scores are not directly comparable Furthermore in the case of VPCs and LVCs the differences among score
ranges of different candidates are large due to high frequencies of particles and light verbs This has motivated us to normalize these scores before using them for comparison We suggest MI and PMI be divided by one of the following two normalization factors NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast with [0 1]α isin and NFmax = max( ( ) ( ))P x P ylowast lowast NF( )α being dependent on alpha can be optimized by setting an appropriate alpha value which is inevitably affected by the MWE type and the corpus statistics On the other hand NFmax is independent of alpha and is recommended when one needs to apply normalized (P)MI to a mixed set of different MWE types or when sufficient data for parameter tuning is unavailable As shown in Table 5 normalized MI and PMI show considerable improvements of up to 80 Also PMI and MI after being normalized with NFmax rank number one in VPC and LVC task respectively If one re-writes MI as = (1 N) ij ij
i jPMIf timessum it is easy to see the heavy
dependence of MI on direct frequencies compared with PMI and this explains why normalization is a pressing need for MI
AM VPCs LVCs MixedMI NF( )α 508
(α = 48) 583
(α = 47) 516
(α = 5)
MI NFmax 508 584 518 [M2] MI 273 435 289
PMI NF( )α 592 (α = 8)
554 (α = 48)
588 (α = 77)
PMI NFmax 565 517 556 [M4] PMI 510 566 515
Table 5 AP performance of normalized (P)MI versus standard (P)MI Best alpha settings shown in parentheses
52 Penalization Terms
It can be seen that given equal co-occurrence frequencies higher marginal frequencies reduce the likelihood of being MWEs This motivates us to use marginal frequencies to synthesize penalization terms which are formulae whose values are inversely proportional to the likelihood of being MWEs We hypothesize that incorporating such penalization terms can improve the respective AMs detection AP
Take as an example the AMs [M21] Brawn-Blanquet (aka Minimum Sensitivity) and [M22] Simpson These two AMs are identical except
36
for one difference in the denominator Brawn-Blanquet uses max(b c) Simpson uses min(b c) It is intuitive and confirmed by our experiments that penalizing against the more frequent constituent by choosing max(b c) is more effective This is further attested in AMs [M23] S Cost and [M25] Laplace where we tried to replace the min(b c) term with max(b c) Table 6 shows the average precision on our datasets for all these AMs
AM VPCs LVCs Mixed[M21]Brawn- Blanquet
478 578 486
[M22] Simpson 249 382 260
[M24] Adjusted S Cost
485 577 492
[M23] S cost 249 388 260
[M26] Adjusted Laplace
486 577 493
[M25] Laplace 241 388 254 Table 6 Replacing min() with max() in selected AMs
In the [M27] Fager AM the penalization term max(b c) is subtracted from the first term which is no stranger but rank equivalent to [M7] PMI2 In our application this AM is not good since the second term is far larger than the first term which is less than 1 As such Fager is largely equivalent to just ndashfrac12 max(b c) In order to make use of the first term we need to replace the constant frac12 by a scaled down version of max(b c) We have approximately derived 1 aN as a lower bound estimate of max(b c) using the independence assumption producing [M28] Adjusted Fager We can see from Table 7 that this adjustment improves Fager on both datasets AM VPCs LVCs Mixed[M28] Adjusted Fager
564 543 554
[M27] Fager 552 439 525 Table 7 Performance of Fager and its adjusted version
The next experiment involves [M14] Third Sokal Sneath which can be shown to be rank equivalent to ndashb ndashc We further notice that frequencies c of particles are normally much larger than frequencies b of verbs Thus this AM runs the risk of ranking VPC candidates based on only frequencies of particles So it is necessary
that we scale b and c properly as in [M14] bαminus times ndash (1 ) cαminus times Having scaled the constituents properly we still see that [M14] by itself is not a good measure as it uses only constituent frequencies and does not take into consideration the co-occurrence frequency of the two constituents This has led us to experiment
with [MR14]PMI
(1 )b cα αtimes + minus times The
denominator of [MR14] is obtained by removing the minus sign from [MR14] so that it can be used as a penalization term The choice of PMI in the numerator is due to the fact that the denominator of [MR14] is in essence similar to NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast which has been successfully used to divide PMI in the normalized PMI experiment We heuristically tried to simplify [MR14] to the following AM
[M30]log( )
(1 )
ad
b cα αtimes + minus times The setting of alpha in
Table 8 below is taken from the best alpha setting obtained the experiment on the normalized PMI (Table 5) It can be observed from Table 8 that [MR14] being computationally simpler than normalized PMI performs as well as normalized PMI and better than Third Sokal-Sneath over the VPC data set
AM VPCs LVCs MixedPMI NF( )α 592
(α =8) 554
(α =48) 588
(α =77)
[M30]log( )
(1 )
ad
b cα αtimes + minus times
600 (α =8)
484 (α =48)
588 (α =77)
[M14] Third Sokal Sneath
565 453 546
Table 8 AP performance of suggested VPCsrsquo penalization terms and AMs
With the same intention and method we have found that while addition of marginal frequencies is a good penalization term for VPCs the product of marginal frequencies is more suitable for LVCs (rows 1 and 2 Table 9) As with the linear combination the product bc should also be weighted accordingly as (1 )b cα αminus The best alpha value is also taken from the normalized PMI experiments (Table 5) which is nearly 5 Under this setting this penalization term is exactly the denominator of the [M18] Odds Ratio Table 9 below show our experiment results in deriving the penalization term for LVCs
37
AM VPCs LVCs Mixedndashb ndashc 565 453 546 1bc 502 532 502 [M18] Odds ratio 443 567 456
Table 9 AP performance of suggested LVCsrsquo penalization terms and AMs
6 Conclusions
We have conducted an analysis of the 82 AMs assembled in Pecina and Schlesinger (2006) for the tasks of English VPC and LVC extraction over the Wall Street Journal Penn Treebank data In our work we have observed that AMs can be divided into two classes ones that do not use context (Class I) and ones that do (Class II) and find that the latter is not suitable for our VPC and LVC detection tasks as the size of our corpus is too small to rely on the frequency of candidatesrsquo contexts This phenomenon also revealed the inappropriateness of hypothesis tests for our detection task We have also introduced the novel notion of rank equivalence to MWE detection in which we show that complex AMs may be replaced by simpler AMs that yield the same average precision performance
We further observed that certain modifications to some AMs are necessary First in the context of ranking we have proposed normalizing scores produced by MI and PMI in cases where the distributions of the two events are markedly different as is the case for light verbs and particles While our claims are limited to the datasets analyzed they show clear improvements normalized PMI produces better performance over our mixed MWE dataset yielding an average precision of 588 compared to 515 when using standard PMI a significant improvement as judged by paired T test Normalized MI also yields the best performance over our LVC dataset with a significantly improved AP of 583
We also show that marginal frequencies can be used to form effective penalization terms In particular we find that (1 )b cα αtimes + minus times is a good penalization term for VPCs while (1 )b cα αminus is suitable for LVCs Our introduced alpha tuning parameter should be set to properly scale the values b and c and should be optimized per MWE type In cases where a common factor is applied to different MWE types max(b c) is a better choice than min(b c) In future work we plan to expand our investigations over larger
web-based datasets of English to verify the performance gains of our modified AMs
Acknowledgement This work was partially supported by a National Research Foundation grant ldquoInteractive Media Searchrdquo (grant R 252 000 325 279)
References Baldwin Timothy (2005) The deep lexical
acquisition of English verb-particle constructions Computer Speech and Language Special Issue on Multiword Expressions 19(4)398ndash414
Baldwin Timothy and Villavicencio Aline (2002) Extracting the unextractable A case study on verb-particles In Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002) pages 98ndash104 Taipei Taiwan
Daille Beacuteatrice (1994) Approche mixte pour lextraction automatique de terminologie statistiques lexicales et filtres linguistiques PhD thesis Universiteacute Paris 7
Evert Stefan (2004) Online repository of association measures httpwwwcollocationsde a companion to The Statistics of Word Cooccurrences Word Pairs and Collocations PhD dissertation University of Stuttgart
Evert Stefan and Krenn Brigitte (2001) Methods for qualitative evaluation of lexical association measures In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics pages 188-915 Toulouse France
Katz Graham and Giesbrecht Eugenie (2006) Automatic identification of non-compositional multi-word expressions using latent semantic analysis In Proceedings of the ACL Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties pages 12-19 Sydney Australia
Krenn Brigitte and Evert Stefan (2001) Can we do better than frequency A case study on extracting PP-verb collocations In Proceedings of the ACLEACL 2001 Workshop on the Computational Extraction Analysis and Exploitation of Collocations pages 39ndash46 Toulouse France
Manning D Christopher and Schutze ampamp Hinrich (1999) Foundations of Statistical Natural Language Processing The MIT Press Cambridge Massachusetts
Pearce Darren (2002) A comparative evaluation of collocation extraction techniques In Proc of the
38
3rd International Conference on Language Resources and Evaluation (LREC 2002) Las Palmas pages 1530-1536 Canary Islands
Pecina Pavel and Schlesinger Pavel (2006) Combining association measures for collocation extraction In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLINGACL 2006) pages 651-658 Sydney Australia
Quirk Randolph Greenbaum Sidney Leech Geoffrey and Svartvik Jan (1985) A Comprehensive Grammar of the English Language Longman London UK
Ramisch Carlos Schreiner Paulo Idiart Marco and Villavicencio Aline (2008) An Evaluation of Methods for the extraction of Multiword Expressions In Proceedings of the LREC-2008 Workshop on Multiword Expressions Towards a Shared Task for Multiword Expressions pages 50-53 Marrakech Morocco
Smadja Frank (1993) Retrieving collocations from text Xtract Computational Linguistics 19(1) 143ndash77
Tan Y Fan Kan M Yen and Cui Hang (2006) Extending corpus-based identification of light verb constructions using a supervised learning framework In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context pages 49ndash56 Trento Italy
Zhai Chengxiang (1997) Exploiting context to identify lexical atoms ndash A statistical view of linguistic context In International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97) pages 119-129 Rio de Janeiro Brazil
Appendix A List of particles used in identifying verb particle constructions about aback aboard above abroad across adrift ahead along apart around aside astray away back backward backwards behind by down forth forward forwards in into off on out over past round through to together under up upon without
39
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 40ndash46Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus
R Mahesh K Sinha
Department of Computer Science amp Engineering Indian Institute of Technology Kanpur
Kanpur 208016 India rmkiitkacin
Abstract
Complex predicate is a noun a verb an ad-jective or an adverb followed by a light verb that behaves as a single unit of verb Com-plex predicates (CPs) are abundantly used in Hindi and other languages of Indo Aryan family Detecting and interpreting CPs con-stitute an important and somewhat a diffi-cult task The linguistic and statistical methods have yielded limited success in mining this data In this paper we present a simple method for detecting CPs of all kinds using a Hindi-English parallel corpus A CP is hypothesized by detecting absence of the conventional meaning of the light verb in the aligned English sentence This simple strategy exploits the fact that CP is a multi-word expression with a meaning that is dis-tinct from the meaning of the light verb Al-though there are several shortcomings in the methodology this empirical method surpri-singly yields mining of CPs with an average precision of 89 and a recall of 90
1 Introduction
Complex predicates (CPs) are abundantly used in Hindi and other languages of Indo-Aryan family and have been widely studied (Hook 1974 Ab-bi 1992 Verma 1993 Mohanan 1994 Singh 1994 Butt 1995 Butt and Geuder 2001 Butt and Ramchand 2001 Butt et al 2003) A com-plex predicate is a multi-word expression (MWE) where a noun a verb or an adjective is followed by a light verb (LV) and the MWE be-haves as a single unit of verb The general theory of complex predicate is discussed in Alsina (1996) These studies attempt to model the lin-guistic facts of complex predicate formation and the associated semantic roles
CPs empower the language in its expressive-ness but are hard to detect Detection and inter-
pretation of CPs are important for several tasks of natural language processing tasks such as ma-chine translation information retrieval summa-rization etc A mere listing of the CPs constitutes a valuable linguistic resource for lexicographers wordnet designers (Chakrabarti et al 2007) and other NLP system designers Computational me-thod using Hindi corpus has been used to mine CPs and categorize them based on statistical analysis (Sriram and Joshi 2005) with limited success Chakrabarti et al (2008) present a me-thod for automatic extraction of V+V CPs only from a corpus based on linguistic features They report an accuracy of about 98 An attempt has also been made to use a parallel corpus for de-tecting CPs using projection POS tags from Eng-lish to Hindi (Soni Mukerjee and Raina 2006) It uses Giza++ word alignment tool to align the projected POS information A success of 83 precision and 46 recall has been reported
In this paper we present a simple strategy for mining of CPs in Hindi using projection of meaning of light verb in a parallel corpus In the following section the nature of CP in Hindi is outlined and this is followed by system design experimentation and results
2 Complex Predicates in Hindi
A CP in Hindi is a syntactic construction consist-ing of either a verb a noun an adjective or an adverb as main predicator followed by a light verb (LV) Thus a CP can be a noun+LV an adjective+LV a verb+LV or an adverb+LV Fur-ther it is also possible that a CP is followed by a LV (CP+LV) The light verb carries the tense and agreement morphology In V+V CPs the contribution of the light verb denotes aspectual terms such as continuity perfectivity inception completion or denotes an expression of forceful-ness suddenness etc (Singh 1994 Butt 1995) The CP in a sentence syntactically acts as a sin-gle lexical unit of verb that has a meaning dis-
40
tinct from that of the LV CPs are also referred as the complex or compound verbs
Given below are some examples (1) CP=noun+LV noun = ashirwad blessings LV = denaa to give usane mujhe ashirwad diyaa उसन मझ आश वाद दया he me blessings gave he blessed me (2) No CP usane mujhe ek pustak dii उसन मझ एक प तक द he me one book gave he gave me a book In (1) the light verb diyaa (gave) in its past tense form with the noun ashirwad (blessings) makes a complex predicate verb form ashirwad diyaa (blessed) in the past tense form The CP here is ashirwad denaa and its corresponding English translation is lsquoto blessrsquo On the other hand in ex-ample (2) the verb dii (gave) is a simple verb in past tense form and is not a light verb Although same Hindi verb denaa (to give) is used in both the examples it is a light verb in (1) and a main verb in (2) Whether it acts as a light verb or not depends upon the semantics of the preceding noun However it is observed that the English meaning in case of the complex predicate is not derived from the individual meanings of the con-stituent words It is this observation that forms basis of our approach for mining of CPs (3) CP=adjective+LV adjective=khush happy LV=karanaa to do usane mujhe khush kiyaa उसन मझ खश कया he me happy did he pleased me Here the Hindi verb kiyaa (did) is the past tense form of a light verb karanaa (to do) and the pre-ceding word khush (happy) is an adjective The CP here is khush karanaa (to please) (4) CP=verb+LV verb = paRhnaa to read LV = lenaa to take usane pustak paRh liyaa
उसन प तक पढ़ लया he book read took he has read the book Here the Hindi verb liyaa (took) is the past tense form of the light verb lenaa (to take) and the pre-ceding word paRh (read) is the verb paRhnaa (to read) in its stem form The CP is paRh lenaa (to finish reading) In such cases the light verb acts as an aspectual modal or as an intensifier (5) CP=verb+LV verb = phaadanaa to tear LV = denaa to give usane pustak phaad diyaa उसन प तक फाड़ दया he book tear gave he has torn the book Here the Hindi verb diyaa (gave) is the past tense form of the light verb denaa (to give) and the preceding word phaad (tear) is the stem form of the verb phaadanaa (to tear) The CP is phaad denaa (to cause and complete act of tearing) (6) CP=verb+LV verb = denaa to give LV = maaranaato hit to kill usane pustak de maaraa उसन प तक द मारा he book give hit he threw the book Here the Hindi verb maaraa (hitkilled) is the past tense form of the light verb maaranaa (to hit to kill) and the preceding word de (give) is a verb denaa (to give) in its stem form The CP is de maranaa (to throw) The verb combination yields a new meaning This may also be consi-dered as a semi-idiomatic construct by some people (7) CP=adverb+LV1+LV2 adverb = vaapasback LV1 = karanaato do LV2 = denaato give or CP = CP+LV CP = vaapas karanaato return LV = denaato give usane pustak vaapas kar diyaa उसन प तक वापस कर दया he book back do gave
41
he returned the book Here there are two Hindi light verbs used The verb kar (do) is the stem form of the light verb karanaa (to do) and the verb diyaa (gave) is the past tense form of the light verb denaa (to give) The preceding word vaapas (back) is an adverb One way of interpretation is that the CP (a con-junct verb) vaapas karanaa (to return) is fol-lowed by another LV denaa (to give) signifying completion of the task Another way of looking at it is to consider these as a combination of two CPs vaapas karanaa (to return) and kar denaa (to complete the act) The semantic interpreta-tions in the two cases remain the same It may be noted that the word vaapas (return) is also a noun and in such a case the CP is a noun+LV
From all the above examples the complexity of the task of mining the CPs is evident Howev-er it is also observed that in the translated text the meaning of the light verb does not appear in case of CPs Our methodology for mining CPs is based on this observation and is outlined in the following section
3 System Design
As outlined earlier our method for detecting a CP is based on detecting a mismatch of the Hindi light verb meaning in the aligned English sen-tence The steps involved are as follows 1) Align the sentences of Hindi-English corpus 2) Create a list of Hindi light verbs and their
common English meanings as a simple verb (Table 1)
3) For each Hindi light verb generate all the morphological forms (Figure 1)
4) For each English meaning of the light verb as given in table 1 generate all the morphologi-cal forms (Figure 2)
5) For each Hindi-English aligned sentence execute the following steps a) For each light verb of Hindi (table 1)
execute the following steps i) Search for a Hindi light verb (LV)
and its morphological derivatives (figure 1) in the Hindi sentence and mark its position in the sentence (K)
ii) If the LV or its morphological deriv-ative is found then search for the equivalent English meanings for any of the morphological forms (figure 2) in the corresponding aligned Eng-lish sentence
iii) If no match is found then scan the words in the Hindi sentence to the left of the Kth position (as identified in step (i)) else if a match is found then exit ie go to step (a)
iv) If the scanned word is a lsquostop wordrsquo (figure 3) then ignore it and contin-ue scanning
v) Stop the scan when it is not a lsquostop wordrsquo and collect the Hindi word (W)
vi) If W is an lsquoexit wordrsquo then exit ie go to step (a) else the identified CP is W+LV
Hindi has a large number of light verbs A list of some of the commonly used light verbs along with their common English meanings as a simple verb is given in table 1 The light verb kar (do) is the most frequently used light verb Using its literal meaning as lsquodorsquo as a criterion for testing CP is quite misleading since lsquodorsquo in English is used in several other contexts Such meanings have been shown within parentheses and are not used for matching
light verb base form root verb meaning baithanaa बठना sit
bananaa बनना makebecomebuildconstruct manufactureprepare
banaanaa बनाना makebuildconstructmanufact-ure prepare
denaa दना give
lenaa लना take
paanaa पाना obtainget
uthanaa 3ठना rise arise get-up
uthaanaa 3ठाना raiselift wake-up
laganaa लगना feelappear look seem
lagaanaa लगाना fixinstall apply
cukanaa चकना (finish)
cukaanaa चकाना pay
karanaa करना (do)
honaa होना happenbecome be
aanaa आना come
jaanaa जाना go
khaanaa खाना eat
rakhanaa रखना keep put
maaranaa मारना killbeathit
daalanaa डालना put
haankanaa हाकना drive
Table 1 Some of the common light verbs in Hindi
42
For each of the Hindi light verb all morpho-logical forms are generated A few illustrations are given in figures 1(a) and 1(b) Similarly for each of the English meaning of the light verb all of its morphological derivatives are generated Figure 2 shows a few illustrations of the same
There are a number of words that can appear in between the nominal and the light verb in a CP These words are ignored in search for a CP and are treated as stop words These are words that denote negation or are emphasizers inten-sifiers interrogative pronoun or a particle A list of stop words used in the experimentation is giv-en in figure 3
Figure 1(a) Morphological derivatives of sample Hindi light verb rsquojaanaarsquo जाना to go
Figure 1(b) Morphological derivatives of sample Hindi light verb rsquolenaarsquo लना to take
Figure 2 Morphological derivatives of sample English meanings
We use a list of words of words that we have
named as lsquoexit wordsrsquo which cannot form part of a CP in Hindi We have used Hindi case (vibhak-ti) markers (also called parsarg) conjunctions and pronouns as the lsquoexit wordsrsquo in our imple-mentation Figure 4 shows a partial list used However this list can be augmented based on analysis of errors in LV identification It should be noted that we do not perform parts of speech (POS) tagging and so the nature of the word pre-ceding the LV is unknown to the system
English word sit Morphological derivations
sit sits sat sitting English word give Morphological derivations
give gives gave given giving degdegdegdegdeg
LV jaanaa जाना to go Morphological derivatives jaa jaae jaao jaaeM jaauuM jaane jaanaa jaanii jaataa jaatii jaate jaaniiM jaatiiM jaaoge jaaogii gaii jaauuMgA jaayegaa jaauuMgii jaayegii gaye gaiiM gayaa gayii jaayeMge jaayeMgI jaakara जा (go stem) जाए (go imperative)
जाओ (go imperative) जाए (go imperative)
जाऊ (go first-person) जान (go infinitive oblique)
जाना (go infinitive masculine singular)
जानी (go infinitive feminine singular)
जाता (go indefinite masculine singular)
जाती (go indefinite feminine singular)
जात (go indefinite masculine pluraloblique)
जानी (go infinitive feminine plural)
जाती (go indefinite feminine plural)
जाओग (go future masculine singular)
जाओगी (go future feminine singular)
गई (go past feminine singular)
जाऊगा (go future masculine first-person singular)
जायगा (go future masculine third-person singular)
जाऊगी (go future feminine first-person singular)
जायगी (go future feminine third-person singular)
गय (go past masculine pluraloblique)
गई (go past feminine plural)
गया (go past masculine singular)
गयी (go past feminine singular)
जायग (go future masculine plural)
जायगी (go future feminine plural)
जाकर (go completion) degdegdegdegdeg
LV lenaa लना to take Morphological derivatives le lii leM lo letaa letii lete liiM luuM legaa legii lene lenaa lenii liyaa leMge loge letiiM luuMgaa luuMgii lekara ल (take stem) ल (take past)
ल (take imperative) लो (take imperative)
लता (take indefinite masculine singular)
लती (take indefinite feminine singular)
लत (take indefinite masculine pluraloblique)
ल (takepastfeminineplural) ल(take first-person)
लगा(take future masculine third-personsingular)
लगी(take future feminine third-person singular)
लन (take infinitive oblique)
लना (take infinitive masculine singular)
लनी (take infinitive feminine singular)
लया (take past masculine singular)
लग (take future masculine plural)
लोग (take future masculine singular)
लती (take indefinite feminine plural)
लगा (take future masculinefirst-personsingular)
लगी (take future feminine first-person singular)
लकर (take completion) degdegdegdegdeg
43
Figure 3 Stop words in Hindi used by the system
Figure 4 A few exit words in Hindi used by the system
The inner loop of the procedure identifies mul-tiple CPs that may be present in a sentence The outer loop is for mining the CPs in the entire corpus The experimentation and results are dis-cussed in the following section
4 Experimentation and Results
The CP mining methodology outlined earlier has been implemented and tested over multiple files of EMILLE (McEnery Baker Gaizauskas and Cunningham 2000) English-Hindi parallel cor-pus A summary of the results obtained are given in table 2 As can be seen from this table the precision obtained is 80 to 92 and the recall is between 89 to 100 The F-measure is 88 to 97 This is a remarkable and somewhat sur-prising result from the simple methodology without much of linguistic or statistical analysis This is much higher than what has been reported on the same corpus by Mukerjee et al 2006 (83 precision and 46 recall) who use projec-tion of POS and word alignment for CP identifi-cation This is the only other work that uses a parallel corpus and covers all kinds of CPs The results as reported by Chakrabarti et al (2008) are only for V-V CPs Moreover they do not re-port the recall value File
1 File 2
File 3
File 4
File 5
File 6
No of Sentences
112 193 102 43 133 107
Total no of CP(N)
200 298 150 46 188 151
Correctly identified CP (TP)
195 296 149 46 175 135
V-V CP 56 63 9 6 15 20
Incorrectly identified CP (FP)
17 44 7 11 16 20
Unidentified CP(FN)
5 2 1 0 13 16
Accuracy
9750 9933 9933 1000 9308 8940
Precision (TP (TP+FP))
9198 8705 9551 8070 9162 8709
Recall ( TP (TP+FN))
9750 9833 9933 1000 9308 8940
F-measure ( 2PR ( P+R))
946 923 974 893 923 882
Table 2 Results of the experimentation
न (ergative case marker) को (accusative case marker) का (possessive case marker) क (possessive case marker) की (possessive case marker) स (frombywith) म (ininto) पर (onbut) और (and Hindi particle) तथा (and) या (or) ल कन (but) पर त (but) क (that Hindi particle) म (I) तम (you) आप (you) वह (heshe) मरा (my) मर (my) मर (my) त हारा (your) त हार (your) त हार (your) उसका (his) उसकी (her) उसक (hisher) अपना (own) अपनी (own) अपन (own) उनक (their) मन (I ergative) त ह (to you) आपको (to you) उसको (to himher) उनको (to them) उ ह (to them) मझको (to me) मझ (to me) िजसका (whose) िजसकी (whose) िजसक (whose) िजनको (to whom) िजनक (to whom)
नह (nonot) न (nonot Hindi particle) भी (also Hindi particle) ह (only Hindi particle) तो (then Hindi particle) य (why) या (what Hindi particle)
कहा (where Hindi particle) कब (when) यहा (here) वहा (there) जहा (where) पहल (before) बाद म (after) श म (beginning) आर भ म (beginning) अत म (in the end) आ खर म (in the end)
44
Given below are some sample outputs (1) English sentence I also enjoy working with the childrens parents who often come to me for advice - its good to know you can help Aligned Hindi sentence मझ ब च क माता - पताओ क साथ काम करना भी अ छा लगता ह जो क अ सर सलाह लन आत ह - यह जानकार खशी होती ह क आप कसी की मदद कर सकत ह | The CPs identified in the sentence i काम करना (to work) ii अ छा लगना (to feel good enjoy) iii सलाह लना (to seek advice) iv खशी होना (to feel happy good) v मदद करना (to help)
Here the system identified 5 different CPs all of which are correct and no CP in the sentence has gone undetected The POS projection and word alignment method (Mukerjee et al 2006) would fail to identify CPs सलाह लना (to seek advice) and खशी होना (to feel happy) (2) English sentence Thousands of children are already benefiting from the input of people like you - people who care about children and their future who have the commitment energy and enthusiasm to be positive role models and who value the opportu-nity for a worthwhile career Aligned Hindi sentence आप जस लोग जो क ब च और उनक भ व य क बार म सोचत ह - इस समय भी हज़ार ब च को लाभ पहचा रह ह | अ छ आदश बनन क लए ऐस लोग म तब ता उ सराह और लगन ह और व एक समथन - यो य यवसाय की क करत ह |
The CPs identified in the sentence i आदश बनना (to be role model) ii क करना (to respect)
Here also the two CPs identified are correct It is obvious that this empirical method of
mining CPs will fail whenever the Hindi light verb maps on to its core meaning in English It
may also produce garbage as POS of the preced-ing word is not being checked However the mining success rate obtained speaks of these be-ing in small numbers in practice Use of the lsquostop wordsrsquo in allowing the intervening words within the CPs helps a lot in improving the per-formance Similarly use of the lsquoexit wordsrsquo avoid a lot of incorrect identification
5 Conclusions
The simple empirical method for mining CPs outlined in this work yields an average 89 of precision and 90 recall which is better than the results reported so far in the literature The major drawback is that we have to generate a list of all possible light verbs This list appears to be very large for Hindi Since no POS tagging or statis-tical analysis is performed the identified CPs are merely a list of mined CPs in Hindi with no lin-guistic categorization or analysis However this list of mined CPs is valuable to the lexicograph-ers and other language technology developers This list can also be used for word alignment tools where the identified components of CPs are grouped together before the word alignment process This will increase both the alignment accuracy and the speed
The methodology presented in this work is equally applicable to all other languages within the Indo-Aryan family
References Anthony McEnery Paul Baker Rob Gaizauskas Ha-
mish Cunningham 2000 EMILLE Building a Corpus of South Asian Languages Vivek A Quar-terly in Artiificial Intelligence 13(3)23ndash32
Amitabh Mukerjee Ankit Soni and Achala M Raina 2006 Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora Proceed-ings of the Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties Sydney 11ndash18
Alex Alsina 1996 Complex PredicatesStructure and Theory CSLI PublicationsStanford CA
Anvita Abbi 1992 The explicator compound verbsome definitional issues and criteria for iden-tification Indian Linguistics 53 27-46
Debasri Chakrabarti Vaijayanthi Sarma and Pushpak Bhattacharyya 2007 Complex Predicates in In-dian Language Wordnets Lexical Resources and Evaluation Journal 40 (3-4)
Debasri Chakrabarti Hemang Mandalia Ritwik Priya Vaijayanthi Sarma and Pushpak Bhattacharyya 2008 Hindi Compound Verbs and their Automatic Extraction Computational Linguistics (COLING08) Manchester UK
45
Manindra K Verma (ed) 1993 Complex Predicates in South Asian Languages Manohar Publishers and Distributors New Delhi
Miriam Butt 1995 The Structure of Complex Predi-cates in Urdu CSLI Publications
Mirium Butt and Gillian Ramchand 2001 Complex Aspectual Structure in HindiUrdu In Maria Lia-kata Britta Jensen and Didier Maillat (Editors) Oxford University Working Papers in Linguistics Philology amp Phonetics Vol 6
Miriam Butt Tracy Holloway King and John T Maxwell III 2003 Complex Predicates via Re-striction Proceedings of the LFG03 Conference
Miriam Butt and Wilhelm Geuder 2001 On the (semi)lexical status of light verbs In Norbert Corv-er and Henk van Riemsdijk (Editors) Semi-lexical Categories On the content of function words and the function of content words Mouton de Gruyter Berlin 323ndash370
Mona Singh 1994 Perfectivity Definiteness and Specificity A Classification of Verbal Predicates Hindi Doctoral dissertation University of Texas Austin
Peter Edwin Hook 1974 The Compound Verb in Hindi Center for South and Southeast Asian Stu-dies The University of Michigan
Tara Mohanan 1994 Argument Structure in Hindi CSLI Publications Stanford California
Venkatapathy Sriram and Aravind K Joshi 2005 Relative compositionality of multi-word expres-sions a study of verb-noun (V-N) collocations In Proceedings of International Joint Conference on Natural Language Processing - 2005 Jeju Island Korea 553-564
46
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 47ndash54Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Improving Statistical Machine Translation UsingDomain Bilingual Multiword Expressions
Zhixiang Ren1 Yajuan L u1 Jie Cao1 Qun Liu 1 Yun Huang2
1Key Lab of Intelligent Info Processing 2Department of Computer ScienceInstitute of Computing Technology School of Computing
Chinese Academy of Sciences National University of SingaporePO Box 2704 Beijing 100190 China Computing 1 Law Link Singapore 117590
renzhixianglvyajuan huangyuncompnusedusgcaojieliuqunictaccn
Abstract
Multiword expressions (MWEs) havebeen proved useful for many natural lan-guage processing tasks However how touse them to improve performance of statis-tical machine translation (SMT) is not wellstudied This paper presents a simple yeteffective strategy to extract domain bilin-gual multiword expressions In additionwe implement three methods to integratebilingual MWEs to Moses the state-of-the-art phrase-based machine translationsystem Experiments show that bilingualMWEs could improve translation perfor-mance significantly
1 Introduction
Phrase-based machine translation model has beenproved a great improvement over the initial word-based approaches (Brown et al 1993) Recentsyntax-based models perform even better thanphrase-based models However when syntax-based models are applied to new domain with fewsyntax-annotated corpus the translation perfor-mance would decrease To utilize the robustnessof phrases and make up the lack of syntax or se-mantic information in phrase-based model for do-main translation we study domain bilingual mul-tiword expressions and integrate them to the exist-ing phrase-based model
A multiword expression (MWE)can be consid-ered as word sequence with relatively fixed struc-ture representing special meanings There is nouniform definition of MWE and many researchersgive different properties of MWE Sag et al (2002)roughly defined MWE as ldquoidiosyncratic interpre-tations that cross word boundaries (or spaces)rdquoCruys and Moiron (2007) focused on the non-compositional property of MWE ie the propertythat whole expression cannot be derived from their
component words Stanford university launcheda MWE project1 in which different qualities ofMWE were presented Forbilingual multiwordexpression (BiMWE) we define a bilingual phraseas a bilingual MWE if (1) the source phrase is aMWE in source language (2) the source phraseand the target phrase must be translated to eachother exactly ie there is no additional (boundary)word in target phrase which cannot find the corre-sponding word in source phrase and vice versaIn recent years many useful methods have beenproposed to extract MWEs or BiMWEs automati-cally (Piao et al 2005 Bannard 2007 Fazly andStevenson 2006) Since MWE usually constrainspossible senses of a polysemous word in contextthey can be used in many NLP applications suchas information retrieval question answering wordsense disambiguation and so on
For machine translation Piao et al (2005) havenoted that the issue of MWE identification andaccurate interpretation from source to target lan-guage remained an unsolved problem for existingMT systems This problem is more severe whenMT systems are used to translate domain-specifictexts since they may include technical terminol-ogy as well as more general fixed expressions andidioms Although some MT systems may employa machine-readable bilingual dictionary of MWEit is time-consuming and inefficient to obtain thisresource manually Therefore some researchershave tried to use automatically extracted bilingualMWEs in SMT Tanaka and Baldwin (2003) de-scribed an approach of noun-noun compound ma-chine translation but no significant comparisonwas presented Lambert and Banchs (2005) pre-sented a method in which bilingual MWEs wereused to modify the word alignment so as to im-prove the SMT quality In their work a bilin-gual MWE in training corpus was grouped as
1httpmwestanfordedu
47
one unique token before training alignment mod-els They reported that both alignment qualityand translation accuracy were improved on a smallcorpus However in their further study they re-ported even lower BLEU scores after groupingMWEs according to part-of-speech on a large cor-pus (Lambert and Banchs 2006) Nonethelesssince MWE represents liguistic knowledge therole and usefulness of MWE in full-scale SMTis intuitively positive The difficulty lies in howto integrate bilingual MWEs into existing SMTsystem to improve SMT performance especiallywhen translating domain texts
In this paper we implement three methods thatintegrate domain bilingual MWEs into a phrase-based SMT system and show that these ap-proaches improve translation quality significantlyThe main difference between our methods andLambert and Banchsrsquo work is that we directly aimat improving the SMT performance rather than im-proving the word alignment quality In detail dif-ferences are listed as follows
bull Instead of using the bilingual n-gram trans-lation model we choose the phrase-basedSMT system Moses2 which achieves sig-nificantly better translation performance thanmany other SMT systems and is a state-of-the-art SMT system
bull Instead of improving translation indirectlyby improving the word alignment qualitywe directly target at the quality of transla-tion Some researchers have argued that largegains of alignment performance under manymetrics only led to small gains in translationperformance (Ayan and Dorr 2006 Fraserand Marcu 2007)
Besides the above differences there are someadvantages of our approaches
bull In our method automatically extractedMWEs are used as additional resources ratherthan as phrase-table filter Since bilingualMWEs are extracted according to noisy au-tomatic word alignment errors in word align-ment would further propagate to the SMT andhurt SMT performance
bull We conduct experiments on domain-specificcorpus For one thing domain-specific
2httpwwwstatmtorgmoses
corpus potentially includes a large numberof technical terminologies as well as moregeneral fixed expressions and idioms iedomain-specific corpus has high MWE cov-erage For another after the investigationcurrent SMT system could not effectivelydeal with these domain-specific MWEs es-pecially for Chinese since these MWEs aremore flexible and concise Take the Chi-nese term ldquo j Ntilde (rdquo for example Themeaning of this term is ldquosoften hard massand dispel pathogenic accumulationrdquo Ev-ery word of this term represents a specialmeaning and cannot be understood literallyor without this context These terms are dif-ficult to be translated even for humans letalone machine translation So treating theseterms as MWEs and applying them in SMTsystem have practical significance
bull In our approach no additional corpus is intro-duced We attempt to extract useful MWEsfrom the training corpus and adopt suitablemethods to apply them Thus it benefitsfor the full exploitation of available resourceswithout increasing great time and space com-plexities of SMT system
The remainder of the paper is organized as fol-lows Section 2 describes the bilingual MWE ex-traction technique Section 3 proposes three meth-ods to apply bilingual MWEs in SMT systemSection 4 presents the experimental results Sec-tion 5 draws conclusions and describes the futurework Since this paper mainly focuses on the ap-plication of BiMWE in SMT we only give a briefintroduction on monolingual and bilingual MWEextraction
2 Bilingual Multiword ExpressionExtraction
In this section we describe our approach of bilin-gual MWE extraction In the first step we obtainmonolingual MWEs from the Chinese part of par-allel corpus After that we look for the translationof the extracted MWEs from parallel corpus
21 Automatic Extraction of MWEs
In the past two decades many different ap-proaches on automatic MWE identification werereported In general those approaches can beclassified into three main trends (1) statisti-cal approaches (Pantel and Lin 2001 Piao et
48
al 2005) (2) syntactic approaches (Fazly andStevenson 2006 Bannard 2007) and (3) seman-tic approaches (Baldwin et al 2003 Cruys andMoiron 2007) Syntax-based and semantic-basedmethods achieve high precision but syntax or se-mantic analysis has to be introduced as preparingstep so it is difficult to apply them to domains withfew syntactical or semantic annotation Statisticalapproaches only consider frequency informationso they can be used to obtain MWEs from bilin-gual corpora without deeper syntactic or semanticanalysis Most statistical measures only take twowords into account so it not easy to extract MWEscontaining three or more than three words
Log Likelihood Ratio (LLR)has been proved agood statistical measurement of the association oftwo random variables (Chang et al 2002) Weadopt the idea of statistical approaches and pro-pose a new algorithm named LLR-based Hierar-chical Reducing Algorithm (HRA for short) to ex-tract MWEs with arbitrary lengths To illustrateour algorithm firstly we define some useful itemsIn the following definitions we assume the givensentence is ldquoA B C D Erdquo
Definition 1 Unit A unit is any sub-string of thegiven sentence For example ldquoA Brdquo ldquo Crdquo ldquo C D Erdquoare all units but ldquoA B Drdquo is not a unit
Definition 2 List A list is an ordered sequence ofunits which exactly cover the given sentence ForexampleldquoArdquoldquo B C Drdquoldquo Erdquo forms a list
Definition 3 Score The score function only de-fines on two adjacent units and return the LLRbetween the last word of first unit and the firstword of the second unit3 For example the scoreof adjacent unit ldquoB Crdquo and ldquoD Erdquo is defined asLLR(ldquo Crdquoldquo Drdquo)
Definition 4 Select The selecting operator is tofind the two adjacent units with maximum scorein a list
Definition 5 ReduceThe reducing operator is toremove two specific adjacent units concatenatethem and put back the result unit to the removedposition For example if we want to reduce unitldquoB Crdquo and unit ldquoDrdquo in list ldquoArdquoldquo B Crdquoldquo Drdquoldquo Erdquowe will get the listldquoArdquoldquo B C Drdquoldquo Erdquo
Initially every word in the sentence is consid-ered as one unit and all these units form a initiallist L If the sentence is of lengthN then the
3we use a stoplist to eliminate the units containing func-tion words by setting their score to 0
list containsN units of course The final set ofMWEsS is initialized to empty set After initial-ization the algorithm will enter an iterating loopwith two steps (1) select the two adjacent unitswith maximum score inL namingU1 andU2 and(2) reduceU1 andU2 in L and insert the reducingresult into the final setS Our algorithm termi-nates on two conditions (1) if the maximum scoreafter selection is less than a given threshold or (2)if L contains only one unit
c1(pound p Eacute w egrave )
c1(pound p Eacute w )
c1(pound p Eacute w)
c1(pound)
c2(p Eacute w)
c2(p Eacute)
c2(p) c3(Eacute) c4(w) c5()
c6(egrave )
c6(egrave) c7()
pound p Eacute w egrave 1471 67552 10596 0 0 8096
Figure 1 Example of Hierarchical Reducing Al-gorithm
Let us make the algorithm clearer with an ex-ample Assume the threshold of score is 20 thegiven sentence is ldquopound p Eacute w egrave rdquo4Figure 1 shows the hierarchical structure of givensentence (based on LLR of adjacent words) Inthis example four MWEs (ldquop Eacuterdquo ldquo p Eacute
wrdquo ldquo egrave rdquo ldquo pound p Eacute wrdquo) are extractedin the order and sub-strings over dotted line in fig-ure 1 are not extracted
From the above example we can see that theextracted MWEs correspond to human intuitionIn general the basic idea of HRA is to reflectthe hierarchical structure pattern of natural lan-guage Furthermore in the HRA MWEs are mea-sured with the minimum LLR of adjacent wordsin them which gives lexical confidence of ex-tracted MWEs Finally suppose given sentencehas lengthN HRA would definitely terminatewithin N minus 1 iterations which is very efficient
However HRA has a problem that it would ex-tract substrings before extracting the whole stringeven if the substrings only appear in the particu-lar whole string which we consider useless Tosolve this problem we use contextual features
4The whole sentence means ldquohealthy tea for preventinghyperlipidemiardquo and we give the meaning for each Chi-nese wordpound(preventing)p(hyper-)Eacute(-lipid-) w(-emia)(for) egrave(healthy)(tea)
49
contextual entropy (Luo and Sun 2003) and C-value (Frantzi and Ananiadou 1996) to filter outthose substrings which exist only in few MWEs
22 Automatic Extraction of MWErsquosTranslation
In subsection 21 we described the algorithm toobtain MWEs and we would like to introduce theprocedure to find their translations from parallelcorpus in this subsection
For mining the English translations of ChineseMWEs we first obtain the candidate translationsof a given MWE from the parallel corpus Stepsare listed as follows
1 Run GIZA++5 to align words in the trainingparallel corpus
2 For a given MWE find the bilingual sentencepairs where the source language sentences in-clude the MWE
3 Extract the candidate translations of theMWE from the above sentence pairs accord-ing to the algorithm described by Och (2002)
After the above procedure we have alreadyextracted all possible candidate translations of agiven MWE The next step is to distinguish rightcandidates from wrong candidates We constructperceptron-based classification model (Collins2002) to solve the problem We design twogroups of features translation features whichdescribe the mutual translating chance betweensource phrase and target phrase and the languagefeatures which refer to how well a candidateis a reasonable translation The translation fea-tures include (1) the logarithm of source-targettranslation probability (2) the logarithm of target-source translation probability (3) the logarithmof source-target lexical weighting (4) the loga-rithm of target-source lexical weighting and (5)the logarithm of the phrase pairrsquos LLR (Dunning1993) The first four features are exactly the sameas the four translation probabilities used in tradi-tional phrase-based system (Koehn et al 2003)The language features include (1) the left entropyof the target phrase (Luo and Sun 2003) (2) theright entropy of the target phrase (3) the first wordof the target phrase (4) the last word of the targetphrase and (5) all words in the target phrase
5httpwwwfjochcomGIZA++html
We select and annotate 33000 phrase pairs ran-domly of which 30000 pairs are used as trainingset and 3000 pairs are used as test set We use theperceptron training algorithm to train the modelAs the experiments reveal the classification preci-sion of this model is 9167
3 Application of Bilingual MWEs
Intuitively bilingual MWE is useful to improvethe performance of SMT However as we de-scribed in section 1 it still needs further researchon how to integrate bilingual MWEs into SMTsystem In this section we propose three methodsto utilize bilingual MWEs and we will comparetheir performance in section 4
31 Model Retraining with Bilingual MWEs
Bilingual phrase table is very important forphrase-based MT system However due to the er-rors in automatic word alignment and unalignedword extension in phrase extraction (Och 2002)many meaningless phrases would be extractedwhich results in inaccuracy of phrase probabilityestimation To alleviate this problem we take theautomatically extracted bilingual MWEs as paral-lel sentence pairs add them into the training cor-pus and retrain the model using GIZA++ Byincreasing the occurrences of bilingual MWEswhich are good phrases we expect that the align-ment would be modified and the probability es-timation would be more reasonable Wu et al(2008) also used this method to perform domainadaption for SMT Different from their approachin which bilingual MWEs are extracted from ad-ditional corpus we extract bilingual MWEs fromthe original training set The fact that additionalresources can improve the domain-specific SMTperformance was proved by many researchers(Wu et al 2008 Eck et al 2004) Howeverour method shows that making better use of theresources in hand could also enhance the qualityof SMT system We use ldquoBaseline+BiMWErdquo torepresent this method
32 New Feature for Bilingual MWEs
Lopez and Resnik (2006) once pointed out thatbetter feature mining can lead to substantial gainin translation quality Inspired by this idea weappend one feature into bilingual phrase table toindicate that whether a bilingual phrase containsbilingual MWEs In other words if the source lan-guage phrase contains a MWE (as substring) and
50
the target language phrase contains the translationof the MWE (as substring) the feature value is 1otherwise the feature value is set to 0 Due to thehigh reliability of bilingual MWEs we expect thatthis feature could help SMT system to select bet-ter and reasonable phrase pairs during translationWe use ldquoBaseline+Featrdquo to represent this method
33 Additional Phrase Table of bilingualMWEs
Wu et al (2008) proposed a method to construct aphrase table by a manually-made translation dic-tionary Instead of manually constructing transla-tion dictionary we construct an additional phrasetable containing automatically extracted bilingualMWEs As to probability assignment we just as-sign 1 to the four translation probabilities for sim-plicity Since Moses supports multiple bilingualphrase tables we combine the original phrase ta-ble and new constructed bilingual MWE table Foreach phrase in input sentence during translationthe decoder would search all candidate transla-tion phrases in both phrase tables We use ldquoBase-line+NewBPrdquo to represent this method
4 Experiments
41 Data
We run experiments on two domain-specific patentcorpora one is for traditional medicine domainand the other is for chemical industry domain Ourtranslation tasks are Chinese-to-English
In the traditional medicine domain table 1shows the data statistics For language model weuse SRI Language Modeling Toolkit6 to train a tri-gram model with modified Kneser-Ney smoothing(Chen and Goodman 1998) on the target side oftraining corpus Using our bilingual MWE ex-tracting algorithm 80287 bilingual MWEs are ex-tracted from the training set
Chinese EnglishTraining Sentences 120355
Words 4688873 4737843Dev Sentences 1000
Words 31722 32390Test Sentences 1000
Words 41643 40551
Table 1 Traditional medicine corpus
6httpwwwspeechsricomprojectssrilm
In the chemical industry domain table 2 givesthe detail information of the data In this experi-ment 59466 bilingual MWEs are extracted
Chinese EnglishTraining Sentences 120856
Words 4532503 4311682Dev Sentences 1099
Words 42122 40521Test Sentences 1099
Words 41069 39210
Table 2 Chemical industry corpus
We test translation quality on test set and use theopen source tool mteval-vllbpl7 to calculate case-sensitive BLEU 4 score (Papineni et al 2002) asour evaluation criteria For this evaluation there isonly one reference per test sentence We also per-form statistical significant test between two trans-lation results (Collins et al 2005) The mean ofall scores and relative standard deviation are calcu-lated with a 99 confidence interval of the mean
42 MT Systems
We use the state-of-the-art phrase-based SMT sys-tem Moses as our baseline system The featuresused in baseline system include (1) four transla-tion probability features (2) one language modelfeature (3) distance-based and lexicalized distor-tion model feature (4) word penalty (5) phrasepenalty For ldquoBaseline+BiMWErdquo method bilin-gual MWEs are added into training corpus as aresult new alignment and new phrase table areobtained For ldquoBaseline+Featrdquo method one ad-ditional 01 feature are introduced to each entry inphrase table For ldquoBaseline+NewBPrdquo additionalphrase table constructed by bilingual MWEs isused
Features are combined in the log-linear modelTo obtain the best translatione of the source sen-tencef log-linear model uses following equation
e = arg maxe
p(e|f)
= arg maxe
Msum
m=1
λmhm(e f) (1)
in which hm andλm denote themth feature andweight The weights are automatically turned byminimum error rate training (Och 2002) on devel-opment set
7httpwwwnistgovspeechtestsmtresourcesscoringhtm
51
43 Results
Methods BLEUBaseline 02658Baseline+BiMWE 02661Baseline+Feat 02675Baseline+NewBP 02719
Table 3 Translation results of using bilingualMWEs in traditional medicine domain
Table 3 gives our experiment results Fromthis table we can see that bilingual MWEsimprove translation quality in all cases TheBaseline+NewBP method achieves the most im-provement of 061 BLEU score comparedwith the baseline system The Baseline+Featmethod comes next with 017 BLEU score im-provement And the Baseline+BiMWE achievesslightly higher translation quality than the baselinesystem
To our disappointment however none of theseimprovements are statistical significant Wemanually examine the extracted bilingual MWEswhich are labeled positive by perceptron algorithmand find that although the classification precisionis high (9167) the proportion of positive exam-ple is relatively lower (7669) The low positiveproportion means that many negative instanceshave been wrongly classified to positive which in-troduce noises To remove noisy bilingual MWEswe use the length ratiox of the source phrase overthe target phrase to rank the bilingual MWEs la-beled positive Assumex follows Gaussian distri-butions then the ranking score of phrase pair(s t)is defined as the following formula
Score(s t) = log(LLR(s t))times 1radic2πσ
timeseminus
(xminusmicro)2
2σ2
(2)Here the meanmicro and varianceσ2 are estimatedfrom the training set After ranking by score weselect the top 50000 60000 and 70000 bilingualMWEs to perform the three methods mentioned insection 3 The results are showed in table 4
From this table we can conclude that (1) Allthe three methods on all settings improve BLEUscore (2) Except the Baseline+BiMWE methodthe other two methods obtain significant improve-ment of BLEU score (02728 02734 02724)over baseline system (02658) (3) When the scaleof bilingual MWEs is relatively small (5000060000) the Baseline+Feat method performs better
Methods 50000 60000 70000Baseline 02658Baseline+BiMWE 02671 02686 02715Baseline+Feat 02728 02734 02712Baseline+NewBP 02662 02706 02724
Table 4 Translation results of using bilingualMWEs in traditional medicine domain
than others (4) As the number of bilingual MWEsincreasing the Baseline+NewBP method outper-forms the Baseline+Feat method (5) Comparingtable 4 and 3 we can see it is not true that themore bilingual MWEs the better performance ofphrase-based SMT This conclusion is the same as(Lambert and Banchs 2005)
To verify the assumption that bilingual MWEsdo indeed improve the SMT performance not onlyon particular domain we also perform some ex-periments on chemical industry domain Table 5shows the results From this table we can see thatthese three methods can improve the translationperformance on chemical industry domain as wellas on the traditional medicine domain
Methods BLEUBaseline 01882Baseline+BiMWE 01928Baseline+Feat 01917Baseline+Newbp 01914
Table 5 Translation results of using bilingualMWEs in chemical industry domain
44 Discussion
In order to know in what respects our methods im-prove performance of translation we manually an-alyze some test sentences and gives some exam-ples in this subsection
(1) For the first example in table 6 ldquoIuml oacuterdquois aligned to other words and not correctly trans-lated in baseline system while it is aligned to cor-rect target phrase ldquodredging meridiansrdquo in Base-line+BiMWE since the bilingual MWE (ldquoIuml oacuterdquoldquodredging meridiansrdquo) has been added into train-ing corpus and then aligned by GIZA++
(2) For the second example in table 6 ldquo
rdquo has two candidate translation in phrase tableldquo teardquo and ldquomedicated teardquo The baseline systemchooses the ldquoteardquo as the translation of ldquo rdquowhile the Baseline+Feat system chooses the ldquomed-
52
Src T not auml k Ouml Eacute Aring raquo Iuml oacute ) 9 | Y S _ Ouml otilde
egraveE 8
Ref the obtained product is effective in tonifying blood expelling cold dredging meridians promoting production of body fluid promoting urination and tranquilizing mind and can be used for supplementing nutrition and protecting health
Baseline the food has effects in tonifying blood dispelling cold promoting salivation and water and tranquilizing and tonic effects and making nutritious health
+Bimwe the food has effects in tonifying blood dispelling cold dredging meridians promotingsalivation promoting urination and tranquilizing tonic nutritious pulverizing
Src curren iexclJ J NtildeJ 5J
Ref the product can also be made into tablet pill powder medicated tea or injection Baseline may also be made into tablet pill powder tea or injection +Feat may also be made into tablet pill powder medicated tea or injection
Table 6 Translation example
icated teardquo because the additional feature giveshigh probability of the correct translation ldquomedi-cated teardquo
5 Conclusion and Future Works
This paper presents the LLR-based hierarchicalreducing algorithm to automatically extract bilin-gual MWEs and investigates the performance ofthree different application strategies in applyingbilingual MWEs for SMT system The translationresults show that using an additional feature to rep-resent whether a bilingual phrase contains bilin-gual MWEs performs the best in most cases Theother two strategies can also improve the qualityof SMT system although not as much as the firstone These results are encouraging and motivatedto do further research in this area
The strategies of bilingual MWE application isroughly simply and coarse in this paper Com-plicated approaches should be taken into accountduring applying bilingual MWEs For examplewe may consider other features of the bilingualMWEs and examine their effect on the SMT per-formance Besides application in phrase-basedSMT system bilingual MWEs may also be inte-grated into other MT models such as hierarchicalphrase-based models or syntax-based translationmodels We will do further studies on improvingstatistical machine translation using domain bilin-gual MWEs
Acknowledgments
This work is supported by National Natural Sci-ence Foundation of China Contracts 60603095and 60873167 We would like to thank the anony-
mous reviewers for their insightful comments onan earlier draft of this paper
References
Necip Fazil Ayan and Bonnie J Dorr 2006 Goingbeyond aer an extensive analysis of word align-ments and their impact on mt InProceedings of the44th Annual Meeting of the Association for Compu-tational Linguistics pages 9ndash16
Timothy Baldwin Colin Bannard Takaaki Tanaka andDominic Widdows 2003 An empirical modelof multiword expression decomposability InPro-ceedings of the ACL-2003 Workshop on MultiwordExpressions Analysis Acquisiton and Treatmentpages 89ndash96
Colin Bannard 2007 A measure of syntactic flex-ibility for automatically identifying multiword ex-pressions in corpora InProceedings of the ACLWorkshop on A Broader Perspective on MultiwordExpressions pages 1ndash8
Peter F Brown Stephen Della Pietra Vincent J DellaPietra and Robert L Mercer 1993 The mathe-matics of statistical machine translation Parameterestimation Computational Linguistics 19(2)263ndash311
Baobao Chang Pernilla Danielsson and WolfgangTeubert 2002 Extraction of translation unit fromchinese-english parallel corpora InProceedings ofthe first SIGHAN workshop on Chinese languageprocessing pages 1ndash5
Stanley F Chen and Joshua Goodman 1998 Am em-pirical study of smoothing techniques for languagemodeling Technical report
Michael Collins Philipp Koehn and Ivona Kucerova2005 Clause restructuring for statistical machinetranslation InProceedings of the 43rd Annual
53
Meeting on Association for Computational Linguis-tics pages 531ndash540
Michael Collins 2002 Discriminative training meth-ods for hidden markov models Theory and exper-iments with perceptron algorithms InProceedingsof the Empirical Methods in Natural Language Pro-cessing Conference pages 1ndash8
Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32
Ted Dunning 1993 Accurate methods for the statis-tics of surprise and coincidenceComputationalLinguistics 19(1)61ndash74
Matthias Eck Stephan Vogel and Alex Waibel 2004Improving statistical machine translation in the med-ical domain using the unified medical language sys-tem InProceedings of the 20th international con-ference on Computational Linguistics table of con-tents pages 792ndash798
Afsaneh Fazly and Suzanne Stevenson 2006 Auto-matically constructing a lexicon of verb phrase id-iomatic combinations InProceedings of the EACLpages 337ndash344
Katerina T Frantzi and Sophia Ananiadou 1996 Ex-tracting nested collocations InProceedings of the16th conference on Computational linguistics pages41ndash46
Alexander Fraser and Daniel Marcu 2007 Measuringword alignment quality for statistical machine trans-lation Computational Linguistics 33(3)293ndash303
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InPro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology pages48ndash54
Patrik Lambert and Rafael Banchs 2005 Data in-ferred multi-word expressions for statistical machinetranslation InProceedings of Machine TranslationSummit X pages 396ndash403
Patrik Lambert and Rafael Banchs 2006 Groupingmulti-word expressions according to part-of-speechin statistical machine translation InProceedings ofthe Workshop on Multi-word-expressions in a multi-lingual context pages 9ndash16
Adam Lopez and Philip Resnik 2006 Word-basedalignment phrase-based translation Whatrsquos thelink In proceedings of the 7th conference of the as-sociation for machine translation in the Americasvisions for the future of machine translation pages90ndash99
Shengfen Luo and Maosong Sun 2003 Two-characterchinese word extraction based on hybrid of internaland contextual measures InProceedings of the sec-ond SIGHAN workshop on Chinese language pro-cessing pages 24ndash30
Franz Josef Och 2002Statistical Machine Transla-tion From Single-Word Models to Alignment Tem-plates Phd thesis Computer Science DepartmentRWTH Aachen Germany
Patrick Pantel and Dekang Lin 2001 A statistical cor-pus based term extractor InAI rsquo01 Proceedings ofthe 14th Biennial Conference of the Canadian Soci-ety on Computational Studies of Intelligence pages36ndash46
Kishore Papineni Salim Roukos Todd Ward and Wei-jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation InProceedings of the40th Annual Conference of the Association for Com-putational Linguistics pages 311ndash318
Scott Songlin Piao Paul Rayson Dawn Archer andTony McEnery 2005 Comparing and combining asemantic tagger and a statistical tool for mwe extrac-tion Computer Speech and Language 19(4)378ndash397
Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 Multi-word expressions A pain in the neck for nlp InProceedings of the 3th International Conferenceon Intelligent Text Processing and ComputationalLinguistics(CICLing-2002) pages 1ndash15
Takaaki Tanaka and Timothy Baldwin 2003 Noun-noun compound machine translation A feasibilitystudy on shallow processing InProceedings ofthe ACL-2003 Workshop on Multiword ExpressionsAnalysis Acquisition and Treatment pages 17ndash24
Hua Wu Haifeng Wang and Chengqing Zong 2008Domain adaptation for statistical machine transla-tion with domain dictionary and monolingual cor-pora InProceedings of Conference on Computa-tional Linguistics (COLING) pages 993ndash1000
54
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 55ndash62Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Bottom-up Named Entity Recognitionusing a Two-stage Machine Learning Method
Hirotaka Funayama Tomohide Shibata Sadao KurohashiKyoto University Yoshida-honmachi
Sakyo-ku Kyoto 606-8501 Japanfunayama shibata kuronlpkueekyoto-uacjp
Abstract
This paper proposes Japanese bottom-upnamed entity recognition using a two-stage machine learning method Mostwork has formalized Named Entity Recog-nition as a sequential labeling problem inwhich only local information is utilizedfor the label estimation and thus a longnamed entity consisting of several mor-phemes tends to be wrongly recognizedOur proposed method regards a compoundnoun (chunk) as a labeling unit and firstestimates the labels of all the chunks ina phrasal unit (bunsetsu) using a machinelearning method Then the best label as-signment in the bunsetsu is determinedfrom bottom up as the CKY parsing al-gorithm using a machine learning methodWe conducted an experimental on CRLNE data and achieved an F measure of8979 which is higher than previous work
1 Introduction
Named Entity Recognition (NER) is a task of rec-ognizing named entities such as person namesorganization names and location It is used forseveral NLP applications such as Information Ex-traction (IE) and Question Answering (QA) Mostwork uses machine learning methods such as Sup-port Vector Machines (SVMs) (Vapnik 1995) andConditional Random Field (CRF) (Lafferty et al2001) using a hand-annotated corpus (Krishnanand DManning 2006 Kazama and Torisawa2008 Sasano and Kurohashi 2008 Fukushima etal 2008 Nakano and Hirai 2004 Masayuki andMatsumoto 2003)
In general NER is formalized as a sequentiallabeling problem For example regarding a mor-pheme as a basic unit it is first labeled as S-PERSON B-PERSON I-PERSON E-PERSON
S-ORGANIZATION etc Then considering thelabeling results of morphemes the best NE labelsequence is recognized
When the label of each morpheme is estimatedonly local information around the morpheme (egthe morpheme the two preceding morphemes andthe two following morphemes) is utilized There-fore a long named entity consisting of severalmorphemes tends to be wrongly recognized Letus consider the example sentences shown in Fig-ure 1
In sentence (1) the label of ldquoKazamardquo can berecognized to be S-PERSON (PERSON consist-ing of one morpheme) by utilizing the surroundinginformation such as the suffix ldquosanrdquo (Mr) and theverb ldquokikoku shitardquo (return home)
On the other hand in sentence (2) when thelabel of ldquoshinyourdquo (credit) is recognized to beB-ORGANIZATION (the beginning of ORGA-NIZATION) only information from ldquohatsudourdquo(invoke) to ldquokyusairdquo (relief) can be utilized andthus the information of the morpheme ldquoginkourdquo(bank) that is apart from ldquoshinyourdquo by three mor-phemes cannot be utilized To cope with this prob-lem Nakano et al (Nakano and Hirai 2004) andSasano et al (Sasano and Kurohashi 2008) uti-lized information of the head of bunsetsu1 In theirmethods when the label of ldquoshinyourdquo is recog-nized the information of the morpheme ldquoginkourdquocan be utilized
However these methods do not work when themorpheme that we want to refer to is not a headof bunsetsu as in sentence (3) In this examplewhen ldquogaikokurdquo (foreign) is recognized to be B-ARTIFACT (the beginning of ARTIFACT) wewant to refer to ldquohourdquo (law) not ldquoihanrdquo (viola-tion) which is the head of the bunsetsu
This paper proposes Japanese bottom-up named1Bunsetsu is the smallest coherent phrasal unit in
Japanese It consists of one or more content words followedby zero or more function words
55
(1) kikoku-shitareturn home
Kazama-san-waMrKazama TOP
rsquoMr Kazama who returned homersquo
(2) hatsudou-shitainvoke
shinyou-kumiai-kyusai-ginkou-nocredit union relief bank GEN
setsuritsu-mo establishment
rsquothe establishment of the invoking credit union relief bankrsquo
(3) shibunsyo-gizou-toprivate document falsification and
gaikoku-jin-touroku-hou-ihan-noforeigner registration law violation GEN
utagai-desuspicion INS
rsquoon suspicion of the private document falsification and the violation of the foreigner registra-tion lawrsquo
Figure 1 Example sentences
entity recognition using a two-stage machinelearning method Different from previous workthis method regards a compound noun as a la-beling unit (we call it chunk hereafter) and es-timates the labels of all the chunks in the bun-setsu using a machine learning method In sen-tence (3) all the chunks in the second bunsetsu(ie ldquogaikokurdquo ldquogaikoku-jinrdquo middot middot middot ldquogaikoku-jin-touroku-hou-ihan rdquo middot middot middot ldquoihanrdquo) are labeled andin the case that the chunk ldquogaikoku-jin-touroku-hourdquo is labeled the information about ldquohourdquo (law)is utilized in a natural manner Then in the bun-setsu the best label assignment is determined Forexample among the combination of ldquogaikoku-jin-touroku-hourdquo (ARTIFACT) and ldquoihanrdquo (OTHER)the combination of ldquogaikoku-jinrdquo (PERSON) andldquotouroku-hou-ihanrdquo (OTHER) etc the best la-bel assignment ldquogaikoku-jin-touroku-hourdquo (AR-TIFACT) and ldquoihanrdquo (OTHER) is chosen basedon a machine learning method In this determi-nation of the best label assignment as the CKYparsing algorithm the label assignment is deter-mined by bottom-up dynamic programming Weconducted an experimental on CRL NE data andachieved an F measure of 8979 which is higherthan previous work
This paper is organized as follows Section 2 re-views related work of NER especially focusing onsequential labeling based method Section 3 de-scribes an overview of our proposed method Sec-tion 4 presents two machine learning models andSection 5 describes an analysis algorithm Section6 gives an experimental result
2 Related Work
In Japanese Named Entity Recognition the defi-nition of Named Entity in IREX Workshop (IREX
class examplePERSON Kimura SyonosukeLOCATION Taiheiyou (Pacific Ocean)ORGANIZATION Jimin-tou (Liberal Democratic
Party)ARTIFACT PL-houan (PL bill)DATE 21-seiki (21 century)TIME gozen-7-ji (7 am)MONEY 500-oku-en (50 billions yen)PERCENT 20 percent
Table 1 NE classes and their examples
Committee 1999) is usually used In this def-inition NEs are classified into eight classesPERSON LOCATION ORGANIZATION AR-TIFACT DATE TIME MONEY and PERCENTTable 1 shows example instances of each class
NER methods are divided into two approachesrule-based approach and machine learning ap-proach According to previous work machinelearning approach achieved better performancethan rule-based approach
In general a machine learning method is for-malized as a sequential labeling problem Thisproblem is first assigning each token (character ormorpheme) to several labels In an SE-algorithm(Sekine et al 1998) S is assigned to NE com-posed of one morpheme B I E is assigned to thebeginning middle end of NE respectively and Ois assigned to the morpheme that is not an NE2The labels S B I and E are prepared for each NEclasses and thus the total number of labels is 33(= 8 4 + 1)
The model for the label estimation is learnedbased on machine learning The following fea-tures are generally utilized characters type of
2Besides there are IOB1 IOB2 algorithm using onlyIOB and IOE1 IOE2 algorithm using only IOE (Kim andVeenstra 1999)
56
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu-MeijinMONEY0075
OTHERe0092
MeijinOTHERe0245
(a)initial state
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu + MeijinPSN+OTHERe0438+0245
Yoshiharu Yoshiharu + Meijin
final outputanalysis direction
MONEY0075
MNY+OTHERe0075+0245
MeijinOTHERe0245
(b)final output
Figure 2 An overview of our proposed method (the bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo)
character POS etc about the morpheme and thesurrounding two morphemes The methods utiliz-ing SVM or CRF are proposed
Most of NER methods based on sequential la-beling use only local information Thereforemethods utilizing global information are pro-posed Nakano et al utilized as a feature the wordsub class of NE on the analyzing direction in thebunsetsu the noun in the end of the bunsetsu ad-jacent to the analyzing direction and the head ofeach bunsetsu (Nakano and Hirai 2004) Sasanoet al utilized cache feature coreference resultsyntactic feature and caseframe feature as struc-tural features (Sasano and Kurohashi 2008)
Some work acquired knowledge from unan-notated large corpus and applied it to NERKazama et al utilized a Named Entity dic-tionary constructed from Wikipedia and a nounclustering result obtained using huge amount ofpairs of dependency relations (Kazama and Tori-sawa 2008) Fukushima et al acquired hugeamount of category-instance pairs (eg ldquopo-litical party - New party DAICHIrdquoldquocompany-TOYOTArdquo) by some patterns from a large Webcorpus (Fukushima et al 2008)
In Japanese NER researches CRL NE data areusually utilized for the evaluation This data in-cludes approximately 10 thousands sentences innews paper articles in which approximately 20thousands NEs are annotated Previous workachieved an F measure of about 089 using thisdata
3 Overview of Proposed Method
Our proposed method first estimates the label ofall the compound nouns (chunk) in a bunsetsu
Then the best label assignment is determinedby bottom-up dynamic programming as the CKYparsing algorithm Figure 2 illustrates an overviewof our proposed method In this example thebunsetsu ldquoHabu-Yoshiharu-Meijinrdquo (Grand Mas-ter Yoshiharu Habu) is analyzed First the labelsof all the chunks (ldquoHaburdquo ldquoHabu-YoshiharurdquoldquoHabu-Yoshiharu-Meijinrdquo middot middot middot ldquoMeijinrdquo etc) inthe bunsetsu are analyzed using a machine learn-ing method as shown in Figure 2 (a)
We call the state in Figure 2 (a) initial statewhere the labels of all the chunks have been es-timated From this state the best label assign-ment in the bunsetsu is determined This pro-cedure is performed from the lower left (corre-sponds to each morpheme) to the upper right likethe CKY parsing algorithm as shown in Figure 2(b) For example when the label assignment forldquoHabu-Yoshiharurdquo is determined the label assign-ment ldquoHabu-Yoshiharurdquo (PERSON) and the labelassignment ldquoHaburdquo (PERSON) and ldquoYoshiharurdquo(OTHER) are compared and the better one is cho-sen While grammatical rules are utilized in ageneral CKY algorithm this method chooses bet-ter label assignment for each cell using a machinelearning method
The learned models are the followings
bull the model that estimates the label of a chunk(label estimation model)
bull the model that compares two label assign-ments (label comparison model)
The two models are described in detail in thenext section
57
Habu Yoshiharu Meijin gaPERSON OTHERe
invalid invalid
invalid
invalidinvalid
Figure 3 Label assignment for all the chunks inthe bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo
4 Model Learning
41 Label Estimation ModelThis model estimates the label for each chunk Ananalysis unit is basically bunsetsu This is because935 of named entities is located in a bunsetsuin CRL NE data Exceptionally the following ex-pressions located in multiple bunsetsus tend to bean NE
bull expressions enclosed in parentheses (eg ldquolsquoHimeyuri-no toursquo rdquo (The tower of Himeyuri)(ARTIFACT))
bull expressions that have an entry in Wikipedia(eg ldquoNihon-yatyou-no kairdquo (Wild Bird So-ciety of Japan) (ORGANIZATION))
Hereafter bunsetsu is expanded when one of theabove conditions meet By this expansion 986of named entities is located in a bunsetsu3
For each bunsetsu the head or tail functionwords are deleted For example in the bun-setsu ldquoHabu-Yoshiharu-Meijin-wardquo the tail func-tion word ldquowardquo (TOP) is deleted In the bunsetsuldquoyaku-san-bairdquo (about three times) the head func-tion word ldquoyakurdquo (about) is deleted
Next for learning the label estimation modelall the chunks in a bunsetsu are attached to the cor-rect label from a hand-annotated corpus The la-bel set is 13 classes which includes eight NE class(as shown in Table 1) and five classes OTHERsOTHERb OTHERi OTHERe and invalid
The chunk that corresponds to a whole bun-setsu and does not contain any NEs is labeledas OTHERs and the head middle tail chunkthat does not correspond to an NE is labeled asOTHERb OTHERi OTHERe respectively4
3As an example in which an NE is not included by anexpanded bunsetsu there are ldquoToru-no Kimirdquo (PERSON)and ldquoOsaka-fu midori-no kankyo-seibi-shitsurdquo (ORGANI-ZATION)
4Each OTHER is assigned to the longest chunk that satis-fies its condition in a chunk
1 of morphemes in the chunk
2 the position of the chunk in its bunsetsu
3 character type5
4 the combination of the character type of adjoiningmorphemes
- For the chunk ldquoRussian Armyrdquo this feature isldquoKatakanaKanjirdquo
5 word class word sub class and several features pro-vided by a morphological analyzer JUMAN
6 several features6 provided by a parser KNP
7 string of the morpheme in the chunk
8 IPADIC7 feature- If the string of the chunk are registered in the fol-
lowing categories of IPADIC ldquopersonrdquo ldquolo-cationrdquo ldquoorganizationrdquo and ldquogeneralrdquo thisfeature fires
9 Wikipedia feature
- If the string of the chunk has an entry inWikipedia this feature fires
- the hypernym extracted from its definition sen-tence using some patterns (eg The hyper-nym of ldquothe Liberal Democratic Partyrdquo is apolitical party)
10 cache feature- When the same string of the chunk appears in the
preceding context the label of the precedingchunk is used for the feature
11 particles that the bunsetsu includes
12 the morphemes particles and head morpheme in theparent bunsetsu
13 the NEcategory ratio in a case slot of predicatenouncase frame(Sasano and Kurohashi 2008)
- For example in the case ga (NOM) of the pred-icate case frame ldquokaikenrdquo (interview) the NEratio ldquoPERSON0245rdquo is assigned to the caseslot Hence in the sentence ldquoHabu-ga kaiken-shitardquo (Mr Habu interviewed) the featureldquoPERSON0245rdquo is utilized for the chunkldquoHaburdquo
14 parenthesis feature
- When the chunk in a parenthesis this featurefires
Table 2 Features for the label estimation model
The chunk that is neither any eight NE class northe above four OTHER is labeled as invalid
In an example as shown in Figure 3 ldquoHabu-Yoshiharurdquo is labeled as PERSON ldquoMeijinrdquo is la-beled as OTHERe and the other chunks are la-beled as invalid
Next the label estimation model is learned fromthe data in which the above label set is assigned
5The following five character types are considered KanjiHiragana Katakana Number and Alphabet
6When a morpheme has an ambiguity all the correspond-ing features fire
7httpchasenaist-naraacjpchasendistributionhtmlja
58
to all the chunks The features for the label esti-mation model are shown in Table 2 Among thefeatures as for feature (3) (5)minus(8) three cate-gories according to the position of a morphemein the chunk are prepared ldquoheadrdquo ldquotailrdquo andldquoanywhererdquo For example in the chunk ldquoHabu-Yoshiharu-Meijinrdquo as for the morpheme ldquoHaburdquofeature (7) is set to be ldquoHaburdquo in ldquoheadrdquo and as forthe morpheme ldquoYoshiharurdquo feature (7) is set to beldquoYoshiharurdquo in ldquoanywhererdquo
The label estimation model is learned from pairsof label and feature in each chunk To classify themulti classes the one-vs-rest method is adopted(consequently 13 models are learned) The SVMoutput is transformed by using the sigmoid func-tion 1
1+exp(minusβx) and the transformed value is nor-malized so that the sum of the value of 13 labelsin a chunk is one
The purpose for setting up the label ldquoinvalidrdquo isas follows In the chunk ldquoHaburdquo and ldquoYoshiharurdquoin Figure 3 since the label ldquoinvalidrdquo has a rela-tively higher score the score of the label PERSONis relatively low Therefore when the label com-parison described in Section 42 is performed thelabel assignment ldquoHabu-Yoshiharurdquo (PERSON) islikely to be chosen In the chunk where the scoreof the label invalid has the highest score the labelthat has the second highest score is adopted
42 Label Comparison Model
This model compares the two label assignmentsfor a certain string For example in the stringldquoHabu-Yoshiharurdquo the model compares the fol-lowing two label assignments
bull ldquoHabu-Yoshiharurdquo is labeled as PERSON
bull ldquoHaburdquo is labeled as PERSON and ldquoYoshi-harurdquo is labeled as MONEY
First as shown in Figure 4 the two comparedsets of chunks are lined up by sandwiching ldquovsrdquo(The left one right one is called the first set thesecond set respectively) When the first set is cor-rect this example is positive otherwise this ex-ample is negative The max number of chunks foreach set is five and thus examples in which thefirst or second set has more than five chunks arenot utilized for the model learning
Then the feature is assigned to each exampleThe feature (13 dimensions) for each chunk is de-fined as follows the first 12 dimensions are used
positive+1 Habu-Yoshiharu vs Habu + Yoshiharu
PSN PSN + MNY+1 Habu-Yoshiharu + Meijin vs Habu + Yoshiharu + Meijin
PSN + OTHERe PSN + MONEY + OTHERe
negative- 1 Habu-Yoshiharu-Meijin vs Habu-Yoshiharu + Meijin
ORG PSN + OTHERe
Figure 4 Assignment of positivenegative exam-ples
for each label which is estimated by the label esti-mation model and the last 13th dimension is usedfor the score of an SVM output Then for the firstand second set the features for each chunk are ar-ranged from the left and zero vectors are placedin the remainder part
Figure 5 illustrates the feature for ldquoHabu-Yoshiharurdquo vs ldquoHabu + Yoshiharurdquo The labelcomparison model is learned from such data us-ing SVM Note that only the fact that ldquoHabu-Yoshiharurdquo is PERSON can be found from thehand-annotated corpus and thus in the exampleldquoHabu-Yoshiharu-Meijinrdquo vs ldquoHabu + Yoshiharu-Meijinrdquo we cannot determine which one is cor-rect Therefore such example cannot be used forthe model learning
5 Analysis
First the label of all the chunks in a bunsetsu isestimated by using the label estimation model de-scribed in Section 41 Then the best label assign-ment in the bunsetsu is determined by applying thelabel comparison model described in Section 42iteratively as shown in Figure 2 (b) In this stepthe better label assignment is determined from bot-tom up as the CKY parsing algorithm
For example the initial state shown in Figure2(a) is obtained using the label estimation modelThen the label assignment is determined using thelabel comparison model from the lower left (cor-responds to each morpheme) to the upper rightIn determining the label assignment for the cellof ldquoHabu-Yoshiharurdquo as shown in 6(a) the modelcompares the label assignment ldquoBrdquo with the la-bel assignment ldquoA+Drdquo In this case the modelchooses the label assignment ldquoBrdquo that is ldquoHabu- Yoshiharurdquo is labeled as PERSON Similarlyin determining the label assignment for the cellof ldquoYoshiharu-Meijinrdquo the model compares the
59
chunk Habu-Yoshiharu Habu Yoshiharulabel PERSON PERSON MONEYvector V11 0 0 0 0 V21 V22 0 0 0
Figure 5 An example of the feature for the label comparison model (The example is ldquoHabu-Yoshiharuvs Habu + Yoshiharurdquo and V11 V21 V22 and 0 is a vector whose dimension is 13)
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu-Meijin
label assighment ldquoHabu-Yoshiharurdquo
label assignmentldquoHaburdquo + ldquoYoshiharurdquo
A B C
EDMONEY0075
OTHERe0092
MeijinOTHERe0245
ldquoHaburdquo + ldquoYoshiharurdquo
F
(a) label assignment for the cell ldquoHabu-Yoshiharurdquo
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu + Meijin
label assignmentldquoHabu-Yoshiharu-Meijinrdquo
A B C
EDMONEY0075
MNY+OTHERe0075+0245
MeijinOTHERe0245
label assignmentldquoHaburdquo + ldquoYoshiharurdquo + ldquoMeijinrdquo
label assignmentldquoHabu-Yoshiharurdquo + ldquoMeijinrdquo
F
(b) label assignment for the cell ldquoHabu-Yoshiharu-Meijinrdquo
Figure 6 The label comparison model
label assignment ldquoErdquo with the label assignmentldquoD+Frdquo In this case the model chooses the labelassignment ldquoD+Frdquo that is ldquoYoshiharurdquo is labeledas MONEY and ldquoMeijinrdquo is labeled as OTHEReWhen the label assignment consists of multiplechunks the content of the cell is updated Inthis case the cell ldquoErdquo is changed from ldquoYoshi-haru-Meijinrdquo (OTHERe) to ldquoYoshiharu + Meijinrdquo(MONEY + OTHERe)
As shown in Figure 6(b) in determining the bestlabel assignment for the upper right cell that isthe final output is determined the model comparesthe label assignment ldquoA+D+Frdquo ldquoB+Frdquo and ldquoCrdquoWhen there are more than two candidates of labelassignments for a cell all the label assignments arecompared in a pairwise and the label assignmentthat obtains the highest score is adopted
In the label comparing step the label as-signment in which OTHERlowast follows OTHERlowast(OTHERlowast - OTHERlowast) is not allowed since eachOTHER is assigned to the longest chunk as de-scribed in Section 41 When the first combina-tion of chunks equals to the second combinationof chunks the comparison is not performed
6 Experiment
To demonstrate the effectiveness of our proposedmethod we conducted an experiment on CRL NEdata In this data 10718 sentences in 1174 newsarticles are annotated with eight NEs The expres-sion to which it is difficult to annotate manually islabeled as OPTIONAL and was not used for both
a b c d e
learn the label estimation model 1
Corpus CRL NE data
learn the label estimation model 2bobtain features for the label apply
b c d e
b c d e
b c d e
obtain features for the labelcomparison model
learn the label estimation model 2c
learn the label estimation model 2d
learn the label estimation model 2e
apply
learn the label comparison model
Figure 7 5-fold cross validation
the model learning8 and the evaluationWe performed 5-fold cross validation following
previous work Different from previous work ourwork has to learn the SVM models twice There-fore the corpus was divided as shown in Figure 7Let us consider the analysis in the part (a) Firstthe label estimation model 1 is learned from thepart (b)-(e) Then the label estimation model 2bis learned from the part (c)-(e) and applying thelearned model to the part (b) features for learningthe label comparison model are obtained Simi-larly the label estimation model 2c is learned fromthe part (b)(d)(e) and applying it to the part (c)features are obtained It is the same with the part
8Exceptionally ldquoOPTIONALrdquo is used when the label es-timation model for OTHERlowast and invalid is learned
60
Recall PrecisionORGANIZATION 8183 (30083676) 8837 (30083404)PERSON 9005 (34583840) 9387 (34583684)LOCATION 9138 (49925463) 9244 (49925400)ARTIFACT 4672 ( 349 747) 7489 ( 349 466)DATE 9327 (33273567) 9312 (33273573)TIME 8825 ( 443 502) 9059 ( 443 489)MONEY 9385 ( 366 390) 9760 ( 366 375)PERCENT 9533 ( 469 492) 9591 ( 469 489)ALL-SLOT 8787 9179F-measure 8979
Table 3 Experimental result
(d) and (e) Then the label comparison model islearned from the obtained features After that theanalysis in the part (a) is performed by using boththe label estimation model 1 and the label compar-ison model
In this experiment a Japanese morphologicalanalyzer JUMAN9 and a Japanese parser KNP10
were adopted The two SVM models were learnedwith polynomial kernel of degree 2 and β in thesigmoid function was set to be 1
Table 6 shows an experimental result An F-measure in all NE classes is 8979
7 Discussion
71 Comparison with Previous Work
Table 7 presents the comparison with previ-ous work and our method outperformed previ-ous work Among previous work Fukushimaet al acquired huge amount of category-instance pairs (eg ldquopolitical party - New partyDAICHIrdquoldquocompany-TOYOTArdquo) by some patternsfrom a large Web corpus and Sasano et al uti-lized the analysis result of corefer resolution as afeature for the model learning Therefore in ourmethod by incorporating these knowledge andorsuch analysis result the performance would be im-proved
Compared with Sasano et al our methodachieved the better performance in analyzinga long compound noun For example in thebunsetsu ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo (Treaty on Conventional ArmedForces in Europe) while Sasano et al labeledldquoOushurdquo (Europe) as LOCATION our methodcorrectly labeled ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo as ARTIFACT Sasano etal incorrectly labeled ldquoOushurdquo as LOCATIONalthough they utilized the information about
9httpnlpkueekyoto-uacjpnl-resourcejuman-ehtml10httpnlpkueekyoto-uacjpnl-resourceknp-ehtml
the head of bunsetsu ldquojyouyakurdquo (treaty) Inour method for the cell ldquoOushurdquo invalid hasthe highest score and thus the score of LOCA-TION relatively drops Similarly for the cellldquosenryoku-sakugen-jyouyakurdquo invalid has thehighest score Consequently ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo is correctly labeledas ARTIFACT
In the bunsetsu ldquogaikoku-jin-touroku-hou-ihanrdquo(the violation of the foreigner registration law)while Sasano et al labeled ldquotouroku-hourdquo as AR-TIFACT our method correctly labeled ldquogaikoku-jin-touroku-hourdquo as ARTIFACT Sasano et al can-not utilize the information about ldquohourdquo that is use-ful for the label estimation since the head of thisbunsetsu is ldquoihanrdquo In contrast in estimating thelabel of the chunk ldquogaikoku-jin-touroku-hourdquo theinformation of ldquohourdquo can be utilized
72 Error Analysis
There were some errors in analyzing a Katakanaalphabet word In the following example althoughthe correct is that ldquoBatistutardquo is labeled as PER-SON the system labeled it as OTHERs
(4) Italy-deItaly LOC
katsuyaku-suruactive
Batistuta-woBatistuta ACC
kuwaetacall
ArgentineArgentine
rsquoArgentine called Batistuta who was active inItalyrsquo
There is not an entry of ldquoBatistutardquo in the dictio-nary of JUMAN nor Wikipedia and thus only thesurrounding information is utilized However thecase analysis of ldquokatsuyakurdquo (active) is incorrectwhich leads to the error of ldquoBatistutardquo
There were some errors in applying the la-bel comparison model although the analysis ofeach chunk is correct For example in thebunsetsu ldquoHongKong-seityourdquo (Government ofHongKong) the correct is that ldquoHongKong-seityourdquo is labeled as ORGANIZATION Asshown in Figure 8 (b) the system incorrectlylabeled ldquoHongKongrdquo as LOCATION As shownin Figure 8(a) although in the initial stateldquoHongKong-seityourdquo was correctly labeled as OR-GANIZATION the label assignment ldquoHongKong+ seityourdquo was incorrectly chosen by the labelcomparison model To cope with this problemwe are planning to the adjustment of the value βin the sigmoid function and the refinement of the
61
F1 analysis unit distinctive features(Fukushima et al 2008) 8929 character Web(Kazama and Torisawa 2008) 8893 character WikipediaWeb(Sasano and Kurohashi 2008) 8940 morpheme structural information(Nakano and Hirai 2004) 8903 character bunsetsu feature(Masayuki and Matsumoto 2003) 8721 character(Isozaki and Kazawa 2003) 8677 morphemeproposed method 8979 compound noun Wikipediastructural information
Table 4 Comparison with previous work (All work was evaluated on CRL NE data using cross valida-tion)
HongKongLOCATION
HongKong-seityouORGANIZATIONLOCATION
0266 0205
seityouOTHERe0184
(a)initial state
HongKongLOCATION
HongKong + seityouLOC+OTHEReLOCATION
0266LOC+OTHERe0266+0184
seityouOTHERe0184
(b)the final output
Figure 8 An example of the error in the label com-parison model
features for the label comparison model
8 Conclusion
This paper proposed bottom-up Named EntityRecognition using a two-stage machine learningmethod This method first estimates the label ofall the chunks in a bunsetsu using a machine learn-ing and then the best label assignment is deter-mined by bottom-up dynamic programming Weconducted an experiment on CRL NE data andachieved an F-measure of 8979
We are planning to integrate this method withthe syntactic and case analysis method (Kawa-hara and Kurohashi 2007) and perform syntacticcase and Named Entity analysis simultaneously toimprove the overall accuracy
ReferencesKenrsquoichi Fukushima Nobuhiro Kaji and Masaru
Kitsuregawa 2008 Use of massive amountsof web text in Japanese named entity recogni-tion In Proceedings of Data Engineering Workshop(DEWS2008) A3-3 (in Japanese)
IREX Committee editor 1999 Proceedings of theIREX Workshop
Hideki Isozaki and Hideto Kazawa 2003 Speeding upsupport vector machines for named entity recogni-tion Transaction of Information Processing Societyof Japan 44(3)970ndash979 (in Japanese)
Daisuke Kawahara and Sadao Kurohashi 2007 Prob-abilistic coordination disambiguation in a fully-lexicalized Japanese parser In Proceedings of the
2007 Joint Conference on Empirical Methods inNatural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL2007)pages 304ndash311
Junrsquoichi Kazama and Kentaro Torisawa 2008 In-ducing gazetteers for named entity recognition bylarge-scale clustering of dependency relations InProceedings of ACL-08 HLT pages 407ndash415
Erik F Tjong Kim and Jorn Veenstra 1999 Repre-senting text chunks In Proceedings of EACL rsquo99pages 173ndash179
Vajay Krishnan and Christopher DManning 2006An effective two-stage model for exploiting non-local dependencies in named entity recognitionpages 1121ndash1128
John Lafferty Andrew McCallun and FernandoPereira 2001 Conditional random fields Prob-abilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth In-ternational Conference (ICMLrsquo01) pages 282ndash289
Asahara Masayuki and Yuji Matsumoto 2003Japanese named entity extraction with redundantmorphological analysis In Proceeding of HLT-NAACL 2003 pages 8ndash15
Keigo Nakano and Yuzo Hirai 2004 Japanese namedentity extraction with bunsetsu features Transac-tion of Information Processing Society of Japan45(3)934ndash941 (in Japanese)
Ryohei Sasano and Sadao Kurohashi 2008 Japanesenamed entity recognition using structural naturallanguage processing In Proceeding of Third In-ternational Joint Conference on Natural LanguageProcessing pages 607ndash612
Satoshi Sekine Ralph Grishman and Hiroyuki Shin-nou 1998 A decision tree method for finding andclassifying names in japanese texts In Proceed-ings of the Sixth Workshop on Very Large Corpora(WVLC-6) pages 171ndash178
Vladimir Vapnik 1995 The Nature of StatisticalLearning Theory Springer
62
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 63ndash70Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
13
13
13 13
13
13
13
13
13
13
13
13 13
13
$
13 amp
13 amp
$ 13
13 (
) 13
$ 13
13
) 13
13 13 13
$ 13 13
+$ - 0)$ - 12
$ 3$ 4 1
)13$ 5 + )$ 5 2
3$ 6 2 $ 7-
2 $ 7 13
2 13 13
13 13
) 13
13$ $ 13$ $ $
13
13 13 13
13 1313 138
$ 13$
3$ 13) )
) 8 9$ $ $
$ $ )13$ 13 13
3
$
13 13
13 +
)
13 $
13 ) 13
$ )
13 13
13 ) $ )
13$
$
) (
$
13
13 13
13
)
13
13 )
$ 13
ファミリーレストラン 13) lt
ファミレス 13lt $
=
13 )
$ 13 13
$ $ )
13 -
)$ $ =
13 131313 13 1313 13 13 13 13 1313 13 13
63
13
13
13 13
13 13
13
13 13
13 $
13
13 amp
13
( ) +
13
13
-
13 13
13 13 13
日本 amp0 東京 $amp0amp
医科 amp0 科学 amp0amp
女子 amp0amp
大学 13amp0 研究所0
ampamp $ 13 13
+
13
$1
$ 13 $1
13
213
$1
$
$1
$1
13 $1
13
13 13 略語は0
0amp
$ +3
1313欧文略語一覧13カタカナ略語一覧13漢字略語一覧13大学の略称
$1
$1 13
13 -
13 $1 13
456
13
456
13
37+
- 13
13
$
8 913
77 4 77 913 13
13 77(amp 13 ltamp
13amp amp
13
amp - 13 13
13
13 9
13 13
$
13
-
13 =
= 9
8
13
13 $
13 13
13 13 13 1313
ッ$ 13 っ$
amp ―$
64
13
$ $
amp $ amp
() $
+) $amp $
) $ $ $ $ $
13 $ amp $ $ $ amp $$
$- )) ) 13
) ))13 ) 0()00)13 ) ) )13
- $ ))) ) 131
) 0)13 ))
13 13
2 ) )) )
)) 3 ) 1313))
1313 2 ) 13)
4) 5) )) )13
) 13) 1
6
7 ) )) ) )13 1
)
) )0
13 ) 8
-
) )
)13 ) 4
4 7 )
) )9 8)
5 ) )8 7 ))
)
) )13 1
)13 )) ) ) )
) 7 )) )13 1
)) 0()
) )))13 0
))13 ) 1
)6 5) ) )
5 ) ))1
lt ) ) 13 )
))) ) =))lt
)=)lt ))=lt gt
=ltlt )) ))) )
13 131313
-
))lt
lt ))lt
))lt
))lt lt
7 ) ))1
lt ) )) ) ))lt lt
13 lt ) ) )8 ) 1
)13 )) )
))lt 13 1
)13 ) )8 ) )13 ))
) ))) 85 1
)13 )8) 2 ))) ) (
0 )) )))
13))13
) ))) ) ( のlt
) ) )13 1
) )) )
) ))) のlt 0 )
)13 )) )1
) ) 13
) ) )) )1
) 00 )13 13)
1313 1313
2 ) 13 )) )
)13 )13 )
))13 )
3 )) )) 1
) ) )
$ )
+ ) )
+ 8 )
65
1313
1313
13 13
13 13
13 13 13
13 13
$13amp13amp 13 13 $13
$13 13 $ ( ) ) 13
13 13 + 13 1313 $
13 + 13 1313 + -
+ 13 ++ 13 13
1313 +
13 13
13 13 1313 13 -
+13 1313 13 13
1313
13
0 13 13 13 13
13 13 12 13
3013+ 13 45 666 1313
0 13 1313
7 13 13 13 13-
13 8 +-13 0
+ 13 13 13 -
+ 13 13 13
3
13 13-
13 13 + 13 13
+13513
1313 13 9
1 13 13 -
+ ( 13 13 3
13 13 + 13+ 13 13
3
135 1313 +13 13
13 13 13+ 13
1313 13 +
1313 1313
13 13 13
13+ 13 $1313 (
13 1313 1313 +-
13 4 13 $
$13 $ 13 -
13 +-13 0
13 1313 13-13 1313 + 13 13 4 - 13 + 13 13 13+ 131313 1313 3 13 13 ( 1313 13 13 + 13 13 1313 13 13 13 13 13 13 13 13 13 13 13+ 13 13 9
13
13
13
13
+ 13 13 +13
13+ 4
13 1313 1313 -
+
13 1313 13-
13 13 +13 + +
(13 13 $ +-
13 $1313 13 13 $ $ 13 $ 13
13 $ampamp lt13
2 13 13 131313 1313 13
13 lt13 ( $
13 13 13 13+ +13 -
13 13 13
13 2 2 2
= 13 13 13 +13 1313+
13 13 13
13 13 13 13
1313 13-
13 ( 13 13 13 13
13 13+ 13 1313
13 13 13+ 1313 13+
13 13 13 13 13 13
$gt1 gt1 13+ $ 13-
13 13 13 13+
13131313
66
1313 1313 1313
1313 1313 1313
13
13 13
13
13
$ 13 13
amp() amp() () amp amp
$ 13 13
+
13
13
13
- + 13 ( (
13 13
-+
13
0 -(1313 amp+ ( (
13 13
13 1
13 () 2 3
0 13 -4252
6-amp++ -32 6+
13 0
$ 135 2
1
- amp+ $ 13
13
$ amp( ) + ( ( )- )( ( ) )
FeaturesLabel
Noun
Noun
Noun
Noun
Particle
Noun
POS
0
0
0
1
1
1
Head of wordreadingword-
FeaturesLabel
Noun
Noun
Noun
Noun
Particle
Noun
POS
0
0
0
1
1
1
Head of wordreadingword-
amp 13 0
13 朝ビタ -+ 13
13 朝はビタミン - +
13 7
7
$
088
13 0
$ amp 9 amp 9
13
13
7
1313 0
13
13 2lt
13 13 13 13
13 13
3 13 amp
13 amp
13
13
0 13 13
13 13
13 13
13
1
ampampamp
67
13 13
$
13 13
ampamp (
)
ampamp $
ampamp amp
amp +
amp +
- - -
amp-
amp
amp
amp
- amp
+
- +
) +
amp $
- amp
+
- amp
+
0 $ ampamp +
1- amp
0 +
2-
)
amp -
amp ampamp
$ amp 0- 3(
34) 5 -+
0 - 4
6)4 amp amp 1 amp
7 4( 6
$ - +
- 8
64
13
) amp
amp
+
amp 9 2
- 9 - +
2 amp
- amp
amp 1
$
( +
(
( amp ( ( -
amp2
amp
amp )
( +
68
13 13
13 1313
13 13 13
13 13
13 13 13
13 13
13 13
13 $
13
13 13 13 13
13
13 13 13
1313
13
amp 13 13 13 13
13 13 13 13
(() 13 13 13
13 13 + 13
13 amp 13
13 13
$ 13
13 13
13
13 1313
13 -
13
1313 13
13 13 13
13 1313 amp
13 13
13 13 1313
0
1313 13 13
13 13
13
13 13
$ $ amp amp ( $)+ - 0 13 + 13 112
ミュージックステーション 13 ステ 13 13 13 13 13 ミュージック 13 ステ ステーション
3 4 56 1 0+ 0 + + 7+ 8 9 $8 4 lt
6 $8 -= 1 8 - = $ 13++ + 139 13 $ 22lt
8 $+ 4 13) 1ltltlt3 $gt3 =
5 $ $ ( 6 5 13amp
5 $ $ ( 213 $ 13 13
5 = 3 + 13 () 1 +
5 () 13 7+ + + 2ltlt
5 () 13 3A 21 + $ + 13
5 () $ gt) 3A2 0 13 13 22lt2lt
+A = 7 1 4 30$+ -+ 3 13 13$ 11
$ 13 amp) $ 4 42 6= $ 2
4 13 13+ $ gt $ 9+ amp6+ 13+ 8 8 6 3 (+ 6 13 11
13 136) $ 4 13 + gt+ 7 30 13 13+ 1
69
13 13 13
13 13 1313
13 13
1313 13
13 1313
13
$
13 13 1313 $
amp (
70
Author Index
Bhutada Pravin 17
Cao Jie 47Caseli Helena 1
Diab Mona 17
Finatto Maria Jose 1Fujii Hiroko 63Fukui Mika 63Funayama Hirotaka 55
Hoang Hung Huu 31Huang Yun 47
Kan Min-Yen 9 31Kim Su Nam 9 31Kuhn Jonas 23Kurohashi Sadao 55
Liu Qun 47Lu Yajuan 47
Machado Andre 1
Ren Zhixiang 47
Shibata Tomohide 55Sinha R Mahesh K 40Sumita Kazuo 63Suzuki Masaru 63
Villavicencio Aline 1
Wakaki Hiromi 63
Zarrieszlig Sina 23
71
Workshop Program
Friday August 6 2009
830ndash845 Welcome and Introduction to the Workshop
Session 1 (0845ndash1000) MWE Identification and Disambiguation
0845ndash0910 Statistically-Driven Alignment-Based Multiword Expression Identification for Tech-nical DomainsHelena Caseli Aline Villavicencio Andre Machado and Maria Jose Finatto
0910ndash0935 Re-examining Automatic Keyphrase Extraction Approaches in Scientific ArticlesSu Nam Kim and Min-Yen Kan
0935ndash1000 Verb Noun Construction MWE Token ClassificationMona Diab and Pravin Bhutada
1000-1030 BREAK
Session 2 (1030ndash1210) Identification Interpretation and Disambiguation
1030ndash1055 Exploiting Translational Correspondences for Pattern-Independent MWE Identifi-cationSina Zarrieszlig and Jonas Kuhn
1055ndash1120 A re-examination of lexical association measuresHung Huu Hoang Su Nam Kim and Min-Yen Kan
1120ndash1145 Mining Complex Predicates In Hindi Using A Parallel Hindi-English CorpusR Mahesh K Sinha
1145-1350 LUNCH
Session 3 (1350ndash1530) Applications
1350ndash1415 Improving Statistical Machine Translation Using Domain Bilingual Multiword Ex-pressionsZhixiang Ren Yajuan Lu Jie Cao Qun Liu and Yun Huang
1415ndash1440 Bottom-up Named Entity Recognition using Two-stage Machine Learning MethodHirotaka Funayama Tomohide Shibata and Sadao Kurohashi
1440ndash1505 Abbreviation Generation for Japanese Multi-Word ExpressionsHiromi Wakaki Hiroko Fujii Masaru Suzuki Mika Fukui and Kazuo Sumita
1505-1530 Discussion of Sessions 1 2 3 (Creating an Agenda for the general discussion)
1530-1600 BREAK
1600-1700 General Discussion
1700-1715 Closing Remarks
ix
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 1ndash8Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Statistically-Driven Alignment-Based Multiword ExpressionIdentification for Technical Domains
Helena de Medeiros Caselidiams Aline Villavicencioclubsspades Andre Machadoclubs Maria Jose Finattohearts
diamsDepartment of Computer Science Federal University of Sao Carlos (Brazil)clubsInstitute of Informatics Federal University of Rio Grande do Sul (Brazil)
spadesDepartment of Computer Sciences Bath University (UK)heartsInstitute of Language and Linguistics Federal University of Rio Grande do Sul (Brazil)
helenacaselidcufscarbr avillavicencioinfufrgsbrammachadoinfufrgsbr mfinattoterracombr
Abstract
Multiword Expressions (MWEs) are oneof the stumbling blocks for more preciseNatural Language Processing (NLP) sys-tems Particularly the lack of coverageof MWEs in resources can impact nega-tively on the performance of tasks and ap-plications and can lead to loss of informa-tion or communication errors This is es-pecially problematic in technical domainswhere a significant portion of the vocab-ulary is composed of MWEs This pa-per investigates the use of a statistically-driven alignment-based approach to theidentification of MWEs in technical cor-pora We look at the use of several sourcesof data including parallel corpora usingEnglish and Portuguese data from a corpusof Pediatrics and examining how a sec-ond language can provide relevant cues forthis tasks We report results obtained bya combination of statistical measures andlinguistic information and compare theseto the reported in the literature Such anapproach to the (semi-)automatic identifi-cation of MWEs can considerably speedup lexicographic work providing a moretargeted list of MWE candidates
1 Introduction
A multiword expression (MWE) can be definedas any word combination for which the syntac-tic or semantic properties of the whole expres-sion cannot be obtained from its parts (Sag etal 2002) Examples of MWEs are phrasal verbs(break down rely on) compounds (police car cof-fee machine) idioms (rock the boat let the catout of the bag) They are very numerous in lan-guages as Biber et al (1999) note accouting forbetween 30 and 45 of spoken English and 21
of academic prose and for Jackendoff (1997) thenumber of MWEs in a speakerrsquos lexicon is of thesame order of magnitude as the number of singlewords However these estimates are likely to beunderestimates if we consider that for languagefrom a specific domain the specialized vocabularyis going to consist largely of MWEs (global warm-ing protein sequencing) and new MWEs are con-stantly appearing (weapons of mass destructionaxis of evil)
Multiword expressions play an important rolein Natural Language Processing (NLP) applica-tions which should not only identify the MWEsbut also be able to deal with them when they arefound (Fazly and Stevenson 2007) Failing toidentify MWEs may cause serious problems formany NLP tasks especially those envolving somekind of semantic processing For parsing for in-stance Baldwin et al (2004) found that for a ran-dom sample of 20000 strings from the British Na-tional Corpus (BNC) even with a broad-coveragegrammar for English (Flickinger 2000) missingMWEs accounted for 8 of total parsing errorsTherefore there is an enormous need for robust(semi-)automated ways of acquiring lexical infor-mation for MWEs (Villavicencio et al 2007) thatcan significantly extend the coverage of resourcesFor example one can more than double the num-ber of verb-particle constructions (VPCs) entriesin a dictionary such as the Alvey Natural Lan-guage Tools (Carroll and Grover 1989) just ex-tracting VPCs from a corpus like the BNC (Bald-win 2005) Furthermore as MWEs are languagedependent and culturally motivated identifyingthe adequate translation of MWE occurrences is animportant challenge for machine translation meth-ods
In this paper we investigate experimentally theuse of an alignment-based approach for the iden-tification of MWEs in technical corpora We lookat the use of several sources of data including par-
1
allel corpora using English and Portuguese datafrom a corpus of Pediatrics and examining howa second language can provide relevant cues forthis tasks In this way cost-effective tools for theautomatic alignment of texts can generate a listof MWE candidates with their appropriate trans-lations Such an approach to the (semi-)automaticidentification of MWEs can considerably speed uplexicographic work providing a more targeted listof MWE candidates and their translations for theconstruction of bilingual resources andor withsome semantic information for monolingual re-sources
The remainder of this paper is structured as fol-lows Section 2 briefly discusses MWEs and someprevious works on methods for automatically ex-tracting them Section 3 presents the resourcesused while section 4 describes the methods pro-posed to extract MWEs as a statistically-driven by-product of an automatic word alignment processSection 5 presents the evaluation methodology andanalyses the results and section 6 finishes this pa-per with some conclusions and proposals for fu-ture work
2 Related Work
The term Multiword Expression has been used todescribe a large number of distinct but related phe-nomena such as phrasal verbs (eg come along)nominal compounds (eg frying pan) institution-alised phrases (eg bread and butter) and manyothers (Sag et al 2002) They are very frequent ineveryday language and this is reflected in severalexisting grammars and lexical resources wherealmost half of the entries are Multiword Expres-sions
However due to their heterogeneous charac-teristics MWEs present a tough challenge forboth linguistic and computational work (Sag etal 2002) Some MWEs are fixed and do notpresent internal variation such as ad hoc whileothers allow different degrees of internal vari-ability and modification such as touch a nerve(touchfind a nerve) and spill beans (spill sev-eralmusicalmountains of beans) In terms of se-mantics some MWEs are more opaque in theirmeaning (eg to kick the bucket as to die) whileothers have more transparent meanings that can beinferred from the words in the MWE (eg eat upwhere the particle up adds a completive sense toeat) Therefore providing appropriate methods
for the automatic identification and treatment ofthese phenomena is a real challenge for NLP sys-tems
A variety of approaches has been proposed forautomatically identifying MWEs differing basi-cally in terms of the type of MWE and lan-guage to which they apply and the sources ofinformation they use Although some work onMWEs is type independent (eg (Zhang et al2006 Villavicencio et al 2007)) given the het-erogeneity of MWEs much of the work looks in-stead at specific types of MWE like collocations(Pearce 2002) compounds (Keller and Lapata2003) and VPCs (Baldwin 2005 Villavicencio2005 Carlos Ramisch and Aline Villavicencio andLeonardo Moura and Marco Idiart 2008) Someof these works concentrate on particular languages(eg (Pearce 2002 Baldwin 2005) for Englishand (Piao et al 2006) for Chinese) but somework has also benefitted from asymmetries in lan-guages using information from one language tohelp deal with MWEs in the other (eg (na Vil-lada Moiron and Tiedemann 2006 Caseli et al2009))
As basis for helping to determine whether agiven sequence of words is in fact an MWE (egad hoc vs the small boy) some of these works em-ploy linguistic knowledge for the task (Villavicen-cio 2005) while others employ statistical meth-ods (Pearce 2002 Evert and Krenn 2005 Zhanget al 2006 Villavicencio et al 2007) or combinethem with some kinds of linguistic informationsuch as syntactic and semantic properties (Bald-win and Villavicencio 2002 Van de Cruys and naVillada Moiron 2007) or automatic word align-ment (na Villada Moiron and Tiedemann 2006)
Statistical measures of association have beencommonly used for this task as they can be demo-cratically applied to any language and MWE typeHowever there is no consensus about which mea-sure is best suited for identifying MWEs in gen-eral Villavicencio et al (2007) compared some ofthese measures (mutual information permutationentropy and χ2) for the type-independent detec-tion of MWEs and found that Mutual Informationseemed to differentiate MWEs from non-MWEsbut the same was not true of χ2 In addition Ev-ert and Krenn (2005) found that for MWE iden-tification the efficacy of a given measure dependson factors like the type of MWEs being targetedfor identification the domain and size of the cor-
2
pora used and the amount of low-frequency dataexcluded by adopting a threshold NonethelessVillavicencio et al (2007) discussing the influ-ence of the corpus size and nature over the meth-ods found that these different measures have ahigh level of agreement about MWEs whetherin carefully constructed corpora or in more het-erogeneous web-based ones They also discussthe results obtained from adopting approaches likethese for extending the coverage of resources ar-guing that grammar coverage can be significantlyincreased if MWEs are properly identified andtreated (Villavicencio et al 2007)
Among the methods that use additional infor-mation along with statistics to extract MWE theone proposed by na Villada Moiron and Tiede-mann (2006) seems to be the most similar to ourapproach The main difference between them isthe way in which word alignment is used in theMWE extraction process In this paper the wordalignment is the basis for the MWE extractionprocess while Villada Moiron and Tiedemannrsquosmethod uses the alignment just for ranking theMWE candidates which were extracted on the ba-sis of association measures (log-likelihood andsalience) and head dependence heuristic (in parseddata)
Our approach as described in details by Caseliet al (2009) also follows to some extent thatof Zhang et al (2006) as missing lexical en-tries for MWEs and related constructions are de-tected via error mining methods and this paper fo-cuses on the extraction of generic MWEs as a by-product of an automatic word alignment Anotherrelated work is the automatic detection of non-compositional compounds (NCC) by Melamed(1997) in which NCCs are identified by analyz-ing statistical translation models trained in a hugecorpus by a time-demanding process
Given this context our approach proposesthe use of alignment techniques for identifyingMWEs looking at sequences detected by thealigner as containing more than one word whichform the MWE candidates As a result sequencesof two or more consecutive source words aretreated as MWE candidates regardless of whetherthey are translated as one or more target words
3 The Corpus and Reference Lists
The Corpus of Pediatrics used in these experi-ments contains 283 texts in Portuguese with a total
of 785448 words extracted from the Jornal de Pe-diatria From this corpus the Pediatrics Glossarya reference list containing multiword terms and re-curring expressions was semi-automatically con-structed and manually checked1 The primary aimof the Pediatrics Glossary as an online resourcefor long-distance education was to train qualifyand support translation students on the domain ofpediatrics texts
The Pediatrics Glossary was built from the36741 ngrams that occurred at least 5 times in thecorpus These were automatically cleaned or re-moved using some POS tag patterns (eg remov-ing prepositions from terms that began or endedwith them) In addition if an ngram was part of alarger ngram only the latter appeared in the Glos-sary as is the case of aleitamento materno (mater-nal breastfeeding) which is excluded as it is con-tained in aleitamento materno exclusivo (exclusivematernal breastfeeding) This post-processing re-sulted in 3645 ngrams which were manuallychecked by translation students and resulted in2407 terms with 1421 bigrams 730 trigrams and339 ngrams with n larger than 3 (not considered inthe experiments presented in this paper)
4 Statistically-Driven andAlignment-Based methods
41 Statistically-Driven methodStatistical measures of association have beenwidely employed in the identification of MWEsThe idea behind their use is that they are an in-expensive language and type independent meansof detecting recurrent patterns As Firth famouslysaid a word is characterized by the company itkeeps and since we expect the component wordsof an MWE to occur frequently together thenthese measures can give an indication of MWE-ness In this way if a group of words co-occurswith significantly high frequency when comparedto the frequencies of the individual words thenthey may form an MWE Indeed measures suchas Pointwise Mutual Information (PMI) MutualInformation (MI) χ2 log-likelihood (Press et al1992) and others have been employed for thistask and some of them seem to provide more ac-curate predictions of MWEness than others Infact in a comparison of some measures for thetype-independent detection of MWEs MI seemed
1Available in the TEXTQUIMUFRGS website httpwwwufrgsbrtextquim
3
to differentiate MWEs from non-MWEs but thesame was not true of χ2 (Villavicencio et al2007) In this work we use two commonly em-ployed measures for this task PMI and MIas implemented in the Ngram Statistics Package(Banerjee and Pedersen 2003)
From the Portuguese portion of the Corpus ofPediatrics 196105 bigram and 362663 trigramMWE candidates were generated after filteringngrams containing punctuation and numbers Inorder to evaluate how these methods perform with-out any linguistic filtering the only threshold em-ployed was a frequency cut-off of 2 occurrencesresulting in 64839 bigrams and 54548 trigramsEach of the four measures were then calculated forthese ngrams and we ranked each n-gram accord-ing to each of these measures The average of allthe rankings is used as the combined measure ofthe MWE candidates
42 Alignment-Based method
The second of the MWE extraction approachesto be investigated in this paper is the alignment-based method The automatic word alignment oftwo parallel texts mdash a text written in one (source)language and its translation to another (target) lan-guage mdash is the process of searching for correspon-dences between source and target words and se-quences of words For each word in a source sen-tence equivalences in the parallel target sentenceare looked for Therefore taking into account aword alignment between a source word sequenceS (S = s1 sn with n ge 2) and a target wordsequence T (T = t1 tm with m ge 1) thatis S harr T the alignmet-based MWE extracionmethod assumes that (a) S and T share some se-mantic features and (b) S may be a MWE
In other words the alignment-based MWE ex-traction method states that the sequence S will bea MWE candidate if it is aligned with a sequenceT composed of one or more words (a n m align-ment with n ge 2 and m ge 1) For examplethe sequence of two Portuguese words aleitamentomaterno mdash which occurs 202 times in the cor-pus used in our experiments mdash is a MWE can-didate because these two words were joined to bealigned 184 times with the word breastfeeding (a2 1 alignment) 8 times with the word breast-fed (a 2 1 alignment) 2 times with breastfeedingpractice (a 2 2 alignment) and so on
Thus notice that the alignment-based MWE ex-
traction method does not rely on the conceptualasymmetries between languages since it does notexpect that a source sequence of words be alignedwith a single target word The method looksfor the sequences of source words that are fre-quently joined together during the alignment de-spite the number of target words involved Thesefeatures indicate that the method priorizes preci-sion in spite of recall
It is also important to say that although the se-quences of source and target words resemble thephrases used in the phrase-based statistical ma-chine translation (SMT) they are indeed a re-finement of them More specifically althoughboth approaches rely on word alignments per-formed by GIZA++2 (Och and Ney 2000) inthe alignment-based approach not all sequences ofwords are considered as phrases (and MWE can-didates) but just those with an alignment n m(n gt= 2) with a target sequence To confirmthis assumption a phrase-based SMT system wastrained with the same corpus used in our exper-iments and the number of phrases extracted fol-lowing both approaches were compared Whilethe SMT extracted 819208 source phrases ouralignment-based approach (without applying anypart-of-speech or frequency filter) extracted only34277 These results show that the alignment-based approach refines in some way the phrasesof SMT systems
In this paper we investigate experimentallywhether MWEs can be identified as a by-productof the automatic word alignment of parallel textsWe focus on Portuguese MWEs from the Corpusof Pediatrics and the evaluation is performed us-ing the bigrams and trigrams from the PediatricsGlossary as gold standard
To perform the extraction of MWE candi-dates following the alignment-based approachfirst the original corpus had to be sentence andword aligned and Part-of-Speech (POS) taggedFor these preprocessing steps were used re-spectively a version of the Translation Cor-pus Aligner (TCA) (Hofland 1996) the statisti-cal word aligner GIZA++ (Och and Ney 2000)and the morphological analysers and POS taggersfrom Apertium3 (Armentano-Oller et al 2006)
2GIZA++ is a well-known statistical word aligner that canbe found at httpwwwfjochcomGIZA++html
3Apertium is an open-source machine translation en-gine and toolbox available at httpwwwapertiumorg
4
From the preprocessed corpus the MWE candi-dates are extracted as those in which two or morewords have the same alignment that is they arelinked to the same target unit This initial list ofMWE candidates is then filtered to remove thosecandidates that (a) match some sequences of POStags or words (patterns) defined in previous exper-iments (Caseli et al 2009) or (b) whose frequencyis below a certain threshold The remaining unitsin the candidate list are considered to be MWEs
Several filtering patterns and minimum fre-quency thresholds were tested and three of themare presented in details here The first one (F1)is the same used during the manual building ofthe reference lists of MWEs (a) patterns begin-ning with Article + Noun and beginning or finish-ing with verbs and (b) with a minimum frequencythreshold of 5
The second one (F2) is the same used in the(Caseli et al 2009) mainly (a) patterns begin-ning with determiner auxiliary verb pronoun ad-verb conjunction and surface forms such as thoseof the verb to be (are is was were) relatives(that what when which who why) and prepo-sitions (from to of ) and (b) with a minimum fre-quency threshold of 2
And the third one (F3) is the same as (Caseli etal 2009) plus (a) patterns beginning or finishingwith determiner adverb conjunction prepositionverb pronoun and numeral and (b) with a mini-mum frequency threshold of 2
5 Experiments and Results
Table 1 shows the top 5 and the bottom 5 rankedcandidates returned by PMI and the alignment-based approach Although some of the results aregood especially the top candidates there is stillconsiderable noise among the candidates as forinstance jogar video game (lit play video game)From table 1 it is also possible to notice that thealignment-based approach indeed extracts Pedi-atrics terms such as aleitamento materno (breast-feeding) and also other possible MWE that are notPediatrics terms such as estados unidos (UnitedStates)
In table 2 we show the precision (number ofcorrect candidates among the proposed ones) re-call (number of correct candidates among those inreference lists) and F-measure ((2 lowast precision lowastrecall)(precision + recall)) figures for the as-sociation measures using all the candidates (on the
PMI alignment-basedOnline Mendelian Inheritance faixa etariaBeta Technology Incorporated aleitamento maternoLange Beta Technology estados unidosOxido Nitrico Inalatorio hipertensao arterialjogar video game leite materno e um de couro cabeludoe a do bloqueio lactıferosse que de emocional anatomiae a da neonato a termoe de nao duplas maes bebes
Table 1 Top 5 and Bottom 5 MWE candidatesranked by PMI and alignment-based approach
pt MWE candidates PMI MI proposed bigrams 64839 64839 correct MWEs 1403 1403precision 216 216recall 9873 9873F 423 423 proposed trigrams 54548 54548 correct MWEs 701 701precision 129 129recall 9603 9603F 255 255 proposed bigrams 1421 1421 correct MWEs 155 261precision 1091 1837recall 1091 1837F 1091 1837 proposed trigrams 730 730 correct MWEs 44 20precision 603 274recall 603 274F 603 274
Table 2 Evaluation of MWE candidates - PMI andMI
first half of the table) and using the top 1421 bi-gram and 730 trigram candidates (on the secondhalf) From these latter results we can see that thetop candidates produced by these measures do notagree with the Pediatrics Glossary since there areonly at most 1837 bigram and 603 trigramMWEs among the top candidates as ranked byMI and PMI respectively Interestingly MI had abetter performance for bigrams while for trigramsPMI performed better
On the other hand looking at the alignment-based method 34277 pt MWE candidates wereextracted and Table 3 sumarizes the number ofcandidates filtered following the three filters de-scribed in 42 F1 F2 and F3
To evaluate the efficacy of the alignment-basedmethod in identifying multiword terms of Pedi-atrics an automatic comparison was performedusing the Pediatrics Glossary In this auto-
5
pt MWE candidates F1 F2 F3 filtered by POS patterns 24996 21544 32644 filtered by frequency 9012 11855 1442 final Set 269 878 191
Table 3 Number of pt MWE candidates filteredin the alignment-based approach
pt MWE candidates F1 F2 F3 proposed bigrams 250 754 169 correct MWEs 48 95 65precision 1920 1260 3846recall 338 669 457F 575 874 818 proposed trigrams 19 110 20 correct MWEs 1 9 4precision 526 818 2000recall 014 123 055F 027 214 107 proposed bitrigrams 269 864 189 correct MWEs 49 104 69precision 1822 1204 3651recall 228 483 321F 405 690 590
Table 4 Evaluation of MWE candidates
matic comparision we considered the final lists ofMWEs candidates generated by each filter in table3 The number of matching entries and the valuesfor precision recall and F-measure are showed intable 4
The different values of extracted MWEs (in ta-ble 3) and evaluated ones (in table 4) are due tothe restriction of considering only bigrams and tri-grams in the Pediatrics Glossary Then longerMWEs mdash such as doenca arterial coronarianaprematura (premature coronary artery disease)and pequenos para idade gestacional (small forgestational age) mdash extracted by the alignment-based method are not being considered at the mo-ment
After the automatic comparison using the Pedi-atrics Glossary an analysis by human experts wasperformed on one of the derived lists mdash that withthe best precision values so far (from filter F3)The human analysis was necessary since as statedin (Caseli et al 2009) the coverage of referencelists may be low and it is likely that a lot of MWEcandidates that were not found in the PediatricsGlossary are nonetheless true MWEs In this pa-per only the pt MWE candidates extracted usingfilter F3 (as described in section 42) were manu-ally evaluated
From the 191 pt MWE candidates extracted af-ter F3 69 candidates (361 of the total amount)were found in the bigrams or trigrams in the
Glossary (see table 4) Then the remaining 122candidates (639) were analysed by two native-speakers human judges who classified each of the122 candidates as true if it is a multiword expres-sion or false otherwise independently of beinga Pediatrics term For the judges a sequence ofwords was considered a MWE mainly if it was(1) a proper name or (2) a sequence of words forwhich the meaning cannot be obtained by com-pounding the meanings of its component words
The judgments of both judges were comparedand a disagreement of approximately 12 on mul-tiwords was verified This disagreement was alsomeasured by the kappa (K) measure (Carletta1996) with k = 073 which does not preventconclusions to be drawn According to Carletta(1996) among other authors a value of k between067 and 08 indicates a good agreement
In order to calculate the percentage of true can-didates among the 122 two approaches can befollowed depending on what criteria one wantsto emphasize precision or coverage (not recallbecause we are not calculating regarding a refer-ence list) To emphasize the precision one shouldconsider as genuine MWEs only those candidatesclassified as true by both judges on the other handto emphasize the coverage one should consideralso those candidates classified as true by just oneof them So from 191 MWE candidates 126(6597) were classified as true by both judgesand 145 (7592) by at least one of them
6 Conclusions and Future Work
MWEs are a complex and heterogeneous set ofphenomena that defy attempts to capture themfully but due to their role in communication theyneed to be properly accounted for in NLP applica-tions and tasks
In this paper we investigated the identifica-tion of MWEs from technical domain test-ing statistically-driven and alignment-based ap-proaches for identifying MWEs from a Pediatricsparallel corpus The alignment-based method gen-erates a targeted precision-oriented list of MWEcandidates while the statistical methods producerecall-oriented results at the expense of precisionTherefore the combination of these methods canproduce a set of MWE candidates that is both moreprecise than the latter and has more coverage thanthe former This can significantly speed up lex-icographic work Moreover the results obtained
6
show that in comparison with the manual extrac-tion of MWEs this approach can provide also ageneral set of MWE candidates in addition to themanually selected technical terms
Using the alignment-based extraction methodwe notice that it is possible to extract MWEs thatare Pediatrics terms with a precision of 38 forbigrams and 20 for trigrams but with very lowrecall since only the MWEs in the Pediatrics Glos-sary were considered correct However after amanual analysis carried out by two native speakersof Portuguese we found that the percentage of trueMWEs considered by both or at least one of themwere respectively 6597 and 7592 This wasa significative improvement but it is important tosay that in this manual analysis the human ex-perts classified the MWEs as true independently ofthem being Pediatrics terms So as future work weintend to carry out a more carefull analysis withexperts in Pediatrics to evaluate how many MWEscandidates are also Pediatrics terms
In addition we plan to investigate a weightedcombination of these methods favouring thosethat have better precision Finally we also in-tend to apply the results obtained in to the semi-automatic construction of ontologies
Acknowledgments
We would like to thank the TEXTQUIMUFRGSgroup for making the Corpus of Pediatrics and Pe-diatrics Glossary available to us We also thank thefinancial support of FAPESP in bulding the paral-lel corpus This research has been partly fundedby the FINEP project COMUNICA
ReferencesCarme Armentano-Oller Rafael C Carrasco Anto-
nio M Corbı-Bellot Mikel L Forcada MireiaGinestı-Rosell Sergio Ortiz-Rojas Juan Anto-nio Perez-Ortiz Gema Ramırez-Sanchez FelipeSanchez-Martınez and Miriam A Scalco 2006Open-source Portuguese-Spanish machine transla-tion In R Vieira P Quaresma MGV Nunes NJMamede C Oliveira and MC Dias editors Pro-ceedings of the 7th International Workshop on Com-putational Processing of Written and Spoken Por-tuguese (PROPOR 2006) volume 3960 of LectureNotes in Computer Science pages 50ndash59 Springer-Verlag May
Timothy Baldwin and Aline Villavicencio 2002 Ex-tracting the Unextractable A Case Study on Verb-particles In Proceedings of the 6th Conference on
Natural Language Learning (CoNLL-2002) pages98ndash104 Taipei Taiwan
Timothy Baldwin Emily M Bender Dan FlickingerAra Kim and Stephan Oepen 2004 Road-testingthe English Resource Grammar over the British Na-tional Corpus In Proceedings of the Fourth In-ternational Conference on Language Resources andEvaluation (LREC 2004) pages 2047ndash2050 LisbonPortugal
Timothy Baldwin 2005 The deep lexical acquisitionof English verb-particles Computer Speech andLanguage Special Issue on Multiword Expressions19(4)398ndash414
Satanjeev Banerjee and Ted Pedersen 2003 The De-sign Implementation and Use of the Ngram Statis-tics Package In In Proceedings of the Fourth Inter-national Conference on Intelligent Text Processingand Computational Linguistics pages 370ndash381
Douglas Biber Stig Johansson Geoffrey Leech SusanConrad and Edward Finegan 1999 Grammar ofSpoken and Written English Longman Harlow
Jean Carletta 1996 Assessing agreement on classi-fication tasks the kappa statistics ComputationalLinguistics 22(2)249ndash254
Carlos Ramisch and Aline Villavicencio and LeonardoMoura and Marco Idiart 2008 Picking themup and Figuring them out Verb-Particle Construc-tions Noise and Idiomaticity In Proceedings ofthe 12th Conference on Computational Natural Lan-guage Learning (CoNLL 2008) pages 49ndash56
John Carroll and Claire Grover 1989 The derivationof a large computational lexicon of English fromLDOCE In Bran Boguraev and Ted Briscoe editorsComputational Lexicography for Natural LanguageProcessing pages 117ndash134 Longman Harlow UK
Helena M Caseli Carlos Ramisch Maria G V Nunesand Aline Villavicencio 2009 Alignment-basedextraction of multiword expressions Language Re-sources and Evaluation to appear
Stefan Evert and Brigitte Krenn 2005 Using smallrandom samples for the manual evaluation of statis-tical association measures Computer Speech andLanguage 19(4)450ndash466
Afsaneh Fazly and Suzanne Stevenson 2007 Distin-guishing Subtypes of Multiword Expressions UsingLinguistically-Motivated Statistical Measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueJune
Dan Flickinger 2000 On building a more efficientgrammar by exploiting types Natural LanguageEngineering 6(1)15ndash28
7
Knut Hofland 1996 A program for aligning Englishand Norwegian sentences In S Hockey N Ideand G Perissinotto editors Research in HumanitiesComputing pages 165ndash178 Oxford Oxford Univer-sity Press
Ray Jackendoff 1997 Twistinrsquo the night away Lan-guage 73534ndash59
Frank Keller and Mirella Lapata 2003 Using the Webto Obtain Frequencies for Unseen Bigrams Compu-tational Linguistics 29(3)459ndash484
I Dan Melamed 1997 Automatic Discovery ofNon-Compositional Compounds in Parallel Data Ineprint arXivcmp-lg9706027 pages 6027ndash+ June
Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the Workshopon Multi-word-expressions in a Multilingual Context(EACL-2006) pages 33ndash40 Trento Italy
Franz Josef Och and Hermann Ney 2000 Improvedstatistical alignment models In Proceedings of the38th Annual Meeting of the ACL pages 440ndash447Hong Kong China October
Darren Pearce 2002 A Comparative Evaluation ofCollocation Extraction Techniques In Proceedingsof the Third International Conference on LanguageResources and Evaluation pages 1ndash7 Las PalmasCanary Islands Spain
Scott S L Piao Guangfan Sun Paul Rayson andQi Yuan 2006 Automatic Extraction of Chi-nese Multiword Expressions with a Statistical ToolIn Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context (EACL-2006)pages 17ndash24 Trento Italy April
William H Press Saul A Teukolsky William T Vet-terling and Brian P Flannery 1992 NumericalRecipes in C The Art of Scientific Computing Sec-ond edition Cambridge University Press
Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 MultiwordExpressions A Pain in the Neck for NLP In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing (CICLing-2002) volume 2276 of (LectureNotes in Computer Science) pages 1ndash15 LondonUK Springer-Verlag
Tim Van de Cruys and Bego na Villada Moiron 2007Semantics-based Multiword Expression ExtractionIn Proceedings of the Workshop on A Broader Pre-spective on Multiword Expressions pages 25ndash32Prague June
Aline Villavicencio Valia Kordoni Yi Zhang MarcoIdiart and Carlos Ramisch 2007 Validation andEvaluation of Automatically Acquired Multiword
Expressions for Grammar Engineering In Pro-ceedings of the 2007 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning pages1034ndash1043 Prague June
Aline Villavicencio 2005 The Availability of Verb-Particle Constructions in Lexical Resources HowMuch is Enough Journal of Computer Speech andLanguage Processing 19(4)415ndash432
Yi Zhang Valia Kordoni Aline Villavicencio andMarco Idiart 2006 Automated Multiword Ex-pression Prediction for Grammar Engineering InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 36ndash44 Sydney Australia July Asso-ciation for Computational Linguistics
8
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 9ndash16Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Re-examining Automatic Keyphrase Extraction Approachesin Scientific Articles
Su Nam KimCSSE dept
University of Melbournesnkimcsseunimelbeduau
Min-Yen KanDepartment of Computer ScienceNational University of Singaporekanmycompnusedusg
Abstract
We tackle two major issues in automatickeyphrase extraction using scientific arti-cles candidate selection and feature engi-neering To develop an efficient candidateselection method we analyze the natureand variation of keyphrases and then se-lect candidates using regular expressionsSecondly we re-examine the existing fea-tures broadly used for the supervised ap-proach exploring different ways to en-hance their performance While mostother approaches are supervised we alsostudy the optimal features for unsuper-vised keyphrase extraction Our researchhas shown that effective candidate selec-tion leads to better performance as evalua-tion accounts for candidate coverage Ourwork also attests that many of existing fea-tures are also usable in unsupervised ex-traction
1 Introduction
Keyphrases are simplex nouns or noun phrases(NPs) that represent the key ideas of the documentKeyphrases can serve as a representative summaryof the document and also serve as high quality in-dex terms It is thus no surprise that keyphraseshave been utilized to acquire critical informationas well as to improve the quality of natural lan-guage processing (NLP) applications such as doc-ument summarizer(Davanzo and Magnini 2005)information retrieval (IR)(Gutwin et al 1999) anddocument clustering(Hammouda et al 2005)
In the past various attempts have been made toboost automatic keyphrase extraction performancebased primarily on statistics(Frank et al 1999Turney 2003 Park et al 2004 Wan and Xiao2008) and a rich set of heuristic features(Barkerand Corrnacchia 2000 Medelyan and Witten
2006 Nguyen and Kan 2007) In Section 2 wegive a more comprehensive overview of previousattempts
Current keyphrase technology still has muchroom for improvement First of all although sev-eral candidate selection methods have been pro-posed for automatic keyphrase extraction in thepast (eg (Frank et al 1999 Park et al 2004Nguyen and Kan 2007)) most of them do not ef-fectively deal with various keyphrase forms whichresults in the ignorance of some keyphrases as can-didates Moreover no studies thus far have donea detailed investigation of the nature and varia-tion of manually-provided keyphrases As a con-sequence the community lacks a standardized listof candidate forms which leads to difficulties indirect comparison across techniques during evalu-ation and hinders re-usability
Secondly previous studies have shown the ef-fectiveness of their own features but not manycompared their features with other existing fea-tures That leads to a redundancy in studies andhinders direct comparison In addition existingfeatures are specifically designed for supervisedapproaches with few exceptions However thisapproach involves a large amount of manual laborthus reducing its utility for real-world applicationHence unsupervised approach is inevitable in or-der to minimize manual tasks and to encourageutilization It is a worthy study to attest the re-liability and re-usability for the unsupervised ap-proach in order to set up the tentative guideline forapplications
This paper targets to resolve these issues ofcandidate selection and feature engineering Inour work on candidate selection we analyze thenature and variation of keyphrases with the pur-pose of proposing a candidate selection methodwhich improves the coverage of candidates thatoccur in various forms Our second contributionre-examines existing keyphrase extraction features
9
reported in the literature in terms of their effec-tiveness and re-usability We test and comparethe usefulness of each feature for further improve-ment In addition we assess how well these fea-tures can be applied in an unsupervised approach
In the remaining sections we describe anoverview of related work in Section 2 our propos-als on candidate selection and feature engineeringin Section 4 and 5 our system architecture anddata in Section 6 Then we evaluate our propos-als discuss outcomes and conclude our work inSection 7 8 and 9 respectively
2 Related Work
The majority of related work has been carriedout using statistical approaches a rich set ofsymbolic resources and linguistically-motivatedheuristics(Frank et al 1999 Turney 1999 Barkerand Corrnacchia 2000 Matsuo and Ishizuka2004 Nguyen and Kan 2007) Features used canbe categorized into three broad groups (1) docu-ment cohesion features (ie relationship betweendocument and keyphrases)(Frank et al 1999Matsuo and Ishizuka 2004 Medelyan and Wit-ten 2006 Nguyen and Kan 2007) and to lesser(2) keyphrase cohesion features (ie relationshipamong keyphrases)(Turney 2003) and (3) termcohesion features (ie relationship among compo-nents in a keyphrase)(Park et al 2004)
The simplest system is KEA (Frank et al1999 Witten et al 1999) that uses TFIDF (ieterm frequency inverse document frequency) andfirst occurrence in the document TFIDF mea-sures the document cohesion and the first occur-rence implies the importance of the abstract orintroduction which indicates the keyphrases havea locality Turney (2003) added the notion ofkeyphrase cohesion to KEA features and Nguyenand Kan (2007) added linguistic features suchas section information and suffix sequence TheGenEx system(Turney 1999) employed an inven-tory of nine syntactic features such as length inwords and frequency of stemming phrase as aset of parametrized heuristic rules Barker andCorrnacchia (2000) introduced a method basedon head noun heuristics that took three featureslength of candidate frequency and head noun fre-quency To take advantage of domain knowledgeHulth et al (2001) used a hierarchically-organizeddomain-specific thesaurus from Swedish Parlia-ment as a secondary knowledge source The
Textract (Park et al 2004) also ranks the can-didate keyphrases by its judgment of keyphrasesrsquodegree of domain specificity based on subject-specific collocations(Damerau 1993) in addi-tion to term cohesion using Dice coefficient(Dice1945) Recently Wan and Xiao (2008) extractsautomatic keyphrases from single documents uti-lizing document clustering information The as-sumption behind this work is that the documentswith the same or similar topics interact with eachother in terms of salience of words The authorsfirst clustered the documents then used the graph-based ranking algorithm to rank the candidates ina document by making use of mutual influences ofother documents in the same cluster
3 Keyphrase Analysis
In previous study KEA employed the index-ing words as candidates whereas others such as(Park et al 2004 Nguyen and Kan 2007) gen-erated handcrafted regular expression rules How-ever none carefully undertook the analysis ofkeyphrases We believe there is more to be learnedfrom the reference keyphrases themselves by do-ing a fine-grained careful analysis of their formand composition Note that we used the articlescollected from ACM digital library for both ana-lyzing keyphrases as well as evaluating methodsSee Section 6 for data in detail
Syntactically keyphrases can be formed by ei-ther simplex nouns (eg algorithm keyphrasemulti-agent) or noun phrases (NPs) which can be asequence of nouns and their auxiliary words suchas adjectives and adverbs (eg mobile networkfast computing partially observable Markov de-cision process) despite few incidences They canalso incorporate a prepositional phrase (PP) (egquality of service policy of distributed caching)When keyphrases take the form of an NP with anattached PP (ie NPs in of-PP form) the preposi-tion of is most common but others such as for invia also occur (eg incentive for cooperation in-equality in welfare agent security via approximatepolicy trade in financial instrument based on log-ical formula) The patterns above correlate wellto part-of-speech (POS) patterns used in modernkeyphrase extraction systems
However our analysis uncovered additional lin-guistic patterns and alternations which other stud-ies may have overlooked In our study we alsofound that keyphrases also occur as a simple con-
10
Criteria RulesFrequency (Rule1) Frequency heuristic ie frequency ge 2 for simplex words vs frequency ge 1 for NPs
Length (Rule2) Length heuristic ie up to length 3 for NPs in non-of-PP form vs up to length 4 for NPs in of-PP form(eg synchronous concurrent program vs model of multiagent interaction)
Alternation (Rule3) of-PP form alternation(eg number of sensor = sensor number history of past encounter = past encounter history)(Rule4) Possessive alternation(eg agentrsquos goal = goal of agent securityrsquos value = value of security)
Extraction (Rule5) Noun Phrase = (NN |NNS|NNP |NNPS|JJ |JJR|JJS)lowast(NN |NNS|NNP |NNPS)(eg complexity effective algorithm grid computing distributed web-service discovery architecture)(Rule6) Simplex WordNP IN Simplex WordNP(eg quality of service sensitivity of VOIP traffic (VOIP traffic extracted)simplified instantiation of zebroid (simplified instantiation extracted))
Table 1 Candidate Selection Rules
junctions (eg search and rescue propagation anddelivery) and much more rarely as conjunctionsof more complex NPs (eg history of past en-counter and transitivity) Some keyphrases appearto be more complex (eg pervasive document editand management system task and resource allo-cation in agent system) Similarly abbreviationsand possessive forms figure as common patterns(eg belief desire intention = BDI inverse docu-ment frequency = (IDF) Bayesrsquo theorem agentrsquosdominant strategy)
A critical insight of our work is that keyphrasescan be morphologically and semantically alteredKeyphrases that incorporate a PP or have an un-derlying genitive composition are often easily var-ied by word order alternation Previous studieshave used the altered keyphrases when forming inof-PP form For example quality of service canbe altered to service quality sometimes with lit-tle semantic difference Also as most morpho-logical variation in English relates to noun num-ber and verb inflection keyphrases are subject tothese rules as well (eg distributed system 6= dis-tributing system dynamical caching 6= dynamicalcache) In addition possessives tend to alternatewith of-PP form (eg agentrsquos goal = goal of agentsecurityrsquos value = value of security)
4 Candidate Selection
We now describe our proposed candidate selectionprocess Candidate selection is a crucial step forautomatic keyphrase extraction This step is corre-lated to term extraction study since topNth rankedterms become keyphrases in documents In pre-vious study KEA employed the indexing wordsas candidates whereas others such as (Park et al2004 Nguyen and Kan 2007) generated hand-crafted regular expression rules However nonecarefully undertook the analysis of keyphrases In
this section before we present our method we firstdescribe the detail of keyphrase analysis
In our keyphrase analysis we observed thatmost of author assigned keyphrase andor readerassigned keyphrase are syntactically more of-ten simplex words and less often NPs Whenkeyphrases take an NP form they tend to be a sim-ple form of NPs ie either without a PP or withonly a PP or with a conjunction but few appear asa mixture of such forms We also noticed that thecomponents of NPs are normally nouns and adjec-tives but rarely are adverbs and verbs As a re-sult we decided to ignore NPs containing adverbsand verbs in this study as our candidates since theytend to produce more errors and to require morecomplexity
Another observation is that keyphrases contain-ing more than three words are rare (ie 6 in ourdata set) validating what Paukkeri et al (2008)observed Hence we apply a length heuristic Ourcandidate selection rule collects candidates up tolength 3 but also of length 4 for NPs in of-PPform since they may have a non-genetive alter-nation that reduces its length to 3 (eg perfor-mance of distributed system = distributed systemperformance) In previous studies words occur-ring at least twice are selected as candidates How-ever during our acquisition of reader assignedkeyphrase we observed that readers tend to collectNPs as keyphrases regardless of their frequencyDue to this we apply different frequency thresh-olds for simplex words (gt= 2) and NPs (gt= 1)Note that 30 of NPs occurred only once in ourdata
Finally we generated regular expression rulesto extract candidates as presented in Table 1 Ourcandidate extraction rules are based on those inNguyen and Kan (2007) However our Rule6for NPs in of-PP form broadens the coverage of
11
possible candidates ie with a given NPs in of-PP form not only we collect simplex word(s)but we also extract non-of-PP form of NPs fromnoun phrases governing the PP and the PP Forexample our rule extracts effective algorithm ofgrid computing as well as effective algorithm andgrid computing as candidates while the previousworksrsquo rules do not
5 Feature Engineering
With a wider candidate selection criteria the onusof filtering out irrelevant candidates becomes theresponsibility of careful feature engineering Welist 25 features that we have found useful in ex-tracting keyphrases comprising of 9 existing and16 novel andor modified features that we intro-duce in our work (marked with lowast) As one ofour goals in feature engineering is to assess thesuitability of features in the unsupervised settingwe have also indicated which features are suitableonly for the supervised setting (S) or applicable toboth (S U)
51 Document Cohesion
Document cohesion indicates how important thecandidates are for the given document The mostpopular feature for this cohesion is TFIDF butsome works have also used context words to checkthe correlation between candidates and the givendocument Other features for document cohesionare distance section information and so on Wenote that listed features other than TFIDF are re-lated to locality That is the intuition behind thesefeatures is that keyphrases tend to appear in spe-cific area such as the beginning and the end of doc-uments
F1 TFIDF (SU) TFIDF indicates doc-ument cohesion by looking at the frequency ofterms in the documents and is broadly used in pre-vious work(Frank et al 1999 Witten et al 1999Nguyen and Kan 2007) However a disadvan-tage of the feature is in requiring a large corpusto compute useful IDF As an alternative con-text words(Matsuo and Ishizuka 2004) can alsobe used to measure document cohesion From ourstudy of keyphrases we saw that substrings withinlonger candidates need to be properly counted andas such our method measures TF in substrings aswell as in exact matches For example grid com-puting is often a substring of other phrases such asgrid computing algorithm and efficient grid com-
puting algorithm We also normalize TF with re-spect to candidate types ie we separately treatsimplex words and NPs to compute TF To makeour IDFs broadly representative we employed theGoogle n-gram counts that were computedover terabytes of data Given this large genericsource of word count IDF can be incorporatedwithout corpus-dependent processing hence suchfeatures are useful in unsupervised approachesas well The following list shows variations ofTFIDF employed as features in our system
bull (F1a) TFIDF
bull (F1b) TF including counts of substrings
bull (F1c) TF of substring as a separate feature
bull (F1d) normalized TF by candidate types(ie simplex words vs NPs)
bull (F1e) normalized TF by candidate types asa separate feature
bull (F1f) IDF using Google n-gram
F2 First Occurrence (SU) KEA used the firstappearance of the word in the document(Frank etal 1999 Witten et al 1999) The main ideabehind this feature is that keyphrases tend to oc-cur in the beginning of documents especially instructured reports (eg in abstract and introduc-tion sections) and newswire
F3 Section Information (SU) Nguyen andKan (2007) used the identity of which specificdocument section a candidate occurs in This lo-cality feature attempts to identify key sections Forexample in their study of scientific papers theauthors weighted candidates differently dependingon whether they occurred in the abstract introduc-tion conclusion section head title andor refer-ences
F4 Additional Section Information (SU)We first added the related work or previous workas one of section information not included inNguyen and Kan (2007) We also propose and testa number of variations We used the substringsthat occur in section headers and reference titlesas keyphrases We counted the co-occurrence ofcandidates (ie the section TF) across all key sec-tions that indicates the correlation among key sec-tions We assign section-specific weights as in-dividual sections exhibit different propensities forgenerating keyphrases For example introduction
12
contains the majority of keyphrases while the ti-tle or section head contains many fewer due to thevariation in size
bull (F4a) section rsquorelatedprevious workrsquo
bull (F4b) counting substring occurring in keysections
bull (F4c) section TF across all key sections
bull (F4d) weighting key sections according tothe portion of keyphrases found
F5 Last Occurrence (SU) Similar to dis-tance in KEA the position of the last occurrenceof a candidate may also imply the importance ofkeyphrases as keyphrases tend to appear in thelast part of document such as the conclusion anddiscussion
52 Keyphrase CohesionThe intuition behind using keyphrase cohesion isthat actual keyphrases are often associated witheach other since they are semantically related totopic of the document Note that this assumptionholds only when the document describes a singlecoherent topic ndash a document that represents a col-lection may be first need to be segmented into itsconstituent topics
F6 Co-occurrence of Another Candidatein Section (SU) When candidates co-occur inseveral key sections together then they are morelikely keyphrases Hence we used the number ofsections that candidates co-occur
F7 Title overlap (S) In a way titles also rep-resent the topics of their documents A large col-lection of titles in the domain can act as a prob-abilistic prior of what words could stand as con-stituent words in keyphrases In our work as weexamined scientific papers from computer sciencewe used a collection of titles obtained from thelarge CiteSeer1 collection to create this feature
bull (F7a) co-occurrence (Boolean) in title col-location
bull (F7b) co-occurrence (TF) in title collection
F8 Keyphrase Cohesion (SU) Turney (2003)integrated keyphrase cohesion into his system bychecking the semantic similarity between top Nranked candidates against the remainder In the
1It contains 13M titles from articles papers and reports
original work a large external web corpus wasused to obtain the similarity judgments As wedid not have access to the same web corpus andall candidateskeyphrases were not found in theGoogle n-gram corpus we approximated this fea-ture using a similar notion of contextual similarityWe simulated a latent 2-dimensional matrix (simi-lar to latent semantic analysis) by listing all candi-date words in rows and their neighboring words(nouns verbs and adjectives only) in columnsThe cosine measure is then used to compute thesimilarity among keyphrases
53 Term Cohesion
Term cohesion further refines the candidacy judg-ment by incorporating an internal analysis of thecandidatersquos constituent words Term cohesionposits that high values for internal word associa-tion measures correlates indicates that the candi-date is a keyphrase (Church and Hanks 1989)
F9 Term Cohesion (SU) Park et al (2004)used in the Dice coefficient (Dice 1945)to measure term cohesion particularly for multi-word terms In their work as NPs are longer thansimplex words they simply discounted simplexword cohesion by 10 In our work we vary themeasure of TF used in Dice coefficientsimilar to our discussion earlier
bull (F9a) term cohesion by (Park et al 2004)
bull (F9b) normalized TF by candidate types(ie simplex words vs NPs)
bull (F9c) applying different weight by candi-date types
bull (F9d) normalized TF and different weight-ing by candidate types
54 Other Features
F10 Acronym (S) Nguyen and Kan (2007) ac-counted for the importance of acronym as a fea-ture We found that this feature is heavily depen-dent on the data set Hence we used it only forNampK to attest our candidate selection method
F11 POS sequence (S) Hulth and Megyesi(2006) pointed out that POS sequences ofkeyphrases are similar It showed the distinctivedistribution of POS sequences of keyphrases anduse them as a feature Like acronym this is alsosubject to the data set
13
F12 Suffix sequence (S) Similar to acronymNguyen and Kan (2007) also used a candidatersquossuffix sequence as a feature to capture the propen-sity of English to use certain Latin derivationalmorphology for technical keyphrases This fea-ture is also a data dependent features thus used insupervised approach only
F13 Length of Keyphrases (SU) Barker andCorrnacchia (2000) showed that candidate lengthis also a useful feature in extraction as well as incandidate selection as the majority of keyphrasesare one or two terms in length
6 System and Data
To assess the performance of the proposed candi-date selection rules and features we implementeda keyphrase extraction pipe line We start withraw text of computer science articles convertedfrom PDF by pdftotext Then we parti-tioned the into section such as title and sectionsvia heuristic rules and applied sentence segmenter2 ParsCit3(Councill et al 2008) for refer-ence collection part-of-speech tagger4 and lem-matizer5(Minnen et al 2001) of the input Af-ter preprocessing we built both supervised andunsupervised classifiers using Naive Bayes fromthe WEKA machine learning toolkit(Witten andFrank 2005) Maximum Entropy6 and simpleweighting
In evaluation we collected 250 papers fromfour different categories7 of the ACM digital li-brary Each paper was 6 to 8 pages on averageIn author assigned keyphrase we found manywere missing or found as substrings To rem-edy this we collected reader assigned keyphraseby hiring senior year undergraduates in computerscience each whom annotated five of the paperswith an annotation guideline and on average tookabout 15 minutes to annotate each paper The fi-nal statistics of keyphrases is presented in Table2 where Combined represents the total number ofkeyphrases The numbers in () denotes the num-ber of keyphrases in of-PP form Found means the
2httpwwwengritsumeiacjpasaoresourcessentseg3httpwingcompnusedusgparsCit4httpsearchcpanorgdistLingua-EN-
TaggerTaggerpm5httpwwwinformaticssusxacukresearchgroupsnlpcarrollmorphhtml6httpmaxentsourceforgenetindexhtml7C24 (Distributed Systems) H33 (Information Search
and Retrieval) I211 (Distributed Artificial Intelligence-Multiagent Systems) and J4 (Social and Behavioral Sciences-Economics)
number of author assigned keyphrase and readerassigned keyphrase found in the documents
Author Reader CombinedTotal 1252 (53) 3110 (111) 3816 (146)NPs 904 2537 3027Average 385 (401) 1244 (1288) 1526 (1585)Found 769 2509 2864
Table 2 Statistics in Keyphrases
7 Evaluation
The baseline system for both the supervised andunsupervised approaches is modified NampK whichuses TFIDF distance section information andadditional section information (ie F1-4) Apartfrom baseline we also implemented basicKEA and NampK to compare Note that NampK is con-sidered a supervised approach as it utilizes fea-tures like acronym POS sequence and suffix se-quence
Table 3 and 4 shows the performance of our can-didate selection method and features with respectto supervised and unsupervised approaches usingthe current standard evaluation method (ie exactmatching scheme) over top 5th 10th 15th candi-dates
BestFeatures includes F1cTF of substring asa separate feature F2first occurrence F3sectioninformation F4dweighting key sections F5lastoccurrence F6co-occurrence of another candi-date in section F7btitle overlap F9aterm co-hesion by (Park et al 2004) F13length ofkeyphrases Best-TFIDF means using all bestfeatures but TFIDF
In Tables 3 and 4 C denotes the classifier tech-nique unsupervised (U) or supervised using Max-imum Entropy (S)8
In Table 5 the performance of each feature ismeasured using NampK system and the target fea-ture + indicates an improvement - indicates aperformance decline and indicates no effector unconfirmed due to small changes of perfor-mances Again supervised denotes MaximumEntropy training and Unsupervised is our unsu-pervised approach
8 Discussion
We compared the performances over our candi-date selection and feature engineering with sim-ple KEA NampK and our baseline system In eval-uating candidate selection we found that longer
8Due to the page limits we present the best performance
14
Method Features C Five Ten FifteenMatch Precision Recall Fscore Match Precising Recall Fscore Match Precision Recall Fscore
All KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 079 1584 519 782 139 1388 909 1099 184 1224 1203 1213
NampK S 132 2648 867 1306 204 2036 1334 1612 254 1693 1664 1678baseline U 092 1832 600 904 157 1568 1027 1241 220 1464 1439 1451
S 115 2304 755 1137 190 1896 1242 1501 244 1624 1596 1610Lengthlt=3 KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 081 1616 529 797 140 1400 917 1108 184 1224 1203 1213
NampK S 140 2792 915 1378 210 2104 1378 1665 262 1749 1719 1734baseline U 092 184 603 908 158 1576 1032 1247 220 1464 1439 1451
S 118 2368 776 1169 190 1900 1245 1504 240 1600 1572 1586Lengthlt=3 KEA U 001 024 008 012 005 052 034 041 007 048 047 047Candidates S 083 1664 545 821 142 1424 933 1127 187 1245 1224 1234
+ Alternation NampK S 153 3064 1004 1512 231 2308 1512 1827 288 1920 1887 1903baseline U 098 1968 645 972 172 1724 1129 1364 237 1579 1551 1565
S 133 2656 870 1311 209 2088 1368 1653 269 1792 1761 1776
Table 3 Performance on Proposed Candidate Selection
Features C Five Ten FifteenMatch Prec Recall Fscore Match Prec Recall Fscore Match Prec Recall Fscore
Best U 114 228 747 113 192 192 126 152 261 174 171 173S 156 312 102 154 250 250 164 198 315 210 206 208
Best U 114 228 74 113 192 192 126 152 261 174 171 173wo TFIDF S 156 311 102 154 246 246 161 194 312 208 204 206
Table 4 Performance on Feature Engineering
A Method Feature+ S F1aF2F3F4aF4dF9a
U F1aF1cF2F3F4aF4dF5F7bF9a- S F1bF1cF1dF1fF4bF4cF7aF7bF9b-dF13
U F1dF1eF1fF4bF4cF6F7aF9b-d S F1eF10F11F12
U F1b
Table 5 Performance on Each Feature
length candidates play a role to be noises so de-creased the overall performance We also con-firmed that candidate alternation offered the flexi-bility of keyphrases leading higher candidate cov-erage as well as better performance
To re-examine features we analyzed the impactof existing and new features and their variationsFirst of all unlike previous studies we found thatthe performance with and without TFIDF did notlead to a large difference which indicates the im-pact of TFIDF was minor as long as other fea-tures are incorporated Secondly counting sub-strings for TF improved performance while ap-plying term weighting for TF andor IDF did notimpact on the performance We estimated thecause that many of keyphrases are substrings ofcandidates and vice versa Thirdly section in-formation was also validated to improve perfor-mance as in Nguyen and Kan (2007) Extend-ing this logic modeling additional section infor-mation (related work) and weighting sections bothturned out to be useful features Other localityfeatures were also validated as helpful both firstoccurrence and last occurrence are helpful as itimplies the locality of the key ideas In addi-
tion keyphrase co-occurrence with selected sec-tions was proposed in our work and found empiri-cally useful Term cohesion (Park et al 2004) is auseful feature although it has a heuristic factor thatreduce the weight by 10 for simplex words Nor-mally term cohesion is subject to NPs only henceit needs to be extended to work with multi-wordNPs as well Table 5 summarizes the reflectionson each feature
As unsupervised methods have the appeal of notneeding to be trained on expensive hand-annotateddata we also compared the performance of super-vised and unsupervised methods Given the fea-tures initially introduced for supervised learningunsupervised performance is surprisingly highWhile supervised classifier produced a matchingcount of 315 the unsupervised classifier obtains acount of 261 We feel this indicates that the exist-ing features for supervised methods are also suit-able for use in unsupervised methods with slightlyreduced performance In general we observed thatthe best features in both supervised and unsuper-vised methods are the same ndash section informationand candidate length In our analysis of the im-pact of individual features we observed that mostfeatures affect performance in the same way forboth supervised and unsupervised approaches asshown in Table 5 These findings indicate that al-though these features may be been originally de-signed for use in a supervised approach they arestable and can be expected to perform similar inunsupervised approaches
15
9 Conclusion
We have identified and tackled two core issuesin automatic keyphrase extraction candidate se-lection and feature engineering In the area ofcandidate selection we observe variations and al-ternations that were previously unaccounted forOur selection rules expand the scope of possiblekeyphrase coverage while not overly expandingthe total number candidates to consider In ourre-examination of feature engineering we com-piled a comprehensive feature list from previousworks while exploring the use of substrings in de-vising new features Moreover we also attested toeach featurersquos fitness for use in unsupervised ap-proaches in order to utilize them in real-world ap-plications with minimal cost
10 AcknowledgementThis work was partially supported by a National ResearchFoundation grant Interactive Media Search (grant R 252000 325 279) while the first author was a postdoctoral fellowat the National University of Singapore
ReferencesKen Barker and Nadia Corrnacchia Using noun phrase
heads to extract document keyphrases In Proceedings ofthe 13th Biennial Conference of the Canadian Society onComputational Studies of Intelligence Advances in Arti-ficial Intelligence 2000
Regina Barzilay and Michael Elhadad Using lexical chainsfor text summarization In Proceedings of the ACLEACL1997 Workshop on Intelligent Scalable Text Summariza-tion 1997 pp 10ndash17
Kenneth Church and Patrick Hanks Word associationnorms mutual information and lexicography In Proceed-ings of ACL 1989 76ndash83
Isaac Councill and C Lee Giles and Min-Yen Kan ParsCitAn open-source CRF reference string parsing package InProceedings of LREC 2008 28ndash30
Ernesto DAvanzo and Bernado Magnini A Keyphrase-Based Approach to Summarizationthe LAKE System atDUC-2005 In Proceedings of DUC 2005
F Damerau Generating and evaluating domain-orientedmulti-word terms from texts Information Processing andManagement 1993 29 pp43ndash447
Lee Dice Measures of the amount of ecologic associationsbetween species Journal of Ecology 1945 2
Eibe Frank and Gordon Paynter and Ian Witten and CarlGutwin and Craig Nevill-manning Domain SpecificKeyphrase Extraction In Proceedings of IJCAI 1999pp668ndash673
Carl Gutwin and Gordon Paynter and Ian Witten and CraigNevill-Manning and Eibe Frank Improving browsing indigital libraries with keyphrase indexes Journal of Deci-sion Support Systems 1999 27 pp81ndash104
Khaled Hammouda and Diego Matute and Mohamed KamelCorePhrase keyphrase extraction for document cluster-ing In Proceedings of MLDM 2005
Annette Hulth and Jussi Karlgren and Anna Jonsson andHenrik Bostrm and Lars Asker Automatic Keyword Ex-traction using Domain Knowledge In Proceedings of CI-CLing 2001
Annette Hulth and Beata Megyesi A study on automaticallyextracted keywords in text categorization In Proceedingsof ACLCOLING 2006 537ndash544
Mario Jarmasz and Caroline Barriere Using semantic sim-ilarity over tera-byte corpus compute the performance ofkeyphrase extraction In Proceedings of CLINE 2004
Dawn Lawrie and W Bruce Croft and Arnold RosenbergFinding Topic Words for Hierarchical Summarization InProceedings of SIGIR 2001 pp 349ndash357
Y Matsuo and M Ishizuka Keyword Extraction from a Sin-gle Document using Word Co-occurrence Statistical Infor-mation In International Journal on Artificial IntelligenceTools 2004 13(1) pp 157ndash169
Olena Medelyan and Ian Witten Thesaurus based automatickeyphrase indexing In Proceedings of ACMIEED-CSjoint conference on Digital libraries 2006 pp296ndash297
Guido Minnen and John Carroll and Darren Pearce Appliedmorphological processing of English NLE 2001 7(3)pp207ndash223
Thuy Dung Nguyen and Min-Yen Kan Key phrase Extrac-tion in Scientific Publications In Proceeding of ICADL2007 pp317-326
Youngja Park and Roy Byrd and Branimir Boguraev Auto-matic Glossary Extraction Beyond Terminology Identifi-cation In Proceedings of COLING 2004 pp48ndash55
Mari-Sanna Paukkeri and Ilari Nieminen and Matti Pollaand Timo Honkela A Language-Independent Approachto Keyphrase Extraction and Evaluation In Proceedingsof COLING 2008
Peter Turney Learning to Extract Keyphrases from TextIn National Research Council Institute for InformationTechnology Technical Report ERB-1057 1999
Peter Turney Coherent keyphrase extraction via Web min-ing In Proceedings of IJCAI 2003 pp 434ndash439
Xiaojun Wan and Jianguo Xiao CollabRank towards a col-laborative approach to single-document keyphrase extrac-tion In Proceedings of COLING 2008
Ian Witten and Gordon Paynter and Eibe Frank and CarGutwin and Graig Nevill-Manning KEAPractical Au-tomatic Key phrase Extraction In Proceedings of ACMDL 1999 pp254ndash256
Ian Witten and Eibe Frank Data Mining Practical Ma-chine Learning Tools and Techniques Morgan Kauf-mann 2005
Yongzheng Zhang and Nur Zinchir-Heywood and Evange-los Milios Term based Clustering and Summarization ofWeb Page Collections In Proceedings of Conference ofthe Canadian Society for Computational Studies of Intel-ligence 2004
16
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 17ndash22Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Verb Noun Construction MWE Token Supervised Classification
Mona T DiabCenter for Computational Learning Systems
Columbia Universitymdiabcclscolumbiaedu
Pravin BhutadaComputer Science Department
Columbia Universitypb2351columbiaedu
Abstract
We address the problem of classifying multi-word expression tokens in running text Wefocus our study on Verb-Noun Constructions(VNC) that vary in their idiomaticity depend-ing on context VNC tokens are classified aseither idiomatic or literal We present a super-vised learning approach to the problem We ex-periment with different features Our approachyields the best results to date on MWE clas-sification combining different linguistically mo-tivated features the overall performance yieldsan F-measure of 8458 corresponding to an F-measure of 8996 for idiomaticity identificationand classification and 6203 for literal identifi-cation and classification
1 Introduction
In the literature in general a multiword expression(MWE) refers to a multiword unit or a colloca-tion of words that co-occur together statisticallymore than chance A MWE is a cover term fordifferent types of collocations which vary in theirtransparency and fixedness MWEs are pervasivein natural language especially in web based textsand speech genres Identifying MWEs and under-standing their meaning is essential to language un-derstanding hence they are of crucial importancefor any Natural Language Processing (NLP) appli-cations that aim at handling robust language mean-ing and use In fact the seminal paper (Sag et al2002) refers to this problem as a key issue for thedevelopment of high-quality NLP applications
For our purposes a MWE is defined as a collo-cation of words that refers to a single concept forexample - kick the bucket spill the beans make adecision etc An MWE typically has an idiosyn-cratic meaning that is more or different from themeaning of its component words An MWE mean-ing is transparent ie predictable in as muchas the component words in the expression relaythe meaning portended by the speaker composi-tionally Accordingly MWEs vary in their de-gree of meaning compositionality composition-ality is correlated with the level of idiomaticityAn MWE is compositional if the meaning of an
MWE as a unit can be predicted from the mean-ing of its component words such as in make adecision meaning to decide If we conceive ofidiomaticity as being a continuum the more id-iomatic an expression the less transparent and themore non-compositional it is Some MWEs aremore predictable than others for instance kick thebucket when used idiomatically to mean to diehas nothing in common with the literal meaningof either kick or bucket however make a decisionis very clearly related to to decide Both of theseexpressions are considered MWEs but have vary-ing degrees of compositionality and predictabilityBoth of these expressions belong to a class of id-iomatic MWEs known as verb noun constructions(VNC) The first VNC kick the bucket is a non-decomposable VNC MWE the latter make a deci-sion is a decomposable VNC MWE These typesof constructions are the object of our study
To date most research has addressed the prob-lem of MWE type classification for VNC expres-sions in English (Melamed 1997 Lin 1999Baldwin et al 2003 na Villada Moiron andTiedemann 2006 Fazly and Stevenson 2007Van de Cruys and Villada Moiron 2007 Mc-Carthy et al 2007) not token classification Forexample he spilt the beans on the kitchen counteris most likely a literal usage This is given away bythe use of the prepositional phrase on the kitchencounter as it is plausable that beans could haveliterally been spilt on a location such as a kitchencounter Most previous research would classifyspilt the beans as idiomatic irrespective of con-textual usage In a recent study by (Cook et al2008) of 53 idiom MWE types used in differentcontexts the authors concluded that almost half ofthem had clear literal meaning and over 40 oftheir usages in text were actually literal Thus itwould be important for an NLP application suchas machine translation for example when givena new VNC MWE token to be able to determinewhether it is used idiomatically or not as it couldpotentially have detrimental effects on the qualityof the translation
17
In this paper we address the problem of MWEclassification for verb-noun (VNC) token con-structions in running text We investigate the bi-nary classification of an unseen VNC token ex-pression as being either Idiomatic (IDM) or Lit-eral (LIT) An IDM expression is certainly anMWE however the converse is not necessarilytrue To date most approaches to the problem ofidiomaticity classification on the token level havebeen unsupervised (Birke and Sarkar 2006 Diaband Krishna 2009b Diab and Krishna 2009aSporleder and Li 2009) In this study we carryout a supervised learning investigation using sup-port vector machines that uses some of the featureswhich have been shown to help in unsupervisedapproaches to the problem
This paper is organized as follows In Section2 we describe our understanding of the variousclasses of MWEs in general Section 3 is a sum-mary of previous related research Section 4 de-scribes our approach In Section 5 we present thedetails of our experiments We discuss the resultsin Section 6 Finally we conclude in Section 7
2 Multi-word Expressions
MWEs are typically not productive though theyallow for inflectional variation (Sag et al 2002)They have been conventionalized due to persis-tent use MWEs can be classified based on theirsemantic types as follows Idiomatic This cat-egory includes expressions that are semanticallynon-compositional fixed expressions such as king-dom come ad hoc non-fixed expressions suchas break new ground speak of the devil TheVNCs which we are focusing on in this paper fallinto this category Semi-idiomatic This classincludes expressions that seem semantically non-compositional yet their semantics are more or lesstransparent This category consists of Light VerbConstructions (LVC) such as make a living andVerb Particle Constructions (VPC) such as write-up call-up Non-Idiomatic This category in-cludes expressions that are semantically compo-sitional such as prime minister proper nouns suchas New York Yankees and collocations such as ma-chine translation These expressions are statisti-cally idiosyncratic For instance traffic light isthe most likely lexicalization of the concept andwould occur more often in text than say trafficregulator or vehicle light
3 Related Work
Several researchers have addressed the problem ofMWE classification (Baldwin et al 2003 Katzand Giesbrecht 2006 Schone and Juraksfy 2001
Hashimoto et al 2006 Hashimoto and Kawa-hara 2008) The majority of the proposed researchhas been using unsupervised approaches and haveaddressed the problem of MWE type classifica-tion irrespective of usage in context (Fazly andStevenson 2007 Cook et al 2007) We areaware of two supervised approaches to the prob-lem work by (Katz and Giesbrecht 2006) andwork by (Hashimoto and Kawahara 2008)
In Katz and Giesbrecht (2006) (KG06) the au-thors carried out a vector similarity comparisonbetween the context of an MWE and that of theconstituent words using LSA to determine if theexpression is idiomatic or not The KG06 is sim-ilar in intuition to work proposed by (Fazly andStevenson 2007) however the latter work was un-supervised KG06 experimented with a tiny dataset of only 108 sentences corresponding to oneMWE idiomatic expression
Hashimoto and Kawahara (2008) (HK08) is thefirst large scale study to our knowledge that ad-dressed token classification into idiomatic versusliteral for Japanese MWEs of all types They ap-ply a supervised learning framework using sup-port vector machines based on TinySVM with aquadratic kernel They annotate a web based cor-pus for training data They identify 101 idiomtypes each with a corresponding 1000 exampleshence they had a corpus of 102K sentences of an-notated data for their experiments They exper-iment with 90 idiom types only for which theyhad more than 50 examples They use two typesof features word sense disambiguation (WSD)features and idiom features The WSD featurescomprised some basic syntactic features such asPOS lemma information token n-gram featuresin addition to hypernymy information on words aswell as domain information For the idiom fea-tures they were mostly inflectional features suchas voice negativity modality in addition to adja-cency and adnominal features They report resultsin terms of accuracy and rate of error reductionTheir overall accuracy is of 8925 using all thefeatures
4 Our ApproachWe apply a supervised learning framework tothe problem of both identifying and classifying aMWE expression token in context We specificallyfocus on VNC MWE expressions We use the an-notated data by (Cook et al 2008) We adopt achunking approach to the problem using an InsideOutside Beginning (IOB) tagging framework forperforming the identification of MWE VNC to-kens and classifying them as idiomatic or literalin context For chunk tagging we use the Yam-
18
Cha sequence labeling system1 YamCha is basedon Support Vector Machines technology using de-gree 2 polynomial kernels
We label each sentence with standard IOB tagsSince this is a binary classification task we have 5different tags B-L (Beginning of a literal chunk)I-L (Inside of a literal chunk) B-I (Beginning anIdiomatic chunk) I-I (Inside an Idiomatic chunk)O (Outside a chunk) As an example a sentencesuch as John kicked the bucket last Friday will beannotated as follows John O kicked B-I the I-Ibucket I-I last O Friday O We experiment withsome basic features and some more linguisticallymotivated ones
We experiment with different window sizes forcontext ranging from minus+1 to minus+5 tokens be-fore and after the token of interest We also em-ploy linguistic features such as character n-gramfeatures namely last 3 characters of a token asa means of indirectly capturing the word inflec-tional and derivational morphology (NGRAM)Other features include Part-of-Speech (POS)tags lemma form (LEMMA) or the citation formof the word and named entity (NE) informationThe latter feature is shown to help in the unsuper-vised setting in recent work (Diab and Krishna2009b Diab and Krishna 2009a) In general allthe linguistic features are represented as separatefeature sets explicitly modeled in the input dataHence if we are modeling the POS tag feature forour running example the training data would beannotated as follows John NN O kicked VBDB-I the Det I-I bucket NN I-I last ADV O FridayNN O Likewise adding the NGRAM featurewould be represented as follows John NN ohnO kicked VBD ked B-I the Det the I-I bucket NNket I-I last ADV ast O Friday NN day O and soon
With the NE feature we followed the same rep-resentation as the other features as a separate col-umn as expressed above referred to as NamedEntity Separate (NES) For named entity recogni-tion (NER) we use the BBN Identifinder softwarewhich identifies 19 NE tags2 We have two set-tings for NES one with the full 19 tags explic-itly identified (NES-Full) and the other where wehave a binary feature indicating whether a wordis a NE or not (NES-Bin) Moreover we addedanother experimental condition where we changedthe wordsrsquo representation in the input to their NEclass Named Entity InText (NEI) For example forthe NEI condition our running example is repre-sented as follows PER NN ohn O kicked VBDked B-I the Det the I-I bucket NN ket I-I last ADV
1httpwwwtado-chasencomyamcha2httpwwwbbncomidentifinder
ast O DAY NN day O where John is replaced bythe NE ldquoPERrdquo
5 Experiments and Results
51 Data
We use the manually annotated standard dataset identified in (Cook et al 2008) This datacomprises 2920 unique VNC-Token expressionsdrawn from the entire British National Corpus(BNC)3 The BNC contains 100M words of multi-ple genres including written text and transcribedspeech In this set VNC token expressions aremanually annotated as idiomatic literal or un-known We exclude those annotated as unknownand those pertaining to the Speech part of thedata leaving us with a total of 2432 sentences cor-responding to 53 VNC MWE types This datahas 2571 annotations4 corresponding to 2020 Id-iomatic tokens and 551 literal ones Since the dataset is relatively small we carry out 5-fold cross val-idation experiments The results we report are av-eraged over the 5 folds per condition We splitthe data into 80 for training 10 for testing and10 for development The data used is the tok-enized version of the BNC
52 Evaluation Metrics
We use Fβ=1 (F-measure) as the harmonic meanbetween (P)recision and (R)ecall as well as accu-racy to report the results5 We report the resultsseparately for the two classes IDM and LIT aver-aged over the 5 folds of the TEST data set
53 Results
We present the results for the different featuressets and their combination We also present resultson a simple most frequent tag baseline (FREQ) aswell as a baseline of using no features just thetokenized words (TOK) The baseline is basicallytagging all identified VNC tokens in the data set asidiomatic It is worth noting that the baseline hasthe advantage of gold identification of MWE VNCtoken expressions In our experimental conditionsidentification of a potential VNC MWE is part ofwhat is discovered automatically hence our sys-tem is penalized for identifying other VNC MWE
3httpwwwnatcorpoxacuk4A sentence can have more than one MWE expression
hence the number of annotations exceeds the number of sen-tences
5We do not think that accuracy should be reported in gen-eral since it is an inflated result as it is not a measure of errorAll words identified as O factor into the accuracy which re-sults in exaggerated values for accuracy We report it onlysince it the metric used by previous work
19
tokens that are not in the original data set6
In Table 2 we present the results yielded per fea-ture and per condition We experimented with dif-ferent context sizes initially to decide on the opti-mal window size for our learning framework re-sults are presented in Table 1 Then once that isdetermined we proceed to add features
Noting that a window size of minus+3 yields thebest results we proceed to use that as our contextsize for the following experimental conditions Wewill not include accuracy since it above 96 for allour experimental conditions
All the results yielded by our experiments out-perform the baseline FREQ The simple tokenizedwords baseline (TOK) with no added features witha context size of minus+3 shows a significant im-provement over the very basic baseline FREQ withan overall F measure of 7704
Adding lemma information or POS or NGRAMfeatures all independently contribute to a bettersolution however combining the three featuresyields a significant boost in performance over theTOK baseline of 267 absolute F points in over-all performance
Confirming previous observations in the liter-ature the overall best results are obtained byusing NE features The NEI condition yieldsslightly better results than the NES conditionsin the case when no other features are beingused NES-Full significantly outperforms NES-Bin when used alone especially on literal classi-fication yielding the highest results on this classof phenomena across the board However whencombined with other features NES-Bin fares bet-ter than NES-Full as we observe slightly less per-formance when comparing NES-Full+L+N+P andNES-Bin+L+N+P
Combining NEI+L+N+P yields the highest re-sults with an overall F measure of 8458 a sig-nificant improvement over both baselines and overthe condition that does not exploit NE featuresL+N+P Using NEI may be considered a formof dimensionality reduction hence the significantcontribution to performance
6 Discussion
The overall results strongly suggest that using lin-guistically interesting features explicitly has a pos-itive impact on performance NE features helpthe most and combining them with other features
6We could have easily identified all VNC syntactic con-figurations corresponding to verb object as a potential MWEVNC assuming that they are literal by default This wouldhave boosted our literal score baseline however for this in-vestigation we decided to strictly work with the gold stan-dard data set exclusively
yields the best results In general performanceon the classification and identification of idiomaticexpressions yielded much better results This maybe due to the fact that the data has a lot more id-iomatic token examples for training Also we notethat precision scores are significantly higher thanrecall scores especially with performance on lit-eral token instance classification This might be anindication that identifying when an MWE is usedliterally is a difficult task
We analyzed some of the errors yielded in ourbest condition NEI+L+N+P The biggest errors area result of identifying other VNC constructionsnot annotated in the training and test data as VNCMWEs However we also see errors of confusingidiomatic cases with literal ones 23 times and theopposite 4 times
Some of the errors where the VNC should havebeen classified as literal however the system clas-sified them as idiomatic are kick heel find feetmake top Cases of idiomatic expressions erro-neously classified as literal are for MWE types hitthe road blow trumpet blow whistle bit a wall
The system is able to identify new VNC MWEconstructions For instance in the sentence On theother hand Pinkie seemed to have lost his head toa certain extent perhaps some prospects of mak-ing his mark by bringing in something novel inthe way of business the first MWE lost his headis annotated in the training data however makinghis mark is newly identified as idiomatic in thiscontext
Also the system identified hit the post as aliteral MWE VNC token in As the ball hit thepost the referee blew the whistle where blew thewhistle is a literal VNC in this context and it iden-tified hit the post as another literal VNC
7 Conclusion
In this study we explore a set of features that con-tribute to VNC token expression binary supervisedclassification The use of NER significantly im-proves the performance of the system Using NERas a means of dimensionality reduction yields thebest results We achieve a state of the art perfor-mance of an overall F measure of 8458 In thefuture we are looking at ways of adding more so-phisticated syntactic and semantic features fromWSD Given the fact that we were able to get moreinteresting VNC data automatically we are cur-rently looking into adding the new data to the an-notated pool after manual checking
20
IDM-F LIT-F Overall F Overall Accminus+1 7793 4857 7178 9622minus+2 8538 5561 7971 9706minus+3 8699 5568 8125 9693minus+4 8622 5581 8075 9706minus+5 8338 50 7763 9661
Table 1 Results in s of varying context window size
IDM-P IDM-R IDM-F LIT-P LIT-R LIT-F Overall FFREQ 7002 8916 7844 0 0 0 6968TOK 8178 8333 8255 7179 4375 5437 7704
(L)EMMA 831 8429 8369 6977 4688 5607 7811(N)GRAM 8317 8238 8278 70 4375 5385 7701
(P)OS 8333 8333 8333 7778 4375 5600 7808L+N+P 8695 8333 8538 7222 4561 5591 7971
NES-Full 852 8793 8655 7907 5862 6733 8277NES-Bin 8497 8241 8367 7349 5259 6131 7915
NEI 8992 8518 8748 8133 5259 6387 8282NES-Full+L+N+P 8989 8492 8734 7632 50 6042 8199NES-Bin+L+N+P 9086 8492 8779 7632 50 6042 8233
NEI+L+N+P 9135 8842 8986 8169 50 6203 8458
Table 2 Final results in s averaged over 5 folds of test data using different features and their combina-tions
8 Acknowledgement
The first author was partially funded by DARPAGALE and MADCAT projects The authors wouldlike to acknowledge the useful comments by twoanonymous reviewers who helped in making thispublication more concise and better presented
ReferencesTimothy Baldwin Collin Bannard Takakki Tanaka
and Dominic Widdows 2003 An empirical modelof multiword expression decomposability In Pro-ceedings of the ACL 2003 workshop on Multiwordexpressions pages 89ndash96 Morristown NJ USA
J Birke and A Sarkar 2006 A clustering approach fornearly unsupervised recognition of nonliteral lan-guage In Proceedings of EACL volume 6 pages329ndash336
Paul Cook Afsaneh Fazly and Suzanne Stevenson2007 Pulling their weight Exploiting syntacticforms for the automatic identification of idiomaticexpressions in context In Proceedings of the Work-shop on A Broader Perspective on Multiword Ex-pressions pages 41ndash48 Prague Czech RepublicJune Association for Computational Linguistics
Paul Cook Afsaneh Fazly and Suzanne Stevenson2008 The VNC-Tokens Dataset In Proceedings ofthe LREC Workshop on Towards a Shared Task for
Multiword Expressions (MWE 2008) MarrakechMorocco June
Mona Diab and Madhav Krishna 2009a Handlingsparsity for verb noun MWE token classification InProceedings of the Workshop on Geometrical Mod-els of Natural Language Semantics pages 96ndash103Athens Greece March Association for Computa-tional Linguistics
Mona Diab and Madhav Krishna 2009b Unsuper-vised classification for vnc multiword expressionstokens In CICLING
Afsaneh Fazly and Suzanne Stevenson 2007 Dis-tinguishing subtypes of multiword expressions us-ing linguistically-motivated statistical measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueCzech Republic June Association for Computa-tional Linguistics
Chikara Hashimoto and Daisuke Kawahara 2008Construction of an idiom corpus and its applica-tion to idiom identification based on WSD incor-porating idiom-specific features In Proceedings ofthe 2008 Conference on Empirical Methods in Nat-ural Language Processing pages 992ndash1001 Hon-olulu Hawaii October Association for Computa-tional Linguistics
Chikara Hashimoto Satoshi Sato and Takehito Utsuro2006 Japanese idiom recognition Drawing a linebetween literal and idiomatic meanings In Proceed-ings of the COLINGACL 2006 Main Conference
21
Poster Sessions pages 353ndash360 Sydney AustraliaJuly Association for Computational Linguistics
Graham Katz and Eugenie Giesbrecht 2006 Au-tomatic identification of non-compositional multi-word expressions using latent semantic analysis InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 12ndash19 Sydney Australia July Asso-ciation for Computational Linguistics
Dekang Lin 1999 Automatic identification of non-compositional phrases In Proceedings of ACL-99pages 317ndash324 Univeristy of Maryland CollegePark Maryland USA
Diana McCarthy Sriram Venkatapathy and AravindJoshi 2007 Detecting compositionality of verb-object combinations using selectional preferencesIn Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL) pages 369ndash379 Prague CzechRepublic June Association for Computational Lin-guistics
Dan I Melamed 1997 Automatic discovery of non-compositional compounds in parallel data In Pro-ceedings of the 2nd Conference on Empirical Meth-ods in Natural Language Processing (EMNLPrsquo97)pages 97ndash108 Providence RI USA August
Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the EACL-06Workshop on Multiword Expressions in a Multilin-gual Context pages 33ndash40 Morristown NJ USA
Ivan A Sag Timothy Baldwin Francis Bond Ann ACopestake and Dan Flickinger 2002 Multiwordexpressions A pain in the neck for nlp In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing pages 1ndash15 London UK Springer-Verlag
Patrick Schone and Daniel Juraksfy 2001 Isknowledge-free induction of multiword unit dictio-nary headwords a solved problem In Proceedingsof Empirical Methods in Natural Language Process-ing pages 100ndash108 Pittsburg PA USA
C Sporleder and L Li 2009 Unsupervised Recog-nition of Literal and Non-Literal Use of IdiomaticExpressions In Proceedings of the 12th Conferenceof the European Chapter of the ACL (EACL 2009)pages 754ndash762 Association for Computational Lin-guistics
Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32Prague Czech Republic June Association for Com-putational Linguistics
22
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 23ndash30Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Exploiting Translational Correspondences for Pattern-Independent MWEIdentification
Sina ZarrieszligDepartment of Linguistics
University of Potsdam Germanysinalinguni-potsdamde
Jonas KuhnDepartment of Linguistics
University of Potsdam Germanykuhnlinguni-potsdamde
Abstract
Based on a study of verb translations inthe Europarl corpus we argue that a widerange of MWE patterns can be identified intranslations that exhibit a correspondencebetween a single lexical item in the sourcelanguage and a group of lexical items inthe target language We show that thesecorrespondences can be reliably detectedon dependency-parsed word-aligned sen-tences We propose an extraction methodthat combines word alignment with syn-tactic filters and is independent of thestructural pattern of the translation
1 Introduction
Parallel corpora have proved to be a valuable re-source not only for statistical machine translationbut also for crosslingual induction of morphologi-cal syntactic and semantic analyses (Yarowsky etal 2001 Dyvik 2004) In this paper we proposean approach to the identification of multiword ex-pressions (MWEs) that exploits translational cor-respondences in a parallel corpus We will con-sider in translations of the following type
(1) DerThe
RatCouncil
sollteshould
unsereour
Positionposition
berucksichtigenconsider
(2) The Council shouldtake account ofour position
This sentence pair has been taken from theGerman - English section of the Europarl corpus(Koehn 2005) It exemplifies a translational cor-respondence between an English MWEtake ac-count ofand a German simplex verbberucksichti-gen In the following we refer to such correspon-dences asone-to-many translations Based on astudy of verb translations in Europarl we will ex-plore to what extent one-to-many translations pro-vide evidence for MWE realization in the targetlanguage It will turn out that crosslingual corre-
spondences realize a wide range of different lin-guistic patterns that are relevant for MWE iden-tification but that they pose problems to auto-matic word alignment We propose an extractionmethod that combines distributional word align-ment with syntactic filters We will show thatthese correspondences can be reliably detectedon dependency-parsed wordaligned sentences andare able to identify various MWE patterns
In a monolingual setting the task of MWE ex-traction is usually conceived of as a lexical as-sociation problem where distributional measuresmodel the syntactic and semantic idiosyncracy ex-hibited by MWEs eg (Pecina 2008) This ap-proach generally involves two main steps 1) theextraction of a candidate list of potential MWEsoften constrained by a particular target patternof the detection method like verb particle con-structions (Baldwin and Villavicencio 2002) orverb PP combinations (Villada Moiron and Tiede-mann 2006) 2) the ranking of this candidate listby an appropriate assocation measure
The crosslingual MWE identification wepresent in this paper is a priori independentof any specific association measure or syntacticpattern The translation scenario allows us toadopt a completely data-driven definition of whatconstitutes an MWE Given a parallel corpuswe propose to consider those tokens in a targetlanguage as MWEs which correspond to a singlelexical item in the source language The intuitionis that if a group of lexical items in one lan-guage can be realized as a single item in anotherlanguage it can be considered as some kind oflexically fixed entity By this means we willnot approach the MWE identification problemby asking for a given list of candidates whetherthese are MWEs or not Instead we will ask fora given list of lexical items in a source languagewhether there exists a one-to-many translation forthis item in a target language (and whether these
23
one-to-many translations correspond to MWEs)This strategy offers a straightforward solution tothe interpretation problem As the translation canbe related to the meaning of the source item andto its other translations in the target language theinterpretation is independent of the expressionrsquostransparency This solution has its limitationscompared to other approaches that need to auto-matically establish the degree of compositionalityof a given MWE candidate However for manyNLP applications coarse-grained knowledgeabout the semantic relation between a widerange of MWEs and their corresponding atomicrealization is already very useful
In this work we therefore focus on a generalmethod of MWE identification that captures thevarious patterns of translational correspondencesthat can be found in parallel corpora Our exper-iments described in section 3 show that one-to-many translations should be extracted from syn-tactic configurations rather than from unstructuredsets of aligned words This syntax-driven methodis less dependent on frequency distributions in agiven corpus but is based on the intuition thatmonolingual idiosyncracies like MWE realizationof an entity are not likely to be mirrored in anotherlanguage (see section 4 for discussion)
Our goal in this paper is twofold First we wantto investigate to what extent one-to-many transla-tional correspondences can serve as an empiricalbasis for MWE identification To this end Sec-tion 2 presents a corpus-based study of the rela-tion between one-to-many translations and MWEsthat we carried out on a translation gold standardSecond we investigate methods for the automaticdetection of complex lexical correspondences fora given parallel corpus Therefore Section 3 eval-uates automatic word alignments against our goldstandard and gives a method for high-precisionone-to-many translation detection that relies onsyntactic filters in addition to word-alignments
2 Multiword Translations as MWEs
The idea to exploit one-to-many translations forthe identification of MWE candidates has not re-ceived much attention in the literature Thus it isnot a priori clear what can be expected from trans-lational correspondences with respect to MWEidentification To corroborate the intuitions intro-duced in the last section we carried out a corpus-based study that aims to discover linguistic pat-
Verb 1-1 1-n n-1 n-n No
anheben (v1) 535 212 92 16 325
bezwecken (v2) 167 513 06 313 150
riskieren (v3) 467 357 05 17 182
verschlimmern (v4) 302 215 286 445 275
Table 1 Proportions of types of translational cor-respondences (token-level) in our gold standard
terns exhibited by one-to-many translationsWe constructed a gold standard coveringall En-
glish translations of four German verb lemmas ex-tracted from the Europarl Corpus These verbssubcategorize for a nominative subject and an ac-cusative object and are in the middle frequencylayer (around 200 occurrences) We extracted allsentences in Europarl with occurences of theselemmas and their automatic word alignments pro-duced by GIZA++ (Och and Ney 2003) Thesealignments were manually corrected on the basisof the crosslingual word alignment guidelines de-velopped by (Graca et al 2008)
For each of the German source lemmas ourgold standard records four translation categoriesone-to-one one-to-many many-to-one many-to-many translations Table 1 shows the distributionof these categories for each verb Strikingly thefour verbs show very different proportions con-cerning the types of their translational correspon-dences Thus while the German verbanheben(en increase) seems to have a frequent parallelrealization the verbsbezwecken(en intend to)or verschlimmern(en aggravate) tend to be real-ized by more complex phrasal translations In anycase the percentage of one-to-many translations isrelatively high which corroborates our hypothesisthat parallel corpora constitute a very interestingresource for data-driven MWE discovery
A closer look at the one-to-many translations re-veals that these cover a wide spectrum of MWEphenomena traditionally considered in the liter-ature as well as constructions that one wouldusually not regard as an MWE Below we willshortly illustrate the different classes of one-to-many translations we found in our gold standard
Morphological variations This type of one-to-many translations is mainly due to non-parallel re-alization of tense Itrsquos rather irrelevant from anMWE perspective but easy to discover and filterautomatically
24
(3) SieThey
verschlimmernaggravate
diethe
Ubelmisfortunes
(4) Their actionis aggravatingthe misfortunes
Verb particle combinations A typical MWEpattern treated for instance in (Baldwin andVillavicencio 2002) It further divides into trans-parent and non-transparent combinations the lat-ter is illustrated below
(5) DerThe
Ausschusscommitte
bezwecktintends
denthe
Institutioneninstitutions
eina
politischespolitical
Instrumentinstrument
anat
diethe
Handhand
zuto
gebengive
(6) The committeeset out to equip the institutions with apolitical instrument
Verb preposition combinations While thisclass isnrsquot discussed very often in the MWE lit-erature it can nevertheless be considered as an id-iosyncratic combination of lexical items Sag et al(2002) propose an analysis within an MWE frame-work
(7) SieThey
werdenwill
denthe
Treibhauseffektgreen house effect
verschlimmernaggravate
(8) They will add to the green house effect
Light verb constructions (LVCs) This is themost frequent pattern in our gold standard It ac-tually subsumes various subpatterns depending onwhether the light verbs complement is realized as anoun adjective or PP Generally LVCs are syntac-tically and semantically more flexible than otherMWE types such that our gold standard containsvariants of LVCs with similar potentially mod-ified adjectives or nouns as in the example be-low However it can be considered an idiosyn-cratic combination since the LVCs exhibit specificlexical restrictions (Sag et al 2002)
(9) IchIch
werdewill
diethe
Sachething
nuronly
nochjust
verschlimmernaggravate
(10) I am justmaking thingsmore difficult
Idioms This MWE type is probably the mostdiscussed in the literature due to its semantic andsyntactic idiosyncracy Itrsquos not very frequent inour gold standard which may be mainly due to itslimited size and the source items we chose
(11) SieThey
bezweckenintend
diethe
Umgestaltungconversion
ininto
einea
zivilecivil
Nationnation
(12) Theyhave in mind the conversion into a civil nation
v1 v2 v3 v4
Ntype 22 (26) 41 (47) 26 (35) 17 (24)
V Part 227 49 00 00
V Prep 364 415 39 59
LVC 182 293 885 882
Idiom 00 24 00 00
Para 364 243 115 235
Table 2 Proportions of MWE types per lemma
Paraphrases From an MWE perspective para-phrases are the most problematic and challengingtype of translational correspondence in our goldstandard While the MWE literature typically dis-cusses the distinction between collocations andMWEs the boarderline between paraphrases andMWEs is not really clear On the hand para-phrases as we classified them here are transparentcombinations of lexical items like in the exam-ple belowensure that something increases How-ever semantically these transparent combinationscan also be rendered by an atomic expressionin-crease A further problem raised by paraphrases isthat they often involve translational shifts (Cyrus2006) These shifts are hard to identify automat-ically and present a general challenge for seman-tic processing of parallel corpora An example isgiven below
(13) WirWe
brauchenneed
besserebetter
Zusammenarbeitcooperation
umto
diethe
RuckzahlungenrepaymentsOBJ
anzuhebenincrease
(14) We need greater cooperation in this respect toensurethat repaymentsincrease
Table 2 displays the proportions of the MWEcategories for the number of types of one-to-manycorrespondences in our gold standard We filteredthe types due to morphological variations only (theoverall number of types is indicated in brackets)Note that some types in our gold standard fall intoseveral categories eg they combine a verb prepo-sition with a verb particle construction For allof the verbs the number of types belonging tocore MWE categories largely outweighs the pro-portion of paraphrases As we already observedin our analysis of general translation categorieshere again the different verb lemmas show strik-ing differences with respect to their realization inEnglish translations For instanceanheben(enincrease) or bezwecken(en intend) are frequently
25
translated with verb particle or preposition combi-nations while the other verbs are much more of-ten translated by means of LVCs Also the morespecific LVC patterns differ largely among theverbs Whileverschlimmern(en aggravate) hasmany different adjectival LVC correspondencesthe translations ofriskieren(en risk) are predomi-nantly nominal LVCs The fact that we found veryfew idioms in our gold standard may be simplyrelated to our arbitrary choice of German sourceverbs that do not have an English idiom realiza-tion (see our experiment on a random set of verbsin Section 33)
In general one-to-many translational corre-spondences seem to provide a very fruitful groundfor the large-scale study of MWE phenomenaHowever their reliable detection in parallel cor-pora is far from trivial as we will show in thenext section Therefore we will not further in-vestigate the classification of MWE patterns inthe rest of the paper but concentrate on the high-precision detection of one-to-many translationsSuch a pattern-independent identification methodis crucial for the further data-driven study of one-to-many translations in parallel corpora
3 Multiword Translation Detection
This section is devoted to the problem of high-precision detection of one-to-many translationsSection 31 describes an evaluation of automaticword alignments against our gold standard Insection 32 we describe a method that extractsloosely aligned syntactic configurations whichyields much more promising results
31 One-to-many Alignments
To illustrate the problem of purely distributionalone-to-many alignment table 3 presents an eval-uation of the automatic one-to-many word align-ments produced by GIZA++ that uses the stan-dard heuristics for bidirectional word alignmentfrom phrase-based MT (Och and Ney 2003) Weevaluate the rate of translational correspondenceson the type-level that the system discovers againstthe one-to-many translations in our gold standardBy type we mean the set of lemmatized Englishtokens that makes up the translation of the Ger-man source lemma Generally automatic wordalignment yields a very high FPR if no frequencythreshold is used Increasing the threshold mayhelp in some cases however the frequency of the
verb n gt 0 n gt 1 n gt 3FPR FNR FPR FNR FPR FNR
v1 097 093 10 10 10 10
v2 093 09 05 096 00 098
v3 088 083 08 097 067 097
v4 098 092 08 092 034 092
Table 3 False positive rate and False negative rateof GIZA++ one-to-many alignments
translation types is so low that already at a thresh-old of 3 almost all types get filtered This does notmean that the automatic word alignment does notdiscover any correct correspondences at all but itmeans that the detection of the exact set of tokensthat correspond to the source token is rare
This low precision of one-to-many alignmentsisnrsquot very surprising Many types of MWEs con-sist of items that contribute most of the lexical se-mantic content while the other items belong to theclass of semantically almost ldquoemptyrdquo items (egparticles light verbs) These semantically ldquolightrdquoitems have a distribution that doesnrsquot necessarilycorrelate with the source item For instance inthe following sentence pair taken from EuroparlGIZA++ was not able to capture the correspon-dence between the German main verbbehindern(en impede) and the LVCconstitute an obstacleto but only finds an alignment link between theverb and the nounobstacle
(15) DieThe
Korruptioncorruption
behindertimpedes
diethe
Entwicklungdevelopment
(16) Corruptionconstitutes an obstacle todevelopment
Another limitation of the word-alignment mod-els is that are independent of whether the sen-tences are largely parallel or rather free transla-tions However parallel corpora like Europarl areknow to contain a very large number of free trans-lations In these cases direct lexical correspon-dences are much more unlikely to be found
32 Aligning Syntactic Configurations
High-precision extraction of one-to-many trans-lation detection thus involves two major prob-lems 1) How to identify sentences or configura-tions where reliable lexical correspondences canbe found 2) How to align target items that have alow occurrence correlation
We argue that both of these problems can beadressed by taking syntactic information into ac-
26
count As an example consider the pair of paral-lel configurations in Figure 1 for the sentence pairgiven in (15) and (16) Although there is no strictone-to-one alignment for the German verb the ba-sic predicate-argument structure is parallel Theverbs arguments directly correspond to each otherand are all dominated by a verbal root node
Based on these intuitions we propose agenerate-and-filter strategy for our one-to-manytranslation detection which extracts partial largelyparallel dependency configurations By admittingtarget dependency paths to be aligned to sourcesingle dependency relations we admit configura-tions where the source item is translated by morethan one word For instance given the configura-tion in Figure 1 we allow the German verb to bealigned to the path connectingconstituteand theargumentY2
Our one-to-many translation detection consistsof the following steps a) candidate generationof aligned syntactic configurations b) filtering theconfigurations c) alignment post-editing ie as-sembling the target tokens corresponding to thesource item The following paragraphs will brieflycaracterize these steps
behindert
X 1 Y 1
Y 2
an to
X 2 obstacle
create
Figure 1 Example of a typical syntactic MWEconfiguration
Data We word-aligned the German and Englishportion of the Europarl corpus by means of theGIZA++ tool Both portions where assigned flatsyntactic dependency analyses by means of theMaltParser (Nivre et al 2006) such that we ob-tain a parallel resource of word-aligned depen-dency parses Each sentence in our resource canbe represented by the triple(DG DE AGE) DG
is the set of dependency triples(s1 rel s2) such
thats2 is a dependent ofs1 of typerel ands1 s2
are words of the source languageDE is the setof dependency triples of the target sentenceAGE
corresponds to the set of pairs(s1 t1) such thats1 t1 are aligned
Candidate Generation This step generates alist of source configurations by searching for oc-curences of the source lexical verb where it islinked to some syntactic dependents (eg its argu-ments) An example input would be the configura-tion ( (verbSB) (verbOA)) forour German verbs
Filtering Given our source candidates a validparallel configuration(DG DE AGE) is then de-fined by the following conditions1 The source configurationDG is the set of tu-ples(s1 rel sn) wheres1 is our source item andsn some dependent2 For eachsn isin DG there is a tuple(sn tn) isinAGE ie every dependent has an alignment3 There is a target itemt1 isin DE such thatfor eachtn there is ap sub DE such thatp isa path(t1 rel tx) (tx rel ty)(tz rel tn) thatconnectst1 and tn Thus the target dependentshave a common root
To filter noise due to parsing or alignment er-rors we further introduce a filter on the length ofthe path that connects the target root and its de-pendents and w exclude paths cross contain sen-tence boundaries Moreover the above candi-date filtering doesnrsquot exclude configurations whichexhibit paraphrases involving head-switching orcomplex coordination Head-switching can be de-tected with the help of alignment information ifthere is a item in our target configuration that hasan reliable alignment with an item not contained inour source configuration our target configurationis likely to contain such a structural paraphrasesand is excluded from our candidate set Coordina-tion can be discarded by imposing the condition onthe configuration not to contain a coordination re-lation This Generate-and-Filter strategy now ex-tracts a set of sentences where we are likely to finda good one-to-one or one-to-many translation forthe source verb
Alignment Post-editing In the final alignmentstep one now needs to figure out which lexicalmaterial in the aligned syntactic configurations ac-tually corresponds to the translation of the sourceitem The intuition discussed in 32 was that all
27
the items lying on a path between the root itemand the terminals belong to the translation of thesource item However these items may have othersyntactic dependents that may also be part of theone-to-many translation As an example considerthe configuration in figure 1 where the articleanwhich is part of the LVCcreate an obstacle tohasto be aligned to the German source verb
Thus for a set of itemsti for which there is a de-pendency relation(tx rel ti) isin DE such thattx isan element of our target configuration we need todecide whether(s1 ti) isin AGE This translationproblem now largely parallels collocation trans-lation problems discussed in the literature as in(Smadja and McKeown 1994) But crucially oursyntactic filtering strategy has substantially nar-rowed down the number of items that are possi-ble parts of the one-to-many translation Thus astraightforward way to assemble the translationalcorrespondence is to compute the correlation orassociation of the possibly missing items with thegiven translation pair as proposed in (Smadja andMcKeown 1994) Therefore we propose the fol-lowing alignment post-editing algorithmGiven the source items1 and the set of target itemsT where eachti isin T is an element of our targetconfiguration
1 Compute corr(s1 T ) the correlation be-tweens1 andT
2 For each ti tx such that there isa (ti rel tx) isin DE computecorr(s1 T + tx)
3 if corr(s1 T + tx) ge corr(s1 T ) addtxto T
As the Dice coefficient is often to give the bestresults eg in (Smadja and McKeown 1994) wealso chose Dice as our correlation measure In fu-ture work we will experiment with other associa-tion measures Our correlation scores are thus de-fined by the formula
corr(s1 T ) =2(freq(s1 and T ))
freq(s1) + freq(T )
We definefreq(T ) as the number of sentencepairs whose target sentence contains occurrencesof all ti isin T andfreq(s1) accordingly The ob-servation frequencyfreq(s1andT ) is the number ofsentence pairs that wheres1 occurs in the sourcesentence andT in the target sentence
The output translation can then be rep-resented as a dependency configurationof the following kind ((ofPMOD)(riskNMODof)(riskNMODthe) (runOBJrisk)(runSBJ))which is the syntactic representationfor the English MWErun the risk of
33 Evaluation
Our translational approach to MWE extractionbears the advantage that evaluation is not exclu-sively bound to the manual judgement of candi-date lists Instead we can first evaluate the systemoutput against translation gold standards which areeasier to obtain The linguistic classification of thecandidates according to their compositionality canthen be treated as a separate problem
We present two experiments in this evaluationsection We will first evaluate the translation de-tection on our gold standard to assess the gen-eral quality of the extraction method Since thisgold standard is to small to draw conclusions aboutthe quality of MWE patterns that the system de-tects we further evaluate the translational corre-spondences for a larger set of verbs
Translation evaluation In the first experimentwe extracted all types of translational correspon-dences for the verbs we annotated in the gold stan-dard We converted the output dependency con-figurations to the lemmatized bag-of-word formwe already applied for the alignment evaluationand calculated the FPR and FNR of the trans-lation types The evaluation is displayed in ta-ble 4 Nearly all translation types that our sys-tem detected are correct This confirms our hy-pothesis that syntactic filtering yields more reli-able translations that just coocurrence-based align-ments However the false negative rate is alsovery high This low recall is due to the fact thatour syntactic filters are very restrictive such that amajor part of the occurrences of the source lemmadonrsquot figure in the prototypical syntactic configu-ration Column two and three of the evaluation ta-ble present the FPR and FNR for experiments witha relaxed syntactic filter that doesnrsquot constrain thesyntactic type of the parallel argument relationsWhile not decreasing the FNR the FPR decreasessignificantly This means that the syntactic filtersmainly fire on noisy configurations and donrsquot de-crease the recall A manual error analysis has alsoshown that the relatively flat annotation scheme ofour dependency parses significantly narrows down
28
the number of candidate configurations that our al-gorithm detects As the dependency parses donrsquotprovide deep analyses for tense or control phe-nomena very often a verbrsquos arguments donrsquot fig-ure as its syntactic dependents and no configura-tion is found Future work will explore the im-pact of deep syntactic analysis for the detection oftranslational correspondences
MWE evaluation In a second experiment weevaluated the patterns of correspondences foundby our extraction method for use in an MWE con-text Therefore we selected 50 random verbs oc-curring in the Europarl corpus and extracted theirrespective translational correspondences This setof 50 verbs yields a set of 1592 one-to-many typesof translational correspondences We filtered thetypes wich display only morphological variationsuch that the set of potential MWE types com-prises 1302 types Out of these we evaluated arandom sample of 300 types by labelling the typeswith the MWE categories we established for theanalysis of our gold standard During the clas-sification we encountered a further category ofoneto- many correspondence which cannot be con-sidered an MWE the category of alternation Forinstance we found a translational correspondencebetween the active realization of the German verbbegruszligen(enappreciate) and the English passivebe pleased by
The classification is displayed in table 5 Al-most 83 of the translational correspondencesthat our system extracted are perfect translationtypes Almost 60 of the extracted types can beconsidered MWEs that exhibit some kind of se-mantic idiosyncrasy The other translations couldbe classified as paraphrases or alternations In ourrandom sample the portions of idioms is signifi-cantly higher than in our gold standard which con-firms our intuition that the MWE pattern of theone-to-many translations for a given verb are re-lated to language-specific semantic properties ofthe verbs and the lexical concepts they realize
4 Related Work
The problem sketched in this paper has clear con-ncetions to statistical MT So-called phrase-basedtranslation models generally target whole sentencealignment and do not necessarily recur to linguis-tically motivated phrase correspondences (Koehnet al 2003) Syntax-based translation that speci-fies formal relations between bilingual parses was
Strict Filter Relaxed Filter
FPR FNR FPR FNR
v1 00 096 05 096
v2 025 088 047 079
v3 025 074 056 063
v4 00 0875 056 084
Table 4 False positive and false negative rate ofone-to-many translations
Trans type Proportion
MWE type Proportion
MWEs 575
V Part 82
V Prep 518
LVC 324
Idiom 106
Paraphrases 244
Alternations 10
Noise 171
Table 5 Classification of 300 types sampled fromthe set of one-to-many translations for 50 verbs
established by (Wu 1997) Our way to use syn-tactic configurations can be seen as a heuristic tocheck relaxed structural parallelism
Work on MWEs in a crosslingual context hasalmost exclusively focussed on MWE translation(Smadja and McKeown 1994 Anastasiou 2008)In (Villada Moiron and Tiedemann 2006) the au-thors make use of alignment information in a par-allel corpus to rank MWE candidates These ap-proaches donrsquot rely on the lexical semantic knowl-edge about MWEs in form of one-to-many trans-lations
By contrast previous approaches to paraphraseextraction made more explicit use of crosslingualsemantic information In (Bannard and Callison-Burch 2005) the authors use the target languageas a pivot providing contextual features for iden-tifying semantically similar expressions Para-phrasing is however only partially comparable tothe crosslingual MWE detection we propose inthis paper Recently the very pronounced contextdependence of monolingual pairs of semanticallysimilar expressions has been recognized as a ma-jor challenge in modelling word meaning (Erk andPado 2009)
The idea that parallel corpora can be used asa linguistic resource that provides empirical evi-dence for monolingual idiosyncrasies has already
29
been exploited in eg morphology projection(Yarowsky et al 2001) or word sense disambigua-tion (Dyvik 2004) While in a monolingual set-ting it is quite tricky to come up with theoreticalor empirical definitions of sense discriminationsthe crosslingual scenario offers a theory-neutraldata-driven solution Since ambiguity is an id-iosyncratic property of a lexical item in a givenlanguage it is not likely to be mirrored in a tar-get language Similarly our approach can also beseen as a projection idea we project the semanticinformation of simplex realization in a source lan-guage to an idiosyncratic multiword realization inthe target language
5 Conclusion
We have explored the phenomenon of one-to-many translations in parallel corpora from theperspective of MWE identification Our man-ual study on a translation gold standard as wellas our experiments in automatic translation ex-traction have shown that one-to-many correspon-dences provide a rich resource and fruitful basisof study for data-driven MWE identification Thecrosslingual perspective raises new research ques-tions about the identification and interpretation ofMWEs It challenges the distinction between para-phrases and MWEs a problem that does not ariseat all in the context of monolingual MWE ex-traction It also allows for the study of the rela-tion between the semantics of lexical concepts andtheir MWE realization Further research in this di-rection should investigate translational correspon-dences on a larger scale and further explore thesefor monolingual interpretation of MWEs
Our extraction method that is based on syn-tactic filters identifies MWE types with a muchhigher precision than purely cooccurence-basedword alignment and captures the various patternswe found in our gold standard Future work on theextraction method will have to focus on the gener-alization of these filters and the generalization toother items than verbs The experiments presentedin this paper also suggest that the MWE realiza-tion of certain lexical items in a target languageis subject to certain linguistic patterns Moreoverthe method we propose is completely languagein-dependent such that further research has to studythe impact of the relatedness of the consideredlanguages on the patterns of one-to-many transla-tional correspondences
ReferencesDimitra Anastasiou 2008 Identification of idioms by mtrsquos
hybrid research system vs three commercial system InProceedings of the EAMT pp 12ndash20
Timothy Baldwin and Aline Villavicencio 2002 Extract-ing the unextractable a case study on verb-particles InProceedings of the COLING-02 pp 1ndash7
Colin Bannard and Chris Callison-Burch 2005 Paraphras-ing with bilingual parallel corpora InProceedings of the43rd Annual Meeting of the ACL pp 597ndash604
Lea Cyrus 2006 Building a resource for studying transla-tion shifts InProceedings of the 5th LREC pp 1240ndash1245
Helge Dyvik 2004 Translations as semantic mirrors Fromparallel corpus to WordNetLanguage and Computers1311 ndash 326
Katrin Erk and Sebastian Pado 2009 Paraphrase assess-ment in structured vector space Exploring parameters anddatasets InProc of the EACL GEMS Workshop pp 57ndash65
Joao de Almeida Varelas Graca Joana Paulo Pardal LuısaCoheur and Diamantino Antonio Caseiro 2008 Multi-language word alignments annotation guidelines Techni-cal report Tech Rep 38 2008 INESC-ID Lisboa
Philipp Koehn Franz Josef Och and Daniel Marcu 2003Statistical phrase-based translation InProceedings of theNAACL rsquo03 pp 48ndash54
Philipp Koehn 2005 Europarl A parallel corpus for statis-tical machine translation InMT Summit 2005 pp 79ndash86
Joakim Nivre Johan Hall and Jens Nilsson 2006 Malt-parser A data driven parser-generator for dependencyparsing InProc of LREC-2006 pp 2216ndash2219
Franz Josef Och and Hermann Ney 2003 A systematic com-parison of various statistical alignment modelsComputa-tional Linguistics 29(1)19ndash51
Pavel Pecina 2008 A machine learning approach to multi-word expression extraction InProceedings of the LRECMWE 2008 Workshop pp 54ndash57
Ivan A Sag Timothy Baldwin Francis Bond and AnnCopestake 2002 Multiword expressions A pain in theneck for NLP InProc of the CICLing-2002 pp 1ndash15
Frank Smadja and Kathleen McKeown 1994 Translatingcollocations for use in bilingual lexicons InProceedingsof the HLT rsquo94 workshop pp 152ndash156
Begona Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automatic word-alignment InProc of the EACL MWE 2006 Workshoppp 33ndash40
Dekai Wu 1997 Stochastic inversion transduction gram-mars and bilingual parsing of parallel corporaComputLinguist 23(3)377ndash403
David Yarowsky Grace Ngai and Richard Wicentowski2001 Inducing multilingual text analysis tools via robustprojection across aligned corpora InProceedings of HLT2001 pp 1ndash8
30
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 31ndash39Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
A Re-examination of Lexical Association Measures
Abstract
We review lexical Association Measures (AMs) that have been employed by past work in extracting multiword expressions Our work contributes to the understanding of these AMs by categorizing them into two groups and suggesting the use of rank equivalence to group AMs with the same ranking performance We also examine how existing AMs can be adapted to better rank English verb particle constructions and light verb constructions Specifically we suggest normalizing (Pointwise) Mutual Information and using marginal frequencies to construct penalization terms We empirically validate the effectiveness of these modified AMs in detection tasks in English performed on the Penn Treebank which shows significant improvement over the original AMs
1 Introduction
Recently the NLP community has witnessed a renewed interest in the use of lexical association measures in extracting Multiword Expressions (MWEs) Lexical Association Measures (hereafter AMs) are mathematical formulas which can be used to capture the degree of connection or association between constituents of a given phrase Well-known AMs include Pointwise Mutual Information (PMI) Pearsonrsquos 2χ and the Odds Ratio These AMs have been applied in many different fields of study from information retrieval to hypothesis testing In the context of MWE extraction many published works have been devoted to comparing their effectiveness Krenn and Evert (2001) evaluate Mutual Information (MI) Dice Pearsonrsquos 2χ log-likelihood
ratio and the T score In Pearce (2002) AMs such as Z score Pointwise MI cost reduction left and right context entropy odds ratio are evaluated Evert (2004) discussed a wide range of AMs including exact hypothesis tests such as the binomial test and Fisherrsquos exact tests various coefficients such as Dice and Jaccard Later Ramisch et al (2008) evaluated MI Pearsonrsquos 2χ and Permutation Entropy Probably the most comprehensive evaluation of AMs was presented in Pecina and Schlesinger (2006) where 82 AMs were assembled and evaluated over Czech collocations These collocations contained a mix of idiomatic expressions technical terms light verb constructions and stock phrases In their work the best combination of AMs was selected using machine learning
While the previous works have evaluated AMs there have been few details on why the AMs perform as they do A detailed analysis of why these AMs perform as they do is needed in order to explain their identification performance and to help us recommend AMs for future tasks This weakness of previous works motivated us to address this issue In this work we contribute to further understanding of association measures using two different MWE extraction tasks to motivate and concretize our discussion Our goal is to be able to predict a priori what types of AMs are likely to perform well for a particular MWE class
We focus on the extraction of two common types of English MWEs that can be captured by bigram model Verb Particle Constructions (VPCs) and Light Verb Constructions (LVCs) VPCs consist of a verb and one or more particles which can be prepositions (eg put on bolster up) adjectives (cut short) or verbs (make do) For simplicity we focus only on bigram VPCs that take prepositional particles the most common class of VPCs A special characteristic of VPCs that affects their extraction is the
Hung Huu Hoang Dept of Computer Science
National University of Singapore
hoanghuucompnusedusg
Su Nam Kim Dept of Computer Science and Software Engineering University of Melbourne
snkimcsseunimelbeduau
Min-Yen Kan Dept of Computer Science
National University of Singapore
kanmycompnusedusg
31
mobility of noun phrase complements in transitive VPCs They can appear after the particle (Take off your hat) or between the verb and the particle (Take your hat off) However a pronominal complement can only appear in the latter configuration (Take it off)
In comparison LVCs comprise of a verb and a complement which is usually a noun phrase (make a presentation give a demonstration) Their meanings come mostly from their complements and as such verbs in LVCs are termed semantically light hence the name light verb This explains why modifiers of LVCs modify the complement instead of the verb (make a serious mistake vs make a mistake seriously) This phenomenon also shows that an LVCrsquos constituents may not occur contiguously
2 Classification of Association Measures
Although different AMs have different approaches to measuring association we observed that they can effectively be classified into two broad classes Class I AMs look at the degree of institutionalization ie the extent to which the phrase is a semantic unit rather than a free combination of words Some of the AMs in this class directly measure this association between constituents using various combinations of co-occurrence and marginal frequencies Examples include MI PMI and their variants as well as most of the association coefficients such as Jaccard Hamann Brawn-Blanquet and others Other Class I AMs estimate a phrasersquos MWE-hood by judging the significance of the difference between observed and expected frequencies These AMs include among others statistical hypothesis tests such as T score Z score and Pearsonrsquos 2χ test
Class II AMs feature the use of context to measure non-compositionality a peculiar characteristic of many types of MWEs including VPCs and idioms This is commonly done in one of the following two ways First non-compositionality can be modeled through the diversity of contexts measured using entropy The underlying assumption of this approach is that non-compositional phrases appear in a more restricted set of contexts than compositional ones Second non-compositionality can also be measured through context similarity between the phrase and its constituents The observation here is that non-compositional phrases have different semantics from those of their constituents It then
follows that contexts in which the phrase and its constituents appear would be different (Zhai 1997) Some VPC examples include carry out give up A close approximation stipulates that contexts of a non-compositional phrasersquos constituents are also different For instance phrases such as hot dog and Dutch courage are comprised of constituents that have unrelated meanings Metrics that are commonly used to compute context similarity include cosine and dice similarity distance metrics such as Euclidean and Manhattan norm and probability distribution measures such as Kullback-Leibler divergence and Jensen-Shannon divergence
Table 1 lists all AMs used in our discussion
The lower left legend defines the variables a b c and d with respect to the raw co-occurrence statistics observed in the corpus data When an AM is introduced it is prefixed with its index given in Table 1(eg [M2] Mutual Information) for the readerrsquos convenience
3 Evaluation
We will first present how VPC and LVC candidates are extracted and used to form our evaluation data set Second we will discuss how performances of AMs are measured in our experiments
31 Evaluation Data
In this study we employ the Wall Street Journal (WSJ) section of one million words in the Penn Tree Bank To create the evaluation data set we first extract the VPC and LVC candidates from our corpus as described below We note here that the mobility property of both VPC and LVC constituents have been used in the extraction process
For VPCs we first identify particles using a pre-compiled set of 38 particles based on Baldwin (2005) and Quirk et al (1985) (Appendix A) Here we do not use the WSJ particle tag to avoid possible inconsistencies pointed out in Baldwin (2005) Next we search to the left of the located particle for the nearest verb As verbs and particles in transitive VPCs may not occur contiguously we allow an intervening NP of up to 5 words similar to Baldwin and Villavicencio (2002) and Smadja (1993) since longer NPs tend to be located after particles
32
Extraction of LVCs is carried out in a similar fashion First occurrences of light verbs are located based on the following set of seven
frequently used English light verbs do get give have make put and take Next we search to the right of the light verbs for the nearest noun
AM Name Formula AM Name Formula M1 Joint Probability ( ) f xy N M2 Mutual Information
1log ˆ
ijij
i j ij
ff
N fsum
M3 Log likelihood ratio
2 log ˆ
ijij
i j ij
ff
fsum
M4 Pointwise MI (PMI) ( )log
( ) ( )
P xy
P x P ylowast lowast
M5 Local-PMI ( ) PMIf xy times M6 PMIk ( )log
( ) ( )
kNf xy
f x f ylowast lowast
M7 PMI2 2( )log
( ) ( )
Nf xy
f x f ylowast lowast
M8 Mutual Dependency 2( )log( ) ( )P xy
P x P y
M9 Driver-Kroeber
( )( )
a
a b a c+ +
M10 Normalized expectation
2
2
a
a b c+ +
M11 Jaccard a
a b c+ +
M12 First Kulczynski a
b c+
M13 Second Sokal-Sneath 2( )
a
a b c+ +
M14 Third Sokal-Sneath
a d
b c
+
+
M15 Sokal-Michiner a d
a b c d
+
+ + +
M16 Rogers-Tanimoto
2 2
a d
a b c d
+
+ + +
M17 Hamann ( ) ( )a d b c
a b c d
+ minus +
+ + +
M18 Odds ratio adbc
M19 Yulersquos ω ad bc
ad bc
minus
+
M20 Yulersquos Q ad bcad bc
minus+
M21 Brawn- Blanquet max( )
a
a b a c+ +
M22 Simpson
min( )
a
a b a c+ +
M23 S cost 12min( )
log(1 )1
b c
a
minus+
+
M24 Adjusted S Cost 12max( )
log(1 )1
b c
a
minus+
+
M25 Laplace 1
min( ) 2
a
a b c
+
+ +
M26 Adjusted Laplace 1
max( ) 2
a
a b c
+
+ +
M27 Fager [M9]
1max( )
2b cminus
M28 Adjusted Fager [M9]
1max( )b c
aNminus
M29 Normalized PMIs
PMI NF( )α PMI NFMax
M30 Simplified normalized PMI for VPCs
log( )
(1 )
ad
b cα αtimes + minus times
M31 Normalized MIs
MI NF( )α MI NFMax
NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast [0 1]α isin NFMax = max( ( ) ( ))P x P ylowast lowast
11 ( )a f f xy= = 12 ( )b f f xy= =
21 ( )c f f xy= = 22 ( )d f f xy= =
( )f xlowast ( )f xlowast
( )f ylowast ( )f ylowast N
Table 1 Association measures discussed in this paper Starred AMs () are developed in this work
Contingency table of a bigram (x y) recording co-occurrence and marginal frequencies w stands for all words except w stands for all words N is total number of bigrams The expected frequency under the independence assumption is ˆ ( ) ( ) ( ) f xy f x f y N= lowast lowast
33
permitting a maximum of 4 intervening words to allow for quantifiers (aan the many etc) adjectival and adverbial modifiers etc If this search fails to find a noun as when LVCs are used in the passive (eg the presentation was made) we search to the right of the light verb also allowing a maximum of 4 intervening words The above extraction process produced a total of 8652 VPC and 11465 LVC candidates when run on the corpus We then filter out candidates with observed frequencies less than 6 as suggested in Pecina and Schlesinger (2006) to obtain a set of 1629 VPCs and 1254 LVCs
Separately we use the following two available sources of annotations 3078 VPC candidates extracted and annotated in (Baldwin 2005) and 464 annotated LVC candidates used in (Tan et al 2006) Both sets of annotations give both positive and negative examples
Our final VPC and LVC evaluation datasets were then constructed by intersecting the gold-standard datasets with our corresponding sets of extracted candidates We also concatenated both sets of evaluation data for composite evaluation This set is referred to as ldquoMixedrdquo Statistics of our three evaluation datasets are summarized in Table 2 VPC data LVC data Mixed Total (freq ge 6)
413 100 513
Positive instances
117 (2833)
28 (28)
145 (2326)
Table 2 Evaluation data sizes (type count not token) While these datasets are small our primary
goal in this work is to establish initial comparable baselines and describe interesting phenomena that we plan to investigate over larger datasets in future work
32 Evaluation Metric
To evaluate the performance of AMs we can use the standard precision and recall measures as in much past work We note that the ranked list of candidates generated by an AM is often used as a classifier by setting a threshold However setting a threshold is problematic and optimal threshold values vary for different AMs Additionally using the list of ranked candidates directly as a classifier does not consider the confidence indicated by actual scores Another way to avoid setting threshold values is to measure precision and recall of only the n most likely candidates
(the n-best method) However as discussed in Evert and Krenn (2001) this method depends heavily on the choice of n In this paper we opt for average precision (AP) which is the average of precisions at all possible recall values This choice also makes our results comparable to those of Pecina and Schlesinger (2006)
33 Evaluation Results
Figure 1(a b) gives the two average precision profiles of the 82 AMs presented in Pecina and Schlesinger (2006) when we replicated their experiments over our English VPC and LVC datasets We observe that the average precision profile for VPCs is slightly concave while the one for LVCs is more convex This can be interpreted as VPCs being more sensitive to the choice of AM than LVCs Another point we observed is that a vast majority of Class I AMs including PMI its variants and association coefficients (excluding hypothesis tests) perform reasonably well in our application In contrast the performances of most of context-based and hypothesis test AMs are very modest Their mediocre performance indicates their inapplicability to our VPC and LVC tasks In particular the high frequencies of particles in VPCs and light verbs in LVCs both undermine their contextsrsquo discriminative power and skew the difference between observed and expected frequencies that are relied on in hypothesis tests
4 Rank Equivalence
We note that some AMs although not mathematically equivalent (ie assigning identical scores to input candidates) produce the same lists of ranked candidates on our datasets Hence they achieve the same average precision The ability to identify such groups of AMs is helpful in simplifying their formulas which in turn assisting in analyzing their meanings Definition Association measures M1 and M2 are
rank equivalent over a set C denoted by M1 r
Cequiv
M2 if and only if M1(cj) gt M1(ck) M2(cj) gt M2(ck) and M1(cj) = M1(ck) M2(cj) = M2(ck) for all cj ck belongs to C where Mk(ci) denotes the score assigned to ci by the measure Mk As a corollary the following also holds for rank equivalent AMs
34
Corollary If M1 r
Cequiv M2 then APC(M1) = APC(M2)
where APC(Mi) stands for the average precision of the AM Mi over the data set C Essentially M1 and M2 are rank equivalent over a set C if their ranked lists of all candidates taken from C are the same ignoring the actual calculated scores1 As an example the following 3 AMs Odds ratio Yulersquos ω and Yulersquos Q (Table 3 row 5) though not mathematically equivalent can be shown to be rank equivalent Five groups of rank equivalent AMs that we have found are listed in Table 3 This allows us to replace the below 15 AMs with their (most simple) representatives from each rank equivalent group
1 Two AMs may be rank equivalent with the exception of some candidates where one AM is undefined due to a zero in the denominator while the other AM is still well-defined We call these cases weakly rank equivalent With a reasonably large corpus such candidates are rare for our VPC and LVC types Hence we still consider such AM pairs to be rank equivalent
1) [M2] Mutual Information [M3] Log likelihood ratio 2) [M7] PMI2 [M8] Mutual Dependency [M9] Driver-Kroeber (aka Ochiai) 3) [M10] Normalized expectation [M11] Jaccard [M12] First Kulczynski
[M13]Second Sokal-Sneath (aka Anderberg)
4) [M14] Third Sokal-Sneath [M15] Sokal-Michiner [M16] Rogers-Tanimoto [M17] Hamann 5) [M18] Odds ratio [M19] Yulersquos ω [M20] Yulersquos Q
Table 3 Five groups of rank equivalent AMs
5 Examination of Association Measures
We highlight two important findings in our analysis of the AMs over our English datasets Section 51 focuses on MI and PMI and Section 52 discusses penalization terms
51 Mutual Information and Pointwise Mutual Information
In Figure 1 over 82 AMs PMI ranks 11th in identifying VPCs while MI ranks 35th in
01
02
03
04
05
06
AP
Figure 1a AP profile of AMs examined over our VPC data set
01
02
03
04
05
06
AP
Figure 1b AP profile of AMs examined over our LVC data set
Figure 1 Average precision (AP) performance of the 82 AMs from Pecina and Schlesinger (2006) on our English VPC and LVC datasets Bold points indicate AMs discussed in this paper
Hypothesis test AMs loz Class I AMs excluding hypothesis test AMs + Context-based AMs
35
identifying LVCs In this section we show how their performances can be improved significantly
Mutual Information (MI) measures the common information between two variables or the reduction in uncertainty of one variable given knowledge of the other
( )MI( ) ( )log ( ) ( )u v
p uvU V p uv p u p v= lowast lowastsum In the
context of bigrams the above formula can be
simplified to [M2] MI =
1log ˆN
ijij
i jij
ff
fsum While MI
holds between random variables [M4] Pointwise MI (PMI) holds between specific values PMI(x
y) =( )
log( ) ( )
P xy
P x P ylowast lowast
( )log
( ) ( )
Nf xy
f x f y=
lowast lowast It has long
been pointed out that PMI favors bigrams with low-frequency constituents as evidenced by the product of two marginal frequencies in its denominator To reduce this bias a common solution is to assign more weight to the co-occurrence frequency ( )f xy in the numerator by either raising it to some power k (Daille 1994) or multiplying PMI with ( )f xy Table 4 lists these adjusted versions of PMI and their performance over our datasets We can see from Table 4 that the best performance of PMIk
is obtained at k values less than one indicating that it is better to rely less on ( )f xy Similarly multiplying
( )f xy directly to PMI reduces the performance of PMI As such assigning more weight to ( )f xy does not improve the AP performance of PMI AM VPCs LVCs MixedBest [M6] PMIk 547
(k = 13) 573
(k = 85) 544
(k = 32)
[M4] PMI 510 566 515 [M5] Local-PMI 259 393 272 [M1] Joint Prob 170 28 175 Table 4 AP performance of PMI and its variants Best alpha settings shown in parentheses
Another shortcoming of (P)MI is that both grow not only with the degree of dependence but also with frequency (Manning and Schutzeampamp 1999 p 66) In particular we can show that MI(X Y) le min(H(X) H(Y)) where H() denotes entropy and PMI(xy) le min( log ( )P xminus lowast log ( )P yminus lowast )
These two inequalities suggest that the allowed score ranges of different candidates vary and consequently MI and PMI scores are not directly comparable Furthermore in the case of VPCs and LVCs the differences among score
ranges of different candidates are large due to high frequencies of particles and light verbs This has motivated us to normalize these scores before using them for comparison We suggest MI and PMI be divided by one of the following two normalization factors NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast with [0 1]α isin and NFmax = max( ( ) ( ))P x P ylowast lowast NF( )α being dependent on alpha can be optimized by setting an appropriate alpha value which is inevitably affected by the MWE type and the corpus statistics On the other hand NFmax is independent of alpha and is recommended when one needs to apply normalized (P)MI to a mixed set of different MWE types or when sufficient data for parameter tuning is unavailable As shown in Table 5 normalized MI and PMI show considerable improvements of up to 80 Also PMI and MI after being normalized with NFmax rank number one in VPC and LVC task respectively If one re-writes MI as = (1 N) ij ij
i jPMIf timessum it is easy to see the heavy
dependence of MI on direct frequencies compared with PMI and this explains why normalization is a pressing need for MI
AM VPCs LVCs MixedMI NF( )α 508
(α = 48) 583
(α = 47) 516
(α = 5)
MI NFmax 508 584 518 [M2] MI 273 435 289
PMI NF( )α 592 (α = 8)
554 (α = 48)
588 (α = 77)
PMI NFmax 565 517 556 [M4] PMI 510 566 515
Table 5 AP performance of normalized (P)MI versus standard (P)MI Best alpha settings shown in parentheses
52 Penalization Terms
It can be seen that given equal co-occurrence frequencies higher marginal frequencies reduce the likelihood of being MWEs This motivates us to use marginal frequencies to synthesize penalization terms which are formulae whose values are inversely proportional to the likelihood of being MWEs We hypothesize that incorporating such penalization terms can improve the respective AMs detection AP
Take as an example the AMs [M21] Brawn-Blanquet (aka Minimum Sensitivity) and [M22] Simpson These two AMs are identical except
36
for one difference in the denominator Brawn-Blanquet uses max(b c) Simpson uses min(b c) It is intuitive and confirmed by our experiments that penalizing against the more frequent constituent by choosing max(b c) is more effective This is further attested in AMs [M23] S Cost and [M25] Laplace where we tried to replace the min(b c) term with max(b c) Table 6 shows the average precision on our datasets for all these AMs
AM VPCs LVCs Mixed[M21]Brawn- Blanquet
478 578 486
[M22] Simpson 249 382 260
[M24] Adjusted S Cost
485 577 492
[M23] S cost 249 388 260
[M26] Adjusted Laplace
486 577 493
[M25] Laplace 241 388 254 Table 6 Replacing min() with max() in selected AMs
In the [M27] Fager AM the penalization term max(b c) is subtracted from the first term which is no stranger but rank equivalent to [M7] PMI2 In our application this AM is not good since the second term is far larger than the first term which is less than 1 As such Fager is largely equivalent to just ndashfrac12 max(b c) In order to make use of the first term we need to replace the constant frac12 by a scaled down version of max(b c) We have approximately derived 1 aN as a lower bound estimate of max(b c) using the independence assumption producing [M28] Adjusted Fager We can see from Table 7 that this adjustment improves Fager on both datasets AM VPCs LVCs Mixed[M28] Adjusted Fager
564 543 554
[M27] Fager 552 439 525 Table 7 Performance of Fager and its adjusted version
The next experiment involves [M14] Third Sokal Sneath which can be shown to be rank equivalent to ndashb ndashc We further notice that frequencies c of particles are normally much larger than frequencies b of verbs Thus this AM runs the risk of ranking VPC candidates based on only frequencies of particles So it is necessary
that we scale b and c properly as in [M14] bαminus times ndash (1 ) cαminus times Having scaled the constituents properly we still see that [M14] by itself is not a good measure as it uses only constituent frequencies and does not take into consideration the co-occurrence frequency of the two constituents This has led us to experiment
with [MR14]PMI
(1 )b cα αtimes + minus times The
denominator of [MR14] is obtained by removing the minus sign from [MR14] so that it can be used as a penalization term The choice of PMI in the numerator is due to the fact that the denominator of [MR14] is in essence similar to NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast which has been successfully used to divide PMI in the normalized PMI experiment We heuristically tried to simplify [MR14] to the following AM
[M30]log( )
(1 )
ad
b cα αtimes + minus times The setting of alpha in
Table 8 below is taken from the best alpha setting obtained the experiment on the normalized PMI (Table 5) It can be observed from Table 8 that [MR14] being computationally simpler than normalized PMI performs as well as normalized PMI and better than Third Sokal-Sneath over the VPC data set
AM VPCs LVCs MixedPMI NF( )α 592
(α =8) 554
(α =48) 588
(α =77)
[M30]log( )
(1 )
ad
b cα αtimes + minus times
600 (α =8)
484 (α =48)
588 (α =77)
[M14] Third Sokal Sneath
565 453 546
Table 8 AP performance of suggested VPCsrsquo penalization terms and AMs
With the same intention and method we have found that while addition of marginal frequencies is a good penalization term for VPCs the product of marginal frequencies is more suitable for LVCs (rows 1 and 2 Table 9) As with the linear combination the product bc should also be weighted accordingly as (1 )b cα αminus The best alpha value is also taken from the normalized PMI experiments (Table 5) which is nearly 5 Under this setting this penalization term is exactly the denominator of the [M18] Odds Ratio Table 9 below show our experiment results in deriving the penalization term for LVCs
37
AM VPCs LVCs Mixedndashb ndashc 565 453 546 1bc 502 532 502 [M18] Odds ratio 443 567 456
Table 9 AP performance of suggested LVCsrsquo penalization terms and AMs
6 Conclusions
We have conducted an analysis of the 82 AMs assembled in Pecina and Schlesinger (2006) for the tasks of English VPC and LVC extraction over the Wall Street Journal Penn Treebank data In our work we have observed that AMs can be divided into two classes ones that do not use context (Class I) and ones that do (Class II) and find that the latter is not suitable for our VPC and LVC detection tasks as the size of our corpus is too small to rely on the frequency of candidatesrsquo contexts This phenomenon also revealed the inappropriateness of hypothesis tests for our detection task We have also introduced the novel notion of rank equivalence to MWE detection in which we show that complex AMs may be replaced by simpler AMs that yield the same average precision performance
We further observed that certain modifications to some AMs are necessary First in the context of ranking we have proposed normalizing scores produced by MI and PMI in cases where the distributions of the two events are markedly different as is the case for light verbs and particles While our claims are limited to the datasets analyzed they show clear improvements normalized PMI produces better performance over our mixed MWE dataset yielding an average precision of 588 compared to 515 when using standard PMI a significant improvement as judged by paired T test Normalized MI also yields the best performance over our LVC dataset with a significantly improved AP of 583
We also show that marginal frequencies can be used to form effective penalization terms In particular we find that (1 )b cα αtimes + minus times is a good penalization term for VPCs while (1 )b cα αminus is suitable for LVCs Our introduced alpha tuning parameter should be set to properly scale the values b and c and should be optimized per MWE type In cases where a common factor is applied to different MWE types max(b c) is a better choice than min(b c) In future work we plan to expand our investigations over larger
web-based datasets of English to verify the performance gains of our modified AMs
Acknowledgement This work was partially supported by a National Research Foundation grant ldquoInteractive Media Searchrdquo (grant R 252 000 325 279)
References Baldwin Timothy (2005) The deep lexical
acquisition of English verb-particle constructions Computer Speech and Language Special Issue on Multiword Expressions 19(4)398ndash414
Baldwin Timothy and Villavicencio Aline (2002) Extracting the unextractable A case study on verb-particles In Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002) pages 98ndash104 Taipei Taiwan
Daille Beacuteatrice (1994) Approche mixte pour lextraction automatique de terminologie statistiques lexicales et filtres linguistiques PhD thesis Universiteacute Paris 7
Evert Stefan (2004) Online repository of association measures httpwwwcollocationsde a companion to The Statistics of Word Cooccurrences Word Pairs and Collocations PhD dissertation University of Stuttgart
Evert Stefan and Krenn Brigitte (2001) Methods for qualitative evaluation of lexical association measures In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics pages 188-915 Toulouse France
Katz Graham and Giesbrecht Eugenie (2006) Automatic identification of non-compositional multi-word expressions using latent semantic analysis In Proceedings of the ACL Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties pages 12-19 Sydney Australia
Krenn Brigitte and Evert Stefan (2001) Can we do better than frequency A case study on extracting PP-verb collocations In Proceedings of the ACLEACL 2001 Workshop on the Computational Extraction Analysis and Exploitation of Collocations pages 39ndash46 Toulouse France
Manning D Christopher and Schutze ampamp Hinrich (1999) Foundations of Statistical Natural Language Processing The MIT Press Cambridge Massachusetts
Pearce Darren (2002) A comparative evaluation of collocation extraction techniques In Proc of the
38
3rd International Conference on Language Resources and Evaluation (LREC 2002) Las Palmas pages 1530-1536 Canary Islands
Pecina Pavel and Schlesinger Pavel (2006) Combining association measures for collocation extraction In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLINGACL 2006) pages 651-658 Sydney Australia
Quirk Randolph Greenbaum Sidney Leech Geoffrey and Svartvik Jan (1985) A Comprehensive Grammar of the English Language Longman London UK
Ramisch Carlos Schreiner Paulo Idiart Marco and Villavicencio Aline (2008) An Evaluation of Methods for the extraction of Multiword Expressions In Proceedings of the LREC-2008 Workshop on Multiword Expressions Towards a Shared Task for Multiword Expressions pages 50-53 Marrakech Morocco
Smadja Frank (1993) Retrieving collocations from text Xtract Computational Linguistics 19(1) 143ndash77
Tan Y Fan Kan M Yen and Cui Hang (2006) Extending corpus-based identification of light verb constructions using a supervised learning framework In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context pages 49ndash56 Trento Italy
Zhai Chengxiang (1997) Exploiting context to identify lexical atoms ndash A statistical view of linguistic context In International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97) pages 119-129 Rio de Janeiro Brazil
Appendix A List of particles used in identifying verb particle constructions about aback aboard above abroad across adrift ahead along apart around aside astray away back backward backwards behind by down forth forward forwards in into off on out over past round through to together under up upon without
39
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 40ndash46Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus
R Mahesh K Sinha
Department of Computer Science amp Engineering Indian Institute of Technology Kanpur
Kanpur 208016 India rmkiitkacin
Abstract
Complex predicate is a noun a verb an ad-jective or an adverb followed by a light verb that behaves as a single unit of verb Com-plex predicates (CPs) are abundantly used in Hindi and other languages of Indo Aryan family Detecting and interpreting CPs con-stitute an important and somewhat a diffi-cult task The linguistic and statistical methods have yielded limited success in mining this data In this paper we present a simple method for detecting CPs of all kinds using a Hindi-English parallel corpus A CP is hypothesized by detecting absence of the conventional meaning of the light verb in the aligned English sentence This simple strategy exploits the fact that CP is a multi-word expression with a meaning that is dis-tinct from the meaning of the light verb Al-though there are several shortcomings in the methodology this empirical method surpri-singly yields mining of CPs with an average precision of 89 and a recall of 90
1 Introduction
Complex predicates (CPs) are abundantly used in Hindi and other languages of Indo-Aryan family and have been widely studied (Hook 1974 Ab-bi 1992 Verma 1993 Mohanan 1994 Singh 1994 Butt 1995 Butt and Geuder 2001 Butt and Ramchand 2001 Butt et al 2003) A com-plex predicate is a multi-word expression (MWE) where a noun a verb or an adjective is followed by a light verb (LV) and the MWE be-haves as a single unit of verb The general theory of complex predicate is discussed in Alsina (1996) These studies attempt to model the lin-guistic facts of complex predicate formation and the associated semantic roles
CPs empower the language in its expressive-ness but are hard to detect Detection and inter-
pretation of CPs are important for several tasks of natural language processing tasks such as ma-chine translation information retrieval summa-rization etc A mere listing of the CPs constitutes a valuable linguistic resource for lexicographers wordnet designers (Chakrabarti et al 2007) and other NLP system designers Computational me-thod using Hindi corpus has been used to mine CPs and categorize them based on statistical analysis (Sriram and Joshi 2005) with limited success Chakrabarti et al (2008) present a me-thod for automatic extraction of V+V CPs only from a corpus based on linguistic features They report an accuracy of about 98 An attempt has also been made to use a parallel corpus for de-tecting CPs using projection POS tags from Eng-lish to Hindi (Soni Mukerjee and Raina 2006) It uses Giza++ word alignment tool to align the projected POS information A success of 83 precision and 46 recall has been reported
In this paper we present a simple strategy for mining of CPs in Hindi using projection of meaning of light verb in a parallel corpus In the following section the nature of CP in Hindi is outlined and this is followed by system design experimentation and results
2 Complex Predicates in Hindi
A CP in Hindi is a syntactic construction consist-ing of either a verb a noun an adjective or an adverb as main predicator followed by a light verb (LV) Thus a CP can be a noun+LV an adjective+LV a verb+LV or an adverb+LV Fur-ther it is also possible that a CP is followed by a LV (CP+LV) The light verb carries the tense and agreement morphology In V+V CPs the contribution of the light verb denotes aspectual terms such as continuity perfectivity inception completion or denotes an expression of forceful-ness suddenness etc (Singh 1994 Butt 1995) The CP in a sentence syntactically acts as a sin-gle lexical unit of verb that has a meaning dis-
40
tinct from that of the LV CPs are also referred as the complex or compound verbs
Given below are some examples (1) CP=noun+LV noun = ashirwad blessings LV = denaa to give usane mujhe ashirwad diyaa उसन मझ आश वाद दया he me blessings gave he blessed me (2) No CP usane mujhe ek pustak dii उसन मझ एक प तक द he me one book gave he gave me a book In (1) the light verb diyaa (gave) in its past tense form with the noun ashirwad (blessings) makes a complex predicate verb form ashirwad diyaa (blessed) in the past tense form The CP here is ashirwad denaa and its corresponding English translation is lsquoto blessrsquo On the other hand in ex-ample (2) the verb dii (gave) is a simple verb in past tense form and is not a light verb Although same Hindi verb denaa (to give) is used in both the examples it is a light verb in (1) and a main verb in (2) Whether it acts as a light verb or not depends upon the semantics of the preceding noun However it is observed that the English meaning in case of the complex predicate is not derived from the individual meanings of the con-stituent words It is this observation that forms basis of our approach for mining of CPs (3) CP=adjective+LV adjective=khush happy LV=karanaa to do usane mujhe khush kiyaa उसन मझ खश कया he me happy did he pleased me Here the Hindi verb kiyaa (did) is the past tense form of a light verb karanaa (to do) and the pre-ceding word khush (happy) is an adjective The CP here is khush karanaa (to please) (4) CP=verb+LV verb = paRhnaa to read LV = lenaa to take usane pustak paRh liyaa
उसन प तक पढ़ लया he book read took he has read the book Here the Hindi verb liyaa (took) is the past tense form of the light verb lenaa (to take) and the pre-ceding word paRh (read) is the verb paRhnaa (to read) in its stem form The CP is paRh lenaa (to finish reading) In such cases the light verb acts as an aspectual modal or as an intensifier (5) CP=verb+LV verb = phaadanaa to tear LV = denaa to give usane pustak phaad diyaa उसन प तक फाड़ दया he book tear gave he has torn the book Here the Hindi verb diyaa (gave) is the past tense form of the light verb denaa (to give) and the preceding word phaad (tear) is the stem form of the verb phaadanaa (to tear) The CP is phaad denaa (to cause and complete act of tearing) (6) CP=verb+LV verb = denaa to give LV = maaranaato hit to kill usane pustak de maaraa उसन प तक द मारा he book give hit he threw the book Here the Hindi verb maaraa (hitkilled) is the past tense form of the light verb maaranaa (to hit to kill) and the preceding word de (give) is a verb denaa (to give) in its stem form The CP is de maranaa (to throw) The verb combination yields a new meaning This may also be consi-dered as a semi-idiomatic construct by some people (7) CP=adverb+LV1+LV2 adverb = vaapasback LV1 = karanaato do LV2 = denaato give or CP = CP+LV CP = vaapas karanaato return LV = denaato give usane pustak vaapas kar diyaa उसन प तक वापस कर दया he book back do gave
41
he returned the book Here there are two Hindi light verbs used The verb kar (do) is the stem form of the light verb karanaa (to do) and the verb diyaa (gave) is the past tense form of the light verb denaa (to give) The preceding word vaapas (back) is an adverb One way of interpretation is that the CP (a con-junct verb) vaapas karanaa (to return) is fol-lowed by another LV denaa (to give) signifying completion of the task Another way of looking at it is to consider these as a combination of two CPs vaapas karanaa (to return) and kar denaa (to complete the act) The semantic interpreta-tions in the two cases remain the same It may be noted that the word vaapas (return) is also a noun and in such a case the CP is a noun+LV
From all the above examples the complexity of the task of mining the CPs is evident Howev-er it is also observed that in the translated text the meaning of the light verb does not appear in case of CPs Our methodology for mining CPs is based on this observation and is outlined in the following section
3 System Design
As outlined earlier our method for detecting a CP is based on detecting a mismatch of the Hindi light verb meaning in the aligned English sen-tence The steps involved are as follows 1) Align the sentences of Hindi-English corpus 2) Create a list of Hindi light verbs and their
common English meanings as a simple verb (Table 1)
3) For each Hindi light verb generate all the morphological forms (Figure 1)
4) For each English meaning of the light verb as given in table 1 generate all the morphologi-cal forms (Figure 2)
5) For each Hindi-English aligned sentence execute the following steps a) For each light verb of Hindi (table 1)
execute the following steps i) Search for a Hindi light verb (LV)
and its morphological derivatives (figure 1) in the Hindi sentence and mark its position in the sentence (K)
ii) If the LV or its morphological deriv-ative is found then search for the equivalent English meanings for any of the morphological forms (figure 2) in the corresponding aligned Eng-lish sentence
iii) If no match is found then scan the words in the Hindi sentence to the left of the Kth position (as identified in step (i)) else if a match is found then exit ie go to step (a)
iv) If the scanned word is a lsquostop wordrsquo (figure 3) then ignore it and contin-ue scanning
v) Stop the scan when it is not a lsquostop wordrsquo and collect the Hindi word (W)
vi) If W is an lsquoexit wordrsquo then exit ie go to step (a) else the identified CP is W+LV
Hindi has a large number of light verbs A list of some of the commonly used light verbs along with their common English meanings as a simple verb is given in table 1 The light verb kar (do) is the most frequently used light verb Using its literal meaning as lsquodorsquo as a criterion for testing CP is quite misleading since lsquodorsquo in English is used in several other contexts Such meanings have been shown within parentheses and are not used for matching
light verb base form root verb meaning baithanaa बठना sit
bananaa बनना makebecomebuildconstruct manufactureprepare
banaanaa बनाना makebuildconstructmanufact-ure prepare
denaa दना give
lenaa लना take
paanaa पाना obtainget
uthanaa 3ठना rise arise get-up
uthaanaa 3ठाना raiselift wake-up
laganaa लगना feelappear look seem
lagaanaa लगाना fixinstall apply
cukanaa चकना (finish)
cukaanaa चकाना pay
karanaa करना (do)
honaa होना happenbecome be
aanaa आना come
jaanaa जाना go
khaanaa खाना eat
rakhanaa रखना keep put
maaranaa मारना killbeathit
daalanaa डालना put
haankanaa हाकना drive
Table 1 Some of the common light verbs in Hindi
42
For each of the Hindi light verb all morpho-logical forms are generated A few illustrations are given in figures 1(a) and 1(b) Similarly for each of the English meaning of the light verb all of its morphological derivatives are generated Figure 2 shows a few illustrations of the same
There are a number of words that can appear in between the nominal and the light verb in a CP These words are ignored in search for a CP and are treated as stop words These are words that denote negation or are emphasizers inten-sifiers interrogative pronoun or a particle A list of stop words used in the experimentation is giv-en in figure 3
Figure 1(a) Morphological derivatives of sample Hindi light verb rsquojaanaarsquo जाना to go
Figure 1(b) Morphological derivatives of sample Hindi light verb rsquolenaarsquo लना to take
Figure 2 Morphological derivatives of sample English meanings
We use a list of words of words that we have
named as lsquoexit wordsrsquo which cannot form part of a CP in Hindi We have used Hindi case (vibhak-ti) markers (also called parsarg) conjunctions and pronouns as the lsquoexit wordsrsquo in our imple-mentation Figure 4 shows a partial list used However this list can be augmented based on analysis of errors in LV identification It should be noted that we do not perform parts of speech (POS) tagging and so the nature of the word pre-ceding the LV is unknown to the system
English word sit Morphological derivations
sit sits sat sitting English word give Morphological derivations
give gives gave given giving degdegdegdegdeg
LV jaanaa जाना to go Morphological derivatives jaa jaae jaao jaaeM jaauuM jaane jaanaa jaanii jaataa jaatii jaate jaaniiM jaatiiM jaaoge jaaogii gaii jaauuMgA jaayegaa jaauuMgii jaayegii gaye gaiiM gayaa gayii jaayeMge jaayeMgI jaakara जा (go stem) जाए (go imperative)
जाओ (go imperative) जाए (go imperative)
जाऊ (go first-person) जान (go infinitive oblique)
जाना (go infinitive masculine singular)
जानी (go infinitive feminine singular)
जाता (go indefinite masculine singular)
जाती (go indefinite feminine singular)
जात (go indefinite masculine pluraloblique)
जानी (go infinitive feminine plural)
जाती (go indefinite feminine plural)
जाओग (go future masculine singular)
जाओगी (go future feminine singular)
गई (go past feminine singular)
जाऊगा (go future masculine first-person singular)
जायगा (go future masculine third-person singular)
जाऊगी (go future feminine first-person singular)
जायगी (go future feminine third-person singular)
गय (go past masculine pluraloblique)
गई (go past feminine plural)
गया (go past masculine singular)
गयी (go past feminine singular)
जायग (go future masculine plural)
जायगी (go future feminine plural)
जाकर (go completion) degdegdegdegdeg
LV lenaa लना to take Morphological derivatives le lii leM lo letaa letii lete liiM luuM legaa legii lene lenaa lenii liyaa leMge loge letiiM luuMgaa luuMgii lekara ल (take stem) ल (take past)
ल (take imperative) लो (take imperative)
लता (take indefinite masculine singular)
लती (take indefinite feminine singular)
लत (take indefinite masculine pluraloblique)
ल (takepastfeminineplural) ल(take first-person)
लगा(take future masculine third-personsingular)
लगी(take future feminine third-person singular)
लन (take infinitive oblique)
लना (take infinitive masculine singular)
लनी (take infinitive feminine singular)
लया (take past masculine singular)
लग (take future masculine plural)
लोग (take future masculine singular)
लती (take indefinite feminine plural)
लगा (take future masculinefirst-personsingular)
लगी (take future feminine first-person singular)
लकर (take completion) degdegdegdegdeg
43
Figure 3 Stop words in Hindi used by the system
Figure 4 A few exit words in Hindi used by the system
The inner loop of the procedure identifies mul-tiple CPs that may be present in a sentence The outer loop is for mining the CPs in the entire corpus The experimentation and results are dis-cussed in the following section
4 Experimentation and Results
The CP mining methodology outlined earlier has been implemented and tested over multiple files of EMILLE (McEnery Baker Gaizauskas and Cunningham 2000) English-Hindi parallel cor-pus A summary of the results obtained are given in table 2 As can be seen from this table the precision obtained is 80 to 92 and the recall is between 89 to 100 The F-measure is 88 to 97 This is a remarkable and somewhat sur-prising result from the simple methodology without much of linguistic or statistical analysis This is much higher than what has been reported on the same corpus by Mukerjee et al 2006 (83 precision and 46 recall) who use projec-tion of POS and word alignment for CP identifi-cation This is the only other work that uses a parallel corpus and covers all kinds of CPs The results as reported by Chakrabarti et al (2008) are only for V-V CPs Moreover they do not re-port the recall value File
1 File 2
File 3
File 4
File 5
File 6
No of Sentences
112 193 102 43 133 107
Total no of CP(N)
200 298 150 46 188 151
Correctly identified CP (TP)
195 296 149 46 175 135
V-V CP 56 63 9 6 15 20
Incorrectly identified CP (FP)
17 44 7 11 16 20
Unidentified CP(FN)
5 2 1 0 13 16
Accuracy
9750 9933 9933 1000 9308 8940
Precision (TP (TP+FP))
9198 8705 9551 8070 9162 8709
Recall ( TP (TP+FN))
9750 9833 9933 1000 9308 8940
F-measure ( 2PR ( P+R))
946 923 974 893 923 882
Table 2 Results of the experimentation
न (ergative case marker) को (accusative case marker) का (possessive case marker) क (possessive case marker) की (possessive case marker) स (frombywith) म (ininto) पर (onbut) और (and Hindi particle) तथा (and) या (or) ल कन (but) पर त (but) क (that Hindi particle) म (I) तम (you) आप (you) वह (heshe) मरा (my) मर (my) मर (my) त हारा (your) त हार (your) त हार (your) उसका (his) उसकी (her) उसक (hisher) अपना (own) अपनी (own) अपन (own) उनक (their) मन (I ergative) त ह (to you) आपको (to you) उसको (to himher) उनको (to them) उ ह (to them) मझको (to me) मझ (to me) िजसका (whose) िजसकी (whose) िजसक (whose) िजनको (to whom) िजनक (to whom)
नह (nonot) न (nonot Hindi particle) भी (also Hindi particle) ह (only Hindi particle) तो (then Hindi particle) य (why) या (what Hindi particle)
कहा (where Hindi particle) कब (when) यहा (here) वहा (there) जहा (where) पहल (before) बाद म (after) श म (beginning) आर भ म (beginning) अत म (in the end) आ खर म (in the end)
44
Given below are some sample outputs (1) English sentence I also enjoy working with the childrens parents who often come to me for advice - its good to know you can help Aligned Hindi sentence मझ ब च क माता - पताओ क साथ काम करना भी अ छा लगता ह जो क अ सर सलाह लन आत ह - यह जानकार खशी होती ह क आप कसी की मदद कर सकत ह | The CPs identified in the sentence i काम करना (to work) ii अ छा लगना (to feel good enjoy) iii सलाह लना (to seek advice) iv खशी होना (to feel happy good) v मदद करना (to help)
Here the system identified 5 different CPs all of which are correct and no CP in the sentence has gone undetected The POS projection and word alignment method (Mukerjee et al 2006) would fail to identify CPs सलाह लना (to seek advice) and खशी होना (to feel happy) (2) English sentence Thousands of children are already benefiting from the input of people like you - people who care about children and their future who have the commitment energy and enthusiasm to be positive role models and who value the opportu-nity for a worthwhile career Aligned Hindi sentence आप जस लोग जो क ब च और उनक भ व य क बार म सोचत ह - इस समय भी हज़ार ब च को लाभ पहचा रह ह | अ छ आदश बनन क लए ऐस लोग म तब ता उ सराह और लगन ह और व एक समथन - यो य यवसाय की क करत ह |
The CPs identified in the sentence i आदश बनना (to be role model) ii क करना (to respect)
Here also the two CPs identified are correct It is obvious that this empirical method of
mining CPs will fail whenever the Hindi light verb maps on to its core meaning in English It
may also produce garbage as POS of the preced-ing word is not being checked However the mining success rate obtained speaks of these be-ing in small numbers in practice Use of the lsquostop wordsrsquo in allowing the intervening words within the CPs helps a lot in improving the per-formance Similarly use of the lsquoexit wordsrsquo avoid a lot of incorrect identification
5 Conclusions
The simple empirical method for mining CPs outlined in this work yields an average 89 of precision and 90 recall which is better than the results reported so far in the literature The major drawback is that we have to generate a list of all possible light verbs This list appears to be very large for Hindi Since no POS tagging or statis-tical analysis is performed the identified CPs are merely a list of mined CPs in Hindi with no lin-guistic categorization or analysis However this list of mined CPs is valuable to the lexicograph-ers and other language technology developers This list can also be used for word alignment tools where the identified components of CPs are grouped together before the word alignment process This will increase both the alignment accuracy and the speed
The methodology presented in this work is equally applicable to all other languages within the Indo-Aryan family
References Anthony McEnery Paul Baker Rob Gaizauskas Ha-
mish Cunningham 2000 EMILLE Building a Corpus of South Asian Languages Vivek A Quar-terly in Artiificial Intelligence 13(3)23ndash32
Amitabh Mukerjee Ankit Soni and Achala M Raina 2006 Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora Proceed-ings of the Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties Sydney 11ndash18
Alex Alsina 1996 Complex PredicatesStructure and Theory CSLI PublicationsStanford CA
Anvita Abbi 1992 The explicator compound verbsome definitional issues and criteria for iden-tification Indian Linguistics 53 27-46
Debasri Chakrabarti Vaijayanthi Sarma and Pushpak Bhattacharyya 2007 Complex Predicates in In-dian Language Wordnets Lexical Resources and Evaluation Journal 40 (3-4)
Debasri Chakrabarti Hemang Mandalia Ritwik Priya Vaijayanthi Sarma and Pushpak Bhattacharyya 2008 Hindi Compound Verbs and their Automatic Extraction Computational Linguistics (COLING08) Manchester UK
45
Manindra K Verma (ed) 1993 Complex Predicates in South Asian Languages Manohar Publishers and Distributors New Delhi
Miriam Butt 1995 The Structure of Complex Predi-cates in Urdu CSLI Publications
Mirium Butt and Gillian Ramchand 2001 Complex Aspectual Structure in HindiUrdu In Maria Lia-kata Britta Jensen and Didier Maillat (Editors) Oxford University Working Papers in Linguistics Philology amp Phonetics Vol 6
Miriam Butt Tracy Holloway King and John T Maxwell III 2003 Complex Predicates via Re-striction Proceedings of the LFG03 Conference
Miriam Butt and Wilhelm Geuder 2001 On the (semi)lexical status of light verbs In Norbert Corv-er and Henk van Riemsdijk (Editors) Semi-lexical Categories On the content of function words and the function of content words Mouton de Gruyter Berlin 323ndash370
Mona Singh 1994 Perfectivity Definiteness and Specificity A Classification of Verbal Predicates Hindi Doctoral dissertation University of Texas Austin
Peter Edwin Hook 1974 The Compound Verb in Hindi Center for South and Southeast Asian Stu-dies The University of Michigan
Tara Mohanan 1994 Argument Structure in Hindi CSLI Publications Stanford California
Venkatapathy Sriram and Aravind K Joshi 2005 Relative compositionality of multi-word expres-sions a study of verb-noun (V-N) collocations In Proceedings of International Joint Conference on Natural Language Processing - 2005 Jeju Island Korea 553-564
46
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 47ndash54Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Improving Statistical Machine Translation UsingDomain Bilingual Multiword Expressions
Zhixiang Ren1 Yajuan L u1 Jie Cao1 Qun Liu 1 Yun Huang2
1Key Lab of Intelligent Info Processing 2Department of Computer ScienceInstitute of Computing Technology School of Computing
Chinese Academy of Sciences National University of SingaporePO Box 2704 Beijing 100190 China Computing 1 Law Link Singapore 117590
renzhixianglvyajuan huangyuncompnusedusgcaojieliuqunictaccn
Abstract
Multiword expressions (MWEs) havebeen proved useful for many natural lan-guage processing tasks However how touse them to improve performance of statis-tical machine translation (SMT) is not wellstudied This paper presents a simple yeteffective strategy to extract domain bilin-gual multiword expressions In additionwe implement three methods to integratebilingual MWEs to Moses the state-of-the-art phrase-based machine translationsystem Experiments show that bilingualMWEs could improve translation perfor-mance significantly
1 Introduction
Phrase-based machine translation model has beenproved a great improvement over the initial word-based approaches (Brown et al 1993) Recentsyntax-based models perform even better thanphrase-based models However when syntax-based models are applied to new domain with fewsyntax-annotated corpus the translation perfor-mance would decrease To utilize the robustnessof phrases and make up the lack of syntax or se-mantic information in phrase-based model for do-main translation we study domain bilingual mul-tiword expressions and integrate them to the exist-ing phrase-based model
A multiword expression (MWE)can be consid-ered as word sequence with relatively fixed struc-ture representing special meanings There is nouniform definition of MWE and many researchersgive different properties of MWE Sag et al (2002)roughly defined MWE as ldquoidiosyncratic interpre-tations that cross word boundaries (or spaces)rdquoCruys and Moiron (2007) focused on the non-compositional property of MWE ie the propertythat whole expression cannot be derived from their
component words Stanford university launcheda MWE project1 in which different qualities ofMWE were presented Forbilingual multiwordexpression (BiMWE) we define a bilingual phraseas a bilingual MWE if (1) the source phrase is aMWE in source language (2) the source phraseand the target phrase must be translated to eachother exactly ie there is no additional (boundary)word in target phrase which cannot find the corre-sponding word in source phrase and vice versaIn recent years many useful methods have beenproposed to extract MWEs or BiMWEs automati-cally (Piao et al 2005 Bannard 2007 Fazly andStevenson 2006) Since MWE usually constrainspossible senses of a polysemous word in contextthey can be used in many NLP applications suchas information retrieval question answering wordsense disambiguation and so on
For machine translation Piao et al (2005) havenoted that the issue of MWE identification andaccurate interpretation from source to target lan-guage remained an unsolved problem for existingMT systems This problem is more severe whenMT systems are used to translate domain-specifictexts since they may include technical terminol-ogy as well as more general fixed expressions andidioms Although some MT systems may employa machine-readable bilingual dictionary of MWEit is time-consuming and inefficient to obtain thisresource manually Therefore some researchershave tried to use automatically extracted bilingualMWEs in SMT Tanaka and Baldwin (2003) de-scribed an approach of noun-noun compound ma-chine translation but no significant comparisonwas presented Lambert and Banchs (2005) pre-sented a method in which bilingual MWEs wereused to modify the word alignment so as to im-prove the SMT quality In their work a bilin-gual MWE in training corpus was grouped as
1httpmwestanfordedu
47
one unique token before training alignment mod-els They reported that both alignment qualityand translation accuracy were improved on a smallcorpus However in their further study they re-ported even lower BLEU scores after groupingMWEs according to part-of-speech on a large cor-pus (Lambert and Banchs 2006) Nonethelesssince MWE represents liguistic knowledge therole and usefulness of MWE in full-scale SMTis intuitively positive The difficulty lies in howto integrate bilingual MWEs into existing SMTsystem to improve SMT performance especiallywhen translating domain texts
In this paper we implement three methods thatintegrate domain bilingual MWEs into a phrase-based SMT system and show that these ap-proaches improve translation quality significantlyThe main difference between our methods andLambert and Banchsrsquo work is that we directly aimat improving the SMT performance rather than im-proving the word alignment quality In detail dif-ferences are listed as follows
bull Instead of using the bilingual n-gram trans-lation model we choose the phrase-basedSMT system Moses2 which achieves sig-nificantly better translation performance thanmany other SMT systems and is a state-of-the-art SMT system
bull Instead of improving translation indirectlyby improving the word alignment qualitywe directly target at the quality of transla-tion Some researchers have argued that largegains of alignment performance under manymetrics only led to small gains in translationperformance (Ayan and Dorr 2006 Fraserand Marcu 2007)
Besides the above differences there are someadvantages of our approaches
bull In our method automatically extractedMWEs are used as additional resources ratherthan as phrase-table filter Since bilingualMWEs are extracted according to noisy au-tomatic word alignment errors in word align-ment would further propagate to the SMT andhurt SMT performance
bull We conduct experiments on domain-specificcorpus For one thing domain-specific
2httpwwwstatmtorgmoses
corpus potentially includes a large numberof technical terminologies as well as moregeneral fixed expressions and idioms iedomain-specific corpus has high MWE cov-erage For another after the investigationcurrent SMT system could not effectivelydeal with these domain-specific MWEs es-pecially for Chinese since these MWEs aremore flexible and concise Take the Chi-nese term ldquo j Ntilde (rdquo for example Themeaning of this term is ldquosoften hard massand dispel pathogenic accumulationrdquo Ev-ery word of this term represents a specialmeaning and cannot be understood literallyor without this context These terms are dif-ficult to be translated even for humans letalone machine translation So treating theseterms as MWEs and applying them in SMTsystem have practical significance
bull In our approach no additional corpus is intro-duced We attempt to extract useful MWEsfrom the training corpus and adopt suitablemethods to apply them Thus it benefitsfor the full exploitation of available resourceswithout increasing great time and space com-plexities of SMT system
The remainder of the paper is organized as fol-lows Section 2 describes the bilingual MWE ex-traction technique Section 3 proposes three meth-ods to apply bilingual MWEs in SMT systemSection 4 presents the experimental results Sec-tion 5 draws conclusions and describes the futurework Since this paper mainly focuses on the ap-plication of BiMWE in SMT we only give a briefintroduction on monolingual and bilingual MWEextraction
2 Bilingual Multiword ExpressionExtraction
In this section we describe our approach of bilin-gual MWE extraction In the first step we obtainmonolingual MWEs from the Chinese part of par-allel corpus After that we look for the translationof the extracted MWEs from parallel corpus
21 Automatic Extraction of MWEs
In the past two decades many different ap-proaches on automatic MWE identification werereported In general those approaches can beclassified into three main trends (1) statisti-cal approaches (Pantel and Lin 2001 Piao et
48
al 2005) (2) syntactic approaches (Fazly andStevenson 2006 Bannard 2007) and (3) seman-tic approaches (Baldwin et al 2003 Cruys andMoiron 2007) Syntax-based and semantic-basedmethods achieve high precision but syntax or se-mantic analysis has to be introduced as preparingstep so it is difficult to apply them to domains withfew syntactical or semantic annotation Statisticalapproaches only consider frequency informationso they can be used to obtain MWEs from bilin-gual corpora without deeper syntactic or semanticanalysis Most statistical measures only take twowords into account so it not easy to extract MWEscontaining three or more than three words
Log Likelihood Ratio (LLR)has been proved agood statistical measurement of the association oftwo random variables (Chang et al 2002) Weadopt the idea of statistical approaches and pro-pose a new algorithm named LLR-based Hierar-chical Reducing Algorithm (HRA for short) to ex-tract MWEs with arbitrary lengths To illustrateour algorithm firstly we define some useful itemsIn the following definitions we assume the givensentence is ldquoA B C D Erdquo
Definition 1 Unit A unit is any sub-string of thegiven sentence For example ldquoA Brdquo ldquo Crdquo ldquo C D Erdquoare all units but ldquoA B Drdquo is not a unit
Definition 2 List A list is an ordered sequence ofunits which exactly cover the given sentence ForexampleldquoArdquoldquo B C Drdquoldquo Erdquo forms a list
Definition 3 Score The score function only de-fines on two adjacent units and return the LLRbetween the last word of first unit and the firstword of the second unit3 For example the scoreof adjacent unit ldquoB Crdquo and ldquoD Erdquo is defined asLLR(ldquo Crdquoldquo Drdquo)
Definition 4 Select The selecting operator is tofind the two adjacent units with maximum scorein a list
Definition 5 ReduceThe reducing operator is toremove two specific adjacent units concatenatethem and put back the result unit to the removedposition For example if we want to reduce unitldquoB Crdquo and unit ldquoDrdquo in list ldquoArdquoldquo B Crdquoldquo Drdquoldquo Erdquowe will get the listldquoArdquoldquo B C Drdquoldquo Erdquo
Initially every word in the sentence is consid-ered as one unit and all these units form a initiallist L If the sentence is of lengthN then the
3we use a stoplist to eliminate the units containing func-tion words by setting their score to 0
list containsN units of course The final set ofMWEsS is initialized to empty set After initial-ization the algorithm will enter an iterating loopwith two steps (1) select the two adjacent unitswith maximum score inL namingU1 andU2 and(2) reduceU1 andU2 in L and insert the reducingresult into the final setS Our algorithm termi-nates on two conditions (1) if the maximum scoreafter selection is less than a given threshold or (2)if L contains only one unit
c1(pound p Eacute w egrave )
c1(pound p Eacute w )
c1(pound p Eacute w)
c1(pound)
c2(p Eacute w)
c2(p Eacute)
c2(p) c3(Eacute) c4(w) c5()
c6(egrave )
c6(egrave) c7()
pound p Eacute w egrave 1471 67552 10596 0 0 8096
Figure 1 Example of Hierarchical Reducing Al-gorithm
Let us make the algorithm clearer with an ex-ample Assume the threshold of score is 20 thegiven sentence is ldquopound p Eacute w egrave rdquo4Figure 1 shows the hierarchical structure of givensentence (based on LLR of adjacent words) Inthis example four MWEs (ldquop Eacuterdquo ldquo p Eacute
wrdquo ldquo egrave rdquo ldquo pound p Eacute wrdquo) are extractedin the order and sub-strings over dotted line in fig-ure 1 are not extracted
From the above example we can see that theextracted MWEs correspond to human intuitionIn general the basic idea of HRA is to reflectthe hierarchical structure pattern of natural lan-guage Furthermore in the HRA MWEs are mea-sured with the minimum LLR of adjacent wordsin them which gives lexical confidence of ex-tracted MWEs Finally suppose given sentencehas lengthN HRA would definitely terminatewithin N minus 1 iterations which is very efficient
However HRA has a problem that it would ex-tract substrings before extracting the whole stringeven if the substrings only appear in the particu-lar whole string which we consider useless Tosolve this problem we use contextual features
4The whole sentence means ldquohealthy tea for preventinghyperlipidemiardquo and we give the meaning for each Chi-nese wordpound(preventing)p(hyper-)Eacute(-lipid-) w(-emia)(for) egrave(healthy)(tea)
49
contextual entropy (Luo and Sun 2003) and C-value (Frantzi and Ananiadou 1996) to filter outthose substrings which exist only in few MWEs
22 Automatic Extraction of MWErsquosTranslation
In subsection 21 we described the algorithm toobtain MWEs and we would like to introduce theprocedure to find their translations from parallelcorpus in this subsection
For mining the English translations of ChineseMWEs we first obtain the candidate translationsof a given MWE from the parallel corpus Stepsare listed as follows
1 Run GIZA++5 to align words in the trainingparallel corpus
2 For a given MWE find the bilingual sentencepairs where the source language sentences in-clude the MWE
3 Extract the candidate translations of theMWE from the above sentence pairs accord-ing to the algorithm described by Och (2002)
After the above procedure we have alreadyextracted all possible candidate translations of agiven MWE The next step is to distinguish rightcandidates from wrong candidates We constructperceptron-based classification model (Collins2002) to solve the problem We design twogroups of features translation features whichdescribe the mutual translating chance betweensource phrase and target phrase and the languagefeatures which refer to how well a candidateis a reasonable translation The translation fea-tures include (1) the logarithm of source-targettranslation probability (2) the logarithm of target-source translation probability (3) the logarithmof source-target lexical weighting (4) the loga-rithm of target-source lexical weighting and (5)the logarithm of the phrase pairrsquos LLR (Dunning1993) The first four features are exactly the sameas the four translation probabilities used in tradi-tional phrase-based system (Koehn et al 2003)The language features include (1) the left entropyof the target phrase (Luo and Sun 2003) (2) theright entropy of the target phrase (3) the first wordof the target phrase (4) the last word of the targetphrase and (5) all words in the target phrase
5httpwwwfjochcomGIZA++html
We select and annotate 33000 phrase pairs ran-domly of which 30000 pairs are used as trainingset and 3000 pairs are used as test set We use theperceptron training algorithm to train the modelAs the experiments reveal the classification preci-sion of this model is 9167
3 Application of Bilingual MWEs
Intuitively bilingual MWE is useful to improvethe performance of SMT However as we de-scribed in section 1 it still needs further researchon how to integrate bilingual MWEs into SMTsystem In this section we propose three methodsto utilize bilingual MWEs and we will comparetheir performance in section 4
31 Model Retraining with Bilingual MWEs
Bilingual phrase table is very important forphrase-based MT system However due to the er-rors in automatic word alignment and unalignedword extension in phrase extraction (Och 2002)many meaningless phrases would be extractedwhich results in inaccuracy of phrase probabilityestimation To alleviate this problem we take theautomatically extracted bilingual MWEs as paral-lel sentence pairs add them into the training cor-pus and retrain the model using GIZA++ Byincreasing the occurrences of bilingual MWEswhich are good phrases we expect that the align-ment would be modified and the probability es-timation would be more reasonable Wu et al(2008) also used this method to perform domainadaption for SMT Different from their approachin which bilingual MWEs are extracted from ad-ditional corpus we extract bilingual MWEs fromthe original training set The fact that additionalresources can improve the domain-specific SMTperformance was proved by many researchers(Wu et al 2008 Eck et al 2004) Howeverour method shows that making better use of theresources in hand could also enhance the qualityof SMT system We use ldquoBaseline+BiMWErdquo torepresent this method
32 New Feature for Bilingual MWEs
Lopez and Resnik (2006) once pointed out thatbetter feature mining can lead to substantial gainin translation quality Inspired by this idea weappend one feature into bilingual phrase table toindicate that whether a bilingual phrase containsbilingual MWEs In other words if the source lan-guage phrase contains a MWE (as substring) and
50
the target language phrase contains the translationof the MWE (as substring) the feature value is 1otherwise the feature value is set to 0 Due to thehigh reliability of bilingual MWEs we expect thatthis feature could help SMT system to select bet-ter and reasonable phrase pairs during translationWe use ldquoBaseline+Featrdquo to represent this method
33 Additional Phrase Table of bilingualMWEs
Wu et al (2008) proposed a method to construct aphrase table by a manually-made translation dic-tionary Instead of manually constructing transla-tion dictionary we construct an additional phrasetable containing automatically extracted bilingualMWEs As to probability assignment we just as-sign 1 to the four translation probabilities for sim-plicity Since Moses supports multiple bilingualphrase tables we combine the original phrase ta-ble and new constructed bilingual MWE table Foreach phrase in input sentence during translationthe decoder would search all candidate transla-tion phrases in both phrase tables We use ldquoBase-line+NewBPrdquo to represent this method
4 Experiments
41 Data
We run experiments on two domain-specific patentcorpora one is for traditional medicine domainand the other is for chemical industry domain Ourtranslation tasks are Chinese-to-English
In the traditional medicine domain table 1shows the data statistics For language model weuse SRI Language Modeling Toolkit6 to train a tri-gram model with modified Kneser-Ney smoothing(Chen and Goodman 1998) on the target side oftraining corpus Using our bilingual MWE ex-tracting algorithm 80287 bilingual MWEs are ex-tracted from the training set
Chinese EnglishTraining Sentences 120355
Words 4688873 4737843Dev Sentences 1000
Words 31722 32390Test Sentences 1000
Words 41643 40551
Table 1 Traditional medicine corpus
6httpwwwspeechsricomprojectssrilm
In the chemical industry domain table 2 givesthe detail information of the data In this experi-ment 59466 bilingual MWEs are extracted
Chinese EnglishTraining Sentences 120856
Words 4532503 4311682Dev Sentences 1099
Words 42122 40521Test Sentences 1099
Words 41069 39210
Table 2 Chemical industry corpus
We test translation quality on test set and use theopen source tool mteval-vllbpl7 to calculate case-sensitive BLEU 4 score (Papineni et al 2002) asour evaluation criteria For this evaluation there isonly one reference per test sentence We also per-form statistical significant test between two trans-lation results (Collins et al 2005) The mean ofall scores and relative standard deviation are calcu-lated with a 99 confidence interval of the mean
42 MT Systems
We use the state-of-the-art phrase-based SMT sys-tem Moses as our baseline system The featuresused in baseline system include (1) four transla-tion probability features (2) one language modelfeature (3) distance-based and lexicalized distor-tion model feature (4) word penalty (5) phrasepenalty For ldquoBaseline+BiMWErdquo method bilin-gual MWEs are added into training corpus as aresult new alignment and new phrase table areobtained For ldquoBaseline+Featrdquo method one ad-ditional 01 feature are introduced to each entry inphrase table For ldquoBaseline+NewBPrdquo additionalphrase table constructed by bilingual MWEs isused
Features are combined in the log-linear modelTo obtain the best translatione of the source sen-tencef log-linear model uses following equation
e = arg maxe
p(e|f)
= arg maxe
Msum
m=1
λmhm(e f) (1)
in which hm andλm denote themth feature andweight The weights are automatically turned byminimum error rate training (Och 2002) on devel-opment set
7httpwwwnistgovspeechtestsmtresourcesscoringhtm
51
43 Results
Methods BLEUBaseline 02658Baseline+BiMWE 02661Baseline+Feat 02675Baseline+NewBP 02719
Table 3 Translation results of using bilingualMWEs in traditional medicine domain
Table 3 gives our experiment results Fromthis table we can see that bilingual MWEsimprove translation quality in all cases TheBaseline+NewBP method achieves the most im-provement of 061 BLEU score comparedwith the baseline system The Baseline+Featmethod comes next with 017 BLEU score im-provement And the Baseline+BiMWE achievesslightly higher translation quality than the baselinesystem
To our disappointment however none of theseimprovements are statistical significant Wemanually examine the extracted bilingual MWEswhich are labeled positive by perceptron algorithmand find that although the classification precisionis high (9167) the proportion of positive exam-ple is relatively lower (7669) The low positiveproportion means that many negative instanceshave been wrongly classified to positive which in-troduce noises To remove noisy bilingual MWEswe use the length ratiox of the source phrase overthe target phrase to rank the bilingual MWEs la-beled positive Assumex follows Gaussian distri-butions then the ranking score of phrase pair(s t)is defined as the following formula
Score(s t) = log(LLR(s t))times 1radic2πσ
timeseminus
(xminusmicro)2
2σ2
(2)Here the meanmicro and varianceσ2 are estimatedfrom the training set After ranking by score weselect the top 50000 60000 and 70000 bilingualMWEs to perform the three methods mentioned insection 3 The results are showed in table 4
From this table we can conclude that (1) Allthe three methods on all settings improve BLEUscore (2) Except the Baseline+BiMWE methodthe other two methods obtain significant improve-ment of BLEU score (02728 02734 02724)over baseline system (02658) (3) When the scaleof bilingual MWEs is relatively small (5000060000) the Baseline+Feat method performs better
Methods 50000 60000 70000Baseline 02658Baseline+BiMWE 02671 02686 02715Baseline+Feat 02728 02734 02712Baseline+NewBP 02662 02706 02724
Table 4 Translation results of using bilingualMWEs in traditional medicine domain
than others (4) As the number of bilingual MWEsincreasing the Baseline+NewBP method outper-forms the Baseline+Feat method (5) Comparingtable 4 and 3 we can see it is not true that themore bilingual MWEs the better performance ofphrase-based SMT This conclusion is the same as(Lambert and Banchs 2005)
To verify the assumption that bilingual MWEsdo indeed improve the SMT performance not onlyon particular domain we also perform some ex-periments on chemical industry domain Table 5shows the results From this table we can see thatthese three methods can improve the translationperformance on chemical industry domain as wellas on the traditional medicine domain
Methods BLEUBaseline 01882Baseline+BiMWE 01928Baseline+Feat 01917Baseline+Newbp 01914
Table 5 Translation results of using bilingualMWEs in chemical industry domain
44 Discussion
In order to know in what respects our methods im-prove performance of translation we manually an-alyze some test sentences and gives some exam-ples in this subsection
(1) For the first example in table 6 ldquoIuml oacuterdquois aligned to other words and not correctly trans-lated in baseline system while it is aligned to cor-rect target phrase ldquodredging meridiansrdquo in Base-line+BiMWE since the bilingual MWE (ldquoIuml oacuterdquoldquodredging meridiansrdquo) has been added into train-ing corpus and then aligned by GIZA++
(2) For the second example in table 6 ldquo
rdquo has two candidate translation in phrase tableldquo teardquo and ldquomedicated teardquo The baseline systemchooses the ldquoteardquo as the translation of ldquo rdquowhile the Baseline+Feat system chooses the ldquomed-
52
Src T not auml k Ouml Eacute Aring raquo Iuml oacute ) 9 | Y S _ Ouml otilde
egraveE 8
Ref the obtained product is effective in tonifying blood expelling cold dredging meridians promoting production of body fluid promoting urination and tranquilizing mind and can be used for supplementing nutrition and protecting health
Baseline the food has effects in tonifying blood dispelling cold promoting salivation and water and tranquilizing and tonic effects and making nutritious health
+Bimwe the food has effects in tonifying blood dispelling cold dredging meridians promotingsalivation promoting urination and tranquilizing tonic nutritious pulverizing
Src curren iexclJ J NtildeJ 5J
Ref the product can also be made into tablet pill powder medicated tea or injection Baseline may also be made into tablet pill powder tea or injection +Feat may also be made into tablet pill powder medicated tea or injection
Table 6 Translation example
icated teardquo because the additional feature giveshigh probability of the correct translation ldquomedi-cated teardquo
5 Conclusion and Future Works
This paper presents the LLR-based hierarchicalreducing algorithm to automatically extract bilin-gual MWEs and investigates the performance ofthree different application strategies in applyingbilingual MWEs for SMT system The translationresults show that using an additional feature to rep-resent whether a bilingual phrase contains bilin-gual MWEs performs the best in most cases Theother two strategies can also improve the qualityof SMT system although not as much as the firstone These results are encouraging and motivatedto do further research in this area
The strategies of bilingual MWE application isroughly simply and coarse in this paper Com-plicated approaches should be taken into accountduring applying bilingual MWEs For examplewe may consider other features of the bilingualMWEs and examine their effect on the SMT per-formance Besides application in phrase-basedSMT system bilingual MWEs may also be inte-grated into other MT models such as hierarchicalphrase-based models or syntax-based translationmodels We will do further studies on improvingstatistical machine translation using domain bilin-gual MWEs
Acknowledgments
This work is supported by National Natural Sci-ence Foundation of China Contracts 60603095and 60873167 We would like to thank the anony-
mous reviewers for their insightful comments onan earlier draft of this paper
References
Necip Fazil Ayan and Bonnie J Dorr 2006 Goingbeyond aer an extensive analysis of word align-ments and their impact on mt InProceedings of the44th Annual Meeting of the Association for Compu-tational Linguistics pages 9ndash16
Timothy Baldwin Colin Bannard Takaaki Tanaka andDominic Widdows 2003 An empirical modelof multiword expression decomposability InPro-ceedings of the ACL-2003 Workshop on MultiwordExpressions Analysis Acquisiton and Treatmentpages 89ndash96
Colin Bannard 2007 A measure of syntactic flex-ibility for automatically identifying multiword ex-pressions in corpora InProceedings of the ACLWorkshop on A Broader Perspective on MultiwordExpressions pages 1ndash8
Peter F Brown Stephen Della Pietra Vincent J DellaPietra and Robert L Mercer 1993 The mathe-matics of statistical machine translation Parameterestimation Computational Linguistics 19(2)263ndash311
Baobao Chang Pernilla Danielsson and WolfgangTeubert 2002 Extraction of translation unit fromchinese-english parallel corpora InProceedings ofthe first SIGHAN workshop on Chinese languageprocessing pages 1ndash5
Stanley F Chen and Joshua Goodman 1998 Am em-pirical study of smoothing techniques for languagemodeling Technical report
Michael Collins Philipp Koehn and Ivona Kucerova2005 Clause restructuring for statistical machinetranslation InProceedings of the 43rd Annual
53
Meeting on Association for Computational Linguis-tics pages 531ndash540
Michael Collins 2002 Discriminative training meth-ods for hidden markov models Theory and exper-iments with perceptron algorithms InProceedingsof the Empirical Methods in Natural Language Pro-cessing Conference pages 1ndash8
Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32
Ted Dunning 1993 Accurate methods for the statis-tics of surprise and coincidenceComputationalLinguistics 19(1)61ndash74
Matthias Eck Stephan Vogel and Alex Waibel 2004Improving statistical machine translation in the med-ical domain using the unified medical language sys-tem InProceedings of the 20th international con-ference on Computational Linguistics table of con-tents pages 792ndash798
Afsaneh Fazly and Suzanne Stevenson 2006 Auto-matically constructing a lexicon of verb phrase id-iomatic combinations InProceedings of the EACLpages 337ndash344
Katerina T Frantzi and Sophia Ananiadou 1996 Ex-tracting nested collocations InProceedings of the16th conference on Computational linguistics pages41ndash46
Alexander Fraser and Daniel Marcu 2007 Measuringword alignment quality for statistical machine trans-lation Computational Linguistics 33(3)293ndash303
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InPro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology pages48ndash54
Patrik Lambert and Rafael Banchs 2005 Data in-ferred multi-word expressions for statistical machinetranslation InProceedings of Machine TranslationSummit X pages 396ndash403
Patrik Lambert and Rafael Banchs 2006 Groupingmulti-word expressions according to part-of-speechin statistical machine translation InProceedings ofthe Workshop on Multi-word-expressions in a multi-lingual context pages 9ndash16
Adam Lopez and Philip Resnik 2006 Word-basedalignment phrase-based translation Whatrsquos thelink In proceedings of the 7th conference of the as-sociation for machine translation in the Americasvisions for the future of machine translation pages90ndash99
Shengfen Luo and Maosong Sun 2003 Two-characterchinese word extraction based on hybrid of internaland contextual measures InProceedings of the sec-ond SIGHAN workshop on Chinese language pro-cessing pages 24ndash30
Franz Josef Och 2002Statistical Machine Transla-tion From Single-Word Models to Alignment Tem-plates Phd thesis Computer Science DepartmentRWTH Aachen Germany
Patrick Pantel and Dekang Lin 2001 A statistical cor-pus based term extractor InAI rsquo01 Proceedings ofthe 14th Biennial Conference of the Canadian Soci-ety on Computational Studies of Intelligence pages36ndash46
Kishore Papineni Salim Roukos Todd Ward and Wei-jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation InProceedings of the40th Annual Conference of the Association for Com-putational Linguistics pages 311ndash318
Scott Songlin Piao Paul Rayson Dawn Archer andTony McEnery 2005 Comparing and combining asemantic tagger and a statistical tool for mwe extrac-tion Computer Speech and Language 19(4)378ndash397
Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 Multi-word expressions A pain in the neck for nlp InProceedings of the 3th International Conferenceon Intelligent Text Processing and ComputationalLinguistics(CICLing-2002) pages 1ndash15
Takaaki Tanaka and Timothy Baldwin 2003 Noun-noun compound machine translation A feasibilitystudy on shallow processing InProceedings ofthe ACL-2003 Workshop on Multiword ExpressionsAnalysis Acquisition and Treatment pages 17ndash24
Hua Wu Haifeng Wang and Chengqing Zong 2008Domain adaptation for statistical machine transla-tion with domain dictionary and monolingual cor-pora InProceedings of Conference on Computa-tional Linguistics (COLING) pages 993ndash1000
54
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 55ndash62Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Bottom-up Named Entity Recognitionusing a Two-stage Machine Learning Method
Hirotaka Funayama Tomohide Shibata Sadao KurohashiKyoto University Yoshida-honmachi
Sakyo-ku Kyoto 606-8501 Japanfunayama shibata kuronlpkueekyoto-uacjp
Abstract
This paper proposes Japanese bottom-upnamed entity recognition using a two-stage machine learning method Mostwork has formalized Named Entity Recog-nition as a sequential labeling problem inwhich only local information is utilizedfor the label estimation and thus a longnamed entity consisting of several mor-phemes tends to be wrongly recognizedOur proposed method regards a compoundnoun (chunk) as a labeling unit and firstestimates the labels of all the chunks ina phrasal unit (bunsetsu) using a machinelearning method Then the best label as-signment in the bunsetsu is determinedfrom bottom up as the CKY parsing al-gorithm using a machine learning methodWe conducted an experimental on CRLNE data and achieved an F measure of8979 which is higher than previous work
1 Introduction
Named Entity Recognition (NER) is a task of rec-ognizing named entities such as person namesorganization names and location It is used forseveral NLP applications such as Information Ex-traction (IE) and Question Answering (QA) Mostwork uses machine learning methods such as Sup-port Vector Machines (SVMs) (Vapnik 1995) andConditional Random Field (CRF) (Lafferty et al2001) using a hand-annotated corpus (Krishnanand DManning 2006 Kazama and Torisawa2008 Sasano and Kurohashi 2008 Fukushima etal 2008 Nakano and Hirai 2004 Masayuki andMatsumoto 2003)
In general NER is formalized as a sequentiallabeling problem For example regarding a mor-pheme as a basic unit it is first labeled as S-PERSON B-PERSON I-PERSON E-PERSON
S-ORGANIZATION etc Then considering thelabeling results of morphemes the best NE labelsequence is recognized
When the label of each morpheme is estimatedonly local information around the morpheme (egthe morpheme the two preceding morphemes andthe two following morphemes) is utilized There-fore a long named entity consisting of severalmorphemes tends to be wrongly recognized Letus consider the example sentences shown in Fig-ure 1
In sentence (1) the label of ldquoKazamardquo can berecognized to be S-PERSON (PERSON consist-ing of one morpheme) by utilizing the surroundinginformation such as the suffix ldquosanrdquo (Mr) and theverb ldquokikoku shitardquo (return home)
On the other hand in sentence (2) when thelabel of ldquoshinyourdquo (credit) is recognized to beB-ORGANIZATION (the beginning of ORGA-NIZATION) only information from ldquohatsudourdquo(invoke) to ldquokyusairdquo (relief) can be utilized andthus the information of the morpheme ldquoginkourdquo(bank) that is apart from ldquoshinyourdquo by three mor-phemes cannot be utilized To cope with this prob-lem Nakano et al (Nakano and Hirai 2004) andSasano et al (Sasano and Kurohashi 2008) uti-lized information of the head of bunsetsu1 In theirmethods when the label of ldquoshinyourdquo is recog-nized the information of the morpheme ldquoginkourdquocan be utilized
However these methods do not work when themorpheme that we want to refer to is not a headof bunsetsu as in sentence (3) In this examplewhen ldquogaikokurdquo (foreign) is recognized to be B-ARTIFACT (the beginning of ARTIFACT) wewant to refer to ldquohourdquo (law) not ldquoihanrdquo (viola-tion) which is the head of the bunsetsu
This paper proposes Japanese bottom-up named1Bunsetsu is the smallest coherent phrasal unit in
Japanese It consists of one or more content words followedby zero or more function words
55
(1) kikoku-shitareturn home
Kazama-san-waMrKazama TOP
rsquoMr Kazama who returned homersquo
(2) hatsudou-shitainvoke
shinyou-kumiai-kyusai-ginkou-nocredit union relief bank GEN
setsuritsu-mo establishment
rsquothe establishment of the invoking credit union relief bankrsquo
(3) shibunsyo-gizou-toprivate document falsification and
gaikoku-jin-touroku-hou-ihan-noforeigner registration law violation GEN
utagai-desuspicion INS
rsquoon suspicion of the private document falsification and the violation of the foreigner registra-tion lawrsquo
Figure 1 Example sentences
entity recognition using a two-stage machinelearning method Different from previous workthis method regards a compound noun as a la-beling unit (we call it chunk hereafter) and es-timates the labels of all the chunks in the bun-setsu using a machine learning method In sen-tence (3) all the chunks in the second bunsetsu(ie ldquogaikokurdquo ldquogaikoku-jinrdquo middot middot middot ldquogaikoku-jin-touroku-hou-ihan rdquo middot middot middot ldquoihanrdquo) are labeled andin the case that the chunk ldquogaikoku-jin-touroku-hourdquo is labeled the information about ldquohourdquo (law)is utilized in a natural manner Then in the bun-setsu the best label assignment is determined Forexample among the combination of ldquogaikoku-jin-touroku-hourdquo (ARTIFACT) and ldquoihanrdquo (OTHER)the combination of ldquogaikoku-jinrdquo (PERSON) andldquotouroku-hou-ihanrdquo (OTHER) etc the best la-bel assignment ldquogaikoku-jin-touroku-hourdquo (AR-TIFACT) and ldquoihanrdquo (OTHER) is chosen basedon a machine learning method In this determi-nation of the best label assignment as the CKYparsing algorithm the label assignment is deter-mined by bottom-up dynamic programming Weconducted an experimental on CRL NE data andachieved an F measure of 8979 which is higherthan previous work
This paper is organized as follows Section 2 re-views related work of NER especially focusing onsequential labeling based method Section 3 de-scribes an overview of our proposed method Sec-tion 4 presents two machine learning models andSection 5 describes an analysis algorithm Section6 gives an experimental result
2 Related Work
In Japanese Named Entity Recognition the defi-nition of Named Entity in IREX Workshop (IREX
class examplePERSON Kimura SyonosukeLOCATION Taiheiyou (Pacific Ocean)ORGANIZATION Jimin-tou (Liberal Democratic
Party)ARTIFACT PL-houan (PL bill)DATE 21-seiki (21 century)TIME gozen-7-ji (7 am)MONEY 500-oku-en (50 billions yen)PERCENT 20 percent
Table 1 NE classes and their examples
Committee 1999) is usually used In this def-inition NEs are classified into eight classesPERSON LOCATION ORGANIZATION AR-TIFACT DATE TIME MONEY and PERCENTTable 1 shows example instances of each class
NER methods are divided into two approachesrule-based approach and machine learning ap-proach According to previous work machinelearning approach achieved better performancethan rule-based approach
In general a machine learning method is for-malized as a sequential labeling problem Thisproblem is first assigning each token (character ormorpheme) to several labels In an SE-algorithm(Sekine et al 1998) S is assigned to NE com-posed of one morpheme B I E is assigned to thebeginning middle end of NE respectively and Ois assigned to the morpheme that is not an NE2The labels S B I and E are prepared for each NEclasses and thus the total number of labels is 33(= 8 4 + 1)
The model for the label estimation is learnedbased on machine learning The following fea-tures are generally utilized characters type of
2Besides there are IOB1 IOB2 algorithm using onlyIOB and IOE1 IOE2 algorithm using only IOE (Kim andVeenstra 1999)
56
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu-MeijinMONEY0075
OTHERe0092
MeijinOTHERe0245
(a)initial state
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu + MeijinPSN+OTHERe0438+0245
Yoshiharu Yoshiharu + Meijin
final outputanalysis direction
MONEY0075
MNY+OTHERe0075+0245
MeijinOTHERe0245
(b)final output
Figure 2 An overview of our proposed method (the bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo)
character POS etc about the morpheme and thesurrounding two morphemes The methods utiliz-ing SVM or CRF are proposed
Most of NER methods based on sequential la-beling use only local information Thereforemethods utilizing global information are pro-posed Nakano et al utilized as a feature the wordsub class of NE on the analyzing direction in thebunsetsu the noun in the end of the bunsetsu ad-jacent to the analyzing direction and the head ofeach bunsetsu (Nakano and Hirai 2004) Sasanoet al utilized cache feature coreference resultsyntactic feature and caseframe feature as struc-tural features (Sasano and Kurohashi 2008)
Some work acquired knowledge from unan-notated large corpus and applied it to NERKazama et al utilized a Named Entity dic-tionary constructed from Wikipedia and a nounclustering result obtained using huge amount ofpairs of dependency relations (Kazama and Tori-sawa 2008) Fukushima et al acquired hugeamount of category-instance pairs (eg ldquopo-litical party - New party DAICHIrdquoldquocompany-TOYOTArdquo) by some patterns from a large Webcorpus (Fukushima et al 2008)
In Japanese NER researches CRL NE data areusually utilized for the evaluation This data in-cludes approximately 10 thousands sentences innews paper articles in which approximately 20thousands NEs are annotated Previous workachieved an F measure of about 089 using thisdata
3 Overview of Proposed Method
Our proposed method first estimates the label ofall the compound nouns (chunk) in a bunsetsu
Then the best label assignment is determinedby bottom-up dynamic programming as the CKYparsing algorithm Figure 2 illustrates an overviewof our proposed method In this example thebunsetsu ldquoHabu-Yoshiharu-Meijinrdquo (Grand Mas-ter Yoshiharu Habu) is analyzed First the labelsof all the chunks (ldquoHaburdquo ldquoHabu-YoshiharurdquoldquoHabu-Yoshiharu-Meijinrdquo middot middot middot ldquoMeijinrdquo etc) inthe bunsetsu are analyzed using a machine learn-ing method as shown in Figure 2 (a)
We call the state in Figure 2 (a) initial statewhere the labels of all the chunks have been es-timated From this state the best label assign-ment in the bunsetsu is determined This pro-cedure is performed from the lower left (corre-sponds to each morpheme) to the upper right likethe CKY parsing algorithm as shown in Figure 2(b) For example when the label assignment forldquoHabu-Yoshiharurdquo is determined the label assign-ment ldquoHabu-Yoshiharurdquo (PERSON) and the labelassignment ldquoHaburdquo (PERSON) and ldquoYoshiharurdquo(OTHER) are compared and the better one is cho-sen While grammatical rules are utilized in ageneral CKY algorithm this method chooses bet-ter label assignment for each cell using a machinelearning method
The learned models are the followings
bull the model that estimates the label of a chunk(label estimation model)
bull the model that compares two label assign-ments (label comparison model)
The two models are described in detail in thenext section
57
Habu Yoshiharu Meijin gaPERSON OTHERe
invalid invalid
invalid
invalidinvalid
Figure 3 Label assignment for all the chunks inthe bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo
4 Model Learning
41 Label Estimation ModelThis model estimates the label for each chunk Ananalysis unit is basically bunsetsu This is because935 of named entities is located in a bunsetsuin CRL NE data Exceptionally the following ex-pressions located in multiple bunsetsus tend to bean NE
bull expressions enclosed in parentheses (eg ldquolsquoHimeyuri-no toursquo rdquo (The tower of Himeyuri)(ARTIFACT))
bull expressions that have an entry in Wikipedia(eg ldquoNihon-yatyou-no kairdquo (Wild Bird So-ciety of Japan) (ORGANIZATION))
Hereafter bunsetsu is expanded when one of theabove conditions meet By this expansion 986of named entities is located in a bunsetsu3
For each bunsetsu the head or tail functionwords are deleted For example in the bun-setsu ldquoHabu-Yoshiharu-Meijin-wardquo the tail func-tion word ldquowardquo (TOP) is deleted In the bunsetsuldquoyaku-san-bairdquo (about three times) the head func-tion word ldquoyakurdquo (about) is deleted
Next for learning the label estimation modelall the chunks in a bunsetsu are attached to the cor-rect label from a hand-annotated corpus The la-bel set is 13 classes which includes eight NE class(as shown in Table 1) and five classes OTHERsOTHERb OTHERi OTHERe and invalid
The chunk that corresponds to a whole bun-setsu and does not contain any NEs is labeledas OTHERs and the head middle tail chunkthat does not correspond to an NE is labeled asOTHERb OTHERi OTHERe respectively4
3As an example in which an NE is not included by anexpanded bunsetsu there are ldquoToru-no Kimirdquo (PERSON)and ldquoOsaka-fu midori-no kankyo-seibi-shitsurdquo (ORGANI-ZATION)
4Each OTHER is assigned to the longest chunk that satis-fies its condition in a chunk
1 of morphemes in the chunk
2 the position of the chunk in its bunsetsu
3 character type5
4 the combination of the character type of adjoiningmorphemes
- For the chunk ldquoRussian Armyrdquo this feature isldquoKatakanaKanjirdquo
5 word class word sub class and several features pro-vided by a morphological analyzer JUMAN
6 several features6 provided by a parser KNP
7 string of the morpheme in the chunk
8 IPADIC7 feature- If the string of the chunk are registered in the fol-
lowing categories of IPADIC ldquopersonrdquo ldquolo-cationrdquo ldquoorganizationrdquo and ldquogeneralrdquo thisfeature fires
9 Wikipedia feature
- If the string of the chunk has an entry inWikipedia this feature fires
- the hypernym extracted from its definition sen-tence using some patterns (eg The hyper-nym of ldquothe Liberal Democratic Partyrdquo is apolitical party)
10 cache feature- When the same string of the chunk appears in the
preceding context the label of the precedingchunk is used for the feature
11 particles that the bunsetsu includes
12 the morphemes particles and head morpheme in theparent bunsetsu
13 the NEcategory ratio in a case slot of predicatenouncase frame(Sasano and Kurohashi 2008)
- For example in the case ga (NOM) of the pred-icate case frame ldquokaikenrdquo (interview) the NEratio ldquoPERSON0245rdquo is assigned to the caseslot Hence in the sentence ldquoHabu-ga kaiken-shitardquo (Mr Habu interviewed) the featureldquoPERSON0245rdquo is utilized for the chunkldquoHaburdquo
14 parenthesis feature
- When the chunk in a parenthesis this featurefires
Table 2 Features for the label estimation model
The chunk that is neither any eight NE class northe above four OTHER is labeled as invalid
In an example as shown in Figure 3 ldquoHabu-Yoshiharurdquo is labeled as PERSON ldquoMeijinrdquo is la-beled as OTHERe and the other chunks are la-beled as invalid
Next the label estimation model is learned fromthe data in which the above label set is assigned
5The following five character types are considered KanjiHiragana Katakana Number and Alphabet
6When a morpheme has an ambiguity all the correspond-ing features fire
7httpchasenaist-naraacjpchasendistributionhtmlja
58
to all the chunks The features for the label esti-mation model are shown in Table 2 Among thefeatures as for feature (3) (5)minus(8) three cate-gories according to the position of a morphemein the chunk are prepared ldquoheadrdquo ldquotailrdquo andldquoanywhererdquo For example in the chunk ldquoHabu-Yoshiharu-Meijinrdquo as for the morpheme ldquoHaburdquofeature (7) is set to be ldquoHaburdquo in ldquoheadrdquo and as forthe morpheme ldquoYoshiharurdquo feature (7) is set to beldquoYoshiharurdquo in ldquoanywhererdquo
The label estimation model is learned from pairsof label and feature in each chunk To classify themulti classes the one-vs-rest method is adopted(consequently 13 models are learned) The SVMoutput is transformed by using the sigmoid func-tion 1
1+exp(minusβx) and the transformed value is nor-malized so that the sum of the value of 13 labelsin a chunk is one
The purpose for setting up the label ldquoinvalidrdquo isas follows In the chunk ldquoHaburdquo and ldquoYoshiharurdquoin Figure 3 since the label ldquoinvalidrdquo has a rela-tively higher score the score of the label PERSONis relatively low Therefore when the label com-parison described in Section 42 is performed thelabel assignment ldquoHabu-Yoshiharurdquo (PERSON) islikely to be chosen In the chunk where the scoreof the label invalid has the highest score the labelthat has the second highest score is adopted
42 Label Comparison Model
This model compares the two label assignmentsfor a certain string For example in the stringldquoHabu-Yoshiharurdquo the model compares the fol-lowing two label assignments
bull ldquoHabu-Yoshiharurdquo is labeled as PERSON
bull ldquoHaburdquo is labeled as PERSON and ldquoYoshi-harurdquo is labeled as MONEY
First as shown in Figure 4 the two comparedsets of chunks are lined up by sandwiching ldquovsrdquo(The left one right one is called the first set thesecond set respectively) When the first set is cor-rect this example is positive otherwise this ex-ample is negative The max number of chunks foreach set is five and thus examples in which thefirst or second set has more than five chunks arenot utilized for the model learning
Then the feature is assigned to each exampleThe feature (13 dimensions) for each chunk is de-fined as follows the first 12 dimensions are used
positive+1 Habu-Yoshiharu vs Habu + Yoshiharu
PSN PSN + MNY+1 Habu-Yoshiharu + Meijin vs Habu + Yoshiharu + Meijin
PSN + OTHERe PSN + MONEY + OTHERe
negative- 1 Habu-Yoshiharu-Meijin vs Habu-Yoshiharu + Meijin
ORG PSN + OTHERe
Figure 4 Assignment of positivenegative exam-ples
for each label which is estimated by the label esti-mation model and the last 13th dimension is usedfor the score of an SVM output Then for the firstand second set the features for each chunk are ar-ranged from the left and zero vectors are placedin the remainder part
Figure 5 illustrates the feature for ldquoHabu-Yoshiharurdquo vs ldquoHabu + Yoshiharurdquo The labelcomparison model is learned from such data us-ing SVM Note that only the fact that ldquoHabu-Yoshiharurdquo is PERSON can be found from thehand-annotated corpus and thus in the exampleldquoHabu-Yoshiharu-Meijinrdquo vs ldquoHabu + Yoshiharu-Meijinrdquo we cannot determine which one is cor-rect Therefore such example cannot be used forthe model learning
5 Analysis
First the label of all the chunks in a bunsetsu isestimated by using the label estimation model de-scribed in Section 41 Then the best label assign-ment in the bunsetsu is determined by applying thelabel comparison model described in Section 42iteratively as shown in Figure 2 (b) In this stepthe better label assignment is determined from bot-tom up as the CKY parsing algorithm
For example the initial state shown in Figure2(a) is obtained using the label estimation modelThen the label assignment is determined using thelabel comparison model from the lower left (cor-responds to each morpheme) to the upper rightIn determining the label assignment for the cellof ldquoHabu-Yoshiharurdquo as shown in 6(a) the modelcompares the label assignment ldquoBrdquo with the la-bel assignment ldquoA+Drdquo In this case the modelchooses the label assignment ldquoBrdquo that is ldquoHabu- Yoshiharurdquo is labeled as PERSON Similarlyin determining the label assignment for the cellof ldquoYoshiharu-Meijinrdquo the model compares the
59
chunk Habu-Yoshiharu Habu Yoshiharulabel PERSON PERSON MONEYvector V11 0 0 0 0 V21 V22 0 0 0
Figure 5 An example of the feature for the label comparison model (The example is ldquoHabu-Yoshiharuvs Habu + Yoshiharurdquo and V11 V21 V22 and 0 is a vector whose dimension is 13)
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu-Meijin
label assighment ldquoHabu-Yoshiharurdquo
label assignmentldquoHaburdquo + ldquoYoshiharurdquo
A B C
EDMONEY0075
OTHERe0092
MeijinOTHERe0245
ldquoHaburdquo + ldquoYoshiharurdquo
F
(a) label assignment for the cell ldquoHabu-Yoshiharurdquo
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu + Meijin
label assignmentldquoHabu-Yoshiharu-Meijinrdquo
A B C
EDMONEY0075
MNY+OTHERe0075+0245
MeijinOTHERe0245
label assignmentldquoHaburdquo + ldquoYoshiharurdquo + ldquoMeijinrdquo
label assignmentldquoHabu-Yoshiharurdquo + ldquoMeijinrdquo
F
(b) label assignment for the cell ldquoHabu-Yoshiharu-Meijinrdquo
Figure 6 The label comparison model
label assignment ldquoErdquo with the label assignmentldquoD+Frdquo In this case the model chooses the labelassignment ldquoD+Frdquo that is ldquoYoshiharurdquo is labeledas MONEY and ldquoMeijinrdquo is labeled as OTHEReWhen the label assignment consists of multiplechunks the content of the cell is updated Inthis case the cell ldquoErdquo is changed from ldquoYoshi-haru-Meijinrdquo (OTHERe) to ldquoYoshiharu + Meijinrdquo(MONEY + OTHERe)
As shown in Figure 6(b) in determining the bestlabel assignment for the upper right cell that isthe final output is determined the model comparesthe label assignment ldquoA+D+Frdquo ldquoB+Frdquo and ldquoCrdquoWhen there are more than two candidates of labelassignments for a cell all the label assignments arecompared in a pairwise and the label assignmentthat obtains the highest score is adopted
In the label comparing step the label as-signment in which OTHERlowast follows OTHERlowast(OTHERlowast - OTHERlowast) is not allowed since eachOTHER is assigned to the longest chunk as de-scribed in Section 41 When the first combina-tion of chunks equals to the second combinationof chunks the comparison is not performed
6 Experiment
To demonstrate the effectiveness of our proposedmethod we conducted an experiment on CRL NEdata In this data 10718 sentences in 1174 newsarticles are annotated with eight NEs The expres-sion to which it is difficult to annotate manually islabeled as OPTIONAL and was not used for both
a b c d e
learn the label estimation model 1
Corpus CRL NE data
learn the label estimation model 2bobtain features for the label apply
b c d e
b c d e
b c d e
obtain features for the labelcomparison model
learn the label estimation model 2c
learn the label estimation model 2d
learn the label estimation model 2e
apply
learn the label comparison model
Figure 7 5-fold cross validation
the model learning8 and the evaluationWe performed 5-fold cross validation following
previous work Different from previous work ourwork has to learn the SVM models twice There-fore the corpus was divided as shown in Figure 7Let us consider the analysis in the part (a) Firstthe label estimation model 1 is learned from thepart (b)-(e) Then the label estimation model 2bis learned from the part (c)-(e) and applying thelearned model to the part (b) features for learningthe label comparison model are obtained Simi-larly the label estimation model 2c is learned fromthe part (b)(d)(e) and applying it to the part (c)features are obtained It is the same with the part
8Exceptionally ldquoOPTIONALrdquo is used when the label es-timation model for OTHERlowast and invalid is learned
60
Recall PrecisionORGANIZATION 8183 (30083676) 8837 (30083404)PERSON 9005 (34583840) 9387 (34583684)LOCATION 9138 (49925463) 9244 (49925400)ARTIFACT 4672 ( 349 747) 7489 ( 349 466)DATE 9327 (33273567) 9312 (33273573)TIME 8825 ( 443 502) 9059 ( 443 489)MONEY 9385 ( 366 390) 9760 ( 366 375)PERCENT 9533 ( 469 492) 9591 ( 469 489)ALL-SLOT 8787 9179F-measure 8979
Table 3 Experimental result
(d) and (e) Then the label comparison model islearned from the obtained features After that theanalysis in the part (a) is performed by using boththe label estimation model 1 and the label compar-ison model
In this experiment a Japanese morphologicalanalyzer JUMAN9 and a Japanese parser KNP10
were adopted The two SVM models were learnedwith polynomial kernel of degree 2 and β in thesigmoid function was set to be 1
Table 6 shows an experimental result An F-measure in all NE classes is 8979
7 Discussion
71 Comparison with Previous Work
Table 7 presents the comparison with previ-ous work and our method outperformed previ-ous work Among previous work Fukushimaet al acquired huge amount of category-instance pairs (eg ldquopolitical party - New partyDAICHIrdquoldquocompany-TOYOTArdquo) by some patternsfrom a large Web corpus and Sasano et al uti-lized the analysis result of corefer resolution as afeature for the model learning Therefore in ourmethod by incorporating these knowledge andorsuch analysis result the performance would be im-proved
Compared with Sasano et al our methodachieved the better performance in analyzinga long compound noun For example in thebunsetsu ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo (Treaty on Conventional ArmedForces in Europe) while Sasano et al labeledldquoOushurdquo (Europe) as LOCATION our methodcorrectly labeled ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo as ARTIFACT Sasano etal incorrectly labeled ldquoOushurdquo as LOCATIONalthough they utilized the information about
9httpnlpkueekyoto-uacjpnl-resourcejuman-ehtml10httpnlpkueekyoto-uacjpnl-resourceknp-ehtml
the head of bunsetsu ldquojyouyakurdquo (treaty) Inour method for the cell ldquoOushurdquo invalid hasthe highest score and thus the score of LOCA-TION relatively drops Similarly for the cellldquosenryoku-sakugen-jyouyakurdquo invalid has thehighest score Consequently ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo is correctly labeledas ARTIFACT
In the bunsetsu ldquogaikoku-jin-touroku-hou-ihanrdquo(the violation of the foreigner registration law)while Sasano et al labeled ldquotouroku-hourdquo as AR-TIFACT our method correctly labeled ldquogaikoku-jin-touroku-hourdquo as ARTIFACT Sasano et al can-not utilize the information about ldquohourdquo that is use-ful for the label estimation since the head of thisbunsetsu is ldquoihanrdquo In contrast in estimating thelabel of the chunk ldquogaikoku-jin-touroku-hourdquo theinformation of ldquohourdquo can be utilized
72 Error Analysis
There were some errors in analyzing a Katakanaalphabet word In the following example althoughthe correct is that ldquoBatistutardquo is labeled as PER-SON the system labeled it as OTHERs
(4) Italy-deItaly LOC
katsuyaku-suruactive
Batistuta-woBatistuta ACC
kuwaetacall
ArgentineArgentine
rsquoArgentine called Batistuta who was active inItalyrsquo
There is not an entry of ldquoBatistutardquo in the dictio-nary of JUMAN nor Wikipedia and thus only thesurrounding information is utilized However thecase analysis of ldquokatsuyakurdquo (active) is incorrectwhich leads to the error of ldquoBatistutardquo
There were some errors in applying the la-bel comparison model although the analysis ofeach chunk is correct For example in thebunsetsu ldquoHongKong-seityourdquo (Government ofHongKong) the correct is that ldquoHongKong-seityourdquo is labeled as ORGANIZATION Asshown in Figure 8 (b) the system incorrectlylabeled ldquoHongKongrdquo as LOCATION As shownin Figure 8(a) although in the initial stateldquoHongKong-seityourdquo was correctly labeled as OR-GANIZATION the label assignment ldquoHongKong+ seityourdquo was incorrectly chosen by the labelcomparison model To cope with this problemwe are planning to the adjustment of the value βin the sigmoid function and the refinement of the
61
F1 analysis unit distinctive features(Fukushima et al 2008) 8929 character Web(Kazama and Torisawa 2008) 8893 character WikipediaWeb(Sasano and Kurohashi 2008) 8940 morpheme structural information(Nakano and Hirai 2004) 8903 character bunsetsu feature(Masayuki and Matsumoto 2003) 8721 character(Isozaki and Kazawa 2003) 8677 morphemeproposed method 8979 compound noun Wikipediastructural information
Table 4 Comparison with previous work (All work was evaluated on CRL NE data using cross valida-tion)
HongKongLOCATION
HongKong-seityouORGANIZATIONLOCATION
0266 0205
seityouOTHERe0184
(a)initial state
HongKongLOCATION
HongKong + seityouLOC+OTHEReLOCATION
0266LOC+OTHERe0266+0184
seityouOTHERe0184
(b)the final output
Figure 8 An example of the error in the label com-parison model
features for the label comparison model
8 Conclusion
This paper proposed bottom-up Named EntityRecognition using a two-stage machine learningmethod This method first estimates the label ofall the chunks in a bunsetsu using a machine learn-ing and then the best label assignment is deter-mined by bottom-up dynamic programming Weconducted an experiment on CRL NE data andachieved an F-measure of 8979
We are planning to integrate this method withthe syntactic and case analysis method (Kawa-hara and Kurohashi 2007) and perform syntacticcase and Named Entity analysis simultaneously toimprove the overall accuracy
ReferencesKenrsquoichi Fukushima Nobuhiro Kaji and Masaru
Kitsuregawa 2008 Use of massive amountsof web text in Japanese named entity recogni-tion In Proceedings of Data Engineering Workshop(DEWS2008) A3-3 (in Japanese)
IREX Committee editor 1999 Proceedings of theIREX Workshop
Hideki Isozaki and Hideto Kazawa 2003 Speeding upsupport vector machines for named entity recogni-tion Transaction of Information Processing Societyof Japan 44(3)970ndash979 (in Japanese)
Daisuke Kawahara and Sadao Kurohashi 2007 Prob-abilistic coordination disambiguation in a fully-lexicalized Japanese parser In Proceedings of the
2007 Joint Conference on Empirical Methods inNatural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL2007)pages 304ndash311
Junrsquoichi Kazama and Kentaro Torisawa 2008 In-ducing gazetteers for named entity recognition bylarge-scale clustering of dependency relations InProceedings of ACL-08 HLT pages 407ndash415
Erik F Tjong Kim and Jorn Veenstra 1999 Repre-senting text chunks In Proceedings of EACL rsquo99pages 173ndash179
Vajay Krishnan and Christopher DManning 2006An effective two-stage model for exploiting non-local dependencies in named entity recognitionpages 1121ndash1128
John Lafferty Andrew McCallun and FernandoPereira 2001 Conditional random fields Prob-abilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth In-ternational Conference (ICMLrsquo01) pages 282ndash289
Asahara Masayuki and Yuji Matsumoto 2003Japanese named entity extraction with redundantmorphological analysis In Proceeding of HLT-NAACL 2003 pages 8ndash15
Keigo Nakano and Yuzo Hirai 2004 Japanese namedentity extraction with bunsetsu features Transac-tion of Information Processing Society of Japan45(3)934ndash941 (in Japanese)
Ryohei Sasano and Sadao Kurohashi 2008 Japanesenamed entity recognition using structural naturallanguage processing In Proceeding of Third In-ternational Joint Conference on Natural LanguageProcessing pages 607ndash612
Satoshi Sekine Ralph Grishman and Hiroyuki Shin-nou 1998 A decision tree method for finding andclassifying names in japanese texts In Proceed-ings of the Sixth Workshop on Very Large Corpora(WVLC-6) pages 171ndash178
Vladimir Vapnik 1995 The Nature of StatisticalLearning Theory Springer
62
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 63ndash70Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
13
13
13 13
13
13
13
13
13
13
13
13 13
13
$
13 amp
13 amp
$ 13
13 (
) 13
$ 13
13
) 13
13 13 13
$ 13 13
+$ - 0)$ - 12
$ 3$ 4 1
)13$ 5 + )$ 5 2
3$ 6 2 $ 7-
2 $ 7 13
2 13 13
13 13
) 13
13$ $ 13$ $ $
13
13 13 13
13 1313 138
$ 13$
3$ 13) )
) 8 9$ $ $
$ $ )13$ 13 13
3
$
13 13
13 +
)
13 $
13 ) 13
$ )
13 13
13 ) $ )
13$
$
) (
$
13
13 13
13
)
13
13 )
$ 13
ファミリーレストラン 13) lt
ファミレス 13lt $
=
13 )
$ 13 13
$ $ )
13 -
)$ $ =
13 131313 13 1313 13 13 13 13 1313 13 13
63
13
13
13 13
13 13
13
13 13
13 $
13
13 amp
13
( ) +
13
13
-
13 13
13 13 13
日本 amp0 東京 $amp0amp
医科 amp0 科学 amp0amp
女子 amp0amp
大学 13amp0 研究所0
ampamp $ 13 13
+
13
$1
$ 13 $1
13
213
$1
$
$1
$1
13 $1
13
13 13 略語は0
0amp
$ +3
1313欧文略語一覧13カタカナ略語一覧13漢字略語一覧13大学の略称
$1
$1 13
13 -
13 $1 13
456
13
456
13
37+
- 13
13
$
8 913
77 4 77 913 13
13 77(amp 13 ltamp
13amp amp
13
amp - 13 13
13
13 9
13 13
$
13
-
13 =
= 9
8
13
13 $
13 13
13 13 13 1313
ッ$ 13 っ$
amp ―$
64
13
$ $
amp $ amp
() $
+) $amp $
) $ $ $ $ $
13 $ amp $ $ $ amp $$
$- )) ) 13
) ))13 ) 0()00)13 ) ) )13
- $ ))) ) 131
) 0)13 ))
13 13
2 ) )) )
)) 3 ) 1313))
1313 2 ) 13)
4) 5) )) )13
) 13) 1
6
7 ) )) ) )13 1
)
) )0
13 ) 8
-
) )
)13 ) 4
4 7 )
) )9 8)
5 ) )8 7 ))
)
) )13 1
)13 )) ) ) )
) 7 )) )13 1
)) 0()
) )))13 0
))13 ) 1
)6 5) ) )
5 ) ))1
lt ) ) 13 )
))) ) =))lt
)=)lt ))=lt gt
=ltlt )) ))) )
13 131313
-
))lt
lt ))lt
))lt
))lt lt
7 ) ))1
lt ) )) ) ))lt lt
13 lt ) ) )8 ) 1
)13 )) )
))lt 13 1
)13 ) )8 ) )13 ))
) ))) 85 1
)13 )8) 2 ))) ) (
0 )) )))
13))13
) ))) ) ( のlt
) ) )13 1
) )) )
) ))) のlt 0 )
)13 )) )1
) ) 13
) ) )) )1
) 00 )13 13)
1313 1313
2 ) 13 )) )
)13 )13 )
))13 )
3 )) )) 1
) ) )
$ )
+ ) )
+ 8 )
65
1313
1313
13 13
13 13
13 13 13
13 13
$13amp13amp 13 13 $13
$13 13 $ ( ) ) 13
13 13 + 13 1313 $
13 + 13 1313 + -
+ 13 ++ 13 13
1313 +
13 13
13 13 1313 13 -
+13 1313 13 13
1313
13
0 13 13 13 13
13 13 12 13
3013+ 13 45 666 1313
0 13 1313
7 13 13 13 13-
13 8 +-13 0
+ 13 13 13 -
+ 13 13 13
3
13 13-
13 13 + 13 13
+13513
1313 13 9
1 13 13 -
+ ( 13 13 3
13 13 + 13+ 13 13
3
135 1313 +13 13
13 13 13+ 13
1313 13 +
1313 1313
13 13 13
13+ 13 $1313 (
13 1313 1313 +-
13 4 13 $
$13 $ 13 -
13 +-13 0
13 1313 13-13 1313 + 13 13 4 - 13 + 13 13 13+ 131313 1313 3 13 13 ( 1313 13 13 + 13 13 1313 13 13 13 13 13 13 13 13 13 13 13+ 13 13 9
13
13
13
13
+ 13 13 +13
13+ 4
13 1313 1313 -
+
13 1313 13-
13 13 +13 + +
(13 13 $ +-
13 $1313 13 13 $ $ 13 $ 13
13 $ampamp lt13
2 13 13 131313 1313 13
13 lt13 ( $
13 13 13 13+ +13 -
13 13 13
13 2 2 2
= 13 13 13 +13 1313+
13 13 13
13 13 13 13
1313 13-
13 ( 13 13 13 13
13 13+ 13 1313
13 13 13+ 1313 13+
13 13 13 13 13 13
$gt1 gt1 13+ $ 13-
13 13 13 13+
13131313
66
1313 1313 1313
1313 1313 1313
13
13 13
13
13
$ 13 13
amp() amp() () amp amp
$ 13 13
+
13
13
13
- + 13 ( (
13 13
-+
13
0 -(1313 amp+ ( (
13 13
13 1
13 () 2 3
0 13 -4252
6-amp++ -32 6+
13 0
$ 135 2
1
- amp+ $ 13
13
$ amp( ) + ( ( )- )( ( ) )
FeaturesLabel
Noun
Noun
Noun
Noun
Particle
Noun
POS
0
0
0
1
1
1
Head of wordreadingword-
FeaturesLabel
Noun
Noun
Noun
Noun
Particle
Noun
POS
0
0
0
1
1
1
Head of wordreadingword-
amp 13 0
13 朝ビタ -+ 13
13 朝はビタミン - +
13 7
7
$
088
13 0
$ amp 9 amp 9
13
13
7
1313 0
13
13 2lt
13 13 13 13
13 13
3 13 amp
13 amp
13
13
0 13 13
13 13
13 13
13
1
ampampamp
67
13 13
$
13 13
ampamp (
)
ampamp $
ampamp amp
amp +
amp +
- - -
amp-
amp
amp
amp
- amp
+
- +
) +
amp $
- amp
+
- amp
+
0 $ ampamp +
1- amp
0 +
2-
)
amp -
amp ampamp
$ amp 0- 3(
34) 5 -+
0 - 4
6)4 amp amp 1 amp
7 4( 6
$ - +
- 8
64
13
) amp
amp
+
amp 9 2
- 9 - +
2 amp
- amp
amp 1
$
( +
(
( amp ( ( -
amp2
amp
amp )
( +
68
13 13
13 1313
13 13 13
13 13
13 13 13
13 13
13 13
13 $
13
13 13 13 13
13
13 13 13
1313
13
amp 13 13 13 13
13 13 13 13
(() 13 13 13
13 13 + 13
13 amp 13
13 13
$ 13
13 13
13
13 1313
13 -
13
1313 13
13 13 13
13 1313 amp
13 13
13 13 1313
0
1313 13 13
13 13
13
13 13
$ $ amp amp ( $)+ - 0 13 + 13 112
ミュージックステーション 13 ステ 13 13 13 13 13 ミュージック 13 ステ ステーション
3 4 56 1 0+ 0 + + 7+ 8 9 $8 4 lt
6 $8 -= 1 8 - = $ 13++ + 139 13 $ 22lt
8 $+ 4 13) 1ltltlt3 $gt3 =
5 $ $ ( 6 5 13amp
5 $ $ ( 213 $ 13 13
5 = 3 + 13 () 1 +
5 () 13 7+ + + 2ltlt
5 () 13 3A 21 + $ + 13
5 () $ gt) 3A2 0 13 13 22lt2lt
+A = 7 1 4 30$+ -+ 3 13 13$ 11
$ 13 amp) $ 4 42 6= $ 2
4 13 13+ $ gt $ 9+ amp6+ 13+ 8 8 6 3 (+ 6 13 11
13 136) $ 4 13 + gt+ 7 30 13 13+ 1
69
13 13 13
13 13 1313
13 13
1313 13
13 1313
13
$
13 13 1313 $
amp (
70
Author Index
Bhutada Pravin 17
Cao Jie 47Caseli Helena 1
Diab Mona 17
Finatto Maria Jose 1Fujii Hiroko 63Fukui Mika 63Funayama Hirotaka 55
Hoang Hung Huu 31Huang Yun 47
Kan Min-Yen 9 31Kim Su Nam 9 31Kuhn Jonas 23Kurohashi Sadao 55
Liu Qun 47Lu Yajuan 47
Machado Andre 1
Ren Zhixiang 47
Shibata Tomohide 55Sinha R Mahesh K 40Sumita Kazuo 63Suzuki Masaru 63
Villavicencio Aline 1
Wakaki Hiromi 63
Zarrieszlig Sina 23
71
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 1ndash8Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Statistically-Driven Alignment-Based Multiword ExpressionIdentification for Technical Domains
Helena de Medeiros Caselidiams Aline Villavicencioclubsspades Andre Machadoclubs Maria Jose Finattohearts
diamsDepartment of Computer Science Federal University of Sao Carlos (Brazil)clubsInstitute of Informatics Federal University of Rio Grande do Sul (Brazil)
spadesDepartment of Computer Sciences Bath University (UK)heartsInstitute of Language and Linguistics Federal University of Rio Grande do Sul (Brazil)
helenacaselidcufscarbr avillavicencioinfufrgsbrammachadoinfufrgsbr mfinattoterracombr
Abstract
Multiword Expressions (MWEs) are oneof the stumbling blocks for more preciseNatural Language Processing (NLP) sys-tems Particularly the lack of coverageof MWEs in resources can impact nega-tively on the performance of tasks and ap-plications and can lead to loss of informa-tion or communication errors This is es-pecially problematic in technical domainswhere a significant portion of the vocab-ulary is composed of MWEs This pa-per investigates the use of a statistically-driven alignment-based approach to theidentification of MWEs in technical cor-pora We look at the use of several sourcesof data including parallel corpora usingEnglish and Portuguese data from a corpusof Pediatrics and examining how a sec-ond language can provide relevant cues forthis tasks We report results obtained bya combination of statistical measures andlinguistic information and compare theseto the reported in the literature Such anapproach to the (semi-)automatic identifi-cation of MWEs can considerably speedup lexicographic work providing a moretargeted list of MWE candidates
1 Introduction
A multiword expression (MWE) can be definedas any word combination for which the syntac-tic or semantic properties of the whole expres-sion cannot be obtained from its parts (Sag etal 2002) Examples of MWEs are phrasal verbs(break down rely on) compounds (police car cof-fee machine) idioms (rock the boat let the catout of the bag) They are very numerous in lan-guages as Biber et al (1999) note accouting forbetween 30 and 45 of spoken English and 21
of academic prose and for Jackendoff (1997) thenumber of MWEs in a speakerrsquos lexicon is of thesame order of magnitude as the number of singlewords However these estimates are likely to beunderestimates if we consider that for languagefrom a specific domain the specialized vocabularyis going to consist largely of MWEs (global warm-ing protein sequencing) and new MWEs are con-stantly appearing (weapons of mass destructionaxis of evil)
Multiword expressions play an important rolein Natural Language Processing (NLP) applica-tions which should not only identify the MWEsbut also be able to deal with them when they arefound (Fazly and Stevenson 2007) Failing toidentify MWEs may cause serious problems formany NLP tasks especially those envolving somekind of semantic processing For parsing for in-stance Baldwin et al (2004) found that for a ran-dom sample of 20000 strings from the British Na-tional Corpus (BNC) even with a broad-coveragegrammar for English (Flickinger 2000) missingMWEs accounted for 8 of total parsing errorsTherefore there is an enormous need for robust(semi-)automated ways of acquiring lexical infor-mation for MWEs (Villavicencio et al 2007) thatcan significantly extend the coverage of resourcesFor example one can more than double the num-ber of verb-particle constructions (VPCs) entriesin a dictionary such as the Alvey Natural Lan-guage Tools (Carroll and Grover 1989) just ex-tracting VPCs from a corpus like the BNC (Bald-win 2005) Furthermore as MWEs are languagedependent and culturally motivated identifyingthe adequate translation of MWE occurrences is animportant challenge for machine translation meth-ods
In this paper we investigate experimentally theuse of an alignment-based approach for the iden-tification of MWEs in technical corpora We lookat the use of several sources of data including par-
1
allel corpora using English and Portuguese datafrom a corpus of Pediatrics and examining howa second language can provide relevant cues forthis tasks In this way cost-effective tools for theautomatic alignment of texts can generate a listof MWE candidates with their appropriate trans-lations Such an approach to the (semi-)automaticidentification of MWEs can considerably speed uplexicographic work providing a more targeted listof MWE candidates and their translations for theconstruction of bilingual resources andor withsome semantic information for monolingual re-sources
The remainder of this paper is structured as fol-lows Section 2 briefly discusses MWEs and someprevious works on methods for automatically ex-tracting them Section 3 presents the resourcesused while section 4 describes the methods pro-posed to extract MWEs as a statistically-driven by-product of an automatic word alignment processSection 5 presents the evaluation methodology andanalyses the results and section 6 finishes this pa-per with some conclusions and proposals for fu-ture work
2 Related Work
The term Multiword Expression has been used todescribe a large number of distinct but related phe-nomena such as phrasal verbs (eg come along)nominal compounds (eg frying pan) institution-alised phrases (eg bread and butter) and manyothers (Sag et al 2002) They are very frequent ineveryday language and this is reflected in severalexisting grammars and lexical resources wherealmost half of the entries are Multiword Expres-sions
However due to their heterogeneous charac-teristics MWEs present a tough challenge forboth linguistic and computational work (Sag etal 2002) Some MWEs are fixed and do notpresent internal variation such as ad hoc whileothers allow different degrees of internal vari-ability and modification such as touch a nerve(touchfind a nerve) and spill beans (spill sev-eralmusicalmountains of beans) In terms of se-mantics some MWEs are more opaque in theirmeaning (eg to kick the bucket as to die) whileothers have more transparent meanings that can beinferred from the words in the MWE (eg eat upwhere the particle up adds a completive sense toeat) Therefore providing appropriate methods
for the automatic identification and treatment ofthese phenomena is a real challenge for NLP sys-tems
A variety of approaches has been proposed forautomatically identifying MWEs differing basi-cally in terms of the type of MWE and lan-guage to which they apply and the sources ofinformation they use Although some work onMWEs is type independent (eg (Zhang et al2006 Villavicencio et al 2007)) given the het-erogeneity of MWEs much of the work looks in-stead at specific types of MWE like collocations(Pearce 2002) compounds (Keller and Lapata2003) and VPCs (Baldwin 2005 Villavicencio2005 Carlos Ramisch and Aline Villavicencio andLeonardo Moura and Marco Idiart 2008) Someof these works concentrate on particular languages(eg (Pearce 2002 Baldwin 2005) for Englishand (Piao et al 2006) for Chinese) but somework has also benefitted from asymmetries in lan-guages using information from one language tohelp deal with MWEs in the other (eg (na Vil-lada Moiron and Tiedemann 2006 Caseli et al2009))
As basis for helping to determine whether agiven sequence of words is in fact an MWE (egad hoc vs the small boy) some of these works em-ploy linguistic knowledge for the task (Villavicen-cio 2005) while others employ statistical meth-ods (Pearce 2002 Evert and Krenn 2005 Zhanget al 2006 Villavicencio et al 2007) or combinethem with some kinds of linguistic informationsuch as syntactic and semantic properties (Bald-win and Villavicencio 2002 Van de Cruys and naVillada Moiron 2007) or automatic word align-ment (na Villada Moiron and Tiedemann 2006)
Statistical measures of association have beencommonly used for this task as they can be demo-cratically applied to any language and MWE typeHowever there is no consensus about which mea-sure is best suited for identifying MWEs in gen-eral Villavicencio et al (2007) compared some ofthese measures (mutual information permutationentropy and χ2) for the type-independent detec-tion of MWEs and found that Mutual Informationseemed to differentiate MWEs from non-MWEsbut the same was not true of χ2 In addition Ev-ert and Krenn (2005) found that for MWE iden-tification the efficacy of a given measure dependson factors like the type of MWEs being targetedfor identification the domain and size of the cor-
2
pora used and the amount of low-frequency dataexcluded by adopting a threshold NonethelessVillavicencio et al (2007) discussing the influ-ence of the corpus size and nature over the meth-ods found that these different measures have ahigh level of agreement about MWEs whetherin carefully constructed corpora or in more het-erogeneous web-based ones They also discussthe results obtained from adopting approaches likethese for extending the coverage of resources ar-guing that grammar coverage can be significantlyincreased if MWEs are properly identified andtreated (Villavicencio et al 2007)
Among the methods that use additional infor-mation along with statistics to extract MWE theone proposed by na Villada Moiron and Tiede-mann (2006) seems to be the most similar to ourapproach The main difference between them isthe way in which word alignment is used in theMWE extraction process In this paper the wordalignment is the basis for the MWE extractionprocess while Villada Moiron and Tiedemannrsquosmethod uses the alignment just for ranking theMWE candidates which were extracted on the ba-sis of association measures (log-likelihood andsalience) and head dependence heuristic (in parseddata)
Our approach as described in details by Caseliet al (2009) also follows to some extent thatof Zhang et al (2006) as missing lexical en-tries for MWEs and related constructions are de-tected via error mining methods and this paper fo-cuses on the extraction of generic MWEs as a by-product of an automatic word alignment Anotherrelated work is the automatic detection of non-compositional compounds (NCC) by Melamed(1997) in which NCCs are identified by analyz-ing statistical translation models trained in a hugecorpus by a time-demanding process
Given this context our approach proposesthe use of alignment techniques for identifyingMWEs looking at sequences detected by thealigner as containing more than one word whichform the MWE candidates As a result sequencesof two or more consecutive source words aretreated as MWE candidates regardless of whetherthey are translated as one or more target words
3 The Corpus and Reference Lists
The Corpus of Pediatrics used in these experi-ments contains 283 texts in Portuguese with a total
of 785448 words extracted from the Jornal de Pe-diatria From this corpus the Pediatrics Glossarya reference list containing multiword terms and re-curring expressions was semi-automatically con-structed and manually checked1 The primary aimof the Pediatrics Glossary as an online resourcefor long-distance education was to train qualifyand support translation students on the domain ofpediatrics texts
The Pediatrics Glossary was built from the36741 ngrams that occurred at least 5 times in thecorpus These were automatically cleaned or re-moved using some POS tag patterns (eg remov-ing prepositions from terms that began or endedwith them) In addition if an ngram was part of alarger ngram only the latter appeared in the Glos-sary as is the case of aleitamento materno (mater-nal breastfeeding) which is excluded as it is con-tained in aleitamento materno exclusivo (exclusivematernal breastfeeding) This post-processing re-sulted in 3645 ngrams which were manuallychecked by translation students and resulted in2407 terms with 1421 bigrams 730 trigrams and339 ngrams with n larger than 3 (not considered inthe experiments presented in this paper)
4 Statistically-Driven andAlignment-Based methods
41 Statistically-Driven methodStatistical measures of association have beenwidely employed in the identification of MWEsThe idea behind their use is that they are an in-expensive language and type independent meansof detecting recurrent patterns As Firth famouslysaid a word is characterized by the company itkeeps and since we expect the component wordsof an MWE to occur frequently together thenthese measures can give an indication of MWE-ness In this way if a group of words co-occurswith significantly high frequency when comparedto the frequencies of the individual words thenthey may form an MWE Indeed measures suchas Pointwise Mutual Information (PMI) MutualInformation (MI) χ2 log-likelihood (Press et al1992) and others have been employed for thistask and some of them seem to provide more ac-curate predictions of MWEness than others Infact in a comparison of some measures for thetype-independent detection of MWEs MI seemed
1Available in the TEXTQUIMUFRGS website httpwwwufrgsbrtextquim
3
to differentiate MWEs from non-MWEs but thesame was not true of χ2 (Villavicencio et al2007) In this work we use two commonly em-ployed measures for this task PMI and MIas implemented in the Ngram Statistics Package(Banerjee and Pedersen 2003)
From the Portuguese portion of the Corpus ofPediatrics 196105 bigram and 362663 trigramMWE candidates were generated after filteringngrams containing punctuation and numbers Inorder to evaluate how these methods perform with-out any linguistic filtering the only threshold em-ployed was a frequency cut-off of 2 occurrencesresulting in 64839 bigrams and 54548 trigramsEach of the four measures were then calculated forthese ngrams and we ranked each n-gram accord-ing to each of these measures The average of allthe rankings is used as the combined measure ofthe MWE candidates
42 Alignment-Based method
The second of the MWE extraction approachesto be investigated in this paper is the alignment-based method The automatic word alignment oftwo parallel texts mdash a text written in one (source)language and its translation to another (target) lan-guage mdash is the process of searching for correspon-dences between source and target words and se-quences of words For each word in a source sen-tence equivalences in the parallel target sentenceare looked for Therefore taking into account aword alignment between a source word sequenceS (S = s1 sn with n ge 2) and a target wordsequence T (T = t1 tm with m ge 1) thatis S harr T the alignmet-based MWE extracionmethod assumes that (a) S and T share some se-mantic features and (b) S may be a MWE
In other words the alignment-based MWE ex-traction method states that the sequence S will bea MWE candidate if it is aligned with a sequenceT composed of one or more words (a n m align-ment with n ge 2 and m ge 1) For examplethe sequence of two Portuguese words aleitamentomaterno mdash which occurs 202 times in the cor-pus used in our experiments mdash is a MWE can-didate because these two words were joined to bealigned 184 times with the word breastfeeding (a2 1 alignment) 8 times with the word breast-fed (a 2 1 alignment) 2 times with breastfeedingpractice (a 2 2 alignment) and so on
Thus notice that the alignment-based MWE ex-
traction method does not rely on the conceptualasymmetries between languages since it does notexpect that a source sequence of words be alignedwith a single target word The method looksfor the sequences of source words that are fre-quently joined together during the alignment de-spite the number of target words involved Thesefeatures indicate that the method priorizes preci-sion in spite of recall
It is also important to say that although the se-quences of source and target words resemble thephrases used in the phrase-based statistical ma-chine translation (SMT) they are indeed a re-finement of them More specifically althoughboth approaches rely on word alignments per-formed by GIZA++2 (Och and Ney 2000) inthe alignment-based approach not all sequences ofwords are considered as phrases (and MWE can-didates) but just those with an alignment n m(n gt= 2) with a target sequence To confirmthis assumption a phrase-based SMT system wastrained with the same corpus used in our exper-iments and the number of phrases extracted fol-lowing both approaches were compared Whilethe SMT extracted 819208 source phrases ouralignment-based approach (without applying anypart-of-speech or frequency filter) extracted only34277 These results show that the alignment-based approach refines in some way the phrasesof SMT systems
In this paper we investigate experimentallywhether MWEs can be identified as a by-productof the automatic word alignment of parallel textsWe focus on Portuguese MWEs from the Corpusof Pediatrics and the evaluation is performed us-ing the bigrams and trigrams from the PediatricsGlossary as gold standard
To perform the extraction of MWE candi-dates following the alignment-based approachfirst the original corpus had to be sentence andword aligned and Part-of-Speech (POS) taggedFor these preprocessing steps were used re-spectively a version of the Translation Cor-pus Aligner (TCA) (Hofland 1996) the statisti-cal word aligner GIZA++ (Och and Ney 2000)and the morphological analysers and POS taggersfrom Apertium3 (Armentano-Oller et al 2006)
2GIZA++ is a well-known statistical word aligner that canbe found at httpwwwfjochcomGIZA++html
3Apertium is an open-source machine translation en-gine and toolbox available at httpwwwapertiumorg
4
From the preprocessed corpus the MWE candi-dates are extracted as those in which two or morewords have the same alignment that is they arelinked to the same target unit This initial list ofMWE candidates is then filtered to remove thosecandidates that (a) match some sequences of POStags or words (patterns) defined in previous exper-iments (Caseli et al 2009) or (b) whose frequencyis below a certain threshold The remaining unitsin the candidate list are considered to be MWEs
Several filtering patterns and minimum fre-quency thresholds were tested and three of themare presented in details here The first one (F1)is the same used during the manual building ofthe reference lists of MWEs (a) patterns begin-ning with Article + Noun and beginning or finish-ing with verbs and (b) with a minimum frequencythreshold of 5
The second one (F2) is the same used in the(Caseli et al 2009) mainly (a) patterns begin-ning with determiner auxiliary verb pronoun ad-verb conjunction and surface forms such as thoseof the verb to be (are is was were) relatives(that what when which who why) and prepo-sitions (from to of ) and (b) with a minimum fre-quency threshold of 2
And the third one (F3) is the same as (Caseli etal 2009) plus (a) patterns beginning or finishingwith determiner adverb conjunction prepositionverb pronoun and numeral and (b) with a mini-mum frequency threshold of 2
5 Experiments and Results
Table 1 shows the top 5 and the bottom 5 rankedcandidates returned by PMI and the alignment-based approach Although some of the results aregood especially the top candidates there is stillconsiderable noise among the candidates as forinstance jogar video game (lit play video game)From table 1 it is also possible to notice that thealignment-based approach indeed extracts Pedi-atrics terms such as aleitamento materno (breast-feeding) and also other possible MWE that are notPediatrics terms such as estados unidos (UnitedStates)
In table 2 we show the precision (number ofcorrect candidates among the proposed ones) re-call (number of correct candidates among those inreference lists) and F-measure ((2 lowast precision lowastrecall)(precision + recall)) figures for the as-sociation measures using all the candidates (on the
PMI alignment-basedOnline Mendelian Inheritance faixa etariaBeta Technology Incorporated aleitamento maternoLange Beta Technology estados unidosOxido Nitrico Inalatorio hipertensao arterialjogar video game leite materno e um de couro cabeludoe a do bloqueio lactıferosse que de emocional anatomiae a da neonato a termoe de nao duplas maes bebes
Table 1 Top 5 and Bottom 5 MWE candidatesranked by PMI and alignment-based approach
pt MWE candidates PMI MI proposed bigrams 64839 64839 correct MWEs 1403 1403precision 216 216recall 9873 9873F 423 423 proposed trigrams 54548 54548 correct MWEs 701 701precision 129 129recall 9603 9603F 255 255 proposed bigrams 1421 1421 correct MWEs 155 261precision 1091 1837recall 1091 1837F 1091 1837 proposed trigrams 730 730 correct MWEs 44 20precision 603 274recall 603 274F 603 274
Table 2 Evaluation of MWE candidates - PMI andMI
first half of the table) and using the top 1421 bi-gram and 730 trigram candidates (on the secondhalf) From these latter results we can see that thetop candidates produced by these measures do notagree with the Pediatrics Glossary since there areonly at most 1837 bigram and 603 trigramMWEs among the top candidates as ranked byMI and PMI respectively Interestingly MI had abetter performance for bigrams while for trigramsPMI performed better
On the other hand looking at the alignment-based method 34277 pt MWE candidates wereextracted and Table 3 sumarizes the number ofcandidates filtered following the three filters de-scribed in 42 F1 F2 and F3
To evaluate the efficacy of the alignment-basedmethod in identifying multiword terms of Pedi-atrics an automatic comparison was performedusing the Pediatrics Glossary In this auto-
5
pt MWE candidates F1 F2 F3 filtered by POS patterns 24996 21544 32644 filtered by frequency 9012 11855 1442 final Set 269 878 191
Table 3 Number of pt MWE candidates filteredin the alignment-based approach
pt MWE candidates F1 F2 F3 proposed bigrams 250 754 169 correct MWEs 48 95 65precision 1920 1260 3846recall 338 669 457F 575 874 818 proposed trigrams 19 110 20 correct MWEs 1 9 4precision 526 818 2000recall 014 123 055F 027 214 107 proposed bitrigrams 269 864 189 correct MWEs 49 104 69precision 1822 1204 3651recall 228 483 321F 405 690 590
Table 4 Evaluation of MWE candidates
matic comparision we considered the final lists ofMWEs candidates generated by each filter in table3 The number of matching entries and the valuesfor precision recall and F-measure are showed intable 4
The different values of extracted MWEs (in ta-ble 3) and evaluated ones (in table 4) are due tothe restriction of considering only bigrams and tri-grams in the Pediatrics Glossary Then longerMWEs mdash such as doenca arterial coronarianaprematura (premature coronary artery disease)and pequenos para idade gestacional (small forgestational age) mdash extracted by the alignment-based method are not being considered at the mo-ment
After the automatic comparison using the Pedi-atrics Glossary an analysis by human experts wasperformed on one of the derived lists mdash that withthe best precision values so far (from filter F3)The human analysis was necessary since as statedin (Caseli et al 2009) the coverage of referencelists may be low and it is likely that a lot of MWEcandidates that were not found in the PediatricsGlossary are nonetheless true MWEs In this pa-per only the pt MWE candidates extracted usingfilter F3 (as described in section 42) were manu-ally evaluated
From the 191 pt MWE candidates extracted af-ter F3 69 candidates (361 of the total amount)were found in the bigrams or trigrams in the
Glossary (see table 4) Then the remaining 122candidates (639) were analysed by two native-speakers human judges who classified each of the122 candidates as true if it is a multiword expres-sion or false otherwise independently of beinga Pediatrics term For the judges a sequence ofwords was considered a MWE mainly if it was(1) a proper name or (2) a sequence of words forwhich the meaning cannot be obtained by com-pounding the meanings of its component words
The judgments of both judges were comparedand a disagreement of approximately 12 on mul-tiwords was verified This disagreement was alsomeasured by the kappa (K) measure (Carletta1996) with k = 073 which does not preventconclusions to be drawn According to Carletta(1996) among other authors a value of k between067 and 08 indicates a good agreement
In order to calculate the percentage of true can-didates among the 122 two approaches can befollowed depending on what criteria one wantsto emphasize precision or coverage (not recallbecause we are not calculating regarding a refer-ence list) To emphasize the precision one shouldconsider as genuine MWEs only those candidatesclassified as true by both judges on the other handto emphasize the coverage one should consideralso those candidates classified as true by just oneof them So from 191 MWE candidates 126(6597) were classified as true by both judgesand 145 (7592) by at least one of them
6 Conclusions and Future Work
MWEs are a complex and heterogeneous set ofphenomena that defy attempts to capture themfully but due to their role in communication theyneed to be properly accounted for in NLP applica-tions and tasks
In this paper we investigated the identifica-tion of MWEs from technical domain test-ing statistically-driven and alignment-based ap-proaches for identifying MWEs from a Pediatricsparallel corpus The alignment-based method gen-erates a targeted precision-oriented list of MWEcandidates while the statistical methods producerecall-oriented results at the expense of precisionTherefore the combination of these methods canproduce a set of MWE candidates that is both moreprecise than the latter and has more coverage thanthe former This can significantly speed up lex-icographic work Moreover the results obtained
6
show that in comparison with the manual extrac-tion of MWEs this approach can provide also ageneral set of MWE candidates in addition to themanually selected technical terms
Using the alignment-based extraction methodwe notice that it is possible to extract MWEs thatare Pediatrics terms with a precision of 38 forbigrams and 20 for trigrams but with very lowrecall since only the MWEs in the Pediatrics Glos-sary were considered correct However after amanual analysis carried out by two native speakersof Portuguese we found that the percentage of trueMWEs considered by both or at least one of themwere respectively 6597 and 7592 This wasa significative improvement but it is important tosay that in this manual analysis the human ex-perts classified the MWEs as true independently ofthem being Pediatrics terms So as future work weintend to carry out a more carefull analysis withexperts in Pediatrics to evaluate how many MWEscandidates are also Pediatrics terms
In addition we plan to investigate a weightedcombination of these methods favouring thosethat have better precision Finally we also in-tend to apply the results obtained in to the semi-automatic construction of ontologies
Acknowledgments
We would like to thank the TEXTQUIMUFRGSgroup for making the Corpus of Pediatrics and Pe-diatrics Glossary available to us We also thank thefinancial support of FAPESP in bulding the paral-lel corpus This research has been partly fundedby the FINEP project COMUNICA
ReferencesCarme Armentano-Oller Rafael C Carrasco Anto-
nio M Corbı-Bellot Mikel L Forcada MireiaGinestı-Rosell Sergio Ortiz-Rojas Juan Anto-nio Perez-Ortiz Gema Ramırez-Sanchez FelipeSanchez-Martınez and Miriam A Scalco 2006Open-source Portuguese-Spanish machine transla-tion In R Vieira P Quaresma MGV Nunes NJMamede C Oliveira and MC Dias editors Pro-ceedings of the 7th International Workshop on Com-putational Processing of Written and Spoken Por-tuguese (PROPOR 2006) volume 3960 of LectureNotes in Computer Science pages 50ndash59 Springer-Verlag May
Timothy Baldwin and Aline Villavicencio 2002 Ex-tracting the Unextractable A Case Study on Verb-particles In Proceedings of the 6th Conference on
Natural Language Learning (CoNLL-2002) pages98ndash104 Taipei Taiwan
Timothy Baldwin Emily M Bender Dan FlickingerAra Kim and Stephan Oepen 2004 Road-testingthe English Resource Grammar over the British Na-tional Corpus In Proceedings of the Fourth In-ternational Conference on Language Resources andEvaluation (LREC 2004) pages 2047ndash2050 LisbonPortugal
Timothy Baldwin 2005 The deep lexical acquisitionof English verb-particles Computer Speech andLanguage Special Issue on Multiword Expressions19(4)398ndash414
Satanjeev Banerjee and Ted Pedersen 2003 The De-sign Implementation and Use of the Ngram Statis-tics Package In In Proceedings of the Fourth Inter-national Conference on Intelligent Text Processingand Computational Linguistics pages 370ndash381
Douglas Biber Stig Johansson Geoffrey Leech SusanConrad and Edward Finegan 1999 Grammar ofSpoken and Written English Longman Harlow
Jean Carletta 1996 Assessing agreement on classi-fication tasks the kappa statistics ComputationalLinguistics 22(2)249ndash254
Carlos Ramisch and Aline Villavicencio and LeonardoMoura and Marco Idiart 2008 Picking themup and Figuring them out Verb-Particle Construc-tions Noise and Idiomaticity In Proceedings ofthe 12th Conference on Computational Natural Lan-guage Learning (CoNLL 2008) pages 49ndash56
John Carroll and Claire Grover 1989 The derivationof a large computational lexicon of English fromLDOCE In Bran Boguraev and Ted Briscoe editorsComputational Lexicography for Natural LanguageProcessing pages 117ndash134 Longman Harlow UK
Helena M Caseli Carlos Ramisch Maria G V Nunesand Aline Villavicencio 2009 Alignment-basedextraction of multiword expressions Language Re-sources and Evaluation to appear
Stefan Evert and Brigitte Krenn 2005 Using smallrandom samples for the manual evaluation of statis-tical association measures Computer Speech andLanguage 19(4)450ndash466
Afsaneh Fazly and Suzanne Stevenson 2007 Distin-guishing Subtypes of Multiword Expressions UsingLinguistically-Motivated Statistical Measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueJune
Dan Flickinger 2000 On building a more efficientgrammar by exploiting types Natural LanguageEngineering 6(1)15ndash28
7
Knut Hofland 1996 A program for aligning Englishand Norwegian sentences In S Hockey N Ideand G Perissinotto editors Research in HumanitiesComputing pages 165ndash178 Oxford Oxford Univer-sity Press
Ray Jackendoff 1997 Twistinrsquo the night away Lan-guage 73534ndash59
Frank Keller and Mirella Lapata 2003 Using the Webto Obtain Frequencies for Unseen Bigrams Compu-tational Linguistics 29(3)459ndash484
I Dan Melamed 1997 Automatic Discovery ofNon-Compositional Compounds in Parallel Data Ineprint arXivcmp-lg9706027 pages 6027ndash+ June
Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the Workshopon Multi-word-expressions in a Multilingual Context(EACL-2006) pages 33ndash40 Trento Italy
Franz Josef Och and Hermann Ney 2000 Improvedstatistical alignment models In Proceedings of the38th Annual Meeting of the ACL pages 440ndash447Hong Kong China October
Darren Pearce 2002 A Comparative Evaluation ofCollocation Extraction Techniques In Proceedingsof the Third International Conference on LanguageResources and Evaluation pages 1ndash7 Las PalmasCanary Islands Spain
Scott S L Piao Guangfan Sun Paul Rayson andQi Yuan 2006 Automatic Extraction of Chi-nese Multiword Expressions with a Statistical ToolIn Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context (EACL-2006)pages 17ndash24 Trento Italy April
William H Press Saul A Teukolsky William T Vet-terling and Brian P Flannery 1992 NumericalRecipes in C The Art of Scientific Computing Sec-ond edition Cambridge University Press
Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 MultiwordExpressions A Pain in the Neck for NLP In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing (CICLing-2002) volume 2276 of (LectureNotes in Computer Science) pages 1ndash15 LondonUK Springer-Verlag
Tim Van de Cruys and Bego na Villada Moiron 2007Semantics-based Multiword Expression ExtractionIn Proceedings of the Workshop on A Broader Pre-spective on Multiword Expressions pages 25ndash32Prague June
Aline Villavicencio Valia Kordoni Yi Zhang MarcoIdiart and Carlos Ramisch 2007 Validation andEvaluation of Automatically Acquired Multiword
Expressions for Grammar Engineering In Pro-ceedings of the 2007 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning pages1034ndash1043 Prague June
Aline Villavicencio 2005 The Availability of Verb-Particle Constructions in Lexical Resources HowMuch is Enough Journal of Computer Speech andLanguage Processing 19(4)415ndash432
Yi Zhang Valia Kordoni Aline Villavicencio andMarco Idiart 2006 Automated Multiword Ex-pression Prediction for Grammar Engineering InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 36ndash44 Sydney Australia July Asso-ciation for Computational Linguistics
8
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 9ndash16Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Re-examining Automatic Keyphrase Extraction Approachesin Scientific Articles
Su Nam KimCSSE dept
University of Melbournesnkimcsseunimelbeduau
Min-Yen KanDepartment of Computer ScienceNational University of Singaporekanmycompnusedusg
Abstract
We tackle two major issues in automatickeyphrase extraction using scientific arti-cles candidate selection and feature engi-neering To develop an efficient candidateselection method we analyze the natureand variation of keyphrases and then se-lect candidates using regular expressionsSecondly we re-examine the existing fea-tures broadly used for the supervised ap-proach exploring different ways to en-hance their performance While mostother approaches are supervised we alsostudy the optimal features for unsuper-vised keyphrase extraction Our researchhas shown that effective candidate selec-tion leads to better performance as evalua-tion accounts for candidate coverage Ourwork also attests that many of existing fea-tures are also usable in unsupervised ex-traction
1 Introduction
Keyphrases are simplex nouns or noun phrases(NPs) that represent the key ideas of the documentKeyphrases can serve as a representative summaryof the document and also serve as high quality in-dex terms It is thus no surprise that keyphraseshave been utilized to acquire critical informationas well as to improve the quality of natural lan-guage processing (NLP) applications such as doc-ument summarizer(Davanzo and Magnini 2005)information retrieval (IR)(Gutwin et al 1999) anddocument clustering(Hammouda et al 2005)
In the past various attempts have been made toboost automatic keyphrase extraction performancebased primarily on statistics(Frank et al 1999Turney 2003 Park et al 2004 Wan and Xiao2008) and a rich set of heuristic features(Barkerand Corrnacchia 2000 Medelyan and Witten
2006 Nguyen and Kan 2007) In Section 2 wegive a more comprehensive overview of previousattempts
Current keyphrase technology still has muchroom for improvement First of all although sev-eral candidate selection methods have been pro-posed for automatic keyphrase extraction in thepast (eg (Frank et al 1999 Park et al 2004Nguyen and Kan 2007)) most of them do not ef-fectively deal with various keyphrase forms whichresults in the ignorance of some keyphrases as can-didates Moreover no studies thus far have donea detailed investigation of the nature and varia-tion of manually-provided keyphrases As a con-sequence the community lacks a standardized listof candidate forms which leads to difficulties indirect comparison across techniques during evalu-ation and hinders re-usability
Secondly previous studies have shown the ef-fectiveness of their own features but not manycompared their features with other existing fea-tures That leads to a redundancy in studies andhinders direct comparison In addition existingfeatures are specifically designed for supervisedapproaches with few exceptions However thisapproach involves a large amount of manual laborthus reducing its utility for real-world applicationHence unsupervised approach is inevitable in or-der to minimize manual tasks and to encourageutilization It is a worthy study to attest the re-liability and re-usability for the unsupervised ap-proach in order to set up the tentative guideline forapplications
This paper targets to resolve these issues ofcandidate selection and feature engineering Inour work on candidate selection we analyze thenature and variation of keyphrases with the pur-pose of proposing a candidate selection methodwhich improves the coverage of candidates thatoccur in various forms Our second contributionre-examines existing keyphrase extraction features
9
reported in the literature in terms of their effec-tiveness and re-usability We test and comparethe usefulness of each feature for further improve-ment In addition we assess how well these fea-tures can be applied in an unsupervised approach
In the remaining sections we describe anoverview of related work in Section 2 our propos-als on candidate selection and feature engineeringin Section 4 and 5 our system architecture anddata in Section 6 Then we evaluate our propos-als discuss outcomes and conclude our work inSection 7 8 and 9 respectively
2 Related Work
The majority of related work has been carriedout using statistical approaches a rich set ofsymbolic resources and linguistically-motivatedheuristics(Frank et al 1999 Turney 1999 Barkerand Corrnacchia 2000 Matsuo and Ishizuka2004 Nguyen and Kan 2007) Features used canbe categorized into three broad groups (1) docu-ment cohesion features (ie relationship betweendocument and keyphrases)(Frank et al 1999Matsuo and Ishizuka 2004 Medelyan and Wit-ten 2006 Nguyen and Kan 2007) and to lesser(2) keyphrase cohesion features (ie relationshipamong keyphrases)(Turney 2003) and (3) termcohesion features (ie relationship among compo-nents in a keyphrase)(Park et al 2004)
The simplest system is KEA (Frank et al1999 Witten et al 1999) that uses TFIDF (ieterm frequency inverse document frequency) andfirst occurrence in the document TFIDF mea-sures the document cohesion and the first occur-rence implies the importance of the abstract orintroduction which indicates the keyphrases havea locality Turney (2003) added the notion ofkeyphrase cohesion to KEA features and Nguyenand Kan (2007) added linguistic features suchas section information and suffix sequence TheGenEx system(Turney 1999) employed an inven-tory of nine syntactic features such as length inwords and frequency of stemming phrase as aset of parametrized heuristic rules Barker andCorrnacchia (2000) introduced a method basedon head noun heuristics that took three featureslength of candidate frequency and head noun fre-quency To take advantage of domain knowledgeHulth et al (2001) used a hierarchically-organizeddomain-specific thesaurus from Swedish Parlia-ment as a secondary knowledge source The
Textract (Park et al 2004) also ranks the can-didate keyphrases by its judgment of keyphrasesrsquodegree of domain specificity based on subject-specific collocations(Damerau 1993) in addi-tion to term cohesion using Dice coefficient(Dice1945) Recently Wan and Xiao (2008) extractsautomatic keyphrases from single documents uti-lizing document clustering information The as-sumption behind this work is that the documentswith the same or similar topics interact with eachother in terms of salience of words The authorsfirst clustered the documents then used the graph-based ranking algorithm to rank the candidates ina document by making use of mutual influences ofother documents in the same cluster
3 Keyphrase Analysis
In previous study KEA employed the index-ing words as candidates whereas others such as(Park et al 2004 Nguyen and Kan 2007) gen-erated handcrafted regular expression rules How-ever none carefully undertook the analysis ofkeyphrases We believe there is more to be learnedfrom the reference keyphrases themselves by do-ing a fine-grained careful analysis of their formand composition Note that we used the articlescollected from ACM digital library for both ana-lyzing keyphrases as well as evaluating methodsSee Section 6 for data in detail
Syntactically keyphrases can be formed by ei-ther simplex nouns (eg algorithm keyphrasemulti-agent) or noun phrases (NPs) which can be asequence of nouns and their auxiliary words suchas adjectives and adverbs (eg mobile networkfast computing partially observable Markov de-cision process) despite few incidences They canalso incorporate a prepositional phrase (PP) (egquality of service policy of distributed caching)When keyphrases take the form of an NP with anattached PP (ie NPs in of-PP form) the preposi-tion of is most common but others such as for invia also occur (eg incentive for cooperation in-equality in welfare agent security via approximatepolicy trade in financial instrument based on log-ical formula) The patterns above correlate wellto part-of-speech (POS) patterns used in modernkeyphrase extraction systems
However our analysis uncovered additional lin-guistic patterns and alternations which other stud-ies may have overlooked In our study we alsofound that keyphrases also occur as a simple con-
10
Criteria RulesFrequency (Rule1) Frequency heuristic ie frequency ge 2 for simplex words vs frequency ge 1 for NPs
Length (Rule2) Length heuristic ie up to length 3 for NPs in non-of-PP form vs up to length 4 for NPs in of-PP form(eg synchronous concurrent program vs model of multiagent interaction)
Alternation (Rule3) of-PP form alternation(eg number of sensor = sensor number history of past encounter = past encounter history)(Rule4) Possessive alternation(eg agentrsquos goal = goal of agent securityrsquos value = value of security)
Extraction (Rule5) Noun Phrase = (NN |NNS|NNP |NNPS|JJ |JJR|JJS)lowast(NN |NNS|NNP |NNPS)(eg complexity effective algorithm grid computing distributed web-service discovery architecture)(Rule6) Simplex WordNP IN Simplex WordNP(eg quality of service sensitivity of VOIP traffic (VOIP traffic extracted)simplified instantiation of zebroid (simplified instantiation extracted))
Table 1 Candidate Selection Rules
junctions (eg search and rescue propagation anddelivery) and much more rarely as conjunctionsof more complex NPs (eg history of past en-counter and transitivity) Some keyphrases appearto be more complex (eg pervasive document editand management system task and resource allo-cation in agent system) Similarly abbreviationsand possessive forms figure as common patterns(eg belief desire intention = BDI inverse docu-ment frequency = (IDF) Bayesrsquo theorem agentrsquosdominant strategy)
A critical insight of our work is that keyphrasescan be morphologically and semantically alteredKeyphrases that incorporate a PP or have an un-derlying genitive composition are often easily var-ied by word order alternation Previous studieshave used the altered keyphrases when forming inof-PP form For example quality of service canbe altered to service quality sometimes with lit-tle semantic difference Also as most morpho-logical variation in English relates to noun num-ber and verb inflection keyphrases are subject tothese rules as well (eg distributed system 6= dis-tributing system dynamical caching 6= dynamicalcache) In addition possessives tend to alternatewith of-PP form (eg agentrsquos goal = goal of agentsecurityrsquos value = value of security)
4 Candidate Selection
We now describe our proposed candidate selectionprocess Candidate selection is a crucial step forautomatic keyphrase extraction This step is corre-lated to term extraction study since topNth rankedterms become keyphrases in documents In pre-vious study KEA employed the indexing wordsas candidates whereas others such as (Park et al2004 Nguyen and Kan 2007) generated hand-crafted regular expression rules However nonecarefully undertook the analysis of keyphrases In
this section before we present our method we firstdescribe the detail of keyphrase analysis
In our keyphrase analysis we observed thatmost of author assigned keyphrase andor readerassigned keyphrase are syntactically more of-ten simplex words and less often NPs Whenkeyphrases take an NP form they tend to be a sim-ple form of NPs ie either without a PP or withonly a PP or with a conjunction but few appear asa mixture of such forms We also noticed that thecomponents of NPs are normally nouns and adjec-tives but rarely are adverbs and verbs As a re-sult we decided to ignore NPs containing adverbsand verbs in this study as our candidates since theytend to produce more errors and to require morecomplexity
Another observation is that keyphrases contain-ing more than three words are rare (ie 6 in ourdata set) validating what Paukkeri et al (2008)observed Hence we apply a length heuristic Ourcandidate selection rule collects candidates up tolength 3 but also of length 4 for NPs in of-PPform since they may have a non-genetive alter-nation that reduces its length to 3 (eg perfor-mance of distributed system = distributed systemperformance) In previous studies words occur-ring at least twice are selected as candidates How-ever during our acquisition of reader assignedkeyphrase we observed that readers tend to collectNPs as keyphrases regardless of their frequencyDue to this we apply different frequency thresh-olds for simplex words (gt= 2) and NPs (gt= 1)Note that 30 of NPs occurred only once in ourdata
Finally we generated regular expression rulesto extract candidates as presented in Table 1 Ourcandidate extraction rules are based on those inNguyen and Kan (2007) However our Rule6for NPs in of-PP form broadens the coverage of
11
possible candidates ie with a given NPs in of-PP form not only we collect simplex word(s)but we also extract non-of-PP form of NPs fromnoun phrases governing the PP and the PP Forexample our rule extracts effective algorithm ofgrid computing as well as effective algorithm andgrid computing as candidates while the previousworksrsquo rules do not
5 Feature Engineering
With a wider candidate selection criteria the onusof filtering out irrelevant candidates becomes theresponsibility of careful feature engineering Welist 25 features that we have found useful in ex-tracting keyphrases comprising of 9 existing and16 novel andor modified features that we intro-duce in our work (marked with lowast) As one ofour goals in feature engineering is to assess thesuitability of features in the unsupervised settingwe have also indicated which features are suitableonly for the supervised setting (S) or applicable toboth (S U)
51 Document Cohesion
Document cohesion indicates how important thecandidates are for the given document The mostpopular feature for this cohesion is TFIDF butsome works have also used context words to checkthe correlation between candidates and the givendocument Other features for document cohesionare distance section information and so on Wenote that listed features other than TFIDF are re-lated to locality That is the intuition behind thesefeatures is that keyphrases tend to appear in spe-cific area such as the beginning and the end of doc-uments
F1 TFIDF (SU) TFIDF indicates doc-ument cohesion by looking at the frequency ofterms in the documents and is broadly used in pre-vious work(Frank et al 1999 Witten et al 1999Nguyen and Kan 2007) However a disadvan-tage of the feature is in requiring a large corpusto compute useful IDF As an alternative con-text words(Matsuo and Ishizuka 2004) can alsobe used to measure document cohesion From ourstudy of keyphrases we saw that substrings withinlonger candidates need to be properly counted andas such our method measures TF in substrings aswell as in exact matches For example grid com-puting is often a substring of other phrases such asgrid computing algorithm and efficient grid com-
puting algorithm We also normalize TF with re-spect to candidate types ie we separately treatsimplex words and NPs to compute TF To makeour IDFs broadly representative we employed theGoogle n-gram counts that were computedover terabytes of data Given this large genericsource of word count IDF can be incorporatedwithout corpus-dependent processing hence suchfeatures are useful in unsupervised approachesas well The following list shows variations ofTFIDF employed as features in our system
bull (F1a) TFIDF
bull (F1b) TF including counts of substrings
bull (F1c) TF of substring as a separate feature
bull (F1d) normalized TF by candidate types(ie simplex words vs NPs)
bull (F1e) normalized TF by candidate types asa separate feature
bull (F1f) IDF using Google n-gram
F2 First Occurrence (SU) KEA used the firstappearance of the word in the document(Frank etal 1999 Witten et al 1999) The main ideabehind this feature is that keyphrases tend to oc-cur in the beginning of documents especially instructured reports (eg in abstract and introduc-tion sections) and newswire
F3 Section Information (SU) Nguyen andKan (2007) used the identity of which specificdocument section a candidate occurs in This lo-cality feature attempts to identify key sections Forexample in their study of scientific papers theauthors weighted candidates differently dependingon whether they occurred in the abstract introduc-tion conclusion section head title andor refer-ences
F4 Additional Section Information (SU)We first added the related work or previous workas one of section information not included inNguyen and Kan (2007) We also propose and testa number of variations We used the substringsthat occur in section headers and reference titlesas keyphrases We counted the co-occurrence ofcandidates (ie the section TF) across all key sec-tions that indicates the correlation among key sec-tions We assign section-specific weights as in-dividual sections exhibit different propensities forgenerating keyphrases For example introduction
12
contains the majority of keyphrases while the ti-tle or section head contains many fewer due to thevariation in size
bull (F4a) section rsquorelatedprevious workrsquo
bull (F4b) counting substring occurring in keysections
bull (F4c) section TF across all key sections
bull (F4d) weighting key sections according tothe portion of keyphrases found
F5 Last Occurrence (SU) Similar to dis-tance in KEA the position of the last occurrenceof a candidate may also imply the importance ofkeyphrases as keyphrases tend to appear in thelast part of document such as the conclusion anddiscussion
52 Keyphrase CohesionThe intuition behind using keyphrase cohesion isthat actual keyphrases are often associated witheach other since they are semantically related totopic of the document Note that this assumptionholds only when the document describes a singlecoherent topic ndash a document that represents a col-lection may be first need to be segmented into itsconstituent topics
F6 Co-occurrence of Another Candidatein Section (SU) When candidates co-occur inseveral key sections together then they are morelikely keyphrases Hence we used the number ofsections that candidates co-occur
F7 Title overlap (S) In a way titles also rep-resent the topics of their documents A large col-lection of titles in the domain can act as a prob-abilistic prior of what words could stand as con-stituent words in keyphrases In our work as weexamined scientific papers from computer sciencewe used a collection of titles obtained from thelarge CiteSeer1 collection to create this feature
bull (F7a) co-occurrence (Boolean) in title col-location
bull (F7b) co-occurrence (TF) in title collection
F8 Keyphrase Cohesion (SU) Turney (2003)integrated keyphrase cohesion into his system bychecking the semantic similarity between top Nranked candidates against the remainder In the
1It contains 13M titles from articles papers and reports
original work a large external web corpus wasused to obtain the similarity judgments As wedid not have access to the same web corpus andall candidateskeyphrases were not found in theGoogle n-gram corpus we approximated this fea-ture using a similar notion of contextual similarityWe simulated a latent 2-dimensional matrix (simi-lar to latent semantic analysis) by listing all candi-date words in rows and their neighboring words(nouns verbs and adjectives only) in columnsThe cosine measure is then used to compute thesimilarity among keyphrases
53 Term Cohesion
Term cohesion further refines the candidacy judg-ment by incorporating an internal analysis of thecandidatersquos constituent words Term cohesionposits that high values for internal word associa-tion measures correlates indicates that the candi-date is a keyphrase (Church and Hanks 1989)
F9 Term Cohesion (SU) Park et al (2004)used in the Dice coefficient (Dice 1945)to measure term cohesion particularly for multi-word terms In their work as NPs are longer thansimplex words they simply discounted simplexword cohesion by 10 In our work we vary themeasure of TF used in Dice coefficientsimilar to our discussion earlier
bull (F9a) term cohesion by (Park et al 2004)
bull (F9b) normalized TF by candidate types(ie simplex words vs NPs)
bull (F9c) applying different weight by candi-date types
bull (F9d) normalized TF and different weight-ing by candidate types
54 Other Features
F10 Acronym (S) Nguyen and Kan (2007) ac-counted for the importance of acronym as a fea-ture We found that this feature is heavily depen-dent on the data set Hence we used it only forNampK to attest our candidate selection method
F11 POS sequence (S) Hulth and Megyesi(2006) pointed out that POS sequences ofkeyphrases are similar It showed the distinctivedistribution of POS sequences of keyphrases anduse them as a feature Like acronym this is alsosubject to the data set
13
F12 Suffix sequence (S) Similar to acronymNguyen and Kan (2007) also used a candidatersquossuffix sequence as a feature to capture the propen-sity of English to use certain Latin derivationalmorphology for technical keyphrases This fea-ture is also a data dependent features thus used insupervised approach only
F13 Length of Keyphrases (SU) Barker andCorrnacchia (2000) showed that candidate lengthis also a useful feature in extraction as well as incandidate selection as the majority of keyphrasesare one or two terms in length
6 System and Data
To assess the performance of the proposed candi-date selection rules and features we implementeda keyphrase extraction pipe line We start withraw text of computer science articles convertedfrom PDF by pdftotext Then we parti-tioned the into section such as title and sectionsvia heuristic rules and applied sentence segmenter2 ParsCit3(Councill et al 2008) for refer-ence collection part-of-speech tagger4 and lem-matizer5(Minnen et al 2001) of the input Af-ter preprocessing we built both supervised andunsupervised classifiers using Naive Bayes fromthe WEKA machine learning toolkit(Witten andFrank 2005) Maximum Entropy6 and simpleweighting
In evaluation we collected 250 papers fromfour different categories7 of the ACM digital li-brary Each paper was 6 to 8 pages on averageIn author assigned keyphrase we found manywere missing or found as substrings To rem-edy this we collected reader assigned keyphraseby hiring senior year undergraduates in computerscience each whom annotated five of the paperswith an annotation guideline and on average tookabout 15 minutes to annotate each paper The fi-nal statistics of keyphrases is presented in Table2 where Combined represents the total number ofkeyphrases The numbers in () denotes the num-ber of keyphrases in of-PP form Found means the
2httpwwwengritsumeiacjpasaoresourcessentseg3httpwingcompnusedusgparsCit4httpsearchcpanorgdistLingua-EN-
TaggerTaggerpm5httpwwwinformaticssusxacukresearchgroupsnlpcarrollmorphhtml6httpmaxentsourceforgenetindexhtml7C24 (Distributed Systems) H33 (Information Search
and Retrieval) I211 (Distributed Artificial Intelligence-Multiagent Systems) and J4 (Social and Behavioral Sciences-Economics)
number of author assigned keyphrase and readerassigned keyphrase found in the documents
Author Reader CombinedTotal 1252 (53) 3110 (111) 3816 (146)NPs 904 2537 3027Average 385 (401) 1244 (1288) 1526 (1585)Found 769 2509 2864
Table 2 Statistics in Keyphrases
7 Evaluation
The baseline system for both the supervised andunsupervised approaches is modified NampK whichuses TFIDF distance section information andadditional section information (ie F1-4) Apartfrom baseline we also implemented basicKEA and NampK to compare Note that NampK is con-sidered a supervised approach as it utilizes fea-tures like acronym POS sequence and suffix se-quence
Table 3 and 4 shows the performance of our can-didate selection method and features with respectto supervised and unsupervised approaches usingthe current standard evaluation method (ie exactmatching scheme) over top 5th 10th 15th candi-dates
BestFeatures includes F1cTF of substring asa separate feature F2first occurrence F3sectioninformation F4dweighting key sections F5lastoccurrence F6co-occurrence of another candi-date in section F7btitle overlap F9aterm co-hesion by (Park et al 2004) F13length ofkeyphrases Best-TFIDF means using all bestfeatures but TFIDF
In Tables 3 and 4 C denotes the classifier tech-nique unsupervised (U) or supervised using Max-imum Entropy (S)8
In Table 5 the performance of each feature ismeasured using NampK system and the target fea-ture + indicates an improvement - indicates aperformance decline and indicates no effector unconfirmed due to small changes of perfor-mances Again supervised denotes MaximumEntropy training and Unsupervised is our unsu-pervised approach
8 Discussion
We compared the performances over our candi-date selection and feature engineering with sim-ple KEA NampK and our baseline system In eval-uating candidate selection we found that longer
8Due to the page limits we present the best performance
14
Method Features C Five Ten FifteenMatch Precision Recall Fscore Match Precising Recall Fscore Match Precision Recall Fscore
All KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 079 1584 519 782 139 1388 909 1099 184 1224 1203 1213
NampK S 132 2648 867 1306 204 2036 1334 1612 254 1693 1664 1678baseline U 092 1832 600 904 157 1568 1027 1241 220 1464 1439 1451
S 115 2304 755 1137 190 1896 1242 1501 244 1624 1596 1610Lengthlt=3 KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 081 1616 529 797 140 1400 917 1108 184 1224 1203 1213
NampK S 140 2792 915 1378 210 2104 1378 1665 262 1749 1719 1734baseline U 092 184 603 908 158 1576 1032 1247 220 1464 1439 1451
S 118 2368 776 1169 190 1900 1245 1504 240 1600 1572 1586Lengthlt=3 KEA U 001 024 008 012 005 052 034 041 007 048 047 047Candidates S 083 1664 545 821 142 1424 933 1127 187 1245 1224 1234
+ Alternation NampK S 153 3064 1004 1512 231 2308 1512 1827 288 1920 1887 1903baseline U 098 1968 645 972 172 1724 1129 1364 237 1579 1551 1565
S 133 2656 870 1311 209 2088 1368 1653 269 1792 1761 1776
Table 3 Performance on Proposed Candidate Selection
Features C Five Ten FifteenMatch Prec Recall Fscore Match Prec Recall Fscore Match Prec Recall Fscore
Best U 114 228 747 113 192 192 126 152 261 174 171 173S 156 312 102 154 250 250 164 198 315 210 206 208
Best U 114 228 74 113 192 192 126 152 261 174 171 173wo TFIDF S 156 311 102 154 246 246 161 194 312 208 204 206
Table 4 Performance on Feature Engineering
A Method Feature+ S F1aF2F3F4aF4dF9a
U F1aF1cF2F3F4aF4dF5F7bF9a- S F1bF1cF1dF1fF4bF4cF7aF7bF9b-dF13
U F1dF1eF1fF4bF4cF6F7aF9b-d S F1eF10F11F12
U F1b
Table 5 Performance on Each Feature
length candidates play a role to be noises so de-creased the overall performance We also con-firmed that candidate alternation offered the flexi-bility of keyphrases leading higher candidate cov-erage as well as better performance
To re-examine features we analyzed the impactof existing and new features and their variationsFirst of all unlike previous studies we found thatthe performance with and without TFIDF did notlead to a large difference which indicates the im-pact of TFIDF was minor as long as other fea-tures are incorporated Secondly counting sub-strings for TF improved performance while ap-plying term weighting for TF andor IDF did notimpact on the performance We estimated thecause that many of keyphrases are substrings ofcandidates and vice versa Thirdly section in-formation was also validated to improve perfor-mance as in Nguyen and Kan (2007) Extend-ing this logic modeling additional section infor-mation (related work) and weighting sections bothturned out to be useful features Other localityfeatures were also validated as helpful both firstoccurrence and last occurrence are helpful as itimplies the locality of the key ideas In addi-
tion keyphrase co-occurrence with selected sec-tions was proposed in our work and found empiri-cally useful Term cohesion (Park et al 2004) is auseful feature although it has a heuristic factor thatreduce the weight by 10 for simplex words Nor-mally term cohesion is subject to NPs only henceit needs to be extended to work with multi-wordNPs as well Table 5 summarizes the reflectionson each feature
As unsupervised methods have the appeal of notneeding to be trained on expensive hand-annotateddata we also compared the performance of super-vised and unsupervised methods Given the fea-tures initially introduced for supervised learningunsupervised performance is surprisingly highWhile supervised classifier produced a matchingcount of 315 the unsupervised classifier obtains acount of 261 We feel this indicates that the exist-ing features for supervised methods are also suit-able for use in unsupervised methods with slightlyreduced performance In general we observed thatthe best features in both supervised and unsuper-vised methods are the same ndash section informationand candidate length In our analysis of the im-pact of individual features we observed that mostfeatures affect performance in the same way forboth supervised and unsupervised approaches asshown in Table 5 These findings indicate that al-though these features may be been originally de-signed for use in a supervised approach they arestable and can be expected to perform similar inunsupervised approaches
15
9 Conclusion
We have identified and tackled two core issuesin automatic keyphrase extraction candidate se-lection and feature engineering In the area ofcandidate selection we observe variations and al-ternations that were previously unaccounted forOur selection rules expand the scope of possiblekeyphrase coverage while not overly expandingthe total number candidates to consider In ourre-examination of feature engineering we com-piled a comprehensive feature list from previousworks while exploring the use of substrings in de-vising new features Moreover we also attested toeach featurersquos fitness for use in unsupervised ap-proaches in order to utilize them in real-world ap-plications with minimal cost
10 AcknowledgementThis work was partially supported by a National ResearchFoundation grant Interactive Media Search (grant R 252000 325 279) while the first author was a postdoctoral fellowat the National University of Singapore
ReferencesKen Barker and Nadia Corrnacchia Using noun phrase
heads to extract document keyphrases In Proceedings ofthe 13th Biennial Conference of the Canadian Society onComputational Studies of Intelligence Advances in Arti-ficial Intelligence 2000
Regina Barzilay and Michael Elhadad Using lexical chainsfor text summarization In Proceedings of the ACLEACL1997 Workshop on Intelligent Scalable Text Summariza-tion 1997 pp 10ndash17
Kenneth Church and Patrick Hanks Word associationnorms mutual information and lexicography In Proceed-ings of ACL 1989 76ndash83
Isaac Councill and C Lee Giles and Min-Yen Kan ParsCitAn open-source CRF reference string parsing package InProceedings of LREC 2008 28ndash30
Ernesto DAvanzo and Bernado Magnini A Keyphrase-Based Approach to Summarizationthe LAKE System atDUC-2005 In Proceedings of DUC 2005
F Damerau Generating and evaluating domain-orientedmulti-word terms from texts Information Processing andManagement 1993 29 pp43ndash447
Lee Dice Measures of the amount of ecologic associationsbetween species Journal of Ecology 1945 2
Eibe Frank and Gordon Paynter and Ian Witten and CarlGutwin and Craig Nevill-manning Domain SpecificKeyphrase Extraction In Proceedings of IJCAI 1999pp668ndash673
Carl Gutwin and Gordon Paynter and Ian Witten and CraigNevill-Manning and Eibe Frank Improving browsing indigital libraries with keyphrase indexes Journal of Deci-sion Support Systems 1999 27 pp81ndash104
Khaled Hammouda and Diego Matute and Mohamed KamelCorePhrase keyphrase extraction for document cluster-ing In Proceedings of MLDM 2005
Annette Hulth and Jussi Karlgren and Anna Jonsson andHenrik Bostrm and Lars Asker Automatic Keyword Ex-traction using Domain Knowledge In Proceedings of CI-CLing 2001
Annette Hulth and Beata Megyesi A study on automaticallyextracted keywords in text categorization In Proceedingsof ACLCOLING 2006 537ndash544
Mario Jarmasz and Caroline Barriere Using semantic sim-ilarity over tera-byte corpus compute the performance ofkeyphrase extraction In Proceedings of CLINE 2004
Dawn Lawrie and W Bruce Croft and Arnold RosenbergFinding Topic Words for Hierarchical Summarization InProceedings of SIGIR 2001 pp 349ndash357
Y Matsuo and M Ishizuka Keyword Extraction from a Sin-gle Document using Word Co-occurrence Statistical Infor-mation In International Journal on Artificial IntelligenceTools 2004 13(1) pp 157ndash169
Olena Medelyan and Ian Witten Thesaurus based automatickeyphrase indexing In Proceedings of ACMIEED-CSjoint conference on Digital libraries 2006 pp296ndash297
Guido Minnen and John Carroll and Darren Pearce Appliedmorphological processing of English NLE 2001 7(3)pp207ndash223
Thuy Dung Nguyen and Min-Yen Kan Key phrase Extrac-tion in Scientific Publications In Proceeding of ICADL2007 pp317-326
Youngja Park and Roy Byrd and Branimir Boguraev Auto-matic Glossary Extraction Beyond Terminology Identifi-cation In Proceedings of COLING 2004 pp48ndash55
Mari-Sanna Paukkeri and Ilari Nieminen and Matti Pollaand Timo Honkela A Language-Independent Approachto Keyphrase Extraction and Evaluation In Proceedingsof COLING 2008
Peter Turney Learning to Extract Keyphrases from TextIn National Research Council Institute for InformationTechnology Technical Report ERB-1057 1999
Peter Turney Coherent keyphrase extraction via Web min-ing In Proceedings of IJCAI 2003 pp 434ndash439
Xiaojun Wan and Jianguo Xiao CollabRank towards a col-laborative approach to single-document keyphrase extrac-tion In Proceedings of COLING 2008
Ian Witten and Gordon Paynter and Eibe Frank and CarGutwin and Graig Nevill-Manning KEAPractical Au-tomatic Key phrase Extraction In Proceedings of ACMDL 1999 pp254ndash256
Ian Witten and Eibe Frank Data Mining Practical Ma-chine Learning Tools and Techniques Morgan Kauf-mann 2005
Yongzheng Zhang and Nur Zinchir-Heywood and Evange-los Milios Term based Clustering and Summarization ofWeb Page Collections In Proceedings of Conference ofthe Canadian Society for Computational Studies of Intel-ligence 2004
16
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 17ndash22Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Verb Noun Construction MWE Token Supervised Classification
Mona T DiabCenter for Computational Learning Systems
Columbia Universitymdiabcclscolumbiaedu
Pravin BhutadaComputer Science Department
Columbia Universitypb2351columbiaedu
Abstract
We address the problem of classifying multi-word expression tokens in running text Wefocus our study on Verb-Noun Constructions(VNC) that vary in their idiomaticity depend-ing on context VNC tokens are classified aseither idiomatic or literal We present a super-vised learning approach to the problem We ex-periment with different features Our approachyields the best results to date on MWE clas-sification combining different linguistically mo-tivated features the overall performance yieldsan F-measure of 8458 corresponding to an F-measure of 8996 for idiomaticity identificationand classification and 6203 for literal identifi-cation and classification
1 Introduction
In the literature in general a multiword expression(MWE) refers to a multiword unit or a colloca-tion of words that co-occur together statisticallymore than chance A MWE is a cover term fordifferent types of collocations which vary in theirtransparency and fixedness MWEs are pervasivein natural language especially in web based textsand speech genres Identifying MWEs and under-standing their meaning is essential to language un-derstanding hence they are of crucial importancefor any Natural Language Processing (NLP) appli-cations that aim at handling robust language mean-ing and use In fact the seminal paper (Sag et al2002) refers to this problem as a key issue for thedevelopment of high-quality NLP applications
For our purposes a MWE is defined as a collo-cation of words that refers to a single concept forexample - kick the bucket spill the beans make adecision etc An MWE typically has an idiosyn-cratic meaning that is more or different from themeaning of its component words An MWE mean-ing is transparent ie predictable in as muchas the component words in the expression relaythe meaning portended by the speaker composi-tionally Accordingly MWEs vary in their de-gree of meaning compositionality composition-ality is correlated with the level of idiomaticityAn MWE is compositional if the meaning of an
MWE as a unit can be predicted from the mean-ing of its component words such as in make adecision meaning to decide If we conceive ofidiomaticity as being a continuum the more id-iomatic an expression the less transparent and themore non-compositional it is Some MWEs aremore predictable than others for instance kick thebucket when used idiomatically to mean to diehas nothing in common with the literal meaningof either kick or bucket however make a decisionis very clearly related to to decide Both of theseexpressions are considered MWEs but have vary-ing degrees of compositionality and predictabilityBoth of these expressions belong to a class of id-iomatic MWEs known as verb noun constructions(VNC) The first VNC kick the bucket is a non-decomposable VNC MWE the latter make a deci-sion is a decomposable VNC MWE These typesof constructions are the object of our study
To date most research has addressed the prob-lem of MWE type classification for VNC expres-sions in English (Melamed 1997 Lin 1999Baldwin et al 2003 na Villada Moiron andTiedemann 2006 Fazly and Stevenson 2007Van de Cruys and Villada Moiron 2007 Mc-Carthy et al 2007) not token classification Forexample he spilt the beans on the kitchen counteris most likely a literal usage This is given away bythe use of the prepositional phrase on the kitchencounter as it is plausable that beans could haveliterally been spilt on a location such as a kitchencounter Most previous research would classifyspilt the beans as idiomatic irrespective of con-textual usage In a recent study by (Cook et al2008) of 53 idiom MWE types used in differentcontexts the authors concluded that almost half ofthem had clear literal meaning and over 40 oftheir usages in text were actually literal Thus itwould be important for an NLP application suchas machine translation for example when givena new VNC MWE token to be able to determinewhether it is used idiomatically or not as it couldpotentially have detrimental effects on the qualityof the translation
17
In this paper we address the problem of MWEclassification for verb-noun (VNC) token con-structions in running text We investigate the bi-nary classification of an unseen VNC token ex-pression as being either Idiomatic (IDM) or Lit-eral (LIT) An IDM expression is certainly anMWE however the converse is not necessarilytrue To date most approaches to the problem ofidiomaticity classification on the token level havebeen unsupervised (Birke and Sarkar 2006 Diaband Krishna 2009b Diab and Krishna 2009aSporleder and Li 2009) In this study we carryout a supervised learning investigation using sup-port vector machines that uses some of the featureswhich have been shown to help in unsupervisedapproaches to the problem
This paper is organized as follows In Section2 we describe our understanding of the variousclasses of MWEs in general Section 3 is a sum-mary of previous related research Section 4 de-scribes our approach In Section 5 we present thedetails of our experiments We discuss the resultsin Section 6 Finally we conclude in Section 7
2 Multi-word Expressions
MWEs are typically not productive though theyallow for inflectional variation (Sag et al 2002)They have been conventionalized due to persis-tent use MWEs can be classified based on theirsemantic types as follows Idiomatic This cat-egory includes expressions that are semanticallynon-compositional fixed expressions such as king-dom come ad hoc non-fixed expressions suchas break new ground speak of the devil TheVNCs which we are focusing on in this paper fallinto this category Semi-idiomatic This classincludes expressions that seem semantically non-compositional yet their semantics are more or lesstransparent This category consists of Light VerbConstructions (LVC) such as make a living andVerb Particle Constructions (VPC) such as write-up call-up Non-Idiomatic This category in-cludes expressions that are semantically compo-sitional such as prime minister proper nouns suchas New York Yankees and collocations such as ma-chine translation These expressions are statisti-cally idiosyncratic For instance traffic light isthe most likely lexicalization of the concept andwould occur more often in text than say trafficregulator or vehicle light
3 Related Work
Several researchers have addressed the problem ofMWE classification (Baldwin et al 2003 Katzand Giesbrecht 2006 Schone and Juraksfy 2001
Hashimoto et al 2006 Hashimoto and Kawa-hara 2008) The majority of the proposed researchhas been using unsupervised approaches and haveaddressed the problem of MWE type classifica-tion irrespective of usage in context (Fazly andStevenson 2007 Cook et al 2007) We areaware of two supervised approaches to the prob-lem work by (Katz and Giesbrecht 2006) andwork by (Hashimoto and Kawahara 2008)
In Katz and Giesbrecht (2006) (KG06) the au-thors carried out a vector similarity comparisonbetween the context of an MWE and that of theconstituent words using LSA to determine if theexpression is idiomatic or not The KG06 is sim-ilar in intuition to work proposed by (Fazly andStevenson 2007) however the latter work was un-supervised KG06 experimented with a tiny dataset of only 108 sentences corresponding to oneMWE idiomatic expression
Hashimoto and Kawahara (2008) (HK08) is thefirst large scale study to our knowledge that ad-dressed token classification into idiomatic versusliteral for Japanese MWEs of all types They ap-ply a supervised learning framework using sup-port vector machines based on TinySVM with aquadratic kernel They annotate a web based cor-pus for training data They identify 101 idiomtypes each with a corresponding 1000 exampleshence they had a corpus of 102K sentences of an-notated data for their experiments They exper-iment with 90 idiom types only for which theyhad more than 50 examples They use two typesof features word sense disambiguation (WSD)features and idiom features The WSD featurescomprised some basic syntactic features such asPOS lemma information token n-gram featuresin addition to hypernymy information on words aswell as domain information For the idiom fea-tures they were mostly inflectional features suchas voice negativity modality in addition to adja-cency and adnominal features They report resultsin terms of accuracy and rate of error reductionTheir overall accuracy is of 8925 using all thefeatures
4 Our ApproachWe apply a supervised learning framework tothe problem of both identifying and classifying aMWE expression token in context We specificallyfocus on VNC MWE expressions We use the an-notated data by (Cook et al 2008) We adopt achunking approach to the problem using an InsideOutside Beginning (IOB) tagging framework forperforming the identification of MWE VNC to-kens and classifying them as idiomatic or literalin context For chunk tagging we use the Yam-
18
Cha sequence labeling system1 YamCha is basedon Support Vector Machines technology using de-gree 2 polynomial kernels
We label each sentence with standard IOB tagsSince this is a binary classification task we have 5different tags B-L (Beginning of a literal chunk)I-L (Inside of a literal chunk) B-I (Beginning anIdiomatic chunk) I-I (Inside an Idiomatic chunk)O (Outside a chunk) As an example a sentencesuch as John kicked the bucket last Friday will beannotated as follows John O kicked B-I the I-Ibucket I-I last O Friday O We experiment withsome basic features and some more linguisticallymotivated ones
We experiment with different window sizes forcontext ranging from minus+1 to minus+5 tokens be-fore and after the token of interest We also em-ploy linguistic features such as character n-gramfeatures namely last 3 characters of a token asa means of indirectly capturing the word inflec-tional and derivational morphology (NGRAM)Other features include Part-of-Speech (POS)tags lemma form (LEMMA) or the citation formof the word and named entity (NE) informationThe latter feature is shown to help in the unsuper-vised setting in recent work (Diab and Krishna2009b Diab and Krishna 2009a) In general allthe linguistic features are represented as separatefeature sets explicitly modeled in the input dataHence if we are modeling the POS tag feature forour running example the training data would beannotated as follows John NN O kicked VBDB-I the Det I-I bucket NN I-I last ADV O FridayNN O Likewise adding the NGRAM featurewould be represented as follows John NN ohnO kicked VBD ked B-I the Det the I-I bucket NNket I-I last ADV ast O Friday NN day O and soon
With the NE feature we followed the same rep-resentation as the other features as a separate col-umn as expressed above referred to as NamedEntity Separate (NES) For named entity recogni-tion (NER) we use the BBN Identifinder softwarewhich identifies 19 NE tags2 We have two set-tings for NES one with the full 19 tags explic-itly identified (NES-Full) and the other where wehave a binary feature indicating whether a wordis a NE or not (NES-Bin) Moreover we addedanother experimental condition where we changedthe wordsrsquo representation in the input to their NEclass Named Entity InText (NEI) For example forthe NEI condition our running example is repre-sented as follows PER NN ohn O kicked VBDked B-I the Det the I-I bucket NN ket I-I last ADV
1httpwwwtado-chasencomyamcha2httpwwwbbncomidentifinder
ast O DAY NN day O where John is replaced bythe NE ldquoPERrdquo
5 Experiments and Results
51 Data
We use the manually annotated standard dataset identified in (Cook et al 2008) This datacomprises 2920 unique VNC-Token expressionsdrawn from the entire British National Corpus(BNC)3 The BNC contains 100M words of multi-ple genres including written text and transcribedspeech In this set VNC token expressions aremanually annotated as idiomatic literal or un-known We exclude those annotated as unknownand those pertaining to the Speech part of thedata leaving us with a total of 2432 sentences cor-responding to 53 VNC MWE types This datahas 2571 annotations4 corresponding to 2020 Id-iomatic tokens and 551 literal ones Since the dataset is relatively small we carry out 5-fold cross val-idation experiments The results we report are av-eraged over the 5 folds per condition We splitthe data into 80 for training 10 for testing and10 for development The data used is the tok-enized version of the BNC
52 Evaluation Metrics
We use Fβ=1 (F-measure) as the harmonic meanbetween (P)recision and (R)ecall as well as accu-racy to report the results5 We report the resultsseparately for the two classes IDM and LIT aver-aged over the 5 folds of the TEST data set
53 Results
We present the results for the different featuressets and their combination We also present resultson a simple most frequent tag baseline (FREQ) aswell as a baseline of using no features just thetokenized words (TOK) The baseline is basicallytagging all identified VNC tokens in the data set asidiomatic It is worth noting that the baseline hasthe advantage of gold identification of MWE VNCtoken expressions In our experimental conditionsidentification of a potential VNC MWE is part ofwhat is discovered automatically hence our sys-tem is penalized for identifying other VNC MWE
3httpwwwnatcorpoxacuk4A sentence can have more than one MWE expression
hence the number of annotations exceeds the number of sen-tences
5We do not think that accuracy should be reported in gen-eral since it is an inflated result as it is not a measure of errorAll words identified as O factor into the accuracy which re-sults in exaggerated values for accuracy We report it onlysince it the metric used by previous work
19
tokens that are not in the original data set6
In Table 2 we present the results yielded per fea-ture and per condition We experimented with dif-ferent context sizes initially to decide on the opti-mal window size for our learning framework re-sults are presented in Table 1 Then once that isdetermined we proceed to add features
Noting that a window size of minus+3 yields thebest results we proceed to use that as our contextsize for the following experimental conditions Wewill not include accuracy since it above 96 for allour experimental conditions
All the results yielded by our experiments out-perform the baseline FREQ The simple tokenizedwords baseline (TOK) with no added features witha context size of minus+3 shows a significant im-provement over the very basic baseline FREQ withan overall F measure of 7704
Adding lemma information or POS or NGRAMfeatures all independently contribute to a bettersolution however combining the three featuresyields a significant boost in performance over theTOK baseline of 267 absolute F points in over-all performance
Confirming previous observations in the liter-ature the overall best results are obtained byusing NE features The NEI condition yieldsslightly better results than the NES conditionsin the case when no other features are beingused NES-Full significantly outperforms NES-Bin when used alone especially on literal classi-fication yielding the highest results on this classof phenomena across the board However whencombined with other features NES-Bin fares bet-ter than NES-Full as we observe slightly less per-formance when comparing NES-Full+L+N+P andNES-Bin+L+N+P
Combining NEI+L+N+P yields the highest re-sults with an overall F measure of 8458 a sig-nificant improvement over both baselines and overthe condition that does not exploit NE featuresL+N+P Using NEI may be considered a formof dimensionality reduction hence the significantcontribution to performance
6 Discussion
The overall results strongly suggest that using lin-guistically interesting features explicitly has a pos-itive impact on performance NE features helpthe most and combining them with other features
6We could have easily identified all VNC syntactic con-figurations corresponding to verb object as a potential MWEVNC assuming that they are literal by default This wouldhave boosted our literal score baseline however for this in-vestigation we decided to strictly work with the gold stan-dard data set exclusively
yields the best results In general performanceon the classification and identification of idiomaticexpressions yielded much better results This maybe due to the fact that the data has a lot more id-iomatic token examples for training Also we notethat precision scores are significantly higher thanrecall scores especially with performance on lit-eral token instance classification This might be anindication that identifying when an MWE is usedliterally is a difficult task
We analyzed some of the errors yielded in ourbest condition NEI+L+N+P The biggest errors area result of identifying other VNC constructionsnot annotated in the training and test data as VNCMWEs However we also see errors of confusingidiomatic cases with literal ones 23 times and theopposite 4 times
Some of the errors where the VNC should havebeen classified as literal however the system clas-sified them as idiomatic are kick heel find feetmake top Cases of idiomatic expressions erro-neously classified as literal are for MWE types hitthe road blow trumpet blow whistle bit a wall
The system is able to identify new VNC MWEconstructions For instance in the sentence On theother hand Pinkie seemed to have lost his head toa certain extent perhaps some prospects of mak-ing his mark by bringing in something novel inthe way of business the first MWE lost his headis annotated in the training data however makinghis mark is newly identified as idiomatic in thiscontext
Also the system identified hit the post as aliteral MWE VNC token in As the ball hit thepost the referee blew the whistle where blew thewhistle is a literal VNC in this context and it iden-tified hit the post as another literal VNC
7 Conclusion
In this study we explore a set of features that con-tribute to VNC token expression binary supervisedclassification The use of NER significantly im-proves the performance of the system Using NERas a means of dimensionality reduction yields thebest results We achieve a state of the art perfor-mance of an overall F measure of 8458 In thefuture we are looking at ways of adding more so-phisticated syntactic and semantic features fromWSD Given the fact that we were able to get moreinteresting VNC data automatically we are cur-rently looking into adding the new data to the an-notated pool after manual checking
20
IDM-F LIT-F Overall F Overall Accminus+1 7793 4857 7178 9622minus+2 8538 5561 7971 9706minus+3 8699 5568 8125 9693minus+4 8622 5581 8075 9706minus+5 8338 50 7763 9661
Table 1 Results in s of varying context window size
IDM-P IDM-R IDM-F LIT-P LIT-R LIT-F Overall FFREQ 7002 8916 7844 0 0 0 6968TOK 8178 8333 8255 7179 4375 5437 7704
(L)EMMA 831 8429 8369 6977 4688 5607 7811(N)GRAM 8317 8238 8278 70 4375 5385 7701
(P)OS 8333 8333 8333 7778 4375 5600 7808L+N+P 8695 8333 8538 7222 4561 5591 7971
NES-Full 852 8793 8655 7907 5862 6733 8277NES-Bin 8497 8241 8367 7349 5259 6131 7915
NEI 8992 8518 8748 8133 5259 6387 8282NES-Full+L+N+P 8989 8492 8734 7632 50 6042 8199NES-Bin+L+N+P 9086 8492 8779 7632 50 6042 8233
NEI+L+N+P 9135 8842 8986 8169 50 6203 8458
Table 2 Final results in s averaged over 5 folds of test data using different features and their combina-tions
8 Acknowledgement
The first author was partially funded by DARPAGALE and MADCAT projects The authors wouldlike to acknowledge the useful comments by twoanonymous reviewers who helped in making thispublication more concise and better presented
ReferencesTimothy Baldwin Collin Bannard Takakki Tanaka
and Dominic Widdows 2003 An empirical modelof multiword expression decomposability In Pro-ceedings of the ACL 2003 workshop on Multiwordexpressions pages 89ndash96 Morristown NJ USA
J Birke and A Sarkar 2006 A clustering approach fornearly unsupervised recognition of nonliteral lan-guage In Proceedings of EACL volume 6 pages329ndash336
Paul Cook Afsaneh Fazly and Suzanne Stevenson2007 Pulling their weight Exploiting syntacticforms for the automatic identification of idiomaticexpressions in context In Proceedings of the Work-shop on A Broader Perspective on Multiword Ex-pressions pages 41ndash48 Prague Czech RepublicJune Association for Computational Linguistics
Paul Cook Afsaneh Fazly and Suzanne Stevenson2008 The VNC-Tokens Dataset In Proceedings ofthe LREC Workshop on Towards a Shared Task for
Multiword Expressions (MWE 2008) MarrakechMorocco June
Mona Diab and Madhav Krishna 2009a Handlingsparsity for verb noun MWE token classification InProceedings of the Workshop on Geometrical Mod-els of Natural Language Semantics pages 96ndash103Athens Greece March Association for Computa-tional Linguistics
Mona Diab and Madhav Krishna 2009b Unsuper-vised classification for vnc multiword expressionstokens In CICLING
Afsaneh Fazly and Suzanne Stevenson 2007 Dis-tinguishing subtypes of multiword expressions us-ing linguistically-motivated statistical measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueCzech Republic June Association for Computa-tional Linguistics
Chikara Hashimoto and Daisuke Kawahara 2008Construction of an idiom corpus and its applica-tion to idiom identification based on WSD incor-porating idiom-specific features In Proceedings ofthe 2008 Conference on Empirical Methods in Nat-ural Language Processing pages 992ndash1001 Hon-olulu Hawaii October Association for Computa-tional Linguistics
Chikara Hashimoto Satoshi Sato and Takehito Utsuro2006 Japanese idiom recognition Drawing a linebetween literal and idiomatic meanings In Proceed-ings of the COLINGACL 2006 Main Conference
21
Poster Sessions pages 353ndash360 Sydney AustraliaJuly Association for Computational Linguistics
Graham Katz and Eugenie Giesbrecht 2006 Au-tomatic identification of non-compositional multi-word expressions using latent semantic analysis InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 12ndash19 Sydney Australia July Asso-ciation for Computational Linguistics
Dekang Lin 1999 Automatic identification of non-compositional phrases In Proceedings of ACL-99pages 317ndash324 Univeristy of Maryland CollegePark Maryland USA
Diana McCarthy Sriram Venkatapathy and AravindJoshi 2007 Detecting compositionality of verb-object combinations using selectional preferencesIn Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL) pages 369ndash379 Prague CzechRepublic June Association for Computational Lin-guistics
Dan I Melamed 1997 Automatic discovery of non-compositional compounds in parallel data In Pro-ceedings of the 2nd Conference on Empirical Meth-ods in Natural Language Processing (EMNLPrsquo97)pages 97ndash108 Providence RI USA August
Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the EACL-06Workshop on Multiword Expressions in a Multilin-gual Context pages 33ndash40 Morristown NJ USA
Ivan A Sag Timothy Baldwin Francis Bond Ann ACopestake and Dan Flickinger 2002 Multiwordexpressions A pain in the neck for nlp In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing pages 1ndash15 London UK Springer-Verlag
Patrick Schone and Daniel Juraksfy 2001 Isknowledge-free induction of multiword unit dictio-nary headwords a solved problem In Proceedingsof Empirical Methods in Natural Language Process-ing pages 100ndash108 Pittsburg PA USA
C Sporleder and L Li 2009 Unsupervised Recog-nition of Literal and Non-Literal Use of IdiomaticExpressions In Proceedings of the 12th Conferenceof the European Chapter of the ACL (EACL 2009)pages 754ndash762 Association for Computational Lin-guistics
Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32Prague Czech Republic June Association for Com-putational Linguistics
22
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 23ndash30Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Exploiting Translational Correspondences for Pattern-Independent MWEIdentification
Sina ZarrieszligDepartment of Linguistics
University of Potsdam Germanysinalinguni-potsdamde
Jonas KuhnDepartment of Linguistics
University of Potsdam Germanykuhnlinguni-potsdamde
Abstract
Based on a study of verb translations inthe Europarl corpus we argue that a widerange of MWE patterns can be identified intranslations that exhibit a correspondencebetween a single lexical item in the sourcelanguage and a group of lexical items inthe target language We show that thesecorrespondences can be reliably detectedon dependency-parsed word-aligned sen-tences We propose an extraction methodthat combines word alignment with syn-tactic filters and is independent of thestructural pattern of the translation
1 Introduction
Parallel corpora have proved to be a valuable re-source not only for statistical machine translationbut also for crosslingual induction of morphologi-cal syntactic and semantic analyses (Yarowsky etal 2001 Dyvik 2004) In this paper we proposean approach to the identification of multiword ex-pressions (MWEs) that exploits translational cor-respondences in a parallel corpus We will con-sider in translations of the following type
(1) DerThe
RatCouncil
sollteshould
unsereour
Positionposition
berucksichtigenconsider
(2) The Council shouldtake account ofour position
This sentence pair has been taken from theGerman - English section of the Europarl corpus(Koehn 2005) It exemplifies a translational cor-respondence between an English MWEtake ac-count ofand a German simplex verbberucksichti-gen In the following we refer to such correspon-dences asone-to-many translations Based on astudy of verb translations in Europarl we will ex-plore to what extent one-to-many translations pro-vide evidence for MWE realization in the targetlanguage It will turn out that crosslingual corre-
spondences realize a wide range of different lin-guistic patterns that are relevant for MWE iden-tification but that they pose problems to auto-matic word alignment We propose an extractionmethod that combines distributional word align-ment with syntactic filters We will show thatthese correspondences can be reliably detectedon dependency-parsed wordaligned sentences andare able to identify various MWE patterns
In a monolingual setting the task of MWE ex-traction is usually conceived of as a lexical as-sociation problem where distributional measuresmodel the syntactic and semantic idiosyncracy ex-hibited by MWEs eg (Pecina 2008) This ap-proach generally involves two main steps 1) theextraction of a candidate list of potential MWEsoften constrained by a particular target patternof the detection method like verb particle con-structions (Baldwin and Villavicencio 2002) orverb PP combinations (Villada Moiron and Tiede-mann 2006) 2) the ranking of this candidate listby an appropriate assocation measure
The crosslingual MWE identification wepresent in this paper is a priori independentof any specific association measure or syntacticpattern The translation scenario allows us toadopt a completely data-driven definition of whatconstitutes an MWE Given a parallel corpuswe propose to consider those tokens in a targetlanguage as MWEs which correspond to a singlelexical item in the source language The intuitionis that if a group of lexical items in one lan-guage can be realized as a single item in anotherlanguage it can be considered as some kind oflexically fixed entity By this means we willnot approach the MWE identification problemby asking for a given list of candidates whetherthese are MWEs or not Instead we will ask fora given list of lexical items in a source languagewhether there exists a one-to-many translation forthis item in a target language (and whether these
23
one-to-many translations correspond to MWEs)This strategy offers a straightforward solution tothe interpretation problem As the translation canbe related to the meaning of the source item andto its other translations in the target language theinterpretation is independent of the expressionrsquostransparency This solution has its limitationscompared to other approaches that need to auto-matically establish the degree of compositionalityof a given MWE candidate However for manyNLP applications coarse-grained knowledgeabout the semantic relation between a widerange of MWEs and their corresponding atomicrealization is already very useful
In this work we therefore focus on a generalmethod of MWE identification that captures thevarious patterns of translational correspondencesthat can be found in parallel corpora Our exper-iments described in section 3 show that one-to-many translations should be extracted from syn-tactic configurations rather than from unstructuredsets of aligned words This syntax-driven methodis less dependent on frequency distributions in agiven corpus but is based on the intuition thatmonolingual idiosyncracies like MWE realizationof an entity are not likely to be mirrored in anotherlanguage (see section 4 for discussion)
Our goal in this paper is twofold First we wantto investigate to what extent one-to-many transla-tional correspondences can serve as an empiricalbasis for MWE identification To this end Sec-tion 2 presents a corpus-based study of the rela-tion between one-to-many translations and MWEsthat we carried out on a translation gold standardSecond we investigate methods for the automaticdetection of complex lexical correspondences fora given parallel corpus Therefore Section 3 eval-uates automatic word alignments against our goldstandard and gives a method for high-precisionone-to-many translation detection that relies onsyntactic filters in addition to word-alignments
2 Multiword Translations as MWEs
The idea to exploit one-to-many translations forthe identification of MWE candidates has not re-ceived much attention in the literature Thus it isnot a priori clear what can be expected from trans-lational correspondences with respect to MWEidentification To corroborate the intuitions intro-duced in the last section we carried out a corpus-based study that aims to discover linguistic pat-
Verb 1-1 1-n n-1 n-n No
anheben (v1) 535 212 92 16 325
bezwecken (v2) 167 513 06 313 150
riskieren (v3) 467 357 05 17 182
verschlimmern (v4) 302 215 286 445 275
Table 1 Proportions of types of translational cor-respondences (token-level) in our gold standard
terns exhibited by one-to-many translationsWe constructed a gold standard coveringall En-
glish translations of four German verb lemmas ex-tracted from the Europarl Corpus These verbssubcategorize for a nominative subject and an ac-cusative object and are in the middle frequencylayer (around 200 occurrences) We extracted allsentences in Europarl with occurences of theselemmas and their automatic word alignments pro-duced by GIZA++ (Och and Ney 2003) Thesealignments were manually corrected on the basisof the crosslingual word alignment guidelines de-velopped by (Graca et al 2008)
For each of the German source lemmas ourgold standard records four translation categoriesone-to-one one-to-many many-to-one many-to-many translations Table 1 shows the distributionof these categories for each verb Strikingly thefour verbs show very different proportions con-cerning the types of their translational correspon-dences Thus while the German verbanheben(en increase) seems to have a frequent parallelrealization the verbsbezwecken(en intend to)or verschlimmern(en aggravate) tend to be real-ized by more complex phrasal translations In anycase the percentage of one-to-many translations isrelatively high which corroborates our hypothesisthat parallel corpora constitute a very interestingresource for data-driven MWE discovery
A closer look at the one-to-many translations re-veals that these cover a wide spectrum of MWEphenomena traditionally considered in the liter-ature as well as constructions that one wouldusually not regard as an MWE Below we willshortly illustrate the different classes of one-to-many translations we found in our gold standard
Morphological variations This type of one-to-many translations is mainly due to non-parallel re-alization of tense Itrsquos rather irrelevant from anMWE perspective but easy to discover and filterautomatically
24
(3) SieThey
verschlimmernaggravate
diethe
Ubelmisfortunes
(4) Their actionis aggravatingthe misfortunes
Verb particle combinations A typical MWEpattern treated for instance in (Baldwin andVillavicencio 2002) It further divides into trans-parent and non-transparent combinations the lat-ter is illustrated below
(5) DerThe
Ausschusscommitte
bezwecktintends
denthe
Institutioneninstitutions
eina
politischespolitical
Instrumentinstrument
anat
diethe
Handhand
zuto
gebengive
(6) The committeeset out to equip the institutions with apolitical instrument
Verb preposition combinations While thisclass isnrsquot discussed very often in the MWE lit-erature it can nevertheless be considered as an id-iosyncratic combination of lexical items Sag et al(2002) propose an analysis within an MWE frame-work
(7) SieThey
werdenwill
denthe
Treibhauseffektgreen house effect
verschlimmernaggravate
(8) They will add to the green house effect
Light verb constructions (LVCs) This is themost frequent pattern in our gold standard It ac-tually subsumes various subpatterns depending onwhether the light verbs complement is realized as anoun adjective or PP Generally LVCs are syntac-tically and semantically more flexible than otherMWE types such that our gold standard containsvariants of LVCs with similar potentially mod-ified adjectives or nouns as in the example be-low However it can be considered an idiosyn-cratic combination since the LVCs exhibit specificlexical restrictions (Sag et al 2002)
(9) IchIch
werdewill
diethe
Sachething
nuronly
nochjust
verschlimmernaggravate
(10) I am justmaking thingsmore difficult
Idioms This MWE type is probably the mostdiscussed in the literature due to its semantic andsyntactic idiosyncracy Itrsquos not very frequent inour gold standard which may be mainly due to itslimited size and the source items we chose
(11) SieThey
bezweckenintend
diethe
Umgestaltungconversion
ininto
einea
zivilecivil
Nationnation
(12) Theyhave in mind the conversion into a civil nation
v1 v2 v3 v4
Ntype 22 (26) 41 (47) 26 (35) 17 (24)
V Part 227 49 00 00
V Prep 364 415 39 59
LVC 182 293 885 882
Idiom 00 24 00 00
Para 364 243 115 235
Table 2 Proportions of MWE types per lemma
Paraphrases From an MWE perspective para-phrases are the most problematic and challengingtype of translational correspondence in our goldstandard While the MWE literature typically dis-cusses the distinction between collocations andMWEs the boarderline between paraphrases andMWEs is not really clear On the hand para-phrases as we classified them here are transparentcombinations of lexical items like in the exam-ple belowensure that something increases How-ever semantically these transparent combinationscan also be rendered by an atomic expressionin-crease A further problem raised by paraphrases isthat they often involve translational shifts (Cyrus2006) These shifts are hard to identify automat-ically and present a general challenge for seman-tic processing of parallel corpora An example isgiven below
(13) WirWe
brauchenneed
besserebetter
Zusammenarbeitcooperation
umto
diethe
RuckzahlungenrepaymentsOBJ
anzuhebenincrease
(14) We need greater cooperation in this respect toensurethat repaymentsincrease
Table 2 displays the proportions of the MWEcategories for the number of types of one-to-manycorrespondences in our gold standard We filteredthe types due to morphological variations only (theoverall number of types is indicated in brackets)Note that some types in our gold standard fall intoseveral categories eg they combine a verb prepo-sition with a verb particle construction For allof the verbs the number of types belonging tocore MWE categories largely outweighs the pro-portion of paraphrases As we already observedin our analysis of general translation categorieshere again the different verb lemmas show strik-ing differences with respect to their realization inEnglish translations For instanceanheben(enincrease) or bezwecken(en intend) are frequently
25
translated with verb particle or preposition combi-nations while the other verbs are much more of-ten translated by means of LVCs Also the morespecific LVC patterns differ largely among theverbs Whileverschlimmern(en aggravate) hasmany different adjectival LVC correspondencesthe translations ofriskieren(en risk) are predomi-nantly nominal LVCs The fact that we found veryfew idioms in our gold standard may be simplyrelated to our arbitrary choice of German sourceverbs that do not have an English idiom realiza-tion (see our experiment on a random set of verbsin Section 33)
In general one-to-many translational corre-spondences seem to provide a very fruitful groundfor the large-scale study of MWE phenomenaHowever their reliable detection in parallel cor-pora is far from trivial as we will show in thenext section Therefore we will not further in-vestigate the classification of MWE patterns inthe rest of the paper but concentrate on the high-precision detection of one-to-many translationsSuch a pattern-independent identification methodis crucial for the further data-driven study of one-to-many translations in parallel corpora
3 Multiword Translation Detection
This section is devoted to the problem of high-precision detection of one-to-many translationsSection 31 describes an evaluation of automaticword alignments against our gold standard Insection 32 we describe a method that extractsloosely aligned syntactic configurations whichyields much more promising results
31 One-to-many Alignments
To illustrate the problem of purely distributionalone-to-many alignment table 3 presents an eval-uation of the automatic one-to-many word align-ments produced by GIZA++ that uses the stan-dard heuristics for bidirectional word alignmentfrom phrase-based MT (Och and Ney 2003) Weevaluate the rate of translational correspondenceson the type-level that the system discovers againstthe one-to-many translations in our gold standardBy type we mean the set of lemmatized Englishtokens that makes up the translation of the Ger-man source lemma Generally automatic wordalignment yields a very high FPR if no frequencythreshold is used Increasing the threshold mayhelp in some cases however the frequency of the
verb n gt 0 n gt 1 n gt 3FPR FNR FPR FNR FPR FNR
v1 097 093 10 10 10 10
v2 093 09 05 096 00 098
v3 088 083 08 097 067 097
v4 098 092 08 092 034 092
Table 3 False positive rate and False negative rateof GIZA++ one-to-many alignments
translation types is so low that already at a thresh-old of 3 almost all types get filtered This does notmean that the automatic word alignment does notdiscover any correct correspondences at all but itmeans that the detection of the exact set of tokensthat correspond to the source token is rare
This low precision of one-to-many alignmentsisnrsquot very surprising Many types of MWEs con-sist of items that contribute most of the lexical se-mantic content while the other items belong to theclass of semantically almost ldquoemptyrdquo items (egparticles light verbs) These semantically ldquolightrdquoitems have a distribution that doesnrsquot necessarilycorrelate with the source item For instance inthe following sentence pair taken from EuroparlGIZA++ was not able to capture the correspon-dence between the German main verbbehindern(en impede) and the LVCconstitute an obstacleto but only finds an alignment link between theverb and the nounobstacle
(15) DieThe
Korruptioncorruption
behindertimpedes
diethe
Entwicklungdevelopment
(16) Corruptionconstitutes an obstacle todevelopment
Another limitation of the word-alignment mod-els is that are independent of whether the sen-tences are largely parallel or rather free transla-tions However parallel corpora like Europarl areknow to contain a very large number of free trans-lations In these cases direct lexical correspon-dences are much more unlikely to be found
32 Aligning Syntactic Configurations
High-precision extraction of one-to-many trans-lation detection thus involves two major prob-lems 1) How to identify sentences or configura-tions where reliable lexical correspondences canbe found 2) How to align target items that have alow occurrence correlation
We argue that both of these problems can beadressed by taking syntactic information into ac-
26
count As an example consider the pair of paral-lel configurations in Figure 1 for the sentence pairgiven in (15) and (16) Although there is no strictone-to-one alignment for the German verb the ba-sic predicate-argument structure is parallel Theverbs arguments directly correspond to each otherand are all dominated by a verbal root node
Based on these intuitions we propose agenerate-and-filter strategy for our one-to-manytranslation detection which extracts partial largelyparallel dependency configurations By admittingtarget dependency paths to be aligned to sourcesingle dependency relations we admit configura-tions where the source item is translated by morethan one word For instance given the configura-tion in Figure 1 we allow the German verb to bealigned to the path connectingconstituteand theargumentY2
Our one-to-many translation detection consistsof the following steps a) candidate generationof aligned syntactic configurations b) filtering theconfigurations c) alignment post-editing ie as-sembling the target tokens corresponding to thesource item The following paragraphs will brieflycaracterize these steps
behindert
X 1 Y 1
Y 2
an to
X 2 obstacle
create
Figure 1 Example of a typical syntactic MWEconfiguration
Data We word-aligned the German and Englishportion of the Europarl corpus by means of theGIZA++ tool Both portions where assigned flatsyntactic dependency analyses by means of theMaltParser (Nivre et al 2006) such that we ob-tain a parallel resource of word-aligned depen-dency parses Each sentence in our resource canbe represented by the triple(DG DE AGE) DG
is the set of dependency triples(s1 rel s2) such
thats2 is a dependent ofs1 of typerel ands1 s2
are words of the source languageDE is the setof dependency triples of the target sentenceAGE
corresponds to the set of pairs(s1 t1) such thats1 t1 are aligned
Candidate Generation This step generates alist of source configurations by searching for oc-curences of the source lexical verb where it islinked to some syntactic dependents (eg its argu-ments) An example input would be the configura-tion ( (verbSB) (verbOA)) forour German verbs
Filtering Given our source candidates a validparallel configuration(DG DE AGE) is then de-fined by the following conditions1 The source configurationDG is the set of tu-ples(s1 rel sn) wheres1 is our source item andsn some dependent2 For eachsn isin DG there is a tuple(sn tn) isinAGE ie every dependent has an alignment3 There is a target itemt1 isin DE such thatfor eachtn there is ap sub DE such thatp isa path(t1 rel tx) (tx rel ty)(tz rel tn) thatconnectst1 and tn Thus the target dependentshave a common root
To filter noise due to parsing or alignment er-rors we further introduce a filter on the length ofthe path that connects the target root and its de-pendents and w exclude paths cross contain sen-tence boundaries Moreover the above candi-date filtering doesnrsquot exclude configurations whichexhibit paraphrases involving head-switching orcomplex coordination Head-switching can be de-tected with the help of alignment information ifthere is a item in our target configuration that hasan reliable alignment with an item not contained inour source configuration our target configurationis likely to contain such a structural paraphrasesand is excluded from our candidate set Coordina-tion can be discarded by imposing the condition onthe configuration not to contain a coordination re-lation This Generate-and-Filter strategy now ex-tracts a set of sentences where we are likely to finda good one-to-one or one-to-many translation forthe source verb
Alignment Post-editing In the final alignmentstep one now needs to figure out which lexicalmaterial in the aligned syntactic configurations ac-tually corresponds to the translation of the sourceitem The intuition discussed in 32 was that all
27
the items lying on a path between the root itemand the terminals belong to the translation of thesource item However these items may have othersyntactic dependents that may also be part of theone-to-many translation As an example considerthe configuration in figure 1 where the articleanwhich is part of the LVCcreate an obstacle tohasto be aligned to the German source verb
Thus for a set of itemsti for which there is a de-pendency relation(tx rel ti) isin DE such thattx isan element of our target configuration we need todecide whether(s1 ti) isin AGE This translationproblem now largely parallels collocation trans-lation problems discussed in the literature as in(Smadja and McKeown 1994) But crucially oursyntactic filtering strategy has substantially nar-rowed down the number of items that are possi-ble parts of the one-to-many translation Thus astraightforward way to assemble the translationalcorrespondence is to compute the correlation orassociation of the possibly missing items with thegiven translation pair as proposed in (Smadja andMcKeown 1994) Therefore we propose the fol-lowing alignment post-editing algorithmGiven the source items1 and the set of target itemsT where eachti isin T is an element of our targetconfiguration
1 Compute corr(s1 T ) the correlation be-tweens1 andT
2 For each ti tx such that there isa (ti rel tx) isin DE computecorr(s1 T + tx)
3 if corr(s1 T + tx) ge corr(s1 T ) addtxto T
As the Dice coefficient is often to give the bestresults eg in (Smadja and McKeown 1994) wealso chose Dice as our correlation measure In fu-ture work we will experiment with other associa-tion measures Our correlation scores are thus de-fined by the formula
corr(s1 T ) =2(freq(s1 and T ))
freq(s1) + freq(T )
We definefreq(T ) as the number of sentencepairs whose target sentence contains occurrencesof all ti isin T andfreq(s1) accordingly The ob-servation frequencyfreq(s1andT ) is the number ofsentence pairs that wheres1 occurs in the sourcesentence andT in the target sentence
The output translation can then be rep-resented as a dependency configurationof the following kind ((ofPMOD)(riskNMODof)(riskNMODthe) (runOBJrisk)(runSBJ))which is the syntactic representationfor the English MWErun the risk of
33 Evaluation
Our translational approach to MWE extractionbears the advantage that evaluation is not exclu-sively bound to the manual judgement of candi-date lists Instead we can first evaluate the systemoutput against translation gold standards which areeasier to obtain The linguistic classification of thecandidates according to their compositionality canthen be treated as a separate problem
We present two experiments in this evaluationsection We will first evaluate the translation de-tection on our gold standard to assess the gen-eral quality of the extraction method Since thisgold standard is to small to draw conclusions aboutthe quality of MWE patterns that the system de-tects we further evaluate the translational corre-spondences for a larger set of verbs
Translation evaluation In the first experimentwe extracted all types of translational correspon-dences for the verbs we annotated in the gold stan-dard We converted the output dependency con-figurations to the lemmatized bag-of-word formwe already applied for the alignment evaluationand calculated the FPR and FNR of the trans-lation types The evaluation is displayed in ta-ble 4 Nearly all translation types that our sys-tem detected are correct This confirms our hy-pothesis that syntactic filtering yields more reli-able translations that just coocurrence-based align-ments However the false negative rate is alsovery high This low recall is due to the fact thatour syntactic filters are very restrictive such that amajor part of the occurrences of the source lemmadonrsquot figure in the prototypical syntactic configu-ration Column two and three of the evaluation ta-ble present the FPR and FNR for experiments witha relaxed syntactic filter that doesnrsquot constrain thesyntactic type of the parallel argument relationsWhile not decreasing the FNR the FPR decreasessignificantly This means that the syntactic filtersmainly fire on noisy configurations and donrsquot de-crease the recall A manual error analysis has alsoshown that the relatively flat annotation scheme ofour dependency parses significantly narrows down
28
the number of candidate configurations that our al-gorithm detects As the dependency parses donrsquotprovide deep analyses for tense or control phe-nomena very often a verbrsquos arguments donrsquot fig-ure as its syntactic dependents and no configura-tion is found Future work will explore the im-pact of deep syntactic analysis for the detection oftranslational correspondences
MWE evaluation In a second experiment weevaluated the patterns of correspondences foundby our extraction method for use in an MWE con-text Therefore we selected 50 random verbs oc-curring in the Europarl corpus and extracted theirrespective translational correspondences This setof 50 verbs yields a set of 1592 one-to-many typesof translational correspondences We filtered thetypes wich display only morphological variationsuch that the set of potential MWE types com-prises 1302 types Out of these we evaluated arandom sample of 300 types by labelling the typeswith the MWE categories we established for theanalysis of our gold standard During the clas-sification we encountered a further category ofoneto- many correspondence which cannot be con-sidered an MWE the category of alternation Forinstance we found a translational correspondencebetween the active realization of the German verbbegruszligen(enappreciate) and the English passivebe pleased by
The classification is displayed in table 5 Al-most 83 of the translational correspondencesthat our system extracted are perfect translationtypes Almost 60 of the extracted types can beconsidered MWEs that exhibit some kind of se-mantic idiosyncrasy The other translations couldbe classified as paraphrases or alternations In ourrandom sample the portions of idioms is signifi-cantly higher than in our gold standard which con-firms our intuition that the MWE pattern of theone-to-many translations for a given verb are re-lated to language-specific semantic properties ofthe verbs and the lexical concepts they realize
4 Related Work
The problem sketched in this paper has clear con-ncetions to statistical MT So-called phrase-basedtranslation models generally target whole sentencealignment and do not necessarily recur to linguis-tically motivated phrase correspondences (Koehnet al 2003) Syntax-based translation that speci-fies formal relations between bilingual parses was
Strict Filter Relaxed Filter
FPR FNR FPR FNR
v1 00 096 05 096
v2 025 088 047 079
v3 025 074 056 063
v4 00 0875 056 084
Table 4 False positive and false negative rate ofone-to-many translations
Trans type Proportion
MWE type Proportion
MWEs 575
V Part 82
V Prep 518
LVC 324
Idiom 106
Paraphrases 244
Alternations 10
Noise 171
Table 5 Classification of 300 types sampled fromthe set of one-to-many translations for 50 verbs
established by (Wu 1997) Our way to use syn-tactic configurations can be seen as a heuristic tocheck relaxed structural parallelism
Work on MWEs in a crosslingual context hasalmost exclusively focussed on MWE translation(Smadja and McKeown 1994 Anastasiou 2008)In (Villada Moiron and Tiedemann 2006) the au-thors make use of alignment information in a par-allel corpus to rank MWE candidates These ap-proaches donrsquot rely on the lexical semantic knowl-edge about MWEs in form of one-to-many trans-lations
By contrast previous approaches to paraphraseextraction made more explicit use of crosslingualsemantic information In (Bannard and Callison-Burch 2005) the authors use the target languageas a pivot providing contextual features for iden-tifying semantically similar expressions Para-phrasing is however only partially comparable tothe crosslingual MWE detection we propose inthis paper Recently the very pronounced contextdependence of monolingual pairs of semanticallysimilar expressions has been recognized as a ma-jor challenge in modelling word meaning (Erk andPado 2009)
The idea that parallel corpora can be used asa linguistic resource that provides empirical evi-dence for monolingual idiosyncrasies has already
29
been exploited in eg morphology projection(Yarowsky et al 2001) or word sense disambigua-tion (Dyvik 2004) While in a monolingual set-ting it is quite tricky to come up with theoreticalor empirical definitions of sense discriminationsthe crosslingual scenario offers a theory-neutraldata-driven solution Since ambiguity is an id-iosyncratic property of a lexical item in a givenlanguage it is not likely to be mirrored in a tar-get language Similarly our approach can also beseen as a projection idea we project the semanticinformation of simplex realization in a source lan-guage to an idiosyncratic multiword realization inthe target language
5 Conclusion
We have explored the phenomenon of one-to-many translations in parallel corpora from theperspective of MWE identification Our man-ual study on a translation gold standard as wellas our experiments in automatic translation ex-traction have shown that one-to-many correspon-dences provide a rich resource and fruitful basisof study for data-driven MWE identification Thecrosslingual perspective raises new research ques-tions about the identification and interpretation ofMWEs It challenges the distinction between para-phrases and MWEs a problem that does not ariseat all in the context of monolingual MWE ex-traction It also allows for the study of the rela-tion between the semantics of lexical concepts andtheir MWE realization Further research in this di-rection should investigate translational correspon-dences on a larger scale and further explore thesefor monolingual interpretation of MWEs
Our extraction method that is based on syn-tactic filters identifies MWE types with a muchhigher precision than purely cooccurence-basedword alignment and captures the various patternswe found in our gold standard Future work on theextraction method will have to focus on the gener-alization of these filters and the generalization toother items than verbs The experiments presentedin this paper also suggest that the MWE realiza-tion of certain lexical items in a target languageis subject to certain linguistic patterns Moreoverthe method we propose is completely languagein-dependent such that further research has to studythe impact of the relatedness of the consideredlanguages on the patterns of one-to-many transla-tional correspondences
ReferencesDimitra Anastasiou 2008 Identification of idioms by mtrsquos
hybrid research system vs three commercial system InProceedings of the EAMT pp 12ndash20
Timothy Baldwin and Aline Villavicencio 2002 Extract-ing the unextractable a case study on verb-particles InProceedings of the COLING-02 pp 1ndash7
Colin Bannard and Chris Callison-Burch 2005 Paraphras-ing with bilingual parallel corpora InProceedings of the43rd Annual Meeting of the ACL pp 597ndash604
Lea Cyrus 2006 Building a resource for studying transla-tion shifts InProceedings of the 5th LREC pp 1240ndash1245
Helge Dyvik 2004 Translations as semantic mirrors Fromparallel corpus to WordNetLanguage and Computers1311 ndash 326
Katrin Erk and Sebastian Pado 2009 Paraphrase assess-ment in structured vector space Exploring parameters anddatasets InProc of the EACL GEMS Workshop pp 57ndash65
Joao de Almeida Varelas Graca Joana Paulo Pardal LuısaCoheur and Diamantino Antonio Caseiro 2008 Multi-language word alignments annotation guidelines Techni-cal report Tech Rep 38 2008 INESC-ID Lisboa
Philipp Koehn Franz Josef Och and Daniel Marcu 2003Statistical phrase-based translation InProceedings of theNAACL rsquo03 pp 48ndash54
Philipp Koehn 2005 Europarl A parallel corpus for statis-tical machine translation InMT Summit 2005 pp 79ndash86
Joakim Nivre Johan Hall and Jens Nilsson 2006 Malt-parser A data driven parser-generator for dependencyparsing InProc of LREC-2006 pp 2216ndash2219
Franz Josef Och and Hermann Ney 2003 A systematic com-parison of various statistical alignment modelsComputa-tional Linguistics 29(1)19ndash51
Pavel Pecina 2008 A machine learning approach to multi-word expression extraction InProceedings of the LRECMWE 2008 Workshop pp 54ndash57
Ivan A Sag Timothy Baldwin Francis Bond and AnnCopestake 2002 Multiword expressions A pain in theneck for NLP InProc of the CICLing-2002 pp 1ndash15
Frank Smadja and Kathleen McKeown 1994 Translatingcollocations for use in bilingual lexicons InProceedingsof the HLT rsquo94 workshop pp 152ndash156
Begona Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automatic word-alignment InProc of the EACL MWE 2006 Workshoppp 33ndash40
Dekai Wu 1997 Stochastic inversion transduction gram-mars and bilingual parsing of parallel corporaComputLinguist 23(3)377ndash403
David Yarowsky Grace Ngai and Richard Wicentowski2001 Inducing multilingual text analysis tools via robustprojection across aligned corpora InProceedings of HLT2001 pp 1ndash8
30
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 31ndash39Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
A Re-examination of Lexical Association Measures
Abstract
We review lexical Association Measures (AMs) that have been employed by past work in extracting multiword expressions Our work contributes to the understanding of these AMs by categorizing them into two groups and suggesting the use of rank equivalence to group AMs with the same ranking performance We also examine how existing AMs can be adapted to better rank English verb particle constructions and light verb constructions Specifically we suggest normalizing (Pointwise) Mutual Information and using marginal frequencies to construct penalization terms We empirically validate the effectiveness of these modified AMs in detection tasks in English performed on the Penn Treebank which shows significant improvement over the original AMs
1 Introduction
Recently the NLP community has witnessed a renewed interest in the use of lexical association measures in extracting Multiword Expressions (MWEs) Lexical Association Measures (hereafter AMs) are mathematical formulas which can be used to capture the degree of connection or association between constituents of a given phrase Well-known AMs include Pointwise Mutual Information (PMI) Pearsonrsquos 2χ and the Odds Ratio These AMs have been applied in many different fields of study from information retrieval to hypothesis testing In the context of MWE extraction many published works have been devoted to comparing their effectiveness Krenn and Evert (2001) evaluate Mutual Information (MI) Dice Pearsonrsquos 2χ log-likelihood
ratio and the T score In Pearce (2002) AMs such as Z score Pointwise MI cost reduction left and right context entropy odds ratio are evaluated Evert (2004) discussed a wide range of AMs including exact hypothesis tests such as the binomial test and Fisherrsquos exact tests various coefficients such as Dice and Jaccard Later Ramisch et al (2008) evaluated MI Pearsonrsquos 2χ and Permutation Entropy Probably the most comprehensive evaluation of AMs was presented in Pecina and Schlesinger (2006) where 82 AMs were assembled and evaluated over Czech collocations These collocations contained a mix of idiomatic expressions technical terms light verb constructions and stock phrases In their work the best combination of AMs was selected using machine learning
While the previous works have evaluated AMs there have been few details on why the AMs perform as they do A detailed analysis of why these AMs perform as they do is needed in order to explain their identification performance and to help us recommend AMs for future tasks This weakness of previous works motivated us to address this issue In this work we contribute to further understanding of association measures using two different MWE extraction tasks to motivate and concretize our discussion Our goal is to be able to predict a priori what types of AMs are likely to perform well for a particular MWE class
We focus on the extraction of two common types of English MWEs that can be captured by bigram model Verb Particle Constructions (VPCs) and Light Verb Constructions (LVCs) VPCs consist of a verb and one or more particles which can be prepositions (eg put on bolster up) adjectives (cut short) or verbs (make do) For simplicity we focus only on bigram VPCs that take prepositional particles the most common class of VPCs A special characteristic of VPCs that affects their extraction is the
Hung Huu Hoang Dept of Computer Science
National University of Singapore
hoanghuucompnusedusg
Su Nam Kim Dept of Computer Science and Software Engineering University of Melbourne
snkimcsseunimelbeduau
Min-Yen Kan Dept of Computer Science
National University of Singapore
kanmycompnusedusg
31
mobility of noun phrase complements in transitive VPCs They can appear after the particle (Take off your hat) or between the verb and the particle (Take your hat off) However a pronominal complement can only appear in the latter configuration (Take it off)
In comparison LVCs comprise of a verb and a complement which is usually a noun phrase (make a presentation give a demonstration) Their meanings come mostly from their complements and as such verbs in LVCs are termed semantically light hence the name light verb This explains why modifiers of LVCs modify the complement instead of the verb (make a serious mistake vs make a mistake seriously) This phenomenon also shows that an LVCrsquos constituents may not occur contiguously
2 Classification of Association Measures
Although different AMs have different approaches to measuring association we observed that they can effectively be classified into two broad classes Class I AMs look at the degree of institutionalization ie the extent to which the phrase is a semantic unit rather than a free combination of words Some of the AMs in this class directly measure this association between constituents using various combinations of co-occurrence and marginal frequencies Examples include MI PMI and their variants as well as most of the association coefficients such as Jaccard Hamann Brawn-Blanquet and others Other Class I AMs estimate a phrasersquos MWE-hood by judging the significance of the difference between observed and expected frequencies These AMs include among others statistical hypothesis tests such as T score Z score and Pearsonrsquos 2χ test
Class II AMs feature the use of context to measure non-compositionality a peculiar characteristic of many types of MWEs including VPCs and idioms This is commonly done in one of the following two ways First non-compositionality can be modeled through the diversity of contexts measured using entropy The underlying assumption of this approach is that non-compositional phrases appear in a more restricted set of contexts than compositional ones Second non-compositionality can also be measured through context similarity between the phrase and its constituents The observation here is that non-compositional phrases have different semantics from those of their constituents It then
follows that contexts in which the phrase and its constituents appear would be different (Zhai 1997) Some VPC examples include carry out give up A close approximation stipulates that contexts of a non-compositional phrasersquos constituents are also different For instance phrases such as hot dog and Dutch courage are comprised of constituents that have unrelated meanings Metrics that are commonly used to compute context similarity include cosine and dice similarity distance metrics such as Euclidean and Manhattan norm and probability distribution measures such as Kullback-Leibler divergence and Jensen-Shannon divergence
Table 1 lists all AMs used in our discussion
The lower left legend defines the variables a b c and d with respect to the raw co-occurrence statistics observed in the corpus data When an AM is introduced it is prefixed with its index given in Table 1(eg [M2] Mutual Information) for the readerrsquos convenience
3 Evaluation
We will first present how VPC and LVC candidates are extracted and used to form our evaluation data set Second we will discuss how performances of AMs are measured in our experiments
31 Evaluation Data
In this study we employ the Wall Street Journal (WSJ) section of one million words in the Penn Tree Bank To create the evaluation data set we first extract the VPC and LVC candidates from our corpus as described below We note here that the mobility property of both VPC and LVC constituents have been used in the extraction process
For VPCs we first identify particles using a pre-compiled set of 38 particles based on Baldwin (2005) and Quirk et al (1985) (Appendix A) Here we do not use the WSJ particle tag to avoid possible inconsistencies pointed out in Baldwin (2005) Next we search to the left of the located particle for the nearest verb As verbs and particles in transitive VPCs may not occur contiguously we allow an intervening NP of up to 5 words similar to Baldwin and Villavicencio (2002) and Smadja (1993) since longer NPs tend to be located after particles
32
Extraction of LVCs is carried out in a similar fashion First occurrences of light verbs are located based on the following set of seven
frequently used English light verbs do get give have make put and take Next we search to the right of the light verbs for the nearest noun
AM Name Formula AM Name Formula M1 Joint Probability ( ) f xy N M2 Mutual Information
1log ˆ
ijij
i j ij
ff
N fsum
M3 Log likelihood ratio
2 log ˆ
ijij
i j ij
ff
fsum
M4 Pointwise MI (PMI) ( )log
( ) ( )
P xy
P x P ylowast lowast
M5 Local-PMI ( ) PMIf xy times M6 PMIk ( )log
( ) ( )
kNf xy
f x f ylowast lowast
M7 PMI2 2( )log
( ) ( )
Nf xy
f x f ylowast lowast
M8 Mutual Dependency 2( )log( ) ( )P xy
P x P y
M9 Driver-Kroeber
( )( )
a
a b a c+ +
M10 Normalized expectation
2
2
a
a b c+ +
M11 Jaccard a
a b c+ +
M12 First Kulczynski a
b c+
M13 Second Sokal-Sneath 2( )
a
a b c+ +
M14 Third Sokal-Sneath
a d
b c
+
+
M15 Sokal-Michiner a d
a b c d
+
+ + +
M16 Rogers-Tanimoto
2 2
a d
a b c d
+
+ + +
M17 Hamann ( ) ( )a d b c
a b c d
+ minus +
+ + +
M18 Odds ratio adbc
M19 Yulersquos ω ad bc
ad bc
minus
+
M20 Yulersquos Q ad bcad bc
minus+
M21 Brawn- Blanquet max( )
a
a b a c+ +
M22 Simpson
min( )
a
a b a c+ +
M23 S cost 12min( )
log(1 )1
b c
a
minus+
+
M24 Adjusted S Cost 12max( )
log(1 )1
b c
a
minus+
+
M25 Laplace 1
min( ) 2
a
a b c
+
+ +
M26 Adjusted Laplace 1
max( ) 2
a
a b c
+
+ +
M27 Fager [M9]
1max( )
2b cminus
M28 Adjusted Fager [M9]
1max( )b c
aNminus
M29 Normalized PMIs
PMI NF( )α PMI NFMax
M30 Simplified normalized PMI for VPCs
log( )
(1 )
ad
b cα αtimes + minus times
M31 Normalized MIs
MI NF( )α MI NFMax
NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast [0 1]α isin NFMax = max( ( ) ( ))P x P ylowast lowast
11 ( )a f f xy= = 12 ( )b f f xy= =
21 ( )c f f xy= = 22 ( )d f f xy= =
( )f xlowast ( )f xlowast
( )f ylowast ( )f ylowast N
Table 1 Association measures discussed in this paper Starred AMs () are developed in this work
Contingency table of a bigram (x y) recording co-occurrence and marginal frequencies w stands for all words except w stands for all words N is total number of bigrams The expected frequency under the independence assumption is ˆ ( ) ( ) ( ) f xy f x f y N= lowast lowast
33
permitting a maximum of 4 intervening words to allow for quantifiers (aan the many etc) adjectival and adverbial modifiers etc If this search fails to find a noun as when LVCs are used in the passive (eg the presentation was made) we search to the right of the light verb also allowing a maximum of 4 intervening words The above extraction process produced a total of 8652 VPC and 11465 LVC candidates when run on the corpus We then filter out candidates with observed frequencies less than 6 as suggested in Pecina and Schlesinger (2006) to obtain a set of 1629 VPCs and 1254 LVCs
Separately we use the following two available sources of annotations 3078 VPC candidates extracted and annotated in (Baldwin 2005) and 464 annotated LVC candidates used in (Tan et al 2006) Both sets of annotations give both positive and negative examples
Our final VPC and LVC evaluation datasets were then constructed by intersecting the gold-standard datasets with our corresponding sets of extracted candidates We also concatenated both sets of evaluation data for composite evaluation This set is referred to as ldquoMixedrdquo Statistics of our three evaluation datasets are summarized in Table 2 VPC data LVC data Mixed Total (freq ge 6)
413 100 513
Positive instances
117 (2833)
28 (28)
145 (2326)
Table 2 Evaluation data sizes (type count not token) While these datasets are small our primary
goal in this work is to establish initial comparable baselines and describe interesting phenomena that we plan to investigate over larger datasets in future work
32 Evaluation Metric
To evaluate the performance of AMs we can use the standard precision and recall measures as in much past work We note that the ranked list of candidates generated by an AM is often used as a classifier by setting a threshold However setting a threshold is problematic and optimal threshold values vary for different AMs Additionally using the list of ranked candidates directly as a classifier does not consider the confidence indicated by actual scores Another way to avoid setting threshold values is to measure precision and recall of only the n most likely candidates
(the n-best method) However as discussed in Evert and Krenn (2001) this method depends heavily on the choice of n In this paper we opt for average precision (AP) which is the average of precisions at all possible recall values This choice also makes our results comparable to those of Pecina and Schlesinger (2006)
33 Evaluation Results
Figure 1(a b) gives the two average precision profiles of the 82 AMs presented in Pecina and Schlesinger (2006) when we replicated their experiments over our English VPC and LVC datasets We observe that the average precision profile for VPCs is slightly concave while the one for LVCs is more convex This can be interpreted as VPCs being more sensitive to the choice of AM than LVCs Another point we observed is that a vast majority of Class I AMs including PMI its variants and association coefficients (excluding hypothesis tests) perform reasonably well in our application In contrast the performances of most of context-based and hypothesis test AMs are very modest Their mediocre performance indicates their inapplicability to our VPC and LVC tasks In particular the high frequencies of particles in VPCs and light verbs in LVCs both undermine their contextsrsquo discriminative power and skew the difference between observed and expected frequencies that are relied on in hypothesis tests
4 Rank Equivalence
We note that some AMs although not mathematically equivalent (ie assigning identical scores to input candidates) produce the same lists of ranked candidates on our datasets Hence they achieve the same average precision The ability to identify such groups of AMs is helpful in simplifying their formulas which in turn assisting in analyzing their meanings Definition Association measures M1 and M2 are
rank equivalent over a set C denoted by M1 r
Cequiv
M2 if and only if M1(cj) gt M1(ck) M2(cj) gt M2(ck) and M1(cj) = M1(ck) M2(cj) = M2(ck) for all cj ck belongs to C where Mk(ci) denotes the score assigned to ci by the measure Mk As a corollary the following also holds for rank equivalent AMs
34
Corollary If M1 r
Cequiv M2 then APC(M1) = APC(M2)
where APC(Mi) stands for the average precision of the AM Mi over the data set C Essentially M1 and M2 are rank equivalent over a set C if their ranked lists of all candidates taken from C are the same ignoring the actual calculated scores1 As an example the following 3 AMs Odds ratio Yulersquos ω and Yulersquos Q (Table 3 row 5) though not mathematically equivalent can be shown to be rank equivalent Five groups of rank equivalent AMs that we have found are listed in Table 3 This allows us to replace the below 15 AMs with their (most simple) representatives from each rank equivalent group
1 Two AMs may be rank equivalent with the exception of some candidates where one AM is undefined due to a zero in the denominator while the other AM is still well-defined We call these cases weakly rank equivalent With a reasonably large corpus such candidates are rare for our VPC and LVC types Hence we still consider such AM pairs to be rank equivalent
1) [M2] Mutual Information [M3] Log likelihood ratio 2) [M7] PMI2 [M8] Mutual Dependency [M9] Driver-Kroeber (aka Ochiai) 3) [M10] Normalized expectation [M11] Jaccard [M12] First Kulczynski
[M13]Second Sokal-Sneath (aka Anderberg)
4) [M14] Third Sokal-Sneath [M15] Sokal-Michiner [M16] Rogers-Tanimoto [M17] Hamann 5) [M18] Odds ratio [M19] Yulersquos ω [M20] Yulersquos Q
Table 3 Five groups of rank equivalent AMs
5 Examination of Association Measures
We highlight two important findings in our analysis of the AMs over our English datasets Section 51 focuses on MI and PMI and Section 52 discusses penalization terms
51 Mutual Information and Pointwise Mutual Information
In Figure 1 over 82 AMs PMI ranks 11th in identifying VPCs while MI ranks 35th in
01
02
03
04
05
06
AP
Figure 1a AP profile of AMs examined over our VPC data set
01
02
03
04
05
06
AP
Figure 1b AP profile of AMs examined over our LVC data set
Figure 1 Average precision (AP) performance of the 82 AMs from Pecina and Schlesinger (2006) on our English VPC and LVC datasets Bold points indicate AMs discussed in this paper
Hypothesis test AMs loz Class I AMs excluding hypothesis test AMs + Context-based AMs
35
identifying LVCs In this section we show how their performances can be improved significantly
Mutual Information (MI) measures the common information between two variables or the reduction in uncertainty of one variable given knowledge of the other
( )MI( ) ( )log ( ) ( )u v
p uvU V p uv p u p v= lowast lowastsum In the
context of bigrams the above formula can be
simplified to [M2] MI =
1log ˆN
ijij
i jij
ff
fsum While MI
holds between random variables [M4] Pointwise MI (PMI) holds between specific values PMI(x
y) =( )
log( ) ( )
P xy
P x P ylowast lowast
( )log
( ) ( )
Nf xy
f x f y=
lowast lowast It has long
been pointed out that PMI favors bigrams with low-frequency constituents as evidenced by the product of two marginal frequencies in its denominator To reduce this bias a common solution is to assign more weight to the co-occurrence frequency ( )f xy in the numerator by either raising it to some power k (Daille 1994) or multiplying PMI with ( )f xy Table 4 lists these adjusted versions of PMI and their performance over our datasets We can see from Table 4 that the best performance of PMIk
is obtained at k values less than one indicating that it is better to rely less on ( )f xy Similarly multiplying
( )f xy directly to PMI reduces the performance of PMI As such assigning more weight to ( )f xy does not improve the AP performance of PMI AM VPCs LVCs MixedBest [M6] PMIk 547
(k = 13) 573
(k = 85) 544
(k = 32)
[M4] PMI 510 566 515 [M5] Local-PMI 259 393 272 [M1] Joint Prob 170 28 175 Table 4 AP performance of PMI and its variants Best alpha settings shown in parentheses
Another shortcoming of (P)MI is that both grow not only with the degree of dependence but also with frequency (Manning and Schutzeampamp 1999 p 66) In particular we can show that MI(X Y) le min(H(X) H(Y)) where H() denotes entropy and PMI(xy) le min( log ( )P xminus lowast log ( )P yminus lowast )
These two inequalities suggest that the allowed score ranges of different candidates vary and consequently MI and PMI scores are not directly comparable Furthermore in the case of VPCs and LVCs the differences among score
ranges of different candidates are large due to high frequencies of particles and light verbs This has motivated us to normalize these scores before using them for comparison We suggest MI and PMI be divided by one of the following two normalization factors NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast with [0 1]α isin and NFmax = max( ( ) ( ))P x P ylowast lowast NF( )α being dependent on alpha can be optimized by setting an appropriate alpha value which is inevitably affected by the MWE type and the corpus statistics On the other hand NFmax is independent of alpha and is recommended when one needs to apply normalized (P)MI to a mixed set of different MWE types or when sufficient data for parameter tuning is unavailable As shown in Table 5 normalized MI and PMI show considerable improvements of up to 80 Also PMI and MI after being normalized with NFmax rank number one in VPC and LVC task respectively If one re-writes MI as = (1 N) ij ij
i jPMIf timessum it is easy to see the heavy
dependence of MI on direct frequencies compared with PMI and this explains why normalization is a pressing need for MI
AM VPCs LVCs MixedMI NF( )α 508
(α = 48) 583
(α = 47) 516
(α = 5)
MI NFmax 508 584 518 [M2] MI 273 435 289
PMI NF( )α 592 (α = 8)
554 (α = 48)
588 (α = 77)
PMI NFmax 565 517 556 [M4] PMI 510 566 515
Table 5 AP performance of normalized (P)MI versus standard (P)MI Best alpha settings shown in parentheses
52 Penalization Terms
It can be seen that given equal co-occurrence frequencies higher marginal frequencies reduce the likelihood of being MWEs This motivates us to use marginal frequencies to synthesize penalization terms which are formulae whose values are inversely proportional to the likelihood of being MWEs We hypothesize that incorporating such penalization terms can improve the respective AMs detection AP
Take as an example the AMs [M21] Brawn-Blanquet (aka Minimum Sensitivity) and [M22] Simpson These two AMs are identical except
36
for one difference in the denominator Brawn-Blanquet uses max(b c) Simpson uses min(b c) It is intuitive and confirmed by our experiments that penalizing against the more frequent constituent by choosing max(b c) is more effective This is further attested in AMs [M23] S Cost and [M25] Laplace where we tried to replace the min(b c) term with max(b c) Table 6 shows the average precision on our datasets for all these AMs
AM VPCs LVCs Mixed[M21]Brawn- Blanquet
478 578 486
[M22] Simpson 249 382 260
[M24] Adjusted S Cost
485 577 492
[M23] S cost 249 388 260
[M26] Adjusted Laplace
486 577 493
[M25] Laplace 241 388 254 Table 6 Replacing min() with max() in selected AMs
In the [M27] Fager AM the penalization term max(b c) is subtracted from the first term which is no stranger but rank equivalent to [M7] PMI2 In our application this AM is not good since the second term is far larger than the first term which is less than 1 As such Fager is largely equivalent to just ndashfrac12 max(b c) In order to make use of the first term we need to replace the constant frac12 by a scaled down version of max(b c) We have approximately derived 1 aN as a lower bound estimate of max(b c) using the independence assumption producing [M28] Adjusted Fager We can see from Table 7 that this adjustment improves Fager on both datasets AM VPCs LVCs Mixed[M28] Adjusted Fager
564 543 554
[M27] Fager 552 439 525 Table 7 Performance of Fager and its adjusted version
The next experiment involves [M14] Third Sokal Sneath which can be shown to be rank equivalent to ndashb ndashc We further notice that frequencies c of particles are normally much larger than frequencies b of verbs Thus this AM runs the risk of ranking VPC candidates based on only frequencies of particles So it is necessary
that we scale b and c properly as in [M14] bαminus times ndash (1 ) cαminus times Having scaled the constituents properly we still see that [M14] by itself is not a good measure as it uses only constituent frequencies and does not take into consideration the co-occurrence frequency of the two constituents This has led us to experiment
with [MR14]PMI
(1 )b cα αtimes + minus times The
denominator of [MR14] is obtained by removing the minus sign from [MR14] so that it can be used as a penalization term The choice of PMI in the numerator is due to the fact that the denominator of [MR14] is in essence similar to NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast which has been successfully used to divide PMI in the normalized PMI experiment We heuristically tried to simplify [MR14] to the following AM
[M30]log( )
(1 )
ad
b cα αtimes + minus times The setting of alpha in
Table 8 below is taken from the best alpha setting obtained the experiment on the normalized PMI (Table 5) It can be observed from Table 8 that [MR14] being computationally simpler than normalized PMI performs as well as normalized PMI and better than Third Sokal-Sneath over the VPC data set
AM VPCs LVCs MixedPMI NF( )α 592
(α =8) 554
(α =48) 588
(α =77)
[M30]log( )
(1 )
ad
b cα αtimes + minus times
600 (α =8)
484 (α =48)
588 (α =77)
[M14] Third Sokal Sneath
565 453 546
Table 8 AP performance of suggested VPCsrsquo penalization terms and AMs
With the same intention and method we have found that while addition of marginal frequencies is a good penalization term for VPCs the product of marginal frequencies is more suitable for LVCs (rows 1 and 2 Table 9) As with the linear combination the product bc should also be weighted accordingly as (1 )b cα αminus The best alpha value is also taken from the normalized PMI experiments (Table 5) which is nearly 5 Under this setting this penalization term is exactly the denominator of the [M18] Odds Ratio Table 9 below show our experiment results in deriving the penalization term for LVCs
37
AM VPCs LVCs Mixedndashb ndashc 565 453 546 1bc 502 532 502 [M18] Odds ratio 443 567 456
Table 9 AP performance of suggested LVCsrsquo penalization terms and AMs
6 Conclusions
We have conducted an analysis of the 82 AMs assembled in Pecina and Schlesinger (2006) for the tasks of English VPC and LVC extraction over the Wall Street Journal Penn Treebank data In our work we have observed that AMs can be divided into two classes ones that do not use context (Class I) and ones that do (Class II) and find that the latter is not suitable for our VPC and LVC detection tasks as the size of our corpus is too small to rely on the frequency of candidatesrsquo contexts This phenomenon also revealed the inappropriateness of hypothesis tests for our detection task We have also introduced the novel notion of rank equivalence to MWE detection in which we show that complex AMs may be replaced by simpler AMs that yield the same average precision performance
We further observed that certain modifications to some AMs are necessary First in the context of ranking we have proposed normalizing scores produced by MI and PMI in cases where the distributions of the two events are markedly different as is the case for light verbs and particles While our claims are limited to the datasets analyzed they show clear improvements normalized PMI produces better performance over our mixed MWE dataset yielding an average precision of 588 compared to 515 when using standard PMI a significant improvement as judged by paired T test Normalized MI also yields the best performance over our LVC dataset with a significantly improved AP of 583
We also show that marginal frequencies can be used to form effective penalization terms In particular we find that (1 )b cα αtimes + minus times is a good penalization term for VPCs while (1 )b cα αminus is suitable for LVCs Our introduced alpha tuning parameter should be set to properly scale the values b and c and should be optimized per MWE type In cases where a common factor is applied to different MWE types max(b c) is a better choice than min(b c) In future work we plan to expand our investigations over larger
web-based datasets of English to verify the performance gains of our modified AMs
Acknowledgement This work was partially supported by a National Research Foundation grant ldquoInteractive Media Searchrdquo (grant R 252 000 325 279)
References Baldwin Timothy (2005) The deep lexical
acquisition of English verb-particle constructions Computer Speech and Language Special Issue on Multiword Expressions 19(4)398ndash414
Baldwin Timothy and Villavicencio Aline (2002) Extracting the unextractable A case study on verb-particles In Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002) pages 98ndash104 Taipei Taiwan
Daille Beacuteatrice (1994) Approche mixte pour lextraction automatique de terminologie statistiques lexicales et filtres linguistiques PhD thesis Universiteacute Paris 7
Evert Stefan (2004) Online repository of association measures httpwwwcollocationsde a companion to The Statistics of Word Cooccurrences Word Pairs and Collocations PhD dissertation University of Stuttgart
Evert Stefan and Krenn Brigitte (2001) Methods for qualitative evaluation of lexical association measures In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics pages 188-915 Toulouse France
Katz Graham and Giesbrecht Eugenie (2006) Automatic identification of non-compositional multi-word expressions using latent semantic analysis In Proceedings of the ACL Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties pages 12-19 Sydney Australia
Krenn Brigitte and Evert Stefan (2001) Can we do better than frequency A case study on extracting PP-verb collocations In Proceedings of the ACLEACL 2001 Workshop on the Computational Extraction Analysis and Exploitation of Collocations pages 39ndash46 Toulouse France
Manning D Christopher and Schutze ampamp Hinrich (1999) Foundations of Statistical Natural Language Processing The MIT Press Cambridge Massachusetts
Pearce Darren (2002) A comparative evaluation of collocation extraction techniques In Proc of the
38
3rd International Conference on Language Resources and Evaluation (LREC 2002) Las Palmas pages 1530-1536 Canary Islands
Pecina Pavel and Schlesinger Pavel (2006) Combining association measures for collocation extraction In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLINGACL 2006) pages 651-658 Sydney Australia
Quirk Randolph Greenbaum Sidney Leech Geoffrey and Svartvik Jan (1985) A Comprehensive Grammar of the English Language Longman London UK
Ramisch Carlos Schreiner Paulo Idiart Marco and Villavicencio Aline (2008) An Evaluation of Methods for the extraction of Multiword Expressions In Proceedings of the LREC-2008 Workshop on Multiword Expressions Towards a Shared Task for Multiword Expressions pages 50-53 Marrakech Morocco
Smadja Frank (1993) Retrieving collocations from text Xtract Computational Linguistics 19(1) 143ndash77
Tan Y Fan Kan M Yen and Cui Hang (2006) Extending corpus-based identification of light verb constructions using a supervised learning framework In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context pages 49ndash56 Trento Italy
Zhai Chengxiang (1997) Exploiting context to identify lexical atoms ndash A statistical view of linguistic context In International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97) pages 119-129 Rio de Janeiro Brazil
Appendix A List of particles used in identifying verb particle constructions about aback aboard above abroad across adrift ahead along apart around aside astray away back backward backwards behind by down forth forward forwards in into off on out over past round through to together under up upon without
39
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 40ndash46Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus
R Mahesh K Sinha
Department of Computer Science amp Engineering Indian Institute of Technology Kanpur
Kanpur 208016 India rmkiitkacin
Abstract
Complex predicate is a noun a verb an ad-jective or an adverb followed by a light verb that behaves as a single unit of verb Com-plex predicates (CPs) are abundantly used in Hindi and other languages of Indo Aryan family Detecting and interpreting CPs con-stitute an important and somewhat a diffi-cult task The linguistic and statistical methods have yielded limited success in mining this data In this paper we present a simple method for detecting CPs of all kinds using a Hindi-English parallel corpus A CP is hypothesized by detecting absence of the conventional meaning of the light verb in the aligned English sentence This simple strategy exploits the fact that CP is a multi-word expression with a meaning that is dis-tinct from the meaning of the light verb Al-though there are several shortcomings in the methodology this empirical method surpri-singly yields mining of CPs with an average precision of 89 and a recall of 90
1 Introduction
Complex predicates (CPs) are abundantly used in Hindi and other languages of Indo-Aryan family and have been widely studied (Hook 1974 Ab-bi 1992 Verma 1993 Mohanan 1994 Singh 1994 Butt 1995 Butt and Geuder 2001 Butt and Ramchand 2001 Butt et al 2003) A com-plex predicate is a multi-word expression (MWE) where a noun a verb or an adjective is followed by a light verb (LV) and the MWE be-haves as a single unit of verb The general theory of complex predicate is discussed in Alsina (1996) These studies attempt to model the lin-guistic facts of complex predicate formation and the associated semantic roles
CPs empower the language in its expressive-ness but are hard to detect Detection and inter-
pretation of CPs are important for several tasks of natural language processing tasks such as ma-chine translation information retrieval summa-rization etc A mere listing of the CPs constitutes a valuable linguistic resource for lexicographers wordnet designers (Chakrabarti et al 2007) and other NLP system designers Computational me-thod using Hindi corpus has been used to mine CPs and categorize them based on statistical analysis (Sriram and Joshi 2005) with limited success Chakrabarti et al (2008) present a me-thod for automatic extraction of V+V CPs only from a corpus based on linguistic features They report an accuracy of about 98 An attempt has also been made to use a parallel corpus for de-tecting CPs using projection POS tags from Eng-lish to Hindi (Soni Mukerjee and Raina 2006) It uses Giza++ word alignment tool to align the projected POS information A success of 83 precision and 46 recall has been reported
In this paper we present a simple strategy for mining of CPs in Hindi using projection of meaning of light verb in a parallel corpus In the following section the nature of CP in Hindi is outlined and this is followed by system design experimentation and results
2 Complex Predicates in Hindi
A CP in Hindi is a syntactic construction consist-ing of either a verb a noun an adjective or an adverb as main predicator followed by a light verb (LV) Thus a CP can be a noun+LV an adjective+LV a verb+LV or an adverb+LV Fur-ther it is also possible that a CP is followed by a LV (CP+LV) The light verb carries the tense and agreement morphology In V+V CPs the contribution of the light verb denotes aspectual terms such as continuity perfectivity inception completion or denotes an expression of forceful-ness suddenness etc (Singh 1994 Butt 1995) The CP in a sentence syntactically acts as a sin-gle lexical unit of verb that has a meaning dis-
40
tinct from that of the LV CPs are also referred as the complex or compound verbs
Given below are some examples (1) CP=noun+LV noun = ashirwad blessings LV = denaa to give usane mujhe ashirwad diyaa उसन मझ आश वाद दया he me blessings gave he blessed me (2) No CP usane mujhe ek pustak dii उसन मझ एक प तक द he me one book gave he gave me a book In (1) the light verb diyaa (gave) in its past tense form with the noun ashirwad (blessings) makes a complex predicate verb form ashirwad diyaa (blessed) in the past tense form The CP here is ashirwad denaa and its corresponding English translation is lsquoto blessrsquo On the other hand in ex-ample (2) the verb dii (gave) is a simple verb in past tense form and is not a light verb Although same Hindi verb denaa (to give) is used in both the examples it is a light verb in (1) and a main verb in (2) Whether it acts as a light verb or not depends upon the semantics of the preceding noun However it is observed that the English meaning in case of the complex predicate is not derived from the individual meanings of the con-stituent words It is this observation that forms basis of our approach for mining of CPs (3) CP=adjective+LV adjective=khush happy LV=karanaa to do usane mujhe khush kiyaa उसन मझ खश कया he me happy did he pleased me Here the Hindi verb kiyaa (did) is the past tense form of a light verb karanaa (to do) and the pre-ceding word khush (happy) is an adjective The CP here is khush karanaa (to please) (4) CP=verb+LV verb = paRhnaa to read LV = lenaa to take usane pustak paRh liyaa
उसन प तक पढ़ लया he book read took he has read the book Here the Hindi verb liyaa (took) is the past tense form of the light verb lenaa (to take) and the pre-ceding word paRh (read) is the verb paRhnaa (to read) in its stem form The CP is paRh lenaa (to finish reading) In such cases the light verb acts as an aspectual modal or as an intensifier (5) CP=verb+LV verb = phaadanaa to tear LV = denaa to give usane pustak phaad diyaa उसन प तक फाड़ दया he book tear gave he has torn the book Here the Hindi verb diyaa (gave) is the past tense form of the light verb denaa (to give) and the preceding word phaad (tear) is the stem form of the verb phaadanaa (to tear) The CP is phaad denaa (to cause and complete act of tearing) (6) CP=verb+LV verb = denaa to give LV = maaranaato hit to kill usane pustak de maaraa उसन प तक द मारा he book give hit he threw the book Here the Hindi verb maaraa (hitkilled) is the past tense form of the light verb maaranaa (to hit to kill) and the preceding word de (give) is a verb denaa (to give) in its stem form The CP is de maranaa (to throw) The verb combination yields a new meaning This may also be consi-dered as a semi-idiomatic construct by some people (7) CP=adverb+LV1+LV2 adverb = vaapasback LV1 = karanaato do LV2 = denaato give or CP = CP+LV CP = vaapas karanaato return LV = denaato give usane pustak vaapas kar diyaa उसन प तक वापस कर दया he book back do gave
41
he returned the book Here there are two Hindi light verbs used The verb kar (do) is the stem form of the light verb karanaa (to do) and the verb diyaa (gave) is the past tense form of the light verb denaa (to give) The preceding word vaapas (back) is an adverb One way of interpretation is that the CP (a con-junct verb) vaapas karanaa (to return) is fol-lowed by another LV denaa (to give) signifying completion of the task Another way of looking at it is to consider these as a combination of two CPs vaapas karanaa (to return) and kar denaa (to complete the act) The semantic interpreta-tions in the two cases remain the same It may be noted that the word vaapas (return) is also a noun and in such a case the CP is a noun+LV
From all the above examples the complexity of the task of mining the CPs is evident Howev-er it is also observed that in the translated text the meaning of the light verb does not appear in case of CPs Our methodology for mining CPs is based on this observation and is outlined in the following section
3 System Design
As outlined earlier our method for detecting a CP is based on detecting a mismatch of the Hindi light verb meaning in the aligned English sen-tence The steps involved are as follows 1) Align the sentences of Hindi-English corpus 2) Create a list of Hindi light verbs and their
common English meanings as a simple verb (Table 1)
3) For each Hindi light verb generate all the morphological forms (Figure 1)
4) For each English meaning of the light verb as given in table 1 generate all the morphologi-cal forms (Figure 2)
5) For each Hindi-English aligned sentence execute the following steps a) For each light verb of Hindi (table 1)
execute the following steps i) Search for a Hindi light verb (LV)
and its morphological derivatives (figure 1) in the Hindi sentence and mark its position in the sentence (K)
ii) If the LV or its morphological deriv-ative is found then search for the equivalent English meanings for any of the morphological forms (figure 2) in the corresponding aligned Eng-lish sentence
iii) If no match is found then scan the words in the Hindi sentence to the left of the Kth position (as identified in step (i)) else if a match is found then exit ie go to step (a)
iv) If the scanned word is a lsquostop wordrsquo (figure 3) then ignore it and contin-ue scanning
v) Stop the scan when it is not a lsquostop wordrsquo and collect the Hindi word (W)
vi) If W is an lsquoexit wordrsquo then exit ie go to step (a) else the identified CP is W+LV
Hindi has a large number of light verbs A list of some of the commonly used light verbs along with their common English meanings as a simple verb is given in table 1 The light verb kar (do) is the most frequently used light verb Using its literal meaning as lsquodorsquo as a criterion for testing CP is quite misleading since lsquodorsquo in English is used in several other contexts Such meanings have been shown within parentheses and are not used for matching
light verb base form root verb meaning baithanaa बठना sit
bananaa बनना makebecomebuildconstruct manufactureprepare
banaanaa बनाना makebuildconstructmanufact-ure prepare
denaa दना give
lenaa लना take
paanaa पाना obtainget
uthanaa 3ठना rise arise get-up
uthaanaa 3ठाना raiselift wake-up
laganaa लगना feelappear look seem
lagaanaa लगाना fixinstall apply
cukanaa चकना (finish)
cukaanaa चकाना pay
karanaa करना (do)
honaa होना happenbecome be
aanaa आना come
jaanaa जाना go
khaanaa खाना eat
rakhanaa रखना keep put
maaranaa मारना killbeathit
daalanaa डालना put
haankanaa हाकना drive
Table 1 Some of the common light verbs in Hindi
42
For each of the Hindi light verb all morpho-logical forms are generated A few illustrations are given in figures 1(a) and 1(b) Similarly for each of the English meaning of the light verb all of its morphological derivatives are generated Figure 2 shows a few illustrations of the same
There are a number of words that can appear in between the nominal and the light verb in a CP These words are ignored in search for a CP and are treated as stop words These are words that denote negation or are emphasizers inten-sifiers interrogative pronoun or a particle A list of stop words used in the experimentation is giv-en in figure 3
Figure 1(a) Morphological derivatives of sample Hindi light verb rsquojaanaarsquo जाना to go
Figure 1(b) Morphological derivatives of sample Hindi light verb rsquolenaarsquo लना to take
Figure 2 Morphological derivatives of sample English meanings
We use a list of words of words that we have
named as lsquoexit wordsrsquo which cannot form part of a CP in Hindi We have used Hindi case (vibhak-ti) markers (also called parsarg) conjunctions and pronouns as the lsquoexit wordsrsquo in our imple-mentation Figure 4 shows a partial list used However this list can be augmented based on analysis of errors in LV identification It should be noted that we do not perform parts of speech (POS) tagging and so the nature of the word pre-ceding the LV is unknown to the system
English word sit Morphological derivations
sit sits sat sitting English word give Morphological derivations
give gives gave given giving degdegdegdegdeg
LV jaanaa जाना to go Morphological derivatives jaa jaae jaao jaaeM jaauuM jaane jaanaa jaanii jaataa jaatii jaate jaaniiM jaatiiM jaaoge jaaogii gaii jaauuMgA jaayegaa jaauuMgii jaayegii gaye gaiiM gayaa gayii jaayeMge jaayeMgI jaakara जा (go stem) जाए (go imperative)
जाओ (go imperative) जाए (go imperative)
जाऊ (go first-person) जान (go infinitive oblique)
जाना (go infinitive masculine singular)
जानी (go infinitive feminine singular)
जाता (go indefinite masculine singular)
जाती (go indefinite feminine singular)
जात (go indefinite masculine pluraloblique)
जानी (go infinitive feminine plural)
जाती (go indefinite feminine plural)
जाओग (go future masculine singular)
जाओगी (go future feminine singular)
गई (go past feminine singular)
जाऊगा (go future masculine first-person singular)
जायगा (go future masculine third-person singular)
जाऊगी (go future feminine first-person singular)
जायगी (go future feminine third-person singular)
गय (go past masculine pluraloblique)
गई (go past feminine plural)
गया (go past masculine singular)
गयी (go past feminine singular)
जायग (go future masculine plural)
जायगी (go future feminine plural)
जाकर (go completion) degdegdegdegdeg
LV lenaa लना to take Morphological derivatives le lii leM lo letaa letii lete liiM luuM legaa legii lene lenaa lenii liyaa leMge loge letiiM luuMgaa luuMgii lekara ल (take stem) ल (take past)
ल (take imperative) लो (take imperative)
लता (take indefinite masculine singular)
लती (take indefinite feminine singular)
लत (take indefinite masculine pluraloblique)
ल (takepastfeminineplural) ल(take first-person)
लगा(take future masculine third-personsingular)
लगी(take future feminine third-person singular)
लन (take infinitive oblique)
लना (take infinitive masculine singular)
लनी (take infinitive feminine singular)
लया (take past masculine singular)
लग (take future masculine plural)
लोग (take future masculine singular)
लती (take indefinite feminine plural)
लगा (take future masculinefirst-personsingular)
लगी (take future feminine first-person singular)
लकर (take completion) degdegdegdegdeg
43
Figure 3 Stop words in Hindi used by the system
Figure 4 A few exit words in Hindi used by the system
The inner loop of the procedure identifies mul-tiple CPs that may be present in a sentence The outer loop is for mining the CPs in the entire corpus The experimentation and results are dis-cussed in the following section
4 Experimentation and Results
The CP mining methodology outlined earlier has been implemented and tested over multiple files of EMILLE (McEnery Baker Gaizauskas and Cunningham 2000) English-Hindi parallel cor-pus A summary of the results obtained are given in table 2 As can be seen from this table the precision obtained is 80 to 92 and the recall is between 89 to 100 The F-measure is 88 to 97 This is a remarkable and somewhat sur-prising result from the simple methodology without much of linguistic or statistical analysis This is much higher than what has been reported on the same corpus by Mukerjee et al 2006 (83 precision and 46 recall) who use projec-tion of POS and word alignment for CP identifi-cation This is the only other work that uses a parallel corpus and covers all kinds of CPs The results as reported by Chakrabarti et al (2008) are only for V-V CPs Moreover they do not re-port the recall value File
1 File 2
File 3
File 4
File 5
File 6
No of Sentences
112 193 102 43 133 107
Total no of CP(N)
200 298 150 46 188 151
Correctly identified CP (TP)
195 296 149 46 175 135
V-V CP 56 63 9 6 15 20
Incorrectly identified CP (FP)
17 44 7 11 16 20
Unidentified CP(FN)
5 2 1 0 13 16
Accuracy
9750 9933 9933 1000 9308 8940
Precision (TP (TP+FP))
9198 8705 9551 8070 9162 8709
Recall ( TP (TP+FN))
9750 9833 9933 1000 9308 8940
F-measure ( 2PR ( P+R))
946 923 974 893 923 882
Table 2 Results of the experimentation
न (ergative case marker) को (accusative case marker) का (possessive case marker) क (possessive case marker) की (possessive case marker) स (frombywith) म (ininto) पर (onbut) और (and Hindi particle) तथा (and) या (or) ल कन (but) पर त (but) क (that Hindi particle) म (I) तम (you) आप (you) वह (heshe) मरा (my) मर (my) मर (my) त हारा (your) त हार (your) त हार (your) उसका (his) उसकी (her) उसक (hisher) अपना (own) अपनी (own) अपन (own) उनक (their) मन (I ergative) त ह (to you) आपको (to you) उसको (to himher) उनको (to them) उ ह (to them) मझको (to me) मझ (to me) िजसका (whose) िजसकी (whose) िजसक (whose) िजनको (to whom) िजनक (to whom)
नह (nonot) न (nonot Hindi particle) भी (also Hindi particle) ह (only Hindi particle) तो (then Hindi particle) य (why) या (what Hindi particle)
कहा (where Hindi particle) कब (when) यहा (here) वहा (there) जहा (where) पहल (before) बाद म (after) श म (beginning) आर भ म (beginning) अत म (in the end) आ खर म (in the end)
44
Given below are some sample outputs (1) English sentence I also enjoy working with the childrens parents who often come to me for advice - its good to know you can help Aligned Hindi sentence मझ ब च क माता - पताओ क साथ काम करना भी अ छा लगता ह जो क अ सर सलाह लन आत ह - यह जानकार खशी होती ह क आप कसी की मदद कर सकत ह | The CPs identified in the sentence i काम करना (to work) ii अ छा लगना (to feel good enjoy) iii सलाह लना (to seek advice) iv खशी होना (to feel happy good) v मदद करना (to help)
Here the system identified 5 different CPs all of which are correct and no CP in the sentence has gone undetected The POS projection and word alignment method (Mukerjee et al 2006) would fail to identify CPs सलाह लना (to seek advice) and खशी होना (to feel happy) (2) English sentence Thousands of children are already benefiting from the input of people like you - people who care about children and their future who have the commitment energy and enthusiasm to be positive role models and who value the opportu-nity for a worthwhile career Aligned Hindi sentence आप जस लोग जो क ब च और उनक भ व य क बार म सोचत ह - इस समय भी हज़ार ब च को लाभ पहचा रह ह | अ छ आदश बनन क लए ऐस लोग म तब ता उ सराह और लगन ह और व एक समथन - यो य यवसाय की क करत ह |
The CPs identified in the sentence i आदश बनना (to be role model) ii क करना (to respect)
Here also the two CPs identified are correct It is obvious that this empirical method of
mining CPs will fail whenever the Hindi light verb maps on to its core meaning in English It
may also produce garbage as POS of the preced-ing word is not being checked However the mining success rate obtained speaks of these be-ing in small numbers in practice Use of the lsquostop wordsrsquo in allowing the intervening words within the CPs helps a lot in improving the per-formance Similarly use of the lsquoexit wordsrsquo avoid a lot of incorrect identification
5 Conclusions
The simple empirical method for mining CPs outlined in this work yields an average 89 of precision and 90 recall which is better than the results reported so far in the literature The major drawback is that we have to generate a list of all possible light verbs This list appears to be very large for Hindi Since no POS tagging or statis-tical analysis is performed the identified CPs are merely a list of mined CPs in Hindi with no lin-guistic categorization or analysis However this list of mined CPs is valuable to the lexicograph-ers and other language technology developers This list can also be used for word alignment tools where the identified components of CPs are grouped together before the word alignment process This will increase both the alignment accuracy and the speed
The methodology presented in this work is equally applicable to all other languages within the Indo-Aryan family
References Anthony McEnery Paul Baker Rob Gaizauskas Ha-
mish Cunningham 2000 EMILLE Building a Corpus of South Asian Languages Vivek A Quar-terly in Artiificial Intelligence 13(3)23ndash32
Amitabh Mukerjee Ankit Soni and Achala M Raina 2006 Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora Proceed-ings of the Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties Sydney 11ndash18
Alex Alsina 1996 Complex PredicatesStructure and Theory CSLI PublicationsStanford CA
Anvita Abbi 1992 The explicator compound verbsome definitional issues and criteria for iden-tification Indian Linguistics 53 27-46
Debasri Chakrabarti Vaijayanthi Sarma and Pushpak Bhattacharyya 2007 Complex Predicates in In-dian Language Wordnets Lexical Resources and Evaluation Journal 40 (3-4)
Debasri Chakrabarti Hemang Mandalia Ritwik Priya Vaijayanthi Sarma and Pushpak Bhattacharyya 2008 Hindi Compound Verbs and their Automatic Extraction Computational Linguistics (COLING08) Manchester UK
45
Manindra K Verma (ed) 1993 Complex Predicates in South Asian Languages Manohar Publishers and Distributors New Delhi
Miriam Butt 1995 The Structure of Complex Predi-cates in Urdu CSLI Publications
Mirium Butt and Gillian Ramchand 2001 Complex Aspectual Structure in HindiUrdu In Maria Lia-kata Britta Jensen and Didier Maillat (Editors) Oxford University Working Papers in Linguistics Philology amp Phonetics Vol 6
Miriam Butt Tracy Holloway King and John T Maxwell III 2003 Complex Predicates via Re-striction Proceedings of the LFG03 Conference
Miriam Butt and Wilhelm Geuder 2001 On the (semi)lexical status of light verbs In Norbert Corv-er and Henk van Riemsdijk (Editors) Semi-lexical Categories On the content of function words and the function of content words Mouton de Gruyter Berlin 323ndash370
Mona Singh 1994 Perfectivity Definiteness and Specificity A Classification of Verbal Predicates Hindi Doctoral dissertation University of Texas Austin
Peter Edwin Hook 1974 The Compound Verb in Hindi Center for South and Southeast Asian Stu-dies The University of Michigan
Tara Mohanan 1994 Argument Structure in Hindi CSLI Publications Stanford California
Venkatapathy Sriram and Aravind K Joshi 2005 Relative compositionality of multi-word expres-sions a study of verb-noun (V-N) collocations In Proceedings of International Joint Conference on Natural Language Processing - 2005 Jeju Island Korea 553-564
46
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 47ndash54Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Improving Statistical Machine Translation UsingDomain Bilingual Multiword Expressions
Zhixiang Ren1 Yajuan L u1 Jie Cao1 Qun Liu 1 Yun Huang2
1Key Lab of Intelligent Info Processing 2Department of Computer ScienceInstitute of Computing Technology School of Computing
Chinese Academy of Sciences National University of SingaporePO Box 2704 Beijing 100190 China Computing 1 Law Link Singapore 117590
renzhixianglvyajuan huangyuncompnusedusgcaojieliuqunictaccn
Abstract
Multiword expressions (MWEs) havebeen proved useful for many natural lan-guage processing tasks However how touse them to improve performance of statis-tical machine translation (SMT) is not wellstudied This paper presents a simple yeteffective strategy to extract domain bilin-gual multiword expressions In additionwe implement three methods to integratebilingual MWEs to Moses the state-of-the-art phrase-based machine translationsystem Experiments show that bilingualMWEs could improve translation perfor-mance significantly
1 Introduction
Phrase-based machine translation model has beenproved a great improvement over the initial word-based approaches (Brown et al 1993) Recentsyntax-based models perform even better thanphrase-based models However when syntax-based models are applied to new domain with fewsyntax-annotated corpus the translation perfor-mance would decrease To utilize the robustnessof phrases and make up the lack of syntax or se-mantic information in phrase-based model for do-main translation we study domain bilingual mul-tiword expressions and integrate them to the exist-ing phrase-based model
A multiword expression (MWE)can be consid-ered as word sequence with relatively fixed struc-ture representing special meanings There is nouniform definition of MWE and many researchersgive different properties of MWE Sag et al (2002)roughly defined MWE as ldquoidiosyncratic interpre-tations that cross word boundaries (or spaces)rdquoCruys and Moiron (2007) focused on the non-compositional property of MWE ie the propertythat whole expression cannot be derived from their
component words Stanford university launcheda MWE project1 in which different qualities ofMWE were presented Forbilingual multiwordexpression (BiMWE) we define a bilingual phraseas a bilingual MWE if (1) the source phrase is aMWE in source language (2) the source phraseand the target phrase must be translated to eachother exactly ie there is no additional (boundary)word in target phrase which cannot find the corre-sponding word in source phrase and vice versaIn recent years many useful methods have beenproposed to extract MWEs or BiMWEs automati-cally (Piao et al 2005 Bannard 2007 Fazly andStevenson 2006) Since MWE usually constrainspossible senses of a polysemous word in contextthey can be used in many NLP applications suchas information retrieval question answering wordsense disambiguation and so on
For machine translation Piao et al (2005) havenoted that the issue of MWE identification andaccurate interpretation from source to target lan-guage remained an unsolved problem for existingMT systems This problem is more severe whenMT systems are used to translate domain-specifictexts since they may include technical terminol-ogy as well as more general fixed expressions andidioms Although some MT systems may employa machine-readable bilingual dictionary of MWEit is time-consuming and inefficient to obtain thisresource manually Therefore some researchershave tried to use automatically extracted bilingualMWEs in SMT Tanaka and Baldwin (2003) de-scribed an approach of noun-noun compound ma-chine translation but no significant comparisonwas presented Lambert and Banchs (2005) pre-sented a method in which bilingual MWEs wereused to modify the word alignment so as to im-prove the SMT quality In their work a bilin-gual MWE in training corpus was grouped as
1httpmwestanfordedu
47
one unique token before training alignment mod-els They reported that both alignment qualityand translation accuracy were improved on a smallcorpus However in their further study they re-ported even lower BLEU scores after groupingMWEs according to part-of-speech on a large cor-pus (Lambert and Banchs 2006) Nonethelesssince MWE represents liguistic knowledge therole and usefulness of MWE in full-scale SMTis intuitively positive The difficulty lies in howto integrate bilingual MWEs into existing SMTsystem to improve SMT performance especiallywhen translating domain texts
In this paper we implement three methods thatintegrate domain bilingual MWEs into a phrase-based SMT system and show that these ap-proaches improve translation quality significantlyThe main difference between our methods andLambert and Banchsrsquo work is that we directly aimat improving the SMT performance rather than im-proving the word alignment quality In detail dif-ferences are listed as follows
bull Instead of using the bilingual n-gram trans-lation model we choose the phrase-basedSMT system Moses2 which achieves sig-nificantly better translation performance thanmany other SMT systems and is a state-of-the-art SMT system
bull Instead of improving translation indirectlyby improving the word alignment qualitywe directly target at the quality of transla-tion Some researchers have argued that largegains of alignment performance under manymetrics only led to small gains in translationperformance (Ayan and Dorr 2006 Fraserand Marcu 2007)
Besides the above differences there are someadvantages of our approaches
bull In our method automatically extractedMWEs are used as additional resources ratherthan as phrase-table filter Since bilingualMWEs are extracted according to noisy au-tomatic word alignment errors in word align-ment would further propagate to the SMT andhurt SMT performance
bull We conduct experiments on domain-specificcorpus For one thing domain-specific
2httpwwwstatmtorgmoses
corpus potentially includes a large numberof technical terminologies as well as moregeneral fixed expressions and idioms iedomain-specific corpus has high MWE cov-erage For another after the investigationcurrent SMT system could not effectivelydeal with these domain-specific MWEs es-pecially for Chinese since these MWEs aremore flexible and concise Take the Chi-nese term ldquo j Ntilde (rdquo for example Themeaning of this term is ldquosoften hard massand dispel pathogenic accumulationrdquo Ev-ery word of this term represents a specialmeaning and cannot be understood literallyor without this context These terms are dif-ficult to be translated even for humans letalone machine translation So treating theseterms as MWEs and applying them in SMTsystem have practical significance
bull In our approach no additional corpus is intro-duced We attempt to extract useful MWEsfrom the training corpus and adopt suitablemethods to apply them Thus it benefitsfor the full exploitation of available resourceswithout increasing great time and space com-plexities of SMT system
The remainder of the paper is organized as fol-lows Section 2 describes the bilingual MWE ex-traction technique Section 3 proposes three meth-ods to apply bilingual MWEs in SMT systemSection 4 presents the experimental results Sec-tion 5 draws conclusions and describes the futurework Since this paper mainly focuses on the ap-plication of BiMWE in SMT we only give a briefintroduction on monolingual and bilingual MWEextraction
2 Bilingual Multiword ExpressionExtraction
In this section we describe our approach of bilin-gual MWE extraction In the first step we obtainmonolingual MWEs from the Chinese part of par-allel corpus After that we look for the translationof the extracted MWEs from parallel corpus
21 Automatic Extraction of MWEs
In the past two decades many different ap-proaches on automatic MWE identification werereported In general those approaches can beclassified into three main trends (1) statisti-cal approaches (Pantel and Lin 2001 Piao et
48
al 2005) (2) syntactic approaches (Fazly andStevenson 2006 Bannard 2007) and (3) seman-tic approaches (Baldwin et al 2003 Cruys andMoiron 2007) Syntax-based and semantic-basedmethods achieve high precision but syntax or se-mantic analysis has to be introduced as preparingstep so it is difficult to apply them to domains withfew syntactical or semantic annotation Statisticalapproaches only consider frequency informationso they can be used to obtain MWEs from bilin-gual corpora without deeper syntactic or semanticanalysis Most statistical measures only take twowords into account so it not easy to extract MWEscontaining three or more than three words
Log Likelihood Ratio (LLR)has been proved agood statistical measurement of the association oftwo random variables (Chang et al 2002) Weadopt the idea of statistical approaches and pro-pose a new algorithm named LLR-based Hierar-chical Reducing Algorithm (HRA for short) to ex-tract MWEs with arbitrary lengths To illustrateour algorithm firstly we define some useful itemsIn the following definitions we assume the givensentence is ldquoA B C D Erdquo
Definition 1 Unit A unit is any sub-string of thegiven sentence For example ldquoA Brdquo ldquo Crdquo ldquo C D Erdquoare all units but ldquoA B Drdquo is not a unit
Definition 2 List A list is an ordered sequence ofunits which exactly cover the given sentence ForexampleldquoArdquoldquo B C Drdquoldquo Erdquo forms a list
Definition 3 Score The score function only de-fines on two adjacent units and return the LLRbetween the last word of first unit and the firstword of the second unit3 For example the scoreof adjacent unit ldquoB Crdquo and ldquoD Erdquo is defined asLLR(ldquo Crdquoldquo Drdquo)
Definition 4 Select The selecting operator is tofind the two adjacent units with maximum scorein a list
Definition 5 ReduceThe reducing operator is toremove two specific adjacent units concatenatethem and put back the result unit to the removedposition For example if we want to reduce unitldquoB Crdquo and unit ldquoDrdquo in list ldquoArdquoldquo B Crdquoldquo Drdquoldquo Erdquowe will get the listldquoArdquoldquo B C Drdquoldquo Erdquo
Initially every word in the sentence is consid-ered as one unit and all these units form a initiallist L If the sentence is of lengthN then the
3we use a stoplist to eliminate the units containing func-tion words by setting their score to 0
list containsN units of course The final set ofMWEsS is initialized to empty set After initial-ization the algorithm will enter an iterating loopwith two steps (1) select the two adjacent unitswith maximum score inL namingU1 andU2 and(2) reduceU1 andU2 in L and insert the reducingresult into the final setS Our algorithm termi-nates on two conditions (1) if the maximum scoreafter selection is less than a given threshold or (2)if L contains only one unit
c1(pound p Eacute w egrave )
c1(pound p Eacute w )
c1(pound p Eacute w)
c1(pound)
c2(p Eacute w)
c2(p Eacute)
c2(p) c3(Eacute) c4(w) c5()
c6(egrave )
c6(egrave) c7()
pound p Eacute w egrave 1471 67552 10596 0 0 8096
Figure 1 Example of Hierarchical Reducing Al-gorithm
Let us make the algorithm clearer with an ex-ample Assume the threshold of score is 20 thegiven sentence is ldquopound p Eacute w egrave rdquo4Figure 1 shows the hierarchical structure of givensentence (based on LLR of adjacent words) Inthis example four MWEs (ldquop Eacuterdquo ldquo p Eacute
wrdquo ldquo egrave rdquo ldquo pound p Eacute wrdquo) are extractedin the order and sub-strings over dotted line in fig-ure 1 are not extracted
From the above example we can see that theextracted MWEs correspond to human intuitionIn general the basic idea of HRA is to reflectthe hierarchical structure pattern of natural lan-guage Furthermore in the HRA MWEs are mea-sured with the minimum LLR of adjacent wordsin them which gives lexical confidence of ex-tracted MWEs Finally suppose given sentencehas lengthN HRA would definitely terminatewithin N minus 1 iterations which is very efficient
However HRA has a problem that it would ex-tract substrings before extracting the whole stringeven if the substrings only appear in the particu-lar whole string which we consider useless Tosolve this problem we use contextual features
4The whole sentence means ldquohealthy tea for preventinghyperlipidemiardquo and we give the meaning for each Chi-nese wordpound(preventing)p(hyper-)Eacute(-lipid-) w(-emia)(for) egrave(healthy)(tea)
49
contextual entropy (Luo and Sun 2003) and C-value (Frantzi and Ananiadou 1996) to filter outthose substrings which exist only in few MWEs
22 Automatic Extraction of MWErsquosTranslation
In subsection 21 we described the algorithm toobtain MWEs and we would like to introduce theprocedure to find their translations from parallelcorpus in this subsection
For mining the English translations of ChineseMWEs we first obtain the candidate translationsof a given MWE from the parallel corpus Stepsare listed as follows
1 Run GIZA++5 to align words in the trainingparallel corpus
2 For a given MWE find the bilingual sentencepairs where the source language sentences in-clude the MWE
3 Extract the candidate translations of theMWE from the above sentence pairs accord-ing to the algorithm described by Och (2002)
After the above procedure we have alreadyextracted all possible candidate translations of agiven MWE The next step is to distinguish rightcandidates from wrong candidates We constructperceptron-based classification model (Collins2002) to solve the problem We design twogroups of features translation features whichdescribe the mutual translating chance betweensource phrase and target phrase and the languagefeatures which refer to how well a candidateis a reasonable translation The translation fea-tures include (1) the logarithm of source-targettranslation probability (2) the logarithm of target-source translation probability (3) the logarithmof source-target lexical weighting (4) the loga-rithm of target-source lexical weighting and (5)the logarithm of the phrase pairrsquos LLR (Dunning1993) The first four features are exactly the sameas the four translation probabilities used in tradi-tional phrase-based system (Koehn et al 2003)The language features include (1) the left entropyof the target phrase (Luo and Sun 2003) (2) theright entropy of the target phrase (3) the first wordof the target phrase (4) the last word of the targetphrase and (5) all words in the target phrase
5httpwwwfjochcomGIZA++html
We select and annotate 33000 phrase pairs ran-domly of which 30000 pairs are used as trainingset and 3000 pairs are used as test set We use theperceptron training algorithm to train the modelAs the experiments reveal the classification preci-sion of this model is 9167
3 Application of Bilingual MWEs
Intuitively bilingual MWE is useful to improvethe performance of SMT However as we de-scribed in section 1 it still needs further researchon how to integrate bilingual MWEs into SMTsystem In this section we propose three methodsto utilize bilingual MWEs and we will comparetheir performance in section 4
31 Model Retraining with Bilingual MWEs
Bilingual phrase table is very important forphrase-based MT system However due to the er-rors in automatic word alignment and unalignedword extension in phrase extraction (Och 2002)many meaningless phrases would be extractedwhich results in inaccuracy of phrase probabilityestimation To alleviate this problem we take theautomatically extracted bilingual MWEs as paral-lel sentence pairs add them into the training cor-pus and retrain the model using GIZA++ Byincreasing the occurrences of bilingual MWEswhich are good phrases we expect that the align-ment would be modified and the probability es-timation would be more reasonable Wu et al(2008) also used this method to perform domainadaption for SMT Different from their approachin which bilingual MWEs are extracted from ad-ditional corpus we extract bilingual MWEs fromthe original training set The fact that additionalresources can improve the domain-specific SMTperformance was proved by many researchers(Wu et al 2008 Eck et al 2004) Howeverour method shows that making better use of theresources in hand could also enhance the qualityof SMT system We use ldquoBaseline+BiMWErdquo torepresent this method
32 New Feature for Bilingual MWEs
Lopez and Resnik (2006) once pointed out thatbetter feature mining can lead to substantial gainin translation quality Inspired by this idea weappend one feature into bilingual phrase table toindicate that whether a bilingual phrase containsbilingual MWEs In other words if the source lan-guage phrase contains a MWE (as substring) and
50
the target language phrase contains the translationof the MWE (as substring) the feature value is 1otherwise the feature value is set to 0 Due to thehigh reliability of bilingual MWEs we expect thatthis feature could help SMT system to select bet-ter and reasonable phrase pairs during translationWe use ldquoBaseline+Featrdquo to represent this method
33 Additional Phrase Table of bilingualMWEs
Wu et al (2008) proposed a method to construct aphrase table by a manually-made translation dic-tionary Instead of manually constructing transla-tion dictionary we construct an additional phrasetable containing automatically extracted bilingualMWEs As to probability assignment we just as-sign 1 to the four translation probabilities for sim-plicity Since Moses supports multiple bilingualphrase tables we combine the original phrase ta-ble and new constructed bilingual MWE table Foreach phrase in input sentence during translationthe decoder would search all candidate transla-tion phrases in both phrase tables We use ldquoBase-line+NewBPrdquo to represent this method
4 Experiments
41 Data
We run experiments on two domain-specific patentcorpora one is for traditional medicine domainand the other is for chemical industry domain Ourtranslation tasks are Chinese-to-English
In the traditional medicine domain table 1shows the data statistics For language model weuse SRI Language Modeling Toolkit6 to train a tri-gram model with modified Kneser-Ney smoothing(Chen and Goodman 1998) on the target side oftraining corpus Using our bilingual MWE ex-tracting algorithm 80287 bilingual MWEs are ex-tracted from the training set
Chinese EnglishTraining Sentences 120355
Words 4688873 4737843Dev Sentences 1000
Words 31722 32390Test Sentences 1000
Words 41643 40551
Table 1 Traditional medicine corpus
6httpwwwspeechsricomprojectssrilm
In the chemical industry domain table 2 givesthe detail information of the data In this experi-ment 59466 bilingual MWEs are extracted
Chinese EnglishTraining Sentences 120856
Words 4532503 4311682Dev Sentences 1099
Words 42122 40521Test Sentences 1099
Words 41069 39210
Table 2 Chemical industry corpus
We test translation quality on test set and use theopen source tool mteval-vllbpl7 to calculate case-sensitive BLEU 4 score (Papineni et al 2002) asour evaluation criteria For this evaluation there isonly one reference per test sentence We also per-form statistical significant test between two trans-lation results (Collins et al 2005) The mean ofall scores and relative standard deviation are calcu-lated with a 99 confidence interval of the mean
42 MT Systems
We use the state-of-the-art phrase-based SMT sys-tem Moses as our baseline system The featuresused in baseline system include (1) four transla-tion probability features (2) one language modelfeature (3) distance-based and lexicalized distor-tion model feature (4) word penalty (5) phrasepenalty For ldquoBaseline+BiMWErdquo method bilin-gual MWEs are added into training corpus as aresult new alignment and new phrase table areobtained For ldquoBaseline+Featrdquo method one ad-ditional 01 feature are introduced to each entry inphrase table For ldquoBaseline+NewBPrdquo additionalphrase table constructed by bilingual MWEs isused
Features are combined in the log-linear modelTo obtain the best translatione of the source sen-tencef log-linear model uses following equation
e = arg maxe
p(e|f)
= arg maxe
Msum
m=1
λmhm(e f) (1)
in which hm andλm denote themth feature andweight The weights are automatically turned byminimum error rate training (Och 2002) on devel-opment set
7httpwwwnistgovspeechtestsmtresourcesscoringhtm
51
43 Results
Methods BLEUBaseline 02658Baseline+BiMWE 02661Baseline+Feat 02675Baseline+NewBP 02719
Table 3 Translation results of using bilingualMWEs in traditional medicine domain
Table 3 gives our experiment results Fromthis table we can see that bilingual MWEsimprove translation quality in all cases TheBaseline+NewBP method achieves the most im-provement of 061 BLEU score comparedwith the baseline system The Baseline+Featmethod comes next with 017 BLEU score im-provement And the Baseline+BiMWE achievesslightly higher translation quality than the baselinesystem
To our disappointment however none of theseimprovements are statistical significant Wemanually examine the extracted bilingual MWEswhich are labeled positive by perceptron algorithmand find that although the classification precisionis high (9167) the proportion of positive exam-ple is relatively lower (7669) The low positiveproportion means that many negative instanceshave been wrongly classified to positive which in-troduce noises To remove noisy bilingual MWEswe use the length ratiox of the source phrase overthe target phrase to rank the bilingual MWEs la-beled positive Assumex follows Gaussian distri-butions then the ranking score of phrase pair(s t)is defined as the following formula
Score(s t) = log(LLR(s t))times 1radic2πσ
timeseminus
(xminusmicro)2
2σ2
(2)Here the meanmicro and varianceσ2 are estimatedfrom the training set After ranking by score weselect the top 50000 60000 and 70000 bilingualMWEs to perform the three methods mentioned insection 3 The results are showed in table 4
From this table we can conclude that (1) Allthe three methods on all settings improve BLEUscore (2) Except the Baseline+BiMWE methodthe other two methods obtain significant improve-ment of BLEU score (02728 02734 02724)over baseline system (02658) (3) When the scaleof bilingual MWEs is relatively small (5000060000) the Baseline+Feat method performs better
Methods 50000 60000 70000Baseline 02658Baseline+BiMWE 02671 02686 02715Baseline+Feat 02728 02734 02712Baseline+NewBP 02662 02706 02724
Table 4 Translation results of using bilingualMWEs in traditional medicine domain
than others (4) As the number of bilingual MWEsincreasing the Baseline+NewBP method outper-forms the Baseline+Feat method (5) Comparingtable 4 and 3 we can see it is not true that themore bilingual MWEs the better performance ofphrase-based SMT This conclusion is the same as(Lambert and Banchs 2005)
To verify the assumption that bilingual MWEsdo indeed improve the SMT performance not onlyon particular domain we also perform some ex-periments on chemical industry domain Table 5shows the results From this table we can see thatthese three methods can improve the translationperformance on chemical industry domain as wellas on the traditional medicine domain
Methods BLEUBaseline 01882Baseline+BiMWE 01928Baseline+Feat 01917Baseline+Newbp 01914
Table 5 Translation results of using bilingualMWEs in chemical industry domain
44 Discussion
In order to know in what respects our methods im-prove performance of translation we manually an-alyze some test sentences and gives some exam-ples in this subsection
(1) For the first example in table 6 ldquoIuml oacuterdquois aligned to other words and not correctly trans-lated in baseline system while it is aligned to cor-rect target phrase ldquodredging meridiansrdquo in Base-line+BiMWE since the bilingual MWE (ldquoIuml oacuterdquoldquodredging meridiansrdquo) has been added into train-ing corpus and then aligned by GIZA++
(2) For the second example in table 6 ldquo
rdquo has two candidate translation in phrase tableldquo teardquo and ldquomedicated teardquo The baseline systemchooses the ldquoteardquo as the translation of ldquo rdquowhile the Baseline+Feat system chooses the ldquomed-
52
Src T not auml k Ouml Eacute Aring raquo Iuml oacute ) 9 | Y S _ Ouml otilde
egraveE 8
Ref the obtained product is effective in tonifying blood expelling cold dredging meridians promoting production of body fluid promoting urination and tranquilizing mind and can be used for supplementing nutrition and protecting health
Baseline the food has effects in tonifying blood dispelling cold promoting salivation and water and tranquilizing and tonic effects and making nutritious health
+Bimwe the food has effects in tonifying blood dispelling cold dredging meridians promotingsalivation promoting urination and tranquilizing tonic nutritious pulverizing
Src curren iexclJ J NtildeJ 5J
Ref the product can also be made into tablet pill powder medicated tea or injection Baseline may also be made into tablet pill powder tea or injection +Feat may also be made into tablet pill powder medicated tea or injection
Table 6 Translation example
icated teardquo because the additional feature giveshigh probability of the correct translation ldquomedi-cated teardquo
5 Conclusion and Future Works
This paper presents the LLR-based hierarchicalreducing algorithm to automatically extract bilin-gual MWEs and investigates the performance ofthree different application strategies in applyingbilingual MWEs for SMT system The translationresults show that using an additional feature to rep-resent whether a bilingual phrase contains bilin-gual MWEs performs the best in most cases Theother two strategies can also improve the qualityof SMT system although not as much as the firstone These results are encouraging and motivatedto do further research in this area
The strategies of bilingual MWE application isroughly simply and coarse in this paper Com-plicated approaches should be taken into accountduring applying bilingual MWEs For examplewe may consider other features of the bilingualMWEs and examine their effect on the SMT per-formance Besides application in phrase-basedSMT system bilingual MWEs may also be inte-grated into other MT models such as hierarchicalphrase-based models or syntax-based translationmodels We will do further studies on improvingstatistical machine translation using domain bilin-gual MWEs
Acknowledgments
This work is supported by National Natural Sci-ence Foundation of China Contracts 60603095and 60873167 We would like to thank the anony-
mous reviewers for their insightful comments onan earlier draft of this paper
References
Necip Fazil Ayan and Bonnie J Dorr 2006 Goingbeyond aer an extensive analysis of word align-ments and their impact on mt InProceedings of the44th Annual Meeting of the Association for Compu-tational Linguistics pages 9ndash16
Timothy Baldwin Colin Bannard Takaaki Tanaka andDominic Widdows 2003 An empirical modelof multiword expression decomposability InPro-ceedings of the ACL-2003 Workshop on MultiwordExpressions Analysis Acquisiton and Treatmentpages 89ndash96
Colin Bannard 2007 A measure of syntactic flex-ibility for automatically identifying multiword ex-pressions in corpora InProceedings of the ACLWorkshop on A Broader Perspective on MultiwordExpressions pages 1ndash8
Peter F Brown Stephen Della Pietra Vincent J DellaPietra and Robert L Mercer 1993 The mathe-matics of statistical machine translation Parameterestimation Computational Linguistics 19(2)263ndash311
Baobao Chang Pernilla Danielsson and WolfgangTeubert 2002 Extraction of translation unit fromchinese-english parallel corpora InProceedings ofthe first SIGHAN workshop on Chinese languageprocessing pages 1ndash5
Stanley F Chen and Joshua Goodman 1998 Am em-pirical study of smoothing techniques for languagemodeling Technical report
Michael Collins Philipp Koehn and Ivona Kucerova2005 Clause restructuring for statistical machinetranslation InProceedings of the 43rd Annual
53
Meeting on Association for Computational Linguis-tics pages 531ndash540
Michael Collins 2002 Discriminative training meth-ods for hidden markov models Theory and exper-iments with perceptron algorithms InProceedingsof the Empirical Methods in Natural Language Pro-cessing Conference pages 1ndash8
Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32
Ted Dunning 1993 Accurate methods for the statis-tics of surprise and coincidenceComputationalLinguistics 19(1)61ndash74
Matthias Eck Stephan Vogel and Alex Waibel 2004Improving statistical machine translation in the med-ical domain using the unified medical language sys-tem InProceedings of the 20th international con-ference on Computational Linguistics table of con-tents pages 792ndash798
Afsaneh Fazly and Suzanne Stevenson 2006 Auto-matically constructing a lexicon of verb phrase id-iomatic combinations InProceedings of the EACLpages 337ndash344
Katerina T Frantzi and Sophia Ananiadou 1996 Ex-tracting nested collocations InProceedings of the16th conference on Computational linguistics pages41ndash46
Alexander Fraser and Daniel Marcu 2007 Measuringword alignment quality for statistical machine trans-lation Computational Linguistics 33(3)293ndash303
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InPro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology pages48ndash54
Patrik Lambert and Rafael Banchs 2005 Data in-ferred multi-word expressions for statistical machinetranslation InProceedings of Machine TranslationSummit X pages 396ndash403
Patrik Lambert and Rafael Banchs 2006 Groupingmulti-word expressions according to part-of-speechin statistical machine translation InProceedings ofthe Workshop on Multi-word-expressions in a multi-lingual context pages 9ndash16
Adam Lopez and Philip Resnik 2006 Word-basedalignment phrase-based translation Whatrsquos thelink In proceedings of the 7th conference of the as-sociation for machine translation in the Americasvisions for the future of machine translation pages90ndash99
Shengfen Luo and Maosong Sun 2003 Two-characterchinese word extraction based on hybrid of internaland contextual measures InProceedings of the sec-ond SIGHAN workshop on Chinese language pro-cessing pages 24ndash30
Franz Josef Och 2002Statistical Machine Transla-tion From Single-Word Models to Alignment Tem-plates Phd thesis Computer Science DepartmentRWTH Aachen Germany
Patrick Pantel and Dekang Lin 2001 A statistical cor-pus based term extractor InAI rsquo01 Proceedings ofthe 14th Biennial Conference of the Canadian Soci-ety on Computational Studies of Intelligence pages36ndash46
Kishore Papineni Salim Roukos Todd Ward and Wei-jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation InProceedings of the40th Annual Conference of the Association for Com-putational Linguistics pages 311ndash318
Scott Songlin Piao Paul Rayson Dawn Archer andTony McEnery 2005 Comparing and combining asemantic tagger and a statistical tool for mwe extrac-tion Computer Speech and Language 19(4)378ndash397
Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 Multi-word expressions A pain in the neck for nlp InProceedings of the 3th International Conferenceon Intelligent Text Processing and ComputationalLinguistics(CICLing-2002) pages 1ndash15
Takaaki Tanaka and Timothy Baldwin 2003 Noun-noun compound machine translation A feasibilitystudy on shallow processing InProceedings ofthe ACL-2003 Workshop on Multiword ExpressionsAnalysis Acquisition and Treatment pages 17ndash24
Hua Wu Haifeng Wang and Chengqing Zong 2008Domain adaptation for statistical machine transla-tion with domain dictionary and monolingual cor-pora InProceedings of Conference on Computa-tional Linguistics (COLING) pages 993ndash1000
54
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 55ndash62Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Bottom-up Named Entity Recognitionusing a Two-stage Machine Learning Method
Hirotaka Funayama Tomohide Shibata Sadao KurohashiKyoto University Yoshida-honmachi
Sakyo-ku Kyoto 606-8501 Japanfunayama shibata kuronlpkueekyoto-uacjp
Abstract
This paper proposes Japanese bottom-upnamed entity recognition using a two-stage machine learning method Mostwork has formalized Named Entity Recog-nition as a sequential labeling problem inwhich only local information is utilizedfor the label estimation and thus a longnamed entity consisting of several mor-phemes tends to be wrongly recognizedOur proposed method regards a compoundnoun (chunk) as a labeling unit and firstestimates the labels of all the chunks ina phrasal unit (bunsetsu) using a machinelearning method Then the best label as-signment in the bunsetsu is determinedfrom bottom up as the CKY parsing al-gorithm using a machine learning methodWe conducted an experimental on CRLNE data and achieved an F measure of8979 which is higher than previous work
1 Introduction
Named Entity Recognition (NER) is a task of rec-ognizing named entities such as person namesorganization names and location It is used forseveral NLP applications such as Information Ex-traction (IE) and Question Answering (QA) Mostwork uses machine learning methods such as Sup-port Vector Machines (SVMs) (Vapnik 1995) andConditional Random Field (CRF) (Lafferty et al2001) using a hand-annotated corpus (Krishnanand DManning 2006 Kazama and Torisawa2008 Sasano and Kurohashi 2008 Fukushima etal 2008 Nakano and Hirai 2004 Masayuki andMatsumoto 2003)
In general NER is formalized as a sequentiallabeling problem For example regarding a mor-pheme as a basic unit it is first labeled as S-PERSON B-PERSON I-PERSON E-PERSON
S-ORGANIZATION etc Then considering thelabeling results of morphemes the best NE labelsequence is recognized
When the label of each morpheme is estimatedonly local information around the morpheme (egthe morpheme the two preceding morphemes andthe two following morphemes) is utilized There-fore a long named entity consisting of severalmorphemes tends to be wrongly recognized Letus consider the example sentences shown in Fig-ure 1
In sentence (1) the label of ldquoKazamardquo can berecognized to be S-PERSON (PERSON consist-ing of one morpheme) by utilizing the surroundinginformation such as the suffix ldquosanrdquo (Mr) and theverb ldquokikoku shitardquo (return home)
On the other hand in sentence (2) when thelabel of ldquoshinyourdquo (credit) is recognized to beB-ORGANIZATION (the beginning of ORGA-NIZATION) only information from ldquohatsudourdquo(invoke) to ldquokyusairdquo (relief) can be utilized andthus the information of the morpheme ldquoginkourdquo(bank) that is apart from ldquoshinyourdquo by three mor-phemes cannot be utilized To cope with this prob-lem Nakano et al (Nakano and Hirai 2004) andSasano et al (Sasano and Kurohashi 2008) uti-lized information of the head of bunsetsu1 In theirmethods when the label of ldquoshinyourdquo is recog-nized the information of the morpheme ldquoginkourdquocan be utilized
However these methods do not work when themorpheme that we want to refer to is not a headof bunsetsu as in sentence (3) In this examplewhen ldquogaikokurdquo (foreign) is recognized to be B-ARTIFACT (the beginning of ARTIFACT) wewant to refer to ldquohourdquo (law) not ldquoihanrdquo (viola-tion) which is the head of the bunsetsu
This paper proposes Japanese bottom-up named1Bunsetsu is the smallest coherent phrasal unit in
Japanese It consists of one or more content words followedby zero or more function words
55
(1) kikoku-shitareturn home
Kazama-san-waMrKazama TOP
rsquoMr Kazama who returned homersquo
(2) hatsudou-shitainvoke
shinyou-kumiai-kyusai-ginkou-nocredit union relief bank GEN
setsuritsu-mo establishment
rsquothe establishment of the invoking credit union relief bankrsquo
(3) shibunsyo-gizou-toprivate document falsification and
gaikoku-jin-touroku-hou-ihan-noforeigner registration law violation GEN
utagai-desuspicion INS
rsquoon suspicion of the private document falsification and the violation of the foreigner registra-tion lawrsquo
Figure 1 Example sentences
entity recognition using a two-stage machinelearning method Different from previous workthis method regards a compound noun as a la-beling unit (we call it chunk hereafter) and es-timates the labels of all the chunks in the bun-setsu using a machine learning method In sen-tence (3) all the chunks in the second bunsetsu(ie ldquogaikokurdquo ldquogaikoku-jinrdquo middot middot middot ldquogaikoku-jin-touroku-hou-ihan rdquo middot middot middot ldquoihanrdquo) are labeled andin the case that the chunk ldquogaikoku-jin-touroku-hourdquo is labeled the information about ldquohourdquo (law)is utilized in a natural manner Then in the bun-setsu the best label assignment is determined Forexample among the combination of ldquogaikoku-jin-touroku-hourdquo (ARTIFACT) and ldquoihanrdquo (OTHER)the combination of ldquogaikoku-jinrdquo (PERSON) andldquotouroku-hou-ihanrdquo (OTHER) etc the best la-bel assignment ldquogaikoku-jin-touroku-hourdquo (AR-TIFACT) and ldquoihanrdquo (OTHER) is chosen basedon a machine learning method In this determi-nation of the best label assignment as the CKYparsing algorithm the label assignment is deter-mined by bottom-up dynamic programming Weconducted an experimental on CRL NE data andachieved an F measure of 8979 which is higherthan previous work
This paper is organized as follows Section 2 re-views related work of NER especially focusing onsequential labeling based method Section 3 de-scribes an overview of our proposed method Sec-tion 4 presents two machine learning models andSection 5 describes an analysis algorithm Section6 gives an experimental result
2 Related Work
In Japanese Named Entity Recognition the defi-nition of Named Entity in IREX Workshop (IREX
class examplePERSON Kimura SyonosukeLOCATION Taiheiyou (Pacific Ocean)ORGANIZATION Jimin-tou (Liberal Democratic
Party)ARTIFACT PL-houan (PL bill)DATE 21-seiki (21 century)TIME gozen-7-ji (7 am)MONEY 500-oku-en (50 billions yen)PERCENT 20 percent
Table 1 NE classes and their examples
Committee 1999) is usually used In this def-inition NEs are classified into eight classesPERSON LOCATION ORGANIZATION AR-TIFACT DATE TIME MONEY and PERCENTTable 1 shows example instances of each class
NER methods are divided into two approachesrule-based approach and machine learning ap-proach According to previous work machinelearning approach achieved better performancethan rule-based approach
In general a machine learning method is for-malized as a sequential labeling problem Thisproblem is first assigning each token (character ormorpheme) to several labels In an SE-algorithm(Sekine et al 1998) S is assigned to NE com-posed of one morpheme B I E is assigned to thebeginning middle end of NE respectively and Ois assigned to the morpheme that is not an NE2The labels S B I and E are prepared for each NEclasses and thus the total number of labels is 33(= 8 4 + 1)
The model for the label estimation is learnedbased on machine learning The following fea-tures are generally utilized characters type of
2Besides there are IOB1 IOB2 algorithm using onlyIOB and IOE1 IOE2 algorithm using only IOE (Kim andVeenstra 1999)
56
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu-MeijinMONEY0075
OTHERe0092
MeijinOTHERe0245
(a)initial state
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu + MeijinPSN+OTHERe0438+0245
Yoshiharu Yoshiharu + Meijin
final outputanalysis direction
MONEY0075
MNY+OTHERe0075+0245
MeijinOTHERe0245
(b)final output
Figure 2 An overview of our proposed method (the bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo)
character POS etc about the morpheme and thesurrounding two morphemes The methods utiliz-ing SVM or CRF are proposed
Most of NER methods based on sequential la-beling use only local information Thereforemethods utilizing global information are pro-posed Nakano et al utilized as a feature the wordsub class of NE on the analyzing direction in thebunsetsu the noun in the end of the bunsetsu ad-jacent to the analyzing direction and the head ofeach bunsetsu (Nakano and Hirai 2004) Sasanoet al utilized cache feature coreference resultsyntactic feature and caseframe feature as struc-tural features (Sasano and Kurohashi 2008)
Some work acquired knowledge from unan-notated large corpus and applied it to NERKazama et al utilized a Named Entity dic-tionary constructed from Wikipedia and a nounclustering result obtained using huge amount ofpairs of dependency relations (Kazama and Tori-sawa 2008) Fukushima et al acquired hugeamount of category-instance pairs (eg ldquopo-litical party - New party DAICHIrdquoldquocompany-TOYOTArdquo) by some patterns from a large Webcorpus (Fukushima et al 2008)
In Japanese NER researches CRL NE data areusually utilized for the evaluation This data in-cludes approximately 10 thousands sentences innews paper articles in which approximately 20thousands NEs are annotated Previous workachieved an F measure of about 089 using thisdata
3 Overview of Proposed Method
Our proposed method first estimates the label ofall the compound nouns (chunk) in a bunsetsu
Then the best label assignment is determinedby bottom-up dynamic programming as the CKYparsing algorithm Figure 2 illustrates an overviewof our proposed method In this example thebunsetsu ldquoHabu-Yoshiharu-Meijinrdquo (Grand Mas-ter Yoshiharu Habu) is analyzed First the labelsof all the chunks (ldquoHaburdquo ldquoHabu-YoshiharurdquoldquoHabu-Yoshiharu-Meijinrdquo middot middot middot ldquoMeijinrdquo etc) inthe bunsetsu are analyzed using a machine learn-ing method as shown in Figure 2 (a)
We call the state in Figure 2 (a) initial statewhere the labels of all the chunks have been es-timated From this state the best label assign-ment in the bunsetsu is determined This pro-cedure is performed from the lower left (corre-sponds to each morpheme) to the upper right likethe CKY parsing algorithm as shown in Figure 2(b) For example when the label assignment forldquoHabu-Yoshiharurdquo is determined the label assign-ment ldquoHabu-Yoshiharurdquo (PERSON) and the labelassignment ldquoHaburdquo (PERSON) and ldquoYoshiharurdquo(OTHER) are compared and the better one is cho-sen While grammatical rules are utilized in ageneral CKY algorithm this method chooses bet-ter label assignment for each cell using a machinelearning method
The learned models are the followings
bull the model that estimates the label of a chunk(label estimation model)
bull the model that compares two label assign-ments (label comparison model)
The two models are described in detail in thenext section
57
Habu Yoshiharu Meijin gaPERSON OTHERe
invalid invalid
invalid
invalidinvalid
Figure 3 Label assignment for all the chunks inthe bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo
4 Model Learning
41 Label Estimation ModelThis model estimates the label for each chunk Ananalysis unit is basically bunsetsu This is because935 of named entities is located in a bunsetsuin CRL NE data Exceptionally the following ex-pressions located in multiple bunsetsus tend to bean NE
bull expressions enclosed in parentheses (eg ldquolsquoHimeyuri-no toursquo rdquo (The tower of Himeyuri)(ARTIFACT))
bull expressions that have an entry in Wikipedia(eg ldquoNihon-yatyou-no kairdquo (Wild Bird So-ciety of Japan) (ORGANIZATION))
Hereafter bunsetsu is expanded when one of theabove conditions meet By this expansion 986of named entities is located in a bunsetsu3
For each bunsetsu the head or tail functionwords are deleted For example in the bun-setsu ldquoHabu-Yoshiharu-Meijin-wardquo the tail func-tion word ldquowardquo (TOP) is deleted In the bunsetsuldquoyaku-san-bairdquo (about three times) the head func-tion word ldquoyakurdquo (about) is deleted
Next for learning the label estimation modelall the chunks in a bunsetsu are attached to the cor-rect label from a hand-annotated corpus The la-bel set is 13 classes which includes eight NE class(as shown in Table 1) and five classes OTHERsOTHERb OTHERi OTHERe and invalid
The chunk that corresponds to a whole bun-setsu and does not contain any NEs is labeledas OTHERs and the head middle tail chunkthat does not correspond to an NE is labeled asOTHERb OTHERi OTHERe respectively4
3As an example in which an NE is not included by anexpanded bunsetsu there are ldquoToru-no Kimirdquo (PERSON)and ldquoOsaka-fu midori-no kankyo-seibi-shitsurdquo (ORGANI-ZATION)
4Each OTHER is assigned to the longest chunk that satis-fies its condition in a chunk
1 of morphemes in the chunk
2 the position of the chunk in its bunsetsu
3 character type5
4 the combination of the character type of adjoiningmorphemes
- For the chunk ldquoRussian Armyrdquo this feature isldquoKatakanaKanjirdquo
5 word class word sub class and several features pro-vided by a morphological analyzer JUMAN
6 several features6 provided by a parser KNP
7 string of the morpheme in the chunk
8 IPADIC7 feature- If the string of the chunk are registered in the fol-
lowing categories of IPADIC ldquopersonrdquo ldquolo-cationrdquo ldquoorganizationrdquo and ldquogeneralrdquo thisfeature fires
9 Wikipedia feature
- If the string of the chunk has an entry inWikipedia this feature fires
- the hypernym extracted from its definition sen-tence using some patterns (eg The hyper-nym of ldquothe Liberal Democratic Partyrdquo is apolitical party)
10 cache feature- When the same string of the chunk appears in the
preceding context the label of the precedingchunk is used for the feature
11 particles that the bunsetsu includes
12 the morphemes particles and head morpheme in theparent bunsetsu
13 the NEcategory ratio in a case slot of predicatenouncase frame(Sasano and Kurohashi 2008)
- For example in the case ga (NOM) of the pred-icate case frame ldquokaikenrdquo (interview) the NEratio ldquoPERSON0245rdquo is assigned to the caseslot Hence in the sentence ldquoHabu-ga kaiken-shitardquo (Mr Habu interviewed) the featureldquoPERSON0245rdquo is utilized for the chunkldquoHaburdquo
14 parenthesis feature
- When the chunk in a parenthesis this featurefires
Table 2 Features for the label estimation model
The chunk that is neither any eight NE class northe above four OTHER is labeled as invalid
In an example as shown in Figure 3 ldquoHabu-Yoshiharurdquo is labeled as PERSON ldquoMeijinrdquo is la-beled as OTHERe and the other chunks are la-beled as invalid
Next the label estimation model is learned fromthe data in which the above label set is assigned
5The following five character types are considered KanjiHiragana Katakana Number and Alphabet
6When a morpheme has an ambiguity all the correspond-ing features fire
7httpchasenaist-naraacjpchasendistributionhtmlja
58
to all the chunks The features for the label esti-mation model are shown in Table 2 Among thefeatures as for feature (3) (5)minus(8) three cate-gories according to the position of a morphemein the chunk are prepared ldquoheadrdquo ldquotailrdquo andldquoanywhererdquo For example in the chunk ldquoHabu-Yoshiharu-Meijinrdquo as for the morpheme ldquoHaburdquofeature (7) is set to be ldquoHaburdquo in ldquoheadrdquo and as forthe morpheme ldquoYoshiharurdquo feature (7) is set to beldquoYoshiharurdquo in ldquoanywhererdquo
The label estimation model is learned from pairsof label and feature in each chunk To classify themulti classes the one-vs-rest method is adopted(consequently 13 models are learned) The SVMoutput is transformed by using the sigmoid func-tion 1
1+exp(minusβx) and the transformed value is nor-malized so that the sum of the value of 13 labelsin a chunk is one
The purpose for setting up the label ldquoinvalidrdquo isas follows In the chunk ldquoHaburdquo and ldquoYoshiharurdquoin Figure 3 since the label ldquoinvalidrdquo has a rela-tively higher score the score of the label PERSONis relatively low Therefore when the label com-parison described in Section 42 is performed thelabel assignment ldquoHabu-Yoshiharurdquo (PERSON) islikely to be chosen In the chunk where the scoreof the label invalid has the highest score the labelthat has the second highest score is adopted
42 Label Comparison Model
This model compares the two label assignmentsfor a certain string For example in the stringldquoHabu-Yoshiharurdquo the model compares the fol-lowing two label assignments
bull ldquoHabu-Yoshiharurdquo is labeled as PERSON
bull ldquoHaburdquo is labeled as PERSON and ldquoYoshi-harurdquo is labeled as MONEY
First as shown in Figure 4 the two comparedsets of chunks are lined up by sandwiching ldquovsrdquo(The left one right one is called the first set thesecond set respectively) When the first set is cor-rect this example is positive otherwise this ex-ample is negative The max number of chunks foreach set is five and thus examples in which thefirst or second set has more than five chunks arenot utilized for the model learning
Then the feature is assigned to each exampleThe feature (13 dimensions) for each chunk is de-fined as follows the first 12 dimensions are used
positive+1 Habu-Yoshiharu vs Habu + Yoshiharu
PSN PSN + MNY+1 Habu-Yoshiharu + Meijin vs Habu + Yoshiharu + Meijin
PSN + OTHERe PSN + MONEY + OTHERe
negative- 1 Habu-Yoshiharu-Meijin vs Habu-Yoshiharu + Meijin
ORG PSN + OTHERe
Figure 4 Assignment of positivenegative exam-ples
for each label which is estimated by the label esti-mation model and the last 13th dimension is usedfor the score of an SVM output Then for the firstand second set the features for each chunk are ar-ranged from the left and zero vectors are placedin the remainder part
Figure 5 illustrates the feature for ldquoHabu-Yoshiharurdquo vs ldquoHabu + Yoshiharurdquo The labelcomparison model is learned from such data us-ing SVM Note that only the fact that ldquoHabu-Yoshiharurdquo is PERSON can be found from thehand-annotated corpus and thus in the exampleldquoHabu-Yoshiharu-Meijinrdquo vs ldquoHabu + Yoshiharu-Meijinrdquo we cannot determine which one is cor-rect Therefore such example cannot be used forthe model learning
5 Analysis
First the label of all the chunks in a bunsetsu isestimated by using the label estimation model de-scribed in Section 41 Then the best label assign-ment in the bunsetsu is determined by applying thelabel comparison model described in Section 42iteratively as shown in Figure 2 (b) In this stepthe better label assignment is determined from bot-tom up as the CKY parsing algorithm
For example the initial state shown in Figure2(a) is obtained using the label estimation modelThen the label assignment is determined using thelabel comparison model from the lower left (cor-responds to each morpheme) to the upper rightIn determining the label assignment for the cellof ldquoHabu-Yoshiharurdquo as shown in 6(a) the modelcompares the label assignment ldquoBrdquo with the la-bel assignment ldquoA+Drdquo In this case the modelchooses the label assignment ldquoBrdquo that is ldquoHabu- Yoshiharurdquo is labeled as PERSON Similarlyin determining the label assignment for the cellof ldquoYoshiharu-Meijinrdquo the model compares the
59
chunk Habu-Yoshiharu Habu Yoshiharulabel PERSON PERSON MONEYvector V11 0 0 0 0 V21 V22 0 0 0
Figure 5 An example of the feature for the label comparison model (The example is ldquoHabu-Yoshiharuvs Habu + Yoshiharurdquo and V11 V21 V22 and 0 is a vector whose dimension is 13)
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu-Meijin
label assighment ldquoHabu-Yoshiharurdquo
label assignmentldquoHaburdquo + ldquoYoshiharurdquo
A B C
EDMONEY0075
OTHERe0092
MeijinOTHERe0245
ldquoHaburdquo + ldquoYoshiharurdquo
F
(a) label assignment for the cell ldquoHabu-Yoshiharurdquo
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu + Meijin
label assignmentldquoHabu-Yoshiharu-Meijinrdquo
A B C
EDMONEY0075
MNY+OTHERe0075+0245
MeijinOTHERe0245
label assignmentldquoHaburdquo + ldquoYoshiharurdquo + ldquoMeijinrdquo
label assignmentldquoHabu-Yoshiharurdquo + ldquoMeijinrdquo
F
(b) label assignment for the cell ldquoHabu-Yoshiharu-Meijinrdquo
Figure 6 The label comparison model
label assignment ldquoErdquo with the label assignmentldquoD+Frdquo In this case the model chooses the labelassignment ldquoD+Frdquo that is ldquoYoshiharurdquo is labeledas MONEY and ldquoMeijinrdquo is labeled as OTHEReWhen the label assignment consists of multiplechunks the content of the cell is updated Inthis case the cell ldquoErdquo is changed from ldquoYoshi-haru-Meijinrdquo (OTHERe) to ldquoYoshiharu + Meijinrdquo(MONEY + OTHERe)
As shown in Figure 6(b) in determining the bestlabel assignment for the upper right cell that isthe final output is determined the model comparesthe label assignment ldquoA+D+Frdquo ldquoB+Frdquo and ldquoCrdquoWhen there are more than two candidates of labelassignments for a cell all the label assignments arecompared in a pairwise and the label assignmentthat obtains the highest score is adopted
In the label comparing step the label as-signment in which OTHERlowast follows OTHERlowast(OTHERlowast - OTHERlowast) is not allowed since eachOTHER is assigned to the longest chunk as de-scribed in Section 41 When the first combina-tion of chunks equals to the second combinationof chunks the comparison is not performed
6 Experiment
To demonstrate the effectiveness of our proposedmethod we conducted an experiment on CRL NEdata In this data 10718 sentences in 1174 newsarticles are annotated with eight NEs The expres-sion to which it is difficult to annotate manually islabeled as OPTIONAL and was not used for both
a b c d e
learn the label estimation model 1
Corpus CRL NE data
learn the label estimation model 2bobtain features for the label apply
b c d e
b c d e
b c d e
obtain features for the labelcomparison model
learn the label estimation model 2c
learn the label estimation model 2d
learn the label estimation model 2e
apply
learn the label comparison model
Figure 7 5-fold cross validation
the model learning8 and the evaluationWe performed 5-fold cross validation following
previous work Different from previous work ourwork has to learn the SVM models twice There-fore the corpus was divided as shown in Figure 7Let us consider the analysis in the part (a) Firstthe label estimation model 1 is learned from thepart (b)-(e) Then the label estimation model 2bis learned from the part (c)-(e) and applying thelearned model to the part (b) features for learningthe label comparison model are obtained Simi-larly the label estimation model 2c is learned fromthe part (b)(d)(e) and applying it to the part (c)features are obtained It is the same with the part
8Exceptionally ldquoOPTIONALrdquo is used when the label es-timation model for OTHERlowast and invalid is learned
60
Recall PrecisionORGANIZATION 8183 (30083676) 8837 (30083404)PERSON 9005 (34583840) 9387 (34583684)LOCATION 9138 (49925463) 9244 (49925400)ARTIFACT 4672 ( 349 747) 7489 ( 349 466)DATE 9327 (33273567) 9312 (33273573)TIME 8825 ( 443 502) 9059 ( 443 489)MONEY 9385 ( 366 390) 9760 ( 366 375)PERCENT 9533 ( 469 492) 9591 ( 469 489)ALL-SLOT 8787 9179F-measure 8979
Table 3 Experimental result
(d) and (e) Then the label comparison model islearned from the obtained features After that theanalysis in the part (a) is performed by using boththe label estimation model 1 and the label compar-ison model
In this experiment a Japanese morphologicalanalyzer JUMAN9 and a Japanese parser KNP10
were adopted The two SVM models were learnedwith polynomial kernel of degree 2 and β in thesigmoid function was set to be 1
Table 6 shows an experimental result An F-measure in all NE classes is 8979
7 Discussion
71 Comparison with Previous Work
Table 7 presents the comparison with previ-ous work and our method outperformed previ-ous work Among previous work Fukushimaet al acquired huge amount of category-instance pairs (eg ldquopolitical party - New partyDAICHIrdquoldquocompany-TOYOTArdquo) by some patternsfrom a large Web corpus and Sasano et al uti-lized the analysis result of corefer resolution as afeature for the model learning Therefore in ourmethod by incorporating these knowledge andorsuch analysis result the performance would be im-proved
Compared with Sasano et al our methodachieved the better performance in analyzinga long compound noun For example in thebunsetsu ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo (Treaty on Conventional ArmedForces in Europe) while Sasano et al labeledldquoOushurdquo (Europe) as LOCATION our methodcorrectly labeled ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo as ARTIFACT Sasano etal incorrectly labeled ldquoOushurdquo as LOCATIONalthough they utilized the information about
9httpnlpkueekyoto-uacjpnl-resourcejuman-ehtml10httpnlpkueekyoto-uacjpnl-resourceknp-ehtml
the head of bunsetsu ldquojyouyakurdquo (treaty) Inour method for the cell ldquoOushurdquo invalid hasthe highest score and thus the score of LOCA-TION relatively drops Similarly for the cellldquosenryoku-sakugen-jyouyakurdquo invalid has thehighest score Consequently ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo is correctly labeledas ARTIFACT
In the bunsetsu ldquogaikoku-jin-touroku-hou-ihanrdquo(the violation of the foreigner registration law)while Sasano et al labeled ldquotouroku-hourdquo as AR-TIFACT our method correctly labeled ldquogaikoku-jin-touroku-hourdquo as ARTIFACT Sasano et al can-not utilize the information about ldquohourdquo that is use-ful for the label estimation since the head of thisbunsetsu is ldquoihanrdquo In contrast in estimating thelabel of the chunk ldquogaikoku-jin-touroku-hourdquo theinformation of ldquohourdquo can be utilized
72 Error Analysis
There were some errors in analyzing a Katakanaalphabet word In the following example althoughthe correct is that ldquoBatistutardquo is labeled as PER-SON the system labeled it as OTHERs
(4) Italy-deItaly LOC
katsuyaku-suruactive
Batistuta-woBatistuta ACC
kuwaetacall
ArgentineArgentine
rsquoArgentine called Batistuta who was active inItalyrsquo
There is not an entry of ldquoBatistutardquo in the dictio-nary of JUMAN nor Wikipedia and thus only thesurrounding information is utilized However thecase analysis of ldquokatsuyakurdquo (active) is incorrectwhich leads to the error of ldquoBatistutardquo
There were some errors in applying the la-bel comparison model although the analysis ofeach chunk is correct For example in thebunsetsu ldquoHongKong-seityourdquo (Government ofHongKong) the correct is that ldquoHongKong-seityourdquo is labeled as ORGANIZATION Asshown in Figure 8 (b) the system incorrectlylabeled ldquoHongKongrdquo as LOCATION As shownin Figure 8(a) although in the initial stateldquoHongKong-seityourdquo was correctly labeled as OR-GANIZATION the label assignment ldquoHongKong+ seityourdquo was incorrectly chosen by the labelcomparison model To cope with this problemwe are planning to the adjustment of the value βin the sigmoid function and the refinement of the
61
F1 analysis unit distinctive features(Fukushima et al 2008) 8929 character Web(Kazama and Torisawa 2008) 8893 character WikipediaWeb(Sasano and Kurohashi 2008) 8940 morpheme structural information(Nakano and Hirai 2004) 8903 character bunsetsu feature(Masayuki and Matsumoto 2003) 8721 character(Isozaki and Kazawa 2003) 8677 morphemeproposed method 8979 compound noun Wikipediastructural information
Table 4 Comparison with previous work (All work was evaluated on CRL NE data using cross valida-tion)
HongKongLOCATION
HongKong-seityouORGANIZATIONLOCATION
0266 0205
seityouOTHERe0184
(a)initial state
HongKongLOCATION
HongKong + seityouLOC+OTHEReLOCATION
0266LOC+OTHERe0266+0184
seityouOTHERe0184
(b)the final output
Figure 8 An example of the error in the label com-parison model
features for the label comparison model
8 Conclusion
This paper proposed bottom-up Named EntityRecognition using a two-stage machine learningmethod This method first estimates the label ofall the chunks in a bunsetsu using a machine learn-ing and then the best label assignment is deter-mined by bottom-up dynamic programming Weconducted an experiment on CRL NE data andachieved an F-measure of 8979
We are planning to integrate this method withthe syntactic and case analysis method (Kawa-hara and Kurohashi 2007) and perform syntacticcase and Named Entity analysis simultaneously toimprove the overall accuracy
ReferencesKenrsquoichi Fukushima Nobuhiro Kaji and Masaru
Kitsuregawa 2008 Use of massive amountsof web text in Japanese named entity recogni-tion In Proceedings of Data Engineering Workshop(DEWS2008) A3-3 (in Japanese)
IREX Committee editor 1999 Proceedings of theIREX Workshop
Hideki Isozaki and Hideto Kazawa 2003 Speeding upsupport vector machines for named entity recogni-tion Transaction of Information Processing Societyof Japan 44(3)970ndash979 (in Japanese)
Daisuke Kawahara and Sadao Kurohashi 2007 Prob-abilistic coordination disambiguation in a fully-lexicalized Japanese parser In Proceedings of the
2007 Joint Conference on Empirical Methods inNatural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL2007)pages 304ndash311
Junrsquoichi Kazama and Kentaro Torisawa 2008 In-ducing gazetteers for named entity recognition bylarge-scale clustering of dependency relations InProceedings of ACL-08 HLT pages 407ndash415
Erik F Tjong Kim and Jorn Veenstra 1999 Repre-senting text chunks In Proceedings of EACL rsquo99pages 173ndash179
Vajay Krishnan and Christopher DManning 2006An effective two-stage model for exploiting non-local dependencies in named entity recognitionpages 1121ndash1128
John Lafferty Andrew McCallun and FernandoPereira 2001 Conditional random fields Prob-abilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth In-ternational Conference (ICMLrsquo01) pages 282ndash289
Asahara Masayuki and Yuji Matsumoto 2003Japanese named entity extraction with redundantmorphological analysis In Proceeding of HLT-NAACL 2003 pages 8ndash15
Keigo Nakano and Yuzo Hirai 2004 Japanese namedentity extraction with bunsetsu features Transac-tion of Information Processing Society of Japan45(3)934ndash941 (in Japanese)
Ryohei Sasano and Sadao Kurohashi 2008 Japanesenamed entity recognition using structural naturallanguage processing In Proceeding of Third In-ternational Joint Conference on Natural LanguageProcessing pages 607ndash612
Satoshi Sekine Ralph Grishman and Hiroyuki Shin-nou 1998 A decision tree method for finding andclassifying names in japanese texts In Proceed-ings of the Sixth Workshop on Very Large Corpora(WVLC-6) pages 171ndash178
Vladimir Vapnik 1995 The Nature of StatisticalLearning Theory Springer
62
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 63ndash70Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
13
13
13 13
13
13
13
13
13
13
13
13 13
13
$
13 amp
13 amp
$ 13
13 (
) 13
$ 13
13
) 13
13 13 13
$ 13 13
+$ - 0)$ - 12
$ 3$ 4 1
)13$ 5 + )$ 5 2
3$ 6 2 $ 7-
2 $ 7 13
2 13 13
13 13
) 13
13$ $ 13$ $ $
13
13 13 13
13 1313 138
$ 13$
3$ 13) )
) 8 9$ $ $
$ $ )13$ 13 13
3
$
13 13
13 +
)
13 $
13 ) 13
$ )
13 13
13 ) $ )
13$
$
) (
$
13
13 13
13
)
13
13 )
$ 13
ファミリーレストラン 13) lt
ファミレス 13lt $
=
13 )
$ 13 13
$ $ )
13 -
)$ $ =
13 131313 13 1313 13 13 13 13 1313 13 13
63
13
13
13 13
13 13
13
13 13
13 $
13
13 amp
13
( ) +
13
13
-
13 13
13 13 13
日本 amp0 東京 $amp0amp
医科 amp0 科学 amp0amp
女子 amp0amp
大学 13amp0 研究所0
ampamp $ 13 13
+
13
$1
$ 13 $1
13
213
$1
$
$1
$1
13 $1
13
13 13 略語は0
0amp
$ +3
1313欧文略語一覧13カタカナ略語一覧13漢字略語一覧13大学の略称
$1
$1 13
13 -
13 $1 13
456
13
456
13
37+
- 13
13
$
8 913
77 4 77 913 13
13 77(amp 13 ltamp
13amp amp
13
amp - 13 13
13
13 9
13 13
$
13
-
13 =
= 9
8
13
13 $
13 13
13 13 13 1313
ッ$ 13 っ$
amp ―$
64
13
$ $
amp $ amp
() $
+) $amp $
) $ $ $ $ $
13 $ amp $ $ $ amp $$
$- )) ) 13
) ))13 ) 0()00)13 ) ) )13
- $ ))) ) 131
) 0)13 ))
13 13
2 ) )) )
)) 3 ) 1313))
1313 2 ) 13)
4) 5) )) )13
) 13) 1
6
7 ) )) ) )13 1
)
) )0
13 ) 8
-
) )
)13 ) 4
4 7 )
) )9 8)
5 ) )8 7 ))
)
) )13 1
)13 )) ) ) )
) 7 )) )13 1
)) 0()
) )))13 0
))13 ) 1
)6 5) ) )
5 ) ))1
lt ) ) 13 )
))) ) =))lt
)=)lt ))=lt gt
=ltlt )) ))) )
13 131313
-
))lt
lt ))lt
))lt
))lt lt
7 ) ))1
lt ) )) ) ))lt lt
13 lt ) ) )8 ) 1
)13 )) )
))lt 13 1
)13 ) )8 ) )13 ))
) ))) 85 1
)13 )8) 2 ))) ) (
0 )) )))
13))13
) ))) ) ( のlt
) ) )13 1
) )) )
) ))) のlt 0 )
)13 )) )1
) ) 13
) ) )) )1
) 00 )13 13)
1313 1313
2 ) 13 )) )
)13 )13 )
))13 )
3 )) )) 1
) ) )
$ )
+ ) )
+ 8 )
65
1313
1313
13 13
13 13
13 13 13
13 13
$13amp13amp 13 13 $13
$13 13 $ ( ) ) 13
13 13 + 13 1313 $
13 + 13 1313 + -
+ 13 ++ 13 13
1313 +
13 13
13 13 1313 13 -
+13 1313 13 13
1313
13
0 13 13 13 13
13 13 12 13
3013+ 13 45 666 1313
0 13 1313
7 13 13 13 13-
13 8 +-13 0
+ 13 13 13 -
+ 13 13 13
3
13 13-
13 13 + 13 13
+13513
1313 13 9
1 13 13 -
+ ( 13 13 3
13 13 + 13+ 13 13
3
135 1313 +13 13
13 13 13+ 13
1313 13 +
1313 1313
13 13 13
13+ 13 $1313 (
13 1313 1313 +-
13 4 13 $
$13 $ 13 -
13 +-13 0
13 1313 13-13 1313 + 13 13 4 - 13 + 13 13 13+ 131313 1313 3 13 13 ( 1313 13 13 + 13 13 1313 13 13 13 13 13 13 13 13 13 13 13+ 13 13 9
13
13
13
13
+ 13 13 +13
13+ 4
13 1313 1313 -
+
13 1313 13-
13 13 +13 + +
(13 13 $ +-
13 $1313 13 13 $ $ 13 $ 13
13 $ampamp lt13
2 13 13 131313 1313 13
13 lt13 ( $
13 13 13 13+ +13 -
13 13 13
13 2 2 2
= 13 13 13 +13 1313+
13 13 13
13 13 13 13
1313 13-
13 ( 13 13 13 13
13 13+ 13 1313
13 13 13+ 1313 13+
13 13 13 13 13 13
$gt1 gt1 13+ $ 13-
13 13 13 13+
13131313
66
1313 1313 1313
1313 1313 1313
13
13 13
13
13
$ 13 13
amp() amp() () amp amp
$ 13 13
+
13
13
13
- + 13 ( (
13 13
-+
13
0 -(1313 amp+ ( (
13 13
13 1
13 () 2 3
0 13 -4252
6-amp++ -32 6+
13 0
$ 135 2
1
- amp+ $ 13
13
$ amp( ) + ( ( )- )( ( ) )
FeaturesLabel
Noun
Noun
Noun
Noun
Particle
Noun
POS
0
0
0
1
1
1
Head of wordreadingword-
FeaturesLabel
Noun
Noun
Noun
Noun
Particle
Noun
POS
0
0
0
1
1
1
Head of wordreadingword-
amp 13 0
13 朝ビタ -+ 13
13 朝はビタミン - +
13 7
7
$
088
13 0
$ amp 9 amp 9
13
13
7
1313 0
13
13 2lt
13 13 13 13
13 13
3 13 amp
13 amp
13
13
0 13 13
13 13
13 13
13
1
ampampamp
67
13 13
$
13 13
ampamp (
)
ampamp $
ampamp amp
amp +
amp +
- - -
amp-
amp
amp
amp
- amp
+
- +
) +
amp $
- amp
+
- amp
+
0 $ ampamp +
1- amp
0 +
2-
)
amp -
amp ampamp
$ amp 0- 3(
34) 5 -+
0 - 4
6)4 amp amp 1 amp
7 4( 6
$ - +
- 8
64
13
) amp
amp
+
amp 9 2
- 9 - +
2 amp
- amp
amp 1
$
( +
(
( amp ( ( -
amp2
amp
amp )
( +
68
13 13
13 1313
13 13 13
13 13
13 13 13
13 13
13 13
13 $
13
13 13 13 13
13
13 13 13
1313
13
amp 13 13 13 13
13 13 13 13
(() 13 13 13
13 13 + 13
13 amp 13
13 13
$ 13
13 13
13
13 1313
13 -
13
1313 13
13 13 13
13 1313 amp
13 13
13 13 1313
0
1313 13 13
13 13
13
13 13
$ $ amp amp ( $)+ - 0 13 + 13 112
ミュージックステーション 13 ステ 13 13 13 13 13 ミュージック 13 ステ ステーション
3 4 56 1 0+ 0 + + 7+ 8 9 $8 4 lt
6 $8 -= 1 8 - = $ 13++ + 139 13 $ 22lt
8 $+ 4 13) 1ltltlt3 $gt3 =
5 $ $ ( 6 5 13amp
5 $ $ ( 213 $ 13 13
5 = 3 + 13 () 1 +
5 () 13 7+ + + 2ltlt
5 () 13 3A 21 + $ + 13
5 () $ gt) 3A2 0 13 13 22lt2lt
+A = 7 1 4 30$+ -+ 3 13 13$ 11
$ 13 amp) $ 4 42 6= $ 2
4 13 13+ $ gt $ 9+ amp6+ 13+ 8 8 6 3 (+ 6 13 11
13 136) $ 4 13 + gt+ 7 30 13 13+ 1
69
13 13 13
13 13 1313
13 13
1313 13
13 1313
13
$
13 13 1313 $
amp (
70
Author Index
Bhutada Pravin 17
Cao Jie 47Caseli Helena 1
Diab Mona 17
Finatto Maria Jose 1Fujii Hiroko 63Fukui Mika 63Funayama Hirotaka 55
Hoang Hung Huu 31Huang Yun 47
Kan Min-Yen 9 31Kim Su Nam 9 31Kuhn Jonas 23Kurohashi Sadao 55
Liu Qun 47Lu Yajuan 47
Machado Andre 1
Ren Zhixiang 47
Shibata Tomohide 55Sinha R Mahesh K 40Sumita Kazuo 63Suzuki Masaru 63
Villavicencio Aline 1
Wakaki Hiromi 63
Zarrieszlig Sina 23
71
allel corpora using English and Portuguese datafrom a corpus of Pediatrics and examining howa second language can provide relevant cues forthis tasks In this way cost-effective tools for theautomatic alignment of texts can generate a listof MWE candidates with their appropriate trans-lations Such an approach to the (semi-)automaticidentification of MWEs can considerably speed uplexicographic work providing a more targeted listof MWE candidates and their translations for theconstruction of bilingual resources andor withsome semantic information for monolingual re-sources
The remainder of this paper is structured as fol-lows Section 2 briefly discusses MWEs and someprevious works on methods for automatically ex-tracting them Section 3 presents the resourcesused while section 4 describes the methods pro-posed to extract MWEs as a statistically-driven by-product of an automatic word alignment processSection 5 presents the evaluation methodology andanalyses the results and section 6 finishes this pa-per with some conclusions and proposals for fu-ture work
2 Related Work
The term Multiword Expression has been used todescribe a large number of distinct but related phe-nomena such as phrasal verbs (eg come along)nominal compounds (eg frying pan) institution-alised phrases (eg bread and butter) and manyothers (Sag et al 2002) They are very frequent ineveryday language and this is reflected in severalexisting grammars and lexical resources wherealmost half of the entries are Multiword Expres-sions
However due to their heterogeneous charac-teristics MWEs present a tough challenge forboth linguistic and computational work (Sag etal 2002) Some MWEs are fixed and do notpresent internal variation such as ad hoc whileothers allow different degrees of internal vari-ability and modification such as touch a nerve(touchfind a nerve) and spill beans (spill sev-eralmusicalmountains of beans) In terms of se-mantics some MWEs are more opaque in theirmeaning (eg to kick the bucket as to die) whileothers have more transparent meanings that can beinferred from the words in the MWE (eg eat upwhere the particle up adds a completive sense toeat) Therefore providing appropriate methods
for the automatic identification and treatment ofthese phenomena is a real challenge for NLP sys-tems
A variety of approaches has been proposed forautomatically identifying MWEs differing basi-cally in terms of the type of MWE and lan-guage to which they apply and the sources ofinformation they use Although some work onMWEs is type independent (eg (Zhang et al2006 Villavicencio et al 2007)) given the het-erogeneity of MWEs much of the work looks in-stead at specific types of MWE like collocations(Pearce 2002) compounds (Keller and Lapata2003) and VPCs (Baldwin 2005 Villavicencio2005 Carlos Ramisch and Aline Villavicencio andLeonardo Moura and Marco Idiart 2008) Someof these works concentrate on particular languages(eg (Pearce 2002 Baldwin 2005) for Englishand (Piao et al 2006) for Chinese) but somework has also benefitted from asymmetries in lan-guages using information from one language tohelp deal with MWEs in the other (eg (na Vil-lada Moiron and Tiedemann 2006 Caseli et al2009))
As basis for helping to determine whether agiven sequence of words is in fact an MWE (egad hoc vs the small boy) some of these works em-ploy linguistic knowledge for the task (Villavicen-cio 2005) while others employ statistical meth-ods (Pearce 2002 Evert and Krenn 2005 Zhanget al 2006 Villavicencio et al 2007) or combinethem with some kinds of linguistic informationsuch as syntactic and semantic properties (Bald-win and Villavicencio 2002 Van de Cruys and naVillada Moiron 2007) or automatic word align-ment (na Villada Moiron and Tiedemann 2006)
Statistical measures of association have beencommonly used for this task as they can be demo-cratically applied to any language and MWE typeHowever there is no consensus about which mea-sure is best suited for identifying MWEs in gen-eral Villavicencio et al (2007) compared some ofthese measures (mutual information permutationentropy and χ2) for the type-independent detec-tion of MWEs and found that Mutual Informationseemed to differentiate MWEs from non-MWEsbut the same was not true of χ2 In addition Ev-ert and Krenn (2005) found that for MWE iden-tification the efficacy of a given measure dependson factors like the type of MWEs being targetedfor identification the domain and size of the cor-
2
pora used and the amount of low-frequency dataexcluded by adopting a threshold NonethelessVillavicencio et al (2007) discussing the influ-ence of the corpus size and nature over the meth-ods found that these different measures have ahigh level of agreement about MWEs whetherin carefully constructed corpora or in more het-erogeneous web-based ones They also discussthe results obtained from adopting approaches likethese for extending the coverage of resources ar-guing that grammar coverage can be significantlyincreased if MWEs are properly identified andtreated (Villavicencio et al 2007)
Among the methods that use additional infor-mation along with statistics to extract MWE theone proposed by na Villada Moiron and Tiede-mann (2006) seems to be the most similar to ourapproach The main difference between them isthe way in which word alignment is used in theMWE extraction process In this paper the wordalignment is the basis for the MWE extractionprocess while Villada Moiron and Tiedemannrsquosmethod uses the alignment just for ranking theMWE candidates which were extracted on the ba-sis of association measures (log-likelihood andsalience) and head dependence heuristic (in parseddata)
Our approach as described in details by Caseliet al (2009) also follows to some extent thatof Zhang et al (2006) as missing lexical en-tries for MWEs and related constructions are de-tected via error mining methods and this paper fo-cuses on the extraction of generic MWEs as a by-product of an automatic word alignment Anotherrelated work is the automatic detection of non-compositional compounds (NCC) by Melamed(1997) in which NCCs are identified by analyz-ing statistical translation models trained in a hugecorpus by a time-demanding process
Given this context our approach proposesthe use of alignment techniques for identifyingMWEs looking at sequences detected by thealigner as containing more than one word whichform the MWE candidates As a result sequencesof two or more consecutive source words aretreated as MWE candidates regardless of whetherthey are translated as one or more target words
3 The Corpus and Reference Lists
The Corpus of Pediatrics used in these experi-ments contains 283 texts in Portuguese with a total
of 785448 words extracted from the Jornal de Pe-diatria From this corpus the Pediatrics Glossarya reference list containing multiword terms and re-curring expressions was semi-automatically con-structed and manually checked1 The primary aimof the Pediatrics Glossary as an online resourcefor long-distance education was to train qualifyand support translation students on the domain ofpediatrics texts
The Pediatrics Glossary was built from the36741 ngrams that occurred at least 5 times in thecorpus These were automatically cleaned or re-moved using some POS tag patterns (eg remov-ing prepositions from terms that began or endedwith them) In addition if an ngram was part of alarger ngram only the latter appeared in the Glos-sary as is the case of aleitamento materno (mater-nal breastfeeding) which is excluded as it is con-tained in aleitamento materno exclusivo (exclusivematernal breastfeeding) This post-processing re-sulted in 3645 ngrams which were manuallychecked by translation students and resulted in2407 terms with 1421 bigrams 730 trigrams and339 ngrams with n larger than 3 (not considered inthe experiments presented in this paper)
4 Statistically-Driven andAlignment-Based methods
41 Statistically-Driven methodStatistical measures of association have beenwidely employed in the identification of MWEsThe idea behind their use is that they are an in-expensive language and type independent meansof detecting recurrent patterns As Firth famouslysaid a word is characterized by the company itkeeps and since we expect the component wordsof an MWE to occur frequently together thenthese measures can give an indication of MWE-ness In this way if a group of words co-occurswith significantly high frequency when comparedto the frequencies of the individual words thenthey may form an MWE Indeed measures suchas Pointwise Mutual Information (PMI) MutualInformation (MI) χ2 log-likelihood (Press et al1992) and others have been employed for thistask and some of them seem to provide more ac-curate predictions of MWEness than others Infact in a comparison of some measures for thetype-independent detection of MWEs MI seemed
1Available in the TEXTQUIMUFRGS website httpwwwufrgsbrtextquim
3
to differentiate MWEs from non-MWEs but thesame was not true of χ2 (Villavicencio et al2007) In this work we use two commonly em-ployed measures for this task PMI and MIas implemented in the Ngram Statistics Package(Banerjee and Pedersen 2003)
From the Portuguese portion of the Corpus ofPediatrics 196105 bigram and 362663 trigramMWE candidates were generated after filteringngrams containing punctuation and numbers Inorder to evaluate how these methods perform with-out any linguistic filtering the only threshold em-ployed was a frequency cut-off of 2 occurrencesresulting in 64839 bigrams and 54548 trigramsEach of the four measures were then calculated forthese ngrams and we ranked each n-gram accord-ing to each of these measures The average of allthe rankings is used as the combined measure ofthe MWE candidates
42 Alignment-Based method
The second of the MWE extraction approachesto be investigated in this paper is the alignment-based method The automatic word alignment oftwo parallel texts mdash a text written in one (source)language and its translation to another (target) lan-guage mdash is the process of searching for correspon-dences between source and target words and se-quences of words For each word in a source sen-tence equivalences in the parallel target sentenceare looked for Therefore taking into account aword alignment between a source word sequenceS (S = s1 sn with n ge 2) and a target wordsequence T (T = t1 tm with m ge 1) thatis S harr T the alignmet-based MWE extracionmethod assumes that (a) S and T share some se-mantic features and (b) S may be a MWE
In other words the alignment-based MWE ex-traction method states that the sequence S will bea MWE candidate if it is aligned with a sequenceT composed of one or more words (a n m align-ment with n ge 2 and m ge 1) For examplethe sequence of two Portuguese words aleitamentomaterno mdash which occurs 202 times in the cor-pus used in our experiments mdash is a MWE can-didate because these two words were joined to bealigned 184 times with the word breastfeeding (a2 1 alignment) 8 times with the word breast-fed (a 2 1 alignment) 2 times with breastfeedingpractice (a 2 2 alignment) and so on
Thus notice that the alignment-based MWE ex-
traction method does not rely on the conceptualasymmetries between languages since it does notexpect that a source sequence of words be alignedwith a single target word The method looksfor the sequences of source words that are fre-quently joined together during the alignment de-spite the number of target words involved Thesefeatures indicate that the method priorizes preci-sion in spite of recall
It is also important to say that although the se-quences of source and target words resemble thephrases used in the phrase-based statistical ma-chine translation (SMT) they are indeed a re-finement of them More specifically althoughboth approaches rely on word alignments per-formed by GIZA++2 (Och and Ney 2000) inthe alignment-based approach not all sequences ofwords are considered as phrases (and MWE can-didates) but just those with an alignment n m(n gt= 2) with a target sequence To confirmthis assumption a phrase-based SMT system wastrained with the same corpus used in our exper-iments and the number of phrases extracted fol-lowing both approaches were compared Whilethe SMT extracted 819208 source phrases ouralignment-based approach (without applying anypart-of-speech or frequency filter) extracted only34277 These results show that the alignment-based approach refines in some way the phrasesof SMT systems
In this paper we investigate experimentallywhether MWEs can be identified as a by-productof the automatic word alignment of parallel textsWe focus on Portuguese MWEs from the Corpusof Pediatrics and the evaluation is performed us-ing the bigrams and trigrams from the PediatricsGlossary as gold standard
To perform the extraction of MWE candi-dates following the alignment-based approachfirst the original corpus had to be sentence andword aligned and Part-of-Speech (POS) taggedFor these preprocessing steps were used re-spectively a version of the Translation Cor-pus Aligner (TCA) (Hofland 1996) the statisti-cal word aligner GIZA++ (Och and Ney 2000)and the morphological analysers and POS taggersfrom Apertium3 (Armentano-Oller et al 2006)
2GIZA++ is a well-known statistical word aligner that canbe found at httpwwwfjochcomGIZA++html
3Apertium is an open-source machine translation en-gine and toolbox available at httpwwwapertiumorg
4
From the preprocessed corpus the MWE candi-dates are extracted as those in which two or morewords have the same alignment that is they arelinked to the same target unit This initial list ofMWE candidates is then filtered to remove thosecandidates that (a) match some sequences of POStags or words (patterns) defined in previous exper-iments (Caseli et al 2009) or (b) whose frequencyis below a certain threshold The remaining unitsin the candidate list are considered to be MWEs
Several filtering patterns and minimum fre-quency thresholds were tested and three of themare presented in details here The first one (F1)is the same used during the manual building ofthe reference lists of MWEs (a) patterns begin-ning with Article + Noun and beginning or finish-ing with verbs and (b) with a minimum frequencythreshold of 5
The second one (F2) is the same used in the(Caseli et al 2009) mainly (a) patterns begin-ning with determiner auxiliary verb pronoun ad-verb conjunction and surface forms such as thoseof the verb to be (are is was were) relatives(that what when which who why) and prepo-sitions (from to of ) and (b) with a minimum fre-quency threshold of 2
And the third one (F3) is the same as (Caseli etal 2009) plus (a) patterns beginning or finishingwith determiner adverb conjunction prepositionverb pronoun and numeral and (b) with a mini-mum frequency threshold of 2
5 Experiments and Results
Table 1 shows the top 5 and the bottom 5 rankedcandidates returned by PMI and the alignment-based approach Although some of the results aregood especially the top candidates there is stillconsiderable noise among the candidates as forinstance jogar video game (lit play video game)From table 1 it is also possible to notice that thealignment-based approach indeed extracts Pedi-atrics terms such as aleitamento materno (breast-feeding) and also other possible MWE that are notPediatrics terms such as estados unidos (UnitedStates)
In table 2 we show the precision (number ofcorrect candidates among the proposed ones) re-call (number of correct candidates among those inreference lists) and F-measure ((2 lowast precision lowastrecall)(precision + recall)) figures for the as-sociation measures using all the candidates (on the
PMI alignment-basedOnline Mendelian Inheritance faixa etariaBeta Technology Incorporated aleitamento maternoLange Beta Technology estados unidosOxido Nitrico Inalatorio hipertensao arterialjogar video game leite materno e um de couro cabeludoe a do bloqueio lactıferosse que de emocional anatomiae a da neonato a termoe de nao duplas maes bebes
Table 1 Top 5 and Bottom 5 MWE candidatesranked by PMI and alignment-based approach
pt MWE candidates PMI MI proposed bigrams 64839 64839 correct MWEs 1403 1403precision 216 216recall 9873 9873F 423 423 proposed trigrams 54548 54548 correct MWEs 701 701precision 129 129recall 9603 9603F 255 255 proposed bigrams 1421 1421 correct MWEs 155 261precision 1091 1837recall 1091 1837F 1091 1837 proposed trigrams 730 730 correct MWEs 44 20precision 603 274recall 603 274F 603 274
Table 2 Evaluation of MWE candidates - PMI andMI
first half of the table) and using the top 1421 bi-gram and 730 trigram candidates (on the secondhalf) From these latter results we can see that thetop candidates produced by these measures do notagree with the Pediatrics Glossary since there areonly at most 1837 bigram and 603 trigramMWEs among the top candidates as ranked byMI and PMI respectively Interestingly MI had abetter performance for bigrams while for trigramsPMI performed better
On the other hand looking at the alignment-based method 34277 pt MWE candidates wereextracted and Table 3 sumarizes the number ofcandidates filtered following the three filters de-scribed in 42 F1 F2 and F3
To evaluate the efficacy of the alignment-basedmethod in identifying multiword terms of Pedi-atrics an automatic comparison was performedusing the Pediatrics Glossary In this auto-
5
pt MWE candidates F1 F2 F3 filtered by POS patterns 24996 21544 32644 filtered by frequency 9012 11855 1442 final Set 269 878 191
Table 3 Number of pt MWE candidates filteredin the alignment-based approach
pt MWE candidates F1 F2 F3 proposed bigrams 250 754 169 correct MWEs 48 95 65precision 1920 1260 3846recall 338 669 457F 575 874 818 proposed trigrams 19 110 20 correct MWEs 1 9 4precision 526 818 2000recall 014 123 055F 027 214 107 proposed bitrigrams 269 864 189 correct MWEs 49 104 69precision 1822 1204 3651recall 228 483 321F 405 690 590
Table 4 Evaluation of MWE candidates
matic comparision we considered the final lists ofMWEs candidates generated by each filter in table3 The number of matching entries and the valuesfor precision recall and F-measure are showed intable 4
The different values of extracted MWEs (in ta-ble 3) and evaluated ones (in table 4) are due tothe restriction of considering only bigrams and tri-grams in the Pediatrics Glossary Then longerMWEs mdash such as doenca arterial coronarianaprematura (premature coronary artery disease)and pequenos para idade gestacional (small forgestational age) mdash extracted by the alignment-based method are not being considered at the mo-ment
After the automatic comparison using the Pedi-atrics Glossary an analysis by human experts wasperformed on one of the derived lists mdash that withthe best precision values so far (from filter F3)The human analysis was necessary since as statedin (Caseli et al 2009) the coverage of referencelists may be low and it is likely that a lot of MWEcandidates that were not found in the PediatricsGlossary are nonetheless true MWEs In this pa-per only the pt MWE candidates extracted usingfilter F3 (as described in section 42) were manu-ally evaluated
From the 191 pt MWE candidates extracted af-ter F3 69 candidates (361 of the total amount)were found in the bigrams or trigrams in the
Glossary (see table 4) Then the remaining 122candidates (639) were analysed by two native-speakers human judges who classified each of the122 candidates as true if it is a multiword expres-sion or false otherwise independently of beinga Pediatrics term For the judges a sequence ofwords was considered a MWE mainly if it was(1) a proper name or (2) a sequence of words forwhich the meaning cannot be obtained by com-pounding the meanings of its component words
The judgments of both judges were comparedand a disagreement of approximately 12 on mul-tiwords was verified This disagreement was alsomeasured by the kappa (K) measure (Carletta1996) with k = 073 which does not preventconclusions to be drawn According to Carletta(1996) among other authors a value of k between067 and 08 indicates a good agreement
In order to calculate the percentage of true can-didates among the 122 two approaches can befollowed depending on what criteria one wantsto emphasize precision or coverage (not recallbecause we are not calculating regarding a refer-ence list) To emphasize the precision one shouldconsider as genuine MWEs only those candidatesclassified as true by both judges on the other handto emphasize the coverage one should consideralso those candidates classified as true by just oneof them So from 191 MWE candidates 126(6597) were classified as true by both judgesand 145 (7592) by at least one of them
6 Conclusions and Future Work
MWEs are a complex and heterogeneous set ofphenomena that defy attempts to capture themfully but due to their role in communication theyneed to be properly accounted for in NLP applica-tions and tasks
In this paper we investigated the identifica-tion of MWEs from technical domain test-ing statistically-driven and alignment-based ap-proaches for identifying MWEs from a Pediatricsparallel corpus The alignment-based method gen-erates a targeted precision-oriented list of MWEcandidates while the statistical methods producerecall-oriented results at the expense of precisionTherefore the combination of these methods canproduce a set of MWE candidates that is both moreprecise than the latter and has more coverage thanthe former This can significantly speed up lex-icographic work Moreover the results obtained
6
show that in comparison with the manual extrac-tion of MWEs this approach can provide also ageneral set of MWE candidates in addition to themanually selected technical terms
Using the alignment-based extraction methodwe notice that it is possible to extract MWEs thatare Pediatrics terms with a precision of 38 forbigrams and 20 for trigrams but with very lowrecall since only the MWEs in the Pediatrics Glos-sary were considered correct However after amanual analysis carried out by two native speakersof Portuguese we found that the percentage of trueMWEs considered by both or at least one of themwere respectively 6597 and 7592 This wasa significative improvement but it is important tosay that in this manual analysis the human ex-perts classified the MWEs as true independently ofthem being Pediatrics terms So as future work weintend to carry out a more carefull analysis withexperts in Pediatrics to evaluate how many MWEscandidates are also Pediatrics terms
In addition we plan to investigate a weightedcombination of these methods favouring thosethat have better precision Finally we also in-tend to apply the results obtained in to the semi-automatic construction of ontologies
Acknowledgments
We would like to thank the TEXTQUIMUFRGSgroup for making the Corpus of Pediatrics and Pe-diatrics Glossary available to us We also thank thefinancial support of FAPESP in bulding the paral-lel corpus This research has been partly fundedby the FINEP project COMUNICA
ReferencesCarme Armentano-Oller Rafael C Carrasco Anto-
nio M Corbı-Bellot Mikel L Forcada MireiaGinestı-Rosell Sergio Ortiz-Rojas Juan Anto-nio Perez-Ortiz Gema Ramırez-Sanchez FelipeSanchez-Martınez and Miriam A Scalco 2006Open-source Portuguese-Spanish machine transla-tion In R Vieira P Quaresma MGV Nunes NJMamede C Oliveira and MC Dias editors Pro-ceedings of the 7th International Workshop on Com-putational Processing of Written and Spoken Por-tuguese (PROPOR 2006) volume 3960 of LectureNotes in Computer Science pages 50ndash59 Springer-Verlag May
Timothy Baldwin and Aline Villavicencio 2002 Ex-tracting the Unextractable A Case Study on Verb-particles In Proceedings of the 6th Conference on
Natural Language Learning (CoNLL-2002) pages98ndash104 Taipei Taiwan
Timothy Baldwin Emily M Bender Dan FlickingerAra Kim and Stephan Oepen 2004 Road-testingthe English Resource Grammar over the British Na-tional Corpus In Proceedings of the Fourth In-ternational Conference on Language Resources andEvaluation (LREC 2004) pages 2047ndash2050 LisbonPortugal
Timothy Baldwin 2005 The deep lexical acquisitionof English verb-particles Computer Speech andLanguage Special Issue on Multiword Expressions19(4)398ndash414
Satanjeev Banerjee and Ted Pedersen 2003 The De-sign Implementation and Use of the Ngram Statis-tics Package In In Proceedings of the Fourth Inter-national Conference on Intelligent Text Processingand Computational Linguistics pages 370ndash381
Douglas Biber Stig Johansson Geoffrey Leech SusanConrad and Edward Finegan 1999 Grammar ofSpoken and Written English Longman Harlow
Jean Carletta 1996 Assessing agreement on classi-fication tasks the kappa statistics ComputationalLinguistics 22(2)249ndash254
Carlos Ramisch and Aline Villavicencio and LeonardoMoura and Marco Idiart 2008 Picking themup and Figuring them out Verb-Particle Construc-tions Noise and Idiomaticity In Proceedings ofthe 12th Conference on Computational Natural Lan-guage Learning (CoNLL 2008) pages 49ndash56
John Carroll and Claire Grover 1989 The derivationof a large computational lexicon of English fromLDOCE In Bran Boguraev and Ted Briscoe editorsComputational Lexicography for Natural LanguageProcessing pages 117ndash134 Longman Harlow UK
Helena M Caseli Carlos Ramisch Maria G V Nunesand Aline Villavicencio 2009 Alignment-basedextraction of multiword expressions Language Re-sources and Evaluation to appear
Stefan Evert and Brigitte Krenn 2005 Using smallrandom samples for the manual evaluation of statis-tical association measures Computer Speech andLanguage 19(4)450ndash466
Afsaneh Fazly and Suzanne Stevenson 2007 Distin-guishing Subtypes of Multiword Expressions UsingLinguistically-Motivated Statistical Measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueJune
Dan Flickinger 2000 On building a more efficientgrammar by exploiting types Natural LanguageEngineering 6(1)15ndash28
7
Knut Hofland 1996 A program for aligning Englishand Norwegian sentences In S Hockey N Ideand G Perissinotto editors Research in HumanitiesComputing pages 165ndash178 Oxford Oxford Univer-sity Press
Ray Jackendoff 1997 Twistinrsquo the night away Lan-guage 73534ndash59
Frank Keller and Mirella Lapata 2003 Using the Webto Obtain Frequencies for Unseen Bigrams Compu-tational Linguistics 29(3)459ndash484
I Dan Melamed 1997 Automatic Discovery ofNon-Compositional Compounds in Parallel Data Ineprint arXivcmp-lg9706027 pages 6027ndash+ June
Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the Workshopon Multi-word-expressions in a Multilingual Context(EACL-2006) pages 33ndash40 Trento Italy
Franz Josef Och and Hermann Ney 2000 Improvedstatistical alignment models In Proceedings of the38th Annual Meeting of the ACL pages 440ndash447Hong Kong China October
Darren Pearce 2002 A Comparative Evaluation ofCollocation Extraction Techniques In Proceedingsof the Third International Conference on LanguageResources and Evaluation pages 1ndash7 Las PalmasCanary Islands Spain
Scott S L Piao Guangfan Sun Paul Rayson andQi Yuan 2006 Automatic Extraction of Chi-nese Multiword Expressions with a Statistical ToolIn Proceedings of the Workshop on Multi-word-expressions in a Multilingual Context (EACL-2006)pages 17ndash24 Trento Italy April
William H Press Saul A Teukolsky William T Vet-terling and Brian P Flannery 1992 NumericalRecipes in C The Art of Scientific Computing Sec-ond edition Cambridge University Press
Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 MultiwordExpressions A Pain in the Neck for NLP In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing (CICLing-2002) volume 2276 of (LectureNotes in Computer Science) pages 1ndash15 LondonUK Springer-Verlag
Tim Van de Cruys and Bego na Villada Moiron 2007Semantics-based Multiword Expression ExtractionIn Proceedings of the Workshop on A Broader Pre-spective on Multiword Expressions pages 25ndash32Prague June
Aline Villavicencio Valia Kordoni Yi Zhang MarcoIdiart and Carlos Ramisch 2007 Validation andEvaluation of Automatically Acquired Multiword
Expressions for Grammar Engineering In Pro-ceedings of the 2007 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning pages1034ndash1043 Prague June
Aline Villavicencio 2005 The Availability of Verb-Particle Constructions in Lexical Resources HowMuch is Enough Journal of Computer Speech andLanguage Processing 19(4)415ndash432
Yi Zhang Valia Kordoni Aline Villavicencio andMarco Idiart 2006 Automated Multiword Ex-pression Prediction for Grammar Engineering InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 36ndash44 Sydney Australia July Asso-ciation for Computational Linguistics
8
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 9ndash16Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Re-examining Automatic Keyphrase Extraction Approachesin Scientific Articles
Su Nam KimCSSE dept
University of Melbournesnkimcsseunimelbeduau
Min-Yen KanDepartment of Computer ScienceNational University of Singaporekanmycompnusedusg
Abstract
We tackle two major issues in automatickeyphrase extraction using scientific arti-cles candidate selection and feature engi-neering To develop an efficient candidateselection method we analyze the natureand variation of keyphrases and then se-lect candidates using regular expressionsSecondly we re-examine the existing fea-tures broadly used for the supervised ap-proach exploring different ways to en-hance their performance While mostother approaches are supervised we alsostudy the optimal features for unsuper-vised keyphrase extraction Our researchhas shown that effective candidate selec-tion leads to better performance as evalua-tion accounts for candidate coverage Ourwork also attests that many of existing fea-tures are also usable in unsupervised ex-traction
1 Introduction
Keyphrases are simplex nouns or noun phrases(NPs) that represent the key ideas of the documentKeyphrases can serve as a representative summaryof the document and also serve as high quality in-dex terms It is thus no surprise that keyphraseshave been utilized to acquire critical informationas well as to improve the quality of natural lan-guage processing (NLP) applications such as doc-ument summarizer(Davanzo and Magnini 2005)information retrieval (IR)(Gutwin et al 1999) anddocument clustering(Hammouda et al 2005)
In the past various attempts have been made toboost automatic keyphrase extraction performancebased primarily on statistics(Frank et al 1999Turney 2003 Park et al 2004 Wan and Xiao2008) and a rich set of heuristic features(Barkerand Corrnacchia 2000 Medelyan and Witten
2006 Nguyen and Kan 2007) In Section 2 wegive a more comprehensive overview of previousattempts
Current keyphrase technology still has muchroom for improvement First of all although sev-eral candidate selection methods have been pro-posed for automatic keyphrase extraction in thepast (eg (Frank et al 1999 Park et al 2004Nguyen and Kan 2007)) most of them do not ef-fectively deal with various keyphrase forms whichresults in the ignorance of some keyphrases as can-didates Moreover no studies thus far have donea detailed investigation of the nature and varia-tion of manually-provided keyphrases As a con-sequence the community lacks a standardized listof candidate forms which leads to difficulties indirect comparison across techniques during evalu-ation and hinders re-usability
Secondly previous studies have shown the ef-fectiveness of their own features but not manycompared their features with other existing fea-tures That leads to a redundancy in studies andhinders direct comparison In addition existingfeatures are specifically designed for supervisedapproaches with few exceptions However thisapproach involves a large amount of manual laborthus reducing its utility for real-world applicationHence unsupervised approach is inevitable in or-der to minimize manual tasks and to encourageutilization It is a worthy study to attest the re-liability and re-usability for the unsupervised ap-proach in order to set up the tentative guideline forapplications
This paper targets to resolve these issues ofcandidate selection and feature engineering Inour work on candidate selection we analyze thenature and variation of keyphrases with the pur-pose of proposing a candidate selection methodwhich improves the coverage of candidates thatoccur in various forms Our second contributionre-examines existing keyphrase extraction features
9
reported in the literature in terms of their effec-tiveness and re-usability We test and comparethe usefulness of each feature for further improve-ment In addition we assess how well these fea-tures can be applied in an unsupervised approach
In the remaining sections we describe anoverview of related work in Section 2 our propos-als on candidate selection and feature engineeringin Section 4 and 5 our system architecture anddata in Section 6 Then we evaluate our propos-als discuss outcomes and conclude our work inSection 7 8 and 9 respectively
2 Related Work
The majority of related work has been carriedout using statistical approaches a rich set ofsymbolic resources and linguistically-motivatedheuristics(Frank et al 1999 Turney 1999 Barkerand Corrnacchia 2000 Matsuo and Ishizuka2004 Nguyen and Kan 2007) Features used canbe categorized into three broad groups (1) docu-ment cohesion features (ie relationship betweendocument and keyphrases)(Frank et al 1999Matsuo and Ishizuka 2004 Medelyan and Wit-ten 2006 Nguyen and Kan 2007) and to lesser(2) keyphrase cohesion features (ie relationshipamong keyphrases)(Turney 2003) and (3) termcohesion features (ie relationship among compo-nents in a keyphrase)(Park et al 2004)
The simplest system is KEA (Frank et al1999 Witten et al 1999) that uses TFIDF (ieterm frequency inverse document frequency) andfirst occurrence in the document TFIDF mea-sures the document cohesion and the first occur-rence implies the importance of the abstract orintroduction which indicates the keyphrases havea locality Turney (2003) added the notion ofkeyphrase cohesion to KEA features and Nguyenand Kan (2007) added linguistic features suchas section information and suffix sequence TheGenEx system(Turney 1999) employed an inven-tory of nine syntactic features such as length inwords and frequency of stemming phrase as aset of parametrized heuristic rules Barker andCorrnacchia (2000) introduced a method basedon head noun heuristics that took three featureslength of candidate frequency and head noun fre-quency To take advantage of domain knowledgeHulth et al (2001) used a hierarchically-organizeddomain-specific thesaurus from Swedish Parlia-ment as a secondary knowledge source The
Textract (Park et al 2004) also ranks the can-didate keyphrases by its judgment of keyphrasesrsquodegree of domain specificity based on subject-specific collocations(Damerau 1993) in addi-tion to term cohesion using Dice coefficient(Dice1945) Recently Wan and Xiao (2008) extractsautomatic keyphrases from single documents uti-lizing document clustering information The as-sumption behind this work is that the documentswith the same or similar topics interact with eachother in terms of salience of words The authorsfirst clustered the documents then used the graph-based ranking algorithm to rank the candidates ina document by making use of mutual influences ofother documents in the same cluster
3 Keyphrase Analysis
In previous study KEA employed the index-ing words as candidates whereas others such as(Park et al 2004 Nguyen and Kan 2007) gen-erated handcrafted regular expression rules How-ever none carefully undertook the analysis ofkeyphrases We believe there is more to be learnedfrom the reference keyphrases themselves by do-ing a fine-grained careful analysis of their formand composition Note that we used the articlescollected from ACM digital library for both ana-lyzing keyphrases as well as evaluating methodsSee Section 6 for data in detail
Syntactically keyphrases can be formed by ei-ther simplex nouns (eg algorithm keyphrasemulti-agent) or noun phrases (NPs) which can be asequence of nouns and their auxiliary words suchas adjectives and adverbs (eg mobile networkfast computing partially observable Markov de-cision process) despite few incidences They canalso incorporate a prepositional phrase (PP) (egquality of service policy of distributed caching)When keyphrases take the form of an NP with anattached PP (ie NPs in of-PP form) the preposi-tion of is most common but others such as for invia also occur (eg incentive for cooperation in-equality in welfare agent security via approximatepolicy trade in financial instrument based on log-ical formula) The patterns above correlate wellto part-of-speech (POS) patterns used in modernkeyphrase extraction systems
However our analysis uncovered additional lin-guistic patterns and alternations which other stud-ies may have overlooked In our study we alsofound that keyphrases also occur as a simple con-
10
Criteria RulesFrequency (Rule1) Frequency heuristic ie frequency ge 2 for simplex words vs frequency ge 1 for NPs
Length (Rule2) Length heuristic ie up to length 3 for NPs in non-of-PP form vs up to length 4 for NPs in of-PP form(eg synchronous concurrent program vs model of multiagent interaction)
Alternation (Rule3) of-PP form alternation(eg number of sensor = sensor number history of past encounter = past encounter history)(Rule4) Possessive alternation(eg agentrsquos goal = goal of agent securityrsquos value = value of security)
Extraction (Rule5) Noun Phrase = (NN |NNS|NNP |NNPS|JJ |JJR|JJS)lowast(NN |NNS|NNP |NNPS)(eg complexity effective algorithm grid computing distributed web-service discovery architecture)(Rule6) Simplex WordNP IN Simplex WordNP(eg quality of service sensitivity of VOIP traffic (VOIP traffic extracted)simplified instantiation of zebroid (simplified instantiation extracted))
Table 1 Candidate Selection Rules
junctions (eg search and rescue propagation anddelivery) and much more rarely as conjunctionsof more complex NPs (eg history of past en-counter and transitivity) Some keyphrases appearto be more complex (eg pervasive document editand management system task and resource allo-cation in agent system) Similarly abbreviationsand possessive forms figure as common patterns(eg belief desire intention = BDI inverse docu-ment frequency = (IDF) Bayesrsquo theorem agentrsquosdominant strategy)
A critical insight of our work is that keyphrasescan be morphologically and semantically alteredKeyphrases that incorporate a PP or have an un-derlying genitive composition are often easily var-ied by word order alternation Previous studieshave used the altered keyphrases when forming inof-PP form For example quality of service canbe altered to service quality sometimes with lit-tle semantic difference Also as most morpho-logical variation in English relates to noun num-ber and verb inflection keyphrases are subject tothese rules as well (eg distributed system 6= dis-tributing system dynamical caching 6= dynamicalcache) In addition possessives tend to alternatewith of-PP form (eg agentrsquos goal = goal of agentsecurityrsquos value = value of security)
4 Candidate Selection
We now describe our proposed candidate selectionprocess Candidate selection is a crucial step forautomatic keyphrase extraction This step is corre-lated to term extraction study since topNth rankedterms become keyphrases in documents In pre-vious study KEA employed the indexing wordsas candidates whereas others such as (Park et al2004 Nguyen and Kan 2007) generated hand-crafted regular expression rules However nonecarefully undertook the analysis of keyphrases In
this section before we present our method we firstdescribe the detail of keyphrase analysis
In our keyphrase analysis we observed thatmost of author assigned keyphrase andor readerassigned keyphrase are syntactically more of-ten simplex words and less often NPs Whenkeyphrases take an NP form they tend to be a sim-ple form of NPs ie either without a PP or withonly a PP or with a conjunction but few appear asa mixture of such forms We also noticed that thecomponents of NPs are normally nouns and adjec-tives but rarely are adverbs and verbs As a re-sult we decided to ignore NPs containing adverbsand verbs in this study as our candidates since theytend to produce more errors and to require morecomplexity
Another observation is that keyphrases contain-ing more than three words are rare (ie 6 in ourdata set) validating what Paukkeri et al (2008)observed Hence we apply a length heuristic Ourcandidate selection rule collects candidates up tolength 3 but also of length 4 for NPs in of-PPform since they may have a non-genetive alter-nation that reduces its length to 3 (eg perfor-mance of distributed system = distributed systemperformance) In previous studies words occur-ring at least twice are selected as candidates How-ever during our acquisition of reader assignedkeyphrase we observed that readers tend to collectNPs as keyphrases regardless of their frequencyDue to this we apply different frequency thresh-olds for simplex words (gt= 2) and NPs (gt= 1)Note that 30 of NPs occurred only once in ourdata
Finally we generated regular expression rulesto extract candidates as presented in Table 1 Ourcandidate extraction rules are based on those inNguyen and Kan (2007) However our Rule6for NPs in of-PP form broadens the coverage of
11
possible candidates ie with a given NPs in of-PP form not only we collect simplex word(s)but we also extract non-of-PP form of NPs fromnoun phrases governing the PP and the PP Forexample our rule extracts effective algorithm ofgrid computing as well as effective algorithm andgrid computing as candidates while the previousworksrsquo rules do not
5 Feature Engineering
With a wider candidate selection criteria the onusof filtering out irrelevant candidates becomes theresponsibility of careful feature engineering Welist 25 features that we have found useful in ex-tracting keyphrases comprising of 9 existing and16 novel andor modified features that we intro-duce in our work (marked with lowast) As one ofour goals in feature engineering is to assess thesuitability of features in the unsupervised settingwe have also indicated which features are suitableonly for the supervised setting (S) or applicable toboth (S U)
51 Document Cohesion
Document cohesion indicates how important thecandidates are for the given document The mostpopular feature for this cohesion is TFIDF butsome works have also used context words to checkthe correlation between candidates and the givendocument Other features for document cohesionare distance section information and so on Wenote that listed features other than TFIDF are re-lated to locality That is the intuition behind thesefeatures is that keyphrases tend to appear in spe-cific area such as the beginning and the end of doc-uments
F1 TFIDF (SU) TFIDF indicates doc-ument cohesion by looking at the frequency ofterms in the documents and is broadly used in pre-vious work(Frank et al 1999 Witten et al 1999Nguyen and Kan 2007) However a disadvan-tage of the feature is in requiring a large corpusto compute useful IDF As an alternative con-text words(Matsuo and Ishizuka 2004) can alsobe used to measure document cohesion From ourstudy of keyphrases we saw that substrings withinlonger candidates need to be properly counted andas such our method measures TF in substrings aswell as in exact matches For example grid com-puting is often a substring of other phrases such asgrid computing algorithm and efficient grid com-
puting algorithm We also normalize TF with re-spect to candidate types ie we separately treatsimplex words and NPs to compute TF To makeour IDFs broadly representative we employed theGoogle n-gram counts that were computedover terabytes of data Given this large genericsource of word count IDF can be incorporatedwithout corpus-dependent processing hence suchfeatures are useful in unsupervised approachesas well The following list shows variations ofTFIDF employed as features in our system
bull (F1a) TFIDF
bull (F1b) TF including counts of substrings
bull (F1c) TF of substring as a separate feature
bull (F1d) normalized TF by candidate types(ie simplex words vs NPs)
bull (F1e) normalized TF by candidate types asa separate feature
bull (F1f) IDF using Google n-gram
F2 First Occurrence (SU) KEA used the firstappearance of the word in the document(Frank etal 1999 Witten et al 1999) The main ideabehind this feature is that keyphrases tend to oc-cur in the beginning of documents especially instructured reports (eg in abstract and introduc-tion sections) and newswire
F3 Section Information (SU) Nguyen andKan (2007) used the identity of which specificdocument section a candidate occurs in This lo-cality feature attempts to identify key sections Forexample in their study of scientific papers theauthors weighted candidates differently dependingon whether they occurred in the abstract introduc-tion conclusion section head title andor refer-ences
F4 Additional Section Information (SU)We first added the related work or previous workas one of section information not included inNguyen and Kan (2007) We also propose and testa number of variations We used the substringsthat occur in section headers and reference titlesas keyphrases We counted the co-occurrence ofcandidates (ie the section TF) across all key sec-tions that indicates the correlation among key sec-tions We assign section-specific weights as in-dividual sections exhibit different propensities forgenerating keyphrases For example introduction
12
contains the majority of keyphrases while the ti-tle or section head contains many fewer due to thevariation in size
bull (F4a) section rsquorelatedprevious workrsquo
bull (F4b) counting substring occurring in keysections
bull (F4c) section TF across all key sections
bull (F4d) weighting key sections according tothe portion of keyphrases found
F5 Last Occurrence (SU) Similar to dis-tance in KEA the position of the last occurrenceof a candidate may also imply the importance ofkeyphrases as keyphrases tend to appear in thelast part of document such as the conclusion anddiscussion
52 Keyphrase CohesionThe intuition behind using keyphrase cohesion isthat actual keyphrases are often associated witheach other since they are semantically related totopic of the document Note that this assumptionholds only when the document describes a singlecoherent topic ndash a document that represents a col-lection may be first need to be segmented into itsconstituent topics
F6 Co-occurrence of Another Candidatein Section (SU) When candidates co-occur inseveral key sections together then they are morelikely keyphrases Hence we used the number ofsections that candidates co-occur
F7 Title overlap (S) In a way titles also rep-resent the topics of their documents A large col-lection of titles in the domain can act as a prob-abilistic prior of what words could stand as con-stituent words in keyphrases In our work as weexamined scientific papers from computer sciencewe used a collection of titles obtained from thelarge CiteSeer1 collection to create this feature
bull (F7a) co-occurrence (Boolean) in title col-location
bull (F7b) co-occurrence (TF) in title collection
F8 Keyphrase Cohesion (SU) Turney (2003)integrated keyphrase cohesion into his system bychecking the semantic similarity between top Nranked candidates against the remainder In the
1It contains 13M titles from articles papers and reports
original work a large external web corpus wasused to obtain the similarity judgments As wedid not have access to the same web corpus andall candidateskeyphrases were not found in theGoogle n-gram corpus we approximated this fea-ture using a similar notion of contextual similarityWe simulated a latent 2-dimensional matrix (simi-lar to latent semantic analysis) by listing all candi-date words in rows and their neighboring words(nouns verbs and adjectives only) in columnsThe cosine measure is then used to compute thesimilarity among keyphrases
53 Term Cohesion
Term cohesion further refines the candidacy judg-ment by incorporating an internal analysis of thecandidatersquos constituent words Term cohesionposits that high values for internal word associa-tion measures correlates indicates that the candi-date is a keyphrase (Church and Hanks 1989)
F9 Term Cohesion (SU) Park et al (2004)used in the Dice coefficient (Dice 1945)to measure term cohesion particularly for multi-word terms In their work as NPs are longer thansimplex words they simply discounted simplexword cohesion by 10 In our work we vary themeasure of TF used in Dice coefficientsimilar to our discussion earlier
bull (F9a) term cohesion by (Park et al 2004)
bull (F9b) normalized TF by candidate types(ie simplex words vs NPs)
bull (F9c) applying different weight by candi-date types
bull (F9d) normalized TF and different weight-ing by candidate types
54 Other Features
F10 Acronym (S) Nguyen and Kan (2007) ac-counted for the importance of acronym as a fea-ture We found that this feature is heavily depen-dent on the data set Hence we used it only forNampK to attest our candidate selection method
F11 POS sequence (S) Hulth and Megyesi(2006) pointed out that POS sequences ofkeyphrases are similar It showed the distinctivedistribution of POS sequences of keyphrases anduse them as a feature Like acronym this is alsosubject to the data set
13
F12 Suffix sequence (S) Similar to acronymNguyen and Kan (2007) also used a candidatersquossuffix sequence as a feature to capture the propen-sity of English to use certain Latin derivationalmorphology for technical keyphrases This fea-ture is also a data dependent features thus used insupervised approach only
F13 Length of Keyphrases (SU) Barker andCorrnacchia (2000) showed that candidate lengthis also a useful feature in extraction as well as incandidate selection as the majority of keyphrasesare one or two terms in length
6 System and Data
To assess the performance of the proposed candi-date selection rules and features we implementeda keyphrase extraction pipe line We start withraw text of computer science articles convertedfrom PDF by pdftotext Then we parti-tioned the into section such as title and sectionsvia heuristic rules and applied sentence segmenter2 ParsCit3(Councill et al 2008) for refer-ence collection part-of-speech tagger4 and lem-matizer5(Minnen et al 2001) of the input Af-ter preprocessing we built both supervised andunsupervised classifiers using Naive Bayes fromthe WEKA machine learning toolkit(Witten andFrank 2005) Maximum Entropy6 and simpleweighting
In evaluation we collected 250 papers fromfour different categories7 of the ACM digital li-brary Each paper was 6 to 8 pages on averageIn author assigned keyphrase we found manywere missing or found as substrings To rem-edy this we collected reader assigned keyphraseby hiring senior year undergraduates in computerscience each whom annotated five of the paperswith an annotation guideline and on average tookabout 15 minutes to annotate each paper The fi-nal statistics of keyphrases is presented in Table2 where Combined represents the total number ofkeyphrases The numbers in () denotes the num-ber of keyphrases in of-PP form Found means the
2httpwwwengritsumeiacjpasaoresourcessentseg3httpwingcompnusedusgparsCit4httpsearchcpanorgdistLingua-EN-
TaggerTaggerpm5httpwwwinformaticssusxacukresearchgroupsnlpcarrollmorphhtml6httpmaxentsourceforgenetindexhtml7C24 (Distributed Systems) H33 (Information Search
and Retrieval) I211 (Distributed Artificial Intelligence-Multiagent Systems) and J4 (Social and Behavioral Sciences-Economics)
number of author assigned keyphrase and readerassigned keyphrase found in the documents
Author Reader CombinedTotal 1252 (53) 3110 (111) 3816 (146)NPs 904 2537 3027Average 385 (401) 1244 (1288) 1526 (1585)Found 769 2509 2864
Table 2 Statistics in Keyphrases
7 Evaluation
The baseline system for both the supervised andunsupervised approaches is modified NampK whichuses TFIDF distance section information andadditional section information (ie F1-4) Apartfrom baseline we also implemented basicKEA and NampK to compare Note that NampK is con-sidered a supervised approach as it utilizes fea-tures like acronym POS sequence and suffix se-quence
Table 3 and 4 shows the performance of our can-didate selection method and features with respectto supervised and unsupervised approaches usingthe current standard evaluation method (ie exactmatching scheme) over top 5th 10th 15th candi-dates
BestFeatures includes F1cTF of substring asa separate feature F2first occurrence F3sectioninformation F4dweighting key sections F5lastoccurrence F6co-occurrence of another candi-date in section F7btitle overlap F9aterm co-hesion by (Park et al 2004) F13length ofkeyphrases Best-TFIDF means using all bestfeatures but TFIDF
In Tables 3 and 4 C denotes the classifier tech-nique unsupervised (U) or supervised using Max-imum Entropy (S)8
In Table 5 the performance of each feature ismeasured using NampK system and the target fea-ture + indicates an improvement - indicates aperformance decline and indicates no effector unconfirmed due to small changes of perfor-mances Again supervised denotes MaximumEntropy training and Unsupervised is our unsu-pervised approach
8 Discussion
We compared the performances over our candi-date selection and feature engineering with sim-ple KEA NampK and our baseline system In eval-uating candidate selection we found that longer
8Due to the page limits we present the best performance
14
Method Features C Five Ten FifteenMatch Precision Recall Fscore Match Precising Recall Fscore Match Precision Recall Fscore
All KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 079 1584 519 782 139 1388 909 1099 184 1224 1203 1213
NampK S 132 2648 867 1306 204 2036 1334 1612 254 1693 1664 1678baseline U 092 1832 600 904 157 1568 1027 1241 220 1464 1439 1451
S 115 2304 755 1137 190 1896 1242 1501 244 1624 1596 1610Lengthlt=3 KEA U 003 064 021 032 009 092 060 073 013 088 086 087Candidates S 081 1616 529 797 140 1400 917 1108 184 1224 1203 1213
NampK S 140 2792 915 1378 210 2104 1378 1665 262 1749 1719 1734baseline U 092 184 603 908 158 1576 1032 1247 220 1464 1439 1451
S 118 2368 776 1169 190 1900 1245 1504 240 1600 1572 1586Lengthlt=3 KEA U 001 024 008 012 005 052 034 041 007 048 047 047Candidates S 083 1664 545 821 142 1424 933 1127 187 1245 1224 1234
+ Alternation NampK S 153 3064 1004 1512 231 2308 1512 1827 288 1920 1887 1903baseline U 098 1968 645 972 172 1724 1129 1364 237 1579 1551 1565
S 133 2656 870 1311 209 2088 1368 1653 269 1792 1761 1776
Table 3 Performance on Proposed Candidate Selection
Features C Five Ten FifteenMatch Prec Recall Fscore Match Prec Recall Fscore Match Prec Recall Fscore
Best U 114 228 747 113 192 192 126 152 261 174 171 173S 156 312 102 154 250 250 164 198 315 210 206 208
Best U 114 228 74 113 192 192 126 152 261 174 171 173wo TFIDF S 156 311 102 154 246 246 161 194 312 208 204 206
Table 4 Performance on Feature Engineering
A Method Feature+ S F1aF2F3F4aF4dF9a
U F1aF1cF2F3F4aF4dF5F7bF9a- S F1bF1cF1dF1fF4bF4cF7aF7bF9b-dF13
U F1dF1eF1fF4bF4cF6F7aF9b-d S F1eF10F11F12
U F1b
Table 5 Performance on Each Feature
length candidates play a role to be noises so de-creased the overall performance We also con-firmed that candidate alternation offered the flexi-bility of keyphrases leading higher candidate cov-erage as well as better performance
To re-examine features we analyzed the impactof existing and new features and their variationsFirst of all unlike previous studies we found thatthe performance with and without TFIDF did notlead to a large difference which indicates the im-pact of TFIDF was minor as long as other fea-tures are incorporated Secondly counting sub-strings for TF improved performance while ap-plying term weighting for TF andor IDF did notimpact on the performance We estimated thecause that many of keyphrases are substrings ofcandidates and vice versa Thirdly section in-formation was also validated to improve perfor-mance as in Nguyen and Kan (2007) Extend-ing this logic modeling additional section infor-mation (related work) and weighting sections bothturned out to be useful features Other localityfeatures were also validated as helpful both firstoccurrence and last occurrence are helpful as itimplies the locality of the key ideas In addi-
tion keyphrase co-occurrence with selected sec-tions was proposed in our work and found empiri-cally useful Term cohesion (Park et al 2004) is auseful feature although it has a heuristic factor thatreduce the weight by 10 for simplex words Nor-mally term cohesion is subject to NPs only henceit needs to be extended to work with multi-wordNPs as well Table 5 summarizes the reflectionson each feature
As unsupervised methods have the appeal of notneeding to be trained on expensive hand-annotateddata we also compared the performance of super-vised and unsupervised methods Given the fea-tures initially introduced for supervised learningunsupervised performance is surprisingly highWhile supervised classifier produced a matchingcount of 315 the unsupervised classifier obtains acount of 261 We feel this indicates that the exist-ing features for supervised methods are also suit-able for use in unsupervised methods with slightlyreduced performance In general we observed thatthe best features in both supervised and unsuper-vised methods are the same ndash section informationand candidate length In our analysis of the im-pact of individual features we observed that mostfeatures affect performance in the same way forboth supervised and unsupervised approaches asshown in Table 5 These findings indicate that al-though these features may be been originally de-signed for use in a supervised approach they arestable and can be expected to perform similar inunsupervised approaches
15
9 Conclusion
We have identified and tackled two core issuesin automatic keyphrase extraction candidate se-lection and feature engineering In the area ofcandidate selection we observe variations and al-ternations that were previously unaccounted forOur selection rules expand the scope of possiblekeyphrase coverage while not overly expandingthe total number candidates to consider In ourre-examination of feature engineering we com-piled a comprehensive feature list from previousworks while exploring the use of substrings in de-vising new features Moreover we also attested toeach featurersquos fitness for use in unsupervised ap-proaches in order to utilize them in real-world ap-plications with minimal cost
10 AcknowledgementThis work was partially supported by a National ResearchFoundation grant Interactive Media Search (grant R 252000 325 279) while the first author was a postdoctoral fellowat the National University of Singapore
ReferencesKen Barker and Nadia Corrnacchia Using noun phrase
heads to extract document keyphrases In Proceedings ofthe 13th Biennial Conference of the Canadian Society onComputational Studies of Intelligence Advances in Arti-ficial Intelligence 2000
Regina Barzilay and Michael Elhadad Using lexical chainsfor text summarization In Proceedings of the ACLEACL1997 Workshop on Intelligent Scalable Text Summariza-tion 1997 pp 10ndash17
Kenneth Church and Patrick Hanks Word associationnorms mutual information and lexicography In Proceed-ings of ACL 1989 76ndash83
Isaac Councill and C Lee Giles and Min-Yen Kan ParsCitAn open-source CRF reference string parsing package InProceedings of LREC 2008 28ndash30
Ernesto DAvanzo and Bernado Magnini A Keyphrase-Based Approach to Summarizationthe LAKE System atDUC-2005 In Proceedings of DUC 2005
F Damerau Generating and evaluating domain-orientedmulti-word terms from texts Information Processing andManagement 1993 29 pp43ndash447
Lee Dice Measures of the amount of ecologic associationsbetween species Journal of Ecology 1945 2
Eibe Frank and Gordon Paynter and Ian Witten and CarlGutwin and Craig Nevill-manning Domain SpecificKeyphrase Extraction In Proceedings of IJCAI 1999pp668ndash673
Carl Gutwin and Gordon Paynter and Ian Witten and CraigNevill-Manning and Eibe Frank Improving browsing indigital libraries with keyphrase indexes Journal of Deci-sion Support Systems 1999 27 pp81ndash104
Khaled Hammouda and Diego Matute and Mohamed KamelCorePhrase keyphrase extraction for document cluster-ing In Proceedings of MLDM 2005
Annette Hulth and Jussi Karlgren and Anna Jonsson andHenrik Bostrm and Lars Asker Automatic Keyword Ex-traction using Domain Knowledge In Proceedings of CI-CLing 2001
Annette Hulth and Beata Megyesi A study on automaticallyextracted keywords in text categorization In Proceedingsof ACLCOLING 2006 537ndash544
Mario Jarmasz and Caroline Barriere Using semantic sim-ilarity over tera-byte corpus compute the performance ofkeyphrase extraction In Proceedings of CLINE 2004
Dawn Lawrie and W Bruce Croft and Arnold RosenbergFinding Topic Words for Hierarchical Summarization InProceedings of SIGIR 2001 pp 349ndash357
Y Matsuo and M Ishizuka Keyword Extraction from a Sin-gle Document using Word Co-occurrence Statistical Infor-mation In International Journal on Artificial IntelligenceTools 2004 13(1) pp 157ndash169
Olena Medelyan and Ian Witten Thesaurus based automatickeyphrase indexing In Proceedings of ACMIEED-CSjoint conference on Digital libraries 2006 pp296ndash297
Guido Minnen and John Carroll and Darren Pearce Appliedmorphological processing of English NLE 2001 7(3)pp207ndash223
Thuy Dung Nguyen and Min-Yen Kan Key phrase Extrac-tion in Scientific Publications In Proceeding of ICADL2007 pp317-326
Youngja Park and Roy Byrd and Branimir Boguraev Auto-matic Glossary Extraction Beyond Terminology Identifi-cation In Proceedings of COLING 2004 pp48ndash55
Mari-Sanna Paukkeri and Ilari Nieminen and Matti Pollaand Timo Honkela A Language-Independent Approachto Keyphrase Extraction and Evaluation In Proceedingsof COLING 2008
Peter Turney Learning to Extract Keyphrases from TextIn National Research Council Institute for InformationTechnology Technical Report ERB-1057 1999
Peter Turney Coherent keyphrase extraction via Web min-ing In Proceedings of IJCAI 2003 pp 434ndash439
Xiaojun Wan and Jianguo Xiao CollabRank towards a col-laborative approach to single-document keyphrase extrac-tion In Proceedings of COLING 2008
Ian Witten and Gordon Paynter and Eibe Frank and CarGutwin and Graig Nevill-Manning KEAPractical Au-tomatic Key phrase Extraction In Proceedings of ACMDL 1999 pp254ndash256
Ian Witten and Eibe Frank Data Mining Practical Ma-chine Learning Tools and Techniques Morgan Kauf-mann 2005
Yongzheng Zhang and Nur Zinchir-Heywood and Evange-los Milios Term based Clustering and Summarization ofWeb Page Collections In Proceedings of Conference ofthe Canadian Society for Computational Studies of Intel-ligence 2004
16
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 17ndash22Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Verb Noun Construction MWE Token Supervised Classification
Mona T DiabCenter for Computational Learning Systems
Columbia Universitymdiabcclscolumbiaedu
Pravin BhutadaComputer Science Department
Columbia Universitypb2351columbiaedu
Abstract
We address the problem of classifying multi-word expression tokens in running text Wefocus our study on Verb-Noun Constructions(VNC) that vary in their idiomaticity depend-ing on context VNC tokens are classified aseither idiomatic or literal We present a super-vised learning approach to the problem We ex-periment with different features Our approachyields the best results to date on MWE clas-sification combining different linguistically mo-tivated features the overall performance yieldsan F-measure of 8458 corresponding to an F-measure of 8996 for idiomaticity identificationand classification and 6203 for literal identifi-cation and classification
1 Introduction
In the literature in general a multiword expression(MWE) refers to a multiword unit or a colloca-tion of words that co-occur together statisticallymore than chance A MWE is a cover term fordifferent types of collocations which vary in theirtransparency and fixedness MWEs are pervasivein natural language especially in web based textsand speech genres Identifying MWEs and under-standing their meaning is essential to language un-derstanding hence they are of crucial importancefor any Natural Language Processing (NLP) appli-cations that aim at handling robust language mean-ing and use In fact the seminal paper (Sag et al2002) refers to this problem as a key issue for thedevelopment of high-quality NLP applications
For our purposes a MWE is defined as a collo-cation of words that refers to a single concept forexample - kick the bucket spill the beans make adecision etc An MWE typically has an idiosyn-cratic meaning that is more or different from themeaning of its component words An MWE mean-ing is transparent ie predictable in as muchas the component words in the expression relaythe meaning portended by the speaker composi-tionally Accordingly MWEs vary in their de-gree of meaning compositionality composition-ality is correlated with the level of idiomaticityAn MWE is compositional if the meaning of an
MWE as a unit can be predicted from the mean-ing of its component words such as in make adecision meaning to decide If we conceive ofidiomaticity as being a continuum the more id-iomatic an expression the less transparent and themore non-compositional it is Some MWEs aremore predictable than others for instance kick thebucket when used idiomatically to mean to diehas nothing in common with the literal meaningof either kick or bucket however make a decisionis very clearly related to to decide Both of theseexpressions are considered MWEs but have vary-ing degrees of compositionality and predictabilityBoth of these expressions belong to a class of id-iomatic MWEs known as verb noun constructions(VNC) The first VNC kick the bucket is a non-decomposable VNC MWE the latter make a deci-sion is a decomposable VNC MWE These typesof constructions are the object of our study
To date most research has addressed the prob-lem of MWE type classification for VNC expres-sions in English (Melamed 1997 Lin 1999Baldwin et al 2003 na Villada Moiron andTiedemann 2006 Fazly and Stevenson 2007Van de Cruys and Villada Moiron 2007 Mc-Carthy et al 2007) not token classification Forexample he spilt the beans on the kitchen counteris most likely a literal usage This is given away bythe use of the prepositional phrase on the kitchencounter as it is plausable that beans could haveliterally been spilt on a location such as a kitchencounter Most previous research would classifyspilt the beans as idiomatic irrespective of con-textual usage In a recent study by (Cook et al2008) of 53 idiom MWE types used in differentcontexts the authors concluded that almost half ofthem had clear literal meaning and over 40 oftheir usages in text were actually literal Thus itwould be important for an NLP application suchas machine translation for example when givena new VNC MWE token to be able to determinewhether it is used idiomatically or not as it couldpotentially have detrimental effects on the qualityof the translation
17
In this paper we address the problem of MWEclassification for verb-noun (VNC) token con-structions in running text We investigate the bi-nary classification of an unseen VNC token ex-pression as being either Idiomatic (IDM) or Lit-eral (LIT) An IDM expression is certainly anMWE however the converse is not necessarilytrue To date most approaches to the problem ofidiomaticity classification on the token level havebeen unsupervised (Birke and Sarkar 2006 Diaband Krishna 2009b Diab and Krishna 2009aSporleder and Li 2009) In this study we carryout a supervised learning investigation using sup-port vector machines that uses some of the featureswhich have been shown to help in unsupervisedapproaches to the problem
This paper is organized as follows In Section2 we describe our understanding of the variousclasses of MWEs in general Section 3 is a sum-mary of previous related research Section 4 de-scribes our approach In Section 5 we present thedetails of our experiments We discuss the resultsin Section 6 Finally we conclude in Section 7
2 Multi-word Expressions
MWEs are typically not productive though theyallow for inflectional variation (Sag et al 2002)They have been conventionalized due to persis-tent use MWEs can be classified based on theirsemantic types as follows Idiomatic This cat-egory includes expressions that are semanticallynon-compositional fixed expressions such as king-dom come ad hoc non-fixed expressions suchas break new ground speak of the devil TheVNCs which we are focusing on in this paper fallinto this category Semi-idiomatic This classincludes expressions that seem semantically non-compositional yet their semantics are more or lesstransparent This category consists of Light VerbConstructions (LVC) such as make a living andVerb Particle Constructions (VPC) such as write-up call-up Non-Idiomatic This category in-cludes expressions that are semantically compo-sitional such as prime minister proper nouns suchas New York Yankees and collocations such as ma-chine translation These expressions are statisti-cally idiosyncratic For instance traffic light isthe most likely lexicalization of the concept andwould occur more often in text than say trafficregulator or vehicle light
3 Related Work
Several researchers have addressed the problem ofMWE classification (Baldwin et al 2003 Katzand Giesbrecht 2006 Schone and Juraksfy 2001
Hashimoto et al 2006 Hashimoto and Kawa-hara 2008) The majority of the proposed researchhas been using unsupervised approaches and haveaddressed the problem of MWE type classifica-tion irrespective of usage in context (Fazly andStevenson 2007 Cook et al 2007) We areaware of two supervised approaches to the prob-lem work by (Katz and Giesbrecht 2006) andwork by (Hashimoto and Kawahara 2008)
In Katz and Giesbrecht (2006) (KG06) the au-thors carried out a vector similarity comparisonbetween the context of an MWE and that of theconstituent words using LSA to determine if theexpression is idiomatic or not The KG06 is sim-ilar in intuition to work proposed by (Fazly andStevenson 2007) however the latter work was un-supervised KG06 experimented with a tiny dataset of only 108 sentences corresponding to oneMWE idiomatic expression
Hashimoto and Kawahara (2008) (HK08) is thefirst large scale study to our knowledge that ad-dressed token classification into idiomatic versusliteral for Japanese MWEs of all types They ap-ply a supervised learning framework using sup-port vector machines based on TinySVM with aquadratic kernel They annotate a web based cor-pus for training data They identify 101 idiomtypes each with a corresponding 1000 exampleshence they had a corpus of 102K sentences of an-notated data for their experiments They exper-iment with 90 idiom types only for which theyhad more than 50 examples They use two typesof features word sense disambiguation (WSD)features and idiom features The WSD featurescomprised some basic syntactic features such asPOS lemma information token n-gram featuresin addition to hypernymy information on words aswell as domain information For the idiom fea-tures they were mostly inflectional features suchas voice negativity modality in addition to adja-cency and adnominal features They report resultsin terms of accuracy and rate of error reductionTheir overall accuracy is of 8925 using all thefeatures
4 Our ApproachWe apply a supervised learning framework tothe problem of both identifying and classifying aMWE expression token in context We specificallyfocus on VNC MWE expressions We use the an-notated data by (Cook et al 2008) We adopt achunking approach to the problem using an InsideOutside Beginning (IOB) tagging framework forperforming the identification of MWE VNC to-kens and classifying them as idiomatic or literalin context For chunk tagging we use the Yam-
18
Cha sequence labeling system1 YamCha is basedon Support Vector Machines technology using de-gree 2 polynomial kernels
We label each sentence with standard IOB tagsSince this is a binary classification task we have 5different tags B-L (Beginning of a literal chunk)I-L (Inside of a literal chunk) B-I (Beginning anIdiomatic chunk) I-I (Inside an Idiomatic chunk)O (Outside a chunk) As an example a sentencesuch as John kicked the bucket last Friday will beannotated as follows John O kicked B-I the I-Ibucket I-I last O Friday O We experiment withsome basic features and some more linguisticallymotivated ones
We experiment with different window sizes forcontext ranging from minus+1 to minus+5 tokens be-fore and after the token of interest We also em-ploy linguistic features such as character n-gramfeatures namely last 3 characters of a token asa means of indirectly capturing the word inflec-tional and derivational morphology (NGRAM)Other features include Part-of-Speech (POS)tags lemma form (LEMMA) or the citation formof the word and named entity (NE) informationThe latter feature is shown to help in the unsuper-vised setting in recent work (Diab and Krishna2009b Diab and Krishna 2009a) In general allthe linguistic features are represented as separatefeature sets explicitly modeled in the input dataHence if we are modeling the POS tag feature forour running example the training data would beannotated as follows John NN O kicked VBDB-I the Det I-I bucket NN I-I last ADV O FridayNN O Likewise adding the NGRAM featurewould be represented as follows John NN ohnO kicked VBD ked B-I the Det the I-I bucket NNket I-I last ADV ast O Friday NN day O and soon
With the NE feature we followed the same rep-resentation as the other features as a separate col-umn as expressed above referred to as NamedEntity Separate (NES) For named entity recogni-tion (NER) we use the BBN Identifinder softwarewhich identifies 19 NE tags2 We have two set-tings for NES one with the full 19 tags explic-itly identified (NES-Full) and the other where wehave a binary feature indicating whether a wordis a NE or not (NES-Bin) Moreover we addedanother experimental condition where we changedthe wordsrsquo representation in the input to their NEclass Named Entity InText (NEI) For example forthe NEI condition our running example is repre-sented as follows PER NN ohn O kicked VBDked B-I the Det the I-I bucket NN ket I-I last ADV
1httpwwwtado-chasencomyamcha2httpwwwbbncomidentifinder
ast O DAY NN day O where John is replaced bythe NE ldquoPERrdquo
5 Experiments and Results
51 Data
We use the manually annotated standard dataset identified in (Cook et al 2008) This datacomprises 2920 unique VNC-Token expressionsdrawn from the entire British National Corpus(BNC)3 The BNC contains 100M words of multi-ple genres including written text and transcribedspeech In this set VNC token expressions aremanually annotated as idiomatic literal or un-known We exclude those annotated as unknownand those pertaining to the Speech part of thedata leaving us with a total of 2432 sentences cor-responding to 53 VNC MWE types This datahas 2571 annotations4 corresponding to 2020 Id-iomatic tokens and 551 literal ones Since the dataset is relatively small we carry out 5-fold cross val-idation experiments The results we report are av-eraged over the 5 folds per condition We splitthe data into 80 for training 10 for testing and10 for development The data used is the tok-enized version of the BNC
52 Evaluation Metrics
We use Fβ=1 (F-measure) as the harmonic meanbetween (P)recision and (R)ecall as well as accu-racy to report the results5 We report the resultsseparately for the two classes IDM and LIT aver-aged over the 5 folds of the TEST data set
53 Results
We present the results for the different featuressets and their combination We also present resultson a simple most frequent tag baseline (FREQ) aswell as a baseline of using no features just thetokenized words (TOK) The baseline is basicallytagging all identified VNC tokens in the data set asidiomatic It is worth noting that the baseline hasthe advantage of gold identification of MWE VNCtoken expressions In our experimental conditionsidentification of a potential VNC MWE is part ofwhat is discovered automatically hence our sys-tem is penalized for identifying other VNC MWE
3httpwwwnatcorpoxacuk4A sentence can have more than one MWE expression
hence the number of annotations exceeds the number of sen-tences
5We do not think that accuracy should be reported in gen-eral since it is an inflated result as it is not a measure of errorAll words identified as O factor into the accuracy which re-sults in exaggerated values for accuracy We report it onlysince it the metric used by previous work
19
tokens that are not in the original data set6
In Table 2 we present the results yielded per fea-ture and per condition We experimented with dif-ferent context sizes initially to decide on the opti-mal window size for our learning framework re-sults are presented in Table 1 Then once that isdetermined we proceed to add features
Noting that a window size of minus+3 yields thebest results we proceed to use that as our contextsize for the following experimental conditions Wewill not include accuracy since it above 96 for allour experimental conditions
All the results yielded by our experiments out-perform the baseline FREQ The simple tokenizedwords baseline (TOK) with no added features witha context size of minus+3 shows a significant im-provement over the very basic baseline FREQ withan overall F measure of 7704
Adding lemma information or POS or NGRAMfeatures all independently contribute to a bettersolution however combining the three featuresyields a significant boost in performance over theTOK baseline of 267 absolute F points in over-all performance
Confirming previous observations in the liter-ature the overall best results are obtained byusing NE features The NEI condition yieldsslightly better results than the NES conditionsin the case when no other features are beingused NES-Full significantly outperforms NES-Bin when used alone especially on literal classi-fication yielding the highest results on this classof phenomena across the board However whencombined with other features NES-Bin fares bet-ter than NES-Full as we observe slightly less per-formance when comparing NES-Full+L+N+P andNES-Bin+L+N+P
Combining NEI+L+N+P yields the highest re-sults with an overall F measure of 8458 a sig-nificant improvement over both baselines and overthe condition that does not exploit NE featuresL+N+P Using NEI may be considered a formof dimensionality reduction hence the significantcontribution to performance
6 Discussion
The overall results strongly suggest that using lin-guistically interesting features explicitly has a pos-itive impact on performance NE features helpthe most and combining them with other features
6We could have easily identified all VNC syntactic con-figurations corresponding to verb object as a potential MWEVNC assuming that they are literal by default This wouldhave boosted our literal score baseline however for this in-vestigation we decided to strictly work with the gold stan-dard data set exclusively
yields the best results In general performanceon the classification and identification of idiomaticexpressions yielded much better results This maybe due to the fact that the data has a lot more id-iomatic token examples for training Also we notethat precision scores are significantly higher thanrecall scores especially with performance on lit-eral token instance classification This might be anindication that identifying when an MWE is usedliterally is a difficult task
We analyzed some of the errors yielded in ourbest condition NEI+L+N+P The biggest errors area result of identifying other VNC constructionsnot annotated in the training and test data as VNCMWEs However we also see errors of confusingidiomatic cases with literal ones 23 times and theopposite 4 times
Some of the errors where the VNC should havebeen classified as literal however the system clas-sified them as idiomatic are kick heel find feetmake top Cases of idiomatic expressions erro-neously classified as literal are for MWE types hitthe road blow trumpet blow whistle bit a wall
The system is able to identify new VNC MWEconstructions For instance in the sentence On theother hand Pinkie seemed to have lost his head toa certain extent perhaps some prospects of mak-ing his mark by bringing in something novel inthe way of business the first MWE lost his headis annotated in the training data however makinghis mark is newly identified as idiomatic in thiscontext
Also the system identified hit the post as aliteral MWE VNC token in As the ball hit thepost the referee blew the whistle where blew thewhistle is a literal VNC in this context and it iden-tified hit the post as another literal VNC
7 Conclusion
In this study we explore a set of features that con-tribute to VNC token expression binary supervisedclassification The use of NER significantly im-proves the performance of the system Using NERas a means of dimensionality reduction yields thebest results We achieve a state of the art perfor-mance of an overall F measure of 8458 In thefuture we are looking at ways of adding more so-phisticated syntactic and semantic features fromWSD Given the fact that we were able to get moreinteresting VNC data automatically we are cur-rently looking into adding the new data to the an-notated pool after manual checking
20
IDM-F LIT-F Overall F Overall Accminus+1 7793 4857 7178 9622minus+2 8538 5561 7971 9706minus+3 8699 5568 8125 9693minus+4 8622 5581 8075 9706minus+5 8338 50 7763 9661
Table 1 Results in s of varying context window size
IDM-P IDM-R IDM-F LIT-P LIT-R LIT-F Overall FFREQ 7002 8916 7844 0 0 0 6968TOK 8178 8333 8255 7179 4375 5437 7704
(L)EMMA 831 8429 8369 6977 4688 5607 7811(N)GRAM 8317 8238 8278 70 4375 5385 7701
(P)OS 8333 8333 8333 7778 4375 5600 7808L+N+P 8695 8333 8538 7222 4561 5591 7971
NES-Full 852 8793 8655 7907 5862 6733 8277NES-Bin 8497 8241 8367 7349 5259 6131 7915
NEI 8992 8518 8748 8133 5259 6387 8282NES-Full+L+N+P 8989 8492 8734 7632 50 6042 8199NES-Bin+L+N+P 9086 8492 8779 7632 50 6042 8233
NEI+L+N+P 9135 8842 8986 8169 50 6203 8458
Table 2 Final results in s averaged over 5 folds of test data using different features and their combina-tions
8 Acknowledgement
The first author was partially funded by DARPAGALE and MADCAT projects The authors wouldlike to acknowledge the useful comments by twoanonymous reviewers who helped in making thispublication more concise and better presented
ReferencesTimothy Baldwin Collin Bannard Takakki Tanaka
and Dominic Widdows 2003 An empirical modelof multiword expression decomposability In Pro-ceedings of the ACL 2003 workshop on Multiwordexpressions pages 89ndash96 Morristown NJ USA
J Birke and A Sarkar 2006 A clustering approach fornearly unsupervised recognition of nonliteral lan-guage In Proceedings of EACL volume 6 pages329ndash336
Paul Cook Afsaneh Fazly and Suzanne Stevenson2007 Pulling their weight Exploiting syntacticforms for the automatic identification of idiomaticexpressions in context In Proceedings of the Work-shop on A Broader Perspective on Multiword Ex-pressions pages 41ndash48 Prague Czech RepublicJune Association for Computational Linguistics
Paul Cook Afsaneh Fazly and Suzanne Stevenson2008 The VNC-Tokens Dataset In Proceedings ofthe LREC Workshop on Towards a Shared Task for
Multiword Expressions (MWE 2008) MarrakechMorocco June
Mona Diab and Madhav Krishna 2009a Handlingsparsity for verb noun MWE token classification InProceedings of the Workshop on Geometrical Mod-els of Natural Language Semantics pages 96ndash103Athens Greece March Association for Computa-tional Linguistics
Mona Diab and Madhav Krishna 2009b Unsuper-vised classification for vnc multiword expressionstokens In CICLING
Afsaneh Fazly and Suzanne Stevenson 2007 Dis-tinguishing subtypes of multiword expressions us-ing linguistically-motivated statistical measures InProceedings of the Workshop on A Broader Perspec-tive on Multiword Expressions pages 9ndash16 PragueCzech Republic June Association for Computa-tional Linguistics
Chikara Hashimoto and Daisuke Kawahara 2008Construction of an idiom corpus and its applica-tion to idiom identification based on WSD incor-porating idiom-specific features In Proceedings ofthe 2008 Conference on Empirical Methods in Nat-ural Language Processing pages 992ndash1001 Hon-olulu Hawaii October Association for Computa-tional Linguistics
Chikara Hashimoto Satoshi Sato and Takehito Utsuro2006 Japanese idiom recognition Drawing a linebetween literal and idiomatic meanings In Proceed-ings of the COLINGACL 2006 Main Conference
21
Poster Sessions pages 353ndash360 Sydney AustraliaJuly Association for Computational Linguistics
Graham Katz and Eugenie Giesbrecht 2006 Au-tomatic identification of non-compositional multi-word expressions using latent semantic analysis InProceedings of the Workshop on Multiword Expres-sions Identifying and Exploiting Underlying Prop-erties pages 12ndash19 Sydney Australia July Asso-ciation for Computational Linguistics
Dekang Lin 1999 Automatic identification of non-compositional phrases In Proceedings of ACL-99pages 317ndash324 Univeristy of Maryland CollegePark Maryland USA
Diana McCarthy Sriram Venkatapathy and AravindJoshi 2007 Detecting compositionality of verb-object combinations using selectional preferencesIn Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL) pages 369ndash379 Prague CzechRepublic June Association for Computational Lin-guistics
Dan I Melamed 1997 Automatic discovery of non-compositional compounds in parallel data In Pro-ceedings of the 2nd Conference on Empirical Meth-ods in Natural Language Processing (EMNLPrsquo97)pages 97ndash108 Providence RI USA August
Bego na Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automaticword-alignment In Proceedings of the EACL-06Workshop on Multiword Expressions in a Multilin-gual Context pages 33ndash40 Morristown NJ USA
Ivan A Sag Timothy Baldwin Francis Bond Ann ACopestake and Dan Flickinger 2002 Multiwordexpressions A pain in the neck for nlp In Pro-ceedings of the Third International Conference onComputational Linguistics and Intelligent Text Pro-cessing pages 1ndash15 London UK Springer-Verlag
Patrick Schone and Daniel Juraksfy 2001 Isknowledge-free induction of multiword unit dictio-nary headwords a solved problem In Proceedingsof Empirical Methods in Natural Language Process-ing pages 100ndash108 Pittsburg PA USA
C Sporleder and L Li 2009 Unsupervised Recog-nition of Literal and Non-Literal Use of IdiomaticExpressions In Proceedings of the 12th Conferenceof the European Chapter of the ACL (EACL 2009)pages 754ndash762 Association for Computational Lin-guistics
Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32Prague Czech Republic June Association for Com-putational Linguistics
22
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 23ndash30Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Exploiting Translational Correspondences for Pattern-Independent MWEIdentification
Sina ZarrieszligDepartment of Linguistics
University of Potsdam Germanysinalinguni-potsdamde
Jonas KuhnDepartment of Linguistics
University of Potsdam Germanykuhnlinguni-potsdamde
Abstract
Based on a study of verb translations inthe Europarl corpus we argue that a widerange of MWE patterns can be identified intranslations that exhibit a correspondencebetween a single lexical item in the sourcelanguage and a group of lexical items inthe target language We show that thesecorrespondences can be reliably detectedon dependency-parsed word-aligned sen-tences We propose an extraction methodthat combines word alignment with syn-tactic filters and is independent of thestructural pattern of the translation
1 Introduction
Parallel corpora have proved to be a valuable re-source not only for statistical machine translationbut also for crosslingual induction of morphologi-cal syntactic and semantic analyses (Yarowsky etal 2001 Dyvik 2004) In this paper we proposean approach to the identification of multiword ex-pressions (MWEs) that exploits translational cor-respondences in a parallel corpus We will con-sider in translations of the following type
(1) DerThe
RatCouncil
sollteshould
unsereour
Positionposition
berucksichtigenconsider
(2) The Council shouldtake account ofour position
This sentence pair has been taken from theGerman - English section of the Europarl corpus(Koehn 2005) It exemplifies a translational cor-respondence between an English MWEtake ac-count ofand a German simplex verbberucksichti-gen In the following we refer to such correspon-dences asone-to-many translations Based on astudy of verb translations in Europarl we will ex-plore to what extent one-to-many translations pro-vide evidence for MWE realization in the targetlanguage It will turn out that crosslingual corre-
spondences realize a wide range of different lin-guistic patterns that are relevant for MWE iden-tification but that they pose problems to auto-matic word alignment We propose an extractionmethod that combines distributional word align-ment with syntactic filters We will show thatthese correspondences can be reliably detectedon dependency-parsed wordaligned sentences andare able to identify various MWE patterns
In a monolingual setting the task of MWE ex-traction is usually conceived of as a lexical as-sociation problem where distributional measuresmodel the syntactic and semantic idiosyncracy ex-hibited by MWEs eg (Pecina 2008) This ap-proach generally involves two main steps 1) theextraction of a candidate list of potential MWEsoften constrained by a particular target patternof the detection method like verb particle con-structions (Baldwin and Villavicencio 2002) orverb PP combinations (Villada Moiron and Tiede-mann 2006) 2) the ranking of this candidate listby an appropriate assocation measure
The crosslingual MWE identification wepresent in this paper is a priori independentof any specific association measure or syntacticpattern The translation scenario allows us toadopt a completely data-driven definition of whatconstitutes an MWE Given a parallel corpuswe propose to consider those tokens in a targetlanguage as MWEs which correspond to a singlelexical item in the source language The intuitionis that if a group of lexical items in one lan-guage can be realized as a single item in anotherlanguage it can be considered as some kind oflexically fixed entity By this means we willnot approach the MWE identification problemby asking for a given list of candidates whetherthese are MWEs or not Instead we will ask fora given list of lexical items in a source languagewhether there exists a one-to-many translation forthis item in a target language (and whether these
23
one-to-many translations correspond to MWEs)This strategy offers a straightforward solution tothe interpretation problem As the translation canbe related to the meaning of the source item andto its other translations in the target language theinterpretation is independent of the expressionrsquostransparency This solution has its limitationscompared to other approaches that need to auto-matically establish the degree of compositionalityof a given MWE candidate However for manyNLP applications coarse-grained knowledgeabout the semantic relation between a widerange of MWEs and their corresponding atomicrealization is already very useful
In this work we therefore focus on a generalmethod of MWE identification that captures thevarious patterns of translational correspondencesthat can be found in parallel corpora Our exper-iments described in section 3 show that one-to-many translations should be extracted from syn-tactic configurations rather than from unstructuredsets of aligned words This syntax-driven methodis less dependent on frequency distributions in agiven corpus but is based on the intuition thatmonolingual idiosyncracies like MWE realizationof an entity are not likely to be mirrored in anotherlanguage (see section 4 for discussion)
Our goal in this paper is twofold First we wantto investigate to what extent one-to-many transla-tional correspondences can serve as an empiricalbasis for MWE identification To this end Sec-tion 2 presents a corpus-based study of the rela-tion between one-to-many translations and MWEsthat we carried out on a translation gold standardSecond we investigate methods for the automaticdetection of complex lexical correspondences fora given parallel corpus Therefore Section 3 eval-uates automatic word alignments against our goldstandard and gives a method for high-precisionone-to-many translation detection that relies onsyntactic filters in addition to word-alignments
2 Multiword Translations as MWEs
The idea to exploit one-to-many translations forthe identification of MWE candidates has not re-ceived much attention in the literature Thus it isnot a priori clear what can be expected from trans-lational correspondences with respect to MWEidentification To corroborate the intuitions intro-duced in the last section we carried out a corpus-based study that aims to discover linguistic pat-
Verb 1-1 1-n n-1 n-n No
anheben (v1) 535 212 92 16 325
bezwecken (v2) 167 513 06 313 150
riskieren (v3) 467 357 05 17 182
verschlimmern (v4) 302 215 286 445 275
Table 1 Proportions of types of translational cor-respondences (token-level) in our gold standard
terns exhibited by one-to-many translationsWe constructed a gold standard coveringall En-
glish translations of four German verb lemmas ex-tracted from the Europarl Corpus These verbssubcategorize for a nominative subject and an ac-cusative object and are in the middle frequencylayer (around 200 occurrences) We extracted allsentences in Europarl with occurences of theselemmas and their automatic word alignments pro-duced by GIZA++ (Och and Ney 2003) Thesealignments were manually corrected on the basisof the crosslingual word alignment guidelines de-velopped by (Graca et al 2008)
For each of the German source lemmas ourgold standard records four translation categoriesone-to-one one-to-many many-to-one many-to-many translations Table 1 shows the distributionof these categories for each verb Strikingly thefour verbs show very different proportions con-cerning the types of their translational correspon-dences Thus while the German verbanheben(en increase) seems to have a frequent parallelrealization the verbsbezwecken(en intend to)or verschlimmern(en aggravate) tend to be real-ized by more complex phrasal translations In anycase the percentage of one-to-many translations isrelatively high which corroborates our hypothesisthat parallel corpora constitute a very interestingresource for data-driven MWE discovery
A closer look at the one-to-many translations re-veals that these cover a wide spectrum of MWEphenomena traditionally considered in the liter-ature as well as constructions that one wouldusually not regard as an MWE Below we willshortly illustrate the different classes of one-to-many translations we found in our gold standard
Morphological variations This type of one-to-many translations is mainly due to non-parallel re-alization of tense Itrsquos rather irrelevant from anMWE perspective but easy to discover and filterautomatically
24
(3) SieThey
verschlimmernaggravate
diethe
Ubelmisfortunes
(4) Their actionis aggravatingthe misfortunes
Verb particle combinations A typical MWEpattern treated for instance in (Baldwin andVillavicencio 2002) It further divides into trans-parent and non-transparent combinations the lat-ter is illustrated below
(5) DerThe
Ausschusscommitte
bezwecktintends
denthe
Institutioneninstitutions
eina
politischespolitical
Instrumentinstrument
anat
diethe
Handhand
zuto
gebengive
(6) The committeeset out to equip the institutions with apolitical instrument
Verb preposition combinations While thisclass isnrsquot discussed very often in the MWE lit-erature it can nevertheless be considered as an id-iosyncratic combination of lexical items Sag et al(2002) propose an analysis within an MWE frame-work
(7) SieThey
werdenwill
denthe
Treibhauseffektgreen house effect
verschlimmernaggravate
(8) They will add to the green house effect
Light verb constructions (LVCs) This is themost frequent pattern in our gold standard It ac-tually subsumes various subpatterns depending onwhether the light verbs complement is realized as anoun adjective or PP Generally LVCs are syntac-tically and semantically more flexible than otherMWE types such that our gold standard containsvariants of LVCs with similar potentially mod-ified adjectives or nouns as in the example be-low However it can be considered an idiosyn-cratic combination since the LVCs exhibit specificlexical restrictions (Sag et al 2002)
(9) IchIch
werdewill
diethe
Sachething
nuronly
nochjust
verschlimmernaggravate
(10) I am justmaking thingsmore difficult
Idioms This MWE type is probably the mostdiscussed in the literature due to its semantic andsyntactic idiosyncracy Itrsquos not very frequent inour gold standard which may be mainly due to itslimited size and the source items we chose
(11) SieThey
bezweckenintend
diethe
Umgestaltungconversion
ininto
einea
zivilecivil
Nationnation
(12) Theyhave in mind the conversion into a civil nation
v1 v2 v3 v4
Ntype 22 (26) 41 (47) 26 (35) 17 (24)
V Part 227 49 00 00
V Prep 364 415 39 59
LVC 182 293 885 882
Idiom 00 24 00 00
Para 364 243 115 235
Table 2 Proportions of MWE types per lemma
Paraphrases From an MWE perspective para-phrases are the most problematic and challengingtype of translational correspondence in our goldstandard While the MWE literature typically dis-cusses the distinction between collocations andMWEs the boarderline between paraphrases andMWEs is not really clear On the hand para-phrases as we classified them here are transparentcombinations of lexical items like in the exam-ple belowensure that something increases How-ever semantically these transparent combinationscan also be rendered by an atomic expressionin-crease A further problem raised by paraphrases isthat they often involve translational shifts (Cyrus2006) These shifts are hard to identify automat-ically and present a general challenge for seman-tic processing of parallel corpora An example isgiven below
(13) WirWe
brauchenneed
besserebetter
Zusammenarbeitcooperation
umto
diethe
RuckzahlungenrepaymentsOBJ
anzuhebenincrease
(14) We need greater cooperation in this respect toensurethat repaymentsincrease
Table 2 displays the proportions of the MWEcategories for the number of types of one-to-manycorrespondences in our gold standard We filteredthe types due to morphological variations only (theoverall number of types is indicated in brackets)Note that some types in our gold standard fall intoseveral categories eg they combine a verb prepo-sition with a verb particle construction For allof the verbs the number of types belonging tocore MWE categories largely outweighs the pro-portion of paraphrases As we already observedin our analysis of general translation categorieshere again the different verb lemmas show strik-ing differences with respect to their realization inEnglish translations For instanceanheben(enincrease) or bezwecken(en intend) are frequently
25
translated with verb particle or preposition combi-nations while the other verbs are much more of-ten translated by means of LVCs Also the morespecific LVC patterns differ largely among theverbs Whileverschlimmern(en aggravate) hasmany different adjectival LVC correspondencesthe translations ofriskieren(en risk) are predomi-nantly nominal LVCs The fact that we found veryfew idioms in our gold standard may be simplyrelated to our arbitrary choice of German sourceverbs that do not have an English idiom realiza-tion (see our experiment on a random set of verbsin Section 33)
In general one-to-many translational corre-spondences seem to provide a very fruitful groundfor the large-scale study of MWE phenomenaHowever their reliable detection in parallel cor-pora is far from trivial as we will show in thenext section Therefore we will not further in-vestigate the classification of MWE patterns inthe rest of the paper but concentrate on the high-precision detection of one-to-many translationsSuch a pattern-independent identification methodis crucial for the further data-driven study of one-to-many translations in parallel corpora
3 Multiword Translation Detection
This section is devoted to the problem of high-precision detection of one-to-many translationsSection 31 describes an evaluation of automaticword alignments against our gold standard Insection 32 we describe a method that extractsloosely aligned syntactic configurations whichyields much more promising results
31 One-to-many Alignments
To illustrate the problem of purely distributionalone-to-many alignment table 3 presents an eval-uation of the automatic one-to-many word align-ments produced by GIZA++ that uses the stan-dard heuristics for bidirectional word alignmentfrom phrase-based MT (Och and Ney 2003) Weevaluate the rate of translational correspondenceson the type-level that the system discovers againstthe one-to-many translations in our gold standardBy type we mean the set of lemmatized Englishtokens that makes up the translation of the Ger-man source lemma Generally automatic wordalignment yields a very high FPR if no frequencythreshold is used Increasing the threshold mayhelp in some cases however the frequency of the
verb n gt 0 n gt 1 n gt 3FPR FNR FPR FNR FPR FNR
v1 097 093 10 10 10 10
v2 093 09 05 096 00 098
v3 088 083 08 097 067 097
v4 098 092 08 092 034 092
Table 3 False positive rate and False negative rateof GIZA++ one-to-many alignments
translation types is so low that already at a thresh-old of 3 almost all types get filtered This does notmean that the automatic word alignment does notdiscover any correct correspondences at all but itmeans that the detection of the exact set of tokensthat correspond to the source token is rare
This low precision of one-to-many alignmentsisnrsquot very surprising Many types of MWEs con-sist of items that contribute most of the lexical se-mantic content while the other items belong to theclass of semantically almost ldquoemptyrdquo items (egparticles light verbs) These semantically ldquolightrdquoitems have a distribution that doesnrsquot necessarilycorrelate with the source item For instance inthe following sentence pair taken from EuroparlGIZA++ was not able to capture the correspon-dence between the German main verbbehindern(en impede) and the LVCconstitute an obstacleto but only finds an alignment link between theverb and the nounobstacle
(15) DieThe
Korruptioncorruption
behindertimpedes
diethe
Entwicklungdevelopment
(16) Corruptionconstitutes an obstacle todevelopment
Another limitation of the word-alignment mod-els is that are independent of whether the sen-tences are largely parallel or rather free transla-tions However parallel corpora like Europarl areknow to contain a very large number of free trans-lations In these cases direct lexical correspon-dences are much more unlikely to be found
32 Aligning Syntactic Configurations
High-precision extraction of one-to-many trans-lation detection thus involves two major prob-lems 1) How to identify sentences or configura-tions where reliable lexical correspondences canbe found 2) How to align target items that have alow occurrence correlation
We argue that both of these problems can beadressed by taking syntactic information into ac-
26
count As an example consider the pair of paral-lel configurations in Figure 1 for the sentence pairgiven in (15) and (16) Although there is no strictone-to-one alignment for the German verb the ba-sic predicate-argument structure is parallel Theverbs arguments directly correspond to each otherand are all dominated by a verbal root node
Based on these intuitions we propose agenerate-and-filter strategy for our one-to-manytranslation detection which extracts partial largelyparallel dependency configurations By admittingtarget dependency paths to be aligned to sourcesingle dependency relations we admit configura-tions where the source item is translated by morethan one word For instance given the configura-tion in Figure 1 we allow the German verb to bealigned to the path connectingconstituteand theargumentY2
Our one-to-many translation detection consistsof the following steps a) candidate generationof aligned syntactic configurations b) filtering theconfigurations c) alignment post-editing ie as-sembling the target tokens corresponding to thesource item The following paragraphs will brieflycaracterize these steps
behindert
X 1 Y 1
Y 2
an to
X 2 obstacle
create
Figure 1 Example of a typical syntactic MWEconfiguration
Data We word-aligned the German and Englishportion of the Europarl corpus by means of theGIZA++ tool Both portions where assigned flatsyntactic dependency analyses by means of theMaltParser (Nivre et al 2006) such that we ob-tain a parallel resource of word-aligned depen-dency parses Each sentence in our resource canbe represented by the triple(DG DE AGE) DG
is the set of dependency triples(s1 rel s2) such
thats2 is a dependent ofs1 of typerel ands1 s2
are words of the source languageDE is the setof dependency triples of the target sentenceAGE
corresponds to the set of pairs(s1 t1) such thats1 t1 are aligned
Candidate Generation This step generates alist of source configurations by searching for oc-curences of the source lexical verb where it islinked to some syntactic dependents (eg its argu-ments) An example input would be the configura-tion ( (verbSB) (verbOA)) forour German verbs
Filtering Given our source candidates a validparallel configuration(DG DE AGE) is then de-fined by the following conditions1 The source configurationDG is the set of tu-ples(s1 rel sn) wheres1 is our source item andsn some dependent2 For eachsn isin DG there is a tuple(sn tn) isinAGE ie every dependent has an alignment3 There is a target itemt1 isin DE such thatfor eachtn there is ap sub DE such thatp isa path(t1 rel tx) (tx rel ty)(tz rel tn) thatconnectst1 and tn Thus the target dependentshave a common root
To filter noise due to parsing or alignment er-rors we further introduce a filter on the length ofthe path that connects the target root and its de-pendents and w exclude paths cross contain sen-tence boundaries Moreover the above candi-date filtering doesnrsquot exclude configurations whichexhibit paraphrases involving head-switching orcomplex coordination Head-switching can be de-tected with the help of alignment information ifthere is a item in our target configuration that hasan reliable alignment with an item not contained inour source configuration our target configurationis likely to contain such a structural paraphrasesand is excluded from our candidate set Coordina-tion can be discarded by imposing the condition onthe configuration not to contain a coordination re-lation This Generate-and-Filter strategy now ex-tracts a set of sentences where we are likely to finda good one-to-one or one-to-many translation forthe source verb
Alignment Post-editing In the final alignmentstep one now needs to figure out which lexicalmaterial in the aligned syntactic configurations ac-tually corresponds to the translation of the sourceitem The intuition discussed in 32 was that all
27
the items lying on a path between the root itemand the terminals belong to the translation of thesource item However these items may have othersyntactic dependents that may also be part of theone-to-many translation As an example considerthe configuration in figure 1 where the articleanwhich is part of the LVCcreate an obstacle tohasto be aligned to the German source verb
Thus for a set of itemsti for which there is a de-pendency relation(tx rel ti) isin DE such thattx isan element of our target configuration we need todecide whether(s1 ti) isin AGE This translationproblem now largely parallels collocation trans-lation problems discussed in the literature as in(Smadja and McKeown 1994) But crucially oursyntactic filtering strategy has substantially nar-rowed down the number of items that are possi-ble parts of the one-to-many translation Thus astraightforward way to assemble the translationalcorrespondence is to compute the correlation orassociation of the possibly missing items with thegiven translation pair as proposed in (Smadja andMcKeown 1994) Therefore we propose the fol-lowing alignment post-editing algorithmGiven the source items1 and the set of target itemsT where eachti isin T is an element of our targetconfiguration
1 Compute corr(s1 T ) the correlation be-tweens1 andT
2 For each ti tx such that there isa (ti rel tx) isin DE computecorr(s1 T + tx)
3 if corr(s1 T + tx) ge corr(s1 T ) addtxto T
As the Dice coefficient is often to give the bestresults eg in (Smadja and McKeown 1994) wealso chose Dice as our correlation measure In fu-ture work we will experiment with other associa-tion measures Our correlation scores are thus de-fined by the formula
corr(s1 T ) =2(freq(s1 and T ))
freq(s1) + freq(T )
We definefreq(T ) as the number of sentencepairs whose target sentence contains occurrencesof all ti isin T andfreq(s1) accordingly The ob-servation frequencyfreq(s1andT ) is the number ofsentence pairs that wheres1 occurs in the sourcesentence andT in the target sentence
The output translation can then be rep-resented as a dependency configurationof the following kind ((ofPMOD)(riskNMODof)(riskNMODthe) (runOBJrisk)(runSBJ))which is the syntactic representationfor the English MWErun the risk of
33 Evaluation
Our translational approach to MWE extractionbears the advantage that evaluation is not exclu-sively bound to the manual judgement of candi-date lists Instead we can first evaluate the systemoutput against translation gold standards which areeasier to obtain The linguistic classification of thecandidates according to their compositionality canthen be treated as a separate problem
We present two experiments in this evaluationsection We will first evaluate the translation de-tection on our gold standard to assess the gen-eral quality of the extraction method Since thisgold standard is to small to draw conclusions aboutthe quality of MWE patterns that the system de-tects we further evaluate the translational corre-spondences for a larger set of verbs
Translation evaluation In the first experimentwe extracted all types of translational correspon-dences for the verbs we annotated in the gold stan-dard We converted the output dependency con-figurations to the lemmatized bag-of-word formwe already applied for the alignment evaluationand calculated the FPR and FNR of the trans-lation types The evaluation is displayed in ta-ble 4 Nearly all translation types that our sys-tem detected are correct This confirms our hy-pothesis that syntactic filtering yields more reli-able translations that just coocurrence-based align-ments However the false negative rate is alsovery high This low recall is due to the fact thatour syntactic filters are very restrictive such that amajor part of the occurrences of the source lemmadonrsquot figure in the prototypical syntactic configu-ration Column two and three of the evaluation ta-ble present the FPR and FNR for experiments witha relaxed syntactic filter that doesnrsquot constrain thesyntactic type of the parallel argument relationsWhile not decreasing the FNR the FPR decreasessignificantly This means that the syntactic filtersmainly fire on noisy configurations and donrsquot de-crease the recall A manual error analysis has alsoshown that the relatively flat annotation scheme ofour dependency parses significantly narrows down
28
the number of candidate configurations that our al-gorithm detects As the dependency parses donrsquotprovide deep analyses for tense or control phe-nomena very often a verbrsquos arguments donrsquot fig-ure as its syntactic dependents and no configura-tion is found Future work will explore the im-pact of deep syntactic analysis for the detection oftranslational correspondences
MWE evaluation In a second experiment weevaluated the patterns of correspondences foundby our extraction method for use in an MWE con-text Therefore we selected 50 random verbs oc-curring in the Europarl corpus and extracted theirrespective translational correspondences This setof 50 verbs yields a set of 1592 one-to-many typesof translational correspondences We filtered thetypes wich display only morphological variationsuch that the set of potential MWE types com-prises 1302 types Out of these we evaluated arandom sample of 300 types by labelling the typeswith the MWE categories we established for theanalysis of our gold standard During the clas-sification we encountered a further category ofoneto- many correspondence which cannot be con-sidered an MWE the category of alternation Forinstance we found a translational correspondencebetween the active realization of the German verbbegruszligen(enappreciate) and the English passivebe pleased by
The classification is displayed in table 5 Al-most 83 of the translational correspondencesthat our system extracted are perfect translationtypes Almost 60 of the extracted types can beconsidered MWEs that exhibit some kind of se-mantic idiosyncrasy The other translations couldbe classified as paraphrases or alternations In ourrandom sample the portions of idioms is signifi-cantly higher than in our gold standard which con-firms our intuition that the MWE pattern of theone-to-many translations for a given verb are re-lated to language-specific semantic properties ofthe verbs and the lexical concepts they realize
4 Related Work
The problem sketched in this paper has clear con-ncetions to statistical MT So-called phrase-basedtranslation models generally target whole sentencealignment and do not necessarily recur to linguis-tically motivated phrase correspondences (Koehnet al 2003) Syntax-based translation that speci-fies formal relations between bilingual parses was
Strict Filter Relaxed Filter
FPR FNR FPR FNR
v1 00 096 05 096
v2 025 088 047 079
v3 025 074 056 063
v4 00 0875 056 084
Table 4 False positive and false negative rate ofone-to-many translations
Trans type Proportion
MWE type Proportion
MWEs 575
V Part 82
V Prep 518
LVC 324
Idiom 106
Paraphrases 244
Alternations 10
Noise 171
Table 5 Classification of 300 types sampled fromthe set of one-to-many translations for 50 verbs
established by (Wu 1997) Our way to use syn-tactic configurations can be seen as a heuristic tocheck relaxed structural parallelism
Work on MWEs in a crosslingual context hasalmost exclusively focussed on MWE translation(Smadja and McKeown 1994 Anastasiou 2008)In (Villada Moiron and Tiedemann 2006) the au-thors make use of alignment information in a par-allel corpus to rank MWE candidates These ap-proaches donrsquot rely on the lexical semantic knowl-edge about MWEs in form of one-to-many trans-lations
By contrast previous approaches to paraphraseextraction made more explicit use of crosslingualsemantic information In (Bannard and Callison-Burch 2005) the authors use the target languageas a pivot providing contextual features for iden-tifying semantically similar expressions Para-phrasing is however only partially comparable tothe crosslingual MWE detection we propose inthis paper Recently the very pronounced contextdependence of monolingual pairs of semanticallysimilar expressions has been recognized as a ma-jor challenge in modelling word meaning (Erk andPado 2009)
The idea that parallel corpora can be used asa linguistic resource that provides empirical evi-dence for monolingual idiosyncrasies has already
29
been exploited in eg morphology projection(Yarowsky et al 2001) or word sense disambigua-tion (Dyvik 2004) While in a monolingual set-ting it is quite tricky to come up with theoreticalor empirical definitions of sense discriminationsthe crosslingual scenario offers a theory-neutraldata-driven solution Since ambiguity is an id-iosyncratic property of a lexical item in a givenlanguage it is not likely to be mirrored in a tar-get language Similarly our approach can also beseen as a projection idea we project the semanticinformation of simplex realization in a source lan-guage to an idiosyncratic multiword realization inthe target language
5 Conclusion
We have explored the phenomenon of one-to-many translations in parallel corpora from theperspective of MWE identification Our man-ual study on a translation gold standard as wellas our experiments in automatic translation ex-traction have shown that one-to-many correspon-dences provide a rich resource and fruitful basisof study for data-driven MWE identification Thecrosslingual perspective raises new research ques-tions about the identification and interpretation ofMWEs It challenges the distinction between para-phrases and MWEs a problem that does not ariseat all in the context of monolingual MWE ex-traction It also allows for the study of the rela-tion between the semantics of lexical concepts andtheir MWE realization Further research in this di-rection should investigate translational correspon-dences on a larger scale and further explore thesefor monolingual interpretation of MWEs
Our extraction method that is based on syn-tactic filters identifies MWE types with a muchhigher precision than purely cooccurence-basedword alignment and captures the various patternswe found in our gold standard Future work on theextraction method will have to focus on the gener-alization of these filters and the generalization toother items than verbs The experiments presentedin this paper also suggest that the MWE realiza-tion of certain lexical items in a target languageis subject to certain linguistic patterns Moreoverthe method we propose is completely languagein-dependent such that further research has to studythe impact of the relatedness of the consideredlanguages on the patterns of one-to-many transla-tional correspondences
ReferencesDimitra Anastasiou 2008 Identification of idioms by mtrsquos
hybrid research system vs three commercial system InProceedings of the EAMT pp 12ndash20
Timothy Baldwin and Aline Villavicencio 2002 Extract-ing the unextractable a case study on verb-particles InProceedings of the COLING-02 pp 1ndash7
Colin Bannard and Chris Callison-Burch 2005 Paraphras-ing with bilingual parallel corpora InProceedings of the43rd Annual Meeting of the ACL pp 597ndash604
Lea Cyrus 2006 Building a resource for studying transla-tion shifts InProceedings of the 5th LREC pp 1240ndash1245
Helge Dyvik 2004 Translations as semantic mirrors Fromparallel corpus to WordNetLanguage and Computers1311 ndash 326
Katrin Erk and Sebastian Pado 2009 Paraphrase assess-ment in structured vector space Exploring parameters anddatasets InProc of the EACL GEMS Workshop pp 57ndash65
Joao de Almeida Varelas Graca Joana Paulo Pardal LuısaCoheur and Diamantino Antonio Caseiro 2008 Multi-language word alignments annotation guidelines Techni-cal report Tech Rep 38 2008 INESC-ID Lisboa
Philipp Koehn Franz Josef Och and Daniel Marcu 2003Statistical phrase-based translation InProceedings of theNAACL rsquo03 pp 48ndash54
Philipp Koehn 2005 Europarl A parallel corpus for statis-tical machine translation InMT Summit 2005 pp 79ndash86
Joakim Nivre Johan Hall and Jens Nilsson 2006 Malt-parser A data driven parser-generator for dependencyparsing InProc of LREC-2006 pp 2216ndash2219
Franz Josef Och and Hermann Ney 2003 A systematic com-parison of various statistical alignment modelsComputa-tional Linguistics 29(1)19ndash51
Pavel Pecina 2008 A machine learning approach to multi-word expression extraction InProceedings of the LRECMWE 2008 Workshop pp 54ndash57
Ivan A Sag Timothy Baldwin Francis Bond and AnnCopestake 2002 Multiword expressions A pain in theneck for NLP InProc of the CICLing-2002 pp 1ndash15
Frank Smadja and Kathleen McKeown 1994 Translatingcollocations for use in bilingual lexicons InProceedingsof the HLT rsquo94 workshop pp 152ndash156
Begona Villada Moiron and Jorg Tiedemann 2006Identifying idiomatic expressions using automatic word-alignment InProc of the EACL MWE 2006 Workshoppp 33ndash40
Dekai Wu 1997 Stochastic inversion transduction gram-mars and bilingual parsing of parallel corporaComputLinguist 23(3)377ndash403
David Yarowsky Grace Ngai and Richard Wicentowski2001 Inducing multilingual text analysis tools via robustprojection across aligned corpora InProceedings of HLT2001 pp 1ndash8
30
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 31ndash39Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
A Re-examination of Lexical Association Measures
Abstract
We review lexical Association Measures (AMs) that have been employed by past work in extracting multiword expressions Our work contributes to the understanding of these AMs by categorizing them into two groups and suggesting the use of rank equivalence to group AMs with the same ranking performance We also examine how existing AMs can be adapted to better rank English verb particle constructions and light verb constructions Specifically we suggest normalizing (Pointwise) Mutual Information and using marginal frequencies to construct penalization terms We empirically validate the effectiveness of these modified AMs in detection tasks in English performed on the Penn Treebank which shows significant improvement over the original AMs
1 Introduction
Recently the NLP community has witnessed a renewed interest in the use of lexical association measures in extracting Multiword Expressions (MWEs) Lexical Association Measures (hereafter AMs) are mathematical formulas which can be used to capture the degree of connection or association between constituents of a given phrase Well-known AMs include Pointwise Mutual Information (PMI) Pearsonrsquos 2χ and the Odds Ratio These AMs have been applied in many different fields of study from information retrieval to hypothesis testing In the context of MWE extraction many published works have been devoted to comparing their effectiveness Krenn and Evert (2001) evaluate Mutual Information (MI) Dice Pearsonrsquos 2χ log-likelihood
ratio and the T score In Pearce (2002) AMs such as Z score Pointwise MI cost reduction left and right context entropy odds ratio are evaluated Evert (2004) discussed a wide range of AMs including exact hypothesis tests such as the binomial test and Fisherrsquos exact tests various coefficients such as Dice and Jaccard Later Ramisch et al (2008) evaluated MI Pearsonrsquos 2χ and Permutation Entropy Probably the most comprehensive evaluation of AMs was presented in Pecina and Schlesinger (2006) where 82 AMs were assembled and evaluated over Czech collocations These collocations contained a mix of idiomatic expressions technical terms light verb constructions and stock phrases In their work the best combination of AMs was selected using machine learning
While the previous works have evaluated AMs there have been few details on why the AMs perform as they do A detailed analysis of why these AMs perform as they do is needed in order to explain their identification performance and to help us recommend AMs for future tasks This weakness of previous works motivated us to address this issue In this work we contribute to further understanding of association measures using two different MWE extraction tasks to motivate and concretize our discussion Our goal is to be able to predict a priori what types of AMs are likely to perform well for a particular MWE class
We focus on the extraction of two common types of English MWEs that can be captured by bigram model Verb Particle Constructions (VPCs) and Light Verb Constructions (LVCs) VPCs consist of a verb and one or more particles which can be prepositions (eg put on bolster up) adjectives (cut short) or verbs (make do) For simplicity we focus only on bigram VPCs that take prepositional particles the most common class of VPCs A special characteristic of VPCs that affects their extraction is the
Hung Huu Hoang Dept of Computer Science
National University of Singapore
hoanghuucompnusedusg
Su Nam Kim Dept of Computer Science and Software Engineering University of Melbourne
snkimcsseunimelbeduau
Min-Yen Kan Dept of Computer Science
National University of Singapore
kanmycompnusedusg
31
mobility of noun phrase complements in transitive VPCs They can appear after the particle (Take off your hat) or between the verb and the particle (Take your hat off) However a pronominal complement can only appear in the latter configuration (Take it off)
In comparison LVCs comprise of a verb and a complement which is usually a noun phrase (make a presentation give a demonstration) Their meanings come mostly from their complements and as such verbs in LVCs are termed semantically light hence the name light verb This explains why modifiers of LVCs modify the complement instead of the verb (make a serious mistake vs make a mistake seriously) This phenomenon also shows that an LVCrsquos constituents may not occur contiguously
2 Classification of Association Measures
Although different AMs have different approaches to measuring association we observed that they can effectively be classified into two broad classes Class I AMs look at the degree of institutionalization ie the extent to which the phrase is a semantic unit rather than a free combination of words Some of the AMs in this class directly measure this association between constituents using various combinations of co-occurrence and marginal frequencies Examples include MI PMI and their variants as well as most of the association coefficients such as Jaccard Hamann Brawn-Blanquet and others Other Class I AMs estimate a phrasersquos MWE-hood by judging the significance of the difference between observed and expected frequencies These AMs include among others statistical hypothesis tests such as T score Z score and Pearsonrsquos 2χ test
Class II AMs feature the use of context to measure non-compositionality a peculiar characteristic of many types of MWEs including VPCs and idioms This is commonly done in one of the following two ways First non-compositionality can be modeled through the diversity of contexts measured using entropy The underlying assumption of this approach is that non-compositional phrases appear in a more restricted set of contexts than compositional ones Second non-compositionality can also be measured through context similarity between the phrase and its constituents The observation here is that non-compositional phrases have different semantics from those of their constituents It then
follows that contexts in which the phrase and its constituents appear would be different (Zhai 1997) Some VPC examples include carry out give up A close approximation stipulates that contexts of a non-compositional phrasersquos constituents are also different For instance phrases such as hot dog and Dutch courage are comprised of constituents that have unrelated meanings Metrics that are commonly used to compute context similarity include cosine and dice similarity distance metrics such as Euclidean and Manhattan norm and probability distribution measures such as Kullback-Leibler divergence and Jensen-Shannon divergence
Table 1 lists all AMs used in our discussion
The lower left legend defines the variables a b c and d with respect to the raw co-occurrence statistics observed in the corpus data When an AM is introduced it is prefixed with its index given in Table 1(eg [M2] Mutual Information) for the readerrsquos convenience
3 Evaluation
We will first present how VPC and LVC candidates are extracted and used to form our evaluation data set Second we will discuss how performances of AMs are measured in our experiments
31 Evaluation Data
In this study we employ the Wall Street Journal (WSJ) section of one million words in the Penn Tree Bank To create the evaluation data set we first extract the VPC and LVC candidates from our corpus as described below We note here that the mobility property of both VPC and LVC constituents have been used in the extraction process
For VPCs we first identify particles using a pre-compiled set of 38 particles based on Baldwin (2005) and Quirk et al (1985) (Appendix A) Here we do not use the WSJ particle tag to avoid possible inconsistencies pointed out in Baldwin (2005) Next we search to the left of the located particle for the nearest verb As verbs and particles in transitive VPCs may not occur contiguously we allow an intervening NP of up to 5 words similar to Baldwin and Villavicencio (2002) and Smadja (1993) since longer NPs tend to be located after particles
32
Extraction of LVCs is carried out in a similar fashion First occurrences of light verbs are located based on the following set of seven
frequently used English light verbs do get give have make put and take Next we search to the right of the light verbs for the nearest noun
AM Name Formula AM Name Formula M1 Joint Probability ( ) f xy N M2 Mutual Information
1log ˆ
ijij
i j ij
ff
N fsum
M3 Log likelihood ratio
2 log ˆ
ijij
i j ij
ff
fsum
M4 Pointwise MI (PMI) ( )log
( ) ( )
P xy
P x P ylowast lowast
M5 Local-PMI ( ) PMIf xy times M6 PMIk ( )log
( ) ( )
kNf xy
f x f ylowast lowast
M7 PMI2 2( )log
( ) ( )
Nf xy
f x f ylowast lowast
M8 Mutual Dependency 2( )log( ) ( )P xy
P x P y
M9 Driver-Kroeber
( )( )
a
a b a c+ +
M10 Normalized expectation
2
2
a
a b c+ +
M11 Jaccard a
a b c+ +
M12 First Kulczynski a
b c+
M13 Second Sokal-Sneath 2( )
a
a b c+ +
M14 Third Sokal-Sneath
a d
b c
+
+
M15 Sokal-Michiner a d
a b c d
+
+ + +
M16 Rogers-Tanimoto
2 2
a d
a b c d
+
+ + +
M17 Hamann ( ) ( )a d b c
a b c d
+ minus +
+ + +
M18 Odds ratio adbc
M19 Yulersquos ω ad bc
ad bc
minus
+
M20 Yulersquos Q ad bcad bc
minus+
M21 Brawn- Blanquet max( )
a
a b a c+ +
M22 Simpson
min( )
a
a b a c+ +
M23 S cost 12min( )
log(1 )1
b c
a
minus+
+
M24 Adjusted S Cost 12max( )
log(1 )1
b c
a
minus+
+
M25 Laplace 1
min( ) 2
a
a b c
+
+ +
M26 Adjusted Laplace 1
max( ) 2
a
a b c
+
+ +
M27 Fager [M9]
1max( )
2b cminus
M28 Adjusted Fager [M9]
1max( )b c
aNminus
M29 Normalized PMIs
PMI NF( )α PMI NFMax
M30 Simplified normalized PMI for VPCs
log( )
(1 )
ad
b cα αtimes + minus times
M31 Normalized MIs
MI NF( )α MI NFMax
NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast [0 1]α isin NFMax = max( ( ) ( ))P x P ylowast lowast
11 ( )a f f xy= = 12 ( )b f f xy= =
21 ( )c f f xy= = 22 ( )d f f xy= =
( )f xlowast ( )f xlowast
( )f ylowast ( )f ylowast N
Table 1 Association measures discussed in this paper Starred AMs () are developed in this work
Contingency table of a bigram (x y) recording co-occurrence and marginal frequencies w stands for all words except w stands for all words N is total number of bigrams The expected frequency under the independence assumption is ˆ ( ) ( ) ( ) f xy f x f y N= lowast lowast
33
permitting a maximum of 4 intervening words to allow for quantifiers (aan the many etc) adjectival and adverbial modifiers etc If this search fails to find a noun as when LVCs are used in the passive (eg the presentation was made) we search to the right of the light verb also allowing a maximum of 4 intervening words The above extraction process produced a total of 8652 VPC and 11465 LVC candidates when run on the corpus We then filter out candidates with observed frequencies less than 6 as suggested in Pecina and Schlesinger (2006) to obtain a set of 1629 VPCs and 1254 LVCs
Separately we use the following two available sources of annotations 3078 VPC candidates extracted and annotated in (Baldwin 2005) and 464 annotated LVC candidates used in (Tan et al 2006) Both sets of annotations give both positive and negative examples
Our final VPC and LVC evaluation datasets were then constructed by intersecting the gold-standard datasets with our corresponding sets of extracted candidates We also concatenated both sets of evaluation data for composite evaluation This set is referred to as ldquoMixedrdquo Statistics of our three evaluation datasets are summarized in Table 2 VPC data LVC data Mixed Total (freq ge 6)
413 100 513
Positive instances
117 (2833)
28 (28)
145 (2326)
Table 2 Evaluation data sizes (type count not token) While these datasets are small our primary
goal in this work is to establish initial comparable baselines and describe interesting phenomena that we plan to investigate over larger datasets in future work
32 Evaluation Metric
To evaluate the performance of AMs we can use the standard precision and recall measures as in much past work We note that the ranked list of candidates generated by an AM is often used as a classifier by setting a threshold However setting a threshold is problematic and optimal threshold values vary for different AMs Additionally using the list of ranked candidates directly as a classifier does not consider the confidence indicated by actual scores Another way to avoid setting threshold values is to measure precision and recall of only the n most likely candidates
(the n-best method) However as discussed in Evert and Krenn (2001) this method depends heavily on the choice of n In this paper we opt for average precision (AP) which is the average of precisions at all possible recall values This choice also makes our results comparable to those of Pecina and Schlesinger (2006)
33 Evaluation Results
Figure 1(a b) gives the two average precision profiles of the 82 AMs presented in Pecina and Schlesinger (2006) when we replicated their experiments over our English VPC and LVC datasets We observe that the average precision profile for VPCs is slightly concave while the one for LVCs is more convex This can be interpreted as VPCs being more sensitive to the choice of AM than LVCs Another point we observed is that a vast majority of Class I AMs including PMI its variants and association coefficients (excluding hypothesis tests) perform reasonably well in our application In contrast the performances of most of context-based and hypothesis test AMs are very modest Their mediocre performance indicates their inapplicability to our VPC and LVC tasks In particular the high frequencies of particles in VPCs and light verbs in LVCs both undermine their contextsrsquo discriminative power and skew the difference between observed and expected frequencies that are relied on in hypothesis tests
4 Rank Equivalence
We note that some AMs although not mathematically equivalent (ie assigning identical scores to input candidates) produce the same lists of ranked candidates on our datasets Hence they achieve the same average precision The ability to identify such groups of AMs is helpful in simplifying their formulas which in turn assisting in analyzing their meanings Definition Association measures M1 and M2 are
rank equivalent over a set C denoted by M1 r
Cequiv
M2 if and only if M1(cj) gt M1(ck) M2(cj) gt M2(ck) and M1(cj) = M1(ck) M2(cj) = M2(ck) for all cj ck belongs to C where Mk(ci) denotes the score assigned to ci by the measure Mk As a corollary the following also holds for rank equivalent AMs
34
Corollary If M1 r
Cequiv M2 then APC(M1) = APC(M2)
where APC(Mi) stands for the average precision of the AM Mi over the data set C Essentially M1 and M2 are rank equivalent over a set C if their ranked lists of all candidates taken from C are the same ignoring the actual calculated scores1 As an example the following 3 AMs Odds ratio Yulersquos ω and Yulersquos Q (Table 3 row 5) though not mathematically equivalent can be shown to be rank equivalent Five groups of rank equivalent AMs that we have found are listed in Table 3 This allows us to replace the below 15 AMs with their (most simple) representatives from each rank equivalent group
1 Two AMs may be rank equivalent with the exception of some candidates where one AM is undefined due to a zero in the denominator while the other AM is still well-defined We call these cases weakly rank equivalent With a reasonably large corpus such candidates are rare for our VPC and LVC types Hence we still consider such AM pairs to be rank equivalent
1) [M2] Mutual Information [M3] Log likelihood ratio 2) [M7] PMI2 [M8] Mutual Dependency [M9] Driver-Kroeber (aka Ochiai) 3) [M10] Normalized expectation [M11] Jaccard [M12] First Kulczynski
[M13]Second Sokal-Sneath (aka Anderberg)
4) [M14] Third Sokal-Sneath [M15] Sokal-Michiner [M16] Rogers-Tanimoto [M17] Hamann 5) [M18] Odds ratio [M19] Yulersquos ω [M20] Yulersquos Q
Table 3 Five groups of rank equivalent AMs
5 Examination of Association Measures
We highlight two important findings in our analysis of the AMs over our English datasets Section 51 focuses on MI and PMI and Section 52 discusses penalization terms
51 Mutual Information and Pointwise Mutual Information
In Figure 1 over 82 AMs PMI ranks 11th in identifying VPCs while MI ranks 35th in
01
02
03
04
05
06
AP
Figure 1a AP profile of AMs examined over our VPC data set
01
02
03
04
05
06
AP
Figure 1b AP profile of AMs examined over our LVC data set
Figure 1 Average precision (AP) performance of the 82 AMs from Pecina and Schlesinger (2006) on our English VPC and LVC datasets Bold points indicate AMs discussed in this paper
Hypothesis test AMs loz Class I AMs excluding hypothesis test AMs + Context-based AMs
35
identifying LVCs In this section we show how their performances can be improved significantly
Mutual Information (MI) measures the common information between two variables or the reduction in uncertainty of one variable given knowledge of the other
( )MI( ) ( )log ( ) ( )u v
p uvU V p uv p u p v= lowast lowastsum In the
context of bigrams the above formula can be
simplified to [M2] MI =
1log ˆN
ijij
i jij
ff
fsum While MI
holds between random variables [M4] Pointwise MI (PMI) holds between specific values PMI(x
y) =( )
log( ) ( )
P xy
P x P ylowast lowast
( )log
( ) ( )
Nf xy
f x f y=
lowast lowast It has long
been pointed out that PMI favors bigrams with low-frequency constituents as evidenced by the product of two marginal frequencies in its denominator To reduce this bias a common solution is to assign more weight to the co-occurrence frequency ( )f xy in the numerator by either raising it to some power k (Daille 1994) or multiplying PMI with ( )f xy Table 4 lists these adjusted versions of PMI and their performance over our datasets We can see from Table 4 that the best performance of PMIk
is obtained at k values less than one indicating that it is better to rely less on ( )f xy Similarly multiplying
( )f xy directly to PMI reduces the performance of PMI As such assigning more weight to ( )f xy does not improve the AP performance of PMI AM VPCs LVCs MixedBest [M6] PMIk 547
(k = 13) 573
(k = 85) 544
(k = 32)
[M4] PMI 510 566 515 [M5] Local-PMI 259 393 272 [M1] Joint Prob 170 28 175 Table 4 AP performance of PMI and its variants Best alpha settings shown in parentheses
Another shortcoming of (P)MI is that both grow not only with the degree of dependence but also with frequency (Manning and Schutzeampamp 1999 p 66) In particular we can show that MI(X Y) le min(H(X) H(Y)) where H() denotes entropy and PMI(xy) le min( log ( )P xminus lowast log ( )P yminus lowast )
These two inequalities suggest that the allowed score ranges of different candidates vary and consequently MI and PMI scores are not directly comparable Furthermore in the case of VPCs and LVCs the differences among score
ranges of different candidates are large due to high frequencies of particles and light verbs This has motivated us to normalize these scores before using them for comparison We suggest MI and PMI be divided by one of the following two normalization factors NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast with [0 1]α isin and NFmax = max( ( ) ( ))P x P ylowast lowast NF( )α being dependent on alpha can be optimized by setting an appropriate alpha value which is inevitably affected by the MWE type and the corpus statistics On the other hand NFmax is independent of alpha and is recommended when one needs to apply normalized (P)MI to a mixed set of different MWE types or when sufficient data for parameter tuning is unavailable As shown in Table 5 normalized MI and PMI show considerable improvements of up to 80 Also PMI and MI after being normalized with NFmax rank number one in VPC and LVC task respectively If one re-writes MI as = (1 N) ij ij
i jPMIf timessum it is easy to see the heavy
dependence of MI on direct frequencies compared with PMI and this explains why normalization is a pressing need for MI
AM VPCs LVCs MixedMI NF( )α 508
(α = 48) 583
(α = 47) 516
(α = 5)
MI NFmax 508 584 518 [M2] MI 273 435 289
PMI NF( )α 592 (α = 8)
554 (α = 48)
588 (α = 77)
PMI NFmax 565 517 556 [M4] PMI 510 566 515
Table 5 AP performance of normalized (P)MI versus standard (P)MI Best alpha settings shown in parentheses
52 Penalization Terms
It can be seen that given equal co-occurrence frequencies higher marginal frequencies reduce the likelihood of being MWEs This motivates us to use marginal frequencies to synthesize penalization terms which are formulae whose values are inversely proportional to the likelihood of being MWEs We hypothesize that incorporating such penalization terms can improve the respective AMs detection AP
Take as an example the AMs [M21] Brawn-Blanquet (aka Minimum Sensitivity) and [M22] Simpson These two AMs are identical except
36
for one difference in the denominator Brawn-Blanquet uses max(b c) Simpson uses min(b c) It is intuitive and confirmed by our experiments that penalizing against the more frequent constituent by choosing max(b c) is more effective This is further attested in AMs [M23] S Cost and [M25] Laplace where we tried to replace the min(b c) term with max(b c) Table 6 shows the average precision on our datasets for all these AMs
AM VPCs LVCs Mixed[M21]Brawn- Blanquet
478 578 486
[M22] Simpson 249 382 260
[M24] Adjusted S Cost
485 577 492
[M23] S cost 249 388 260
[M26] Adjusted Laplace
486 577 493
[M25] Laplace 241 388 254 Table 6 Replacing min() with max() in selected AMs
In the [M27] Fager AM the penalization term max(b c) is subtracted from the first term which is no stranger but rank equivalent to [M7] PMI2 In our application this AM is not good since the second term is far larger than the first term which is less than 1 As such Fager is largely equivalent to just ndashfrac12 max(b c) In order to make use of the first term we need to replace the constant frac12 by a scaled down version of max(b c) We have approximately derived 1 aN as a lower bound estimate of max(b c) using the independence assumption producing [M28] Adjusted Fager We can see from Table 7 that this adjustment improves Fager on both datasets AM VPCs LVCs Mixed[M28] Adjusted Fager
564 543 554
[M27] Fager 552 439 525 Table 7 Performance of Fager and its adjusted version
The next experiment involves [M14] Third Sokal Sneath which can be shown to be rank equivalent to ndashb ndashc We further notice that frequencies c of particles are normally much larger than frequencies b of verbs Thus this AM runs the risk of ranking VPC candidates based on only frequencies of particles So it is necessary
that we scale b and c properly as in [M14] bαminus times ndash (1 ) cαminus times Having scaled the constituents properly we still see that [M14] by itself is not a good measure as it uses only constituent frequencies and does not take into consideration the co-occurrence frequency of the two constituents This has led us to experiment
with [MR14]PMI
(1 )b cα αtimes + minus times The
denominator of [MR14] is obtained by removing the minus sign from [MR14] so that it can be used as a penalization term The choice of PMI in the numerator is due to the fact that the denominator of [MR14] is in essence similar to NF( )α = ( )P xα lowast + (1 ) ( )P yαminus lowast which has been successfully used to divide PMI in the normalized PMI experiment We heuristically tried to simplify [MR14] to the following AM
[M30]log( )
(1 )
ad
b cα αtimes + minus times The setting of alpha in
Table 8 below is taken from the best alpha setting obtained the experiment on the normalized PMI (Table 5) It can be observed from Table 8 that [MR14] being computationally simpler than normalized PMI performs as well as normalized PMI and better than Third Sokal-Sneath over the VPC data set
AM VPCs LVCs MixedPMI NF( )α 592
(α =8) 554
(α =48) 588
(α =77)
[M30]log( )
(1 )
ad
b cα αtimes + minus times
600 (α =8)
484 (α =48)
588 (α =77)
[M14] Third Sokal Sneath
565 453 546
Table 8 AP performance of suggested VPCsrsquo penalization terms and AMs
With the same intention and method we have found that while addition of marginal frequencies is a good penalization term for VPCs the product of marginal frequencies is more suitable for LVCs (rows 1 and 2 Table 9) As with the linear combination the product bc should also be weighted accordingly as (1 )b cα αminus The best alpha value is also taken from the normalized PMI experiments (Table 5) which is nearly 5 Under this setting this penalization term is exactly the denominator of the [M18] Odds Ratio Table 9 below show our experiment results in deriving the penalization term for LVCs
37
AM VPCs LVCs Mixedndashb ndashc 565 453 546 1bc 502 532 502 [M18] Odds ratio 443 567 456
Table 9 AP performance of suggested LVCsrsquo penalization terms and AMs
6 Conclusions
We have conducted an analysis of the 82 AMs assembled in Pecina and Schlesinger (2006) for the tasks of English VPC and LVC extraction over the Wall Street Journal Penn Treebank data In our work we have observed that AMs can be divided into two classes ones that do not use context (Class I) and ones that do (Class II) and find that the latter is not suitable for our VPC and LVC detection tasks as the size of our corpus is too small to rely on the frequency of candidatesrsquo contexts This phenomenon also revealed the inappropriateness of hypothesis tests for our detection task We have also introduced the novel notion of rank equivalence to MWE detection in which we show that complex AMs may be replaced by simpler AMs that yield the same average precision performance
We further observed that certain modifications to some AMs are necessary First in the context of ranking we have proposed normalizing scores produced by MI and PMI in cases where the distributions of the two events are markedly different as is the case for light verbs and particles While our claims are limited to the datasets analyzed they show clear improvements normalized PMI produces better performance over our mixed MWE dataset yielding an average precision of 588 compared to 515 when using standard PMI a significant improvement as judged by paired T test Normalized MI also yields the best performance over our LVC dataset with a significantly improved AP of 583
We also show that marginal frequencies can be used to form effective penalization terms In particular we find that (1 )b cα αtimes + minus times is a good penalization term for VPCs while (1 )b cα αminus is suitable for LVCs Our introduced alpha tuning parameter should be set to properly scale the values b and c and should be optimized per MWE type In cases where a common factor is applied to different MWE types max(b c) is a better choice than min(b c) In future work we plan to expand our investigations over larger
web-based datasets of English to verify the performance gains of our modified AMs
Acknowledgement This work was partially supported by a National Research Foundation grant ldquoInteractive Media Searchrdquo (grant R 252 000 325 279)
References Baldwin Timothy (2005) The deep lexical
acquisition of English verb-particle constructions Computer Speech and Language Special Issue on Multiword Expressions 19(4)398ndash414
Baldwin Timothy and Villavicencio Aline (2002) Extracting the unextractable A case study on verb-particles In Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002) pages 98ndash104 Taipei Taiwan
Daille Beacuteatrice (1994) Approche mixte pour lextraction automatique de terminologie statistiques lexicales et filtres linguistiques PhD thesis Universiteacute Paris 7
Evert Stefan (2004) Online repository of association measures httpwwwcollocationsde a companion to The Statistics of Word Cooccurrences Word Pairs and Collocations PhD dissertation University of Stuttgart
Evert Stefan and Krenn Brigitte (2001) Methods for qualitative evaluation of lexical association measures In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics pages 188-915 Toulouse France
Katz Graham and Giesbrecht Eugenie (2006) Automatic identification of non-compositional multi-word expressions using latent semantic analysis In Proceedings of the ACL Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties pages 12-19 Sydney Australia
Krenn Brigitte and Evert Stefan (2001) Can we do better than frequency A case study on extracting PP-verb collocations In Proceedings of the ACLEACL 2001 Workshop on the Computational Extraction Analysis and Exploitation of Collocations pages 39ndash46 Toulouse France
Manning D Christopher and Schutze ampamp Hinrich (1999) Foundations of Statistical Natural Language Processing The MIT Press Cambridge Massachusetts
Pearce Darren (2002) A comparative evaluation of collocation extraction techniques In Proc of the
38
3rd International Conference on Language Resources and Evaluation (LREC 2002) Las Palmas pages 1530-1536 Canary Islands
Pecina Pavel and Schlesinger Pavel (2006) Combining association measures for collocation extraction In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLINGACL 2006) pages 651-658 Sydney Australia
Quirk Randolph Greenbaum Sidney Leech Geoffrey and Svartvik Jan (1985) A Comprehensive Grammar of the English Language Longman London UK
Ramisch Carlos Schreiner Paulo Idiart Marco and Villavicencio Aline (2008) An Evaluation of Methods for the extraction of Multiword Expressions In Proceedings of the LREC-2008 Workshop on Multiword Expressions Towards a Shared Task for Multiword Expressions pages 50-53 Marrakech Morocco
Smadja Frank (1993) Retrieving collocations from text Xtract Computational Linguistics 19(1) 143ndash77
Tan Y Fan Kan M Yen and Cui Hang (2006) Extending corpus-based identification of light verb constructions using a supervised learning framework In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context pages 49ndash56 Trento Italy
Zhai Chengxiang (1997) Exploiting context to identify lexical atoms ndash A statistical view of linguistic context In International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97) pages 119-129 Rio de Janeiro Brazil
Appendix A List of particles used in identifying verb particle constructions about aback aboard above abroad across adrift ahead along apart around aside astray away back backward backwards behind by down forth forward forwards in into off on out over past round through to together under up upon without
39
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 40ndash46Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus
R Mahesh K Sinha
Department of Computer Science amp Engineering Indian Institute of Technology Kanpur
Kanpur 208016 India rmkiitkacin
Abstract
Complex predicate is a noun a verb an ad-jective or an adverb followed by a light verb that behaves as a single unit of verb Com-plex predicates (CPs) are abundantly used in Hindi and other languages of Indo Aryan family Detecting and interpreting CPs con-stitute an important and somewhat a diffi-cult task The linguistic and statistical methods have yielded limited success in mining this data In this paper we present a simple method for detecting CPs of all kinds using a Hindi-English parallel corpus A CP is hypothesized by detecting absence of the conventional meaning of the light verb in the aligned English sentence This simple strategy exploits the fact that CP is a multi-word expression with a meaning that is dis-tinct from the meaning of the light verb Al-though there are several shortcomings in the methodology this empirical method surpri-singly yields mining of CPs with an average precision of 89 and a recall of 90
1 Introduction
Complex predicates (CPs) are abundantly used in Hindi and other languages of Indo-Aryan family and have been widely studied (Hook 1974 Ab-bi 1992 Verma 1993 Mohanan 1994 Singh 1994 Butt 1995 Butt and Geuder 2001 Butt and Ramchand 2001 Butt et al 2003) A com-plex predicate is a multi-word expression (MWE) where a noun a verb or an adjective is followed by a light verb (LV) and the MWE be-haves as a single unit of verb The general theory of complex predicate is discussed in Alsina (1996) These studies attempt to model the lin-guistic facts of complex predicate formation and the associated semantic roles
CPs empower the language in its expressive-ness but are hard to detect Detection and inter-
pretation of CPs are important for several tasks of natural language processing tasks such as ma-chine translation information retrieval summa-rization etc A mere listing of the CPs constitutes a valuable linguistic resource for lexicographers wordnet designers (Chakrabarti et al 2007) and other NLP system designers Computational me-thod using Hindi corpus has been used to mine CPs and categorize them based on statistical analysis (Sriram and Joshi 2005) with limited success Chakrabarti et al (2008) present a me-thod for automatic extraction of V+V CPs only from a corpus based on linguistic features They report an accuracy of about 98 An attempt has also been made to use a parallel corpus for de-tecting CPs using projection POS tags from Eng-lish to Hindi (Soni Mukerjee and Raina 2006) It uses Giza++ word alignment tool to align the projected POS information A success of 83 precision and 46 recall has been reported
In this paper we present a simple strategy for mining of CPs in Hindi using projection of meaning of light verb in a parallel corpus In the following section the nature of CP in Hindi is outlined and this is followed by system design experimentation and results
2 Complex Predicates in Hindi
A CP in Hindi is a syntactic construction consist-ing of either a verb a noun an adjective or an adverb as main predicator followed by a light verb (LV) Thus a CP can be a noun+LV an adjective+LV a verb+LV or an adverb+LV Fur-ther it is also possible that a CP is followed by a LV (CP+LV) The light verb carries the tense and agreement morphology In V+V CPs the contribution of the light verb denotes aspectual terms such as continuity perfectivity inception completion or denotes an expression of forceful-ness suddenness etc (Singh 1994 Butt 1995) The CP in a sentence syntactically acts as a sin-gle lexical unit of verb that has a meaning dis-
40
tinct from that of the LV CPs are also referred as the complex or compound verbs
Given below are some examples (1) CP=noun+LV noun = ashirwad blessings LV = denaa to give usane mujhe ashirwad diyaa उसन मझ आश वाद दया he me blessings gave he blessed me (2) No CP usane mujhe ek pustak dii उसन मझ एक प तक द he me one book gave he gave me a book In (1) the light verb diyaa (gave) in its past tense form with the noun ashirwad (blessings) makes a complex predicate verb form ashirwad diyaa (blessed) in the past tense form The CP here is ashirwad denaa and its corresponding English translation is lsquoto blessrsquo On the other hand in ex-ample (2) the verb dii (gave) is a simple verb in past tense form and is not a light verb Although same Hindi verb denaa (to give) is used in both the examples it is a light verb in (1) and a main verb in (2) Whether it acts as a light verb or not depends upon the semantics of the preceding noun However it is observed that the English meaning in case of the complex predicate is not derived from the individual meanings of the con-stituent words It is this observation that forms basis of our approach for mining of CPs (3) CP=adjective+LV adjective=khush happy LV=karanaa to do usane mujhe khush kiyaa उसन मझ खश कया he me happy did he pleased me Here the Hindi verb kiyaa (did) is the past tense form of a light verb karanaa (to do) and the pre-ceding word khush (happy) is an adjective The CP here is khush karanaa (to please) (4) CP=verb+LV verb = paRhnaa to read LV = lenaa to take usane pustak paRh liyaa
उसन प तक पढ़ लया he book read took he has read the book Here the Hindi verb liyaa (took) is the past tense form of the light verb lenaa (to take) and the pre-ceding word paRh (read) is the verb paRhnaa (to read) in its stem form The CP is paRh lenaa (to finish reading) In such cases the light verb acts as an aspectual modal or as an intensifier (5) CP=verb+LV verb = phaadanaa to tear LV = denaa to give usane pustak phaad diyaa उसन प तक फाड़ दया he book tear gave he has torn the book Here the Hindi verb diyaa (gave) is the past tense form of the light verb denaa (to give) and the preceding word phaad (tear) is the stem form of the verb phaadanaa (to tear) The CP is phaad denaa (to cause and complete act of tearing) (6) CP=verb+LV verb = denaa to give LV = maaranaato hit to kill usane pustak de maaraa उसन प तक द मारा he book give hit he threw the book Here the Hindi verb maaraa (hitkilled) is the past tense form of the light verb maaranaa (to hit to kill) and the preceding word de (give) is a verb denaa (to give) in its stem form The CP is de maranaa (to throw) The verb combination yields a new meaning This may also be consi-dered as a semi-idiomatic construct by some people (7) CP=adverb+LV1+LV2 adverb = vaapasback LV1 = karanaato do LV2 = denaato give or CP = CP+LV CP = vaapas karanaato return LV = denaato give usane pustak vaapas kar diyaa उसन प तक वापस कर दया he book back do gave
41
he returned the book Here there are two Hindi light verbs used The verb kar (do) is the stem form of the light verb karanaa (to do) and the verb diyaa (gave) is the past tense form of the light verb denaa (to give) The preceding word vaapas (back) is an adverb One way of interpretation is that the CP (a con-junct verb) vaapas karanaa (to return) is fol-lowed by another LV denaa (to give) signifying completion of the task Another way of looking at it is to consider these as a combination of two CPs vaapas karanaa (to return) and kar denaa (to complete the act) The semantic interpreta-tions in the two cases remain the same It may be noted that the word vaapas (return) is also a noun and in such a case the CP is a noun+LV
From all the above examples the complexity of the task of mining the CPs is evident Howev-er it is also observed that in the translated text the meaning of the light verb does not appear in case of CPs Our methodology for mining CPs is based on this observation and is outlined in the following section
3 System Design
As outlined earlier our method for detecting a CP is based on detecting a mismatch of the Hindi light verb meaning in the aligned English sen-tence The steps involved are as follows 1) Align the sentences of Hindi-English corpus 2) Create a list of Hindi light verbs and their
common English meanings as a simple verb (Table 1)
3) For each Hindi light verb generate all the morphological forms (Figure 1)
4) For each English meaning of the light verb as given in table 1 generate all the morphologi-cal forms (Figure 2)
5) For each Hindi-English aligned sentence execute the following steps a) For each light verb of Hindi (table 1)
execute the following steps i) Search for a Hindi light verb (LV)
and its morphological derivatives (figure 1) in the Hindi sentence and mark its position in the sentence (K)
ii) If the LV or its morphological deriv-ative is found then search for the equivalent English meanings for any of the morphological forms (figure 2) in the corresponding aligned Eng-lish sentence
iii) If no match is found then scan the words in the Hindi sentence to the left of the Kth position (as identified in step (i)) else if a match is found then exit ie go to step (a)
iv) If the scanned word is a lsquostop wordrsquo (figure 3) then ignore it and contin-ue scanning
v) Stop the scan when it is not a lsquostop wordrsquo and collect the Hindi word (W)
vi) If W is an lsquoexit wordrsquo then exit ie go to step (a) else the identified CP is W+LV
Hindi has a large number of light verbs A list of some of the commonly used light verbs along with their common English meanings as a simple verb is given in table 1 The light verb kar (do) is the most frequently used light verb Using its literal meaning as lsquodorsquo as a criterion for testing CP is quite misleading since lsquodorsquo in English is used in several other contexts Such meanings have been shown within parentheses and are not used for matching
light verb base form root verb meaning baithanaa बठना sit
bananaa बनना makebecomebuildconstruct manufactureprepare
banaanaa बनाना makebuildconstructmanufact-ure prepare
denaa दना give
lenaa लना take
paanaa पाना obtainget
uthanaa 3ठना rise arise get-up
uthaanaa 3ठाना raiselift wake-up
laganaa लगना feelappear look seem
lagaanaa लगाना fixinstall apply
cukanaa चकना (finish)
cukaanaa चकाना pay
karanaa करना (do)
honaa होना happenbecome be
aanaa आना come
jaanaa जाना go
khaanaa खाना eat
rakhanaa रखना keep put
maaranaa मारना killbeathit
daalanaa डालना put
haankanaa हाकना drive
Table 1 Some of the common light verbs in Hindi
42
For each of the Hindi light verb all morpho-logical forms are generated A few illustrations are given in figures 1(a) and 1(b) Similarly for each of the English meaning of the light verb all of its morphological derivatives are generated Figure 2 shows a few illustrations of the same
There are a number of words that can appear in between the nominal and the light verb in a CP These words are ignored in search for a CP and are treated as stop words These are words that denote negation or are emphasizers inten-sifiers interrogative pronoun or a particle A list of stop words used in the experimentation is giv-en in figure 3
Figure 1(a) Morphological derivatives of sample Hindi light verb rsquojaanaarsquo जाना to go
Figure 1(b) Morphological derivatives of sample Hindi light verb rsquolenaarsquo लना to take
Figure 2 Morphological derivatives of sample English meanings
We use a list of words of words that we have
named as lsquoexit wordsrsquo which cannot form part of a CP in Hindi We have used Hindi case (vibhak-ti) markers (also called parsarg) conjunctions and pronouns as the lsquoexit wordsrsquo in our imple-mentation Figure 4 shows a partial list used However this list can be augmented based on analysis of errors in LV identification It should be noted that we do not perform parts of speech (POS) tagging and so the nature of the word pre-ceding the LV is unknown to the system
English word sit Morphological derivations
sit sits sat sitting English word give Morphological derivations
give gives gave given giving degdegdegdegdeg
LV jaanaa जाना to go Morphological derivatives jaa jaae jaao jaaeM jaauuM jaane jaanaa jaanii jaataa jaatii jaate jaaniiM jaatiiM jaaoge jaaogii gaii jaauuMgA jaayegaa jaauuMgii jaayegii gaye gaiiM gayaa gayii jaayeMge jaayeMgI jaakara जा (go stem) जाए (go imperative)
जाओ (go imperative) जाए (go imperative)
जाऊ (go first-person) जान (go infinitive oblique)
जाना (go infinitive masculine singular)
जानी (go infinitive feminine singular)
जाता (go indefinite masculine singular)
जाती (go indefinite feminine singular)
जात (go indefinite masculine pluraloblique)
जानी (go infinitive feminine plural)
जाती (go indefinite feminine plural)
जाओग (go future masculine singular)
जाओगी (go future feminine singular)
गई (go past feminine singular)
जाऊगा (go future masculine first-person singular)
जायगा (go future masculine third-person singular)
जाऊगी (go future feminine first-person singular)
जायगी (go future feminine third-person singular)
गय (go past masculine pluraloblique)
गई (go past feminine plural)
गया (go past masculine singular)
गयी (go past feminine singular)
जायग (go future masculine plural)
जायगी (go future feminine plural)
जाकर (go completion) degdegdegdegdeg
LV lenaa लना to take Morphological derivatives le lii leM lo letaa letii lete liiM luuM legaa legii lene lenaa lenii liyaa leMge loge letiiM luuMgaa luuMgii lekara ल (take stem) ल (take past)
ल (take imperative) लो (take imperative)
लता (take indefinite masculine singular)
लती (take indefinite feminine singular)
लत (take indefinite masculine pluraloblique)
ल (takepastfeminineplural) ल(take first-person)
लगा(take future masculine third-personsingular)
लगी(take future feminine third-person singular)
लन (take infinitive oblique)
लना (take infinitive masculine singular)
लनी (take infinitive feminine singular)
लया (take past masculine singular)
लग (take future masculine plural)
लोग (take future masculine singular)
लती (take indefinite feminine plural)
लगा (take future masculinefirst-personsingular)
लगी (take future feminine first-person singular)
लकर (take completion) degdegdegdegdeg
43
Figure 3 Stop words in Hindi used by the system
Figure 4 A few exit words in Hindi used by the system
The inner loop of the procedure identifies mul-tiple CPs that may be present in a sentence The outer loop is for mining the CPs in the entire corpus The experimentation and results are dis-cussed in the following section
4 Experimentation and Results
The CP mining methodology outlined earlier has been implemented and tested over multiple files of EMILLE (McEnery Baker Gaizauskas and Cunningham 2000) English-Hindi parallel cor-pus A summary of the results obtained are given in table 2 As can be seen from this table the precision obtained is 80 to 92 and the recall is between 89 to 100 The F-measure is 88 to 97 This is a remarkable and somewhat sur-prising result from the simple methodology without much of linguistic or statistical analysis This is much higher than what has been reported on the same corpus by Mukerjee et al 2006 (83 precision and 46 recall) who use projec-tion of POS and word alignment for CP identifi-cation This is the only other work that uses a parallel corpus and covers all kinds of CPs The results as reported by Chakrabarti et al (2008) are only for V-V CPs Moreover they do not re-port the recall value File
1 File 2
File 3
File 4
File 5
File 6
No of Sentences
112 193 102 43 133 107
Total no of CP(N)
200 298 150 46 188 151
Correctly identified CP (TP)
195 296 149 46 175 135
V-V CP 56 63 9 6 15 20
Incorrectly identified CP (FP)
17 44 7 11 16 20
Unidentified CP(FN)
5 2 1 0 13 16
Accuracy
9750 9933 9933 1000 9308 8940
Precision (TP (TP+FP))
9198 8705 9551 8070 9162 8709
Recall ( TP (TP+FN))
9750 9833 9933 1000 9308 8940
F-measure ( 2PR ( P+R))
946 923 974 893 923 882
Table 2 Results of the experimentation
न (ergative case marker) को (accusative case marker) का (possessive case marker) क (possessive case marker) की (possessive case marker) स (frombywith) म (ininto) पर (onbut) और (and Hindi particle) तथा (and) या (or) ल कन (but) पर त (but) क (that Hindi particle) म (I) तम (you) आप (you) वह (heshe) मरा (my) मर (my) मर (my) त हारा (your) त हार (your) त हार (your) उसका (his) उसकी (her) उसक (hisher) अपना (own) अपनी (own) अपन (own) उनक (their) मन (I ergative) त ह (to you) आपको (to you) उसको (to himher) उनको (to them) उ ह (to them) मझको (to me) मझ (to me) िजसका (whose) िजसकी (whose) िजसक (whose) िजनको (to whom) िजनक (to whom)
नह (nonot) न (nonot Hindi particle) भी (also Hindi particle) ह (only Hindi particle) तो (then Hindi particle) य (why) या (what Hindi particle)
कहा (where Hindi particle) कब (when) यहा (here) वहा (there) जहा (where) पहल (before) बाद म (after) श म (beginning) आर भ म (beginning) अत म (in the end) आ खर म (in the end)
44
Given below are some sample outputs (1) English sentence I also enjoy working with the childrens parents who often come to me for advice - its good to know you can help Aligned Hindi sentence मझ ब च क माता - पताओ क साथ काम करना भी अ छा लगता ह जो क अ सर सलाह लन आत ह - यह जानकार खशी होती ह क आप कसी की मदद कर सकत ह | The CPs identified in the sentence i काम करना (to work) ii अ छा लगना (to feel good enjoy) iii सलाह लना (to seek advice) iv खशी होना (to feel happy good) v मदद करना (to help)
Here the system identified 5 different CPs all of which are correct and no CP in the sentence has gone undetected The POS projection and word alignment method (Mukerjee et al 2006) would fail to identify CPs सलाह लना (to seek advice) and खशी होना (to feel happy) (2) English sentence Thousands of children are already benefiting from the input of people like you - people who care about children and their future who have the commitment energy and enthusiasm to be positive role models and who value the opportu-nity for a worthwhile career Aligned Hindi sentence आप जस लोग जो क ब च और उनक भ व य क बार म सोचत ह - इस समय भी हज़ार ब च को लाभ पहचा रह ह | अ छ आदश बनन क लए ऐस लोग म तब ता उ सराह और लगन ह और व एक समथन - यो य यवसाय की क करत ह |
The CPs identified in the sentence i आदश बनना (to be role model) ii क करना (to respect)
Here also the two CPs identified are correct It is obvious that this empirical method of
mining CPs will fail whenever the Hindi light verb maps on to its core meaning in English It
may also produce garbage as POS of the preced-ing word is not being checked However the mining success rate obtained speaks of these be-ing in small numbers in practice Use of the lsquostop wordsrsquo in allowing the intervening words within the CPs helps a lot in improving the per-formance Similarly use of the lsquoexit wordsrsquo avoid a lot of incorrect identification
5 Conclusions
The simple empirical method for mining CPs outlined in this work yields an average 89 of precision and 90 recall which is better than the results reported so far in the literature The major drawback is that we have to generate a list of all possible light verbs This list appears to be very large for Hindi Since no POS tagging or statis-tical analysis is performed the identified CPs are merely a list of mined CPs in Hindi with no lin-guistic categorization or analysis However this list of mined CPs is valuable to the lexicograph-ers and other language technology developers This list can also be used for word alignment tools where the identified components of CPs are grouped together before the word alignment process This will increase both the alignment accuracy and the speed
The methodology presented in this work is equally applicable to all other languages within the Indo-Aryan family
References Anthony McEnery Paul Baker Rob Gaizauskas Ha-
mish Cunningham 2000 EMILLE Building a Corpus of South Asian Languages Vivek A Quar-terly in Artiificial Intelligence 13(3)23ndash32
Amitabh Mukerjee Ankit Soni and Achala M Raina 2006 Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora Proceed-ings of the Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties Sydney 11ndash18
Alex Alsina 1996 Complex PredicatesStructure and Theory CSLI PublicationsStanford CA
Anvita Abbi 1992 The explicator compound verbsome definitional issues and criteria for iden-tification Indian Linguistics 53 27-46
Debasri Chakrabarti Vaijayanthi Sarma and Pushpak Bhattacharyya 2007 Complex Predicates in In-dian Language Wordnets Lexical Resources and Evaluation Journal 40 (3-4)
Debasri Chakrabarti Hemang Mandalia Ritwik Priya Vaijayanthi Sarma and Pushpak Bhattacharyya 2008 Hindi Compound Verbs and their Automatic Extraction Computational Linguistics (COLING08) Manchester UK
45
Manindra K Verma (ed) 1993 Complex Predicates in South Asian Languages Manohar Publishers and Distributors New Delhi
Miriam Butt 1995 The Structure of Complex Predi-cates in Urdu CSLI Publications
Mirium Butt and Gillian Ramchand 2001 Complex Aspectual Structure in HindiUrdu In Maria Lia-kata Britta Jensen and Didier Maillat (Editors) Oxford University Working Papers in Linguistics Philology amp Phonetics Vol 6
Miriam Butt Tracy Holloway King and John T Maxwell III 2003 Complex Predicates via Re-striction Proceedings of the LFG03 Conference
Miriam Butt and Wilhelm Geuder 2001 On the (semi)lexical status of light verbs In Norbert Corv-er and Henk van Riemsdijk (Editors) Semi-lexical Categories On the content of function words and the function of content words Mouton de Gruyter Berlin 323ndash370
Mona Singh 1994 Perfectivity Definiteness and Specificity A Classification of Verbal Predicates Hindi Doctoral dissertation University of Texas Austin
Peter Edwin Hook 1974 The Compound Verb in Hindi Center for South and Southeast Asian Stu-dies The University of Michigan
Tara Mohanan 1994 Argument Structure in Hindi CSLI Publications Stanford California
Venkatapathy Sriram and Aravind K Joshi 2005 Relative compositionality of multi-word expres-sions a study of verb-noun (V-N) collocations In Proceedings of International Joint Conference on Natural Language Processing - 2005 Jeju Island Korea 553-564
46
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 47ndash54Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Improving Statistical Machine Translation UsingDomain Bilingual Multiword Expressions
Zhixiang Ren1 Yajuan L u1 Jie Cao1 Qun Liu 1 Yun Huang2
1Key Lab of Intelligent Info Processing 2Department of Computer ScienceInstitute of Computing Technology School of Computing
Chinese Academy of Sciences National University of SingaporePO Box 2704 Beijing 100190 China Computing 1 Law Link Singapore 117590
renzhixianglvyajuan huangyuncompnusedusgcaojieliuqunictaccn
Abstract
Multiword expressions (MWEs) havebeen proved useful for many natural lan-guage processing tasks However how touse them to improve performance of statis-tical machine translation (SMT) is not wellstudied This paper presents a simple yeteffective strategy to extract domain bilin-gual multiword expressions In additionwe implement three methods to integratebilingual MWEs to Moses the state-of-the-art phrase-based machine translationsystem Experiments show that bilingualMWEs could improve translation perfor-mance significantly
1 Introduction
Phrase-based machine translation model has beenproved a great improvement over the initial word-based approaches (Brown et al 1993) Recentsyntax-based models perform even better thanphrase-based models However when syntax-based models are applied to new domain with fewsyntax-annotated corpus the translation perfor-mance would decrease To utilize the robustnessof phrases and make up the lack of syntax or se-mantic information in phrase-based model for do-main translation we study domain bilingual mul-tiword expressions and integrate them to the exist-ing phrase-based model
A multiword expression (MWE)can be consid-ered as word sequence with relatively fixed struc-ture representing special meanings There is nouniform definition of MWE and many researchersgive different properties of MWE Sag et al (2002)roughly defined MWE as ldquoidiosyncratic interpre-tations that cross word boundaries (or spaces)rdquoCruys and Moiron (2007) focused on the non-compositional property of MWE ie the propertythat whole expression cannot be derived from their
component words Stanford university launcheda MWE project1 in which different qualities ofMWE were presented Forbilingual multiwordexpression (BiMWE) we define a bilingual phraseas a bilingual MWE if (1) the source phrase is aMWE in source language (2) the source phraseand the target phrase must be translated to eachother exactly ie there is no additional (boundary)word in target phrase which cannot find the corre-sponding word in source phrase and vice versaIn recent years many useful methods have beenproposed to extract MWEs or BiMWEs automati-cally (Piao et al 2005 Bannard 2007 Fazly andStevenson 2006) Since MWE usually constrainspossible senses of a polysemous word in contextthey can be used in many NLP applications suchas information retrieval question answering wordsense disambiguation and so on
For machine translation Piao et al (2005) havenoted that the issue of MWE identification andaccurate interpretation from source to target lan-guage remained an unsolved problem for existingMT systems This problem is more severe whenMT systems are used to translate domain-specifictexts since they may include technical terminol-ogy as well as more general fixed expressions andidioms Although some MT systems may employa machine-readable bilingual dictionary of MWEit is time-consuming and inefficient to obtain thisresource manually Therefore some researchershave tried to use automatically extracted bilingualMWEs in SMT Tanaka and Baldwin (2003) de-scribed an approach of noun-noun compound ma-chine translation but no significant comparisonwas presented Lambert and Banchs (2005) pre-sented a method in which bilingual MWEs wereused to modify the word alignment so as to im-prove the SMT quality In their work a bilin-gual MWE in training corpus was grouped as
1httpmwestanfordedu
47
one unique token before training alignment mod-els They reported that both alignment qualityand translation accuracy were improved on a smallcorpus However in their further study they re-ported even lower BLEU scores after groupingMWEs according to part-of-speech on a large cor-pus (Lambert and Banchs 2006) Nonethelesssince MWE represents liguistic knowledge therole and usefulness of MWE in full-scale SMTis intuitively positive The difficulty lies in howto integrate bilingual MWEs into existing SMTsystem to improve SMT performance especiallywhen translating domain texts
In this paper we implement three methods thatintegrate domain bilingual MWEs into a phrase-based SMT system and show that these ap-proaches improve translation quality significantlyThe main difference between our methods andLambert and Banchsrsquo work is that we directly aimat improving the SMT performance rather than im-proving the word alignment quality In detail dif-ferences are listed as follows
bull Instead of using the bilingual n-gram trans-lation model we choose the phrase-basedSMT system Moses2 which achieves sig-nificantly better translation performance thanmany other SMT systems and is a state-of-the-art SMT system
bull Instead of improving translation indirectlyby improving the word alignment qualitywe directly target at the quality of transla-tion Some researchers have argued that largegains of alignment performance under manymetrics only led to small gains in translationperformance (Ayan and Dorr 2006 Fraserand Marcu 2007)
Besides the above differences there are someadvantages of our approaches
bull In our method automatically extractedMWEs are used as additional resources ratherthan as phrase-table filter Since bilingualMWEs are extracted according to noisy au-tomatic word alignment errors in word align-ment would further propagate to the SMT andhurt SMT performance
bull We conduct experiments on domain-specificcorpus For one thing domain-specific
2httpwwwstatmtorgmoses
corpus potentially includes a large numberof technical terminologies as well as moregeneral fixed expressions and idioms iedomain-specific corpus has high MWE cov-erage For another after the investigationcurrent SMT system could not effectivelydeal with these domain-specific MWEs es-pecially for Chinese since these MWEs aremore flexible and concise Take the Chi-nese term ldquo j Ntilde (rdquo for example Themeaning of this term is ldquosoften hard massand dispel pathogenic accumulationrdquo Ev-ery word of this term represents a specialmeaning and cannot be understood literallyor without this context These terms are dif-ficult to be translated even for humans letalone machine translation So treating theseterms as MWEs and applying them in SMTsystem have practical significance
bull In our approach no additional corpus is intro-duced We attempt to extract useful MWEsfrom the training corpus and adopt suitablemethods to apply them Thus it benefitsfor the full exploitation of available resourceswithout increasing great time and space com-plexities of SMT system
The remainder of the paper is organized as fol-lows Section 2 describes the bilingual MWE ex-traction technique Section 3 proposes three meth-ods to apply bilingual MWEs in SMT systemSection 4 presents the experimental results Sec-tion 5 draws conclusions and describes the futurework Since this paper mainly focuses on the ap-plication of BiMWE in SMT we only give a briefintroduction on monolingual and bilingual MWEextraction
2 Bilingual Multiword ExpressionExtraction
In this section we describe our approach of bilin-gual MWE extraction In the first step we obtainmonolingual MWEs from the Chinese part of par-allel corpus After that we look for the translationof the extracted MWEs from parallel corpus
21 Automatic Extraction of MWEs
In the past two decades many different ap-proaches on automatic MWE identification werereported In general those approaches can beclassified into three main trends (1) statisti-cal approaches (Pantel and Lin 2001 Piao et
48
al 2005) (2) syntactic approaches (Fazly andStevenson 2006 Bannard 2007) and (3) seman-tic approaches (Baldwin et al 2003 Cruys andMoiron 2007) Syntax-based and semantic-basedmethods achieve high precision but syntax or se-mantic analysis has to be introduced as preparingstep so it is difficult to apply them to domains withfew syntactical or semantic annotation Statisticalapproaches only consider frequency informationso they can be used to obtain MWEs from bilin-gual corpora without deeper syntactic or semanticanalysis Most statistical measures only take twowords into account so it not easy to extract MWEscontaining three or more than three words
Log Likelihood Ratio (LLR)has been proved agood statistical measurement of the association oftwo random variables (Chang et al 2002) Weadopt the idea of statistical approaches and pro-pose a new algorithm named LLR-based Hierar-chical Reducing Algorithm (HRA for short) to ex-tract MWEs with arbitrary lengths To illustrateour algorithm firstly we define some useful itemsIn the following definitions we assume the givensentence is ldquoA B C D Erdquo
Definition 1 Unit A unit is any sub-string of thegiven sentence For example ldquoA Brdquo ldquo Crdquo ldquo C D Erdquoare all units but ldquoA B Drdquo is not a unit
Definition 2 List A list is an ordered sequence ofunits which exactly cover the given sentence ForexampleldquoArdquoldquo B C Drdquoldquo Erdquo forms a list
Definition 3 Score The score function only de-fines on two adjacent units and return the LLRbetween the last word of first unit and the firstword of the second unit3 For example the scoreof adjacent unit ldquoB Crdquo and ldquoD Erdquo is defined asLLR(ldquo Crdquoldquo Drdquo)
Definition 4 Select The selecting operator is tofind the two adjacent units with maximum scorein a list
Definition 5 ReduceThe reducing operator is toremove two specific adjacent units concatenatethem and put back the result unit to the removedposition For example if we want to reduce unitldquoB Crdquo and unit ldquoDrdquo in list ldquoArdquoldquo B Crdquoldquo Drdquoldquo Erdquowe will get the listldquoArdquoldquo B C Drdquoldquo Erdquo
Initially every word in the sentence is consid-ered as one unit and all these units form a initiallist L If the sentence is of lengthN then the
3we use a stoplist to eliminate the units containing func-tion words by setting their score to 0
list containsN units of course The final set ofMWEsS is initialized to empty set After initial-ization the algorithm will enter an iterating loopwith two steps (1) select the two adjacent unitswith maximum score inL namingU1 andU2 and(2) reduceU1 andU2 in L and insert the reducingresult into the final setS Our algorithm termi-nates on two conditions (1) if the maximum scoreafter selection is less than a given threshold or (2)if L contains only one unit
c1(pound p Eacute w egrave )
c1(pound p Eacute w )
c1(pound p Eacute w)
c1(pound)
c2(p Eacute w)
c2(p Eacute)
c2(p) c3(Eacute) c4(w) c5()
c6(egrave )
c6(egrave) c7()
pound p Eacute w egrave 1471 67552 10596 0 0 8096
Figure 1 Example of Hierarchical Reducing Al-gorithm
Let us make the algorithm clearer with an ex-ample Assume the threshold of score is 20 thegiven sentence is ldquopound p Eacute w egrave rdquo4Figure 1 shows the hierarchical structure of givensentence (based on LLR of adjacent words) Inthis example four MWEs (ldquop Eacuterdquo ldquo p Eacute
wrdquo ldquo egrave rdquo ldquo pound p Eacute wrdquo) are extractedin the order and sub-strings over dotted line in fig-ure 1 are not extracted
From the above example we can see that theextracted MWEs correspond to human intuitionIn general the basic idea of HRA is to reflectthe hierarchical structure pattern of natural lan-guage Furthermore in the HRA MWEs are mea-sured with the minimum LLR of adjacent wordsin them which gives lexical confidence of ex-tracted MWEs Finally suppose given sentencehas lengthN HRA would definitely terminatewithin N minus 1 iterations which is very efficient
However HRA has a problem that it would ex-tract substrings before extracting the whole stringeven if the substrings only appear in the particu-lar whole string which we consider useless Tosolve this problem we use contextual features
4The whole sentence means ldquohealthy tea for preventinghyperlipidemiardquo and we give the meaning for each Chi-nese wordpound(preventing)p(hyper-)Eacute(-lipid-) w(-emia)(for) egrave(healthy)(tea)
49
contextual entropy (Luo and Sun 2003) and C-value (Frantzi and Ananiadou 1996) to filter outthose substrings which exist only in few MWEs
22 Automatic Extraction of MWErsquosTranslation
In subsection 21 we described the algorithm toobtain MWEs and we would like to introduce theprocedure to find their translations from parallelcorpus in this subsection
For mining the English translations of ChineseMWEs we first obtain the candidate translationsof a given MWE from the parallel corpus Stepsare listed as follows
1 Run GIZA++5 to align words in the trainingparallel corpus
2 For a given MWE find the bilingual sentencepairs where the source language sentences in-clude the MWE
3 Extract the candidate translations of theMWE from the above sentence pairs accord-ing to the algorithm described by Och (2002)
After the above procedure we have alreadyextracted all possible candidate translations of agiven MWE The next step is to distinguish rightcandidates from wrong candidates We constructperceptron-based classification model (Collins2002) to solve the problem We design twogroups of features translation features whichdescribe the mutual translating chance betweensource phrase and target phrase and the languagefeatures which refer to how well a candidateis a reasonable translation The translation fea-tures include (1) the logarithm of source-targettranslation probability (2) the logarithm of target-source translation probability (3) the logarithmof source-target lexical weighting (4) the loga-rithm of target-source lexical weighting and (5)the logarithm of the phrase pairrsquos LLR (Dunning1993) The first four features are exactly the sameas the four translation probabilities used in tradi-tional phrase-based system (Koehn et al 2003)The language features include (1) the left entropyof the target phrase (Luo and Sun 2003) (2) theright entropy of the target phrase (3) the first wordof the target phrase (4) the last word of the targetphrase and (5) all words in the target phrase
5httpwwwfjochcomGIZA++html
We select and annotate 33000 phrase pairs ran-domly of which 30000 pairs are used as trainingset and 3000 pairs are used as test set We use theperceptron training algorithm to train the modelAs the experiments reveal the classification preci-sion of this model is 9167
3 Application of Bilingual MWEs
Intuitively bilingual MWE is useful to improvethe performance of SMT However as we de-scribed in section 1 it still needs further researchon how to integrate bilingual MWEs into SMTsystem In this section we propose three methodsto utilize bilingual MWEs and we will comparetheir performance in section 4
31 Model Retraining with Bilingual MWEs
Bilingual phrase table is very important forphrase-based MT system However due to the er-rors in automatic word alignment and unalignedword extension in phrase extraction (Och 2002)many meaningless phrases would be extractedwhich results in inaccuracy of phrase probabilityestimation To alleviate this problem we take theautomatically extracted bilingual MWEs as paral-lel sentence pairs add them into the training cor-pus and retrain the model using GIZA++ Byincreasing the occurrences of bilingual MWEswhich are good phrases we expect that the align-ment would be modified and the probability es-timation would be more reasonable Wu et al(2008) also used this method to perform domainadaption for SMT Different from their approachin which bilingual MWEs are extracted from ad-ditional corpus we extract bilingual MWEs fromthe original training set The fact that additionalresources can improve the domain-specific SMTperformance was proved by many researchers(Wu et al 2008 Eck et al 2004) Howeverour method shows that making better use of theresources in hand could also enhance the qualityof SMT system We use ldquoBaseline+BiMWErdquo torepresent this method
32 New Feature for Bilingual MWEs
Lopez and Resnik (2006) once pointed out thatbetter feature mining can lead to substantial gainin translation quality Inspired by this idea weappend one feature into bilingual phrase table toindicate that whether a bilingual phrase containsbilingual MWEs In other words if the source lan-guage phrase contains a MWE (as substring) and
50
the target language phrase contains the translationof the MWE (as substring) the feature value is 1otherwise the feature value is set to 0 Due to thehigh reliability of bilingual MWEs we expect thatthis feature could help SMT system to select bet-ter and reasonable phrase pairs during translationWe use ldquoBaseline+Featrdquo to represent this method
33 Additional Phrase Table of bilingualMWEs
Wu et al (2008) proposed a method to construct aphrase table by a manually-made translation dic-tionary Instead of manually constructing transla-tion dictionary we construct an additional phrasetable containing automatically extracted bilingualMWEs As to probability assignment we just as-sign 1 to the four translation probabilities for sim-plicity Since Moses supports multiple bilingualphrase tables we combine the original phrase ta-ble and new constructed bilingual MWE table Foreach phrase in input sentence during translationthe decoder would search all candidate transla-tion phrases in both phrase tables We use ldquoBase-line+NewBPrdquo to represent this method
4 Experiments
41 Data
We run experiments on two domain-specific patentcorpora one is for traditional medicine domainand the other is for chemical industry domain Ourtranslation tasks are Chinese-to-English
In the traditional medicine domain table 1shows the data statistics For language model weuse SRI Language Modeling Toolkit6 to train a tri-gram model with modified Kneser-Ney smoothing(Chen and Goodman 1998) on the target side oftraining corpus Using our bilingual MWE ex-tracting algorithm 80287 bilingual MWEs are ex-tracted from the training set
Chinese EnglishTraining Sentences 120355
Words 4688873 4737843Dev Sentences 1000
Words 31722 32390Test Sentences 1000
Words 41643 40551
Table 1 Traditional medicine corpus
6httpwwwspeechsricomprojectssrilm
In the chemical industry domain table 2 givesthe detail information of the data In this experi-ment 59466 bilingual MWEs are extracted
Chinese EnglishTraining Sentences 120856
Words 4532503 4311682Dev Sentences 1099
Words 42122 40521Test Sentences 1099
Words 41069 39210
Table 2 Chemical industry corpus
We test translation quality on test set and use theopen source tool mteval-vllbpl7 to calculate case-sensitive BLEU 4 score (Papineni et al 2002) asour evaluation criteria For this evaluation there isonly one reference per test sentence We also per-form statistical significant test between two trans-lation results (Collins et al 2005) The mean ofall scores and relative standard deviation are calcu-lated with a 99 confidence interval of the mean
42 MT Systems
We use the state-of-the-art phrase-based SMT sys-tem Moses as our baseline system The featuresused in baseline system include (1) four transla-tion probability features (2) one language modelfeature (3) distance-based and lexicalized distor-tion model feature (4) word penalty (5) phrasepenalty For ldquoBaseline+BiMWErdquo method bilin-gual MWEs are added into training corpus as aresult new alignment and new phrase table areobtained For ldquoBaseline+Featrdquo method one ad-ditional 01 feature are introduced to each entry inphrase table For ldquoBaseline+NewBPrdquo additionalphrase table constructed by bilingual MWEs isused
Features are combined in the log-linear modelTo obtain the best translatione of the source sen-tencef log-linear model uses following equation
e = arg maxe
p(e|f)
= arg maxe
Msum
m=1
λmhm(e f) (1)
in which hm andλm denote themth feature andweight The weights are automatically turned byminimum error rate training (Och 2002) on devel-opment set
7httpwwwnistgovspeechtestsmtresourcesscoringhtm
51
43 Results
Methods BLEUBaseline 02658Baseline+BiMWE 02661Baseline+Feat 02675Baseline+NewBP 02719
Table 3 Translation results of using bilingualMWEs in traditional medicine domain
Table 3 gives our experiment results Fromthis table we can see that bilingual MWEsimprove translation quality in all cases TheBaseline+NewBP method achieves the most im-provement of 061 BLEU score comparedwith the baseline system The Baseline+Featmethod comes next with 017 BLEU score im-provement And the Baseline+BiMWE achievesslightly higher translation quality than the baselinesystem
To our disappointment however none of theseimprovements are statistical significant Wemanually examine the extracted bilingual MWEswhich are labeled positive by perceptron algorithmand find that although the classification precisionis high (9167) the proportion of positive exam-ple is relatively lower (7669) The low positiveproportion means that many negative instanceshave been wrongly classified to positive which in-troduce noises To remove noisy bilingual MWEswe use the length ratiox of the source phrase overthe target phrase to rank the bilingual MWEs la-beled positive Assumex follows Gaussian distri-butions then the ranking score of phrase pair(s t)is defined as the following formula
Score(s t) = log(LLR(s t))times 1radic2πσ
timeseminus
(xminusmicro)2
2σ2
(2)Here the meanmicro and varianceσ2 are estimatedfrom the training set After ranking by score weselect the top 50000 60000 and 70000 bilingualMWEs to perform the three methods mentioned insection 3 The results are showed in table 4
From this table we can conclude that (1) Allthe three methods on all settings improve BLEUscore (2) Except the Baseline+BiMWE methodthe other two methods obtain significant improve-ment of BLEU score (02728 02734 02724)over baseline system (02658) (3) When the scaleof bilingual MWEs is relatively small (5000060000) the Baseline+Feat method performs better
Methods 50000 60000 70000Baseline 02658Baseline+BiMWE 02671 02686 02715Baseline+Feat 02728 02734 02712Baseline+NewBP 02662 02706 02724
Table 4 Translation results of using bilingualMWEs in traditional medicine domain
than others (4) As the number of bilingual MWEsincreasing the Baseline+NewBP method outper-forms the Baseline+Feat method (5) Comparingtable 4 and 3 we can see it is not true that themore bilingual MWEs the better performance ofphrase-based SMT This conclusion is the same as(Lambert and Banchs 2005)
To verify the assumption that bilingual MWEsdo indeed improve the SMT performance not onlyon particular domain we also perform some ex-periments on chemical industry domain Table 5shows the results From this table we can see thatthese three methods can improve the translationperformance on chemical industry domain as wellas on the traditional medicine domain
Methods BLEUBaseline 01882Baseline+BiMWE 01928Baseline+Feat 01917Baseline+Newbp 01914
Table 5 Translation results of using bilingualMWEs in chemical industry domain
44 Discussion
In order to know in what respects our methods im-prove performance of translation we manually an-alyze some test sentences and gives some exam-ples in this subsection
(1) For the first example in table 6 ldquoIuml oacuterdquois aligned to other words and not correctly trans-lated in baseline system while it is aligned to cor-rect target phrase ldquodredging meridiansrdquo in Base-line+BiMWE since the bilingual MWE (ldquoIuml oacuterdquoldquodredging meridiansrdquo) has been added into train-ing corpus and then aligned by GIZA++
(2) For the second example in table 6 ldquo
rdquo has two candidate translation in phrase tableldquo teardquo and ldquomedicated teardquo The baseline systemchooses the ldquoteardquo as the translation of ldquo rdquowhile the Baseline+Feat system chooses the ldquomed-
52
Src T not auml k Ouml Eacute Aring raquo Iuml oacute ) 9 | Y S _ Ouml otilde
egraveE 8
Ref the obtained product is effective in tonifying blood expelling cold dredging meridians promoting production of body fluid promoting urination and tranquilizing mind and can be used for supplementing nutrition and protecting health
Baseline the food has effects in tonifying blood dispelling cold promoting salivation and water and tranquilizing and tonic effects and making nutritious health
+Bimwe the food has effects in tonifying blood dispelling cold dredging meridians promotingsalivation promoting urination and tranquilizing tonic nutritious pulverizing
Src curren iexclJ J NtildeJ 5J
Ref the product can also be made into tablet pill powder medicated tea or injection Baseline may also be made into tablet pill powder tea or injection +Feat may also be made into tablet pill powder medicated tea or injection
Table 6 Translation example
icated teardquo because the additional feature giveshigh probability of the correct translation ldquomedi-cated teardquo
5 Conclusion and Future Works
This paper presents the LLR-based hierarchicalreducing algorithm to automatically extract bilin-gual MWEs and investigates the performance ofthree different application strategies in applyingbilingual MWEs for SMT system The translationresults show that using an additional feature to rep-resent whether a bilingual phrase contains bilin-gual MWEs performs the best in most cases Theother two strategies can also improve the qualityof SMT system although not as much as the firstone These results are encouraging and motivatedto do further research in this area
The strategies of bilingual MWE application isroughly simply and coarse in this paper Com-plicated approaches should be taken into accountduring applying bilingual MWEs For examplewe may consider other features of the bilingualMWEs and examine their effect on the SMT per-formance Besides application in phrase-basedSMT system bilingual MWEs may also be inte-grated into other MT models such as hierarchicalphrase-based models or syntax-based translationmodels We will do further studies on improvingstatistical machine translation using domain bilin-gual MWEs
Acknowledgments
This work is supported by National Natural Sci-ence Foundation of China Contracts 60603095and 60873167 We would like to thank the anony-
mous reviewers for their insightful comments onan earlier draft of this paper
References
Necip Fazil Ayan and Bonnie J Dorr 2006 Goingbeyond aer an extensive analysis of word align-ments and their impact on mt InProceedings of the44th Annual Meeting of the Association for Compu-tational Linguistics pages 9ndash16
Timothy Baldwin Colin Bannard Takaaki Tanaka andDominic Widdows 2003 An empirical modelof multiword expression decomposability InPro-ceedings of the ACL-2003 Workshop on MultiwordExpressions Analysis Acquisiton and Treatmentpages 89ndash96
Colin Bannard 2007 A measure of syntactic flex-ibility for automatically identifying multiword ex-pressions in corpora InProceedings of the ACLWorkshop on A Broader Perspective on MultiwordExpressions pages 1ndash8
Peter F Brown Stephen Della Pietra Vincent J DellaPietra and Robert L Mercer 1993 The mathe-matics of statistical machine translation Parameterestimation Computational Linguistics 19(2)263ndash311
Baobao Chang Pernilla Danielsson and WolfgangTeubert 2002 Extraction of translation unit fromchinese-english parallel corpora InProceedings ofthe first SIGHAN workshop on Chinese languageprocessing pages 1ndash5
Stanley F Chen and Joshua Goodman 1998 Am em-pirical study of smoothing techniques for languagemodeling Technical report
Michael Collins Philipp Koehn and Ivona Kucerova2005 Clause restructuring for statistical machinetranslation InProceedings of the 43rd Annual
53
Meeting on Association for Computational Linguis-tics pages 531ndash540
Michael Collins 2002 Discriminative training meth-ods for hidden markov models Theory and exper-iments with perceptron algorithms InProceedingsof the Empirical Methods in Natural Language Pro-cessing Conference pages 1ndash8
Tim Van de Cruys and Begona Villada Moiron 2007Semantics-based multiword expression extractionIn Proceedings of the Workshop on A Broader Per-spective on Multiword Expressions pages 25ndash32
Ted Dunning 1993 Accurate methods for the statis-tics of surprise and coincidenceComputationalLinguistics 19(1)61ndash74
Matthias Eck Stephan Vogel and Alex Waibel 2004Improving statistical machine translation in the med-ical domain using the unified medical language sys-tem InProceedings of the 20th international con-ference on Computational Linguistics table of con-tents pages 792ndash798
Afsaneh Fazly and Suzanne Stevenson 2006 Auto-matically constructing a lexicon of verb phrase id-iomatic combinations InProceedings of the EACLpages 337ndash344
Katerina T Frantzi and Sophia Ananiadou 1996 Ex-tracting nested collocations InProceedings of the16th conference on Computational linguistics pages41ndash46
Alexander Fraser and Daniel Marcu 2007 Measuringword alignment quality for statistical machine trans-lation Computational Linguistics 33(3)293ndash303
Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InPro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology pages48ndash54
Patrik Lambert and Rafael Banchs 2005 Data in-ferred multi-word expressions for statistical machinetranslation InProceedings of Machine TranslationSummit X pages 396ndash403
Patrik Lambert and Rafael Banchs 2006 Groupingmulti-word expressions according to part-of-speechin statistical machine translation InProceedings ofthe Workshop on Multi-word-expressions in a multi-lingual context pages 9ndash16
Adam Lopez and Philip Resnik 2006 Word-basedalignment phrase-based translation Whatrsquos thelink In proceedings of the 7th conference of the as-sociation for machine translation in the Americasvisions for the future of machine translation pages90ndash99
Shengfen Luo and Maosong Sun 2003 Two-characterchinese word extraction based on hybrid of internaland contextual measures InProceedings of the sec-ond SIGHAN workshop on Chinese language pro-cessing pages 24ndash30
Franz Josef Och 2002Statistical Machine Transla-tion From Single-Word Models to Alignment Tem-plates Phd thesis Computer Science DepartmentRWTH Aachen Germany
Patrick Pantel and Dekang Lin 2001 A statistical cor-pus based term extractor InAI rsquo01 Proceedings ofthe 14th Biennial Conference of the Canadian Soci-ety on Computational Studies of Intelligence pages36ndash46
Kishore Papineni Salim Roukos Todd Ward and Wei-jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation InProceedings of the40th Annual Conference of the Association for Com-putational Linguistics pages 311ndash318
Scott Songlin Piao Paul Rayson Dawn Archer andTony McEnery 2005 Comparing and combining asemantic tagger and a statistical tool for mwe extrac-tion Computer Speech and Language 19(4)378ndash397
Ivan A Sag Timothy Baldwin Francis Bond AnnCopestake and Dan Flickinger 2002 Multi-word expressions A pain in the neck for nlp InProceedings of the 3th International Conferenceon Intelligent Text Processing and ComputationalLinguistics(CICLing-2002) pages 1ndash15
Takaaki Tanaka and Timothy Baldwin 2003 Noun-noun compound machine translation A feasibilitystudy on shallow processing InProceedings ofthe ACL-2003 Workshop on Multiword ExpressionsAnalysis Acquisition and Treatment pages 17ndash24
Hua Wu Haifeng Wang and Chengqing Zong 2008Domain adaptation for statistical machine transla-tion with domain dictionary and monolingual cor-pora InProceedings of Conference on Computa-tional Linguistics (COLING) pages 993ndash1000
54
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 55ndash62Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
Bottom-up Named Entity Recognitionusing a Two-stage Machine Learning Method
Hirotaka Funayama Tomohide Shibata Sadao KurohashiKyoto University Yoshida-honmachi
Sakyo-ku Kyoto 606-8501 Japanfunayama shibata kuronlpkueekyoto-uacjp
Abstract
This paper proposes Japanese bottom-upnamed entity recognition using a two-stage machine learning method Mostwork has formalized Named Entity Recog-nition as a sequential labeling problem inwhich only local information is utilizedfor the label estimation and thus a longnamed entity consisting of several mor-phemes tends to be wrongly recognizedOur proposed method regards a compoundnoun (chunk) as a labeling unit and firstestimates the labels of all the chunks ina phrasal unit (bunsetsu) using a machinelearning method Then the best label as-signment in the bunsetsu is determinedfrom bottom up as the CKY parsing al-gorithm using a machine learning methodWe conducted an experimental on CRLNE data and achieved an F measure of8979 which is higher than previous work
1 Introduction
Named Entity Recognition (NER) is a task of rec-ognizing named entities such as person namesorganization names and location It is used forseveral NLP applications such as Information Ex-traction (IE) and Question Answering (QA) Mostwork uses machine learning methods such as Sup-port Vector Machines (SVMs) (Vapnik 1995) andConditional Random Field (CRF) (Lafferty et al2001) using a hand-annotated corpus (Krishnanand DManning 2006 Kazama and Torisawa2008 Sasano and Kurohashi 2008 Fukushima etal 2008 Nakano and Hirai 2004 Masayuki andMatsumoto 2003)
In general NER is formalized as a sequentiallabeling problem For example regarding a mor-pheme as a basic unit it is first labeled as S-PERSON B-PERSON I-PERSON E-PERSON
S-ORGANIZATION etc Then considering thelabeling results of morphemes the best NE labelsequence is recognized
When the label of each morpheme is estimatedonly local information around the morpheme (egthe morpheme the two preceding morphemes andthe two following morphemes) is utilized There-fore a long named entity consisting of severalmorphemes tends to be wrongly recognized Letus consider the example sentences shown in Fig-ure 1
In sentence (1) the label of ldquoKazamardquo can berecognized to be S-PERSON (PERSON consist-ing of one morpheme) by utilizing the surroundinginformation such as the suffix ldquosanrdquo (Mr) and theverb ldquokikoku shitardquo (return home)
On the other hand in sentence (2) when thelabel of ldquoshinyourdquo (credit) is recognized to beB-ORGANIZATION (the beginning of ORGA-NIZATION) only information from ldquohatsudourdquo(invoke) to ldquokyusairdquo (relief) can be utilized andthus the information of the morpheme ldquoginkourdquo(bank) that is apart from ldquoshinyourdquo by three mor-phemes cannot be utilized To cope with this prob-lem Nakano et al (Nakano and Hirai 2004) andSasano et al (Sasano and Kurohashi 2008) uti-lized information of the head of bunsetsu1 In theirmethods when the label of ldquoshinyourdquo is recog-nized the information of the morpheme ldquoginkourdquocan be utilized
However these methods do not work when themorpheme that we want to refer to is not a headof bunsetsu as in sentence (3) In this examplewhen ldquogaikokurdquo (foreign) is recognized to be B-ARTIFACT (the beginning of ARTIFACT) wewant to refer to ldquohourdquo (law) not ldquoihanrdquo (viola-tion) which is the head of the bunsetsu
This paper proposes Japanese bottom-up named1Bunsetsu is the smallest coherent phrasal unit in
Japanese It consists of one or more content words followedby zero or more function words
55
(1) kikoku-shitareturn home
Kazama-san-waMrKazama TOP
rsquoMr Kazama who returned homersquo
(2) hatsudou-shitainvoke
shinyou-kumiai-kyusai-ginkou-nocredit union relief bank GEN
setsuritsu-mo establishment
rsquothe establishment of the invoking credit union relief bankrsquo
(3) shibunsyo-gizou-toprivate document falsification and
gaikoku-jin-touroku-hou-ihan-noforeigner registration law violation GEN
utagai-desuspicion INS
rsquoon suspicion of the private document falsification and the violation of the foreigner registra-tion lawrsquo
Figure 1 Example sentences
entity recognition using a two-stage machinelearning method Different from previous workthis method regards a compound noun as a la-beling unit (we call it chunk hereafter) and es-timates the labels of all the chunks in the bun-setsu using a machine learning method In sen-tence (3) all the chunks in the second bunsetsu(ie ldquogaikokurdquo ldquogaikoku-jinrdquo middot middot middot ldquogaikoku-jin-touroku-hou-ihan rdquo middot middot middot ldquoihanrdquo) are labeled andin the case that the chunk ldquogaikoku-jin-touroku-hourdquo is labeled the information about ldquohourdquo (law)is utilized in a natural manner Then in the bun-setsu the best label assignment is determined Forexample among the combination of ldquogaikoku-jin-touroku-hourdquo (ARTIFACT) and ldquoihanrdquo (OTHER)the combination of ldquogaikoku-jinrdquo (PERSON) andldquotouroku-hou-ihanrdquo (OTHER) etc the best la-bel assignment ldquogaikoku-jin-touroku-hourdquo (AR-TIFACT) and ldquoihanrdquo (OTHER) is chosen basedon a machine learning method In this determi-nation of the best label assignment as the CKYparsing algorithm the label assignment is deter-mined by bottom-up dynamic programming Weconducted an experimental on CRL NE data andachieved an F measure of 8979 which is higherthan previous work
This paper is organized as follows Section 2 re-views related work of NER especially focusing onsequential labeling based method Section 3 de-scribes an overview of our proposed method Sec-tion 4 presents two machine learning models andSection 5 describes an analysis algorithm Section6 gives an experimental result
2 Related Work
In Japanese Named Entity Recognition the defi-nition of Named Entity in IREX Workshop (IREX
class examplePERSON Kimura SyonosukeLOCATION Taiheiyou (Pacific Ocean)ORGANIZATION Jimin-tou (Liberal Democratic
Party)ARTIFACT PL-houan (PL bill)DATE 21-seiki (21 century)TIME gozen-7-ji (7 am)MONEY 500-oku-en (50 billions yen)PERCENT 20 percent
Table 1 NE classes and their examples
Committee 1999) is usually used In this def-inition NEs are classified into eight classesPERSON LOCATION ORGANIZATION AR-TIFACT DATE TIME MONEY and PERCENTTable 1 shows example instances of each class
NER methods are divided into two approachesrule-based approach and machine learning ap-proach According to previous work machinelearning approach achieved better performancethan rule-based approach
In general a machine learning method is for-malized as a sequential labeling problem Thisproblem is first assigning each token (character ormorpheme) to several labels In an SE-algorithm(Sekine et al 1998) S is assigned to NE com-posed of one morpheme B I E is assigned to thebeginning middle end of NE respectively and Ois assigned to the morpheme that is not an NE2The labels S B I and E are prepared for each NEclasses and thus the total number of labels is 33(= 8 4 + 1)
The model for the label estimation is learnedbased on machine learning The following fea-tures are generally utilized characters type of
2Besides there are IOB1 IOB2 algorithm using onlyIOB and IOE1 IOE2 algorithm using only IOE (Kim andVeenstra 1999)
56
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu-MeijinMONEY0075
OTHERe0092
MeijinOTHERe0245
(a)initial state
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu + MeijinPSN+OTHERe0438+0245
Yoshiharu Yoshiharu + Meijin
final outputanalysis direction
MONEY0075
MNY+OTHERe0075+0245
MeijinOTHERe0245
(b)final output
Figure 2 An overview of our proposed method (the bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo)
character POS etc about the morpheme and thesurrounding two morphemes The methods utiliz-ing SVM or CRF are proposed
Most of NER methods based on sequential la-beling use only local information Thereforemethods utilizing global information are pro-posed Nakano et al utilized as a feature the wordsub class of NE on the analyzing direction in thebunsetsu the noun in the end of the bunsetsu ad-jacent to the analyzing direction and the head ofeach bunsetsu (Nakano and Hirai 2004) Sasanoet al utilized cache feature coreference resultsyntactic feature and caseframe feature as struc-tural features (Sasano and Kurohashi 2008)
Some work acquired knowledge from unan-notated large corpus and applied it to NERKazama et al utilized a Named Entity dic-tionary constructed from Wikipedia and a nounclustering result obtained using huge amount ofpairs of dependency relations (Kazama and Tori-sawa 2008) Fukushima et al acquired hugeamount of category-instance pairs (eg ldquopo-litical party - New party DAICHIrdquoldquocompany-TOYOTArdquo) by some patterns from a large Webcorpus (Fukushima et al 2008)
In Japanese NER researches CRL NE data areusually utilized for the evaluation This data in-cludes approximately 10 thousands sentences innews paper articles in which approximately 20thousands NEs are annotated Previous workachieved an F measure of about 089 using thisdata
3 Overview of Proposed Method
Our proposed method first estimates the label ofall the compound nouns (chunk) in a bunsetsu
Then the best label assignment is determinedby bottom-up dynamic programming as the CKYparsing algorithm Figure 2 illustrates an overviewof our proposed method In this example thebunsetsu ldquoHabu-Yoshiharu-Meijinrdquo (Grand Mas-ter Yoshiharu Habu) is analyzed First the labelsof all the chunks (ldquoHaburdquo ldquoHabu-YoshiharurdquoldquoHabu-Yoshiharu-Meijinrdquo middot middot middot ldquoMeijinrdquo etc) inthe bunsetsu are analyzed using a machine learn-ing method as shown in Figure 2 (a)
We call the state in Figure 2 (a) initial statewhere the labels of all the chunks have been es-timated From this state the best label assign-ment in the bunsetsu is determined This pro-cedure is performed from the lower left (corre-sponds to each morpheme) to the upper right likethe CKY parsing algorithm as shown in Figure 2(b) For example when the label assignment forldquoHabu-Yoshiharurdquo is determined the label assign-ment ldquoHabu-Yoshiharurdquo (PERSON) and the labelassignment ldquoHaburdquo (PERSON) and ldquoYoshiharurdquo(OTHER) are compared and the better one is cho-sen While grammatical rules are utilized in ageneral CKY algorithm this method chooses bet-ter label assignment for each cell using a machinelearning method
The learned models are the followings
bull the model that estimates the label of a chunk(label estimation model)
bull the model that compares two label assign-ments (label comparison model)
The two models are described in detail in thenext section
57
Habu Yoshiharu Meijin gaPERSON OTHERe
invalid invalid
invalid
invalidinvalid
Figure 3 Label assignment for all the chunks inthe bunsetsu ldquoHabu-Yoshiharu-Meijinrdquo
4 Model Learning
41 Label Estimation ModelThis model estimates the label for each chunk Ananalysis unit is basically bunsetsu This is because935 of named entities is located in a bunsetsuin CRL NE data Exceptionally the following ex-pressions located in multiple bunsetsus tend to bean NE
bull expressions enclosed in parentheses (eg ldquolsquoHimeyuri-no toursquo rdquo (The tower of Himeyuri)(ARTIFACT))
bull expressions that have an entry in Wikipedia(eg ldquoNihon-yatyou-no kairdquo (Wild Bird So-ciety of Japan) (ORGANIZATION))
Hereafter bunsetsu is expanded when one of theabove conditions meet By this expansion 986of named entities is located in a bunsetsu3
For each bunsetsu the head or tail functionwords are deleted For example in the bun-setsu ldquoHabu-Yoshiharu-Meijin-wardquo the tail func-tion word ldquowardquo (TOP) is deleted In the bunsetsuldquoyaku-san-bairdquo (about three times) the head func-tion word ldquoyakurdquo (about) is deleted
Next for learning the label estimation modelall the chunks in a bunsetsu are attached to the cor-rect label from a hand-annotated corpus The la-bel set is 13 classes which includes eight NE class(as shown in Table 1) and five classes OTHERsOTHERb OTHERi OTHERe and invalid
The chunk that corresponds to a whole bun-setsu and does not contain any NEs is labeledas OTHERs and the head middle tail chunkthat does not correspond to an NE is labeled asOTHERb OTHERi OTHERe respectively4
3As an example in which an NE is not included by anexpanded bunsetsu there are ldquoToru-no Kimirdquo (PERSON)and ldquoOsaka-fu midori-no kankyo-seibi-shitsurdquo (ORGANI-ZATION)
4Each OTHER is assigned to the longest chunk that satis-fies its condition in a chunk
1 of morphemes in the chunk
2 the position of the chunk in its bunsetsu
3 character type5
4 the combination of the character type of adjoiningmorphemes
- For the chunk ldquoRussian Armyrdquo this feature isldquoKatakanaKanjirdquo
5 word class word sub class and several features pro-vided by a morphological analyzer JUMAN
6 several features6 provided by a parser KNP
7 string of the morpheme in the chunk
8 IPADIC7 feature- If the string of the chunk are registered in the fol-
lowing categories of IPADIC ldquopersonrdquo ldquolo-cationrdquo ldquoorganizationrdquo and ldquogeneralrdquo thisfeature fires
9 Wikipedia feature
- If the string of the chunk has an entry inWikipedia this feature fires
- the hypernym extracted from its definition sen-tence using some patterns (eg The hyper-nym of ldquothe Liberal Democratic Partyrdquo is apolitical party)
10 cache feature- When the same string of the chunk appears in the
preceding context the label of the precedingchunk is used for the feature
11 particles that the bunsetsu includes
12 the morphemes particles and head morpheme in theparent bunsetsu
13 the NEcategory ratio in a case slot of predicatenouncase frame(Sasano and Kurohashi 2008)
- For example in the case ga (NOM) of the pred-icate case frame ldquokaikenrdquo (interview) the NEratio ldquoPERSON0245rdquo is assigned to the caseslot Hence in the sentence ldquoHabu-ga kaiken-shitardquo (Mr Habu interviewed) the featureldquoPERSON0245rdquo is utilized for the chunkldquoHaburdquo
14 parenthesis feature
- When the chunk in a parenthesis this featurefires
Table 2 Features for the label estimation model
The chunk that is neither any eight NE class northe above four OTHER is labeled as invalid
In an example as shown in Figure 3 ldquoHabu-Yoshiharurdquo is labeled as PERSON ldquoMeijinrdquo is la-beled as OTHERe and the other chunks are la-beled as invalid
Next the label estimation model is learned fromthe data in which the above label set is assigned
5The following five character types are considered KanjiHiragana Katakana Number and Alphabet
6When a morpheme has an ambiguity all the correspond-ing features fire
7httpchasenaist-naraacjpchasendistributionhtmlja
58
to all the chunks The features for the label esti-mation model are shown in Table 2 Among thefeatures as for feature (3) (5)minus(8) three cate-gories according to the position of a morphemein the chunk are prepared ldquoheadrdquo ldquotailrdquo andldquoanywhererdquo For example in the chunk ldquoHabu-Yoshiharu-Meijinrdquo as for the morpheme ldquoHaburdquofeature (7) is set to be ldquoHaburdquo in ldquoheadrdquo and as forthe morpheme ldquoYoshiharurdquo feature (7) is set to beldquoYoshiharurdquo in ldquoanywhererdquo
The label estimation model is learned from pairsof label and feature in each chunk To classify themulti classes the one-vs-rest method is adopted(consequently 13 models are learned) The SVMoutput is transformed by using the sigmoid func-tion 1
1+exp(minusβx) and the transformed value is nor-malized so that the sum of the value of 13 labelsin a chunk is one
The purpose for setting up the label ldquoinvalidrdquo isas follows In the chunk ldquoHaburdquo and ldquoYoshiharurdquoin Figure 3 since the label ldquoinvalidrdquo has a rela-tively higher score the score of the label PERSONis relatively low Therefore when the label com-parison described in Section 42 is performed thelabel assignment ldquoHabu-Yoshiharurdquo (PERSON) islikely to be chosen In the chunk where the scoreof the label invalid has the highest score the labelthat has the second highest score is adopted
42 Label Comparison Model
This model compares the two label assignmentsfor a certain string For example in the stringldquoHabu-Yoshiharurdquo the model compares the fol-lowing two label assignments
bull ldquoHabu-Yoshiharurdquo is labeled as PERSON
bull ldquoHaburdquo is labeled as PERSON and ldquoYoshi-harurdquo is labeled as MONEY
First as shown in Figure 4 the two comparedsets of chunks are lined up by sandwiching ldquovsrdquo(The left one right one is called the first set thesecond set respectively) When the first set is cor-rect this example is positive otherwise this ex-ample is negative The max number of chunks foreach set is five and thus examples in which thefirst or second set has more than five chunks arenot utilized for the model learning
Then the feature is assigned to each exampleThe feature (13 dimensions) for each chunk is de-fined as follows the first 12 dimensions are used
positive+1 Habu-Yoshiharu vs Habu + Yoshiharu
PSN PSN + MNY+1 Habu-Yoshiharu + Meijin vs Habu + Yoshiharu + Meijin
PSN + OTHERe PSN + MONEY + OTHERe
negative- 1 Habu-Yoshiharu-Meijin vs Habu-Yoshiharu + Meijin
ORG PSN + OTHERe
Figure 4 Assignment of positivenegative exam-ples
for each label which is estimated by the label esti-mation model and the last 13th dimension is usedfor the score of an SVM output Then for the firstand second set the features for each chunk are ar-ranged from the left and zero vectors are placedin the remainder part
Figure 5 illustrates the feature for ldquoHabu-Yoshiharurdquo vs ldquoHabu + Yoshiharurdquo The labelcomparison model is learned from such data us-ing SVM Note that only the fact that ldquoHabu-Yoshiharurdquo is PERSON can be found from thehand-annotated corpus and thus in the exampleldquoHabu-Yoshiharu-Meijinrdquo vs ldquoHabu + Yoshiharu-Meijinrdquo we cannot determine which one is cor-rect Therefore such example cannot be used forthe model learning
5 Analysis
First the label of all the chunks in a bunsetsu isestimated by using the label estimation model de-scribed in Section 41 Then the best label assign-ment in the bunsetsu is determined by applying thelabel comparison model described in Section 42iteratively as shown in Figure 2 (b) In this stepthe better label assignment is determined from bot-tom up as the CKY parsing algorithm
For example the initial state shown in Figure2(a) is obtained using the label estimation modelThen the label assignment is determined using thelabel comparison model from the lower left (cor-responds to each morpheme) to the upper rightIn determining the label assignment for the cellof ldquoHabu-Yoshiharurdquo as shown in 6(a) the modelcompares the label assignment ldquoBrdquo with the la-bel assignment ldquoA+Drdquo In this case the modelchooses the label assignment ldquoBrdquo that is ldquoHabu- Yoshiharurdquo is labeled as PERSON Similarlyin determining the label assignment for the cellof ldquoYoshiharu-Meijinrdquo the model compares the
59
chunk Habu-Yoshiharu Habu Yoshiharulabel PERSON PERSON MONEYvector V11 0 0 0 0 V21 V22 0 0 0
Figure 5 An example of the feature for the label comparison model (The example is ldquoHabu-Yoshiharuvs Habu + Yoshiharurdquo and V11 V21 V22 and 0 is a vector whose dimension is 13)
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu-Meijin
label assighment ldquoHabu-Yoshiharurdquo
label assignmentldquoHaburdquo + ldquoYoshiharurdquo
A B C
EDMONEY0075
OTHERe0092
MeijinOTHERe0245
ldquoHaburdquo + ldquoYoshiharurdquo
F
(a) label assignment for the cell ldquoHabu-Yoshiharurdquo
HabuPERSON 0111
Habu-YoshiharuPERSON 0438
Habu-Yoshiharu-MeijinORGANIZATION0083
Yoshiharu Yoshiharu + Meijin
label assignmentldquoHabu-Yoshiharu-Meijinrdquo
A B C
EDMONEY0075
MNY+OTHERe0075+0245
MeijinOTHERe0245
label assignmentldquoHaburdquo + ldquoYoshiharurdquo + ldquoMeijinrdquo
label assignmentldquoHabu-Yoshiharurdquo + ldquoMeijinrdquo
F
(b) label assignment for the cell ldquoHabu-Yoshiharu-Meijinrdquo
Figure 6 The label comparison model
label assignment ldquoErdquo with the label assignmentldquoD+Frdquo In this case the model chooses the labelassignment ldquoD+Frdquo that is ldquoYoshiharurdquo is labeledas MONEY and ldquoMeijinrdquo is labeled as OTHEReWhen the label assignment consists of multiplechunks the content of the cell is updated Inthis case the cell ldquoErdquo is changed from ldquoYoshi-haru-Meijinrdquo (OTHERe) to ldquoYoshiharu + Meijinrdquo(MONEY + OTHERe)
As shown in Figure 6(b) in determining the bestlabel assignment for the upper right cell that isthe final output is determined the model comparesthe label assignment ldquoA+D+Frdquo ldquoB+Frdquo and ldquoCrdquoWhen there are more than two candidates of labelassignments for a cell all the label assignments arecompared in a pairwise and the label assignmentthat obtains the highest score is adopted
In the label comparing step the label as-signment in which OTHERlowast follows OTHERlowast(OTHERlowast - OTHERlowast) is not allowed since eachOTHER is assigned to the longest chunk as de-scribed in Section 41 When the first combina-tion of chunks equals to the second combinationof chunks the comparison is not performed
6 Experiment
To demonstrate the effectiveness of our proposedmethod we conducted an experiment on CRL NEdata In this data 10718 sentences in 1174 newsarticles are annotated with eight NEs The expres-sion to which it is difficult to annotate manually islabeled as OPTIONAL and was not used for both
a b c d e
learn the label estimation model 1
Corpus CRL NE data
learn the label estimation model 2bobtain features for the label apply
b c d e
b c d e
b c d e
obtain features for the labelcomparison model
learn the label estimation model 2c
learn the label estimation model 2d
learn the label estimation model 2e
apply
learn the label comparison model
Figure 7 5-fold cross validation
the model learning8 and the evaluationWe performed 5-fold cross validation following
previous work Different from previous work ourwork has to learn the SVM models twice There-fore the corpus was divided as shown in Figure 7Let us consider the analysis in the part (a) Firstthe label estimation model 1 is learned from thepart (b)-(e) Then the label estimation model 2bis learned from the part (c)-(e) and applying thelearned model to the part (b) features for learningthe label comparison model are obtained Simi-larly the label estimation model 2c is learned fromthe part (b)(d)(e) and applying it to the part (c)features are obtained It is the same with the part
8Exceptionally ldquoOPTIONALrdquo is used when the label es-timation model for OTHERlowast and invalid is learned
60
Recall PrecisionORGANIZATION 8183 (30083676) 8837 (30083404)PERSON 9005 (34583840) 9387 (34583684)LOCATION 9138 (49925463) 9244 (49925400)ARTIFACT 4672 ( 349 747) 7489 ( 349 466)DATE 9327 (33273567) 9312 (33273573)TIME 8825 ( 443 502) 9059 ( 443 489)MONEY 9385 ( 366 390) 9760 ( 366 375)PERCENT 9533 ( 469 492) 9591 ( 469 489)ALL-SLOT 8787 9179F-measure 8979
Table 3 Experimental result
(d) and (e) Then the label comparison model islearned from the obtained features After that theanalysis in the part (a) is performed by using boththe label estimation model 1 and the label compar-ison model
In this experiment a Japanese morphologicalanalyzer JUMAN9 and a Japanese parser KNP10
were adopted The two SVM models were learnedwith polynomial kernel of degree 2 and β in thesigmoid function was set to be 1
Table 6 shows an experimental result An F-measure in all NE classes is 8979
7 Discussion
71 Comparison with Previous Work
Table 7 presents the comparison with previ-ous work and our method outperformed previ-ous work Among previous work Fukushimaet al acquired huge amount of category-instance pairs (eg ldquopolitical party - New partyDAICHIrdquoldquocompany-TOYOTArdquo) by some patternsfrom a large Web corpus and Sasano et al uti-lized the analysis result of corefer resolution as afeature for the model learning Therefore in ourmethod by incorporating these knowledge andorsuch analysis result the performance would be im-proved
Compared with Sasano et al our methodachieved the better performance in analyzinga long compound noun For example in thebunsetsu ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo (Treaty on Conventional ArmedForces in Europe) while Sasano et al labeledldquoOushurdquo (Europe) as LOCATION our methodcorrectly labeled ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo as ARTIFACT Sasano etal incorrectly labeled ldquoOushurdquo as LOCATIONalthough they utilized the information about
9httpnlpkueekyoto-uacjpnl-resourcejuman-ehtml10httpnlpkueekyoto-uacjpnl-resourceknp-ehtml
the head of bunsetsu ldquojyouyakurdquo (treaty) Inour method for the cell ldquoOushurdquo invalid hasthe highest score and thus the score of LOCA-TION relatively drops Similarly for the cellldquosenryoku-sakugen-jyouyakurdquo invalid has thehighest score Consequently ldquoOushu-tsuujyou-senryoku-sakugen-jyouyakurdquo is correctly labeledas ARTIFACT
In the bunsetsu ldquogaikoku-jin-touroku-hou-ihanrdquo(the violation of the foreigner registration law)while Sasano et al labeled ldquotouroku-hourdquo as AR-TIFACT our method correctly labeled ldquogaikoku-jin-touroku-hourdquo as ARTIFACT Sasano et al can-not utilize the information about ldquohourdquo that is use-ful for the label estimation since the head of thisbunsetsu is ldquoihanrdquo In contrast in estimating thelabel of the chunk ldquogaikoku-jin-touroku-hourdquo theinformation of ldquohourdquo can be utilized
72 Error Analysis
There were some errors in analyzing a Katakanaalphabet word In the following example althoughthe correct is that ldquoBatistutardquo is labeled as PER-SON the system labeled it as OTHERs
(4) Italy-deItaly LOC
katsuyaku-suruactive
Batistuta-woBatistuta ACC
kuwaetacall
ArgentineArgentine
rsquoArgentine called Batistuta who was active inItalyrsquo
There is not an entry of ldquoBatistutardquo in the dictio-nary of JUMAN nor Wikipedia and thus only thesurrounding information is utilized However thecase analysis of ldquokatsuyakurdquo (active) is incorrectwhich leads to the error of ldquoBatistutardquo
There were some errors in applying the la-bel comparison model although the analysis ofeach chunk is correct For example in thebunsetsu ldquoHongKong-seityourdquo (Government ofHongKong) the correct is that ldquoHongKong-seityourdquo is labeled as ORGANIZATION Asshown in Figure 8 (b) the system incorrectlylabeled ldquoHongKongrdquo as LOCATION As shownin Figure 8(a) although in the initial stateldquoHongKong-seityourdquo was correctly labeled as OR-GANIZATION the label assignment ldquoHongKong+ seityourdquo was incorrectly chosen by the labelcomparison model To cope with this problemwe are planning to the adjustment of the value βin the sigmoid function and the refinement of the
61
F1 analysis unit distinctive features(Fukushima et al 2008) 8929 character Web(Kazama and Torisawa 2008) 8893 character WikipediaWeb(Sasano and Kurohashi 2008) 8940 morpheme structural information(Nakano and Hirai 2004) 8903 character bunsetsu feature(Masayuki and Matsumoto 2003) 8721 character(Isozaki and Kazawa 2003) 8677 morphemeproposed method 8979 compound noun Wikipediastructural information
Table 4 Comparison with previous work (All work was evaluated on CRL NE data using cross valida-tion)
HongKongLOCATION
HongKong-seityouORGANIZATIONLOCATION
0266 0205
seityouOTHERe0184
(a)initial state
HongKongLOCATION
HongKong + seityouLOC+OTHEReLOCATION
0266LOC+OTHERe0266+0184
seityouOTHERe0184
(b)the final output
Figure 8 An example of the error in the label com-parison model
features for the label comparison model
8 Conclusion
This paper proposed bottom-up Named EntityRecognition using a two-stage machine learningmethod This method first estimates the label ofall the chunks in a bunsetsu using a machine learn-ing and then the best label assignment is deter-mined by bottom-up dynamic programming Weconducted an experiment on CRL NE data andachieved an F-measure of 8979
We are planning to integrate this method withthe syntactic and case analysis method (Kawa-hara and Kurohashi 2007) and perform syntacticcase and Named Entity analysis simultaneously toimprove the overall accuracy
ReferencesKenrsquoichi Fukushima Nobuhiro Kaji and Masaru
Kitsuregawa 2008 Use of massive amountsof web text in Japanese named entity recogni-tion In Proceedings of Data Engineering Workshop(DEWS2008) A3-3 (in Japanese)
IREX Committee editor 1999 Proceedings of theIREX Workshop
Hideki Isozaki and Hideto Kazawa 2003 Speeding upsupport vector machines for named entity recogni-tion Transaction of Information Processing Societyof Japan 44(3)970ndash979 (in Japanese)
Daisuke Kawahara and Sadao Kurohashi 2007 Prob-abilistic coordination disambiguation in a fully-lexicalized Japanese parser In Proceedings of the
2007 Joint Conference on Empirical Methods inNatural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL2007)pages 304ndash311
Junrsquoichi Kazama and Kentaro Torisawa 2008 In-ducing gazetteers for named entity recognition bylarge-scale clustering of dependency relations InProceedings of ACL-08 HLT pages 407ndash415
Erik F Tjong Kim and Jorn Veenstra 1999 Repre-senting text chunks In Proceedings of EACL rsquo99pages 173ndash179
Vajay Krishnan and Christopher DManning 2006An effective two-stage model for exploiting non-local dependencies in named entity recognitionpages 1121ndash1128
John Lafferty Andrew McCallun and FernandoPereira 2001 Conditional random fields Prob-abilistic models for segmenting and labeling se-quence data In Proceedings of the Eighteenth In-ternational Conference (ICMLrsquo01) pages 282ndash289
Asahara Masayuki and Yuji Matsumoto 2003Japanese named entity extraction with redundantmorphological analysis In Proceeding of HLT-NAACL 2003 pages 8ndash15
Keigo Nakano and Yuzo Hirai 2004 Japanese namedentity extraction with bunsetsu features Transac-tion of Information Processing Society of Japan45(3)934ndash941 (in Japanese)
Ryohei Sasano and Sadao Kurohashi 2008 Japanesenamed entity recognition using structural naturallanguage processing In Proceeding of Third In-ternational Joint Conference on Natural LanguageProcessing pages 607ndash612
Satoshi Sekine Ralph Grishman and Hiroyuki Shin-nou 1998 A decision tree method for finding andclassifying names in japanese texts In Proceed-ings of the Sixth Workshop on Very Large Corpora(WVLC-6) pages 171ndash178
Vladimir Vapnik 1995 The Nature of StatisticalLearning Theory Springer
62
Proceedings of the 2009 Workshop on Multiword Expressions ACL-IJCNLP 2009 pages 63ndash70Suntec Singapore 6 August 2009 ccopy2009 ACL and AFNLP
13
13
13 13
13
13
13
13
13
13
13
13 13
13
$
13 amp
13 amp
$ 13
13 (
) 13
$ 13
13
) 13
13 13 13
$ 13 13
+$ - 0)$ - 12
$ 3$ 4 1
)13$ 5 + )$ 5 2
3$ 6 2 $ 7-
2 $ 7 13
2 13 13
13 13
) 13
13$ $ 13$ $ $
13
13 13 13
13 1313 138
$ 13$
3$ 13) )
) 8 9$ $ $
$ $ )13$ 13 13
3
$
13 13
13 +
)
13 $
13 ) 13
$ )
13 13
13 ) $ )
13$
$
) (
$
13
13 13
13
)
13
13 )
$ 13
ファミリーレストラン 13) lt
ファミレス 13lt $
=
13 )
$ 13 13
$ $ )
13 -
)$ $ =
13 131313 13 1313 13 13 13 13 1313 13 13
63
13
13
13 13
13 13
13
13 13
13 $
13
13 amp
13
( ) +
13
13
-
13 13
13 13 13
日本 amp0 東京 $amp0amp
医科 amp0 科学 amp0amp
女子 amp0amp
大学 13amp0 研究所0
ampamp $ 13 13
+
13
$1
$ 13 $1
13
213
$1
$
$1
$1
13 $1
13
13 13 略語は0
0amp
$ +3
1313欧文略語一覧13カタカナ略語一覧13漢字略語一覧13大学の略称
$1
$1 13
13 -
13 $1 13
456
13
456
13
37+
- 13
13
$
8 913
77 4 77 913 13
13 77(amp 13 ltamp
13amp amp
13
amp - 13 13
13
13 9
13 13
$
13
-
13 =
= 9
8
13
13 $
13 13
13 13 13 1313
ッ$ 13 っ$
amp ―$
64
13
$ $
amp $ amp
() $
+) $amp $
) $ $ $ $ $
13 $ amp $ $ $ amp $$
$- )) ) 13
) ))13 ) 0()00)13 ) ) )13
- $ ))) ) 131
) 0)13 ))
13 13
2 ) )) )
)) 3 ) 1313))
1313 2 ) 13)
4) 5) )) )13
) 13) 1
6
7 ) )) ) )13 1
)
) )0
13 ) 8
-
) )
)13 ) 4
4 7 )
) )9 8)
5 ) )8 7 ))
)
) )13 1
)13 )) ) ) )
) 7 )) )13 1
)) 0()
) )))13 0
))13 ) 1
)6 5) ) )
5 ) ))1
lt ) ) 13 )
))) ) =))lt
)=)lt ))=lt gt
=ltlt )) ))) )
13 131313
-
))lt
lt ))lt
))lt
))lt lt
7 ) ))1
lt ) )) ) ))lt lt
13 lt ) ) )8 ) 1
)13 )) )
))lt 13 1
)13 ) )8 ) )13 ))
) ))) 85 1
)13 )8) 2 ))) ) (
0 )) )))
13))13
) ))) ) ( のlt
) ) )13 1
) )) )
) ))) のlt 0 )
)13 )) )1
) ) 13
) ) )) )1
) 00 )13 13)
1313 1313
2 ) 13 )) )
)13 )13 )
))13 )
3 )) )) 1
) ) )
$ )
+ ) )
+ 8 )
65
1313
1313
13 13
13 13
13 13 13
13 13
$13amp13amp 13 13 $13
$13 13 $ ( ) ) 13
13 13 + 13 1313 $
13 + 13 1313 + -
+ 13 ++ 13 13
1313 +
13 13
13 13 1313 13 -
+13 1313 13 13
1313
13
0 13 13 13 13
13 13 12 13
3013+ 13 45 666 1313
0 13 1313
7 13 13 13 13-
13 8 +-13 0
+ 13 13 13 -
+ 13 13 13
3
13 13-
13 13 + 13 13
+13513
1313 13 9
1 13 13 -
+ ( 13 13 3
13 13 + 13+ 13 13
3
135 1313 +13 13
13 13 13+ 13
1313 13 +
1313 1313
13 13 13
13+ 13 $1313 (
13 1313 1313 +-
13 4 13 $
$13 $ 13 -
13 +-13 0
13 1313 13-13 1313 + 13 13 4 - 13 + 13 13 13+ 131313 1313 3 13 13 ( 1313 13 13 + 13 13 1313 13 13 13 13 13 13 13 13 13 13 13+ 13 13 9
13
13
13
13
+ 13 13 +13
13+ 4
13 1313 1313 -
+
13 1313 13-
13 13 +13 + +
(13 13 $ +-
13 $1313 13 13 $ $ 13 $ 13
13 $ampamp lt13
2 13 13 131313 1313 13
13 lt13 ( $
13 13 13 13+ +13 -
13 13 13
13 2 2 2
= 13 13 13 +13 1313+
13 13 13
13 13 13 13
1313 13-
13 ( 13 13 13 13
13 13+ 13 1313
13 13 13+ 1313 13+
13 13 13 13 13 13
$gt1 gt1 13+ $ 13-
13 13 13 13+
13131313
66
1313 1313 1313
1313 1313 1313
13
13 13
13
13
$ 13 13
amp() amp() () amp amp
$ 13 13
+
13
13
13
- + 13 ( (
13 13
-+
13
0 -(1313 amp+ ( (
13 13
13 1
13 () 2 3
0 13 -4252
6-amp++ -32 6+
13 0
$ 135 2
1
- amp+ $ 13
13
$ amp( ) + ( ( )- )( ( ) )
FeaturesLabel
Noun
Noun
Noun
Noun
Particle
Noun
POS
0
0
0
1
1
1
Head of wordreadingword-
FeaturesLabel
Noun
Noun
Noun
Noun
Particle
Noun
POS
0
0
0
1
1
1
Head of wordreadingword-
amp 13 0
13 朝ビタ -+ 13
13 朝はビタミン - +
13 7
7
$
088
13 0
$ amp 9 amp 9
13
13
7
1313 0
13
13 2lt
13 13 13 13
13 13
3 13 amp
13 amp
13
13
0 13 13
13 13
13 13
13
1
ampampamp
67
13 13
$
13 13
ampamp (
)
ampamp $
ampamp amp
amp +
amp +
- - -
amp-
amp
amp
amp
- amp
+
- +
) +
amp $
- amp
+
- amp
+
0 $ ampamp +
1- amp
0 +
2-
)
amp -
amp ampamp
$ amp 0- 3(
34) 5 -+
0 - 4
6)4 amp amp 1 amp
7 4( 6
$ - +
- 8
64
13
) amp
amp
+
amp 9 2
- 9 - +
2 amp
- amp
amp 1
$
( +
(
( amp ( ( -
amp2
amp
amp )
( +
68
13 13
13 1313
13 13 13
13 13
13 13 13
13 13
13 13
13 $
13
13 13 13 13
13
13 13 13
1313
13
amp 13 13 13 13
13 13 13 13
(() 13 13 13
13 13 + 13
13 amp 13
13 13
$ 13
13 13
13
13 1313
13 -
13
1313 13
13 13 13
13 1313 amp
13 13
13 13 1313
0
1313 13 13
13 13
13
13 13
$ $ amp amp ( $)+ - 0 13 + 13 112
ミュージックステーション 13 ステ 13 13 13 13 13 ミュージック 13 ステ ステーション
3 4 56 1 0+ 0 + + 7+ 8 9 $8 4 lt
6 $8 -= 1 8 - = $ 13++ + 139 13 $ 22lt
8 $+ 4 13) 1ltltlt3 $gt3 =
5 $ $ ( 6 5 13amp
5 $ $ ( 213 $ 13 13
5 = 3 + 13 () 1 +
5 () 13 7+ + + 2ltlt
5 () 13 3A 21 + $ + 13
5 () $ gt) 3A2 0 13 13 22lt2lt
+A = 7 1 4 30$+ -+ 3 13 13$ 11
$ 13 amp) $ 4 42 6= $ 2
4 13 13+ $ gt $ 9+ amp6+ 13+ 8 8 6 3 (+ 6 13 11
13 136) $ 4 13 + gt+ 7 30 13 13+ 1
69
13 13 13
13 13 1313
13 13
1313 13
13 1313
13
$
13 13 1313 $
amp (
70
Author Index
Bhutada Pravin 17
Cao Jie 47Caseli Helena 1
Diab Mona 17
Finatto Maria Jose 1Fujii Hiroko 63Fukui Mika 63Funayama Hirotaka 55
Hoang Hung Huu 31Huang Yun 47
Kan Min-Yen 9 31Kim Su Nam 9 31Kuhn Jonas 23Kurohashi Sadao 55
Liu Qun 47Lu Yajuan 47
Machado Andre 1
Ren Zhixiang 47
Shibata Tomohide 55Sinha R Mahesh K 40Sumita Kazuo 63Suzuki Masaru 63
Villavicencio Aline 1
Wakaki Hiromi 63
Zarrieszlig Sina 23
71