Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
COLING 2018
The 27th International Conferenceon Computational Linguistics
Proceedings of the First Workshop on Linguistic Resourcesfor Natural Language Processing (LR4NLP-2018)
August 20, 2018Santa Fe, New Mexico, USA
c©2018 The Association for Computational Linguistics
Order copies of this and other ACL proceedings from:
Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]
ISBN 978-1-948087-54-4
ii
Copyright of each paper stays with the respective authors (or their employers).
ISBN 978-1-948087-54-4Anabela Barreiro, Kristina Kocijan, Peter Machonis and Max Silberztein (eds.)
iii
Proceedings of the First Workshop onLinguistic Resources for Natural Language Processing
Preface
Linguists and developers of NLP software have been working separately for many years. Since stochasticmethods, such as statistical and neural-network based parsers, have shown to be overwhelminglysuccessful in the software industry, NLP researchers have typically turned their focus towards technicalissues specific to stochastic methods, such as improving recall and precision, and developing larger andlarger training corpora. At the same time, linguists kept focusing on problems related to the developmentof exhaustive and precise resources that are mainly “neutral” vis-a-vis any NLP application, such asparsing and generating sentences.
However, recent progress in both fields has been reducing many of these differences, with large-coverage linguistic resources being used more and more by robust NLP software. For instance,NLP researchers now use large dictionaries of multiword units and expressions, and several linguisticexperiments have shown the feasibility of using large phrase-structure grammars (a priori used for textparsing) in “generation” mode to automatically produce paraphrases of sentences that are described bythese grammars.
The First Workshop on Linguistic Resources for Natural Language Processing (LR4NLP) of the27th International Conference on Computational Linguistics (COLING 2018) held at Santa Fe, NewMexico, August 20, 2018, brought together participants interested in developing large-coverage linguisticresources and researchers with an interest in developing real-world Natural Language Processing (NLP)software. The presentations at the LR4NLP Workshop were organized into four sessions, as follows:
• Clash of the Titans: Linguistics vs. Statistics vs. Neural Networks
• May the Force Be with NooJ
• One for the Road: Monolingual Resources
• Language Resources Without Borders
The first session, Clash of the Titans: Linguistics vs. Statistics vs. Neural Networks, focused onlinguistic and stochastic approaches and results. Our invited speaker, Mark Liberman, showed howsemi-automatic analysis of large digital speech collections is transforming the science of phonetics, andoffered exciting opportunities to researchers in other fields, such as the possibility of improving parsingalgorithms by incorporating features from speech as well as text. He was followed by Silberztein, whopresented a series of experiments aimed at evaluating reference corpora, such as the Open AmericanNational Corpus, and proposed a series of tasks to enhance them. Zhang & Moldovan then made anassessment on the limitations and strengths of neural net systems to rule-based systems on SemanticTextual Similarity by comparing its performance with traditional rule-based systems against theSemEval 2012 benchmark.
Several workshop participants have been using the NooJ software to develop the large-coveragelinguistic resources needed by their NLP applications. NooJ was particularly germane to this workshop,because it is not only being used by linguists to develop resources in the form of electronic dictionaries,and morphological and syntactic grammars, but by computational linguists to parse and annotate largecorpora, as well as by software engineers to develop NLP applications. Thus, we allocated the entiresecond session, May the Force Be with NooJ, to researchers using this platform. Machonis showedhow a lexicon grammar dictionary of English phrasal verbs can be transformed into a NooJ dictionary,in order to accurately identify these structures in large corpora. Phrasal verbs are located by means
iv
of a grammar, and the results are then refined with a series of dictionaries, disambiguating grammars,and filters. Likewise, Kocijan et al. demonstrated how they use NooJ to detect and describe the majorderivational processes used in the formation of perfective, imperfective, and bi-aspectual Croatianverbs. Annotated chains are exported into a format adequate for a web-based system and further usedto enhance the aspectual and derivational information for each verb. Next, Boudhina & Fehri presenteda rule-based system for disambiguating French locative verbs in order to accurately translate them intoArabic. They used the Dubois & Dubois Charlier French Verb dictionary, a set of French syntacticgrammars, as well as a bilingual French-Arabic dictionary developed within the NooJ platform. Finally,Rodrigo et al. presented a NooJ application aimed at teaching Spanish as a foreign language to nativespeakers of Italian. Their presentation included an analysis of a journalistic corpus over a thirty-yeartime span focusing on adjectives used in the Argentine Rioplatense variety of Spanish.
In the third session, One for the Road: Monolingual Resources, researchers examined a varietyof large-coverage, monolingual linguistic resources for NLP applications. Dorr & Voss described thelinguistic resource STYLUS (SysTematicallY Derived Language USe), which they produced throughextraction of a set of argument realizations from lexical-semantic representations for a range of 500English verb classes. Their Verb Database contains a total of 9,525 entries and includes informationabout components of meaning and collocations. STYLUS enables systematic derivation of regularpatterns of language usage without requiring manual annotation. Then, Gezmu et al. presented acorpus of contemporary Amharic, automatically tagged for morpho-syntactic information. Textswere collected from 25,199 documents from different domains and about 24 million orthographicwords were tokenized. Malireddy et al. discussed a new summarization technique, called TelegraphicSummarization, that, instead of selecting whole sentences, picks short segments of text spread acrosssentences in order to build the resulting summary. They proposed a set of guidelines to create suchsummaries and annotated a gold corpus of 200 English short stories. Finally, Abera et al. describedthe procedures that were used for the creation of the first speech corpus of Tigrinya a Semitic languagespoken in the Horn of Africa for speech recognition purposes.
The closing session, Language Resources Without Borders, focused on the development of large-coverage, multilingual linguistic resources for Machine Translation (MT). Abate et al. describedthe development of parallel corpora for five Ethiopian Languages Amharic, Tigrigna, Afan-Oromo,Wolaytta and Geez. The authors conducted statistical machine translation experiments for sevenlanguage pairs that showed that the morphological complexity of these languages has a negative impacton the performance of the translation, especially for the target languages. Then, using the FrameNetand SALSA corpora, Sikos & Padó examined English and German, highlighting how inferencescan be made about cross-lingual frame applicability using a vector space model. They showed howmultilingual vector representations of frames learned from manually annotated corpora can address theneed of accessing broad-coverage resources for any language pair. Next, Zhai et al. presented a parallelmultilingual oral corpus the TED Talks in English, French, and Chinese. The authors categorized andannotated translation relations, to distinguish literal translation from other translation techniques. Theydeveloped a classifier to automatically detect these relations, with the long-term objective being tohave better semantic control when dealing with paraphrases or translational equivalencies. Tomokiyoet al. aimed at improving the Cesselin, a well-known, open source Japanese-French dictionary. Theyhypothesized that the degree of lexical similarity between results of MT into a third language mightprovide insight on how to better annotate proverbs, idiomatic constructions, and phrases containingquantifiers. To test this, they used Google Translate to translate both the Cesselin Japanese expressionsand their French translations into English. Their results showed much promise, in particular fordistinguishing normal usage from idiomatic examples. Barreiro & Batista presented a detailed analysison Portuguese contractions in an aligned bilingual Portuguese-English corpus and argued that thechoice to decompose contractions or not depended on their context, for which the occurrence ofmultiword units is key. Finally, Dhar et al. presented a newly created parallel corpus of English andcode-mixed English-Hindi. Using 6,088 code-mixed English-Hindi sentences previously available, theycreated a parallel English corpus using human translators. They then presented a technique to augment
v
run-of-the-mill MT approaches, which achieves superior translations without the need for speciallydesigned translation systems, and which can be plugged into any existing MT system.
The common theme of all of the papers presented in this workshop was how to build large linguisticresources in the form of annotated corpora, dictionaries, and morphological and syntactic grammarsthat can be used by NLP applications. Linguists as well as Computational Linguists who work on NLPapplications based on linguistic methods will find advanced, up-to-the-minute studies for these themesin this volume. We hope that readers will appreciate the importance of this volume, both for the intrinsicvalue of each linguistic formalization and the underlying methodology, as well as for the potential fordeveloping automatic NLP applications.
Editors:Anabela Barreiro, INESC-ID, Lisbon, PortugalKristina Kocijan, University of Zagreb, Zagreb, CroatiaPeter Machonis, Florida International University, Miami, USAMax Silberztein, Université de Franche-Comté, Besançon, France
vi
Organizers and Review Committee
Workshop Organizers
Anabela Barreiro, INESC-ID, Lisbon, PortugalKristina Kocijan, University of Zagreb, CroatiaPeter Machonis, Florida International University, USAMax Silberztein, Université de Franche-Comté, France
Peer Review Committee
Program Committee Chair
Max Silberztein, Université de Franche-Comté, France
Jorge Baptista, University of Algarve, PortugalAnabela Barreiro, INESC-ID Lisbon, PortugalXavier Blanco, Autonomous University of Barcelona, SpainNicoletta Calzolari, Istituto di Linguistica Computazionale, ItalyChristiane Fellbaum, Princeton University, USAHéla Fehri, University of Sfax, TunisiaYuras Hetsevich, National Academy of Sciences, BelarusKristina Kocijan, University of Zagreb, CroatiaMark Liberman, University of Pennsylvania, USAElena Lloret Pastor, Universidad de Alicante, SpainPeter Machonis, Florida International University, USASlim Mesfar, Carthaga University, TunisiaSimon Mille, Universitat Pompeu Fabra, SpainMario Monteleone, University of Salerno, ItalyJohanna Monti, University of Naples - L’Orientale, ItalyBernard Scott, Logos Institute, USA
Invited Speaker
Mark Liberman, University of Pennsylvania, USA
Session Chairs
Anabela Barreiro, INESC-ID, Lisbon, PortugalKristina Kocijan, University of Zagreb, CroatiaPeter Machonis, Florida International University, USAMax Silberztein, Université de Franche-Comté, France
vii
Table of Contents
Corpus Phonetics: Past, Present, and FutureMark Liberman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Using Linguistic Resources to Evaluate the Quality of Annotated CorporaMax Silberztein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2
Rule-based vs. Neural Net Approaches to Semantic Textual SimilarityLinrui Zhang and Dan Moldovan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Linguistic Resources for Phrasal Verb IdentificationPeter Machonis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Designing a Croatian Aspectual Derivatives Dictionary: Preliminary StagesKristina Kocijan, Krešimir Šojat and Dario Poljak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
A Rule-Based System for Disambiguating French Locative Verbs and Their Translation into ArabicSafa Boudhina and Héla Fehri . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38
A Pedagogical Application of NooJ in Language Teaching: The Adjective in Spanish and ItalianAndrea Rodrigo, Mario Monteleone and Silvia Reyes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
STYLUS: A Resource for Systematically Derived Language UsageBonnie Dorr and Clare Voss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic CorpusAndargachew Mekonnen Gezmu, Binyam Ephrem Seyoum, Michael Gasser and Andreas Nürnberger . .65
Gold Corpus for Telegraphic SummarizationChanakya Malireddy, Srivenkata N M Somisetty and Manish Shrivastava . . . . . . . . . . . . . . . . . . . . . 71
Design of a Tigrinya Language Speech Corpus for Speech RecognitionHafte Abera and Sebsibe H/Mariam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Parallel Corpora for bi-Directional Statistical Machine Translation for Seven Ethiopian Language PairsSolomon Teferra Abate, Michael Melese, Martha Yifiru Tachbelie, Million Meshesha, Solomon
Atinafu, Wondwossen Mulugeta, Yaregal Assabie, Hafte Abera, Binyam Ephrem, Tewodros Abebe,Wondimagegnhue Tsegaye, Amanuel Lemma, Tsegaye Andargie and Seifedin Shifaw. . . . . . . . . . . . . . 83
Using Embeddings to Compare FrameNet Frames Across LanguagesJennifer Sikos and Sebastian Padó . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Construction of a Multilingual Corpus Annotated with Translation RelationsYuming Zhai, Aurélien Max and Anne Vilnat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Towards an Automatic Classification of Illustrative Examples in a Large Japanese-French DictionaryObtained by OCR
Christian Boitet, Mathieu Mangeot and Mutsuko Tomokiyo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Contractions: To Align or Not to Align, That Is the QuestionAnabela Barreiro and Fernando Batista . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Enabling Code-Mixed Translation: Parallel Corpus Creation and MT Augmentation ApproachMrinal Dhar, Vaibhav Kumar and Manish Shrivastava . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
ix
Conference Program
Monday, August 20, 2018
9:00–10:30 Session S1: Clash of the Titans: Linguistics vs. Statistics vs. Neural-nets
9:10–9:50 Corpus Phonetics: Past, Present, and FutureMark Liberman
9:50–10:10 Using Linguistic Resources to Evaluate the Quality of Annotated CorporaMax Silberztein
10:10–10:30 Rule-based vs. Neural Net Approaches to Semantic Textual SimilarityLinrui Zhang and Dan Moldovan
11:00–12:20 Session S2: May the Force Be with NooJ
11:00–11:20 Linguistic Resources for Phrasal Verb IdentificationPeter Machonis
11:20–11:40 Designing a Croatian Aspectual Derivatives Dictionary: Preliminary StagesKristina Kocijan, Krešimir Šojat and Dario Poljak
11:40–12:00 A Rule-Based System for Disambiguating French Locative Verbs and TheirTranslation into ArabicSafa Boudhina and Héla Fehri
12:00–12:20 A Pedagogical Application of NooJ in Language Teaching: The Adjective in Spanishand ItalianAndrea Rodrigo, Mario Monteleone and Silvia Reyes
xi
Monday, August 20, 2018 (continued)
14:00–15:20 Session S3: One for the Road: Monolingual Resources
14:00–14:20 STYLUS: A Resource for Systematically Derived Language UsageBonnie Dorr and Clare Voss
14:20–14:40 Contemporary Amharic Corpus: Automatically Morpho-Syntactically TaggedAmharic CorpusAndargachew Mekonnen Gezmu, Binyam Ephrem Seyoum, Michael Gasser andAndreas Nürnberger
14:40–15:00 Gold Corpus for Telegraphic SummarizationChanakya Malireddy, Srivenkata N M Somisetty and Manish Shrivastava
15:00–15:20 Design of a Tigrinya Language Speech Corpus for Speech RecognitionHafte Abera and Sebsibe H/Mariam
16:00–18:00 Session S4: Language Resources without Borders
16:00–16:20 Parallel Corpora for bi-Directional Statistical Machine Translation for SevenEthiopian Language PairsSolomon Teferra Abate, Michael Melese, Martha Yifiru Tachbelie, MillionMeshesha, Solomon Atinafu, Wondwossen Mulugeta, Yaregal Assabie, HafteAbera, Binyam Ephrem, Tewodros Abebe, Wondimagegnhue Tsegaye, AmanuelLemma, Tsegaye Andargie and Seifedin Shifaw
16:20–16:40 Using Embeddings to Compare FrameNet Frames Across LanguagesJennifer Sikos and Sebastian Padó
16:40–17:00 Construction of a Multilingual Corpus Annotated with Translation RelationsYuming Zhai, Aurélien Max and Anne Vilnat
17:00–17:20 Towards an Automatic Classification of Illustrative Examples in a Large Japanese-French Dictionary Obtained by OCRChristian Boitet, Mathieu Mangeot and Mutsuko Tomokiyo
17:20–17:40 Contractions: To Align or Not to Align, That Is the QuestionAnabela Barreiro and Fernando Batista
17:40–18:00 Enabling Code-Mixed Translation: Parallel Corpus Creation and MTAugmentation ApproachMrinal Dhar, Vaibhav Kumar and Manish Shrivastava
xii
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, page 1Santa Fe, New Mexico, USA, August 20, 2018.
Corpus Phonetics: Past, Present, and Future
Mark Liberman
Department of Linguistics
University of Pennsylvania
Philadelphia, PA, USA
Invited Speaker
Abstract
Semi-automatic analysis of digital speech collections is transforming the science of phonetics,
and offers interesting opportunities to researchers in other fields. Convenient search and analysis
of large published bodies of recordings, transcripts, metadata, and annotations – as much as three
or four orders of magnitude larger than a few decades ago – has created a trend towards “cor-
pus phonetics,” whose benefits include greatly increased researcher productivity, better coverage
of variation in speech patterns, and essential support for reproducibility.
The results of this work include insight into theoretical questions at all levels of linguistic
analysis, as well as applications in fields as diverse as psychology, sociology, medicine, and
poetics, as well as within phonetics itself. Crucially, analytic inputs include annotation or cate-
gorization of speech recordings along many dimensions, from words and phrase structures to
discourse structures, speaker attitudes, speaker demographics, and speech styles. Among the
many near-term opportunities in this area we can single out the possibility of improving parsing
algorithms by incorporating features from speech as well as text.
Biography
Mark Liberman is the Christopher H. Browne Professor of Linguistics at the University of Pennsyl-
vania, as well as Professor of Computer and Information Science, Faculty Director of Ware College
House, and Director of the Linguistic Data Consortium. Before moving to Penn, he was member of
technical staff and head of the Linguistics Research Department at AT&T Bell Laboratories from 1975
to 1990. He is a fellow of the Linguistic Society of America and the American Association for the Ad-
vancement of Science, and co-editor of the Annual Review of Linguistics.
He received the Antonio Zampolli Prize from the European Language Resources Association in 2010,
and the IEE James L. Flanagan Speech and Audio Processing Award in 2017.
His current research focuses on features of speech, language, and communicative interaction that are
associated with neuropsychological categories and with relevant dimensions of variation in the popula-
tion at large.
This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://crea-
tivecommons.org/licenses/by/4.0/
1
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 2–11Santa Fe, New Mexico, USA, August 20, 2018.
Using Linguistic Resources to
Evaluate the Quality of Annotated Corpora
Max Silberztein
Université de Franche-Comté
Abstract
Statistical and neural network based methods that compute their results by comparing a given
text to be analyzed with a reference corpus assume that the reference corpus is complete and
reliable enough. In this article, I conduct several experiments to verify this assumption and I
suggest ways to improve these reference corpora by using carefully handcrafted linguistic
resources.
1 Introduction*
Nowadays, most Natural Language Processing (NLP) applications use stochastic methods that are, for
example, statistical- or neural network-based, in order to analyze new texts. Analyzing a text involves
thus comparing it with a “training” or reference corpus, which is a set of texts that have been either pre-
analyzed manually or parsed automatically, and then checked by a linguist. Granted that the reference
corpus and the text to analyze are similar enough, these methods produce satisfactory results.
Because natural languages contain infinite sets of sentences, these methods cannot just compare the
text to be analyzed with the reference corpus directly at the sentence level. They rather process both the
text and the reference corpus at the wordform level (i.e. contiguous sequences of letters). To analyze a
sentence in a new text, they first look up how each wordform of the text was tagged in the reference
corpus, and then they compare the context of the wordform in the text to be analyzed with similar ones
in the reference corpus.
The basic assumption of these stochastic methods is that if the reference corpus is sufficiently large,
the wordforms that constitute the text to be analyzed will contain enough occurrences to find identical,
or at least similar, contexts. Reciprocally, if the reference corpus is too small or too different from the
text to be analyzed, then the application will produce unreliable results. Therefore, evaluating the quality
of an annotated corpus means answering the following questions:
• what is the minimum size of the annotated corpus needed to produce reliable analyses?
• how reliable is the information stored in an annotated corpus?
• how much information is missing in an annotated corpus, and how does the missing information
affect the reliability of the analysis of new texts?
For this experiment, I have used the NooJ linguistic development environment1 to study the Slate
corpus included in the Open American National Corpus2. This sub-corpus, constituted by 4,531
articles/files, contains 4,302,120 wordforms. Each wordform is tagged according to the Penn tag set3.
This work is licenced under a Creative Commons Attribution 4.0 International License. Page numbers and proceedings footer
are added by the organizers. License details: http://creativecommons.org/licenses/by/4.0/ 1 NooJ is a free open-source linguistic development environment, distributed by the European Metashare platform, see
(Silberztein, 2003) and (Silberztein, 2016). 2 The Open American National Corpus (OANC) is a free corpus and can be downloaded at www.anc.org. We have looked
at the Corpus of Contemporary American English (COCA), which has a subset that is free of charge: we will see in section 2
that it has problems related to vocabulary similar to those of the OANC. (Silberztein, 2016) has evaluated the reliability of
the Penn treebank and found results similar to those discussed in section 4. 3 In the OANC as well as in other annotated corpora such as the Penn treebank or the COCA, sequences of digits,
punctuation characters and sequences that contain one or more dashes are also processed as linguistic units.
2
2 Vocabulary Coverage
2.1 Stability of the vocabulary
As a first experiment, I split the Slate corpus into two files: Even.txt contains all the articles whose
original filename ends with an even number (e.g. “ArticleIP_1554.txt”), whereas Odd.txt contains all
the articles whose original filename ends with an odd number (e.g. “ArticleIP_1555.txt”). These two
corpora are composed of intertwined articles, so that vocabulary differences cannot be blamed on
chronological or structural considerations.
As Figure 1 shows,4 almost half of the wordforms in the vocabulary of the total corpus either occur
in Even.txt but not in Odd.txt or occur in Odd.txt but not in Even.txt. In other words, for half of the
wordforms in the corpus’ vocabulary, the fact that they occur or not appears to be a random accident.
The fact that the vocabularies of two random subsets of this 4-million-wordform corpus are so different
shows that this corpus is still too small to have a stabilized vocabulary.
Figure 1. Vocabulary is unstable
Wordforms that occur in one sub-corpus but not in the other one, are not necessarily hapaxes or even
rare wordforms: there are 4,824 wordforms that occur more than once in Even.txt, but never occur in
Odd.txt, and there are 4,574 wordforms that occur more than once in Odd.txt, but never occur in
Even.txt. The following are examples of wordforms that occur 10 times or more in one sub-corpus, but
never occur in the other one:
• In Odd.txt: cryptography (12 occurrences), mammary (12), predation (12), selector (15), etc.
• In Even.txt: irradiation (13), jelly (17), obsolescence (11), quintet (16), sturgeon (10), etc.
The vocabulary covered by this 4-million-wordform corpus is still massively unstable, which
indicates that the corpus size is much too small to cover a significant portion of the vocabulary.
2.2 Evolution of the vocabulary
As a second experiment, I studied the evolution of the size of the vocabulary in the full corpus. As we
can see in Figure 2, the number of different wordforms (dashed line) first increases sharply and then
settles down to a quasi-linear progression. That is not surprising because in a magazine, we expect to
find a constant flow of new proper names (e.g. Abbott) and new typos (e.g. achives), as new articles are
added to a corpus.
In NooJ, every element of the vocabulary is described by one, and only one, lexical entry. If a given
wordform is polysemous, i.e. corresponds to more than one vocabulary element (e.g. to milk vs. some
milk), then it is described by more than one lexical entry.
NooJ’s Atomic Linguistic Units (ALUs) are either lexical entries (e.g. eat), inflected forms of a lexical
entry (e.g. eaten) or derived forms of a lexical entry (e.g. eatable). NooJ handles affix ALUs (e.g. re-, -
ly), simple ALUs (e.g. table), compound ALUs (e.g. blue collars, in spite of) as well as discontinuous
ALUs (e.g. turns … off, took ... into account). Note that contracted forms (e.g. cannot) and agglutinated
forms (e.g. autodialed), very frequent in languages such as Arabic or German, are not ALUs: they are
processed as sequences of ALUs.
4 The total number of wordforms shown in Figure 1 is lower than 88,945, because we unified the forms that have different
cases in the two sub-corpora. For instance, the wordform Deductions (only in Odd.txt) and the wordform deductions (only in
Even.txt) are counted as one wordform.
3
Figure 2. Evolution of the vocabulary
When evaluating the coverage of a reference corpus, it is important to distinguish ALUs from
wordforms that have no or poor intrinsic linguistic value, such as numbers, uncategorized proper names
and typos. To estimate the evolution of the vocabulary, I applied the English DELAS dictionary to the
Slate corpus. The DELAS dictionary5 represents standard English vocabulary, i.e. the vocabulary used
in general newspapers and shared by most English speakers. It also includes non-basic terms such as
aardvark (an African nocturnal animal) or zugzwang (a situation in chess), but it does not contain the
millions of terms that can be found in scientific and technical vocabularies (e.g. the medical term
erythematosus), nor terms used in regional variants all over the world (e.g. the Indian-English term
freeship). The DELAS dictionary contains approximately 160,000 entries, which correspond to 300,000
inflected forms. Applying it to this corpus recognizes 51,072 wordforms, approximately a sixth of the
standard vocabulary. Here are a few terms that never occur in the Slate corpus:
abominate, acarian, adjudicator, aeolian, aftereffect, agronomist, airlock,
alternator, amphibian, anecdotic, apiculture, aquaculture, arachnid, astigmatism,
atlas, autobiographic, aviator, awoken, axon, azimuth, etc.
The number of ALUs present in the corpus (solid line in Figure 2) grows slower and slower. By
extrapolating it, I estimate that one would need to add at least 16 million wordforms to this 4-million-
wordform corpus to get a decent 1/2 coverage of the standard vocabulary. Manually tagging such a large
corpus — or even checking it after some automatic processing — would represent a considerable
workload,6 obviously implying a much larger project than simply constructing an English dictionary
such as the DELAS.
Processing verbs correctly is crucial for any automatic parser because verbs impose strong constraints
upon their contexts: for instance, as the verb to sleep is intransitive, one can deduce that any phrase that
occurs right after it (e.g. Joe slept last night) has to be an adjunct rather than an argument (as in Joe
enjoyed last night). Knowing that the verb to declare expects a subject that is a person, or an
organization, allows automatic software to retrieve the pronoun’s reference in the sentence: They
5 The DELA family of dictionaries were created in the LADL laboratory, see (Courtois and Silberztein 1990; Klarsfeld 1991;
Chrobot et al., 1999). 6 20 million wordforms to check, one second per check, 8-hours a day, 5-day a week, would take 3 years. By contrast, it takes
between 6 and 9 months for a linguist to construct a NooJ module for a new language (that includes a DELAS-type
dictionary).
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000
Siz
e o
f th
e v
oca
bu
lary
Number of occurrences
Evolution of the vocabulary
Lexemes Word forms
4
declared an income, etc. However, if the reference corpus does not contain any occurrence of a verb, a
statistical or neural-network based parser would have no means to deduce anything about its syntactic
context nor its distributional restrictions, and therefore they would not be able to reliably process
sentences that contain the verb.
The 4-million-wordform Slate corpus contains 12,534 wordforms tagged as verbal forms (tags VB,
VBD, VBG, VBN, VBP or VBZ), which represents only a fifth of the 62,188 verbal forms processed by
NooJ. Following are examples of verbs that never occur in the Slate corpus:
acerbate, acidify, actualize, adjure, administrate, adulate, adulterate, agglutinate,
aggress, aliment, amputate, aphorize, appertain, arraign, approbate, asphyxiate, etc.7
2.3 Compound words
In the Slate corpus, a few compound words that contain a hyphen have been tagged as units, e.g. “a-
capella_JJ”. But these same exact compounds, when spelled with a space character, were processed as
sequences of two independent units, e.g. “a_DT capella_NN”. At the same time, a large number of
sequences that contain a hyphen, but are not compounds, have also been tagged as linguistic units, e.g.:
abide-and, abuse-suggesting, activity-regarded, adoption-related, Afghan-based, etc.
Similarly, in the COCA, we can find left-wing, wing-feathers and ultra-left-wing tagged properly as
ALUs, whereas left wing, wing commander, wing nuts are processed as sequences of two independent
units. It seems that there is a systematic confusion between compound ALUs and sequences that contain
a dash8 in annotated corpora. In reality, most compounds do not contain a hyphen. For example, all
occurrences of the adverb as a matter of fact have been tagged in the Slate corpus as:
As_IN a_DT matter_NN of_IN fact_NN
and in the COCA as:
as as ii
a a at1
matter matter nn1
of of jj32_nn132
fact fact nn1
This type of analysis makes it impossible for any NLP applications to process this adverb correctly.
One would not want a MT system to translate this adverb word by word, nor a search engine to return
these occurrences when a user is looking for the noun matter. In fact, a Web Semantic application should
not even try to link these occurrences to the entities dark matter (2 occurrences), gray matter (2 occ.),
organic matter (1 occ.) nor reading matter (1 occ.), etc.
NooJ’s DELAC dictionary9 contains over 70,000 compound nouns (e.g. bulletin board), adjectives
(e.g. alive and well), adverbs (e.g. in the line of fire) and prepositions (e.g. for the benefit of). These
entries correspond to approximately 250,000 inflected compound words. By applying the DELAC
dictionary to the corpus, NooJ found 166,060 occurrences of compound forms, as seen in Figure 3.
These 166,060 compounds correspond to approximately 400,000 wordforms (i.e. 10% of the corpus)
whose tags are either incorrect, or at least not relevant for any precise NLP application.
7 Some of these forms do occur in the Slate corpus, but not as verbs. They have rather been tagged as adjectives, e.g.
actualized, amputated, arraigned, etc. 8 Silberztein (2016) presents a set of three criteria to distinguish between analyzable sequences of words and lexicalized
multiword units: (1) the meaning of the whole cannot be completely computed from its components (e.g. a green card is
much more than just a card that has a green color), (2) everyone uses the same term to name an entity (e.g. compare a
washing-machine with a clothes-cleaning device), and (3) the transformational rules used to compute the relation between its
components has some idiosyncratic constraints (compare the function of the adjective presidential in the two expressions:
presidential election (we elect the president, *presidents elect someone) and presidential race (*we race the president, the
presidents race against each other)). 9 Silberztein (1990) presents the first electronic dictionary for compounds (French DELAC), designed to be used by
automatic NLP software. The English DELAC is presented in Chrobot et al. (1999).
5
These 166,060 occurrences represent 25,277 different compound forms, which amounts to only 10%
of the English vocabulary. Standard terms such as the following never occur in the Slate corpus:
abandoned ship, access path, administrative district, aerosol spray, after hours,
agglutinative language, air bed, album jacket, ammunition belt, anchor box, appeal court,
aqueous humor, arc welder, assault charge, attitude problem, auction sale, aviator’s ear,
awareness campaign, axe to grind, azo dye, etc.
Moreover, most compound words actually found in the Slate corpus do not occur in all their forms:
some nouns only occur in their singular form, whereas others only occur in their plural form. For
instance, there are no occurrence of the singular forms of the following nouns:
absentee voters, access codes, additional charges, affinity groups, AID patients, Alsatian wines,
amusement arcades, ancient civilizations, appetite suppressants, armed extremists,
assembly operations, attack helicopters, audio guides, average wages, ax-grinders, etc.
Even if encountering an occurrence of any inflected or derived form for a lexical entry would allow
an automatic system to correctly parse all its other inflected and derived forms, the Slate corpus only
covers 10% of the compounds of the vocabulary. I estimate that one would need to add over 32 million
wordforms to the corpus to get a decent 1/2 coverage of the English compounds.10
Figure 3. Compounds in the Slate corpus
2.4 Phrasal Verbs
Any precise NLP application must take into account all multiword units, even those that are
discontinuous. Examples of discontinuous expressions include idiomatic expressions (e.g. to read … the
10 32 million wordforms to check, one second per check, 8-hours a day, 5-day a week, would take over 4 years. By contrast, it
typically takes a year for a linguist to construct a DELAC-type dictionary.
6
riot act), verbs that have a frozen complement (e.g. to take … into account), phrasal verbs (e.g. to turn
… off) and associations of predicative nouns and their corresponding support verb (e.g. to take a (<E>
| good | long | refreshing) shower).
For this experiment, I applied NooJ’s dictionary of phrasal verbs11 to the Slate corpus. This dictionary
contains 1,260 phrasal verbs, from act out (e.g. Joe acted out the story) to zip up (e.g. Joe zipped up his
jacket). NooJ recognized over 12,000 occurrences12 of verbal phrases, such as in:
… acting out their predictable roles in the…
… I would love to ask her out,…
… would have backed North Vietnam up…
… Warner still wanted to boss him around…
… We booted up and victory!...
However, less than 1/3 of the phrasal verbs described in the dictionary had one or more occurrences
in the Slate corpus. For instance, phrasal verbs such as the following have no occurrence in the corpus:
argue down, bring about, cloud up, drag along, eye up, fasten up, goof up, hammer down, etc.
3 Hapaxes
3.1 Wordforms and ALUs
In most applications that use statistical approaches (e.g. Economics, Medicine, Physics, etc.), hapaxes
— i.e. statistical events that only occur once — are rightfully ignored as “accidents,” as they behave
like “noise,” by polluting analysis results.
In linguistics, a hapax is a wordform that occurs only once in a reference corpus. There are reasons
to ignore hapaxes during a text analysis since the unique syntactic context available cannot be used to
make any reliable generalization. Following are examples of hapaxes that occur right after a verb in the
OANC:
• an adjective, e.g. … one cage is left unrattled…
• an adverb, e.g. … you touched unerringly on all the elements…
• a noun, e.g. that score still seemed misogynous…
• an organization name, e.g. the center caused Medicare to pay for hundreds…
• a person name, e.g. Lewinsky told Jordan that…
• a verbal form, e.g. the deal might defang last year’s welfare reform...
• a foreign word, e.g. they graduated magna cum laude…
• or even a typo, e.g. whipped mashed potatos and…
It is only by taking into account multiple syntactic contexts for a wordform that one can hope to
describe its behavior reliably. If a wordform in the text to be analyzed corresponds to a hapax in the
reference corpus that occurs right after a verb, it would be very lucky if a statistical or neural-network
based parser were tagging it correctly.
There are 31,275 different hapaxes in the Slate corpus, out of 88,945 different wordforms, i.e. a third
of the vocabulary covered by the Slate corpus, which covers itself a sixth of the standard vocabulary.
Consequently, statistical parsers that do not exclude hapaxes will produce unreliable results for up to
one third of the wordforms present in the reference corpus’ vocabulary.
11 Peter Machonis is the author of the Lexicon-Grammar table for English Phrasal Verbs, which has been integrated into NooJ
via a linked couple dictionary / grammar (Machonis, 2010). 12 There are a few false-positives, i.e. phrasal verbs that were recognized but do not actually occur in the text, such as: The
Constitution grew out of a convention; they represent less than 2% of the matches. Half of these errors could be avoided by
using simple local grammars, such as giving priority to compounds (so that the recognition of the compound preposition out
of would block the recognition of the phrasal verb grew out in the latter example).
7
3.2 Compound words
As we have seen previously, the OANC and the COCA (as most reference corpora) contain no special
tags for compound words, which are nevertheless crucial for any precise NLP application. To
automatically identify them, some researchers use statistical methods to try to locate colocations.13 Their
idea is that if, for instance, the two wordforms nuclear and plant occur together in a statistically
meaningful way, one may deduce that the sequence nuclear plant corresponds to an English term. Even
if one subscribes to this principle, statistical methods cannot deduce that a sequence of wordforms
probably corresponds to a term if it occurs only once. In the Slate corpus, there are 9,007 compound
words that only occur once, e.g.:
absent without leave, accident report, adhesive tape, after a fashion, age of reason, air support,
alarm call, American plan, animal cracker, application process, artesian well, assault pistol,
atomic power, automatic pilot, aversion therapy, away team, axe to grind, etc.
If one removes these hapaxes from the list of compound words that occur in the corpus, the number
of compound words that could theoretically be detected as co-locations is reduced to 25,277 – 9,007
= 16,270 occurrences, i.e. only 6% of the vocabulary.
Note that the colocation criterion does not really make any sense from a linguistic point of view. It is
not because a sequence occurs often that it is necessarily an element of the English language vocabulary
(e.g. the sequence was in the occurs 69 times), and reciprocally it is not because a sequence only occurs
once in a corpus (e.g. after a fashion) that it is a lesser element of the English vocabulary. In the same
manner that it would not make any sense to state that alright is not an element of English vocabulary
because it only occurs once in the Slate corpus, it does not make sense to state that artesian well is not
a term because it only occurs once in the same corpus.
3.3 Polysemy
To be reliable, statistical-based disambiguation techniques need to process units that are frequent
enough. For example, in the Slate corpus, the wordform that is tagged 42,781 times as a subordinating
conjunction (IN), out of 62,286 occurrences. It is then fair to predict that any of its occurrences has a
70% probability of being a subordinating conjunction.
However, if a corpus contains only one occurrence of a polysemous wordform, predicting its function
in a new text can only produce unreliable results. For instance, the wordform shrivelled occurs only
once in the Slate corpus:
… an orange that was, in Zercher’s words, shrivelled and deformed...
It has been correctly tagged as an adjective (JJ), but this is not a reason to deduce that this wordform
will always function as an adjective, as one can see in sentence: The lack of rain has shrivelled the crops.
In the Slate corpus, there are 2,285 wordforms that have two or more potential tags, but occur only once,
e.g.:
aboveboard (adjective or adverb), accusative (adjective or noun), advert (noun or verb),
aflame (adjective or adverb), agglomerate (adjective, noun or verb), airdrop (noun or verb),
alright (adjective or adverb), amnesiac (adjective or noun), angora (adjective or noun),
apologetics (singular or plural noun), aqueous, armour (noun or verb), astringent (adjective or noun),
attic (adjective or noun), auburn (adjective or noun), Azerbaijani (adjective or noun), etc.
Any parser that processes these wordforms as monosemous (because they only occur once in the
reference corpus) produces unreliable results.
13 See, for instance, the European PARSEME initiative http://typo.uni-konstanz.de/parseme and the program
of the SIGLEX-MWE (Special Interest Group on Multiword Expressions) workshops
http://multiword.sourceforge.net/PHITE.php?sitesig=MWE.
8
4 Reliability
All statistical or neural-network based NLP applications that compare a reference corpus to the texts to
analyze assume that the reference corpus can be relied upon: if the tags used to compute an analysis are
incorrect, then one cannot expect these applications to produce perfect analyses. The fact that reference
corpora contain errors is well known to NLP researchers.14 Even though, I believe that the actual number
of errors has been largely ignored or minimized.
The Open American National Corpus has been tagged by an enhanced version of GATE’s ANNIE
system,15 using the Penn tag set. However, during this series of experiments, every superficial look at
the Slate corpus16 has uncovered mistakes. For instance, when looking at the form that in section 3.3., I
found the following analyses for this wordform:
• 41,622 times as a subordinating conjunction (IN),
• 10,151 times as a Wh-determiner (WDT),
• 10,491 as a determiner (DT),
• 124 times as an adverb (RB).
All WDT (Wh-determiner) and RB (adverb) tags for the wordform that are incorrect: in fact, they
correspond to pronoun uses of that. Here are a few examples of mistakes:
• Incorrect WDT: … Publications that refuse to… phrase that describes this… stigma that came
with…
• Incorrect RB: … likes to boast that, … to do just that, and delightfully … know that, …
Thus, at least 25% of the occurrences of the wordform that have been tagged incorrectly. There is a
systematic confusion between Wh-determiners and pronouns (WDT instead of WP); occurrences of the
wordform that that are followed by a period or a comma have been systematically tagged as adverbs.
The “systematic” aspect of these mistakes in the corpus is unsettling, because it means that enlarging
the corpus to 10 or even 100 million words will not enhance the usefulness of the statistical methods:
there is no really useful information added to the corpus if it was added automatically. As a matter of
fact, a superficial look at the full OANC corpus shows the same systematic mistakes as in the Slate sub-
corpus, which suggests the use of automatic disambiguation rules.
If — as systematic mistakes imply — the tagger used automatic rules to disambiguate wordforms, it
is essential that we are able to look at them, so that we can correct them. One may even argue that these
automatic disambiguation rules could even be inserted directly in the final NLP application: in that case,
the reference corpus becomes less and less useful.
To estimate the accuracy of the tags in the corpus, I have compiled a list of its 84,386 useful
wordforms17 associated with their tags, and parsed it with NooJ:
• Out of the 49,323 wordforms that were not tagged as proper names, 11,019 tags — i.e. over 20%
— are considered as incorrect by NooJ. Examples of incorrect tags are:
abbreviate, abduct, abhor, abhors, etc. (not nouns),
about, agonized, bible, cactus, California, etc. (not adjectives)
expenditures, Japanese, many, initiatives, wimp, etc. (not verbs)
anomaly, back, because, by, of, out, particular, upon, etc. (not adverbs)
A number of derived or agglutinated forms have been tagged as ALUs, e.g.:
14 See for instance Green and Manning (2010) about tagging errors in Arabic corpora, Kulick et al. (2011) and Volokh and
Neumann ( 2011) about errors in tree-banks, and Dickinson (2015) about methods to detect the annotation errors. 15 See http://gate.ac.uk. 16 The Open American Corpus has been tagged 17 Tags such as CD (Cardinal Number), LS (List item marker), POS (Possessive ending), SYM (symbol) and TO (to) are not
useful in the sense that they do not add any information to the word they describe; they can be automatically added with
simple SED-type replacements such as s/\([0-9]*\)/\1_CD/.
9
audienceless, autodialed, balancingly, bioremediation, barklike, etc.
However, there are good linguistic reasons to consider these wordforms to be analyzable sequences
of two ALUs (i.e. two tags), in order to process prefixes such as auto-, bio- and suffixes such as -less, -
ly, -like as ALUs themselves.
• Out of the 35,063 uppercase wordforms that have been associated with an Proper name (tagged
NNP or NNPS), we also get a large number18 of incorrect tags, e.g.:
Abacuses, Abandoned, ABATEMENT, Abattoir, Abbreviated, Ablaze,
Abnormal, Abolished, Abuse, Abstract, Accidental, ALMOST, etc.
The Penn tag set does not contain tags for typos: all uppercase typos were tagged as proper names
(NNP or NNPS), e.g.: Aconfession, Afew, AffairsThe, Allpolitics, etc. Most other typos have been
associated with a noun tag, e.g.:
absentionists (NNS), achives (NNS), accrossthe (NN), afteryou (NN), alwaysattend (NN), etc.
18,949 sequences that include the em dash19 have been incorrectly processed as compound words,
e.g. aggression−that (JJ), believe−that (JJ), chance−a (JJ), etc. Finally, one needs to mention that there
are over 200 typos in the tags themselves, for instance:
a,DTn, believ_VBe, classi_JJc, di_VBDd, JFK-styl_NNPe, etc.
Adding the 10% irrelevant tags (e.g. “as_IN” in “as a matter of fact”) to the 20% incorrectly tagged
sequences that include a hyphen or an em dash (e.g. “minister—an_JJ”) , to the 20% impossible tags
(e.g. “many_VB”), as well as the typos, makes the Slate corpus unreliable at best.
5 Granularity of the linguistic information
Dictionaries for NLP applications actually need to associate their lexical entries with more information
than just their part of speech. Nouns need to be classified in the very least as Human, Concrete or
Abstract, because there are syntactic and semantic rules that do not apply in the same way to human,
concrete or abstract noun phrases. Verbs need to be associated with a description of their subjects,
prepositions and complements. For instance, in order to analyze correctly the following sentence:
The classroom burst into laughter
a computer program needs to know that the (compound) verb to burst into laughter expects a human
subject, that classroom is not a human subject, and therefore it needs to activate a special analysis (e.g.
metonymy) to process (e.g. index/link/translate/etc.) it.
In the previous evaluations, I counted wordforms as if each were representing all its corresponding
ALUs: for instance, when the wordform agent occurs in the corpus, I counted it as if it represented two
lexical entries: the person (in a secret agent), and the product (in a bleaching agent). But when parsing
a text, NooJ produces over 2 potential analyses for each wordform in average. In consequence, if we
were to take this linguistic information into account by enriching the tag set (e.g. add a +Human or
+NonHuman feature for nouns, add a +Transitive or +Intransitive feature for verbs, distinguish between
multiple meanings of each lexical entry, etc.), the coverage of the corpus would be half of our previous
estimate.
18 It is not possible to know if an uppercase word is or is not a proper name without a syntactic analysis of its context. For
instance, “Black”, “Carpenter”, “Hope”, etc. are common words as well as proper names, depending on their context. That
said, by scanning over the list of the words tagged as proper names, I estimate that e at least 15% of them are mistakes. 19 The em dash is represented by two consecutive dash characters in the OANC Slate corpus.
10
6 Conclusion and proposals
On the one hand, a large number of NLP applications rely on reference corpora to perform sophisticated
analyses, such as intelligent search engines, automatic abstracts, information extraction, machine
translation, etc. On the other hand, linguists have spent years carefully handcrafting a large quantity of
linguistic resources. Using these resources could enhance reference corpora significantly:
• a first operation would be to compare the set of tagged wordforms in the reference corpus with
the lexical entries in an English dictionary such as the DELAS; this operation would allow one to
correct typos as well as “impossible” tags (e.g. many should never be tagged as a verb);
• a second operation would be to tag compound words (e.g. as a matter of fact) and discontinuous
expressions (e.g. ask … out) already listed and described in dictionaries, as processing these
linguistic units as a whole is crucial for any precise NLP application;
• a third operation would be to develop a set of simple local grammars to correct tags in the text
that are not compatible with the English grammar (e.g. that should not be tagged as a Wh-
determiner in phrase that describes this). These local grammars could even replace the
meaningless algorithms (e.g. that before a comma tagged as an adverb) used to tag the reference
corpus;
• a fourth, more ambitious project would be to enhance the tag set in order to distinguish human,
concrete and abstract nouns, as well as to classify verbs according to their complements. That
would allow sophisticated NLP applications such as Information Extraction and MT software to
be trained on a semantic-rich tagged corpus.
References
Agata Chrobot, Blandine Courtois, Marie Hamani, Maurice Gross, Katia Zellagui. 1999. Dictionnaire
Electronique DELAC anglais : noms composés. Technical Report #59, LADL, Université Paris 7: Paris, France.
Blandine Courtois and Max Silberztein (editors). 1990. Les dictionnaires électroniques du français. Larousse:
Paris, France.
Markus Dickinson. 2015. Detection of Annotation Errors in Corpora. In Language & Linguistics Compass, vol 9,
Issue 3. Wiley Online Library, https://doi.org/10.1111/lnc3.12129.
Spence Green and Christopher Manning. 2010. Better Arabic parsing: Baselines, evaluations, and analysis. In
Proceedings of the 23rd International Confeence on Computational Linguistics (COLING), pages 394-402.
Gaby Klarsfeld. 1991. Dictionnaire morphologique de l’anglais. Technical Report, LADL, Université Paris 7:
Paris, France.
Seth Kulick, Ann Bies, Justin Mott. 2011. Further Developments in Treebank Error Detection Using Derivation
Trees. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human
Language Technologies, Association for Computational Linguistics, pages 693–698, Portland, Oregon, USA.
Peter Machonis. 2010. English Phrasal Verbs: from Lexicon-Grammar to Natural Language Processing. Southern
Journal of Linguistics, vol. 34, n01: 21-48.
Max Silberztein. 2003. The NooJ Manual. Available for download at www.nooj-association.org.
Max Silberztein. 1990. Le dictionnaire électronique DELAC. In Les dictionnaires électroniques du français.
Larousse: Paris, France.
Max Silberztein. 2016. Formalizing Natural Languages: the NooJ Approach. Cognitive Science Series. Wiley-
ISTE: London, UK.
Alexander Volokh, Günter Neumann. 2011. Automatic detection and correction of errors in dependency tree-banks.
In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language
Technology: short papers, vol. 2, pages 346-350.
11
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 12–17Santa Fe, New Mexico, USA, August 20, 2018.
Rule-based vs. Neural Net Approach to
Semantic Textual Similarity
Linrui Zhang
The University of Texas at Dallas 800 West Campbell Road; MS EC31,
Richardson, TX 75080 U.S.A [email protected]
Dan Moldovan
The University of Texas at Dallas 800 West Campbell Road; MS EC31,
Richardson, TX 75080 U.S.A [email protected]
Abstract
This paper presents a neural net approach to determine Semantic Textual Similarity (STS) using attention-based bidirectional Long Short-Term Memory Networks (Bi-LSTM). To this date, most of the traditional STS systems were rule-based that built on top of excessive use of linguis-tic features and resources. In this paper, we present an end-to-end attention-based Bi-LSTM neural network system that solely takes word-level features, without expensive, feature engi-neering work or the usage of external resources. By comparing its performance with traditional rule-based systems against the SemEval 2012 benchmark, we make an assessment on the limi-tations and strengths of neural net systems as opposed to rule-based systems on STS.
1 Introduction
Semantic Textual Similarity (STS) is the task of determining the resemblance of the meanings between two sentences (Agirre et al., 2012; Agirre et al., 2013; Agirre et al., 2014; Agirre et al., 2015; Agirre et al., 2016; Cer et al., 2017). For the sentence pairs below, on a scale from 0 to 5, (1) is very similar [5.0], (2) is somewhat similar [3.0] and (3) is not similar [0.2]:
1. Someone is removing the scales from the fish.
A person is descaling a fish.
2. A woman is chopping an herb.
A man is finely chopping a green substance.
3. The woman is smoking.
The man is walking.
In STS tasks, the performance of traditional models relies highly on the usage of linguistic resources and hand-crafted features. For example, in SemEval 2012 Task 06: A Pilot on Semantic Textual Simi-larity (Agirre et al., 2012), the top three performers (Šarić et al., 2012; B�� r et al., 2012; Banea et al., 2012) all derived knowledge from WordNet, Wikipedia and other large corpora. In particular, Banea et al. built the models from 6 million Wikipedia articles and more than 9.5 million hyperlinks; B�� r et al. used Wiktionary, which contains over 3 million entries; and Šarić et al. used The New York Times An-notated Corpus that contains over 1.8 million news articles. Blanco and Moldovan (2013) proposed a model with semantic representation of sentences, which was considered to use the smallest external resources and features in 2015. However, their model still required WordNet with approximately 120,000 synsets and a semantic parser.
Complex neural network architectures are being increasingly used for learning to compute the seman-tic resemblances among natural language texts. To this date, there are two end-to-end neural network models proposed for STS tasks (Shao, 2017; Prijatelj et al., 2017), and both of them followed a standard sentence pair modeling neural network architecture that contains three components: a word embedding This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://crea-tivecommons.org/licenses/by/4.0/.
12
component that transforms words into word vectors, a sentence embedding component that takes word vectors as input and encodes the sentence into a single vector that represents the semantic meanings of the original sentence, and a comparator component that evaluates the similarity between sentence vec-tors and generates a similarity score.
In this paper, we modified and improved the modals proposed by Prijatelj et al. (2017) and Shao (2017), and proposed a Bi-LSTM neural network model as the representative of the neural net approach and evaluated it on the SemEval 2012 dataset. In the experimental section, we compared our system with the top three performers in SemEval 2012 using traditional rule-based models. Because neither Shao nor Prijatelj et al. considered attention mechanisms (Yang et al., 2016) in their systems, we specif-ically applied attention mechanisms to improve the performance of our system.
The goal of the paper is to illustrate that with well-designed neural network models, we can achieve competitive results (compared to traditional rule-based models) without expensive feature engineering work and external resources. We also make an assessment on the limitations of neural net systems as opposed to rule-based systems on STS.
2 Related Work
Determining textual similarity is relatively new as a stand-alone task since SemEval-2012, but it is often a component of NLP applications such as information retrieval, paraphrase recognition, grading answers to questions and many other tasks. In this section, we only list the works that are involved in our evalu-ation systems: the top performers in SemEval 2012 and recent neural network-based approaches in SemEval 2017.
The performance of the rule-based models (Šarić et al., 2012; B�� r et al., 2012; Banea et al., 2012) mostly rely on word pairings and knowledge derived from large corpora, e.g., Wikipedia. Regardless of details, each word in sent1 is paired with the word in sent2 that is most similar according to some simi-larity measure. Then, all similarities are added and normalized by the length of sent1 to obtain the simi-larity score from sent1 to sent2. The process is repeated to obtain the similarity score from sent2 to sent1, and both scores are then averaged to determine the overall textual similarity. Several word to word similarity measures are often combined with other shallow features (e.g. n-gram overlap, syntactic de-pendencies) to obtain the final similarity score.
Shao (2017) proposed a simple convolutional neural network (CNN) models for STS. He used a CNN as the sentence embedding component to encode the original sentences into sentence-level vectors and generated a semantic difference vector by concatenating the element-wise absolute difference and the element-wise multiplication of the corresponding sentence vectors. He then passed the semantic differ-ence vector into a fully connected neural network to perform regression to generate the similarity score on a continuous inclusive scale from 0 to 5. His model ranked 3rd on the primary track of SemEval 2017.
Prijatelj et al. (2017) wrote a survey on neural networks for semantic textual similarity. The frame-work of their model is similar to Shao’s, but they explored various neural network architectures, from simple to complex, and reported the results of applying the combination of these neural network models within this framework.
3 System Description
Figure 1 provides an overview of our neural network-based model. The sentence pairs first pass through a pre-processing step described in subsection 3.1 to generate word embeddings. The attention-based Bi-LSTM models transform the word embeddings into sentence-level vectors described in subsection 3.2. In subsection 3.3, we use the same semantic difference vector as Shao to represent the semantic differ-ence between the sentence-level vectors. Lastly, we pass the semantic difference vector into fully con-nected neural networks to generate the similarity score between the original sentence pairs.
3.1 Pre-processing
We first applied a simple NLP pipeline to the input sentences to tokenize them, remove punctuations and lower-case all the tokens. Second, we looked up the word embeddings from the pretrained 50-di-
13
mension GloVe vectors, and set non-existing words to zero vector. Third, we enhanced the word embed-dings by adding a true/false (1/0) flag to them if the corresponding word appears in both sentences. Lastly, we unified the length of the inputs by padding the sentences.
Figure 1: The structure of attention-based Bi-LSTM network for Semantic Textual Similarity.
3.2 Attention-based LSTM
Since sentences are sequences of words, and the order of the words matters, it is natural to use LSTMs (Hochreiter and Schmidhuber, 1997) to encode sentences into vectors. However, sometimes the back-ward sequence contains useful information as well, especially for long and unstructured sentences. Be-cause of this, Irsoy and Cardie (2014) proposed Deep Bidirectional RNNs that can make predictions based on future words by having the RNN model read through the sentence backwards. In this section, we will first introduce a regular LSTM network and then extend it into a Bi-LSTM. At the end of this section, we will apply attention mechanisms to improve the performance of the system.
The traditional LSTM unit is defined by 5 components: an input gate, a forget gate, an output gate, a new memory generation cell and a final memory cell.
The input gate is to decide if the input xt is worth being preserved based on the input word xt and the past hidden state ht−1.
�� = �(�()�� + �()ℎ���) (1)
The forget gate ft makes an assessment on whether the past memory cell is useful to compute the
current memory cell. �� = �(�(�)�� + �(�)ℎ���) (2)
The output gate is to separate the final memory ct from the hidden state ht. �� = �(�(�)�� + �(�)ℎ���) (3)
The new memory generation cell is used to generate a new memory �� by input work xt and the past
hidden state ht−1. �� = tanh (�(�)�� + �(�)ℎ���) (4)
The final memory cell produces the final memory ct by summing the advice of the forget gate ft and
input gate it. �� = �� ∙ ���� + �� ∙ �� (5)
14
ℎ� = �� ∙ �� ℎ(��) (6) A Bi-LSTM could be viewed as a network that maintains two hidden LSTM layers together, at each
time-step t, one for the forward propagation and another for the backward propagation. The final classi-fication results are generated through the combination of the score results produced by both hidden layers. The mathematical representation of a simplified Bi-LSTM is shown as follows:
ℎ!"� = �(�!!!"�� + #!"ℎ!"��� + $!") (7)
ℎ%!� = �(�%!!!�� + #%!ℎ%!��� + $%!) (8)
&'� = ((�ℎ� + �) = ((�)ℎ!"���; ℎ%!���+ + �) (9)
where &'� is the final predication. The symbols → and ← are indicating directions. The rest of the terms are defined the same as in regular LSTM neural networks. W, U are weight matrices that are associated with input xt and hidden states ht. U is used to combine the two hidden LSTM layers together, b and c are bias term. g(x) and f(x) are activation functions.
Not all words contribute equally to the representation of the sentence meaning; thus, we extract words that are more informative to the sentence and aggregate these words to the sentence-level vector by applying the attention mechanism. Specifically:
First, we feed the hidden state ht through a one-layer MLP to get ut, and ut could be viewed as a hidden representation of ht.
,� = �� ℎ (�ℎ� + $) (10)
Second, we multiply ut with a context vector uw, and normalized the results through a softmax function
to get the importance weight at of each hidden state ht. The context vector could be seen as a high-level vector to select informative word in the sentence (Sukhbaatar et al., 2015) and it will be jointly learned during the training process.
�� = -./ (01 203)
∑ -./ (01203)1
(11)
Lastly, the final state S is a sum over of the hidden states and its the importance weights. 5 = ∑ ��ℎ�� (12)
3.3 Semantic Difference of Sentences
We used the same semantic difference vector as Shao, by concatenating the element-wise absolute dif-ference and the element-wise multiplication of the corresponding sentence-level embedding pairs.
The generated semantic difference vector is passed through a two-layer fully connected neural net-works with a softmax function as the output layer and generated a probabilistic distribution over the six similarity labels used in the SemEval 2012 task. We multiplied it with a constant matrix with integers from 0 to 5 to transfer the probabilistic distribution into a float number as the final semantic similarity score between the original sentence pairs.
4 Experiment
4.1 Corpus
We evaluated our system on the corpora used in SemEval 2012 Task 06: A Pilot on Semantic Textual Similarity. It contains five corpora: (1) MSRvid, short sentences for video descriptions; (2) MSRpar, long sentences of paraphrases; (3) SMTeuroparl, output of machine translation systems and reference translations; (4) OnWN, OntoNotes and WordNet glosses; and (5) SMTnews, output of machine trans-lation systems in the news domain and gold translations. In corpus (1) to (3), both training and testing data are provided, and corpus (4) and (5) are surprise data (new domain data) and only testing data are provided. For more details about the corpus, please refer to Agirre et al. (2012).
15
We followed the training and testing splits of the original corpus. Since corpus (4) and (5) do not have corresponding training data, we used the training data of corpus (2) as the training data for corpus (4) since both corpus contains long and hard to parse sentences, and we used the training data of corpus (3) as the training data for corpus (5), since both corpus contains ungrammatical sentences.
4.2 Experiment Results
We used the Pearson correlation coefficient to evaluate the performance. We introduced two neural net-work models: a regular LSTM model and a Bi-LSTM model and, for each model, we also demonstrated their performance with attention mechanisms. Table 1 shows the results of the neural network-based systems and the traditional rule-based systems on in-domain data (corpus 1 to 3) and out-of-domain data (corpus 4 and 5).
System MSRvid MSRpar SMTeuroparl OnWN SMTnews
LSTM Basic 0.7774 0.5278 0.2787 0.4519 0.2071 +attention 0.7851 0.5891 0.3492 0.4773 0.2635
Bi-LSTM Basic 0.7661 0.5258 0.3993 0.4591 0.3298 +attention 0.7762 0.6210 0.4368 0.5607 0.3976
B�� r et al., 2012 0.8739 0.6830 0.5280 0.6641 0.4937 Šarić et al., 2012 0.8620 0.6985 0.3612 0.7049 0.4683 Banea et al., 2012 0.8750 0.5353 0.4203 0.6715 0.4033
Table 1: The Person correlation coefficient of our system and the top three performers in SemEval 2012 benchmark.
4.3 Results Analysis
From the results, we could observe that: (1) The overall performance of the rule-based model is still slightly better than the neural network-based approach. However, we must note that the neural network models are end-to-end models that do not use complicated linguistic features and resources. (2) The neural network-based approaches are better at handling long sentences, whereas the rule-based systems are good at handling short sentences. The reason is that the performance of the traditional rule-based models greatly relies on the extraction of features, however, long sentences are usually hard to parse. The errors that occur in the feature extraction step will propagate until the end and decrease the perfor-mance of the system. Whereas the neural network models only take word-level features and do end-to-end training, so they do not have this “error propagation” issue. Besides, since we add attention mecha-nisms, the system could aggregate the influence of the informative words and ignore the unimportant words in long sentences. From the results we observe that our system beats the third-ranked performer on the MSRpar corpus, and the second- and third-ranked performers on the SMTeuroparl corpus, which contains mainly long sentences. (3) The regular LSTM model performs poorly on the SMTeuroparl corpus, but the Bi-LSTM dramatically increases the performance (ranking just after the top performer with rule-based systems). The reason is that in the SMTeuroparl corpus, one sentence in the sentence pair is usually ungrammatical. Regular LSTM can only capture the forward sequential information of sentences, so it will miss some information if the sentences are unstructured. However, the Bi-LSTM model can compensate for this missing information by capturing the backward sequential information as well, and this makes the system more robust when handling ungrammatical sentences. (4) The tradi-tional rule-based models show a huge advantage over neural network-based models on new domain datasets. The reason is that the neural-network models are supervised models that mostly depend on the training data, and when transferred to new domains lacking training data, the performance of the system drops dramatically. On the other hand, rule-based systems rely mostly on word pairings and linguistic resources that are not as dependent on training data.
References
Eneko Agirre, Mona Diab, Daniel Cer, and Aitor Gonzalez-Agirre. 2012. Semeval-2012 task 6: A pilot on semantic textual similarity. In Proceedings of the First Joint Conference on Lexical and Computational SemanticsVolume
16
1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International
Workshop on Semantic Evaluation, pages 385–393. Association for Computational Linguistics.
Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. 2013. * sem 2013 shared task: Semantic textual similarity. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, volume 1, pages 32–43.
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2014. Semeval-2014 task 10: Multilingual semantic textual simi-larity. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), pages 81–91.
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, et al. 2015. Semeval-2015 task 2: Semantic textual similar-ity, english, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic
evaluation (SemEval 2015), pages 252–263.
Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2016. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual eval-uation. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 497–511.
Carmen Banea, Samer Hassan, Michael Mohler, and Rada Mihalcea. 2012. Unt: A supervised synergistic approach to semantic text similarity. In Proceedings of the First Joint Conference on Lexical and Computational Seman-
tics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth
International Workshop on Semantic Evaluation, pages 635–642. Association for Computational Linguistics.
Daniel B��r, Chris Biemann, Iryna Gurevych, and Torsten Zesch. 2012. Ukp: Computing semantic textual similarity by combining multiple content similarity measures. In Proceedings of the First Joint Conference on Lexical
and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume
2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pages 435–440. Association for Computational Linguistics.
Eduardo Blanco and Dan Moldovan. 2013. A semantically enhanced approach to determine textual similarity. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1235–1245.
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Se-mantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055.
Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
Ozan Irsoy and Claire Cardie. 2014. Opinion mining with deep recurrent neural networks. In Proceedings of the
2014 conference on empirical methods in natural language processing (EMNLP), pages 720–728.
Derek S Prijatelj, Jonathan Ventura, and Jugal Kalita. 2017. Neural networks for semantic textual similarity. In Proceedings of the 14th International Conference on Natural Language Processing, pages 456-465.
Frane Šarić, Goran Glavaš, Mladen Karan, Jan Šnajder, and Bojana Dalbelo Bašić. 2012. Takelab: Systems for measuring semantic text similarity. In Proceedings of the First Joint Conference on Lexical and Computational
Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of
the Sixth International Workshop on Semantic Evaluation, pages 441–448. Association for Computational Lin-guistics.
Yang Shao. 2017. Hcti at semeval-2017 task 1: Use convolutional neural network to evaluate semantic textual similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 130–133.
Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End-to-end memory networks. In Ad-
vances in neural information processing systems, Montréal, Canada, pages 2440–2448.
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489.
17
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 18–27Santa Fe, New Mexico, USA, August 20, 2018.
Linguistic Resources for Phrasal Verb Identification
Peter A. Machonis
Florida International University
Dept. of Modern Languages
11200 SW 8th Street
Miami, FL 33199 USA [email protected]
Abstract
This paper shows how a lexicon grammar dictionary of English phrasal verbs (PV) can be trans-
formed into an electronic dictionary, in order to accurately identify PV in large corpora within
the linguistic development environment, NooJ. The NooJ program is an alternative to statistical
methods commonly used in NLP: all PV are listed in a dictionary and then located by means of
a PV grammar in both continuous and discontinuous format. Results are then refined with a
series of dictionaries, disambiguating grammars, filters, and other linguistics resources. The
main advantage of such a program is that all PV can be identified in any corpus. The only
drawback is that PV not listed in the dictionary (e.g., archaic forms, recent neologisms) are not
identified; however, new PV can easily be added to the electronic dictionary, which is freely
available to all.
1 Introduction
Although described as early as the 1700’s, English phrasal verbs (PV) or verb-particle combinations,
such as figure out, look up, turn on, etc. have long been considered a characteristic trait of the English
language and are to this day one of the most difficult features of English to master for non-native speak-
ers. PV began attracting the attention of linguists in the early 1900’s with Kennedy’s (1920) classic
study. Many have reiterated his historical analysis, such as Konishi (1958:122), who also finds a steady
growth of these combinations after Old English, a slight drop during the Age of Reason – with authors
such as Dryden and Johnson who avoided such “grammatical irregularities” – followed by a new ex-
pansion in the 19th century.
A renewal of interest in PV arose in the 1970’s, with the works of Bolinger (1971:xi), who associated
PV with a “creativeness that surpasses anything else in our language” and Fraser (1976), who first pre-
sented detailed descriptions of PV transformations. In particular, he studied constraints on particle po-
sition. Although many PV allow movement (figure out the answer, figure the answer out), if the direct
object is a pronoun, it can only appear before the particle (figure it out, *figure out it). Fraser (1976:19)
also showed that some PV idioms prohibit particle movement (e.g., dance up a storm, *dance a storm
up) whereas others permit movement (e.g., turn back the clock or turn the clock back).
More recently, Hampe (2002), in her corpus-based study of semantic redundancy in English, suggests
that compositional PV can function as an “index of emotional involvement of the speaker,” and today,
linguists and computer scientists debate the status of compositional vs. idiomatic PV. Whereas idiomatic
PV, such break up the audience “cause to laugh” or burn out the teacher “exhaust,” cannot be derived
from the meaning of the verb plus particle and must be clearly listed in the lexicon, compositional PV,
such as drink up the milk or boot up the computer, can be derived from the meaning of the regular verb.
In this case, the particle simply functions as an intensifier (e.g., rev up the engine), aspect marker (e.g.,
lock up the car), or an adverbial noting direction (e.g., drive up prices). Although these are strong
arguments in favor of separating compositional from idiomatic PV, Machonis (2009) suggests a disad-
vantage of treating compositional PV separately from frozen ones in that simple verb entries can become
enormously complex when all English particles – Fraser (1976) lists fifteen different particles – are
taken into account.
This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecom-
mons.org/licenses/by/4.0/
18
PV present one of the thorniest problems for Natural Language Processing; in fact, Sag et al. (2002:
14) state multiword expressions “constitute a key problem that must be resolved in order for linguisti-
cally precise NLP to succeed.” This paper shows how a lexicon grammar dictionary can be transformed
into an electronic dictionary, in order to correctly identify PV, both continuous and discontinuous, in
large corpora, using multiple algorithms and filters within the linguistic development environment, NooJ
(Silberztein, 2016).
2 Using NooJ for Automatic PV Recognition
2.1 Previous Work
Previous studies using NooJ (Machonis, 2010; 2012; 2016), showed that the automatic recognition of
PV proved to be far more complex than for other multiword expressions due to three main factors: (1)
their possible discontinuous nature (e.g., let out the dogs let the dogs out), (2) their confusion with
verbs followed by simple prepositions (e.g., Do you remember what I asked you in Rome? (verb +
prepositional phrase) vs. Did you ask the prince in when he arrived? (PV)), and (3) genuine ambiguity
only resolvable from context (e.g., Her neighbor was looking over the broken fence, which can mean
either “looking above the fence” (preposition) or “examining the fence” (PV)). Even with disambigu-
ating grammars, adverbial and adjectival expression filters, and idiom dictionaries, our previous PV
studies using NooJ achieved only 88% precision with written texts and 78% precision with an oral cor-
pus, with most of the noise coming from the particles in and on, which are fairly tricky to distinguish
automatically from prepositions (e.g. had a strange smile on her thin lips vs. had her hat and jacket
on), even with the disambiguation grammars, filters, and extra dictionaries mentioned above.
2.2 Using Lexicon Grammar Tables in Tandem with NooJ
The NooJ platform is a freeware linguistic development environment that can be downloaded from
http://www.nooj4nlp.net/, which allows linguists to describe several levels of linguistic phenomena and
then apply formalized descriptions to any corpus of texts. Instead of relying on a part of speech tagger
that obligatorily produces a certain percentage of tagging mistakes, NooJ uses a Text Annotation Struc-
ture (TAS) that holds all unsolved ambiguities. Furthermore, these annotations, as opposed to tags, can
represent discontinuous linguistic units, such as PV (Silberztein, 2016).
Lexicon grammar (Gross, 1994; 1996) accentuates the reproducibility of linguistic data in the form
of exhaustive tables or matrices, which contain both lexical and syntactic information. For example,
each verb in a table would be discussed by a team of linguists and marked as plus (+) or minus (-) for
all possible complements and relevant transformations. This descriptive approach to syntax showed the
enormous complexity of language and challenged the Chomskian model (Gross, 1979).
Our original PV tables included 700 entries of transitive and neutral PV1 with the particle up, 200
with out, and 300 entries with other particles, such as away, back, down, in, off, on, over (300 entries).
These tables are manually constructed and a sample is given in Table 1. The first two columns represent
potential subjects, N0, which can be human, non-human, or both. This is followed by the verb, the par-
ticle, and an example of a direct object, N1. The direct object is also classified as human, non-human,
or both, although only one example is given. The next column, N0 V N1, takes into consideration cases
where the verb can have a similar meaning, even if the particle is not used. A plus indicates that the PV
can be used without the particle: e.g., The chef beat up the eggs The chef beat the eggs. These
would be considered compositional PV, since the verb keeps its regular meaning, but the particle is
merely viewed as an intensifier or aspect marker, as explained in the introduction. The next column, N1
V Part, identifies neutral verbs, with a plus in that column indicating that the verb has both a transitive
and intransitive linked use: e.g., She booted up the computer The computer booted up. Finally, a
plus in the N1 V column signifies that the verb can be neutral, even if the particle is not expressed: e.g.,
The building blew, The water boiled, The computer booted. The last column gives a synonym for the
PV. Note that different meanings of the same PV (e.g., beat up, blow up, bolster up) necessitate different
values in the lexicon grammar.
1 Transitive PV take a direct object and have no intransitive: e.g., The bully beat up the child, but not *The bully beat up nor
*The child beat up. Neutral PV (also referred to as ergative PV) take a direct object, which could also function as the subject:
e.g., The terrorists blew up the building The building blew up). For more information on neutral or ergative verbs within a
lexicon grammar framework, see Machonis (1997).
19
Table 1: Sample Lexicon Grammar Table: Phrasal Verbs with the Particle up
Figure 1 below is a sample of the NooJ PV Dictionary, which mirrors all of the syntactic information
contained within the highlighted area of the lexicon grammar entry in Table 1 above. As can be seen,
there is also a French translation of the English PV in the NooJ dictionary. The NooJ PV Grammar in
Figure 2 works in tandem with this dictionary to annotate PV in large corpora. The bottom portion of
the graph represents the path for continuous PV, while the top portion of the graph represents the path
for identifying discontinuous PV, i.e., with a noun phrase and optional adverb inserted between the verb
and the particle. The noun phrase has an embedded NP structure, which is explained further in 3.1.
Most importantly, the PV grammar uses the NooJ functionality $THIS=$V$Part, which assures that a
particular particle must be associated with a specific verb in the PV dictionary in order for it to be
recognized as a PV. That is, NooJ only recognizes verb-particle combinations listed in the PV dictionary,
not simply any verb that can be part of a PV followed by any particle.
Figure 1: NooJ PV Dictionary Showing Highlighted Area of Lexicon Grammar in Table 1
In addition to the PV grammar and dictionary that work together to identify PV in large corpora, we
had to add other types of resources to remove noise. These include three disambiguation grammars,
which examine the immediately preceding and following environments of potential PV, and eliminate
nouns that are mistaken for verbs (e.g., take a run down to Spain ≠ run down, his hands still in his
pockets ≠ hand in), prepositions that are identified as PV (what a comfort I take in it ≠ take in), and
20
prepositions that introduce locative expressions (asked you in Rome). We have also written adverbial
and adjectival expression filters and idiom dictionaries that identify certain fixed expressions as “unam-
biguous” and thus cannot be given the TAS of PV (e.g., asked in a low tone ≠ ask in, put on one’s guard
≠ put on, take an interest in ≠ take in). The underlined expressions in the examples represent fixed
expressions, which consequently cannot be part of a candidate PV string. The goal of these extra gram-
mars and dictionaries is to remove noise without creating silence. More details on these resources are
described in Machonis (2016).
Figure 2: NooJ PV Grammar
This PV Grammar is fairly accurate, and with the recent improvements made (see section 3), can now
correctly identify many discontinuous PV involving two, three, or four word forms, such as the follow-
ing from our sample corpora – the 1881 Henry James novel The Portrait of a Lady (233,102 word forms)
and an oral corpus consisting of 25 transcribed Larry King Live programs from January 2000 (228,950
word forms):
She had reasoned the matter well out, (Portrait of a Lady)
Shall I show the gentleman up, ma’am?
Mayor Ed Koch has a great new book out (Larry King Live)
We now know that they tracked it all the way down and then back up
That really turned the national economy around
3 Improvements to Original Grammar
To improve precision while avoiding noise, refinements were made to the original 2010 NooJ PV gram-
mar. In this section, we present some of these enhancements.
3.1 PV Grammar
One of the first things we did was to limit the embedded NP node in the PV grammar (Figure 2). While
the original NP node (Figure 3) could accept a variety of noun phrases, the refined NP node (Figure 4)
and DET node (Figure 5) within the present grammar restrict the type of NP allowed. For example, the
sentence he had found out was previously annotated as a PV2, i.e., had NP out, as in the example above,
has a great new book out. Now, since found does not have an article associated with the singular form,
the verb in the sentence he had found out is no longer annotated as a PV, while the phrase has a great
new book out, where the singular NP is introduced with a determiner, still is. The new NP node is also
able to identify PV originally overlooked, such as the following from our oral corpus: and I checked
all this out. Other PV noise removed by the new grammar includes the following from Portrait of a
Lady:
with Pansy’s little figure marching up the middle of
the band of tapestry Pansy had left on the table.
on the contrary, he had only let out sail.
2 Although found is generally the past tense or past participle of the verb find, it can also be the present tense of the verb
found, an adjective (found materials), or a noun (they earn so much a week and found “food and lodging in addition to pay”).
NooJ recognizes all of these possibilities in the TAS.
21
The expression let out, however, is correctly identified as a PV in this last example. While disambig-
uation grammars and idiom dictionaries are able to remove noise automatically in NooJ, avoiding im-
proper noun phrases is an important first step. In fact, the original 2010 PV grammar used a punctuation
node after the particle to identify all discontinuous PV, which severely limited the recall of discontinuous
PV. This new grammar without the punctuation node, however, had the disadvantage of identifying
many prepositional phrases involving overlapping particles and prepositions as PV, which created a real
difficulty for processing PV. Consequently, we added a series of disambiguation grammars.
Figure 3: Original NP Node in NooJ PV Grammar
Figure 4: Revised NP Node in NooJ PV Grammar
Figure 5: New DET Node in NooJ PV Grammar
22
3.2 Disambiguation Grammars
After a PV analysis, the TAS can be automatically modified by any of three PV disambiguation gram-
mars, described in more detail in Machonis (2016), which specify certain structures that are not to be
assigned PV status. The first grammar examines the environment to the right of a candidate PV string.
This syntactically motivated grammar states that if the PV occurs with a pronoun object, the PV must
be in the discontinuous format (e.g., figure it out, look him up, take them away). Thus if an object pro-
noun follows a supposed particle, it must be a preposition, as in the following: what sort of pressure is
put on them back in Cuba. The first disambiguation grammar specifies that this instance of put on (e.g.,
put on my T-shirt) is not a PV. The PV put on is very common is our oral corpus (e.g., put on nine
pounds, put on my wedding dress, put on a prayer shawl, put my jeans on), yet shows enormous potential
for overlapping with prepositional phrases.
The second disambiguation grammar identifies verbs that are nouns by examining the environment
to the left of a hypothetical PV. In essence, if a determiner or adjective appears immediately before the
hypothetical PV, then this second disambiguation grammar correctly assumes that it is a noun and re-
moves the PV status from the TAS. This grammar successfully eliminates much noise derived from PV
that overlap with nouns, such as break in, check out, cheer up, figure out, hand in, head up, play out,
sort out, take up, time in. etc.
Our third disambiguation grammar examines the environment to the right of a candidate PV string,
but specifically focuses on prepositions introducing locative prepositional phrases that are clearly not
part of a PV. This third disambiguation grammar makes use of a supplemental Locative Dictionary,
which contains some frequent locatives found in our corpora, such as church, library, sitting-room, as
well as place names such as London, Paris, Rome. These nouns are all marked as N+Loc and the PV
status is automatically removed from the TAS by this grammar. For example, the place names China
and New Hampshire, recently added to the Locative Dictionary, assure that the following sentences are
not considered PV:
We will be doing it again in New Hampshire.
The Democrats have a big dispute on China.
Not all locative expressions have to be added to the dictionary, since some noise is already avoided
by means of the new NP node in the PV grammar mentioned in 3.1 above. For example, the following
represent cases of noise recently removed from our transcribed Larry King Live corpus. That is, the
singular noun place does not have a determiner and the potential PV take something in is no longer
recognized as a PV in these cases:
that trial is going to take place in Albany
process that’s about to take place in Florida,
effort that will have to take place in the Pacific Ocean
3.3 Avoiding Idiom / Phrasal Verb Overlap
Although PV, as multiword expressions, are idioms in themselves, another problem we face in trying to
accurately identify PV in large corpora is the overlap of certain idiomatic expressions with PV. For
example, idioms that contain prepositions, such as in, off, and on, can easily be mistaken for PV in our
corpora:
asked her in a low tone ≠ PV ask in
put the girl on her guard ≠ PV put on
take an interest in her ≠ PV take in
Among the additional lexical resources incorporated into NooJ, we have an Adverbials Grammar
which identifies expressions such as at one time, in a low tone, in her lap, on one’s mind, etc. as unam-
biguous adverbs (ADV+UNAMB). Consequently, neither the noun/verb (time), nor the preposition (in,
on) can be associated with a PV.
23
Another grammar identifies a few idioms that also appear with certain support verbs such as have,
put, and take, but can create noise when they are identified as PV (e.g., have NP on, put NP on, take NP
off). This Adjectivals Grammar labels certain expressions as unambiguous adjectives (A+UNAMB) and
consequently eliminates PV noise from sentences such as:
she had been much on her guard
the airport has already put grief counselors on duty here
was waiting to take him off his guard.
We have also incorporated into our PV analysis a larger dictionary of simple Prep C1 idioms that do
not have multiple modifiers, such as in a haze, in the clouds, off duty, on a collision course, out of the
question, over the top, etc. These are assigned the notation A+PrepC1+UNAMB, thus avoiding noise
with potential PV using the particles in, off, on, out, over. For example, the expression on pins and
needles is no longer confused with the PV keep something on in the following sentence from Portrait of
a Lady: Your relations with him, while he was here, kept me on pins and needles.
Finally, there are two dictionaries that work in tandem with grammars that target more complex idio-
matic expressions such as keep an eye on, take an interest in, take great pleasure in, have an opinion
on, put blame on, take part in, etc. where the frozen noun can take a variety of determiners and modifiers
(Idiom1). Another dictionary lists expressions such as turn one’s back on, turn one’s back against, etc.
(Idiom2). These two dictionaries work together with grammars that first identify these idioms. For
example, the Idiom1 Grammar in Figure 6 annotates the expression put the blame on as V+Idiom1, DET,
N+Idiom, PREP+Idiom. Then the Idiom Disambiguation Grammar (Figure 7) removes any potential
PV from the TAS with the “this is not a PV” notation <!V+PV>, thus avoiding noise with the true PV
put something on.
Figure 6: Verbal Idiom1 Grammar
Figure 7: Idiom Disambiguation Grammar
24
As can be seen, most of the potential noise created by idiomatic expressions comes from the preposi-
tions in and on, especially when used with high frequency verbs such as keep, have, put, take, and turn.
4 Results and Future Work
In Table 2, we present an indication of improvements made to our overall PV analysis using the two
sample texts – the 1881 novel The Portrait of a Lady (233,102 word forms) and the 2000 oral corpus of
transcribed Larry King Live programs (228,950 word forms). The year 2010 represents the output of
our original grammar with punctuation node that overlooked many discontinuous PV. The year 2016
represents results when the punctuation node was removed from the PV Grammar and the three disam-
biguation grammars were added. The 2018 results represent more recent modifications to the PV Gram-
mar, along with fine-tuning of the adverbial and adjectival expression filters and auxiliary dictionaries
within NooJ. The first column represents the overall number of PV strings identified by NooJ. If a PV
was automatically removed by a disambiguation grammar or other filter, it was not counted. Of the
potential PV strings identified, the next two columns represent correct continuous and discontinuous PV
manually verified. The next two columns represent noise, either prepositions that were incorrectly an-
notated as particles and part of a PV, or other PV misidentifications due to nouns mistaken for verbs
(e.g., I should spend the evening out), verbs for nouns (to make her reach out a hand), etc. The last
column represents precision: True Positives / (True Positives + False Positives).
As can be seen, the number of PV automatically identified by NooJ has grown since we first started
this long-term project, mainly because many discontinuous PV were not annotated due to the punctua-
tion node requirement of the initial grammar. If this change also created much noise (false positives,
incorrect PV) in our 2016 analysis, with recent changes to the PV Grammar, the addition of idiom dic-
tionaries, and the tweaking of disambiguating grammars and other linguistics resources, precision has
greatly improved, especially with our literature sample. Precision is still a major problem with our oral
corpus, however, chiefly due to the noise created by the prepositions in and on. In fact, precision can be
greatly improved by removing all PV with the particles in and on from the NooJ dictionary. While this
does not accomplish the main NLP goal of annotating every PV in a large corpus, our resource could
serve as a springboard for a purely linguistic endeavor, such as analyzing the evolution of PV throughout
the history of the English language.
Table 2: Comparison of PV Identification in 2010 & 2016 Studies vs. Today
Thim (2012:201-5), in his detailed PV study highlights “the little attention Late Modern English – in
particular the 19th century – has received.” “Most of the 19th century is not covered at all,” he states. A
work in progress actually involves reducing the NooJ dictionary to include only six particles (out, up,
TEXT
Pote
nti
al P
V s
trin
gs
iden
tifi
ed b
y N
ooJ
Corr
ect
PV
(con
tin
uou
s)
Corr
ect
PV
(dis
con
tin
uou
s)
Inco
rrec
t P
V
(Pre
posi
tion
s)
Inco
rrec
t P
V
(Mis
iden
tifi
cati
on
s)
Per
cen
tage
of
inco
rrec
t
PR
EC
ISIO
N:
Per
cen
tage
of
PV
corr
ectl
y id
enti
fied
Portrait of a Lady (2010) 583 405 83 44 51 16.30% 83.70%
Portrait of a Lady (2016) 658 426 152 62 18 12.16% 87.84%
Portrait of a Lady (2018) 636 426 152 55 3 9.12% 90.88%
Larry King Live (2010) 614 424 102 53 35 14.33% 85.67%
Larry King Live (2016) 800 451 172 136 39 21.88% 77.88%
Larry King Live (2018) 730 452 169 97 12 14.93% 85.07%
25
down, away, back, off) instead of twelve, which has helped us achieve 98% precision in an analysis of
The Portrait of a Lady. In order to get a better idea of the evolution of PV in Late Modern English, we
plan to use the NooJ PV Grammar, with a limited PV Dictionary, to analyze numerous 19th century texts
from a variety of authors, both American (Herman Melville, James Fenimore Cooper, Washington Ir-
ving, Nathaniel Hawthorne, Harriet Beecher Stowe, Mark Twain, Edith Wharton) and British (Charles
Dickens, Jane Austen, Walter Scott, the Bronte sisters, George Eliot, Thomas Hardy, Oscar Wilde).
Previously, Hiltunen (1994:135) examined English texts limiting his searches to these six typical par-
ticles representing three levels of PV frequency: high (out, up), mid (down, away), and low (back, off).
By doing the same, we could get an accurate snapshot of PV usage, although limited to these six parti-
cles, in numerous 19th century novels with improved precision. In fact, our dictionary and grammar
could become very useful instruments to automatically measure PV usage in different genres – novels,
plays, nonfiction, technical material, daily life texts (e.g., news articles, blogs), etc. – and at different
periods in the history of the English language.
In conclusion, the NooJ PV dictionary and grammar are great resources for identifying this most
difficult, characteristic feature of the English language. While PV are indeed a “pain in the neck for
NLP,” what we have described is a reliable first step in accurately identifying them in large corpora,
while automatically removing as much noise as possible. As we have seen in this paper, incorporating
other idioms in an NLP analysis greatly helps to alleviate this noise. And although certain prepositions
still create a fair amount of noise, many of these problems can eventually be resolved when we are able
to build a NooJ grammar to recognize entire English sentences, another future goal.
References
Dwight Bolinger. 1971. The Phrasal Verb in English. Harvard University Press, Cambridge, MA.
Bruce Fraser. 1976. The Verb-Particle Combination in English. Academic Press, New York, NY.
Maurice Gross. 1979. On the failure of generative grammar. Language 55(4):859-885.
Maurice Gross. 1994. Constructing Lexicon grammars. In Beryl T. (Sue) Atkins & Antonio Zampolli (eds.), Com-
putational Approaches to the Lexicon, 213-263. Oxford University Press, Oxford, UK.
Maurice Gross. 1996. Lexicon Grammar. In Keith Brown & Jim Miller (eds.), Concise Encyclopedia of Syntactic
Theories, 244-258. Elsevier, New York, NY.
Beate Hampe. 2002. Superlative Verbs: A corpus-based study of semantic redundancy in English verb-particle
constructions. Gunter Narr Verlag, Tübingen, Germany.
Risto Hiltunen. 1994. On Phrasal Verbs in Early Modern English: Notes on Lexis and Style. In Dieter Kastovsky
(ed.), Studies in Early Modern English, 129-140. Mouton de Gruyter, Berlin, Germany.
Arthur Garfield Kennedy. 1920. The Modern English Verb-adverb Combination. Stanford University Press, Stan-
ford, CA.
Tomoshichi Konishi. 1958. The growth of the verb-adverb combination in English: A brief sketch. In Kazuo Araki,
Taiichiro Egawa, Toshiko Oyama & Minoru Yasui (eds.), Studies in English grammar and linguistics: A mis-
cellany in honour of Takanobu Otsuka, 117-128. Kenkyusha, Tokyo, Japan.
Peter A. Machonis. 1997. Neutral verbs in English: A preliminary classification. Lingvisticæ Investigationes
21(2):293-320.
Peter A. Machonis. 2009. Compositional phrasal verbs with up: Direction, aspect, intensity. Lingvisticae Investi-
gationes 32(2):253-264.
Peter A. Machonis. 2010. English Phrasal Verbs: from Lexicon Grammar to Natural Language Processing. South-
ern Journal of Linguistics 34(1):21-48.
Peter A. Machonis. 2012. Sorting NooJ out to take Multiword Expressions into account. In Kristina Vučković,
Božo Bekavac, & Max Silberztein (eds.), Automatic Processing of Various Levels of Linguistic Phenomena:
Selected Papers from the NooJ 2011 International Conference, 152-165. Cambridge Scholars Publishing, New-
castle upon Tyne, UK
Peter A. Machonis. 2016. Phrasal Verb Disambiguating Grammars: Cutting Out Noise Automatically. In Linda
Barone, Max Silberztein, & Mario Monteleone (eds.), Automatic Processing of Natural-Language Electronic
Texts with NooJ, 169-181. Springer International Publishing AG, Cham, Switzerland.
26
NooJ: A Linguistic Development Environment. http://www.nooj4nlp.net/
Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann Copestake & Dan Flickinger. 2002. Multiword expressions: A
pain in the neck for NLP. Proceedings of the Third International Conference on Intelligent Text Processing and
Computational Linguistics, 1-15. CICLING, Mexico City, Mexico.
Max Silberztein. 2016. Formalizing Natural Languages: The NooJ Approach. Wiley ISTE, London, UK.
Stephan Thim. 2012. Phrasal Verbs: The English Verb-Particle Construction and Its History. Walter de Gruyter,
Berlin, Germany.
27
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 28–37Santa Fe, New Mexico, USA, August 20, 2018.
Designing a Croatian Aspectual Derivatives Dictionary: Preliminary Stages
Kristina Kocijan Department of information
and communication sciences Faculty of Humanities
and Social Sciences University of Zagreb
Croatia [email protected]
Krešimir Šojat Department of linguistics
Faculty of Humanities and Social Sciences University of Zagreb
Croatia [email protected]
Dario Poljak Department of information
and communication sciences Faculty of Humanities
and Social Sciences University of Zagreb
Croatia [email protected]
Abstract
The paper focuses on derivationally connected verbs in Croatian, i.e. on verbs that share the same lexical morpheme and are derived from other verbs via prefixation, suffixation and/or stem alternations. As in other Slavic languages with rich derivational morphology, each verb is marked for aspect, either perfective or imperfective. Some verbs, mostly of foreign origin, are marked as bi-aspectual verbs. The main objective of this paper is to detect and to describe major derivational processes and affixes used in the derivation of aspectually connected verbs with NooJ. Annotated chains are exported into a format adequate for a web-based system and further used to enhance the aspectual and derivational information for each verb.
1 Introduction*
In this paper we deal with the representation of derivational processes in Croatian, a South Slavic lan-guage with rich inflectional and derivational morphology. The paper focuses on derivationally connected verbs, i.e. on those verbs which share the same lexical morpheme and which are derived from other verbs mostly via prefixation and suffixation. As in other Slavic languages, each verb is always marked for aspect and classified as perfective, imperfective, or bi-aspectual. Generally, the imperfective aspect is used to describe actions, processes and states as unfinished or ongoing, whereas the imperfective aspect refers to them as finished or completed, e.g.:
• 1a. Pisao sam (imperfective) pismo jedan sat. 1b. I was writing a letter for an hour.
• 2a. Napisao (perfective) sam pismo za jedan sat. 2b. I wrote a letter in an hour.
Verbs like pisati "to write + imperfective" and napisati "to write, to finish writing + perfective" are referred to as aspectual pairs. Verbs in aspectual pairs are closely related in meaning, except that one expresses imperfective and the other perfective aspect. Aspect in Croatian is morphologically marked in each verbal form and it affects inflectional properties of verbs (e.g. only perfectives can be used in aorist, and imperfectives in imperfective past tense; gerunds are commonly formed by imperfectives
etc). Aspect in Croatian is regarded as a word-formation process and members of aspectual pairs are treated as separate lexical entries in dictionaries. In terms of derivation, perfectives are commonly de-rived from imperfectives by prefixation, while imperfectives are formed from perfectives by suffixation or stem alternation. The presence of a certain affix indicates whether a verb is a perfective or an imper-fective. A relatively small group of bi-aspectual verbs, predominantly of foreign origin, can be used as perfectives and imperfectives in the same morphological form. Various factors can determine whether they will be used as a perfectives or imperfectives (e.g. a context, the type of time adverbial used in a sentence etc.). Although based on the opposition of only two aspects and overtly marked, numerous
This work is licenced under a Creative Commons Attribution 4.0 International License. Page numbers and proceedings footer are added by the organizers. License details: http://creativecommons.org/licenses/by/4.0/
28
studies in the area of second language acquisition indicate that aspect is one of the most complicated category for learners of Slavic languages.
In this paper we present preliminary stages in the construction of the database of Croatian aspectually and derivationally connected verbs, i.e. aspectual derivatives. Apart from its potential pedagogical use, the database of aspectual derivatives is one of the first attempts to systematically present this area of Croatian derivational morphology. The paper is structured as follows: In Section 2, we briefly describe major derivational processes in Croatian and focus on the derivation of verbs from other verbs and aspectual changes that take place. Sections 3 and 4 present the processing of aspectual derivatives in Croatian in NooJ and provide an overview of underlying principles. In Section 5, the design and the structure of the database is discussed. The paper concludes with an outline of future work.
2 Derivation of verbs
Major word-formation processes in Croatian are derivation and compounding. Further we discuss only derivation which predominantly consists of affixation. Although there are some other processes like conversion or back formation, they are not as prominent as prefixation or suffixation (Šojat et al, 2013).
Croatian verbs can be thus divided into simple imperfectives (pisati "to write + imperfective") and prefixed perfectives (na-pisati "to write + perfective") denoting that the action is completed. Other pre-fixes used for the derivation of perfectives add different semantic features (pisati "to write + imperfec-tive" – pre-pisati "to copy by writing + perfective" – pot-pisati "to sign + perfective") and enable further derivation, either through prefixation, suffixation or simultaneous prefixation and suffixation. These perfectives can be derived into secondary imperfectives denoting iterative actions through suffixation (potpis-ivati "to sign several / many times"). Other suffixes are used for the derivation of diminutive verbs (e.g. pisati– pis-karati "to scribble + imperfective") or verbs expressing punctual actions (vikati "to shout + imperfective" – viknuti "to shout once + perfective"). Further, some secondary imperfectives are derived via prefixation into perfectives denoting distributive actions (is-potpisivati – "to sign each one + perfective", e.g. each letter, every document etc.). In some cases aspectual distinctions are ex-pressed by vowel variations or suppletive forms (e.g. doći "to come + perfective" – dolaziti "to come + imperfective"). Detailed account of morpho-semantic relations among Croatian verbal derivatives is found in Šojat et al. (2012).
In the following section we show how existing language resources can benefit from the information about verbal aspect in terms of their extension and enrichment. We demonstrate this on the inflectional dictionary for Croatian verbs in NooJ.
3 Verbs in NooJ Dictionary
The main language resources (LR) for Croatian, as prepared in Vučković (2009) and explained in Vučković et al. (2010), include the dictionary of Croatian verbs. Each verb was originally marked for the main category <V>, reflexive property <V+Prelaz=pov> and a paradigm rule FLX responsible for describing the rules used to build all the simple verb forms,1 for example:
pisati,V+FLX=PISATI “to write”
asistirati,V+FLX=SJATI “to assist”
sjati,V+Prelaz=pov+FLX=SJATI “to shine”.
In all cases, a name of a verb is used as a representative of a specific conjugation paradigm. For example, the verb sjati “to shine” uses the set of conjugation rules that we refer to as SJATI <FLX=SJATI>. All the other verbs that share the same set of conjugation rules will be associated with this name of the paradigm like the verb asistirati "to assist." In addition to these three markers, a small subset of verbs was also marked for valency (Vučković et al., 2010) to improve the performance of the Croatian chunker and some verbs were marked for aspect.
Although each Croatian verb has an aspect by its nature, either perfective, imperfective or bi-aspec-tual, this information was not originally encoded into the main dictionary entries. This is mainly due to
1 Complex tenses in Croatian such as future I and II, perfect, pluperfect, or conditionals are described within syntax gram-mars and are beyond the scope of this paper.
29
the absence of that information in the resources built after the MULTEXT-East specifications prepared by Tadić, as explained in Erjavec (2001), from which NooJ resources have been adopted.
There are 4,168 main verb entries in the NooJ Croatian dictionary. At the beginning of our project, there were only 1,448 verbs that had been marked as either perfective <Aspect=perf>, imperfective <Aspect=imperf> or bi-aspectual <Aspect=bi> regarding the Category of aspect (Table 1). This means that little over 65% of verbs had no marker assigned for this category.
Total +perf +imperf +bi no marker
4,168 673 534 241 2,720
Table 1: Original distribution of Aspect markers
The importance of information on aspect encoded in the verb, lies, among others, in the list of possible verbal forms. For example, only perfective verbs can have aorist past tense and past adverbial participle forms, while only imperfective verbs can have the imperfect past tense and the present adverbial parti-ciple. Before we could add the tag for aspect automatically to the remaining verbs, it was important to check if the existing aspect markers were correctly assigned and what data could be used to correctly mark the remaining 65% of verbs. Since the list of possible tenses is embedded into the conjugation paradigms (FLX), we decided to start our investigation with them.
3.1 Data Analysis of the original dictionary
At the very beginning of our analysis, we found two paradigms that were using both Aorist and Imperfect endings [FLX=POKLONITI and FLX=BIRATIDODATI]. Two does not sound like a large number, but if we take into account that the POKLONITI paradigm is responsible for the conjugation of 106 verbs and the BIRATIDODATI paradigm for 204, we are not talking small numbers any more. We proceeded with the analysis by double-checking each of the 310 verbs. We learned that all the verbs using the POKLONITI paradigm are actually perfective in aspect (as is the paradigm itself), while the other verbs are all dual which makes the presence of aorist and imperfect endings appropriate. Thus, we were able to annotate these verbs with appropriate markers <Aspect=perf> and <Aspect=bi>, respectively. In ad-dition, we revisited the POKLONITI paradigm and removed the endings for Imperfect. At this point, we also decided to recheck the aspect category of all the verbs.
To avoid individual checking each verb, we cross referenced all of the assigned aspects with the aspect of a verb used to mark the conjugation paradigm. From that list (Table 2), after removing all the verbs whose aspect matched the aspect of a paradigm, and all the verbs that had no aspect defined, we were left with only 62 verbs that had been marked as mismatched and that needed to be checked manually.
Since the reason for the mismatched aspect could be due to either an incorrectly marked aspect of a verb or to an incorrectly described paradigm, we checked both, starting with the verbs. Paradigm anal-ysis revealed some missing aorist and/or past participle descriptions in the category of perfective verbs, and missing imperfect and/or present participle descriptions of imperfective verbs. Also, some descrip-tions of bi-aspectuals were missing, either aorist and past participle descriptions, or imperfect and pre-sent participle descriptions. All of these cases were corrected to mirror the related aspect, either by cor-recting the value of an Aspect attribute or changing the paradigm name.
For the final analysis we wanted to make sure that there are no duplicates among our paradigms, i.e. that we do not have different paradigm names describing the same conjugation occurrences. We detected 16 such occurrences that we have replaced choosing the one paradigm that was used more often in the dictionary.
3.2 The new verb dictionary and paradigms
There are 209 paradigms that describe the conjugation rules for Croatian verbs. Some rules describe only one verb in a dictionary (there are 67 such rules), while others describe more (Figure 1). The largest number of verbs (538) described by the same paradigm BIRATI make up 13% of all the verbs in the dictionary, while the second runner up (SJATI) describes 7% of verbs.
30
Table 2. Distribution of different aspects assigned to paradigms
Figure 1. Distribution of paradigms among Croatian verbs
However, a closer look shows us that these descriptions are not as different as they might first appear. We will show this through a detailed analysis of simple verb forms by tense category, starting with present tense.
What makes these paradigms different is the list of tenses they describe, but also how each tense is described. These two categories (list of tenses, form of tenses) are linked by an OR operator, meaning that the difference in either one or both from the existing list of paradigms, results in a new paradigm.
To demonstrate, let us compare the UGOJITI and POKLONITI paradigms. They both have the same list of tenses they can form (present, imperative, PDR - verbal adjective active, PDT - verbal adjective passive, aorist and GPP - past adverbial participle), but the way they make PDT differs [<B3>en(:PDT) and <B3>jen(:PDT) respectfully] and thus they form different paradigm rules.
FLX FLX
Aspect
31
However, since NooJ allows multiple usage of its grammars (Silberztein, 2016), each tense rule is described as a separate sub-grammar and then called from a paradigm, where needed. In order to de-scribe derivations of Croatian simple verb forms that are in the NooJ dictionary, we built 280 such sub-grammars whose different combinations build 209 paradigms that describe 4,168 verbs and recognize 377,603 forms, taking into account simple verb tenses (long/short infinitive, present, imperfect, aorist, passive/active verbal adjectives, imperative, present/past adverbial participles), gender (masculine, fem-inine, neutral), number (singular, plural) and person (1st, 2nd, 3rd).
Figure 2. Distribution of sub-grammars per tense
As expected, inside each tense category, some descriptions are more common than others. This dis-tribution (Figure 2) is different for each of the tenses. The same is true for the number of rules which ranges from 15 (GPP) to 61 (Present). The number of paradigms that do not have rules for a tense are marked in gray (for example: 2,153 paradigms do not have Imperfect, 2,079 do not have GPP, 2,015 do not have GPS etc.). Present is the only tense that is found in all paradigms.
Figure 3. Distribution of Present rules used among existing verbs
Figure 3 shows the distribution of rules2 used to build the present tense found in the existing 209 paradigms. It may look as if there are 61 different suffixes for the Croatian present, but this is not the case. Throughout the paradigms, the same set of suffixes is used for all three genders (male, female, neutral), for all three persons singular [-m, -š, -/], and for the first and second person plural [-mo, -te]. The third person plural may have the ending [-e] or [-u]. So, why are there 61 different present tense descriptions in this figure?
Although the set of suffixes is the same (with two possible alterations for 3rd person plural), changes that occur before the suffix differ. In some cases it is enough to only remove (<B2>) the infinitive ending 2 Each circle represents one rule; the bigger the circle, the more verbs the rule describes.
32
(-ti or –ći) and add the suffix, but in some cases more letters need to be removed (<B3>..<B5>) and more letters prior to the suffix need to be added, like in the following examples:
isteći -> <B2>če(:PRsingular) => istečem [removes last 2 charac-
ters, inserts ‘če’ before adding suffix for Present]
izleći -> <B2>že(:PRsingular) => izležem [removes last 2 charac-
ters, inserts ‘že’ before adding suffix for Present]
otprijeti -> <B5>e(:PRsingular) => otprem [removes last 5 char-
acters, inserts ‘e’ before adding suffix for Present].
After removing duplicate verbs from the dictionary and sorting out the paradigm sets, we were able to automatically add the missing aspect information for all unmarked verbs. Their total now amounts to 4,134. The largest aspect category are perfective verbs, followed by imperfective verbs and bi-aspectual verbs. Their distribution is visualized in Figure 4.
Figure 4. Distribution of verbs per aspect category in the dictionary and number of paradigms used to describe each category
4 Grammar modeling for aspectual derivatives
Computational derivation is a well-known process when new words need to be created in order to eco-nomically enlarge the dictionary (Trost, 2003). NooJ provides two routes to describe derivations.
The first one uses a derivational module that allows a direct link between the dictionary with the list of words that can be derived and a grammar that provides rules for their derivation either graphically in a form of an enhanced recursive transition network (ERTN) or via formal grammar rules as context-free grammar (CFG). This link is defined via an attribute DRV that holds the name of the paradigm respon-sible for the allowed derivation(s).
The second one uses a morphological grammar module that may simulate the dictionary entries via ERTN. It can recognize a defined set of letters and tag them in the same manner that we would manually do in the dictionary. The difference is that in the grammar can have a few graphs describing multiple dictionary entries (e.g. if we wanted to recognize and tag roman numerals, we can do it by a minimum of five graphs or by 3,999 dictionary entries).
Since our main objective is to produce derivational paths in a format that we can use to populate the web-based database, we have opted for the second approach that left us more room to accommodate the output to the database design (see Figure 5). To avoid recognizing words that start with the same set of letters as prefixes used in derivations, we introduced the constraint that the dictionary check and validate if the primary verb first exists. So, for example, sufinancirati "co-finance" will be recognized, since there is a verb ‘financirati,V’ in the dictionary. On the other hand, the word suncobran “parasol” will not be recognized, since there is no dictionary entry marked as ‘ncobran,V’ (nor any such word in Cro-atian).
For the preliminary grammar model, we used all the derivations for the verb pisati “to write” (Table 3a) and the verbs bacati / baciti “to throw away” (Table3b). All the pairs have both perfective and
33
imperfective forms. This is not true in only two cases: for the derived form napisati that has no aspectual pair, and the aspectual pair ispotpisati - ispotpisivati that share the same aspect (perfective).3 If we put aside these two exceptions, from the remaining pairs we can conclude that if there is a verb in the dic-tionary to which a prefix is added, then the newly derived verb will be perfective in aspect. If a verb derived in such a manner adds the suffix2 (SUF 2), then the new verb will be imperfective in aspect if the length of suffix2 is 3 or 4 and perfective if its length is 0, 1 or 2.
Table 3. Aspectual derivatives of the root verbs a) PISATI "to write" and b) BACATI / BACITI "to throw away"
We have applied these rules to the morphological grammar built in NooJ. Figure 5 illustrates how the grammar works, using the verb ispisivati “to write out.”
Figure 5. Morphological grammar that recognizes and annotates a verb derived by a single prefixation and suffixation (example of the verb ispisivati)
Possible prefixes in the first position (i.e. the position closest to the root), such as the prefix is, are listed in the P1_ node. The following node holds any set of letters which are recognized as the root of the verb used to build the constraints that check if such a root concatenated with a + ti exists in the dictionary as a verb in infinitive form whose Aspect is defined as INF <pis+a+ti=:V+INF+inf>. If this constraint is validated, we check against the length of suffix2. Since the length in our example is 3, we proceed with the path where <$@S2$LENGTH=3>. It then leads us to the annotation section that ads 3 This may be due to the fact that the verb ispotpisivati is actually derived from potpisivati, while the verb ispotpisati is re-dundant in semantical meaning of its prefixes i.e. the prefix is does not bring anything new to the meaning of pot in this con-text. In the hrWaC 2.2 web corpus (Ljubešić & Klubička, 2014) it shows up only 7 times, mostly in an informal web setting.
34
the recognized lemma as the superlemma of the derived verb, and marks the POS, Root and Derivational chain with the aspect marker for each derivation [+DERIV=pisati_inf->ispisati_fin->ispisivati_inf]. This information is then exported from NooJ and added to the Specifics table of our web database as discussed in Section 5.
In Table 3b, there is an imperfective verb BACATI "to throw away" from which the perfective verb nabacati "to throw onto" is derived by prefixation. Its imperfective pair is the verb nabacivati "to throw onto" derived via suffixation. The perfective verb BACITI uses prefixation to produce a perfective form nabaciti whose imperfective pair is also nabacivati. This means that the imperfective form nabacivati
should be found in both aspectual derivational chains. But, ambiguity is not a stranger to language.
5 Database and interface design
After extracting data into usable chunks, we wanted to present it in a way usable to others. To reach the widest possible audience with our tool, we focused on bringing it to the web. In that way it will be available to everyone with basic Internet access and can be dynamically updated as new chains are prepared within NooJ. However, to accomplish that, we needed to create a searchable interface backbone in a well-structured database, whose main function is to support our information system as defined in Gunjal (2003).
Due to the nature of our data, we decided to split it into three separate semantic data-groups (Figure 6). The Main data set stores all the data at the morphological level that will be searchable through the online tool, with various levels of granularity. The Specifics provides derivational data focusing on one model used for the derivation. The Examples set unifies all the semantics and should provide a better understanding of the verb’s usage.
Figure 6. Data structure of three data groups
Since we are using csv to represent our data, it is already in a rather structured state. Thus, our data can be described as a structure in the first normal form. This means that the data itself can fit into tabular format and that it always contains only one value for each cell. The first normal form also assumes the usage of primary keys for the unique representation of every row of data. This can be performed auto-matically while importing it into the database as described in Gilfillan (2015).
As can be seen, there is still some redundant data (Figure 6) since the Main Data file contains both Form and Root components, present in other files as well. The idea behind the interface is to allow users to search either by Form or Root fields with additional (optional) Suffix and Prefix information, and then conditionally showing examples and specifics depending on which Form is selected. Thus we can reduce the clutter in the other two files by imposing a foreign key constraint after importing it into our database and removing Form and Root information from the other two tables (since that data will already be in the search results from querying the Main Data file).
We used MySQL (Coulter, 2017) to store the data and created primary keys, as well as foreign key constraints. After importing all the data and imposing foreign key constraints, we are left with 3 tables (Figure 7) with almost identical structure as we had at the beginning (Figure 6). Now the Form and Root fields are present exclusively in the Main Data table. In order to access it from other tables, we use the foreign key named mainID constrained to the ID field. It also acts as the primary key of our Main Data
table. The foreign key constraint is set up in a way that easily enables us to update or delete multiple records at once without ever leaving the Main Data table.
35
Figure 7. Representation of tables and their data in the database
Furthermore, a search is performed on only one (Main Data) table with all the additional information
retrieved only upon the user’s request. Figure 8 depicts one such instance when the verb form ispisivati
is selected for search.4 The derivational chain [Derivacija: pisati -> ispisati -> ispisivati] holds additional
information on aspect that is color-coded (perfective verbs are marked in red, imperfectives in blue) and
available via a hover feature.
Figure 8. Web interface showing additional information on derivation and examples for the verbs ispisivati
6 Conclusion
In this paper we showed some preliminary steps taken in the processing of Croatian aspectual pairs. This
phase of the project consists of the extension of the existing verb dictionary and its enrichment with the
information on verbal aspect. This preliminary step resulted in the significant expansion and improve-
ment of its coverage. The second step that was taken in the preliminary stage was to design and to build
the web-based database of aspectual derivatives. The database will enable various types of queries and
provide information about affixes used in a specific derivational process, full derivation chains of verbs,
basic meaning definitions, contextual examples, etc. Out next goal is to populate the database with as-
pectual derivatives from other derivational families. The database will remain free for on-line search.
4 The web interface is deployed on a Heroku server at https://vidski-parnjaci.herokuapp.com.
ISPISIVATI
36
References
Tom Coulter. 14.12.2017. Why MySQL is still King, Retrieved 22.06.2018. from https://www.free-lancer.com/community/articles/why-mysql-is-still-king
Tomaž Erjavec (ed.). 2001. Specifications and Notation for MULTEXT-East Lexicon Encoding. MULTEXT-East
Report, Concede Edition D1.1F/Concede, Jožef Stefan Institute, Ljubljana. http://nl.ijs.si/ME/Vault/V2/msd/msd.pdf
Ian Gilfillan 13.06.2015. Database Normalization Overview, MariaDB Docs, Retreived 22.06.2018. from https://mariadb.com/kb/en/library/database-normalization-overview/
Bhojaraju Gunjal. 2003. Database Management: Concepts and Design, In Proceedings of 24th IASLIC–SIG-2003. Dehradun: Survey of India. Retreived 20.06.2018. from https://www.researchgate.net/publi-cation/257298522_Database_Management_Concepts_and_Design
Nikola Ljubešić and Filip Klubička. 2014. {bs,hr,sr}WaC – Web corpora of Bosnian, Croatian and Serbian, In Proceedings of the 9th Web as Corpus Workshop (WaC-9) @ EACL 2014, (eds.) F. Bildhauer & R. Schäfer, Association for Computational Linguistics, pages 29–35, Gothenburg, Sweden.
Max Silberztein. 2016. Formalizing Natural Languages: The NooJ Approach, Cognitive science series, Wiley-ISTE, London, UK.
Krešimir Šojat, Matea Srebačić, and Marko Tadić. 2012. Derivational and Semantic Relations of Croatian Verbs. Journal of Language Modelling, 00 (2012), 1: 111-142.
Krešimir Šojat, Matea Srebačić, and Vanja Štefanec. 2013. CroDeriV i morfološka raščlamba hrvatskoga glagola, Suvremena lingvistika. 39 (2013), 75: 75-96.
Harald Trost. 2003. Morphology. The Oxford Handbook of Computational Linguistics (ed. R. Mitkov), Oxford University Press: 25-47.
Kristina Vučković. 2009. Model parsera za hrvatski jezik. PhD dissertation, Faculty of Humanities and Social Sciences, University of Zagreb, Zagreb.
Kristina Vučković, Marko Tadić, and Božo Bekavac. 2010. Croatian Language Resources for NooJ. CIT: Journal
of computing and information technology, 18(2010):295-301.
Kristina Vučković, Nives Mikelić-Preradović, and Zdravko Dovedan. 2010. Verb Valency Enhanced Croatian Lex-icon. In T. Varadi, J. Kuti, M. Silberztein (eds.) Applications of Finite-State Language Processing, pages 52-60, Cambridge Scholars Publishing, Newcastle upon Tyne.
37
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 38–46Santa Fe, New Mexico, USA, August 20, 2018.
A Rule-Based System for Disambiguating French Locative Verbs and
Their Translation into Arabic
Safa Boudhina
University of Gabes [email protected]
Héla Fehri
University of Sfax [email protected]
Abstract
This paper presents a rule-based system for disambiguating French locative verbs and their trans-
lation into Arabic. The disambiguation phase is based on the use of the French Verb dictionary
of Dubois and Dubois Charlier (LVF) as a linguistic resource, from which a base of disambigu-
ation rules is extracted. The extracted rules take the form of transducers which are subsequently
applied to texts. The translation phase consists in translating the disambiguated locative verbs
returned by the disambiguation phase. The translation takes into account the verb tense, as well
as the inflected form of that verb. This phase is based on bilingual dictionaries that contain the
different French locative verbs and their translation into Arabic. The experimentation and the
evaluation are done using the linguistic platform NooJ, both a language resource development
environment and a tool for automatic large corpora flow (Fehri, 2012).
1 Introduction
Lexical ambiguity represents an obstacle for the automatic processing of natural language. In fact, the
difficulty of automatically dealing with this phenomenon was recognized as early as the first appearance
of automatic translation systems (Apidaniaki, 2008). This ambiguity can occur at the level of different
grammatical categories of words, including the verb. In its conjugated form, the verb is distinguished
by tense, mood, person, number and the syntactic constructions in which this verb appears.
However, the task of solving the problem of ambiguity, known as Word Sense Disambiguation
(WSD), is not an easy one. In fact, it’s essential to specify the elements that contribute to the selection
of the appropriate meaning of the ambiguous word. Gross et al. (1997) state that “any syntactic differ-
ence corresponds to an essential semantic difference.” This statement emphasizes the importance of
syntactic constructions in the choice of the meaning of the verb. In other words, the meaning of the verb
strongly depends on the syntactic construction in which it appears. Furthermore, to properly handle this
condition, the entire sentence containing the ambiguous verb should be analyzed, and this requires a
certain linguistic knowledge.
In this perspective, the dictionary of French Verbs of Jean Dubois and Françoise Dubois-Charlier
(LVF) constitutes a relevant and useful resource for the WSD of verbs. Indeed, it proposes a verbal
classification that relates the syntactic and semantic characteristics of the verb.
WSD may be seen as an intermediate task for some applications such as machine translation. How-
ever, to provide the suitable translation of an ambiguous word, WSD is indispensable in identifying the
most appropriate meaning of the word in question. In addition, the quality of the translation does not
depend solely on the preservation of the meaning of the translated word in the target language. The
consideration of the form of the verb and the tense in which that verb is conjugated is also very useful
and influences the quality of the translated word.
The main objective of the present work is to develop a rule-based system for disambiguating French
verbs, in particular locative verbs. This system also makes it possible to translate the disambiguated
verbs while taking into account the tense of the verb and its inflected form. Our system is based on the
LVF dictionary using the NooJ platform.
This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecom-
mons.org/licenses/by/4.0/
38
2 Related work
According to the resources used, WSD approaches can be roughly classified into two main categories.
The first one is the corpora-based approach, in which systems are trained to perform the task of word
sense disambiguation (Dixit et al., 2015). This type includes supervised and unsupervised methods. The
supervised methods are based on a learning corpus grouping examples of disambiguated instances of
words, while unsupervised methods exploit the results of automatic meaning-acquisition methods. The
second main category is the knowledge-based approach, which exploits knowledge resources to infer
the senses of words in context (Dixit et al., 2015).
Supervised WSD methods require annotated corpora to be trained (Nessereddine et al., 2015). How-
ever, such corpora are only available for a few languages, which complicates supervised lexical disam-
biguation. In this perspective, Nassreddine et al. (2015) recently proposed a method for creating a system
of lexical disambiguation for the French language. Since this language suffers from a scarcity of anno-
tated corpora, their method consists of constructing an annotated corpus by the transfer of annotations.
These annotations are obtained from the automatic translation of an English annotated corpus into
French. The Bayesian classifier is then used to build the disambiguation system. The precision obtained
is of the order of 52%. The limit of this method is that it strongly depends on the annotations obtained
by the translation phase. In addition, the quality of the disambiguation system is restricted by the exam-
ples used during the learning phase.
In general, the problem with supervised, corpora-based disambiguation approaches is that getting
large amounts of annotated text in one's way is very costly in terms of time and money, which results in
a data acquisition bottleneck (Tchechmedjiev, 2012). Moreover, the quality of the disambiguation of
these approaches is restricted by the examples used for training.
For unsupervised methods, the notion of meaning is directly induced by the corpus (Audibert, 2003).
The method proposed by Schütze (1998), which relies on the use of the vector model can be taken as an
example. In this method, a vector is associated with each word m of the corpus. This vector is derived
from words that co-occur in the context of m. Subsequently, according to their degree of similarity, these
vectors are grouped into clusters. Each cluster refers to a possible meaning of the word m.
The problem with this method, as well as unsupervised approaches in general, is that the senses do
not correspond to any well-defined set of meanings. The distinctions of meaning can sometimes be
confusing and are, moreover, often difficult to be used by other applications (Audibert, 2003).
In knowledge-based approaches, the information needed for lexical disambiguation is derived from
external resources of the studied corpus. These resources are of different natures. Indeed, some studies
have tested the adequacy of the definitions provided by the Collins English Dictionary (CED), Merriam-
Webster New Pocket Dictionary, and Contemporary English Dictionary (LDOCE) for the automatic
processing of disambiguation. Other works have exploited the information provided by Roget's Interna-
tional Thesaurus or by WordNet semantic lexicons explaining the meanings and relations between senses
or shades of meaning (Vasilescu, 2003).
The disambiguation system established by Lesk is one of the first systems that relies on a dictionary.
The principle of this method is first of all to extract from the dictionary all the possible meanings for
each occurrence of the ambiguous word. Each sense corresponds to a particular definition in the dic-
tionary. Subsequently, a score is assigned for each of these senses. This score is equal to the number of
words in common between the definition of the word to be disambiguated and the definitions of the
words co-occurring in its context (Vasilescu, 2003). The sense selected as being the most appropriate
sense is supposed to be the one that maximizes the score. This method correctly disambiguates between
50% and 70% of the cases (Boubekeur-Amirouche, 2008).
The method proposed by Brun et al. (2001) is based on an electronic dictionary. The dictionary used
in this method gives an example for each meaning. Each example is analyzed through a specific ana-
lyzer, from which rules of disambiguation are elaborated. These rules are then generalized using a se-
mantic network that assigns a semantic class to each existing unit in the rule of origin. Subsequently, to
disambiguate a word in any sentence, it is syntactically analyzed and the syntactic relationships are
derived from it (Brun et al., 2001). These relationships are then compared with the disambiguation rules.
The disambiguation rule whose arguments are the same is then selected and its meaning number is as-
sociated with the word in question (Brun et al., 2001).
39
This method was applied to 850 French sentences for the set of nouns, adjectives and verbs. The
precision rate obtained for verbs is of the order of 58%. The problem with this method is that the elab-
orated rules of disambiguation depend on the words that exist in the examples found in the dictionary.
This link has also generated a problem of coherence between the dictionary and the chosen semantic
network, due to the semantic classification proposed by the thesaurus. In addition, some lexical entries
haven’t been disambiguated because of the lack of examples for certain meanings from which the dis-
ambiguation rules are extracted.
This method has the disadvantage of being very sensitive to the words that exist in each definition.
Indeed, the choice of meanings based on a limited number of common words can be the source of errors.
3 Description of the French Verb Dictionary (LVF)
The LVF is a lexical database of French language, which contains 25,609 entries for 12,310 different
verbs. The verbs are classified into 14 classes, which is based on the hypothesis of a correspondence
between syntax and meaning (Sénéchal et al., 2007). The class denoted L presents the class of locative
verbs, the topic of this work. It contains 1524 locative verbs (e.g., baigner “to bathe”, apparaître “to
appear”). We note that our choice of the type of verbs is arbitrarily done.
4 The creation of electronic locative verbs dictionary with NooJ
The developed disambiguation system focuses on the use of the LVF dictionary. This dictionary has
been built to be used by linguists, and is not directly exploitable by automatic text analysis programs
(Silberztein, 2010). For this reason, we opt for the formalization of the LVF dictionary. More precisely,
we are interested in the formalization of the locative verbs described in the LVF. This formalization
consists in reformulating the information contained in LVF so that it can be exploited by automatic
analysis programs, notably NooJ (Silberztein, 2010). The electronic dictionary created in NooJ, called
LocativeVerbs.dic contains 1,524 verbal entries. Figure 1 presents an extract from this dictionary.
Figure 1. Extract of electronic dictionary LocativeVerbs.dic
As can be seen in Figure 1, each locative verbal entry is associated with a set of features: a lemma
that corresponds to the verb in the infinitive form; a label indicating the grammatical category (V); , the
FLX property which gives the inflectional model related to the verb; the CONST, which shows the syn-
tactic-semantic properties of the verb; the sens, which indicates the meaning of the verb; and finally the
translation into the target language (AR). It’s worth-noting that a verb can be associated with more than
10 senses. The verb baigner “to bathe” is used as an example. Since this verb has four different meanings
described in the LVF dictionary (baigner 01=nager, baigner 02=immeger, baigner 03=tremper, baigner
04=se tremper), it appears four times in the LocativeVerbs dictionary, as shown in Figure 1.
The NooJ implementation of our system involves a two-phase process: (1) disambiguation of French
locative verbs and (2) translation into Arabic. Each phase requires the construction of proper transducers.
5 Phase of disambiguation
Each verbal entry described in the LVF dictionary is specified by the different possible meanings that
correspond to it. These meanings are illustrated by different uses, semantic indicators, syntactic charac-
teristics, etc. The information related to the given meaning of the ambiguous verb, thus paves the way
to extract the necessary disambiguation rules. The extracted rules then make it possible to identify the
syntactic patterns that will be transformed into transducers to achieve the task of disambiguating verbs.
40
5.1 Identification of disambiguation rules
The French verbs described in the LVF are classified into semantic classes defined by syntactic argu-
ments, which make it possible to assign the appropriate meaning to each occurrence of the verb. That is,
each use of the verb is described by a particular syntactic construction, and then the meaning of the verb
is determined by the construction in which the verb appears. In fact, the verb can have four syntactic
constructions: intransitive (A), an indirect transitive (N), direct transitive (T), and pronominal (P). Each
construction is composed of a set of components (subject, object, complement), which are also associ-
ated with semantic features, such as human, inanimate, animal, etc. For example, the word apparaître
“to appear” has two different constructions:
A11: Inanimate subject, Human object
A31: Human subject, Complement
To solve these semantic and syntactic ambiguities, we created formal grammars that take into consid-
eration the differences between syntactic constructions and the various types of components (human,
inanimate, animal) in each construction. The elaboration of such a grammar is not an easy task because
it is necessary to find a representation that considers recursion, as well as the length of the syntactic
construction. Since the syntactic construction is formed by other components, its length is not known in
advance. The proposed representation is illustrated by syntactic patterns.
5.2 Identification of syntactic patterns
In order to facilitate the identification of the transducers necessary for the disambiguation of verbs, we
transformed the definition of each syntactic construction into a syntactic pattern. A major advantage of
this method is that if makes it possible to arrange the various constituents of the construction in a linear
manner that can be easily transformed into graphs (Fehri, 2012). According to the classification pro-
posed by the LVF, we can construct 56 syntactic patterns that describe the locative verbs. A sample
pattern is shown below:
Pattern N1j:= <human subject><verb><Prepositional complement, PREP = « en, dans »>
This pattern describes the different components of the N1j construction. According to this construction,
we can distinguish the meaning of the verb when the latter appears in a sentence which has a human
subject and a prepositional complement introduced by the preposition (PREP) en “in, to” or dans “in,
inside”. We notice that the identified patterns have been refined.
5.3 Creation of transducers for locative verbs
For a reliable disambiguation system of locative verbs, we transformed each syntactic pattern into a
transducer, which is manually built in NooJ. The meaning of the verb is related to the type of the con-
struction and its constituents. The created transducer is represented in Figure 2.
Figure 2. Main transducer of verb disambiguation
41
The transducer illustrated in Figure 2 contains four sub-graphs. Each sub-graph represents a category
of syntactic construction -- intransitive (A), an indirect transitive (N), a pronominal (P), and a direct
transitive (T),
6 Phase of Translation
Understanding a word sense is an inherent part of correct translation of words whose meanings depend
on the context (Turdakov, 2010). For this reason, the proposed method of translation takes as input the
disambiguated verbs returned by the phase of disambiguation. In addition, the quality of the translation
of the verb does not depend only on the preservation of the meaning of the translated verb in the target
language but also on the consideration of the tense in which the verb is conjugated and the inflected
form of that verb. It also influences the quality of the translated verb. In this context, we built a syntactic
grammar allowing the translation of each disambiguated locative verb by taking into account the tense
of the verb and its inflected form.
However, the creation of such a grammar is not an easy task because this study deals with two very
different languages: French, an Indo-European language, and Arabic, a Semitic language. We now ex-
amine some cases of translation of French locative verbs into Arabic. For example, unlike the French
verb, the Arabic verb is not based on tense but on the aspect. In fact, the Arabic verbal system, is essen-
tially represented by three aspects: accomplished, unaccomplished and imperative. Consider the follow-
ing dictionary entry and French sentence:
accueillir 02: meaning = recevoir (to welcome) + AR= استقبل istaqbala “welcomed”
La fille accueillait une amie dans sa maison.
The tense of the verb accueillir is in the imperfect. In Arabic, the imperfect is translated by كان
تستقبل Kāna yafaalu “has done” and the verb accueillait will be translated byيفعل Kānat كانت
tastaqbilu “have welcomed.”
Another difference between the two language can be seen in the case of gender. In French, the verb
is conjugated with the same form for both the third person singular subject masculine pronoun il “he’
and the feminine pronoun elle “she ” – il écrit “he writes”, elle écrit “she writes”. In Arabic, however,
the conjugation of the verb differs for the two subject pronouns. Indeed, il écrit is translated into Arabic
as كتب kataba “he writes”, while elle écrit is translated by كتبت katabat “she writes”.
Finally, the Arabic language specifically notes when there are two of something, called المثنى al-
muṯaná “dual”. This form does not exist in French, which has only singular and plural. In Arabic, the
dual number has its own suffixes, which totally differ from those of the plural.
After, this level of analysis, we move to the practice of automatic translation using the NooJ platform.
In this step, we will integrate locative verbs into a bilingual dictionary. We also create a formal grammar,
seen in Figure 3.
Figure 3. Transducer for the translation of conjugated verbs in imperfect
42
Figure 3 represents an extract of the transducer relating the translation of a verb when it is conjugated
in the imperfect. The node that exists between brackets called Verbe, represents the form of the French
verb to translate. For example, <V+CONST+3+s+I> indicates that the verb is conjugated in the imper-
fect (I) with the third person singular (3+s).
7 Experimentation and evaluations
WE tested the developed system is done with NooJ. As mentioned above, this platform uses already
built syntactic local grammars (Fehri et al., 2011). Table 1 gives an idea about the dictionaries which we
added to the resources in NooJ.
Dictionaries Number of inputs Annotation in the dictionary
Human nouns 250 N+Hum
Inanimate nouns 305 N+Chose
Animal nouns 480 N+Anl
Table 1. Added dictionaries
In addition to the dictionaries mentioned in Table 1, other NooJ dictionaries -- adjectives, determi-
nants and verbs -- are used (Fehri et al., 2016).
7.1 Experimentation of disambiguation phase
To evaluate the disambiguation phase, we applied our resources to a corpus of journalistic articles pub-
lished in Le Monde Diplomatique newspaper. The various topics that are covered deal with current world
politics, economics, culture, sports, etc. The purpose of using a wide variety of subject areas is to have
a broad coverage of the complex structures of sentences that constitute the context of the verbs to be
disambiguated. This corpus contains 1,009 articles giving a total of 2,006,631 graphic words including
1,544 occurrences of locative verbs. These verbs are manually identified using NooJ queries.
Figure 4. Concordance table of locative verb disambiguation
Each line of the concordance table shown in Figure 4 displays the sequence identified by the disam-
biguation transducer. For example, in the first line of Figure 4, the verb is assigned the meaning of the
construction <N1j>. Consequently, the context of the verb baignaient (< baigner “to bathe”) takes on
the meaning of immerger “immerse” in this particular context. To measure the performance of our tool,
the following evaluation metrics are used: precision, recall and F-measure. The results obtained are
described in Table 2.
43
Nb. of occurrences
(locative verbs) Precision Recall F-measure
Newspaper Le Monde di-
plomatqiue (1009 texts) 1544 0.75 0.85 0.79
Table 2. Obtained results
The results shown in Table 2 reflect the existence of some issues that have not been resolved by our tool.
These problems lie in the inability to disambiguate the verbs that appear in sentences whose subject is
implicitly defined and so it is not possible to identify its nature. Take the example of the following
phrase: most being located near the islands. In this example, the subject is represented by the noun most.
This noun is not followed by anything that specifies the subject, such as most people. As a result, we
cannot decide on the nature of the subject (human, animal or thing) and subsequently the verb to locate
cannot be disambiguated.
In addition, the LVF construction codes describe the semantic traits of the arguments (subject, object, etc.)
of each verbal use, as human, thing, animal, etc. For example, to distinguish the use shelter 3 in The embassy
shelters the refugees from shelter 2 used in The ambassador shelters refugees at home, one must take into
account that the noun embassy is a thing, while ambassador and refugee are human nouns. But such a purely
lexical characterization of the arguments of a verb does not allow researchers to analyze many sentences
that include semantic mechanisms, such as metaphor or metonymy, or to modify the function of nominal
groups. Thus, for example, the noun embassy, principally lexicalized as a thing, can play the role of a
human by metonymy: the phrase The embassy shelters the refugees must then be considered as ambigu-
ous, depending on whether embassy designates the building itself or the embassy staff.
7.2 Disambiguation phase
The translation phase is applied to the disambiguated locative verbs returned by the disambiguation
phase. Thus, as pointed out before, this verb translation from French into Arabic takes into account the
verbal tense used and the inflected form of the verb. The results obtained are illustrated in Figure 5.
Figure 5. Concordance table of locative verb translation
Figure 5 shows the Arabic equivalent assigned to the ambiguous verbs shown in Figure 4 of the pre-
vious section. As a result, we have translated the verb according to the context in which it appears. Take
44
the example of the verb baignaient (< baigner “to bathe”) on the first line of the concordance table.
This verb can have four different meanings, corresponding to four separate translations. Here, in this
particular context, the verb baigner “to bathe” has been disambiguated with the meaning immerger “im-
merse” which is then translated into Arabic by انغمر enghamara “immersed,” thus eliminating the other
three translations.
Our translation tool gives satisfactory results, with a precision of 87%, which is comparable to well-
known translators such as Google and Babylon that support the translation from French to Arabic.
French Locative Verbs
Obtained
result (Our tool)
Obtained
result (Google)
Obtained
result (Babylon)
Les ouvriers baignaient dans
une atmosphère العمال في جو استحمو كانوا ينغمرون
baignaient في جو
العمال
Les autorités sanitaires avaient
fondé leur décision sur des
preuves
كن قد بنين السلطات الصحية استندتو
األدلةالى قراراتها بشأن
السلطات الصحية التي
على قرار بشأن تقوم
األدلة
Son nom figurait dans des
agendas اسمه وارد في مذكرات و كان اسمه في يوميات ورد
Le bateau finira en Méditer-
rané سينتهي
في البحر ينتهيفإن القارب
األبيض المتوسط
بالفعل في ستنجزمركب
Méditerrané
Table3. Experimental results
As shown in Table 3, it is noted that Google did not use the correct sense of the verb. Moreover, the
tense and the form of the verb have not been respected, such as for the verb baigner which appears in
the sentence Les ouvriers baignaient dans une atmosphère “the workers were immersed in an atmos-
phere.” Our tool gives as result ينغمرون Kânû yanghamirûna “have immersed.” This translation كانوا
corresponds to the meaning of the verb baigner in this particular context. Also, the translated verb re-
spects the tense and the form of the source verb. However, Google translates the verb baignaient into
estahama “has bathed” where the verbal tense and the inflected form of the verb have not been استحم
respected. Babylon does not even produce any translation of the verb baigner.
Note also that there are verbs that are not recognized by either Google or Babylon and therefore no
translation is provided for the verb appear in the sentence His name appeared in diaries. The results
obtained reflect the effectiveness of our translation tool. Indeed, a specific treatment for the locative
verbs allows us to understand the meaning of the verb according to the statements produced in the source
language and thus to propose a satisfactory translated version that takes into account the characteristics
of the verb.
8 Conclusion and future perspectives
In the present work, we have developed a rule-based system for disambiguating French locative verbs
and their translation into Arabic. The proposed disambiguation approach is based on the LVF dictionary.
We have shown the effectiveness of this dictionary as a relevant resource for the task of disambiguation.
As for the translation, it consists in translating the already disambiguated French locative verbs into
Arabic. This translation takes into account the verbal time used as well as the inflected form of the verb.
The experimentation and the evaluation were done using the linguistic platform NooJ, with satisfactory
results.
In the future, we intend to study the syntactic and semantic phenomena in more detail. We also plan
to generalize our phase of disambiguation to all the verbs. As for translation, we plan to take into account
the linguistic context of the verb in order to specify the most appropriate verbal tense with the syntax of
the Arabic sentence.
45
References
Marianna Apidaniaki. 2008. Acquisition automatique de sens pour la désambiguïsation et la sélection lexicale en
traduction. PhD thesis, PARIS University. DIDEROT (Paris 7).
Laurent Audibert. 2003. Outils d'exploration de corpus et désambiguïsation lexicale automatique. Autre [cs.OH].
University of Provence - Aix-Marseille I, French.
Fatiha Boubekeur-Amirouche. 2008. Contribution à la définition de modèles de recherche d'information flexibles
basés sur les CP-Nets. PhD thesis in computer science, University of Toulouse.
Caroline Brun. Bernard Jacquemin, and Frédérique Segond. 2001. Exploitation de dictionnaires électroniques pour
la désambiguïsation sémantique lexicale. Automatic Language Processing, ATALA, 42 (3): 677-691.
Vimal Dixit, Kamlesh Dutta and Pardeep Singh. 2015. Word Sense Disambiguation and Its Approaches. CPUH-
Research Journal: 2015, 1(2): 54-58.
Héla Fehri. 2012. Reconnaissance automatique des entités nommées arabes et leurs traductions vers le français.
PhD thesis in computer science, University of Sfax.
Héla Fehri, Kais Haddar, and Abdelmajid Ben Hamadou. 2011. Recognition and Translation of Arabic Named
Entities with NooJ Using a New Representation Model. Proceedings of the 9th International Workshop on Finite
State Methods and Natural Language Processing, Blois, France, pages 134–142.
Héla Fehri, Mohamed Zaidi, and Kamel Boudhina. 2016. Création d’un dictionnaire des verbes NooJ open source
pour la langue Arabe. Report of the end of studies project. Higher Institute of Management of Gabes.
Gaston Gross and André Clas. 1997. Synonymie, polysémie et classes d'objets. Meta 421 (1997.DOI:
10.7202/002977ar): 147–154.
Mohammad Nessereddine, Andon Tchechmedjiev, Hervé Blanchon, and Didier Schwab. 2015. Création rapide et
efficace d’un système de désambiguïsation lexicale pour une langue peu dotée. 22nd Automatic Processing of
Natural Languages, Caen. http://talnarchives.atala.org/TALN/TALN-2015/taln-2015-
long-008.pdf.
Hinrich Schütze. 1998. Automatic word sense discrimination. Computational Linguistics: Special Issue on Word
Sense Disambiguation, 24 (1): 97–123.
Morgane Sénéchal and Dominique Willems. 2007. Classes verbales et régularités polysémiques : le cas des verbes
trivalenciels locatifs. French language (no. 153): 92-110.
Max Silberztein. 2010. La formalisation du dictionnaire LVF avec NooJ et ses applications pour l'analyse automa-
tique de corpus. Languages (No. 179-180), DOI 10.3917 / lang.179.0221: 221-241.
Andon Tchechmedjiev. 2012. État de l’art : mesures de similarité sémantique locales & algorithmes globaux pour
la désambiguïsation lexicale à base de connaissances. Proceedings of the joint JEP-TALN-RECITAL conference
volume 3: RECITAL, Grenoble, pages 295–308.
D.Yu Turdakov. 2010. Word Sense Disambiguation Methods. Programming and Computer Software. Vol. 36, No.
6: 309–326.
Florentina Vasilescu. 2003. Désambiguïsation de corpus monolingues par des approches de type Lesk. Master of
Science in Computer Science, University of Montréal.
46
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 47–56Santa Fe, New Mexico, USA, August 20, 2018.
A Pedagogical Application of NooJ in Language Teaching:
the Adjective in Spanish and Italian
Andrea Rodrigo
Universidad Nacional de Rosario Entre Ríos 758
S2000CRN Rosario, Argentina [email protected]
Mario Monteleone
Universitá di Salerno Via Giovanni Paolo II, 132 84084 Fisciano SA, Italy
Silvia Reyes
Universidad Nacional de Rosario Entre Ríos 758
S2000CRN Rosario, Argentina [email protected]
Abstract
This paper relies on the work developed by the research team IES_UNR (Argentina) and pre-sents a pedagogical application of NooJ for the teaching and learning of Spanish as a foreign language. However, as this proposal specifically addresses learners of Spanish whose mother tongue is Italian, it also entailed vital collaboration with Mario Monteleone from the University of Salerno, Italy. The adjective was chosen on account of its lower frequency of occurrence in texts written in Spanish, and particularly in the Argentine Rioplatense variety, and with the aim of developing strategies to increase its use. The features that the adjective shares with other grammatical categories render it extremely productive and provide elements that enrich the learner’s proficiency. The reference corpus contains the front pages of the Argentinian newspa-per Clarín related to an emblematic historical moment, whose starting point is March 24, 1976, when a military coup began, and covers a thirty year period until March 24, 2006. The use of the linguistic resources created in NooJ for the automatic processing of texts written in Spanish accounts for the adjective in a relevant historical context for Argentina.
1 Introduction*
1.1 Research Subject
Our subject matter deals with the pedagogical application of NooJ in the teaching and learning of Span-ish as a foreign language, which relies on the research project carried out by the IES_UNR group since 2015 (Rodrigo et al. 2018). An antecedent of this work can be observed in Frigière and Fuentes (2015), which examines the pedagogical application of NooJ for learners of French as a foreign language. In this paper, we focus on a specific group: native speakers of Italian learning Spanish as a foreign lan-guage. Our interest lies in a main grammatical category: the adjective. And our work entails the follow-ing steps: (1) selection of a reference corpus in Spanish; (2) presentation of the adjective in relation to other categories and inserted in real texts; (3) use of NooJ for automatic text processing (Silberztein 2015); and (4) presentation of a pedagogical proposal specifically directed to learners of Spanish whose mother tongue is Italian (see Appendix B).
1.2 Reference corpus
Our approach to Spanish takes the Argentine Rioplatense variety1 and centres on journalistic texts pro-duced during thirty years in Argentina. These written texts represent an important historical moment since they are concerned with the last dictatorship and the democratic period immediately following it. This work is licenced under a Creative Commons Attribution 4.0 International Licence. Licence details: http://crea-tivecommons.org/licenses/by/4.0/ 1 Although our conclusions may also apply to other Spanish dialects, whether European or American varieties.
47
Our reference corpus, the Clarín_corpus,2 spans the years 1976 to 2006, and includes the same date each year, March 24, which was the day the Armed Forces took power by force in 1976 and overthrew the democratic Argentine government in charge of President María Estela Martínez de Perón. This repres-sive de facto regime known as la Junta Militar “the Military Junta” seized power and ruled until De-cember 1983, when a democratic government was again elected by the Argentinian people. The first front page that was published by Clarín when the coup d’état began is displayed in Figure 1.
Figure 1. Front page published on March 24, 1976.
The phrase Nuevo Gobierno “New Government” is a distant and neutral expression, whereas at the bottom right corner of the last front page of the corpus, which corresponds to March 24, 2006 (see Appendix A), the word noche “night,” in the expression A 30 años de la noche más larga “30 years after the longest night,” overtly condemns the beginning of the military regime.
The corpus comprises 2,898 words and always deals with the front pages of the daily newspaper Clarín, assuming the front page to be a focus of journalistic and ideological attention.
1.3 The adjective
At a first stage, the adjective is presented in relation to the noun, the verb and the adverb; then, the
frequency of adjectives in the Clarín_corpus is detected, in contrast to other categories. At a second stage, our intention is to observe what types of adjectives appear there. For instance, if the term de-
saparecido “disappeared” − which from a historical viewpoint has so many connotations in Argentina − is used, we observe what prefixes and suffixes are allowed with derivations, and in what types of con-structions it is included. It is important to recall that the word desaparecido is also listed in English dictionaries. The Oxford Dictionary of English3 defines it as, “(especially in South America) a person who has disappeared, presumed killed by members of the armed services or the police.” And the Mer-
riam-Webster’s Dictionary4 offers the following definition: “an Argentine citizen who has been abducted and usually murdered by right-wing terrorists.”
2Clarín Corpus available at http://tapas.clarin.com/tapa.html#19780324 3Oxford Dictionary of English available at https://en.oxforddictionaries.com 4Merriam-Webster’s Dictionary available at https://www.merriam-webster.com
48
1.4 Use of NooJ
We have been working with the linguistic resources created for the Spanish Module Argentina,5 specif-ically built by the IES_UNR research team to be used with NooJ,6 a computational tool designed by Max Silberztein. Our linguistic resources include dictionaries and grammars that are now in full devel-opment. In this paper we will particularly deal with the dictionary of adjectives and some specific gram-mars, leaving aside the dictionaries of other grammatical categories, which we use to recognise different types of entries such as nouns, verbs and adverbs.
Similarly, the linguistic resources created for the Italian Module7 provide an anchor to the Italian learners of Spanish as a foreign language. Notice however that the Italian Module is not our main pur-pose, for it is only tangentially addressed as a starting point for those learners.
The choice of NooJ entails, above all, a methodological decision, which assumes that it is possible to create language teaching and learning situations by taking into account natural language processing. Then, it is understood that in a given environment and with a given input, learners formulate linguistic hypotheses that may be checked with a computational tool. The NooJ platform provides a favourable means to create this type of situations since: (1) it articulates different levels of analysis (orthography, morphology, syntax, etc.); (2) its connected dictionaries and grammars involve an integral knowledge
of language; and (3) a single platform allows a comparison of Italian and Spanish using a common language.
We will try to formulate possible language teaching and learning strategies for learners of Spanish as a foreign language, whose mother tongue is Italian.
2 Development
2.1 Comparison between the adjective and other categories in Spanish
From a morphological viewpoint, Spanish nouns and adjectives share the same diminutive suffix -ito (e.g., mesa “table” - mesita “little table” and claro “clear”- clarito “very clear”). Many adjectives derive from nouns as in: cultura “culture” - cultural “cultural” and trabajo “work” - trabajador “worker”. From a syntactic viewpoint, an adjective may become the head of a noun phrase, as in the sentence el
militar abrió fuego “the military (man) opened fire,” where the adjective militar “military” functions as a noun.
From a morphological viewpoint, it is possible to create adjectives out of certain verbs (e.g., tonificar “to tone” - tonificante “toning” and desear “to desire” - deseable “desirable”). From a syntactic view-point, past participles and adjectives occur in common environments, as in: un traje combinado “a com-bined suit” and un traje azul “a blue suit.”
From a morphological viewpoint, adjectives and adverbs admit the same diminutive and superlative suffixes -ito and -ísimo respectively (e.g., buenito “very good” - tempranito “very early” and buenísimo “great” - tempranísimo “very early”). Moreover, adverbs in -mente derive from adjectives as in: lento
“slow” - lentamente “slowly.” From a syntactic and semantic viewpoint, adjectival adverbs constitute an intersection of adjectives and adverbs, as in the sentence llovió fuerte “it rained hard,” where fuerte “hard” functions as an adjectival adverb modifying the verb llovió; whereas in the phrase un fuerte viento “a strong wind,” fuerte “strong” is simply an adjective modifying a noun.
2.2 Comparison between the adjective and other categories in Italian
In Italian, adjectives have a high percentage of categorical ambiguity with two other grammatical cate-gories, specifically those of the noun and the past participle (Monteleone 2016). A typical example of this ambiguity is the word anguished. In the sentence Giovanni è angosciato “Giovanni is anguished,” angosciato is an adjective. And in the sentence Paolo è un angosciato “Paul is an anguished man,” angosciato functions as a noun. However, in the sentence Paolo è angosciato da Giovanna “Paolo is anguished because of Giovanna,” angosciato is a past participle.
5Spanish Module (Argentina) available at http://www.nooj-association.org/index.php?option= com_k2&view=item&id=6:spanish-module-argentina&Itemid=611) 6 Software NooJ available at http://nooj-association.org 7Italian Module available at http://www.nooj-association.org/index.php?option=com_k2&view= item&task=download&id=80_cf86247a0b995f68b04b2d484130c7f4&Itemid=611)
49
However, through specific automatic parsing tools (grammars, automata and finite-state transducers in NooJ), it is possible to solve this ambiguity and automatically tag syntactic sequences containing only adjectives. For reasons of space, those tools will not be discussed here, but developed in a future paper.
3 Analysis of the Clarín_corpus
3.1 Category frequency
Within NooJ, we apply the command “Linguistic Analysis” in order to obtain the frequency of the cat-egories contained in the corpus: nouns, verbs, adjectives and adverbs. Then, by applying Locate for each category in turn: <N>, <V>, <ADJ> and <ADV>, the following piece of information is obtained: Nouns: 1,200; Verbs: 658; Adjectives: 353; Adverbs: 137; Other categories: 550 (conjunctions, prepo-sitions, pronouns).
The graph in Figure 2 shows that adjectives and adverbs are the least frequent out of the five catego-ries. This data provides a first conclusion. The low frequency of adjectives points out that, due to its low proportion in the text input, this category is susceptible to be reinforced by the foreign language teacher’s intervention.
Figure 2. Frequency of grammatical categories in the Clarín_corpus.
3.2 Adjectives
As frequency is our working criterion, those adjectives recording more frequency in the corpus represent the starting point. It is then specified which categories precede these adjectives, which morphological suffixes and prefixes they admit, and which adjectival structures are more recurrent.
In Figure 3, after applying the commands “Locate” and “Statistical Analysis,” it is possible to recog-nise the twenty most frequent adjectives in the corpus: militar “military,” político “political,” política “political,” desaparecidos “disappeared,” general “general,” muertos “dead,” derechos “right,” Militar “Military,” Argentina “Argentinian, listas “ready,” heridos “injured,” humanos “human,” Armadas “Armed, solo “alone,” fuerte “strong,” primer “first,” exclusivo “exclusive,” argentino “Argentine,” central “central” and especial “special.”
Some of the adjectives of the above list are selected in order to specify their inflectional and deriva-tional paradigms, their context of occurrence and their syntactic structure.
50
Figure 3. Frequency of adjectives in the Clarín_corpus.
3.3 Morphological Inflection and Derivation
Next, the inflectional paradigms of some of the most frequent adjectives are shown, as they are recorded in the inflectional grammars (Rodrigo et al. 2018) and in the dictionary of adjectives created in NooJ for the Spanish Module Argentina.
The adjective militar “military” inflects like ÁGIL: militar, militares. The following words are part of its derivational paradigm as recorded in the dictionary of adjectives: antimilitar “anti-military,” anti-
militarista “anti-militarist,” paramilitar “paramilitary,” desmilitarizado “demilitarised,” militarizado
“militarised.”
The adjective político “political” belongs to the paradigm FLACO: político, política, políticos, polí-
ticas. The words apolítico “apolitical,” geopolítico “geopolitical” and impolítico “non-political” are part of its derivational paradigm in the dictionary of adjectives.
The adjective desaparecido “disappeared” also follows the paradigm FLACO: desaparecido, desa-
parecida, desaparecidos, desaparecidas. It is important to note that there are no words derived from desaparecido.
The adjective general “general” inflects like ÁGIL: general, generales. The words generalista “gen-eralist,” generalizable “generalisable” and generalizado “generalised” are part of its derivational para-digm as it is recorded in the dictionary of adjectives.
3.4 Context of occurrence
The context of occurrence allows us to know which categories can accompany the adjective. The adjec-tive militar “military” is preceded by nouns, as in the phrases reacción militar “military reaction,” au-
tocrítica militar “military self-criticism,” caudillo militar “military leader,” uniforme militar “military uniform,” golpe militar “military coup,” colegio militar “military college.”
The adjectives político, política “political” are preceded by nouns, as in: sector político “political sector,” Reforma política “political Reform,” favoritismo político “political favouritism,” enemigo pol-
ítico” political enemy” and terror político “political terror.” However, they may also be preceded by determiners: su política “his/her/their/your politics,” la política “(the) politics.”
The adjective desaparecidos “disappeared” is preceded by verbs (e.g., son desaparecidos “(they) are disappeared”), by the construction adjective+preposition (e.g., miles de desaparecidos “thousands of disappeared”), by numeral adjectives (e.g., catorce desaparecidos “fourteen disappeared”), and by nouns (e.g., alumnos desaparecidos “disappeared students”).
3.5 Syntactic structure
The analysis of two expressions containing adjectives and selected from the Clarín_corpus, Hubo miles
de desaparecidos “There were thousands of disappeared” and último golpe militar “last military coup” will be shown. The first example is an impersonal sentence and the second one is a noun phrase.
51
The basis of our analysis are the concepts developed by Bès (1999) and our approach is related to context-sensitive grammars. Our NooJ grammars take into account two types of constructions: nucleus phrases, such as a nucleus verb phrase (SVN), a nucleus adjective phrase (SADJN), a nucleus preposi-tion phrase (SPN), a nucleus noun phrase (SNN), and larger phrases, such as the noun phrase (SINNOM), which contain nucleus phrases. These expressions will be analysed with the different syn-tactic grammars available in the Spanish Module Argentina.
In Spanish, the expression Hubo miles de desaparecidos “There were thousands of disappeared” is an impersonal sentence, i.e., without a subject. In the above example, we distinguish a nucleus adjective phrase (SADJN), miles “thousands” and a nucleus preposition phrase (SPN), de desaparecidos “of dis-appeared,” both inside the noun phrase (SINNOM). On the other hand, the verb hubo “were,” which is the past indicative of haber “to be,” constitutes a nucleus verb phrase (SVN). These two structures make up the impersonal sentence, as it can be seen in Figure 4.
Figure 4. Analysis of Hubo miles de desaparecidos.
The corresponding main grammar for impersonal sentences is displayed in Figure 5.
Figure 5. Grammar for impersonal sentences.
As regards the second expression, the noun phrase último golpe militar “last military coup”, instead of analysing it with the “Linguistic Analysis” command, we use the NooJ “Show Debug” feature instead. By using “Show Debug,” it is possible to analyse whether the phrase entered can be considered accepta-ble, when the noun phrase grammar is applied. As can be seen in Figure 6, the result is a Perfect Match.
52
Figure 6. Noun Phrase Grammar.
“Show Debug” checks to see if the grammar designed for the noun phrase identifies that último “last” and militar “military” are two nucleus adjective phrases (SADJN), inside a larger unit, the noun phrase (SINNOM). The intermediate unit is the nucleus noun phrase (SNN), último golpe “last coup.”
The analyses of both expressions, Hubo miles de desaparecidos “There were thousands of disap-peared” and último golpe militar “last military coup,” show the adjective in two different syntactic con-structions. In the first example, the adjective miles “thousands” is the nucleus of the noun phrase (SINNOM) miles de desaparecidos “thousands of disappeared” and is part of an impersonal sentence;
in the second example, two adjectives occur in different positions: último “last” is a nucleus adjective phrase (SADJN) inside the nucleus noun phrase (SNN) último golpe “last coup,” and, at the same time, último “last” is part of the noun phrase (SINNOM) último golpe militar “last military coup,” and militar
“military” is a nucleus adjective phrase (SADJN) inside a noun phrase (SINNOM), peripheral to the nucleus noun phrase (SNN).
The syntactic structure shows how the adjective can integrate noun phrases and even become the nucleus of a noun phrase.
4 Conclusions
In this paper, we have proposed a pedagogical application of the NooJ approach to a specific situation by choosing a significant corpus (the Clarín_corpus), which can provide an idea of the use of the Spanish language in a particular historical and cultural context. Our intention has been to identify adjectives and to do a global reading of them within the corpus. Our focus was mainly placed on the adjective de-
saparecido “disappeared,” because of its ideological importance. As we have already stated in the In-troduction, our subject matter deals with the pedagogical application of NooJ in the teaching-learning of Spanish as a foreign language. And we have concentrated on a specific group: native speakers of Italian learning Spanish as a foreign language. Our strategy has been to make explicit some contact points between Spanish and Italian by using NooJ as a common platform. Our purpose was not to com-pare the Italian and Spanish resources, but to highlight that both languages can be formalised using a common platform, since the structure of dictionaries and grammars is the same. This provides unity and coherence to our pedagogical proposal, which is the goal of the research project carried out by the IES_UNR team, since both languages can be analysed with NooJ (Appendix B).
References
Gabriel G. Bès, 1999. La phrase verbale noyau en français. Recherches sur le français parlé, volume 15: 273–358. Université de Provence.
53
Andrea Rodrigo, Silvia Reyes and Rodolfo Bonino. 2018. Some Aspects Concerning the Automatic Treatment of Adjectives and Adverbs in Spanish: A Pedagogical Application of the NooJ Platform. In: Mbarki S., Mourchid M., Silberztein M. (eds) Formalizing Natural Languages with NooJ and Its Natural Language Processing Ap-plications. NooJ 2017. Springer, Cham. https://doi.org/10.1007/978-3-319-73420-0_11. Communications in
Computer and Information Science, volume 811: 130-140.
Julia Frigière and Sandrine Fuentes. 2015. Pedagogical Use of NooJ dealing with French as a Foreign Language. In J. Monti, M. Silberztein, M. Monteleone, and M. P. di Buono (eds.) Formalizing Natural Languages with NooJ, Cambridge Scholars Publishing, London, 186-197.
Mario Monteleone. 2016. Local Grammars and Formal Semantics: Past Participles vs. Adjectives in Italian. In T. Okrut, Y. Hetsevich,M. Silberztein, and H. Stanislavenka, (eds) Automatic Processing of Natural-Language Electronic Texts with NooJ. Communications in Computer and Information Science, volume 607:83-95, Sprin-ger, Cham.
Max Silberztein, 2015. La formalization des langues, l´approche de NooJ. Iste Ediciones, London.
Appendices
A. Last front page of the Clarín_corpus.
The last front page that was published by Clarín (2006) is displayed in Figure 7.
Figure 7. Last front page published on March 24, 2006.
54
B. A pedagogical proposal.
Our work will be carried out in parallel to the NooJ Italian module, since the starting point is the same platform and our proposal includes the activities detailed below.
• Open the text in the Italian Module: Ringraziamenti (Dan Brown, The Da Vinci Code, 2003)8.
• Apply Locate <A> in order to extract all adjectes, as stated in the properties_definiton file (Figure 8):
Figure 8. Locate <A> in the Italian text.
• Apply Locate and search a string of characters for the adjective buon “good.” Note how concord-ance is done. Figure 9 presents a sample of some of the sequences.
Figure 9. Locate buon “good” in the Italian text.
8Ringraziamenti (Dan Brown The Da Vinci Code, 2003) available at https://archive.org/stream/IlCodiceDaVinciDanBrown/Il%20codice%20Da%20Vinci%20-
%20Dan%20Brown_djvu.txt
55
• Identify the inflectional features of some of the nouns following the previous adjective: e.g., il buon nome “the good name,” un buon amico “a good friend,” etc.
• Return to the Spanish Module Argentina, and perform Linguistic Analysis on the following se-quence: un hombre bueno “a good man” (Figure 10).
Figure 10. Linguistic Analysis of an adjectival sequence
• Detect concordance features in the sequence: masculine (masc) and singular (sg).
• State which variations this adjective may adopt, since according to the corresponding lexical grammar, bueno “good” inflects like FLACO (FLACO = <E>/masc+sg | s/masc+pl | <B> a/fem+sg | <B> as/fem+pl;).
• Look up, in the dictionary of adjectives, another adjective inflecting like desaparecido and com-plete the following expressions:
Un hombre… .. “A man…”
Unos hombres… .. “Some men…”
Mi libro… ..”My book…”
La señora… .. “The lady…”
• Check through Linguistic Analysis how it is possible, by virtue of concordance, to build noun phrases.
• Do an online search to complete the analysis with its historical implications of the term desapare-
cido in expressions such desaparecidos Argentina and disappeared Argentina9.
9The searches ‘desaparecidos Argentina’ and ‘disappeared Argentina’ are available at https://es.wikipe-dia.org/wiki/Desaparecidos_durante_el_terrorismo_de_Estado_en_Argentina and
https://en.wikipedia.org/wiki/Dirty_War.
56
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 57–64Santa Fe, New Mexico, USA, August 20, 2018.
STYLUS: A Resource for Systematically Derived Language Usage
Bonnie Dorr
Institute for Human and Machine Cognition
15 SE Osceola Ave, Ocala, FL 34471
Clare Voss
U.S. Army Research Laboratory
Adelphi, MD 20783
Abstract
Starting from an existing lexical-conceptual structure (LCS) Verb Database of 500 verb classes
(containing a total of 9525 verb entries), we automatically derived a resource that supports ar-
gument identification for language understanding and argument realization for language gene-
ration. The extended resource, called STYLUS (SysTematicallY Derived Language USage),
supports constraints at the syntax-semantics interface through the inclusion of components of
meaning and collocations. We show that the resulting resource covers three cases of language
usage patterns both for spatially oriented applications such as dialogue management for robot
navigation and for non-spatial applications such as generation of cyber-related notifications.
1 Introduction*
This paper presents a derivative resource, called STYLUS (SysTematicallY Derived Language US-
age), produced through extraction of a set of argument realizations from lexical-semantic representati-
ons for a range of different verb classes (Appendix A). Prior work (Jackendoff, 1996; Levin, 1993;
Olsen, 1994; Kipper et al., 2008; Palmer et al., 2017) has suggested a close relation between un-
derlying lexicalsemantic structures of predicates and their syntactic argument structure. Subsequent
work (Dorr and Voss, 2018) argued that regular patterns of language usage can be systematically deri-
ved from lexicalsemantic representations and used in applications such as dialogue management for
robot navigation. The latter investigation focused on the spatial dimension, e.g., motion and direction.
We adopt the view that this systematicity also holds for verbs in the non-spatial dimension, includ-
ing those that have been metaphorically related to the spatial dimension by a range of corpus-based
techniques (Dorr and Olsen, 2018). We consider one example of a non-spatial application: generation
of cyber-related textual notifications. We argue that this application requires knowledge at the syn-
taxsemantics interface that is analogous to spatial knowledge for robot navigation. A recent survey of
narrative generation techniques (Kybartas and Bidarra, 2017) highlights several important components
of narrative generation (including a story, plot, space, and discourse for telling the story), leaving open
the means for surface-realization of arguments from an underlying lexical-semantic structure.
STYLUS is designed to accommodate mechanisms for filling this gap.
STYLUS extends an existing Lexical-Conceptual Structure Verb Database (LVD) (Dorr et al., 2001)
that includes 500 verb classes (9525 verb entries) with structurally-specified information about realiza-
tion of arguments for both spatial and non-spatial verbs. STYLUS was produced through systematic
derivation of regular patterns of language usage (Block, Overlap, Fill) without requiring manual anno-
tation. The next section reviews related work, starting with the spatial underpinnings of the original
LVD. Following this, we describe our extensions and present examples for two applications: natural
language processing for robot navigation and generation of cyber-related notifications.
2 Background
2.1 Spatial Language Understanding
Spatial language understanding has made great strides in recent years, with the emergence of language
resources and standards for capturing spatial information, e.g., (ISO-24617-7, 2014), which provide
guidelines for annotating spatial information in English language texts (Pustejovsky and Lee, 2017; * This work is licensed under a Creative Commons Attribution 4.0 International License. License details:
http://creativecommons.org/licenses/by/4.0/.
57
Pustejovsky and Yocum, 2014). This work differs from the perspective adopted for STYLUS in that it
provides annotation guidelines for training systems for spatial information extraction, and so it does
not focus on generalized mappings at the syntax-semantics interface.
The Semantic Annotation Framework (semAF) identifies places, paths, spatial entities, and spatial
relations that can be used to associate sequences of processes and events in news articles (Pustejovsky
et al., 2011). Prepositions and particles (near, off) and verbs of position and movement (lean, swim)
have corresponding components of meanings and collocations adopted in this paper.
Spatial role labeling using holistic spatial semantics (i.e., analysis at the level of the full utterance)
has been used for identifying spatial relations between objects (Kordjamshidi et al., 2011). The associ-
ation between thematic roles and their corresponding surface realizations has been investigated pre-
viously, including in the LCS formalism (described next), but Kordjamshidi et al’s approach also ties
into deeper notions such as region of space and frame of reference.
2.2 Lexical-Conceptual Structure Verb Database (LVD)
Lexical Conceptual Structure (LCS) (Jackendoff, 1983; Jackendoff, 1990; Dorr, 1993; Dowty, 1979;
Guerssel et al., 1985) has been used for wide-ranging applications, including interlingual machine
translation (Voss and Dorr, 1995; Habash and Dorr, 2002), lexical acquisition (Habash et al., 2006),
crosslanguage information retrieval (Levow et al., 2000), language generation (Traum and Habash,
2000), and intelligent language tutoring (Dorr, 1997).
LCS incorporates primitives whose combination captures syntactic generalities, i.e., actions and en-
tities must be systematically related to a syntactic structure: GO, STAY, BE, GO-EXT, ORIENT, and
also an ACT primitive developed by Dorr and Olsen (1997). LCS is grounded in the spatial domain
and is naturally extended to non-spatial domains, as specified by fields. For example, the spatial di-
mension of the LCS representation corresponds to the (Loc)ational field, which underlies the meaning
of John traveled from Chicago to Boston in the LCS [John GOLoc [From Chicago] [To Boston]]. This
is straightforwardly extended to the (Temp)oral field to represent analogous meanings such as The me-
eting went from 7pm to 9pm in the LCS [Meeting GOTemp [From 7pm] [To 9pm]].
The LVD developed in prior work (Dorr et al., 2001) includes a set of LCS templates classified ac-
cording to an extension of Levin (1993)’s 192 classes to a total of 500 classes covering 9525 verb en-
tries (an additional 5500+ verb entries beyond the original 4000+ verb entries). The first 44 classes
were added beyond the original set of semantic classes (Dorr and Jones, 1996). The remaining classes
were derived through aspectual distinctions to yield a set of LCS classes that were finer-grained than
the original Levin classes (Olsen et al., 1997). Each LCS class consists of a set of verbs and, in several
cases, the classes included non-Levin words (those not in Levin (1993)), derived semi-automatically
(Dorr, 1997).
The original LVD provides a mapping of lexical-semantic structures to their surface realization.
This mapping serves as a foundation for the enrichments that yield STYLUS. The new resource bene-
fits from decades of prior study that led to the LVD. Specifically, Levin’s classes are based on signifi-
cant corpus analysis and have been validated in numerous within-language studies (Levin and Rappa-
port Hovav, 1995; Rappaport Hovav and Levin, 1998) and cross-language studies (Guerssel et al.,
1985; Levin, 2015). Thus, STYLUS is expected to have an important downstream impact, in both
depth and breadth, for future linguistic investigations and computational applications.
2.3 Syntax-Semantics Interface
Prior work (Jackendoff, 1996; Levin, 1993; Dorr and Voss, 1993; Voss and Dorr, 1995; Kipper et al.,
2004; Kipper et al., 2008; Palmer et al., 2017) suggests that there is a close relation between un-
derlying lexical-semantic structures of verbs and nominal predicates and their syntactic argument
structure. VerbNet (Kipper et al., 2004) reinforces the view in this resource paper, that prepositions
and their relation with verb classes serve as significant predictors of semantic content, but does not
leverage an inner structure of events for compositional derivation of argument realizations.
FrameNet also sits at the syntax-semantics interface (Fillmore, 2002), with linking generalizations
based on valency to map semantic frames (events and participants) to their corresponding surface
structure. Osswald and Van Valin (2014) point out that such generalizations are hindered by bottom-
up/datadriven frames and they argue for a richer frame representation with an inner structure of an e-
vent. This notion of “inner structure” is seen in the work of Voss et al. (1998) and Dorr and Voss
58
(2018), which suggests that the generation of a preposition (in English) is dependent on both the inter-
nal semantics of the predicate and structural idiosyncracies at the syntax-semantics interface. As such,
the following two definitions are fundamental to the syntax-semantics mappings adopted in STYLUS1:
• Component of meaning: Implicit semantic unit, such as UPWARD for the verb elevate
• Collocation: Explicit juxtaposition of a particular word, such as up for the verb lift in “lift up”
Leveraging these definitions, Section 3 describes the enrichments necessary to produce STYLUS
without requiring training on annotated data.
3 Addition of Components of Meaning and Collocations to LVD Classes
We investigate the systematic derivation of language usage patterns for both understanding and gener-
ation of language and leverage this investigation to simplify and enrich the LVD presented in Section
2. The resulting resource, STYLUS (downloadable as described in Appendix 5), relies on lexically
implicit components of meaning and lexically explicit collocations (as defined above) to cover three
cases of language usage patterns applicable to language understanding and generation:
• Block refers to components of meaning that do not co-occur with their collocational counter-
parts, e.g., elevate and ascend include the UPWARD component and thus do not co-occur with
the collocation up;
• Overlap refers to components of meaning that optionally co-occur with their collocational
counterparts, e.g., lift and raise include the UPWARD component but optionally co-occur with
the collocation up;
• Fill refers to underspecified components of meaning that fall into one of two cases:
- Oblig: obligatory co-occurrence with collocations, e.g., the verb put does not specify a direc-
tion and thus always co-occurs with a collocation such as up;
- Opt: optional co-occurrence with collocations, e.g., the motion verb move does not specify a
direction but optionally co-occurs with a collocation such as up.
STYLUS contains simplified verb classes from the LVD, omitting the full LCS structures and the-
matic roles while adding prepositional collocations and components of meaning. For example, in the
LVD Verbs of inherently directed motion (Class 51.1.a in (Levin, 1993)), the verb leave need not co-
occur with a prepositional collocate (e.g., leave the room) whereas the verb depart can co-occur with
from (e.g., departed from the room). For either case, the component of meaning is uniformly “GO TO
(outside the room) FROM (inside of the room),” and the collocation from is optional:
LVD Class Entry:
(:NUMBER "51.1.a"
:NAME "Verbs Inherently Directed Motion / -from/to"
:WORDS (advance arrive ascend climb come depart descend enter escape
exit fall flee go leave plunge recede return rise tumble) :NON_LEVIN_WORDS (approach come! defect head)
:LCS (go loc (* thing 2)
((* from 3) loc (thing 2) (at loc (thing 2) (thing 4)))
((* to 5) loc (thing 2) ([at] loc (thing 2) (thing 6)))
(!!+ingly 26))
:THETA_ROLES ((1 "_th,src(from),goal(towards)")
(1 "_th,goal,src(from)")
(1 "_th,src(from),goal(to)")))
Derivation of components of meaning and collocations, along with their obligatoriness or optionali-
ty, was achieved by a simple automated procedure, without manual annotation. Components of mean-
1 Teletype font is used for components of meaning such as UPWARD. Several examples throughout this paper were purposely
selected to illustrate the full range of syntactic realizations for the concept of “upwardness.” Other verbs and collocations
could easily have been selected (e.g., lower with the collocation down), but a varied selection of lexical distinctions would
confound the illustration of more general distinctions at the syntax-semantics interface.
59
ing were derived either from the LCS structure or from verb-prepositions pairs in a “Categorial Varia-
tion” database (Habash and Dorr, 2003). For example, the LCS above (from the :LCS slot) includes a
sublexical component of meaning, ((* from 3) loc (thing 2) (at loc (thing 2) (thing 4))), that maps op-
tionally to a preposition from in the surface form (as in exit (from) the room). Prepositional colloca-
tions (such as from) were derived from the thematic roles (in the :THETA ROLES slot) of the original
LVD. These prepositions are specified in parentheses, and are preceded either by a comma (,) for op-
tional collocations or by underscore (_) for obligatory collocations. In the example above, the theme is
obligatory (_th), whereas the source (,src) and goal (,goal) are optional. Verbs are thus paired
with their corresponding prepositions by virtue of their class membership in the final STYLUS re-
source2:
STYLUS Class Entry:
(:NUMBER "51.1.a"
:NAME "Verbs Inherently Directed Motion / -from/to"
:WORDS (advance arrive ascend climb come depart descend enter escape
exit fall flee go leave plunge recede return rise tumble)
:NON_LEVIN_WORDS (approach come! defect head)
:COLLOCATIONS ("from" "to" "towards")
:COMPONENT_OF_MEANING (FROM, AWAY_FROM, OUT_OF, OR, UP_TO, UP,
BEFORE, INTO, TO)) ;; Fill-Opt <<SPATIAL>>
The derivative collocations and components of meaning obviate the need for spelling out the full
LCS, thus ensuring a more compact representation of the original LVD. In addition, the Block, Over-
lap, Fill-Oblig, and Fill-Opt designations were easily derived automatically from the
:COLLOCATIONS and :COMPONENT OF MEANING slots according to the four rules below:
(1) Tag Block if :COMPONENT OF MEANING filled and :COLLOCATIONS not filled, e.g., Face the doorway;
(2) Tag Overlap if :COMPONENT OF MEANING and :COLLOCATIONS both filled, and use of comma (,)
in front of the corresponding thematic role, e.g., “,src(behind)” for Follow (behind) the car;
(3) Tag Fill-Oblig if :COMPONENT OF MEANING and :COLLOCATIONS both filled, and use of undersco-
re (_) in front of the corresponding thematic role, e.g., “_goal(to))” for Put the box onto the table;
(4) Tag Fill-Opt if :COMPONENT OF MEANING not filled and either :COLLOCATIONS not filled or use of
comma (,) in front of a modifier role, as in Move the box (to the table).
Finally, each entry was automatically tagged <<SPATIAL>> if the LCS indicated a Loc(ational) or
Poss(essional) field in association with the underlying primitive (GO, BE, etc.). All non-spatial entries
were automatically tagged <<NON-SPATIAL>>.
4 STYLUS: Spatial and Non-Spatial Subset of LVD Classes
The number of classes associated with the language usage patterns introduced above is in the table
below, with tallies for the full set of LVD classes, the spatial subset, and non-spatial subset.3 Repre-
sentative examples of verbs in both the spatial subset (e.g., for robot navigation) and the non-spatial
subset (e.g., cyber-notification generation) are provided in Table 1.
Interestingly, the Spatial Subset of classes is sizeable (44% of the entire set). However, STYLUS
development has yielded generalizations beyond prior LCS-inspired work that focused strictly on spa-
tial verbs (Voss et al., 1998). Most notably, the Block, Overlap and Fill patterns are generalizable to a
high number of LCS classes that are non-spatial as well. The obvious cases are those with non-zero
values in Table 1 (Overlap and Fill): e.g., tread on (Overlap), cram for (Fill-Oblig), and blend together
(Fill-Opt). However, although the number of non-spatial Block cases appears to be 0, the correspond-
ing 7 spatial usages also extend to metaphorical extensions, e.g., in the phrase elevated her spirits, the
use of “up” is blocked. Although STYLUS encodes elevate as a spatial verb, this same lexico-semantic
2 In the original LVD, the :SPATIAL COMPONENT OF MEANING field name was used. More recently, such components
have been determined to be useful for non-spatial verbs as well. For example, admire (in class 31.2.c) has a toward compo-
nent of meaning, whereas abhor has an away-from component of meaning. Thus, the more general :COMPONENT OF
MEANING field name has been subsequently adopted. 3 N/A refers to verb classes whose members take bare NP or S arguments. Intrans refers to Intransitive verbs.
60
property applies for its non-spatial (metaphorical) usage. Analogously, STYLUS encodes the con-
trasting verb lift as a spatial verb, designated as Overlap, to enable lifted up her hand. A metaphorical
(non-spatial) usage would be lift up her spirits. Such cases indicate that Block, Overlap, and Fill apply
even more broadly than was conceived in the original LVD—which means that the numbers in the
Non-Spatial Subset column above are understated and would benefit from additional analysis of meta-
phorical extensions.
Lexical Ops LVD
Classes
Spatial
Subset Spatial Examples
Non-
Spatial
Subset
Non-Spatial
Examples
Block 7 7 elevate, face, pocket 0 infect, archive
Overlap 17 10 advance, lower, lift 7 follow, precede
Fill-Oblig 310 128 drive, rotate, put 182 mount, install
Fill-Opt 87 49 remove, slide 38 move, add
Intrans 6 3 float, part, squirm 3 choke, drown, snap
N/A 73 12 — 61 —
Tot. Classes 500 219 281
Tot. Verbs 9525 4640 4885
Table 1: Language Usage Patterns for Spatial and Non-Spatial Verbs with Representative Examples
To explore the broader applicability of the Block, Overlap, and Fill patterns, we first examined
verbs in the Spatial Subset and determined that many of them are among those relevant to robot navi-
gation, e.g., move, go, advance, drive, return, rotate, and turn. Others are easily accommodated by
extending classes—without modification to the spatial language usage patterns described above. For
example, back up matches the class template containing advance, and pivot matches the class template
containing rotate. We then considered the Non-Spatial Subset for generation of cyber-related notifica-
tions, a sub-task of an ongoing cyber-attack prediction project (Dalton et al., 2017). We determined
that the Fill, Overlap, and Block patterns apply to members of the same classes that were relevant to
robot navigation. Specifically, we examined the four classes shown in Table 2.
Class
[Pattern]
Robot Navigation Cyber Notification
(Generation Only) Understanding Generation
9.8-Fill [Block]
Face the doorway
Face to the doorway
Where do I face?
*Where do I face to?
Risk that trojan horse virus will infect
(*to) System A increased 5%
47.8.1-Contig [Overlap]
Follow the car
Follow behind the car
Which car do I follow
(behind)?
Risk increased 25% that VPN failure
will follow (behind) malware attack
9.1-Put [FillOblig]
Put it on the table
*Put the box
Where do I put it?
*Do I put the box?
Risk of malware increases 5% if disk
A mounted on file server B
11.2-Slide [FillOpt]
Move it to the table
Move it
Where do I move it?
Do I move it?
10% increased malware risk if System
A files are moved (to System B)
Table 2: Class-Based Patterns for Robot Navigation and Cyber Notification
Note that Block, Overlap, and Fill can be applied as validity constraints on language usage patterns
in the spatial domain of human-robot commands (for understanding and generation).4 Corresponding-
ly, for the non-spatial domain of cyber notification generation, these same four classes included other
class members (e.g., for 9.8 and 9.1) or meaning extensions of the same class members (e.g., for
47.8.1 and 11.2) that exhibited the same language usage patterns as their robot navigation counterparts.
5 Conclusion and Future Work
STYLUS provides the basis for both understanding and generation in the spatial robot navigation do-
main and for generation in the non-spatial cyber notification domain. A larger scale application and
evaluation of the effectiveness of STYLUS for understanding and generation in these two domains is a
4 An asterisk at the start of a sentence indicates an invalid generated form. Verbs in the designated class are italicized.
61
future area of study. A starting point is an ongoing Bot Language project (Marge et al., 2017) that has
heretofore focused on dialogue annotation (Traum et al., 2018) and has not yet incorporated lexicon-
based knowledge necessary for automatically detecting incomplete, vague, or implicit navigation
commands.
Another avenue for exploration is the enhancement of cyber notifications through systematic deri-
vation of mappings to surface realizations for other parts of speech. This work will involve access to a
“Categorial Variation” database (CatVar) (Habash and Dorr, 2003) to map verbs in the LCS classes to
their nominalized and adjectivalized forms. For example, the CatVar entry for infect includes the nom-
inalized form infection, which provides additional options that may be more fluent in cyber-related
notifications, e.g., viral infection of system might be considered less stilted than virus will infect sys-
tem.
Although STYLUS is strictly for English, the lexical-semantic foundation from which it is derived
has been investigated for a range of multilingual applications (Habash et al., 2006; Cabezas et al.,
2001; Levow et al., 2000). Future work will examine the derivation of this resource for non-English
languages.
Acknowledgements
This research is supported, in part by the Institute for Human and Machine Cognition, in part by the
U.S. Army Research Laboratory, and in part by the Office of the Director of National Intelligence
(ODNI) and the Intelligence Advanced Research Projects Activity (IARPA) via the Air Force Research
Laboratory (AFRL) contract number FA875016C0114. The U.S. Government is authorized to repro-
duce and distribute reprints for Governmental purposes notwithstanding any copyright annotation
thereon.
Disclaimer: The views and conclusions contained herein are those of the authors and should not be
interpreted as necessarily representing the official policies or endorsements, either expressed or im-
plied, of ODNI, IARPA, AFRL, ARL, or the U.S. Government.
References
Clara Cabezas, Bonnie J. Dorr, and Philip Resnik. 2001. Spanish Language Processing at University of Mary-
land: Building Infrastructure for Multilingual Applications. In Proceedings of the Second International Work-
shop on Spanish Language Processing and Language Technologies (SLPLT-2), Jaen, Spain,
http://www.umiacs.umd.edu/users/bonnie/Publications/slplt-01.htm.
Adam Dalton, Bonnie Dorr, Leon Liang, and Kristy Hollingshead. 2017. Improving cyber-attack predictions via
unconventional sensors discovered through information foraging. In Proceedings of the 2017 International
Workshop on Big Data Analytics for Cyber Intelligence and Defense, BDA4CID ’17, pages 4560–4565. IEEE.
Bonnie Dorr and Doug Jones. 1996. Acquisition of Semantic Lexicons: Using Word Sense Disambiguation to
Improve Precision. In In Proceedings of theWorkshop on Breadth and Depth of Semantic Lexicons, 34th An-
nual Conference of the Association for Computational Linguistics, pages 42–50. Kluwer Academic Publish-
ers.
Bonnie J. Dorr and Mari Broman Olsen. 1997. Deriving Verbal and Compositional Lexical Aspect for NLP Ap-
plication. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and
Eighth Conference of the European Chapter of the Association for Computational Linguistics, pages 151–158.
Bonnie J. Dorr and Mari Broman Olsen. 2018. Lexical Conceptual Structure of Literal and Metaphorical Spatial
Language: A Case Study of Push. In Proceedings of the NAACLWorkshop on Spatial Language Understand-
ing, pages 31–40.
Bonnie J. Dorr and Clare R. Voss. 1993. Machine Translation of Spatial Expressions: Defining the Relation be-
tween an Interlingua and a Knowledge Representation System. In Proceedings of the Twelfth Conference of
the American Association for Artificial Intelligence, pages 374–379.
Bonnie J. Dorr and Clare Voss. 2018. The Case for Systematically Derived Spatial Language Usage. In Proceed-
ings of the NAACL Workshop on Spatial Language Understanding, pages 63–70.
Bonnie J. Dorr, Mari Olsen, Nizar Habash, and Scott Thomas. 2001. LCS Verb Database Documentation.
http://www.umiacs.umd.edu/˜bonnie/Demos/LCS_Database_Documentation.html.
62
Bonnie J. Dorr. 1993. Machine Translation: A View from the Lexicon. MIT Press, Cambridge, MA.
Bonnie J. Dorr. 1997. Large-Scale Dictionary Construction for Foreign Language Tutoring and Interlingual Ma-
chine Translation. Machine Translation, 12:271–322.
David Dowty. 1979. Word Meaning and Montague Grammar. Reidel, Dordrecht.
Charles J. Fillmore. 2002. Linking sense to syntax in FrameNet (keynote speech). In 19th International Confer-
ence on Computational Linguistics, Taipei. COLING, http://www.lirmm.fr/˜lafourca/ML-
enseign/DEA/DEA03-TALN-articles/fillmore.pdf.
M. Masten Guerssel, Kenneth Hale, Mary Laughren, Beth Levin, and Josie White Eagle. 1985. A Cross-
linguistic Study of Transitivity Alternations. In W. H. Eilfort and P. D. Kroeber nad K. L. Peterson, editor, Pa-
pers from the Parasession in Causatives and Agentivity at the Twenty-first Regional meeting of the Chicago
Linguistic Society, pages 48–63.
Nizar Habash and Bonnie J. Dorr. 2002. Handling Translation Divergences: Combining Statistical and Symbolic
Techniques in Generation-Heavy Machine Translation. In Proceedings of the Fifth Conference of the Associa-
tion for Machine Translation in the Americas, pages 84–93, Tiburon, CA.
Nizar Habash and Bonnie Dorr. 2003. A categorial variation database for English. In NAACL/HLT 2003, Pro-
ceedings of the Human Language Technology and North American Association for Computational Linguistics
Conference, pages 96–102.
Nizar Habash, Bonnie J. Dorr, and Christof Monz. 2006. Challenges in Building an Arabic GHMT system with
SMT Components. In Proceedings of the 7th Conference of the Association for Machine Translation in the
Americas, pages 56–65, Boston, MA, August.
ISO-24617-7. 2014. Language Resource management Semantic Annotation Framework Part 7: Spatial infor-
mation (ISOspace), http://www.iso.org/standard/60779.html.
Ray Jackendoff. 1983. Semantics and Cognition. MIT Press, Cambridge, MA.
Ray Jackendoff. 1990. Semantic Structures. MIT Press, Cambridge, MA.
Ray Jackendoff. 1996. The Proper Treatment of Measuring Out, Telicity, and Perhaps Even Quantification in
English. Natural Language and Linguistic Theory, 14:305–354.
Karin Kipper, Benjamin Snyder, and Martha Palmer. 2004. Using Prepositions to Extend a Verb Lexicon. In Pro-
ceedings of the HLT/NAACL Workshop on Computational Lexical Semantics, pages 23–29.
Karin Kipper, Anna Korhonen, Neville Ryant, and Martha Palmer. 2008. A Large-scale Classification of English
Verbs. Language Resources and Evaluation, 42(1):21–40.
Parisa Kordjamshidi, Martijn Van Otterlo, and Marie-Francine Moens. 2011. Spatial Role Labeling: Task Defini-
tion and Annotation Scheme. In Proceedings of the Seventh Conference on International Language Resources
and Evaluation, pages 413–420.
Ben Kybartas and Rafael Bidarra. 2017. A survey on story generation techniques for authoring computational
narratives. IEEE Transactions on Computational Intelligence and AI in Games, 9(3):239–253.
Beth Levin and Malka Rappaport Hovav. 1995. Unaccusativity: At the Syntax-Lexical Semantics Interface, Lin-
guistic Inquiry Monograph 26. MIT Press, Cambridge, MA.
Beth Levin. 1993. English Verb Classes and Alternations: A Preliminary Investigation. The University of Chica-
go Press.
Beth Levin. 2015. Verb Classes Within and Across Languages. In B. Comrie and A. Malchukov, editors, Valency
Classes: A Comparative Handbook, pages 1627–1670. De Gruyter, Berlin.
Gina-Anne Levow, Bonnie J. Dorr, and Dekang Lin. 2000. Construction of Chinese-English Semantic Hierarchy
for Cross-Language Retrieval. In Workshop on English-Chinese Cross Language Information Retrieval, In-
ternational Conference on Chinese Language Computing, pages 187–194, Chicago, IL.
Matthew Marge, Claire Bonial, Ashley Foots, Cassidy Henry Cory Hayes, Kimberly Pollard, Ron Artstein, Clare
Voss, and David Traum. 2017. Exploring Variation of Natural Human Commands to a Robot in a Collabora-
tive Navigation Task. In ACL2017 RoboNLP workshop, pages 58–66.
63
Malka Rappaport Hovav and Beth Levin. 1998. Building Verb Meanings. In M. Butt and W. Geuder, editors, The
Projection of Arguments: Lexical and Compositional Factors, pages 97–134. CSLI Publications, Stanford,
CA.
Mari Broman Olsen, Bonnie J. Dorr, and Scott Thomas. 1997. Toward Compact Monotonically Compositional
Interlingua Using Lexical Aspect. In Proceedings of the Workshop on Interlinguas in MT, pages 33–44, San
Diego, CA, October.
Mari Broman Olsen. 1994. The Semantics and Pragmatics of Lexical and Grammatical Aspect. Studies in the
Linguistic Sciences, 24(1–2):361–375.
Rainer Osswald and Robert D. Van Valin. 2014. Framenet, Frame Structure, and the Syntax-semantics Interface.
In T. Gamerschlag, D. Gerland, R. Osswald, andW. Petersen, editors, Frames and Concept Types: Applica-
tions in language and philosophy (Studies in Linguistics and Philosophy 94), pages 125–56. Heidelberg:
Springer.
Martha Palmer, Claire Bonial, and Jena D. Hwang. 2017. VerbNet: Capturing English Verb behavior, Meaning
and Usage. In Susan Chipman, editor, The Oxford Handbook of Cognitive Science. Oxford University Press.
James Pustejovsky and Kiyong Lee. 2017. Enriching the Notion of Path in ISOspace. In Proceedings of the 13th
Joint ISO-ACL Workshop on Interoperable Semantic Annotation (ISA-13),
http://aclweb.org/anthology/W17-7415.
James Pustejovsky and Zachary Yocum. 2014. Image annotation with iso-space: Distinguishing content from
structure. In Proceedings of the Ninth International Conference on Language Resources and Evaluation
(LREC’14), pages 426–431.
James Pustejovsky, Jessica Moszkowicz, and Marc Verhagen. 2011. Using ISO-Space for Annotating Spatial
Information, http://www2.denizyuret.com/bib/pustejovsky/pustejovsky2011cosit/
COSIT-ISO-Space.final.pdf.
David Traum and Nizar Habash. 2000. Generation from Lexical Conceptual Structures. In Proceedings of the
Workshop on Applied Interlinguas, North American Association for Computational Linguistics / Applied NLP
Conference, pages 34–41.
David Traum, Cassidy Henry, Stephanie Lukin, Ron Artstein, Felix Gervits, Kimberly Pollard, Claire Bonial, Su
Lei, Clare Voss, Matthew Marge, Cory Hayes, and Susan Hill. 2018. Dialogue Structure Annotation for Multi-
Floor Interaction. In Nicoletta Calzolari (Conference chair), Khalid Choukri, Christopher Cieri, Thierry De-
clerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hlne Mazo, Asuncion
Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the Eleventh Interna-
tional Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018.
European Language Resources Association (ELRA).
Clare R. Voss and Bonnie J. Dorr. 1995. Toward a Lexicalized Grammar for Interlinguas. J. of Machine Transla-
tion, 10:143–184.
Clare R. Voss, Bonnie J. Dorr, and M. U. Şencan. 1998. Lexical Allocation in Interlingua-based Machine Trans-
lation of Spatial Expressions. In Patrick Oliver and Klaus-Peter Gapp, editor, Representation and Processing
of Spatial Expressions, pages 133–148. L. Erlbaum Associates Inc., Hillsdale, NJ, USA.
Appendices
Appendix A. Supplemental Material: STYLUS
STYLUS (SysTematicallY Derived Language USage) is a simplified and extended version of Lexical
Conceptual Structure Verb Database (LVD) (Dorr et al., 2001) with 500 classes of verbs that incorpo-
rate components of meaning and prepositional collocations (in ascii .txt format). The classes include
over 9500 verb entries—about 5500 entries beyond the original 4000+ entries in Levin 1993. This re-
source can be downloaded from the link below:
https://www.dropbox.com/s/3fwzlg1wrirjhv1/LCS-Bare-Verb-Classes-Final.txt?dl=0
64
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 65–70Santa Fe, New Mexico, USA, August 20, 2018.
Contemporary Amharic Corpus: Automatically Morpho-Syntactically
Tagged Amharic Corpus
Andargachew Mekonnen Gezmu
Otto-von-Guericke-Universität
Binyam Ephrem Seyoum
Addis Ababa University
Michael Gasser
Indiana University
Andreas Nürnberger
Otto-von-Guericke-Universität
Abstract
We introduce the contemporary Amharic corpus, which is automatically tagged for morpho-
syntactic information. Texts are collected from 25,199 documents from different domains and
about 24 million orthographic words are tokenized. Since it is partly a web corpus, we made
some automatic spelling error corrections. We have also modified the existing morphological
analyzer, HornMorpho, to use it for automatic tagging.
1 Introduction
Amharic is a Semitic language that serves as the working language of the Federal Government of Ethi-
opia. Next to Arabic, it is the most spoken Semitic language. Though it plays several roles, it is consid-
ered one of the less-resourced languages. This is because it lacks basic tools and resources for carrying
out natural language processing research and applications.
One of the main hurdles in natural language processing of Amharic is the absence of sizable, clean
and properly tagged corpora. Since Amharic is morphologically rich and the boundary of the syntactic
word in an orthographic word is unclear in most cases. Even if it is possible to use some techniques of
collecting a big corpus from the web, the basic natural language tasks like POS tagging or parsing would
be challenging if we base our analysis on orthographic words. In Amharic, orthographic words represent
different linguistic units. It is possible for an orthographic word to represent a phrase, a clause, or a
sentence. This is because some syntactic words like prepositions, conjunctions, axillaries, etc. can be
attached to other lexical categories. Thus, for proper tagging and syntactic analysis, these morphemes
should be separated from their phonological host. In this project, we made an effort to collect, segment,
and tag text documents from various sources.
2 Related Works
Even though Amharic is a less-resourced language, there are some corpus collections by different initi-
atives. Since the introduction of the notion of considering the web as a corpus, which is motivated for
practical reasons of getting larger data with open access and low cost, there have been corpora of differ-
ent sizes for Amharic. In this section, we describe three corpora: the Walta Information Center (WIC),
the HaBit (amWaC17), and the An Crúbadán.
The WIC corpus is a medium-sized corpus of 210,000 tokens collected from 1,065 Amharic news
documents (Demeke and Getachew, 2006). The corpus is manually annotated for POS tags. The number
of tag-sets, which is proposed based on the orthographic word, is 31. They use compound tag-sets for
those words with prepositions and conjunctions. Even though they distinguish some subclasses of a
given category (like noun and verb), the distinction no longer exists when these forms attach prepositions
and/or conjunctions. However, they propose independent tags for both prepositions and conjunctions
without favoring segmentation of these clitics. In the annotation process, annotators were given training
This work is licenced under a Creative Commons Attribution 4.0 International Licence. Licence details: http://crea-tivecommons.org/licenses/by/4.0/
65
and guidelines. Each annotator was given a different news article. They were asked to write the tags by
hand and the annotation was given to a typist to insert the handwritten tag into the document. In such a
process, there may be some loss of information which could be attributed to the inconsistencies of the
annotation (Gambäck et al., 2009; Tachbelie and Menzel, 2009; Yimam et al., 2014) . In the WIC report,
they did not provide the annotation consistencies measurement and method of keeping consistency dur-
ing the process of the annotation. According to the recent report of Rychlý and Suchomel (2016) , the
efficiency of an automatic tagger developed or trained on this corpus is 87.4%. Since this corpus can be
accessible to most NLP researchers on Amharic, it is used to train a stemmer (Argaw and Asker, 2007),
Named Entity recognition (Alemu, 2013) and a chunker (Ibrahim and Assabie, 2014) .
The HaBit corpus, called amWaC17 (Amharic ‘Web as Corpus’, the year 2017) is another web
corpus, which was developed by crawling using SpiderLing (Rychlý and Suchomel, 2016) . Most of the
crawling was done in August 2013, October 2015, and January 2016. The current version includes a web
corpus crawled from May to September 2017 and consists of 30.5 million tokens collected from 75,509
documents. The corpus is cleaned and tagged for POS using a TreeTagger trained on WIC. They have
used the same tag-sets used in WIC and followed the same principles as WIC in tagging tokens.
The An Crúbadán corpus was developed under the project called corpus building for under-resourced
languages. The initiative aimed at the creation of text corpora for a large number of under-resourced
languages by crawling the web (Scannell, 2007) . The project collected written corpora for more than
2,000 languages. Amharic was one of the languages to be included in this project. The Amharic corpus
consists of 16,970,855 words crawled from 9,999 documents. The major sources used are Amharic Wik-
ipedia, Universal Declaration of Human Right, and JW.org. This corpus is a list of words with their
frequencies.
In the three corpora mentioned above, the written or orthographic form is considered as a word. Syn-
tactic words like prepositions, conjunctions, and articles are not separately treated. In other words, clitics
were not segmented from their host. In addition, though it is possible to collect authentic data entry from
the web, such sources are inaccurate. As Amharic is not standardized, in these sources one may face lots
of variation and expect to find misspellings, typographical errors, and grammatical errors. This calls for
manual or automatic editing.
3 Data Collection and Preprocessing
3.1 Data Collection
The Contemporary Amharic Corpus (CACO) is collected from archives of various sources which are
edited and made available for the public. All of the documents are written in modern or contemporary
Amharic. Table 1 summarizes documents used in the corpus.
Type of Documents Titles
Newspapers አዲስ አድማስ, አዲስ ዘመን, ሪፖርተር, ነጋሪት ጋዜጣ
News articles Ethiopian News Agency, Global Voices
Magazines ንቁ, መጠበቂያ ግንብ
Fictions የልምዣት, ግርዶሽ, ልጅነት ተመልሶ አይመጣም, የአመጽ ኑዛዜ, የቅናት ዛር, አግዐዚ
Historic novel አሉላ አባነጋ, ማዕበል የአብዮቱ ማግሥት, የማይጨው ቁስለኛ, የታንጉት ሚስጢር
Short stories የዓለም መስታወት, የቡና ቤት ስዕሎችና ሌሎችም ወጎች
History books አጭር የኢትዮጲያ ታሪክ, ዳግማዊ አጤ ምኒልክ, ዳግማዊ ምኒልክ,
የእቴጌ ጣይቱ ብጡል (፲፰፻፴፪ - ፲፱፻፲) አጭር የሕይወት ታሪክ, ከወልወል እስከ ማይጨው
Politics book ማርክሲዝምና የቋንቋ ችግሮች, መሬት የማን ነው
Children’s book ፒኖኪዮ, ውድድር
Amharic Bible1 አዲስ ዓለም ትርጉም መጽሐፍ ቅዱስ
Table 1: Data sources for CACO.
In total 25,199 documents from these sources are preprocessed and tagged. The preprocessing and
tagging tasks are discussed in the subsequent sections. 1 We used the New World Translation of the Bible which is translated into contemporary (not archaic) Amharic.
66
3.2 Preprocessing
The preprocessing of the documents involves spelling correction, normalization of punctuation marks,
and sentence extraction from paragraphs. In the documents, different types of misspellings are observed.
Misspellings result from missed out spaces (e.g., አንዳንድየህክምናተቋማትናባለሙያዎቻቸው), replacing letters
with visually similar characters (e.g., ቇጠረ for ቆጠረ), and typographical errors. In addition, four Amharic
phonemes have one or more homophonic character representations. The homophonic characters are
commonly observed to be used interchangeably, which can be considered real-word misspellings. To
correct the misspellings, we used the spelling corrector developed by Gezmu et al. (2018). Mainly the
spelling corrector is employed to correct the first two types of spelling errors. As intensive manual in-
tervention is needed to select the correct spelling from the plausible suggestions for typographical errors,
in the current version of the corpus we have not corrected the typographical errors. To deal with the
problems of real-word spelling errors resulting from homophonic letters, we adhere to the Ethiopian
Languages Academy (ELA) spelling reform (ELA, 1970; Aklilu, 2004). Following their reform, homo-
phonic characters are replaced with their common forms; ሐ and ኀ are replaced with ሀ, ሠ with ሰ, ዐ with
አ, and ፀ with ጸ.
The other issue that needs attention is normalization of punctuation marks. Different styles of punc-
tuation marks have been used in the documents. For instance, for double quotation mark two single
quotation marks, “, ”, ‹‹, ››, ``, ", « or » are used. Thus, normalization of punctuation marks is a non-
trivial matter. We normalized all types of double quotes by ", all single quotes by ', question marks (e.g.,
?and ፧) by ?, word separators (e.g., : and ፡) by plain space, full stops (e.g., :: and ፡፡) by ።, exclamation
marks (e.g., ! and !) by !, hyphens (e.g., :-, and ፡—) by ፦, and commas (e.g., ፥ and ÷) by ፣.
From the collected documents, sentences are identified and extracted by their boundaries—either dou-
ble colon-like symbols (።) or question marks (?)—and are tokenized based on the orthographic-word
boundary, a white space.
About 1.6 million sentences are collected from the documents. When the sentences are tokenized by
plain space as a word boundary, around 24 million orthographic words are found. We build a 3-gram
language model using the corpus. The corpus statistics are given in Table 2.
Elements Numbers
Sentences 1,605,452
Tokens 24,049,484
Unigrams 919,407
Bigrams 9,170,309
Trigrams 16,033,318
Table 2: Statistical information for CACO corpus.
4 Segmentation and Tagging
4.1 Amharic Orthography
The Amharic script is known as Ge’ez or Ethiopic. The system allows representing a consonant and a
vowel together as a symbol. Each symbol represents a CV syllable. However, a syllable in Amharic may
have a CVC structure. It is worth taking into account that the writing system does not mark gemination.
Gemination is phonemic; it can bring meaning distinctions and in some cases can also convey grammat-
ical information.
As in other languages, Amharic texts show variation in writing compound words. They can be written
in three ways: with a space between the compound words (e.g., ሆደ ሰፊ /hodə səffi/ “patient”, ምግብ ቤት
/mɨgɨb bet/ “restaurant”), separated by a hyphen (e.g., ስነ-ልሳን /sɨnə-lɨsan/ “linguistics”, ስነ-ምግባር /sɨnə-
mɨgɨbar/ “behavior”) or written as a single word (e.g., ቤተሰብ /betəsəb/ “family”, ምድረበዳ /mɨdrəbəda/
“desert”).
Orthographic words, which are separated by whitespace (semicolon in old documents), may be cou-
pled with function words like a preposition, a conjunction or auxiliaries. In most cases, an orthographic
word may function as a phrase (e.g., ከቤትዋ /kəbətwa/ “from her house”), a clause (e.g., የመጣው
/jəmmət‘t‘aw/ “the one who came”), or even a sentence (e.g., አልበላችም /ʔalbəllaʧʧɨm/ “She did not eat.”).
67
In addition, the writing system allows people to write in a phonemic form (the abstract form or what one
intends to say) or in a phonetic form (what is actually uttered). As a result, we have different forms for
a given word.
All the above features need to be addressed in processing Amharic texts. Some of the above problems
call for standardization efforts to be made, whereas others are due to the decision to write what is in
mind and what is actually produced. Furthermore, for proper syntactic analyses, clitics could be seg-
mented. We used an existing morphological analyzer with some improvements to perform the possible
segmentation and tagging.
4.2 Proposed Tag-sets
As we have indicated in Section 2, most works on POS tags in Amharic are based on orthographic words.
In such approach, they proposed compound tagsets for words with other syntactic words. Tagged corpora
following this approach may not be used for syntactic analyses. For syntactic analysis, we need infor-
mation about the structure within a phrase, a clause and a sentence. Tagging with a bundle of tagset will
hide such syntactic information.
Since one of the main objectives of the tagging task is to provide lexical information that can be
employed for syntactic analysis like parsing, we suggest that syntactic words should be the unit of anal-
ysis for tagging rather than orthographic words. In light of this, we can follow what is proposed by
Seyoum et al. (2018). In this work, a syntactic word is used for the analysis in favor of a lexicalist
syntactic view. They suggested that clitics should be segmented as a pre-processing step. Since manual
segmentation is very costly and time taking, we opted for doing morphological segmentation using an
existing tool that could be closer to our intention. For this purpose, we have found HornMorpho to be a
possible candidate with some improvement. Although this tagset has not yet been fully integrated into
HornMorpho, we use it for analysis since no other analyzer has comparable coverage.
4.3 HornMorpho Analysis and Limitations
HornMorpho (Gasser, 2011) is a rule-based system for morphological analysis and generation. As in
many other modern computational morphological systems, it takes the form of a cascade of composed
finite-state transducers (FSTs) that implement a lexicon of roots and morphemes, morphotactics, and
alternation rules governing phonological or orthographic changes at morpheme boundaries (Beesley and
Karttunen, 2003). To handle the complex long-distance dependencies between morphemes that charac-
terize the morphology of Amharic, the transducers in HornMorpho are weighted with feature structures,
an approach originally introduced by Amtrup (2003). During transduction, the feature structure weights
on arcs that are traversed are unified, resulting in transducer output that includes an accumulated feature
structure as well as a character string. In this way, the transducers retain a memory of where they have
been, allowing the system to implement agreement and co-occurrence constraints and to output charac-
ters based on morphemes encountered earlier in the word.
As an example, consider the imperfective forms of Amharic verbs, which, as in other Semitic lan-
guages, include subject person/gender/number agreement prefixes as well as suffixes. The prefix t- en-
codes either third person singular feminine or second person singular masculine, singular feminine or
plural. The suffix following an imperfective stem, or absence thereof, may disambiguate the prefix. Once
it has seen the prefix, the HornMorpho transducer “remembers” the set of possible subject person, gen-
der, and number features that are consistent with the input so far. When it reaches the suffix (or finds
none), it can reduce the set of possible features (or reject the word if an inconsistent suffix is found). So
in the word ትፈልጊያለሽ /tɨfəllɨgijalləʃ/ “you (sing. fem.) want,” where we have t - fəllɨg - i - alləʃ, the
presence of the suffix -i following the stem -fəllɨg- tells HornMorpho to constrain the subject person and
number to second person singular feminine; without the proper prefix, HornMorpho would reject the
word at this point.
The major focus in the development of HornMorpho was on lexical words not on function words.
More specifically, it analyzes nouns and verbs. Since Amharic adjectives behave like nouns, HornMor-
pho does not distinguish between adjectives and nouns. In addition, compound words and light verb
constructions (in version 2.5) are not handled. It also gives wrong results when it gets words with more
than two clitics attached to a word. For instance, በበኩላቸው /bəbbəkkulaʧʧəw/ “from his/her side or per-
spective” (this word occurs 4,293 times in the amWaC17). For this word, the system guessed fifteen
analyses. Furthermore, since clitics are not considered as separate words, the system does not give any
68
analysis for clitics. Therefore, in order to use this tool for tagging a big corpus, we need to improve the
coverage of the tool.
4.4 Improvement to HornMorpho
For the purposes of this paper, we modified HornMorpho in several ways. Since the major focus of
HornMorpho is to give analysis for nouns and verbs, it does not give morphological analysis for syntactic
words. However, as a result of the improvement, it distinguishes more parts-of-speech (verbs, nouns,
adjectives, adverbs, conjunctions, and adpositions). Morphological analysis for constructions like light
verbs, compound nouns and verbs, are now included in the improved version. As a rule-based system, it
has limited lexicon. The improved version, however, recognizes personal and place names. For our pur-
pose, we have enhanced the coverage of the lexicon.
<s>
<w pos="PRON" morphemes="be(prep)-{yh}" latin="bezihu">በዚሁ </w> <w pos="NM_PRS" morphemes="ye(gen)-{dereja}" lain="yedereja">
የደረጃ </w> <w pos="N" morphemes="{zrzr}" latin="zrzr">ዝርዝር </w> <w pos="NADJ" morphemes="{'and}-m(cnj)" latin="'andm">አንድም </w> <w pos="N" morphemes="{'ityoP_yawi}" latin="'ityoPyawi">ኢትዮጵያዊ </w> <w pos="V" morphemes="al(neg1)-{ktt+te1a2_e3}(prf,recip,pas)-
e(sb=3sm)-m(neg2)" latin="'altekatetem">አልተካተተም </w> <w pos="PUN">። </w> </s>
Figure 1. An example sentence that is tagged with HornMorpho.
Figure 1 shows an example sentence from our corpus that is tagged by HornMorpho, በዚሁ የደረጃ ዝርዝር አንድም ኢትዮጵያዊ አልተካተተም ። /bəzihu jədərəʤa zɨrzɨr ʔandɨmm ʔitjop’ɨjawi ʔaltəkatətəm./ “No Ethiopian
has been included in the ranking list,” transliterated as bezihu yedereja zrzr 'andm 'ityoPyawi 'altekate-
tem ።. The HornMorpho segmenter puts the stem between {}, gives a POS label for the lexical word,
and labels grammatical morphemes, which are mostly function words, that are separated by hyphens,
with features/tags. For verbs, the stem is represented as a sequence of root consonants and a template
consisting of root consonant positions, vowels and gemination characters. The root consonants and tem-
plate are separated by a “+”, the root consonant positions in the template are represented by numbers,
and underscore indicates gemination. During analysis by HornMorpho, the Amharic text is transliterated
to Latin script using the System for Ethiopic Representation in ASCII (Firdyiwek and Yaqob, 1997).
5 Conclusion and Future Works
In this paper, we have introduced a new resource, CACO, a partly web corpus of Amharic. The corpus
has been collected from different sources with different domains including newspapers, fiction, histori-
cal books, political books, short stories, children books and the Bible. These sources are believed to be
well edited but still need automatic editing. The corpus consists of 24 million tokens from 25,199 doc-
uments. It is automatically tagged for POS using HornMorpho. We made some modifications to Horn-
Morpho for the tagging. In doing so, we noticed that Amharic orthographic words should be properly
segmented in a preprocessing stage. Currently, we have more words with unclassified tags because of
the limitation of HornMorpho. Therefore, as future work, we plan to enhance HornMorpho for automatic
tagging. In addition, we plan to increase the size of the data and apply our method to other existing
corpora.
Acknowledgments
We would like to extend our sincere gratitude to the anonymous reviewers for their valuable feedback.
We would also like to thank the University of Oslo for the partial support under a project called Linguis-
tic Capacity Building-Tools for inclusive development of Ethiopia.
69
References
Amsalu Aklilu. 2004. Sabean and Ge’ez symbols as a guideline for Amharic spelling reform. In Proceedings of
the first international symposium on Ethiopian philology, pages 18-26.
Besufikad Alemu. 2013. A Named Entity Recognition for Amharic. Master’s thesis at Addis Ababa University.
Addis Ababa, Ethiopia.
Jan W. Amtrup. 2003. Morphology in machine translation systems: Efficient integration of finite state transducers
and feature structure descriptions. Machine Translation, 18(3): 217–238.
Atelach Alemu Argaw and Lars Asker. 2007. An Amharic stemmer: Reducing words to their citation forms. In Pro-
ceedings of the 2007 workshop on computational approaches to Semitic languages: Common issues and re-
sources, pages 104-110. Association for Computational Linguistics.
Kenneth R. Beesley and Lauri Karttunen. 2003. Finite-state morphology: Xerox tools and techniques. CSLI, Stan-
ford.
Girma A. Demeke and Mesfin Getachew. 2006. Manual annotation of Amharic news items with part-of-speech
tags and its challenges. Ethiopian Languages Research Center Working Papers, 2, 1-16.
ELA. 1970. የአማርኛ ፡ ፊደል ፡ ሕግን ፡ አንዲጠብቅ ፡ ለማድረግ ፡ የተዘጋጀ ፡ ራፖር ፡ ማስታወሻ, (engl. A memorandum for stand-
ardization of Amharic spelling) Journal of Ethiopian Studies, 8(1): 119–134.
Yitna Firdyiwek and Daniel Yaqob. 1997. The system for Ethiopic representation in ASCII. https://www.re-
searchgate.net/publication/2682324_The_System_for_Ethiopic_Representation_in_ASCII.
Björn Gambäck, Fredrik Olsson, Atelach Alemu Argaw, and Lars Asker. 2009. Methods for Amharic part-of-
speech tagging. In Proceedings of the First Workshop on Language Technologies for African Languages, pages
104-111. Association for Computational Linguistics.
Michael Gasser. 2011. HornMorpho: a system for morphological processing of Amharic, Oromo, and Tigrinya.
In Conference on Human Language Technology for Development, pages 94-99, Alexandria, Egypt.
Andargachew Mekonnen Gezmu, Andreas Nürnberger and Binyam Ephrem Seyoum. 2018. Portable Spelling Cor-
rector for a Less-Resourced Language: Amharic. In Proceedings of the Eleventh International Conference on
Language Resources and Evaluation, pages 4127– 4132. Miyazaki, Japan: European Language Resources As-
sociation (ELRA).
Abeba Ibrahim and Yaregal Assabie. 2014. Amharic Sentence Parsing Using Base Phrase Chunking. In Interna-
tional Conference on Intelligent Text Processing and Computational Linguistics, pages 297-306. Springer, Ber-
lin, Heidelberg.
Pavel Rychlý and Vít Suchomel. 2016. Annotated Amharic Corpora. In International Conference on Text, Speech,
and Dialogue, pages 295-302. Springer, Cham.
Kevin P. Scannell. 2007. The Crúbadán Project: Corpus building for under-resourced languages. In Building and
Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, Vol. 4, pages 5-15.
Binyam Ephrem Seyoum, Yusuke Miyao and Baye Yimam Mekonnen. 2018. Universal Dependencies for Amharic.
In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, pages 2216–
2222. Miyazaki, Japan: European Language Resources Association (ELRA).
Martha Yifiru Tachbelie and Wolfgang Menzel. 2009. Amharic Part-of-Speech Tagger for Factored Language Mod-
eling. In Proceedings of the International Conference RANLP-2009, pages 428-433.
Seid Muhie Yimam. 2014. Automatic annotation suggestions and custom annotation layers in WebAnno. In Pro-
ceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstra-
tions, pages 91-96.
70
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 71–77Santa Fe, New Mexico, USA, August 20, 2018.
Gold Corpus for Telegraphic Summarization
Chanakya MalireddyLTRC, IIIT Hyderabad
Srivenkata N M SomisettyIIIT Hyderabad
Manish ShrivastavaLTRC, IIIT Hyderabad
Abstract
Most extractive summarization techniques operate by ranking all the source sentences and thenselect the top-ranked sentences as the summary. Such methods are known to produce good sum-maries, especially when applied to news articles and scientific texts. However, they do not fareso well when applied to texts such as fictional narratives, which do not have a single central orrecurrent theme. This is because usually the information or plot of the story is spread across sev-eral sentences. In this paper, we discuss a different summarization technique called TelegraphicSummarization. Here, we do not select whole sentences, rather pick short segments of text spreadacross sentences, as the summary. We have tailored a set of guidelines to create such summariesand, using the same, annotate a gold corpus of 200 English short stories.
1 Introduction
The purpose of summarization is to capture all the useful information from the source text in as fewwords as possible. Extractive summarization involves identifying parts of the text which are important.Such summaries are usually generated by ranking all the source sentences, according to some heuristicor metric, and then selecting the top sentences as the summary. Most extractive summarization systemshave been developed for domains such as newswire articles (Lee et al., 2005), encyclopedic and scientifictexts (Teufel and Moens, 2002). They work well in such domains because these texts revolve around acentral theme and the information is often enforced by reiteration across several sentences. However,fictional narratives do not talk about a single topic. They describe a sequence of events and often containdialogue. Information is not repeated and each sentence contributes to developing the plot further. Hence,selecting a subset of such sentences does not accurately capture the story.
In this paper, we focus on telegraphic summarization. Telegraphic summary does not contain wholesentences, instead, shorter segments are selected across sentences, and reads like a telegram. We havedescribed a set of guidelines to create such summaries. These guidelines were used to annotate a goldcorpus of 200 English short stories. The dataset can be very useful to gain an insight into story structures.No such corpus already exists, to the best of our knowledge. Formally, the paper makes the followingcontributions:
1. Discusses drawbacks of applying traditional extractive summarization methods to narrative texts.
2. Describes a set of guidelines to generate telegraphic summaries.
3. Provides a gold corpus of 200 short stories and their telegraphic summaries annotated using theseguidelines.
4. Provides abstractive summaries for 50 stories and a set of 45 multiple-choice questions (MCQs) forevaluation purposes.
This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/
71
The paper is organized as follows. Section 2 discusses existing and related work. Section 3 describesthe data collection and summarization process. Section 4 discusses the analysis performed on the dataset.Section 5 discusses conclusions and future work.
2 Related Work
Automatic text summarization was first attempted in the middle of the 20th century (Luhn, 1958). Sincethen it has been applied to several domains and corpora, such as news articles (Lee et al., 2005), scientificarticles (Teufel and Moens, 2002), and blogs (Hu et al., 2007).
News articles have been the focus of summarization systems for a long time because of the vastpractical applications. In fact, most datasets available today are built from news corpora. However,a comparative study has shown that a single summarization technique does not perform equally wellacross all domains (Ceylan et al., 2010). Therefore, separate systems have to be built to deal with thedomain of fiction and its nuances, and news corpora based datasets are not sufficient to train and evaluatethe same.
There has been research on short fiction summarization (Kazantseva and Szpakowicz, 2010), fairytales (Lloret and Palomar, 2009) and whole books (Mihalcea and Ceylan, 2007). But the aforementionedwork in short fiction summarization had a different objective - helping a reader decide whether one wouldbe interested in reading the complete story. Hence it contains just enough information to help the readerdecide but does not reveal the entire plot. However, our dataset aims to summarize the entire plot. Thisis useful to learn plot structures and story organization.
Turney (2000) and the KEA algorithm by Witten et al. (1999) have attacked the problem of key-phraseextraction. But the key-phrases extracted by them do not form a cohesive summary and just try to list themajor themes discussed in the article and therefore cannot be applied to the domain of fiction.
Grefenstette (1998) proposed the use of sentence shortening to generate telegraphic texts that wouldhelp a blind reader skim a page (using text-to-speech). He provided eight levels of telegraphic reduc-tion. The first (and the most drastic) generated a stream of all the proper nouns in the text. The secondgenerated all nouns present in the subject or object position. The third, in addition, included the headverbs. The least drastic reduction generated all subjects, head verbs, objects, subclauses, prepositionsand dependent noun heads. Since then Jing (2000), Riezler et al. (2003) and Knight and Marcu (2000)have explored statistical models for sentence shortening that, in addition, aim at ensuring grammaticalityof the shortened sentences. Intuitively it appears that sentence-shortening can allow more important in-formation to be included in a summary. However, Lin (2003) showed that statistical sentence-shorteningapproaches like Knight and Marcu (2000) resulted in significantly worse content selection. He concludedthat pure syntax-based compression does not improve overall summarizer performance, even though itperforms well at the sentence level. Which is why there is a need for semantically aware techniques. Ourdataset provides the tools to evaluate and, in the future, maybe even train such algorithms.
3 Data Construction
3.1 Collection and Preprocessing
We collected English short stories containing 300 to 1100 words, available in the public domain1. 200stories were then randomly picked, after ensuring that works from at least 20 different authors had beenselected to keep the dataset diverse in terms of genre and writing style. As a result, the dataset spans 39authors. These stories were then manually processed to remove any spelling, grammatical and encodingerrors that might have crept in.
3.2 Summarization
The summarization was performed by 5 annotators. The annotators are not native English speakersbut are fluent in the language. They summarized 40 stories and cross-annotated 4 stories each (1 fromeach remaining annotator), to help calculate inter-annotator agreement. Each annotator also performed
1https://americanliterature.com/short-story-library
72
Figure 1: An example of telegraphic and abstractive summaries created according to the guidelines. Thestory is displayed in the left panel, the telegraphic summary is highlighted and also listed in the box titled‘extractive summary’ and the abstractive summary is shown below it.
abstractive summarization on 10 stories. Unlike extractive summarization, abstractive summaries neednot contain the same words as used in the source and are instead written by the annotator in their “ownwords” based on their understanding of the text. Abstractive summary for a story was not provided bythe same annotator who generated its telegraphic summary. An example is shown in Figure 1.
Guidelines followed for telegraphic summarization:
1. A segment is defined as a continuous span of words in the source, chosen as a part of the summary.
2. A word should not be fragmented e.g., if the word “breaking” appears in the source, it should notbe broken into fragments like “break”.
3. Each segment should be relevant to the plot, try to advance the story and have some continuity withthe preceding and the following segment.
4. Segments extracted from dialogues or parentheses should be enclosed in quotes or parenthesesrespectively.
5. Segments should be arranged in the same order as they appear in the story.
6. The summary should be minimal. If multiple segments mean the same thing, pick the shortest.Adjectives, adverbs, and modifiers are not to be included if they are not relevant to the plot.
7. When the segments are read in sequence the plot should be apparent and unambiguous.
73
Guidelines followed for abstractive summarization:
1. Summaries should be written from a third party perspective e.g., “This story is about a girl...”
2. Summaries should only discuss the plot and try to avoid inferences and opinions not immediatelyapparent from the story.
3. Summaries should maintain the same order of events as they occur in the source text.
4 Data Analysis and Evaluation
The length of the stories varies from 300 to 1100 words, with the average length being 650. The aver-age summarization factor (length of summary/length of the story) is 0.37 and 0.36 for telegraphic andabstractive summaries respectively.
Figure 2: Summarization Factor vs Story Length
From Figure 2 we can see that the summarization factor tends to be high for very short stories sincenearly every word is important. It reduces gradually as the length of the story increases because longerstories tend to be more descriptive and contain extraneous information not relevant to the central plot.
4.1 Quantitative EvaluationWe used the Alpha metric proposed in Krippendorff (1980) to measure the inter-annotator agreement.Alpha was computed based on 20 cross-annotated stories and found to be 0.73.
We generated 50 summaries using popular online extractive summarization tools, Smmry2 and Re-soomer3, which generate summaries by ranking and selecting the top sentences (after the selection stepthe sentences are re-arranged according to the source order). These summaries have the same summa-rization factor as the corresponding telegraphic summaries.
RougeN =
∑sε{ReferenceSummaries}
∑gramnεS Countmatch (gramn)∑
sε{ReferenceSummaries}∑gramnεS Count (gramn)
(1)
ROUGE-N score (Lin, 2004) is a popular metric used to evaluate summaries produced by a system.ROUGE-N recall is computed as shown in Eq. 1, where N stands for the length of the n-gram, gramn
and Countmatch(gramn) is the maximum number of n-grams co-occurring in a candidate summary anda set of reference summaries. ROGUE-N precision can be calculated by replacing the denominator inEq. 1 by the total number of n-grams present in all the reference summaries instead of system summaries.The F1 score is defined as the Harmonic Mean of precision and recall.
2https://smmry.com/3https://resoomer.com/en/
74
In our case, there was only one reference (the abstractive summary) and one summary from eachsystem - Telegraphic, Smmry, and Resoomer. We report the average F1 score, after removing stop wordsand stemming, for N = 1,2,3,4 in Table 1.
N=1 N=2 N=3 N=4Telegraphic 0.582 0.258 0.108 0.051Smmry 0.498 0.184 0.092 0.056Resoomer 0.483 0.179 0.093 0.057
Table 1: ROUGE-N F1 score
The higher F1 score for N = 1,2,3 telegraphic summaries indicates that they are more adequate interms of capturing relevant content. Since telegraphic summaries select shorter segments of text, theycan retain more information while maintaining the same summarization factor as their sentence-levelcounterparts. The ROUGE-4 score for Smmry and Resoomer summaries is slightly higher because theyselect entire source sentences and are therefore likely to have more 4-gram overlaps.
4.2 Qualitative Evaluation
High-order n-gram ROUGE measures try to judge fluency to some degree but since ROUGE is basedonly on the content overlap, it can only measure adequacy and not coherence. In order to gain an insightinto the coherence of the summaries, we made a set of 45 MCQs from 15 stories in the dataset. Questionswere not set by the same annotator who generated the corresponding telegraphic summary. The test wasadministered to three groups of two participants each. Each group was allotted the summaries producedby a single system - Telegraphic, Smmry or Resoomer.
Figure 3: An example set of questions for the story ‘The Little Match Girl’.
An example set of questions for the story ‘The Little Match Girl’ is shown in Figure 3. The ‘id’field refers to the unique id we assign to each story in the dataset. For each story, we made a set of 3multiple-choice questions with 3 options each. The answers refer to the index of the correct option in thelist. Apart from the given options, the participants were allowed to choose option 4, “Can’t say”, if theycould not answer a question based on the summary. Average scores are reported in Table 2.
Correct Incorrect Can’t sayTelegraphic 88.9% 2.2% 8.9%Smmry 62.2% 4.4% 33.4%Resoomer 60.0% 2.2% 37.8%
Table 2: Questionnaire Results
Higher scores on the questionnaire indicate that the telegraphic summaries were more coherent andallowed the reader to understand the story better. Participants who read the Smmry and Resoomer sum-
75
maries were unable to understand the story and chose “Can’t say” as the answer for nearly a third of thequestions.
5 Conclusion and Future Work
In this paper, we highlight the shortcomings of applying traditional extractive summarization techniquesto narrative texts and show how telegraphic summarization can be used to overcome these shortcomings.
We defined a set of guidelines to help generate telegraphic summaries. Using the same, we thenconstruct a corpus of 200 English short stories and their telegraphic summaries. 50 abstractive summariesand 45 MCQs are also provided for evaluation purposes. This corpus has been made public 4 and can beused as a gold standard to evaluate such summarization tasks. The MCQs can also be used to evaluateQA systems.
In future, we intend to extend this corpus by adding more stories. We plan on developing algorithmsto automatically generate high-quality telegraphic summaries and an extended corpus could be used astraining data for supervised techniques.
ReferencesHakan Ceylan, Rada Mihalcea, Umut Ozertem, Elena Lloret, and Manuel Palomar. 2010. Quantifying the limits
and success of extractive summarization systems across domains. In Human language technologies: The 2010annual conference of the North American chapter of the Association for Computational Linguistics, pages 903–911. Association for Computational Linguistics.
Gregory Grefenstette. 1998. Producing intelligent telegraphic text reduction to provide an audio scanning servicefor the blind. In Working notes of the AAAI Spring Symposium on Intelligent Text summarization, pages 111–118. The AAAI Press Menlo Park, CA.
Meishan Hu, Aixin Sun, and Ee-Peng Lim. 2007. Comments-oriented blog summarization by sentence extraction.In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management,CIKM ’07, pages 901–904, New York, NY, USA. ACM.
Hongyan Jing. 2000. Sentence reduction for automatic text summarization. In Proceedings of the Sixth Confer-ence on Applied Natural Language Processing, ANLC ’00, pages 310–315, Stroudsburg, PA, USA. Associationfor Computational Linguistics.
Anna Kazantseva and Stan Szpakowicz. 2010. Summarizing short stories. Comput. Linguist., 36(1):71–109.
Kevin Knight and Daniel Marcu. 2000. Statistics-based summarization-step one: Sentence compression. AAAI/IAAI, 2000:703–710.
Klaus Krippendorff. 1980. Content analysis: an introduction to its methodology. Sage commtext series. SagePublications.
Chang-Shing Lee, Zhi-Wei Jian, and Lin-Kai Huang. 2005. A fuzzy ontology and its application to news summa-rization. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 35(5):859–880.
Chin-Yew Lin. 2003. Improving summarization performance by sentence compression: a pilot study. In Proceed-ings of the sixth international workshop on Information retrieval with Asian languages-Volume 11, pages 1–8.Association for Computational Linguistics.
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization BranchesOut.
Elena Lloret and Manuel Palomar, 2009. A Gradual Combination of Features for Building Automatic Summarisa-tion Systems, pages 16–23. Springer Berlin Heidelberg, Berlin, Heidelberg.
Hans Peter Luhn. 1958. The automatic creation of literature abstracts. IBM J. Res. Dev., 2(2):159–165.
Rada Mihalcea and Hakan Ceylan. 2007. Explorations in automatic book summarization. In Proceedings ofthe 2007 joint conference on empirical methods in natural language processing and computational naturallanguage learning (EMNLP-CoNLL), pages 380–389.
4https://github.com/m-chanakya/shortstories
76
Stefan Riezler, Tracy H King, Richard Crouch, and Annie Zaenen. 2003. Statistical sentence condensation usingambiguity packing and stochastic disambiguation methods for lexical-functional grammar. In Proceedings ofthe 2003 Conference of the North American Chapter of the Association for Computational Linguistics on HumanLanguage Technology-Volume 1, pages 118–125. Association for Computational Linguistics.
Simone Teufel and Marc Moens. 2002. Summarizing scientific articles: experiments with relevance and rhetoricalstatus. Computational linguistics, 28(4):409–445.
Peter D Turney. 2000. Learning algorithms for keyphrase extraction. Information retrieval, 2(4):303–336.
Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G. Nevill-Manning. 1999. KEA: practicalautomatic keyphrase extraction. CoRR, cs.DL/9902007.
77
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 78–82Santa Fe, New Mexico, USA, August 20, 2018.
Design of a Tigrinya Language Speech Corpus
for Speech Recognition
Hafte Abera
Addis Ababa University
Addis Ababa, Ethiopia
hafte.abera@ aau.edu.et
Sebsibe H/Mariam
Addis Ababa University
Addis Ababa, Ethiopia
Abstract
In this paper, we describe the first Tigrinya Language speech corpus designed and developed for
speech recognition purposes. Tigrinya, often written as Tigrigna (ትግርኛ) /tɪˈɡrinjə/ belongs to the
Semitic branch of the Afro-Asiatic languages and shows characteristic features of a Semitic lan-
guage. It is spoken by ethnic Tigray-Tigrigna people in the Horn of Africa. This paper outlines
different corpus designing processes and related work on the creation of speech corpora for dif-
ferent languages. The authors also provide procedures that were used for the creation of a speech
recognition corpus for Tigrinya, an under-resourced language. One hundred and thirty native
Tigrinya speakers were recorded for the training and test datasets. Each speaker read 100 texts,
which consisted of syllabically rich and balanced sentences. Ten thousand sets of sentences were
used, which contained all of the contextual syllables and phones of Tigrinya.
1 Introduction
Speech corpus is defined as a collection of speech signals that is accessible in computer readable form,
and has an annotation, metadata and documents to allow re-use of the data in house or by scientists
(Gibbonet et al., 1997). Speech corpus is one of the fundamental requirements for developing speech
recognition and synthesis systems and to analyze the characteristics of speech signals. Moreover, for
phonetic research, speech corpus can also provide diverse and accurate data to help researchers find the
rules of languages. The main reason for preparing a Tigrinya speech corpus is, as a first step, to explore
the possibility of developing a Tigrinya speech recognition system (Li & Zu, 2006). Tigrinya, often
written as Tigrigna (ትግርኛ) /tɪˈɡriːnjə/belongs to the Semitic branch of the Afro-Asiatic languages and
shows characteristic features of a Semitic language (Tewolde, 2002; Berhane, 1991). It is spoken by
ethnic Tigray-Tigrigna people in the Horn of Africa.
Speech corpora design is one of the key issues in building high quality text to speech recognition
systems (mostly read speech). Radová (1998) pointed out that most speech corpora contain read speech,
either for practical reasons because annotating non-read speech is more difficult, or simply because the
intended application or investigation requires read speech. Due to the first reason a read speech corpus
is prepared and used for this work.
The preparation of any type of speech corpus is normally a project on its own and handled on the
basis of an agreement between corpus producers and corpus users. However, in cases like this study,
where the required corpus is not available, speech recognition experiments are conducted on the newly
produced corpus. The advantage in the latter case is that the corpus is produced with full and specific
knowledge of its intended use.
According to Li & Yin (2006), the development of speech corpus has created new problems: (1) many
corpora have been established, and much money and time have been put into their technology; and (2)
these corpora are difficult to share among different affiliations. The foremost cause of this problem is
the lack of general specifications for corpus collection, annotation, and distribution. In order to solve
this problem, standardization research on speech corpora is necessary and specifications should be stip-
ulated (Li &Yin, 2012).
This work is licenced under a Creative Commons Attribution 4.0 International Licence. Licence details: http://crea-
tivecommons.org/licenses/by/4.0/
78
2 Benchmarks and procedure for speech corpus construction
The Tigrinya Speech Recognition Corpus has been designed to satisfy a set of guidelines, which specify
the required quality of speech data and the proportional distribution of data with different speaker char-
acteristics. In this paper the authors provide a brief overview of the criteria defined for the speech corpus
from Li & Zu (2006). The criteria are explained in the following sub-sections.
2.1 Specification of speakers
Any speech corpus should be representative of speakers of both genders in equal proportions and contain
speech from speakers of different ages. In this work, the speakers were grouped into six age groups, as
seen in Table 1: from 18 to 22 (up to 19%), from 23 to 27 (at least 25%), from 28 to 32 (at least 25%),
from 33 to 37 (at least 15%), from 38 to 42 (at least 13%), and from 43-70 (at least 4%). The upper
boundary is relatively small, since we did not have enough speakers to cover that age group. The re-
searchers excluded the speech of children and people older than 70, because the speech characteristics
in these groups are quite different and require training of separate acoustic models.
Age Range Training set Test Set Adaptive data set Total
Male Female Male Female Male Female
18-22 10 9 1 1 3 1 25
23-27 12 10 1 2 4 3 32
28-32 14 12 1 2 1 2 32
33-37 8 8 1 1 1 1 20
38-42 9 6 0 0 1 1 17
43- 70 1 1 0 0 1 1 4
Total 54 46 4 6 11 9 130
Table 1: Age and Sex Distribution of the Readers
2.2 Specification of corpus design
After having the text corpus, the next important step in preparing a read speech corpus is recording the
speech. In the recording of selected sentences, each speaker is asked to read exactly what is presented
to him or her. The text to be read is presented on a mobile phone. In the read speech recording, the
degree of control is very high. For example during the recording, each utterance of the speaker can be
checked directly for errors, and if an error is found, the speaker is asked to re-read the text.
The recordings of the Tigrigna speech corpus were done in an office environment, as well as outside
the office. The text was read by 130 native Tigrigna speakers. For recording purposes, the mobile appli-
cation Lig-Akuma, which displays one sentence at a time for the speaker to read, was used. The entire
recording was done in the presence of a researcher. The speaker was first explained the purpose of the
project and instructed what to do. The recording session was controlled by the researcher, which in-
cluded running the recording program, starting and stopping the recording session, playing back the
recorded speech, re-recording the sentences (if required), and moving to the next sentence. Every
speaker was instructed to start the recording when he or she was ready. After each reader’s session was
finished, all the utterances were listened to both by the reader and the researcher for possible corrections.
The Tigrigna speech corpus was designed according to best practice guidelines established for other
languages. Standard speech corpora, such as the Amharic speech corpus (Abate et al., 2005), consist of
a training set, a speaker adaptation set, and test sets. To make it comparable with commonly used stand-
ard corpora, the Tigrinya corpus was designed to contain the same components.
The recording training set consists of a total of ten thousand different sentences. The training set was
read by 100 speakers of the different Tigrai dialects. As already mentioned, Table 1 shows the age and
sex distribution of all speakers. Due to time constraints, it was difficult to keep the age and sex balance
in the speaker sample.
Each test and speaker adaptation set was read by 10 and 20 speakers, respectively, from different
dialects. For the test sets, different sentences were selected for each speaker. Table 2 shows the number
79
of sentences that have been automatically selected from the text database and read for the collection of
speech data for the corpus.
Data-Set Number of selected
sentences
Number of recorded
sentences Duration (hours)
Training set 10,000 10,000 17:57
Speaker Adaptation set 69 69 2:30
Development test set 1,000 600 2:10
Evaluation test set 1,000 400 1:20
Table 2: Elements of Tigrinya Speech Corpus
2.3 Specification of recording
Data collection was carried out using an improved version of the Android application Lig-Aikuma, de-
veloped by Steven Bird and colleagues (Blachon et al., 2016). The resulting app, called Lig-Aikuma,
runs on various mobile phones and tablets and proposes a range of different speech collection modes
(recording, re-speaking, translation and elicitation). Lig-Aikuma’s improved features include a smart
generation and handling of speaker metadata as well as re-speaking and parallel audio data mapping.
For every mode, a metadata file was saved with the recording file. Metadata forms are filled in before
any recording. In addition, metadata have been enriched with new details about the languages (language
of the recording, mother tongue of the speaker, other languages spoken) and about the speaker (name,
age, gender, region of origin). Moreover, in order to save time, a feature saves the latest metadata as a
session and uses it to preload the form the next time it is necessary. The files are now named using the
specific following format: DATE-TIMEDEVICE-NAME LANG. As an example 2016-02-07-01-32-
18_HUAWEI-HUAWEI Y625-U32_tir is the name of a recording made on February, 07, 2016, at 1pm
(01:32:18), in Tigrinya, on a HUAWEI device.
Similarly to the general trend of speech recognition system development (e.g., Abate et al., 2005;
Blachon et al., 2016; De Vries et al., 2011), all speech data were sampled in 16 KHz and 16 bits allocated
per sample. In order to be able to develop speech recognition systems that can be executed in different
environments and that can be used for wide purposes, we should be able to build speech acoustic models
that are robust, as well as noise representative to different environments (Li &Zu, 2006). Therefore, the
Tigrinya speech recognition corpus has been developed to include speech data with different types of
background noise (office, street, in-car, etc. noise) and with different signal-to-noise ratios (SNR). The
majority of the data, however, contain a relatively low level of noise (the SNR being between 20-85dB).
Overlapping speech segments (with two or more simultaneous speakers) are not included in the corpus.
The Tigrigna Speech Corpus consists of utterances from Tigrinya literary (formal) language, however,
pronounced by a variety of speakers, including speakers with different dialect from both the Southern
and Northern Tigrinya dialects (Berhane, 1991).
2.4 Specification of annotation
Following recent research on other language speech recognition corpora (Abushariah et al., 2012; Ra-
dová, 1999; Arora et al., 2004) the corpus has been designed to be phonetically balanced in order to be
representative of natural speech and phonetically rich so that the trained acoustic models would effi-
ciently generalize over different speakers with different characteristics. Abera et al. (2016) previously
showed how to analyze and create corpora with respect to syllabic richness and balance. An example of
an orthographically annotated utterance is given in Figure 1.
2.5 Availability
Thus far the corpus has been used for our research in developing an automatic speech recognizer for
Tigrinya. As our research is at its last phase, we have the intention of making the corpus available for
researchers and developers by means of a third party.
80
Figure 1: Example of the Phonologically Annotated Utterance ፃዕሪ ግብፅን ንግድ ኢትዮጵያን እቶም ነዊሕ
ዝቚመቶም ሰብ ሳባን ናባኻ ኺሐልፉ ናትካውን ኪዀኑ እዮም “Tsaeri gbtsn ngd etyopyan etom newih zquametom
seb saban nabaKa KiHalf Natkawn kiKuinu eyom”
3 Conclusion and future work
In this paper the authors presented the overall design of the Tigrinya Speech Recognition Corpus. The
corpus is the first database available to train and test an Automatic Speech Recognition system, which
takes into account dialectal variation in Tigrinya. The corpus consists of 17.57 hours of speech audio
data set for training and 3.3 hours of phonetically annotated speech audio data for development and
evaluation. The corpus is both phonetically rich and balanced. This paper also described the reasoning
behind different choices made during the development of the corpus. We believe that this paper can
serve as a guideline for other researchers, who are developing, or want to develop speech recognition
corpora for under-resourced languages.
The Tigrinya Speech Recognition Corpus in the years to come will be an asset for further research in
speech recognition in general, as well as speech synthesis of Tigrinya − a language that did not have a
dedicated speech recognition corpus before. In addition, the speech corpus should be able to play a
crucial role in linguistic research, such as comparing the differences in pronunciation made by men and
women. Additional experiments with this corpus are anticipated to improve the performance of our
speech recognition system.
References
Solomon Teferra Abate. 2006. Automatic speech recognition for Amharic. Ph.D. Thesis. University of Hamburg.
Hamburg.
Solomn Teferra Abate, Wolfgang Menzel, and Bairu Tafila. 2005. An Amharic speech corpus for large vocabulary
continuous speech recognition. In Eurospeech, 9th European Conference on Speech Communication and Tech-
nology, pages 1601-1604, Lisbon, Portugal.
Hafte Abera, Climent Nadeu, and Sebsibe H. Mariam. 2016. Extraction of syllabically rich and balanced sentences
for Tigrigna language. In 2016 International Conference on Advances in Computing, Communications and In-
formatics (ICACCI), IEEE, pages 2094-2097, Jaipur, India.
Mohammad AM Abushariah, Raja N. Ainon, Roziati Zainuddin, Moustafa Elshafei, and Othman O. Khalifa. 2012.
Phonetically rich and balanced text and speech corpora for Arabic language. Language resources and evalua-
tion, 46(4), 601-634.
Karunesh Arora, Sunita Arora, Kapil Verma, and Shyam Sunder Agrawal. 2004. Automatic extraction of phoneti-
cally rich sentences from large text corpus of Indian languages. In Proceedings of the 8th International Confer-
ence on Spoken Language Processing (ICSLP), pages 2885-2888, Jeju, Korea.
David Blachon, Elodie Gauthier, Laurent Besacier, Guy-Noël Kouarata, Martine Adda-Decker, and Annie Rial-
land. 2016. Parallel speech collection for under-resourced language studies using the Lig-Aikuma mobile device
app. Procedia Computer Science, 81, 61-66.
Nic De Vries, Jaco Badenhorst, Marelie H. Davel, Etienne Barnard, and Alta De Waal. 2011. Woefzela-an open-
source platform for ASR data collection in the developing world. In Proceedings of INTERSPEECH 2011, 12th
Annual Conference of the International Speech Communication Association, pages 3177-3180, Florence, Italy.
Dafydd Gibbon, Roger Moore, and Richard Winski. 1997. Handbook of Standards and Resources for Spoken Lan-
guage Systems. Mouton de Gruyter, Berlin New York.
Ai-jun Li and Zhi-gang Yin. 2007. Standardization of speech corpus. Data Science Journal, 6,: S806-S812.
81
Ai-jun Li and Zu, Yiqing. 2006. Corpus design and annotation for speech synthesis and recognition. Advances in
Chinese spoken language processing: 243-268.
Vlasta Radová. 1998. Design of the Czech Speech Corpus for Speech Recognition Applications with a Large Vo-
cabulary. In Text, Speech, Dialogue. Proc. of the First Workshop on Text, Speech, Dialogue. Brno, Czech Re-
public, pages 299-304.
Vlasta Radová, Petr Vopálka, and P. Ircing. 1999. Methods of Phonetically Balanced Sentences Selection. In Pro-
ceedings of the 3rd Multiconference on Systemics, Cybernetics and Informatics, pages 334-339, Orlando, USA.
82
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 83–90Santa Fe, New Mexico, USA, August 20, 2018.
Parallel Corpora for bi-Directional Statistical Machine Translation for
Seven Ethiopian Language Pairs
Solomon Teferra Abate
Addis Ababa University,
Addis Ababa, Ethiopia
Michael Melese Woldeyohannis
Addis Ababa University,
Addis Ababa, Ethiopia
Martha Yifiru Tachbelie
Addis Ababa University,
Addis Ababa, Ethiopia
Million Meshesha
Addis Ababa University,
Addis Ababa, Ethiopia
Solomon Atinafu
Addis Ababa University,
Addis Ababa, Ethiopia
Wondwossen Mulugeta
Addis Ababa University,
Addis Ababa, Ethiopia
Yaregal Assabie
Addis Ababa University,
Addis Ababa, Ethiopia
Hafte Abera
Addis Ababa University,
Addis Ababa, Ethiopia
Biniyam Ephrem
Addis Ababa University,
Addis Ababa, Ethiopia
Tewodros Abebe
Addis Ababa University,
Addis Ababa, Ethiopia
Wondimagegnhue Tsegaye
Bahir Dar University,
Bahir Dar, Ethiopia
Amanuel Lemma
Aksum University,
Axum, Ethiopia
Tsegaye Andargie
Wolkite University,
Wolkite, Ethiopia
Seifedin Shifaw
Wolkite University,
Wolkite, Ethiopia
Abstract*
In this paper, we describe the development of parallel corpora for Ethiopian Languages: Am-
haric, Tigrigna, Afan-Oromo, Wolaytta and Ge’ez. To check the usability of all the corpora we
conducted baseline bi-directional statistical machine translation (SMT) experiments for seven
language pairs. The performance of the bi-directional SMT systems shows that all the corpora
can be used for further investigations. We have also shown that the morphological complexity
of the Ethio-Semitic languages has a negative impact on the performance of the SMT especially
when they are target languages. Based on the results we obtained, we are currently working
towards handling the morphological complexities to improve the performance of statistical ma-
chine translation among the Ethiopian languages.
* This work is licenced under a Creative Commons Attribution 4.0 International License. Page numbers and proceedings
footer are added by the organizers. License details: http://creativecommons.org/licenses/by/4.0/
83
1 Introduction
The advancement of technology and the rise of the internet as a means of communication led to an ever-
increasing demand for Natural Language Processing (NLP) Applications. NLP applications are useful
in facilitating human-human communication via computing systems. One of the NLP applications which
facilitate human-human communication is Machine Translation. Machine Translation (MT) refers to a
process by which computer software is used to translate a text from one language to another (Koehn,
2009). In the presence of high volume digital text, the ideal aim of machine translation systems is to
produce the best possible translation with minimal human intervention (Hutchins, 2005).
The translation of natural language by machine becomes a reality, for technologically favored
languages, in the late 20th century although it is dreamt since the seventieth century (Hutchins,
1995). Various approaches to MT have been and are being used in the research community. These
approaches are broadly classified into rule based and corpus-based MT (Koehn, 2009). The rule-
based machine translation demands various kinds of linguistic re- sources such as morphological
analyzer and synthesizer, syntactic parsers, semantic analyzers and so on. On the other hand, corpus-
based approaches (as the name implies) require parallel and monolingual corpora. Since corpus-based ap-
proaches do not require deep linguistic analysis of the source and target languages, it is the preferred
approach for under-resourced languages of the world, including Ethiopian languages.
1.1 Machine Translation for Ethiopian Languages
Research in the development of MT has been conducted for technologically favored and economically
as well as politically important languages of the world since the 17th century. As a result, notable pro-
gress towards the development and use of MT systems has been made for these languages. However,
research in the area of MT for Ethiopian languages, which are under-resourced as well as economically
and technologically disadvantaged, has started very recently. Most of the researches on MT for Ethiopian
languages are conducted by graduate students (Tariku, 2004; Sisay, 2009; Eleni, 2013; Jabesa, 2013;
Akubazgi, 2017), including two PhD works: one that tried to integrate Amharic into a unification-based
machine translation system (Sisay, 2004) and the other that investigated English-Amharic Statistical
Machine translation (Mulu, 2017). Beside this, Michael and Million (Michael and Million, 2017) exper-
imented a bi-directional Amharic-Tigrigna SMT system using word and morpheme as a unit.
Due to unavailability of linguistic resources and since the most widely used MT approach is
statistical, most of these researches have been conducted using statistical machine translation
(SMT), which requires large bilingual and monolingual corpora. However, as there were no such
corpora for SMT experiments for Ethiopian languages, the researchers had to prepare their own
small size corpora for their experiment. This in turn, affects the results that they obtain.
In addition, since there are no standard corpora for conducting replicable and consistent experi-
ments for performance evaluation, it is difficult to know the progress made in the area for local
languages. Moreover, since the researchers had to spend their time on corpora preparation, they
usually have limited time for experimentation, exploration and development of MT systems.
1.2 Motivation of this Paper
African languages, which contribute around 30% (2139) of the world language highly suffer from the
lack of sufficient language resources (Simons and Fennig, 2017). This is true for Ethiopian languages.
On the other hand, Ethiopia being a multilingual and multi-ethnic country, its constitution decrees that
each citizen has the right to speak, write and develop in his/her own language. However, there is still a
need to share information among citizens who speak different languages. For example, Amharic is the
regional language of the Amhara and Southern Nations and Nationalities regions, Afan-Oromo is that
of the Oromia region while the Tigray region uses Tigrigna. All these regions produce a lot of infor-
mation that need to be shared among the other regions of the nation. There is, therefore, a lot of transla-
tion demands among the different language communities of the federal government of Ethiopia.
In order to enable the citizens of the country to use the documents and the information produced
in other Ethiopian languages, the documents need to be translated to the languages they understand
most. Since manual translation is expensive, a promising alternative is the use of machine trans-
lation, particularly SMT as Ethiopian languages suffer from lack of basic linguistic resources such
as morphological analyser, syntactic analyser, morphological synthesizer, etc. The major and basic
84
resource required for SMT is parallel corpora, which are not available for Ethiopian languages.
The collection and preparation of parallel corpora for Ethiopian languages is, therefore, an im-
portant endeavour to facilitate future MT research and development.
We have, therefore, collected and prepared parallel corpora for seven Ethiopian Language pairs
taking representatives from the Semitic, Cushitic and Omotic language families. We have consid-
ered Amharic, Tigrigna and Ge’ez from the Semitic, Afan-Oromo from the Cushitic and Wolaytta
from Omotic language families. This paper, therefore, describes the parallel corpora we collected
and prepared for these Ethiopian languages and the SMT experiments conducted using the corpora
as a way of verifying their usability.
2 Nature of the Language Pairs
The language pairs in the corpora belong to Semitic (Ge’ez, Amharic and Tigrigna), Cushitic (Afan-
Oromo) and Omotic (Wolaytta) language families. Except Ge’ez, these languages have native speakers.
Ge’ez serves as a liturgical language of Ethiopian Orthodox Church. It is thought as a second language
in the traditional schools of the church and given as a course in different Universities. There is a rich
body of literature in Ge’ez, including philosophical, medical and astrological writings. Because of
this, there is a big initiative in translating the documents written in Ge’ez to other widely used
languages. On the other hand, Amharic is spoken by more than 27 million people which makes it
the second most spoken Semitic language in the world. Tigrigna is spoken by 9 million people.
Afan-Oromo and Wolaytta are spoken by more than 34 million and 2 million speakers, respectively
(Simons and Fennig, 2017).
The writing systems of these language pairs are Ge’ez or Ethiopic script and Latin alphabet.
Ge’ez, Amharic and Tigrigna are written in Ge’ez script whereas both Afan-Oromo and Wolaytta
are written in Latin alphabet. It is believed that the earliest known writing in the Ge’ez script date
back to the 5th century BC. The Ge’ez script is syllabary in which each character represents a
consonant and a vowel. Each character gets its basic shape from the consonant of the syllable, and
the vowel is represented through systematic modifications of the basic shape. The script is also
used to write other languages like Argobba, Harari, Gurage, etc.
The language pairs have got different functions in the country. Amharic for instance is the work-
ing language of the Federal Government of Ethiopia. It also serves as regional working language
of some other regional states. It facilitates inter-regional communication. Tigrigna and Afan-
Oromo are working languages in Tigray and Oromia regional administrations, respectively. Apart
from this, they serve as medium of instructions in primary and secondary schools. These languages
are also used widely in the electronic media like news, blogs and social media. Some of the gov-
ernmental websites are available in Amharic, Tigrigna and Afan-Oromo. Currently, Google offers
a searching capability using these Ethiopian languages. Further, Google also included Amharic in
its translation service recently.
2.1 Morphological Features
Like other Semitic languages, Ge’ez (Dillmann and Bezold, 1907), Amharic (Leslau, 2000;
Anbessa and Hudson, 2007) and Tigrigna (Mason, 1996; Yohannes, 2002), make use of the root
and pattern system. In these languages, a root (which is called a radical) is a set of consonants
which bears the basic meaning of the lexical item whereas a pattern is composed of a set of vowels
inserted between the consonants of the root. These vowel patterns together with affixes results in
derived words. Such derivational process makes these languages morphologically complex.
In addition to the morphological information, some syntactic information are also expressed at
word level. Furthermore, an orthographic word may attach some syntactic words like prepositions,
conjunctions, negation, etc. which create various word forms (Gasser, 2010; Gasser, 2011). In
these languages, nominals are inflected for number, gender, definiteness and case whereas verbs
are inflected for person, number, gender, tense, aspect, and mood (Griefenow-Mewis, 2001).
Essentially, unlike the Semitic languages which allow prefixing, Afan-Oromo allows suffixing.
Most functional words like postpositions are also suffixed. However, there are some prepositions
written as a separate word.
85
Wolaytta like Afan-Oromo is a suffixing language in which words can be generated from root
words recursively by adding suffixes only. Wolaytta nouns are inflected for number, gender and
case whereas verbs are inflected for person, number, gender, aspect and mood (Wakasa, 2008).
2.2 Syntactic Features
Ethiopian languages that are under our consideration follow Subject-Object-Verb (SOV) word-
order except Ge’ez which allows the verb to come first. In Ge’ez, the basic word-order is Verb-
Subject-Object (VSO).
3 Challenges of SMT
Statistical Machine Translation is greatly impacted by the linguistic features of the target languages.
The challenges range from the writing system to that of word ordering and morphological complex-
ity.
3.1 Writing System
The Ge’ez writing system, which is used by Amharic, Tigrigna and Ge’ez languages, uses different
characters in words that convey the same meaning, especially in Amharic. For example, peace can
be written as: ሰላም or ሠላም. Such character variations affect probability values that have direct
impact on the performance of SMTs.
3.2 Word Ordering
Most of the languages under consideration have same word order. With this respect, Amharic, Afan-
Oromo, Tigrigna and Wolaytta have SOV, while only Ge’ez has VSO. This might challenge machine
translation system where Ge’ez is in the pair. Another challenge is the existence of flexibility in word
order. For instance, even though Afan-Oromo follows SOV word order, nouns can be changed based
on their role in a sentence which makes the word order to be flexible. Such flexibility will pose a
challenge for translation from a source to Afan-Oromo.
3.3 Morphological Complexity
While word alignment could be done automatically or with supervision, morphological agreement
between words in the source and target are crucial. For instance, Amharic and Geez have subject
agreement, object agreement and genitive (possessive) agreement. Each of which is expressed as
bound morphemes. In Amharic, for the word ገድልክ /you killed/ the subject “you” is represented by
the suffix “+ክ” while the same subject is represented as “+” in the Geez ቀተልከ /you killed/). Most
of the morphemes in the considered Ethiopian languages are bound ones.
4 Parallel Corpora Preparation
The development of machine translation more often uses statistical approach because it requires
very limited computational linguistic resources compared to the rule based approach. Nevertheless,
the statistical approach relies to a great extent on parallel corpora of the source and target
languages.
The research team has applied different techniques to collect parallel corpora for the selected
Ethiopian language pairs. The domain of the collected data is only religious for which we have data
for all the considered language pairs. It includes Holy Bible and different documents written in
spiritual theme and collected from Jehovah’s Witnesses (JW1), Ethiopicbible2, Ebible3 and Geez
experience4 which are freely accessible websites.
A simple web crawler was used to extract parallel text from the websites. Python libraries such
as requests, and BeautifulSoup were used to analyse the structure of the websites, extract texts
1 available at https://www.jw.org 2 available at https://www.ethiopicbible.com 3 available at http://ebible.org 4 available at https://www.geezexperience.com
86
and combine into a single text file. To collect the bible data, we have generated the structure of the
URL so that it shows the book names, chapters and verses of the Bible in each language.
For the “daily text” which is published at JW.org, we tried to use the date information to generate
URL for each language. Finally, we extracted the data based on the generated URL information and
merged to a single UTF-8 text file for each language.
4.1 Pre-processing
Data preprocessing is an important and basic step in preparing bilingual and multilingual parallel
corpora. Since the collected parallel data have different formats and characteristics, it is very difficult
and time-consuming to prepare usable parallel corpora manually because it needs to analyse the struc-
ture of the collected raw data by applying different linguistic methods. We have, therefore, applied
different automatic methods of text pre-processing that requires minimal human interference. As
part of the pre-processing unnecessary links numbers, symbols and foreign texts in each language
have been removed. During pre-processing the following tasks have been performed: character nor-
malization, sentence tokenization and sentence level alignment.
4.1.1 Character Normalization
As it is indicated in Section 3.1, there are characters in Amharic that have similar roles and are
redundant. For example the character (ሀ) can be written as (ሐ, ሓ, ኃ, ኅ and ሃ). Though they used to
possess semantic differences in the traditional writings, currently these characters are mostly used
interchangeably. To avoid words with same meaning from being taken as different words due to
these variations we have replaced a set of characters with similar function into a single most
frequently used character.
4.1.2 Sentence Tokenization and Alignment
Lines that contain multiple sentences in both source and target languages are tokenized. The team
has set two criteria to check whether the aligned sentences are correct or not. The first criterion is
counting and matching the number of sentences in the source language and the target language. In
the parallel corpora of the language pairs in which Ge’ez is the target, the source language contains
multiple verses in a single line. While on the Ge’ez side, each line contains a single verse. In such
cases, we merged different verses of Ge’ez to produce the line that is aligned with that of the
source language.
4.2 Corpus Size and Distribution of Words
The corpora have been analysed to see the relationship between languages in the language pairs.
As it has been revealed in different literature, the Ethio-Semitic languages have more complex
morphology than the other Ethiopian languages. Due to this difference, the same number of sentences
in these language pairs is tokenized into significantly different number of tokens and word types.
Table 1 clearly shows that the vocabulary of the languages in the Ethio-Semitic language family is
much more than the vocabulary of the other two language families.
Sentences Languages Token Type Average sentence Length
34,349
Amharic 521,035 98,841 15
Tigrigna 546,570 87,649 15
11,546
Amharic 148,084 38,097 12
Ge’ez 158,003 33,386 13
11,457
Amharic 163,816 37,283 14
Afan-Oromo 214,335 24,005 18
10,987
Tigrigna 162,508 32,953 14
Afan-Oromo 206,844 23,536 18
9,400
Amharic 119,262 32,780 12
Wolaytta 137,869 25,331 14
Afan-Oromo 46,340 8,118 15
87
Sentences Languages Token Type Average sentence Length
2,923 Wolaytta 33,828 8,786 11
2,504
Tigrigna 34,780 9,864 13
Wolaytta 29,458 7,989 11
Table 1: Sentence and Word Distribution of the Parallel Corpora
On the contrary the token of the non-Semitic languages is significantly higher than the tokens of the
Ethio-Semitic languages. This is because syntactic words like preposition, conjunction, negation, etc are
bound in the Ethio-Semitic language group. It is clear, therefore, that such differences between the lan-
guages in a language pair makes SMT difficult because it aggravates data sparsity and results into a
weakly trained translation model. Although the size of the data we have is much less to draw conclu-
sions, we could also see that the Ethio-Cushitic and Ethio-Omotic languages are morphologically more
similar with each other than their similarity with the Ethio-Semitic languages.
We have also observed morphological differences among the Ethio-Semitic languages that is
revealed by the difference in the number of token and word type in the same corpus we have for
Amharic-Tigrigna and Amharic-Ge’ez language pairs. The data revealed that Amharic is the most
morphologically complex language of the family.
5 SMT Experiments and Results
To check the usability of the collected parallel corpora for seven Ethiopian language pairs, we have
conducted bi-directional SMT experiments.
5.1 Experimental Setup
To conduct SMT experiments, each parallel corpus has been divided into three sets: 80% for the
training, 10% for tuning and 10% for test sets. Moses (Koehn, 2009) has been used along with
Giza++ alignment tool (Och and Ney, 2003) for aligning words and phrases. SRILM toolkit
(Stolcke, 2002) has been used to develop the language models using target language sentences
from the training and tuning sets of parallel corpora. Bilingual Evaluation Under Study (BLEU) is
used for automatic scoring.
5.2 Experimental Results
Table 2 presents the experimental results of bi-directional SMT systems developed for the seven
Ethiopian language pairs. The Table shows the difference in the performance of the systems devel-
oped for the same language pair in different directions.
Sentences Language pair BLEU
34,349
Amharic - Tigrigna 21.22
Tigrigna - Amharic 19.06
11,457
Amharic - Afan Oromo 17.79
Afan Oromo - Amharic 13.11
10,987
Tigrigna - Afan Oromo 16.82
Afan Oromo - Tigrigna 14.61
9,400
Amharic - Wolaytta 11.23
Wolaytta - Amharic 7.17
11,546
Ge’ez - Amharic 7.31
Amharic - Ge’ez 6.29
2,923
Wolaytta - Afan Oromo 4.73
Afan Oromo - Wolaytta 2.73
2,504
Tigrigna - Wolaytta 2.2
Wolaytta - Tigrigna 3.8
Table 2: Experimental Results
88
The performance of SMT systems decreases when Ethio-Semitic languages are on the target
side. This confirms that Ethio-Semitic languages (when they are targets) are more challenging to
SMT than the other language families. The only exception being Tigrigna and Wolaytta language
pair, where the performance is high when Tigrigna is a target. This could be attributed to the
small data we have for this language pair.
The results in the Table 2 also show the effect of data size on the performance of SMT systems.
That means as the data increases, the performance also increases. In this view again we have an
exceptionally lower BLUE score for the Amharic-Ge’ez language pair than the score we achieved for
Amharic-Afan Oromo language pair although the data size used is almost equal. The performances
of the Amharic-Wolaytta-Amharic translation systems are better than Amharic- Ge’ez-Amharic
systems although the data size used in the former is less than the data size used in the latter. The
results confirm that the morphological complexity of the languages severely affect SMT performance
than the amount of data. From the difference in results achieved for the Amharic-Ge’ez (6.29) and
Ge’ez-Amharic (7.31) language pairs, it is possible to understand that syntactic differences affect
the performance of SMT more than the difference in their morphological features. We have seen from
their number of word types that Amharic has more complex morphology than Ge’ez which, however,
has flexible syntactic structure.
Despite the size of the data, the performance registered in translation towards the Ethio- Semitic
languages has less BLEU score than the translations from them. This is because of the fact that,
when the Ethio-Semitic languages, specially Amharic, are used as a target language, the transla-
tion from other languages with less morphological complexity as a source languages is challenged
by one-to-many alignment. On the other hand, better performance is registered the other direction
since the alignment is many-to-one. Beside this, the word-based language model favours the non-
Semitic languages than Semitic ones due to the complexity of the morphology of the latter language
family.
6 Conclusions and Future Work
This paper presents the attempt made in the preparation of usable parallel corpora for Ethiopian
languages. The corpora have been collected from the web in the religious domain. Then, they are
further pre-processed and normalized. We have now usable parallel corpora for seven Ethiopian
language pairs. Using the corpora, bi-directional statistical machine translation experiments have
been conducted. The results show that translation systems from Ethio-Semitic languages to either
Omotic or Cushitic language families achieve better BLEU score than those in the other directions.
That leads us to conclude that the Ethio-Semitic language family has the most complex morphol-
ogy which greatly affects the performance of SMT.
Finding solutions that minimize the negative effect of morphological complexity of the Ethio-
Semitic languages on the performance of SMT is therefore a future endeavour. We considered a
previous work by (Mulu, 2017) who gained a significant improvement by the application of mor-
phological segmentations to guide us in utilizing the use of morphemes instead of words as units
for both the translation and the statistical language models. The most attractive solution to the
problems of machine translation that is the trend of the time is the use of ANN modelling, which
however, requires more data than what we have. So we will use our experience of corpus preparation
and work towards the application of the state of the art technologies to develop usable machine
translation systems for the Ethiopian languages.
As it is well known, domain is the most important factor on the performance of SMT. Thus,
we are also working on the development and organization of parallel corpora for the Ethiopian
languages in different domains.
Acknowledgement
We would like to acknowledge Addis Ababa University for the thematic research fund and NORHED
(Linguistic Capacity Building-Tools) for sponsoring conference participation.
89
References
Gebremariam Akubazgi. 2017. Amharic-Tigrigna machine translation using hybrid approach. Master’s the-
sis, Addis Ababa University.
Teferra Anbessa and Grover Hudson. 2007. Essentials of Amharic. Rüdiger Köppe, Verlag, Köln.
August Dillmann and Carl Bezold. 1907. Ethiopic grammar, enlarged and improved by c. Bezold. Translated
by JA Crichton. London: Williams & Norgate.
Teshome Eleni. 2013. Bidirectional English-Amharic machine translation: An experiment using con- strained
corpus. Master’s thesis, Addis Ababa University.
Michael Gasser. 2010. A dependency grammar for Amharic. In Proceedings of the Workshop on Language
Resources and Human Language Technologies for Semitic Languages, Valletta, Malta. ftp://html.soic.indiana.edu/pub/gasser/sem_ws10.pdf
Michael Gasser. 2011. Hornmorpho: a system for morphological processing of Amharic, Oromo, and Tigrinya.
In Conference on Human Language Technology for Development, pages 94-99, Alexandria, Egypt.
Catherine Griefenow-Mewis. 2001. A grammatical sketch of written Oromo, volume 16. Rüdiger Köppe.
John Hutchins. 1995. Machine translation: A brief history. In Concise history of the language sciences, pages
431–445. Elsevier.
John Hutchins. 2005. The history of machine translation in a nutshell,
http://www.hutchinsweb.me.uk/Nutshell-2005.pdf.
Daba Jabesa. 2013. Bi-directional English-Afaan oromo machine translation using hybrid approach. Mas-
ter’s thesis, Addis Ababa University.
Philipp Koehn. 2009. Statistical machine translation. Cambridge University Press.
Wolf Leslau. 2000. Introductory grammar of Amharic, volume 21. Otto Harrassowitz Verlag.
John S Mason. 1996. Tigrinya grammar. Red Sea Press (NJ).
Melese Michael and Meshesha Million. 2017. Experimenting Statistical Machine Translation for Ethiopic
Semitic Languages : The case of Amharic-Tigrigna. In LNICST. EAI International Conference on ICT
for Development for Africa, pages 140-149, Springer.
Gebreegziabher Teshome Mulu. 2017. English-Amharic Statistical Machine Translation. PhD Dissertation,
IT Doctoral Program, Addis Ababa University, Addis Ababa, Ethiopia.
Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models.
Computational linguistics, 29(1):19–51.
Gary F Simons and Charles D Fennig. 2017. Ethnologue: Languages of the world. SIL, Dallas, Texas.
Adugna Sisay. 2009. English–Afan Oromo machine translation: An experiment using statistical Approach. Mas-
ter’s thesis, Addis Ababa University.
Fissaha Sisay. 2004. Adding Amharic to a Unification-based Machine Translation System: An Experiment.
Peter Lang.
Andreas Stolcke. 2002. SRILM-an extensible language modeling toolkit. In Seventh international conference
on spoken language processing, https://www.isca-speech.org/ar-
chive/icslp_2002/i02_0901.html
Tsegaye Tariku. 2004. English-Tigrigna factored statistical machine translation. Master’s thesis, Addis Ab-
aba University.
Motomichi Wakasa. 2008. A descriptive study of the modern Wolaytta language. Unpublished PhD thesis,
University of Tokyo.
Tesfay Tewolde Yohannes. 2002. A modern Grammar of Tigrinya. Tipografia U. Detti.
90
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 91–101Santa Fe, New Mexico, USA, August 20, 2018.
Using Embeddings to CompareFrameNet Frames Across Languages
Jennifer SikosIMS, University of Stuttgart
Stuttgart, [email protected]
Sebastian PadóIMS, University of Stuttgart
Stuttgart, [email protected]
Abstract
Much of the recent interest in Frame Semantics is fueled by the substantial extent of its applicabilityacross languages. At the same time, lexicographic studies have found that the applicability ofindividual frames can be diminished by cross-lingual divergences regarding polysemy, syntacticvalency, and lexicalization. Due to the large effort involved in manual investigations, there are sofar no broad-coverage resources with “problematic” frames for any language pair.
Our study investigates to what extent multilingual vector representations of frames learned frommanually annotated corpora can address this need by serving as a wide coverage source for suchdivergences. We present a case study for the language pair English — German using the FrameNetand SALSA corpora and find that inferences can be made about cross-lingual frame applicabilityusing a vector space model.
1 Introduction
Frame Semantics (Fillmore et al., 2003) is a theory of predicate-argument structure that describes meaningnot at the level of individual words, but is instead based on the concept of a scenario or scene called aframe. Frames are defined by the group of words that evoke the scene (frame-evoking elements or FEEs),as well as their expected semantic arguments (frame elements). A JUDGMENT frame, for instance, wouldhave FEEs praise, criticize, and boo and frame elements such as COGNIZER, EVALUEE, EXPRESSOR andREASON. The Berkeley FrameNet project (Baker et al., 1998) is the most well-known lexical resourcebased on Frame Semantics, and provides definitions for over 1200 frames.
A very attractive feature of Frame Semantics is that it can abstract away from individual words, whichmakes it a promising descriptive framework for events that generalize across languages (Boas, 2005;Lönneker-Rodman, 2007). However, there is a fundamental tension within Frame Semantics between thecharacterization of the scene, which is arguably relatively language-independent, and the characterizationof the FEEs and frame elements which can vary among even closely related languages. For example,FrameNet distinguishes between the frames OPERATE_VEHICLE (drive) and USE_VEHICLE (ride) whichare often indistinguishable in German, at least for the very frequent verb fahren (drive/ride) (Burchardt etal., 2009). Another example is the well-known difference between Germanic and Romance languagesregarding the conceptualization of motion events, where manner can be incorporated into the semanticsof the FEE - run into a room, or externalized - enter a room running (Subirats, 2009). In general, someframes are evidently more parallel than others cross-lingually (Tonelli and Pianta, 2008; Torrent et al.,2018; Vossen et al., 2018).
The question of whether a particular frame transfers across languages is often determined by linguistson the basis of large corpora, comparing either frames assigned to direct translations (Padó, 2007) orextracting “typical” uses of frames in the languages under consideration. Identifying the degree ofapplicability can, however, be a long process that must be repeated for each new frame and language.
In this paper, we investigate to what extent this problem can be alleviated by the use of computationalmethods. More concretely, we build distributional representations, also known as embeddings, of frames
This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/
91
(Turney and Pantel, 2010; Baroni and Lenci, 2010; Hermann et al., 2014) which build a compactrepresentation of a frame’s meaning from the entire corpus. These embeddings are constructed foreach language separately on the basis of expert-labeled corpora but are represented in a joint space,which enables us to automatically measure the cross-lingual similarity of language-specific frames(e.g., AWARENESS in English and AWARENESS in German). Our hypothesis is that frames whichgeneralize well across languages will show a high similarity, while frames which do not generalize wellwill yield a low similarity. In this sense, we can see the set of frames that show the lowest self-similarityacross languages as a repository of cases of potential generalization failure. Concretely, we carry outa case study for the language pair English–German, which is relatively well-studied in terms of FrameSemantics.
This paper is organized as follows. First, we describe distributional semantic models (DSMs) andtheir utility in capturing cross-lingual lexical semantics. We give an overview of the annotated corporathat we use for our experiments for English – German frames, followed by an overview of the modelstructure and input for training our DSMs. Finally, we present results of our frame embeddings in bothEnglish and German data and interpret output of our computational model through findings from thetheoretical literature. To our knowledge, our work is the first to explore interoperability of frames acrosslanguages with embeddings. We present results that demonstrate how DSMs can capture latent semanticrelationships between frames and speculate about the conclusions that can be drawn when using DSMsfor frame semantic analysis.
2 Background
2.1 Word Embeddings
Vector representations of word meanings, generally known as distributional semantic models (DSMs) orembeddings, have become a popular tool for research in lexical semantics within the Natural LanguageProcessing community. The distributional hypothesis (Harris, 1954) forms the basis for DSM models,which claims that similar words should be expected to appear in similar contexts. Building on this assump-tion, current DSMs represent words as vectors in a low-dimensional space in which similarity is supposedto indicate semantic relatedness. DSMs can either be computed using count-based methods, which directlycount word context co-occurrences and optionally apply dimensionality reduction, or prediction-basedmethods, where the representations are the parameter vectors of an underlying optimization task (Baroniet al., 2014).
A widely-used prediction-based model is the word2vec model (Mikolov et al., 2013c). Word2vecproduces word vectors by predicting the target for a window of context, a setup known as the continuousbag-of-words (CBOW) model. In the CBOW architecture, an input for the target verb “cook" with acontext window of 2 might appear as [‘the’,‘chef’,‘a’,‘meal’], and embeddings would accordingly begenerated for each token in the vocabulary. The model is trained over a neural network with negativesampling, where noisy words are selected from a noise distribution and the model learns to approximateword meaning by discriminating the difference between the true and noisy word.
While the use of embeddings to model meaning at the word level is the most widespread application,they can in fact be applied to linguistic units of almost any granularity, from morphemes to word senses tosentences and even paragraphs and documents. In this study, we apply them to modeling the meaning offrames, i.e., semantically motivated predicate classes, following Hermann et al. (2014).
2.2 Cross-lingual Embeddings
The goal of a cross-lingual DSM is to make meaning representations from different languages comparable,either by learning mutilingual embeddings in the same space or by mapping monolingual embeddings intoa joint space (Ruder et al., 2017; Upadhyay et al., 2016). Mikolov et al. (2013b) introduced a particularlysimple version that takes advantage of a vocabulary of shared bilingual seed words to map embeddingsfrom a source language onto the vector space of a target language. They learn a transformation matrix Wby minimizing the mean squared error (MSE) between the source seed words and their translations fromthe bilingual dictionary:
92
MSE =n∑
i=1
‖Wxsi − xti ‖2
where xs represents the seed words from the source language, and xt are seed words from the target. Thetransformation matrix can then be applied to an embedding from the source space to project it onto thetarget language space. This can be achieved by solving z = Wx, where x is the source embedding andz is the same vector transformed into the target language space. This results in a shared space wheresemantic units that evoke similar concepts in one language are close to their translations in another.
Conceptually, the bilingual seed words is a set of words that are assumed, a priori, to have the samemeaning in both languages. A good choice for this set appear to be named entities, which are less proneto semantic problems like polysemy and vagueness, and which are furthermore easy to collect for mostlanguage pairs. We will adopt this choice in the present study.
3 Annotated Corpora
The popularity of the Berkeley FrameNet project has inspired the Multilingual FrameNet project, whereframe-semantic resources have emerged in multiple languages across the world (Torrent et al., 2018). Inour work, we focus on two languages that are relatively well-represented in the Frame Semantics literature,English and German, using the FrameNet and SALSA corpora, respectively.
3.1 FrameNet Corpus
The FrameNet corpus (release 1.5) provides over 173k annotated datasets with manually-curated sentences,drawn from the 100 million word British National Corpus (BNC)1, which is balanced across multiplegenres, including novels and news reports. FrameNet’s FEEs can range from verbal, nominal, adjectival,and even adverbial and prepositional in some cases. In FrameNet’s v1.5 complete annotation set, there areover 11k FEEs, where each FEE attestation is labeled. This produced sentences in which multiple FEEsare present, and lexical units are annotated on average 20.7 times in the current annotated corpus.
3.2 SALSA Corpus
SALSA (Erk et al., 2003; Burchardt et al., 2009) is a German corpus with frame semantic annotations.It was constructed over the syntactic annotations in the German TIGER (version 2.1), and is composedof 1.5 million words from German newswire text. SALSA (release 2.0) provides 24,184 sentences withframe annotations, and the corpus has labels for 998 unique FEEs. Each unique FEE is annotated onaverage 36.9 times.
3.3 Parallelism
For the purposes of our study, it is important that the corpora are as comparable as possible. Comparabilitymust be considered on two levels: corpus composition and annotation. With regard to corpus composition,we note that SALSA is based on a corpus consisting exclusively of newswire, while FrameNet is basedon the much broader BNC. Nevertheless, we maintain that the match is good, albeit not perfect. Asfor the semantic annotation schema, SALSA adopted the annotation scheme of the Berkeley FrameNetproject fairly directly (Ellsworth et al., 2004). The main difference is that SALSA proceeded predicate-by-predicate instead of frame-by-frame, which required the introduction of “pseudo-frames” for senses thatwere not covered by FrameNet at the time of annotation (Burchardt et al., 2009). We exclude these fromour analysis. Finally, another small difference regards the annotation of multi-word expressions, whichwe address further in Section 4.
4 Methods
To build frame embeddings for English and German, we pre-process the two corpora introduced in theprevious section. Below, we describe the steps we took to format the input for the model, and then weelaborate on how specifically we build the frame embeddings.
1http://www.natcorp.ox.ac.uk/
93
4.1 Preprocessing the FrameNet corpusWe use the FrameNet 1.5 corpus, which provides tokenized and lemmatized sentences (Bauer et al., 2012).We perform three pre-processing steps. First, multi-word frame-evoking elements (FEEs) are concatenatedinto a single token. This is motivated by the presence of multi-word FEEs that evoke different framesfrom the single-word FEEs that they contain. One such case is the lexical unit ride, which evokes theRIDE_VEHICLE frame, and the multi-word FEE ride out, which evokes SURVIVING. We keep instancesof MWE FEEs such as ride out as single units in the vector space.
The second step is a detection of named entities using the pretrained English model from spaCy’snamed entity recognition (NER) package2. We later use these entities as seed words for the cross-lingualmapping of embeddings (cf. Section 2). Again, for simplicity, we concatenate entities that span multipletokens into a single span, e.g., “San Francisco" would be converted to “San_Francisco".
The third step optionally replaces each occurrence of a FEE by the name of the frame it evokes toproduce the “Frame Corpus”. By applying the word2vec embedding algorithm to the Frame Corpus, wecan generate a monolingual embedding for each frame. Without this replacement, we retain the so-called“FEE Corpus”, from which we can generate standard word-level representations for the FEEs. An exampleof the FEE Corpus and Frame Corpus model inputs are shown below in Table 1.
Original “The Washington Post reported on the country’s biological weapons labs"FEE Corpus [The_Washington_Post, report, on, the, country, ’s, biological_weapon, lab]
Frame Corpus [The_Washington_Post, STATEMENT, on, the, country, ’s, WEAPON, lab]
Table 1: Example of original FrameNet sentence and the FEE Corpus and Frame Corpus versions
4.2 Preprocessing the SALSA corpusSimilar to FrameNet, the SALSA corpus provides pre-tokenized and lemmatized sentences (Erk et al.,2003). Multi-word predicates are prevalent in the SALSA data but cannot be concatenated as easily asthey are in English; SALSA decided to annotate some constructions as FEEs that are arguably moresyntactically than semantically motivated. Table 3 shows the three most frequent MWE patterns. The firstclass, separated prefix verbs, are merged into a single word. The second class, combinations of verbalFEEs with modals, and the third class, combinations of verbal FEEs with the particle zu (to), we decidednot to concatenate. Like English, we recognize named entities using spaCy (using the pretrained Germanmodel in this case), and convert any named entities that span over multiple words into a single token(“Volksrepublik China" becomes “Volksrepublik_China"). Also identical to the English FrameNet input,we build two corpora using the German data, a FEE Corpus and a Frame Corpus, to generate embeddingsfor frames and FEEs, respectively. An example in shown in Table 2.
Original “Konzernchefs lehnen den Milliardär als US-Präsident ab"(CEOs reject the billionaire as US President)
FEE Corpus [Konzernchef, ablehnen, der, Milliardär, als, US-Präsident]Frame Corpus [Konzernchef, JUDGMENT_COMMUNICATION, der, Milliardär, als, US-Präsident]
Table 2: Example of original SALSA sentence and the FEE Corpus and Frame Corpus versions
4.3 Frame EmbeddingsFor both English and German, we first learn 300-dimensional monolingual FEE-level and frame-levelembeddings from the respective corpora, using the word2vec CBOW method (see (Mikolov et al., 2013a)for more in-depth details). Then, we employ Mikolov (2013a)’s method based on a shared bilingualvocabulary to learn a linear projection from the source to the target vector space. As a bilingual vocabulary,we use the intersection of the recognized named entities of the standard classes (organizations; persons
2https://spacy.io/
94
VVFIN+PTKVZ → PTKVZ_VVExample: gehört an → angehören
VM* VV* → no changeExample: müssen rechnen
PTKVZU → no changeExample: zu sagen
Table 3: Rules for combining multi-word predi-cates in German SALSA corpus
Bill Clinton MussoliniNetanyahu VegasPentagon CIAIntel IBMSan Francisco McDonald’sApple Bill GatesMao Zedong ReutersWashington Times New York Times
Table 4: Examples of named entities occurringin both FrameNet and SALSA
and person groups; locations and geopolitical entities) from FrameNet and SALSA. In other words, wefirst match on named entities whose surface form is identical between English and German – so New York/ New York. To increase the size of the seed entities, we use a bilingual dictionary for English–German(Conneau et al., 2017) to further match any named entities where the surface form varies but reflects thesame individual or location, for example President Clinton / Präsident Clinton. Overall, 40,262 entitieswere detected in the English FrameNet sentences, while 15,398 entities were detected in the GermanSALSA. The intersection of entities that appeared in both corpora was 2,899. The list in Table 4 gives asample of the entity names that appear in both FrameNet and SALSA.
The models are trained on the pre-processed corpora. This results in a total of 230 frame embeddingsfor both EN and DE, 11,830 English FEE embeddings, and 998 German FEE embeddings.3
5 Experiment 1: Hyperparameter optimization and sanity check
Producing embeddings with word2vec’s CBOW algorithm involves choices regarding several hyperpa-rameters, notably the number of negative samples (neg), the learning rate (alpha), the size of the contextwindow (win), and the number of iterations (iter) – cf. Section 2. We use the same hyperparameter valuesfor the Frame Corpus-based and FEE Corpus-based embeddings.
It is common practice to choose the values for these hyperparameters based either on performance on adevelopment set or on some auxiliary task, such as word similarity prediction. In order to validate theFrameNet/SALSA frames that we use in our subsequent analysis, we decided to define an auxiliary task;namely, a check for the quality of the monolingual frame embeddings. An added benefit of this auxiliarytask is that it can also be understood as a form of sanity checking for the monolingual embeddings beforeusing them for cross-lingual comparison.
More concretely, we ask how well the embedding we obtain for each frame corresponds to the centroidof all FEE embeddings of that frame. For example, we would expect that the COMMERCE_BUY frameembedding should be highly similar to the centroid of the embeddings of the FEEs buy, purchase, butdissimilar to centroid of embeddings for predicates such as tell or beat. Formally, let cFEE(φ) denote theFEE centroid for a frame φ, defined as
cFEE(φ) =1
||{fee | fee∈φ}||∑
fee∈φ−→fee.
We can now assemble, for each frame f , the list of FEE centroids cFEE of all frames ranked by theircosine similarity to the frame embedding ~f . This list enables us to define the accuracy of the model usingstandard evaluation measures for ranked retrieval. We report the percentage of frames whose most similarFEE centroid is their own (R@1), the percentage whose FEE centroid is among the five most similar ones(R@5), and the percentage whose FEE centroid is among the ten most similar ones (R@10).
5.1 Results
Tables 5 and 6 give a sampling of the some of the hyperparameters that were used to evaluate the frameembeddings. The numbers indicate that the choice of hyperparameters is in fact important. Furthermore,the behavior of the embeddings with respect to the hyperparameters is relatively parallel between Englishand German: too few iterations and too few negative samples result in noisy embeddings. Setting neg to
3The embeddings are available at http://www.ims.uni-stuttgart.de/data/XLFrameEmbed.html.
95
neg alpha win iter R@1 R@5 R@105 .025 2 10 0.481 0.701 0.76610 .025 5 20 0.832 0.951 0.97010 .05 2 20 0.733 0.923 0.95720 .025 2 30 0.931 0.987 0.99320 .025 2 35 0.929 0.990 0.99430 .025 2 30 0.912 0.988 0.993
Table 5: FrameNet (EN): Evaluation ofFrame Embedding Quality
neg alpha win iter R@1 R@5 R@105 .025 2 10 0.408 0.591 0.65310 .025 2 20 0.843 0.925 0.93810 .05 2 20 0.938 0.986 0.99320 .025 5 20 0.904 0.972 0.97920 .025 2 35 0.965 0.979 0.98630 .025 2 30 0.931 0.993 0.993
Table 6: SALSA (DE): Evaluation ofFrame Embedding Quality
Top 10 Most Similar Frame EmbeddingsFrame FrameNet (EN) SALSA (DE)COMMERCE_BUY COMMERCE_SELL, GETTING, SUPPLY, IM-
PORTING, DEGREE_OF_PROCESSING, TRANS-FER, IMPORT_EXPORT, RECEIVING, AMASS-ING, CAUSE_MOTION
GETTING, IMPORT_EXPORT, RECEIVING,CAUSE_MOTION, REMOVING, MANUFACTUR-ING, ACTIVITY_START, USING, COMMIT-MENT, BRINGING
DECIDING COMMUNICATION_RESPONSE, ASSESSING,SOLE_INSTANCE, ACTIVITY_ONGOING, AT-TEMPT, TELLING, GIVING, ARRIVING, DESIR-ING
COMMUNICATION_RESPONSE, ASSESSING,ACTIVITY_ONGOING, ATTEMPT, TELLING,GIVING, ARRIVING, DESIRING, RESPONSE,PERCEPTION_ACTIVE
Table 7: Top 10 nearest neighbors for FrameNet/SALSA frame embeddings
20 and running for 30 to 35 iterations, however, yields arguably good embeddings with R@1 values ofover 90%, that is, the vast majority of frame embeddings are located closest to their own FEE centroid.For R@10, the numbers approach (for German) or even exceed (for English) 99%. Note that these tablesinclude all frames that appear in the corpora. When we prune our scoring to only frames that occur >5times in the corpus, we obtain R@1 scores of over 97% for both languages. For the remainder of thepaper, we adopt the hyperparameters that yield the highest accuracy.
We complement these results with a more qualitative analysis, were we find the top ten nearestneighbor frame embeddings for two frames in the same, monolingual space. This determines to whatextent the frame embeddings form sensible semantic neighborhoods: We would expect a frame likeCOMMERCE_BUY to have COMMERCE_SELL as one of its nearest neighboring frames, while a framelike GIVING_BIRTH would not. The results are shown in Table 7 and demonstrate that essentially allnearest neighbors are semantically related to the target frame, expressing a wide range of concepts. Forexample, for COMMERCE_BUY, we find the inverse perspectivization COMMERCE_SELL, entailments(GETTING, RECEIVING, CAUSE_MOTION), preconditions (COMMITMENT), scenarios and narrativechains (IMPORTING, IMPORT_EXPORT, MANUFACTURING), and aspectual class (ACTIVITY_START).At the same time, the lists are sufficiently dissimilar to motivate our second experiment which assesses towhich extent these differences reflect cross-lingual differences in frame usage and applicability.
6 Experiment 2: Analysis of Cross-Lingual Frame Correspondences
As described in Section 4, we project the FrameNet and SALSA frame embeddings into the same space.This means we can compare them via a simple distance metric such as cosine similarity, and we canvisualize them via dimensionality reduction to two dimensions. We start our analysis by confirming thatthe joint space respects the monolingual similarities and introduces reasonable cross-lingual ones. Forclarity, we will affix -EN to English (FrameNet) frames and -DE to German (SALSA) frames.
Figure 1 shows the nearest frame neighbors for VERDICT in English and a joint FrameNet-SALSA space.The joint space, while introducing related frames in the German space, still preserves the relationships onthe source side; namely, ARREST, NOTIFICATION_OF_CHARGES, and TRIAL. Additional related framesemerge from the German data, including CRIMINAL_INVESTIGATION and PROCESS_START.
We now proceed to analyze our embeddings through the perspective of prior cross-lingual studies
96
Figure 1: English space around VERDICT-EN (left), and joint space (right). Many semantic relationshipsare preserved in the cross-lingual mapping.
from the linguistics literature. We focus in particular on the frames and frame properties that diverge inEnglish and German, and compare these findings with results from our corpus-based frame embeddings.Conversely, we explore frames that have a high potential for universality and compare their similarities inthe cross-lingual vector space.
6.1 Specificity vs. Cross-lingual Frame Applicability
FrameNet frames form a hierarchy with more abstract, schematic, often scenario-like frames at the topand more specific and more constrained frames at the bottom (Fillmore et al., 2003). Arguably, the morespecific a frame is, the higher is the risk that its constraints do not carry over well to other languages, whileframes that are higher on the FrameNet hierarchy should be more universally applicable (Padó, 2007).
As two examples for such general frames, consider COMMUNICATION and MOTION, which applyto general speaking and moving events (with FEEs such as speak, communicate, say and move, come,go, respectively). Both frames are extended by more specific child frames, such as COMMUNICA-TION_RESPONSE, COMMUNICATION_MANNER, MOTION_DIRECTIONAL and SELF_MOTION. We nowtest whether the specificity of the frames correlates with the cross-lingual similarities of the frame embed-dings. We expect MOTION-DE and MOTION-EN to be highly similar while MOTION_DIRECTIONAL-DEand -EN are less similar when mapped into a joint, multilingual vector space.
The results are visualized in Figure 2: in general, both coarsely defined frames have a higher similaritythan their respective child frames, although this tendency is much more pronounced for COMMUNICATION
than MOTION. The one exception to this trend is the SELF_MOTION frame, which can also be argued is ageneral frame for self-propelled movement; in fact, the SELF_MOTION is annotated more in the SALSAcorpus (count=76) with a greater variety of FEEs (31 unique FEEs for SELF_MOTION in SALSA) thanMOTION (count=32), which has significantly fewer unique FEE annotations (10 unique FEEs).
6.2 Most and Least Similar Frames
Next, we computed the frames that were most and least similar across FrameNet and SALSA. Ourexpectation was that the list of most similar frames would be dominated by very general frames (asper the previous analysis) and that the list of least similar frames would contain mostly cases known asproblematic from the lexicographic literature.
These expectations were only confirmed to an limited extent. Amongst the very similar frames,frames tend to be belong to abstract domains such as Cognition (GRASP, JUDGMENT, MEMORY) ordescribing abstract properties of events (ACTIVITY_RESUME, LIKELIHOOD). Two concrete frames arealso prominent: VERDICT and ARREST. This was a surprising result, given the differences inthe judicialsystems of the USA and Germany.
Even for very similar frames, it is striking now little correlation there is between the FrameNet andSALSA frequencies for certain frames, and the imbalance is more pronounced for the least similar frames.
97
Figure 2: Cross-lingual similarity scores forCOMMUNICATION and MOTION.
Frame SimilarityDE-EN
COMMUNICATION 0.54COMMUNICATION_RESPONSE 0.17COMMUNICATION_MANNER 0.14MOTION 0.39CAUSE_MOTION 0.15MOTION_DIRECTIONAL 0.27SELF_MOTION 0.46
Figure 3: Cross-lingual similarity between the less-specific COMMUNICATION and MOTION framesand more specific child frames.
Most Similar Frames Similarity FreqEN/DE
Least Similar Frames Similarity FreqEN/DE
GRASP 0.66 287/21 ORIGIN -0.14 194/8COMMUNICATION 0.54 84/25 PEOPLE_BY_VOCATION -0.11 586/12VERDICT 0.51 215/77 UNDERGOING -0.07 38/136ARREST 0.51 189/103 INGEST_SUBSTANCE -0.02 187/7BUILDING 0.49 393/52 EMPLOYING 0.00 151/428ACTIVITY_RESUME 0.49 37/8 TEXT 0.03 1080/4JUDGMENT 0.48 1212/33 SENSATION 0.03 471/1MEMORY 0.48 209/41 TAKING_SIDES 0.04 189/18COTHEME 0.48 665/7 COMMITTING_CRIME 0.07 48/55LIKELIHOOD 0.48 577/6 SENTENCING 0.07 77/39
Table 8: Most and least similar English–German frame pairs in joint space.
In fact, we believe that for a number of frames in this list (e.g., SENSATION), the reason for low similarityis that SALSA provides so few annotations that no reliable embeddings can be learned, irrespective of anylinguistic or lexicographic considerations. For the remaining frames, the low similarity appears to resultfrom large differences in the FEEs that were chosen for annotation. For example, FrameNet’s ORIGIN
contains many nationalities and ethnicities (American, Assyrian, Byzantine) while SALSA annotatedfamily terms like Kind, Sohn, Enkel (child, son, grandson). Similarly, PEOPLE_BY_VOCATION, which inFrameNet covers a wide range of professions, only contains one FEE in German, Kumpel (miner). Theseappear to be artifacts of SALSA’s strategy of annotating lemma by lemma, as a result of which someframes end up being annotated only for a single, or very few, FEEs (Ellsworth et al., 2004). So, the leastsimilar frames provide information about annotation choices rather than conceptual applicability.
Note that the definition of abstract/general frames can be made more concrete by considering FrameNet’sframe-to-frame relations “X is Inherited by/is Used by/has Subframe Y". All relations share the commonproperty that X is a more abstract frame than Y. Thus, if there is a high number of relations in whicha frame occurs in position X, then it should be a more abstract frame, and behave more similar acrosslanguages. This turns out to be true in our sample: the ten most similar frames listed in Table 8 filled ahigh average number of relations in position X (4.89) while the ten least similar frames had a comparablylow average (1.7).
6.3 Monolingual Annotation Choices
Nevertheless, the previous analysis raises the question of how monolingual annotation choices influencecross-lingual comparability. As an example, let us consider the English verb announce, which is relativelyflexible regarding its subcategorization behavior; Boas (2009) analyzed instances of the verb and found
98
FEE Annotated German frames Count Most similar English framesankündigen HERALDING-DE 85 STATEMENT-EN, REPARATION-EN,
OMEN-DE 1 OMEN-EN, REST-ENEVIDENCE-DE 1
Table 9: SALSA FEEs and frames for translations of the English STATEMENT FEE announce (left), andmost similar English frame embeddings (right)
that the STATEMENT-EN frame, which can express the SPEAKER, MEDIUM, and MESSAGE roles, is ableto cover all of the following uses (Boas, 2009):
(1) a. [The CEO SPEAKER] announced [that the company would be acquired MESSAGE].b. [The press report MEDIUM] announced [that the company would be acquired MESSAGE].c. [The CEO SPEAKER] announced [that the company would be acquired MESSAGE] [by email MEDIUM].
The closest German translation of announce is ankündigen, which can likewise express each of thealternations above. Consequently, we would expect the instances of ankündigen in SALSA to be annotatedwith STATEMENT-DE.
However, as the left-hand cell of Table 9 shows, SALSA decided to annotate the overwhelming majorityof the ankündigen instances with the frame HERALDING. HERALDING is a specific frame involvinga COMMUNICATOR informing the public about a future course of action (EVENT), and in EnglishHERALDING can only be evoked by the verb herald. Presumably, the reason for the German annotation isthat the very typical use of ankündigen with a direct object NP expresses not just a statement event but acommitment by the speaker to carry out the action described by the NP:
(2) [Regierung in Rom COMMUNICATOR] kündigt [Preisstopp und Sparprogramm EVENT] an. (s1494)(Government in Rome announces price stop and austerity program)
Unfortunately, in FrameNet, there is no direct connection of any type between STATEMENT and HERALD-ING. What our cross-lingual embeddings can provide in this case in an insight into the relationshipbetween these two frames. We do this by retrieving the most similar English frame embeddings forthe SALSA ankündigen embedding. The result that we find is the most similar frame to ankündigen isSTATEMENT-EN. This indicates that the usage of ankündigen in the SALSA corpus is, even if not labeledas such, semantically still very similar to STATEMENT-EN. Arguably, this pattern - a predicate and itsdirect translations that belong to two different frames, provides evidence that the involved frames sharesome kind of conceptual relationship (Sikos and Padó, 2018). In other words, the cross-lingual model hasidentified a potential gap in the FrameNet frame relations.
7 Conclusion
In this paper, we have looked at the question of the cross-lingual applicability of FrameNet semantic frames.Our starting hypothesis was that multilingual vector spaces can serve as a rich source of information onthis topic. A frame vector space can represent frame-evoking elements and frames as vectors, which arereadily comparable by geometric distances, thus providing similarity judgments between them.
In our experiments, we have constructed a joint, bilingual space by applying word embedding methodsto the annotations of the English FrameNet corpus and the German SALSA corpus. The outcome ofthree specific analyses that we have carried out – specificity, frame similarity, and translational variation– provides overall positive evidence for this hypothesis: In many cases, our findings correspond wellto previous linguistic findings reported, for example, in Boas (2009). Based on our results, we believethat vector space models can be used to help the linguistic/lexicographic work on cross-lingual framesemantics by guiding lexicographers towards potential frame mismatches and cross-lingual applicability.
That said, the picture that emerged is somewhat more nuanced than we anticipated. As usual in corpus-based research, one must be aware that the similarity structure of the embeddings is not always directlyinterpretable in terms of frame generalizability. Among the factors that impact embeddings are “usual
99
suspects” like frequency (low-frequency embeddings are potentially noisy, cf. Table 9) and polysemy (thesense distribution of FEEs influences vectors), but also the annotation decisions of individual annotationprojects (cf. SALSA’s incomplete coverage of frames in terms of FEEs, and the decision to annotateankündigen with HERALDING-DE). A deeper investigation of these factors is a topic for future work.
Another clear avenue for future investigation is to apply our analysis to compare semantic annotationsin multiple languages, which is fast becoming a possibility as new, multilingual FrameNets emerge.
ReferencesCollin F Baker, Charles J Fillmore, and John B Lowe. 1998. The Berkeley FrameNet project. In Proceedings of
ACL/COLING, pages 86–90, Montreal, QC.
Marco Baroni and Alessandro Lenci. 2010. Distributional memory: A general framework for corpus-basedsemantics. Computational Linguistics, 36(4):673–721.
Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don’t count, predict! a systematic comparisonof context-counting vs. context-predicting semantic vectors. In Proceedings of ACL, pages 238–247, Baltimore,MD.
Daniel Bauer, Hagen Fürstenau, and Owen Rambow. 2012. The dependency-parsed FrameNet corpus. In Pro-ceedings of LREC, pages 3861–3867.
Hans C Boas. 2005. Semantic frames as interlingual representations for multilingual lexical databases. Interna-tional Journal of Lexicography, 18(4):445–478.
Hans C Boas. 2009. Multilingual FrameNets in computational lexicography – Methods and applications. DeGruyter.
Aljoscha Burchardt, Katrin Erk, Anette Frank, Andrea Kowalski, Sebastian Padó, and Manfred Pinkal. 2009.Using FrameNet for the semantic analysis of German: Annotation, representation, and automation. In Hans C.Boas, editor, Multilingual FrameNets in Computational Lexicography – Methods and Applications, pages 209–244. De Gruyter.
Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2017. Wordtranslation without parallel data. arXiv:1710.04087.
Michael Ellsworth, Katrin Erk, Paul Kingsbury, and Sebastian Padó. 2004. PropBank, SALSA and FrameNet:How design determines product. In Proceedings of the LREC workshop on Building Lexical Resources FromSemantically Annotated Corpora, pages 17–23, Lisbon, Portugal.
Katrin Erk, Andrea Kowalski, Sebastian Padó, and Manfred Pinkal. 2003. Towards a resource for lexical se-mantics: A large German corpus with extensive semantic annotation. In Proceedings of ACL, pages 537–544,Sapporo, Japan.
Charles J Fillmore, Christopher R Johnson, and Miriam R L Petruck. 2003. Background to FrameNet. Interna-tional Journal of Lexicography, 16(3):235–250.
Zellig S Harris. 1954. Distributional structure. Word, 10(2-3):146–162.
Karl Moritz Hermann, Dipanjan Das, Jason Weston, and Kuzman Ganchev. 2014. Semantic frame identificationwith distributed word representations. In Proceedings of ACL, pages 1448–1458, Baltimore, MD.
Birte Lönneker-Rodman. 2007. Multilinguality and FrameNet. Technical report, International Computer ScienceInstitute.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representationsin vector space. arXiv:1301.3781.
Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013b. Exploiting similarities among languages for machinetranslation. arXiv:1309.4168.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013c. Distributed representationsof words and phrases and their compositionality. In Advances in neural information processing systems, pages3111–3119.
100
Sebastian Padó. 2007. Translational equivalence and cross-lingual parallelism: The case of FrameNet frames. InProceedings of the NODALIDA workshop on building Frame semantics resources for Scandinavian and Balticlanguages, pages 39–46, Tartu, Estonia.
Sebastian Ruder, Ivan Vulic, and Anders Søgaard. 2017. A survey of cross-lingual word embedding models.arXiv:1706.04902.
Jennifer Sikos and Sebastian Padó. 2018. Framenet’s ’using’ relation as source of concept-driven paraphrases.Constructions and Frames, 10(1). In press.
Carlos Subirats. 2009. Spanish FrameNet: A frame semantic analysis of the Spanish lexicon. In Hans C. Boas,editor, Multilingual FrameNets in Computational Lexicography – Methods and Applications, pages 135–162.De Gruyter.
Sara Tonelli and Emanuele Pianta. 2008. Frame information transfer from English to Italian. In Proceedings ofLREC, pages 2252–2256, Marrakech, Morocco.
Tiago Timponi Torrent, Lars Borin, and Collin Baker, editors. 2018. Proceedings of the LREC 2018 InternationalFrameNet Workshop on Multilingual Framenets and Constructicons, Miyazaki, Japan.
Peter D Turney and Patrick Pantel. 2010. From Frequency to Meaning: Vector Space Models of Semantics.Journal of Artificial Intelligence Research, 37(1):141–188.
Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. 2016. Cross-lingual models of word embeddings:An empirical comparison. In Proceedings of ACL, pages 1661–1670, Berlin, Germany.
Piek Vossen, Antske Fokkens, Isa Maks, and Chantal van Son. 2018. Towards an open Dutch FrameNet lexiconand corpus. In Proceedings of the LREC 2018 Workshop International FrameNet Workshop 2018 : MultilingualFramenets and Constructicons, pages 75–80, Miyazaki, Japan.
101
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 102–111Santa Fe, New Mexico, USA, August 20, 2018.
Construction of a Multilingual Corpus Annotated with TranslationRelations
Yuming ZhaiLIMSI-CNRS
Univ. Paris-SudUniv. Paris-Saclay, France
Aurélien MaxLIMSI-CNRS
Univ. Paris-SudUniv. Paris-Saclay, France
Anne VilnatLIMSI-CNRS
Univ. Paris-SudUniv. Paris-Saclay, France
Abstract
Translation relations, which distinguish literal translation from other translation techniques, con-stitute an important subject of study for human translators (Chuquet and Paillard, 1989). How-ever, automatic processing techniques based on interlingual relations, such as machine translationor paraphrase generation exploiting translational equivalence, have not made use of these rela-tions explicitly until now. In this work, we present a categorization of translation relations andthen we annotate a parallel multilingual (English, French, Chinese) corpus of oral presentations,the TED Talks, with these relations. Our long-term objective will be to automatically detect theserelations in order to integrate them as important characteristics for the search of monolingual seg-ments in relation of equivalence (paraphrases) or of entailment. The annotated corpus resultingfrom our work will be made available to the community.
1 Introduction
Human translators have studied translation relations since a long time (Vinay and Darbelnet, 1958; Chu-quet and Paillard, 1989), which categorize different translation techniques apart from literal translations.But to the best of our knowledge, no automatic processing techniques explicitly implement these inter-lingual relations.
As an important natural language understanding and generation task, machine translation (MT) hasbeen seriously improved with first phrase-based statistical machine translation (PBMT) then recentlywith neural machine translation (NMT). MT has also been exploited to generate paraphrases from bilin-gual parallel corpus, which was originally proposed by Bannard and Callison-Burch (2005). The as-sumption is that two segments in the same language are potential paraphrases if they share commontranslations in a foreign language. Currently the largest resource of paraphrases, PPDB (ParaphraseDatabase) (Ganitkevitch et al., 2013), has been built following this method exploiting translational equiv-alence. Nonetheless, the work of Pavlick et al. (2015) revealed that there exist other relations than strictequivalence (paraphrase) in PPDB (i.e. Entailment (in two directions), Exclusion, Other related and In-dependent). The existence of these other relations in PPDB reflects the lack of semantic control duringthe paraphrasing process.
We propose a categorization of translation relations which model human translators’ choices, and weannotate a multilingual (English, French, Chinese) parallel corpus of oral presentations, the TED Talks1,with these relations. The annotation is still ongoing and we are developing a classifier based on theseannotations to conduct automatic detection. Our further goal is to integrate this information as importantcharacteristics for the search of monolingual segments in relation of equivalence (paraphrases) or ofentailment.
After presenting the related work in section 2, we describe our parallel corpus of TED Talks (section 3)and the translation relations (section 4). The annotation process and the statistics follow in section 5.Contrastive study between target languages is presented in section 6. Finally, we conclude in section 7.
This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/
1https://www.ted.com/
102
2 Related work
Deng and Xue (2017) have studied the divergences present in English-Chinese machine translation, byusing a hierarchical alignment scheme between the parse trees of these two languages. Seven typesof divergences have been identified, and some of them cause important difficulties for automatic wordalignment. In particular, we can underline lexical differences resulting from non-literal translations, andstructural differences between languages (with or without change of types of syntagms). In order to sup-ply a particular dataset of multiword expressions (MWE) for evaluating machine translation, Monti et al.(2015) have annotated specifically these expressions in the English-Italian TED Talks corpus, associatedwith their translation generated by an automatic system. The phenomena discussed in these two studiesare included in the translation relations that we present in this article.
PARSEME (PARSing and Multiword Expressions) 2 is a European scientific network built up to elab-orate universal terminologies and annotation guidelines for MWEs in 18 languages (Savary et al., 2015).Its main outcome is a multilingual 5-million word annotated corpus, which underlies a shared task onautomatic identification of verbal MWEs (Savary et al., 2017). Our trilingual annotated corpus focuseson bilingual relation between translations, and we annotate all words in the corpus, including continuousand discontinuous MWEs.
As a complement of MT evaluation metrics, which reflect imperfectly systems’ performance, Isabelleet al. (2017) have introduced a challenge set based on difficult linguistic materials. The authors couldhence determine some remaining difficulties for recent NMT systems. The problems include, in par-ticular, incomplete generalizations; translating common and syntactically flexible idioms, or crossingmovement verbs e.g. swim across X → traverser X à la nage. The annotated corpus that we present herecould also constitute a challenge set, for the purpose of evaluating MT systems when human translatorsresort to different translation relations.
3 Corpus
In order to study translation relations for several pairs of languages, we have worked on a multilingualparallel corpus. This corpus is available from the Web inventory WIT3 (Cettolo et al., 2012), which givesaccess to a collection of transcribed and translated talks, including the corpus of TED Talks3. This corpuswas released for the evaluation campaign IWSLT 2013 and 2014.4 The source language, i.e. the originallanguage in which the speakers expressed themselves, is English. We have calculated the intersectionof a parallel corpus with translations in French5, Chinese, Arabic, Spanish and Russian. The translationof subtitles for TED Talks is controlled by volunteers and language coordinators per language6, whichgenerally ensures good quality translations. The corpus to be annotated contains 2 436 lines of parallelsentences for each pair of languages (the English corpus contains 51 926 tokens). For the moment, weannotate English-French and English-Chinese corpora to validate our hierarchy of translation relations,since these two target languages are very dissimilar in several linguistic aspects: morphology, grammar,expression, etc.
For English and French languages, the corpus has been tokenized by Stanford Tokenizer7, and lemma-tized by TreeTagger (Schmid, 1995) while keeping the tokenization of Stanford Tokenizer. The capitalletters at the beginning of each sentence are kept only if these words always appear with capital initialselsewhere in the corpus, otherwise they are lowercased. We have used the tool THULAC (Li and Sun,2009) for the segmentation of the Chinese corpus, and proceeded to several corrections before anno-tation. The words are automatically aligned by training FastAlign (Dyer et al., 2013), with its defaultparameters on each entire parallel corpus (i.e. 163 092 lines and 3 303 660 English tokens). We import
2http://www.parseme.eu/3https://wit3.fbk.eu/4We have used training corpus of 2014 (160 656 lines), development corpus (880 lines) and test corpus (1 556 lines) of
2010.5The sentence boundaries have been corrected in French test corpus to calculate the intersection.6https://www.ted.com/participate/translate/get-started7http://nlp.stanford.edu/software/tokenizer.shtml
103
these automatic alignments before annotation to accelerate the process, in particular for the words whichare literally translated. The annotators should correct these alignments if necessary.
4 Translation relations
Figure 1: Hierarchy of translation relations.
The first attempt to establish a taxonomy of translation procedures has been carried out by Vinayand Darbelnet (1958). The hierarchy of translation relations that we propose here is based on theoriesclarified in the work of Chuquet and Paillard (1989), and on phenomena found during our initial studyof the corpus (see figure 1).
The colored nodes represent our categories, the other nodes present the hierarchy (i.e. Non-Literal,Unaligned, Modulation, No Type) or the phenomena more specific but for which there is no dedicatedlabel (e.g. Remove Metaphor, Other Generalization). We present below their definition and typicalexamples:
1. Literal translation: word-for-word translation (including insertion or deletion of determiners,changes between singular and plural forms), or possible literal translation of some idioms:
What time is it? → Quelle heure est-il ?, facts are stubborn → les faits sont têtus
2. Equivalence:
i) non-literal translation of proverbs, idioms, or fixed expressions:
Birds of a feather flock together. → Qui se ressemble s’assemble.
ii) semantic equivalence in supra-lexical level, translation of terms:
magic trick → tour de magie, hatpin → épingle à chapeau
3. Modulation: consists in changing the point of view, either to circumvent a translation difficultyor to reveal a way of seeing things, specific to the speakers of the target language. This categorycould result in semantic shift between source text and target text. Apart from its two sub-typesParticularization and Generalization, all the other phenomena are represented by the sub-type OtherModulation:
this is a completely unsustainable pattern → il est absolument impossible de continuer sur cettetendance, I had an assignment → on m’avait confié une mission
4. Particularization: the translation is more precise or presents a more concrete sense:
the director said → le directeur déclara, language loss → l’extinction du langage
5. Generalization: this category contains three sub-types, but we annotate them with the same label:
i) the translation is more general or neutral; in other cases, this translation technique makes thesense more accessible in the target language:
104
look carefully at → regardez, as we sit here in ... → alors que nous sommes à ...
ii) translation of an idiom by a non-fixed expression:
trial and error → procéder par tâtonnements
iii) removal of a metaphorical image:
ancient Tairona civilization which once carpeted the Caribbean coastal plain → anciennes civil-isations tyranniques qui occupaient jadis la plaine côtière des Caraïbes
6. Transposition: translating words or expressions by using other grammatical categories than the onesused in the source language, without altering the meaning of the utterance:
astonishingly inquisitive → dotée d’une curiosité stupéfiante
patients over the age of 40 → les malades ayant dépassé l’âge de 40 ans
7. Modulation plus Transposition: this category can contain any sub-type of Modulation combinedwith Transposition:
this is a people who cognitively do not distinguish → c’est un peuple dont l’état des connais-sances ne permet pas de faire la distinction
8. Idiom: translate non-fixed expression by an idiom (frequently used when translating English toChinese):
at any given moment → à un instant "t"
died getting old →行将就木 "getting closer and closer to the coffin"
9. Metaphor: this category contains two sub-types reduced to only one label:
i) keep the same metaphorical image by using a non-literal translation:
the Sun begins to bathe the slopes of the landscape → le soleil qui inonde les flancs de ce paysage
ii) introduce metaphorical expression to translate non-metaphor:
if you faint easily → si vous tombez dans les pommes facilement
10. Unaligned - Explicitation:
i) resumptive anaphora (Charolles, 2002): add a phrase or sentence summarizing the preceding in-formation (which could be present in previous sentence), to help understanding the present sentence.
ii) introduce in the target language clarifications that remain implicit in the source language butemerge from the situation; add language-specific function words:
feel their past in the wind → ressentent leur passé souffler dans le vent
an entire book → 一本本本完整的书 (add the Chinese classifier本)
11. Unaligned - Simplification: remove deliberately certain content words in translation:
and you’ll suddenly discover what it would be like → et vous découvrirez ce que ce serait
12. Unaligned and no type attributed: function words necessary in one language but not in the other;segments not translated but which don’t impact the meaning; segments giving repeated informationin context; translated segments which don’t correspond to any source segment:
minus 271 degrees, colder than → moins 271 degrés, ce qui est plus froid
the last example I have time to → le dernier exemple que j’ai le temps de
105
5 Annotation
5.1 Annotation tool and configuration
We have used the Web application Yawat8 (Germann, 2008), which allows us to align words or segments(continuous or discontinuous), and then to assign labels adapted to our task on monolingual or bilingualunits (see figure 2).
Figure 2: Annotation interface of Yawat.
Here is a trilingual example from our corpus:
well, we use that great euphemism, "trial and error", which is exposed to be meaningless.
eh bien, nous employons cet euphémisme, procéder par tâtonnements, qui est dénué de sens.
我们 "we"普通人 "ordinary people"会 "particle for future tense"做 "do"各种各样 "diverse"的"particle for attribute" 实验 "experience" 不断 "continuously" 地 "particle for adverb" 犯 错 误"make a mistake"结果 "consequently"却 "however"一无所获 "have gained nothing"
The segments well and we use that great euphemism are translated literally in French (with "great"omitted), but they are omitted in Chinese. The idiom trial and error is translated by a generalization inthe two languages. The segment which is exposed to be is translated by a generalization in French (est"is") and by a modulation in Chinese (结果 "consequently"却 "however"). The adjective meaninglessis translated by a transposition in French (dénué de sens "lacking meaning") and by an idiom of fourcharacters in Chinese (一无所获 "have gained nothing").
The spelling errors in the original corpus have not been corrected, because on one hand, there are notmany of them, on the other hand, generally they don’t prevent us from assigning labels. Nonetheless, wehave included a category Uncertain for pairs for which the annotators don’t know which label to assign,or for those which contain obvious translation errors.
Three annotators 9 have participated in the annotation. The training of annotators relies on an anno-tation guide which defines all categories illustrated by giving characteristic examples. The hierarchy ofrelations provides a general view of relations between these categories. To better understand the context,the annotators could watch the corresponding video of talks10 before annotating.
5.2 Control study
We have evaluated in a conventional way the feasibility of our annotation task, by measuring the inter-annotator agreement on a control corpus, which contains 100 pairs of trilingual parallel sentences (3 055English tokens, 3 238 French tokens and 4 195 Chinese characters). For each pair of languages (EN-FRand EN-ZH), two annotators have independently annotated the corpus.
Since there exist disagreements on the boundary of certain segments, we have calculated the agreementof Cohen’s Kappa (Cohen, 1960) only for the segments annotated with exactly the same boundaries bytwo annotators. The value 0.672 signifies a substantial agreement for the pair EN-FR, while EN-ZH hasa lower agreement (0.61, the minimum value to be considered as substantial). The number of tokens
8Yet Another Word Alignment Tool, which is available for research under the license GNU Affero General Public Licensev3.0.
9One French annotator and one Chinese annotator for the whole corpus, and another Chinese annotator for only the English-Chinese control corpus.
10The corpora to be annotated are transcriptions of TED Talks and their translations.
106
annotated in segments with same boundaries represent 72.60% of English tokens for EN-FR, but only52.76% for EN-ZH.
We also calculate another inter-annotator agreement in a more flexible way, by including the pairs withdifferent but compatible boundaries (i.e. without overlap of boundaries), and with a common category.11
In this scenario, while the coverage of English tokens increase to 85.56% for EN-FR and 74.10% forEN-ZH, the Kappa decreases to 0.617 for EN-FR and to 0.60 (moderate) for EN-ZH. The remainingtokens belong to segments with incompatible boundaries (see table 1 and table 2).
κ %EN tokens
strict 0.672 72.60%flexible 0.617 85.56%
Table 1: EN-FR inter-annotator agreement.
κ %EN tokens
strict 0.61 52.76%flexible 0.60 74.10%
Table 2: EN-ZH inter-annotator agreement.
We have compared the annotations of two annotators in a confusion matrix. There exist certain dis-agreements between Literal and these categories: Equivalence (e.g. in this way → de cette façon), Mod-ulation (e.g. this entire time → tout ce temps), Particularization (e.g. snuff → tabac) and Transposition(e.g. their prayers alone → seulement leurs prières). However, our categories Equivalence and Literalare very close, and Particularization is a sub-type of Modulation, consequently we could consider theseconfusions as acceptable in a more flexible measure. Modulation presents the majority of confusionswith Literal and Transposition (e.g. from the forest floor → tombées par terre), which indicates thatit is necessary to better explain their differences when training annotators. Mod+Trans is a combinedtype for which certain annotators perceive sometimes only one of these two types (e.g. a great distance→ de loin). Metaphor is the origin of several disagreements (e.g. at the base of glaciers → aux piedsdes glaciers), which could mainly be explained by the difficulty of the annotation task for a non-nativeannotator of the target language.
5.3 Annotation scheme with several passes
The calculation of inter-annotator agreement enables certain standard interpretations on the control cor-pus, and the confusion matrix helps us to identify the difficulties of the task. For the purpose of converg-ing on boundaries of segments and on attributions of labels, we have adopted an annotation scheme withseveral passes to ensure a better annotation quality. For each subcorpus12, the first annotator conducts afirst annotation of all categories, then the second annotator takes over, which allows him/her to modifythe alignments and/or the categories in case of disagreement. Each annotation file is saved at the end ofeach pass to document the differences in annotation. This alternation can be repeated until the conver-gence of all annotations. In practice, we limit ourselves to 3 passes, the third one being accomplishedby the first annotator. We observe that the number of modifications in the third pass drop gradually withthe documents annotated, which reflects a progressive and fast adaptation of annotators to the task. Thisannotation scheme is cost-intensive, but has been made necessary by the quality intended, as well as bythe inherent difficulties of segmentation observed in the control corpus.
5.4 Statistics of annotated subcorpora
Apart from two control corpora, we have annotated six English-French subcorpora and three English-Chinese subcorpora. Their statistics are presented in table 3. Table 4 shows a distribution of number oftokens annotated in different translation relations for English-French corpus.
11For example, I was asked by and I was asked by my professor at Harvard have both been annotated by Modulation, andmy professor at Harvard has been annotated with Literal by the first annotator. Here we consider that the two boundaries arecompatible, and there is an agreement between the two annotators on one segment (the longest one), because of the commoncategory Modulation.
12Each subcorpus represents one or several complete speeches of TED Talks, in order to ensure better understanding of thecontext.
107
Nb lines Nb EN Tokens Nb FR Tokens Nb ZH Characters
control 100 3 055 3 238 4 195
1 95 1 792 1 774 2 3882 106 2 282 2 545 3 8513 101 2 189 2 357 3 3804 92 1 381 1 489 -5 133 2 566 2 766 -6 120 2 691 2 919 -
Total 747 15 956 17 088 13 814
Table 3: Statistics of annotated subcorpora.
English French % EN tokens
Literal 8 701 9 086 67.44%Equivalence 690 874 5.35%Modulation 1 671 1 734 12.95%Transposition 208 297 1.61%Mod+Trans 250 301 1.94%Generalization 198 159 1.53%Particularization 391 560 3.03%Idiom 4 6 0.03%Metaphor 16 19 0.12%Simplification 166 0 1.29%Explicitation 0 165 0.00%Uncertain 127 148 0.98%All types 12 422 13 349 96.29%No Type 479 501 3.71%Total nb tokens 12 901 13 850 -
Table 4: Statistics of English-French annotations(number of tokens).
6 Contrast between target languages
We present comparative statistics based on three annotated trilingual subcorpora (see table 5). Englishand French languages are very similar in grammar and syntax, but Chinese translators should oftenchange word or even phrase order to adapt the translation.
We can see that there are fewer English tokens translated into Chinese using literal translation. ForEquivalence, Modulation and Transposition, the proportions are not so different, but the boundaries ofsegments could be different. Chinese translations use less complicated Modulation+Transposition, butthey resort much often to Generalization and Particularization. In general, Idiom and Metaphor areunder-represented in both target languages, but translating by using a Chinese idiom in four charactersis considered as a good practice, which result in a concise language style and translations adapted toChinese culture. Simplification and Explicitation show clear differences with French translations. Chi-nese translations often render only the most important information and leave some other content phrasesnon-translated. Concerning Explicitation, this includes adding necessary Chinese classifiers; insertingcontent words (e.g. subject noun phrase) to keep the phrase grammatical instead of translating literallyword by word and some instances of resumptive anaphora (see section 4). There exist also more in-stances of Uncertain, together with Simplification, these phenomena reflect that the quality of Chinesetranslations are not as good as French translations. Many more English words do not have any typeattributed because of the big differences in grammar. At the same time, Chinese translations are moreconcise than the transcriptions of oral English, which results in the omission of many English transitionwords.
Here are some examples:1) EN: We had to worry about the lawyers and so on.FR: On a dû se préoccuper des avocats et des trucs dans le genre."We had to worry about the lawyers and things like that." (Literal, Modulation+Transposition)ZH: 要顾及到很多法律问题 "have to worry about many legal issues" (Generalization)2) EN: pressured the people a little bit about itFR: obliger le peuple à en parler "force the people to talk about it" (Modulation)ZH:刨根问底 "inquire into the root of the matter" (Idiom)3) EN: and it doesn’t matter how much information we’re looking at, how big these collections are or
how big the images are.FR: et ce quelle que soit la quantité d’informations que l’on visionne, la taille de ces collections ou la
taille des images."regardless of the amount of information viewed, the size of the collections or the size of the images."
(Modulation+Transposition)
108
ZH:不管所见到的数据有多少、图像集有多大以及图像本身有多大, Seadragon都拥有这样的处理能力。
"it doesn’t matter how much data we’re looking at, how big the collections of images are, and how bigthe images themselves are, Seadragon possesses this processing ability"
(Explicitation for the last phrase: resumptive anaphora which refers to information in the previoussentence)
English French %EN tokens English Chinese %EN tokens
Literal 4 267 4 423 68.13% 3 307 5 311 52.80%Equivalence 406 514 6.48% 426 629 6.80%Modulation 589 617 9.40% 615 863 9.82%Transposition 142 195 2.27% 166 258 2.65%Mod+Trans 188 225 3.00% 90 134 1.44%Generalization 90 65 1.44% 200 208 3.19%Particularization 197 256 3.15% 273 661 4.36%Idiom 4 6 0.06% 10 21 0.16%Metaphor 10 15 0.16% 6 10 0.10%Simplification 92 - 1.47% 314 - 5.01%Explicitation - 58 - - 929 -Uncertain 79 79 1.26% 157 277 2.51%
All types 6 064 6 453 96.82% 5 564 9 301 88.84%No type 199 223 3.18% 699 318 11.16%Total nb tokens 6 263 6 676 - 6 263 9 619 -
Table 5: Contrasts between translations towards different target languages (number of tokens).
7 Conclusion and perspectives
In this work we are interested in translation relations, to our best knowledge, they have never beentaken into account in machine translation as well as in paraphrase generation by exploiting translationalequivalence. We categorized these relations and annotated them in a multilingual parallel corpus of TEDTalks. We chose this specific genre (transcribed and translated speech) in order to obtain more diversitythan with the technical corpora. The measured inter-annotator agreement is strong for segments with thesame boundaries, but we have adopted a more time-consuming annotation process with three passes toensure better annotation quality.
A parallel task with our corpus annotation is the development of a classifier to automatically detectthese relations. Our long-term objective will be to have a better semantic control using this informationas important characteristics to exploit translational equivalence, for the search of monolingual segmentsin relation of equivalence (paraphrases) or of entailment.
We think that there exist many other types of applications for this corpus, for example: evaluatemachine translation or automatic word alignment for different translation relations, like the work of Is-abelle et al. (2017); understand what are the characteristics of non-literal translations in order to improvemachine translation to generate natural and native expressions; provide a concordancer for translationstudies to investigate different translation relations. We could also supply pre-processed parallel corporain other languages (Arabic, Spanish and Russian) for those who are interested to enrich annotation andcomparative studies.
Acknowledgements
This work is founded by a scholarship from Paris-Sud University. We thank Yaqiu Liu for her annotationof English-Chinese control corpus and for her valuable insights in corpus analysis. We are grateful to
109
our anonymous reviewers for their thoughtful and constructive comments.
ReferencesColin J. Bannard and Chris Callison-Burch. 2005. Paraphrasing with bilingual parallel corpora. In ACL 2005,
43rd Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 25-30June 2005, University of Michigan, USA, pages 597–604.
Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012. Wit3: Web inventory of transcribed and translatedtalks. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT),pages 261–268, Trento, Italy, May.
Michel Charolles. 2002. La référence et les expressions référentielles en français. Ophrys.
Hélène Chuquet and Michel Paillard. 1989. Approche linguistique des problèmes de traduction anglais-français.Ophrys.
Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement,20:37–46.
Dun Deng and Nianwen Xue. 2017. Translation Divergences in Chinese-English Machine Translation: An Em-pirical Investigation. Computational Linguistics, 43(3):521–565.
Chris Dyer, Victor Chahuneau, and Noah A. Smith. 2013. A simple, fast, and effective reparameterization of IBMmodel 2. In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff, editors, Human Language Technolo-gies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings,June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, pages 644–648. The Association forComputational Linguistics.
Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The Paraphrase Database. InHuman Language Technologies: Conference of the North American Chapter of the Association of Computa-tional Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, pages758–764.
Ulrich Germann. 2008. Yawat: Yet Another Word Alignment Tool. In ACL 2008, Proceedings of the 46th AnnualMeeting of the Association for Computational Linguistics, June 15-20, 2008, Columbus, Ohio, USA, DemoPapers, pages 20–23. The Association for Computer Linguistics.
Pierre Isabelle, Colin Cherry, and George F. Foster. 2017. A Challenge Set Approach to Evaluating MachineTranslation. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel, editors, Proceedings of the 2017 Confer-ence on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September9-11, 2017, pages 2476–2486. Association for Computational Linguistics.
Zhongguo Li and Maosong Sun. 2009. Punctuation As Implicit Annotations for Chinese Word Segmentation.Computational Linguistics, 35(4):505–512, December.
Johanna Monti, Federico Sangati, and Mihael Arcan. 2015. TED-MWE: a bilingual parallel corpus with MWEannotation. Towards a methodology for annotating MWEs in parallel multilingual corpora. In Proceedings ofthe Second Italian Conference on Computational Linguistics CLiC-it, pages 193–197, Trento.
Ellie Pavlick, Johan Bos, Malvina Nissim, Charley Beller, Benjamin Van Durme, and Chris Callison-Burch. 2015.Adding semantics to data-driven paraphrasing. In Proceedings of the 53rd Annual Meeting of the Associationfor Computational Linguistics and the 7th International Joint Conference on Natural Language Processing ofthe Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1:Long Papers, pages 1512–1522.
Agata Savary, Manfred Sailer, Yannick Parmentier, Michael Rosner, Victoria Rosén, Adam Przepiórkowski,Cvetana Krstev, Veronika Vincze, Beata Wójtowicz, Gyri Smørdal Losnegaard, Carla Parra Escartín, JakubWaszczuk, Matthieu Constant, Petya Osenova, and Federico Sangati. 2015. PARSEME – PARSing and Multi-word Expressions within a European multilingual network. In 7th Language & Technology Conference: HumanLanguage Technologies as a Challenge for Computer Science and Linguistics (LTC 2015), Poznan, Poland, Nov.Available at https://hal.archives-ouvertes.fr/hal-01223349/document.
110
Agata Savary, Carlos Ramisch, Silvio Cordeiro, Federico Sangati, Veronika Vincze, Behrang QasemiZadeh, MarieCandito, Fabienne Cap, Voula Giouli, Ivelina Stoyanova, and Antoine Doucet. 2017. The PARSEME sharedtask on automatic identification of verbal multiword expressions. In Stella Markantonatou, Carlos Ramisch,Agata Savary, and Veronika Vincze, editors, Proceedings of the 13th Workshop on Multiword Expressions,MWE@EACL 2017, Valencia, Spain, April 4, 2017, pages 31–47. Association for Computational Linguistics.
Helmut Schmid. 1995. Improvements In Part-of-Speech Tagging With an Application To German. In Proceedingsof the ACL SIGDAT-Workshop, pages 47–50.
Jean-Paul Vinay and Jean Darbelnet. 1958. Stylistique comparée du français et de l’anglais: méthode de traduc-tion. Bibliothèque de stylistique comparée. Didier.
111
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 112–121Santa Fe, New Mexico, USA, August 20, 2018.
Towards an Automatic Classification of Illustrative Examples
in a Large Japanese-French Dictionary Obtained by OCR
Mutsuko Tomokiyo
UGA, LIG-GETALP Grenoble
Christian Boitet
UGA, LIG-GETALP Grenoble
Mathieu Mangeot
UGA, LIG-GETALP Grenoble
Abstract
This paper focuses on improving the Cesselin, a large, open source Japanese-French bilingual dictionary digitalized by OCR, available on the web, and contributively improvable online. La-belling its examples (about 226,000) would significantly enhance their usefulness for language learners. Examples are proverbs, idiomatic constructions, normal usage examples, and, for nouns, phrases containing a quantifier. Proverbs are easy to spot, but not the other types. To find a method for automatically or at least semi-automatically annotating them, we have studied many entries, and hypothesized that the degree of lexical similarity between results of MT into a third language might give good cues. To confirm that hypothesis, we sampled 500 examples and used Google Translate to translate into English the Cesslin Japanese expressions and their French translations. The hypothesis holds well, in particular for distinguishing examples of nor-mal usage from idiomatic examples. Finally, we propose a detailed annotation procedure and discuss its future automatization.
1 Introduction
The Cesselin Japanese-French bilingual dictionary was edited by a French missionary, Gustave Cesselin (1873-1944), and published in 1939 and 1957 in Japan (82,703 entries and 2,345 pages)1. We have converted it to an electronic form by using an OCR (optical character reader) and made some improve-
ments. An example of the entry for飲む nomu “drink” in original and online form on the Jibiki platform
is given in Figure 2 below (before the references). Our aims are to improve, update and complete the Cesselin to make it available as a modern on-line
dictionary for Japanese or French language learners or researchers in Japanese studies, and also to con-tribute to some improvement in the translation performance of Japanese↔French machine translation (MT) systems. About 60,000 other articles have already been added from other free sources.
The Cesselin includes knowledge on spoken and written Japanese2, and rich illustrative examples containing the headwords. It is however not satisfactory for modern users, because it contains outdated
This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://crea-
tivecommons.org/licenses/by/4.0/. 1A few other on-line Japanese-French dictionaries exist: the Diko (3000 entry words, Diko), the Dictionnaire français-ja-
ponais (15,000 entry words, Lexilogos), the Dictionnaire japonais (24,000 entry words, Assimil), and the Dictionnaire
Glosbe (entry words unknown, Glosbe), etc. The number of words in these dictionaries may change, because they are going to be edited and extended in a collaborative way. 2The Japanese writing system has been fixed by the Japanese government in 1986, in order for the writing form to conform to pronunciation.
112
phonetic descriptions, old Chinese characters and old forms of Okurigana.3 Also, its examples are given without any indication4 concerning their types. As in many dictionaries, the examples are given to help understanding various meanings and usages of words, and to show their grammatical constructions, and their usages in collocations, proverbs, and quantified phrases. Information on the exact type of each example would help users looking for specific information such as numerical quantifiers, and enable MT developers to methodically deal with lexical ambiguities (Tomokiyo et al., 2016). It would also help language learners or researchers to practice Japanese or French, and to deepen their understanding of the Japanese culture.
In Section 2, we explain an experiment made in order to investigate what kinds of examples are in-cluded in the Cesselin. After getting a general impression while correcting the OCR results of about 15,000 entries, we have manually analyzed 500 examples (out of about 226,000), taken from 25 complex entries, established a classification of examples, and proposed classification criteria. In Section 3, we propose an automatable step by step procedure to classify and annotate the remaining 225,500 examples according to these criteria. In Section 4, we briefly present the Universal Networking Language (UNL) Universal Words (UW) dictionary we used to test words in English translations for synonymy. We also examine some technical aspects of the automatization of our procedure. That step, now beginning, will likely save much human expert time, because it will only be needed to “post-edit” automatically gener-ated annotations, and, we hope, to correct less than 5% of them.
2 Case studies on using results of Google Translate5 for the classification
To find a classification of the examples that would be useful and amenable to automatic or at least semi-automatic annotation, we have studied many entries, and hypothesized that the degree of lexical simi-larity between results of MT into a third language might give good cues. To confirm that hypothesis, we selected 500 examples from complex entries (having more than 25 examples in average) and used Google Translate (GT) to translate into English both their Japanese expressions and their French trans-lations.6 We then compared the two translation outputs from the point of view of lexical similarity. Fig-ure 1 illustrates the process.
That comparison confirmed the assumption7 that, if an example is a proverb or if a word in an example is employed in a collocation, the J→E translated example (Ej) usually contains lexemes that completely differ from those of the F→E MT-translated example (Ef), because in proverbs and collocational ex-pressions including classifiers/quantifiers, words have a tendency to be used in figurative meaning, and GT does not in general produce adequate translation outputs for them, but very often literal and incorrect translations. Thus, the two outputs contain different lexemes,8 and one can consequently distinguish proverbs and collocations from other usage examples.
Figure 1: Assumption: according to the similarity of the bag of words w(Ej) and w(Ef), the example is collocational, proverbial, idiomatic, or concerns a quantifier/classifier
3A Japanese word is written in Kanji (漢字, Chinese character) and Hiragana (平仮名, Japanese character), in Kanji only, or
in Hiragana and/or Katakana (片仮名) only. Okuriganas (送り仮名) are hiragana suffixes following Kanji stems. The rules
for Okuriganas changed several times, and were standardized by the Japanese government in 1973. 4Excepting indications for figurative meaning usage by the label {fig} and for some domain-specific names. 5https://translate.google.fr/?hl=fr - fr/ja/La démocratie n%27a plus de cote. 6We have the premise underlying that translations into French proposed by the Cesselin are correct. 7The assumption is grounded on our studies on Japanese, French and English classifiers and quantifiers (Tomokiyo et al., 2017). 8We don’t consider the fillers such as John and Mary in John pulled Mary’s leg, only pulled and leg.
Cesselin Japanese –-------------→French ↓ ↓ Ej Ef w(Ej) ≈? w(Ef)
113
2.1 Japanese-English and French-English translation of the 500 examples by GT system
The structure of an entry in the Cesselin is as follows: head word (in Japanese), pronunciation given in
two different forms (in Japanese and in transliteration in Latin characters, called ローマ字 (Ro-maji)),
conjugated forms for verbs and adjectives, part of speech label, definitions (in French), and examples in
Japanese, with their transliteration in ローマ字, and their translation or translations into French. Here
is an extract of the entry for the verb nomu “drink” from the Cesselin dictionary.
nomu[ma, mi, me] 飲む – 呑む【のむ】v.t. 1. Boire, avaler.9
2. Aspirer, sucer, fumer.10
3. Prendre.11
4. 飲まねば薬も効能なし(Nomaneba kusuri mo kônô nashi) “Qui veut la fin veut les moyens.12”
5. 飲まぬ酒には酔はぬ (Nomanu sake ni wa yowanu) “II n'y a point de fumée sans feu.13”
We have submitted to GT 500 input examples14 contained in about 20 complex entries of the Cesselin (Table 1): 飲む nomu “drink,” 買う kau “buy,” 走 hashiru “ran,” 居る iru “stay, be,” 来る kuru “come,” 思う omou “think,” 言う iu “say,” 可愛い kawaii “pretty,” 強い tsuyoi “strong,” 安い yasui “cheap,” すぐ sugu “soon, immediately,” から kara “because, from, since,” 家 ie “house, family,” 夜 yoru
“night,” 足 ashi “leg,” 車 kuruma “car,” 年 nen “year,” 何時 nanji “what time,” だけ dake “only,” ながら nagara “while,” 必要 hitsuyou “necessity,” もう mou “already,” 事 koto “thing.”
(a) (b) (c) (d) (e)
Japanese examples
in the Cesselin
GT J→E
translations
GT F→E
translations
Cesselin French
translations
Matching results
for J-E and F-E15 水を飲む。(Mizu wo
nomu) Drink water. Drink water. Boire de l'eau. 100% 飲まぬ酒には酔わぬ。(Nomanu sake
niha yowanu.)
I'm not intoxi-cated with drink-ing alcohol.
There is no smoke without fire.
II n'y a point de
fumée sans feu. (Ø P) 飲んだり吐出したりする。(Nondari
hakidashitari suru.)
It drinks or spits out.
Swallow and turn in turn.
Avaler et rendre
tour à tour. Ø 飲んだり吐出したりする。(Nondari
hakidashitari suru.)
It drinks or spits out.
Accept and re-fuse.
{fig} Accepter et
refuser.
Ø {fig} (collocation)
Table 1: Excerpt of translations obtained by GT (J→E and F→E)
9To drink, to swallow 10To inhale, to suck, to smoke 11To take 12He who wills the end wills the means. 13There is no smoke without fire. 14The chosen words belong to the 200 most frequent words in the frequency list except for 助詞 joshi “postpositions”:『現代日本語書き言葉均衡コーパス (Gendai Nihongo kakikotoba kinkou ko-pasu』』語彙表 (Giohyou, The Balanced Corpus
of Contemporary Written Japanese (BCCWJ) (http://pj.ninjal.ac.jp/corpus_center/bccwj/freq-
list.html)) 15Matching results are judged by a linguist in the experiment.
114
2.2 Comparison and contrast of the two translations results
We have asked the following questions concerning the two outputs by GT:
• do the two outputs include totally the same lexemes?
• when they include different lexemes, what kind of lexemes are common to both?
Column (e) in Table 1 shows a manual comparison of the J→E translations (Column (b)) outputs with the F→E translations (Column (d)) from the point of view of lexemes in an example.
Table 2 shows the comparison of the two translation outputs.
items explanation count percentage
(1) 100% concordance (100%)
GT J→E and F→E translations of an example in-clude exactly the same words.
63/500 12.6%
(2) 100%* concordance (100%*)
GT J→E and F→E translations of an example in-clude synonym words
79/500 15.8%
(3) Concordance of
words with ⊆ or ⊇
(100%⊆ or ⊇)
The lengths of the GT J→E translation and the GT F→E translation differ significantly, but the two in-clude almost the same words.
26/500 5.2%
(4) Concordance by morphosyntactic analy-sis (100%**)
GT J→E and F→E translations of an example in-clude the same words for the predicate verb and its principal actants.
15/500 3.0%
(5) No-concordance with{fig}label (Ø{fig.})
GT J→E and F→E translations of an example don't include any common words and the French exam-ple is marked by the {fig.} label.
8/500 1,6%
(6) No concordance (Ø) GT J→E and F→E translations of an example don't include any common word except articles.
269/500 53.8%
(7) No concordance and proverbs (Ø P)
GT J→E and F→E translations of an example don't include any common content word and the example appears in a Japanese proverbs dictionary.
32/500 6.4%
(8) No concordance with usage domain (Ø D)
GT J→E and F→E translations of an example hav-ing an indication of usage domain don't include any common word.
4/500 0.8%
(9) Zero output (x) Zero output by GT. 4/500 0.8%
Table 2: Comparison of J-E and F-E translations outputs for the 500 examples: different cases
According to this experiment, the examples in the Cesselin can be classified into 5 classes:
• ordinary usage examples to give grammatical information or a context where headwords are used
• proverbs
• collocational expressions
• classifiers/quantifiers, which depend on signified entities
• domain-specific examples
2.3 Explanations for examples annotation
In the following examples for the verb 飲む nomu “drink” in Table 3, (a) is a proverb, (b) is a sentence
including the form 飲む in a figurative meaning, and (c) is a sentence that is not a proverb nor a collo-
cation, but ordinary usage. When one compares the lexemes in the J→E and F→E translations by GT,
115
while (a) and (b) don’t contain any common word in their predicative part, in (c), the two translations contain two common words (drink and water).
That confirms that examples showing ordinary usages have a tendency to be translated as sentences having many words (or synonyms) in common. We thought this phenomenon could be a trump card to classify all examples in the Cesselin, because J→E translations by GT16 of polylexical expressions are almost always word-for-word translations. We suggest that this comes from a lack of information on the figurative usages of words in GT resources (bilingual dictionary and parallel corpus). Hence, when a word in a Japanese example is used in a figurative meaning depending on the context, the translation result almost never lexically matches the F→E translation result.
Japanese exam-
ples
GT J-E transla-
tions
GT F-E transla-
tions
Cesselin transla-
tions of Japanese
examples
(a) Proverb
飲まぬ酒には酔わぬ17 (Nomanu sake
ni-ha yowanu)
I'm not intoxicated with drinking alco-hol.
There is no smoke without fire.
II n'y a point de fu-
mée sans feu.
(b) Collocation
彼は妻君に飲まれている。(Kare ha
saikun ni nomare-teiru)
He is being drunk by his wife.
His wife is bowing him.
Sa femme le
berne.18
(c) Ordinary
usage
水を飲む。(Mizu
wo nomu) Drink water. Drink water. Boire de l’eau.
Table 3: Examples for a proverb, a collocational expression and an ordinary usage
Among ordinary usage examples, proverbs, collocational expressions, classifier/quantifiers and do-main-specific examples, we have distinguished proverbs from other types of examples by using a prov-erb dictionary. Using also labels such as {fig} (in Table 4) and {botanic, zoology….} which are attached to some examples in the Cesselin, we have been able to distinguish collocational expressions and do-main-specific examples from other examples, respectively.
3 Examples annotation procedure
We propose the following procedure for classification and annotation of the examples in the Cesselin.
Step 1) Differentiation of proverbs from others
We distinguish proverbs from other examples, using the proverb dictionary.19 When Japanese examples appear in this Japanese proverbs dictionary, they are annotated as {prov}
for proverb.
Step 2) Differentiation of ordinary usage examples from others
When lexemes in two translation outputs have 100 percentage of concordance, the example is anno-
tated as {ordusg} for ordinary usage.
Step 3) Differentiation of collocational expressions and expressions in specific domains
In the Cesselin, the label {fig} indicates a figurative usage of headwords, but is not used consistently.
When the translation for an example is marked with that label (see {fig} in Table 4), we annotate it as
{colexp} for collocations20, and when the French translation is labelled by {botanic, zoology,
etc.}, we annotate it as {domexp} for domain-specific usage. 16Of course, translation performance of commercial systems depends on the considered language pair. 17A word-to-word translation would be: One doesn't get drunk on Sake which one doesn't drink. 18In modern French, one would say Sa femme le trompe — another illustration of the need to update the content. 19新明解故事ことわざ辞典 (Sinmeikai koji-kotowaza jiten, Dictionary of legends and proverbs), 三省堂編修所 (Sanseido
Editions), Tokyo. That dictionary contains 7,300 items. 20Note that phrases or sentences which include words used in figurative meaning are not always collocational expressions.
116
Table 4: Examples of words having a figurative meaning, with and without the label {fig}
Step 4) Differentiation of collocational expressions without any label in Cesselin
There are cases where French translations in Cesselin have no label {fig}, but are used in figurative
meaning (Table 5). When their English translations haven’t any common lexeme (w(Ej) ∩ w(Ef) = ∅), we classify them as collocational expressions, after referring to a KWIC (Keywords in context) list23 obtained by using the Sketch Engine24 software. When an example appears many times in the KWIC list, it is considered as a collocation or a classifier/quantifier (Alda, 2011) and we annotate it as {co-
lexp} for collocational usage. We compare with the KWIC list, and, if the example is a noun phrase of
type “number + noun,” or “noun + のような (no-youna, like) + noun,”25 we annotate it as {quant} for
classifiers/quantifiers (Tomokiyo et al., 2017; Miyagawa, 1989).
headword Japanese
example
GT J→→→→E
translation
GT F→→→→E
translation
French translation for the
Japanese example 足 (ashi, foot) 足を洗う26 (ashi
wo arau) Wash your feet.
Rise from a lower class.
S’élever d'une classe inféri-
eure.
Table 5: Examples of collocational expressions without any label in the Cesselin
Step 5) Checking on synonymy relationship between the main words in the two translated examples
In Table 6, the J→E translation of the example 必要費 hitsuyou hi “necessary cost” is different from
the F→E translation of cost and expenses, and the J→E translation of the example 車に乗る kuruma ni
noru “to get on a car” is different from the F→E translation for ride and get in, which are the predicative
verbs of the sentences. In these cases, we check whether a synonymy relationship holds between the two words by using a (UNL) UW dictionary (see 4.1). If the two words (rather, word senses) are synony-mous, the two translated examples are considered to have 100% concordance, and the example is anno-
tated as {ordusg} for ordinary usage.
21To catch the person, drinking him. 22The concrete meaning of 飲んだり吐出したりする : Drinking and spitting alternately. 23The input is a corpus developed by Mathieu Mangeot (Mathieu’s corpus) in the framework of the Jibiki project since 2014 (http://jibiki.fr). It is a Japanese-French parallel corpus coming from newspapers, novels, the Bible, etc., and con-tains 9268785 words at the moment. 24The Sketch Engine is a corpus management tool, containing 400 ready-to-use corpora. We have added Mathieu’s ccorpus to it. See https://www.sketchengine.co.uk 25E.g. 山のような問題 yama no youna mondai “problems like mountain” → a lot of problems 26The concrete meaning of 足を洗う is: Wash one’s feet. By the way, the expression 足を洗う makes sense in concrete as
well as figurative usage. We have not yet solved this problem.
items Japanese examples GT J-E
translations
GT F-E
translations
French translation of
Japanese examples
Collocation without in-dication
人を飲んで掛かる(Hito wo
non de kakaru)21
I'm drinking people and hanging.
Miss someone, look down.
Manquer à quelqu'un,
regarder de haut.
Collocation with indica-tion {fig}
飲んだり吐出したりする22
(Nondari hakidashitari suru)
To drink or to spit.
Accept and re-fuse.
{fig} Accepter et re-fuser.
117
headword Japanese
examples
GT J→→→→E
translations
GT F→→→→E
translations
French
translations comparison 必要
(hitsuyou)27
必要費
(hitsuyou hi) Necessary cost
Necessary ex-penses.
Dépenses
nécessaires. 100%*28 車
(kuruma)29
車に乗る
(kuruma ni
noru)
I ride a car. Get in the car Monter en voi-
ture. 100%*
Table 6: Two translated examples having synonymous lexemes
Step 6) Analyzing translated examples having different lengths
Examples translated into English have different lengths because of the nominalization30 of a predica-tive verb, like (a) in Table 7, as well as difference of registers (formal vs. informal setting, degree of politeness), as in utterances (b) in Table 7, etc.
In example (b), Excuse me, are not you, and Mr. Yamada are common to the two translated examples,
and only Please and but are added in the F→E translation. In this case, the example is syntactically
analyzed, and if the predicative verb and its main actants are the same, the two examples are considered to have 100% concordance, and we annotate it as {ordusg} for ordinary usage.
Expression in
the Cesselin
Japanese
examples
GT J→→→→E
translations
GT F→→→→E
translations
French translation
for the Japanese
examples
a) もう
(mou)
(J ⊇ F)
もう休みましょう(Mou yasumi
masyou)31
Let's have a rest now.
Let's rest now! Reposons-nous
maintenant!
b) ながら(nagara)
(J ⊆ F)
失礼ながら山田様ではございませんか (Shitsurei nagara
Yamada san deha
gozaimasen ka)32
Excuse me, are not you, Mr. Yamada?
Please excuse me, but are not you Mr. Yamada?
Veuillez bien m'ex-
cuser, mais n'êtes-
vous pas monsieur
Yamada?
Table 7: Two translated examples having different lengths
4 Towards automatic classification
4.1 Usage of a UW dictionary as a pivot of interlingual lexemes
In order to get information on the synonymy relationship between English words in the two translated examples, and also morphosyntactic information, we have used a UW dictionary33 made available by the UNL project,34 as a pivot of “interlingual lexemes.”
27hitsuyou = “necessary” 28The symbol 100 %* stands for the matching of 100 % of the two translation outputs with synonymy information. 29A car 30We are not yet engaged in this case. 31Let’s take a rest now. 32Excuse me, are not you Mr.Yamada? 33The UNL-UW dictionary contains at the moment 126,9421 headwords for Japanese, 520,305 headwords for French and 1,458,686 headwords for English. The semantic attributes consist of 58 labels and 39 semantic relation labels. 34The UNL (Universal Networking Language) project was launched at the Institute of Advanced Studies (IAS) of the United Nations University in Tokyo (UNU) in April 1996 under the aegis of the United Nations University in Tokyo, with financial
118
A Universal Word (UW) is a character-string which represents a sense of a word. It is made of a headword (usually an English word or term) followed by a list of semantic restrictions between paren-theses. A UNL semantic representation of an utterance is a hypergraph, where nodes are UWs having semantic attributes, and arcs bear semantic relations between two nodes or scopes. One node has to bear the @entry attribute. A scope is an arc-connected subgraph having an entry node and may be referred to as origin or extremity of an arc.
Here are some examples of entries in the UW dictionary:
(a) expense(icl>cost) [expense is "included" in cost, hence cost has its nominal sense here]
(b) look(agt>thing, equ>search, icl>examine(icl>do, obj>thing)) [here:
- icl in (a) gives synonym information (examine)
- agt and obj in (b) denote the nature of the 2 main actants of this predicative verb (in discourse) - equ in (b) expresses the fact that look (maybe the headword should be look_for) is linked to search by an equ path in the UNL ontology map35 (Uchida et al., 2006).]
4.2 Mechanisms needed for automatic annotation
Before starting this research, we produced a morphosyntactic analysis of the Cesselin examples by using MeCab (Mangeot 2016).36 Hence, the Japanese examples are already segmented in words and annotated according to a grammar based on Hashimoto’s grammar.37
In order to build an automatic classifier/annotator of the Cesselin examples, we further need:
(1) a mechanism to match examples in the Cesselin with proverbs contained in one or more proverb dictionaries. As no Japanese proverb dictionary is available in electronic form (at least .epub), a first challenge is to build an online Japanese proverb database. We envisage to use the same method as that used for computerizing the Cesselin.
(2) a mechanism to call GT on an example and its translation, and to evaluate the lexical similarity of the two translation outputs. The first part is easy, while the second needs the three following resources.
(3) a mechanism to call an English morphosyntactic analyzer on the translated examples, to get the lemmas and possibly other attributes (e.g., number, determination). PhpMorphy would be a pos-
sibility (http://phpmorphy.sourceforge.net/dokuwiki/).
(4) a mechanism to check the synonymy between two words by comparing the UWs having their lemmas as headwords in the UW dictionary. For this, we intend to put the UW dictionary in the form of a lexical network (it has been made with an earlier version using the PIVAX/Jibiki lexical database).
(5) a mechanism to access a KWIC list produced from the set of Japanese proverbs.
5 Conclusion and perspectives
Labelling the examples (about 226,000) of the Cesselin would significantly enhance their usefulness for language learners. We have hypothesized that the degree of lexical similarity between results of MT (by GT) into English might give good cues. That hypothesis has been confirmed by manually applying a procedure derived from it on 500 examples, and getting 100% correct annotations. Incidentally, that success confirms that handling of polylexical expressions by MT systems (even neural ones) is still very
support from the ASCII corporation and IAS. See http://www.undl.org/unlsys/unl/unl2005/attrib-
ute.htm (Uchida et al., 2006).
35The semantic relation labels are created from UNL ontology, which stores all relational information in a lattice structure,
where UWs are interconnected through relations including hierarchical relations (10 levels) such as icl (a-kind-of) and
iof (an-instance-of), and mean headword’s sub-meaning, respectively. See http://www.undl.org/unlexp/. 36Spoken Japanese has no space between words in a sentence. 37Hashimoto's grammar was introduced by the linguist Shinkichi Hashimoto (1946), and is widely adopted in the educational field in Japan.
119
poor. We hope our work could help improve them, at least for the J-F pair. We are working towards automating that procedure and applying it to the remaining 265,500 examples. Of course, we don’t hope to get 100% correct annotations on this large set, but we hope that corrections done by cooperative online post-editing, will be limited to 5% or 10%.
We are also planning to study whether that method could spot corresponding proverbs, collocational expressions and quantified expressions in bilingual aligned J-F corpora. It will be a good surprise if it works, because sentences in usual texts tend to be quite longer than dictionary examples, which are in general not very long. Also, they rarely include proverbs, which are frequent in dictionary examples, and easy to spot. However, our method of differentiating collocational expressions from other types of expressions might also work in general documents. We plan to test this in the future. Also, it might be possible to disambiguate polylexical expressions by describing the context where a word appears in the UW dictionary (Uchida et al., 2006).
Figure 2: The nomu “drink” entry in the original Cesselin and the online Cesselin/Jibiki (18 examples).
References
Mathieu Mangeot. 2016. Collaborative construction of a good quality, broad coverage and copyright free Japanese-French dictionary. HAL-01294566.
120
Alda Mari. 2011. Quantificateurs polysémiques. Université Paris-Sorbonne, Vol.23, France.
Sigeru Miyagawa. 1989. Structure and case marking in Japanese. Syntax and Semantics, Vol. 22, New York.
Mutsuko Tomokiyo, Mathieu Mangeot, and Christian Boitet. 2016. Corpus and dictionary development for clas-sifiers/quantifiers towards a French-Japanese machine translation, COLING, CogALex 2016, pages 185-192, Japan.
Mutsuko Tomokiyo, Mathieu Mangeot, and Christian Boitet. 2017. Development of a classifiers/quantifiers dic-tionary towards French-Japanese MT, MT-Summit 2017, Research Track, Vol 1, pages 216-226, Japan.
Hiroshi Uchida, Meihyin Zhu, and Tarcisio Della Senta. 2006. Universal Networking Language. UNDL Founda-tion, Japan.
121
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 122–130Santa Fe, New Mexico, USA, August 20, 2018.
Contractions: To Align or Not to Align, That Is the Question
Anabela BarreiroINESC-ID, Rua Alves Redol 9
1000-029 LisboaPortugal
Fernando BatistaInstituto Universitário de Lisboa (ISCTE-IUL)
INESC-ID, Rua Alves Redol 91000-029 Lisboa, Portugal
Abstract
This paper performs a detailed analysis on the alignment of Portuguese contractions, based ona previously aligned bilingual corpus. The alignment task was performed manually in a subsetof the English-Portuguese CLUE4Translation Alignment Collection. The initial parallel corpuswas pre-processed and a decision was made as to whether the contraction should be maintainedor decomposed in the alignment. Decomposition was required in the cases in which the twowords that have been concatenated, i.e., the preposition and the determiner or pronoun, go intwo separate translation alignment pairs (PT - [no seio de] [a União Europeia] EN - [within][the European Union]). Most contractions required decomposition in contexts where they arepositioned at the end of a multiword unit. On the other hand, contractions tend to be maintainedwhen they occur at the beginning or in the middle of the multiword unit, i.e., in the frozen partof the multiword (PT - [no que diz respeito a] EN - [with regard to] or PT - [além disso] EN -[in addition]. A correct alignment of multiwords and phrasal units containing contractions isinstrumental for machine translation, paraphrasing, and variety adaptation.
1 Introduction
The past decade has seen a significant advance in the field of machine translation mainly due to thegrowth of publicly available corpora, from which an enormous amount of translation alignments havebeen extracted. Alignments of multiword units and other phrases represent the driving force in the devel-opment of translation systems and the success of systems like Google Translate, which has a great deal todo with huge lexical coverage available in the large amounts of corpora that they have access to (Barreiroet al., 2014b) and from which translation alignments are extracted. But the quality of these alignmentsis also very important. For example, several authors have pointed out that the integration of multiwordunits in translation models based on linguistic knowledge is considered as an impact factor in obtainingbetter quality translations (cf. (Chiang, 2005), (Marcu et al., 2006), or (Zollmann and Venugopal, 2006),among others). Expert participation extends to the gathering, enhancement and integration of languageresources including non-contiguous multiword unit alignments (Barreiro and Batista, 2016). Above all,high quality machine translation depends on the quality of the alignments used in the processes of ma-chine learning. Some systems use unsupervised learning, in which the machine itself decides whichsegments of a source-language phrase align with which target language phrase segments (Och and Ney,2000), while others use supervised learning based on previous alignments made manually by linguists(Blunsom and Cohn, 2006). In this paper, we focus on the alignment of multiword units where con-tractions occur, a challenge that has been overlooked in the existing literature and can be responsiblefor grammatical errors in translations.
A contraction is a word formed from two or more words of different parts-of-speech (most frequently)or the same part-of-speech (more seldom) that would otherwise appear next to each other in a sequence.For example, in English the most common contractions are those where the word not is added to an
This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/
122
auxiliary verb in negative sentences, with omission of internal letters (e.g., is not → isn’t) or those con-sisting of combinations of pronouns with auxiliary verbs, in which a word or a syllable is substituted byan apostrophe (e.g., it is → it’s). These contractions are mainly used in speech and informal writing, butnot in formal writing as in the Romance languages, where contractions are non-optional. The most com-mon contractions in the Romance languages are those where prepositions are contracted with articles orpronouns with addition, replacement, or omission of letters. For example, in Portuguese the contractionnas "in, at the" results from the concatenation of the preposition em with the feminine plural definitearticle as; in Italian, the contraction degli "of" results from the concatenation of the preposition di withthe masculine plural definite article gli; in Spanish, the contraction al "to, at the" results from the con-catenation of the preposition a with the masculine singular definite article el; in French, the contractionaux "at, for, to the" results from the concatenation of the preposition à with the masculine plural definitearticle les. However, contractions can also be composed of two words with the same part-of-speech, e.g.,two determiners (la une → l’une) or two prepositions (de après → d’après), as in French.
We describe a linguistically motivated approach to the alignment of multiword units where contrac-tions occurring in these multiword units are required to be decomposed, except in specific circumstancesdetermined by the context, such as when they constitute a non-variable (non-inflecting) element of afrozen multiword unit. Decomposition allows the correct alignment of a multiword unit, such as theprepositional compound apesar de "in spite of", in the sense that it separates the preposition de "of" thatis part of the multiword from a concatenated element, in this case, the feminine singular definite articlea "the" that is not part of the multiword, but rather belongs to the phrase or expression that immediatelyfollows it (e.g., apesar da → [apesar de] [a NP]. Similarly, the masculine plural definite article os "the"in the expression à luz de "in light of" requires to be split from the preposition de (e.g., [à luz dos] →[à luz de] [os NP). However, the contraction of the preposition a "at" with the feminine singular definitearticle a in this expression is not decomposed from its composed form à, because it represents a fixedelement of the multiword unit, never changing its form. Failure to align and process correctly thesemultiword units involving contractions containing elements that are external to them leads to errors inthe translated texts. Even if these errors do not affect the understanding of the translated text, they maycompromise the quality of the translation leading to greater post-editing efforts.
In our experiment, a linguist has pre-processed manually a subset of the reference Europarl parallelcorpus (Koehn, 2005) containing 400 Portuguese-English parallel sentences. From this subset corpus, theEN–PT CLUE4Translation Alignment Collection was achieved by adopting the methodology describedin Section 3 for the alignment of Portuguese multiwords and other phrasal units involving contractionsin the original corpus. This methodology was achieved during the development of the CLUE AlignmentGuidelines, a set of linguistically-informed guidelines for the alignment translation or paraphrastic unitsin bitexts. In other words, the Guidelines were developed in two separate sets of documents containingstatements by which to determine courses of action regarding the alignment of multiwords and otherphrasal units, depending on whether these linguistic units are used in translation (CLUE4TranslationAlignment Guidelines) or in paraphrasing (CLUE4Paraphrasing Alignment Guidelines). The approachreinforces the weight of multiwords as objects of representation in the alignment between the sourceand the target languages. This is independent of the source-target being two different languages (e.g.,translation), two language varieties (e.g., variety adaptation), or the same language (e.g., paraphrases).The annotation of the subset corpus was performed with the CLUE-Aligner tool (Barreiro et al., 2016),a paraphrastic and translation unit aligner built to provide an efficient solution in the alignment of non-contiguous multiword units. CLUE-Aligner was developed within the eSPERTo project1, whose ob-jective is to develop a context-sensitive, linguistically enhanced paraphrase system that can be used innatural language processing applications, such as intelligent writing aids, summarization tools, smartdialogue systems, language learning, among others. Our broader research aims to contribute to new ma-
1eSPERTo stands for System of Paraphrasing for Editing and Revision of Text (in Portuguese, Sistema de Parafraseamentopara Edição e Revisão de Texto). eSPERTo’s core linguistic resources were extracted from OpenLogos bilingual resources(Barreiro et al., 2014a), the free open source version of the Logos System (Scott, 2003) (Barreiro et al., 2011) (Scott, 2018),adapted and integrated into NooJ linguistic engine (Silberztein, 2016). eSPERTo is available at https://esperto.l2f.inesc-id.pt/esperto/esperto/demo.pl
123
chine translation systems that produce high quality translation for which linguistically-based alignmentsare extremely important.
2 Related Work
In NLP tasks, contractions are problematic for several reasons, among them: (i) two or more functionwords2 mostly with different parts-of-speech overlap, which makes syntactic analysis and generationdifficult; (ii) in cross-language analysis, the contrast between languages that have contractions and lan-guages that do not have them, or do not have them in the same contexts, may present additional dif-ficulties. Although, most parsers and part-of-speech taggers can process contractions successfully, thealignment of segments in a parallel pair of sentences, where one particular segment corresponds to acontraction in one language and to more than one segment (no contraction) in the other language has notbeen adequately addressed in alignment annotation guidelines or alignment research (cf. (Och and Ney,2000), (Lambert et al., 2005), (Graça et al., 2008), or (Tiedemann, 2011), among others). For example,the Portuguese contraction of the preposition em and the demonstrative pronoun este in neste correspondsto two words in English (in this) and in Spanish (en esta), as illustrated in example (1).
(1) EN - to make further progress in this areaES - a fin de avanzar en esta direcciónPT - com o intuito de conseguir um avanço neste (em + este) domínio
In addition, the freely available parallel corpora most used in alignment tasks (Koehn, 2005) havenot been pre-processed in order to make possible the correct alignment of the pairs of multiword unitsinvolving contractions. These shortcomings and lack of adequate directives to guide annotators in align-ment tasks are responsible for machine translation errors, but they also affect negatively other NLP tasksinvolving alignment resources, such as paraphrasing, among others. Our contraction pre-processing taskaims to advance the state of the art alignment taking into consideration the correct alignment of multi-word units where contractions existed in the original corpus.3 The methodology used to decide whethercontractions need to be decomposed for the alignment of their canonical forms or whether they are re-quired to be maintained inside the multiword unit is presented in Section 3.
The Romance languages have peculiar behaviour with regards to the use of contractions. Some lan-guages require a particular contraction, other languages require another type of contraction. Our method-ology is consistent with regards to decomposition of contractions when they refer to aligning canonicalforms, i.e., separate words like a preposition and a determiner cannot align with a contraction or whenthey are part of a frozen compound or fixed expression. For example, the English lexical bundle in thatsense requires the contraction in the Portuguese translation nesse sentido to be maintained. The equiv-alents in the remaining Romance languages do not contain contractions (en ese sentido in Spanish, anden ce sens in French).
3 Methodology
In our alignment task, the PT–EN CLUE4Translation parallel corpus was pre-processed for a frameworkdecision regarding whether its contractions should be decomposed or maintained. Sections 3.1 and 3.2discuss the alignment issues specific to each one of the decisions, with a set of real-world alignmentexamples, which aid in the understanding of the issues raised. Initially, the pre-processing task consistedof a semi-manual decomposition by a linguist of all contractions. Decomposition allowed for the correctalignment of multiword units where contracted forms required to be split so that those multiwords andthe phrases that follow them could be mapped to the corresponding elements in the source language, asillustrated in Section 3.1. Subsequently, all the decomposed forms were reviewed and the decomposed
2Function or structure words, such as prepositions, determiners, auxiliary verbs and pronouns, among others, have littlelexical or ambiguous meaning, and are used to express grammatical (or structural) relationships with other words within asentence. They are extensively described in grammars. Function words are in contrast with content or lexical words, whichinclude nouns, verbs, adjectives, and most adverbs, normally containing very specific meanings listed in the dictionaries.
3This topic has been only superficially described in earlier work (Barreiro and Mota, 2017).
124
Figure 1: Alignment of the compound word em nome de "on behalf of" and the noun phrase o meu grupo"my group" with their internal elements (individual words)
forms in multiwords and frozen expressions were changed back to contractions, as described in Section3.2. This methodology prioritized decomposition for statistical reasons only. The number of contractionsthat need to be decomposed in the corpus is much greater than the number of contractions that require tobe maintained.
3.1 Decomposed Contractions
The Portuguese word do "of the" occurring in the original corpus corresponds to the contraction of thepreposition de with the masculine definite article o that agrees with its masculine noun modifier grupoin the phrase em nome do meu grupo "on the behalf of my group". This contraction was decomposed intwo elements, the preposition de "of" and the masculine singular definite article o "the" (de + o) in orderto align correctly both the canonical form (lemma) of the compound word em nome de "on behalf of",and the noun phrase o meu grupo "my group", where the preposition of the contraction goes with thecompound and the definite article goes with the noun phrase, i.e., the decomposition is required to makepossible that the two concatenated words go in two different alignment pairs, as illustrated in Figure 1.Similar decomposition has taken place in contractions such as those illustrated in examples (2)–(5).
(2) EN - across + [the Atlantic]PT - do outro lado de (do = [de+o]) [o Atlântico]
(3) EN - issues like + [the NP]PT - questões como a de (dos = [de+os]) [os NP]
(4) EN - with respect to + [the N]PT - quanto a (ao = [a+o]) [o N]
(5) EN - fully approves [NP: the joint position of the council]PT - dá a sua total aprovação a (à = [a+a]) [NP: a posição comum do conselho]
Decomposition of contractions also has implication in coordination. For example, the coordinatednoun phrases o parlamento "the parliament" and o conselho "the council" illustrated in Figure 2 are di-rect complements of the Portuguese prepositional verb realizado por "carried out by". While in Englishthe preposition by of the prepositional verb is not repeated before the second noun phrase, in Portuguesethere is repetition of the preposition por in the coordination introduced by the prepositional verb real-izado por [NP] e por [NP]. The CLUE-Aligner alignment tool allows the alignment of the non-contiguouscoordinating structure, excluding the NP elements (gaps), which are the variable elements of the coordi-nation, and making possible to align them separately. Alignment methodologies require these linguisticnuances captured in translation to be handled correctly.
125
Figure 2: Alignment of the coordinated multiword realizado por [NP] e por [NP], implying the doubledecomposition of the contraction pelo into the preposition por of the prepositional verb and the masculinesingular definite article o of the coordinated NP.
3.2 Non-decomposed Contractions
In a second pre-processing step, decomposed contractions à = a + a "to + the", na = em + a "in +the", and do = de + o "of + the" were restored in non-compositional multiword units, such as the fixedexpressions à luz de "in light of" already mentioned in Section 1, à data "at the time" illustrated in Figure3, and na ordem do dia "the next item" in the corpus (better translated as "in the agenda"), illustrated inFigure 4.
4 Analysis of Preliminary Results
Preliminary results confirmed that most contractions require decomposition in contexts where they arepart of a multiword unit. For example, the most frequent contractions in the corpus, (de + a, de + o, de+ os, and em + a), with more than 50 occurrences each, establish syntactic relationships between multi-words, such as compounds, prepositional nouns, etc., some of which are discontinuous (e.g. centrar-se[] em "to deal with"). In these contractions, the preposition establishes the final border of the first phrase(i.e., the last word in the phrase), and the determiner establishes the initial border of the phrase imme-diately after (i.e., the first word in the phrase). The second noun phrase can be a named entity (e.g.,União Europeia "European Union", Ásia "Asia"), or a term (e.g., capital de risco "risk capital", fundosde pensão "pension funds"), but, there are also occurrences of contractions that require decompositionin contexts where the preposition is part of a multiword unit (the last word of the multiword, e.g., emrelação a "with regard to" and the determiner is part of a regular noun phrase, e.g., as observações "thecomments". Table 1 presents the frequency of contractions in contexts in which they require decomposi-tion.
With regards to contractions that cannot be decomposed, most of them occur in the beginning or inthe middle of the multiword unit, seldom in the end. For example, the contractions no, neste, pelo, andàs in the multiwords no que diz respeito a "with regard to", neste momento "at this time", pelo contrário"on the contrary", and às 12h30 "at 12.30 p.m." cannot be decomposed, because they are not positionedin the border with the next phrase. The same goes for the contraction à in the multiword unit até à data"so far", which occurs in a middle position. Exceptionally, the contraction disso in the multiword alémdisso "in addition" also remains undecomposed, because it corresponds to a fixed adverbial expression.Table 2 presents the frequency of contractions in contexts in which they cannot be decomposed.
126
Figure 3: Alignment of the fixed expression à data
Figure 4: Alignment of the fixed expression na ordem do dia
127
Decomp. Freq PT example EN translationde a 113 [no seio de] [a União Europeia] [within] [the European Union]de o 93 [a promoção de] [o capital de risco] [encouraging] [risk capital]de os 68 [o favorecimento de] [os fundos de pensão] [to favour] [pension funds]em a 61 [integração em] [a Ásia] [integration in] [Asia]a a 44 [dar prioridade a] [a extensão de] [focusing on] [the extension of]a o 34 [prestar-se [] atenção a] [o trabalho infantil] [attention must [] be paid to] [child labour]de as 29 [o objectivo de] [as redes transeuropeias] [the purpose of] [the trans-European networks]em o 29 [fusões em] [o mercado de capitais] [mergers on] [the capital market]a os 20 [no concernente a] [os fundos de pensões] [as for] [pension funds]em as 16 [centrar-se [] em] [as questões comuns] [to deal with] [questions which unite us]a as 15 [em relação a] [as observações] [with regard to] [the comments]em os 12 [com base em] [os mesmos critérios] [to use the same yardstick]por o 10 [realizado por] [o parlamento] [carried out by] [parliament]por os 10 [angariados por] [os mercados de capital de risco] [raised from] [venture capital]por a 9 [influenciados [] por] [a instalação de] [compromised [] by] [fitting]em uma 7 [assenta em] [uma relação de igualdade] [based on] [a relationship of equality]
Table 1: Frequency of contractions in contexts in which they require decomposition
contracted freq PT example EN translationno 43 no que diz respeito a with regard todo 34 inclusão [na ordem do dia] added [to the agenda]da 33 [da mesma forma que] [in the same way that]nos 17 nos dois sentidos on both sidesdos 17 a carta dos direitos fundamentais the charter of fundamental rightsna 17 na sua quase unanimidade almost unanimouslyneste 13 neste momento at this timeà 13 até à data so farao 13 ao dar prioridade a by focusing ondisso 7 além disso in additionpelo 6 pelo contrário on the contrarydas 5 redução [] das despesas reducing [] expenditurenesse 2 nesse sentido to this effectàs 2 às 12h30 at 12.30 p.m.desse 2 desse modo henceconsigo 2 em paz consigo próprio at peace with itself
Table 2: Frequency of contractions in contexts in which they cannot be decomposed
A few observations are worth noting with regards to undecomposeable contractions. One of them isthat there are some semantico-syntactic patterns that function as linguistic constraints. For example, thecontractions às, nos, or pelos cannot be decomposed when used with time-related named entities, suchas às seis horas da tarde "at 6 p.m", às sextas-feiras "on Fridays", nos anos sessenta "in the sixties",or pelos anos seguintes "for all years ahead", among others. Another important observation is that, innormal circumstances, contractions of prepositions with pronouns, such as consigo in the expressionconsigo próprio "with itself" should not be decomposed.
The alignment task has given us cause to reflect on how certain linguistic units have been aligned inprevious research work. As far as alignments involving the contraction phenomenon, have there beendiscussions on whether the contraction should be maintained or decomposed in cases such as muitos dospresentes nesta assembleia "many in the house", or pelas mais variadas razões "for a variety of reasons"?What about other linguistic phenomena? Is there scientific ground to establish "strict" boundaries foraligning paraphrastic units or translation units or are alignment decisions sometimes arbitrary? Whilethis is not the first attempt to establish guidelines for alignment tasks, we have made an attempt to treatcontractions in a scientific way, either maintaining the contraction at the beginning and the middle of a
128
multiword unit or decomposing the contraction at the end of the multiword unit. The resulting alignmentdata may still contain errors, but we tried to make decisions in more than an ad-hoc fashion.
5 Final Remarks
Language experts’ involvement in machine translation is essential in pre-editing tasks to improve thequality of the text to be translated (input or source text), and in post-editing tasks to improve the translatedtext (output or target text). High quality machine translation is directly related to the human factor,namely to the intervention of language specialists involved in translation and their role in the validationof correct translation alignments. When used in machine translation systems, alignments containinglinguistic knowledge contribute to improved accuracy, reduced computational complexity and ambiguity,and improved translation quality, as illustrated for the contractions described in this paper. Given thatcontractions can be a frequent phenomenon in a language, the results that can be obtained through theircorrect alignment in a system can be significantly better than those obtained in a purely statistic orad-hoc manner. But, there are other linguistic phenomena that require further examination. Without asuitable linguistic approach to the alignment task, and limited to the capacity of the algorithms, systemswill continue to be overloaded with poor quality alignments, which will create translation of limitedquality, requiring a greater post-editing effort. However, there is still a shortage of manually annotatedalignments that can be used in training and evaluation for many language pairs or language variants,especially those with scarce resources. In this paper, we have used a methodology to align multiwordunits involving contractions, which pose a challenge to their correct alignment. The proposed alignmentmethodology does not depend on the application, so the pairs of aligned multiwords and phrases canbe used in translation, paraphrasing, variety adaptation and other NLP tasks. We also hope that thelinguistic knowledge learned in our alignment task can help solve problems related to the alignment ofmultiword units, provide better solutions to process and align them, and ultimately serve to build a moresophisticated automatic alignment tool.
Acknowledgements
This work was supported by national funds through Fundação para a Ciência e a Tecnologia(FCT) under project eSPERTo, with reference EXPL/MHC-LIN/2260/2013, and with referenceUID/CEC/50021/2013. Anabela Barreiro was also funded by FCT through post-doctoral grantSFRH/BPD/91446 /2012.
ReferencesAnabela Barreiro and Fernando Batista. 2016. Machine Translation of Non-Contiguous Multiword Units. In
Wolfgang Maier, Sandra Kübler, and Constantin Orasan, editors, Proceedings of the Workshop on Discontin-uous Structures in Natural Language Processing, DiscoNLP 2016, pages 22–30, San Diego, California, June.Association for Computational Linguistics (ACL).
Anabela Barreiro and Cristina Mota. 2017. e-PACT: eSPERTo Paraphrase Aligned Corpus of EN-EP/BP Transla-tions. Tradução em Revista, 1(22):87–102.
Anabela Barreiro, Bernard Scott, Walter Kasper, and Bernd Kiefer. 2011. OpenLogos Rule-Based MachineTranslation: Philosophy, Model, Resources and Customization. Machine Translation, 25(2):107–126.
Anabela Barreiro, Fernando Batista, Ricardo Ribeiro, Helena Moniz, and Isabel Trancoso. 2014a. OpenLogosSemantico-Syntactic Knowledge-Rich Bilingual Dictionaries. In Nicoletta Calzolari, Khalid Choukri, ThierryDeclerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis,editors, Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014,pages 3774–3781, Reykjavik, Iceland, May. European Language Resources Association.
Anabela Barreiro, Johanna Monti, Brigitte Orliac, Susanne Preuss, Kutz Arrieta, Wang Ling, Fernando Batista,and Isabel Trancoso. 2014b. Linguistic Evaluation of Support Verb Constructions by OpenLogos and GoogleTranslate. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph
129
Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the 9th International Con-ference on Language Resources and Evaluation, LREC 2014, pages 35–40, Reykjavik, Iceland, May. EuropeanLanguage Resources Association.
Anabela Barreiro, Francisco Raposo, and Tiago Luís. 2016. CLUE-Aligner: An Alignment Tool to Annotate Pairsof Paraphrastic and Translation Units. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi,Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, and SteliosPiperidis, editors, Proceedings of the 10th Edition of the Language Resources and Evaluation Conference,LREC 2016, pages 7–13. European Language Resources Association.
Phil Blunsom and Trevor Cohn. 2006. Discriminative Word Alignment with Conditional Random Fields. InProceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meetingof the Association for Computational Linguistics, ACL-44, pages 65–72, Sydney, Australia. Association forComputational Linguistics.
David Chiang. 2005. A Hierarchical Phrase-based Model for Statistical Machine Translation. In Proceedings ofthe 43rd Annual Meeting on Association for Computational Linguistics, ACL 2005, pages 263–270, Ann Arbor,Michigan, USA. Association for Computational Linguistics.
João Graça, Joana Paulo Pardal, Luísa Coheur, and Diamantino Caseiro. 2008. Building a Golden Collectionof Parallel Multi-Language Word Alignment. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, JosephMariani, Jan Odjik, Stelios Piperidis, and Daniel Tapias, editors, Proceedings of the 6th International Confer-ence on Language Resources and Evaluation, LREC 2008, pages 986–993, Marrakech, Morocco, May. Euro-pean Language Resources Association.
Philipp Koehn. 2005. EuroParl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of the 10thMachine Translation Summit, pages 79–86, Phuket, Thailand. AAMT, Asia-Pacific Association for MachineTranslation.
Patrik Lambert, Adrià De Gispert, Rafael Banchs, and José B. Mariño. 2005. Guidelines for Word AlignmentEvaluation and Manual Alignment. Language Resources and Evaluation, 39(4):267–285, Dec.
Daniel Marcu, Wei Wang, Abdessamad Echihabi, and Kevin Knight. 2006. SPMT: Statistical machine translationwith syntactified target language phrases. In Proceedings of the 2006 Conference on Empirical Methods in Nat-ural Language Processing, EMNLP 2006, pages 44–52, Stroudsburg, PA, USA. Association for ComputationalLinguistics.
Franz Josef Och and Hermann Ney. 2000. Improved Statistical Alignment Models. In Proceedings of the 38thAnnual Meeting on Association for Computational Linguistics, ACL 2000, pages 440–447, Hong Kong. Asso-ciation for Computational Linguistics.
Bernard Scott. 2003. The Logos Model: An Historical Perspective. Machine Translation, 18(1):1–72.
Bernard Scott. 2018. Translation, Brains and the Computer: A Neurolinguistic Solution to Ambiguity and Com-plexity in Machine Translation. Machine Translation: Technologies and Applications. Springer InternationalPublishing.
Max Silberztein. 2016. Formalizing Natural Languages: the NooJ Approach. Wiley Eds.
Jörg Tiedemann. 2011. Bitext Alignment. Synthesis Digital Library of Engineering and Computer Science.Morgan & Claypool.
Andreas Zollmann and Ashish Venugopal. 2006. Syntax Augmented Machine Translation via Chart Parsing. InProceedings of the Workshop on Statistical Machine Translation, SMT 2006, pages 138–141, New York City,June. Association for Computational Linguistics.
130
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pages 131–140Santa Fe, New Mexico, USA, August 20, 2018.
Enabling Code-Mixed Translation: Parallel Corpus Creation and MTAugmentation Approach
Mrinal DharIIIT Hyderabad
Gachibowli, HyderabadTelangana, India
Vaibhav KumarIIIT Hyderabad
Gachibowli, HyderabadTelangana, India
Manish ShrivastavaIIIT Hyderabad
Gachibowli, HyderabadTelangana, India
Abstract
Code-mixing, use of two or more languages in a single sentence, is generated by multi-lingualspeakers across the world. The phenomenon presents itself prominently in social media dis-course. Consequently, there is a growing need for translating code-mixed hybrid language intostandard languages. However, due to the lack of gold parallel data, existing machine translationsystems fail to properly translate code-mixed text.
In an effort to initiate the task of machine translation of code-mixed content, we present a newlycreated parallel corpus of code-mixed English-Hindi and English. We selected previously avail-able English-Hindi code-mixed data as a starting point for our parallel corpus, and 4 human trans-lators, fluent in both English and Hindi, translated the 6,096 code-mixed English-Hindi sentencesinto English. With the help of the created parallel corpus, we analyzed the structure of English-Hindi code-mixed data and present a technique to augment run-of-the-mill machine translation(MT) approaches that can help achieve superior translations without the need for specially de-signed translation systems. The augmentation pipeline is presented as a pre-processing step andcan be plugged with any existing MT system, which we demonstrate by improving code-mixedtranslations done by systems like Moses, Google Neural Machine Translation System (NMTS)and Bing Translator.
1 Introduction
In the last decade, digital communication mediums like e-mail, Facebook, Twitter, etc. have allowedpeople to have conversations in a much more informal manner than before. This informal nature ofconversations has given rise to a new form of hybrid language, called code-mixed language, that lacks aformally defined structure.
Myers-Scotton (1993) defines code-mixing as “the embedding of linguistic units such as phrases,words and morphemes of one language into an utterance of another language”. Code-switching is asimilar concept, except that code-mixing is observed entirely in a single sentence, while code-switchingoccurs across sentences. For the purposes of this study, however, we will not make a difference betweencode-mixing and code-switching and treat them both as code-mixing.
Code-mixing of Hindi and English, where sentences follow the syntax of Hindi but borrow somevocabulary from English is very prevalent on social media content in India, where most people are multi-lingual. An example of a code-mixed English-Hindi sentence is presented below:
This work is licenced under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/
131
• “Main kal movie dekhne jaa rahi thi and raaste me I met Sam.”
Gloss : I yesterday [movie] to-see go Continous-marker was [and] way in [I met Sam].
English Translation : I was going to a movie yesterday and I met Sam on the way.
This phenomenon is present in informal communication in almost every multi-lingual society, as stud-ied by Choudhury et al. (2007). They investigated how language used on these social media platforms,which they have called texting language, differs from the standard language that is found in more formaltexts like books. It is also more common in the areas of the world where people are naturally bi- ormulti-lingual. Usually, these are the areas where languages change over short geo-spatial distances andpeople generally have at least a basic knowledge of the neighbouring languages (Jamatia et al., 2015). Avery good example for this is a country like India which has an extensive language diversity and wheredialectal changes frequently instigate code-mixing.
In recent times, we have seen an explosion of Computer Mediated Communication (CMC) worldwide(Herring, 2003). In CMC, language use lies somewhere in between spoken and written forms of alanguage and tend to use simple shorter constructions, contractions and phrasal repetitions typical ofspeech (Danet and Herring, 2007). Such conversations, especially in social-media are both multi-partyand multilingual, with mixing occurring between two or more languages, where the choice of language-use being highly influenced by the speakers and their communicative goals (Crystal, 2011). With multiplelanguages coming into play, and the variety of factors which influence the usage of those, the task ofprocessing text becomes quite difficult.
As of now, according to the data of internetworldstats.com, the Internet has four-hundred fifty millionEnglish speaking users out of one billion five hundred million total users. This means that the marketfor English language is slightly less than one third of the total market. In other terms, most currentapproaches to information extraction exploiting social media and user-generated content (UGC), that arepredominantly developed for English, are working with a mere third of the total data available.
The huge part of UGC that is not in English is currently being neglected mostly due to the relativelyephemeral character of UGC in general. Such content remains usually untranslated unless the usersthemselves so choose, because (i) it expresses opinions, and most users are much more likely to expressor look for other opinions in the same language as their own rather than translated from a differentone; (ii) it is generated and updated at an extremely fast pace and has a very short lifespan, whichrules out in practice the possibility of translation by human subjects; and (iii) it is also produced inimmense quantities, which, together with the previous point, ends up rendering translation by humansubjects effectively impossible (both in terms of time and cost). All this leads to an enormous body ofinformation being constantly generated which is also being constantly lost behind language barriers: theconsolidation of Web 2.0 has caused an unprecedented increase in the amount of data and each individualuser is currently being deprived of most of it (Carrera et al., 2009).
These language barriers are intensified by the fact that a huge number of people don’t use just onelanguage, but multiple languages simultaneously, in the form of code-mixing. Even if we were ableto create a system capable of processing information in every language, or perhaps a system capable oftranslating text from any given language to another, we would still not be able to break down the languagebarrier completely, due to the phenomenon of code-mixing. This is where code-mixed translation comesinto play.
While translation of code-mixed text has been a requirement for some time, there is a noticeable lack ofresources for this task. To tackle this problem, we have created a set of 6096 English-Hindi code-mixedand monolingual English gold standard parallel sentences as an initial attempt to promote generation ofdata resources for this domain.
However, most MT systems require a significant number of “parallel” sentences to perform well.While we wait for large parallel corpora for code-mixed text to be developed, we could make do byequipping the existing state-of-the-art MT systems to handle code-mixed content. This necessity hasmotivated us to develop an augmentation pipeline to support code-mixing on existing MT systems.
To summarize, the main contributions of this work are as follows:
132
• We are releasing 1 a gold standard parallel corpus consisting of 6096 English-Hindi code-mixed andmonolingual English that we created.
• We have developed an augmentation pipeline for existing machine translation systems that can boosttheir translation performance on code-mixed content.
• We carry out experiments involving various machine translation systems like Moses, Google NMTSand Bing Translator to compare their translation performance on code-mixed text, and demonstratehow our augmentation pipeline improves their translation results.
The rest of our paper is divided into the following sections: We begin with a study of research con-ducted in this domain in Section 2. We discuss the process of creation of the corpus and its features inSection 3. In Section 4, we introduce our augmentation pipeline for machine translation systems anddescribe the approach in detail. Then in Section 5, we perform experiments with existing machine trans-lation systems, and describe the impact of our augmentation pipeline on their translation accuracy forcode-mixed data, proceeded by a discussion of our results.
2 Related Work
In recent times, there has been a lot of interest from the Computational Linguistics community to supportcode-mixing in language systems and models.
Due to a massive growth of social media content, the usage of noisy non-standard tokens online hasalso increased. Hence, text normalization systems have become necessary that can convert these non-standard tokens to their standard form. Language identification at the word level was attempted byNguyen and Dogruoz (2013) on Turkish-Dutch posts collected from an online chat forum. They per-formed comparisons between different approaches such as using a dictionary and statistical models.They were able to develop a system which had an accuracy of 97.6%, and concluded that language mod-els prove to be better than a dictionary based approach. Barman et al. (2014) explored the same taskon social media text in code-mixed Bengali-Hindi-English languages. They annotated a corpus withover 180,000 tokens, and used statistical models with monolingual dictionaries to achieve an accuracy of95.76%.
Vyas et al. (2014) deal with POS tagging of English-Hindi code-mixed data that they extracted fromTwitter and Facebook. Social media is a good source for obtaining code-mixed data as it is the preferredchoice of the urban youth as informal platforms for communication. Bali et al. (2014) have done a studyof English-Hindi code-mixing on Facebook, and their investigation demonstrates the extent of code-mixing in the digital world, and the need for systems that can automatically process this data. Sharma etal. (2016) have addressed shallow parsing of English-Hindi code-mixed data obtained from online socialmedia. They were the first to attempt shallow parsing of code-mixed data, to the best of our knowledge.
Language identification in code-mixed data is an important step because it might determine how tofurther process the data in tasks like POS tagging, for example. Chittaranjan et al. (2014) explore a CRFbased system for word level identification of languages.
Apart from this, Raghavi et al. (2015) have explored the problem of classification of code-mixed ques-tions. By translating words to English before extracting features, they were able to achieve a greateraccuracy in classification. WebShodh, developed by Chandu et al. (2017), is an online web based ques-tion answering system that is based on this work.
Gupta et al. (2016) created a dataset of code-mixed English-Hindi sentences along with the associatedlanguage and normalization. This was the first attempt to create such a linguistic resource for the lan-guage pair. They also presented an empirical study detailing the construction of language identificationand normalization system designed for the language pair.
Code-mixed translation has been attempted previously by Sinha and Thakur (2005) from a linguisticsperspective. However, we primarily draw our inspiration from the skeleton model presented by Rijhwaniet al. (2016). Though they present the broad idea similar to ours, we find that many of the assumptions
1https://github.com/mrinaldhar/en-hi-codemixed-corpus
133
made were very simplistic and no evaluation or results were provided and no systems were released. Weprovide a deeper look into the code-mixed translation process and demonstrate the impact along withbenchmark dataset.
3 Corpus creation
3.1 Analysis
We started with the dataset created and released by Gupta et al. (2016). They collected 1,446 sentencesfrom social media, and performed language identification and word normalization on these sentences.We also obtained 771 sentences from the dataset released as part of the ICON 2017 tool contest on POS-tagging for code-mixed social media text, created by collection of Whatsapp chat messages. Additionally,we use the dataset released by Joshi et al. (2016) for the task of sentiment analysis of code-mixed content,which contains 3,879 code-mixed sentences.
For our study, we removed annotations such as sentiment labels, POS tags, etc. from the obtaineddatasets, and only used raw sentences for the task of corpus creation. For the augmentation pipeline, wealso make use of the language identifiers, wherever available in the dataset samples.
The 6,096 code-mixed sentences contain a total of 63,913 tokens. Of these tokens, 37,673 are Hindiwords and 16,182 are English words. The rest of the tokens were marked as “Rest”. “Rest” wouldmean that these tokens could be abbreviations, named entities, etc. The tokens in the data were alreadynormalized.
It is very crucial to find the level of code-mixing in data, since if the extent of code mixing is tooless, then it is as good as a monolingual corpus and would not provide us much benefit in the task ofcode-mixed translation. Hence, in order to find the level of code-mixing present in the data, we use theCode-Mixing Index (CMI) (Das and Gamback, 2014) defined as:
CMI =
{100× [1− max{wi}
n−u ] n > u
0 n = u(1)
where, wi is the number of words tagged with a particular language tag,max{wi} represents the numberof words of the most prominent language, n is the total number of tokens, u represents the number oflanguage independent tokens (such as named entities, abbreviations), in our case these would be thewords marked as “Rest”.
Summarized statistics of the data can be found in Table 1. From the table we can see that the code-mixing index is around 30.5, hence, we can safely assume that the data has a decent variety of code-mixedsentences which could be effectively used for the creation of translation systems.
3.2 Methodology for Annotation
For the task of translation, we selected 4 annotators who were fluent in both English and Hindi. Beforestarting the process of annotation, we randomly sampled 100 sentences from the corpus. We then askedone of the annotators to translate the given sentences into English. The other three annotators judged thetranslated sentences into two categories, Totally Correct (TC) and Requires Changes (RC). Finally, weuse the Fleiss’ Kappa measure in order to calculate agreement.
Fleiss’ Kappa is a statistical measure for assessing the reliability of agreement between a fixed numberof raters when assigning categorical ratings to a number of items or classifying items in order to calculatethe agreement. The measure is defined as follows:
Let N denote the number of subjects, n denote the number of ratings per subject, and k be the numberof categories into which assignments are to be made. In our case N=100, n=3, k=2. Let the subjects beindexed by i = 1, ... N , categories be indexed by j = 1, ... k and nij be the number of raters who assignedthe jth category to the ith subject. Then,
Pi =1
n(n− 1)[(
k∑
j=1
nij2)− (n)] (2)
134
here, Pi denotes the extent to which raters agree to the ith subject,
P =1
N
N∑
i=1
Pi (3)
pj =1
Nn
N∑
i=1
nij (4)
Pe =k∑
j=1
pj2 (5)
and finally,
κ =P − Pe
1− Pe(6)
where κ denotes the Fleiss’ Kappa score.Using this, we found out that the κ score for code-mixed to English translation was close to 0.88. We
can consider this score to be in correspondence with almost complete agreement. We speculate that sucha high agreement could arise from the fact that the annotators had the same cultural background andshared similar communities.
In Table 2 we provide a few examples of code-mixed English-Hindi translated to English with the helpof our pipeline. From the examples we can clearly see that the dataset sentences consist of transliteratedHindi words. Transliterated words usually pose a problem because one needs to come up with a standardform of those words before proceeding to the task of translation. However, Gupta et al. (2016) hadalready normalized the transliterated words. Hence, we were able to translate sentences without havingto normalize the words.
Type ValueTotal no. of code-mixed sentences 6,096Total no. of tokens 63,913Total no. of Hindi tokens 37,673Total no. of English tokens 16,182Total no. of ’Rest’ 10,094Code Mixing Index 30.5
Table 1: Corpus Statistics
Some examples from the dataset which illustrate code-mixing:
1. I was really trying ki aajayun
Gloss: [I was really trying] I come
Translation: I was really trying to come
2. Sorrrry, aaj subah tak pata nhi tha that I wudnt be able to come today
Gloss: [Sorrrry], today morning till know not did [that I wudnt be able to come today]
Translation: Sorrrry, I didn’t know until this morning that I wudnt be able to come today.
3. tu udhar ka permanent intezaam karke aa !
Gloss: you there is [permanent] arrangement do come !
Translation: Come over after making a permanent arrangement there !
4. btw i was thinking mai pehle ghar chale jaungi, and sham ko venue pe aa jaungi
Gloss: [btw i was thinking] i first home walk go, [and] evening [venue] on come go
Translation: btw I was thinking that I’ll first go home and then come to the venue in the evening.
135
Code-mixed input sentence
Identify languages involved
Determine Matrix language
Translate into Matrix language
Translate into target language
English / Hindi monolingual sentence
Figure 1: Translation augmentation pipeline
4 MT Augmentation Pipeline
Instead of attempting to create a new machine translation system specifically for code-mixed text, whichis not feasible due to lack of abundant gold standard data, we now introduce an approach that willaugment existing systems for use with code-mixed languages in the form of the pipeline described inFigure 1.
In order to improve the results of the MT system for code-mixing, we attempt to change the input code-mixed sentence to more closely resemble the kind of input these systems work best for - a monolingualsentence. As the code-mixed sentence follows the structure of the Matrix language, we translate wordsof the code-mixed sentence that belong to the Embedded language (Em) to the Matrix (Ma) languageusing a machine translation system (Em-Ma MT). The resulting sentence would be translated to thedesired target language (Tgt) using another existing MT system (Ma-Tgt MT), which may or may notbe the same as the Em-Ma MT system (depending on what the matrix and target languages are). Forour experiments, we have fixed the target (Tgt) to be English, as we attempt to translate English-Hindicode-mixed language into English. The pipeline also recognizes when the target language is the sameas the matrix language, and therefore there is no need to perform a translation of respective parts or thewhole of the sentence.
4.1 Identification of languages involved
As a pre-processing step, we identify the languages that are involved in creating the hybrid code-mixedlanguage in the text. This step is crucial in determining the pipeline to follow for the rest of the transla-tion.
Gupta et al. (2016) released their dataset with manually annotated language identifiers associatedwith every word of the sentences in the data. For the other datasets, we make use of the LanguageIdentification (LID) system by Bhat et al. (2015) to identify the corresponding language for each wordin a code-mixed sentence.
136
4.2 Determination of Matrix Language
In the Matrix Language-Frame model proposed by Myers-Scotton (1997), a code-mixed sentence isformed by a Matrix language and an Embedded language. The overall morpho-syntactic structure of thesentence is that of the Matrix language, however, words from the Embedded language are also present inthe sentence.
In other words, the Matrix language lends its syntactical structure and the Embedded language lends itsvocabulary to the code-mixed sentence. The Matrix language is determined by employing the followingheuristics in decreasing order of preference:
1. The number of words in one language in the sentenceThe language which has more words in the sentence is likely to be the matrix language.
2. Determination of the syntactic structure of the text by detecting the language of the verbFor example, if the two languages involved in code-mixing are English and Hindi, and a sentencefollows SVO structure, in that case English is likely to be the Matrix language of the sentence.
3. Usage of function words of a particular language in the textThe language whose function words exist in the sentence might be the matrix language, since func-tion words are associated with syntax.
4.3 Translation into Matrix Language
While most translation systems are unable to translate foreign (including code-mixed) words due tolack of training, borrowed and commonly used words are often found in parallel corpora. Commercialsystems, like Google Translate, can translate frequent short phrases as well. Errors creep in when theyhave to deal with longer phrases. Keeping this in mind and to reduce the translation cost per sentence,we selected the longest string of words belonging to the Embedded language for translation to theMatrix language. This is to ensure the maximal meaningful translation with a single call to the Em-MaMT engine. This results in a code-mixed sentence that follows the syntax of the Matrix language andalso has the majority of its words in this Matrix language.
4.4 Translation into Target Language
Now that the code-mixed sentence has been translated into the matrix language, it can be directly trans-lated into the Target language using the Ma-Tgt MT engine.,
5 Evaluation and Results
In order to evaluate our methods of augmentation, we consider the following existing machine translationsystems: Moses (Koehn et al., 2007), Google’s Neural Machine Translation System (NMTS) (Wu etal., 2016), Bing Translator.
For training a translation model for Moses, we used the English-Hindi parallel corpus released byKunchukuttan et al. (2017). This dataset consists of 1,492,827 parallel monolingual English and mono-lingual Hindi sentences. The Hindi sentences were in the Devanagari script, and required pre-processingfor use with our code-mixed dataset, which is entirely in Roman script.
We compare the output translations of these MT systems for code-mixed data, with and without theaugmentation by our system. For accuracy metrics, we chose BLEU score, Word Error Rate (WER) andTranslation Error Rate (TER) as they are ideal for use with machine translation.
As can be observed from Table 3, our augmentation pipeline significantly improves the translationaccuracy of existing machine translation systems. Note that among the systems themselves, GoogleNMTS performs much better on code-mixed English-Hindi data as compared to traditional phrase basedsystems like Moses and even neural systems like Bing Translator. Even though Moses does not performas well as the other systems described here for code-mixed data, our pipeline is still able to boost itsperformance significantly.
137
Original sentence Without Augmentation With Augmentation
room mei shayad kal bhi nahistay karungi , cancel ho saktihai uski booking abhi ?
I will not stay in the roomtomorrow, can I cancel herbooking now?
I will not stay in the roomtomorrow, can I cancel herbooking now?
Sorrrry , aaj subah tak patanhi tha that I wudnt be ableto come today
Sorry , aaj subah tak patanahi tha that I wouldn’t beable to come today
Sorry , Did not know untilthis morning that I wudnt beable to come today
I was really trying ki aajayun I was really trying ki aajayun I was really trying I comepar if its possible and anyother guest needs a room ,mera room de de kisi ko bhi
par if its possible and anyother guest needs a room ,mera room de de kisi ko bhi
par if its possible and anyother guest needs a room ,Give my room to anyone
toh hum aaj train ki ticketkarwa lenge .
So we will get a train tickettoday.
So we will get a train tickettoday.
tu udhar ka permanent in-tezaam karke aa !
You come here by arrangingPermanent!
You come here with a perma-nent arrangement!
Table 2: Augmenting Google Translate with our pipeline
Without Augmentation With Augmentation
BLEU WER TER BLEU WER TER
Moses 14.9 10.671 2.403 16.9 9.505 2.295Google NMTS 28.4 5.882 0.692 37.8 4.030 0.537Bing Translator 18.9 8.940 1.108 25.0 8.054 0.917
Table 3: Comparison of performance with and without using our augmentation pipeline. (Note: BLEU -higher is better, (WER,TER) - lower is better.)
6 Conclusions
In this paper, we have created a set of 6,096 English-Hindi code-mixed and monolingual English goldstandard parallel sentences for promoting the task of machine translation of code-mixed data and gener-ation of data resources for this domain.
We have also developed an augmentation pipeline, that can be used to augment existing machinetranslation systems such that translation of code-mixed data can be improved without training an MTsystem specifically for code-mixed text. Using the evaluation metrics selected, it is shown that there is aquantifiable improvement in the accuracy of translations with the augmentation proposed.
As part of our study, we have observed that long distance re-ordering of words is still an issue withcode-mixed MT. Also, since code-mixed language does not have a standard form, it is difficult to es-tablish a correct version of spelling for a particular word. Language identification is the most criticalmodule for translation augmentation, because the same orthographic form can lead to valid words inmultiple languages.
To take this work further, we intend to develop an end-to-end code-mixed MT system which canjointly perform the normalization, language identification, matrix language identification and two-steptranslation tasks. A hybrid model, partially trained on gold parallel corpus, may also be attempted.
Acknowledgements
A special thanks to Aashna Jena, Abhinav Gupta, Arjun Nemani, Freya Mehta, Manan Goel, SaraanshTandon, Shweta Sahoo, Ujwal Narayan and Yash Mathne for their help with this study. Thanks to Vatika
138
Harlalka for proof-reading this paper and ensuring there were as few errors as possible.
ReferencesKalika Bali, Jatin Sharma, Monojit Choudhury, and Yogarshi Vyas. 2014. i am borrowing ya mixing? an analysis
of english-hindi code mixing in facebook. In Proceedings of the First Workshop on Computational Approachesto Code Switching, EMNLP.
Utsab Barman, Amitava Das, Joachim Wagner, and Jennifer Foster. 2014. Code mixing: A challenge for lan-guage identification in the language of social media. In Proceedings of The First Workshop on ComputationalApproaches to Code Switching, pages 13–23.
Irshad Ahmad Bhat, Vandan Mujadia, Aniruddha Tammewar, Riyaz Ahmad Bhat, and Manish Shrivastava. 2015.Iiit-h system submission for fire2014 shared task on transliterated search. In Proceedings of the Forum forInformation Retrieval Evaluation, FIRE ’14, pages 48–53, New York, NY, USA. ACM.
Jordi Carrera, Olga Beregovaya, and Alex Yanishevsky. 2009. Machine translation for cross-language socialmedia. PROMT Americas Inc.
Khyathi Raghavi Chandu, Manoj Chinnakotla, Alan W. Black, and Manish Shrivastava. 2017. Webshodh: A codemixed factoid question answering system for web. In Gareth J.F. Jones, Seamus Lawless, Julio Gonzalo, LiadhKelly, Lorraine Goeuriot, Thomas Mandl, Linda Cappellato, and Nicola Ferro, editors, Experimental IR MeetsMultilinguality, Multimodality, and Interaction, pages 104–111, Cham. Springer International Publishing.
Gokul Chittaranjan, Yogarshi Vyas, Kalika Bali, and Monojit Choudhury. 2014. Word-level language identifica-tion using crf: Code-switching shared task report of msr india system. page 73–79. Association for Computa-tional Linguistics.
Monojit Choudhury, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar, and Anupam Basu. 2007.Investigation and modeling of the structure of texting language. International Journal of Document Analysisand Recognition (IJDAR), 10(3):157–174.
David Crystal. 2011. Internet linguistics: A Student Guide (1st ed.). Routledge, New York, NY.
Brenda Danet and Susan C Herring. 2007. The multilingual Internet: Language, culture, and communicationonline. Oxford University Press on Demand.
Amitava Das and Bjorn Gamback. 2014. Identifying languages at the word level in code-mixed indian socialmedia text. Proceedings of the 11th International Conference on Natural Language Processing. pages 169–178.
Sakshi Gupta, Piyush Bansal, and Radhika Mamidi. 2016. Resource creation for hindi-english code mixed socialmedia text. The 4th International Workshop on Natural Language Processing for Social Media in the 25thInternational Joint Conference on Artificial Intelligence.
Susan C Herring. 2003. Media and language change: Introduction. Journal of Historical Pragmatics, 4(1):1–17.
Anupam Jamatia, Bjorn Gamback, and Amitava Das. 2015. Part-of-speech tagging for code-mixed english-hinditwitter and facebook chat messages. In Proceedings of Recent Advances in Natural Language Processing 2015Organising Committee, Association for Computational Linguistics. pages 239–248.
Aditya Joshi, Ameya Prabhu, Manish Shrivastava, and Vasudeva Varma. 2016. Towards sub-word level com-positions for sentiment analysis of hindi-english code mixed text. In Proceedings of COLING 2016, the 26thInternational Conference on Computational Linguistics: Technical Papers, pages 2482–2491. The COLING2016 Organizing Committee.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, BrookeCowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and EvanHerbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th AnnualMeeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, pages 177–180, Stroudsburg,PA, USA. Association for Computational Linguistics.
Anoop Kunchukuttan, Pratik Mehta, and Pushpak Bhattacharyya. 2017. The IIT bombay english-hindi parallelcorpus. Under review at LREC 2018, CoRR, abs/1710.02855.
139
Carol Myers-Scotton. 1993. Common and uncommon ground: Social and structural factors in codeswitching.Language in society, 22(4):475–503.
Carol Myers-Scotton. 1997. Duelling Languages: Grammatical Structure in Codeswitching. Clarendon Press.
Dong Nguyen and A Seza Dogruoz. 2013. Word level language identification in online multilingual communi-cation. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages857–862.
Khyathi Chandu Raghavi, Manoj Kumar Chinnakotla, and Manish Shrivastava. 2015. ”answer ka type kya he?”:Learning to classify questions in code-mixed language. In Proceedings of the 24th International Conference onWorld Wide Web. pages 853–858. ACM.
Shruti Rijhwani, Royal Sequiera, Monojit Choudhury Choudhury, and Kalika Bali. 2016. Translating code-mixed tweets: A language detection based system. In 3rd Workshop on Indian Language Data Resource andEvaluation - WILDRE-3.
Arnav Sharma, Sakshi Gupta, Raveesh Motlani, Piyush Bansal, Manish Srivastava, Radhika Mamidi, and Dipti MSharma. 2016. Shallow parsing pipeline for hindi-english code-mixed social media text. HLT-NAACL TheAssociation for Computational Linguistics. pages 1340–1345
Rai Mahesh Kumar Sinha and Anil Thakur. 2005. Machine translation of bi-lingual hindi-english (hinglish) text.10th Machine Translation summit (MT Summit X), Phuket, Thailand, pages 149–156.
Yogarshi Vyas, Spandana Gella, Jatin Sharma, Kalika Bali, and Monojit Choudhury. 2014. POS tagging ofenglish-hindi code-mixed social media content. In Proceedings of the 2014 Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages 974–979. Association for Computational Linguistics.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, MaximKrikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu,ukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian,Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gapbetween human and machine translation. CoRR, abs/1609.08144.
140
Author Index
Šojat, Krešimir, 28
Abebe, Tewodros, 83Abera, Hafte, 78, 83Andargie, Tsegaye, 83Assabie, Yaregal, 83Atinafu, Solomon, 83
Barreiro, Anabela, 122Batista, Fernando, 122Boitet, Christian, 112Boudhina, Safa, 38
Dhar, Mrinal, 131Dorr, Bonnie, 57
Ephrem, Binyam, 83
Fehri, Héla, 38
Gasser, Michael, 65Gezmu, Andargachew Mekonnen, 65
H/Mariam, Sebsibe, 78
Kocijan, Kristina, 28Kumar, Vaibhav, 131
Lemma, Amanuel, 83Liberman, Mark, 1
Machonis, Peter, 18Malireddy, Chanakya, 71Mangeot, Mathieu, 112Max, Aurélien, 102Melese, Michael, 83Meshesha, Million, 83Moldovan, Dan, 12Monteleone, Mario, 47Mulugeta, Wondwossen, 83
Nürnberger, Andreas, 65
Padó, Sebastian, 91Poljak, Dario, 28
Reyes, Silvia, 47
Rodrigo, Andrea, 47
Seyoum, Binyam Ephrem, 65Shifaw, Seifedin, 83Shrivastava, Manish, 71, 131Sikos, Jennifer, 91Silberztein, Max, 2Somisetty, Srivenkata N M, 71
Teferra Abate, Solomon, 83Tomokiyo, Mutsuko, 112Tsegaye, Wondimagegnhue, 83
Vilnat, Anne, 102Voss, Clare, 57
Yifiru Tachbelie, Martha, 83
Zhai, Yuming, 102Zhang, Linrui, 12
141