Upload
mohammed-attia
View
170
Download
0
Tags:
Embed Size (px)
Citation preview
The Floating Arabic Dictionary: An
Automatic Method for Updating a Lexical
Database through the Detection and
Lemmatization of Unknown Words
Mohammed Attia, Younes Samih, Khaled Shaalan
and Josef van Genabith
Faculty of Engineering and IT, The British University in Dubai
Heinrich-Heine-Universität, Germany
School of Computing, Dublin City University, Ireland
Outline
• Introduction
• Morphological Guesser
• Methodology
• Testing and Evaluation
• Conclusion
Introduction
• Why deal with unknown words?
• Complexity of lemmatization in Arabic
• Data used
Introduction
A living language is just
… living
… dynamic
… constantly changing
… new words appear
… old words die out
… some words are seasonal
… some are core
Introduction
Introduction
Why deal with unknown words?
• Language is always changing
• New words appear
• Old words disappear
• Unknown words make up 29% of the Gigaword
corpus
• Unknown words (OOV) always cause a problem to:
• Morphological analysers
• Parsers
• Machine Translation & other applications
Review of Arabic lexicographic work
Kitab al-'Ain by al-Khalil bin Ahmed al-Farahidi (died 789)
(refinement/expansion/organizational Improvement)
▼
• Tahzib al-Lughah by Abu Mansour al-Azhari (died 980)
• al-Muheet by al-Sahib bin 'Abbad (died 995)
• Lisan al-'Arab by ibn Manzour (died 1311)
• al-Qamous al-Muheet by al-Fairouzabadi (died 1414)
• Taj al-Arous by Muhammad Murtada al-Zabidi (died 1791)
• Muheet al-Muheet (1869) by Butrus al-Bustani
• al-Mu'jam al-Waseet (1960)
• Buckwalter Arabic Morphological Analyzer (2002)
Size: 40,222 lemmas (including 2,034 proper nouns)
Includes many obsolete lexical items
Many modern words are missed out
Review of Arabic lexicographic work
Buckwalter obsolete words: 8,400 obsolete words
هيالن وعس ميعاس عثير : (sand)رملفيفاء فدفد قواء موماة متلف سبسب : (desert)صحراء
حداجة مخلوفة : (saddle)سرج
ظعينة حدج : (load)حمل
ظعون وقر
كعام كعم فدام : (bridle)لجام
غمامة أرنبة شكم
حداء : (rider)راكب
هجينة : (camel)جمل
دفية بشتة : (gown)رداء
مبذل بشمق : (shoes)حذاء
صرمة قبقاب زربول زربون
Review of Arabic lexicographic work
Not in Dictionaries: about 10,000 need to be added
:تكنولوجيا
رقمنة، أتمتة، مكننة
فيس بوك، تويتر، تغريدة
تليفون محمول \جوال \هاتف
الب توب
الهواتف الذكية
حوسبة
بريد إلكتروني
دي في دي، سي دي
سبام، فيروس
ملتي ميديا
كمبيوتر لوحي، شاشة لمسية
شيفرة
أمننة شرعنة أفروعربية إثني إقصائي تسييس محاصصة جبهوي جمهوعسكرية العصبوية شخصنة أمركة :سياسة
عصرنة
إلكترونية_دفترية مليار ترليون تجارة_جونز تضخم أسهم قيمة_خصخصة ريعي يورو بورصة تعويم داو :اقتصاد
Review of Arabic lexicographic work
Not in Dictionaries: about 10,000 need to be added
Technology:
Digitalizing, automating,
Mechanizing
Facebook, twitter, tweet
Mobile phone
Laptop
Smartphone
Computerizing
DVD, CD
Span, virus
Multimedia
Tablet PC, touch screen
Politics: legalizing, Afro-Arab, ethnic, ostracizing, Americanize, modernize
Economy: privatization, Euro, inflation, Billion, Trillion, e-commerce
Introduction Floating
Dictionary
Introduction
Complexity of lemmatization in Arabic
• Lemmatization means reducing words to their base
(canonical) forms
• played -> play studies - study
• went -> go wives -> wife
• New words in English appear in their base form 86% of
the time (Lindén, 2008)
• New words in Arabic appear in their base form 45% of
the time
• Arabic morphology is complex and semi-algorithmic:
root, patterns, inflections, clitics, etc.
Introduction
Proclitics Prefix Lemma Suffix Enclitic
Conjunction/
question article
Comp Tense/mood –
number/gend
Verb Tense/mood –
number/gend
Object
pronoun
Conjunctions و
wa ‘and’ or فfa ‘then’
li ‘to’ Imperfectiveل
tense (5)
lemma
Imperfective
tense (10)
First person
(2)
Question word أ
᾽a ‘is it true that’
sa ‘will’ Perfective tenseس
(1)
Perfective
tense (12)
Second
person (5)
la ‘then’ Imperative (2) Imperative (5) Third personل
(5)
Introduction
Complexity of lemmatization in Arabic
Possible Concatenations in Arabic Verbs
lemma
šakara ‘to thank’, generate شكر
2,552 valid forms
وسيشكرونه
wasayashkurunahu
wa@sa@yashkuruna@hu
and@will@thank[they]@him
Proclitics lemma Suffix Enclitic
Conjunction/
question article
Preposition Definite
article
Noun Gender/Number Genitive
pronoun
Conjunctions و
wa ‘and’ or ف
fa ‘then’
,’bi ‘withب
’ka ‘asك
or لli ‘to’
’al ‘theال
Stem
Masculine Dual
(4)
First person
(2)
Feminine Dual
(4)
Question word أ
᾽a ‘is it true
that’
Masculine
regular plural
(4)
Second person
(5)
Feminine
regular plural
(1)
Third person
(5)
Feminine Mark
(1)
Introduction
Complexity of lemmatization in Arabic
Possible Concatenations in Arabic Nouns
lemma
mudarris ‘teacher’, generate 519 مدرس
valid forms
وللمدرسين
walilmudarrisiyna
wa@li@al@mudarrisiyna
and@to@the@teachers
Introduction
Difference between stemming and lemmatizing
وسيقولونها
wa-sa-ya+quwl+uwna-ha
and they will say it
Stemming
quwl
قول
Lemmatizing
qAla
قال
Alteration
rules
Introduction
Data used
• A large-scale corpus of 1,089,111,204
words
• 85% from the Arabic Gigaword Fourth Edition
• 15% from news articles crawled from the Al-
Jazeera web site
If printed on paper, it will be more than 2 times the height of Eiffel
Tower
= 16,000 large books
= 640 meters of bookshelves
Avr reader reads 200 wpm with 60% comprehension.
You will need 11 years 24/7 to read the Gigaword corpus
Technical issues:
20-30 days to analyze with MADA using 10 parrallel sessions.
You will need a machine with 256GB RAM to read 3-,4-. Or 5-
gram language model of the Arabic Gigaword
Morphological Guesser
We develop a morphological guesser for
Arabic unknown words that handles all
possible
• Clitics
• Prefixes
• Suffixes
• And all relevant alteration operations that include
insertion, assimilation, and deletion
Guesser LEXC
======
LEXICON Conjunctions
+وـ conj:وـ Prepositions;
+فـ conj:فـ Prepositions;
Prepositions;
LEXICON Prepositions
+لـ prep:لـ Article;
+كـ prep:كـ Article;
+بـ prep:بـ Article;
Article;
LEXICON Article
+الـ defArt Nouns;
+الـ defArt Adjectives;
Nouns;
Adjectives;
LEXICON Nouns
+noun GuessWords;
^ss^ ^خادم se^ FemMascduMascpl;
....
LEXICON Adjectives +adj+fem GuessWords;
+adj+masc GuessWords;
^ss^ ^سعيد se^+adj+masc
FemMascduFemduMascplFempl;
....
LEXICON GuessWords ^ss^^GUESSNOUNSTEM^^se^
FemMascduFemduMascplFempl;
^ss^^GUESSNOUNSTEM^^se^
FemMascduFemduFempl;
^ss^^GUESSNOUNSTEM^^se^
FemMascduFemdu;
….
ALTERATION RULES
=================
a -> b || L _ R
XFST
=====
read regex < arb-Alphabet.txt
define Alphabet
define PossNounStem [[Alphabet]^{2,24}] "+Guess":0;
substitute defined PossNounStem for
"^GUESSNOUNSTEM^“
1
2
3
Methodology
We use a pipeline-based approach
• First: a machine learning (SVM), context-sensitive tool
(MADA) is used to predict:
• POS
• Morpho-syntactic features of number, gender, person, tense, etc.
• Second: The finite-state morphological guesser is used
to produce all the possible interpretations of words and
suggested lemmas.
• Third: The two output are matched together and the
agreed analysis is selected.
Methodology
Methodology
Example قون والمسو
wa-Al-musaw~iquwna “and-the-marketers”
MADA output:
form:wAlmswqwn num:p gen:m per:na case:n asp:na mod:na vox:na
pos:noun prc0:Al_detprc1:0 prc2:wa_conj prc3:0 enc0:0 stt:d
Finite-state guesser output:
@Guess+masc+pl+nom+والمسوقadj+ والمسوقون
@Guess+sg+والمسوقونadj+ والمسوقون
@Guess+masc+pl+nom+والمسوقnoun+ والمسوقون
@Guess+sg+والمسوقونnoun+ والمسوقون
@Guess+masc+pl+nom+مسوقdefArt@+adj+ال@conj+و والمسوقون
@Guess+sg+مسوقونdefArt@+adj+ال@conj+و والمسوقون
Guess+masc+pl+nom@ Correct Analysis+مسوقdefArt@+noun+ال@conj+و والمسوقون
@Guess+sg+مسوقونdefArt@+noun+ال@conj+و والمسوقون
@Guess+masc+pl+nom+المسوقconj@+adj+و والمسوقون
@Guess+sg+المسوقونconj@+adj+و والمسوقون
@Guess+masc+pl+nom+المسوقconj@+noun+و والمسوقون
@Guess+sg+المسوقونconj@+noun+و والمسوقون
Methodology
Results • Corpus size is 1,089,111,204 tokens, 7,348,173
types
• Unknown Types in the corpus: 2,116,180 (29%)
• After spell checking, correctly spelt types are
208,188
• Types with frequency of 10 or more: 40,277
• After lemmatization:18,399 types
Testing and Evaluation
We create a gold standard
of 1,310 words manually-
annotated for:
• Gold lemma
• Gold POS
• Lexical relevance (include in a
dictionary): yes or no
Among unknown words,
- Proper nouns are the most common
- Verbs are the least common
Gold POS Type Count Ratio
noun_prop 584 45 %
noun 264 20 %
adj 255 19 %
verb 52 4 %
noun_fem_plural (pluralia tantum)
28 2 %
noun_broken_plural 28 2 %
others: noun_masc_plural (pluralia tantum) (4) part (3) pron_dem (1)
8 0.6 %
Excluded
misspelling 55 4 %
not_known 15 1 %
colloquial 19 1.5 %
Lexicographic relevance
Include in a dictionary 671 51 %
Don’t include in a dictionary
639 49 %
Testing and Evaluation
Evaluating POS (accuracy)
• Baseline: The most frequent tag (proper name)
for all unknown words: 45%
• Mada: 60%
• Voted POS Tagging: 69%. When a lemma gets a
different POS tag with a higher frequency we
take the higher Accuracy
POS tagging
1 POS Tagging baseline 45%
2 MADA POS tagging 60%
3 Voted POS Tagging 69%
Testing and Evaluation
Evaluating Lemmatization (accuracy)
• Baseline: new words appear in their base form:
45%
• Pipelined strict definite article ‘al’: 54%
• Pipelined ignoring definite article ‘al’: 63%
Lemmatization
1 Lemma first-order baseline 45%
2 Pipelined lemmatization (first-
order decision) with strict
definite article matching
54%
3 Pipelined lemmatization (first-
order decision) ignoring definite
article matching
63%
Testing and Evaluation
Evaluating Lemma Weighting
• The weighting criteria aims to push lexicographically
relevant words up the list and less interesting words down.
• We aim to make the number of important words high in the
top 100 and low in the bottom 100
Word Weight = ((number of
sister forms * 800) +
frequencies of sister forms) / 2 +
POS factor
Good words In top
100
In bottom
100
relying on Frequency
alone (baseline)
63 50
relying on number of
sister forms * 800
87 28
relying on POS factor 58 30
using combined criteria 78 15
Testing and Evaluation
Oxford new words list: June 2012
• BitTorrent: a protocol that underpins the practice of peer-to-peer file
sharing
• command line: a user interface that is navigated by typing
commands
• cybercast: A news or entertainment program transmitted over the
Internet.
• subcommunity: a distinct grouping within a community
• subjectivization: to make subjective
• subpersonality: a personality mode that kicks in (appears on a
temporary basis) to allow a person to cope with certain types of
psychosocial situations.
• superglue v: to stick with superglue
Testing and Evaluation
Words expected in the next Arabic dictionary/morphological analyser
Testing and Evaluation
Testing and Evaluation
Bird’s Eye view
Problem
• Out of Vocabulary words (OOV) cause a problem to
morphological analysers, parsers, MT, etc.
• The manual extension of lexical databases is costly an time
consuming.
• With the large amount of data, manual extension of lexicons
becomes practically impossible.
Solution
• Creating an automatic method for updating a lexical database
• Integrating a Machine Learning method with a finite state
guesser to lemmatize unknown words
• Weighting new words by relevance and importance
Conclusion
• We develop a methodology for automatically extracting
and lemmatizing unknown words in Arabic
• We pipeline a finite-state guesser with a machine
learning tool for lemmatization
• We develop a weighting mechanism for predicting the
relevance and importance of lemmas
• Out of 2,116,180 unknown words, we create a lexicon of
18,399 lemmatized, POS-tagged and weighted entries.