Floating dict presentation_04

The Floating Arabic Dictionary: An

Automatic Method for Updating a Lexical

Database through the Detection and

Lemmatization of Unknown Words

Mohammed Attia, Younes Samih, Khaled Shaalan

and Josef van Genabith

Faculty of Engineering and IT, The British University in Dubai

Heinrich-Heine-Universität, Germany

School of Computing, Dublin City University, Ireland

Outline

• Introduction

• Morphological Guesser

• Methodology

• Testing and Evaluation

• Conclusion

Introduction

• Why deal with unknown words?

• Complexity of lemmatization in Arabic

• Data used

Introduction

A living language is just

… living

… dynamic

… constantly changing

… new words appear

… old words die out

… some words are seasonal

… some are core

Introduction

Introduction

Why deal with unknown words?

• Language is always changing

• New words appear

• Old words disappear

• Unknown words make up 29% of the Gigaword

corpus

• Unknown words (OOV) always cause a problem to:

• Morphological analysers

• Parsers

• Machine Translation & other applications

Review of Arabic lexicographic work

Kitab al-'Ain by al-Khalil bin Ahmed al-Farahidi (died 789)

(refinement/expansion/organizational Improvement)

▼

• Tahzib al-Lughah by Abu Mansour al-Azhari (died 980)

• al-Muheet by al-Sahib bin 'Abbad (died 995)

• Lisan al-'Arab by ibn Manzour (died 1311)

• al-Qamous al-Muheet by al-Fairouzabadi (died 1414)

• Taj al-Arous by Muhammad Murtada al-Zabidi (died 1791)

• Muheet al-Muheet (1869) by Butrus al-Bustani

• al-Mu'jam al-Waseet (1960)

• Buckwalter Arabic Morphological Analyzer (2002)

Size: 40,222 lemmas (including 2,034 proper nouns)

Includes many obsolete lexical items

Many modern words are missed out


Buckwalter obsolete words: 8,400 obsolete words

هيالن وعس ميعاس عثير : (sand)رملفيفاء فدفد قواء موماة متلف سبسب : (desert)صحراء

حداجة مخلوفة : (saddle)سرج

ظعينة حدج : (load)حمل

ظعون وقر

كعام كعم فدام : (bridle)لجام

غمامة أرنبة شكم

حداء : (rider)راكب

هجينة : (camel)جمل

دفية بشتة : (gown)رداء

مبذل بشمق : (shoes)حذاء

صرمة قبقاب زربول زربون


Not in Dictionaries: about 10,000 need to be added

:تكنولوجيا

رقمنة، أتمتة، مكننة

فيس بوك، تويتر، تغريدة

تليفون محمول \جوال \هاتف

الب توب

الهواتف الذكية

حوسبة

بريد إلكتروني

دي في دي، سي دي

سبام، فيروس

ملتي ميديا

كمبيوتر لوحي، شاشة لمسية

شيفرة

أمننة شرعنة أفروعربية إثني إقصائي تسييس محاصصة جبهوي جمهوعسكرية العصبوية شخصنة أمركة :سياسة

عصرنة

إلكترونية_دفترية مليار ترليون تجارة_جونز تضخم أسهم قيمة_خصخصة ريعي يورو بورصة تعويم داو :اقتصاد


Not in Dictionaries: about 10,000 need to be added

Technology:

Digitalizing, automating,

Mechanizing

Facebook, twitter, tweet

Mobile phone

Laptop

Smartphone

Computerizing

Email

DVD, CD

Span, virus

Multimedia

Tablet PC, touch screen

Politics: legalizing, Afro-Arab, ethnic, ostracizing, Americanize, modernize

Economy: privatization, Euro, inflation, Billion, Trillion, e-commerce

Introduction Floating

Dictionary

Introduction

Complexity of lemmatization in Arabic

• Lemmatization means reducing words to their base

(canonical) forms

• played -> play studies - study

• went -> go wives -> wife

• New words in English appear in their base form 86% of

the time (Lindén, 2008)

• New words in Arabic appear in their base form 45% of

the time

• Arabic morphology is complex and semi-algorithmic:

root, patterns, inflections, clitics, etc.

Introduction

Proclitics Prefix Lemma Suffix Enclitic

Conjunction/

question article

Comp Tense/mood –

number/gend

Verb Tense/mood –

number/gend

Object

pronoun

Conjunctions و

wa ‘and’ or فfa ‘then’

li ‘to’ Imperfectiveل

tense (5)

lemma

Imperfective

tense (10)

First person

(2)

Question word أ

᾽a ‘is it true that’

sa ‘will’ Perfective tenseس

(1)

Perfective

tense (12)

Second

person (5)

la ‘then’ Imperative (2) Imperative (5) Third personل

(5)

Introduction


Possible Concatenations in Arabic Verbs

lemma

šakara ‘to thank’, generate شكر

2,552 valid forms

وسيشكرونه

wasayashkurunahu

wa@sa@yashkuruna@hu

and@will@thank[they]@him

Proclitics lemma Suffix Enclitic

Conjunction/

question article

Preposition Definite

article

Noun Gender/Number Genitive

pronoun

Conjunctions و

wa ‘and’ or ف

fa ‘then’

,’bi ‘withب

’ka ‘asك

or لli ‘to’

’al ‘theال

Stem

Masculine Dual

(4)

First person

(2)

Feminine Dual

(4)

Question word أ

᾽a ‘is it true

that’

Masculine

regular plural

(4)

Second person

(5)

Feminine

regular plural

(1)

Third person

(5)

Feminine Mark

(1)

Introduction


Possible Concatenations in Arabic Nouns

lemma

mudarris ‘teacher’, generate 519 مدرس

valid forms

وللمدرسين

walilmudarrisiyna

wa@li@al@mudarrisiyna

and@to@the@teachers

Introduction

Difference between stemming and lemmatizing

وسيقولونها

wa-sa-ya+quwl+uwna-ha

and they will say it

Stemming

quwl

قول

Lemmatizing

qAla

قال

Alteration

rules

Introduction

Data used

• A large-scale corpus of 1,089,111,204

words

• 85% from the Arabic Gigaword Fourth Edition

• 15% from news articles crawled from the Al-

Jazeera web site

If printed on paper, it will be more than 2 times the height of Eiffel

Tower

= 16,000 large books

= 640 meters of bookshelves

Avr reader reads 200 wpm with 60% comprehension.

You will need 11 years 24/7 to read the Gigaword corpus

Technical issues:

20-30 days to analyze with MADA using 10 parrallel sessions.

You will need a machine with 256GB RAM to read 3-,4-. Or 5-

gram language model of the Arabic Gigaword

Morphological Guesser

We develop a morphological guesser for

Arabic unknown words that handles all

possible

• Clitics

• Prefixes

• Suffixes

• And all relevant alteration operations that include

insertion, assimilation, and deletion

Guesser LEXC

======

LEXICON Conjunctions

+وـ conj:وـ Prepositions;

+فـ conj:فـ Prepositions;

Prepositions;

LEXICON Prepositions

+لـ prep:لـ Article;

+كـ prep:كـ Article;

+بـ prep:بـ Article;

Article;

LEXICON Article

+الـ defArt Nouns;

+الـ defArt Adjectives;

Nouns;

Adjectives;

LEXICON Nouns

+noun GuessWords;

^ss^ ^خادم se^ FemMascduMascpl;

....

LEXICON Adjectives +adj+fem GuessWords;

+adj+masc GuessWords;

^ss^ ^سعيد se^+adj+masc

FemMascduFemduMascplFempl;

....

LEXICON GuessWords ^ss^^GUESSNOUNSTEM^^se^

FemMascduFemduMascplFempl;

^ss^^GUESSNOUNSTEM^^se^

FemMascduFemduFempl;

^ss^^GUESSNOUNSTEM^^se^

FemMascduFemdu;

….

ALTERATION RULES

=================

a -> b || L _ R

XFST

=====

read regex < arb-Alphabet.txt

define Alphabet

define PossNounStem [[Alphabet]^{2,24}] "+Guess":0;

substitute defined PossNounStem for

"^GUESSNOUNSTEM^“

1

2

3

Methodology

We use a pipeline-based approach

• First: a machine learning (SVM), context-sensitive tool

(MADA) is used to predict:

• POS

• Morpho-syntactic features of number, gender, person, tense, etc.

• Second: The finite-state morphological guesser is used

to produce all the possible interpretations of words and

suggested lemmas.

• Third: The two output are matched together and the

agreed analysis is selected.

Methodology

Methodology

Example قون والمسو

wa-Al-musaw~iquwna “and-the-marketers”

MADA output:

form:wAlmswqwn num:p gen:m per:na case:n asp:na mod:na vox:na

pos:noun prc0:Al_detprc1:0 prc2:wa_conj prc3:0 enc0:0 stt:d

Finite-state guesser output:

@Guess+masc+pl+nom+والمسوقadj+ والمسوقون

@Guess+sg+والمسوقونadj+ والمسوقون

@Guess+masc+pl+nom+والمسوقnoun+ والمسوقون

@Guess+sg+والمسوقونnoun+ والمسوقون

@Guess+masc+pl+nom+مسوقdefArt@+adj+ال@conj+و والمسوقون

@Guess+sg+مسوقونdefArt@+adj+ال@conj+و والمسوقون

Guess+masc+pl+nom@ Correct Analysis+مسوقdefArt@+noun+ال@conj+و والمسوقون

@Guess+sg+مسوقونdefArt@+noun+ال@conj+و والمسوقون

@Guess+masc+pl+nom+المسوقconj@+adj+و والمسوقون

@Guess+sg+المسوقونconj@+adj+و والمسوقون

@Guess+masc+pl+nom+المسوقconj@+noun+و والمسوقون

@Guess+sg+المسوقونconj@+noun+و والمسوقون

Methodology

Results • Corpus size is 1,089,111,204 tokens, 7,348,173

types

• Unknown Types in the corpus: 2,116,180 (29%)

• After spell checking, correctly spelt types are

208,188

• Types with frequency of 10 or more: 40,277

• After lemmatization:18,399 types

Testing and Evaluation

We create a gold standard

of 1,310 words manually-

annotated for:

• Gold lemma

• Gold POS

• Lexical relevance (include in a

dictionary): yes or no

Among unknown words,

- Proper nouns are the most common

- Verbs are the least common

Gold POS Type Count Ratio

noun_prop 584 45 %

noun 264 20 %

adj 255 19 %

verb 52 4 %

noun_fem_plural (pluralia tantum)

28 2 %

noun_broken_plural 28 2 %

others: noun_masc_plural (pluralia tantum) (4) part (3) pron_dem (1)

8 0.6 %

Excluded

misspelling 55 4 %

not_known 15 1 %

colloquial 19 1.5 %

Lexicographic relevance

Include in a dictionary 671 51 %

Don’t include in a dictionary

639 49 %


Evaluating POS (accuracy)

• Baseline: The most frequent tag (proper name)

for all unknown words: 45%

• Mada: 60%

• Voted POS Tagging: 69%. When a lemma gets a

different POS tag with a higher frequency we

take the higher Accuracy

POS tagging

1 POS Tagging baseline 45%

2 MADA POS tagging 60%

3 Voted POS Tagging 69%


Evaluating Lemmatization (accuracy)

• Baseline: new words appear in their base form:

45%

• Pipelined strict definite article ‘al’: 54%

• Pipelined ignoring definite article ‘al’: 63%

Lemmatization

1 Lemma first-order baseline 45%

2 Pipelined lemmatization (first-

order decision) with strict

definite article matching

54%

3 Pipelined lemmatization (first-

order decision) ignoring definite

article matching

63%


Evaluating Lemma Weighting

• The weighting criteria aims to push lexicographically

relevant words up the list and less interesting words down.

• We aim to make the number of important words high in the

top 100 and low in the bottom 100

Word Weight = ((number of

sister forms * 800) +

frequencies of sister forms) / 2 +

POS factor

Good words In top

100

In bottom

100

relying on Frequency

alone (baseline)

63 50

relying on number of

sister forms * 800

87 28

relying on POS factor 58 30

using combined criteria 78 15


Oxford new words list: June 2012

• BitTorrent: a protocol that underpins the practice of peer-to-peer file

sharing

• command line: a user interface that is navigated by typing

commands

• cybercast: A news or entertainment program transmitted over the

Internet.

• subcommunity: a distinct grouping within a community

• subjectivization: to make subjective

• subpersonality: a personality mode that kicks in (appears on a

temporary basis) to allow a person to cope with certain types of

psychosocial situations.

• superglue v: to stick with superglue


Words expected in the next Arabic dictionary/morphological analyser



Bird’s Eye view

Problem

• Out of Vocabulary words (OOV) cause a problem to

morphological analysers, parsers, MT, etc.

• The manual extension of lexical databases is costly an time

consuming.

• With the large amount of data, manual extension of lexicons

becomes practically impossible.

Solution

• Creating an automatic method for updating a lexical database

• Integrating a Machine Learning method with a finite state

guesser to lemmatize unknown words

• Weighting new words by relevance and importance

Conclusion

• We develop a methodology for automatically extracting

and lemmatizing unknown words in Arabic

• We pipeline a finite-state guesser with a machine

learning tool for lemmatization

• We develop a weighting mechanism for predicting the

relevance and importance of lemmas

• Out of 2,116,180 unknown words, we create a lexicon of

18,399 lemmatized, POS-tagged and weighted entries.

Documents

Floating dict presentation_04