26
School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha [email protected] Supervisor Dr. Eric Atwell [email protected]

School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha [email protected]

Embed Size (px)

Citation preview

Page 1: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

School of ComputingFACULTY OF ENGINEERING

Automatic Part-of-Speech Tagging of

Arabic Text

School of ComputingFACULTY OF ENGINEERING

Majdi Sawalha

[email protected]

Supervisor

Dr. Eric Atwell

[email protected]

Page 2: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

2

Outline:

• Introduction

• Research focus and questions

• A word about Arabic Language

• Arabic Language Corpora

• Gold standard for evaluation

• Arabic Morphological Analysers and Stemmers

• Prior-Knowledge broad-lexical resource

• Hybrid Part-of-Speech tagger of Arabic language

School of ComputingFACULTY OF ENGINEERING

Page 3: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

3

• What is Part of Speech Tagging?

• What is a tag?

• What is the tagsets?

Our Aim

School of ComputingFACULTY OF ENGINEERINGIntroduction

How to widen the scope of Arabic Part-of-Speech tagging, to develop a system which can process Arabic text in wide range of formats, domains, and genres of both vowelized and non-vowelized text ?

Page 4: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

4

Research sub-questions:

• Can richer lexical resources derived from dictionaries and grammar text books improve the coverage of morphological analysis for wider range of Arabic text formats, domains and genres?

• How do we evaluate existing Part-of-Speech taggers and new Part-of-Speech tagger on a wider range of text formats, domains, genres, and vowelized and non-vowelized text?

• How do I make the best reuse of existing tagger components and methods?

School of ComputingFACULTY OF ENGINEERINGResearch focus and questions

How to widen the scope of Arabic Part-of-Speech tagging, to develop a system which can process Arabic text in wide range of formats, domains, and genres of both vowelized and non-vowelized text ?

Page 5: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

5

Tagging Applications

• A good tagger can serve as a preprocessor.

• Large tagged text corpora are used as data for linguistic studies.

• Information technology applications;

• Text indexing and retrieval.

• Speech processing.

School of ComputingFACULTY OF ENGINEERINGIntroduction

Page 6: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

6

Arabic language linguists classify words in Arabic into three main categories.

• Verbs: that word which denotes an action and has tense.

• Nouns: name of a person, place, or object and does not have any tense.

• Particles: that word of which cannot be understood without joining a noun or a verb or both.

School of ComputingFACULTY OF ENGINEERINGA word about Arabic Language

Page 7: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

7

School of ComputingFACULTY OF ENGINEERINGA word about Arabic Language

Verbالفعل

Complete Verbتام فعل

Incomplete Verbناقص فعل

Transitive Verb متعد فعل

Intransitive Verbالزم فعل

Active Verbمعلوم فعل

Passive Verbمجهول فعل

Verb classifications

VerbالفعلVerbالفعل

Imperative Verbأمر فعل

Imperative Verbأمر فعل

Progress Verbالمضارع الفعل

Progress Verbالمضارع الفعل

Perfect / Past Verbالماضي الفعل

Perfect / Past Verbالماضي الفعل

Verbالفعل

Complete Verbتام فعل

Incomplete Verbناقص فعل

Transitive Verb متعد فعل

Intransitive Verbالزم فعل

Active Verbمعلوم فعل

Passive Verbمجهول فعل

Verbالفعل

Complete Verbتام فعل

Incomplete Verbناقص فعل

Transitive Verb متعد فعل

Intransitive Verbالزم فعل

Active Verbمعلوم فعل

Passive Verbمجهول فعل

Verbالفعل

Complete Verbتام فعل

Incomplete Verbناقص فعل

Transitive Verb متعد فعل

Intransitive Verbالزم فعل

Active Verbمعلوم فعل

Passive Verbمجهول فعل

Verbالفعل

Complete Verbتام فعل

Incomplete Verb

ناقص فعل

Transitive Verb متعد فعل

Intransitive Verb

الزم فعل

Active Verbمعلوم فعل

Passive Verbمجهول فعل

Page 8: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

8

Nouns

• Arabic language linguists distinguish between 21 types of nouns

School of ComputingFACULTY OF ENGINEERINGA word about Arabic Language

• Verbal noun• Original noun • Pronoun• Personal noun• Demonstrative noun • Joining nouns• Interrogative noun• Conditional noun• Generalization nouns• Adverb• Present participle• Past participle

• Adjective• Increased present participle. • Comparing and contrasting entities, the comparative and the superlative • Adverb of place• Adverb of time• Noun of instrument • Proper noun• Noun of genus• Ordinal number nouns• Verb noun• The five nouns

Page 9: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

9

School of ComputingFACULTY OF ENGINEERINGA word about Arabic Language

Particles

Building ParticlesMeaning Particles

Inactive ParticlesInactive Particles Active ParticlesActive Particles

Effects

VerbJussiveSubjunctivePartial subjunctive

NounGenitive CaseVocativeException

BothConjunction

Particles

Page 10: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

10

Evaluating existing Arabic tagsets.

• Every researcher has developed a tagset. Either detailed or minimal tagset.

• A comparison of different tagsets will show

• The number of tags used,

• The purpose of using the tagset.

• The source of information when designing the tagset.

• The errors in classifying tags into their categories.

• Designing a more reliable and multi-level tagset that varies from minimal tagset to more detailed one.

School of ComputingFACULTY OF ENGINEERINGArabic Language Tagset

Page 11: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

11

Arabic Language challenges

• Writing constraints lead to ambiguities.

• Tokenization.

• Agglutination.

• Complex Morphology.

• Vowel Marks.

• Grammatical ambiguity

2.8 in vowelized text and 5.6 in non-vowelized text

School of ComputingFACULTY OF ENGINEERINGA word about Arabic Language

Page 12: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

12

School of ComputingFACULTY OF ENGINEERING

• What is a token?

• Main tokens are delimited by a white space or a punctuation mark

• ( ، ؛ ؟ ! . etc) .

• Arabic Morphology allows words to be prefixed or suffixed with clitics.

• Clitics can be concatenated one after the other.

• Arabic clitics are not as easily recognizable.

• A single word can comprise up to four independent morphemes.

• Tokenizer is responsible for:

• Defining word boundaries.

• Demarcating clitics, multiword expressions, abbreviations and numbers.

• Affixes carry morpho-syntactic features

- Tense - Person - Gender - Number)

• Clitics serve syntactic functions

- Negation -Definition – Conjunction - Preposition

Tokenization

Page 13: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

13

Tokenization

اـهـنوــبـتـكـــيـلو [ wlyktbwnhA ] (And they write it)

اه * نو * بتك * ي * ل * و (w*l*y*ktb*wn*hA)

School of ComputingFACULTY OF ENGINEERING

كتبكتبي

هكتبهكتبي

كتابكتابال

همكتابهمكتابو

ktb

yktb

ktbh

yktbh

ktAb

AlktAb

ktAbhm

wktAbhm

Wrote

Write

Wrote it

Writing it

Book

The book

Their book

And their book

• Most Arabic words consist of stem/root and a combination of prefixes and suffixes. 1- Root

2- Prefix(es) + Root

3- Root + Suffix(es)

4- Prefix(es) + Root + Suffix(es)

5- Stem

6- Prefix(es) + Stem

7- Stem + Suffix(es)

8- Prefix(es) + Stem + Suffix(es)

و ل ي كتب ون هاConjunction preposition Progressive

letterRoot Relative

Pronoun

(Plural/Subject)

Relative Pronoun

(Object)

Tokenization

Page 14: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

14

• Arabic has 2 types of vowels

1- Long vowels: Alif ا , waw و , yaa ي (part of Arabic letters)

2- Short vowels: there small vowel marks which are not part of Arabic letters. These marks are placed above and below the Arabic letters.

Arabic has other 5 diacritical marks

• Nunation is the doubling of the short vowels used at the end of indefinite nouns

• Sukun (absence of a vowel) consonant is not followed by a vowel.

• Gemination (Shadda) duplication of the consonant

School of ComputingFACULTY OF ENGINEERINGVowels & Diacritical marks

Page 15: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

15

Importance of using diacritics in Arabic language

• Adding semantic information to the words

• Determining the correct tag to the word in the sentence

• Indicating grammatical functions to the word

(Mood, Aspect, Voice endings for verbs, Case endings for nouns).

• Indicating the correct pronunciation of word, correct syntactical analysis and removing the semantic confusion of Arabic readers.

School of ComputingFACULTY OF ENGINEERINGVowelization & Part-of-Speech Tagging

Page 16: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

16

• Diacritical marks affect the Part-of-Speech tag of the word and its meaning

School of ComputingFACULTY OF ENGINEERINGVowelization & Part-of-Speech Tagging

Page 17: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

17

Corpus

A collection of samples of texts that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.

Applications of Corpora

• Prepare and format text to be used by search tools.

• Useful for linguist, teacher and learner. (advanced level)

• The study of syntactic structure.

• Corpus in lexicography used for developing good dictionaries.

• Used to train Machine Learning software for grammar analysis, word clustering, machine translation, …

School of ComputingFACULTY OF ENGINEERINGCorpora or (Corpuses)

Page 18: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

18

Corpus of Contemporary Arabic (CCA) [University of Leeds Corpus] (2004)

• Engineered by Latifa Al-Sulaiti & Eric Atwell; Written and some spoken; Around 1M words; TAFL; Websites and online magazines

• FREE to download: http://www.comp.leeds.ac.uk/arabic

Buckwalter Arabic Corpus 1986-2003

• Written; 2.5 to 3 billion words, Lexicography;Public resources on the Web

An-Nahar Corpus (2001)

• Written;140M words; General research;

An-Nahar newspaper (Lebanon)

Al-Hayat Corpus (2002)

• Written;18.6M words; Language Engineering and Information Retrieval; Al-Hayat newspaper (Lebanon)

Arabic Gigaword (2002)

• Written; Around 400M words; Natual language processing, information retrieval, language modelling; Agence France Presse, Al-Hayat news agency, An-Nahar news agency, Xinhua news agency

School of ComputingFACULTY OF ENGINEERINGArabic Language Corpora

Page 19: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

19

Building Gold Standard Evaluation Corpus

- Different text domains, formats and genres of both vowelised and non-vowelised text.

- The Qur’an.

- Newspaper text.

- Magazines.

- School books.

- Children’s books.

- Blogs (text in blogs can be in Arabic script or in roman letters transcription)

- Gold Standard will be checked by Arabic language scholars.

School of ComputingFACULTY OF ENGINEERINGGold Standard Evaluation Corpus

Page 20: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

20

School of ComputingFACULTY OF ENGINEERING

Alif. Lam. Mim. Do men imagine that they will be left (at ease) because they say, We believe, and will not be tested with affliction? Lo! We tested those who were before them. Thus Allah knoweth those who are sincere, and knoweth those who feign. Or do those who do ill-deeds imagine that they can outstrip Us? Evil (for them) is that which they decide. Whoso looketh forward to the meeting with Allah (let him know that) Allah's reckoning is surely nigh, and He is the Hearer, the Knower. And whosoever striveth, striveth only for himself, for lo! Allah is altogether Independent of (His) creatures. And as for those who believe and do good works, We shall remit from them their evil deeds and shall repay them the best that they did. We have enjoined on man kindness to parents; but if they strive to make thee join with Me that of which thou hast no knowledge, then obey them not. Unto Me is your return and I shall tell you what ye used to do. And as for those who believe and do good works, We verily shall make them enter in among the righteous.

Globalization will stay a hot topic of discussion for a long time. In this article, we consider in depth some of the questions raised by new writers who consider globalization as a new lifestyle for the modern man. Taking the lead from America, many writers describe the multi-ethnic and multicultural American life style as the ideal in the new global village where telecommunication, transportation, information systems and the media shorten the distances between disparate groups. Advocates of this point of view look forward to a new modern man, the Cosmopolitan man.

Sample of Qur’an Gold Standard (vowelized) Sample of Newspaper Gold Standard (non-vowelized)

Gold Standard Evaluation Corpus

Page 21: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

21

School of ComputingFACULTY OF ENGINEERING

Evaluating stemming and morphological analyzers.

• A comparison of three stemming algorithms has been done.

• Shereen Khoja Stemmer, Tim Buckwalter morphological analyzer and tri-literal root extraction algorithm.

• Four different fair evaluation measurements were applied.

• A combining by voting is used to combine results of different algorithms.

• The paper shows that more work in this field is required as the stemming algorithms failed to achieve accuracy rates more that 75% (sawalha & Atwell, 2008).

Arabic Morphological Analysers and Stemmers

Page 22: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

22

• 15 Arabic language dictionaries* are used

•The lexicon contains:

• roots and single words.

• Multi-word expressions.

• Idioms.

• Collocations requiring special part of speech assignment.

• Words with special part of speech tags.

• Meanings.

School of ComputingFACULTY OF ENGINEERING

Prior-Knowledge broad-lexical resource of Arabic Language

* Freely available from www.almeshkat.com in MS-Word format

I've seen it all..;)I've seen it all..;)

Page 23: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

23

School of ComputingFACULTY OF ENGINEERING

Prior-Knowledge broad-lexical resource of Arabic Language

Lisan Al-Arab “ العرب Arab ” لسانtongue

Taj Al-Arous min jawaher Al-Qamus “ من العروس تاجالقاموس Bride crown from the dictionaries ” جواهر

jewels

Page 24: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

24

School of ComputingFACULTY OF ENGINEERING

Existing Arabic language Part-of-Speech taggers and reuse

• Evaluating existing Part-of-Speech tagger components.

• Gold Standard

• Fair measurements

• Multi-level tagset

• Analyzing & re-implementing algorithms of Part-of-Speech taggers.

• Best tagger components need to be re-implemented, using Python.

• Python will simplify the integration of the Part-of-Speech tagger to the NLTK (Natural Language Toolkit).

Page 25: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

25

• Novel algorithm leading to hybrid Part-of-Speech tagger for Arabic text which combines best components of existing taggers with novel resources and components.

• Integrating best tagger components together

• Integrating Prior-knowledge lexical resource

• Integrating Morphological analyser

• Using unsupervised learning algorithms to solve the problem of unknown words.

School of ComputingFACULTY OF ENGINEERING

Hybrid Part-of-Speech tagger

Page 26: School of Computing FACULTY OF ENGINEERING Automatic Part-of-Speech Tagging of Arabic Text School of Computing FACULTY OF ENGINEERING Majdi Sawalha sawalha@comp.leeds.ac.uk

26

School of ComputingFACULTY OF ENGINEERING