From lexical database to tagged Arabic corpus Van Mol A031.pdf · the dual endings in both accusative and genitive with their abbreviated forms are used in a construct state. Special

1

From lexical database to tagged Arabic corpus

Van Mol Mark

Leuven Language Institute – Faculty of Arts, KU Leuven, Leuven, Belgium

[email protected]

Abstract

A completely tagged corpus of Arabic can form the basis for different research activities. First of all it is a suitable instrument for pure linguistic research, but it can also become an interesting tool for computer applications varying from Machine Translation, or the spell check of scanned documents, to the automatic tagging of raw corpora and also form the basis for the development of CALL applications. In this paper we discuss the lexical database that we have developed at the University of Leuven, as well as the first tools that have been developed to perform POS -tagging of Arabic corpora.

I. INTRODUCTION

Several years ago the Dutch Language Union decided to compile new

bilingual dictionaries for the Dutch language. Among those languages was

Arabic. The production of bilingual Arabic dictionaries Dutch – Arabic * Arabic

– Dutch was entrusted to the Radboud University of Nijmegen and to the

Catholic University of Leuven. The database which we discuss in this paper is

the one which was developed by our university. Besides the lexical database we

also developed a system for POS tagging on Arabic words in a sentence. So far,

a corpus of 10,000,000 Arabic words has been preliminary tagged. In the last

phase of our research we developed a tool that performs the definitive POS-

tagging of Arabic words in the database semi-automatically. The whole process

will be demonstrated live in the conference.

II. THE PLATFORM OF THE DATABASE

From the beginning we opted to work in a Macintosh environment and

as a database we chose to work with 4D.

A. Adaptation from the database to the Arabic Language

When working with the database on a Mac we encountered several

minor problems which had to be solved. The first was due to the Mac system

software in which no distinction was made between the letters dÁl and ÌÁ. The

software engineers of Mac solved this problem. The second was due to 4D itself.

In order to do the searches into the database properly and also in order to do the

tagging in an efficient way it was a prerequisite that the database recognized the

different characters, but also the diacritical elements separately. This was not the

case. For example when searching the database, the program did not make a

distinction between several Arabic characters such as the different vowels and

letters. For example when searching for the verb saÞala (to ask) also the verb sÁla

(to flow) was given. This was a serious problem, because the aim of the

database is also to do research on frequency of each word separately and to do

searches on corpora with KWIC-index tools.

Together with the developers of 4D all these problems were solved. One

problem which remained was the overlap between the letter ÛÁd and the kašÐda.

This problem was solved by our own programmers by means of a work-around.

III. THE STRUCTURE OF THE DATABASE

The database is divided into two main parts, viz. a lexical database and a

corpus. The first part consists of the lexical database which contains all the

information for the production of a bilingual dictionary whereas the second part

contains the corpus.

Basically the lexical database consists of four categories. The first

category contains a list of Dutch words with all the grammatical and lexical

information that is in Dutch relevant to each of these words. The second

category consists of a list of Arabic words with all the lexical and grammatical

information that is relevant to each Arabic word. The two other categories are

composed of the dictionary elements. One Arabic – Dutch compartment in

which the link is made between the two first categories and the other Dutch –

Arabic where the reverse link is made. Meanwhile the dictionary work has been

finished and it is published in book form [1].

In the following I will limit myself to the second department of the

lexical database which is made up by the Arabic words and the added lexical

and grammatical information and the second main part which contains the

corpus, because those are the two elements which will be used in the semi-

automatic tagging of corpora.

IV. THE CONTENTS OF THE ARABIC LEXICAL DATABASE

Until now the lexical database contains up to 27,393 Arabic words or

lexical items. The selection of these words was done by the translation of a

4,000,000 words Arabic corpus. After the translation of the corpus an Arab

educated native speaker from Iraq was asked to search the dictionary of Hans

Wehr [2] and to add all the words out of this dictionary which he had

encountered in his life.

3

Each word comprises a file which contains the following information.

First of all there is the full voweled form of the word. For each word two kinds

of grammatical categories are given. The first is based on Western traditional

linguistic thought; the second on the Arabic division of grammatical categories

according to the tradition of the Arab grammarians. For every word the stem and

the stem category are given. An indication is given whether a word is dialectal

or of standard usage. Also an indication is given whether the word occurs in the

dictionary of Hans Wehr [2]. Of about 5% of the words found in the corpus

were not found in the dictionary of Hans Wehr!

For nouns, of course, the different plural forms are given. The most

crucial element, however, is the preliminary tagged form of each word.

V. A TWO-STEP POS-TAGGING APPROACH

In order to realize the definitive POS-tagging of Arabic words in texts

we work in two phases. The first phase consists of a preliminary treatment of

Arabic texts, which we call the preliminary encoding of Arabic words in a text.

For this preliminary encoding we make use of vowels and other diacritical signs

that are added to the words in a systematic way according to a convention about

its use which was developed at our university [3]. This preliminary encoding

makes it possible for software to recognize in a simple manner all part of speech

elements in an Arabic text. The second phase which is the definitive POS-

tagging of words in a text is done by the analysis of a pre-tagged text and the

confrontation of the text with the definitive POS-tags which are stored in the

database.

VI. THE PRELIMINARY ENCODING OF ARABIC WORDS

In order to be able to disambiguate Arabic words in raw text corpora we

designed a preliminary encoding for the tagging of Arabic words. This

preliminary encoding makes use of the vowels of Arabic and all the other

diacritical signs including the kašÐda, which is one of the vital elements in our

encoding.

The aim of this preliminary encoding is to create a convention that

makes it possible to disambiguate every word in the Arabic language, both in its

original as in its derived forms as well as its prefixes and suffixes.

We will illustrate some elements of this encoding by a few examples. In

order to disambiguate between nouns and adjectives we laid down the rule that

the first consonant of a noun always bears a vowel whereas the first consonant

of an adjective never does. So the form maÝrÙf as adjective (known) will be

written mÝrÙf and the noun (amicability) will be written maÝrÙf.

In order to disambiguate between the different persons of a verb, both in

the present and the past tense also use was made of the vowels. Those were only

used on the affixes in order to disambiguate. For the rest, words remain

unvoweled. So the first person of a verb in the past time is written ktbtu, the

second ktbta or ktbti, etc. The third person in the past time is written ktb. In

order to distinguish between this form and the plural of the word kitÁb (book)

which is also written ktb on the plural form a vowel is written above the first

consonant, which is the rule in our convention for the mark of a noun.

As far as the prefixes and the suffixes are concerned all were vocalized

as preparation for disambiguation. So when the end of a word contains hum it is considered to be a suffix, but when the end of a word contains the letters hm those are to be considered a part of the word itself. So the word sahm (arrow) is

written s-a-h-m without sukÙn, but the prefix bi followed by the suffix hum will

be written as b-hum.

The distinction between feminine nouns and feminine adjectives ending

in tÁÞ marbÙÔa is based on the following convention. With an adjective, the

vowel fatÎa is written before the tÁÞ marbÙÔa whereas with a noun no vowel is

written. So the word madÐna (town) is written in Arabic as mdÐnt but the

feminine form of the adjective madÐn (indebted) is written mdÐnat. Another

element of the convention is that the first consonant of feminine words ending in

a tÁÞ marbÙÔa is not vocalized, because there is no need to mark it twice as a

noun.

Proper nouns also are marked in every text by placing it between a

hyphen - and an equal sign =. This way also an inventory can be made of all the

proper nouns in an Arabic text corpus.

So far a corpus of more than 12,000,000 words has been preliminary

tagged. In order to make the definitive POStagging this corpus will be read into

the database and will be compared with the definitive POS-tagset which is filed

in the database. In order to give a better understanding of the tagging process,

we have to point out some detail about the way in which words and their derived

and agglutinated forms are classified. We make a distinction between what we

call minimal basic forms and maximal basic forms.

VII. MINIMAL BASIC FORMS AND MAXIMAL FORMS

There is no doubt that the agglunative character of the Arabic language

complicates its tagging. The string of characters between two blanks in Arabic

often contains more than one word. One might even state that only seldom does

such a string contain only one isolated word. One manner to make a reference

for tagging is to generate for all the words all the possible combinations that

5

might occur between two blanks. We chose not to do so. In the generation of

possible word-forms we made a distinction between minimal basic forms and

maximal basic forms. All affixes that can be generated automatically with a

word are considered to be part of the word. The word itself or the word with an

inflectional affix is considered to be a minimal basic form. All such affixes have

been generated automatically in the database to every word. The automatic

encoded generation of affixes within a minimal basic form of a word has been

applied to the following categories.

Automatic generation of all the conjugated forms of every verb, both in

the present and the past tense, but also for the conjunctive and even for some

dialectal forms.

As far as the nouns are concerned, the following forms have been

generated: all the external plural forms in both genitive and accusative, but also

the abbreviated forms which are used before suffixes or in a construct state. Also

the dual endings in both accusative and genitive with their abbreviated forms are

used in a construct state. Special attention was also paid to the word pattern of

weak nouns.

As far as adjectives are concerned, the encoded feminine form (fatÎa with tÁÞ marbÙÔa) is generated as well as the external plural and dual forms both

in accusative and genitive in addition to the forms used in a construct state with

also special forms for weak adjectives.

This way 594,941 minimal basic forms were generated, forms which

might possibly occur in an Arabic text.

The maximal basis forms were not automatically generated. Affixes

which do not belong to the derivation of a word, but which occur in a string of

characters between two blanks, were stored separately in the database in their

preliminary encoded form. Up to 74 affixes or combinations of affixes have

been stored this way, including also some dialectal verbal affixes which are

widely used, such as, the b-prefix of the verb and the dialectal future prefix h.

If we had also generated all the theoretical possible maximal basic

forms of words, we would have obtained a huge database with lots of

information that probably might never occur in real language use.

This does not mean that we will not try to make an inventory of

maximal basic forms as they occur in texts. This process, however, will take

place simultaneously with the definitive POS-tagging of the corpus. The

inventory of maximal basic forms such as they occur in real language use will,

in a later stage of the research, be of great help for the fullautomatic tagging of

corpora, especially also when we obtain statistical information about these

forms. So far we have registered some 3,688 maximal basic forms in the corpus.

VIII. NEUTRAL FORMS

From every basic form, neutral forms are also generated A neutral form

is an Arabic word, or a string of Arabic characters, such as it appears in a raw

text corpus. The ultimate aim of the research is to fully automatically tag a

corpus by means of a comparison between the neutral forms in a raw text, with

the neutral form in the database and the encoded minimal and maximal basic

forms that are related to that neutral form. The more texts we definitively tag,

the more statistical information on minimal and maximal basic forms we will

obtain, which will probably be of a great help for the full-automatic tagging of a

raw Arabic corpus.

IX. THE CORPUS

According to the above described convention an Arabic representative

corpus has been compiled and preliminary tagged. This corpus contains both

oral and written sources from Arabic of the media, as literature. The great

majority of the texts date from the year 2000. For the moment these texts are

only available in text format. Some of them have already been stored in the

database. The storage of the texts in the database is done according to a strict

classification.

This classification is as follows: The source information comprises the

following elements: country of origin, type of text (media Arabic or literature).

In case it is a literary text, distinction is made between fiction and non-fiction,

general subject of the text and detailed subject of the text. For example, when a

general subject such as, ‘sports’ is marked, then the detailed information will

indicate which sport is involved, basketball, football, etc. Further on the name of

the author is also indicated.

X. CORPUS EXPLORATION TOOLS OUTSIDE THE DATABASE

Some corpus exploration tools have been developed which operate

outside the database. Those tools comprise a program which divides a text into

sentences and a KWIC-tool (Key Word in Context) tool with a variety of search

facilities [3].

Thanks to the preliminary encoding of the corpus already detailed

searches can be performed. Because all affixes are encoded studies about Arabic

particles based on corpus analysis can be performed. When searching, for

example, for the particle bi all instances where the particle bi is found will be

7

given. We have already conducted a corpus-based study on the use of some

complementary particles in Arabic by making use of these corpus exploration

tools. The results of the study have recently been published [4].

Despite the fact that many elements can be searched for in the

preliminary tagged corpus, these searches are still limited. In order to be fully

analyzable complete and definitive tagging of the corpus is necessary.

XI. THE LEUVEN TAGSET FOR ARABIC

The Leuven Tagset for Arabic is based both on categories from Western

traditional grammar as on the categories of traditional Arabic grammar, of which

the basic division is as generally known, verb, noun, particle. As far as the

division of Western categories is concerned, we made a preliminary division

into 64 categories. There are more head categories than in other tagsets because

we made some refinements which are not common for other languages. We

have, for example, three tags for adjectives. One is for Arabic adjectives, one for

adjectives denoting countries and one for adjectives derived from foreign words.

Also the category of nouns is more refined because of the following division:

We have specific tagsets for masculine nouns in general, for feminine nouns, for

abbreviations, a noun denoting an element of chemistry, a proper noun

masculine and feminine, a noun denoting a country, a noun denoting a city, a

noun denoting a month, a collective, a noun which only exists in a plural form

and finally a specific tag for masculine foreign nouns and feminine foreign

nouns. The last information, for example, will give us an indication of the

impact of the use of foreign words on the Arabic language.

Besides the main categories each tag also has a set of subtags. These

subtags are given according to the generated minimal basic forms. For nouns,

for example, we use the subtags for dual forms, plural forms, forms used in a

construct state, etc.

The Arabic tagset is based on both the Arabic traditional grammar, such as the

different kinds of ÎurÙf and morphological forms of the words. We created 359

tags based on Arabic grammatical categories and on Arabic morphological

forms. This, of course, means a much larger refinement which can be used in

combination with the Western grammar based tags. To give an example we take

the maÒdar. This is an Arabic category which is assigned to certain nouns in

Arabic. For the maÒdar, however, we do not limit ourselves to this category as a

whole, but also specifications are given, not only of the form of the maÒdar; which can be the second one or the fifth, but also whether a maÒdar is involved

with a tÁÞ marbÙÔa, or a maÒdar with a nisba-ending, or a maÒdar with a nisba

ending and a tÁÞ marbÙÔa.

The Arabic morphological forms will also give us details of the spread

of these forms in Arabic. For example, details are given about whether a word is

a faÝl form or a fiÝl form or a fuÝl form, etc. All this kind of tag information is

automatically added to words in context by one simple choice. The

implementation of both tagsets will make combined and more detailed searches

possible. One could, for example, search for a maÒdar of the fuÝÙl form, or a

plural of a noun of the same form.

XII. THE SEMI-AUTOMATIC TAGGING OF THE CORPUS

To tag the texts in the database they have to be put first in a field where

all the text information has to be added. Above the field there are pop-ups which

the researcher uses to make a full external description of the text. If the text is a

part of a novel, information has to be added about the title of the novel, the

author, the chapter and the pages from which the fragment is taken. In this first

phase the text is marked as preliminary encoded, because the typists have treated

the text according to the principles described in the preliminary encoding.

Figure 1

This figure shows the interface in which a neutral text is inserted. The text of the

fields is in Dutch. Using pop-ups the researcher indicates the characteristics of the

text to be included in the database

9

The second phase consists of the automatic comparison between the

data in the database and the text to be tagged. By means of this comparison the

program yields all the information available per word. If only one tag is

possible, this tag is immediately and automatically attached to the word in

question. If no tag is found, because the word has not yet been stored in the

database, the new word will be placed between the following brackets {}. When

different tags are possible, the program shows per word the different

possibilities. It is then up to the scientist to make the appropriate choices based

on the context in which the words appear. When the scientist went through the

whole text and added the appropriate tag to each word, the whole text is stored

with all the relevant information per word. One choice suffices to add all the

detailed information about one word.

XIII. ILLUSTRATION OF THE TAGGING PROCESS OF A RAW

CORPUS

In order to illustrate this process in a more detailed way I give an

account of the tagging of the following sentence out of a novel from a Lebanese

writer. Because of the availability of the pretagged corpus, we have two options.

We can base the tagging process on a raw corpus or on the pre-tagged corpus.

At first I give the results of the tagging based on a raw corpus. It will become

clear that the tagging of a raw corpus requires much more decision making from

the scientist than the tagging of the pre-tagged corpus. The sentence goes as

follows:

KamÁ law Þannani qarrartu ‘iqÁb nafsÐ al-muÃÔariba, faqad Ûalaltu jÁlisan yawman kÁmilan Þak×aru min ‘ašar sÁÝÁt mutaÝÁqibat ÝalÁ kursiyy Ìašabiyy wa-ÞanÁ mašdÙd ÞilÁ jihÁz al-kombyÙtar ÝalÁ naÎw min al-walah aš- šayÔÁniyy

It can be translated as follows: As if I had decided to punish my

tormented self, for I remained sitting a whole day more than ten continuous hours on a wooden chair while I was tied onto the computer in a satanically passionate manner.

In the following, I will describe the different possible tags which were

offered by the computer program.

KamÁ yields four different tags or possibilities:

1) conjunction – subordinate, 2) noun – masculine – dual of construct

state – nominative with the meaning of calyx, 3) noun – masculine – dual of

construct state – nominative with the meaning of sleeve, and 4) the verb kamma, the third person dual of the perfect tense with the meaning of to cover From the

context it is clear that the first choice is the appropriate one.

Figure 2

After the command ‘spellcheck’ the program yields the different possible

grammatical tags for every word. In this table, the choices of the tags are based on a

non vocalised non pretagged text. The first word is marked and in the table

underneath the different possibilities are given. Note that the tags are given in

Dutch. The first tag CONJ-ond-S stands for Conjunction – subordinate –Standard

Arabic.

Law yields only one possibility, so that no choice has to be made. The

tagging of the word happens automatically.

ÞAnnani also gives four possibilities in which the personal suffix is

recognized. Those possibilities are the right ones 1) the relative pronoun

Þannaand three other possibilities which are not appropriate in this context viz. 2)

the third person of the past tense of the verb Þanna(to moan) in its active form

and 3) in its passive form and 4) the subordinate conjunction Þan.

Qarrartu gives eight possibilities for two verbs. All in the past tense but

for the verb to settle down (from the first form) three possible persons are given,

viz. the first, the second and the third singular, but for the correct verb which is

of the second form five possibilities are given, viz. the four persons singular

except the third person masculine of which, of course, the first person has to be

chosen.

11

‘IqÁb gives three possibilities, all of them nouns. The first is the plural

form of ‘aqaba (obstacle), the second possibility is the noun ‘uqÁb (eagle) and

the third and the right one is ‘iqÁb (punishment).

NafsÐ gives five serious possibilities (when based on a raw corpus our

program gives fifteen possibilities because the suffixes are dealt with separately.

As far as the suffixes are concerned the program cannot make a distinction

between a verb and a noun beforehand. So the impossible combination verb +

suffix i is given whereas only the combination verb + suffix ni is possible. The

other five possibilities given are one adjective nafsiyy (mental) and two nouns

nafs (soul) and nafas (breath) both with the suffix of the first person, but for

each noun also the possible dual forms of these nouns in the accusative-genitive

of the construct state are given.

Al-muÃÔariba gives only one possibility which is automatically tagged

and that is a feminine adjective. faqad gives ten possibilities which are all

realistic. We have three nouns of which the first fa has been identified as the

conjunction. Those nouns are fa- fa-qudd (codfish), fa-qidd (strip of leather) and

fa-qadd (shape). Another possible noun was identified, but of which the first

radical was interpreted as forming an integral part of the noun which is faqd (loss). Further on the program gives the adverb qad. For the definitive tagging

we made a distinction between two possibilities, viz. qad used before a past

tense and qad used before a present tense. The four other possibilities are all

verbs, all in the third person singular past tense and both in active and passive

voice. In one case the string of characters is identified as the conjunction fa followed by the verb qadda (to cut off) and the second one as the verb faqada (to

miss).

Úalaltu gives eight possibilities based on two possible verbs Ûalla (to

remain) and Ûallala (to shade). For the verb of the first form only three persons

are given (viz. first and second masculine and feminine), whereas for the verb of

the second form five possible persons are given because the third person

feminine singular is also given. The last possibility given is the passive voice of

the third person singular. In the generation process of pre-tagged word forms we

have limited the generation of passive verb forms only to the third person

masculine and feminine singular, in order to try to limit the number of

unrealistic possibilities. JÁlisan gives two realistic possibilities and one

unrealistic one. The first realistic one is the undefined adjective jaalis in the

accusative (sitting). The second one is the third person singular dual of the past

tense jÁlasÁ. The unrealistic one is the verb jÁlasa with the ending of an

undefined accusative, which, of course, can only be added to a noun and not to a

verb.

Yawman gives two possibilities. The first is an undefined noun in the

accusative (day) but the second possibility is the noun yawm (day) in the

nominative dual in a construct state. Both possibilities are realistic. KÁmilan gives only one possibility which is automatically tagged as

such and this is an adjective in the undefined accusative. ÞAk×aru gives five possibilities. First of all, the elative of ka×Ðr (many),

but also four possible verb forms. The first person of the verb in the present

tense of the forth form Þak×ara (to outnumber) and the third person of the past

tense of the same verb. Further on also the first person in the present tense is

given of the verb ka×ara (to be much). Also a passive voice is given. One other

possibility which the program did not give, because the verb did not occur in the

4,000,000 words corpus is the second form of the same stem ka××ara (to

increase) which seems a realistic possibility, but which is not in our database.

This example illustrates the importance of tagged corpora as a reference point

for the tagging of other corpora. In a further stage when statistics are available,

the accurateness of the tagging process will certainly increase.

Min gives us seven possibilities. Of course the most prominent one is

the preposition min. But the neutral form consisting of simple mim and nÙn

gives quite a number of other realistic possibilities. There is the interrogative

pronoun man (who) and the relative pronoun man (he who). Both are

preliminary encoded in the corpus in a different manner. But also the noun

mann (blessing) is possible and the verb manna (to grant) of which the third

person singular masculine is given in both active and passive voice. But also,

something more unexpected is given, which is the third person plural active

of the past tense of the verb mÁna (which means to lie). Indeed the third

person plural feminine of that verb has the shape of mim and nÙn in its

unvocalised form.

13

Figure 3

In this table, the different tags for an untagged form mn are given. The first letters of

the tag indicate the following possibilities: N (Noun), VNW (pronoun, either a

relative pronoun or an interrogative), VZ, which stands for preposition (the right tag

in this context) and then three times WW (Verb)

‘Ašar gives three possibilities. Two are numbers, one the cardinal one

‘ašar (ten), the other the fraction‘ušr (one tenth). But ‘ašar can also be a noun

when used as the first ten days of the month of MuÎarram.

SÁÝÁt gives only one possibility, viz. a feminine noun, which is

automatically tagged as such. Here we could have considered to make a specific

tag to perform a semantic disambiguation. This, however, is a quite complex

matter. The problem is where to lay the border, for some words have a great

number of meanings. Of course, the difference in meaning is, as far as this word

is concerned, quite important, the one being hour and the other meaning being

watch. For pure POS-tagging, however, this distinction is not relevant.

MutaÝÁqibat gives only one possibility, viz. a feminine adjective and is

automatically tagged as such.

ÝalÁ gives only one tag viz. a preposition, which is automatically tagged

as such. However, other words might be taken into consideration, such as, the

noun Ýulan (height) and the plural of Þ aÝalÁ (higher) which is also Ýulan. Both

forms did not occur in the 4,000,000 words corpus which is why they were not

marked. Again, the above results show that in the future statistics about word

frequency, will play a crucial role in the tagging process.

Kursiyy gives two possibilities. The first being the correct one, viz. a

noun masculine singular with the meaning of chair. But the same string of

characters can also stand for a dual in the accusative construct state of the word

kurs (bag viz. kursay ). This word does not occur in the dictionary of Hans Wehr

but was found in the corpus.

Ëašabiyy also gives two possibilities. An adjective with nisba (wooden),

which is the right choice here, but the form can also stand for the noun Ìašab (limber) in the dual accusative of the construct state.

Wa-ÞanÁ gives three possibilities. The correct one is the conjunction wa

(and) with the personal pronoun of the first person singular. The two alternative

are the conjunction wa with the noun anan (span of time) in the construct state

and the conjunction wa followed by the verb Þanna (to groan) which is the dual

form of the third person singular in the past tense.

MašdÙd is automatically tagged, because only one possibility is given

for this string of characters, viz. adjective singular masculine.

ÞIlÁ is also automatically tagged as a preposition or in the Arabic

category Îarf al-jarr, because there are no alternatives.

JihÁz is also automatically tagged as a noun masculine singular, because

no other alternatives match with this string of characters.

Al-kombyÙtar was marked as a new word. Apparently it did not yet

appear in the corpus nor in the lexical database. Here the program gives the

opportunity to add new words to the corpus and to add the corresponding tag.

For foreign words we also signal the problem of many different transcriptions in

Arabic. When the transcription differs from the one stored in the database, it is,

of course, not found. This means that for foreign words a kind of uniformization

seems to be necessary.

ÝalÁ: same remarks as above.

naÎw gives four possibilities. The correct one in this context is the

preposition. Other possibilities given are a noun masculine singular meaning

direction, side, the other two being verbal forms. The first is the first person

plural of the jussive from the verb ÎawÁ (to contain) the second being the third

person of the past tense in its form before a preposition of the verb naÎÎa (to

remove).

15

For min holds the same remarks as above.

Al-walah gives six verb tags and one noun tag. Of course, this word is a

noun meaning distraction, because it is preceded by the definite article. Here the

program has the feature to decide for further tagging that this string of

characters, when preceded by the definite article automatically obtains the tag

for the noun.

Aš- šayÔÁniyy. Gives two possibilities, the correct realistic one which

is a defined adjective masculine singular with nisba meaning satanic, while

the other one is unrealistic. The program shows a determined noun with the

dual accusative ending used in a construct state, which is, of course,

impossible from a grammatical point of view. Here too the program has the

facility to determine for the future that this string of characters be interpreted

always as an adjective.

From the 30 strings of characters, 11 were tagged completely

automatically and correct, which is a rate of approximately 37%.

XIV. THE TAGGING PROCESS OF THE PRE-ENCODED CORPUS

When the tagging is done from the pre-encoded form of the words, 90%

(27 of the 30 strings of characters) of the tagging of the pre-encoded sentence is

done automatically and correct. Only three strings of characters give again

multiple choices. Two of them are due to an incorrect pre-tagging by the pre-

tagger.

First of all, the word ‘iqÁb which was not properly pretagged. The typist

wrote the letters ‘ayn – qaf – alif and ba’, whereas in our convention the first

consonant of all nouns is to be vocalized, because this precisely is the distinctive

mark for the pre-tagging of a noun. If the tagger had sticked to the rule and

written a kasra under the ‘ayn, the program would immediately have chosen the

right singular noun, not only for POS-tagging, but also semantically the right

noun, because the noun ‘uqÁb (eagle) is stored in another file as a separate noun.

The second case due to misencoding by the tagger is the preposition

naÎw. The rule for prepositions is to always vocalize the last consonant.

Because of the fact that the tagger did not vocalise this last consonant, the

program has to base its analysis on all the possible raw forms, which gave us

four possibilities. So if the pre-tagging was done correctly we would have

obtained an automatically correct tagging of 97%.

The last case is due to our own choice, which might have to be

reconsidered. We foresaw two encodings for the particle qad, one for its use

before a verb in the present tense expressing a possibility, and one for its use

before a verb in the past tense, expressing the accomplishment of an action. This

of course goes beyond the strict POS-tagging, because by doing so we enter the

field of semantics.

XV. CONCLUSION

So far, a corpus of more than 12,000,000 words has been pre-tagged. I

do believe that the further definite encoding of the corpus, might be of help in

the development of effective tagging software. Once the corpus is definitively

tagged, the statistics, which we hope to generate might be of use to help elevate

the rate of correctness of the tagging of raw corpora of Arabic

XVI. ACKNOWLEDGMENT

First of all I want to thank dr. Hans Paulussen, for his support and help

for the development of KWIC search tools outside the database and his

continuous support for my project. I also thank Mr. Koen Bergman for the many

years he has been working with me on the lexical database and also Mrs. Amal

Marogy who added words to the database from her experience as a native

speaker.

XVII. REFERENCES

[1] M. Van Mol and Berghman Koen, “Learners’ dictionary Dutch Arabic”,

Bulaaq, Amsterdam, 530 p.. and M. Van Mol and Berghman Koen,

“Learners’ dictionary Arabic Dutch”, Bulaaq, Amsterdam, 2001, 506 p.

[2] H. Wehr, “A dictionary of Modern Written Arabic”, Otto Harrassowitz,

Wiesbaden, 1979, 1301 p.

[3] M. Van Mol, “The semi-automatic tagging of arabic corpora”, in

Workshop proceedings Arabic language resources and evaluation –

status and prospects, LREC 2002 pp. 40-44.

[4] M. Van Mol, “Variation in modern standard arabic in radio news

broadcasts, a synchronic descriptive investigation in the use of

complementary particles, Peeters Publishers, Leuven, 2003, 324 p.

Documents

From lexical database to tagged Arabic corpus Van Mol A031.pdf · the dual endings in both accusative and genitive with their abbreviated forms are used in a construct state. Special