Download pdf - Arabic Text Search

8/3/2019 Arabic Text Search

1/21

Arabic texts analysis for topic modeling evaluation

Abderrezak Brahmi Ahmed Ech-Cherif Abdelkader Benyettou

Received: 12 September 2010 / Accepted: 23 May 2011 Springer Science+Business Media, LLC 2011

Abstract Significant progress has been made in information retrieval covering text semantic

indexing and multilingual analysis. However, developments in Arabic information retrieval

did not follow the extraordinary growth of Arabic usage in the Web during the ten last years. In

the tasks relating to semantic analysis, it is preferable to directly deal with texts in their original

language. Studies on topic models, which provide a good way to automatically deal with

semantic embedded in texts, are not complete enough to assess the effectiveness of the

approach on Arabic texts. This paper investigates several text stemming methods for Arabictopic modeling. A new lemma-based stemmer is described and applied to newspaper articles.

The Latent Dirichlet Allocation model is used to extract latent topics from three Arabic real-

world corpora. For supervised classification in the topics space, experiments show an

improvement when comparing to classification in the full words space or with root-based

stemming approach. In addition, topic modeling with lemma-based stemming allows us to

discover interesting subjects in the press articles published during the 20072009 period.

Keywords Arabic stemming Topic model Linguistic analysis Classification Test collections

1 Introduction

Arabic is one among the six official languages of the United Nations organization where a

considerable work has been done to develop the multilingual United Nations Bibliographic

A. Brahmi (&)

Department of Computer Science, University of Abdelhamid Ibn Badis,

BP: 188, Mostaganem, Algeria

e-mail: [email protected]

A. Ech-Cherif (&) A. BenyettouDepartment of Computer Science, USTO-MB, BP: 1505, Oran, Algeria


A. Benyettou


123

Inf Retrieval

DOI 10.1007/s10791-011-9171-y


2/21

Information System Thesaurus. This UNBIS Thesaurus is used in subject analysis of

documents and other materials relevant to the United Nations programs and activities. In

addition, Arabic is one of the top ten languages in the Internet. For a global population of

350 millions in Arabic world, Internet World Stats1 have reported, for Internet Arabic

users, the highest growth rate with 2,501.2% for the period 20002010.Unfortunately, developments in Arabic information retrieval (IR) did not follow this

extraordinary growth. For most of the studies in the different IR tasks (such as categori-

zation, clustering and search), English was used as the main language. However, when

switching to Arabic, two approaches have been adopted for evaluating IR methods: either

by retaining English as a pivot language when using parallel corpora in a cross-language

context; either by processing the original Arabic text and analyzing the IR methods in a

mono-language context. Although the first approach allows an equitable evaluation, it

depends on the availability and the quality of parallel corpora. In the second approach, the

evaluation of IR methods requires standard corpora and appropriate linguistic prepro-

cessing. This approach becomes more interesting for the IR tasks with semantic analysis. It

avoids the loss of meaning caused by translation from languages with high inflectional

morphology such as Arabic (Oard and Gey 2002; Larkey et al. 2004). Unfortunately,

Arabic IR, including stemming methods, did not receive sufficient standard evaluation.

In this context, three major challenges face the developments of the Arabic IR and

generalization of existing methods on Arabic texts: (1) How to efficiently extract a good

stem from a morpheme that implies several segmentations and senses? (2) How to apply a

topic model to capture the semantics embedded in Arabic texts? (3) How to make more

accessible Arabic resources for IR tasks to benefit from the various developments in non-

commercial context.This work aims to answer the first two questions raised above. On the one hand, a

lemma-based stemming approach is proposed and compared with other Arabic stemmers.

On the other hand, the Latent Dirichlet Allocation (LDA) model is used to extract Arabic

topics from newspaper articles. As regards the third question, our experiments are con-

ducted on three real-world corpus automatically crawled from the Web. All results and

resources will be made freely available for the research community.

This paper presents first related works on Arabic stemming and topic modeling. Then,

the lemma-based stemmer is described and evaluated with other approaches for Arabic text

analysis. Afterwards, the generative process for LDA topic modeling is illustrated. Before

presenting the results of our experiments, three datasets of newspaper articles are descri-bed. Finally, we discuss the main results and conclude our study.

2 Related works

2.1 Arabic stemming

Among the successful approaches for Arabic stemming, a root-based stemmer has been

developed by (Khoja and Garside 1999). Based on predefined root lists and morphologicalanalysis, Khoja algorithm attempts to extract the true root. However, more than one root

can be found in an isolated word without diacritics. Although the Khoja2 stemmer has not

1 Internet World Stats, Usage and population statistics (Miniwatts Marketing Group). Updated for June 30,

2010, last visited in August 2010. http://www.internetworldstats.com/.2 Freely available at http://www.zeus.cs.pacificu.edu/shereen/ArabicStemmerCode.zip.

Inf Retrieval

123
http://www.internetworldstats.com/http://www.zeus.cs.pacificu.edu/shereen/ArabicStemmerCode.ziphttp://www.zeus.cs.pacificu.edu/shereen/ArabicStemmerCode.ziphttp://www.internetworldstats.com/


3/21

been maintained since its first publication, it has been widely used and analyzed in later

works. As an instance, Al-Shammari lemma-based stemmer included Khoja algorithm for

verbs stemming (Al-Shammari and Lin 2008). The authors combined successfully light

stemming, root stemming and dictionary lookup. In addition to its effectiveness in clus-

tering task, the Al-Shammari algorithm outperformed Khoja and light-stemmers in terms ofover-stemming evaluation (Al-Shammari 2010).

For light stemming, several variants have been developed (Larkey et al. 2002). When

applying in the AFP_ARB corpus, the authors have found that light stemmer was more

effective for cross-language retrieval than a morphological stemmer. They deduce that it is

not essential for a stemmer to yield the correct root. Surprisingly, in a technical report

(Larkey and Connell 2001), the authors claim that these results, either in mono-lingual or

cross-language retrieval, were obtained with no prior experience with Arabic. Another

study has confirmed the same result which prefers light stemming for Arabic retrieval tasks

(Moukdad 2006).

On the contrary, Brants et al. reported that, whether by stemming or using full forms,

they have obtained the same performances for document topic analysis (Brants et al. 2002).

A recent study about Arabic text categorization has highlighted this contradiction in lit-

erature and attempted to analyze various stemming tools (Said et al. 2009). In (Darwish

et al. 2005), the authors showed that using context to improve the root extraction process

may enhance the IR process. However, the context root extraction is computationally

expensive compared with the light and root stemming. Similar to Khoja but without a root

dictionary, a good light stemmer was developed by (Taghva et al. 2005). The authors found

that stem lists are not required in an Arabic stemmer. They deduced that finding the true

grammatical root of a term should not be the goal of a stemmer for document retrieval.Compared to English and other languages, the research relating to Arabic texts stem-

ming is fairly limited (Taghva et al. 2005). The main efforts to build efficient Arabic IR

systems have been achieved in a commercial framework. The approaches used for these

systems as well as the performance accuracy are not known. As a significant example, Siraj

system3 from Sakhr allows to classify Arabic text and to extract named entities with

human-satisfying response. However, it has no technical documentation to explain the used

method neither the system evaluation.

2.2 Topic modeling

The LDA model has been introduced within a general Bayesian framework where the

authors developed a variational method and EM algorithm for learning the model from

collection of discrete data (Blei et al. 2003). The authors applied their model in document

modeling, text classification and collaborative filtering. For document modeling, they

trained a number of latent variable models, including LDA, on two texts corpora to

compare the generalization performance compare the generalization performance, as

measured by likelihood of held-out test data. Based on different datasets with various

document numbers and vocabulary sizes, experiments show that the LDA model outper-

forms other models such as unigram model and pLSI.In (Blei et al. 2006), LDA model was tested on CGC Bibliography items. Experiments

show that LDA had better predictive performance than two standard models (unigram and

mixture of unigrams). In the text classification problem, SVM has been trained in the low-

dimensional representations produced by LDA from unlabeled documents. The authors

3 http://www.siraj.sakhr.com/.

Inf Retrieval

123
http://www.siraj.sakhr.com/http://www.siraj.sakhr.com/


4/21

conducted two binary classification experiments using the Reuters-21578 dataset. They

realize a similar performance compared with SVM classification based on full words space.

It is worth pointing out that most datasets used for LDA evaluation are freely available

and included few thousands of English documents (sometimes up to 20,000) with some

30,000 unique words. This was considered sufficient for analyzing and assessing the modelperformances but this is not the same case for topic modeling in other languages such as

Arabic.

Since the original introduction of the LDA model, several contributions have been

proposed. However, few studies, on finding latent topics in Arabic context, have been

identified. In addition to the works related to Arabic topic detecting and tracking (Oard and

Gey 2002; Larkey et al. 2004), a segmentation method that uses the Probabilistic Latent

Semantic Analysis (Hofmann 1999) have been applied to AFP_ARB corpus for mono-

lingual Arabic document topic analysis (Brants et al. 2002). In (Larkey et al. 2004), the

researchers compared different topic tracking methods. They claimed that it should be

preferable to use separate language for building specific topic models. Good topic models

have been obtained when native Arabic stories are available. However, Arabic topic

tracking has not been improved in texts translated from English stories.

In fact, studies on Arabic IR are insufficient and the few works carried out for topic

modeling as well as for text stemming lack of rigorous evaluation. Considering the high

inflectional morphology in Arabic, it seems more appropriate to learn LDA model in

mono-language context taking more care for linguistic aspects. However, a large investi-

gation on stemming methods is required for assessing Arabic topic modeling in real-world

corpora.

3 Arabic text analysis

Unlike the indo-European languages, Arabic belongs to the Semitic languages family.

Written from right to left, it includes 28 letters. Despite the fact that different Arabic

dialects are spoken in the Arab world, there is only one form of the written language found

in printed works which it is known as the Modern Standard Arabic, herein referred to as

Arabic (Kadri and Nie 2006). In addition to its derivational morphology, the main char-

acteristics of the Arabic language, which complicate any automatic text analysis, are the

agglutination and the non-vocalization.

3.1 Arabic language features

Arabic is a highly inflected language due to its complex morphology. An Arabic word can

be one of three morpho-syntactic categories: nouns, verbs or particles. Several works have

used other categories (such as prepositions and adverbs) with no good reason except that

they are taken from English (Larkey et al. 2002; Tuerlinckx 2004; Moukdad 2006).

The lemma is the dictionary entry which is fully vocalized and relates to any form of

text. Particularly, the verbs are reduced to the third masculine singular in past tense. Allnouns and verbs are derived from a non-vocalized root according to one of the Arabic

patterns. The root is a linguistic unit carrying a semantic area. It is a non-vocalized word

(more general than lemma) and often consists of only 3 consonants (rarely 4 or 5 con-

sonants) (Kadri and Nie 2006; Tuerlinckx 2004).

Inf Retrieval

123


5/21

3.1.1 Morphological complexity

In Semitic languages, the root is an essential element from which various words may be

derived according to specific patterns or schemes. The morphological complexity in Arabic

is characterized by inflection and derivation.Inflection modifies the word to express different grammatical categories for the same

meaning such as gender, number, place or tense. Some irregular inflection schemes did not

use simple prefixed or suffixed roots but also they apply infixation and complex affixation

process. The following examples illustrate Arabic inflection with the irregular plural

(called broken plural):

from the root [Elm] : plural of [Eilm, science] is [Eulum, sciences] ,from the root [ktb] : plural of [kitAb, book] is [kutub, books]

However, derivation is a root affixation process generating a new word with different

meaning but generally in the same semantic area. An example of verb derivation from theroot [Elm] is [[aEolam, notify/inform] and [{isotaEolam, inquire] .

3.1.2 Agglutination

In Arabic text, a lexical unit is not easily identifiable from a graphic unit (word delimited

by space characters or punctuations marks). Morphological affixation process becomes

more complicated when extra affixes are agglutinated to a lemma. Indeed, a word can be

extended by attaching four kinds of affixes (antefix, prefix, suffix and postfix). Table 1

shows an example of an agglutinated and inflected word, [wayaEolamuwnahu] wherevarious kinds of affixes are attached to the core form [Elm] .This situation can make a high ambiguity to extract the right core (stem) from an

agglutinated form. In non-vocalized texts, morphology analysis become more difficult as

illustrated above in Table 1. There are other agglutinative languages such as Japanese,

Turkish and Finnish but the problem of non-vocalization is not as complicated as in Arabic.

3.1.3 Vocalization

Arabic word is vocalized with diacritics (short vowels) but unfortunately, full or partial

vocalization can be found only in didactic documents or in Koranic text. This factaccentuates the ambiguity of words and requires from each automatic analyzer to pay more

attention to the morphology and the word context. In Table 2, the non-vocalized word

[bsm] gives more than one segmentation with different meanings. This is mainly due to

diacritics missing in an agglutinated form.

Table 1 Segmentation of an Arabic agglutinated form meaning to and they know it

Agglutination

Inflection

Antefix Prefix Lemma Suffix Postfix

[wa] [ya] [Eolamu] [wna] [hu]

and they 1/2 know they 2/2 It

Inf Retrieval

123


6/21

3.2 Stemming methods

Stemming is a process for conflating inflected or derived words to a unique base-stem. It is

an important way to reduce collection vocabulary. In addition, stemming avoids dealing

with the same word as different index entries. Two classes of Arabic stemming methods

can be identified: (1) Light stemmers by removing the most common affixes and, (2)

Morphological analyzers by extracting each core (root or lemma) according to a scheme.

3.2.1 Light stemmers

They refer to the technique which truncates from a word a reduced list of affixes without

trying to find roots. Effectiveness of this approach depends on the content of prefixes and

suffixes lists. When, in English, one tries to find a stem by, mainly, removing conjugation

suffixes, we have to deal in Arabic texts with ambiguous agglutinated forms that imply

several morphological derivations. An analysis of such an approach can be found in (Larkeyet al. 2002). ISRI stemmer is another example of light stemming (Taghva et al. 2005).Without a root dictionary, theISRI4 algorithm use some affix lists and most common patternsto extract roots. Nevertheless, it keeps normalized form for unfound stem.

This kind of stemmers can effectively deal with most practical cases, but in some ones,

the right word is lost. As an example, in the word [wafiy] , one can read two agglutinated

prepositions that mean to and in but another will consider a noun, which meansfaithful/complete.

3.2.2 Morphological analyzers

In morphological analysis, we try to extract more complete forms according to vocalization

variation and derivation patterns knowledge. We can distinguish two categories of ana-

lyzers according to the nature of desired output unit: (1) Root-based stemmers and, (2)

lemma-based stemmers. The choice between the two approaches depends on how further

stemming results, in IR tasks or in language modeling, will be used.

In first category, the Khoja stemmer, which attempts to find root for Arabic words, hasbeen proposed in (Khoja and Garside 1999). A list of roots and patterns is used to

determine the right stem. This approach produces abstract roots, which reduce significantly

the dimension of document features space, but it leads to a confusion of divergent meaningembedded in a unique non-vocalized stem. For example, stemming of the word cited

above in Table 1 must deduce the root-stem with possible meaning to verbs to know orto teach. However, the same root can mean to the noun flag.

Table 2 Four possible solutions for the word [bsm]

Solution Morphology Vocalization English meaning

1 Noun [basom] Smiling

2 Verb [basam] Smile

3 Prep ? noun [bi] ? [somi] In/by (the) name of

4 Prep ? noun [bi] ? [sam *] By/with poison

4 http://www.nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.isri-pysrc.html.

Inf Retrieval

123
http://www.nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.isri-pysrc.htmlhttp://www.nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.isri-pysrc.html


7/21

In second category, a lemma-based stemmer has been developed and compared to

Khoja stemmer (Al-Shammari and Lin 2008). The authors combined light stemming withKhoja algorithm for antefixes removal and verbs stemming before processing theremaining words as nouns. They use a stop list exceeding 2,200 words with verbs and

nouns dictionaries as linguistic resources. In addition to clustering performance compar-ison, they used a collection of concept groups for under- and over-stemming evaluation.

Unfortunately, neither these resources nor the test collections are available. As shown in

Table 2, an Arabic word can induce more than one stem and so more than one lemma. The

above approaches did not support this aspect.

For this purpose, a set of Arabic lexicons5

has been developed with rules for legal

combinations of lemma-stems and affixes forms (Buckwalter 2002). P. Brihaye has

developed AraMorph,6 a java package for Arabic lemmatization based on BuckwalterArabic morphological analyzer. Several stemming solutions can be proposed for each

word. From this analyzer, one can develop, under some considerations, a lemma-based

stemmer. This approach will be described hereafter.

3.3 Lemma-based stemmer

We propose to develop an algorithm for lemma-based stemmer that is called the Brahmi- Buckwalter Stemmerand referred henceforth as BBw. Based on the resources of theBuckwaltermorphological analyzer, two main contributions can be reported for the BBwstemmer: (1) Normalization preprocessing and, (2) stem selection with morphological

analysis.

3.3.1 Normalization

This step is performed for normalizing the input text. Then the obtained list of tokens will

be processed by the Buckwalter morphological analyzer.

Convert to UTF-8 encoding

Tokenize text respecting the standard punctuation

Remove diacritics and tatweel ( ) Remove non-Arabic letters and stop-words.

Replace initial alef with hamza ( or ) by bar-alef (

) Replace final waw or yeh with hamza ( or ) by hamza ( ) Replace maddah ( ) or alef-waslah ( ) by bar-alef () Replace two bar-alef ( ) by alef-maddah ( ) Replace final teh marbuta ( ) by heh ( ) Remove final yeh ( ) when the remaining stem is valid.

3.3.2 Stem selection

When an input token (in-token) is processed by the Buckwalter morphological analyzer,three cases can be reported: (1) A unique solution is given according to a specific pattern.

5 http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49.6 http://www.nongnu.org/aramorph/english/index.html.

Inf Retrieval

123
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49http://www.nongnu.org/aramorph/english/index.htmlhttp://www.nongnu.org/aramorph/english/index.htmlhttp://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49


8/21

(2) Multiple solutions are found corresponding to several patterns and lexicon entries. (3)

No solution can be attributed to the in-token. The actions, that the BBw stemmer mustundertake, will be detailed bellow.

Unique solution. The BBw stemmer retains only the non-vocalized lemma-stem of the

solution (without affixes). A solution without noun or verb lemma (i.e., contains onlyparticles) is ignored and therefore the in-token is considered as a stop-word.

Multiple solutions. The BBw stemmer treats all the proposed solutions as a set ofseparated unique solutions and thus retains all non-vocalized lemma-stem. Note that

eliminating diacritics from lemmas may unify some stems and so reduce the solutions

multiplicity. For example, Table 2 gives four vocalized solutions for the token [bsm]

but after removing diacritics from output lemmas, the BBw stemmer will identify only twoconfused stems {[bsm] , [sm] }. It is worth pointing out that most of the Arabic proper

names can be derived regularly from roots. In this case multiple solutions, including the in-

token, must be considered.

No solution. The in-token cannot have a solution in the following cases, differentreasons can be raised: (1) The in-token is wrong and it did not imply any Arabic lemma. (2)

The in-token corresponds to a proper name (person, city, etc.) that has no entry in the

dictionary. (3) The in-token is a correct Arabic word but it is not yet included in the current

release of Buckwaltermorphological analyzer.In this study, we have opted to improve the normalization preprocessing in BBw

algorithm based on the original Buckwalterlexicon. Three stemmer variants (BBw0, BBw1,BBw2) are developed and evaluated on different Arabic datasets. Table 3 summarizes stemselection approach for each BBwX stemmer.

3.3.3 Confusion degree measure

When the morphological analysis of an in-token implies multiple solutions, each BBwXstemmer produces multiple lemma-stems. From a collection (S), we denote by L= 0 thetotal number of in-tokens after stop-words removal. LR = 0 refers to the total number ofstems when stemming (S) with an algorithm (R). Then, the confusion degree C(S|R) isdefined as:

CSjR LR

L

1

For example, C(S|Khoja) = 1, since the Khoja stemmer gives at most one stem for eachtoken in any dataset S. This is an ideal situation for a stemming process but when applyingthe BBwXstemmers, the confusion degree Cwill be increased. We proposed this measureC(S|R) for assessing the lexical ambiguity in Arabic texts. For a human Arabic reader, thisproblem will be solved easily with semantic considerations guided by the context.

For BBwXstemming, note that all possible stems are equitably related to their in-token.At this stage, we have not precise knowledge to select the good stem. Nevertheless, that the

Table 3 BBwX outputs versus different cases of the in-token morphological analysis

Stemmer Unique solution Multiple solutions No solution

BBw0 1 lemma-stem all lemmas \

BBw1 1 lemma-stem all lemmas ? in-token in-token

BBw2 1 lemma-stem all lemmas in-token

Inf Retrieval

123


9/21

relevant solution can be weighted later by a co-occurrence computation in local context.

We think that this will be possible with LDA topic modeling.

3.4 Stemming evaluation

Although the literature describes various Arabic stemmers, only few of them have received

a standard evaluation (Al-Shammari 2010; Said et al. 2009). A way to assess the effec-

tiveness of stemming algorithms is to evaluate their performances in information retrieval

tasks. This requires standard and representative test collections. Nevertheless, it is unsure

that good performances in IR tasks result only from stemming quality (Paice 1996; Frakes

2003). Herein, we describe three stemming metrics used in this present work. Such metrics

allow assessing some stemming aspects independently of IR tasks performance.

3.4.1 Index compression

The Index Compression Factor (ICF) represents the extent to which a collection of uniquewords is reduced (compressed) by stemming, the idea being that the heavier the Stemmer,

the greater the Index Compression Factor (Frakes 2003). This can be calculated by:

N, Number of unique words before stemming; S, Number of unique stems afterstemming

ICF

N S

N 2

The ICF factor has been introduced as a strength measure to evaluate stemmers andcompression performance. However, vocabulary compression did not mean to ideal

stemming. In fact, a good stemmer is that stems all the words to their correct roots. The

following measures may satisfy this condition.

3.4.2 Under- and over-stemming

Under-stemming is the failure to conflate morphologically related words. This occurs whentwo words that should be stemmed to the same root are not. An example of under-

stemming would be if the words adhere and adhesion are not stemmed to the same

root.

Over-stemming refers to words that should not be grouped together by stemming, but

are. For example, merging the words probe and probable after stemming would

constitute an over-stemming error.

Using a sample file of W grouped words, under-stemming errors are then counted asdescribed in (Paice 1996). A concept group contains forms which are both semantically

and morphologically related one to another. For each group g containing ng words, thenumber of pairs of different words defines the desired merged total (DMTg):

DMTg 0:5ngng 1

Since a perfect stemmer should not merge any member of a group with other group

words, for every group there is a desired non-merge total (DNTg):

Inf Retrieval

123


10/21

DNTg 0:5ngW ng

When summing these two totals over all groups, one can obtain the global desiredmerged total (GDMT) and the global desired non-merge total (GDNT) respectively. Thus,stemming errors are calculated as follows:

Conflation Index (CI): proportion of equivalent word pairs which were successfullygrouped to the same stem; Distinctness Index (DI): proportion of non-equivalent wordpairs which remained distinct after stemming

The under-stemming index (UI) and the over-stemming index (OI) are given by:

UI 1 CI 3

OI 1 DI 4

In (Paice 1996), the author proposed to compute the ratio of these two quantities as a

measure of the stemming weight (SW):

SW OI=UI 5

The purpose of the Paices error-counting approach is that, although it is advantageous

to have the index of terms compressed, this is only useful up to a point. This is because, as

conflation becomes heavier, the merging of distinct concepts becomes increasingly fre-

quent. At this point, small increases in Recall are gained at the expense of a major loss of

Precision (Frakes 2003).

One question mark over this approach concerns the validity of the grouped file

against which the errors are assessed. These grouped files were constructed by human

judgment, during scrutiny of sample word lists (Paice 1996; Frakes 2003). For Arabicstemming evaluation, Al-Shammari has selected a sample of 419 words and has divided

them into 81 conceptual groups (i.e. close to 5 words per group). Comparing to Khojaand Light stemmers, Al-Shammaris lemmatizer has reduced over-stemming errors.However, no effective improvement is achieved for under-stemming counting (Al-

Shammari 2010).

4 LDA topic model

Latent Dirichlet allocation (LDA) is a generative topic model for text documents (Blei

et al. 2003). Based on the classical bag of words assumption, a topic model considers

each document as a mixture of topics where a topic is defined by a probability distribution

over words.

The distribution over words within a document (d) is given by:

Pwijd XT

j1

Pwijzi jPzi j dj :

where P(w|z) defines the probability distribution over words w given topic z and P(z|d)refers to the distribution over topics z in a collection of words (document). More detailsand interpretations about topic models can be found in (Blei et al. 2003; Steyvers and

Griffiths 2007)

Inf Retrieval

123


11/21

For a given number of topics T, LDA model will be trained from a collection ofdocuments defined as follows:

N: number of words in vocabulary.M: number of document in corpus.T: number of topics, given as input value.P(z/d): distribution over topics z in a particular document.P(w/z): probability distribution over words w given topic z.

Then, we can define a generative process as follows:

For each document d= 1 to M (in dataset) do:

1. Sample mixing probability hd* Dir(a)2. For each word wdi = 1 to N (in vocabulary) do:

2.a. Choose a topic zdi [ {1,, T} * Multinomial (hd)

2.b. Choose a word wd [ {1,, N} * Multinomial (bzdi)

where a is a Dirichletsymmetric parameter and {bi} are multinomial topic parameter. Eachbi assigns a high probability to a specific set of words that are semantically related. Thi s

distribution over vocabulary is referred to as topic. In the present work, we use LingPipe7

LDA implementation which is based on Gibbs sampling for parameters estimation.

The choice of the number of topics can affect the interpretability of the results. A model

with too few topics will generally result very broad topics. However, a model with too

many topics will result in an uninterpretable model (Steyvers and Griffiths 2007). Since the

number of topics (T) is given as an input parameter for training the LDA model, severalmethods were proposed to select a suitable T. An evident approach is to choose Tthat leadsto best performance for tasks (classification, clustering etc.).

5 Building Arabic datasets

As raised above, developments in Arabic IR are often faced with the problem of

unavailability of standard free resources. So, we have opted to build our own experi-

mentations datasets. For this aim, we developed a Web-Crawler8 to collect newspaper

articles from several Arabic websites. In this study, we present three real-world corporabased on Echorouk,9Reuters10 and Xinhua11 Web-articles. Each article is saved with UTF-8 encoding in a separated text file where the first line is reserved to its title. A brief

description is given in Table 4.

5.1 Datasets description

Echoroukcollection contains 11,313 documents from Echorouk newspaper articles relatingto 20082009 period. It is labeled according to eight categories. From the full corpus

Ech-11k, we build a subset, Ech-4000, of 4,000 documents for preliminary evaluations.

7 Available at http://www.alias-i.com/lingpipe/index.html.8 Developed with Amine Roukh and Abdelkadir Sadouki in the Mostaganem University.9 Algerian newspaper with online edition at http://www.echoroukonline.com/.10 International news agency with Arabic online edition at http://www.ara.reuters.com/.11 Chinese news agency with Arabic online edition at http://www.arabic.news.cn/.

Inf Retrieval

123
http://www.alias-i.com/lingpipe/index.htmlhttp://www.echoroukonline.com/http://www.ara.reuters.com/http://www.arabic.news.cn/http://www.arabic.news.cn/http://www.ara.reuters.com/http://www.echoroukonline.com/http://www.alias-i.com/lingpipe/index.html


12/21

Reuters collection, Rtr-41k, contains 41,251 Arabic documents relating to 200720082009 period. It is labeled according to six categories. A subset, Rtr-5251, of 5251 docu-ments is used for preliminary evaluations.

Xinhua collection contains 36,696 Arabic documents relating to 20082009 period. It islabeled according to eight categories. A subset, Xnh-4500, of 4500 documents is used forpreliminary evaluations. Table 5 describes the collected datasets with their distributions

over published categories.

5.2 Arabic stemming

The three lemma-based stemmers (BBw0, BBw1 and BBw2) were applied on the threedatasets described above. For comparison, we use ISRIalgorithm for light stemming and 2

variants of Khoja algorithm for root-based morphological analysis. The first one, Khoja0,is the original Khoja algorithm which gives only the found roots. The second variant,Khoja1, allows adding unfound words to the vocabulary. Furthermore, we use the raw textas baseline characterization in preliminary experiments. For aligning comparison, the stop-

word list used in BBwX algorithms is removed from other stemmers output.

Table 4 Description of three datasets relating to Echorouk, Reuters and Xinhua Web-articles

Feature\dataset Ech-11k Rtr-41k Xnh-36k

# Articles 11,313 41,251 36,696

# Characters 48,247,774 111,478,849 85,696,969# Tokens 4,388,426 10,093,707 7,532,955

# Arabic tokens 3,341,465 7,892,348 6,097,652

Average # tokens/article 387.9 244.7 205.3

# Categories 8 6 8

Table 5 Distribution of the three datasets over categories

Dataset source Echorouk Reuters Xinhua

Category\dataset Ech-11k Ech-4000 Rtr-41k Rtr-5251 Xnh-36k Xnh-45001 World 2,274 572 10,000 1,000 9,465 561

2 Economy 816 572 10,000 1,000 6,862 563

3 Sport 3,554 572 10,000 1,000 1,132 563

4 Middle-East 10,000 1,000 9,822 563

5 Science-health 889 889 1,993 563

6 Culture-education 566 566 1,508 562

7 Algeria 2,722 573

8 Society 808 572

9 Art 315 315 10 Religion 258 258

11 Entertainment 362 362

12 China 4,654 563

13 Tourism-ecology 1,260 562

Total 11,313 4,000 41,251 5,251 36,696 4,500

Inf Retrieval

123


13/21

For each stemmer, different sizes of corpus vocabularies are reported in Table 6. As a

main observation for the three datasets, stemming results show that ISRIalgorithm producehigh vocabulary dimensions. Close to raw text, ISRIdo not provide significant reduction inthe feature space. On the contrary, morphology analysis, with Khoja and BBw, decrease thevocabulary size by unifying tokens those have a common root or lemma. Retaining only

the correct form, Khoja0 and BBw0 stemmers produce the lower vocabularies.

In morphological analysis, it is clear that our lemma-based stemmer (BBw0) enhancesthe feature space when comparing to a root-based stemmer (Khoja0). However, thevocabularies produced when adding unrecognized tokens (Khoja1 and BBw2), show a lackin the used lexicons. We have expected such cases by analyzing the BBwX variants asdescribed in a previous sections where we discussed the no solution causes. To analyze

the loss caused by this lack, we compute the ratio of unfound tokens by comparison with

all the words (found and unfound). Figure 1 shows that the BBw lexicon more completethan Khoja.

To complete our preliminary analysis and assess performances ofBBwXstemmers when

dealing with ambiguous forms, we compute the confusion degree as defined in (1).According to Table 7, the maximum confusion degree, in real-world corpora, does not

exceed 1.16. This indicates that when our stemming approach includes all multiple solu-

tions in vocabulary, it preserves all word senses without significant lexical ambiguity.

5.3 Stemming evaluation

ICF and Paice metrics have been tested on the three datasets for analyzing stemmersquality. For evaluation fairness, the index compression factor has been calculated after

stop-words removal as applied in BBwX stemmers. In lemma-based stemming, a larger

stop-list with 575 words is used. As defined in (2), Table 8 summarizes the ICFcomputing.Table 8 show that Khoja0 root-stemmer and BBw0 lemma-stemmer have performed the

best compression over the three datasets. However, light stemming with ISRIdid not give asignificant index reduction. Because they keep unknown tokens as proper stems, the other

variants of Khoja and BBw have reduced slightly the index.For Paices evaluation, the main difficulty consists of how to build a representative

concept groups. Al-Shammari has used a moderate set of 81 groups for comparing her

lemmatizer to Khoja and light Arabic stemmers. Unfortunately, no test collections orconcept groups used in experiments were available (Al-Shammari 2010).

In this study, we have built a large real-world concept-groups by processing the article

titles in the three datasets (Ech-11k, Rtr-41k and Xnh-36k). After stop-words removal andpreliminary grouping, we have selected the groups that contained 10 words at least. Then,

we have revised the resulting groups with an Arabic expert before retaining a collection of

13,142 words distributed on 689 groups. Because all the selected words are recognized by

Khoja and BBw stemmers, it was not necessary to calculate stemming errors for the othervariants (KhojaX, BBwX). Table 9 gives results of the Paices evaluation.

Table 6 The vocabulary sizes of three datasets according to different stemmers

Vocabulary Raw ISRI Khoja0 Khoja1 BBw0 BBw1 BBw2

Ech-11k 179,225 144,442 3,172 22,411 17,558 60,148 48,789

Rtr-41k 183,510 150,977 3,153 38,644 17,020 73,382 62,843Xnh-36k 175,410 140,192 3,089 10,712 11,352 25,835 22,456

Inf Retrieval

123


14/21

Results show that our lemma-based stemmer provides the lowest over-stemming and

under-stemming indexes (OI and UI). It is a significant performance compared to Khojaand ISRI stemmers. Remember that BBw stemmer can generate multiple stems for an in-

token. Theoretically, this fact increases the under-stemming errors but empirically, resultsshow that BBw improves the sense distinctness.

6 Experiments and results

The experiments performed for text categorization and topic modeling are described and

analyzed in this section. Support vector machine (SVM) is a kernel-based method which

has been introduced by (Vapnik 1995) for binary and multi-class classification. In this

work, the LIBSVM12 package is applied for multi-class text categorization. As SVM

training parameter, the simple linear kernel is used with a cost parameter set to 10. For

performance evaluation, fivefold cross validation is performed for each feature space of

datasets.

0% 2% 4% 6% 8% 10% 12% 14%

Ech-11k

Rtr-41k

Xnh-36k

BBw

Khoja

Fig. 1 The unfound tokens ratefor the two analyzers, Khoja andBBw

Table 7 Confusion degree (Conf-deg) relating to the BBwX stemmers

Dataset Ech-11k Rtr-41k Xnh-36k

Stemmer Conf-deg Max-mul Conf-deg Max-mul Conf-deg Max-mul

BBw0 1.118 5 1.129 6 1.121 5

BBw1 1.146 6 1.155 6 1.141 6

BBw2 1.106 5 1.116 5 1.105 5

Max-mul refers to maximum multiple solutions

Table 8 Comparison of index compression factors for different stemmers

Dataset\stemmer ISRI Khoja0 Khoja1 BBw0 BBw1 BBw2

Ech-11k 0.194 0.982 0.875 0.902 0.664 0.728

Rtr-41k 0.177 0.983 0.789 0.907 0.600 0.658

Xnh-36k 0.201 0.982 0.939 0.935 0.853 0.872

12 LIBSVM-2.89 is available at http://www.csie.ntu.edu.tw/*cjlin/libsvm?zip.

Inf Retrieval

123
http://www.csie.ntu.edu.tw/~cjlin/libsvm+ziphttp://www.csie.ntu.edu.tw/~cjlin/libsvm+ziphttp://www.csie.ntu.edu.tw/~cjlin/libsvm+ziphttp://www.csie.ntu.edu.tw/~cjlin/libsvm+ziphttp://www.csie.ntu.edu.tw/~cjlin/libsvm+ziphttp://www.csie.ntu.edu.tw/~cjlin/libsvm+zip


15/21

It is worth pointing out that: (1) for each dataset, six stemmers (ISRI, Khoja0, Khoja1,BBw0, BBw1 and BBw2) are applied for categorization and topic modeling, (2) raw text isused as baseline, (3) for topic modeling, all datasets are used as unlabeled collections.

6.1 Classification in full word space

For the basic words space definition, the TF and the TF-IDF measures are applied onstemmed datasets. TFt refers to the term frequency of each term (t) in its document. Theinverse document frequency (IDFt) measures the general importance of the term over a setof documents (D). In this work, we use the simple formulation of a TF-IDF measure asfollows:

TF IDFterm tjD TFt IDFt TFt logDj j

d : t2 df gj j: 6

The three subsets (Ech-4000, Rtr-5251 and Xnh-4500) are used for text categorization in

the full word space. According to two term models (TFand TF-IDF) and various stemmingapproaches, Fig. 2 gives a preliminary evaluation for text classification.As a main observation, the Khoja stemmers give the weakest performances. With

abstract root stemming and incomplete lexicon, Khoja0 algorithm decreases text charac-terization. When adding unfound tokens, Khoja1 variant improves slightly classificationperformances. Although the ISRIlight stemmer produces huge vocabularies, it seems givegood performances for classification in the full word space. The same point can be reported

when dealing with raw text. However, the BBwXstemmers improve classification accuracyfor reasonable dimension of words space.

6.2 Classification in topics space

The three subsets (Ech-4000, Rtr-5251 and Xnh-4500) with various stemming variants aretrained with LDA algorithm. For several numbers of topics, we report in Tables 10, 11 and

12 the classification accuracy obtained by SVM cross-validation.

According to Tables 10, 11 and 12, we can deduce the suitable number of topics for

each stemmed dataset. The best models may be obtained when choosing, for LDA training,

a number of topics between 100 and 400.

Considering stemming methods, preliminary experiments show that Khoja stemmersgive the lowest classification performances. Unexpected results for linguists show that with

raw texts, topic modeling can produce performances as good as with morphological

analysis. Remind us that light stemming (ISRI) also has generated large vocabularies

where different entries should be conflated in a single Arabic stem.

Focusing on morphological analyzers, experiments show that BBw lemma-basedstemmers improve classification in topics space compared to Khoja root-based stemmers.For Reuters collection as example, we highlight in Fig. 3 the difference between the

Table 9 Paices evaluation for three Arabic stemmers

Stemmer OI UI SW

ISRI 0.019 9 10-3 0.968 0.019 9 10-3

Khoja 1.452 9 10-3

0.098 14.87 9 10-3

BBw 0.006 9 10-3 0.060 0.096 9 10-3

Inf Retrieval

123


16/21

stemmers which retain only the correct Arabic forms. It is clear that lemma-based stem-

ming enhances LDA modeling even with low topics number.

6.3 Finding topics in newspaper articles

In this section, we illustrate some results of LDA modeling in Arabic texts. The BBw2algorithm was used for stemming the three complete datasets (Echorouk, Reuters and

Xinhua). Note that a latent topic can be titled with human assessment according to itsrelevant terms. For example, we give in Table 13 topics distribution which are mostrelevant to the word [mAl, money] .

In addition, we propose to compute the categories distribution over learned topics. A

confusion matrix (category\topic) can be obtained by adding document distributions of the

same category. In Table 14, we give an illustration of the categories distribution over the

Table 10 Classification accuracy in topics space of the Ech-4000 corpus

# Topics\stemmer Raw ISRI Khoja0 Khoja1 BBw0 BBw1 BBw2

32 76.3 77.4 71.7 76.5 79.0 78.9 79.1

64 79.2 79.4 75.4 78.7 79.4 79.7 80.2

100 80.2 80.3 77.3 79.2 80.9 80.0 80.8

200 80.9 81.2 77.2 79.6 80.6 80.6 81.1

300 81.8 80.4 76.9 78.2 80.3 80.8 81.3

400 80.9 80.5 76.7 79.2 80.5 80.5 80.7

500 81.3 80.7 77.3 77.5 80.0 80.9 81.0

600 80.9 80.6 76.0 77.7 80.3 80.9 80.3

700 80.6 79.7 75.6 76.4 79.3 80.1 79.6

6 stemmers are applied with raw text as baseline for several numbers of topics

Bold values indicate the best classification accuracies

68

70

72

74

76

78

80

82

84

Ech-TF Rtr-TF Xnh-TF Ech-TFIDF Rtr-TFIDF Xnh-TFIDF

Raw ISRI Khoja0 Khoja1 BBw0 BBw1 BBw2

Fig. 2 Classification accuracy in the full-word space of the three datasets (Ech-4000, Rtr-5251 andXnh-4500)

Inf Retrieval

123


17/21

eight main topics13 from Reuters dataset. By setting a likelihood threshold at 10%, one canidentify relevant topics in each category. For example, one will easily discover that the

main subjects in sport category during 20072009 were football and Tennis.

Table 11 Classification accuracy in topics space of the Rtr-5251 corpus


32 80.0 81.1 77.5 82.3 83.8 83.3 84.2

64 84.2 84.4 80.4 83.9 85.6 84.8 84.6100 84.3 85.5 82.0 84.6 86.0 85.6 85.3

200 85.7 85.2 82.8 84.9 85.6 85.8 85.8

300 85.6 84.9 83.5 84.7 86.0 85.7 85.7

400 84.9 84.9 83.2 85.0 85.5 85.7 85.6

500 84.2 84.5 82.4 84.4 85.5 85.2 85.5

600 84.4 84.7 81.8 84.9 85.2 85.1 85.4

700 84.3 84.5 81.6 83.7 85.2 85.4 85.4


Table 12 Classification accuracy in topics space of the Xnh-4500 corpus


32 67.5 70.1 68.9 74.6 75.3 75.0 74.2

64 75.8 77.7 74.3 78.4 80.3 81.0 79.9

100 78.8 77.5 75.6 79.8 81.9 81.6 80.6

200 81.3 82.1 78.7 80.8 82.8 83.1 82.2

300 82.0 81.5 79.5 81.1 82.9 82.5 83.2

400 81.0 81.2 79.8 81.3 82.5 82.3 82.4500 80.5 81.0 78.9 81.3 82.5 81.8 82.3

600 81.3 80.9 78.2 80.9 81.8 81.7 81.9

700 81.2 81.4 78.5 79.3 81.8 82.4 82.4


72

74

76

78

80

82

84

86

88

32 64 100 200 300 400 500 600 700

Topics number

Khoja0

BBw0

Fig. 3 Classification accuracy in topics space of the Rtr-5251 corpus. Comparison between root-based(Khoja0) and lemma-based (BBw0) stemming

13 More details are available at https://sites.google.com/site/abderrezakbrahmi/.

Inf Retrieval

123
https://sites.google.com/site/abderrezakbrahmi/https://sites.google.com/site/abderrezakbrahmi/


18/21

Furthermore, words can be analyzed by finding their different contexts when training

LDA for each collection. Among 100 latent topics, we report in Table 15 the relevant

topics to some words. Two senses can be assigned to the first one, [slAm] , either peaceor greeting/salutation.

It is clear, in Table 15, that the different contexts related to this word imply the firstsense excepting the topic Coal mines in Australia. In fact, the forth topic in Xnh-36 kdataset is related to another word. The input token [slAmh] has two BBw stemmingsolutions ([slAm] and [slAmh] ). The second solution means safety which is highlycorrelated to security and safety requirements in coal mines. However, vocabulariesobtained by Khoja stemming do not contain this entry because the word [slAm] wasindexed by its root [slm] . This non-vocalized word induces various senses such as

[sul*im, be conceded] , [silom, peace] and [sul*am, stairs] . Furthermore, Khojaalgorithm indexes under the same root other lemmas such as [isolam , Islam] and[saliym , correct] .

The second word, [mAl] , means either money/capital/funds or lean/bend/incline/sympathize. However, Table 15 shows that several specific topics are related to the firstsense including some ways to financing Al-Qaida organization.

7 Conclusion and future work

For information retrieval tasks, several stemming approaches have been proposed and

applied on Arabic texts. The light stemmers try to remove from a word the most common

affixes. However, the morphological analyzers attempt to extract the correct root or lemma.The preference of a particular method depends on the nature of the following IR task.

Unfortunately, the literature does not give clear answers for choosing the appropriate

stemming method. Using raw text for categorization or topic analysis can also lead to

acceptable performances (Brants et al. 2002; Said et al. 2009).

Table 13 The 4 topics related to the word ([mAl, money] ) in Echorouk (Ech-11k) dataset

The word [mAl, money] in bold

Inf Retrieval

123


19/21

Two main contributions were presented in this study: Firstly, we have proposed the BBwlemma-based stemming with specific text normalization and multiple lemmas indexing. By

applying a confusion measure on three real-world corpora, we have shown that our

stemming approach may preserve the semantic embedded in Arabic texts withoutcompromising lexical characterization. The Paices evaluation has been used to measure

under- and over-stemming errors. The results have shown a high effectiveness for our

approach. The BBw lemma-based stemmer reduces significantly vocabulary dimension,under- and over-stemming errors. In addition, classification performance is improved

slightly compared to classification of raw and light stemmed texts.

For morphological analysis, three BBw variants were compared to root based stemmers(Khoja0 and Khoja1). The two variants tested for Khoja algorithm have shown a lack in itslexicon (roots and patterns). This limitation was surmounted by adding the unfound words

as new vocabulary entries. However, it would be judicious to maintain permanently lin-

guistic resources.

Secondly, three real-world corpora were tested for Arabic stemming and topic model-

ing. Tens of thousands of Web-articles were automatically crawled from Echorouk,Reuters and Xinhua. The variety of writing styles allows us to validate the proposedlemma-based stemmer. For topic modeling evaluation, SVM classification was

Table 14 Distribution of Reuters categories over eight latent topics

Category\topic Middle-East

(%)

Business

(%)

World

(%)

Sport

(%)

Entertainment

(%)

Science

(%)

Military actions (USA. Iraq.Afghanistan) 25.4 1.8 21.1 0.8 2.8 2.0

Football 0.6 0.6 0.9 52.3 4.9 1.2

Economy 3.0 75.2 2.8 1.4 5.4 5.9

Health and science 12.3 5.6 17.4 4.0 50.3 59.6

Iranian nuclear 12.4 11.4 28.9 3.3 6.9 26.0

Iranian politics 21.0 3.8 26.9 3.2 22.2 4.1

Tennis 0.3 0.5 0.7 34.4 1.6 0.5

Middle-East 25.0 1.1 1.4 0.6 6.0 0.8

Bold values indicate relevant topics

Table 15 The topics related to two words ([slAm] and [mAl] ) in each dataset

Word Ech-11 k Rtr-41 k Xnh-36 k

[slAm] Arab league Somali actors UN missions

Palestinian dialogue Nobel price Jordans peace effort

Middle-East negotiations Middle-East negotiations

Sudan events Coal mines in Australia

Sudan events[mAl] Monetary values Arabic gulf investments Financial crisis

Al-Qaida operations Financial operations Financial operations

Financial institutions Financial crisis Energy projects

Government projects Government investments

Inf Retrieval

123


20/21

successfully tested with various stemming methods. In particular, the proposed BBwstemming approach proved its efficiency in both text classification and Arabic topic

modeling.

Furthermore, topic LDA modeling was applied in Arabic texts. A large investigation

was carried out by varying the topics number. Classification in topics space was achievedto assess the performances LDA model with different stemming methods. When training

LDA in BBw vocabularies, interpreting topics was easier than those obtained from Khojaroots.

It is worth pointing that it is difficult to understand how one can assess semantic aspects

in Arabic texts without sufficient linguistic knowledge (Larkey et al. 2002, 2004). This

study shows that effective developments in Arabic IR and topic modeling can not be

performed without close collaboration between computer scientists and Arabic language

experts.

As future work, the BBw stemmer will be improved by handling additional irregularforms and enhancing lexicons to proper names (person, location and organization). Further

effort must be oriented to topic model integration with end-user retrieval system.

Acknowledgments We gratefully acknowledge Professor Patrick Gallinari, director of LIP6 laboratory

(Paris-France), for his valuable advices to achieve this work. We would like to thank the reviewers of INRT

journal for their insightful and constructive comments.

References

Al-Shammari, E. (2010). Lemmatizing, stemming, and query expansion method and system. US Patent20100082333, April 2010.

Al-Shammari, E., & Lin, J. (2008). A novel Arabic lemmatization algorithm. In Proceedings of theworkshop on analytics for noisy unstructured data, Singapore, pp. 113118.

Blei, D. M., Franks, K., Jordan, M. I., & Mian, I. S. (2006). Statistical modeling of biomedical corpora:

Mining the caenorhabditis genetic center bibliography for genes related to life span. BMC Bioinfor-matics, 7, 250. doi:10.1186/1471-2105-7-250.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine LearningResearch, 3, 9931022.

Brants, T., Chen, F., & Farahat, A. (2002). Arabic document topic analysis. LREC-2002 workshop on

Arabic language resources and evaluation, Las Palmas, Spain.

Buckwalter, T. (2002). Buckwalter Arabic morphological analyzer version 1.0. Linguistic data consortium,

University of Pennsylvania. LDC catalog no. LDC2002L49.

Darwish, K., Hassan, H., & Emam, O. (2005). Examining the effect of improved context sensitive mor-

phology on Arabic information retrieval. In Proceedings of the ACL workshop on computationalapproaches to semitic languages, Ann Arbor, Michigan, pp. 2530.

Frakes, W. B. (2003). Strength and similarity of affix removal stemming algorithms. In SIGIR forum (Vol.37, issue 1), pp. 2630.

Hofmann T. (1999). Probabilistic latent semantic analysis. In Proceedings of the fifteenth conference onuncertainty in artificial intelligence, pp. 289296.

Kadri, Y., & Nie, J. (2006). Effective stemming for Arabic information retrieval. The challenge of Arabic forNLP/MT, international conference at the British Computer Society (BCS), pp. 6874. London, UK.

Khoja, S., & Garside, R. (1999). Stemming Arabic text. Technical report. Computing Department, Lancaster

University, Lancaster.

Larkey, L. S., Ballesteros, L., & Connell, M. E. (2002). Improving stemming for Arabic information

retrieval: Light stemming and co-occurrence analysis. In Proceedings of SIGIR 2002, pp. 275282.Tampere, Finland.

Larkey, L. S., & Connell, M. E. (2001). Arabic information retrieval at UMass in TREC-10. In TREC 2001,pp. 562570. Gaithersburg, Maryland, USA.

Larkey, L. S., Feng, F., Connell, M. E., & Lavrenko, V. (2004). Language-specific models in multilingual

topic tracking. In Proceedings of SIGIR 2004, 402409. Sheffield, UK.

Inf Retrieval

123
http://dx.doi.org/10.1186/1471-2105-7-250http://dx.doi.org/10.1186/1471-2105-7-250


21/21

Moukdad, H. (2006). Stemming and root-based approaches to the retrieval of Arabic documents on the Web.

Webology, 3(1), article 22.Oard, D. W., & Gey, F. (2002). The TREC-2002 Arabic/English CLIR track. In TREC2002 notebook,

pp. 8193.

Paice, C. D. (1996). Method for evaluation of stemming algorithms based on error counting. Journal of the

American Society for Information Science, 47(8), 632649.Said, D., Wanas, N., Darwish, N., & Hegazy, N. (2009). A study of text preprocessing tools for Arabic textclassification. In Proceedings of the 2nd international conference on Arabic language resources andtools, pp. 230236. Cairo, Egypt.

Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. Handbook of Latent Semantic Analysis.Mahwah, NJ: Lawrence Erlbaum.

Taghva, K., Elkoury, R., & Coombs, J. (2005). Arabic stemming without a root dictionary. In Proceedingsof the international conference on information technology: Coding and computing, Vol. 01,pp. 152157.

Tuerlinckx, L. (2004). La lemmatisation de larabe non classique. In JADT 2004, 7e Journe es internatio-nales dAnalyse statistique des Donnees Textuelles, pp. 10691078.

Vapnik, V. N. (1995). The nature of statistical learning theory. New York, NY, USA: Springer-Verlag New

York, Inc.

Inf Retrieval