8/3/2019 Arabic Text Search
1/21
Arabic texts analysis for topic modeling evaluation
Abderrezak Brahmi Ahmed Ech-Cherif Abdelkader Benyettou
Received: 12 September 2010 / Accepted: 23 May 2011 Springer Science+Business Media, LLC 2011
Abstract Significant progress has been made in information retrieval covering text semantic
indexing and multilingual analysis. However, developments in Arabic information retrieval
did not follow the extraordinary growth of Arabic usage in the Web during the ten last years. In
the tasks relating to semantic analysis, it is preferable to directly deal with texts in their original
language. Studies on topic models, which provide a good way to automatically deal with
semantic embedded in texts, are not complete enough to assess the effectiveness of the
approach on Arabic texts. This paper investigates several text stemming methods for Arabictopic modeling. A new lemma-based stemmer is described and applied to newspaper articles.
The Latent Dirichlet Allocation model is used to extract latent topics from three Arabic real-
world corpora. For supervised classification in the topics space, experiments show an
improvement when comparing to classification in the full words space or with root-based
stemming approach. In addition, topic modeling with lemma-based stemming allows us to
discover interesting subjects in the press articles published during the 20072009 period.
Keywords Arabic stemming Topic model Linguistic analysis Classification Test collections
1 Introduction
Arabic is one among the six official languages of the United Nations organization where a
considerable work has been done to develop the multilingual United Nations Bibliographic
A. Brahmi (&)
Department of Computer Science, University of Abdelhamid Ibn Badis,
BP: 188, Mostaganem, Algeria
e-mail: [email protected]
A. Ech-Cherif (&) A. BenyettouDepartment of Computer Science, USTO-MB, BP: 1505, Oran, Algeria
e-mail: [email protected]
A. Benyettou
e-mail: [email protected]
123
Inf Retrieval
DOI 10.1007/s10791-011-9171-y
8/3/2019 Arabic Text Search
2/21
Information System Thesaurus. This UNBIS Thesaurus is used in subject analysis of
documents and other materials relevant to the United Nations programs and activities. In
addition, Arabic is one of the top ten languages in the Internet. For a global population of
350 millions in Arabic world, Internet World Stats1 have reported, for Internet Arabic
users, the highest growth rate with 2,501.2% for the period 20002010.Unfortunately, developments in Arabic information retrieval (IR) did not follow this
extraordinary growth. For most of the studies in the different IR tasks (such as categori-
zation, clustering and search), English was used as the main language. However, when
switching to Arabic, two approaches have been adopted for evaluating IR methods: either
by retaining English as a pivot language when using parallel corpora in a cross-language
context; either by processing the original Arabic text and analyzing the IR methods in a
mono-language context. Although the first approach allows an equitable evaluation, it
depends on the availability and the quality of parallel corpora. In the second approach, the
evaluation of IR methods requires standard corpora and appropriate linguistic prepro-
cessing. This approach becomes more interesting for the IR tasks with semantic analysis. It
avoids the loss of meaning caused by translation from languages with high inflectional
morphology such as Arabic (Oard and Gey 2002; Larkey et al. 2004). Unfortunately,
Arabic IR, including stemming methods, did not receive sufficient standard evaluation.
In this context, three major challenges face the developments of the Arabic IR and
generalization of existing methods on Arabic texts: (1) How to efficiently extract a good
stem from a morpheme that implies several segmentations and senses? (2) How to apply a
topic model to capture the semantics embedded in Arabic texts? (3) How to make more
accessible Arabic resources for IR tasks to benefit from the various developments in non-
commercial context.This work aims to answer the first two questions raised above. On the one hand, a
lemma-based stemming approach is proposed and compared with other Arabic stemmers.
On the other hand, the Latent Dirichlet Allocation (LDA) model is used to extract Arabic
topics from newspaper articles. As regards the third question, our experiments are con-
ducted on three real-world corpus automatically crawled from the Web. All results and
resources will be made freely available for the research community.
This paper presents first related works on Arabic stemming and topic modeling. Then,
the lemma-based stemmer is described and evaluated with other approaches for Arabic text
analysis. Afterwards, the generative process for LDA topic modeling is illustrated. Before
presenting the results of our experiments, three datasets of newspaper articles are descri-bed. Finally, we discuss the main results and conclude our study.
2 Related works
2.1 Arabic stemming
Among the successful approaches for Arabic stemming, a root-based stemmer has been
developed by (Khoja and Garside 1999). Based on predefined root lists and morphologicalanalysis, Khoja algorithm attempts to extract the true root. However, more than one root
can be found in an isolated word without diacritics. Although the Khoja2 stemmer has not
1 Internet World Stats, Usage and population statistics (Miniwatts Marketing Group). Updated for June 30,
2010, last visited in August 2010. http://www.internetworldstats.com/.2 Freely available at http://www.zeus.cs.pacificu.edu/shereen/ArabicStemmerCode.zip.
Inf Retrieval
123
http://www.internetworldstats.com/http://www.zeus.cs.pacificu.edu/shereen/ArabicStemmerCode.ziphttp://www.zeus.cs.pacificu.edu/shereen/ArabicStemmerCode.ziphttp://www.internetworldstats.com/8/3/2019 Arabic Text Search
3/21
been maintained since its first publication, it has been widely used and analyzed in later
works. As an instance, Al-Shammari lemma-based stemmer included Khoja algorithm for
verbs stemming (Al-Shammari and Lin 2008). The authors combined successfully light
stemming, root stemming and dictionary lookup. In addition to its effectiveness in clus-
tering task, the Al-Shammari algorithm outperformed Khoja and light-stemmers in terms ofover-stemming evaluation (Al-Shammari 2010).
For light stemming, several variants have been developed (Larkey et al. 2002). When
applying in the AFP_ARB corpus, the authors have found that light stemmer was more
effective for cross-language retrieval than a morphological stemmer. They deduce that it is
not essential for a stemmer to yield the correct root. Surprisingly, in a technical report
(Larkey and Connell 2001), the authors claim that these results, either in mono-lingual or
cross-language retrieval, were obtained with no prior experience with Arabic. Another
study has confirmed the same result which prefers light stemming for Arabic retrieval tasks
(Moukdad 2006).
On the contrary, Brants et al. reported that, whether by stemming or using full forms,
they have obtained the same performances for document topic analysis (Brants et al. 2002).
A recent study about Arabic text categorization has highlighted this contradiction in lit-
erature and attempted to analyze various stemming tools (Said et al. 2009). In (Darwish
et al. 2005), the authors showed that using context to improve the root extraction process
may enhance the IR process. However, the context root extraction is computationally
expensive compared with the light and root stemming. Similar to Khoja but without a root
dictionary, a good light stemmer was developed by (Taghva et al. 2005). The authors found
that stem lists are not required in an Arabic stemmer. They deduced that finding the true
grammatical root of a term should not be the goal of a stemmer for document retrieval.Compared to English and other languages, the research relating to Arabic texts stem-
ming is fairly limited (Taghva et al. 2005). The main efforts to build efficient Arabic IR
systems have been achieved in a commercial framework. The approaches used for these
systems as well as the performance accuracy are not known. As a significant example, Siraj
system3 from Sakhr allows to classify Arabic text and to extract named entities with
human-satisfying response. However, it has no technical documentation to explain the used
method neither the system evaluation.
2.2 Topic modeling
The LDA model has been introduced within a general Bayesian framework where the
authors developed a variational method and EM algorithm for learning the model from
collection of discrete data (Blei et al. 2003). The authors applied their model in document
modeling, text classification and collaborative filtering. For document modeling, they
trained a number of latent variable models, including LDA, on two texts corpora to
compare the generalization performance compare the generalization performance, as
measured by likelihood of held-out test data. Based on different datasets with various
document numbers and vocabulary sizes, experiments show that the LDA model outper-
forms other models such as unigram model and pLSI.In (Blei et al. 2006), LDA model was tested on CGC Bibliography items. Experiments
show that LDA had better predictive performance than two standard models (unigram and
mixture of unigrams). In the text classification problem, SVM has been trained in the low-
dimensional representations produced by LDA from unlabeled documents. The authors
3 http://www.siraj.sakhr.com/.
Inf Retrieval
123
http://www.siraj.sakhr.com/http://www.siraj.sakhr.com/8/3/2019 Arabic Text Search
4/21
conducted two binary classification experiments using the Reuters-21578 dataset. They
realize a similar performance compared with SVM classification based on full words space.
It is worth pointing out that most datasets used for LDA evaluation are freely available
and included few thousands of English documents (sometimes up to 20,000) with some
30,000 unique words. This was considered sufficient for analyzing and assessing the modelperformances but this is not the same case for topic modeling in other languages such as
Arabic.
Since the original introduction of the LDA model, several contributions have been
proposed. However, few studies, on finding latent topics in Arabic context, have been
identified. In addition to the works related to Arabic topic detecting and tracking (Oard and
Gey 2002; Larkey et al. 2004), a segmentation method that uses the Probabilistic Latent
Semantic Analysis (Hofmann 1999) have been applied to AFP_ARB corpus for mono-
lingual Arabic document topic analysis (Brants et al. 2002). In (Larkey et al. 2004), the
researchers compared different topic tracking methods. They claimed that it should be
preferable to use separate language for building specific topic models. Good topic models
have been obtained when native Arabic stories are available. However, Arabic topic
tracking has not been improved in texts translated from English stories.
In fact, studies on Arabic IR are insufficient and the few works carried out for topic
modeling as well as for text stemming lack of rigorous evaluation. Considering the high
inflectional morphology in Arabic, it seems more appropriate to learn LDA model in
mono-language context taking more care for linguistic aspects. However, a large investi-
gation on stemming methods is required for assessing Arabic topic modeling in real-world
corpora.
3 Arabic text analysis
Unlike the indo-European languages, Arabic belongs to the Semitic languages family.
Written from right to left, it includes 28 letters. Despite the fact that different Arabic
dialects are spoken in the Arab world, there is only one form of the written language found
in printed works which it is known as the Modern Standard Arabic, herein referred to as
Arabic (Kadri and Nie 2006). In addition to its derivational morphology, the main char-
acteristics of the Arabic language, which complicate any automatic text analysis, are the
agglutination and the non-vocalization.
3.1 Arabic language features
Arabic is a highly inflected language due to its complex morphology. An Arabic word can
be one of three morpho-syntactic categories: nouns, verbs or particles. Several works have
used other categories (such as prepositions and adverbs) with no good reason except that
they are taken from English (Larkey et al. 2002; Tuerlinckx 2004; Moukdad 2006).
The lemma is the dictionary entry which is fully vocalized and relates to any form of
text. Particularly, the verbs are reduced to the third masculine singular in past tense. Allnouns and verbs are derived from a non-vocalized root according to one of the Arabic
patterns. The root is a linguistic unit carrying a semantic area. It is a non-vocalized word
(more general than lemma) and often consists of only 3 consonants (rarely 4 or 5 con-
sonants) (Kadri and Nie 2006; Tuerlinckx 2004).
Inf Retrieval
123
8/3/2019 Arabic Text Search
5/21
3.1.1 Morphological complexity
In Semitic languages, the root is an essential element from which various words may be
derived according to specific patterns or schemes. The morphological complexity in Arabic
is characterized by inflection and derivation.Inflection modifies the word to express different grammatical categories for the same
meaning such as gender, number, place or tense. Some irregular inflection schemes did not
use simple prefixed or suffixed roots but also they apply infixation and complex affixation
process. The following examples illustrate Arabic inflection with the irregular plural
(called broken plural):
from the root [Elm] : plural of [Eilm, science] is [Eulum, sciences] ,from the root [ktb] : plural of [kitAb, book] is [kutub, books]
However, derivation is a root affixation process generating a new word with different
meaning but generally in the same semantic area. An example of verb derivation from theroot [Elm] is [[aEolam, notify/inform] and [{isotaEolam, inquire] .
3.1.2 Agglutination
In Arabic text, a lexical unit is not easily identifiable from a graphic unit (word delimited
by space characters or punctuations marks). Morphological affixation process becomes
more complicated when extra affixes are agglutinated to a lemma. Indeed, a word can be
extended by attaching four kinds of affixes (antefix, prefix, suffix and postfix). Table 1
shows an example of an agglutinated and inflected word, [wayaEolamuwnahu] wherevarious kinds of affixes are attached to the core form [Elm] .This situation can make a high ambiguity to extract the right core (stem) from an
agglutinated form. In non-vocalized texts, morphology analysis become more difficult as
illustrated above in Table 1. There are other agglutinative languages such as Japanese,
Turkish and Finnish but the problem of non-vocalization is not as complicated as in Arabic.
3.1.3 Vocalization
Arabic word is vocalized with diacritics (short vowels) but unfortunately, full or partial
vocalization can be found only in didactic documents or in Koranic text. This factaccentuates the ambiguity of words and requires from each automatic analyzer to pay more
attention to the morphology and the word context. In Table 2, the non-vocalized word
[bsm] gives more than one segmentation with different meanings. This is mainly due to
diacritics missing in an agglutinated form.
Table 1 Segmentation of an Arabic agglutinated form meaning to and they know it
Agglutination
Inflection
Antefix Prefix Lemma Suffix Postfix
[wa] [ya] [Eolamu] [wna] [hu]
and they 1/2 know they 2/2 It
Inf Retrieval
123
8/3/2019 Arabic Text Search
6/21
3.2 Stemming methods
Stemming is a process for conflating inflected or derived words to a unique base-stem. It is
an important way to reduce collection vocabulary. In addition, stemming avoids dealing
with the same word as different index entries. Two classes of Arabic stemming methods
can be identified: (1) Light stemmers by removing the most common affixes and, (2)
Morphological analyzers by extracting each core (root or lemma) according to a scheme.
3.2.1 Light stemmers
They refer to the technique which truncates from a word a reduced list of affixes without
trying to find roots. Effectiveness of this approach depends on the content of prefixes and
suffixes lists. When, in English, one tries to find a stem by, mainly, removing conjugation
suffixes, we have to deal in Arabic texts with ambiguous agglutinated forms that imply
several morphological derivations. An analysis of such an approach can be found in (Larkeyet al. 2002). ISRI stemmer is another example of light stemming (Taghva et al. 2005).Without a root dictionary, theISRI4 algorithm use some affix lists and most common patternsto extract roots. Nevertheless, it keeps normalized form for unfound stem.
This kind of stemmers can effectively deal with most practical cases, but in some ones,
the right word is lost. As an example, in the word [wafiy] , one can read two agglutinated
prepositions that mean to and in but another will consider a noun, which meansfaithful/complete.
3.2.2 Morphological analyzers
In morphological analysis, we try to extract more complete forms according to vocalization
variation and derivation patterns knowledge. We can distinguish two categories of ana-
lyzers according to the nature of desired output unit: (1) Root-based stemmers and, (2)
lemma-based stemmers. The choice between the two approaches depends on how further
stemming results, in IR tasks or in language modeling, will be used.
In first category, the Khoja stemmer, which attempts to find root for Arabic words, hasbeen proposed in (Khoja and Garside 1999). A list of roots and patterns is used to
determine the right stem. This approach produces abstract roots, which reduce significantly
the dimension of document features space, but it leads to a confusion of divergent meaningembedded in a unique non-vocalized stem. For example, stemming of the word cited
above in Table 1 must deduce the root-stem with possible meaning to verbs to know orto teach. However, the same root can mean to the noun flag.
Table 2 Four possible solutions for the word [bsm]
Solution Morphology Vocalization English meaning
1 Noun [basom] Smiling
2 Verb [basam] Smile
3 Prep ? noun [bi] ? [somi] In/by (the) name of
4 Prep ? noun [bi] ? [sam *] By/with poison
4 http://www.nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.isri-pysrc.html.
Inf Retrieval
123
http://www.nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.isri-pysrc.htmlhttp://www.nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.isri-pysrc.html8/3/2019 Arabic Text Search
7/21
In second category, a lemma-based stemmer has been developed and compared to
Khoja stemmer (Al-Shammari and Lin 2008). The authors combined light stemming withKhoja algorithm for antefixes removal and verbs stemming before processing theremaining words as nouns. They use a stop list exceeding 2,200 words with verbs and
nouns dictionaries as linguistic resources. In addition to clustering performance compar-ison, they used a collection of concept groups for under- and over-stemming evaluation.
Unfortunately, neither these resources nor the test collections are available. As shown in
Table 2, an Arabic word can induce more than one stem and so more than one lemma. The
above approaches did not support this aspect.
For this purpose, a set of Arabic lexicons5
has been developed with rules for legal
combinations of lemma-stems and affixes forms (Buckwalter 2002). P. Brihaye has
developed AraMorph,6 a java package for Arabic lemmatization based on BuckwalterArabic morphological analyzer. Several stemming solutions can be proposed for each
word. From this analyzer, one can develop, under some considerations, a lemma-based
stemmer. This approach will be described hereafter.
3.3 Lemma-based stemmer
We propose to develop an algorithm for lemma-based stemmer that is called the Brahmi- Buckwalter Stemmerand referred henceforth as BBw. Based on the resources of theBuckwaltermorphological analyzer, two main contributions can be reported for the BBwstemmer: (1) Normalization preprocessing and, (2) stem selection with morphological
analysis.
3.3.1 Normalization
This step is performed for normalizing the input text. Then the obtained list of tokens will
be processed by the Buckwalter morphological analyzer.
Convert to UTF-8 encoding
Tokenize text respecting the standard punctuation
Remove diacritics and tatweel ( ) Remove non-Arabic letters and stop-words.
Replace initial alef with hamza ( or ) by bar-alef (
) Replace final waw or yeh with hamza ( or ) by hamza ( ) Replace maddah ( ) or alef-waslah ( ) by bar-alef () Replace two bar-alef ( ) by alef-maddah ( ) Replace final teh marbuta ( ) by heh ( ) Remove final yeh ( ) when the remaining stem is valid.
3.3.2 Stem selection
When an input token (in-token) is processed by the Buckwalter morphological analyzer,three cases can be reported: (1) A unique solution is given according to a specific pattern.
5 http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49.6 http://www.nongnu.org/aramorph/english/index.html.
Inf Retrieval
123
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49http://www.nongnu.org/aramorph/english/index.htmlhttp://www.nongnu.org/aramorph/english/index.htmlhttp://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L498/3/2019 Arabic Text Search
8/21
(2) Multiple solutions are found corresponding to several patterns and lexicon entries. (3)
No solution can be attributed to the in-token. The actions, that the BBw stemmer mustundertake, will be detailed bellow.
Unique solution. The BBw stemmer retains only the non-vocalized lemma-stem of the
solution (without affixes). A solution without noun or verb lemma (i.e., contains onlyparticles) is ignored and therefore the in-token is considered as a stop-word.
Multiple solutions. The BBw stemmer treats all the proposed solutions as a set ofseparated unique solutions and thus retains all non-vocalized lemma-stem. Note that
eliminating diacritics from lemmas may unify some stems and so reduce the solutions
multiplicity. For example, Table 2 gives four vocalized solutions for the token [bsm]
but after removing diacritics from output lemmas, the BBw stemmer will identify only twoconfused stems {[bsm] , [sm] }. It is worth pointing out that most of the Arabic proper
names can be derived regularly from roots. In this case multiple solutions, including the in-
token, must be considered.
No solution. The in-token cannot have a solution in the following cases, differentreasons can be raised: (1) The in-token is wrong and it did not imply any Arabic lemma. (2)
The in-token corresponds to a proper name (person, city, etc.) that has no entry in the
dictionary. (3) The in-token is a correct Arabic word but it is not yet included in the current
release of Buckwaltermorphological analyzer.In this study, we have opted to improve the normalization preprocessing in BBw
algorithm based on the original Buckwalterlexicon. Three stemmer variants (BBw0, BBw1,BBw2) are developed and evaluated on different Arabic datasets. Table 3 summarizes stemselection approach for each BBwX stemmer.
3.3.3 Confusion degree measure
When the morphological analysis of an in-token implies multiple solutions, each BBwXstemmer produces multiple lemma-stems. From a collection (S), we denote by L= 0 thetotal number of in-tokens after stop-words removal. LR = 0 refers to the total number ofstems when stemming (S) with an algorithm (R). Then, the confusion degree C(S|R) isdefined as:
CSjR LR
L
1
For example, C(S|Khoja) = 1, since the Khoja stemmer gives at most one stem for eachtoken in any dataset S. This is an ideal situation for a stemming process but when applyingthe BBwXstemmers, the confusion degree Cwill be increased. We proposed this measureC(S|R) for assessing the lexical ambiguity in Arabic texts. For a human Arabic reader, thisproblem will be solved easily with semantic considerations guided by the context.
For BBwXstemming, note that all possible stems are equitably related to their in-token.At this stage, we have not precise knowledge to select the good stem. Nevertheless, that the
Table 3 BBwX outputs versus different cases of the in-token morphological analysis
Stemmer Unique solution Multiple solutions No solution
BBw0 1 lemma-stem all lemmas \
BBw1 1 lemma-stem all lemmas ? in-token in-token
BBw2 1 lemma-stem all lemmas in-token
Inf Retrieval
123
8/3/2019 Arabic Text Search
9/21
relevant solution can be weighted later by a co-occurrence computation in local context.
We think that this will be possible with LDA topic modeling.
3.4 Stemming evaluation
Although the literature describes various Arabic stemmers, only few of them have received
a standard evaluation (Al-Shammari 2010; Said et al. 2009). A way to assess the effec-
tiveness of stemming algorithms is to evaluate their performances in information retrieval
tasks. This requires standard and representative test collections. Nevertheless, it is unsure
that good performances in IR tasks result only from stemming quality (Paice 1996; Frakes
2003). Herein, we describe three stemming metrics used in this present work. Such metrics
allow assessing some stemming aspects independently of IR tasks performance.
3.4.1 Index compression
The Index Compression Factor (ICF) represents the extent to which a collection of uniquewords is reduced (compressed) by stemming, the idea being that the heavier the Stemmer,
the greater the Index Compression Factor (Frakes 2003). This can be calculated by:
N, Number of unique words before stemming; S, Number of unique stems afterstemming
ICF
N S
N 2
The ICF factor has been introduced as a strength measure to evaluate stemmers andcompression performance. However, vocabulary compression did not mean to ideal
stemming. In fact, a good stemmer is that stems all the words to their correct roots. The
following measures may satisfy this condition.
3.4.2 Under- and over-stemming
Under-stemming is the failure to conflate morphologically related words. This occurs whentwo words that should be stemmed to the same root are not. An example of under-
stemming would be if the words adhere and adhesion are not stemmed to the same
root.
Over-stemming refers to words that should not be grouped together by stemming, but
are. For example, merging the words probe and probable after stemming would
constitute an over-stemming error.
Using a sample file of W grouped words, under-stemming errors are then counted asdescribed in (Paice 1996). A concept group contains forms which are both semantically
and morphologically related one to another. For each group g containing ng words, thenumber of pairs of different words defines the desired merged total (DMTg):
DMTg 0:5ngng 1
Since a perfect stemmer should not merge any member of a group with other group
words, for every group there is a desired non-merge total (DNTg):
Inf Retrieval
123
8/3/2019 Arabic Text Search
10/21
DNTg 0:5ngW ng
When summing these two totals over all groups, one can obtain the global desiredmerged total (GDMT) and the global desired non-merge total (GDNT) respectively. Thus,stemming errors are calculated as follows:
Conflation Index (CI): proportion of equivalent word pairs which were successfullygrouped to the same stem; Distinctness Index (DI): proportion of non-equivalent wordpairs which remained distinct after stemming
The under-stemming index (UI) and the over-stemming index (OI) are given by:
UI 1 CI 3
OI 1 DI 4
In (Paice 1996), the author proposed to compute the ratio of these two quantities as a
measure of the stemming weight (SW):
SW OI=UI 5
The purpose of the Paices error-counting approach is that, although it is advantageous
to have the index of terms compressed, this is only useful up to a point. This is because, as
conflation becomes heavier, the merging of distinct concepts becomes increasingly fre-
quent. At this point, small increases in Recall are gained at the expense of a major loss of
Precision (Frakes 2003).
One question mark over this approach concerns the validity of the grouped file
against which the errors are assessed. These grouped files were constructed by human
judgment, during scrutiny of sample word lists (Paice 1996; Frakes 2003). For Arabicstemming evaluation, Al-Shammari has selected a sample of 419 words and has divided
them into 81 conceptual groups (i.e. close to 5 words per group). Comparing to Khojaand Light stemmers, Al-Shammaris lemmatizer has reduced over-stemming errors.However, no effective improvement is achieved for under-stemming counting (Al-
Shammari 2010).
4 LDA topic model
Latent Dirichlet allocation (LDA) is a generative topic model for text documents (Blei
et al. 2003). Based on the classical bag of words assumption, a topic model considers
each document as a mixture of topics where a topic is defined by a probability distribution
over words.
The distribution over words within a document (d) is given by:
Pwijd XT
j1
Pwijzi jPzi j dj :
where P(w|z) defines the probability distribution over words w given topic z and P(z|d)refers to the distribution over topics z in a collection of words (document). More detailsand interpretations about topic models can be found in (Blei et al. 2003; Steyvers and
Griffiths 2007)
Inf Retrieval
123
8/3/2019 Arabic Text Search
11/21
For a given number of topics T, LDA model will be trained from a collection ofdocuments defined as follows:
N: number of words in vocabulary.M: number of document in corpus.T: number of topics, given as input value.P(z/d): distribution over topics z in a particular document.P(w/z): probability distribution over words w given topic z.
Then, we can define a generative process as follows:
For each document d= 1 to M (in dataset) do:
1. Sample mixing probability hd* Dir(a)2. For each word wdi = 1 to N (in vocabulary) do:
2.a. Choose a topic zdi [ {1,, T} * Multinomial (hd)
2.b. Choose a word wd [ {1,, N} * Multinomial (bzdi)
where a is a Dirichletsymmetric parameter and {bi} are multinomial topic parameter. Eachbi assigns a high probability to a specific set of words that are semantically related. Thi s
distribution over vocabulary is referred to as topic. In the present work, we use LingPipe7
LDA implementation which is based on Gibbs sampling for parameters estimation.
The choice of the number of topics can affect the interpretability of the results. A model
with too few topics will generally result very broad topics. However, a model with too
many topics will result in an uninterpretable model (Steyvers and Griffiths 2007). Since the
number of topics (T) is given as an input parameter for training the LDA model, severalmethods were proposed to select a suitable T. An evident approach is to choose Tthat leadsto best performance for tasks (classification, clustering etc.).
5 Building Arabic datasets
As raised above, developments in Arabic IR are often faced with the problem of
unavailability of standard free resources. So, we have opted to build our own experi-
mentations datasets. For this aim, we developed a Web-Crawler8 to collect newspaper
articles from several Arabic websites. In this study, we present three real-world corporabased on Echorouk,9Reuters10 and Xinhua11 Web-articles. Each article is saved with UTF-8 encoding in a separated text file where the first line is reserved to its title. A brief
description is given in Table 4.
5.1 Datasets description
Echoroukcollection contains 11,313 documents from Echorouk newspaper articles relatingto 20082009 period. It is labeled according to eight categories. From the full corpus
Ech-11k, we build a subset, Ech-4000, of 4,000 documents for preliminary evaluations.
7 Available at http://www.alias-i.com/lingpipe/index.html.8 Developed with Amine Roukh and Abdelkadir Sadouki in the Mostaganem University.9 Algerian newspaper with online edition at http://www.echoroukonline.com/.10 International news agency with Arabic online edition at http://www.ara.reuters.com/.11 Chinese news agency with Arabic online edition at http://www.arabic.news.cn/.
Inf Retrieval
123
http://www.alias-i.com/lingpipe/index.htmlhttp://www.echoroukonline.com/http://www.ara.reuters.com/http://www.arabic.news.cn/http://www.arabic.news.cn/http://www.ara.reuters.com/http://www.echoroukonline.com/http://www.alias-i.com/lingpipe/index.html8/3/2019 Arabic Text Search
12/21
Reuters collection, Rtr-41k, contains 41,251 Arabic documents relating to 200720082009 period. It is labeled according to six categories. A subset, Rtr-5251, of 5251 docu-ments is used for preliminary evaluations.
Xinhua collection contains 36,696 Arabic documents relating to 20082009 period. It islabeled according to eight categories. A subset, Xnh-4500, of 4500 documents is used forpreliminary evaluations. Table 5 describes the collected datasets with their distributions
over published categories.
5.2 Arabic stemming
The three lemma-based stemmers (BBw0, BBw1 and BBw2) were applied on the threedatasets described above. For comparison, we use ISRIalgorithm for light stemming and 2
variants of Khoja algorithm for root-based morphological analysis. The first one, Khoja0,is the original Khoja algorithm which gives only the found roots. The second variant,Khoja1, allows adding unfound words to the vocabulary. Furthermore, we use the raw textas baseline characterization in preliminary experiments. For aligning comparison, the stop-
word list used in BBwX algorithms is removed from other stemmers output.
Table 4 Description of three datasets relating to Echorouk, Reuters and Xinhua Web-articles
Feature\dataset Ech-11k Rtr-41k Xnh-36k
# Articles 11,313 41,251 36,696
# Characters 48,247,774 111,478,849 85,696,969# Tokens 4,388,426 10,093,707 7,532,955
# Arabic tokens 3,341,465 7,892,348 6,097,652
Average # tokens/article 387.9 244.7 205.3
# Categories 8 6 8
Table 5 Distribution of the three datasets over categories
Dataset source Echorouk Reuters Xinhua
Category\dataset Ech-11k Ech-4000 Rtr-41k Rtr-5251 Xnh-36k Xnh-45001 World 2,274 572 10,000 1,000 9,465 561
2 Economy 816 572 10,000 1,000 6,862 563
3 Sport 3,554 572 10,000 1,000 1,132 563
4 Middle-East 10,000 1,000 9,822 563
5 Science-health 889 889 1,993 563
6 Culture-education 566 566 1,508 562
7 Algeria 2,722 573
8 Society 808 572
9 Art 315 315 10 Religion 258 258
11 Entertainment 362 362
12 China 4,654 563
13 Tourism-ecology 1,260 562
Total 11,313 4,000 41,251 5,251 36,696 4,500
Inf Retrieval
123
8/3/2019 Arabic Text Search
13/21
For each stemmer, different sizes of corpus vocabularies are reported in Table 6. As a
main observation for the three datasets, stemming results show that ISRIalgorithm producehigh vocabulary dimensions. Close to raw text, ISRIdo not provide significant reduction inthe feature space. On the contrary, morphology analysis, with Khoja and BBw, decrease thevocabulary size by unifying tokens those have a common root or lemma. Retaining only
the correct form, Khoja0 and BBw0 stemmers produce the lower vocabularies.
In morphological analysis, it is clear that our lemma-based stemmer (BBw0) enhancesthe feature space when comparing to a root-based stemmer (Khoja0). However, thevocabularies produced when adding unrecognized tokens (Khoja1 and BBw2), show a lackin the used lexicons. We have expected such cases by analyzing the BBwX variants asdescribed in a previous sections where we discussed the no solution causes. To analyze
the loss caused by this lack, we compute the ratio of unfound tokens by comparison with
all the words (found and unfound). Figure 1 shows that the BBw lexicon more completethan Khoja.
To complete our preliminary analysis and assess performances ofBBwXstemmers when
dealing with ambiguous forms, we compute the confusion degree as defined in (1).According to Table 7, the maximum confusion degree, in real-world corpora, does not
exceed 1.16. This indicates that when our stemming approach includes all multiple solu-
tions in vocabulary, it preserves all word senses without significant lexical ambiguity.
5.3 Stemming evaluation
ICF and Paice metrics have been tested on the three datasets for analyzing stemmersquality. For evaluation fairness, the index compression factor has been calculated after
stop-words removal as applied in BBwX stemmers. In lemma-based stemming, a larger
stop-list with 575 words is used. As defined in (2), Table 8 summarizes the ICFcomputing.Table 8 show that Khoja0 root-stemmer and BBw0 lemma-stemmer have performed the
best compression over the three datasets. However, light stemming with ISRIdid not give asignificant index reduction. Because they keep unknown tokens as proper stems, the other
variants of Khoja and BBw have reduced slightly the index.For Paices evaluation, the main difficulty consists of how to build a representative
concept groups. Al-Shammari has used a moderate set of 81 groups for comparing her
lemmatizer to Khoja and light Arabic stemmers. Unfortunately, no test collections orconcept groups used in experiments were available (Al-Shammari 2010).
In this study, we have built a large real-world concept-groups by processing the article
titles in the three datasets (Ech-11k, Rtr-41k and Xnh-36k). After stop-words removal andpreliminary grouping, we have selected the groups that contained 10 words at least. Then,
we have revised the resulting groups with an Arabic expert before retaining a collection of
13,142 words distributed on 689 groups. Because all the selected words are recognized by
Khoja and BBw stemmers, it was not necessary to calculate stemming errors for the othervariants (KhojaX, BBwX). Table 9 gives results of the Paices evaluation.
Table 6 The vocabulary sizes of three datasets according to different stemmers
Vocabulary Raw ISRI Khoja0 Khoja1 BBw0 BBw1 BBw2
Ech-11k 179,225 144,442 3,172 22,411 17,558 60,148 48,789
Rtr-41k 183,510 150,977 3,153 38,644 17,020 73,382 62,843Xnh-36k 175,410 140,192 3,089 10,712 11,352 25,835 22,456
Inf Retrieval
123
8/3/2019 Arabic Text Search
14/21
Results show that our lemma-based stemmer provides the lowest over-stemming and
under-stemming indexes (OI and UI). It is a significant performance compared to Khojaand ISRI stemmers. Remember that BBw stemmer can generate multiple stems for an in-
token. Theoretically, this fact increases the under-stemming errors but empirically, resultsshow that BBw improves the sense distinctness.
6 Experiments and results
The experiments performed for text categorization and topic modeling are described and
analyzed in this section. Support vector machine (SVM) is a kernel-based method which
has been introduced by (Vapnik 1995) for binary and multi-class classification. In this
work, the LIBSVM12 package is applied for multi-class text categorization. As SVM
training parameter, the simple linear kernel is used with a cost parameter set to 10. For
performance evaluation, fivefold cross validation is performed for each feature space of
datasets.
0% 2% 4% 6% 8% 10% 12% 14%
Ech-11k
Rtr-41k
Xnh-36k
BBw
Khoja
Fig. 1 The unfound tokens ratefor the two analyzers, Khoja andBBw
Table 7 Confusion degree (Conf-deg) relating to the BBwX stemmers
Dataset Ech-11k Rtr-41k Xnh-36k
Stemmer Conf-deg Max-mul Conf-deg Max-mul Conf-deg Max-mul
BBw0 1.118 5 1.129 6 1.121 5
BBw1 1.146 6 1.155 6 1.141 6
BBw2 1.106 5 1.116 5 1.105 5
Max-mul refers to maximum multiple solutions
Table 8 Comparison of index compression factors for different stemmers
Dataset\stemmer ISRI Khoja0 Khoja1 BBw0 BBw1 BBw2
Ech-11k 0.194 0.982 0.875 0.902 0.664 0.728
Rtr-41k 0.177 0.983 0.789 0.907 0.600 0.658
Xnh-36k 0.201 0.982 0.939 0.935 0.853 0.872
12 LIBSVM-2.89 is available at http://www.csie.ntu.edu.tw/*cjlin/libsvm?zip.
Inf Retrieval
123
http://www.csie.ntu.edu.tw/~cjlin/libsvm+ziphttp://www.csie.ntu.edu.tw/~cjlin/libsvm+ziphttp://www.csie.ntu.edu.tw/~cjlin/libsvm+ziphttp://www.csie.ntu.edu.tw/~cjlin/libsvm+ziphttp://www.csie.ntu.edu.tw/~cjlin/libsvm+ziphttp://www.csie.ntu.edu.tw/~cjlin/libsvm+zip8/3/2019 Arabic Text Search
15/21
It is worth pointing out that: (1) for each dataset, six stemmers (ISRI, Khoja0, Khoja1,BBw0, BBw1 and BBw2) are applied for categorization and topic modeling, (2) raw text isused as baseline, (3) for topic modeling, all datasets are used as unlabeled collections.
6.1 Classification in full word space
For the basic words space definition, the TF and the TF-IDF measures are applied onstemmed datasets. TFt refers to the term frequency of each term (t) in its document. Theinverse document frequency (IDFt) measures the general importance of the term over a setof documents (D). In this work, we use the simple formulation of a TF-IDF measure asfollows:
TF IDFterm tjD TFt IDFt TFt logDj j
d : t2 df gj j: 6
The three subsets (Ech-4000, Rtr-5251 and Xnh-4500) are used for text categorization in
the full word space. According to two term models (TFand TF-IDF) and various stemmingapproaches, Fig. 2 gives a preliminary evaluation for text classification.As a main observation, the Khoja stemmers give the weakest performances. With
abstract root stemming and incomplete lexicon, Khoja0 algorithm decreases text charac-terization. When adding unfound tokens, Khoja1 variant improves slightly classificationperformances. Although the ISRIlight stemmer produces huge vocabularies, it seems givegood performances for classification in the full word space. The same point can be reported
when dealing with raw text. However, the BBwXstemmers improve classification accuracyfor reasonable dimension of words space.
6.2 Classification in topics space
The three subsets (Ech-4000, Rtr-5251 and Xnh-4500) with various stemming variants aretrained with LDA algorithm. For several numbers of topics, we report in Tables 10, 11 and
12 the classification accuracy obtained by SVM cross-validation.
According to Tables 10, 11 and 12, we can deduce the suitable number of topics for
each stemmed dataset. The best models may be obtained when choosing, for LDA training,
a number of topics between 100 and 400.
Considering stemming methods, preliminary experiments show that Khoja stemmersgive the lowest classification performances. Unexpected results for linguists show that with
raw texts, topic modeling can produce performances as good as with morphological
analysis. Remind us that light stemming (ISRI) also has generated large vocabularies
where different entries should be conflated in a single Arabic stem.
Focusing on morphological analyzers, experiments show that BBw lemma-basedstemmers improve classification in topics space compared to Khoja root-based stemmers.For Reuters collection as example, we highlight in Fig. 3 the difference between the
Table 9 Paices evaluation for three Arabic stemmers
Stemmer OI UI SW
ISRI 0.019 9 10-3 0.968 0.019 9 10-3
Khoja 1.452 9 10-3
0.098 14.87 9 10-3
BBw 0.006 9 10-3 0.060 0.096 9 10-3
Inf Retrieval
123
8/3/2019 Arabic Text Search
16/21
stemmers which retain only the correct Arabic forms. It is clear that lemma-based stem-
ming enhances LDA modeling even with low topics number.
6.3 Finding topics in newspaper articles
In this section, we illustrate some results of LDA modeling in Arabic texts. The BBw2algorithm was used for stemming the three complete datasets (Echorouk, Reuters and
Xinhua). Note that a latent topic can be titled with human assessment according to itsrelevant terms. For example, we give in Table 13 topics distribution which are mostrelevant to the word [mAl, money] .
In addition, we propose to compute the categories distribution over learned topics. A
confusion matrix (category\topic) can be obtained by adding document distributions of the
same category. In Table 14, we give an illustration of the categories distribution over the
Table 10 Classification accuracy in topics space of the Ech-4000 corpus
# Topics\stemmer Raw ISRI Khoja0 Khoja1 BBw0 BBw1 BBw2
32 76.3 77.4 71.7 76.5 79.0 78.9 79.1
64 79.2 79.4 75.4 78.7 79.4 79.7 80.2
100 80.2 80.3 77.3 79.2 80.9 80.0 80.8
200 80.9 81.2 77.2 79.6 80.6 80.6 81.1
300 81.8 80.4 76.9 78.2 80.3 80.8 81.3
400 80.9 80.5 76.7 79.2 80.5 80.5 80.7
500 81.3 80.7 77.3 77.5 80.0 80.9 81.0
600 80.9 80.6 76.0 77.7 80.3 80.9 80.3
700 80.6 79.7 75.6 76.4 79.3 80.1 79.6
6 stemmers are applied with raw text as baseline for several numbers of topics
Bold values indicate the best classification accuracies
68
70
72
74
76
78
80
82
84
Ech-TF Rtr-TF Xnh-TF Ech-TFIDF Rtr-TFIDF Xnh-TFIDF
Raw ISRI Khoja0 Khoja1 BBw0 BBw1 BBw2
Fig. 2 Classification accuracy in the full-word space of the three datasets (Ech-4000, Rtr-5251 andXnh-4500)
Inf Retrieval
123
8/3/2019 Arabic Text Search
17/21
eight main topics13 from Reuters dataset. By setting a likelihood threshold at 10%, one canidentify relevant topics in each category. For example, one will easily discover that the
main subjects in sport category during 20072009 were football and Tennis.
Table 11 Classification accuracy in topics space of the Rtr-5251 corpus
# Topics\stemmer Raw ISRI Khoja0 Khoja1 BBw0 BBw1 BBw2
32 80.0 81.1 77.5 82.3 83.8 83.3 84.2
64 84.2 84.4 80.4 83.9 85.6 84.8 84.6100 84.3 85.5 82.0 84.6 86.0 85.6 85.3
200 85.7 85.2 82.8 84.9 85.6 85.8 85.8
300 85.6 84.9 83.5 84.7 86.0 85.7 85.7
400 84.9 84.9 83.2 85.0 85.5 85.7 85.6
500 84.2 84.5 82.4 84.4 85.5 85.2 85.5
600 84.4 84.7 81.8 84.9 85.2 85.1 85.4
700 84.3 84.5 81.6 83.7 85.2 85.4 85.4
Bold values indicate the best classification accuracies
Table 12 Classification accuracy in topics space of the Xnh-4500 corpus
# Topics\stemmer Raw ISRI Khoja0 Khoja1 BBw0 BBw1 BBw2
32 67.5 70.1 68.9 74.6 75.3 75.0 74.2
64 75.8 77.7 74.3 78.4 80.3 81.0 79.9
100 78.8 77.5 75.6 79.8 81.9 81.6 80.6
200 81.3 82.1 78.7 80.8 82.8 83.1 82.2
300 82.0 81.5 79.5 81.1 82.9 82.5 83.2
400 81.0 81.2 79.8 81.3 82.5 82.3 82.4500 80.5 81.0 78.9 81.3 82.5 81.8 82.3
600 81.3 80.9 78.2 80.9 81.8 81.7 81.9
700 81.2 81.4 78.5 79.3 81.8 82.4 82.4
Bold values indicate the best classification accuracies
72
74
76
78
80
82
84
86
88
32 64 100 200 300 400 500 600 700
Topics number
Khoja0
BBw0
Fig. 3 Classification accuracy in topics space of the Rtr-5251 corpus. Comparison between root-based(Khoja0) and lemma-based (BBw0) stemming
13 More details are available at https://sites.google.com/site/abderrezakbrahmi/.
Inf Retrieval
123
https://sites.google.com/site/abderrezakbrahmi/https://sites.google.com/site/abderrezakbrahmi/8/3/2019 Arabic Text Search
18/21
Furthermore, words can be analyzed by finding their different contexts when training
LDA for each collection. Among 100 latent topics, we report in Table 15 the relevant
topics to some words. Two senses can be assigned to the first one, [slAm] , either peaceor greeting/salutation.
It is clear, in Table 15, that the different contexts related to this word imply the firstsense excepting the topic Coal mines in Australia. In fact, the forth topic in Xnh-36 kdataset is related to another word. The input token [slAmh] has two BBw stemmingsolutions ([slAm] and [slAmh] ). The second solution means safety which is highlycorrelated to security and safety requirements in coal mines. However, vocabulariesobtained by Khoja stemming do not contain this entry because the word [slAm] wasindexed by its root [slm] . This non-vocalized word induces various senses such as
[sul*im, be conceded] , [silom, peace] and [sul*am, stairs] . Furthermore, Khojaalgorithm indexes under the same root other lemmas such as [isolam , Islam] and[saliym , correct] .
The second word, [mAl] , means either money/capital/funds or lean/bend/incline/sympathize. However, Table 15 shows that several specific topics are related to the firstsense including some ways to financing Al-Qaida organization.
7 Conclusion and future work
For information retrieval tasks, several stemming approaches have been proposed and
applied on Arabic texts. The light stemmers try to remove from a word the most common
affixes. However, the morphological analyzers attempt to extract the correct root or lemma.The preference of a particular method depends on the nature of the following IR task.
Unfortunately, the literature does not give clear answers for choosing the appropriate
stemming method. Using raw text for categorization or topic analysis can also lead to
acceptable performances (Brants et al. 2002; Said et al. 2009).
Table 13 The 4 topics related to the word ([mAl, money] ) in Echorouk (Ech-11k) dataset
The word [mAl, money] in bold
Inf Retrieval
123
8/3/2019 Arabic Text Search
19/21
Two main contributions were presented in this study: Firstly, we have proposed the BBwlemma-based stemming with specific text normalization and multiple lemmas indexing. By
applying a confusion measure on three real-world corpora, we have shown that our
stemming approach may preserve the semantic embedded in Arabic texts withoutcompromising lexical characterization. The Paices evaluation has been used to measure
under- and over-stemming errors. The results have shown a high effectiveness for our
approach. The BBw lemma-based stemmer reduces significantly vocabulary dimension,under- and over-stemming errors. In addition, classification performance is improved
slightly compared to classification of raw and light stemmed texts.
For morphological analysis, three BBw variants were compared to root based stemmers(Khoja0 and Khoja1). The two variants tested for Khoja algorithm have shown a lack in itslexicon (roots and patterns). This limitation was surmounted by adding the unfound words
as new vocabulary entries. However, it would be judicious to maintain permanently lin-
guistic resources.
Secondly, three real-world corpora were tested for Arabic stemming and topic model-
ing. Tens of thousands of Web-articles were automatically crawled from Echorouk,Reuters and Xinhua. The variety of writing styles allows us to validate the proposedlemma-based stemmer. For topic modeling evaluation, SVM classification was
Table 14 Distribution of Reuters categories over eight latent topics
Category\topic Middle-East
(%)
Business
(%)
World
(%)
Sport
(%)
Entertainment
(%)
Science
(%)
Military actions (USA. Iraq.Afghanistan) 25.4 1.8 21.1 0.8 2.8 2.0
Football 0.6 0.6 0.9 52.3 4.9 1.2
Economy 3.0 75.2 2.8 1.4 5.4 5.9
Health and science 12.3 5.6 17.4 4.0 50.3 59.6
Iranian nuclear 12.4 11.4 28.9 3.3 6.9 26.0
Iranian politics 21.0 3.8 26.9 3.2 22.2 4.1
Tennis 0.3 0.5 0.7 34.4 1.6 0.5
Middle-East 25.0 1.1 1.4 0.6 6.0 0.8
Bold values indicate relevant topics
Table 15 The topics related to two words ([slAm] and [mAl] ) in each dataset
Word Ech-11 k Rtr-41 k Xnh-36 k
[slAm] Arab league Somali actors UN missions
Palestinian dialogue Nobel price Jordans peace effort
Middle-East negotiations Middle-East negotiations
Sudan events Coal mines in Australia
Sudan events[mAl] Monetary values Arabic gulf investments Financial crisis
Al-Qaida operations Financial operations Financial operations
Financial institutions Financial crisis Energy projects
Government projects Government investments
Inf Retrieval
123
8/3/2019 Arabic Text Search
20/21
successfully tested with various stemming methods. In particular, the proposed BBwstemming approach proved its efficiency in both text classification and Arabic topic
modeling.
Furthermore, topic LDA modeling was applied in Arabic texts. A large investigation
was carried out by varying the topics number. Classification in topics space was achievedto assess the performances LDA model with different stemming methods. When training
LDA in BBw vocabularies, interpreting topics was easier than those obtained from Khojaroots.
It is worth pointing that it is difficult to understand how one can assess semantic aspects
in Arabic texts without sufficient linguistic knowledge (Larkey et al. 2002, 2004). This
study shows that effective developments in Arabic IR and topic modeling can not be
performed without close collaboration between computer scientists and Arabic language
experts.
As future work, the BBw stemmer will be improved by handling additional irregularforms and enhancing lexicons to proper names (person, location and organization). Further
effort must be oriented to topic model integration with end-user retrieval system.
Acknowledgments We gratefully acknowledge Professor Patrick Gallinari, director of LIP6 laboratory
(Paris-France), for his valuable advices to achieve this work. We would like to thank the reviewers of INRT
journal for their insightful and constructive comments.
References
Al-Shammari, E. (2010). Lemmatizing, stemming, and query expansion method and system. US Patent20100082333, April 2010.
Al-Shammari, E., & Lin, J. (2008). A novel Arabic lemmatization algorithm. In Proceedings of theworkshop on analytics for noisy unstructured data, Singapore, pp. 113118.
Blei, D. M., Franks, K., Jordan, M. I., & Mian, I. S. (2006). Statistical modeling of biomedical corpora:
Mining the caenorhabditis genetic center bibliography for genes related to life span. BMC Bioinfor-matics, 7, 250. doi:10.1186/1471-2105-7-250.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine LearningResearch, 3, 9931022.
Brants, T., Chen, F., & Farahat, A. (2002). Arabic document topic analysis. LREC-2002 workshop on
Arabic language resources and evaluation, Las Palmas, Spain.
Buckwalter, T. (2002). Buckwalter Arabic morphological analyzer version 1.0. Linguistic data consortium,
University of Pennsylvania. LDC catalog no. LDC2002L49.
Darwish, K., Hassan, H., & Emam, O. (2005). Examining the effect of improved context sensitive mor-
phology on Arabic information retrieval. In Proceedings of the ACL workshop on computationalapproaches to semitic languages, Ann Arbor, Michigan, pp. 2530.
Frakes, W. B. (2003). Strength and similarity of affix removal stemming algorithms. In SIGIR forum (Vol.37, issue 1), pp. 2630.
Hofmann T. (1999). Probabilistic latent semantic analysis. In Proceedings of the fifteenth conference onuncertainty in artificial intelligence, pp. 289296.
Kadri, Y., & Nie, J. (2006). Effective stemming for Arabic information retrieval. The challenge of Arabic forNLP/MT, international conference at the British Computer Society (BCS), pp. 6874. London, UK.
Khoja, S., & Garside, R. (1999). Stemming Arabic text. Technical report. Computing Department, Lancaster
University, Lancaster.
Larkey, L. S., Ballesteros, L., & Connell, M. E. (2002). Improving stemming for Arabic information
retrieval: Light stemming and co-occurrence analysis. In Proceedings of SIGIR 2002, pp. 275282.Tampere, Finland.
Larkey, L. S., & Connell, M. E. (2001). Arabic information retrieval at UMass in TREC-10. In TREC 2001,pp. 562570. Gaithersburg, Maryland, USA.
Larkey, L. S., Feng, F., Connell, M. E., & Lavrenko, V. (2004). Language-specific models in multilingual
topic tracking. In Proceedings of SIGIR 2004, 402409. Sheffield, UK.
Inf Retrieval
123
http://dx.doi.org/10.1186/1471-2105-7-250http://dx.doi.org/10.1186/1471-2105-7-2508/3/2019 Arabic Text Search
21/21
Moukdad, H. (2006). Stemming and root-based approaches to the retrieval of Arabic documents on the Web.
Webology, 3(1), article 22.Oard, D. W., & Gey, F. (2002). The TREC-2002 Arabic/English CLIR track. In TREC2002 notebook,
pp. 8193.
Paice, C. D. (1996). Method for evaluation of stemming algorithms based on error counting. Journal of the
American Society for Information Science, 47(8), 632649.Said, D., Wanas, N., Darwish, N., & Hegazy, N. (2009). A study of text preprocessing tools for Arabic textclassification. In Proceedings of the 2nd international conference on Arabic language resources andtools, pp. 230236. Cairo, Egypt.
Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. Handbook of Latent Semantic Analysis.Mahwah, NJ: Lawrence Erlbaum.
Taghva, K., Elkoury, R., & Coombs, J. (2005). Arabic stemming without a root dictionary. In Proceedingsof the international conference on information technology: Coding and computing, Vol. 01,pp. 152157.
Tuerlinckx, L. (2004). La lemmatisation de larabe non classique. In JADT 2004, 7e Journe es internatio-nales dAnalyse statistique des Donnees Textuelles, pp. 10691078.
Vapnik, V. N. (1995). The nature of statistical learning theory. New York, NY, USA: Springer-Verlag New
York, Inc.
Inf Retrieval