Empirical Analysis of Affix Removal Stemmers - IJCTA · Empirical Analysis of Affix Removal Stemmers . ... using “-ies->i” suffix rule. ... With the help of some words,

Empirical Analysis of Affix Removal Stemmers

Ms. Rupam Gupta

Assistant Professor,MCA Department

SVIT, Vasad

Gujarat Technological University

[email protected]

Dr. Anjali Ganesh Jivani

Head, Department of Computer Science &

Engineering

The M. S. University of Baroda

[email protected]

Abstract

This paper contains analysis of four different affix-

removal stemmers after empirically executing them on different text data. The stemmers were Porter, Lovins,

Paice and Krovetz-Stemmer. Each algorithm’s

individual step’s and its functionality shows how each

one identifies affix for removing and recoding the

words to generate stem term. The strength of individual

stemmer and computational time are also calculated

here. This paper also focuses on variation in stem

terms generation by these four stemmers from same

input data-set. All set of stem terms shown in this

paper, are created by execution of our comparative

stemming tool implemented in java using standard

input data-set.

Keywords: stemming, Index Compression Factor

(ICF), suffix, Porter, Lovins, Paice, Krovetz

Stemmer(KStemmer)

1. Introduction

English text is a collection of words including stop

words where many words used in the text, are

morphological variants of a single basic root word e.g.

dove/dive, acceptance/acceptable/accept, ability

/abilities [1]. Stemming techniques help to reduce different semantically similar variant forms of a single

root word into a single stem term. Stemming

algorithms are applicable in natural language text

analysis e.g. summarization, categorization and any

information retrieval search-engine. If the searcher is

entering query “symbol of maths”, Search Engine (SE)

displays “mathematical symbols”, “symbol in

mathematics”, “symbolic maths”. If query is

“competition in Bank exam”, SE displays “competitive

exam of bank”. So, stemming is beneficial for

decreasing size of index files used in IR. Instead of

storing individual word, single stem term for all

morphological variants of a single root word is stored

and hence compression factor of over 50 percent is

achieved. Efficiency of any stemming algorithm is

measured based on degree of closeness in-between

stem term and original root morpheme of the word

“word” which to be stemmed and computational time

taken by different stemmer. This paper is focused on

generation of large variety of stem words using various

affix removal stemmers for some DUC standard text, some set of morphed word list

(http://www.comp.lancs.ac.uk/computing/research/stem

ming/Links/resources.htm) and also on computational

time variation in-between different stemmers. In this

paper, we address the working process of each step of

four affix removal stemmers Porter, Lovins, Paice,

KStemmer using above mentioned text and word list

that depicts the variation in-between different

stemming rules. We are also computing Index

Compression Factor (ICF) of each affix removal

stemmer for individual document and also for different

set of morphed word variant of single root word. I CF

represents the difference between number of unique

words before stemming and after stemming. Greater

ICF value indicates better strength of stemmer [3].

N = Number of unique words before stemming S = Number of unique stems after stemming

ICF = Index Compression factor

ICF = (N-S)/N

2. Stemming Process

To get single term from different morphological

variants of a single root word, conflation technique is

needed. Conflation is a process through which set of

words are fused into one term. Conflation can be either

manual or automatic. Automated conflation is called as

Stemming.

Automated Conflation method is categorized into four

techniques. Categories are as follows-

Rupam Gupta et al, Int.J.Computer Technology & Applications,Vol 5 (2),393-399

IJCTA | March-April 2014 Available [email protected]

393

ISSN:2229-6093

Affix Removal, Successor Variety, Table Lookup, n-gram. Affix Removal is also sub-categorized into two

techniques, Longest Match and Simple Removal.

Affix removal methods eliminates suffixes or prefixes

from actual word to get a stem form. Affix Removal

Stemmer applies a set of transformation rule on each

word for eliminating known prefix or suffix.

J.B Lovins first introduced affix removal stemming

algorithm in 1968 [4]. Other affix removal algorithms

are introduced by Porter and Chris D. Paice in 1980 [5,

6, and 7]. Krovetz stemmer is also one prominent

stemmer that removes inflectional suffix, developed by

Bob Krovetz, at the University of Massachusetts, in

1993 [8].

3. Working methods of Stemmers

3.1 Porter Stemmer Porter stemmer contains six steps which check the

existence of different suffixes within a word and based

on the of type of suffix, each step applies their own

rule to generate stem term.

1st step gets rid of plurals endings with “–sses”, “-ies”,

“-s” and also recodes to generate stem word e.g. actresses->actress, abilities->abiliti, pennies->penni,

rats->rat. This step also removes endings “-ed” used in

past tense of any word and “-ing” of present continuous

e.g. abbreviated->abbriviat , assembling->assembl.

“assembl” term is recoded into “assemble” word using

“-bl into “-ble” recode conversion rule. “-at” ending is

recoded into “-ate” in “abbriviat” to “abbreviate”

conversion.

2nd

step recodes terminal „y‟ into „i‟ in presence of

another vowel in stem word e.g. “penny” is recoded

into “penni”. So “pennies” and “penny”, these two

words are conflated into single stem word “penni”.

”Pennies” word is converted into “penni” in step-1

using “-ies->i” suffix rule. “ability” word is recoded

into “abiliti” stem word in step-2, so “abilities” and

“ability” both are conflated into single “abiliti” stem

word. 3

rd step maps double suffices into single suffix and

recodes “-ational” ending into “–ate” ending / “-tional”

into ”-tion”/”-izer” into ”-ize” / “-ation” into ”-ate” / “-

fullness” into ”-full” etc. This step recodes

“application” word into “applicate”.

4th

step handles “-ic“, “-icate “, “-full”, “-ness” ended

words by removing and recoding endings for

generating stem words. “Applicate” stem word

generated in step-3 from “application” word, is

stemmed into “applic” final stem word using “-

icate”>”-ic” conversion rule.

5th

step removes “-ance”, “-er”, “-able”, “-ant”, “-iti” endings from word with context <c>vcvc<v>.

“Acceptance” and “acceptable” both words are

conflated into “accept” stem word by removing “-ance”

and “-able” suffix respectively. “Reporter” word is

conflated into “report” by eliminating “-er” ending.

“Reporting” word is also stemmed into “report” word

using “-ing” suffix rule in step-1. “Abiliti” generated

from “ability” and ”abilities” word in step-2/step-1, is

conflated into “abil” term by removing “-iti” ending in

step-5. So, “ability” and “abilities” are stemmed into a

single stem “abil”.

6th

step removes “-e” ending from any stem word.

“Assemble” word is conflated into “assembl” stem

word. So “assembling” and “assemble” words are

conflated into same stem word “assembl”.

So porter stemmer is not using any stem-dictionary. It

only uses list of endings to be removed and list of

conversion rule for recoding to get stem word through step1 to step 5 and removes final –e from stem words

in step 6. This algorithm is already implemented in C,

JAVA, perl and is quite simple [7, 8].

Stem terms list which are generated after execution of

our comparative tool on some set of morphological

variants of some unique words, are shown below.

Table 1. First list of Stemmed Terms

Word

Stemmed terms

Porter Lovins Paice KStem

Reporters report reporter report reporter

Reporting report report report report

Reported report report report report

Report report report report report

Abbreviated abbrevi abbrevi abbrevy abbreviate

Abbreviate abbrevi abbrevi abbrevy abbreviate

Acceptance accept accept acceiv acceptance

Acceptable accept accept acceiv acceptable

Assemble assembl assembl assembl assemble

Accesses access acces access access

Actresses actress actress actress actress

Abilities abil abil abl abilities

Ability abil abil abl Ability

Applicant applic appl appl applicant

Applies appli appl apply apply

Apply appli ap apply apply

Application applic applic apply application

Applicable applic applic appl applicable



394

ISSN:2229-6093

3.2 Lovins Stemmer

It is the first developed, most popular algorithm

published in 1968 [4].

This algorithm [4] uses a large list of 294 suffices, 29 context sensitive conditions based on different suffix

and 35 word ending transformation rules for generating

stem word.

This algorithm is decomposed into two steps, first step

removes suffix from original word based on a condition

taken from 29 conditions list and then 2nd

step recodes

the ending of word based on a transformation rule from

35 context sensitive rules to generate final stem word.

“Believed” and “believe” words are considered as input

for generating stem words based on Lovins algorithm.

Step-1 [4]: Algorithm matches “-ed” suffix of

“believed” word in suffix list based on “ed E” ending

and then applies condition rule “E” from condition rule

list based on “-ed” suffix. 1st step generates stem word

“believ”.

Step-2 [4]: Recoding/transformation rule “2” [iev

ief] is applied and generate final stem word “belief” from “believ”. For “believe” word, “believ” term is

generated in 1st step based on “e A” ending and then

recoded into “belief” final stem word in 2nd

step. So,

“believe”, “believed”, “belief” words are conflated into

“belief” stem word.

Lovins algorithm also handles irregular singular-plural

e.g. .index/indices, formula/formulae. Word “index” is

recoded into “indic” stem word using recoding rule

“11” [dex-dic] in step 2. Word “indices” is stemmed

into “indic” based on “es E” ending in step 1 and

nothing is done in step 2.

This algorithm [4] also treats double letter word

(running, sitting) correctly. Recoding/transformation

rule „1‟[rule 1: eliminate one of double letters

b,d,g,l,m,n,p,r,s,t] is applied to get same stem word .

e.g. “running” and “run” both words are stemmed into

“run” stem word. Lovins algorithm generates single stem term for

individual set from some set of morphological variants

of single root words, where as other affix removal

stemmer can not generate unique stem term for the

same, which are shown below.

Table 2. Second list of Stemmed Terms Word Stemmed term


Believe believ belief believ believe

Belief belief belief believ belief

Running run run run running

Run run run run run

Index index indic index index

Indices indic indic ind indices

Formula formula forml formul formula

Formulae formula forml formula formulae

3.3 Paice Stemmer It is developed by Paice with assistance of Husk in

1980 [5, 6]. Paice algorithm [5, 6] is based on iteration

and it maintains one table of rules. Each rule is either

removing or replacement of an ending (suffix) of the

word unlike of “Lovins” recoding step.

Each rule consists five elemental parts, two of them are

optional [5, 6]:

a) An ending of one or more characters, is placed

in reverse order. b) An optional intact flag “*” is maintained.

c) A digit indicates total number of ending

characters which will be removed from

original/stem words.

d) Removal characters may be replaced by

one/more characters (optional).

e) “>” symbol indicates that iteration will be

continued and “.” Indicates ending of iteration.

With the help of some words, all components of some

rules from rule table are tested and are shown below.

Two words “center” and “central” are stemmed into

single term “cent” by Paice whereas Porter, Lovins and

KStemmer generate different stem words for “center”

and “central”. For “center” word, “re2> { -er > - }”

(1st component of a rule, placed in reverse order) [5]

rule is matched based on last ending character „r‟. Last two ending characters “er” are removed based on

“re2>”. “cent” stem word is generated and then no „t‟

ending rule is matched on “cent” term and then finally

“cent” stem term is developed for “center” word. For

“central” word, „l‟ character based rules are checked

and “la2> { -al > - }” rule is matched . According to

rule, last two characters are removed from “central”

word and generate “centr” stem word. Based on last

character „r‟, “rt1> { -tr > -t }” rule is applied and last

characters „tr‟ is replaced by „t‟ to generate “cent”

stem. No rule based on last character „t‟ is matched

from rule table and so “cent” will become final stem

word. From “woman” and “women” words, single stem

word “wom” is generated by Paice algorithm whereas

Porter, Lovins and KStemmer are generating two

different stem words from such type of irregular

singular-plural e.g. woman-women/ox-oxen. For “woman”, „n‟ based “na2> { -an > - }” rule is

applied and “wom” stem is developed. For “women”,

“wom” stem is created based on “ne2> { -en > - }”



395

ISSN:2229-6093

rule of rule table. Paice also generates single stem term for “ox, “oxen” irregular singular-plural.

Paice generates single stem word “distinct” from

“distinguish” and “distinguishing” words. Paice applies

rule “hsiug5ct. { -guish > -ct }" for “distinguish” word

from rule table where ending “guish " is replaced by

“ct” ending and then generates “distinct” stem term.

“Distinct” stem word is final stem word because last

character „.‟ of the rule indicates termination of

applying the rule. For “distinguishing” word, "gni3>

{ -ing > - }" rule is applied and ending “ing” will be

removed to generate “distinguish” stem word.

“Distinct” final stem is generated after applying

“hsiug5ct. { -guish > -ct }" rule. With the help of

exhaustive list of rules in Paice algo., most of the root

words and its‟ morphed words are stemmed correctly.

The above mentioned morphed words which are

correctly conflated into single stem term, are listed

below. Table 3. Third list of Stemmed Terms

Word Stemmed terms


Center center center cent center

Central central centr cent central

Woman woman woman wom woman

Women women wom wom women

Ox ox ox ox ox

Oxen oxen oxen ox oxen

Distinguish distinguish distingue distinct distinguish

Distinguishing distinguish distinguish distinct distinguish

Distinct Distinct distinct distinct distinct

3.4 K-Stemmer K-Stem algorithm perfectly eliminates inflectional

suffixed in three steps [9].

1st step converts plural form‟s word into its singular

form e.g. shops to shop, babies to baby, affixes to affix.

Then 2nd

one converts past, present continuous tense

word into present tense e.g. “.moved” to “move”,

“adding” to “add”. So, K-Stem first removes suffix and

then checks dictionary for regular recoding and also for

exceptional recoding to generate stem words. “Europe”,

“European” are conflated into “Europe” using

dictionary. K-Stem generates valid English word as a

stem word, not a truncated word forms whereas other

stemmer may generate meaningless stem term. “Carry”,

“carried”, “carrying” words are truncated into “carry”

stem word in K-stem, whereas “carri”, “car”

meaningless terms are generated as stem term using

Porter and Lovins stemmer respectively. In general, K-stem needs a word to be in the lexicon (basic list of

words that system knows) before word is reduced to

another form. If a variant is explicitly referred in the

lexicon, conflation is not done in K-stemming e.g.

“curly” is not conflated into “cur” in K-stemming,

whereas Porter, Lovins and Paice conflate “curly” into

“curli”, “cur”, “cur” respectively. Similarly, the word

“factorial” is not conflated into “factory” in K-

stemming whereas, Porter, Lovins and Paice stemmer

reduce “factorial” into “factori”, “factor”, “fact” term

respectively. Porter, Lovins and Paice incorrectly

conflate “factory” and “factorial” into same stem word

“fact”. K-stemmer avoids to reduce word “station” into

“state” and “authority” into “author”. But “immunity”

is reduced to “immune” due to presence of root word

“immune” in lexicon. Some files are maintained for

generating stem term in K-stem algo.. A direct-

conflation file is referred to over-ride the normal processing of conflation of the system e.g. “aging” is

conflated into “age”. File contains word-variants with

root word e.g. “lying”, “lie”/ “dying”, “die”.

Exceptional data-set which are stemmed correctly using

K-stemming, are listed below.

Table 4. Fourth list of Stemmed Terms

4. Steps of our comparative stemming tool

It is used for performing comparative analysis in

between Porter, Lovins, Paice and KStemmer Stemming algorithm:

1) Initially punctuation and two characters long

words are removed from text.

Word Stemmed terms


Authority author author auth authority

Author author author auth author

Authorize author author auth authorize

Factory factori factor fact factory

Factor factor fact fact factor

Factorial factori factor fact factorial

factorize factor factor fact factorize

State state st stat state

statement statement stat stat statement

Station station stat stat station

News new new new news

New new new new new

Age ag ag age age

Aging ag aging ag age

Lying ly lying lying lie

Lie lie li lie lie

Dying dy dying dying die

Die die di die die



396

ISSN:2229-6093

2) Numeric values and proper nouns used in text, are eliminated using Stanford Part-of-Speech

software tool.

3) Stop words which have length greater than

two, are removed.

4) Duplicate words used in text, are eliminated

and then only unique words of the text are

sorted.

5) Finally Porter, Lovins, Paice and KStemming

algorithm are executed on those above

mentioned filtered sorted unique words which

are acquired from text.

6) Execution time taken by Porter, Lovins, Paice

and KStemming algorithm on given text, are

computed using java thread.

7) For set of words, our comparative tool is

directly executed for four affix-removal

stemmers for generation of stem terms.

Filtration steps are not applicable for word set. Input:

1) DUC text documents ( http://www-

nlpir.nist.gov/projects/duc/data.html)

2) Common word List.

http://www.comp.lancs.ac.uk/computing/r

esearch/stemming/Links/resources.htm

(Grouped WordList A)

Output:

a) Input file:

DUC\ducdocs\d061j/AP880911-0016

Above mentioned DUC text‟s sample

Output is shown below.

Table 5. Fifth list of Stemmed Terms Word Stemmed terms


Alarm alarm alarm alarm alarm

Alert alert alers alert alert

Alerted alert alers alert alert

Approaching approach approach approach approach

Area area are are area

Around around around around around

Associated associ associ assocy associate

Briefly briefli brief brief brief

b) Input: Grouped Wordlist - A

Sample output is given below

Table 6. Sixth list of Stemmed Terms Word Stemmed terms


Aardvark aardvark aardvarkk aardvarkk aardvarkk

Aardvarks aardvark aardvark aardvark aardvark

Aback aback Aback aback aback

Abacus abacu Abac abac abacus

Abacuses abacus Abacus abacus abacus

Abalone abalon Abalone abalon abalone

Abalones abalone abalone abalone abalone

5. Comparative analysis of four stemmers

Seventh list of stemmed terms is shown with respect to

number of stem word generation for a set of wordlist-

http://www.comp.lancs.ac.uk/computing/research/stem

ming/Links/resources.htm (Grouped WordList A)

Table 7. Seventh list of Stemmed Terms List of

Morphed

Words

Number of Stemmed terms

Porte

r

Lovin

s

Paic

e

KS

tem

mer

Abilities

Ability

1 1 1 2

Abnormal

Abnormally

1 1 1 1

Abolish

Abolished

Abolishes

Abolishing

Abolition

2

3

2

2

Table-8 shows comparison in between stemmers with

respect to ICF value, are given below.



397

ISSN:2229-6093

Table 8. list of ICF values for text

Note: Effective size of text is total number of words

generated after removing stop words and duplicate

words from original text.

Shown below is the 3-D bar diagram for depicting ICF value of four stemmers.

Figure 1. Bar Chart for ICF

Shown below is the 3-D bar diagram showing time

computation value of four stemmers.

Figure 2. Bar Chart for time computation

Tex

t

Actu

al

Siz

e

Eff

ecti

ve S

ize

Index Compression

Factor (ICF)

ICF=(N-S)/N

AP

88

09

11

-00

16

32

2

10

3

Po

rter

Lo

vin

s

Pai

ce

KS

tem

mer

0.0

29

0.0

67

0.0

48

0.0

38

AP

88

0

91

2-

00

95

78

2

25

6

0.0

66

0.0

7

0.0

85

0.0

46

AP

880

912

-

0137

666

217

0.0

59

0.0

73

0.0

78

0.0

414

AP

88

0915

-

0003

1080

269

0.0

37

0.0

40

0.0

48

0.0

29

AP

88

0916

-

0060

527

174

0.0

63

0.0

5

0.0

74

0.0

45

WS

J88

0912

-

0064

362

108

0.0

46

0.0

55

0.0

55

0.0

46



398

ISSN:2229-6093

Table 9. List of time computational values

Note: Average execution time for individual stemming

algorithm is calculated based on average of 5 time‟s

computational time of each algorithm for single text.

6. Conclusion and Future Work

We have presented comparative study in between each

step of all four affix removal stemmer and output

generated from all these stemmers. According to our

observation, Paice algorithm performs better with

respect to ICF value and Lovins stemmer executes at a

faster rate as compared to other stemmers. But, these four stemmers are not able to generate correct stem

term for many set of morphed words. Some words are

given in Table 10.

Table 10. List of stemmed terms

All these stemmers apply some rules for removing affix

which are not always generating correct stem. But

Lemmatizer uses dictionary vocabulary and considers

the context of the word in text to remove inflectional

endings for generating dictionary based word, lemma.

In future, we will study on lemmatization and try to

minimize above mentioned errors using context

knowledge of word and part-of-speech of the word.

7. References

[1] A Comparative Study of Stemming Algorithms

.Ms. Anjali Ganesh Jivani et al, Int. J. Comp. Tech.

Appl., Vol 2 (6), 1930-1938. IJCTA | NOV-DEC 2011

[2] CHAPTER 8: STEMMING ALGORITHMS: W. B.

Frakes Software Engineering Guild, Sterling, VA

22170 [3] Frakes William B. “Strength and similarity of affix

removal stemming algorithms”. ACM SIGIR

Forum, Volume 37, No. 1. 2003, 26-30.

[4] J. B. Lovins, “Development of a stemming algorithm,” Mechanical Translation and Computer

Linguistic., vol.11, no.1/2, pp. 22-31, 1968.

[5] Paice Chris D. “Another stemmer”. ACM SIGIR

Forum, Volume 24, No. 3. 1990, 56-61. [6] Paice Chris D. “An evaluation method for

stemming algorithms”. Proceedings of the 17th

annual international ACM SIGIR conference on

Research and development in information retrieval. 1994, 42-50.

[7] Porter M.F. “An algorithm for suffix stripping”.

Program. 1980; 14, 130-137.

[8] Porter M.F. “Snowball: A language for stemming algorithms”. 2001.

[9] Krovetz Robert. “Viewing morphology as an

inference process”. Proceedings of the 16th annual

international ACM SIGIR conference on Research and development in information retrieval. 1993,

191-202.

Tex

t

Eff

ecti

ve S

ize

Average Time Computation

(milli seconds)

AP

88

09

11

-00

16

10

3

Po

rter

Lo

vin

s

Pai

ce

KS

tem

mer

21

.6

11

.6

10

.6

95

AP

88

0

91

2-

00

95

25

6

23

.2

19

.8

13

10

0.4

AP

880

912-

0137

217

26.2

12.8

13.4

100.2

AP

88

0915

-

0003

269

22.6

14.8

11

107

AP

880

916

-

0060

174

28.2

14.8

13.4

102.6

WS

J8

8091

2-

0064

108

19.6

10

7

99.6

Average

19.8

13.9

6

15.4

100.8

Word Stemmed terms


give give giv giv give

given given giv giv given

gave gave gav gav gave

man man man man man

men men men men men

half half half half half

halves halv halv halv halves

new new new new new

news new new new news



399

ISSN:2229-6093

Documents

Empirical Analysis of Affix Removal Stemmers - IJCTA · Empirical Analysis of Affix Removal Stemmers . ... using “-ies->i” suffix rule. ... With the help of some words,