10
1 Discussion Class 3 The Porter Stemmer

1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday

Embed Size (px)

Citation preview

Page 1: 1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday

1

Discussion Class 3

The Porter Stemmer

Page 2: 1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday

2

Course Administration

No class on Thursday

Page 3: 1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday

3

Discussion Classes

Format:

Question

Ask a member of the class to answer.

Provide opportunity for others to comment.

When answering:

Stand up.

Give your name. Make sure that the TA hears it.

Speak clearly so that all the class can hear.

Suggestions:

Do not be shy at presenting partial answers.

Differing viewpoints are welcome.

Page 4: 1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday

4

Question 1: Stemming

(a) Define the terms: stem, suffix, prefix, conflation

(b) What makes a good stemming algorithm? How would you measure it?

(c) Porter proposes a criterion for removing suffixes. What is it? Do you agree with it?

(d) The paper uses "recall cutoff" to measure effectiveness. What does it measure?

Page 5: 1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday

5

Question 2: Categories of Stemmer

The following diagram illustrate the various categories of stemmer. Porter's algorithm is shown by the red path. What do these terms mean?

Conflation methods

Manual Automatic (stemmers)

Affix Successor Table n-gramremoval variety lookup

Longest Simplematch removal

Page 6: 1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday

6

Question 3: Mechanics Step 1a

The paper gives the following example of Step 1a. Explain what this step does.

Suffix Replacement Examples

sses ss caresses -> caress

ies i ponies -> poni ties -> ti

ss ss caress -> caress

s cats -> cat

Page 7: 1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday

7

Question 4: Mechanics Step 1b

Conditions Suffix Replacement Examples

(m > 0) eed ee feed -> feedagreed -> agree

(*v*) ed null plastered -> plasterbled -> bled

(*v*) ing null motoring -> motorsing -> sing

(a) Explain this table

(b) How does this table apply to: "exceeding", "ringed"?

Page 8: 1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday

8

Question 5: Mechanics Step 5a

Step 5a is defined as follows. What does this do and why?

(m>1) E -> probate -> probat rate -> rate(m=1 and not *o) E -> cease -> ceas

Page 9: 1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday

9

Question 6. Ad hoc decisions

Discuss the following:

"The algorithm is careful not to remove a suffix when the stem is too short, the length of the stem being given by its measure, m. There is no linguistic basis for this approach. It was merely observed that m could be used quite effectively to help decide whether or not it was wise to take off a suffix."

(a) What is m?

(b) Why is it a reasonable measure?

(c) What anomalies does it produce?

Page 10: 1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday

10

Question 7: Stemming in Web searching

(a) In Web search engines, the tendency is not to use stemming. Why? (There are several answers.)

(b) Does your answer to part (a) mean that stemming is no longer useful?