Upload
christina-watts
View
215
Download
3
Embed Size (px)
Citation preview
1
Discussion Class 3
The Porter Stemmer
2
Course Administration
No class on Thursday
3
Discussion Classes
Format:
Question
Ask a member of the class to answer.
Provide opportunity for others to comment.
When answering:
Stand up.
Give your name. Make sure that the TA hears it.
Speak clearly so that all the class can hear.
Suggestions:
Do not be shy at presenting partial answers.
Differing viewpoints are welcome.
4
Question 1: Stemming
(a) Define the terms: stem, suffix, prefix, conflation
(b) What makes a good stemming algorithm? How would you measure it?
(c) Porter proposes a criterion for removing suffixes. What is it? Do you agree with it?
(d) The paper uses "recall cutoff" to measure effectiveness. What does it measure?
5
Question 2: Categories of Stemmer
The following diagram illustrate the various categories of stemmer. Porter's algorithm is shown by the red path. What do these terms mean?
Conflation methods
Manual Automatic (stemmers)
Affix Successor Table n-gramremoval variety lookup
Longest Simplematch removal
6
Question 3: Mechanics Step 1a
The paper gives the following example of Step 1a. Explain what this step does.
Suffix Replacement Examples
sses ss caresses -> caress
ies i ponies -> poni ties -> ti
ss ss caress -> caress
s cats -> cat
7
Question 4: Mechanics Step 1b
Conditions Suffix Replacement Examples
(m > 0) eed ee feed -> feedagreed -> agree
(*v*) ed null plastered -> plasterbled -> bled
(*v*) ing null motoring -> motorsing -> sing
(a) Explain this table
(b) How does this table apply to: "exceeding", "ringed"?
8
Question 5: Mechanics Step 5a
Step 5a is defined as follows. What does this do and why?
(m>1) E -> probate -> probat rate -> rate(m=1 and not *o) E -> cease -> ceas
9
Question 6. Ad hoc decisions
Discuss the following:
"The algorithm is careful not to remove a suffix when the stem is too short, the length of the stem being given by its measure, m. There is no linguistic basis for this approach. It was merely observed that m could be used quite effectively to help decide whether or not it was wise to take off a suffix."
(a) What is m?
(b) Why is it a reasonable measure?
(c) What anomalies does it produce?
10
Question 7: Stemming in Web searching
(a) In Web search engines, the tendency is not to use stemming. Why? (There are several answers.)
(b) Does your answer to part (a) mean that stemming is no longer useful?