Upload
nuncio
View
28
Download
0
Embed Size (px)
DESCRIPTION
ISP 433/633 Week 4. Text operation, indexing and search. Document Process Steps. Example Collection. Documents D1: It is a dog eat dog world! D2: While the world sleeps. D3: Let sleeping dogs lie. D4: I will eat my hat. D5: My dog wears a hat. Step 1: Parse Text Into Words. - PowerPoint PPT Presentation
Citation preview
ISP 433/633 Week 4
Text operation, indexing and search
Document Process Steps
Example Collection
Documents
D1: It is a dog eat dog world!
D2: While the world sleeps.
D3: Let sleeping dogs lie.
D4: I will eat my hat.
D5: My dog wears a hat.
Step 1: Parse Text Into Words
• break at spaces and punctuation
D1: IT IS A DOG
EAT
DOG WORLD
D2: WHILE THE WORLD SLEEPS
D3: LETSLEEPING DOGS LIE
D4: I WILL
EAT MY HAT
D5: MY DOG WEARS A HAT
Step 2: Stop Words Elimination
• Remove non-distinguishing words• Pronouns, … prepositions, … articles, ... to Be, to Have, to Do
• I,MY,IT,YOUR,…OF,BY,ON,…A,THE,THIS,…,IS,HAS,WILL,…
D1: DOG
EAT
DOG WORLD
D2: WORLD SLEEPS
D3: LETSLEEPING DOGS LIE
D4: EAT
HAT
D5: DOG WEARS HAT
Stop Words List
• 250-300 most common words in English account for 50% or more of a given text.– Example: “the” and “of” represent 10% of
tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.
• Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). – Top 65 types cover 1132 tokens (> 50%).– Token/type ratio: 2256/859 = 2.63
Step 3: Stemming
• Goal: “normalize” similar words
D1: DOG
EAT
DOG WORLD
D2: WORLD SLEEP
D3: LETSLEEP DOG LIE
D4: EAT
HAT
D5: DOG WEAR HAT
Stemming and Morphological Analysis
Morphology (“form” of words)– Inflectional Morphology
• E.g,. inflect verb endings and noun number• Never change grammatical class
– dog, dogs
– Derivational Morphology • Derive one word from another• Often change grammatical class
– build, building; health, healthy
Simple “S” stemming
• IF a word ends in “ies”, but not “eies” or “aies”– THEN “ies” “y”
• IF a word ends in “es”, but not “aes”, “ees”, or “oes”– THEN “es” “e”
• IF a word ends in “s”, but not “us” or “ss”– THEN “s” NULL Harman, JASIS 1991
Porter’s Algorithm
• An effective, simple and popular English stemmer
• Official URL http://www.tartarus.org/~martin/PorterStemmer/
• A demo http://snowball.tartarus.org/demo.php
Porter’s Algorithm
• 1. The measure, m, of a stem is a function of sequences of vowels followed by a consonant. If V is a sequence of vowels and C is a sequence of consonants, then m is: C(VC)mVwhere the initial C and final V are optional and m is the number of VC repeats. m=0 free, why m=1 frees, whose m=2 prologue, compute2. *<X> - stem ends with letter X3. *v* - stem ends in a vowel4. *d - stem ends in double consonant5. *o - stem ends with consonant-vowel-consonant sequence where the final consonant is now w, x, or y
Porter, Program 1980
Porter’s Algorithm
• Suffix conditions take the form current_suffix = = patternActions are in the form old_suffix -> new_suffixRules are divided into steps to define the order of applying the rules. The following are some examples of the rules:
STEP CONDITION SUFFIX REPLACEMENT EXAMPLE1a NULL sses ss stresses->stress1b *v* ing NULL making->mak1b1 NULL at ate inflat(ed)->inflate1c *v* y I happy->happi2 m>0 aliti al formaliti->formal3 m>0 icate ic duplicate->duplic4 m>1 able NULL adjustable->adjust5a m>1 e NULL inflate->inflat
Problems of Porter’s Algorithm
Too Aggressive Too TimidOrganization/organ Relatedness/related
Executive/execute Create/creation
• Unreadable results• Does not handle some irregular verbs and
adjectives– Take/took– Bad/worse
• Possible errors:
Step 4: Indexing
• Inverted Files
D1 D3 D5
D1 D4
D4 D5
D3
D3
D2 D3
D5
D1 D2
Occurrences DOG
EAT
HAT
LET
LIE
SLEEP
WEAR
WORLD
Vocabulary
Inverted Files
• Occurrences can point to– Documents– Positions in a document– Weight
• Most commonly used indexing method• Based on words
– Queries such as phrases are expensive to solve– Some data does not have words
• Genetic data
Suffix Trees
1234567890123456789012345678901234567890123456789012345678901234567This is a text. A text has many words. Words are made from letters.
60
28
50
11
19
33
40
l
m ad
n
te x t
.
‘ ‘
w
o r d s‘ ‘
.
Patricia tree
Text Compression
• Represent text in fewer bits
• Symbols to be compressed are words
• Method of choice– Huffman coding
Huffman Coding
• Developed by David Huffman (1952)• Average of 5 bits per character• Based on frequency distributions of
symbols• Idea: assign shorter code to more
frequent symbols• Algorithm: iteratively build a tree of
symbols starting with the two least frequent symbols
An Example
Symbol Frequency
A 7
B 4
C 10
D 5
E 2
F 11
G 15
H 3
I 7
J 8
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
c
b d
f
g
i j
he
a
Example Coding
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
c
b d
f
g
i j
he
a
Symbol Code
A 0110
B 0010
C 000
D 0011
E 01110
F 010
G 10
H 01111
I 110
J 111
Exercise
• Consider the bit string: 011011011110001001100011101001110001101011010111
• Use the Huffman code from the example to decode it.
• Try inserting, deleting, and switching some bits at random locations and try decoding
Huffman Code
• Prefix property – it means that no word in the code is a
prefix of any other word in the code
• Random access– Decompress starting from any where
• Not the fastest
Sequential string searching
• Boyer-Moore algorithm
• Example: search for “cats” in “the catalog of all cats”
• Some preprocessing is needed.• Demos:
http://www-sr.informatik.uni-tuebingen.de/~buehler/BM/BM.html