ISP 433/633 Week 4

ISP 433/633 Week 4

Text operation, indexing and search

Document Process Steps

Example Collection

Documents

D1: It is a dog eat dog world!

D2: While the world sleeps.

D3: Let sleeping dogs lie.

D4: I will eat my hat.

D5: My dog wears a hat.

Step 1: Parse Text Into Words

• break at spaces and punctuation

D1: IT IS A DOG

EAT

DOG WORLD

D2: WHILE THE WORLD SLEEPS

D3: LETSLEEPING DOGS LIE

D4: I WILL

EAT MY HAT

D5: MY DOG WEARS A HAT

Step 2: Stop Words Elimination

• Remove non-distinguishing words• Pronouns, … prepositions, … articles, ... to Be, to Have, to Do

• I,MY,IT,YOUR,…OF,BY,ON,…A,THE,THIS,…,IS,HAS,WILL,…

D1: DOG

EAT

DOG WORLD

D2: WORLD SLEEPS

D3: LETSLEEPING DOGS LIE

D4: EAT

HAT

D5: DOG WEARS HAT

Stop Words List

• 250-300 most common words in English account for 50% or more of a given text.– Example: “the” and “of” represent 10% of

tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.

• Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). – Top 65 types cover 1132 tokens (> 50%).– Token/type ratio: 2256/859 = 2.63

Step 3: Stemming

• Goal: “normalize” similar words

D1: DOG

EAT

DOG WORLD

D2: WORLD SLEEP

D3: LETSLEEP DOG LIE

D4: EAT

HAT

D5: DOG WEAR HAT

Stemming and Morphological Analysis

Morphology (“form” of words)– Inflectional Morphology

• E.g,. inflect verb endings and noun number• Never change grammatical class

– dog, dogs

– Derivational Morphology • Derive one word from another• Often change grammatical class

– build, building; health, healthy

Simple “S” stemming

• IF a word ends in “ies”, but not “eies” or “aies”– THEN “ies” “y”

• IF a word ends in “es”, but not “aes”, “ees”, or “oes”– THEN “es” “e”

• IF a word ends in “s”, but not “us” or “ss”– THEN “s” NULL Harman, JASIS 1991

Porter’s Algorithm

• An effective, simple and popular English stemmer

• Official URL http://www.tartarus.org/~martin/PorterStemmer/

• A demo http://snowball.tartarus.org/demo.php

http://www.tartarus.org/~martin/PorterStemmer/

http://www.tartarus.org/~martin/PorterStemmer/

http://snowball.tartarus.org/demo.php





• 1. The measure, m, of a stem is a function of sequences of vowels followed by a consonant. If V is a sequence of vowels and C is a sequence of consonants, then m is: C(VC)mVwhere the initial C and final V are optional and m is the number of VC repeats. m=0 free, why m=1 frees, whose m=2 prologue, compute2. *<X> - stem ends with letter X3. *v* - stem ends in a vowel4. *d - stem ends in double consonant5. *o - stem ends with consonant-vowel-consonant sequence where the final consonant is now w, x, or y

Porter, Program 1980


• Suffix conditions take the form current_suffix = = patternActions are in the form old_suffix -> new_suffixRules are divided into steps to define the order of applying the rules. The following are some examples of the rules:

STEP CONDITION SUFFIX REPLACEMENT EXAMPLE1a NULL sses ss stresses->stress1b *v* ing NULL making->mak1b1 NULL at ate inflat(ed)->inflate1c *v* y I happy->happi2 m>0 aliti al formaliti->formal3 m>0 icate ic duplicate->duplic4 m>1 able NULL adjustable->adjust5a m>1 e NULL inflate->inflat

Problems of Porter’s Algorithm

Too Aggressive Too TimidOrganization/organ Relatedness/related

Executive/execute Create/creation

• Unreadable results• Does not handle some irregular verbs and

adjectives– Take/took– Bad/worse

• Possible errors:

Step 4: Indexing

• Inverted Files

D1 D3 D5

D1 D4

D4 D5

D3

D3

D2 D3

D5

D1 D2

Occurrences DOG

EAT

HAT

LET

LIE

SLEEP

WEAR

WORLD

Vocabulary

Inverted Files

• Occurrences can point to– Documents– Positions in a document– Weight

• Most commonly used indexing method• Based on words

– Queries such as phrases are expensive to solve– Some data does not have words

• Genetic data

Suffix Trees

1234567890123456789012345678901234567890123456789012345678901234567This is a text. A text has many words. Words are made from letters.

60

28

50

11

19

33

40

l

m ad

n

te x t

.

‘ ‘

w

o r d s‘ ‘

.

Patricia tree

Text Compression

• Represent text in fewer bits

• Symbols to be compressed are words

• Method of choice– Huffman coding

Huffman Coding

• Developed by David Huffman (1952)• Average of 5 bits per character• Based on frequency distributions of

symbols• Idea: assign shorter code to more

frequent symbols• Algorithm: iteratively build a tree of

symbols starting with the two least frequent symbols

An Example

Symbol Frequency

A 7

B 4

C 10

D 5

E 2

F 11

G 15

H 3

I 7

J 8

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

c

b d

f

g

i j

he

a

Example Coding

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

c

b d

f

g

i j

he

a

Symbol Code

A 0110

B 0010

C 000

D 0011

E 01110

F 010

G 10

H 01111

I 110

J 111

Exercise

• Consider the bit string: 011011011110001001100011101001110001101011010111

• Use the Huffman code from the example to decode it.

• Try inserting, deleting, and switching some bits at random locations and try decoding

Huffman Code

• Prefix property – it means that no word in the code is a

prefix of any other word in the code

• Random access– Decompress starting from any where

• Not the fastest

Sequential string searching

• Boyer-Moore algorithm

• Example: search for “cats” in “the catalog of all cats”

• Some preprocessing is needed.• Demos:

http://www-sr.informatik.uni-tuebingen.de/~buehler/BM/BM.html