Chapter 2 Information Retrieval Part-1

1

Chapter 2Information RetrievalPart-1

2

Modern Information Retrieval Document representation

Using keywords Relative weight of keywords

Query representation Keywords Relative importance of keywords

Retrieval model Similarity between document and query Rank the documents Performance evaluation of the retrieval

process

3

Document Representation

Transforming a text document to a weighted list of keywords

4

Stopwords

Figure 2.2 A partial list of stopwords

5

Activity: Document Representation

Transform the text in the document given into a weighted list of keywords.

6

StemmingA given word may occur in a variety of syntactic forms

plurals past tense gerund forms (a noun derived from a verb)

ExampleThe word connect, may appear as

connector, connection, connections, connected, connecting, connects, preconnection, and postconnection.

7

StemmingA stem is what is left after its affixes (prefixes and suffixes) are removedSuffixes connector, connection, connections,

connected, connecting, connects, Prefixes preconnection, and postconnection.Stem connect

8

Porter’s Algorithm Letters A, E, I, O, and U are vowels A consonant in a word is a letter other than A, E,

I, O, or U, with the exception of Y The letter Y is a vowel if it is preceded by a

consonant, otherwise it is a consonant For example, Y in synopsis is a vowel, while in toy,

it is a consonant A consonant in the algorithm description is

denoted by c, and a vowel by v

9

Porter’s algorithmStep 1

Step 1:plurals and past participles

10


Steps 2–4: straightforward stripping of suffixes

11



12



13


Steps 5: tidying-up

14

Suffix stripping of a vocabulary of 10,000 words (http://www.tartarus.org/~martin/)

Porter’s algorithm

15

For the Tutorial Bring your laptop/ lab Make sure you have Java installed Bring any English language text

document, extension must be .txt Number of words (no more than 1000

words)

16

Document Representation

17

Term-Document Matrix• Term-document matrix (TDM) is a two-

dimensional representation of a document collection.

• Rows of the matrix represent various documents

• Columns correspond to various index terms• Values in the matrix can be either the

frequency or weight of the index term (identified by the column) in the document (identified by the row).

18

Term-Document matrix

19

Sparse Matrixes- triples

20

Sparse Matrixes- Pairs

21

Normalization• raw frequency values are not useful for a

retrieval model• prefer normalized weights, usually between

0 and 1, for each term in a document• dividing all the keyword frequencies by the

largest frequency in the document is a simple method of normalization:

22

Normalized Term-Document Matrix

23

Vector Representation of document d1

(word, frequency, normalized frequency)

24

Mini project (Survey)Arabic language stemmer design Survey and compare existing Arabic

language stemmers and write a research paper.

Design an Arabic Language stemmer Reading: Hints on writing technical reports and papers

Documents

Chapter 2 Information Retrieval Part-1