24
Chapter 2 Information Retrieval Part-1 1

Chapter 2 Information Retrieval Part-1

  • Upload
    elie

  • View
    47

  • Download
    0

Embed Size (px)

DESCRIPTION

Chapter 2 Information Retrieval Part-1. Modern Information Retrieval. Document representation Using keywords Relative weight of keywords Query representation Keywords Relative importance of keywords Retrieval model Similarity between document and query Rank the documents - PowerPoint PPT Presentation

Citation preview

Page 1: Chapter 2 Information  Retrieval Part-1

1

Chapter 2Information RetrievalPart-1

Page 2: Chapter 2 Information  Retrieval Part-1

2

Modern Information Retrieval Document representation

Using keywords Relative weight of keywords

Query representation Keywords Relative importance of keywords

Retrieval model Similarity between document and query Rank the documents Performance evaluation of the retrieval

process

Page 3: Chapter 2 Information  Retrieval Part-1

3

Document Representation

Transforming a text document to a weighted list of keywords

Page 4: Chapter 2 Information  Retrieval Part-1

4

Stopwords

Figure 2.2 A partial list of stopwords

Page 5: Chapter 2 Information  Retrieval Part-1

5

Activity: Document Representation

Transform the text in the document given into a weighted list of keywords.

Page 6: Chapter 2 Information  Retrieval Part-1

6

StemmingA given word may occur in a variety of syntactic forms

plurals past tense gerund forms (a noun derived from a verb)

ExampleThe word connect, may appear as

connector, connection, connections, connected, connecting, connects, preconnection, and postconnection.

Page 7: Chapter 2 Information  Retrieval Part-1

7

StemmingA stem is what is left after its affixes (prefixes and suffixes) are removedSuffixes connector, connection, connections,

connected, connecting, connects, Prefixes preconnection, and postconnection.Stem connect

Page 8: Chapter 2 Information  Retrieval Part-1

8

Porter’s Algorithm Letters A, E, I, O, and U are vowels A consonant in a word is a letter other than A, E,

I, O, or U, with the exception of Y The letter Y is a vowel if it is preceded by a

consonant, otherwise it is a consonant For example, Y in synopsis is a vowel, while in toy,

it is a consonant A consonant in the algorithm description is

denoted by c, and a vowel by v

Page 9: Chapter 2 Information  Retrieval Part-1

9

Porter’s algorithmStep 1

Step 1:plurals and past participles

Page 10: Chapter 2 Information  Retrieval Part-1

10

Porter’s algorithmStep 2

Steps 2–4: straightforward stripping of suffixes

Page 11: Chapter 2 Information  Retrieval Part-1

11

Porter’s algorithmStep 3

Steps 2–4: straightforward stripping of suffixes

Page 12: Chapter 2 Information  Retrieval Part-1

12

Porter’s algorithmStep 4

Steps 2–4: straightforward stripping of suffixes

Page 13: Chapter 2 Information  Retrieval Part-1

13

Porter’s algorithmStep 5

Steps 5: tidying-up

Page 14: Chapter 2 Information  Retrieval Part-1

14

Suffix stripping of a vocabulary of 10,000 words (http://www.tartarus.org/~martin/)

Porter’s algorithm

Page 15: Chapter 2 Information  Retrieval Part-1

15

For the Tutorial Bring your laptop/ lab Make sure you have Java installed Bring any English language text

document, extension must be .txt Number of words (no more than 1000

words)

Page 16: Chapter 2 Information  Retrieval Part-1

16

Document Representation

Page 17: Chapter 2 Information  Retrieval Part-1

17

Term-Document Matrix• Term-document matrix (TDM) is a two-

dimensional representation of a document collection.

• Rows of the matrix represent various documents

• Columns correspond to various index terms• Values in the matrix can be either the

frequency or weight of the index term (identified by the column) in the document (identified by the row).

Page 18: Chapter 2 Information  Retrieval Part-1

18

Term-Document matrix

Page 19: Chapter 2 Information  Retrieval Part-1

19

Sparse Matrixes- triples

Page 20: Chapter 2 Information  Retrieval Part-1

20

Sparse Matrixes- Pairs

Page 21: Chapter 2 Information  Retrieval Part-1

21

Normalization• raw frequency values are not useful for a

retrieval model• prefer normalized weights, usually between

0 and 1, for each term in a document• dividing all the keyword frequencies by the

largest frequency in the document is a simple method of normalization:

Page 22: Chapter 2 Information  Retrieval Part-1

22

Normalized Term-Document Matrix

Page 23: Chapter 2 Information  Retrieval Part-1

23

Vector Representation of document d1

(word, frequency, normalized frequency)

Page 24: Chapter 2 Information  Retrieval Part-1

24

Mini project (Survey)Arabic language stemmer design Survey and compare existing Arabic

language stemmers and write a research paper.

Design an Arabic Language stemmer Reading: Hints on writing technical reports and papers