Ir 03

1. Lecture 03 Information Retrieval

2. Boolean Retrieval Model Processing Boolean queries To process a simple conjunctive query such as Brutus AND Calpurnia using an inverted index and the basic Boolean retrieval model, we follow these steps: 1. Locate Brutus in the Dictionary 2. Retrieve its postings 3. Locate Calpurnia in the Dictionary 4. Retrieve its postings 5. Intersect the two postings lists

3. Boolean Retrieval Model Processing Boolean queries The intersection operation is the crucial one: we need to efficiently intersect postings lists so as to be able to quickly find documents that contain both terms. This operation is sometimes referred to as merging postings lists.

4. Boolean Retrieval Model Processing Boolean queries If the lengths of the postings lists are x and y, the intersection takes O(x + y) operations. Processing more complex queries? Example: (Brutus OR Caesar) AND NOT Calpurnia

5. Boolean Retrieval Model Processing Boolean queries Query optimization: is the process of selecting how to organize the work of answering a query so that the least total amount of work needs to be done by the system. Brutus AND Caesar AND Calpurnia

6. Boolean Retrieval Model Processing Boolean queries Brutus AND Caesar AND Calpurnia A major element is the order in which postings lists are accessed. What is the best order for query processing? (Calpurnia AND Brutus) AND Caesar

7. Boolean Retrieval Model Processing Boolean queries if we start by intersecting the two smallest postings lists, then all intermediate results must be no bigger than the smallest postings list, and we are therefore likely to do the least amount of total work.

8. The term vocabulary and postings lists Choosing a Document Unit What is the document unit that should be used for indexing? Questio n Text Message Attachment (.doc file / .rar file)Email Messages Individual Books (entire book as a unit) Each Chapter as a Unit Individual Sentences Collection of Books Precision Recall

9. The term vocabulary and postings lists Determining the vocabulary of terms Recall the major steps in inverted index construction: 1. Collect the documents to be indexed. 2. Tokenize the text. 3. Do linguistic preprocessing of tokens. 4. Index the documents that each term occurs in. Tokenization is the process of chopping character streams into tokens throwing away certain characters. Tokenization Deals with building equivalence classes of tokens which are the set of terms that are indexed Linguistic Preprocessing

10. The term vocabulary and postings lists Determining the vocabulary of terms Token/Type/or Term? A token: is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. A type: is the class of all tokens containing the same character sequence. A term: is a type that is included in the IR systems dictionary (a Tokenization is the process of chopping character streams into tokens throwing away certain characters.Tokenization

11. The term vocabulary and postings lists Determining the vocabulary of terms What about apostrophe for possession and contractions? doc_1 : Dr. Thomas ODaniel has been the President of Research since December 2006. doc_2 : Students solutions werent correct. doc_3 : Ahmads notebook isnt cheap. Example: Query = ODaniel AND Research Token 1: odaniel Token 2: odaniel Token 3: o daniel Token 4: o daniel what are the correct tokens to use? Questio n

12. The term vocabulary and postings lists Determining the vocabulary of terms What about tokens associated with special characters? doc_1 : C# is a high-level, multi-paradigm, general-purpose programming language. doc_2 : C++ (pronounced cee plus plus) is a general purpose programming language. doc_3 : A+ is an array programming language descendent from the programming language A. Example: Query = C AND programming Token 1: C# Token 2: C # what are the correct tokens to use? Questio n

13. The term vocabulary and postings lists Determining the vocabulary of terms What about hyphenated tokens? doc_1 : C# is a high-level, multi-paradigm, general-purpose programming language. doc_2 : C++ (pronounced cee plus plus) is a general purpose programming language. doc_3 : A+ is an array programming language descendent from the programming language A. Example: Query = general-purpose AND programming Token 1: general-purpose Token 2: general purpose what are the correct tokens to use? Questio n

14. The term vocabulary and postings lists Determining the vocabulary of terms What about tokens that should be regarding as a single token? doc_1 : The West Bank, including East Jerusalem, has a land area of 5,640 km2. doc_2 :The West bank and Gaza Strip. doc_3 : There is a branch of the Arab Bank in Palestine in the West of Jenin City. Example: Query = West Bank AND Palestine Token 1: West Bank Token 2: West Token 3: Bank what are the correct tokens to use? Questio n

15. The term vocabulary and postings lists Dropping Common Terms (Stop words Removal) Using a stop list significantly reduces the number of postings that a system has to store. keyword searches with terms like the and by dont seem very useful. However, this is not true for phrase searches. The meaning of flights to London is likely to be lost if the word to is stopped out. Example: The phrase query President of the United States or Flights to London is more precise than President AND United States. and Flights AND London some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. Stop words

16. The term vocabulary and postings lists Dropping Common Terms (Stop words Removal) The general trend in IR systems over time has been: from standard use of quite large stop lists (200 300 terms) to very small stop lists (712 terms) to no stop list whatsoever. how we can exploit the statistics of language so as to be able to cope with common words in better ways. Questio n Do we really need to use stop lists. Questio n

17. The term vocabulary and postings lists Normalization (equivalence classing of terms) Token normalization: is the process of canonicalizing (standardizing or normalizing) tokens so that matches occur despite superficial differences in the character sequences of the tokens. The easy case is if tokens in the query just match tokens in the token list of the document. However, there are many cases when two character sequences are not quite the same but you would like a match to occur. Query Token1 Token 2 Document Token1 Token 2

18. The term vocabulary and postings lists Normalization (equivalence classing of terms) Create equivalence classes, which are normally named after one member of the set. Query anti-discriminatory co-author U.S.A Document antidiscriminatory coauthor USA

19. The term vocabulary and postings lists Normalization (equivalence classing of terms) An alternative is to maintain relations between unnormalized tokens. This method can be extended to hand-constructed lists of synonyms such as car and automobile. These term relationships can be achieved in two ways: 1. The usual way is to index unnormalized tokens and to maintain a query expansion list of multiple vocabulary entries to consider for a certain query term. 2. The alternative is to perform the expansion during index construction. When the document contains automobile, we index it under car as well (and, usually, also vice-versa). Use of either of these methods is considerably less efficient than equivalence classing, as there are more postings to store and

20. The term vocabulary and postings lists Accents and Diacritics Diacritics: signs which when written above or below a letter indicates a difference in pronunciation from the same letter when unmarked or differently marked. In English: naive and nave This can be done by normalizing tokens to remove diacritics. What about other languages? It might be best to equate all words to a form without diacritics.

21. The term vocabulary and postings lists Capitalization/Case-folding Case-folding: refers to reducing all letters to lower case. Naive naive General Motors general motors Drew University drew university Drew West drew west

22. The term vocabulary and postings lists Capitalization/Case-folding Case-folding: refers to reducing all letters to lower case. C.A.T cat

23. The term vocabulary and postings lists Capitalization/Case-folding An alternative to making every token lowercase is to just make some tokens lowercase. The simplest heuristic is to convert to lowercase words at the beginning of a sentence and all words occurring in a title that is all uppercase or in which most or all words are capitalized. Mid-sentence capitalized words are left as capitalized (which is usually correct). However, trying to get capitalization right in this way probably doesnt help if your users usually use lowercase regardless of the correct case of words. Thus, lowercasing everything often remains the most practical solution.

24. The term vocabulary and postings lists Other issues in English Other possible normalizations are quite idiosyncratic and particular to English. For instance, you might wish to equate: colour and color. 3/12/91 and Mar. 12, 1991 U.S., 3/12/91 is Mar. 12, 1991, whereas in Europe it is 3 Dec 1991.

Education

Ir 03