23
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

  • View
    219

  • Download
    1

Embed Size (px)

Citation preview

Page 1: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

1

CS 430: Information Discovery

Lecture 3

Inverted Files

and

Boolean Operations

Page 2: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

2

Course Administration

• Assignment 1 will be posted during the next couple of days. It is due on Friday, September 21 at 5 p.m.

Page 3: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

3

Inverted File (Basic Definition)

Inverted file: a list of the words in a set of documents and the documents in which they appear.

Word Document

abacus 3 19 22

actor 2 19 29

aspen 5 atoll 11

34 Stop words are removed before building the index.

Page 4: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

4

Inverted List

Inverted list: All the entries in an inverted file that apply to a specific word, e.g.

abacus 3 19 22

Posting: Entry in an inverted list, e.g., the postings for "abacus" are documents 3, 19, 22.

Page 5: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

5

Keywords and Controlled Vocabulary

Keyword:

A term that is used to describe the subject matter in a document. It is sometimes called an index term.

Keywords can be extracted automatically from a document or assigned by a human cataloguer or indexer.

Controlled vocabulary:

A list of words that can be used as keywords, e.g., in a medical system, a list of medical terms.

Inverted file (more complete definition):

A list of the keywords that apply to a set of documents and the documents in which they appear.

Page 6: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

6

Enhancements to Inverted Files

Location: The inverted file holds information about the location of each term within the document.

Uses

adjacency and near operators

user interface design -- highlight location of search term

Frequency: The inverted file includes the number of postings for each term.

Uses

term weighting

query processing optimization

user interface design

Page 7: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

7

Inverted File (Enhanced)

Word Postings Document Location

abacus 4 3 94 19 7 19 212

22 56actor 3 2 66

19 213 29 45

aspen 1 5 43atoll 3 11 3

11 70 34 40

Page 8: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

8

Example: Boolean Queries

Boolean query: two or more search terms, related by logical operators, e.g.,

and or not

Examples:

abacus and actor

abacus or actor

(abacus and actor) or (abacus and atoll)

not actor

Page 9: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

9

Boolean Diagram

A B

A and B

A or B

not (A or B)

Page 10: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

10

Evaluating a Boolean Query

3 19 22 2 19 29

To evaluate the and operator, merge the two inverted lists

with a logical AND operation.

Examples: abacus and actor

Postings for abacus

Postings for actor

Document 19 is the only document that contains both terms, "abacus" and "actor".

Page 11: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

11

Adjacent and Near Operators

abacus adj actor

Terms abacus and actor are adjacent to each other as in the string

"abacus actor"

abacus near 4 actor

Terms abacus and actor are near to each other as in the string

"the actor has an abacus"

Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).

Page 12: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

12

Evaluating an Adjacency Operation

Examples: abacus adj actor

Postings for abacus

Postings for actor

Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent.

3 94 19 719 212 22 56

2 66 19 213 29 45

Page 13: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

13

Evaluation of Boolean Operators

Precedence of operators must be defined:

adj, near high

and, not

or low

Example

A and B or C and B

is evaluated as

(A and B) or (C and B)

Page 14: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

14

Set Records UniqueTerms

A 2,653 5,123

B 38,304 c.25,000

Sizes of Inverted Files

Set A has an average of 14 postings per term and a maximum of over 2,000 postings per term.

Set B has an average of 88 postings per record.

Examples from Harman and Candela, 1990

Page 15: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

15

Representation of Inverted Files

Index (vocabulary) file: Stores list of terms (keywords). Designed for rapid searching and processing range queries. Often held in memory.

Postings file: Stores a list of postings for each term. Designed for rapid merging of lists. Each list may be stored sequentially.

Document file: [Repositories for the storage of document collections are covered in CS 502.]

Page 16: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

16

Organization of Inverted Files

Term Pointer topostings

ant

bee

cat

dog

elk

fox

gnu

hog

Inverted lists

Index (vocabulary) file Postings file Documents file

Page 17: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

17

Decisions in Building Inverted Files

• Underlying character set, e.g., printable ASCII, Unicode, UTF8.

• Whether to use a controlled vocabulary. If so, what words to include.

• List of stopwords.

• Rules to decide the beginning and end of words, e.g., spaces or punctuation.

• Character sequences not to be indexed, e.g., sequences of numbers.

Page 18: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

18

Efficiency Criteria

Storage

Inverted files are big, typically 10% to 100% the size of the collection of documents.

Update performance

It must be possible, with a reasonable amount of computation, to:

(a) Add a large batch of documents

(b) Add a single document

Retrieval performance

Retrieval must be fast enough to satisfy users and not use excessive resource.

Page 19: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

19

Index File

If an index is held on disk, search time is dominated by the number of disk accesses.

Suppose that an index has 1,000,000 distinct terms.

Each index entry consists of the term and a pointer to the inverted list, average 100 characters.

Size of index is 100 megabytes, which can easily be held in memory.

Page 20: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

20

Postings File

Since inverted lists may be very long, it is important to match postings efficiently.

Usually, the inverted lists will be held on disk. Therefore algorithms for matching posting use sequential file processing.

For efficient matching, the inverted lists should all be sorted in the same sequence, usually alphabetic order, "lexicographic index".

Merging inverted lists is the most computationally intensive task in many information retrieval systems.

Page 21: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

21

Efficiency and Query Languages

Some query options may require huge computation, e.g.,

Regular expressions

If inverted files are stored in alphabetical order,

comp* can be processed efficiently *comp cannot be processed efficiently

Boolean terms

If A and B are search terms

A or B can be processed by comparing two moderate sized lists (not A) or (not B) requires two very large lists

Page 22: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

22

Index File Structures: Linear Index

Term Pointer tolist of postings

ant

bee

cat

dog

elk

fox

gnu

hog

Inverted lists

Page 23: 1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

23

Linear Index

Advantages

Can be searched quickly, e.g., by binary search, O(log n)

Good for sequential processing, e.g., comp*

Convenient for batch updating

Economical use of storage

Disadvantages

Index must be rebuilt if an extra term is added