37
1 IR960803.ppt Steven O. Kimbrough 960803 Basics of Information Retrieval

1 IR960803.pptSteven O. Kimbrough 960803 Basics of Information Retrieval

Embed Size (px)

Citation preview

1IR960803.ppt Steven O. Kimbrough 960803

Basics of InformationRetrieval

2IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

Executive Summary, 1

• The basic information retrieval (IR) problem– Given a task at hand and a collection of

documents, find all information relevant to the task

• The IR problem is– Universally present

– Often not recognized at all

– Often incorrectly thought to be solved

– Fundamentally very difficult

• Good solutions– Require ranked retrieval

– Do not require user knoweldge of all the relevant search terms

3IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

Executive Summary, 2

• Improved search engines/methods are becoming available– Web search engines, Verity, Lotus Notes,

etc.

• Little is known about performance of standard methods– And what IS known is very discouraging

• Less is known about newer methods

• The Internet exacerbates the problem--and is delivering some newer methods

• Expect good IR to be on the agenda for organizations wanting to operate at peak efficiency

4IR960803.ppt Steven O. Kimbrough 960803

Example of IR at Work:

Litigation Support

• A large constuction company is being sued over performance on a major project– Find all documents relevant to defending

against the suit

• Assume– All 50,000+ documents are on line

– Unlimited availability of a powerful full-text retrieval system

» Search on given words

» Search on Boolean combinations of words

» Proximity searching

– Lawyers are experienced, highly skilled and incented, and are supported by first-rate assistants

• What percentage of the relevant documents are actually found?

5IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

Basics of IR, 1

• The corporate memory problem– Information pertinent to the task at hand

has passed through the organization.

» Was this information captured?

» If so, can it be effectively retrieved and brought to bear on the present task?

» If so, can it be understood and intepreted?

• The information retrieval (IR) problem– Narrower than corporate memory

» Focused on documents

– Broader than corporate memory

» Individuals and organizations

• Our interest: the broadest sense(s)– “the IR problem” or “the corporate

memory problem” or “the organizational memory problem”

6IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

Basics of IR, 2

• The IR problem is very hard– And will remain so

• Why? Many reasons, including:– Documents are not (very) structured

» Compare: database searches vs document base searches

– Language is not (very) coöperative

» DNA: microbiology or Digital Equipment Corporation’s Network Architecture?

» free rider: game theory or urban transportation systems?

» corporate memory or organizational memory?

• Physical access vs logical access– Physical: relatively easy

– Logical: terribly difficult

– Internet?

7IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

Basics of IR, 3

• Kinds of information searches– Framework from David Blair, “Search

Exhaustivity and Data Base Size as a Framework for Text Retrieval Systems”

• Distinctions– Large vs small (document) data bases

– Exhaustive vs sample searches

– Content vs context searches

• And...

so...

2 2 2 8

8IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

Basics of IR, 4

• Resulting framework of searches,

ordered by difficulty– Large, exhaustive, content

– Large, exhaustive, context

– Large, sample, content

– Large, sample, context

– Small, exhaustive, content

– Small, exhaustive, context

– Small, sample, content

– Small, sample, context

9IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

Basic IR Technology, 1

• Recall: IR is a hard problem– And it is a real problem in real

organizations for real people

• Examples from Blair– “The Management of Information: Basic

Distinctions”

• Example from the Coast Guard– We know we have the document, but have

no idea where it is

» Retrieve from the Congressional Record

10IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

Basic IR Technology, 2

• Your basic IR technology– full text or keyword retrieval, with

– Boolean combinations and

– Location indicators

• Full text--has everything– Or does it?

• Keyword indexing– Requires work

• Boolean combination of words– Usual boolean operators: AND, OR, NOT

– This is a logically complete set

11IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

Basic IR Technology, 3

• Examples of boolean search– computer AND network

– (corporate OR organizational) AND memory

• Boolean formulæ:– Natural for many queries

– Fundamentally difficult for many people

» Disjunctive normal form

» Conjunctive normal form

– Fundamentally deterministic

» And that’s a problem

12IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

Basic IR Technology, 4

• Recall vs precision– Everything: U

– What you want: A + B

– What you get: B + C

– Recall: B/(A + B)

– Precision: B/((B + C)

A B

U

C

13IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

Basic IR Technology, 5

• Recall our 8-way framework– Large, exhaustive, content

– Large, exhaustive, context

– Large, sample, content

– Large, sample, context

– Small, exhaustive, content

– Small, exhaustive, context

– Small, sample, content

– Small, sample, context

• When and where and how does the recall vs precision distinction matter?

14IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

Basic IR Technology, 6

• So, how well does full text retrieval

work?

• Hard to tell– Recall the recall vs precision diagram

– How do we find A?

– Few good studies

– The Blair & Maron STAIRS study (1985), “An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System” is about the best

• STAIRS study– BART

– Litigation support

– Results: bad news

15IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:IR Theory

• Why should IR be such a difficult

problem?

• “We all know why we’re here.”

• Zipf word distributions

• Scale is the problem

• Concept: futility point(s)

• Demise of the library model

• Collection partitioning

• IR as communication

• Importance of context

16IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

IR Theory, Example

• “Operated on this morning.

Diagnosis not yet complete but

results seem satisfactory and

already exceed expecations.”

17IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

Requirements for IR Systems

• Standard features– Full text or keyword searches

– Boolean searches

– Partial matches

– Positional searches

• Ranked retrieval– Note: distinguished from database queries

• Sensitivity to semantic latency

• Context sensitivity

• Ability to exploit partial structuring– SGML, time lines, causal models, etc.

18IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

Non-Basic IR Technologies

• Ranking algorithms

• Latent semantic indexing

• Genetic searching

• Faceted Indexing

...for now, ranking algorithms only

19IR960803.ppt Steven O. Kimbrough 960803

Retrieval Algorithms/Approaches

• Three categories of approach:– Boolean

– Vector space

– Probabilistic

• Boolean– Find matches on words and combinations

of words

– The standard approach

– Known not to work well

• Vector space– Using the indexing, place documents and

queries in a hyperspace and measure distance

– Example: DCB algorithm (follows)

• Probabilistic– Like the vector space, but impose a

probability distribution on the elements in the space

20IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

The DCB Ranking Algorithm, 1

• Needed for good IR:– Ranked retrieval

– Sensitivity to semantic latency

– (and other things)

• DCB: A vector space approach– Ranking based on location of documents

in “word space”

• How does it work?

21IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

The DCB Ranking Algorithm, 2

• Consider K, an array of 1s and 0s– Rows: keywords

– Columns: documents

– Entries

» 1 if the document has the keyword

» 0 else

• Example, to illustrate:

K =

1 0 1 0 1 1

1 1 1 1 0 1

1 1 0 0 0 0

1 1 1 0 0 0

0 0 0 1 1 0

1 1 0 0 1 0

22IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

The DCB Ranking Algorithm, 3

• Obtain L, by multiplying K by its transpose:

L = K K T =

4 3 1 2 1 2

3 5 2 3 1 2

1 2 2 2 0 2

2 3 2 3 0 2

1 1 0 0 2 1

2 2 2 2 1 3

23IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

The DCB Ranking Algorithm, 4

• Obtain M, by multiplying L by K:

M = K KT K =

12 8 9 4 7 7

15 12 11 6 6 8

9 8 5 2 3 3

12 10 8 3 4 5

3 2 2 3 4 2

11 9 6 3 6 4

24IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

The DCB Ranking Algorithm, 5

• Intuition: Look at how the top-left element of M is obtained (by multiplying the top row of L by the left-most column of K):

4 3 1 2 1 2

1

1

1

1

0

1

12

25IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

The DCB Ranking Algorithm, 6

• The DCB algorithm for document retrieval ranks the documents in (ex ante) a plausible manner.

• Does it actually produce good rankings?– From our experience to date, yes

– Experimental studies are much needed

» Initial study on Laughlin photos is very encouraging (cf., Hoque et al. 1995)

• And there is more....

26IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

The DCB Ranking Algorithm, 7:Resource Location

• Thinking more generally, the DCB algorithm for document retrieval ranks the documents by a sort of similarity of association

• K is a matrix of primary links (keywords to documents)

• DCB measures overall association for these primary links

• K need not be just keywords to documents

• Think of K as generally indicating primary links (1s) between individual objects, e.g.– people

– meetings

– issues

27IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

The DCB Ranking Algorithm, 8

• Think of K as generally indicating primary links (1s and 0s) between individual objects, e.g.– people

– meetings

– issues

– museum artifacts

– keywords

• Then, K will (typically) be square

• But the DCB algorithm works the same way

• Interpretation of M is essentially the same

• L has a useful interpretation as well

• Now an example

28IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

The DCB Ranking Algorithm, 9

• A resource location (mini)example

• Interpretation of K– K(1) a person

– K(2) a person

– K(3) an issue

– K(4) an issue

– K(5) a meeting

– K(6) a meeting

• The problem (Coast Guard): – Given a particular letter of inquiry, which

CG employees have useful knowledge for the question at hand?

• Idea:– The question identifies an issue. Find the

employees most closely associated with that issue.

29IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

The DCB Ranking Algorithm, 10

• A new K:

K =

1 0 0 0 1 0

0 1 0 0 0 1

0 0 1 0 1 0

0 0 0 1 0 1

1 0 1 0 1 0

0 1 0 1 0 1

30IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

The DCB Ranking Algorithm, 11

• A new L:

L = K K T =

2 0 1 0 2 0

0 2 0 1 0 2

1 0 2 0 2 0

0 1 0 2 0 2

2 0 2 0 3 0

0 2 0 2 0 3

31IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

The DCB Ranking Algorithm, 12

• And a new M:

M = K KT K =

4 0 3 0 5 0

0 4 0 3 0 5

3 0 4 0 5 0

0 3 0 4 0 5

5 0 5 0 7 0

0 5 0 5 0 7

Person

Issue

32IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

The DCB Ranking Algorithm, 13

• DCB– Ranked retrieval by similarly of association

– Works for documents

– Works for arbitrary associations between arbitrary objects

• Future work– Much is required to explore DCB

systematically

– Experiments

– Tweeks

– Comparison with alternatives

– &c.

33IR960803.ppt Steven O. Kimbrough 960803

Experimental IR Test:

PIRS and Laughlin• PIRS: Picture Indexing and Retrieval System

– Developed as part of the Coast Guard KSS project

• Ranking based on– DCB applied to

– Text associated with pictures

• Clarence Laughlin archive– The Historic New Orleans Collection

– Photographer and writer

• Test– 390 photos plus Collection records on them

– Parse and feed to DCB algorithm

• Very promising results– Given 3 ranked photos, 83% of subjects

agreed on 2 or 3 of the implicit 3 pairwise rankings (50% expected)

34IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:Representative Commercial Products

• grep et al.

• Verity--Topic

• Excaliber

• Lotus Notes

• AppleSearch

• Web search engines– Yahoo

– Lycos

– Web Crawler

– etc.

35IR960803.ppt Steven O. Kimbrough 960803

The Information Retrieval Problem:

Commercial Products

• Ranking often is provided– But usually in a very limited way

– AppleSearch: *****, ****,...,*

• Retrieval and ranking algorithms typically not disclosed– Be wary

• Quality of retrieval not known– And when known not disclosed

• Some systems require extensive maintenance and “hand holding”

• Batch update of indexing is the rule

36IR960803.ppt Steven O. Kimbrough 960803

Useful References

• David Blair, “Search Exhaustivity and Data Base Size as a Framework for Text Retrieval Systems (or, All You Wanted to Know about Document Retrieval but Were Afraid to Ask”

• David Blair (1984). “The Management of Information: Basic Distinctions”

• David Blair and M. E. Maron (1985). “An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System”

• David Blair and Steven O. Kimbrough (1994). “Exemplary Documents”

• Michael D. Gordon and Robert K. Lindsay (1994). “Toward Discovery Support Systems: A Replication, Re-examination, and Extension of Swanson’s Work on Literature Based Discovery of a Connection Between Raynaud’s and Fish Oil.”

• Abeer Y. Hoque et al. (1995). “Report on an Experiment on Picture REtrieval Using the DCB Algorithm, The PIRS Software and Pictures from the Clarence Laughlin Archives at The Historic New Orleans Collection.”

• Steven O. Kimbrough, Stephen E. Kirk and Jim R. Oliver (1995) “On Relevance and Two Aspects of the Organizational Memory Problem”

37IR960803.ppt Steven O. Kimbrough 960803

This page unintentioallyleft blank