30
Michael W. Berry Xiaoyan (Kathy) Zhang Padma Raghavan Department of Computer Science University of Tennessee Level Search Filtering for IR Model Reduction

Michael W. Berry Xiaoyan (Kathy) Zhang Padma Raghavan Department of Computer Science University of Tennessee Level Search Filtering for IR Model Reduction

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Michael W. BerryXiaoyan (Kathy) ZhangPadma Raghavan

Department of Computer ScienceUniversity of Tennessee

Level Search Filtering for

IR Model Reduction

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

2

Computational Models for IR

1. Need framework for designing concept-based IR

models. 2. Can we draw upon backgrounds and

experiences of computer scientists and mathematicians?

3. Effective indexing should address issues of scale and accuracy.

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

3

The Vector Space Model

Represent terms and documents as vectors in k-dimensional space

Similarity computed by measures such as cosine or Euclidean distance

Early prototype - SMART system developed by Salton et al. [70’s, 80’s]

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

4

Motivation for LSI

Two fundamental query matching problems:synonymy

(image, likeness, portrait, facsimile, icon)

polysemy(Adam’s apple, patient’s discharge, culture)

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

5

Motivation for LSI

Approach Treat word to document association

data as an unreliable estimate of a larger set of applicable words.

Goal Cluster similar documents which

may share no terms in a low-dimensional subspace (improve recall).

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

6

LSI Approach Preprocessing

Compute low-rank approximation to the original term-by-document (sparse) matrix

Vector Space Model Encode terms and documents using

factors derived from SVD (ULV, SDD) Postprocessing

Rank similarity of terms and docs to query via Euclid. distances or cosines

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

7

SVD Encoding

Ak is the best rank-k approx. to term-by-document matrix A

Ak = Uk

VkTk

docs

term

s

Term Vectors

Doc Vectors

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

8

Vector Space Dimension

Want minimum no. of factors (k ) that discriminates most concepts

In practice, k ranges between 100 and 300 but could be much larger.

Choosing optimal k for different collections is challenging.

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

9

Strengths of LSI

Completely automaticno stemming required, allow misspellings

Multilanguage search capabilityLandauer (Colorado), Littman (Duke)

Conceptual IR capability (Recall) Retrieve relevant documents that

do not contain any search terms

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

10

Changing the LSI Model Updating

Folding-in new terms or documents [Deerwester et al. ‘90]

SVD-updating [O’Brien ‘94], [Simon & Zha ‘97]

DowndatingModify SVD w.r.t. term or document deletions[Berry & Witter ‘98]

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

11

Recent LSI-based Research

Implementation of kd-trees to reduce query matching complexity (Hughey & Berry ‘00, Info. Retrieval )

Unsupervised learning model for data mining electronic commerce data (J. Jiang et al. ‘99, IDA)

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

12

Recent LSI-based Research

Nonlinear SVD approach for constraint-based feedback

(E. Jiang & Berry ‘00, Lin. Alg. & Applications)

Future incorporation of up- and down-dating into LSI-based client/servers

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

13

Information Filtering

Concept: Reduce a large document collection

to a reasonably sized set of potential retrievable documents.

Goal: Produce a relatively small subset

containing a high proportion of relevant documents.

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

14

Approach: Level Search

Reduce sparse SVD computation cost by selecting a small subset from the original term-by-document matrix

Use undirected graphic model . . . Term or document: vertices Term weight: edge weight Term in document or document

containing term: edges in graph

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

15

Level Search

Term 1

Term 2

Term 3

Query

Doc 1

Doc 2

Doc 3

Doc 4

Document

Term 5

Term 6

Term 7

Term 8

Term

Level 1 Level 2 Level 3

Document

Doc 5

Doc 6

Doc 7

Doc 8

Level 4

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

16

Similarity measures

Recall: ratio of no. of documents retrieved that are relevant to total no. of relevant documents.

Precision: ratio of no. of documents retrieved that are relevant to total no. of documents. retrieved

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

17

Test Collections

Collection Matrix Size (Docs Terms Non-

zeros)

MEDLINE 1033 5831 52009

TIME 425 10804 68240

CISI 1469 5609 83602

FBIS 4974 42500 1573306

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

18

Avg Recall & Submatrix Sizes for LS

Collection Avg R %D %T %NMEDLINE 85.7 24.8 63.2 27.8

TIME 69.4 15.3 61.9 22.7

CISI 55.1 21.4 64.1 25.2

FBIS 82.1 28.5 55.0 52.9

Mean 67.8 18.2 53.4 27.0

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

19

Results for MEDLINE

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Recall

Pre

cisi

on LSI Only

Level Search Plus LSI

5,831 terms 1,033 docs

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

20

Results for CISI

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Recall

Pre

cis

ion

LSI Only

Level Search Plus LSI

5,609 terms 1,469 docs

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

21

Results for TIME

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Recall

Pre

cis

ion

LSI Only

Level Search Plus LSI

10,804 terms 425

docs

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

22

Results for FBIS (TREC-5)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Recall

Pre

cisi

on

LSI Only

Level Search Plus LSI

42,500 terms 4,974 docs

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

23

Level Search with Pruning

Term 1

Term 2

Term 3

Query

Doc 1

Doc 2

Doc 3

Doc 4

Document

Term 5

Term 6

Term 7

Term 8

Term

Level 1 Level 2 Level 3

Document

Doc 5

Doc 6

Doc 7

Doc 8

Level 4

Deletesingletonterms

Prune terms to further reduce submatrix andmaintain recall; no affect on documents.

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

24

Effects of Pruning

0102030405060708090

100

% N

onze

ros

MEDLINE CISI TIME FBIS LATIMES

LSI input matrix density comparisons after level search filtering (L) and pruning (P).

LSI

LSI&L

LSI&LP

17,903 terms 1,086 docs (TREC5)

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

25

Effects of Pruning

0102030405060708090

100

Ave

rage

Pre

cisi

on (

%)

MEDLINE CISI TIME FBIS LATIMES

LSI average precision comparisons with/ without level search (L) and/ or pruning (P).

LSILSI&LLSI&LP

230 terms/doc 29 terms/query

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

26

Impact

Level Search is a simple and cost-effective filtering method for LSI; scalable IR.

May reduce the effective term-by-document matrix size by 75% with no significant loss of LSI precision (less than 5%).

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

27

Some Future Challenges for LSI

Agent-based software for indexing remote/distributed collections

Effective updating with global weighting Incorporate phrases and proximity Expand cosine matching to incorporate

other similarity-based data (e.g., images) Optimal number of dimensions

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

28

LSI Web Site

InvestigatorsPapersDemo’sSoftware

http://www.cs.utk.edu/~lsi

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

29

SIAM Book (June’99)

Document File Prep.Vector Space ModelsMatrix

DecompositionsQuery ManagementRanking & Relevance

FeedbackUser InterfacesA Course ProjectFurther Reading

IMA Hot Topics Workshop: Text Mining, Apr 17, 2000

30

CIR00 Workshophttp://www.cs.utk.edu/cir00

10-22-00, Raleigh NC

Invited Speakers:I. Dhillon (Texas)C. Ding (NERSC)K. Gallivan (FSU)D. Martin (UTK)H. Park (Minnesota)B. Pottenger (Lehigh)P. Raghavan (UTK)J. Wu (Boeing)