27
1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER

1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER

  • View
    231

  • Download
    0

Embed Size (px)

Citation preview

1

A Text Filtering Method For Digital Libraries

Mustafa Zafer BOLAT

Hayri SEVER

2

introduction• Information filtering (IF)

– Incoming relevant documents are routed to profilesqueries.

• Information retrieval (IR)– Provides a list of ordered documents based

on the similarity with the user query

3

introduction (continued...)

• Linear Separation - partitions relevant and non-relevant

into distinct blocks

• Optimal Queries- all relevant documents are ahead of

nonrelevant ones.

• Steepest Descent Algorithm (SDA)

4

preliminaries

• Information retrieval system (S) can be defined as 5 tuple

• S =(T,D,Q,V,f)

-T set of ordered index terms-D set of documents-Q set of queries-V set of real numbers-f:DxQ V retrieval function

5

preliminaries (continued)

• Vector Space Model- Transformation of raw text into more computationally useful forms

- Documents and queries are represented as vectors of weighted terms

• d=(t1,wd1;t2,wd2;. . .;tn,wdn) ti T d

• q = (q1, wq1 ; q2, wq2, . . . ; qm, wqm) qi T q

6

preliminaries (continued)

• Rnorm value for effectiveness It measures up how relevant documents are distributed over nonrelavent ones.

rank matters.

7

preliminaries (continued)predicted actual

relevant non-relevant

relevant a bnon-relevant c d

Contingency Table

•Precision =a / (a+b) •Recall =a / (a+c)

•Breakeven pointWhere precision and recall are equal

8

overview of experiment

TrainingWithSDA

Optimal query

...train

test

Reuters -21578Data set Category

labels

Effectivenessmeasures

Preprocessing

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

9

overview of experiment

train

Removingstop words

Stemming

Transform to Vectors

Parsing

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

Preprocessing

Consists of 21578 economic news stories thatoriginally appeared on the Reuters newswire in 1987

Each story has been manually assigned one or more indexing labels from a fixed list

There are 135 TOPIC labels for classification.In order to use a text corpus for machine learning

research it splited into sets of training and testing examples

Reuters 21578

train

test

Reuters -21578Data set

10

overview of experiment

train

Removingstop words

Stemming

Transform to Vectors

Parsing

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"

OLDID="9944" NEWID="5031"><DATE>13-MAR-1987 15:45:35.38</DATE>

<TOPICS><D>livestock</D><D>carcass</D></TOPICS><PLACES><D>usa</D></PLACES>

<PEOPLE></PEOPLE><ORGS><D>ec</D></ORGS>

<EXCHANGES></EXCHANGES><COMPANIES></COMPANIES>

<TEXT>&#2;<TITLE>U.S. MEAT GROUP TO FILE TRADE COMPLAINTS</TITLE>

<DATELINE> WASHINGTON, March 13 - </DATELINE><BODY>The American Meat Institute, AME,said it intended to ask the U.S.

government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S. meat products.

Molpus told a Senate Agriculture subcommittee that AME andother livestock and farm groups intended to file a petition

under Section 301 of the General Agreement on Tariffs and Tradeagainst an EC directive that, effective April 30, will require

U.S. meat processing plants to comply fully with EC standards.

Reuter&#3;</BODY></TEXT>

</REUTERS>

Sample Reuters 21578 Document

train

test

Reuters -21578Data set

11

train

test

Reuters -21578Data set

overview of experiment

train

Removingstop words

Stemming

Transform to Vectors

Parsing

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

ParsingHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Body: U.S. MEAT GROUP TO FILE TRADE COMPLAINTSThe American Meat Institute, AME,said it intended to ask the U.S.

government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S.

meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General

Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply

fully with EC standards

12

train

test

Reuters -21578Data set

overview of experiment

train

Removingstop words

Stemming

Transform to Vectors

Parsing

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

After ParsingHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Body: U S MEAT GROUP TO FILE TRADE COMPLAINTSThe American Meat Institute AME said it intended to ask the U S

government to retaliate against a European Community meat inspection requirement AME President C Manly Molpus also said the industry would file a petition challenging Korea's ban of U S

meat products Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups

intended to file a petition under Section of the General Agreement on Tariffs and Trade against an EC directive that

effective April will require U S meat processing plants to comply fully with EC standards

13

train

test

Reuters -21578Data set

overview of experiment

train

Removingstop words

Stemming

Transform to Vectors

Parsing

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Removing Stop WordsHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Body: U.S. MEAT GROUP FILE TRADE COMPLAINTSThe American Meat Institute, AME,said it intended to ask the U.S.

government to retaliate against a European Community meat inspection requirement. AME President C. Manly Molpus also said the industry would file a petition challenging Korea's ban of U.S.

meat products. Molpus told a Senate Agriculture subcommittee that AME and other livestock and farm groups intended to file a petition under Section 301 of the General

Agreement on Tariffs and Trade against an EC directive that, effective April 30, will require U.S. meat processing plants to comply

fully with EC standards

14

train

test

Reuters -21578Data set

overview of experiment

train

Removingstop words

Stemming

Transform to Vectors

Parsing

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

After Removing Stop WordsHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Body: . MEAT GROUP FILE TRADE COMPLAINTSAmerican Meat Institute AME intended ask

government retaliate European Community meat inspection requirement. AME President Manly Molpus

industry file petition challenging Korea's ban U.S. meat products Molpus Senate Agriculture subcommittee AME livestock farm groups

intended file petition Section General Agreement Tariffs Trade EC directive

effective April require meat processing plants comply fully EC standards

15

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

StemmingHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Body: MEAT GROUP FILE TRADE COMPLAINTAmerican Meat Institute AME intended ask

government retaliate European Community meat inspection requirement. AME President Manly

Molpus industry file petition challeng Korea ban meat product Molpus Senate Agriculture subcommittee AME livestock farm group intended file petition Section General

Agreement Tariff Trade EC direct effect April require meat process plant compli

fulli EC standard

Removingstop words

Stemming

Transform to Vectors

Parsing

ReducingNormalizing

train

test

Reuters -21578Data set

16

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Transform To VectorsHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

meat 5group 1

... ...Molpus 1

... ...

... ...standard 1

train

test

Reuters -21578Data set

17

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Create Dictionary (only in training)

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

approv 1236chairman 1225

... ...

... ...

... ...

... ...ptd 5

train

test

Reuters -21578Data set

18

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

ReducingHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

... ...group 1meat 5Molpus ...

... ...standard 1

... ...

train

test

Reuters -21578Data set

19

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

After ReducingHAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

... ...group 1meat 5

... ...standard 1

... ...

train

test

Reuters -21578Data set

20

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Normalizing HAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

... ...group 1meat 5

... ...standard 1

... ...

train

test

Reuters -21578Data set

wk =tk x log (ND /nk)

tk term frequency

ND Number of documents in collection

nk number of documents containing tk

is normalized weight of term k

unnormalized weight of term k

2' / www kk'kw

kw

21

overview of experiment

train

TrainingWithSDA

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

After Normalizing HAS TOPICS=YES

LEWISSPLIT=TRAINTOPICS:livestock,carcass

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

... ...group 0.127meat 0.278

... ...standard 0.012

... ...

train

test

Reuters -21578Data set

wk =tk x log (ND /nk)

tk term frequency

ND Number of documents in collection

nk number of documents containing tk

is normalized weight of term k

unnormalized weight of term k

2' / www kk'kw

kw

22

overview of experiment

train

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing

Training

1. Choose a starting query vector Q0; let k = 0.

2. Let Qk be a query vector at the start of

the (k+1)th iteration; identify thefollowing set of difference vectors:   (Qk) ={b=d- d’ :d d’ and

f(Qk,b) 0}; if (Qk)= ,

Qopt = Qk is a solution

and exit, otherwise, 3. Let Qk+1 = Qk +

 4. k = k+1; go back to Step (2).

)(Qkb

b

TrainingWithSDA

Optimal query

23

overview of experiment

train

Optimal query

test

...Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

Training• All the category examples as positive examples • Random 60% from other topicsas negative examples

• If maximum Rnorm value (1)is not reached at maximum 150 iterations set optimal query as the query that produces maximum Rnorm value available

TrainingWithSDA

24

overview of experiment

TrainingWithSDA

Optimal query

...train

test

Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

There are 135 categories

Topic # of + earn 2877acq 1650moneyfx 538grain 433crude 389trade 369interest 347wheat 212ship 197corn 182

Topic # of earn 1087acq 719moneyfx 179grain 149crude 189trade 118interest 131wheat 71ship 89corn 56

traintest

25

overview of experiment

TrainingWithSDA

Optimal query

...train

test

Reuters -21578Data set Category

labels

Effectivenessmeasures

PrePocessing

Removingstop words

Stemming

Transform to Vectors

Parsing

Reducing Normalizing

Create contingency tables

Find breakeven points

26

ResultsTopic Findism Nbayes SDA Bnets Trees SVM

earn 92,9 95,9 96,32 95,8 97,8 98,0

acq 64,7 87,8 85,26 88,3 89,7 93,6

money-fx 46,7 56,6 68,72 58,8 66,2 74,5

grain 67,5 78,8 71,81 81,4 85,0 94,6

crude 70,1 79,5 82,54 79,6 85,0 88,9

trade 65,1 63,5 65,25 69,0 72,5 75,9

interest 63,4 64,9 61,07 71,3 67,1 77,7

wheat 68,9 69,7 76,06 82,7 92,5 91,9

ship 49,2 85,4 65,17 84,4 74,2 85,6

corn 48,2 65,3 75,00 76,4 91,8 90,3

Avg.Top 10 64,6 81,5 84,54 85,0 88.4 92,0

Avg.All 61,7 75,2 76,37 80,0 N/A 87,0

breakevens

27

Thank you!