94
Text Text Similarity Similarity Dr Eamonn Keogh Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 [email protected]

Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 [email protected]

  • View
    222

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Text SimilarityText Similarity

Dr Eamonn KeoghDr Eamonn KeoghComputer Science & Engineering Department

University of California - RiversideRiverside,CA [email protected]

Page 2: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Word Twain Twain Twain Twain Twain

Length Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Snodgrass

1 74 312 116 138 122 424

2 349 1146 496 532 466 2685

3 456 1394 673 741 653 2752

4 374 1177 565 591 517 2302

5 212 661 381 357 343 1431

6 127 442 249 258 207 992

7 107 367 185 215 152 896

8 84 231 125 150 103 638

9 45 181 94 83 92 465

10 27 109 51 55 45 276

11 13 50 23 30 18 152

12 8 24 8 10 12 101

13+ 9 12 8 9 9 61

0

200

400

600

800

1000

1200

1400

1600

1 2 3 4 5 6 7 8 9 10 11 12 13

Sample 1

Sample 2

0

0.05

0.1

0.15

0.2

0.25

0.3

1 2 3 4 5 6 7 8 9 10 11 12 13

Series1

Series2

Page 3: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

1

2

5

3

4

6

Page 4: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Information RetrievalInformation Retrieval

• Task Statement:

Build a system that retrieves documents that users are likely to find relevant to their queries.

• This assumption underlies the field of Information Retrieval.

Page 5: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text input

How isthe queryconstructed? How is

the text processed?

Evaluate

Page 6: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

TerminologyTerminology

Token: A natural language word “Swim”, “Simpson”, “92513” etc

Document: Usually a web page, but more generally any file.

Page 7: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Some IR HistorySome IR History

– Roots in the scientific “Information Explosion” following WWII

– Interest in computer-based IR from mid 1950’s• H.P. Luhn at IBM (1958)

• Probabilistic models at Rand (Maron & Kuhns) (1960)

• Boolean system development at Lockheed (‘60s)

• Vector Space Model (Salton at Cornell 1965)

• Statistical Weighting methods and theoretical advances (‘70s)

• Refinements and Advances in application (‘80s)• User Interfaces, Large-scale testing and application (‘90s)

Page 8: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

RelevanceRelevance

• In what ways can a document be relevant to a query?– Answer precise question precisely.

– Who is Homer’s Boss? Montgomery Burns.

– Partially answer question.– Where does Homer work? Power Plant.

– Suggest a source for more information.– What is Bart’s middle name? Look in Issue 234 of Fanzine

– Give background information.– Remind the user of other knowledge.– Others ...

Page 9: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text input

How isthe queryconstructed? How is

the text processed?

EvaluateThe section that follows is about

Content AnalysisContent Analysis(transforming raw text into a computationally more manageable form)

Page 10: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Stemming and Morphological AnalysisStemming and Morphological Analysis

• Goal: “normalize” similar words

• Morphology (“form” of words)– Inflectional Morphology

• E.g,. inflect verb endings and noun number

• Never change grammatical class– dog, dogs

– Bike, Biking

– Swim, Swimmer, Swimming

What about… build, building;

Page 11: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Original Words        …consignconsignedconsigningconsignmentconsistconsistedconsistencyconsistentconsistentlyconsistingconsists…

Stemmed Words…consignconsignconsignconsignconsistconsistconsistconsistconsistconsistconsist

Examples of Stemming (using Porters algorithm)Examples of Stemming (using Porters algorithm)

Porters algorithms is available in Java, C, Lisp, Perl, Python etc from

http://www.tartarus.org/~martin/PorterStemmer/

Page 12: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Errors Generated by PorterErrors Generated by Porter Stemmer (Krovetz 93)

Too Aggressive Too Timidorganization/ organ european/ europe

policy/ police cylinder/ cylindrical

execute/ executive create/ creation

arm/ army search/ searcher

Homework!! Play with the following URLhttp://fusion.scs.carleton.ca/~dquesnel/java/stuff/PorterApplet.html

Page 13: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Statistical Properties of TextStatistical Properties of Text

• Token occurrences in text are not uniformly distributed

• They are also not normally distributed

• They do exhibit a Zipf distribution

Page 14: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

8164 the4771 of4005 to2834 a2827 and2802 in1592 The1370 for1326 is1324 s1194 that 973 by

969 on 915 FT 883 Mr 860 was 855 be 849 Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE 798 HEADLINE 798 DOCNO

1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI 1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT 1 ADVISERS 1 AE

Government documents, 157734 tokens, 32259 uniqueGovernment documents, 157734 tokens, 32259 unique

Page 15: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Plotting Word Frequency by RankPlotting Word Frequency by Rank

• Main idea: count– How many times tokens occur in the text

• Over all texts in the collection

• Now rank these according to how often they occur. This is called the rank.

Page 16: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Rank Freq1 37 system2 32 knowledg3 24 base4 20 problem5 18 abstract6 15 model7 15 languag8 15 implem9 13 reason10 13 inform11 11 expert12 11 analysi13 10 rule14 10 program15 10 oper16 10 evalu17 10 comput18 10 case19 9 gener20 9 form

The Corresponding Zipf CurveThe Corresponding Zipf Curve

Page 17: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Zipf DistributionZipf Distribution

• The Important Points:– a few elements occur very frequently– a medium number of elements have medium

frequency– many elements occur very infrequently

Page 18: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Zipf DistributionZipf Distribution• The product of the frequency of words (f) and their rank (r) is

approximately constant– Rank = order of words’ frequency of occurrence

• Another way to state this is with an approximately correct rule of thumb:– Say the most common term occurs C times– The second most common occurs C/2 times– The third most common occurs C/3 times– …

10/

/1

NC

rCf

Page 19: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Illustration by Jacob Nielsen

Zipf DistributionZipf Distribution(linear and log scale)(linear and log scale)

Page 20: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

What Kinds of Data Exhibit a What Kinds of Data Exhibit a Zipf Distribution?Zipf Distribution?

• Words in a text collection– Virtually any language usage

• Library book checkout patterns• Incoming Web Page Requests • Outgoing Web Page Requests• Document Size on Web• City Sizes• …

Page 21: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Consequences of ZipfConsequences of Zipf

• There are always a few very frequent tokens that are not good discriminators.– Called “stop words” in IR

• English examples: to, from, on, and, the, ...

• There are always a large number of tokens that occur once and can mess up algorithms.

• Medium frequency words most descriptive

Page 22: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Word Frequency vs. Resolving Word Frequency vs. Resolving Power Power (from van Rijsbergen 79)(from van Rijsbergen 79)

The most frequent words are not the most descriptive.

Page 23: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Statistical IndependenceStatistical Independence

Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together.

),()()( yxPyPxP

Page 24: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Lexical AssociationsLexical Associations• Subjects write first word that comes to mind

– doctor/nurse; black/white (Palermo & Jenkins 64)

• Text Corpora yield similar associations• One measure: Mutual Information (Church and Hanks 89)

• If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)

)(),(

),(log),( 2 yPxP

yxPyxI

Page 25: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Statistical IndependenceStatistical Independence• Compute for a window of words

collectionin words ofnumber

in occur -co and timesofnumber ),(

position at startingwindow within words

5)(say window oflength ||

),(1

),(

:follows as ),( eapproximat llWe'

/)()(

t.independen if ),()()(

||

1

N

wyxyxw

iw

ww

yxwN

yxP

yxP

NxfxP

yxPyPxP

i

wN

ii

w1 w11w21

a b c d e f g h i j k l m n o p

Page 26: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Interesting Associations with “Doctor”Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89)

I(x,y) f(x,y) f(x) x f(y) y11.3 12 111 Honorary 621 Doctor

11.3 8 1105 Doctors 44 Dentists

10.7 30 1105 Doctors 241 Nurses

9.4 8 1105 Doctors 154 Treating

9.0 6 275 Examined 621 Doctor

8.9 11 1105 Doctors 317 Treat

8.7 25 621 Doctor 1407 Bills

Page 27: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

I(x,y) f(x,y) f(x) x f(y) y0.96 6 621 doctor 73785 with

0.95 41 284690 a 1105 doctors

0.93 12 84716 is 1105 doctors

UnUn--Interesting Associations with Interesting Associations with “Doctor“Doctor”

(AP Corpus, N=15 million, Church & Hanks 89)

These associations were likely to happen because the non-doctor words shown here are very commonand therefore likely to co-occur with any noun.

Page 28: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Associations Are Important Because…Associations Are Important Because…

• We may be able to discover that phrases that should be treated as a word. I.e. “data mining”.

• We may be able to automatically discover synonyms. I.e. “Bike” and “Bicycle”

Page 29: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Content Analysis SummaryContent Analysis Summary• Content Analysis: transforming raw text into more

computationally useful forms• Words in text collections exhibit interesting

statistical properties– Word frequencies have a Zipf distribution

– Word co-occurrences exhibit dependencies

• Text documents are transformed to vectors– Pre-processing includes tokenization, stemming,

collocations/phrases

Page 30: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu
Page 31: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text inputHow isthe indexconstructed?

The section that follows is about

Index ConstructionIndex Construction Evaluate

Page 32: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Inverted IndexInverted Index• This is the primary data structure for text indexes• Main Idea:

– Invert documents into a big index

• Basic steps:– Make a “dictionary” of all the tokens in the collection

– For each token, list all the docs it occurs in.

– Do a few things to reduce redundancy in the data structure

Page 33: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Inverted IndexesInverted Indexes

We have seen “Vector files” conceptually. An Inverted File is a vector file “inverted” so that rows become columns and columns become rowsdocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1

Terms D1 D2 D3 D4 D5 D6 D7 …

t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0

Page 34: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

How Are Inverted Files CreatedHow Are Inverted Files Created

• Documents are parsed to extract tokens. These are saved with the Document ID.

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 35: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

How Inverted How Inverted Files are CreatedFiles are Created

• After all documents have been parsed the inverted file is sorted alphabetically.

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 36: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

How InvertedHow InvertedFiles are CreatedFiles are Created

• Multiple term entries for a single document are merged.

• Within-document term frequency information is compiled.

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Page 37: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

How Inverted Files are CreatedHow Inverted Files are Created

• Then the file can be split into – A Dictionary file

and – A Postings file

Page 38: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

How Inverted Files are CreatedHow Inverted Files are CreatedDictionary PostingsTerm Doc # Freq

a 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Page 39: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Inverted IndexesInverted Indexes• Permit fast search for individual terms• For each term, you get a list consisting of:

– document ID – frequency of term in doc (optional) – position of term in doc (optional)

• These lists can be used to solve Boolean queries:• country -> d1, d2• manor -> d2• country AND manor -> d2

• Also used for statistical ranking algorithms

Page 40: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

How Inverted Files are UsedHow Inverted Files are UsedQuery on “time” AND “dark”

2 docs with “time” in dictionary ->IDs 1 and 2 from posting file

1 doc with “dark” in dictionary ->ID 2 from posting file

Therefore, only doc 2 satisfied the query.

Dictionary PostingsDoc # Freq

2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Page 41: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu
Page 42: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text inputHow isthe indexconstructed?

The section that follows is about

Querying (and Querying (and ranking)ranking)

Evaluate

Page 43: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Simple query language: Simple query language: BooleanBoolean

– Terms + Connectors (or operators)

– terms• words

• normalized (stemmed) words

• phrases

– connectors• AND

• OR

• NOT

• NEAR (Pseudo Boolean)

Word Doc

• Cat x

• Dog

• Collar x

• Leash

Page 44: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Boolean QueriesBoolean Queries• Cat

• Cat OR Dog

• Cat AND Dog

• (Cat AND Dog)

• (Cat AND Dog) OR Collar

• (Cat AND Dog) OR (Collar AND Leash)

• (Cat OR Dog) AND (Collar OR Leash)

Page 45: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Boolean SearchingBoolean Searching“Measurement of thewidth of cracks in prestressedconcrete beams”

Formal Query:cracks AND beamsAND Width_measurementAND Prestressed_concrete

Cracks

Beams Widthmeasurement

Prestressedconcrete

Relaxed Query:(C AND B AND P) OR(C AND B AND W) OR(C AND W AND P) OR(B AND W AND P)

Page 46: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Ordering of Retrieved DocumentsOrdering of Retrieved Documents

• Pure Boolean has no ordering

• In practice:– order chronologically– order by total number of “hits” on query terms

• What if one term has more hits than others?

• Is it better to one of each term or many of one term?

Page 47: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Boolean ModelBoolean Model• Advantages

– simple queries are easy to understand– relatively easy to implement

• Disadvantages– difficult to specify what is wanted– too much returned, or too little– ordering not well determined

• Dominant language in commercial Information Retrieval systems until the WWW

Since the Boolean model is limited, lets consider a generalization…

Page 48: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Vector ModelVector Model• Documents are represented as “bags of words”• Represented as vectors when used computationally

– A vector is like an array of floating point

– Has direction and magnitude

– Each vector holds a place for every term in the collection

– Therefore, most vectors are sparse

• Smithers secretly loves Monty Burns• Monty Burns secretly loves Smithers Both map to…[ Burns, loves, Monty, secretly, Smithers]

Page 49: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Document VectorsDocument VectorsOne location for each wordOne location for each word

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

ABCDEFGHI

Document ids

Page 50: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

We Can Plot the VectorsWe Can Plot the VectorsStar

Diet

Doc about astronomyDoc about movie stars

Doc about mammal behavior

Page 51: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Illustration from Jurafsky & Martin

Documents in 3D Vector SpaceDocuments in 3D Vector Space

t1

t2

t3

D1

D2

D10

D3

D9

D4

D7

D8

D5

D11

D6

Page 52: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Vector Space ModelVector Space Modeldocs Homer Marge BartD1 * *D2 *D3 * *D4 *D5 * * *D6 * *D7 *D8 *D9 *

D10 * *D11 * *Q *

Note that the query is projected into the same vector space as the documents.

The query here is for “Marge”.

We can use a vector similarity model to determine the best match to our query (details in a few slides).

But what weights should we use for the terms?

Page 53: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Assigning Weights to TermsAssigning Weights to Terms

• Binary Weights

• Raw term frequency

• tf x idf– Recall the Zipf distribution– Want to weight terms highly if they are

• frequent in relevant documents … BUT

• infrequent in the collection as a whole

Page 54: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Binary WeightsBinary Weights

• Only the presence (1) or absence (0) of a term is included in the vector

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1D11 1 0 1

We have already seen and discussed this model.

Page 55: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Raw Term WeightsRaw Term Weights

• The frequency of occurrence for the term in each document is included in the vector

docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1

D10 0 3 5D11 4 0 1

This model is open to exploitation by websites… sex sex sex sex sexsex sex sex sex sexsex sex sex sex sexsex sex sex sex sexsex sex sex sex sex sex sex sex sex sex

Counts can be normalized by document lengths.

Page 56: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

tf * idf Weightstf * idf Weights

• tf * idf measure:– term frequency (tf)– inverse document frequency (idf) -- a way to

deal with the problems of the Zipf distribution

• Goal: assign a tf * idf weight to each term in each document

Page 57: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

tf * idftf * idf)/log(* kikik nNtfw

log

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik

Page 58: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Inverse Document FrequencyInverse Document Frequency

• IDF provides high values for rare words and low values for common words

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

For a collectionof 10000 documents

log

nNidf

kk

Page 59: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Similarity MeasuresSimilarity Measures

|)||,min(|

||

||||

||

||||

||||

||2

||

21

21

DQ

DQ

DQ

DQ

DQDQ

DQ

DQ

DQ

Simple matching (coordination level match)

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient

Page 60: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

CosineCosine

2

1 1D

Q2D

98.0cos

74.0cos

)8.0 ,4.0(

)7.0 ,2.0(

)3.0 ,8.0(

2

1

2

1

Q

D

D

1.0

0.8

0.6

0.8

0.4

0.60.4 1.00.2

0.2

Page 61: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Vector Space Similarity MeasureVector Space Similarity Measure

)()(

),(

:comparison similarity in the normalize otherwise

),( :normalized weights termif

absent is terma if 0 ...,,

,...,,

1

2

1

2

1

1

,21

21

t

jd

t

jqj

t

jdqj

i

t

jdqji

qtqq

dddi

ij

ij

ij

itii

ww

ww

DQsim

wwDQsim

wwwwQ

wwwD

Page 62: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Problems with Vector SpaceProblems with Vector Space

• There is no real theoretical basis for the assumption of a term space– it is more for visualization that having any real

basis– most similarity measures work about the same

regardless of model

• Terms are not really orthogonal dimensions– Terms are not independent of all other terms

Page 63: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Probabilistic ModelsProbabilistic Models

• Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query

• Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle)

• Rely on accurate estimates of probabilities

Page 64: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu
Page 65: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Relevance FeedbackRelevance Feedback• Main Idea:

– Modify existing query based on relevance judgements• Query Expansion: Extract terms from relevant documents

and add them to the query• Term Re-weighing: and/or re-weight the terms already in the

query

– Two main approaches:• Automatic (psuedo-relevance feedback)• Users select relevant documents

– Users/system select terms from an automatically-generated list

Page 66: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Definition: Relevance Feedback is the reformulation of a search query in response to feedback provided by the user for the results of previous versions of the query.

Term Vector [Jordan , Bank, Bull, River]Term Weights [ 1 , 1 , 1 , 1 ]

Term Vector [Jordan , Bank, Bull, River]

Term Weights [ 1.1 , 0.1 , 1.3 , 1.2 ]

SearchSearch

Display ResultsDisplay Results

Gather FeedbackGather Feedback

Update WeightsUpdate Weights

Suppose you are interested in bovine agriculture on the banks of the river Jordan…

Page 67: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Rocchio MethodRocchio Method

0.25) to and 0.75 toset best to studies some(in

t termsnonrelevan andrelevant of importance the tune and

chosen documentsrelevant -non ofnumber the

chosen documentsrelevant ofnumber the

document relevant -non for the vector the

document relevant for the vector the

query initial for the vector the

2

1

0

1 21 101

21

n

n

iS

iR

Q

where

n

S

n

RQQ

i

i

n

i

in

i

i

Page 68: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Rocchio IllustrationRocchio IllustrationAlthough we usually work in vector space for text, it is easier to visualize Euclidian space

Original Query Term Re-weightingNote that both the location of the center, and the shape of the query have changed

Query Expansion

Page 69: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Rocchio Method

• Rocchio automatically– re-weights terms– adds in new terms (from relevant docs)

• Most methods perform similarly– results heavily dependent on test collection

• Machine learning methods are proving to work better than standard IR approaches like Rocchio

Page 70: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Using Relevance Feedback

• Known to improve results

• People don’t seem to like giving feedback!

Page 71: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text inputHow isthe indexconstructed?

The section that follows is about

Evaluation Evaluation Evaluate

Page 72: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

EvaluationEvaluation

• Why Evaluate?

• What to Evaluate?

• How to Evaluate?

Page 73: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Why Evaluate?Why Evaluate?

• Determine if the system is desirable

• Make comparative assessments

Page 74: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

What to Evaluate?What to Evaluate?

• How much of the information need is satisfied.

• How much was learned about a topic.

• Incidental learning:– How much was learned about the collection.– How much was learned about other topics.

• How inviting the system is.

Page 75: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

What to Evaluate?What to Evaluate?

What can be measured that reflects users’ ability to use system? (Cleverdon 66)

– Coverage of Information– Form of Presentation– Effort required/Ease of Use– Time and Space Efficiency– Recall

• proportion of relevant material actually retrieved

– Precision• proportion of retrieved material actually relevant

effe

ctiv

enes

s

Page 76: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Relevant vs. RetrievedRelevant vs. Retrieved

Relevant

Retrieved

All docs

Page 77: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Precision vs. RecallPrecision vs. Recall

Relevant

Retrieved

|Collectionin Rel|

|edRelRetriev| Recall

|Retrieved|

|edRelRetriev| Precision

All docs

Page 78: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Why Precision and Recall?Why Precision and Recall?

Intuition:

Get as much good stuff while at the same time getting as little junk as possible.

Page 79: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Retrieved vs. Relevant DocumentsRetrieved vs. Relevant Documents

Relevant

Very high precision, very low recall

Page 80: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Retrieved vs. Relevant DocumentsRetrieved vs. Relevant Documents

Relevant

Very low precision, very low recall (0 in fact)

Page 81: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Retrieved vs. Relevant DocumentsRetrieved vs. Relevant Documents

Relevant

High recall, but low precision

Page 82: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Retrieved vs. Relevant DocumentsRetrieved vs. Relevant Documents

Relevant

High precision, high recall (at last!)

Page 83: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Precision/Recall CurvesPrecision/Recall Curves

• There is a tradeoff between Precision and Recall

• So measure Precision at different levels of Recall

• Note: this is an AVERAGE over MANY queries

precision

recall

x

x

x

x

Page 84: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Precision/Recall CurvesPrecision/Recall Curves

• Difficult to determine which of these two hypothetical results is better:

precision

recall

x

x

x

x

Page 85: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Document Cutoff LevelsDocument Cutoff Levels

• Another way to evaluate:– Fix the number of documents retrieved at several levels:

• top 5• top 10• top 20• top 50• top 100• top 500

– Measure precision at each of these levels– Take (weighted) average over results

• This is a way to focus on how well the system ranks the first k documents.

Page 86: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Problems with Precision/RecallProblems with Precision/Recall

• Can’t know true recall value – except in small collections

• Precision/Recall are related– A combined measure sometimes more appropriate

• Assumes batch mode– Interactive IR is important and has different criteria for

successful searches

– Assumes a strict rank ordering matters.

Page 87: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Relation to Contingency TableRelation to Contingency Table

• Accuracy: (a+d) / (a+b+c+d)• Precision: a/(a+b)• Recall: a/(a+c)• Why don’t we use Accuracy for IR?

– (Assuming a large collection)– Most docs aren’t relevant – Most docs aren’t retrieved– Inflates the accuracy value

Doc is Relevant

Doc is NOT relevant

Doc is retrieved a b

Doc is NOT retrieved

c d

Doc is Relevant

Doc is NOT relevant

Doc is retrieved

Doc is NOT retrieved

relretN

relretN relretN

relretN

Page 88: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

The E-MeasureThe E-MeasureCombine Precision and Recall into one number (van

Rijsbergen 79)

PRb

bE

1

11 2

2

P = precisionR = recallb = measure of relative importance of P or R

For example,b = 0.5 means user is twice as interested in

precision as recall

Page 89: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

How to Evaluate?How to Evaluate?Test CollectionsTest Collections

Page 90: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

TRECTREC

• Text REtrieval Conference/Competition– Run by NIST (National Institute of Standards & Technology)

– 2004 (November) will be 13th year

• Collection: >6 Gigabytes (5 CRDOMs), >1.5 Million Docs– Newswire & full text news (AP, WSJ, Ziff, FT)– Government documents (federal register, Congressional

Record)– Radio Transcripts (FBIS)– Web “subsets”

Page 91: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

TREC (cont.)TREC (cont.)

• Queries + Relevance Judgments– Queries devised and judged by “Information Specialists”

– Relevance judgments done only for those documents retrieved -- not entire collection!

• Competition– Various research and commercial groups compete (TREC

6 had 51, TREC 7 had 56, TREC 8 had 66)

– Results judged on precision and recall, going up to a recall level of 1000 documents

Page 92: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

TRECTREC• Benefits:

– made research systems scale to large collections (pre-WWW)

– allows for somewhat controlled comparisons

• Drawbacks:– emphasis on high recall, which may be unrealistic for

what most users want

– very long queries, also unrealistic

– comparisons still difficult to make, because systems are quite different on many dimensions

– focus on batch ranking rather than interaction

– no focus on the WWW

Page 93: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

TREC is changingTREC is changing

• Emphasis on specialized “tracks”– Interactive track– Natural Language Processing (NLP) track– Multilingual tracks (Chinese, Spanish)– Filtering track– High-Precision– High-Performance

• http://trec.nist.gov/

Page 94: Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Homework…Homework…