CS336: Intelligent Information Retrieval Lecture 8: Indexing Models

CS336: Intelligent Information Retrieval

Lecture 8: Indexing Models

Basic Automatic Indexing• Parse documents to recognize structure

– e.g. title, date, other fields

• Scan for word tokens• Stopword removal• Stem words• Weight words

– using frequency in documents and database– frequency data independent of retrieval model

• Optional– phrase indexing– thesaurus classes

Words vs. Terms vs. “Concepts”

• Concept-based retrieval – often used to imply something beyond word indexing

• In virtually all systems, a concept is a name given to a set of recognition criteria or rules– similar to a thesaurus class

• Words, phrases, synonyms, linguistic relations can all be evidence used to infer presence of the concept– e.g. “information retrieval” can be inferred based on:

• “information”• “retrieval”• “information retrieval” • “text retrieval”

Phrases• Both statistical and syntactic methods have been used to identify

“good” phrases

• Proven techniques:– find all word pairs that occur more than n times

– using POS tagger to identify simple noun phrases

• 1,100,000 phrases extracted from all TREC data (>1,000,000 documents)

• 3,700,000 phrases extracted from PTO 1996 data

Phrases• Phrases can have an impact on both effectiveness and

efficiency

– phrase indexing will speed up phrase queries

– finding documents containing “Black Sea” better than finding documents containing both words alone

– effectiveness not straightforward and depends on retrieval model

• e.g. for “information retrieval” or “green house”, how much do individual words count?

– Effectiveness can also depend upon collection

Top Phrases from TIPSTER

65824 United States61327 Article Type33864 Los Angeles18062 Hong Kong17788 North Korea17308 New York15513 San Diego15009 Orange County12869 prime minister12799 first time12067 Soviet Union10811 Russian Federation9912 United Nations8127 Southern California7640 South Korea7620 end recording7524 European Union7436 South Africa7362 San Francisco7086 news conference6792 City Council6348 Middle East6157 peace process5955 human rights5837 White House

5778 long time5776 Armed Forces5636 Santa Ana5619 Foreign Ministry5527 Bosnia-Herzegovina5458 words indistinct5452 international community5443 vice president5247 Security Council5098 North Korean5023 Long Beach4981 Central Committee4872 economic development4808 President Bush4652 press conference4602 first half4565 second half4495 nuclear weapons4448 UN Security Council4426 South Korean4219 first quarter4166 Los Angeles County4107 State Duma4085 State Council3969 market economy3941 World War II

Top Phrases from Patents975362 present invention191625 U.S. Pat147352 preferred embodiment95097 carbon atoms87903 group consisting81809 room temperature78458 SEQ ID75850 BRIEF DESCRIPTION66407 prior art59828 perspective view58724 first embodiment56715 reaction mixture54619 DETAILED DESCRIPTION54117 ethyl acetate52195 Example 152003 block diagram46299 second embodiment41694 accompanying drawings40554 output signal37911 first end35827 second end34881 appended claims33947 distal end32338 cross-sectional view30193 outer surface29635 upper surface

29535 preferred embodiments29252 present invention provides29025 sectional view28961 longitudinal axis27703 title compound27434 PREFERRED EMBODIMENTS27184 side view25903 inner surface25802 Table 125047 lower end25047 plan view24513 third embodiment24432 control signal24296 upper end24275 methylene chloride24117 reduced pressure23831 aqueous solution23618 SEQUENCE DESCRIPTION23616 SEQUENCE CHARACTERISTICS22382 weight percent22070 closed position21356 light source21329 image data21026 flow chart21003 PREFERRED EMBODIMENT

Phrases from 50 TREC Queries

14 international criminal activity 9 international criminal 1436 criminal activity 84 hubble telescope 188 passenger vehicle 9086 civil war 255 hydroelectric project 5261 detailed description 183 rap music 1449 negative effect 8081 young people 297 radio wave 26 radio tower 404 car phone 135 brain cancer

5 theft of trade secret1324 trade secret573 sources of information530 trade journal334 business meet506 patent office1870 trade show26 competitor's product63 growing plant41 magnetic levitate38 commercial harvest58 highway accident

Collocation (Co-occurrence)

• Co-occurrence patterns of words & word classes – reveal significant information about how a language is used.

Apply to:• build dictionaries (lexicography)

• IR tasks such as phrase detection, indexing, query expansion, building thesauri

• Co-occurrence based on text windows– typical window may be 100 words

– smaller windows used for lexicography, e.g. adjacent pairs or 5 words

Collocation

• Typical measure used is the point version of the mutual information measure (compared to the expected value of I, sometimes called EMIM)

• Paired t-test also used to compare collocation probabilities

• Other tests such as Chi-square can also be used

2

22

1

21

21 x - x t

nn

)()(

),(log),(

ypxp

yxpyxI

Indexing Models• Which terms should index a document?

– what terms describe the documents in the collection?– what terms are good for discriminating between documents?

• Different focus than retrieval model, but related

• Sometimes seen as term weighting– TF.IDF– Term Discrimination model– 2-Poisson model– Language models– Clumping model

Indexing Models• Term Weighting

– 2 components• term weight indicating its relative importance

• similarity measure: use term weights to calculate similarity between query and document

• How do we determine the importance of indexing terms?

– We want to do this in a way that distinguishes

Relevant documents fromNon-relevant documents

TF.IDF: Standard Approach• TF component measures “aboutness”

– frequency of a word within a document• if apple occurs frequently in a document, then the document is

probably about apples

– normalize term frequency since may vary with other variables• e.g. longer docs will have more words and higher word freq

– normalization can be based on maximum term frequency or could include a document length component

• logs used to smooth numbers for large collections

– c is a constant typically b/w [0.4-1] since single occurrence is important

– tf = term frequency in the doc

– max_tf is the maximum term frequency in any document

)0.1log(max_

)5.0log()1(

tf

tfcce.g.) tf =

TF.IDF • Inverse document frequency (IDF) measures

“discrimination value”

– if “apple” occurs in many documents, will it tell us anything about how those documents differ?

– by Spark Jones in 1972: IDF = log (N/df) + 1• N is the number of documents in the collection• df is the number of documents the term occurs in

• wt (term weight) = TF*IDF

– reward a term for occurring frequently in a document (high tf)

– penalize it for occurring frequently in the collection (low idf)

Computing idf

• Assume 10,000 documents. • If one term appears in 10 documents:

idf=log(10,000/10)+1=log(1000)+1~11

• If another term appears in 100 documents: idf =log(10,000/100)+1 = log(100)+1 ~8

Term Discrimination Model• Based on vector space model

– documents and queries are vectors in an n-dimensional space for n terms

• Basic idea:– Compute discrimination value of a term

• degree to which use of the term will help to distinguish documents

• based on Comparing average similarity of documents both with and without an index term

Term Discrimination Model• Compute average similarity or “density” of document

space based on discrimination value of terms

– add a term to the model if it decreases the average document similarity (i.e. improves our ability to distinguish documents)

xx

xx

(b)

b) after assignment of good discriminator; documents are less similar

xxx

x

(c)

c) after assignment of poor descriminator; documents are more similar

xx

xx

(a)

a) before term assignment;

Term Discrimination Model

• Compute average similarity or “density” of document space

– AVGSIM is the density– where K is a normalizing constant (e.g. 1/n(n-1))– similar() is a similarity function such as cosine correlation

• Can be computed more efficiently using an average document or centroid– frequencies in the centroid vector are averages of frequencies in

document vectors

),(1 1

j

n

i

n

ji

ji

DDsimilarKAVGSIM

),(1

n

iiDDsimilarKAVGSIM

Term Discrimination Model• (AVGSIM)t =doc density with term t removed from all docs

• DISCVALUEt = (AVGSIM)t – AVGSIM

• Good discriminators:DISCVALUEt > 0–using term makes documents look less similar

•typically medium frequency terms

• Indifferent discriminators: DISCVALUE 0–use of term has no effect

•typically low frequency terms

• Poor discriminators DISCVALUE < 0–use of term increases the density (docs more similar)

•typically high frequency terms

• Criticism: discrimination of relevant and non-relevant documents is the important factor

Discriminators for 3 CollectionsCranfield 424 MED 450 Time 425

Best Discriminatorspanel marrow Buddhistflutter Amyloidosis Diem

jet Lymphostasis Laocone Hepatitis Arab

separate Hela Vietshell antigan Kurdyaw chromosome Wilson

nozzle irradiate Baathtransit tumor Parkdegree virus Nenni

Worst Discriminatorsequate clinic worktheo children lead

bound act Redeffect high minister

solution develop nationmethod treat partypress increase communeresult result U.S.

number cell governflow patient new

• Does this technique ensure that relevant documents are distinguishable from non-relevant documents?

Summary• Index model identifies how to represent documents• Content-based indexing

– Typically use features occurring within document

• Identify features used to represent documents– Words, phrases, concepts,etc

• Normalize if needed– Stopping, stemming, etc

• Assign index term weights (measure of significance)– TF*IDF, discrimination value, etc

• Other decisions determined by retrieval model– e.g. how to incorporate term weights

Queries

• What is a query?• Query languages• Query formulation• Query processing

Queries and Information Needs

• Information need is specific to searcher

• Many different kinds of information needs– known item– known attribute– general content search– exhaustive literature review

• Information need often poorly understood– evolves during search process– influenced by collection and system

• Serendipity

• Query is some interpretable form of information need

Queries

• Inherent ambiguity!

• Form of query depends on intended interpreter– NL statement for a colleague

– NL statement for a reference librarian

– free text statement for a retrieval system

– Boolean expression for a retrieval system

• Often multiple query ‘translations’– Judge describes need to law clerk

– Clerk describes need to law librarian

– Librarian formulates free text query to Westlaw

– Westlaw translates query to internal form for search engine

– Westlaw translates for external systems (e.g. Dialog, Dow Jones)

• Different IR systems generate different answers to different kinds of queries (e.g., full-text IR system vs. boolean ranking system)

• Different IR models dictate which queries can beformulated

• For conventional IR models, natural language (NL) queries are the main type of queries

• To find relevant answers to a query, techniques have been developed to enhance & preprocess the query (e.g., synonyms of keywords, thesaurus, stemming, stopwords, etc.)

Query Formulation

• 2 basic query language types– Boolean, structured

– free text

• Many systems support some combination of both• User interface is crucial part of query formulation

– covered later

• Tools provided to support formulation– query processing and weighting

– query expansion

– dictionaries and thesauri

– relevance feedback

Boolean Queries• Queries that combine words & Boolean operators

• Syntax of a Boolean query:

– Boolean operators:

• OR, AND, and BUT (i.e., NOT) are common operators

• Documents retrieved satisfy boolean algebra, results are document set as opposed to ranked list

e.g.) information AND retrieval

Result set contains all documents containing at least one occurrence of both query words.

Boolean Queries

• May sort the retrieved documents by some criterion

• Drawbacks of Boolean queries

– Users must be familiar with Boolean expressions

– Classic IR systems based on Boolean queries provide no ranking (i.e., either ‘yes’ or ‘no’) and hence no partial matching

• Extended Boolean system: fuzzy Boolean operators

– Relax the meaning of AND and OR using SOME

– Rank docs according to # of matched operands in a query

Natural Language Querying

• NL queries popular because they are intuitive, easy to express, fast in ranking

• Simpliest form: a word or set of words

• Complex form: combination of operations w/ words

• Basic queries:

– Queries composed of a set of individual words

– Multiple-word queries (including phrases or proximity)

– Pattern-matching queries

Single-Word Querying

• Retrieved documents

– contain at least one of the words in the query

– typically ranked according to their similarity to the query based on tf and idf

• Supplement single-word querying by considering the proximity of words, i.e., the context of words

• Related words often appear together (co-occurrence)

Queries Exploiting Context• Phrase:

– query contains groups of adjacent words

• stopwords are typically eliminated

• Proximity:

– relaxed-phrase queries

– sequence of single words and/or phrases with a maximum distance between them

– distance can be measured by number of characters or words

– words/phrases, as specified in the query, can appear in any order in the retrieved docs

Natural Language Queries

• Rank retrieved documents according to their degrees of matching

• Negation – documents contain the negated words are penalized in the ranking computation

• Establish a threshold to eliminate low weighted docs

Query Processing• Query processing steps similar to automatic document

indexing– text is less grammatical and shorter

• User interaction possible and desirable– Relevance feedback

• Query-based stemming and stopwords

• Most important steps– identifying phrases– identifying negation– identifying “core” terms– expanding query with related terms

• either automatically or interactively with concept clusters

Determining Core Concepts

• “What research is ongoing to reduce the effects of osteoporosis in existing patients as well as prevent the disease occurring in those unafflicted at this time?”– core concept: “osteoporosis”

• “Annual budget and/or cost involved with the management and upkeep of National Parks in the U.S.”– “National Parks”

• Use combination of linguistic analysis, weighting, and corpus analysis of query word relationships to identify core concepts and increase weight

TREC Queries#q307 = #WSUM ( 1 1.0 #WSUM ( 1.01 project1 construct1 extent1 desire1 country1 consequence1 purpose1 nature1 hydroelectric1.5 #foreigncountry1 locate1 propose1.5 #passage25( #PHRASE( hydroelectric project) ))

1.25 #WSUM(1.01 project0.987143 construct0.974286 dam0.961429 #3( federal power act )0.948571 #3( power project )0.935714 #3( feasible study )0.922857 ferc0.91 #3( dam project )0.897143 turbine0.884286 #3( water manage )0.871429 #3( rio arriba county )0.858571 #3( mr. sharp )0.845714 electric0.832857 #3( construct license )0.82 #3( ferc project )0.807143 doe0.794286 reclamation0.781429 wcua0.768571 #3( federal energy regulatory commission )0.755714 commence0.742857 laos0.73 hungary0.717143 #3( vinh son )

Clusters from Breast Cancer queryGroup 1:

breast cancer patientbreast exambreast tissueu.s. womencancer killscancer societycancer specialistfamily historymammogrammammography

Group 2:chemotherapylumpectomylymph nodemastectomyradiation therapyrecurrencesurvival rate

Group 3:breast implantimplantsilicone gelsilicone gel breast implantsilicone implant

Group 4:birth control pillbreast cancer riskmenopausesex hormone

Group 5:breast cancer surgerycancer surgery

Group 6:national cancer institutesloan kettering cancer center

Group 7:breast cancer researchself examination

Other Representations• N-grams

– for spelling, Soundex, OCR errors

• Hypertext– citations

– web links

• Reduced dimensionality– LSI

– Neural networks

• Natural language processing– semantic primitives, frames