63
LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

Embed Size (px)

Citation preview

Page 1: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

LIS618 lecture 6

Vector Model and ProQuest

Thomas Krichel2011-11-01

Page 2: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

advantages of Boolean model

• supposedly easy to grasp by the user• precise semantics of queries• implemented in the majority of commercial

systems

Page 3: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

problems of Boolean model

• sharp distinction between relevant and irrelevant documents

• no ranking possible • users find it difficult to formulate Boolean

queries• users find it difficult to resolve Boolean

queries

Page 4: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

vector model

• associates weights with each index term appearing in the query and in each database document.

• relevance can be calculated as the cosine between the two vectors, i.e. their cross product divided be the square roots of the squares of each vector. This measure varies between 0 and 1.

Page 5: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

tf/idf

• stands for term frequency / inverse document frequency

• This refers to a technique that gives term a high rank in a document if– the term appears frequently in a document– the term does not appear frequently in other

documents• We will look at each component one at time.

Page 6: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

absolute & maximum term frequency

• Let F_t_d be the number of times term t appears in the document d. This is its absolute term frequency in the document.

• Let m_d be the maximum absolute term frequency achieved by any term in document d. Examples– Document 1: a b a a b c c d m_1 = 3, because "a" appears 3 times– Document 2: a b a f f f e d f a a m_2 = 4, because "a" or "f" appears 4 times

Page 7: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

relative document term frequency• The relative term frequency f_t_d, is given by f_t_d = F_t_d / m_d that is the absolute term frequency of term t in

document d divided by the maximum absolute term frequency of document d.

• This completes the "term frequency" part of the tf/idf formula.

• Let us look at this part through an example.

Page 8: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

main example, part I• Consider three documents– 1: a b c a f o n l p o f t y x– 2: a m o e e e n n n a n p l– 3: r a e e f n l i f f f f x l

• First, look at the maximum frequency achieved by any term in a given document.m_1 = 2 ("a", "f" and "o" are there twice)m_2 = 4 ("n" is there four times)m_3 = 5 ("f" is there five times)

Page 9: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

main example part II

• Now look at some example of absolute term frequencyF_a_1 = 2 F_e_2 = 3 F_x_3 = 1

• and some examples of relative term frequency f_a_1 = F_a_1 / m_1 = 2 / 2 = 1f_e_2 = F_e_2 / m_2 = 3 / 4 = 0.75 f_x_3 = F_x_3 / m_3 = 1 / 5 = 0.2

Page 10: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

inverse document frequency

• Let N be the number of documents in the datebase. N=3 in our example.

• Let n_t be the number of documents where the term t appears. In our examplen_a = 3 n_e = 2 n_x = 2

• N/n_t is an indication of inverse document frequency of a term. It is larger the less a term appears across documents in the database.

Page 11: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

intermezzo: the logarithm

• The logarithm, written log() is a mathematical function. You should know that– log() is an increasing function, i.e. the bigger is x,

the bigger is log(x). – log(1) = 0– log(x) > 0 if x > 1

• Your calculator will tell you what the logarithm of a number is.

Page 12: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

tf/idf formula

• Term frequency and inverse document frequency have to be combined.

• The final formula for the weight combines the terms as follows

w_t_d = f_t_d * log( N / n_t )

Page 13: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

main example part III

N = 3w_a_1 = 1 * log(3/3) = log(1) = 0 !w_e_2 = 0.75 * log(3/2)w_x_3 = 0.2 * log(3/2)

where log(3/2) = 0.176, approximately

Page 14: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

practical operation

• The computer will search the documents for the query term and return the documents where the weight of term in the index for that document is strictly positive, by order of weights, highest to lowest.

• If there are several query terms the computer will perform a more complicated operation that we will not further study here, so we limit ourselves to the case of one query term.

Page 15: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

practical tests

• You ask the computer to query the term "a" in our example. What documents are being returned? – Compare with the result of the Boolean model.

• You ask the computer to query the term "e". What documents are being returned, and in what order?

Page 16: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

advantages of vector model

• term weighting improves performance• sorting is possible• easy to compute, therefore fast• results are difficult to improve without – query expansion– user feedback circle

Page 17: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

ProQuest search targets

• ProQuest searches “citations” and “documents”.

• “citations” are description of documents such as author names, titles, journal etc.

• “documents” contain the full-text of documents.

• Target differences imply different behavior of an expression when matched against a candidate.

Page 18: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

ProQuest search

• If you enter two search terms, they will be used as one phrase.

• If you use three term, they are searched to be appearing in proximity.

• You can force phrase interpretation by placing the search expressions into double quotes.

Page 19: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

terms

• A search term is something you type and that has a meaning on its own.

• For example: house, or krichel.• Terms have a regular expression

interpretation.

Page 20: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

regular expressions

• ‘*’ is used as a right-handed truncation character only; it will find all forms of a word.For example, searching for “econom*”.

• ‘?’ is used to replace any single character, either inside the word or the right end of the word. For example, searching for “wom?n”

• ‘?’ cannot be used to begin a word.

Page 21: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

operators: and

• AND Find the words. • When searching for keywords in "Citation and

Document Text," AND finds documents in which the words occur in the same paragraph (within approx. 1000 characters) or the words appear in any citation field.

Page 22: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

operator: and not, or

• “and not” is the same as “not” in Dialog.• “or” is a normal Boolean or.

Page 23: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

proximity operators

• W/number Find documents where these words are within some number number of words apart (either before or after). Use when searching for keywords within "Citation and Document Text" or "Document Text."Example: computer W/3 careers

• NOT W/number does the opposite.

Page 24: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

proximity operators

• W/PARA Finds documents where these words are within the same paragraph (within approx. 1000 characters). Use when searching for keywords within "Document Text."Example: internet W/PARA web

Page 25: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

proximity operators

• W/DOC Find documents where all the words appear within the document text. Use W/DOC in place of AND when searching for keywords within "Citation and Document Text" or "Document Text" to retrieve more comprehensive results.Example: Internet W/DOC education

Page 26: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

proximity operators

• PRE/number Find documents where the first word appears some number number of words before the second word.

• Use when searching for keywords within "Citation and Document Text" or "Document Text."Example: world pre/3 web

Page 27: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

field syntax

• It is possible to limit a search for a term to a field.

• This is done by writing field( term)

Page 28: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

abstract

• ABS() search article abstracts for your terms. • Examples:

ABS(customer delight) ABS(ozone)

Page 29: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

appendix

• APX() searches the appendix of a document. The appendix usually comes at the end of the document, identified by a header

• Use Keywords to search this field.• Example: APX(Michigan)

Page 30: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

author

• AU() is used to find articles written by an author or reviewer.

• Example AU(Thomas Krichel)

Page 31: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

Classification code (ABI)• Use Classification Codes when searching

business topics. Classification Codes are a fast way to precisely target a search by topic, industry or market, geographical area, or article type.

• Examples: CC(1120) for Economic Policy & Planning

• This only applies to a subset of data from ABI inform, which has these codes.

Page 32: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

Coden• This is use to search the coden index. A coden

is an alphanumeric code used for shelving/ordering books and journals in libraries, often based on a publication’s title.

• Example: CODEN(EDUSBI)

Page 33: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

Column / Document Column Head

• The title of a column in a periodical or newspaper, such as “The Week in Review”. This search field finds all articles where the search words are in the column head.

• Examples: COL(futures) COL("The Week In Review")

Page 34: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

company / organization

• CO() searches for an organization featured prominently in an article, – Associations and cooperatives– Companies and their divisions– Governmental organizations and olitical parties– sports teams, music bands and churches– native american tribes

• Comes with LCO({}) option for full matches.

Page 35: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

publication date

• PDN() searches the publication date in numeric format (mm/dd/yyyy).

• You can use the < and > signs to indicate dates before and after a date, or between specific dates.

• For example, PDN(>1/1/2002) AND PDN(<1/5/2002) will find results from publications with numeric dates between January 1 2002 and January 5 2002.

Page 36: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

dateline

• DLN() searches article Datelines. The dateline occurs frequently in newspapers, just after the article title, giving the date and place of the articles origin. You can use Boolean, proximity and truncation operators.

• DLN(lebanon pre/1 ohio)

Page 37: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

document features

• SF() is used to search document features, such as an index or auxiliary materials, that may be included in or accompany a document.

• The document features indexed are:– Graphs and Illustrations– Maps– References– Tables

Page 38: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

search by proquest handle

• ID() Searches the unique database ID for articles and documents in ProQuest.

• Examples: ID(356894)

Page 39: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

document language

• LA() is used to search Language index. This field contains the language in which the document was published originally.

• Examples: LA(french) LN(french or english)

Page 40: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

document text

• Searches only the full text of articles for your search terms. Article abstracts are not included in this search. AND, OR, and other search operators are treated as such unless enclosed in quotes.

• Examples: TEXT(Kofi Annan) TEXT("North Sea oil")

Page 41: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

title searches

• TI() searches the title of a document, such as “Seigniorage, Taxation and Myopia in EMU”

Page 42: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

document type

• DT() is used to look for search words or phrases in documents of a certain type.

• Examples DT(commentary) DT(editorial cartoon) DT(review) DT(arts/exhibits review) DT(television review-no opinion)

Page 43: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

company number

• DUNS() searches Dunn and Bradstreet trading partner identification number. These numbers provide a universal system for computer identification of companies.

• Examples: DUNS(00 695 7856) DUN(03 575 3920)

Page 44: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

footnote

• FOOT() searches the article footnotes for your terms.

• Examples: FOOT(326 U.S. 465)

Page 45: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

volume

• Volume() searches the volume. • Examples:

VO(100)

Page 46: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

word count• WC() restricts the number of words in the

article text. Use this search field to locate articles under (<) or over (>) a certain length.

• Examples:– WC(<1000) – WC(>500)– WC(>750 AND <1000)

Page 47: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

year

• Year searches the publication year• Examples:

YR(1986) YR(1986-1987) YR(>1998) YR(<1998)

Page 48: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

location

• GEO() is used this search field to look for articles in which a geographical area or location figures prominently in the text.

• Examples: GEO(Midwest) GN(UK) GEO(New South Wales) GN(Black Forest)

• Comes with LGEO({})

Page 49: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

headnote• HEAD() looks for words that occur in the

headnotes of an article. Headnotes are short introductions, explanations, or comments at the beginning of an article. They are different from abstracts in that they do not attempt to summarize the content of the article.

• Examples: HEAD(escalator accidents) HDN(digital tv) HEAD(Global Economy)

Page 50: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

caption texts

• CAP() This search field looks for occurrences of search words in the caption text accompanying article illustrations, graphs, and photographs.

• Examples: CAP(Chart)

Page 51: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

(additional) index

• INDEX() locates all occurrences of search words in any searchable index field. It does not find occurrences in the text of the articles.

• Examples: INDEX(starcore)

Page 52: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

ISSN

• ISSN() looks for the eight-digit International Standard Serials Number (ISSN), where available. Hyphens are optional.

• Examples: ISSN(0011-4664) SN(00916358)

Page 53: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

issue()

• ISSUE() is used to search Issue Number.• Valid Forms:

ISSUE IS

• Examples: IS(10)

Page 54: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

NAICIS / SIC

• NAICS() or SIC() searches for industry codes. The NAICS/SIC code defines the economic activity of a business as defined by the US Census Bureau.

• Examples: SIC(4911) SIC(514210)

Page 55: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

start page

• PAGE() is used for specific pages of a publication. Useful for finding front page articles.

• Example: PAG(A.1) AND PUB(wall street journal) AND PDN(1/10/2003)

Page 56: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

person

• NAME() finds articles about a person. When the Personal Name field is displayed in an article citation, the life spans of historical figures follow their names.

• You can enter the name in any format. Searching for NA(John A Smith) will return the same results as NA(Smith, John A).

Page 57: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

product name

• PROD() finds articles about a specific product. • Examples:

PROD(TiVo) PR(harley-davidson)

Page 58: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

journal

• JN() is used to search by a specific publication or publications.

• Examples: JN(Forbes) JN(New York Times or Washington Post) JN(computing) — retrieves all periodicals with "computing" in their titles

Page 59: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

section

• SECTION() finds articles that appear in a specific section of a publication. Use the SOURCE search field to specify a publication. You must specify the section name exactly as it appears in the publication.

• Examples: SOURCE(New York Times) AND SECTION(editorial) AND AU(Gore Vidal) SEC(sports) AND NA(Florence Griffith Joyner)

Page 60: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

source type

• STYPE() is used include or exclude the following source types from your search: dissertations, newspapers, periodicals and wire feeds.

• Examples: NA(Winston Churchill) AND STYPE(periodical) GEO(Japan) AND STYPE(wire feed)

Page 61: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

subject terms

• SU() is used to look for articles about a specific subject. When searching Hoover's, this contains information on company type.

• Examples: SU(Music) SU(venture capital companies) SU(Health Care) SU(nonprofit)

• Comes with LSU({}) facility

Page 62: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

combined search

• When you select “Citations and abstracts” from the drop-down menu, ProQuest searches the following fields: AU(), NAME(), ABS() PN(), TI(), SU(), CO(), SO(), GEO()

Page 63: LIS618 lecture 6 Vector Model and ProQuest Thomas Krichel 2011-11-01

http://openlib.org/home/krichel

Please shutdown the computers whenyou are done.

Thank you for your attention!