25
Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael Buckland, Aitao Chen, Fredric Gey and Ray Larson Friday Afternoon Seminar, Feb 14, 2003 http://metadata.sims.berkeley.edu/papers/ SeamlessSearchFinalReport.pdf

Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

Seamless Searching of Numeric and Textual Resources

Funded by a National Library Leadership Grant from

the Institute of Museum and Library Services

Michael Buckland, Aitao Chen, Fredric Gey and Ray Larson

Friday Afternoon Seminar, Feb 14, 2003

http://metadata.sims.berkeley.edu/papers/SeamlessSearchFinalReport.pdf

Page 2: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

From numbers to texts:

Iritani, Evelyn. "Normalizing ties to Vietnam important steps for U.S. firms; California stands to profit handsomely when barriers fall to trade with fast-growing country." Los Angeles Times v114 (July 12, 1995):D1.

An article found using the keywords“Import” and “Vietnam” as query.

Page 3: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

From text to numbers:

"U.S. bans import of most European meat". Los Angeles Times v116, n14 (Dec 14, 1997):A22. (On fear of mad cow disease.) "Ban on cattle and sheep is extended to all Europe." New York Times v147, sec1 (Dec 14, 1997):16(N), 42(L). (The U.S. Agriculture Department responds to threat of 'Mad Cow' disease).

Topic of interest: imports of beef to the United States from Britain

The sources at http://govinfo.kerr.orst.edu/import/import.html show

No reported edible beef imports from the United Kingdom.

Page 4: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

Seamless Search Project Goals:

• Phase I: The development and demonstration of a library gateway providing search support for searching both text and socio-economic numeric databases.

• Phase II: The demonstration of a library gateway supporting searches between text and numeric database.

Page 5: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

Data Sets to create Entry Vocabulary Indexes: MELVYL MARC Files

<RECORD>

<001> 73180254 </001>

<245><a>A study of operant conditioning under delayed reinforcement in early infancy</a></245>

<650><a>Infant psychology.</a></650>

<650><a>Operant conditioning.</a><650>

</RECORD>

Number of MARC records in the training data set: ~4,246,000.

Book title

LC Subject Headings

A sample training record extracted from a MARC record.

Page 6: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

doc1

doc2

doc3

doc4

doc5

behavior

infant

infancy

psychology

Infant psychology

Operant conditioning

Infant development

Psychology

Parent and child

child

attitude

baby

development

Title Words Doc IDs LCSHs

Statistical association of title words and LCSH

Page 7: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

Word to LCSH Entry Vocabulary Index (EVI)

1 alcoholism 7470.462 alcoholic 1745.233 alcohol 709.264 alcoholism and employment 318.265 drug abuse 257.756 alcohol, ethyl 235.137 drinking of alcoholic beverages 151.468 substance abuse 146.04

Rank LCSH Weight

List of the LCSHs that are most closely associated, statistically, with the query word: alcoholism.

Page 8: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

Words to LCSH Entry Vocabulary Index (EVI)

1 economic policy 756.90

2 german (west) 645.02

3 switzerland 97.70

4 regional planning 96.39

5 economics 92.14

Rank LCSH Weight

List of LCSHs that are most closely associated, statistically, with the German query word: Wirtschaftspolitik.

Note: The top-ranked LCSH “economic policy” happens to be the English translation of the German word “Wirtschaftspolitik”.

Page 9: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

Words to LCSH Entry Vocabulary Index (EVI)

1 peanut 1343.902 cookery (peanut butter) 429.613 cookery (peanuts) 423.474 peanut industry 359.575 peanut butter 316.236 butter 309.367 schulz, charles m 277.308 cookery 197.08

Rank LCSH Weight

List of LCSHs that are most closely associated, statistically, with the phrase peanut butter as a query.

Page 10: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

Word to LCSH Entry Vocabulary Index (EVI)

1 world war, 1939-1945 16430.62

2 vietnamese conflict, 1961-1975 15388.68

3 united states 13989.66

4 world war, 1914-1918 8055.60

5 vietnam 6523.90

Rank LCSH Weight

List of LCSHs that are most closely associated with the German query: Vietnam War.

Note: “Vietnam War” is not an established (authorized) LCSH. The established LCSH is “Vietnamese conflict”.

Page 11: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

LCSH to Words Entry Vocabulary Index

1 alcohol 13471.942 alcoholism 11715.563 abuse 3708.094 drug 3467.225 drink 2563.536 alcoholic 2534.917 treatment 2349.038 prevention 1263.949 problem 1148.0310 addiction 886.81

Rank Words Weight

List of words that are most closely associated, statistically, with the Library of Congress Subject Heading: Alcoholism.

Page 12: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

EVI-based Access to MELVYL

Free-form query

Ranked list of LCSHs

MELVYLZ39.50 SERVER

HTTP/Z39.50Gateway

httpd

evi access

Searchresults

Full MARCrecord

Web server

gatewayaccess

EVI

Web Browser

OtherZ39.50 SERVERS

Z39.50

HTTP

CGI

1 6

5

4

3

2

7

Page 13: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael
Page 14: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael
Page 15: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael
Page 16: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

Counting California Database(http://countingcalifornia.cdlib.org/)

• A collection of some 3,000 numeric tables.

• Organized into 16 topics and 184 subtopics.

Sample topics: • Banking, Finance and Insurance• Elections• Population and Demographics• Social Services and Public Assistance

Sample subtopics under Agriculture and Natural Resources: • Farms and Farming• Fishing• Forestry and Lumber• Minerals

Page 17: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

Enhanced Access to Counting California Database

• Conventional probabilistic retrieval of numeric tables using table captions, mapping query to text of captions.

• Access to numeric tables through the words-to-subtopic entry vocabulary index.

<table>

<topic> education </topic>

<subtopic> libraries </subtopic>

<caption>STATISTICS, STATEWIDE SUMMARY BY TYPE OF LIBRARY CALIFORNIA, 1992-93 TO 1997-98</caption>

</table>

A sample record created from http://countingcalifornia.cdlib.org.

Page 18: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

Probabilistic Access to Counting California Database

Search results for the query: public libraries in Californiagives ranked list of captions:

Page 19: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael
Page 20: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

EVI-based Access to Counting California Database

Ranked list of subtopics that are most closely associated, statistically, with the query: personal/individual income tax.1 income 542.532 government earnings and tax revenues 251.713 property tax 156.674 property tax 74.585 personal income tax 59.99

Page 21: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

Numeric Tables with Subtopic: Personal income tax.

Page 22: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

EVI LCSH

marcnew query

search resultscaptions

numeric table

numeric database

online catalog

search interface 1

search interface 2

1

8 7 6

5

432

11

109

Traverse Searching Between Online Catalogs and Numeric Databases

Page 23: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

Melvyl MARC record as source of a query

Page 24: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

Extract from MARC as a query

Any caption can become a query

Page 25: Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael

http://metadata.sims.berkeley.edu/papers/SeamlessSearchFinalReport.pdf

Final Report on “Seamless Searching of Numeric and Textual Resources” Project, 1999-2002.

Two sequels:

1. Adding search by place: “Going Places in the Catalog: Improved Geographic Access,” funded by a National Library Leadership Project from the Institute of Museum and Library Services, 2002-2004.

2. Multilingual Search Across Multiple Genres: Proposal submitted Feb 13, 2003!