Cataloging and Metadata: What does the Future Hold – Issues and Perspectives Michael Norman Head...

Preview:

Citation preview

Cataloging and Metadata:

What does the Future Hold – Issues and Perspectives

Michael NormanHead of Content Access

ManagementUniversity of Illinois at

Urbana-Champaign1

http://netfiles.uiuc.edu/manorman/ILS.ppt

Please, not another how we do it good!

• It will be okay. I’ve got a lot of good things to show you. And, hopefully advance the discuss on these important issues and and solutions.

• I will recount some of the successes we have had – and detail some of the mistakes we have made this past year or so.

• Good, quality, shareable metadata is so damn important

3

@ UIUC Library

• We have learned a lot through the various digitization projects we have been involved with including Open Content Alliance, Illinois Harvest project and starting Google Digitization Project the next few months.

• We have learned quite a bit about cataloging and metadata, access systems, search and metasearch, digital preservation, and better ways to make all this information findable.

4

Conversation about the ILS Today, I’m wanting to have a dialogue with you about

where we think we are concerning: online catalogs, metadata, other access options outside the library world, metasearch, and where do we go from here to offer better search for

our users. We are at a critical stage. Our online catalog is not very good at allowing users to

find what they seek. There are better options

5

6

Problems with our online catalogs

• Search is difficult• Does not include many of the available

resources in our collections, including images, digital collections, many of our electronic resources, archival materials

• Our metadata does not include much of the pertinent information needed to make a judgment about a resource

• Our metadata is hard to discern

7

8

Better Options than our Online CATS

I’m almost at a point where I’d advise our users at University of Illinois at Urbana-Champaign to begin there outside our online catalog (particularly Microsoft Live Book Search, Amazon, and Google Book Search)

Then after she or he get their results, come back and search our catalog to see if we have it (either digital or print version)

Does not make me very happy to say that.

9

Why search elsewhere first?

• User can evaluate resource much better through search at Microsoft, Amazon and Google

• Information such as table of contents, indices, bibliographies, cover images, cover data, summaries, biographies, reviews, etc make it easier to determine if resource helps one research or not

10

Amazon.com

11

Key Phrases – Amazon’s CAPs and SIPs

12

13

Amazon’s CAPs and SIPs

• Capitalized Phrases (CAPs) are people, places, events, or important topics mentioned frequently in a book.

• Statistically Improbable Phrases (SIPs) are the most distinctive phrases in the text of books in Amazon’s Search Inside the Book. To identify SIPs, they scan the text of all books in the Search Inside program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book.

14

Machiavelli in Amazon

15

CAPS, Search in the Book, etc.

16

Interlinking of Citations

17

Amazon’s Inside the Book

18

Microsoft Live Books

19

Microsoft Live – Inside Book - Index

20

Microsoft Live Books - Bibliographies

21

Google Book Search

22

Google Book Search – Full text content

23

Google – Publisher supplied

24

Google – TOCs, Summaries

25

Google – References from books, articles, related books

26

Google’s Metadata Records

27

Google’s Metadata Records (continued)

28

Multiple sources of data

• Amazon, Microsoft, and Google are getting this data from various sources including from publishers, vendors such as Bowker, digitization of materials, and harvesting metadata from evaluative sources.

• Millions of full-text or partial full-text content

29

30

Still far behind in breadth of collection

• Amazon, Google and Microsoft still don’t have it right. When we do a search, we are searching everything. If you do a search in Microsoft, it is searching across the entire body of full-text content. It is hard to do an advanced search of title, author, series title, publisher, etc.

• They do not have the breadth of titles or sources we have or OCLC WorldCat has. We have a couple hundred years of collecting on them. In 5 to 6 years, yes, they probably will. Eventually, may be able to search across 60 million full-text resources.

31

Why Amazon, Microsoft, Google?

• Why am I showing what Amazon, Microsoft and Google are doing in regards to search? To make us all feel bad. Maybe. Just a little.

• Really to show alternatives to our online catalogs. What is out there.

• But also to show us some of the opportunities, how we can do better.

• Central to this is metadata – creating surrogate records that help lead users to what they are want

32

UIUC work with Open Content Alliance

33

Examples of digitized books

34

Downloading of resources

35

The present

36

NCSU Endeca Catalog

37

Vanderbilt’s Primo

38

Vanderbilt Primo title level

39

Oklahoma State’s Aquabrowser

40

Title level - Aquabrowser

41

Aquabrowser – Searchable TOCs and Summaries

42

UIUC Various Access Systems• Voyager ILS system• CONTENTdm – digital images• Dspace – IDEALS, Illinois Institutional Repository• DLXS – digital text• Olive – Newspapers and Serials • Online Research Resources (ORR) – local

electronic resources management system• Discover/SFX OpenURL knowledge base

43

Metasearch – Is it the answer?

44

UIUC’s Information Gateway

45

Easy Search Results (metasearch)

46

Illinois Harvest – metasearch across formats from OAI Harvesting

47

Illinois Harvest - results with images, learning objects, digitized books, and

streaming audio

48

Positives

• They are pulling in metadata from multiple sources, including the publishers, intermediate vendors and from digitization projects

• They are adding value such as Google maps and textual analysis

• We are still cataloging for a surrogate record environment and we have got to move beyond that quickly.

• We do not have the metadata structures to pull in and incorporate much of the data that is out there. The metadata that Amazon, Microsoft and Google are bringing to bear.

49

Possibilities• We have access to the same sources of metadata. • We can get ONIX feeds from publishers. • We can harvest table of contents, indexes and

bibliographies from the works we are digitizing.• We can add cover images, book reviews, summaries

and abstracts.• We can crunch data and performing datamining as well

as they can• With the help of OCLC, we can layer such applications

as WorldCat Identities and authority control on top of all this.

50

WorldCat Identities

51

WorldCat Identities - Machiavelli

52

WorldCat Identities Display

53

Identities - Continued

54

55

Metadata• MARC records still have a role to play.• Cannot be the only game in town anymore. It

is not a flexible enough structure or standard to accommodate researchers need, especially with the technological opportunities we have today.

• It cannot accommodate much of the data we need to produce interconnectivity (linking) between resources

56

MARC – Where are we at now?

• Libraries – we still do most of our cataloging in MARC

• Other viable schemas – Dublin Core (both Simple and Qualified), MODS, MARCXML

• Preservation metadata schemas (such as PREMIS) • Content standards (such as AACR2 and CCO) • Controlled vocabularies (such as LCSH, TGN, AAT

and other applicable vocabularies) • Transmission standards such as METS

57

ONIX (Online Information Exchange)

• ONIX is a standard format that publishers use to distribute electronic information about their books to wholesale, e-tail and retail booksellers, and other publishers.

• Standard XML template for organizing data storage

58

Metadata Encoding & Transmission Standard (METS)

• The METS schema provides a flexible mechanism for encoding descriptive, administrative, and structural metadata for a digital library object, and for expressing the complex links between these various forms of metadata.

• Provide a useful standard for the exchange of digital library objects between repositories.

• METS provides the ability to associate a digital object with behaviours or services.

59

Interconnectivity

• We can start to create the search environment that allows one to move from

• citation • to full-text content • to other works about or cited within a work• continue to next full-text resource• Each year over the next 7 years, we will be able to

move from full-text content to full-text content• Moving from bibliography to bibliography, citation to

citation; OpenURL can show us the way

60

Automating Metadata Generation

• I’m the chair of the Automating Metadata Generation Task Force formed by the ALCTS Big Heads of Technical Services and we will have a white paper out this fall outlining the capabilities and possibilities of automating the creation of metadata records.

• And, yes, we can automate many of our processes for creating metadata.

Our structures and standards cannot support this presently

• Can’t fit a lot of this data into a MARC record• No real standards for indexes, table of contents,

citations, bibliographies. The mark-up languages can accommodate this. To easily pull these valuable data from a resource, need to be able to easily identify and harvest

• Can get this data from publishers for recent publications and pull from digitization projects for older materials

• Pull together using metadata record, ONIX and METS wrapper

62

New Systems• Need system that can read MARC and XML or has the

ability to easily convert MARC to MARCXML• Allows search across surrogate records and full-text

content• Relevancy ranking• User can easily discern different formats pulled in

through metasearch (monographs, articles, images, datasets, citations, etc.)

• Strong structured search and also powerful keyword indexing

• Easy to determine how best to get this piece of information (i.e. Open WorldCat)

63

New Systems (continued)

• Ability to harvesting data from multiple sources

• Ability to keep this data current and accurate• Ability to track changes to this data, ensuring

we always keep the best• Have to automate a lot of these processes• Technologies exist to allow us to do it• Collaboration

64

Recommended