Upload
kimberly-patton
View
212
Download
0
Tags:
Embed Size (px)
Citation preview
Cataloging and Metadata:
What does the Future Hold – Issues and Perspectives
Michael NormanHead of Content Access
ManagementUniversity of Illinois at
Urbana-Champaign1
http://netfiles.uiuc.edu/manorman/ILS.ppt
Please, not another how we do it good!
• It will be okay. I’ve got a lot of good things to show you. And, hopefully advance the discuss on these important issues and and solutions.
• I will recount some of the successes we have had – and detail some of the mistakes we have made this past year or so.
• Good, quality, shareable metadata is so damn important
3
@ UIUC Library
• We have learned a lot through the various digitization projects we have been involved with including Open Content Alliance, Illinois Harvest project and starting Google Digitization Project the next few months.
• We have learned quite a bit about cataloging and metadata, access systems, search and metasearch, digital preservation, and better ways to make all this information findable.
4
Conversation about the ILS Today, I’m wanting to have a dialogue with you about
where we think we are concerning: online catalogs, metadata, other access options outside the library world, metasearch, and where do we go from here to offer better search for
our users. We are at a critical stage. Our online catalog is not very good at allowing users to
find what they seek. There are better options
5
6
Problems with our online catalogs
• Search is difficult• Does not include many of the available
resources in our collections, including images, digital collections, many of our electronic resources, archival materials
• Our metadata does not include much of the pertinent information needed to make a judgment about a resource
• Our metadata is hard to discern
7
8
Better Options than our Online CATS
I’m almost at a point where I’d advise our users at University of Illinois at Urbana-Champaign to begin there outside our online catalog (particularly Microsoft Live Book Search, Amazon, and Google Book Search)
Then after she or he get their results, come back and search our catalog to see if we have it (either digital or print version)
Does not make me very happy to say that.
9
Why search elsewhere first?
• User can evaluate resource much better through search at Microsoft, Amazon and Google
• Information such as table of contents, indices, bibliographies, cover images, cover data, summaries, biographies, reviews, etc make it easier to determine if resource helps one research or not
10
Amazon.com
11
Key Phrases – Amazon’s CAPs and SIPs
12
13
Amazon’s CAPs and SIPs
• Capitalized Phrases (CAPs) are people, places, events, or important topics mentioned frequently in a book.
• Statistically Improbable Phrases (SIPs) are the most distinctive phrases in the text of books in Amazon’s Search Inside the Book. To identify SIPs, they scan the text of all books in the Search Inside program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book.
14
Machiavelli in Amazon
15
CAPS, Search in the Book, etc.
16
Interlinking of Citations
17
Amazon’s Inside the Book
18
Microsoft Live Books
19
Microsoft Live – Inside Book - Index
20
Microsoft Live Books - Bibliographies
21
Google Book Search
22
Google Book Search – Full text content
23
Google – Publisher supplied
24
Google – TOCs, Summaries
25
Google – References from books, articles, related books
26
Google’s Metadata Records
27
Google’s Metadata Records (continued)
28
Multiple sources of data
• Amazon, Microsoft, and Google are getting this data from various sources including from publishers, vendors such as Bowker, digitization of materials, and harvesting metadata from evaluative sources.
• Millions of full-text or partial full-text content
29
30
Still far behind in breadth of collection
• Amazon, Google and Microsoft still don’t have it right. When we do a search, we are searching everything. If you do a search in Microsoft, it is searching across the entire body of full-text content. It is hard to do an advanced search of title, author, series title, publisher, etc.
• They do not have the breadth of titles or sources we have or OCLC WorldCat has. We have a couple hundred years of collecting on them. In 5 to 6 years, yes, they probably will. Eventually, may be able to search across 60 million full-text resources.
31
Why Amazon, Microsoft, Google?
• Why am I showing what Amazon, Microsoft and Google are doing in regards to search? To make us all feel bad. Maybe. Just a little.
• Really to show alternatives to our online catalogs. What is out there.
• But also to show us some of the opportunities, how we can do better.
• Central to this is metadata – creating surrogate records that help lead users to what they are want
32
UIUC work with Open Content Alliance
33
Examples of digitized books
34
Downloading of resources
35
The present
36
NCSU Endeca Catalog
37
Vanderbilt’s Primo
38
Vanderbilt Primo title level
39
Oklahoma State’s Aquabrowser
40
Title level - Aquabrowser
41
Aquabrowser – Searchable TOCs and Summaries
42
UIUC Various Access Systems• Voyager ILS system• CONTENTdm – digital images• Dspace – IDEALS, Illinois Institutional Repository• DLXS – digital text• Olive – Newspapers and Serials • Online Research Resources (ORR) – local
electronic resources management system• Discover/SFX OpenURL knowledge base
43
Metasearch – Is it the answer?
44
UIUC’s Information Gateway
45
Easy Search Results (metasearch)
46
Illinois Harvest – metasearch across formats from OAI Harvesting
47
Illinois Harvest - results with images, learning objects, digitized books, and
streaming audio
48
Positives
• They are pulling in metadata from multiple sources, including the publishers, intermediate vendors and from digitization projects
• They are adding value such as Google maps and textual analysis
• We are still cataloging for a surrogate record environment and we have got to move beyond that quickly.
• We do not have the metadata structures to pull in and incorporate much of the data that is out there. The metadata that Amazon, Microsoft and Google are bringing to bear.
49
Possibilities• We have access to the same sources of metadata. • We can get ONIX feeds from publishers. • We can harvest table of contents, indexes and
bibliographies from the works we are digitizing.• We can add cover images, book reviews, summaries
and abstracts.• We can crunch data and performing datamining as well
as they can• With the help of OCLC, we can layer such applications
as WorldCat Identities and authority control on top of all this.
50
WorldCat Identities
51
WorldCat Identities - Machiavelli
52
WorldCat Identities Display
53
Identities - Continued
54
55
Metadata• MARC records still have a role to play.• Cannot be the only game in town anymore. It
is not a flexible enough structure or standard to accommodate researchers need, especially with the technological opportunities we have today.
• It cannot accommodate much of the data we need to produce interconnectivity (linking) between resources
56
MARC – Where are we at now?
• Libraries – we still do most of our cataloging in MARC
• Other viable schemas – Dublin Core (both Simple and Qualified), MODS, MARCXML
• Preservation metadata schemas (such as PREMIS) • Content standards (such as AACR2 and CCO) • Controlled vocabularies (such as LCSH, TGN, AAT
and other applicable vocabularies) • Transmission standards such as METS
57
ONIX (Online Information Exchange)
• ONIX is a standard format that publishers use to distribute electronic information about their books to wholesale, e-tail and retail booksellers, and other publishers.
• Standard XML template for organizing data storage
58
Metadata Encoding & Transmission Standard (METS)
• The METS schema provides a flexible mechanism for encoding descriptive, administrative, and structural metadata for a digital library object, and for expressing the complex links between these various forms of metadata.
• Provide a useful standard for the exchange of digital library objects between repositories.
• METS provides the ability to associate a digital object with behaviours or services.
59
Interconnectivity
• We can start to create the search environment that allows one to move from
• citation • to full-text content • to other works about or cited within a work• continue to next full-text resource• Each year over the next 7 years, we will be able to
move from full-text content to full-text content• Moving from bibliography to bibliography, citation to
citation; OpenURL can show us the way
60
Automating Metadata Generation
• I’m the chair of the Automating Metadata Generation Task Force formed by the ALCTS Big Heads of Technical Services and we will have a white paper out this fall outlining the capabilities and possibilities of automating the creation of metadata records.
• And, yes, we can automate many of our processes for creating metadata.
Our structures and standards cannot support this presently
• Can’t fit a lot of this data into a MARC record• No real standards for indexes, table of contents,
citations, bibliographies. The mark-up languages can accommodate this. To easily pull these valuable data from a resource, need to be able to easily identify and harvest
• Can get this data from publishers for recent publications and pull from digitization projects for older materials
• Pull together using metadata record, ONIX and METS wrapper
62
New Systems• Need system that can read MARC and XML or has the
ability to easily convert MARC to MARCXML• Allows search across surrogate records and full-text
content• Relevancy ranking• User can easily discern different formats pulled in
through metasearch (monographs, articles, images, datasets, citations, etc.)
• Strong structured search and also powerful keyword indexing
• Easy to determine how best to get this piece of information (i.e. Open WorldCat)
63
New Systems (continued)
• Ability to harvesting data from multiple sources
• Ability to keep this data current and accurate• Ability to track changes to this data, ensuring
we always keep the best• Have to automate a lot of these processes• Technologies exist to allow us to do it• Collaboration
64