View
214
Download
0
Tags:
Embed Size (px)
Citation preview
1
William Y. Arms
September 26, 2002
A Research Program for
Information Sciencewith
the NSDL as an Example
2
A Scenario
A faculty member wished to find a paper for students to read in a class. He began by asking an expert. She suggested the original research paper as suitable.
Later, he typed a few terms into Google, browsed the hits, selected one that led to ResearchIndex, found the paper, and downloaded a PDF version from the author's web site.
3
Computer Science
Internet
Web
ResearchIndex
Computer Science
4
HCI
Browsing
Searching
User interface design
Human Computer Interaction
Computer Science
5
HCI: Eye Tracking
6
Roles of expert/instructor/student
Cognitive psychology
Linguistics
Natural language processing
CognitiveStudies HCI
Cognitive Studies
Computer Science
7
8
Organizational change
Economics
Ethics
Social culture
Law
SocietyCognitiveStudies HCI
Society
Computer Science
9
SocietyCognitiveStudies HCI
Computer Science
Applications
Information Science
10
Open Access to Scientific, Scholarly and
Professional Information
11
Before the Web
Access to scientific, medical, legal information
In the United States:
excellent if you belonged to a rich organization (e.g, a major university)
very poor otherwise
In many countries of the world:
very poor for everybody
12
Some Light Reading
William Y. Arms, "Economic models for open-access publishing." iMP, March 2000. http://www.cisp.org/imp/march_2000/03_00arms.htm
William Y. Arms, "Automated digital libraries." D-Lib Magazine, July/August 2000. http://www.dlib.org/dlib/july20/07contents.html
William Y. Arms, "What are the alternatives to peer review? Quality control in scholarly publishing on the web." Journal of Electronic Publishing, 8(1), August 2002. http://www.press.umich.edu/jep/08-01/arms.html
13
Research Libraries are Expensive
library materials
buildings & facilities
staff
14
Baumol's Cost Disease
Year
Price
1900 1950 2000
Bundle of goods and services
Labor-intensive services
Manufactured goods
2050
15
Baumol's Cost Disease
Year
Price
1900 1950 2000
Bundle of goods and services
Labor-intensive services
Manufactured goods
2050
Moore's Law
16
Brute Force Computing
Few people really understand Moore's Law
Computing power doubles every 18 monthsIncreases 100 times in 10 yearsIncreases 10,000 times in 20 years
Simple algorithms
plus
immense computing power
can outperform human intelligence
17
Example: Catalogs and Indexes
Cost disease: catalogs and indexes
Catalog, index and abstracting records are very expensive when created by skilled professionals
Moore's Law: automatic indexing of full text
Retrieval effectiveness using automatic indexing can be at least as effective as manual indexing with controlled vocabularies
(Cleverdon 1967, reporting on experiments by Salton)
18
Resistance to Change
"I used to be a heavy user of INSPEC. Now I use Google instead."
19
Information Discovery:1992 and 2002
1992 2002
Content print digital
Computing expensive inexpensive
Choice of content selective comprehensive
Index creation human automatic
Frequency one time monthly
Vocabulary controlled not controlled
Query Boolean ranked retrieval
Users trained untrained
20
Brute Force Computing:Substitutes for Human Intelligence
Automated algorithms for information discovery
Similarity of two documents
Vector space and statistical methods
(Salton, Sparc Jones, et al.)
Importance of digital object
Rank importance of web pages by analysis of the graph of web links
(Kleinberg, Page, et al.)
21
Brute Force Computing: Automated Metadata Extraction
Informedia (Carnegie Mellon)
Automatic processing of segments of video, e.g., television news.
Algorithms for:
dividing raw video into discrete items
generating short summaries
indexing the sound track using speech recognition
recognizing faces
(Wactlar, et al.)
22
23
Simple algorithms
plus
immense computing power
plus
the intelligence of the user
can replace labor-intensive services
CognitiveStudies HCI
Low Cost Information
Computer Science
24
The National Science Digital Library (NSDL)
2525
ScopeAll digital information relevant to any level of education in any branch of science.
Scientific and technical information
Materials used in education
Materials tailored toeducation
2626
All branches of science, all levels of education, very broadly defined:
Five year targets
1,000,000 different users
10,000,000 digital objects
10,000 to 100,000 independent sites
How Big might the NSDL be?
2727
Resources
Integration team
Budget $4-6 million
Staff 25 - 30
Management Diffuse How can a small team, without direct management control, create a very large-scale digital library?
2828
It is possible to build a very large digital library with a small staff.
But ...
Every aspect of the library must be planned with scalability in mind.
Some compromises will be made.
Philosophy
2929
Basic AssumptionsThe integration team will not manage any collections
The integration team will not create any metadata
3030
... to provide a coherent set of collections and services across
great diversity
The Integration Task ...
3131
Interoperability
The Problem
Conventional approaches require partners to support agreements (technical, content, and business)
But NSDL needs thousands of very different partners
... most of whom are not directly part of the NSDL program
The challenge is to create incentives for independent digital libraries to adopt agreements
3232
Function Versus Cost of Acceptance
Function
Cost of acceptance
Many adopters
Few adopters
3333
Example: Textual Mark-up
Function
Cost of acceptance
SGML
ASCII
HTML
XML
3434
The Spectrum of Interoperability
Level Agreements Example
Federation Strict use of standards AACR, MARC(syntax, semantic, Z 39.50and business)
Harvesting Digital libraries expose Open Archivesmetadata; simple metadata harvesting
protocol and registry
Gathering Digital libraries do not Web crawlerscooperate; services must and search enginesseek out information
3535
What to Index?
Full text indexing is excellent, but full text indexing is not possible for all materials (non-textual, no access for indexing).
Comprehensive metadata is an alternative, but available for very few of the materials.
What Architecture to Use?
Few collections support an established search protocol (e.g., Z39.50).
Searching
3636
Broadcast Searching does not Scale
User interfaceserver
User
Collections
3737
Users
Collections
Metadata repository
The Metadata Repository
Services
The metadata repository is a resource for service providers.
It holds information about every collection and item known to the NSDL.
3838
Search Architecture
Portal
Portal
Portal
Search andDiscoveryServices Collections
SDLIP OAI
http
Metadata repository
James Allan, Bruce Croft (University of Massachusetts, Amherst)
3939
Other TopicsUser interfaces: data driven portals using a channel architecture
Selection: selective web crawling, machine learning
Quality measures: ???
4040
The Mortal behind the Portal
[This space left intentionally blank.]
4141
The NSDL is a program of the National Science Foundation's Directorate for Education and Human Resources, Division of Undergraduate Education.
The NSDL Core Integration is a collaboration between the University Center for Atmospheric Research (Dave Fulker), Columbia University (Kate Wittenberg) and Cornell University (Bill Arms). The Technical Director is Carl Lagoze (Cornell University).
Acknowledgement
42
SocietyCognitiveStudies HCI
Computer Science
Applications
Information Science