1 William Y. Arms September 26, 2002 A Research Program for Information Science with the NSDL as an Example

1

William Y. Arms

September 26, 2002

A Research Program for

Information Sciencewith

the NSDL as an Example

2

A Scenario

A faculty member wished to find a paper for students to read in a class. He began by asking an expert. She suggested the original research paper as suitable.

Later, he typed a few terms into Google, browsed the hits, selected one that led to ResearchIndex, found the paper, and downloaded a PDF version from the author's web site.

3

Computer Science

Internet

Web

Google

ResearchIndex

PDF

Computer Science

4

HCI

Browsing

Searching

User interface design

Human Computer Interaction

Computer Science

5

HCI: Eye Tracking

6

Roles of expert/instructor/student

Cognitive psychology

Linguistics

Natural language processing

CognitiveStudies HCI

Cognitive Studies

Computer Science

7

8

Organizational change

Economics

Ethics

Social culture

Law

SocietyCognitiveStudies HCI

Society

Computer Science

9


Computer Science

Applications

Information Science

10

Open Access to Scientific, Scholarly and

Professional Information

11

Before the Web

Access to scientific, medical, legal information

In the United States:

excellent if you belonged to a rich organization (e.g, a major university)

very poor otherwise

In many countries of the world:

very poor for everybody

12

Some Light Reading

William Y. Arms, "Economic models for open-access publishing." iMP, March 2000. http://www.cisp.org/imp/march_2000/03_00arms.htm

William Y. Arms, "Automated digital libraries." D-Lib Magazine, July/August 2000. http://www.dlib.org/dlib/july20/07contents.html

William Y. Arms, "What are the alternatives to peer review? Quality control in scholarly publishing on the web." Journal of Electronic Publishing, 8(1), August 2002. http://www.press.umich.edu/jep/08-01/arms.html

13

Research Libraries are Expensive

library materials

buildings & facilities

staff

14

Baumol's Cost Disease

Year

Price

1900 1950 2000

Bundle of goods and services

Labor-intensive services

Manufactured goods

2050

15

Baumol's Cost Disease

Year

Price

1900 1950 2000

Bundle of goods and services

Labor-intensive services

Manufactured goods

2050

Moore's Law

16

Brute Force Computing

Few people really understand Moore's Law

Computing power doubles every 18 monthsIncreases 100 times in 10 yearsIncreases 10,000 times in 20 years

Simple algorithms

plus

immense computing power

can outperform human intelligence

17

Example: Catalogs and Indexes

Cost disease: catalogs and indexes

Catalog, index and abstracting records are very expensive when created by skilled professionals

Moore's Law: automatic indexing of full text

Retrieval effectiveness using automatic indexing can be at least as effective as manual indexing with controlled vocabularies

(Cleverdon 1967, reporting on experiments by Salton)

18

Resistance to Change

"I used to be a heavy user of INSPEC. Now I use Google instead."

19

Information Discovery:1992 and 2002

1992 2002

Content print digital

Computing expensive inexpensive

Choice of content selective comprehensive

Index creation human automatic

Frequency one time monthly

Vocabulary controlled not controlled

Query Boolean ranked retrieval

Users trained untrained

20

Brute Force Computing:Substitutes for Human Intelligence

Automated algorithms for information discovery

Similarity of two documents

Vector space and statistical methods

(Salton, Sparc Jones, et al.)

Importance of digital object

Rank importance of web pages by analysis of the graph of web links

(Kleinberg, Page, et al.)

21

Brute Force Computing: Automated Metadata Extraction

Informedia (Carnegie Mellon)

Automatic processing of segments of video, e.g., television news.

Algorithms for:

dividing raw video into discrete items

generating short summaries

indexing the sound track using speech recognition

recognizing faces

(Wactlar, et al.)

22

23

Simple algorithms

plus

immense computing power

plus

the intelligence of the user

can replace labor-intensive services

CognitiveStudies HCI

Low Cost Information

Computer Science

24

The National Science Digital Library (NSDL)

2525

ScopeAll digital information relevant to any level of education in any branch of science.

Scientific and technical information

Materials used in education

Materials tailored toeducation

2626

All branches of science, all levels of education, very broadly defined:

Five year targets

1,000,000 different users

10,000,000 digital objects

10,000 to 100,000 independent sites

How Big might the NSDL be?

2727

Resources

Integration team

Budget $4-6 million

Staff 25 - 30

Management Diffuse How can a small team, without direct management control, create a very large-scale digital library?

2828

It is possible to build a very large digital library with a small staff.

But ...

Every aspect of the library must be planned with scalability in mind.

Some compromises will be made.

Philosophy

2929

Basic AssumptionsThe integration team will not manage any collections

The integration team will not create any metadata

3030

... to provide a coherent set of collections and services across

great diversity

The Integration Task ...

3131

Interoperability

The Problem

Conventional approaches require partners to support agreements (technical, content, and business)

But NSDL needs thousands of very different partners

... most of whom are not directly part of the NSDL program

The challenge is to create incentives for independent digital libraries to adopt agreements

3232

Function Versus Cost of Acceptance

Function

Cost of acceptance

Many adopters

Few adopters

3333

Example: Textual Mark-up

Function

Cost of acceptance

SGML

ASCII

HTML

XML

3434

The Spectrum of Interoperability

Level Agreements Example

Federation Strict use of standards AACR, MARC(syntax, semantic, Z 39.50and business)

Harvesting Digital libraries expose Open Archivesmetadata; simple metadata harvesting

protocol and registry

Gathering Digital libraries do not Web crawlerscooperate; services must and search enginesseek out information

3535

What to Index?

Full text indexing is excellent, but full text indexing is not possible for all materials (non-textual, no access for indexing).

Comprehensive metadata is an alternative, but available for very few of the materials.

What Architecture to Use?

Few collections support an established search protocol (e.g., Z39.50).

Searching

3636

Broadcast Searching does not Scale

User interfaceserver

User

Collections

3737

Users

Collections

Metadata repository

The Metadata Repository

Services

The metadata repository is a resource for service providers.

It holds information about every collection and item known to the NSDL.

3838

Search Architecture

Portal

Portal

Portal

Search andDiscoveryServices Collections

SDLIP OAI

http

Metadata repository

James Allan, Bruce Croft (University of Massachusetts, Amherst)

3939

Other TopicsUser interfaces: data driven portals using a channel architecture

Selection: selective web crawling, machine learning

Quality measures: ???

4040

The Mortal behind the Portal

[This space left intentionally blank.]

4141

The NSDL is a program of the National Science Foundation's Directorate for Education and Human Resources, Division of Undergraduate Education.

The NSDL Core Integration is a collaboration between the University Center for Atmospheric Research (Dave Fulker), Columbia University (Kate Wittenberg) and Cornell University (Bill Arms). The Technical Director is Carl Lagoze (Cornell University).

Acknowledgement

42


Computer Science

Applications

Information Science

Documents

1 William Y. Arms September 26, 2002 A Research Program for Information Science with the NSDL as an Example