Denise Troll Covey Associate Dean, University Libraries, Carnegie Mellon Pennsylvania Library Association Conference Pittsburgh, PA – October 5, 2003 Understanding

Denise Troll CoveyAssociate Dean, University Libraries, Carnegie Mellon

Pennsylvania Library Association Conference

Pittsburgh, PA – October 5, 2003

Understanding & Assessing

The Million Book Project

Million Book Project Vision

“Attempt to understand &

solve the technical,

economic, & social policy

issues of providing online

access to all creative works

of the

human race.”

– Dr. Raj Reddy

What is the Million Book Project?

Effort to digitize & provide full-text searching

& free-to-read access to a million books by 2007

Collection development

Copyright permissions

Acquisitions & shipping

Scanning operations

Proposal writing

Why is the Million Book Project?

Democratize knowledge & empower citizenry

Address disparity in library size & accessibility

Facilitate new knowledge

Combining old & new, east & west, technical & humanistic

Enhance student learning & success of faculty research

Address copyright absurdities

Why is the Million Book Project?

Support digital library research Information distribution, management, & sustainability

Security, copyright, & digital rights management

Accuracy of optical character recognition (OCR)

OCR of non-Romanic languages & scripts

Automatic creation of structural metadata

Automatic summarization

Intelligent indexing

Machine translation

Storage formats

Search engines

Who is involved?

Carnegie Mellon University Libraries & School of Computer Science

Other U.S. libraries

Internet Archive

India & China

OCLC, DLF, & CRL

Archival Resource Company Inc.

Funding

Collection development NSF – $35,000 for initial

planning meeting 2001

Funded by partners

Copyright permission UC Merced – $35,000

Carnegie Mellon – ?

Project administration Carnegie Mellon – ?

Equipment & travel NSF – $3.6 million (discounts from Minolta)

Labor for scanning India – $1.5 million

China – ?

Acquisitions & shipping Internet Archive – ?

NSF – pilot shipment

Collection Development Strategies

Librarians as selectors

Best books – cited in bibliographies

Technical reports & government documents

Priorities of participants & funding agencies

Topics to get full collections

What can we acquire

Bulk, cheap, fast

Libraries weeding, closing, renovating

Initial Collection of Collections

In Copyright Indigenous Indian & Chinese materials

Public Domain

100,000200,000

700,000

Multi-lingual& multi-script

languageprocessing

November 2001

planning meeting

funded by NSF

Books forCollege Libraries& other selectedbibliographies

Current Collection of Collections

In Copyright Indigenous Indian & Chinese materials

Public Domain

500,000200,000

300,000

Multi-lingual& multi-script

languageprocessing

Books forCollege Libraries& other selectedbibliographies

New copyright

permission strategy

Scanning Underway in India

Multiple centers – each with terabyte storage

Indigenous materials

Shipments from U.S.

100,000 books by 2004

Above average wages

6000 Book Pilot Shipment to India

20 ft ocean container – 25 days NY to Chennai

243 boxes – 9 palettes – 11,298 lbs – duty free

Mostly public domain – government documents,

social science, biography, history, & literature

Approximate cost $2 per book round trip

4000 books did not have to be returned

2000 books were returned in good condition

August 2002 to August 2003

Lessons Learned from Pilot

Reduce shipping cost per book

Change packing method = cost $1 per book or 50 cents per book shipped one way

Reduce turn-around time

Learned procedures for clearing customs

Establish 5 international centers for U.S. shipments• Funded by Indian government

• To be inaugurated January 2004

• Receive books, scan, check quality, return

Distribution in India

Central Distribution Site Deemed University, Thanjavur

Pilot

Bangalore Hyderabad

Current

Allahabad Calcutta Delhi

Scanning Underway in China

Customs & content issues prohibit

shipping books to China

Scanning indigenous materials

& U.S.

copyrighted works already

in their libraries

(with permission granted)

Above average wages

Standards & Workflow

National standards for digital preservation

Developed by IMLS 2001 & endorsed by DLF 2002

http://www.imls.gov/pubs/forumframework.htm

National standards for cataloging

Carnegie Mellon University Libraries

Developed & documented workflow

Provided training

Digitization Workflow

Operators scan, post-process, & OCR

600 DPI TIFF (v5) images

ScanFix post-processing

Abby Fine Reader OCR

• 98% accuracy with English

• Some foreign languages

OCR being developed for

other languages & scripts

Optimum Scanner Throughput

4000 books per year per Minolta scanner

One scanner, two shifts daily = 16 books per day

250 work days per year = 4000 books per year

72,000 books per year with current 18 scanners

400,000 books per year with 100 scanners

Allowing 50% deterioration in throughput,

100 scanners can complete the project in 5 years

Metadata Workflow

Librarians capture

Bibliographic metadata - for delivery system

• MARC from OCLC or create Dublin Core

• Guest IDs provided by OCLC

Administrative metadata - for reporting system• Bibliographic metadata

• Source library

• Return requested

• Copyright status – check renewal records• Permission status – used by delivery system

Collaborative development by India &

Carnegie Mellon

Copyright Workflow #1

N o

U p d a te p u b lish e r da ta b a se

S c a n & O C RC a p tu re m e tad a taR e tu rn bo o k s as n e e d edS e n d co p ie s to C a rn e g ie M e llon

L o c a te , a c q u ire , & sh ip to In d ia

Y e s

N e g o tia te p e rm iss ion U p d a te p u b lish e r da ta b a se

Id e n tify & c o n ta c t p u b lish e rs

India

Carnegie Mellon

Copyright Workflow #2

U p d a te p u b l is h e rd a ta b a s e a s n e e d e d

N oD eliv ery sy stem w o n 't d isp lay

U p d a te p u b l is h e r d a ta b a seU p d a te a d m in is t r a t iv e m e ta d a ta

Y e s

N e g o t ia te p e rm iss io n sa s n e e d e d

C o p y r ig h t Y es & P erm iss io n U n k n o w nId e n ti fy & c o n ta c t p u b l is h e rs

S c a n & O C RC a p tu re m e ta d a taR e tu rn b o o k s a s n e e d e dS e n d c o p ie s to C a rn e g ie M e l lo n

L o c a te , a c q u ire , & s h ip to In d ia

India

Carnegie Mellon

Internet Archive &Archival Resource Company, Inc.

Contextual Searching

Legal to scan & create index without permission

When no permission granted

to display copyrighted book,

search returns query terms

in OCR context

Acquiring the Collection

Archival Resources Company Inc.

Packing, shipping, & tracking

Help locate & acquire books

• Weeded collections

• Closing libraries

Acquisitions web site

• Materials wanted

• Loaning & donating

• Insurance

Integrating the Collection

Transporting files to Carnegie Mellon

Inadequate Internet bandwidth

Expense of copying to & from gold CD or DVD

Physically transport files on disks

Sustaining the Collection

Goal is 10 organizations host the Collection India – Digital Library of India multiple locations

China – site(s) not yet known

U.S. – Carnegie Mellon, Internet Archive, & University of California Merced

Discussions with OCLC, Library of Congress, & Digital Library of Alexandria

Estimated cost one million $$ per host site

Estimated size is 20 terabytes

http://www.ulib.org/html/index.html

http://www.ulib.org/html/index.html

www.dli.gov.in

www.dli.gov.in

Next Steps

Ship 12,500 books one-way to Hyderabad, India from University of Washington @ $6000

Negotiations UMI/Proquest – print-on-demand service

OCLC – digital registry & identifying source libraries

CRL – supply or help acquire books

November 2003 – collection meeting

NSF proposal to create database of copyright renewal records

Print on Demand Service

UMI/ProQuest

Handle financial transactions

Print, bind, & send books to customers

Collaboration with Carnegie Mellon

Negotiate royalties with publishers

Develop suitable business model

Digital Rights Management – Lite

Free-to-read by any Internet user

Difficult to save or print books

One page at a time using browser

Secure servers restrict access

Discourage hacking by offering

affordable printed, bound books

Global Business Model

Hardback book = $30.00

Paperback book = $15.00

Digitalback book = cost of cup of coffee

Internet Archive POD = $1.00 paperback

India POD = $0.80 paperback; $2.00 hardback

Open Access Feasibility Study

Couldn’t locate publisher for 11% books

If located publisher, half didn’t respond

Even to second letter

If got response

Fewer than half gave permission

Often permission was restricted

22% permission granted

1999-2000 – statistically valid random sample

Success Rate

Scholarly associations 45%

University presses 37%

Museums & galleries 31%

Commercial publishers 12%

Permission by Publisher Type

0%

25%

50%

75%

No-Response

Rate

SuccessRate

LibraryContent

22%overall

Open Access Fine & Rare Books

367 titles in copyright (34% of collection)

Couldn’t locate copyright holder for 13% of titles

127 letters & 44 follow-up calls to date

56% titles permission granted

6% with restrictions

3% titles permission denied

Assumed if 3 contacts get no response2002-2003

Transaction Costs

$ 6,550 FTE labor

$ 225 Phone calls

$ 65 Paper & postage

$ 6,840 TOTAL

May 2003 through August 2003

Does not include legal fees, cost of Internet connectivityor administrator time

$37.00 per title

Copyright Negotiations

Educate Find online, but use print

Online access increases use

Open access doesn’t decrease, & can increase sales

Copyright absurdity

AskNon-exclusive permission to scan & provide open access

Minimal system functionality

Give Preservation-quality copies

Metadata & OCR

Motivate$$ Use in added-value, fee-based services

$$ Print on demand for out-of-print titles

$$ Buy button for in-print titles

Million Book Project

Initial Copyright Approach

Do not pay permission cost

Focus on out-of-print, in-copyright titles

Books for College Libraries has 50,000 titles

Begin with scholarly associations & university presses

Transaction cost per title is prohibitive

Identifying & inserting titles in letters

Negotiating & tracking permission per title

Epiphany & New Approach

Focus on publishers of quality books Treat bibliographies as approval plan of publishers

Books for College Libraries has 5600 publishers

Ask for permission to digitize All out-of-print, in-copyright titles

All titles published prior to a date of their choosing

All titles published # or more years ago

List of titles they provide

Follow-up phone call or visit

Current Statistics

5600 publishers in Books for College Libraries

Using intermittent labor

Couldn’t locate 30 publishers (so far)

184 letters & 24 follow-up calls to date

4% permission granted

5% permission denied

Full-time staff October 2003

Results of New Approach

Estimate transaction costs remain the same

But acquire more books for $$ spent

National Academy Press – 99% increase

• 26 titles in Books for College Libraries

• Permission for 3,046 titles

Brookings Institution – 96% increase

Rand McNally – 60% increase

“More Bang for the Buck”

Indigenous Materials

Public Domain

In Copyright

Initial Current

Projections

Success rate (# BCL publishers)

# of books per publisher

Million Book Collection

4% (224) 1500 336,000

6% (336) 1500 504,000

22% (1,232) 1500 1,848,000

We could need to negotiate

with India for more labor

Transaction log analysis – beginning 2004 Number of searches, browses, pages displayed

Use per title online & print-on-demand

Use by different geographic demographics

Use per time of day, day of week, month of the year

Usage Assessments

Outcomes assessment – 2006-2007 User demographics – age, gender, location

How users found the Collection

Why they used it & what they did with it

What difference the Collection made

Their assessment of the quality of the Collection & the usability & functionality of the system

Their view of the significance of the project

Copyright Assessments – 2006

Number of copyrighted books in the collection

Success rate of permission requests

Survey of participating publishers

Overall satisfaction

Quality of the copies

What they did or plan to do with the copies

Impact on revenue & view of open access

Dissemination

Million Book Collection

Books accessible via Google search

Libraries can link to collection from web site

Libraries can link books to catalog records

Publisher database

Successful negotiation strategies

Research test bed

Thank you!

Documents

Denise Troll Covey Associate Dean, University Libraries, Carnegie Mellon Pennsylvania Library Association Conference Pittsburgh, PA – October 5, 2003 Understanding