Upload
emery-logan
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Denise Troll CoveyAssociate Dean, University Libraries, Carnegie Mellon
Pennsylvania Library Association Conference
Pittsburgh, PA – October 5, 2003
Understanding & Assessing
The Million Book Project
Million Book Project Vision
“Attempt to understand &
solve the technical,
economic, & social policy
issues of providing online
access to all creative works
of the
human race.”
– Dr. Raj Reddy
What is the Million Book Project?
Effort to digitize & provide full-text searching
& free-to-read access to a million books by 2007
Collection development
Copyright permissions
Acquisitions & shipping
Scanning operations
Proposal writing
Why is the Million Book Project?
Democratize knowledge & empower citizenry
Address disparity in library size & accessibility
Facilitate new knowledge
Combining old & new, east & west, technical & humanistic
Enhance student learning & success of faculty research
Address copyright absurdities
Why is the Million Book Project?
Support digital library research Information distribution, management, & sustainability
Security, copyright, & digital rights management
Accuracy of optical character recognition (OCR)
OCR of non-Romanic languages & scripts
Automatic creation of structural metadata
Automatic summarization
Intelligent indexing
Machine translation
Storage formats
Search engines
Who is involved?
Carnegie Mellon University Libraries & School of Computer Science
Other U.S. libraries
Internet Archive
India & China
OCLC, DLF, & CRL
Archival Resource Company Inc.
Funding
Collection development NSF – $35,000 for initial
planning meeting 2001
Funded by partners
Copyright permission UC Merced – $35,000
Carnegie Mellon – ?
Project administration Carnegie Mellon – ?
Equipment & travel NSF – $3.6 million (discounts from Minolta)
Labor for scanning India – $1.5 million
China – ?
Acquisitions & shipping Internet Archive – ?
NSF – pilot shipment
Collection Development Strategies
Librarians as selectors
Best books – cited in bibliographies
Technical reports & government documents
Priorities of participants & funding agencies
Topics to get full collections
What can we acquire
Bulk, cheap, fast
Libraries weeding, closing, renovating
Initial Collection of Collections
In Copyright Indigenous Indian & Chinese materials
Public Domain
100,000200,000
700,000
Multi-lingual& multi-script
languageprocessing
November 2001
planning meeting
funded by NSF
Books forCollege Libraries& other selectedbibliographies
Current Collection of Collections
In Copyright Indigenous Indian & Chinese materials
Public Domain
500,000200,000
300,000
Multi-lingual& multi-script
languageprocessing
Books forCollege Libraries& other selectedbibliographies
New copyright
permission strategy
Scanning Underway in India
Multiple centers – each with terabyte storage
Indigenous materials
Shipments from U.S.
100,000 books by 2004
Above average wages
6000 Book Pilot Shipment to India
20 ft ocean container – 25 days NY to Chennai
243 boxes – 9 palettes – 11,298 lbs – duty free
Mostly public domain – government documents,
social science, biography, history, & literature
Approximate cost $2 per book round trip
4000 books did not have to be returned
2000 books were returned in good condition
August 2002 to August 2003
Lessons Learned from Pilot
Reduce shipping cost per book
Change packing method = cost $1 per book or 50 cents per book shipped one way
Reduce turn-around time
Learned procedures for clearing customs
Establish 5 international centers for U.S. shipments• Funded by Indian government
• To be inaugurated January 2004
• Receive books, scan, check quality, return
Distribution in India
Central Distribution Site Deemed University, Thanjavur
Pilot
Bangalore Hyderabad
Current
Allahabad Calcutta Delhi
Scanning Underway in China
Customs & content issues prohibit
shipping books to China
Scanning indigenous materials
& U.S.
copyrighted works already
in their libraries
(with permission granted)
Above average wages
Standards & Workflow
National standards for digital preservation
Developed by IMLS 2001 & endorsed by DLF 2002
http://www.imls.gov/pubs/forumframework.htm
National standards for cataloging
Carnegie Mellon University Libraries
Developed & documented workflow
Provided training
Digitization Workflow
Operators scan, post-process, & OCR
600 DPI TIFF (v5) images
ScanFix post-processing
Abby Fine Reader OCR
• 98% accuracy with English
• Some foreign languages
OCR being developed for
other languages & scripts
Optimum Scanner Throughput
4000 books per year per Minolta scanner
One scanner, two shifts daily = 16 books per day
250 work days per year = 4000 books per year
72,000 books per year with current 18 scanners
400,000 books per year with 100 scanners
Allowing 50% deterioration in throughput,
100 scanners can complete the project in 5 years
Metadata Workflow
Librarians capture
Bibliographic metadata - for delivery system
• MARC from OCLC or create Dublin Core
• Guest IDs provided by OCLC
Administrative metadata - for reporting system• Bibliographic metadata
• Source library
• Return requested
• Copyright status – check renewal records• Permission status – used by delivery system
Collaborative development by India &
Carnegie Mellon
Copyright Workflow #1
N o
U p d a te p u b lish e r da ta b a se
S c a n & O C RC a p tu re m e tad a taR e tu rn bo o k s as n e e d edS e n d co p ie s to C a rn e g ie M e llon
L o c a te , a c q u ire , & sh ip to In d ia
Y e s
N e g o tia te p e rm iss ion U p d a te p u b lish e r da ta b a se
Id e n tify & c o n ta c t p u b lish e rs
India
Carnegie Mellon
Copyright Workflow #2
U p d a te p u b l is h e rd a ta b a s e a s n e e d e d
N oD eliv ery sy stem w o n 't d isp lay
U p d a te p u b l is h e r d a ta b a seU p d a te a d m in is t r a t iv e m e ta d a ta
Y e s
N e g o t ia te p e rm iss io n sa s n e e d e d
C o p y r ig h t Y es & P erm iss io n U n k n o w nId e n ti fy & c o n ta c t p u b l is h e rs
S c a n & O C RC a p tu re m e ta d a taR e tu rn b o o k s a s n e e d e dS e n d c o p ie s to C a rn e g ie M e l lo n
L o c a te , a c q u ire , & s h ip to In d ia
India
Carnegie Mellon
Internet Archive &Archival Resource Company, Inc.
Contextual Searching
Legal to scan & create index without permission
When no permission granted
to display copyrighted book,
search returns query terms
in OCR context
Acquiring the Collection
Archival Resources Company Inc.
Packing, shipping, & tracking
Help locate & acquire books
• Weeded collections
• Closing libraries
Acquisitions web site
• Materials wanted
• Loaning & donating
• Insurance
Integrating the Collection
Transporting files to Carnegie Mellon
Inadequate Internet bandwidth
Expense of copying to & from gold CD or DVD
Physically transport files on disks
Sustaining the Collection
Goal is 10 organizations host the Collection India – Digital Library of India multiple locations
China – site(s) not yet known
U.S. – Carnegie Mellon, Internet Archive, & University of California Merced
Discussions with OCLC, Library of Congress, & Digital Library of Alexandria
Estimated cost one million $$ per host site
Estimated size is 20 terabytes
Next Steps
Ship 12,500 books one-way to Hyderabad, India from University of Washington @ $6000
Negotiations UMI/Proquest – print-on-demand service
OCLC – digital registry & identifying source libraries
CRL – supply or help acquire books
November 2003 – collection meeting
NSF proposal to create database of copyright renewal records
Print on Demand Service
UMI/ProQuest
Handle financial transactions
Print, bind, & send books to customers
Collaboration with Carnegie Mellon
Negotiate royalties with publishers
Develop suitable business model
Digital Rights Management – Lite
Free-to-read by any Internet user
Difficult to save or print books
One page at a time using browser
Secure servers restrict access
Discourage hacking by offering
affordable printed, bound books
Global Business Model
Hardback book = $30.00
Paperback book = $15.00
Digitalback book = cost of cup of coffee
Internet Archive POD = $1.00 paperback
India POD = $0.80 paperback; $2.00 hardback
Open Access Feasibility Study
Couldn’t locate publisher for 11% books
If located publisher, half didn’t respond
Even to second letter
If got response
Fewer than half gave permission
Often permission was restricted
22% permission granted
1999-2000 – statistically valid random sample
Success Rate
Scholarly associations 45%
University presses 37%
Museums & galleries 31%
Commercial publishers 12%
Permission by Publisher Type
0%
25%
50%
75%
No-Response
Rate
SuccessRate
LibraryContent
22%overall
Open Access Fine & Rare Books
367 titles in copyright (34% of collection)
Couldn’t locate copyright holder for 13% of titles
127 letters & 44 follow-up calls to date
56% titles permission granted
6% with restrictions
3% titles permission denied
Assumed if 3 contacts get no response2002-2003
Transaction Costs
$ 6,550 FTE labor
$ 225 Phone calls
$ 65 Paper & postage
$ 6,840 TOTAL
May 2003 through August 2003
Does not include legal fees, cost of Internet connectivityor administrator time
$37.00 per title
Copyright Negotiations
Educate Find online, but use print
Online access increases use
Open access doesn’t decrease, & can increase sales
Copyright absurdity
AskNon-exclusive permission to scan & provide open access
Minimal system functionality
Give Preservation-quality copies
Metadata & OCR
Motivate$$ Use in added-value, fee-based services
$$ Print on demand for out-of-print titles
$$ Buy button for in-print titles
Million Book Project
Initial Copyright Approach
Do not pay permission cost
Focus on out-of-print, in-copyright titles
Books for College Libraries has 50,000 titles
Begin with scholarly associations & university presses
Transaction cost per title is prohibitive
Identifying & inserting titles in letters
Negotiating & tracking permission per title
Epiphany & New Approach
Focus on publishers of quality books Treat bibliographies as approval plan of publishers
Books for College Libraries has 5600 publishers
Ask for permission to digitize All out-of-print, in-copyright titles
All titles published prior to a date of their choosing
All titles published # or more years ago
List of titles they provide
Follow-up phone call or visit
Current Statistics
5600 publishers in Books for College Libraries
Using intermittent labor
Couldn’t locate 30 publishers (so far)
184 letters & 24 follow-up calls to date
4% permission granted
5% permission denied
Full-time staff October 2003
Results of New Approach
Estimate transaction costs remain the same
But acquire more books for $$ spent
National Academy Press – 99% increase
• 26 titles in Books for College Libraries
• Permission for 3,046 titles
Brookings Institution – 96% increase
Rand McNally – 60% increase
Projections
Success rate (# BCL publishers)
# of books per publisher
Million Book Collection
4% (224) 1500 336,000
6% (336) 1500 504,000
22% (1,232) 1500 1,848,000
We could need to negotiate
with India for more labor
Transaction log analysis – beginning 2004 Number of searches, browses, pages displayed
Use per title online & print-on-demand
Use by different geographic demographics
Use per time of day, day of week, month of the year
Usage Assessments
Outcomes assessment – 2006-2007 User demographics – age, gender, location
How users found the Collection
Why they used it & what they did with it
What difference the Collection made
Their assessment of the quality of the Collection & the usability & functionality of the system
Their view of the significance of the project
Copyright Assessments – 2006
Number of copyrighted books in the collection
Success rate of permission requests
Survey of participating publishers
Overall satisfaction
Quality of the copies
What they did or plan to do with the copies
Impact on revenue & view of open access
Dissemination
Million Book Collection
Books accessible via Google search
Libraries can link to collection from web site
Libraries can link books to catalog records
Publisher database
Successful negotiation strategies
Research test bed