HATHI TRUST A Shared Digital Repository
HathiTrust, Collections, and Collaboration
COLD 2011 Spring MeetingJeremy York
May 20, 2011
Outline
• Overview• Partnership• Mission/Goals
• Collections• Services
• Collaboration
Mission and Goals
Overview
Current PartnersArizona State UniversityBaylor UniversityCalifornia Digital LibraryColumbia UniversityCornell UniversityDartmouth CollegeDuke UniversityEmory UniversityHarvard University LibraryIndiana UniversityJohns Hopkins UniversityLibrary of CongressMassachusetts Institute of
TechnologyMichigan State UniversityNew York UniversityNew York Public LibraryNorth Carolina Central
University
North Carolina State UniversityNorthwestern UniversityThe Ohio State UniversityThe Pennsylvania State
UniversityPrinceton UniversityPurdue UniversityStanford UniversityTexas A&M UniversityUniversidad Complutense de
MadridUniversity of California
BerkeleyDavisIrvineLos AngelesMercedRiversideSan DiegoSan FranciscoSanta BarbaraSanta Cruz
The University of ChicagoUniversity of IllinoisUniversity of Illinois at ChicagoThe University of IowaUniversity of MarylandUniversity of MichiganUniversity of MinnesotaThe University of North Carolina at Chapel HillUniversity of PennsylvaniaUniversity of PittsburghUniversity of UtahUniversity of VirginiaUniversity of WashingtonUniversity of Wisconsin-MadisonUtah State UniversityYale University Library
Mission
• To contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge
Collections and Collaboration
• Comprehensive collection• Preservation…with Access
• Shared strategies– Collection management, development– Copyright– Preservation– Efficient user services
• Public Good
Collections
What is in HathiTrust?
• 8,725,092 Total volumes• 2,367,111 Public Domain• 4,774,782 Book titles• 211,688 Serial titles
* As of May 20, 2011
Content Sources
* As of May 1, 2011
Content Distribution
* As of May 1, 2011
Dates
* As of May 1, 2011
Breakdown of HathiTrust book corpus by publication date
Bibliographic Indeterminacy and the Scale of Problems and Opportunities of "Rights" in Digital Collection Building – 2/2011
Breakdown of HathiTrust book corpus by publication date
Language Distribution (1)
The top 10 languages make up ~86% of all content
* As of May 1, 2011
Language Distribution (2)
The next 40 languages make up ~13% of total
* As of May 1, 2011
Content over time
* As of May 1, 2011
Content Growth
Services: Preservation, Access
Services (1)
• Ingest– Book and Journal content
• Google• Internet Archive• In-house, other vendor digitization
– Images, Audio, Born digital (coming soon…)
• Two parts– Bibliographic Data– Content
Services (2)
• Long-term preservation– Bit-level, migration– Standard and open formats (ITU G4 TIFF,
JPEG2000, JPG, Unicode)– Validation, integrity, redundancy– OAIS
• How reliable is it?– DRAMBORA, TRAC
Technology - OAIS
GRINInternal Data Loading
GRINInternal Data Loading
GoogleInternet Archive
In-house Conversion
GoogleInternet Archive
In-house Conversion
MARC record extensions (Aleph)
Rights DB
MARC record extensions (Aleph)
Rights DB
Page TurnerHathiTrust API
OAIGeoIP DB
CNRI Handles[Solr]
Page TurnerHathiTrust API
OAIGeoIP DB
CNRI Handles[Solr]
METS/PREMIS objectTIFF G4/JPEG2000
OCRMD5 checksums
METS/PREMIS objectTIFF G4/JPEG2000
OCRMD5 checksums
METS objectPNGOCRPDF
METS objectPNGOCRPDFIsilon
Site ReplicationTSM
MD5 checksum validation
IsilonSite Replication
TSMMD5 checksum validation
GROOVE(JHOVE)GROOVE(JHOVE)
;
Quality
• Partner Digitization• Google Digitization• Quality work / Volume certification• [email protected]
Services (3)
• Preservation…with Access– As part of preservation, service to partners, and as
public good– Discovery
• Bibliographic (temporary catalog, OCLC/HathiTrust catalog)
• Full-text
– Reading• Interface optimized for users with print disabilities
– Collections
Descriptive headings added (hidden from GUI with CSS)Descriptive headings added (hidden from GUI with CSS)
Info about SSD service & link to accessibility page
Info about SSD service & link to accessibility page
Images used for style are in css so no need to use alt tagsImages used for style are in
css so no need to use alt tags
Skip navigation linkSkip navigation link
Access keys for navigating pages with keyboard
Access keys for navigating pages with keyboard
Added labels & descriptive titles to forms & ToC tableAdded labels & descriptive titles to forms & ToC table
Type of work
Search – Bib and Full text
View Full-PDF download
Print on Demand
Print disabilities
Section 108 (preservation uses)
Public domain worldwide
World World World if no restrictions,Partners if restrictions
World Partners worldwide
N/A
Public domain in the US
World US US if no restrictions,US partners if restrictions
US US Partners
N/A
Open Access (+Creative Commons)
World World World if no restrictions
World with permission
Partners worldwide if no restrictions
N/A
In copyright (and undetermined)
World Not available
Not available Not available
Partners US and worldwide, where applicable
Partners US and worldwide, where applicable
Access Matrix
Services (4)
• Rights Management– Rights Database– Copyright review
• IMLS Grant awarded to University of Michigan 2008 to determine copyright status of books published in US between 1923 and 1963
• 18 staff members, 4 institutions– Indiana University– University of Michigan– University of Minnesota– University of Wisconsin
• 125k reviewed through CRMS• 67,000 (54%) in public domain
Services (5)
• Data Availability– Tab-delimited inventory files– Bibliographic API– Data API– OAI feed of public domain– SFX target– Summon
Some Examples of Use
• Catalogs– UM loaded every record– Chicago links to public domain volumes owned in print– TROVE harvesting through OAI– OCLC loads records into OCLC
• Link Resolves– UC created SFX target
• Vendors– H.W. Wilson database links to public domain volumes– ProQuest full-text index via Summon
Services (6)
• Collaborative Development Environment– Active repository development
• Support for Computational Research– Datasets
• 120,000-volume set• Google-digitized public domain
– Protocol-based access– Research Center
How Different from Google?
• Preservation• Content• Collective work• Uses of materials• Own trajectory• Partnership
– Not just about digital content or repository– Address challenges– Fulfill mission– Provide services for our communities
Collaboration:Print Storage
A global change in the library environment
June 2010Median duplication: 31%
June 2009Median duplication: 19%
Academic print book collection already substantially duplicated in mass digitized book corpus
Continuing growth of overlap …
• ARL overlap– 31% in June 2010– 33% in Dec (adjustment: adding little-held works)– ~ 1% per 225,000 vols– 38% in May, 2011; 45% by December, 2011
• Oberlin Group overlap– 41% in December, 2010– Higher rate of overlap per added volume?– Close to 50% in May, 2011
Digitized Books in Shared Repositories
~75% of mass digitized corpus is ‘backed up’ in one or more shared print repositories
~3.5M titles
~2.5M
Cost Model
• Based on overlap with print collections– Public Domain / In-copyright
• Print Holdings Database– Costs– Lawful uses of materials– Complete picture– Volumes institutions own or have owned
• OCLC number; Bib record ID; Condition; Holding Status
Collaboration:Copyright
Copyright status of books published pre-1923 and US works published 1923-1963
Public domain, in-copyright, and orphan works, pre-1923 and 1923-1963
Breakdown by US/non-US and rights status, pre-1923, 1923-1963 and 1964-1977
Breakdown by US/non-US and rights status for all periods
Collaboration:Preservation
Technology - OAIS
GRINInternal Data Loading
GRINInternal Data Loading
GoogleInternet Archive
In-house Conversion
GoogleInternet Archive
In-house Conversion
MARC record extensions (Aleph)
Rights DB
MARC record extensions (Aleph)
Rights DB
Page TurnerHathiTrust API
OAIGeoIP DB
CNRI Handles[Solr]
Page TurnerHathiTrust API
OAIGeoIP DB
CNRI Handles[Solr]
METS/PREMIS objectTIFF G4/JPEG2000
OCRMD5 checksums
METS/PREMIS objectTIFF G4/JPEG2000
OCRMD5 checksums
METS objectPNGOCRPDF
METS objectPNGOCRPDFIsilon
Site ReplicationTSM
MD5 checksum validation
IsilonSite Replication
TSMMD5 checksum validation
GROOVE(JHOVE)GROOVE(JHOVE)
;
Technology
How to find out more• Web site “About” section:
http://www.hathitrust.org/about• Twitter: http://twitter.com/hathitrust• RSS: http://www.hathitrust.org/updates_rss• Monthly newsletter:
http://www.hathitrust.org/updates• Contact us: [email protected]• Soon: Facebook, blog
Thank you!