Upload
martin-kalfatovic
View
7.110
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Presentation for the Office of Strategic Initiatives (November 8, 2006) with Suzanne C. Pilsk
Citation preview
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Biodiversity Heritage Library: A Conversation About A
Collaborative Digitization ProjectSuzanne C. Pilsk
Martin R. Kalfatovic
Smithsonian Institution Libraries
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Biodiversity
What is Biodiversity?• Genetic variability
within species• Diversity of species• Ecosystems and
landscapes
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Biodiversity
• Wholesome food• Drinkable water• Breathable air• Stable climate for
– Forestry– Agriculture– Fisheries
• Waste decomposition• Bioremediation• Invasive species• Pest control• Ecotourism
• Pharmaceuticals• Genomics• Proteomics• Bioengineering• Biotechnology• Molecular design• Imitating nature• Designer organisms• Renewable feedstocks• Envirofriendly• Manufacturing processes
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Taxonomic Literature
• Over 250 years of systematic description of life
• Systema naturae (10th ed. 1758) by Carl von Linné
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
The cited half-life of publications in taxonomy is longer than in any other scientific discipline
* * * The decay rate is longer than in any scientific discipline
- Macro-economic case for open access, Tom Moritz
Taxonomic Literature
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Taxonomic Impediment
• Specimen collections• Databases• Publications• Observations• ‘Gray’ literature• Index cards• Field notebooks
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Agatea violarisType specimen from the U.S. National Herbarium (Smithsonian Institution) collected by the United States Exploring Expedition, 1838-1842
Taxonomic Impediment
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Taxonomic Impediment
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
- Specimen- Plate or other visual image- Taxonomic description
Taxonomic Impediment
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
• that there is access to information held in national/regional/global collections
• that electronic data is efficiently captured and provided in useable form
• that existing information held in literature and by current experts is made available electronically
• that stability of scientific names of organisms, used to access this information, is promoted
- Darwin Declaration, 1998
The essential requirements for accessing and utilising this global information are:
Taxonomic Literature
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Taxonomic Impediment
0
1
2
3
4
5
6
7
8
US & Canada Europe Mexico & C.America
SouthAmerica
Biologia Centrali-Americana. Edited by Frederick Ducane Godman and Osbert Salvin. London : Pub. for the editors by R. H. Porter, 1879-1915
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Digital Divide?
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Vishwas Chavan travels a lot. An informatician based at the National Chemical Laboratory in Pune, India, he collects data on what types of animal live where in India to enter into a biodiversity database … Much of the information Chavan seeks is in old, out-of-print tomes … To find them, Chavan has spent years trailing around libraries. He dreams of the day when books such as these are scanned and made available as digital files on the Internet.
“Science in the Web Age: The Real Death of Print”by Andreas von Bubnoff
Nature 438, 550-552 1 December 2005
Digital Divide?
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Encyclopedia of Life…imagine for a moment that all the diversity of the world were finally revealed and then described, say one page to a species. The description would contain the scientific name, a photograph or drawing, a brief diagnosis, and information of where the species if found. If published in conventional book form … this Great Encyclopedia of Life would occupy 60 meters of library shelf per million species … 100 million species of organisms … would extend through 6 kilometers of shelving …
E.O. Wilson (1992)
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Biodiversity Heritage Library• 2003, Telluride. Encyclopaedia of
Life meeting• February 2005, London. Library
and Laboratory: the Marriage of Research, Data and Taxonomic Literature
• May 2005, Washington. Ground work for the Biodiversity Heritage Library
• June 2006, Washington. Organizational and Technical meeting
• October 2006, St. Louis/San Francisco. Technical meetings
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Biodiversity Heritage Library
• Museums– American Museum of
Natural History (New York)
– Field Museum (Chicago)
– Natural History Museum (London)
– Smithsonian Institution (Washington)
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Biodiversity Heritage Library
• Botanical Gardens– Missouri Botanical
Garden– New York Botanical
Garden– Royal Botanic
Garden, Kew
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Biodiversity Heritage Library
• University Libraries– Botany Libraries,
Harvard University– Ernst Meyer Library
of the Museum of Comparative Zoology, Harvard University
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Biodiversity Heritage Library
• Bioinformatics Member– Marine Biological
Laboratory / Woods Hole Oceanographic Institution Library (MBL/WHOI)
– uBio project of MBL/WHOI
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Biodiversity Heritage LibraryAffiliated Partner: Internet Archive
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Biodiversity Heritage Library
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
• Core literature pre-1923: 400,000 (80 million pages)
• All pre-1923: 600-750,000 (120-150 million pages)
• All literature: 1.4-1.6 million (280-320 million pages)
Biodiversity Heritage Library
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Biodiversity Heritage LibraryMandates:
Open Access: all content can be reused, repurposed, reformatted, sliced, diced, scraped, and ???
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Data Types
• CR2: Raw camera files (IA)
• JPEG 2000• JPEG (IA)• GIF (IA)• Thumbnail (IA)• Flippy Book (IA)• PDF• DejaVu (IA)
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Data Types• OCR Text
– Raw OCR Text– Structured OCR Text– OCR Text w/embedded
Taxonomic Intelligence– Structured OCR
w/embedded Taxonomic Intelligence
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
BHL Portal Prototype
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
- Specimen- Plate or other visual image- Taxonomic description
Taxonomic Impediment
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
View
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
9. Page View
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
9. Page View
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
9. Page View
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
10. Page View - Detail
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
11. Page View – Detail – Full Screen
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
12. Page View - Detail
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
12. Page View - Detail
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Discover names
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
44. Names View
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
46. Names View
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
50. Names View
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Taxonomic Intelligence
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Taxonomic Intelligence
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Taxonomic Intelligence
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Taxonomic Intelligence
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Vernacular terms
Link outs
Taxonomic Intelligence
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Generated Taxa Lists
Taxonomic Intelligence
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
• http://namebank.ubio.org/bulletin/process.php
Taxonomic Intelligence
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Biodiversity Heritage Library
Jacob Christian SchäfferElementa entomologica . . . 1766.
Metadata RepositoryStore all bibliographic metadata for the member libraries; create volume, part, piece metadata; ingest page level metadata at scanning level for the creation of page level Globally Unique Identifiers (GUIDs) for linking to other taxonomic services
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Preliminary First Steps
• Combined metadata from member libraries = “Dirty Metadata Repository”
• OCLC analysis
• Worth while? Verdict still out
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Metadata Analysis
• Initial analysis showed: We have 1.3 million catalogue records 73% are monographs (remainder are
serials at title-level) 63% is English language material. The
next most popular language (9%) is German.
About 30% of material was published before 1923.
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Metadata Analysis
• Record files were received from Smithsonian, MOBOT, NYBG, Kew, NHML, Harvard, and AMNH.– Total records: 1,330,058
• From these files, all records describing language-based monographs were extracted (LDR/6 and LDR/7 equal to “a” and “m”, respectively).– Total records: 981,703
• Assumed Serials– Total 256,962
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Metadata Analysis
• 757,430 Total Monograph records made up of
616,196 records with no matches (assumed unique)
141,234 records representing a cluster
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Metadata Analysis
• Overlap analysis • Of the 981,000 monograph records
from all institutions 378,000 matching pairs were found
• 616,000 had no matches at all and were unique to one institution.
• After de-duplication of the matching pairs, the final file contains 757,000 records.
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Metadata Analysis
• 981,703 monograph records analyzed by OCLC’s duplicate detection software – 378,579 pairs detected
and then clustered by A=B and B=C => A=C
• 151,705 unique items – BUT Grand total of too
many (1,032,494 increase of 50,791) ~ Logic equation wasn’t quite right!
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Metadata Analysis• Problems Problems
Problems– Natural History London
fixed field coding that OCLC did a monograph vs serial title base match was not “consistent”
– Harvard catalog contained quite a few “monograph” records for analyzed library specific bounded articles
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Metadata Analysis
• Serials! Guesstimate!– 60 million pages
(300,000 volumes of 200 pages each)
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Outline / Workflow
• Scanning centers– 10 scanners in a pod
• REQUIRES food at approximately XXX volumes per YYY– Boston– NYC area– DC– London
– Single Scanning Station
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Outline / Workflow
10 Natural History Libraries Scanning
at OnceWho is to Scan What?
– OCLC analysis assist in prioritizing
– Collection Managers’ – Gross general
themes to begin– No longer worried
about “Registry of Intent to Scan”
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Outline / Workflow• Volumes are pulled and
taken to scanner• Scanner wands barcode
and uses a Z39.50 to fetch a title level record from ILS
Problem• Multivolumes and
Serials!• Title level descriptions –
BUT – No item level metadata
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Problem: Issue-ization
• Page scan data• Title level data• Missing is the in
between – Citation resolving
• CCS – some success but NOT open source
• Citeseer – Lee Giles at PSU
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Outline / Workflow
• “Clean Metadata Repository”– Title Level– Intellectual Units to Some Granularity– URL pointing to BHL “portal”– Identifiers registered somewhere
• LSIDs• DOIs• BHL uniquely defined
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Outline / Workflow
• Clean Metadata Repository as a Source– For OCLC to pull and point– For local ILS’ to pull and point– For NSDL and other harvesters
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
BHL Metadata Repository
Internet ArchiveBHL MR
BHL Public Interface
Taxonomic Web Servicese.g. CBOL, GBIF, ITIS,
GenBank, INOTAXA documents, etc.
BHL MRBHL MR
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Timeline• BHL Metadata
Repository for currently scanned titles: January 2007
• BHL Portal for existing literature: March 2007
• Funding for Mass Scanning: Late Spring 2007?
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Biodiversity Heritage Library
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Biodiversity Heritage Library
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Biodiversity Heritage Library
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Biodiversity Heritage Library
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Biodiversity Heritage Library: A Conversation About A
Collaborative Digitization ProjectSuzanne C. Pilsk
Martin R. Kalfatovic
Smithsonian Institution Libraries
Thanks to the following for input/content:Chris Freeland (Missouri Botanical Garden)Neil Thomson (Natural History Museum, London)Anna Weitzman (National Museum of Natural History)Chris Lyal (Natural History Museum, London)Scott Miller (Smithsonian Institution)
Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006
Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project
Biodiversity Heritage Library (BHL)http://www.bhl.si.eduUniversal Biological Indexer and Organizer (UBio)http://www.ubio.org/Consortium for the Barcode of Life (CBOL)http://barcoding.si.edu/Global Biodiversity Information Facility (GBIF)http://barcoding.si.edu/Taxonomic Databases Working Group (TDWG)http://www.nhm.ac.uk/hosted_sites/tdwg/
Conversation About a Collaborative Digitization Project
http://www.sil.si.edu/staff/2006-BHL4LC/