82
Suzanne C. Pilsk and Martin R. Kalfatovic November 8, 2006 Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project Biodiversity Heritage Library: A Conversation About A Collaborative Digitization Project Suzanne C. Pilsk Martin R. Kalfatovic Smithsonian Institution Libraries

Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Embed Size (px)

DESCRIPTION

Presentation for the Office of Strategic Initiatives (November 8, 2006) with Suzanne C. Pilsk

Citation preview

Page 1: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Biodiversity Heritage Library: A Conversation About A

Collaborative Digitization ProjectSuzanne C. Pilsk

Martin R. Kalfatovic

Smithsonian Institution Libraries

Page 2: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Biodiversity

What is Biodiversity?• Genetic variability

within species• Diversity of species• Ecosystems and

landscapes

Page 3: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Biodiversity

• Wholesome food• Drinkable water• Breathable air• Stable climate for

– Forestry– Agriculture– Fisheries

• Waste decomposition• Bioremediation• Invasive species• Pest control• Ecotourism

• Pharmaceuticals• Genomics• Proteomics• Bioengineering• Biotechnology• Molecular design• Imitating nature• Designer organisms• Renewable feedstocks• Envirofriendly• Manufacturing processes

Page 4: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Taxonomic Literature

• Over 250 years of systematic description of life

• Systema naturae (10th ed. 1758) by Carl von Linné

Page 5: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

The cited half-life of publications in taxonomy is longer than in any other scientific discipline

* * * The decay rate is longer than in any scientific discipline

- Macro-economic case for open access, Tom Moritz

Taxonomic Literature

Page 6: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Taxonomic Impediment

• Specimen collections• Databases• Publications• Observations• ‘Gray’ literature• Index cards• Field notebooks

Page 7: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Agatea violarisType specimen from the U.S. National Herbarium (Smithsonian Institution) collected by the United States Exploring Expedition, 1838-1842

Taxonomic Impediment

Page 8: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Taxonomic Impediment

Page 9: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

- Specimen- Plate or other visual image- Taxonomic description

Taxonomic Impediment

Page 10: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

• that there is access to information held in national/regional/global collections

• that electronic data is efficiently captured and provided in useable form

• that existing information held in literature and by current experts is made available electronically

• that stability of scientific names of organisms, used to access this information, is promoted

- Darwin Declaration, 1998

The essential requirements for accessing and utilising this global information are:

Taxonomic Literature

Page 11: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Taxonomic Impediment

0

1

2

3

4

5

6

7

8

US & Canada Europe Mexico & C.America

SouthAmerica

Biologia Centrali-Americana. Edited by Frederick Ducane Godman and Osbert Salvin. London : Pub. for the editors by R. H. Porter, 1879-1915

Page 12: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Digital Divide?

Page 13: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Vishwas Chavan travels a lot. An informatician based at the National Chemical Laboratory in Pune, India, he collects data on what types of animal live where in India to enter into a biodiversity database … Much of the information Chavan seeks is in old, out-of-print tomes … To find them, Chavan has spent years trailing around libraries. He dreams of the day when books such as these are scanned and made available as digital files on the Internet.

“Science in the Web Age: The Real Death of Print”by Andreas von Bubnoff

Nature 438, 550-552 1 December 2005

Digital Divide?

Page 14: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Encyclopedia of Life…imagine for a moment that all the diversity of the world were finally revealed and then described, say one page to a species. The description would contain the scientific name, a photograph or drawing, a brief diagnosis, and information of where the species if found. If published in conventional book form … this Great Encyclopedia of Life would occupy 60 meters of library shelf per million species … 100 million species of organisms … would extend through 6 kilometers of shelving …

E.O. Wilson (1992)

Page 15: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Biodiversity Heritage Library• 2003, Telluride. Encyclopaedia of

Life meeting• February 2005, London. Library

and Laboratory: the Marriage of Research, Data and Taxonomic Literature

• May 2005, Washington. Ground work for the Biodiversity Heritage Library

• June 2006, Washington. Organizational and Technical meeting

• October 2006, St. Louis/San Francisco. Technical meetings

Page 16: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Biodiversity Heritage Library

• Museums– American Museum of

Natural History (New York)

– Field Museum (Chicago)

– Natural History Museum (London)

– Smithsonian Institution (Washington)

Page 17: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Biodiversity Heritage Library

• Botanical Gardens– Missouri Botanical

Garden– New York Botanical

Garden– Royal Botanic

Garden, Kew

Page 18: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Biodiversity Heritage Library

• University Libraries– Botany Libraries,

Harvard University– Ernst Meyer Library

of the Museum of Comparative Zoology, Harvard University

Page 19: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Biodiversity Heritage Library

• Bioinformatics Member– Marine Biological

Laboratory / Woods Hole Oceanographic Institution Library (MBL/WHOI)

– uBio project of MBL/WHOI

Page 20: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Biodiversity Heritage LibraryAffiliated Partner: Internet Archive

Page 21: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Biodiversity Heritage Library

Page 22: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

• Core literature pre-1923: 400,000 (80 million pages)

• All pre-1923: 600-750,000 (120-150 million pages)

• All literature: 1.4-1.6 million (280-320 million pages)

Biodiversity Heritage Library

Page 23: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Biodiversity Heritage LibraryMandates:

Open Access: all content can be reused, repurposed, reformatted, sliced, diced, scraped, and ???

Page 24: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Data Types

• CR2: Raw camera files (IA)

• JPEG 2000• JPEG (IA)• GIF (IA)• Thumbnail (IA)• Flippy Book (IA)• PDF• DejaVu (IA)

Page 25: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Data Types• OCR Text

– Raw OCR Text– Structured OCR Text– OCR Text w/embedded

Taxonomic Intelligence– Structured OCR

w/embedded Taxonomic Intelligence

Page 26: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

BHL Portal Prototype

Page 27: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

- Specimen- Plate or other visual image- Taxonomic description

Taxonomic Impediment

Page 28: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

View

Page 29: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

9. Page View

Page 30: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Page 31: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

9. Page View

Page 32: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Page 33: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Page 34: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

9. Page View

Page 35: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

10. Page View - Detail

Page 36: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

11. Page View – Detail – Full Screen

Page 37: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

12. Page View - Detail

Page 38: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Page 39: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Page 40: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Page 41: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

12. Page View - Detail

Page 42: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Discover names

Page 43: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Page 44: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

44. Names View

Page 45: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Page 46: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

46. Names View

Page 47: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Page 48: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Page 49: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Page 50: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

50. Names View

Page 51: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Page 52: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Page 53: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Taxonomic Intelligence

Page 54: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Taxonomic Intelligence

Page 55: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Taxonomic Intelligence

Page 56: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Taxonomic Intelligence

Page 57: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Vernacular terms

Link outs

Taxonomic Intelligence

Page 58: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Generated Taxa Lists

Taxonomic Intelligence

Page 59: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

• http://namebank.ubio.org/bulletin/process.php

Taxonomic Intelligence

Page 60: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Biodiversity Heritage Library

Jacob Christian SchäfferElementa entomologica . . . 1766.

Metadata RepositoryStore all bibliographic metadata for the member libraries; create volume, part, piece metadata; ingest page level metadata at scanning level for the creation of page level Globally Unique Identifiers (GUIDs) for linking to other taxonomic services

Page 61: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Preliminary First Steps

• Combined metadata from member libraries = “Dirty Metadata Repository”

• OCLC analysis

• Worth while? Verdict still out

Page 62: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Metadata Analysis

• Initial analysis showed: We have 1.3 million catalogue records 73% are monographs (remainder are

serials at title-level) 63% is English language material. The

next most popular language (9%) is German.

About 30% of material was published before 1923.

Page 63: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Metadata Analysis

• Record files were received from Smithsonian, MOBOT, NYBG, Kew, NHML, Harvard, and AMNH.– Total records: 1,330,058

• From these files, all records describing language-based monographs were extracted (LDR/6 and LDR/7 equal to “a” and “m”, respectively).– Total records: 981,703

• Assumed Serials– Total 256,962

Page 64: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Metadata Analysis

• 757,430 Total Monograph records made up of

616,196 records with no matches (assumed unique)

141,234 records representing a cluster

Page 65: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Metadata Analysis

• Overlap analysis • Of the 981,000 monograph records

from all institutions 378,000 matching pairs were found

• 616,000 had no matches at all and were unique to one institution.

• After de-duplication of the matching pairs, the final file contains 757,000 records.

Page 66: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Metadata Analysis

• 981,703 monograph records analyzed by OCLC’s duplicate detection software – 378,579 pairs detected

and then clustered by A=B and B=C => A=C

• 151,705 unique items – BUT Grand total of too

many (1,032,494 increase of 50,791) ~ Logic equation wasn’t quite right!

Page 67: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Metadata Analysis• Problems Problems

Problems– Natural History London

fixed field coding that OCLC did a monograph vs serial title base match was not “consistent”

– Harvard catalog contained quite a few “monograph” records for analyzed library specific bounded articles

Page 68: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Metadata Analysis

• Serials! Guesstimate!– 60 million pages

(300,000 volumes of 200 pages each)

Page 69: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Outline / Workflow

• Scanning centers– 10 scanners in a pod

• REQUIRES food at approximately XXX volumes per YYY– Boston– NYC area– DC– London

– Single Scanning Station

Page 70: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Outline / Workflow

10 Natural History Libraries Scanning

at OnceWho is to Scan What?

– OCLC analysis assist in prioritizing

– Collection Managers’ – Gross general

themes to begin– No longer worried

about “Registry of Intent to Scan”

Page 71: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Outline / Workflow• Volumes are pulled and

taken to scanner• Scanner wands barcode

and uses a Z39.50 to fetch a title level record from ILS

Problem• Multivolumes and

Serials!• Title level descriptions –

BUT – No item level metadata

Page 72: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Problem: Issue-ization

• Page scan data• Title level data• Missing is the in

between – Citation resolving

• CCS – some success but NOT open source

• Citeseer – Lee Giles at PSU

Page 73: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Outline / Workflow

• “Clean Metadata Repository”– Title Level– Intellectual Units to Some Granularity– URL pointing to BHL “portal”– Identifiers registered somewhere

• LSIDs• DOIs• BHL uniquely defined

Page 74: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Outline / Workflow

• Clean Metadata Repository as a Source– For OCLC to pull and point– For local ILS’ to pull and point– For NSDL and other harvesters

Page 75: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

BHL Metadata Repository

Internet ArchiveBHL MR

BHL Public Interface

Taxonomic Web Servicese.g. CBOL, GBIF, ITIS,

GenBank, INOTAXA documents, etc.

BHL MRBHL MR

Page 76: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Timeline• BHL Metadata

Repository for currently scanned titles: January 2007

• BHL Portal for existing literature: March 2007

• Funding for Mass Scanning: Late Spring 2007?

Page 77: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Biodiversity Heritage Library

Page 78: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Biodiversity Heritage Library

Page 79: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Biodiversity Heritage Library

Page 80: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Biodiversity Heritage Library

Page 81: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Biodiversity Heritage Library: A Conversation About A

Collaborative Digitization ProjectSuzanne C. Pilsk

Martin R. Kalfatovic

Smithsonian Institution Libraries

Thanks to the following for input/content:Chris Freeland (Missouri Botanical Garden)Neil Thomson (Natural History Museum, London)Anna Weitzman (National Museum of Natural History)Chris Lyal (Natural History Museum, London)Scott Miller (Smithsonian Institution)

Page 82: Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

Suzanne C. Pilsk and Martin R. KalfatovicNovember 8, 2006

Biodiversity Heritage Library: A Conversation About a Collaborative Digitization Project

Biodiversity Heritage Library (BHL)http://www.bhl.si.eduUniversal Biological Indexer and Organizer (UBio)http://www.ubio.org/Consortium for the Barcode of Life (CBOL)http://barcoding.si.edu/Global Biodiversity Information Facility (GBIF)http://barcoding.si.edu/Taxonomic Databases Working Group (TDWG)http://www.nhm.ac.uk/hosted_sites/tdwg/

Conversation About a Collaborative Digitization Project

http://www.sil.si.edu/staff/2006-BHL4LC/