Upload
chris-freeland
View
692
Download
2
Embed Size (px)
DESCRIPTION
9 June 2011,
Citation preview
CROSS-COMMUNITY USER REQUIREMENTS & THE BIODIVERSITY HERITAGE LIBRARY
Chris Freeland 9 June 2011
My background
M.S., Biological SciencesEastern Illinois University, 1997
B.S., Environmental BiologyEastern Illinois University, 1996
Director, Center for Biodiversity InformaticsMissouri Botanical Garden, 2007 – date
Technical DirectorBiodiversity Heritage Library, 2007 – date
Application Development ManagerMissouri Botanical Garden, 2003 – 2007
Web Project LeaderMissouri Botanical Garden, 2000 - 2003
http://chrisfreeland.com
@chrisfreeland
Data sharing & integration
Plant Names
Specimens
Plant Names
Plant NamesSpecimensDescriptions
Plant Names
Plant Names
Citations
Plant Sciences: Tropicos
Developed in-house at MOBOT since 1982 Originally developed to capture field
notebook data & streamline printing herbarium sheet labels
Tool used by MOBOT staff, collaborators & a global audience of scholars & students
Core Components
Names 1.2 million names + synonymy Objective view
Specimens 3.9 million specimen records
Images 160,000 specimens, plants, drawings IMLS National Leadership Grant, 1998
Literature 1.2 million protologue citations, linked to BHL when
available 160,000 name-based citations
Projects Floras, checklists & data gathering Alternate classifications, project-specific views
http://tropicos.org
System Expansion
GIS integration Enhanced mapping & analysis Complements Analysis Unit IMLS grant, 2009
Enhanced interfaces for keys SDD export now available
Robust APIs, including names lookup service Services instead of scraping
djatoka for JPEG2000 (JP2) image delivery
MO Distribution: Caprifoliaceae
Tropicos as Data Provider
GBIF 3.9mil records; 2.1mil georeferenced
Taxonomic Name Resolution Service Computed Acceptance, Synonymy
NameBank Contributed names
Zipcode Zoo 20,000 images shared between systems
The Plant List, in collaboration with Kew
Users & Requirements
Plant Science Scholars & Students Status / history of name
Links to BHL Specimens collected / specimens
determined Distribution Multiple classifications Acceptance
General Public Common names, images, maps/distribution
Literature Repositories: BHL Consortium of natural history museum &
botanical garden libraries Expanded to include technology partners
and service providers Goal of digitizing public domain
biodiversity literature, and in-copyright materials where negotiable
Direct integration with Encyclopedia of Life (EOL)
BHL Partners http://www.biodiversitylibrary.org
The Biodiversity Heritage Library (BHL) is a global community of natural history libraries and research institutions who have formed a partnership to
digitize and make available the world's biodiversity literature.
Now Online: 90,000+ volumes 34 million+ pages
BHL is a research space
BHL corpus as whole is a data set of biodiversity data in its own right. Embedded in it are: Predator/prey relationships Habitat/distribution data Host/parasite data Pathogen/disease vector data
Third party researchers and projects are interested in mining the BHL texts for multiple research needs.
One site for serving/accessing/downloading digital texts AND for data mining is messy. Separate out and put a version of the corpus in a public-like cloud space.
http://biodiversitylibrary.org
BHL by the Book
OCR
XML
JP2
> 70TB, growing every day…
One 380 pg (avg) volume = multiple files, varying sizes, relationships among them
Internet Archive:Digitized content / files
MOBOT:Database & web application
MBL:Redundant cluster
Current distributed infrastructure
Metadata
Content
Access System – files, metadata & services needed to deliver content.
Data Ingest
Data Ingest
Data Ingest
Sync
BHL Vision: Global Infrastructure
Preservation System – multiple redundant copies of all digitized content.
Replicate
DuraCloud pilot
Community interest in cloud storage (Funding organizations, too!)
Wanted to evaluate applicability of cloud storage for large-scale digitization activities Solutions for efficient transfer of 10-100s
TB data Lower cost alternatives to maintaining large
data centers
BHL Policy Challenges
Money - At present in the US, one BHL member library (MBL) is willing to provide essentially free redundant hosting. This is a very attractive financial offer. Since the MBL is BHL member it provides a level of administrative commitment
Skill level - Multiple global partners needing all or some of the current holdings - have varying levels of technical skills. For some shipping hard drives might be easier. For some uploading to and downloading from the cloud might be preferable.
Control – in cultural-scientific digital projects no clear models using cloud. Early-adopter paranoia.
Data Transfer Methods & Limitations
vs
NodeBNodeB
NodeBNodeB
NodeANodeA
NodeANodeA
Problems: Hardware failure, data loss, shipping fees
Problems: Available bandwidth, upload/download fees
Data transfer: Cloud vs. Cluster Inventory & audit lists Checksums for data integrity Heavy lifting at BHL scale, regardless of
endpoint weeks->months, not minutes->days
Differences In cluster environment, have to be intimately
involved in hardware decisions, maintenance, troubleshooting
In cloud environment, those worries are part of your fee
Challenges for adopting cloud storage
BHL is embedded in longstanding institutions with megainfrastructure. Already support data storage & maintenance at
BHL scale Little funding for alternative infrastructure /
storage Current storage is (really, truly) free through
Internet Archive Costs associated with download / use of
content BHL is a global resource for a broad community User community wants to “do things” with data
Lessons learned from pilot
Cloud infrastructure & applicability to BHL are no longer a mystery
Nothing is free Except when it is
Cloud storage provides ability to quickly scale infrastructure No lost time procuring & configuring hardware
Useful for the right kinds of datasets It’s not the size of the corpus, it’s the size of
the files Huge files are problematic
More lessons learned
More possibilities than expected: Features Movement Support available from commercial providers. Increasing menus of choices
There is no silver bullet Cloud is just a different endpoint for file
storage It doesn’t solve all problems related to
repository management
Global data sharing requiresa social infrastructure
BHL Services & APIs
OpenURL Facilitate links to citations: protologues, articles,
references Documentation:
http://www.biodiversitylibrary.org/openurlhelp.aspx Useful to Nomenclators, Reference Systems
IPNI Tropicos
Names Service Return all occurrences of a name throughout BHL digitized
corpus Documentation: http://bit.ly/2e6sg9
Working out a strategy for obscure species Algorithm improvements to detect nomenclatural &
taxonomic acts
BHL + Tropicos
A unique platform for biodiversity research Built to serve taxonomists’ & other
scientists’ investigations But now serve multiple disciplines
Enhanced by 250+ years of accumulated knowledge Complicated by 250+ years of collegial
disagreement Complementary to physical libraries &
herbaria
http://www.biodiversitylibrary.org/openurl?pid=title:3934&volume=14&issue=&spage=301&date=1879
http://www.biodiversitylibrary.org/openurl?pid=title:3934&volume=14&issue=&spage=301&date=1879
http://www.tropicos.org/Name/1200408
BHL OpenURL Disambiguation Looking for:
BHL returns:
Services: OpenURL Results
Conclusion
Questions?
Chris FreelandTechnical Director, Biodiversity Heritage Library
Director, Center for Biodiversity Informatics, Missouri Botanical Garden
Missouri Botanical Garden
4344 Shaw Blvd.
St. Louis, MO 63110 USA
Email: [email protected]
Twitter: @chrisfreeland
Blog / info: chrisfreeland.com