32
CROSS-COMMUNITY USER REQUIREMENTS & THE BIODIVERSITY HERITAGE LIBRARY Chris Freeland 9 June 2011

Cross-Community User Requirements and the Biodiversity Heritage Library

Embed Size (px)

DESCRIPTION

9 June 2011,

Citation preview

Page 1: Cross-Community User Requirements and the Biodiversity Heritage Library

CROSS-COMMUNITY USER REQUIREMENTS & THE BIODIVERSITY HERITAGE LIBRARY

Chris Freeland 9 June 2011

Page 2: Cross-Community User Requirements and the Biodiversity Heritage Library

My background

M.S., Biological SciencesEastern Illinois University, 1997

B.S., Environmental BiologyEastern Illinois University, 1996

Director, Center for Biodiversity InformaticsMissouri Botanical Garden, 2007 – date

Technical DirectorBiodiversity Heritage Library, 2007 – date

Application Development ManagerMissouri Botanical Garden, 2003 – 2007

Web Project LeaderMissouri Botanical Garden, 2000 - 2003

http://chrisfreeland.com

@chrisfreeland

Page 3: Cross-Community User Requirements and the Biodiversity Heritage Library

Data sharing & integration

Plant Names

Specimens

Plant Names

Plant NamesSpecimensDescriptions

Plant Names

Plant Names

Citations

Page 4: Cross-Community User Requirements and the Biodiversity Heritage Library

Plant Sciences: Tropicos

Developed in-house at MOBOT since 1982 Originally developed to capture field

notebook data & streamline printing herbarium sheet labels

Tool used by MOBOT staff, collaborators & a global audience of scholars & students

Page 5: Cross-Community User Requirements and the Biodiversity Heritage Library

Core Components

Names 1.2 million names + synonymy Objective view

Specimens 3.9 million specimen records

Images 160,000 specimens, plants, drawings IMLS National Leadership Grant, 1998

Literature 1.2 million protologue citations, linked to BHL when

available 160,000 name-based citations

Projects Floras, checklists & data gathering Alternate classifications, project-specific views

Page 6: Cross-Community User Requirements and the Biodiversity Heritage Library

http://tropicos.org

Page 7: Cross-Community User Requirements and the Biodiversity Heritage Library

System Expansion

GIS integration Enhanced mapping & analysis Complements Analysis Unit IMLS grant, 2009

Enhanced interfaces for keys SDD export now available

Robust APIs, including names lookup service Services instead of scraping

djatoka for JPEG2000 (JP2) image delivery

MO Distribution: Caprifoliaceae

Page 8: Cross-Community User Requirements and the Biodiversity Heritage Library

Tropicos as Data Provider

GBIF 3.9mil records; 2.1mil georeferenced

Taxonomic Name Resolution Service Computed Acceptance, Synonymy

NameBank Contributed names

Zipcode Zoo 20,000 images shared between systems

The Plant List, in collaboration with Kew

Page 9: Cross-Community User Requirements and the Biodiversity Heritage Library

Users & Requirements

Plant Science Scholars & Students Status / history of name

Links to BHL Specimens collected / specimens

determined Distribution Multiple classifications Acceptance

General Public Common names, images, maps/distribution

Page 10: Cross-Community User Requirements and the Biodiversity Heritage Library

Literature Repositories: BHL Consortium of natural history museum &

botanical garden libraries Expanded to include technology partners

and service providers Goal of digitizing public domain

biodiversity literature, and in-copyright materials where negotiable

Direct integration with Encyclopedia of Life (EOL)

Page 11: Cross-Community User Requirements and the Biodiversity Heritage Library

BHL Partners http://www.biodiversitylibrary.org

The Biodiversity Heritage Library (BHL) is a global community of natural history libraries and research institutions who have formed a partnership to

digitize and make available the world's biodiversity literature.

Now Online: 90,000+ volumes 34 million+ pages

Page 12: Cross-Community User Requirements and the Biodiversity Heritage Library

BHL is a research space

BHL corpus as whole is a data set of biodiversity data in its own right. Embedded in it are: Predator/prey relationships Habitat/distribution data Host/parasite data Pathogen/disease vector data

Third party researchers and projects are interested in mining the BHL texts for multiple research needs.

One site for serving/accessing/downloading digital texts AND for data mining is messy. Separate out and put a version of the corpus in a public-like cloud space.

Page 13: Cross-Community User Requirements and the Biodiversity Heritage Library

http://biodiversitylibrary.org

Page 14: Cross-Community User Requirements and the Biodiversity Heritage Library
Page 15: Cross-Community User Requirements and the Biodiversity Heritage Library

BHL by the Book

PDF

OCR

XML

JP2

> 70TB, growing every day…

One 380 pg (avg) volume = multiple files, varying sizes, relationships among them

Page 16: Cross-Community User Requirements and the Biodiversity Heritage Library

Internet Archive:Digitized content / files

MOBOT:Database & web application

MBL:Redundant cluster

Current distributed infrastructure

Metadata

Content

Page 17: Cross-Community User Requirements and the Biodiversity Heritage Library

Access System – files, metadata & services needed to deliver content.

Data Ingest

Data Ingest

Data Ingest

Sync

BHL Vision: Global Infrastructure

Preservation System – multiple redundant copies of all digitized content.

Replicate

Page 18: Cross-Community User Requirements and the Biodiversity Heritage Library

DuraCloud pilot

Community interest in cloud storage (Funding organizations, too!)

Wanted to evaluate applicability of cloud storage for large-scale digitization activities Solutions for efficient transfer of 10-100s

TB data Lower cost alternatives to maintaining large

data centers

Page 19: Cross-Community User Requirements and the Biodiversity Heritage Library

BHL Policy Challenges

Money - At present in the US, one BHL member library (MBL) is willing to provide essentially free redundant hosting. This is a very attractive financial offer. Since the MBL is BHL member it provides a level of administrative commitment

Skill level - Multiple global partners needing all or some of the current holdings - have varying levels of technical skills. For some shipping hard drives might be easier. For some uploading to and downloading from the cloud might be preferable.

Control – in cultural-scientific digital projects no clear models using cloud. Early-adopter paranoia.

Page 20: Cross-Community User Requirements and the Biodiversity Heritage Library

Data Transfer Methods & Limitations

vs

NodeBNodeB

NodeBNodeB

NodeANodeA

NodeANodeA

Problems: Hardware failure, data loss, shipping fees

Problems: Available bandwidth, upload/download fees

Page 21: Cross-Community User Requirements and the Biodiversity Heritage Library

Data transfer: Cloud vs. Cluster Inventory & audit lists Checksums for data integrity Heavy lifting at BHL scale, regardless of

endpoint weeks->months, not minutes->days

Differences In cluster environment, have to be intimately

involved in hardware decisions, maintenance, troubleshooting

In cloud environment, those worries are part of your fee

Page 22: Cross-Community User Requirements and the Biodiversity Heritage Library

Challenges for adopting cloud storage

BHL is embedded in longstanding institutions with megainfrastructure. Already support data storage & maintenance at

BHL scale Little funding for alternative infrastructure /

storage Current storage is (really, truly) free through

Internet Archive Costs associated with download / use of

content BHL is a global resource for a broad community User community wants to “do things” with data

Page 23: Cross-Community User Requirements and the Biodiversity Heritage Library

Lessons learned from pilot

Cloud infrastructure & applicability to BHL are no longer a mystery

Nothing is free Except when it is

Cloud storage provides ability to quickly scale infrastructure No lost time procuring & configuring hardware

Useful for the right kinds of datasets It’s not the size of the corpus, it’s the size of

the files Huge files are problematic

Page 24: Cross-Community User Requirements and the Biodiversity Heritage Library

More lessons learned

More possibilities than expected: Features Movement Support available from commercial providers. Increasing menus of choices

There is no silver bullet Cloud is just a different endpoint for file

storage It doesn’t solve all problems related to

repository management

Page 25: Cross-Community User Requirements and the Biodiversity Heritage Library

Global data sharing requiresa social infrastructure

Page 26: Cross-Community User Requirements and the Biodiversity Heritage Library

BHL Services & APIs

OpenURL Facilitate links to citations: protologues, articles,

references Documentation:

http://www.biodiversitylibrary.org/openurlhelp.aspx Useful to Nomenclators, Reference Systems

IPNI Tropicos

Names Service Return all occurrences of a name throughout BHL digitized

corpus Documentation: http://bit.ly/2e6sg9

Working out a strategy for obscure species Algorithm improvements to detect nomenclatural &

taxonomic acts

Page 27: Cross-Community User Requirements and the Biodiversity Heritage Library

BHL + Tropicos

A unique platform for biodiversity research Built to serve taxonomists’ & other

scientists’ investigations But now serve multiple disciplines

Enhanced by 250+ years of accumulated knowledge Complicated by 250+ years of collegial

disagreement Complementary to physical libraries &

herbaria

Page 28: Cross-Community User Requirements and the Biodiversity Heritage Library

http://www.biodiversitylibrary.org/openurl?pid=title:3934&volume=14&issue=&spage=301&date=1879

http://www.biodiversitylibrary.org/openurl?pid=title:3934&volume=14&issue=&spage=301&date=1879

http://www.tropicos.org/Name/1200408

Page 29: Cross-Community User Requirements and the Biodiversity Heritage Library

BHL OpenURL Disambiguation Looking for:

BHL returns:

Page 30: Cross-Community User Requirements and the Biodiversity Heritage Library

Services: OpenURL Results

Page 31: Cross-Community User Requirements and the Biodiversity Heritage Library

Conclusion

Page 32: Cross-Community User Requirements and the Biodiversity Heritage Library

Questions?

Chris FreelandTechnical Director, Biodiversity Heritage Library

Director, Center for Biodiversity Informatics, Missouri Botanical Garden

Missouri Botanical Garden

4344 Shaw Blvd.

St. Louis, MO 63110 USA

Email: [email protected]

Twitter: @chrisfreeland

Blog / info: chrisfreeland.com