BHL Technologies: Review for BHL-Australia

Preview:

DESCRIPTION

A review of technologies in use within the Biodiversity Heritage Library, as presented to BHL-Australia partners and the Atlas of Living Australia.

Citation preview

TECHNOLOGY

Chris Freeland, Technical Director

Biodiversity Heritage Library: http://biodiversitylibrary.org

Topics Covered

Development History Usage Scanning & Content Acquisition Technologies

Data Mining Services & APIs CiteBank

Global BHL

http://www.biodiversitylibrary.org/item/38659

Biodiversity Heritage Library: http://biodiversitylibrary.org

Tech History

Preliminary work: MOBOT’s Botanicus http://www.botanicus.org

Funded by Keck Foundation & IMLS Working demonstration of how

nomenclators/databases (like Tropicos) can link into digitized scientific literature

Codebase reused for BHL, then changed to fit requirements for EOL

Biodiversity Heritage Library: http://biodiversitylibrary.org

Usage

Biodiversity Heritage Library: http://biodiversitylibrary.org

Referrers: 2008 - 2009

Biodiversity Heritage Library: http://biodiversitylibrary.org

Referrers: 2010

Jan 1 – Mar 15, 2010

SCANNING & CONTENT ACQUISITION

Workflow

SelectionSelection PreparationPreparation

Post ProductionPost Production(Re)publication(Re)publication

DigitizationDigitization

ConservationConservation

Biodiversity Heritage Library: http://biodiversitylibrary.org

Complexities of distributed, mass scanning

from NYBG

from Smithsonian

BHL ScanList

http://bhl.nhm-wien.ac.at/scanlist/index.php

http://bhl.nhm-wien.ac.at/scanlist/index.php/Bibs/view/1018

Biodiversity Heritage Library: http://biodiversitylibrary.org

Scanning = human work

Biodiversity Heritage Library: http://biodiversitylibrary.org

Scan & Store: Internet Archive

Scanning on Scribes

Storage in Petaboxes

Biodiversity Heritage Library: http://biodiversitylibrary.org

Scanning Derivatives

XML JP2

PDF JPG TXT DJVu

Master Derivatives

PDF

OCR

XML

JP2

Biodiversity Heritage Library: http://biodiversitylibrary.org

Ingest from other IA Partners Used mixture of subject analysis & other

bibliographic metadata to identify content for inclusion in BHL

BHL TECHNOLOGIES

Biodiversity Heritage Library: http://biodiversitylibrary.org

Distributed (Somewhat)

Internet Archive:Digitized content / files

MOBOT:Database & web application

MBL:Redundant cluster

Biodiversity Heritage Library: http://biodiversitylibrary.org

BHL Development Team

Biodiversity Heritage Library: http://biodiversitylibrary.org

http://biodiversitylibrary.org/page/10165550

Image from ScannerConverted to text via OCRName finding via TaxonFinder Extract namesSubmit to NameBankSOAP response

Name Finding in action

with Taxonomic Intelligence…

Biodiversity Heritage Library: http://biodiversitylibrary.org

http://biodiversitylibrary.org/page/10165550

http://biodiversitylibrary.org/name/Petalostigma_banksii

http://eol.org/pages/1153286

Biodiversity Heritage Library: http://biodiversitylibrary.org

Name finding statistics

30 million pages scanned 70 million name strings found 60 million names verified with a

NameBankID

1.5 million unique names with a NameBankID

3.5 million unique names *without* a NameBankID This is where the interesting data live!!!

Biodiversity Heritage Library: http://biodiversitylibrary.org

Services & APIs• OpenURL

– Facilitate links to citations: protologues, articles, references• Documentation:

http://www.biodiversitylibrary.org/openurlhelp.aspx– Useful to Nomenclators, Reference Systems

• IPNI• Tropicos

• Names Service– Return all occurrences of a name throughout BHL digitized

corpus• Documentation: http://bit.ly/2e6sg9

– Working out a strategy for obscure species– Algorithm improvements to detect nomenclatural &

taxonomic acts• New API

http://www.biodiversitylibrary.org/openurl?pid=title:3934&volume=14&issue=&spage=301&date=1879

http://www.biodiversitylibrary.org/openurl?pid=title:3934&volume=14&issue=&spage=301&date=1879

http://www.tropicos.org/Name/1200408

Biodiversity Heritage Library: http://biodiversitylibrary.org

Services: OpenURL Disambiguation Looking for:

BHL returns:

Biodiversity Heritage Library: http://biodiversitylibrary.org

Services: OpenURL Results

Biodiversity Heritage Library: http://biodiversitylibrary.org

But where are the articles??

BHL scans cover to cover for monographs & serials

Have tested automated markup and article boundary extraction techniques Variety of typefaces & printing techniques

make a wholly automated solution close to impossible

So, when in need, crowdsource…

Biodiversity Heritage Library: http://biodiversitylibrary.org

PDF Generation Stats

Biodiversity Heritage Library: http://biodiversitylibrary.org

No, really, where are the articles?

Biodiversity Heritage Library: http://biodiversitylibrary.org

http://www.citebank.org

Biodiversity Heritage Library: http://biodiversitylibrary.org

http://citebank.org/search

Biodiversity Heritage Library: http://biodiversitylibrary.org

http://citebank.org/node/47423

Biodiversity Heritage Library: http://biodiversitylibrary.org

CiteBank boundaries

Scanned Books

Citation

Pageturning UIPDFOCR

eBook/Kindle

Stored *somewhere* & retrievable via HTTP URI

CitationCitationCitation

Bibliography

CiteBank

TOWARDS A GLOBAL BHL

Biodiversity Heritage Library: http://biodiversitylibrary.org

Opportunities

New technologies BHL-Europe: Scan List

New use cases & user communities BHL-Europe: Cultural history

New initiatives Data mining, markup, text correction

Redundancy, localization CONTENT!!

Biodiversity Heritage Library: http://biodiversitylibrary.org

BHL is…

A unique software tool Built to serve taxonomists’ & other

scientists’ research Enhanced by 250+ years of accumulated

knowledge Complementary to physical libraries

A shared, global resource An unparalleled opportunity for

collaboration

Biodiversity Heritage Library: http://biodiversitylibrary.org

Thanks!

Chris FreelandTechnical Director, BHL

Director, Center for Biodiversity Informatics,

Missouri Botanical Garden

chris.freeland@mobot.orghttp://twitter.com/chrisfreeland