116

Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Embed Size (px)

Citation preview

Page 1: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO
Page 2: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Digitizando Literatura sobre Biodiversidad

(Contenido Técnico)

CONABIO, México

William Ulate, Director Técnico de BHL

17 Diciembre 2014

Page 3: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO
Page 4: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Digitization Workflow

Insert Smithsonian

Macaw software here

Page 5: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Hardware & Software

Hardware

Usando una Estación Scribe

Page 6: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Escaneo por Internet Archive

Northeast Regional Scanning Facility (Boston)

New Jersey Facility

Natural History Museum, London

Fedscan (Library of Congress)

Internet Archive (San Francisco)

Smithsonian Libraries

Missouri Botanical Garden (Non-Scribe operation)

Page 7: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Hardware & Software

Hardware

Usando una Estación Scribe

“Off-the-shelf” escaners o cámaras digitales de buena calidad

Software

Wonderfetch -> Partner Meta App

(si usan máquinas Scribe)

identifier

search_id

title

volume

creator

date

call_number

language

subject

publisher

description

page-progression

possible-copyright-

status

licenseurl

rights

duediligence

rules_scanning

rules_republishing

Page 8: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Hardware & Software

Hardware

Usando una Estación Scribe

“Off-the-shelf” escaners o cámaras digitales de buena calidad

Software

Wonderfetch -> Partner Meta App (when using Scribe machines)

Macaw

Page 9: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Software de Escaneo: Macaw

Page 10: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Hardware & Software

Hardware

Usando una Estación Scribe

“Off-the-shelf” escaners o cámaras digitales de buena calidad

Software

Wonderfetch -> Partner Meta App (when using Scribe machines)

Macaw

Uploading directly to Internet Archive (for example: MBG‟s Botanicus http://www.botanicus.org/)

Page 11: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Standards and formats to consider

The simplest way to contribute a text item to IA is currently as a single pdf file. IA creates a second pdf with a text layer, if none exist.

Items can be submitted as a stack of image files, one image per page. The files can be in JPEG2000, JPG, or TIFF format, but with strict requirements for how the files in an image stack are to be named, and the stack needs to be packed into a single .zip or .tar file before submission.

When IA (Archive.org) scans a book for a Contributing Library, they use the custom-engineered "Scribe" workstation, but for many materials, adequate images can be made with off-the-shelf scanners or good-quality digital cameras.

For best results, it is recommended to use the highest resolution your device is capable of. Most images IA processes were produced at a resolution of 300-600 ppi.

Page 12: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Standards and formats to consider

BHL recommends following, in part, the DLF's "Benchmark for Faithful Digital Reproductions of Monographs and Serials" (available online at http://www.diglib.org/standards/bmarkfin.htm).

Bitonal: 600 dpi, 1-bit or bitonal TIFF images

Grayscale: 300 dpi, 8-bit grayscale uncompressed TIFF, or lossless compressed image (e.g. LZW, JPEG2000 [*.jp2]).

Color: 300 dpi, 24-bit color uncompressed TIFF, or lossless compressed images (e.g. LZW, JPEG2000 [*.jp2]).

NOTE: the above specifications are the preferred ones. BHL will, however, accept lossy files. In the case of JPEG2000, files with a compression level of 85% are acceptable.

Page 13: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Standards and formats to consider

Currently, BHL data can be downloaded as MODS, EndNote and BibTex. See our wiki page with more information: http://biodivlib.wikispaces.com/Data+Exports#x--MODS

Title metadata as well as pagination, descriptive and page order (structural) metadata is being copied into METS files in the <biodiveristy> collection at IA.

The purpose of these METS files is to accommodate the need of our pagination data.

These METS files are pagination specific and they do not have the item/volume information included.

If bibliographic metadata for BHL content was required, it should be found in the MODS files on the Data Exports page.

Page 14: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Standards and formats to consider

For the future, we are looking at serving OLEF as an envelope format to share information with other BHL Nodes.

See

http://www.bhle.eu/bhl-schema/v0.3/ and

http://www.slideshare.net/HeimoRainer/bhleuropemetadataharmonisationtdwg20111018kollerwhrainer/6 )

Page 15: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Metadata generation and indexing strategy

Each item to be uploaded needs a unique identifier within our central repository, currently Internet Archive (archive.org) and a folder with such name is created to hold the uploaded and generated (derivative) files.

Within BHL we record metadata at 3 levels of bibliographic granularity – Title, Item & Page – as well as metadata for the Creator(s) of the title.

Page 16: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Metadata generation and indexing strategy

Scanned material (jp2.zip) and basic title-level metadata content (marc.xml), item-level metadata (meta.xml) and page-level metadata (scandata.xml) are uploaded to Internet Archive (IA), in the „biodiversity‟ collection.

JP2.zip: The compressed JP2 images (Compression Quality 15) that IA will use for delivering pages to the Read Online feature following a very specific naming convention for the filenames: Master images files named with local library identifier + 4-digit sequence number (with no gaps).

MARC.xml: The MARC record for the title from the library catalog in MARCXML format

Title, *Abbreviation, *Creator, Description, Publisher, Start Date Published, End Date Published, Local Library Identifier, *OCLC Number, *ISSN, *ISBN, *Call Number, *Subject, *Language, Date Created, Date Last Modified, *Foreign Keys

Page 17: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Metadata generation and indexing strategy

META.xml: The item level information (even redundant with the title-level information) including the title, author, publisher, copyright information, digitizing sponsor, date published, type of item, and who originally uploaded it. IA may also update this XML file with information as it processes the pages of the item.

Barcode, Sequence, Local Library Identifier, +Start Volume, End Volume, +Start Date, End Date, *Language, Scanning Institution, *Scanning Contributor, *Scanning Sponsor, Date Created, Date Last Modified

SCANDATA.xml: An XML file (scandata.xml) recording information about each page image (handSide, cropBox, original width & height, etc. )

FileName, Sequence, *Page number, *Page Type, Year, Volume, IssuePrefix, Issue, Date Created, Date Last Modified

Page 18: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Metadata generation and indexing strategy

CREATOR: A “Creator” is defined as a person or company responsible for the creation of the Title.

Name, *Role, Date of Birth, Date of Death, Biography

A detailed description of the contents of each one of these files and the whole process of Uploading content to IA is available at: http://biodivlib.wikispaces.com/Upload

Page 19: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Metadata generation and indexing strategy

Internet Archive runs the OCR process and generates “derivative files” that include:

The resulting files of the OCR process with ABBYY FineReader (djvu, djvu.txt, djvu.xml, abby.gz)

A 100x152 pixel GIF with a looping, animated thumbnail of the first 20 pages of a book.

The presentation version on BHL in PDF format.

The MARC record in binary and XML formats.

And others ( for a more detailed description you can see http://biodivlib.wikispaces.com/Download+All+File+Types+and+Descriptions )

Page 20: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Metadata generation and indexing strategy

The metadata from new items included in the BHL collection is included in the database and indexed to be used in searches through the Portal and API services.

Periodically, the OCR pages are ran through taxonomic names services to mine for new taxa names like TaxonFinder (ubio.org) or GNRDS (Global Names resolution tools and services: resolver.globalnames.org) soon.

Taxa names are added to the database and written back into Internet Archive (names.xml)

Page 21: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Online Platform

Capture System

Scribe machines

Macaw

Publication

BHL Portal

BookViewer

PDF Generator

Page 22: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Online Platform

Publication

BHL API (biodivlib.wikispaces.com/Developer+Tools+and+API)

The BHL Application Programming Interface (API) is a set of REST-like web services that can be invoked via HTTP queries (GET/POST requests) or SOAP.

Responses can be received in one of three formats: JSON, XML, or XML wrapped in a SOAP envelope.

We are currently developing a new API v3, closer to a RESTful design than previous versions, using resource-centric URLs (where possible) and GET/PUT/POST/DELETE verbs.

Page 23: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Online Platform

Publication

Data Exports (biodivlib.wikispaces.com/Data+Exports)

Page 24: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Online Platform

Management

BHL Admin Dashboard

Admin Functions (Alert Message, Image Server, Collections, Institutions, Languages, Page Types, PDF Requests, Segment Types)

Library Functions (Titles/Items/Segments /Pagination/Authors)

Science Functions (Names (Taxa) on a Page)

Library Statistics (Titles/Items/Pages/Names/Segments/Items with Segments, Names, Pages with Names)

Growth Statistics (Titles/Items/Pages/Names/Segments new this Month/Year)

Page 25: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Online Platform

Management

BHL Admin Dashboard

PDF Generation Statistics (Generated: 174,162)

Internet Archive Harvesting Statistics (Complete: 119,125 items)

BioStor Harvest Statistics (Published: 11,126 as of Aug. 29, 2013)

DOI Assignment Statistics (DOI Approved: 57,338 as of Aug 29, 2013)

Web Traffic Statistics (API v2, OpenURL)

Reports (Item Pagination, Title Import History, Character Encoding Problems, DOIs by Institution, Monographic Contributions, Items by Contributor)

Page 26: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Deduplication

• We try to avoid duplication where possible

• Tools

• Serials = Scanlist

• Monographs = Monographic deduper

• Check the BHL before you send for scanning

• We do our best but duplication happens

• Post-digitization, we merge titles as necessary

Page 27: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Online Platform

Management

Monographic Deduping Tool

The MBLWHOI Library has been working on a tool that assists with de-duplicating the monographs that BHL members are sending to IA for scanning.

The application is ready for use and it‟s entirely web-based, requiring no client or user configuration.

The monographic deduper acts as a master database that contains records for all of the monographs that any BHL partner institution has scanned.

Page 28: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Online Platform

Management

Monographic Deduping Tool

In addition, there is a process also in place that allows for material ingested from the Internet Archive, but not contributed by a BHLpartner institution, to be added to the deduper database.

Ultimately, the Monographic deduper database should be seen as living record of accountability that communicates to staff collaborating in the BHL network, a partner‟s promise to digitize a particular monographic title.

Page 29: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Online Platform

Management

Serials Bid List

It is a catalogue that allows users to browse and search Serials titles held by BHL member institutions using advanced filtering.

Page 30: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Technical Group at MBG

Mike Lichtenberg Developer

Trish Rose-Sandler Data Analyst

William Ulate Technical Director

Page 31: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Technical Support

MBG IT Division

Manage servers, systems and telecommunications.

Installs software needed

And others:

MBL

Smithsonian

Internet Archive

BHL-Australia

BHL-Europe

Page 32: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Technical Advisory Group

Page 33: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Fire

wa

ll

Images (JP2) PDF Coordinate-based OCR XML metadata

BHL Architecture: Window Seat Ed.

BHL DB

Internet Archive

Storage

Logic

APIs UI Data

Exports

Access

Data Transform Utilities

Geocoding

Name Finding

Page 34: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO
Page 35: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Projects

Global Names

Art of Life

Purposeful Gaming

Digging into Data

Page 36: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Scientific Name Extraction

TaxonFinder algorithm in production since 2008

More than 100 million candidate name strings

More than 1.5 million unique, verified names

Available through UI, APIs, Data Exports & Internet Archive

New collaboration with Global Names project

Improved algorithm, better precision & recall

More data with TaxonFinder and Neti Neti!

http://gnrd.globalnames.org/

Page 37: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Taxon Names

BEFORE

Name Instances 101,591,803 101,288,804

Unique Names 7,498,554 7,464,924

Verified Names 1,905,507 1,902,803

EOL Names 63,130,350 62,963,582

EOL Pages 13,579,868 13,532,684

AFTER

Name Instances 151,222,182 150,066,425

Unique Names 29,246,382 29,091,767

Verified Names 10,153,165 10,109,540

EOL Names 87,791,695 87,135,089

EOL Pages 15,466,713 15,342,867

Page 38: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO
Page 39: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO
Page 40: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Article-level metadata

Chapter-level metadata

Treatment-level metadata

Part-level metadata

Page 41: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Articles in the BHL UI

Page 42: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO
Page 43: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

See also:

Page 44: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Related Titles

Page 45: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO
Page 46: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Digitization workflow

1. Titles vs. Items vs. Segments

2. Metadata we need: • MARC for book and journal titles • Volume information • Page data

BHL Term Titles Items Segments

Library Term Book or Journal

Titles

Volume, Piece Articles, Book

chapters,

Meaning Conceptual unit Object Section of

consecutive pages

Page 47: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO
Page 48: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Art of Life

Page 49: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Art of Life

Page 50: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Art of Life

Page 51: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Art of Life

Page 52: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Art of Life

Page 53: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO
Page 54: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Art of Life

Page 56: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Reviewing Metadata

Page 57: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Reviewing Metadata

Page 58: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO
Page 59: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Manually built:

1,714 sets

89,457 images

Page 60: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO
Page 61: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Purposeful Gaming

Page 62: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

*E.xvi�c�piteI von c. cXx.WptdvonfnrWmn bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X a�m cv(f b1air�'o�et ert oiensr �; �', :�hlrfc�c wa ff�4am.diug bist a 6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl: bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas waIwutr Ober �ci ti 1V Ces ' wt gbtiemwwajfu tpctt, afferain 9 c: b�titbfof �r f eran m rs bra wlg auig4;f aer�m *mc vrt blatcabtfm wfru an'deg~m rt blas Iaum bwWt� run f ncmai b14ianf tJobrrfan ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W�e�&mcyfbq4 Mabtt mmw rc a iiu bc Jcn ncI.end.*, blat s. a\ u:�rprd3 rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i

Page 63: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

OCR Improvements

Gaming

Transcription

Page 64: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

OCR Improvements

Transcription

Purposeful Gaming

Looking at…

Crowdsourcing Markup & Annotation

Page 65: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Purposeful Gaming DIGITALKOOT

Joint project run by the National Library of Finland and Microtask to index the library's enormous archives so that they are searchable on the Internet for easier access to the Finnish cultural heritage.

.

Page 66: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Purposeful Gaming DIGITALKOOT

Launched on Feb 8 2011, nearly 110 000 participants completed over 8 million word fixing tasks by Nov 29 2012

DigiTalkoot enabled volunteers to participate in this fixing work by playing games.

.

Page 67: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Purposeful gaming and BHL: engaging the public in improving and

enhancing access to digital texts

IMLS Grant Program: National Leadership Grants for Libraries

Partners:

Missouri Botanical Garden

Harvard University

Cornell University

New York Botanical Garden

P.I.: Trish Rose-Sandler, Missouri Botanical Garden

Dates: Dec 2013 – Nov. 2015

Page 68: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Project objectives and benefits

Test new means of crowdsourcing to support the enhancement of content in BHL

Demonstrate if digital games are an effective tool for analyzing and improving digital outputs from OCR and transcription

Benefits of gaming include:

improved access to content by providing richer and more accurate data;

an extension of limited staff resources; and

exposure of library content to communities who may not know about the collections otherwise.

Page 69: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

OCR Improvements

German text interpreted by the OCR process as:

“unb auf ben ©elnrgen be6 fublic{)en”

Page 70: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

AOCR Improvements

Different resulting texts from parsing the phrase: “und auf den Gebirgen des südlichen Deutschlands”

(“and on the mountains of southern Germany”)

IA OCR OCR 2 Transcriptio

n 1

Transcriptio

n 2

1 unb und und und Ok

2 den ben den den Ok

3 ©elnrgen ©ebirgen Bebirgen Gebirgen X

4 be6 des de5 des Chk

5 fublic{)en fublichen Füdlichen Südlichen X

6 £)eittfc{)(anb6 Deutfchlanbs Deutfchlands Deutschland

s X

Page 71: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Purposeful Gaming

Page 72: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO
Page 73: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

iDigBio‟s aOCR Hackathon

Improve OCR parsing of labels with clear metrics (datasets, output formats, scoring algorithm)

Libraries of regular expr. to clean up each field (different error correction for latitude/longitude coordinates than personal names or herbarium catalog numbers)

Tool for classifying segments of the image before submitting to OCR

Do a first pass of OCR to clean images before sending them to a second, 'real' pass of OCR

Page 74: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

iDigBio‟s CITScribe Hackathon

1. Interoperability betweenpublic participation tools and biodiversity data systems,

2. Transcription quality assessment/quality control (QA/QC) and the reconciliation of replicatetranscriptions,

3. Integration of optical character recognition (OCR) into thetranscription workflow

4. User engagement

Page 75: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

NfN & iDigBio‟s CITScribe Hackathon

Jason Best‟s DarwinScore

Ben Brumfield‟s Handwriting Gibberish Detector

Dictionaries to improve crowdsourcing consensus (e.g., names of collectors, scientific names)

Word Clouds created using n-gram scoring, faceting, and Solr for indexing + Carrot2 for specimen selection (visualize and explore of the use with a word of interest from the word cloud) and a data cleaning step (highlight infrequent words by the system).

Page 76: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

NESCent EOL-BHL Research Sprint

There is no place like home: Defining “habitat” for biodiversity science

Robert D. Stevenson

UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston, MA 02125-3393

Carl Nordman (Natureserve) and

Evangelos Pafilis

Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion, 71003, Crete, Greece

Page 77: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

NESCent EOL-BHL Research Sprint

Assessing Risk Status of Mexican Amphibians Through Data Mining.

Esther Quintero and Bárbara Ayala

National Commission for Knowledge and Use of Biodiversity (CONABIO)

and

Anne Thessen

Marine Biological Laboratory and Arizona State University

Page 78: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Planning for global change: using species interactions in conservation

Nicole F. Angeli, Emma P. Gomez, Margot A. Wood,

Applied Biodiversity Sciences Program, Texas A&M University, College Station, Texas

[email protected]

Tweet me @auratus_nicole

and

Javier Otegui

University of Colorado-Boulder

Page 79: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

There is no place like home: Defining “habitat” for biodiversity science

Robert D. Stevenson

UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston, MA 02125-3393

Carl Nordman (Natureserve)

Evangelos Pafilis

Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion, 71003, Crete, Greece

http://epafilis.info/ , [email protected]

Page 80: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Evolution in the usage of anatomical concepts in the biodiversity literature

Todd Vision ([email protected]), Prashanti Manda ([email protected]), and Dongye Meng ([email protected])

University of North Carolina at Chapel Hill

Page 81: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

NESCent EOL-BHL Research Sprint

Evolution in the usage of anatomical concepts in the biodiversity literature

Todd Vision ([email protected]),

Prashanti Manda ([email protected]), and

Dongye Meng

University of North Carolina at Chapel Hill

Page 82: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Some preliminary observations…

Our API seemed to work fine

Access via a taxon (or a group), for example:

“I want to harvest all pages with names from this taxon (Chordata) or this common name (Vertebrate)”.

Groups started getting results after 2.5 days.

The structure of BHL was explained so researchers could understand the title, item, page and part levels and define what they wanted. Ex: one group was looking for terms in the titles and the parts‟ titles.

Some others said they would Harvest the OCR from IA although they will not be able to harvest the text on a page by page granularity (only item level).

Page 83: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

NESCent EOL-BHL Research Sprint

There is no place like home: Defining “habitat” for biodiversity science

Robert D. Stevenson

UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston, MA 02125-3393

Carl Nordman (Natureserve) and

Evangelos Pafilis

Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion, 71003, Crete, Greece

Page 84: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Mining Biodiversity

Page 85: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Mining Biodiversity

Mining Biodiversity: Enriching Biodiversity Heritage with Text Mining and Social Media

One of the international projects that won in the third round of the 2013 Digging Into Data Challenge

Promote the development of innovative computational techniques to apply into big data in the humanities and social sciences

The National Centre for Text Mining (UK)

Missouri Botanical Garden (US)

Dalhousie University's Big Data Analytics Institute (Canada)

Social Media Lab (Canada)

Page 86: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

MiBIO: Mining Biodiversity

1. Automatic error correction of OCR text errors.

2. Crowdsource annotation of legacy texts with semantic metadata.

3. Adapt text mining techniques to extract terminology, entities and significant events automatically and to track terminology evolution over time.

4. Use Interactive visualization techniques to help users manage search results through next generation browsing capabilities, assisted by a semantic similarity network of important terms and entities.

5. Design of a social media layer, serving as an environment for diverse users to interact and collaborate on science, public education, awareness and outreach.

Page 87: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

MiBIO: Mining Biodiversity

Page 88: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Crowdsource Markup

Display text Species Profile Model category

General/summary TaxonBiology

Geographic range Distribution

Habitat Habitat

Food sources and feeding behavior TrophicStrategy

Physical description (general) Description

Physical description (detailed

morphology)

DiagnosticDescription

Page 89: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Visit to NaCTeM, Feb. 17, 2014

Page 90: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

NaCTeM‟s Biodiversity- relevant tools

Page 91: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

ANNNOTATION PLATFORM

Page 92: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Remote Processing Workflows processed on remote machines. No attendance needed

Workflows GUI for creating single-flow and multi-branch workflows

Workflow Designer

User Interaction Annotation Editor allows for making changes while processing Annotator/Curator

Web S

erv

ice

Third-party applications

Processing Components Data (de)serialisation, search engines, NLP, NER, etc.

Developers

Page 93: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Workflows view

Page 94: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Processes View

Page 95: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Documents view

Page 96: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Workflow editor

Page 97: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Workflow as a Web service

Page 98: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Workflow as a Web service

http://argo.nactem.ac.uk/test/services/webservice/314

INPUT

OUTPUT

Page 99: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

NAMED ENTITY RECOGNISERS AND NORMALISERS

Page 100: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

✔ ✔

Page 101: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Automatically recognised named entities

Page 102: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Linking to external dictionaries

Page 103: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Species and habitat recognition

Page 104: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

EVENT EXTRACTORS

Page 105: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Events: associations between entities

Page 106: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO
Page 107: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

SEMANTIC SEARCH

Page 108: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO
Page 109: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO
Page 110: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO
Page 111: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

TERM EXTRACTION

Page 112: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO
Page 113: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Ryerson University SocialLab‟s Netlytic.org

Page 114: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

http://miningbiodiversity.com/ http://miningbiodiversity.org/

Page 115: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Thank you William Ulate

BHL Technical Director

Missouri Botanical Garden

[email protected]

Skype: william_ulate_r

Page 116: Digitalización de Literatura de Biodiversidad: an overview of the BHL for CONABIO

Thank you!

And thanks to Bianca Crowley for the workflow slides