25
BIS TDWG Conference, New Orleans, 2011 GBIF: Issues in providing federated access to digital information related to biological specimens David Remsen Senior Programme Officer Global Biodiversity Information Facility (GBIF)

BIS TDWG Conference, New Orleans, 2011 GBIF: Issues in providing federated access to digital information related to biological specimens David Remsen Senior

Embed Size (px)

Citation preview

BIS TDWG Conference, New Orleans, 2011

GBIF: Issues in providing federated access to digital information related to biological specimens

David RemsenSenior Programme OfficerGlobal Biodiversity Information Facility (GBIF)

Issue #2: Geospatial integration

Issue #3: Taxonomic integration

Issue #1: The consequences of scale

3 issues

Issue #1: The consequences of scale

Goal – Provide timely access to a large federated network of biodiversity databases

About GBIF

• 341 publishers• 9290 datasets• 310M records

The mission of the Global Biodiversity Information Facility (GBIF) is to facilitate free and open access to biodiversity data worldwide via the Internet to underpin sustainable development.

• 57 countries• 45 organisations

“Wrapper” Software

PyWrapper (Python)

TAPIR Link (PHP)

DiGIR (PHP)

Your database

Insect Collection

Install one of these ‘wrappers’

ABCD

Bird Observations

Herbarium

Data

DarwinCore

DarwinCore

The promise of federation

Insect Collection HerbariumBird Observations Herbarium

Any specimens from Thailand?

GBIF Data Portal

I will ask!

I do! I do! I do!Nope!

GBIF Data Portal as a Gateway

The challenge of federation

Insect Collection HerbariumBird Observations Herbarium

Hello?

Server Not AvailableServer Not Available

GBIF Data Portal

Hi!

The rise of Indexing

Insect Collection HerbariumBird Observations Herbarium

Any data records from

Thailand?Send me a copy of your data

GBIF Data Portal (now with Data!)

GBIF Data Portal as a Data Index

The wrong tools for the job

Insect Collection HerbariumBird Observations Herbarium

Any data records from

Thailand?

Send me a copy of your data once per month

Here is page one.

If I go offline,start againNot too fast!

You ask the same questions every time

GBIF Data Portal (now with Data!)

TAPIR request example

• dataset of 260,000 specimens

• 200 records retrieved per request

• requires 1300 request/response pairs

• over 9 hours to complete

• 500 MB of XML data is transferred

• becomes 32 MB text file in the GBIF server

• 32 MB is compressible to 3 MB zip file

Darwin Core Archives

A text-based solution to publishing biodiversity data

A Refined Approach

Insect Collection HerbariumBird Observations Herbarium

Any data records from

Thailand?

This is fast!

GBIF Data Portal (now with Data!)

This is easy

URL URL URLURL

- index very large data sets

- reduce latency

2007 Today

70 million

20102008 2009

147 million

180 million

201 million

302 millionGrowth

Need for a new standard identified

Issue #2: Geospatial Integration

Goal – Provide accurate reporting of nationally-bound data

Challenge – Inaccurate recording of geospatial coordinates

Geo-referenced USA data

Verbatim data as shared on the network

Issue #2: Geospatial IntegrationRemediation includes:• Use of country boundary shapefiles to

verify that coordinates fall within them– Including EEZ boundaries– Including islands

• Outliers identified• Nature of the error qualified (e.g.,

“coordinates inverted”)• Offending records marked and

omitted from display

Geo-referenced USA data

Data following interpretation- Coastal regions recognised- Offshore islands recognised

Issue #3: Taxonomic Integration

• Goal – Provide access to biodiversity data according to taxonomic groups and concepts

• Challenge – – Heterogeneous and sometimes inaccurate

classification• Same taxon appearing in different

classifications– Presence of homonyms that complicate

reconciling above– Misspellings– Wide range of orthographies for the same name

Enabling authoratative taxonomic data to be published through GBIF

Trochilidae (Hummingbirds) (today)

Misinterpretations(Hummingbirds are restricted to the Americas)

Trochilidae (Hummingbirds) (next month)

Improved interpretation

Search for Oenanthe(water dropwort plant or wheatear bird)

Difficult for user to interpret

Accurate search results

Today

Next month

resolution of homonyms

Improved means to match names to authority files

In summary• GBIF has had to deploy different data access

strategies in order to effectively scale• Darwin Core Archive offers a scalable solution that

has led to rapid growth in data published through GBIF

• Geospatial filtering via shapefiles provides basis for more accurate national reporting– Basis for additional services later (e.g., ecosystem

shapefiles, protected areas, etc.)

• Heterogenous taxonomy inherent to collections data is nearly impossible to consolidate into a taxonomically accurate structure.– Comprehensive authoritative taxonomic data is a key

organisational component of collections data

Thank you