Upload
gervais-berry
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
BIS TDWG Conference, New Orleans, 2011
GBIF: Issues in providing federated access to digital information related to biological specimens
David RemsenSenior Programme OfficerGlobal Biodiversity Information Facility (GBIF)
Issue #2: Geospatial integration
Issue #3: Taxonomic integration
Issue #1: The consequences of scale
3 issues
Issue #1: The consequences of scale
Goal – Provide timely access to a large federated network of biodiversity databases
About GBIF
• 341 publishers• 9290 datasets• 310M records
The mission of the Global Biodiversity Information Facility (GBIF) is to facilitate free and open access to biodiversity data worldwide via the Internet to underpin sustainable development.
• 57 countries• 45 organisations
“Wrapper” Software
PyWrapper (Python)
TAPIR Link (PHP)
DiGIR (PHP)
Your database
Insect Collection
Install one of these ‘wrappers’
ABCD
Bird Observations
Herbarium
Data
DarwinCore
DarwinCore
The promise of federation
Insect Collection HerbariumBird Observations Herbarium
Any specimens from Thailand?
GBIF Data Portal
I will ask!
I do! I do! I do!Nope!
GBIF Data Portal as a Gateway
The challenge of federation
Insect Collection HerbariumBird Observations Herbarium
Hello?
Server Not AvailableServer Not Available
GBIF Data Portal
Hi!
The rise of Indexing
Insect Collection HerbariumBird Observations Herbarium
Any data records from
Thailand?Send me a copy of your data
GBIF Data Portal (now with Data!)
GBIF Data Portal as a Data Index
The wrong tools for the job
Insect Collection HerbariumBird Observations Herbarium
Any data records from
Thailand?
Send me a copy of your data once per month
Here is page one.
If I go offline,start againNot too fast!
You ask the same questions every time
GBIF Data Portal (now with Data!)
TAPIR request example
• dataset of 260,000 specimens
• 200 records retrieved per request
• requires 1300 request/response pairs
• over 9 hours to complete
• 500 MB of XML data is transferred
• becomes 32 MB text file in the GBIF server
• 32 MB is compressible to 3 MB zip file
A Refined Approach
Insect Collection HerbariumBird Observations Herbarium
Any data records from
Thailand?
This is fast!
GBIF Data Portal (now with Data!)
This is easy
URL URL URLURL
- index very large data sets
- reduce latency
2007 Today
70 million
20102008 2009
147 million
180 million
201 million
302 millionGrowth
Need for a new standard identified
Issue #2: Geospatial Integration
Goal – Provide accurate reporting of nationally-bound data
Challenge – Inaccurate recording of geospatial coordinates
Issue #2: Geospatial IntegrationRemediation includes:• Use of country boundary shapefiles to
verify that coordinates fall within them– Including EEZ boundaries– Including islands
• Outliers identified• Nature of the error qualified (e.g.,
“coordinates inverted”)• Offending records marked and
omitted from display
Geo-referenced USA data
Data following interpretation- Coastal regions recognised- Offshore islands recognised
Issue #3: Taxonomic Integration
• Goal – Provide access to biodiversity data according to taxonomic groups and concepts
• Challenge – – Heterogeneous and sometimes inaccurate
classification• Same taxon appearing in different
classifications– Presence of homonyms that complicate
reconciling above– Misspellings– Wide range of orthographies for the same name
Search for Oenanthe(water dropwort plant or wheatear bird)
Difficult for user to interpret
Accurate search results
Today
Next month
resolution of homonyms
In summary• GBIF has had to deploy different data access
strategies in order to effectively scale• Darwin Core Archive offers a scalable solution that
has led to rapid growth in data published through GBIF
• Geospatial filtering via shapefiles provides basis for more accurate national reporting– Basis for additional services later (e.g., ecosystem
shapefiles, protected areas, etc.)
• Heterogenous taxonomy inherent to collections data is nearly impossible to consolidate into a taxonomically accurate structure.– Comprehensive authoritative taxonomic data is a key
organisational component of collections data