Upload
vincent-smith
View
578
Download
1
Embed Size (px)
DESCRIPTION
Presentation given at the Biodiversity Informatics Horizons Meeting in Rome, Italy. 3-6 Rome, 2013.
Citation preview
Vince Smith
The biodiversity informatics landscape: a systematics perspective
Biodiversity Informatics Horizons Rome, 3-6 Sept 2013
Overview
1. Background – the biodiversity informa9cs domain • The problem (i.e. why are we here) • Representa6ons of the domain (data, infrastructures, projects…) • Toward an integrated view (strategy)
2. Social challenges • Openness • Collabora6on and communi6es • Standards, iden6fiers & protocols
3. (Big) data challenges • Mobilizing exis6ng data (metadata, literature, collec6ons) • New forms of data ([meta]genomics & observatories)
4. Synthe9c challenges • Data Aggrega6on & linking • Visualisa6on • Modeling
5. Next steps (data infrastructures & funding) • Lessons learned: new informa6cs opportuni6es in H2020
1. Background
The problem – integrating biodiversity research
How to we join up these ac0vi0es? How do we use this as a tool? Species conserva6on & protected areas
Impacts of human development Biodiversity & human health Impacts of climate change Food, farming & biofuels
Invasive alien species
What infrastructures do we need? (technologies, tools, standards…) What processes do we need? (Modelling, workflows…) What data do we need? (Genes, locali6es…)
Natural History – the foundation
"It is interes0ng to contemplate a tangled bank, clothed with many plants of many kinds, …, so different from each other, and dependent upon each other in so complex a manner, have all been produced by laws ac0ng around us.”
C. Darwin "On the Origin of Species”, 1859
Darwin’s “tangled bank”… Systema9cs, a founda9onal “law”
Ecological interactions
A granular understanding of biodiversity
Genes
GCGC GTAC CTAG
Individuals
i ii iii iv v vi
Populations
1 2 1 2 3
Local populations
Species
A B C D E F
Global biodiversity
Interactions
A B C D E F - + + + + + + - + + + + + - + - + - + -
Biological networks
GenBank
Key problems • Landscape is complex, fragmented & hard to navigate • Many audiences (policy makers, scien6sts, amateurs, ci6zen scien6sts) • Many scales (global solu6ons to local problems)
Figure adapted from Peterson et al 2010
Genotype Phenotype Biotic Interactions Environment Human Effects
Niche & Pop. Ecology
Biodiversity Loss
Phylogenetic Trees
Taxonomy
Geographic Dsitributions
Range Maps Forecasts of Change
Conservation & management
Products
Data
GenBank MorphBank Interactions Geospatial Census
IUCN
TreeBase
IPNI, Zoobank
Pop. data
GBIF
Extent of Occurrence AquaMaps
AquaMaps
Systems
An informaticians view of biodiversity
A project centric view of biodiversity
NomenclatorsIndex FungorumZooBankIPNI(Kew/AUS/Harvard)INGAFD/APC/APUINZORCoL (Sp2000& ITIS)ZooRecord
PESI:
ERMS
Fauna Europea
Euro+Med Plantbase
ORBIS
WORMS
Flora Europea
Checklists
PhylogeneticTree of LifeTreeBaseCIPRES
MolecularDatabases
NCBI/EMBL/DDBJCBoLBarcode of Life Initiative
BiodiversityALA
CONABIO
CRIA (Brazil)
IUCN
SEEK
OPAL
DAISIE
iNaturalist
uBio
PLAZI
Inotaxa
BHL
eFloras
Scan / Mark/up
IdentificationKey2NatureIdentifyLife
Inter-InstitutionalSynthesisBCIBioCASEGeoCASEMaNIS
InstitutionalEMu (=MOA)
Recorder
TDWG
LifeWatch
GBIF
CDMGNA (NameBank) IPNI
Google ScholarConnoteaViTaLISI
BibliographicDescriptive / classification
EoLScratchpadsCATEMorphoBankWikipedia
A snapshot from 2009, “the dance of the ini0a0ves”
The strategic view: community informatics challenges
GBIF GBIC Report (Coming soon)
EU Biodiversity Strategy (2011)
Biodiv. Inf. Challenges (2013)
Grand Challenges for Biodiversity Informa6cs (integra6ng ac6vi6es for H2020)
2. Social challenges - Openness - Collabora6on and communi6es - Standards, iden6fiers & links
Openness in biodiversity informatics
E. Archambault et. al., Propor9on of Open Access Peer-‐Reviewed Papers at the European and World Levels-‐-‐2004-‐2011, June 2013, Science-‐Metrix Inc.
“One-‐half of all papers are now freely available within a year or two of publica0on”
“A piece of data or content is open if anyone is free to use, reuse, and redistribute it -‐ subject, at most, to the requirement to aOribute and/or share-‐alike.” hfp://opendefini6on.org/
Many kinds of openness: • Open Access • Open Data • Open Science • Open Source
• Sharing data is a founda6on for our ac6vi6es
• Normal prac6ce in some communi6es (molecular)
• Mandated by some funders & governments
Openness in biodiversity informatics
Many kinds of openness: • Open Access • Open Data • Open Science • Open Source
Need to con0nue to incen0vise openness
“A piece of data or content is open if anyone is free to use, reuse, and redistribute it -‐ subject, at most, to the requirement to aOribute and/or share-‐alike.”
• Sharing data is a founda6on for our ac6vi6es
• Normal prac6ce in some communi6es (molecular)
• Mandated by some funders & governments
hfp://opendefini6on.org/
Incen6vise through credit via cita6on (e.g. BDJ)
What are Scratchpads? (hfp://scratchpads.eu)
Taxa Projects Regions Socie9es
544 Scratchpad Communi6es
by 6,644 ac6ve registered users
covering 91,631 taxa
in 535,317 pages. 81 paper cita9ons in 2012
In total more than
1,300,000 visitors
e.g., Scratchpad Virtual Research Communi0es
Collaboration & communities
Making taxonomy a team sport
Our infrastructures need to facilitate collabora0on
Standards, identifiers & protocols
Standards can’t be developed in isola0on – they must be used
Key requirements: • Need to be inclusive, prac6cal & extensible • Readable by humans & machines • Widely used
Good examples: • Darwin Core • CrossRef & DataCite DOIs • ORCHID Author iden6fiers
Gaps / Problems • Reuse & persistence of iden6fiers • Vocabularies & ontologies (6me consuming / lifle reward)
Poten0al solu0ons • Build them into our credit systems • Show sema6c reasoning poten6al (LOD & RDF demonstrators)
A founda6on for integra6on Facilita9ng data sharing across communi9es
3. (Big) data challenges - Mobilising exis6ng data - New forms of data
Mobilising existing data
Collec0ons • 1.5-‐3B specimens in collec6ons worldwide • Fragments efforts / heterogeneity of process • Needs ambi6on (NHM: 20M in 5 yrs.) & coord.
Literature • >300M pages of biodiversity literature • BHL (41M pp.) an example of what can be done • Needs a sustainability & ar6cle metadata
Metadata registries • Data about data (cheaper & scalable) • e.g. bibliographic data, dataset portals
Informa0cs challenges • Storage & persistence • Automa6on & annota6on • Incen6ves to digi6se & fitness for use
Collec9ons, literature & metadata How can we quickly, efficiently and cost effec6vely mobilise biological data at scale?
Bibliography of Life (RefFinder & RefBank)
BHL literature
NHM Digi0sa0on
Mobilising & managing new forms of data
New Molecular approaches • Molecular detec6on & monitoring of organisms is rou6ne • Metagenomics (env. sequencing) commonplace • Becoming the 1° route to understanding biodiversity
Ecological observatories • Automated biodiversity detec6on • Remote sensing (e.g. satellite & acous6c data, drones, camera traps) • Monitoring conspicuous, rare or invasive spp. (algal blooms, palms) • Monitoring human ac6vity
Informa0cs challenges • Very large quan66es of data (2.5-‐10TB per researcher per yr.) • Doesn’t map well to exis6ng data infrastructures • Challenge current networking & storage capacity • Digital and physical collec6ons become equally important?
3-‐4 June 2013, NHM
22 July, 2013
Metagenomics & ecological observatories These new data types do not depend on tradi6onal taxonomy & systema6cs
4. Synthe9c challenges - Data aggrega6on & linking - Visualisa6on - Modeling
Aggregation & linking
Portals bringing together distributed & diverse forms of data Giving consistent and comprehensive access to all biological data
Several approaches, with different advantages • Tightly coupled to a few data sources
• (e.g. eMonocot, CDM) • Loosely coupled to many sources
• (e.g. BioNames, Wikipedia) • Hybrid forms (e.g. Canadensys, EOL, GBIF)
Informa0cs challenges • Portals are hard to sustain • New methods of data discovery & access • Create new windows (views) on content • New data structures, new types of database
Scalable but less accurate (3M taxon names, 93k phylogenies & 28k ar6cles)
BioNames
Selec0ve & accurate but hard to scale (276k taxa, 8k images, 13 keys & 3 phylogenies)
eMonocot
Visualisation
Visually synthesizing large, linked biodiversity datasets Making biodiversity data accessible & understandable
NHM specimen records
hfp://data.nhm.ac.uk/globe/
Research opportuni0es • Tools integra6on (e.g. GeoCat, CartoDB) • Span mul6ple audiences
Outreach opportuni0es • Visually compelling story telling • Crowdsourcing tools (e.g. Notes From Nature)
Exploi0ng new technologies • Touch screens • Mobile • Loca6on awareness
Informa0cs challenges • Very specific to individual use cases • Sustainability issues
Modeling the biosphere: a (the) 30 year goal?
Conceptually has many poten0al uses • Iden6fying trends • Explaining paferns • Making predic6ons • Real 6me alerts
-‐ when data contradicts current knowledge • The ul6mate policy tool
Major informa0cs challenges • Technical very difficult (many years off) • Needs effec6ve prototypes & plarorms • Some first steps e.g. OBOE, LEFT
Nature 2013, doi:10.1038/493295a
Reasoning across large, linked biodiversity datasets A clear, singular, long-‐term vision, which biodiversity data can contribute too
5. Next steps
Lessons learned: new opportunities in H2020
PATHWAYS TO INTEGRATION (by addressing these social, data & synthe0c challenges) • Break out of the discipline, technical &
project centric ac9vi9es (it is unsustainable, inefficient & bad for science)
• Integrate & build on exi9ng programmes
where possible (LifeWatch is a poten6al umbrella for these ac6vi6es)
• Bridge the disconnect between
informa9cians & users (make the users informa6cians & in informa6cians users)
• Our products well suited to address these
challenges • Use H2020 as a mechanism to achieve
integra9on
How do we join up these ac0vi0es?
QUESTIONS
Possible biodiversity informatics design principles*
1. Start with needs -‐ focus on real user needs (not just the ‘official process’)
2. Do less -‐ if someone else is doing it, link to it or use it
3. Design with data -‐ prototype and test with real users on the live website 4. Do the hard work to make it simple -‐ let the computer take the strain
5. Iterate. Then iterate again. -‐ itera0on reduces risk & is more sustainable
6. Build for inclusion – it’s easier in the long run 7. Understand context -‐ we are designing for people, not a screen or a brand 8. Build digital services, not websites -‐ there is life beyond the website 9. Be consistent, not uniform -‐ every circumstance is different
10. Make things open: it makes things bejer -‐ it’s more sustainable
= experience from 7-‐years with the Scratchpads = lessons for infrastructures in H2020?
*hfps://www.gov.uk/designprinciples
Mobilising existing data: how to prioritise
Nick Poole, UK Collec6ons Trust
CONTENT
METADATA
A LITTLE A LOT
Digi6se a few things & invest in depth, descrip6on & promo6on
Digi6se lots of things, put lifle effort into descrip6on & promo6on
FUN
OUTREACH LEARNING
RESEARCH
AGGREGATION DATA MINING
COLECTIONS MANAGEMENT
Collaboration & communities
• Very few recent single author papers • Most (fundable) science is cross-‐disciplinary • Need to incen6vise data cura6on & annota6on • Need mechanisms to share annota6ons
Our infrastructures need to facilitate collabora0on
Joppa et al, 2011
CONE SNAILS BIRDS MAMMALS AMPHIBIANS SPIDERS PLANTS
Average dates when increasing numbers of taxonomists were involved in describing species Making taxonomy a team sport