29
Vince Smith The biodiversity informatics landscape: a systematics perspective Biodiversity Informatics Horizons Rome, 3-6 Sept 2013

The biodiversity informatics landscape: a systematics perspective

Embed Size (px)

DESCRIPTION

Presented by V. Smith at the Biodiversity Informatics Horizons conference, Sapienza – Università di Roma, Rome, Italy. 3-6 Sept. 2013.

Citation preview

Page 1: The biodiversity informatics landscape: a systematics perspective

Vince Smith

The biodiversity informatics landscape:a systematics perspective

Biodiversity Informatics HorizonsRome, 3-6 Sept 2013

Page 2: The biodiversity informatics landscape: a systematics perspective

Overview

1. Background – the biodiversity informatics domain• The problem (i.e. why are we here)• Representations of the domain (data, infrastructures, projects…)• Toward an integrated view (strategy)

2. Social challenges• Openness• Collaboration and communities • Standards, identifiers & protocols

3. (Big) data challenges• Mobilizing existing data (metadata, literature, collections) • New forms of data ([meta]genomics & observatories)

4. Synthetic challenges• Data Aggregation & linking• Visualisation• Modeling

5. Next steps (data infrastructures & funding)• Lessons learned: new informatics opportunities in H2020

Page 3: The biodiversity informatics landscape: a systematics perspective

1. Background

Page 4: The biodiversity informatics landscape: a systematics perspective

The problem – integrating biodiversity research

How to we join up these activities? How do we use this as a tool? Species conservation & protected areas

Impacts of human developmentBiodiversity & human health

Impacts of climate changeFood, farming & biofuels

Invasive alien species

What infrastructures do we need?(technologies, tools, standards…)What processes do we need?(Modelling, workflows…)What data do we need?(Genes, localities…)

Page 5: The biodiversity informatics landscape: a systematics perspective

Natural History – the foundation

"It is interesting to contemplate a tangled bank, clothed with many plants of many kinds, …, so different from each other, and dependent upon each other in so complex a manner, have all been produced by laws acting around us.”

C. Darwin "On the Origin of Species”, 1859

Darwin’s “tangled bank”… Systematics, a foundational “law”

Page 6: The biodiversity informatics landscape: a systematics perspective

Ecological interactions

Page 7: The biodiversity informatics landscape: a systematics perspective

A granular understanding of biodiversity

Genes

GCGCGTACCTAG

Individuals

iiiiiiivvvi

Populations

12

123

Local populations

Species

ABCDEF

Global biodiversity

Interactions

A B C D E F- + + + + ++ - + + ++ + -+ - + -+ -

Biological networks

GenBank

Page 8: The biodiversity informatics landscape: a systematics perspective

Key problems• Landscape is complex, fragmented & hard to navigate• Many audiences (policy makers, scientists, amateurs, citizen scientists)• Many scales (global solutions to local problems)

Figure adapted from Peterson et al 2010

An informaticians view of biodiversity

Page 9: The biodiversity informatics landscape: a systematics perspective

A project centric view of biodiversity

A snapshot from 2009, “the dance of the initiatives”

Page 10: The biodiversity informatics landscape: a systematics perspective

The strategic view: community informatics challenges

GBIF GBIC Report(Coming soon)

EU Biodiversity Strategy(2011)

Biodiv. Inf. Challenges(2013)

Grand Challenges for Biodiversity Informatics(integrating activities for H2020)

Page 11: The biodiversity informatics landscape: a systematics perspective

2. Social challenges- Openness- Collaboration and communities - Standards, identifiers & links

Page 12: The biodiversity informatics landscape: a systematics perspective

Openness in biodiversity informatics

E. Archambault et. al., Proportion of Open Access Peer-Reviewed Papers at the European and World Levels--2004-2011, June 2013, Science-Metrix Inc.

“One-half of all papers are now freely available within a year or two of publication”

“A piece of data or content is open if anyone is free to use, reuse, and redistribute it - subject, at most, to the requirement to attribute and/or share-alike.” http://opendefinition.org/

Many kinds of openness:• Open Access• Open Data• Open Science• Open Source

• Sharing data is a foundation for our activities

• Normal practice in some communities (molecular)

• Mandated by some funders & governments

Page 13: The biodiversity informatics landscape: a systematics perspective

Openness in biodiversity informatics

Many kinds of openness:• Open Access• Open Data• Open Science• Open Source

Need to continue to incentivise openness

“A piece of data or content is open if anyone is free to use, reuse, and redistribute it - subject, at most, to the requirement to attribute and/or share-alike.”

• Sharing data is a foundation for our activities

• Normal practice in some communities (molecular)

• Mandated by some funders & governments

http://opendefinition.org/

Incentivise through credit via citation (e.g. BDJ)

Page 14: The biodiversity informatics landscape: a systematics perspective

What are Scratchpads? (http://scratchpads.eu)

Taxa Projects Regions Societies

544 Scratchpad Communities

by 6,644 active registered users

covering 91,631 taxa

in 535,317 pages. 81 paper citations in 2012

In total more than

1,300,000 visitors

e.g., Scratchpad Virtual Research Communities

Collaboration & communities

Making taxonomy a team sport

Our infrastructures need to facilitate collaboration

Page 15: The biodiversity informatics landscape: a systematics perspective

Standards, identifiers & protocols

Standards can’t be developed in isolation – they must be used

Key requirements:• Need to be inclusive, practical & extensible• Readable by humans & machines• Widely used

Good examples:• Darwin Core• CrossRef & DataCite DOIs• ORCHID Author identifiers

Gaps / Problems• Reuse & persistence of identifiers• Vocabularies & ontologies (time consuming / little reward)

Potential solutions• Build them into our credit systems• Show sematic reasoning potential (LOD & RDF demonstrators)

A foundation for integrationFacilitating data sharing across communities

Page 16: The biodiversity informatics landscape: a systematics perspective

3. (Big) data challenges- Mobilising existing data - New forms of data

Page 17: The biodiversity informatics landscape: a systematics perspective

Mobilising existing data

Collections• 1.5-3B specimens in collections worldwide• Fragments efforts / heterogeneity of process• Needs ambition (NHM: 20M in 5 yrs.) & coord.

Literature• >300M pages of biodiversity literature• BHL (41M pp.) an example of what can be done• Needs a sustainability & article metadata

Metadata registries• Data about data (cheaper & scalable)• e.g. bibliographic data, dataset portals

Informatics challenges• Storage & persistence• Automation & annotation• Incentives to digitise & fitness for use

Collections, literature & metadata

How can we quickly, efficiently and cost effectively mobilise biological data at scale?

Bibliography of Life (RefFinder & RefBank)

BHL literature

NHM Digitisation

Page 18: The biodiversity informatics landscape: a systematics perspective

Mobilising & managing new forms of data

New Molecular approaches• Molecular detection & monitoring of organisms is routine• Metagenomics (env. sequencing) commonplace• Becoming the 1° route to understanding biodiversity

Ecological observatories• Automated biodiversity detection• Remote sensing (e.g. satellite & acoustic data, drones, camera traps)• Monitoring conspicuous, rare or invasive spp. (algal blooms, palms) • Monitoring human activity

Informatics challenges• Very large quantities of data (2.5-10TB per researcher per yr.)• Doesn’t map well to existing data infrastructures• Challenge current networking & storage capacity • Digital and physical collections become equally important?

3-4 June 2013, NHM

22 July, 2013

Metagenomics & ecological observatories

These new data types do not depend on traditional taxonomy & systematics

Page 19: The biodiversity informatics landscape: a systematics perspective

4. Synthetic challenges- Data aggregation & linking- Visualisation- Modeling

Page 20: The biodiversity informatics landscape: a systematics perspective

Aggregation & linking

Portals bringing together distributed & diverse forms of data

Giving consistent and comprehensive access to all biological data

Several approaches, with different advantages• Tightly coupled to a few data sources

• (e.g. eMonocot, CDM)• Loosely coupled to many sources

• (e.g. BioNames, Wikipedia)• Hybrid forms (e.g. Canadensys, EOL, GBIF)

Informatics challenges• Portals are hard to sustain• New methods of data discovery & access• Create new windows (views) on content• New data structures, new types of database

Scalable but less accurate(3M taxon names, 93k phylogenies & 28k articles)

BioNames

Selective & accurate but hard to scale(276k taxa, 8k images, 13 keys & 3 phylogenies)

eMonocot

Page 21: The biodiversity informatics landscape: a systematics perspective

Visualisation

Visually synthesizing large, linked biodiversity datasets

Making biodiversity data accessible & understandable

NHM specimen records

http://data.nhm.ac.uk/globe/

Research opportunities• Tools integration (e.g. GeoCat, CartoDB)• Span multiple audiences

Outreach opportunities• Visually compelling story telling• Crowdsourcing tools (e.g. Notes From Nature)

Exploiting new technologies• Touch screens• Mobile• Location awareness

Informatics challenges• Very specific to individual use cases• Sustainability issues

Page 22: The biodiversity informatics landscape: a systematics perspective

Modeling the biosphere: a (the) 30 year goal?

Conceptually has many potential uses• Identifying trends• Explaining patterns• Making predictions• Real time alerts

- when data contradicts current knowledge• The ultimate policy tool

Major informatics challenges• Technical very difficult (many years off)• Needs effective prototypes & platforms• Some first steps e.g. OBOE, LEFT

Nature 2013, doi:10.1038/493295a

Reasoning across large, linked biodiversity datasets

A clear, singular, long-term vision, which biodiversity data can contribute too

Page 23: The biodiversity informatics landscape: a systematics perspective

5. Next steps

Page 24: The biodiversity informatics landscape: a systematics perspective

Lessons learned: new opportunities in H2020

PATHWAYS TO INTEGRATION (by addressing these social, data & synthetic challenges)

• Break out of the discipline, technical & project centric activities (it is unsustainable, inefficient & bad for science)

• Integrate & build on exiting programmes where possible (LifeWatch is a potential umbrella for these activities)

• Bridge the disconnect between informaticians & users (make the users informaticians & in informaticians users)

• Our products well suited to address these challenges

• Use H2020 as a mechanism to achieve integration

How do we join up these activities?

Page 25: The biodiversity informatics landscape: a systematics perspective

QUESTIONS

Page 26: The biodiversity informatics landscape: a systematics perspective

Possible biodiversity informatics design principles*

1. Start with needs - focus on real user needs (not just the ‘official process’)

2. Do less - if someone else is doing it, link to it or use it

3. Design with data - prototype and test with real users on the live website

4. Do the hard work to make it simple - let the computer take the strain

5. Iterate. Then iterate again. - iteration reduces risk & is more sustainable

6. Build for inclusion – it’s easier in the long run

7. Understand context - we are designing for people, not a screen or a brand

8. Build digital services, not websites - there is life beyond the website

9. Be consistent, not uniform - every circumstance is different

10. Make things open: it makes things better - it’s more sustainable

= experience from 7-years with the Scratchpads= lessons for infrastructures in H2020?

*https://www.gov.uk/designprinciples

Page 27: The biodiversity informatics landscape: a systematics perspective

Mobilising existing data: how to prioritise

Nick Poole, UK Collections Trust

CONTENT

METADATA

A LITTLE A LOT

Digitise a few things & invest in depth, description & promotion

Digitise lots of things, put little effort into description & promotion

FUN

OUTREACHLEARNING

RESEARCH

AGGREGATION DATA MINING

COLECTIONS MANAGEMENT

Page 28: The biodiversity informatics landscape: a systematics perspective

Collaboration & communities

• Very few recent single author papers• Most (fundable) science is cross-disciplinary• Need to incentivise data curation & annotation• Need mechanisms to share annotations

Our infrastructures need to facilitate collaboration

Joppa et al, 2011

CONE SNAILS BIRDS MAMMALS AMPHIBIANS SPIDERS PLANTS

Average dates when increasing numbers of taxonomists were involved in describing speciesMaking taxonomy a team sport

Page 29: The biodiversity informatics landscape: a systematics perspective