29
Vince Smith The biodiversity informatics landscape: a systematics perspective Biodiversity Informatics Horizons Rome, 3-6 Sept 2013

The Biodiversity Informatics Landscape

Embed Size (px)

DESCRIPTION

Presentation given at the Biodiversity Informatics Horizons Meeting in Rome, Italy. 3-6 Rome, 2013.

Citation preview

Page 1: The Biodiversity Informatics Landscape

Vince Smith

The biodiversity informatics landscape: a systematics perspective

Biodiversity Informatics Horizons Rome, 3-6 Sept 2013

Page 2: The Biodiversity Informatics Landscape

Overview

1.   Background  –  the  biodiversity  informa9cs  domain  •  The  problem  (i.e.  why  are  we  here)  •  Representa6ons  of  the  domain  (data,  infrastructures,  projects…)  •  Toward  an  integrated  view  (strategy)  

2.   Social  challenges  •  Openness  •  Collabora6on  and  communi6es    •  Standards,    iden6fiers  &  protocols  

3.   (Big)  data  challenges  •  Mobilizing  exis6ng  data  (metadata,  literature,  collec6ons)    •  New  forms  of  data  ([meta]genomics  &  observatories)  

4.   Synthe9c  challenges  •  Data  Aggrega6on  &  linking  •  Visualisa6on  •  Modeling  

5.   Next  steps  (data  infrastructures  &  funding)  •  Lessons  learned:  new  informa6cs  opportuni6es  in  H2020  

Page 3: The Biodiversity Informatics Landscape

1.  Background  

Page 4: The Biodiversity Informatics Landscape

The problem – integrating biodiversity research

How  to  we  join  up  these  ac0vi0es?     How  do  we  use  this  as  a  tool?    Species  conserva6on  &  protected  areas  

Impacts  of  human  development  Biodiversity  &  human  health  Impacts  of  climate  change  Food,  farming  &  biofuels  

Invasive  alien  species    

What  infrastructures  do  we  need?  (technologies,  tools,  standards…)  What  processes  do  we  need?  (Modelling,  workflows…)  What  data  do  we  need?  (Genes,  locali6es…)    

Page 5: The Biodiversity Informatics Landscape

Natural History – the foundation

"It  is  interes0ng  to  contemplate  a  tangled  bank,  clothed  with  many  plants  of  many  kinds,  …,  so  different  from  each  other,  and  dependent  upon  each   other   in   so   complex   a   manner,   have   all  been  produced  by  laws  ac0ng  around  us.”  

C.  Darwin  "On  the  Origin  of  Species”,  1859  

Darwin’s  “tangled  bank”…   Systema9cs,  a  founda9onal  “law”  

Page 6: The Biodiversity Informatics Landscape

Ecological interactions

Page 7: The Biodiversity Informatics Landscape

A granular understanding of biodiversity

Genes

GCGC GTAC CTAG

Individuals

i ii iii iv v vi

Populations

1 2 1 2 3

Local populations

Species

A B C D E F

Global biodiversity

Interactions

A B C D E F - + + + + + + - + + + + + - + - + - + -

Biological networks

GenBank

Page 8: The Biodiversity Informatics Landscape

Key  problems  •  Landscape  is  complex,  fragmented  &  hard  to  navigate  •  Many  audiences  (policy  makers,  scien6sts,  amateurs,  ci6zen  scien6sts)  •  Many  scales  (global  solu6ons  to  local  problems)  

Figure  adapted  from  Peterson  et  al  2010  

Genotype Phenotype Biotic Interactions Environment Human Effects

Niche & Pop. Ecology

Biodiversity Loss

Phylogenetic Trees

Taxonomy

Geographic Dsitributions

Range Maps Forecasts of Change

Conservation & management

Products

Data

GenBank MorphBank Interactions Geospatial Census

IUCN

TreeBase

IPNI, Zoobank

Pop. data

GBIF

Extent of Occurrence AquaMaps

AquaMaps

Systems

An informaticians view of biodiversity

Page 9: The Biodiversity Informatics Landscape

A project centric view of biodiversity

NomenclatorsIndex FungorumZooBankIPNI(Kew/AUS/Harvard)INGAFD/APC/APUINZORCoL (Sp2000& ITIS)ZooRecord

PESI:

ERMS

Fauna Europea

Euro+Med Plantbase

ORBIS

WORMS

Flora Europea

Checklists

PhylogeneticTree of LifeTreeBaseCIPRES

MolecularDatabases

NCBI/EMBL/DDBJCBoLBarcode of Life Initiative

BiodiversityALA

CONABIO

CRIA (Brazil)

IUCN

SEEK

OPAL

DAISIE

iNaturalist

uBio

PLAZI

Inotaxa

BHL

eFloras

Scan / Mark/up

IdentificationKey2NatureIdentifyLife

Inter-InstitutionalSynthesisBCIBioCASEGeoCASEMaNIS

InstitutionalEMu (=MOA)

Recorder

TDWG

LifeWatch

GBIF

CDMGNA (NameBank) IPNI

Google ScholarConnoteaViTaLISI

BibliographicDescriptive / classification

EoLScratchpadsCATEMorphoBankWikipedia

A  snapshot  from  2009,  “the  dance  of  the  ini0a0ves”  

Page 10: The Biodiversity Informatics Landscape

The strategic view: community informatics challenges

GBIF  GBIC  Report  (Coming  soon)  

EU  Biodiversity  Strategy  (2011)  

Biodiv.  Inf.  Challenges  (2013)  

Grand  Challenges  for  Biodiversity  Informa6cs  (integra6ng  ac6vi6es  for  H2020)  

Page 11: The Biodiversity Informatics Landscape

2.  Social  challenges  -   Openness  -   Collabora6on  and  communi6es    -   Standards,    iden6fiers  &  links  

Page 12: The Biodiversity Informatics Landscape

Openness in biodiversity informatics

E.   Archambault   et.   al.,   Propor9on   of   Open   Access   Peer-­‐Reviewed   Papers   at   the  European  and  World  Levels-­‐-­‐2004-­‐2011,  June  2013,  Science-­‐Metrix  Inc.  

“One-­‐half  of  all  papers  are  now  freely  available  within  a  year  or  two  of  publica0on”  

“A  piece  of  data  or  content  is  open  if  anyone  is  free  to  use,  reuse,  and  redistribute  it    -­‐  subject,  at  most,  to  the  requirement  to  aOribute  and/or  share-­‐alike.”   hfp://opendefini6on.org/  

Many  kinds  of  openness:  •  Open  Access  •  Open  Data  •  Open  Science  •  Open  Source  

•  Sharing  data  is  a  founda6on  for  our  ac6vi6es    

•  Normal  prac6ce  in  some  communi6es  (molecular)  

•  Mandated  by  some  funders  &  governments  

Page 13: The Biodiversity Informatics Landscape

Openness in biodiversity informatics

Many  kinds  of  openness:  •  Open  Access  •  Open  Data  •  Open  Science  •  Open  Source  

Need  to  con0nue  to  incen0vise  openness  

“A  piece  of  data  or  content  is  open  if  anyone  is  free  to  use,  reuse,  and  redistribute  it    -­‐  subject,  at  most,  to  the  requirement  to  aOribute  and/or  share-­‐alike.”  

•  Sharing  data  is  a  founda6on  for  our  ac6vi6es    

•  Normal  prac6ce  in  some  communi6es  (molecular)  

•  Mandated  by  some  funders  &  governments  

hfp://opendefini6on.org/  

Incen6vise  through  credit  via  cita6on  (e.g.  BDJ)  

Page 14: The Biodiversity Informatics Landscape

What  are  Scratchpads?  (hfp://scratchpads.eu)  

Taxa   Projects   Regions   Socie9es  

544  Scratchpad  Communi6es    

by  6,644  ac6ve  registered  users    

covering  91,631  taxa    

in  535,317  pages.   81  paper  cita9ons  in  2012  

In  total  more  than  

1,300,000  visitors  

e.g.,  Scratchpad  Virtual  Research  Communi0es  

Collaboration & communities

Making  taxonomy  a  team  sport  

Our  infrastructures  need  to  facilitate  collabora0on  

Page 15: The Biodiversity Informatics Landscape

Standards, identifiers & protocols

Standards  can’t  be  developed  in  isola0on  –  they  must  be  used  

Key  requirements:  •  Need  to  be  inclusive,  prac6cal  &  extensible  •  Readable  by  humans  &  machines  •  Widely  used    

Good  examples:  •  Darwin  Core  •  CrossRef  &  DataCite  DOIs  •  ORCHID  Author  iden6fiers    

Gaps  /  Problems  •  Reuse  &  persistence  of  iden6fiers  •  Vocabularies  &  ontologies  (6me  consuming  /  lifle  reward)    

Poten0al  solu0ons  •  Build  them  into  our  credit  systems  •  Show  sema6c  reasoning  poten6al  (LOD  &  RDF  demonstrators)  

A  founda6on  for  integra6on  Facilita9ng  data  sharing  across  communi9es  

Page 16: The Biodiversity Informatics Landscape

3.  (Big)  data  challenges  -   Mobilising  exis6ng  data    -   New  forms  of  data  

Page 17: The Biodiversity Informatics Landscape

Mobilising existing data

Collec0ons  •  1.5-­‐3B  specimens  in  collec6ons  worldwide  •  Fragments  efforts  /  heterogeneity  of  process  •  Needs  ambi6on  (NHM:  20M  in  5  yrs.)  &  coord.    

Literature  •  >300M  pages  of  biodiversity  literature  •  BHL  (41M  pp.)  an  example  of  what  can  be  done  •  Needs  a  sustainability  &  ar6cle  metadata    

Metadata  registries  •  Data  about  data  (cheaper  &  scalable)  •  e.g.  bibliographic  data,  dataset  portals    

Informa0cs  challenges  •  Storage  &  persistence  •  Automa6on  &  annota6on  •  Incen6ves  to  digi6se  &  fitness  for  use  

Collec9ons,  literature  &  metadata  How  can  we  quickly,  efficiently  and  cost  effec6vely  mobilise  biological  data  at  scale?  

Bibliography  of  Life  (RefFinder  &  RefBank)  

BHL  literature  

NHM  Digi0sa0on  

Page 18: The Biodiversity Informatics Landscape

Mobilising & managing new forms of data

 

New  Molecular  approaches  •  Molecular  detec6on  &  monitoring  of  organisms  is  rou6ne  •  Metagenomics  (env.  sequencing)  commonplace  •  Becoming  the  1°  route  to  understanding  biodiversity  

Ecological  observatories  •  Automated  biodiversity  detec6on  •  Remote  sensing  (e.g.  satellite  &  acous6c  data,  drones,  camera  traps)  •  Monitoring  conspicuous,  rare  or  invasive  spp.  (algal  blooms,  palms)    •  Monitoring  human  ac6vity    

Informa0cs  challenges  •  Very  large  quan66es  of  data  (2.5-­‐10TB  per  researcher  per  yr.)  •  Doesn’t  map  well  to  exis6ng  data  infrastructures  •  Challenge  current  networking  &  storage  capacity    •  Digital  and  physical  collec6ons  become  equally  important?  

3-­‐4  June  2013,  NHM  

22  July,  2013  

Metagenomics  &  ecological  observatories    These  new  data  types  do  not  depend  on  tradi6onal  taxonomy  &  systema6cs  

Page 19: The Biodiversity Informatics Landscape

4.  Synthe9c  challenges  -   Data  aggrega6on  &  linking  -   Visualisa6on  -   Modeling  

Page 20: The Biodiversity Informatics Landscape

Aggregation & linking

Portals  bringing  together  distributed  &  diverse  forms  of  data  Giving  consistent  and  comprehensive  access  to  all  biological  data  

 

Several  approaches,  with  different  advantages  •  Tightly  coupled  to  a  few  data  sources    

•  (e.g.  eMonocot,  CDM)  •  Loosely  coupled  to  many  sources  

•  (e.g.  BioNames,  Wikipedia)  •  Hybrid  forms  (e.g.  Canadensys,  EOL,  GBIF)    

 

Informa0cs  challenges  •  Portals  are  hard  to  sustain  •  New  methods  of  data  discovery  &  access  •  Create  new  windows  (views)  on  content  •  New  data  structures,  new  types  of  database    

Scalable  but    less  accurate  (3M  taxon  names,  93k  phylogenies  &  28k  ar6cles)  

BioNames  

Selec0ve  &  accurate  but  hard  to  scale  (276k  taxa,  8k  images,  13  keys  &  3  phylogenies)  

eMonocot  

Page 21: The Biodiversity Informatics Landscape

Visualisation

Visually  synthesizing  large,  linked  biodiversity  datasets  Making  biodiversity  data  accessible  &  understandable  

NHM  specimen  records  

hfp://data.nhm.ac.uk/globe/  

 

Research  opportuni0es  •  Tools  integra6on  (e.g.  GeoCat,  CartoDB)  •  Span  mul6ple  audiences    

Outreach  opportuni0es  •  Visually  compelling  story  telling  •  Crowdsourcing  tools  (e.g.  Notes  From  Nature)    

Exploi0ng  new  technologies  •  Touch  screens  •  Mobile  •  Loca6on  awareness  

Informa0cs  challenges  •  Very  specific  to  individual  use  cases  •  Sustainability  issues  

Page 22: The Biodiversity Informatics Landscape

Modeling the biosphere: a (the) 30 year goal?

Conceptually  has  many  poten0al  uses  •  Iden6fying  trends  •  Explaining  paferns  •  Making  predic6ons  •  Real  6me  alerts    

-­‐  when  data  contradicts  current  knowledge  •  The  ul6mate  policy  tool  

Major  informa0cs  challenges  •  Technical  very  difficult  (many  years  off)  •  Needs  effec6ve  prototypes  &  plarorms  •  Some  first  steps  e.g.  OBOE,  LEFT  

Nature  2013,  doi:10.1038/493295a  

Reasoning  across  large,  linked  biodiversity  datasets  A  clear,  singular,  long-­‐term  vision,  which  biodiversity  data  can  contribute  too  

Page 23: The Biodiversity Informatics Landscape

5.  Next  steps  

Page 24: The Biodiversity Informatics Landscape

Lessons learned: new opportunities in H2020

PATHWAYS  TO  INTEGRATION        (by  addressing  these  social,  data  &  synthe0c  challenges)    •  Break  out  of  the  discipline,  technical  &  

project  centric  ac9vi9es  (it  is  unsustainable,  inefficient  &  bad  for  science)  

 •  Integrate  &  build  on  exi9ng  programmes  

where  possible  (LifeWatch  is  a  poten6al  umbrella  for  these  ac6vi6es)  

 •  Bridge  the  disconnect  between  

informa9cians  &  users  (make  the  users  informa6cians  &  in  informa6cians  users)  

 •  Our  products  well  suited  to  address  these  

challenges    •  Use  H2020  as  a  mechanism  to  achieve  

integra9on  

How  do  we  join  up  these  ac0vi0es?    

Page 25: The Biodiversity Informatics Landscape

QUESTIONS  

Page 26: The Biodiversity Informatics Landscape

Possible biodiversity informatics design principles*

1.   Start  with  needs  -­‐  focus  on  real  user  needs  (not  just  the  ‘official  process’)  

2.   Do  less  -­‐  if  someone  else  is  doing  it,  link  to  it  or  use  it  

3.   Design  with  data  -­‐  prototype  and  test  with  real  users  on  the  live  website  4.   Do  the  hard  work  to  make  it  simple  -­‐  let  the  computer  take  the  strain  

5.   Iterate.  Then  iterate  again.  -­‐  itera0on  reduces  risk  &  is  more  sustainable  

6.   Build  for  inclusion  –  it’s  easier  in  the  long  run  7.   Understand  context  -­‐  we  are  designing  for  people,  not  a  screen  or  a  brand  8.   Build  digital  services,  not  websites  -­‐  there  is  life  beyond  the  website  9.   Be  consistent,  not  uniform  -­‐  every  circumstance  is  different  

10.  Make  things  open:  it  makes  things  bejer  -­‐  it’s  more  sustainable  

=  experience  from  7-­‐years  with  the  Scratchpads  =  lessons  for  infrastructures  in  H2020?  

*hfps://www.gov.uk/designprinciples  

Page 27: The Biodiversity Informatics Landscape

Mobilising existing data: how to prioritise

Nick  Poole,  UK  Collec6ons  Trust  

CONTENT  

METADATA  

A  LITTLE   A  LOT  

Digi6se  a  few  things  &  invest  in  depth,  descrip6on  &  promo6on  

Digi6se  lots  of  things,  put  lifle  effort  into  descrip6on  &  promo6on  

FUN  

OUTREACH  LEARNING  

RESEARCH  

AGGREGATION   DATA  MINING  

COLECTIONS  MANAGEMENT  

Page 28: The Biodiversity Informatics Landscape

Collaboration & communities

•  Very  few  recent  single  author  papers  •  Most  (fundable)  science  is  cross-­‐disciplinary  •  Need  to  incen6vise  data  cura6on  &  annota6on  •  Need  mechanisms  to  share  annota6ons  

Our  infrastructures  need  to  facilitate  collabora0on  

Joppa et al, 2011

CONE  SNAILS   BIRDS   MAMMALS   AMPHIBIANS   SPIDERS   PLANTS  

Average  dates  when  increasing  numbers  of  taxonomists  were  involved  in  describing  species  Making  taxonomy  a  team  sport  

Page 29: The Biodiversity Informatics Landscape