18
Small Data, or: Bridging the Gap Between Specific and Generic Research Repositories April 11, 2013 Anita de Waard VP Research Data CollaboraDons [email protected] hHp://researchdata.elsevier.com/

Small Data: Bridging the Gap Between Generic and Specific Repositories

Embed Size (px)

DESCRIPTION

My presentation for the http://iannotate.org// meeting in San Francisco, April 11th 2013

Citation preview

Page 1: Small Data: Bridging the Gap Between Generic and Specific Repositories

Small  Data,  or:  Bridging  the  Gap  Between  Specific  and  Generic  Research  Repositories  

April  11,  2013  Anita  de  Waard  

VP  Research  Data  CollaboraDons  [email protected]  

       

hHp://researchdata.elsevier.com/      

Page 2: Small Data: Bridging the Gap Between Generic and Specific Repositories

There  are  many  efforts  to  enhance    data  storing  and  sharing...  

•  Many  different  research  databases–  both  generic  (Dryad,  Dataverse,  …)  and  specific  (NIF,  IEDA,  PDB,  …)  

•  Many  systems  for  creaDng/sharing  workflows  (Taverna,  MyExperiment,  Vistrails,  Workflow4Ever  etc)  

•  Many  e-­‐lab  notebooks  (LabGuru,  LabArchives,    LaBlog,  etc)  •  Scores  of  projects,  commiHees,  standards,    bodies,  grants,  iniDaDves,  conferences  for  discussing  and  connecDng  all  of  this  (KEfED,  Pegasus,  PROV,  RDA,  Science  Gateways,  Codata,  BRDI,  Earthcube,  etc.  etc)    

•  You  can  make  a  living  out  of  this  ;-­‐)!  (and  many  of  us  do…)  

Page 3: Small Data: Bridging the Gap Between Generic and Specific Repositories

…but  this  is  what  scienDsts  do:  

Using  anDbodies  and  squishy  bits      Grad  Students  experiment  and  enter  details  into  their  lab  notebook.    The  PI  then  tries  to    make  sense  of  this,  and  writes  a  paper.      End  of  story.    

Page 4: Small Data: Bridging the Gap Between Generic and Specific Repositories

Why  save  research  data?  

A.  Data  PreservaDon:      –  Preserve  record  of  scienDfic  process,  

provenance  –  Enable  reproducible  research  

B.  Data  Use:  –  Use  results  obtained  by  others  –  Do  beHer  science!  –  Improve  interdisciplinary  work  

 

Page 5: Small Data: Bridging the Gap Between Generic and Specific Repositories

>  50  My  Papers  2  M  scienDsts  

2  M  papers/year  

Where  the  data  goes  now:  

Majority  of  data  (90%?)    is  stored    

on  local  hard  drives  Dryad:  

7,631  files  

 Dataverse:  0.6  M  

   

Datacite:    1.5  M    

Some  data    (8%?)  stored  in  large,    

generic  data    repositories  

MiRB:      25k  

PetDB:    1,5  k  

TAIR:      72,1  k  

PDB:      88,3  k    

SedDB:    0.6  k  

A  small  porDon  of  data    (1-­‐2%?)  stored  in  small,    

topic-­‐focused  data  repositories  

Page 6: Small Data: Bridging the Gap Between Generic and Specific Repositories

>  50  My  Papers  2  M  scienDsts  

2  M  papers/year  

So  this  needs  to  happen:  

Dryad:  7,631  files  

 Dataverse:  0.6  M  

   

Datacite:    1.5  M    

MiRB:      25k  

PetDB:    1,5  k  

Majority  of  data  (90%?)    is  stored    

on  local  hard  drives  

Some  data    (8%?)  stored  in  large,    

generic  data    repositories  

TAIR:      72,1  k  

PDB:      88,3  k    

SedDB:    0.6  k  

A  small  porDon  of  data    (1-­‐2%?)  stored  in  small,    

topic-­‐focused    data  repositories  

INCREASE  DATA  PRESERVATION  

Page 7: Small Data: Bridging the Gap Between Generic and Specific Repositories

Data  PreservaDon  Issues:  

Example:  create  tailored  metadata  collecDon  tools  on  mini-­‐tablets  in  labs  to  replace  paper  notebooks  

ObjecDon:  “Our  lab  notebooks  are  all  on  paper  –  it’s  how  we  do  things”  Response:  Grao  tools  closely  on  scienDsts’  daily  pracDce  

Page 8: Small Data: Bridging the Gap Between Generic and Specific Repositories

ObjecDon:  “I  need  to  see  a  direct  benefit  of  any  effort  I  put  in.”  Response:  Create  tools  to  allow  beHer  insight  in  own    and  other’s  results.  Example:  ‘PI-­‐Dashboard’:  allow  immediate  access/analysis  of  shared  data:  new  science!  

Data  PreservaDon  Issues:  

Page 9: Small Data: Bridging the Gap Between Generic and Specific Repositories

ObjecDon:  “I  don’t  really  trust  anyone  else’s  data  –  and  don’t  think  they’ll  trust  mine”    

Response:  Create  social  networking  context;  allow  data  owner  to  provide  granular  access  control.  Example:    •  In  Urban  Lab  app,  data  stored  by  researcher  name.  •  PI  decides  who  gets  to  see  which  data  •  Match  up  with  NIF  and  Eagle-­‐I  ontologies  on  back  end  so  export  of  (part  of)  data  is  possible  at  any  Dme.    

c  o  n  s  o  r  t  i  u  m  

Data  Use  Issues:  

Page 10: Small Data: Bridging the Gap Between Generic and Specific Repositories

•  ObjecDon:  “I  am  afraid  other  people  might  scoop  my  discoveries”  

•  Response:  Reward  system  needs  to  move  from  direct  compeDDon  to  a  ‘shared  mission’  approach  (cf.  Mars)  

•  Example:  Data  Rescue  Challenge  in  the  geosciences:  collect  and  reward  stories/pracDces  of  data  preservaDon,  enable  cross-­‐disciplinary  access  and  use  of  all  data.    

 

   

The  2013  Interna.onal  Data  Rescue  Award  in  the  Geosciences  Organised  by  IEDA  and  Elsevier  Research  Data  Services    hHp://researchdata.elsevier.com/datachallenge    

Data  Use  Issues:  

Page 11: Small Data: Bridging the Gap Between Generic and Specific Repositories

Data  PreservaDon  and  AnnotaDon:  :    Fine,  I’ll  do  it–  but  where  the  hell  do  I  put  it?    

Funding  Agency:   University:  

Collaborators:  Domain  of  study:  Domain-­‐Specific    Data  Repository  

Local    Data  Repository  

InsDtuDonal    Data  Repository  

Generic    Data  Repository  

AND  

THEY  ALL  

WANT  

DIFFERENT  

METADATA!!!!  

Page 12: Small Data: Bridging the Gap Between Generic and Specific Repositories

Comparing  Repository  Types:  Repository   Advantages     Disadvantages  

Local  data  repository  

Easy!  No  one  steals  your  data.    

No  one  sees  it.    Not  compliant  with  requirements  

InsDtuDonal  Repository    

Not  very  difficult.  Administrators  are  happy.      

Data  can’t  easily  be  reused.  Credit?  

Generic  data  repository  

Not  very  hard  to  do.  Have  complied!  

Data  can’t  be  easily  reused.  Credit…  

Domain-­‐specific  data  repository  

Data  can  be  reused.  Credit!    

Lot  of  work  –  for  curators   Eff

ort,  Re

use,  Credit,  Co

mpliance  

Habit,  Ease,  Priv

acy,  Con

trol    

 MORE

 ANNOTA

TION  

Page 13: Small Data: Bridging the Gap Between Generic and Specific Repositories

Conclusions  for  data  annotaDon:  “Instead  of  building  newer  and  larger  weapons  of  mass  destrucHon,  I  think  mankind  should  try  to  get  more  use  out  of  the  ones  we  have”  

Deep  Thoughts  by  Jack  Handy  

 •  Let’s  use  the  data  standards  we  already  have  –  and  agree  on  using  the  same  ones  

•  Work  with  exisDng  data  repositories  in  a  field  to  come  to  a  lowest  common  denominator  of  metadata  

•  Tailor  the  systems  to  be  opDmally  easy  to  use  for  scienDsts  in  terms  of  metadata:  add  as  liHle  as  you  have  to,  as  few  Dmes  as  you  can.    

Page 14: Small Data: Bridging the Gap Between Generic and Specific Repositories

Summary:  •  Data  PreservaDon:    –  Tailor  tools  to  fit  scienDsts’  workflow  –  follow  the  experiment!  – We  are  creaDng  repositories  of  shared  experiments:  Enable  demonstrably  beFer  science!  

•  Data  Use:    –  Allow  owner  full  control  over  who  sees  which  data  -­‐  create  social  networking  context  

–  CollecDvely  pioneer  long-­‐term  funding  opDons;  support/develop  ‘shared  mission’  funding  challenges  

•  How  annotaDon  can  help  reuse:    –  Collaborate  between  (generic/specific,  insDtuDonal,  cross-­‐naDonal)  data  faciliDes  to  integrate  repositories,  enable  cross-­‐repository  usage  and  reuse  exisIng  metadata.  

Page 15: Small Data: Bridging the Gap Between Generic and Specific Repositories

QuesDons?  

Anita  de  Waard  VP  Research  Data  CollaboraDons  

[email protected]          

hHp://researchdata.elsevier.com/      

Page 16: Small Data: Bridging the Gap Between Generic and Specific Repositories

Elsevier  Research  Data  Services  Goals:  1.  Increase  Data  PreservaDon:    

Help  increase  the  amount  and  quality  of  data  preserved  and  shared    

2.  Improve  Data  Use:    Help  increase  the  value  and  usability  of  the  data  shared  by  increasing  annotaDon,  normalizaDon,  provenance  enabling  enhanced  interoperability  

3.  Develop  Sustainable  Models:    Help  measure  and  deliver  credit  for  shared  data,  the  researchers,  the  insDtute,  and  the  funding  body,  enabling  more  sustainable  plaworms.  

Page 17: Small Data: Bridging the Gap Between Generic and Specific Repositories

Guiding  Principles  of  RDS:  •  In  principle,  all  open  data  stays  open  and  URLs,  front  end  etc.  stay  where  they  are  (i.e.  with  repository)  

•  CollaboraDon  is  tailored  to  data  repositories’    unique  needs/interests-­‐  ‘service-­‐model’  type:    – Aspects  where  collaboraDon  is  needed  are  discussed  – A  collaboraDon  plan  is  drawn  up  using  a  Service-­‐Level  Agreement:  agree  on  Dme,  condiDons,  etc.    

•  Transparent  business  model  •  Very  small  (2/3  people)  department;  immediate  communicaDon;  instant  deployment  of  ideas.  

 

Page 18: Small Data: Bridging the Gap Between Generic and Specific Repositories

“But  aren’t  you  guys  in  it  for  the  money?”  •  Yes,  we  are-­‐  like  most  businesses…    •  Is  your  real  quesDon  perhaps:  ‘Does  no  one  want  to  work  with  you  anymore  because  of  the  Open  Access  debate?’    

•  The  OA  debate  focuses  on  three  issues:  –  IPR  and  Access  issues  – Opaque  business  models    

   – Lack  of  perceived  added    value  

E.g.  BY-­‐NC-­‐SA?  Github?  ..?  

E.g.  Gold  Open  Access?Shared  funding  model?  Commercial  analyDcs  with  shared  royalDes?  

We  offer  a  service:  only  use  it  if  it’s  any  good!