60
Spooky Spreadsheets Carly Strasser | California Digital Library UCSB/Bren Oct 2013 From Flickr by Jeff Golden

Bren - UCSB - Spooky spreadsheets

Embed Size (px)

DESCRIPTION

Talk for Jim Frew's grad class at Bren School, UC Santa Barbara. Oct 31, 2013. All about things you can do wrong (and right) with spreadsheets.

Citation preview

Page 1: Bren - UCSB - Spooky spreadsheets

Spooky  Spreadsheets  

Carly  Strasser  |  California  Digital  Library  UCSB/Bren  Oct  2013  

From  Flickr  by  Jeff  Golden  

Page 2: Bren - UCSB - Spooky spreadsheets

Roadmap  

3. Toolbox    

1. Background    

2. Best  practices  

Page 3: Bren - UCSB - Spooky spreadsheets

From  Flickr  by  robertpaulyoung  

Scientists  are  bad  at  data  management.  

Page 4: Bren - UCSB - Spooky spreadsheets

Many  tables  

Page 5: Bren - UCSB - Spooky spreadsheets

Embedded  figures  

Page 6: Bren - UCSB - Spooky spreadsheets

my  spreadsheet  

No  headings  

Page 7: Bren - UCSB - Spooky spreadsheets

my  spreadsheet  

Page 8: Bren - UCSB - Spooky spreadsheets

my  spreadsheet  

Page 9: Bren - UCSB - Spooky spreadsheets
Page 10: Bren - UCSB - Spooky spreadsheets

?

Page 11: Bren - UCSB - Spooky spreadsheets

Reproducibility  Transparency  Reuse  NO  

Didn’t  share  the  data  Didn’t  document  the  data  (metadata)  Didn’t  document  provenance/workflow  

www.petsham

ing.ne

t  

Page 12: Bren - UCSB - Spooky spreadsheets

From  Flickr  by  johntrainor  

Why  should  I  care?  

Page 13: Bren - UCSB - Spooky spreadsheets

Because  they  care:  

From  Flickr  by  Redden-­‐McAllister  

Page 14: Bren - UCSB - Spooky spreadsheets
Page 15: Bren - UCSB - Spooky spreadsheets

data management

From

 Flickr  by  Big  Sw

ede  Guy

 

Best  Practices  

Page 16: Bren - UCSB - Spooky spreadsheets

From  Flickr  by  Mark  Sardella  

Plan  before  data  collection  

Page 17: Bren - UCSB - Spooky spreadsheets

•  Create  a  key  (data  dictionary)  •  Make  sure  names  are  unique  •  Define  codes  

From

 Flickr  by  zebb

ie  

Planning  Design  sample  naming  scheme  

Page 18: Bren - UCSB - Spooky spreadsheets

PhDcomics.com  

Planning  Design  file  naming  scheme  

Page 19: Bren - UCSB - Spooky spreadsheets

 Use  descriptive  file  names  •  Unique  •  Reflect  contents  

From  R  Cook,  ESA  Best  Practices  Workshop  2010  

Bad:    Mydata.xls      2001_data.csv      best  version.txt  

Better:  Eaffinis_nanaimo_2010_counts.xls  

Site  name  

Year  What  was  measured    

Study  organism  

*Not  for  everyone  

*  

Planning  Design  file  naming  scheme  

Page 20: Bren - UCSB - Spooky spreadsheets

From  S.  Hampton  

Planning  Design  file  organization  

Page 21: Bren - UCSB - Spooky spreadsheets

Biodiversity  

Lake  

Experiments  

Field  work  

Grassland  

Biodiv_H20_heatExp_2005to2008.csv  Biodiv_H20_predatorExp_2001to2003.csv  …  Biodiv_H20_PlanktonCount_2001toActive.csv  Biodiv_H20_ChlAprofiles_2003.csv  …    

From  S.  Hampton  

Planning  Design  file  organization  

Consider…  •  Dependencies?  •  File  formats?  •  Time  of  collection?  •  Order  of  analysis?  

Workflows!

Page 22: Bren - UCSB - Spooky spreadsheets

Planning  

Constrain  entries    Atomize  Break  down  spreadsheets  

Design  your  spreadsheet  

From  Flickr  by  Ulleskelf  

Page 23: Bren - UCSB - Spooky spreadsheets

A  relational  database  is      A  set  of  tables    Relationships  among  the  tables    A  language  to  specify  &  query  the  tables  

 A  RDB  provides  

 Scalability:  millions+  records    Features  for  sub-­‐setting,  querying,  sorting    Reduced  redundancy  &  entry  errors  

 

From  Mark  Schildhauer  

Planning  Consider  a  database  

Page 24: Bren - UCSB - Spooky spreadsheets

You  should  invest  time  in  learning  databases  if      your  data  sets  are  large  or  complex  

 

Consider  investing  time  in  learning  databases  if    your  data  are  small  and  humble    you  ever  intend  to  share  your  data    you  are  <  30  years  old  

Planning  Consider  a  database  

From  Mark  Schildhauer  

Page 25: Bren - UCSB - Spooky spreadsheets

Store  your  data  in  a  repository  

Institutional  archive  

Discipline/specialty  archive  

   

 

Pick  a  data  repository  

From  Flickr  by  torkildr  

Ask  a  librarian  

Repos  of  repos:  

databib.org  

re3data.org  

Planning  

Page 26: Bren - UCSB - Spooky spreadsheets

From

 Flickr  by  sepa

 syn

od  

From  Flickr  by    taberandrew  

From  Flickr    by  withassociates  

What  software?  What  hardware?  What  personnel?  

How  often?  Set  up  reminders!  

Test  system    

Decide  on  preservation/backup   Planning  

Page 27: Bren - UCSB - Spooky spreadsheets

…document  that  describes  what  you  will  

do  with  your  data  throughout    

the  research  project  

From  Flickr  by  Barbies  Land  

Write  a  data  management  plan!  

Planning  

Page 28: Bren - UCSB - Spooky spreadsheets

DMP  components  

But they all have different requirements and express them in

different ways

•  What  will  be  collected  •  Methods  •  Standards  •  Metadata  •  Sharing/access  •  Long-­‐term  storage  

Planning  

From  Flickr  by  Barbies  Land  

Page 29: Bren - UCSB - Spooky spreadsheets

Step-­‐by-­‐step  wizard  for  generating  DMP  

create  |  edit  |  re-­‐use  |  share  

Free  &  open  to  community    

dmptool.org                    Planning  

Page 30: Bren - UCSB - Spooky spreadsheets

During  Data  Collection  &  Entry  

From  Flickr  by  Julia  Manzerova  

Page 31: Bren - UCSB - Spooky spreadsheets

Realistically:    •  Archive  .csv  version  of  raw  data  •  Make  a  “raw”  tab  in  working  data  file  •  Do  all  work  on  other  tabs  

During  collection  Keep  raw  data  raw  

Page 32: Bren - UCSB - Spooky spreadsheets

Raw  data  as  .csv  

R  script  for  processing  &  analysis  

During  collection  Keep  raw  data  raw  

Ideally:  •  Use  scripts  to  process  data    •  Save  them  with  data    

Page 33: Bren - UCSB - Spooky spreadsheets

During  collection  Document  your  workflow  

Temperature  data  

Salinity                data  

Data  import  into  Excel  

Analysis:  mean,  SD  

Graph  production  

Quality  control  &  data  cleaning  “Clean”  T  

&  S  data  

Summary  statistics  

Data  in  spread-­‐sheet  

Workflow:  how  you  get  from  the  raw  data  to  the  final  products  of  your  research  

 

Simple  workflow:  flow  chart  

Page 34: Bren - UCSB - Spooky spreadsheets

During  collection  Document  your  workflow  

Workflow:  how  you  get  from  the  raw  data  to  the  final  products  of  your  research  

 

Simple  workflow:  commented  script  

•  R,  SAS,  MATLAB…  •  Well-­‐documented  code  is  

Easier  to  review  Easier  to  share  Easier  to  use  for  repeat  analysis  

#  %  $  

&  

Page 35: Bren - UCSB - Spooky spreadsheets

Fancy  schmancy  workflows  Resulting  output  

https://kepler-­‐project.org  

During  collection  Document  your  workflow  

Page 36: Bren - UCSB - Spooky spreadsheets

Workflows  enable  •  Reproducibility  •  Transparency    •  Reuse    

From  Flickr  by  merlinprincesse  

During  collection  Document  your  workflow  

Page 37: Bren - UCSB - Spooky spreadsheets

Constrain  data  entries  •  Excel  lists  •  Data  validation  •  Google  docs  forms    

Modified  from  K.  Vanderbilt    

During  collection  

Page 38: Bren - UCSB - Spooky spreadsheets

Atomize   During  collection  

One  piece  of  information  per  cell  

Page 39: Bren - UCSB - Spooky spreadsheets

 Create  parameter  table  

From  doi:10.3334/ORNLDAAC/777  

From  doi:10.3334/ORNLDAAC/777  

From  R  Cook,  ESA  Best  Practices  Workshop  2010  

During  collection  Break  down  spreadsheets  

Fake  a  relational  database  

Create  a  site  table  

Page 40: Bren - UCSB - Spooky spreadsheets

Why  are  you  promoting  Excel?  

During  collection  Create  metadata  

Page 41: Bren - UCSB - Spooky spreadsheets

   Metadata:  data  reporting    

WHO  created  the  data?  

WHAT  is  the  content    

 of  the  data  set?  

WHEN  was  it  created?  

WHERE  was  it  collected?  

HOW  was  it  developed?  

WHY  was  it  developed?  

From

 Flickr  by    /\/\ich

ael  P

atric

|{    

During  collection  Create  metadata  

Page 42: Bren - UCSB - Spooky spreadsheets

Digital  context  

•  Name  of  the  data  set  

•  The  name(s)  of  the  data  file(s)  in  the  data  set  

•  Date  the  data  set  was  last  modified  

•  Example  data  file  records  for  each  data  type  file  

•  Pertinent  companion  files  

•  List  of  related  or  ancillary  data  sets  

•  Software  (including  version  number)  used  to  prepare/read    the  data  set  

•  Data  processing  that  was  performed  

Personnel  &  stakeholders  

•  Who  collected    

•  Who  to  contact  with  questions  

•  Funders  

Scientific  context  

•  Scientific  reason  why  the  data  were  collected  

•  What  data  were  collected  

•  What  instruments  (including  model  &  serial  number)  were  used  

•  Environmental  conditions  during  collection  

•  Temporal  &  spatial  resolution    

•  Standards  or  calibrations  used  

Information  about  parameters  

•  How  each  was  measured  or  produced  

•  Units  of  measure  

•  Format  used  in  the  data  set  

•  Precision  &  accuracy  if  known  

Information  about  data  

•  Definitions  of  codes  used  

•  Quality  assurance  &  control  measures  

•  Known  problems  that  limit  data  use  (e.g.  uncertainty,  sampling  problems)    

During  collection  Create  metadata  

Page 43: Bren - UCSB - Spooky spreadsheets

•  Provide  structure  to  describe  data  

Common  terms    |    definitions    |    language    |    structure  

•  Come  in  many  flavors    EML  ,  FGDC,  ISO19115,  DarwinCore,…  

•  Can  be  met  using  software  tools  

 Morpho  (EML),  Metavist  (FGDC),  NOAA  MERMaid  (CSGDM)    

   

What  is  metadata?  

Metadata  standards…  

During  collection  Create  metadata  

Standard <

Page 44: Bren - UCSB - Spooky spreadsheets

Back  up  daily   During  collection  

From  Flickr  by  lippo  

From  Flickr  by  see  phar  

Original  

Near  

Far  

Page 45: Bren - UCSB - Spooky spreadsheets

During  collection  

From  Flickr  by  Barbies  Land  

Remember  that  data  management  plan?  

Revisit  Review  Revise  

Page 46: Bren - UCSB - Spooky spreadsheets

During  collection  

Schedule  a  time  each  week  or  month  

Revisit  Review  Revise  

From  Flickr  by  purplemattfish  

Page 47: Bren - UCSB - Spooky spreadsheets

From

 Flickr  by  dipster1  

Toolbox  

Page 48: Bren - UCSB - Spooky spreadsheets

Step-­‐by-­‐step  wizard  for  generating  DMP  

create  |  edit  |  re-­‐use  |  share  

Free  &  open  to  community    

dmptool.org                    

Write  a  DMP  

Page 49: Bren - UCSB - Spooky spreadsheets

databib.org  

Where  should  I  put  my  data?  

Find  a  repository  

Page 50: Bren - UCSB - Spooky spreadsheets

•  Help  researchers  manage,  describe,  and  share  tabular  data  

•  Free  •  Add-­‐in  for  Excel  &  web  application    

Manage  &  share  

Page 51: Bren - UCSB - Spooky spreadsheets

Features  1.  Best  practices  check  2.  Generate  metadata  3.  Get  identifier  &  citation  4.  Post  data  to  repository  

Manage  &  share  

Page 52: Bren - UCSB - Spooky spreadsheets

Create  metadata  

Page 53: Bren - UCSB - Spooky spreadsheets

Create  metadata  

Page 54: Bren - UCSB - Spooky spreadsheets

Clean  data  

Open  Refine  =  Google  Refine    

•  Open  source  desktop  application    •  Used  for  data  cleanup  and  transformation  to  other  formats  •  Works  with  spreadsheets  but  behaves  like  a  database  •  User  can  filter  the  rows  to  display  using  facets  that  define  

filtering  criteria  

Page 55: Bren - UCSB - Spooky spreadsheets

Open  Refine  =  Google  Refine    

•  Open  source  desktop  application    •  Used  for  data  cleanup  and  transformation  to  other  formats  •  Works  with  spreadsheets  but  behaves  like  a  database  •  User  can  filter  the  rows  to  display  using  facets  that  define  

filtering  criteria  

Page 56: Bren - UCSB - Spooky spreadsheets

DCXL  blog:  dcxl.cdlib.org  

Toolbox:    

Get  help  

Page 57: Bren - UCSB - Spooky spreadsheets

From

 Flickr  by  tw

m1340

 

Culture  Shift  Ahead  

Page 58: Bren - UCSB - Spooky spreadsheets

science  source  notebook  content  access  data  government  knowledge  

From

 Flickr  by  cd

sessum

s  

Page 59: Bren - UCSB - Spooky spreadsheets

From  Flickr  by  Andy  Graulund  

Make  a  resolution  • Triage  on  current  projects  • Get    advisor,  lab  mates,  collaborators  on  board  • Do  better  next  time  

Page 60: Bren - UCSB - Spooky spreadsheets

Website  Email  

Twitter  Slides  

carlystrasser.net  [email protected]  @carlystrasser    slideshare.net/carlystrasser