Using DCO Data (Infrastructure, Management, Analysis, Visualization, …) Peter Fox @taswegian, [email protected] (Marshall Ma) and the Data Science [email protected]

Embed Size (px)

Citation preview

  • Slide 1
  • Using DCO Data (Infrastructure, Management, Analysis, Visualization, ) Peter Fox @taswegian, [email protected] (Marshall Ma) and the Data Science [email protected] Tetherless World Constellation Rensselaer Polytechnic Institute DCO Summer School, July 14, 2014. Big Sky, MT Data Science https://deepcarbon.net/group/dco-summer-school-2014
  • Slide 2
  • Deep Carbon Observatory Global community of Carbon scientists (~1000 from ~40 countries) contributing to a Deep Earth Computer (data legacy) comprising: Global Earth Mineral Laboratory Global Census of Deep Fluids Global Volcano Gas Emissions Global Census of Deep Microbial Life Global State of High Pressure and Temperature Carbon and Related Materials Global Inventory of Diamonds with Inclusions
  • Slide 3
  • Data Science is Doing science with someone elses data across datasets with models multi-dimensional, multi-scale, multi-mode complex data-types needing new analytic and visual approaches Especially in multiple dimensions (functional) E.g. Detection/ attribution methods/ algorithms Visual exploration Data Science
  • Slide 4
  • You may see many diagrams like 4
  • Slide 5
  • 5 Physical quantity versus measured as quantity Value and units? Reference frame? Reference units? Value and units?
  • Slide 6
  • Data A scientist bringing new data Spreadsheet Diagram Digital Map Report A data manager transforming data Transformed data ready for import Repository staff/ Data librarian (Fleischer, 2011) Importing tool A data repository Internet Use case: How DCO Finds Out About Data
  • Slide 7
  • Data-Information- Knowledge Ecosystem 7 DataInformationKnowledge ProducersConsumers Context Presentation Organization Integration Conversation Creation Gathering Experience
  • Slide 8
  • 8 ProducersConsumers Quality Control Fitness for Purpose Fitness for Use Quality Assessment Trustee Trustor
  • Slide 9
  • Spreadsheets E.g. Excel import data 9
  • Slide 10
  • Documentation? 10
  • Slide 11
  • Substantial metadata how to visualize THIS? Census of Deep Life
  • Slide 12
  • To incline to one side; to give a particular direction to; to influence; to prejudice; to prepossess. [1913 Webster] A partiality that prevents objective consideration of an issue or situation [syn: prejudice, preconception] For acquisition sampling bias is your enemy Cognitive bias is (due to) YOU! 12
  • Slide 13
  • Provenance* Origin or source from which something comes, intention for use, who/what generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility Internal External
  • Slide 14
  • How you find DCO data? http://deepcarbon.net/dco_datasets Will soon be a window into community-based sources http://metpetdb.rpi.edu http://earthchem.org/ http://www.earthchem.org/petdb http://vamps.mbl.edu/portals/deep_carbon/cdl.ph p http://vamps.mbl.edu/portals/deep_carbon/cdl.ph p
  • Slide 15
  • Browser
  • Slide 16
  • All information is linked and traceable! 16
  • Slide 17
  • Slide 18
  • E.g. Deep Life (CoDL) New tools: R (statistics, visualization, modeling), D3.js (visualization) NOT just of the data, but of all types of information, knowledge! iPython Notebooks?
  • Slide 19
  • When You Use Data Science 2.0 Version/ subsetting and converting to a format you are familiar with is very common but mysterious Take notes document provenance Software what did you use and how? Derived products what did you create, how, why, etc. Use the metadata every chance you get, e.g. filenames! Place them in a Web-accessible folder, consider getting an identifier Use social media, blogs, etc. to discuss it..
  • Slide 20
  • 4 Rs Goble and others
  • Slide 21
  • Slide 22
  • Exercise 1 Search for and access a dataset that you are not familiar with: Can you read it? Can you make sense of it? Can you assess quality, uncertainty? Any sources of bias? What would you need to do to make it useful?
  • Slide 23
  • When You Generate Data Science 2.0 How the data was generated, why, for what, when and in what format Take notes document provenance Software what did you use and how? Derived products what did you create, how, why, etc. Use the metadata every chance you get, e.g. filenames! Place them in a Web-accessible folder, consider getting an identifier Use social media, blogs, etc. to discuss it..
  • Slide 24
  • Make it visible to DCO (can be private) https://deepcarbon.net/dco/dco-open- access-and-data-policies https://deepcarbon.net/page/submit- community-data https://deepcarbon.net/dco/dco-open- access-and-data-policies https://deepcarbon.net/page/submit- community-data You get an identifier! DCO-ID, can be cited, rewarded and much more Share
  • Slide 25
  • DCO checklist: what people have to do (courtesy UC3) Your data management plan Funding agency requirements Creating your data Organizing your data Managing your data Sharing your data Domain Scientist Data manager Repository staff Data Scientist Curation Services & Tools Domain scientists often also take up these two roles, which however is not efficient and effective (i.e., the 80-20 rule). Data Science
  • Slide 26
  • DCO checklist: a service & tool perspective Your data management plan AP Sloan requirements+ Creating your data Organizing your data Managing your data Sharing your data e.g., NSF New Proposal and Award Policies and Procedures Guide (effective January 14, 2013)Proposal and Award Policies and Procedures Guide Object Modeling Identity Services Storage Services Ingest Services Discovery Service Characterization Services Access Services CKAN, community Faceted search and Drupal etc. DCO-ID (Handle+DOI) + Linked Data, community Schema.org, etc. Use cases, info. model
  • Slide 27
  • Exercise 2 Begin with a recent dataset that you generated or were involved in generating Can someone else read it? Can someone make sense of it? Have you asserted quality, uncertainty? Have you described known sources of bias? What else would you now do to make it more useful?
  • Slide 28
  • Further reading Data Science course at RPI: http://tw.rpi.edu/web/Courses/DataScience/2013 http://tw.rpi.edu/web/Courses/DataScience/2013 Fourth Paradigm: http://research.microsoft.com/en- us/collaboration/fourthparadigm/ http://research.microsoft.com/en- us/collaboration/fourthparadigm/ Data Management Planning tools: http://tw.rpi.edu/web/project/DCO- DS/WorkingGroups/DMP http://tw.rpi.edu/web/project/DCO- DS/WorkingGroups/DMP http://www.iedadata.org/compliance/plan http://www.iedadata.org/compliance/plan https://dmp.cdlib.org/ https://dmp.cdlib.org/
  • Slide 29
  • Breakout Session Today Exercises 1 and 2 Discussion
  • Slide 30
  • Friday Marshall (Xiaogang) Ma will round out the data discussion DCO goal for data: in the interim, help you become data scientists (as well as your specialty) Then, in time you can drop data because you will handle data as easily as you do field work, use instruments, etc