Using DCO Data (Infrastructure, Management, Analysis, Visualization, …) Peter Fox @taswegian,...
If you can't read please download the document
Using DCO Data (Infrastructure, Management, Analysis, Visualization, …) Peter Fox @taswegian, [email protected] (Marshall Ma) and the Data Science [email protected]
Using DCO Data (Infrastructure, Management, Analysis,
Visualization, ) Peter Fox @taswegian, [email protected] (Marshall
Ma) and the Data Science [email protected] Tetherless World
Constellation Rensselaer Polytechnic Institute DCO Summer School,
July 14, 2014. Big Sky, MT Data Science
https://deepcarbon.net/group/dco-summer-school-2014
Slide 2
Deep Carbon Observatory Global community of Carbon scientists
(~1000 from ~40 countries) contributing to a Deep Earth Computer
(data legacy) comprising: Global Earth Mineral Laboratory Global
Census of Deep Fluids Global Volcano Gas Emissions Global Census of
Deep Microbial Life Global State of High Pressure and Temperature
Carbon and Related Materials Global Inventory of Diamonds with
Inclusions
Slide 3
Data Science is Doing science with someone elses data across
datasets with models multi-dimensional, multi-scale, multi-mode
complex data-types needing new analytic and visual approaches
Especially in multiple dimensions (functional) E.g. Detection/
attribution methods/ algorithms Visual exploration Data
Science
Slide 4
You may see many diagrams like 4
Slide 5
5 Physical quantity versus measured as quantity Value and
units? Reference frame? Reference units? Value and units?
Slide 6
Data A scientist bringing new data Spreadsheet Diagram Digital
Map Report A data manager transforming data Transformed data ready
for import Repository staff/ Data librarian (Fleischer, 2011)
Importing tool A data repository Internet Use case: How DCO Finds
Out About Data
8 ProducersConsumers Quality Control Fitness for Purpose
Fitness for Use Quality Assessment Trustee Trustor
Slide 9
Spreadsheets E.g. Excel import data 9
Slide 10
Documentation? 10
Slide 11
Substantial metadata how to visualize THIS? Census of Deep
Life
Slide 12
To incline to one side; to give a particular direction to; to
influence; to prejudice; to prepossess. [1913 Webster] A partiality
that prevents objective consideration of an issue or situation
[syn: prejudice, preconception] For acquisition sampling bias is
your enemy Cognitive bias is (due to) YOU! 12
Slide 13
Provenance* Origin or source from which something comes,
intention for use, who/what generated for, manner of manufacture,
history of subsequent owners, sense of place and time of
manufacture, production or discovery, documented in detail
sufficient to allow reproducibility Internal External
Slide 14
How you find DCO data? http://deepcarbon.net/dco_datasets Will
soon be a window into community-based sources
http://metpetdb.rpi.edu http://earthchem.org/
http://www.earthchem.org/petdb
http://vamps.mbl.edu/portals/deep_carbon/cdl.ph p
http://vamps.mbl.edu/portals/deep_carbon/cdl.ph p
Slide 15
Browser
Slide 16
All information is linked and traceable! 16
Slide 17
Slide 18
E.g. Deep Life (CoDL) New tools: R (statistics, visualization,
modeling), D3.js (visualization) NOT just of the data, but of all
types of information, knowledge! iPython Notebooks?
Slide 19
When You Use Data Science 2.0 Version/ subsetting and
converting to a format you are familiar with is very common but
mysterious Take notes document provenance Software what did you use
and how? Derived products what did you create, how, why, etc. Use
the metadata every chance you get, e.g. filenames! Place them in a
Web-accessible folder, consider getting an identifier Use social
media, blogs, etc. to discuss it..
Slide 20
4 Rs Goble and others
Slide 21
Slide 22
Exercise 1 Search for and access a dataset that you are not
familiar with: Can you read it? Can you make sense of it? Can you
assess quality, uncertainty? Any sources of bias? What would you
need to do to make it useful?
Slide 23
When You Generate Data Science 2.0 How the data was generated,
why, for what, when and in what format Take notes document
provenance Software what did you use and how? Derived products what
did you create, how, why, etc. Use the metadata every chance you
get, e.g. filenames! Place them in a Web-accessible folder,
consider getting an identifier Use social media, blogs, etc. to
discuss it..
Slide 24
Make it visible to DCO (can be private)
https://deepcarbon.net/dco/dco-open- access-and-data-policies
https://deepcarbon.net/page/submit- community-data
https://deepcarbon.net/dco/dco-open- access-and-data-policies
https://deepcarbon.net/page/submit- community-data You get an
identifier! DCO-ID, can be cited, rewarded and much more Share
Slide 25
DCO checklist: what people have to do (courtesy UC3) Your data
management plan Funding agency requirements Creating your data
Organizing your data Managing your data Sharing your data Domain
Scientist Data manager Repository staff Data Scientist Curation
Services & Tools Domain scientists often also take up these two
roles, which however is not efficient and effective (i.e., the
80-20 rule). Data Science
Slide 26
DCO checklist: a service & tool perspective Your data
management plan AP Sloan requirements+ Creating your data
Organizing your data Managing your data Sharing your data e.g., NSF
New Proposal and Award Policies and Procedures Guide (effective
January 14, 2013)Proposal and Award Policies and Procedures Guide
Object Modeling Identity Services Storage Services Ingest Services
Discovery Service Characterization Services Access Services CKAN,
community Faceted search and Drupal etc. DCO-ID (Handle+DOI) +
Linked Data, community Schema.org, etc. Use cases, info. model
Slide 27
Exercise 2 Begin with a recent dataset that you generated or
were involved in generating Can someone else read it? Can someone
make sense of it? Have you asserted quality, uncertainty? Have you
described known sources of bias? What else would you now do to make
it more useful?
Slide 28
Further reading Data Science course at RPI:
http://tw.rpi.edu/web/Courses/DataScience/2013
http://tw.rpi.edu/web/Courses/DataScience/2013 Fourth Paradigm:
http://research.microsoft.com/en- us/collaboration/fourthparadigm/
http://research.microsoft.com/en- us/collaboration/fourthparadigm/
Data Management Planning tools: http://tw.rpi.edu/web/project/DCO-
DS/WorkingGroups/DMP http://tw.rpi.edu/web/project/DCO-
DS/WorkingGroups/DMP http://www.iedadata.org/compliance/plan
http://www.iedadata.org/compliance/plan https://dmp.cdlib.org/
https://dmp.cdlib.org/
Slide 29
Breakout Session Today Exercises 1 and 2 Discussion
Slide 30
Friday Marshall (Xiaogang) Ma will round out the data
discussion DCO goal for data: in the interim, help you become data
scientists (as well as your specialty) Then, in time you can drop
data because you will handle data as easily as you do field work,
use instruments, etc