Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Motivation: Water/hydrology research is a team sport
• requires integration of information from multiplesources• is data and computationally intensive• requires collaboration and working as a
team/community
Data
Analysis
Models
• Advancing Hydrologic Understanding
CyberInfrastructure Challenges• The data deluge
• Large datasets, data heterogeneity, Inadequate metadata
• Data Organization and Model Input preparation• Reproducibility• Software installation and configuration
• Platform dependencies, Library dependencies, Licensing
• Computational resources• Memory, disk and processing
Outline
• Data Management 101(Many slides from Jeff Horsburgh Research Scholar’s presentation)
• HydroShare Overview• HydroShare Hands on
The Steven Hall Story
With a little help, Steven deposited his dataset in the online
HydroShare repository
Steven collected his data in the
field and transformed
into a sharable format
Steven verified his data and metadata were correct but
kept the data private
Steven submitted his paper for
publication and responded to
reviews
Steven published his
paper and cited published data in HydroShare
Steven published his data in
HydroShare and received a DOI
From Jeff Horsburgh
Data Management 101• How are you managing your data?
• There are simple guidelines to improve data management
• Benefits– Improved data organization – facilitates analysis– Improved reproducibility– Improved capacity for data re-use
Borer, E.T., E.W. Seabloom, M.B. Jones, and M. Schildhauer (2009). Some simple guidelines for effective data management, ESA Bulletin, 90(2):205-214, http://dx.doi.org/10.1890/0012-9623-90.2.205
From Jeff Horsburgh
1. Don’t Mess with the Raw Data
• Always store uncorrected data with all of its “bumps andwarts”
• Do not make any corrections to this– You could change something that was actually correct– You could make mistakes while correcting other mistakes
• Script QA/QC procedures and write results to a new file/copyof the data
From Jeff Horsburgh
An Example
From Jeff Horsburgh
An Example
Removal of a calibration shiftFrom Jeff Horsburgh
An Example
Removal of anomalous, out of range valuesFrom Jeff Horsburgh
An Example
Removal of “bad data” – sensor malfunctionFrom Jeff Horsburgh
2. Use Descriptive File Names
• Use only plain ASCII characters• Brief, but descriptive of content• Generally – avoid spaces in file names• Include a “readme” file when using many files in a directory
From Jeff Horsburgh
This might not be the best system…
How could we make this better?
From Jeff Horsburgh
Streamflow Data from USGS
From Jeff Horsburgh
4. Do Not Mix Data Typesin Table Columns
• Numeric, strings, date/time, boolean• Different software packages will handle mixed
data types inconsistently• Can be more difficult to detect errors in the
data• Can cause erroneous results
From Jeff Horsburgh
5. Archive Data in Non-ProprietaryData Formats
• Microsoft Excel is widely available and usednow, but what about in 10 years? 20 years?
• How many other software programs can openyour data?
• Will your data disappear if the fileformat/software become obsolete?
From Jeff Horsburgh
• Does Your Office LookLike This?
• What are thepotential problems?
• What are somepotential solutions?
6. Preservation/Backup MediaHow are you preserving your data now?
From Jeff Horsburgh
• Natural disaster• Facilities infrastructure failure• Storage failure• Server hardware/software failure• Application software failure• External dependencies• Format obsolescence• Legal encumbrance• Human error• Malicious attack by human or
automated agents• Loss of staffing competencies• Loss of institutional commitment• Loss of financial stability• Changes in user expectations and
requirements
Data Loss
CC im
age
by S
hary
nM
orro
w o
n Fl
ickr
CC im
age
by m
ombo
leum
on F
lickr
Slide courtesy DataONE.From Jeff Horsburgh
To the Cloud!• Convenience• Accessibility anywhere• Cross platform• Enhanced sharing• Low cost
• But…• Privacy???????• Delay (slow or non-existent
internet)• Storage, but not much else• File formats and semantics
still matter• No community of similar
experts From Jeff Horsburgh
Why store your model on Hydroshare (where your data is also located)?
• Model creates reproducible results• Models/code can be shared by simply
giving permission (no need to copy)• Models can be re-executed at any time
From Jeff Horsburgh
Reproducible Visualization in Python
From Jeff Horsburgh
8. Maintain Metadata (Information about Data)
Borer et al.: “Do not underestimate your ability to forget details about a study!”
– WHO created the data?– WHAT is the content of the data?– WHEN were the data created?– WHERE is it geographically?– WHY were the data developed?– HOW were the data developed?
From Jeff Horsburgh
• When you provide data to someone else, what types of information would you want to include with the data?
• When you receive a dataset from an external source, what types of details do you want to know about the data?
Sharing Data: The Golden Rule
From Jeff Horsburgh
• Providing data: – Why were the data created? – What limitations do the data have? – What does the data mean? – How should the data be cited if it is re-used in a new study?
• Receiving data:– What are the data gaps?– What processes were used for creating the data?– Are there any fees associated with the data?– In what scale were the data created? – What do the values in the tables mean?– What software do I need in order to read the data?– What projection are the data in?– Can I give these data to someone else?
Sharing Data
From Jeff Horsburgh
Necessary Meta/data Structure
The degree of metadata format and structure necessary for different levels of projected secondary data utilization. (adapted from Michener et al., 1997).
From Jeff Horsburgh
Summary
1. Don’t mess with the raw data2. Use descriptive file names3. Use descriptive file headers4. Do not mix data types in table columns5. Archive data in non-proprietary data formats6. Consider media7. Ensure repoducibility8. Maintain metadata
From Jeff Horsburgh
Data and models used by hydrologists are diverse…• Time series• Geographic rasters• Geographic features• Multidimensional space/time• Model programs• Model instances• …
141 241 341
131 231 331
121 221 321
111 211 311
441
431
421
411
142 242 342
132 232 332
122 222 322
112 212 312
442
432
422
412
143 243 343
133 233 333
123 223 323
113 213 313
443
433
423
413
Y
X
Time
http://www.unidata.ucar.edu
http://www.usgs.gov
http://www.esri.com
From Jeff Horsburgh
HydroShare can hold data in a wide variety of formats, and data in any format as “generic”
How do people share other content now
• YouTube• Facebook• Instagram• Drop Box• Google Drive• ArcGIS Online• Hydrologic data ?
HydroShare is a platform for sharing Hydrologic Resources and Collaborating•File Storage
Value Added Functionality
DropBox-ish Functionality
dropbox.com
• Meta Data Descriptions• Data Access API• Web Apps• Social Functions• DOI Data Publication
The goal of HydroShare is to advance hydrologic science by enabling the scientific community to more easily and freely share products resulting from their research - not just the scientific publication summarizing a study, but also the data and models used to create the scientific publication.
Collaborative data sharing
Add content to HydroShare to share with your colleagues or formally publish
to document result reproducibility
Resources (data and models) in HydroShare are objects of collaboration (social objects)
For each resource you can- Manage who has access
- To edit- To view
- Comment or rate- Get unique identifier- Describe with metadata- Organize into collections- Formally publish- Version- Open with compatible web
app
Resources formally published receive a citable digital object identifier (DOI) and are made immutable to changes
...
Formal data publication
Automatic and natural metadata gathering eases some of the pain of metadata entry
For geographic raster WGS 84 Coverage information automatically harvested from GeoTIFF coordinate system information
For multidimensional netCDF data with CF convention metadata the HydroShare metadata can be fully and automatically completed
Summary1. A new, web-based system for advancing model and data sharing2. Access multiple types of hydrologic data using standards compliant data
formats and interfaces3. Flexible discovery functionality4. Model sharing and execution5. Facilitate and ease access to use of high performance computing6. Social media and collaboration functionality7. Links to other data and modeling systems8. Enable more rapid advances in hydrologic understanding through
collaborative data sharing, analysis and modeling9. Much of the functionality has applicability to other geosciences beyond
hydrology