Upload
anne-thessen
View
227
Download
1
Tags:
Embed Size (px)
DESCRIPTION
This talk was given at the Atlantic Estuarine Research Society at their 2014 Sprint meeting in Ocean City, Maryland, USA
Citation preview
Data Infrastructures for Estuarine and Coastal Science
Anne E. Thessen
http://www.slideshare.net/[email protected]
Photo Credit: NASA/ GSFC/ NOAA/ USGS
Outline
• Why are we talking about data infrastructures?
• What are the challenges?• What are the requirements?• What parts are already available?• How do we get there?• PSA
Data Type Important Easy
Atmospheric Data 52.2% 21.6%Climate Data 56.0% 23.3%Oceanographic Data 42.5% 18.9%Geophysical Data 55.5% 22.0%Geological Data 56.3% 19.8%Critical Zone Data 19.3% 8.2%Hydrology Data 48.4% 20.1%
Results from EarthCube Stakeholder Alignment Survey
Why Are We Talking About Data Infrastructure?
Working with multiple data sets from many disciplines?
Working with multiple data sets within a discipline?
88.1% say it is important23.5% say it is easy
70.7% say it is important9.8% say it is easy
Results from EarthCube Stakeholder Alignment Survey
Why Are We Talking About Data Infrastructure?
Why Are We Talking About Data Infrastructure?
• “Data Deluge”• Large-scale problems• Maturation of the internet• Increased investment (i.e.
EarthCube)• Estuarine and coastal
science has interdisciplinary nature and strong sharing culture
User Needs
Where Do We Start?
Available Technology
Existing Infrastructure
Incentives
Sociological
Technological
• Data sharing• Incentives• Data cultures• Science practices• Massive heterogeneity
• Storage capacity• Moving data around• Efficient query• Processing speed• Knowledge representation
Stakeholder Assessment
Data producers
Photo Credit: The University of Nottingham Photo Credit: Kay Nietfeld/EPA
Data consumers
What is the current state of sharing?
• Data sharing varies widely by discipline– No universal rules or agreements– Sharing in marine science is 40%– Other disciplines - 10% to 100%
What is the current state of sharing?
• Data sharing varies widely and by discipline• Far more scientists say they are willing to
share data than actually do– Time to prepare– Concerns about misuse
What is the current state of sharing?
• Data sharing varies widely and by discipline• Far more scientists say they are willing to
share data than actually do• Lack of access to data is a major impediment
If sharing is so important why aren’t more people doing it?
The large proportion of researchers who claim to be willing to share data and the low numbers of researchers who actually make their data easily available suggests that data sharing would increase substantially if the proper infrastructure were in place.
Reasons for Not Sharing
• Not enough time or funding• No place to put the data• No standards or policies for sharing• Others have no need for the data• Loss of control• No way to get credit• Sensitive data cannot be shared• Errors will be exposed• Loss of competitiveness
Social Infrastructure Requirements
• Repository capability• Place conditions on access• Mechanisms for data citation and credit• Data sharing policy• Value added services• Requirements from publishers and funders• Respect for confidentiality• Ease of use
We need a system that can
• Share• Preserve• Digitize• Automate• Integrate– Data– Infrastructure
Data Set Size
Data Set Heterogeneity
• Data format• Data file format• Data quality and completeness• Physical samples
What Will We Do With the Data?
• Preserve Data– Format migration– Redundancy– Self-Repair
• Serve Data– Discoverable– Accessible– Usable
Technical Infrastructure Requirements
• Preservation• Layered service architecture• Repository functions• Accommodate heterogeneity• Bridge digital and physical
Review Requirements
Sociological• Repository capability• Place conditions on access• Mechanisms for data citation
and credit• Data sharing policy• Value added services• Requirements from
publishers and funders• Respect for confidentiality• Ease of use
Technological• Preservation• Layered service architecture• Repository functions• Accommodate
heterogeneity• Bridge digital and physical
What is Available?
Repositories
What is Available?
Citation
Repositories
What is Available?
Preservation
Repositories
Citation
What is Available?
Quality Control and Usage Metrics
Repositories
Citation
Preservation
Crowd Sourcing
Web 2.0
What is Available?
Integration
Repositories
Citation
Preservation
Quality and Metrics
Web 3.0
What is Available?
Mobilization
Repositories
Citation
Preservation
Quality and Metrics
Integration
What is Available?
Access Protocols
Web Services
Data Brokers Repositories
Citation
Preservation
Quality and Metrics
Integration
Mobilization
What is Available?
Standards
Repositories
Citation
Preservation
Quality and Metrics
Integration
Mobilization
Access
How Can it all Fit Together?
Quality and
Metrics
Access
Citation
PreservationMobilization
Integration
Repositories
Standards
Who Should Be Doing All This Work?
• Librarians• Data Scientists• Informaticians• Ontologists• Computer Scientists• Software Developers• Standards Groups
Image by Michael Krigsman
PSA
Why Share Data?
• Increased recognition• Increased economic opportunities• Improved data set• Improved science• Time and money saved
Photo Credit: Emergency Cleaning Solutions
Photo Credit: The Collared Sheep
Acknowledgements
• Benjamin Fertig• David Patterson• Mike Kemp• John Milliman• Melissa Cragin• Sayeed Choudhury• Tim DiLauro• Carol Palmer
• Nathan Wilson• Alan Renear• Ruth Duerr• Cyndy Chandler• Peter Fox• Krishna Sinha• Janet Fredericks• Carl Lagoze
Questions?
ReferencesAtkins DE, Droegemeier KK, Feldman SI, Garcia-Molina H, Klein ML, Messerschmitt DG, Messina P, Ostriker JP, Wright MH.
2003. Revolutionizing science and engineering through cyberinfrastructure.
Borgman CL. 2010. Research data: who will share what, with whom, when, and why? Fifth China-North America Library Conference 2010
Borgman CL. 2012. The conundrum of sharing research data. Journal of the American Society for Information Science and Technology 63(6):1059-1078
Burton A, Treloar A. 2009. Designing for discovery and re-use: the ANDS data-sharing verbs approach to service decomposition. The International Journal of Digital Curation 4.
Costello M. 2009. Motivating online publication of data. BioScience 59:418-426
Cragin MH, Palmer CL, Carlson JR, Witt M. 2010. Data sharing, small science and institutional repositories. Philosophical Transactions of the Royal Society A 368:4023-4038
Edwards PN, Mayernik MS, Batcheller AL, Bowker GC, Borgman CL. 2011. Science friction: data, metadata and collaboration. Social Studies of Science 41(5):667-690
Enke N, Thessen AE, Bach K, Bendix J, Seeger B, Gemeinholzer B. 2012. The User’s View on Biodiversity Data Sharing. Ecological Informatics 11: 25-33
Field D Sansone SA, Collis A, Booth T, Dukes P, Gregurick SK, Kennedy K, Kolar P, Kolker E, Maxon M, Millard S, Mugabushaka AM, Perrin N, Remacle JE, Remington K, Rocca-Serra P, Taylor CF, Thorley M, Tiwari B, Wilbanks J. 2009. ‘Omics data-sharing. Science 326:234-236
Froese R, Lloris D, Opitz S. 2003. Scientific data in the public domain. ACP-EU Fisheries Research Report 14:267-271.
Gleditsch NP, Strand H. 2003. Posting your data: will you be scooped or will you be famous? International Study Perspectives 4:89-97
Heidorn PB. 2008. Shedding light on the dark data in the long tail of science. Library Trends 57:280-299.
Henty M, Weaver B, Bradbury SJ, Simon P. 2008. Investigating data management practices in Australian Universities. APSR. QUT digital repository http://eprints.qut.edu.au/14549
Hey T, Tansley S, Tolle K. 2009. The Fourth Paradigm. Microsoft Research. Redmond, WA, USA, 252 pp.
ReferencesKey Perspectives Ltd. 2010. Data Dimensions: disciplinary differences in research data-sharing, reuse and long term viability.
DCC Scarp Synthesis Report. ISSN 1759-586X
Laogze C, Patzke K. 2011. A research agenda for data curation cyberinfrastructure. JCDL’11
Mayernik MS, DiLauro T, Duerr R, Metsger E, Thessen AE Choudhury GS. 2013. Data Conservancy provenance, context and lineage services: key components for data preservation and curation. Data Science Journal 12:158-171
Palmer CL, Cragin MH, Heidorn PB, Smith LC. 2007. Data curation for the long tail of science: the case of environmental studies. Digital Curation
Palmer CL, Weber NM, Cragin MH. 2011. The analytic potential of scientific data: understanding re-use value. ASIST 2011
Piwowar HA, Day RS, Fridsma DB. 2007. Sharing detailed research data is associated with increased citation rate. PLoS ONE 3:e308
Savage CJ, Vickers AJ. 2009. Empirical study of data-sharing by authors publishing in PLoS journals. PLoS ONE 4: e7078
Sinha AK, Thessen AE, Barnes CG. 2013. Geoinformatics: towards an integrative view of Earth as a system, in Bickford, M.E., ed., The Web of Geological Sciences: Advances, Impacts, and Interactions: Geological Society of America Special Paper 500, p. 1-14. 10.1130/2013.2500(19)
Smith VS. 2009. Data publication: towards a database of everything. BMC Research Notes 2:113
Tenopir C, Allard S, Douglass KL, Aydinoglu AU, Wu L, Read E, Manoff M, Frame M. 2011. Data sharing by scientists: practices and perceptions. PLoS ONE 6.6
Thessen AE, Patterson DJ. 2011. Data issues in the life sciences. ZooKeys 150:15-51
Wallis JC, Mayernik MS, Borgman CL, Pepe A. 2010. Digital libraries for scientific data discovery and reuse: from vision to practical reality. Joint Conference on Digital Libraries 2010
Weber NM, Baker KS, Thomer AK, Chao TC, Palmer CL. 2012. Value and context in data use: domain analysis revisited. Proceedings of the American Society for Information Science and Technology. 49(1):1-10
Whitlock MC. 2011. Data archiving in ecology and evolution: best practices. TREE 26(2):61-65