29
Demystifying Data & Scholarly Communication Presented by Aaron Collie and Hailey Mooney In collaboration with Lucas Mak Tuesday, April 26, 2011

Demystifying Data & Scholarly Communication Presented by Aaron Collie and Hailey Mooney In collaboration with Lucas Mak Tuesday, April 26, 2011

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

Demystifying Data &

Scholarly Communication

Presented by Aaron Collie and Hailey MooneyIn collaboration with Lucas Mak

Tuesday, April 26, 2011

Science Paradigms• Thousand years ago:

science was empirical describing natural phenomena

• Last few hundred years: theoretical branch using models, generalizations

• Last few decades: a computational branch simulating complex phenomena

• Today: data exploration (eScience)unify theory, experiment, and simulation – Data captured by instruments

Or generated by simulator– Processed by software– Information/Knowledge stored in computer– Scientist analyzes database / files

using data management and statistics

2

22.

3

4

a

cG

a

a

Slide credit: Gray, J. & Szalay, A. (11 January 2007). eScience Talk at NRC-CSTB meeting. http://research.microsoft.com/en-us/um/people/gray/talks/NRC-CSTB_eScience.ppt

Scholarly communication lifecycle model from Western Libraries: http://www.lib.uwo.ca/scholarship/scholarlycommunication.html

What is/are data?

DefinitionExamples

Research Data“A reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing. Examples of data include a sequence of bits, a table of numbers, the characters on a page, the recording of sounds made by a person speaking, or a moon rock specimen.”Consultative Committee for Space Data Systems. 2002. Reference Model for an Open Archival Information System (OAIS). Washington, DC: National Aeronautics and Space Administration, p. 1-9. Available at http://public.ccsds.org/publications/archive/650x0b1.pdf

“[I]nformation used in scientific, engineering, and medical research as inputs to generate research conclusions. This usage encompasses a wide variety of information. It includes textual information, numeric information, instrumental readouts, equations, statistics, images (whether fixed or moving), diagrams, and audio recordings. It includes raw data, processed data, published data, and archived data. It includes the data generated by experiments, by models and simulations, and by observations of natural phenomena at specific times and locations. It includes data gathered specifically for research as well as information gathered for other purposes that is then used in research.”National Academy of Sciences (U.S.), National Academy of Engineering, , and , Institute of Medicine (U.S.). (2009). Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, D.C: National Academies Press.

NSF Data Collection Categories• Research data collections

– Products of one or more focused research projects– Novel data types– Small user community– May not conform to standards (file formats, metadata)– Often no intention for preservations– Small budgets

• Resource or community data collections– Serve a single science community– Often establish community level standards– Intermediate budgets– Unclear if sustained support for preservation will be maintained

• Reference data collections– Serve large segments of the scientific community– Broad scope and diverse set of user communities– Conforms to or creates robust universal standards– Large budgets– Long-term support for preservation

National Science Board (U.S.) & National Science Foundation (U.S.). (2005). Long-lived digital data collections enabling research and education in the 21st century. Washington, D.C.: National Science Foundation. http://www.nsf.gov/pubs/2005/nsb0540/

Expanding Roles

Scholarly Communication

Researchers Funding agencies

LibrariesPublishers

You ARE managing your data... RIGHT?

• Good Science– “Sponsors of university research, federal and state

oversight agencies, or journals and other colleagues in the field may need or be legally entitled to review primary research data well after publication or dissemination of results.” http://rio.msu.edu/research_data.htm

• “Research Data Management”– Same tune, new requirements.

Mandating Data Management

NSF DMP Requirements• The types of data, samples, physical collections, software, curriculum

materials, and other materials to be produced in the course of the project;• The standards to be used for data and metadata format and content (where

existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies);

• Policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements;

• Policies and provisions for re-use, re-distribution, and the production of derivatives; and

• Plans for archiving data, samples, and other research products, and for preservation of access to them.

• More Info: http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#dmp

Why should researchers care?

• Secure funding specifically for research data management– Improve and standardize data management practice and

policy in your lab• Improve the impact and visibility of your research

– Facilitate collaboration, increase research efficiency, and make new discoveries

• Assure a greater return on investment by adapting a value chain model (e.g. one analogy would be the value-added by journal publishers)

What makes data management possible?

• Policy– Carrots: Data Citations, Data Papers– Sticks: Grant Funding– Standards… a whole lot of them

• Infrastructure– Short-term storage– Long-term storage

• Humans– Use Cases for expanded roles for Librarians (aka Services!)

• “Reference Specialists” → Reference, Evaluating, and Promoting• “Subject Specialists” → Subject Expertise, Liaising, and Awareness• “Technology Specialists” → Training, Advice, and Support

Policies and Standards for MSU

• Research Data Guidelines (VPRGS)– http://www.vprgs.msu.edu/dataguidelines

• Data management for research: Preparing for new requirements (VPRGS)– http://www.vprgs.msu.edu/node/1439

• Research Data: Management, Control, and Access (RIO)– http://rio.msu.edu/research_data.htm

• http://www.lib.msu.edu/about/diginfo/ldmp.jsp

Infrastructure at MSU

Adapted from: http://www.vprgs.msu.edu/files_vprgs/Data%20Management%20for%20Research.pdf

Individual (e.g. 1 FTE researcher)

Group (e.g. Department, College)

Institution (e.g. Collaboration, Shared Resources)

Institution (e.g. Collaboration, Shared Resources)

Limited-term(analysis)

Internal Hard Drive; External Hard Drive; Flash Drive; DVD; Dropbox

Network Share; Net Files; External Hard Drive; HPCC Scratch;

Network Share; Net Files; HPCC Scratch

Network Share;

Short-term (project duration)

Internal Hard Drive; External Hard Drive; Network Share

Network Share Network Share Cloud Storage

Long-term (archival storage)

Disciplinary Repository

Disciplinary Repository

Disciplinary Repository

Disciplinary Repository

Use Cases from MSU Libraries

Ecology / E.S. Professor• Present: P.I., Digital Curation

Librarian, Metadata Librarian, A.D. for Digital Information

• Situation: Researcher requested help writing Data Management Plan

• Collaborative project• Asked: Integrating disparate

landscape limnology data• Advised: Retention, access/sharing,

embargoes, metadata standards, disciplinary repositories, archival format

History Professor• Present: P.I., Digital Curation Librarian,

Metadata Librarian, Bibliographer, Web Services, A.D. for Digital Information

• Situation: Emeritus faculty exploring options for capstone project

• Asked: Enhancing web presence, upgrading flat database, converting extensive code book to metadata schema

• Advised: databases, metadata schema, metadata mapping, complex queries and relational data, file formats

Core Service

Image From: http://www.admin.ox.ac.uk/rdm/

OpenContext.org

• Data Publisher (for archeology data)!• Forming an editorial board

– Vetted data– Editorial process cleans data

• Sends clean and bundled data off to the CDL for long-term preservation

• Distributes the processes which support data management across the publication lifecycle

The value chain• Registration, which allows claims of precedence for a scholarly

finding.• Certification, which establishes the validity of a registered

scholarly claim.• Awareness, which allows actors in the scholarly system to

remain aware of new claims and findings.• Archiving, which preserves the scholarly record over time.• Rewarding, which rewards actors for their performance in the

communication system based on metrics derived from that system.

• Roosendaal and Geurts 1997

What roles do you see for libraries and librarians to support data management, sharing and publication?

What have you heard from faculty about data management, sharing and publication?

What are the norms for data management, sharing and publication in your disciplines?

Data Sharing Cultures

Research Information Network. (2008). To Share or not to share: Publication and quality assurance of research data outputs. Research Information Network, June 2008. as cited in Griffiths, A. (2009). The publication of research data: Researcher attitudes and behaviour. The International Journal of Digital Curation, 1(4). http://www.ijdc.net/index.php/ijdc/article/viewFile/101/76

The American Journal of Human GeneticsGuide for Authors

Distribution of Materials and DataAn implicit term and condition of publishing in AJHG is that authors be willing to distribute any materials and protocols used in the published experiments to qualified researchers for their own use. Materials include but are not limited to cells, DNA, antibodies, reagents, organisms, and mouse strains, or if necessary the relevant ES cells. These materials must be made available with minimal restrictions and in a timely manner, but it is acceptable to request reasonable payment to cover the cost of provision and transport of materials. If there are restrictions to the availability of any materials, data, or information, these must be disclosed in the cover letter and the Material and Methods section of the manuscript at the time of submission.Nucleic acid and protein sequences, single-nucleotide polymorphisms (SNPs), copy number variants (CNVs), microarray data, and macromolecular structures determined by X-ray crystallography (along with structure factors) must be deposited in the appropriate public database and must be accessible without restriction from the date of publication . The URL of the databases used must be included in the Web Resources section of the manuscript. All entry names and/or accession numbers must be included in the Material and Methods section. Microarray data should be MIAME compliant (for guidelines see http://www.mged.org/Workgroups/MIAME/miame.html).Although AJHG does not require authors to deposit genotype data to a public database, we do encourage this practice . We do ask that authors include genotype data in their supplemental materials or that a website is provided at which readers would be able to gain access to such data. If such data presentations are not possible, we ask that AJHG authors accommodate legitimate requests for population-genetics data provided that there are no IRB restrictions.Newly described SNPs should be submitted to an appropriate database such as dbSNP (http://www.ncbi.nlm.nih.gov/SNP/) prior to submission of revised manuscripts. The identification numbers should be used to describe the SNPs in the manuscript.All copy number variants (CNVs) identified in control samples should be submitted to one of two public data archives, the Database of Genomic Variants Archive (DGVa; http://www.ebi.ac.uk/dgva/page.php?page=data_submission) or the Database of Genomic Structural Variation (dbVAR; http://www.ncbi.nlm.nih.gov/dbvar/content/submission/), prior to submission of revised manuscripts. The associated identification numbers should be used to describe the CNVs in the manuscript.Please provide a figure or table that summarizes the full results of your genome-wide scan.In addition to the information that must be deposited in public databases as detailed above, authors are encouraged to contribute additional information to the appropriate databases. Authors are also encouraged to deposit materials used in their studies to the appropriate repositories for distribution to researchers.

http://download.cell.com/images/edimages/AJHG/AJHG_Information_for_Authors.pdf

DemographyAuthor Instructions

Authors of accepted manuscripts will be asked to preserve the data used in their analysis and to make the data available to others at reasonable cost from a date six months after the publication date for the paper and for a period of three years thereafter. Authors wishing to request an exemption from this requirement (e.g., because the analysis is based on a proprietary data set) should notify the editors at the time of manuscript submission or after receiving this notice; otherwise, authors will be assumed to accept the requirement.

http://www.populationassociation.org/publications/demography/

Data Sharing

What are reasons for publishing and sharing research data?

What are reasons for NOT publishing and sharing research data?

Image courtesy of http://biomat2010-8.wikispaces.com, licensed under a Creative Commons Attribution Share-Alike 3.0 License.

What needs to change in order for data management, sharing and publication to become a common practice?