49
Research Data Management Spring 2014: Session 3 Practical strategies for better results University Library Center for Digital Scholarship

Data Management Lab: Session 3 Slides

Embed Size (px)

DESCRIPTION

Data Management Lab: Session 3 slides (more details at http://ulib.iupui.edu/digitalscholarship/dataservices/datamgmtlab) What you will learn: 1. Build awareness of research data management issues associated with digital data. 2. Introduce methods to address common data management issues and facilitate data integrity. 3. Introduce institutional resources supporting effective data management methods. 4. Build proficiency in applying these methods. 5. Build strategic skills that enable attendees to solve new data management problems.

Citation preview

Page 1: Data Management Lab: Session 3 Slides

Research Data Management

Spring 2014: Session 3

Practical strategies for better results

University Library Center for Digital Scholarship

Page 2: Data Management Lab: Session 3 Slides

QUALITY ASSURANCE & CONTROL MODULE 3

Page 3: Data Management Lab: Session 3 Slides

LEARNING OUTCOMES • Develop procedures

for quality assurance and quality control activities.

Page 4: Data Management Lab: Session 3 Slides

Data Integrity

1. Data have integrity if they have been maintained without unauthorized alteration or destruction

2. Data integrity is data that has a complete or whole structure. (http://www.princeton.edu/~achaney/tmve/wiki100k/docs/Data_integrity.html)

Page 5: Data Management Lab: Session 3 Slides

Data Quality

• Fitness for use (depends on context of your questions) • Data quality is the most important aspect of data

management • Ensured by

– Sufficient resources and expertise – Paying close attention to the design of data collection

instruments – Creating appropriate entry, validation, and reporting processes – Ongoing QC processes – Understanding the data collected

Chapman, 2005 Dept of Biostatistics – Data Management, IUSM

Page 6: Data Management Lab: Session 3 Slides

Data Quality Standards

• Check data for its logical consistency. • Check data for reasonableness. • Ensure adherence to sound estimation methodologies. • Ensure adherence to monetary submission standards for

stolen and recovered property. • Ensure that other statistical edit functions are processed

within established parameters. FBI: http://www.fbi.gov/about-us/cjis/ucr/data_quality_guidelines Dept of Biostatistics – Data Management, IUSM

Page 7: Data Management Lab: Session 3 Slides

Data Entry and Manipulation

• Strategies for preventing errors from entering a dataset • Activities to ensure quality of data before collection • Activities that involve monitoring and maintaining the

quality of data during the study

Page 8: Data Management Lab: Session 3 Slides

Data Entry and Manipulation

• Define & enforce standards ◦ Formats ◦ Codes ◦ Measurement units ◦ Metadata

• Assign responsibility for data quality ◦ Be sure assigned person is educated in QA/QC

Page 9: Data Management Lab: Session 3 Slides

Quality Assurance v. Control

• QA: set of processes, procedures, and activities that are initiated prior to data collection to ensure the expected level of quality will be reached and data integrity will be maintained.

• QC: a system for verifying and maintaining a desired level of quality in a product or service.

http://c2.com/cgi/wiki?QualityAssuranceIsNotQualityControl

Page 10: Data Management Lab: Session 3 Slides

Quality Assurance in Practice

• CRF (data collection instrument) review & validation • System/process testing & validation • Training, education, communication of a team • Standard Operating Procedures, Standard Operating

Guidelines • Site audits Dept of Biostatistics – Data Management, IUSM

Page 11: Data Management Lab: Session 3 Slides

Quality Control in Practice

• Set of processes, procedures, and activities associated with monitoring, detection, and action during and after data collection.

• Examples: – Errors in individual data fields – Systematic errors – Violation of protocol – Staff performance issues – Fraud or scientific misconduct

Dept of Biostatistics – Data Management, IUSM

Page 12: Data Management Lab: Session 3 Slides

Activity

Define data quality standards for the following variables: • Age • Height • BMI • Life satisfaction scale • Number of close friends

Don’t forget to upload this to Box. Suggested file name “Data Quality Standards”

Page 13: Data Management Lab: Session 3 Slides

References 1. Department of Biostatistics – Data Management Team, Indiana

University School of Medicine (2013). Data Management including REDCap. (provided via email)

2. Chapman, A. D. 2005. Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. ISBN 87-92020-03-8. http://www.gbif.org/resources/2829

3. DataONE Education Module: Data Quality Control and Assurance. DataONE. From http://www.dataone.org/sites/all/documents /L05_DataQualityControlAssurance.pptx

Page 14: Data Management Lab: Session 3 Slides

DATA COLLECTION MODULE 3

Page 15: Data Management Lab: Session 3 Slides

LEARNING OUTCOMES • Describe key

considerations for selecting data collection tools.

Page 16: Data Management Lab: Session 3 Slides

Choose your tools wisely

Page 17: Data Management Lab: Session 3 Slides

Choose your tools wisely

Allie Brosh, 2010

Page 18: Data Management Lab: Session 3 Slides

Activity

Draft data collection instrument See document “DataMgmtLab-Spr14-CollectionCodingEntry_EX“

Don’t forget to upload this to Box. Suggested file name “Data Collection Tool”

Page 19: Data Management Lab: Session 3 Slides

References 1. Brosh. A. 2010. Boyfriend doesn’t have ebola. Probably.

http://hyperboleandahalf.blogspot.com/2010/02/boyfriend-doesnt-have-ebola-probably.html

Page 20: Data Management Lab: Session 3 Slides

DATA CODING & ENTRY MODULE 3

Page 21: Data Management Lab: Session 3 Slides

LEARNING OUTCOMES • Use best practices

for coding. • Use best practices

for data entry.

Page 22: Data Management Lab: Session 3 Slides
Page 23: Data Management Lab: Session 3 Slides

Goals of Data Entry

• Publishable results! – Valid data that are organized to support smooth

analysis • Easy to import into analytical program • Minimize manipulations and errors • Has a logical [data] structure

Page 24: Data Management Lab: Session 3 Slides
Page 25: Data Management Lab: Session 3 Slides

Activity

Draft data coding scheme for data entry • Review data entry best practices

document in Box

Don’t forget to upload this to Box. Suggested file name “Coding Scheme”

Page 26: Data Management Lab: Session 3 Slides

References 1. DataONE Education Module: Data Entry and Manipulation. DataONE.

From http://www.dataone.org/sites/all/documents/ L04_DataEntryManipulation.pptx

2. Tilmes, C. (2011). Data Management 101 for the Earth Scientist presented at the AGU Workshop. From http://wiki.esipfed.org/index.php/2011AGUworkshop

3. Scott, T. (2012). Guidelines to Data Collection and Data Entry, Vanderbilt CRC Research Skills Workshop Series. From http://www.mc.vanderbilt.edu/gcrc/workshop_files/2012-09-07.pdf

Page 27: Data Management Lab: Session 3 Slides

DATA SCREENING & CLEANING MODULE 3

Page 28: Data Management Lab: Session 3 Slides

LEARNING OUTCOMES • Develop a screening

and cleaning protocol and/or checklist.

Page 29: Data Management Lab: Session 3 Slides

Data Entry and Manipulation

Data Contamination • Process or phenomenon, other than the one of interest,

that affects the variable value • Erroneous values

CC

imag

e by

Mic

hael

Cog

hlan

on

Flic

kr

Page 30: Data Management Lab: Session 3 Slides

Data Entry and Manipulation

• Errors of Commission o Incorrect or inaccurate data entered o Examples: malfunctioning instrument, mistyped data

• Errors of Omission o Data or metadata not recorded o Examples: inadequate documentation, human error, anomalies in the

field

CC

imag

e by

Nic

k J

Web

b on

Flic

kr

Page 31: Data Management Lab: Session 3 Slides

Data Entry and Manipulation

• Double entry ◦ Data keyed in by two independent people ◦ Check for agreement with computer verification

• Record a reading of the data and transcribe from the recording

• Use text-to-speech program to read data back

CC

imag

e by

wes

krie

sel o

n Fl

ickr

Page 32: Data Management Lab: Session 3 Slides

Data Entry and Manipulation

• Design data storage well ◦ Minimize number of times items that must be entered repeatedly ◦ Use consistent terminology ◦ Atomize data: one cell per piece of information

• Document changes to data ◦ Avoids duplicate error checking ◦ Allows undo if necessary

Page 33: Data Management Lab: Session 3 Slides

Data Entry and Manipulation

• Make sure data line up in proper columns • No missing, impossible, or anomalous values • Perform statistical summaries

CC

imag

e by

che

sape

akec

limat

e on

Flic

kr

Page 34: Data Management Lab: Session 3 Slides

Data Entry and Manipulation

• Look for outliers ◦ Outliers are extreme values for a variable given the statistical model

being used ◦ The goal is not to eliminate outliers but to identify potential data

contamination

0

10

20

30

40

50

60

0 5 10 15 20 25 30 35

Page 35: Data Management Lab: Session 3 Slides

Data Entry and Manipulation

• Methods to look for outliers ◦ Graphical

• Normal probability plots • Regression • Scatter plots

◦ Maps ◦ Subtract values from mean

Page 36: Data Management Lab: Session 3 Slides

Data Entry and Manipulation

• Data contamination is data that results from a factor not examined by the study that results in altered data values

• Data error types: commission or omission • Quality assurance and quality control are strategies for ◦ preventing errors from entering a dataset ◦ ensuring data quality for entered data ◦ monitoring, and maintaining data quality throughout the project

• Identify and enforce quality assurance and quality control measures throughout the Data Life Cycle

Page 37: Data Management Lab: Session 3 Slides

Discussion

Using the Data Review Checklist, evaluate the HBSC codebook “DataMgmtLab-Spr14_DataReviewChecklist_EX”

What screening & cleaning procedures were used?

Page 38: Data Management Lab: Session 3 Slides

Data Entry and Manipulation

1. D. Edwards, in Ecological Data: Design, Management and Processing, WK Michener and JW Brunt, Eds. (Blackwell, New York, 2000), pp. 70-91. Available at www.ecoinformatics.org/pubs

2. R. B. Cook, R. J. Olson, P. Kanciruk, L. A. Hook, Best practices for preparing ecological data sets to share and archive. Bull. Ecol. Soc. Amer. 82, 138-141 (2001).

3. A. D. Chapman, “Principles of Data Quality:. Report for the Global Biodiversity Information Facility” (Global Biodiversity Information Facility, Copenhagen, 2004). Available at http://www.gbif.org/communications/resources/print-and-online-resources/download-publications/bookelets/

Page 39: Data Management Lab: Session 3 Slides

References 1. Cook, 2013, NACP Best Data Management Practices Workshop. From

http://daac.ornl.gov/NACP_AIM_2013/04_data_management_cook_2013.02.03.ppt

2. Simmhan, Y. L., Plale, B., & Gannon, D. (2005). A survey of data provenance in e-Science. SIGMOD Record, 34(3), 31-36. From http://www.sigmod.org/publications/sigmod-record/0509/p31-special-sw-section-5.pdf

3. Ram, S. (2012). Emerging Role of Social Media in Data Sharing and Management. From http://www.slideshare.net/INSITEUA/provenance-management-to-enable-data-sharing

Page 40: Data Management Lab: Session 3 Slides

AUTOMATION MODULE 3

Page 41: Data Management Lab: Session 3 Slides

LEARNING OUTCOMES • Explain why

automation provides better provenance than manual processes.

• Identify effective tools for automating data processing and analysis.

Page 42: Data Management Lab: Session 3 Slides

Choose your tools wisely • Documents • Excel • Access • SPSS, Minitab • Mathematica, MATLAB, Scilab • SAS, Stata • R • MapReduce • NVivo, Atlas.ti, Dedoose, HyperRESEARCH, etc. http://www.dataone.org/all-software-tools

Page 43: Data Management Lab: Session 3 Slides

Data Formats; Version 1.0

Overview

• Spreadsheets are amazingly flexible, and are commonly used for data collection, analysis and management

• Spreadsheets are seldom self-documenting, and seldom well-documented

• Subtle (and not so subtle) errors are easily introduced during entry, manipulation and analysis

• Spreadsheet conventions – often ad hoc and evolutionary – may change or be applied inconsistently

• Spreadsheet file formats are proprietary and thus generally unacceptable as long term archival purposes

Page 44: Data Management Lab: Session 3 Slides

Data Entry and Manipulation

• Great for charts, graphs, calculations

• Flexible about cell content type—cells in same column can contain numbers or text

• Lack record integrity--can sort a column independently of all others)

• Easy to use – but harder to maintain as complexity and size of data grows

• Easy to query to select portions of data

• Data fields are typed – For example, only integers are allowed in integer fields

• Columns cannot be sorted independently of each other

• Steeper learning curve than a spreadsheet

Page 45: Data Management Lab: Session 3 Slides

NACP Best Data Management Practices, February 3, 2013

5. Preserve information (cont) • Use a scripted language to process data

– R Statistical package (free, powerful) – SAS – MATLAB

• Processing scripts are records of processing – Scripts can be revised, rerun

• Graphical User Interface-based analyses may seem easy, but don’t leave a record

45

Page 46: Data Management Lab: Session 3 Slides

Provenance, Audit Trails, etc.

• “…information that helps determine the derivation history of a data product, starting from its original sources.” (Simmhan et al, 2005) – Ancestral data products from which the data evolved – Process of transformation of these ancestral data

products

• Uses: data quality, audit trail, replication recipe, attribution, informational

Page 47: Data Management Lab: Session 3 Slides

More Considerations

• Field names & descriptions • Structured entry • Validation • Record integrity • Missing data • Data/field types • File types: common, open documented standard • Output required for analysis and visualization

Page 48: Data Management Lab: Session 3 Slides

Demonstration & Discussion

Run [analysis] in Excel and Stata. Compare output. • What features does Stata have that Excel

does not? • How do these features support

provenance and data integrity?

Page 49: Data Management Lab: Session 3 Slides

References 1. DataONE Education Module: Data Entry and Manipulation. DataONE.

From http://www.dataone.org/sites/all/documents/ L04_DataEntryManipulation.pptx