Data Management Lab: Session 3 Data Coding Best Practices

IUPUI University Library Center for Digital Scholarship Data Management Lab: Pilot [January 2014]

Data Coding Best Practices

Data Coding Guidelines adapted from the ICPSR Guide to Social Science Data Preparation and Archiving

1. Use common coding conventions a. Assure that all statistical software packages can handle the data b. Promote greater measurement comparability

2. Check out Federal Information Processing Codes (FIPS) - standard schemes. 3. Identification variables - provide fields at the beginning of each record to accommodate all

identification variables (e.g., unique study number and respondent number). 4. Code categories - should be mutually exclusive, exhaustive, and precisely defined. 5. Preserving original information - code as much detail as possible; recording original data, such as

age and income is more useful than collapsing or bracketing the information. 6. Closed-ended questions - responses to survey questions that are pre-coded in the questionnaire

should retain the coding scheme to avoid errors and confusion. 7. Open-ended questions - either use a predetermined coding scheme or review the initial survey

responses to construct a coding scheme based on major categories that emerge; any coding scheme and its derivation should be reported in study documentation.

8. User-coded responses - must be reviewed for disclosure risk; if necessary, treated to protect confidentiality prior to dissemination.

9. Check-coding - it's a good idea to verify or check-code some cases during the coding process; i.e., repeat the process with an independent coder.

10. Series of responses - if a series of responses requires more than one field, organizing the responses into meaningful major classifications is helpful; permits analysis of the data using broad groupings or more detailed categories.

11. Missing data a. Codes should match the content of the field (i.e., numeric, alphanumeric,). b. Codes should be standardized such that the same code is used for each type of

missing data for all variables. c. Blanks should not be used as missing data codes unless there is no need to differentiate

types of missing data such as "don't know" or "refused" etc. d. If an entire sequence of variables is blank due to inapplicability or another reason, an

indicator field should be used. e. Skip patterns & "not applicable" - not applicable and inapplicable should be distinct

from other missing data codes.

References

1. ICPSR. (2012). Guide to Social Science Data Preparation and Archiving, University of Michigan, Ann Arbor, MI. From http://www.icpsr.umich.edu/files/deposit/dataprep.pdf.

Heather Coates, 2013

http://www.icpsr.umich.edu/files/deposit/dataprep.pdf

Education

Data Management Lab: Session 3 Data Coding Best Practices