Data Integrity
Michelle A. Detry Department of Biostatistics and
Medical Informatics University of Wisconsin - Madison ICTR Short Course – June 9, 2010
Data Integrity
• What is Data Integrity? • Learning objective is to “Maintain the integrity
of data when collecting, recording, analyzing and reporting it”
• Success of research depends on data collected
• Need to collect quality data to assure integrity of the results
• Depends on careful attention to detail, from planning until publication
• Analyses at end of study cannot “fix” data quality
Data Quality
• Two fundamental measures of data quality – Completeness – Accuracy
• Poor data quality can result in bias and increased variability
• Bias – systematic error that would result in erroneous conclusions given a sufficiently large sample
• Increased variability – decreases power for detecting differences between groups or increases uncertainty in differences identified
Data Quality
• Close attention must be paid to data collection process and design of data collection forms
• Primary goal should be to minimize potential for bias through completeness of data
Data Collection
• Needs to be thoroughly planned before study begins
• Difficult to make changes after study begins
• How much data do you collect? • Not too much that it is a burden on
study team • But enough to answer study question
Data Collection
• Need to have well specified study objectives (primary, secondary, exploratory) before study begins
• Need to have well defined outcomes
• Carefully planning will allow for collection of enough data to answer questions
Data Collection – Keys to Success
• Defining variables and recordable events clearly
• Communicating these definitions clearly to research nurses, clinic personnel, data managers, and statistical analysts
• Developing strategies for consistent data collection
Data Collection – Issues to Consider
• What data can we collect…where…and when?
– During Clinic Visit?
– During Surgery?
– Data from Pathology and Radiology Reports?
– Data associated with outside treatments and previous disease events?
Data Collection – Issues to Consider
• Who is responsible for collecting and recording each type of data?
• Who is responsible for quality control of data?
• Which data sources override one another?
• What is the frequency with which data are reviewed?
Data Collection – Case Report Forms (CRFs)
• Well designed forms are crucial
• Must be clear and easy to use
• Can include directions, but prefer forms to be self-explanatory
• Consistency in how forms filled out is crucial
• Consider testing forms prior to implementation
Data Collection – Case Report Forms (CRFs)
• Each subject should have a unique study id which is on every form
• Data entered in coded fields with boxes to check for appropriate categories
• Can more than one box be checked (specify)
• If yes/no variable include boxes for yes AND no
Data Collection – Case Report Forms (CRFs)
• Date formats should be clearly specified – DD-MON-YYYY or MM/DD/YYYY or DD/MM/
YYYY
• Open ended text fields are problematic
• Missing data – is it missing, was it not done, was it not applicable?
• In addition to outcome yes/no collect dates if applicable
Data Integrity
• Problems could be due to fraudulent activity
• Most commonly problems due to poor design or lack of planning
• Science is based on replication of results
• Clinical trials sometimes repeated
• Cannot be perfect, but want to do best you can
Data Integrity - Incompetence
• Need to have competent investigators
• Need to have timely collection of data
• Need to collect high quality data
• Need to have a competent lab if samples are collected
• Can the lab handle the volume of samples it will receive
Data Integrity - Misunderstanding
• Data to be collected must be clearly and specifically defined
• Cannot have variation due to interpretation of definition
• Eligibility criteria needs to be well thought out and clearly defined
• Outcome measures need to be specifically defined
• Death is clear, recurrence may not be clear • Impact of misunderstanding could be serious
Data Integrity - Misunderstanding
• Train personnel who will be collecting data • Train data entry personnel • Train personnel who will be assessing
eligibility • Impact of misunderstanding could be serious
– i.e. misunderstanding of outcome • Collect all data to determine outcome • If multiple components collect each
component details
Data Integrity - Errors
• Errors may be random or systematic • No way of predicting random errors, but they
are unlikely to be repeated in same way • Systematic errors more problematic • High probability errors will happen again in
similar situation • Random errors add variability “noise” to study
but most likely will not invalidate results • Systematic errors may affect results and
credibility of study
Data Integrity - Errors
• Subjects may have been enrolled/randomized but do not meet eligibility criteria
• What do you do?
• Still follow subjects for study measurements and outcomes?
• YES!
• In a randomized trial, random errors will be balanced between the study groups and will add variability but will not invalidate the trial
Data Integrity - Errors
• Important to monitor for systematic errors
• May not be possible to go back and correct errors
Data Integrity - Bias
• Prejudices conscious or unconscious can introduce bias
• Blinding important for both subjects and investigator where possible
• Patient knowledge of treatment could affect actions
• Investigator’s knowledge of treatment assignment can subconsciously affect evaluation of outcomes
Data Integrity - Bias
• Could be introduced by excluding randomized/enrolled patients from analysis because they did not complete therapy or did not meet eligibility criteria
• Bias in primary outcomes can be very problematic
• Effort should be given to eliminate bias in design, conduct, and analysis
Data Integrity - Intention to Treat
• Intention-to-treat (ITT) principle:
– All subjects meeting admission criteria and subsequently randomized should be counted in their originally assigned treatment groups without regard to deviations from assigned treatment (Fisher et al., 1990)
Data Integrity - Intention to Treat
• The intention-to-treat (ITT) principle means:
– All subjects randomized should be counted in their originally assigned treatment groups without regard to deviations from assigned treatment
– No exceptions
Data Integrity - Intention to Treat
• Examples of subjects that are frequently excluded: – Failed to meet compliance/adherence
requirements – Discontinued treatment due to adverse
effects – Received no treatment – Received the wrong treatment (e.g., due to
record keeping error) – Failed inclusion/exclusion criteria after
randomization
Data Integrity – Intention to Treat
• Deviations from ITT may bias analyses
• Treatment discontinuations, non-compliance with protocol, and non-adherence to treatment are frequently treatment and outcome dependent
• Example: a drug whose only effect it to cause a severe reaction in the sickest subjects: if you exclude from the analysis those who discontinue, the drug appears to make subjects better
References • Cook T and DeMets DL. Introduction to Statistical
Methods for Clinical Trials, Chapman & Hall/CRC; Taylor & Francis Group, LLC, Boca Raton, FL, 2008.
• DeMets, D. L., Distinctions between fraud, bias, errors, misunderstanding, and incompetence. Controlled Clinical Trials 1997;18:637-650.
• Introduction to Responsible Research by Nicholas H. Steneck, Office of Research Integrity, Department of Health and Human Services http://ori.dhhs.gov/documents/rcrintro.pdf
Data Integrity – ITT example
• Anturane Reinfarction Trial (ART): • Trial of the clotting inhibitor anturane • Subjects with recent myocardial
infarction • Primary outcome: mortality • 1629 randomized • Re-evaluation of eligibility identified 71
ineligible subjects
Data Integrity – ITT example
• Anturane Reinfarction Trial (ART):
• Initial mortality analysis excluded the ineligible subjects
• Rationale: pre-specified eligibility criteria, based on data measured prior to randomization
Data Integrity – ITT example
Reference: Temple and Pledger (1980) NEJM p. 1488
Subgroup Anturane Placebo P-value
ITT (all rand) 74/813 (9.1%) 89/816 (10.9%) 0.20 Eligible 64/775 (8.3%) 85/783 (10.9%) 0.07
Ineligible 10/38 (26.3%) 4/33 (12.1%) 0.12
Eligible vs. Ineligible
p = 0.0001 p = 0.98
Data Integrity – Intention to Treat
• Exclusions prior to randomization are not the problem, these subjects should, by definition, be excluded from analyses
• Withdrawals after randomization are the concern
• Need to specifically define when an subject has been randomized