Upload
hugh-gibson
View
216
Download
2
Embed Size (px)
Citation preview
Planning and Managing your Data AnalysesSharon R. Ghazarian, PhDDirector, BEAD Core
Questions and Solutions•What is a data analysis plan?•Why and when do I need a data analysis
plan?•What belongs in a data analysis plan?•When/how will it be used?•A static document versus a dynamic
moving part?•Translating the analysis plan to a tracking
document•What analyses can I propose for my data?
What is a data analysis plan?The place to:•Justify – show how you will use the data you
collected/obtained and why (the importance) you put the participants through the study
•Assure – demonstrate the your objectives can/will be achieved
•Identify the parameters of interest•Identify the associations of interest•Identify group comparisons
Looking back to move forward…
•Remember your aims, hypotheses, research questions
•Should be tied VERY closely to your data analysis plan
•A disconnect between aims, hypotheses, research questions and analyses is a FATAL FLAW
•A disconnect can be “formidable”•Revision in one requires revision in the
other
Research questions, aims, hypotheses, study design, analyses all intertwined as one
Attention to one may require revision of the other an iterative process
An example:Hypothesis 1: Females will be more likely to
report experiencing migraine headaches compared to men.
Hypothesis 2: Females living in the eastern region of the US will be more likely to report frequent headaches compared to those living in the western region.
Hypothesis 3: Women and men who report high levels of depressive symptoms will be more likely to report high levels of anxiety compared to those who report low levels of depressive symptoms.
So, the analysis plan should include:
•Group comparisons•Identify parameters of interest•Identify associations of interest•But NOT introduce new parameters that
have not been mentioned previously (intro AND measures)
•But NOT propose associations that the reader is not prepared for
•But NOT propose group associations that have not been justified
Why and when do I need a data analysis plan?•A data analysis plan is a place to:
▫Organize your thoughts about analyses▫Match up your proposed analyses to your aims,
hypotheses, and/or research questions▫Head off reviewer (grant or manuscript)
comments/concerns▫Provide a place to bring together the front and
back ends of your grant/manuscript•Put together your data analysis plan as early
as possible – ideally as soon as you have all the necessary information - you can revise later.
What belongs in a data analysis plan?•ALL parameters of interest – ALL variables in
the measures section•Say something about your demographic
variables•What preliminary analyses will you complete?
▫Descriptive analyses Central tendency (mean, median, mode) Variation (SD, range, variance)
▫Missing data Planned missingness Informative missingness Imputation
▫Sensitivity analyses
What belongs in a data analysis plan?
•What study design features do you need to attend to in preliminary analyses?▫Matching▫Clustering▫Repeated Measures▫Intervention model▫Treatment groups
•Hypothesis-testing Analyses▫Baseline, interim, longitudinal▫Subgroup▫Unadjusted vs adjusted
What belongs in a data analysis plan?•Need description of statistical methods for
EACH objective, aim, hypothesis of the study•What statistical test will be used for each
analysis – stay tuned….•What level of statistical significance will be
used for all analyses• Is there a level of clinical significance that
needs to be stated?▫Ghazarian, S. R. (2011). Distinguishing
statistical significance from clinical importance. Pediatrics in Review, 32, 73-74.
What belongs in a data analysis plan?
•What software package(s) will be used to complete the analyses
•Any additional information the reader might need before proceeding to the Results section
When/how will it be used?•Initially – this is a working document that
will go through numerous revisions▫Be sure to revise when/if the study design
changes (i.e. during data collection)•Use this document as a guide for any
interim analyses•Use this document as a guide when
working with a data manager or planning your database
•When you are ready to begin analyses, go to this document first
Translating the analysis plan to a tracking document
•Use the bones of your data analysis plan as a guide to create a tracking (syntax) document for EACH and EVERY step that is completed in you’re analyses
•DOCUMENT EVERYTHING!!!!•Annotate everything•Use simple language so it can be
understood by your entire team•Stay organized – otherwise this document
becomes useless
******* This series of commands will produce a correlation matrix and avoids the missing data problem ********* To create the correlation matrix, just have the appropriate data file loaded up into SPSS and run this program. The end result with be a new data file
(in the spreadsheet). Just save that file with an appropriate name (((The following is for Autonomy Cross Cultural Validation Paper:Mother model:temp.select if (Gender eq 2).CORRELATION freedomm friendsm rwm clothesm datingm confidnm encourgm careerm educm personm talkm
schoolm groupm educatnm marrym livem choosem wantsm needm approvem lovemem niceem hard ontime educimpt likeschl gpa work trying goals noend avoid bother stick now success handle newthing failure insecure reliant giveup
/ MISSING=LISTWISE / MATRIX OUT (*) .EXECUTE .Father model:temp.select if (Gender eq 2).CORRELATION freedomf friendsf rwf clothesf datingf confidnf encourgf careerf educf personf talkf
schoolf groupf educatnf marryf livef choosef wantsf needf approvef lovemef niceef hard ontime educimpt likeschl gpa work trying goals noend avoid bother stick now success handle newthing failure insecure reliant giveup
/ MISSING=LISTWISE / MATRIX OUT (*) .EXECUTE .FILTER OFF.
Example Syntax for Tracking:
-------------------------------------------------------------AGE------------------------------------------------------------------*focal child age, wave 1.RECODE ags01248 (else=copy) INTO agew1.RECODE ags29019 (ELSE=COPY) INTO agew2.execute. -------------------------------------------------------------EDUCATION--------------------------------------------------------*parent education, wave 1.RECODE eds01249 (1=0) (2 thru 3 = 1) (4 thru 6 = 2) (7 thru 10=3) INTO peduc.VARIABLE LABELS peduc 'parent education, degree'.ADD VALUE LABELS peduc 0 '< hs' 1 'hs/ged' 2 'hs+' 3 'college +'. RECODE eds01281(1=0) (2 thru 3 = 1) (4 thru 6 = 2) (7 thru 10=3) INTO speduc.VARIABLE LABELS speduc 'spouse/partner education, degree'.ADD VALUE LABELS speduc 0 '< hs' 1 'hs/ged' 2 'hs+' 3 'college +'.EXECUTE. COMPUTE educw1 = 999.IF (SYSMIS(speduc)) educw1=peduc.IF (~(sysmis(speduc)) & (speduc>peduc)) educw1=speduc.IF (~(sysmis(speduc)) & (speduc<=peduc)) educw1 = peduc.VARIABLE LABELS educw1 'highest education in hh of parent/spouse'.ADD VALUE LABELS educw1 0 '< hs' 1 'hs/ged' 2 'hs+' 3 'college +'.EXECUTE.-------------------------------------------------------------INCOME--------------------------------------------------------------RECODE wfs03459 (else=copy) INTO hhincw1.VARIABLE LABELS hhincw1 'income-to-needs ratio, wave 1, imputed values by investigators'.FREQUENCIES VARIABLES = hhincw1.EXECUTE.
Example Syntax for Tracking:
Housekeeping* Add labels to variable nameslabel variable pt_id "number assigned to patient upon entry into database“label variable dob "Date of Birth (dd/month/yy)"label var sex "sex"label define sex_ 0 "Male" 1 "Female"label values sex sex_label define dead 0 "Alive" 1 "Dead"label values dead_ dead
***********************data consitency / baseline information ************************************
* see the total number of participantscount if visit==1
* see race / gender proportiontab sex if visit==1tab race if visit==1ranksum race if visit==1, by(sex)
* See whether every death of patients had exact death date (2 patients died without date)count if dead_==1 & visit==1count if datedied2 !=. & visit==1
* See how many late entries if time metric is "Time sine SSc is diagnosed "count if datediagnosis- datefirstseen<0 & visit==1 & datediagnosis!=. & datefirstseen!=.
Example Syntax for Tracking:
What analyses can I propose for my data?
BEAD Core Seminar II – October 2011Clinical Epidemiology and Biostatistics
for the Practitioner
•Jacky M. Jennings, PhD, MPH•Sharon R. Ghazarian, PhD
Matching Statistical Tools to your Data
•Cross-sectional Data▫Categorical outcomes▫Continuous outcomes
•Multi-level Data•Longitudinal Data•Working through an example
Cross-Sectional Datadata at one point in time for each
participant
ID Pain Depression Fatigue Stress
1 5 13 11 8
2 4 8 12 14
3 2 7 4 8
4 9 17 12 15
Categorical Outcomes (e.g. mortality yes/no)
An Example: Mortality data (yes/no) for patients with and without a diagnosis of Hepatitis C
Looking at proportions of those with Hepatitis C and without and examining the proportions of those who died 20 years later versus those who did notOutcome: mortality (yes/no)Predictor: disease (Hepatitis C; yes/no)
Categorical Outcomes Statistical ToolsCross-Sectional Data Testing differences in proportions
Chi-square; Fisher’s Exact Example: Are there differences in the
proportion of those who died versus survived in a sample of patients where some had Hepatitis C and some did not?
Calculating differences in odds of expected outcomes Logistic Regression Example: Are the odds of death different
based on whether the patient is diagnosed with Hepatitis C ?
Continuous Outcomes Cross-Sectional Data
An Example: examining the effects of resistance exercise on pain, fatigue, functional ability, and quality of life
• Looking at whether there is a difference in mean levels of pain, fatigue, functional ability and quality of life for those who engage in resistance exercise versus notOutcome: pain, fatigue, functional ability and quality of life levels - continuousPredictor: resistance exercise (categorical yes/no)
Continuous Outcomes Statistical ToolsCross-Sectional DataTesting differences in means Independent t-test – two separate groups
Example: Two groups (Exercise/no exercise) look for differences in mean levels of fatigue and pain
Paired t-test – two measures of the same person (e.g. pre/post) Example: Pre-exercise pain test and post-exercise pain
test Look for differences pre/post in the same person for the
entire sample ANOVA; ANCOVA; MANOVA
Example: Comparing mean levels of pain, fatigue, quality of life across exercise groups after adjusting for covariates
Continuous Outcomes Statistical ToolsCross-Sectional Data Linear regression
Same objective of comparing means but gives you the opportunity to look at a slope value for each association
General Linear Models (GLM) – a catch phrase for multiple types of analyses Allows you to perform several types of
regression models all in one method
Multi-Level Data Methods
▫Utilize for grouped (or nested) data where there is more than one category of grouping Examples, Neonates enrolled from 30 clinic sites
OR adolescents enrolled from 63 neighborhoods OR patients with multiple measures at multiple time points
▫Allow for: Study of effects that vary by groups Estimation of group level averages
▫Advantages: Traditional regression ignores the average
variation between groups
Individual-level effects can be explored
Longitudinal Datamultiple data points for each participant over
timeID PainT1 PainT2 PainT3 PainT4
1 7 8 7 9
2 5 4 6 5
3 4 5 5 6
4 8 7 5 4
ID TiterT1 TiterT2 TiterT3 TiterT4
1 0.5 0.6 0.8 0.5
2 1.2 1.0 1.5 1.4
3 1.8 1.5 1.7 1.4
4 0.4 0.3 0.5 0.7
Longitudinal Data
Alternate Data Structure (Long)
ID Time Pain Fatigue
5 1 5 12
5 2 6 10
5 3 8 11
5 4 7 9
Options for Data Analysis
Summary statistics or derived variablesAverage across time (pain averaged from T1-T4)Change scores (pain T2 – pain T1)Area under the curve (AUC)
Problematic:▫ Unbalanced cases (varying number of
observations across subjects) leads to variability in information going into these statistics across cases
▫ Reducing repeated measurements to one summary measure leads to a loss of statistical power
▫ Use of time varying covariates (data captured over time for covariates) can be problematic
Longitudinal DataCommon statistical tools for longitudinal data
1) Repeated measures ANOVAExample: Measures of pain at 4 clinic visits over 6
monthsExamining the difference in mean levels of pain over time, possibly accounting for additional covariates
Assumption of constant variance and covariance over time this is typically violated with longitudinal data since we
expect variability over time Not helpful for categorical outcomes Not robust for missing data points Does not handle irregularly spaced data
intervals very well
Longitudinal Data
2) Generalized mixed-effects regression models
Robust for continuous, categorical, and other non-normal distributions (e.g. Poisson; Negative Binomial)
Robust ability to handle missing data points Can handle irregularly spaced data intervals Can account for time-invariant (e.g. collected only at baseline) and time-varying covariates (e.g collected over the 4 time points)
Longitudinal Data3) Generalized Estimating Equations
(GEE) Used to estimate parameters of a generalized linear model with a possible unknown correlation between outcomes
Estimates the average response over the population rather than the effect of changing one or more covariates on a given individual
Robust parameter estimates (most often) even when the variance structure is miss-specified
Longitudinal Data4) Structural Equation Models (SEM)
▫Model-based analyses examining measurement and hypothesis-testing models at the same time
▫A synergy of factor analysis, reliability, validity, multiple linear regression, mediation, and moderation
▫Appropriate for continuous, categorical, and non-normally distributed outcomes
▫Can account for time-varying and time-invariant covariates
▫Can handle balanced or unbalanced data▫Provides the ability to examine several
regression models at the same time – but be mindful of collinearity
Structural Equation Models (SEM)An Example
Pain
Item1e1
• Item2
• Item3
• Item4
• Item5
• Item6
• Item7
• Item8
• Item9
Fatigue
e2
e3
e4
e5
e6
e7
e8
e9
Quality of Life
A breath…….
The marathon is over….questions before we work through an example?
A Submitted Example
Patients entered a cohort study at different time points and were followed for five years. We want to see the differences in rates of certain events (e.g. stroke, MI, etc) among different patient groups (e.g. male vs female, White vs African American, etc) after some adjustments (e.g. age at entry etc.).
Things to think about…•What would these data look like?•Cross-sectional or longitudinal?•Time-varying versus time-invariant
predictors?•Continuous versus categorical outcomes?•Sample size?•Balanced or unbalanced data?
Now we need to match our statistical tools to our data.
Working through the decision process
•Would we want to use a summary statistic?
•Would repeated measures ANOVA work here?
• Generalized mixed-effects regression models?
•Do we need to use a multi-level model?•Would SEM be appropriate here?
No, you are NOT expected to digest it all in one seminar!!!!!!
• Use your resources• It often takes multiple times of
hearing these concepts before you really feel comfortable with the topic