Planning and Managing your Data Analyses Sharon R. Ghazarian, PhD Director, BEAD Core

Planning and Managing your Data AnalysesSharon R. Ghazarian, PhDDirector, BEAD Core

Questions and Solutions•What is a data analysis plan?•Why and when do I need a data analysis

plan?•What belongs in a data analysis plan?•When/how will it be used?•A static document versus a dynamic

moving part?•Translating the analysis plan to a tracking

document•What analyses can I propose for my data?

What is a data analysis plan?The place to:•Justify – show how you will use the data you

collected/obtained and why (the importance) you put the participants through the study

•Assure – demonstrate the your objectives can/will be achieved

•Identify the parameters of interest•Identify the associations of interest•Identify group comparisons

Looking back to move forward…

•Remember your aims, hypotheses, research questions

•Should be tied VERY closely to your data analysis plan

•A disconnect between aims, hypotheses, research questions and analyses is a FATAL FLAW

•A disconnect can be “formidable”•Revision in one requires revision in the

other

Research questions, aims, hypotheses, study design, analyses all intertwined as one

Attention to one may require revision of the other an iterative process

An example:Hypothesis 1: Females will be more likely to

report experiencing migraine headaches compared to men.

Hypothesis 2: Females living in the eastern region of the US will be more likely to report frequent headaches compared to those living in the western region.

Hypothesis 3: Women and men who report high levels of depressive symptoms will be more likely to report high levels of anxiety compared to those who report low levels of depressive symptoms.

So, the analysis plan should include:

•Group comparisons•Identify parameters of interest•Identify associations of interest•But NOT introduce new parameters that

have not been mentioned previously (intro AND measures)

•But NOT propose associations that the reader is not prepared for

•But NOT propose group associations that have not been justified

Why and when do I need a data analysis plan?•A data analysis plan is a place to:

▫Organize your thoughts about analyses▫Match up your proposed analyses to your aims,

hypotheses, and/or research questions▫Head off reviewer (grant or manuscript)

comments/concerns▫Provide a place to bring together the front and

back ends of your grant/manuscript•Put together your data analysis plan as early

as possible – ideally as soon as you have all the necessary information - you can revise later.

What belongs in a data analysis plan?•ALL parameters of interest – ALL variables in

the measures section•Say something about your demographic

variables•What preliminary analyses will you complete?

▫Descriptive analyses Central tendency (mean, median, mode) Variation (SD, range, variance)

▫Missing data Planned missingness Informative missingness Imputation

▫Sensitivity analyses

What belongs in a data analysis plan?

•What study design features do you need to attend to in preliminary analyses?▫Matching▫Clustering▫Repeated Measures▫Intervention model▫Treatment groups

•Hypothesis-testing Analyses▫Baseline, interim, longitudinal▫Subgroup▫Unadjusted vs adjusted

What belongs in a data analysis plan?•Need description of statistical methods for

EACH objective, aim, hypothesis of the study•What statistical test will be used for each

analysis – stay tuned….•What level of statistical significance will be

used for all analyses• Is there a level of clinical significance that

needs to be stated?▫Ghazarian, S. R. (2011). Distinguishing

statistical significance from clinical importance. Pediatrics in Review, 32, 73-74.

What belongs in a data analysis plan?

•What software package(s) will be used to complete the analyses

•Any additional information the reader might need before proceeding to the Results section

When/how will it be used?•Initially – this is a working document that

will go through numerous revisions▫Be sure to revise when/if the study design

changes (i.e. during data collection)•Use this document as a guide for any

interim analyses•Use this document as a guide when

working with a data manager or planning your database

•When you are ready to begin analyses, go to this document first

Translating the analysis plan to a tracking document

•Use the bones of your data analysis plan as a guide to create a tracking (syntax) document for EACH and EVERY step that is completed in you’re analyses

•DOCUMENT EVERYTHING!!!!•Annotate everything•Use simple language so it can be

understood by your entire team•Stay organized – otherwise this document

becomes useless

******* This series of commands will produce a correlation matrix and avoids the missing data problem ********* To create the correlation matrix, just have the appropriate data file loaded up into SPSS and run this program. The end result with be a new data file

(in the spreadsheet). Just save that file with an appropriate name (((The following is for Autonomy Cross Cultural Validation Paper:Mother model:temp.select if (Gender eq 2).CORRELATION freedomm friendsm rwm clothesm datingm confidnm encourgm careerm educm personm talkm

schoolm groupm educatnm marrym livem choosem wantsm needm approvem lovemem niceem hard ontime educimpt likeschl gpa work trying goals noend avoid bother stick now success handle newthing failure insecure reliant giveup

/ MISSING=LISTWISE / MATRIX OUT (*) .EXECUTE .Father model:temp.select if (Gender eq 2).CORRELATION freedomf friendsf rwf clothesf datingf confidnf encourgf careerf educf personf talkf

schoolf groupf educatnf marryf livef choosef wantsf needf approvef lovemef niceef hard ontime educimpt likeschl gpa work trying goals noend avoid bother stick now success handle newthing failure insecure reliant giveup

/ MISSING=LISTWISE / MATRIX OUT (*) .EXECUTE .FILTER OFF.

Example Syntax for Tracking:

-------------------------------------------------------------AGE------------------------------------------------------------------*focal child age, wave 1.RECODE ags01248 (else=copy) INTO agew1.RECODE ags29019 (ELSE=COPY) INTO agew2.execute. -------------------------------------------------------------EDUCATION--------------------------------------------------------*parent education, wave 1.RECODE eds01249 (1=0) (2 thru 3 = 1) (4 thru 6 = 2) (7 thru 10=3) INTO peduc.VARIABLE LABELS peduc 'parent education, degree'.ADD VALUE LABELS peduc 0 '< hs' 1 'hs/ged' 2 'hs+' 3 'college +'. RECODE eds01281(1=0) (2 thru 3 = 1) (4 thru 6 = 2) (7 thru 10=3) INTO speduc.VARIABLE LABELS speduc 'spouse/partner education, degree'.ADD VALUE LABELS speduc 0 '< hs' 1 'hs/ged' 2 'hs+' 3 'college +'.EXECUTE. COMPUTE educw1 = 999.IF (SYSMIS(speduc)) educw1=peduc.IF (~(sysmis(speduc)) & (speduc>peduc)) educw1=speduc.IF (~(sysmis(speduc)) & (speduc<=peduc)) educw1 = peduc.VARIABLE LABELS educw1 'highest education in hh of parent/spouse'.ADD VALUE LABELS educw1 0 '< hs' 1 'hs/ged' 2 'hs+' 3 'college +'.EXECUTE.-------------------------------------------------------------INCOME--------------------------------------------------------------RECODE wfs03459 (else=copy) INTO hhincw1.VARIABLE LABELS hhincw1 'income-to-needs ratio, wave 1, imputed values by investigators'.FREQUENCIES VARIABLES = hhincw1.EXECUTE.


Housekeeping* Add labels to variable nameslabel variable pt_id "number assigned to patient upon entry into database“label variable dob "Date of Birth (dd/month/yy)"label var sex "sex"label define sex_ 0 "Male" 1 "Female"label values sex sex_label define dead 0 "Alive" 1 "Dead"label values dead_ dead

***********************data consitency / baseline information ************************************

* see the total number of participantscount if visit==1

* see race / gender proportiontab sex if visit==1tab race if visit==1ranksum race if visit==1, by(sex)

* See whether every death of patients had exact death date (2 patients died without date)count if dead_==1 & visit==1count if datedied2 !=. & visit==1

* See how many late entries if time metric is "Time sine SSc is diagnosed "count if datediagnosis- datefirstseen<0 & visit==1 & datediagnosis!=. & datefirstseen!=.


What analyses can I propose for my data?

BEAD Core Seminar II – October 2011Clinical Epidemiology and Biostatistics

for the Practitioner

•Jacky M. Jennings, PhD, MPH•Sharon R. Ghazarian, PhD

Matching Statistical Tools to your Data

•Cross-sectional Data▫Categorical outcomes▫Continuous outcomes

•Multi-level Data•Longitudinal Data•Working through an example

Cross-Sectional Datadata at one point in time for each

participant

ID Pain Depression Fatigue Stress

1 5 13 11 8

2 4 8 12 14

3 2 7 4 8

4 9 17 12 15

Categorical Outcomes (e.g. mortality yes/no)

An Example: Mortality data (yes/no) for patients with and without a diagnosis of Hepatitis C

Looking at proportions of those with Hepatitis C and without and examining the proportions of those who died 20 years later versus those who did notOutcome: mortality (yes/no)Predictor: disease (Hepatitis C; yes/no)

Categorical Outcomes Statistical ToolsCross-Sectional Data Testing differences in proportions

Chi-square; Fisher’s Exact Example: Are there differences in the

proportion of those who died versus survived in a sample of patients where some had Hepatitis C and some did not?

Calculating differences in odds of expected outcomes Logistic Regression Example: Are the odds of death different

based on whether the patient is diagnosed with Hepatitis C ?

Continuous Outcomes Cross-Sectional Data

An Example: examining the effects of resistance exercise on pain, fatigue, functional ability, and quality of life

• Looking at whether there is a difference in mean levels of pain, fatigue, functional ability and quality of life for those who engage in resistance exercise versus notOutcome: pain, fatigue, functional ability and quality of life levels - continuousPredictor: resistance exercise (categorical yes/no)

Continuous Outcomes Statistical ToolsCross-Sectional DataTesting differences in means Independent t-test – two separate groups

Example: Two groups (Exercise/no exercise) look for differences in mean levels of fatigue and pain

Paired t-test – two measures of the same person (e.g. pre/post) Example: Pre-exercise pain test and post-exercise pain

test Look for differences pre/post in the same person for the

entire sample ANOVA; ANCOVA; MANOVA

Example: Comparing mean levels of pain, fatigue, quality of life across exercise groups after adjusting for covariates

Continuous Outcomes Statistical ToolsCross-Sectional Data Linear regression

Same objective of comparing means but gives you the opportunity to look at a slope value for each association

General Linear Models (GLM) – a catch phrase for multiple types of analyses Allows you to perform several types of

regression models all in one method

Multi-Level Data Methods

▫Utilize for grouped (or nested) data where there is more than one category of grouping Examples, Neonates enrolled from 30 clinic sites

OR adolescents enrolled from 63 neighborhoods OR patients with multiple measures at multiple time points

▫Allow for: Study of effects that vary by groups Estimation of group level averages

▫Advantages: Traditional regression ignores the average

variation between groups

Individual-level effects can be explored

Longitudinal Datamultiple data points for each participant over

timeID PainT1 PainT2 PainT3 PainT4

1 7 8 7 9

2 5 4 6 5

3 4 5 5 6

4 8 7 5 4

ID TiterT1 TiterT2 TiterT3 TiterT4

1 0.5 0.6 0.8 0.5

2 1.2 1.0 1.5 1.4

3 1.8 1.5 1.7 1.4

4 0.4 0.3 0.5 0.7

Longitudinal Data

Alternate Data Structure (Long)

ID Time Pain Fatigue

5 1 5 12

5 2 6 10

5 3 8 11

5 4 7 9

Options for Data Analysis

Summary statistics or derived variablesAverage across time (pain averaged from T1-T4)Change scores (pain T2 – pain T1)Area under the curve (AUC)

Problematic:▫ Unbalanced cases (varying number of

observations across subjects) leads to variability in information going into these statistics across cases

▫ Reducing repeated measurements to one summary measure leads to a loss of statistical power

▫ Use of time varying covariates (data captured over time for covariates) can be problematic

Longitudinal DataCommon statistical tools for longitudinal data

1) Repeated measures ANOVAExample: Measures of pain at 4 clinic visits over 6

monthsExamining the difference in mean levels of pain over time, possibly accounting for additional covariates

Assumption of constant variance and covariance over time this is typically violated with longitudinal data since we

expect variability over time Not helpful for categorical outcomes Not robust for missing data points Does not handle irregularly spaced data

intervals very well

Longitudinal Data

2) Generalized mixed-effects regression models

Robust for continuous, categorical, and other non-normal distributions (e.g. Poisson; Negative Binomial)

Robust ability to handle missing data points Can handle irregularly spaced data intervals Can account for time-invariant (e.g. collected only at baseline) and time-varying covariates (e.g collected over the 4 time points)

Longitudinal Data3) Generalized Estimating Equations

(GEE) Used to estimate parameters of a generalized linear model with a possible unknown correlation between outcomes

Estimates the average response over the population rather than the effect of changing one or more covariates on a given individual

Robust parameter estimates (most often) even when the variance structure is miss-specified

Longitudinal Data4) Structural Equation Models (SEM)

▫Model-based analyses examining measurement and hypothesis-testing models at the same time

▫A synergy of factor analysis, reliability, validity, multiple linear regression, mediation, and moderation

▫Appropriate for continuous, categorical, and non-normally distributed outcomes

▫Can account for time-varying and time-invariant covariates

▫Can handle balanced or unbalanced data▫Provides the ability to examine several

regression models at the same time – but be mindful of collinearity

Structural Equation Models (SEM)An Example

Pain

Item1e1

• Item2

• Item3

• Item4

• Item5

• Item6

• Item7

• Item8

• Item9

Fatigue

e2

e3

e4

e5

e6

e7

e8

e9

Quality of Life

A breath…….

The marathon is over….questions before we work through an example?

A Submitted Example

Patients entered a cohort study at different time points and were followed for five years. We want to see the differences in rates of certain events (e.g. stroke, MI, etc) among different patient groups (e.g. male vs female, White vs African American, etc) after some adjustments (e.g. age at entry etc.).

Things to think about…•What would these data look like?•Cross-sectional or longitudinal?•Time-varying versus time-invariant

predictors?•Continuous versus categorical outcomes?•Sample size?•Balanced or unbalanced data?

Now we need to match our statistical tools to our data.

Working through the decision process

•Would we want to use a summary statistic?

•Would repeated measures ANOVA work here?

• Generalized mixed-effects regression models?

•Do we need to use a multi-level model?•Would SEM be appropriate here?

No, you are NOT expected to digest it all in one seminar!!!!!!

• Use your resources• It often takes multiple times of

hearing these concepts before you really feel comfortable with the topic

Documents

Planning and Managing your Data Analyses Sharon R. Ghazarian, PhD Director, BEAD Core