View
22.054
Download
2
Embed Size (px)
Citation preview
Basic BiostatisticsAssoc. Prof. Dr. Jamalludin Ab Rahman MD MPH
Department of Community Medicine
May
1, 2
023
2
Learning outcomesAt the end of this workshop, you should be able to1. Describe about population & sample, causation,
level of measurement, distribution of data2. Summarise categorical & numerical data3. Use appropriate statistical test for bi-variable
analyses using SPSS
Bas
ic B
iost
atis
tics
May
1, 2
023
Bas
ic B
iost
atis
tics
3
We observe, we believe.What we (can) observe might
not be the truth
May
1, 2
023
4
Population vs. Sample
Parameter
Statistics
Bas
ic B
iost
atis
tics
May
1, 2
023
5
Parameter vs. Statistics Parameter
characteristic of the whole population Statistics
characteristic of a sample, presumably measurable.
Bas
ic B
iost
atis
tics
May
1, 2
023
6
Statistics estimate parameters But how ‘representative’ a sample to the population When sample statistics ≠ population statistics =
sampling error A different sample statistics from the same population
may yield different estimates If sampling is repeated many times, the statistics will
approach the actual value of the population parameters
Bas
ic B
iost
atis
tics
May
1, 2
023
7
N = 35Red = 15/35 = 42.8%
N = 6Red = 3/6 = 50%
N = 6Red = 2/6 = 33.3%
N = 6Red = 4/6 = 66.7%
Average % Red = = 50%
50% 42.8%Parameter
Statistics
Statistics Parameter
Bas
ic B
iost
atis
tics
May
1, 2
023
8
Variable & its role A value and whose associated value may be
changed
DependentIndependent
Bas
ic B
iost
atis
tics
May
1, 2
023
9
Causation Relation of events (cause and effect) But correlation (between two events) does not
(always) imply causation Rooster's crow does not cause the sun to rise Switch does not cause the bulb to light
Bas
ic B
iost
atis
tics
May
1, 2
023
10
Time & causation
Time
Exposure Exposure
Exposure
Outcome
Bas
ic B
iost
atis
tics
May
1, 2
023
11
Time & causation (example)
Time
AgeExposure to
silica
Smoking
Lung Cancer
Bas
ic B
iost
atis
tics
May
1, 2
023
12
Causal Model
OutcomeExposure Exposure Exposure
Exposure
Bas
ic B
iost
atis
tics
May
1, 2
023
13
Study design & causality Not all design can prove causality Cross sectional can’t determine causality Must have temporality – exposure precedes
outcome
Bas
ic B
iost
atis
tics
14
May
1, 2
023
Bas
ic B
iost
atis
tics
Type of data(Level of measurement)
Categorical
Nominal Ordinal
Numerical
Discrete Continuous
e.g. Gender, Race e.g. Cancer staging, Severity of CXR for PTB
e.g. Parity, Gravida
e.g. Hb, RBS, cholesterol.
May
1, 2
023
15
Distribution (shape) of data Applicable to numerical value Discrete or Continuous Discrete ~ Binomial, Poisson, Negative Binomial,
Hypergeometry, Multinomial etc. Continuous ~ Normal, t, chi-square, F etc.
Bas
ic B
iost
atis
tics
May
1, 2
023
16
Normal Distribution
Bas
ic B
iost
atis
tics
𝑓 (𝑥 ;𝜇 ,𝜎2 )= 1𝜎√2𝜋
𝑒− 1
2( 𝑥−𝜇𝜎
)2
¿
May
1, 2
023
17
Characteristics Smooth, symmetrical (around the mean, not
skewed, appropriate height), uni-modal, bell shaped curve
Mean = Median = Mode The total area under the curve
(AUC) = 1 Asymptotic to the x-axis – never touch x-axis
Bas
ic B
iost
atis
tics
May
1, 2
023
18
Test of Normality Anderson–darling Test Corrected Kolmogorov–Smirnov Test (Lilliefors Test) Cramér–von-mises Criterion D'agostino's K-squared Test Jarque–bera Test Pearson's Chi-square Test Shapiro–francia Shapiro–wilk Test
Bas
ic B
iost
atis
tics
May
1, 2
023
19
Use Normality test with caution Small samples almost always pass a normality test.
Normality tests have little power to tell whether or not a small sample of data comes from a Gaussian distribution.
With large samples, minor deviations from normality may be flagged as statistically significant, even though small deviations from a normal distribution won’t affect the results of a t test or ANOVA.
Bas
ic B
iost
atis
tics
May
1, 2
023
Bas
ic B
iost
atis
tics
20
Why run statistical test?1. Determine presence of difference (or similarity)2. Determine degree of difference3. Determine the direction of changes (outcome)4. Predict changes (outcomes)
May
1, 2
023
Bas
ic B
iost
atis
tics
21
A B C
Is there any difference between A & B?
How big is the difference between A & B?
Which one is taller? A or B?
Is C different from A & B?
Is there any pattern now?
If there is D, can you predict how tall is D?
May
1, 2
023
Bas
ic B
iost
atis
tics
22
Statistical analysis
AnalyticalDescriptive
e.g. Describe socio-demographic characteristics - Age, Sex, Race etc.
e.g. Prevalence of hypertension.
e.g. Compare demographic characteristics between two population – Compare age between male & female
e.g. Distribution of gender by hypertension status
e.g. How demographic characteristics (more than one factors) explain hypertension
Univariable Bivariable MultivariableIV DV IV DV IV DV IV
May
1, 2
023
23
Descriptive Statistics Explain one variable at one time Method based on type of measure
CategoricalFrequency (Percentage)
NumericalCentral measures (e.g. mean, median) & Dispersion (e.g. variance, standard deviation, range, min-max, interquartile range)
Bas
ic B
iost
atis
tics
May
1, 2
023
24
Data
CategoricalFrequency (count) &
Percentage
Numerical
Normal Mean (SD)
Not Normal Median (Range/IQR)
How to describe a data
Bas
ic B
iost
atis
tics
May
1, 2
023
25
DifferenceA
B C
Which of the following shows true difference between two populations?
I II
IIII II
Bas
ic B
iost
atis
tics
May
1, 2
023
26
P in hypothesis testing The mean glucose from 9 rabbits decreased from 7.3
mmol/L to 7.0 (0.9) mmol/L within 24 hours. Using t-test, taking critical value of P=0.05, is the difference significant?
Ho: No difference i.e. 7.3= 7.0 z = (7.0-7.3)/(0.9/9) = -1 Based on t-table, z at -1, df=9, two-tailed, P is between -
0.4 & -0.3 Therefore, P > 0.05
Bas
ic B
iost
atis
tics
May
1, 2
023
27
One-tailed vs. two-tailed Is there a difference between Hb 14 g% vs. Hb 12
g% in male & female respectively?Ho: HbM – HbF = 0H1: HbM = HbFH2: HbM > HbFH3: HbM < HbM
Note: Should be determined a priori
Bas
ic B
iost
atis
tics
May
1, 2
023
28
One-tailed vs. two-tailed
Two-sided Right-sidedLeft-sided
Bas
ic B
iost
atis
tics
May
1, 2
023
29
Hypothesis testingTruth
Ho True Ho False
Test
Do notreject Ho Correct Type II error
()
Reject Ho
Type I error () Correct
P-value is the probability to make Type I error (based on frequentist inference)
Bas
ic B
iost
atis
tics
May
1, 2
023
30
P & Sample Size
Bas
ic B
iost
atis
tics
May
1, 2
023
31
The truth about P value Currently effectiveness is measured by P value (even by US
FDA) < P means statistically significant, NOT clinical significant P is affected BOTH by effectiveness AND sample size P can be < 0.05 even though the effectiveness is marginal
when sample size is huge Compare Ps between studies only appropriate if the sampling
& sample size is the same Therefore, be careful when interpreting P value
Bas
ic B
iost
atis
tics
May
1, 2
023
32
P < 0.05 Why 5%? Cut-off point proposed by Sir Ronald A. Fisher
1925 to reject or not to reject a hypothesis If P < 0.05 = Probability to make Type I error is
less than 5% If P > 0.05, > 5% of the difference occurred by
chance & not due to the TRUE difference
Bas
ic B
iost
atis
tics
May
1, 2
023
33
Hypothesis Testing using bivariable analysis Try to prove that Exposure causes the Disease
e.g. Smoking causing Lung Cancer Ho: No difference of risk to get Lung Cancer
between smoker and non-smoker
Bas
ic B
iost
atis
tics
May
1, 2
023
Bas
ic B
iost
atis
tics
34
Lung Cancer No Lung Cancer
Smoking 20 (18.2%) 90 (81.8%)
Not Smoking 5 (4.5%) 105 (95.5%)
The occurrence of lung cancer is significantly higher (18.2%) among smokers compared to non-smokers (4.5%) (2 (df=1)= 10.15, P =0.001, OR = 4.7 (CI95% 1.7 – 13.0))
May
1, 2
023
Bas
ic B
iost
atis
tics
36
Confidence Interval Range of plausible values2 Narrow interval high precision
Wide interval poor precision How narrow is narrow?
And how wide is wide? Base of your clinical judgment
May
1, 2
023
37
Interpret Confidence Interval Compare with the null value
i.e. can be 0 for % or 1 for risk Compare with practical significance or the clinical
significance/indifference
Null
Null
A
B
Source: http://www.childrens-mercy.org/stats/journal/confidence.asp
Null
Null
C
D
Bas
ic B
iost
atis
tics
May
1, 2
023
38
Effect size The measure of effect irrespective of sample size Cohen (1988) classify ES into
Low (<0.3)Medium (0.3-0.7)Large (> 0.7)
Manual calculation or web based calculation
Bas
ic B
iost
atis
tics
May
1, 2
023
39
Statistical Test Univariate ~ One dependent & one independent Multivariate ~ Multiple dependent & multiple
independent variable
Bas
ic B
iost
atis
tics
May
1, 2
023
40
What test to use?Variable 1 Variable 2 Test
Categorical Categorical Chi-square
Categorical (2 pop) Numerical (Normal) Independent sample t-test
Categorical (2 pop) Numerical (Not Normal) Mann-Whitney U test
Categorical (> 2 pop) Numerical (Normal) One-way ANOVA
Categorical (> 2 pop) Numerical (Not Normal) Kruskal-Wallis test
Numerical (Normal) Numerical (Normal) Pearson Correlation Coefficient Test
Numerical (Normal/ Not Normal)
Numerical (Not Normal) Spearman Correlation Coefficient Test
Numerical (Normal) Numerical (Normal) – Paired
Paired t-test
Numerical (Not Normal) Numerical (Not Normal) – Paired
Wilcoxon Signed Rank Test
Bas
ic B
iost
atis
tics
May
1, 2
023
41
Univariate Analyses Compare means
Independent sample t-test (Unpaired t-test) ~ Two unrelated means
Paired t-test ~ Two related meansOne-way ANOVA ~ More than 2 means
2 Test ~ Between categorical variables Non-parametric tests (Kruskall-Wallis, Man-Whitney
U tests) ~ If data is not normally distributed
Bas
ic B
iost
atis
tics
42
Writing plan for statistical analysis #1
May 1, 2023Basic Biostatistics
Data were analyzed using the complex sample function of SPSS (version 13.0). Sampling errors were estimated using the primary sampling units and strata provided in the data set. Sampling weights were used to adjust for nonresponse bias and the oversampling of blacks, Mexican Americans, and the elderly in NHANES. The prevalence of hypertension, as well as the awareness, treatment, and control rates, were age adjusted by direct standardization to the US 2000 standard population.10 To analyze differences over time, the 2003–2004 data were compared with the 1999–2000 data. Estimates with a coefficient of variation >0.3 were considered unreliable. A 2-tailed P value <0.05 was considered statistically significant. (Ong et al. 2009)
43
Writing plan for statistical analysis #2
May 1, 2023Basic Biostatistics
To assess the effect of the selection process on the characteristics of the cases, we compared cases included in the final analysis to the rest of the cases. Since controls included in the present analysis were different from the rest of the diabetes free participants by design, no similar comparisons were performed for that group. To compare baseline characteristics of cases and controls appropriate univariate statistics were used. Similar binary logistic and multiple linear regression models were built with incident diabetes or HbA1c as respective outcomes and additive block entry of adiponectin and potential confounders. For linear regression CRP and triglycerides were log transformed. Since HbA1c could be modified by drug treatment, we ran a sensitivity analysis excluding all participants on antidiabetic medication. A p-value of <0.05 was considered significant. Analyses were performed with SPSS 14.0 for Windows.
May
1, 2
023
Bas
ic B
iost
atis
tics
44
Reporting analysis (example)
May
1, 2
023
45
Reporting analysis (example)
May
1, 2
023
Bas
ic B
iost
atis
tics
46
Reporting analysis (example)
May
1, 2
023
47
Summary1. Identify & define variables2. Type – independent vs. dependent3. Level of measurements – nominal, ordinal or
continuous4. Check distribution – Normal vs. Not Normal5. Decide what to do - descriptive vs. analytical
Bas
ic B
iost
atis
tics
Chapter 2Introduction to SPSSIBM SPSS Statistics v21 for WindowsJamalludin Ab Rahman MD MPHDepartment of Community MedicineKulliyyah of Medicine
May
1, 2
023
Bas
ic B
iost
atis
tics
49
IBM SPSS Statistics
IBM Corporation Software Group Route 100 Somers, NY 10589Produced in the United States of AmericaMay 2012
May
1, 2
023
Basic Biostatistics 50
SPSS Layout
May
1, 2
023
Bas
ic B
iost
atis
tics
51
LayoutMain menu
Toolbar
Variables
May
1, 2
023
Bas
ic B
iost
atis
tics
52
Data editor
Rows = variablesRows = each data
Define & describe your variables here
Enter your data here
May
1, 2
023
Bas
ic B
iost
atis
tics
53
Viewer
The output of analyses will be displayed here.
Output is separated from
data
May
1, 2
023
Bas
ic B
iost
atis
tics
54
Syntax
We can compile all the steps of the analyses here. Extend the programming
function in SPSS. Ability to perform complex steps e.g.
“looping”
May
1, 2
023
Basic Biostatistics 55
Creating Dataset
May
1, 2
023
Bas
ic B
iost
atis
tics
56
Before even you start SPSS! You must identify & define relevant variables Define means
1. Name – preferably short single name, begins with alphabet, no special character, no space
2. Type of data – e.g. Numeric, Date, String3. Width & Decimal Places (if numeric)4. Label – description for the Name (will be displayed in Viewer)
5. Values – labels for value e.g. 1=Male, 2=Female6. Missing – define missing value e.g. 999 for N/A
May
1, 2
023
Bas
ic B
iost
atis
tics
57
Define your variables
May
1, 2
023
Bas
ic B
iost
atis
tics
58
Variable Types
May
1, 2
023
Bas
ic B
iost
atis
tics
59
Variable Type
Decide the suitable
variable type
For numeric, determine Width
& Decimal. Decimal < Width
For String, no option for
Decimal Place
New option for Numerics with leading zeros
May
1, 2
023
Bas
ic B
iost
atis
tics
60
Value Labels
May
1, 2
023
Bas
ic B
iost
atis
tics
61
Missing ValueThis question is
Not Applicable to male
e.g. Assign 999 to represent N/A value
& this won’t be included in any
analysis
Chapter 3Descriptive StatisticsIBM SPSS Statistics v21 for WindowsJamalludin Ab Rahman MD MPHDepartment of Community MedicineKulliyyah of Medicine
May
1, 2
023
Bas
ic B
iost
atis
tics
63
Exercise data Hypothetical Study to describe factors related to HbA1c &
Homocystein (HCY) N=301 13 variables
May
1, 2
023
Bas
ic B
iost
atis
tics
64
Retrieve file information
May
1, 2
023
Bas
ic B
iost
atis
tics
65
healthstatus001
May
1, 2
023
Bas
ic B
iost
atis
tics
66
Task 11. Describe socio-demographic characteristics of
the respondent (age, gender & race)2. Describe the explanatory variables
1. Exercise2. smoking status3. BMI status &4. BP status
3. Describe HbA1c (taking cut-off for Poor HbA1c ≥ 6.5%) & HCY
May
1, 2
023
Basic Biostatistics 67
Describe numerical data
May
1, 2
023
Bas
ic B
iost
atis
tics
68
Explore
May
1, 2
023
Bas
ic B
iost
atis
tics
69
May
1, 2
023
Bas
ic B
iost
atis
tics
70
Results Check for Normality.Is Age data distributed Normally?
May
1, 2
023
Bas
ic B
iost
atis
tics
71
Is this Normaldistribution?
May
1, 2
023
Bas
ic B
iost
atis
tics
72
Describe ageNormal The subjects distributed between 23-67 years old
with the average of 34 (SD=8) years.If not Normal The subjects distributed between 23-67 years old
with the median of 33 (IQR=11) years
May
1, 2
023
Basic Biostatistics 73
Describe categorical data
May
1, 2
023
Bas
ic B
iost
atis
tics
74
Frequency
May
1, 2
023
Bas
ic B
iost
atis
tics
75
May
1, 2
023
76
Results
Basic Biostatistics
May
1, 2
023
Basic Biostatistics 77
TRANSFORM
May
1, 2
023
Bas
ic B
iost
atis
tics
78
Compute
May
1, 2
023
Bas
ic B
iost
atis
tics
79
weight / ((height / 100) ** 2)
May
1, 2
023
Bas
ic B
iost
atis
tics
80
Visual binning
May
1, 2
023
Bas
ic B
iost
atis
tics
81
Normal < 23Overweight 23 to < 27.5Obese >= 27.5
May
1, 2
023
Bas
ic B
iost
atis
tics
82
Chapter 4Bivariable analysesIBM SPSS Statistics v21 for WindowsJamalludin Ab Rahman MD MPHDepartment of Community MedicineKulliyyah of Medicine
May
1, 2
023
Bas
ic B
iost
atis
tics
84
To check association of two variables?
HbA1cAge
May
1, 2
023
Bas
ic B
iost
atis
tics
85
The steps1. Determine which is dependant & which is
independent2. Determine level of measurements3. Determine Normality of the numerical
measurement4. Determine the suitable statistical test
May
1, 2
023
Bas
ic B
iost
atis
tics
86
What are the tests?Variable 1 Variable 2 Test
Categorical Categorical Chi-square
Categorical (2 pop) Numerical (Normal) Independent sample t-test
Categorical (2 pop) Numerical (Not Normal) Mann-Whitney U test
Categorical (> 2 pop) Numerical (Normal) One-way ANOVA
Categorical (> 2 pop) Numerical (Not Normal) Kruskal-Wallis test
Numerical (Normal) Numerical (Normal) Pearson Correlation Coefficient Test
Numerical (Normal/ Not Normal)
Numerical (Not Normal) Spearman Correlation Coefficient Test
Numerical (Normal) Numerical (Normal) – Paired Paired t-test
Numerical (Not Normal) Numerical (Not Normal) – Paired
Friedman test
May
1, 2
023
Bas
ic B
iost
atis
tics
87
Tasks 21. Determine association between socio-demographic
characteristics & all the risk factors with HbA1c2. Determine association between socio-demographic
characteristics & all the risk factors with HCY
Note: It would be good if you could construct dummy table for the answers even before the analyses started
May
1, 2
023
Bas
ic B
iost
atis
tics
88
HCY normal range
May
1, 2
023
Basic Biostatistics 89
Independent sample t-TestComparing Two Means
May
1, 2
023
Bas
ic B
iost
atis
tics
90
Age vs. BP
May
1, 2
023
Bas
ic B
iost
atis
tics
91
May
1, 2
023
Bas
ic B
iost
atis
tics
92
Results
Levene’s test check equality between variances. Ho: There is no difference of variances. So if P is significant, we reject Ho, and therefore equal variances assumed.
The original t-test (Student’s t-test) assumes equal variances for equal sample sizes. However if the variances are equal, it is robust for different sizes.
Welch's correction
May
1, 2
023
Bas
ic B
iost
atis
tics
93
Table – Distribution of age by blood pressure status
N Mean SD Statistics df P
Normal BP 156 33.9 7.9 t=0.431 299 0.667
High BP 145 34.4 8.9
May
1, 2
023
Basic Biostatistics 94
Chi-squared TestDifference of Two Proportions
May
1, 2
023
Bas
ic B
iost
atis
tics
95
Gender vs. BP
May
1, 2
023
Bas
ic B
iost
atis
tics
96
May
1, 2
023
Bas
ic B
iost
atis
tics
97
Results
Describe this table first. What is your impression? 49% women vs. 47% men with high BP
Some books may suggest the use of Continuity Correction at ALL time, but recent simulations showed that CC (or Yate’s correction) is OVERCONSERVATIVE. Hence, use Pearson 2 when < 20% of cells have expected count < 5
When 20% of cells have EC < 5, use Fisher’s Exact Test
This is given because we code the variables using numbers. Can be used to measure P-trend
May
1, 2
023
Basic Biostatistics 98
One-way ANOVAComparing More than Two Means
May
1, 2
023
Bas
ic B
iost
atis
tics
99
Race vs. HbA1c
May
1, 2
023
Bas
ic B
iost
atis
tics
100
May
1, 2
023
Bas
ic B
iost
atis
tics
101
ResultsDescribe these results first. What is your impression? HbA1c between races? 6.4 (SD 2.1) vs. 6.7 (SD 2.2) vs. 6.5 (SD 2.2)
The F test shows that there is no single significant difference between any two groups
May
1, 2
023
Bas
ic B
iost
atis
tics
102
Results – BMI status vs. HbA1c
The F test shows that at least there is ONE pair with significant different. Either N vs. OW, N vs. OB or OW vs. OB
We need to run Post-hoc test to determine which of the PAIR is significant.
To decide which Post-hoc test to choose, we have to test for equality of variances i.e. Homogeneity of variances (Levene’s test)
May
1, 2
023
Bas
ic B
iost
atis
tics
103
May
1, 2
023
Bas
ic B
iost
atis
tics
104
Results – Post hoc
The significant difference is only for Normal vs. Obese (P=0.002)
May
1, 2
023
Bas
ic B
iost
atis
tics
105
ReportThere is a significant association between BMI Status and HbA1c (F(2,298)=13.129, P<0.001). Post-hoc test showed that Obese subjects have significantly higher HbA1c compared to Normal and Overweight subjects (P=0.001 and P < 0.001 respectively).
May
1, 2
023
Basic Biostatistics 106
Mann-whitney UNon Parametric Tests
May
1, 2
023
Bas
ic B
iost
atis
tics
107
Gender vs. HCY
May
1, 2
023
Bas
ic B
iost
atis
tics
108
ResultsThis ranks table is not to be cited in the research paper. Instead, describe their MEDIAN
May
1, 2
023
Basic Biostatistics 109
KRUSKALL WALLISNon-Parametric TESTS
May
1, 2
023
Bas
ic B
iost
atis
tics
110
Race vs. HCY
May
1, 2
023
Bas
ic B
iost
atis
tics
111
Results
May
1, 2
023
Basic Biostatistics 112
Correlation TestRELATIONSHIP OF TWO NUMERICAL DATA
May
1, 2
023
Bas
ic B
iost
atis
tics
113
Age vs. HbA1c
May
1, 2
023
Bas
ic B
iost
atis
tics
114
Results
May
1, 2
023
Bas
ic B
iost
atis
tics
115
Age vs. HCY
May
1, 2
023
Bas
ic B
iost
atis
tics
116