Basic Biostatistics

Preview:

Citation preview

Basic BiostatisticsAssoc. Prof. Dr. Jamalludin Ab Rahman MD MPH

Department of Community Medicine

May

1, 2

023

2

Learning outcomesAt the end of this workshop, you should be able to1. Describe about population & sample, causation,

level of measurement, distribution of data2. Summarise categorical & numerical data3. Use appropriate statistical test for bi-variable

analyses using SPSS

Bas

ic B

iost

atis

tics

May

1, 2

023

Bas

ic B

iost

atis

tics

3

We observe, we believe.What we (can) observe might

not be the truth

May

1, 2

023

4

Population vs. Sample

Parameter

Statistics

Bas

ic B

iost

atis

tics

May

1, 2

023

5

Parameter vs. Statistics Parameter

characteristic of the whole population Statistics

characteristic of a sample, presumably measurable.

Bas

ic B

iost

atis

tics

May

1, 2

023

6

Statistics estimate parameters But how ‘representative’ a sample to the population When sample statistics ≠ population statistics =

sampling error A different sample statistics from the same population

may yield different estimates If sampling is repeated many times, the statistics will

approach the actual value of the population parameters

Bas

ic B

iost

atis

tics

May

1, 2

023

7

N = 35Red = 15/35 = 42.8%

N = 6Red = 3/6 = 50%

N = 6Red = 2/6 = 33.3%

N = 6Red = 4/6 = 66.7%

Average % Red = = 50%

50% 42.8%Parameter

Statistics

Statistics Parameter

Bas

ic B

iost

atis

tics

May

1, 2

023

8

Variable & its role A value and whose associated value may be

changed

DependentIndependent

Bas

ic B

iost

atis

tics

May

1, 2

023

9

Causation Relation of events (cause and effect) But correlation (between two events) does not

(always) imply causation Rooster's crow does not cause the sun to rise Switch does not cause the bulb to light

Bas

ic B

iost

atis

tics

May

1, 2

023

10

Time & causation

Time

Exposure Exposure

Exposure

Outcome

Bas

ic B

iost

atis

tics

May

1, 2

023

11

Time & causation (example)

Time

AgeExposure to

silica

Smoking

Lung Cancer

Bas

ic B

iost

atis

tics

May

1, 2

023

12

Causal Model

OutcomeExposure Exposure Exposure

Exposure

Bas

ic B

iost

atis

tics

May

1, 2

023

13

Study design & causality Not all design can prove causality Cross sectional can’t determine causality Must have temporality – exposure precedes

outcome

Bas

ic B

iost

atis

tics

14

May

1, 2

023

Bas

ic B

iost

atis

tics

Type of data(Level of measurement)

Categorical

Nominal Ordinal

Numerical

Discrete Continuous

e.g. Gender, Race e.g. Cancer staging, Severity of CXR for PTB

e.g. Parity, Gravida

e.g. Hb, RBS, cholesterol.

May

1, 2

023

15

Distribution (shape) of data Applicable to numerical value Discrete or Continuous Discrete ~ Binomial, Poisson, Negative Binomial,

Hypergeometry, Multinomial etc. Continuous ~ Normal, t, chi-square, F etc.

Bas

ic B

iost

atis

tics

May

1, 2

023

16

Normal Distribution

Bas

ic B

iost

atis

tics

𝑓 (𝑥 ;𝜇 ,𝜎2 )= 1𝜎√2𝜋

𝑒− 1

2( 𝑥−𝜇𝜎

)2

¿

May

1, 2

023

17

Characteristics Smooth, symmetrical (around the mean, not

skewed, appropriate height), uni-modal, bell shaped curve

Mean = Median = Mode The total area under the curve

(AUC) = 1 Asymptotic to the x-axis – never touch x-axis

Bas

ic B

iost

atis

tics

May

1, 2

023

18

Test of Normality Anderson–darling Test Corrected Kolmogorov–Smirnov Test (Lilliefors Test) Cramér–von-mises Criterion D'agostino's K-squared Test Jarque–bera Test Pearson's Chi-square Test Shapiro–francia Shapiro–wilk Test

Bas

ic B

iost

atis

tics

May

1, 2

023

19

Use Normality test with caution Small samples almost always pass a normality test.

Normality tests have little power to tell whether or not a small sample of data comes from a Gaussian distribution.

With large samples, minor deviations from normality may be flagged as statistically significant, even though small deviations from a normal distribution won’t affect the results of a t test or ANOVA.

Bas

ic B

iost

atis

tics

May

1, 2

023

Bas

ic B

iost

atis

tics

20

Why run statistical test?1. Determine presence of difference (or similarity)2. Determine degree of difference3. Determine the direction of changes (outcome)4. Predict changes (outcomes)

May

1, 2

023

Bas

ic B

iost

atis

tics

21

A B C

Is there any difference between A & B?

How big is the difference between A & B?

Which one is taller? A or B?

Is C different from A & B?

Is there any pattern now?

If there is D, can you predict how tall is D?

May

1, 2

023

Bas

ic B

iost

atis

tics

22

Statistical analysis

AnalyticalDescriptive

e.g. Describe socio-demographic characteristics - Age, Sex, Race etc.

e.g. Prevalence of hypertension.

e.g. Compare demographic characteristics between two population – Compare age between male & female

e.g. Distribution of gender by hypertension status

e.g. How demographic characteristics (more than one factors) explain hypertension

Univariable Bivariable MultivariableIV DV IV DV IV DV IV

May

1, 2

023

23

Descriptive Statistics Explain one variable at one time Method based on type of measure

CategoricalFrequency (Percentage)

NumericalCentral measures (e.g. mean, median) & Dispersion (e.g. variance, standard deviation, range, min-max, interquartile range)

Bas

ic B

iost

atis

tics

May

1, 2

023

24

Data

CategoricalFrequency (count) &

Percentage

Numerical

Normal Mean (SD)

Not Normal Median (Range/IQR)

How to describe a data

Bas

ic B

iost

atis

tics

May

1, 2

023

25

DifferenceA

B C

Which of the following shows true difference between two populations?

I II

IIII II

Bas

ic B

iost

atis

tics

May

1, 2

023

26

P in hypothesis testing The mean glucose from 9 rabbits decreased from 7.3

mmol/L to 7.0 (0.9) mmol/L within 24 hours. Using t-test, taking critical value of P=0.05, is the difference significant?

Ho: No difference i.e. 7.3= 7.0 z = (7.0-7.3)/(0.9/9) = -1 Based on t-table, z at -1, df=9, two-tailed, P is between -

0.4 & -0.3 Therefore, P > 0.05

Bas

ic B

iost

atis

tics

May

1, 2

023

27

One-tailed vs. two-tailed Is there a difference between Hb 14 g% vs. Hb 12

g% in male & female respectively?Ho: HbM – HbF = 0H1: HbM = HbFH2: HbM > HbFH3: HbM < HbM

Note: Should be determined a priori

Bas

ic B

iost

atis

tics

May

1, 2

023

28

One-tailed vs. two-tailed

Two-sided Right-sidedLeft-sided

Bas

ic B

iost

atis

tics

May

1, 2

023

29

Hypothesis testingTruth

Ho True Ho False

Test

Do notreject Ho Correct Type II error

()

Reject Ho

Type I error () Correct

P-value is the probability to make Type I error (based on  frequentist inference)

Bas

ic B

iost

atis

tics

May

1, 2

023

30

P & Sample Size

Bas

ic B

iost

atis

tics

May

1, 2

023

31

The truth about P value Currently effectiveness is measured by P value (even by US

FDA) < P means statistically significant, NOT clinical significant P is affected BOTH by effectiveness AND sample size P can be < 0.05 even though the effectiveness is marginal

when sample size is huge Compare Ps between studies only appropriate if the sampling

& sample size is the same Therefore, be careful when interpreting P value

Bas

ic B

iost

atis

tics

May

1, 2

023

32

P < 0.05 Why 5%? Cut-off point proposed by Sir Ronald A. Fisher

1925 to reject or not to reject a hypothesis If P < 0.05 = Probability to make Type I error is

less than 5% If P > 0.05, > 5% of the difference occurred by

chance & not due to the TRUE difference

Bas

ic B

iost

atis

tics

May

1, 2

023

33

Hypothesis Testing using bivariable analysis Try to prove that Exposure causes the Disease

e.g. Smoking causing Lung Cancer Ho: No difference of risk to get Lung Cancer

between smoker and non-smoker

Bas

ic B

iost

atis

tics

May

1, 2

023

Bas

ic B

iost

atis

tics

34

Lung Cancer No Lung Cancer

Smoking 20 (18.2%) 90 (81.8%)

Not Smoking 5 (4.5%) 105 (95.5%)

The occurrence of lung cancer is significantly higher (18.2%) among smokers compared to non-smokers (4.5%) (2 (df=1)= 10.15, P =0.001, OR = 4.7 (CI95% 1.7 – 13.0))

May

1, 2

023

Bas

ic B

iost

atis

tics

36

Confidence Interval Range of plausible values2 Narrow interval high precision

Wide interval poor precision How narrow is narrow?

And how wide is wide? Base of your clinical judgment

May

1, 2

023

37

Interpret Confidence Interval Compare with the null value

i.e. can be 0 for % or 1 for risk Compare with practical significance or the clinical

significance/indifference

Null

Null

A

B

Source: http://www.childrens-mercy.org/stats/journal/confidence.asp

Null

Null

C

D

Bas

ic B

iost

atis

tics

May

1, 2

023

38

Effect size The measure of effect irrespective of sample size Cohen (1988) classify ES into

Low (<0.3)Medium (0.3-0.7)Large (> 0.7)

Manual calculation or web based calculation

Bas

ic B

iost

atis

tics

May

1, 2

023

39

Statistical Test Univariate ~ One dependent & one independent Multivariate ~ Multiple dependent & multiple

independent variable

Bas

ic B

iost

atis

tics

May

1, 2

023

40

What test to use?Variable 1 Variable 2 Test

Categorical Categorical Chi-square

Categorical (2 pop) Numerical (Normal) Independent sample t-test

Categorical (2 pop) Numerical (Not Normal) Mann-Whitney U test

Categorical (> 2 pop) Numerical (Normal) One-way ANOVA

Categorical (> 2 pop) Numerical (Not Normal) Kruskal-Wallis test

Numerical (Normal) Numerical (Normal) Pearson Correlation Coefficient Test

Numerical (Normal/ Not Normal)

Numerical (Not Normal) Spearman Correlation Coefficient Test

Numerical (Normal) Numerical (Normal) – Paired

Paired t-test

Numerical (Not Normal) Numerical (Not Normal) – Paired

Wilcoxon Signed Rank Test

Bas

ic B

iost

atis

tics

May

1, 2

023

41

Univariate Analyses Compare means

Independent sample t-test (Unpaired t-test) ~ Two unrelated means

Paired t-test ~ Two related meansOne-way ANOVA ~ More than 2 means

2 Test ~ Between categorical variables Non-parametric tests (Kruskall-Wallis, Man-Whitney

U tests) ~ If data is not normally distributed

Bas

ic B

iost

atis

tics

42

Writing plan for statistical analysis #1

May 1, 2023Basic Biostatistics

Data were analyzed using the complex sample function of SPSS (version 13.0). Sampling errors were estimated using the primary sampling units and strata provided in the data set. Sampling weights were used to adjust for nonresponse bias and the oversampling of blacks, Mexican Americans, and the elderly in NHANES. The prevalence of hypertension, as well as the awareness, treatment, and control rates, were age adjusted by direct standardization to the US 2000 standard population.10 To analyze differences over time, the 2003–2004 data were compared with the 1999–2000 data. Estimates with a coefficient of variation >0.3 were considered unreliable. A 2-tailed P value <0.05 was considered statistically significant. (Ong et al. 2009)

43

Writing plan for statistical analysis #2

May 1, 2023Basic Biostatistics

To assess the effect of the selection process on the characteristics of the cases, we compared cases included in the final analysis to the rest of the cases. Since controls included in the present analysis were different from the rest of the diabetes free participants by design, no similar comparisons were performed for that group. To compare baseline characteristics of cases and controls appropriate univariate statistics were used. Similar binary logistic and multiple linear regression models were built with incident diabetes or HbA1c as respective outcomes and additive block entry of adiponectin and potential confounders. For linear regression CRP and triglycerides were log transformed. Since HbA1c could be modified by drug treatment, we ran a sensitivity analysis excluding all participants on antidiabetic medication. A p-value of <0.05 was considered significant. Analyses were performed with SPSS 14.0 for Windows.

May

1, 2

023

Bas

ic B

iost

atis

tics

44

Reporting analysis (example)

May

1, 2

023

45

Reporting analysis (example)

May

1, 2

023

Bas

ic B

iost

atis

tics

46

Reporting analysis (example)

May

1, 2

023

47

Summary1. Identify & define variables2. Type – independent vs. dependent3. Level of measurements – nominal, ordinal or

continuous4. Check distribution – Normal vs. Not Normal5. Decide what to do - descriptive vs. analytical

Bas

ic B

iost

atis

tics

Chapter 2Introduction to SPSSIBM SPSS Statistics v21 for WindowsJamalludin Ab Rahman MD MPHDepartment of Community MedicineKulliyyah of Medicine

May

1, 2

023

Bas

ic B

iost

atis

tics

49

IBM SPSS Statistics

IBM Corporation Software Group Route 100 Somers, NY 10589Produced in the United States of AmericaMay 2012

May

1, 2

023

Basic Biostatistics 50

SPSS Layout

May

1, 2

023

Bas

ic B

iost

atis

tics

51

LayoutMain menu

Toolbar

Variables

May

1, 2

023

Bas

ic B

iost

atis

tics

52

Data editor

Rows = variablesRows = each data

Define & describe your variables here

Enter your data here

May

1, 2

023

Bas

ic B

iost

atis

tics

53

Viewer

The output of analyses will be displayed here.

Output is separated from

data

May

1, 2

023

Bas

ic B

iost

atis

tics

54

Syntax

We can compile all the steps of the analyses here. Extend the programming

function in SPSS. Ability to perform complex steps e.g.

“looping”

May

1, 2

023

Basic Biostatistics 55

Creating Dataset

May

1, 2

023

Bas

ic B

iost

atis

tics

56

Before even you start SPSS! You must identify & define relevant variables Define means

1. Name – preferably short single name, begins with alphabet, no special character, no space

2. Type of data – e.g. Numeric, Date, String3. Width & Decimal Places (if numeric)4. Label – description for the Name (will be displayed in Viewer)

5. Values – labels for value e.g. 1=Male, 2=Female6. Missing – define missing value e.g. 999 for N/A

May

1, 2

023

Bas

ic B

iost

atis

tics

57

Define your variables

May

1, 2

023

Bas

ic B

iost

atis

tics

58

Variable Types

May

1, 2

023

Bas

ic B

iost

atis

tics

59

Variable Type

Decide the suitable

variable type

For numeric, determine Width

& Decimal. Decimal < Width

For String, no option for

Decimal Place

New option for Numerics with leading zeros

May

1, 2

023

Bas

ic B

iost

atis

tics

60

Value Labels

May

1, 2

023

Bas

ic B

iost

atis

tics

61

Missing ValueThis question is

Not Applicable to male

e.g. Assign 999 to represent N/A value

& this won’t be included in any

analysis

Chapter 3Descriptive StatisticsIBM SPSS Statistics v21 for WindowsJamalludin Ab Rahman MD MPHDepartment of Community MedicineKulliyyah of Medicine

May

1, 2

023

Bas

ic B

iost

atis

tics

63

Exercise data Hypothetical Study to describe factors related to HbA1c &

Homocystein (HCY) N=301 13 variables

May

1, 2

023

Bas

ic B

iost

atis

tics

64

Retrieve file information

May

1, 2

023

Bas

ic B

iost

atis

tics

65

healthstatus001

May

1, 2

023

Bas

ic B

iost

atis

tics

66

Task 11. Describe socio-demographic characteristics of

the respondent (age, gender & race)2. Describe the explanatory variables

1. Exercise2. smoking status3. BMI status &4. BP status

3. Describe HbA1c (taking cut-off for Poor HbA1c ≥ 6.5%) & HCY

May

1, 2

023

Basic Biostatistics 67

Describe numerical data

May

1, 2

023

Bas

ic B

iost

atis

tics

68

Explore

May

1, 2

023

Bas

ic B

iost

atis

tics

69

May

1, 2

023

Bas

ic B

iost

atis

tics

70

Results Check for Normality.Is Age data distributed Normally?

May

1, 2

023

Bas

ic B

iost

atis

tics

71

Is this Normaldistribution?

May

1, 2

023

Bas

ic B

iost

atis

tics

72

Describe ageNormal The subjects distributed between 23-67 years old

with the average of 34 (SD=8) years.If not Normal The subjects distributed between 23-67 years old

with the median of 33 (IQR=11) years

May

1, 2

023

Basic Biostatistics 73

Describe categorical data

May

1, 2

023

Bas

ic B

iost

atis

tics

74

Frequency

May

1, 2

023

Bas

ic B

iost

atis

tics

75

May

1, 2

023

76

Results

Basic Biostatistics

May

1, 2

023

Basic Biostatistics 77

TRANSFORM

May

1, 2

023

Bas

ic B

iost

atis

tics

78

Compute

May

1, 2

023

Bas

ic B

iost

atis

tics

79

weight / ((height / 100) ** 2)

May

1, 2

023

Bas

ic B

iost

atis

tics

80

Visual binning

May

1, 2

023

Bas

ic B

iost

atis

tics

81

Normal < 23Overweight 23 to < 27.5Obese >= 27.5

May

1, 2

023

Bas

ic B

iost

atis

tics

82

Chapter 4Bivariable analysesIBM SPSS Statistics v21 for WindowsJamalludin Ab Rahman MD MPHDepartment of Community MedicineKulliyyah of Medicine

May

1, 2

023

Bas

ic B

iost

atis

tics

84

To check association of two variables?

HbA1cAge

May

1, 2

023

Bas

ic B

iost

atis

tics

85

The steps1. Determine which is dependant & which is

independent2. Determine level of measurements3. Determine Normality of the numerical

measurement4. Determine the suitable statistical test

May

1, 2

023

Bas

ic B

iost

atis

tics

86

What are the tests?Variable 1 Variable 2 Test

Categorical Categorical Chi-square

Categorical (2 pop) Numerical (Normal) Independent sample t-test

Categorical (2 pop) Numerical (Not Normal) Mann-Whitney U test

Categorical (> 2 pop) Numerical (Normal) One-way ANOVA

Categorical (> 2 pop) Numerical (Not Normal) Kruskal-Wallis test

Numerical (Normal) Numerical (Normal) Pearson Correlation Coefficient Test

Numerical (Normal/ Not Normal)

Numerical (Not Normal) Spearman Correlation Coefficient Test

Numerical (Normal) Numerical (Normal) – Paired Paired t-test

Numerical (Not Normal) Numerical (Not Normal) – Paired

Friedman test

May

1, 2

023

Bas

ic B

iost

atis

tics

87

Tasks 21. Determine association between socio-demographic

characteristics & all the risk factors with HbA1c2. Determine association between socio-demographic

characteristics & all the risk factors with HCY

Note: It would be good if you could construct dummy table for the answers even before the analyses started

May

1, 2

023

Bas

ic B

iost

atis

tics

88

HCY normal range

May

1, 2

023

Basic Biostatistics 89

Independent sample t-TestComparing Two Means

May

1, 2

023

Bas

ic B

iost

atis

tics

90

Age vs. BP

May

1, 2

023

Bas

ic B

iost

atis

tics

91

May

1, 2

023

Bas

ic B

iost

atis

tics

92

Results

Levene’s test check equality between variances. Ho: There is no difference of variances. So if P is significant, we reject Ho, and therefore equal variances assumed.

The original t-test (Student’s t-test) assumes equal variances for equal sample sizes. However if the variances are equal, it is robust for different sizes.

Welch's correction

May

1, 2

023

Bas

ic B

iost

atis

tics

93

Table – Distribution of age by blood pressure status

  N Mean SD Statistics df P

Normal BP 156 33.9 7.9 t=0.431 299 0.667

High BP 145 34.4 8.9      

May

1, 2

023

Basic Biostatistics 94

Chi-squared TestDifference of Two Proportions

May

1, 2

023

Bas

ic B

iost

atis

tics

95

Gender vs. BP

May

1, 2

023

Bas

ic B

iost

atis

tics

96

May

1, 2

023

Bas

ic B

iost

atis

tics

97

Results

Describe this table first. What is your impression? 49% women vs. 47% men with high BP

Some books may suggest the use of Continuity Correction at ALL time, but recent simulations showed that CC (or Yate’s correction) is OVERCONSERVATIVE. Hence, use Pearson 2 when < 20% of cells have expected count < 5

When 20% of cells have EC < 5, use Fisher’s Exact Test

This is given because we code the variables using numbers. Can be used to measure P-trend

May

1, 2

023

Basic Biostatistics 98

One-way ANOVAComparing More than Two Means

May

1, 2

023

Bas

ic B

iost

atis

tics

99

Race vs. HbA1c

May

1, 2

023

Bas

ic B

iost

atis

tics

100

May

1, 2

023

Bas

ic B

iost

atis

tics

101

ResultsDescribe these results first. What is your impression? HbA1c between races? 6.4 (SD 2.1) vs. 6.7 (SD 2.2) vs. 6.5 (SD 2.2)

The F test shows that there is no single significant difference between any two groups

May

1, 2

023

Bas

ic B

iost

atis

tics

102

Results – BMI status vs. HbA1c

The F test shows that at least there is ONE pair with significant different. Either N vs. OW, N vs. OB or OW vs. OB

We need to run Post-hoc test to determine which of the PAIR is significant.

To decide which Post-hoc test to choose, we have to test for equality of variances i.e. Homogeneity of variances (Levene’s test)

May

1, 2

023

Bas

ic B

iost

atis

tics

103

May

1, 2

023

Bas

ic B

iost

atis

tics

104

Results – Post hoc

The significant difference is only for Normal vs. Obese (P=0.002)

May

1, 2

023

Bas

ic B

iost

atis

tics

105

ReportThere is a significant association between BMI Status and HbA1c (F(2,298)=13.129, P<0.001). Post-hoc test showed that Obese subjects have significantly higher HbA1c compared to Normal and Overweight subjects (P=0.001 and P < 0.001 respectively).

May

1, 2

023

Basic Biostatistics 106

Mann-whitney UNon Parametric Tests

May

1, 2

023

Bas

ic B

iost

atis

tics

107

Gender vs. HCY

May

1, 2

023

Bas

ic B

iost

atis

tics

108

ResultsThis ranks table is not to be cited in the research paper. Instead, describe their MEDIAN

May

1, 2

023

Basic Biostatistics 109

KRUSKALL WALLISNon-Parametric TESTS

May

1, 2

023

Bas

ic B

iost

atis

tics

110

Race vs. HCY

May

1, 2

023

Bas

ic B

iost

atis

tics

111

Results

May

1, 2

023

Basic Biostatistics 112

Correlation TestRELATIONSHIP OF TWO NUMERICAL DATA

May

1, 2

023

Bas

ic B

iost

atis

tics

113

Age vs. HbA1c

May

1, 2

023

Bas

ic B

iost

atis

tics

114

Results

May

1, 2

023

Bas

ic B

iost

atis

tics

115

Age vs. HCY

May

1, 2

023

Bas

ic B

iost

atis

tics

116

Recommended