23
11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing variation, and graphical presentation Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health [email protected] S L I D E 1 Outline Sources of variation Types of variables S L I D E 2 Readings and resources Chapter 9, p105-118: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition). S L I D E 3 Almost every characteristic that is measured on a patient varies THAT IS WHY IT IS CALLED A VARIABLE EXAMPLES Blood glucose level Blood pressure Diet Electrolytes etc.…

Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

Embed Size (px)

Citation preview

Page 1: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

1

S L I D E 0

Statistics in medicine

Lecture 1- part 1: Describing variation, and graphical presentation

Fatma Shebl, MD, MS, MPH, PhD

Assistant Professor

Chronic Disease Epidemiology Department

Yale School of Public Health

[email protected]

S L I D E 1

Outline

• Sources of variation

• Types of variables

S L I D E 2

Readings and resources

• Chapter 9, p105-118: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition).

S L I D E 3

• Almost every characteristic that is measured on a patient varies

• THAT IS WHY IT IS CALLED A VARIABLE

• EXAMPLES

• Blood glucose level

• Blood pressure

• Diet

• Electrolytes

• etc.…

Page 2: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

2

S L I D E 4

There are different sources of variation

Let us consider blood pressure as an example

• Biologic differences

– Age, race, diet, affect blood pressure

• Older patients, of African descent, and those who consume high salt diet tend to have high blood pressure

• Measurement conditions

– Time of the day, anxiety, fatigue…etc.

• High blood pressure is observed following exercise, and with anxiety

S L I D E 5

There are different sources of variation

Let us consider blood pressure as an example

• Measurement error

– Systematic error

• Distort the data in one direction leading to bias obscure the truth

• Ex. Defective BP cuff that tend to give high readings

– Random error

• Slight, inevitable inaccuracies

• Not systematic because it makes some readings too high, and some too low

Statistics can adjust for random error, but can not fix systematic error

S L I D E 6

To understand variation, you have to describe it

• Descriptive statistics definition:

–Statistics, such as the mean, the standard deviation, the proportion, and the rate, used to describe attributes of a set of a data

S L I D E 7

Variable could be quantitative or qualitative • Qualitative

• Skin color

• Jaundice

• Heart murmurs

• Quantitative

–Blood pressure

–Electrolytes levels

http://clinicalgate.com/wp-content/uploads/2015/06/B9781437729306000483_f48-02-

97 81437729306.jpg

Page 3: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

3

S L I D E 8

There are different types of variables

–Nominal

–Dichotomous (binary)

–Ordinal (ranked)

–Continuous (interval)

–Continuous (ratio)

–Risks and proportions

–Counts and units of observation

–Combining data

S L I D E 9

Nominal variables (qualitative)

• Nominal are “naming” variables

• Definition:

– The simplest scale of measurement. Used for characteristics that have no numerical values, no measurement scales and no rank order. It is also called a categorical or qualitative scale.

• Ex. Skin color

– Different number can be assigned to each color

• E.g. 1: purple, 2: black, 3: white, 4 blue, 5: tan

– It makes no difference to the statistical analysis which number is assigned to which color, because the number is merely a numerical name for a color

• Percentages and proportions are commonly used to summarize the data

S L I D E 10

Dichotomous variables (qualitative)

• Dichotomous from the Greek “cut into two” variables

• Ex.: Normal/abnormal skin color, living/dead

• Some time it s not enough to describe the data as two categories living/dead, but it is important to know how long the patient survived survival analysis

S L I D E 11

Ordinal “ranked” variables

• Definition:

– Used for characteristics that have an underlying order to their values; that have clearly implied direction from better to worse.

• Are categorical (qualitative) scales

• Three or more levels

• Although there is an order among categories, however the difference between two adjacent categories is not the same throughout the scale

Page 4: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

4

S L I D E 12

Ordinal “ranked” variables

• Percentages and proportions are commonly used to summarize the data

• Medians are sometime used to describe the whole data

https://openclipart.org/detail/218053/pain-scale http://biology-forums.com/gallery/2137_18_05_12_2_25_00.jpeg

Ex. Pain scale: “0- no pain” - “10- worst imaginable pain”

Ex. Pitting edema grading scale: “0- no edema” - “4+- sever edema”

S L I D E 13

Numerical scales (quantitative)

• Definition:

– The highest level of measurement. It is used for characteristics that can be given numerical values; the difference between numbers have meaning, ex. BMI, height.

• Types

• Interval

• Ratio

• Discrete

• Measures of central tendencies are usually used to summarize: means, medians

S L I D E 14

Numerical scales (continuous)

• Has a value on a continuum

• Interval: arbitrary zero point

• Ex. Centigrade temperature scale

• Ratio: absolute zero point

• Ex. Kalvin temperature scale

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=

&cad=rja&uact=8&ved=0ahUKEwiuo6nf8sjOAhUEkh4KHXTZAnUQjRwI

Bw&url=http%3A%2F%2Fwww.livescience.com%2F39994-kelv in.html&psig=AFQjCNFGVvg1wdLx78W2V44wDlZQDQB17A&ust=147

1 538633651130

S L I D E 15

Numerical scales (Discrete) • Has values equal to integers

• Units of observation: person, animal, thing, etc.…

• Presented in frequency tables

• One characteristic in the x-axis, one characteristic in the y-axis, and counts in the cells

Cholesterol level

Gender Checked Not checked Total

Female 17(63%) 10(37%) 27(100%)

Male 25 (57%) 19(43%) 44(100%)

Total 42(59%) 29(41%) 71(100%)

Source: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition).

Frequency table of gender by whether serum total cholesterol was checked or not

Page 5: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

5

S L I D E 16

Risks and proportions

• Risk is the conditional probability of an event (e.g. death) in a defined population in a defined period.

• Share some characteristics of discrete and some characteristics of continuous variables

• Ex. A discrete event (e.g., death) occurred in a fraction of population

• Calculated by the ratio of counts in the numerator to counts in denominator

S L I D E 17

Combining data • Continuous variable could be converted to ordinal variable

• When data is converted to categories individual information is lost

• The fewer the number of categories the greater is the amount of information lost

0

20

40

60

80

100

120

Birth weight (g)

Source: Buehler W et al. Public Health Rep 1 02:151-161, 1987

Histogram of neonatal mortality rate per 1000 live births ,

by birth weight group, United States 1980

S L I D E 18

Statistics in medicine

Lecture 1- part 2: Describing variation, and graphical presentation

Fatma Shebl, MD, MS, MPH, PhD

Assistant Professor

Chronic Disease Epidemiology Department

Yale School of Public Health

[email protected]

S L I D E 19

Outline

• Frequency distributions

–Frequency distribution of continuous data

–Frequency distribution of binary data

Page 6: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

6

S L I D E 20

Readings and resources

• Chapter 9, p105-118: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition).

S L I D E 21

Frequency distribution is

S L I D E 22

Frequency distribution is

TABLE of data displaying the VALUE of each data point ( or range of data points) in one column and the FREQUENCY with which that

value occurs in the other column

PLOT of data displaying the VALUE of each data point ( or range of data points) on one axis and the FREQUENCY with which that value

occurs on the other axis

S L I D E 23

Frequency tables • Definition

– A table showing the number and or the percentages of observations occurring at different values (or range of values) of a variable.

• Steps of creating frequency table

– Decide on the number of non-overlapping intervals

• It is better to have equal width intervals

• Usually 6 to 14 intervals are adequate to demonstrate the shape of the distribution

• Creating intervals means: continuous variable converted to ordinal variable

– Information on individual level is lost

– Count the number of observations in each interval

• Percentages could be calculated as well

– Percentage=the number of observation in the interval divided by the total number of observations, multiplied by 100

• Presented graphically by histogram

Page 7: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

7

S L I D E 24

Frequency tables

Categories of glucose level of 180 participants

Category Count %

<=70 14 7.78

71-100 104 57.78

101-125 26 14.44

>=126 36 20.00

Glucose level of 180 participants Glucose level Count Glucose level Count Glucose level Count

52 1 88 2 140 4 66 1 89 5 143 1 67 1 90 8 145 5 68 2 92 3 149 2 69 2 95 11 150 4 70 7 96 1 155 2 71 1 98 1 158 1 72 2 100 12 160 1 73 1 103 4 165 4 75 12 108 1 168 1 76 2 110 11 170 1 77 4 115 1 172 1 78 6 120 6 220 1 79 4 121 1 80 11 122 1 82 2 124 1 83 2 130 3 85 9 133 1 86 4 135 3 87 1 139 1

S L I D E 25

There are REAL and THEORITICAL frequency distributions

• Real

– Obtained from the actual data

• Theoretical

– Calculated using certain assumptions

– The most commonly used is “NORMAL (GAUSSIAN) DISTRIBUTION”

• Most statistical methods assume that the data is normally distributed

• Real data are seldom perfectly normally distributed

• Based on the central limit theory, if the sample size is large, the assumption of normal distribution usually hold even if the data is skewed

S L I D E 26

Normal (Gaussian) distribution

• Continuous distribution

• Used if the population (σ) is known

• A symmetric bell-shaped probability distribution with a shape that is determined by mean (µ) and standard deviation (σ)

Different µ Same σ Same µ

different σ

S L I D E 27

Normal (Gaussian) distribution

• Properties:

–Bell shape

–Depends on mean (µ) and standard deviation (σ)

–Symmetric about the mean (µ)

–Mean=median=mode

Page 8: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

8

S L I D E 28

Normal (Gaussian) distribution

• The area under the curve is the probability (relative frequency) of the values comprising the normal distribution.

– The area under the whole curve = 1

• 68% within µ + 1σ

• 95% within µ + 2σ (actually 1.96σ)

• 99% within µ + 3σ (actually 2.58σ)

S L I D E 29

Normal (Gaussian) distribution, example

• If the math test scores is normally distributed with a mean of 10 and standard deviation of 3, then what is the range of scores in which 68% of the student scores will lie?

– 68% of the students will have a score within µ + 1σ

– 10+3 =between 7 and 13

S L I D E 30

Standard normal distribution (z)

• The normal distribution with mean 0 and standard deviation 1

• If the mean#0 and SD#1do z transformation allow using the standard normal table

– 𝑧 =𝑥−𝜇

𝜎, where x is the value of the

variable, µ is the mean, σ is the SD

• A positive z means the value is above the mean

• A negative z means the value is below the mean

• If the z is known you can get the x

– x= µ + zσ

Graph generated by R

S L I D E 31

Standard normal distribution (z)

• Properties:

– Bell shape

– Symmetric about the mean

– Mean=median=mode

– Mean=0

– Standard deviation=1

– The area under the curve = 1

• 68% within µ + 1σ

• 95% within µ + 2σ

• 99% within µ + 3σ

Graph generated by R

Page 9: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

9

S L I D E 32

Standard normal distribution (z) tables

Areas under the standard normal curve (z scores)

• Could be used to find proportion above ,below , or between any z scores

• The first column includes the stem of the z value

• The top row includes the second and third digit of the z value

Source: http://image.slidesharecdn.com/copyofz-table-130515110049-phpapp02/95/copy-of-z-

table-1-638.jpg?cb=1368615687

Area under the curve to the left i.e. below z

Z score

Positive z Negative z

S L I D E 33

Standard normal distribution (z), example

• If the mean of students’ test scores is 80, and the standard deviation is 10, what is the test score that divides the highest 5% of scores (i.e. find the students at or above the 95% percentile)?

• Solution:

– Find the z score that marks the upper 5% 1.645

– The test score= µ + 1.645σ= 80+1.645*10=96.45

– Conclusion: the upper 5% has a test score >96.45

https://i.ytimg.com/vi/SSHCPCS5cys/maxresdefault.jpg

S L I D E 34

Standard normal distribution (z) tables

• If the mean of HDL cholesterol is 45 mg/dL, and the standard deviation is 5, what is the proportion of population that have HDL values > 40 mg/dL?

• Solution:

– Find the z score equivalent to 40 mg/dL 𝑧 =

𝑥−𝜇

𝜎= (40-45)/5= -1

– P(HDL>40)=P(z>-1)=1-P(z<=1-)

– Find the area (probability) below (HDL=40) =.1587

– P(HDL>40)= 1-0.1587=0.8413

– Conclusion: 84.13% of people in the population are expected to have HDL value 40 mg/dL

Source: http://www.gridgit.com/postpic/2014/10/negative-z-score-table-pdf_287337.png

Area under the curve to

the left i.e. below z

Z score

Negative z table

S L I D E 35

T-distribution

• A symmetric distribution with mean 0 and standard deviation larger than that for the normal distribution for small sample sizes.

• Used if the population standard deviation is unknown

• Needed when the sample size is small

• t and z distributions are very similar if n>30

• Properties:

– Symmetric

– Bell shape

– Shape change based on degrees of freedom k

– Mean=median=mode=0

– Standard deviation > 1

Z & t almost

identical when

sample size ~30

Graph generated by R

Page 10: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

10

S L I D E 36

T-distribution

• Degrees of freedom (df)

– Is the number of observations that are free to vary

– When calculating the mean, the sum of observations are fixed, therefore when adding up the N observations, each observation could be vary, except the last one, because the total has to be fixed. Therefore, only N-1 observations can vary if one mean is to be estimated (one-sample), and (N1+N2)-2 observations can vary if two means are to be estimated (two-sample)

– df= total sample size-number of means that are calculated

S L I D E 37

T-distribution

• Table of critical values of t distribution

Source: http://elvers.us/psy216/tables/tvalues.htm

df

Levels of Significance for a One-Tailed Test

0.2500 0.2000 0.1500 0.1000 0.0500 0.0250 0.0100 0.0050 0.0005

Levels of Signficance for a Two-Tailed Test

0.5000 0.4000 0.3000 0.2000 0.1000 0.0500 0.0200 0.0100 0.0010

1 1.000 1.376 1.963 3.078 6.314 12.706 31.821 63.657 636.619

2 0.816 1.061 1.386 1.886 2.920 4.303 6.964 9.925 31.599

3 0.765 0.978 1.250 1.638 2.353 3.182 4.541 5.841 12.924

4 0.741 0.941 1.189 1.533 2.132 2.776 3.747 4.604 8.610

5 0.727 0.920 1.156 1.476 2.015 2.570 3.365 4.032 6.869

6 0.718 0.906 1.134 1.440 1.943 2.447 3.143 3.707 5.959

7 0.711 0.896 1.119 1.415 1.895 2.365 2.998 3.499 5.408

8 0.706 0.889 1.108 1.397 1.859 2.306 2.896 3.355 5.041

9 0.703 0.883 1.100 1.383 1.833 2.262 2.821 3.250 4.781

10 0.700 0.879 1.093 1.372 1.812 2.228 2.764 3.169 4.587

S L I D E 38

Binomial distribution is used to describe the frequency distribution of dichotomous data

• The probability distribution that describes the number of successes X observed in n independent trials, each with the same probability of occurrence

• For binary variables

• Defined by n and π

• If sample is large, or proportion ~.5 z distribution could be used

Graphs generated by R

S L I D E 39

Chi-square distribution (X2) is used for analysis of counts

• The distribution used to analyze counts in frequency tables.

• A nonsymmetrical distribution with mean (µ) and variance (σ2)

• Used for categorical (nominal) data

• Properties:

– Degrees of freedom = υ

– µ = υ

– σ2 = υ*2

– Approaches normal distribution with the increase in df

Graph generated by R

Page 11: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

11

S L I D E 40

Statistics in medicine

Lecture 1- part 3: Describing variation, and graphical presentation

Fatma Shebl, MD, MS, MPH, PhD

Assistant Professor

Chronic Disease Epidemiology Department

Yale School of Public Health

[email protected]

S L I D E 41

Readings and resources

• Chapter 9, p105-118: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition).

S L I D E 42

Summarizing numerical data

• Continuous variable

– Measures of central tendency

– Measures of dispersion

• Nominal data

– Proportions

– Percentages

– Ratios

– Rates

S L I D E 43

Measures of central tendency

• Definition:

– Index or summary numbers that describe the middle of a distribution

• Types:

– Mean

– Median

– Mode

Page 12: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

12

S L I D E 44

The mean

• Types

–Arithmetic

–Geometric

S L I D E 45

The arithmetic mean

• The most commonly used statistics

• Definition:

– The arithmetic average of the observations, which is denoted by µ in the population and by in the sample. In a sample the mean is the sum of X values divided by the number n in the sample

• Arithmetic mean’s calculation

• Sensitive to extreme values

• Could be used with numerical scales

• Should NOT be used with ordinal scales

S L I D E 46

Example of arithmetic mean’s calculation

Arithmetic mean

=88+86+93+⋯+106

20=1775

20=

89.05

Subject Glucose

1 88

2 86

3 93

4 79

5 83

6 98

7 74

8 96

9 95

10 78

11 75

12 98

13 90

14 108

15 81

16 108

17 76

18 97

19 72

20 106

Basic Statistical Measures

Location Variability

Mean 89.05000 Std Deviation 11.58254

Median 89.00000 Variance 134.15526

Mode 98.00000 Range 36.00000

Interquartile Range

19.00000

SAS 9.4 output

S L I D E 47

The geometric mean

• Less commonly used than arithmetic mean

• Definition:

– The nth root of the product of n observations

• Geometric mean’s calculation

Log GM i.e. the mean of the log values • Exponentiation GM

• Used with skewed distributions or logarithms

Page 13: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

13

S L I D E 48

Example of geometric mean’s calculation

• Geometric mean Subject Glucose

Log glucose

1 88 4.477337

2 86 4.454347

3 93 4.532599

4 79 4.369448

5 83 4.418841

6 98 4.584967

7 74 4.304065

8 96 4.564348

9 95 4.553877

10 78 4.356709

11 75 4.317488

12 98 4.584967

13 90 4.49981

14 108 4.682131

15 81 4.394449

16 108 4.682131

17 76 4.330733

18 97 4.574711

19 72 4.276666

20 106 4.663439

Sum 1781 89.62306

Arethmetic Mean 89.05 4.481153

Geometric mean 88.33649

'

S L I D E 49

The median

• Definition:

– A measure of central tendency. It is the middle observation; i.e., the one that divides the distribution of values into halves.it is also equal to the 50th percentile

• Median’s calculation

– Arrange observation ascending or descending

– Count in to find

• Odd number of observations: the middle value

• Even number of observations: the mean of the two middle values

• Less sensitive to extreme value than the mean

• Could be used with numerical scales

• Could be used with ordinal scales

S L I D E 50

Example of median’s calculation

• Median

(88+90)/2 =89

Subject Glucose

19 72

7 74

11 75

17 76

10 78

4 79

15 81

5 83

2 86

1 88

13 90

3 93

9 95

8 96

18 97

6 98

12 98

20 106

14 108

16 108

Basic Statistical Measures

Location Variability

Mean 89.05000 Std Deviation 11.58254

Median 89.00000 Variance 134.15526

Mode 98.00000 Range 36.00000

Interquartile Range

19.00000

SAS 9.4 output

S L I D E 51

The mode

• Definition:

– The value of a numerical variable that occurs the most frequently

• Mode’s calculation

– Count the number of times each value occur

– The mode is the value that is most frequent • Some data might not have mode

• Some data might have two modes bimodal

• Some data might have > two modes multimodal

• Modal class could be estimated, which is the interval that has the largest number of observations

Page 14: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

14

S L I D E 52

Example of mode’s calculation

• Modes

98 and 108

Subject Glucose

19 72

7 74

11 75

17 76

10 78

4 79

15 81

5 83

2 86

1 88

13 90

3 93

9 95

8 96

18 97

6 98

12 98

20 106

14 108

16 108

S L I D E 53

Use of measures of central tendency

• What is the best measure for a particular dataset?

– The choice depends on:

• Type of scale

–Numerical arithmetic mean or median

–Ordinal median

–Logarithmic scale geometric mean

• Distribution

–Symmetrical: the same shape on both sides of the mean arithmetic mean or median

–Skewed: outliers in one direction median

–Bimodal: mode

S L I D E 54

Measures of spread (dispersion)

• Definition:

– Index or summary numbers that describe the spread of observations about the middle value.

• Types

– Range

– Standard deviation

– Coefficient of variation

– Percentiles

– Interquartile range

S L I D E 55

The range

• Definition:

– The difference between the largest and the smallest observation

• Range’s calculation

– Rank the data

– Range=largest value – smallest value

• Sometimes, minimum and maximum values are displayed instead of the range

Page 15: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

15

S L I D E 56

Example of range’s calculation

• Range

– 108-72=36

– Or present the lower and upper values (72,108)

Subject Glucose

19 72

7 74

11 75

17 76

10 78

4 79

15 81

5 83

2 86

1 88

13 90

3 93

9 95

8 96

18 97

6 98

12 98

20 106

14 108

16 108

Basic Statistical Measures

Location Variability

Mean 89.05000 Std Deviation 11.58254

Median 89.00000 Variance 134.15526

Mode 98.00000 Range 36.00000

Interquartile Range

19.00000

SAS 9.4 output

S L I D E 57

The standard deviation

• Definition:

– The most common measure of spread, denoted by σ in the population and SD or s in the sample. It can be used with the mean to describe the distribution of observations. It is the square root of the average of the squared deviations of the observations from their mean

• SD’s calculation

• Other computational formulas exists

S L I D E 58

The standard deviation

• SD is used in many statistical tests

• Could be used with the mean to describe the distribution of observation

– If the mean – 2SD contains zero skewed observations

• Characteristics of SD:

– If the distribution is bell shape

• 67% of observations lie between mean+1SD

• 95% of observations lie between mean+2SD

• 99.7% of observations lie between mean+3SD

– Regardless of the shape

• At least 75% of observations lie between mean+2SD

S L I D E 59

Example of SD’s calculation

SD’s calculation

Subject Glucose

1 88 -1.05 1.1025

2 86 -3.05 9.3025

3 93 3.95 15.6025

4 79 -10.05 101.0025

5 83 -6.05 36.6025

6 98 8.95 80.1025

7 74 -15.05 226.5025

8 96 6.95 48.3025

9 95 5.95 35.4025

10 78 -11.05 122.1025

11 75 -14.05 197.4025

12 98 8.95 80.1025

13 90 0.95 0.9025

14 108 18.95 359.1025

15 81 -8.05 64.8025

16 108 18.95 359.1025

17 76 -13.05 170.3025

18 97 7.95 63.2025

19 72 -17.05 290.7025

20 106 16.95 287.3025

Sum 1781 2548.95

Mean 89.05

SD 11.58254

Basic Statistical Measures

Location Variability

Mean 89.05000 Std Deviation

11.58254

Median

89.00000 Variance 134.15526

Mode 98.00000 Range 36.00000

Interquartile Range

19.00000

SAS 9.4 output

Page 16: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

16

S L I D E 60

The coefficient of variation

• Definition:

– The standard deviation divided by the mean. It is used to obtain a measure of relative variation i.e. variation relative to the size of the mean

• CV’s calculation

• Commonly used in quality control

S L I D E 61

Percentiles

• Definition:

– A number that indicates the percentage of a distribution that is less than or equal to that number

• Commonly used to compare individual values to norm

– Growth charts

• Used to determine normal laboratory ranges

– Between 2½ and 97½ percentiles contains the central 95% of the distribution

Quantiles

Level Quantile

100% Max 108.0

99% 108.0

95% 108.0

90% 107.0

75% Q3 97.5

50% Median 89.0

25% Q1 78.5

10% 74.5

5% 73.0

1% 72.0

0% Min 72.

SAS 9.4 output

S L I D E 62

Interquartile range

• Definition:

– The difference between the 25th percentile(first quartile) and the 75th percentile(third quartile)

• It contains the central 50% of the distribution

• Some authors present the first and third quartile values instead of the difference

S L I D E 63

Interquartile range

• Interquartile range

97.5-78.5=19

Or present the first and third quartile (78.5,97.5)

Subject Glucose

19 72

7 74

11 75

17 76

10 78

4 79

15 81

5 83

2 86

1 88

13 90

3 93

9 95

8 96

18 97

6 98

12 98

20 106

14 108

16 108

Basic Statistical Measures

Location Variability

Mean 89.05000 Std Deviation 11.58254

Median 89.00000 Variance 134.15526

Mode 98.00000 Range 36.00000

Interquartile Range

19.00000

SAS 9.4 output

Page 17: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

17

S L I D E 64

Use of measures of spread

• What is the best measure for a particular dataset?

– The choice depends on:

• Type of measure of central tendency

– Mean standard deviation

– Median interquartile range

• Distribution

– Symmetrical: the same shape on both sides of the mean standard deviation or interquartile range

– Skewed: outliers in one direction interquartile range

• Purpose

– Compare to norms percentiles

– Compare distributions measured on different scale coefficient of variation

– Describe the central 50% of distribution interquartile range

– Emphasize the extreme values range

S L I D E 65

Error bar plots

• Definition:

– A graph that displays the mean and a measure of a spread for one or more groups

• Deciphering the error bar plot

– The circle

• The mean

– The bars

• The standard deviation

–Some authors present the standard error

S L I D E 66

The proportions and percentages

• Proportion definition:

– The number of observations with the characteristic of interest divided by the total number of observations.

• Proportion’s calculation

– If the data contains two groups a and b, then the proportion of a is

• Could be used with

– Nominal scales

– Ordinal scales

– numerical scales

• Percentage: is the proportion multiplied by 100%

S L I D E 67

The ratios

• Definition:

– A part divided by another part. It is the number of observations WITH the characteristic of interest divided by the number of observations WITHOUT the characteristic of interest .

• Ratio’s calculation

– If the data contains two groups a and b, then the ratio of a to b is

Page 18: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

18

S L I D E 68

The rates

• Definition:

– A proportion associated with a multiplier, called the base (e.g., 1000, 100,000) and computed over a specified period

• Rate’s calculation

– If the data contains two groups a and b, then the rate of a is

• Use of rates in epidemiology and medicine:

– Mortality rates

– Cause-specific mortality rates

– Morbidity rates

S L I D E 69

The rates

• Adjusting rates:

– Why crude rate might not be suitable? • Comparing populations with dissimilar characteristics

such as age, gender, race

– Types:

• Direct adjustment

• Indirect adjustment

– Details of calculations will be covered in the epidemiology and public health thread class

S L I D E 70

One of the problems in the analysis of frequency distribution is SKEWNESS

• Horizontal stretching of the distribution the right and left sides of the distributions are not mirror images i.e. one tail is longer than the other

• The tail indicates the direction and type of skewed distribution

– Tail is pointing to the right skewed to the right (positively skewed)

– Tail is pointing to the left skewed to the left (negatively skewed)

• The mean follows the tail regardless of the type of skewed distribution

– The sequence from the tail to the apex is mean, median, mode (realize it is alphabetical order)

• Mean > median > mode skewed to the right (positively skewed)

• Mean < median < mode skewed to the left (negatively skewed)

Graph source: http://www.statisticshowto.com/wp-content/uploads/2014/02/pearson-mode-skewness.jpg

S L I D E 71

Statistics in medicine

Lecture 1- part 4: Describing variation, and graphical presentation

Fatma Shebl, MD, MS, MPH, PhD

Assistant Professor

Chronic Disease Epidemiology Department

Yale School of Public Health

[email protected]

Page 19: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

19

S L I D E 72

Readings and resources

• Chapter 9, p105-118: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition).

S L I D E 73

There are several way to depict continuous variable frequency distribution

• Histogram

• Frequency polygons

• Line graphs

• Stem and leaf diagrams

• Quantiles

• Boxplots

S L I D E 74

Frequency distribution is usually presented with histogram • Definition:

– A bar graph of a frequency distribution of numerical observations

• Steps of creating histogram

– Decide on the number of non-overlapping intervals(statistical software might determine this automatically)

– Put the intervals on the x-axis

– Put the number or percentages on the y-axis • Percentages are used to compare two histograms based on different

sample sizes

– The frequency/percentages are presented with bars • Area of each bar is in proportion to percentage of individuals in that

interval

• Combining observations in intervals

smoother curve compared to histograms of individual values

S L I D E 75

2101801501209060

44

41

38

35

32

29

26

23

20

17

14

11

8

5

2

Glucose level

Fre

qu

en

cy

Histogram of Glucose level

Minitab 17 output

Interpreting the graph Most participants had fasting blood glucose level of 65 to 125. Only two participants had blood glucose level less than 60 mg/dl. Additionally, the distribution is skewed to the right (positively skewed) ; several participants had had fasting blood glucose level much higher than the target of =< 125 mg/dl.

Page 20: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

20

S L I D E 76

21 01 801 501 209060

4 4

4 1

3 8

3 5

3 2

2 9

2 6

2 3

2 0

1 7

1 4

1 1

8

5

2

Glucose level

Fre

qu

en

cy

Frequency polygon of Glucose level

Frequency polygon definition: A line graph connecting the mid-points of the top of the columns of histogram. It is useful in comparing two frequency distributions

Steps of creating frequency polygons Create a histogram Connect the mid-points of the top of the columns of histogram

Frequency polygons is another presentation of the frequency distribution

Minitab 17 output

S L I D E 77

21 01 801 501 209060

25

20

1 5

1 0

5

0

Glucose level

Pe

rce

nt

Percentage polygon

Percentage polygons

Percentage polygon definition: A line graph connecting the mid-points of the top of the columns of histogram based on percentages instead of count. It is useful in comparing two or more frequency distributions when frequencies are not equal

Steps of creating percentage polygons Create a histogram based on percentages Connect the mid-points of the top of the columns of histogram Extends the line from the midpoints of the first and last columns to the x-axis

S L I D E 78

Stem-and-leaf plots

• Definition:

– A graphical display for numerical data. It is similar to both frequency table and histogram

• For tallying observations

• Steps of creating stem-and-leaf plot

– Decide on the number of non-overlapping intervals

– Draw a vertical line

– Put the first digits of each interval on the left side of the vertical line “stem”

– For each individual, put the second digit on the right side of the vertical line “leaves”

• If the observation is one digit, that digit is the leaf

– Reorder leaves from lowest to highest within each interval

– Count from either end to locate the median

S L I D E 79

Stem-and-Leaf Display: Glucose level

1 5 2

7 6 678899

46 7 000000012235555555555556677778888889999

82 8 000000000002233555555555666678899999

(24) 9 000000002225555555555568

74 10 00000000000033338

57 11 000000000005

45 12 000000124

36 13 00035559

28 14 000035555599

16 15 0000558

9 16 055558

3 17 02

1 18

1 19

1 20

1 21

1 22 0

Stem-and-leaf of Glucose level N = 180

Leaf Unit = 1.0

n Stem Leaf

Median is in this line=91

Ve

rtical lin

e w

as

ad

de

d m

an

ua

lly

Minitab 17 output

Page 21: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

21

S L I D E 80

Box plots (box-and-whisker plot)

• Definition:

– A graph that summarize the data by displaying the minimum, first quartile, median, third quartile, and maximum statistics

• It could be created from the information displayed in a stem-and-leaf plot or a frequency table

S L I D E 81

Box plots (box-and-whisker plot)

• Deciphering the box-and-whisker plot

– The box • The top of the box is the is the third

quartile

• The bottom of the box is the first quartile

• The length of the box is the interquartile range

• The median is presented with a horizontal line in the box

• The mean is presented with a plus sign in the box (some programs)

– The whiskers • Depict the minimum and the maximum

values Source: editionhttp://www.physics.csbsju.edu/stats/simpl

e.box.defs.gif

S L I D E 82

225

200

175

150

125

100

75

50

Glu

co

se l

evel

91

101.033

Boxplot of Glucose level

Interpreting the results

The boxplot shows:

• The range(whiskers) is 52 ,172

• The longer upper whisker and large box area

above the median indicate that the data is

rightly (positive) skewed

• The median is 91

• The mean 101.033

• The interquartile range is 79,118.75

• One outlier is present

S L I D E 83

Tabular and graphical presentation of nominal and ordinal data

• Contingency frequency tables: • A table used to display counts and or

frequencies for two or more nominal or quantitative variables

Gender Post graduate College High school

Male 1 3 3

Female 5 6 2

Page 22: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

22

S L I D E 84

Tabular and graphical presentation of nominal and ordinal data

• Dot plots • A graphical presentation using dots

• Bar charts

– A graph used with nominal characteristics to display the numbers or percentages of observations with the characteristic of interest

• The categories are placed on the x-axis

• The numbers or percentages are placed on the y-axis

S L I D E 85

Graphs for two characteristics

• Two characteristics are nominal

– Bar charts - Dot plots

S L I D E 86

Graphs for two characteristics • One characteristic is nominal and the other is numerical:

• Box plots

• Error plots

Box plots SAS 9.4 output Error plots SAS 9.4 output

S L I D E 87

Graphs for two characteristics • Two characteristics are numerical:

• Scatterplots (bivariate plots)

– A two-dimensional graph displaying the relationship between two numerical characteristics of variables

• Creating a scatterplot

– If data does not have an outcome and a predictor

• Choice of the x and y axis does not matter

– If data has an outcome and a predictor

• Put the explanatory (risk factor, predictor) on the x-axis

• Put the outcome on the y-axis

– Put a circle for each observation at the point of intersection of its x and y values

Scatter plots SAS 9.4 output

Page 23: Outline Statistics in medicine - Yale School of Medicine …tbl.med.yale.edu/files/variation_2/slides.pdf11/4/2016 1 S L I D E 0 Statistics in medicine Lecture 1- part 1: Describing

11/4/2016

23

S L I D E 88

Quiz

A pharmaceutical company tested the effect of sofosbuvir (new HCV drug) on sustained viral response (SVR) in four HCV genotypes. In genotype 1, 2, 3, and 4, the drug was shown to cause SVR in 90%, 93%, 84%, and 96% of the patients respectively. What type of graphical depiction is best suited to show the data?

A. Pie chart

B. Venn diagram

C. Bar diagram

D. Histogram