Upload
others
View
9
Download
1
Embed Size (px)
Citation preview
Epidemiology 9509 Descriptive Statistics
Epidemiology 9509Principles of Biostatistics
textbook - The Wonders of BiostatisticsChapter 2
Descriptive Statistics
John Koval
Department of Epidemiology and BiostatisticsUniversity of Western Ontario
1
Epidemiology 9509 Descriptive Statistics
What we are learning today
How to describe data
1. numerically
2. pictoriially
2
Epidemiology 9509 Descriptive Statistics
statistics for discrete data
{current smoker, former smoker, never smoker}eg C, F, N, C, N
1. frequencytwo C’s, one F, two N’s
2. relative frequencyC 0.40, F 0.20, N 0.40
3. percentageC 40%, F 20%, N 40%
3
Epidemiology 9509 Descriptive Statistics
statistics for ordinal data
in addition to frequency
cumulative frequency
{non-smoker, occasional smoker, regular smoker}eg N,O,R,N,R
1. cumulative frequencyonly makes sense for ordinal scaletwo N‘s, three N‘s or O‘s, five N‘s, O‘s or R‘s
2. cumulative relative frequencyonly makes sense for ordinal scale0.40 N‘s , 0.60 N‘s or O‘s, 1.00 N‘s, O‘s or R‘s
3. cumulative percentageonly makes sense for ordinal scale40%, 60%, 100%
4
Epidemiology 9509 Descriptive Statistics
statistics for continuous data70, 80, 90, 100, 110
◮ mean
x̄ =∑n
i=1 xin
= 70+80+90+100+1105 = 90
◮ variance
s2 =∑n
i=1(xi−x̄)2
n−1
= 202+102+02+102+202
4
= 400+100+0+100+4004 = 250
◮ standard deviation
s =√s2 =
√
(250) = 15.8
5
Epidemiology 9509 Descriptive Statistics
nonparametric statistics
70, 80, 90, 100, 110
◮ median
1. rank dataput in ascending orderx[1], x[2], ..., x[n]
2. if n odd, usex[ n+1
2 ]
if n even, compute average
(
x[ n2 ]+ x[ n2+1]
)
/2
example: n = 5median is x[3] = 90
6
Epidemiology 9509 Descriptive Statistics
nonparametric spread
◮ range 70, 80, 90, 100, 110
1. rank data2. minimum and maximum
x[1], x[n]3. range
(minimum, maximum)(x[1], x[n])
for examplex[1] = 70, x[n] = 110range is (70,110)
7
Epidemiology 9509 Descriptive Statistics
quartile
1. first (lower) quartile : Q1
25% less than
2. second quartile - median : Q2
50% less than
3. third (upper) quartile : Q3
75% less than
8
Epidemiology 9509 Descriptive Statistics
computing quartiles
1. q1 = 0.25 (first quartile)q2 = 0.50 (median)q3 = 0.75 (third quartile)
2. rank data
3. compute
3.1 if nq integerquartile =
(
x[nq] + x[nq+1]
)
/23.2 if nq not integer
quartile = x[⌊nq⌋+1]
where ⌊m⌋ means greatest integer less than m
9
Epidemiology 9509 Descriptive Statistics
example
70, 80, 90, 100, 110
1. q1 = .25, nq1 = 1.25, ⌊nq1⌋ = 1first quartile is x[2] = 80
2. q2 = .50, nq2 = 2.5, ⌊nq2⌋ = 2median is x[3] = 90
3. q3 = .75, nq3 = 3.75, ⌊nq3⌋ = 3third quartile is x[4] = 100
10
Epidemiology 9509 Descriptive Statistics
Ranges
1. quartile range(Q1,Q3)example(80,100)
2. Interquartile range (IQR)IQR = Q3 − Q1
exampleIQR = 100− 80 = 20
11
Epidemiology 9509 Descriptive Statistics
example two65,70, 80, 85, 90, 100, 110, 120
1. mean
x̄ =65 + 70 + 80 + 85 + 90 + 100 + 110 + 120
8= 90
2. variance
s2 = 252+202+102+52+02+102+202+302
7
= 625+400+100+25+0+100+400+9007
= 25507 = 364.2857
3. standard deviation
s =√
(364.2857) = 19.1
12
Epidemiology 9509 Descriptive Statistics
nonparametrics65,70, 80, 85, 90, 100, 110, 120
1. first quartile (Q1) q1 = 0.25, so nq = 2
Q1 =(
x[2] + x[3])
/2
= 70+802 = 75
2. median (Q2) q2 = 0.5, so nq = 4
median =(
x[4] + x[5])
/2
= 85+902 = 87.5
3. third quartile (Q3) q3 = 0.75, so nq = 6
Q3 =(
x[6] + x[7])
/2
= 100+1102 = 105
13
Epidemiology 9509 Descriptive Statistics
nonparametrics (continued)
IQR = Q3 − Q1
= 105− 75 = 30
14
Epidemiology 9509 Descriptive Statistics
Empirical Intervals
◮ x̄ ± sd
eg 90.0 ± 19.1 = (70.9, 109.1)68% of data (if data normally distributed)
◮ x̄ ± 2sdeg 90.0 ± 38.2 = (51.8, 128.2)95% of data (if data normally distributed)
◮ x̄ ± 3sdeg 90.0 ± 57.3 = (32.7, 147.3)almost all data (if data normally distributed)
15
Epidemiology 9509 Descriptive Statistics
Empirical Intervals - continued
1. human data not usually normally distributed
2. confused with distribution of sample mean(to be discussed later)
16
Epidemiology 9509 Descriptive Statistics
Graphical Summaries
1. histograms (bar charts)
2. stem-and-leaf plots
3. box-and-whisker (box) plots
show distribution of the data graphically
17
Epidemiology 9509 Descriptive Statistics
histogram/bar chart
plot:
◮ y-axis is frequency/relative frequency/percentage
◮ x-axis is range of values of data
for discrete dataat each possible valuedraw box whose height is proportionalto frequency/relative frequency/percentage
18
Epidemiology 9509 Descriptive Statistics
sample bar chart
2
1
C F N
Frequency
Smoking Status
19
Epidemiology 9509 Descriptive Statistics
histogram
continuous datalike bar chartbut bars are contiguous
1. choose number of bins (5-10)log2(n)
2. create boundaries for binsthat DO NOT include data values
3. count number of data points in each bin
4. plot as for bar chart
20
Epidemiology 9509 Descriptive Statistics
example two
65,70, 80, 85, 90, 100, 110, 120
1. n=8, use 3 bins
2. range is 65 to 120width of 553 bins of width 2063-83; 83-103; 103-123
3. 3 in first bin3 in second bin2 in last bin
21
Epidemiology 9509 Descriptive Statistics
sample histogram
63 103 12383
Blood pressure
1
2
3
Frequency
22
Epidemiology 9509 Descriptive Statistics
Stem-and-leaf plot
suggested by John Tukeyquick histogramon its side
1. sort (order, rank) data
2. count number of similar firstor first and second digits
65,70, 80, 85, 90, 100, 110, 1206| 57| 08| 0 59| 010| 011| 012| 0
23
Epidemiology 9509 Descriptive Statistics
second attempt
67| 5 089| 0 5 0
1011| 0 012| 0
24
Epidemiology 9509 Descriptive Statistics
Tukey box-and-whisker plot1. centred at median2. lower hinge = Q1 ; upper hinge = Q3
3. rectangle from lower hinge to upper hinge4. IQR is distance between lower and upper hinges,5. lower inner fence = Q1 − 1.5(IQR)6. upper inner fence = Q3 + 1.5(IQR)7. lower outer fence = Q1 − 3.0(IQR)8. upper outer fence = Q3 + 3.0(IQR)9. lower adjacent value is smallest value
greater than the lower inner fence10. upper adjacent value is largest value
lower than the upper inner fence11. whiskers drawn from hinges to the adjacent values12. values between inner and outer fences
possible outliers; marked with an asterisk13. values beyond the outer fences
probable outliers; marked with an zero25
Epidemiology 9509 Descriptive Statistics
example two65,70, 80, 85, 90, 100, 110, 120
1. median is 87.5
2. lower hinge is 75; upper hinge is 105
3. IQR is 30
4. lower inner fence is 75 - 1.5(30) = 30
5. upper inner fence is 105 + 1.5(30) = 150
6. lower outer fence is 75 - 3.0(30) = -15
7. upper outer fence is 105 + 3.0(30) = 195
8. lower adjacent value is 65
9. upper adjacent value is 120
10. NO values between inner and outer fencesno possible outliers
11. NO values beyond outer fencesno probable outliers
26
Epidemiology 9509 Descriptive Statistics
sample Tukey boxplot
SAS Proc BOXPLOT schematic
uni-dimensional
Blood pressure
60 70 80 90 100 110 120130
27
Epidemiology 9509 Descriptive Statistics
Modern box-and-whisker plot
1. centred at median;
2. lower hinge = Q1 ; upper hinge = Q3
3. rectangle from lower hinge to upper hinge
4. IQR is distance between lower and upper hinges,
5. whisker drawn from minimum to maximum.
SAS Proc BOXPLOT skeletal
28
Epidemiology 9509 Descriptive Statistics
sample modern boxplot
Blood pressure
60 70 80 90 100 110 120130
29
Epidemiology 9509 Descriptive Statistics
Why graphical display
1. is data distribution symmetrical (Normal)?
2. is data distribution lop-sided?skewed? left or right? negative or positive?
3. are there outliers?(not handled by modern boxplot)
30
Epidemiology 9509 Descriptive Statistics
Histogram with right skew
63 103 12383
Blood pressure
12
Frequency
143 163 183
4
8
31
Epidemiology 9509 Descriptive Statistics
Histogram with left skew
63 103 12383
Blood pressure
12
Frequency
143 163 183
4
8
32
Epidemiology 9509 Descriptive Statistics
Histogram with possible outlier
63 103 12383
Blood pressure
12
Frequency
143 163 183
4
8
33
Epidemiology 9509 Descriptive Statistics
Using stem-and-leaf plots
65,70, 80, 85, 90, 100, 110, 120, 140 ,1606| 57| 08| 0 59| 010| 011| 012| 013|14| 015|16| 0
34
Epidemiology 9509 Descriptive Statistics
second attempt
67| 5 089| 0 5 0
1011| 0 01213| 01415| 01617| 0
35
Epidemiology 9509 Descriptive Statistics
boxplot with right skew
Blood pressure
60 70 80 90 100 110 120130
36
Epidemiology 9509 Descriptive Statistics
boxplot with left skew
Blood pressure
60 70 80 90 100 110 120130
37
Epidemiology 9509 Descriptive Statistics
boxplot with outliers
Blood pressure
60 70 80 90 100 110 120130
38
Epidemiology 9509 Descriptive Statistics
skewness
measure of asymmetry (lopsidedness) of data
1. for symmetric data, sample skewness is close to zero
2. for data skewed to the right
skewness has positive value
3. for data skewed to the left
skewness has negative value
39
Epidemiology 9509 Descriptive Statistics
calculating sample skewness
1. accurate value
∑
(xi − x̄)3
ns3
2. approximation
3
(
x̄ −Median
s
)
3. second approximation (poorer than preceding)
x̄ −Mode
s
40
Epidemiology 9509 Descriptive Statistics
Example
For the dataset, {70,80,90,100,110}sample skewness is
∑
(xi − x̄)3
ns3=
(−20)3 + (−10)3 + 03 + 103 + 203
5(250)(15.8)
= 0
41
Epidemiology 9509 Descriptive Statistics
Example two
For the dataset, {65, 70,80,85 90,100,110,120}sample skewness is
=(−25)3 + (−20)3 + (−10)3 + (−5)3 + 03 + 103 + 203 + 303
8(364.2857)(19.086)
=−15625 − 8000 − 1000 − 125 + 0 + 1000 + 8000 + 27000
55622.8416
=11250
55622.8416= 0.202
or
390 − 87.5
19.086=
3(2.5)
19.086= 0.392
42