Upload
asclabisb
View
227
Download
0
Embed Size (px)
Citation preview
7/30/2019 (8) Measures of Dispersion
1/30
Applied Statistics and Computing Lab
MEASURES OF DISPERSION
Applied Statistics and Computing Lab
Indian School of Business
7/30/2019 (8) Measures of Dispersion
2/30
Applied Statistics and Computing Lab
Learning goals To understand the need for studying
dispersion To understand the idea behind measures of
dispersion
To study different measures of dispersion
Additional topics
Standardization of a variable
Skewness and Kurtosis
Five-point summary
2
7/30/2019 (8) Measures of Dispersion
3/30
Applied Statistics and Computing Lab
Need to study dispersion Two patients are admitted into the Intensive Care Unit of a
hospital. The night before their operation, the doctor makesthe last visit at 9pm and blood pressure for Patient 1 is 110/80
and for Patient 2 it is 120/70. Although they are normal, for
precautionary reasons, the Doctor asks the nurse to check
their blood pressure every 2 hours. At 7.30 the next morning,the nurse reports that the average blood pressure for both
the patients was normal, 120/80. The chart of their actual
blood pressures was:
3
Time 11pm 1am 3am 5am 7am
Patient 1 120/80 100/80 100/60 130/80 150/100
Patient 2 110/60 100/60 100/70 130/90 160/120
7/30/2019 (8) Measures of Dispersion
4/30
Applied Statistics and Computing Lab
Need to study dispersion (contd.) What if the doctor decides to operate the patients
without looking at the blood pressure chart? What if someone decides to visit the tourist
destination next week, based on the averagetemperature of last week, given in our data?
What if I am interested in working with company X(that is visiting our campus) and I am given informationabout only the mean salary of the employees?
In an extreme case, a central tendency can alsoindicate a dataset consisting of same constant value
4
7/30/2019 (8) Measures of Dispersion
5/30
Applied Statistics and Computing Lab5
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Median
Mode
7/30/2019 (8) Measures of Dispersion
6/30
Applied Statistics and Computing Lab
Examples
Variability in temperature through the week
Scatter of the horsepower capacities, within
the cars available
Spread of the prices at which varieties of asingle product (say rice varieties) are available
Variability in returns on investments
6
7/30/2019 (8) Measures of Dispersion
7/30
Applied Statistics and Computing Lab
Need for measures of dispersion
(contd.)
Helps determine the reliability of the measure
of central tendency
Facilitates comparison of two sets of data
Useful for building further statistical measures
7
7/30/2019 (8) Measures of Dispersion
8/30
Applied Statistics and Computing Lab
Desired properties A good measure should not get highly affected
if the data changes slightly
A good measure should be representative of
the majority of the data A good measure should allow us to declare an
interval within which most of the values lie,
with a certain degree of confidence
8
7/30/2019 (8) Measures of Dispersion
9/30
Applied Statistics and Computing Lab
Dataset Body measurements on 507
individuals 247 men and 260 women
Primarily in 20s and 30s, with some
exceptions
All individuals exercise several hours
a week
From the 28 total variables present
in this dataset, we consider thevariables Gender(1=Male,
0=Female) and Weight(in Kg.)
9Data source: Measurements collected by authors Grete Heinz and Louis J. Peterson for their study
7/30/2019 (8) Measures of Dispersion
10/30
Applied Statistics and Computing Lab
Dataset (contd.)
10
Female Male Overall
Min. weight (in Kgs.) 42 53.9 42
Max. weight (in Kgs.) 105.2 116.4 116.4
Mean weight (in Kgs.) 60.6 78.14 69.15
Median weight (in Kgs.) 59 77.3 68.2
7/30/2019 (8) Measures of Dispersion
11/30
Applied Statistics and Computing Lab11
Evaluatingdispersion
Consider theboundaries
(Measure based onselected values)
Report theextreme values
Consider distancefrom a centraltendency
(Measures based onall the values)
Build anabsolutemeasure
Calculate acoefficient
High coefficient:
Large spread,high variability
Small coefficient:Small spread, less
variability
7/30/2019 (8) Measures of Dispersion
12/30
Applied Statistics and Computing Lab
1. Considering the boundaries These measures consider and report only the
boundaries of the data
Try to understand how far the values of the
variable reach
The spread of the data is not considered
relative to any central tendency
These measures overlook the patterns ofvalues within the boundaries
12
7/30/2019 (8) Measures of Dispersion
13/30
Applied Statistics and Computing Lab
Minimum and
maximum values
ADVANTAGES:
Useful when range oftolerance exists i.e. if values
beyond a certain limit areharmful or unacceptable
Easy to compute andunderstand
DISADVANTAGES:Ignores any pattern in the
data
Ignores most of the data
Range = (Maximum
value) (Minimum
value)
ADVANTAGES:
Easy comparison of variabilityacross datasets
Easy to compute and
understand
DISADVANTAGES:Ignores any pattern in the
data
Ignores most of the data
Inter-quartile range
= (3rd quartile) (1st
quartile)
ADVANTAGES:
Highlights the middle portionof the distribution of values
Easy to understand
DISADVANTAGES:
More difficult to computethan Min-max and range
Ignores irregularities onthe extremes
Ignores 25% data on eachside
13
Female Male Overall
(Min. weight, Max. weight) (42, 105.2) (53.9, 116.4) (42, 116.4)
Weight range 63.2 62.5 74.4
Weight inter-quartile range 11.1 14.55 20.45
7/30/2019 (8) Measures of Dispersion
14/30
Applied Statistics and Computing Lab14
Evaluatingdispersion
Consider theboundaries
(Measure based onselected values)
Report theextreme values
Consider distancefrom a centraltendency
(Measures based onall the values)
Build anabsolutemeasure
Calculate acoefficient
High coefficient:
Large spread,high variability
Small coefficient:Small spread, less
variability
7/30/2019 (8) Measures of Dispersion
15/30
Applied Statistics and Computing Lab
2. Considering distance from central
tendency Consider the deviations of values from the central tendency
measure What if we simply sum all these deviations?
Consider a hypothetical dataset
(1,1,2,2,3,3,4,5,5,6,6,7,7)
Mean = Median = 4
Consider
= = 0 Taking absolute values or taking squares so that we are
considering only the magnitudes
15
7/30/2019 (8) Measures of Dispersion
16/30
Applied Statistics and Computing Lab
Absolute deviations For a dataset consisting ofn observations:
Absolute deviations:
Mean absolute deviation from mean = ()
Mean absolute deviation from median =
()
Median absolute deviation from median = ( )
16
Female weights Male weights
Mean absolute deviation from mean 7.33 8.58
Mean absolute deviation from median 7.19 8.57
Median absolute deviation from median 5.1 7.2
7/30/2019 (8) Measures of Dispersion
17/30
Applied Statistics and Computing Lab
Measures based on squared deviation For a dataset consisting ofn observations,
Variance = = ()
In order to look at a measure that has unit of measurements
equivalent to the original data, we can take square root:
Standard deviation = =
17
52.110Variance
46.92Variance
malesWeight,
femalesWeight,
=
=
51.10deviationStandard
62.9deviationStandard
malesWeight,
femalesWeight,
=
=
7/30/2019 (8) Measures of Dispersion
18/30
Applied Statistics and Computing Lab
Relative measures of dispersion Coefficient of range:
()()
Always lies between [0,1] Higher the coefficient, broader the range!
, = 0.43 , = 0.37
Coefficient of variation: 100 Computes the variability per unit mean
Indicates how consistent the data is, with respect to its mean
Higher the coefficient, more spread-over are the observations
, = 15.87 , = 13.45
The values of weights among females are more spread-over than those among males
18
7/30/2019 (8) Measures of Dispersion
19/30
Applied Statistics and Computing Lab
Comparing measures of dispersion
19
All the measures that consider distance from central tendency, are based on all the values!
-Absolute deviations are less affected by extreme
values, as compared to squared deviations
-Absolute deviations are easy to understand andinterpret
-Median absolute deviation is least affected by slight
changes in the data, across all measures of dispersion
-Variance and Standard deviation are most popular
measures of dispersion due to their usefulness in
building further statistical measures and because they
algebraically amenable
-Both play an important part in building and evaluating
further statistical measures
-Standard deviation is easier to understand than
variance, as it is in the same units as the original data
-Algebraic manipulation of
measures based on measures of
absolute deviations is difficult-Variance is most affected by
extreme values as it is based on
squared deviations
-Standard deviation is not very
easy to compute
-Standard deviation cannot be
calculated for data with open
ended classes
-Coefficients are free of units therefore facilitate comparison
-Useful even when two variables are measured in two different units
7/30/2019 (8) Measures of Dispersion
20/30
Applied Statistics and Computing Lab
Standardization
Standardized variable of
=
Mean of standardized variable = 0 Variance of standardized variable = 1
Standardized variables are free of units
Therefore measures of variation ofstandardized variables are comparable
20
7/30/2019 (8) Measures of Dispersion
21/30
Applied Statistics and Computing Lab
Example How is the weight of a new-born affected by whether a mother smokes or not?
Further, does it affect the perinatal mortality rate that varies for different birth
weights? Yerushalmy J. found out in his 1971 paper that although low birth rate is associated
with an increase in the number of babies who die shortly after birth, the babies of
smokers tended to have much lower death rates than the babies of nonsmokers.*
In this study, he compared perinatal death rates by grouping birth rates
In 1986 and 1993, Wilcox & Russell and Wilcox (respectively) strongly recommended
that the babies should be grouped based on their relative (or standardized) birth
weight, rather than looking at the absolute weights (in Kgs.)
What happened then?
21* And ** taken from Deborah Nolan and Terry Speeds Stat Labs: Mathematical Statistics through applications
Table in Yerushalmy J. (1971)**
(Weights measured in grams)
7/30/2019 (8) Measures of Dispersion
22/30
Applied Statistics and Computing Lab
Example (contd.)
22Graphs taken from Deborah Nolan and Terry Speeds Stat Labs: Mathematical Statistics through applications
7/30/2019 (8) Measures of Dispersion
23/30
Applied Statistics and Computing Lab
Further to deviations
23
Variance = () is the sum of squares of deviationsfrom the mean divided by n or the expected value of squareddeviation of X from its mean
Expected values of higher powers of deviations from mean, giveadditional information about the distribution of data
Expected value of any power of the deviations from mean of a
variable X (say power) is called the central moment of thatvariable = =
( ) = ( )
Central moments depict the spread and shape of data Variance is 2nd central moment
Measures using the 3rd and 4th central moments are useful tounderstand the shape of the distribution
7/30/2019 (8) Measures of Dispersion
24/30
Applied Statistics and Computing Lab
Skewness Skewness is a measure of symmetry (or the lack
of it) in a dataset
A distribution is right-skewed or positively
skewed if it stretches asymmetrically to the right
It is left or negatively skewed if the asymmetric
stretch is on the left
Measuring skewness using moments:
= =
Important to note that if a distribution is
perfectly symmetric, = 0 The sign of the coefficient = the sign of A coefficient of skewness value closer to zero,
indicates a highly symmetric distribution
24Visuals from Aczel A., Sounderpandian J. Complete business statistics
7/30/2019 (8) Measures of Dispersion
25/30
Applied Statistics and Computing Lab
Kurtosis Kurtosis is a measure of peakedness of a
dataset
The ideal value for kurtosis is 3 and such a
curve is called the Mesokurtic curve
Value larges than 3 indicates that the
distribution would be peaked with shorter tails.This graph is also termed the Leptokurtic curve
Value smaller than 3 would fetch a flatter graph
with longer tails and is called the Platykurtic
curve Measuring kurtosis using moments:
= = 25
Visual from http://whatilearned.wikia.com/wiki/File:Kurtosis.jpg
The red line representsa frequency curve of a
long tailed distribution
The blue line represents
a frequency curve of a
short tailed distribution
The black line is the
standard bell curve
7/30/2019 (8) Measures of Dispersion
26/30
Applied Statistics and Computing Lab
Example
Skewness Kurtosis
Female 1.14 5.59
Male 0.29 3.15
Entire dataset 0.40 2.65
Table of the gender-wise skewness and kurtosis of weights:
26
7/30/2019 (8) Measures of Dispersion
27/30
Applied Statistics and Computing Lab
Example (contd.) We see that skewness and kurtosis captures the numeric measure
of the information presented in a histogram
We see that the histogram of weights of females is highlystretched on the right, leading to a positive and high skewnessmeasure of 1.14
The stretch of histogram for weights of the entire dataset ismoderate and much lesser than that for weights of females. This is
reflected in the slightly lower skewness of 0.40 The weights of males are stretched almost equally on both sides of
the centrality giving a skewness measure as close to zero as 0.29
Skewness and Kurtosis shed light on important characteristics suchas symmetry and peakedness
Give additional information about distribution of data, than themeasures of central tendency and measures of dispersion
27
7/30/2019 (8) Measures of Dispersion
28/30
Applied Statistics and Computing Lab
Point summary Very useful and practical use of measures of central
tendency and dispersion
5-point summary
6-point summary
Gives an idea about the extreme values, the valueswithin which the middle 50% of the values lie and alsothe centrality of the data
6-point summary of Weights in the bodymeasurement data:
28
Minimum 1st quartile Median 3rd quartile Maximum
Minimum 1st quartile Median Mean 3rd quartile Maximum
Min. 1st Qu. Median Mean 3rd Qu. Max.
42 58.4 68.2 69.15 78.85 116.4
Measure R code
7/30/2019 (8) Measures of Dispersion
29/30
Applied Statistics and Computing Lab
Measure R-code
Minimum min(variable name)
Maximum max(variable name)
Range range(variable name)
Inter-quartile range IQR(variable name)
Mean absolute deviation about mean mean(abs(variable name-mean(variable name)))
Mean absolute deviation about median mean(abs(variable name-median(variable name)))
Median absolute deviation about median median(abs(variable name-median(variable name)))
Variance var(variable name)
Standard deviation sd(variable name)
Coefficient of range (max(variable name) - min(variable name)) /
(max(variable name) + min(variable name))
Coefficient of variation library(raster)
cv(variable name)
Standardization of a variable function(x) {(x-mean(x))/sqrt(var(x))}
Skewness and Kurtosis library(moments)
skewness(variable name)
kurtosis(variable name)
6-point summary summary(variable name)
29
7/30/2019 (8) Measures of Dispersion
30/30
Applied Statistics and Computing Lab
Thank you