Upload
dana-whitehead
View
221
Download
1
Tags:
Embed Size (px)
Citation preview
Measures of Dispersion
What is Dispersion?Refers to the way in which quantitative data
values are dispersed or spread out in a dataset.
The most powerful dispersion statistics calculate the quantitative spread of the data values around the arithmetic mean and are called
measures of deviation.
The various measures of deviation calculate the arithmetic differences between each data value
and the arithmetic mean of the dataset.
Why bother with measuring deviation?
Consider the following datasets:3+3+3+3+3 1+1+1+2+10
First we calculate their arithmetic means using:
𝑥=∑ 𝑥𝑛
=3 =3Are they the same? According to the mean they are.
Then we calculate their standard deviations using:
Same means, very different standard deviations.So are the datasets the same – or not?
𝑠=√∑ ¿¿¿ ¿ = 0 = 3.94
Measures of Dispersion and DeviationThe Range (a measure of dispersion):
The range is the difference between the lowest value (called MIN) and the highest value (called MAX) in a
dataset.
The Standard Deviation (a measure of deviation):Measures the average difference between a data value
and the arithmetic mean of all data values.
The Variance (a measure of deviation):Squares the average difference between a data value and the arithmetic mean of the data set. Thus it is the
standard deviation squared.
The Range(Range = MAX-MIN)
The RangeThe range describes the span of your dataset, from the
minimum value (MIN) to the maximum value (MAX) using:
Range = MAX – MIN
Used as a measure of data dispersion NOT deviation, because deviation implies a difference between your data values and something, e.g. the arithmetic mean.
The Range is used in finding histogram (or bar chart) classes.
Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00MAX $45,000.00 $80,000.00 $80,000.00MIN $35,000.00 $35,000.00 $1.00Range $10,000.00 $45,000.00 $79,999.00
Even the range is telling us more about the data than just the
central tendency measures do.
Compare dataset #1 with #3.
The Standard Deviation(s )
The Standard Deviation
Where:s is the sample standard deviationx is a value in the dataset is the arithmetic mean of the datasetn is the number of values in the dataset
𝑠=√∑ ¿¿¿ ¿
𝑥
The standard deviation measures the average difference between a data value and the arithmetic mean of all data values. It is given by:
The standard deviation and the variance are related insofar as the s is the square root of the variance (or the variance is s2). s is the
most widely used measure of deviation, though it should always be used in conjunction with the variance.
Interpreting the Standard Deviation Formula
Subtract each data value x from the arithmetic mean and sum them:
But this returns a set of plus and minus differences that add to zero.So to remove the signs we square each difference and sum the squared
differences …
… then take their square root to return the magnitudes of the original values.
𝑥
𝑠=∑ (𝑥−𝑥)
𝑠❑=∑ (𝑥−𝑥 )2
𝑠=√∑ ¿¿¿ ¿
𝑠=√∑ ¿¿¿ ¿
A reminder of the effect of squaring…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
50
100
150
200
250
300
350
400
450
numbers squares
# #2
1 12 43 94 165 256 367 498 649 81
10 10011 12112 14413 16914 19615 22516 25617 28918 32419 36120 400
… it emphasizes higher values
An exponential progression
An arithmetic progression
x x-meanx-mean squared
sqrt of x-mean
squared1 -9.5 90.25 9.52 -8.5 72.25 8.53 -7.5 56.25 7.54 -6.5 42.25 6.55 -5.5 30.25 5.56 -4.5 20.25 4.57 -3.5 12.25 3.58 -2.5 6.25 2.59 -1.5 2.25 1.5
10 -0.5 0.25 0.511 0.5 0.25 0.512 1.5 2.25 1.513 2.5 6.25 2.514 3.5 12.25 3.515 4.5 20.25 4.516 5.5 30.25 5.517 6.5 42.25 6.518 7.5 56.25 7.519 8.5 72.25 8.520 9.5 90.25 9.5
10.5 0.0
Why Squares and Roots?
The difference x-x produces negative
numbers and a sum of zero, but
… the square of a number is
always positive,
and…
… differences between squares
increase more rapidly than differences
between original numbers, so…
…taking the square root of the squared data values simply
returns them to the original numbers, and also removes
the sign.
‾
1 2 3 4 5 6 7 8 9 10
11
12
13
14
15
16
17
18
19
20
0
50
100
150
200
250
300
350
400
450
numbers squares num diff square diffs
number
square
This is a list of numbers, x.
Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00MAX $45,000.00 $80,000.00 $80,000.00MIN $35,000.00 $35,000.00 $1.00Range $10,000.00 $45,000.00 $79,999.00s $3,852.18 $14,290.36 $21,559.86
Low s means that the data are clustered around mean
(data are leptokurtic or
‘peaked’)
REMEMBERs values do not indicate skewness.
They do indicate kurtosis.
High s means that the data
are spread out around the mean (data are
platykurtic or ‘flat’)
‘Normal’ standard deviation
‘Small’ standard deviation
‘Large’ standard deviation
Freq
uenc
yReview Slide
Standard Deviation and the ‘Shape’ of Data
This ‘peakedness’ of the distribution is called kurtosis.Use the kurtosis statistic to test for normality.
𝒙
The Variance(s2)
The Variance
𝑠𝟐=∑ (𝑥− 𝑥 )2
𝑛−1
Squares the average difference between a data value and the arithmetic mean of the data set. It is given by:
Where:s2 is the sample variancex is a value in the dataset is the arithmetic mean of the datasetn is the number of values in the dataset𝑥
Since it uses the arithmetic mean, it is subject to the same effect of extreme values – except much more because of
the effect of squaring.
Interpreting the Variance Formula
Subtract each data value x from the arithmetic mean and sum them.
But this returns a set of plus and minus differences that adds to zero.
So to remove the signs we square each difference thus:
…and sum the squared differences.
𝑠2=∑ (𝑥− 𝑥 )2
𝑛−1𝑥
𝑠2=∑(𝑥−𝑥)
𝑠2=∑ (𝑥−𝑥 )2
Variance and SD Compared
By squaring the differences you
remove the negative signs and exaggerate
more extreme differences to make them more obvious
for analysis.
By taking the square root you return the differences to their original magnitude
but the signs are removed so the
differences no longer sum to zero.
In comparing the two, when the s is small, the difference between the variance (s2) and the s is
smaller than if the s is large – that’s what happens when you square numbers.
𝑠=√∑ ¿¿¿ ¿𝑠2=∑ (𝑥− 𝑥 )2
𝑛−1
Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00MAX $45,000.00 $80,000.00 $80,000.00MIN $35,000.00 $35,000.00 $1.00Range $10,000.00 $45,000.00 $79,999.00s $3,852.18 $14,290.36 $21,559.86s2 $14,839,285.71 $204,214,285.71 $464,827,464.41
Note that the highest s is 5.6 times the lowest whereas the highest s2 is 31 times the lowest – this is the effect of
squaring extreme values
N and n-1
Why do the sample standard deviation and sample variance (in fact, sample anything) formulas have n-1 as the denominator?
Because n-1 gives a more conservative estimate of deviation by increasing the standard deviation and variance values.
If you have a larger standard deviation or variance, you have a higher standard to pass in making your case. Why?
Because if you are testing to see if a data value is 1.96 s away from the mean of its dataset, then a larger s means the data value has to
meet a stricter test – i.e. it has to be higher.
𝑠2=∑ (𝑥− 𝑥 )2
𝑛−1𝑠=√∑ ¿¿¿ ¿
Sample versus population – n-1 versus NSample
size(n)
Value of numerator
in standard deviation formula
Biased estimate of population standard
deviation (i.e. dividing by N)
Unbiased estimate of population standard deviation
(dividing by n-1)
Difference between
biased and
unbiased estimates
10 500 7.07 7.45 .38100 500 2.24 2.25 .011000 500 0.7071 0.7075 .0004Source: After Salkind, page 40.
Note:1. With n-1 the standard deviation is higher.2. The larger the sample, the smaller the effect of n-1
√(500/10)=
√(500/100)=
√(500/1000)=
√(500/(10-1))=
√(500/(100-1))=
√(500/(1000-1))=
2( )
1
x xs
n
5.0%0.4%0.056%
∑
N
Interpreting Variance & Standard Deviations gives the average difference between each data
value and the mean of a dataset and s2 squares it and so exaggerates it.
The larger the values, the more spread out the values are and the larger the differences between them.
If the values are equal to zero then there are no differences between your data values.
The standard deviation and the variance each require an arithmetic mean to work, not the median or the
mode. Therefore they require the same rigour as the mean and are sensitive to extreme values as well,
especially the variance.
The Coefficient of Variation(Cv)
Calculating the Coefficient Of Variation
The equation for the sample coefficient of variation is:
And, for the population:
* 100 * 100
Interpreting The Coefficient Of Variation
The coefficient of variation expresses the standard deviation as a percentage
of the mean.
Allows easy comparison of standard deviations with one another.
Interpreting The Coefficient Of Variation
By way of example:
Compare a s of $2,400 on a per capita average income of $55,000 against an s of $300 on a per
capita average income of $2,000 – how to interpret?
Here the coefficients of variation are 4.4% and 15% indicating a much wider range of variability in the poorer nation – that is a much wider gap between
rich and poor.
Case in point: the coefficient of variation for global GNI is 108.9%! This indicates an extraordinary gap
between rich and poor nations.
Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00MAX $45,000.00 $80,000.00 $80,000.00MIN $35,000.00 $35,000.00 $1.00Range $10,000.00 $45,000.00 $79,999.00s $3,852.18 $14,290.36 $21,559.86s2 $14,839,285.71 $204,214,285.71 $464,827,464.41Cv 9.31% 31.24% 52.11%
Note that the highest Cv is 5.3 times the lowest indicating that dataset#3 is considerably more variable that dataset #1 – the effect of the two
extreme values is evident.
Summary Stats So Far
Arithmetic mean and standard deviation are fundamental to statistics.
Form the heart of descriptive statistics.
Are the essential building blocks of all other statistical methods – look for them as
elements in future formulas.
Other measures of dispersion have their roles, are more robust, but not as powerful.
All Geography students are deviants.
All Geography students areabove average deviants.
mg!