31
Measures of Dispersion

Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

Embed Size (px)

Citation preview

Page 1: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

Measures of Dispersion

Page 2: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

What is Dispersion?Refers to the way in which quantitative data

values are dispersed or spread out in a dataset.

The most powerful dispersion statistics calculate the quantitative spread of the data values around the arithmetic mean and are called

measures of deviation.

The various measures of deviation calculate the arithmetic differences between each data value

and the arithmetic mean of the dataset.

Page 3: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

Why bother with measuring deviation?

Consider the following datasets:3+3+3+3+3 1+1+1+2+10

First we calculate their arithmetic means using:

𝑥=∑ 𝑥𝑛

=3 =3Are they the same? According to the mean they are.

Page 4: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

Then we calculate their standard deviations using:

Same means, very different standard deviations.So are the datasets the same – or not?

𝑠=√∑ ¿¿¿ ¿ = 0 = 3.94

Page 5: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

Measures of Dispersion and DeviationThe Range (a measure of dispersion):

The range is the difference between the lowest value (called MIN) and the highest value (called MAX) in a

dataset.

The Standard Deviation (a measure of deviation):Measures the average difference between a data value

and the arithmetic mean of all data values.

The Variance (a measure of deviation):Squares the average difference between a data value and the arithmetic mean of the data set. Thus it is the

standard deviation squared.

Page 6: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

The Range(Range = MAX-MIN)

Page 7: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

The RangeThe range describes the span of your dataset, from the

minimum value (MIN) to the maximum value (MAX) using:

Range = MAX – MIN

Used as a measure of data dispersion NOT deviation, because deviation implies a difference between your data values and something, e.g. the arithmetic mean.

The Range is used in finding histogram (or bar chart) classes.

Page 8: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00MAX $45,000.00 $80,000.00 $80,000.00MIN $35,000.00 $35,000.00 $1.00Range $10,000.00 $45,000.00 $79,999.00

Even the range is telling us more about the data than just the

central tendency measures do.

Compare dataset #1 with #3.

Page 9: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

The Standard Deviation(s )

Page 10: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

The Standard Deviation

Where:s is the sample standard deviationx is a value in the dataset is the arithmetic mean of the datasetn is the number of values in the dataset

𝑠=√∑ ¿¿¿ ¿

𝑥

The standard deviation measures the average difference between a data value and the arithmetic mean of all data values. It is given by:

The standard deviation and the variance are related insofar as the s is the square root of the variance (or the variance is s2). s is the

most widely used measure of deviation, though it should always be used in conjunction with the variance.

Page 11: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

Interpreting the Standard Deviation Formula

Subtract each data value x from the arithmetic mean and sum them:

But this returns a set of plus and minus differences that add to zero.So to remove the signs we square each difference and sum the squared

differences …

… then take their square root to return the magnitudes of the original values.

𝑥

𝑠=∑ (𝑥−𝑥)

𝑠❑=∑ (𝑥−𝑥 )2

𝑠=√∑ ¿¿¿ ¿

𝑠=√∑ ¿¿¿ ¿

Page 12: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

A reminder of the effect of squaring…

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

50

100

150

200

250

300

350

400

450

numbers squares

# #2

1 12 43 94 165 256 367 498 649 81

10 10011 12112 14413 16914 19615 22516 25617 28918 32419 36120 400

… it emphasizes higher values

An exponential progression

An arithmetic progression

Page 13: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

x x-meanx-mean squared

sqrt of x-mean

squared1 -9.5 90.25 9.52 -8.5 72.25 8.53 -7.5 56.25 7.54 -6.5 42.25 6.55 -5.5 30.25 5.56 -4.5 20.25 4.57 -3.5 12.25 3.58 -2.5 6.25 2.59 -1.5 2.25 1.5

10 -0.5 0.25 0.511 0.5 0.25 0.512 1.5 2.25 1.513 2.5 6.25 2.514 3.5 12.25 3.515 4.5 20.25 4.516 5.5 30.25 5.517 6.5 42.25 6.518 7.5 56.25 7.519 8.5 72.25 8.520 9.5 90.25 9.5

10.5 0.0

Why Squares and Roots?

The difference x-x produces negative

numbers and a sum of zero, but

… the square of a number is

always positive,

and…

… differences between squares

increase more rapidly than differences

between original numbers, so…

…taking the square root of the squared data values simply

returns them to the original numbers, and also removes

the sign.

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

20

0

50

100

150

200

250

300

350

400

450

numbers squares num diff square diffs

number

square

This is a list of numbers, x.

Page 14: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00MAX $45,000.00 $80,000.00 $80,000.00MIN $35,000.00 $35,000.00 $1.00Range $10,000.00 $45,000.00 $79,999.00s $3,852.18 $14,290.36 $21,559.86

Low s means that the data are clustered around mean

(data are leptokurtic or

‘peaked’)

REMEMBERs values do not indicate skewness.

They do indicate kurtosis.

High s means that the data

are spread out around the mean (data are

platykurtic or ‘flat’)

Page 15: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

‘Normal’ standard deviation

‘Small’ standard deviation

‘Large’ standard deviation

Freq

uenc

yReview Slide

Standard Deviation and the ‘Shape’ of Data

This ‘peakedness’ of the distribution is called kurtosis.Use the kurtosis statistic to test for normality.

𝒙

Page 16: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

The Variance(s2)

Page 17: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

The Variance

𝑠𝟐=∑ (𝑥− 𝑥 )2

𝑛−1

Squares the average difference between a data value and the arithmetic mean of the data set. It is given by:

Where:s2 is the sample variancex is a value in the dataset is the arithmetic mean of the datasetn is the number of values in the dataset𝑥

Since it uses the arithmetic mean, it is subject to the same effect of extreme values – except much more because of

the effect of squaring.

Page 18: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

Interpreting the Variance Formula

Subtract each data value x from the arithmetic mean and sum them.

But this returns a set of plus and minus differences that adds to zero.

So to remove the signs we square each difference thus:

…and sum the squared differences.

𝑠2=∑ (𝑥− 𝑥 )2

𝑛−1𝑥

𝑠2=∑(𝑥−𝑥)

𝑠2=∑ (𝑥−𝑥 )2

Page 19: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

Variance and SD Compared

By squaring the differences you

remove the negative signs and exaggerate

more extreme differences to make them more obvious

for analysis.

By taking the square root you return the differences to their original magnitude

but the signs are removed so the

differences no longer sum to zero.

In comparing the two, when the s is small, the difference between the variance (s2) and the s is

smaller than if the s is large – that’s what happens when you square numbers.

𝑠=√∑ ¿¿¿ ¿𝑠2=∑ (𝑥− 𝑥 )2

𝑛−1

Page 20: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00MAX $45,000.00 $80,000.00 $80,000.00MIN $35,000.00 $35,000.00 $1.00Range $10,000.00 $45,000.00 $79,999.00s $3,852.18 $14,290.36 $21,559.86s2 $14,839,285.71 $204,214,285.71 $464,827,464.41

Note that the highest s is 5.6 times the lowest whereas the highest s2 is 31 times the lowest – this is the effect of

squaring extreme values

Page 21: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

N and n-1

Why do the sample standard deviation and sample variance (in fact, sample anything) formulas have n-1 as the denominator?

Because n-1 gives a more conservative estimate of deviation by increasing the standard deviation and variance values.

If you have a larger standard deviation or variance, you have a higher standard to pass in making your case. Why?

Because if you are testing to see if a data value is 1.96 s away from the mean of its dataset, then a larger s means the data value has to

meet a stricter test – i.e. it has to be higher.

𝑠2=∑ (𝑥− 𝑥 )2

𝑛−1𝑠=√∑ ¿¿¿ ¿

Page 22: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

Sample versus population – n-1 versus NSample

size(n)

Value of numerator

in standard deviation formula

Biased estimate of population standard

deviation (i.e. dividing by N)

Unbiased estimate of population standard deviation

(dividing by n-1)

Difference between

biased and

unbiased estimates

10 500 7.07 7.45 .38100 500 2.24 2.25 .011000 500 0.7071 0.7075 .0004Source: After Salkind, page 40.

Note:1. With n-1 the standard deviation is higher.2. The larger the sample, the smaller the effect of n-1

√(500/10)=

√(500/100)=

√(500/1000)=

√(500/(10-1))=

√(500/(100-1))=

√(500/(1000-1))=

2( )

1

x xs

n

5.0%0.4%0.056%

N

Page 23: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

Interpreting Variance & Standard Deviations gives the average difference between each data

value and the mean of a dataset and s2 squares it and so exaggerates it.

The larger the values, the more spread out the values are and the larger the differences between them.

If the values are equal to zero then there are no differences between your data values.

The standard deviation and the variance each require an arithmetic mean to work, not the median or the

mode. Therefore they require the same rigour as the mean and are sensitive to extreme values as well,

especially the variance.

Page 24: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

The Coefficient of Variation(Cv)

Page 25: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

Calculating the Coefficient Of Variation

The equation for the sample coefficient of variation is:

And, for the population:

* 100 * 100

Page 26: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

Interpreting The Coefficient Of Variation

The coefficient of variation expresses the standard deviation as a percentage

of the mean.

Allows easy comparison of standard deviations with one another.

Page 27: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

Interpreting The Coefficient Of Variation

By way of example:

Compare a s of $2,400 on a per capita average income of $55,000 against an s of $300 on a per

capita average income of $2,000 – how to interpret?

Here the coefficients of variation are 4.4% and 15% indicating a much wider range of variability in the poorer nation – that is a much wider gap between

rich and poor.

Case in point: the coefficient of variation for global GNI is 108.9%! This indicates an extraordinary gap

between rich and poor nations.

Page 28: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $80,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $45,000.00 $43,000.00 $43,000.00 $43,000.00 $41,000.00 $41,000.00 $41,000.00 $40,000.00 $40,000.00 $40,000.00 $37,000.00 $37,000.00 $37,000.00 $35,000.00 $35,000.00 $1.00Sum $331,000.00 $366,000.00 $331,001.00n 8 8 8Mean $41,375.00 $45,750.00 $41,375.13Median $42,000.00 $42,000.00 $42,000.00Mode $45,000.00 $45,000.00 $45,000.00MAX $45,000.00 $80,000.00 $80,000.00MIN $35,000.00 $35,000.00 $1.00Range $10,000.00 $45,000.00 $79,999.00s $3,852.18 $14,290.36 $21,559.86s2 $14,839,285.71 $204,214,285.71 $464,827,464.41Cv 9.31% 31.24% 52.11%

Note that the highest Cv is 5.3 times the lowest indicating that dataset#3 is considerably more variable that dataset #1 – the effect of the two

extreme values is evident.

Page 29: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

Summary Stats So Far

Arithmetic mean and standard deviation are fundamental to statistics.

Form the heart of descriptive statistics.

Are the essential building blocks of all other statistical methods – look for them as

elements in future formulas.

Other measures of dispersion have their roles, are more robust, but not as powerful.

Page 30: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

All Geography students are deviants.

Page 31: Measures of Dispersion. What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful

All Geography students areabove average deviants.

mg!