Measures of Dispersion - pdfs.semanticscholar.org€¦ · Measures of Dispersion. There are many, many model probability distributions Here’s a link to a map of 50+ probability

Measures of Dispersion

There are many, many model probability distributions

Here’s a link to a map of 50+ probability distributions, showing how they all relate!

http://www.math.wm.edu/~leemis/chart/UDR/UDR.html

● The center is the mean, median, or mode.

● The spread is the variability of the data:

● Shape can be described by symmetry, skewness, number of peaks (modes), etc.

Image source

Features of Probability Distributions

http://stattrek.com/statistics/charts/data-patterns.aspx?Tutorial=AP

Descriptive Statistics● Measures of central tendency

○ Mean, median, mode

● Measures of spread○ Variance, standard deviation

○ Range, max, min, quartiles

● Measures of shape○ Skew, modalities, etc.

http://onlinestatbook.com/2/summarizing_distributions/shapes.html

Spread

● Measurements vary for one of two reasons:

○ Systematic bias: underlying problems with how the data were collected,

leading to inaccuracies (e.g., thermometer off by a degree)

○ Randomness: natural fluctuations in measurements (aka “noise”)

● You can and will get different results taking the same measurements over

and over again, as a result of the randomness.

Measurements Vary

● Measures of central tendency tell us something about the center of a

probability distribution or a histogram.

● But they don’t tell us anything about spread.

Spread

Data Source

https://collegescorecard.ed.gov/data/

One Measure of Spread● Data: Change in the valuation of

different financial sectors.

● Let’s think up a metric we can use to

measure the spread of a distribution.

● A starting point: what is the average

deviation of the values from the mean? ○ [(1.66 - 0.58) + (1.08 - 0.58) + (0.89 - 0.58) +

… + (0.28 - 0.58) + (0.13 - 0.58) + (-0.28 -

0.58)] / 10 = 0.

● What is the problem with this measure? Data Source

https://markets.ft.com/data/sectors

● The problem is that we are get positive and negative deviations, and they

end up cancelling each other out, so the average deviation is 0!

● How can we make all deviations positive?

○ We can take absolute values, or

○ We can square everything!

● Our measure of spread will be the average squared deviation of values from

the mean. This is called the sample variance.

One Measure of Spread

● The total sum of the squared deviations:○ (1.66 - 0.58)2 + (1.08 - 0.58)2 + (0.89 - 0.58)2 + (0.69 - 0.58)2 + (0.52 - 0.58)2 + (0.47 - 0.58)2

+ (0.36 - 0.58)2 + (0.28 - 0.58)2 + (0.13 - 0.58)2 + (-0.28 - 0.58)2 ∽ 2.62

● To find the average, we divide by 10.

● So, the sample mean is ∽ 0.58.

● And the sample variance is ∽ 0.262.

● Why such a big spread? Well, because we squared the deviations!

● Arguably of more interest, since it is of the same magnitude as the values, is

the square root of the sample variance: i.e., the sample standard deviation.

Calculating Sample Variance

Interquartile Range (IQR)

Maximum, minimum, and range● The maximum is an outcome that is greater than or equal to all others in our sample

● The minimum is an outcome that is less than or equal to all others in our sample

● The range is the difference between the maximum and the minimum

● The maximum and minimum are sensitive to outliers

○ If an outcome is added to sample that is less than the minimum, then the minimum

changes

○ If an outcome is added to our sample that is greater than the maximum, then the

maximum changes

● To determine if the maximum and the minimum in our data set are indeed outliers we

can use a rule of thumb called the interquartile range rule

Quartiles ● Quartiles within an ordered set of data are three points that divide the dataset into

four equal groups, each containing a quarter of the data

● The first quartile (Q1) is the midpoint between the minimum and the median

● The second quartile (Q2) is the median

● The third quartile (Q3) is the midpoint between the median and the maximum

● The interquartile range (IQR) is the difference between Q3 and Q1 (IQR = Q3 - Q1)

Image source

https://www.mathsisfun.com/data/images/interquartile-range.gif

Computing quartiles ● Find the median, and use it to divide the dataset into two halves

○ If there are an odd number of data points in the dataset do not include the median in either half

○ If there are even number of data points in the dataset, split it exactly in half

● The lower quartile value is the median of the lower half of the data

● The upper quartile value is the median of the upper half of the data

Example of computing quartiles ● Computing quartiles when the sample size is odd

● Ordered sample: 3, 4, 4, 5, 6, 8, 8

First Quartile (Q1) 4

Second Quartile (Q2) (Median) 5

Third Quartile (Q3) 8

Interquartile Range (IQR = Q3 - Q1) 4

Another example of computing quartiles● Computing quartiles when the sample size is even

● Ordered sample: 1, 3, 3, 4, 5, 6, 6, 7, 8, 8

First Quartile (Q1) 3

Second Quartile (Q2) (Median) 5.5

Third Quartile (Q3) 7

Interquartile Range (IQR = Q3 - Q1) 4

Box and whisker plots: for visualizing quartiles

● Depict minimum, first quartile, median, third quartile

and maximum

● The upper whisker (from the maximum to the third

quartile) represents the upper 25% of the distribution

● The interquartile range (IQR) represents the middle

50% of the data

● The lower whisker (from the first quartile to the

minimum) represents the lower 25% of the distribution

Image source

http://bolt.mph.ufl.edu/files/2012/07/images-mod1-boxplot7.gif

Do pets relieve stress?● Does someone experience different level of stress when

doing tasks with a pet, a good friend, or alone?

● Allen et al. had 45 people count backwards by 13s and 17s

● The people were randomly assigned to 3 different groups:

pet (P), friend (F), and alone (C, for control)

● The dependent variable measured was the subject’s average

heart rate during the task

Study Results● The task was most stressful

around friends and least

stressful around pets

● We are comparing levels of a

quantitative variable (heart

rate) across levels of a

categorical (qualitative)

variable (treatment)

Image source

http://www2.stat.duke.edu/~gp42/sta101/notes/FPP7_9_6pp.pdf

Side-by-side box and whisker plots (double the fun!)

● Side-by-side box and whisker plots are a

method for visualizing data when one

variable is categorical (qualitative) and

the other is quantitative

● They can be used to compare the

distributions associated with

quantitative variable across the levels of

a categorical variable

● In this plot, the stars are outliers

Image source

http://bolt.mph.ufl.edu/files/2012/07/images-mod1-boxplot7.gif

Interquartile range rule for detecting outliers

1. Calculate the interquartile range (Q3-Q1).

2. Multiply the interquartile range (IQR) by 1.5.

3. Add 1.5*IQR to the third quartile. This value is called the upper fence.

Values greater than this are suspected outliers.

4. Subtract 1.5*IQR from the first quartile. This value is called the lower fence.

Values less than this are suspected outliers.

Image source

http://bolt.mph.ufl.edu/files/2012/07/images-mod1-spread10.gif

Image source

http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/

IQR rule for detecting outliers: visualization

Image source

Image source

https://taps-graph-review.wikispaces.com/Box+and+Whisker+Plots

https://taps-graph-review.wikispaces.com/Box+and+Whisker+Plots

http://blog.contextures.com/archives/2013/06/11/create-a-simple-box-plot-in-excel

http://blog.contextures.com/archives/2013/06/11/create-a-simple-box-plot-in-excel

Summary

Descriptive statistics don’t tell the whole story

Symmetric, unimodal Skewed right Skewed left

Bimodal Multimodal SymmetricImage Source

https://en.wikipedia.org/wiki/Histogram

Summary● We can summarize data sets by the features of their frequency

distributions, such as their center, dispersion, shape, etc.○ Mean, median, and mode are three measures of central tendency

○ Variance, standard deviation, and IQR are three measures of dispersion

● Descriptive statistics can be more informative than raw data, but

they do not tell all; so they can also be misleading

Documents

Measures of Dispersion - pdfs.semanticscholar.org€¦ · Measures of Dispersion. There are many, many model probability distributions Here’s a link to a map of 50+ probability