Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Measures of Dispersion
There are many, many model probability distributions
Here’s a link to a map of 50+ probability distributions, showing how they all relate!
● The center is the mean, median, or mode.
● The spread is the variability of the data:
● Shape can be described by symmetry, skewness, number of peaks (modes), etc.
Image source
Features of Probability Distributions
Descriptive Statistics● Measures of central tendency
○ Mean, median, mode
● Measures of spread○ Variance, standard deviation
○ Range, max, min, quartiles
● Measures of shape○ Skew, modalities, etc.
Spread
● Measurements vary for one of two reasons:
○ Systematic bias: underlying problems with how the data were collected,
leading to inaccuracies (e.g., thermometer off by a degree)
○ Randomness: natural fluctuations in measurements (aka “noise”)
● You can and will get different results taking the same measurements over
and over again, as a result of the randomness.
Measurements Vary
● Measures of central tendency tell us something about the center of a
probability distribution or a histogram.
● But they don’t tell us anything about spread.
Spread
Data Source
One Measure of Spread● Data: Change in the valuation of
different financial sectors.
● Let’s think up a metric we can use to
measure the spread of a distribution.
● A starting point: what is the average
deviation of the values from the mean? ○ [(1.66 - 0.58) + (1.08 - 0.58) + (0.89 - 0.58) +
… + (0.28 - 0.58) + (0.13 - 0.58) + (-0.28 -
0.58)] / 10 = 0.
● What is the problem with this measure? Data Source
● The problem is that we are get positive and negative deviations, and they
end up cancelling each other out, so the average deviation is 0!
● How can we make all deviations positive?
○ We can take absolute values, or
○ We can square everything!
● Our measure of spread will be the average squared deviation of values from
the mean. This is called the sample variance.
One Measure of Spread
● The total sum of the squared deviations:○ (1.66 - 0.58)2 + (1.08 - 0.58)2 + (0.89 - 0.58)2 + (0.69 - 0.58)2 + (0.52 - 0.58)2 + (0.47 - 0.58)2
+ (0.36 - 0.58)2 + (0.28 - 0.58)2 + (0.13 - 0.58)2 + (-0.28 - 0.58)2 ∽ 2.62
● To find the average, we divide by 10.
● So, the sample mean is ∽ 0.58.
● And the sample variance is ∽ 0.262.
● Why such a big spread? Well, because we squared the deviations!
● Arguably of more interest, since it is of the same magnitude as the values, is
the square root of the sample variance: i.e., the sample standard deviation.
Calculating Sample Variance
Interquartile Range (IQR)
Maximum, minimum, and range● The maximum is an outcome that is greater than or equal to all others in our sample
● The minimum is an outcome that is less than or equal to all others in our sample
● The range is the difference between the maximum and the minimum
● The maximum and minimum are sensitive to outliers
○ If an outcome is added to sample that is less than the minimum, then the minimum
changes
○ If an outcome is added to our sample that is greater than the maximum, then the
maximum changes
● To determine if the maximum and the minimum in our data set are indeed outliers we
can use a rule of thumb called the interquartile range rule
Quartiles ● Quartiles within an ordered set of data are three points that divide the dataset into
four equal groups, each containing a quarter of the data
● The first quartile (Q1) is the midpoint between the minimum and the median
● The second quartile (Q2) is the median
● The third quartile (Q3) is the midpoint between the median and the maximum
● The interquartile range (IQR) is the difference between Q3 and Q1 (IQR = Q3 - Q1)
Image source
Computing quartiles ● Find the median, and use it to divide the dataset into two halves
○ If there are an odd number of data points in the dataset do not include the median in either half
○ If there are even number of data points in the dataset, split it exactly in half
● The lower quartile value is the median of the lower half of the data
● The upper quartile value is the median of the upper half of the data
Example of computing quartiles ● Computing quartiles when the sample size is odd
● Ordered sample: 3, 4, 4, 5, 6, 8, 8
First Quartile (Q1) 4
Second Quartile (Q2) (Median) 5
Third Quartile (Q3) 8
Interquartile Range (IQR = Q3 - Q1) 4
Another example of computing quartiles● Computing quartiles when the sample size is even
● Ordered sample: 1, 3, 3, 4, 5, 6, 6, 7, 8, 8
First Quartile (Q1) 3
Second Quartile (Q2) (Median) 5.5
Third Quartile (Q3) 7
Interquartile Range (IQR = Q3 - Q1) 4
Box and whisker plots: for visualizing quartiles
● Depict minimum, first quartile, median, third quartile
and maximum
● The upper whisker (from the maximum to the third
quartile) represents the upper 25% of the distribution
● The interquartile range (IQR) represents the middle
50% of the data
● The lower whisker (from the first quartile to the
minimum) represents the lower 25% of the distribution
Image source
Do pets relieve stress?● Does someone experience different level of stress when
doing tasks with a pet, a good friend, or alone?
● Allen et al. had 45 people count backwards by 13s and 17s
● The people were randomly assigned to 3 different groups:
pet (P), friend (F), and alone (C, for control)
● The dependent variable measured was the subject’s average
heart rate during the task
Study Results● The task was most stressful
around friends and least
stressful around pets
● We are comparing levels of a
quantitative variable (heart
rate) across levels of a
categorical (qualitative)
variable (treatment)
Image source
Side-by-side box and whisker plots (double the fun!)
● Side-by-side box and whisker plots are a
method for visualizing data when one
variable is categorical (qualitative) and
the other is quantitative
● They can be used to compare the
distributions associated with
quantitative variable across the levels of
a categorical variable
● In this plot, the stars are outliers
Image source
Interquartile range rule for detecting outliers
1. Calculate the interquartile range (Q3-Q1).
2. Multiply the interquartile range (IQR) by 1.5.
3. Add 1.5*IQR to the third quartile. This value is called the upper fence.
Values greater than this are suspected outliers.
4. Subtract 1.5*IQR from the first quartile. This value is called the lower fence.
Values less than this are suspected outliers.
Image source
Image source
IQR rule for detecting outliers: visualization
Image source
Image source
Summary
Descriptive statistics don’t tell the whole story
Symmetric, unimodal Skewed right Skewed left
Bimodal Multimodal SymmetricImage Source
Summary● We can summarize data sets by the features of their frequency
distributions, such as their center, dispersion, shape, etc.○ Mean, median, and mode are three measures of central tendency
○ Variance, standard deviation, and IQR are three measures of dispersion
● Descriptive statistics can be more informative than raw data, but
they do not tell all; so they can also be misleading