Upload
brendan-buck
View
218
Download
2
Embed Size (px)
Citation preview
5-Minute Check on Lesson 1-25-Minute Check on Lesson 1-25-Minute Check on Lesson 1-25-Minute Check on Lesson 1-2
Click the mouse button or press the Space Bar to display the answers.Click the mouse button or press the Space Bar to display the answers.
1. What 4 terms are used to describe data sets or distributions?
2. Which type of graph can our calculators do (bar or histogram)?
3. How many classes should a histogram have?
4. What needs to be looked for in time-series graphs?
5. What is the major difference between a histogram and a stem-plot?
6. Name a possible graphical error in a histogram
Shape, Outliers, Center, Spread (SOCS)
histogram
classes = square root (number of observations)
seasonal trends
histogram summarizes the datastem-plot maintains the data
overlapping categories
Lesson 1 - 3
Describing Quantitative Data with Numbers
adapted from Mr. Molesky’s TPS 4E slides
Objectives• Calculate and interpret measures of center (mean,
median, mode)
• Calculate and interpret measures of spread (IQR, standard deviation, range)
• Identify outliers using the 1.5 x IQR rule
• Make a boxplot
• Select appropriate measures of center and spread
• Use appropriate graphs and numerical summaries to compare distributions of quantitative variables
Vocabulary
• Boxplot – graphs the five number summary and any outliers
• Degrees of freedom – the number of independent pieces of information that are included in your measurement
• Five-number summary – the minimum, Q1, Median, Q3, maximum
• Interquartile range – the range of the middle 50% of the data; (IQR) – IQR = Q3 – Q1
• Mean – the average value (balance point); x-bar
• Median – the middle value (in an ordered list); M
• Mode – the most frequent data value
Vocabulary cont
• Outlier – a data value that lies outside the interval [Q1 – 1.5 IQR, Q3 + 1.5 IQR]
• Pth percentile – p percent of the observations (in an ordered list) fall below at or below this number
• Quartile – multiples of 25th percentile (Q1 – 25th; Q2 –50th or median; Q3 – 75th)
• Range – difference between the largest and smallest observations
• Resistant measure – a measure (statistic or parameter) that is not sensitive to the influence of extreme observations
• Standard Deviation– the square root of the variance
• Variance – the average of the squares of the deviations from the mean
Measures of Center
Numerical descriptions of distributions begin with a measure of its “center”
If you could summarize the data with one number, what would it be?
Mean: The “average” value of a dataset
Median: The “middle” value of an ordered dataset1.Arrange observations in order min to max2.Locate the middle observation, average if needed
Mean vs Median
The mean and the median are the most common measures of center
If a distribution is perfectly symmetric, the mean and the median are the same
The mean is not resistant to outliers
The mode, the data value that occurs the most often, is a common measure of center for categorical data
You must decide which number is the most appropriate description of the center...
MeanMedian Applet
Use the mean on symmetric data andthe median on skewed data or data with outliers
Skewed Left: (tail to the left)Mean substantially smaller than median
(tail pulls mean toward it)
Mean < Median < Mode
Mode
Median
Mean
Distributions Parameters
Symmetric:Mean roughly equal to median
Mean ≈ Median ≈ Mode
Mode
Median
Mean
Distributions Parameters
Skewed Right: (tail to the right)Mean substantially greater than median
(tail pulls mean toward it)
Mean > Median > Mode
Mode
Median
Mean
Distributions Parameters
Central Measures Comparisons
Measure of Central Tendency
Computation Interpretation When to use
Meanμ = (∑xi ) / Nx‾ = (∑xi) / n
Center of gravity
Data are quantitative and
frequency distribution is
roughly symmetric
Median
Arrange data in ascending order
and divide the data set into half
Divides into bottom 50% and
top 50%
Data are quantitative and
frequency distribution is
skewed
Mode
Tally data to determine most
frequent observation
Most frequent observation
Data are categorical or the
most frequent observation is the
desired measure of central tendency
Measuring Center: Example 1
• Use the data below to calculate the mean and median of the commuting times (in minutes) of 20 randomly selected New York workers. Example, page 53Example, page 53
10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45
minutes 25.3120
4540...2553010
x
0 51 0055552 00053 004 00556 00578 5
Key: 4|5 represents a New York worker who reported a 45-minute travel time to work.
M 20 25
222.5 minutes
Example 2
Which of the following measures of central tendency resistant?
1. Mean
2. Median
3. Mode
Not resistant
Resistant
Resistant
Example 3Given the following set of data:
70, 56, 48, 48, 53, 52, 66, 48, 36, 49, 28, 35, 58, 62, 45, 60, 38, 73, 45, 51,56, 51, 46, 39, 56, 32, 44, 60, 51, 44, 63, 50, 46, 69, 53, 70, 33, 54, 55, 52
What is the mean?
What is the median?
What is the mode?
What is the shape of the distribution?
51.125
51
48, 51, 56
Symmetric(tri-modal)
Example 4Given the following types of data and sample sizes, list the measure of central tendency you would use and explain why?
Sample of 50 Sample of 200Hair colorHeightWeightParent’s IncomeNumber of SiblingsAge
Does sample size affect your decision?
mode mode
mean mean
mean meanmedian medianmean meanmean mean
Not in this case, but the larger the sample size, might allow use to use the mean vs the median
Day 1 Summary and Homework
• Summary– Three characteristics must be used to describe
distributions (from histograms or similar charts)• Shape (uniform, symmetric, bi-modal, etc) • Outliers (rule next lesson)• Center (mean, median, mode measures)• Spread (IQR, variance – next lesson)
– Median is resistant to outliers; mean is not!– Use Mean for symmetric data– Use Median for skewed data (or data with outliers)– Use Mode for categorical data
• Homework– pg 70-74; prob 79, 81, 83, 87, 89
5-Minute Check on Lesson 1-3a5-Minute Check on Lesson 1-3a5-Minute Check on Lesson 1-3a5-Minute Check on Lesson 1-3a
Click the mouse button or press the Space Bar to display the answers.Click the mouse button or press the Space Bar to display the answers.
1. What are the two quantitative measures of center?
2. When do we use one versus the other?
3. Which one is resistant to outliers?
4. Which measure of center is used for qualitative data?
5. Find the mean, median and mode of the following data set: 7, 15, 4, 8, 16, 17, 2, 5, 11, 8, 12, 6
Mean and median
Mean for symmetric data and median for skewed
Median
Mode
Mean: 9.25Median: 8Mode: 8
Measures of Spread
Variability is the key to Statistics. Without variability, there would be no need for the subject.
When describing data, never rely on center alone.
Measures of Spread:Range - {rarely used ... why?}
Quartiles - InterQuartile Range {IQR=Q3-Q1}
Variance and Standard Deviation {var and sx}
Like Measures of Center, you must choose the most appropriate measure of spread.
Standard Deviation
Another common measure of spread is the Standard Deviation: a measure of the “average” deviation of all observations from the mean.
To calculate Standard Deviation:Calculate the mean.Determine each observation’s deviation (x - xbar).“Average” the squared-deviations by dividing the total squared deviation by (n-1).This quantity is the Variance.Square root the result to determine the Standard Deviation.
Standard Deviation Properties
s measures spread about the mean and should be used only when the mean is used as the measure of center
s = 0 only when there is no spread/variability. This happens only when all observations have the same value. Otherwise, s > 0. As the observations become more spread out about their mean, s gets larger
s, like the mean x-bar, is not resistant. A few outliers can make s very large
Standard Deviation
Variance:
Standard Deviation:
Example 1.16 (p.85 of TPS 3E): Metabolic Rates
var (x1 x )2 (x2 x )2 ... (xn x )2
n 1
sx (xi x )2n 1
1792 1666 1362 1614 1460 1867 1439
Standard Deviation
1792 1666 1362 1614 1460 1867 1439
x (x - x) (x - x)2
1792 192 36864
1666 66 4356
1362 -238 56644
1614 14 196
1460 -140 19600
1867 267 71289
1439 -161 25921
Totals: 0 214870
Metabolic Rates: mean=1600
Total Squared Deviation
214870
Variance
var=214870/6
var=35811.66
Standard Deviation
s=√35811.66
s=189.24 cal
What does this value, s, mean?
The Interquartile Range (IQR)
– A measure of center alone can be misleading.– A useful numerical description of a distribution requires
both a measure of center and a measure of spread.
To calculate the quartiles:
1)Arrange the observations in increasing order and locate the median M.
2)The first quartile Q1 is the median of the observations located to the left of the median in the ordered list.
3)The third quartile Q3 is the median of the observations located to the right of the median in the ordered list.
The interquartile range (IQR) is defined as:
IQR = Q3 – Q1
How to Calculate the Quartiles and the Interquartile Range
QuartilesQuartiles Q1 and Q3 represent the 25th and 75th percentiles.
To find them, order data from min to max.
Determine the median - average if necessary.
The first quartile is the middle of the ‘bottom half’.
The third quartile is the middle of the ‘top half’.
19 22 23 23 23 26 26 27 28 29 30 31 32
45 68 74 75 76 82 82 91 93 98
med Q3=29.5Q1=23
med=79Q1 Q3
Example 1
Which of the following measures of spread are resistant?
1. Range
2. Variance
3. Standard Deviation
4. Interquartile Range (IQR)
Not Resistant
Not Resistant
Not Resistant
Resistant
Example 2
• Travel times to work for 20 randomly selected New Yorkers
5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85
Example, page 57Example, page 57
10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45
5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85
M = 22.5M = 22.5 Q3= 42.5Q1 = 15
IQR = Q3 – Q1
= 42.5 – 15= 27.5 minutes
Interpretation: The range of the middle half of travel times for the New Yorkers in the sample is 27.5 minutes.
Determining Outliers
InterQuartile Range “IQR”: Distance between Q1 and Q3. Resistant measure of spread...only measures middle 50% of data.
IQR = Q3 - Q1 {width of the “box” in a boxplot}
1.5 IQR Rule: If an observation falls more than 1.5 IQRs above Q3 or below Q1, it is an outlier.
“1.5 IQR Rule”“1.5 IQR Rule”
Why 1.5? According to John Tukey, 1 IQR seemed like too little and 2 IQRs Why 1.5? According to John Tukey, 1 IQR seemed like too little and 2 IQRs seemed like too much...seemed like too much...
Outliers: 1.5 IQR Rule
To determine outliers:
1. Find 5 Number Summary
2. Determine IQR
3. Multiply 1.5 IQR
4. Set up “fences”
A. Lower Fence: Q1 - (1.5 IQR)
B. Upper Fence: Q3 + (1.5 IQR)
5. Observations “outside” the fences are outliers.
Example 2 part 2
• In addition to serving as a measure of spread, the interquartile range (IQR) is used as part of a rule of thumb for identifying outliers.Definition:
The 1.5 x IQR Rule for Outliers
Call an observation an outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile.
Example, page 57Example, page 57
In the New York travel time data, we found Q1=15 minutes, Q3=42.5 minutes, and IQR=27.5 minutes.
For these data, 1.5 x IQR = 1.5(27.5) = 41.25
Q1 - 1.5 x IQR = 15 – 41.25 = -26.25
Q3+ 1.5 x IQR = 42.5 + 41.25 = 83.75
Any travel time shorter than -26.25 minutes or longer than 83.75 minutes is considered an outlier.
0 51 0055552 00053 004 00556 00578 5
5-Number Summary, Boxplots
The 5 Number Summary provides a reasonably complete description of the center and spread of distribution
We can visualize the 5 Number Summary with a boxplot.
MIN Q1 MED Q3 MAX
min=45 Q1=74 med=79 Q3=91 max=98
45 50 55 60 65 70 75 80 85 90 95 100
Quiz ScoresOutlier?Outlier?
Drawing a Boxplot
The five-number summary divides the distribution roughly into quarters. This leads to a new way to display quantitative data, the boxplot.
• Draw and label a number line that includes the range of the distribution.
• Draw a central box from Q1 to Q3.
• Note the median M inside the box.
• Extend lines (whiskers) from the box out to the minimum and maximum values that are not outliers
Example 2 part 3
• Boxplot
M = 22.5M = 22.5 Q3= 42.5Q1 = 15Min=5
10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45
5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85
TravelTime0 10 20 30 40 50 60 70 80 90
Max=85Recall, this is an outlier by
the 1.5 x IQR rule
Example 3Consumer Reports did a study of ice cream bars (sigh, only
vanilla flavored) in their August 1989 issue. Twenty-seven bars having a taste-test rating of at least “fair” were listed, and calories per bar was included. Calories vary quite a bit partly because bars are not of uniform size. Just how many calories should an ice cream bar contain?
Construct a boxplot for the data above.
342 377 319 353 295 234 294 286
377 182 310 439 111 201 182 197
209 147 190 151 131 151
Example 3 - Answer
Q1 = 182 Q2 = 221.5 Q3 = 319
Min = 111 Max = 439 Range = 328
IQR = 137 UF = 524.5 LF = -23.5
Calories
100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500
Example 4
The weights of 20 randomly selected juniors at MSHS are recorded below:
a) Construct a boxplot of the data
b) Determine if there are any mild or extreme outliers
c) Comment on the distribution
121 126 130 132 143 137 141 144 148 205
125 128 131 133 135 139 141 147 153 213
Example 4 - Answer
Q1 = 130.5 Q2 = 138 Q3 = 145.5
Min = 121 Max = 213 Range = 92
IQR = 15 UF = 168 LF = 108
Mean = 143.6
StDev = 23.91
Weight (lbs)
100 110 120 130 140 150 160 170 180 190 200 210 220
**
Extreme Outliers( > 3 IQR from Q3)
Shape: somewhat symmetric Outliers: 2 extreme outliersCenter: Median = 138 Spread: IQR = 15
Example 5Consider the following test scores for a small class:
75 76 82 93 45 68 74 82 91 98
Plot the data and describe the SOCS:
Why use median describes the “center”?Why use IQR to describes the “spread’?
scores40 50 60 70 80 90 100
Collection 1 Dot Plot
scores40 50 60 70 80 90 100
Collection 1 Dot Plot Shape?Outliers?Center?Spread?
skewed left
maybe 45
M = 79
IQR = 91-74=17
data skewed
data skewed
Choosing Measures of Center & Spread
• We now have a choice between two descriptions for center and spread– Mean and Standard Deviation– Median and Interquartile Range
•The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers.
•Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers.
•NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA!
Choosing Measures of Center and Spread
Using the TI-83
• Enter the test data into List, L1– STAT, EDIT enter data into L1
• Calculate 5 Number Summary– Hit STAT go over to CALC
and select 1-Var Stats and hitt 2nd 1 (L1)
• Use 2nd Y= (STAT PLOT) to graph the box plot– Turn plot1 ON– Select BOX PLOT (4th option, first in second row)– Xlist: L1– Freq: 1– Hit ZOOM 9:ZoomStat to graph the box plot
• Copy graph with appropriate labels and titles
Day 2 Summary and Homework
• Summary– Sample variance is found by dividing by (n – 1) to keep it an
unbiased (since we estimate the population mean, μ, by using the sample mean, x-bar) estimator of population variance
– The larger the standard deviation, the more dispersion the distribution has
– Boxplots can be used to check outliers and distributions– Use comparative boxplots for two datasets– Identifying a distribution from boxplots or histograms is
subjective!– Use standard deviation with mean and IQR with median
• Homework– pg 82: prob 33; pg 89 probs 40, 41;
pg 97 probs 45, 46