45
Numerical Summaries of Center & Variation

Momentary detour... Ideas for collecting data from our classroom; what would YOU like to collect? So far, social media, piercings, # pets, first pet,

Embed Size (px)

Citation preview

Numerical Summaries of Center & Variation

Numerical Summaries of Center & VariationMomentary detour...Ideas for collecting data from our classroom; what would YOU like to collect?

So far, social media, piercings, # pets, first pet, etc.

Turn in your 3 x 5 card by end of the night; dont feel like you have to put your name on it; your choice.Bring 3 x 5 cards to COC for this2Last chapter...Four Corners: Go to your corner based on if your birthday falls in the Winter, Spring, Summer, or Fall; 1 minute

In your group, come to a consensus about the three most important topics we learned and list them on the board. 5 minutes.Do you think there will be = #s of students in each corner? Similar #s in each corner? We will come back to this concept when we discuss hypothesis testing (via goodness of fit test). Afterwards, can go back to own seats; discuss and share out.3Last chapter, we learned...Appropriate graphical representations (numerical & categorical data)

Always graph the data; always.

Describing numerical distributions/data sets via SOCS (the basics; we will get more sophisticated with our descriptions soon); do we use SOCS to describe categorical data distributions? Why or why not?

This is what I came up with; perhaps students main/big ideas will differ slightly.4SOCS...Shape, outlier(s), center, spread

We loosely defined center and spread

Now we will be much more specific & detailed

... And remember, always embed context

Here we go ...Word association time...When I say a word, you immediately write down what you think it means; dont think, just write.

Ready?

Each student share out. The idea will come out that people think of average like mean, median, mode; all central tendencies, but very different.6Word association time...Average

Each student share out. The idea will come out that people think of average like mean, median, mode; all central tendencies, but very different. So why do we have 3 different measures of central tendency??7Bill Gates walks into a diner...The annual salaries of 7 patrons in a diner are listed below.

Find the mean and the median using Minitab

Are the mean and the median similar? Would they represent a typical or average customers salary?

Should we use the mean or the median in this case?

Graph the data (lets choose a histogram) using Minitab. What shape is the distribution?$45,000$50,000$43,000$40,000$35,000$55,000$46,000What about the mode? Is that considered an typical?8Now, Bill Gates walks into the diner...Find the mean and the median using Minitab

Are the mean and the median similar? Would they represent a typical or average customers salary?

Should we use the mean or the median in this case?

Graph the data (lets choose a histogram) using Minitab. What shape is the distribution?$45,000$50,000$43,000$40,000$35,000$55,000$46,000$3,710,000,000What about the mode? Do we have one? So ask what the moral of the story is? Have them come up with resistant vs. non resistant. NOTE: Gate salary as of 2012.9Whats the moral of this story?Means are excellent measures of central tendency if the data is (fairly) symmetric

However, means are highly influenced by outlier(s)

So, if the data has an outlier(s), then a better measure of central tendency is the median, which is not influenced by outliers; this is called resistant

So, consider the shape of your data/distribution, then wisely choose an appropriate measure of central tendency Which measure of central tendency should we use?.

Note: all these data sets have means and medians... But which should we use? Which better describe the data? Note; labeled, scaled, bins = width; all good.11Which measure of central tendency should we use?.

Not EXACTLY symmetric, but almost symmetric. So mean ok. Again, labeled, scaled, bins = width; good. 12For this distribution, which is larger: mean or median?

Median is just middle; so maybe 56 puzzles solved so median would be about the 28th puzzle or so... So median would be between 26 and 50 seconds. But mean would be would be dragged down to right with the few outlier observations near the 126 + seconds to solve puzzle, so would be more. Also think of bill gates. Right skewed distribution lead to mean > median.13Left skewed; how does mean compare to median?

I hope our test scores look something like this; many of you getting As and Bs! Whats value of median (about)? 36 values; so median is 18th/19th value. So, median value is between 81-90. but what is mean? We dont know exactly (because histograms do not retain original values of distribution; remember; bad thing about histograms). But we do know that those low test scores are going to drag down our mean (not resistant), so mean < median in left skewed distributions.14

Right skewed, then mean > median; left skewed, then mean < median; (fairly) symmetric, then mean approx = median; there are many other shapes of distributions... Bi modal, uniform, etc. Though. If a dist is bi-modal, mean and median may be misleading. Again, always good practice to graph and do your numeric analysis of data set.15What is the mean of each data set?Use Minitab and calculate the mean of each of the following data sets:(13, 19, 14, 23, 10)(11, 17, 18, 1, 32)Are they the same distribution/data set?

Another characteristic that is helpful in describing distributions/data sets is standard deviation, which is the typical distance from center (mean)

Standard deviation is usually is paired with mean (FYI median usually paired with IQR... But more on this later)

So data sets can have = means, but be very different data sets/distributions. Maybe demo a box plot for students (even though we havent done BPs yet). We need another characteristic to help us better describe a given data set. The more spread out the data is, the bigger the SD; the closer together the data is, the smaller the SD.16Game time...Stand up and line up from shortest to tallest, with absolutely no talking. 2 minutes. Go.

Partner people up by 2s from shortest to tallest. Go to one of their computers for the game. No need to move your stuff. You will return to your own computer in a few minutes.17Lets play the standard deviation game...Your teams task: Create a data set of four numbers (from 1 to 10) with the lowest standard deviation value possibleInput your four numbers (again use numbers from 1 to 10 only) into Minitab, then calculate the standard deviationChange a value or values until you get the lowest possible standard deviation you can. 3 minutes. Go.Share out and discuss.In Minitab, use Summary, descriptive statistics. In general, for any data set, what is THE lowest SD possible? Is there just one set of data that has THE lowest SD? Can SD ever be negative? Does that make sense? After share out, return to seats if you want.18Which has the largest SD?

Then draw a histogram with SD = 0; typical distance from center = 0; in these graphs there is no label; be sure to include labels & embed context in your description (SOCS)19Calculating the standard deviation...

Briefly explain; but emphasize we will never use it; never calculate by hand; always use technology (ie minitab)20Variance... Another measure of spreadNot used very often; usually, if we use a mean as a measure of central tendency, we use the standard deviation as our measure of spread

Variance is related to standard deviation

variance = (standard deviation)2

standard deviation =

So if SD = 7, variance = 49; if variance = 16, SD = 4; go to minitab. They only give you the SD; but some software packages give both; if you have one, it is very easy to calculate the other.21Data collection time...Think of two female friends (not family members).

Text them and get their heights, in inches. On the board, write their height (and yours if you are female)

Input your data into Minitab

Find the mean, the median, create a histogram, & describe our data (5 minutes)

If they give their height to you in fractions of an inch (like 68-1/2), round to the nearest inch. Is our distribution symmetric? Uni modal? Mean approx median? If so, we have an approx Normal distribution and can use empirical rule.22The Empirical Rule...When distributions are uni-modal, symmetric, & mean median, then ... life is beautiful

Distribution is said to be Normal

68% of data within 1 standard deviation of mean95% of data within 2 standard deviations of mean99.7% of data within 3 standard deviations of meanNormal means we can use the Empirical Rule; and it is very helpful in calculating probabilities and hypo testing and CIs2368-95-99.7 Rule (Empirical Rule)For ()Normal Distributions Only

Empirical Model ...

Whats the mean of this model? What was the mean of our sample data? What is the SD of this model? Was was the SD of our sample data? Talk about some extreme heights in our data and how rare they are;; # of SDs away from center; talk direction as well; my height.

Graph could be labeled better; heights of what? Who? Also, will get to standardizing in next slide.25New topic.. (not female heights anymore)...Is 120 big or small? Think Pair - ShareThink: 30 seconds; Pair: 1 minute; Share: a few minutes.26TPS... Is 120 big or small?Big if ... days temperature in LA in degrees Fahrenheit or # units a student takes during a semester (really big!)

Small if ... monthly rent paid for an apartment in LA

Usual or average if ... Weight in pounds for a 15-year-old girl or systolic blood pressure

Nearly impossible to answer how unusual 120 is unless we know what we are comparing 120 to.Have students guess # SDs away from center for each; just to give idea of direction and distance from center27Something else to consider...A students ACT score was 25.9; their SAT score was 1172. Which is a better score?

ACT scores (national) mean = 21, standard deviation 4.7

SAT (national) mean (critical reading & math) = 1010, standard deviation = 163Lets assume both are approx Normally distributed. Both are above average; which is better? ACT is a little more than 1 SD above; SAT is a little less than 1 SD above.28Z-Scores, standardizing...Z-scores, or standardizing data, is when we convert raw data into # of SDs away from mean

Lets practice with our data...

Calculate z-scores for some tall, short, and average girls in our data set; discuss Empirical Rule. Z-scores make most sense with (fairly) symmetric distributions (z-scores go along with mean, SD).30What about skewed distributions?

Page 93 in text; would mean describe the center of this distribution well? No, due to it being skewed to right; this distribution is not Normal nor approx Normal. We dont want to use mean, cant use z-scores (based on Normal or near Normal distributions); we need another measure of central tendency and another way to say how far away from center is a typical value in this data set.31Remember Bill Gates?Median, center value when data is organized from smallest to largest values

Consider the distribution: 0, 0, 0, 0, 1, 1, 1, 2, 6By the way, what could be the context?Median =

Consider the distribution: 0, 0, 0, 0, 1, 2, 2, 2, 2, 6Median = With bill gates, it made more sense for us to use the median as a measure of central tendency as including bill made our distribution very right skewed. However, sometimes, we dont have a center but rather 2 centers; Minitab calculates this automatically; but if you were calculating by hand, you would find the mean of the two middle values; that would be the median.32Data gathering time again...# pets you currently have on board & enter into Minitab

Numerical analysis (descriptive statistics in Minitab) and graphical representation

Describe the distributionMake it skewed if I need to do so.33Skewed? Shouldnt use mean & SDBut we still need to describe the center and the spread of the distribution

Use median and IQR (Inter-quartile Range)

Median & IQR are not effected by outlier(s) (resistant)

IQR = Q3 Q1

IQR is amount of space the middle 50% of the data occupyWe already know about using the median; but now learning about IQR as measure of spread; look at our data; find IQR; 50% of us have between xx and xx # of pets now.34Range of data...Another measure of variability (used with any distribution) is range

Range = maximum value minimum value

Range for our data = Graphical representations using median & IQR... Boxplots

This is a boxplot vs. a modified boxplot which shows (potential) outlier(s)36Boxplots ...

basic boxplot vs. modified boxplot; when creating boxplots, we use the five number summary ... Minimum, Q1, median, Q3, maximum. Know this. 37Boxplots...

Can be oriented vertically or horizontally. Again this is a basic boxplot (vs. modified). This one indicates where the median would be (right skewed so mean > median)38Modified boxplot shows outlier(s)

Notice scale, label, units; excellent boxplot; what we dont know is exact vaules of data; also didnt include # in this study; what is n = ? In general modified boxplot are more helpful than basic boxplots.39Two modified boxplots...

Can compare two sets easily; Use SOCS. Do it. 3 minutes; share out. What is our n = ? Do we know exact values of median? IQR? Do we want to use mean and SD?40What are outliers?Boxplots are the only graphical representation where we specifically define an outlier

Potential outliers are values that are more than 1.5 IQRs from Q1 or Q3

IQR x 1.5; add that product to Q3; any value(s) beyond that point is an outlier to the right

Q1; any value(s) beyond that point is an outlier to the leftGo back to our pet data...Using Minitab, calculate descriptive statistics

Lets calculate (by hand) to see if we have any outliers

Q3 Q1 = IQR

IQR x 1.5; add this product to Q3; are there any values in our data set beyond this point to the right?

IQR x 1.5; subtract product from Q1; are there any values in our data set beyond this point to the left?

Now use Minitab to create a boxplot; are our calculations confirmed with our boxplot?Be careful with outliers...Are they really an outlier?Is your data correct? Was it input accurately?COCs recent 99-year-old graduateDont automatically throw out an unusual piece of data; investigateBe careful... one more thing...

Re: boxplots; best used with uni-modal data sets; boxplots hide bi-modality (or multi-modality). See above pg 107 of text. Also remember the 5 number summary. Boxplots should not be sued with very small data sets; it takes at least 5 numbers in a distribution to create a boxplot44Chapters 1, 2, 3...Review

Exam