Upload
ambadar
View
680
Download
0
Embed Size (px)
Citation preview
This lecture presenta/on complements Khan’s tutorials.
1
In this lecture we will discuss the different methods to measure central tendency and dispersion in a sta/s/cal sample.
2
Central tendency is just a technical way of saying, what’s typical of this sample? For example, out of all Carlow students, which gender is the more typical one? Male or female? Out of all the products listed on Amazon, which is the best seller? And out of all the eBay lis/ngs of “Tickle Me Elmo,” which price is the most common one?
3
These three different measures are discussed in detail by Khan Academy. Here are some brief summaries. We will discuss normal distribu/on.
One key idea is this: If the sample is normally distributed, meaning it looks like a symmetrical bell curve, then mean, median and mode will be the same number. However, if the sample is skewed either to the leS or to the right, then these three numbers would take on different values.
4
Concepts like “mean” and “standard devia/on” are really based on the theory of normal curve.
Note it’s a theory, a conceptualiza/on of how data should be distributed in an ideal world.
In reality, oSen /mes distribu/ons are not perfectly normal.
Next slide is an example.
Note that the “mean” = the 50th percen/le.
5
Look at this distribu/on of salary data. It’s heavy on the leS side, with a long skinny tail on the right.
Definitely not symmetrical.
6
When we impose the normal curve on top of the salary distribu/on, we see that the normal curve only captures the right tail well. For the leS tail, the normal curve doesn’t describe the actual distribu/on very well.
This is because the salary data is posi%vely skewed.
In skewed data, “mode” and “median” describe the central tendency be]er than the “mean”.
7
In addi/on to central tendency, we also need a way to describe how spread out the distribu/on is, and how weird a case is (rela/ve to the mean).
When a case is very close to the mean, we have an average joe. When a case is far off from the mean on the /p of a long tail, we have a weirdo!
In real life, we oSen discuss dispersion without realizing it. For example: In which percen/le is my child’s height? How many people in this class will get an A? Is the customer’s credit score above or below average? By how much? Is a dona/on of $30,000 pre]y common or very rare? How rare is it?
This slide illustrates the distribu/on of total purchase aSer a customer clicks on a link. Look at the data, the mean, the distribu/on, and reflect on the following ques/ons:
How likely would an average customer spend $200 per order? Very unlikely – it’s at the end of the curve – in a tail.
How about $35? Much more likely – it’s the average order.
In what percenEle is a $67 order? The 84th -‐ we know because it’s one standard deviaEon (34%) above the mean (50%).
The next slide explains what a standard deviaEon is.
8
Standard devia/on is a standardized measure of dispersion. It tells you whether the distribu/on is short and fat (with a big standard distribu/on) or tall and skinny (with a small standard distribu/on).
The calcula/on is explained well by Khan (see Khan’s Academy video clips linked in this session).
The basic idea to take away is: The standard devia/on tells you, on average, how far away the data points are from the mean.
For example, let’s say that the Steelers have an average score of 25 per game, and the standard devia/on is 1. Let’s also say that the Greenbay Packers have an average score of 25 per game, and a standard devia/on of 7.
In this example, both teams are comparable in terms of average scores, but the Steelers have a much smaller standard devia/on. This means the Steelers’ performance is pre]y consistent over /me, their scores may be above or below 25, but only by 1-‐2 points on average. If you plot their scores on a chart, you would see that most of them pack around 25, with a nice narrow distribu/on that peaks around 25.
In contrast, the Packers may average around 25, but their performance varies widely from game to game. One day they may score 18 (25-‐7) and the next day they may score 32 (25+7) If you plot their widely varied scores on a chart, you would get a short and fat distribu/on.
(Go Steelers Go!)
9
What are prac/cal ways to use the standard devia/on? With a normal distribu/on, the mean divides it up evenly in the middle. The por/on below the mean covers 50% of the popula/on, whereas the por/on above the mean also covers 50% of the popula/on.
The first standard devia/on away from the mean covers 34% of the distribu/on. In other words, 1 standard devia/on above the mean = 50% + 34% = 84% = 84th percen/le
Let’s say that the average weight for a one year old is 25 lbs, with a standard devia/on of 2 lbs. Connor is 23 lbs. That’s 1 standard devia/on below the mean. In other words he is 50%-‐34% or in the16th percen/le of the popula/on Nardia is 27 lbs. That’s 1 standard devia/on above the mean. In other words she is 50%+34% or in the 84th percen/le of the popula/on
The en/re distribu/on is covered by roughly 6 standard devia/ons – 3 above the mean and 3 below the mean Hence the name of the quality management program “Six Sigma”
10
More examples:
Given a mean and a standard devia/on score, you have a pre]y good idea of what the distribu/on is like – is it fat and short, or tall and skinny?
We can then map out individual scores on the distribu/on and tell the average joes from the weirdos!
11
The Z score is the number of standard devia/ons from the mean. With our previous example, Connor would have a Z score of nega/ve 1 (that is 1 standard devia/on below the mean), while Nardia has a Z score of 1 (that is 1 standard devia/on above the mean).
The average joes would have close to zero z scores (e.g., 0.0006, -‐.0029) Whereas the weirdos have extremely large or small z scores (e.g., 3.07, -‐2.99)
Again -‐ The z score is the number of standard devia/ons that a data point is away from the mean. Let's say that the average weight for all American women is 150 lbs, and the standard devia/on is 20 lbs. If your weight is 130, then your z score is -‐1, because you're exactly 1 standard devia/on below the mean. If Peggy's weight is 170, then her z score is 1, because she is exactly 1 standard devia/on above the mean.
12
Ques/ons? Schedule a chat/phone mee/ng with the instructor for more assistance
13