88
Chapter 3 Descriptive Statistics Numerical Methods

Chapter 3 Descriptive Statistics Numerical Methods

Embed Size (px)

Citation preview

Page 1: Chapter 3 Descriptive Statistics Numerical Methods

Chapter 3

Descriptive Statistics

Numerical Methods

Page 2: Chapter 3 Descriptive Statistics Numerical Methods

Our goal? Numbers to help us answer simple questions.

What is a typical value? How variable are the data? How extreme is a particular value? Given data on two variables, how closely do

they move together?

Page 3: Chapter 3 Descriptive Statistics Numerical Methods

Measures of Central Tendency

Here are three ways to identify a “typical” observation

Mean – the arithmetic average Median – the middlemost value Mode – the most common value

Page 4: Chapter 3 Descriptive Statistics Numerical Methods

There are formulas, but . . .

One confusing thing about the formulas is all the notation they use. To explain why we need the notation, and why you need to know it, let me

remind you of an important distinction . . .

Page 5: Chapter 3 Descriptive Statistics Numerical Methods

Population vs. Sample

A population is the set of all data that characterize some phenomenon, and a number computed from population data is called a parameter.

A sample is a subset of a population, and a number computed from sample data is called a statistic.

Page 6: Chapter 3 Descriptive Statistics Numerical Methods

An Example

Population - All registered voters. Parameter – The fraction of all registered

voters who prefer John McCain to Hillary Clinton.

Sample – 2500 voters surveyed by Gallup. Statistic – The fraction of voters in the poll

who prefer McCain to Clinton.

Page 7: Chapter 3 Descriptive Statistics Numerical Methods

Another Example

Population - All Duracell AA batteries. Parameter – The average lifetime of all

Duracell AA batteries in a particular toy. Sample – A hundred batteries being tested by

the manufacturer. Statistic – The average lifetime of the 100

tested batteries in the particular toy.

Page 8: Chapter 3 Descriptive Statistics Numerical Methods

Why is the distinction important?

Sample statistics are very different from population parameters. Parameters are fixed numbers. Before the sample is drawn, statistics depend on the elements that may be selected, and are random.

Once a sample is drawn, the numbers themselves are likely to be different; that is 48% of the population but 51% of the sample may prefer Clinton.

Therefore, our notation must clearly distinguish sample statistics from population parameters.

Page 9: Chapter 3 Descriptive Statistics Numerical Methods

Now let us return . . .

To those measures of central tendency.

Page 10: Chapter 3 Descriptive Statistics Numerical Methods

The Sample Mean

The sample mean is the arithmetic mean of some sample data.

The notation for a sample mean is X-bar.

The notation for sample size is a lower case n.

1

n

ii

xX

n

Page 11: Chapter 3 Descriptive Statistics Numerical Methods

The Population Mean

The population mean is the arithmetic mean of some population data.

The notation for a population mean is the Greek letter mu.

The notation for population size is an upper case N.

1

N

ii

x

N

Page 12: Chapter 3 Descriptive Statistics Numerical Methods

And THIS is the summation operator

Here it is just telling us to add the observations.

1 2 31

n

i ni

x x x x x

Page 13: Chapter 3 Descriptive Statistics Numerical Methods

Don’t be intimidated by the summation operator

It is just shorthand; it saves space. The summation operator is just the Greek

letter Sigma. Sigma is Greek for S, and S stands for Sum. S for Sum, meaning “add them all up.”

Page 14: Chapter 3 Descriptive Statistics Numerical Methods

Formulas with summation operators confuse you?

Consult Anderson, Sweeny, and Williams Appendix C, and memorize the rules listed there, or

Do what I do, which is figure it out as you go along . For example . . .

Page 15: Chapter 3 Descriptive Statistics Numerical Methods

Is this a valid operation?

?

1 1

n n

i ii i

ax a x

Page 16: Chapter 3 Descriptive Statistics Numerical Methods

Just undo the shorthand to see!

1 2 31

1 2 3

1

So

n

i ni

n

i ni

n n

i ii i

ax ax ax ax ax

a x a x x x x

ax a x

Page 17: Chapter 3 Descriptive Statistics Numerical Methods

How about this? Is it ok?

?

1 1 1

n n n

i i i ii i i

ax by a x b y

Page 18: Chapter 3 Descriptive Statistics Numerical Methods

Undo the shorthand to check

1 1 2 2 3 31

1 2 3 1 2 31 1

1 1 1

So, yes,

n

i i n ni

n n

i i n ni i

n n n

i i i ii i i

ax by ax by ax by ax by ax by

a x b y a x x x x b y y y y

ax by a x b y

Page 19: Chapter 3 Descriptive Statistics Numerical Methods

Let’s go back to our formulas

Suppose you are given the following data

2,3, 3, 4, 6, 7, 8,11,12,13,15,16,17

Page 20: Chapter 3 Descriptive Statistics Numerical Methods

Sample or population data?

It depends on the context. Suppose this data came from asking 13 people the number of computer games they own. If you are investigating the number of games owned by

these particular 13 people, then this is population data If you are investigating the number of games owned by a

larger group, and these 13 people are members of that group, then this is sample data.

In homework problems, the default is sample data.

Page 21: Chapter 3 Descriptive Statistics Numerical Methods

The mean in this example

1

2 3 3 4 6 7 8 11 12 13 15 16 179

13

n

ii

xX

n

X

Page 22: Chapter 3 Descriptive Statistics Numerical Methods

Since the median is the middlemost observation, one way to find it is to order the observations and throw away observations one at a time from each end, until one is left in the middle – in this case, 8.

The median in this example

2,3, 3, 4, 6, 7, 8,11,12,13,15,16,17

2 , 3 , 3 , 4 , 6 , 7 , 8,11, 12 , 13 , 15 , 16 , 17

Page 23: Chapter 3 Descriptive Statistics Numerical Methods

Suppose you have an even number of observations!

In that case, you will be left with two numbers in the middle when you have finished eliminating numbers from both the top and bottom.

The median is found by adding those two numbers and dividing by two.

Page 24: Chapter 3 Descriptive Statistics Numerical Methods

The mode in this example

Here is our data. Note that the only observation that appears twice is “3” making it the mode, or most common observation.

But 3 is not a very typical observation, which is why the mode is hardly ever used.

2,3, 3, 4, 6, 7, 8,11,12,13,15,16,17

Page 25: Chapter 3 Descriptive Statistics Numerical Methods

Mean vs. Median

In 1983, the average starting salary of Rhetoric and Communications majors at the University of Virginia was approximately $35,000 a year, far more than that of other majors in the college of Arts and Science.

Can you guess why?

Page 26: Chapter 3 Descriptive Statistics Numerical Methods

Here is your answer!

Ralph Sampson, a Rhetoric and Communications major, was the first pick in the NBA draft. The Houston Rockets paid him $2,000,000 a year.

Page 27: Chapter 3 Descriptive Statistics Numerical Methods

Robust Statistics

A statistic is said to be robust if it is not dramatically affected by a small number of extreme observations.

The median is robust, the mean is not. Therefore the median is usually a better

indication of a “typical value.”

Page 28: Chapter 3 Descriptive Statistics Numerical Methods

How the mean and median differ

If a distribution is not symmetric . . . And there are a handful of extremely large or

small values . . . The mean will be pulled in the direction of

the extreme values. The Ralph Sampson story illustrates the

problem.

Page 29: Chapter 3 Descriptive Statistics Numerical Methods

Look at the income distribution in the USA in 1992

Page 30: Chapter 3 Descriptive Statistics Numerical Methods

Asymmetric, skewed to the right

The median income is marked on the graph, at about $22,000 a year.

The mean is not reported, but it appears to be about $30,000 a year.

Page 31: Chapter 3 Descriptive Statistics Numerical Methods

Many people use statistics the way a drunk uses a lamp post

For support . . .

Not for Illumination.

Page 32: Chapter 3 Descriptive Statistics Numerical Methods

And they play games with the mean and median

An incumbent politician will boast of how well the economy is doing, and use mean income numbers as evidence.

The challenger will complain of how badly the economy is doing, and use median income numbers as evidence.

Confusing voters.

Page 33: Chapter 3 Descriptive Statistics Numerical Methods

Measures of Variability

Range Interquartile Range Variance Standard Deviation Coefficient of Variation

Page 34: Chapter 3 Descriptive Statistics Numerical Methods

The Range

The Range of a data set is just the difference between the biggest and smallest observation.

The Range is easy to compute, but it is not robust, and therefore may be misleading.

As in: “Starting salaries for Rhetoric and Communications majors range from $18,000 to $2,000,000 a year.”

Page 35: Chapter 3 Descriptive Statistics Numerical Methods

An Example

Lets use the earlier data to illustrate.

2,3, 3, 4, 6, 7, 8,11,12,13,15,16,17

The Range is 17 2 15

Page 36: Chapter 3 Descriptive Statistics Numerical Methods

Interquartile Range (IQR)

This is the spread of the middle 50% of the observations.

It is defined as Q3 – Q1 Q3 is the third quartile, or 75th percentile. 75% of

all observations are smaller than Q3. Q1 is the first quartile, or 25th percentile. 25% of

all observations are smaller than Q1. (Q2 is the second quartile, or median.)

Page 37: Chapter 3 Descriptive Statistics Numerical Methods

How do you find quartiles?

Basically, to find Q3, the 75th percentile, order the data and throw away 3 observations from the bottom for every one from the top. Here, Q3 is 13.

2,3, 3, 4, 6, 7, 8,11,12,13,15,16,17

2 , 3 , 3, 4 , 6 , 7 , 8 , 11, 12 ,13, 15, 16 , 17

Page 38: Chapter 3 Descriptive Statistics Numerical Methods

Q1 works the same way

To find Q1, the 25th percentile, order the data and throw away 1 observation from the bottom for every 3 from the top. Here Q1 is 4, so IQR = Q3 - Q1= 13 – 4 = 9.

2,3, 3, 4, 6, 7, 8,11,12,13,15,16,17

2 , 3 , 3, 4, 6 , 7 , 8 , 11, 12 , 13 , 15, 16 , 17

Page 39: Chapter 3 Descriptive Statistics Numerical Methods

I cheated a bit to make it simple

With 13 observations, eliminating observations in this way leaves you with just one observation remaining.

If the number of observations you have is not equal to 4n+1 for some n, there will be two, three, or four observations remaining.

Then you must round or interpolate.

Page 40: Chapter 3 Descriptive Statistics Numerical Methods

A, S, & W propose this solution

Arrange data in ascending order.

Compute i, where p is the percentile you seek and n is the sample size.

If i is an integer, average the ith and i+1st observations.

If i is not integer, round up.

100

pi n

Page 41: Chapter 3 Descriptive Statistics Numerical Methods

An Example finding Q3

Here p = 75, n = 6. Which gives i = 4.5. Which is not integer. So round up to 5. The 5th observation is

9, so Q3 = 9.

2, 3, 6, 7, 9,10

756 4.5

100i

Page 42: Chapter 3 Descriptive Statistics Numerical Methods

An Example finding Q1

Here p = 25, n=8 Which gives i = 2. Which IS an integer. So average the second

and third observations. To get (5+6)/2 = 5.5 So Q1 = 5.5

2, 5, 6, 6, 7, 8, 9,10

258 2

100i

Page 43: Chapter 3 Descriptive Statistics Numerical Methods

But this is an arbitrary convention

Minitab uses a different rule. In our first example, where we got Q3 = 9,

Minitab gets Q3 = 9.25. In our second example, where we got Q1 =

5.5, Minitab gets Q1 = 5.25.

Page 44: Chapter 3 Descriptive Statistics Numerical Methods

Variance of a Population

The variance is the average size of a squared deviation about the mean.

Lower-case sigma squared is population variance.

Note the use of mu and N in the formula: all these are population parameters.

2

2 1

N

ii

x

N

Page 45: Chapter 3 Descriptive Statistics Numerical Methods

Variance of a Sample

Lower-case s-squared denotes sample variance.

Note the use of X-bar and n in the formula: these are sample statistics.

Also note the funky denominator, n-1, where you would expect to see n.

2

2 1

1

n

ii

x xs

n

Page 46: Chapter 3 Descriptive Statistics Numerical Methods

Why use n-1 with sample data?

A sophisticated explanation is coming in Chapter 7, but think of it as a fudge factor.

Having to compute squared deviations around the sample mean instead of the true population mean makes the numerator too small.

Dividing by n-1 corrects for this.

Page 47: Chapter 3 Descriptive Statistics Numerical Methods

Example of the Calculation

The heart of the calculation is evaluating the numerator. Here is our example. Remember, the mean is 9.

2 2 2 2

1

2

1

2,3, 3, 4, 6, 7, 8,11,12,13,15,16,17

2 9 3 9 3 9

49 36 36 338

n

ii

n

ii

x x

x x

Page 48: Chapter 3 Descriptive Statistics Numerical Methods

Finishing the Variance calculation

Given the sum of squared deviations from the mean, the calculation is as follows: Divide by n-1 for

sample data. Divide by N for

population data.

2

2 1 33828.1667

1 12

n

ii

x xs

n

2

2 1 33826

13

N

ii

x

N

Page 49: Chapter 3 Descriptive Statistics Numerical Methods

The Standard Deviation

The variance measures variability in nonsense units; in this case, number of computer games squared.

To correct this, we introduce the standard deviation, which is just the square root of the variance.

The standard deviation can be thought of as the size of a typical deviation from the mean.

2

28.1667 5.31

s s

s

2

26 5.099

Page 50: Chapter 3 Descriptive Statistics Numerical Methods

Coefficient of Variation

Seldom used in this course. Answers: “The standard

deviation is what percent of the average?”

Why is this useful? An inch more or less in the

height of a skyscraper is meaningless.

An inch more or less in the length of your nose is a big deal.

coefficient of variation 100

5.31 10059

9

s

xs

x

Page 51: Chapter 3 Descriptive Statistics Numerical Methods

Can you show us a use for the standard deviation?

Many real world data sets have an approximate bell shape, as you no doubt have been told.

0 5 10

0.0

0.1

0.2

C2

C3

The Famous Bell Curve

Otherwise known as the Normal Distribution

Page 52: Chapter 3 Descriptive Statistics Numerical Methods

A Rule for such variables

68% of all observations are found within one standard deviation of the mean.

95.5% of all observations are found within two standard deviations of the mean.

99.7% of all observations are found within three standard deviations of the mean.

So any observation more than 2s from the mean is unusual, and one more than 3s from the mean is very unusual.

Page 53: Chapter 3 Descriptive Statistics Numerical Methods

The z-Score

This measures how many standard deviations above or below the mean a particular observation is.

Positive values are above the mean, negative ones below.

Any z greater than 2 in absolute value is a mild outlier.

Any z greater than 3 in absolute value is a substantial outlier.

ii

x xz

s

Page 54: Chapter 3 Descriptive Statistics Numerical Methods

Why should we care about outliers?

Depends on the circumstances, but outliers often require investigation. An outlier may signal a data entry error! An outlier may identify a particularly poor

outcome that needs to be corrected. An outlier may identify a particularly good

outcome that needs to be emulated.

Page 55: Chapter 3 Descriptive Statistics Numerical Methods

Measures of Association between two variables

Often we want to know whether variables are positively or negatively associated, and if so, how strong the association is.

Some examples will illustrate what I mean.

Page 56: Chapter 3 Descriptive Statistics Numerical Methods

Positive association

Page 57: Chapter 3 Descriptive Statistics Numerical Methods

No association

Page 58: Chapter 3 Descriptive Statistics Numerical Methods

Negative association

Page 59: Chapter 3 Descriptive Statistics Numerical Methods

No linear association, but . . .

Page 60: Chapter 3 Descriptive Statistics Numerical Methods

Covariance

A measure of the degree of linear association.

It is an “average size” of a cross product.

First formula is the population covariance.

Second formula is the sample covariance.

1

1

1

N

i x i yi

xy

n

i ii

xy

x y

N

x x y ys

n

Page 61: Chapter 3 Descriptive Statistics Numerical Methods

Why this cross product?

i ix x y y When x and y are simultaneously bigger than their means,

both terms are positive, contributing a positive term to the sum.

When x and y are both simultaneously smaller than their means, both terms are negative, once again contributing a positive term to the sum.

Positively related variables will have many + terms in the sum.

Page 62: Chapter 3 Descriptive Statistics Numerical Methods

Conversely. . .

i ix x y y If a big x and small y are paired, the first term is

positive,while the second is negative, contributing a negative term to the sum.

If a small x and big y are paired, the first term is negative,while the second is positive, contributing a negative term to the sum.

Negatively related variables will have many negative terms in the sum.

Page 63: Chapter 3 Descriptive Statistics Numerical Methods

And if x and y are unrelated

The sum will consist of offsetting positive and negative terms.

Therefore: A positive covariance means positive association A negative covariance means negative

association A zero covariance means no (linear) association.

Page 64: Chapter 3 Descriptive Statistics Numerical Methods

An example

Here are some data: on casual observation they seem positively associated.

First we need the mean of each of the two variables.

The mean of x is 16. The mean of y is 10.

X Y

6 6

11 9

15 6

21 17

27 12

Page 65: Chapter 3 Descriptive Statistics Numerical Methods

Computing the numerator

1

6 16 6 10 11 16 9 10 15 16 6 10

21 16 17 10 27 16 12 10

10 4 5 1 1 4 5 7 11 2

40 5 4 35 22 106

n

i ii

x x y y

Page 66: Chapter 3 Descriptive Statistics Numerical Methods

Completing the calculation

The numerator is the same for both the sample and population covariance.

The only difference is in the denominator, because of the n-1 divisor.

1

1

10621.2

5

1106

26.54

N

i x i yi

xy

n

i ii

xy

x y

N

x x y ys

n

Page 67: Chapter 3 Descriptive Statistics Numerical Methods

Hmmm . . .

We can see that the relationship is positive because the covariance is positive, but what are we to make of 26.5? Is that big or small?

Page 68: Chapter 3 Descriptive Statistics Numerical Methods

The Covariance is Flawed

There are really TWO problems. The covariance lacks a scale, so we have no way

to judge its size. The covariance depends on units. Measuring x

and y in inches we’d get one answer. Someone measuring x and y in centimeters would get a different answer – even though the degree of association is exactly the same!

Page 69: Chapter 3 Descriptive Statistics Numerical Methods

Correlation is superior

Top formula defines the population correlation.

Bottom formula defines the sample correlation.

Correlation is unit free and always between –1 and +1.

xyxy

x y

xyxy

x y

sr

s s

Page 70: Chapter 3 Descriptive Statistics Numerical Methods

Interpreting the correlation

A positive correlation implies a positive association, and conversely, since the correlation has the same sign as the covariance.

A zero correlation implies no linear association.

A correlation near one in absolute value is a very strong relationship; one near zero, weak.

Page 71: Chapter 3 Descriptive Statistics Numerical Methods

For Example . . .

Page 72: Chapter 3 Descriptive Statistics Numerical Methods

Our second example . . .

Page 73: Chapter 3 Descriptive Statistics Numerical Methods

Our third example . . .

Page 74: Chapter 3 Descriptive Statistics Numerical Methods

Remember, correlation measures linear association!

Page 75: Chapter 3 Descriptive Statistics Numerical Methods

Returning to the example . . .

Here is our data. To compute a correlation

coefficient, we need to first compute standard deviations of both X and Y.

X Y

6 6

11 9

15 6

21 17

27 12

Page 76: Chapter 3 Descriptive Statistics Numerical Methods

Standard deviation of X

2 2 2

1

2 2 2

2

1

2

1

6 16 11 16

15 16 21 16 27 16

100 25 1 25 121 272

1 272 4 8.246

272 5 7.376

n

ii

n

x ii

N

x i xi

x x

s x x n

x N

Page 77: Chapter 3 Descriptive Statistics Numerical Methods

Standard deviation of Y

2 2 2

1

2 2 2

2

1

2

1

6 10 9 10

6 10 17 10 12 10

16 1 16 49 4 86

1 86 4 4.637

86 5 4.147

n

ii

n

y ii

N

y i yi

y y

s y y n

y N

Page 78: Chapter 3 Descriptive Statistics Numerical Methods

Completing the computation

26.5.693

8.246 4.637

21.2.693

7.376 4.147

xyxy

x y

xyxy

x y

sr

s s

Page 79: Chapter 3 Descriptive Statistics Numerical Methods

A few comments

It is not an accident that the numbers are the same. The only difference in the population and sample formulas is in divisors: n-1 vs. N.

Those different divisors appear in both numerator and denominator and cancel out.

The conclusion, a correlation of .693, implies a moderately strong positive association

Page 80: Chapter 3 Descriptive Statistics Numerical Methods

Grouped Data

Here is a problem using grouped data.

Original observations lost though grouping.

Treat this as follows: 4 observations of 5, 7 observations of 10, 9 observations of 15, and 5 observations of 20.

Class Midpoint Freq

3-7 5 4

8-12 10 7

13-17 15 9

18-22 20 5

ASW, #53, p. 119 25

Page 81: Chapter 3 Descriptive Statistics Numerical Methods

Existing formulas work but are tedious!

4 times 7 times 9 times 5 times

1

1

5 5 10 10 15 15 20 20

25

32513

25

n

ii

n

ii

xX

n

xX

n

Page 82: Chapter 3 Descriptive Statistics Numerical Methods

This is why one multiplies

The top formula is for sample data.

The bottom formula is for population data.

M sub-i is midpoint of the ith category.

f sub-i is the frequency or count of the ith category.

i i

i i

f MX

n

f M

N

Page 83: Chapter 3 Descriptive Statistics Numerical Methods

So the calculation works this way

Just take each midpoint Multiply by the

corresponding frequency.

Add up the products, and you have the numerator.

Midpoint Frequency fiMi

5 4 20

10 7 70

15 9 135

20 5 100

325i if M

Page 84: Chapter 3 Descriptive Statistics Numerical Methods

The last step

Just the same for the population mean.

Warning: the sample size is 25 (the sum of the frequencies), not 4 (the number of categories).

32513

25

i if MX

n

X

Page 85: Chapter 3 Descriptive Statistics Numerical Methods

Variance formulas for grouped data

Same rationale as formulas for the mean; while previous formulas work, these are easier, because multiplication replaces addition.

Top formula is for sample data; bottom for population data.

2

2

2

2

1i i

i i x

f M xs

n

f M

N

Page 86: Chapter 3 Descriptive Statistics Numerical Methods

Computing the numerator

Midpoint Frequency

5 4 64 256

10 7 9 63

15 9 4 36

20 5 49 245

600

2

iM x 2

i if M x

2

i if M x

Page 87: Chapter 3 Descriptive Statistics Numerical Methods

Completing the calculation

Calculation of the numerator is identical for both the population and sample variance.

Only difference is for sample measure, divide by n-1; population measure divide by N.

2

2

2

2

2

2

60025

1 24

5

60024

25

24 4.899

i i

i i

f M xs

n

s s

f M

N

Page 88: Chapter 3 Descriptive Statistics Numerical Methods

That is it for today!