Lecture 3 Preview: Interval Estimates and the Central Limit Theorem Review Populations, Samples, Estimation Procedures, and the Estimate’s Probability

Lecture 3 Preview: Interval Estimates and the Central Limit Theorem

Review

Populations, Samples, Estimation Procedures, and the Estimate’s Probability Distribution

Why Is the Mean of the Estimate’s Probability Distribution Important?

Why Is the Variance of the Estimate’s Probability Distribution Important?

Normal Distribution: A Way to Estimate Probabilities

Relative Frequency Interpretation of Probability

Random Variables

Clint’s Dilemma and His Opinion Poll

Interval Estimates

Central Limit Theorem

Properties of the Normal Distribution

Using the Normal Distribution Table: An Example

Justifying the Use of the Normal Distribution

Normal Distribution’s Rules of Thumb

Mean and Variance of the Estimate’s Probability Distribution for a Sample Size of T

Review

Populations, Samples, and Estimation ProceduresQuestion: How can we use sample information to draw inferences about a population?

Random Variables: Before the experiment is conducted: Bad news. What we do not know: We cannot determine the numerical value of the random variable with certainty before the experiment is conducted.Good news. What we do know: On the other hand, we can often calculate the random variable’s probability distribution telling us how likely it is for the random variable to equal each of its possible numerical values.

Relative Frequency Interpretation of Probability: After many, many repetitions of the experiment the distribution of the numerical values from the experiment mirrors the random variable’s probability distribution.

The mean reflects the center of the distribution. The variance reflects the spread of the distribution.

An example, Clint’s poll: 12 of the 16 individuals polled support Clint. EstFrac = .75Question: Does this poll definitely prove that Clint is ahead?Answer: No. It is possible for 12 (or more) individuals to support Clint in one poll even when the election is a toss up.

Question: How do we describe a distribution?

Distribution of the Numerical Values After many, many repetitions Probability Distribution

Answer: Center (Mean) and Spread (Variance)

Opinion Poll: Sample Size Equals T

Write the names of every individual in the population on a card.Perform the following procedure T times:

Thoroughly shuffle the cards.Randomly draw one card.Ask that individual if he/she supports Clint; the answer determines the numerical value of vt:

Replace the card.

Calculate the fraction of those polled supporting Clint.

Question: What do we know about the vt’s?From our last class – Sample Size of 2: Mean[v1] = Mean[v2] = p

Mean[vt] = p for each t; that is, Mean[v1] = Mean[v2] = … = Mean[vT] = p

From our last class – Sample Size of 2: Var[v1] = Var[v2] = p(1p)Var[vt] = p(1p) for each t; that is, Var[v1] = Var[v2] = … = Var[vT] = p(1p)

where T = Sample Size

From out last class – Sample Size of 2: v1 and v2 are independent; their covariance equals 0The vt’s are independent; hence, their covariances equal 0.

where p = ActFrac = Actual fraction of the population supporting Clint

vt equals 1 if the tth individual polled supports Clint; 0 otherwise.

The estimated fraction, EstFrac, is a random variable.


Var[vt] = p(1p) for each t; that is, Var[v1] = Var[v2] = … = Var[vT] = p(1p)

The vt’s are independent; that is, all their covariances equal 0where p = ActFrac = Actual fraction of the population supporting Clint

Mean[cx] = cMean[x]

Mean[x + y] = Mean[x] + Mean[y]

How many p terms are there? T

Mean[cx] = cMean[x]

Mean[x + y] = Mean[x] + Mean[y]

Mean[v1] = Mean[v2] = … = Mean[vT] = p

Distribution Center: Mean of the Estimate’s Probability Distribution


Var[vt] = p(1p) for each t; that is, Var[v1] = Var[v2] = … = Var[vT] = p(1p)

The vt’s are independent; hence, all their covariances equal 0where p = ActFrac = Actual fraction of the population supporting Clint

Var[cx] = c2Var[x]

Var[x + y] = Var[x] + 2Cov[x, y] + Var[y]

How many p(1p) terms are there?

Var[x + y] = Var[x] + Var[y]

Summary:

T

Var[cx] = c2Var[x]

Var[x + y] = Var[x] + Var[y]

Var[v1] = Var[v2] = … = Var[vT] = p(1p)

Distribution Spread: Variance of the Estimate’s Probability Distribution

Simulations: Confirming the equations.

Mean[EstFrac] = ActFrac = pVar[EstFrac] =

Mean of Variance of Mean (Average) of Variance of EstFrac’s EstFrac’s Numerical Values Numerical ValuesSample Prob Prob Simulation of EstFrac from of EstFrac from Size Dist Dist Repetitions the Experiments the Experiments

1

2

25

100

400

.50

.50

.50

.50

.50

>1,000,000 .50 .25

>1,000,000 .50 .125

>1,000,000 .50 .01

>1,000,000 .50 .0025

>1,000,000 .50 .000625

Two QuestionsWhy is the distribution center (mean) important?

Why is the distribution spread (variance) important?

Relative Frequency Interpretation of Probability: After many, many repetitions of the experiment, the distribution of the actual numerical values mirrors the probability distribution of the random variable. Both distributions have the same mean and variance.

Lab 3.1

More specifically, Mean[EstFrac] = ActFrac. Why is this important?

http://www3.amherst.edu/~fwesthoff/MITLinks/MIT-Lab-03-01.html

Question: Why is the mean of the estimate’s probability distribution important?A mean describes the center of its probability distribution.

Mean[EstFrac] = ActFrac

Conceptually, an estimation procedure is unbiased whenever it does not systematically underestimate or overestimate the actual population fraction.

If the probability distribution is symmetric, we have even more intuition.the chances that the estimated fraction is

too low

the chances that the estimated fraction is

too highequal

Average of the estimate’s

numerical values after many, many repetitions

Unbiased Estimation Procedure

Formally, an estimation procedure is unbiased whenever the mean of the estimated fraction’s probability distribution equals the actual population fraction.

Relative Frequency Interpretation of Probability

Lab 3.2

Mean[EstFrac]

Probability Distribution of EstFrac

ActFrac

EstFrac

In one poll,

So, we have already shown that Clint’s

estimation procedure is unbiased.

Average of the estimate’s numerical values after many, many repetitions

= ActFrac

=

Now we have some intuition.


Question: Why is the variance of the estimate’s probability distribution important when the estimation procedure is unbiased?

Claim: When the estimation procedure is unbiased, the reliability of the estimated fraction depends on the variance of the estimated fraction’s probability distribution.

Interval Estimate Question: What is the probability that the estimated fraction from a single poll lies close to the actual value?

Small probability Large probability

Estimate is unreliable

Estimate is reliable

Decide on a close to criterion: .05

Population Fraction = ActFrac = p Simulations: Percent of Repetitions Sample Variance of Random Simulation in which the Numerical Value of Size Variable EstFrac Repetitions EstFrac Lies between .45 and .55

25 .01 100 .0025 400 .000625

>1,000,000 39%>1,000,000 69%>1,000,000 95%

= .50

Question: After many, many repetitions, how frequently is the estimated fraction are close to, within .05 of, the actual population fraction?

Lab 3.3

Quantifying Reliability:

Strategy: A simulation and apply the relative frequency interpretation of probability.

Interval Estimate Question: What is the probability that the estimated fraction from a single poll lies close to, within .05 of, the actual value?


Probability that the Numerical ValueSample Variance of EstFrac’s of EstFrac Lies between .45 and .55 Size Probability Distribution in a Single Poll (One Repetition)

25 .01 100 .0025 400 .000625

.39.69.95

Interval Estimate Question: What is the probability that the numerical value of the estimated fraction from one repetition of the experiment lies close to, within .05 of, the actual population fraction?

ActFrac = .50 Simulations: Percent of Repetitions Sample Variance of EstFrac’s Simulation in which the Numerical Value of Size Probability Distribution Repetitions EstFrac Lies between .45 and .55

25 .01 100 .0025 400 .000625

>1,000,000 39%>1,000,000 69%>1,000,000 95%

Relative Frequency Interpretation of Probability: After many, many repetitions of the experiment, the distribution of the numerical values mirrors the probability distribution.

The portion of estimates that lie within .05 of the actual value,

between .45 and .55,after many, many repetitions

How can we use the simulation results to answer the interval estimate question?

equals

The probability that the estimate lies within .05 of the actual value,

between .45 and .55,in a single poll (one repetition)

Reconsider the interval estimate question:

Sample Variance of EstFrac’s In a Single Poll (One Repetition): Size Probability Distribution Prob[.45 Numerical Value .55]

25 .01 100 .0025 400 .000625

.39.69.95

Variance Large Variance Small

Small probability that the numerical value of the estimated fraction,

EstFrac, from one repetition of the experiment will be close to the actual

population fraction, ActFrac.

Large probability that the numerical

value of the estimated fraction, EstFrac, from one repetition of the

experiment will be close to the actual population fraction, ActFrac.

Estimate is unreliable

Estimate is reliable

Variance large Variance small

Probability Distributions of EstFrac

Mean[EstFrac] = ActFrac Mean[EstFrac] = ActFracEstFrac EstFrac

Summary: When the estimation procedure is unbiased, the variance

tells us how reliable the estimate is.

Generalizing, when an estimation procedure is unbiased:

Sample Size = T = 25


Mean[EstFrac] = p

Mean[EstFrac] = p


Mean[EstFrac] = p

Strategy for Motivating and Illustrating the Central Limit Theorem: Four Steps

Central Limit Theorem Motivation: Role of the Standard DeviationCentral Limit Theorem: As the sample size becomes larger and larger, we can use the normal distribution to calculate better and better approximations of interval estimates.

Step 2: Use simulations to calculate the percent of repetitions that fall within 1, 2, and 3 standard deviations of Mean[EstFrac], the mean EstFrac’s probability distribution.Step 3: Observe an interesting similarity.Step 4: Introduce the normal distribution and use it to calculate the percent of repetitions that fall within 1, 2, and 3 standard deviations of Mean[EstFrac].

Step 1: Mean, variance, and SD for three sample sizes

Step 1: Use the equations to calculate the mean, variance, and standard deviation of EstFrac’s probability distribution for three sample sizes, 25, 100, and 400.

Summary of Mean and SD Calculations Sample Size 25 100 400Mean[EstFrac] .500 .500 .500SD[EstFrac] .100 .050 .025

Interval: 1 SD

.400-.600From-To Values

Percent of Repetitions 69.2%

Interval: 2 SD’s

From-To Values

Percent of Repetitions

Interval: 3 SD’s

From-To Values

Percent of Repetitions

.300-.700

96.3%

.200-.800

99.9%

.450-.550

68.5%

.400-.600

95.6%

.350-.650

99.8%

.475-.525

68.3%

.450-.550

95.5%

.425-.575

99.7%

Question: What do these results suggest?

Central Limit Theorem Motivation: Role of the Standard DeviationCentral Limit Theorem: As the sample size becomes larger and larger, the normal distribution provides better and better approximations of interval estimates.

Step 2: Use simulations to calculate the percent of repetitions that fall within 1, 2, and 3 standard deviations of Mean[EstFrac], the mean EstFrac’s probability distribution.

Step 3: Observe an interesting similarity.

Answer: The standard deviations, the SD’s, appear to be critical.

Lab 3.4


Normal Distribution: The Famed Bell-Shaped CurveThe variable z: the “normalized” value of the random variable.

z equals the number of standard deviations the value lies from the random variable’s mean:

Normal Distribution TableThe row specifies the z value’s whole number and its tenths.

For example, suppose that z = 1.53:

What is the probability that the random variable would lie more than 1.53 standard deviations above its mean?

1.53 SD’s

.0630Normal Distribution: Three Important Properties

The normal distribution is bell shaped.

The area beneath the normal curve equals 1.

The number in the body of the table estimates the probability that the random variable lies more than z standard deviations above its mean.

.0630

The column the z value’s hundredths.

z SD’s

Probability of being more than z standard deviations about the

distribution mean

The normal distribution is symmetric around its mean (center).

Normal Distribution

http://www3.amherst.edu/~fwesthoff/MITLinks/MIT-Normal-00-00.html

Normal Distribution Rules of Thumb

Standard Deviations within Random Probability of Variable’s Mean being within 1 .68 2 .95

3 >.99

Simulations: Percent of Interval: Repetitions within Interval Standard Deviations within Sample Size Random Variable’s Mean 25 100 400 1 69.2% 68.5% 68.3% 2 96.3% 95.6% 95.5% 3 99.9% 99.8% 99.7%

68.26%95.44%99.74%

z 0.00 0.01 0.9 0.1841 0.1814 1.0 0.1587 0.1562 1.1 0.1357 0.1335

z 0.00 0.01 1.9 0.0287 0.0281 2.0 0.0228 0.0222 2.1 0.0179 0.0174

z 0.00 0.01 2.9 0.0019 0.0018 3.0 0.0013 0.0013

1 (.1587 + .1587) = .6826 1 (.0228 + .0228) = .9544 1 (.0013 + .0013) = .9974

.1587.1587

.0228.0228

NormalDistributionPercentages

The area beneath the normal curve equals 1. The normal distribution is symmetric around its mean (center).Normal Distribution

Summary

Central Limit Theorem: As the sample size becomes larger and larger, we can use the normal distribution to calculate better and better approximations of interval estimates.

Revisiting Clint’s DilemmaOn the eve of the election, Clint must decide whether or not to hold a pre-election party:

If he is comfortably ahead, he will not hold the party; he will save his campaign funds for a future political endeavor (or a trip to Cancun).

If he is not comfortably ahead, he will hold the party hoping to capture more votes.

There is not enough time to canvas everyone, however. What should he do?

Econometrician’s Philosophy: If you lack the information to determine the value directly, estimate the value to the best of your ability using the information you do have.

Clint’s Estimation ProcedureQuestionnaire: Are you voting for Clint?

Results: 12 students report that they will vote for Clint and 4 against Clint.

Estimated fraction of population supporting Clint

Clint uses the information collected from the sample to draw inferences about the entire population. Seventy-five percent, .75, of the sample support Clint.

This poll suggests that Clint leads.

Question: Should Clint be confident that he has the election in hand or should he fund the party?

Procedure: Clint selects 16 students at random.

= .75

Documents

Lecture 3 Preview: Interval Estimates and the Central Limit Theorem Review Populations, Samples, Estimation Procedures, and the Estimate’s Probability