6
Stat 204, Part 4 Inference Chapter 8: Sampling Variability and Sampling Distributions These notes reflect material from our text, Statistics, Learning from Data, First Edition, by Roxy Peck, published by CENGAGE Learning, 2015. Sampling distributions Three distributions : population, data, sampling Sampling distribution of the sample proportion Sampling distribution of the sample mean 10 15 20 25 30 35 40 0.00 0.05 0.10 0.15 0.20 Population distribution vs. sampling distribution of sample mean Frequency population sample means LLN and CLT LLN : X n μ as n gets larger. CLT : the distribution of X n Normal Distribution as n gets larger. Both theorems require certain conditions to be satisfied for the theorem to be applicable. The CLT requires independent sample elements (from less than 10% of the population) relatively large sample size (for instance, at least 30 elements) data which are not highly skewed and without extreme outliers in order to conclude that the sample mean is approximately normally distributed with standard deviation approximately SE (OIS, p.168). Spring 2016 Page 1 of 6

Chapter 8: Sampling Variability and Sampling Distributionscparrish.sewanee.edu/stat204 S2016/notes/part 04... · Sampling distributions Three distributions : population, data, sampling

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chapter 8: Sampling Variability and Sampling Distributionscparrish.sewanee.edu/stat204 S2016/notes/part 04... · Sampling distributions Three distributions : population, data, sampling

Stat 204, Part 4 Inference

Chapter 8: Sampling Variability and Sampling Distributions

These notes reflect material from our text, Statistics, Learning from Data, First Edition, by Roxy Peck,published by CENGAGE Learning, 2015.

Sampling distributions

Three distributions : population, data, sampling

Sampling distribution of the sample proportion

Sampling distribution of the sample mean

10 15 20 25 30 35 40

0.00

0.05

0.10

0.15

0.20

Population distribution vs. sampling distribution of sample mean

Frequency

populationsample means

LLN and CLT

LLN : Xn → µ as n gets larger.

CLT : the distribution of Xn → Normal Distribution as n gets larger.

Both theorems require certain conditions to be satisfied for the theorem to be applicable. The CLTrequires

• independent sample elements (from less than 10% of the population)

• relatively large sample size (for instance, at least 30 elements)

• data which are not highly skewed and without extreme outliers

in order to conclude that the sample mean is approximately normally distributed with standard deviationapproximately SE (OIS, p.168).

Spring 2016 Page 1 of 6

Page 2: Chapter 8: Sampling Variability and Sampling Distributionscparrish.sewanee.edu/stat204 S2016/notes/part 04... · Sampling distributions Three distributions : population, data, sampling

Stat 204, Part 4 Inference

Sampling distributions of the sample mean

Uniform distribution.

0.0 0.2 0.4 0.6 0.8 1.0

0.6

0.8

1.0

1.2

1.4

Uniform Distribution, U[0,1]

xs

Probability

n=2, n.samples=10000

Sample mean

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0200

400

600

800

n=5, n.samples=10000

Sample mean

Frequency

0.2 0.4 0.6 0.8 1.0

0500

1000

1500

n=30, n.samples=10000

Sample mean

Frequency

0.3 0.4 0.5 0.6 0.7

0500

1000

1500

Exponential distribution.

0.0 0.2 0.4 0.6 0.8 1.0

0.4

0.6

0.8

1.0

Exponential Distribution

xs

Probability

n=2, n.samples=10000

Sample mean

Frequency

0 1 2 3 4 5

0500

1500

2500

n=5, n.samples=10000

Sample mean

Frequency

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0500

1000

1500

n=30, n.samples=10000

Sample mean

Frequency

0.5 1.0 1.5

0500

1000

2000

Spring 2016 Page 2 of 6

Page 3: Chapter 8: Sampling Variability and Sampling Distributionscparrish.sewanee.edu/stat204 S2016/notes/part 04... · Sampling distributions Three distributions : population, data, sampling

Stat 204, Part 4 Inference

Normal distribution. The sampling distribution of a normal distribution is a normal distribution, evenfor small sample sizes.

-3 -2 -1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

Normal Distribution

xs

Probability

n=2, n.samples=10000

Sample mean

Frequency

-3 -2 -1 0 1 2 3

0500

1500

2500

n=5, n.samples=10000

Sample mean

Frequency

-1.5 -0.5 0.0 0.5 1.0 1.5

0500

1000

1500

n=30, n.samples=10000

Sample mean

Frequency

-0.5 0.0 0.5 1.0

0500

1000

1500

2000

Sampling distributions of counts and proportions

Imagine taking a sample of size 100 from a Bernoulli population with p = 0.6. Do this 1000 times andmake a histogram of the counts of successes. Take another 1000 samples of size 100 and make a histogramof the proportions of successes in the samples. What do you observe?

0.0

0.4

0.8

Population Distribution(p=0.6)

Outcome

Pro

babi

lity

0 1

Sampling DistributionCounts

results

Fre

quen

cy

40 50 60 70 80

010

020

030

0

Sampling DistributionProportions

results

Fre

quen

cy

0.4 0.6 0.8

010

020

030

0

Distribution of a statistic (mean or proportion)

Two scenarios : (1) If we are studying a quantitative variable and we know the population parametersµ and σ, what can be said of the sample statistics? (2) If we know the sample statistics x and s, what canbe said of the population parameters? The first question is easiest to answer, but is rarely the case. Thesecond question leads to much of contemporary statistics.

Spring 2016 Page 3 of 6

Page 4: Chapter 8: Sampling Variability and Sampling Distributionscparrish.sewanee.edu/stat204 S2016/notes/part 04... · Sampling distributions Three distributions : population, data, sampling

Stat 204, Part 4 Inference

We might be studying a quantitative variable in a population (normal or not) with population pa-rameters µ and σ, and we wish to know the sampling distribution of the sample mean x. Or we mightbe studying a categorical variable in a population with proportion p, and we wish to know the samplingdistribution of the sample proportion p̂. The following table summarizes the sample distributions in bothcases (Probability and Statistics, Open Learning Initiative, CMU).

Variable Statistic Shape Center Standard Error Conditions

quantitative (σ known) x Normal µ σ/√n n ≥ 30 or approx. normal

quantitative (σ unknown) x t µ s/√n n ≥ 30 or approx. normal

categorical p̂ Normal p√

p(1−p)n min(np, n(1− p)) ≥ 15

Barrels of marbles and beans

Three distributions : population, data, sampling

Here is a more leisurely view of sampling distributions. Suppose that in the middle of the classroomwe have a big barrel of red and white marbles. A proportion p of this large population of marbles is red. Ifwe actually knew exactly how many red and how many white marbles were in the barrel, we could make asmall table displaying the number of red and white marbles in the barrel. Agresti and Franklin would callthis table the population distribution. Beside that barrel is another large barrel full of bright green limabeans. The beans vary in size from large to small, but most are medium-sized. We are actually interestedin the weights of the beans, so it is interesting to know that the average weight of the beans in this barrelis µ and the standard deviation of their weights is σ. A graph of the actual distribution of weights of thelima beans in the barrel would be an illustration of a population distribution.

Sampling distribution of the sample proportion

Next to the barrels is a small scoop. We begin by sampling from the barrel with the marbles. Scoopout a small but fixed number of marbles, n, and count the proportion of red marbles in your scoop, p̂.Make a small table displaying the number of red and the number of white marbles in your scoop. Agrestiand Franklin would call this table the data distribution of the marbles in your sample. Another studenttakes a scoop of the same number n of marbles, and calculates the statistic p̂ for the marbles in her scoop.We might expect these two statistics to be similar, but probably not exactly equal. The question now is,if a large number of students took similar samples of size n from the barrel of marbles, and calculatedthe proportion p̂ of red marbles in each scoop, what would the collection of statistics p̂ look like. Theanswer to that question is called the sampling distribution of p̂. Since our samples are independent, theCLT says that the sampling distribution is approximately normal. It has mean µ and standard deviation√p(1− p)/n.

Sampling distribution of the sample mean

Now turn to the barrel of lima beans. We want to study the weights of the beans in our samples, so weborrow a precision balance from the chemistry department. Scoop out a small but fixed number of limabeans, n, weigh each bean in your scoop, and calculate the average weight of the beans in your scoop, x̄.Make a stripchart (dot plot) of the weights of the beans in your sample. This strip chart is an illustration ofthe data distribution of the beans in your sample. Now the next student repeats the process, and calculatesthe average weight of the beans in the second scoop, x̄. We might expect these two statistics to be similar,but probably not exactly equal. The question now is, if a large number of students took similar samples of

Spring 2016 Page 4 of 6

Page 5: Chapter 8: Sampling Variability and Sampling Distributionscparrish.sewanee.edu/stat204 S2016/notes/part 04... · Sampling distributions Three distributions : population, data, sampling

Stat 204, Part 4 Inference

size n from the barrel of lima beans, and calculated the average weight of the beans x̄ in each scoop, whatwould the collection of statistics x̄ look like. The answer to that question is called the sampling distributionof x̄. Since our samples are independent, the CLT says that the sampling distribution is approximatelynormal, and in fact if the distribution of the population weights is actually normal, then the samplingdistribution is also normal (not just approximately normal). In either case, the sampling distribution hasmean µ and standard deviation σ/

√n.

Standard Error

The standard deviation of a statistic is called a standard error, so in the sequel we will write

SEx̄ = σ/√n

orSEp̂ =

√p(1− p)/n.

Cherry Blossom Run, 2012

16,924 runners participated in the Cherry Blossom 10 Mile Run in 2012. Take the population to be allof the runners, and estimate their average time to complete the race and the average age of the participantsby taking 1000 samples of 100 runners each and calculating the corresponding sample statistics. The resultsare illustrated in the population and sampling distributions shown below. (OpenIntro Statistics, 2nd ed.,pp.159-164).

Population Distribution (time)

Time (Min)

Frequency

40 60 80 120 160

02000

Sampling Distribution (time)

Sample mean

Frequency

40 60 80 120 160

0100200

Population Distribution (age)

Age

Frequency

0 20 40 60 80

01000

3000

Sampling Distribution (age)

Sample mean

Frequency

0 20 40 60 80

050

100

150

Spring 2016 Page 5 of 6

Page 6: Chapter 8: Sampling Variability and Sampling Distributionscparrish.sewanee.edu/stat204 S2016/notes/part 04... · Sampling distributions Three distributions : population, data, sampling

Stat 204, Part 4 Inference

Inference

Outline for one-variable inference for sample data (Peck, chapters 7–13):

The goal is to generalize from a sample to learn about a population

• categorical variable

- one proportion

- confidence interval - one-sample z CI for a proportion

- hypothesis testing - one-sample z HT for a proportion

- difference of two proportions

- confidence interval - two-sample z CI for a difference in proportions

- hypothesis testing - two-sample z HT for a difference in proportions

• continuous variable

- one mean

- confidence interval - one-sample t CI for a mean

- hypothesis testing - one-sample t HT for a mean

- difference of paired means

- confidence interval - paired t CI for a difference in means

- hypothesis testing - paired t HT for a difference in means

- difference of independent means

- confidence interval - two-sample t CI for a difference in means

- hypothesis testing - two-sample t HT for a difference in means

Exercises

We will attempt to solve some of the following exercises as a community project in class today. Finish thesesolutions as homework exercises, write them up carefully and clearly, and hand them in at the beginningof class next Friday.

Homework 8a – sampling distributions

Exercises from Chapter 8:8.2 (histograms), 8.3 (imports), 8.15 (sampling distribution), 8.16 (normal), 8.26 (sampling distribution)

Homework 8b – sampling distributions

Exercises from Chapter 8:8.27 (sampling distribution), 8.29 (polling), 8.30 (credit card), 8.50 (sample size), 8.52 (hurricane)

Spring 2016 Page 6 of 6