BU255: Final Exam-AID (updated) Taught by Greg Overholt

Preview:

DESCRIPTION

BU255: Final Exam-AID (updated) Taught by Greg Overholt. Chapter 9: Sampling Distribution. Chapter 9. Sampling Distributions Pop’s are usually TOO large to calculate accurate parameters . SO, take samples, calculate statistics related to the parameter and make inferences based on that! - PowerPoint PPT Presentation

Citation preview

BU255: Final Exam-AID(updated)

Taught by Greg Overholt

Chapter 9: Sampling Distribution

Chapter 9

• Sampling Distributions• Pop’s are usually TOO large to calculate

accurate parameters.• SO, take samples, calculate statistics related

to the parameter and make inferences based on that!

– This is where the sampling distributions come in!

• To get more accurate results, increase sample size!

Sampling Distribution

• This is reflected in the change in the formula. Standard deviation for the sample distribution is reduced with higher sample sizes.

The standard deviation of thesampling distribution iscalled the standard error.

(ux = u – mean of sample same as pop mean)

Central Limit Theory

• If pop = normal, then sample = normal for all values of n.

• If pop = non-normal, then sample = approximately normal only for larger values of n.

• In most practical situations, a sample size of 30 may be sufficiently large to allow us to use the normal distribution as an approximation for the sampling distribution of X.

Central Limit Theory: states that the sampling distribution will be approximately normal for sufficiently large sampling sizes.

Sampling Distribution of the Mean

Assuming the population is infinitely large (so can be considered normal):

Any pop that is 20 times the sample size is to be considered LARGE!

Questions!

• If a customer buys one bottle, what is the probability that the bottle will contain more than 32 ounces?

• If a customer buys four bottles, what is the probability that the mean of the 4 bottles will be more than 32 ounces?– Want P( X > 32) with u = 32.2 and σ = .3

Z > (32 – 32.2) / ( .3 / √4 )

Z > -.2 / .15

Z > -1.33 = .9082 (TABLE!)

Sample Dist for Inference

• You can rearrange the formula to find a confidence interval:

Question: Census indicated that people under 24 use facebook 7 hours a week. You polled 100 university students and found that on average they used facebook 8 hours a week with a standard deviation of 3 hours. Is the census average seem right using 95% confidence.

P(7 – 1.96 (3/10)) < X < 7 + 1.96(3/10)) = .95

P(6.412 < X < 7.588) = .95

If the census data was correct, the sample mean should land with 6.4 and 7.5, 95% of the time. Yours is 8 which indicates that either the pop mean is wrong or there were errors in your data/calculations/polling.

Normal Approximation to Binomial

• With a binomial distribution (and ‘p’ close to .5) you can approximate a normal distribution:

• If you have n = 20 and p = .5 (heads/tails) you can determine u and σ:

• u = np = 20(.5) = 10• σ2 = np(1-p) = 20(.5)(.5) = 5 • σ = √5 = 2.24

• For the approximation to provide good results two conditions should be met:

1) np ≥ 52) n(1–p) ≥ 5

Normal Approx to Binomial

• You can use the normal curve properties to estimate P(X=10), by taking the area between 9.5 and 10.5 on normal curve.– In fact: P(X = 10) = .176– While P(9.5 < Y < 10.5) = .1742– the approximation is quite good.

Sampling Distribution of a Sample Prop

• When you have a sample proportion (so votes, heads/tails,..) you can determine the same set of statistics:– E(P) = p– V(P) = σ2 = p(1-p)/n– σ = √p(1-p)/n

• These can also be standardized to standard normal dist:

Sampling Distribution of a Sample Prop

Example: Suppose 60% of students use latex vs saran wrap for

their condom choice. What is the probability of taking a random sample of size 120 students and finding that 50% or less use that brand?

p = .60 p(hat) = .50 n = 120 = .50 - .60 / √.60(.40)/120

= -.1 / .0447 = -2.24

We aren’t done: we want to know less then -2.24, so a left tail. The chart says 2.24 has .0125 on the left tail = .0125 or 1.25%

Sampling Dist: Different of 2 means

• When you have 2 samples of the same population, you can compare them!

• Two sample means will be (approximately) normally distributed if:– Two populations are both normally

distributed– Two populations are not both normally

distributed BUT the sample sizes are “large” (>30)

Sampling Dist: Different of 2 means

• You can find the difference of their means!– Need independent samples from normal pop’s

– If so, the different X1-X2 is normal

X1 - X2

X1 - X2

n

EXAMPLE

• There are two species of green beings on Mars. The mean height of Species 1 is 32 while the mean height of Species 2 is 22. The standard deviation of the two species are 60 and 70 respectively and the heights of both species are normally distributed.

• You randomly sample 10 members of Species 1 and 14 members of Species 2. What is the probability that the mean of the 10 members of Species 1 will exceed the mean of the 14 members of Species 2 by 5 or more? σ1

2 – σ22

X1 - X2 n n

== 3600/10 – 4900/14

= sqrt (10) = 3.17

EXAMPLE

Z > 5 – (32-22)

3600/10 – 4900/14

Z > (5 – 10 )/ 3.17

Z > -5/3.17

Z > -1.577

Look up on normal dist table!!

= .941 or 94.1% chance that the mean of species 1 will be greater then species 2 by 5.

Chapter 10:Estimation

• Binomial, Poisson, normal, and exponential distributions allow us to make probability statements about X (an individual member of the population).

• To do so we need the population parameters.– Binomial: p– Poisson: μ– Normal: μ and σ– Exponential: λ

Chapter 10

• Introduction to Statistical Inference• Estimation

– Point and Interval Estimators– Properties of Estimators

• Interval Estimation [ confidence intervals]

• Determining Sample Size

Estimation

• Estimation: determining approximate value of pop parameter based on sample statistic.

• 2 types:– Point Estimator

• No good. Too small

– Interval Estimator• Used almost all the time.• Uses an interval to estimate the population

parameter. • Provides % certainty that it is between a lower and

upper bound

Estimating u when σ known

In Chapter 9 we saw this, providing confidence in the sample mean:

You can rearrange it, to be a confidence interval for the population mean!

Desirable Qualities of Estimators…Great MC q!

1. Unbiased: the expected value of an estimator is equal to that parameter.

2. Consistent: the difference between the estimator and the parameter grows smaller as the sample size grows larger.

3. Relative Efficiency: If there are two unbiased estimators of a parameter, the one whose variance is smaller is said to be relatively efficient.

EXAMPLE

• Diageo, sampled 85 Laurier students and determined the sample mean of alcohol consumption was 510 drinks a term. They previous calculated that the population standard deviation was 46. Please create interval of population mean with 95% confidence.

X(bar) = 510 n = 85 σ = 46 Za/2 .. Unknown, but we want 95% confidence.

95% in the middle, so that’s 2.5% on each tail, so we want to find the Z value of .475 = 1.96

= 510 – 1.96(46/√85) < u < 510+ 1.96(46/√85)

= 510 – 9.78 < u < 510+ 9.78

= 500.22 < u < 519.78

95% confident that the average number of drinks for the population is between 500.22 and 519.78

Selecting the sample size

• The difference between the sample mean and the population mean is called the error of estimation.

• You can make sure you stay within it, by another freaking formula:

B = Bound on the error (given in q)

EXAMPLE

• I want to know how many students I need to interview to find out how many times a Laurier student facebook stakes in 1 day. I want to be 95% certain and that the range of error is 2. It turns out the standard deviation of this stat is 5. GO:

n = ? (what we want to find out)

σ = 5

W = 2

Z a/2 = 95% confidence.. Which is 2.5% in each tail, which is a z value of 1.96

n = (1.96 * 5 / 2 ) 2

n = 24.01 (so need 25 people)

Chapter 11: Intro Hypothesis Testing

Hypothesis Testing

• There are two procedures for making inferences:– Estimation. – Hypotheses testing.

• The purpose of hypothesis testing is to determine whether there is enough statistical evidence in favor of a certain belief about a parameter.

Hypothesis Testing

• There are two hypothesis:– Null Hypothesis (H0)

• Assumed to be true• Ex. The defendant is innocent

– Alternative (or research) Hypothesis (H1)

• Opposite of H0

• Ex. The defendant is guilty

• NOTE: The null will always states the parameter equal the value specified in the alternative.

Hypothesis Testing Process

• Step 1: State the Null and Alternative– Eg: You want to see if the exam average will be

greater then 75%.• H0 = 75• H1 > 75

• Step 2: randomly sample the pop and create a test statistic (in this case a sample mean)– The procedure begins with the NULL BEING TRUE (and the

goal is to see if there is enough evidence to say that the alternative is true).

• Step 3: Make statement about hypo– If t-stat value is inconsistent with null hypo, we reject the

null alternative is true.

Hypo Testing Decisions

1. Reject the null in favour of the alternative

• Sufficient evidence to support the alternative

2. Do not reject the null in favour of alt.– Does not mean ‘accepting the null’ (just

not enough evidence)– Ex. Can’t prove that the defendant is

guilty does not mean that he is innocent

Hypo Testing Errors

• Two types of errors are possible when making the decision whether to reject H0(the null hypothesis)

• Type 1 error (alpha): reject null hypothesis – send a innocent man to jail (reject null when null is true!) MOST SERIOUS OF THE TWO!

Our original hypothesis…

our new assumption…

Type 2 error: don’t reject a false null hypothesis (go with the safe null assumption.. Don’t have the balls to reject it!! )Guilty man goes free. (not rejected null when null is actually false)It can be calculated .. (later).

Hypo Testing Errors

THIS EXAMPLE IS TESTING IF HYDRO BILLS ARE > estimated mean of 170.

Sample bills were taken to get x bar.. etc

2 ways to Test: Rejection Region

• Depending on you are looking for <, >, or not equal to, you define the rejection region• Level of significance = α

Test It: P-value

• The p-value of a test is the probability of observing a test statistic at least as extreme as the one computed given that the null hypothesis is true.

• The smallest value of α for which H0 can be rejected

p-value

P-value =.0069

Z=2.46

Type II Error Example

Example:• H0: µ = 170• H1: µ > 170

• At a significance level of 5% we rejected H0 in favor of H1 since our sample mean (178) was greater than the critical value of (175.34).

In the question – they will have to give you the new mean to test. ($180 mean)

• β = P( x < 175.34, given that µ = 180), thus…

Our original hypothesis…

our new assumption…

Chance we send a guilty man free

Changing your confidence requirement!

INCREASE THE SAMPLE SIZE!

Test It with P-values

• The p-value of a test is the probability of observing a test statistic at least as extreme as the one computed given that the null hypothesis is true.

• The smallest value of α for which H0 can be rejected

p-value

P-value =.0069

Z=2.46

Chapter 12: Inferences about a Population

What will we ‘infer’?

• Inference About: – Population Mean– Population Variance– Population Proportion

• Inference About: – 1 population– 2 or more pop’s

What’s different?

• In past, we have known standard deviation of the population (which is unrealistic)

– With it, we can use Z stat to make inferences

• NOW, we don’t know st dev. So, have to use the ‘sample st dev’ – why we use the T-stat– GOT to have a normal (or approx) population

dist!

t-Distribution

• Created by: William Sealy Gosset (MC?)• It has one parameter: degree of freedom

(df) v

• Degree of freedom: number of observations that are free to vary after sample mean has been found– How many degrees??? N-1!!

• (if 5 items, 4 degrees)

EXCEL (good MC)

• T-dist calculations can be done using excel:– TDIST(x,degrees_freedom,tails)

• This is when you want the % in the tail(s).• TDIST(1.3,60,1)

– 1.3 is your t-value (like your z-value) and the curve is drawn with 60 degrees of freedom and you want the 1 tail test (vs 2). (ANSWER = 0.0992 (so 9% in the 1 tail test))

– TINV(p-value,degrees_freedom)• This is the inverse. Give it the % in the tails and it will

give you the T-value. • NOTE: will give you the % in a 2-tail test!!!!

– SO, if they wanted you to do the inverse of the q above to get a t-value of 1.3:

– TINV(0.1984, 60) – you double the percentage for 2 tail!

Estimating MEAN

T-dist instead of normal & sample stdev and not population stdev.

QUESTION: Tiger Woods is rumoured to pay his ‘girls’ $1million per year to stay quiet. If a random sample of 7 of them were taken and the mean was $800,000 with a stdev of $100,000. Find the 95% interval estimate of the population mean. (assume pop is normal..)

Degrees of freedom = 6

= 800K + t(.025) (100K/√7)

= 800K + 2.447(37,796)

= 800,000 +/- 92,486 RANGE between $707,513 and $892,486

Estimating MEAN

• T-statistic: Same as z-stat, just using sample mean and stdev!

QUESTION: Can we conclude with 95% confidence that the mean that Cheetah pays for his girls is not $1,000,000?

T = 800,000 – 1,000,000 / (100,000 / √7)

T = -200,000 / 37,796 = -5.29

H0. u = 1MillionH1. u ≠ 1 million (two tail test)

T-critical with 2.5% in each tail is at -/+ 2.447.

-5.29 is definitely past -2.447, reject the null = mean isn’t 1 million!

Inference about VARIATION

• The sample variance (s2) is an unbiased, consistent and efficient point estimator for σ2.

• Use chi-squared distribution:• BUT, it’s not symmetrical.So you NEED to look up both values.

Inference about VARIATION• Question: Cheetah says he can hit a golf ball

300 with a variation of 2. We asked him to hit it 10 times. The sample mean was 297.5 with a var of 2.9. Can we say his variance assumption is wrong with 95% confidence? We need: t @ 0.25 and t @ .975 (2 values) with 9 degrees of f.

Compare those with = (n-1)*2.9 / 2*2

chi-stat = 9*2.9 / 4 = 6.52

6.52 is not < 2.7 or > 19.02, can’t reject the null

H0. variance = 2H1 variance ≠ 2

For 9 df’s: T(.025) = 19.0228 T(.975) = 2.7004

Inference about VARIATION• Question: To find the 95% confidence limit to

his driving based on our sample info:

We have everything: s = 2.9 and for 9 df’s: T(.025) = 19.0228 T(.975) = 2.7004

LCL = 9 * 2.9 / 19.0228 LCL = 1.372

UCL = 9 * 2.9 / 2.7004UCL = 9.665

The range is so big because we only took 10 shots, increase sample size to decrease interval!

Inference about PROPORTION

• Proportion is all about binary (yes,no).– It is binomially distributed.– Make sure that n*p > 5 and n(1-p) > 5

• USES THE Z-TEST!!! (says ‘z’ in formula’s) Test by comparing to critical value:

PROPORTION Q.

• Analysts say that the leafs will win a dismal 25% of all games. So far the leafs have played 25 games and won 8 (32%). Is this claim correct at the 0.05 significance level?H0. win proportion = .25

H1. win proportion ≠ .25 (two-tail)

Z = .32 - .25 / √ (.25(.75) / 25)

Z = .8082

Z critical for .025 in tails is +/- 1.96. .8082 is not past 1.96, not enough evidence to say the analysts are wrong.. Yet.

Number required for PROPORTION

To prove that the analysts are right. How many games would they need to play, winning 25% of them, to give them 90% confidence that they will end up no more then 5% up or down from 25%?

W = .10 (5% up and down)

P(hat) is .25

Z(a/2) = 1.65

N = ( 1.65 √(.25*.75) / .10 ) 2

N = 51

RECAP!!

With 1 population, we talked about:Mean Variance

Proportion-T-stat - chi-square - z-stat

Know how to use each (all very similar). Know the tables. AND know the excel slide!

Chapter 13: Inference about comparing Two Populations

Chapter 13 (inference about comparing two population)

• With two populations, we can be:– Comparing the variances– Comparing the means– Comparing paired observations– Comparing proportions

Comparing two Variances

To infer info about two variances we use this test statistic (degree of freedom (n-1) applies):

The distribution / table to use is the F DISTRIBUTION (F-test). Like the chi-squared, Not negative, not symmetrical (so need to look up 2 numbers again!):

F-TABLE QUESTION:

For example, what is the value of F for 5% of the areaunder the right hand “tail” of the curve, with anumerator degree of freedom of 4 and a denominatordegree of freedom of 9? F .05,4,9

F-Dist

IF they want a LEFT –TAIL!!! You need to swap the degrees of freedom (so the row and the column) and divide that number by 1!

= F .05 (left),4,9

Same question as last, but they want 5% of the left.

= 1/ F .95,9,4

EXCEL PRINTOUT QUESTIONS:• This is to see if the variances of 2 variances are different. The null

is always σ2/ σ2 = 1. • If you want to show if different then want two tail ‘does not equal’

test. – PRINTOUT will give you one tail p-value (will need to double to

get p-value)

12345678910

A B CF-Test Two-Sample for Variances

Consumers NonconsumersMean 604.02 633.23Variance 4103 10670Observations 43 107df 42 106F 0.38P(F<=f) one-tail 0.0004F Critical one-tail 0.6371

0.0004 * 2 =0.0008 is stillMuch less thenConfidence.

Comparing two Variances

Chapter 13 (inference about comparing two population)

• With two populations, we can be:– Comparing the variances– Comparing the means– Comparing paired observations– Comparing proportions

COMPARING TWO MEANS

• Similar to dealing with 1 mean, now we are looking at the difference of means:

1. If you know the stdev’s?? Plug-and-play:

Two means, unknown variances!

• MOST times you don’t know stdev’s, so not so straightforward, you need to the t-test, and there are 2 cases:– When they have equal variances– When they are unequal variances.

• TO FIND OUT IF EQUAL/UNEQUAL:1.Will tell you in question2.Look at F-Estimator of ratio of 2 variances

(table)!

Type 1. Assumed equal variances

• With equal variances, you can use that property to bring together the variances with their degrees of freedom:

First get a combined estimate value:

Use it to find your t-value:

Type 1. Assumed equal variances

• With UNEQUAL variances, you have this intense degree of freedom formula – NEED TO CHAT ABOUT, what do you guys need to know about these formulas??

Use it to find your t-value:

Question: people are tested to see if those that eat bran for breakfast eat more or less calories at lunch. Surveys are trying to prove that people who ate bran ate less at lunch.

H0 = u1 – u2 = 0 (1 is consumers of bran, 2 is not)

H1 = u1 – u2 < 0

Step 1: what test? Don’t know variances of pops, and don’t know if the sample variances are equal or unequal (and it didn’t say in q..) .. F-TEST of variances!

TABLE QUESTION!:

• FROM PREVIOUS SECTION:– This is to see if the variances of 2 variances are different. The null is

always σ2/ σ2 = 1. – If you want to show if different then want two tail ‘does not equal’ test.

• PRINTOUT will give you one tail p-value (will need to double to get p-value)

12345678910

A B CF-Test Two-Sample for Variances

Consumers NonconsumersMean 604.02 633.23Variance 4103 10670Observations 43 107df 42 106F 0.38P(F<=f) one-tail 0.0004F Critical one-tail 0.6371

0.0004 * 2 =0.0008 is stillMuch less thenConfidence.YES THEY DIFFER!

F-TEST!

Chapter 13 (inference about comparing two population)

12345678910111213

A B Ct-Test: Two-Sample Assuming Unequal Variances

Consumers NonconsumersMean 604.02 633.23Variance 4102.97564 10669.76565Observations 43 107Hypothesized Mean Difference 0df 123t Stat -2.09P(T<=t) one-tail 0.0193t Critical one-tail 1.6573P(T<=t) two-tail 0.0386t Critical two-tail 1.9794

t-test: Equal/unequal variances t-test of Q. Calories eaten at lunch by 2 separate populations – do they differ?

21

Chapter 13 (inference about comparing two population)

1234567891011121314

A B Ct-Test: Two-Sample Assuming Equal Variances

Method A Method BMean 6.47 6.17Variance 1.30 1.36Observations 25 25Pooled Variance 1.33Hypothesized Mean Difference 0df 48t Stat 0.9198P(T<=t) one-tail 0.1811t Critical one-tail 1.6772P(T<=t) two-tail 0.3623t Critical two-tail 2.0106

Equal Variances Chart (times to assemble – Q: do they differ?)

Chapter 13 (inference about comparing two population)

• With two populations, we can be:– Comparing the variances– Comparing the means– Comparing paired observations– Comparing proportions

Chapter 13 (inference about comparing two population)

Matched Pairs Experiment (t-test and estimator of UD )

• if you can find a way to pair the independent samples, then you can use this method. Just cause they have the same number of samples, doesn’t mean they are matched, even if they are ordered, they NEED to be matched on another variable (gpa buckets etc).

Chapter 13 (inference about comparing two population)

1234567891011121314

A B Ct-Test: Paired Two Sample for Means

Finance MarketingMean 65,438 60,374Variance 444,981,810 469,441,785Observations 25 25Pearson Correlation 0.9520Hypothesized Mean Difference 0df 24t Stat 3.81P(T<=t) one-tail 0.0004t Critical one-tail 1.7109P(T<=t) two-tail 0.0009t Critical two-tail 2.0639

Matched Pairs Experiment (t-test and estimator of UD )

Q1. Do finance majors make more then marketing? Take 25 random people – then do t-test of equal/unequal variances. BUT, in this they took 25 buckets of GPA’s and took 1 random person from each range of GPA = matched pairs.

Chapter 13 (inference about comparing two population)

1234567891011121314

A B Ct-Test: Paired Two Sample for Means

Finance MarketingMean 65,438 60,374Variance 444,981,810 469,441,785Observations 25 25Pearson Correlation 0.9520Hypothesized Mean Difference 0df 24t Stat 3.81P(T<=t) one-tail 0.0004t Critical one-tail 1.7109P(T<=t) two-tail 0.0009t Critical two-tail 2.0639

Matched Pairs Experiment (t-test and estimator of UD )

Q2. Do finance majors make more then marketing by $5000? Take 25 random people – then do t-test of equal/unequal variances. BUT, in this they took 25 buckets of GPA’s and took 1 random person from each range of GPA = matched pairs.

5000

This would probably be t stat of 0.5, and the p value would not by < 0.05

Chapter 13 (inference about comparing two population)

• With two populations, we can be:– Comparing the variances– Comparing the means– Comparing paired observations– Comparing proportions

Chapter 13 (inference about comparing two population)

3) Inference about the difference between population proportions (with nominal data) – z-test of p1 – p2 – Using nominal data, so win/lose categories.– same restriction of the p*n and p*(1-n) > 5 (but

now for both populations)– depending on null hypothesis, there are 2

different formula (one for =0 and one for = D (not 0) – look to the hypothesized mean line in table!!)

Chapter 13 (inference about comparing two population)

Z-test of p1 – p2 type 1- eg: testing for the proportion of a certain product being

sold in 2 different stores – with a difference of 0 (so seeing if supermarket 1 sold more then supermarket 2)

1234567891011

A B Cz-Test: Two Proportions

Supermarket 1 Supermarket 2Sample Proportions 0.1991 0.1493Observations 904 1038Hypothesized Difference 0z Stat 2.90P(Z<=z) one tail 0.0019z Critical one-tail 1.6449P(Z<=z) two-tail 0.0038z Critical two-tail 1.96

Chapter 13 (inference about comparing two population)

Z-test of p1 – p2 type 2- eg: testing for the proportion of a certain product being

sold in 2 different stores – with a difference of 3% (so seeing if supermarket 1 sold 3% more then supermarket 2)

1234567891011

A B Cz-Test: Two Proportions

Supermarket 1 Supermarket 2Sample Proportions 0.1991 0.1493Observations 904 1038Hypothesized Difference 0.03z Stat 1.14P(Z<=z) one tail 0.1261z Critical one-tail 1.6449P(Z<=z) two-tail 0.2522z Critical two-tail 1.96

EXAM PRACTICE Q (see package)

• For the NA / Asian / Africa death example: • They ask were there more then 10% more NA’s then

Asian’s that died from Heart Disease: (2 columns with the nom info):PRINTOUT # 28: Z-Test: Two Proportions, Variable 1: I1:I51; Variable 2: J1:J51 Code for Success: 3 z-Test: Two Proportions Asia-disease N.America-disease Sample Proportions 0.3 0.34 Observations 50 50 Hypothesized Difference -0.1 z Stat 0.6437 P(Z<=z) one tail 0.2599 z Critical one-tail 1.6449 P(Z<=z) two-tail 0.5198 z Critical two-tail 1.96

EXAM PRACTICE Q

• For the NA / Asian / Africa death example: • They ask were there more then 10% more NA’s then

Asian’s that died from Heart Disease: (2 columns with the nom info):PRINTOUT # 30: Z-Test: Two Proportions, Variable 1: I1:I51; Variable 2: J1:J51 Code for Success: 3 z-Test: Two Proportions

Asia-disease N.America-disease

Sample Proportions 0.3 0.34 Observations 50 50 Hypothesized Difference 0.1 z Stat -1.502 P(Z<=z) one tail 0.0665 z Critical one-tail 1.6449 P(Z<=z) two-tail 0.133 z Critical two-tail 1.96

EXAM PRACTICE Q

• For the NA / Asian / Africa death example: • They ask were there more then 10% more NA’s then

Asian’s that died from Heart Disease: (2 columns with the nom info):PRINTOUT # 28: Z-Test: Two Proportions, Variable 1: I1:I51; Variable 2: J1:J51 Code for Success: 3 z-Test: Two Proportions Asia-disease N.America-disease Sample Proportions 0.3 0.34 Observations 50 50 Hypothesized Difference -0.1 z Stat 0.6437 P(Z<=z) one tail 0.2599 z Critical one-tail 1.6449 P(Z<=z) two-tail 0.5198 z Critical two-tail 1.96

T test of p1 – p2, so that would be .3 - .34 = -.04.

You want NA to be > 10% difference.. So you would want it to be like .30 - .40 = -.10.. So you want this one!

RECAP!!

• Two pop’s.. Looked at their means, variances, if they were paired matches, as well as proportion means.

Chapter 14 : Analysis of Variance

Analysis of Variance (ANOVA)

• comparing 2 or more population of INTERVAL data• determine whether differences exist between population

means– done by analyzing sample variance, and the ANOVA

technique:• Single Factor (or 1-way): For populations which

have only 1 factor that you are comparing them against, then you use the ANOVA: Single Factor. This is like comparing sales from 3 cities with the factor being the marketing strategy.

1. SINGLE FACTOR

FINAL Condition: • MST/MSE = F compare this value with

F crit on the chart. If F value is greater then F-crit then the means differ.

• REQUIRED CONDITIONS: the random variables must be normally distributed with equal variances.

One-Factor ANOVA

One-Factor ANOVA

All means are the same:The null hypothesis is not

rejected

H0 :1 2 3 k

Ha : Not all i are the same

1 2 3

At least one mean is different:The null hypothesis is rejected

H0 :1 2 3 k

Ha : Not all i are the same

1 2 3

1 2 3

or

One-Factor ANOVA

Partitioning the Variation

Total Variation = the aggregate dispersion of the individual data values across the various populations

Within-Sample Variation (SSE) = dispersion that exists among the data values within a particular population

Between-Sample Variation (SST) = dispersion among the sample means

1 2 3

Response

3x

1x 2x

1 2 3

Response

3x1x 2x

x

Between Group Variation (SSC) + Within Group Variation (SSE)

RECAP Error = SST + SSE

1 2 3

Response, X

x Total Sum of Squares =

+

Total Sum of Squares

Where:

k = number of populations

ni = sample size from population i

xij = jth measurement from population i

x = grand mean (mean of all data values)

TOTAL ERROR=

+

One-Way ANOVA Table

Source of Variation

dfSS MS

Between Samples

SST MST =

Within Samples

nT - CSSE MSE =

Total nT - 1SST+SSE

C - 1 MST

MSE

F ratio

C = number of populations

nT = sum of the sample sizes from all populationsdf = degrees of freedom

SST

C - 1

SSE

nT - C

F =

One-Factor ANOVA F-Test Statistic

• Test statistic

MST is mean squares between variancesMSE is mean squares within variances

• Degrees of freedom– dfC = k – 1 (k = number of populations)

– dfE = nT – k (nT = sum of sample sizes from all populations)

H0: μ1= μ2 = … = μ k

Ha: At least two population means are different

F = MST / MSE

Single Factor:- Comparing 3 independent populations, with the factor being

marketing strategy.Q. Is there enough evidence to support that the sales of this product

differ? 123456789101112131415

A B C D E F GAnova: Single Factor

SUMMARYGroups Count Sum Average Variance

Convenience 20 11551 577.6 10775.0Quality 20 13060 653.0 7238.1Price 20 12173 608.7 8670.2

ANOVASource of Variation SS df MS F P-value F crit

Between Groups 57512.23 2 28756.1 3.23 0.0468 3.16Within Groups 506983.5 57 8894.4

Total 564495.7 59

All this says is that at least 2 of the means differ!

Analysis of Variance

Must be normal and equal variances:- If nonnormal, replace test with Kruskal-Wallis

test (making the numbers ordinal) – not covered- If unequal variances – we CANNOT DO!

NOTES / THOUGHTS:- Why do we need it? (if 4 pop’s, LOTS of pairs of

means to compare) potential for type 1 error is huge. But need t-test for 2 pop’s cause ANOVA only says that means ‘differ’ (not < or >)

Analysis of Variance

Chapter 15: Chi-Squared Tests

NOTE: EVERY CATEGORY NEEDS TO HAVE 5 IN IT! IF LESS, sample more, or combine groups.

Goodness of Fit Test

Goodness of Fit Test – used in 2 ways:• used to describe one population of data with more then 2

nominal options (no heads/tails, but rock/paper/scissor)– trials must be independent– must have expected frequency > 5 for each (n*p)– have a null hypoth being equal to the p*n for each option,

and goodness of fit test determines of the actual results differ from them.

• Used to determine if classifications of a pop are different then stated frequencies. Null hypot is that they equal certain frequencies. Alternative hypot says that at least one is not equal.

Given frequencies of .45, .40, and .15, = expected frequency.

This test tested 200 samples and compares the expected to the actual, and gives p-value / Critical values.

Chi Critical = 5.6

Goodness of Fit Test

90

80

30

200

2

12

-14 6.53

0.05

1.6

8.18

Chi Critical = 5.6

Chi-Squared Test of a Cont. Table

• TEST to see: is there enough evidence to infer that two nominal variables are related or to infer that differences exist between two or more populations of nominal data.

• H0 is that the populations are independent. H1 is that they are dependent.

123456789101112131415

A B C D E FContingency Table

DegreeMBA Major 1 2 3 TOTAL

1 31 13 16 602 8 16 7 313 12 10 17 394 10 5 7 22

TOTAL 61 44 47 152

chi-squared Stat 14.7019df 6p-value 0.0227chi-squared Critical 12.5916

METHOD 1: TABLE GIVEN

Chi-Squared Test of a Cont. Table

METHOD 2: BY HAND Test statistic (same as goodness)

24.08

11.28

18.55

• NOTE: if the expected value of a particular cell is less then 5, you need to combine rows/columns to satisfy this rule. (ALSO, the degrees of freedom must be changed!)– Small part question..?

Chi-Squared Test of a Cont. Table

Chapter 16: Linear Regression and

Coorelation

Linear Regression

REGRESSION:USED TO: analysis the relationship between interval variables.• Is there a linear relationship between one variable (dependent

variable) and other variables (independent variables)? • Predict dependent variables based on independent ones.

Least Squares Method

• Least Squares Method– The objective of the scatter diagram is to measure the

strength and direction of the linear relationship– Both can be more easily judged by drawing a straight line

through the data.– How to draw that line? LSM!

• This line has the smallest sum of squared distances to all the points on the plot.

• LSM: It creates a line, and it is created by:

You calculate b1, then for b0 sub in the mean values of x and y, solve for b0 and then rewrite like the bottom one here.

OBAMA EXAMPLE:

B1 = 34.75 / 16.5 = 2.10*var not stdev

B0 = 44 – (2.1)*7 = 29.3

FINAL LINE:

Y = 29.3 + 2.1x

At 0 mins, he has 29%, but with every minute of her skit, obama gets 2% of the supporters.

Least Squares Method

Linear Regression and Correlation

• Eg: resell value of car with x miles on the odometer

METHOD 1: SSEStandard Error (SSE) = sum of the error for each of the points. How good the error in the points is.(relate this to the mean)0.32 vs sample mean of 15 ($15,000 for a car on average) - (average data points)

Standard Error of estimate

Linear Regression and Correlation

• Eg: resell value of car with x miles on the odometer METHOD 2: TEST SLOPE (part 1)If we want to see if a relationship (if no relationship, slope = horizontal = 0 ):H0: β = 0

H1: β ≠ 0

Can reject if you know the t-critical, or can just look at p-value:

P-values here ASSUME two-tail test!! So with this, compare 0.000 with 0.05 or whatever significance you are given!

• Eg: resell value of car with x miles on the odometer METHOD 2: TEST SLOPE (part 2)If we want to see if it has a negative relationship (eg: slope< 0):H0: β = 0

H1: β < 0

Can reject if you know the t-critical, or can just look at p-value:

SINCE p-value assumes two tail, you need to divide the p-value by 2!! So compare 0.000/2 and .05 This one is easy, but if it was 0.06 .. You’d have to guess (if looking for > (when it is a negative coeff) you need to 1 – value!

Linear Regression and Correlation

Linear Regression and Correlation

• Eg: resell value of car with x miles on the odometer METHOD 3: Coefficient of CorrelationIf want to see if there is a relationship between the variables. NOTE: R between -1 and +1

H0 : ρ = 0H1: ρ ≠ 0

0.8052 = pretty good positive relationship!

BUT… how much can be explained by this model???

Linear Regression and Correlation

• Eg: resell value of car with x miles on the odometer METHOD 3.5: Coefficient of DeterminationRemember – when we squared R, we go Determination (how much of the variation is due to the independent variable (if 1 – no error, and all variation due to indep, if 0 – no linear relationship between variables, and all error).

Here – 64% of the variation in the price is determined by the mileage!!

BUT… how much can be explained by this model???

Linear Regression and Correlation

Prediction Interval VS Confidence Interval??

• To find out the expected value of an individual item (prediction interval) or the expected value of the mean of a population (confidence interval estimate)

• The confidence interval estimate of the expected value of y will be narrower than the prediction interval for the same given value of x and confidence level. This is because there is less error in estimating a mean value as opposed to predicting an individual value.

Regression Diagnostics

• Three conditions required in order to perform a regression analysis.

– Error Variable must be normally distributed– Error variable must have a constant variance

•Heteroscedasticity: when the error variable does not have a constant variance

– Errors must be independent of each other• Can’t be correlated (Time series data usually

have errors that are correlated)•Autocorrelated or serially correlated: error

terms that are correlated over time

• How to diagnose? – Residual Analysis.

Chapter 17 & 18: Multiple Regression

Multiple Regression (multiple variables, all first-order)

Many different variables that go into these types.

Eg: A Hotel is looking to expand – but they don’t know where in this 1 particular city. Should it be close to the airport? Close to high income housing? To the university?

TO find out, sample hotels in the area, find out their details for the various questions you think are factors towards profitability, and then RUN EXCEL to see which factors actually affect profit!!

Required Conditions…

For these regression methods to be valid the following four conditions for the error variable must be met:

1. The distribution of the error variable is normal.

2. The mean of the error variable is 0.3. The standard deviation of error is ,

which is a constant.4. The errors are independent.

Eg: Hotel profit margin – based on 6 factors.

Multiple Regression (multiple variables, all first-order)

Eg: Hotel profit margin – based on 6 factors.ANOVA (used most for multiple reg.)The printout gives us the overall quality of this model – Significance F! Could compare F F-critical, or look at Significance F and compare to significance level! 0.00 < 0.05 – If greater then 0.05, it says none of the factors have a relationship.

Multiple Regression (multiple variables, all first-order)

Eg: Hotel profit margin – based on 6 factors.

R Square / Adjusted R Square:

R square = coefficient of determination.Adjusted = coef of Det if MORE THEN 1 VARIABLE! Use this one to determine how much of the variation is from the variables!!

Multiple Regression (multiple variables, all first-order)

Eg: Hotel profit margin – based on 6 factors.

Variable Relationships?:

Look to the P-values. If < significant (using 0.05) then they are good (reject the null saying that there is no relationship) The others can be removed.

Multiple Regression (multiple variables, all first-order)

Multiple Regression (multiple variables, all first-order)

Eg: Hotel profit margin – based on 6 factors.

Interpret the Intercepts!

+ means that the profit margin goes up! - means that the profit margin goes down.

Example Q

• You can predict the outcome if the built where: – There are 3815 rooms within 3 miles of the site.– The closest other hotel or motel is .9 miles away.– The amount of office space is 476,000 square feet.– Census data indicates the median household income in the

area (rounded to the nearest thousand) is $35,000

Multicollinearity• When 2 or more of your variables are not just related to

the dependent variable (eg: profit margin), but are correlated to each other (so if distance to university goes down, then income of houses goes up/down). There will always be some of this, but if it is a strong coorelation = multicollinearity.

• WHAT DOES THIS MEAN?: The overall model can be tested for relationships (significance F ok!), but you cannot tell which individual variables are related! (not indiv t-tests!!)

• FIX?: You can built the model one variable at a time, or delete some.. Not needed for final.

Multiple Regression (multiple variables, all first-order)

• Indicator Variables (or dummy var)– Either 0 or 1– If 3 different colours of cars (and colour may

affect price of cars) then you need 2 dummy variables (1 less then options)

Multiple Regression (multiple variables, all first-order)

Multiple Regression (multiple variables, all first-order)

Multiple Regression (multiple variables, all first-order)

Multiple Regression (multiple variables, all first-order)

Multiple Regression (multiple variables, all first-order)

STATS LIGHTNING ROUND!!!

Which technique to use?

• The bookstore has a policy that the proportion of books returned should be less than 10%. To see if the policy is working, a random sample of book titles was drawn, and the fraction of the total originally ordered that are returned is recorded. Can we infer at the 10% significance level that the mean proportion of returns is less than 10%.

t-test of μ

• The bookstore has a policy that the proportion of books returned should be less than 10%. To see if the policy is working, a random sample of book titles was drawn, and the fraction of the total (interval) originally ordered that are returned is recorded. Can we infer at the 10% significance level that the mean proportion of returns is less than 10%.

Which technique to use?

• Has the recent drop in airplane passengers resulted in better on time performance? Before the recent downturn, one airline bragged that 92% of its flight were on time. A random sample of 165 flights reveals that 153 were on time. Can we conclude at the 5% significance level that the airline’s on time performance has improved?

z -test of p

• Has the recent drop in airplane passengers resulted in better on time performance? Before the recent downturn, one airline bragged that 92% of its flight were on time. A random sample of 165 flights reveals that 153 were on time (nominal- either on time, or not). Can we conclude at the 5% significance level that the airline’s on time performance has improved?

Which Technique?

• FROM EXAM PRACTICE EXAM: c) Do Asians who die due to any one of the 3

given diseases live longer than the Africans?

Which Technique?

• FROM EXAM PRACTICE EXAM: c) Do Asians who die due to any one of the 3

given diseases live longer than the Africans?

ANSWER:- 2 populations, central location, interval,

indep, variances?? YOU DON’t KNOW! F-TEST!!

Which Technique?

• FROM EXAM PRACTICE EXAM: c) Do Asians who die due to any one of the 3

given diseases live longer than the Africans?

PRINTOUT # 8: F-Test Two Sample for Variances, Variable 1: A1:A51; Variable 2: B1:B51 F-Test Two-Sample for Variances

Asia-age Africa-age Mean 43.94 49.44 Variance 135.1596 88.74122 Observations 50 50 df 49 49 F 1.523076 P(F<=f) one-tail 0.07218 F Critical one-tail 1.60729

.0721 * 2 (two tail!!) = .1442 > .05 not enough to reject ‘equal’ hypoth. .. So use EQUAL CHART

Which Technique?

• FROM EXAM PRACTICE EXAM: c) Do Asians who die due to any one of the 3

given diseases live longer than the Africans?PRINTOUT # 15: t-Test: Two-Sample Assuming Equal Variances,

Variable 1 Range: A1:A51; Variable 2 Range: B1:B51 t-Test: Two-Sample Assuming Equal Variances

Asia-age Africa-age Mean 43.94 49.44 Variance 135.1596 88.74122 Observations 50 50 Pooled Variance 111.9504 Hypothesized Mean Difference 0 df 98 t Stat -2.59908 P(T<=t) one-tail 0.005395 t Critical one-tail 1.660551 P(T<=t) two-tail 0.01079 t Critical two-tail 1.984467

1 tail gives MOST CLOSEST!

0.0053 is left tail, but you want a right tail!!

1-.0053 = .9964!!

Which Technique?

• Guggul is a popular remedy in India to lower cholesterol levels. 103 Philadelphia area adults were divided into 3 groups. Group 1 took placebos, 2 took 1000 milligrams of guggul, group 3 took 2000 milligrams of guggul. The changes in low-density cholesterol were recorded. Can we infer that there are differences in the reduction of cholesterol between the groups? Histograms are bell shaped and similar.

ANOVA (one-way)

• Guggul is a popular remedy in India to lower cholesterol levels. 103 Philadelphia area adults were divided into 3 groups. Group 1 took placebos, 2 took 1000 milligrams of guggul, group 3 took 2000 milligrams of guggul. The changes in low-density cholesterol were recorded. Can we infer that there are differences in the reduction of cholesterol between the groups? Histograms are bell shaped and similar.

• -PO: Compare 2 or more populations- group 1, 2 and 3.

• -DT: Interval- reduction is cholesterol levels.• -Samples are independent- nothing relating the

samples• -Normally Distributed- states in question

Excel Output

Which technique?

• Who spends more on vacations, golfers or skiers? A travel agency surveyed 15 customers who regularly take their spouses on either a golfing or skiing vacation and asked how much money they spend. Can we infer that golfers and skiers differ in their vacation expenses. Assume variance is a 1:1.5 ratio between the golfer and skier spending. Normally distributed.

Equal-variances t-test of

• Who spends more on vacations, golfers or skiers? A travel agency surveyed 15 customers who regularly take their spouses on either a golfing or skiing vacation and asked how much money they spend. Can we infer that golfers and skiers differ in their vacation expenses. Assume variance is a 1:1.5 (rule of thumb is less than 1:2 ratio means variance is equal), ratio between the golfer and skier spending.

21

Which Technique?

• Do waitresses or waiters earn larger tips? To answer this, a study was done involving a measure of the percentage of the total bill left as a tip. One randomly selected waiter, and waitress was selected in each of the 50 restaurants during a one-week period. What conclusions can be drawn from the data? Normal data

t-test of

• Do waitresses or waiters earn larger tips? To answer this, a study was done involving a measure of the percentage of of the total bill left as a tip. One randomly selected waiter, and waitress was selected in each of the 50 restaurants during a one-week period. What conclusions can be drawn from the data?

• 2 pop’s- waiters and waitresses, data type is interval because percentages were recorded, measuring central location (average tip percentage) and pairs are matched (one male and female from each restaurant).

D

Which technique to use?

• During the winder some grape vines die from the extreme cold. In the spring the vines are pruned; if it is brown the plant is dead, and green means healthy. A random sample of vines is selected to see how well it survived the winter. Each vine is considered (1) alive, or (2) dead. Estimate with 90% confidence the degree of winter kill for this vineyard.

Estimator of p

• During the winder some grape vines die from the extreme cold. In the spring the vines are pruned; if it is brown the plant is dead, and green means healthy. A random sample of vines is selected to see how well it survived the winter. Each vine is considered (1) alive, or (2) dead. Estimate with 90% confidence the degree of winter kill for this vineyard.

Which technique to use?

• A fast food franchiser wants to build a restaurant downtown, but based on financial analysis, the location is only acceptable if the number of pedestrians passing by the location averages more than 200 per hour. To help decide whether to build or not, a statistics practitioner observes the number of pedestrians who pass the site each hour over a 40-hour work week. Should the franchiser build on the street?

t-test of μ

• A fast food franchiser wants to build a restaurant downtown, but based on financial analysis, the location is only acceptable if the number of pedestrians passing by the location averages more than 200 per hour. To help decide whether to build or not, a statistics practitioner observes the number of pedestrians who pass the site each hour over a 40-hour work week. Should the franchiser build on the street? (Question is asking to describe the population, because we want to know if the pop is larger than 200/hour).

t-Test: Mean

PedestriansMean 209.125Standard Deviation 60.0078Hypothesized Mean 200df 39t Stat 0.9617P(T<=t) one-tail 0.1711t Critical one-tail 1.6849P(T<=t) two-tail 0.3422t Critical two-tail 2.0227

P-value = .1711

Which technique?

• In textbook example 12.52, we described the problem with changing light bulbs. We decided they need to be fixed, but there are 2 brands of bulbs to use. The mean and variance of the lengths of life are important, we therefore randomly sample both brands by leaving them on until they burn out. The times were recorded. Can we conclude that the variances differ?

F-test of • In textbook example 12.52, we

described the problem with changing light bulbs. We decided they need to be fixed, but there are 2 brands of bulbs to use. The mean and variance of the lengths of life are important, we therefore randomly sample both brands (2 pop’s) by leaving them on until they burn out. The times were recorded. Can we conclude that the variances differ?

22

21 /

What technique to use?

• To determine the effect of full-page advertisements, the owner of a store asked 200 randomly selected people who visited the store whether or not they had seen the ad. He also determined whether or not the customers bought anything and if so, how much they spent.

• A) Can the owner conclude that customers who see the add are more likely to make a purchase than those who do not see the ad?

Z test of

•To determine the effect of full-page advertisements, the owner of a store asked 200 randomly selected people who visited the store whether or not they had seen the ad. He also determined whether or not the customers bought anything and if so, how much they spent. •A) Can the owner conclude that customers who see the add are more likely to make a purchase than those who do not see the ad?

-Compare 2 populations (those who bought and those who didn’t)

-Nominal data (Buy or not buy)

21 pp

Excel output• Can the owner conclude that customers who see the add are more likely to make a purchase than those who do

not see the ad?• Can we conclude that the purchase tendency between people who see the ad and people who don’t is different?

z-test of the Difference Between 2 Proportions

Sample 1 Sample 2 z stat 2.83

Sample Proportion

0.4336 0.2414 P(Z<=z) one tail 0.0024

Sample Size 113 87 z critical one tail

1.6449

Alpha 0.05 P(Z<=z) two-tail 0.0052

Z Critical two tail

1.96

Saw ad Didn’t

Which technique to use?

• 20 people were recruited who were more than 50 pounds overweight to compare 4 diets. The people were matched by age in groups of 4. The number of pounds that each person lost were recorded. Can we infer that there are differences between the four diets? All histograms are bell shaped and similar.

Analysis of variance (randomized blocks)

• 20 people were recruited who were more than 50 pounds overweight to compare 4 diets. The people were matched by age in groups of 4. The number of pounds that each person lost were recorded. Can we infer that there are differences between the four diets? All histograms are bell shaped and similar.

• -PO: Compare 2 or more populations (different diets)• -DT: Interval (amount of weight lost)• -Blocks (according to age)• -Distributed Normally

What technique to use?

• To determine the effect of full-page advertisements, the owner of a store asked 200 randomly selected people who visited the store whether or not they had seen the ad. He also determined whether or not the customers bought anything and if so, how much they spent.

• B) Can the owner conclude that customers who see the ad spend more than those who do not see the ad? (normal dist)

t-test of 21 To determine the effect of full-page advertisements, the owner of a store asked 200 randomly selected people who visited the store whether or not they had seen the ad. He also determined whether or not the customers bought anything and if so, how much they spent.B) Can the owner conclude that customers who see the ad spend more than those who do not see the ad? (assume variances are equal)

-Compare 2 pop’s (see the ad and don’t see the ad)

-Interval data (Amount of money Spent)

-Independent Samples, normal

EQUAL or UNEQUAL????

GO to the F test!

F-Test Two-Sample For Variances

AD No-ADMean 97.38 92.01Variances 621.97 283.26Observations 49 21df 48 106F Critical one-tail 1.3P(F<=f) one-tail 0.008F Critical one-tail 1.876

H0 var1 = var2H1 Var1 ≠ var2

Excel Output

B) Can the owner conclude that customers who see the ad spend more than those who do not see the ad?

t-test: Two Sample Assuming Equal Variance

Ad No AdMean 97.38 92.01Variance 621.97 283.26Observations 49 21Pooled Variance 522.35df 68t stat 0.9P(T<=t) one tail 0.1853t Critical one tail 1.6676P(T<=t) two tail 0.3705t Critical two tail 1.9955

P value= 0.1853

T-test: assuming unequal variances

Which Technique to use?

• A spokesperson for the postal service said it had a success rate of more than 95% in delivering priority mail letters within a 2 day deadline. Angry mailman decided to conduct an experiment to test this statement by sending letters by priority mail and ordinary mail from his hometown to Waterloo. Letters that arrived on time were recorded with a 2 and late letters were recorded as a 1.

• A) Does the data provide sufficient evidence to support the spokespersons claim?

z -test of p

• A spokesperson for the postal service said it had a success rate of more than 95% in delivering priority mail letters within a 2 day deadline. Angry mailman decided to conduct an experiment to test this statement by sending letters by priority mail and ordinary mail from his hometown to Waterloo. Letters that arrived on time were recorded with a 2 and late letters were recorded as a 1.

• A) Does the data provide sufficient evidence to support the spokespersons claim?

• Describe a population do they deliver letters within the 2 day deadline over 95% of the time?

• Nominal Data- 1 for late letters, 2 for on time.

Excel Output

z-Test: Proportion

2Sample Proportion 0.9713Observations 244Hypothesized Proportion 0.95z Stat 1.5274P(Z<=z) one-tail 0.0633z Critical one-tail 1.6449P(Z<=z) two-tail 0.1266z Critical two-tail 1.96

P value = 0.0633

Which technique to use?

• A spokesperson for the postal service said it had a success rate of more than 95% in delivering priority mail letters within a 2 day deadline. Angry mailman decided to conduct an experiment to test this statement by sending letters by priority mail and ordinary mail from his hometown to Waterloo. Letters that arrived on time were recorded with a 2 and late letters were recorded as a 1.

• B) Does the data provide sufficient evidence to prove that Priority mail met the 2 day deadline more frequently than ordinary mail?

z-test of • A spokesperson for the postal service said it had a

success rate of more than 95% in delivering priority mail letters within a 2 day deadline. Angry mailman decided to conduct an experiment to test this statement by sending letters by priority mail and ordinary mail from his hometown to Waterloo. Letters that arrived on time were recorded with a 2 and late letters were recorded as a 1.

• B) Does the data provide sufficient evidence to prove that Priority mail met the 2 day deadline more frequently than ordinary mail?

• Compare two populations- priority mail and ordinary mail.• Nominal Data- 1 fail, 2 success• 2 categories- on time or not

21 pp

Excel Output

z-Test: Two Proportions

Prority OrdinarySample Proportions 0.9714 0.9101Observations 245 378Hypothesized Difference 0z Stat 3.018P(Z<=z) one tail 0.0013z Critical one-tail 2.3263P(Z<=z) two-tail 0.0026z Critical two-tail 2.5758

P value= .0013

What technique?

• A popular game of craps is based on the probabilities of rolling certain sums with a pair of dice. Ex. Probability of rolling a sum of 3= 2/36, 4=3/36… A stats professor suspects that the dice are not fairly balanced and records each of 1000 throws. A) Does the data allow us to infer that the dice are not fair?

goodness-of-fit test

• A popular game of craps is based on the probabilities of rolling certain sums with a pair of dice. Ex. Probability of rolling a sum of 3= 2/36, 4=3/36… A stats professor suspects that the dice are not fairly balanced and records each of 1000 throws. A) Does the data allow us to infer that the dice are not fair?

• -PO: Describe a population- sums when rolling dice.• -DT: Nominal –sum of the rolls• -Multinomial experiments- more than 2 categories (this is why

it is not z test of p)

2

Which technique to use?

• CFC’s are banned because they damage the ozone layer. The new legislation in Ontario will affect those who use CFC’s in their air conditioners in cars. To see how many vehicles will be affected by the new ban on CFC’s, a survey of 650 vehicles was taken and each car was identified as either (1) use CFC or (2) do not. If there are 5 million vehicles registered in Ontario, estimate with 95% confidence the number of vehicles affected by the new law.

Estimator of p

• CFC’s are banned because they damage the ozone layer. The new legislation in Ontario will affect those who use CFC’s in their air conditioners in cars. To see how many vehicles will be affected by the new ban on CFC’s, a survey of 650 vehicles was taken and each car was identified as either (1) use CFC or (2) do not. If there are 5 million vehicles registered in Ontario, estimate with 95% confidence the number of vehicles affected by the new law.

Which Technique to use?

• To examine whether age is a factor in determining who drinks alcohol 1054 adults were polled and asked “do you ever drink alcohol?” Responses were recorded as 1=yes, 2=no. They were also asked to report there age category where 1=18-29, 2=30-49, 3=50+. Can we infer that differences exist among age categories with respect to alcohol use?

test of a contingency table 2•To examine whether age is a factor in determining who drinks alcohol 1054 adults were polled and asked “do you ever drink alcohol?” Responses were recorded as 1=yes, 2=no. They were also asked to report there age category where 1=18-29, 2=30-49, 3=50+. Can we infer that differences exist among age categories with respect to alcohol use?

-PO: Compare 2 or more populations-DT: Nominal

Which technique?

• To test if a computer came with enough memory the age of the computers was tested. Random samples were taken of computer users where each was asked the brand of the computer and its age in months. Does the data provide sufficient evidence to conclude that there are differences in age between the computer brands. All histograms are bell shaped and similar. Brands identified were Dell, Hewlett Packard, IBM and other.

ANOVA (one-way)

• To test if a computer came with enough memory the age of the computers was tested. Random samples were taken of computer users where each was asked the brand of the computer and its age in months. Does the data provide sufficient evidence to conclude that there are differences in age between the computer brands. All histograms are bell shaped and similar. Brands identified were Dell, Hewlett Packard, IBM and other.

• -PO: Compare 2 or more populations- HP, Dell, IBM, other

• -DT: Interval- months• -Samples are independent- nothing relating the

samples• -Normally Distributed- states in question

Which technique?

• It was decided to upgrade the skills of workers because they were unable to master required skills. Experts identified 6 skills and each worker was rated based on these skills and also on the quality of work done on the machine. All data is interval. Identify the skills that affect the quality of work.

Multiple Regression t-test of

• It was decided to upgrade the skills of workers because they were unable to master required skills. Experts identified 6 skills and each worker was rated based on these skills and also on the quality of work done on the machine. All data is interval. Identify the skills that affect the quality of work.

• -PO: Analyze the relationship between two or more variables. –skills and quality of work

• -DT: Interval

i

What technique?

• A professor of statistics hands back his graded midterms in class by calling out the name of each student and personally handing the exam over to its owner. At the end of the process he notes that there are several exams left over, the result of students missing that class. He forms the theory that the absence is caused by a poor performance by those students on the test. If the theory is correct, the leftover papers will have lower marks than those papers handed back. He recorded the marks (out of 100) for the leftover papers and the marks of the returned papers. Do the data support the professor's theory? Histograms are bell shaped and one in more than three times the width of the other.

Unequal variances T test of

• A professor of statistics hands back his graded midterms in class by calling out the name of each student and personally handing the exam over to its owner. At the end of the process he notes that there are several exams left over, the result of students missing that class. He forms the theory that the absence is caused by a poor performance by those students on the test. If the theory is correct, the leftover papers will have lower marks than those papers handed back(2 POPULATIONS). He recorded the marks (out of 100) (INTERVAL) for the leftover papers and the marks of the returned papers. Do the data support the professor's theory? Histograms are bell shaped (normal) and one is more than three times the width of the other (unequal variances).

21

Recommended