Upload
magdalen-kotas
View
94
Download
4
Embed Size (px)
DESCRIPTION
BU255 FINAL Exam-AID Taught by: Greg Overholt. What are we doing??. Stats Lectures 10 to 20!! All of it. Lecture 10 & 11:Estimation. Chapter s 5 and 6 : - PowerPoint PPT Presentation
Citation preview
BU255 FINAL Exam-AIDTaught by: Greg Overholt
What are we doing??
• Stats • Lectures 10 to 20!!
– All of it..
Lecture 10 & 11:Estimation
Chapters 5 and 6:• Binomial, Poisson, normal, and exponential
distributions allow us to make probability statements about X (an individual member of the population).
• To do so we need the population parameters.– Binomial: p– Poisson: μ– Normal: μ and σ– Exponential: λ
Chapter 7: Sampling distributions allow us to make
probability statements about sample statistics.
We need the population parameters.Sample mean: µ and σSample proportion: p
However, in almost all realistic situations parameters are unknown. We will use the sampling distribution to draw inferences about the unknown population parameters.
• Introduction to Statistical Inference• Estimation
– Point and Interval Estimators– Properties of Estimators
• Interval Estimation [ confidence intervals]
• Determining Sample Size
Estimation
• Estimation: determining approximate value of pop parameter based on sample statistic.
• 2 types:– Point Estimator
• No good. Too small
– Interval Estimator• Used almost all the time.• Uses an interval to estimate the population
parameter. • Provides % certainty that it is between a lower and
upper bound
Estimating u when σ known
You typically want to know u.. And if you have σ?
XZ
n
1n
zxn
zxP 2/2/
100(1-)% Confidence Interval of μ when is known
EXAMPLE
• Diageo, sampled 85 Laurier students and determined the sample mean of alcohol consumption was 510 drinks a term. They previous calculated that the population standard deviation was 46. Please create interval of population mean with 95% confidence.
X(bar) = 510 n = 85 σ = 46 Za/2 .. Unknown, but we want 95% confidence.
95% in the middle, so that’s 2.5% on each tail, so we want to find the Z value of .475 = 1.96
= 510 – 1.96(46/√85) < u < 510+ 1.96(46/√85)
= 510 – 9.78 < u < 510+ 9.78
= 500.22 < u < 519.78
95% confident that the average number of drinks for the population is between 500.22 and 519.78
Intro – T dist: What’s different?
• In past, we have known standard deviation of the population (which is unrealistic)
– With it, we can use Z stat to make inferences
• NOW, we don’t know st dev. So, have to use the ‘sample st dev’ – why we use the T-stat– GOT to have a normal (or approx) population
dist!
T-Distribution
Degrees of Freedom
• It is the number of items that are free to vary to define the mean.. The best way to think of it is to assume one of the numbers is your mean, and the rest are simply numbers around the mean to determine its shape (so the degrees of freedom are the number of items that determine the shape (n – the one center value)
• (Normal has df = infinity
EXCEL (good MC)
• T-dist calculations can be done using excel:– TDIST(x,degrees_freedom,tails)
• This is when you want the % in the tail(s).• TDIST(1.3,60,1)
– 1.3 is your t-value (like your z-value) and the curve is drawn with 60 degrees of freedom and you want the 1 tail test (vs 2). (ANSWER = 0.0992 (so 9% in the 1 tail test))
– TINV(p-value,degrees_freedom)• This is the inverse. Give it the % in the tails and it will
give you the T-value. • NOTE: will give you the % in a 2-tail test!!!!
– SO, if they wanted you to do the inverse of the q above to get a t-value of 1.3:
– TINV(0.1984, 60) – you double the percentage for 2 tail!
Formula / example
• Assume pop is relatively normal– Confidence interval formula
QUESTION: The researched average cost of a standing-room only ticket to a Leafs game from a scalper is $168.
A random sample of buying 16 tickets from different scalpers resulted in xbar= $172.50, s = $15.40. Find the 95% interval estimate. Assume population distribution is relatively normal.
1n
Xt
s
n
/ 2, 1 / 2, 1n n
s sx t x t
n n
Degrees of Freedom?: (n-1) = 15.
What is the T value? (t .025, 15 ) =
Formula / example
• Assume pop is normal (relatively)– Confidence interval formula
The researched average cost of a standing-room only ticket to a Leafs game from a scalper is $168.
A random sample of buying 16 tickets from different scalpers resulted in xbar= $172.50, s = $15.40. Find the 95% interval estimate. Assume population distribution is relatively normal.
1n
Xt
s
n
/ 2, 1 / 2, 1n n
s sx t x t
n n
Degrees of Freedom?: (n-1) = 15.
What is the T value? (t .025, 15 ) = 2.131
SO formula is 172.50 – 2.131(15.40/4) < u < 172.50 + 2.131(15.4/4)
INTERVAL of 95% confidence 164.3 to 180.7 … YES, this includes 168
Estimating the Population Proportion
Assumption: ˆ ˆ5 and (1 ) 5np n p
ˆ
ˆ ˆ(1 )
p pZ
p pn
/ 2 / 2
ˆ ˆ ˆ ˆ(1 ) (1 )ˆ ˆ
p p p pp z p p z
n n
Formula (did this in midterm):
Confidence Interval of p(new.. But simply rearranging letters):
Example
What proportion of male students in Canada are have a violent case of ‘Beiber Fever’?
A random sample of 1,350 Laurier students were sampled, and 250 of them reveled they had ‘Beiber Fever’.. What is the 98% confidence interval for the population proportion?
Example
A random sample of 1,350 Laurier students were sampled, and 250 of them reveled they had ‘Beiber Fever’.. What is the 98% confidence interval for the population proportion?
/ 2 / 2
ˆ ˆ ˆ ˆ(1 ) (1 )ˆ ˆ
p p p pp z p p z
n n
P (hat) = 250/1350 = .185
Z a/2 = the Z value which has 1% in each tail (2% in total for a 98% confidence)
Example
A random sample of 1,350 Laurier students were sampled, and 250 of them reveled they had ‘Beiber Fever’.. What is the 98% confidence interval for the population proportion?
/ 2 / 2
ˆ ˆ ˆ ˆ(1 ) (1 )ˆ ˆ
p p p pp z p p z
n n
P (hat) = 250/1350 = .185
Z a/2 = the Z value which has 1% in each tail (2% in total for a 98% confidence) 2.325.
RESULT= 0.185 + / – 2.325 ( √ .185 ( 1 - .185) / 1350 )
98% confidence range = 0.1604 to 0.2095
Selecting the sample size
• The difference between the sample mean and the population mean is called the error of estimation.
• You can make sure you stay within it, by another freaking formula:
E = error (given in q)
2
/ 2zn
E
EXAMPLE
• I want to know how many students I need to interview to find out how many times a Laurier student facebook stakes in 1 day. I want to be 95% certain and that the range of error is 2. It turns out the standard deviation of this stat is 5. GO:
n = ? (what we want to find out)
σ = 5
E = 2
Z a/2 = 95% confidence.. Which is 2.5% in each tail, which is a z value of 1.96
n = (1.96 * 5 / 2 ) 2
n = 24.01 (so need 25 people)
2
/ 2zn
E
Determining n when Estimating p
2/ 2
2
(1 )p p zn
E
1. Use the historical min or max of p, if available.2. To be safe, use p = 0.5 if p is totally unknown.
What proportion of students in statistics actually open their textbook? To estimate this proportion within 5% and be 95% confident, how large a sample should you take?
If historically <15% of students ever do, and NO historical information is available.
n = .15 ( .85) * 1.962 / .052
N = 100.062
N = 101 (MUST round up!)
Class 12 & 13: Intro Hypothesis Testing for
single populations
Hypothesis Testing
• There are two procedures for making inferences:– Estimation. – Hypotheses testing.
• The purpose of hypothesis testing is to determine whether there is enough statistical evidence in favor of a certain belief about a parameter.
Hypothesis Testing
• There are two hypothesis:– Null Hypothesis (H0)
• Assumed to be true• Ex. The defendant is innocent
– Alternative (or research) Hypothesis (H1)
• Opposite of H0
• Ex. The defendant is guilty
• NOTE: The null will always states the parameter equal the value specified in the alternative.
Hypothesis Testing Process
• Step 1: State the Null and Alternative– Eg: You want to see if the exam average will be
greater then 75%.• H0 = 75• H1 > 75
• Step 2: randomly sample the pop and create a test statistic (in this case a sample mean)– The procedure begins with the NULL BEING TRUE (and the
goal is to see if there is enough evidence to say that the alternative is true).
• Step 3: Make statement about hypo– If t-stat value is inconsistent with null hypo, we reject the
null alternative is true.
Hypo Testing Decisions
1. Reject the null in favour of the alternative
• Sufficient evidence to support the alternative
2. Do not reject the null in favour of alt.– Does not mean ‘accepting the null’ (just
not enough evidence)– Ex. Can’t prove that the defendant is
guilty does not mean that he is innocent
Hypo Testing Errors
• Two types of errors are possible when making the decision whether to reject H0(the null hypothesis)
• Type 1 error (alpha): reject null hypothesis – send a innocent man to jail (reject null when null is true!) MOST SERIOUS OF THE TWO!
Our original hypothesis…
our new assumption…
Type 2 error: don’t reject a false null hypothesis (go with the safe null assumption.. Don’t have the balls to reject it!! )Guilty man goes free. (not rejected null when null is actually false)It can be calculated .. (later).
Hypo Testing Errors
THIS EXAMPLE IS TESTING AVG HYDRO BILLS, estimated mean of 170.
Sample bills were taken to get x bar, and trying to figure out the critical range.
2 ways to Test: Rejection Region
• Depending on you are looking for <, >, or not equal to, you define the rejection region• Level of significance = α
Test It: P-value
• The p-value of a test is the probability of observing a test statistic at least as extreme as the one computed given that the null hypothesis is true.
• The smallest value of α for which H0 can be rejected
p-value
QUESTION: Testing hydro bills.. If they think the average customer’s hydro bill is 170 (with standard deviation of 65), and they want to test to see if they are larger than that. The company tested 400 customers to find that they had an average of $178. Should you reject or accept the null hypothesis? (we want 95% confidence)
Test It: P-value
• The p-value of a test is the probability of observing a test statistic at least as extreme as the one computed given that the null hypothesis is true.
• The smallest value of α for which H0 can be rejected
p-value
P-value =.0069
Z=2.46
Type II Error Example
Example:• H0: µ = 170• H1: µ > 170
• At a significance level of 5% we rejected H0 in favor of H1 since our sample mean (178) was greater than the critical value of (175.34).
If want to do a Type 2 error - In the question – they will have to give you the new mean to test. ($180 mean)
• β = P( x < 175.34, given that µ = 180), thus…
Our original hypothesis…
our new assumption…
Chance we send a guilty man free
Changing your confidence requirement!
INCREASE THE SAMPLE SIZE!
Estimating MEAN with T-stat (refresher)
T-dist instead of normal & sample stdev and not population stdev.
QUESTION: Tiger Woods is rumoured to pay his ‘girls’ $1million per year to stay quiet. If a random sample of 7 of them were taken and the mean was $800,000 with a stdev of $100,000. Find the 95% interval estimate of the population mean. (assume pop is normal..)
Degrees of freedom = 6
= 800K + t(.025) (100K/√7)
= 800K + 2.447(37,796)
= 800,000 +/- 92,486 RANGE between $707,513 and $892,486
Hypo testing with T-stat
• T-statistic: Same as z-stat, just using sample mean and stdev!
QUESTION: SO.. can we conclude with 95% confidence that the mean that Cheetah pays for his girls is not $1,000,000?
T = 800,000 – 1,000,000 / (100,000 / √7)
T = -200,000 / 37,796 = -5.29
H0. u = 1MillionH1. u ≠ 1 million (two tail test)
T-critical with 2.5% in each tail is at -/+ 2.447.
-5.29 is definitely past -2.447, reject the null = mean isn’t 1 million!
Third type.. proportion
ˆ
(1 )
p pz
p pn
Assumption: np > 5, n(1-p) > 5
The high school student council believes that 11% of its students will come to the school dance wasted, and they wanted to test their belief. A sample of 200 students resulted in 28 indicating they in fact, will be tanked before they arrive. Use a probability of a Type I error of 0.10.
H0 = .11 will be drunk
H1 ≠ .11 will br drunk (have to use 2 tail test.. Don’t know which way they are testing)
P hat = 28/200 or .14
The high school student council believes that 11% of its students will come to the school dance wasted, and they wanted to test their belief. A sample of 200 students resulted in 28 indicating they in fact, will be tanked before they arrive. Use a probability of a Type I error of 0.10.
ˆ
(1 )
p pz
p pn
H0 = .11 will be drunk
H1 ≠ .11 will br drunk
P hat = 28/200 or .14
Z = .14 - .11 / √ ( .11 ( .88) / 200 )
Z = .03 / √ .000484
Z = .03 / .022
Z = 1.36
Is 1.36 far enough? On z-table, we need to know the p value of 1.36 and compare to .05 (half of significant level of .10)
The high school student council believes that 11% of its students will come to the school dance wasted, and they wanted to test their belief. A sample of 200 students resulted in 28 indicating they in fact, will be tanked before they arrive. Use a probability of a Type I error of 0.10.
ˆ
(1 )
p pz
p pn
H0 = .11 will be drunk
H1 ≠ .11 will br drunk
P hat = 28/200 or .14
Z = .14 - .11 / √ ( .11 ( .88) / 200 )
Z = .03 / √ .000484
Z = .03 / .022
Z = 1.36
Is 1.36 far enough? On z-table, we need to know the p value?
Z = 1.36 has .0869 in the right tail.
BUT this is a two-tail test, so it needed by < .05., so cannot reject the null in favour of the alternative.
RECAP!!
With 1 population, we talked about:Mean (known ) Mean (unknown )
ProportionStandard z-stat T-statistic z-stat for prop.
XZ
n
Lectures 14 &15: Inference about comparing Two
Populations
inference about comparing two population
• With two populations, we can be:– Comparing the means– Comparing paired observations– Comparing proportions
COMPARING TWO MEANS
• Similar to dealing with 1 mean, now we are looking at the difference of two pop means:
1. If you know the stdev’s?? Plug-and-play:
2 2
1 21 2
1 2/ 2
zx x n n
Confidence Formula:
A random sample of 32 business students from Laurier are asked how often they party so hard they don’t remember what happened during the night previous. A similar random sample is taken of 34 science students. The results and the population SDs are given below.
Q. Is there enough evidence to say that they differ with a 5% confidence level?
1
1
1
32
70.700
16.253
nx
2
2
2
34
62.187
12.900
nx
Question Example
• You have st dev’s.. You can use this formula:
a) z = (70.7 – 62.187) – (0) √(16.2532/32 + 12.92/34)
z = 8.513 / 3.62
z = 2.347 (what p value is that on the z table?)
• You have st dev’s.. You can use this formula:
a) z = (70.7 – 62.187) – (0) √(16.2532/32 + 12.92/34)
z = 8.513 / 3.62
z = 2.347 (what p value is that on the z table?)
P-value = .0096
(yes, less than 2.5% in tail – they DO DIFFER!!!)
Two means, unknown variances!
• MOST times you don’t know stdev’s, so not so straightforward, you need to the t-test, and there are 2 cases:– When they have equal variances– When they are unequal variances.
Type 1. Assumed equal variances
• With equal variances, you can use that property to bring together the variances with their degrees of freedom, assume pops are relatively normal. This is a T-statistic
First get a combined estimate value:
Use it to find your t-value:
Tn1+n2-2 =
Population 1: Average amount (per present) for those who buy presents at Walmart
Population 2: Average amount (per present) for those who buy presents at Value Village
Assume both populations are Normal.
1 2
1 2
1 2
13, 15
4.35, 6.84
1.20, 1.42
n n
x x
s s
9.95 8.99
Example
Is there enough evidence to conclude that people at Walmart spend more then those at Value Village? αα=0.10=0.10H0 u1 = u2H1 u1 < u2 (or u1-u2 > 0).. 1 tail test!!!
Step 1: Get Pooled Variance
Sp2 = (13-1)*1.2*1.2 + 14*1.42*1.42
13+15-2
Sp2 = 1.75
1 2
1 2
1 2
13, 15
4.35, 6.84
1.20, 1.42
n n
x x
s s
9.95 8.99
Step 1: Get Pooled Variance
Sp2 = (13-1)*1.2*1.2 + 14*1.42*1.42
13+15-2
Sp2 = 1.75
Step 2: get T statistic
(9.95 – 8.99) – 0
√1.75 √ 1/13 + 1/15
t13+15-2, .10 =
t26,.10 = .96 / .50128 = 1.915
Now need to check with critical value at .10 significance!
1 2
1 2
1 2
13, 15
4.35, 6.84
1.20, 1.42
n n
x x
s s
9.95 8.99
Step 1: Get Pooled Variance
Sp2 = (13-1)*1.2*1.2 + 14*1.42*1.42
13+15-2
Sp2 = 1.75
Step 2: get T statistic
(9.95 – 8.99) – 0
√1.75 √ 1/13 + 1/15
t13+15-2, .10 =
1 2
1 2
1 2
13, 15
4.35, 6.84
1.20, 1.42
n n
x x
s s
9.95 8.99
t26,0.10 = .96 / .50128 = 1.915
Now need to check with critical value at .10 significance!
YES! 1.915 > 1.315, so reject null in favour that walmart shoppers spend more
Construct a 90% confidence interval on the difference between the average spend at Walmart vs The Village.
1 21 2 2, / 21 2
1 1( ) *n n px x t s
n n
We need t12+14-2, .05 now.. Back to the table!!
Construct a 90% confidence interval on the difference between the average spend at Walmart vs The Village.
1 21 2 2, / 21 2
1 1( ) *n n px x t s
n n
t26, .05 = 1.706
Result = .96 +/- 1.706*0.66 (from last example)
The difference range of Walmart – Value Village purchases is between -$0.165 to $2.085
Type 2. unequal variances
• With UNEQUAL variances, you have this intense degree of freedom formula – NEED TO CHAT ABOUT, what do you guys need to know about these formulas??
Use it to find your t-value:
Chapter 13 (inference about comparing two population)
• With two populations, we can be:– Comparing the means– Comparing paired observations– Comparing proportions
Paired Observation
Matched Pairs Experiment (t-test and estimator of UD )
• if you can find a way to pair the independent samples, then you can use this method. Just cause they have the same number of samples, doesn’t mean they are matched, even if they are ordered, they NEED to be matched on another variable (gpa buckets, same people but different dates etc).
• We are actually making inference on the mean difference between matched pairs of the two populations: D= μ1 – μ2
• Most common hypotheses: – Ho: D=0 – Ha: D (<, >, ≠) 0
Matched Pairs TestMatched Pairs Test: Mean : Mean Difference Difference Between Two Dependent Between Two Dependent SamplesSamples
• Additional Assumption: the difference between matching pairs of the two possible populations is Normal.
number of pairs
= mean sample difference
= mean population difference
= SD of sample difference
/
d
d
d
D
s
d D
s n
n
100(1 - )% CI of D
1, / 2d
n
sd t
n
Formula for t-test (Df = n - 1 )
Example
• Your study group of 8 people is crazy competitive, and decide to go to the SOS Exam-AID to see if their session helped your average marks. Below are the resultsMidterm Final (with Exam-AID!) difference
1 65 68 32 72 72 03 80 81 14 75 85 105 87 92 56 69 67 -27 70 72 28 81 86 5
average 74.875 77.875 3.70328
Is there enough evidence to conclude that their marks went up from the Exam-AID? Use αα=0.01=0.01
number of pairs
= mean sample difference
= mean population difference
= SD of sample difference
/
d
d
d
D
s
d D
s n
n
t 7,.01 = 77.875 – 74.875
3.703 / √ 8
Midterm Final (with Exam-AID!) difference1 65 68 32 72 72 03 80 81 14 75 85 105 87 92 56 69 67 -27 70 72 28 81 86 5
average 74.875 77.875 3.70328
t 7,.01 = 3 / 1.309 = 2.291
Is it enough?? TO THE TABLE!!!
Is there enough evidence to conclude that their marks went up from the Exam-AID? Use αα=0.01=0.01
number of pairs
= mean sample difference
= mean population difference
= SD of sample difference
/
d
d
d
D
s
d D
s n
n
t 7,.01 = 77.875 – 74.875
3.703 / √ 8
Midterm Final (with Exam-AID!) difference1 65 68 32 72 72 03 80 81 14 75 85 105 87 92 56 69 67 -27 70 72 28 81 86 5
average 74.875 77.875 3.70328
t 7,.01 = 3 / 1.309 = 2.291
Is it enough?? TO THE TABLE!!!
t 7,.01 is 2.998, and this example is 2.291… not far enough to ensure that we are 99% confident that the Exam-AID helped.
• What about a confidence interval for 95% confidence?
• t7,.025 =?
100(1 - )% CI of D
1, / 2d
n
sd t
n
• What about a confidence interval for 95% confidence?
• t7,.025 = 2.365
Interval = 3 +/- 2.365 (3.703 / √8)Interval with 95% confidence, their post-
ExamAID marks will be -0.09628 to 6.09628 higher.
100(1 - )% CI of D
1, / 2d
n
sd t
n
Chapter 13 (inference about comparing two population)
• With two populations, we can be:– Comparing the variances– Comparing the means– Comparing paired observations– Comparing proportions
Examples of two proportions
• Comparing market share of a product for two different markets
• Studying the proportion of female customers in two different geographic areas such as Quebec and Ontario.
• Comparing the proportion of defective products from one period to another
3) Inference about the difference between population proportions (with nominal data) – – Using nominal data, so win/lose categories.– Z statistic– same restriction of the p*n and p*(1-n) > 5 (but
now for both populations)• If over 5, then p1 – p2 is NORMAL, use these
formulas:
Formulas!!!!
A random sample survey of 300 stats students reveals that 120 won’t study for this exam more then 1 day before it. .. A sample survey of 250 accounting students revealed that 90 of them won’t study more then a day before it. Is the proportion of stats procrastinators higher? Use α=0.01.
P1 = 120/300 = .4
P2 = 90/250 = .36
P hat = 300*.4 + 250*.36 / (550) = .3818
Z = .4 - .36 / √ .3818*.6181*(1/300+1/250) Z = .04 / .0316 Z = .96152
TO THE TABLE!! What is the p value?
A random sample survey of 300 stats students reveals that 120 won’t study for this exam more then 1 day before it. .. A sample survey of 250 accounting students revealed that 90 of them won’t study more then a day before it. Is the proportion of stats procrastinators higher? Use α=0.01.
P1 = 120/300 = .4
P2 = 90/250 = .36
P hat = 300*.4 + 250*.36 / (550) = .3818
Z = .4 - .36 / √ .3818*.6181*(1/300+1/250) Z = .04 / .0316 Z = .96152
Z of .96 has .1685 (.5-3315) in the tail, that is MUCH greater than .01 .. So not enough evidence to reject the null that Stats students are higher procrastinators.
RECAP of 2 populations!!
• We looked at TWO populations now:
Two proportions:
z =
Matched Pairs:Population Means (with stdev)
Population Means (no stdev)
t =
Lecture 16 : Analysis of Variance
Analysis of Variance (ANOVA)
• comparing 2 or more population of INTERVAL data• determine whether differences exist between population
means– done by analyzing sample variance, and the ANOVA
technique:• Single Factor (or 1-way): For populations which
have only 1 factor that you are comparing them against, then you use the ANOVA: Single Factor. This is like comparing sales from 3 cities with the factor being the marketing strategy.
One-Factor ANOVA
All means are the same:The null hypothesis is not
rejected
H0 :1 2 3 k
Ha : Not all i are the same
1 2 3
At least one mean is different:The null hypothesis is rejected
H0 :1 2 3 k
Ha : Not all i are the same
1 2 3
1 2 3
or
One-Factor ANOVA
Partitioning the Variation
Total Variation = the aggregate dispersion of the individual data values across the various populations
Within-Sample Variation (SSE) = dispersion that exists among the data values within a particular population
Between-Sample Variation (SSC) = dispersion among the sample means (sometimes referred to SST)
1 2 3
Response
3x
1x 2x
1 2 3
Response
3x1x 2x
x
Between Group Variation (SSC) + Within Group Variation (SSE)
RECAP Error = SSC + SSE
1 2 3
Response, X
x Total Sum of Squares =
+
Total Sum of Squares
Where:
k = number of populations
ni = sample size from population i
xij = jth measurement from population i
x = grand mean (mean of all data values)
TOTAL ERROR=
+
One-Way ANOVA Table
Source of Variation
dfSS MS
Between Samples
SST MST =
Within Samples
nT - CSSE MSE =
Total nT - 1SST+SSE
C - 1 MST
MSE
F ratio
C = number of populations
nT = sum of the sample sizes from all populationsdf = degrees of freedom
SST
C - 1
SSE
nT - C
F =
One-Factor ANOVA F-Test Statistic
• Test statistic
MST is mean squares between variancesMSE is mean squares within variances
• Degrees of freedom– dfC = k – 1 (k = number of populations)
– dfE = nT – k (nT = sum of sample sizes from all populations)
H0: μ1= μ2 = … = μ k
Ha: At least two population means are different
F = MST / MSE
Single Factor:- Comparing 3 independent populations, with the factor being
marketing strategy.Q. Is there enough evidence to support that the sales of this product
differ? 123456789101112131415
A B C D E F GAnova: Single Factor
SUMMARYGroups Count Sum Average Variance
Convenience 20 11551 577.6 10775.0Quality 20 13060 653.0 7238.1Price 20 12173 608.7 8670.2
ANOVASource of Variation SS df MS F P-value F crit
Between Groups 57512.23 2 28756.1 3.23 0.0468 3.16Within Groups 506983.5 57 8894.4
Total 564495.7 59
All this says is that at least 2 of the means differ!
Analysis of Variance
Example?
• Want to see if there is any difference between the speed of shot for three top brands of hockey sticks.Bauer (mph’s): 50,55,49Nike (mph’s): 58,53,57Easton (mph’s): 48,52,55
Anova: Single Factor
SUMMARYGroups Count Sum Average Variance
Bauer 3 154 51.33333 10.33333Nike 3 168 56 7Easton 3 155 51.66667 12.33333
ANOVASource of Variation SS df MS F P-value F critBetween GroupsWithin Groups
Example?
Bauer (mph’s): 50,55,49Nike (mph’s): 58,53,57Easton (mph’s): 48,52,55
SOLVE!
Groups Count Sum Average VarianceBauer 3 154 51.33333 10.33333Nike 3 168 56 7
ANOVASource of Variation SS df MS F P-value F critBetween GroupsWithin Groups
Example?
Bauer (mph’s): 50,55,49 (avg = 51.3)Nike (mph’s): 58,53,57 (avg = 56)Easton (mph’s): 48,52,55 (avg = 51.6)
BIG average = 53
= 3 (51.33-53)2 + 3(56-53)2 + 3(51.66-53)2
= 40.667
40.667
= (50-51.3)2 + (55-51.3)2 + (49-51.3)2 + (58-56)2 + (53-56)2 + (57-56)2 + (48-51.6)2 + (52-51.6)2 + (55-51.6)2
= 59.333
59.333
Example?
Bauer (mph’s): 50,55,49Nike (mph’s): 58,53,57Easton (mph’s): 48,52,55
Groups Count Sum Average VarianceBauer 3 154 51.33333 10.33333Nike 3 168 56 7
ANOVASource of Variation SS df MS F P-value F critBetween Groups 40.66667Within Groups 59.33333
Between groups df = number of manufactures – 1
Within groups df = number of all trial – number of manufactures
26
Example?
Bauer (mph’s): 50,55,49Nike (mph’s): 58,53,57Easton (mph’s): 48,52,55
Groups Count Sum Average VarianceBauer 3 154 51.33333 10.33333Nike 3 168 56 7
ANOVASource of Variation SS df MS F P-value F critBetween Groups 40.66667Within Groups 59.33333
SST / df = 40.6667 / 2
SSE / df = 59.3333 / 6
26
20.3339.8888
Example?
Bauer (mph’s): 50,55,49Nike (mph’s): 58,53,57Easton (mph’s): 48,52,55
Groups Count Sum Average VarianceBauer 3 154 51.33333 10.33333Nike 3 168 56 7
ANOVASource of Variation SS df MS F P-value F critBetween Groups 40.66667Within Groups 59.33333
F = MST / MSE = 20.33 / 9.888
26
20.3339.8888
2.056
Example?
Bauer (mph’s): 50,55,49Nike (mph’s): 58,53,57Easton (mph’s): 48,52,55
Groups Count Sum Average VarianceBauer 3 154 51.33333 10.33333Nike 3 168 56 7
ANOVASource of Variation SS df MS F P-value F critBetween Groups 40.66667Within Groups 59.33333
26
20.3339.8888
2.056 0.20888 5.143253
P value is .20 .. if significance level is .05, then you would not reject the null, indicating that the three sticks are not significantly different.
0 2.056F-value
5.14critical
Must be normal and equal variances:- If nonnormal, replace test with Kruskal-Wallis
test (making the numbers ordinal) – not covered- If unequal variances – we CANNOT DO!
NOTES / THOUGHTS:- Why do we need it? (if 4 pop’s, LOTS of pairs of
means to compare) potential for type 1 error is huge. But need t-test for 2 pop’s cause ANOVA only says that means ‘differ’ (not < or >)
Analysis of Variance
Lecture 17&18: Correlation and
Simple Regression Analysis
Linear Regression
REGRESSION:USED TO: analysis the relationship between interval variables.• Is there a linear relationship between one variable (dependent
variable) and other variables (independent variables)? • Predict dependent variables based on independent ones.
Least Squares Method
• Least Squares Method– The objective of the scatter diagram is to measure the
strength and direction of the linear relationship– Both can be more easily judged by drawing a straight line
through the data.– How to draw that line? LSM!
• This line has the smallest sum of squared distances to all the points on the plot.
• LSM: It creates a line, and it is created by:You calculate b1, then for b0 sub in the mean values
of x and y, solve for b0 and rewrite.
Least Squares Method
21
XY
XX
X X Y Y SS
SSbX X
Linear Regression and Correlation
• Eg: resell value of car with x miles on the odometer
METHOD 1: SSEStandard Error (SSE) = sum of the error for each of the points. How good the error in the points is.(relate this to the mean)0.32 vs sample mean of 15 ($15,000 for a car on average) - (average data points)
Standard Error of estimate
Linear Regression and Correlation
• Eg: resell value of car with x miles on the odometer METHOD 2: TEST SLOPE (part 1)If we want to see if a relationship (if no relationship, slope = horizontal = 0 ):H0: β = 0
H1: β ≠ 0
Can reject if you know the t-critical, or can just look at p-value:
P-values here ASSUME two-tail test!! So with this, compare 0.000 with 0.05 or whatever significance you are given!
• Eg: resell value of car with x miles on the odometer METHOD 2: TEST SLOPE (part 2)If we want to see if it has a negative relationship (eg: slope< 0):H0: β = 0
H1: β < 0
Can reject if you know the t-critical, or can just look at p-value:
SINCE p-value assumes two tail, you need to divide the p-value by 2!! So compare 0.000/2 and .05 This one is easy, but if it was 0.06 .. You’d have to guess (if looking for > (when it is a negative coeff) you need to 1 – value!
Linear Regression and Correlation
Linear Regression and Correlation
• Eg: resell value of car with x miles on the odometer METHOD 3: Coefficient of CorrelationIf want to see if there is a relationship between the variables. NOTE: R between -1 and +1
H0 : ρ = 0H1: ρ ≠ 0
0.8052 = pretty good positive relationship!
BUT… how much can be explained by this model???
Linear Regression and Correlation
• Eg: resell value of car with x miles on the odometer METHOD 3.5: Coefficient of DeterminationRemember – when we squared R, we go Determination (how much of the variation is due to the independent variable (if 1 – no error, and all variation due to indep, if 0 – no linear relationship between variables, and all error).
Here – 64% of the variation in the price is determined by the mileage!!
Confidence Interval
11 / 2, 2 *n bb t s
What if they ask you for a confidence interval for 95%?
Use above formula:
b1 = -0.0669 (intercept of the slope)
Sb1 = .005 (Standard Error)
ta/2,n-2 = t.025,13 to the table!!
n = 15 prices were sampled.
Confidence Interval
11 / 2, 2 *n bb t s
What if they ask you for a confidence interval for 95%?
Use above formula:
B1 = 17.25 (intercept of the slope)
b1 = -0.0669 (slope)
Sb1 = .005(Standard Error)
T.025,13 = 2.160
Result = -0.0699 +/- 2.160*.005
RANGE -0.077 -0.0561
n = 15 prices were sampled.
With regression done.. What can you do?
Once the regression line is confirmed to be valid, it is suitable as an estimation and prediction tool:
1. Point estimate of y for xo
2. Prediction interval of individual y for xo
3. Confidence interval of average y for xo
POINT estimate
16.114
Point is used to find out an estimate for ONE particular value of x.
Using the following sample, simple regression analysis gives ˆ 1.57 0.0407Y X
Q. What is the point estimate of the cost for 73 passengers?
1.57 + 0.0407(73) = 4.5411
Linear Regression and Correlation
Prediction Interval VS Confidence Interval??
• To find out the expected value of an individual item (prediction interval) or the expected value of the mean of a population (confidence interval estimate)
• The confidence interval estimate of the expected value of y will be narrower than the prediction interval for the same given value of x and confidence level. This is because there is less error in estimating a mean value as opposed to predicting an individual value.
16.116
Construct the 95% confidence interval for the mean cost for a flight with 73 passengers. Xbar=77.5, SSxx=1689, Se=0.177.
(assume this regression was done with 20 samples)
Confidence Interval for average y given xo
2
/ 2, 201ˆ *SS*n e
xx
Yn
XXt S
What’s Y? we found it in the last question: 4.5411
What’s t a/2,n-2 .. t .025,18 = 2.10
Result? 4.5411 +/- 2.10*0.177* √ [ 1/20 + (73-77.5)2 / 1689 ]
4.5411 +/- 0.0925 or a 95% confidence Interval of 4.448 4.633
16.117
Construct the 95% confidence interval for the cost for a particular flight with 73 passengers. Xbar=77.5, SSxx=1689, Se=0.177.(assume this regression was done with 20 samples)
PREDICTION Interval for average y given xo
Only difference? The 1+ in the square root..
Result? 4.5411 +/- 2.10*0.177* √ [ 1+ 1/20 + (73-77.5)2 / 1689 ]
4.5411 +/- 0.383 or a 95% PREDICTION Interval for the cost of 1 flight is between 4.158 4.924
2
/ 2, 201ˆ * 1SS*n e
xx
Yn
XXt S
NOTE… the range for the Confidence Interval was much smaller 4.448 4.633
Regression Diagnostics
• Three conditions required in order to perform a regression analysis. (all on upcoming slides)
1. Error Variable must be normally distributed 2. Error variable must have a constant
varianceHeteroscedasticity: when the error variable
does not have a constant variance
3. Errors must be independent of each other• Can’t be correlated (Time series data usually
have errors that are correlated)• How to diagnose? – Residual Analysis. (look at the
difference between the actual and predicted results)
16.119
Normality…
We can take the residuals and put them into a histogram to
visually check for normality…
…we’re looking for a bell shaped histogram with the mean close to zero.
16.120
Constant Variance…
When the requirement of a constant variance is violated, we
have a condition of heteroscedasticity.
16.121
Independence
When the requirement of independence is violated, there may be a trend in residuals.
Lecture 19 & 20: Multiple Regression
Multiple Regression (multiple variables, all first-order)
Many different variables that go into these types.
Eg: A Hotel is looking to expand – but they don’t know where in this 1 particular city. Should it be close to the airport? Close to high income housing? To the university?
TO find out, sample hotels in the area, find out their details for the various questions you think are factors towards profitability, and then RUN EXCEL to see which factors actually affect profit!!
Required Conditions…
For these regression methods to be valid the following four conditions for the error variable must be met:
1. The distribution of the error variable is normal. (draw histogram of errors)
2. The mean of the error variable is 0. (calculate error variable)
3. The standard deviation of error is constant.(plot residuals vs Y to see if constant)
4. The errors are independent. (plot residuals vs time periods to see if connected)
Eg: Hotel profit margin – based on 6 factors.
Multiple Regression (multiple variables, all first-order)
Eg: Hotel profit margin – based on 6 factors.ANOVA (used most for multiple reg.)The printout gives us the overall quality of this model – Significance F! Could compare F F-critical, or look at Significance F and compare to significance level! 0.00 < 0.05 – If greater then 0.05, it says none of the factors have a relationship.
Multiple Regression (multiple variables, all first-order)
Eg: Hotel profit margin – based on 6 factors.
R Square / Adjusted R Square:
R square = coefficient of determination.Adjusted = coef of Det if MORE THEN 1 VARIABLE! Use this one to determine how much of the variation is from the variables!!
Multiple Regression (multiple variables, all first-order)
Eg: Hotel profit margin – based on 6 factors.
Variable Relationships?:
Look to the P-values. If < significant (using 0.05) then they are good (reject the null saying that there is no relationship) The others can be removed.
Multiple Regression (multiple variables, all first-order)
Multiple Regression (multiple variables, all first-order)
Eg: Hotel profit margin – based on 6 factors.
Interpret the Intercepts!
+ means that the profit margin goes up! - means that the profit margin goes down.
Example Q
• You can predict the outcome if the built where: – There are 3815 rooms within 3 miles of the site.– The closest other hotel or motel is .9 miles away.– The amount of office space is 476,000 square feet.– Census data indicates the median household income in the
area (rounded to the nearest thousand) is $35,000
Multicollinearity• When 2 or more of your variables are not just related to
the dependent variable (eg: profit margin), but are correlated to each other (so if distance to university goes down, then income of houses goes up/down). There will always be some of this, but if it is a strong coorelation = multicollinearity.
• WHAT DOES THIS MEAN?: The overall model can be tested for relationships (significance F ok!), but you cannot tell which individual variables are related! (not indiv t-tests!!)
• FIX?: You can built the model one variable at a time, or delete some.. Not needed for final.
Multiple Regression (multiple variables, all first-order)
• Indicator Variables (or dummy var)– Either 0 or 1– If 3 different colours of cars (and colour may
affect price of cars) then you need 2 dummy variables (1 less then options)
Multiple Regression (multiple variables, all first-order)
Multiple Regression (multiple variables, all first-order)
Multiple Regression (multiple variables, all first-order)
Multiple Regression (multiple variables, all first-order)
Multiple Regression (multiple variables, all first-order)
Bayes Theorem
• Start with your initial or prior probabilities.
• You get new info.• So now with new info, you calculate revised
or posterior probabilities• This process is Bayes Theorem
Bayes Theorem
• Bayes’ theorem is applicable when the events for which we want to compute posterior probabilities are mutually exclusive and their union is the entire sample space
Conditional Probability:
P(Ai|B) = P(Ai)*P(B|Ai)
P(B)
KEY DIFFERENCE: You are just now, adding up all the partitions that contain B on the bottom, since you have them all split up.
Bayes Theorem
• Example: – Two printer cartridge companies, Alamo and
Jersey. – Alamo makes 65% of the cartridges– Jersey makes 35%.– Alamo has a defective rate of 8%– Jersey has a defective rate of 12%a) Customer purchases a cartridge, prob that Alamo
made it?
- Cartridge is tested, and it is defective. b) What is the probability that Alamo made the cartridge? c) What is the probability that Jersey made the cartridge?
ANSWER
• The knowledge of the producer breakdown is the prior probability:– Alamo = 65% P(E1)
– Jersey = 35% P(E2)
• We know the conditional probabilities of the defective rates:– Alamo = 8% P(D|E1)
– Jersey = 12% P(D|E2)
ANSWER 1: TABLE
Prior Conditional Joint Posterior
Alamo .65 .08 .052 .052/.094 = .553
Jersey .35 .12 .042 .042/.094 = .447
Total defective
.094 1.000
Odds of getting an alamo cartridge that is defective if you bought it at futureshop by random
Given that you got a defective cartridge, since there is a 9.4% chance of getting a defective one, and 5.2% of that 9.4% is Alamo’s, then you have a 55.3% of it being Alamo’s!
ANSWER 2: TREE
Alamo.65
Jersey.35
Defective.08
Defective.12
Acceptable.88
Acceptable.92
.052
.598
.042
.308
.094
Revised Probabilty: Alamo = .052 / .094 = .553Revised Probabilty: Jersey = .042 / .094 = .447
REMINDERS!
• Financial Accounting Exam-AID Wednesday.
• STATS sessions again Thursday and Friday.– TELL YOUR FRIENDS!!!
• Interested in going on an outreach trip? SOS is running at least 2 trip (May & August).. And other trips nationally.. E-mail: [email protected]