Lecture 3 HSPM J716. Efficiency in an estimator Efficiency = low bias and low variance Unbiased with high variance – not very useful Biased with low variance

Lecture 3

HSPM J716

Efficiency in an estimator

• Efficiency = low bias and low variance

• Unbiased with high variance – not very useful

• Biased with low variance -- worthless

A no-variance, reliable estimator?

• The 0 estimator

Eyeball vs. Least squares for assignment 1

• http://hspm.sph.sc.edu/COURSES/J716/demos/StudentLines/StudentLines.html

http://hspm.sph.sc.edu/COURSES/J716/demos/StudentLines/StudentLines.html

http://hspm.sph.sc.edu/COURSES/J716/demos/StudentLines/StudentLines.html

Hypothesis testing – parallels among the coin toss, card trick, and assignment 1A experiments

• A statistic calculated from our data• A critical value for that statistic calculated

theoretically based on a hypothesis about how the data were generated

• If our statistic were greater than the critical value, we would reject the hypothesis.

Hypothesis testing – all about calculating the probability of what you got and drawing an inference

• With the coin toss experiment– A statistic calculated from our data• Counted how many tails came up

– A critical value for that statistic calculated theoretically based on the hypothesis that the coin was fair• 5 consecutive results that are all the same

– When our statistic was greater than the critical value, we rejected the hypothesis

Hypothesis testing – all about calculating the probability of what you got and drawing an inference

• With the card experiment– A statistic calculated from our data• Counted how many times I guessed the card

– A critical value for that statistic calculated theoretically based on the hypothesis that the any of 52 cards could come up• Even one right guess has a probability less than 0.05, so

the critical value is 1.

– When our statistic was as big as the critical value, we rejected the hypothesis

T statistic hypothesis tests calculate a probability and draw an inference

• With the assignment 1A spreadsheet– A statistic calculated from our data• The estimated coefficient divided by its standard error

– A critical value for that statistic calculated theoretically based on the hypothesis that the true line’s slope is 0.• 2.571

– When our statistic is greater than the critical value, we reject the hypothesis

Not rejecting a false hypothesisType II error in assignment 1A part 2

How the assumptions apply to the eyeball line and the least squares line

Assumption 1 is that there is a true line and that what you see differs from the

true line because of random errors up or down for each point.

• Eyeball line: It's why you drew a line through the points, instead of using a curve or a wiggly line that goes from one point to the next.

• Least squares: It’s why you built a spreadsheet that calculates the slope and intercept of a line.

Assumption 2 is that the errors have an expected value of 0.

• Eyeball line: it's why you try to draw the line through the middle of the points, rather than off to one side or tilting differently.

• Least squares: The average of the residuals is 0.

• (The residuals are your estimates of the errors.)

Assumption 3 is that the errors all have the same variance.

• Eyeball line: It's why you don't favor one point over another in drawing the line.

• Least squares: The spreadsheet’s sum and average rows are simples sums and averages. No data row gets a different weight from another.

Assumption 4 is that the errors are independent, not correlated with

each other.

• Eyeball line: It's why you predict for X=800 using a point on the line

• Least squares: Its why you predict for X=800 with 800*slope + intercept.

Confidence interval for a coefficient

• Coefficient ± its standard error × t from table• 95% probability that the true coefficient is in

the 95% confidence interval?• If you do a lot of studies, you can expect that,

for 95% of them, the true coefficient will be in the 95% confidence interval.

• If 0 is in the confidence interval, then the coefficient is not significant.

Assignment 2

• All regression results are the same• Graphs differ• Need reason to use or doubt least squares

prediction• The reason is in the form of rejecting one or

more of the assumptions

Durbin-Watson statistic

• Serial correlation– Finds significant pattern for clinic 2

N

i i

N

i ii

u

uuDW

1

2

2

21)(

Confidence interval for prediction

• The hyperbolic outline

Formal outlier test?

• Use confidence interval of prediction– With and without the suspect point?

• How do you predict when your data have an outlier?– Totally ignoring it seems wrong.– So does letting it sway your results too much.– Investigate and use judgment.

Multiple regression

• 3 or more dimensions• 2 or more X variables

• Y = α + βX + γZ + error• Y = α + β1X1 + β2X2 + … + βpXp + error

Fitting a plane in 3D space

• Linear assumption– Now a flat plane– The effect of a change in X1 on Y is the same at all

levels of X1 and X2 and any other X variables.

• Residuals are vertical distances from the plane to the data points floating in space.

Multiple regression

• Separating effects– Example from literature– Example from handout

β interpretation

• in Y = α + βX + γZ + error• β is the effect on Y of changing X by 1, holding

Z constant.• When X is one unit bigger than you would

predict it to be from what Z is, then we expect Y to be β more than what you would predict it would be from what Z is. – Those prediction are based on linear relationships.

β-hat formula

•

LS

• Spreadsheet as front end• Word processor as back end• Interpretation of results– Coefficients– Standard errors– T-statistics– P-values

• Prediction

Documents

Lecture 3 HSPM J716. Efficiency in an estimator Efficiency = low bias and low variance Unbiased with high variance – not very useful Biased with low variance