Example x12345678 y67529676. We wish to check for a non zero correlation

Examplex 1 2 3 4 5 6 7 8y 6 7 5 2 9 6 7 6

We wish to check for a non zero correlation

We know that:

22~

12

nt

rnr

Let the true correlation coefficient be ρ. Then test the hypotheses:

H0: ρ = 0

H1: ρ ≠ 0

It has already been shown that r = 0.1458

Thus,

2

2 0.1458 6 0.36101 0.02131

r n

r

The cut off points for the t distribution with 6 degrees of freedom for 2.5% top and bottom are +/-2.447.

-2.447 2.447

The t value of 0.3610 implies H0 is accepted

;;;;;;;;;;

;;;;;;;;;;;

There is no evidence of a non zero correlation between x and y.

Similarly, we can check whether the slope b is significantly different from 0.

So the value of b is 0.1190.

Now carry out a hypothesis test.

H0: b = 0

H1: b ≠ 0

The standard error of b is

This is calculated in R as 0.3298

1/ 22ˆ / xxS

^

The test statistic is

This calculates as (0.1190 – 0)/0.3298

= 0.3608

1/ 22

ˆ

ˆ / xx

b btS

Ds…..

……….

Again, t tables using 6 degrees of freedom give cut of point of 2.447 for 2.5%.

………-2.447………………................ 2.447

Since the test statistic t (0.3608) is less than this cut-off point, we accept the null hypothesis H0.

There is no evidence at the 5% level of a non-zero value of b.

To confirm this, the 95% CI is:

0.1190 +/- 2.447 x 0.3298 = (-0.688, 0.926)

Notice that this includes zero

Confidence Intervals for Variance

222

2

~ˆ2

nn

We quoted earlier that

This can be used to obtain a confidence interval for σ2

Recall the earlier example

y 3.5 3.2 3.0 2.9 4.0 2.5 2.3x 3.1 3.4 3.0 3.2 3.9 2.8 2.2

Estimate of error variance 2

2ˆ /( 2) 0.39418 / 5 0.07884RESSS n

2252

ˆ5 ~

25Now is equal to 0.8312 for “bottom”

2.5% and 12.83 for “top” 2.5%

95% CI for 2 is (5 0.07884/12.83 , 5 0.07884/0.8312) i.e. (0.031 , 0.474)

Trees Example

More than one variable

The residual plot suggests that the linear model is satisfactory. The R squared value seems quite high though, so from physical arguments we force the line to pass through the origin.

The R squared value is higher now, but the residual plot is not so random.

We might now ask if we can find a model with both explanatory variables height and girth. Physical considerations suggest that we should explore the very simple model

Volume = b1 × height × (girth)2 +

This is basically the formula for the volume of a cylinder.

So the equation is:

Volume = 0.002108 × height × (girth)2 +

The residuals are considerably smaller than those from any of the previous modelsconsidered. Further graphical analysis fails to reveal any further obvious dependenceon either of the explanatory variable girth or height.

Further analysis also shows that inclusion of a constant term in the model does not significantly improve the fit. Model 4 is thus the most satisfactory of those models considered for the data.

However, this is regression “through the origin” so it may be more satisfactory torewrite Model 4 as

volume = b1 +

height × (girth)2

so that b1 can then just be regarded as the mean of the observations of

volume height × (girth)2

recall that is assumed to have location measure (here mean) 0.

Compare with 0.002108 found earlier

Practical Question 2

y x1 x2

3.5 3.1 303.2 3.4 253.0 3.0 202.9 3.2 304.0 3.9 402.5 2.8 252.3 2.2 30

So y = -0.2138 + 0.8984x1 + 0.01745x2 + e

Use >plot(multregress)

> ynew=c(y,12)> x1new=c(x1,20)> x2new=c(x2,100)

> multregressnew=lm(ynew~x1new+x2new)

Very large influence

Second Example

> ynew=c(y,40)> x1new=c(x1,10)> x2new=c(x2,50)

> multregressnew=lm(ynew~x1new+x2new)

Documents

Example x12345678 y67529676. We wish to check for a non zero correlation