Ch8 Regression Revby Rao

Preview:

DESCRIPTION

 

Citation preview

Medical Statistics Medical Statistics (full English class)(full English class)

Shaoqi Rao, PhD

School of Public Health

Sun Yat-Sen University

Slides adapted from Dr. Ji-Qian Fang’s

Chapter 8Chapter 8Linear RegressionLinear Regression

How does the value of one variable How does the value of one variable depend on that of another one?depend on that of another one?How does the son’s height depend on the father’s

height?How does the death rate of animal depend on the

drug dosage?How does the infant weight depend on the month’

s age?How does the body surface area depend on the hei

ght?

---- To explore linear dependence quantitatively between two continuous variables.

8.1.1 Linear regression equation Initial meaning of “regression”: Galdon noted that if the father is tall, his son will be relatively tall; if the father is short, his son will be relatively short. But, if the father is very tall, his son will not talle

r than his father usually; if the father is very short, his son will not shorter than his father usually.

Otherwise, ……?!Galdon called this phenomenon “regression to th

e mean”

8.1 Statistical Description of Linear Regression

Independent variable (explanatory variable), X

randomly changing

or fixed by the researcher

Dependent variable (response variable), Y

randomly following a linear equation

What is regression in statistics?What is regression in statistics?

To find out the track of the means

100

120

140

160

180

200

220

100 120 140 160 180 200 220

Father’s height ( cm)

Son’s height (cm)

Given the value of X, Y varies around a center (y|x)

All the centers locate on a line -- regression line.

The relationship between the center y|x and X is described by a linear equation

|y x X

Linear regression

Try to estimate and , getting

Where

a -- estimate of , intercept

b -- estimate of , slope

-- estimate of y|x

bXaY ˆ

Y

|y x X

8.1.2 Regression coefficient and its calculation

To find a straight line to best fit the points.

Residual:

Fitness of the regression line:

Principle of least squares: To find a straight line that minimizes the sum of squared residuals.

Under such a principle, it is easy to get the formulas for and by calculus:  

(8.3)

(8.4)

Such a line must go through the point of , and cross the vertical axis at ---- Why?

yy ˆ

2)ˆ( yy

2)(

))((

xx

yyxx

l

lb

i

ii

xx

xy

xbya ),( yx

a

Example 8.1 Calculate the regression equation Example 8.1 Calculate the regression equation of the height of son of the height of son YY on the height of father on the height of father XX . .

No. 1 2 3 4 5 6 7 8 9 10

Father’s height, X 150 153 155 158 161 164 165 167 168 169 Son’s height, Y 159 157 163 166 169 170 169 167 169 170

No. 11 12 13 14 15 16 17 18 19 20

Father’s height, X 170 171 172 174 175 177 178 181 183 185 Son’s height, Y 173 170 170 176 178 174 173 178 176 180

8.168x 35.170y 2.1859xxl 4.1059xyl

5698.0

2.1859

4.1059 xx

xy

l

lb 17.74)8.168)(5698.0(35.170 a

XY 5698.017.74ˆ

8.2.1.1 The t-test for regression coefficient

b is the sample regression coefficient, changing from sample to sample

There is a population regression coefficient, denoted by

Question : Whether =0 or not?

H0: =0, H1: ≠0α=0.05

8.2 Statistical Inference on Regression 8.2.1 Hypothesis tests

2

)ˆ( 2

n

YYs

20

ns

bt

b

Statistic

Standard deviation of regression coefficient

Standard deviation of residual

2)( XX

ssb

For Example 8.1For Example 8.1

05326.0

2.1859

2964.2

xx

bl

ss

68.1005326.0

5698.00

bb s

bt 18220

2964.218

92.94

2

)ˆ( 2

n

yys ii

p <0.001.

0H

Reject ---- the regression of the son’s height on the father’s height is statistically significant.

: =0, : ≠0

0H

0H

1H 0H

8.2.1.2 Analysis of variance : The contribution of the linear regression is 0

: The contribution of the linear regression is not 0

(1) Before regression, we can only use to estimate

(2) After regression, we can use to estimate

(3) The regression makes the sum of squared deviations decline

(4) To test The contribution of regression is 0, F-statistic is used

0H

1H

y xy| Y xy|

sidualTotalgression SSSSSS ReRe 1ReRe sidualTotalgression

For Example 8.1For Example 8.1Source SS DF MS F P

Regression 603.63 1 603.63 114.54 < 0.01

Residual 94.92 18 5.27

Total 698.55 19

Conclusion: the regression of the son’s height on the father’s height is statistically significant.

The slight difference between these two approaches :• t test could be used for both of one-side and two-side problems;• ANOVA for two-side only. However, the idea of ANOVA can easily be extended to the cases of nonlinear regression and multiple regression.

8.2.2 Determination coefficient

For Example 8.1For Example 8.1

63.603Re gressionSS 55.698TotalSS

8641.0

55.698

63.603Re Total

gression

SS

SS 8641.09296.0 22 r

Determination coefficient: Contribution of regression by %

Total

gression

SS

SSR Re2 10 2 R

•It reflects that the percentage of the total sum of squared deviations can be explained by the regression.• If both of X and Y are random variables,

tcoefficien ncorrelatio of square2 R

In practice, it is suggested to report the value of In practice, it is suggested to report the value of determination coefficient after an analysis of determination coefficient after an analysis of regression to describe how good the regression regression to describe how good the regression is. is. Here is a story:

: An index of liver function: A score for psychological status

Regression is statistically significant, Claimed: “the index for liver function can be improved

by psychological consultation” It is wrong?

Why?

X Y

2.0r

01.0b

8.3 The Application of Linear Regression

8.3.1 Two interval estimations

8.3.1.1 Confidence interval for

8.3.1.2 Prediction interval for Y

xy|

2

20

,0 )(

)(1ˆxx

xx

nstY

i

2

20

,0 )(

)(11ˆ

xx

xx

nstY

i

8.3.3 On the basic assumptions 8.3.3 On the basic assumptions ---- ---- LINE LINE

(1) Linear : There exists a linear tendency between the dependent variable and the independent variable

(2) Independent : The individual observations are independent each other

(3) Normal : Given the value of, the corresponding follows a normal distribution

(4) Equal variances : The variances of for different values of are all equal, denoted with .

In practice, one may use scatter diagram to observe whether the basic assumptions are met.

The assumption of linearity is essential that using a linear model to describe a curvilinear relationship is obviously inappropriate;

The assumption of independency is also essential; The violation to the assumptions of normal

distribution and equal variance might not seriously affect the least square estimates though all the introduced formulas for statistical inference might not valid.

Once the assumptions (1), (3) and (4) are violated, some transformations are worthwhile to try.

SummarySummary Regression and Correlation Regression and Correlation

1. Distinguish and connection Distinguish: Correlation: Both X and Y are random Regression: Y must be random X could be random or not random

Connection: When both X and Y are random

1) Same sign for correlation coefficient

and regression coefficient

2) t tests are equivalent

tr = tb

3) Determination Coefficient

Total

Regressiont Coefficien ionDeterminatSS

SS

2tCoefficien ionDeterminat r

2. Caution --

for regression and correlation

1) Don’t put any two variables together for correlation and regression – They must have some relation in subject matter;

2) Correlation and regression do not necessary mean causality

---- sometimes may be indirect relation or even no any real relation;

3) A big value of r does not necessary mean a big regression coefficient b;

4) To reject does not necessary mean that the correlation is strong, only but ;

5) A regression equation is statistically significant does not necessary mean that one can well predict Y by X, only but ; well predict or not depends on coefficient of determination;

6) Scatter diagram is useful before working with linear correlation and linear regression;

7) The regression equation is not allowed to be applied beyond the range of the data set.

0:0 H0

0