Principles of Biostatistics Chapter 18 Simple Linear Regression
36
4/9/2005 11:38 AM Department of Epidemiology and Health Statistics,Tongji Medical College http://statdtedm.6to23 (Dr. Chuanhua Yu) 18-1 Principles of Biostatistics Chapter 18 Simple Linear Regression 宇 宇 传 http://statdtedm.6to23.com
Principles of Biostatistics Chapter 18 Simple Linear Regression
Principles of Biostatistics Chapter 18 Simple Linear Regression. 宇传华 http://statdtedm.6to23.com. Terminology. Linear regression 线性回归 Response (dependent) variable 反应 ( 应 ) 变量 Explanatory (independent) variable 解释 ( 自 ) 变量 Linear regression model 线性回归模型 - PowerPoint PPT Presentation
Citation preview
Simple Linear Regression18-*
18-*
Terminology
Error sum of squares or residual sum of squares
18-*
18.3 Estimation: The Method of Least Squares
18.4 Error Variance and the Standard Errors of Regression
Estimators
18.5 Confidence Intervals for the Regression Parameters
18.6 Hypothesis Tests about the Regression Relationship
18.7 How Good is the Regression?
18.8 Analysis of Variance Table and an F Test of the Regression
Model
18.9 Residual Analysis
Contents
18-*
8.1 An example
Table18.1 IL-6 levels in brain and serum (pg/ml) of 10 patients
with subarachnoid hemorrhage ()
Patient i
18-*
Scatterplot
This scatterplot locates pairs of observations of serum IL-6 on the
x-axis and brain IL-6 on the y-axis. We notice that:
Larger (smaller) values of brain IL-6 tend to be associated with
larger (smaller) values of serum IL-6 .
The scatter of points tends to be distributed around a positively
sloped straight line.
The pairs of values of serum IL-6 and brain IL-6 are not located
exactly on a straight line. The scatter plot reveals a more or less
strong tendency rather than a precise linear relationship. The line
represents the nature of the relationship on average.
Chart3
22.4
51.6
58.1
25.1
65.9
79.7
75.3
32.4
96.4
85.7
134
167
132.3
80.2
100
139.1
187.2
97.2
192.3
199.4
Sheet2
1
22.4
134
2
51.6
167
3
58.1
132.3
4
25.1
80.2
5
65.9
100
6
79.7
139.1
7
75.3
187.2
8
32.4
97.2
9
96.4
192.3
10
85.7
199.4
Sheet2
Sheet1
Patient
x
y
y^
18-*
18-*
Model Building
The inexact nature of the relationship between serum and brain
suggests that a statistical model might be useful in analyzing the
relationship.
A statistical model separates the systematic component of a
relationship from the random component.
Data
Statistical
model
Systematic
component
Random
errors
In ANOVA, the systematic component is the variation of means
between samples or treatments (SSTR) and the random component is
the unexplained variation (SSE).
In regression, the systematic component is the overall linear
relationship, and the random component is the variation around the
line.
4/9/2005 11:38 AM
18-*
y= a + b x + or my|x=a+b x
Nonrandom or Random
Component
Where y is the dependent (response) variable, the variable we wish
to explain or predict; x is the independent (explanatory) variable,
also called the predictor variable; and is the error term, the only
random component in the model, and thus, the only source of
randomness in y.
my|x is the mean of y when x is specified, all called the
conditional mean of Y.
a is the intercept of the systematic component of the regression
relationship.
is the slope of the systematic component.
4/9/2005 11:38 AM
18-*
Picturing the Simple Linear Regression Model
The simple linear regression model posits () an exact linear
relationship between the expected or average value of Y, the
dependent variable Y, and X, the independent or predictor
variable:
my|x= a+b x
Actual observed values of Y (y) differ from the expected value
(my|x ) by an unexplained or random error(e):
y = my|x +
= a+b x +
18-*
The relationship between X and Y is a straight-Line
relationship.
The values of the independent variable X are assumed fixed (not
random); the only randomness in the values of Y comes from the
error term .
The errors are uncorrelated (i.e. Independent) in successive
observations. The errors are Normally distributed with mean 0 and
variance 2(Equal variance). That is: ~ N(0,2)
X
Y
LINE assumptions of the Simple Linear Regression Model
Identical normal distributions of errors, all centered on the
regression line.
my|x=a + x
18-*
$
Estimation of a simple linear regression relationship involves
finding estimated or predicted values of the intercept and slope of
the linear regression line.
The estimated regression equation:
y= a+ bx + e
where a estimates the intercept of the population regression line,
a ;
b estimates the slope of the population regression line, ;
and e stands for the observed errors ------- the residuals from
fitting the estimated regression line a+ bx to a set of n
points.
The estima
ted regres
sion line:
18-*
X
Y
e
X
4/9/2005 11:38 AM
18-*
18-*
18-*
18-*
Sheet1
Patient
x
y
y^
18-*
Example 18-1: Using Computer-Excel
The results on the bottom are the output created by selecting
REGRESSION option from the DATA ANALYSIS toolkit.
Office“”“”“”
Sheet2
1
22.4
134
2
51.6
167
3
58.1
132.3
4
25.1
80.2
5
65.9
100
6
79.7
139.1
7
75.3
187.2
8
32.4
97.2
9
96.4
192.3
10
85.7
199.4
Sheet2
Sheet1
Patient
x
y
y^
18-*
Y
X
What you see when looking at the total variation of Y.
X
What you see when looking along the regression line at the error
variance of Y.
Y
18-*
18.4 Error Variance and the Standard Errors of Regression
Estimators
X
Y
4/9/2005 11:38 AM
18-*
4/9/2005 11:38 AM
18-*
4/9/2005 11:38 AM
18-*
Constant Y
Unsystematic Variation
Nonlinear Relationship
A hypothes
18-*
4/9/2005 11:38 AM
18-*
18.7 How Good is the Regression?
.
{
R2=
18-*
134
167
132.3
80.2
100
139.1
187.2
97.2
192.3
199.4
Sheet2
1
22.4
134
2
51.6
167
3
58.1
132.3
4
25.1
80.2
5
65.9
100
6
79.7
139.1
7
75.3
187.2
8
32.4
97.2
9
96.4
192.3
10
85.7
199.4
Sheet2
Sheet1
Patient
x
y
y^
18-*
18.8 Analysis of Variance Table and an F Test of the Regression
Model
4/9/2005 11:38 AM
18-*
0
Residuals
0
Residuals
Time
0
Residuals
4/9/2005 11:38 AM
18-*
Chart4
22.4
25.1
32.4
51.6
58.1
65.9
75.3
79.7
85.7
96.4
0
0
0
0
0
0
0
0
0
0
Sheet1
Patient
x
y
x2
y2
134
167
132.3
80.2
100
139.1
187.2
97.2
192.3
199.4
Sheet2
1
22.4
134
2
51.6
167
3
58.1
132.3
4
25.1
80.2
5
65.9
100
6
79.7
139.1
7
75.3
187.2
8
32.4
97.2
9
96.4
192.3
10
85.7
199.4
Sheet2
Sheet1
Patient
x
y
y^
18-*
Point Prediction
A single-valued estimate of Y for a given value of X obtained by
inserting the value of X in the estimated regression
equation.
Prediction Interval
For a value of Y given a value of X
Variation in regression line estimate
Variation of points around regression line
For confidence interval of an average value of Y given a value of
X
Variation in regression line estimate
4/9/2005 11:38 AM
18-*
4/9/2005 11:38 AM
18-*
4/9/2005 11:38 AM
18-*
Confidence Interval for the Average Value of Y and Prediction
Interval for the Individual Value of Y
Chart5
22.4
22.4
22.4
22.4
22.4
25.1
25.1
25.1
25.1
25.1
32.4
32.4
32.4
32.4
32.4
51.6
51.6
51.6
51.6
51.6
58.1
58.1
58.1
58.1
58.1
65.9
65.9
65.9
65.9
65.9
75.3
75.3
75.3
75.3
75.3
79.7
79.7
79.7
79.7
79.7
85.7
85.7
85.7
85.7
85.7
96.4
96.4
96.4
96.4
96.4
serum IL-6
brain IL-6
Sheet1
Patient
x
y
y^
serum IL-6
brain IL-6
18-*
Summary
1. Regression analysis is applied for prediction while control
effect of independent variable X.
2. The principle of least squares in solution of regression
parameters is to minimize the residual sum of squares.
3. The coefficient of determination, R2, is a descriptive measure
of the strength of the regression relationship.
4. There are two confidence bands: one for mean predictions and the
other for individual prediction values
5. Residual analysis is used to check goodness of fit for
models
4/9/2005 11:38 AM
18-*
Assignments
1. What is the main distinctions and assossiations between
correlation analysis and simple linear regression?
2. What is the least squares method to estimate regression
line?
3. Please describe the main steps for fitting a simple linear
regression model with data.
4/9/2005 11:38 AM
18-*
1. Data source:
correlation analysis is required that both x and y follow normal
distribution; but for simple linear regression, only y is required
following normal distribution.
2. application:
correlation analysis is employed to measure the association between
two random variables (both x and y are treated symmetrically)
simple linear regression is employed to measure the change in y for
x (x is the independent varible, y is the dependent variable)
3. r is a dimensionless number, it has no unit of measurement; but
b has its unit which relate to y.
4/9/2005 11:38 AM
18-*
3.
YY
XX
l
l
b
r
yx
Y
ˆ
Intercept72.96072725.57022272.8533473370.02136880313.9956876131.925766
x1.17970420.398262342.9621285150.0180872980.2613096332.09809885
0
0
2
0
0
0
/2,2
2
7
()
egree of freedom
for each paramete
r estimated ( and ) )
An
2
SSE
: MSE=
-2
ˆ
a
a
b
b
n
n
ts
a
ts
b
a
a
a
a
0
1.180
2.962
0.398
value0.018(1028)
b
b
HH
t
t
H
b
s
p
bba
n
=¹=
Total
mean value of Y:
Actual observationslower of 95% CL for y
upper of 95% CL for ylower of 95% CL for mean
upper of 95% CL for mean
2
2
where s = MSE
1
xx
xx
x
a
x
b
x
s
nl
s
s
l
s
s
n
a
b
l
x
brain IL-6