Ch4 Describing RelationshipsCh4 Describing Relationships

Ch4 Describing RelationshipsCh4 Describing Relationships Between Variables

2900

Ceramic Item Page 125

2800

2850

2700

2750

2650

2700

Density

2550

2600

2450

2500

0 2000 4000 6000 8000 10000 120000 2000 4000 6000 8000 10000 12000

Pressure

Section 4.1: Fitting a Line by Least SquaresSquares

• Often we want to fit a straight line to data.

• For example from an experiment we might have p p gthe following data showing the relationship of density of specimens made from a ceramicdensity of specimens made from a ceramic compound at different pressures.

B fitti li t th d t di t h t• By fitting a line to the data we can predict what the average density would be for specimens

d imade at any given pressure, even pressures we did not investigate experimentally.

• For a straight line we assume a model which says that on average in the whole population y g p pof possible specimens the average density, y, value is related to pressure x by the equationvalue is related to pressure, x, by the equation

0 1y xβ β≈ +

• The population (true) intercept and slope are represented by Greek symbols just like µ andrepresented by Greek symbols just like µ and σ.

How to choose the best line?‐‐‐‐Principle of least squares

l h i i l f l i h• To apply the principle of least squares in the fitting of an equation for y to an n‐point data set, values of the equation parameters are chosen to minimize

2ˆ( )n

i iy y−∑where y1, y2, …, yn are the observed responses

d h t di

1i=

and yi‐hat are corresponding responses predicted or fitted by the equation.

In another wordIn another word

• We want to choose a slope and intercept so as to minimize the sum of squared vertical qdistances from the data points to the line in questionquestion.

• A least squares fit minimizes the sum of squared deviations from the fitted line

minimize ( ) ( )( )220 1î i i iy y y xβ β− = − +∑ ∑

ye ( ) ( )( )0 1i i i iy y y β β∑ ∑

• Deviations from the fitted line are called “residuals”

• We are minimizing the sum of squared residuals,We are minimizing the sum of squared residuals,

called the “residual sum of squares.”

Come againCome again

We need to minimize

( )( )20 1 0 1( ) i iS y xβ β β β= − +∑

over all possible values of β0 and β1.

( )( )0 1 0 1( , ) i iS y xβ β β β+∑

This is a calculus problem (take partialThis is a calculus problem (take partial derivatives).

• The resulting formulas for the least squares estimates of the intercept and slope areestimates of the intercept and slope are

( )( )i ix x y yb

− −= ∑

( )1 2i

bx x

b b

=−∑

0 1b y b x= −

• Notice the notation. We use b1 and b0 to denote some particular values for thedenote some particular values for the parameters β1 and β0.

The fitted lineThe fitted line

h d d fi i h li• For the measured data we fit a straight line

0 1y b b x= +

• For the ith point the fitted value or predicted

0 1y

• For the ith point, the fitted value or predicted value is

y b b x= +which represent likely y behavior at that x

0 1i iy b b x= +p y y

value.

Ceramic Compound dataCeramic Compound datax y (x‐x_bar) (y‐y_bar) (x‐x_bar)*(y‐y_bar) (x‐x_bar)^2

2000 2.486 ‐4000 ‐0.181 724 16000000

2000 2.479 ‐4000 ‐0.188 752 16000000

2000 2.472 ‐4000 ‐0.195 780 16000000

4000 2 558 2000 0 109 218 40000004000 2.558 ‐2000 ‐0.109 218 4000000

4000 2.57 ‐2000 ‐0.097 194 4000000

4000 2.58 ‐2000 ‐0.087 174 4000000

6000 2.646 0 ‐0.021 0 0

6000 2.657 0 ‐0.01 0 0

6000 2.653 0 ‐0.014 0 0

8000 2.724 2000 0.057 114 4000000

8000 2 774 2000 0 107 214 40000008000 2.774 2000 0.107 214 4000000

8000 2.808 2000 0.141 282 4000000

10000 2.861 4000 0.194 776 16000000

10000 2.879 4000 0.212 848 16000000

10000 2.858 4000 0.191 764 16000000

5840 120000000

x_bar=6000 b1= 4.87E‐05

y bar 2 667 b0 2 375y_bar=2.667 b0= 2.375

ˆ 2 375 0 0000487y x= +2.375 0.0000487y x= +

2.85

2.9

Ceramic Item Page 125

2.75

2.8

2.65

2.7

Density

Linear (Density)

2.55

2.6

2.45

2.5

0 2000 4000 6000 8000 10000 12000

InterpolationInterpolation

• At the situation for x=5000psi, there are no data with this x value.

• If interpolation is sensible from a physical view point the fitted valuepoint, the fitted value

ˆ 2 375 0 0000487(5000) 2 6183 /y g cc= + =

b d t t d it f 5 000 i

2.375 0.0000487(5000) 2.6183 /y g cc+

can be used to represent density for 5,000 psi pressure.

• “Least squares” is the optimal method to use for fitting the line iffor fitting the line if – The relationship is in fact linear.

F fi d l f h l i l f– For a fixed value of x the resulting values of y are • normally distributed with

• the same constant variance at all x values.

• If these assumptions are not met, then we are not using the best tool for the job.g j

F i i l l k h h l i• For any statistical tool, know when that tool is the right one to use.

4.1.2 The Sample Correlation and Coefficient of Determination

The sample (linear) correlation coefficient, r, is a measure of how “correlated” the x and y yvariable are.The correlation is between ‐1 and 1The correlation is between 1 and 1

+1 means perfect positive linear correlationli l i0 means no linear correlation

‐1 means perfect negative linear correlation

h l l i i d b• The sample correlation is computed by

( )( )∑( )( )∑ −−=

yyxxr ii

( ) ( )∑ ∑ −− 22 yyxxr

ii

• It is a measure of the strength of an apparentIt is a measure of the strength of an apparent linear relationship.

Coefficient of DeterminationCoefficient of Determination

It is another measure of the quality of a fitted equation. q

( ) ( )2 22 î i iy y y y− − −∑ ∑( ) ( )

( )2

2R i i i

i

y y y y

y y=

−∑ ∑

∑

lnterpretation of R2

R2 = fraction of variation accounted for (explained) by the fitted line(explained) by the fitted line.

( ) ( )2 22 ˆ

R i i iy y y y− − −∑ ∑( ) ( )( )

22R i i i

iy y=

−∑ ∑

∑

Ceramic Items Page 124

2900

2750

2800

2850

2900

2550

2600

2650

2700

Dens

ity

2450

2500

0 2000 4000 6000 8000 10000 12000

Pressure

Pressure y = Density y - mean (y-mean)^2

2000 2486 -181 32761

2000 2479 -188 35344

2000 2472 -195 38025

4000 2558 -109 11881

4000 2570 -97 9409

4000 2580 -87 75694000 2580 87 7569

6000 2646 -21 441

6000 2657 -10 100

6000 2653 14 1966000 2653 -14 196

8000 2724 57 3249

8000 2774 107 11449

8000 2808 141 19881

10000 2861 194 37636

10000 2879 212 44944

10000 2858 191 36481

mean 6000 2667 sum 0 289366

st dev 2927.7 143.767

correlation 0.991

correl^2 0.982

Observation Predicted Density Residuals Residual^2

1 2472.333 13.667 186.778

2 2472.333 6.667 44.444

3 2472.333 -0.333 0.111

4 2569.667 -11.667 136.111

5 2569.667 0.333 0.111

6 2569.667 10.333 106.778

7 2667 000 -21 000 441 0007 2667.000 -21.000 441.000

8 2667.000 -10.000 100.000

9 2667.000 -14.000 196.000

10 2764.333 -40.333 1626.778

11 2764.333 9.667 93.444

12 2764.333 43.667 1906.778

13 2861.667 -0.667 0.444

14 2861.667 17.333 300.444

15 2861.667 -3.667 13.444

5152.666667 sum

• If we don't use the pressures to predict density– We use to predict every yiy p y yi

– Our sum of squared errors is = SS Total

y( ) 366,2892

=−∑ yyi

= SS Total

• If we do use the pressures to predict density– We use to predict yiii xbby 00ˆ +=

2= SS Residual( )2ˆ 5152.67i iy y− =∑

Th t d ti i fThe percent reduction in our error sum of squares is

( ) ( )2 2ˆ∑ ∑( ) ( )( )

2 22

2

ˆR 100

Re Re

i i i

i

y y y y

y ySS Total SS sidual SS gression

− − −= ∗

−∑ ∑

∑_ _ Re _ Re

_ _289,366 5152.67 284,213.33100 100

SS Total SS sidual SS gressionSS Total SS Total−

= =

−= ∗ = ∗

2

100 100289,366 289,366

R 98 2%

= ∗ = ∗

Using x to predict y decreases the error sum of

2R 98.2%=

Using x to predict y decreases the error sum of squares by 98.2%.

The reduction in error sum of squares from using x to predict y isg p y– Sum of squares explained by the regression

equationequation

– 284,213.33 = SS Regression

This is also the correlation squared.q

r2 = 0.9912 = 0.982=R2

For a perfectly straight linell d l• All residuals are zero.– The line fits the points exactly.

• SS Residual 0• SS Residual = 0• SS Regression = SS Total

– The regression equation explains all variation– The regression equation explains all variation• R2 = 100%• r = ±1r ±1

r2 = 1

If r=0, then there is no linear relationship between x and y• R2 = 0%• Using x to predict y with a linear model does not help at

all.

4 1 3 Computing and Using Residuals4.1.3 Computing and Using Residuals• Does our linear model extract the main

message of the data?

• What is left behind? (hopefully only small fluctuations explainable only as random variation))

• Residuals!î i ie y y= −i i iy y

Good Residuals: PattenlessGood Residuals: Pattenless

• They should look randomly scattered if a fitted equation is telling the whole story.q g y

• When there is a pattern, there is something p , ggone unaccounted for in the fitting.

Plotting residuals will be most crucial in section 4.2 with multiple x variableswith multiple x variables– But residual plots are still of use here.

Plot residuals versus – Predicted values – Versus x– In run order

Versus other potentially influential variables e g– Versus other potentially influential variables, e.g. technician

– Normal Plot of residuals

Read the book Page 135 for more Residual plots.

Checking Model AdequacyChecking Model Adequacy

With only single x variable, we can tell most of what we need from a plot with the fitted line.p

Original Scale

30

20

25

30

10

15Y

0

5

0 2 4 6 8 10 12 14 16 18 20

X

A residual plot gives us a magnified view of the increasing variance and curvaturethe increasing variance and curvature.

Original Scale

8

10

2

4

6

8

Resi

dual

-6

-4

-2

00 2 4 6 8 10 12 14 16 18

P di t d

This residual plot indicates 2 problems with this linear least squares fit

Predicted

linear least squares fit• The relationship is not linear

I di t d b th t i th id l l t– Indicated by the curvature in the residual plot

• The variance is not constant– So the least squares method isn't the best approach

even if we handle the nonlinearity.

Some CautionsSome Cautions

• Don't fit a linear function to these data directly with least squares.y q

• With increasing variability, not all squared errors should count equallyerrors should count equally.

Some Study Questions• What does it mean to say that a line fit to data is the "least

squares" line? Where do the terms least and squares come from?from?

• We are fitting data with a straight line. What 3 assumptions g g p(conditions) need to be true for a linear least squares fit to be the optimal way of fitting a line to the data?

• What does it mean if the correlation between x and y is ‐1? What is the residual sum of squares in this situation?q

• If the correlation between x and y is 0, what is the i f SS R i i thi it ti ?regression sum of squares, SS Regression, in this situation?

• Consider the following data.

12

14

16

6

8

10

Y Series1

0

2

4

0 2 4 6 8 10 120 2 4 6 8 10 12

X

ANOVA

df SS MS F Significance FRegression 1 124.0333 124.03 15.85 0.016383072Residual 4 31.3 7.825Total 5 155.3333

CoefficientsStandard

Error t Stat P-value Lower 95% Upper 95%Intercept -0.5 2.79732 -0.1787 0.867 -8.26660583 7.2666058X 1.525 0.383039 3.9813 0.016 0.461513698 2.5884863

h i h l f 2?• What is the value of R2?

• What is the least squares regression equation?What is the least squares regression equation?

• How much does y increase on average if x is increased by 1.0?

• What is the sum of squared residuals? Do not compute the residuals; find the answer in the Excel outputresiduals; find the answer in the Excel output.

• What is the sum of squared deviations of y from y bar?What is the sum of squared deviations of y from y bar?

• By how much is the sum of squared errors reduced by using x to d d l b d ?predict y compared to using only y bar to predict y?

• What is the residual for the point with x = 2?• What is the residual for the point with x = 2?

Documents

Ch4 Describing RelationshipsCh4 Describing Relationships