30
Introduction to Linear and Logistic Regression

Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

  • View
    220

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Introduction to Linear and Logistic Regression

Page 2: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals Curve Fitting Logistic Regression Odds and Probability

Page 3: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Basic Ideas Jargon

IV = X = Predictor (pl. predictors) DV = Y = Criterion (pl. criteria) Regression of Y on X Linear Model = relations between IV and DV

represented by straight line.

A score on Y has 2 parts – (1) linear function of X and (2) error.

Y Xi i i= + +α β ε (population values)

Page 4: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Basic Ideas (2) Sample value: Intercept – place where X=0 Slope – change in Y if X changes 1 unit If error is removed, we have a predicted value

for each person at X (the line):

Y a bX ei i i= + +

′= +Y a bX

Suppose on average houses are worth about 50.00 Euro a square meter. Then the equation relating price to size would be Y’=0+50X. The predicted price for a 2000 square meter house would be 250,000 Euro

Page 5: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Linear Transformation 1 to 1 mapping of variables via line Permissible operations are addition and

multiplication (interval data)

1086420X

40

35

30

25

20

15

10

5

0

Y

Changing the Y Intercept

Y=5+2XY=10+2XY=15+2X

Add a constant

1086420X

30

20

10

0

Y

Changing the Slope

Y=5+.5XY=5+X

Y=5+2X

Multiply by a constant

′= +Y a bX

Page 6: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Linear Transformation (2) Centigrade to Fahrenheit Note 1 to 1 map Intercept? Slope?

1209060300Degrees C

240

200

160

120

80

40

0D

eg

ree

s F

32 degrees F, 0 degrees C

212 degrees F, 100 degrees C

Intercept is 32. When X (Cent) is 0, Y (Fahr) is 32.

Slope is 1.8. When Cent goes from 0 to 100 (run), Fahr goes from 32 to 212 (rise), and 212-32 = 180. Then 180/100 =1.8 is rise over run is the slope. Y = 32+1.8X. F=32+1.8C.

′= +Y a bX

Page 7: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Standard Deviation and Variance Square root of the variance, which is the sum of

squared distances between each value and the mean divided by population size (finite population)

Example• 1,2,15 Mean=6

• =6.37

1− 6( )2

+ (2 − 6)2 + (15 − 6)2

3= 40.66

=1

N∗ x i − x( )

2

i=1

N

Page 8: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Correlation Analysis

Correlation coefficient (also called Pearson’s product moment coefficient)

If rX,Y > 0, X and Y are positively correlated (X’s values increase as

Y’s). The higher, the stronger correlation.

rX,Y = 0: independent; rX,Y < 0: negatively correlated€

rXY =x i − x ( )∑ y i − y ( )

(n −1)σ Xσ Y

Page 9: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Regression of Weight on HeightHt Wt

61 105

62 120

63 120

65 160

65 120

68 145

69 175

70 160

72 185

75 210

N=10 N=10

mean=67 mean=150

=4.57 = 33.99

767472706866646260Height in Inches

240

210

180

150

120

90

60

Weight in Lbs

Regression of Weight on Height

Regression of Weight on Height

Regression of Weight on Height

Rise

Run

Y= -316.86+6.97X

Correlation (r) = .94.

Regression equation: Y’=-316.86+6.97X

′= +Y a bX

Page 10: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Predicted Values & Residuals

N Ht Wt Y' RS

1 61 105 108.19 -3.19

2 62 120 115.16 4.84

3 63 120 122.13 -2.13

4 65 160 136.06 23.94

5 65 120 136.06 -16.06

6 68 145 156.97 -11.97

7 69 175 163.94 11.06

8 70 160 170.91 -10.91

9 72 185 184.84 0.16

10 75 210 205.75 4.25

mean 67 150 150.00 0.00

4.57 33.99 31.85 11.89

V 20.89 1155.56 1014.37 141.32

Numbers for linear part and error.

•Y’ is called the predicted value•Y-Y’ the residual (RS)•The residual is the error•Mean of Y’ and Y is the same•Variance of Y is equal to the variance Y’ + RS

′= +Y a bX

Page 11: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Finding the Regression Line

Need to know the correlation, standard deviation and means of X and Y

b = rXY

σ Y

σ X

To find the intercept, use: XbYa −=

Suppose rXY = .50, X = .5, meanX = 10, Y = 2, meanY = 5.

25.

25. ==b

15)10(25 −=−=aXY 215' +−=Slope

Intercept

Page 12: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Line of Least Squares Assume linear relations is reasonable, so the 2

variables can be represented by a line. Where should the line go?

Place the line so errors (residuals) are small The line we calculate has a sum of errors = 0 It has a sum of squared errors that are as small as possible;

the line provides the smallest sum of squared errors or least squares

Page 13: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Minimize sum of the quadratic residuals

• Derivation equal 0

SRSmin = (RS)2

i=1

n

RSi = a + bX i −Yi

SRSmin = (a + bX i

i=1

n

∑ −Yi)2

∂ (a + bX i −Yi)2

i=1

n

∑∂a

= 0

∂ (a + bX i −Yi)2

i=1

n

∑∂b

= 0

Page 14: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

∂ (a + bX i −Yi)2

i=1

n

∑∂a

= 0

2 (a + bX i −Yi)i=1

n

∑ ⋅∂ (a + bX i −Yi)i=1

n

∑∂a

= 0

2n (a + bX i

i=1

n

∑ −Yi) = 0

ai=1

n

∑ + bX i

i=1

n

∑ = Yi

i=1

n

a ⋅n + b X i

i=1

n

∑ = Yi

i=1

n

Page 15: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

∂ (a + bX i −Yi)2

i=1

n

∑∂b

= 0

2 (a + bX i −Yi)i=1

n

∑ ⋅∂ (a + bX i −Yi)i=1

n

∑∂b

= 0

2 (a + bX i

i=1

n

∑ −Yi) X i

i=1

n

∑ = 0

ai=1

n

∑ X i + bX i2

i=1

n

∑ = X iYi

i=1

n

a X i

i=1

n

∑ + b X i2

i=1

n

∑ = X iYi

i=1

n

Page 16: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

The coefficients a and b are found by solving the following system of linear equations

n X i

i=1

n

X i

i=1

n

∑ X i2

i=1

n

⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥

a

b

⎣ ⎢

⎦ ⎥=

Yi

i=1

n

X iYi

i=1

n

⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥

Page 17: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Curve Fitting

Linear Regression

Exponential Curve

Logarithmic Curve

Power Curve

Y = a + bX

Y = aebX a > 0

Y = aX b a > 0€

Y = a + b ln(x)

Page 18: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

The coefficients a and b are found by solving the following system of linear equations

n ˆ X ii=1

n

ˆ X ii=1

n

∑ ˆ X i2

i=1

n

⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥

ˆ a

b

⎣ ⎢

⎦ ⎥=

ˆ Y ii=1

n

ˆ X i ˆ Y ii=1

n

⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥

Page 19: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

with

Linear Regression

Exponential Curve

Logarithmic Curve

Power Curve

ˆ a := a ˆ X i := X iˆ Y i := Yi

ˆ a := ln(a) ˆ X i := X iˆ Y i := ln(Y )i

ˆ a := a ˆ X i := ln(X i) ˆ Y i := Yi

ˆ a := ln(a) ˆ X i := ln(X i) ˆ Y i := ln(Yi)

Page 20: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Multiple Linear Regression

Ti = a + bX i + cYi

n X i

i=1

n

∑ Yi

i=1

n

X i

i=1

n

∑ X i2

i=1

n

∑ X iYi

i=1

n

Yi

i=1

n

∑ X iYi

i=1

n

∑ Yi2

i=1

n

⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

a

b

c

⎢ ⎢ ⎢

⎥ ⎥ ⎥=

Ti

i=1

n

X iTi

i=1

n

Yi

i=1

n

∑ Ti

⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

The coefficients a, b and c are found by solving the following system of linear equations

Page 21: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Polynomial Regression

Yi = a + bX i + cX i2

n X i

i=1

n

∑ X i2

i=1

n

X i

i=1

n

∑ X i2

i=1

n

∑ X i3

i=1

n

X i2

i=1

n

∑ X i3

i=1

n

∑ X i4

i=1

n

⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

a

b

c

⎢ ⎢ ⎢

⎥ ⎥ ⎥=

Yi

i=1

n

X iYi

i=1

n

X i2

i=1

n

∑ Yi

⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

The coefficients a, b and c are found by solving the following system of linear equations

Page 22: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Logistic Regression Variable is binary (a categorical variable that

has two values such as "yes" and "no") rather than continuous

binary DV (Y) either 0 or 1 For example, we might code a successfully kicked

field goal as 1 and a missed field goal as 0 or we might code yes as 1 and no as 0 or admitted

as 1 and rejected as 0 or Cherry Garcia flavor ice cream as 1 and all other flavors as zero.

Page 23: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

If we code like this, then the mean of the distribution is equal to the proportion of 1s in the distribution.

For example if there are 100 people in the distribution and 30 of them are coded 1, then the mean of the distribution is .30, which is the proportion of 1s

The mean of a binary distribution so coded is denoted as P, the proportion of 1s

The proportion of zeros is (1-P), which is sometimes denoted as Q

The variance of such a distribution is PQ, and the standard deviation is Sqrt(PQ)

Page 24: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Suppose we want to predict whether someone is male or female (DV, M=1, F=0) using height in inches (IV)

We could plot the relations between the two variables as we customarily do in regression. The plot might look something like this

Page 25: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

None of the observations (data points) fall on the regression line

They are all zero or one

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (Unkomprimiert)“

benötigt.

Page 26: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Predicted values (DV=Y)correspond to probabilities If linear regression is used, the predicted

values will become greater than one and less than zero if one moves far enough on the X-axis

Such values are theoretically inadmissible

P := Y =ea +bX

1+ ea +bX=

1

1+ e−(a +bX )

Page 27: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Linear vs. Logistic regression

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (Unkomprimiert)“

benötigt.

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (Unkomprimiert)“

benötigt.

Page 28: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Odds and Probability

odds =P

1− P

log(odds) = log it(P) = lnP

1− P

⎝ ⎜

⎠ ⎟

logit(P) = a + bX

P

1− P= ea +bX

P =ea +bX

1+ ea +bX

Linear regression!

Page 29: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (Unkomprimiert)“

benötigt.

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (Unkomprimiert)“

benötigt.

Page 30: Introduction to Linear and Logistic Regression. Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals

Basic Ideas Linear Transformation Finding the Regression Line Minimize sum of the quadratic residuals Curve Fitting Logistic Regression Odds and Probability