44
1 Experimental Experimental Statistics Statistics - week 13 - week 13 Multiple Regression Miscellaneous Topics

Experimental Statistics - week 13

  • Upload
    arne

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

Experimental Statistics - week 13. Multiple Regression. Miscellaneous Topics. Setting: We have a dependent variable Y and several candidate independent variables. Question: Should we use all of them?. Why do we run Multiple Regression?. - PowerPoint PPT Presentation

Citation preview

Page 1: Experimental Statistics           - week 13

1

Experimental StatisticsExperimental Statistics - week 13 - week 13

Multiple Regression Miscellaneous Topics

Page 2: Experimental Statistics           - week 13

2

Setting:

We have a dependent variable Y and several candidate independent variables.

Question:Should we use all of them?

Page 3: Experimental Statistics           - week 13

3

Why do we run Multiple Regression?

1. Obtain estimates of individual coefficients in a model (+ or -, etc.)

2. Screen variables to determine which have a significant effect on the model

3. Arrive at the most effective (and efficient) prediction model

Page 4: Experimental Statistics           - week 13

4

The problem:

Collinearity among the independent variables

-- high correlation between 2 independent variables-- one independent variable nearly a linear combination of other independent variables-- etc.

Page 5: Experimental Statistics           - week 13

5

Effects of Collinearity• parameter estimates are highly variable and unreliable - parameter estimates may even have the opposite sign from what is reasonable• may have significant F but none of the t-tests are significant

Variable Selection TechniquesTechniques for “being careful” about which variables are put into the model

Page 6: Experimental Statistics           - week 13

6

Variable Selection Procedures

• Forward selection

• Backward Elimination

• Stepwise

• Best subset

Page 7: Experimental Statistics           - week 13

7

Multiple Regression – Analysis Suggestions

1. Include only variables that make sense2. Force imprtant variables into a model3. Be wary of variable selection results

- especially forward selection4. Examine pairwise correlations among variables5. Examine pairwise scatterplots among variables - identify nonlinearity

- identify unequal variance problems- identify possible outliers

5. Try transformations of variables for- correcting nonlinearity- stabilizing the variances- inducing normality of residuals

Page 8: Experimental Statistics           - week 13

8length age lengthb weightb chestb

ches

tbw

eigh

tble

ngth

bag

ele

ngth

SPSS Output from INFANT Data Set

Page 9: Experimental Statistics           - week 13

9Horsepower City MPG Highway MPG Weight in Pounds

Wei

ght i

n P

ound

sH

ighw

ay M

PG

City

MP

GH

orse

pow

er

SPSS Output from CAR Data Set

Page 10: Experimental Statistics           - week 13

10

Examples of Nonlinear Data “Shapes” and Linearizing Transformations

Page 11: Experimental Statistics           - week 13

11

Y

X1

0 1 iXi iY e Original Model

1 > 0

1 < 0

Transformed Into: 0 1ln lni i iY X

Exponential Transformation(Log-Linear)

Page 12: Experimental Statistics           - week 13

12

10Original: i i iY X

0 1Transformed: ln ln ln lni i iY X

Y

X1

Y

X1

1 1

10 1 11 0

1 1

1 1

Transformed Multiplicative Model (Log-Log)

Page 13: Experimental Statistics           - week 13

13

Y

X1

0 1 1i i iY X

1 > 0

1 < 0

Square Root Transformation

Page 14: Experimental Statistics           - week 13

14

Note:- transforming Y using the log or square root transformation can help with unequal variance problems

- these transformations may also help induce normality

Page 15: Experimental Statistics           - week 13

15

hmpg vs hp hmpg vs sqrt(hp)

log(hmpg) vs hp log(hmpg) vs log(hp)

Page 16: Experimental Statistics           - week 13

16

Polynomial Regression:2

0 1 2 ... ppy x x x

- basically a multiple regression where the independent variables are powers of a single independent variable

- use SAS to compute the independent variables x2, x3, … , xp

Page 17: Experimental Statistics           - week 13

17

Outlier Detection- there are tests for outliers

- throwing away outliers should technically be done only when there is evidence that the values “do not belong”

Page 18: Experimental Statistics           - week 13

18

Use of Dummy Variables in Regression

Page 19: Experimental Statistics           - week 13

19

Example 6.1, Text page 268-269Does a drug retains its potency after 1 year of storage?2 groups: 1) fresh product 2) product stored for 1 year n = 10 observations from each group -- indep. samples)

Fresh Stored10.2 9.810.5 9.6 . . . . . .

Variable measured is potency reading

Question: How would you compare groups?

Page 20: Experimental Statistics           - week 13

20

ij i ijy

1-Factor ANOVA Model

where mean of fresh product

mean of 1-year old product

0

1

::

F S

F S

HH

We want to test:

We could use: - independent groups t-test - 1-factor ANOVA (with 2 levels of the factor)

Page 21: Experimental Statistics           - week 13

21

data ott269;input type$ y; datalines;F 10.2 F 10.5F 10.3F 10.8F 9.8 . . .S 9.6 S 9.8S 9.9;proc glm; class type; model y=type; means type/lsd; title 'ANOVA -- Potency Data - page 269 (t-test)';run;

Page 22: Experimental Statistics           - week 13

22

ANOVA -- Potency Data - page 269 (t-test) The GLM Procedure

Class Level Information Class Levels Values type 2 F S

The GLM ProcedureDependent Variable: y Sum ofSource DF Squares Mean Square F Value Pr > FModel 1 1.45800000 1.45800000 17.95 0.0005Error 18 1.46200000 0.08122222Corrected Total 19 2.92000000

R-Square Coeff Var Root MSE potency Mean 0.499315 2.821734 0.284995 10.10000

Source DF Type I SS Mean Square F Value Pr > Ftype 1 1.45800000 1.45800000 17.95 0.0005

Source DF Type III SS Mean Square F Value Pr > Ftype 1 1.45800000 1.45800000 17.95 0.0005

Page 23: Experimental Statistics           - week 13

23

Since p =.0005 we reject

0

1

::

F S

F S

HH

and conclude that storage time does make a difference.

t Tests (LSD) for yNOTE: This test controls the Type I comparisonwise error rate, not the experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 18 Error Mean Square 0.081222 Critical Value of t 2.10092 Least Significant Difference 0.2678 Means with the same letter are not significantly different. t Grouping Mean N type A 10.3700 10 F B 9.8300 10 S

Fresh product has higher potency on average.Also – estimated difference in means = 10.37 – 9.83 = .54

Page 24: Experimental Statistics           - week 13

24

Regression analysis – requires the independent variables to be quantitativequantitative

Let’s consider recoding the group membership variable (i.e. F and S) into the numeric scores:

0 = fresh 1 = stored one year

and running a regression analysis with this new “dummy” variable as a “quantitative” independent variable - let’s call the “dummy” variable x.

0 1y x Regression Model

Page 25: Experimental Statistics           - week 13

25

data ott269;input x y; datalines;0 10.2 0 10.50 10.30 10.80 9.8 . . .1 9.6 1 9.81 9.9;proc reg; model y=x;title ‘Regression Analysis -- Potency Data - page 269';run;

Page 26: Experimental Statistics           - week 13

26

The REG Procedure

Dependent Variable: y Number of Observations Read 20 Number of Observations Used 20

Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 1 1.45800 1.45800 17.95 0.0005Error 18 1.46200 0.08122Corrected Total 19 2.92000

Root MSE 0.28500 R-Square 0.4993 Dependent Mean 10.10000 Adj R-Sq 0.4715 Coeff Var 2.82173

Parameter Estimates

Parameter StandardVariable DF Estimate Error t Value Pr > |t|

Intercept 1 10.37000 0.09012 115.06 <.0001x 1 -0.54000 0.12745 -4.24 0.0005

10.37 .54y x Regression Equation:

Page 27: Experimental Statistics           - week 13

27

Note: the regression model

0 1y x

On the basis of this model:

ˆF

ˆS

ˆ ˆF S

Page 28: Experimental Statistics           - week 13

28

Dummy Variables with More than 2 Groups

Example: Balloon Data - 4 groups

Page 29: Experimental Statistics           - week 13

29

1122.4 2324.6 3120.3 4419.8 5324.3 6222.2 7228.5 8225.7 9320.210119.611228.812424.013417.114419.315324.216115.817218.318117.519418.720322.921116.322414.023416.624218.125218.926416.027220.128322.529316.030119.331115.932320.3

Balloon Data  Col. 1-2 - observation number Col. 3 - color (1=pink, 2=yellow, 3=orange, 4=blue) Col. 4-7 - inflation time in seconds

“Research Question”:Is the average time required to inflate the balloons the same for each color?

Recall:

Page 30: Experimental Statistics           - week 13

30

GLM Procedure ANOVA --- Balloon Data

Dependent Variable: time Sum ofSource DF Squares Mean Square F Value Pr > FModel 3 126.1512500 42.0504167 3.85 0.0200Error 28 305.6475000 10.9159821Corrected Total 31 431.7987500

R-Square Coeff Var Root MSE time Mean 0.292153 16.31069 3.303934 20.25625Source DF Type I SS Mean Square F Value Pr > Fcolor 3 126.1512500 42.0504167 3.85 0.0200

Analysis using 1-factor ANOVA Model with 4 Groups

Grouping Mean N color

A 22.575 8 2(yellow) A A 21.875 8 3(orange)

B 18.388 8 1(pink) B B 18.188 8 4(blue)

LSD Results

Page 31: Experimental Statistics           - week 13

31

Dummy Variables

For 4 groups -- 3 dummy variables needed.

1 11 0 if obs. is in group 2, otherwisex x

2 21 0 if obs. is in group 3, otherwisex x

3 31 0 if obs. is in group 4, otherwisex x

0 1 1 2 2 3 3y x x x

0, 0, 0 → group 11, 0, 0 → group 20, 1, 0 → group 30, 0, 1 → group 4

Page 32: Experimental Statistics           - week 13

32

Dummy Variables for 4 Groups:

0 1 1 2 2 3 3y x x x

The model says:The mean for color 1 (i.e. x1 = 0, x2 = 0, x3 = 0) is

- notation

The mean for color 2 (i.e. x1 = 1, x2 = 0, x3 = 0) is

- notation The mean for color 3 (i.e. x1 = 0, x2 = 1, x3 = 0) is

- notation The mean for color 4 (i.e. x1 = 0, x2 = 0, x3 = 1) is

- notation

Page 33: Experimental Statistics           - week 13

33

1 0 0 0 0 1: 0 0 is equivalent to : H H

0 1 1 2 2 3 3y x x x

2 0 1 1

0 1 0: 0

so is equivalent to : H H

3 0 2 2

0 2 0: 0

so is equivalent to : H H

4 0 3 3

0 3 0: 0

so is equivalent to : H H

Page 34: Experimental Statistics           - week 13

34

Dummy Variables for 4 Groups:

1 0

2 0 1 0 1

0 1 1 2 2 3 3y x x x

3 0 2

4 0 3 3 4 1

1 2 1

2 3 1

Page 35: Experimental Statistics           - week 13

35

11 000 22.4 23 010 24.6 31 000 20.3 44 001 19.8 53 010 24.3 62 100 22.2 72 100 28.5 82 100 25.7 93 010 20.2101 000 19.6112 100 28.8124 001 24.0134 001 17.1144 001 19.3153 010 24.2161 000 15.8172 100 18.3181 000 17.5194 001 18.7203 010 22.9211 000 16.3224 001 14.0234 001 16.6242 100 18.1252 100 18.9264 001 16.0272 100 20.1283 010 22.5293 010 16.0301 000 19.3311 000 15.9323 010 20.3

  Col. 1-2 - observation number Col. 3 - color (1=pink, 2=yellow, 3=orange, 4=blue) Col. 5 X1 Col. 6 X2 Col. 7 X3

Col. 9-12 - inflation time in seconds

Balloon Data Set with Dummy Variables:

Page 36: Experimental Statistics           - week 13

36

ANOVA --- Balloon Data using Dummy Variables The REG Procedure Analysis of Variance

Sum of MeanSource DF Squares Square F Value Pr > FModel 3 126.15125 42.05042 3.85 0.0200Error 28 305.64750 10.91598Corrected Total 31 431.79875

Root MSE 3.30393 R-Square 0.2922

Dependent Mean 20.25625 Adj R-Sq 0.2163 Coeff Var 16.31069

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 18.38750 1.16812 15.74 <.0001 x1 1 4.18750 1.65197 2.53 0.0171 x2 1 3.48750 1.65197 2.11 0.0438 x3 1 -0.20000 1.65197 -0.12 0.9045

Page 37: Experimental Statistics           - week 13

37

Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 18.38750 1.16812 15.74 <.0001 x1 1 4.18750 1.65197 2.53 0.0171 x2 1 3.48750 1.65197 2.11 0.0438 x3 1 -0.20000 1.65197 -0.12 0.9045

2 1 0 1 20, :We conclude i.e. reject H

0 1 3:We reject H

(i.e. “pink” ≠ “yellow”)

i.e. conclude “pink” ≠ “orange”

0 1 4:We do not reject H i.e. we cannot conclude “pink” and “blue” are different

Grouping Mean N color

A 22.575 8 2(yellow) A A 21.875 8 3(orange)

B 18.388 8 1(pink) B B 18.188 8 4(blue)

Recall LSD Results

Page 38: Experimental Statistics           - week 13

38

We showed that 1-factor ANOVA can be run using regression analysis with dummy variables.

Question: What’s the real benefit of dummy variables?

Answer: Dummy variables can be mixed in with quantitative independent variables to give a combination of regression and ANOVA analyses.

Dummy Variables

Page 39: Experimental Statistics           - week 13

39

• study using 108 patients in a surgical unit. • researchers interested in predicting the survival time (in days) of patients undergoing a type of liver operation

Survival Data

clot = blood clotting score prog = prognostic index enzyme = enzyme function test score liver = liver function test score age = age in years gender (0 = male, 1 = female) alch1, alch2 = indicator of alcohol usage None: alch1 = 0, alch2 = 0 Moderate: alch1 = 1, alch2 = 0 Heavy: alch1 = 0, alch2 = 1

Independent Variables

Page 40: Experimental Statistics           - week 13

40

DATA survival;INPUT clot prog enzyme liver age gender alch1 alch2 survival;DATALINES;6.7 62 81 2.59 50 0 1 0 6955.1 59 66 1.70 39 0 0 0 4037.4 57 83 2.16 55 0 0 0 7106.5 73 41 2.01 48 0 0 0 3497.8 65 115 4.30 45 0 0 1 23435.8 38 72 1.42 65 1 1 0 348 . . .;

PROC reg;MODEL survival=clot prog enzyme liver age/selection=adjrsq;output out=new r=ressurv p=predsurv;RUN;

Survival Data

PROC reg;MODEL lgsurv=clot prog enzyme liver age/selection=adjrsq;output out=new r=ressvlg p=predsvlg;RUN;

Gender: 0=male, 1=female Alcohol Use alch1 alch2 None 0 0 Moderate 1 0 Heavy 0 1

Page 41: Experimental Statistics           - week 13

41

Dependent Variable: survival Number in Adjusted Model R-Square R-Square Variables in Model 6 0.7611 0.7745 clot prog enzyme liver alch1 alch2 5 0.7606 0.7718 clot prog enzyme liver alch2 7 0.7592 0.7749 clot prog enzyme liver age alch1 alch2 7 0.7591 0.7748 clot prog enzyme liver gender alch1 alch2 6 0.7587 0.7723 clot prog enzyme liver age alch2 6 0.7587 0.7722 clot prog enzyme liver gender alch2 8 0.7571 0.7753 clot prog enzyme liver age gender alch1 alch2 7 0.7568 0.7727 clot prog enzyme liver age gender alch2 5 0.7416 0.7536 clot prog enzyme alch1 alch2

Adjusted R-Square Selection Method

Dependent Variable: log(survival)Number in Adjusted Model R-Square R-Square Variables in Model 6 0.7649 0.7781 clot prog enzyme liver gender alch2 7 0.7634 0.7789 clot prog enzyme liver gender alch1 alch2 5 0.7628 0.7738 clot prog enzyme liver alch2 7 0.7627 0.7782 clot prog enzyme liver age gender alch2 6 0.7614 0.7747 clot prog enzyme liver alch1 alch2 8 0.7612 0.7790 clot prog enzyme liver age gender alch1 alch2 6 0.7605 0.7740 clot prog enzyme liver age alch2 7 0.7591 0.7749 clot prog enzyme liver age alch1 alch2

Page 42: Experimental Statistics           - week 13

42

Dependent Variable: lgsurv Number of Observations Read 108 Number of Observations Used 108

Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 6 20.33867 3.38978 59.04 <.0001Error 101 5.79922 0.05742Corrected Total 107 26.13789

Root MSE 0.23962 R-Square 0.7781 Dependent Mean 6.36909 Adj R-Sq 0.7649 Coeff Var 3.76224

Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 3.91009 0.17992 21.73 <.0001 clot 1 0.06227 0.02023 3.08 0.0027 prog 1 0.01321 0.00158 8.38 <.0001 enzyme 1 0.01387 0.00141 9.84 <.0001 liver 1 0.06695 0.03547 1.89 0.0620 gender 1 0.06659 0.04766 1.40 0.1654 alch2 1 0.28922 0.05983 4.83 <.0001

6 variable model for log(survival) selected by adjusted R2

Page 43: Experimental Statistics           - week 13

43

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 20.22658 4.04532 69.80 <.0001 Error 102 5.91132 0.05795 Corrected Total 107 26.13789

Root MSE 0.24074 R-Square 0.7738 Dependent Mean 6.36909 Adj R-Sq 0.7628 Coeff Var 3.77976 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 3.92845 0.18027 21.79 <.0001 clot 1 0.05942 0.02022 2.94 0.0041 prog 1 0.01342 0.00158 8.50 <.0001 enzyme 1 0.01387 0.00142 9.80 <.0001 liver 1 0.07362 0.03531 2.08 0.0396 alch2 1 0.28799 0.06010 4.79 <.0001

5 variable model for log(survival) selected by Backward Elimination

Page 44: Experimental Statistics           - week 13

44

None: (0,0) mean survival = 640.5 Moderate: (1,0) mean survival = 608.4 Severe: (0,1) mean survival = 815.2

What is the role of the variable “alch2” in the model?

alch2 =1 implies heavyalch2 = 0 implies none or moderate