Other Regression Models Andy Wang CIS 5930-03 Computer Systems Performance Analysis

Other Regression ModelsAndy WangCIS 5930-03Computer SystemsPerformance Analysis

*Regression WithCategorical PredictorsRegression methods discussed so far assume numerical variablesWhat if some of your variables are categorical in nature?If all are categorical, use techniques discussed later in the courseLevels - number of values a category can take

*HandlingCategorical PredictorsIf only two levels, define bi as followsbi = 0 for first valuebi = 1 for second valueThis definition is missing from book in section 15.2Can use +1 and -1 as values, insteadNeed k-1 predictor variables for k levelsTo avoid implying order in categories

*Categorical Variables ExampleWhich is a better predictor of a high rating in the movie database, winning an Oscar,winning the Golden Palm at Cannes, or winning the New York Critics Circle?

*Choosing VariablesCategories are not mutually exclusivex1= 1 if Oscar 0 if otherwisex2= 1 if Golden Palm 0 if otherwisex3= 1 if Critics Circle Award 0 if otherwisey = b0+b1 x1+b2 x2+b3 x3

*A Few Data PointsTitleRatingOscarPalmNYCGentlemans Agreement7.5XXMutiny on the Bounty7.6XMarty7.4XXXIf7.8XLa Dolce Vita8.1XKagemusha8.2XThe Defiant Ones7.5XReds6.6XHigh Noon8.1X

*And Regression Says . . . How good is that?R2 is 34% of variationBetter than age and lengthBut still no great shakesAre regression parameters significant at 90% level?

*Curvilinear RegressionLinear regression assumes a linear relationship between predictor and responseWhat if it isnt linear?You need to fit some other type of function to the relationship

*When To UseCurvilinear RegressionEasiest to tell by sight Make a scatter plotIf plot looks non-linear, try curvilinear regressionOr if non-linear relationship is suspected for other reasonsRelationship should be convertible to a linear form

*Types ofCurvilinear RegressionMany possible types, based on a variety of relationships: Many others

*Transform Themto Linear FormsApply logarithms, multiplication, division, whatever to produce something in linear formI.e., y = a + b*somethingOr a similar formIf predictor appears in more than one transformed predictor variable, correlation likely!

*Sample TransformationsFor y = aebx, take logarithm of yln(y) = ln(a) + bxy = ln(y), b0 = ln(a), b1 = bDo regression on y = b0+b1xFor y = a+b ln(x), x = exDo regression on y = a + bln(x)

*Sample TransformationsFor y = axb, take log of both x and yln(y) = ln(a) + bln(x)y = ln(y), b0 = ln(a), b1 = b, x = exDo regression on y = b0 + b1ln(x)

*Corrections to Jain p. 257

Nonlinear

Linear

_1107199107.unknown

_1107592444.unknown

_1294773948.unknown

_1294774070.unknown

_1295284600.unknown

_1295286197.unknown

_1295286198.unknown

_1294774071.unknown

_1294774030.unknown

_1294773919.unknown

_1294773932.unknown

_1294773901.unknown

_1107199480.unknown

_1107199939.unknown

_1107200133.unknown

_1107199864.unknown

_1107199333.unknown

_1107199425.unknown

_1107199215.unknown

_1107198748.unknown

_1107198992.unknown

_1107199047.unknown

_1107198833.unknown

_1107198573.unknown

_1107198684.unknown

_1107198382.unknown

*General TransformationsUse some function of response variable y in place of y itselfCurvilinear regression is one exampleBut techniques are more generally applicable

*When To Transform?If known properties of measured system suggest itIf datas range covers several orders of magnitudeIf homogeneous variance assumption of residuals (homoscedasticity) is violated

*Transforming Due To HomoscedasticityIf spread of scatter plot of residual vs. predicted response isnt homogeneous,Then residuals are still functions of the predictor variablesTransformation of response may solve the problem

*What TransformationTo Use?Compute standard deviation of residuals at each y_hatAssume multiple residuals at each predicted valuePlot as function of mean of observationsAssuming multiple experiments for single set of predictor valuesCheck for linearity: if linear, use a log transform

*Other Tests for TransformationsIf variance against mean of observations is linear, use square-root transformIf standard deviation against mean squared is linear, use inverse (1/y) transformIf standard deviation against mean to a power is linear, use power transformMore covered in the book

*General Transformation PrincipleFor some observed relation between standard deviation and mean,

let

transform to

and regress on w

*Example: Log TransformationIf standard deviation against mean is linear, then

So

*Confidence Intervalsfor Nonlinear RegressionsFor nonlinear fits using general (e.g., exponential) transformations:Confidence intervals apply to transformed parametersNot valid to perform inverse transformation on intervalsMust express confidence intervals in transformed domain

*OutliersAtypical observations might be outliersMeasurements that are not truly characteristicBy chance, several standard deviations outOr mistakes might have been made in measurementWhich leads to a problem:Do you include outliers in analysis or not?

*DecidingHow To Handle Outliers1. Find them (by looking at scatter plot)2. Check carefully for experimental error3. Repeat experiments at predictor values for each outlier4. Decide whether to include or omit outliersOr do analysis both waysQuestion: Is first point in last lectures example an outlier on rating vs. age plot?

*Rating vs. Age

Sheet1

XXT

1511811111111

113132513202841496162

12011911813211915391118132105

128153

14191XTy

149118

161132

162105

XTX(XTX) invertedXTX*XTX inverted

y8.17279968-1.148950.003390.008361-00

6.827913025330450.003390.00025-0.00010-01-0

7968330451195720.00836-0.00010-0.00003001

7.4

7.7XTX|I7279968100

7.52791302533045010

7.696833045119572001

81139.8571428571138.28571428570.142857142900

2791302533045010

96833045119572001

2139.8571428571138.28571428570.142857142900

01904.8571428571-5536.7142857143-39.857142857110

96833045119572001

3139.8571428571138.28571428570.142857142900

01-2.9066296685-0.02092395380.00052497380

96833045119572001

4139.8571428571138.28571428570.142857142900

01-2.9066296685-0.02092395380.00052497380

0-5536.7142857143-14288.5714285714-138.285714285701

5139.8571428571138.28571428570.142857142900

01-2.9066296685-0.02092395380.00052497380

00-30381.7494375281-254.13566821662.90662966851

6139.8571428571138.28571428570.142857142900

01-2.9066296685-0.02092395380.00052497380

0010.0083647477-0.0000956703-0.0000329145

7139.8571428571138.28571428570.142857142900

0100.003389270.0002468958-0.0000956703

0010.0083647477-0.0000956703-0.0000329145

8139.85714285710-1.01386796530.01322982930.0045516047

0100.003389270.0002468958-0.0000956703

0010.0083647477-0.0000956703-0.0000329145

9100-1.14895458320.003389270.0083647477

0100.003389270.0002468958-0.0000956703

0010.0083647477-0.0000956703-0.0000329145

XTy8.1YT8.16.877.47.77.57.68

111111116.8

5132028414961627yTy452.91

118132119153911181321057.4XT

7.7b-0.672660159311111111

7.50.031777743513202841496162

7.60.057275445111813211915391118132105

XTy59.6(XTX) inverted8bT-0.67266015930.0317777430.0572754451

2118.9-1.148950.003390.00836ybar =7.51257.5125bTXT6.24473107397.30080924866.77867266348.98025973955.84229280367.64295176428.82614091077.3114816368

7247.50.003390.00025-0.00010SS0 =451.50125442.0107721397

0.00836-0.00010-0.00003SSE =10.8992278603

b=(XTX)inverted * XTy

-0.6726601593X

0.031777743151186.245y8.1

0.05727544511131327.3016.8

1201196.7797

1281538.9807.4

141915.8427.7

1491187.6437.5

1611328.8267.6

1621057.3118

7.5125

6.2447310739e1.9e23.44

7.3008092486-0.50.25

6.77867266340.20.05

8.9802597395-1.62.50

5.84229280361.93.45

7.6429517642-0.10.02

8.8261409107-1.21.50

7.31148163680.70.47

SSE = SUM(11.69

SSY =452.91

SST =1.40875

SS0 =451.50125

SSR =-10.28

R2 =-7.2967379804

&A

Page &P

Sheet2

XXT

1511811111111

113132513202841496162

12011911813211915391118132105

128153

14191XTXC = XTX inverted

14911882799687.7134557285-0.022753714-0.0561563633

1611322791302533045-0.0227537140.00032401420.0000946588

16210596833045119572-0.05615636330.00009465880.0004368193

y8.1XTyb=XTX inverted * XTy

6.860.18.3726020643

72118.90.0050953691

7.47247.5-0.0085768848

7.78.37260206430.0050953691-0.0085768848

7.5

7.6

8ee2

yhat7.3860065027y8.1-0.71399349730.5097867142

7.30669306826.80.50669306820.2567378653

7.453860154370.45386015430.2059890397

7.20300902377.4-0.19699097630.0388054448

7.80101567997.70.10101567990.0102041676

7.61020274297.50.11020274290.0121446445

7.55127078487.6-0.04872921520.0023745364

7.78794204368-0.21205795640.0449685769

ybar7.5125SSE1.0810109894Confidence intervals for regression parameters

SSY452.91

SS0451.50125b05.997189904210.7480142244

SST1.40875b1-0.01030022760.0204909657

SSR0.3277390106b2-0.0264526810.0092989114

R20.2326452604

se0.4244625993MSR0.1638695053

so1.1788645956MSE0.2162021979

s10.0076404946MSR/MSE0.7579456034

s20.0088713629

Correlation of x1 and x2

n8

sum x1279

sum x2968

sum x1^213025

sum x2^2119572

sum x1x233045

R-0.25161014

&A

Page &P

Sheet3

Categorical regression example

&A

Page &P

Chart1

8.1

6.8

7

7.4

7.7

7.5

7.6

8

&A

Page &P

Rating

Age

Rating

Chart2

8.1

6.8

7

7.4

7.7

7.5

7.6

8

&A

Page &P

Rating

Length

Rating

Sheet4

AgeRatingLengthRating

58.11188.1

136.81326.8

2071197

287.41537.4

417.7917.7

497.51187.5

617.61327.6

6281058

&A

Page &P

Sheet5

&A

Page &P

Sheet6

&A

Page &P

Sheet7

&A

Page &P

Sheet8

&A

Page &P

Sheet9

&A

Page &P

Sheet10

&A

Page &P

Sheet11

&A

Page &P

Sheet12

&A

Page &P

Sheet13

&A

Page &P

Sheet14

&A

Page &P

Sheet15

&A

Page &P

Sheet16

&A

Page &P

*Common Mistakesin RegressionGenerally based on taking shortcutsOr not being carefulOr not understanding some fundamental principle of statistics

*Not Verifying LinearityDraw the scatter plotIf its not linear, check for curvilinear possibilitiesMisleading to use linear regression when relationship isnt linear

*Relying on ResultsWithout Visual VerificationAlways check scatter plot as part of regressionExamine predicted line vs. actual pointsParticularly important if regression is done automatically

*Some Nonlinear Examples

_1107201370.xls

_1107201476.xls

_1107201830.xls

_1107202023.xls

_1107202097.xls

_1107202135.xls

_1107202060.xls

_1107201878.xls

_1107201613.xls

_1107201433.xls

_1107201446.xls

_1107201387.xls

_1107201283.xls

_1107201335.xls

_1107200761.xls

*Attaching ImportanceTo Values of ParametersNumerical values of regression parameters depend on scale of predictor variablesSo just because a particular parameters value seems small or large, not necessarily an indication of importanceE.g., converting seconds to microseconds doesnt change anything fundamentalBut magnitude of associated parameter changes

*Not SpecifyingConfidence IntervalsSamples of observations are randomThus, regression yields parameters with random propertiesWithout confidence interval, impossible to understand what a parameter really means

*Not Calculating Coefficient of DeterminationWithout R2, difficult to determine how much of variance is explained by the regressionEven if R2 looks good, safest to also perform an F-testNot that much extra effort

*Using Coefficient of Correlation ImproperlyCoefficient of determination is R2Coefficient of correlation is RR2 gives percentage of variance explained by regression, not RE.g., if R is .5, R2 is .25And regression explains 25% of varianceNot 50%!

*Using Highly Correlated Predictor VariablesIf two predictor variables are highly correlated, using both degrades regressionE.g., likely to be correlation between an executables on-disk and in-core sizesSo dont use both as predictors of run timeMeans you need to understand your predictor variables as well as possible

*Using Regression Beyond Range of ObservationsRegression is based on observed behavior in a particular sampleMost likely to predict accurately within range of that sampleFar outside the range, who knows?E.g., regression on run time of executables smaller than size of main memory may not predict performance of executables that need VM activity

*Using Too ManyPredictor VariablesAdding more predictors does not necessarily improve model!More likely to run into multicollinearity problemsSo what variables to choose?Subject of much of this course

*Measuring Too Littleof the RangeRegression only predicts well near range of observations If you dont measure commonly used range, regression wont predict muchE.g., if many programs are bigger than main memory, only measuring those that are smaller is a mistake

*Assuming Good PredictorIs a Good ControllerCorrelation isnt necessarily controlJust because variable A is related to variable B, you may not be able to control values of B by varying AE.g., if number of hits on a Web page correlated to server bandwidth, but might not boost hits by increasing bandwidthOften, a goal of regression is finding control variables

White Slide

This belongs in the advanced regression lecture.

Documents

Other Regression Models Andy Wang CIS 5930-03 Computer Systems Performance Analysis