Upload
denis-simon
View
225
Download
1
Tags:
Embed Size (px)
Citation preview
Other Regression ModelsAndy WangCIS 5930-03Computer SystemsPerformance Analysis
*Regression WithCategorical PredictorsRegression methods discussed so far assume numerical variablesWhat if some of your variables are categorical in nature?If all are categorical, use techniques discussed later in the courseLevels - number of values a category can take
*HandlingCategorical PredictorsIf only two levels, define bi as followsbi = 0 for first valuebi = 1 for second valueThis definition is missing from book in section 15.2Can use +1 and -1 as values, insteadNeed k-1 predictor variables for k levelsTo avoid implying order in categories
*Categorical Variables ExampleWhich is a better predictor of a high rating in the movie database, winning an Oscar,winning the Golden Palm at Cannes, or winning the New York Critics Circle?
*Choosing VariablesCategories are not mutually exclusivex1= 1 if Oscar 0 if otherwisex2= 1 if Golden Palm 0 if otherwisex3= 1 if Critics Circle Award 0 if otherwisey = b0+b1 x1+b2 x2+b3 x3
*A Few Data PointsTitleRatingOscarPalmNYCGentlemans Agreement7.5XXMutiny on the Bounty7.6XMarty7.4XXXIf7.8XLa Dolce Vita8.1XKagemusha8.2XThe Defiant Ones7.5XReds6.6XHigh Noon8.1X
*And Regression Says . . . How good is that?R2 is 34% of variationBetter than age and lengthBut still no great shakesAre regression parameters significant at 90% level?
*Curvilinear RegressionLinear regression assumes a linear relationship between predictor and responseWhat if it isnt linear?You need to fit some other type of function to the relationship
*When To UseCurvilinear RegressionEasiest to tell by sight Make a scatter plotIf plot looks non-linear, try curvilinear regressionOr if non-linear relationship is suspected for other reasonsRelationship should be convertible to a linear form
*Types ofCurvilinear RegressionMany possible types, based on a variety of relationships: Many others
*Transform Themto Linear FormsApply logarithms, multiplication, division, whatever to produce something in linear formI.e., y = a + b*somethingOr a similar formIf predictor appears in more than one transformed predictor variable, correlation likely!
*Sample TransformationsFor y = aebx, take logarithm of yln(y) = ln(a) + bxy = ln(y), b0 = ln(a), b1 = bDo regression on y = b0+b1xFor y = a+b ln(x), x = exDo regression on y = a + bln(x)
*Sample TransformationsFor y = axb, take log of both x and yln(y) = ln(a) + bln(x)y = ln(y), b0 = ln(a), b1 = b, x = exDo regression on y = b0 + b1ln(x)
*Corrections to Jain p. 257
Nonlinear
Linear
_1107199107.unknown
_1107592444.unknown
_1294773948.unknown
_1294774070.unknown
_1295284600.unknown
_1295286197.unknown
_1295286198.unknown
_1294774071.unknown
_1294774030.unknown
_1294773919.unknown
_1294773932.unknown
_1294773901.unknown
_1107199480.unknown
_1107199939.unknown
_1107200133.unknown
_1107199864.unknown
_1107199333.unknown
_1107199425.unknown
_1107199215.unknown
_1107198748.unknown
_1107198992.unknown
_1107199047.unknown
_1107198833.unknown
_1107198573.unknown
_1107198684.unknown
_1107198382.unknown
*General TransformationsUse some function of response variable y in place of y itselfCurvilinear regression is one exampleBut techniques are more generally applicable
*When To Transform?If known properties of measured system suggest itIf datas range covers several orders of magnitudeIf homogeneous variance assumption of residuals (homoscedasticity) is violated
*Transforming Due To HomoscedasticityIf spread of scatter plot of residual vs. predicted response isnt homogeneous,Then residuals are still functions of the predictor variablesTransformation of response may solve the problem
*What TransformationTo Use?Compute standard deviation of residuals at each y_hatAssume multiple residuals at each predicted valuePlot as function of mean of observationsAssuming multiple experiments for single set of predictor valuesCheck for linearity: if linear, use a log transform
*Other Tests for TransformationsIf variance against mean of observations is linear, use square-root transformIf standard deviation against mean squared is linear, use inverse (1/y) transformIf standard deviation against mean to a power is linear, use power transformMore covered in the book
*General Transformation PrincipleFor some observed relation between standard deviation and mean,
let
transform to
and regress on w
*Example: Log TransformationIf standard deviation against mean is linear, then
So
*Confidence Intervalsfor Nonlinear RegressionsFor nonlinear fits using general (e.g., exponential) transformations:Confidence intervals apply to transformed parametersNot valid to perform inverse transformation on intervalsMust express confidence intervals in transformed domain
*OutliersAtypical observations might be outliersMeasurements that are not truly characteristicBy chance, several standard deviations outOr mistakes might have been made in measurementWhich leads to a problem:Do you include outliers in analysis or not?
*DecidingHow To Handle Outliers1. Find them (by looking at scatter plot)2. Check carefully for experimental error3. Repeat experiments at predictor values for each outlier4. Decide whether to include or omit outliersOr do analysis both waysQuestion: Is first point in last lectures example an outlier on rating vs. age plot?
*Rating vs. Age
Sheet1
XXT
1511811111111
113132513202841496162
12011911813211915391118132105
128153
14191XTy
149118
161132
162105
XTX(XTX) invertedXTX*XTX inverted
y8.17279968-1.148950.003390.008361-00
6.827913025330450.003390.00025-0.00010-01-0
7968330451195720.00836-0.00010-0.00003001
7.4
7.7XTX|I7279968100
7.52791302533045010
7.696833045119572001
81139.8571428571138.28571428570.142857142900
2791302533045010
96833045119572001
2139.8571428571138.28571428570.142857142900
01904.8571428571-5536.7142857143-39.857142857110
96833045119572001
3139.8571428571138.28571428570.142857142900
01-2.9066296685-0.02092395380.00052497380
96833045119572001
4139.8571428571138.28571428570.142857142900
01-2.9066296685-0.02092395380.00052497380
0-5536.7142857143-14288.5714285714-138.285714285701
5139.8571428571138.28571428570.142857142900
01-2.9066296685-0.02092395380.00052497380
00-30381.7494375281-254.13566821662.90662966851
6139.8571428571138.28571428570.142857142900
01-2.9066296685-0.02092395380.00052497380
0010.0083647477-0.0000956703-0.0000329145
7139.8571428571138.28571428570.142857142900
0100.003389270.0002468958-0.0000956703
0010.0083647477-0.0000956703-0.0000329145
8139.85714285710-1.01386796530.01322982930.0045516047
0100.003389270.0002468958-0.0000956703
0010.0083647477-0.0000956703-0.0000329145
9100-1.14895458320.003389270.0083647477
0100.003389270.0002468958-0.0000956703
0010.0083647477-0.0000956703-0.0000329145
XTy8.1YT8.16.877.47.77.57.68
111111116.8
5132028414961627yTy452.91
118132119153911181321057.4XT
7.7b-0.672660159311111111
7.50.031777743513202841496162
7.60.057275445111813211915391118132105
XTy59.6(XTX) inverted8bT-0.67266015930.0317777430.0572754451
2118.9-1.148950.003390.00836ybar =7.51257.5125bTXT6.24473107397.30080924866.77867266348.98025973955.84229280367.64295176428.82614091077.3114816368
7247.50.003390.00025-0.00010SS0 =451.50125442.0107721397
0.00836-0.00010-0.00003SSE =10.8992278603
b=(XTX)inverted * XTy
-0.6726601593X
0.031777743151186.245y8.1
0.05727544511131327.3016.8
1201196.7797
1281538.9807.4
141915.8427.7
1491187.6437.5
1611328.8267.6
1621057.3118
7.5125
6.2447310739e1.9e23.44
7.3008092486-0.50.25
6.77867266340.20.05
8.9802597395-1.62.50
5.84229280361.93.45
7.6429517642-0.10.02
8.8261409107-1.21.50
7.31148163680.70.47
SSE = SUM(11.69
SSY =452.91
SST =1.40875
SS0 =451.50125
SSR =-10.28
R2 =-7.2967379804
&A
Page &P
Sheet2
XXT
1511811111111
113132513202841496162
12011911813211915391118132105
128153
14191XTXC = XTX inverted
14911882799687.7134557285-0.022753714-0.0561563633
1611322791302533045-0.0227537140.00032401420.0000946588
16210596833045119572-0.05615636330.00009465880.0004368193
y8.1XTyb=XTX inverted * XTy
6.860.18.3726020643
72118.90.0050953691
7.47247.5-0.0085768848
7.78.37260206430.0050953691-0.0085768848
7.5
7.6
8ee2
yhat7.3860065027y8.1-0.71399349730.5097867142
7.30669306826.80.50669306820.2567378653
7.453860154370.45386015430.2059890397
7.20300902377.4-0.19699097630.0388054448
7.80101567997.70.10101567990.0102041676
7.61020274297.50.11020274290.0121446445
7.55127078487.6-0.04872921520.0023745364
7.78794204368-0.21205795640.0449685769
ybar7.5125SSE1.0810109894Confidence intervals for regression parameters
SSY452.91
SS0451.50125b05.997189904210.7480142244
SST1.40875b1-0.01030022760.0204909657
SSR0.3277390106b2-0.0264526810.0092989114
R20.2326452604
se0.4244625993MSR0.1638695053
so1.1788645956MSE0.2162021979
s10.0076404946MSR/MSE0.7579456034
s20.0088713629
Correlation of x1 and x2
n8
sum x1279
sum x2968
sum x1^213025
sum x2^2119572
sum x1x233045
R-0.25161014
&A
Page &P
Sheet3
Categorical regression example
&A
Page &P
Chart1
8.1
6.8
7
7.4
7.7
7.5
7.6
8
&A
Page &P
Rating
Age
Rating
Chart2
8.1
6.8
7
7.4
7.7
7.5
7.6
8
&A
Page &P
Rating
Length
Rating
Sheet4
AgeRatingLengthRating
58.11188.1
136.81326.8
2071197
287.41537.4
417.7917.7
497.51187.5
617.61327.6
6281058
&A
Page &P
Sheet5
&A
Page &P
Sheet6
&A
Page &P
Sheet7
&A
Page &P
Sheet8
&A
Page &P
Sheet9
&A
Page &P
Sheet10
&A
Page &P
Sheet11
&A
Page &P
Sheet12
&A
Page &P
Sheet13
&A
Page &P
Sheet14
&A
Page &P
Sheet15
&A
Page &P
Sheet16
&A
Page &P
*Common Mistakesin RegressionGenerally based on taking shortcutsOr not being carefulOr not understanding some fundamental principle of statistics
*Not Verifying LinearityDraw the scatter plotIf its not linear, check for curvilinear possibilitiesMisleading to use linear regression when relationship isnt linear
*Relying on ResultsWithout Visual VerificationAlways check scatter plot as part of regressionExamine predicted line vs. actual pointsParticularly important if regression is done automatically
*Some Nonlinear Examples
_1107201370.xls
_1107201476.xls
_1107201830.xls
_1107202023.xls
_1107202097.xls
_1107202135.xls
_1107202060.xls
_1107201878.xls
_1107201613.xls
_1107201433.xls
_1107201446.xls
_1107201387.xls
_1107201283.xls
_1107201335.xls
_1107200761.xls
*Attaching ImportanceTo Values of ParametersNumerical values of regression parameters depend on scale of predictor variablesSo just because a particular parameters value seems small or large, not necessarily an indication of importanceE.g., converting seconds to microseconds doesnt change anything fundamentalBut magnitude of associated parameter changes
*Not SpecifyingConfidence IntervalsSamples of observations are randomThus, regression yields parameters with random propertiesWithout confidence interval, impossible to understand what a parameter really means
*Not Calculating Coefficient of DeterminationWithout R2, difficult to determine how much of variance is explained by the regressionEven if R2 looks good, safest to also perform an F-testNot that much extra effort
*Using Coefficient of Correlation ImproperlyCoefficient of determination is R2Coefficient of correlation is RR2 gives percentage of variance explained by regression, not RE.g., if R is .5, R2 is .25And regression explains 25% of varianceNot 50%!
*Using Highly Correlated Predictor VariablesIf two predictor variables are highly correlated, using both degrades regressionE.g., likely to be correlation between an executables on-disk and in-core sizesSo dont use both as predictors of run timeMeans you need to understand your predictor variables as well as possible
*Using Regression Beyond Range of ObservationsRegression is based on observed behavior in a particular sampleMost likely to predict accurately within range of that sampleFar outside the range, who knows?E.g., regression on run time of executables smaller than size of main memory may not predict performance of executables that need VM activity
*Using Too ManyPredictor VariablesAdding more predictors does not necessarily improve model!More likely to run into multicollinearity problemsSo what variables to choose?Subject of much of this course
*Measuring Too Littleof the RangeRegression only predicts well near range of observations If you dont measure commonly used range, regression wont predict muchE.g., if many programs are bigger than main memory, only measuring those that are smaller is a mistake
*Assuming Good PredictorIs a Good ControllerCorrelation isnt necessarily controlJust because variable A is related to variable B, you may not be able to control values of B by varying AE.g., if number of hits on a Web page correlated to server bandwidth, but might not boost hits by increasing bandwidthOften, a goal of regression is finding control variables
White Slide
This belongs in the advanced regression lecture.