Lecture 8 Page 1 CS 239, Spring 2007 Experiment Design CS 239 Experimental Methodologies for System Software Peter Reiher May 1, 2007

Lecture 8Page 1CS 239, Spring 2007

Experiment DesignCS 239

Experimental Methodologies for System Software

Peter ReiherMay 1, 2007


Designing Experiments• Introduction• A digression into multilinear

regression• Types of experiment designs• Simple experimental design• Full factorial design


Introduction To Experiment Design

• You know your metrics

• You know your factors

• You know your levels

• You’ve got your instrumentation and test loads

• Now what?


Goals in Experiment Design

• Obtain maximum information

• With minimum work

– Typically meaning minimum number of experiments

• More experiments aren’t better if you’re the one who has to perform them

• Well-designed experiments are also easier to analyze


Basic Experiment Design Terminology

• Response variable – quantitative description of outcome of experiment

• Factors – Variables that affect the response variable (aka predictors)

• Levels – Values a factor can assume


More Terminology

• Primary factors – the most important ones (i.e., those you test)

• Secondary factors – the less important ones (i.e., those you don’t test)

• Replication – one run of an experiment– Multiple replications for statistical

validation


What Is Experiment Design?

• Specification of:

– Number of experiments

– Factor level combinations for experiments

– Number of replications of each experiment


An Example

• Consider Time Warp

• The response variable is run time

• The factors are number of nodes used, dynamic load management, lazy vs. aggressive cancellation, maybe others

• The primary factors will be number of nodes and use of dynamic load management


Continuing the Example

• What are the levels?– For number of nodes, if we have a 64

node parallel processor, could choose any integers between 1 and 64

• Dynamic load management is either on or off


So What Might Be the Experiment Design?

• 6 experiments

• All combinations of:

– Nodes chosen from (8,32,64)

– Dynamic load management (off,on)

• Three replications of each experiment


Interactions

• If there is more than one primary factor, factors might interact

• Might be relationships between how various factors affect performance

• E.g., in Time Warp case, perhaps dynamic load management is more effective with more nodes

• Vital to understand factor interactions– By end of experiments, if not before

• Clearly complicates experiment design


Basic Problem in Designing Experiments

• You have chosen some number of factors

• They may or may not interact

• How can you design an experiment that captures the full range of the levels?

– With minimum amount of work

• Which combination or combinations of the levels of the factors do you measure?


Common Mistakes in Experimentation

• Ignoring experimental error

• Uncontrolled parameters

• Not isolating effects of different factors

• One-factor-at-a-time experiment designs

• Interactions ignored

• Designs requiring too many experiments


A Digression Into Multilinear Regression

• Like linear regression, but assumes multiple control variables

• Solution generally uses simple matrix math

• Introduces an issue not present in simple linear regression: multicollinearity


Multiple Linear Regression

• Models with more than one predictor variable

• But each predictor variable has a linear relationship to the response variable

• Conceptually, plotting a regression line in n-dimensional space, instead of 2-dimensional


Basic Multiple Linear Regression Formula

Response y is a function of k predictor variables x1,x2, . . . , xk

y b b x b x b x ek k 0 1 1 2 2


A Multiple Linear Regression Model

Given sample of n observations

model consists of n equations :

x x x y x x x yk n n kn n11 21 1 1 1 2, , , , , , , , , ,

y b b x b x b x ek k1 0 1 11 2 21 1 1

y b b x b x b x ek k2 0 1 12 2 22 2 2

y b b x b x b x en n n k kn n 0 1 1 2 2


Looks Like It’s Matrix Arithmetic Time

y = Xb +e y

y

y

x x x

x x x

x x x

b

b

b

e

e

en

k

k

n n kn k n

1

2

11 21 1

12 22 2

2 2

0

1

1

2

1

1

1

.

.

.

. . . . .

. . . . .

. . . . .

.

.

.

.

.

.


Analysis of Multiple Linear Regression

• Listed in box 15.1 of Jain

• Not terribly important (for our purposes) how they were derived

– This isn’t a class on statistics

• But you need to know how to use them

• Mostly matrix analogs to simple linear regression results


Multiple Linear Regression Example

• Internet Movie Database keeps popularity ratings of movies (in numerical form)

• Postulate popularity of Academy Award winning films is based on two factors -– Age– Running time

• Produce a regression

rating = b0 + b1(length) +b2(age)


Some Sample Data

• I selected 35 Oscar winner spanning 78 years• Used IMDB data to get run times and user rating

– As of 4/29/07• User rating is our y• Run time and age are our control variables• What’s the regression, and how good is it?• Complete data in spreadsheet posted on class web

page


Now For Some Tedious Matrix Arithmetic

• We need to calculate X, XT, XTX, (XTX)-1, and XTy

Because

I shall spare you the details, but

b = (7.34, -.009,.005)

Meaning the regression predicts

rating = 7.34 - .009*age +.005*length

b X X X yT T1


How Good Is This Regression Model?

• How accurately does the model predict the rating of a film based on its age and running time?

• Best way to determine this analytically is to calculate the errors

or

SSE T T T y y b X y

03.152 ieSSE


Now Calculate R2

• SSY =

• SS0 =

• SST = SSY – SS0 = 2107.42 – 2089.03 = 18.39

• SSR = SST – SSE = 18.39 – 15.03 = 3.36

• In other words, the regression stinks

42.21072iy

03.20892 yn

18.39.18

36.32 SST

SSRR


Why Does It Stink?

• Let’s look at the properties of the regression parameters

• Now calculate standard deviations of the regression parameters

69.32

03.15

3

n

SSEse


Calculating STDEV of Regression Parameters

• Estimations only, since we’re working with a sample

• estimated stdev of 67.97.69.000 csb e

00003.000044.69.111 csb e

00003.000041.69.222 csb e


Calculating Confidence Intervals of STDEVs

• At the 90% level, for instance

• Confidence intervals for

All three are statistically significant

23.8,46.667.311.134.70 b

0086.,0087.00003.311.1009..1 b

0053,.0052.000028.311.1005.2 b


So Why Does Regression “Fail”?

• Look at magnitude of standard deviations of regression parameters

• Almost all deviation is within the b0 parameter

• The other parameters are minimally predictive of the value

• Trying to fit a line to “noise” in the variable– Or to some other parameter not yet

identified


Analysis of Variance

• Are the predictor parameters significant predictors of the variance in the data?

• F-tests can be used to answer this question

– Determine if the SSR is significantly higher than the SSE

– Equivalent to testing that y does not depend on any of the predictor variables

• Discussed in detail in book


Running an F-Test• Need to calculate SSR and SSE• From those, calculate mean squares of the

regression (MSR) and the errors (MSE)• MSR/MSE has an F distribution• If MSR/MSE > F-table, predictors explain a

significant fraction of response variation– F-tables in appendices A6-A8

• Note typo in book’s table 15.3– SST matches not

y y y y


F-Test for Our Example

• SSR = 3.36• SSE = 15.03• MSR = SSR/k = 3.36/2 = 1.68• MSE = SSE/(n-k-1) = 15.03/(35 - 2 - 1) =.47• F-computed = MSR/MSE =3.58

• F[90; 2,32] = 2.268 (at 90%)

• So it passes the F-test at 90%• Which means predictors variables explain some of

variation with 90% confidence


Multicollinearity

• If two predictor variables are linearly dependent, they are collinear

– Meaning they are related

– And thus the second does not improve the regression

– In fact, it can make it worse

• Typical symptom is inconsistent results from various significance tests


Finding Multicollinearity

• Must test correlation between predictor variables

• If it’s high, eliminate one and repeat the regression without it

• If the significance of regression improves, it’s probably due to collinearity between the variables


Why Didn’t Regression Work Well Here?

• Check the scatter plots

– Rating vs. age

– Rating vs. length

• Regardless of how good or bad regressions look, always check the scatter plots


Rating Vs. Age

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80 90

age

ranking


Rating vs. Length

4

5

6

7

8

9

10

90 110 130 150 170 190 210

Length

Ranking


What Do These Charts Tell Us?

• Trends in age and length don’t explain much– A little, but not much

• Not much difference between lower and upper ranges of each

• Great deal of variation across small sub-ranges of each


Types of Experimental Designs

• Simple designs

• Full factorial design

• Fractional factorial design


Simple Designs

• Vary one factor at a time

• For k factors with ith factor having ni levels

• Perhaps with multiple replications of experiments

n nii

k

1 1

1


A Simple Design for TW Experiment

• 2 factors (# of nodes, DLM)• Factor 1 has three levels• Factor 2 has two levels• Run following 4 experiments:

– (8 nodes, DLM off), (32 nodes, DLM off), (64 nodes, DLM off)

– (8 nodes, DLM on)• Might be better to use (64 nodes, DLM on),

instead


Critique of Simple Designs

• Assumes factors don’t interact

• Often more effort than required

• Don’t use it, usually

• In TW case, clearly you don’t determine whether DLM works better or worse with differing numbers of nodes


Full Factorial Designs

• For k factors with ith factor having ni levels

• Again, possibly with multiple replications per experiment

n nii

k

1


Full Factorial Design of TW Experiment

• 2 factors (# of nodes, DLM)• Factor 1 has three levels• Factor 2 has two levels• Run the following 6 experiments:

– (8 nodes, DLM off), (32 nodes, DLM off), (64 nodes, DLM off)

– (8 nodes, DLM on), (32 nodes, DLM on), (64 nodes, DLM on)


Critique of Full Factorial Design

• Test every possible combination of factors’ levels

• Captures full information about interaction

• A hell of a lot of work, though

– Usually more than needed


Reducing the Work in Full Factorial Designs

• Reduce number of levels per factor

– Generally a good choice

– Especially if you know which factors are most important - use more levels for them

• Reduce the number of factors

– But don’t drop important ones

• Use fractional factorial designs


Fractional Factorial Designs• Only measure some combination of the

levels of the factors• Must design carefully to best capture any

possible interactions• Less work, but more chance of inaccuracy• Especially useful if some factors are

known not to interact


2k Factorial Designs• Used to determine the effect of k factors

– Each with two alternatives or levels

• Often used as a preliminary to a larger performance study

– Each factor measured at its maximum and minimum level

– Perhaps offering insight on importance and interaction of various factors


Unidirectional Effects• Effects that only increase as the level of a

factor increases

– Or visa versa

• If this characteristic is known to apply, a 2k factorial design at minimum and maximum levels is useful

• Shows whether the factor has a significant effect


22 Factorial Designs

• Two factors with two levels each

• Simplest kind of factorial experiment design

• Concepts developed here generalize

• A form of regression can be easily used here

• Simplest to show with an example


22 Factorial Design Example• Reduce the TW experiment we

discussed earlier• Use two levels of nodes (8,64)• And two levels of DLM (off, on)• Will require 4 experiments

– Possibly with multiple replications• Analyze with simple regression model


Defining Variables for the 22 Factorial TW Example

xA 11 if 8 nodes

if 64 nodes

xB 11 if no dynamic load management

if dynamic load management used

• Values are set to -1 and 1 for ease in modeling


Sample Data For Example

• Single runs of one benchmark simulation

DLM

(1)

NO

DLM

(-1)

8 Nodes (-1) 64 Nodes (1)

820

776 197

217


The Sign Table Method

Another way to look at it shown in this table -

Experiment A B y

1 -1 -1 y1

2 1 -1 y2

3 -1 1 y3

4 1 1 y4


Regression Model for Example

• y = q0 + qAxA + qBxB + qABxAxB

• Note this is a nonlinear model– Third term describes interactions between

variables820 = q0 -qA - qB + qAB

217 = q0 +qA - qB - qAB

776 = q0 -qA + qB - qAB

197 = q0 +qA + qB + qAB

820

776 197

217

1

1

-1

-1

• 4 equations in 4 unknowns


Solving the Equations

q0 = 1/4(820 + 217 + 776 + 197) = 502.5

qA = 1/4(-820 + 217 - 776 + 197) = -295.5

qB = 1/4(-820 - 217 + 776 + 197) = -16

qAB = 1/4(820 - 217 - 776 + 197) = 6

So,

y = 502.5 - 295.5xA - 16xB + 6xAxB


The Sign Table MethodI A B AB y1 -1 -1 1 8201 1 -1 -1 2171 -1 1 -1 7761 1 1 1 1972010 -1182 -64 24 Total502.5 -295.5 -16 6 Total/4

Same results as using equationsy = 502.5 - 295.5xA - 16xB + 6xAxB


Allocation of Variation for 22 Model

• Calculate the sample variance of y

Numerator is the SST - total variation

SST = 22qA2 + 22qB

2 + 22qAB2

• We can use this to explain what causes the variation in y

s

y yy

ii22

12

2

2

2 1


Terms in the SST

• 22qA2 is part of variation explained by the

effect of A - SSA

• 22qB2 is part of variation explained by the

effect of B - SSB

• 22qAB2 is part of variation explained by the

effect of the interaction of A and B - SSAB

SST = SSA + SSB + SSAB


Variations in Our Example

• SST = 350449

• SSA = 349281

• SSB = 1024

• SSAB = 144

• We can now calculate the fraction of the total variation caused by each effect


Fractions of Variation in Our Example

• Fraction explained by A is 99.67%• Fraction explained by B is 0.29%• Fraction explained by the interaction of A

and B is 0.04%• So almost all the variation comes from the

number of nodes• So if you want to run faster, apply more

nodes, don’t turn on dynamic load management


22r Factorial Designs

• 2 factors, 2 levels each, with r replications at each of the four combinations

• y = q0 + qAxA + qBxB + qABxAxB + e

• Now we need to compute effects, estimate the errors, and allocate variation

• We can also produce confidence intervals for effects and predicted responses


Computing Effects for 22r Factorial Experiments

• We can use the sign table method

• But instead of single observations, regress off the mean of the r observations

• Compute errors for each replication using similar tabular method

• Similar methods used for allocation of variance and calculating confidence intervals


Example of 22r Factorial Design With Replications

• Same Time Warp system as before, but with 4 replications at each point (r=4)

• No DLM, 8 nodes: 820, 822, 813, 809

• DLM, 8 nodes: 776, 798, 750, 755

• No DLM, 64 nodes: 217, 228, 215, 221

• DLM, 64 nodes: 197, 180, 220, 185


22r Factorial Example Analysis Matrix

I A B AB y Mean 1 -1 -1 1 (820,822,813,809) 8161 1 -1 -1 (217,228,215,221) 220.251 -1 1 -1 (776,798,750,755) 769.751 1 1 1 (197,180,220,185) 195.52001.5 -1170 -71 21.5 Total500.4 -292.5 -17.75 5.4 Total/4

q0= 500.4 qA= -292.5 qB= -17.75 qAB= 5.4


Estimation of Errors for 22r Factorial Example

• Figure differences between predicted and observed values for each replication

• Now calculate SSE

e y yy q q x q x q x ix

ij ij i

ij A Ai B Bi AB A Bi

0

SSE eijj

r

i

2

11

22

2606


Allocating Variation

• We can determine the percentage of variation due to each factor’s impact

– Just like 22 designs without replication

• But we can also isolate the variation due to experimental errors

• Methods are similar to other regression techniques for allocating variation


Variation Allocation in Example

• We’ve already figured SSE

• We also need SST, SSA, SSB, and SSAB

• Also, SST = SSA + SSB + SSAB + SSE

• Use same formulae as before for SSA, SSB, and SSAB

SST y yiji j

2

,


Sums of Squares for Example

• SST = SSY - SS0 = 1,377,009.75• SSA = 1,368,900• SSB = 5041• SSAB = 462.25• Percentage of variation for A is 99.4%• Percentage of variation for B is 0.4%• Percentage of variation for A/B interaction is

0.03% • And 0.2% (apx.) is due to experimental errors


Confidence Intervals For Effects• Computed effects are random variables

• Thus, we would like to specify how confident we are that they are correct

• Using the usual confidence interval methods

• First, must figure Mean Square of Errors

s

SSE

r2

22 1


Calculating Variances of Effects

• Variance of all effects is the same -

• So standard deviation is also the same

• In calculations, use t- or z-value for 22(r-1) degrees of freedom

s s s ss

rq q q qe

A B AB0

2 2 2 22

22


Calculating Confidence Intervals for Example

• At 90% level, using the t-value for 12 degrees of freedom, 1.782

• And standard deviation of effects is 3.68

• Confidence intervals are qi-+(1.782)(3.68)

• q0 - (493.8,506.9)• qA - (-299.1,-285.9)• qB - (-24.3,-11.2)• qAB - (-1.2,11.9)


Visual Tests for Verifying Assumptions

• What assumptions have we been making?– Model errors are statistically independent– Model errors are additive– Errors are normally distributed– Errors have constant standard deviation– Effects of errors are additive

• Which boils down to independent, normally distributed observations with constant variance


Testing for Independent Errors

• Compute residuals and make a scatter plot

• Trends indicate a dependence of errors on factor levels

– But if residuals order of magnitude below predicted response, trends can be ignored

• Sometimes a good idea to plot residuals vs. experiments number


Example Plot of Residuals vs. Predicted Response

-30

-20

-10

0

10

20

30

40

0 100 200 300 400 500 600 700 800 900


Example Plot of Residuals Vs. Experiment Number

-30

-20

-10

0

10

20

30

40

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5


Testing for Normally Distributed Errors

• As usual, do a quantile-quantile chart

– Against the normal distribution

• If it’s close to linear, this assumption is good


Quantile-Quantile Plot for Example

y = 0.0731x + 4E-17

R2 = 0.9426

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

-30 -20 -10 0 10 20 30 40


Assumption of Constant Variance

• Checking homoscedasticity

• Go back to the scatter plot and check for an even spread


The Scatter Plot, Again

-30

-20

-10

0

10

20

30

40

0 100 200 300 400 500 600 700 800 900


Example Shows Residuals Are Function of Predictors

• What to do about it?• Maybe apply a transform?• To determine if we should, plot standard

deviation of errors vs. various transformations of the mean

• Here, dynamic load management seems to introduce greater variance– Transforms not likely to help– Probably best not to describe with regression

Documents

Lecture 8 Page 1 CS 239, Spring 2007 Experiment Design CS 239 Experimental Methodologies for System Software Peter Reiher May 1, 2007