CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION

CHAPTER 13CHAPTER 13 MMODELINGODELING C CONSIDERATIONSONSIDERATIONS ANDAND

SSTATISTICALTATISTICAL I INFORMATIONNFORMATION

“All models are wrong; some are useful.” George E. P. Box

•Organization of chapter in ISSO

–Bias-variance tradeoff

–Model selection: Cross-validation

–Fisher information matrix: Definition, examples, and efficient computation

Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall

13-2

Model Definition and MSEModel Definition and MSE

• Assume model z = h(, x) + v, where z is output, h(·) is some function, x is input, v is noise, and is vector of model parameters– h(·) may represent simulation model

– h(·) may represent “metamodel” (response surface) of existing simulation

• A fundamental goal is to take n data points and estimate , forming

• A common measure of effectiveness for estimate is mean of squared model error (MSE) at fixed x:

n

ˆ nE h E z2

,( ) ( )x x x

13-3

Bias-Variance DecompositionBias-Variance Decomposition

• The MSE of the model at a fixed x can be decomposed as:

E{[h( , x) E(z|x)]2 |x}

= E{[h( , x) E(h( , x))]2|x} + [E(h( , x)) E(z|x)]2

= variance at x + (bias at x)2

where expectations are computed w.r.t.• Above implies:

Model too simple Model too simple High bias High bias//low variancelow variance

Model too complex Model too complex Low bias Low bias//high variancehigh variance

n

n

n n

n

13-4

Unbiased Estimator May Not be Best Unbiased Estimator May Not be Best (Example 13.1 from (Example 13.1 from ISSOISSO) )

• Unbiased estimator is such that (i.e., mean of prediction is same as mean of data z)

• Example:Example: Let denote sample mean of scalar i.i.d. data as estimator of true mean (h(,x) = in notation above)

• Alternative biased estimator of is where 0 < r < 1• MSE of biased and unbiased estimators generally satisfy

• Biased estimate better in MSE sense– However, optimal value of r requires knowledge of unknown (true)

ˆ ˆ n nE r E2 2<( ) ( )

n

ˆ ,nr

ˆ nE h E z,( ) ( )x x x

13-5

Bias-Variance Tradeoff in ModelBias-Variance Tradeoff in ModelSelection in Simple ProblemSelection in Simple Problem

13-6

Example 13.2 in Example 13.2 in ISSOISSO: : Bias-Variance TradeoffBias-Variance Tradeoff

• Suppose true process produces output according to z = f(x) + noise, where f(x) = (x + x2)1.1

• Compare linear, quadratic, and cubic approximations• Table below gives average bias, variance, and MSE

• Overall pattern of decreasing bias and increasing variance; optimal tradeoff is quadratic model

2bias variance Overall MSE

Linear Model

510.6 10.0 520.6

Quadratic Model

0.53 20.0 20.53

Cubic Model

0.005 30.0 30.005

13-7

Model SelectionModel Selection

• The bias-variance tradeoff provides conceptual framework for determining a good model– Bias-variance tradeoff not directly useful

• Need a practical method for optimizing bias-variance tradeoff

• Practical aim is to pick a model that minimizes a criterion:

f1(fittingerrorfromgivendata)+ f2(modelcomplexity)

where f1 and f2 are increasing functions

• All methods based on a tradeoff between fitting error (high variance) and model complexity (low bias)

• Criterion above may/may not be explicitly used in given method

13-8

Methods for Model SelectionMethods for Model Selection

• Among many popular methods are:– Akaike Information Criterion (AIC) (Akaike, 1974)

• Popular in time series analysis– Bayesian selection (Akaike, 1977)– Bootstrap-based selection (Efron and Tibshirini, 1997)– Cross-validation (Stone, 1974)– Minimum description length (Risannen, 1978)– V-C dimension (Vapnik and Chervonenkis, 1971)

• Popular in computer science

• Cross-validation appears to be most popular model fitting method

13-9

Cross-ValidationCross-Validation

• Cross-validation is simple, general method for comparing candidate models– Other specialized methods may work better in specific problems

• Cross-validation uses the training set of data

• Method is based on iteratively partitioning the full set of training data into training and test subsets

• For each partition, estimateestimate model from training subset and evaluateevaluate model on test subset

– Number of training (or test) subsets = number of model fits required

• Select model that performs best over all test subsets

13-10

Choice of Training and Test SubsetsChoice of Training and Test Subsets• Let n denote total size of data set, nT denote size of test subset, nT < n• Common strategy is leave-one-out: nT = 1

– Implies n test subsets during cross-validation process

• Often better to choose nT > 1 – Sometimes more efficient (sampling w/o replacement)– Sometimes more accurate model selection

• If nT > 1, sampling may be with or without replacement– “With replacement” indicates that there are “n choose nT” test subsets, written

– With replacement may be prohibitive in practice: e.g., n = 30, nT = 6 implies nearly 600K model fits!

• Sampling without replacement reduces number of test subsets to n/nT (disjoint test subsets)

– “With replacement” indicates that there are “n choose nT” samplings

– Above may be prohibitive in practice

• ee means have may lead to huge number of samlingslarge tno Cross-validation uses the training set of data

• Method is based on iteratively partitioning the full set of training data into training and test subsets

• For each partition, estimateestimate model from training subset and evaluateevaluate model on test subset

• Select model that performs best over all test subsets

T

nn

13-11

Conceptual Example of Sampling Without Conceptual Example of Sampling Without Replacement: Cross-Validation with Replacement: Cross-Validation with

3 Disjoint Test Subsets3 Disjoint Test Subsets

13-12

Typical Steps for Cross-ValidationTypical Steps for Cross-Validation

Step 0 (initialization) Step 0 (initialization) Determine size of test subsets and candidate model. Let ii be counter for test subset being used.

Step 1 (estimation) Step 1 (estimation) For the ii th test subset, let the remaining data be the i th training subset. Estimate from this training subset.

Step 2 (error calculation) Step 2 (error calculation) Based on estimate for from Step 1 (ii th training subset), calculate MSE (or other measure) with data in ii th test subset.

Step 3 (new training and test subsets) Step 3 (new training and test subsets) Update ii to ii + 1 and return to step 1. Form mean of MSE when all test subsets have been evaluated.

Step 4 (new model) Step 4 (new model) Repeat steps 1 to 3 for next model. Choose Choose model with lowest mean MSE as best.model with lowest mean MSE as best.

13-13

Numerical Illustration of Cross-Validation Numerical Illustration of Cross-Validation (Example 13.4 in (Example 13.4 in ISSOISSO))

• Consider true system corresponding to a sine function of the input with additive normally distributed noise

• Consider three candidate models– Linear (affine) model– 3rd-order polynomial– 10th-order polynomial

• Suppose 30 data points are available, divided into 5 disjoint test subsets (sampling w/o replacement)

• Based on RMS error (equiv. to MSE) over test subsets, 3rd-order polynomial is preferred

• See following plot

13-14

Sine wave (process mean)

Numerical Illustration (cont’d): Relative Fits Numerical Illustration (cont’d): Relative Fits for 3 Models with Low-Noise Observationsfor 3 Models with Low-Noise Observations

Numerical Illustration (cont’d): Relative Fits Numerical Illustration (cont’d): Relative Fits for 3 Models with Low-Noise Observationsfor 3 Models with Low-Noise Observations

10th-order

Linear

3rd-order

13-15

Fisher Information MatrixFisher Information Matrix

• Fundamental role of data analysis is to extract information from data

• Parameter estimation for models is central to process of extracting information

• The Fisher information matrix plays a central role in parameter estimation for measuring information

Information matrix summarizes the amount Information matrix summarizes the amount

of information in the data relative to the of information in the data relative to the parameters parameters being estimatedbeing estimated

13-16

Problem SettingProblem Setting

• Consider the classical statistical problem of estimating parameter vector from n data vectors z1, z2 ,…, zn

• Suppose have a probability density and/or mass function associated with the data

• The parameters appear in the probability function and affect the nature of the distribution

– Example: zi N(mean(), covariance()) for all i

• Let l(|z1, z2 ,…, zn) represent the likelihood function, i.e., the

p.d.f./p.m.f. viewed as a function of conditioned on the data

13-17

Information MatrixInformation Matrix—Definition—Definition

• Recall likelihood function l(|z1, z2 ,…, zn)

• Information matrix defined as

where expectation is w.r.t. z1, z2 ,…, zn

• Equivalent form based on Hessian matrix:

• Fn() is positive semidefinite of dimension pp (p=dim())

log log( )n T

EF l l

2 log

( )n TEF

l

13-18

Information MatrixInformation Matrix—Two Key Properties—Two Key Properties

• Connection of Fn() and uncertainty in estimate is

rigorously specified via two famous results ( = true value of ):

1. Asymptotic normality: 1. Asymptotic normality:

where

2. Cramér-Rao inequality:2. Cramér-Rao inequality:

Above two results indicate: greater variability of “smaller” Fn() (and vice versa)

dist 1ˆ ,( ) ( )nn N 0 F

1ˆcov for all( ) ( )n n nF

ˆn

lim ( )n

nnF F

ˆn

13-19

Selected ApplicationsSelected Applications

• Information matrix is measure of performance for several applications. Four uses are:

1. 1. Confidence regions for parameter estimation– Uses asymptotic normality and/or Cramér-Rao inequality

2.2. Prediction bounds for mathematical models

3.3. Basis for “D-optimal” criterion for experimental design– Information matrix serves as measure of how well can be

estimated for a given set of inputs

4.4. Basis for “noninformative prior” in Bayesian analysis– Sometimes used for “objective” Bayesian inference

Documents

CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION