51
To Explain or To Predict? Explanatory vs. Predictive Modeling in Scientific Research Galit Shmuéli Georgetown University October 30, 2009

To Explain Or To Predict?

Embed Size (px)

Citation preview

Page 1: To Explain Or To Predict?

To Explain or To Predict?Explanatory vs. Predictive Modeling in Scientific Research

Galit Shmuéli

Georgetown UniversityOctober 30, 2009

Page 2: To Explain Or To Predict?

The path to discovery

Page 3: To Explain Or To Predict?

Predict

Explain

Page 4: To Explain Or To Predict?

What are

“explaining”?

“predicting”?

Page 5: To Explain Or To Predict?

Statistical modeling in social science research

Purpose: test causal theory (“explain”)Association-based statistical models

Prediction nearly absent

Page 6: To Explain Or To Predict?

Whether statisticians like it or not,

in the social sciences,

association-based statistical models are used for testing causal theory.

Justification: a strong underlying theoretical model provides the causality.

Lesson #1:

Page 7: To Explain Or To Predict?

Definition: Explanatory Model

A statistical model used for testing causal theory

(“proper” or not)

Page 8: To Explain Or To Predict?

Definition: Predictive Model

An empirical model used for predicting new records/scenarios

Page 9: To Explain Or To Predict?
Page 10: To Explain Or To Predict?

Multi-page sections with theoretical justifications of each hypothesis

Page 11: To Explain Or To Predict?

Concept operationalization

4 pages of such tables

AngerEconomic stability

Trust

Well-being

Poverty

Page 12: To Explain Or To Predict?

Statistical model (here: path analysis)

Page 13: To Explain Or To Predict?

“Statistical” conclusions

Page 14: To Explain Or To Predict?

Research conclusions

Page 15: To Explain Or To Predict?

Lesson #2

In the social sciences,

empirical analysis is mainly used for testing causal theory.

Empirical prediction is considered un-academic.

Some statisticians share this view: The two goals in analyzing data... I prefer to describe as “management” and “science”. Management seeks profit... Science seeks truth.

Parzen, Statistical Science 2001

Page 16: To Explain Or To Predict?

Prediction in the Information Systems literature

Page 17: To Explain Or To Predict?

Predictive goal stated?Predictive power assessed?

Page 18: To Explain Or To Predict?

“Examples of [predictive] theory in IS do not come readily to hand, suggesting that they are not common” Gregor, MISQ 2006

1072 articles

of which

52 empirical with predictive claims

Page 19: To Explain Or To Predict?

Breakdown of the 52 “predictive” articles

Page 20: To Explain Or To Predict?

To PredictTo Explain

test causal theory

(utility)

relevancenew theory

predictability

Scientific use of empirical models

Why Predict?

Page 21: To Explain Or To Predict?

Why are statistical

explanatory models different than

predictive models?

Page 22: To Explain Or To Predict?

Theory vs. its manifestation

?

Page 23: To Explain Or To Predict?

“The goal of finding models that are predictively accurate differs from the goal of finding models that are true.”

Page 24: To Explain Or To Predict?

Given the research environment in the social sciences, two critically important points are:

1. Explanatory power and predictive accuracy cannot be inferred from one another.

2. The “best” explanatory model is (nearly) never the “best” predictive model, and vice versa.

Page 25: To Explain Or To Predict?

Point #1

Explanatory Power

Predictive Power ≠

Cannot infer one from the other

Page 26: To Explain Or To Predict?

What is R2 ?

Page 27: To Explain Or To Predict?

In-sample vs. out-of-sample evaluation

Page 28: To Explain Or To Predict?

out-of-sample

Performance Evaluation

Danger: type I,II errors

goodness-of-fit

p-values

Danger: over-fitting

costs

prediction accuracy

interpretation

run time

R2

Page 29: To Explain Or To Predict?

Suggestion for social scientists:

Report predictive accuracy in addition to explanatory power

Page 30: To Explain Or To Predict?

Explanatory Power

Pred

ictiv

e Po

wer

Page 31: To Explain Or To Predict?

Best explanatory model

Best predictive model

Point #2

Page 32: To Explain Or To Predict?

Predict ≠ Explain

+ ?

“We should mention that not all data features were found to be useful. For example, we tried to benefit from an extensive set of attributes describing each of the movies in the dataset. Those attributes certainly carry a significant signal and can explain some of the user behavior. However, we concluded that they could not help at all for improving the accuracy of well tuned collaborative filtering models.”

Bell et al., 2008

Page 33: To Explain Or To Predict?

Predict ≠ ExplainThe FDA considers two products bioequivalent if the 90% CI of the relative mean of the generic to brand formulation is within 80%-125%

“We are planning to… develop predictive models for bioavailability and bioequivalence”

Lester M. Crawford, 2005Acting Commissioner of Food & Drugs

Page 34: To Explain Or To Predict?

Let’s dig in

Page 35: To Explain Or To Predict?

Explanatory goal:

minimize model bias

Predictive goal:

minimize MSE (model bias + sampling variance)

Page 36: To Explain Or To Predict?

What isOptimized?

Bias Prediction MSE

Var(Y)= uncontrollable

bias2 = model misspecification

estimation (sampling variance)

or

Page 37: To Explain Or To Predict?

Linear Regression Example

True modelEstimated model

2211)( xxxf

)ˆˆ(0))(( 221122

1 xxVarxfYEMSE

2211ˆˆ)(ˆ xxxf

)ˆ())(*ˆ( 112

222122

2 xVarxAxxfYEMSE

11)(* xxf

11̂)(*ˆ xxf

211

11 '' xxxxA

Underspecified modelEstimated model

MSE2 < MSE1 when: σ2 large

|β2| small corr(x1,x2) high

limited range of x’s

Page 38: To Explain Or To Predict?

Two statistical modeling paths

China's Diverging Paths, photo by Clark Smith

Page 39: To Explain Or To Predict?

Goal Definition

Design & Collection

Data Preparation

EDA

Variables? Methods? Evaluation,

Validation & Model Selection

Model Use & Reporting

Page 40: To Explain Or To Predict?

Study design

Hierarchical data

Observational or experiment?

Primary or secondary data?

Instrument (reliability+validity vs. measur accuracy)

How much data?

How to sample?

& data collection

Page 41: To Explain Or To Predict?

Data preparation

reduced-feature models

missing

partitioning

Page 42: To Explain Or To Predict?

outliers

PCASVD

trends

Interactive visualization

summary stats plots

Page 43: To Explain Or To Predict?

Which variables?

Multicollinearity?

theory associations ex-post availability

A, B, A*B?

Page 44: To Explain Or To Predict?

ensemblesPLS

ridge regression

variance bias

PCR

Methods / Models

Blackbox / interpretableMapping to theory

boosting

Page 45: To Explain Or To Predict?

Evaluation, Validation& Model Selection

Training dataEmpirical model Holdout data

Predictive power

Over-fitting analysis

Theoretical model

Empirical model

Data

ValidationModel fit ≠

Explanatory power

Page 46: To Explain Or To Predict?

Inference

Model Use

Test causal theory

(utility) PredictionsRelevanceNew theoryPredictability

Predictive performance

Over-fitting analysis

Null hypothesis

Naïve/baseline

Page 47: To Explain Or To Predict?

Goal Definition

Design & Collection

Data Preparation

EDA

Variables? Methods? Evaluation,

Validation, & Model Selection

Model Use & Reporting

Page 48: To Explain Or To Predict?

How does all this impact

research

in the (social) sciences?

Page 49: To Explain Or To Predict?

Three Current Problems

“While the value of scientific prediction… is beyond question… the inexact sciences [do not] have…the use of predictive expertise well in hand.”

Helmer & Rescher, 1959

Distinction blurred

Inappropriate modeling/assessment

Prediction underappreciated

Page 50: To Explain Or To Predict?

Why?

What can be done?

Statisticians should acknowledge the difference and teach it!

Page 51: To Explain Or To Predict?

It’s time for Change

To Predict

To Explain