Case Studies and Regression Analysisfaculty.wcas.northwestern.edu/~jws780/essexslides/day3.pdfCase...

Preview:

Citation preview

Case Studies and RegressionAnalysis

Jason Seawright

j-seawright@northwestern.edu

August 11, 2010

J. Seawright (PolSci) Essex 2010 August 11, 2010 1 / 44

Examples of Strengths andWeaknesses

J. Seawright (PolSci) Essex 2010 August 11, 2010 2 / 44

Case Selection

J. Seawright (PolSci) Essex 2010 August 11, 2010 3 / 44

Case Selection

1 Study the entire population.

J. Seawright (PolSci) Essex 2010 August 11, 2010 3 / 44

Case Selection

1 Study the entire population.

2 Take a random sample.

J. Seawright (PolSci) Essex 2010 August 11, 2010 3 / 44

Case Selection

1 Study the entire population.

2 Take a random sample.

3 Follow some rule for deliberate case

selection.

J. Seawright (PolSci) Essex 2010 August 11, 2010 3 / 44

Mill’s Methods

J. Seawright (PolSci) Essex 2010 August 11, 2010 4 / 44

Mill’s Methods

Method of Agreement

J. Seawright (PolSci) Essex 2010 August 11, 2010 4 / 44

Mill’s Methods

Method of Agreement

Method of Difference

J. Seawright (PolSci) Essex 2010 August 11, 2010 4 / 44

Mill’s Methods

Method of Agreement

Method of Difference

Etc..

J. Seawright (PolSci) Essex 2010 August 11, 2010 4 / 44

Crucial Cases

J. Seawright (PolSci) Essex 2010 August 11, 2010 5 / 44

Crucial Cases

Theory assigns high likelihood to an

outcome in a particular case that, for (all,

or most, or a major) competing theory has

low likelihood.

J. Seawright (PolSci) Essex 2010 August 11, 2010 5 / 44

Regression Analysis

In a survey of 1000 articles in 10 leading

political science journals, 49% used

statistics (Bennett, Barth, and Rutherford

2003).

Presumably, most of them involve somevariant of regression.

J. Seawright (PolSci) Essex 2010 August 11, 2010 6 / 44

Regression Analysis

In a survey of 1000 articles in 10 leading

political science journals, 49% used

statistics (Bennett, Barth, and Rutherford

2003).

Presumably, most of them involve somevariant of regression.

A search in JStor for the word “regression”

finds 10,404 relevant articles.

J. Seawright (PolSci) Essex 2010 August 11, 2010 6 / 44

Regression Analysis

Almost no matter what you work on, you

will have to interact with regression-based

studies.

J. Seawright (PolSci) Essex 2010 August 11, 2010 7 / 44

Choosing Cases

Case-selection rules:

J. Seawright (PolSci) Essex 2010 August 11, 2010 8 / 44

Choosing Cases

Case-selection rules:

Random sampling

J. Seawright (PolSci) Essex 2010 August 11, 2010 8 / 44

Choosing Cases

Case-selection rules:

Random samplingTypical cases

J. Seawright (PolSci) Essex 2010 August 11, 2010 8 / 44

Choosing Cases

Case-selection rules:

Random samplingTypical casesDiverse cases

J. Seawright (PolSci) Essex 2010 August 11, 2010 8 / 44

Choosing Cases

Case-selection rules:

Random samplingTypical casesDiverse casesExtreme cases

J. Seawright (PolSci) Essex 2010 August 11, 2010 8 / 44

Choosing Cases

Case-selection rules:

Random samplingTypical casesDiverse casesExtreme casesDeviant cases

J. Seawright (PolSci) Essex 2010 August 11, 2010 8 / 44

Choosing Cases

Case-selection rules:

Random samplingTypical casesDiverse casesExtreme casesDeviant casesInfluential cases

J. Seawright (PolSci) Essex 2010 August 11, 2010 8 / 44

Choosing Cases

Case-selection rules:

Random samplingTypical casesDiverse casesExtreme casesDeviant casesInfluential casesMost-similar cases

J. Seawright (PolSci) Essex 2010 August 11, 2010 8 / 44

Running Example

J. Seawright (PolSci) Essex 2010 August 11, 2010 9 / 44

Typical Cases

Typicalityi = −abs[yi − E(yi |x1,i , x2,i , . . . , xk ,i)] (1)

J. Seawright (PolSci) Essex 2010 August 11, 2010 10 / 44

Typical Cases

J. Seawright (PolSci) Essex 2010 August 11, 2010 11 / 44

Extreme Cases

Extremityi = |xi − x

s| (2)

J. Seawright (PolSci) Essex 2010 August 11, 2010 12 / 44

Deviant Cases

Deviantnessi = −Typicalityi (3)

J. Seawright (PolSci) Essex 2010 August 11, 2010 13 / 44

Influential Cases

Cook’s distance is a statistical measure of

how much the overall regression result

would change if a given case is deleted.

J. Seawright (PolSci) Essex 2010 August 11, 2010 14 / 44

Influential Cases

Cook’s distance is a statistical measure of

how much the overall regression result

would change if a given case is deleted.

A Cook’s distance score of 1 or more

usually is regarded as representing

substantial influence.

J. Seawright (PolSci) Essex 2010 August 11, 2010 14 / 44

Influential Cases

J. Seawright (PolSci) Essex 2010 August 11, 2010 15 / 44

Most-Similar Cases

Matching techniques are an automated way

of finding most similar cases.

J. Seawright (PolSci) Essex 2010 August 11, 2010 16 / 44

Measurement Error in Y

Y∗i = Yi + δY ,i

J. Seawright (PolSci) Essex 2010 August 11, 2010 17 / 44

Measurement Error in Y

Y∗i = Yi + δY ,i

Random Sampling

J. Seawright (PolSci) Essex 2010 August 11, 2010 17 / 44

Measurement Error in Y

Typical/Deviant Cases:

ei = Yi −Hi ,·Y + δY ,i

J. Seawright (PolSci) Essex 2010 August 11, 2010 18 / 44

Measurement Error in Y

Influential Cases Strategy:

Maximizes the product of the error term and the

weighted average distance of the right-hand-side

variables from their means.

J. Seawright (PolSci) Essex 2010 August 11, 2010 19 / 44

Measurement Error in Y

Extreme Cases:

Y∗i = Yi + δY ,i

J. Seawright (PolSci) Essex 2010 August 11, 2010 20 / 44

Measurement Error in Y

Most-Similar Cases

J. Seawright (PolSci) Essex 2010 August 11, 2010 21 / 44

Measurement Error in Y

Most-Similar Cases

Most-Different Cases

J. Seawright (PolSci) Essex 2010 August 11, 2010 21 / 44

Measurement Error in X

X∗i = Xi + δX ,i

J. Seawright (PolSci) Essex 2010 August 11, 2010 22 / 44

Measurement Error in X

X∗i = Xi + δX ,i

Random Sampling

J. Seawright (PolSci) Essex 2010 August 11, 2010 22 / 44

Measurement Error in X

Typical/Deviant Cases:

ei = Yi − Xi β∗ − δX ,i β

J. Seawright (PolSci) Essex 2010 August 11, 2010 23 / 44

Measurement Error in X

Influential Cases Strategy:

Maximizes the product of the error term and the

weighted average distance of the right-hand-side

variables from their means.

J. Seawright (PolSci) Essex 2010 August 11, 2010 24 / 44

Measurement Error in X

Extreme Cases:

X∗i = Xi + δX ,i

J. Seawright (PolSci) Essex 2010 August 11, 2010 25 / 44

Measurement Error in X

Most-Similar Cases

J. Seawright (PolSci) Essex 2010 August 11, 2010 26 / 44

Measurement Error in X

Most-Similar Cases

Most-Different Cases

J. Seawright (PolSci) Essex 2010 August 11, 2010 26 / 44

Omitted Variables

ei = di + γZi , where Zi = ZI − E(Zi |Xi)

J. Seawright (PolSci) Essex 2010 August 11, 2010 27 / 44

Omitted Variables

ei = di + γZi , where Zi = ZI − E(Zi |Xi)

Random Sampling

J. Seawright (PolSci) Essex 2010 August 11, 2010 27 / 44

Omitted Variables

Typical/Deviant Cases:

ei = di + γZi

J. Seawright (PolSci) Essex 2010 August 11, 2010 28 / 44

Omitted Variables

Influential Cases Strategy:

Maximizes the product of the error term and the

weighted average distance of the right-hand-side

variables from their means.

J. Seawright (PolSci) Essex 2010 August 11, 2010 29 / 44

Omitted Variables

Extreme Cases:

J. Seawright (PolSci) Essex 2010 August 11, 2010 30 / 44

Omitted Variables

Extreme Cases:

For confounders, extreme on X may be a good

strategy.

J. Seawright (PolSci) Essex 2010 August 11, 2010 30 / 44

Omitted Variables

Extreme Cases:

For confounders, extreme on X may be a good

strategy.

Extreme on Y maximizes:

Yi + di + γZi

J. Seawright (PolSci) Essex 2010 August 11, 2010 30 / 44

Omitted Variables

Most-Similar Cases

J. Seawright (PolSci) Essex 2010 August 11, 2010 31 / 44

Omitted Variables

Most-Similar Cases

Most-Different Cases

J. Seawright (PolSci) Essex 2010 August 11, 2010 31 / 44

Pathway Variables

Wi = ν + µXi + ωi

Yi = α + τWi + σi

J. Seawright (PolSci) Essex 2010 August 11, 2010 32 / 44

Pathway Variables

Wi = ν + µXi + ωi

Yi = α + τWi + σiRandom Sampling

J. Seawright (PolSci) Essex 2010 August 11, 2010 32 / 44

Pathway Variables

Typical/Deviant Cases:

ei = τωi + σi

J. Seawright (PolSci) Essex 2010 August 11, 2010 33 / 44

Pathway Variables

Influential Cases Strategy:

Maximizes the product of the error term and the

weighted average distance of the right-hand-side

variables from their means.

J. Seawright (PolSci) Essex 2010 August 11, 2010 34 / 44

Pathway Variables

Extreme Cases:

J. Seawright (PolSci) Essex 2010 August 11, 2010 35 / 44

Pathway Variables

Extreme Cases:

Wi = ν + µXi + ωi

J. Seawright (PolSci) Essex 2010 August 11, 2010 35 / 44

Pathway Variables

Extreme Cases:

Wi = ν + µXi + ωi

Extreme on Y maximizes:

Yi = α + τWi + σi

J. Seawright (PolSci) Essex 2010 August 11, 2010 35 / 44

Pathway Variables

Most-Similar Cases

J. Seawright (PolSci) Essex 2010 August 11, 2010 36 / 44

Pathway Variables

Most-Similar Cases

Most-Different Cases

J. Seawright (PolSci) Essex 2010 August 11, 2010 36 / 44

Summary: Analytic Arguments

Deviant Influential Ext. X Ext. Y

Error in Y Good Mixed Poor Good

Error in X Mixed Mixed Good Poor

Confound Good Mixed Mixed Good

Pathway Good Mixed Good Mixed

J. Seawright (PolSci) Essex 2010 August 11, 2010 37 / 44

Monte Carlo for Case Selection

Simulate case selection for the same problem

10,000 times.

J. Seawright (PolSci) Essex 2010 August 11, 2010 38 / 44

Monte Carlo for Case Selection

Simulate case selection for the same problem

10,000 times.

Analysis of presidential vote shares and the

economy in Latin America, 1980-2000.

J. Seawright (PolSci) Essex 2010 August 11, 2010 38 / 44

Monte Carlo for Case Selection

Simulate case selection for the same problem

10,000 times.

Analysis of presidential vote shares and the

economy in Latin America, 1980-2000.

Add measurement error, omitted variables,

etc.

J. Seawright (PolSci) Essex 2010 August 11, 2010 38 / 44

Monte Carlo for Case Selection

Simulate case selection for the same problem

10,000 times.

Analysis of presidential vote shares and the

economy in Latin America, 1980-2000.

Add measurement error, omitted variables,

etc.

2 SD Rule

J. Seawright (PolSci) Essex 2010 August 11, 2010 38 / 44

Simulation Results

Random Typical Influential Deviant Extreme Y Extreme X Similar Different Contrast

0.00

0.05

0.10

0.15

0.20

Figure: Case Selection for Finding Confounder.

J. Seawright (PolSci) Essex 2010 August 11, 2010 39 / 44

Simulation Results

Random Typical Influential Deviant Extreme Y Extreme X Similar Different Contrast

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Figure: Case Selection for Other Causes.

J. Seawright (PolSci) Essex 2010 August 11, 2010 40 / 44

Simulation Results

Random Typical Influential Deviant Extreme Y Extreme X Similar Different Contrast

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Figure: Case Selection for Exploring Mechanisms.

J. Seawright (PolSci) Essex 2010 August 11, 2010 41 / 44

Simulation Results

Random Typical Influential Deviant Extreme Y Extreme X Similar Different Contrast

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Figure: Case Selection for Error in X .

J. Seawright (PolSci) Essex 2010 August 11, 2010 42 / 44

Simulation Results

Random Typical Influential Deviant Extreme Y Extreme X Similar Different Contrast

0.0

0.2

0.4

0.6

0.8

Figure: Case Selection for Error in Y .

J. Seawright (PolSci) Essex 2010 August 11, 2010 43 / 44

Simulation Results

Random Typical Influential Deviant Extreme Y Extreme X Similar Different Contrast

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Figure: Case Selection for Estimating Overall Slope.

J. Seawright (PolSci) Essex 2010 August 11, 2010 44 / 44

Case-selection software in R

J. Seawright (PolSci) Essex 2010 August 11, 2010 45 / 44

Assignment

Implement each case-selection technique for a

data set of interest, or off my website. Be

prepared to discuss what kinds of cases you get,

and whether they seem on first glance to be

useful, tomorrow.

J. Seawright (PolSci) Essex 2010 August 11, 2010 46 / 44

Recommended