Estimators - University of Warwick...Estimators • Any statistic whose values are used to estimate is defined to be an estimator of . If a parameter is estimated by an estimator,

Estimators

• Any statistic whose values are used to estimate is defined to be anestimator of . If a parameter is estimated by an estimator, weusually write , where the hat indicates that we are dealing with anestimator of . Some text books use Greek letters for the unknownparameters and Roman letters for the estimators of the parameters.

• Estimators are random variables with a fixed (mean) and a randomcomponent (disturbance)

• Estimators are useful since we normally cannot observe the trueunderlying population and the characteristics of its distribution/density.

• The mean is an estimator / estimate of the expected population valueand the sample variance is an estimator / estimate of the truepopulation variance

• The formula/ rule to calculate the mean/ variance (characteristic) froma sample is called estimator, the value is called estimate. Thus, therealization of an estimator, i.e. the realization of the random variableis called an estimate of .

Testing• Empirical research hypothesis testing is at least as important as estimation.

• An estimate of a parameter might have occurred by chance.

• Since an estimator is a random variable, the estimate is just ONE realizationof the random variable. Thus, using an alternative random sample of thesame size can possibly generate an estimate of the parameter withcompletely different implications.

• Therefore we need confidence in the precision of our estimate.

• In case we can derive the distribution of the estimator, we can makestatements about the variability of the estimates using the variance of theestimator.

• Tests are tools that allow us to discriminate between the estimates fordifferent models

Definition

A test is a decision problem with respect to anunknown parameter or a relationship betweendifferent parameters.

We condense the information of the sample by afunction of the data to solve the decisionproblem. This function is called a test statistic.

Test Statistic

A test statistic is a known function of random variables,its purpose is to prove the null hypothesis. The teststatistic has a known distribution under the nullhypothesis, i.e. the distribution is known if the null is true.The test statistic follows another unknown distribution ifthe alternative hypothesis is true.

Example: two variables x and y.

H0: mean(x) = mean(y)

Ha: mean(x) ≠ mean(y)

Types of Hypotheses

Let theta be a parameter and r a constant which is known under thenull hypothesis:

1. Two-sided test of a simple hypothesis about parameter theta:H0: theta=rHa: theta≠r

2. One-sided test of a simple hypothesis about parameter theta:H0: theta≥rHa: theta<r

3. Two-sided test of a simple hypothesis about more than oneparameters:

H0: R1*theta + R2*theta’ = rHa: R1*theta + R2*theta’ ≠ r

4. Joint hypothesis:H0: theta1 = c1, theta2 = c2, theta3 = c3, theta_k = c_kHa: at least one of the equalities under the null does not hold

Intuition

Tests can be designed as a kind of distance measure which measures thedistance between the hypothesized parameter values and the evidenceproduced by the data.

If an estimate is close to the hypothesized value r we would tend not toreject the null. Large deviations of the estimate from r would lead us toreject the null.

But what is ‘close’ and ‘far’? Classical inference divides the outcome spaceof the test statistic into disjoint regions, an acceptance region and a criticalregion. If the outcome of the test lies within the acceptance region, we donot reject the null.

Since by definition the distribution of the test statistic is known under thenull, we define the critical regions by the probability for the occurrence ofextreme outcomes of the test statistic, given the null is true. For example wedefine the 95 percent quantile which indicates that only 5 percent of alloutcomes of the test are larger than this critical value.

Power of a test

The inference based on the outcome of the test statistic may be wrong.The probability of rejecting the null for a given critical region asfunction of the parameter is called power function.

For a given test statistic and a critical region of a given significancelevel we define the probability of rejecting the null hypothesis as thepower of a test

The power would be optimal if the probability of rejecting the null wouldbe 0 if there is a relationship and 1 otherwise.

This is, however, not the case in reality. There is always a positiveprobability to draw the wrong conclusions from the results:

• One can reject the null-hypothesis even though it is true (type 1 error, alphaerror):alpha=Pr[Type I Error]=Pr[rejecting the H0 | H0 is true]

• Or not reject the null-hypothesis even though it is wrong (type 2 error, betaerror)beta=Pr[Type II Error]=Pr[accepting the H0 | Ha is true]

Type I and Type II errors

Alpha and beta errors: an example:

A criminal trial: taking as the null hypothesis that the defendant isinnocent, type I error occurs when the jury wrongly decides that thedefendant is guilty. A type two error occurs when the jury wronglyacquits the defendant.

In significance test:

H0: beta is insignificant = 0:

Type I error: wrongly rejecting the null hypothesis

Type II error: wrongly accepting the null that the coefficient is zero.

Selection of significance levels increase or decrease the probability oftype I and type II errors.

The smaller the significance level (5%, 1%) the lower the probability oftype I and the higher the probability of type II errors.

Accepting Hypotheses?

From a research strategy point of view, we donot speak about acceptance of a hypothesis. Weonly infer that the data are not contradicting thenull hypothesis. This terminology emphasizesthat we do not know the real world after testingand the hypothesis about the world is just aworking hypothesis which we use as long as wedo not find a model that describes the worldmore appropriately.

One-Sample T-Test

Assumptions: Y1,…, Yn – iid (independent, identically distributed)sample, Y~N(mu, sigma²) – if Y is not normally distributed – centrallimit theorem.

H0: mu = c

Ha: mu ≠ c (thus mu>c or mu<c)

test statistic: student’s t-statistic: follows a t-distribution under thenull with n-1 degrees of freedom:

H0 is rejected with respect to the level of alpha if:

Since this is a two sided test, alpha defines the critical value towhich to compare the test-statistic at the 2.5 and 97.5 percentile ofthe t-distribution.

/

y ct

s n

1,1 / 2nt t

Two-Sample T-Test

Assumptions: Y1,…, Yn; X1,…,Xn – iid (independent, identicallydistributed) sample, Y, X ~ N(mu, sigma²)

H0:

Ha:

test statistic: student’s t-statistic: follows a t-distribution under thenull:

H0 is rejected with respect to the level of alpha if:

X Y

X Y

1 2 2 ,1 / 22 21 2

, ~/ /

n n

y x

y xT T t

s n s n

1 2 2 ,1 / 2n nT t

In a two sided test, alpha defines the critical value to which to comparethe test-statistic at the 2.5 and 97.5 percentile of the t-distribution.

The area under the curve captures the probability for all outcomes of thetest statistic T and has to amount to 1.

acceptance region:

H0 is not rejected0

T

probability density:

f (t)

alpha/2alpha/2

Reject H0: |T|>t_n1+n2-2,1-alpha/2

critical values

The simple linear Regression Model• Correlation coefficient is non-parametric and just indicates that two

variables are associated with one another, but it does not give anyideas of the kind of relationship.

• Regression models help investigating bivariate and multivariaterelationships between variables, where we can hypothesize that 1variable depends on another variable or a combination of othervariables.

• Normally relationships between variables in political science andeconomics are not exact – unless true by definition, but relationshipsinclude most often a non-structural or random component, due tothe probabilistic nature of theories and hypotheses in PolSci,measurement errors etc.

• Regression analysis enables to find average relationships that maynot be obvious by just „eye-balling“ the data – explicit formulation ofstructural and random components of a hypothesized relationshipbetween variables.

• Example: positive relationship between unemployment andgovernment spending

Simple linear regression analysis• Linear relationship between x (explanatory variable) and y

(dependent variable)

• Epsilon describes the random component of the linear relationshipbetween x and y

-10

-50

51

01

5y

-2 0 2 4 6x

i i iy x

• Y is the value of the dependent variable(spending) in observation i (e.g. in the UK)

• Y is determined by 2 components:1. the non-random/ structural component alpha+beta*xi

– where x is the independent/ explanatory variable(unemployment) in observation i (UK) and alpha andbeta are fixed quantities, the parameters of themodel; alpha is called constant or intercept andmeasures the value where the regression linecrosses the y-axis; beta is called coefficient/ slope,and measures the steepness of the regression line.

2. the random component called disturbance or errorterm epsilon in observation i

i i iy x

A simple example:• x has 10 observations: 0,1,2,3,4,5,6,7,8,9• The true relationship between y and x is: y=5+1*x, thus, the true y

takes on the values: 5,6,7,8,9,10,11,12,13,14• There is some disturbance e.g. a measurement error, which is

standard normally distributed: thus the y we can measure takeson the values: 6.95,5.22,6.36,7.03,9.71,9.67,10.69,13.85,13.21,14.82 – which are close to the true values, but for any givenobservation the observed values are a little larger or smaller thanthe true values.

• the relationship between x and y should hold on average true butis not exact

• When we do our analysis, we don‘t know the true relationship andthe true y, we just have the observed x and y.

• We know that the relationship between x and y should have thefollowing form: y=alpha+beta*x+epsilon (we hypothesize a linearrelationship)

• The regression analysis „estimates“ the parameters alpha andbeta by using the given observations for x and y.

• The simplest form of estimating alpha and beta is called ordinaryleast squares (OLS) regression

OLS-Regression:

• Draw a line through the scatter plot in a way to minimize the deviations ofthe single observations from the line:

• Minimize the sum of all squared deviations from the line (squared residuals)

• This is done mathematically by the statistical program at hand

• the values of the dependent variable (values on the line) are calledpredicted values of the regression (yhat): 4.97,6.03,7.10,8.16,9.22,10.28,11.34,12.41,13.47,14.53 – these are very close to the „true values“;the estimated alpha = 4.97 and beta = 1.06

01

23

45

67

89

10

11

12

13

14

15

y

0 2 4 6 8 10x

01

23

45

67

89

10

11

12

13

14

15

0 1 2 3 4 5 6 7 8 9 10x

y Fitted values

alpha

y1

yhat1

y7

yhat7

epsilon7

i i i i i iˆ ˆˆ ˆ ˆ ˆy x y x

OLS regressionOrdinary least squares regression: minimizes the squared residuals

Components:• DY: y; at least 1 IV: x• Constant or

intercept term: alpha• Regression coefficient,

slope: beta• Error term, residuals: epsilon

i i iˆˆ ˆy x

i i iy x

n n2 2

i i ii 1 i 1

ˆ ˆ(Y Y ) ( ) min

i i iy x

-10

-50

51

01

5C

omp

one

ntplu

sre

sid

ual

-2 0 2 4 6x

Derivation of the OLS-Parameters alpha and beta:The relationship between x and y is described by the function:

The difference between the dependent variable y and the estimated systematicinfluence of x on y is named the residual:

To receive the optimal estimates for alpha and beta we need a choice-criterion;in the case of OLS this criterion is the sum of squared residuals: we calculatealpha and beta for the case in which the sum of all squared deviations(residuals) is minimal

Taking the squares of the residual is necessary since a) positive and negativedeviation do not cancel each other out, b) positive and negative estimationerrors enter with the same weight due to the squaring down, it is thereforeirrelevant whether the expected value for observation yi is underestimated oroverestimates

Since the measure is additive no value is of outmost relevance.

Especially large residuals receive a stronger weight due to squaring.

i i iy x

i i iˆê y x

n n 22

i i iˆ ˆˆ ˆ, ,i 1 i 1

ˆ ˆˆ ˆmin e min y x S ,

Minimizing the function requires to calculate the first order conditions withrespect to alpha and beta and set them zero:

This is just a linear system of two equations with two unknowns alpha andbeta, which we can mathematically solve for alpha:

n

i ii 1

n

i i ii 1

ˆˆS ,ˆÎ : 2 y x 0

ˆ

ˆˆS ,ˆÎI : 2 y x x 0

ˆ

n

i ii 1

n

i ii 1

ˆÎ : y x 0

ˆ ˆˆ ˆy x y x

… and beta:

n

i i ii 1

n2

i i i ii 1

n2

i i i ii 1

n n2

i i i i i i i ii 1 i 1

n n n

i i i ii 1 i 1 i 1

n

ii 1n

ii 1

i ii

ˆÎI : y x x 0

ˆˆy x x x 0

ˆ ˆy x y x x x 0

ˆ ˆ ˆ ˆy x yx xx x 0 y y x x x 0

ˆ ˆy y x x 0 y y x x

y yˆ

x x

y y x xˆ

n

11n

2

ii 1

Cov x, yX 'X X ' y

V ar xx x

Naturally we still have to verify whether and reallyminimize the sum of squared residuals and satisfy thesecond order conditions of the minimizing problem.Thus we need the second derivatives of the twofunctions with respect to alpha and beta which aregiven by the so called Hessian matrix (matrix ofsecond derivatives). (I spare the mathematicalderivation)

The Hessian matrix has to be positive definite (thedeterminant must be larger than 0) so that andglobally minimize the sum of squared residuals. Onlyin this case alpha and beta are optimal estimates forthe relationship between the dependent variable y andthe independent variable x.

Regression coefficient:

Beta equals the covariance between y and xdivided by the variance of x.

n

i ii 1

yx n2

ii 1

(x x)(y y)ˆ

(x x)

Interpretation of regression results:

reg y x

Source | SS df MS Number of obs = 100-------------+---------------------------------------------- F( 1, 98) = 89.78

Model | 1248.96129 1 1248.96129 Prob > F = 0.0000Residual | 1363.2539 98 13.9107541 R-squared = 0.4781

-------------+---------------------------------------------- Adj R-squared = 0.4728Total | 2612.21519 99 26.386012 Root MSE = 3.7297

----------------------------------------------------------------------------------------------------y | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+-------------------------------------------------------------------------------------x | 1.941914 .2049419 9.48 0.000 1.535213 2.348614

_cons | .8609647 .4127188 2.09 0.040 .0419377 1.679992----------------------------------------------------------------------------------------------------

If x increases by 1 unit, y increases by 1.94 units: the interpretation is linear andstraightforward

Interpretation: example

• alpha=4.97, beta=1.06

• Education and earnings: no education gives you aminimal hourly wage of around 5 pounds. Eachadditional year of education increases the hourly wageby app. 1 pound:

01

23

45

67

89

10

11

12

13

14

15

0 1 2 3 4 5 6 7 8 9 10x

y Fitted values

alpha

beta=1.06

Properties of the OLS estimator:• Since alpha and beta are estimates of the unknown parameters,

estimates the mean function or the systematic part ofthe regression equation. Since a random variable can be predictedbest by the mean function (under the mean squared error criterion),yhat can be interpreted as the best prediction of y. the differencebetween the dependent variable y and its least squares prediction isthe least squares residual: e=y-yhat =y-(alpha+beta*x).

• A large residual e can either be due to a poor estimation of theparameters of the model or to a large unsystematic part of theregression equation

• For the OLS model to be the best estimator of the relationshipbetween x and y several conditions (full ideal conditions, Gauss-Markov conditions) have to be met.

• If the „full ideal conditions“ are met one can argue that the OLS-estimator imitates the properties of the unknown model of thepopulation. This means e.g. that the explanatory variables and theerror term are uncorrelated.

i iˆˆy x

Gauss-Markov Assumptions, Full IdealConditions of OLS

The full ideal conditions consist of a collection of assumptions about the trueregression model and the data generating process and can be thought ofas a description of an ideal data set. Ideal conditions have to be met inorder for OLS to be a good estimate (BLUE, unbiased and efficient)

Most real data do not satisfy these conditions, since they are not generated byan ideal experiment. However, the linear regression model under full idealconditions can be thought of as being the benchmark case with whichother models assuming a more realistic DGP should be compared.

One has to be aware of ideal conditions and their violation to be able to controlfor deviations from these conditions and render results unbiased or at leastconsistent:

1. Linearity in parameters alpha and beta: the DV is a linear function of a setof IV and a random error component

→ Problems: non-linearity, wrong determinants, wrong estimates; a relationship that is actually there can not be detected with a linear model

2. The expected value of the error term is zero for all observations

→ Problem: intercept is biased

iE 0

3. Homoskedasticity: The conditional variance of the error term isconstant in all x and over time: the error variance is a measure ofmodel uncertainty. Homoskedasticity implies that the modeluncertainty is identical across observations.

→ Problem: heteroskedasticity – variance of error term is different across observations – model uncertainty varies from observationto observation – often a problem in cross-sectional data, omittedvariables bias

4. Error term is independently distributed and not correlated, nocorrelation between observations of the DV.

→ Problem: spatial correlation (panel and cross-sectional data), serial correlation/ autocorrelation (panel and time-series data)

2i iV E ² cons tan t

i j i jCov , E 0, i j

5. Xi is deterministic: x is uncorrelated with the error term since xi isdeterministic:

→ Problems: omitted variable bias, endogeneity and simultaneity

6. Other problems: measurement errors, multicolinearity

If all Gauss-Markov assumptions are met than the OLS estimators alphaand beta are BLUE – best linear unbiased estimators:

best: variance of the OLS estimator is minimal, smaller than thevariance of any other estimator

linear: if the relationship is not linear – OLS is not applicable.

unbiased: the expected values of the estimated beta and alphaequal the true values describing the relationship between x and y.

i i i i i i

i i i i i

Cov X , E X E X *E

X E X E sin ce X is det

0

Inference

Is it possible to generalize the regression results for the sample underobservation to the universe of cases (the population)?

Can you draw conclusions for individuals, countries, time-points beyondthose observations in your data-set?

• Significance tests are designed to answer exactly these questions.

• If a coefficient is significant (p-value<0.10, 0.05, 0.01) then you candraw conclusions for observations beyond the sample underobservation.

But…

• Only in case the samples matches the characteristics of thepopulation

• This is normally the case if all (Gauss-Markov) assumptions of OLSregressions are met by the data under observation.

• If this is not the case the standard errors of the coefficients might bebiased and therefore the result of the significance test might bewrong as well leading to false conclusions.

Significance test: the t-test

22 ,

*N Var X

The t-test:• T-test for significance: testing the H0 (Null-Hypothesis) that beta

equals zero: H0: beta=0; HA: beta≠0

• The test statistic follows a student t distribution under the Null

• t is the critical value of a t – distribution for a specific number ofobservations and a specific level of significance: convention instatistics is a significance level of 5% (2.5% on each side of the t-distribution for a 2-sided test) – this is also called the p-value.

2

2

ˆ ˆ

ˆ

*

ˆ ˆ

ˆ

*

n

n

r rt

SSRSE

N Var X

tSSRSE

N Var X

Assume beta is 1 and the estimated standard error is 0.8The critical value of the two-sided symmetric student t-distribution for n=∞ and alpha=5% is 1.96

Acceptance at the 5% level:

The Null (no significant relationship) will not be rejected if:

This condition can be expressed in terms of betaby substituting for t:

Multiplying through by the SE of beta:

Then:

Substituting beta=1 and se(beta)=0.8:

Since this in-equality holds true, the null-hypothesis is not rejected. Thus, weaccept that there is rather no relationship between x and y and beta equalswith a high probability (95%) zero.

ˆ 0

ˆt

SE

1.96 1.96t

ˆ 0

1.96 1.96ˆSE

ˆ ˆ ˆ1.96* 0 1.96*SE SE

ˆ ˆ ˆ0 1.96* 0 1.96*SE SE

0 1.96*0.8 1 0 1.96*0.8

1.568 1 1.568

acceptance region

0

beta1-1.568 1.568

probability density

function of beta

2.5%2.5%

Now assume that the standard error of beta is 0.4 instead of 0.8, weget:

This in-equality is wrong, therefore we reject the null-hypothesis thatbeta equals zero and decide in favour of the alternative hypothesisthat there is a significant positive relationship between x and y.

0 1.96*0.4 1 0 1.96*0.4

0.784 1 0.784

acceptance region

0

beta1-0.748 0.748

probability density

function of beta

2.5%2.5%

Significance test – rule of thumb:

If the regression-coefficient (beta) is at leasttwice as large as the correspondingstandard error of beta the result isstatistically significant at the 5% level.

Power of a test

For a given test statistic and a critical region of a givensignificance level we define the probability of rejecting the nullhypothesis as the power of a test

The power would be optimal if the probability of rejecting the nullwould be 0 if there is a relationship and 1 otherwise.

This is, however, not the case in reality. There is always apositive probability to draw the wrong conclusions from theresults:

• One can reject the null-hypothesis even though it is true (type 1error, alpha error):alpha=Pr[Type I Error]=Pr[rejecting the H0 | H0 is true]

• Or not reject the null-hypothesis even though it is wrong (type 2error, beta error)beta=Pr[Type II Error]=Pr[accepting the H0 | Ha is true]

Type I and Type II errors

Alpha and beta errors: an example:

A criminal trial: taking as the null hypothesis that the defendant isinnocent, type I error occurs when the jury wrongly decides that thedefendant is guilty. A type two error occurs when the jury wronglyacquits the defendant.

In significance test:

H0: beta is insignificant = 0:

Type I error: wrongly rejecting the null hypothesis

Type II error: wrongly accepting the null that the coefficient is zero.

Selection of significance levels increase or decrease the probability oftype I and type II errors.

The smaller the significance level (5%, 1%) the lower the probability oftype I and the higher the probability of type II errors.

Confidence Intervals

Significance tests assume that hypotheses come beforethe test: beta≠0. however, the significance test leaves us with some vacuum since we know that beta is differentfrom zero but since we have a probabilistic theory we arenot sure what the exact value should be.

Confidence intervals give us a range of numbers that areplausible and are compatible with the hypothesis.

As for significance test the researcher has to choose thelevel of confidence (95% is convention)

Using the same example again: estimated beta is 1 and theSE(beta) is 0.4 ; the critical value of the two-sided t-distribution are 1.96 and -1.96

Calculation of the confidence interval:The question is how far can a hypothetical value differ from the

estimated result before they become incompatible with theestimated value?

The regression coefficient b and the hypothetical value beta areincompatible if either

That is if beta satisfies the double inequality:

Any hypothetical value of beta that satisfies this inequality will thereforeautomatically be compatible with the estimate b, that is will not berejected. The set of all such values, given by the interval betweenthe lower and upper limits of the inequality, is known as theconfidence interval for b. The centre of the confidence interval is theestimated b.

If the 5% significance level is adopted the corresponding confidenceinterval is known as the 95% confidence interval (1% - 99%).

crit crit

b bt or t

SE b SE b

* *crit critb SE b t b SE b t

Since the critical value of the t distribution is greater for the 1% levelthan for the 5% level, for any given number of degrees of freedom, itfollows that the 99% interval is wider than the 95% interval andencompasses al the hypothetical values of beta in the 95%confidence interval plus some more on either side

Example: b=1, se(b)=0.4, 95% confidence interval, t_critical=1.96,

-1.96:

1-0.4*1.96 ≤ beta ≤ 1+0.4*1.96

95% confidence interval:

0.216 ≤ beta ≤ 1.784

Thus, all values between 0.216 and 1.784 are theoretically possibleand would not be rejected. They are compatible to the estimated b =1. 1 is the central value of the confidence interval.

Interpretation of regression results:reg y x

Source | SS df MS Number of obs = 100-------------+---------------------------------------------- F( 1, 98) = 89.78

Model | 1248.96129 1 1248.96129 Prob > F = 0.0000Residual | 1363.2539 98 13.9107541 R-squared = 0.4781

-------------+---------------------------------------------- Adj R-squared = 0.4728Total | 2612.21519 99 26.386012 Root MSE = 3.7297

----------------------------------------------------------------------------------------------------y | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+-------------------------------------------------------------------------------------x | 1.941914 .2049419 9.48 0.000 1.535213 2.348614

_cons | .8609647 .4127188 2.09 0.040 .0419377 1.679992----------------------------------------------------------------------------------------------------

Degrees of freedom: number of observations minus number of estimatedparameters: in this case alpha and beta: 100-2=98. If we had 2 explanatoryvariables the number of degrees of freedom would decrease to 97, 3 – 96,etc.

The concept of DoF implies that you cannot have more explanatory variablesthan observations!

DefinitionsTotal Sum of Squares (SST):

Explained (Estimation) Sum of Squares (SSE):

Residual Sum of Squares or Sum of Squares Residuals (SSR):

2

1

n

ii

SST y y

22

1 1

ˆn n

i i ii i

SSR y x

2

1

ˆn

ii

SSE y y

Goodness of Fit

How well does the explanatory variable explain the dependentvariable?

How well does the regression line fit the data?

The R-squared (coefficient of determination) measures how muchvariation of the dependent variable can be explained by theexplanatory variables.

The R² is the ratio of the explained variation compared to the totalvariation: it is interpreted as the fraction of the sample variation in ythat is explained by x.

Explained variation of y / total variation of y:

2

2 1

2

1

ˆ ˆ( )

1

( )

n

in

i

Y YSSE SSR

RSST SST

Y Y

Properties of R²:

• 0 ≤ R² ≤ 1, often the R² is multiplied by 100 to get the percentage of the sample variation in y that is explained by x

• If the data points all lie on the same line, OLS provides a perfect fitto the data. In this case the R² equals 1 or 100%.

• A value of R² that is nearly equal to zero indicates a poor fit of theOLS line: very little of the variation in the y is captured by thevariation in the y_hat (which all lie on the regression line)

• R²=(corr(y,yhat))²• The R² follows a complex distribution which depends on the

explanatory variable• Adding further explanatory variables leads to an increase the R²• The R² can have a reasonable size in spurious regressions if the

regressors are non-stationary• Linear transformations of the regression model change the value of

the R² coefficient• The R² is not bounded between 0 and 1 in models without intercept

Properties of an Estimator1. Finite Sample Properties

There are often more than 1 possible estimators to estimate arelationship between x and y (e.g. OLS or Maximum Likelihood)

How do we choose between two estimators: the 2 mostly usedselection criteria are bias and efficiency.

Bias and efficiency are finite sample properties, because they describehow an estimator behaves when we only have a finite sample (eventhough the sample might be large)

In comparison so called “asymptotic properties” of an estimator haveto do with the behaviour of estimators as the sample size growswithout bound

Since we always deal with finite samples and it is hard to say whetherasymptotic properties translate to finite samples, examining thebehaviour of estimators in finite samples seems to be moreimportant.

Unbiasedness

UnBiasedness: the estimated coefficient is on average true:

That is: in repeated samples of size n the mean outcome of the estimate equalsthe true – but unknown – value of the parameter to be estimated.

If an estimator is unbiased, then its probability distribution has an expectedvalue equal to the parameter it is supposed to be estimating. Unbiasednessdoes not mean that the estimate we get with any particular sample is equalto the true parameter or even close. Rather the mean of all estimates frominfinitely drawn random samples equals the true parameter.

Ê 0

0.5

11

.52

2.5

33

.5D

ensi

ty

.5 .6 .7 .8 .9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1b1

Sampling Variance of an estimator

Efficiency: is a relative measure between two estimators – measuresthe sampling variance of an estimator: V(beta)

Let and be two unbiased estimator of the true parameter . Withvariances and . Then is called to be relative moreefficient than if is smaller than .

The property of relative efficiency only helps us to rank two unbiasedestimators.

0.0

2.0

4.0

6.0

8D

ensity

-20 -10 0 10 20b1

0.1

.2.3

.4D

ensity

-4 -2 0 2 4b2

ˆV

V

ˆV V

Trade-off between Bias and EfficiencyWith real world data and the related problems we sometimes have only

the choice between a biased but efficient and an unbiased butinefficient estimator. Then another criterion can be used to choosebetween the two estimators, the root mean squared error (RMSE).The RMSE is a combination of bias and efficiency and gives us ameasure of overall performance of an estimator.

RMSE:

k measures the numberof experiments, trials orsimulations0

.51

1.5

22

.5

Den

sity

-.5 0 .5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5

coefficient of z3

xtfevd

fixed effects

K 2

truek 1

2

true

2

true

1ˆ ˆRMSEK

ˆMSE E

ˆ ˆMSE Var Bias ,

Asymptotic Properties of Estimators

We can rule out certain silly estimators by studying the asymptotic orlarge sample properties of estimators.

We can say something about estimators that are biased and whosevariances are not easily found.

Asymptotic analysis involves approximating the features of the samplingdistribution of an estimator.

Consistency: how far is the estimator likely to be from the trueparameter as we let the sample size increase indefinitely.

If N→∞ the estimated beta equals the true beta:

Unlike unbiasedness, consistency involves that the variance of theestimator collapses around the true value as N approaches infinity.

Thus unbiased estimators are not necessarily consistent, but thosewhose variance shrink to zero as the sample size grows areconsistent.

n nn

nn

ˆ ˆlim Pr 0, p lim ,

ˆlim E

Multiple Regressions

• In most cases the dependent variable y is not just afunction of a single explanatory variable but acombination of several explanatory variables.

• THUS: drawback of binary regression: impossible to drawceteris paribus conclusions about how x affects y(omitted variable bias).

• Models with k-independent variables:

• Control for omitted variable bias

• But: increases inefficiency of the regression sinceexplanatory variables might be collinear.

i 1 i1 2 i2 k ik iy x x ... x

Obtaining OLS Estimates in MultipleRegressions

The intercept is the predicted value of y when all explanatory variablesequal zero.

The estimated betas have partial effect or ceteris paribus interpretations.

We can obtain the predicted change in y given the changes in each x.when x_2 is held fixed then beta_1 gives the change in y if x_1changes by one unit.

0 1 1 2 2

2

2 2 1 1 1 1 2 2 2 21 1 1 1

1 22 2

1 1 2 2 1 1 2 21 1 1

1

ˆ

ˆ ' '

i i i i

n n n n

i i i i i i ii i i i

n n n

i i i ii i i

y x x

x x x x y y x x x x x x y y

x x x x x x x x

X X X y

“Holding Other Factors Fixed”

• The power of multiple regression analysis is that itprovides a ceteris paribus interpretation even though thedata have not been collected in a ceteris paribus fashion.

• Example: multiple regression coefficients tell us whateffect an additional year of education has on personalincome if we hold social background, intelligence, sex,number of children, marital status and all other factorsconstant that also influence personal income.

Standard Error and Significance in MultipleRegressions

22 2

1 211 1

2

1 1 11

2

1 12 2 1

1 1 2 12

1 11

1 12

1 1

1ˆ ˆˆ11

ˆ

:

ˆ ˆ

1

n

ii

n

ii

n

ii

i i n

ii

Varn kSST R

SST x x

x xSSE

R for the regression of x on x RSST

x x

SD SESST R

F – Test: Testing Multiple Linear Restrictions

• t-test (as significance test) is associated with any OLS coefficient.

• We also want to test multiple hypotheses about the underlyingparameters beta_0…beta_k.

• The F-test, tests multiple restriction: e.g. all coefficients jointly equalzero:

H0: beta_0=beta_1=…=beta_k=0

Ha: H0 is not true, thus at least one beta differs from zero

• The F-statistic (or F-ratio) is defined as:

• The F-statistic is F distributed under the

Null-Hypothesis.

• F-test for overall significance of a regression: H0: all coefficients arejointly zero – in this case we can also compute the F-statistic by usingthe R² of the Regression:

, 1

/

/ 1r ur

ur

q n k

SSR SSR qF

SSR n k

F F

² /

1 ² / 1

R k

R n k

• SSR_r: Sum of Squared Residuals of the restricted model (constantonly)

• SSR_ur: Sum of Squared Residuals of the unrestricted model (allregressors)

• SSR_r can be never smaller than SSR_ur F is always non-negative

• k – number of explanatory variables (regressors), n – number ofobservations, q – number of exclusion restrictions (q of the variableshave zero coefficients): q = df_r – df_ur (difference in degrees offreedom between the restricted and unrestricted models; df_r >df_ur)

• The F-test is a one sided test, since the F-statistic is always non-negative

• We reject the Null at a given significance level if F>F_critical for thissignificance level.

• If H0 is rejected than we say that all explanatory variables are jointlystatistically significant at the chosen significance level.

• THUS: The F-test only allows to not reject H0 if all t-tests for allsingle variables are insignificant too.

Goodness of Fit in multiple Regressions:

As with simple binary regressions we can define SST, SSE and SSR.And we can calculate the R² in the same way.

BUT: R² never decreases but tends to increase with the number ofexplanatory variables.

THUS, R² is a poor tool for deciding whether one variable of severalvariables should be added to a model.

We want to know whether a variable has a nonzero partial effect on y inthe population.

Adjusted R²: takes the number of explanatory variables into accountsince the R² increases with the number of regressors:

k is the number of explanatory variables and n the number ofobservations

2 211 1adj

nR R

n k

Comparing Coefficients

The size of the slope parameters depends on the scaling of thevariables (on which scale a variables is measured), e.g. populationin thousands or in millions etc.

To be able to compare the size effects of different explanatoryvariables in a multiple regression we can use standardizedcoefficients:

Standardized coefficients take the standard deviation of the dependentand explanatory variables into account. So they describe how muchy changes if x changes by one standard deviation instead of oneunit. If x changes by 1 SD – y changes by b_hat SD. This makes thescale of the regressors irrelevant and we can compare themagnitude of the effects of different explanatory variables (thevariables with the largest standardized coefficient is most important

in explaining changes in the dependent variable).

ˆˆˆ 1,...,

ˆjx j

j

y

b for j k

Problems in Multiple Regressions:1. Multicolinearity

• Perfect multicolinearity leads to drop out of one of the variables: if x1and x2 are perfectly correlated (correlation of 1) – the statisticalprogram at hand does the job.

• The higher the correlation the larger the population variance of thecoefficients, the less efficient the estimation and the higher theprobability to get erratic point estimates. Multicolinearity can result innumerically unstable estimates of the regression coefficients (smallchanges in X can result in large changes to the estimated regressioncoefficients).

• Trade off between omitted variable bias and inefficiency due tomulticolinearity.

Testing for MulticolinearityCorrelation between explanatory variables: Pairwise colinearity can be

determined from viewing a correlation matrix of the independentvariables. However, correlation matrices will not reveal higher ordercolinearity.

Variance Inflation Factor (vif): measures the impact of collinearity amongthe x in a regression model on the precision of estimation. vif detectshigher order multicolinearity: one or more x is/are close to a linearcombination of the other x.

• Variance inflation factors are a scaled version of the multiplecorrelation coefficient between variable j and the rest of theindependent variables. Specifically,

where Rj is the multiple correlation coefficient.• Variance inflation factors are often given as the reciprocal of the

above formula. In this case, they are referred to as the tolerances.• If Rj equals zero (i.e., no correlation between Xj and the remaining

independent variables), then VIFj equals 1. This is the minimumvalue. Neter, Wasserman, and Kutner (1990) recommend looking atthe largest VIF value. A value greater than 10 is an indication ofpotential multicolinearity problems.

2

1

1j

j

VIFR

Possible Solutions

• Reduce the overall error: by including explanatoryvariables not correlated with other variables butexplaining the dependent variable

• Drop variables which are highly multi-collinear

• Increase the variance by increasing the number ofobservations

• Increase the variance of the explanatory variables

• If variables are conceptually similar – combine them intoa single index, e.g. by factor or principal componentanalysis

2. Omitted Variable Bias• The effect of omitted variables that ought to be included:

• Suppose the dependent variable y depends on two explanatoryvariables:

• But you are unaware of the importance of x2 and only include x

• If x2 is omitted from the regression equation, x1 will have a“double” effect on y (a direct effect and one mimicking x2)

• The mimicking effect depends on the ability of x1 to mimic x2 (thecorrelation) and how much x2 would explain y

• Beta1 in the second equation is biased upwards in case x1 and x2are positively correlated and downward biased otherwise

• Beta1 is only unbiased if x1 and x2 are not related (corr(x1,x2)=0

• However: including variable that is unnecessary, because it doesnot explain an variation in y the regression becomes inefficient andthe reliability of point estimates decreases.

0 1 1 2 2 i i i iy x x

0 1 1 i i iy x

Testing for Omitted Variables

• Heteroskedasticity of the error term with respect to the observationof a specific independent variable is a good indication for omittedvariable bias:

• Plot the error term against all explanatory variables

• Ramsey RESET F-test for omitted variables in the whole model:tests for wrong functional form (if e.g. an interaction term is omitted):

– Regress Y on the X’s and keep the fitted value Y_hat ;

– Regress Y on the X’s, and Y_hat² and Y_hat³.

– Test the significance of the fitted value terms using an F test.

• Szroeter test for monotonic variance of the error term in theexplanatory variables

Solutions:

• Include variables that are theoretically important and have a highprobability of being correlated with one or more variables in themodel and explaining significant parts of the variance in the DV.

• Fixed unit effects for unobserved unit heterogeneity (time invariantunmeasurable characteristics of e.g. countries – culture, institutions)

3. Heteroskedasticity

The variance of the error term is not constant in each observation butdependent on unobserved effects, not controlling for this problemviolates one of the basic assumptions of linear regressions andrenders the estimation results inefficient.

Possible causes:• Omitted variables, for example: spending might vary with the

economic size of a country, but size is not included in the model.Test:• plot the error term against all independent variables• White test if the form of Heteroskedasticity is unknown• Breusch-Pagan Lagrange Multiplier test if the form is known

Solutions:• Robust Huber-White sandwich estimator (GLS)• White Heteroskedasticity consistent VC estimate: manipulates the

variance-covariance matrix of the error term.• More substantially: include omitted variables• Dummies for groups of individuals or countries that are assumed to

behave more similar than others

Tests for Heteroskedasticity:a. Breusch-Pagan LM test for known form of Heteroskedasticity:

groupwise

=sum of group-specific squared residuals

= OLS residuals

H0: homoskedasticity ~ Chi² with n-1 degrees of freedom

LM-test assumes normality of residuals, not appropriate ifassumption not met.

b. Likelihood Ratio Statistic

Residuals are computed using MLE (e.g. iterated FGLS, OLS lossof power)

2

22

1

12

ni

i

sTLM

s

2is2s

2 2 22ln ln ln ~ 1iNT T dF n

c. White test if form of Heteroskedasticity is unknown:

• H0:

• Ha:

1. Estimate the model under H0

2. Compute squared residuals:

3. Use squared residuals as dependent variable of auxiliaryregression: RHS: all regressors, their quadratic forms andinteraction terms

4. Compute White statistic from R² of auxiliary regression:

5. Use one-sided test and check if n*R² is larger than 95%quantile of Chi²-distribution

2|i iV x

2| ii iV x

2ie

2 2 20 1 2 1 1 2 1 2 3... ...i i k ik k i k i i q ik ie x x x x x x

2 2( )* aqn R

Robust White Heteroskedasticity Consistent Variance-Covariance Estimator:

Normal Variance of beta:

Robust White VC matrix:

D is a n*n matrix with off-diagonals=0 and diagonal the squaredresiduals.

The normal variance covariance matrix is weighted by the non-constanterror variance.

Robust Standard errors therefore tend to be larger.

22 ,

*N Var X

1 1

2

1ˆˆ ˆ

ˆ ˆ' ' '

î

Vn

n X X X DX X X

D diag e

Generalized Least Squares Approaches

• The structure of the variance covariance matrix Omega is used not just toadjust the standard errors but also the estimated coefficient.

• GLS can be an econometric solution to many violations of the G-M conditions(Autocorrelation, Heteroskedasticity, Spatial Correlation…), since the OmegaMatrix can be flexibly specified

• Since the Omega matrix is not known, it has to be estimated and GLSbecomes FGLS (Feasible Generalized Least Squares)

• All FGLS approaches are problematic if number of observations is limited –very inefficient, since the Omega matrix has to be estimated

Beta:

Estimated covariance matrix:

1N N1 1

i i i ii 1 i 1

X ' X X ' y

1N N1 1

i i i ii 1 i 1

ˆ ˆ ˆX ' X X ' y

1

1X ' X

Omega matrix with heteroscedastic error structure andcontemporaneously correlated errors, but in principleFGLS can handle all different correlation structures…:

21 21 31 n1

212 2 32 n 2

213 23 3 n3

21n 2n 3n n

4. Autocorrelation

The observation of the residual in t1 is dependent on theobservation in t0: not controlling for autocorrelationviolates on of the basic assumptions of OLS and maybias the estimation of the beta coefficients

Options:• lagged dependent variable

• differencing the dependent variable

• differencing all variables

• Prais-Winston Transformation of the data

• HAC constitent VC matrix

Tests:• Durbin-Watson, Durbin’s m, Breusch-Godfrey test

• Regress e on lag(e)

i t i t 1 it

AutocorrelationThe error term in t1 is dependent on the error term in t0: not controlling

for autocorrelation violates on of the basic assumptions of OLS andmay bias the estimation of the beta coefficients

The residual of a regression model picks up the influences of thosevariables affecting the DV that have not been included in theregression equation. Thus, persistence in excluded variables is themost frequent cause of autocorrelation.

Autocorrelation does make no predictions about a trend, though a trendin the DV is often a sign for serial correlation.

Positive autocorrelation: rho is positive: it is more likely that a positivevalue of the error-term is followed by a one and a negative by anegative one.

Negative autocorrelation: rho is negative: it is more likely that a positivevalue of the error-term is followed by a negative one and vice versa.

i t i t 1 it

DW test for first order AC:

• Regression must have an intercept

• Explanatory variables have to be deterministic

• Inclusion of LDV biases statistic towards 2

Efficiency problem of serial correlation can be fixed by Newey-WestHAC consistent VC matrix for Heteroskedasticity of unknown formand AC of order p: Problem: VC matrix consistent but coefficient canstill be biased! (HAC is possible with “ivreg2” in stata)

2

12

2

1

T

t tt

T

tt

e e

d

e

1 1*

* 2 ' ' '1

1 1 1

ˆˆ ' '

1 1

11

1

NW

pT T

t t t l t t t t l t l tt t t l

l

V T X X S X X

S e x x e e x x x xT T

p

OR a simpler Test:

• Estimate the model by OLS

• compute the residuals

• Regress the residuals on all independent variables(including the LDV if present) and the lagged residuals

• If the coefficient on the lagged residual is significant (withthe usual t-test), we can reject the null of independenterrors.

Lagged Dependent Variable

• The interpretation of the LDV as measure of time-persistency ismissleading

• LDV captures average dynamic effect, this can be shown byCochrane-Orcutt distributive lag models. Thus LDV assumes that allx-variables have an one period lagged effect on y

make sure interpretation is correct – calculating the real effect of x -variables

• Is an insignificant coefficient really insignificant if coefficient of laggedy is highly significant?

it 0 it 1 k it i ty y x

0 , 1 1

0 , 1

1

( )

it i t it it

it i t it

it

y y x

y y

x

First Difference models

• Differencing only the dependent variable – only if theorypredicts effects of levels on changes

• FD estimator assumes that the coefficient of the LDV isexactly 1 – this is often not true

• Theory predicts effects of changes on changes

• Suggested remedy if time series is non-stationary (has asingle unit root), asymptotic analysis for T→ ∞.

• Consistent

K

i t i t 1 k k i t k i t 1 i t i t 1k 1

K

i t k k i t i tk 1

y y x x

y x

Prais-Winsten Transformation• Models the serial correlation in the error term –

regression results for X variables are more straightforwardly interpretable:

with

The are iid – with

• The VC matrix of the error term is

• The matrix is stacked for N units. Diagonals are 1.

• Prais-Winston is estimated by GLS. It is derived fromthe AR(1) model for the error term. The first observationis preserved

i t i t i ty x i t i t 1 it

it 2N 0,2 T 1

T 2

2 T 32

T 1 T 2 T 3

1

11

11

1

1. Estimation of a standard linear regression:

2. An estimate of the correlation in the residuals is then obtained by thefollowing auxiliary regression:

3. A Cochrane-Orcutt transformation is applied for observationst=2,…,n

4. And the transformation for t=1 is as follows:

5. With Iterating to convergence, the whole process is repeated untilthe change in the estimate of rho is within a specified tolerance, thenew estimates are used to produce fitted values for y and rho is re-estimated, by:

i t i t i ty x

i t i t 1 it

i t i t 1 i t i t 1 i ty y x x

2 2 21 1 11 y 1 x 1

i t i t i t 1 i t 1 i tˆ ˆy y y y

Distributed Lag Models

• Simplest form is Cochrane-Orcutt – dynamic structure ofall independent variables is captured by 1 parameter,either in the error term or as LDV

• If dynamics are that easy – LDV or Prais-Winston is fine– saves Degrees of Freedom

• Problem: if theory predicts different lags for different righthand side variables – than a miss-specified model leadsnecessarily to bias

• Test down – start with relatively large number of lags forpotential candidates:

i t i t 1 i t 1 2 i t 2 3 i t n n 1 i ty x x x x

n 1, , t 1

Specification Issues in Multiple Regressions:1. Non-Linearity

One or more explanatory variables have a non-linear effect on thedependent variable: estimating a linear model would lead to wrongor/and insignificant results. Thus, even though in the populationthere exist a relationship between an explanatory variable and thedependent variable, but this relationship cannot be detected due tothe strict linearity assumption of OLS

Test:

• Ramsay RESET F-test gives a first indication for the whole model

• In general, we can use acprplot to verify the linearity assumptionagainst an explanatory variable – though this is just “eye-balling”

• Theoretical expectations should guide the inclusion of squaredterms.

-10

-50

51

0A

ugm

ente

dco

mpon

entp

lus

resi

dua

l

0 .2 .4 .6 .8 1institutional openness to trade standardized

-15

-10

-50

510

Au

gm

ente

dco

mpon

entp

lus

resi

dua

l0 2 4 6 8 10

level of democracy

SolutionsHandy solutions without leaving the linear regression framework:• Logarithmize the IV and DV: gives you the elasticity, higher values are

weighted less (engel curve – income elasticity of demand). This model iscalled a log-log model or a log-linear model

– Different functional forms give parameter estimates that have different substantialinterpretations. The parameters of the linear model have an interpretation asmarginal effects. The elasticities will vary depending on the data. In contrast theparameters of the log-log model have an interpretation as elasticities. So the log-log model assumes a constant elasticity over all values of the data set. Thereforethe coefficients of a log-linear model can be interpreted as percentage changes –if the explanatory variable changes by one percent the dependent variablechanges by beta percent.

– The log transformation is only applicable when all the observations in the data setare positive. This can be guaranteed by using a transformation like log(X+k) wherek is a positive scalar chosen to ensure positive values. However, careful thoughthas to be given to the interpretation of the parameter estimates.

– For a given data set there may be no particular reason to assume that onefunctional form is better than the other. A model selection approach is to estimatecompeting models by OLS and choose the model with the highest R-squared.

• include an additional squared term of the IV to test for U-shape and inverseU-shape relationships. Careful with the interpretation! The size of the twocoefficients (linear and squared) determines whether there is indeed a u-shaped or inverse u-shaped relationship.

i i ilog y log log x log

2i 1 i 2 i iy x x

Hausken, Martin, Plümper 2004: Government Spending and Taxation inDemocracies and Autocracies, Constitutional Political Economy 15, 239-59.

polity_sqr polity govcon0 0 20.049292

0.25 0.5 18.9879461 1 18.024796

2.25 1.5 17.1598424 2 16.393084

6.25 2.5 15.7245219 3 15.154153

12.25 3.5 14.68198216 4 14.308005

20.25 4.5 14.03222525 5 13.85464

30.25 5.5 13.77525136 6 13.794057

42.25 6.5 13.91105949 7 14.126257

56.25 7.5 14.4396564 8 14.851239

72.25 8.5 15.36102481 9 15.969004

90.25 9.5 16.67518100 10 17.479551

The „u“ shaped relationship between democracy and government spending:

0 2 4 6 8 10

13

14

15

16

17

18

19

20

21g

ove

rnm

en

tco

nsu

mp

tion

in%

ofG

DP

degree of democracy

2. Interaction Effects

Two explanatory variables do not only have adirect effect on the dependent variable but alsoa combined effect

Interpretation: combined effect:b1*SD(x1)+b2*SD(x2)+b3*SD(x1*x2)

Example: monetary policy of currency union hasa direct effect on monetary policy in outsidercountries but this effect is increased by importshares.

i 1 1i 2 2i 3 1i 2i iy x x x x

Example:

government

spending

35

40

45

50

Low unemployment High unemployment

Go

ver

nm

ent

spen

din

gin

%o

fG

DP

Low cristian democraticportfolio

High cristian democraticportfolio

20

30

40

50

60

70

sp

en

d

0 5 10 15 20unem

government spending trade+1sd

trade mean trade-1sd

20

40

60

80

spe

nd

0 50 100 150trade

government spending unem+1sd

unem mean unem-1sd

Interaction Effects of Continuous Variables

-.5

0.5

11

.52

Ma

rgin

alE

ffe

cto

fU

nem

plo

yme

nt

0 20 40 60 80 100 120 140 160

Trade Openness

Marginal Effect of Unemployment on Spending

95% Confidence Interval

Dependent Variable: Government Spending

Marginal Effect of Unemployment on Spending as Trade Openness changes

Mean of international trade exposure

0.0

05

.01

.01

5K

ern

elD

en

sity

Estim

ate

oft

rad

e

-.5

0.5

11

.52

Ma

rgin

alE

ffe

cto

fu

ne

mp

loym

en

to

ngo

ve

rnm

en

tspe

nd

ing

0 50 100 150international trade exposure

Thick dashed lines give 95% confidence interval.Thin dashed line is a kernel density estimate of trade.

3. Dummy variables

An explanatory variable that takes on only thevalues 0 and 1

Example: DV: spending, IV: whether a countryis a democracy (1) or not (0).

Alpha then is the effect for non-democraciesand alpha+beta is the effect fordemocracies.

i iy D

4. Outliers

Problem:The OLS principle implies the minimization of squaredresiduals. From this follows that extreme cases can havea strong impact on the regression line.Inclusion/exclusion of extreme cases might change theresults significantly.

The slope and intercept of the least squares line is verysensitive to data points which lie far from the trueregression line. These points are called outliers, i.e.extreme values of observed variables that can distortestimates of regression coefficients.

Test for Outliers

• symmetry (symplot) and normality (dotplot) of dependent variable gives firstindication for outlier cases

• Residual-vs.-fitted plots (rvfplot) indicate which observations of the DV arefar away from the predicted values

• lvr2plot is the leverage against residual squared plot. The upper left cornerof the plot will be points that are high in leverage and the lower right cornerwill be points that are high in the absolute of residuals. The upper rightportion will be those points that are both high in leverage and in theabsolute of residuals.

• DFBETA: how much would the coefficient of an explanatory variable changeif we omitted one observation?

The measure that measures how much impact each observation hason a particular coefficient is DFBETAs. The DFBETA for anexplanatory variable and for a particular observation is the differencebetween the regression coefficient calculated for all of the data andthe regression coefficient calculated with the observationdeleted, scaled by the standard error calculated with theobservation deleted. The cut-off value for DFBETAs is 2/sqrt(n),where n is the number of observations.

AustraliaAustraliaAustraliaAustralia

AustraliaAustraliaAustraliaAustralia

AustraliaAustralia

AustraliaAustralia

Australia

Australia

AustraliaAustraliaAustraliaAustraliaAustraliaAustralia

AustraliaAustralia

Australia

AustraliaAustralia

AustraliaAustralia

AustraliaAustralia

Australia

AustraliaAustralia

AustriaAustriaAustriaAustria

Austria

AustriaAustriaAustria

AustriaAustriaAustriaAustriaAustria

Austria

AustriaAustriaAustriaAustriaAustria



AustriaAustriaAustriaAustriaAustriaAustria

AustriaBelgiumBelgiumBelgiumBelgium

BelgiumBelgiumBelgium

BelgiumBelgiumBelgiumBelgiumBelgium

Belgium

Belgium

BelgiumBelgiumBelgiumBelgium

Belgium

BelgiumBelgiumBelgium

Belgium

Belgium

BelgiumBelgium

Belgium

Belgium

BelgiumBelgium

Belgium

Belgium

Canada

CanadaCanadaCanadaCanadaCanadaCanadaCanada

CanadaCanadaCanadaCanada

CanadaCanada

CanadaCanadaCanadaCanadaCanada

Canada

Canada

Canada

CanadaCanada

CanadaCanadaCanadaCanadaCanadaCanada

CanadaCanada

Denmark Denmark

Denmark

DenmarkDenmarkDenmarkDenmark

DenmarkDenmark

Denmark

DenmarkDenmarkDenmarkDenmark

Denmark

DenmarkDenmark

Denmark

Denmark

Denmark

DenmarkDenmark DenmarkDenmarkDenmarkDenmark

DenmarkDenmarkDenmarkDenmarkDenmark DenmarkFinlandFinlandFinlandFinland

FinlandFinlandFinland

Finland

Finland

Finland

FinlandFinlandFinlandFinlandFinlandFinland

Finland

FinlandFinlandFinlandFinlandFinland

FinlandFinland

FinlandFinlandFinland

FinlandFinland

Finland

FinlandFinland

France

FranceFranceFranceFranceFranceFrance

FranceFranceFranceFranceFranceFrance

FranceFrance

FranceFranceFranceFranceFrance

FranceFranceFranceFrance

FranceFranceFranceFrance

France

France

France

FranceGermanyGermanyGermanyGermanyGermanyGermany

GermanyGermany

GermanyGermanyGermanyGermanyGermanyGermanyGermany

GermanyGermanyGermanyGermanyGermanyGermany

GermanyGermanyGermanyGermanyGermanyGermanyGermanyGermanyGermany

Germany

Germany

IrelandIrelandIrelandIrelandIreland

Ireland

Ireland

IrelandIrelandIrelandIrelandIreland

IrelandIreland

Ireland

IrelandIreland

IrelandIrelandIrelandIreland

Ireland

Ireland

Ireland IrelandIreland

Ireland

IrelandIreland

IrelandIreland

Ireland

ItalyItaly

ItalyItalyItalyItaly

ItalyItaly

Italy

ItalyItaly

Italy

Italy

ItalyItaly

ItalyItalyItaly

ItalyItalyItalyItalyItalyItaly

ItalyItalyItalyItaly

ItalyItalyItaly

Italy

JapanJapan

Japan

Japan

JapanJapan

JapanJapan

Japan

JapanJapanJapan

Japan

JapanJapanJapanJapanJapanJapanJapanJapanJapanJapanJapanJapanJapan

JapanJapanJapanJapanJapanJapan

NetherlandsNetherlandsNetherlands

NetherlandsNetherlandsNetherlandsNetherlandsNetherlandsNetherlands


NetherlandsNetherlands




Netherlands



Netherlands

Netherlands

Netherlands


Netherlands

NorwayNorway

NorwayNorwayNorwayNorwayNorway

NorwayNorwayNorwayNorway

Norway

NorwayNorway

Norway

NorwayNorwayNorwayNorway

Norway

Norway

NorwayNorwayNorway

NorwayNorway

NorwayNorway

Norway

NorwaySwedenSweden

Sweden

SwedenSwedenSwedenSwedenSwedenSwedenSweden

SwedenSwedenSwedenSwedenSweden

Sweden

Sweden SwedenSweden

Sweden

SwedenSwedenSwedenSweden

SwedenSwedenSweden

Sweden

Sweden

Sweden

SwedenSweden

Switzerland

SwitzerlandSwitzerland



Switzerland


Switzerland


Switzerland

Switzerland

Switzerland

Switzerland

Switzerland

Switzerland

UK

UK

UK

UKUK

UKUK

UK

UKUK

UK

UK

UK

UK

UKUKUKUK

UKUK

UK

UK

UK

UK

UK

UKUK

UK

UK

UKUKUK

USUSUSUSUSUSUSUS

US

USUSUS

USUS

USUSUSUS

US

US

US

USUS

USUSUSUSUSUS

US

USUS

0.0

2.0

4.0

6.0

8L

eve

rag

e

0 .005 .01 .015 .02 .025Normalized residual squared

Solutions: Outliers

• Include or exclude obvious outlier cases and check their impact onthe regression coefficients.

• Logarithmize the dependent variable and possibly the explanatoryvariables as well – this reduces the impact of larger values.

jacknife, bootstrap:• Are both tests and solutions at the same time: they show whether

single observations have an impact on the results. If so, one can usethe jacknifed and bootstrapped coefficients and standard errorswhich are more robust to outliers than normal OLS results.

• Jacknife: takes the original dataset, runs the same regression N-1times, leaving one observation out at a time.Example command in STATA: „jacknife _b _se, eclass: reg spend unemgrowthpc depratio left cdem trade lowwage fdi skand “

• Bootstrapping is a re-sampling technique: for the specified numberof repetitions, the same regression is run for a different samplerandomly drawn from the original dataset.Example command: „bootstrap _b _se, reps(1000): reg spend unemgrowthpc depratio left cdem trade lowwage fdi skand “

Differences-in-Differences Models

• Experiment

– Rats’ weight depends on drug

– Effectiveness of medication etc.

• OLS regression

0 1 2 12

0 1 2 12

Weight Treatment Post Treatment Post

Outcome Treatment Post Treatment Post

Interpretation of DD Models

Control Treatment Difference

Pre-Period 0 0 + 1 1

Post-Period 0 + 2 0 + 1 + 2

+ 12

1 + 12

Difference 2 2 + 12 12

Documents

Estimators - University of Warwick...Estimators • Any statistic whose values are used to estimate is defined to be an estimator of . If a parameter is estimated by an estimator,