Upload
adina
View
52
Download
1
Tags:
Embed Size (px)
DESCRIPTION
CERAM February-March-April 2008. Class 4 Ordinary Least Squares. Lionel Nesta Observatoire Français des Conjonctures Economiques [email protected]. Introduction to Regression. - PowerPoint PPT Presentation
Citation preview
Class 4
Ordinary Least Squares
CERAM February-March-April 2008
Lionel Nesta
Observatoire Français des Conjonctures Economiques
Introduction to Regression Ideally, the social scientist is interested not only in knowing the
intensity of a relationship, but also in quantifying the magnitude
of a variation of one variable associated with the variation of
one unit of another variable.
Regression analysis is a technique that examines the relation
of a dependent variable to independent or explanatory
variables.
Simple regression y = f(X)
Multiple regression y = f(X,Z)
Let us start with simple regressions
Scatter Plot of Fertilizer and Production
Scatter Plot of Fertilizer and Production
Scatter Plot of Fertilizer and Production
iPr ediction Y
i iError Y Y
Scatter Plot of Fertilizer and Production
Scatter Plot of Fertilizer and Production
Objective of Regression It is time to ask: “What is a good fit?”
“A good fit is what makes the error small”
“The best fit is what makes the error smallest”
Three candidates
1. To minimize the sum of all errors
2. To minimize the sum of absolute values of errors
3. To minimize the sum of squared errors
To minimize the sum of all errors
1
minn
i ii
y y
X
Y
–
–+
X
Y
– ++
Problem of sign
X
Y
+3
To minimize the sum of absolute values of errors
1
minn
i ii
y y
X
Y
–1
–1+2
Problem of middle point
To minimize the sum of squared errors
2
1
minn
i ii
y y
X
Y
–
–+
Solve both problems
22
1 1
min minn n
i ii i
y y
ε
ε²
Overcomes the sign problem
Goes through the middle point
Squaring emphasizes large errors
Easily Manageable
Has a unique minimum
Has a unique – and best - solution
To minimize the sum of squared errors
Scatter Plot of Fertilizer and Production
Scatter Plot of R&D and Patents (log)
Scatter Plot of R&D and Patents (log)
Scatter Plot of R&D and Patents (log)
Scatter Plot of R&D and Patents (log)
The Simple Regression Model
( )i i i
i i
y x
E y x
yi Dependent variable (to be explained)
xi Independent variable (explanatory)
α First parameter of interest
Second parameter of interest
εi Error term
The Simple Regression Model
iiy x
.
and are estimates of
the true - but unkown - and
2
1
minn
i ii
y y
ε
ε²
2 2
1 1
2
1
2
1
min min
0
0
n n
i i i ii i
n
i
n
i
y y y x
To minimize the sum of squared errors
2
1
minn
i ii
y y
ε
ε²
2
i i
i
y y x x
x x
y x
To minimize the sum of squared errors
Application to CERAM_BIO Data using Excel
lnpat_assets lnrd_assetsNumerator Beta_Hat
Denominator Beta_Hat
-12.77 -2.28 -0.61 0.01 -0.01 0.00-12.51 -2.24 -0.35 0.05 -0.02 0.00-12.74 -2.20 -0.58 0.09 -0.05 0.01-12.52 -2.31 -0.36 -0.02 0.01 0.00-12.12 -2.25 0.04 0.04 0.00 0.00-12.53 -2.26 -0.37 0.03 -0.01 0.00-12.09 -2.25 0.07 0.04 0.00 0.00
Mean of y Mean of x Sum Sum-12.16 -2.29 448.75 256.55
Alpha_hat -8.148
Beta_hat 1.749
Deviation to the mean
Application to CERAM_BIO Data using Excel
lnpat_assets lnrd_assetsNumerator Beta_Hat
Denominator Beta_Hat
-12.77 -2.28 -0.61 0.01 -0.01 0.00-12.51 -2.24 -0.35 0.05 -0.02 0.00-12.74 -2.20 -0.58 0.09 -0.05 0.01-12.52 -2.31 -0.36 -0.02 0.01 0.00-12.12 -2.25 0.04 0.04 0.00 0.00-12.53 -2.26 -0.37 0.03 -0.01 0.00-12.09 -2.25 0.07 0.04 0.00 0.00
Mean of y Mean of x Sum Sum-12.16 -2.29 448.75 256.55
Alpha_hat -8.148
Beta_hat 1.749
Deviation to the mean
Patent R&Dln 8.148 1.748 ln
Assets Assets i
InterpretationPatent R&D
ln 8.148 1.748 lnAssets Assets i
When the log of R&D (per asset) increases by one unit, the log of patent per asset increases by 1.748
Remember! A change in log of x is a relative change of x itself
A 1% increase in R&D (per asset) entails a 1.748% increase in the number of patent (per asset).
Application to Data using SPSS
Analyse Régression Linéaire
Coefficientsa
-8.151 .244 -33.392 .000
1.748 .101 .642 17.323 .000
(constante)
lnrd_assets
Modèle1
BErreur
standard
Coefficients nonstandardisés
Bêta
Coefficientsstandardisés
t Signification
Variable dépendante : lnpat_assetsa.
Assessing the Goodness of Fit
It is important to ask whether a specification provides a good prediction on the dependent variable, given values of the independent variable.
Ideally, we want an indicator of the proportion of variance of the dependent variable that is accounted for – or explained – by the statistical model.
This is the variance of predictions (ŷ) and the variance of residuals (ε), since by construction, both sum to overall variance of the dependent variable (y).
Overall Variance
Decomposing the overall variance (1)
Decomposing the overall variance (2)
Coefficient of determination R² R2 is a statistic which provides information on the
goodness of fit of the model.
2
2
2
tot i
fit i tot fit res
res i i
SS y y
SS y y SS SS SS
SS y y
² fit
tot
SSR
SS
0 ² 1R
Fisher’s F Statistics Fisher’s statistics is relevant as a form of ANOVA on SSfit
which tells us whether the regression model brings significant (in a statistical sense, information.
Model SS df MSS F
(1) (2) (3) (2)/(3)
Fitted p
Residual N–p–1
Total N–1 2
iy y
2
i iy y
2
iy y
p: number of parametersN: number of observations
MSS
MSSfit
res
MSS fit
MSSres
Application to Data using SPSS
Analyse Régression Linéaire
ANOVAb
784.132 1 784.132 300.090 .000a
1120.970 429 2.613
1905.102 430
Régression
Résidu
Total
Modèle1
Sommedes carrés ddl Carré moyen F Signification
Valeurs prédites : (constantes), lnrd_assetsa.
Variable dépendante : lnpat_assetsb.
Récapitulatif du modèle
.642a .412 .410 1.61647Modèle1
R R-deux R-deux ajusté
Erreurstandard del'estimation
Valeurs prédites : (constantes), lnrd_assetsa.
What the R² is not
Independent variables are a true cause of the changes in the dependent variable
The correct regression was used
The most appropriate set of independent variables has been chosen
There is co-linearity present in the data
The model could be improved by using transformed versions of the existing set of independent variables
Inference on β
We have estimated
Therefore we must test whether the estimated parameter is significantly different than 0, and, by way of consequence, we must say something on the distribution – the mean and variance – of the true but unobserved β*
( )i iiE y y x Si 0, ( )iE y Si 0, ( ) iE y x
The mean and variance of β It is possible to show that is a good approximation,
i.e. an unbiased estimator, of the true parameter β*.
*ˆE
2 22
ˆ2
1
VAR where 1 1i in
i
y y nx x
The variance of β is defined as the ratio of the mean square of errors over the sum of squares of the explanatory variable
The confidence interval of β
We must now define de confidence interval of β, at 95%. To do so, we use the mean and variance of β and define the t value as follows: *
ˆt s
*.025
2
1
tn
i
x x
Therefore, the 95% confidence interval of β is:
If the 95% CI does not include 0, then β is significantly different than 0.
Student t Test for β We are also in the position to infer on β
H0: β* = 0
H1: β* ≠ 0
Rule of decision
Accept H0 is | t | < tα/2
Reject H0 is | t | ≥ tα/2
*
ˆ ˆ
ts s
Application to Data using SPPS
Analyse Régression Linéaire
Coefficientsa
-8.151 .244 -33.392 .000
1.748 .101 .642 17.323 .000
(constante)
lnrd_assets
Modèle1
BErreur
standard
Coefficients nonstandardisés
Bêta
Coefficientsstandardisés
t Signification
Variable dépendante : lnpat_assetsa.
Assignments on CERAM_BIO Regress the number of patent on R&D expenses
and consider:
1. The quality of the fit
2. The significance and direction of R&D expenses
3. The interpretation of the result in an economic sense
Repeat steps 1 to 3 using: R&D expenses divided by one million (you need to
generate a new variable for that) The log of R&D expenses
What do you observe? Why?