P101 Applied Econometrics - SOAS University of London · 2019-09-23 · P101 Applied Econometrics Prepared by Francesca Di Nuzzo This module is partially based on the earlier module

© SOAS | 3740

Centre for Development, Environment and Policy

P101

Applied Econometrics

Prepared by Francesca Di Nuzzo

This module is partially based on the earlier module ‘Applied

Econometrics for the Agricultural and Food Sector’ prepared for the

University of London’s External Programme by Alison Burrell.

Applied Econometrics Module Introduction

© SOAS CeDEP 2

ABOUT THIS MODULE

This module is about econometric methods and how they are applied to estimate and

test the unknown parameters of economic relationships. Priority is given to both the

statistical reasoning underlying the methodology and the practical considerations

involved in using this methodology with a variety of models and real data.

The focus of the module is on the classical linear regression model. This is the basis

for much econometric methodology and it provides the framework for organising the

module.

The module covers

the principles of regression analysis and its statistical foundations

the simple linear regression model

the multiple linear regression model

violations of the assumptions of classical linear regression

There is a limit to the distance that can be covered in the study time available. In an

econometrics module, the trade-off between breadth and depth is low since, without

good groundwork and sufficient information at each stage, ideas may be

misunderstood and techniques misapplied. This module follows the standard itinerary

of most econometrics textbooks.

The practical exercises designed to be done with the help of the free computer

software package R are an important element of the module.


© SOAS CeDEP 3

STRUCTURE OF THE MODULE

Each unit in the study guide follows the same format.

Each unit will always start with a section on ideas or issues, whose purpose is to

explain, in simple words and with a minimum of technical notation, the basic

substance of the unit. The aim is to give you an intuitive feel for the subject matter

before going into technical detail. If you feel that mathematics and statistics is not

your strongest suit, this regular section will give you a few ‘analytical handles’ to

hold on to when studying relevant techniques. But even if you are confident with

mathematics and statistics, it is important not to skip this section.

Technical expertise is not just a question of one’s ability to work out the steps in a

technical procedure or to understand a mathematical derivation. It also involves

understanding the type of questions a technique tries to address and the

assumptions on which it is based as well as judging the appropriateness of particular

technical procedures in specific conditions.

Next, the module units contain a Key Concepts section, which guides your study in

further detail. The purpose of this section is to highlight the main concepts as well as

to structure your reading of the textbook.

Following on from this, you will find a section containing an example (except Units 6

and 10). The purpose of this section is twofold.

First, the example highlights a specific aspect of the topic under study in a

particular unit of the module.

Second, the example also gives you a glimpse of econometrics in action.

The examples aim to highlight the links between economic theory and empirical

investigation, and to illustrate the problems that can arise when we work with real

data.

Often, you will find a set of self-assessment questions at the end of each section.

It is important that you work through all of these. Their purpose is threefold:

to check your understanding of basic concepts and ideas

to verify your ability to execute technical procedures in practice

to develop your skills in interpreting the results of empirical analysis.

Also, you will find additional exercises at the end of each unit, which aim to give you

hands-on examples of how to carry out empirical analysis; in many cases, you will be

asked to work with actual data and to interpret results. Answers to the self-

assessment questions are provided.

For each unit (except in Units 6 and 10) you will find a guide that explains how to

use R – the software package you will use to carry out econometric exercises. This

guide will help you to master this particular econometrics software package. You may

find it more convenient to study each R guide individually at the end of each unit

(before the section ‘Unit Self-Assessment Questions’).


© SOAS CeDEP 4

When studying each unit, we suggest that you study the text at your own pace and

then work through the questions which are always designed to test your

understanding of the unit material. If these questions reveal some weak spots,

refresh them first before going on to the applied, data-based questions which you

will need to answer in conjunction with the R guide. The Summary at the end of

each unit briefly describes the topics covered, and the table of Key Terms and

Concepts lists all the important concepts you should have learned through your

study.

Applied statistics and econometrics are subjects with a great deal of specialised

jargon. It can be disconcerting to be faced with a number of unfamiliar new terms,

many of them quite long and often rather similar to each other. We recommend very

strongly that, right from the beginning of the module, you keep a glossary in which

you list each new term as you encounter it, together with an explanation of the

term in your own words. You should read through this glossary every few weeks,

updating your definitions if you find that, as the module progresses, your

understanding of the term develops along with your familiarity with the concept.


© SOAS CeDEP 5

WHAT YOU WILL LEARN

Module Aims

The specific aims of the module are:

To explain the principles of econometric estimation and its statistical

foundations.

To present the theory of the classical linear regression model and explain why

the conditions in such a model provide an ideal environment for ordinary least

squares regression.

To develop practical skills of data analysis, use of regression techniques and

interpretation of regression results.

To explain the procedures of interval estimation and hypothesis testing in the

classical normal linear regression model.

To show how econometric models can be made more realistic through the use

of dummy variables.

To explain how linear restrictions can be imposed on parameters during

estimation and how these restrictions can be tested.

To investigate the consequences of heteroscedasticity of the disturbances and

endogeneity of the regressors.

To encourage an appreciation of what constitutes a ‘good’ econometric model,

and how to test that a model is well specified.

Module Learning Outcomes

By the end of this module students should be able to:

understand and selectively and critically apply the basic principles of regression

analysis and statistical inference in the context of a single-equation regression

model

formulate a single-equation regression model, estimate its parameters, carry

out a variety of tests relating to model specification and critically interpret all

results

test hypotheses about economic behaviour and critically interpret the results of

these tests

specify and interpret models using dummy variables, different types of

dynamic specification and incorporate and test linear restrictions

test for heteroscedasticity and endogeneity, and take appropriate action when

these conditions are found to be present.


© SOAS CeDEP 6

ASSESSMENT

This module is assessed by:

• an examined assignment (EA) worth 40%

• a written examination worth 60%.

Since the EA is an element of the formal examination process, please note the

following:

(a) The EA questions and submission date will be available on the Virtual Learning

Environment (VLE).

(b) The EA is submitted by uploading it to the VLE.

(c) The EA is marked by the module tutor and students will receive a percentage

mark and feedback.

(d) Answers submitted must be entirely the student’s own work and not a product

of collaboration.

(e) Plagiarism is a breach of regulations. To ensure compliance with the specific

University of London regulations, all students are advised to read the

guidelines on referencing the work of other people. For more detailed

information, see the FAQ on the VLE.


© SOAS CeDEP 7

STUDY MATERIALS

Except for Unit 9, for which separate material will be provided, the single textbook

for the module is:

Gujarati, D. & Porter, D. (2010) Essentials of Econometrics. 4th edition.

International Edition. McGraw-Hill.

This book has been chosen for this self-study module because of its attention to full

explanations of concepts and procedures, its long introductory section presenting the

basic statistical concepts used in regression modelling, and its avoidance of

unnecessary algebra and difficult notation.

You are also encouraged to read those parts that are not specifically identified in the

module texts, since all the material here should be within your grasp and will

reinforce your understanding of the subject. By the end of the module, you will know

this textbook well and will be ready for other more advanced readings.

The following are not compulsory; however, they are more advanced in terms of

notation and explanations, should you wish to go into more depth about specific

concepts. Please note that these texts are recommendations and are not provided.

Intermediate

Greene, W. (2000) Econometric Analysis. 4th edition. New Jersey, Prentice Hall.

Gujarati, D. (1979) Basic Econometrics. Singapore, McGraw-Hill.

Gujarati, D. (2011) Econometrics by Example. Palgrave Macmillan.

Advanced

Maddala, G.S. & Lahiri, K. (2009) Introduction to Econometrics. 4th edition.

Chichester, John Wiley & Sons.

Other

Hallam, D. (1990) Econometric Modelling of Agricultural Commodity Markets.

London, Routledge.

Discussion of econometric modelling in agricultural economics.


© SOAS CeDEP 8

For each of the module units, the following are provided.

Key Readings

These are drawn mainly from the textbook and relevant academic journals and

internationally respected reports. Key Readings are provided to add breadth and

depth to the unit materials, as appropriate, and are required reading as they contain

material on which students may be examined. The notes under each reading indicate

the scope and relevance of the reading.

Further Readings

These texts are not provided in hard copy, but weblinks have been included. Further

Readings are NOT examinable and are provided to enable students to pursue their

own areas of interest.

Multimedia

The e-version of the study guide includes a number of interviews with rural

development managers in which they discuss particular aspects of their management

experience. These interviews can be treated rather like audio case studies to

illustrate concepts and arguments presented in the module text.

References

Each unit contains a full list of all material cited in the text. All references cited in the

unit text are listed in the relevant units. However, this is primarily a matter of good

academic practice: to show where points made in the text can be substantiated.

Students are not expected to consult these references as part of their study of this

module.

Self-Assessment Questions

Often, you will find a set of Self-Assessment Questions at the end of each section

within a unit. It is important that you work through all of these. Their purpose is

threefold:

to check your understanding of basic concepts and ideas

to verify your ability to execute technical procedures in practice

to develop your skills in interpreting the results of empirical analysis.

Also, you will find additional Unit Self-Assessment Questions at the end of each

unit, which aim to help you assess your broader understanding of the unit material.

Answers to the Self-Assessment Questions are provided in the Answer Booklet.


© SOAS CeDEP 9

In-text Questions

This icon invites you to answer a question for which an answer is

provided. Try not to look at the answer immediately; first write down

what you think is a reasonable answer to the question before reading

on. This is equivalent to lecturers asking a question of their class and

using the answers as a springboard for further explanation.

In-text Activities

This symbol invites you to halt and consider an issue or engage in a

practical activity.

Key Terms and Concepts

At the end of each unit you are provided with a list of Key Terms and Concepts which

have been introduced in the unit. The first time these appear in the text guide they

are Bold Italicised. Some key terms are very likely to be used in examination

questions, and an explanation of the meaning of relevant key terms will nearly

always gain you credit in your answers.

Acronyms and Abbreviations

As you progress through the module you may need to check unfamiliar acronyms

that are used. A full list of these is provided for you at the end of the Introduction.


© SOAS CeDEP 10

TUTORIAL SUPPORT

There are two opportunities for receiving support from tutors during your study, and

you are strongly advised to take advantage of both. These opportunities involve:

(a) participating in the Virtual Learning Environment (VLE)

(b) completing the examined assignment (EA).

Virtual Learning Environment (VLE)

The Virtual Learning Environment provides an opportunity for you to interact with

both other students and tutors. A discussion forum is provided through which you

can post questions regarding any study topic that you have difficulty with, or for

which you require further clarification. You can also discuss more general issues on

the News forum within the CeDEP Programme Area.


© SOAS CeDEP 11

INDICATIVE STUDY CALENDAR

Part/unit Unit title Study time (hours)

PART I INTRODUCTORY IDEAS AND STATISTICAL CONCEPTS

Unit 1 Introduction to Econometrics 10

Unit 2 Statistical Review 15

PART II THE SIMPLE REGRESSION MODEL

Unit 3 The Classical Linear Regression Model 15

Unit 4 Hypothesis Testing 15

PART III THE MULTIPLE REGRESSION MODEL

Unit 5 The Multiple Regression Model 15

Unit 6 Dummy Variables 10

Unit 7 Linear Parameter Restrictions 15

PART IV NON-CLASSICAL DISTURBANCES AND ENDOGENEITY

Unit 8 Heteroscedasticity 15

Unit 9 Causality and Instrumental Variables 15

PART V MODULE SUMMARY

Unit 10 Module Summary 10

Examined Assignment

Check the VLE for submission deadline

15

Examination entry July

Revision and examination preparation Jul–Sep

End-of-module examination Late Sep—

early Oct


© SOAS CeDEP 12

ACRONYMS AND ABBREVIATIONS

BLUE best linear unbiased estimator

CLRM classical linear regression model

CLT central limit theorem

CNLRM classical normal linear regression model

CPI consumer price index

d.f. degrees of freedom

ESS explained sum of squares

FAO Food and Agriculture Organization

FDI foreign direct investment

GDP gross domestic product

GNP gross national product

LM Lagrange multiplier

MBA Masters of Business Administration

ML maximum likelihood

MPC marginal propensity to consume

MSE mean square error

MT Metical; the currency of Mozambique

OLS ordinary least squares

PDF probability density function

PRF objective of regression

PRL population regression line

RLS restricted least squares

RSS residual sum of squares

SRF sample regression function

SUR seemingly unrelated regression

TSS total sum of squares

VIF variance inflation factor

WLS weighted least squares

Unit One: Introduction to Econometrics

Unit Information 2

Unit Overview 2 Unit Aims 2

Unit Learning Outcomes 2 Unit Interdependencies 3

Key Readings 4

References 5

1.0 Ideas and issues 6

Section Overview 6 Section Learning Outcomes 6

1.1 What is econometrics? 6 1.2 Econometrics in theory and in practice 9

2.0 Key concepts: the concept of regression 11

Section Overview 11

Section Learning Outcomes 11 2.1 What is regression? 11

2.2 Linearity and log-linearity 15 2.3 Correlation versus regression analysis 16

2.4 Data and regression 16 2.5 A word of caution on notation 17

Section 2 Self-Assessment Question 18

3.0 Example: The Keynesian consumption function 19

Unit Summary 22

Unit Self-Assessment Questions 23

Key Terms and Concepts 26

Applied Econometrics Unit 1

© SOAS CeDEP 2

UNIT INFORMATION

Unit Overview

This unit introduces you to the study of econometrics. It begins by defining

econometrics and then explains how econometrics relates to and differs from other

branches of economics. The important roles of economic theory and data in

econometric work are emphasised. Regression analysis is identified as the basis of

econometric procedure. The aims and purpose of regression analysis are explained.

The main steps of a typical econometric investigation are described and illustrated

with an example.

Unit Aims

To define the nature and scope of econometrics.

To identify the special characteristics of econometrics as a tool of applied

economics.

To describe and illustrate the main steps of an econometric investigation.

To identify some characteristics of economic data.

To practise some basic techniques of data investigation.

Unit Learning Outcomes

By the end of this unit, students should:

have an appreciation of econometrics as a method of empirical investigation

have an understanding of major differences between econometric models and

economic models

have an understanding of the main steps of an econometric investigation

have a knowledge of essential terminology relating to regression analysis

know how to perform basic data analysis in R.


© SOAS CeDEP 3

Unit Interdependencies

Unit 3

In Unit 3, you will take the first steps into linear regression analysis, which is one

technique to deal with non-experimental data.

Unit 4

In this unit, we discuss the methodology of econometrics, which involves testing our

maintained hypothesis; this is what you will be introduced to in Unit 4.

Unit 5

In Unit 5, we expand linear regression to multivariate regression, which is commonly

used to perform empirical analysis. Also, we will make use of correlation analysis in

dealing with the issue of multicollinearity.

Unit 8

In this Unit, you will learn that we can use a variety of functional forms to specify our

model. In Unit 8 you will review the logarithmic transformation and employ it to

mitigate the effect of non-constant variance in the residuals, ie heteroscedasticity.

Unit 9

In this unit we introduce the concept of regression analysis. This kind of analysis

implies that a causal nexus is identified between two or more variables. In Unit 9 the

issues of causality are discussed more extensively.


© SOAS CeDEP 4

KEY READINGS

Gujarati, D. & Porter, D. (2010) The nature and scope of econometrics. In: Essentials

of Econometrics. 4th edition. International Edition, McGraw-Hill. pp. 1–13.

Gujarati and Porter (2010) is the main textbook for the study of this module. This introductory

chapter will give you an overview of the course and it will define the field of econometrics, its

methodologies and types of data. The chapter is straightforward and can be read fairly quickly.

Gujarati, D. & Porter, D. (2010) Basic ideas of linear regression: the two-variable

model. In: Essentials of Econometrics. 4th edition. International Edition, McGraw-

Hill. Sections 2.1–2.6, pp. 21–32.

Sections 2.1 to 2.6 provide a brief, clear explanation of the key introductory concepts of

linear regression. Make sure that you refer to these definitions also as you go through the

following units.


© SOAS CeDEP 5

REFERENCES

Maddala, G.S. (2009) Introduction to Econometrics. 4th edition. Chichester, John

Wiley & Sons.

ONS. (1991) Economic Trends Annual Supplement 1991. Unipub.

Tsai, P.L. (1991) Determinants of foreign direct investment in Taiwan: an alternative

approach with time series data. World Development, 19 (2–3), 275–285.


© SOAS CeDEP 6

1.0 IDEAS AND ISSUES

Section Overview

In this section, we begin to explore the subject of econometrics, what it does and

how it is used in applied research.

Section Learning Outcomes

By the end of this section, students should be able to:

understand what econometrics does and how it is employed in empirical

research

understand the nature of the data and methods used in econometrics.

1.1 What is econometrics?

Welcome to this module. Its aim is to give you an introduction to econometric

methods or, more specifically, to linear regression which is the main statistical

foundation for econometric work. Throughout the module you will be working with

data; we hope you will find this interesting.

Economic theory is concerned with relationships between variables. You might have

already met some of these, such as demand and supply functions for agricultural

products, production functions, labour supply and demand functions, and so on.

Economic theory aims to explain economic behaviour; this involves studying the

relationship between economic variables and the factors that influence them.

The purpose of econometrics is to quantify economic relationships. Econometrics

can provide numerical estimates of the parameters of these relationships and a

framework for testing hypotheses about them. Broadly defined, econometrics is

‘ … the application of statistical and mathematical methods to the

analysis of economic data, with the purpose of giving empirical content to

economic theories and verifying them or refuting them …’

Source: Maddala (2009) p. 3.

Other definitions are possible: you will come across a number of definitions that each

has a slightly different emphasis. Common to all definitions, however, is the stress

on the empirical nature of econometric work.

The process of econometrics involves the confrontation between economic

theory and economic data in quantifying economic relat ionships.

Econometrics is not just a branch of mathematical economics. Mathematical

economics need not have any empirical content at all whereas in econometrics the

emphasis is on empirical analysis. At the same time, econometrics is not just a

‘box of tools’ to work with data. It requires, undoubtedly, a good training in statistical

techniques but these techniques need to be deployed in an interactive process

between theory and the data.


© SOAS CeDEP 7

This module can be studied in its own right, but normally we would expect you to

take it as part of the MSc programme where, in Part I, you will have studied various

economic theories and models. You should therefore be familiar with a range of

questions raised in theoretical discussions and with the results of some applied

empirical studies. These are good foundations on which to build the study of

econometrics. If up to now you have approached empirical studies from the point of

view of theory or the consequences for policy making, we now invite you to look at

them from the point of view of an econometrician. What is the difference?

To give empirical content to economic theories, the econometrician is confronted with

four problems that hardly concern the economic theorist. These four problems are

explained below.

(a) Non-experimental data

Economic theory develops models using a priori reasoning applied to relatively simple

assumptions. This procedure involves abstracting from secondary complications by

assuming that ‘other things remain equal’ (or ceteris paribus), in order to

investigate the links between a few key economic variables.

For example, in demand theory we say that the quantity demanded of a commodity

(that is not a Giffen good) will fall if its price rises, other things being equal. These

‘other things’ which we assume are held constant include consumers’ incomes and

income distribution, and the prices of substitutes and complementary goods.

This method is fruitful in economic theory but, unfortunately, it is rarely possible to

carry out controlled experiments to test such statements. Therefore, in empirical

economics the scope for observing such behaviour is severely limited. A researcher

cannot alter a commodity’s price, holding other things constant, in order to see what

happens to its demand.

In general, economic data are not the outcome of experiments but rather are

observed and recorded in a non-experimental world where other things are

never equal. Therefore, econometrics involves untangling the effects of different

factors that act simultaneously rather than analysing the results of a laboratory

experiment.

(b) Stochastic relationships

Economic theory usually involves deterministic relationships between economic

variables. This can be explained with a simple example: the Keynesian consumption

function. In economic theory we assume that, if we know the level of aggregate real

income, consumption will be uniquely determined. That is, for each value of

aggregate real income there corresponds a given level of aggregate consumption.

In reality, however, we do not expect theoretical relationships to hold exactly. Even

when all the main factors that systematically affect the behaviour of an economic

variable are taken into account, there will still be some random variation due to non-

systematic, ‘one-off’ factors and human variability.

Hence, in econometric work we deal with relationships between variables that

contain a random or stochastic element, and that are therefore not deterministic in

nature. We investigate functions between variables which we believe to be reasonably

stable on average, but there is always a degree of uncertainty about them.


© SOAS CeDEP 8

In econometrics we make explicit assumptions about these random components,

called disturbances. This is why econometrics draws heavily on probability theory

and statistical inference.

(c) Observed variables

In economic theory we work with theoretical variables. Econometrics, in contrast,

deals with observed data.

Obviously, there is a certain correspondence between them: data collection is

inspired by some theoretical framework. For example, the framework for measuring

national income account data derives from Keynesian economics, which is centred on

the analysis of theoretical aggregates such as output, demand, employment and the

price level.

However, observed variables do not fully correspond to their theoretical counterparts

because of differences in definition and coverage, and errors in measurement. For

example, the ‘price level’ is an abstract concept that is usually represented

empirically by some aggregate price index; however, the values it takes depend on

the goods whose prices are covered by the index and the method of calculating the

index.

Another example concerns modelling technology. In agricultural supply functions, the

‘state of technology’ is an important variable: changes in supply over time are driven

both by price changes and by the pace of technological change. But how can

technological development be measured? Many researchers resort to a simple time

trend to represent this important variable.

Finally, ‘management’ is a key input in the theoretical specification of an agricultural

production function but one that is always difficult to measure empirically .

Econometricians sometimes resort to proxy variables for management like the

number of years of education the farmer has received.

In econometrics we need to be aware of the discrepancy between theoretical

concepts and observed data, and its implications when quantifying theoretical

propositions.

(d) The treatment of time

The econometrician must make explicit assumptions about the role of time in his

model. When economic theory postulates that consumption depends on disposable

income, ceteris paribus, it implies that when income takes different values so does

consumption. Econometrics can quantify this dependency by using information about

how consumption changes as income takes different values. However, this

dependency could be observed empirically in two alternative ways:

(i) by recording how consumption and income move together over time, or

(ii) by recording the consumption of households at different income levels during

the same time period.

In the first case, we have a time-series model, requiring time-series data

(measured at different intervals over time).

In the second case, we have a cross-section model, requiring cross-section data

(measured for different individuals or micro-units at the same point or during the


© SOAS CeDEP 9

same period in time). The choice between a time-series and a cross-section model

often depends on data availability, although this choice is less straightforward than

it may seem.

First, we may need to modify our theory for explaining consumption changes over

time before it can be applied to cross-sectional consumption analysis.

Second, there may be data considerations; for example, a time-series approach is

hardly appropriate for studying how consumption varies with income during periods

in which there has been virtually zero income growth.

1.2 Econometrics in theory and in practice

The four elements above give econometric work its distinctive flavour:

the fact that we cannot hold other things constant in empirical analysis

the random (or stochastic) nature of relationships between variables

the discrepancies between theoretical variables and observed data

the need to make explicit assumptions about time.

We cannot move straight from an economic model as formulated by economic theory

to parameter estimation without dealing with these issues. In empirical analysis, our

data never behave exactly as our theoretical models would lead us to believe. Simple

theoretical models are useful abstractions.

But in empirical work the relationships we wish to disentangle from the data may

involve a number of variables, and may be subject to uncertainties that our theories

could not possibly aim to explain. ‘Econometric methodology’ therefore includes

approaches for dealing with these issues, as well as the statistical techniques of

parameter estimation.

Regression analysis provides us with an analytical framework for handling

relationships involving a number of causal factors, including random elements. It

seeks to establish statistical regularities among observed variables. To do this we

need to deal with the randomness inherent in the behaviour of our variables. This

requires the help of statistical theory, which allows us to model randomness as an

integral part of the relationship between variables. How this is done, and how we

should interpret the results, is the subject of this module.

The following are the main points to remember.

In econometrics we confront theory with economic data so as to quantify

economic relationships and to test hypotheses about them.

In practice, we deal with stochastic relationships between variables which we

can only observe in a non-experimental context.

Econometric methodology has been developed in order to deal with this

situation, and differs significantly from the way regression analysis is applied to

experimental data. There are many outstanding issues and unresolved

methodological problems in the practice of econometrics.


© SOAS CeDEP 10

Moreover, the conclusions we draw in a particular context will always involve a

considerable degree of uncertainty, even if our model is correctly specified.

For this reason, we rely on probability theory and statistical inference to deal

with uncertainty in assessing the results of empirical analysis.

Econometrics is concerned primarily with quantifying and testing relationships

between variables, and regression analysis is its main tool of statistical

analysis.


© SOAS CeDEP 11

2.0 KEY CONCEPTS: THE CONCEPT OF REGRESSION

Section Overview

Linear regression is one important tool of econometrics. In this section, we discuss

the general nature of regression analysis and provide a background that will be used

to interpret real examples in the rest of the module.

Section Learning Outcomes

By the end of this section, students should be able to:

understand the concept of linear regression

identify the components of regression analysis.

Teaching and learning econometrics involves a preoccupation with tec hnical details,

definitions of technical terms, mathematical derivations, step by step descriptions of

statistical procedures etc, all expressed in technical notation.

This is normal and, indeed, necessary. But this preoccupation with technical detail

often implies that students lose a perspective on ‘What is it all about?’ and ‘Why are

we doing this?’ That is, there is a need to keep a grip on the kinds of basic questions,

which give substance to the subsequent technical exercises, uncluttered by notation

and technical detail. We need to get an overview of a problem before we attack it

aided by our technical armoury. We need to know the simple questions and intuitive

insights which have prompted elaborate technical enquiries.

Let us first start with the concept of regression analysis.

2.1 What is regression?

An intuitive explanation

Regression is the main statistical tool of econometrics. But what is it exactly?

Regression can best be explained by an example. Consider a famous empirical law of

consumer behaviour, formulated by German economist and statistician Engel. This

was based on a household budget survey of Belgian working class families collected

in 1855. Engel observed that the share of expenditure on food in total household

expenditure (= the -variable) was a declining function of household income (= the

-variable).

This is indeed what one would expect: on average, poorer families spend a higher

proportion of their income on food in comparison with better-off families. Note that

we refer to the proportion of total household expenditures spent on food and not

total food consumption of the family (one would expect better-off families to spend

more money on food even though these expenditures are generally a smaller

proportion of their total expenditure).

Hence, we expect that, on average, the share of food in household expenditures is

inversely related to household income. But we do not expect this relationship to be

exact. That is, if we were to sample 10 families with identical income (ie equal -


© SOAS CeDEP 12

values), we would not expect to get 10 identical shares of food consumption in total

household expenditures (the -values).

Differences in the demographic composition of families, in consumption habits and in

tastes will account for differences in food expenditures. In fact, many budget studies,

in the past and in the present, reveal that there is considerable variation within

each income class with respect to the proportion of household expenditures spent on

food. But, nevertheless, it is still valid to say that, on average, the proportion of

household expenditures spent on food declines as the level of income increases.

This leads us to the concept of regression. Regression methods bring out this

average relationship between a dependent variable (the -variable) on the one

hand and one or more independent variables (the -variables, also called the

explanatory variables) on the other.

In our example, the average relationship between the share of food in household

expenditure and the level of household income is the regression of the former

variable on the latter.

Hence, in regression analysis we seek to model the chance variation around the

average as well as the average itself.

In summary, we hope that our model captures the basic structure of

interaction between economic variables. We expect that the behavioural

relationships are reasonably stable but we know that they do not hold exactly

because of the random component (the disturbance term). At most, we expect these

relations to hold ‘on average’.

Trying to determine this average relationship amidst the random variation in the data

is like trying to separate sound from noise when listening to a badly tuned radio.

Thus, a regression model has two components.

(a) A regression line, which models the average relationship between the

dependent variable and its explanatory variable(s). This requires us to make an

explicit assumption about the shape of the regression line: the function that

expresses it may be linear, quadratic, exponential, etc.

(b) Disturbances; we acknowledge the existence of chance fluctuations due to a

multitude of factors not explicitly recognised in the model. We model this

element of uncertainty (the noise) in the form of a disturbance term which

constitutes an integral part of our model. This disturbance term is a ‘catchall

for all the variables considered as irrelevant for the purpose of this model as

well as all unforeseen events’ (Maddala, 2009: p. 5). It is a random variable

that we cannot observe or measure in practice.

We are not interested in the disturbance term as a variable per se, but we are keen

to remove its blurred messages that hamper our attempts to investigate the

behavioural relationship between the variables of our model. To do this, we need to

model the stochastic (probabilistic) nature of the disturbance term. This is no easy

task and we always need to think carefully about whether the assumptions we

make about the behaviour of the disturbance term are indeed appropriate for the

relationship under study. Not surprisingly, a great deal of econometric theory and

practice revolves around these assumptions.


© SOAS CeDEP 13

A formal explanation

It is useful to express these important ideas more formally. We start with the

population regression function. This function is a theoretical construct

representing a hypothesis about how the data are generated. For the simple, two-

variable linear regression model we have

(1.1)

where is the dependent variable (sometimes called the ‘regressand’)

is the explanatory variable or independent variable (or ‘regressor’)

is the disturbance term

and are the regression parameters: 1 is the intercept, or constant,

and is the slope coefficient.

The subscript indicates the -th observation, ie the -th person or object sampled.

Typically, the variables and are observable for each observation , the

disturbance takes different values for each but is not observable, whereas the

parameters and are unknown but constant for all observations.

The presence of the random disturbance in equation (1.1) means that is stochastic.

The population regression function may be viewed as comprising two components:

a systematic element represented by a straight line showing the statistical

dependence of on ;

a random, or stochastic, element represented by the disturbance term u.

The systematic element can be expressed as

( | ) (1.2)

that is, the average, or expected, value of conditional on a given value of X is a

linear function of . Therefore, the population regression function joins the

conditional means of .

The disturbance term, , accounts for the variation in Y around the population

regression line. In later units, you will learn about the assumptions made concerning .

Regression enables us to quantify the unknown parameters and , and the

unknown disturbances * +, for , ... , , in equation (1.1).

Using a sample of data on and , we obtain estimates and , of the unknown

population parameters ( is read as ‘hat’, hence is ‘beta 1 hat’). We have the

sample regression function

(1.3)

in which and are random variables (the particular estimates obtained depend

on the particular sample of data on and used) that differ from the population

parameters and .


© SOAS CeDEP 14

Consequently, the sample residuals, , differ from the unknown population

disturbances, .

Whereas the disturbance term accounts for the variation in around the population

regression line, the residuals give us the deviations of the observed -values from

the estimated regression line.

2.1.1 Difference between disturbances and residuals

Name Notation Refers to

Disturbances Population regression line (computed on the entire population)

Residuals Estimated regression line (computed on the sample)

Source: unit author

The residuals, therefore, are not identical with the disturbances, but clearly they may

contain some information that can help us understand the behaviour of the

disturbances. How to analyse the information contained in the residuals is addressed

in later units.

Please note that different textbooks use different definitions for disturbances

and residuals; for example, some use the term ‘error’ instead of residuals. In

this module, we try to be as flexible as possible with terminology.

The predicted (or fitted) value of the dependent variable, , is given by the sample

regression line

(1.4)

in which is the fitted value of the dependent variable, the estimator of ( | ),

that is the estimator of the population conditional mean (cf. equation (1.2)).

The sample linear regression line is an estimator of the population

regression line.


© SOAS CeDEP 15

2.2 Linearity and log-linearity

Equation (1.1) is an example of a linear regression model. That is, is linear in

and in the parameters and . With the linear regression line

the interpretation given to relies on the fact that

(1.5)

In case you are not familiar with derivatives, this means that an increase of one unit

in (measured in units of ) results in an increase of units in (measured in

units of ).

On the intercept – in theory, is the predicted value of (in units of ) if . In

practice, this interpretation of is not recommended unless zero values of could

reasonably occur.

Now consider the model

(1.6)

which, after taking natural logarithms of both sides of the equation, can be written as

(1.7)

where If you are not familiar with this transformation, please make sure

that you revise the algebra in a basic textbook; this kind of topics is usually in the

appendix.

This model is also linear in the parameters and . We may view the model as

(1.8)

where and

.

This model is known by various names – logarithmic, double log, log-log, log-linear

and constant elasticity – and is frequently used in applied work to characterise the

form of the functional relationship between the variables. It has the property that the

slope coefficient measures the elasticity of with respect to because

⁄ (1.9)

Again, if you do not feel comfortable with this formulation, please make sure that

you refresh the basic mathematical tools.


© SOAS CeDEP 16

2.3 Correlation versus regression analysis

Although regression analysis is related to correlation analysis, conceptually these two

types of analysis are very different.

The main aim of correlation analysis is to measure the degree of linear association

between two variables and this is summarised by a value, the correlation

coefficient.

The two variables are treated symmetrically:

both are considered random

there is no distinction between dependent and explanatory variables

there is no implication of causality in a particular direction from one variable to

the other.

Regression analysis, on the other hand, can deal with relationships between two or

more variables and the variables are not treated symmetrically:

the dependent and explanatory variables are carefully distinguished

the former is random whereas the latter are often assumed to take the same

values in different samples – often referred to as ‘fixed in repeated samples’

the underlying economic theory implies that , an explanatory variable,

‘causes’ or ‘determines’ , the dependent variable

moreover, with more than one explanatory variable, regression analysis

quantifies the influence of each explanatory variable on the dependent

variable.

It is important to note that the regression of on does not give the same

sample regression line as the regression of on .

The appropriate direction of causality is determined by the modeller according to a

priori reasoning, based on theory or common sense.

2.4 Data and regression

Regression methods allow us to investigate associations between variables, but the

justification for these relationships comes from theory. Relationships have to be

meaningful and whether they are or not depends on theoretical argument s.

This does not mean, however, that data play only a passive role in economic

analysis. Empirical investigation is an active part of theoretical analysis inasmuch as

it involves testing theoretical hypotheses against the data as well as, in many

instances, providing clues and hints towards new avenues of theoretical enquiry.

Theoretical insights have to be translated into empirically testable hypotheses that

we can investigate with observed data. Hence, theory and data are interactive:

theoretical propositions should be continually tested empirically and theoretical

insights can be improved with the aid of signals from the data.


© SOAS CeDEP 17

Most of the data we use in applied economic analysis are not obtained from

experiments but are the result of surveys and observational programmes.

For example: national income accounts, agricultural and industrial surveys,

financial accounts, employment surveys, population census data, household

budget surveys, and price and income data, that are collected by various

statistical offices.

They are records of unplanned events; they are not the outcome of experiments. The

nature of economic data makes an econometrician’s work quite different from that of

a psychologist or an agricultural scientist.

In the latter cases, experiments play a central role in empirical research, and much

emphasis is put on the careful design of experiments in order to single out the

‘stimulus-response’ relationship between two variables whilst controlling for the

influence of other variables (that is, by holding them constant ).

In economics, the scope for experimentation is very limited. We cannot change the

price of a commodity, holding incomes and all other prices constant, just to see what

would happen to the demand for it. In economic theory, we assume that ‘other

things are equal’ (ceteris paribus) and focus on cause and effect between the

remaining variables. But in empirical analysis other things are never equal, and we

have to observe the behaviour of economic agents from survey data. Multiple

regression techniques allow us to ‘account’ for the influence of other variables whilst

investigating the interaction between two key variables, but this is not the same as

‘holding other variables constant’.

A careful observer uses data not just to confirm his or her theories, but also to get

clues from empirical analysis to advance his/her theoretical grasp of a problem. It is

primarily this aspect that enables data to contribute to the process of analysis.

2.5 A word of caution on notation

You will notice that terminology and notation may be different from the textbook in

some instances. Although these differences are inconvenient, it is an unfortunate fact

that terminology and notation are not wholly standardised amongst econometricians.

For example, we maintain the ‘hat’ to denote estimators, eg – instead of .

You should not be alarmed by these discrepancies; we invite you to focus on the

concepts and ideas rather than on the technicalities. This material is specifically

designed to keep technical contents to a minimum, and our aim is for everyone to

enjoy the course independently of their mathematical knowledge.


© SOAS CeDEP 18

Section 2 Self-Assessment Question

uestion 1

What are the links between econometrics and both economic theory and

mathematical economics?

Q


© SOAS CeDEP 19

3.0 EXAMPLE: THE KEYNESIAN CONSUMPTION FUNCTION

We shall now illustrate several phases in the methodology of econometrics with an

example, the Keynesian consumption function.

Statement of the theory

The Keynesian theory of consumption is the basis of our model of consumption

expenditure. This theory states that real consumption expenditure depends on real

disposable income, other things held constant. When income rises, consumption

expenditure rises, but changes in consumption expenditure are less than the change

in income. Also, as income rises, the average propensity to consume, that is,

consumption per unit of income, falls.

Mathematical model of the theory

Suppose we represent the Keynesian consumption function as a linear relationship

(1.10)

where is real consumption expenditure, is real disposable income, is a

constant and is the slope of the consumption function, ie the marginal propensity

to consume out of disposable income. Because of our a priori expectations

concerning the average and marginal propensities to consume, we expect and

. (Note that the average propensity to consume is ⁄ ( )⁄ . For

this to fall as income rises, we need .)

Econometric model of the theory

The econometric model is stochastic. It includes a random disturbance, , which

captures the influence of all the other variables that may influence consumption

expenditure.

(1.11)


© SOAS CeDEP 20

Collection of data

The data to be used are annual time-series data for the UK covering the period

19551991. They are aggregate consumption expenditure and personal disposable

income both measured in £(1985) million. The source of the data is the Economic

Trends Annual Supplement 1991 (ONS, 1991). Thus, our model represents a theory

about the behaviour of aggregate consumption over time. A scatter plot of these

data is given in the figure in 3.1.1.

It is obvious from this scatter plot that the relationship is upward sloping and it

seems to be reasonably linear.

3.1.1 Scatter plot of aggregate consumption expenditure ( ) and personal disposable

income ( )

Source: unit author using data from Economic Trends Annual Supplement (ONS, 1991)

Parameter estimation

Using these data the parameters and can be estimated to obtain the average

relationship between and . Just how the coefficients of the population regression

function are estimated will be explained in detail in later units. The consumption

function estimated with our data is

(1.12)

and this represents the average relationship between consumption expenditure and

personal disposable income.

The estimated value of is 3952 and of is 0.889.

Consequently if personal disposable income increases by £1 million, consumption

expenditure increases on average by £0.889 million.


© SOAS CeDEP 21

In this case, the interpretation of the intercept is not so meaningful. Mechanical

interpretation of the estimate tells us that consumption expenditure is £3952 million

if aggregate personal disposable income is zero. However, this is not particularly

helpful because if aggregate personal disposable income is zero then the economy

would be in chaos and the Keynesian theory of consumption expenditure would not

be appropriate. The fact is that, in our sample, the -values are a long way from

zero, and we really have no idea what the consumption function might look like at

low levels of income.

Alternatively, some explain the value of the intercept as the average of all the

variables omitted from the model.

Tests of the hypothesis

Do the results conform to the theory of the consumption function?

With our theory we expect and .

Is each of these hypotheses supported by the results? Clearly, our estimates are

consistent with what we expected to obtain. For a discussion of formal hypothesis

tests, we must wait until the corresponding units.

Prediction

We can use the estimated model to predict what consumption expenditure would be

if personal disposable income were a particular amount. Suppose personal disposable

income was £250 000 million. The predicted amount of consumption expenditure is

= 3952 + 0.889(250 000)

∴ =226 202

That is, consumption expenditure is predicted to be £226 202 million if disposable

income is £250 000 million.


© SOAS CeDEP 22

UNIT SUMMARY

In this unit we have introduced some basic ideas on econometrics and regression

analysis. The most important points to remember are the following.

Econometrics is the application of statistical and mathematical methods to the

analysis of economic data, with the purpose of giving empirical content to

economic theories and testing them against ‘reality’.

The econometrician’s approach differs from that of the economic theorist

because:

- we cannot ‘hold other things constant’ in empirical analysis

- the random nature of relationships between variables means that the

results and conclusions of empirical analysis always contain an element of

uncertainty

- there is a discrepancy between theoretical variables and observed data in

terms of coverage and precision of measurement

- econometricians cannot avoid explicit assumptions about the time frame of

their model, since the data they use have been generated in a ‘real-time’

context.

Regression analysis is the statistical basis of econometric theory and practice.

Its aim is to quantify relationships between variables, especially between

variables whose relationship is subject to chance variation.

Regression involves finding an average line that summarises the relationship

whereby depends on in the midst of random variation and uncertainty of

outcome.

The randomness inherent in conclusions and outcomes based on regression

analysis is formally modelled by introducing a disturbance term into our

behavioural equations. This is a stochastic variable which is not observable.

However, the residuals of a sample regression function may provide us with an

indication as to the behaviour of these unknown disturbances.

Regression allows us to investigate the association between variables, but it

cannot ‘discover’ causality between them. To establish causality we need to

resort to economic theory.

Empirical work in economics cannot rely on experimentation. Econometric

analysis is therefore based on careful observation of data drawn from within a

context which we do not control.

In terms of practical skills, this unit requires that:

you are familiar with the scatter plot as a practical tool of empirical analysis

you know how to load data into R software from a pre-existing data file

you know the R software commands to obtain a summary of descriptive

statistics of a variable, make a scatter plot and create logarithms of variables.


© SOAS CeDEP 23

UNIT SELF-ASSESSMENT QUESTIONS

To answer the Unit Self-assessment Questions, you will need to use R software.

Please refer to the R Guide for Unit 1 (available on your e-study guide) and follow

the instructions given.

uestion 1

The data file u1q2.txt contains annual time-series data for the United States over the

period 19591991 on aggregate consumption expenditure, , and disposable income,

, both measured per head of population and in billions of constant 1987$. (Source:

Economic Report of the President, 1992: table B-5, p. 305).

(a) Use R software to produce a scatter plot of on the vertical axis and on the

horizontal axis. Comment on the scatter plot: would a linear regression seem

appropriate?

(b) Use R software to obtain time-series plots of and . Describe the way

consumption and income have moved over the period 19581991.

uestion 2

The hypothesis that foreign direct investment is determined by demand suggests

that foreign direct investment and gross domestic product are positively related,

other variables remaining constant. The data file u1q3.txt contains annual time-

series data for the period 19581985 on foreign direct investment, FDI, and gross

domestic product, GDP, for Taiwan. (Source: Tsai 1991: Table A-1, p. 285).

Use R software to obtain scatter plots of FDI on GDP and of the logarithm of FDI on

the logarithm of GDP, both for the period 19581985. Comment on the two scatter

plots. Remember that a log transformation is used in empirical research to reduce

the ‘noise’ (stored in the disturbances) associated with higher values of relative to

that associated with lower -values.

Which of the following would you expect to be the more appropriate linear regression

model:

(a)

or

(b) ?

Q

Q


© SOAS CeDEP 24

uestion 3

The data file u1q4.txt contains cross-section data from a sample of 100 rural

households on the value of their consumption and income during a given month.

Income ( ) includes cash income from all sources during the month concerned, plus

the (imputed) market value of own production consumed by the household.

Consumption ( ) includes the value of all purchased items, plus the value of own

production consumed by the household. The units are measured in the local currency

rounded to the nearest whole number.

(a) Obtain the scatter plot of on . What is the main difference between this

scatter plot and the one constructed in Question 1?

(b) Use R software to obtain the histograms of and . Income has the usual

positively skewed distribution that we would expect, whereas the distribution of

consumption is less skewed. Can you suggest a reason for this?

(c) Use R software

(i) to obtain the average propensity to consume at the sample means

(ii) to compare the degree of skewness of the two variables

(iii)to obtain their correlation coefficient.

You may remember that the correlation coefficient (or Pearson’s coefficient)

is defined as:

( ) ( )

It measures the degree of linear dependence between the two variables, it varies

between −1 and 1 and it is invariant to the units of measurement (eg $, thousands $

etc). A positive (negative) coefficient indicates that the variables are positively

(negatively) correlated, with a zero-value coefficient indicating that they are

uncorrelated random variables.

Q


© SOAS CeDEP 25

uestion 4

Weekly earnings can vary considerably in the case of casual dock labourers recruited

on a day-to-day basis. There are differences between workers as well as across

weeks. Weekly earnings will vary from week to week depending on the activity of the

harbour which determines the demand for labour. Daily recruitment will be high if

demand is high, and vice versa. Earnings also vary between workers in any given

week. These depend on the numbers of days a worker manages to get recruited for

in a particular week, on whether he or she is recruited for the day shift or the night

shift, and on the number of hours of overtime he or she works in that week.

In this exercise you will look at data on the weekly earnings of casual workers, ECAS,

and the recruitment of casual workers, CASREC. The data file u1q5.txt contains

paired observations on the two variables ECAS and CASREC. The data were taken

from a field study carried out in 1980/1981 by the Centre of African Studies in

Mozambique (Eduardo Mondlane University, Maputo) on casual labour on the docks

of Maputo harbour. The earnings data are in units of 100 MT, the local currency

being the Metical.

(a) Using R software, calculate the means, standard deviations, and minimum and

maximum values for both variables.

(b) A particular worker is randomly chosen from the labour force in a particular

week in 1980/1981 that is also randomly chosen. Using the information in your

answer to part (a), what is your best estimate of the weekly earnings of this

randomly selected worker?

(c) With R software, obtain the scatter plot of ECAS against CASREC. Write down

what you observe.

Q


© SOAS CeDEP 26

KEY TERMS AND CONCEPTS

cross-section data Type of data collected by observing a number of individuals,

countries, firms etc at the same point in time.

disturbances The unobserved random component that explains the

difference between observed and predicted values of .

econometrics The discipline that investigates economic data and

relationships using statistical techniques.

linear regression The statistical methodology that models the relationship

between a dependent variable and one or more explanatory

variables. In linear regression, this relationship (or function) is

assumed to be linear. Other methodologies can deal with

other functional forms, eg exponential, quadratic.

non-experimental data

Type of data that are not compiled as a result of experiments.

Other factors in the model are assumed to be fixed, although

the researcher cannot actually hold them fixed as he could do

in an experimental context.

time-series data Type of data collected by observing the same individual,

country, firm etc over a period of time.

Documents

P101 Applied Econometrics - SOAS University of London · 2019-09-23 · P101 Applied Econometrics Prepared by Francesca Di Nuzzo This module is partially based on the earlier module