PROBABILITY AND SAMPLING THEORY MSc. in MTF, QF & FM Induction … · 2015-03-02 · PROBABILITY AND SAMPLING . THEORY . MSc. in MTF, QF & FM Induction Programme . Lecture Notes (Academic

PROBABILITY AND SAMPLING THEORY

MSc. in MTF, QF & FM Induction Programme Lecture Notes (Academic Year 2010-2011)

Giovanni Urga Professor of Finance and Econometrics

Faculty of Finance, Cass Business School, London (UK) & Director, Centre for Econometric Analysis (CEA@Cass)

and Professor of Econometrics

Dipartimento Scienze Economiche, Bergamo (Italy)

c° Giovanni Urga, 2009 ii Adv Stats Induction, 2009-2010

Contents

0.1 Random Variables and Probability Distributions: A Re-view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii0.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . iii0.1.2 Probability theory . . . . . . . . . . . . . . . . . . . . . . vii0.1.3 Random Variables and Probability Distribution . . . . . . xi0.1.4 Mathematical Expectation and Moments . . . . . . . . . xviii0.1.5 Some Special Distributions . . . . . . . . . . . . . . . . . xxvi

0.2 Sampling Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxii0.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . xxxiii0.2.2 Methods for Finding Point Estimators . . . . . . . . . . . xxxiv0.2.3 Properties of Point Estimators . . . . . . . . . . . . . . . xl0.2.4 Interval estimation and hypothesis testing . . . . . . . . . lii

0.1 Random Variables and Probability Distrib-utions: A Review

0.1.1 Introduction

We plan to revise the foundations of probability and statistical theory witha view to applying them to econometric methodology

Because econometrics generally deal with the study of several unknown pa-rameters, the discipline of econometrics uses statistical techniques to developand apply tools for

a.) estimating economic/financial relationships

b.) testing hypotheses involving economic behaviour

c.) forecasting the behaviour of economic variables.

We will stress the relationship between ECONOMIC/FINANCIAL THE-ORY/IES and ECONOMETRIC METHODS later on in the programme. At

iii

CONTENTS

this stage we focus on the relationship between ECONOMETRICS and STA-TISTICS.

An econometrician typically formulates a theoretical model that is aframework for analysing economic behaviour using some underlying logical struc-ture. For the time being we do not want to know how his model arose.

What we want to stress is the fact that a well-specified model would bea reasonable approximation to the actual process that generates the observeddata. This process (known as the Data Generating Process or DGP) wouldinvolve the interaction of behaviour amongst numerous economic agents.

Once the model has been formulated and the data gathered, the next stepin an empirical study is

to estimate the unknown parameters of the modeland

to subject the model to a variety of diagnostic tests to make sure thatone obtain robust conclusions, that is conclusions that are not sensitive to modelspecification.

To achieve this goal, an investigator may have to reformulate the models andperhaps use alternative techniques to estimate them. Methods of hypothesistesting would be useful not only at this diagnostic testing stage but also to testthe validity of a body of theories.

Because economics/finance deals with non experimental data generated bya complicated process involving the interaction of the behaviour of numerouseconomic and political agents, this imposes a great deal of uncertainty in themodels and methods used by the econometricians.

In particular, estimated relations are not precise, hypothesis testing lead tothe error of rejecting a true hypothesis or that of accepting a false hypothesis,and forecasts of variables often turn out to be far from their actual values. Thisuncertainty makes statistical methodology very important in econometrics.

The mathematical and statistical model that an econometrician formulatesis meant to be a description of the DGP. The actual data, however, are treatedas one of several possible realisations of events.

The theory of probability is useful to construct a framework that portraysthe likelihood of one type of realisation or another. Not surprisingly, the prob-ability framework invariably depends on a number of parameters. In order toestimate the parameters, one typically obtains a sample of observations and usesthem in conjunction with a probability model.

c° Giovanni Urga, 2009 iv Adv Stats Induction, 2009-2010

0.1. RANDOM VARIABLES AND PROBABILITYDISTRIBUTIONS: A REVIEW

STATISTICAL INFERENCE deals with the methods of obtaining theseestimates, measuring their precision, and testing hypotheses on the parametersin the equations.

STATISTICS is a body of knowledge that is useful for collecting, organising,presenting, analysing, and interpreting data and numerical facts.

• DESCRIPTIVE STATISTICS: deals with the presentation and organi-sation of data. Measures of central tendency, such as mean and median,measures of dispersion, such as standard deviations and range are descrip-tive statistics.

• INFERENTIAL STATISTICS: deals with the use of sample data to infer,or reach, general conclusions about a much larger population. In stats, wedefine a population as the entire group of individuals we are interested instudying. A sample is any subset of such a population.

c° Giovanni Urga, 2009 v Adv Stats Induction, 2009-2010

CONTENTS

We also encounter another dichotomy in statistical analysis:DEDUCTION vs. INDUCTION.

• DEDUCTION: is the use of general information to draw conclusionsabout specific cases.6

The use of probability to determine the chance of obtaining a particularkind of sample result is known as deductive statistical analysis.

PROBABILITY THEORY: We will learn how to apply deductive tech-niques when we know everything about the population in advance and areconcerned with studying the characteristics of the possible samples thatmay arise from that known population.

• INDUCTION: involves drawing general conclusions from specific infor-mation.

This means that on the strenght of a specific sample, we infer somethingabout a general population. The sample is all what is known. We mustdetermine the uncertainty characteristics of the population from the in-complete information available. This strand of statistical analysis is calledinductive statistics analysis.

c° Giovanni Urga, 2009 vi Adv Stats Induction, 2009-2010


0.1.2 Probability theory

Basic Probability

1. Sample space, sample points and events

The term probability is loosely used by many persons to indicate the mea-sure of one’s belief in the occurrence of an uncertain future event There is nosingle definition of the term probability that is universally accepted. Variousdefinitions will be introduced.

EXPERIMENTS (as simple as rolling a pair of dice or as complicated asconducting a large-scale survey of household or firms).

RANDOM EXPERIMENTS: if it satisfies the following conditions:

1. all possible distinct outcomes are known ahead of time;

2. the outcome of a particular trial is not known a priori;

3. the experiment can be duplicated in principle under ideal conditions

The totality of all possible outcomes of the experiments is referred to as theSAMPLE SPACE (denoted by S) and its distinct individual elements are calledthe SAMPLE POINTS or ELEMENTARY EVENTS

An EVENT is a subset of a sample space and is a set of sample points thatrepresents several possible outcomes of an experiments (denoted by A).

The IMPOSSIBLE EVENT or NULL EVENT is denoted by 0, A samplespace with a finite or COUNTABLE INFINITE sample points (with a one toone correspondence to positive integers) is called a DISCRETE SPACE.

A CONTINUOUS SPACE is one with an UNCOUNTABLY INFINITE num-ber of points, i.e. it has as many elements as there are real numbers.

2. Some results from set theory

• The SAMPLE SPACE is denoted by S. A = S implies that the events inA must always occur. The EMPTY SET is a set with no elements and isdenoted by 0. A = 0 implies that the events in a do not occur.

• The set of all elements not in A is called the COMPLEMENT of A andis denoted by Ac or A. Thus Ac occurs if and only if A does not occur.

c° Giovanni Urga, 2009 vii Adv Stats Induction, 2009-2010

CONTENTS

• The set of all points in either A or a set B or both is called the UNIONof the two sets and is denoted by ∪. A∪B means that either the event Aor the event B or both occur. Note that A ∪Ac = S.

• The set of all elements in both A and B is called the INTERSECTION ofthe two sets and is represented by ∩. A ∩ B means that both the eventsA and B occur simultaneously.

• A∩B = 0 implies that A and B cannot occur together. A and B are thensaid to be DISJOINT or MUTUALLY EXCLUSIVE. Note that A∩Ac = 0.

• A ⊂ B means that A is contained in B or that A is a subset of B, that is,every element of A is an element of B. If an event A has occurred, thenB must have occurred also.

3. Probability: definitions and concepts

The probability of an event is defined in a number of ways, all of which areuseful in calculating probabilities.

AXIOMATIC DEFINITIONThe probability of an event A ⊂ F is a real number such that

1. P (A) ≥ 0; ∀A ⊂ F

2. The probability of the entire sample space S is 1, P (S) = 1

3. If A1, A2, ...., An are mutually exclusive events (i.e. A1∪A2∪......∪An = 0for all i 6= j, then P (A1 ∪ A2 ∪ ...... ∪ An) =

PiP (Ai) and this holds for

n =∞ also.

Note that the triple (S,F, P ) is referred to as the PROBABILITY SPACE andP is a PROBABILITY MEASURE.

Although the axiomatic definition is rigorous, it does not directly tell ushow to assign probabilities to elementary events. That is accomplished by twoother definitions given below. All three definitions are then used to calculateprobabilities of various events.

CLASSICAL DEFINITIONIf an experiments has n ( n < ∞) mutually exclusive and equally likely

outcomes, and if the event A occurs in nA possible ways, then the probabilityof A is nA/n, denoted as P (A) = nA/n.

c° Giovanni Urga, 2009 viii Adv Stats Induction, 2009-2010


FREQUENCY DEFINITIONLet nA be the number of times the eventA occurs in n trials of an experiment.

If there exists a real number p such that

p = limn→∞

(nA/n)

then p is called the probability of A and is denoted by P (A).

SUBJECTIVE PROBABILITYIn many instances, individuals use personal judgements to asses the relative

likelihoods of various outcomes. For instance, one often hear the expression “theodds are 2 to 1 that candidate X will be elected”.

This is an example of subjective (or personal) probability which is based on“educated guesses” or intuition. In statistical inference, the practicability ofthis approach stems from using prior beliefs or new information in updatingprevious model specification

This provides the foundation for the Bayesian approach

In sum, the axiomatic approach to probability theory is rich enough to in-clude each of the interpretations of probability and using the axioms (1) - (3)the following properties of probabilities can be proved:

P (A) ≤ 1P (Ac) = 1− P (A)

P (0) = 0

If ·A ⊂ B, thenP (A) ≤ P (B)

P (A ∪B) = P (A) + P (B)− P (A ∩B)

Conditional Probability

There is the case when we wish to calculate the probabilities of events whenit is known that another event occurred. It would be useful to know how theprobability of an event alters as a result of the occurrence of another event

Definition: Let A and B be two events in a probability space (S, F, P ) suchthat P (B) ≥ 0. The conditional probability of A given that B will occur or hasoccurred, denoted by P (A | B), is given by

P (A | B) = P (A ∩B)P (B)

Note that we can re-write the expression above as:

c° Giovanni Urga, 2009 ix Adv Stats Induction, 2009-2010

CONTENTS

P (A ∩B) = P (A | B)P (B) = P (B | A)P (A)More generally, if

A1, A2, ...., An

are events such that

P (A1) > 0, P (A1 ∩A2) > 0, .........., P (A1 ∩A2 ∩A3......An−1) > 0

then

P (A1∩A2∩A3......An) = P (A1)P (A2 | A1)P (A3 | A1∩A2)......P (An | A1∩A2∩...∩An−1

This definition of conditional probability leads to the definition of INDE-PENDENCE OF EVENTS.

Independence of Events

Suppose that P (A | B) = P (A). This would imply that knowing that event Boccurred has no effect on the chance of event A occurring. It would be reasonablethen to say that events A and B are independent event. So, two events A andB are independent when:

P (A∩)B) = P (A | B)P (B) = P (A)P (B)

Otherwise A and B are said to be dependent. Clearly if the expression aboveis true, then P (B | A)P (B) as well. This definition extends to more than twoevents, for example events A,B, and C are independent if

P (A ∩B ∩ C) = P (A)P (B)P (C)

Bayes’ Rule

Let S denote the sample space of some experiment. The disjoint events

A1, A2, ...., An

are a partition of S if

A1 ∪ ...... ∪An = S

c° Giovanni Urga, 2009 x Adv Stats Induction, 2009-2010


and P (Ai) > 0 for each i. If B is any event in S so that P (B) > 0, then it willbe true that

B = (A1 ∩B) ∪ (A2 ∩B) ∪ ...... ∪ (An ∩B)

Draw Venn diagram to check!!!

Since the events Ai ∩B are disjoint

P (B) =nXi=1

P (Ai ∩B)

Also, if P (Ai) > 0, the

P (Ai ∩B) = P (B | Ai)P (Ai)

so

P (B) =nXi=1

P (B | Ai)P (Ai)

which is called the law of total probability

The formulation of a partition of S means that when the experiment isperformed then exactly one of the events Ai will occur. Situations arise inwhich we would like to determine the probability of the event Ai given that theevent B occurs. The solution to this problem is given by the Bayes’ theorem

P (Ai | B) =P (Ai ∩B)

P (B)=

P (B | Ai)P (Ai)

P (B)

P (Ai | B) =P (B | Ai)P (Ai)Pni=1 P (B | Ai)P (Ai)

0.1.3 Random Variables and Probability Distribution

RANDOM VARIABLES

RANDOM VARIABLES: is a real-valued function that assigns a number toeach sample point (outcome) in the sample space of an experiment

It is important to distinguish between the rule or function that assigns thenumbers to each sample point and the numerical values themselves.

c° Giovanni Urga, 2009 xi Adv Stats Induction, 2009-2010

CONTENTS

In general, we distinguish between the two by assigning random variablesas upper case letter, X, and the values of the random variable by lower caseletter x. So P (X = x) or P (X < 3) mean that the probability that the randomvariable X takes the value x or a value less than 3, respectively.

Discrete: whose set of all possible values is finite or countable infinite

Continuous: is one that may assume all values in at least one interval of thereal number line

We will assume that we have a probability measure P defined on the samplespace S that is associate with some experiments. It should not be surprising thatwe can use the assigned probability measure to make probabilistic statementsabout the possible values of a random variable. That is, we can construct aprobability distribution or probability density function (p.d.f.) for X that willliterally tell us how the probability mass (total mass = 1) is distributed (i.e.allocated or spread out) across the values x that X can take. This p.d.f. canbe used for example to compute the probability that X takes values or falls ina given range.

PROBABILITY DISTRIBUTION FOR DISCRETE RANDOM VARIABLES:

For a discrete random variable X, a probability density function (or Proba-bility Function) is defined to be the function f(x) such that for any real number,x, which is the value that X can take, f(x) = P (X = x). Thus 0 ≤ f(x) ≤ 1.

If x is not one of the values that X can take, then f(x) = 0. Also, if thesequence

x1, x2, .......

includes all values that X can take then

∞Xi=1

f(xi) = 1

.

A function closely related to the probability density function of a randomvariable is the corresponding DISTRIBUTION FUNCTION or CUMULATIVEDISTRIBUTION FUNCTION (c.d.f.)

The c.d.f. of a random variable X is defined for each real number x asF (x) = P (X ≤ x) for −∞ < x < +∞. That is, F (X) is the probability thatthe random variable X takes a value less than or equal to x.

c° Giovanni Urga, 2009 xii Adv Stats Induction, 2009-2010


For a discrete random variable X,

F (X) =Xt≤x

f(t)

where this summation notation means sum all the values of the p.d.f. f(t) forthe values the random variable can take less than or equal to the specified value,x.

c° Giovanni Urga, 2009 xiii Adv Stats Induction, 2009-2010

CONTENTS

PROBABILITY DISTRIBUTION FOR CONTINUOUS RANDOM VARIABLES:

A continuous random variable X can take any value in at least one intervalon the real number line. Suppose X can take all possible values a ≤ x ≤ b.Since the possible values of X are uncountable, the probability that X takes anyparticular value is zero! So, unlike the situation for discrete random variable, thep.d.f of the continuous random variable, say f(x), will not give the probabilitythat X takes the value x.

Instead, the p.d.f, f(x) of a continuous random variable X will be such thatthe area under f(x) will give probabilities associated with the correspondingintervals on the horizontal axis. More specifically, a function with values f(x),defined over all real x, is a p.d.f. for the continuous random variable X if

(i) f(x) ≥ 0

(ii)R∞−∞ f(x)dx = 1

(iii) P (a ≤ X ≤ b) =R baf(x)dx for any a, b such that −∞ < a < b <∞

Graphically:

c° Giovanni Urga, 2009 xiv Adv Stats Induction, 2009-2010


The cumulative distribution for a continuous random variable X is givenby

F (x) = P (X ≤ x) =

Z x

−∞f(t)dt for−∞ < x <∞

where f(t) is the value of the p.d.f at t. Note that F (−∞) = 0, F (∞) = 1and when a < b, F (a) ≤ F (b)

Furthermore,P (a ≤ X ≤ b) = F (b)− F (a) andf(x) = dF (x)/dx = F 0(x) where the derivative exists

MULTIVARIATE DISTRIBUTIONS

So far we have been concerned only with a single random variable X andfunctions of X. In most cases, however, the outcome of an experiment may becharacterized by more than one random variable. For instanceX may be Incomeand Y the total Expenditure of a household. Thus we may observe a pair ofrandom variables (X,Y ). If family size Z is also added, we observe (X,Y,Z).We would thus be interested in the study of Multivariate Distribution.

Let us consider with the bivariate case:If X and Y are discrete random variables f(x, y) = P (X = x, Y = y),

defined in each pair of values (x, y) that X and Y can take, is called thejoint probability distribution.

A joint probability function of X and Y F (x, y) ≥ 0 for (x, y) in its domainΣxΣyf(x, y) = 1

If A and B are sets of values of x and y respectively then P (X ∈ A, Y ∈B) = Σx∈AΣy∈Bf(x, y) for (x, y) ∈ R2 where R2 is a real two dimensionalplane

If X and Y are continuous random variables, then the bivariate functionf(x, y) with values defined for (x, y) ∈ R2 is a joint p.d.f for X and Y ifP [(X,Y ) ∈ A] =

R RAf(x, y)dx for any region A ∈ R2 that represent an event.

The function f(x, y) must have the properties that f(x, y) ≥ 0 andZ ∞−∞

Z ∞−∞

f(x, y)dxdy = 1

The correspondent joint cumulative distribution function for the continuousrandom variables X and Y is for (x, y) ∈ R2. Analogous to f(x) = δF (x)/δx isf(x, y) = δ2F (x, y)/δxδy (if the partial derivatives exist).

c° Giovanni Urga, 2009 xv Adv Stats Induction, 2009-2010

CONTENTS

MARGINAL DISTRIBUTIONS

Let us consider the previous sample. Suppose we simply want to know theprobability distribution of X, say g(x). How can we use the joint p.d.f for Xand Y to obtain g(x)?The discrete random variable X can take the values 1 and 0, and

g(1) = P (x = 1) = f(1, 1) + f(1, 0) = b/T + d/T =1X

y=0

f(1, y)

g(0) = P (x = 0) = f(0, 1) + f(0, 0) = a/T + c/T =1X

y=0

f(0, y)

Or more compactly, g(x) = Σyf(x, y), meaning that the p.d.f of X, g(x) isobtained by summing the joint p.d.f of X and Y over all values of Y for eachvalue x of X.The p.d.f g(x) is called a marginal distribution because if we obtain the row

and column sums of f(x, y), and display them in the margins of the table, wehave

X1 0

Y 1 b/T a/T (a+ b)/T0 d/T c/T (c+ d)/T

(b+ d)/T (a+ c)/T

So g(x) is given by the row totals. Similarly the p.d.f of Y , h(y), is given bythe column totals and in general h(x) = Σxf(x, y). These probability densityfunctions, obtained from the joint p.d.f, are the MARGINAL PROBABILITYDENSITY FUNCTIONS of X and Y .

If X and Y are continuous, the summation signs are replaced by integralsand we get

g(x) =

Z ∞−∞

f(x.y)dy −∞ < x <∞

h(y) =

Z ∞−∞

f(x.y)dx −∞ < y <∞

as the marginal probability density functions for X and Y .

CONDITIONAL DISTRIBUTIONS AND INDEPENDENT RANDOM VARAIBLES

c° Giovanni Urga, 2009 xvi Adv Stats Induction, 2009-2010


Previously we defined the conditional probability of event A given event Bas P (A|B) = P (A ∩B)/P (B) if P (B) > 0.If X and Y are discrete random variables, then

P (X = x | Y = y) =P (X = x, Y = y)

P (Y = y)=

f(x, y)

f(y)= f(x|y)

Provided P (Y = y) = h(y) 6= 0, where f(x, y) is the joint p.d.f of X andY , h(y) is the p.d.f of Y at y, and f(x | y) is the probability distribution ofX given that Y is fixed at the value y. The function f(x | y) is called theconditional probability density of X given Y = y.

If X and Y are continuous random variable then

f(x | y) = f(x, y) / h(y) h(y) 6= 0 −∞ < x <∞

is the conditional p.d.f of X given Y = y.

Having defined the idea of a conditional p.d.f we can now define the IN-DEPENDENCE OF RANDOM VARIABLES in the same way as we definedindependent events. Namely if f(x | y) = f(x).

So that knowing that Y = y does not affect the probability distribution ofX, then it is reasonable to say that X is independent of Y , and vice versa. Iff(x|y) = f(x) holds, then

f(x, y) = f(x | y)h(y) = g(x)h(y)

We can then formally define X and Y to be statistically independent if

f(x, y) = g(x)h(y)

for all values (x, y) of (X,Y ) in the sample space.

If X1,X2, . . . ,Xn have joint p.d.f. f(x1, x2, . . . , xn) then the random vari-ables X1, . . . ,Xn are MUTUALLY STOCHASTICALLY INDEPENDENT if

f(x1, . . . , xn) = f(x1)× f(x2)× · · · × f(xn)

Using this definition of stochastic independence, we can define a random sample.Suppose we repeat an experiment n times, and let X1, . . . ,Xn be random

variables representing a certain measurement or values for those n trials.Then X1, . . . ,Xn constitute a random sample if they are independent and iden-tically distributed.That is,

c° Giovanni Urga, 2009 xvii Adv Stats Induction, 2009-2010

CONTENTS

f(x1, . . . , xn) = f(x1)× · · · × f(xn)

and

f(x1) = f(x2) = · · · = f(xn) = f(x)

0.1.4 Mathematical Expectation and Moments

In order to study random variables and their probability distributions, it isuseful to define the concept of mathematical expectation of a random variableand of functions of a random variable; these definitions are given in what follows.

A) EXPECTED VALUE OF A RANDOM VARIABLE

Consider the experiment of rolling a single die. Let X be the value thatshows on the die. The probability distribution of X is

What would be the average value of X be if the experiment were repeatedan infinite number of times? Intuitively, you would expect X = 1 or 1/6 of thethrows, X = 2 or 1/6 of the throws and so on. So, on the average, the value ofX is

1× 1/6 + 2× 1/6 + 3× 1/6 + 4× 1/6 + 5× 1/6 + 6× 1/6 = 3.5

If X is a discrete random variable and f(x) is its p.d.f., then the expectedvalue of X is E[X] = Σxxf(x). This is a weighted average with the values thatX can take, and weights being the probabilities of each value’s occurrence.

If X is a continuous random variable and f(x) is its p.d.f., then the expectedvalue of X is

E[X] =

∞Z−∞

xf(x)dx

c° Giovanni Urga, 2009 xviii Adv Stats Induction, 2009-2010


The expected value of X, E[X], is also called mean of X or the mean of thedistribution of X. It is a measure of the average value of X and a measure ofthe centre of the probability distribution. In fact, E[X] can be regarded as thecentre of gravity of the probability distribution. If a p.d.f. is symmetric withrespect to a value x0 or the x-axis, then E[X] = x0 if it exists.

Note: Two other common measures of the centre of the distribution areMEDIAN, which is the value of the random variable below which 50% of theprobability mass falls, and the MODE, which is the value of the random variablesat which its p.d.f. attains the maximum value.

B) EXPECTATION OF A FUNCTION OF A SINGLE VARIABLE

Mathematical expectation of a function of X, say g(x)So,E[g(x)] = Σxg(x)f(x) for discrete random variables, and

E[g(x)] =R∞−∞ g(x)f(x)dx for continuous random variables

Some results: If a and b are constantsE[aX + b] = aE[X] + b which impliesE[aX] = aE[X] and E[b] = b

c° Giovanni Urga, 2009 xix Adv Stats Induction, 2009-2010

CONTENTS

C) EXPECTATIONS OF FUNCTION OF SEVERAL RANDOM VARIABLES

Suppose Y = g(x1, . . . , xn) and let f(x1, . . . , xn) be the joint p.d.f of thecontinuous random variables X1, . . . ,Xn. Then

E [Y ] =

∞Z−∞

· · ·∞Z−∞

g(x1, . . . , xn)f(x1, . . . , xn)dx1, . . . , dxn

In the discrete caseRis replaced by Σ.

Some useful results:(i) Let a0, . . . , an be constants, then

E[a0 + a1X1 + · · ·+ anXn] = a0 + a1E[X1] + · · ·+ anE[Xn]

(ii) Let X1, . . . ,Xn be independent random variables such that E[Xi] ex-ists, then

E[X1X2 . . .Xn] = E[X1]E[X2] . . . E[Xn]

D) MOMENTS OF A RANDOM VARIABLE

The rth moment of the random variable X about the origin is denoted μrand is E[X2]. Note that for r = 1, μ1 = E[X] is the mean (denoted with μalso).

Quotation: “moments” comes from physical interpretation of mathematicalexpectations such that the fact that μr is the unit of gravity.

The rth moment about the mean of X is μr = E[(X − μ)r]If r = 2, μ2 = E[(X−μ)2] is called variance of the distribution of X (or varianceof X), often denoted as σ2 or var(X). The positive square root of σ2, σ is calledthe standard deviation of X.

It is useful to note that

σ2 = E[(X −E(X))2] = E[X2]− (E[X])2

where σ2 is the average of the squared distance between the value of X (randomvariable) and E[X] when the experiment is repeated infinitely many times.

The larger the value of σ2, the greater the average squared distance betweenthe values of the random variables and E[X], and thus more likely a value of Xfar from its mean is to occur.

c° Giovanni Urga, 2009 xx Adv Stats Induction, 2009-2010


We can graphically represent this part X and Y and let us suppose var(x) >var(y)

The probability mass ofX is more “spread-out” than that of Y . The varianceof X is a measure of the dispersion of the probability mass of X about its mean.The greater the value of var(X), the wider the dispersion of the values of X

about its mean.

Product Moments(a) the sth and rth product moment about the origin of the random vari-

ables X and Y is given byμrs = E[XrY s]

(b) the sth and rth product moment about the means of the random vari-ables X and Y is given by

μrs = E[(X −E(X))r(Y −E(Y ))s]

The covariance between two random variables is a measure of the associationbetween them. Let μ11 = covariance between X and Y , also denoted with σxyor cov(X,Y ) i.e.

σxy = E[(X −E(X))(Y −E(Y ))]

σxy = E[XY ]−E[X]E[Y ]

Examining σxy and interpreting E[.] as “the average in infinitely many tri-als”, we can see that cov(X,Y ) > 0 if X and Y jointly tend to be greater than

c° Giovanni Urga, 2009 xxi Adv Stats Induction, 2009-2010

CONTENTS

their means, or less than their means - on average. Cov(X,Y ) can also be 0 ornegative with obvious meaning.

To see cov(X,Y ) measure linear association, let us define a related measure,the CORRELATION between X and Y (ρxy) and defined as

ρxy =Cov(X,Y )pvar(X)

pvar(Y )

• The ρxy is a positive number and falls between —1 and +1.

• Let cov(X,Y ) = 0 then ρxy = 0

• If X and Y are independent, then ρxy = 0

• If Y = a+ bX for some constants a and b (a, b 6= 0), the |ρxy| = 1 and Xand Y are said to be perfectly correlated. The more the values of X andY are linearly related, the larger |ρxy|

Some results:

(a) If a0 . . . an are constants and X1 . . .Xn random variables, then

var(a0 + a1X1 + ...+ anXn) =nXi=1

a2i var(Xi) + 2Xi<j

Xaiaj cov(XiYj)

(b) var(a0 + a1X1) = a21var(x1)

(c) var(X1 ±X2) = var(X1) + var(X2)± 2cov(X1X2)

Note: if the Xi are statistically independent, then the covariance terms dropout of (a) and (c)

COEFFICIENT OF VARIATION = σ/μ

SKEWNESS AND KURTOSIS:

[A]If a continuous density f(x) has the property that f(μ+a) = f(μ−a) for all

a (μ being the mean of the distribution), then f(x) is said to be SYMMETRIC

c° Giovanni Urga, 2009 xxii Adv Stats Induction, 2009-2010


AROUND THE MEAN. If a distribution is not symmetric about the mean, thenit is called SKEWED.

A commonly used measure of skewness is

E[(X − μ)3/σ2]

For a symmetric distribution this is zero.

[B]

The peakness of a distribution is called KURTOSIS. A narrow distributionis called leptokurtic and a flat distribution is called platykurtic

One measure of kurtosis is

c° Giovanni Urga, 2009 xxiii Adv Stats Induction, 2009-2010

CONTENTS

E[(X − μ)4/σ4]

The value

E[(X − μ)4/σ4]− 3is often referred to as EXCESS KURTOSIS.

S and K will be used in econometrics to characterize forecast errors and nonnormality of the distributions of such and error.

S and K are defined on standardized variables.

E) CHEBYSHEV’S THEOREM OR INEQUALITY

It illustrates how the variance and standard deviation are related to thedispersion of the probability mass and very useful for many theoretical proofsin statistics.

If X is a random variable with mean μ and variance σ2, thenP (|X − μ| < kσ) ≥ 1− (1/k2)

orP (|X − μ| ≥ kσ) < 1/k2

F) EXPECTATIONS INVOLVING MULTIVARIATE R.V.

Let X be a vectors of random variables X =

⎛⎜⎝ X1

...Xn

⎞⎟⎠where E[Xi] = μi

var(Xi) = σ2icov(Xi, Yj) = σij

Then

E[X] =

⎛⎜⎝ E[X1]...E[Xn]

⎞⎟⎠ =

⎛⎜⎝ μ1...μn

⎞⎟⎠ = μ

The expected value of a random vector is the vector of expectations.

c° Giovanni Urga, 2009 xxiv Adv Stats Induction, 2009-2010


We can define covariance matrix in the random vector X as

Cov(X) = E[(X −E[X])(X −E[X])0]

Cov(X) =

⎡⎢⎢⎢⎣(X1 − μ1)

2(X1 − μ1) (X2 − μ2) · · · (X1 − μ1) (Xn − μn)

(X2 − μ2) (X1 − μ1) (X2 − μ2)2 · · · (X2 − μ2) (Xn − μn)

......

. . ....

(Xn − μn) (X1 − μ1) (Xn − μn) (X2 − μ2) · · · (Xn − μn)2

⎤⎥⎥⎥⎦

=

⎡⎢⎢⎢⎣var (X1) cov (X1,X2) · · · cov (X1,Xn)cov (X2,X1) var (X2) · · · cov (X2,Xn)...

.... . .

...cov (Xn,X1) cov (Xn,X2) · · · var (Xn)

⎤⎥⎥⎥⎦

=

⎡⎢⎢⎢⎣σ21 σ12 · · · σ1nσ21 σ22 · · · σ2n...

.... . .

...σn1 σn2 · · · σ2n

⎤⎥⎥⎥⎦

The covariance matrix ofX, often denotedP

x, is a POSITIVE SEMIDEFINITEMATRIX with variances on the main diagonal and off-diagonal elements thatare covariance. If more of the random variables X1 . . .Xn are dependent (i.e.if none of them have zero variances), and no exact linear relationships betweenthe Xi exists, the

Px is positive definite.

Let a0 = (a1 . . . an) be a vector of constants, then

E[a0X] = a0μ = a1μ1 + · · ·+ anμn

and

V ar(a0X) = a0X

xa =nXi=1

a2iσ2i +

Xi<j

Xaiσiσj

This result can be expressed:

V ar(a0X) = E[a0X −E[a0X]]2

= E[(a0X −E[a0X])(a0X −E[a0X])]

= a0E[(X −E[X])(X −E[X])0]a

Note: a0X is a (1× 1) matrix

c° Giovanni Urga, 2009 xxv Adv Stats Induction, 2009-2010

CONTENTS

Further, let P be a (m×n) matrix of constants with m ≤ n. Then Z = PXis a (m× 1) random vector.From previous results

E[Z] = E[PX] = PE[X] = Pμand

Cov(Z) = cov(PX) = PP

xP 0

0.1.5 Some Special Distributions

In what follows, I’ll introduce some special distributions that are usually usedin econometrics.DISCRETE DISTRIBUTIONS:

• Bernoulli

• Binomial

• Multinomial

• Poisson

CONTINUOUS DISTRIBUTIONS:

• Normal

• Bivariate and Multivariate normal

• Gamma

• Chi-square

• t-distribution

• F-distribution

1. Bernoulli Distribution

Consider an experiment in which only two outcomes are possible (H or T).Let us indicate this result as 0 and 1. ThenX is said to have a Bernoulli distributionif it can only take the two values 0 and 1 and the probabilities are P (X = 1) = pand P (X = 0) = 1− p with 0 ≤ p ≤ 1.

The p.d.f. of this random variable is

f(X|p) =½

px (1− p)1−x for x = 0, 10 otherwise

c° Giovanni Urga, 2009 xxvi Adv Stats Induction, 2009-2010


where the notation f(X|p) denotes that the p.d.f. of X depends on the para-meter p. It is easy to check that E[X] = p and var(X) = p(1− p)

This random variable arises in econometrics in choice models, where a deci-sion maker must choose between 2 alternatives, like buying a car and not buyinga car.

2. Binomial Distribution

If X1, . . . ,Xn are independent random variables each having a Bernoullidistribution with parameter p, then X =

Pni=1Xi is a discrete random variable

that is the number of successes (i.e. Bernoulli experiments with outcomeXi = 1)in the n-trials of the experiment and X has a binomial distribution.

This random variable has p.d.f.

f(x|n, p) =

⎧⎨⎩µ

nx

¶px(1− p)n−x x = 0, 1, ......n

0 otherwise

whereµ

nx

¶= n!

n!(n−x)! is the number of possible combinations of n things

taken x at a time.

This distribution has two parameters, n and p, where n is a positive integerand 0 ≤ p ≤ 1. Its mean and variance are

E[X] =nXi=1

E[Xi] = np

V ar(X) =nXi=1

var(Xi) = np(1− p)

A related random variable is Y = X/n, which is the proportion of successesin n trials of an experiment. Its mean and variance are

E[Y ] = p

V ar(Y ) = p(1− p)/n

3. Normal Distribution

c° Giovanni Urga, 2009 xxvii Adv Stats Induction, 2009-2010

CONTENTS

It is the most important distribution in statistics. An important reason in thepredominance of the normal distribution is the CENTRAL LIMIT THEOREM,which says that many important functions of observations, like the sample mean,tend to be usually distributed given a large enough sample, no matter what thedistribution of the original population. The Normal Distribution has many nicemathematical properties that make it convenient to work with.

A continuous random variable X has a normal distribution with mean μ andvariance σ2 (−∞ < μ < +∞, σ2 > 0), often denoted as X ∼ N(μ, σ2), if Xhas p.d.f. of the form

f(x|μ, σ2) = 1

(2πσ2)1/2exp

½−(x− μ)2

2σ2

¾with −∞ < x < +∞.

This p.d.f. is “bell-shaped” as represented in the following graph

The p.d.f. is symmetric about the parameter μ, which is the mean, median,and mode of the distribution.

E[X] = μ

V ar(X) = σ2

Also, the p.d.f. has points of infliction at x = μ ± σ

Note: the p.d.f. of a normal random variable does not have a definite integralthat has a closed-form, so probabilities must be computed numerically. Theprobabilities can be economically presented in one table using a standard form.To see this, we use the fact that a linear function of a normal random variable

c° Giovanni Urga, 2009 xxviii Adv Stats Induction, 2009-2010


is itself normal. Example: suppose X ∼ N(μ, σ2) and Y = aX + b with a andb are constants and a 6= 0. Then Y ∼ N(aμ+ b, a2σ2)

Consequently

Z =X − μ

σ∼ N(0, 1)

is the Standard Normal random variable. If Φ(z) denotes the c.d.f. of Z evalu-ated at z, then probability statements about X can be made in terms of X anyvalues of Φ(z) which are tabulated.

EXAMPLE: If X ∼ N(μ, σ2) then

P (a < x < b) = P³(a−μ)σ < z < (b−μ)

σ

´= Φ

³(b−μ)σ

´− Φ

³(a−μ)σ

´

4. Bivariate Normal Distribution(as a link between univariate and multivariate)

Let X1 and X2 be random variables with joint p.d.f.

f(x1, x2 | μ1, μ2, σ21, σ22, ρ) = (2π)−1[σ21σ22(1− ρ2)]−1/2 exp −Q/2

where −∞ < x1, x2 < +∞ and

Q =1

1− ρ2

"µx1 − μ1

σ1

¶2− 2ρ

µx1 − μ1

σ1

¶µx2 − μ2

σ2

¶+

µx2 − μ2

σ2

¶2#

f(x1, x2) is the joint p.d.f. of the bivariate normal X1 and X2. The distri-bution of X1 and X2 are N(μ1, σ

21) and N(μ2, σ

22) respectively. The parameter

ρ = cov(X1X2)/σ1σ2

Properties:

• The CONDITIONAL DISTRIBUTION of X2 given X1 is also a normal,specifically

f(x2|x1) ∼ N(b, σ22(1− ρ2))

and

b = μ2 + ρ(σ2/σ1)(x1 − μ1)

• The conditional mean, E[X2|X1] is a function of x

c° Giovanni Urga, 2009 xxix Adv Stats Induction, 2009-2010

CONTENTS

• The bivariate normal has the property that if ρ = 0, then X1 and X2

are statistically independent (which is not true of other pair of randomvariables in general). To see this note that if ρ = 0, then

f(x2|x1) = f(x2) ∼ N(μ2, σ22)

5. Multivariate Normal Distribution

Suppose we have n independent and identically distributed standard normalrandom variables

Zi ∼ N(0, 1) for i = 1, ..., n

Since Zi are independent, we can obtain their joint p.d.f. as

f(z1.......zn) = f(z) =nYi=1

f(zi)

=Y(2π)−1/2 exp

½−z

2i

2

¾

= (2π)−n/2 exp

(−12

nXi=1

z2i

)

= (2π)−n2 exp

½−12z0z

¾where z

0= (z1......zn) is a vector of independent random variables and z is the

corresponding vector values.

6. χ2(chi-squared), t (student t), and F (Fisher’s F) distributions

A] Relationship between chi-squared (χ2) and a normal distribution

The square of a standard normal variable has a χ2 distribution with 1 degreeof freedom, that is if Z ∼ N(0, 1) then Z2 ∼ χ2 (1).

Furthermore, if χ2(r1), χ2(r2), .., χ2(rn) are mutually stochastically indepen-dent χ2 random variables, then it can be shown that

c° Giovanni Urga, 2009 xxx Adv Stats Induction, 2009-2010


Y = χ2(r1) + χ2(r2) + ...+ χ2(rn) ∼ χ2((r1 + r2 + .....+ rn))

So, if Z1........Zn are independent and standard normal random variables,then

Y = Z21 + Z22 + ........+ Z2n ∼ χ2(n)

.

B] Student - t statistics

Let Z be a N(0, 1) random variable and let χ2(r) denote a χ2 random vari-able with r degree of freedom. Furthermore, let Z and χ2(r) be stochasticallyindependent. Then

T =Zpχ2r/r

∼ t(r)

is a t-distribution with r degrees of freedom.

The p.d.f. of T is symmetric about 0, E(T ) = 0, and is bell shaped but flatterthan the N(0, 1) distribution. The variance of the random variable which is Tis

V ar(T ) =r

r − 2for r > 2.

As r approaches∞, the distribution of the random variable T can be closelyapproximated by that of a N(0, 1) random variable.

C] Fisher’s F statistics

Let χ2(r1) and χ2(r2) be independent χ2 random variable with r1 and r2degrees of freedom. Then,

F =χ2(r1)/r1χ2(r2)/r2

∼ F (r1, r2)

is a F-distribution with parameters r1 and r2 (called numerator and denominatordegrees of freedom respectively). The F-distribution is skewed to the right andnon negative since it is generated by two χ2 random variables.

c° Giovanni Urga, 2009 xxxi Adv Stats Induction, 2009-2010

CONTENTS

0.2 Sampling Theory

OUTLINE:

• INTRODUCTION

(Statistical model / inference: point vs interval estimates, hypothesis test-ing)

• METHODS FOR FINDING POINT ESTIMATORS

(Method of moments (MM), maximum likelihood (ML), least squares (LS))

• PROPERTIES OF POINT ESTIMATORS

- SMALL SAMPLE PROPERTIES OF ESTIMATORS(Unbiasedness, bias vs precision, efficiency)

- LARGE SAMPLE PROPERTIES OF ESTIMATORS(consistency, convergence in distribution, asymptotic efficiency)

• INTERVAL ESTIMATION

• HYPOTHESIS TESTING

c° Giovanni Urga, 2009 xxxii Adv Stats Induction, 2009-2010

0.2. SAMPLING THEORY

0.2.1 Introduction

Statistical Inference is the subject that deals with the problems associated withthe estimation of the unknown parameters underlying statistical distributions,measuring the precision, testing hypothesis on them, using them in generatingforecasts of random variable, and so forth.

Because it is expensive to measure all the elements of a population (totalityof elements about which some information is desired) underlying an experiment,an investigator often resorts to measuring attributes for a small portion of thepopulation (known as a sample) and draws conclusion or makes policy decisionbased on the data obtained.

The probability distribution of the variables is unknown, and we wish touse the data to learn about the characteristics of that unknown distribution.If the result of an experiment can be described by a single number and theexperiment is repeated T times, the sample consists of the T random variables(Y1, . . . , YT ) = Y 0. The joint probability density function (p.d.f.) of the sampleY is assumed to have a known mathematical form f(y | ϑ) but depends on oneor more parameters (ϑ1, . . . , ϑk) = ϑ0 that are known to fall in a set of possiblevalues, Ω , called the parameter space. The investigator is always assumed toknow the mathematical form of f(y | ϑ) and Ω, the set of possible values of ϑ.

This constitutes the statistical model for how the values of the random vari-ables Y are obtained and is the basis for classical statistical inference. Giventhe statistical model, statistical inference consists of using the sample values forY to specify plausible values for ϑ1, . . . , ϑk (this is the problem of point estima-tion) or at least to determine a subset of Ω, for which we can assert whether ornot Ω contains some value θ (interval estimation or hypothesis testing).

Example:We may assume that the amount of expenditure by persons with annual

income of $20,000, from a certain population, follow a normal distribution,N(ϑ, σ2), with unknown value of location parameter ϑ and the scale parameterσ2. Parameter estimation then involves using a sample data on the experimentsby persons with $20,000 income to make an inference about ϑ and σ2.

INFERENCEPOINT ESTIMATESINTERVAL ESTIMATES (sample of values)

HYPOTHESIS-TESTING: problem relates to using the data in a way toprovide evidence about a conjecture covering the population.

Example (. . . continued)We may entertain the notion that the mean expenditure level by households

with $20,000 annual income is $15,000. Using the assumption that the prob-ability distribution of expenditures is normal, we then examine the data, and

c° Giovanni Urga, 2009 xxxiii Adv Stats Induction, 2009-2010

CONTENTS

make a judgement as to whether or not the data appear to be consistent with theconjecture.

0.2.2 Methods for Finding Point Estimators

There are many methods that can be used in order to derive estimators. I willbriefly recall:1) Method of Moments2) Method of Maximum Likelihood3) Method of Least Squares[ 4) There is a Bayesian approach to estimation as well ]

What essentially is an estimator?

Let Y be a random variable with p.d.f. f(y|ϑ), where ϑ is a parameterthat we would like to estimate. Let Y1, . . . , YT be a random sample from thispopulation, then an estimator of ϑ is a function or rule of the form

ϑ = ϑ(Y1, . . . , YT )

i.e. the estimator ϑ is a function of Y1, . . . , YT.

ϑ is a random variable because it is a function of random variables. ˆ meansthat is an estimator, or rule for estimating, the parameter ϑ. When the valuesof the random variables are inserted, an estimate is produced, which is simplythe value of the random variable or estimator ϑ.

Example (..continued):Expenditure by persons with annual income of $20,000 are approximately

usually distributed with an unknown mean ϑ but known variance, say σ2 =$5,0002. That is, if Y is the expenditure from a $20,000 income by a person inthis population, then

Y ∼ N(ϑ, σ2 = $5,0002)

c° Giovanni Urga, 2009 xxxiv Adv Stats Induction, 2009-2010


A common estimator of ϑ is the arithmetric mean

ϑ = ϑ(Y1, ..., YT ) =TXi=1

Yi / T = Y

So, where did the idea come from that the arithmetic mean of a sample ofvalues can or should be used to estimate a population mean?

This is the subject of examining methods that may be used to obtain pointestimators.

1) METHOD OF MOMENTS

We defined the rth moment of a random variable Y about the origin asμr = E[Y r].

If the p.d.f. of Y is f(y, ϑ) where ϑ0 = (ϑ1, . . . , ϑk) is a vector of unknownparameters, then in general μr will be a known function of ϑ say μr = μr[ϑ].

The idea of the method of moments is to use a random sample of data,Y1, . . . , YT , to compute the sample moments

μr =TXi=1

Y ri /T r = 1, . . . ,K

and then to equate the sample and the true moments μr = μr(ϑ) and solve theresulting system of K equations (if possible) for the unknown parameters. Theresulting estimator, ϑMM , is the method of moments estimator (MM).

Example :

Let Y1 , . . . , Y T be a random sample from a N (ϑ, σ2 ) population. Recallthat E(Y ) = ϑ = μ1 and σ2 = E(Y 2)− (E[Y ]2) = μ2 − (μ1)2. Then, equatingthe sample to population moments,

μ1 =TXi=1

Yi / T = Y = ϑ

μ2 =TXi=1

Y 2i / T = σ2 + ϑ

2

so

c° Giovanni Urga, 2009 xxxv Adv Stats Induction, 2009-2010

CONTENTS

σ2 =

ÃTXi=1

Y 2i / T

!− Y 2

=1

T

TXi=1

(Yi − Y )2

Note: The MM estimators need not to be unique, and the method can bedifficult to apply in with complicated problems. It also depends on the randomvariable in question having moments, which is not always the case.

In fact, a more powerful and general estimation method is the MaximumLikelihood Estimation Method (ML) that we introduce in the following section.

2) THE METHOD OF MAXIMUM LIKELIHOOD

Let Y be a Bernoulli random variable with parameter p, which we know,for some reason, can only take the values 1

4 or34 . Suppose we have a random

sample of size T = 3 with values y1 = 1, y2 = 1, y3 = 0

The question is how to use these data to estimate the unknown parameterp, which for this problem boils down to choosing either p = 1

4 or p =34 .

The method of maximum likelihood (ML) will choose that value of the un-known parameter (p) that maximizes the probability (likelihood) of randomlydrawing the sample that was actually obtained.

Since f(y|p) = py(1− p)1−y for y = 0 or 1, we can calculate the probabilityof our random sample from the point p.d.f. of y1, y2, y3. It is

f(y1 = 1, y2 = 2, y3 = 0) = f(1, 1, 0) =3Y

i=1

pyi(1− p)1−yi = pp(1− p)

So we can interpret this function as a function of p given the sample obser-vations. When we do so, it is called a likelihood function and written l(p/y).

Note: This function is mathematically identical to the joint p.d.f. of therandom sample, but it is interpreted as function of the unknown parametersinstead of the value of the random variable which are assumed to be known.

To making inferences or decisions about p after the sample values are ob-served, all relevant sample information is contained in the likelihood function.

c° Giovanni Urga, 2009 xxxvi Adv Stats Induction, 2009-2010


Thenl( 14/y) = 0.046

l( 34/y) = 0.141

thus the probability (or likelihood) of obtaining the sample values we actu-ally have is maximized by choosing p = 3

4 rather than p =14 ( these are the only

choices), and thus p = 34 is the maximum likelihood estimate of p in this prob-

lem. It is, in the sense described, the value of p most likely to have generatedour sample.

Alternatively, we can find the value of p that maximizes l(p/y) by usingcalculus techniques.

A local maximum of a continuous function occurs where its slope is zero andits second derivative is negative.

Here, suppose l(y/p) = p2(1− p)

d (p/y)

dp= 2p− 3p2

d2 (p/y)

dp2= 2− 6p

Setting the first derivative to zero and solving, we see there are two solutions,p = 0 and p = 2/3.

p = 0 is ruled out since it clearly does not maximize the likelihood function:if p = 0 we could not have drawn the sample since P (Y = 1) = p and thusobtaining a value y = 1 would be impossible.

c° Giovanni Urga, 2009 xxxvii Adv Stats Induction, 2009-2010

CONTENTS

p = 2/3 satisfies both the first and second order conditions for a local max-imum and is the maximum likelihood estimate of p.

To sum up the maximum likelihood: the likelihood function l(ϑ/y) is alge-brically identical to the joint p.d.f. f (y|ϑ) of the random sample Y1, . . . , YT ,where ϑ is a vector of unknown parameters with a true value known to fall inthe parameter space Ω. The difference between l(ϑ|y) and f (y|ϑ) is that l(ϑ|y)is interpreted as a function of ϑ given the values of a random sample y; and thejoint p.d.f. f (y|ϑ) is a function of y given particular values of ϑ.

The maximum likelihood of ϑis that value in Ω, say ϑML, that maximizesl(ϑ|y). The value ϑ is in general a function of y, say ϑ = ϑ(y), so the ran-dom variable ϑ = ϑ(Y ), where Y = (Y1, . . . , YT ), is the maximum likelihoodestimator of ϑ.

Practical matter: it is usual practice to maximize the natural logarithm ofthe likelihood function rather than the likelihood function itself. This is veryconvenient, since the logarithm likelihood L = ln l(ϑ|y) is composed of sumsrather than products and exponential functions simplify nicely.

Since the natural logarithm is a monotonic function, ln l(ϑ/y) and l(ϑ|y)attain their maximum at the same value of ϑ.

c° Giovanni Urga, 2009 xxxviii Adv Stats Induction, 2009-2010


3) LEAST SQUARES ESTIMATION

Note: Both MM and ML require a specific assumption about the distributionof the random variable. LS does not require exact specification of the populationdistribution.It can be used to estimate central moments of random variable μr = E[Y r].

Idea: since the mathematical expectations of a random variable is the meanof the random variable, given the values of a random sample of data Y1, . . . , YT ,it is reasonable to use the “center” of the data yri , i = 1, . . . , T to estimate μr.

One way to define the center of a set of data is to find the value μr thatminimizes

S =TXi=1

(yri − μr)2

The value S is the sum of squared differences between yri and the expectationμr = E[Y r]. The value μr that minimizes S for a given set of data values ofthe random variable is called the LEAST SQUARED ESTIMATE of μr. If μris considered a function of the random variables Yi then it is the least squared(LS) estimator.

Example: Let Y1 , . . . ,Y T be a random sample from a population with meanâ and finite variance σ2 . Then the LS estimator of β is obtained by minimizing

S =TXi=1

(yi − β)2

The F.O.C. is

dS

dβ=

TXi=1

−2(yi − β) = 0

So

β =TXi=1

yi/T

Note: We may verify that β does in fact minimize S by checking the S.O.C.

Graphically the sum of squares S may be viewed as a parabola.

c° Giovanni Urga, 2009 xxxix Adv Stats Induction, 2009-2010

CONTENTS

Finding the LS estimator of â amounts to find the value of β that correspondsto the minimum point on the parabola.

0.2.3 Properties of Point Estimators

Estimators are random variables so evaluating their properties amounts tostudying the properties of their probability distributions.

We will consider:A) “SMALL SAMPLE” properties of estimators (those properties hold

for samples of all sizes, even small ones)B) “LARGE SAMPLE” or “ASYMPTOTIC” properties of estimators

(those properties are based on sample whose size is assumed to grow to infinity).

These properties are frequently required in econometrics as estimation rules’properties such that very little can be said about them without the assumptionof very large samples.

Small Sample Properties of Estimators

I. SINGLE PARAMETER CASE

Y is a random variable with p.d.f. f (y|ϑ), where ϑ = parameter we wouldlike to estimate and let Y1, . . . , YT be a sample of observations on Y .Suppose ϑ = ϑ(Y1, . . . , YT ) is an estimator of ϑ.We need to evaluate the estimator ϑ in a repeated sampling sense, since to

do otherwise requires knowledge of the true parameter ϑ.

c° Giovanni Urga, 2009 xl Adv Stats Induction, 2009-2010


Unbiasedness

ϑ is an unbiased estimator of ϑ if

E[ϑ] = ϑ

If E[ϑ] 6= ϑ, then ϑ is a biased estimator and E[ϑ]− ϑ = δ is the amount ofthe bias.

If the p.d.f. is symmetric and ϑ is unbiased, the ϑ is located at the centerof the distribution.

Note: Unbiasedness means that the estimator “on the average” will yieldthe true parameter value, i.e. if the underlying experiment is repeated infinitelymany times by drawing sample of site T , the average value of the estimates ϑfrom all those samples will be ϑ.

Now, ϑ is a random variable, so in addition to the mean of the estimator weare also concerned about its variance. Suppose we have two unbiased estima-tors, ϑand ϑ:

c° Giovanni Urga, 2009 xli Adv Stats Induction, 2009-2010

CONTENTS

and V ar(ϑ) > V ar(ϑ). Since both estimators are unbiased, we would preferthe estimator ϑ to ϑ since using ϑ gives us a higher probability of obtaining anestimate that is close to the true value.

Bias vs Precision

We want an estimator that usually yields estimates of the unknown parame-ter that are “close” to the true parameter value. In order to do this, bias andvariance must be evaluated. One way to do this is to consider an estimator’smean square error (MSE):

MSE(ϑ) = E[ϑ− ϑ]2

It is, within the repeated sampling contest, the average squared distancebetween ϑ and the true parameter value ϑ.

Furthermore: (ϑ − ϑ)2 measures the loss form using ϑas an estimator of ϑand taking the mathematical expectation of this loss yields the average loss orrisk of using ϑ to estimate ϑ..

This measure encompasses both precision and bias as follows:

MSE(ϑ) = E[ϑ−E(ϑ) +E(ϑ)− ϑ]2

= E[ϑ−E(ϑ)] + [E(ϑ)− ϑ]2

= E[ϑ−E(ϑ)]2 + [E(ϑ)− ϑ]2 + 2E[ϑ−E(ϑ)][E(ϑ)− ϑ]

whereE[ϑ−E(ϑ)][E(ϑ)−ϑ] = [E(ϑ)−ϑ]E[ϑ−E(ϑ)] = [E(ϑ)−ϑ]E[ϑ−E(ϑ)] = 0

c° Giovanni Urga, 2009 xlii Adv Stats Induction, 2009-2010


So

MSE(ϑ) = E[ϑ−E(ϑ)]2 + [E(ϑ)− ϑ]2

= var(ϑ) + [bias(ϑ)]2

The use of MSE allows us to compare estimators such as ϑ and ϑ whosep.d.f. are shown:

ϑ is an estimator unbiased whose variance is larger than that of the biasedestimator ϑ. For the purpose of using an estimator that will give estimates closeto ϑ, ϑ may be preferred even though it is biased, if it yields a smaller meansquare error.

Efficiency

If we compare estimators that are unbiased, then the estimator with thesmaller variance would be preferred. [Basis of the estimator efficiency]

An estimator ϑ is an efficient estimator of ϑ if E[ϑ] = ϑ and var(ϑ) ≤ var(ϑ),where ϑ is any other unbiased estimator of ϑ.

The CRAMER-RAO INEQUALITY provides a sufficient but not necessarycondition for an estimator to be efficient.

Let Y be a random variable with p.d.f. f (y|ϑ). If Y1, . . . , YT is a randomsample, then the joint p.d.f. of the T random variables Yi is

c° Giovanni Urga, 2009 xliii Adv Stats Induction, 2009-2010

CONTENTS

f (y|ϑ) = f (y1, . . . , yT |ϑ) =TYi=1

f (yi|ϑ)

Recall that if we interpret the joint p.d.f. f (y1, . . . , yT |ϑ) as a function ofthe unknown parameter ϑ given the values of the random sample, then theresulting function is known as the likelihood function and written as l(ϑ|y) =l(ϑ|y1, . . . , yT ). It is mathematically identical to the joint p.d.f. but has a dif-ferent interpretation.

Let L(ϑ) denote the natural logarithm of the likelihood function, then theCramer-Rao inequality states that if ϑ is any unbiased estimator of ϑ, andcertain regularity conditions hold, then

V ar(ϑ) ≥ 1

−Ehd2L(ϑ)dϑ2

i(Theil (1971), pp. 384-386 for proof)

It says that if an unbiased estimator ϑ has a variance equal to−©1/E[d2L/dϑ2]

ª,

which is known as the CRAMER-RAO LOWER BOUND (CRLB), then it isefficient since no lower variance is possible for an unbiased estimator.

N.B.: This condition is sufficient but not necessary. It is possible that nounbiased estimator has variance as small as the CRLB. It is often much simplerto find the minimum variance estimator among those that are not only unbiasedbut also linear function of the sample observations.

Let ϑ be an estimator of an unknown parameter ϑ and be of the form

ϑ = a1y1 + · · ·+ aT yT

where ai are constants.

Then ϑ is defined to be a linear estimator since it is a linear function of thesample observations. If ϑ is unbiased and var(ϑ) ≤ var(ϑ), where ϑ is any otherlinear and unbiased estimator of ϑ then ϑ is the BEST LINEAR UNBIASEDESTIMATOR (BLUE) of ϑ.

Note: It is often possible to find such an estimator without knowledge of theunderlying p.d.f. of Y .

II. THE CASE OF SEVERAL PARAMETERS

c° Giovanni Urga, 2009 xliv Adv Stats Induction, 2009-2010


This is the case of statistical model involving unknown parameters: ϑ1, . . . , ϑk.Let Y be a random variable with p.d.f. f(y|ϑ) with ϑ = (ϑ1, . . . , ϑk)0 is a (k×1)vector of unknown parameters that we wish to estimate. Let ϑ = (ϑ1, . . . ϑr)

0

be an (k × 1) vector. ϑ is the estimator of the element unknown ϑi of ϑ.

The three properties will be introduced following the structure below:

Unbiasedness

Consider ϑ (k × 1) such that E[ϑ] = ϑ, that is, the vector ϑ is an unbiasedestimator for ϑ:

E(ϑ) =

⎡⎢⎢⎢⎣E(ϑ1)

E(ϑ2)...

E(ϑk)

⎤⎥⎥⎥⎦ =⎡⎢⎢⎢⎣

ϑ1ϑ2...ϑk

⎤⎥⎥⎥⎦ = ϑ

If E[ϑ] 6= ϑ, then ϑ is biased and δ = E[ϑ] − ϑ is the bias vector. Theprecision of ϑ is measured by its variance-covariance matrix (a k × k positivedefinite and symmetric), which is defined as

Cov(ϑ) = E[(ϑ−E(ϑ))(ϑ−E(ϑ))0]

=

⎡⎢⎢⎢⎣var(ϑ1) cov(ϑ1, ϑ2) · · · cov(ϑ1, ϑk)

cov(ϑ2, ϑ1) var(ϑ2) · · · cov(ϑ2, ϑ2)...

......

...cov(ϑk, ϑ1) cov(ϑ2, ϑ1) · · · var(ϑk)

⎤⎥⎥⎥⎦Note: An obvious function arises as to how the precision of two competing

estimators of ϑ can be compared. That is, if ϑ and ϑ are both estimators of ϑ,how do we compare cov(ϑ) and cov(ϑ) since they are matrices?

We will say that cov(ϑ) is “smaller than” cov(ϑ), in a matrix sense, if CC =cov(ϑ)− cov(ϑ) is a positive semi-definite matrix. So, if CC is a positive semi-definite matrix then

(i) var(ϑi) ≥ var(ϑi), i = 1, . . . , k

(ii) |cov(ϑ)| ≥ |cov(ϑ)|

(iii) If c is any (k × 1) vector of constants, then var(c0ϑ) = c0(cov(ϑ)) c ≥var(c0ϑ) = c0(cov(ϑ)) c

(see Goldberger, 1991)

c° Giovanni Urga, 2009 xlv Adv Stats Induction, 2009-2010

CONTENTS

Bias vs Precision

The mean squared error matrix of an estimator ϑ for ϑ

MSE(ϑ) = E[(ϑ− ϑ)(ϑ− ϑ)0]

MSE(ϑ) = E[ϑ− ϑ][ϑ− ϑ] + [E(ϑ)− ϑ][E(ϑ)− ϑ]

thusMSE(ϑ) = cov(ϑ) + (bias(ϑ))(bias(ϑ))0

Cov(ϑ) is the covariance matrix of the estimator ϑ and the “squared bias”matrix has elements [bias(ϑi)]2 on the diagonal and the product bias(ϑi)bias(ϑj)in the ijth off-diagonal positions.

A common scalar measure of the mean squared error of an estimator ϑ is

trMSE(ϑ) =KXi=1

var(ϑi) + [bias(ϑi)]2

And is often simply called the mean squared error of ϑ or the average lossor risk incurred when using ϑ to estimate ϑ.

Efficiency

An estimator ϑ is efficient for ϑ if ϑ is unbiased and cov(ϑ)−cov(ϑ)is positivesemidefinite where ϑ is any other unbiased estimator for ϑ.

We can find a multivariate version of the Cramer-Rao inequality

If we consider the likelihood function l(ϑ|y), which is mathematically equiv-alent to the joint p.d.f. of the random sample but is a function of ϑ instead ofy.

If L(ϑ) denote the logarithm of the likelihood function then the matrix

I(ϑ) = −E∙

∂2L

∂ϑ∂ϑ0

¸where I(ϑ) is the INFORMATION MATRIX and

[I (ϑ)]−1

is the Cramer-Rao lower bound matrix.

c° Giovanni Urga, 2009 xlvi Adv Stats Induction, 2009-2010


Large Sample Properties of Estimators

There are estimations when the small sample properties of estimators cannotbe obtained easily, so in order to compare estimators we must study their as-ymptotic properties, i.e. their approximate behavior when the sample size T islarge and approaches infinity. There are three important definitions:

a.) CONSISTENCY

b.) CONVERGENCE IN DISTRIBUTION

c.) ASYMPTOTIC EFFICIENCY

CONSISTENCY

This property assumes that the estimator rule will produce an estimate thatis close to the true parameter value with high probability if the sample size islarge enough. More precisely, let ϑT be an estimator of ϑ based on a sample ofsize T . Then ϑT is a consistent estimator of ϑ if

limT→∞

P³| ϑT − ϑ |<

´= 1

where is an arbitrary small positive number.

This means that the probability that the value of ϑT falls in the interval[ϑ− , ϑ+ ] can be made arbitrary close to 1 given a large enough sample size,no matter how small is.

If limT→∞

P³| ϑT − ϑ |<

´= 1 is true, then the sequence of random variables

ϑT is said to converge in probability to the constant ϑ and ϑ is said to be the

probability limit of the sequence ϑT . This is usually abbreviated as plim ϑT = ϑ.

Thus the estimator ϑT is consistent for ϑ if limT→∞

P³| ϑT − ϑ |<

´= 1is

true or, equivalently, when plim ϑT = ϑ.

Example:

Let Y1, . . . , YT be a random sample from a N(β, σ2) population, and considerthe estimator of β,

βT =TXi=1

yi/T

which is the sample mean.

c° Giovanni Urga, 2009 xlvii Adv Stats Induction, 2009-2010

CONTENTS

We know that

βT ∼ N(β, σ2/T )

Suppose we have three sample size T3 > T2 > T1, the sampling distributionsare

We see that the probability mass in the interval [(β− )(β+ )] is getting largeas the sample size increases from T1 to T2 to T3, and the sampling distributionof βT is collapsing about the true parameter β. In fact, since

limT→∞

var(βT ) = limT→∞

σ2/T = 0

The sampling distribution of βT becomes degenerate at the true parametervalue in the limit, so that all its probability mass occurs at β.So we can conclude that, sufficient, but not necessary, conditions for an

estimator ϑT to be consistent for ϑ are

• limT→∞

[ϑT ] = ϑ (asymptotically unbiased)

• limT→∞

var(ϑT ) = 0

CONVERGENCE IN DISTRIBUTION

c° Giovanni Urga, 2009 xlviii Adv Stats Induction, 2009-2010


In this case we study the probability distribution of an estimator as thesample size becomes increasingly large. It is remarkable that estimator whosedistributions are unknown in small samples can sometimes be shown to have aparticular, known distribution in large samples.A well-known example of this is the CENTRAL LIMIT THEOREM:

Let Y1, . . . , YT be independently and identically distributed random variableswith E[Yi] = β and V ar[Yi] = σ2

Also let βT =TPi=1

yi/T. Then,

√T³βT − β

´d→ N

¡0, σ2

¢This remarkable result says that given a random sample of observation from

an infinite population with any distribution as long as it has finite mean andfinite variance then a simple function of the sample mean

√T³βT − β

´has a limiting distribution that is N (0 ,σ2)

Comments

1. There are situations when we have to make probability statements aboutan estimator when the exact small sample distribution of the estimatorcannot be determined. If a limiting distribution exists, then we can con-sider that limiting distribution as an approximation to the true but un-known distribution if the sample we have at our disposal is sufficiently largefor us to believe the approximation will be a reasonable one.

Suppose that the situations stated by CLT hold, the exact p.d.f. of βT is

unknown, then if the sample size is large, we may take√T³βT − β

´to be

approximately normally distributed (0, σ2). And if√T³βT − β

´·∼ N

¡0, σ2

¢then βT

·∼ N(β, σ2/T ) where ·∼ means “approximately distributed”. Manytimes this result is written as βT

asy∼ N(β, σ2/T ) and N(β, σ2/T ) is called theASYMPTOTIC DISTRIBUTION OF βT .

2. Why the CLT is stated in terms of√T³βT − β

´?

c° Giovanni Urga, 2009 xlix Adv Stats Induction, 2009-2010

CONTENTS

We known that when the conditions of CLT hold βT ∼ N(β, σ2/T ) for anysample size T . So, as T →∞, var(βT )→ 0, and the distribution of βT becomesdegenerate, a definitely non-normal distribution!

However, it can be shown that the sequence of random variables√T³βT − β

´,

as T → ∞, has a sequence of c.d.f.’s FT that converge, pointwise, to the c.d.f.F of a N(0, σ2) random variable. That is, lim

T→∞FT → F at all continuity points

of F .

This is in fact a definition of the convergence in distribution of the se-

quence of random variables ZT =√T³βT − β

´, and ZT is said to have the

limiting distribution F .

⇒ All what has been said can be applied to the case of vectors of estimators

if ϑT is a consistent estimator for ϑ and√T³ϑT − ϑ

´converges in distribution

to N(0,Σ), then ϑT is said to have the asymptotic distribution N(ϑ,1/TΣ), orϑT

asy∼ N(ϑ,Σ/T ).

ASYMPTOTIC EFFICIENCY

One-parameter case

Suppose we have two estimators ϑ and ϑ of a parameter ϑ such that

√T³ϑ− ϑ

´d→ N

¡0, σ21

¢and

√T³ϑ− ϑ

´d→ N

¡0, σ22

¢If σ22 ≥ σ21, then ϑ is asymptotically efficient relative to ϑ.

Scalar caseIf ϑ is a vector of parameters and ϑ and ϑ are consistent estimators such

that

√T³ϑ− ϑ

´d→ N (0,Σ)

and

√T³ϑ− ϑ

´d→ N (0,Ω)

then ϑ is asymptotically efficient relative to ϑ if (Ω−Σ) is positive semidefinitematrix.

c° Giovanni Urga, 2009 l Adv Stats Induction, 2009-2010


As in the finite sampling efficiency, showing that an estimator is asymp-totically efficient relative to any other consistent estimator, requires knowledgeof the parent distribution, so that the information matrix and the asymptoticCramer-Rao lower bound can be established. Let ϑ be a consistent estimator ofϑ such that

√T³ϑ− ϑ

´d→ N (0,Σ)

Then the estimator ϑ is asymptotically efficient if

Σ = limT→∞

∙1

TI(ϑ)

¸−1where I(ϑ) is the information matrix.

So both small sample and asymptotic efficiency are established using CRLBas a reference point.

Note: A property of the "method of maximum likelihood".Under some fairly general conditions maximum likelihood estimators are

consistent, asymptotically normal, asymptotically unbiased, and asymptotically efficient.

That is, if ϑ is the maximum likelihood estimator of the vector of parameterϑ then

√T³ϑ− ϑ

´d→ N

Ã0, lim

T→∞

∙1

TI(ϑ)

¸−1!

Consequently, in a problem like the one introduced earlier, where we obtainthe MLE of the parameter p of a Bernoulli population, even though we do notknow the small sample distribution of the estimator, we can rely on the factthat it is asymptotically normal to test hypotheses or make confidence intervalstatements.

c° Giovanni Urga, 2009 li Adv Stats Induction, 2009-2010

CONTENTS

0.2.4 Interval estimation and hypothesis testing

Interval estimation

Once we get point estimates, we are concerned about the “reliability” of theestimate. Knowledge of the variance of the estimator (or an estimate of it)plus knowledge of the finite or asymptotic sampling distribution conveys thisinformation.

It is sometimes useful, however, when reporting results to give a range of pos-sible values that incorporates both a point estimate and a measure of precision.An INTERVAL ESTIMATE does just that. Interval estimates are usually basedon the sampling distribution of an estimator, although Chebychev’s inequalitycould also be used. We will illustrate using three common examples.

Example:Let Y1, . . . , YT be the known sample from a N(β, σ2) population, where σ2is

known. The MLE. of β is

β =TXi=1

yi/T ∼ N(β, σ2/T )

Then we can use the sampling distribution of β to make probability state-ments.Since

z =

³β − β

´³σ/√T´ ∼ N(0, 1)

it follows that

P [−Z(α/2) ≤ Z ≤ Z(α/2)] = 1− α

where Z(α/2) is the upper −α/2 percentile of the N(0, 1) distribution, as illus-trated in the figure below and the values are tabulated.

c° Giovanni Urga, 2009 lii Adv Stats Induction, 2009-2010


Substituting for Z and rearranging

P [−Z(α/2) ≤ Z ≤ Z(α/2)] = 1− α

P [−Z(α/2) ≤β − β

σ/√T≤ Z(α/2)] = 1− α

P [β − Z(α/2)

³σ/√T´≤ a ≤ β + Z(α/2)

³σ/√T´] = 1− α

The end points of the interval containing β are random since β is ran-

dom. The random intervalhβ − Z(α/2)

³σ/√T´, β + Z(α/2)

³σ/√T´i

is an

interval estimator of β and contains the unknown parameter β with (1 − α)confidence interval, and α is the significant level for the interval estimator. An

interval estimate is obtained when the estimator β is replaced by an estimatebased on a particular sample of values.

Example: referring back to the previous example, an interval estimate of themean consumption level with confidence level 1− α = 0.95 is

£13, 000± (1.96)(500/√3)

or

[£7, 341.97, 18, 658.03]

So, we can state that [£7, 341.97, 18, 658.03] is a 95% confidence intervalfor β, that is that it is one realisation of the interval estimator

P [β − Z(α/2)

³σ/√T´≤ a ≤ β + Z(α/2)

³σ/√T´] = 1− α

c° Giovanni Urga, 2009 liii Adv Stats Induction, 2009-2010

CONTENTS

based on a particular sample.

Examples: 1) 3.13 t-distribution; 2) 3.14 χ-distribution

Hypothesis Testing

This is the second branch of statistics inference. We’ll look at various propertiesof statistical tests and then at procedures or principles by which test statisticscan be obtained.

Elements of a Statistical Test

A statistical test is a decision problem involving an unknown parameter ϑthat must lie in a certain parameter space Ω. However, Ω can be divided intotwo disjoint subsets Ω0 and Ω1, and then we must decide, using a sample ofdata, whether ϑ lies in Ω0 or Ω1.

Let H0 denote the null hypothesis that ϑ ∈ Ω0.Let H1 denote the alternative hypothesis that ϑ ∈ Ω1.

Since Ω0 and Ω1 are disjoint, only one of the hypotheses H0 and H1 is true,and by accepting one we are at the same time rejecting the other.

Testing a hypothesis then involves accepting either H0 or H1, in light of thecosts of making an incorrect decision and using whatever data are available asefficiently as possible.

Suppose we observe a random sample Y = (Y1, . . . , YT )0 that has joint p.d.f.

f (y|ϑ). The set of all possible value of Y is the sample space of the experiment.

A test procedure is defined by dividing the sample space into two parts, onecontaining the subset of values of Y that will lead to H0 being accepted, andthe other being the subset of values of Y where H1 will be accepted (and H0

rejected). The latter subset where H0 will be rejected is called critical region orthe rejection region of the test.

In practice, a test is carried out using a test statistic. A test statistic is onethat has known distribution under the null hypothesis (i.e. ensuring H0 is true)and has some other distribution when H1 is true. Thus the critical region ofa hypothesis test is that set of values of the test statistic for which the nullhypothesis will be rejected.

Thus the basic elements of every statistical test are:

c° Giovanni Urga, 2009 liv Adv Stats Induction, 2009-2010


• a null hypothesis (H0) that will be maintained until evidence to the con-trary is known,

• an alternative hypothesis (H1) that will be adopted if the null hypothesisis rejected, a test statistic and a region of rejection or critical region.

Example:We wish to test a simple null hypothesis about the unknown mean β of a

normal population with variance that is known to be σ2 = 10. Specifically, letthe hypothesis be

Ho = β = 1

H1 = β 6= 1Given a random sample of size T = 10, Y1, . . . , Y10 we know that

β =TXi=1

Yi/T ∼ N(β, σ2/T ) = N(β, 1)

Under a null hypothesis (i.e. ensuring that it is true) β ∼ N(1, 1) is depictedin the figure below:

To define a critical region, we choose values of β that will lead us to rejectH0, and accept H1. An intuitive rule is to select the rejection region to be values

c° Giovanni Urga, 2009 lv Adv Stats Induction, 2009-2010

CONTENTS

of β that are unlikely to occur if the null hypothesis is true. Then, if one of theunlikely values does occur, that provides evidence against the null hypothesisand would lead us to reject it.

In the previous figure we choose a and b so that.

P [β ≤ a] = P [β ≥ b] = α/2

we can define an unbiased (since equal amounts of probability are in the twotails) test with size α (which represents the probability of rejecting H0 when itis true) as follows

Reject H0 if β ≤ a or β ≥ bAccept H0 (at least do not reject it) if a < β < b.

For the problem at hand, the test is usually carried out using the test statistic

z =

³β − 1

´³σ/√T´

which is distributed as N(0, 1) if H0 is true. If H0 is not true, say β = β0 6= 1,then z ∼ N(β0 − 1, 1).

The rejection region for the test is z ≥ z(α/2) or z ≤ −z(α/2) or |z| ≥ z(α/2)as shown in figure below:

c° Giovanni Urga, 2009 lvi Adv Stats Induction, 2009-2010


If α = 0.05, then z(α/2) = 1.96 using tables of standard normal. In previousfigure this means that a = −0.96 and b = 2.96

In terms of the test statistics z, the reasoning of the statistical test is asfollows. If a sample value of β is obtained that leads to a calculated value of|z| ≥ z(α/2), then we reject H0 since

(i) If the hypothesis is true, the chances of obtaining a value of z in therejection region are “only” α.

(ii) This is an unlikely event, so we conclude that the test statistic is unlikelyto have a N(0, 1) distribution.

(iii) If the null hypothesis is true, the test statistic z does have a N(0, 1)distribution.

We conclude that the hypothesis is unlikely to be true since the sample evi-dence does not support it.

Formulation of null and alternative hypothesis

Example:In a court of law an accused person is “innocent until proven guilty”. The

null hypothesis is that the person is innocent, and the alternative is that theperson is guilty.

Null hypothesis in economic applications usually take the form of assertingthat some parameter, linear combination of parameters, o set linear combina-tions of parameters take specified values. The alternative is simply that theydo not.

Then, if ϑ = (ϑ1, . . . , ϑk)0 is a set of parameters, the following pairs of nullor alternative hypothesis are frequently seen.

(i) Ho = ϑi = 0

H1 = ϑi 6= 0

(ii) Ho = ϑi + ϑj = 1

H1 = ϑi + ϑj 6= 1

(iii) Ho = ϑ1 = 0 and ϑ2 = 0 and . . . and ϑk = 0

H1 : at least one of the equalities in H0 is false

c° Giovanni Urga, 2009 lvii Adv Stats Induction, 2009-2010

CONTENTS

Case (i) and Case (ii): the null hypothesis represents one condition on theparameter vector ϑ and thus constitutes a single hypothesis. Also, the singlehypothesis is said to be a simple hypothesis since it specifies a single value forthe parameter and linear combination of parameters. The alternative hypothesisin cases (i) and (ii) is simply the negative of H0 and a specific value is not given.Such a hypothesis is said to be composite as it represents more than one specificvalue for the parameter or linear combination of parameters.

Case (iii): the null hypothesis is joint since values are specified for severalparameters and the function is do all these hypotheses hold simultaneously?The alternative in this case is that any one of the individual hypotheses is false.

In addition to these cases, inequality hypotheses are often stated such as

(iv) Ho = ϑi ≥ 0H1 = ϑi < 0

In this case both the null and alternative hypotheses are composite.

Power of a Test

Since a hypothesis test involves making a choice, there is a chance that thedecision made is an incorrect one.Suppose

H0 : ϑ ∈ Ω0H1 : ϑ ∈ Ω1.

Once a test statistic and critical region have been defined, then the proba-bility of rejecting H0 can be determined for every ϑ ∈ Ω.

Let Π(ϑ) = P [rejecting H0|ϑ] where H0 is rejected when the test falls inthe critical region. Π(ϑ) is called power function of the test.Ideally Π(ϑ) = 0 for every value of ϑ ∈ Ω0 and Π(ϑ) = 1 for every ϑ ∈ Ω1.

This would imply that the test procedure always leads to the correct decision.

Unfortunately, such perfect test does not usually exist. Consequently, thereis usually a nonzero probability of rejecting H0 even when it is true (calledType 1 error)

When testing hypotheses, it is customary to only consider test proceduressuch that this probability is bounded by a constant α called the size of the testor level of significance of the test. The size of the test α is the largest value ofΠ(ϑ) for any value of ϑ that makes H0 true, that is ϑ ∈ Ω0.

c° Giovanni Urga, 2009 lviii Adv Stats Induction, 2009-2010


ThereforeP [Type I error] = P [rejecting H0 | H0 true] ≤ α

If H0 is “simple” or in cases (i) and (ii) then only the equality holds

The second possible error (called Type II error) when testing a hypothesis isto accept a false hypothesis. The probability of a Type II error (often denotedβ, not to confused with a parameter β) is

β = P [Type II error] = P [accepting H0 | H1 true] = 1−Π(ϑ) for ϑ ∈ Ω1.

It is unfortunately true that no test procedure exist for a given sample size,that allow both Type I and Type II errors to be made arbitrarily small.

It is generally true that reducing the size of the test α increases the prob-ability of a Type II error and vice versa. The magnitude of the Type I erroris conventionally fixed and usually at a small value, the reason being that it isthe probability of rejecting the null hypothesis incorrectly, which is the error wewant to avoid the most.

We do not want to reject the null hypothesis, as in the court room example,unless convincing evidence to the contrary has been provided.

Let us illustrate this point with the continuation of the previous example.

Example:When H0 is true, the power of the test is less than or equal to the size of

the test α. In our example, the power of the test equals 0.05 when β = 1 andthe hypothesis is true. To investigate the power when β 6= 1, we will begin byassuming that, in fact, β = 2. Then the true distribution of β is β ∼ N(2, 1)and the true distribution of the test statistic z is N(1, 1).

In the next figure we compare the true distribution of β ∼ [N(2, 1)] with thedistribution under H0 [N(1, 1)].

Having defined the critical region of the test, we can compute the power ofthe test when â=2 as

Π(2) = P [β ≤ −0.96 or β ≥ 2.96 |β = 2]

= 1− P

"−0.96− 2

1≤ β − 2

1≤ 2.96− 2

1

#= 1− P [−2.96 ≤ z ≤ 0.96]= 0.17

c° Giovanni Urga, 2009 lix Adv Stats Induction, 2009-2010

CONTENTS

Relationship between confidence intervals and hypothesis tests

Suppose we are considering a population that is known to be N(β, 10) andwe draw a random sample of size T = 10, say Y1, . . . , Y10. Then the maximumlikelihood estimator of β is β = ΣYi / 10 ∼ N(β, 1). A 95% confidence intervalfor β is

Now consider testing the null hypothesisH0 = β = β0 againstH1 = β 6= β0

The test statistic is z =√T (β − a0)/σ = β − β0. The hypothesis is rejected

if |z| ≥ 1.96 and not rejected if |z| < 1.96.

c° Giovanni Urga, 2009 lx Adv Stats Induction, 2009-2010


Examining the latter statement, the hypothesis is not rejected if−1.96 < β − β0 < 1.96

orβ − 1.96 < β0 < β + 1.96

That is, it is clear in the hypothesis-testing problem that any value β0 thatfalls “inside” the 95% confidence interval will lead to acceptance of the hypoth-esis at 0.05 level of significance. Any value of β0 outside the 95% confidenceinterval will lead to H0 being rejected at the level of significance 0.05.

Stated in another way, if a (1 − α) × 100% confidence interval covers thehypothesized value then the associated hypothesis test will be accepted.

The notes are based on the first four chapters of

c° Giovanni Urga, 2009 lxi Adv Stats Induction, 2009-2010

CONTENTS

c° Giovanni Urga, 2009 lxii Adv Stats Induction, 2009-2010

Bibliography

[1] Judge, G. G., Hill, R. C., Griffiths,W. E., Lutkepohl, H. and T. C. Lee(1988), INTRODUCTION TO THEORY AND PRACTICE OF ECONO-METRICS, Second Edition, Wiley. I wish to thank A. Leccadito and D.Philip for editorial help. The usual disclaimer applies.

Additional references are

lxiii

BIBLIOGRAPHY

c° Giovanni Urga, 2009 lxiv Adv Stats Induction, 2009-2010

Bibliography

[1] Amemiya, T. (1994), INTRODUCTION TO STATISTICS AND ECONO-METRICS, Harvard University Press, Cambridge, MA and London, U.K

[2] Goldberger, A. S. (1991), A COURSE IN ECONOMETRICS, Harvard Uni-versity Press, Cambridge, MA and London, U.K

[3] Greene, W. H. (2000), ECONOMETRIC ANALYSIS, 4th Edition, PrenticeHall International

[4] Grimmett, G. R., Stirzaker, D.R. (1982), PROBABILITY AND RANDOMPROCESSES, Clarendon Press

[5] Hamilton, J. D. (1994), TIME SERIES ANALYSIS, Princeton UniversityPress

[6] Hogg, R. and A. Craig (1978), INTRODUCTION TO MATHEMATICALSTATISTICS, 4th Edition, New York: Wiley

[7] Mood, A. F., Graybill, F. and D. Boes (1974), INTRODUCTION TO THETHEORY OF STATISTICS, 3rd Edition, New York: McGraw Hill

[8] Spanos, A. (1999), PROBABILITY THEORY AND STATISTICAL IN-FERENCE, Cambridge University Press

[9] Theil, H. (1971), PRINCIPLES OF ECONOMETRICS, New York: Wiley

lxv

Documents

PROBABILITY AND SAMPLING THEORY MSc. in MTF, QF & FM Induction … · 2015-03-02 · PROBABILITY AND SAMPLING . THEORY . MSc. in MTF, QF & FM Induction Programme . Lecture Notes (Academic