Weeks 2-3 L3-5 Review of Probability & Stats

8/17/2019 Weeks 2-3 L3-5 Review of Probability & Stats

http://slidepdf.com/reader/full/weeks-2-3-l3-5-review-of-probability-stats 1/40

Econ 311 – Spring 2016, Weeks 2-3: Review of Probability and Statistics

İnsan TUNALI 9 February 2016Econ 311 – Econometrics I Lectures 3-5

REVIEW OF PROBABILITY AND STATISTICS

Draws on:

Stock & Watson, Ch.2, Sections 2.1-2.3Goldberger, Chs. 3-4.

From the syllabus: “The prerequisites for ECON 311 include MATH 201

(Statistics), and ECON 201 (Intermediate Microeconomics). Students who got agrade of C − or below in MATH 201 are strongly advised to work independently to

make up for any deficiency they have during the first two weeks of the semester.”



Econ 311 – Spring 2016, Weeks 2-3: Review of Probability and Statistics2/40

The probability framework for statistical inference

(a) Population, random variable, and distribution

Random variables and distributions can be classified as:

• Univariate, bivariate, trivariate …

•

Discrete, continuous, mixed.

(b) Moments of a distribution (mean, variance, standard deviation,

covariance, correlation)

(c)

Conditional distributions and conditional means(d) Distribution of a sample of data drawn randomly from a

population: Y B1B,…, Y BnB (subject of another handout).




(a) Population, random variable, and distribution

Population

•

The group or collection of all possible entities of interest

• We will think of populations as being “very big” (∞ is an

approximation to “very big”)

Outcomes... sample space... events... MATH 201

Random variable Y

• Numerical summary of a random outcome

Population distribution of Y

• Discrete case: The probabilities of different values of Y that

occur in the population

• Continous case: Likelihood of particular ranges of Y




How to envision

Discrete Probability Distributions

Urn model: Population = Balls in an urn

Each ball has a value (Y ) written on it;

Y has K distinct values: y1, y2, ..., yi, ..., yK .

Suppose we were to sample from this (univariate) population, with

replacement, infinitely many times…




(Population) Distribution of Y :

pi = Pr(Y = yi), i = 1, 2, …, K .

Gives the proportion of times we encounter a ball with value Y = yi,i = 1, 2, …, K .

Alternate notation: f ( y) = Pr(Y = y); “probability function (p.f.) of Y .”

Clearly pi ≥ 0 for all i, and Σi pi = 1.

Convention: Σi ≡ ∑=1

Examples:

> Gender: M (=0)/F (=1). We prefer the numerical representation…

> Standing: freshman (=1), sophomore (=2), junior (=3), senior(=4).

> Ranges of wages (group them in intervals first).




Cumulative Distribution Function (c.d.f) of Y :

F ( y) = Pr(Y ≤ y) = ∑ )(≤ y y

ii

y f

= ∑≤ y y

ii

p .

That is, to find F ( y) we sum the pi’s up to the value p = Pr(Y = y).

Use of the c.d.f.:

Pr(a < Y ≤ b) = F (b) – F (a).

Important features of the distribution of Y :

(Population) Mean of Y :

μY = E (Y ) = Σi yi pi. (Here Σi ≡ ∑ )=1

Also known as “the expected value of Y” or simply “expectation of Y.”

Remark: Expectation is a weighted average of the values of Y whereweights are the probabilities with which distinct values occur.




The idea of “weighted averaging” can be extended to functions of Y .

Suppose Z = h(Y ), any function of Y . Then the expected value of h(Y )

is: E ( Z ) = E [h(Y )]

= Σi h( yi) pi.

Thus knowledge of the probability distribution of Y is sufficient for

calculating the expectation of functions of Y as well.Examples:

(i) Take Z = Y 2. Then

E ( Z ) = E (Y 2) = Σi yi2 pi.

(ii) Take Z = (Y– μY )2. Then

E ( Z ) = E [(Y– μY )2] = Σi ( yi – μY )2 pi.




With this choice of h(Y ), we get the:

(Population) Variance of Y :

2Y σ = V (Y ) = E [(Y– μY )2]

= Σi ( yi – μY )2 pi.

In words, variance equals the expected value of (or the expectation

of)“the squared deviation of Y from its mean.”

Example: Suppose random variable Y can take on one of two values, y0 = 0 and y1 = 1 with probabilities p0 and p1.

Since p0 + p1 =1, we may take

Pr (Y = 1) = p and Pr (Y = 0) = 1 – p, 0 < p < 1.

We say Y has a “Bernoulli distribution with parameter p” and

write: Y ~ Bernoulli ( p).




For Y ~ Bernoulli ( p):

μY = E (Y ) = Σi yi pi = (0)(1 – p) + (1)( p) = p;

E (Y 2) = Σi y2i pi = (0)2(1 – p) + (1)2( p) = p;

2Y σ = V (Y ) = Σi ( yi – μY )2 pi

= (0 – p)2(1 – p) + (1 – p)2( p) = … = p(1 – p). ///

(iii) Linear functions: Take Z = a + bY , where a and b are constants. E ( Z ) = E (a + bY )

= Σk (a + byk ) pk = a Σk pk + b Σk yk pk

= a + b E (Y ).

In words, expectation of a linear function (of Y) equals the linear function of the expectation (of Y).




Useful algebra: Let Y * = Y – μY , deviation of Y from its population

mean.

This function is linear in Y , as in (iii), where a = – μY , and b = 1.

E (Y *) = – μY + (1) E (Y ) = 0.

In words, expectation of a deviation around the mean is zero.

Next, examine Y *2 = (Y – μY )2 = Y 2 + μY 2 – 2YμY next; this

function is not linear in Y .

E (Y *2) = E (Y 2 + μY 2 – 2YμY )

= E (Y 2) + μY 2 – 2 μY E (Y ) (*)

= E (Y 2) + μY 2 – 2 μY

2 = E (Y 2) – [ E (Y )]2.

In line (*) we exploited the fact that E (.), which involves weightedaveraging, is a “linear” operator; thus expectation of a sum equalsthe sum of expectations.




From (ii), V (Y ) = E (Y *2); thus

V (Y ) = E (Y 2) – [ E (Y )]2.

In words, variance of Y equals “expectation of squared Y” minus“square of expected Y”.

Finally, let Z = a + bY as in (iii), and consider the deviation of Z fromits mean:

Z * = Z – E ( Z )= a + bY – [a + bE (Y )] = bY *.

It follows that the variance of Z is related to the variance of Y via:

V ( Z ) = E ( Z *2)

= E [(bY *

)2

] = E (b2

Y *2

) = b2

E (Y *2

) = b2

V (Y ).In words, variance of a linear function equals slope squared times

the variance




Exercise: Well-drilling project .

Based on previous experience, a contractor believes he will find water

within 1-5 days, and attaches a probability to each possible outcome.Let T denote the (random amount of) time it takes to complete drilling.The probability distribution (p.f.) of T is:

t = time (days) 1 2 3 4 5Pr(T = t ) = f T (t ) 0.1 0.2 0.3 0.3 0.1

(i) Find the cumulative distribution function (c.d.f.) of T and interpretit.

t = time (days) 1 2 3 4 5F T (t ) =




(ii) Find the expected duration of the project and interpret the numberyou find.

The contractors’s total project cost is made up of two parts: A fixedcost of TL2,000, plus TL 500 for each day taken to complete the

drilling.(iii) Find the expected total project cost.

(iv) Find the variance of the project cost.




Prediction: Consider the urn model, where population consists of

balls in an urn. A ball is picked at random. Your task is to guess the

value Y written on it. What would your guess be?Example: Suppose you had to predict how long a particular well

drilling project would take. What would your guess be?

One of the possible values of T ?

Some other number?




We need more structure. Clearly, prediction is subject to error.

Errors can be costly, and large errors can be more costly.

What is the cost of a poor prediction?

Let “c” be your guess (a number).

Define prediction error as U = Y – c.

We would like to make U small. Since Y is a random variable, U is also a

random variable.

More definitions:

E (U ) = E (Y ) – c = bias of your guess (“c”).

E (U 2) = E [(Y – c)2]) = mean (expected ) squared error of guess c.

Mean Squared Prediction Error criterion: Suppose the objective is to

minimize E (U 2). Then the best predictor (guess) is c = μY = E (Y ).




Proof: Can use calculus.

E (U 2) = E [(Y – c)2]) = Σi ( yi – c)2 pi.

Differentiation yields:

∂ E (U 2)/∂c = Σi ∂[( yi – c)2 pi]/∂c = Σi [–2( yi – c) pi].

Setting the derivative to zero yields the first order condition (F.O.C.)

for a minimum: Σi [–2( yi – c) pi] = 0.

That is,

Σi yi pi = c Σi pi.

We know: Σi pi = 1, and Σi yi pi = E (Y ), so the solution is c = μY .(Check the second order condition to verify that we located a minimum.)///




Non-Calculus proof: For brevity let μ = μY and reexamine the

prediction error:

U = Y – c = Y – μ – (c – μ) = Y * – (c – μ),

where Y * = Y – μ as usual. Square both sides and expand:

U 2 = [Y * – (c – μ)]2 = Y *2 + (c – μ)2 – 2Y *(c – μ).

Take expectations, and recall “useful algebra”:

E (U 2) = E [Y *2 + (c – μ)2 – 2Y *(c – μ)]

= E (Y *2) + (c – μ)2 – 2(c – μ) E (Y *)

= V (Y ) + (c – μ)2

.Since V (Y ) > 0 and (c – μ)2 ≥ 0, minimum E (U 2) is obtained by

setting c = μ. ///




Remarks:

(i) If we use the mean squared prediction error , then the

population mean (that is, the expectation of the randomvariable) is the best guess (predictor) of a draw from that

population (distribution).

(ii) Variance equals the value of the expected squared prediction

error when the population mean is used as the predictor.(iii) Other criteria may yield different choices of best predictor.

For example, if the criterion were minimization of the

expected absolute prediction error, namely E (|U |), then the

population median would be the best predictor.




How to envision

lConditionaMarginal

Joint

Probability Distributions

Urn model: Population = Balls in an urn

Bivariate population: Each ball has a pair of values ( X , Y ) written on it.

X has J distinct values: x1, x2, ..., x j, ..., x J .

Y has K distinct values: y1, y2, ..., yk , ..., yK .




Joint (population) distribution of X and Y :

p jk = Pr( X = x j, Y = yk ), j = 1, 2, …, J ; k = 1, 2, …, K .

Gives the proportion of times we encounter a ball with paired values( x j, yk ), j = 1, 2, …, J ; k = 1, 2, …, K .

The joint distribution classifies the balls according to values of both

X and Y . To obtain a “marginal” distribution, we reclassify the balls

in the urn according to the distinct values of one “margin”. We

ignore the distinct values of the second margin.

Marginal (population) distribution of X :

p j = Pr( X = x j), j = 1, 2,…, J .




Here we ignore the values of Y , and examine the proportion of timeswe encounter a ball with values x j, j = 1, 2,…, J .

How to obtain a marginal distribution of X from the joint distributionof X and Y :

p j = Σk p jk , j = 1, 2,…, J .

(Population) Mean of X :

μ X = E ( X ) = Σ j x j p j.(Population) Variance of X :

2 X σ = V ( X ) = Σ j ( x j – μ X )2 p j.

The marginal distribution of Y, its mean and variance may be

obtained in analogous fashion (write down the formula!).




Exercise: Consider S&W Table 2.3, Panel A (see next page).

(i) Verify the derivation of the marginal distributions of A and M .

(ii)

Find the means E ( A), E ( M ) and variances V ( A), V ( M ).




(Stock & Watson)




To obtain a “conditional” distribution, we first sort the balls

according to one of the two values, and put them in different urns.

We then examine the contents of a specific urn.To obtain the conditional distributions of Y given X we sort on

distinct values x j:

POPULATION

SUBPOPULATIONS . . . . . . . .

X = x1 X = x2 X = x j X = x J

Each urn has a distribution of values of Y!




These conditional distributions may be different (hence each

subpopulation may have a different mean and variance). We can

distinguish between them as long as we record the distinct value of X for that urn.

Conditional (population) distribution of Y given X = x j:

pk | j = Pr(Y = yk | X = x j) =

j

jk

p

p, k = 1, 2, …, K .

The derivation requires p j > 0.

Conditional (population) mean of Y given X = x j:

μY | j = E (Y | X = x j) = Σk yk pk | j.

Conditional (population) variance of Y given X = x j:2| jY σ = V (Y | X = x j) = Σk ( yk – μY | j)2 pk | j.




The conditional distributions of X given Y = yk , and their conditional

means and variances may be obtained in analogous fashion (write

down the formula you would use!).Exercises: Consider S&W Table 2.3, Panel B (see page 22 above).

(i) Verify the derivation in Panel B.

(ii) Find E ( M | A) for A = 0, 1 and interpret them.




Practical uses of conditional expectations:

• Consider the conditional distributions given in S&W Table 2.3

(p.70). Suppose you have an old computer. How would you justify buying a new computer?

Hint: Calculate the benefit (reduction in expected crashes) of

switching from an old computer to a new one.




Practical uses of conditional expectations, cont’d:

• Consider the urn model. We obtain a random draw from the joint

distribution of ( X , Y ). We tell you the value of X . What is your best guess of the value of Y ?

Hint: Suppose X = x j. Then an equivalent way of stating the problem

is that Y has been drawn from the urn labeled X = x j.




We saw that “expectation” is weighted average. In the urn model, if

we focus on urn labelled X = x j, we find the conditional mean using

μY | j = E (Y | X = x j) = Σk yk pk | j, j = 1,2,…, J .

In classifying the balls, the urn labelled X = x j is used with probability

p j = Pr( X = x j). As a consequence:

μY = E (Y ) = Σ j E (Y | X = x j) Pr( X = x j).Thus, expectation (mean) of Y is a weighted average of the

conditional expectations of Y given X = x j , weighted by Pr( X = x j).

We may write: E (Y ) = E X [ E (Y | X )].

This results is known as the Law of Iterated Expectations.




Law of Iterated Expectations:

E (Y ) = E X [ E (Y | X )].

Observe that:

• The “inner” expectation E (Y | X ) is a weighted average of the

different values of y, weighed by conditional probabilities

Pr(Y = yk | X = x j) (here X is “given”, we know which urn the balls

come from).

• The “outer” expectation E X [.] is a weighted average of the

different values of E (Y | X = x j), weighed by probabilities Pr( X = x j).

Exercise: Earlier we used the marginal distribution of M to calculate E ( M ). Can you think of another way to compute E ( M )? (see S&W:72)




Functions of jointly distributed random variables:

Let Z = h( X , Y ), a function of two random variables, X and Y .

Suppose the joint distribution of X and Y is known.

Then the expectation of Z can be computed in the usual manner,

as a weighted average:

E ( Z

) = E

[h( X

,Y

)] = Σ j

Σ j h

( xi

, y j

) Pr( X

= x j

,Y

= y j

)= Σ j Σ j h( xi, y j) pij ()

where the probability weights pij, i = 1, 2,…, I and j = 1, 2,…, J are

obtained from the joint distribution.

Exercise: Use S&W Table 2.3 to compute E ( MA).




(Population) covariance:

In a joint distribution, the degree to which two random variables

are related may be measured with the help of covariance:

Cov( X , Y ) = σ XY = E ( X *Y *) = E [( X – μ X )( Y – μY )]

= Σ j Σk ( x j – μ X )( yk – μY ) Pr( X = x j, Y = yk ).

Remark: We took Z = h( X , Y ) = ( X – μ X )(Y – μY ) and found E ( Z )…

Useful algebra:

E ( X *Y *) = E [( X – μ X )( Y – μY )] = …

= E ( XY ) – E ( X ) E (Y ). ()

In words, covariance equals the expected value of the product, minusthe product of the expectations.




(Population) covariance cont’d:

The “sign” of covariance is informative about the nature of the

relation:

If above average values of X go together with above average

values of Y (so that below average values of X go together with

below average values of Y ) covariance will be positive.

If above average values of one variable go together with below

average values of the other, covariance will be negative.

Exercise: Suppose X = weight, Y = height of individuals in a

population. Can you guess the sign of Cov( X , Y ) = ?




(Population) correlation:

The magnitude of covariance is affected by the units of measurement

of the variables. For a unit-free measure, we turn to correlation:

Corr ( X , Y ) = ρ XY = (,)

� ()� () =

.

It can be shown that – 1 ≤ ρ XY ≤ 1.

Random variables are said to be uncorrelated if ρ XY = 0. Clearlyfor this to happen, σ XY = 0 must hold.

Recall that in general E (Y | X ) is a function of X ; it tells us how the

conditional mean of Y given X = x j changes with x j, j = 1, 2, …, J.

Remark: Think about the urn model. Think about prediction.




Suppose E (Y | X ) = E (Y ) = μY , a constant. To describe this case, we say

Y is mean-independent of X .

Claim 1: If Y is mean-independent of X , then σ XY = 0 ( ρ XY = 0).

Proof: E ( XY ) = E (YX ) = E X [ E (YX | X )] = E X [ E (Y | X ) X ];

*When we “condition” on X , we set it equal to a particular value.

If E (Y | X ) = E (Y ), the last expression simplifies:

= E X [ E (Y ) X ] = E (Y ) E ( X ).

We showed:

If Y is mean-independent of X , then E ( XY ) = E (Y ) E ( X ).

Return to () and note that σ XY = 0 iff E ( XY ) = E ( X ) E (Y ). Thus σ XY = 0… ///




CAUTION: If σ XY = 0, it does not follow that E (Y | X ) = constant.

Covariance/correlation capture the linear relation between X and Y .

It could be that the relation is non-linear, so that E (Y | X ) varieswith X , but yet σ XY = 0.

Example: Modify the joint distribution in Assignment 2 Part II as:

and (re)calculate Cov( X , Y ).

f ( x, y) x = –1 x = 0 x =1 y = 1 0.20 0.10 0.20 y = 2 0.10 0.30 0.10




Independence: Random variables X and Y are (statistically)

independent , if knowledge of the value of one of the variables

provides no information about the other. Formally:

From the definition of conditional probabilities,

Pr(Y = y, X = x) = Pr(Y = y | X = x)Pr( X = x).

Thus an equivalent condition for, and implication of independence is:

I.1. X and Y are independently distributed if, for all values of x and y,

Pr(Y = y | X = x) = Pr(Y = y).

I.2. Pr(Y = y, X = x) = Pr(Y = y)Pr( X = x), for all values of x and y.




Claim 2: If X and Y are independently distributed, then

E ( X |Y ) = E ( X ) and E (Y | X ) = E (Y ).

Proof: E (Y | X = x j) = Σk yk pk | j

= Σk yk j

jk

p

p

= Σk yk ( p j pk / p j) = Σk yk pk = E (Y ). ///

SUMMARY:

Independence Mean-independence Zero correlation.

However: We cannot go from right to left!

Stronger condition implies the weaker condition; not the other wayaround.




Additional Linear Function Rules: (S&W Appendix 2.1)

Suppose Z = X + Y. Then using (), it is easy to show

E ( Z ) = E ( X ) + E (Y ).

In words, expectation of a sum equals the sum of expectations.

Continuing, if Z = X + Y , then Z * = X * + Y *, and Z *2 = X *2 + Y *2 + 2 X *Y *,

where the asterisk denotes the deviation from the expectation. So

V ( Z ) = E ( Z *2) = E ( X *2) + E (Y *2) + 2E ( X *Y *)

= V ( X ) + V (Y ) + 2C ( X ,Y ).

In words, variance of a sum equals the sum of the variances plus twice

the covariance.Exercise: Use the same logic to find the variance of a difference.




Generalizing to linear functions, if

Z = a + bX + cY

where a, b and c are constants, then

E ( Z ) = a + bE ( X ) + cE (Y ),

so the deviation from the expectation is Z * = bX * + cY *, and the

variance of Z is

V ( Z ) = E ( Z *2) = b2V ( X ) + c2V (Y ) + 2bcC ( X ,Y ).

Still more generally, for a pair of random variables

Z 1 = a1 + b1 X + c1Y , Z 2 = a2 + b2 X + c2Y ,

where a’s b’s and c’s are constants, the covariance of Z 1 and Z 2 is

C ( Z 1, Z 2) = b1b2V ( X ) + c1c2V (Y ) + (b1c2 + b2c1)C ( X ,Y ).

Documents

Weeks 2-3 L3-5 Review of Probability & Stats