60
Basic Statistics for SGPE Students Part III: Descriptive Statistics 1 Mark Mitchell [email protected] University of Edinburgh September 2017 1 Thanks to Achim Ahrens, Anna Babloyan and Erkal Ersoy for creating these slides and allowing us to use them.

Basic Statistics for SGPE Students [.3cm] Part III ... · 3.8 4 4.2 4.4 6810121416 Natural log of output-side real GDP at current PPPs Log Life expectancy vs. Log Real GDP Importantnote

Embed Size (px)

Citation preview

Basic Statistics for SGPE Students

Part III: Descriptive Statistics1

Mark [email protected]

University of Edinburgh

September 2017

1Thanks to Achim Ahrens, Anna Babloyan and Erkal Ersoy for creatingthese slides and allowing us to use them.

Outline1. Probability theory

I Conditional probabilities and independenceI Bayes’ theorem

2. Probability distributionsI Discrete and continuous probability functionsI Probability density function & cumulative distribution functionI Binomial, Poisson and Normal distributionI E[X] and V[X]

3. Descriptive statisticsI Sample statistics (mean, variance, percentiles)I Graphs (box plot, histogram)I Data transformations (log transformation, unit of measure)I Correlation vs. Causation

4. Statistical inferenceI Population vs. sampleI Law of large numbersI Central limit theoremI Confidence intervalsI Hypothesis testing and p-values

1 / 44

Descriptive statistics

I In recent years, more and better-quality data have beenrecorded than any other time in history.

I The increasing size of data sets that are readily available to ushas enabled us to adopt new and more robust statistical tools.

I Rising data availability has (unfortunately) led to empiricalresearchers to sometimes overlook some preliminary steps,such as summarizing and visually examining their data sets.

I Ignoring these preliminary steps can lead to important issuesand invalidate seemingly significant results.

As we will see in this and following lectures, there are ways inwhich we can numerically summarize a data set. Before we discussthose approaches, let’s take a quick look at what’s available to usto visualize a data set graphically.

2 / 44

Descriptive statisticsHistograms

I Histograms are extremely useful in getting a good graphicalrepresentation of the distribution of data. These figuresconsist of adjacent rectangles over discrete intervals, whoseareas are the frequency of observations in the interval.

I Histograms are often normalized to show the proportion (ordensities) of observations that fall into non-overlappingcategories. In such cases, the total area under the binsequal 1.

RemarkThe height of each bin in a normalized histogram representsdensity or proportion of observations that fall into that category.These can more easily be interpreted as percentages.

3 / 44

Descriptive statisticsHistograms

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8D

ensi

ty

25 30 35 40 45 50 55 60 65 70 75 80 85Life expectancy (in years)

1960

I Approximately, what is the average life expectancy in 1960?I Roughly what percentage of countries had life expectancy

above 65?I What proportion of countries had a life expectancy less than

55 years?4 / 44

Descriptive statisticsHistograms

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8D

ensi

ty

25 30 35 40 45 50 55 60 65 70 75 80 85Life expectancy (in years)

1960

5 / 44

Descriptive statisticsHistograms

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8D

ensi

ty

25 30 35 40 45 50 55 60 65 70 75 80 85Life expectancy (in years)

1990

5 / 44

Descriptive statisticsHistograms

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8D

ensi

ty

25 30 35 40 45 50 55 60 65 70 75 80 85Life expectancy (in years)

2011

5 / 44

Descriptive statisticsThe Mean and Standard Deviation

I A histogram can help summarize large amounts of data, butwe often like to see an even shorter (and sometimes easier tointerpret) summary. This is usually provided by the mean andthe standard deviation.

I The mean (and median) are frequently used to find thecenter, whereas standard deviation measures the spread.

DefinitionThe (arithmetic) mean of a list of numbers is their sum divided byhow many there are.

For example, the mean of 9, 1, 2, 2, 0 is 9+1+2+2+05 = 2.8.

More generally, mean = x = 1n

n∑i=1

xi ; i = 1...n

6 / 44

Descriptive statisticsThe Mean and Standard Deviation

I The standard deviation (SD) tells us how far numbers on alist deviate from their average. Usually, most numbers arewithin one SD around the mean.

I More specifically, for normally distributed variables, about68% of entries are within one SD of the mean and about 95%of entries are within two SDs.

mean

mean - one SD mean + one SD

mean

mean - two SDs mean + two SDs

95%

68%

7 / 44

Descriptive statisticsComputing the Standard Deviation

DefinitionStandard Deviation =

√mean of (deviations from the mean)2

where deviation from mean = entry −mean

And in formal notation, σ =√

1N

∑Ni=1(xi − µ)2, where

µ = 1N (x1 + ...+ xN ).

Example: Find the SD of 20, 10, 15, 15.Answer: mean = x = 20+10+15+15

4 = 15 Then, the deviations are5,−5, 0, 0, respectively.So, SD =

√52+(−5)2+02+02

4 =√

504 =

√12.5 ≈ 3.5

RemarkThe SD comes out in the same units as the data. For example, ifthe data are a set of individuals’ heights in inches, the SD is ininches too.

8 / 44

Descriptive statisticsThe Root-Mean-Square

Consider the following list of numbers: 0, 5,−8, 7,−3.Question: How big are these numbers? What is their mean?The mean is 0.2, but this does not tell us much about the size ofthe numbers–it only implies that the positive numbers slightlyoutweigh the negative ones.To get a better sense of their size, we could use the mean of theirabsolute values. Statisticians tend to use another measure, though:The root-mean-square.

DefinitionRoot−mean− square (rms) =

√average of (entries)2

9 / 44

Descriptive statisticsThe Root-Mean-Square and Standard Deviation

There is an alternative way of calculating SD usingroot-mean-square:

RemarkSD =

√mean of (entries)2 − (mean of entries)2

Recall the four numbers we used earlier to calculate SD:20, 10, 15, 15.mean of (entries)2 = 202+102+152+152

4 = 9504 = 237.5

(mean of entries)2 = (20+10+15+154 )2 = (60

4 )2 = 225 Therefore,

SD =√

237.5− 225 =√

12.5 ≈ 3.5, which agrees with what wefound earlier.

10 / 44

Descriptive statisticsVariance

I In probability theory and statistics, variance gets mentionednearly as often as the mean and standard deviation. It is veryclosely related to SD and is a measure of how far a set ofnumbers lie from their mean.

I Variance is the second moment of a distribution (mean beingthe first moment), and therefore, tells us about the propertiesof the distribution (more on these later).

DefinitionVariance = (Std.Dev.)2 = σ2

11 / 44

Descriptive statisticsNormal Approximation for Data and Percentiles

010

2030

4050

60F

requ

ency

mean-1 s.d. +1 s.d.-2 s.d. +2 s.d. +3 s.d. +4 s.d.

5,000 10,000 15,000 20,000 25,000Volume (thousands)

Source: Yahoo! Finance and Commodity Systems, Inc.

S&P 500, January 2001 - December 2001

Is the normal approximation satisfactory here?

12 / 44

Descriptive statisticsNormal Approximation for Data and Percentiles

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8D

ensi

ty

mean-1 s.d. +1 s.d. +2 s.d.-2 s.d.

45 50 55 60 65 70 75 80 85 90 95Life expectancy (in years)

2011 with normal

How about here?

13 / 44

Descriptive statisticsNormal Approximation for Data and Percentiles

RemarkThe mean and SD can be used to effectively summarize data thatfollow the normal curve, but these summary statistics can be muchless satisfactory for data that do not follow the normal curve.

In such cases, statisticians often opt for using percentiles tosummarize distributions.

Table. Selected percentiles for life expectancy in 2011Percentiles Value1 4810 52.625 6350 73.475 76.995 81.899 82.7

14 / 44

Descriptive statisticsCalculating percentiles

1. Order all the values in your data set in ascending order (i.e.smallest to largest).

2. Select a percentile, P, that you would like to calculate andmultiply it by the total number of entries in your data set, n.The value you obtain here is called the index.

3. If the index is not a whole number, round it up to the nextinteger.

4. Count the entries in your list of numbers starting from thesmallest one until you get to the number indicated by yourindex.

5. This entry is the kth percentile in your data set.

15 / 44

Descriptive statisticsCalculating percentiles

ExampleConsider the following list of 5 numbers: 10, 15, 20, 25, 30. What isthe entry that corresponds to the 25th percentile? What is themedian?

To obtain the 25th percentile, all we need to do is 0.25× 5 = 1.25.After rounding, this value becomes 1, so 25th percentile in thiscase is the first entry, 10.

We were also asked to obtain the median. To do this, calculate0.5× 5 = 2.5. Rounding this to the nearest whole number gives 3.So, the median in this case is 20.

16 / 44

Descriptive statisticsPercentiles

The 1st percentile of the distribution is approximately 48, meaningthat the life expectancy in 1% of countries in 2011 was 48 or less,and 99% of countries had life expectancy higher than that.Similarly, the fact that 25th percentile is 63 implies that 25% ofcountries had life expectancy of 63 or less, whereas 75% had alonger expected lifespan.

DefinitionInterquartile range is defined as 75th percentile − 25th percentileand is sometimes used as a measure of spread, particularly whenthe SD would pay too much (or too little) attention to a smallpercentage of cases in the tails of the distribution.

From the table above, the interquartile range equals76.9− 63 = 13.9 (and SD was 10.14).

17 / 44

Descriptive statisticsBox plots

The structure of a box plot:

75th percentile/3rd quartile (upper hinge)

Whiskers

50th percentile (median)

25th percentile/1st quartile (lower hinge)

Box

Whiskers

Adjacent line (Upper adjacent value)

The largest value within

75th percentile +

Adjacent line (Lower adjacent value)

The smallest value within

25th percentile -

Entries less than the lower adjacent value

18 / 44

Descriptive statisticsBox plots

5060

7080

90Li

fe e

xpec

tanc

y (in

yea

rs)

EAS ECS LCN MEA NAC SAS SSF LegendEAS: East Asia & PacificECS: Europe & Central AsiaLCN: Latin America & CaribbeanMEA: Middle East & North AfricaNAC: North AmericaSAS: South AsiaSSF: Sub-Saharan Africa

Life expectancy by region in 2011Are thereany clearpatternsemergingfromsumma-rizing thedata thisway?

19 / 44

Descriptive statisticsBox plots

We might be able to spot some patterns that developed over timeif we look at different years:

2535

4555

6575

85Li

fe e

xpec

tanc

y (in

yea

rs)

EAS ECS LCN MEA NAC SAS SSF

1960Life expectancy by region

20 / 44

Descriptive statisticsBox plots

We might be able to spot some patterns that developed over timeif we look at different years:

2535

4555

6575

85Li

fe e

xpec

tanc

y (in

yea

rs)

EAS ECS LCN MEA NAC SAS SSF

1990Life expectancy by region

20 / 44

Descriptive statisticsBox plots

We might be able to spot some patterns that developed over timeif we look at different years:

2535

4555

6575

85Li

fe e

xpec

tanc

y (in

yea

rs)

EAS ECS LCN MEA NAC SAS SSF

2011Life expectancy by region

20 / 44

Data TransformationsThe effects of changing the unit of measure

I Now that we know how to summarize a dataset, let us turn toinvestigating the effects of changing the unit of measure for avariable on the mean and standard deviation.

I Such changes in the unit of measure could be for practicalreasons or based on theory, but regardless of the reason, astatistician should know what to expect.

I To study this, let’s consider a dataset on 200 individuals’weights and heights.

I Each entry is originally reported in kg and cm, respectively,and below are some summary statistics:

Table. Summary statisticsVariable Mean Standard DeviationWeight (kg) 65.8 15.1Height (cm) 170.02 12.01

21 / 44

Data TransformationsThe effects of changing the unit of measure

And here are some diagrams that summarize the distribution of thetwo variables.

0.0

1.0

2.0

3.0

4D

ensi

ty

mean-1 s.d. +1 s.d. +2 s.d.-2 s.d.

40 50 60 70 80 90 100 110 120 130 140 150 160 170Measured weight in kg

measured in kgWeight

0.0

1.0

2.0

3.0

4D

ensi

ty

mean-1 s.d. +1 s.d. +2 s.d.-2 s.d.

140 150 160 170 180 190 200Measured height in cm

measured in cmHeight

Does the normal approximation look satisfactory?

22 / 44

Data TransformationsThe effects of changing the unit of measure

And here are some diagrams that summarize the distribution of thetwo variables.

5010

015

020

0M

easu

red

wei

ght i

n kg

F M

Weight (kg) by sex

5010

015

020

0M

easu

red

heig

ht in

cm

F M

Height (cm) by sex

23 / 44

Data TransformationsThe effects of changing the unit of measure

And here are some diagrams that summarize the distribution of thetwo variables.

0.0

1.0

2.0

3.0

4D

ensi

ty

mean-1 s.d. +1 s.d. +2 s.d.-2 s.d.

40 50 60 70 80 90 100 110 120 130 140 150 160 170Measured weight in kg

measured in kgWeight

0.0

1.0

2D

ensi

ty

mean-1 s.d. +1 s.d. +2 s.d.-2 s.d.

80 100 120 140 160 180 200 220 240 260 280 300 320 340 360Measured weight in pounds

measured in lbWeight

Do you think the mean matches the original one (in correct units)?How about the standard deviation?

24 / 44

Data TransformationsThe effects of changing the unit of measure

And here are some diagrams that summarize the distribution of thetwo variables.

0.0

1.0

2.0

3.0

4D

ensi

ty

mean-1 s.d. +1 s.d. +2 s.d.-2 s.d.

140 150 160 170 180 190 200Measured height in cm

measured in cmHeight

0.0

2.0

4.0

6.0

8.1

.12

Den

sity

mean-1 s.d. +1 s.d. +2 s.d.-2 s.d.

55 60 65 70 75 80Measured height in inches

measured in inHeight

Do you think the mean matches the original one (in correct units)?How about the standard deviation?

25 / 44

Data TransformationsThe effects of changing the unit of measure

Here are the box plots with the transformed data:

100

150

200

250

300

350

Mea

sure

d w

eigh

t in

poun

ds

F M

Weight (lb) by sex

2040

6080

Mea

sure

d he

ight

in in

ches

F M

Height (in) by sex

26 / 44

Data TransformationsThe effects of changing the unit of measure

I Observations made using the figures are, of course, based onwhat statisticians and econometricians often call "eye-balling"the data. These observations are certainly not formal, but area crucial part of effectively analyzing any dataset.

I In fact, you should make plotting, investigating andeye-balling your data a habit before you dive into complicatedmodels and overlook important features of your dataset.

I Now that we have made our informal observations, let’s lookat the actual numbers.

Table. Summary statisticsVariable Mean SD Mean (converted) SD (converted)Weight (kg) 65.8 15.1 65.8× 2.2 ≈ 145.06 15.1× 2.2 ≈ 33.28Height (cm) 170.02 12.01 66.94 4.73Weight (lb) 145.06 33.28 145.06/2.2 ≈ 65.8 33.28/2.2 ≈ 15.1Height (in) 66.94 4.73 170.02 12.01

27 / 44

Data TransformationsThe effects of changing the unit of measure

I Observations made using the figures are, of course, based onwhat statisticians and econometricians often call "eye-balling"the data. These observations are certainly not formal, but area crucial part of effectively analyzing any dataset.

I In fact, you should make plotting, investigating andeye-balling your data a habit before you dive into complicatedmodels and overlook important features of your dataset.

I Now that we have made our informal observations, let’s lookat the actual numbers.

Table. Summary statisticsVariable Mean SD Mean (converted) SD (converted)Weight (kg) 65.8 15.1 145.06 33.28Height (cm) 170.02 12.01 170.02× 0.4 ≈ 66.94 12.01× 0.4 ≈ 4.73Weight (lb) 145.06 33.28 65.8 15.1Height (in) 66.94 4.73 66.94× 2.5 ≈ 170.02 4.73× 2.5 ≈ 12.01

27 / 44

Data TransformationsThe effects of changing the unit of measure

We have seen that the mean and the standard deviation remainthe same when we change the unit of measure, but how doesvariance behave?

Table. Summary statisticsVariable Mean Variance Mean (converted) Variance (converted)Weight (kg) 65.8 228.01 65.8 × 2.2 ≈ 145.06 228.01 × 2.2 ≈ 502.68Height (cm) 170.02 144.24 66.94 56.79Weight (lb) 145.06 1107.56 145.06/2.2 ≈ 65.8 1107.56/2.2 ≈ 502.38Height (in) 66.94 22.37 170.02 56.81

28 / 44

Data TransformationsThe effects of changing the unit of measure

We have seen that the mean and the standard deviation remainthe same when we change the unit of measure, but how doesvariance behave?

Table. Summary statisticsVariable Mean Variance Mean (converted) Variance (converted)Weight (kg) 65.8 228.01 145.06 502.68Height (cm) 170.02 144.24 170.02 × 0.4 ≈ 66.94 144.24 × 0.4 ≈ 56.79Weight (lb) 145.06 1107.56 65.8 502.38Height (in) 66.94 22.37 66.94 × 2.54 ≈ 170.02 22.37 × 2.54 ≈ 56.81

Note that 1 inch = 2.54 cm and similarly, 1cm = 12.54 = 0.3937in.

Then, 22.37 ≈ (0.3937)2 × 144.24. The opposite is true as well:144.24 ≈ (2.54)2 × 22.37. And we can apply the same to theweights in kg and lbs. And in general ...

28 / 44

Data TransformationsProperties of Variance

...variance is scaled by the square of the constant by which all thevalues are scaled. While we are at it, here are some basicproperties of variance:

Basic properties of variance

I Variance is non-negative: Var(X) ≥ 0I Variance of a constant random variable is zero:

P(X = a) = 1↔ Var(X) = 0I Var(aX) = a2 Var(X) F

I However, Var(X + a) = Var(X)I For two random variables X and Y ,

Var(aX + bY ) = a2Var(X) + b2Var(Y ) + 2abCov(X ,Y )I ...but Var(X −Y ) = Var(X) + Var(Y )− 2Cov(X ,Y )

29 / 44

Data TransformationsLog Transformation

So far, we have only worked with transformations in which wemultiply each value with a constant. However, more complicatedtransformations are quite common in statistics and econometrics.One of the most common and useful transformations uses thenatural logarithm.

DefinitionData transformation refers to applying a specific operation to eachpoint in a dataset, in which each data point is replaced with thetransformed one. That is, xi are replaced by yi = f (xi).

In our previous example with heights, our function, f (x), wassimply f (x) = 2.54x. Now, let us study a different function: thenatural logarithm.

30 / 44

Data TransformationsLog Transformation

Log transformation in action:

5000

00.0

0010

0000

0.00

015

0000

0.00

020

0000

0.00

0O

utpu

t-si

de r

eal G

DP

at c

urre

nt P

PP

s (in

mil.

200

5US

$)

1960 1970 1980 1990 2000 2010Year

UK Real GDP

1313

.514

14.5

Nat

ural

log

of o

utpu

t-si

de r

eal G

DP

at c

urre

nt P

PP

s1960 1970 1980 1990 2000 2010

Year

UK Real GDP

31 / 44

Data TransformationsLog Transformation

Log transformation in action:

0.00

010

0000

0.00

020

0000

0.00

0O

utpu

t-si

de r

eal G

DP

at c

urre

nt P

PP

s (in

mil.

200

5US

$)

1960 1970 1980 1990 2000 2010Year

UK Real GDP

1313

.514

14.5

Nat

ural

log

of r

eal G

DP

at c

urre

nt P

PP

s (in

mil.

200

5US

$)1960 1970 1980 1990 2000 2010

Year

UK Real GDP

31 / 44

Data TransformationsLog Transformation

5060

7080

90Li

fe e

xpec

tanc

y (in

yea

rs)

0.000 5000000.000 10000000.000 15000000.000

Output-side real GDP at current PPPs (in mil. 2005US$)

Life Expectancy vs. Real GDP

3.8

44.

24.

4Li

fe e

xpec

tanc

y (in

yea

rs)

6 8 10 12 14 16

Natural log of output-side real GDP at current PPPs

Log Life expectancy vs. Log Real GDP

Important noteThe log transformation can only be used for variables that havepositive values (why?). If the variable has zeros, thetransformation can be applied only after these figures are replaced(usually by one-half of the smallest positive value in the data set).

32 / 44

Data TransformationsLog Transformation

CHN

GBR

IDN

IND

JPN

RUS

USA

ZAF

40

60

80

0 10000 20000 30000 40000 50000 60000Real GDP per capita (at constant 2005 national prices) [linear scale]

Life

exp

ecta

ncy

(in y

ears

) [li

near

sca

le]

Region

EAS

ECS

LCN

MEA

NAC

SAS

SSF

Population(in million)

10

50

100

250

500

1000

Year: 2011

CHN

GBR

IDN

IND

JPN

RUS

USA

ZAF

40

60

80

156.25 312.5 625 1250 2500 5000 10000 20000 40000 80000Real GDP per capita (at constant 2005 national prices) [log scale]

Life

exp

ecta

ncy

(in y

ears

) [li

near

sca

le]

Region

EAS

ECS

LCN

MEA

NAC

SAS

SSF

Population(in million)

10

50

100

250

500

1000

Year: 201133 / 44

Data TransformationsLog Transformation

CHN

GBR

IDNIND

JPNUSA

ZAF

40

60

80

156.25 312.5 625 1250 2500 5000 10000 20000 40000 80000Real GDP per capita (at constant 2005 national prices) [log scale]

Life

exp

ecta

ncy

(in y

ears

) [li

near

sca

le]

Region

EAS

ECS

LCN

MEA

NAC

SAS

SSF

Population(in million)

10

50

100

250

500

1000

Year: 1960

33 / 44

Data TransformationsLog Transformation

CHN

GBR

IDN

IND

JPN

RUS

USA

ZAF

40

60

80

156.25 312.5 625 1250 2500 5000 10000 20000 40000 80000Real GDP per capita (at constant 2005 national prices) [log scale]

Life

exp

ecta

ncy

(in y

ears

) [li

near

sca

le]

Region

EAS

ECS

LCN

MEA

NAC

SAS

SSF

Population(in million)

10

50

100

250

500

1000

Year: 1990

33 / 44

Data TransformationsLog Transformation

CHN

GBR

IDN

IND

JPN

RUS

USA

ZAF

40

60

80

156.25 312.5 625 1250 2500 5000 10000 20000 40000 80000Real GDP per capita (at constant 2005 national prices) [log scale]

Life

exp

ecta

ncy

(in y

ears

) [li

near

sca

le]

Region

EAS

ECS

LCN

MEA

NAC

SAS

SSF

Population(in million)

10

50

100

250

500

1000

Year: 2011

33 / 44

Data TransformationsLog Transformation and growth

A useful feature of the log transformation is the interpretation ofits first difference as a percentage change (for small changes). Thisis because ln(1 + x) ≈ x for a small x: Wolfram AlphaStrictly speaking, a percentage change in Y from period t − 1 toperiod t is defined as Yt−Yt−1

Yt−1, which is approximately equal to

ln(Yt)− ln(Yt−1). And the approximation is almost exact if thepercentage change is small.To see this, consider the percentage change in US GDP from 2010to 2011:Table. US Real GDP (in mil. 2005 US$)

Year GDP Percentage change ln(Yt) ln(Y2011)− ln(Y2010)2010 12993576 1.803507 16.379966 1.7874362011 13227916 . 16.39784 .

And the difference in percentage change is0.01803507 − 0.01787436 = 0.00016071—a discrepancy that we might bewilling to live with.

34 / 44

Examining RelationshipsCovariance and Correlation

Our daily lives (and not just within economics) are filled withstatements about the relationship between two variables. Forexample, we might read about a study that found that men spendmore money online than women.

The relationship between gender and spending more online maynot be this simple, of course–income might play a role in thisobserved pattern. Ideally, we would like to set up an experiment inwhich we control the behavior of one variable (keeping everythingelse the same) and observe its effect on another. This is often notfeasible in economics (a lot more on this later!).

For the time being, let’s focus on simple correlation.

35 / 44

Examining RelationshipsCovariance and Correlation

Scatter plots are very useful in identifying the sign and strength ofthe relationship between two variables. Therefore, it’s alwaysextremely useful to plot your data and investigate what therelationship between your two variables are:

65.0

070

.00

75.0

080

.00

85.0

0Li

fe e

xpec

tanc

y

0.00 20.00 40.00 60.00 80.00Internet users per 100 people

Life Expectancy (in years) vs. Internet usage

36 / 44

Examining RelationshipsCovariance and Correlation

But these plots can also be misleading to the eye simply bychanging the scale of the axes:

65.0

070

.00

75.0

080

.00

85.0

0Li

fe e

xpec

tanc

y

0.00 20.00 40.00 60.00 80.00Internet users per 100 people

Life Expectancy (in years) vs. Internet usage

45.0

055

.00

65.0

075

.00

85.0

095

.00

Life

exp

ecta

ncy

0.00 20.00 40.00 60.00 80.00 100.00 120.00Internet users per 100 people

Life Expectancy (in years) vs. Internet usage

37 / 44

Examining RelationshipsCovariance and Correlation

Therefore, it’s best to obtain a numerical measure of therelationship. And correlation is the measure statisticians andeconometricians tend to use.DefinitionCorrelation measures the strength and direction of a linearrelationship between two variables and is usually denoted as r .

rx,y = ry,x = sx,ysxsy

where sx,y is the sample covariance, and sx and sy are samplestandard deviations of x and y, respectively. The former (i.e.sample covariance) is calculated as:

sx,y = sy,x = 1N − 1

N∑i=1

(xi − x)(yi − y).

38 / 44

Examining RelationshipsUnderstanding covariance

To see how a scatter diagram can be read in terms of covariancebetween the two variables, consider the USA:

COD

KWTUSA

67

89

1011

12

Log

of r

eal G

DP

per

cap

ita(a

t con

stan

t 200

5 na

tiona

l pric

es)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Average years of total schooling

Education and GDP per capita (2010)

}

y USA−

y

xUSA−xBecause xUSA > x and yUSA > y, theterm (xUSA − x)(yUSA − y) is positive.Also, (xCOD − x)(yCOD − y) > 0, but(xKWT − x)(yKWT − y) < 0.

Thus, countries located in the top-rightand bottom-left quadrants have apositive effect on sx,y, whereascountries in the top-left andbottom-right quadrants have a negativeeffect on sx,y.

Question: Should we use covariance or correlation as a more"robust" measure of the relationship? Why?

39 / 44

Examining RelationshipsUnderstanding covariance

To answer this question, let’s look more closely at how covariancebehaves: A positive (negative) covariance indicates that x tends tobe above its mean value whenever y is above (below) its meanvalue. A sample covariance of zero suggests that x and y areunrelated.

In our example, sx,y = 2.69. This suggests that there is a positiverelationship between x and y. But what does the value of 2.69 tellus about the strength of the relationship? — Nothing.

Why not? — Suppose we wanted to measure schooling in decadesinstead of years. That is, we generate a new variable which equalsschool measured in years divided by 10. The new covariance issx,y = 0.269 which is much closer to zero.

Technically speaking, covariance is not invariant to lineartransformations of the variables.

40 / 44

Examining RelationshipsCovariance versus Correlation

The sample correlation coefficient addresses this problem. While sx,y maytake any value between −∞ and +∞, the correlation coefficient isstandardised such that r ∈ [−1, 1]. Recall that

rx,y = ry,x = sx,y

sxsy

where sx,y is the covariance of x and y. sx and sy are the sample standarddeviations of x and y, respectively.

Note that because sx > 0 and sy > 0, the sign of the sample covariance is thesame as the sign of the correlation coefficient.

Correlation coefficient

I rx,y > 0 indicates positive correlation.I rx,y < 0 indicates negative correlation.I rx,y = 0 indicates that x and y are unrelated.I rx,y = ±1 indicates perfect positive (negative) correlation. That is, there

exists an exact linear relationship between x and y of the form y = a + bx.

41 / 44

Examining RelationshipsCorrelation

In our example, rx,y = 0.7763, which indicates positive correlation(because rx,y > 0) and that the relationship is reasonably strong(because rx,y is not too far away from 1).

To get a better feeling for what is "strong" and "weak", wegenerate 100 observations of x and y with varying degrees ofcorrelation and plot them on a scatter diagram.

To get a better feeling for what is "strong" and "weak", wegenerate 100 observations of x and y with varying degrees ofcorrelation and plot them on a scatter diagram.

-4-3

-2-1

01

23

4y

-4 -3 -2 -1 0 1 2 3 4x

r(x,y)=.9

-4-3

-2-1

01

23

4y

-4 -3 -2 -1 0 1 2 3 4x

r(x,y)=-.9

-4-3

-2-1

01

23

4y

-4 -3 -2 -1 0 1 2 3 4x

r(x,y)=.7

42 / 44

Examining RelationshipsCorrelation

-4-3

-2-1

01

23

4y

-4 -3 -2 -1 0 1 2 3 4x

r(x,y)=.3

-4-3

-2-1

01

23

4y

-4 -3 -2 -1 0 1 2 3 4x

r(x,y)=0

-4-3

-2-1

01

23

4y

-4 -3 -2 -1 0 1 2 3 4x

r(x,y)=0

What’s unusual about the right-most diagram here?In the right-most diagram, the correlation coefficient indicates thatx and y are unrelated, but the graph implies otherwise. In fact,there is a strong quadratic relationship between x and y in thiscase.

43 / 44

Examining RelationshipsSummary

I Correlation, r , measures the strength and direction of alinear relationship between two variables.

I The sign of r indicates the direction of the relationship:r > 0 for a positive association and r < 0 for a negativeone.

I r always lies within [−1, 1] and indicates the strength of arelationship by how close it is to 1 or −1.

44 / 44

Examining RelationshipsCorrelation vs Causation

You may have already encountered the statement that Correlation doesnot imply causation. This is an important concept to grasp, becauseeven a strong correlation between two variables is not enough to drawconclusions about causation. For instance, consider the followingexamples:1. Do televisions increase life expectancy?

2. Are big hospitals bad for you?

3. Do firefighters make fires worse?

45 / 44

Examining RelationshipsCorrelation vs Causation

You may have already encountered the statement that Correlation doesnot imply causation. This is an important concept to grasp, becauseeven a strong correlation between two variables is not enough to drawconclusions about causation. For instance, consider the followingexamples:1. Do televisions increase life expectancy?

There is a high positive correlation between the number of television setsper person in a country and life expectancy in that country. That is,nations with more TV sets per person have higher life expectancies. Doesthis imply that we could extend people’s lives in a country just byshipping TVs to them? No, of course not. The correlation between thesetwo variables stem from the nation’s income: Richer nations have moreTVs per person than poorer ones. These nations also have access tobetter nutrition and health care.

2. Are big hospitals bad for you?

3. Do firefighters make fires worse?

45 / 44

Examining RelationshipsCorrelation vs Causation

You may have already encountered the statement that Correlation doesnot imply causation. This is an important concept to grasp, becauseeven a strong correlation between two variables is not enough to drawconclusions about causation. For instance, consider the followingexamples:1. Do televisions increase life expectancy?

2. Are big hospitals bad for you?A study has found positive correlation between the size of a hospital(measured by its number of beds) and the median number of days thatpatients remain in the hospital. Does this mean that you can shorten ahospital stay by choosing a small hospital?

3. Do firefighters make fires worse?

45 / 44

Examining RelationshipsCorrelation vs Causation

You may have already encountered the statement that Correlation doesnot imply causation. This is an important concept to grasp, becauseeven a strong correlation between two variables is not enough to drawconclusions about causation. For instance, consider the followingexamples:1. Do televisions increase life expectancy?

2. Are big hospitals bad for you?

3. Do firefighters make fires worse?A magazine has observed that "there’s a strong positive correlationbetween the number of firefighters at a fire and the damage the fire does.So sending lots of firefighters just causes more damage." Is this reasoningflawed?

45 / 44

Examining RelationshipsReverse Causality

In addition to correlation feeding through a third (sometimes unobserved)variable, in economics, we often run into reverse causality problems.Earlier, we showed that real GDP per capita and education (measured byaverage years of schooling) are positively correlated. This could bebecause:

1. Rich countries can afford more (and better) education. That is, anincrease in GDP per capita causes an increase in schooling.

2. More (and better) education promotes innovation and productivity.That is, an increase in schooling causes an increase in GDP percapita.

The relationship between GDP per capita and education suffers fromreverse causality.To reiterate, although we can make the statement that x and y arecorrelated, we do not know whether y is caused by x or vice versa.This is one of the central problems in empirical research in economics. Inthe course of the MSc, you will learn methods that allow you to identifythe causal mechanisms in the relationship between y and x. 46 / 44