Quantitative Methods – Week 2: Descriptive Statistics Roman Studer Nuffield College [email protected]

Quantitative Methods – Week 2:

Descriptive Statistics

Roman StuderNuffield College

[email protected]

Frequency Distributions

• Frequency distributions provide a summary presentation of the data• Very good method to get a first overview of the data

Discrete variablesMeasure the frequency of occurrence of each of the valuesExample: Poor Law Dataset: Number of workhouses per counties

0

5

10

15

20

25

30

35

Num

ber

of W

ork

houses

Essex Norfolk Suffolk Cambs Beds Sussex Kent

Bar Chart of Number of Workhouses in Selected Counties, 1831

Continuous variables• Choose appropriate class intervals and not the frequency of occurrence

for each class• Number of class intervals normally between 5 and 20

Example: Per capita relief payments in the parishes of Kent, 1831

Frequency Distributions (II)

(1) (2) (3) (4) (5)

Class Intervals Frequency Relative Frequency (%) Cumulative Frequency Cumulative Relative

(shillings) (f) (f/n x 100) Frequency (%)

≥ 5 but < 10 1 4.2 1 4.2

≥ 10 but < 15 7 29.2 8 33.4

≥ 15 but < 20 4 16.6 12 50.0

≥ 20 but < 25 6 25.0 18 75.0

≥ 25 but < 30 4 16.6 22 91.6

≥ 30 but < 35 1 4.2 23 95.8

≥ 35 but < 40 1 4.2 24 100.0

24 100.0

Frequency Distributions (III)

0

1

2

3

4

5

6

7

Num

ber

of p

aris

hes

≥ 5 but < 10 ≥ 15 but < 20 ≥ 25 but < 30 ≥ 35 but < 40

Relief payments (shilling)

Frequency of Per Capita Relief Payments in Kent, 1831

Frequency Distributions (IV)

Relative Frequency of Per Capita Relief Payments

0

5

10

15

20

25

30

35

≥ 5 but < 10 ≥ 15 but < 20 ≥ 25 but < 30 ≥ 35 but < 40

Relief payments (shillings)

Per

cent

ages

0

5

10

15

20

25

30

35

Histogram

Frequency curve

Frequency Distributions (V)

0

5

10

15

20

25

Num

ber

of p

aris

hes

1 2 3 4 5 6 7

Relief payments (shilling)

Cumulative Frequency of Relief Payments

Frequency Distributions (VI)

Cumulative Relative Frequency of Per Capita Relief

0

1020

3040

50

6070

8090

100

10 15 20 25 30 35 40

Relief Payments

Per

cent

ages

(C

umul

ativ

e)

Descriptive Statistics

Making frequency tables and plotting them using histograms and frequency curves is very helpful to get a first overview

But it isn’t a precise way to summarize the information of the variables in a dataset. To do this, we normally determine three features of a variable:

1) Which are the most central (i.e. the most common or typical) values?

2) How are the values spread (dispersed) around those central values?

3) What is the shape of the distribution?

Each of these features can be described by one or more simple statistics, and they form the basic elements of descriptive statistics, as the provide a precise and comprehensive summary of the variables in a data set

Measures of Central Tendency

• The arithmetic mean Adding up all the values and dividing this total by the number of

observations

• The median The value that has one-half of the number of observations above and below

it, when the series is set out in an ascending or descending array Uneven number of observations: Position = (number of observations + 1) / 2 Even number of observations: Average of two middle observations

• The mode Value that occurs most frequently

N

y

N

yyyyyyE

N

i iN

1321 )...()(

• Percentiles, deciles, and quartiles Instead of dividing the observations into two equal halves (median), we can

divide them into four equal quarters (quartiles), or into 10 portions (deciles) or 100 portions (percentiles) of equal size

Measures of Central Tendency (II)

• Numeric Example

What is the mean, the median, the mode?

Case 1 2 3 4 5 6 7

xi x1 x2 x3 x4 x5 x6 x7

Values 2 4 5 5 7 9 10

6 5 5

8

x8

30

What is the effect of adding value x8?

9 6 5

What is the absolute and what is the relative frequency of 5?

2 25%

Measures of Dispersion

• Variable x is more densely distributed around the mean m than variable y

x,y

f(x)f(y)

f(x)

f(y)

• Two variables with equal arithmetic mean, but different spread

Measures of Dispersion (II)

2

1

2 )(1

)(

N

i ix XXN

XVar

• The variance The variance is equal to the arithmetic mean of the squared deviations from

the mean

The variance is widely used in statistical work; however, the disadvantage is that it is expressed in square units…

• The standard deviation The standard deviation is the square root of the variance

)var(xx Interpretation: Average or typical deviation of variable x from the

arithmetic mean The standard deviation is the most widely used measure of dispersion;

however, as it is calculated in the same units as the series, these absolute standard deviations are unsuitable for comparisons with series that have different underlying units…

Measures of Dispersion (III)

• The coefficient of variation (CV) The coefficient of variation is a measure of relative rather than absolute

variation It is obtained by dividing the standard deviation by the mean

Interpretation: Average percentage deviation from the mean

• The range This is a very crude measure of dispersion defined as the difference between

the maximum and the minimum value in the series

x

xCV

The Shape of Distributions

• Normal distribution The normal distribution is a symmetrical, smooth, bell-shaped distribution

that is fully described by the arithmetic mean and standard deviation Mode, median and mean are equal Measures of skewness and kurtosis of the normal distribution are equal to 0

and 3 But again: Mean and standard deviation are dependent on units of the series

and thus difficult to compare…

The Shape of Distributions (II)

• Standard normal distribution

Every normal distribution can be transformed into a standard normal distribution using

By definition, the standard normal distribution has now two further basic features that the normal distribution hasn’t:

• mean =0 • standard deviation =1

These properties make the distribution ideal for comparison The standard normal distribution has for this reason a key role in inductive

statistics as it can be used to make inferences on probabilities

y

yYZ

The Shape of Distributions (III)

Mode Mean

x

Median

Freq

uenc

y

• Skewed distributions However, values need not be symmetrically distributed around the central

point, i.e. distributions can be skewed In these cases, Mean and standard deviation are insufficient to describe the

distribution Especially socio-economic data (wages, income, wealth and related

variables) is frequently skewed

This distribution is skewed to the right (positively skewed)

The Shape of Distributions (IV)

• Consequences of skewed distributions

Skewed variables can lead to undesirable effects in regressions• Non-normal distributed residuals (misspecification) • Heteroscedasticity; test statistics and confidence intervals are

biased

(Roughly) normal distributed variables help to avoid these problems. Take a look at the variable

• If the variable is not significantly skewed, continue• If the variable is skewed, transform the variable: “Ladder of

Powers”. For this reason you often find the logarithm of income, the square root of the mortality rate, etc.

The Shape of Distributions (V)

x

• Kurtosis Furthermore, two symmetrically distributed variables with equal mean and

standard deviation can still have a different distribution, i.e. they can have a different kurtosis

Here the variable y has the bigger kurtosis than variable x

x,y

f(x)

f(y)

x

f(y)

f(x)

y

The Shape of Distributions (VI)

• Measures for skewness and kurtosis Measures for skewness and kurtosis tell us therfore more about a

distribution

Skewness and kurtosis of a normal distributed variable are zero and three, respectively

Skewness:• a3 > 0 distribution skewed to the right/ positively skewed• a3 < 0 distribution skewed to the left/ negatively skewed

Kurtosis:• a4 > 3 thinner tails & higher peak than a normal distribution• a4 < 3 thicker tails & lower peak compared to a normal distribution

For a meaningful and comparable measure of a4, the distribution should be symmetrical (hence again the need to have a normal distribution)

3

3

3

)(:

YE

aSkewness4

4

4

)(:

YE

aKurtosis

• Getting started with STATA

• Descriptive statistics

Computer Class:

STATA Basics

• Stata is a statistical package for managing, analysing, and graphing data

• It can be used in two different ways1) As a point-and-click application

→ Easy interface for those new to Stata, and for those who don’t use it very often

→ … for us (at least at the beginning)!2) As a command-driven package

→ Very fast if used to commands→ Good for communicating more complex ideas→ One of the main advantages of Stata over SPSS

• A helpful guide Hamilton, Lawrence C., Statistics with STATA. Constantly updated

versions.

Getting Started Together

Various data formats Data comes in various data formats and extensions, most often in

.xls : Excel .sav : SPSS .dta : STATA .txt : Text files STATA can import all these formats: File/Import/....

1) Download data file Relief dataset from Feinstein & Thomas, get online:

http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=9780521806633&ss=res

Download the Stata file and save it to your folder on the O: drive

2) Open data Open/Relief dataset

3) Open data editor Open Data Editor and try to understand the structure of the dataset What do the rows and columns mean? Change the names of some variables Sort the relief payments in ascending order: what was the minimum paid,

what was the maximum?

Getting Started Together (II)

4) List some variables: Data/Describe data/List data Relief Income

5) Tabulate some variables: Income Relief

6) Frequencies Get an overview of the distribution with a histogram (Graphics/Histogram) The number of bins changes the number of bars (or the number of categories) Which variables look normally distributed, which ones not?

7) Descriptive statistics (Central tendencies & dispersion) Mean, stdv, min, max (Data/Describe data/Summary statistics) Skewness, kurtosis, median, quartiles, percentiles, etc. (Data/Describe

data/Summary statistics/Display additional statistics)

Getting Started Together (III)

8) Export some tables, graphs to Word Right-click and copy; insert in Word

9) If you’re stuck: Help/…

Appendix: STATA Commands

• edit Opens the Data Editor • sort Arranges the observations into ascending order

based on the values of the # variable• tabulate varname Produces one-way tables of frequency counts:

absolute & relative & cumulative frequency.• summarize varname Calculates a variety of summary

statistics (obs, mean, stdv, min, max)• summarize varname, detail Gives more detailed statistics, for

instance kurtosis, skewness, percentiles, etc.• histogram varname, bin(x) Creates a histogram with x categories

Homework

Readings:• Feinstein & Thomas, Ch. 3

Problem Set 1: Do the exercises 1, 2 (Relief dataset) , and 7 (

The Old Poor Law in England) from chapter 2.7 (pp. 66-70) Submit your solutions including graphs and tables in a Word file

by noon on Monday (29 January)

Documents

Quantitative Methods – Week 2: Descriptive Statistics Roman Studer Nuffield College [email protected]