View
215
Download
0
Embed Size (px)
Citation preview
Quantitative Methods – Week 2:
Descriptive Statistics
Roman StuderNuffield College
Frequency Distributions
• Frequency distributions provide a summary presentation of the data• Very good method to get a first overview of the data
Discrete variablesMeasure the frequency of occurrence of each of the valuesExample: Poor Law Dataset: Number of workhouses per counties
0
5
10
15
20
25
30
35
Num
ber
of W
ork
houses
Essex Norfolk Suffolk Cambs Beds Sussex Kent
Bar Chart of Number of Workhouses in Selected Counties, 1831
Continuous variables• Choose appropriate class intervals and not the frequency of occurrence
for each class• Number of class intervals normally between 5 and 20
Example: Per capita relief payments in the parishes of Kent, 1831
Frequency Distributions (II)
(1) (2) (3) (4) (5)
Class Intervals Frequency Relative Frequency (%) Cumulative Frequency Cumulative Relative
(shillings) (f) (f/n x 100) Frequency (%)
≥ 5 but < 10 1 4.2 1 4.2
≥ 10 but < 15 7 29.2 8 33.4
≥ 15 but < 20 4 16.6 12 50.0
≥ 20 but < 25 6 25.0 18 75.0
≥ 25 but < 30 4 16.6 22 91.6
≥ 30 but < 35 1 4.2 23 95.8
≥ 35 but < 40 1 4.2 24 100.0
24 100.0
Frequency Distributions (III)
0
1
2
3
4
5
6
7
Num
ber
of p
aris
hes
≥ 5 but < 10 ≥ 15 but < 20 ≥ 25 but < 30 ≥ 35 but < 40
Relief payments (shilling)
Frequency of Per Capita Relief Payments in Kent, 1831
Frequency Distributions (IV)
Relative Frequency of Per Capita Relief Payments
0
5
10
15
20
25
30
35
≥ 5 but < 10 ≥ 15 but < 20 ≥ 25 but < 30 ≥ 35 but < 40
Relief payments (shillings)
Per
cent
ages
0
5
10
15
20
25
30
35
Histogram
Frequency curve
Frequency Distributions (V)
0
5
10
15
20
25
Num
ber
of p
aris
hes
1 2 3 4 5 6 7
Relief payments (shilling)
Cumulative Frequency of Relief Payments
Frequency Distributions (VI)
Cumulative Relative Frequency of Per Capita Relief
0
1020
3040
50
6070
8090
100
10 15 20 25 30 35 40
Relief Payments
Per
cent
ages
(C
umul
ativ
e)
Descriptive Statistics
Making frequency tables and plotting them using histograms and frequency curves is very helpful to get a first overview
But it isn’t a precise way to summarize the information of the variables in a dataset. To do this, we normally determine three features of a variable:
1) Which are the most central (i.e. the most common or typical) values?
2) How are the values spread (dispersed) around those central values?
3) What is the shape of the distribution?
Each of these features can be described by one or more simple statistics, and they form the basic elements of descriptive statistics, as the provide a precise and comprehensive summary of the variables in a data set
Measures of Central Tendency
• The arithmetic mean Adding up all the values and dividing this total by the number of
observations
• The median The value that has one-half of the number of observations above and below
it, when the series is set out in an ascending or descending array Uneven number of observations: Position = (number of observations + 1) / 2 Even number of observations: Average of two middle observations
• The mode Value that occurs most frequently
N
y
N
yyyyyyE
N
i iN
1321 )...()(
• Percentiles, deciles, and quartiles Instead of dividing the observations into two equal halves (median), we can
divide them into four equal quarters (quartiles), or into 10 portions (deciles) or 100 portions (percentiles) of equal size
Measures of Central Tendency (II)
• Numeric Example
What is the mean, the median, the mode?
Case 1 2 3 4 5 6 7
xi x1 x2 x3 x4 x5 x6 x7
Values 2 4 5 5 7 9 10
6 5 5
8
x8
30
What is the effect of adding value x8?
9 6 5
What is the absolute and what is the relative frequency of 5?
2 25%
Measures of Dispersion
• Variable x is more densely distributed around the mean m than variable y
x,y
f(x)f(y)
f(x)
f(y)
• Two variables with equal arithmetic mean, but different spread
Measures of Dispersion (II)
2
1
2 )(1
)(
N
i ix XXN
XVar
• The variance The variance is equal to the arithmetic mean of the squared deviations from
the mean
The variance is widely used in statistical work; however, the disadvantage is that it is expressed in square units…
• The standard deviation The standard deviation is the square root of the variance
)var(xx Interpretation: Average or typical deviation of variable x from the
arithmetic mean The standard deviation is the most widely used measure of dispersion;
however, as it is calculated in the same units as the series, these absolute standard deviations are unsuitable for comparisons with series that have different underlying units…
Measures of Dispersion (III)
• The coefficient of variation (CV) The coefficient of variation is a measure of relative rather than absolute
variation It is obtained by dividing the standard deviation by the mean
Interpretation: Average percentage deviation from the mean
• The range This is a very crude measure of dispersion defined as the difference between
the maximum and the minimum value in the series
x
xCV
The Shape of Distributions
• Normal distribution The normal distribution is a symmetrical, smooth, bell-shaped distribution
that is fully described by the arithmetic mean and standard deviation Mode, median and mean are equal Measures of skewness and kurtosis of the normal distribution are equal to 0
and 3 But again: Mean and standard deviation are dependent on units of the series
and thus difficult to compare…
The Shape of Distributions (II)
• Standard normal distribution
Every normal distribution can be transformed into a standard normal distribution using
By definition, the standard normal distribution has now two further basic features that the normal distribution hasn’t:
• mean =0 • standard deviation =1
These properties make the distribution ideal for comparison The standard normal distribution has for this reason a key role in inductive
statistics as it can be used to make inferences on probabilities
y
yYZ
The Shape of Distributions (III)
Mode Mean
x
Median
Freq
uenc
y
• Skewed distributions However, values need not be symmetrically distributed around the central
point, i.e. distributions can be skewed In these cases, Mean and standard deviation are insufficient to describe the
distribution Especially socio-economic data (wages, income, wealth and related
variables) is frequently skewed
This distribution is skewed to the right (positively skewed)
The Shape of Distributions (IV)
• Consequences of skewed distributions
Skewed variables can lead to undesirable effects in regressions• Non-normal distributed residuals (misspecification) • Heteroscedasticity; test statistics and confidence intervals are
biased
(Roughly) normal distributed variables help to avoid these problems. Take a look at the variable
• If the variable is not significantly skewed, continue• If the variable is skewed, transform the variable: “Ladder of
Powers”. For this reason you often find the logarithm of income, the square root of the mortality rate, etc.
The Shape of Distributions (V)
x
• Kurtosis Furthermore, two symmetrically distributed variables with equal mean and
standard deviation can still have a different distribution, i.e. they can have a different kurtosis
Here the variable y has the bigger kurtosis than variable x
x,y
f(x)
f(y)
x
f(y)
f(x)
y
The Shape of Distributions (VI)
• Measures for skewness and kurtosis Measures for skewness and kurtosis tell us therfore more about a
distribution
Skewness and kurtosis of a normal distributed variable are zero and three, respectively
Skewness:• a3 > 0 distribution skewed to the right/ positively skewed• a3 < 0 distribution skewed to the left/ negatively skewed
Kurtosis:• a4 > 3 thinner tails & higher peak than a normal distribution• a4 < 3 thicker tails & lower peak compared to a normal distribution
For a meaningful and comparable measure of a4, the distribution should be symmetrical (hence again the need to have a normal distribution)
3
3
3
)(:
YE
aSkewness4
4
4
)(:
YE
aKurtosis
• Getting started with STATA
• Descriptive statistics
Computer Class:
STATA Basics
• Stata is a statistical package for managing, analysing, and graphing data
• It can be used in two different ways1) As a point-and-click application
→ Easy interface for those new to Stata, and for those who don’t use it very often
→ … for us (at least at the beginning)!2) As a command-driven package
→ Very fast if used to commands→ Good for communicating more complex ideas→ One of the main advantages of Stata over SPSS
• A helpful guide Hamilton, Lawrence C., Statistics with STATA. Constantly updated
versions.
Getting Started Together
Various data formats Data comes in various data formats and extensions, most often in
.xls : Excel .sav : SPSS .dta : STATA .txt : Text files STATA can import all these formats: File/Import/....
1) Download data file Relief dataset from Feinstein & Thomas, get online:
http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=9780521806633&ss=res
Download the Stata file and save it to your folder on the O: drive
2) Open data Open/Relief dataset
3) Open data editor Open Data Editor and try to understand the structure of the dataset What do the rows and columns mean? Change the names of some variables Sort the relief payments in ascending order: what was the minimum paid,
what was the maximum?
Getting Started Together (II)
4) List some variables: Data/Describe data/List data Relief Income
5) Tabulate some variables: Income Relief
6) Frequencies Get an overview of the distribution with a histogram (Graphics/Histogram) The number of bins changes the number of bars (or the number of categories) Which variables look normally distributed, which ones not?
7) Descriptive statistics (Central tendencies & dispersion) Mean, stdv, min, max (Data/Describe data/Summary statistics) Skewness, kurtosis, median, quartiles, percentiles, etc. (Data/Describe
data/Summary statistics/Display additional statistics)
Getting Started Together (III)
8) Export some tables, graphs to Word Right-click and copy; insert in Word
9) If you’re stuck: Help/…
Appendix: STATA Commands
• edit Opens the Data Editor • sort Arranges the observations into ascending order
based on the values of the # variable• tabulate varname Produces one-way tables of frequency counts:
absolute & relative & cumulative frequency.• summarize varname Calculates a variety of summary
statistics (obs, mean, stdv, min, max)• summarize varname, detail Gives more detailed statistics, for
instance kurtosis, skewness, percentiles, etc.• histogram varname, bin(x) Creates a histogram with x categories
Homework
Readings:• Feinstein & Thomas, Ch. 3
Problem Set 1: Do the exercises 1, 2 (Relief dataset) , and 7 (
The Old Poor Law in England) from chapter 2.7 (pp. 66-70) Submit your solutions including graphs and tables in a Word file
by noon on Monday (29 January)