Upload
oswald-little
View
221
Download
4
Embed Size (px)
Citation preview
Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3
Precise & Approximate Relationships Between Variables
Dr Gwilym Pryce
Plan:
1. Introduction 2. Precise Relationships 3. Approximate Relationships 4. Relationships between categorical
variables
A token of transatlantic friendship… the relationship between variables:
1. Introduction to relationships between variables
Often of greatest interest in social science is investigation into relationships between variables:– is social class related to political perspective?– is income related to education?– is work alienation related to job monotony?
We are also interested in the direction of causation, but this is more difficult to prove empirically:– our empirical models are usually structured
assuming a particular theory of causation
Exercise:
Q/ Does the main research question that interests you involve a relationship between variables?
Think about:– what the variables are– the direction of causation– the rationale for this causation– whether it is a precise or approximate
relationship
2. Precise relationships
No random or error component:• Circumference = 3.14 Diameter
– (linear)
• Fahrenheit = 32 + 9/5 Centigrade – (linear)
• F = ma – (non-linear)– where F = force; m = mass; a = acceleration
• e = mc2
– (non-linear)– where e = energy; m = mass; c = speed of light
– linear relationships have straight line graphical representations
– non-linear relationships have curved graphical representations
Precise Linear Relationships
Exercise:– Write a column of integers from 0 to 10 and
call this variable ‘C’– Then construct a new column called ‘F’
where F = 32 + 2C– Then plot F and C on a graph with F on the
vertical axis, and C on the horizontal axis.
C F0 321 342 363 384 405 426 447 468 489 50
10 52
Equation of a straight line:
Traditional to:– call the dependent variable “y”
• I.e. the variable that’s being determined or explained
– call the explanatory variable “x”• I.e. the determinant of y; the factor that explains
the variation in y
y = a + bxwhere:
• a is the vertical intercept» measures how much y would be if x is zero» changes in a simply move the line up or down in
parallel shifts
• b is the slope coefficient» measures how much y increases for every unit
increase in x» the greater the value of b the steeper the slope and
the more sensitive y is to x.
Graphing exact relationships
Axes:– put the dependent variable y on the vertical
axis– put the explanatory variable x on the
horizontal axis Equation is fully summarised with a line
y = ln(x)
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
y
y = exp(x ) = 2.7x
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
-4.9
-4.3
-3.7
-3.1
-2.5
-1.9
-1.3
-0.7
-0.1 0.5
1.1
1.7
2.3
2.9
3.5
4.1
4.7
5.3
5.9
6.5
7.1
7.7
8.3
8.9
9.5
10.1
10.7
11.3
11.9
12.5
13.1
13.7
14.3
14.9
x
y
y = x 2
0
100
200
300
400
500
600
700
800
900
1000
x
y
3. Approximate relationships In social science/epidemiology/history
we don’t tend to get precise relationships– e.g. Relationship between heart disease
and smoking– e.g. Educational achievement and social
class of parents– e.g. Rate of teenage pregnancy and area
deprivation
Modelling approximate relationships: Such relationships can sometimes be
approximated/summarised by a precise relationship plus an error term:– Linear:
• Risk Heart disease = a + b no. cigs + e
• y = a + b x + e
– Multivariate:• y = a + b x + c z + e
– Non-linear:• y = a + b x2 + e
Graphing approximate relationships
The most straight forward way to investigate evidence for relationship is to look at scatter plots:– Again, traditional to:
• put the dependent variable (I.e. the “effect”) on the vertical axis
– or “y axis”
• put the explanatory variable (I.e. the “cause”) on the horizontal axis
– or “x axis”
Scatter plot of IQ and Income:
IQ
1601401201008060
INC
OM
E
40000
30000
20000
10000
We would like to find the line of best fit:
IQ
1601401201008060
INC
OM
E
40000
30000
20000
10000
bxay ˆ
line of slope
intercept
where,
b
ya
IQbaINCOME
Sometimes the relationship appears non-linear:
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
… and so a straight line of best fit is not always very satisfactory:
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
Could try a quadratic line of best fit:
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
… or a cubic line of best fit:(overfitted?)
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
Could try two linear lines:“structural break”
IQ2
3002001000
INC
OM
E
40000
30000
20000
10000
Q/How do we best fit a straight line? A/ Regression analysis
– The most popular algorithm for drawing the line of best fit
– minimises the sum of squared deviations from the line to each observation
– also called ‘Ordinary Least Squares’ (OLS)
n
iii yy
1
2)ˆ(min Where:
yi = observed value of y
= predicted value of yi
= the value on the line of best fit corresponding to xi
iy
Regression estimates of a, b:
This algorithm yields estimates of the slope b and y-intercept a of the straight line– b is usually the parameter of most interest
since it tells us what happens to y if x increases by 1.
But sometimes the line of best fit doesn’t seem to explain the variation in y very well:
Floor Area (sq meters)
3002001000
Pu
rch
ase
Pri
ce
300000
200000
100000
0
Q/ Why do you think this might be?
Is floor area the only factor?What other variables determine purchase price?
Floor Area (sq meters)
3002001000
Pu
rch
ase
Pri
ce300000
200000
100000
0
Omitted explanatory variables:
If the line of best fit doesn’t seem to explain much of the variation in y this might be because there are other factors determining y:
Scatter plot (with floor spikes)
Purchase Price
300
100000
200000
300000
200
Floor Area (sq meters)3.53.0100 2.5
Number of Bathrooms2.01.51.0
Fitting non-linear lines of best fit:
Regression analysis can be used to summarise non-linear relationships, both bi-variate and multivariate:– e.g. y = a + b x2 + cz2
• multivariate and quadratic in x and z
– e.g. y = a + b x + cz2
• multivariate: linear relationship between y and x but quadratic relationship between y and z
3D Surface Plots:Construction, Price & Unemployment
Q
Ut-1
P
020
4060
80
0
5
1015
-500
0
500
020
4060
80
0
5
1015
Construction Equation in a Slump
020
4060
80
0
510
15
0
200
400
600
800
020
4060
80
0
510
15
=> new construction has a linear relationship with Price, but a quatratic relationship with unemployment.
4. Relationships between categorical variables:
The easiest way to represent relationships between categorical variables is to use contingency tables– also called cross-tabulations or cross tabs– also called two way tables
They show the number of observations (or % of observations) in particular categories and naturally lead to a test of independence which has a Chi-square (or “2”) distribution.
Contingency Tables in SPSS:
Most basic cross tab just lists the count in each category:
You can add % in each category by returning to the cross-tabs window, select the cells button, and choose which percentages you want:
first time buyer y=2 n=1 * House County from Postcode Crosstabulation
Count
203 154 357
104 95 199
307 249 556
N
Y
first time buyery=2 n=1
Total
Cumber Durham
House County fromPostcode
Total
If you select all three (row, column and total), you will end up with:
first time buyer y=2 n=1 * House County from Postcode Crosstabulation
203 154 357
56.9% 43.1% 100.0%
66.1% 61.8% 64.2%
36.5% 27.7% 64.2%
104 95 199
52.3% 47.7% 100.0%
33.9% 38.2% 35.8%
18.7% 17.1% 35.8%
307 249 556
55.2% 44.8% 100.0%
100.0% 100.0% 100.0%
55.2% 44.8% 100.0%
Count
% within first timebuyer y=2 n=1
% within HouseCounty from Postcode
% of Total
Count
% within first timebuyer y=2 n=1
% within HouseCounty from Postcode
% of Total
Count
% within first timebuyer y=2 n=1
% within HouseCounty from Postcode
% of Total
N
Y
first time buyery=2 n=1
Total
Cumber Durham
House County fromPostcode
Total