Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce

Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3

Precise & Approximate Relationships Between Variables

Dr Gwilym Pryce

Plan:

1. Introduction 2. Precise Relationships 3. Approximate Relationships 4. Relationships between categorical

variables

A token of transatlantic friendship… the relationship between variables:

1. Introduction to relationships between variables

Often of greatest interest in social science is investigation into relationships between variables:– is social class related to political perspective?– is income related to education?– is work alienation related to job monotony?

We are also interested in the direction of causation, but this is more difficult to prove empirically:– our empirical models are usually structured

assuming a particular theory of causation

Exercise:

Q/ Does the main research question that interests you involve a relationship between variables?

Think about:– what the variables are– the direction of causation– the rationale for this causation– whether it is a precise or approximate

relationship

2. Precise relationships

No random or error component:• Circumference = 3.14 Diameter

– (linear)

• Fahrenheit = 32 + 9/5 Centigrade – (linear)

• F = ma – (non-linear)– where F = force; m = mass; a = acceleration

• e = mc2

– (non-linear)– where e = energy; m = mass; c = speed of light

– linear relationships have straight line graphical representations

– non-linear relationships have curved graphical representations

Precise Linear Relationships

Exercise:– Write a column of integers from 0 to 10 and

call this variable ‘C’– Then construct a new column called ‘F’

where F = 32 + 2C– Then plot F and C on a graph with F on the

vertical axis, and C on the horizontal axis.

C F0 321 342 363 384 405 426 447 468 489 50

10 52

Equation of a straight line:

Traditional to:– call the dependent variable “y”

• I.e. the variable that’s being determined or explained

– call the explanatory variable “x”• I.e. the determinant of y; the factor that explains

the variation in y

y = a + bxwhere:

• a is the vertical intercept» measures how much y would be if x is zero» changes in a simply move the line up or down in

parallel shifts

• b is the slope coefficient» measures how much y increases for every unit

increase in x» the greater the value of b the steeper the slope and

the more sensitive y is to x.

Graphing exact relationships

Axes:– put the dependent variable y on the vertical

axis– put the explanatory variable x on the

horizontal axis Equation is fully summarised with a line

y = ln(x)

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

x

y

y = exp(x ) = 2.7x

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

-4.9

-4.3

-3.7

-3.1

-2.5

-1.9

-1.3

-0.7

-0.1 0.5

1.1

1.7

2.3

2.9

3.5

4.1

4.7

5.3

5.9

6.5

7.1

7.7

8.3

8.9

9.5

10.1

10.7

11.3

11.9

12.5

13.1

13.7

14.3

14.9

x

y

y = x 2

0

100

200

300

400

500

600

700

800

900

1000

x

y

3. Approximate relationships In social science/epidemiology/history

we don’t tend to get precise relationships– e.g. Relationship between heart disease

and smoking– e.g. Educational achievement and social

class of parents– e.g. Rate of teenage pregnancy and area

deprivation

Modelling approximate relationships: Such relationships can sometimes be

approximated/summarised by a precise relationship plus an error term:– Linear:

• Risk Heart disease = a + b no. cigs + e

• y = a + b x + e

– Multivariate:• y = a + b x + c z + e

– Non-linear:• y = a + b x2 + e

Graphing approximate relationships

The most straight forward way to investigate evidence for relationship is to look at scatter plots:– Again, traditional to:

• put the dependent variable (I.e. the “effect”) on the vertical axis

– or “y axis”

• put the explanatory variable (I.e. the “cause”) on the horizontal axis

– or “x axis”

Scatter plot of IQ and Income:

IQ

1601401201008060

INC

OM

E

40000

30000

20000

10000

We would like to find the line of best fit:

IQ

1601401201008060

INC

OM

E

40000

30000

20000

10000

bxay ˆ

line of slope

intercept

where,

b

ya

IQbaINCOME

Sometimes the relationship appears non-linear:

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

… and so a straight line of best fit is not always very satisfactory:

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

Could try a quadratic line of best fit:

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

… or a cubic line of best fit:(overfitted?)

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

Could try two linear lines:“structural break”

IQ2

3002001000

INC

OM

E

40000

30000

20000

10000

Q/How do we best fit a straight line? A/ Regression analysis

– The most popular algorithm for drawing the line of best fit

– minimises the sum of squared deviations from the line to each observation

– also called ‘Ordinary Least Squares’ (OLS)

n

iii yy

1

2)ˆ(min Where:

yi = observed value of y

= predicted value of yi

= the value on the line of best fit corresponding to xi

iy

Regression estimates of a, b:

This algorithm yields estimates of the slope b and y-intercept a of the straight line– b is usually the parameter of most interest

since it tells us what happens to y if x increases by 1.

But sometimes the line of best fit doesn’t seem to explain the variation in y very well:

Floor Area (sq meters)

3002001000

Pu

rch

ase

Pri

ce

300000

200000

100000

0

Q/ Why do you think this might be?

Is floor area the only factor?What other variables determine purchase price?

Floor Area (sq meters)

3002001000

Pu

rch

ase

Pri

ce300000

200000

100000

0

Omitted explanatory variables:

If the line of best fit doesn’t seem to explain much of the variation in y this might be because there are other factors determining y:

Scatter plot (with floor spikes)

Purchase Price

300

100000

200000

300000

200

Floor Area (sq meters)3.53.0100 2.5

Number of Bathrooms2.01.51.0

Fitting non-linear lines of best fit:

Regression analysis can be used to summarise non-linear relationships, both bi-variate and multivariate:– e.g. y = a + b x2 + cz2

• multivariate and quadratic in x and z

– e.g. y = a + b x + cz2

• multivariate: linear relationship between y and x but quadratic relationship between y and z

3D Surface Plots:Construction, Price & Unemployment

Q

Ut-1

P

020

4060

80

0

5

1015

-500

0

500

020

4060

80

0

5

1015

Construction Equation in a Slump

020

4060

80

0

510

15

0

200

400

600

800

020

4060

80

0

510

15

=> new construction has a linear relationship with Price, but a quatratic relationship with unemployment.

4. Relationships between categorical variables:

The easiest way to represent relationships between categorical variables is to use contingency tables– also called cross-tabulations or cross tabs– also called two way tables

They show the number of observations (or % of observations) in particular categories and naturally lead to a test of independence which has a Chi-square (or “2”) distribution.

Contingency Tables in SPSS:

Most basic cross tab just lists the count in each category:

You can add % in each category by returning to the cross-tabs window, select the cells button, and choose which percentages you want:

first time buyer y=2 n=1 * House County from Postcode Crosstabulation

Count

203 154 357

104 95 199

307 249 556

N

Y

first time buyery=2 n=1

Total

Cumber Durham

House County fromPostcode

Total

If you select all three (row, column and total), you will end up with:

first time buyer y=2 n=1 * House County from Postcode Crosstabulation

203 154 357

56.9% 43.1% 100.0%

66.1% 61.8% 64.2%

36.5% 27.7% 64.2%

104 95 199

52.3% 47.7% 100.0%

33.9% 38.2% 35.8%

18.7% 17.1% 35.8%

307 249 556

55.2% 44.8% 100.0%

100.0% 100.0% 100.0%

55.2% 44.8% 100.0%

Count

% within first timebuyer y=2 n=1

% within HouseCounty from Postcode

% of Total

Count



% of Total

Count



% of Total

N

Y

first time buyery=2 n=1

Total

Cumber Durham

House County fromPostcode

Total

Documents

Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 3 Precise & Approximate Relationships Between Variables Dr Gwilym Pryce