188
Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Embed Size (px)

Citation preview

Page 1: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Stats Bootcamp - Day #3Zoltan Fazekaszfazekas.github.io

3 February 2016Zagreb

Page 2: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Outline Day 1

I General considerations: what and how we aim to do?I Data and variablesI Descriptive statistics: measures of central tendency and variabilityI Associations and mean differences

Page 3: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

General considerations

Page 4: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

General considerations (1)I A simplified version of the “usual” stepsI We have an idea > read (somebody else probably had it)I Using our idea & prior work, we refine the research question/ourtheoryI From the theory > testable hypothesesI If classic quantitative approach: collect (or find) dataI Operationalization and measurement - empirical model choice -resultsI Test hypotheses

Page 5: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

General considerations (2)

I Usually: our hypotheses regard relationships at a particular levelI Unit of analysis + relationships between conceptsI Concepts are measured > relationship between variablesI Today: all components of a relationship (between what, whatdirection, what strength)I Examples

Page 6: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Data and variables

Page 7: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Data (1)

I Tabular is goodI Preferred: rows = observations, columns = features, variablesI Coding: labels, values, and missing valuesI No matter what: codebook!

Page 8: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

relig[1:10, 1:5]

## year country ccode attend_religious_services hdi## 1 1981 australia 900 4.220493 0.8704## 2 1981 belgium 211 3.726638 0.8718## 3 1981 denmark 390 2.615059 0.8844## 4 1981 france 220 2.560833 0.8744## 5 1981 germany west 260 3.851341 0.8646## 6 1981 great britain 200 2.922879 0.8620## 7 1981 ireland 205 6.667214 0.8382## 8 1981 italy 325 4.693620 0.8626## 9 1981 japan 740 2.755258 0.8886## 10 1981 netherlands 210 3.755938 0.8878

Page 9: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Data (2)I Strategies

I Inspect & browseI Larger datasets: inspect subsetsI (even) larger datasets and general: tables, summaries, visualization

I Recommended rules:I Use a codebook!I Never overwrite the raw dataI Back-up and pipe

Page 10: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Data: our running exampleI Today and also tomorrow, data from published workI Religious Participation, Social Conservatism, and HumanDevelopment by Gaskins, Golder, and Siegel, Journal of Politics,

2013I Aggregate data:

I Aggregate Religious Participation captures the average level ofreligious attendance in a country, 1-8 (WVS)I HDI is composite index capturing three aspects of humandevelopment: education, health, and standard of living (UNDP)I Whenever we introduce a new variable, will let you know what it is

Page 11: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Variables

I An attribute, condition, or characteristic that changes (varies)for/across different observations (individual, country, etc)I We are interested in variation: no reason to define a variable

human for survey dataI Ideally, we need measurement(s) of each variable for eachobservationI Almost anything can be thought of as a variable. Examples?

Page 12: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Type of values: continuous

I There are infinite number of possible values that fall between twoobserved values - you can divide it into infinite number offractional partsI Time, weight, etc.I Age: you might go into very small measurement levels and alwaysfind something in between two values

Page 13: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Type of values: discreteI Consists of separate, indivisible categories. No values can existbetween two neighboring categories: counts of family members,biological sex, etc.I But some continuous variables are measured as discrete, and weassume continuous natureI Age again: you cannot really find anything between 32 and 33, oryou can ask date of birth and you know the survey date, etcI But how much in detail?I In many cases: 4-5 ordered categories will be treated ascontinuousI We will see issues with this, i.e. non-linearityI Some age categories are different - when you relate it to a cutpoint

Page 14: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Scales: nominal

I Set of categories with different names (don’t get confused if theyare represented by numbers)I Categories simply label observations, no quantitative distinction:Female = 1, Male = 0

Page 15: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Scales: ordinal

I Categories that are organized in an ordered sequence: ranks, orsuch qualifiers as small, medium, largeI Good for seeing whether individuals are different or not (or who isbetter), but not suitable for measuring the magnitude of thedifference

Page 16: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Scales: interval & ratio

I Series of ordered categories, but categories from a series ofintervals that are exactly the same size

I We do most of this, though: the world is not even ordinal, it iscategorical (Hagenaars 2011, presentation)

Page 17: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Descriptive statistics

I We only summarize, organize, understand, and simplify our dataI We do not infer or generalize to populations (from tomorrow)I A key element in research related workI Differences between variable types will start to matter

Page 18: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

MeanI 5 student class with the following grades: 1,1,2,3,2 [1-5]I Class mean grade: 9/5 = 1.8

x = ( n∑i=1

xi )/nI Measure of central tendency, or a reflection of a typical valueI This is the sample mean. If population mean, use µ for the meanand N for population size (number of observations in thepopulation)I Though, sensitive to some values: we will see, to outliersI Is the mean always meaningful?I What is the mean of a dichotomous variable?

Page 19: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

In our datamean(relig$gdp100, na.rm = TRUE)

## [1] 12.85864

mean(relig$hdi) ## diffrence?

## [1] NA

table(relig$communist)

#### 0 1## 178 10

mean(relig$communist) ## interpretation?

## [1] 0.05319149

Page 20: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

In our datamean(relig$country)

## Warning in mean.default(relig$country): argument is not numeric or logical:## returning NA

## [1] NA

mean(relig$ccode) ## ?

## [1] 379.8404

Page 21: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Median

I Different indicator of the central tendency: the value that splitsthe sample in two equal partsI The ordered sample: 1,1,2,2,3 – the reordered grades – this stepshould already tell you a propertyI When even numbers: average the two middle valuesI It is not sensitive to outliersI Range: max −min

Page 22: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

In our datamin(relig$urbanization, na.rm = TRUE)

## [1] 12

max(relig$urbanization, na.rm = TRUE)

## [1] 100

median(relig$urbanization, na.rm = TRUE)

## [1] 68

mean(relig$urbanization, na.rm = TRUE)

## [1] 65.97312

range(relig$urbanization, na.rm = TRUE)

## [1] 12 100

Page 23: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

In our dataI Percentiles: the percentage of the sample/population/numbers thatare below a certain value

quantile(relig$urbanization, 0.9, na.rm = TRUE)

## 90%## 87.5

quantile(relig$urbanization, 0.1, na.rm = TRUE)

## 10%## 42

I What is the median, expressed as percentile?

Page 24: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

In our data

median(relig$urbanization, na.rm = TRUE) == quantile(relig$urbanization, 0.5, na.rm = TRUE)

## 50%## TRUE

Page 25: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

ModeI The value that appears most frequently, especially useful in thediscrete case (but not only)I You might find “modal category”I Can be more values: can be that the probability of observing onevalue is the same for all valuesI Or you could have multiple modesI These three - mean, median, mode - values can be identical undercertain conditionI In practice, there will be differences: these differences will alreadysignal the nature and extent of deviation from a normaldistribution (see later)

Page 26: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

In our datatable(relig$postcommunist)

#### 0 1## 138 50

table(relig$social_regulation)

#### 0 0.66668 0.6668 1.3334 2.0002 2.6668 2.667 3.3334 4.0002## 23 4 17 20 13 13 4 19 7## 4.6668 4.667 5.3334 6 6.0002 6.6668 7.3334 8.0002 8.6668## 7 7 9 7 1 4 3 12 6## 9.3334 10## 4 8

Page 27: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

In our datasummary(relig[, c("gdp1000",

"urbanization","country")])

## gdp1000 urbanization country## Min. : 0.7068 Min. : 12.00 spain : 5## 1st Qu.: 6.6010 1st Qu.: 56.00 argentina : 4## Median :11.2848 Median : 68.00 czech republic: 4## Mean :12.8586 Mean : 65.97 great britain : 4## 3rd Qu.:19.4886 3rd Qu.: 76.00 hungary : 4## Max. :41.3540 Max. :100.00 japan : 4## NA's :35 NA's :2 (Other) :163

str(relig[, c("gdp1000", "hdi","urbanization","country")])

## 'data.frame': 188 obs. of 4 variables:## $ gdp1000 : num 17.5 16.2 17.8 16.3 15.9 ...## $ hdi : num 0.87 0.872 0.884 0.874 0.865 ...## $ urbanization: int 86 96 84 73 73 88 56 67 60 65 ...## $ country : Factor w/ 82 levels "albania","algeria",..: 5 10 20 26 29 30 38 40 41 51 ...

Page 28: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Variation (1)I Measures of variability or spread, not central tendencyI We strive to explain itI Variance:σ2 = n∑

i=1(xi − x )2/(n − 1)

I Standard deviation:σ = √

n∑i=1

(xi − x )2/(n − 1)I (n-1) instead of n (or N): you are losing a degree of freedom byestimating x (we will get back to this)

Page 29: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Variation (2)x x (xi − x )2 value1 1.8 −0.82 0.641 1.8 −0.82 0.642 1.8 0.22 0.043 1.8 1.22 1.442 1.8 0.22 0.04∑ 2.8∑/4 0.7

σ2 0.7σ 0.837

Page 30: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

In our datasd(relig$urbanization, na.rm = TRUE)

## [1] 17.37859

var(relig$urbanization, na.rm = TRUE)

## [1] 302.0155

sqrt(var(relig$urbanization, na.rm = TRUE))

## [1] 17.37859

Page 31: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Quick words about univariate ‘distribution’

Page 32: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Associations

Page 33: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Associations

I We want to analyze associations between attributes of theobservationsI Do we have better quality of democracy in richer countriesI Do more educated people prefer more liberal policiesI These are classic examples when you ask a question about arelationship

Page 34: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Disclaimer

Page 35: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Use of causal language at this stage is not warranted

I Do we have better economic situation in countries with betterquality of democracyI Association is not causation

Page 36: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Steps

I Two attributes/features – variables in your data setI Interest in the relationship between these two1. Is there a relationship between these two2. What is the direction of this association: +/-3. MagnitudeI From univariate to bivariate

Page 37: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Correlation (1)

I We measure and describe the relationship between two variablesI Accordingly, we have X and Y , two observed variablesI Each observation has the scores on these variables

Page 38: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Correlation (2)

X Y0 23 121 910 2-2 3Usually, more observations: better display needed

Page 39: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

In our data

0

10

20

30

40

25 50 75 100Urbanization

GD

P in

1,0

00 U

SD

Page 40: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

In our data

australia

belgium

denmark

francegermany west

great britain

ireland

italyjapan netherlands

spain

canada

hungary

norway

republic of korea

sweden

united states

argentina

iceland

poland

switzerland

austria belgium

canada

chile

china

czech republic

denmark

estonia

finland francegermany

great britain

iceland

india

ireland

italy

japan

mexico

netherlands

nigeria

norway

poland

portugal

republic of korea

russian federation

slovakia

south africa

spain

sweden

turkey

united states

argentina

brazil

czech republic

hungaryslovakia

slovenia

romania

argentina

australia

chinaindia

japan

nigeria

russian federation

slovenia

spain

united states

bangladesh

belarus

chile

croatia

dominican republic

estonia

finland

georgia

latvia

mexico

norway

peru

philippines

republic of korea

republic of moldova

south africa

sweden

switzerland

turkey

ukraine

uruguay

venezuela

armeniaazerbaijan

brazil

bulgariacolombia

germany

lithuania

pakistan

poland

albania

colombia

czech republic

great britain

hungary

macedonia, republic of

new zealand

romania

slovakia argentina

austria belgium

bulgaria

croatia

czech republic

denmark

el salvador

estonia

francegermany

great britain

greece

hungary

icelandireland

italy

latvialithuania

luxembourg

netherlands

poland

portugal

romania

russian federation

slovakia

slovenia

spain

sweden

ukraine

united states

belarus

canada

chile

egypt

finland

iran (islamic republic of)

japan

mexico

nigeria

spain

venezuela

saudi arabia

0

10

20

30

40

25 50 75 100Urbanization

GD

P in

1,0

00 U

SD

Page 41: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

In our data

0

10

20

30

40

25 50 75 100Urbanization

GD

P in

1,0

00 U

SD

Page 42: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Correlation (3)I Correlation is a number that conveys information about threecharacteristics of the relationship between X and Y

1. Direction:I Positive (+): the variables tend to change in the same direction –as X increases, Y tends to increase as well; but also as Xdecreases, Y tends to decrease as wellI Negative (-): the variables tend to go in the opposite direction – as

X increases, Y tends to decrease; but also as X decreases, Ytends to increaseI Reminder: we can change here the order saying that as Yincreases, X tends to . . .

Page 43: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

DirectionsPerfect negative Perfect positive

−2

0

2

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3X

Y

Page 44: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Correlation (4)2. Form: as displayed, a linear relationship, which is the mostcommon use for correlations3. Strength: how much change in Y is associated with a change in

X

I Not all points on the line, usuallyI Perfect correlation means that each change in X is accompaniedby a perfectly predictable change in Y – otherwise, we alwayshave some variation in the increase of decreaseI Perfect positive correlation is 1.00, perfect negative correlations is-1.00 - Range of the correlation – [-1, 1]

Page 45: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Pearson correlation (1)I The degree and direction of a linear relationship through the ratioof covariability between X and Y and their variability takenseparately1. Calculate the sum of products of deviations

SP = ∑ni=1 (Xi − X )(Yi − Y ) Each observation has a

deviation (from the mean), but this is true for bothvariables - Calculate the product of these for eachobservation - Sum

2. Alternative formulaSP = ∑n

i=1 XiYi −∑n

i=1 Xi∑n

i=1 Yi

n

Page 46: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Different variables

I Pearson correlation works fine if you have numerical scoresmeasured on an interval or ratio scaleI Spearman correlation for data on an ordinal scale (rank-ordercorrelation)

Page 47: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Transformation

Observation X Y Xrank YrankA 2 2 1 1B 3 8 2 2C 4 9 3 3D 10 10 4 4

Page 48: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Example

A

B

C

D

A

B

C

D

Ranks Scales

1

2

3

4

2

4

6

8

10

1 2 3 4 2 4 6 8 10X

Y

Page 49: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

UseI Ordinal variables or ranks - apply formulaI No expectation or assumption about the form of relationship:SpearmanI If manually: convert it to ranks and use same formulaI Softwares take care of thisI Other solutions for specific cases: point-biserial (one interval, onedichotomous), polychoric (latent continuous assumed),phi-coefficient (2 dichotomous), etcI Know the structure of your variables (from conceptual andempirical point of view) and apply/look for the most suitable option

Page 50: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Categorical variables

I Chi-square: extremely usefulI It is a non-parametric testI Previously: looked at parameters (mean) and formulatedassumptions about the distributional formI What if these assumptions do not holdI I.e. dichotomous variables in terms of normal distribution

Page 51: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Chi-square test of independence (1)

Party A Party B Party C Party DMen 10 3 15 22Women 90 17 25 18

Page 52: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Chi-square test of independence (2)I Two categorical variablesI Use the χ2 test of independence to evaluate whether there is arelationship between these two variablesI H0: for the general population there is no relationship betweengender and party preference – H0 (is) will be crucialI Or: knowing one’s gender does not improve our potential topredict his/her party choiceI Is the statement “knowing one’s party preference does not improveour potential to predict his/her gender” correct for H0?

Page 53: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Chi-square test of independence (3)

I Observed frequencies: fo is given, we observe itI For fe we start from H0 - this is rather new, but not necessarily . . .I The frequency distribution of party preference has the same shapefor men and for women (same proportion)I For fe start with the totals

Page 54: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Chi-square test of independence (4)

Party A Party B Party C Party D TotalMen 50Women 150Total 100 20 40 40

Page 55: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Chi-square test of independence (5)

I In total: 50% prefer Party A (100/200); 10% party B, 20% Party C,and 20% party DI The proportion of women and men (150 vs 50) is knownI Combine the two sources of information

Page 56: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Chi-square test of independence (6)

Party A Party B Party C Party D TotalMen 25 5 10 10 50Women 75 15 30 30 150Total 100 20 40 40

Page 57: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Chi-square test of independence (6)I Formula:χ2 = ∑ (fo−fe )2

fe

I Losing though 2 pieces of informative values (for both categoricalvariables)I df = (R − 1)(C − 1), where R is the number of rows, and C is thenumber of columnsI Use the χ2 value and the df to evaluate whether (and with whatcertainty) the H0 can be rejected or not

Page 58: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Mean differences

Page 59: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Group means

I Most of these basic techniques or quantities of interest are usedin experimental research a lotI Experiments: to some extent, gold standard – remember,association 6= causation - so what is so special?I Control vs Treatment group: intervention vs no interventionI After carrying it out: do we see differences in the outcome ofinterest?I Compare means (for example), between groups

Page 60: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

T-test (1)I Very brief, as some components need input from tomorrowI Possibilities:

I One-sampleI Two-sample: paired (i.e. within-subjects)/not-paired(i.e. between-subjects)

I Goals/statistic of interest:I Some quantities: M1, M2I H0: µ1 − µ2 = 0

Page 61: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

T-test (2)t.test(relig$urbanization~relig$postcommunist, paired = FALSE)

#### Welch Two Sample t-test#### data: relig$urbanization by relig$postcommunist## t = 3.7441, df = 158.51, p-value = 0.0002529## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:## 3.846861 12.436669## sample estimates:## mean in group 0 mean in group 1## 68.16176 60.02000

I What is the t-statistic?

Page 62: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

T-test (3)

I t = samplestat−hypepoppareststanderr

I t = (M1−M2)−(µ1−µ2)s(M1−M2)

I ‘the standard error can be viewed as a measure of how muchdifference is reasonable to expect between two sample means ifthe null hypothesis is true’

Page 63: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Tomorrow

Page 64: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

What we omitted today & going forward

1. Uncertainty and hypothesis testing2. Regression: bivariate & multivariate (interpretation)3. Conditional hypotheses & different outcomes (Day 5, morning)

Page 65: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Stats Bootcamp - Day #4Zoltan Fazekaszfazekas.github.io

4 February 2016Zagreb

Page 66: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Outline Day 2

I Samples and uncertaintyI Bivariate regressionI Multivariate regression, different types of predictors

Page 67: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Reminder (your friend and enemy)

Page 68: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Difference in notation

I Notation: why µ and not xI µ population mean, whereas x sample mean of xI We rarely, if ever, have populations

Page 69: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

TerminologyPopulation:total set of subjects of interest in a study

I Example: population of Europe, all the countries in the world,eligible voters in Denmark, etcSample:subset of the population on which data is collected

Aim:we want to draw conclusions about the population, while working withsamples

Page 70: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Inference

Inferential statistics:we provide predictions about characteristics of the population, based oninformation in a sample from that population [Agresti 1997]

Page 71: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Example (1)

I Assume that the full class is the populationI Select 3 people and grade their assignmentsI The mean of their grades will be the mean of the full class’ gradesI The standard deviation of their grades will be the standarddeviation of the classI Under which conditions would you like this system

Page 72: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Samples (1)

I Careful selection of samples: who, where, how manyI Many possibilities: simple random sample, systematic randomsample, stratified samples, cluster samplingI These are all probability samplesI Non-probability: volunteer sampling - can work, but seriouslimitationsI For the latter, increasing sample size does not help with bias

Page 73: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Samples (2)

I Other practical difficulties: can you reach everybody you want, dothese people answer, do these people answer to everythingI Easy to extrapolate to a different unit of analysis, however, weusually talk about individuals in this sense

Page 74: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Terminology (2)

Parameter:for populationsStatistic:for samples

Page 75: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

(Very) Basic probability overview (1)I We have outcomes and we describe how many times that givenoutcome appears in a long sequence of repeated observations, ordrawsI Flip coins, draw cards, dice, etcI We do frequentist probability theory here: relative frequency of anoutcomeI Example: coins – on the long run, around half of the times youwill observe heads -> P(outcome = heads) = 0.5, for a single flipof the coin

Page 76: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

(Very) Basic probability overview (2)

Probability distribution:a probability distribution is the list of all possible outcomes with theirassociated probabilityContinuous Y:if you have a truly continuous variable, think in terms of intervals

Page 77: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Example (2)I From Agresti 1997 (87)

y P(y)0 0.911 0.062 0.023 0.01I Y = number of people you’ve known that were victims of homicidewithin the past year (GSS)

Page 78: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Two simple rules

Probabilities are between 0 and 10 ≤ P (y ) ≤ 1

The sum of probabilities of all sample points in a sample is equalto 1 ∑ally P (y ) = 1

Page 79: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Expected value:

Mean of a probability distribution:the expected value of Y , denoted by E (Y )µ = ∑

yP (y )From the example:

0 ∗ 0.91 + 1 ∗ 0.06 + 2 ∗ 0.02 + 3 ∗ 0.01 = 0.13

Page 80: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Gauss

I You can think of probabilities in terms of percentages/proportionsI Preferably, you can think of percentages/proportions in terms ofprobabilities

Page 81: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Example

I From Gravetter (141)I We assume a normal distribution of heights, with µ = 172.72 andσ = 15.24, all in cm

I What is the probability of randomly selecting an individual who istaller than 2 meters and 3.2 centimetersI Note: transformations from inches make problems easier tounderstand, but with messier numbers

Page 82: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Z-scores in detailZ-score:

z = xi−µσExpress values in terms of standard deviation

From the example:z = 203.2−172.72

15.24 = 2.00

I The observation in question is at 2 standard deviations from themeanI The sign gives us where: above the mean

Page 83: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Heights (1)

I We subtract everything that is below the 2σ mark: 100% - 2 x34.1% - 2 x 13.6% - 2.1% - 0.4%/2 ≈ 2.28%

Page 84: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Heights (2)

p(X > 203.2) = p(z > 2.00) = 2.28% = 0.0228

I While labor intensive, didactic benefitI For these types of questions you can use the unit normal tablewhere you find various proportions or probabilities for z-scores

Page 85: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

The convention“Crucial” conventional value:

I z = ±1.96, two-tailed: 0.05I This means that 95% of the observations are between µ − 1.96σand µ + 1.96σ

Reminder:I The normal distribution is symmetricalI These values are applicable only for (or assuming) a normaldistribution

Page 86: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Sample vs. populationI We had population distribution that is usually unobserved and hasunknown parameters: µ, σI We had sample distribution that is observed with statistics: Y , sI To evaluate our sample statistics with regards of the populationparameters, we introduce the sampling distribution

Sampling distribution:Is a probability distribution that determines the probabilities of thepossible values of a sample statistic

Page 87: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Sampling distribution (1)I You want to know what would happen in a referendum about anew immigration law: you can be either in favor or against (forsimplification)I Take a representative sample of Croatian people eligible to vote inthat referendum (1000-1200 people)I Your survey results suggest that 58% of your sample are in favor ofthe new law and 42% of your sample are against itI In terms of a dichotomous variable names pro_law, its mean (Y )is 0.58I Other polling organizations report 50%, 62%, 55% and so onfavorability ratesI What to do and which is the “correct” (true) number

Page 88: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Sampling distribution (2)

I Smaller scale example: 4 people with possible outcomes: 1 if pro,0 if against the new lawI Assume (you “know”) the population is split, so half of it favors thelaw, half of it is against it

Page 89: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Sampling distribution (3)I All possible combination of outcomes, n = 4

(1,1,1,1) (1,1,1,0) (1,1,0,1) (1,0,1,1)(0,1,1,1) (1,1,0,0) (1,0,1,0) (1,0,0,1)(0,1,1,0) (0,1,0,1) (0,0,1,1) (1,0,0,0)(0,1,0,0) (0,0,1,0) (0,0,0,1) (0,0,0,0)I When there are 4 people the result for pro can be: 0%, 25%, 50%,75%, 100%

Page 90: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Sampling distribition (4)

(1,1,1,1) (1,1,1,0) (1,1,0,1) (1,0,1,1)(0,1,1,1) (1,1,0,0) (1,0,1,0) (1,0,0,1)(0,1,1,0) (0,1,0,1) (0,0,1,1) (1,0,0,0)(0,1,0,0) (0,0,1,0) (0,0,0,1) (0,0,0,0)I 1/16 probability for full support: 0.0625

Page 91: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Sampling distribution (5)

(1,1,1,1) (1,1,1,0) (1,1,0,1) (1,0,1,1)(0,1,1,1) (1,1,0,0) (1,0,1,0) (1,0,0,1)(0,1,1,0) (0,1,0,1) (0,0,1,1) (1,0,0,0)(0,1,0,0) (0,0,1,0) (0,0,0,1) (0,0,0,0)

I 1/16 probability for no support: 0.0625

Page 92: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Sampling distribition (6)

(1,1,1,1) (1,1,1,0) (1,1,0,1) (1,0,1,1)(0,1,1,1) (1,1,0,0) (1,0,1,0) (1,0,0,1)(0,1,1,0) (0,1,0,1) (0,0,1,1) (1,0,0,0)(0,1,0,0) (0,0,1,0) (0,0,0,1) (0,0,0,0)

I 4/16 probability for 25% support: 0.25

Page 93: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Sampling distribution (7)

(1,1,1,1) (1,1,1,0) (1,1,0,1) (1,0,1,1)(0,1,1,1) (1,1,0,0) (1,0,1,0) (1,0,0,1)(0,1,1,0) (0,1,0,1) (0,0,1,1) (1,0,0,0)(0,1,0,0) (0,0,1,0) (0,0,0,1) (0,0,0,0)

I 4/16 probability for 75% support: 0.25

Page 94: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Sampling distribution (8)

(1,1,1,1) (1,1,1,0) (1,1,0,1) (1,0,1,1)(0,1,1,1) (1,1,0,0) (1,0,1,0) (1,0,0,1)(0,1,1,0) (0,1,0,1) (0,0,1,1) (1,0,0,0)(0,1,0,0) (0,0,1,0) (0,0,0,1) (0,0,0,0)

I 6/16 probability for 50% support: 0.375

Page 95: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Sampling distribution (9)Sample (support) proportion Probability0.00 0.06250.25 0.25000.50 0.37500.75 0.25001.00 0.0625

I Sampling distribution of sample proportion, n = 4

Page 96: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Sampling distribution (10)

I In our example, we technically looked at the mean (Y )I With a very small number of repeated samples (with fewobservations) we saw quite some variation in what mean wewould getI The key mental leap here is to think of samples as you previouslythought of observations

Page 97: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Sampling distribution (11)

I For each sample (we have observed values) we calculate thestatistic of interest, in this case the sample meanI We will have as many values of that mean (Yi ) as many sampleswe tookI They will gravitate around the true population mean (µ),sometimes over it, sometimes above it, but: as anything that hasmultiple values, it has a distribution also

Page 98: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Sampling distribution (12)

I We went this far to get some sort of a precision measure aboutour statistic of interest, YI A precision we might be doubting our results because we areusing a sample, not the population: the quality (not necessarilythe content) of our inferences depends on this

Page 99: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Standard error (1)I We define the standard error, a measure of spread of the samplingdistribution (variability)

First, standard deviation of the sampling distribution:SDY = σ√

nwhere σ is the standard deviation of the

population and n is the sample size

Standard error:SEY = s√

nwhere s is the standard deviation of the sample

and n is the sample size

Page 100: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Standard error (2)

I Precision increases with sample size = standard error decreasesas n increasesI Larger sample sizes push you closer (on average) to the true mean(µ)I Important: under the assumption of probability samplesI Sampling error

Page 101: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Central Limit Theorem

I For random sampling, as the sample size n grows, the samplingdistribution of Y approaches a normal distributionI Does not matter that a discrete variable produced that meanI What is the suitable n depends on the population distributionI If the sampling distribution of Y follows a normal distribution withµ and σ : Y falls within ± 2σ√

nwith a probability of around 0.95

Page 102: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Possible RQ (1)

I Interested to see whether wealth is systematically related to moreor less religious attendance (or whether . . . )I Possible theory

I ?I Implications ?

I In terms of dependent and independent variablesI Preferably: response and explanatory variables (or predictor)

Page 103: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Possible RQ (2)

Caveat:Theoretically, the link might be causal - statistically, these data andmethods are not suitable to say so

Page 104: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Operationalization and data

I Response variable: aggregate religious attendance (1 to 8)I Explanatory variable will be the GDP/capita expressed in 1,000USD - but see paper for alternativesI Discuss case selection, sample, population: 188 country x yearcombinations

Page 105: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

In our datanrow(relig)

## [1] 188

summary(relig$attend_religious_services)

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's## 1.131 3.221 4.194 4.272 5.079 7.398 8

summary(relig$gdp1000)

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's## 0.7068 6.6010 11.2800 12.8600 19.4900 41.3500 35

Page 106: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

In our data

2

4

6

0 10 20 30 40GDP (1,000 USD)

Att

end

relig

ious

ser

vice

s

Page 107: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Linear regression (1)Linear regression:Analysis of relationship between two (or more) variables assuming thatone is the function of the other(s) and this is a linear function. It isdepicted by a line, and this line should be the best fitting lineFunction notation (1):

Y = a + bX + eNote: we do not discuss yet the question of samples and uncertaintyFunction notation (2):Straight forward, basic math

Page 108: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Linear regression (2)Y = a + bX + e

Elements:I Y – response or dependent variable (left)I X – explanatory or independent variable (right)I a – constant (or intercept)I b – slopeI e – relationships in social sciences are never exactI Both a and b are coefficients

Page 109: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Linear regression (3)

I Why call a constant or intercept

I It is unrelated to changes in XI Reflects what is the value of Y when X is 0I Substitute in values of X and get the values of Y using a and bI We do not know a or b, we only have the data

Page 110: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Linear regression (3)

I Why call a constant or interceptI It is unrelated to changes in XI Reflects what is the value of Y when X is 0I Substitute in values of X and get the values of Y using a and bI We do not know a or b, we only have the data

Page 111: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Linear regression (4)

Given Y , X , and Y = a + bX + e :I Find the values of a and b in a manner that the regression linegiven by these will be the best fitting lineI Best fit in what terms, as many straight lines between Y and Xcan be plotted

Page 112: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Linear regression (5)

2

4

6

0 10 20 30 40GDP (1,000 USD)

Att

end

relig

ious

ser

vice

s

-Multiple lines described by different values of a andb

Page 113: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Choice: Least squares principle

I Minimize somethingI Goal: accurate predictions of Y , given XI Not all the observations fall on any of the linesI Inspect prediction errorI Y = a + bX , where Y is the predicted valueI Given the function described by a and b, for each observation inour data, Y can be calculated

Page 114: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

a = 2 and b = 0.5

2

4

6

0 10 20 30 40GDP (1,000 USD)

Att

end

relig

ious

ser

vice

s

Page 115: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Results (1)

head(pred_vars)

## gdp1000 attend_religious_services predicted error## 1 17.49698 4.220493 10.748490 6.527997## 2 16.17930 3.726638 10.089650 6.363012## 3 17.76259 2.615059 10.881295 8.266236## 4 16.29161 2.560833 10.145805 7.584972## 5 15.92710 3.851341 9.963550 6.112209## 6 14.04749 2.922879 9.023745 6.100866

Page 116: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Results (2)

I Rather badI What indications

Check whether intuitive:Those country x years that have the same gdp will have the samepredicted values for religious attendance (Y )

Page 117: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Minimization (1)

I Intuitively: minimize the prediction errorI It would be minimizing the sum of all Y − YI Why is that a bad solution

Page 118: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

a = 4.5 and b = −0.1

2

4

6

0 10 20 30 40GDP (1,000 USD)

Att

end

relig

ious

ser

vice

s

I Better, but some Y − Y are positive, some negative

Page 119: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Results (4)

head(pred_vars_2)

## gdp1000 attend_religious_services predicted error## 1 17.49698 4.220493 2.750302 -1.4701908## 2 16.17930 3.726638 2.882070 -0.8445676## 3 17.76259 2.615059 2.723741 0.1086819## 4 16.29161 2.560833 2.870839 0.3100058## 5 15.92710 3.851341 2.907290 -0.9440510## 6 14.04749 2.922879 3.095251 0.1723718

Page 120: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Minimization (2)

I When the prediction error is summed up, positive and negativeerrors cancel outI Instead of the prediction error (∑ (Y − Y )), the OLS solutionminimizes: ∑ (Y − Y )2I The sum of squared errorsI There is a simple numerical solution to find a and b thatminimizes ∑ (Y − Y )2

Page 121: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

TerminologyIntercept:

I a – the value Y takes when X is zeroI No broke countries in the sampleI Interpreting the intercept: it is crucial to have an X (or more) thatis meaningful when it takes the value 0I Solution: mean center our variable, Xnew = X − X

Slope:

I b – should be interpreted as the average change in Y associatedwith one unit change in XI For the example: how much change is associated in religiousattendance (on its unit of coding) with 1000 USD higher GDP

Page 122: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Results (5)

2

4

6

0 10 20 30 40GDP (1,000 USD)

Att

end

relig

ious

ser

vice

s

Page 123: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

a = 4.89 and b = −0.058

reg_gdp <- lm(attend_religious_services ~ gdp1000,data = relig)

summary(reg_gdp)

#### Call:## lm(formula = attend_religious_services ~ gdp1000, data = relig)#### Residuals:## Min 1Q Median 3Q Max## -3.6558 -0.8633 -0.0666 0.7049 2.6258#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 4.89080 0.18022 27.139 < 2e-16 ***## gdp1000 -0.05831 0.01188 -4.907 2.46e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 1.118 on 145 degrees of freedom## (41 observations deleted due to missingness)## Multiple R-squared: 0.1424, Adjusted R-squared: 0.1365## F-statistic: 24.07 on 1 and 145 DF, p-value: 2.461e-06

I At which value of X should the 4.89 value for religious attendance appear?I Is it meaningful?

Page 124: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Recodingmean(relig$gdp1000, na.rm = TRUE)

## [1] 12.85864

reg_gdp <- lm(attend_religious_services ~ I(gdp1000 - 12.85864),data = relig)

summary(reg_gdp)

#### Call:## lm(formula = attend_religious_services ~ I(gdp1000 - 12.85864),## data = relig)#### Residuals:## Min 1Q Median 3Q Max## -3.6558 -0.8633 -0.0666 0.7049 2.6258#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 4.14098 0.09225 44.890 < 2e-16 ***## I(gdp1000 - 12.85864) -0.05831 0.01188 -4.907 2.46e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 1.118 on 145 degrees of freedom## (41 observations deleted due to missingness)## Multiple R-squared: 0.1424, Adjusted R-squared: 0.1365## F-statistic: 24.07 on 1 and 145 DF, p-value: 2.461e-06

Page 125: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Results (7)

2.5

3.0

3.5

4.0

4.5

−10 0 10 20 30GDP (1,000 USD)

Att

end

relig

ious

ser

vice

s (p

redi

cted

)

-Are there 188 points? Are pointsoverlaping?

Page 126: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Results (8)

australia

belgium

denmarkfrance

germany west

great britain

ireland

italy

japan

netherlands

spaincanada

hungary

norway

republic of korea

sweden

united states

argentina

iceland

switzerlandaustria

belgium

canada

chile

china

czech republic

denmark

finland

france

germany

great britainiceland

india

ireland

italy

japan

mexico

netherlands

nigeria

norway

poland

portugal

republic of korea

russian federation

slovakiaspain

sweden

turkey

united states

argentina

brazil

czech republic

hungary

slovakiasloveniaromania

taiwan province of china

argentinaaustralia

india

japan

nigeria

russian federation

sloveniaspain

united states

bangladesh

belarus

chilecroatia

dominican republic

estonia

finlandgeorgia

latvia

mexico

norway

peru

philippines

republic of korea republic of moldova

south africa

sweden

switzerlandturkey

ukraine

uruguay

venezuela

armeniaazerbaijan

brazil

bulgaria

colombia

germany

lithuania

poland

albania

colombia

czech republic hungary

macedonia, republic of

new zealand

romania

slovakia

argentina

austria

belgium

bulgaria

croatia

czech republic

denmark

el salvador

estoniafrance

germany

great britain

greece

hungary

iceland

ireland

italy

latvia

lithuania

luxembourg

netherlands

poland

portugalromania

russian federation

slovakia

sloveniaspain

sweden ukraine

united states

belarus

canada

chile

egypt

finland

iran (islamic republic of)

japan

mexico

nigeria

spain venezuela

saudi arabia

−2

0

2

2.5 3.0 3.5 4.0 4.5Fitted/predicted

Res

idua

ls

-Useful: what does itshow?

Page 127: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

General thoughtsPrediction:

I For any value of X you can get Y , because you know a and bI Caution with the intercept (in case of recoding)

Linear relationship:

I It does not matter whether the change is from 10 to 11 thousandUSD or from 40 to 41I Assumption: the effect of GDP is linear

Task:

I Interpret resultsI Best guess for religious attendance without the availability of a

covariate

Page 128: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

How good (1)

I Previous example: a statistical model focused on the relationshipbetween two variablesI A statistical model is always an abstractionI Unknown whether this is the true model (never is)I Benchmark or assess how well this model workedI Many model fit statistics, or goodness-of-fit measuresI These are associated with the statistical model itself

Page 129: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

How good (2)

I Customary to report R2, or the coefficient of determinationI Without much detail now:

I The R2 is not a goodness-of-fit measureI Trying to obsessively increase the R2 is not a goal and should notbe a practice

I Read: Achen (1982) and King (1986)

Page 130: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

R2 (1)

I R2: a measure of explained varianceI How much variation in Y is explained by X .I More precisely: how much variation in Y is explained by thevariation in XI Total variation in Y : ∑n

i=1 (Yi − Y )2I Explained variation in Y : ∑n

i=1 (Yi − Y )2

Page 131: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

R2 (2)

I R2: ratio between the explained variation and the total variation:∑n

i=1 (Yi−Y )2∑ni=1 (Yi−Y )2

I Range: [0,1]

Page 132: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Uncertainty and inferences (1)

I Standard error and the hypothesis test: grounded in someassumptionsI Samples, not populationI Different sample: (how) different relationshipI Probabilistic judgment on how likely it is – given sample size andassumptions of the linear regression – that a particularrelationship appears in other samples

Page 133: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Uncertainty and inferences (2)I Rewrite equation to reflect focus on parameters: Y = α + βX + ε(alternatively, α can be β0)I Usually non-directional null hypothesis (H0), H0 : β = 0I Alternative hypothesis, coefficient is different from 0: H1 : β 6= 0I H0: in the population wealth is unrelated to religious participationI Take many samples and run the same model -> many values for

a, bI Some of these would be close to the true value of the population

parameter, some of them would be way offI How is our point estimate doing?

Page 134: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

6 Samples

2

4

6

0 10 20 30 40GDP (1,000 USD)

Att

end

relig

ious

ser

vice

s

-Note: differentsamples

Page 135: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Uncertainty and inferences (3)Standard error of the coefficient:

I The measure of the amount of variability that would be presentamong different b’s estimated from samples drawn from the samepopulationI Sensitivity of the estimate to changes in the sampleI Needed: precision, or controlled uncertainty, or accepted

uncertaintyI Needed: threshold, given the sample size and the standard error

Page 136: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Uncertainty and inferences (4)I One sample at handI Example: is -0.05831 (b) among those b’s that are somewhereclose to βI Needed: a probabilistic evaluation of thisI Use the t-distribution: a probability distribution derived for the(b−β)

SE proportionI If this ratio is above some test statistic, the null-hypothesis can bedismissedI Make probabilistic statements about a difference between the trueparameter and the estimate, knowing the variation of the estimateI Null-hypothesis is that β = 0 - what we actually check iswhether b

SE is bigger than a given threshold

Page 137: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Uncertainty and inferences (5)I For larger sample sizes the t-distribution looks very much like thez-distributionI Depending on how much certainty is desired we want: choosethresholdI It is a function of how much information is available and how manyunknowns, degrees of freedomI Example: 147 observations and 2 coefficients were estimated, so

df = 147− 2 = 145I In any stats book: table for the t-distributionI Look up what is the critical value for your chosen level of(un)certainty, given the degrees of freedomI Example: the 95% (two-tailed) value would is 2.06.I If the ratio of the estimate and its standard error is above thisvalue -> 95% of the samples the β will be different from 0

Page 138: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Uncertainty and inferences (6)I Example: b = -0.05831 and the standard error was 0.01188I t-ratio calculated as b

SE is -4.907I “Larger than the threshold of 2.06” - why?I Generally: for samples of around 100+, use a 1.96 thresholdI A good rule of thumb: if the ratio between the estimate and thestandard error is larger than 2I Worth: test your estimate against against a particular value, notonly 0 (depending on goal)

Standard error:

SE = √1

n−2 (Yi−Yi )2√(Xi−Xi )2

Page 139: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Confidence intervals (intro)I Example: b = -0.05831, point estimateI Knowing the point estimate and the SE: calculate confidenceintervals around the point estimateI Interval, 2 values define it: lower and upper boundI For each coefficient: calculate b ± 1.96SEI 1.96: 95% level value of the t-statistic (and z-score) for a decentsample sizeI Lower bound: -0.05831 - 1.96*0.01188 = -0.0818I Upper bound: -0.05831 + 1.96*0.01188 = -0.0348I Interpretation: if 0 is not within the interval defined by the lowerand upper bounds, the coefficient is significantly different from 0(95% level)

Page 140: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

In papers (1)

2

4

6

0 10 20 30 40GDP (1,000 USD)

Att

end

relig

ious

ser

vice

s

Page 141: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

In papers (2)Model 1Intercept 4.141***(0.092)GDP -0.054***(1000 USD, Mean centered) (0.012)

N 147adj. R2 0.14

Page 142: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Usually omitted

2

4

6

0.0 2.5 5.0 7.5Government regulation

Att

end

relig

ious

ser

vice

s

Page 143: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Multiple predictorsI Multivariate regressionI There is nothing special about this situationI The formulation: everything else held constant or controlling forI Y = α + β1X1 + β2X2 + εI X1 = GDP mean centeredI X2 = urbanization, mean centeredI For some reason, we think that urbanization level should have anegative effect on religious attendanceI But we also need to control for wealth

Page 144: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Model results

relig$gdp1000_cent <- relig$gdp1000 - mean(relig$gdp1000,na.rm = TRUE)

relig$urban_cent <- relig$urban - mean(relig$urbanization,na.rm = TRUE)

Page 145: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Model results

reg_gdp <- lm(attend_religious_services ~ gdp1000_cent +urban_cent,data = relig)

Page 146: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Model resultssummary(reg_gdp)

#### Call:## lm(formula = attend_religious_services ~ gdp1000_cent + urban_cent,## data = relig)#### Residuals:## Min 1Q Median 3Q Max## -4.5475 -0.7775 -0.1252 0.6942 2.2510#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 4.224652 0.089975 46.953 < 2e-16 ***## gdp1000_cent -0.030991 0.013150 -2.357 0.0198 *## urban_cent -0.028494 0.007051 -4.041 8.65e-05 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 1.063 on 143 degrees of freedom## (42 observations deleted due to missingness)## Multiple R-squared: 0.2306, Adjusted R-squared: 0.2198## F-statistic: 21.43 on 2 and 143 DF, p-value: 7.267e-09

Page 147: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

In papers (3)Model 1Intercept 4.225***(0.090)GDP -0.031*(Mean centered) (0.013)Urbanization -0.028***(Mean centered) (0.007)

N 146R2 0.23adj. R2 0.22Standard errors in parentheses

Note: *** p < 0.001; ** p < 0.01; * p < 0.05

Page 148: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Prediction (1)

pred_data <- data.frame(gdp1000_cent = seq(min(relig$gdp1000_cent,

na.rm = TRUE),max(relig$gdp1000_cent,

na.rm = TRUE),1),

urban_cent = 0)predictions <- predict(reg_gdp, newdata = pred_data, se.fit = TRUE)

Page 149: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Prediction (2)pred_data$predicted <- predictions$fitpred_data$predicted_se <- predictions$se.fithead(pred_data)

## gdp1000_cent urban_cent predicted predicted_se## 1 -12.151795 0 4.601254 0.1762373## 2 -11.151795 0 4.570263 0.1650496## 3 -10.151795 0 4.539271 0.1541720## 4 -9.151795 0 4.508280 0.1436750## 5 -8.151795 0 4.477288 0.1336482## 6 -7.151795 0 4.446297 0.1242057

Page 150: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Prediction (3)

2.5

3.0

3.5

4.0

4.5

5.0

−10 0 10 20GDP/capita (1000 USD, mean centered)

Pred

icte

d at

tend

ance

of r

elig

. act

iviti

es

Page 151: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

R2

I When you have multiple variables, it is good to look at theadjusted R2, because that gives some penalty for the number ofparameters estimatedI Otherwise, as you add variables to a model, the simple R2 istheoretically bound to increase: it might lead to erroneousconclusionsI F-testI Customary: p < 0.05 is what people are looking for, because thatis the two-tailed 95% “significance” or “certainty”I Of what?

Page 152: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

ReportingI One coefficient is statistically significant or significantly differentfrom 0I A . . . change in [variable] is associated with . . . in [dependentvariable]I Present meaningful and interesting comparisonsI In table, report: DV, variable name, coefficient, standard errors,and put stars if you want, mentioning the associated p-value, orconfidence intervalsI Better: report results in figures

Important:Statistical significance does not mean substantive significance

Page 153: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Categorical predictors (1)I Assume: especially interested in differences betweenpost-communist countries and other countriesI Make sure that the actual coding involves 0 and 1, because it helps

table(relig$postcommunist)

#### 0 1## 138 50

Page 154: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Categorical predictors (2)reg_gdp <- lm(attend_religious_services ~ postcommunist,

data = relig)summary(reg_gdp)

#### Call:## lm(formula = attend_religious_services ~ postcommunist, data = relig)#### Residuals:## Min 1Q Median 3Q Max## -3.2589 -1.0387 0.0118 0.8151 3.0083#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 4.3897 0.1120 39.188 <2e-16 ***## postcommunist -0.4318 0.2147 -2.011 0.0458 *## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 1.282 on 178 degrees of freedom## (8 observations deleted due to missingness)## Multiple R-squared: 0.02222, Adjusted R-squared: 0.01672## F-statistic: 4.045 on 1 and 178 DF, p-value: 0.04582

Page 155: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Categorical predictors (3)Model 1Intercept 4.39***(0.112)Post-communist country -0.432*(0.215)

N 180adj. R2 0.017

Page 156: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Categorical predictors (4)

I Our results again: 4.39 (± 1.96 x 0.112); -0.432 (± 1.96 x 0.215)I Predictor takes the value 1 for post-communist countriesI Model predicts that non post-communist countries (0) should have,on average, 0.432 higher average religious participation (rememberthe scale, 1-8)I The reported SE and then the significance test indicates that wecan reject the null-hypothesis for the intercept and slope

Page 157: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Categorical predictors (5)

I Thus, the coefficient for a dichotomous variables simply shows youthe average difference in Y for the two-categoriesI Extending the interpretation issues . . . - including anotherdichotomous predictor

Page 158: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Categorical predictors (6)

table(relig$postcommunist, relig$west_europe)

#### 0 1## 0 83 55## 1 50 0

Page 159: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Categorical predictors (6)reg_gdp <- lm(attend_religious_services ~ postcommunist + west_europe,

data = relig)summary(reg_gdp)

#### Call:## lm(formula = attend_religious_services ~ postcommunist + west_europe,## data = relig)#### Residuals:## Min 1Q Median 3Q Max## -3.6448 -0.7594 -0.0772 0.7175 3.2956#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 4.7756 0.1394 34.263 < 2e-16 ***## postcommunist -0.8177 0.2235 -3.658 0.000335 ***## west_europe -0.9361 0.2171 -4.312 2.68e-05 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 1.223 on 177 degrees of freedom## (8 observations deleted due to missingness)## Multiple R-squared: 0.1152, Adjusted R-squared: 0.1052## F-statistic: 11.52 on 2 and 177 DF, p-value: 1.981e-05

Page 160: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Categorical predictors (7)Model 1Intercept 4.78***(0.14)Post-communist country -0.82***(0.22)West European country -0.94***(0.22)

N 180adj. R2 0.11

Page 161: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Categorical predictors: but. . . (8)

table(relig$postcommunist, relig$west_europe)

#### 0 1## 0 83 55## 1 50 0

Page 162: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Categorical predictors: multiple categories, 1 variable (9)I If k is the number of categories than k − 1 dichotomous variablesshould be included in the modelI Baseline or a reference group: a meaningful 0 for variables inorder for the intercept to be meaningful:I What happens if all three are included?I Substantive interpretation example . . . : create a three-categoryvariable from GDP100, with 30% and less coded as 0, 30% to 60%as 1, and 61% and up to be coded as 2I Include this as a predictor of average religious attendance andinterpret outputI Next: create three dichotomous variables - below.30, from30.60,above.60, variables taking the value 1 if condition is true, 0otherwise - include these as predictors of average religiousparticipation - what do you observe?

Page 163: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Tomorrow

Page 164: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

What we omitted today & going forward

1. Regression assumptions (OLS)2. Conditional hypotheses

Page 165: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Stats Bootcamp - Day #5Zoltan Fazekaszfazekas.github.io

5 February 2016Zagreb

Page 166: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Before we start

Page 167: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Outliers

I Univariate ‘unusual observations’ are not necessarily regressionI ‘Conditionally’ (based on the values of X) is an unusual observation

I Leverage: far from X , i.e. unusual value on predictorI Influence: has leverage, but it is also unusual on Y

I Step 0: always check codebook and coding (the classic case of -99age/income) All checks are useful, but we usually want to checkfor influence

Page 168: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

TestsI Too many to discuss, but we will highlight one, which is bothintuitive and usefulCook’s Distance (Cook’s D):I A distance metric for each observation in the data, given by thedifference between b estimated on full data and b−i , which isestimated without the inclusion of the i data point (more precisely,and F-test for the b = b−i hypothesis)I Rule of thumb - many: over 1 is a problem, over 4

n is a problem,over 4n−k−1 is a problem . . .

Page 169: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Assumptions behind OLS

I We checked already one: if there are influential observations thatmove your line, there might be issuesI But there are more . . .I Biased estimatesI Biased hypothesis tests

Page 170: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Essential

ValidityI Test: no formal tests - measurement quality, sample qualityFunctional form: linear (and additivity)I Test: graphical (bivariate, scatter plots) + graphical (multivariate,errors vs. each predictor), local regression

Page 171: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Still importantHomoskedasticity: homogeneity (constant variance) of errors

I Test: graphical (errors vs. predicted values) + Breusch-Pagan test(H0 = homogeneity of errors)Independence of errors: no clustering, for exampleI Test: graphical (errors vs. potential grouping variable) - iftime-series data, Durbin-Watson testMulticollinearity: correlation of predictorsI Test: vif (variance inflation factor, 10 or above)

Page 172: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

A word on ‘normality’

I Normal distribution: of errors (though people look at the outcomevariable)I Upon violation: estimates are not biased, significance tests will be(p-values)I Test: Shapiro-Wilk W test for normality (H0 = errors are normallydistributed) or graphical (plot against normal)

Page 173: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Conditional hypotheses: through an exampleI Model the relationship between religiosity (0 = not at all, 10 =very religious) and ideological self-placement on the left-rightscale (0 = left, 5 = middle of the road, 10 = right)I Start with bivariate, move to a model with additional controlsI Conditional hypotheses: model the heterogeneity of therelationship (if there is one)I Data used: 2009 European Election Study (data from Denmark,and one last model using data from Poland)I If you follow politics or you are a political scientist, what is thestarting intuition?

Page 174: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Bivariate correlation

I Scatter plot - problematic: while continuous variables, 11categories for eachI Recode the predictor and get a better understanding of the data(visually) Note: unless explicitly mentioned, results based onDanish data

Page 175: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Bivariate correlationSpearman's r:

0.16 (original)

3

4

5

6

Not religious Religious Somewhat religious

Ave

rage

ideo

logy

(LR

, with

95%

) C

I

Page 176: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

ModelingI Multivariate regression estimated via OLSI Additional variables: gender (1 = female), education (3 categories,0 = not finished high school, 2 = completed highest education),age (mean centered in all models)

yi = β0 + β1xi1 + β2xi2 + β3xi4 + β4xi4 + εiyi = α + β1xi1 + β2xi2 + β3xi3 + β4xi4 + εi

ideologyi = α + β1religiosityi + β2genderi +β3educationi + β4agei + εi

I Where: i = 1, . . . , n and ε ∼ N(0, σ2)

Page 177: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Results (1)Ideology (LR, 0-10)(Intercept) 4.99∗∗∗(0.24)Religiosity 0.15∗∗∗(0.03)Gender (Female = 1) −0.38∗(0.18)Age (mean centered) 0.01(0.01)Education −0.15(0.12)Adj. R2 0.03Num. obs. 947

∗∗∗p < 0.001, ∗∗p < 0.01, ∗p < 0.05

Page 178: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Results (2)

4.5

5.0

5.5

6.0

6.5

0.0 2.5 5.0 7.5 10.0Strength of religiosity (0 = none, 10 = very much)

Pred

icte

d le

ft−ri

ght

posi

tion

(0 =

left

, 10

= r

ight

)Relationship between religiosity and ideology, Denmark 2009

Note: for an average aged man with mid-level education

Page 179: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

InteractionWhy was the Brambor, Clark, and Golder (2006) piece so influential?

I Conditional hypotheses are prevalent: our theories might not begeneral enough and we can actually build theories aboutheterogeneityI Offered a general framework for how to model conditional effectsI Offered a general framework for how to correctly calculateuncertainty measures and best practices for interpretationOverall: a solution to a prevalent problem

Page 180: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Specification (Brambor et al, 2006, 65)

Page 181: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Quantities of interest

Marginal effect of X on YI ∂Y

∂X = β1 + β3ZStandard error of the marginal effectI se ∂Y

∂X= √

var (β1) + Z 2var (β3) + 2Zcov (β1β3)

Page 182: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Important rules

1. If conditional hypothesis: multiplicative interaction term needs tobe specified2. Main effects should be modeled (X,Z)3. Standard errors for the marginal effect should be calculated:‘significance’ of coefficients can be misleading4. Interpretation of main effects: not average effects5. Visual evaluation is even more important

Page 183: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Additional readings

I On theory: Berry, Golder, and Milton (2012)I For categorical outcomes:

I Berry, DeMeritt, and Esarey (2010)I Rainey (2016)

I But again, check the Multiplicative Interaction Models on MattGolder’s webpage

Page 184: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Our model

ideologyi = α + β1religiosityi + β2agei +β3(religiosityi × agei )β4educationi + β5genderi + εi

I Where: i = 1, . . . , n and ε ∼ N(0, σ2)I Hence, we are interested in β1, β2, and β3

Page 185: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Results (1)Model 1 Model 2Intercept 4.99∗∗∗ 4.92∗∗∗(0.24) (0.24)Religiosity 0.15∗∗∗ 0.14∗∗∗(0.03) (0.03)Age 0.01 −0.01(0.01) (0.01)Religiosity × Age 0.005∗(0.002)Gender −0.38∗ −0.38∗(0.18) (0.18)Education −0.15 −0.11(0.12) (0.12)Adj. R2 0.03 0.03Num. obs. 947 947

∗∗∗p < 0.001, ∗∗p < 0.01, ∗p < 0.05

Page 186: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Results (2)Religiosity 0.14∗∗∗(0.03)Age −0.01(0.01)Religiosity × Age 0.005∗(0.002)

A. The effect of religiosity is not conditional on age. B. Religiosity hasa stronger effect for older people. C. Religiosity has a weaker effect forolder people. D. We would need more information to answer thequestion.

Page 187: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Results (3)

−20 0 20 40

−0.

10.

00.

10.

20.

30.

40.

5Denmark

Age (centered)

Est

imat

ed m

argi

nal e

ffect

of r

elig

iosi

ty

Page 188: Stats Bootcamp - Day #3 - cepis.hrcepis.hr/wp-content/uploads/2016/06/4-Prezentacija_Fazekas.pdf · Stats Bootcamp - Day #3 Zoltan Fazekas zfazekas.github.io 3 February 2016 Zagreb

Results (4)

0 2 4 6 8 10

−0.

020.

000.

020.

040.

06Denmark

Religiosity

Est

imat

ed m

argi

nal e

ffect

of a

ge