Regression and Correlation GTECH 201 Lecture 18. ANOVA Analysis of Variance Continuation from...

Preview:

Citation preview

Regression and Correlation

GTECH 201Lecture 18

ANOVA

Analysis of Variance Continuation from matched-pair difference

of means tests; but now for 3+ cases We still check whether samples come from

one or more distinct populations Variance is a descriptive parameter ANOVA compares group means and looks

whether they differ sufficiently to reject H0

ANOVA H0 and HA

ANOVA Test Statistic

MSB = between-group mean squares

MSW = within-group mean squares

Between-group variability is calculated in three steps:

B

w

MSF

MS

1. Calculate overall mean as weighted average of sample means

2. Calculate between-group sum of squares3. Calculate between-group mean squares (MSB)

Between-group Variability

1. Total or overall mean

2. Between-group sum of squares

3. Between-group mean squares

1

k

i ii

T

n XX

N

2

2 2

1 1

k k

B i i T i i Ti i

SS n X X n X N X

1B B

BB

SS SSMS

df k

Within-group Variability

1. Within-group sum of squares

2. Within-group mean squares

2

1

1k

w i ii

SS n s

W WW

W

SS SSMS

df N k

Kruskal-Wallis Test Nonparametric equivalent of ANOVA Extension of Wilcoxon rank sum W test

to 3+ cases Average rank is Ri / ni

Then the Kruskal-Wallis H test statistic is

With N =n1 + n2 + … +nk = total number of observations, and

Ri = sum of ranks in sample i

2

1

123 1

1

ki

i i

RH N

N N n

ANOVA Example

House prices by neighborhood in ,000 dollars A B C D175 151 127 174147 183 142 182138 174 124 210156 181 150 191184 193 180148 205

196

ANOVA Example, continued

Sample statistics

n X sA 6 158.00 17.83B 7 183.29 17.61C 5 144.60 22.49D 4 189.25 15.48

Total 22 168.68 24.85

Now fill in the six steps of the ANOVA calculation

The Six Steps

1 6(158.00) 7(183.29) 5(144.60) 4(189.25)168.68

22

k

i ii

T

n XX

N

2 2 2 2 22 2

1

6 158.00 7 183.29 5 144.60 4 189.25 22 168 6769.394k

B i i Ti

SS n X N X

6769.3942256.465

1 3B B

BB

SS SSMS

df k

2 2 2 22

1

1 5 17.83 6 17.61 4 22.49 3 15.48 6193.379k

w i ii

SS n s

6193.379

344.07722 4

W WW

W

SS SSMS

df N k

2256.465

6.558344.077

B

W

MSF

MS .003p

Correlation Co-relatedness between 2+ variables As the values of one variable go up,

those of the other change proportionally

Two step approach:1. Graphically - scatterplot2. Numerically – correlation coefficients

Is There a Correlation?

Scatterplots Exploratory analysis

Pearson’s Correlation Index

Based on concept of covariance

= covariation between X and Y

= deviation of X from its mean

= deviation of Y from its mean

Pearson’s correlation coefficient

XYCV X X Y Y XYCV

X X

Y Y

/X Y

X X Y Y Nr

S S

Sample and Population

r is the sample correlation coefficient

Applying the t distribution, we can infer the correlation for the whole population

Test statistic for Pearson’s r

2

2

1

r nt

r

Correlation Example Lake effect snow

Spearman’s Rank Correlation

Non-parametric alternative to Pearson

Logic similar to Kruskal and Wilcoxon

Spearman’s rank correlation coefficient 2

3

61s

dr

N N

Regression

In correlation we observe degrees of association but no causal or functional relationship

In regression analysis, we distinguish an independent from a dependent variable

Many forms of functional relationships bivariate linear

multivariate non-linear (curvi-linear)

Graphical Representation

In correlation analysis either variable could be depicted on either axis

In regression analysis, the independent variable is always on the X axis

Bivariate relationship is described by a best-fitting line through the scatterplot

Least-Square Regression

Objective: minimize 2id

Y a bX

Regression Equation

Y = a + bX

22

n XY X Yb

n X X

Y b Xa

n

Strength of Relationship

How much is explained by the regression equation?

Coefficient of Determination

Total variation of Y (all the bucket water)

Large ‘Y’ = dependent variable Small ‘y’ = deviation of each value of Y

from its mean

e = explained; u = unexplained

22y Y Y 2 2 2

e uy y y

2 2 2e uy y y

Explained Variation

Ratio of square of covariation between X and Y to the variation in X

where xy = covariation between X and Y

x2 = total variation of X

Coefficient of determination

222e

xyy

x

22

2eyry

Error Analysis

r 2 tells us what percentage of the variation is accounted for by the independent variable

This then allows us to infer the standard error of our estimate

which tells us, on average, how far off our prediction would be in measurement units

2

2eySE

n

Recommended