Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average...

Preview:

Citation preview

Analysis of Individual Variables

• Descriptive – – Measures of Central Tendency

• Mean – Average score of distribution (1st moment)• Median – Middle score (50th percentile) of distribution

– Measures of Variation (used to measure the range of the distribution relative to the measures of central tendency)

• Range – Distance between lowest and highest data point• Mean Deviation – Average distance between Mean and data

points • Variance – Sum of Squared distance from mean (2nd moment)• Standard Deviation – Square root of variance

Analysis of Individual Variables

Obs Income1 20.502 31.503 47.704 26.205 44.006 8.287 30.808 17.209 19.90 Mean 31.28

10 9.96 Median 25.7011 55.80 Variance 500.6812 25.20 Stdev 22.3813 29.0014 85.5015 15.1016 28.5017 21.4018 17.7019 6.4220 84.90

Analysis of Relationship among Variables

• Correlation• Regression

– Two Variable Models– Multiple Variable Models– Discrete Dependent Variable Models

Scatter Plot of Money Supply Growth and Inflation

Correlation

• A scatter plot is a graph that shows the relationship between the observations for two data series in two dimensions

• Correlation analysis expresses this numerically– In contrast to a scatter plot, which graphically depicts the

relationship between two data series, correlation analysis expresses this same relationship using a single number

– The correlation coefficient is a measure of how closely related two data series are

– The correlation coefficient measures the linear association between two variables

Correlation

• Determine association between 2 variables • Measured on a scale from +1 to -1

– values close to +1.0 indicates strong positive relationship

– values close to -1.0 indicates strong negative relationship

– values close to 0 indicates little or no relationship

+1 0 -1

Variables with Perfect Positive Correlation

Variables with Perfect Negative Correlation

Variables with a Correlation of 0

Variables with a Non-Linear Association

Calculating correlations

• The sample correlation coefficient ‘r’ is,

n

i

iY

n

i

iX

n

i

ii

YX

n

YYs

n

XXs

n

YYXXYXCov

ss

YXCovr

1

2

1

2

1

)1(

)(,

)1(

)(

)1(

))((),(

),(

Calculating correlations

• E.g.: Is it true that higher education leads to higher compensation?– To answer this question, we need to look at the data and

calculate correlation

Years of Education

Compensation (000)

17.97 163.3022.86 142.0517.25 100.0013.35 103.5514.97 90.0015.87 97.5013.17 90.0011.1 80.00

13.86 90.258.97 49.50

Calculating correlations

• The sample correlation coefficient ‘r’ is,

22

1

2

1

2

1

)(,)(

)(),(

Y of average

X, of average

calculate toneed weso

)1(

)(,

)1(

)(,

)1(

))((),(

),(

YYXX

YYXX

Y

X

n

YYs

n

XXs

n

YYXXYXCov

ss

YXCovr

ii

ii

n

i

iY

n

i

iX

n

i

ii

YX

Calculating correlationsYears of

Ed. Comp (000) (X-XBar)2 (Y-YBar)2 (X-XBar)(Y-YBar)17.97 163.30 9.20 3929.41 190.1222.86 142.05 62.77 1716.86 328.2917.25 100.00 5.35 0.38 -1.4213.35 103.55 2.52 8.61 -4.6614.97 90.00 0.00 112.68 -0.3515.87 97.50 0.87 9.70 -2.9113.17 90.00 3.12 112.68 18.7611.10 80.00 14.72 424.98 79.1013.86 90.25 1.16 107.43 11.168.97 49.50 35.61 2612.74 305.00

Sums: 135.32 9035.48 923.10:

XBar 14.94YBar 100.62n -1 9.00

Covariance 102.57

SX 3.88

SY 31.69

r 0.83

Calculations

Calculating correlations (EXCEL)

Years of Ed.

Comp (000)

17.97 163.3022.86 142.0517.25 100.0013.35 103.5514.97 90.0015.87 97.5013.17 90.0011.10 80.0013.86 90.258.97 49.50

Correlation =CORREL(array1, array2)Correlation 0.83

Correlation Matrix

US Eqt UK US FI Japan Korea Mexico China HK S'pore IndiaUS Eqt 1.00

UK 0.27 1.00US FI -0.13 -0.27 1.00Japan 0.20 -0.15 0.08 1.00Korea -0.13 -0.17 0.28 -0.01 1.00

Mexico -0.10 0.28 -0.35 -0.38 -0.01 1.00China 0.17 -0.12 0.29 0.09 0.19 0.00 1.00

HK 0.22 0.24 -0.38 -0.23 -0.55 0.32 -0.08 1.00S'pore 0.52 0.24 0.00 0.08 -0.02 0.30 0.35 -0.01 1.00India 0.30 0.57 0.17 -0.12 -0.11 -0.17 0.24 0.01 0.35 1.00

Correlations Among Stock Return Series

Regression

• Most times its not enough to just say whether 2 variables are correlated– we would like to define a relationship between the two variables– E.g. when the economy grows 1%, how much will the S&P500

increase

• To do this, we use a technique of Regression

Regression

• How the term Regression came to be applied to the subject of statistical models.

• 19th century scientist, Sir Francis Galton, studying human subjects found in all things "regression toward mediocrity”– E.g. If your parents are very smart, you are likely to

be significantly less smart - so its really not your fault!!

Regression

• In modern times, when we talk of Regression analysis, we make an implicit assumption of a ‘mean’ relationship between variables and we try to determine that relationship.

• Regression analysis is concerned with –– the study of the dependence of one variable (the dependent

variable) – on one or more other variables (the explanatory variables) – with a view to estimating and/or predicting the mean or

average value of the former – in terms of the fixed values of the latter.

Two Variable Regression Model

• Regression analysis is concerned with relationship of 2 variables, say ‘y’ and ‘x’ and can be written as –

– All this means is that the value of ‘y’ is a function of the value of ‘x’– Another way of saying it is that ‘y’ doesn’t independently get its

value, but somehow depends on ‘x’ to get its value– Thus y can so how be derived from ‘x’– Thus ‘y’ is a dependent variable and ‘x’ is an independent variable

• Regression is thus, the study of a relationship between the dependent and independent variables

)( ii xfy

Regressionx y

1.0 2.01.5 3.02.0 4.02.5 5.03.0 6.03.5 7.04.0 8.04.5 9.05.0 10.05.5 11.06.0 12.06.5 13.07.0 14.07.5 15.025 ?

2 where,*)(

50,25 if so , *2

)(:

25 when x y, is what :Q

xyxfy

yxxy

xfyA

0

2

4

6

8

10

12

14

16

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

Regression

303*33

202*22

1011:

10 when x y3, y2, y1, are what :Q

yxy

yxy

yxyA

0

5

10

15

20

25

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

x y1 y2 y30.0 0.0 0.0 0.00.5 0.5 1.0 1.51.0 1.0 2.0 3.01.5 1.5 3.0 4.52.0 2.0 4.0 6.02.5 2.5 5.0 7.53.0 3.0 6.0 9.03.5 3.5 7.0 10.54.0 4.0 8.0 12.04.5 4.5 9.0 13.55.0 5.0 10.0 15.05.5 5.5 11.0 16.56.0 6.0 12.0 18.06.5 6.5 13.0 19.57.0 7.0 14.0 21.07.5 7.5 15.0 22.510

Regression

2 1, where, *)(

21,10 if so

)*2(1

)(:

10 when x y, is what :Q

xyxfy

yx

xy

xfyA

x y10.0 1.00.5 2.01.0 3.01.5 4.02.0 5.02.5 6.03.0 7.03.5 8.04.0 9.04.5 10.05.0 11.05.5 12.06.0 13.06.5 14.07.0 15.07.5 16.0

10.0

0

2

4

6

8

10

12

14

16

18

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

Two Variable Regression Model

• Regression analysis is concerned with –– the study of a relationship between the dependent and

independent variables

– In reality, we can are estimating a relationship, so we can calculate the value of a random variable

)( ii xfy

ii xy

Two Variable Regression Model

• Real data from which we estimate relationship is never very good because we deal with random variables– What we end up having is some thing like this

– What we try to do in regression is estimates the “Line of Best Fit”, so that we can come up with this equation

– This is also the equation of line, so this form of regression is called a ‘Linear regression”

ii xy

errorxy ii

Two Variable Regression Model

y = 0.841+0.3909x

R2 = 0.7247

2.002.202.402.602.803.003.203.403.603.804.00

2.00 3.00 4.00 5.00 6.00 7.00 8.00

Two Variable Regression Model

• Regression Model – Equation of a Line

• Terminology – ‘y’– Dependent Variable, or– Left-Hand Side Variable, or– Explained Variable, or

iii xy

• Terminology – ‘x’– Independent Variable, or– Right-Hand Side Variable, or– Explanatory Variable, or– Regressor, Covariate, Control Variable

• Terminology – ‘’– Error– Disturbance

Two Variable Regression Model

Two Variable Regression Model

iii xy

• Terminology – – ‘’ - Intercept– ‘’ – Slope– ‘’ - error

Assumptions of the Linear Regression Model

• The relationship between the dependent variable, Y, and the independent variable, X is linear

• The independent variable, X, is not random• About the error –

– The expected value (remember average) of the error term is 0– The error term is normally distributed– The variance of the error term is the same for all observations– The error term is uncorrelated across observations

Regression Relationship estimation

• The model is estimated by the “Least Squares Estimation” method

Two Variable Regression Model

XY

XVar

YXCov

xy ii

)(

),(

• Inferences from Regression can be made about– Model - how well does the specified model perform, i.e., are

the specified independent variables, taken together a good predictor of the dependent variable (R2)

– Independent Variables – The contribution of each independent variable in predicting the dependent variable (hypothesis test)

Inferences from Regression

iii xy 11

Model power

variationTotal

variationdUnexplaine1

variationTotal

variationexp variationTotal

variation2

lainedUnToal

ExplainedR

Inference about Model

• Coeff. of Determination (R2)

• So, higher the R2 – better model (Yes? That would be too easy!)

x1-xm)

(x1, y1)

ym

yp

y1

xm x1

SST SSE

SSRSST

SSER

SST

SSE

SST

SSRSST

SSE

SST

SSR

SSESSRSST

1

1

1

2

Inference about Model

• If the model is correctly specified, R2 is an ideal measure

• Addition of a variable to a regression will increase the R2 (by construction)

• This fact can be exploited to get regressions with R2 ~ 100% by addition of variables, but this doesn’t mean that the model is any good

• Adj-R2 should be reported

Inference about Parameters

• Coefficients are estimated with a confidence interval• To know if a specific independent variable (xi) is

influential in predicting the dependent variable (y), we test whether the corresponding coefficient is statistically different from 0 (i.e. i = 0).

• We do so by calculating the t-statistic for the coefficient

• If the t-stat is sufficient large, it indicates that bi is significantly different from 0 indicating that i * xi plays a role in determining y

Inference about parameters

• We can test to see if the slope coefficient is significant by using a t-test.

1

01

^

bst

In Excel

In Excel

In Excel

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.405156042R Square 0.164151419Adjusted R Square 0.149740236Standard Error 0.05350165Observations 60

ANOVAdf SS MS F Significance F

Regression 1 0.032604637 0.032604637 11.39055864 0.001321732Residual 58 0.166020739 0.002862427Total 59 0.198625377

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -9.72076E-05 0.007438982 -0.01306732 0.98961893 -0.014987948 0.014793533X Variable 1 0.939398568 0.278341127 3.374990169 0.001321732 0.382238272 1.496558865

Recommended