Upload
jocelin-willis
View
230
Download
6
Embed Size (px)
Citation preview
Chapter 16Data Analysis: Testing for
Associations
Chapter 16Data Analysis: Testing for
Associations
Relationships
DirectionDirection
PositivePositive
NegativeNegative
StrengthStrength
WeakWeak
StrongStrong
ModerateModerate
16-2
Relationships
3
Types Types
Curvilinear relationship between two variables – the strength
and/or direction of the relationship changes over the
range of both variables.
Curvilinear relationship between two variables – the strength
and/or direction of the relationship changes over the
range of both variables.
Linear relationship between two variables – the strength and
nature of the relationship remains the same over the
range of both variables.
Linear relationship between two variables – the strength and
nature of the relationship remains the same over the
range of both variables.
16-3
Relationships between Variables
ThreeQuestions
ThreeQuestions
Is there a relationship between the two variables we are interested in?
Is there a relationship between the two variables we are interested in?
How strong is the relationship?How strong is the relationship?
How can that relationship be best described?
How can that relationship be best described?
16-4
No Relationship between X and Y
16-5
Positive Relationship between X and Y
16-6
Negative Relationship between X and Y
16-7
Curvilinear Relationship between X and Y
16-8
Pearson Correlation Coefficient
. . . statistical measure of the strength of a linear
relationship between two metric (interval or ratio level)
variables.
16-9
It varies between –1.00 and +1.00, with 0 representing absolutely no association between two variables, and –1.00 and +1.00 representing perfect association between two variables.
It varies between –1.00 and +1.00, with 0 representing absolutely no association between two variables, and –1.00 and +1.00 representing perfect association between two variables.
The higher the absolute value of the correlation coefficient the stronger the level of association.
The higher the absolute value of the correlation coefficient the stronger the level of association.
The size of the correlation coefficient can be used to quantitatively describe the strength of the association between two variables.
The size of the correlation coefficient can be used to quantitatively describe the strength of the association between two variables.
Pearson Correlation Coefficient
16-10
Null hypothesis states that there is no association between the two variables in the population and that the correlation coefficient
is zero.
Null hypothesis states that there is no association between the two variables in the population and that the correlation coefficient
is zero.
Pearson Correlation CoefficientPearson Correlation Coefficient
If correlation coefficient is statistically significant the null hypothesis is rejected and
the conclusion is that the two variables do share some association in the population.
If correlation coefficient is statistically significant the null hypothesis is rejected and
the conclusion is that the two variables do share some association in the population.
16-11
Spearman Rank Order Correlation
. . . a statistical measure of the linear association between two variables where both have been
measured using ordinal (rank order) scales.
16-12
If either one of the variables is represented by rank order (ordinal) data – use the
Spearman rank order correlation coefficient.
If either one of the variables is represented by rank order (ordinal) data – use the
Spearman rank order correlation coefficient.
Spearman Rank Order CorrelationSpearman Rank Order Correlation
Spearman rank order correlation coefficient tends to produce a lower coefficient and is
considered a more conservative measure. We should choose a Pearson Correlation when
we can.
Spearman rank order correlation coefficient tends to produce a lower coefficient and is
considered a more conservative measure. We should choose a Pearson Correlation when
we can.
16-13
. . . a statistical technique that analyzes the linear
relationship between two variables by estimating
coefficients for an equation for a straight line. One
variable is designated as a dependent variable and
the other is called an independent or predictor
variable.
. . . a statistical technique that analyzes the linear
relationship between two variables by estimating
coefficients for an equation for a straight line. One
variable is designated as a dependent variable and
the other is called an independent or predictor
variable.
BivariateRegression
Analysis
16-14
Relationship is linear. Relationship is linear.
Variables of interest are measured on interval or ratio scales (except in the case of dummy variables).
Variables of interest are measured on interval or ratio scales (except in the case of dummy variables).
Variables come from a bivariate normal population (distribution).
Variables come from a bivariate normal population (distribution).
The error terms associated with making predictions are normally and independently distributed.
The error terms associated with making predictions are normally and independently distributed.
Regression Assumptions
16-15
Regression – formula for a straight line
y = a + bX + ei
where
y = the dependent variable
a = the intercept (point where the straight line intersects the y-axis when X = 0
b = the slope (the change in y for very 1-unit change in x)
X = the independent variable used to predict y
ei = the error for the prediction
What is Regression Analysis?
16-16
Adjusted r-square – adjustment reduces the r2 by taking into account the sample size and the number of independent variables in the regression equation. It tells you when the multiple regression equation has too many independent variables.
Adjusted r-square – adjustment reduces the r2 by taking into account the sample size and the number of independent variables in the regression equation. It tells you when the multiple regression equation has too many independent variables.
Explained variance – amount of variation in the dependent variable that can be accounted for by the combination of independent variables (represented by r2 in a bivariate regression or adjusted r2 in a multivariate regression).
Explained variance – amount of variation in the dependent variable that can be accounted for by the combination of independent variables (represented by r2 in a bivariate regression or adjusted r2 in a multivariate regression).
Unexplained variance – amount of variation in the dependent variable that can not be accounted for by the combination of independent variables.
Unexplained variance – amount of variation in the dependent variable that can not be accounted for by the combination of independent variables.
Regression coefficient – indicator of the importance of an independent variable in predicting a dependent variable. Large coefficients are good predictors and small coefficients are weak predictors.
Regression coefficient – indicator of the importance of an independent variable in predicting a dependent variable. Large coefficients are good predictors and small coefficients are weak predictors.
Regression Analysis Terms
16-17
Significant Model? – answers the first question about the relationship – “Is there a relationship
between the dependent and independent variable?”
Significant Model? – answers the first question about the relationship – “Is there a relationship
between the dependent and independent variable?”
How strong is the relationship? – the size of the coefficient of determination (r2) – tells what
percentage of the total variation in dependent variable is explained.
How strong is the relationship? – the size of the coefficient of determination (r2) – tells what
percentage of the total variation in dependent variable is explained.
r2 measure varies between .00 and 1.00 – the size of the r2 indicates the strength of the
relationship – the closer to 1.00 the stronger the relationship.
r2 measure varies between .00 and 1.00 – the size of the r2 indicates the strength of the
relationship – the closer to 1.00 the stronger the relationship.
Statistical Significance in Regression
16-18
Multiple Regression Analysis
. . . a statistical technique that analyzes the linear relationship between a dependent variable and MULTIPLE independent
variables by estimating coefficients for the equation for a
straight line.
16-19
If the independent variables are measured using a different scale (1 -5
for one 1 – 10 for another) then the different scales do not permit relative
comparisons between regression coefficients to see which independent variable has the most influence on the
dependent variable.
If the independent variables are measured using a different scale (1 -5
for one 1 – 10 for another) then the different scales do not permit relative
comparisons between regression coefficients to see which independent variable has the most influence on the
dependent variable.
Multiple Regression concernMultiple Regression concern
16-20
Standardized regression coefficients (beta coefficients) correct this problem.
Standardized regression coefficients (beta coefficients) correct this problem.
Beta coefficient is an estimated regression coefficient that has been recalculated
(standardized) to have a mean of 0 and a standard deviation of 1.
Beta coefficient is an estimated regression coefficient that has been recalculated
(standardized) to have a mean of 0 and a standard deviation of 1.
Standardization removes the effects of different scales and enables independent variables with
different units of measurement to be directly compared for their predictive ability.
Standardization removes the effects of different scales and enables independent variables with
different units of measurement to be directly compared for their predictive ability.
16-21
Assess the statistical significance of the overall regression model using the F statistic and its associated probability.
Assess the statistical significance of the overall regression model using the F statistic and its associated probability.
Examine the r2 to see how large it is. Examine the r2 to see how large it is.
FOR MULTIVARIATE: Evaluate the individual regression coefficients and their t-test statistic to see which are statistically significant.
FOR MULTIVARIATE: Evaluate the individual regression coefficients and their t-test statistic to see which are statistically significant.
FOR MULTIVARIATE: Look at the variables’ beta coefficients to assess relative influence (standardized if the variable scales are different).
FOR MULTIVARIATE: Look at the variables’ beta coefficients to assess relative influence (standardized if the variable scales are different).
When evaluating regression analysis results
16-22
Used when independent variables you may want to use to predict a dependent variable may not be
measured using interval or ratio scales.
Used when independent variables you may want to use to predict a dependent variable may not be
measured using interval or ratio scales.
Dummy Variables – artificial variables introduced into a regression equation to represent the categories of
a nominally scaled variable.
Dummy Variables – artificial variables introduced into a regression equation to represent the categories of
a nominally scaled variable.
There will be one dummy variable for each of the nominal categories of the independent variable and
the values will typically be 0 or 1.
There will be one dummy variable for each of the nominal categories of the independent variable and
the values will typically be 0 or 1.
16-23
Can result in difficulty in estimating independent regression coefficients for the correlated variables.
• It inflates the standard error of the coefficient and lowers the t statistic associated with it (makes the variables in question less likely to be significant).
• Impacts the individual regression coefficients (the independent variables).
• Does not impact the size of the r2 or the ability to predict values of the dependent variable.
Can result in difficulty in estimating independent regression coefficients for the correlated variables.
• It inflates the standard error of the coefficient and lowers the t statistic associated with it (makes the variables in question less likely to be significant).
• Impacts the individual regression coefficients (the independent variables).
• Does not impact the size of the r2 or the ability to predict values of the dependent variable.
Multicollinearity – independent variables are highly correlated with each other.
Multicollinearity – independent variables are highly correlated with each other.
16-24