Upload
mitchell-chandler
View
220
Download
0
Embed Size (px)
DESCRIPTION
“Regression” = closely related topic The relationship/difference between correlation and regression? –Correlation = compute the degree to which values of variables cluster around a straight line a symmetric description (r xy = r yx ) a standardized measure –Regression = compute the equation for the “best fitting” straight line (Y = a + bX) It is an asymmetric description (b xy b yx ) an unstandardized measure (usually)
Citation preview
4 basic analytical tasks in statistics:
1) Comparing scores across groups look for differences in means
2) Cross-tabulating categoric variables look for contingencies
3) Computing correlations among variables look for covariances
4) Predicting scores on an outcome variable from numerical predictor variables look for causal effects (or predicted outcomes)
-- Focus this week on the 4th task
“Correlation” (revisited)Correlation = strength of the linear
association between 2 numeric variables• It reflects the degree to which the association
is described by a “straight-line” relationship– The degree to which two variable covary or share
common variance – [“covariance” = a key term]• It reflects the “commonality” (“predictability”)
between the two variables• Note: r2 (r-squared) = the proportion of
variance that “shared” or common to both variables
“Regression” = closely related topic• The relationship/difference between
correlation and regression?– Correlation = compute the degree to which
values of variables cluster around a straight line a symmetric description (rxy = ryx) a standardized measure
– Regression = compute the equation for the “best fitting” straight line (Y = a + bX) It is an asymmetric description (bxy <> byx) an unstandardized measure (usually)
Linear Regression
So, what’s the deal with “Regression” ?
• Why is “regression” called that?a) Term introduced by Francis Galton in late-19th
century to describe prediction of genetic traits across generations reflecting imperfect correlations between parents and children
b) It referred to tendency of extreme values of traits to “regress toward the mean” across successive generations reflecting Galton’s interest in the inheritability of genius & other unusual traits
c) Correct word use: we “regress the dependent variable on the independent variable” Y = a + byxX
What’s the deal with “Regression”? (cont.)
• Why is regression used in data analysis? To describe the functional pattern that links 2
variables together in a correlation – i.e., what are the optimal values of a and b for X & Y?
Two basic uses of regression: a) Prediction:
-- predict values of one variable (Y) from values of another variable (X) (using linear equation)
b) Explanation:-- Estimate the causal influence of one variable (X) on
another (Y) (based on measurable correlation).-- test a causal hypothesis about how Y and X are related.
How is regression analysis done?• By fitting a straight line to a set of bivariate
points (values on 2 variables for the same data units)
– y = a + byxx (basic formula for linear relation)– y = the dependent variable– x = the independent variable– a = the “intercept”– byx = the “slope” of the line
• Concern is with fitting a straight line that minimizes the errors of prediction (of y from x) – y y ei i i (observed = predicted + error)
2 ways of expression the prediction equation:
y a b x ei yx i i
y a b xi yx i
or
Regression example (continued)
“Regression”• How to obtain the straight line that “best
fits” the data?– Rely on a method called “least squares” which
minimizes the sum of the squared errors (deviations between the line and the data points)
– Yields best-fitting line to the points– Yields formulas for a and b provided in the book
• How to compute regression coefficients?• By hand calculations:
– Definitional formula (the familiar one)– Computational formula (no deviation scores)
• By SPSS: Analyze Regression Linear
bX X Y Y
X Xyx
( )( )
( ) 2
a Y b Xyx
Regression Coefficient: Definitional Formula
Regression Coefficient: Computational Formula
Intercept (Constant): Computational Formula
bXY N X Y
X N Xyx
2 2
“Regression”
• Use Example from Fox/Levin/Forde text (p. 277) (handout)
Prior Charges Sentence (mos)
0 12
3 13
1 15
0 19
6 26
5 27
3 29
4 31
10 40
8 48
# PriorsX
SentenceY X2 Y2 XY
0 12 0 144 O3 13 9 169 391 15 1 225 150 19 0 361 06 26 36 676 1565 27 25 729 1353 29 9 841 874 31 16 961 124
10 40 100 1600 400 8 48 64 2304 384
Σ= 40 Σ=260 Σ=260 Σ=8010 Σ=1340
Regression Example (cont.)
bXY N X Y
X n X
2 2
b
1340 10 4 0 26 0260 10 4 0 4 0
( )( . )( . )( )( . )( . )
1340 1040260 160
300100
a Y b X a 26 3 0 4 0 26 12( . )( . )
= = = 3.0
= 14.0
Regression example (continued)
Regression (continued)- How to interpret the results?• Slope (b) = predicted change in Y for a 1-unit
change in X Unstandardized b (b) = in original units/metric Standardized b (β)[beta]= in standard (Z) units
• Intercept (a) = predicted value of Y when X=0 Interpretable only when zero is a meaningful value of X Also called the “constant” term since it is the same for all
values of X
• R (multiple r) = correlation between Y and the predictor(s) (predictability of Y from Xs)
Regression (continued)• What are assumptions/requirements of
regression?1. Numeric variables (interval or ratio level)2. Linear relationship between variables3. Random sampling4. Normal distribution of data5. Homoscedasticity (equal conditional variances)
• What if the assumptions do not hold?1. Don’t worry about small deviations2. May be able to transform variables3. May use alternative procedures
Regression (continued)• How to test for significance of results?
– F-test for overall regression– t-test for individual b coefficients
• What is R? (or R2?)• Can we use more than one independent
variable?– Yes – it’s called “multiple regression”– Regress a single dependent variable (Y) on
multiple independent variables (a linear combination that best predicts Y)
Multiple Regression - addenda• Simultaneous analysis of the regression
of a dependent variable on 2 or more independent variables Yi = a +b1X1 + b2 X2 + b3X3 + ei
• All coefficients are computed at once– In this case, the b coefficients are partial
regression coefficients– They reflect the unique predictive ability of each
variable (with the covariance of other independent variables “partialled out”)
Multiple Regression• What is Multiple Regression good for?
allows us to estimate:– The combined effects of multiple variables– The unique effects of individual variables
allows us to test causal theories– The combined effects of multiple variables– The unique effects of individual variables
• In this case, R2 measure how well the entire model does in predicting Y.
The overall F-test refers to whole set of variables The t-tests apply to coefficients of each variable