81
Cal State Northridge 427 Ainsworth Correlation and Regression

Cal State Northridge 427 Ainsworth

  • Upload
    elkan

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

Correlation and Regression. Cal State Northridge 427 Ainsworth. Major Points - Correlation. Questions answered by correlation Scatterplots An example The correlation coefficient Other kinds of correlations Factors affecting correlations Testing for significance. The Question. - PowerPoint PPT Presentation

Citation preview

  • Cal State Northridge427AinsworthCorrelation and Regression

    Chapter 9 Correlation

  • Major Points - CorrelationQuestions answered by correlationScatterplotsAn exampleThe correlation coefficientOther kinds of correlations Factors affecting correlationsTesting for significance

  • The QuestionAre two variables related?Does one increase as the other increases?e. g. skills and incomeDoes one decrease as the other increases?e. g. health problems and nutritionHow can we get a numerical measure of the degree of relationship?

  • ScatterplotsAKA scatter diagram or scattergram.Graphically depicts the relationship between two variables in two dimensional space.

  • Direct Relationship

    Chart1

    12

    15

    10

    14

    13

    6

    18

    15

    15

    10

    12

    12

    13

    9

    16

    15

    13

    11

    7

    14

    11

    6

    8

    13

    12

    12

    13

    10

    13

    12

    Average Hours of Video Games Per Week

    Average Number of Alcoholic Drinks Per Week

    Scatterplot:Video Games and Alcohol Consumption

    correlations

    1367

    486

    1264

    1159

    760

    876

    1061

    495

    780

    377

    1568

    1475

    591

    1439

    790

    1066

    1071

    086

    1579

    1647

    1749

    1479

    560

    1248

    1266

    1365

    1155

    1455

    477

    1544

    -0.6415181796

    1912

    1815

    1310

    1514

    1213

    96

    1818

    1915

    2015

    1310

    1612

    2012

    2213

    129

    2316

    1515

    1313

    1611

    127

    1914

    1611

    76

    38

    1713

    1112

    1712

    2013

    1010

    1813

    1512

    0.7382131408

    correlations

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    Average Hours of Video Games Per Week

    Average Number of Alcoholic Drinks Per Week

    Correlation Between Video Games and Alcohol Consumption

  • Inverse Relationship

    Chart2

    67

    86

    64

    59

    60

    76

    61

    95

    80

    77

    68

    75

    91

    39

    90

    66

    71

    86

    79

    47

    49

    79

    60

    48

    66

    65

    55

    55

    77

    44

    Average Hours of Video Games Per Week

    Exam Score

    Scatterplot: Video Games and Test Score

    correlations

    1367

    486

    1264

    1159

    760

    876

    1061

    495

    780

    377

    1568

    1475

    591

    1439

    790

    1066

    1071

    086

    1579

    1647

    1749

    1479

    560

    1248

    1266

    1365

    1155

    1455

    477

    1544

    -0.6415181796

    1912

    1815

    1310

    1514

    1213

    96

    1818

    1915

    2015

    1310

    1612

    2012

    2213

    129

    2316

    1515

    1313

    1611

    127

    1914

    1611

    76

    38

    1713

    1112

    1712

    2013

    1010

    1813

    1512

    0.7382131408

    correlations

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    Average Hours of Video Games Per Week

    Exam Score

    Correlation Between Video Games and Test Score

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    Average Hours of Video Games Per Week

    Average Number of Alcoholic Drinks Per Week

    Correlation Between Video Games and Alcohol Consumption

  • An ExampleDoes smoking cigarettes increase systolic blood pressure?Plotting number of cigarettes smoked per day against systolic blood pressureFairly moderate relationshipRelationship is positive

  • Trend?

  • Smoking and BPNote relationship is moderate, but real.Why do we care about relationship?What would conclude if there were no relationship?What if the relationship were near perfect?What if the relationship were negative?

  • Heart Disease and CigarettesData on heart disease and cigarette smoking in 21 developed countries (Landwehr and Watkins, 1987) Data have been rounded for computational convenience.The results were not affected.

  • The DataSurprisingly, the U.S. is the first country on the list--the country with the highest consumption and highest mortality.

    Sheet1

    CountryCigarettesCHD(X - Xbar)(Y - Ybar)(X - Xbar)(Y - Ybar)

    111265.0511.4857.97

    29213.056.4819.76

    39243.059.4828.91

    49213.056.4819.76

    58192.054.489.18

    68132.05-1.52-3.12

    78192.054.489.18

    86110.05-3.52-0.18

    96230.058.480.42

    10515-0.950.48-0.46

    11513-0.95-1.521.44

    1254-0.95-10.529.99

    13518-0.953.48-3.31

    14512-0.95-2.522.39

    1553-0.95-11.5210.94

    16411-1.95-3.526.86

    17415-1.950.48-0.94

    1846-1.95-8.5216.61

    19313-2.95-1.524.48

    2034-2.95-10.5231.03

    21314-2.95-0.521.53

    Mean5.9514.52

    SD2.336.69

    Sum222.44

    Sheet2

    Sheet3

  • Scatterplot of Heart DiseaseCHD Mortality goes on ordinate (Y axis)Why?Cigarette consumption on abscissa (X axis)Why?What does each dot represent?Best fitting line included for clarity

  • {X = 6, Y = 11}

    Chapter 9 Correlation

  • What Does the Scatterplot Show?As smoking increases, so does coronary heart disease mortality.Relationship looks strongNot all data points on line.This gives us residuals or errors of predictionTo be discussed later

  • CorrelationCo-relationThe relationship between two variablesMeasured with a correlation coefficientMost popularly seen correlation coefficient: Pearson Product-Moment Correlation

  • Types of CorrelationPositive correlationHigh values of X tend to be associated with high values of Y.As X increases, Y increasesNegative correlationHigh values of X tend to be associated with low values of Y.As X increases, Y decreasesNo correlationNo consistent tendency for values on Y to increase or decrease as X increases

  • Correlation CoefficientA measure of degree of relationship.Between 1 and -1Sign refers to direction.Based on covarianceMeasure of degree to which large scores on X go with large scores on Y, and small scores on X go with small scores on YThink of it as variance, but with 2 variables instead of 1 (What does that mean??)

  • *

  • CovarianceRemember that variance is:

    The formula for co-variance is:

    How this works, and why?When would covXY be large and positive? Large and negative?

    Chapter 9 Correlation

  • Example

    Country

    X (Cig.)

    Y (CHD)

    *

    1

    11

    26

    5.05

    11.48

    57.97

    2

    9

    21

    3.05

    6.48

    19.76

    3

    9

    24

    3.05

    9.48

    28.91

    4

    9

    21

    3.05

    6.48

    19.76

    5

    8

    19

    2.05

    4.48

    9.18

    6

    8

    13

    2.05

    -1.52

    -3.12

    7

    8

    19

    2.05

    4.48

    9.18

    8

    6

    11

    0.05

    -3.52

    -0.18

    9

    6

    23

    0.05

    8.48

    0.42

    10

    5

    15

    -0.95

    0.48

    -0.46

    11

    5

    13

    -0.95

    -1.52

    1.44

    12

    5

    4

    -0.95

    -10.52

    9.99

    13

    5

    18

    -0.95

    3.48

    -3.31

    14

    5

    12

    -0.95

    -2.52

    2.39

    15

    5

    3

    -0.95

    -11.52

    10.94

    16

    4

    11

    -1.95

    -3.52

    6.86

    17

    4

    15

    -1.95

    0.48

    -0.94

    18

    4

    6

    -1.95

    -8.52

    16.61

    19

    3

    13

    -2.95

    -1.52

    4.48

    20

    3

    4

    -2.95

    -10.52

    31.03

    21

    3

    14

    -2.95

    -0.52

    1.53

    Mean

    5.95

    14.52

    SD

    2.33

    6.69

    Sum

    222.44

    _1222636204.unknown

    _1222636234.unknown

    _1222636241.unknown

    _1222636165.unknown

  • Example*What the heck is a covariance? I thought we were talking about correlation?

  • Correlation CoefficientPearsons Product Moment CorrelationSymbolized by rCovariance (product of the 2 SDs)

    Correlation is a standardized covariance

  • Calculation for ExampleCovXY = 11.12sX = 2.33sY = 6.69

  • ExampleCorrelation = .713Sign is positiveWhy?If sign were negativeWhat would it mean?Would not alter the degree of relationship.

  • Other calculations*Z-score method

    Computational (Raw Score) Method

  • Other Kinds of CorrelationSpearman Rank-Order Correlation Coefficient (rsp)used with 2 ranked/ordinal variablesuses the same Pearson formula*

    Chapter 9 Correlation

    Sheet1

    AttractivenessSymmetry

    32

    46

    11

    23

    54

    65

    0.77

    Sheet2

    Sheet3

  • Other Kinds of CorrelationPoint biserial correlation coefficient (rpb)used with one continuous scale and one nominal or ordinal or dichotomous scale.uses the same Pearson formula*

    Chapter 9 Correlation

    Sheet1

    AttractivenessDate?

    30

    40

    11

    21

    51

    60

    -0.49

    Sheet2

    Sheet3

  • Other Kinds of CorrelationPhi coefficient ()used with two dichotomous scales.uses the same Pearson formula*

    Chapter 9 Correlation

    Sheet1

    AttractivenessDate?

    00

    10

    11

    11

    00

    11

    0.71

    Sheet2

    Sheet3

  • Factors Affecting rRange restrictionsLooking at only a small portion of the total scatter plot (looking at a smaller portion of the scores variability) decreases r.Reducing variability reduces rNonlinearityThe Pearson r (and its relatives) measure the degree of linear relationship between two variablesIf a strong non-linear relationship exists, r will provide a low, or at least inaccurate measure of the true relationship.

  • Factors Affecting rHeterogeneous subsamplesEveryday examples (e.g. height and weight using both men and women)OutliersOverestimate CorrelationUnderestimate Correlation

  • Countries With Low Consumptions

  • Truncation*

  • Non-linearity*

  • Heterogenous samples*

  • Outliers*

  • Testing Correlations*So you have a correlation. Now what?In terms of magnitude, how big is big?Small correlations in large samples are big.Large correlations in small samples arent always big.Depends upon the magnitude of the correlation coefficientANDThe size of your sample.

  • Testing rPopulation parameter = Null hypothesis H0: = 0Test of linear independenceWhat would a true null mean here?What would a false null mean here?Alternative hypothesis (H1) 0Two-tailed

  • Tables of SignificanceWe can convert r to t and test for significance:

    Where DF = N-2

  • Tables of SignificanceIn our example r was .71N-2 = 21 2 = 19

    T-crit (19) = 2.09Since 6.90 is larger than 2.09 reject r = 0.

  • Computer PrintoutPrintout gives test of significance.

  • Regression

    Chapter 9 Correlation

  • What is regression?*How do we predict one variable from another?How does one variable change as the other changes?Influence

  • Linear Regression*A technique we use to predict the most likely score on one variable from those on another variableUses the nature of the relationship (i.e. correlation) between two variables to enhance your prediction

  • Linear Regression: Parts*Y - the variables you are predictingi.e. dependent variableX - the variables you are using to predicti.e. independent variable - your predictions (also known as Y)

  • Why Do We Care?*We may want to make a prediction.More likely, we want to understand the relationship.How fast does CHD mortality rise with a one unit increase in smoking?Note: we speak about predicting, but often dont actually predict.

  • An Example*Cigarettes and CHD Mortality againData repeated on next slideWe want to predict level of CHD mortality in a country averaging 10 cigarettes per day.

  • The Data*Based on the data we have what would we predict the rate of CHD be in a country that smoked 10 cigarettes on average?First, we need to establish a prediction of CHD from smoking

    Sheet1

    CountryCigarettesCHD(X - Xbar)(Y - Ybar)(X - Xbar)(Y - Ybar)

    111265.0511.4857.97

    29213.056.4819.76

    39243.059.4828.91

    49213.056.4819.76

    58192.054.489.18

    68132.05-1.52-3.12

    78192.054.489.18

    86110.05-3.52-0.18

    96230.058.480.42

    10515-0.950.48-0.46

    11513-0.95-1.521.44

    1254-0.95-10.529.99

    13518-0.953.48-3.31

    14512-0.95-2.522.39

    1553-0.95-11.5210.94

    16411-1.95-3.526.86

    17415-1.950.48-0.94

    1846-1.95-8.5216.61

    19313-2.95-1.524.48

    2034-2.95-10.5231.03

    21314-2.95-0.521.53

    Mean5.9514.52

    SD2.336.69

    Sum222.44

    Sheet2

    Sheet3

  • *For a country that smokes 6 C/A/DWe predict a CHD rate of about 14Regression Line

    Chapter 9 Correlation

  • Regression Line*Formula

    = the predicted value of Y (e.g. CHD mortality) X = the predictor variable (e.g. average cig./adult/country)

  • Regression Coefficients*Coefficients are a and bb = slope Change in predicted Y for one unit change in Xa = intercept value of when X = 0

  • Calculation*Slope

    Intercept

  • For Our Data*CovXY = 11.12s2X = 2.332 = 5.447b = 11.12/5.447 = 2.042a = 14.524 - 2.042*5.952 = 2.32See SPSS printout on next slideAnswers are not exact due to rounding error and desire to match SPSS.

  • SPSS Printout*

  • Note:*The values we obtained are shown on printout.The intercept is the value in the B column labeled constant The slope is the value in the B column labeled by name of predictor variable.

  • Making a Prediction*Second, once we know the relationship we can predict

    We predict 22.77 people/10,000 in a country with an average of 10 C/A/D will die of CHD

  • Accuracy of PredictionFinnish smokers smoke 6 C/A/DWe predict:

    They actually have 23 deaths/10,000Our error (residual) = 23 - 14.619 = 8.38a large error*

    Chapter 9 Correlation

  • *Cigarette Consumption per Adult per Day12108642CHD Mortality per 10,0003020100ResidualPrediction

    Chapter 9 Correlation

  • Residuals*When we predict for a given X, we will sometimes be in error. Y for any X is a an error of estimateAlso known as: a residualWe want to (Y- ) as small as possible.BUT, there are infinitely many lines that can do this.Just draw ANY line that goes through the mean of the X and Y values.Minimize Errors of Estimate How?

  • Minimizing Residuals*Again, the problem lies with this definition of the mean:

    So, how do we get rid of the 0s?Square them.

  • Regression Line: A Mathematical DefinitionThe regression line is the line which when drawn through your data set produces the smallest value of:

    Called the Sum of Squared Residual or SSresidualRegression line is also called a least squares line.*

    Chapter 9 Correlation

  • Summarizing Errors of Prediction*Residual varianceThe variability of predicted values

  • Standard Error of Estimate*Standard error of estimateThe standard deviation of predicted values

    A common measure of the accuracy of our predictionsWe want it to be as small as possible.

  • Example*

    Chapter 9 Correlation

    Sheet1

    CountryX (Cig.)Y (CHD)Y'(Y - Y')(Y' - Ybar)(Y - Ybar)

    1112624.8291.1711.371106.193131.699

    292120.7450.2550.06538.70141.939

    392420.7453.25510.59538.70189.795

    492120.7450.2550.06538.70141.939

    581918.7030.2970.08817.46420.035

    681318.703-5.70332.52417.4642.323

    781918.7030.2970.08817.46420.035

    861114.619-3.61913.0970.00912.419

    962314.6198.38170.2410.00971.843

    1051512.5772.4235.8713.7910.227

    1151312.5770.4230.1793.7912.323

    125412.577-8.57773.5653.791110.755

    1351812.5775.42329.4093.79112.083

    1451212.577-0.5770.3333.7916.371

    155312.577-9.57791.7193.791132.803

    1641110.5350.4650.21615.91212.419

    1741510.5354.46519.93615.9120.227

    184610.535-4.53520.56615.91272.659

    193138.4934.50720.31336.3732.323

    20348.493-4.49320.18736.373110.755

    213148.4935.50730.32736.3730.275

    Mean5.95214.524

    SD2.3346.690

    Sum0.04440.757454.307895.247895.06

    Y' = (2.04*X) + 2.37

    Sheet2

    Sheet3

  • Regression and Z Scores*When your data are standardized (linearly transformed to z-scores), the slope of the regression line is called DO NOT confuse this with the associated with type II errors. Theyre different.When we have one predictor, r = Zy = Zx, since A now equals 0

  • Sums of square deviationsTotal

    Regression

    Residual we already covered

    SStotal = SSregression + SSresidualPartitioning Variability*

  • Partitioning Variability*Degrees of freedomTotaldftotal = N - 1Regressiondfregression = number of predictorsResidualdfresidual = dftotal dfregressiondftotal = dfregression + dfresidual

  • Partitioning Variability*Variance (or Mean Square)Total Variances2total = SStotal/ dftotalRegression Variances2regression = SSregression/ dfregressionResidual Variances2residual = SSresidual/ dfresidual

  • Example*

    Sheet1

    CountryX (Cig.)Y (CHD)Y'(Y - Y')(Y' - Ybar)(Y - Ybar)

    1112624.8291.1711.371106.193131.699

    292120.7450.2550.06538.70141.939

    392420.7453.25510.59538.70189.795

    492120.7450.2550.06538.70141.939

    581918.7030.2970.08817.46420.035

    681318.703-5.70332.52417.4642.323

    781918.7030.2970.08817.46420.035

    861114.619-3.61913.0970.00912.419

    962314.6198.38170.2410.00971.843

    1051512.5772.4235.8713.7910.227

    1151312.5770.4230.1793.7912.323

    125412.577-8.57773.5653.791110.755

    1351812.5775.42329.4093.79112.083

    1451212.577-0.5770.3333.7916.371

    155312.577-9.57791.7193.791132.803

    1641110.5350.4650.21615.91212.419

    1741510.5354.46519.93615.9120.227

    184610.535-4.53520.56615.91272.659

    193138.4934.50720.31336.3732.323

    20348.493-4.49320.18736.373110.755

    213148.4935.50730.32736.3730.275

    Mean5.95214.524

    SD2.3346.690

    Sum0.04440.757454.307895.247895.06

    Y' = (2.04*X) + 2.37

    Sheet2

    Sheet3

  • Example*

  • Coefficient of Determination*It is a measure of the percent of predictable variability

    The percentage of the total variability in Y explained by X

  • r = .713r 2 = .7132 =.508

    or

    Approximately 50% in variability of incidence of CHD mortality is associated with variability in smoking.r 2 for our example*

  • Coefficient of Alienation*It is defined as 1 - r 2 or

    Example1 - .508 = .492

  • r2, SS and sY-Y*r2 * SStotal = SSregression(1 - r2) * SStotal = SSresidualWe can also use r2 to calculate the standard error of estimate as:

  • Testing Overall Model*We can test for the overall prediction of the model by forming the ratio:

    If the calculated F value is larger than a tabled value (F-Table) we have a significant prediction

  • Testing Overall Model*Example

    F-Table F critical is found using 2 things dfregression (numerator) and dfresidual.(demoninator)F-Table our Fcrit (1,19) = 4.3819.594 > 4.38, significant overallShould all sound familiar

  • SPSS output*

  • Testing Slope and Intercept*The regression coefficients can be tested for significanceEach coefficient divided by its standard error equals a t value that can also be looked up in a t-table Each coefficient is tested against 0

  • Testing the Slope*With only 1 predictor, the standard error for the slope is:

    For our Example:

  • Testing Slope and Intercept*These are given in computer printout as a t test.

  • Testing*The t values in the second from right column are tests on slope and intercept.The associated p values are next to them.The slope is significantly different from zero, but not the intercept.Why do we care?

  • Testing*What does it mean if slope is not significant?How does that relate to test on r?What if the intercept is not significant?Does significant slope mean we predict quite well?

    *********Landwehr, J.M. & Watkins, A.E. (1987) Exploring Data: Teachers Edition. Palo Alto, CA: Dale Seymour Publications.

    *****************With updates on slides 23, 25, 27, 29, 35, 36*Landwehr, J.M. & Watkins, A.E. (1987) Exploring Data: Teachers Edition. Palo Alto, CA: Dale Seymour Publications.

    **