40
1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics." George Bernard Shaw Source: https://www.google.com.ph/search?q=house+and+car+pictures&biw=1366&bih=667&tbm =isch&imgil=u6a6khDohjuW5M%253A%253B423OVK06_el86M%253Bhttp%25253A% 25252F%25252Furideidrive.com%25252Flonger-drives-and-larger-vehicles%25252Fauto- delivery%25252F&source=iu&pf=m&fir=u6a6khDohjuW5M%253A%252C423OVK06_el 86M%252C_&usg=___pDF5qFgYsPZk2xt4HfiozoTcuU%3D&ved=0ahUKEwjl3pmE773L AhUFKKYKHerSDGkQyjcILg&ei=aHjlVuWKFoXQmAXqpbPIBg#imgdii=XPnPD17XP H3wXM%3A%3BXPnPD17XPH3wXM%3A%3BLXrQgW_TdN7n9M%3A&imgrc=XPn PD17XPH3wXM%3A Many things in real life are related. For instance, the price of a house is directly related to its floor area. The price of a car depends on its engine and model. Moreover, many measurements of body parts are related to one another. Furthermore, class rankings may be associated with

Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

  • Upload
    vumien

  • View
    222

  • Download
    7

Embed Size (px)

Citation preview

Page 1: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

1

Chapter 10

Chapter 10

Linear Regression and Correlation

"It is the mark of a truly intelligent person to be moved by statistics."

—George Bernard Shaw

Source:

https://www.google.com.ph/search?q=house+and+car+pictures&biw=1366&bih=667&tbm

=isch&imgil=u6a6khDohjuW5M%253A%253B423OVK06_el86M%253Bhttp%25253A%

25252F%25252Furideidrive.com%25252Flonger-drives-and-larger-vehicles%25252Fauto-

delivery%25252F&source=iu&pf=m&fir=u6a6khDohjuW5M%253A%252C423OVK06_el

86M%252C_&usg=___pDF5qFgYsPZk2xt4HfiozoTcuU%3D&ved=0ahUKEwjl3pmE773L

AhUFKKYKHerSDGkQyjcILg&ei=aHjlVuWKFoXQmAXqpbPIBg#imgdii=XPnPD17XP

H3wXM%3A%3BXPnPD17XPH3wXM%3A%3BLXrQgW_TdN7n9M%3A&imgrc=XPn

PD17XPH3wXM%3A

Many things in real life are related. For instance, the price of a house is directly related to

its floor area. The price of a car depends on its engine and model. Moreover, many measurements

of body parts are related to one another. Furthermore, class rankings may be associated with

Page 2: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

2

rankings in standardized tests. In chapter 3, you learned how to measure the degree of association

between two quantitative variables. In this chapter, you will recall this procedure and you will also

learn to measure the degree of association between ranked variables. Moreover, You will also learn

how to use linear equations in modelling relationships among variables.

Specifically, you will learn to:

Illustrate the nature of bivariate data

Construct a scatter plot

Describe shape (form), trend (direction), and variation (strength) based on a scatter plot

Calculate the Pearson product moment correlation coefficient and interpret

Draw the best-fit line on a scatter plot

Calculate the slope and y-intercept of the regression line and interpret

Predict the value of the dependent variable given the value of the independent variable

Solve problems involving correlation and regression analysis

Use regression analysis in modelling real-life data

Calculate the Spearman rank correlation coefficient and interpret

Page 3: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

3

Concept Map

Page 4: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

4

Big Ideas

Relationships in real-life

situations can be measured and

modelled.

Essential Questions

How do you measure the degree of

relationships between two

variables?

How can you model relationships

among variables?

LINEAR CORRELATION

Consider the following final grades in algebra and statistics obtained by a sample of 12 students.

Student A B C D E F G H I J K L

Algebra 82 87 78 93 95 87 80 85 85 86 90 83

Statistics 84 85 75 92 96 90 80 86 83 84 92 85

A comparison of the grades of the students in these two subjects would lead you to ask the

question: "Is there a relationship between these algebra and statistics grades?" Specifically, can

you say that students who have high grades in algebra have also high grades in statistics?

It was Sir Francis Galton, a cousin of Charles Darwin, who introduced the idea of

correlation analysis, a statistical method to determine if there is an association between two

Page 5: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

5

variables. Galton undertook detailed studies on human characteristics and he found out that there

is a very strong relationship between the heights of fathers and the heights of their sons.

A very useful visual tool in the process of determining if there is any relationship between

two variables, say X and Y, is the scatter plot. For the given data set above, we obtain the n = 12

data points (xi,yi) on the Cartesian plane by using xi = algebra grade and yi = statistics grade of the

ith student. Hence, each student would be represented by a point as shown in the following scatter

plot.

Figure 1. Scatter plot of statistics versus algebra grades

It could be seen from the scatter plot that the data points may not fall exactly on a straight

line but they tend to follow very closely a straight line with a positive slope. This is an indication

that there is a strong direct linear relationship between algebra and statistics grades such that

students with high grades in algebra are expected to have high grades also in statistics. Hence, for

ALGEBRA

STATISTICS

72

76

80

84

88

92

96

100

76 80 84 88 92 96 100

Page 6: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

6

this data set, you could say that there is a strong positive correlation between algebra grades and

statistics grades.

After you have drawn the scatter plot and observe that there is a linear relationship between

the two variables X and Y, you could then determine the appropriate correlation coefficient,

which measures the strength of the linear relationship between two variables.

Pop-Up!

Linear correlation is a statistical method of determining the nature and strength of

the linear relationship between two variables X and Y using a single numerical

value known as the correlation coefficient.

Pearson's r

Karl Pearson developed a coefficient of linear correlation that could be used to

determine the nature and strength of linear relationship between two quantitative variables

X and Y. This correlation coefficient is called Pearson’s sample product-moment

correlation coefficient, which is popularly known as Pearson's r, is given by the

following formula.

Page 7: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

7

Pop-Up!

2n

1ii

n

1i

2i

2n

1ii

n

1i

2i

n

1ii

n

1ii

n

1iii

yynxxn

yxyxn

r ,

where: xi = ith value of the variable X

yi = ith value of the variable Y

n = number of observations or data points

Note that this Pearson’s r formula is a simplified form of the sample correlation coefficient r in

Chapter 3 shown below:

2 2

2 2

x yxy

nr

x yx y

n n

The resulting value of this correlation coefficient ranges from 1 to +1. Specifically, there are two

pieces of information that can be obtained from it, namely:

1. The positive (+) or negative () sign indicates the nature of the linear relationship between X

and Y, wherein

Page 8: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

8

r 0 (positive correlation) indicates a direct linear relationship between X and Y (i.e.,

Y is expected to increase as X increases); and

r 0 (negative correlation) indicates an indirect linear relationship between X and Y

(i.e., Y is expected to decrease as X increases).

2. The magnitude of r, disregarding the + or sign, indicates the strength of the linear relationship

between X and Y so that

|r| close to1 indicates a strong correlation between X and Y;

|r| close to ½ indicates a moderate correlation between X and Y;

|r| close to 0 indicates a weak correlation between X and Y;

r = +1 indicates a perfect positive correlation between X and Y;

r = 1 indicates a perfect negative correlation between X and Y; and

r = 0 indicates that there is no linear relationship (zero correlation) between X and Y.

Page 9: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

9

The following are some prototype scatter plots.

Figure 2. Prototype scatter plots

It can be seen from these illustrations that a perfect correlation between X and Y means

that all the data points in the scatter plot lie exactly on a straight line. In this case, it would be

p e r f e c t p o s it iv e c o r r e la t io n

X

Y

s t r o n g p o s it iv e c o r r e la t io n

X

Y

s t r o n g n e g a t iv e c o r r e la t io n

X

Y

lo w p o s it iv e c o r r e la t io n

X

Y

z e r o c o r r e la t io n

X

Y

z e r o c o r r e la t io n

X

Y

Page 10: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

10

possible to give an accurate prediction of Y based on the known value of X. The closer the data

points are to a linear pattern, the stronger the correlation between X and Y. And the farther the

data points are from a linear pattern, the weaker the correlation between X and Y.

It should be noted that the correlation coefficient r is a measure of linear relationship

between X and Y so that a zero correlation simply means that there is no linear relationship

between X and Y. But it does not eliminate the possibility that there may be some other kind of

association between them. An example of this situation is given by the last diagram in the

prototype scatter plots wherein there is zero correlation between X and Y even though there is a

strong quadratic (parabolic) relationship between them.

Example 1

For the given data on the algebra grades and statistics grades of the sample of n = 12

students, compute for the Pearson’s r and interpret.

Solution

Let X denote the algebra grade and let Y denote the statistics grade. The required

computations to determine Pearson's r are shown in the table below.

Student xi yi 2i

x 2i

y xiyi

A

B

C

82

87

78

84

85

75

6,724

7,569

6,084

7,056

7,225

5,625

6,888

7,395

5,850

Page 11: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

11

D

E

F

G

H

I

J

K

L

93

95

87

80

85

85

86

90

83

92

96

90

80

86

83

84

92

85

8,649

9,025

7,569

6,400

7,225

7,225

7,396

8,100

6,889

8,464

9,216

8,100

6,400

7,396

6,889

7,056

8,464

7,225

8,556

9,120

7,830

6,400

7,310

7,055

7,224

8,280

7,055

ix =1,031 iy =1,032 2i

x =88,855 2i

y

=89,116

iiyx =88,963

Using the formula for Pearson's r, you get

22 )032,1()116,89(12)031,1()855,88(12

)032,1)(031,1()963,88(12r

= 0.94,

which indicates that there is a strong positive correlation between the algebra grades and statistics

grades of the students. This means that to a high extent, students with high grades in algebra also

tend to have high grades in statistics.

A statistic that is closely associated with the correlation coefficient is the sample

coefficient of determination, r2100 (%), which gives the proportion of total variability in Y

which could be explained or accounted for by the linear relationship with X. This coefficient can

be used to compare the strengths of the linear relationships between two pairs of variables: X1 and

Y1 versus X2 and Y2. Suppose the correlation coefficient between X1 and Y1 is r1 = 0.8 and the

correlation coefficient between X2 and Y2 is r2 = 0.4, which correspond to 2

1r = 64% and 22r =

Page 12: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

12

16%, respectively, then we could say that the linear relationship between X1 and Y1 is four times

as strong compared to the linear relationship between X2 and Y2.

Example 2

Using the given data on Algebra grades and Statistics grades, find the sample coefficient

of determination and interpret.

Solution:

From the example 1, you obtained r = 0.94, which yields r2 = 88.4%. This means that

88.4% of the total variability in the statistics grades could be accounted for by the linear

relationship with the algebra grades. Furthermore, the remaining 11.6% of the variability in the

statistics grades could be explained by other factors besides the algebra grades.

Testing the significance of Pearson’s population correlation coefficient :

In addition to the estimate of the linear relationship between two numerical variables X

and Y using the correlation coefficient Pearson’s r, you can also draw an inference about the true

linear relationship between X and Y. To test for the significance of the linear relationship between

two numerical variables X and Y, you test the null hypothesis H0: = 0 against an appropriate

alternative hypothesis Ha using the test statistic

2

2

1

r nt

r

,

Page 13: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

13

which has the t distribution with v = n – 2 degrees of freedom. The null hypothesis H0: = 0

means that there is no significant linear relationship between X and Y, where the parameter

denotes the unknown true value of the correlation coefficient.

At a level of significance , you reject H0 according to the decision rules given below.

Ha Decision Rule: Reject H0 if Interpretation

≠ 0 t< –t/2(n 2)

or

t>t/2(n 2)

There is a significant linear

relationship between X and

Y.

> 0 t>t (n 2) There is significant positive

correlation between X and Y.

< 0 t< –t (n 2) There is a significant negative

correlation between X and Y.

Otherwise, fail to reject H0.

Recall from Example 1, the obtained Pearson’s r between X = algebra grade and Y =

statistics grade from a sample of 12 students is r = 0.94. Recall also that this sample correlation

coefficient indicates a strong positive correlation between the two variables. It also corresponds

to a sample coefficient of determination r2 = 0.8836, which indicates that approximately 88.4% of

the variation in statistics grades (Y) can be accounted for by a linear relationship with algebra

grades (X).

To test the significance of the linear relationship between the algebra grades and statistics

grades using a significance level of 5%,

Step 1. H0: There is no significant linear relationship between algebra grades and statistics

grades, that is, = 0.

Page 14: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

14

H0: There is a significant linear relationship between algebra grades and statistics

grades, that is, ≠ 0.

Step 2. = 0.05.

Step 3. The test statistic to use is

2

2

1

r nt

r

,

with v = n 2 degrees of freedom.

Step 4. Reject H0 if t < –t0.025,10 = 2.2281 or t > t0.025,10 = 2.2281. Otherwise, fail to reject

H0.

Step 5. Substituting the available information in the test statistic, you get

2

(0.94) 12 2

1 (0.94)

t 8.7127.

Step 6. Since the computed value of the test statistic t is greater than 2.2281 and hence, falls

into the critical region, then reject H0.

Step 7. At = 5%, you have sufficient evidence to indicate a significant linear relationship

between algebra grades and statistics grades.

Note that the p-value associated with the computed test statistic, which is 8.7127, is

0.000003. Since p-value <, that is 0.000003 < 0.05, then reject H0 and you arrive at the same

conclusion.

Page 15: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

15

Example 4:The data of the study of the effectiveness of a one-month physical exercise program in

weight reduction using a sample of eight persons are shown in the table below.

Weight

(in

pounds)

Person

1 2 3 4 5 6 7 8

Before 209 178 169 212 180 192 158 180

After 196 171 170 207 177 190 159 180

Compute and test the significance of Pearson’s to determine if there is a significant linear

relationship between the weight before and the weight after the one-month physical exercise

program.

Solution:

Let X represent the weight in pounds before the physical exercise program and let Y represent the

weight after the physical exercise program.

Pearson’s r:

2n

1ii

n

1i

2i

2n

1ii

n

1i

2i

n

1ii

n

1ii

n

1iii

yynxxn

yxyxn

r

2 2

8 269,878 1,478 1,450

8 275,498 1,478 8 264,516 1,450

= 0.9768

Page 16: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

16

Test of significance of Pearson’s population correlation coefficient :

Step 1. H0: There is no significant linear relationship between the weights before and after

the one-month exercise program, that is, = 0.

H0: There is a significant linear relationship between the weights before and after

the one-month exercise program, that is, ≠ 0.

Step 2. = 0.05.

Step 3. The test statistic to use is

2

2

1

r nt

r

,

with v = n 2 degrees of freedom.

Step 4. Reject H0 if t < –t0.025,6 = 2.4469 or t > t0.025,6 = 2.4469. Otherwise, fail to reject

H0.

Step 5. Substituting the available information in the test statistic, you get

2

(0.9768) 8 2

1 (0.9768)

t 11.1800.

Step 6. Since the computed value of the test statistic t is greater than 2.4469 and hence, falls

into the critical region, then reject H0.

Step 7. At = 5%, you have sufficient evidence to indicate a significant linear relationship

between the weights before and after the one-month exercise program.

Note that the p-value associated with the computed test statistic, which is 11.1800, is

0.00003. Since p-value <, that is 0.00003 < 0.05, then reject H0 and you arrive at the same

conclusion.

Page 17: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

17

Example 5:The nutritionist’s claim that individuals tend to report decreasing dietary intake the

more they are interviewed. Data from a sample of eight female university students are shown

below:

Student 1 2 3 4 5 6 7 8

Day 1 1905 2237 1863 2291 2211 1428 1062 1705

Day 2 1658 1479 1100 2116 1999 1097 1283 2424

Compute and test the significance of Pearson’s to determine if there is a significant linear

relationship between the recorded dietary intake on day 1 and the recorded dietary intake on day

2.

Solution:

Let X represent the recorded dietary intake on day 1 and let Y represent the recorded dietary intake

on day 2.

Pearson’s r:

2n

1ii

n

1i

2i

2n

1ii

n

1i

2i

n

1ii

n

1ii

n

1iii

yynxxn

yxyxn

r

2 2

8 24,845,840 14,702 13,156

8 28,315,218 14,702 8 23,345,136 13,156

= 0.4489

Test of significance of Pearson’s :

Step 1. H0: There is no significant linear relationship between the recorded dietary intake

on day 1 and the recorded dietary intake on day 2, that is, = 0.

Page 18: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

18

H0: There is a significant linear relationship between the recorded dietary intake on

day 1 and the recorded dietary intake on day 2, that is, ≠ 0.

Step 2. = 0.05.

Step 3. The test statistic to use is

2

2

1

r nt

r

,

with v = n 2 degrees of freedom.

Step 4. Reject H0 if t < –t0.025,6 = 2.4469 or t > t0.025,6 = 2.4469. Otherwise, fail to reject

H0.

Step 5. Substituting the available information in the test statistic, you get

2

(0.4489) 8 2

1 (0.4489)

t 1.2304.

Step 6. Since the computed value of the test statistic t does not fall in the critical region,

that is 1.2304 2.4469 and 1.2304 2.4469, then you fail to reject H0.

Step 7. At = 5%, you have no sufficient evidence to indicate a significant linear

relationship between the recorded dietary intake on day 1 and the recorded dietary

intake on day 2.

Note that the p-value associated with the computed test statistic, which is 1.2304, is 0.2646.

Since p-value >, that is 0.2646 > 0.05, then you fail to reject H0 and you arrive at the same

conclusion.

Page 19: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

19

Spearman's rho

A corresponding correlation coefficient that can be used to measure the strength of the

association between two variables on the ordinal scale, especially when there are only few data

points, is the Spearman's Rank-Order Correlation Coefficient, rs, or simply, Spearman's rho.

Under the Spearman's rho, the data consists of two sets of rankings corresponding to the values of

the variables X and Y. But just like Pearson's r, the resulting values of Spearman's rho also range

from 1 to +1. You could also interpret Spearman's rho in a similar manner to Pearson's r.

The procedure for calculating the Spearman's rho is to compare the rankings on the

variables X and Y for the subjects under study. Average the ranks of tied observations, if any.

The difference between each pair of ranks is denoted by di. These differences are squared and

added and then used to calculate the following correlation coefficient:

Pop-Up!

)1n(n

d61r

2

2i

s

,

where: di = difference between the ranks assigned to the ith data point (xi,yi)

n = number of pairs of data.

Page 20: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

20

Example 6

The following table gives the preliminary scores and the final rankings obtained by a group

of 8 female students for a campus beauty search.Find the degree of association between

preliminary score and final ranking.

Candidate Preliminary Score Final Ranking

A

B

C

D

E

F

G

H

I

82

87

78

95

85 (tied with H)

81

84

85 (tied with E)

90

2

6

1 (worst)

8

5

4

3

7

9 (best)

Solution:

Using the variables X = rank based on the preliminary score and Y = final ranking, the

following table gives the rankings on X and Y and the differences in ranks for the n = 9 pairs of

observations. The computation for 2

id is shown in the last column.

Candidate xi yi di 2i

d

A

B

3

7

2

6

1

1

1

1

Page 21: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

21

C

D

E

F

G

H

I

1

9

5.5*

2

4

5.5*

8

1

8

5

4

3

7

9

0

1

0.5

-2

1

-1.5

-1

0

1

0.25

4

1

2.25

1

2i

d = 11.5

*mean of ranks 5 and 6

Substituting into the formula for rs, we find that

)1n(n

d61r

2

2i

s

=

)181(9

)5.11(61

= 0.9042,

which suggests a strong positive correlation between the preliminary scores and the final rankings

of the beauty contestants. This means that the rankings obtained by the beauty contestants based

on their preliminary scores generally agree with their final rankings.

Correlation versus causation

We end this section by taking note of the possible misuse in the interpretation of the

correlation coefficient. It should be emphasized that the correlation between two variables X and

Y, no matter how strong it is, does not necessarily imply causation between the two variables. A

high correlation simply indicates that there is a strong linear association between the two variables

even though there is no cause and effect relationship existing between them. It could be that there

is a third factor that is correlated as well as the cause for these two variables. For example, it is a

Page 22: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

22

known fact that compared to other months of the year, sales would increase and at the same time

the temperature gets colder during the Christmas season in December. Hence, in this case, there

is an indirect correlation between sales and temperature. But it would be illogical to say that higher

sales causes the temperature to go down or that the lower temperature causes sales to go up.

Instead, the real reason for the higher sales and lower temperature during this month is the fact that

it is the Christmas season.

SIMPLE LINEAR REGRESSION ANALYSIS

In the previous section, we learned that correlation analysis is used to determine if there is a

relationship between two quantitative variables X and Y. In this section, you will learn another technique

of establishing such relationship between X and Y, and that is through regression analysis.

Although the assignment of X and Y are done arbitrarily between the two quantitative variables

in correlation analysis, this is not the case in regression analysis. Here, the independent or predictor

variable is denoted as X, while the dependent or response variable is denoted as Y. Hence, in regression

analysis, you want to see the effect of X on Y.

Example 8

For the following studies, identify the independent and dependent variables of interest.

(a) The president of a homeowners association wants to predict the monthly association dues

based on the number of cars owned by his homeowners.

Page 23: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

23

(b) An educator wants to investigate the effect of number of hours of sleep the day before the

exam and the exam score.

(c) A nutritionist wants to determine if weight (in kg) of adolescents depends on their usual food

intake (in calories).

Solution

(a) The independent variable is the number of cars owned, while the dependent variable is the

monthly association dues.

(b) The independent variable is the number of hours of sleep, while the dependent variable is the

exam score.

(c) The independent variable is the usual food intake, while the dependent variable is weight.

The scatter plot was very helpful in visualizing such relationships between X and Y. If a linear

pattern is evident from the scatter plot, we may be interested in obtaining the estimate of such line and

this is done using regression analysis.

Regression Line

From Figure 1, we observed that the data points tend to follow very closely a straight line with a

positive slope. Such regression line is drawn in the scatter plot presented below.

Page 24: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

24

Scatterplot of Statistics against Algebra

Spreadsheet1 2v *12c

Statistics = -6.8181+1.0803*x

76 78 80 82 84 86 88 90 92 94 96

Algebra

74

76

78

80

82

84

86

88

90

92

94

96

98

Sta

tistics

Figure 1. Scatter plot of statistics versus algebra gardes

The regression line is used to predict or estimate the expected value of Y (called the dependent

or response variable) corresponding to given values of X (called the independent or explanatory variable).

The functional form of the regression line is given by the simple linear regression model

where Yi is the ith observed value of the dependent variable, Xi is the ith observed value of the independent

variable, β0 is the y-intercept or regression constant, β1 is the slope or regression coefficient, and εi is the

ith random error associated with Yi for all i=1,2,…,n.

The regression parameters or coefficients β0 and β1 are unknown but we can estimate them using

the method of least squares. In this method, the regression line that best fits the data is obtained. With

Page 25: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

25

the aid of Calculus, this is done by obtaining the sum of the squared deviations between the actual Yi and

its expected value, given by

and minimizing it. The least squares estimators of the parameters β1 and β0 are, respectively, given by the

following formula.

Pop-Up!

Thus, the best fit or regression line is expressed as

0 1ˆ y b b x .

Example 8

n n

i ini 1 i 1

i ii 1

1 1 2n

ini 12

ii 1

x y

x ynˆb

x

xn

xbyˆb 100

Page 26: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

26

For the data on the algebra and statistics grades of a random sample of 12 students given in

Example 1, find the regression line.

Solution

Using the estimator of the slope of the regression line, you have

08031

12

032185588

12

0321031196388

21 .,

,

,,,

b

.

While the estimator of the y-intercept is given by

8181612

031108031

12

03210 .

,.

,b .

Thus, the regression line given in Figure _?_ is

ˆ 6.8181 1.0803 y x .

With such equation, we can predict the expected statistics grade of a student who obtained an algebra

grade of 90. It is given by

ˆ 6.8181 1.0803(90) 90.4089 y .

The slope and y-intercept of the best fit line is interpreted in a manner similar to the interpretation

of such in a linear equation. That is, the slope represents the expected amount of change in Y for every

one unit change in X. On the other hand, the y-intercept is the expected value of Y when the value of X=0

provided the scope of the model includes X=0.

Page 27: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

27

Example 9

Interpret the regression coefficients obtained in Example 8.

Solution

Since the obtained slope of the regression line is 1.0803, this means that there is an estimated

average increase of 1.0803 units in the statistics grade for every one unit increase in the algebra grade.

This estimate applies to algebra grades ranging from 78 to 95, the lowest and highest reported algebra

grade by the 12 students, respectively. Since 0 is not within this range, it is meaningless to interpret the

y-intercept of -6.8181.

Testing the Significance of β1

In addition to the best fit line that describes the linear relationship between X and Y, you can also

make inferences regarding the regression parameters. However, inferences concerning β1 is particularly

important since it can determine if indeed there exists a linear relationship between X and Y.

For testing H0: β1 = 0 (i.e., There is no linear relationship between X and Y.) against H1: β1 ≠ 0 (i.e.,

There is a significant linear relationship between X and Y.), you can follow the steps in testing for the

significance of the Pearson’s discussed earlier. This is because the two hypotheses H0: = 0 and H0: β1

= 0 are equivalent.

Page 28: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

28

Diagnostic Checking

Such inference is valid provided that assumptions underlying the simple linear regression model

are satisfied. These assumptions include the following:

(i) εi must be normally distributed.

(ii) εi must have constant variance for all levels of the independent variables.

(iii) εi must be uncorrelated.

(iv) The relationship between X and Y is linear.

Residual analyses are then performed to determine if these assumptions are satisfied. These are

done using graphical tools and using statistical tests. However, these are beyond the scope of this book.

Measure of Model Adequacy

The coefficient of simple determination 2R , also known as the measure of goodness-of-fit,

discussed in correlation analysis is likewise computed to assess further the usefulness of the simple linear

regression model for prediction purposes. The formula for 2R is given by

12 =xy

y

b SPR

SS.

2R measures the total variation in the Y that is explained by the simple linear regression model

that utilizes X. It has values between 0 or 1. The larger the 2R , the more the total variation of Y is

explained by X.

Page 29: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

29

Example 10

Using the given data on Algebra grades and Statistics grades, interpret the 2R of 88.4% computed

in Example 2.

Solution

Since 88.4% is near 100%, then the regression line is a good fit for the data on algebra and

statistics grades.

Chapter Performance Tasks

Modelling Real Life Relationships

Collect real life data involving at least two quantitative variables and two ranked variables.

Examples of real-life data that you may gather are as follows:

1. measurements of body parts such height, arm span, weight, and kneeling height.

2. first grading period and second grading period grades, English and math grade, science and

math grade, etc.

3. daily allowance and math grade

4. daily allowance and daily expenses

Draw a scatter plot, compute Pearson’s r , set up the simple linear regression model (SLRM)

from their data, and draw inferences.

Possible data for Spearman’s rho:

1. the students’ rakings in the first grading period and the second grading period.

2. the students make a common list of things like hobbies and have them ranked by, say, boys

versus girls, based on their preference

3. the students could also make a common list of collegiate degree programs and have them

ranked by “two groups” based on their preference

4. ranks of the top 10 students in the last grading period and their daily allowance

Rank the values of the variables, compute Spearman’s rho and interpret.

Page 30: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

30

Statistics Links

Pearson’s r was named after Karl Pearson (27 March 1857 – 27

April 1936), an English mathematician and biostatistician. His

other contributions to classical statistical methods include

method of moments, Pearson’s chi-squared test, and principal

components analysis.

Key Concepts / Terms

Linear correlation

Perfect correlation

Positive correlation

Negative correlation

Zero correlation

Pearson's r

Spearman's rho

Coefficient of determination

Chapter Assessment

1. The value of the correlation coefficient r, as well as rs, is always between

A. 2 and 2 B. 1 and 1 C. 0 and 1 D. 0 and 100

2. The coefficient of determination r2 could assume values ranging from

A. 2 and 2 B. 1 and 1 C. 0 and 1 D. 0 and 100

3. Which of the following statements is true?

A. A perfect correlation between the variables X and Y implies a cause and effect

relationship between these two variables.

Page 31: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

31

B. A positive correlation indicates a strong linear relationship.

C. A negative correlation indicates that there is no linear relationship between the two

variables.

D. A near-zero correlation between X and Y suggests a very weak linear relationship.

4. Which of the following indicates a strong, but not perfect, linear relationship between X and

Y?

A. 1 B. 0.92 C. 0.03 D. 0.57

5. Which of the following indicates a perfect linear relationship between X and Y?

A. 1 B. 0.92 C. 0.03 D. 0.57

6. Which of the following indicates a moderate and direct linear relationship between X and Y?

A. 1 B. 0.92 C. 0.03 D. 0.57

7. Which of the following indicates that there is no linear relationship between X and Y?

A. 1 B. 0.92 C. 0.03 D. 0.57

8. A negative correlation between two variables X and Y suggests that

A. there is no correlation between X and Y.

B. small values of X are associated with small values of Y.

C. large values of X are associated with small values of Y.

D. the predicted value of the dependent variable Y is always negative.

Page 32: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

32

9. Given the regression equation �� = −3(1 + 𝑥), the corresponding Pearson’s r is

A. 3 B. 1

C. 3 D. cannot be determined

10. If the Pearson’s r has a value of 1, then the slope of the corresponding linear regression

equation is

A. 1 B. 1 C. negative D. positive

11. In the estimated regression model

xbby 10ˆ , which quantity gives the slope of the

regression line?

a) 0b c) y

b) 1b d) x

12. What does 1b represent in the regression model xbby 10ˆ ?

a) Value of y when x=0.

b) Value of x when y=0.

c) Increase in the value of x for a unit increase in y.

d) Increase in the value of y for a unit increase in x.

Page 33: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

33

13. What does 0b represent in the regression model xbby 10ˆ ?

a) Value of y when x=0.

b) Value of x when y=0.

c) Increase in the value of x for a unit increase in y.

d) Increase in the value of y for a unit increase in x.

14. Which of the following quantities has always the same sign as r?

a) 0b c) y

b) 1b d) x

For numbers 15-17. Given the following data:

X 1 2 3 4 5

Y 10 13 14 15 17

15.What is the estimated regression line?

A.

xy 6.19ˆ

B.

xy 96.1ˆ

C.

xy 5.25ˆ

D.

xy 55.2ˆ

Page 34: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

34

16. What is the predicted value of y when x=2.5?

A. 13

B. 15

C. 20

D. 12

17. Which of the following statements about x and y is true?

A. As x increases, y increases.

B. As x decreases, y increases.

C. As x increases, y decreases.

D. The relationship of x and y cannot be determined.

For nos. 18-20. Given the following data.

X 1 2 3 4 5

Y 30 28 25 22 17

18. What is the estimated regression line?

A.

xy 2.334ˆ

B.

xy 2.334ˆ

Page 35: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

35

C.

xy 342.3ˆ

D.

xy 342.3ˆ

19. Which of the following statements is true?

A. For every unit increase in x, there is a 3.2 increase in y.

B. For every unit increase in x, there is a 3.2 decrease in y.

C. For every unit increase in x, there is a 34 increase in y.

D. For every unit increase in x, there is a 34 decrease in y.

20. What is the predicted value of y when x=5?

A. 18

B. 20

C. 38

D. 50

Page 36: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

36

Chapter Workout

1. Interpret each the following correlation coefficients between two variables X and Y.

(a) 0.24 (b) 0.92 (c) 1 (d) 0.85 (e) 0.15

2. Identify the type of correlation that exists between each of the following pairs of variables.

(a) quiz scores and final grade

(b) market value and age of an equipment

(c) incidence of lung cancer and smoking level

(d) IQ and height

(e) height and weight

3. The following table shows the number of hours spent for studying and the grade obtained by a

student in each of his 6 subjects during the last grading period.

No. of hours 1.5 2.0 4.0 2.0 3.0 2.5

Grade 85 85 95 87 90 92

a) Compute and interpret Pearson's r.

b) Find the estimated regression line.

Page 37: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

37

4. The following table shows the rankings given by two judges to the eight entries in a poster-

making contest.

Entry # 1 2 3 4 5 6 7 8

Judge A 2 1 6 8 3 5 7 4

Judge B 1 3 4 6 2 7 5 8

Use Spearman's rho, rs to determine if the two judges agree on the rankings that they gave to

the entries.

5. Identify the nature of the correlation between the two numerical variables X and Y in each of

the following scatter plots.

(a)

(c)

0

1

2

3

4

5

0 20 40 60 80 100 120

Page 38: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

38

6. Given the following data:

x 4 5 2 2 6 3 9 8

y 1.5 1.7 2.0 1.9 1.3 1.8 1.2 1.0

A. Compute and interpret Pearson's r.

B. What proportion of the total variability in Y can be explained by the linear relationship with

X?

(b)

0

10

20

30

40

50

0 20 40 60 80 100

(c)

(c)

0

50

100

150

200

0 20 40 60 80

Page 39: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

39

7. The following data give the IQ and shoe size of a sample of 10 students.

Student A B C D E F G H I J

IQ 105 100 100 90 95 95 110 95 100 105

Shoe size 7 9 7.5 8 8 8.5 10 11 8.5 9

Compute Pearson's r and interpret.

8. To determine if there is an association between the price and the quality rating of a certain

household appliance, the following data on the prices (in pesos) and the quality ratings (1-worst

to 7-best) of seven brands of the household appliance were recorded.

Brand A B C D E F G

Price 1,780 1,500 1,500 2,000 1,200 1,800 2,100

Quality 7 4 3 6 1 2 5

Compute Spearman's rho, rs and interpret.

Page 40: Chapter 10 Linear Regression and Correlation - FCAMPENA · 1 Chapter 10 Chapter 10 Linear Regression and Correlation "It is the mark of a truly intelligent person to be moved by statistics."

40

9. The following is the data for a random sample of 8 households on the number of members (X)

and the daily expenditure on food (Y).

1 2 3 4 5 6 7 8

X 5 3 10 4 4 5 7 6

Y 150 120 300 180 200 200 240 230

A. Find the estimated regression line.

B. Find the estimated value of daily expenditure on food when the number of members is 5.

10. Suppose a construction company keeps record of the number of workers (X) and the number

of working days to finish a 100 sq m two-storey house (Y).

X 15 13 10 14 12 10

Y 94 108 128 100 110 120

A. Find the estimated regression line.

B. Interpret 1b .