Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
3.1 - 1
3.1 - 2
Overview
4.1 Tables and Graphs for the Relationship Between Two Variables
4.2 Introduction to Correlation
4.3 Introduction to Regression
3.1 - 3
4.1 Tables and Graphs for the Relationship Between Two Variables Objectives:
By the end of this section, I will be
able to…
1) Construct and interpret crosstabulations for two categorical variables.
2) Construct and interpret clustered bar graphs for two categorical variables.
3) Construct and interpret scatterplots for two quantitative variables.
3.1 - 4
Crosstabulations
Tabular method for simultaneously summarizing the data for two categorical (qualitative) variables
Constructing a Crosstabulation
Step 1
Put the categories of one variable at the top of each column, and the categories of the other variable at the beginning of each row.
3.1 - 5
Crosstabulations
Steps for Constructing a Crosstabulation
Step 2
For each row and column combination, enter the number of observations that fall in the two categories.
Step 3
The bottom of the table gives the column totals, and the right-hand column gives the row totals.
3.1 - 6
Table 4.1 Prestigious career survey data set
3.1 - 7
Example 4.1 - Crosstabulation of the prestigious career survey
Construct a crosstabulation of career
and gender.
3.1 - 8
Example 4.1 continued
Solution
Step 1
Crosstabulation given in Table 4.2.
Categories for the variable gender are at the top
Categories for the variable career are on the left.
Each student in the sample is associated with a certain cell
For example, a male student who reported Military Officer appears as one of the four students in the “Male” column and the “Military Officer” row.
3.1 - 9
Example 4.1 continued
Step 2
For each row and column combination, enter the number of observations that fall in the
two categories.
This is shown in Table 4.2.
Step 3
“Total” column contains the sum of the counts of the cells in each row (category) of the career variable
This sum represents the frequency distribution for this variable.
3.1 - 10
Example 4.1 continued
Step 3 - continued
“Total” contains sums the counts of the cells in each column (category) of the gender variable
This sum represents the frequency distribution for this variable.
Thus, we see that crosstabulations contain the frequency distributions of each of the two variables.
3.1 - 11
Example 4.1 continued
Table 4.2 Crosstabulation of prestigious career survey data set
3.1 - 12
Example 4.1 continued
Use the crosstabulation to look for patterns
For example, does there appear to be a difference between males and females responses?
Most of the students who responded “Doctor” were females, and most of the students who responded “Military Officer” were males.
3.1 - 13
Clustered Bar Graphs
Useful for comparing two categorical variables
Used in conjunction with crosstabulations
Each set of bars in a clustered bar graph represents a single category of one variable
One can construct clustered bar graphs using either frequencies or relative frequencies
3.1 - 14
Scatterplots
Used to summarize the relationship between two quantitative variables that have been
measured on the same element
Graph of points (x, y) each of which represents one observation from the data set
One of the variables is measured along the horizontal axis and is called the x variable
The other variable is measured along the vertical axis and is called the y variable
3.1 - 15
Predictor Variable and Response Variable
The value of the x variable can be used to predict or estimate the value of the
y variable The x variable is referred to as the predictor
variable
The y variable is called the response variable
3.1 - 16
Scatterplot Terminology
Note the terminology in the caption to Figure 4.2.
When describing a scatterplot, always indicate the y variable first and use the term versus (vs.) or against the x variable.
This terminology reinforces the notion that the y variable depends on the x variable.
3.1 - 17
FIGURE 4.2
Scatterplot of sales price versus square footage.
3.1 - 18
Positive relationship
As the x variable increases in value, the
y variable also tends to increase.
FIGURE 4.3 (a) Scatterplot of a positive relationship
3.1 - 19
Negative relationship
As the x variable increases in value, the y variable tends to decrease
FIGURE 4.3 (b) scatterplot of a negative relationship
3.1 - 20
No apparent relationship
As the x variable increases in value, the y variable tends to remain unchanged
FIGURE 4.3 (c) scatterplot of no apparent relationship.
3.1 - 21
Example 4.4 - Relationship between lot size and price in Glen Ellyn, Illinois
Using Figure 4.2, investigate the relationship
between lot square footage and lot price.
3.1 - 22
Example 4.4 continued
Solution
The scatterplot in Figure 4.2 most resembles Figure 4.3a, where a positive relationship
exists between the variables. Thus, smaller lot sizes tend to be associated
with lower prices, and larger lot sizes tend to be associated with higher prices.
Put another way, as the lot size increases, the lot price tends to increase as well.
3.1 - 23
Summary
Crosstabulation summarizes the relationship between two categorical variables.
A crosstabulation is a table that gives the counts for each row-column combination, with totals for the rows and columns.
Clustered bar graphs are useful for comparing two categorical variables and are often used in conjunction with crosstabulations.
For two numerical variables, scatterplots summarize the relationship by plotting all the (x, y) points.
3.1 - 24
4.2 Introduction to Correlation
Objective:
By the end of this section, I will be
able to…
1) Calculate and interpret the value of the correlation coefficient.
3.1 - 25
Correlation Coefficient r Measures the strength and direction of the
linear relationship between two variables.
sx is the sample standard deviation of the x data values.
sy is the sample standard deviation of the y data values.
)( )(
( 1) x y
y yx xr
n s s
3.1 - 26
Example
Page 183-184
3.1 - 27
Example
3.1 - 28
Example
Positively correlated, negatively correlated, not correlated?
3.1 - 29
Example
ANSWER: positively correlated
3.1 - 30
Example
3.1 - 31
Example
do 20(b) first:
20(a)
2553x 2583y
6.5105
2553
n
xx 6.516
5
2583
n
yy
3.1 - 32
Example
20(c)
97.104
2.481
1
2
n
xxsx
32.54
2.113
1
2
n
yysy
3.1 - 33
Example
20(d)
8792.0
)320.5)(968.10)(15(
2.205
)1(
)()(
yxssn
yyxxr
3.1 - 34
Equivalent Computational Formula for Calculating the Correlation Coefficient r
2 22 2
/
/ /
xy x y nr
x x n y y n
3.1 - 35
Option 1
• Enter the data in two lists.
• Press STAT and select TESTS
• LinRegTTest is option F (scroll arrow up 3 places)
• Enter the names of the lists from step 1.
• Arrow down to Calculate and then press Enter
• The r value is the last value displayed; round this value to three decimal places
Calculate r Directly Using Calculator
3.1 - 36
Option 2 (page 193)
Calculate r Directly Using Calculator
3.1 - 37
Example
Page 184 20(e)
Use the calculator to verify the answer in 20(d)
3.1 - 38
Interpreting the Correlation Coefficient r
1) Values of r close to 1 indicate a positive
relationship between the two variables.
The variables are said to be positively
correlated.
As x increases, y tends to increase as well.
3.1 - 39
Interpreting the Correlation Coefficient r
2) Values of r close to -1 indicate a negative relationship between the two variables.
The variables are said to be negatively correlated.
As x increases, y tends to decrease.
3.1 - 40
Interpreting the Correlation Coefficient r
3) Other values of r indicate the lack of either a positive or negative linear relationship between the two variables.
The variables are said to be uncorrelated
As x increases, y tends to neither increase nor decrease linearly.
3.1 - 41
Guidelines for Interpreting the Correlation Coefficient r
If the correlation coefficient between two variables is
greater than 0.7, the variables are positively correlated.
between 0.33 and 0.7, the variables are mildly positively correlated.
between –0.33 and 0.33, the variables are not correlated.
between –0.7 and –0.33, the variables are mildly negatively correlated.
less than –0.7, the variables are negatively correlated.
3.1 - 42
Example
Page 184
3.1 - 43
continued
Solution
we found the correlation coefficient for the relationship between SAT I verbal and math scores to be r = 0.0.8792.
r = 0.8792 is close to 1.
We would therefore say that SAT I verbal and math scores are strongly positively correlated.
As verbal score increases, math score also tends to increase.
3.1 - 44
Example
ANSWER:
3.1 - 45
Example
Page 184
3.1 - 46
Example
Page 184
3.1 - 47
Example
ANSWER: positive
3.1 - 48
Example
Page 184
3.1 - 49
Example
ANSWER: somewhere in the middle
3.1 - 50
Example
Page 184
3.1 - 51
Example
ANSWER:
3.1 - 52
Common Error Interpreting Correlation
correlation does not imply causality
EXAMPLE: Umbrella sales are negatively correlated with
attendance at baseball games in outdoor stadiums (that is, as the amount of umbrella sales increases, the attendance at baseball games in outdoor stadiums tend to decrease). It is not correct to conclude that increased umbrella sales causes a decrease in attendance. Both of these are probably caused by a hidden variable: rainfall.
3.1 - 53
Summary
Section 4.2 introduces the correlation
coefficient r, a measure of the strength of linear association between two numeric variables.
Values of r close to 1 indicate that the variables are positively correlated.
Values of r close to –1 indicate that the variables are negatively correlated.
Values of r close to 0 indicate that the variables are not correlated.
3.1 - 54
4.3 Introduction to Regression
Objectives:
By the end of this section, I will be
able to…
1) Calculate the value and understand the meaning of the slope and the y intercept of the regression line.
2) Predict values of y for given values of x.
3.1 - 55
Interpreting the Slope of a Line
For a line with equation:
we interpret a nonzero slope m as
mxby
y increases (if m is positive) or decreases (if m is negative) by m units for every one unit increase in x.
3.1 - 56
Interpreting the y-intercept of a Line
For a line with equation:
we interpret a y-intercept b as
mxby
The y value is b when the x value is 0.
3.1 - 57
Equation of the Regression Line
Approximates the relationship between two random variables x and y
The equation is
where the regression coefficients are the slope, b1, and the y-intercept, b0.
The “hat” over the y (pronounced “y-hat”) indicates that this is an estimate of y and not necessarily an actual value of y.
xbby 10ˆ
3.1 - 58
Relationship Between Slope and Correlation Coefficient
The slope b1 of the regression line and the correlation coefficient r always have the same sign.
b1 is positive if and only if r is positive.
b1 is negative if and only if r is negative.
3.1 - 59
Regression coefficients b0 and b1
All of the quantities needed to calculate b0 and b1 have already been computed in the formula for r.
Numerators for b1 and r are exactly the same.
21xx
yyxxb xbyb 10
3.1 - 60
Example
Page 194, problems 10-12 use this table
3.1 - 61
Example
2553x 2583y
6.5105
2553
n
xx 6.516
5
2583
n
yy
3.1 - 62
Example
Page 194, problem 10(a)
426.02.481
2.20521
xx
yyxxb
3.1 - 63
Example
Page 194, problem 10(b)
084.2996.510426.06.51610 xbyb
3.1 - 64
Example
Page 194, problem 10(c)
xbby 10ˆ
xy 426.0084.299ˆ
3.1 - 65
Calculator
1. Enter the data in two lists.
2. Make a scatter plot of the data (use 2nd Y= to get STAT PLOT, choose Plot1 On, first scatterplot icon, then zoom 9)
(your plot will look different)
3.1 - 66
Calculator
3. Plot the regression line. Choose:
4.
STAT → CALC #4 LinReg(ax+b)
Include the parameters L1, L2, Y1.
NOTE: Y1 comes from VARS → YVARS, #Function, Y1
3.1 - 67
Calculator
5. Choose Y= and the equation for the regression line will be stored in Y1 Then choose GRAPH and the regression line will be plotted.
3.1 - 68
Calculator
6. Choose TRACE and you can see X and Y values on scatterplot or regression line
3.1 - 69
Example
Page 194, problem 11(a)
A slope of 0.426 means that the estimated SAT I Math score increases by 0.426 points for every increase of 1 point in the SAT I verbal score.
3.1 - 70
Example
Page 194, problem 11(b)
The y-intercept of 299.084 means that the estimated SAT I Math score is 299.084 when the SAT I Verbal score is 0.
3.1 - 71
Example
Page 194, problem 12(a)
0.512500426.0084.299y
500 when ˆ Find xy
3.1 - 72
Example
Page 194, problem 12(b)
3.516510426.0084.299y
510 when ˆ Find xy
3.1 - 73
Example
Page 194, problem 12(c)
490 when ˆ Find xy
The x values in the data set range from 497 to 522. Since 490 is not in the range of the x values in the data set, it is not appropriate to use the regression equation in this case.
3.1 - 74
Summary
Section 4.3 introduces regression, where the
linear relationship between two numerical
variables is approximated using a straight
line, called the regression line.
The equation of the regression line is written
as where the regression
coefficients are the y intercept, b0, and the
slope, b1.
0 1y b b x
3.1 - 75
Summary
The regression equation can be used
to make predictions about values of y
for particular values of x.