Upload
duongtuyen
View
214
Download
0
Embed Size (px)
Citation preview
Midterm 1 Review
Dr. Joseph Brennan
Math 148, BU
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 1 / 30
Chapter 8 & 9
Two quantative variables measured on the same individual are oftenstudied simultaneously.
Correlation Analysis
The establishment of an association (or correlation) between two variablesand assessing its strength.
Independent Variable
The variable that is expected to influence the other variable.
Also known as the explanatory variable.
Dependent Variable
The variable that is expected to be influenced by the other variable.
Also known as the response variable.
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 2 / 30
Chapter 8 & 9
Variables with a linear relationship have two classifications:
Positive Association: both variable increase or decreasesimultaneously.
Negative Association: when one of the variables increases the otherdecreases.
NOTE: The direction of association follows from the slope of imaginaryline plotted through the points.
A positive association represents a positive slope.
A negative association represents a negative slope.
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 3 / 30
The Correlation Coefficient
The correlation coefficient, r , is a descriptive statistic which measures thedirection and strength of the linear relationship between twoquantitative variables.
Suppose that we have data on variables x and y for n individuals.
Let the mean of x values be x and let the mean of y values be y .
Let the standard deviation of x values be sx and let the standarddeviation of y values be sy .
The sample correlation coefficient r between x and y is computed as
r =1
n
n∑i=1
(xi − x
sx
)(yi − y
sy
)=
1
n
n∑i=1
zx ,i · zy ,i (1)
where zx ,i =xi − x
sxand zy ,i =
yi − y
syare the z - scores for xi and yi ,
respectively.
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 4 / 30
Properties of the Correlation Coefficient
(1) The sign of the correlation coefficient r indicates the direction of therelationship between the variables.
(2) The correlation coefficient is just a number, it has no units ofmeasurement.
(3) The correlation r is always a number between −1 and 1. The closerr to 1 or -1 is, the stronger the linear association between x
¯and y .
(4) Correlation only measures the strength of a LINEAR relationshipbetween two variables. Correlation DOES NOT describe curvedrelationships between variables, no matter how strong they are!
(5) The correlation coefficient is NOT resistant to outliers.
(6) The correlation coefficient r is symmetric.
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 5 / 30
Interpreting the Correlation Coefficient
Conventional interpretations of the computed (absolute) values of r aregiven in the following table.
Value of |r | Interpretation
0.0 - 0.2 Very weak to negligible correlation0.2 - 0.4 Weak, low correlation (not very significant)0.4 - 0.7 Moderate correlation0.7 - 0.9 Strong, high correlation0.9 - 1.0 Very strong correlation
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 6 / 30
Correlation Coefficient: CAUTION
The correlation coefficient only measures the strength ofLINEAR relationships.
CORRELATION 6= CAUSATION
In order to accurately visually guess r , construct a scatter diagramsuch that the vertical standard deviations cover the same distance onthe page as the horizontal standard deviations.
A coefficient r = 0.80 does NOT mean that 80% of the points aretightly clustered around a line, NOR does it indicate twice as muchlinearity as r = 0.40.
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 7 / 30
iClicker Question
Age 43 21 25 42 57 59
Glucose Level 99 65 79 75 87 81
If the Age has a standard deviation of 14.4 and mean 41.2 and the GlucoseLevel has a standard deviation of 10.5 and mean 81, what is the samplecorrelation coefficient between the variables Age and Glucose Level?
A) −1
B) −0.5
C) 0
D) 0.5
E) 1
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 8 / 30
iClicker Solution
X Y Zx Zy Zx · Zy
43 99 0.13 1.7 0.22
21 65 -1.4 -1.52 2.13
25 79 -1.13 -0.19 0.22
42 75 0.06 -0.57 -0.03
57 87 1.1 0.57 0.63
59 81 1.24 0 0
The average of the final column is r = 0.53
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 9 / 30
Chapter 10
Regression Analysis The creation of a mathematical model or formulathat relates the values of one variable to the values of the other.
Assume that y and x are the dependent and independent variables of astudy. Denote:
y to be the predicted (by regression) value of y for a given x ,r to be the correlation coefficient between x and y ,y and sy the average and standard deviation for the dependent(response) variable y ,x and sx the average and standard deviation for the independent(explanatory) variable x .
Regression Line: Regressing y on x
y = y + r · sy ·x − x
sxy = y + r · sy · zx
zy = r · zx
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 10 / 30
Chapter 10
The best use of the regression line is to estimate the AVERAGE value ofy for a given value of x .
The correlation coefficient r measures the amount of scattering of pointsabout the regression line.
The closer r is to -1 or 1, the closer the points are to the regression line,the greater the success of regression in explaining individual responses yfor given values of x .
The expression for the slope m = rsysx
implies that x changes by onestandard deviation when y changes by r standard deviations.
Because −1 ≤ r ≤ 1, the change in y is less than or equal to the change inx . As the correlation between x and y decreases, the prediction y changesmore slowly in response to changes in x . This effect is called attenuation.
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 11 / 30
Chapter 10
Regression Effect describes the tendency of individuals with extremevalues to retest towards the mean.
On average the top group will value lower on a second experiment and onaverage the bottom group will value higher on a second experiment.
The Regression Fallacy is a fallacy by which individuals conjecture acause for an extreme to become average.
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 12 / 30
Notes on Linear Regression
The least squares regression line always passes through the center of thedata; the point (x , y) consisting of averages.
By swapping dependent and independent status for variables, a secondregression line can be found.
We have seen that the correlation coefficient r is symmetric: if we switchthe axes, we will get the same correlation.
This is not true for regression. When switching the roles of x and y , youget a different regression equation as there is not necessarily an equalitybetween x and y or between sx and sy .
The equations for y regressed on x and for x regressed on y are generallydifferent!
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 13 / 30
Football-Shaped Clustering
The data in a scatter plot for variables with a linear correlation clusters ina football (elliptical) shape estimated by three lines:
The solid line is the regression line for y on x.
The dashed line SD (standard deviation) line.
The dotted line is the regression line for x on y.
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 14 / 30
iClicker Question
Age 43 21 25 42 57 59
Glucose Level 99 65 79 75 87 81
The Age has a standard deviation of 14.4 and mean 41.2 and the GlucoseLevel has a standard deviation of 10.5 and mean 81. The correlationcoefficient was found to be r = 0.53. Predict the age of person A who hasa glucose level of 92.
A) 20
B) 30
C) 40
D) 50
E) 60
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 15 / 30
iClicker Solution
As we are predicting age, we let AGE be y and GLUCOSE x.
To find our regression line we first find the slope
m = r · sysx
= (0.53)14.4
10.5= 0.73
Then the y-intercept
b = y −m · x = 41.2− (0.73) · 81 = −17.9
Finally we obtain our regression line
y = 0.73x − 17.9
We are given an x value of 92 and the resulting prediction:
y = 0.73(92)− 17.9 = 49.2
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 16 / 30
Chapter 11
The R.M.S. Error measures how far a typical point will be from theregression line.
Taking the spread of points in a small vertical strip of the scatter plot, theR.M.S. Error is similar to the standard deviation and the regression line issimilar to the mean.
R.M.S. Error: =√
1− r2 · sy
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 17 / 30
Homoscedastic and Heteroscedastic Relationships
There are two broad generalizations of data with a linear relationship:
Homoscedastic: Scatter diagrams which forma a true footballshape. The standard deviation of y observations on vertical strips areapproximately the equivalent.
We may assume that the standard deviation is the R.M.S. Error.
Heteroscedastic: Scatter diagrams with unequal vertical stripstandard deviations. The R.M.S. Error in this case gives an average erroracross all the vertical strips.
For a given x, the R.M.S. Error should not be used as an estimateof the standard deviation of the corresponding y-values.
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 18 / 30
CAUTION! CAUTION! CAUTION!
Here are a few things to note about the least squares linear regression:
CAUTION 1: Linear regression should not be used if the relationshipbetween x and y is nonlinear.
CAUTION 2: Regression method is not robust to outliers. Anextreme outlier in x or y -direction may substantially change theregression equation.
Points that are outliers in the x direction can have a strong influenceon the position of the regression line.
Usually the outliers in the x direction pull the regression line towardsthemselves and do not have large residuals.
Points that are outliers in the y direction have large residuals.
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 19 / 30
Chapter 13 and 14
Probability theory deals with studies where the outcomes are not knownfor sure in advance. Usually, there are many possible outcomes for a study,we just do not know which particular outcome we will observe.
Sample Space: The set of all possible outcomes of a study. The samplespace of a study is denoted by S .
Every repetition of a study, or a trial, produces a single outcome. Usuallyan outcome is computed from the values of the response variables.
RULE: In problems involving drawing at random, the probability ofgetting an object of a particular type in a single draw is equal to theproportion of objects of this type in the population.
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 20 / 30
Events
Definition of an event in words: An event is some statement aboutthe random phenomenon.
Mathematical definition of an event: An event is a set ofoutcomes from the sample space S .
We say that an event has occurred if ANY of the outcomes thatconstitute it occur.
In the special case that the outcomes in the sample space are equallylikely, the probability of an event E is computed as
P(E ) =# of outcomes supporting E
Total # of outcomes in S(2)
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 21 / 30
Special Events
Certain Event: An event which is guaranteed to happen at everyrepetition of the experiment. A certain event is equal to the samplespace.
Impossible Event: An event which never can occur. Mathematically animpossible event is written as ∅, the empty set.
Opposite Event: An event A is opposite to A if it happens whenever Adoes not happen.
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 22 / 30
Mutually Exclusive Events
The set of common outcomes is an event which is called the intersectionof events A and B. We will denote the intersection of events A and B by
A and B
Mutually Exclusive Events: Two events A and B are mutually exclusiveif they both cannot happen at the same time. The intersection A and B isthe empty set and P(AandB) = 0.
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 23 / 30
Independent Events:
Two events A and B are independent of each other if knowing that oneevent occurs does not change the probability that the other event occurs.
For independent events A and B, P(AandB) = P(A) · P(B).
It can be established that events A and B are independent ifP(A) = P(A|B).
Union: The union of events A and B is the event which happens wheneither event A or event B or both happen. The union of two events isexpressed as
A or B
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 24 / 30
Rules of Probability
The following rules simplify many probability computations:
Rule 1: The probability P(A) of any event A satisfies
0 ≤ P(A) ≤ 1.
In other words, the probability of any event is between 0 and 1.
Rule 2: If S is the sample space for an experiment, then P(S) = 1.The probability of a certain event is 1.
Rule 3: The probability of an impossible event is 0. Hence,P(∅) = 0.
Rule 4: If events A and B are independent, then the probabilitythat they both happen is the product of their probabilities:
P(A and B) = P(A) · P(B)
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 25 / 30
Rules of Probability
Rule 5: For any events A and B, the probability of their union isequal to the sum of their individual probabilities minus the probabilityof their intersection:
P(A or B) = P(A) + P(B)− P(A and B)
Subtracting the probability of the intersection is needed to avoiddouble counting.
A special case: if A and B are disjoint events, then the probability oftheir union is the sum of individual probabilities:
P(A or B) = P(A) + P(B)
Rule 6: For any event A
P(A) = 1− P(A)
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 26 / 30
Conditional Probability:
Let A and B be events. If the events are not independent, then theoccurrence of B alters the probability that A will occur. The conditionalprobability of event A given that event B has happened is denotedP(A|B).
P(A|B) =P(A and B)
P(B)
We will consider two important sampling schemes:
Sampling with Replacement: Independent Events
Sampling without Replacement: Dependent Events
The events A1,A2, . . . ,An are not necessarily independent.
P(A1 and A2 and A3 and A4 and . . .) =
= P(A1)·P(A2|A1)·P(A3|A1 and A2)·P(A4|A1 and A2 and A3)·. . .
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 27 / 30
iClicker Question
What is the probability of obtaining at least one number greater than 4when you roll six fair dice?
A) 0%
B) 9%
C) 91%
D) 99%
E) 100%
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 28 / 30
iClicker Question
Three cards are dealt to you without replacement from a standard 52 carddeck. What is the probability of obtaining exactly one king, which happensto be the first card dealt?
A) 0%
B) 7%
C) 14%
D) 21%
E) 100%
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 29 / 30
iClicker Solution
Let K be the event: obtaining a king when dealt one card.
Let N be the event: not obtaining a king when dealt one card.
The event we are interested in is:
K and N and N
The probability is foundP(KNN) = P(K ) · P(N|K ) · P(N|(K and N)) = 4
52 ·4851 ·
4750 = 7%
Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 30 / 30