Midterm 1 Review - Binghamton Universitypeople.math.binghamton.edu/jbrennan/home/S13MAT148/Midterm2Review.pdfThe solid line is the regression line foryonx. ... Midterm 1 Review 19

Midterm 1 Review

Dr. Joseph Brennan

Math 148, BU

Dr. Joseph Brennan (Math 148, BU) Midterm 1 Review 1 / 30

Chapter 8 & 9

Two quantative variables measured on the same individual are oftenstudied simultaneously.

Correlation Analysis

The establishment of an association (or correlation) between two variablesand assessing its strength.

Independent Variable

The variable that is expected to influence the other variable.

Also known as the explanatory variable.

Dependent Variable

The variable that is expected to be influenced by the other variable.

Also known as the response variable.


Chapter 8 & 9

Variables with a linear relationship have two classifications:

Positive Association: both variable increase or decreasesimultaneously.

Negative Association: when one of the variables increases the otherdecreases.

NOTE: The direction of association follows from the slope of imaginaryline plotted through the points.

A positive association represents a positive slope.

A negative association represents a negative slope.


The Correlation Coefficient

The correlation coefficient, r , is a descriptive statistic which measures thedirection and strength of the linear relationship between twoquantitative variables.

Suppose that we have data on variables x and y for n individuals.

Let the mean of x values be x and let the mean of y values be y .

Let the standard deviation of x values be sx and let the standarddeviation of y values be sy .

The sample correlation coefficient r between x and y is computed as

r =1

n

n∑i=1

(xi − x

sx

)(yi − y

sy

)=

1

n

n∑i=1

zx ,i · zy ,i (1)

where zx ,i =xi − x

sxand zy ,i =

yi − y

syare the z - scores for xi and yi ,

respectively.


Properties of the Correlation Coefficient

(1) The sign of the correlation coefficient r indicates the direction of therelationship between the variables.

(2) The correlation coefficient is just a number, it has no units ofmeasurement.

(3) The correlation r is always a number between −1 and 1. The closerr to 1 or -1 is, the stronger the linear association between x

¯and y .

(4) Correlation only measures the strength of a LINEAR relationshipbetween two variables. Correlation DOES NOT describe curvedrelationships between variables, no matter how strong they are!

(5) The correlation coefficient is NOT resistant to outliers.

(6) The correlation coefficient r is symmetric.


Interpreting the Correlation Coefficient

Conventional interpretations of the computed (absolute) values of r aregiven in the following table.

Value of |r | Interpretation

0.0 - 0.2 Very weak to negligible correlation0.2 - 0.4 Weak, low correlation (not very significant)0.4 - 0.7 Moderate correlation0.7 - 0.9 Strong, high correlation0.9 - 1.0 Very strong correlation


Correlation Coefficient: CAUTION

The correlation coefficient only measures the strength ofLINEAR relationships.

CORRELATION 6= CAUSATION

In order to accurately visually guess r , construct a scatter diagramsuch that the vertical standard deviations cover the same distance onthe page as the horizontal standard deviations.

A coefficient r = 0.80 does NOT mean that 80% of the points aretightly clustered around a line, NOR does it indicate twice as muchlinearity as r = 0.40.


iClicker Question

Age 43 21 25 42 57 59

Glucose Level 99 65 79 75 87 81

If the Age has a standard deviation of 14.4 and mean 41.2 and the GlucoseLevel has a standard deviation of 10.5 and mean 81, what is the samplecorrelation coefficient between the variables Age and Glucose Level?

A) −1

B) −0.5

C) 0

D) 0.5

E) 1


iClicker Solution

X Y Zx Zy Zx · Zy

43 99 0.13 1.7 0.22

21 65 -1.4 -1.52 2.13

25 79 -1.13 -0.19 0.22

42 75 0.06 -0.57 -0.03

57 87 1.1 0.57 0.63

59 81 1.24 0 0

The average of the final column is r = 0.53


Chapter 10

Regression Analysis The creation of a mathematical model or formulathat relates the values of one variable to the values of the other.

Assume that y and x are the dependent and independent variables of astudy. Denote:

y to be the predicted (by regression) value of y for a given x ,r to be the correlation coefficient between x and y ,y and sy the average and standard deviation for the dependent(response) variable y ,x and sx the average and standard deviation for the independent(explanatory) variable x .

Regression Line: Regressing y on x

y = y + r · sy ·x − x

sxy = y + r · sy · zx

zy = r · zx


Chapter 10

The best use of the regression line is to estimate the AVERAGE value ofy for a given value of x .

The correlation coefficient r measures the amount of scattering of pointsabout the regression line.

The closer r is to -1 or 1, the closer the points are to the regression line,the greater the success of regression in explaining individual responses yfor given values of x .

The expression for the slope m = rsysx

implies that x changes by onestandard deviation when y changes by r standard deviations.

Because −1 ≤ r ≤ 1, the change in y is less than or equal to the change inx . As the correlation between x and y decreases, the prediction y changesmore slowly in response to changes in x . This effect is called attenuation.


Chapter 10

Regression Effect describes the tendency of individuals with extremevalues to retest towards the mean.

On average the top group will value lower on a second experiment and onaverage the bottom group will value higher on a second experiment.

The Regression Fallacy is a fallacy by which individuals conjecture acause for an extreme to become average.


Notes on Linear Regression

The least squares regression line always passes through the center of thedata; the point (x , y) consisting of averages.

By swapping dependent and independent status for variables, a secondregression line can be found.

We have seen that the correlation coefficient r is symmetric: if we switchthe axes, we will get the same correlation.

This is not true for regression. When switching the roles of x and y , youget a different regression equation as there is not necessarily an equalitybetween x and y or between sx and sy .

The equations for y regressed on x and for x regressed on y are generallydifferent!


Football-Shaped Clustering

The data in a scatter plot for variables with a linear correlation clusters ina football (elliptical) shape estimated by three lines:

The solid line is the regression line for y on x.

The dashed line SD (standard deviation) line.

The dotted line is the regression line for x on y.


iClicker Question

Age 43 21 25 42 57 59

Glucose Level 99 65 79 75 87 81

The Age has a standard deviation of 14.4 and mean 41.2 and the GlucoseLevel has a standard deviation of 10.5 and mean 81. The correlationcoefficient was found to be r = 0.53. Predict the age of person A who hasa glucose level of 92.

A) 20

B) 30

C) 40

D) 50

E) 60


iClicker Solution

As we are predicting age, we let AGE be y and GLUCOSE x.

To find our regression line we first find the slope

m = r · sysx

= (0.53)14.4

10.5= 0.73

Then the y-intercept

b = y −m · x = 41.2− (0.73) · 81 = −17.9

Finally we obtain our regression line

y = 0.73x − 17.9

We are given an x value of 92 and the resulting prediction:

y = 0.73(92)− 17.9 = 49.2


Chapter 11

The R.M.S. Error measures how far a typical point will be from theregression line.

Taking the spread of points in a small vertical strip of the scatter plot, theR.M.S. Error is similar to the standard deviation and the regression line issimilar to the mean.

R.M.S. Error: =√

1− r2 · sy


Homoscedastic and Heteroscedastic Relationships

There are two broad generalizations of data with a linear relationship:

Homoscedastic: Scatter diagrams which forma a true footballshape. The standard deviation of y observations on vertical strips areapproximately the equivalent.

We may assume that the standard deviation is the R.M.S. Error.

Heteroscedastic: Scatter diagrams with unequal vertical stripstandard deviations. The R.M.S. Error in this case gives an average erroracross all the vertical strips.

For a given x, the R.M.S. Error should not be used as an estimateof the standard deviation of the corresponding y-values.


CAUTION! CAUTION! CAUTION!

Here are a few things to note about the least squares linear regression:

CAUTION 1: Linear regression should not be used if the relationshipbetween x and y is nonlinear.

CAUTION 2: Regression method is not robust to outliers. Anextreme outlier in x or y -direction may substantially change theregression equation.

Points that are outliers in the x direction can have a strong influenceon the position of the regression line.

Usually the outliers in the x direction pull the regression line towardsthemselves and do not have large residuals.

Points that are outliers in the y direction have large residuals.


Chapter 13 and 14

Probability theory deals with studies where the outcomes are not knownfor sure in advance. Usually, there are many possible outcomes for a study,we just do not know which particular outcome we will observe.

Sample Space: The set of all possible outcomes of a study. The samplespace of a study is denoted by S .

Every repetition of a study, or a trial, produces a single outcome. Usuallyan outcome is computed from the values of the response variables.

RULE: In problems involving drawing at random, the probability ofgetting an object of a particular type in a single draw is equal to theproportion of objects of this type in the population.


Events

Definition of an event in words: An event is some statement aboutthe random phenomenon.

Mathematical definition of an event: An event is a set ofoutcomes from the sample space S .

We say that an event has occurred if ANY of the outcomes thatconstitute it occur.

In the special case that the outcomes in the sample space are equallylikely, the probability of an event E is computed as

P(E ) =# of outcomes supporting E

Total # of outcomes in S(2)


Special Events

Certain Event: An event which is guaranteed to happen at everyrepetition of the experiment. A certain event is equal to the samplespace.

Impossible Event: An event which never can occur. Mathematically animpossible event is written as ∅, the empty set.

Opposite Event: An event A is opposite to A if it happens whenever Adoes not happen.


Mutually Exclusive Events

The set of common outcomes is an event which is called the intersectionof events A and B. We will denote the intersection of events A and B by

A and B

Mutually Exclusive Events: Two events A and B are mutually exclusiveif they both cannot happen at the same time. The intersection A and B isthe empty set and P(AandB) = 0.


Independent Events:

Two events A and B are independent of each other if knowing that oneevent occurs does not change the probability that the other event occurs.

For independent events A and B, P(AandB) = P(A) · P(B).

It can be established that events A and B are independent ifP(A) = P(A|B).

Union: The union of events A and B is the event which happens wheneither event A or event B or both happen. The union of two events isexpressed as

A or B


Rules of Probability

The following rules simplify many probability computations:

Rule 1: The probability P(A) of any event A satisfies

0 ≤ P(A) ≤ 1.

In other words, the probability of any event is between 0 and 1.

Rule 2: If S is the sample space for an experiment, then P(S) = 1.The probability of a certain event is 1.

Rule 3: The probability of an impossible event is 0. Hence,P(∅) = 0.

Rule 4: If events A and B are independent, then the probabilitythat they both happen is the product of their probabilities:

P(A and B) = P(A) · P(B)


Rules of Probability

Rule 5: For any events A and B, the probability of their union isequal to the sum of their individual probabilities minus the probabilityof their intersection:

P(A or B) = P(A) + P(B)− P(A and B)

Subtracting the probability of the intersection is needed to avoiddouble counting.

A special case: if A and B are disjoint events, then the probability oftheir union is the sum of individual probabilities:

P(A or B) = P(A) + P(B)

Rule 6: For any event A

P(A) = 1− P(A)


Conditional Probability:

Let A and B be events. If the events are not independent, then theoccurrence of B alters the probability that A will occur. The conditionalprobability of event A given that event B has happened is denotedP(A|B).

P(A|B) =P(A and B)

P(B)

We will consider two important sampling schemes:

Sampling with Replacement: Independent Events

Sampling without Replacement: Dependent Events

The events A1,A2, . . . ,An are not necessarily independent.

P(A1 and A2 and A3 and A4 and . . .) =

= P(A1)·P(A2|A1)·P(A3|A1 and A2)·P(A4|A1 and A2 and A3)·. . .


iClicker Question

What is the probability of obtaining at least one number greater than 4when you roll six fair dice?

A) 0%

B) 9%

C) 91%

D) 99%

E) 100%


iClicker Question

Three cards are dealt to you without replacement from a standard 52 carddeck. What is the probability of obtaining exactly one king, which happensto be the first card dealt?

A) 0%

B) 7%

C) 14%

D) 21%

E) 100%


iClicker Solution

Let K be the event: obtaining a king when dealt one card.

Let N be the event: not obtaining a king when dealt one card.

The event we are interested in is:

K and N and N

The probability is foundP(KNN) = P(K ) · P(N|K ) · P(N|(K and N)) = 4

52 ·4851 ·

4750 = 7%


Documents

Midterm 1 Review - Binghamton Universitypeople.math.binghamton.edu/jbrennan/home/S13MAT148/Midterm2Review.pdfThe solid line is the regression line foryonx. ... Midterm 1 Review 19