S6 w2 linear regression

Linear Regression Purpose – Determine if one or more

IVs can predict a DV Examples:

• Does your height (IV) predict how much money you will spend (DV)?

• Does the number of store managers predict how often the machine will break down (DV)?

• Does the number of clicks (IV1) and the number of comments (IV2) on the blog predict the size of revenue (DV)?

Choosing the right test for your research

Research Question Inferential Statistics

Compare means of 2 numeric variables

T test

Relate 2 categorical variables Pearson Chi Square

Relate 2 numeric variables Pearson Correlation r

Use 1+ IVs to explain 1 numeric DV

Regression

Where’s the crystal ball? I want to see the future!

Correlation tells us how X relates to Y (in the past)

Simple Regression tells us how X predicts Y (in the future)• E.g., Does AvgDailyClicks predict

DirectSalesRevenue? Multiple Regression tells us how

X1, X2, X3, ….. predicts Y• E.g., Do NumberBlogAuthors &

AvgDailyClicks predict SponsorRevenue?

Linear Regression Assumptions The relationship between Xs and Y are

linear If you have 2 or more Xs, they are not

perfectly correlated with each other Xs are not correlated with external

variables Independence – Any two observations

should be independent from each other. Errors are normally distributed And a few others

Simple Regression Example: Does Number of Stupid

Customers predict Self Checkout Error Rate?

When we use X to predict Y:• X = the predictor = the independent variable (IV)• Y = the predicted value = the dependent variable

(the value of Y depends on the predictor X) (DV)• You’re basically building a linear model between X

and Y:

Y = Constant + B*X + error

Basic Geometry: Linear Function Y = Constant + B*X + error Y = 1 + 2*X

Source: wikepedia

Constant = 1

Slope B = 2

What do Armani and regression have in common?Model Audition: Fitting the best straight line between

X & Y

Who is the best fitting model? (Hint: Not Kate Moss)

Line that’s closest to all dots

Kate Moss expressed mathematically:DirectSalesRevenue=(constant)

+B*AvgDailyClicks+error

Goodness of Fit (R2): How well does the line fit the data?(How well does Kate fit the average

woman?)

(constant)

Slope B

Distances to regression line = error

Good fit = small errors

Kate Moss as a lousy regression model:

Large errors, poor goodness of fit, small R2

Reading the SPSS Regression Output

Y = Constant + B*X + error DirectSalesRevenue =

19.466-.003*AvgDailyClicks+errorConstant is significantly greater than

zero

Slope (-.003) is significantly less than zero

Goodness of Fit (R2): Model explains 59% variations in DirectSalesRevenue

Reporting Regression in plain English

The number of average daily clicks significantly predicted direct sales revenue, b = -.03, t(39) = 14.72, p < .001. The number of average daily clicks also explained a significant proportion of variance in direct sales revenue, R2 = .59, F(1, 38) = 42.64, p < .001. These findings suggest that, websites with more average daily clicks tend to have lower direct sales revenue level.

Why is regression useful for predicting the future?

Y=200X (R2 = 45%)Given any X, we can predict value of Y with 45%

accuracy

Additional Notes Assumptions: Xs are somewhat independent; Y values are

independent; Y values are normally distributed; errors are normally distributed; X Y relations are linear; no outliers• Example: Time series data are NOT independent – stock price today depends on

stock price yesterday which depends on stock price the day before, etc. Multiple regression is just an extension of single regression

• Use multiple Xs (e.g., both AvgDailyClicks and NumberAuthors) to predict Y

• When you have a condition (e.g., customer choice depends on gender; brand awareness depends on comm. channel; number of applications depends on program of study), you need to create an interaction term next class

When an X is categorical (e.g., whether the blog host is Google or WordPress): Code X in numbers – e.g., 0 is Google, 1 is WordPress

When Y is categorical (e.g., whether the blog won the Outstanding Blog Award): Code Y in numbers – e.g. 0 is No, 1 is Yes, and use Logistic Regression

Y=Constant +B1 * X1 + B2 * X2 + error for Your Project

What is your Y (the value you want to predict)? Is your Y categorical? Do you need Logistic

Regression? See the instructor for help What is your X (your predictor variable)? How many

Xs do you have? Is any of your Xs categorical? Do you have a

coding scheme? Do you have a condition? (e.g., customer choice

depends on gender; brand awareness depends on comm. channel; number of applications depends on program of study) See the instructor for help

Choosing the right test for your research

Research Question Inferential Statistics

Compare means of 2 numeric variables

T test

Relate 2 numeric variables Pearson Correlation r

Relate 2 categorical variables Pearson Chi Square

Use 1+ IVs to explain 1 numeric DV Regression

Technology

S6 w2 linear regression