8
linear regression quality of the fit and automating the analysis in Excel living with the lab © 2011 David Hall and the LWTL faculty team The Living with the Lab label, the Louisiana Tech Logo, and this copyright notice should not be removed when any part of this work is used by others. This work may not be used for commercial purposes. Inquiries should be addressed to [email protected]. This presentation on linear regression is based partially on class notes created by Dr. Mark Barker at Louisiana Tech University. good, better and best aren’t very quantitative words to describe the “quality of the fit” good fit 0 5 10 15 20 25 30 35 40 45 50 100 heart rate versus exercise time cumulative exercise time (s) heart rate (bpm) better fit 0 5 10 15 20 25 30 35 40 45 50 100 heart rate versus exercise time cumulative exercise time (s) heart rate (bpm) best fit 0 5 10 15 20 25 30 35 40 45 50 100 heart rate versus exercise time cumulative exercise time (s) heart rate (bpm)

Linear regression quality of the fit and automating the analysis in Excel living with the lab © 2011 David Hall and the LWTL faculty team The Living with

Embed Size (px)

Citation preview

Page 1: Linear regression quality of the fit and automating the analysis in Excel living with the lab © 2011 David Hall and the LWTL faculty team The Living with

linear regressionquality of the fit and automating the analysis in Excel

living with the lab

© 2011 David Hall and the LWTL faculty teamThe Living with the Lab label, the Louisiana Tech Logo, and this copyright notice should not be removed when any part of this work is used by others. This work may not be used for commercial purposes. Inquiries should be addressed to [email protected]. This presentation on linear regression is based partially on class notes created by Dr. Mark Barker at Louisiana Tech University.

good, better and best aren’t very quantitative words to describe the “quality of the fit”

good fit

0 5 10 15 20 25 30 35 40 455060708090

100110120

heart rate versus exercise time

cumulative exercise time (s)

hear

t rat

e (b

pm)

better fit

0 5 10 15 20 25 30 35 40 455060708090

100110120

heart rate versus exercise time

cumulative exercise time (s)

hear

t rat

e (b

pm)

best fit

0 5 10 15 20 25 30 35 40 455060708090

100110120

heart rate versus exercise time

cumulative exercise time (s)

hear

t rat

e (b

pm)

Page 2: Linear regression quality of the fit and automating the analysis in Excel living with the lab © 2011 David Hall and the LWTL faculty team The Living with

living with the lab

2

The content of this presentation is for informational purposes only and is intended only for students attending Louisiana Tech University.

The author of this information does not make any claims as to the validity or accuracy of the information or methods presented.

The procedures demonstrated here are potentially dangerous and could result in injury or damage.

Louisiana Tech University and the State of Louisiana, their officers, employees, agents or volunteers, are not liable or responsible for any injuries, illness, damage or losses which may result from your using the materials or ideas, or from your performing the experiments or procedures depicted in this presentation.

If you do not agree, then do not view this content.

DISCLAIMER

Page 3: Linear regression quality of the fit and automating the analysis in Excel living with the lab © 2011 David Hall and the LWTL faculty team The Living with

Class Problem Determine the best fit line of “recovery for recycling” versus “year” for 1960, 1970, 1980, 1990, 2000 and 2009.

a. Use Excel to set up a table to manually determine the slope m and the y-intercept b.b. Plot the six raw data points versus the fit. Use markers only (with no lines) for the raw data

and lines only (no markers) for the fit.

3

living with the lab

www.epa.gov

Table ES-3. Generation, materials recovery, composting, combustion with energy recovery, and discards of municipal solid waste, 1960-2009, in pounds per person per day

http://www.wastexchange.org/upload_publications/MSWintheU.S.2010.pdf

𝑚=𝑛∑ 𝑥 𝑖 𝑦 𝑖−∑ 𝑥𝑖∑ 𝑦 𝑖

𝑛∑ 𝑥 𝑖2  − (∑ 𝑥𝑖 )

2 𝑏=∑ 𝑦 𝑖−𝑚∑ 𝑥 𝑖

𝑛

Page 4: Linear regression quality of the fit and automating the analysis in Excel living with the lab © 2011 David Hall and the LWTL faculty team The Living with

living with the lab

solution

4

𝑚=𝑛∑ 𝑥 𝑖 𝑦 𝑖−∑ 𝑥𝑖∑ 𝑦 𝑖

𝑛∑ 𝑥 𝑖2  − (∑ 𝑥𝑖 )

2 𝑏=∑ 𝑦 𝑖−𝑚∑ 𝑥 𝑖

𝑛

the “coefficient of determination,” more commonly referred to as r2, will be used to determine the “goodness of the fit”

Page 5: Linear regression quality of the fit and automating the analysis in Excel living with the lab © 2011 David Hall and the LWTL faculty team The Living with

living with the lab

5

coefficient of determination

𝑥

𝑦

𝑥𝑖

𝑦 𝑖𝑓𝑖𝑡𝑦 𝑖

𝑦 𝑖❑− 𝑦 𝑖

𝑓𝑖𝑡

data point (𝑥 𝑖 , 𝑦 𝑖) best fit line

𝑦❑𝑓𝑖𝑡=𝑚 ∙𝑥+𝑏

𝑦 𝑖𝑓𝑖𝑡=𝑚 ∙𝑥 𝑖+𝑏

• the error at point is • since some errors are negative (fit lies below data point) and some are positive (fit lies

above data point), we square the errors: • if we simply reported the term above, the number would vary in size depending on the

problem being solved• we would like a number that varies between 0 (poor fit) and 1 (perfect fit), so we normalize

the error

where is the average value of

0≤𝑟 2≤1

𝑟2=1−∑ (𝑦 𝑖

𝑓𝑖𝑡− 𝑦 𝑖 )2

∑ ( 𝑦− 𝑦 𝑖 )2

Page 6: Linear regression quality of the fit and automating the analysis in Excel living with the lab © 2011 David Hall and the LWTL faculty team The Living with

living with the lab

6

alternate equation for r2

𝑟2=[ 𝑛∑ 𝑥 𝑖 𝑦 𝑖−∑ 𝑥 𝑖∑ 𝑦 𝑖

√𝑛 (∑ 𝑥 𝑖2 )− (∑ 𝑥𝑖 )

2∙√𝑛 (∑ 𝑦 𝑖

2 )− (∑ 𝑦 𝑖 )2 ]2

0≤𝑟 2≤1

instead of using the form for r2 presented on the previous slide, we use the form below; this form does not rely on and :

Class Problem Use Excel to compute for the recycling problem completed earlier.

Page 7: Linear regression quality of the fit and automating the analysis in Excel living with the lab © 2011 David Hall and the LWTL faculty team The Living with

living with the lab

7

solution: adding r2 to the earlier spreadsheet

𝑟2=[ 𝑛∑ 𝑥 𝑖 𝑦 𝑖−∑ 𝑥 𝑖∑ 𝑦 𝑖

√𝑛 (∑ 𝑥 𝑖2 )− (∑ 𝑥𝑖 )

2∙√𝑛 (∑ 𝑦 𝑖

2 )− (∑ 𝑦 𝑖 )2 ]2

0≤𝑟 2≤1

• if r2 is 0, then there is no apparent relationship between x and y

• if r2 is 1, then o x perfectly determines y o the variation in y is wholly due to xo y depends on x and there are no other

variables that affect y

Page 8: Linear regression quality of the fit and automating the analysis in Excel living with the lab © 2011 David Hall and the LWTL faculty team The Living with

living with the lab

8

repeat using built-in Excel tools

1. enter x and y data2. create a scatter plot3. right click on the markers and select “Add Trendline”4. select “Linear”, “Display Equation of chart” and “Display R-squared value on chart”

1950 1960 1970 1980 1990 2000 2010 20200

0.2

0.4

0.6

0.8

1

1.2

f(x) = 0.0212209701126899 x − 41.5367555120039R² = 0.937803707371666

recovery for recycling versus year

year

reco

very

for

recy

clin

g (lb

s/(p

erso

n*da

y))

STEPS:

NOTE: when studying for the next exam, be sure you can solve problems like the one today by hand and using Excel