Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
+
Tutorial Regression and
correlation
Presented by Jessica Raterman Shannon Hodges
+Setting and checking your data
n Install the package > data(birthwt, package=“MASS”) or > install.packages(“MASS”) n Load the data > library(MASS) n Look over the raw data > print(birthwt) or > birthwt
+Setting and checking your data
• Raw Data low age lwt race smoke ptl ht ui ftv bwt 85 0 19 182 2 0 0 0 1 0 2523 86 0 33 155 3 0 0 0 0 3 2551 87 0 20 105 1 1 0 0 0 1 2557
+Setting and checking your data
n Check data form and structure
• Find variable names
> names(birthwt)
• Look at data structure
> str(birthwt)
+Setting and checking your data
• Variables Names
[1] "low" "age" "lwt” "race" "smoke" "ptl" "ht" "ui" [9] "ftv" "bwt"
• Look at data structure
'data.frame': 189 obs. of 10 variables: $ low : int 0 0 0 0 0 0 0 0 0 0 ... $ age : int 19 33 20 21 18 21 22 17 29 26 ... $ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
+Setting and checking your data • Explore the variables > ?birthwt
or
> help(birthwt)
• Check data summary
> summary(birthwt)
• Rename the data if desired, e.g.
> bw <- birthwt
+Setting and checking your data
• Explore the variables lwt mother's weight in pounds at last menstrual bwt birth weight in grams. • Check data summary
low age Min. :0.0000 Min. :14.00 1st Qu. :0.0000 1st Qu. :19.00 Median :0.0000 Median :23.00 Mean :0.3122 Mean :23.24 3rd Qu. :1.0000 3rd Qu. :26.00 Max. :1.0000 Max. :45.00
+Setting and checking your data
n Examine all scatterplots > pairs(birthwt) • Choose two variables to scatterplot > plot(birthwt$bwt, birthwt$lwt) • Examine correlation results > cor(birthwt$bwt, birthwt$lwt)
+Setting and checking your data
lwt x bwt
[1] 0.1857333
1000 2000 3000 4000 5000
100
150
200
250
birthwt$bwt
birthwt$lwt
+Setting and checking your data
n Check normality and distribution
> hist(birthwt$lwt)
and/or
> stem(birthwt$lwt)
> hist(birthwt$bwt)
+Setting and checking your data
n Check normality and distribution
Histogram of birthwt$lwt
birthwt$lwt
Frequency
100 150 200 250
010
2030
4050
6070
Histogram of birthwt$bwt
birthwt$bwt
Frequency
1000 2000 3000 4000 5000
010
2030
40
+Setting and checking your data
n Transform your data if needed • Create a new vector (column) for this > sqrtlwt <- sqrt(birthwt$lwt) > loglwt <- log(birthwt$lwt) • Recheck your data > hist(sqrtlwt) > hist(loglwt)
+Setting and checking your data
n Transform your data if needed Histogram of loglwt
loglwt
Frequency
4.4 4.6 4.8 5.0 5.2 5.4 5.6
010
2030
4050
+Parametric: correlation
n Recheck your results > cor(loglwt, birthwt$bwt)
The default setting uses Pearson’s r > plot(loglwt, birthwt$bwt)
+Parametric: correlation
[1] 0.2036035
4.4 4.6 4.8 5.0 5.2 5.4
1000
2000
3000
4000
5000
loglwt
birthwt$bwt
+Parametric: Linear Regression
n Specify model for simple regression
> m1=lm(birthwt$bwt~loglwt)
n Check your results with summary
> summary(m1)
You will want to check p-value, R2, slope, F-statistic
+Parametric: Linear Regression
n Summary Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -390.8 1174.0 -0.333 0.73958 loglwt 688.9 242.2 2.844 0.00495 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 715.8 on 187 degrees of freedom Multiple R-squared: 0.04145, Adjusted R-squared: 0.03633 F-statistic: 8.087 on 1 and 187 DF, p-value: 0.004954
+Parametric: Linear Regression
n Plot your model, check normality > plot(m1) Plot shows: • Residuals vs fitted - Numbered data are
potential problem points skewing the model. • Q-Q plot
+
2600 2800 3000 3200 3400
-2000
-1000
01000
2000
Fitted values
Residuals
lm(birthwt$bwt ~ loglwt)
Residuals vs Fitted
131133
130
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
lm(birthwt$bwt ~ loglwt)
Normal Q-Q
131133
130
+Parametric: Linear Regression
n Confidence and Prediction • Confidence intervals for all parameters > confint(m1) > confint(m1, level = 0.95) • CI for mean response > predict.lm(m1, interval=“confidence”) • Single predicted values of mean response > predict.lm(m1, interval="prediction")
+Parametric: Linear Regression
n Add the line of best fit
> abline(m1)
Rerun the plot if needed first:
> plot(loglwt, birthwt$bwt)
n Find the regression equation
• Infer from summary data: y = B0 +/- B1x
+Parametric: Linear Regression
(Intercept) -390.8 loglwt 688.9
4.4 4.6 4.8 5.0 5.2 5.4
1000
2000
3000
4000
5000
loglwt
birthwt$bwt
+Nonparametric
Use when there is residuals are not normally distributed (i.e. cannot assume linear relationship between x and y).
n Correlation • Change coeff. correl. to nonparametric option
> ?cor
> cor(birthwt$bwt, birthwt$lwt, method=c(“spearman”))
+Nonparametric
n Smooth with loess, then use linear reg. > m1.lo <- loess(birthwt$bwt~loglwt, span = 100, degree = 1)
> j <- order(loglwt) > plot(m1.lo) > lines(loglwt[j],m1.lo$fitted[j],col="red",lwd=3)
• Check residuals again > summary(m1.lo)
+Further practice
n Try one run-through of the tutorial with a new set of data that meet parametric requirements, and one that meets the requirements of nonparametric data.
• For new data: > data() • Or browse online: https://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
+Sources
n Hartlaub, BA. 2011. “Introduction to R.” [internet]. Downloaded on January 26, 2015. Available at http://www2.kenyon.edu/Depts/Math/hartlaub/Math305%20Fall2011/R.htm
n Hosmer DW, Lemeshow S, and Sturdivant RX, editors. 1989. Applied Logistic Regression, 3rd edition. New York: John Wiley & Sons Inc.
n Stack Exchange. [internet]. “Fit a Line with LOESS in R.” Downloaded on January 30, 2015. Available at http://stackoverflow.com/questions/15337777/fit-a-line-with-loess-in-r