Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2

Introduction to R: Part IIIStatistics and Linear Modelling

Alexandre Perera i Lluna1,2

1Centre de Recerca en Enginyeria Biomedica (CREB)Departament d’Enginyeria de Sistemes, Automatica i Informatica Industrial

(ESAII)Universitat Politecnica de Catalunyamailto:[email protected]

2Centro de Investigacion Biomedica en Red en Bioingenierıa, Biomateriales yNanomedicina (CIBER-BBN)

Jan 2011 / Introduction to RUniversitat Rovira i Virgili

mailto:[email protected]

StatisticsTests

Linear regression

Contents I

1 StatisticsUnivariate DataBivariate dataMultivariate Data

2 TestsHypothesis testsTwo population testsT-Testsχ2- TestsDurbin-WatsonGraphical tests

3 Linear regressionLinear modelsRegression analysisMultivariate regressionVariance analysis

Alexandre Perera i Lluna, Introduction to R: Part III

StatisticsTests

Linear regression

Univariate DataBivariate dataMultivariate Data

mean(),sd()

Mean Standard deviation, variance

Let’s define a random variable with a normal distribution> x <- rnorm(100, mean = 2, sd = 0.5)

mean(),sd()> mean(x)

[1] 2.016474

> median(x)

[1] 1.996165

> sd(x)

[1] 0.4814775

> var(x)

[1] 0.2318206

> summary(x)

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.068 1.638 1.996 2.016 2.347 3.446


StatisticsTests

Linear regression


Quantiles

quantile() any quantile between 0 and 1> quantile(x, 0.25)

25%

1.637719

> quantile(x, c(0.1, 0.9))

10% 90%

1.429676 2.663940

Difference between 1st and 3rd quartile> IQR(x)

[1] 0.7096768

cut() categorizes continous variables> summary(cut(x, c(min(x), mean(x), quantile(x, 0.75), max(x))))

(1.07,2.02] (2.02,2.35] (2.35,3.45] NA's50 24 25 1


StatisticsTests

Linear regression


Histograms

> cuts <- quantile(x, seq(0,

+ 1, 0.1))

> hist(x, breaks = cuts)

> rug(x)

Histogram of x

x

Den

sity

1.0 1.5 2.0 2.5 3.0 3.50.

00.

20.

40.

60.

8


StatisticsTests

Linear regression


boxplots

> boxplot(x, horizontal = TRUE,

+ col = "pink", xlab = "cm",

+ main = "Oscillation")

●

1.0 1.5 2.0 2.5 3.0 3.5

Oscillation

cm


StatisticsTests

Linear regression


Density

> cortes <- quantile(x, seq(0,

+ 1, 0.1))

> hist(x, breaks = cortes)

> rug(x)

> lines(density(x, bw = "SJ"),

+ col = "red")

Histogram of x

x

Den

sity

1.0 1.5 2.0 2.5 3.0 3.50.

00.

20.

40.

60.

8


StatisticsTests

Linear regression


Factores

> language <- as.factor(c("french",

+ "french", "german", "german",

+ "english", "german", "french",

+ "english", "french", "german"))

> gender <- as.factor(c("man",

+ "woman", "woman", "woman",

+ "woman", "woman", "man",

+ "woman", "man", "man"))

> table(gender, language)

language

gender english french german

man 0 3 1

woman 2 1 3

> plot(table(language, gender),

+ col = c("pink", "blue"))

table(language, gender)

language

gend

er

english french german

man

wom

an


StatisticsTests

Linear regression


barplot

> barplot(table(language, gender),

+ col = c("pink", "blue",

+ "green"), legend.text = levels(language))

man woman

germanfrenchenglish

01

23

45

6


StatisticsTests

Linear regression


stripchart

1D plots, alternative to boxplot() for certain cases:

> attach(iris)

> stripchart(Sepal.Length ~ Species)

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

seto

save

rsic

olor

virg

inic

a

Sepal.Length

> boxplot(Sepal.Length ~ Species)

> detach(iris)

●

setosa versicolor virginica

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0


StatisticsTests

Linear regression


Formula notation in R

It is possible to use a formula notation in R, when these arenamed (names()).

In previous example: Sepal.Length ~ Species

Formulas in R

variable ∼ group

variables per group

This notation is homogeneous along most of R code


StatisticsTests

Linear regression


Formula Notation in R, II

Formulas

response ∼ model

See help(formula)

log(Sepal.Length) ~ Species

Arithmetic functions are allowed:

I(Sepal.Length + Petal.Length) ~ Species

Heavily used in linear regression, but also in visualizationfunctions


StatisticsTests

Linear regression


Contingency tables

> data(UCBAdmissions)> UCBAdmissions[, , 1:2]

, , Dept = A

GenderAdmit Male FemaleAdmitted 512 89Rejected 313 19

, , Dept = B

GenderAdmit Male FemaleAdmitted 353 17Rejected 207 8

> DF <- as.data.frame(UCBAdmissions)> head(DF)

Admit Gender Dept Freq1 Admitted Male A 5122 Rejected Male A 3133 Admitted Female A 894 Rejected Female A 195 Admitted Male B 3536 Rejected Male B 207

xtabs() Mulltiple factorscontingency tables

Used commonly on data framesdata.frame

> xtabs(Freq ~ Gender + Admit,+ DF)

AdmitGender Admitted Rejected

Male 1198 1493Female 557 1278

> summary(xtabs(Freq ~ ., DF))

Call: xtabs(formula = Freq ~ ., data = DF)Number of cases in table: 4526Number of factors: 3Test for independence of all factors:

Chisq = 2000.3, df = 16, p-value = 0


StatisticsTests

Linear regression


score plots I

> def.par <- par(no.readonly = TRUE)> data(iris)> xhist <- hist(iris$Petal.Length, plot = FALSE)> yhist <- hist(iris$Sepal.Length, plot = FALSE)> top <- max(c(xhist$counts, yhist$counts))> xrange <- range(iris$Petal.Length)> yrange <- range(iris$Sepal.Length)> nf <- layout(matrix(c(2, 0, 1, 3), 2, 2, byrow = TRUE), c(3,+ 1), c(1, 3), TRUE)> layout.show(nf)> par(mar = c(3, 3, 1, 1))> plot(iris$Petal.Length, iris$Sepal.Length, xlim = xrange, ylim = yrange,+ xlab = "", ylab = "")> par(mar = c(0, 3, 1, 1))> barplot(xhist$counts, axes = FALSE, ylim = c(0, top), space = 0)> par(mar = c(3, 0, 1, 1))> barplot(yhist$counts, axes = FALSE, xlim = c(0, top), space = 0,+ horiz = TRUE)> par(def.par)


StatisticsTests

Linear regression


score plots II


StatisticsTests

Linear regression


lattice

library(lattice)

xyplot(Sepal.Length ~ Sepal.Width

| Species, data=iris)

Sepal.Width

Sep

al.L

engt

h

5

6

7

8

2.0 2.5 3.0 3.5 4.0 4.5

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●● ●

●●

●●

●●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

●

setosa

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●●

●●

●●

●●

●

●●●

●●

●

●

●

●

●● ●

●

●

●

●●●

●

●

●

versicolor

5

6

7

8

●

●

●

●●

●

●

●

●

●

●●

●

●●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●●

●

●●●

●

●●●

●●

●

●

virginica


StatisticsTests

Linear regression


pairs()

> panel.hist <- function(x,+ ...) {+ usr <- par("usr")+ on.exit(par(usr))+ par(usr = c(usr[1:2],+ 0, 1.5))+ h <- hist(x, plot = FALSE,+ breaks = 10)+ breaks <- h$breaks+ nB <- length(breaks)+ y <- h$counts+ y <- y/max(y)+ rect(breaks[-nB], 0, breaks[-1],+ y, col = "blue", ...)+ }> pairs(iris[, c(1:4)], panel = panel.smooth,+ cex = 1.5, pch = 21, bg = as.numeric(iris$Species),+ diag.panel = panel.hist)

Sepal.Length

2.0 3.0 4.0

●●

●●●

●

●●

●

●

●

●●

●

● ●●

●

●

●●

●

●

●●

● ●●●

●●

●●●

●●

●

●

●

●●

● ●

● ●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●● ●

●

●

●●

●

●●

●●

●●●

●●

●●

●●●●

●

●

●●

●●●

●●

●

● ●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

● ●

●●

●●

●

●

●

●

●●

●

●●●

●●

●

●●●

●

●●●

●●●

●

●●●●

●●

●

●●●●●●

●●

●

●

●

●●

●

●●●●

●

●●

●

●

●●

●●●●

●●

●●●

●●

●

●

●

●●

●●

●●●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●●

●

●●

●●

●●●

●●

●●●●●

●

●

●

●●

●●●

●●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●●

●●

●●

●

●

●

●

●●

●

●●●

●●

●

●●●

●

●●●

●●

●

●

●●●●●●

●

0.5 1.5 2.5

4.5

5.5

6.5

7.5

●●●●●

●

●●

●

●

●

●●

●

●●●

●

●

●●

●

●

●●●●●●

●●

●●●

●●

●

●

●

●●

●●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●● ●

●

●

●●

●

●●

●●

●●●● ●

●●●●●

●

●

●

●●

●●●

●●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

● ●

●●

●●

●

●

●

●

●●

●

●●●

●●

●

●●●

●

●●●

●●

●

●

●●●●●

●●

2.0

3.0

4.0

●

●●●

●

●

● ●

●●

●

●

●●

●

●

●

●

●●

●

●●

●●

●

●●●●●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

● ●● ●

●

●●

●

●

●●

●

●

●

●●●●

●

●

●

●

●

●

●●●●●●

●●●

●●

●

●

●

●

●

●●

●

●

●

●

●● ●

●

●

●

●

●●● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●●

● ●

●●

●●●

●

●●●

●

●

●● ●●●

●

●●

●

●

●

●

●

Sepal.Width●

●●●

●

●

●●

●●

●

●

●●

●

●

●

●

●●

●

●●

●●

●

●●●●●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

● ●●●

●

●●

●

●

●●

●

●

●

●●●●

●

●

●

●

●

●

●●●●●●

●●●

● ●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●●● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●●

●●

●●

●●●

●

●●●

●

●

●● ●●●

●

●●

●

●

●

●

●

●

●●●

●

●

●●

●●

●

●

●●

●

●

●

●

●●

●

●●

●●

●

●●●●●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

● ●●●

●

●●

●

●

●●

●

●

●

●●●●

●

●

●

●

●

●

●●●●

●●

●●●

● ●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●● ●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●●

●●

●●

●

●

●●●

●

●

●● ● ●●

●

●●

●

●

●

●

●

●●●● ●●

● ●● ● ●●●● ●

●●●●●●●

●●●●●●●●● ●●●●● ●●● ●●●●●●

●●● ●●

●●●

●●● ●

●

●●

●●●●

●

●●●

●●

●

●

●●●●

●●●

●●●●

●● ● ●

●●●● ●

●●

●●● ●

●

●

●

●

●●●

●

●

●●

●

●●●

●●●●

●●

●●

●

●

●

●●

●●● ●

●●

●●

●●

●●●

●●●●

●●●●●●

●

●●●● ●●

●●●● ●●●● ●

●●●●●● ●

●●●● ●●●●● ● ●●●

● ●●● ●●● ●●

●● ●● ●●

●●●

●●● ●

●

●●

●●●

●

●

●●●

●●

●

●

● ●●●

●●●

●●● ●

●● ●●

●●●

● ●●

●

● ●●●

●

●

●

●

●●●

●

●

●●

●

●● ●● ● ●●

●●

●●

●

●

●

●●

●●●●●

●

●●

●●

●●●●●●●

●●●● ● ●●

Petal.Length

12

34

56

7

●●●●●●

●●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●

●●

●●●●●

●●●

●●● ●

●

●●

●●●

●

●

●●●

●●

●

●

●●●●● ●●

●●●●

●●●●●

●●●●●

●

●●●●

●

●

●

●

●● ●

●

●

●●

●

●●●

● ●●●

●●

●●

●

●

●

●●

●●●●

●●

●●

●●●●

●● ●

●●

●●●●●●

●

4.5 5.5 6.5 7.5

0.5

1.5

2.5

●●●● ●●● ●● ● ●●●● ●

●●● ●●●●

●

●

●●●●●●●●

●●●● ●●● ●●●●

●●●●● ●●

●● ●●

●●

●

●

●●

●

●

●

●● ●●

●

●

●

●

●●

●●●●

●●

●●●●

●● ● ●●●●●

●●

●

●●● ●●

●

●

●●

●

● ●

● ●●

●

●●●●

● ●

●

●●

●

●

● ●●

●

●●●

●

●

● ●●

●●

●●

●●

●

●●

●

●●●

●●

●

●

●●●● ●●●●●● ●●●● ●

●●● ●●●●

●

●

●●●●●●●

●

●●●● ●●● ●●● ●

●●● ●● ●●

●●●●●●

●

●

●●

●

●

●

●●●●

●

●

●

●

●●

●●●●

●●

●●●●

● ● ●●● ●●●

●●

●

● ●●●●●

●

●●

●

●●

● ●●

●

●●●●

● ●

●

●●

●

●

●●●

●

●●●

●

●

● ●●

●●

● ●

●●

●

●●

●

●●

●

● ●

●

●

1 2 3 4 5 6 7

●●●●●●●●●●●●●●●

●●●●●●●

●

●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●●

●●

●

●

●●

●

●

●

●● ●●

●

●

●

●

●●

●●●●

●●

●●●●

●●●●●●●●●

●●

●●●●●●

●

●●

●

● ●

● ●●

●

●●●●

●●

●

●●

●

●

● ●●

●

●●●

●

●

●●●

●●

●●

●●

●

●●

●

●●

●

●●

●

●

Petal.Width


StatisticsTests

Linear regression

Hypothesis testsT-Testsχ2- TestsDurbin-WatsonGraphical tests

dpq-functions

pnorm(q) returns the probability thata random variable measures lowerthan q (larger than withlower.tail=FALSE)

> pnorm(c(0, 1))

[1] 0.5000000 0.8413447

> pnorm(1, lower.tail = F)

[1] 0.1586553

qnorm(q) responds to the inversequestion. Which value corresponds toa certain probability. (e.g. .75 for Q3)

> qnorm(c(0.75, 0.841345))

[1] 0.6744898 1.0000010

dnorm(x) theoretical density function

> curve(pnorm(x), -5, 5, col = "red",+ ylab = "", frame.plot = FALSE)> curve(dnorm(x), -5, 5, col = "blue",+ add = TRUE)> legend("topleft", legend = c("pnorm(x)",+ "dnorm(x)"), col = c("red",+ "blue"))

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

x

pnorm(x)dnorm(x)


StatisticsTests

Linear regression


Standardization

With our random variable x . (rnorm(), runif(), ...)> x <- rnorm(100, mean = 2,+ sd = 0.5)> z <- (x - 2)/0.5> mean(z)

[1] 0.08849595

> sd(z)

[1] 0.986837

z -score> pnorm(z)[1:5]

[1] 0.1282827 0.7556697 0.2045399[4] 0.4998171 0.1222717

> pnorm(x, mean = 2, sd = 0.5)[1:5]

[1] 0.1282827 0.7556697 0.2045399[4] 0.4998171 0.1222717


StatisticsTests

Linear regression


t-test

t-statistic:

t =X − µs/√n

> x <- rnorm(100, mean = 2, sd = 0.5)> t.test(x)

One Sample t-test

data: xt = 38.0565, df = 99, p-value <2.2e-16alternative hypothesis: true mean is not equal to 095 percent confidence interval:1.885099 2.092484sample estimates:mean of x1.988791

> x <- rnorm(100, mean = 0, sd = 0.5)> t.test(x)

One Sample t-test

data: xt = 0.5835, df = 99, p-value =0.5609alternative hypothesis: true mean is not equal to 095 percent confidence interval:-0.06811229 0.12486603sample estimates:mean of x0.02837687


StatisticsTests

Linear regression


Proportion test

In a poll (“yes”/”no”), 43 say “yes”. Is that a 50 % of population?(two-sided alternative)

H0: null hypothesis p = 0,5

H1: alternative hypothesis p 6= 0,5

> prop.test(43, 100, p = 0.5)

1-sample proportions test withcontinuity correction

data: 43 out of 100, null probability 0.5X-squared = 1.69, df = 1, p-value =0.1936alternative hypothesis: true p is not equal to 0.595 percent confidence interval:0.3326536 0.5327873sample estimates:

p0.43

> prop.test(430, 1000, p = 0.5)

1-sample proportions test withcontinuity correction

data: 430 out of 1000, null probability 0.5X-squared = 19.321, df = 1, p-value= 1.105e-05alternative hypothesis: true p is not equal to 0.595 percent confidence interval:0.3991472 0.4613973sample estimates:

p0.43


StatisticsTests

Linear regression


Wilcox-test

Rain distribution in Albacete (Spain). Asimetric distribution (t-test isnot allowed).

H0: µ = 5H1: µ > 0,5

> x = c(12.8, 3.5, 2.9, 9.4, 8.7,+ 0.7, 0.2, 2.8, 1.9, 2.8, 3.1,+ 15.8)> stem(x)

The decimal point is 1 digit(s) to the right of the |

0 | 012333340 | 991 | 31 | 6

> wilcox.test(x, mu = 5, alt = "greater")

Wilcoxon signed rank test withcontinuity correction

data: xV = 39, p-value = 0.5156alternative hypothesis: true location is greater than 5

null not rejectedAlexandre Perera i Lluna, Introduction to R: Part III

StatisticsTests

Linear regression


t-test for two populations

t-statistic:

t =(X2 − X2)− (µ1 − µ2)√

s21n1− s22

n2

Assuming X1 and X2 normally distributedEqual variances:> x = c(15, 10, 13, 7, 9, 8, 21,+ 9, 14, 8)> y = c(15, 14, 12, 8, 14, 7, 16,+ 10, 15, 12)> t.test(x, y, alt = "less", var.equal = TRUE)

Two Sample t-test

data: x and yt = -0.5331, df = 18, p-value =0.3002alternative hypothesis: true difference in means is less than 095 percent confidence interval:

-Inf 2.027436sample estimates:mean of x mean of y

11.4 12.3

Unequal variances> t.test(x, y, alt = "less")

Welch Two Sample t-test

data: x and yt = -0.5331, df = 16.245, p-value =0.3006alternative hypothesis: true difference in means is less than 095 percent confidence interval:

-Inf 2.044664sample estimates:mean of x mean of y

11.4 12.3


StatisticsTests

Linear regression


χ2-test

Allows statistical test for categorial data

χ2-test:

χ2 =n∑

i=1

(fi − ei)2

ei

Assumption: All cases occur more than one and 80% of cases > 5.

> freqs <- c(22, 21, 22, 27, 22,+ 36)> probs <- rep(1/6, 6)> chisq.test(freqs, p = probs)

Chi-squared test for givenprobabilities

data: freqsX-squared = 6.72, df = 5, p-value =0.2423



data: freqsX-squared = 25.92, df = 5, p-value= 9.248e-05


StatisticsTests

Linear regression


χ2-test II

Does certain process follow a certain distribution? (e.g. dice)



data: freqsX-squared = 6.72, df = 5, p-value =0.2423



data: freqsX-squared = 25.92, df = 5, p-value= 9.248e-05


StatisticsTests

Linear regression


χ2-test III: homogeneity

Are two processes generated by the same distribution? (e.g. two dice,normal truce (ok/ko))

> dado.ok <- sample(1:6, 200, p = c(1,+ 1, 1, 1, 1, 1)/6, replace = T)> dado.ko <- sample(1:6, 100, p = c(0.5,+ 0.5, 0.5, 0.5, 2, 2)/6, replace = T)> freqs.ok <- table(dado.ok)> freqs.ko = table(dado.ko)> rbind(freqs.ok, freqs.ko)

1 2 3 4 5 6freqs.ok 29 25 42 39 40 25freqs.ko 6 11 5 12 35 31

> chisq.test(rbind(freqs.ok, freqs.ko))

Pearson's Chi-squared test

data: rbind(freqs.ok, freqs.ko)X-squared = 35.5763, df = 5,p-value = 1.154e-06

> dado.ok <- sample(1:6, 200, p = c(1,+ 1, 1, 1, 1, 1)/6, replace = T)> dado.ko <- sample(1:6, 100, p = c(1.1,+ 1, 1, 1.1, 1, 1)/6, replace = T)> freqs.ok <- table(dado.ok)> freqs.ko = table(dado.ko)> rbind(freqs.ok, freqs.ko)

1 2 3 4 5 6freqs.ok 35 33 38 32 37 25freqs.ko 12 19 13 18 14 24

> chisq.test(rbind(freqs.ok, freqs.ko))

Pearson's Chi-squared test

data: rbind(freqs.ok, freqs.ko)X-squared = 9.2915, df = 5, p-value= 0.09799


StatisticsTests

Linear regression


Durbin-Watson

Evaluates Durbin-Watson for error auto-correlation

durbin.watson(car) Durbin-Watson Test for Autocorrelated Errorsdwtest(lmtest) Durbin-Watson Test

> library(lmtest)> err1 <- rnorm(100)> x <- rep(c(-1, 1), 50)> y1 <- 1 + x + err1> dwtest(y1 ~ x)

Durbin-Watson test

data: y1 ~ xDW = 1.8898, p-value = 0.3244alternative hypothesis: true autocorrelation is greater than 0

> err2 <- filter(err1, 0.9, method = "recursive")> y2 <- 1 + x + err2> dwtest(y2 ~ x)

Durbin-Watson test

data: y2 ~ xDW = 0.2426, p-value < 2.2e-16alternative hypothesis: true autocorrelation is greater than 0


StatisticsTests

Linear regression


Random numbers: quantile2 plots

> x = rnorm(100, 0, 1)> qqnorm(x, main = "normal(0,1)")> qqline(x)

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

−2 −1 0 1 2

−2

−1

01

2

normal(0,1)

Theoretical Quantiles

Sam

ple

Qua

ntile

s

> x = rnorm(100, 10, 15)> qqnorm(x, main = "normal(10,15)")> qqline(x)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

−2 −1 0 1 2

−20

020

40

normal(10,15)


Sam

ple

Qua

ntile

s


StatisticsTests

Linear regression


Random numbers: quantile2 plots

> x = rexp(100, 1/10)> qqnorm(x, main = "exponential mu=10")> qqline(x)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●● ●

●

●

●●

●

●●

●

●

●

●

●

●

●

−2 −1 0 1 2

010

2030

4050

60

exponential mu=10


Sam

ple

Qua

ntile

s

> x = runif(100, 0, 1)> qqnorm(x, main = "unif(0,1)")> qqline(x)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

−2 −1 0 1 2

0.0

0.2

0.4

0.6

0.8

1.0

unif(0,1)


Sam

ple

Qua

ntile

s


StatisticsTests

Linear regression

Linear modelsRegression analysisMultivariate regressionVariance analysis

Linear Models

Assume a response variable Y , dependant on three predictors:

Y = f (X1,X2,X3) + ε

Most easy linear form:

Y = β0 + β1X1 + β2X2 + β3X3 + ε

No need that predictors should be linear, but the input should belineal:

Y = β0 + β1X1 + β2log(X2) + β3X3X1 + ε

On the other side:

Y = β0 + β1Xβ2

1 + ε

Is not linear.


StatisticsTests

Linear regression


Linear modelling: matrix representation

Y = Xβ + ε

In matrix representationy1

y2

. . .yn

=

1 x11 x12 . . . x1P

1 x21 x22 . . . x2P

. . . . . . . . . . . . . . . . .1 xn1 xn2 . . . xnP

·β0

β1

. . .βP

+

ε1

ε2

. . .εn

(1)

The most simple model is the null model:y1

y2

. . .yn

=

11. . .1

µ+

ε1

ε2

. . .εn

(2)

Find β so that X β is as close to Y as possible.

y = X β

ε lives in the subspace (n-p)Alexandre Perera i Lluna, Introduction to R: Part III

StatisticsTests

Linear regression


Geometrical representation


StatisticsTests

Linear regression


Galapagos

Galapago islands : 30 islands, 7 variables

Species: The number of species of tortoise found on the islandArea: The area of the island (km2)Nearest: The distance from the nearest island (km)Elevation: The highest elevation of the island (m)Endemics: The number of endemic speciesScruz: The distance from Santa Cruz island (km)Adjacent: the area of the adjacent island (square km)

> library(faraway)> data(gala)> head(gala)

Species Endemics Area Elevation Nearest Scruz AdjacentBaltra 58 23 25.09 346 0.6 0.6 1.84Bartolome 31 21 1.24 109 0.6 26.3 572.33Caldwell 3 3 0.21 114 2.8 58.7 0.78Champion 25 9 0.10 46 1.9 47.4 0.18Coamano 2 1 0.05 77 1.9 1.9 903.82Daphne.Major 18 11 0.34 119 8.0 8.0 1.84


StatisticsTests

Linear regression


lm()

> plot(Species ~ Elevation, data = gala)

●

●

●

●

●

●●

●●●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

0 500 1000 15000

100

200

300

400

Elevation

Spe

cies


StatisticsTests

Linear regression


lm(): model construction

> mdl <- lm(Species ~ Elevation, data = gala)> coef(mdl)

(Intercept) Elevation11.3351132 0.2007922

> plot(Species ~ Elevation, data = gala)> abline(mdl, col = "blue") ●

●

●

●

●

●●

●●●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

0 500 1000 15000

100

200

300

400

Elevation

Spe

cies


StatisticsTests

Linear regression


lm(): model information

> summary(mdl)

Call:lm(formula = Species ~ Elevation, data = gala)

Residuals:Min 1Q Median 3Q Max

-218.319 -30.721 -14.690 4.634 259.180

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 11.33511 19.20529 0.590 0.56Elevation 0.20079 0.03465 5.795 3.18e-06 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 78.66 on 28 degrees of freedomMultiple R-squared: 0.5454, Adjusted R-squared: 0.5291F-statistic: 33.59 on 1 and 28 DF, p-value: 3.177e-06


StatisticsTests

Linear regression


lm(): plot(lm)

> par(mfrow = c(2, 2))> plot(mdl)

50 150 250 350

−20

00

200

Fitted values

Res

idua

ls

●●●

●●●●

●●●

●

●

●

●

●●

●●

●

●● ●

●

●

●

●

●

●●●

Residuals vs Fitted

SantaCruz

Fernandina

SantaMaria

●●

●●

● ● ●●

●●

●

●

●

●

●●

● ●

●

●● ●

●

●

●

●

●

●●●

−2 −1 0 1 2

−2

02

4


Sta

ndar

dize

d re

sidu

als

Normal Q−Q

SantaCruz

Fernandina

SantaMaria

50 150 250 350

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

●

●

●

●

●●

●

●

●

●●

●

●●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●●

Scale−LocationSantaCruzFernandina

SantaMaria

0.0 0.1 0.2 0.3

−4

−2

02

4

Leverage

Sta

ndar

dize

d re

sidu

als

●●●●●●●

●●●

●

●

●

●

●●

● ●

●

●●●

●

●

●

●

●

●●●

Cook's distance

10.5

0.51

Residuals vs Leverage

Fernandina

SantaCruz

SantaMaria


StatisticsTests

Linear regression


lm(): Residuals

> resid(mdl)

Baltra Bartolome Caldwell Champion Coamano Daphne.Major-22.809212 -2.221462 -31.225423 4.428446 -24.796112 -17.229384

Daphne.Minor Darwin Eden Enderby Espanola Fernandina-6.008787 -35.068202 -17.591359 -31.823839 45.908033 -218.318650Gardner1 Gardner2 Genovesa Isabela Marchena Onslow36.826069 -51.914941 13.404680 -7.087387 -29.206835 -14.354918

Pinta Pinzon Las.Plazas Rabida SanCristobal SanSalvador-63.350647 4.702062 -18.209579 -15.025848 124.897677 43.747160SantaCruz SantaFe SantaMaria Seymour Tortuga Wolf259.180432 -1.340291 145.157883 3.148434 -32.682461 -41.135538


StatisticsTests

Linear regression


lm(): Predictions

> newdata <- gala[15:nrow(gala), ]> predict(mdl, newdata)

Genovesa Isabela Marchena Onslow Pinta Pinzon26.59532 354.08739 80.20684 16.35492 167.35065 103.29794

Las.Plazas Rabida SanCristobal SanSalvador SantaCruz SantaFe30.20958 85.02585 155.10232 193.25284 184.81957 63.34029

SantaMaria Seymour Tortuga Wolf139.84212 40.85157 48.68246 62.13554


StatisticsTests

Linear regression


lm(): Predictions, IC

> predict(mdl, newdata, level = 0.9, interval = "confidence")

fit lwr uprGenovesa 26.59532 -3.2897506 56.48039Isabela 354.08739 271.4761884 436.69858Marchena 80.20684 55.7314194 104.68225Onslow 16.35492 -15.3566679 48.06650Pinta 167.35065 133.0307302 201.67056Pinzon 103.29794 78.2982338 128.29764Las.Plazas 30.20958 0.9226656 59.49649Rabida 85.02585 60.5948667 109.45683SanCristobal 155.10232 123.2045765 187.00007SanSalvador 193.25284 153.2255598 233.28012SantaCruz 184.81957 146.7231442 222.91599SantaFe 63.34029 38.0783575 88.60222SantaMaria 139.84212 110.6221988 169.06203Seymour 40.85157 13.1644062 68.53873Tortuga 48.68246 21.9996242 75.36530Wolf 62.13554 36.7813407 87.48974


StatisticsTests

Linear regression


lm(): Model construction

> mdl <- lm(Species ~ Elevation + Endemics, data = gala)> summary(mdl)

Call:lm(formula = Species ~ Elevation + Endemics, data = gala)


-74.85 -12.49 2.59 12.67 70.25


(Intercept) -19.92862 7.14320 -2.790 0.00955 **Elevation -0.02294 0.02009 -1.142 0.26366Endemics 4.35265 0.30997 14.042 6.29e-14 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 27.8 on 27 degrees of freedomMultiple R-squared: 0.9452, Adjusted R-squared: 0.9412F-statistic: 233 on 2 and 27 DF, p-value: < 2.2e-16


StatisticsTests

Linear regression


lm(): Model construction

> mdl <- lm(Species ~ ., data = gala)> summary(mdl)

Call:lm(formula = Species ~ ., data = gala)


-68.219 -10.225 1.830 9.557 71.090


(Intercept) -15.337942 9.423550 -1.628 0.117Endemics 4.393654 0.481203 9.131 4.13e-09 ***Area 0.013258 0.011403 1.163 0.257Elevation -0.047537 0.047596 -0.999 0.328Nearest -0.101460 0.500871 -0.203 0.841Scruz 0.008256 0.105884 0.078 0.939Adjacent 0.001811 0.011879 0.152 0.880---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 28.96 on 23 degrees of freedomMultiple R-squared: 0.9494, Adjusted R-squared: 0.9362F-statistic: 71.88 on 6 and 23 DF, p-value: 9.674e-14


StatisticsTests

Linear regression


Other regressors

Partial Least Squares: pls() en library(pls)

> library(pls)> data(yarn)> mod <- plsr(density ~ NIR, ncomp = 10,+ data = yarn[yarn$train, ],+ validation = "CV")> predplot(mod, ncomp = 1:6)

●

●

●

●

●●

●●

●●

●●●

●●

●

●

●●●

●

0 20 40 60 80 100

020

4060

80

density, 1 comps, validation

●

●

●

●

●●

●●

●●

●●●●●

●

●●●●●

0 20 40 60 80 100

020

4060

80


●

●●

●●●

●●●●

●●●●●

●

●●●●●

0 20 40 60 80 100

020

4060

8010

0


●

●●

●●●

●●●●

●●●●●

●●●●●●

0 20 40 60 80 100

020

4060

8010

0density, 4 comps, validation

●

●●

●●●

●●●●

●●●●●

●●●●●●

0 20 40 60 80 100

020

4060

8010

0


●

●●

●●●

●●●●

●●●●●

●●●●●●

0 20 40 60 80 100

020

4060

8010

0


measured

pred

icte

d


StatisticsTests

Linear regression


Other regressors

Principal Component Regression: pcr() en library(pls)

> data(yarn)> mod <- pcr(density ~ NIR, ncomp = 10,+ data = yarn[yarn$train, ],+ validation = "CV")> predplot(mod, ncomp = 1:6)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

0 20 40 60 80 100

2025

3035

4045


●

●

●

●

●●

●●

●●

●●●●●

●

●●●●●

0 20 40 60 80 100

020

4060

80


●

●●

●●●

●●

●●

●●●●●

●

●●●●●

0 20 40 60 80 100

020

4060

8010

0


●

●●

●●●

●●●

●

●●●●●

●

●●●●●

0 20 40 60 80 100

020

4060

80density, 4 comps, validation

●

●●

●●●

●●●●

●●●●●

●

●●●●●

0 20 40 60 80 100

020

4060

8010

0


●

●●

●●●

●●●●

●●●●●

●●●●●●

0 20 40 60 80 100

020

4060

8010

0


measured

pred

icte

d


StatisticsTests

Linear regression


Other regressors

Generally it is assumed:

ε is i.i.d (ind. and ident. distr., ε = σ2IResidues dist. normal

When errors are not i.i.d.:

glm() from library(stats)

glm(model, family="bionomial") (logistic version with)

Independent errors, but no iden. dist.:

WLS, through glm()

Errors not normally distributed:

robust regression, through rlm() from library(MASS)


StatisticsTests

Linear regression


One way anova

Generalization of t-test

H0: µ0 = µ1 = · · · = µp

> oneway.test(Sepal.Length ~ Species,+ data = iris)

One-way analysis of means (notassuming equal variances)

data: Sepal.Length and SpeciesF = 138.9083, num df = 2.000, denomdf = 92.211, p-value < 2.2e-16

p-value small: we reject nullhypothesis of equal means

●

setosa versicolor virginica

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Species

Sep

al.L

engt

h


StatisticsTests

Linear regression


Anova

> mdl <- lm(Sepal.Length ~ Species - 1, data = iris)> mdl.null <- lm(Sepal.Length ~ 1, data = iris)> anova(mdl, mdl.null)

Analysis of Variance Table

Model 1: Sepal.Length ~ Species - 1Model 2: Sepal.Length ~ 1

Res.Df RSS Df Sum of Sq F Pr(>F)1 147 38.9562 149 102.168 -2 -63.212 119.26 < 2.2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


StatisticsTests

Linear regression


Principal Component Analysis: model

> mdl <- prcomp(iris[, -5], center = TRUE, scale = TRUE)> summary(mdl)

Importance of components:PC1 PC2 PC3 PC4

Standard deviation 1.71 0.956 0.3831 0.14393Proportion of Variance 0.73 0.229 0.0367 0.00518Cumulative Proportion 0.73 0.958 0.9948 1.00000

> mdl$rotation

PC1 PC2 PC3 PC4Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971


StatisticsTests

Linear regression


Principal Component Analysis: prediction

> proj <- predict(mdl, iris[, -5])> plot(proj) ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

−3 −2 −1 0 1 2 3

−2

−1

01

2

PC1

PC

2


StatisticsTests

Linear regression


Principal Component Analysis: prediction

> biplot(mdl)

−0.2 −0.1 0.0 0.1 0.2

−0.

2−

0.1

0.0

0.1

0.2

PC1

PC

2

1

234

5

6

78

9

10

11

12

1314

15

16

17

18

1920

21

22

23

2425

26

272829

3031

32

33

34

3536

3738

39

4041

42

43

44

45

46

47

48

49

50

515253

54

55

56

57

58

59

60

61

62

63

6465

66

67

68

6970

71

72

73

74

7576

77

78

79

80

8182

8384

85

8687

88

89

9091

92

93

94

95

9697

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125126

127

128129

130131

132

133134

135

136137

138

139

140141142

143

144145

146

147

148

149

150

−10 −5 0 5 10

−10

−5

05

10

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

> library(pls)> scoreplot(mdl, col = as.numeric(iris$Species),+ pch = 16)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

−3 −2 −1 0 1 2 3

−2

−1

01

2

PC1 (73 %)

PC

2 (2

3 %

)


StatisticsTests

Linear regression


End Part III


Documents

Introduction to R: Part III - B2SLabb2slab.upc.edu/wp-content/uploads/2014/02/CursR_III.pdf · Introduction to R: Part III Statistics and Linear Modelling Alexandre Perera i Lluna1;2