36
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K- means results, evaluating models

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Embed Size (px)

Citation preview

Page 1: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

1

Peter Fox

Data Analytics – ITWS-4963/ITWS-6965

Week 5b, February 21, 2014

Interpreting regression, kNN and K-means results,

evaluating models

Page 3: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Linear and least-squares> multivariate <- read.csv(”EPI_data.csv")

> attach(EPI_data);

> boxplot(ENVHEALTH,DALY,AIR_H,WATER_H)

> lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H)

> lmENVH

Let’s recall what this taught you!

> summary(lmENVH)

> cENVH<-coef(lmENVH) 3

Page 4: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Linear and least-squares> lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H)

> lmENVH

Call:

lm(formula = ENVHEALTH ~ DALY + AIR_H + WATER_H)

Coefficients:

(Intercept) DALY AIR_H WATER_H

-2.673e-05 5.000e-01 2.500e-01 2.500e-01

> summary(lmENVH)

> cENVH<-coef(lmENVH)

4

Page 5: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Linear and least-squares> summary(lmENVH)

Call:

lm(formula = ENVHEALTH ~ DALY + AIR_H + WATER_H)

Residuals:

Min 1Q Median 3Q Max

-0.0072734 -0.0027299 0.0001145 0.0021423 0.0055205

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.673e-05 6.377e-04 -0.042 0.967

DALY 5.000e-01 1.922e-05 26020.669 <2e-16 ***

AIR_H 2.500e-01 1.273e-05 19645.297 <2e-16 ***

WATER_H 2.500e-01 1.751e-05 14279.903 <2e-16 ***

---

5

p < 0.01 : very strong presumption against null hypothesis vs. this fit 0.01 < p < 0.05 : strong presumption against null hypothesis 0.05 < p < 0.1 : low presumption against null hypothesis p > 0.1 : no presumption against the null hypothesis

Page 6: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Linear and least-squaresContinued:

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.003097 on 178 degrees of freedom

(49 observations deleted due to missingness)

Multiple R-squared: 1, Adjusted R-squared: 1

F-statistic: 3.983e+09 on 3 and 178 DF, p-value: < 2.2e-16

> names(lmENVH)

[1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign"

[7] "qr" "df.residual" "na.action" "xlevels" "call" "terms"

[13] "model" 6

Page 7: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Object of class lm:An object of class "lm" is a list containing at least the following components:

coefficients a named vector of coefficients

residuals the residuals, that is response minus fitted values.

fitted.values the fitted mean values.

rank the numeric rank of the fitted linear model.

weights (only for weighted fits) the specified weights.

df.residual the residual degrees of freedom.

call the matched call.

terms the terms object used.

contrasts (only where relevant) the contrasts used.

xlevels (only where relevant) a record of the levels of the factors used in fitting.

offset the offset used (missing if none were used).

y if requested, the response used.

x if requested, the model matrix used.

model if requested (the default), the model frame used. 7

Page 8: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

> plot(ENVHEALTH,col="red")

> points(lmENVH$fitted.values,col="blue")

> Huh?

8

Plot original versus fitted

Page 9: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Try again!

9

> plot(ENVHEALTH[!is.na(ENVHEALTH)], col="red")

> points(lmENVH$fitted.values,col="blue")

Page 10: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Predict> cENVH<-coef(lmENVH)

> DALYNEW<-c(seq(5,95,5)) #2

> AIR_HNEW<-c(seq(5,95,5)) #3

> WATER_HNEW<-c(seq(5,95,5)) #4

10

Page 11: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Predict> NEW<-data.frame(DALYNEW,AIR_HNEW,WATER_HNEW)

> pENV<- predict(lmENVH,NEW,interval=“prediction”)

> cENV<- predict(lmENVH,NEW,interval=“confidence”) # look up what this does

11

Page 12: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Predict object returnspredict.lm produces a vector of predictions or a matrix of predictions and bounds with column names fit, lwr, and upr if interval is set. Access via [,1] etc.

If se.fit is TRUE, a list with the following components is returned:

fit vector or matrix as above

se.fit standard error of predicted means

residual.scale residual standard deviations

df degrees of freedom for residual

12

Page 13: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Output from predict> head(pENV)

fit lwr upr

1 NA NA NA

2 11.55213 11.54591 11.55834

3 18.29168 18.28546 18.29791

4 NA NA NA

5 69.92533 69.91915 69.93151

6 90.20589 90.19974 90.21204

…13

Page 14: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

> tail(pENV)

fit lwr upr

226 NA NA NA

227 NA NA NA

228 34.95256 34.94641 34.95871

229 59.00213 58.99593 59.00834

230 24.20951 24.20334 24.21569

231 38.03701 38.03084 38.04319

14

Page 15: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Did you repeat this for: ?AIR_E

CLIMATE

15

Page 16: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

K Nearest Neighbors (classification)Scripts – Lab4b_0_2014.R

> nyt1<-read.csv(“nyt1.csv")

… from week 4b slides or script

> classif<-knn(train,test,cg,k=5)

#

> head(true.labels)

[1] 1 0 0 1 1 0

> head(classif)

[1] 1 1 1 1 0 0

Levels: 0 1

> ncorrect<-true.labels==classif

> table(ncorrect)["TRUE"] # or > length(which(ncorrect))

> What do you conclude?16

Page 17: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Contingency tables> table(nyt1$Impressions,nyt1$Gender) #

0 1

1 69 85

2 389 395

3 975 937

4 1496 1572

5 1897 2012

6 1822 1927

7 1525 1696

8 1142 1203

9 722 711

10 366 400

11 214 200

12 86 101

13 41 43

14 10 9

15 5 7

16 0 4

17 0 1

17

Contingency table - displays the (multivariate) frequency distribution of the variable.

Tests for significance (not now)

> table(nyt1$Clicks,nyt1$Gender) 0 1 1 10335 10846 2 415 440 3 9 17

Page 18: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Regression> plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE) )

> m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx)

You were reminded that log(0) is … not fun

THINK through what you are doing…

Filtering is somewhat inevitable:

> bronx<-bronx[which(bronx$GROSS.SQUARE.FEET>0 & bronx$LAND.SQUARE.FEET>0 & bronx$SALE.PRICE>0),]

> m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx)

18

Page 19: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Interpreting this!Call:

lm(formula = log(SALE.PRICE) ~ log(GROSS.SQUARE.FEET), data = bronx)

Residuals:

Min 1Q Median 3Q Max

-14.4529 0.0377 0.4160 0.6572 3.8159

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.0271 0.3088 22.75 <2e-16 ***

log(GROSS.SQUARE.FEET) 0.7013 0.0379 18.50 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.95 on 2435 degrees of freedom

Multiple R-squared: 0.1233, Adjusted R-squared: 0.1229

F-statistic: 342.4 on 1 and 2435 DF, p-value: < 2.2e-16 19

Page 20: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Plots – tell me what they tell you!

20

Page 21: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Solution model 2> m2<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx)

> summary(m2)

> plot(resid(m2))

#

> m2a<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx)

> summary(m2a)

> plot(resid(m2a))21

Page 22: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

22

How do you interpret this residual plot?

Page 23: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Solution model 3 and 4> m3<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)+factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx)

> summary(m3)

> plot(resid(m3))

#

> m4<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)*factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx)

> summary(m4)

> plot(resid(m4))

23

Page 24: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

24

And this one?

Page 25: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Did you get to create the sales map?

table(mapcoord$NEIGHBORHOOD) # contingency table

mapcoord$NEIGHBORHOOD <- as.factor(mapcoord$NEIGHBORHOOD) # and this?

geoPlot(mapcoord,zoom=12,color=mapcoord$NEIGHBORHOOD) # this one is easier

25

Page 26: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

26

Page 27: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Did you forget the KNN?#almost there.

mapcoord$class<as.numeric(mapcoord$NEIGHBORHOOD)

nclass<-dim(mapcoord)[1]

split<-0.8

trainid<-sample.int(nclass,floor(split*nclass))

testid<-(1:nclass)[-trainid]

##mappred<-mapcoord[testid,]

##mappred$class<as.numeric(mappred$NEIGHBORHOOD) 27

Page 28: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

KNN!Did you loop over k?

knnpred<-knn(mapcoord[trainid,3:4],mapcoord[testid,3:4],cl=mapcoord[trainid,2],k=5)

knntesterr<-sum(knnpred!=mapcoord [testid,2] )/length(testid)

28

Page 29: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

K-Means!> mapmeans<-data.frame(adduse$ZIP.CODE, as.numeric(mapcoord$NEIGHBORHOOD), adduse$TOTAL.UNITS, adduse$"LAND.SQUARE.FEET", adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude')

> mapobj<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"))

> fitted(mapobj,method=c("centers","classes")) 29

Page 30: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Return objectcluster A vector of integers (from 1:k) indicating the cluster to which each point is allocated.

centers A matrix of cluster centres.

totss The total sum of squares.

withinss Vector of within-cluster sum of squares, one component per cluster.

tot.withinss Total within-cluster sum of squares, i.e., sum(withinss).

betweenss The between-cluster sum of squares, i.e. totss-tot.withinss.

size The number of points in each cluster. 30

Page 31: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

31

Huh?What is this?

plot(mapmeans,mapobj$cluster)

Page 32: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Plotting clusters (preview)library(cluster)

clusplot(mapmeans, mapobj$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

# Centroid Plot against 1st 2 discriminant functions

library(fpc)

plotcluster(mapmeans, mapobj$cluster)

32

Page 33: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Comparing cluster fits (e.g. different k)

library(fpc)

cluster.stats(d, fit1$cluster, fit2$cluster)

Use help.

> help(plotcluster)

> help(cluster.stats)

33

Page 34: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Assignment 3?• Preliminary and Statistical Analysis. Due next

Friday. 15% (written)– Distribution analysis and comparison, visual

‘analysis’, statistical model fitting and testing of some of the nyt1…31 datasets.

• How is it going?

34

Page 35: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Assignments to come

• Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ early March. 15% (10% written and 5% oral; individual);

• Assignment 5: Term project proposal. Due ~ week 7. 5% (0% written and 5% oral; individual);

• Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 9. 15% (15% written; individual);

• Term project. Due ~ week 13. 30% (25% written, 5% oral; individual).

35

Page 36: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models

Admin info (keep/ print this slide)• Class: ITWS-4963/ITWS 6965• Hours: 12:00pm-1:50pm Tuesday/ Friday• Location: SAGE 3101• Instructor: Peter Fox• Instructor contact: [email protected], 518.276.4862 (do not

leave a msg)• Contact hours: Monday** 3:00-4:00pm (or by email appt)• Contact location: Winslow 2120 (sometimes Lally 207A

announced by email)• TA: Lakshmi Chenicheri [email protected] • Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014

– Schedule, lectures, syllabus, reading, assignments, etc.

36