Download pdf - “Validación y Predicción de los Modelos Lineales en la ...umh1480.edu.umh.es/wp-content/uploads/sites/44/2013/02/teim_ses… · Validación del Modelo Transformaciones Box -Cox

Sesión004b.“ValidaciónyPrediccióndelosModelosLinealesenlaInvestigaciónComercial”1480-TécnicasEstadísticasenInvestigacióndeMercadosGradoenEstadísticaempresariaCurso2012-13– SegundosemestreProfesor:XavierBarberiVallésDepartamento:Estadística,MatemáticaseInformática

Índice

• AjusteModelo9 BondaddelAjustedelmodelo

9 ¿EsBondadoso?

9 ValidacióndelModelo9 ¿EsVálido?

9 PrediccióndelModelo

LosDatos

©XaviBarber2012/2013

Modeloajustado

• Trasrealizarlastéccnicas deselecciónautomáticadelmodelo,orealizandoestaseleccióndeformamanualelmodeloalquesellegaes(enlenguajeR):

modelo1 <-

lm(Cases~Month + Easter + Eggs.pr, data=Eggs)

BondaddeAjusteCuando hemos realizado el ajuste de un modelo de regresión lineal,hemos de verificar que efectivamente dicho modelo proporciona unbuen ajuste a la hora de explicar (predecir) la variable respuesta.Básicamente la bondad del ajuste la cuantificamos con el tanto porciento de variabilidad de la respuesta, que consigue ser explicada por elmodelo ajustado. Para ello contamos con varios tipos de medidas quecuantifican esta variabilidad dediversosmodos.

Comomedidasfundamentalesdebondaddeajustecontamoscon:• elerrorresidualestimados= ;• eltestFdebondaddeajustequeseobtienedelaTabladeAnova;• elcoeficientededeterminaciónR2.

σ̂

BondaddeAjuste:R2

Sin embargo,esa medida no esta exenta de problemas:• R^2 puede resultar grande a pesar de que la relación entre x e y no

sea lineal (de hecho tiene la misma interpretación que uncoeficiente de correlación, válido para cuantificar la relación linealsólo cuando esta existe).

• La magnitud de R^2 depende del rango de variabilidad de lavariable explicativa. Cuando el modelo de regresión es adecuado, lamagnitud de R^2 aumenta (o disminuye) cuando lo hace ladispersión de x.

• Podemos obtener un valor muy pequeño de R2 debido a que elrango de variación de x es demasiado pequeño, y entoncesimpedirá que se detecte su relación con y

BondaddeAjusteResidual standard error: 6187 on 88 degrees of freedomMultiple R-squared: 0.8421, Adjusted R-squared: 0.8134 F-statistic: 29.34 on 16 and 88 DF, p-value: < 2.2e-16

APARENTEMENTE NUERSTRO MODELO ES BONDADOSO

El coeficiente del R2 supera el 80%El p-valor asociado a la tabla de ANOVA del Modelo es muy inferior al 0.05.Aunque los residuos son un poco grandes.

ValidacióndelModeloUnavezajustadounmodelohemosdeprocederconeldiagnósticodelmodelo,queconsisteenverificarsi´estesatisfacelashipótesisbásicasdelmodeloderegresión,queson:• linealidadentrelasvariablesxey;(deberíahaberseestudiadoya!!)• paraloserroresdelmodelo,εi :(siexisterelaciónlinealentrex ey)

– mediacero– varianzaconstante– incorrelación– normalidad.

Elanálisisdelosresiduosnospermitirádetectardeficienciasenlaverificacióndeestashipótesis,asícomodescubrirobservacionesanómalasoespecialmenteinfluyentesenelajuste.Unavezencontradaslasdeficiencias,siexisten,cabráconsiderarelreplanteamientodelmodelo,bienempleandotransformacionesdelasvariables,bienproponiendomodelosalternativosal.

ValidacióndelModelo• En primer lugar yo siempre recomiendo una visualización de las gráficas

de diagnóstico.• Con esta visualización intuimos el comportamiento de los residuos, que

son los verdaderamente importantes en esta fase.

# diagnostic plots

layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page

plot(fit)

80000 120000 160000

-100

000

1000

0

Fitted values

Res

idua

ls

Residuals vs Fitted

27

35 39

-2 -1 0 1 2

-3-1

01

23

4

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q-Q

2739

90

80000 120000 160000

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location27 3990

0.0 0.1 0.2 0.3 0.4 0.5 0.6

-4-2

02

4

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance 1

0.5

0.5

1

Residuals vs Leverage

90

39

40

lm(Cases ~ Cereal.Pr + Egg.Pr + Month + Easter)

Validacióndelmodelo

• Delgráficoseextraenlassiguientesconclusiones:– Exceptuandoalgúnpuntosíparecequehayanormalidaddelosresiduos.

– PeroparecequenoexisteHomocedasticidad.

Sepuedecontrastardeformanuméricaparatenermayorseguridad.

LasHipótesis delosResiduos1. GuardandolosResiduos

2. Hipótesis demediadelso residuos=0.

3. Hipótesis deNormalidad

4. HipótesisdeHomomocedasticidad

5. Hipótesisdeautocorrelación

ValidacióndelModeloshapiro.test(Eggs$rstudent.LinearModel.1)

Shapiro-Wilk normality test

data: Eggs$rstudent.LinearModel.1 W = 0.9712, p-value = 0.02173##################################################bptest(Cases ~ Cereal.Pr + Egg.Pr + Month + Easter, varformula = ~ fitted.values(LinearModel.1), studentize=TRUE, data=Eggs)

studentized Breusch-Pagan test

data: Cases ~ Cereal.Pr + Egg.Pr + Month + EasterBP = 6.8796, df = 1, p-value = 0.008718

ValidacióndelModelo

• TenemosunosresiduosqueincumplenlahipótesisdeNormalidadylahipótesisdeHomocedasticidad.

• PortantomimodeloNOesVálido.¿existealgunaposiblevíadeescape?


TransformacionesBox-CoxLa familia de transformaciones más utilizada para resolver los problemas de falta de normalidad y de heterocedasticidad es la familia de Box-Cox, cuya definición es la siguiente.

Se desea transformar la variable Y, cuyos valores muestrales se suponen positivos, en caso contrario se suma una cantidad fija M tal que Y + M > 0. La transformación de Box-Cox depende de un parámetro por determinar y viene dada por

Valores muy utilizados del parámetro son los siguientes:

Transformación

-1 Z = 1/Y

-1/2 Z = 1/

0 Z = lg

1/2 Z =

1 Z = Y

ValidacióndelModeloboxcox(LinearModel.1)

-2 -1 0 1 2

5055

60

l

log-

Like

lihoo

d 95%


• Elijosegúnelgráficoanteriorelvalordentrodelintervalode“lambda”mas“redondo”,evitandovaloresdelestilode0.36,osimilares.Estolohagoporquetraselposteriorajuste“denuevo”lasprediccionesdeberán“re”-transformarseparaobtenerlasenlaescalademedicióninicial.

• Paraelejemploactualelijoλ=0à log(y)

Volvemosapaso1:AjustedelModelo

• Hemosdevolveraempezarperoahoraconellavariable“Y”transformadasegúnelparámetro“lambda”obtenido.

Ajustedelmodelo(2)Call:

lm(formula = lgY ~ Cereal.Pr + Easter + Egg.Pr + Month, data = Eggs)

Residuals:

Min 1Q Median 3Q Max

-0.113660 -0.039252 0.003659 0.034938 0.165506

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 12.4301579 0.1892718 65.674 < 2e-16 ***

Cereal.Pr -0.0035922 0.0014209 -2.528 0.013252 * Easter[T.Pre Easter] 0.3008591 0.0458257 6.565 3.49e-09 ***

Easter[T.Easter] 0.5403188 0.0542879 9.953 4.49e-16 ***

Easter[T.Post Easter] -0.1991539 0.0512024 -3.890 0.000195 ***

Egg.Pr -0.0047450 0.0009444 -5.024 2.63e-06 ***

Month[T.February] -0.0126416 0.0285716 -0.442 0.659246

Month[T.March] -0.0152822 0.0295750 -0.517 0.606643

Month[T.April] -0.0931902 0.0347661 -2.680 0.008776 **

Month[T.May] -0.0970813 0.0283182 -3.428 0.000926 ***

Month[T.June] -0.0728529 0.0291486 -2.499 0.014297 *

Month[T.July] -0.1027564 0.0269424 -3.814 0.000253 ***

Month[T.August] -0.1399601 0.0285595 -4.901 4.32e-06 ***

Month[T.September] -0.0754367 0.0284316 -2.653 0.009458 **

Month[T.October] -0.0884024 0.0272269 -3.247 0.001652 **

Month[T.November] -0.0274792 0.0283612 -0.969 0.335250

Month[T.December] 0.0380737 0.0287682 1.323 0.189108

---

BondaddelAjusteResidual standard error: 0.05926 on 88 degrees of freedom

Multiple R-squared: 0.7983, Adjusted R-squared: 0.7617 F-statistic: 21.77 on 16 and 88 DF, p-value: < 2.2e-16

ValidacióndelModelobptest(lgY ~ Cereal.Pr + Easter + Egg.Pr + Month, varformula= ~ fitted.values(modelo2), studentize=TRUE, data=Eggs)

studentized Breusch-Pagan test

data: lgY ~ Cereal.Pr + Easter + Egg.Pr + MonthBP = 1.0556, df = 1, p-value = 0.3042

shapiro.test(Eggs$rstudent.modelo2)

Shapiro-Wilk normality test

data: Eggs$rstudent.modelo2 W = 0.9858, p-value = 0.3267

Apredecir

Predicción

• Tenemosquegenerarun“data.frame”conlosdatosquedeseamospredecir,esdecirdebemosdedarvaloresatodaslas“X”:

# modelostep2<-lm(formula = Cases ~ First.Week + Month + Egg.Pr*Easter + Week + Beef.Pr, data = Eggs)

Newdata<-data.frame( First.Week="No",Month="July",Egg.Pr=90, Easter="Non Easter",Week=2,Beef.Pr=142)

predict(modelostep3, Newdata, interval="predict") fit lwr upr1 96337.57 85233.57 107441.6

Predicción

Predicción

• Efecto“PascuaenAbril”.NewData<-data.frame(Cereal.Pr=107, Easter="Pre Easter", Egg.Pr=90, Month="April")pred2<-(predict(modelo2, newdata=.NewData, interval="prediction", level=.95 ))

.NewData<-data.frame(Cereal.Pr=107, Easter="Easter", Egg.Pr=90, Month="April")pred1<-(predict(modelo2, newdata=.NewData, interval="prediction", level=.95 ))

.NewData<-data.frame(Cereal.Pr=107, Easter="Post Easter", Egg.Pr=90, Month="April")pred3<-(predict(modelo2, newdata=.NewData, interval="prediction", level=.95 ))

> exp(pred1)fit lwr upr

1 173836 150035.2 201412.5> exp(pred2)

fit lwr upr1 136818.1 117302.5 159580.7> exp(pred3)

fit lwr upr1 82983.33 71540.98 96255.78

Predicción

• Efecto“Pascua”,conpreenMarzo<-data.frame(Cereal.Pr=107, Easter="Pre Easter", Egg.Pr=90, Month="March")

> pred2<-(predict(modelo2, newdata=.NewData, interval="prediction", level=.95 ))>

> .NewData<-data.frame(Cereal.Pr=107, Easter="Easter", Egg.Pr=90, Month="April")

> pred1<-(predict(modelo2, newdata=.NewData, interval="prediction", level=.95 ))

>

> .NewData<-data.frame(Cereal.Pr=107, Easter="Post Easter", Egg.Pr=90, Month="April")

> pred3<-(predict(modelo2, newdata=.NewData, interval="prediction", level=.95 ))>

> exp(pred1)

fit lwr upr1 173836 150035.2 201412.5> exp(pred2)

fit lwr upr

1 147903.6 126972.7 172284.9

> exp(pred3)

fit lwr upr1 82983.33 71540.98 96255.78

>

¿Estátodobien?

• Nuncasobrahacerlaspruebasnecesariashastaencontrar“elmejormodelo”.

• ParaellosesueleutilizarlatécnicadeValidaciónCruzada:– EliminoundatoyAjustodenuevoelmodelo,pidolaPrediccióndeesedatoyobservoelerrorCometido.ContodoslosErrorescometidosobtengoelECM,ylocomparoconotrasestrategiasdemodelización.

Validación Cruzada(Quick-R)• CrossValidation• You cando K-Foldcross-validation using the cv.lm() function inthe DAAG package.# K-fold cross-validationlibrary(DAAG)cv.lm(df=mydata, fit, m=3) # 3 fold cross-validation• Sumthe MSEfor each fold,divideby the number ofobservations,andtake the square root toget the cross-

validated standarderrorofestimate.• You canassess R2shrinkage via K-fold cross-validation.Using the crossval() function from the bootstrappackage,

dothe following:# Assessing R2 shrinkage using 10-Fold Cross-Validation

fit <- lm(y~x1+x2+x3,data=mydata)

library(bootstrap)# define functionstheta.fit <- function(x,y){lsfit(x,y)}theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}

# matrix of predictorsX <- as.matrix(mydata[c("x1","x2","x3")])# vector of predicted valuesy <- as.matrix(mydata[c("y")])

results <- crossval(X,y,theta.fit,theta.predict,ngroup=10)cor(y, fit$fitted.values)**2 # raw R2cor(y,results$cv.fit)**2 # cross-validated R2