21
Partial Exam Lidia Montero 2017, November 2nd Problem 1: All questions account for 1 point Specifications are given for 428 new vehicles for the 2004 year. The variables recorded include price, measurements relating to the size of the vehicle, and fuel efficiency (from http://ww2.amstat.org/publications/ jse/jse_data_archive.htm). There contains 19 variables: • Vehicle Name • Sports Car? (1=yes, 0=no) • Sport Utility Vehicle? (1=yes, 0=no) • Wagon? (1=yes, 0=no) • Minivan? (1=yes, 0=no) • Pickup? (1=yes, 0=no) • All-Wheel Drive? Factor (no,yes) • Rear-Wheel Drive? Factor (no,yes) Suggested Retail Price, what the manufacturer thinks the vehicle is worth, including adequate profit for the automaker and the dealer (U.S. Dollars) • Dealer Cost (or “invoice price”), what the dealership pays the manufacturer (U.S. Dollars) • Engine Size (liters) • Number of Cylinders (=-1 if rotary engine) • Horsepower • City Miles Per Gallon • Highway Miles Per Gallon • Weight (Pounds) • Wheel Base (inches) • Length (inches) • Width (inches) Missing values are denoted with *. SOURCE: Kiplinger’s Personal Finance December 2003, vol. 57, no. 12, pp. 104-123, http:/www.kiplinger.com. Load 04Cars.RData file in your current R or RStudio session. City Miles per Gallon (cmpg) consumption is going to be our numeric target and car type our target factor (f.cartype). 1. Create a new factor variable consisting on an indicator for Car type, classified either as Sport Car, SUV, Wagon, Minivan and Pickup. Use binary indicators originally included in dataset (named it f.cartype). Summarize the resulting factor. library(car) ## Warning: package car was built under R version 3.3.3 library(FactoMineR) ## Warning: package FactoMineR was built under R version 3.3.3 setwd("C:/Users/lmontero/Dropbox/DOCENCIA/MSCTM-ADTL/EXAMS/2017-18") load("04cars.RData") #load("F:/DOCENCIA/MSCTM-ADTL/EXAMS/2017-18/04cars.RData") df<-Car04 summary(df) 1

Partial Exam - UPC Universitat Politècnica de Catalunya · 2017. 11. 3. · ## carmodel sports suv ## Infiniti G35 4dr : 2 Min. :0.0000 Min. :0.0000 ## Mercedes-Benz C240 4dr : 2

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Partial ExamLidia Montero

    2017, November 2nd

    Problem 1: All questions account for 1 point

    Specifications are given for 428 new vehicles for the 2004 year. The variables recorded include price,measurements relating to the size of the vehicle, and fuel efficiency (from http://ww2.amstat.org/publications/jse/jse_data_archive.htm). There contains 19 variables:

    • Vehicle Name• Sports Car? (1=yes, 0=no)• Sport Utility Vehicle? (1=yes, 0=no)• Wagon? (1=yes, 0=no)• Minivan? (1=yes, 0=no)• Pickup? (1=yes, 0=no)• All-Wheel Drive? Factor (no,yes)• Rear-Wheel Drive? Factor (no,yes)• Suggested Retail Price, what the manufacturer thinks the vehicle is worth, including adequate profit for

    the automaker and the dealer (U.S. Dollars)• Dealer Cost (or “invoice price”), what the dealership pays the manufacturer (U.S. Dollars)• Engine Size (liters)• Number of Cylinders (=-1 if rotary engine)• Horsepower• City Miles Per Gallon• Highway Miles Per Gallon• Weight (Pounds)• Wheel Base (inches)• Length (inches)• Width (inches)

    Missing values are denoted with *. SOURCE: Kiplinger’s Personal Finance December 2003, vol. 57, no. 12,pp. 104-123, http:/www.kiplinger.com.

    Load 04Cars.RData file in your current R or RStudio session. City Miles per Gallon (cmpg)consumption is going to be our numeric target and car type our target factor (f.cartype).

    1. Create a new factor variable consisting on an indicator for Car type, classified either asSport Car, SUV, Wagon, Minivan and Pickup. Use binary indicators originally includedin dataset (named it f.cartype). Summarize the resulting factor.

    library(car)

    ## Warning: package 'car' was built under R version 3.3.3library(FactoMineR)

    ## Warning: package 'FactoMineR' was built under R version 3.3.3setwd("C:/Users/lmontero/Dropbox/DOCENCIA/MSCTM-ADTL/EXAMS/2017-18")load("04cars.RData")#load("F:/DOCENCIA/MSCTM-ADTL/EXAMS/2017-18/04cars.RData")

    df

  • ## carmodel sports suv## Infiniti G35 4dr : 2 Min. :0.0000 Min. :0.0000## Mercedes-Benz C240 4dr : 2 1st Qu.:0.0000 1st Qu.:0.0000## Mercedes-Benz C320 4dr : 2 Median :0.0000 Median :0.0000## Acura 3.5 RL 4dr : 1 Mean :0.1145 Mean :0.1402## Acura 3.5 RL w/Navigation 4dr: 1 3rd Qu.:0.0000 3rd Qu.:0.0000## Acura MDX : 1 Max. :1.0000 Max. :1.0000## (Other) :419## wagon minivan pickup wheeld4## Min. :0.00000 Min. :0.00000 Min. :0.00000 4whd-No :336## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 4whd-Yes: 92## Median :0.00000 Median :0.00000 Median :0.00000## Mean :0.07009 Mean :0.04673 Mean :0.05607## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000## Max. :1.00000 Max. :1.00000 Max. :1.00000#### wheeld2 price cost enginesize## 2whd-No :318 Min. : 10280 Min. : 9875 Min. :1.300## 2whd-Yes:110 1st Qu.: 20334 1st Qu.: 18866 1st Qu.:2.375## Median : 27635 Median : 25295 Median :3.000## Mean : 32775 Mean : 30015 Mean :3.197## 3rd Qu.: 39205 3rd Qu.: 35710 3rd Qu.:3.900## Max. :192465 Max. :173560 Max. :8.300#### ncilinder horsepower cmpg hmpg## Min. :-1.000 Min. : 73.0 Min. :10.00 Min. :12.00## 1st Qu.: 4.000 1st Qu.:165.0 1st Qu.:17.00 1st Qu.:24.00## Median : 6.000 Median :210.0 Median :19.00 Median :26.00## Mean : 5.776 Mean :215.9 Mean :20.09 Mean :26.91## 3rd Qu.: 6.000 3rd Qu.:255.0 3rd Qu.:21.00 3rd Qu.:29.00## Max. :12.000 Max. :500.0 Max. :60.00 Max. :66.00## NA's :14 NA's :14## weight wheelbase length width## Min. :1850 Min. : 89.0 Min. :143.0 Min. :64.00## 1st Qu.:3102 1st Qu.:103.0 1st Qu.:177.0 1st Qu.:69.00## Median :3474 Median :107.0 Median :186.0 Median :71.00## Mean :3577 Mean :108.2 Mean :185.1 Mean :71.29## 3rd Qu.:3974 3rd Qu.:112.0 3rd Qu.:193.0 3rd Qu.:73.00## Max. :7190 Max. :144.0 Max. :227.0 Max. :81.00## NA's :2 NA's :2 NA's :26 NA's :28# Point 1names(df)

    ## [1] "carmodel" "sports" "suv" "wagon" "minivan"## [6] "pickup" "wheeld4" "wheeld2" "price" "cost"## [11] "enginesize" "ncilinder" "horsepower" "cmpg" "hmpg"## [16] "weight" "wheelbase" "length" "width"df$f.cartype

  • df$f.cartype

  • ## ctype.minivan: 20## ctype.pickup : 24### Numeric summarytable(df$f.cartype)

    #### ctype.None ctype.Sports ctype.suv ctype.wagon ctype.minivan## 245 49 60 30 20## ctype.pickup## 24100*round(prop.table(table(df$f.cartype)),3)

    #### ctype.None ctype.Sports ctype.suv ctype.wagon ctype.minivan## 57.2 11.4 14.0 7.0 4.7## ctype.pickup## 5.6#Graphical summarybarplot(table(df$f.cartype),main="Car type polythomic factor",col=heat.colors(6))

    ctype.None ctype.suv ctype.minivan

    Car type polythomic factor

    050

    100

    150

    200

    The new factor is polythomic and the most common level is ‘none’ (245 out 428, a 57.2%): no sport car, nosuv, no wagon, no minivan and no pickup.

    2. Determine the presence of missing values in rows and columns. Select the 2 most criticalrows and columns according to missing criteria. You can use countNA function provided

    4

  • in O4Cars workspace, after loading the function: mis_col list in the output object containsthe total number of missing values per variable and mis_ind counts the total number ofmissing per observation.

    missout

  • 01

    23

    45

    mis

    sout

    $mis

    _ind

    2774

    419

    29307879183241242

    ## [1] 27 74 419 29 30 78 79 183 241 242ll4);ll

    ## [1] 27 74df$carmodel[ll]

    ## [1] Mazda3 i 4dr Mazda3 s 4dr## 425 Levels: Acura 3.5 RL 4dr Acura 3.5 RL w/Navigation 4dr ... Volvo XC90 T6

    The provided function is very useful and immediately returns in output list $mis_col the number of missingvalue for each variable: width and length with 28 and 26 missing values have the lowest quality. The secondlist $mis_ind contains the total number of missing values per observation: obs. 27 and 74 contain 5 missingconsidering all variables and obs. 419 contains 4 missing values. Mazda 3 i/s 4dr contain the highest incidenceof missing values.

    3. Numeric target is defined as cmpg. Summarize numerically and graphically the responsevariable. Make an interpretation of the results. Do you think that cmpg may be consid-ered normally distributed?

    summary(df$cmpg)

    ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's## 10.00 17.00 19.00 20.09 21.00 60.00 14mm

  • ## [1] 20.08937

    ## [1] 5.213062

    ## [1] 0.2594935Boxplot(df$cmpg)

    1020

    3040

    5060

    df$c

    mpg

    305

    7094

    69

    97134749481214

    ## [1] 305 70 94 69 97 13 47 49 48 12 14df[c(70,94),]

    ## carmodel sports suv wagon minivan pickup## 70 Honda Insight 2dr (gas/electric) 0 0 0 0 0## 94 Toyota Prius 4dr (gas/electric) 0 0 0 0 0## wheeld4 wheeld2 price cost enginesize ncilinder horsepower cmpg hmpg## 70 4whd-No 2whd-No 19110 17911 2.0 3 73 60 66## 94 4whd-No 2whd-No 20510 18926 1.5 4 110 59 51## weight wheelbase length width f.cartype## 70 1850 95 155 67 ctype.None## 94 2890 106 175 68 ctype.Nonehist(df$cmpg, freq=F)curve(dnorm(x,mm,ssdd), col="red", lwd=2,add=T)

    7

  • Histogram of df$cmpg

    df$cmpg

    Den

    sity

    10 20 30 40 50 60

    0.00

    0.04

    0.08

    shapiro.test((df$cmpg))

    #### Shapiro-Wilk normality test#### data: (df$cmpg)## W = 0.7907, p-value < 2.2e-16

    Central trend is about 19 city milles per gallon, and 50% of sample distributes between 17 and 21 city millesper gallon and standard deviation is 5.21. A coefficient of variation of 0.26 is found: no large variabilityaround the mean is found. Nevertheless, two japaneese cars (obs. 70 and 94) have the higthest city milles pergallon, since they seem hybrid vehicles. Normal profile is descarted by visual inspection due to the outliersand Shapiro-Wilk test confirms with a very low p_value that Null Hypothesis of Normally distributed datahas to be rejected.

    4. Calculate the upper threshold to identify severe outliers for cmpg. Are there any carssatisfying this criteria? How many? Are global outliers retained once car type factor isconsidered?.

    ss

  • llutcmpg);ll

    ## [1] 13 47 49 69 70 94 97Boxplot(df$cmpg,labels=row.names(df))

    ## [1] "305" "70" "94" "69" "97" "13" "47" "49" "48" "12" "14"abline(h=utcmpg,col="red",lwd=3)

    1020

    3040

    5060

    df$c

    mpg

    305

    7094

    69

    97134749481214

    df[ll,1]

    ## [1] Honda Civic HX 2dr## [2] Toyota Echo 2dr manual## [3] Toyota Echo 4dr## [4] Honda Civic Hybrid 4dr manual (gas/electric)## [5] Honda Insight 2dr (gas/electric)## [6] Toyota Prius 4dr (gas/electric)## [7] Volkswagen Jetta GLS TDI 4dr## 425 Levels: Acura 3.5 RL 4dr Acura 3.5 RL w/Navigation 4dr ... Volvo XC90 T6llc

  • ctype.None ctype.suv ctype.minivan

    1020

    3040

    5060

    f.cartype

    cmpg 134749

    69

    7094

    97

    294

    385

    416421

    df[llc,]

    ## carmodel sports suv wagon minivan## 13 Honda Civic HX 2dr 0 0 0 0## 47 Toyota Echo 2dr manual 0 0 0 0## 49 Toyota Echo 4dr 0 0 0 0## 69 Honda Civic Hybrid 4dr manual (gas/electric) 0 0 0 0## 70 Honda Insight 2dr (gas/electric) 0 0 0 0## 94 Toyota Prius 4dr (gas/electric) 0 0 0 0## 97 Volkswagen Jetta GLS TDI 4dr 0 0 0 0## 294 Toyota MR2 Spyder convertible 2dr 1 0 0 0## 385 Chevrolet Astro 0 0 0 1## 416 Ford Ranger 2.3 XL Regular Cab 0 0 0 0## 421 Mazda B2300 SX Regular Cab 0 0 0 0## pickup wheeld4 wheeld2 price cost enginesize ncilinder horsepower## 13 0 4whd-No 2whd-No 14170 12996 1.7 4 117## 47 0 4whd-No 2whd-No 10760 10144 1.5 4 108## 49 0 4whd-No 2whd-No 11290 10642 1.5 4 108## 69 0 4whd-No 2whd-No 20140 18451 1.4 4 93## 70 0 4whd-No 2whd-No 19110 17911 2.0 3 73## 94 0 4whd-No 2whd-No 20510 18926 1.5 4 110## 97 0 4whd-No 2whd-No 21055 19638 1.9 4 100## 294 0 4whd-No 2whd-Yes 25130 22787 1.8 4 138## 385 0 4whd-Yes 2whd-No 26395 23954 4.3 6 190## 416 1 4whd-No 2whd-Yes 14385 13717 2.3 4 143## 421 1 4whd-No 2whd-Yes 14840 14070 2.3 4 143

    10

  • ## cmpg hmpg weight wheelbase length width f.cartype## 13 36 44 2500 103 175 67 ctype.None## 47 35 43 2035 93 163 65 ctype.None## 49 35 43 2055 93 163 65 ctype.None## 69 46 51 2732 103 175 68 ctype.None## 70 60 66 1850 95 155 67 ctype.None## 94 59 51 2890 106 175 68 ctype.None## 97 38 46 3003 99 172 68 ctype.None## 294 26 32 2195 97 153 67 ctype.Sports## 385 14 17 4605 111 190 78 ctype.minivan## 416 24 29 3028 111 NA NA ctype.pickup## 421 24 29 2960 112 NA NA ctype.pickup

    The upper threshold for outliers is 33 and 7 vehicles are over this performance 13 47 49 69 70 94 97, exceptone car the rest of cars are all japaneese cars. All these upper extreme outliers belong to the category Nonein Car Type (the new factor). When car type factor is considered all those severe outliers are retained andnew mild outliers appear in the sport car group (294), minivan group (385, lower mild outlier with a poorperformance, a Chevrotet Astro car) and pickup group (416 and 421).

    5. Which are the numerical variables statistically associated with the response (cmpg)?Indicate the suitable measure of association and/or tests that support your answer. Assesslinearity association to cmpg for available variables.

    names(df)

    ## [1] "carmodel" "sports" "suv" "wagon" "minivan"## [6] "pickup" "wheeld4" "wheeld2" "price" "cost"## [11] "enginesize" "ncilinder" "horsepower" "cmpg" "hmpg"## [16] "weight" "wheelbase" "length" "width" "f.cartype"names(df)[c(14,9:13,15:19)]

    ## [1] "cmpg" "price" "cost" "enginesize" "ncilinder"## [6] "horsepower" "hmpg" "weight" "wheelbase" "length"## [11] "width"llen

  • ## enginesize 0.67 0.65 0.76## ncilinder 0.61 0.57 0.68## horsepower 0.50 0.45 0.61## hmpg -0.53 -0.36 -0.60## weight 0.78 0.70 0.80## wheelbase 1.00 0.88 0.77## length 0.88 1.00 0.77## width 0.77 0.77 1.00cor.test(df$cmpg,df$hmpg,method="spearman")

    ## Warning in cor.test.default(df$cmpg, df$hmpg, method = "spearman"): Cannot## compute exact p-value with ties

    #### Spearman's rank correlation rho#### data: df$cmpg and df$hmpg## S = 781480, p-value < 2.2e-16## alternative hypothesis: true rho is not equal to 0## sample estimates:## rho## 0.9339202cor.test(df$cmpg,df$weight,method="spearman")

    ## Warning in cor.test.default(df$cmpg, df$weight, method = "spearman"):## Cannot compute exact p-value with ties

    #### Spearman's rank correlation rho#### data: df$cmpg and df$weight## S = 21702000, p-value < 2.2e-16## alternative hypothesis: true rho is not equal to 0## sample estimates:## rho## -0.8619549cor.test(df$cmpg,df$enginesize,method="spearman")

    ## Warning in cor.test.default(df$cmpg, df$enginesize, method = "spearman"):## Cannot compute exact p-value with ties

    #### Spearman's rank correlation rho#### data: df$cmpg and df$enginesize## S = 21926000, p-value < 2.2e-16## alternative hypothesis: true rho is not equal to 0## sample estimates:## rho## -0.8539988

    Correlation makes sense between pairs of numeric variables and thus, we are interested in the correlation(Spearman, since normality does not hold) between cmpg (city miles per gallon) and “price”,“cost”,“enginesize”,“ncilinder”, “horsepower”, “hmpg”, “weight”, “wheelbase”, “length” and “width”. Highway miles per gallon isintensely and positively correlated (0.93) with cmpg indicating that those vehicles with high performance in city

    12

  • have also high performance in highways. Weight (-0.86) and enginesize (-0.85) are intensely and negativelyassociated to cmpg, indicating than as increasing car weight or car enginesize, performance decreases.

    6. Describe the profile for the cmpg numeric target using available tools in FactoMineRpackage. Hint: Do not include raw variables Sports, suv, wagon, minivan and pickup butconsider f.cartype.

    names(df)[7:20]

    ## [1] "wheeld4" "wheeld2" "price" "cost" "enginesize"## [6] "ncilinder" "horsepower" "cmpg" "hmpg" "weight"## [11] "wheelbase" "length" "width" "f.cartype"condes(df[,7:20],which(names(df)[7:20]=="cmpg"))

    ## $quanti## correlation p.value## hmpg 0.9401801 8.840927e-195## cost -0.4569619 9.458660e-23## price -0.4611296 3.453441e-23## length -0.4648256 2.361362e-22## wheelbase -0.4962908 3.959916e-27## width -0.5858334 3.244415e-37## ncilinder -0.6320217 1.488874e-47## horsepower -0.6662558 1.867635e-54## enginesize -0.7033902 4.644084e-63## weight -0.7371921 8.346569e-72#### $quali## R2 p.value## f.cartype 0.18202542 2.806118e-16## wheeld4 0.09156341 3.252429e-10## wheeld2 0.04583760 1.112241e-05#### $category## Estimate p.value## ctype.None 3.07907364 5.903869e-15## 4whd-No 1.90987654 3.252429e-10## 2whd-No 1.26933551 1.112241e-05## ctype.Sports -0.09213083 3.680022e-02## ctype.pickup -1.99222334 1.250551e-03## 2whd-Yes -1.26933551 1.112241e-05## 4whd-Yes -1.90987654 3.252429e-10## ctype.suv -2.48448568 2.572408e-10table(df$wheeld4,df$wheeld2)

    #### 2whd-No 2whd-Yes## 4whd-No 226 110## 4whd-Yes 92 0

    Significant global association is found between cmpg (city performance) and hmpg (very significant). Inversesignificant associations are found with cost, price, length, wheelbase, width, ncilinder, horsepower, enginesizeand weight.

    Significant global association is found with factor f.cartype and binary indicators of 4/2 wheel.

    13

  • City performance over the mean is found for f.cartype None, No 4whd and No 2Whd. Performances less thanthe mean are found for SUV, PickUp and Sports cars and for positive 2 wheel drive and 4 wheel drive (thereare 226 cars belonging neither to 2 wheel drive, nor 4 wheels drive).

    7. The average cmpg can be argued to be the same for all car type levels (f. cartype)?Which are group pairs that show non-significant differences in average city miles pergallon (cmpg)?

    Boxplot(cmpg~f.cartype,data=df)

    ctype.None ctype.suv ctype.minivan

    1020

    3040

    5060

    f.cartype

    cmpg 134749

    69

    7094

    97

    294

    385

    416421

    ## [1] "13" "47" "49" "69" "70" "94" "97" "294" "385" "416" "421"kruskal.test(cmpg~f.cartype,data=df)

    #### Kruskal-Wallis rank sum test#### data: cmpg by f.cartype## Kruskal-Wallis chi-squared = 115.3, df = 5, p-value < 2.2e-16#oneway.test(cmpg~f.cartype,data=df) # Not suitable

    with(df,pairwise.wilcox.test(cmpg,f.cartype))

    ## Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot## compute exact p-value with ties

    ## Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot## compute exact p-value with ties

    14

  • ## Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot## compute exact p-value with ties

    ## Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot## compute exact p-value with ties

    ## Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot## compute exact p-value with ties

    ## Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot## compute exact p-value with ties

    #### Pairwise comparisons using Wilcoxon rank sum test#### data: cmpg and f.cartype#### ctype.None ctype.Sports ctype.suv ctype.wagon ctype.minivan## ctype.Sports 6.6e-05 - - - -## ctype.suv < 2e-16 0.00018 - - -## ctype.wagon 1.00000 0.06508 4.3e-06 - -## ctype.minivan 0.00022 1.00000 0.02447 0.02550 -## ctype.pickup 7.7e-07 0.01513 1.00000 0.00124 0.06508#### P value adjustment method: holm#with(df,pairwise.t.test(cmpg,f.cartype))

    Clearly, city performance is affected by f.cartype level attending to Boxplot. Using a non-parametric KruskalWallis for testing the null hypothesis of equal mean city performance (cmpg) across f.cartype levels, evidence isfound to reject H0 and thus there exist any different means. When comparing mean by pairwise.wilcox.test()method: mean performance in wagon and None levels seem to be equal, as it occurs in the case of minivan andsports cars and between pickup with SUV cars. SUV cars mean performance seems to be very different of thatof None and Sports levels. Warning messages are provided as output of the pairwise-comparison procedure.

    8. The variance of urban consumption, cmpg, can be argued to be the same for all car typelevels (f. cartype)? Which are the groups that are likely to have a greater dispersion ofcmpg than the others?

    Boxplot(cmpg~f.cartype,data=df,col=heat.colors(6))

    15

  • ctype.None ctype.suv ctype.minivan

    1020

    3040

    5060

    f.cartype

    cmpg 134749

    69

    7094

    97

    294

    385

    416421

    ## [1] "13" "47" "49" "69" "70" "94" "97" "294" "385" "416" "421"fligner.test(cmpg~f.cartype,data=df)

    #### Fligner-Killeen test of homogeneity of variances#### data: cmpg by f.cartype## Fligner-Killeen:med chi-squared = 19.389, df = 5, p-value =## 0.001626#bartlett.test(cmpg~f.cartype,data=df)

    Clearly, variability of city performance is affected by f.cartype level attending to Boxplot box lengths. Usinga non-parametric Fligner test for testing the null hypothesis of equal variance across groups, the p value is0.0016 less than the common 0.05 threshold, thus it provides evidence to reject the null hypothesis and we canaffirm that there is any group with a significant different variance. The greater dispersion is found in Nonef.cartype.

    9. Describe the profile for the f.cartype target factor using available tools in FactoMineRpackage. Indicate the most relevant numeric variables available in dataset, globally andby f.cartype level.

    names(df)

    ## [1] "carmodel" "sports" "suv" "wagon" "minivan"## [6] "pickup" "wheeld4" "wheeld2" "price" "cost"## [11] "enginesize" "ncilinder" "horsepower" "cmpg" "hmpg"## [16] "weight" "wheelbase" "length" "width" "f.cartype"

    16

  • catdes(df[,7:20],which(names(df)[7:20]=="f.cartype"))

    ## $test.chi2## p.value df## wheeld4 1.542859e-19 5## wheeld2 1.555502e-18 5#### $category## $category$ctype.None## Cla/Mod Mod/Cla Global p.value v.test## wheeld4=4whd-No 65.47619 89.79592 78.50467 5.434706e-11 6.558513## wheeld2=2whd-No 60.06289 77.95918 74.29907 4.685921e-02 1.987570## wheeld2=2whd-Yes 49.09091 22.04082 25.70093 4.685921e-02 -1.987570## wheeld4=4whd-Yes 27.17391 10.20408 21.49533 5.434706e-11 -6.558513#### $category$ctype.Sports## Cla/Mod Mod/Cla Global p.value v.test## wheeld2=2whd-Yes 32.727273 73.46939 25.70093 7.903623e-14 7.471915## wheeld4=4whd-No 13.095238 89.79592 78.50467 3.400560e-02 2.120005## wheeld4=4whd-Yes 5.434783 10.20408 21.49533 3.400560e-02 -2.120005## wheeld2=2whd-No 4.088050 26.53061 74.29907 7.903623e-14 -7.471915#### $category$ctype.suv## Cla/Mod Mod/Cla Global p.value v.test## wheeld4=4whd-Yes 41.304348 63.33333 21.49533 1.734038e-14 7.668955## wheeld2=2whd-No 18.867925 100.00000 74.29907 3.635777e-09 5.899965## wheeld2=2whd-Yes 0.000000 0.00000 25.70093 3.635777e-09 -5.899965## wheeld4=4whd-No 6.547619 36.66667 78.50467 1.734038e-14 -7.668955#### $category$ctype.wagon## NULL#### $category$ctype.minivan## Cla/Mod Mod/Cla Global p.value v.test## wheeld2=2whd-No 5.9748428 95 74.29907 0.02097382 2.308455## wheeld2=2whd-Yes 0.9090909 5 25.70093 0.02097382 -2.308455#### $category$ctype.pickup## Cla/Mod Mod/Cla Global p.value v.test## wheeld4=4whd-Yes 13.043478 50 21.49533 0.001677810 3.142030## wheeld2=2whd-Yes 10.909091 50 25.70093 0.009450601 2.595309## wheeld2=2whd-No 3.773585 50 74.29907 0.009450601 -2.595309## wheeld4=4whd-No 3.571429 50 78.50467 0.001677810 -3.142030###### $quanti.var## Eta2 P-value## wheelbase 0.37965506 1.551164e-41## hmpg 0.35231436 1.546401e-36## weight 0.34254312 2.652594e-36## width 0.24938137 1.248764e-23## cmpg 0.18202542 2.806118e-16## price 0.15903877 2.054772e-14## cost 0.15466556 5.894341e-14

    17

  • ## enginesize 0.15461726 5.963116e-14## horsepower 0.15082034 1.480536e-13## length 0.15139203 2.189983e-13## ncilinder 0.06064666 7.152038e-05#### $quanti## $quanti$ctype.None## v.test Mean in category Overall mean sd in category## hmpg 10.349104 29.368644 26.905797 5.347631e+00## cmpg 7.703467 21.766949 20.089372 5.706888e+00## wheelbase -2.865563 107.176955 108.173709 5.825088e+00## cost -3.485123 27446.142857 30014.700935 1.474657e+04## ncilinder -3.615290 5.530612 5.775701 1.529333e+00## price -3.646974 29814.359184 32774.855140 1.606665e+04## horsepower -5.264882 200.085714 215.885514 6.555886e+01## width -6.127804 70.423868 71.292500 2.926635e+00## enginesize -6.222094 2.908571 3.196729 9.433592e-01## weight -8.106621 3319.687243 3577.213615 5.538066e+02## Overall sd p.value## hmpg 5.689919 4.224216e-25## cmpg 5.206762 1.324235e-14## wheelbase 8.316671 4.162680e-03## cost 17621.495747 4.919117e-04## ncilinder 1.620882 3.000112e-04## price 19409.002795 2.653466e-04## horsepower 71.752062 1.402795e-07## width 3.389239 8.910033e-10## enginesize 1.107299 4.905629e-10## weight 759.544606 5.204695e-16#### $quanti$ctype.Sports## v.test Mean in category Overall mean sd in category## price 7.890665 53387.06122 32774.85514 33433.166300## cost 7.782966 48473.16327 30014.70093 30295.556677## horsepower 7.070289 284.16327 215.88551 91.838000## cmpg -2.131409 18.59574 20.08937 2.498166## weight -2.753899 3295.69388 3577.21362 473.227011## length -6.617086 173.28571 185.12687 9.893308## wheelbase -7.320572 99.97959 108.17371 5.064844## Overall sd p.value## price 19409.002795 3.005796e-15## cost 17621.495747 7.084378e-15## horsepower 71.752062 1.546108e-12## cmpg 5.206762 3.305542e-02## weight 759.544606 5.889002e-03## length 13.295955 3.663484e-11## wheelbase 8.316671 2.469154e-13#### $quanti$ctype.suv## v.test Mean in category Overall mean sd in category## weight 9.526679 4444.433333 3577.213615 881.811041## width 6.747549 74.033333 71.292500 3.459126## enginesize 5.450067 3.920000 3.196729 1.081943## ncilinder 4.071663 6.566667 5.775701 1.370726

    18

  • ## wheelbase 2.919128 111.083333 108.173709 8.660815## horsepower 2.317734 235.816667 215.885514 55.763337## cmpg -6.227285 16.203390 20.089372 2.704545## hmpg -9.207204 20.627119 26.905797 3.188423## Overall sd p.value## weight 759.544606 1.623959e-21## width 3.389239 1.503637e-11## enginesize 1.107299 5.035077e-08## ncilinder 1.620882 4.667874e-05## wheelbase 8.316671 3.510124e-03## horsepower 71.752062 2.046376e-02## cmpg 5.206762 4.745878e-10## hmpg 5.689919 3.347295e-20#### $quanti$ctype.wagon## v.test Mean in category Overall mean sd in category## enginesize -2.186353 2.77 3.196729 0.8760327## Overall sd p.value## enginesize 1.107299 0.02878978#### $quanti$ctype.minivan## v.test Mean in category Overall mean sd in category## width 6.557064 76.15 71.2925 2.612949## wheelbase 5.075473 117.40 108.1737 4.127953## length 4.137110 197.15 185.1269 6.002291## weight 3.712775 4193.60 3577.2136 276.412988## hmpg -1.974628 24.45 26.9058 2.397394## Overall sd p.value## width 3.389239 5.487750e-11## wheelbase 8.316671 3.865335e-07## length 13.295955 3.517082e-05## weight 759.544606 2.049993e-04## hmpg 5.689919 4.831042e-02#### $quanti$ctype.pickup## v.test Mean in category Overall mean sd in category## wheelbase 8.978668 123.000000 108.173709 11.463711## weight 4.466184 4250.750000 3577.213615 868.155336## enginesize 4.013729 4.079167 3.196729 1.222695## price -2.032733 24941.375000 32774.855140 9664.115116## cost -2.114451 22616.750000 30014.700935 8665.750392## cmpg -3.282745 16.695652 20.089372 3.168250## hmpg -5.073646 21.173913 26.905797 3.737613## Overall sd p.value## wheelbase 8.316671 2.740650e-19## weight 759.544606 7.962697e-06## enginesize 1.107299 5.976691e-05## price 19409.002795 4.207948e-02## cost 17621.495747 3.447680e-02## cmpg 5.206762 1.028016e-03## hmpg 5.689919 3.902650e-07###### attr(,"class")

    19

  • ## [1] "catdes" "list "

    Globally related factors to f.cartype are wheeld4 and wheeld2. Attending to levels in f.cartype, regular carsshow an incidence of No 4 wheel drive over global mean; Sports cars show significatively 2 wheel drive; SUVcars show significatively 4 wheel drive and No 2 wheel drive. Few observations are available for pickup cartype, but they are either 4 wheel or 2 wheel drive, not combined.

    Global and significatively numeric variables related to f.cartype are in decreasing order: wheelbase, hmpg,weight, width, cmpg, price, cost, enginesize, horsepower, length and ncilinder.

    According to f.cartype levels:

    • None: regular cars show a significant performance (hmpg, cmpg) over the global mean of 10.3 and 7.7miles per gallon and an average weight 8.11 units under the mean and enginesize 6.22 liters under themean.

    • Sports cars show a price/cost almost 8 units over the mean, while weight, length and wheelbase areclearly under the mean by 2.7, 6.6 and 7.3 units.

    • SUV cars are over global means of weight, width, enginesize, ncilinder, wheelbase and have a lowperformance (city or highway).

    • Wagon car show a significant enginesize under the mean by 2.2 units.• Minivan cars are larger in width, length, and wheelbase than the mean and highway performance is 2

    units under the global mean.

    10. Is the car type independent of four wheel drive availability? Identify those car types thatshow four wheel drive lack.

    table(df$f.cartype,df$wheeld4)

    #### 4whd-No 4whd-Yes## ctype.None 220 25## ctype.Sports 44 5## ctype.suv 22 38## ctype.wagon 21 9## ctype.minivan 17 3## ctype.pickup 12 12prop.table(table(df$f.cartype,df$wheeld4),1)

    #### 4whd-No 4whd-Yes## ctype.None 0.8979592 0.1020408## ctype.Sports 0.8979592 0.1020408## ctype.suv 0.3666667 0.6333333## ctype.wagon 0.7000000 0.3000000## ctype.minivan 0.8500000 0.1500000## ctype.pickup 0.5000000 0.5000000prop.table(table(df$f.cartype,df$wheeld4),2)

    #### 4whd-No 4whd-Yes## ctype.None 0.65476190 0.27173913## ctype.Sports 0.13095238 0.05434783## ctype.suv 0.06547619 0.41304348## ctype.wagon 0.06250000 0.09782609## ctype.minivan 0.05059524 0.03260870## ctype.pickup 0.03571429 0.13043478

    20

  • chisq.test(table(df$f.cartype,df$wheeld4))

    ## Warning in chisq.test(table(df$f.cartype, df$wheeld4)): Chi-squared## approximation may be incorrect

    #### Pearson's Chi-squared test#### data: table(df$f.cartype, df$wheeld4)## X-squared = 97.792, df = 5, p-value < 2.2e-16

    A contingency table allow us to answer the question and by taking row profiles we can see that in regular cars(None level) and Sports cars 90% lack of 4 wheel drive, while SUV cars usually have 4 wheel drive; in pickupcars 50-50 and minivan behave almost as regular cars, so clearly, the availability of 4 wheel drive seems todepend on car type (f.cartype). To test the null hypothesis of independence between two factors, a Chi Squaredtest can be used and it returns an almost 0 pvalue leading to reject the null hypothesis of independence. Thus,global association between car type and 4 wheel drive availability is found. Warning message in the ChiSquared test are due to few observations (less than 5) in one of the cells of the contingency table.

    21