Multivariate Laboratory Exercise III

Embed Size (px)

Citation preview

  • 7/31/2019 Multivariate Laboratory Exercise III

    1/15

    multivariate laboratory exercise iii 1

    Student:

    Asaad, Al-Ahmadgaid B.

    website:www.alstat.weebly.com

    email:[email protected]

    Instructor:Prof. Baguio, Carolina B.

    email: [email protected]

    A. Obtain a Data with two and three response variable.

    1. Data with three response variables

    Source: Rencher, A. C. (2002), Methods of Multivariate

    Analysis, 2nd Edition. pg. 56

    Table A.1 Calcium in Soil and Turnip GreensLocation Number y1 y2 y3

    1 35 3.5 2.8

    2 35 4.9 2.7

    3 40 30 4.38

    4 10 2.8 3.21

    5 6 2.7 2.73

    6 20 2.8 2.8

    7 35 4.6 2.88

    8 35 10.9 2.99 35 8 3.28

    10 30 1.6 3.2

    Table A.1 gives partial data from Kramer and Jensen

    (1969). Three variables were measured (in milliequiv-

    alents per 100 g) at 10 different locations in the South.

    The variables are y1 = available soil calcium, y2 = ex-

    changeable soil calcium, and y3 = turnip green calcium.

    Test the normality of the data using R software.

    Solution:

    Res3Data

  • 7/31/2019 Multivariate Laboratory Exercise III

    2/15

    multivariate laboratory exercise iii 2

    35 3.5 2.80

    35 4.9 2.70

    40 30.0 4.38

    10 2.8 3.21

    6 2.7 2.73

    20 2.8 2.81

    35 4.6 2.88

    35 10.9 2.90

    35 8.0 3.28

    30 1.6 3.20")

    Res3Data.Mat

  • 7/31/2019 Multivariate Laboratory Exercise III

    3/15

    multivariate laboratory exercise iii 3

    great effect on the other observations of the variables y2and y3, and thus this contributes to the formed outliers

    on the plot. Now to avoid this, it is better to test

    Figure 1: Normal Quantile-Quantile Plot

    of the Table A.1

    .

    the normality and plot the quantile-quantile plot of each

    variables. So that, the units of the observations is homo-

    geneous.

    Testing the Normality of each variable

    a. y1=available soil calcium

    library(mvnormtest)

    attach(Res3Data)

    shapiro.test(y1)

    Shapiro-Wilk normality test

    data: y1

    W = 0.7874, p-value = 0.0102

    Figure 2: Normal Probability Plot ofvariable y1. The green line is the 95%confidence interval of the data, and thepurple line is the normal line. For thecodes of the plot refer to the appendix.

    The p-value of variable y1 is 0.0102 which is less than

    the level of significance 0.05, and thus it is not nor-

  • 7/31/2019 Multivariate Laboratory Exercise III

    4/15

    multivariate laboratory exercise iii 4

    mally distributed. Check out the quantile-quantile plot,

    Figure 2. Observe that in the normal probability plot

    of variable y1, theres a single point that is not inside

    of a 95% confidence interval, and thus it is not nor-

    mally distributed.

    b. y2=exchangeable soil calcium

    library(mvnormtest)

    attach(Res3Data)

    shapiro.test(y2)

    Shapiro-Wilk normality test

    data: y2

    W = 0.6405, p-value = 0.0001687

    Figure 3: The normal probability plot ofvariable y2. The green line is the 95%confidence interval of the data, and thepurple line is the normal line. For thecodes of the plot refer to the appendix.

    The observe p-value of variable y2 is also less than

    0.05, and thus it is not normally distributed. Refer

    also to the quantile-quantile plot of variable y2, Figure

    3. In the plot, the data is not normally distributed,

    because there is an outliers which lie outside the 95%

    confidence interval. And thus coincide with the per-

    formed test ofy2 variable.

    c. y3 = turnip green calcium

    library(mvnormtest)

    attach(Res3Data)

    shapiro.test(y3)

    Shapiro-Wilk normality test

    data: y3

    W = 0.7294, p-value = 0.002001

    Figure 4: The normal probability plot ofvariable y3. The green line is the 95%confidence interval of the data, and thepurple line is the normal line. For thecodes of the plot refer to the appendix.

    Again the third variable also follows, that the obser-

    vations on it is not normally distributed, since again

    the 0.002001 is less than the level of significance 0.05.

    The quantile-quantile plot of the variable y3 is not nor-

  • 7/31/2019 Multivariate Laboratory Exercise III

    5/15

    multivariate laboratory exercise iii 5

    mally distributed, since another outliers that lie out-

    side the 95% confidence interval.

    And hence summing up the decisions of the three vari-

    ables tested, the decision of the Shapiro-Wilk test which

    was first applied for the data combining the three vari-ables is true, that the observations in the data is not nor-

    mally distributed.

    Since the data is not normally distributed then it is diffi-

    cult to estimate the appropriate probability density func-

    tion of the data due to the small sample size n.

    2. Data with two response variables

    Source: Hardle, W., et al. (2007), Multivariate Statistics:

    Exercises and Solutions. pg. 336

    Table A.2 Sales Data

    Sales Price Advert Ass. Hours

    1 230 125 200 109

    2 181 99 55 107

    3 165 97 105 98

    4 150 115 85 71

    5 97 120 0 82

    6 192 100 150 1037 181 80 85 111

    8 189 90 120 93

    9 172 95 110 86

    10 170 125 130 78

    This is a data set consisting of 10 measurements of 4

    variables. The story: A textile shop manager is study-

    ing the sales of "classic blue" pullovers over 10 periods.

    He uses three different marketing methods and hopes

    to understand his sales as a fit of these variables usingstatistics. The variables measured are X1: Numbers of

    sold pullovers, X2: Price (in EUR), X3: Advertisement

    costs in local newspapers (in EUR), X4: Presence of a

    sales assistant (in hours per period).

  • 7/31/2019 Multivariate Laboratory Exercise III

    6/15

    multivariate laboratory exercise iii 6

    Test the normality of the data using R software.

    Solution:

    Res2Data

  • 7/31/2019 Multivariate Laboratory Exercise III

    7/15

    multivariate laboratory exercise iii 7

    Figure 5: Normal Quantile-Quantile Plotof the Table A.2.

    used. Now just as before, it is better to test the normality

    of each variables, to make sure the homogeneity of the

    measurements.

    Testing the Normality of each variable

    a. Sales - products sold

    library(mvnormtest)

    attach(Res2Data)

    shapiro.test(Sales)

    Shapiro-Wilk normality test

    data: Sales

    W = 0.9067, p-value = 0.2591

    Figure 6: The normal probability plot ofvariable Sales. The green line is the 95%confidence interval of the data, and thepurple line is the normal line. For thecodes of the plot refer to the appendix.

    The p-value generated is greater than the level of sig-

    nificance 0.05, and thus the observations on variable

    Sales is normally distributed. Furthermore, the quan-

    tile quantile plot of it at Figure 6 is also normally dis-

    tributed since all of the points are fluctuated within

    the 95% confidence interval.

  • 7/31/2019 Multivariate Laboratory Exercise III

    8/15

    multivariate laboratory exercise iii 8

    b. Price - Price of the products sold

    library(mvnormtest)

    attach(Res2Data)

    shapiro.test(Price)

    Shapiro-Wilk normality test

    data: Price

    W = 0.9187, p-value = 0.346

    Figure 7: The normal probability plot ofvariable Price. The green line is the 95%confidence interval of the data, and thepurple line is the normal line. For thecodes of the plot refer to the appendix.

    For this variable the p-value of it is also greater than

    the level of significance 0.05 which means that the as-

    sumption in the null H0 is true, that the data is nor-

    mally distributed. And as seen on Figure 7, the points

    are within the 95% confidence interval, implying thatthe observations in variable Price ofSales data is nor-

    mally distributed.

    Thus, summing up the conclusions of the two variables

    above (Sales, and Price). The Sales data is normally

    distributed which coincides with the performed test of

    the Shapiro-Wilk test for multivariate data, in which the

    two variables were combined and tested.

    If Normal, what is the probability density function of

    the data?

    Let the variable Sales be S, and Price be P.

    If the data set Res2Data is X, then

    f(X) =1

    22

    (||) 12e

    12 [(SS)(PP)] 2S SP

    PS 2P

    1 (S S)

    (P P)

    (1)

    Note that the is just equal to 2S SP

    PS 2

    P.

    And these are the values of each matrix above,

    a.

    2S SP

    PS 2P

    =

    1152.46 88.9188.91 244.27

  • 7/31/2019 Multivariate Laboratory Exercise III

    9/15

    multivariate laboratory exercise iii 9

    b.

    2S SP

    PS 2P

    1=

    0.00089 0.00032

    0.00032 0.0042

    c. (||) 12 = 2S SP

    PS

    2

    P

    12

    = 523.0691

    d. [(S S)(P P)] = The output of the data gener-ates a large matrix which is not easy to input it here.

    However, the following codes will generate it.

    attach(Res2Data)

    FVar

  • 7/31/2019 Multivariate Laboratory Exercise III

    10/15

    multivariate laboratory exercise iii 10

    Figure 8: 2D Binned Kernel Density Es-timate, with bandwidth of (5,5).

    contour(est$x1,est$x2,est$fhat,col = "blue")

    Figure 9: Contour plot of2D Binned Ker-nel Density Estimate, with bandwidth of(5,5).

    b. Bandwidth = (10,10)

    library(KernSmooth)

    est

  • 7/31/2019 Multivariate Laboratory Exercise III

    11/15

    multivariate laboratory exercise iii 11

    theta=45,phi=90,col="red",shade = 0.1)

    Figure 10: 2D Binned Kernel Density Es-timate, with bandwidth of (10,10).

    contour(est$x1,est$x2,est$fhat,col = "red")

    Figure 11: Contour plot of 2D BinnedKernel Density Estimate, with band-width of (10,10). This is a contour plotof Figure 10.

  • 7/31/2019 Multivariate Laboratory Exercise III

    12/15

    multivariate laboratory exercise iii 12

    c. Bandwidth = (15,15)

    library(KernSmooth)

    est

  • 7/31/2019 Multivariate Laboratory Exercise III

    13/15

    multivariate laboratory exercise iii 13

    It is observed that the three-dimensional plot of the data

    is not very smooth in the first plot with bandwidth of

    (5,5), but with the following plot of bandwidth (10,10) it

    became a little smooth. And the third bandwidth makes

    it more smoother than the two, but still there are two cir-

    cles seen on the plot that makes it not perfectly smooth.

    However, a further increase in the bandwidth, the plot

    will form a smooth normal plot. Like using the band-

    width (45,45) below, it forms a smoothness over the mesh

    induced by the grid points.

    est

  • 7/31/2019 Multivariate Laboratory Exercise III

    14/15

    multivariate laboratory exercise iii 14

    Appendix

    A. R Codes for figures 2, 3, 4, 6, and 7. By using "Variable"

    as the place value, the codes can be modified for differ-

    ent variables of the data set. Now, since there are two

    data sets (three and two response variables) the place

    value for the data sets can be "DataSet". And thus,

    when using the two response data simply replace the

    "DataSet" with "Res2Data". For using the Sales vari-

    able of the Sales data, simply replace the "Variable"

    with Sales and with the place value of data set re-

    placed with "Res2Data".

    library(ggplot2)

    attach(DataSet)

    df

  • 7/31/2019 Multivariate Laboratory Exercise III

    15/15

    multivariate laboratory exercise iii 15

    + 2*qprobs*fd$vcov[1,2]

    #lower bound

    xpl