31
1 Multivariate Statistical Analysis Webinar for the GLJUG Introduction In JMP 11.0.0, On Multivariate & Fit Model Platforms platform Multivariate Analysis has been described as the area of statistics which unravels the effects of each variable in a set of several- to- many variables. Problems in this setting include correlation and interaction as the variable affects a test or descriptive statistic. Thus analyzing a single variable will yield little or no true information about the system. Multivariate Analysis consists of a set of tools that are useful when multiple observations are taken on each object or individual in one or many samples. This type of data is encountered frequently in most real life situations. The variables are usually measured simultaneously on a single unit and correlated. The techniques are usually exploratory and generate hypotheses rather than test them, and can be used on continuous, nominal, and ordinal data, just the types that JMP loves to see. A very useful application of these tools is to curb a researcher’s tendency to read too much into their data! The purpose of this presentation is to acquaint you with the basic tools and operations, along with their attendant accessories in JMP, using data sets readily available within the program. Theory I) The Multivariate Normal Distribution We will briefly discuss a bit of theory before proceeding into our multivariate toolkit. Univariate tests and confidence intervals (CI’s) are usually based on the normal (Gaussian) distribution, and in a like manner, multivariate tests and CI’s are based on the multivariate normal distribution (1). The multivariate normal distribution has the following important properties: 1) The distribution is completely described by means, variances, and covariances. 2) Bivariate plots of multivariate data show linear trends. 3) If the variables are uncorrelated, they are independent.

Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

  • Upload
    vukhue

  • View
    228

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

1

Multivariate Statistical Analysis

Webinar for the GLJUG

Introduction In JMP 11.0.0, On Multivariate & Fit Model Platforms platform Multivariate Analysis has been described as the area of statistics which unravels the effects of each variable in a set of several- to- many variables. Problems in this setting include correlation and interaction as the variable affects a test or descriptive statistic. Thus analyzing a single variable will yield little or no true information about the system. Multivariate Analysis consists of a set of tools that are useful when multiple observations are taken on each object or individual in one or many samples. This type of data is encountered frequently in most real life situations. The variables are usually measured simultaneously on a single unit and correlated. The techniques are usually exploratory and generate hypotheses rather than test them, and can be used on continuous, nominal, and ordinal data, just the types that JMP loves to see. A very useful application of these tools is to curb a researcher’s tendency to read too much into their data! The purpose of this presentation is to acquaint you with the basic tools and operations, along with their attendant accessories in JMP, using data sets readily available within the program. Theory

I) The Multivariate Normal Distribution

We will briefly discuss a bit of theory before proceeding into our multivariate toolkit. Univariate tests and confidence intervals (CI’s) are usually based on the

normal (Gaussian) distribution, and in a like manner, multivariate tests and CI’s are based on the multivariate normal distribution (1). The multivariate normal distribution has the following important properties:

1) The distribution is completely described by means, variances, and covariances.

2) Bivariate plots of multivariate data show linear trends. 3) If the variables are uncorrelated, they are independent.

Page 2: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

2

4) Linear functions of multivariate normal variables are also normal. 5) The convenient form of the multivariate normal density function lends

itself to derivation of many useful properties and test statistics. 6) Even if the data is not multivariate normal, the multivariate normal may

serve as a useful approximation, especially in inferences involving sample mean vectors, which are approximately multivariate normal by the central limit theorem.

Now let’s take a look at the straightforward extension of ANOVA to MANOVA.

II) Multivariate Analysis of Variance (MANOVA)

In the Univariate case (balanced, one-way ANOVA) we have a random sample of n observations from each of k normal populations with equal variances:

Y11 Y21 … Yk1 Y12 Y22 … Yk2

. . . The k samples are assumed to be independent . . . . . . Y1n Y2n Ykn

The model for each observation then is: Yij = µi + εij where i=1, 2, …k and j=1, 2, …n

And µi is the mean of the i th population In the multivariate case we assume that k independent random samples of size n are obtained from p-variate normal populations with equal covariance matrices. Since each sample is composed of several-to-many variables the variables for each sample are represented by vectors so the model for each sample is represented by vectors also: Here yij = µi + εij BUT EACH SYMBOL IS A VECTOR REPRESENTING n VARIABLES IN MULTIPLE SAMPLES Now let’s delve into the tests themselves… Methodologies (including tips)

I) Multivariate Analysis of Variance (MANOVA)

Page 3: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

3

This is done in JMP by choosing the Manova personality in the Fit Model platform. This will fit >1 Y’s (outcome or dependent variables on several or many independent variables). As the JMP manuals (2) states, the Manova fitting personality “Fits models that involve multiple continuous variables. Techniques include multivariate analysis of variance, repeated measures, Discriminant analysis and canonical correlations. These are useful as follows • Repeated measures analysis when repeated measurements are taken on each subject and you want to analyze effects both between subjects and within subjects across the measurements. This multivariate approach is especially important when the correlation structure across the measurements is arbitrary. • Canonical correlation to find the linear combination of the X and Y variables that has the highest correlation. The multivariate fit begins with a rudimentary preliminary analysis that shows parameter estimates and least squares means. You can then specify a response design across the Y variables and multivariate tests are performed.” Model Specification: For the data set drug.jmp let’s analyze the effect of 3 drugs on two unnamed clinical variables. For this (MANOVA) we will use the Fit Model platform, Listing X & Y as the Y variables and Drug as the fitting effect. The fitting personality is Manova:

Upon hitting run for the initial fit, we obtain:

Page 4: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

4

This shows us that there seems to be a drug effect on the x, y variables but there are no hypothesis test results (as we haven’t asked for them yet). When we specify the response design, the multivariate estimates and tests will be displayed.

Page 5: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

5

We then select Response Specification and choose the response type as Identity ( uses each separate response in the identity matrix) at first:

Page 6: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

6

Please note the following very important points regarding the results:

1) Although the Univariate box was checked there are no Univariate test results. This is because of mathematical reasons (the official verbiage is that “the M-Matrix is not ortho-normalizable to full rank”) the variables could not be made independent/normal and thus not analyzed separately.

2) Also note that there are 4 tests for the model and the drug (intercept is usually of no interest). Also note that within each, one result disagrees with the other 3. This does occur sometimes and leaves the novice scratching his head, i.e., which is correct. Here remember that each test is examining a model and error matrices and that there is no one test that is best in all applications. The tip here is that a) if the deviation from the null

Page 7: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

7

hypothesis is small the order of test preference (based solely on statistical power) is:

1)Pillai’s Trace 2) Wilks’ 1lambda 3) Hotelling-Lawley Trace 4)Roy’s Maximum Root

b) if the deviation from the null hypothesis is much larger, the order of test preference is the reverse of above. Now in the Response Specification I choose Identity as this and the Repeated Measures (RM) responses are the 2 most commonly used and the Identity response will generate the 4 tests above. However in clinical studies we often see measures taken on a single or many patients multiple times. Thus this choice (RM) becomes more appropriate. In this case the hypothesis tests change to between and within the subjects. Please note that when this type of test is requested (RM), another dialog will appear asking you to label the measure studied. The default is Time.

Page 8: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

8

We see that between subjects there appears to be no significant difference, but within subjects there is a significant difference with time, although no significant time*drug interaction. Remembering that data on any given subject are correlated, we must not use the Univariate tests which do not correct for these correlations.

Page 9: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

9

II) Cluster Analysis

In many research studies a basic component of the analysis is to classify subgroups of the area studies, e.g., tumor types, drug categories, sub-species, chemical groups, etc. Ideally we would like the subgroups to be completely separated by the classification scheme but this is not always possible, so we go for the best separation of the data. At least the class labels provide a simple way of describing the patterns of similarities and differences in the data, and clearly differences in the definition of sub-groups will collect different subjects into each (3). Basically, we are classifying objects based on a set of rules. We have no generally optimal schemes across all subject matter, but experience points us in useful directions. We have schemes to classify groups where the separations are not known as the subjects are being analyzed for the first time, and also where the class membership is already known and here we are trying to discover rules by which to classify new subjects.

In JMP, there are two methods of clustering, hierarchical and k-means. They differ in methodology and application. Hierarchical clustering is agglomerative in that it combines nearest neighbors until all data are assigned to a group. It is useful for “small” data sets (up to several thousand rows) and you do not need to specify the number of groups beforehand. K-means clustering is useful for very large data sets (up to hundreds of thousands of rows) and proceeds by guessing the cluster seed points and iteratively assigning points to clusters and recalculating the centers of the new clusters. Here you must specify the number of clusters beforehand (2). There are two other clustering techniques that we will not cover but are done in JMP. These are Normal Mixtures clustering (like k-means but do not place points into the groups but rather give a probability of being in each group) and Self-Organizing Maps (SOM’s). Again, a variant of k-means where points are laid out on a grid to reflect their positions in the multivariate space. Specifying the Analysis Clustering is done in the Multivariate platform. With the appropriate data table selected, the steps are Analyze/Multivariate Methods/Cluster. The following dialog box appears:

Page 10: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

10

We are using Drug.jmp for this exercise and attempting to classify x and y affects to a particular drug (there are three in the data set). We will use the default settings for our first attempt. Hierarchical clustering is appropriate for this very small set and Ward’s is an acceptable method for many studies. There is no one “best” methodology for each study and the differences in the methods that each uses are outlined in the JMP Multivariate manual (2). To clarify the results we will ask that the clusters be distinctly colored (red triangle in Hierarchical Clustering graybar). We place X and Y in the Y, Columns box and Drug in the label box. Hit OK and get:

Page 11: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

11

It is apparent that the program had a hard time correctly assigning the drugs to a distinct group based on our two clinical markers. It may be that the markers are highly variable from person to person and thus a more precise separation methodology may be needed as well as choosing a different (or larger) set of clinical markers. Even though this is a small data set, let’s see how k-means treats the separation of the drugs. Again, go to Analyze/Multivariate Methods/Cluster and change Hierarchical to k-means. X, Y, and drug are placed in the same boxes as above. When hitting the OK, another dialog comes up asking us to provide the number of clusters expected. The default is 3, and when we then hit Go we obtain:

Page 12: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

12

Notice on the Biplot that we have a very nice separation for drug 3, but 1 and 2 have significant overlap. Note that we have done this separation with Principal Components (PC’s). If we have 3 variables we may further clarify the view with a 3rd PC in a 3D plot. As these 3D plots are rotatable we would have a better view of the best separation. Please note on the above, that the Biplots appear, as they were selected in the File/Preferences section and Cluster platform prior to the analysis.

Page 13: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

13

III) Discriminant Analysis (DA)

DA is a multivariate method used for group separation by use of discriminant functions. These are linear combinations of variables that best separate the groups. The underlying mathematics is straightforward and the technique has been demonstrated to be very useful in a variety of situations. We are actually looking for a way to predict a group membership (X or classification variable that is nominal or ordinal) based on Y continuous responses that form the data set. JMP implementation of this analysis can be done by several techniques. The manual (2) states: “There are several varieties of discriminant analysis. JMP implements linear and quadratic discriminant analysis, along with a method that blends both types. In linear discriminant analysis, it is assumed that the Y variables are normally distributed with the same variances and covariances, but that there are different means for each group defined by X. In quadratic Discriminant analysis, the covariances can be different across groups. Both methods measure the distance from each point in the data set to each group's multivariate mean (often called a centroid) and classify the point to the closest group. The distance measure used is the Mahalanobis distance, which takes into account the variances and covariances between the variables.” For this analysis we will follow the JMP manual as the Fisher’s Iris data is the classical example and yields a very illustrative canonical plot. Within JMP open use Iris.jmp and choose Analyze/Multivariate Methods/Discriminant and fill in the dialog box as below:

Upon hitting the OK button we obtain:

Page 14: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

14

Page 15: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

15

Before we proceed with the analysis, notice that the default (and straightforward) method, i.e., Linear Discriminant Analysis, was used. The central idea here is to realize we are assigning all groups a common covariance matrix as, based on the biology, it seems reasonable that three varieties of the same species would have the similar variances and covariances (but NOT means) for the four factors that were measured. The Canonical Plot is the core of what we wish to do, i.e., effect the best separation of the three groups and save the results for classification of future data. The plot does this by displaying the data rows as points as well as the group multivariate means in a manner that best separates them. Each group is summarized by a crosshair with 2 concentric circles. The crosshairs are the multivariate means for that group, the inner circle is the 95% CL for the multivariate mean, and the outer circle is the normal 50% contour into which are enclosed approximately half of the data points for that group. Note that if there were four or more groups (subspecies of iris), we could produce a 3D plot also. The other important feature is the Score Summaries which show that 3/150 were misclassified, an error of 2%

IV) Multivariate (including correlation)

This platform is somewhat analogous to the Distribution platform for Univariate Statistics. It is the Multivariate personality for routine data exploration and tells us something about how the variables interact with one another thru the following outputs:

- Scatter plots with density ellipses - Correlations - Covariance matrix - Simple statistics - Outlier analysis

There are a few other techniques but these are the most important, except for PCA

which will be covered in section VI). The platform is accessed by Analyze/Multivariate Methods/Multivariate. We will

use solubility.jmp as our test set. The following dialog appears:

Page 16: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

16

The 6 chemicals will go to the Y, Columns box. Under Estimation Methods we have several choices, most of which are useful for particular situations such as size of dataset, bias, missing data, and outliers. These specialty methods are detailed in the JMP manual. We will use the Default, which chooses between REML (restricted maximal likelihood estimators) and Pairwise, the two used most often. We get the following output upon hitting OK:

Page 17: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

17

This is quite a lot of information, and you will notice that many of the panels are closed (denoted by right pointing arrows on their gray bars). Most of the more important information is disclosed in the Scatterplot matrix (with correlations requested), the covariance and correlation matrices, and the outlier analysis (Mahalinobis Distance). A quick rundown of their usefulness is: Correlations – The correlations table is the matrix of correlation coefficients that summarize the strength of the relationship between each pair of response variables (the Y’s).(2)

Page 18: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

18

Inverse Correlation – The diagonal element of this matrix is a function of how closely the variable is a linear function of the other variables2. Partial Correlations – This table shows the partial correlations of each pair of variables after adjusting for all the other variables. (2) Covariance Matrix – Displays the covariance matrix used in the analysis. Pairwise Correlation – Lists the Pearson product-moment correlations for each pair of Y variables. The report includes a table of significant probabilities for each Pairwise correlation and illustrates this with bar graphs. Nonparametric Correlations – Displays nonparametric measures of association in tabular form. These include Spearman’s Rho, Kendall’s Tau, and Hoeffding’s D. See the JMP manual (2) for further explanations of use. Outlier Analysis – Measured by the following choices: Mahalanobis Distance (most common choice), Jackknife Distance, and T2 statistic. They all measure distance from a group multivariate center to a data point. Thus points outside of the confidence ellipsoids are outliers NOT because they are outliers in the coordinate sense, but because they are outside of the correlation structure. Ellipsoid 3D Plot – This toggles a 95% confidence ellipsoid around three chosen variables (defining the axis of the graph). Impute Missing Data - This will replace missing data with estimated values, which are conditioned on the extant values, and calculated using the mean and covariance matrix.

V) Partial Least Squares

“The Partial Least Squares (PLS) platform fits linear models based on factors, namely, linear combinations of the explanatory variables (X’s). These factors are obtained in a way that attempts to maximize the covariance between the X’s and the response or responses (Y’s). PLS exploits the correlations between the Xs and the Ys to reveal underlying latent structures. Partial least squares performs well in situations such as the following, where the use of ordinary least squares does not produce satisfactory results: More X variables than observations; highly correlated X variables; a large number of X variables; several Y variables and many X variables.

Page 19: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

19

The platform uses the vanderVoet T2 test and cross validation to help you choose the optimal number of factors to extract. In JMP, the platform uses the leave-one-out method of cross validation. PLS is used widely in modeling high-dimensional data in areas such as spectroscopy, chemometrics, genomics, psychology, education, economics, political science, and environmental science. The PLS approach to model fitting is particularly useful when there are more explanatory variables than observations or when the explanatory variables are highly correlated. You can use PLS to fit a single model to several responses simultaneously. Two model fitting algorithms are available: nonlinear iterative partial least squares (NIPALS) and a “statistically inspired modification of PLS” (SIMPLS). The SIMPLS algorithm was developed with the goal of solving a specific optimality problem. For a single response, both methods give the same model. For multiple responses, there are slight differences.” (2) Again, following JMP’s suggestion, the file Baltic.jmp will be used. The problem involves using the spectra of sea water samples to determine the amounts of three industrial pollutants in the water. Please note that the rows represent observations and NOT factors. The three Y’s represent outputs, or amounts of pollutant detected, while the v1-v27 are spectral wavelengths that are called latent variables as we are not directly observing the variables that are affecting the outputs. Use Analyze/Multivariate Methods/PLS to access the platform. Assign Is, ha, and dt to Y and V1-V27 to X.

Notice that both scaling (the data will be divided by the standard deviation) and centering (the mean is subtracted from the data) are checked. This was done by JMP. When we hit OK, we get a dialog box which requests the model specification and the factor search range (JMP has a suggestion for both).

Page 20: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

20

Upon accepting these suggestions and hitting OK, we obtain the summary PLS report:

Page 21: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

21

These are very useful for assessing the adequacy and accuracy of the model, as well as assessing the importance of the individual factors to the model. The report contains many useful features but many more are available. There are only two red triangles in the report. The one at the top (with the Partial Least Squares gray bar) only deals with scripts however, the one in the model fit bar has many other

Page 22: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

22

choices, the more important of which are covered below. Starting with the summary report, use the following: Model Comparison Summary – For each model run the summary lists how good a job the model has done in explaining the variation in Both the X and Y factors. In this case at 99.9+%, the model was very accurate. There are also a high number of factors that exceed the cutoff for significance in the Variation Importance Score (VIP). The scores range from 0-1 and anything less than 0.8 is considered unimportant by the original author of the test. Crossvalidation with Method= - Gives the root mean PRESS (total error sum of squares; lower is better), the van der Voet T2 statistic (determines whether a model with a different number of factors differs significantly from the model with the minimum PRESS value), i.e., can we get away with less factors), and the p-value for each factor T2 (significant factors’ values are in color). It appears from the minimal press score, that 7 factors are optimal for the model. The bar graph illustrates this as we see PRESS values rising on either side of 7. X and Y Score Plots – Produces X vs Y plots for the 7 factors making up the model. When the factors do a good job, we see tight correlation about the straight line. Other tables put a number on these correlations. Percent Variation Explained – This is perhaps the most important diagnostic, as in all research work, we would like to select a model that explains the most variance in the output or dependent variables. This gives us a valuable diagnostic to that effect. Now in the model fit red triangle (NEPALS Fit with 7 Factors) we find a variety of useful diagnostics, of which the more useful include the following: Percent Variation Plots – Shows for each X and Y factor which latent variables contribute the most to that factors variability. Here we need to know more about just what is being measured and how, and what the latent variable may represent.

Page 23: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

23

VIP Plots – The Variable Importance Plot graphs the VIP values for each X variable. The Variable Importance Table shows the VIP scores. A VIP score is a measure of a variable’s importance in modeling both X and Y. If a variable has a small coefficient and a small VIP, then it is a candidate for deletion from the model. A value of 0.8 is generally considered to be a small VIP and a blue line is drawn on the plot at 0.8.

Page 24: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

24

Diagnostic Plots - Show diagnostic plots for assessing the model fit. The Actual vs Predicted for each Y value gives us a good idea of the fit. Higher correlation is better. Four plot types are available: Actual by Predicted Plot, Residual by Predicted Plot, Residual by Row Plot, and a Residual Normal Quantile Plot. Plots are provided for each response. When a validation set, or a validation and test set, are in use, separate reports are provided for these sets, as well as for the training set.(2)

Page 25: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

25

VI) Principle Components Analysis (PCA)

This was left for last as it is based on more complex mathematics that generates an inherent drawback. It is the most sophisticated in terms of power to separate groups but will automatically move to as many dimensions as needed for a maximal separation. When more than three dimensions are used, the interpretation becomes very difficult, as we cannot easily connect dimensions with real, physical factors past three (four, if time is counted). The good news is that we rarely need

Page 26: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

26

more than three and this technique is the best for group separation (or at least the most sophisticated). Thus its scores may be saved and used to make a predictive model for future data. It is essentially a dimension reducing technique so that many dimensions may be reduced to a workable number to describe variation. The JMP manual for multivariate statistics tells us: “Principal component analysis accounts for the total variance of the observed variables (that is, the variance common to all variables and the variance unique to each variable). If you want to see the arrangement of points across many correlated variables, you can use principal component analysis to show the most prominent directions of the high-dimensional data. Using principal component analysis reduces the dimensionality of a set of data. Principal components representation is important in visualizing multivariate data by reducing it to graphable dimensions. Principal components is a way to picture the structure of the data as completely as possible by using as few variables as possible. For n original variables, n principal components are formed as follows: • The first principal component is the linear combination of the standardized original variables that has the greatest possible variance. • Each subsequent principal component is the linear combination of the variables that has the greatest possible variance and is uncorrelated with all previously defined components.”(2)

For our analyses, we will again use Fisher’s Iris data and compare the results with the Discriminant Analysis. Start with Analyze/Multivariate Methods/Principle Components and putting the four plant measurements in the Y, Columns box:

Upon hitting OK, we get the Summary Plots:

Page 27: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

27

Under the red triangle we select the following important diagnostics: Scree Plot and 3D Scatter Plot.

Page 28: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

28

Now as to interpretation and usefulness: Correlation Table – gives the association strength of the four data measurements.

Page 29: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

29

Pareto Plot of Eigenvalues – illustrates the relative contribution of each PC to explaining the total system variability. 2D Component Plot – Shows the optimal separation in 2D by PC1 and PC2. Loading Plot – Very useful to illustrate the relationships between the variables. Specifics include: “X-variables correlation structure

Variables close to each other in the loadings plot will have a high positive correlation if the two components explain a large portion of the variance of X. The same is true for variables in the same quadrant lying close to a straight line through the origin. Variables in diagonally opposed quadrants will have a tendency to be negatively correlated.

For example, in the figure below, variables “redness” and “colour” have a high positive correlation, and they are negatively correlated to variable “thickness”. Variables “redness” and “off-flavour” have independent variations. Variables “raspberry flavour” and “off-flavour” are negatively correlated. Variable “sweetness” and “chew” resistance cannot be interpreted in this plot, because they are very close to the center.

Loadings of 12 sensory variables along (PC1,PC2)

Note: Variables lying close to the center are poorly explained by the plotted PCs. Do not interpret them in that plot!

Page 30: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

30

When working with spectroscopic or time series data, line loadings plots will aid better interpretation. This is because the loadings will have a profile similar to the original data and may highlight regions of high importance. The plot below shows how a number of PC’s can be overlayed in a line loadings plot to determine which components capture the important sources of information.” (4)

Scree Plot - Shows a graph of the eigenvalue for each component. This helps in visualizing the dimensionality of the data space, i.e., how many PC’s are necessary to best separate the groups. We look for the break, or knee in the curve. In the Iris example both 2 and 3 work reasonably well. 3D Scatterplot – Displays a 3D graphic of the PC scores. The default is the first three PC’s. This is useful in refining the visual separation as the 3D graphics are rotatable. Summary Multivariate analysis offers a variety of tools to address the numerous situations in research and business regarding complex cases of data structure. JMP implements these in an easy to use platform with many helpful diagnostics. Bibliography 1) Rencher, A.C. Methods of Multivariate Analysis (2nd Ed.) Wiley-Interscience,

New York. 2002 2) JMP 9 and 11 manuals; Modeling and Multivariate Methods. SAS, Cary, N.C.,

2010, 2013.

Page 31: Multivariate Statistical Analysis Webinar for the GLJUG ... · PDF fileMultivariate Statistical Analysis . Webinar for the GLJUG . ... (MANOVA) 3 This is done in JMP by choosing the

31

3)Everitt, B.S. and G. Dunn. Applied Multivariate Data Analysis (2nd Ed.) Oxford University Press, Inc., New York. 2001

4) Umetrics. 2009-2013. Unscrambler X, On-line Help. Camo Software. Oslo,

Norway.