Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Unscrambler X 教程
CAMO出品
郑光辉整理
Tutorial A: A simple example of calibration ..................................1 Description .........................................................................................................................................................1 Opening the project file ......................................................................................................................................2 Define ranges......................................................................................................................................................4 Univariate regression..........................................................................................................................................7 Calibration ..........................................................................................................................................................9 Interpretation of the results...............................................................................................................................11 Prediction..........................................................................................................................................................17 Evaluation of the predicted results ...................................................................................................................18
Tutorial B: Quality analysis with PCA and PLS ..........................22 Description .......................................................................................................................................................22 Preparing the data .............................................................................................................................................26 Objective 1: Find the main sensory qualities....................................................................................................32 Objective 2: Explore the relationships between instrumental/chemical data (X) and sensory data (Y)...........43 Objective 3: Predict user preference from sensory measurements ...................................................................51
Tutorial C: Spectroscopy and interference problems ...................62 Description .......................................................................................................................................................62 Get to know the data.........................................................................................................................................64 Univariate regression........................................................................................................................................68 Calibration ........................................................................................................................................................71 Multiplicative Scatter Correction (MSC) .........................................................................................................77 Check the error in original units: RMSE ..........................................................................................................84 Predict new MSCorrected samples...................................................................................................................85 Guidelines for calibration of spectroscopic data ..............................................................................................87
Tutorial D: Screening and optimization designs ..........................89 Description .......................................................................................................................................................89 Build a screening design...................................................................................................................................90 Estimate the effects...........................................................................................................................................98 Draw a conclusion from the screening design................................................................................................104 Build an optimization design..........................................................................................................................104 Compute the response surface ........................................................................................................................110 Draw a conclusion from the optimization design ...........................................................................................117
Tutorial E: SIMCA classification...............................................118 Description .....................................................................................................................................................118 Reformat the data table...................................................................................................................................119 Graphical clustering........................................................................................................................................120 Make class models..........................................................................................................................................124 Classify unknown samples .............................................................................................................................126 Interpretation of classification results.............................................................................................................128
Diagnosing the classification model...............................................................................................................132
Tutorial F: Interacting with other programs ...............................136 Description .....................................................................................................................................................136 Import spectra from an ASCII file ..................................................................................................................137 Import responses from Excel..........................................................................................................................138 Create a categorical variable...........................................................................................................................140 Append a variable to the data set....................................................................................................................143 Organizing the data.........................................................................................................................................143 Study the data before modeling ......................................................................................................................145 Make a PLS Model .........................................................................................................................................148 Save PLS model file .......................................................................................................................................152 Export ASCII-MOD file .................................................................................................................................152 Export data to ASCII file ................................................................................................................................154
Tutorial G: Mixture design ........................................................155 Description .....................................................................................................................................................155 Design variables and responses ......................................................................................................................156 Build a simplex centroid design .....................................................................................................................157 Import response values from Excel ................................................................................................................164 Check response variations with statistics .......................................................................................................166 Model the mixture response surface...............................................................................................................169 Conclusions ....................................................................................................................................................174
Tutorial H: PLS Discriminant Analysis (PLS-DA) ....................177 Description .....................................................................................................................................................177 Build PLS regression model ...........................................................................................................................179 Classify unknown samples .............................................................................................................................186 Some general comments on classification......................................................................................................190
Tutorial I: Multivariate curve resolution (MCR) of dye mixtures
...................................................................................................192 Description .....................................................................................................................................................192 Data plotting ...................................................................................................................................................195 Run MCR with default options.......................................................................................................................197 Plot MCR results ............................................................................................................................................199 Interpret MCR results .....................................................................................................................................200 Run MCR with initial guess ...........................................................................................................................201 Validate the estimated results with reference information..............................................................................202 View an MCR result matrix............................................................................................................................203
Tutorial J: MCR constraint settings ...........................................207 Description .....................................................................................................................................................207
Data plotting ...................................................................................................................................................208 Estimate the number of pure components and detect outliers with PCA........................................................209 Run MCR with default settings ......................................................................................................................213 Tune the model’s sensitivity to pure components...........................................................................................214 Run MCR with a constraint of closure ...........................................................................................................216 Remove outliers and noisy wavelengths with recalculate ..............................................................................218
Tutorial K: Clustering ................................................................221 Description .....................................................................................................................................................221 Transform the raw spectra ..............................................................................................................................223 Application of K-Means clustering ................................................................................................................226 Application of Hierarchical Cluster Analysis (HCA) .....................................................................................229 Repeat the HCA using a correlation-based measure.......................................................................................230 Using the results of HCA to confirm the results of PCA................................................................................233
Tutorial L: L-PLS ......................................................................237 Description .....................................................................................................................................................237 Open and study the data .................................................................................................................................241 Build a L-PLS model......................................................................................................................................241 Interpret the results .........................................................................................................................................244 Verify the results.............................................................................................................................................250 Bibliography...................................................................................................................................................254
Tutorial M: Variable selection and model stability.....................255 Description .....................................................................................................................................................255 Create a PLS model ........................................................................................................................................256 Interpret a PLS model.....................................................................................................................................257 Conclusions ....................................................................................................................................................265
1
Tutorial A: A simple example of calibration
• Description o Expected outcomes of this tutorial o Data table
• Opening the project file • Define ranges • Univariate regression • Calibration • Interpretation of the results • Prediction • Evaluation of the predicted results
Description
This tutorial aims to provide and example of the measurement of the concentration (Y) of a chemical constituent “a” by use of conventional transmission spectroscopy. The situation is complicated by the presence of an interferent “b” which is present in varying unknown quantities. Under these conditions, the instrument response of “b” strongly overlaps that of “a”.
Expected outcomes of this tutorial
This tutorial contains the following tasks and procedures:
• Open a project file. • Define row and column sets. • Compare the results of univariate vs. multivariate regression.
2
• Develop calibration models. • Predict new samples. • Validate the model for future use. • Analyze and interpret regression coefficients. • Explore the plotting options available for these methods.
References:
• Basic principles in using The Unscrambler® • Descriptive Statistics • About Regression methods • Prediction • Validation
Data table
The data for this tutorial can be found in the project file “Tutorial A” in the “Data” directory installed with The Unscrambler®.
Seven solutions, (samples), of known concentration (Y) of the constituent a, will be used as the calibration set. Three other (test) samples are available of unknown concentrations. These will be predicted by the use of a developed regression model.
Light absorbance was measured at two different wavelengths, namely Red and Blue. Red is variable 1, Blue is variable 2. Variable 3 has been designated as the concentration of a.
Opening the project file
Task
Open the project “Tutorial A” into The Unscrambler® project navigator and study the data in the Editor. Use the Descriptive Statistics functionality to view some basic characteristics of the data table.
How to do it
Use File - Open to select the project file “Tutorial_A.unsb” in The Unscrambler® data samples directory. This directory is typically located in C:\Program Files\The Unscrambler X\Data.
For the purposes of this tutorial, click the following link to import the data. Tutorial A data set
The project should now be visible in the project navigator and the data should be displayed in the editor.
3
Note that the values for variable Comp “a” are missing (blank) for the 3 Unknown samples.
Use the Tasks-Analyze-Descriptive Statistics… option to view some basic statistics of the data, including the Mean, Standard Deviation, Skewness etc.
Tasks-Analyze-Descriptive Statistics…
The following dialog will open. Select the data matrix to be analyzed and ensure that no rows or columns have been excluded from the analysis.
Descriptive Statistics Dialog
4
After clicking OK, the statistics will be computed. A new analysis node will appear in the project navigator providing some simple plots and analysis of the data.
Descriptive Statistics Results Matrix
Define ranges
In most practical applications of multivariate data analysis, it is necessary to work on subsets of the data table. To do this, one must define ranges for variables and samples. One Sample Set (Row range) and one Variable Set (column range) make up a virtual matrix which is used in the analysis.
Task
Define two Column ranges (variable sets), one for “Light Absorb” and the other for”Constituent a”. Also define two Row ranges (sample sets) “Calibration Samples” and “Prediction Samples”.
How to do it
There are two options for defining data ranges in The Unscrambler®:
5
Create Row/Column ranges using the right mouse click option
Highlight a range of variables to be defined and right click in the column header. This will display the Create Column Range option. Sample sets can also be defined as row ranges using a similar method and selecting Create Row Range.
Create a column range
Rename the column set by highlighting it in the project navigator, and right clicking. Choose the Rename option, and change the name to “Constituent a”.
Repeat this process for the “Light Absorbance” set containing the first two columns and the row sets: “Calibration” containing samples 1 to 7 and “Prediction” containing samples 8 to 10.
Use Edit - Define Range… to create row and column sets.
Open the Define Range dialog from the Edit menu. Define the data as follows,
Name: Light Absorbance
Interval: columns 1-2
Define Range Dialog
6
Enter the Column numbers directly into the Set Interval field under rows and columns.
Deselect variables marked by mistake by pressing Ctrl while clicking on the variable to be removed from the set.
Click OK.
Similarly define the second variable Set using the Edit -Define Range option and specifying:
• Name: Constituent A
• Set Interval: Column 3
Click OK.
Choose Edit - Create Row Range to create sample sets.
Four sample and variable sets should now be displayed in the project navigator.
Data set with ranges
7
By organizing the data into sets from the beginning, one can add value to the analysis and also use this information to communicate results. All analyzes and plotting will be much easier to set up, and can be used in the visualization of results.
Remember to save the project before proceeding, select File - Save or press the button.
Univariate regression
The simplest regression method (univariate regression) can be simply visualized in a 2-dimensional scatter plot.
Task
Make a regression model of component “a” and the absorbance of red light.
How to do it
Perform the regression by plotting the red light variable against Constituent a. Select Plot - Scatter from the Plot menu. The following plot should appear.
Scatter plot
8
The univariate regression should be performed on the calibration samples only, as the Y-values are missing in the prediction set.
The plot is displayed without the trend lines visible. Toggle the regression and/or target line on and
off using the shortcut . Also view the statistics for the plot. Toggle the statistics display on
and off using the shortcut .
Statistics for the plot are shown in a special frame in the upper left corner.
Scatter plot with trend lines and statistics
9
The displayed correlation value of 0.91 indicates that the two variables are highly correlated. The univariate model for this data can be generated using the Offset value and Slope value. The equation is as follows:
Comp"a" = -0.9285 + 0.59524 * Red
Calibration
This section describes how to develop the simplest multivariate model containing two predictor (X) variables.
Task
Make a PLS regression model between the absorbance measurements and the concentration of “a”.
How to do it
Select Tasks - Analyze - Partial Least Squares Regression… to display the PLS regression dialog. Use the following parameters to define the model:
Model inputs
• Rows (indicating which samples to use): Calibration Samples (7) • Predictors, X: Light Absorbance (2) • Responses, Y: Constituent a (1) • Maximum components: 2
10
Check the Mean center Data and Identify Outliers boxes.
Partial Least Squares Regression Dialog: Model Inputs
Weights
Click the tabs for both X and Y weights to see which options apply for each sheet. Since the data are of spectral origin, ensure the weights are All 1.0
Validation
Under the validation tab select the cross validation option. Click on Setup to choose Full from the drop-down list.
It is important to properly validate models. Leverage correction is not recommended as it gives only an overly-optimistic estimate of the error of a model. The estimate of the prediction error (validation variance) is more conservative with cross validation than with leverage correction!
11
Cross Validation Dialog
Click OK to start the calibration.
Interpretation of the results
Task
• Display the results of the modeling steps. • Interpret the Y-Residual Validation Variance Curve. • Study the Regression Coefficients plot and provide an interpretation.
Display the model results
From the project navigator, display the Regression Overview plots. Four predefined plots make up the Regression Overview:
• Scores, • Loadings, • Variance, • and Predicted vs measured.
PLS Regression Overview
12
When OK has been selected in the PLS dialog box and Yes has been selected to view the plots, a PLS node will be added to the project navigator. This node contains the following,
• Raw data, • Results, • Validation, • Plots.
The raw data used for building the model is stored in the results folder. Validation results matrices generated from the model can be viewed along with predefined plots for the analysis.
Toggle between different plots from those available in the project navigator. Alternatively use the Plot… menu option, or right click in a plot to select a desired plot.
Information about the model is available in the Information field, located at the bottom of the project navigator view. Information such as how many samples were used to develop the model and the optimal number of factors is contained here.
Model info box
13
A number of important calculated results matrices may be obtained from the PLS node.
Returning to the PLS overview, activate the Scores plot, which is in the upper left quadrant of the overview, by clicking in it.
Right click on this plot and select the Properties option.
Properties option
Select Point label from the available options, and in the dialog change the label to sample number instead of sample name.
Properties: Point label
14
In the properties dialog it is possible to make other customizations to the plot.
Click OK.
Activate the Predicted vs. Measured plot (lower right quadrant of the PLS overview). In this plot, colors are used to differentiate between Calibration results (in blue) and Validation results (in red).
Use the Next Horizontal PC and Previous Horizontal PC buttons to display the Predicted vs. Measured for one and two PLS Factors.
15
Use the Cal/Val buttons to toggle between the calibration and validation samples. It is also
possible to toggle on and off the regression and trend lines .
Interpret the Y-Residual Validation Variance Curve
Activate the Y residuals plot in the lower left quadrant of the PLS overview and choose Cal/Val for Y from the toolbar shortcuts.
Notice that the residual variance increases going from factor 0 to factor 1. This usually indicates the presence of outliers in the data, which should be removed (with justification) before going final validation of the model.
Residual Y variance plot
.
However, for the purposes of this tutorial, the main goal is to become familiar with the use of The Unscrambler®.
Study the Regression Coefficients Plot
From the main menu, choose the Plot - Regression Coefficients - Raw- Line option. Change the plot
layout to a bar chart using the toolbar shortcut .
Regression coefficients
16
.
This illustrates how to view Raw regression coefficients (B), which define the model equation. View
the regression coefficients for the next factor using the arrows on the toolbar .
In the present case, the values of the regression coefficients remain unchanged when shifting from Weighted coefficients (Bw) to Raw coefficients (B). The reason is that the weights were chosen as All 1.0 (no weighting) for the purposes of calibration.
Regression coefficients can be viewed in different ways, such as lines, bars and accumulated bars from the respective shortcut buttons found in the toolbar.
Hovering the mouse cursor over one of the bars displays numerical information associated with the particular variable. Click once more to get the object information window. For the two factor model developed in this tutorial, the b-coefficient for the Red absorbance is 1.042, the b-coefficient for the Blue absorbance is -0.2083 and the offset (B0) is 1E-15, i.e. approximately zero.
The b-coefficients can also be shown as a table by selecting the matrix Beta coefficients (raw) in the Result folder of the PLS node in the project navigator.
Regression coefficients matrix
.
The b-coefficients are a graphical representation of the model equation relating the concentration of “a” to the Red and Blue light absorbances:
17
Concentration of “a”: a = 0 + 1.042 * Red – 0.2083 * Blue
Remember the value of the coefficient for Red in the univariate model (0.59524). This result is different from what was found in a multivariate model.
The results should be saved in the project with the data.
Select File - Save or use the save tool and give the project file the name “Tutorial A”.
Prediction
The main purpose of developing a regression model is for future prediction of the properties of new samples measured in a similar way.
Task
Use the PLS calibration model to predict the concentration of “a” for the three unknown samples in the data table.
How to do it
Use the Tasks - Predict- Regression… option to predict the values of the new samples. Enter the parameters below in the Prediction dialog:
Prediction dialog
18
• Select model: PLS. • Components: 2. • Full Prediction. • Inlier statistics. • Mahalanobis distance. • Data Matrix: Tutor_a. • Rows: Prediction (3). • Columns (X-variables): Light Absorbance (2). • Y-reference: no selection (do not include Y-reference values).
It is possible to find all models in the current project using the drop-down list next to Select model. Select the PLS model developed and click OK to start the prediction.
Evaluation of the predicted results
During the development stage of a regression model, the quality of the predictions must be checked by evaluating the quality of the Predicted vs Measured plot.
The predictions can be checked when some reference measurements are available. This is not possible for the unknown samples in this tutorial as there are no reference measurements available
19
for these samples. However, a method exists for determining the quality of the predictions, based on the properties of projection modeling.
Task
Perform a prediction and evaluate the quality of the predicted results.
How to do it
First, evaluate the predicted results of the unknown samples and determine if these values are in the same range as the calibration range of samples. Select the Prediction plot under the new Predict/Plots node in the project navigator to visually assess the results.
Prediction with deviation
The predicted values are displayed as horizontal bars. The size of the bars represent the deviation (uncertainty) in the estimates. The numerical values for the Y Predicted values and Y deviations can be found in the output matrices, and are displayed under the plot. A comparison of these predictions to actual values cannot be made, however, if the new samples have predicted values similar to those in the calibration set and the size of the deviation bars is small, the quality of the prediction may be ensured.
Predicted values
Another method for determining the reliability of the predicted values is to study the Inlier vs Hotelling T² plot available as a right click option in any plot. Select the Prediction - Inlier/Hotelling T² - Inliers vs Hotelling T² option to display this plot.
For a prediction to be trusted its value must not be too far from a calibration sample. This may be checked using the Inlier distance. The predicted values projection onto the model should not be too far from the center. This may be checked using the Hotelling T² distance.
20
Inliers vs Hotelling T²
In this case all the samples were found to be in the left bottom corner of the plot, indicating that the predicted results can be trusted.
Save the project before proceeding.
Returning to the PLS model results the estimated prediction quality of the model may be determined. Under the PLS node in the project navigator, expand the Plots folder and select Predicted vs Measured to display this plot in the viewer.
The Predicted vs Measured plot appears.
21
Use the toolbar icons to toggle between the regression and/or target lines.
High quality predictions were obtained from this PLS model. Comparison of the multivariate regression model with the univariate regression model, shows the marked improvement of using the multivariate model.
22
Tutorial B: Quality analysis with PCA and PLS
• Description o Main learning outcomes o Data table
• Preparing the data o Insert categorical variables o Check column (variable) sets o Define sample sets
• Objective 1: Find the main sensory qualities o Make a PCA model o Interpret the variance plot in the PCA overview o Interpretation of the score plot for the PCA o Interpretation of the correlation loadings plot o Interpretation of scores and loadings o Interpretation of the influence plot
• Objective 2: Explore the relationships between instrumental/chemical data (X) and sensory data (Y) o Make a PLS regression model o Interpretation of the variance plot o Interpretation of the score plot o Interpretation of the loadings and loading weights plot o Interpretation of the predicted vs measured plot
• Objective 3: Predict user preference from sensory measurements o Make a PLS regression model for preference o Interpretation of the regression overview o Interpretation of the regression coefficients o Open result matrices in the Editor o Predict preference for new samples o Interpretation of Predicted with Deviation o Check the error in original units – RMSE o Export models from The Unscrambler®
Description
This tutorial aims to use multivariate techniques to analyze the quality of raspberry jam in order to determine which sensory attributes are relevant to “perceived quality”. The analysis will cover three aspects as follows.
23
1. A trained tasting panel has provided scores for a number of different variables using descriptive sensory analysis. In this tutorial the first objective is to find the main sensory quality properties relevant for raspberry jam.
2. The second objective is to find a way of rationalizing quality control, since the use of taste panels is very costly. In this application a number of laboratory instrumental measurements were investigated to potentially replace the sensory testing panel.
3. The third and final objective of this application is to be able to predict consumer preference for raspberry jam from descriptive sensory analysis. The use of PLS regression modeling techniques were investigated in order to potentially find a relationship between sensory data and preference.
Main learning outcomes
This tutorial contains the following parts and learning objectives:
• Explore methods for inserting categorical variables. • Define ranges in data sets. • Investigate the relationships existing in a single data table by the use of PCA. • Interpret scores and loadings of the PCA and draw relevant conclusions. • Run a PLS regression for understanding the relationships between two data tables. • Export models within The Unscrambler® of potentially to other applications. • Predict response values from new samples. • Estimate regression coefficients and interpret them. • Find optimal number of components or factors in multivariate models.
References:
• Basic principles in using The Unscrambler® • PCA Analysis • About Regression methods • Exporting data from The Unscrambler® • Prediction
Data table
Click the following link to import the Tutorial B data set used in this tutorial.
The analysis is be based on 12 samples of jam (objects), selected to span the expected, normal quality variations inherent in such products. Several observations and measurements were been made on the samples.
Agronomic production variables
24
The samples were taken from four different cultivars, at three different harvesting times. The table below describes the sampling plan for this analysis.
Sample description
No Name Cultivar Harvest time No Name Cultivar Harvest time
1 C1-H1 1 1 7 C3-H1 3 1
2 C1-H2 1 2 8 C3-H2 3 2
3 C1-H3 1 3 9 C3-H3 3 3
4 C2-H1 2 1 10 C4-H1 4 1
5 C2-H2 2 2 11 C4-H2 4 2
6 C2-H3 2 3 12 C4-H3 4 3
Note that the agronomic production variables are not used as input variables in any of the matrices. These represent known information which may be extremely valuable for the interpretation of the results of the data analysis. They will be utilized as categorical variables in the analyses performed in this tutorial.
Column (variable) set Instrumental
Three chemical and three instrumental variables (APHA colorimetry) variables were also measured on the samples tested by the sensory panel. These are described in the table below.
Instrumental variables
No Name Method
1 L Lightness
2 a Green-red axis
3 b Blue-yellow axis
4 Absorbance Absorbance
5 Soluble Soluble solids (%)
6 Acidity Titrable acidity (%)
Column (variable) set “Sensory”
25
A trained sensory panel evaluated 12 different attributes of raspberries, using a 1-9 point intensity scale. The entries in the data matrix are the average ratings over all judges. The observed variables are listed in the table below.
Sensory variables
No Name Type
1 Redness Redness
2 Colour Color intensity
3 Shininess Shininess
4 R.Smell Raspberry smell
5 R.Flav Raspberry flavor
6 Sweetness Sweetness
7 Sourness Sourness
8 Bitterness Bitterness
9 Off-flav Off-flavor
10 Juiciness Juiciness
11 Thickness Viscosity/thickness
12 Chew.res Chewing resistance
Column (variable) set Preference
114 representative consumers were invited to taste the 12 jam samples used in this application. They each provided an individual preference score on a scale from 1-9. The average over all consumers for each sample is provided in the data table.
Row (sample) sets
The data table, “JAMdemo”, consists of 20 samples. The first twelve samples will be used to develop the models in this application and are hereafter referred to as training samples.
Eight new jam samples were assessed by the trained panel and given a sensory rating. These samples represent the eight last samples in the table, and referred to as Prediction samples. The preference and the instrumental values are missing for these samples, as measurements were not performed on these samples. The calibration model will be used to predict the preference for these eight samples.
26
Preparing the data
Insert categorical variables
Categorical variables are useful for interpreting patterns in data sets. Here, the raspberries used to make the jam samples originated from different cultivars and were harvested at different times. These parameters represent excellent candidates for using categorical variables in an analysis.
Task
Insert two categorical variables, Cultivar and Harvest Time.
How to do it
The data table should be opened by following the above link and are already organized into two row sets for training and prediction. The different types of variables have been defined in the column sets as Instrumental, Sensory and Preference, based on the definitions in the data tables above. These defined sets can be seen by expanding the data table in the project navigator.
Jam data organization
Some additional information about the cultivar and harvest time now needs to be added to this data as two new columns.
Activate a cell in the first column of the table, right mouse click and select Insert… or use the menu options and select Edit - Insert…. In the dialog box, choose to add two new columns. Two empty columns will be added to the data table.
27
Insert New Columns
Select the new inserted columns and convert each of them to data type Categorical by selecting Edit-Change Data Type… or right clicking to select Change Data Type…. The category converter dialog will appear, and here select to input new levels based upon individual values.
Category converter dialog
28
Enter the Categorical Variable Name “Cultivar” manually in the column 1 header cell. Manually enter the values of the new categorical variable. Use C1, C2, C3, and C4 as the values for Cultivar, as given in the sample names. Type these values in the Cultivar column.
Note: Categorical variable cells are orange in the editor to distinguish them from ordinary variables.
Insert the categorical variable “Harvest Time”; change the name of column 2 to Harvest time, and fill in the correct Harvest Time levels based on the information contained in the sample names.
The Tutorial_b data table displayed in the Editor (after insertion of Cultivar and Harvest Time)
Check column (variable) sets
In The Unscrambler® matrices are defined by Row and Column (Sample and Variable) Sets. A recommended good practice is to define all sets before any analyses are performed. The information entered to organize the data can later be used to color-code graphics according to these sample groups.
Task
Check that the three column (Variable) Sets: “Instrumental”, “Sensory” and “Preference” have been defined.
These sets can be visualized in the project navigator.
How to do it
To create column and row ranges, select Edit - Define Range to open the Define Range dialog.
29
Three sets have been predefined in the project Tutorial_B data set.
Column name: Instrumental Interval: 3-8
Column name: Preference Interval: 14
Column name: Sensory Interval: 9-13, 15-21
To verify these definitions use the Edit - Define range and inspect the information in this dialog.
The Define range dialog with three column sets
After defining column intervals, click OK to perform the task.
Define sample sets
Task
30
Verify the existence of two sample sets “Calibration Samples” and “Prediction Samples”.
How to do it
Select Edit – Define Range to open the Define Range dialog. The available row sets can be inspected here.
The Define range dialog with two Row Sets
1. Row Name: Calibration Samples, Interval: 1-12 2. Row Name: Prediction Samples, Interval: 13-20
Additional row sets will be added for the various levels of the categorical variables harvest time and cultivar.
Go out from the Define Range dialog box by clicking Cancel.
Begin by selecting the row 1 in the data editor, and select Edit- Group rows…, which will open the Create row ranges from column dialog.
Edit- Group rows…
31
The column that was selected, “Cultivar”,is already in the Cols field.
No need to specify the Number of Groups as it is based on a category variable.
Create row ranges from column
Click OK.
32
Automatically 4 row ranges have been added. Look in the Row folder to see them:
New row ranges
Do the same for the variable “Harvest time”.
Objective 1: Find the main sensory qualities
The main variations in the sensory measurements may be found by decomposing them by Principal Component Analysis (PCA). This data decomposition results in valuable graphical diagnostic tools including scores, loadings and residuals. The results will be interpreted in order to establish whether sensory measurements made on the jam samples have any practical meaning.
Make a PCA model
Task
Make a PCA model using the Set “Sensory” as the variable set.
How to do it
Select Tasks – Analyze - Principal Component Analysis… Specify the following parameters in the dialog box:
Model inputs
• Data matrix: “JAMdemo” (20x21)
• Rows: Calibration Sam (12)
• Cols: Sensory (12)
33
• Maximum components: 6
Check the identify outliers and Mean Center boxes, if these check boxes are not already selected.
Principal Component Analysis dialog: Model inputs
Weights
From the Weights tab verify that the weights are all 1.0 (constant).
No weighting is used in this model as the sensory panel is known to be well trained.
However, sensory variables are often weighted when there is evidence that the panel is not well trained, or when investigating relationships with other variables. The most common weighting to use is 1/SDev.
Weight tab dialog
34
Validation
From the validation tab Select Cross Validation and press Setup which opens the Cross Validation Setup dialog. Here select Full from the drop-down list for cross validation method.
Validation Dialog
35
This validation method is more time consuming than leverage correction, but the estimate of the residual variance is more reliable.
Click OK to start the PCA. After PCA analysis is completed, the program will request a user, “Do you want to view plots now?”. Click Yes to see the PCA Overview plots. A new node has been added to the project navigator containing all the PCA result matrices and plots.
Interpret the variance plot in the PCA overview
Task
Determine the optimal number of PCs.
How to do it
The PCA Overview contains the most commonly used plots for interpreting PCA models, including
• Scores plot. • Loadings plot. • Influence plot. • Explained/Residual Variance plot.
36
PCA Overview plots
The Scores plot is a map of the samples, and shows how they are distributed. It can be used to isolate samples that are similar, or dissimilar to one another. In this analysis, the plot labels show that PC-1 explains 58% and PC-2 28% of the total variance in the data. The Explained variance curve (in the lower right corner) is an excellent tool for selecting of the optimal number of components in the model.
The explained variance increases until PC 5 is reached. The software does suggest the optimal number of PCs for a model, but it is up to the analyst to analyze the data and confirm the optimal number of PCs in this model, usually based on this plot.
The highest explained variance is found with 5 PCs, but the explained variance in a model using 3 PCs contains similar explained variation. A simple (parsimonious) model is usually more robust than a complex one, and easier to interpret. It is always suggested to work with a model consisting of a few PCs as possible. The info box in the lower left corner of the main workspace indicates that 3 PCs are considered optimal for this model.
Info Box
37
Task
Change the explained variance plot to a residual variance plot.
How to do it
Activate the lower right plot by clicking in it. Toggle between the Explained/Residual buttons from
toolbar shortcuts . Another way of doing this is to make the plot once again using Plot - Variances and RMSEP, but the short cut method of toggling to change the plot is preferred.
The explained variance is now converted to residual variance. The information is the same, but presented in another way. The residual variance is well suited to finding the optimal number of PCs to use in a model, while the explained variance is a better measure for explaining how much of the variation is described by the model. The plot layout can be changed to a bar chart by using the plot
layout shortcut .
The PCA Explained Variance Bar plot
The model with 3 PCs describes 92% of the total validation variance in the data; for calibration it is 96%. These values may be obtained by clicking on the specific data point in the plot.
38
Use the toolbar buttons to change between having only the calibration or validation variance curve plotted, or both.
Interpretation of the score plot for the PCA
The score plot, which is a map of samples, displays information about the sample relationships for a particular data set.
Task
Interpret Scores plot. Use different plot options for ease of interpretation.
How to do it
The score plot shows the projected locations of the samples onto the calculated PCs. By studying patterns in the samples a meaningful interpretation of the PCs may be possible.
PCA Scores plot
The score plot for this analysis indicates that the 12 samples are not arranged in a random way. By moving from left to right along this plot, a pattern can be observed where samples harvested at time H1 are mainly found on the left. These then change to H2 and finally H3. Moreover, moving from the top to the bottom, C4 samples occupy the top region, followed by C3, then C2, and finally C1.
The row sets based on the categorical variables that were inserted into the data table can be used to better visualize these trends. In the scores plot, right mouse click and select Sample Grouping to open the dialog where different row sets can be used for grouping and color-coding the plot. Select all the cultivar row sets (C1, C2, C3, C4) individually and add them for grouping purposes. The marker color, shape and size can be customized here for optimized viewing of the data.
Sample Grouping Dialog
39
When the desired settings have been defined, click OK to complete the operation.
In the Scores plot, right mouse click to select Properties, where customization of the plot appearance is possible. Select header and change the plot heading to Scores plot with Cultivar Grouping. Choose a different font size or color if so desired.
Properties Dialog
40
PCA Scores with Sample Grouping
Repeat the above sample grouping process, this time using the categorical variable Harvest Time.
Interpretation of the correlation loadings plot
The loading plot, which is a map of the variables, displays information about the variables analyzed in the PCA model. Correlation Loadings provide a scale independent assessment of the variables and may, in some cases, provide a clearer indication of variable correlations.
Task
Interpret variable relationships in the correlation loadings plot.
How to do it
Activate the X-Loadings plot by clicking in it, then use the corresponding shortcut button.
The Correlation Loadings plot may be used to study the variable correlations that exist in a particular data set.
Correlation Loadings plot
41
The plot shows that two variables (redness and colour) have an extreme position to the right of the plot along PC1. They are close to each other (i.e. they are highly positively correlated), and far from the center and are very close to the edge of the 100% explained variance ellipse. This also means that samples lying to the right of the score plot have higher values for those two variables.
Along the vertical axis (PC2), two variables can be observed, with high positive values for this PC. These are R.SMELL and R.FLAV. These two variables are opposite to the variable OFF FLAV which has lower values for this PC. This indicates that raspberry smell and flavor correlate positively with each other, and negatively with off-flavor.
Interpretation of scores and loadings
Task
Relate Scores (samples) information to Loadings (variables) information.
How to do it
The Scores plot and Correlation Loadings plot show that samples C2H3 and C1H3 have high color and redness intensities, while sample C1H2 is more likely to have an off-flavor character. Samples located in a specific part of a 2-vector score plot have, in general, much of the properties of the variables in the same location in the 2-vector loading plot, provided that the plotted PCs describe a large proportion of the variance.
PC 3 describes the variation in sweetness, bitterness and chewing resistance. Confirm this by activating the loading plot (upper right quadrant) and selecting Plot - Loadings. Display PC 1 vs. PC
3 by changing Vector 2 using the arrows in the toolbar .
PCA Loadings 1 vs 3
42
In this new plot, the horizontal axis is unchanged (PC1) and the vertical axis now shows PC3.
Interpretation of the influence plot
Task
Interpret the influence plot, which is used for the detection of outliers.
How to do it
The influence plot is displayed in the lower left quadrant of the PCA Overview. The strongest outliers are placed in the upper right corner of the plot, i.e. they have a large leverage and a high residual variance. In the current analysis, there is no evidence of outliers.
PCA Influence plot
43
All of the results for the PCA are now part of the project Tutorial_B. Save the project to capture the PCA results. The next steps in this tutorial will make use of the sensory, instrumental and preference data.
Close the PCA overview by selecting its name in the navigation bar at the bottom of the viewer and right clicking to select Close.
Objective 2: Explore the relationships between
instrumental/chemical data (X) and sensory data (Y)
Is it be possible to predict the quality variations observed in the jam data by using instrumental measurements only? Training and employing a sensory panel is costly and time consuming. Producers of jam would find it most convenient if they could predict quality variations by measuring some properties by instrumental means. The next task in this tutorial is to make a regression model between the sensory and instrumental data and analyze the results for a possible solution.
Make a PLS regression model
In The Unscrambler® the regression between two matrices can be performed using a number of common multivariate methods. Partial Least Squares (PLS) is used in this case in order to maximize the information obtained from both X and Y.
Task
Make a PLS regression model that predicts the variations in sensory variables from instrumental and chemical variables.
How to do it
Select Tasks - Analyze - Partial Least Squares Regression…. Specify the following parameters in the Regression dialog:
Partial Least Squares Model Inputs
44
Model inputs tab
Predictors
• Rows/Samples: Calibration Sam (12) • X-variables: Instrumental (6)
Responses
• Cols/Y-variables: Sensory (12) • Maximum components: 6
X adn Y weights tabs
Select the X and Y Weights tabs to access their dialogs. Weighting will be applied to all the X and Y variables for regression purposes.
X Weights Dialog
45
Press All to change the weighting of all variables at the same time. Variables can also be selected by clicking on them in the list. Remember to hold the Ctrl key down while selecting several variables. Choose the A / (SDev +B) radio button. Use constants A = 1 and B = 0. Press Update and ensure that the weights change in the list, then click OK.
All variables are weighted by dividing them with their own standard deviations. This allows all variables to contribute to the model, regardless of whether they have a small or large standard deviation from the outset; only the systematic variation is of interest here.
Remember to do the same in the Y Weights tab.
Validation tab
Select Cross validation from the Validation tab.
Press the Setup button to access the Cross Validation Setup dialog and choose Full Cross Validation from the drop-down list. It is always recommended to use test set or cross validation to develop final models.
Click OK in the regression dialog when all parameters have been set up. The computation of the model will begin. After PLS analysis is completed, the system will ask “Do you want to view plots
46
now?”. Click Yes to see the PLS Overview plots. A new node, PLS, has been added to the project navigator.
Click Yes to study the Regression Overview.
PLS Regression Overview
This Viewer provides the most useful and common predefined result plots for PLS, including loading weights and residuals, etc. The model can always be reviewed during the analysis stage by selecting any of the result plots under the PLS-Plots node in the project navigator. For this exercise, various Y response values were used for model development. Therefore the overview results for each of these responses are available by choosing the Y value of interest in the tool bar. When performing this type of analysis with multiple responses the non-significant variables may be determined for each of the responses. It can also provides information on which sensory responses can best be predicted from the instrumental measurements without making a separate PLS model for each response. When a Predicted vs measured plot is selected (lower right quadrant) active, the name of the Y value being
analyzed appears in the toolbar . Another Y response can be chosen from the drop-menu menu, or one can scroll through the values using the arrow tool on the right.
Interpretation of the variance plot
Task
47
Interpret the explained variance curve, which can be shown as residual variance, or as explained variance. The two different views are useful for different tasks.
How to do it
The explained variance plot is in the lower left quadrant. This plot can be changed to the residual
variance plot by using the toolbar . A local minimum is achieved in only two PLS factors. The next task is to determine how much each of the six first Y-variables are described by the model. This can be done by looking at the explained variance.
Validation Variance plot
From the plot menu select Variances and RMSEP - X- and Y-Variance… Make sure the bottom plot shows the Explained Variance for the 12 individual Y variables. If not, change it by using the toolbar
shortcut. Also do not select Total, but select Cal from the toolbar shortcuts . Add a legend to the plot by right clicking and selecting Properties. Select legend, and check the box visible to add the legend to the plot.
PLS, Explained Validation Variance Plot displayed for the 12 individual Y-variables
48
The conclusion reached from the residual variance curve was that two PLS factors were optimal. The variables that are well described are reflected in the information conveyed by these factors. About 85% of the color variation (variables 1 and 2), and 80% of the variation in sweetness (variable 6) can be explained by a combination of the chemical and instrumental variables.
Note that only 23% of the total Y-variance is explained by the model using two factors.
Interpretation of the score plot
The score plot shows how the samples are related to each other.
Task
Interpret the score plot.
How to do it
Returning to the Regression Overview Plot (by selecting it from the Plots node in the project navigator). the Scores plot is always found in the upper left quadrant of the overview. The score plot shows patterns in the samples. This is often difficult to see without some other powerful visual tools. Use the categorical variables as markers in the same way it was performed in the “Interpretation of the Score Plot” for the PCA model. This can be performed by highlighting the score plot and right clicking to select sample grouping. The categorical variables harvest time, will be used for the sample grouping.
PLS factor 1 describes the harvesting time. Harvest time 1 is found on the right in the plot and harvest time 3 to the left. The score plot does not reveal information about the cultivars.
A comparison with the loading plot provides more information. Interpret the two plots (Scores and Loadings) by analyzing them together.
49
Interpretation of the loadings and loading weights plot
Study the loading weights plot to find correlating variables.
Task
Interpret the loadings and the loadings weight plots.
How to do it
The loadings plot is located in the upper right quadrant of the Regression Overview. Activate it (if it is present), or choose it from the project navigator under the PLS - Plots node. Make sure both X and Y loadings are plotted.
To interpret variable relationships, visualize straight lines between the variables through the origin. Variables along the same line, far from the origin, may be correlated. (Negatively correlated when situated on opposite sides of the origin.)
PLS, X-Loading Weights and Y-Loadings Plot
The spectrophotometric color measurements (L, a, and b) appear to be strongly negatively correlated with color intensity and redness. Sweetness is, as expected, strongly negatively correlated with measured Acidity. But the R. Flavor shows weak correlation to the PLS-factors (near origin = low PLS loadings).
The regression coefficients may also be analyzed to understand which X variables are important in describing each of the Y responses. These can be selected from the project navigator, or from the menu Plot- Regression coefficients - Raw - Line. The coefficients for each of the Y responses can be displayed by selecting them from the drop-down list in the toolbar.
50
From Problem I it was concluded that the jam quality varied both with respect to color, flavor, and sweetness. But the results so far in Problem II show that the chemical and instrumental variables mainly predict variations in color and sweetness (which is indicated by the low explained Y-variance of Flavor). This indicates that the Y-variable Flavor cannot be replaced with the present set of X-variables, i.e. there is no information in the chemical and instrumental measurements related to the Flavor of the jam samples.
Use of other instrumental X-variables, e.g. gas chromatographic data, may have increased the flavor prediction ability of the raspberry jam data.
Interpretation of the predicted vs measured plot
The predicted vs. measured plot displays the predictive ability of the developed model.
Task
Interpret the predicted vs. measured plot.
How to do it
The predicted vs. measured plot in the regression overview currently displays the results for the first Y-variable, in this case, Redness.
PLS, Predicted vs Measured Plot for variable Redness, model with two factors
Use the drop-down list in the toolbar to observe the prediction quality for other variables measured in this analysis. Make sure these plots are displayed for two PLS factors, as this is the optimal
51
number for this model. Note that for several of the properties, including raspberry flavor, raspberry smell, and off-flavor, the instrumental values do not provide any real information.
Objective 3: Predict user preference from sensory
measurements
Is it possible to develop a model for predicting consumer preference data from new sensory data? If so, expensive consumer tests can be replaced by cheaper sensory tests. The PLS model previously developed was used for interpretation purposes. The focus is now on prediction. A new model will be built relating the sensory data to consumer preference data, and this model will be applied to unknown samples to predict their preference.
Make a PLS regression model for preference
First, develop a model relating sensory data to preference, and interpret it. PLS regression will be used as the regression method
Task
Make a PLS regression model for describing the relationships between sensory data and preference.
How to do it
From the Main Menu, select Tasks - Analyze - Partial Least Squares Regression…, and specify the following parameters in the PLS Regression dialog:
Model Inputs
Predictors
• X data set: “JAMdemo” • Rows/Samples: Calibration Samples (12) • Col/X-variables: Sensory (12)
Responses
• Y data set: “JAMdemo” • Rows/Samples: Calibration Samples (12) • Cols/Y-variables: Preference (1)
Maximum components: 6
PLS Regression Dialog
52
Weights in X and Y
All 1/SDev
Select the X Weights tab and weight all the X variables with 1/SDev so that each variable will contribute equally in the modeling step. Also weight the Preference values (Y) by 1/SDev in the Y Weights tab.
Validation
Full Cross Validation
Press Setup to access the Cross Validation Setup dialog and choose Full Cross Validation as the cross validation method.
Press OK.
Interpretation of the regression overview
Task
53
A new PLS node has been added to the project navigator. Rename this to PLS Sensory by highlighting it, then right clicking and selecting the Rename option. Interpret the model using the regression overview plots and other diagnostic tools available.
How to do it
It is of primary interest to determine how well the model can predict new values. Therefore only the residual variance and the Predicted vs Measured plots have most meaning.
The residual variance
Activate the explained variance plot in the lower left quadrant, and change it to the residual Y
variance plot by using the toolbar shortcuts . The prediction error tapers off significantly after two PLS factors. This represents the optimal model conditions.
Residual Y Validation Variance Plot
Predicted vs measured
Activate the predicted vs. measured plot and specify to display it for 2 PLS factors, using the arrows
in the toolbar .
Turn on the regression line and the target line with the toolbar shortcuts .
Predicted vs Measured Plot with Trend Lines
54
It can be observed that the predictions are of good quality. Some samples are not so well predicted, but the overall correlation is satisfactory.
Interpretation of the regression coefficients
The regression coefficients are used to calculate the response value from the X-measurements. The size of the coefficients provides an indication of which variables have an important impact on the response variables.
There are two kinds of regression coefficients, Bw and B. The Bw coefficients are calculated from the weighted data table and are used for interpretation. The B coefficients (raw) are calculated from the raw data table and are used for predictions.
Task
Find which variables are important for predicting the Y-variable Preference.
How to do it
The estimated regression coefficients indicate the cumulative importance of each of the sensory variables to the consumer preference.
Select Plot - Regression Coefficients. Choose the Weighted coefficients (Bw) option. Using the arrows in the toolbar, change the plot to show regression coefficients for 2 PLS factors, and change the plot layout to a bar chart.
Regression Coefficients Plot
55
Redness, Color and Sweetness (B1, B2 and B6) are statistically significant in predicting Preference. Raspberry Smell (B4) is also significant, but contributing negatively to the Preference. Thickness (B11) seems to be of importance also as it has a large (negative) coefficient, however it is not shown significant in this model.
Save the project file with the name “Tutorial_B “. It may also be saved as the model file itself, providing a smaller file with just the model information that can be used for predicting new samples using The Unscrambler® Online Predictor and The Unscrambler® Online products. To save the model only, right click on the model node in the project navigator and select the option Save result.
Save result
Rename the model if desired and click on Save.
Open result matrices in the Editor
The result matrices may also be observed numerically. Comparison of results may be easier in tables and the Editor is a good starting point for exporting data into other programs.
56
The Raw regression Coefficients (B) are available as a predefined plot from the Plot menu in the Regression results Viewer. However, for this exercise the B coefficients will be viewed from the list of numerous available matrices.
Task
View the regression coefficients in the Editor.
How to do it
Open the Results folder under the PLS node in the project navigator and select the Beta Coefficients (raw) matrix. Any of the other validation matrices may be selected from the validation folder of the PLS model. The beta coefficients can then be treated as every other data in an Editor. They may be plotted from the Plot menu, etc.
Predict preference for new samples
Regression models are mainly used to predict the response value for new samples. Models are developed to allow the prediction of these values rather than performing reference measurements, which often are time consuming and expensive.
The purpose of the model previously developed was to predict the jam preference for some consumers based on sensory values that were measured for the samples.
Task
Predict the Preference for the jam samples.
Interpret the prediction results to see whether the predictions can be trusted.
57
How to do it
Activate the “JAMdemo” data matrix. Select Tasks - Predict - Regression… and specify the following parameters in the Prediction dialog:
• Select model: PLS Sensory • Data matrix: “JAMdemo” • Rows/Samples: Prediction Samples (8) • Cols/X-variables: Sensory (12) • Prediction type: Full Prediction • Y-reference: Not included • Number of Components: 2
Check the boxes for Inlier statistics and Mahalanobis distance to provide valuable statistical measures of the similarity of the prediction samples to the calibration samples.
Click OK to perform the prediction.
The Prediction dialog
Interpretation of Predicted with Deviation
58
There were no reference measurements available for the new samples in the “Prediction Sam” Set. This makes it impossible to check predicted vs. measured values. Since a model has been developed based on projection, the only option available is to check the reliability of the predictions from the deviations. There are also some statistical measurements of the similarity of predicted samples to those used in developing the calibration model that can be used: inlier statistics and Mahalanobis distance.
Task
Interpret the Predicted with Deviation plot, and other plots related to prediction results.
How to do it
Click OK in the Prediction dialog to display the predicted with deviation plot, and the tabulated prediction results.
Prediction results
Predicted preference for the “unknown” new jams have some uncertainty limits, i.e. the accuracy of new predictions is not so reliable, however, this model can be used to predict the preference of new jam samples providing an indication of which ones will be accepted or not by consumers.
View the Inlier vs Hotelling T² plot by selecting Plot – Inlier vs Hotelling T². This plot shows how similar the new samples are to those used in developing the calibration model. For a prediction to be trusted the predicted sample must not be too far from a calibration sample. This is checked by the Inlier distance and also its projection onto the model should not be too far from the center. This may be checked using the Hotelling T² distance.
Save the project file under the name “Tutorial B_complete”. This now includes all the data, three models, and the predicted results for preference.
59
Check the error in original units – RMSE
Finally, observe how large the expected error is in predicted preference results, i.e. determine what an approximate RMSEP is for such an analysis.
Task
Plot the RMSE.
How to do it
Return to the PLS Sensory node in the project navigator. In the plots folder select Regression Overview, then select Plot - Variances and RMSEP - RMSE.
Two curves are plotted, one for the calibration: RMSEC and one for validation. In this particular case it is the cross-validation error: RMSECV.
PLS, Root Mean Square Error Plot
To gain a better approximation of what to expect in future predictions, the RMSECV should be analyzed.
The RMSECV may be studied for Preference for all PLS factors. RMSECV (using two factors) is 0.83. This means that any predicted new sample on the scale from 1 to 9 will have a prediction error around 0.8. This is an acceptable error level in sensory analysis, which has much uncertainty in all measurements.
60
Export models from The Unscrambler®
Models from The Unscrambler® are often used in instruments to make predictions in real time. A model format has been developed to facilitate the easy reading of results in instruments or other software that do not read The Unscrambler® models directly.
Task
Export the regression model used to predict Preference from Sensory Data.
How to do it
Select a PLS Model from the project navigator and select File – Export - ASCII-MOD…
This displays the Export ASCII-MOD dialog box.
Export ASCII-MOD Dialog
Verify that the correct number of factors has been chosen for the selected model. The optimal number of components should be used for the export. Therefore, change the number of factors to 2 before clicking OK.
Two types of model export are available:
• Full • Regr.Coef. only: corresponding to only the regression coefficients
Observe the ASCII file that is generated, this has the file name extension .AMO. The format of the file is described in the ASCII-MOD Technical Reference.
Similarly any of the result or validation matrices can be selected for export into other formats. Supported export formats are
• ASCII • JCAMP-DX
61
• Matlab • NetCDF • ASCII-MOD
Full ASCII-MOD export includes all results that are necessary to perform outlier detection, etc. This format can be used for applying models outside The Unscrambler® environment, for example in a custom written program script. The ASCII-MOD file is readable by any text editor, such as Notepad.
62
Tutorial C: Spectroscopy and interference
problems
• Description o What you will learn o Data table
• Get to know the data o Read data file and define sets o Plot raw data
• Univariate regression • Calibration
o Interpretation of the calibration model o Study the predicted vs measured plot
• Multiplicative Scatter Correction (MSC) • Check the error in original units: RMSE • Predict new MSCorrected samples • Guidelines for calibration of spectroscopic data
Description
There is a need for an easy way to determine the concentration of dye (a brightly red-colored heme protein, Cytochrome-C), in water solutions. Dye absorbs light in the visible range, and the concentration determination will be based on this light absorbance.
In the solutions to be analyzed there are varying, unknown amounts of milk, which absorbs some light in the same wavelength range as dye and therefore causes chemical interference in the measurements. In addition, milk contains particles that give serious light scattering.
Another effect that will influence the absorbance spectra is the varying sample path length.
The light absorbance spectrum figure shows the light absorbance spectrum of one sample of the dye/milk/water solution.
Absorbance Spectrum
63
The vertical lines represent the 16 different wavelength channels selected as predicting variables for this sample set.
This example is constructed to enable duplication in a lab. This illustrates the interference effects and other effects that make spectroscopy challenging. However similar problems occur with many industrial applications, e.g. measuring the concentration of different chemical species in sewer water, which contains many other chemical agents, as well as physical interferences like slurries and particles; measuring moisture and solvents in a granulation process.
The two major peaks (variables Xvar4 and Xvar6) represent the absorbance of dye, while the first peak (Xvar2) represents absorbance due to an absorbing component in the milk. The broad peak to the right (Xvar12, Xvar13, Xvar14) is due to light absorption by water itself.
What you will learn
Tutorial C contains the following parts:
• PLS regression • Handling of interference problems, Multiplicative Scatter Correction (MSC) • Check list for calibration of spectroscopic data
A problem similar to this tutorial is described extensively in chapter 8 in the book “Multivariate Calibration”, by Martens & Næs.
References
• Transformations: Principles of Data Preprocessing • Multivariate regression methods • Prediction with regression models
Data table
64
Click the following link to import the Tutorial C data set used in this tutorial. This is best done into a new project (File-New).
The data matrix, Tutorial_C is imported into the project. It consists of 28 samples (samples of solutions) that spans the two most important types of variations: the dye and milk concentrations. The composition of dye/milk/water in each calibration sample is shown. The values are given in ml making a total of 20 ml in each solution (sample).
Sample Dye Milk Water Sample Dye Milk Water
1 0.0 0.5 19.5 15 4.0 0.5 15.5
2 0.0 1.0 19.0 16 4.0 1.0 15.0
3 0.0 2.0 18.0 17 4.0 1.5 14.5
4 0.0 6.0 14.0 18 4.0 6.0 10.0
5 0.0 8.0 12.0 19 4.0 10.0 6.0
6 0.0 10.0 10.0 20 6.0 1.0 13.0
7 2.0 0.5 17.5 21 6.0 2.0 12.0
8 2.0 1.0 17.0 22 6.0 6.0 8.0
9 2.0 1.5 16.5 23 6.0 10.0 4.0
10 2.0 2.0 16.0 24 8.0 0.5 11.5
11 2.0 4.0 14.0 25 8.0 1.0 11.0
12 2.0 6.0 12.0 26 8.0 1.5 10.5
13 2.0 8.0 10.0 27 8.0 2.0 10.0
14 2.0 10.0 8.0 28 8.0 6.0 6.0
Note that the known milk and water quantities will not be used to make the model, only as descriptors in result plots. The sample names are coded with these quantities as well.
Get to know the data
Read data file and define sets
The first step in all modeling is to get the data into The Unscrambler® and organize it into appropriate sets. The data for the different analyses are organized as sets, defining which
65
samples(rows) or variables(columns) are used in the modeling. Cleverly defined Sets make modeling and plotting work much easier.
Task
Open the data matrix Tutorial_C, and take a look at the properties of the data. Some of the data have already been organized into row and column sets. The data will be further organized by defining some additional sets to be used in the analysis.
How to do it
In the project navigator, expand the tree under the data matrix Tutorial_C to see the file content. An Editor with the data table is launched in the viewer.
Project navigator view of data
One can see that some sets have already been defined, but one additional column set named Statistical will be defined.
The data table already has the following: Column (Variable) Ranges:
• Cols/Name: Absorbance; Interval, Columns: 4-19 • Cols/Name: Dye Level; Interval, Columns: 3 • Cols/Name: Description; Interval, Columns: 1-2
Row (Sample) Ranges:
• Rows/Name: Calibration; Interval, Rows: 1-28 • Rows/Name : Prediction; Interval, Rows: 29-42
Put the cursor in the data viewer. Now one can define a new column set (variable range) by going to Edit - Define Range… which will open the Define Range Editor. Define the Columns Sets by putting the name Statistical in the column range space, and for interval, enter 3-19 for columns as shown below.
Define Range Dialog
66
Click OK when finished defining the Column and row sets. Use File-Save As… to save the project with the updated name “Tutorial_C_updated” in a convenient the location before continuing. The organized data will now have numerous nodes for column and sample sets in the project navigator, and give a color-coded data matrix.
Plot raw data
It is good practice to start by plotting the raw data to get an impression of what the data look like. It will be of tremendous help when you want to assess which pretreatments are necessary and what kind of model (e.g. how many factors) to expect, as well as generally understanding the structure of the data.
Task
Plot some calibration samples in order to see how the spectra vary with varying amounts of dye and milk.
How to do it
67
Make a line plot of samples that have the same amount of milk, 10 ml. The line plot is just of the X-variables for these samples, so in the data table editor, select the four samples having 10 ml of milk by marking the samples in the Editor (samples 6, 14, 19, and 23) by clicking the sample numbers while holding down the Ctrl key. Then right click and select Plot - Line.
Line plot dialog
In the Line Plot dialog that appears, select the column set Absorbance from the drop-down list. Click OK and note that the four samples are highlighted in the Editor.
The same could be done by selecting the menu option Plot - Line… after having selected the samples in the viewer, and specifying use the Column set Absorbance in the Line Plot dialog.
Line Plot of sample with 10 ml milk
.
68
Use shortcuts keys to change the layout of the plot to a bar chart.
These four samples have the same milk level and the line plot shows that the dye level has influence on the absorbance of variables number 2 - 8 only.
Plot samples 20, 21, 22,and 23 the same way, using the CTRL key to to select just these specific rows. These samples have the same dye level: 6 ml.
The plot shows that increasing milk level will increase the absorbance of light of all wavelengths from number 1 to number 16. There seems to be a great deal of interference or scattering to deal with, over the whole spectrum. This indicates that some transformations of the data may be useful to get an optimal model.
Univariate regression
Is it possible to predict the dye level from the absorbance of one single wavelength? Before we enter the multivariate world we want to see what can be done by univariate regression.
Task
Find the best wavelength on which to make a univariate regression model.
How to do it
You find the best wavelength by looking at the correlation between each absorbance variable and the dye level variable. Select the data set Statistical from the project navigator. Select Tasks - Analyze - Descriptive Statistics… and specify the following parameters in the Descriptive Statistics dialog.
• Rows: Calibration (28) • Cols: Statistical (17) • Compute Correlation matrix: On or tick
When the computation is done, there will be a prompt asking if you want to view the plots. Click Yes, and the two plots summarizing the statistics will be displayed. You will find a new node, Descriptive statistics in the project navigator which consists of the three folders raw data, results and plots.
In the project navigator, expand the folder results. Select the Variable Correlation matrix from this folder to view this in the viewer. We will use these data to find the highest correlation between Dye Level and some X-variable. You may select the first row, dye level, and plot it (Plot - Bar) to see the highest correlation (after the correlation between Dye level and Dye level, which of course is 1).
Bar chart of variable correlation
69
The variable with the highest correlation coefficient to Dye Level is Xvar6 with a correlation coefficient of 0.49. You can close the bar plot of the correlation matrix by selecting the tab in the navigation bar at the bottom of the viewer and right clicking to select close.
Now we should illustrate the regression in a plot. To get the right plot go back to the original data set, Tutorial_C, and select the columns Xvar6 and Dye level using the Ctrl key and Plot - Scatter. In the line plot dialog remember to select only the calibration samples from the row drop-down list.
Scatter plot dialog
Scatter plot of Xvar6 vs Dye level
70
Another way to do this is go to Plot - Scatter and in the Scatter plot dialog click on the define button next to Cols., which will open the Define Range dialog. Here you can select the columns Dye level and Xvar6, or type in columns 3, 9 in the Interval box. Select the calibration samples for the rows.
Scatter plot dialog showing define option
Turn on the Regression Line and Target Line with the shortcut buttons . We can also add
the plot statistics from the toolbar shortcut . From the plot we see our results are not very good using just one variable to model the dye level. Hopefully we can do better with multivariate regression models.
Scatter plot of Xvar6 vs Dye level with target and regression lines
71
Calibration
We choose to make a PLS regression model because PLS takes the variation in Y into consideration when the model is calibrated.
Task
Make a PLS regression model between the variable set Absorbance (X) and the response Dye Level(Y).
How to do it
Activate the Tutorial_C data Editor from project navigator and select Tasks - Analyze - Partial Least Squares Regression…. In the PLS dialog, specify the following parameters:
• Data Set: Tutorial_C • Predictors:
Rows: Calibration (28) Cols (X-variables): Absorbance (16) • Responses:
Cols (Y-variables): Dye Level (1) • Maximum components: 8 • Mean center data: selected • Identify outliers: selected
PLS Regression dialog
72
.
• Weights: All 1.0 in X and Y • Validation method: Cross validation
Go to the Validation tab to select cross validation. You can further define the settings for this by clicking Setup…, taking you to the Cross validation setup dialog. Select Random as the cross validation method and set the number of segments to “7”.
Cross validation setup dialog
73
.
Start the calibration by clicking OK on the Model inputs tab. When the computation is complete you will be asked if you want to view the PLS plots now. Click Yes, and the regression overview plots will be displayed.
A new node, PLS, has been added to the project navigator. This has four folders with the raw data, results, validation, and plots for the PLS model. Rename the PLS node in the project navigator for this analysis to “PLS Tutorial C” before you continue. You can do this by right clicking the latest PLS model in the project navigator and selecting Rename.
Interpretation of the calibration model
The interpretation of a calibration model involves several steps. First, we check whether the model has detected any systematic variation. This is done by looking at the residual variance plot. If the model has successfully described systematic variation, we start to interpret different additional modeling results. The most important model results to study are the Scores, Loadings, and the Predicted vs Measured, all of which are part of the Regression Overview Plots.
Task
Interpret the plots in the regression overview.
How to do it
74
The regression overview was displayed when you clicked View plots. It consists of four plots of the most important modeling results from the regression model. We will now view the PLS results. The plot in the lower left quadrant is the residual variance. This plot gives information about how many factors are required to explain model variation and optimal number of factors for the model. A summary of the model information is given in the Info box in the lower left of the screen, below the project navigator.
PLS Regression Overview Plots
Score plot
The plot in the upper left quadrant is the Scores plot. From the Scores plot we can interpret that the combination of two main factors, factor 1 and factor 2, reflects the variations in the milk and water levels. The first two factors indicate that 99% (X1 84, X2 15) of the X variance, explains 75% (Y1 19, Y2 56) of the response dye level. By studying the samples in the plot we can see that the milk level increases from upper left to lower right in the plot, while the water level increases from right to left.
Regression coefficients
The regression coefficients plot summarizes the relationship between all predictors and a given response. It is easiest to access this plot by selecting it from the plots folder in the project navigator.
Plots folder in project navigator
75
You can see this plot when the any PLS plots are active in the viewer and going to Plot - Regression Coefficients - Raw coefficients (B) - …, or by right mouse clicking and selecting PLS - Regression Coefficients - Raw coefficients (B) -…. Select the line plot of the raw regression coefficients. Since we did not apply any weighting to the data, the plots of weighted and raw regression coefficients will be identical.
The regression coefficients plot indicates that the wavelength numbers (X-variables) 4 and 6 are the most important for the prediction of Y (concentration) in the first factor. The pattern is clearer here than in the loading plot.
Regression coefficients plot
Compare the regression coefficients plot to the raw absorbance data. You see that high loading values indicating important variables are present in the region where we know that milk and dye absorb light.
76
Study the predicted vs measured plot
This plot, in the lower right of the Regression Overview shows how the model is able to predict the response value for the calibration samples. This gives an indication of how well the model will perform in the future when new samples are collected and we want to calculate the dye level for these samples, from the spectral data.
Task
Take a closer look at the residual variances in the error measures plots.
How to do it
Activate the Predicted vs Measured plot and select Plot - Variances and RMSEP… and select the X- and Y-variance, which will bring up two plots summarizing the X and Y variance.
The upper plot shows that the model describes much of the variance in the X-variables in the first factors, while it takes more factors in the lower plot to describe the variance in Y (dye level). We are interested in describing Y, therefore we have to include enough factors in our model to get a high explained variance for the Y-variable.
The X-variance and Y-variance plots
77
Multiplicative Scatter Correction (MSC)
Since we suspect that the light scattering and sample thickness have multiplicative effects on the data, and that the chemical absorptions have additive effects, we decide to try MSCorrection on the X-variables in order to separate these effects from each other.
Perform a Multiplicative Scatter Correction
Task
Correct the data for multiplicative scatter effects. Omit variables 1 to 8 in the Set Absorbance as important variables.
How to do it
Select the data matrix Tutorial_C.
First, we verify the need for MSC by looking at the Scatter Effects plot. This plot is available from a Statistics model. Select Tasks - Analyze - Descriptive Statistics and specify the following parameters in the Descriptive Statistics dialog:
• Rows: Calibration (28) • Cols: Absorbance (16)
Click OK to calculate the statistics, and select Yes to view the plots now. As we already have run descriptive statistics before, but using 17 of the variables, rather than just the absorbance, the current results are a new node, Descriptive Statistics(1), in the project navigator. We are not interested in the default plots that are shown, but want a plot that helps us to understand the scatter in the data. Make the plot window active, and select menu option Plot - Scatter effects. In this plot of the mean value of each X var we see that the scatter is not the same for all variables. The first 8 variables are approximately in a straight line. For the other variables, one can observe a spread in the scatter effects.
Scatter effects plot
78
Select the data matrix Tutorial_C. Select Tasks - Transform - MSC/E… Specify the following parameters in the Multiplicative Scatter Correction dialog:
• Rows: Calibration (28) • Columns: Absorbance (16) • Enable Omit Variables: 1-8
Multiplicative Scatter Correction dialog
Go to the Options tab and under Function select Common Amplification.
Multiplicative Scatter Correction options
79
Prediction samples are not used to find the correction factors we want to find now and use in the MSC.
Variables 1-8 are omitted as important because the light absorption of these variables vary with the dye level, while wavelengths 9 to 16 (the water absorption peak) is independent of the concentration of dye. The difference in these wavelengths is instead caused by the general light scatter due to milk addition. It is important that only wavelengths with no chemical information are used to find the correction factors.
The transformed data are now displayed in the project navigator with the name “Tutorial_C_MSC”. There is also a node with the MSC model for transformation, which can be applied to future samples. This is called “MSC_Tutorial_C”, and has a folder with the model under it.
Look at the corrected data by selecting the data from the new project navigator node, and going to Plot - Line. Select the new sample matrix with the corrected data in the Line Plot dialog, row set calibration, and column set Absorbance.
Line plot of MSC transformed data.
80
We want to compare the corrected data with the original data. Select the raw data matrix in the project navigator (Tutorial_C) and make a line plot of the calibration samples for the absorbance values. You see that the MSCorrected data are different from the original. The interference and light scatter effects have successfully been corrected for. You can display the plots on the same screen by going to the navigation bar at the bottom of the screen and right clicking to select Pop out to give an undocked plot of the MSC Corrected data that can be moved around as you wish.
Pop out menu
You can then choose the line plot of the uncorrected data from the navigation bar, making it active in the Viewer, and move the other window to the same view for easier comparison.
Line plots of the MSCCorrected and the original data
81
.
Another way to get a view of both plots together is to go to Insert-Custom Layout - Two Horizontal… and select the two samples matrices, selecting the calibration samples for rows, absorbance for columns, and setting the plots to be line plots in the custom layout dialog. You can also give a title for each plot as show below.
Custom layout dialog
Calibrate with MSC transformed data
So far we have only corrected the data, now we have to make a new PLS model using MSCorrected data.
Task
82
Make a PLS model with the same model parameters as the model “PLS Tutorial C”.
How to do it
Activate the matrix with the corrected data. Select Tasks - Analyze - Partial Least Squares Regression… and specify the following parameters in the Partial Least Squares dialog:
Data Set: “Tutorial_C_MSC” - Predictors: Rows: Calibration (28) Cols (X-variables): Absorbance (16) - Responses: Rows: Calibration (28) Cols (Y-variables): Dye Level (1)
• Maximum components: 8 • Mean center data: selected • Identify outliers: selected • Weights: All 1.0 in X and Y • Validation method: Cross Validation
Go to the Validation tab to select the cross validation method, again using Random with 7 segments.
Click yes to view the plots now for this model, and the regression overview plots will be displayed in the viewer.
The new regression model will create a new PLS node in the project navigator. Rename this to PLS MSCorrected by selecting the node and right clicking to select Rename.
Comparison of models
We are now interested in seeing how the model performs with regard to prediction ability. The residual variance is therefore the yardstick we compare the different models.
Task
Look at the residual variance for all models in Tutorial C.
How to do it
Study the residual variance for each model. In the project navigator, select the PLS results for the first PLS model, and from the plots folder select Regression overview. The plot on the lower left quadrant shows the variance. Use the toolbar shortcuts to display the residual Y variance
. We see that for the optimal number of factors (2) the variance value is 4.4.
83
There is a minimum in this plot for 5 factors, where the value of the residual variance has not really decreased.
Y Residual validation variance: original data
.
View the same plot for the model PLS MSCorrected by going to the PLS Overview plot of the MSC corrected data (which should still be an open tab in the navigator bar at the bottom of the viewer). Highlight the lower left quadrant, the explained variance plot, and change the view to be the residual
Y variance plot by using the toolbar shortcuts , selecting Y, and Res, for just the validation samples.
Y Residual validation variance: MSC Corrected data
.
The plot shows the validated residual Y-variance for the two models From these plots (line) we find that the minimum square error is lower for the MSC corrected model with two factors (1.87). So though the optimal number of factors recommended is four, even with two factors we can model the system well (more of the Y variance is explained by two factors, then when using the raw data; see score plot). The system can be modeled well with the MSC Corrected data, whereas with the raw data a much higher error is achieved, and less of the Y variance is explained with two factors. This shows that MSC has removed the interfering amplification effect in these data.
84
Tutorial C MSCorrected with four factors gives the lowest estimate for the residual Y-variance. So we see that predictions done by this model using four factors therefore will give the predicted values with the lowest prediction error. We could also model this system well enough with two factors (as we do not here have information on the error of the reference method for measuring the dye level, we will follow the model suggestion for four factors).
Check the error in original units: RMSE
The numerical residual variance values we used in order to find the best model and decide the optimal number of factors in the model are not related directly to the predictions. We cannot use the residual variance to tell how large we can expect the deviations in future predictions. We have to use the RMSEP for that purpose.
Task
Let us see how large an error in ml dye we can expect in future predictions: RMSEP.
How to do it
Activate the regression overview plot for the model PLS-MSCorrected. Select Plot - Variance and RMSEP - RMSE
Deselect the calibration samples box and select the validation samples (RMSEP) instead from the shortcut keys.
You see that the shape of the curve is exactly that of the residual variance, but the values have changed. The plot says that predictions done with this model and using four factors will have an average prediction error of 0.9.
RMSE: MSC Corrected data
.
85
Predict new MSCorrected samples
The model with MSC is the one we will use for the prediction of new samples.
Run a prediction with automatic pretreatment
The prediction samples will be transformed automatically with the same MSC model as the calibration samples. This will require that the variables selected for the data matrix include the same number of variables as are associated with the MSC. This we need to select correctly in the Prediction dialog.
Task
Predict the dye level of the unknown samples.
How to do it
Select Tasks - Predict- Regression…. Specify the following parameters in the Prediction dialog:
• Model name: “PLS MSCorrected” • Number of Components: 4 • Full Prediction with inlier options also selected • Data Matrix: “Tutorial_C_MSC” • Rows: Prediction (14) • Columns: All
As you can see, there is the option to make the prediction for a different number of components than what is deemed optimal for the model. We can also in the predictions, compare results with a model of fewer components, which is good to help avoid possible overfitting.
Prediction dialog
86
Click View after the prediction is done. The prediction overview plot appears where the predicted values are shown together with the deviations. A new node, Predict, has been added to the project navigator. This has folders for raw data, validation, and plots. The projection overview shows a plot of values with their estimated uncertainties, and also has a table of the values with these deviations.
Predicted values with deviation
87
Large deviations indicate that the predictions cannot be trusted. For a prediction to be trusted the predicted sample must be not too far from a calibration sample. This is checked by the Inlier distance and also its projection in the model should not be too far from the center. This is checked with the Hotelling T² distance.
Study the Inlier vs Hotelling T² plot available from a right click on the plot and then Prediction - Inlier/Hotelling T² - Inliers vs Hotelling T²
Inliers vs Hotelling T²
In this case all the samples are found to the below the Inliers distance limit, showing that these samples are similar to those used in making the model. One sample is outside the Hotelling T² limit line (with 95% confidence), so is an outlier. The prediction for the outlier therefore cannot be trusted.
Guidelines for calibration of spectroscopic data
88
Now that you have learned the basics of calibration, let us suggest steps and useful functions for the development of calibration models.
See the guidelines for spectroscopic calibrations
89
Tutorial D: Screening and optimization designs
• Description o What you will learn o Data table
• Build a screening design • Estimate the effects
o Run an analysis of effects o Interpret the results
• Draw a conclusion from the screening design • Build an optimization design • Compute the response surface
o Run a response surface analysis o Interpret analysis of variance results o Check the residuals o Interpret the response surface plots
• Draw a conclusion from the optimization design
Description
This tutorial is built from the enamine synthesis example published by R. Carlsson in his book “Design and Optimization in Organic Synthesis”, Elsevier, 1992.
A standard method for the synthesis of enamine from a ketone gave some problems, and a modified procedure was investigated. A first series of experiments gave two important results:
1. Reaction time can be shortened considerably. 2. The optimal operational conditions were highly dependent on the structure of the original
ketone.
Thus, a new investigation had to be conducted to study the specific case of the formation of morpholine enamine from methyl isobutyl ketone. It was decided to adopt a 2-step strategy:
1. At a screening stage, study the main effects of 4 factors (relative amounts of the reagents, stirring rate and reaction temperature).
2. Conduct an optimization investigation with a reduced number of factors.
What you will learn
90
Tutorial D contains the following parts:
• Build suitable designs for screening and optimization purposes; • Analysis of Effects; • Response Surface Modeling.
References:
• Principles of Data Collection and Experimental Design • Descriptive statistics • Principles of experimental design • Analysis of designed data
Data table
From the previous experiments, reasonable ranges of variation were selected for the 4 design variables:
Variable Low High
A: amount of TiCl4 / Ketone (mol/mol) 0.57 0.93
B: amount of Morpholine / Ketone (mol/mol) 3.7 7.3
C: reaction temperature (°C) 25 40
D: stirring rate (rpm) 0 50
Build a screening design
Screening designs are used to identify which design variables influence the responses significantly.
Task
Select a screening design which requires a maximum of 11 experiments that will make it possible to estimate all main effects.
Note: With 4 design variables: A Plackett-Burman design is not interesting because it requires 8 experiments; the same amount as a fractional factorial design. A fractional factorial gives 8 (24-1) experiments. A full factorial design gives 16 (2⁴) experiments.
How to do it
91
Choose Insert – Create Design… to launch the Design Experiment Wizard.
In the Design Experiment Wizard, on the first tab Start, type a name for the table for example “Enamine”. Select the Goal that for now is Screening. It is possible to type information in the Information section.
Start tab filled
Go to the next section: Define Variables.
Specify the variables as shown in the table hereafter:
ID Name Analysis type Constraints Analysis Type of levels Levels
A TiCl4 Design None Continuous 0.6 - 0.9
B Morpholine Design None Continuous 3.7 - 7.3
C Temperature Design None Continuous 25.0 - 40.0
D Stirring Design None Continuous 0.0 - 50.0
92
ID Name Analysis type Constraints Analysis Type of levels Levels
1 Yield Response None – –
Do this by clicking the Add button and editing the Variable editor. Validate by clicking OK and enter the next variable by clicking Add again.
Define Variables tab filled
After all design variables have been defined, go to the next tab Choose the Design, to select the appropriate design.
By default, in the Beginner mode, the selected design is “Screening of many design variables” which refers to a Fractional factorial design as can be seen in the box below the Design section.
This design corresponds to the goal of the experimentation so no change is needed.
The Design Wizard - Choose the design tab
93
Go to the next tab: Design Details.
This tab gives information about the resolution of the design, the confounding pattern and the number of experiments to perform including the center samples.
By default the selected option is a Fractional factorial design with a resolution IV and with a confounding pattern that gives the interactions being confounded. It is possible to upgrade to a Full factorial, but this increases the number of experiments to perform to 19 which is more than we would like to do.
Study the confounding pattern of the suggested design. All main effects are confounded with 3-variable interactions, which is acceptable if those interactions are unlikely to be significant. The 2-variable interactions are confounded two by two. This is going to limit the study and the conclusions, but in a screening stage this is acceptable.
The Design Wizard - Design Details tab
94
Go to the next tab: Additional Experiments.
There is no need to replicate the design samples so the Number of replications is kept at its default value: “1”.
By default there are “3” center samples. This is enough.
There is no need to add reference samples.
The Design Wizard - Additional experiments tab
95
Proceed to the next tab, Randomization. There is no need to make any further specification in this tab. Try different options just to get familiar with the possibilities.
The Design Wizard - Randomization tab
Go to the Summary tab.
In this tab some information about the design is presented. It is also possible to calculate the power of the design. To do so two values are needed:
96
• Delta: the difference to detect. In this example a 3% yield improvement would be great. • Std. dev.: estimated standard deviation. In this example the yield for the same parameters
varies with a standard deviation of 1.2.
Enter the following values:
• Std. dev.: 1.2 • Delta: 3
and click on the Recalculate power button
Note this value. As there is only one response variable the power of the design will be the same as the one calculated. A power superior to 0.80 is considered to be good enough.
The Design Wizard - Summary tab
Go to the final tab: Design Table. Here the data table is presented with several view options. Check them out to familiarize with the options.
The Design Wizard - Design table tab
97
The design creation is now complete. Click the Finish button.
Now the data tables appear in the Navigator. There is a separate table for the responses. The design table has been organized with row sets and column sets for the design, and center samples, and the effects, respectively.
The design tables in the Navigator
It is possible to view the data in different ways.
• To change the order from the standard sample sequence to the experiment sample sequence click on the column Randomized and go to Edit – Sort – Ascending.
• To change from the actual values to the level values click on the table and then View – Level indices.
98
Estimate the effects
After the experiments have been performed and the responses have been measured, the results have to be analyzed using a suitable method. Study the main effects of the four design variables. The simplest way to do this is to run an Analysis of Effects, and then, interpret the results.
Run an analysis of effects
Task
1. Fill in the responses in the matrix Enamine_Response. 2. Run an Analysis of Effects.
How to do it
First, enter the 11 response values manually. Make sure the rows are sorted in experimental order.
Sample Yield
(1) 74.3
ad 70.1
bd 87.9
ab 96.7
cd 72.8
ac 69.7
bc 88.7
abcd 97.1
cp01 96.4
cp02 96.8
cp03 96.9
99
To start the analysis, choose Tasks - Analyze - Analyze Design Matrix….
In the Method dialog select the Classical DOE analysis method and go to the second tab Model Inputs.
Method dialog
100
Predictors In the Predictors part set the X matrix to be “Enamine_Design”, Rows “All” and the Cols “All”.
Model The Model should include the “Main effects + Interactions (2-var)”. The list of estimated effect should be “A, B, C, D, AB, AC, BC”.
Note: All the interactions are not presented. Remember that AB=DC, AC=BD, BC=AD by the confounding pattern.
Responses For the Responses set the Matrix to be “Enamine_Response”, Rows “All” and the Cols “All”.
Validate the final choices by Clicking OK.
Model inputs
101
When the computations are done, click Yes to study the results. A new node called DOE Analysis is added into the navigator. Before doing anything else, use File - Save As to save the project with a name such as “Enamine Project”.
Interpret the results
Task
Interpret the results of the Analysis of Effects that was just run.
How to do it
The ANOVA Overview plot shows four informative plots:
• the ANOVA table • the Diagnostics table • the Effect viewer • the Effect Summary table
102
ANOVA table
Look at the ANOVA table and check for the validity of the model. The p-value of the model should be less than 0.05. If this is the case look at the value of the different sources of variation that are the main effect. The significant effects are the ones with a p-value less than 0.05. They are in shades of green. Here A (TiCl4), B (Morpholine),and AB=CD are found significant.
Check the R-square values; the closer to 1 the better.
ANOVA table
Note: The interaction effect BC=AD is a possible significant effect. Checking the effect value or the B-coefficient should help to determine if it is significant or not.
The Effect viewer
Look at the effects and check for curvature. See if the center sample average is placed such that the average at low and high level are linked by a linear relation. If this is the case there is
no curvature effect. Use the to scroll through the effects for the different variables.
Here a curvature effect can be found on all effects: A (TiCl4), B (Morpholine), C (Temperature), D (stirring). However it can be noticed that low and high values for C and D are quite similar. In addition the center value is the same for all 4 effects. The most probable is that there is a curvature effect for A and B that are significant.
Effect Morpholine on the Yield
103
The Diagnostics
Look at the residuals to see if the model fits the samples well. The table is presented with the experimental order (randomized) which makes it possible to check for any deviation with time.
The Summary table
See which effect is the most important (size) and the most significant (smallest p-value).
Look at the value of the coefficient for “Morpholine*Temperature”. This effect is much smaller than the significant one. It can be neglected.
Summary table
Go through the other plots and check the plot interpretation in the DOE section
104
Draw a conclusion from the screening design
The final conclusions of the screening experiments are the following:
1. Three effects were found likely to be significant. One of them is a confounded interaction. Since the main effects of A and B are the only significant ones, we can make an educated guess and assume that the significant interaction is AB (and not CD with which is it confounded).
2. There seems to be a strong nonlinearity in the relationship between Yield and (TiCl4, Morpholine). Furthermore, since the center samples have a higher yield than the majority of the design samples, the optimum is likely to be somewhere inside the investigated region.
Thus, the next sensible step would be to perform an optimization, using only variables TiCl4 and Morpholine.
Build an optimization design
After finding the important variables from a screening design, it is natural to proceed to the next step: find the optimal levels of those variables. This is achieved by an optimization design.
Task
Build a Central Composite Design to study the effects of the two important variables (TiCl4 and Morpholine) in more detail.
Note: The other two variables investigated in the screening design, found to not be significant, have been set to their most convenient values: No stirring, and Temperature=40°C.
How to do it
Go to Tools - Extend/Modify a design
A dialog box opens, in which one selects the design to be extended or modified. Select the design “Enamine_Design”.
105
Modify/Extend Design dialog
The Design Experiment Wizard opens.
On the first tab Start, type a name for the table, for example “Enamine_Opt”. Select the Goal that for now is Optimization. It is possible to type in information in the Information section.
Go to the next section: Define Variables.
Delete the variables “Temperature” and “Stirring”. To do so, click on the variable to be deleted and press Delete.
106
The design variables “TiCl4” and “Morpholine” as well as the response variable “Yield” are kept.
ID Name Analysis type Constraints Analysis Type of levels Levels
A TiCl4 Design None Continuous 0.6 - 0.9
B Morpholine Design None Continuous 3.7 - 7.3
1 Yield Response None – –
Define variables tab
Go to the next tab Choose the Design. The selected option, Optimization of response(s) with 3 or 5 levels, corresponds to either a central composite design or a Box-Behnken design. This is a good option for an optimization on variables without constraints. Do nothing and go to the next tab.
Choose the Design tab
107
In the next section Design Details, four options are proposed. Look at the bottom table to see the differences between the different designs and their performance. As it is possible to do experiments outside the selected range the option Circumscribed Central Composite (CCC) design is chosen. Check the value of the star point distance to the center. It should be 1.412 for two designed variables.
Design Details tab
Go to the next section: Additional Experiments.
108
In this section it is possible to add some samples: either replicate the design points or the center samples. Let the Number of replications be “1”. Set the Number of center samples to “5”. The are no Reference samples.
Additional Experiments tab
Go to the Randomization tab.
It is possible to change the order of the experimentation by modifying the settings of this tab. To not randomize a design variable use the Detailed randomization button. To just have another go at the randomization click on Re-randomize.
Randomization tab
109
In the Summary tab check that the design includes a total of 13 experiments. Otherwise, go back to the appropriate tab and make the necessary corrections.
Summary tab
Go to the Design Table tab, and display the experiment in different views.
Design Table tab
110
Finally click the Finish button.
The generated design table is displayed in the viewer and all associated tables are automatically added to the project navigator. Their names start with “Enamine_Opt”.
Save the project, which now include the information for the screening and optimization experiments.
Generated designed tables
Compute the response surface
111
After the new experiments have been performed and their results collected, it is possible to analyze the results so as to find the optimum. This is done by finding the levels of TiCl4 and Morpholine that give the best possible yield. A response surface analysis can give this information.
Run a response surface analysis
Task
Run a Response Surface Analysis.
How to do it
Enter the response values in the “Enamine_Opt_Response” matrix. Before doing so, check that the order of experiments is the standard one and not the experimental one. Use Edit-Sort-Ascending to change the order if necessary.
Sample Yield
Axial_A(high) 84.9
Axial_A(low) 76.8
Axial_B(high) 81.3
Axial_B(low) 56.6
Cube1 73.4
Cube2 69.7
Cube3 88.7
Cube4 98.7
cp01 96.4
cp02 96.8
cp03 87.5
cp04 96.1
cp05 90.5
Response matrix
112
Choose Tasks – Analyze – Analyze Design Matrix….
In the first tab, Method, select the first option: Classical DOE.
In the dialog box, make the following selections:
• Predictor Matrix: “Enamine_Opt_Design”, Rows: “All”, Cols: “All” • Model: “Main effects + Interactions (2-var) + Quadratic” • Responses Matrix: “Enamine_Opt_Response”, Rows: “All”, Cols: “All”
Model inputs
113
Click OK to start the analysis.
When the computations are done, click Yes to study the results. A new node called DOE Analysis(1) is added into the navigator.
Interpret analysis of variance results
Task
Interpret the results from the analysis.
How to do it
The ANOVA Overview plot shows four informative plots:
• the ANOVA table • the Diagnostics table • the Effect viewer • the Effect Summary table
114
First, study the ANOVA results.
Note: It is possible to resize the overview better the table by expanding any quadrant by dragging the resize cross.
Study in turn: Summary, Variables, and Quality in the ANOVA table.
ANOVA Table for the Response Surface model
The Summary shows that the model is globally significant, so it is possible to go on with the interpretation.
The ANOVA table for variables displays the values of the p-values for each effect. The most significant coefficients are for the linear and quadratic effects of Morpholine. TiCl4 effects looks less important but are still significant due to the square term being very significant. However the interaction is more doubtful.
The Quality section tells about the quality of the fit of the response surface model: R-square for the calibration and prediction are very good.
In the Results node in the project navigator, check the tables Model check and Lack of fit.
The Model Check indicates that the quadratic part of the model is significant, which shows that the interaction and square effects included in the model are useful.
115
The Lack of Fit section shows that with a p-value superior to 0.05, there is no significant lack of fit in the model. Thus the model can be trusted to describe the response surface adequately.
Check the residuals
Task
Check the residuals from the Response Surface Analysis.
How to do it
Go to the predefined plot Residuals overview, found in the Plots folder in the project navigator.
Start with the Normal Probability plot of the residuals. This plot can be used to detect any outliers. Here, the residuals form two groups (positive residuals and negative ones). Apart from that, they lie roughly along a straight line, and there is one extreme residual to be found “cp03”. This may be an outlier.
Normal Probability plot of the residuals
Look at the second plot Y-Residuals vs Predicted Y.
Y-Residuals vs Predicted Y
116
In the residuals plot, all values are within the (-4;+4) range, except “cp03” which has a high residual. For the other samples, there is no clear pattern in the residuals, so nothing seems to be wrong with the model.
Look at the bottom right plot Y-residuals vs Experimental order. Check if there is a bias with time. Look at the 5 center samples residuals.
The center samples show quite some variation. This is why so few effects in the model are very significant. There is quite a large amount of experimental variability.
Interpret the response surface plots
Now that the model has been thoroughly checked, use it for final interpretation. This is most easily done by studying the response surface.
Task
Interpret the response surface plots.
How to do it
The contour plot is available from the project navigator in the folder Plots - Response surface and shows the shape of the response surface as a contour plot. Click on it and select the menu Properties to change it into a 3-D response surface. Change the scaling to zoom around the optimum, so as to locate its coordinates more accurately.
117
Click at various points in the neighborhood of the optimum, to see how fast the predicted values decrease. Notice that the top of the surface is rather flat, but that the further away, the steeper the Yield decrease.
Finally, notice that the Predicted Max Point Value, found in the table below the plot, is smaller than several of the actually observed Yield values. (Sample Cube004a for instance has a Yield of 98.7). This is not paradoxical, since the model will smooth the observed values. Those high observed values might not be reproduced when the same experiments are performed again.
Draw a conclusion from the optimization design
The analysis gave a significant model, in which the quadratic part in particular was significant, thus justifying the optimization experiments.
Since there was no apparent lack of fit, no outliers, and the residuals showed no clear pattern, the model could be considered valid and its results interpreted more thoroughly.
The response surface showed an optimum predicted Yield of 96.747 for TiCl4=0.835 and Morpholine=6.504. The predicted Yield is larger than 95 in the neighboring area, so that even small deviations from the optimal settings of the two variables will give quite acceptable results.
118
Tutorial E: SIMCA classification
• Description o What you will learn o Data table
• Reformat the data table • Graphical clustering
o Graphical clustering based on hierarchical clustering o Graphical clustering based on score plots
• Make class models • Classify unknown samples • Interpretation of classification results • Diagnosing the classification model
Description
The data to be classified in this tutorial is taken from the classical paper by Fisher. (Fisher RA, The use of multiple measurements in taxonomic problems, Ann. Eugenics, 7, 179 – 188 (1936).) The task is to see whether three different types of the iris flowers can be classified by four measurements made on them; the length and width of the Sepal and Petal.
What you will learn
Tutorial E contains the following parts:
• Make models of different classes • Classify new data • Diagnose the classification model
References:
• Principal Component Analysis (PCA) overview • Classification • SIMCA Classification
Data table
Click the following link to import the Tutorial E data set used in this tutorial.
119
The data contains 75 training (calibration) samples and 75 testing (validation) samples.
The training samples are divided into three Row (Sample) ranges, each containing 25 samples. The three Sets are: Setosa, Versicolor, and Virginica. The row set Testing will later be used to test the classification.
Four variables are measured; Sepal length, Sepal width, Petal length, and Petal width. The measurements are given in centimeters. These four variables are collectively defined as the column set Iris properties
Reformat the data table
Whenever working with classification, it is very useful to identify samples belonging to the same class under all circumstances – in the raw data table and on PCA or classification plots.
In order to do this, we need to create a categorical variable stating class membership for all samples.
Task
Insert a categorical variable into the Tutorial_E data table.
How to do it
Open the file Tutorial_E from the Examples folder.
Select the first column in the editor and select Edit - Insert - Category Variable…. This opens a dialog that asks how to define the levels.
First enter a name for the variable: “Iris type”.
Then select the second option: Specify levels to be based on a collection of row sets.
In the left column select one by one the three row ranges: “Setosa”, “Virginica” and “Versocolor” and add them to the left column using the button Add.
Category variable dialog
120
Now a new column has been created “Iris type” containing the appropriate value for each sample in each cell of the column.
Data table with category variable “Iris”
Graphical clustering
It is always a good idea to start a classification with some exploratory data analysis. You can run a PCA model and/or hierarchical clustering of all samples. If you do not know the classes in advance, this is a way of visualizing the clustering. The calibration samples must be assigned to the different classes to give a sense of whether a classification model can be developed.
121
Graphical clustering based on hierarchical clustering
Task
Perform hierarchical clustering of all calibration samples.
How to do it
Use Tasks - Analyze - Cluster Analysis… and select the following parameters:
Model inputs • Matrix: Tutorial_E • Rows: Training • Columns: Iris properties • Number of clusters: 3 • Clustering method: Hierarchical Complete-linkage • Distance measure: Squared Euclidean.
In the options tab, you can assign samples to the initial clusters, but for this exercise, we will make a completely unsupervised cluster analysis.
Click OK for the Cluster analysis to run.
When the clustering is complete a dialogue asking if you want to view the plots will appear. Click Yes.
The Dendrogram showing the clustering of samples will be displayed. Notice that three clusters are identified, but they are not all of equal size. All the results are in a new Cluster analysis node in the project.
Dendrogram: Complete-linkage squared Euclidean distance
122
Open the Results folder for the cluster analysis, and expand the levels so that you see the different row sets; one has been defined for each cluster.
Cluster analysis results in project navigator view
By looking at the row sets, one can see that the Setosa samples are all assigned to one cluster, and that there is a small cluster that contains only Virginica samples, but a larger group has a mix of both Virginica and Versicolor samples. These results suggest that it based on the four variables provided for these irises, an unambiguous classification may be difficult.
Graphical clustering based on score plots
Task
123
Make a PCA model of all calibration samples.
How to do it
Use Tasks - Analyze - Principal Component Analysis… and select the following parameters:
Model inputs
• Matrix: Tutorial_E • Rows: Training • Columns: Iris properties • Maximum components: 4 • Keep the default ticks in the boxes Mean center data and Identify Outliers.
Weights
On the weights tab, select all the variables by highlighting them, and setting the weight by selecting the correct radio button.
• Weights: 1/SDev
Click Update.
Validation
Proceed to the Validation tab to set the validation.
• Validation Method: Cross validation
You can now click OK for the PCA to run.
We assume that you are familiar with making models by now. Refer to one of the previous tutorials if you have trouble finding your way in the PCA dialog.
When the model is built a dialogue asking if you want to view the plots will appear. Click Yes.
The Regression Overview consisting of the plots of the scores, loadings, influence and explained variance will be displayed. All the results are in a new PCA node in the project.
Activate the explained variance plot in the lower right quadrant and click on the Cal button on the toolbar so that only Validation variance remains on the plot.
Explained validation variance
124
We see that the Explained Validation Variance is 91% with 2 PCs.
Activate the score plot and right click to select sample grouping. Select the row sets for the Setosa, Versicolor and Virginica. Click OK.
Score plot with sample grouping
You can see the three groups in different colors; one very distinct (Setosa) and two that are not so well separated (Versicolor and Virginica). This indicates that it may be difficult to differentiate Versicolor from Virginica in an overall classification model.
Make class models
Before we classify new samples, each class must be described by a PCA model. These models should be made independently of each other. This means that the number of components must be determined for each model, outliers found and removed separately, etc.
Task
125
Make PCA models for the three classes Setosa, Versicolor, and Virginica.
How to do it
Go back to the Editor window containing your reformatted data table.
Select the first 25 samples corresponding to Setosa samples and create a new range by right clicking on the selected data and select the menu Create Row Range. Do the same with the next 25 samples corresponding to Virginica samples and the samples 51 to 75 corresponding to Virginica samples.
Create Range menu
Rename each range with a name reflecting the samples it contains using a right click on the row set and select Rename.
Rename row set menu
Select Tasks - Analyze-Principal Component Analysis… and make the first PCA model for Setosa with the following parameters:
126
Model Inputs • Matrix: Tutorial_E • Rows: Setosa • Cols: Iris properties • Maximum components: 4
Weights 1/SDev
Validation Leverage correction
When the model is computed, view the plots. In the project navigator rename the PCA class model with name PCA Setosa by highlighting the new PCA node, right clicking and selecting Rename.
Rename menu
Repeat the procedure successively on Row Sets Versicolor and Virginica, also renaming each new PCA model.
Classify unknown samples
When the different class models have been made and new samples are collected, it is time to assign them to the known classes. In our case the test samples are already in the data table, ready to use.
Task
Assign the Sample Set Testing to the classes Setosa, Versicolor, and Virginica.
How to do it
Select Tasks - Predict- Classification - SIMCA….
Menu Tasks - Predict- Classification - SIMCA…
127
Use the following parameters:
• Matrix: Tutorial_E • Rows: Testing • Columns: Iris properties
Make sure that Centered Models is checked. Add the three PCA class models Setosa, Versicolor, and Virginica.
SIMCA classification dialog
The suggested number of PCs to use is 3 for all models; keep that default (it is based on the variance curve for each model).
Click OK to start the classification.
128
Interpretation of classification results
The classification results are displayed directly in a table, but you may also investigate the classification model closer in some plots.
Interpret the classification table
Task
Interpret the classification results displayed in the SIMCA results.
How to do it
Click View when the classification is finished.
A table plot is displayed, called Classification membership. There are three columns: one for each class model.
Samples “recognized” as members of a class (they are within the limits on sample-to-model distance and leverage) have a star in the corresponding column.
SIMCA classification table
129
The significance level can be toggled with the Significance option, which is available as a toggle
on the menu bar.
At the 5% significance level, we can see that all but three samples (false negatives: virg1, virg36, virg42) are recognized by their rightful class model.
However, some samples are classified as belonging to two classes (false positives): 12 Versicolor samples are also classified as Virginica, while 6 Virginica samples are also classified as Versicolor. Only the Setosa samples are 100% correctly classified (no false positives, no false negatives).
If you tune up the significance limit to 25%, this reduces the number of false positives but also increases the number of false negatives (vers41 and virg35 come in addition).
Interpret the Cooman’s plot
130
If a sample is doubly classified, you should study both Si (sample-to-model distance) and Hi (leverage) to find the best fit; at similar Si levels, the sample is probably closest to the model to which it has the smallest Hi. The classification results are well displayed in the Cooman’s plot.
Task
Look at the Cooman’s plot.
How to do it
Under the SIMCA/Plots node choose the Cooman’s plot. You can change which classes it displays on
the toolbar ; now set it for models Virginica and Versicolor.
This plot displays the sample-to-model distance for each sample to two models. The newly classified samples (from sample set Testing) are displayed in green color, while the calibration samples for the two models are displayed in blue and red.
Cooman’s plot for Versicolor vs. Virginica
The Cooman’s plot for the classes Virginica and Versicolor shows that all Setosa samples are far away from the Virginica model (they appear far to the right). However, we can see that many Virginica and Versicolor samples are within the distance limits for both models. This suggests some classification problems.
131
Interpret the Si vs Hi plot
We also have to look at the distance from the model center to the projected location of the sample, i.e. the leverage. This is done in the Si vs. Hi plot.
Task
Look at the Si vs. Hi plots.
How to do it
Under the SIMCA/Plots node choose the Si vs. Hi plot, and set it for the model Versicolor using the arrows on the toolbar. Before you start interpreting the plot, turn on Sample Grouping by right clicking in the plot window and selecting the Sample Grouping option. In the sample grouping & marking dialog, select the row sets Setosa, Versicolor and Virginica. The point labels can be changed to show just the first two characters of their name by right clicking and selecting Properties. In the left list, select Point Label to get to the Point Label dialog. Here one has the option to change the label name to just the first 2 characters of the name. Select the radio button Name, and under the Label layout use the drop-down list for show to select first, and in number of characters box enter 2, as shown in the dialog.
Point layout dialog
The then provides a plot which is much easier to interpret: iris type appears clearly with the initials Se, Ve, Vi in three different colors.
132
Si vs Hi plot for the model Versicolor
Some Virginica samples are classified as belonging to the class Versicolor, but most samples that are not Versicolor are outside the lower left quadrant. The reason for the difficult classification between Versicolor and Virginica is that the samples are overlapping in the score plot. They are very similar with respect to the sepal and petal width.
Diagnosing the classification model
In addition to the Cooman’s and Si vs Hi plots, there are three more plots that give us information regarding the classification.
Interpret model-to-model distance
Task
Look at the Model Distance plots.
How to do it
Under the SIMCA/Plots node choose the Model Distance plot, and set it for the model Versicolor
using the arrows on the toolbar. Change it to a bar chart using the shortcut .
133
Model distance for Versicolor model
This plot allows you to compare different models. A distance larger than three indicates good class separation. The models are different.
It is clear from this plot that the Setosa model is different from the Versicolor, with a distance close to 10, while the distance to Virginica is smaller.
Interpret discrimination power
Task
Look at the Discrimination Power plots.
How to do it
Under the SIMCA/Plots node choose the Discrimination Power plot. Using the arrows on the toolbar, choose the discrimination power for Versicolor projected onto the Setosa model.
This plot tells which of the variables are most useful in describing the difference between the two types of iris.
Discrimination power:Versicolor onto Setosa
134
We can see that variables sepal length and sepal width have high discrimination powers between these classes, while it is lower for the petal length and width.
Do the same for Versicolor onto Virginica: all variables have discrimination powers around 3. This is obviously not enough to completely discriminate these classes.
Interpret modeling power
Task
Look at the Modeling Power plots.
How to do it
From the plots choose the Modeling Power for Versicolor.
Variables with a modeling power near one are important for the model. A rule of thumb says that variables with modeling power less than 0.3 are of little importance for the model.
Modeling power for Versicolor
135
The plot tells us that all variables have a modeling power larger than 0.3, which means that all variables are important for describing the model. None of the variables should be deleted from the modeling. The only chance to improve on the classification between Versicolor and Virginica is to measure some additional variables.
136
Tutorial F: Interacting with other programs
• Description o What you will learn o Data table
• Import spectra from an ASCII file • Import responses from Excel • Create a categorical variable • Append a variable to the data set • Organizing the data • Study the data before modeling
o Plot spectral data o Basic statistics on data
• Make a PLS Model o Interpretation of the Regression Overview o Customizing plots and copying them into other programs
• Save PLS model file • Export ASCII-MOD file • Export data to ASCII file
Description
It is not uncommon to use The Unscrambler® together with other programs in one’s daily work. This could be a word processor to document latest work, or instrument software.
This tutorial shows some of the capabilities The Unscrambler® has to interact with other programs under the Windows operating system. The main focus here is how The Unscrambler® is used in conjunction with other software.
What you will learn
Tutorial F contains the following parts:
• Import data file; • Drag and drop from other programs; • Insert categorical variable; • Edit plots and insert into another program; • Save models for use in The Unscrambler® Online Predictor and The Unscrambler® Online • Write an ASCII-MOD file.
137
References:
• Basic principles in using The Unscrambler® • Importing data into The Unscrambler® • About Regression methods • Customizing Plots • Exporting data from The Unscrambler®
Data table
The data are NIR spectra of wheat samples collected at a mill. Fifty five samples were collected and the NIR spectra on an instrument using 20 channels.
The water content of wheat samples was measured by a reference method and is the response variable in the data. These values are stored in a separate file.
Click the following links to save the data files to be used in this tutorial:
• Tutorial F data set: Spectra • Tutorial F data set: Responses
Import spectra from an ASCII file
Data are stored in many different ways. The most simple and flexible way is to store data in ASCII files.
Task
Import the “Tutorial_F_spectra.csv” ASCII data file.
How to do it
Start The Unscrambler® and go to File – Import data – ASCII…. Locate the file “Tutorial_F_spectra.csv” in the browser and click Open.
Alternatively, click the following link to import the Tutorial F data set used in this tutorial directly.
This launches the Import ASCII dialog, where you specify what the ASCII file looks like. Use the options displayed in the dialog. Note that the first row in the data file contains variable names and the first column contains sample names. The separator for the data is a comma. Check the boxes Process double quotes and Treat consecutive separators as one.
ASCII Import Dialog
138
Click OK to import the file and the data are read into The Unscrambler®, creating a data table called “Tutorial_F” in the project.
Import responses from Excel
Spreadsheet applications are commonly used for storing data. It is easy to transfer data between such a program and The Unscrambler®. The water content of the wheat samples is stored in an Excel file together with the sample names.
Task
Import the water values from the Excel data file “Tutorial_F_responses.xls” into the existing data table.
How to do it
There are two procedures. Use procedure 1 if you have Microsoft Excel or another spreadsheet application installed on your computer or procedure 2 if you do not have a spreadsheet program that can read the file “Tutorial_F_responses.xls”. You only need to follow one of the procedures.
We will begin by appending a column to the existing data table. Put the cursor in the data viewer and select Edit – Append, and in the dialog, enter 1 to add a single column.
1. Copy and paste from Excel
139
Save the Tutorial_F_responses.xls spreadsheet containing the responses
Launch Microsoft Excel and open the file “Tutorial_F_responses.xls”. Copy the values from the column water, and paste them into the empty column that you appended in data matrix “Tutorial F”.
2. Import data from the Excel file
From File – Import data – Excel…, select “Tutorial_F_responses.xls” from the location and click Import. Alternatively, click the following link to import the responses from Tutorial_F_responses.xls directly.
In the project navigator you will find the two data matrices which you imported from the ASCII and Excel files, respectively. Rename the matrices by selecting them, right clicking and choosing Rename; rename them as Wheat NIR Spectra and water content.
Data matrices in the Navigator
140
We could leave the response Y values (water content) in a separate matrix, and do the analysis from these two matrices. But for consistency on data organization in this exercise, we will copy the values from the Water content matrix into the empty column (21) that we appended to the data matrix “Wheat NIR Spectra”.
Create a categorical variable
Categorical variables are useful to calculate statistics and to use in plot interpretation.
Task
Insert a variable to group the samples into three categories, depending on the water content level.
How to do it
141
Place the cursor in the first column and select Edit – Insert… and insert one empty column. Then use copy (Ctrl+C) - paste (Ctrl+V) to copy the water content data into the new column.
Rename the column as “Water levels”.
Then select the “Water levels” column and go to the menu Edit – Change Data Type and select Categorical.
Edit – Change Data Type - Categorical menu
The category converter dialog appears. Select the option New levels based upon ranges of values. Add three levels by entering 3 for the Desired number of levels, and specify the following ranges manually:
• Low (Water < 13.0), • Medium (13.0 > Water > 15.0), and • High (15.0 > Water).
Category Converter menu
142
The column of the categorical values is orange to distinguish this kind of variable from the ordinary ones.
Data after insertion of a category variable
143
Append a variable to the data set
Sometimes it is interesting to have all the information in only one data table.
Task
Append a variable to have the NIR spectra and the water content in the same table.
How to do it
Place the cursor in the last column and select Edit – Append… and append one empty column. Then use copy (Ctrl+C) - paste (Ctrl+V) the water content data into the new column.
Rename the column as “Water”.
Organizing the data
Most of the time, you will want to work on subsets of your data table. To do this, you must define ranges for variables and samples. One Sample Set (Row range) and one Variable Set (column range) make up a virtual matrix which is used in the analysis.
Task
Define the Column ranges (variable sets) “Level”, “Water content” and “NIR Spectra”.
How to do it
144
Choose Edit - Define Range… to create sample sets and variable sets by defining Rows and Columns, or right click upon selecting Rows(samples) or Columns(Variables) to choose Create Row Range and Create Column Range respectively.
We begin by defining the column range for the water content by highlighting column 22, and going to Edit - Define Range. This opens the Define range dialog, where we determine the column range Water, entering this name for Column.
Define Range Dialog
Do the same then to define the column range for “level” in column 1, and “NIR Spectra” in columns 2-21.
The list of defined data ranges are found in the project navigator as nodes under the data matrix.
Project navigator with data sets defined
145
Go to File-Save As… to save the project as Tutorial F.
Study the data before modeling
In any analysis, it is advisable to begin by familiarizing yourself with the data. We should plot data to see if there are any obvious patterns or problems with the data. Does it look as we expect? Are there outliers? From looking at the raw data, we may also be able to see if we should apply a transform to the data. We can also look at the statistics on the data, to get an understanding of the distributions in the data.
Plot spectral data
The NIR data used here are collected at 20 wavelengths using a filter instrument, so do not give a complete spectrum. Regardless, it is still advisable to plot the data to have an understanding of it. Select the column set NIR Spectra in the project navigator. Right click and select Plot - Line to get the plot as shown below. In the plot, we can see that the strongest absorbance peak is at 1940 nm, where the OH vibration for water is found in the NIR spectrum. There is now a new entry in the project navigator for the Line plot. You can rename this by right clicking and choosing Rename
Line Plot of Spectral Data
146
Basic statistics on data
We can check the statistics of our data as well. This can be done for all the spectral data, and for the response variable. Here we will compute the statistics for the water content values. We begin by plotting a histogram, which shows the distribution of values. When we are developing a calibration, we would like to have an even distribution of the response values over the calibration range where we will be operating. Highlight the column “Water” and go to Plot-Histogram to get the following plot. The line for a normal distribution is superimposed on the plot, and the statistics for this sample set are displayed.
Histogram plot of water content
147
We can also compute the statistics without the plot by going to Tasks-Analyze-Descriptive Statistics…. In the dialog, select all the rows, and the column “Water” and click OK. When the computation is complete, say Yes to see the plots now. A quantile and mean and standard deviation plot are displayed. If you had more than one variable, the plots would show results for all the variables. A new node has been added to the project navigator, “Descriptive Statistics”. This has subfolders containing the raw data, results, and plots of the statistical analysis. Expand the folder “Results” and select the matrix “Statistics” to see the numerical results.
Statistics on water content
148
Make a PLS Model
The NIR spectra should contain information which makes it possible to predict the water content from them. Let us make a model and find out.
Task
Make a PLS model from NIR spectra to measure the Water Content.
How to Do It
Select Task - Analyze - Partial Least Squares Regression and specify the following parameters in the Regression dialog:
Model inputs:
• X: NIR Spectra (55x22) • X Rows: All • X Cols: Spectra • Y: Water content (55x1) • Y Rows: All • Y Cols: All • Maximum number of components: 5
If not already done, check the boxes Mean center data and Identify outliers.
Go to the X weights and Y weights tabs to verify that these are all set to 1.0 (the default setting). On the Validation tab, select Cross validation.
PLS Dialog
149
Click OK to launch the calculations.
Click Yes when the calculations are finished, and the prompt appears to view plots now. The PLS Overview plots are displayed. A new node is also added to the project navigator with all the PLS results. This has four folders with the raw data, results, validation, and plots for the PLS model.
Interpretation of the Regression Overview
The most important PLS analysis results are given in the regression overview plot. This has the plots Scores, X and Y loadings, Explained variance, and Predicted vs Measured displayed as the default.
Task
Look at the model results.
How to do it
Study the PLS regression overview plots in the viewer.
PLS Overview Plots
150
The Scores plot shows that the samples are scattered in the model space, with no evidence of groupings and that the first two factors explain 92% and 8% of the variance in the data respectively. The Explained X-variance goes up nicely and is close to 100 after two Factors (PCs). The Predicted vs Measured plot looks OK. The fit is quite good. The info box in the lower left panel of the display indicates that two factors are optimal for this model.
Another very useful plot is of the regression coefficients. Activate the upper-right quadrant and right click to go to PLS-Regression coefficients - Raw coefficients (B) - Line. From the regression coefficients one can see that there is a distinct peak around 1940, as expected as this is where the water absorbance peak is located in the NIR spectrum.
Raw Regression Coefficients
151
Save the project. All the results and plots that have generated will be part of the saved project.
Customizing plots and copying them into other programs
In data analysis and research work, it is critical to provide documentation of the results. Sometimes is may be necessary to transfer plots from The Unscrambler® into a word processor.
Task
Customize plots within The Unscrambler®, and transfer plots from The Unscrambler®, using Copy and Paste.
How to do it
Select the score plot in the regression overview plot, and right click to choose Properties which gives one options to customize a plot.
Change the plot heading name, as well as the font used for it.
Annotations can be added to a plot by right clicking and selecting Insert Draw Item…, or from the
short cut keys on the toolbar
When the plot has been customized it can readily be saved or copied into another application. Right click and select Copy to select just the highlighted plot, or Copy All to select all the four overview
152
plots. Go to another program and place the cursor where the plot is to appear in the document. Select Edit - Paste. The plot is now inserted as a graphical object in the other document.
The plot can be saved as a picture file. The picture file option will usually give better quality plots, but also larger files. Highlight a plot, and right click Save as… to save the plot in a choice of graphics image file formats, such as EMF or PNG.
Save as options
Save PLS model file
Task
Save just the PLS model file, giving a smaller file with just the model information that can be used for predicting new samples using The Unscrambler® Online Predictor and The Unscrambler® Online.
How to do it
To do so right click on the model in the Navigator and select the option Save Result.
Save result
Rename the model as needed and click on Save.
Export ASCII-MOD file
Task
Export an ASCII-MOD file.
153
How to do it
Go to File - Export menu.
File - Export menu
Select ASCII-MOD to open the dialog:
ASCII-MOD Dialog
Verify that the correct model is selected, and the correct number of factors. It is possible to select two types of model:
• Full • Regr.Coef. only: corresponding to only the regression coefficients
Take a look at the ASCII file that is generated, which has the file name extension .AMO. The format of the file is described in the ASCII-MOD Technical Reference.
154
Export data to ASCII file
A common file format that most programs read is the simple ASCII file. There are different ways of writing the ASCII file. Determine the format needed based on the requirements of other programs that will be used to read the ASCII files.
Task
Write the Wheat NIR Spectra data table to an ASCII file.
How to do it
Select the Wheat NIR Spectra table and select File - Export - ASCII. Use only the columns of the NIR Spectra, by choosing this column set from the drop-down list. Make sure that the item deliminator is comma as suggested in the Export ASCII dialog.
Export ASCII Dialog
Provide a file name, and location when prompted. Open the file in an ASCII editor and look at the file. All names are enclosed in double quotes.
155
Tutorial G: Mixture design
• Description o What you will learn o Data table
• Design variables and responses • Build a simplex centroid design • Import response values from Excel • Check response variations with statistics • Model the mixture response surface • Conclusions
Description
This application, inspired from an example in John A. Cornell’s reference book “Experiments With Mixtures”, illustrates the basic principles and specific features of mixture designs.
A fruit punch is to be prepared by blending three types of fruit juice:
• watermelon, • pineapple and • orange.
The purpose of the manufacturer is to use their large supplies of watermelons by introducing watermelon juice, of little value by itself, into a blend of fruit juices. Therefore, the fruit punch has to contain a substantial amount of watermelon - at least 30% of the total. Pineapple and orange have been selected as the other components of the mixture, since juices from these fruits are easy to get and relatively inexpensive.
The manufacturer decides to use experimental design to find out which combination of those three ingredients maximizes consumer acceptance of the taste of the punch.
What you will learn
This tutorial contains the following parts:
• Build a suitable design for a mixture optimization; • Import response values from Excel; • Check response variations with Statistics;
156
• Analyze the results with PLS and Martens’ Uncertainty Test;
References:
• Mixture designs • Data import from a spreadsheet • Descriptive statistics • Analysis of mixture design results • Martens’ Uncertainty Test
Data table
The data in this exercise consist of two parts:
1. The design table, which will be created in the tutorial. 2. Measured responses: Sensory data: acceptance, sweetness, bitterness, fruitiness of the juice
as well as an economic factor, the cost of production. We begin by setting up the design in The Unscrambler®, and then will import the response variables from a separate table.
Design variables and responses
The ranges of variation selected for the experiment are as follows:
Ranges of variation for the fruit punch design
Ingredient Low High
Watermelon 30% 100%
Pineapple 0% 70%
Orange 0% 70%
This defines a simplex.
The responses of interest for the manufacturer are detailed in the table below.
Responses for the fruit punch design
Variable Type of Measurement Target
Consumer acceptance Average of 63 individual ratings on a 0-5 scale Maximum
Production cost Computed from mixture composition and raw material cost Minimum
157
Variable Type of Measurement Target
Sweetness Average ratings by sensory panel on a 0-9 scale Descriptive only
Bitterness Average ratings by sensory panel on a 0-9 scale Descriptive only
Fruitiness Average ratings by sensory panel on a 0-9 scale Descriptive only
Consumer acceptance is the most important response, but if the analysis of the results should reveal two areas with equally high consumer acceptance, the mixture with lower production cost will be preferred. The sensory descriptors provide an explanation of the consumer acceptance based on some properties, and provide directions for further improvement (for instance by adding sugar or sweetener if the consumers seem to prefer sweeter mixtures).
Build a simplex centroid design
Since there are only three design variables, it is possible to build an optimization design right away. In a mixture design the most suitable design is a simplex centroid design.
Task
Build a simplex centroid design with the help of the design experiment wizard, Insert – Create design….
How to do it
Use Insert – Create design… to start the Design Experiment Wizard. The first tab is the Start tab, where you enter the name of the design and the goal of the experimentation. It is also possible to add additional information in the description field. Enter “Punch” as a name for the design and select Optimization as the goal.
Start tab for the Punch experiment
158
Go to the next tab: Define variables. Specify the variables as shown in the table hereafter:
Variables to define
ID Name Analysis Type Constraints Type of levels Level range
A Watermelon Design Mixture Continuous 30-100
B Pineapple Design Mixture Continuous 0-70
C Orange Design Mixture Continuous 0-70
1 Acceptance Response - - -
2 Cost Response - - -
3 Sweet Response - - -
4 Bitter Response - - -
5 Fruity Response - - -
Do this by clicking the Add button and editing the Variable editor including the level range for the design variables. Validate by clicking OK and enter the next variable by clicking Add again.
Variables involved in the design
159
Go to the next tab: Choose the Design.
There is already a type of design that has been selected: Mixture design. Validate this choice by going to the next tab.
Choose the design for the Punch experiments
Go to the next section: Design Details
160
Look at the Description table. The design needed is the Simplex centroid as it is the only design suitable for optimization. To better cover the design space add some more experiments; tick the option Augmented design.
Design details: Simplex centroid
Go to the next tab: Additional Experiments. There is no need to replicate the design samples so the Number of replications is kept at its default value: “1”.
By default there are “3” center samples. This is enough for the purpose of this experiment.
Additional experiments tab
161
There is no need to add reference samples so just proceed to the next tab, Randomization. There is no need to make any further adjustments in this tab. Try different options just to get familiar with the options.
Randomization tab
162
In the Summary tab, the table on the right presents a summary of the information on the design.
On the left part of the tab you can calculate the power of the design if you know two types of information for the responses:
• the standard deviation of the response variables: Std.dev. • the minimum difference to be detected: Delta
Summary tab
163
Go to the final tab Design Table. Here the data table is presented with several view options. Check them out so as to familiarize yourself with the options.
Design table tab for the punch experiments
164
Once all necessary checks and corrections have been made, click the Finish button.
Now the data tables appear in the Navigator. There are two tables: one for the design variables, and another for the responses. The response table is empty until you fill in the values, which you will do later. The design matrix is already organized into row and column sets according to the types of samples (design, center, etc.) and effects.
The design tables in the Navigator
It is possible to view the data in different ways:
• To change the order from the standard sample sequence to the experiment sample sequence click on column randomized, and select Edit - Sort - Descending.
• To change from the actual values to the level values click on the table and then View - Level indices.
Save the new project with File - Save and specify a name such as “Punch Optimization”.
Import response values from Excel
The responses for all samples are stored in an Excel spreadsheet. These can be imported directly as a separate matrix which can then be copied into the Punch_response matrix.
Task
165
Open the Excel table that has the response values and copy them into the response data table.
How to do it
Go to File - Import Data - Excel…, select the Excel file “Tutorial_G.xls” and click Open. Alternatively, click the following link to open the Excel sheet import the responses from Tutorial_G.xls directly.
Then in the Excel Preview window, select the sheet “Responses”, and select the 5 responses:
• Accept • Cost • Sweet • Bitter • Fruity
Excel Preview
Click on OK.
Imported response data
166
Look at the order. It is very important that the tables “Punch_Responses” and “Tutorial_G” match in their order. The “Punch_Responses” should be in the standard order
Select all the data in “Tutorial_G” and copy them using right click and the option Copy or with the shortcut Ctrl+C and paste them into the “Punch_Responses” table. To do so place the cursor in the first cell and use right click and the option Paste or the shortcut Ctrl+V.
Check response variations with statistics
Run a first analysis – Statistics, and interpret the results with the following questions in mind:
• How much does each response vary? • Is there more variation over the whole design than over the replicated Center samples? • Is there any response value outside the expected range?
Task
Run Statistics, display the results as plots, check response variations and look for abnormal values.
How to do it
With the Punch_Response data table displayed in the Editor, select Task - Analyze - Descriptive Statistics.
Choose the following settings in the Statistics dialog:
• Data Matrix: Punch_Response (12x5) • Data Row: All • Data Cols: All • Compute correlation matrix: ticked
167
then click OK to start the computations.
Descriptive statistics dialog box
Click Yes to view the results. The Statistics results are displayed as two plots. The upper plot is Quantiles, the lower Mean and SDev.
Let us have a look at the upper plot: Quantiles.
If you have never interpreted a box-plot (or Quantiles plot) before, follow this link.
Right click on the plot and select View - Numerical View to display the min, max, median, Q1 and Q3 for the response. Check that the ranges of variation are within the expected range for that response (0-5 for Acceptance, 0-3 for Cost and 1-9 for the sensory responses on flavor).
Now display the same two plots for design samples and center samples, in order to compare variation over the whole design to variation over the replicated Center samples. If the experiments have been performed correctly, there should be much more variation among design points than among the three replicates of the Center sample.
Return to the graphical view (View - Graphical view).
Right click on the plot and select Sample Grouping. A dialog box opens. Select the sets Center samples and All design samples from the matrix Punch_Design.
Sample grouping and marking for the statistics
168
Note: It is possible to edit the color of the bars
Click OK.
To display the legend, right click on the plot and select Properties. Go to legend and tick Visible.
Properties - Legend
169
Click OK.
Quantiles plot with sample grouping
The quantiles plot is now displayed for three groups. The bars or boxes for all samples appear in blue, for design samples in red and for center samples, in green (unless a different color scheme has been designated under Properties). On the quantiles plot, one can see that there is much more variation among design points than among the center samples. This also appears clearly on the Mean and SDev plot when the sample grouping us added. For instance, if you click successively on the blue and red bars for variable Acceptance, you will see that SDev is 0.75 for Design samples and only 0.25 for Center samples.
Conclusions
The ranges of variation of the 5 responses are as expected.
There is no abnormal value for any response.
There is much more variation over the whole design than among the center samples, which suggests that the experiments were performed correctly.
Model the mixture response surface
The next step after checking the data is to model the responses. In this we want to study the quantitative relationships between fruit punch composition and consumer acceptance, production cost and sensory properties of the mixtures.
170
There are two ways of analyzing the data: either each response variable individually with the Scheffé formula or the responses as a whole with PLS regression.
In both cases, the results will be interpreted by plotting a Response Surface for each response variable.
Task
Analyze the design with a response analysis using a Scheffé model. View the results and interpret the results.
How to do it
Highlight the data table Punch_Design and run Tasks - Analyze - Analyze design matrix…. Make the following choices in the Design Analysis dialog:
Method Classical
Model inputs • Predictors • Matrix: “Punch_Design (12x13)” • Rows: All • Cols: All • Model: Special cubic • Responses • Matrix: “Punch_Responses (12x5)” • Rows: All • Cols: All
Design Analysis
171
Click OK, then Yes to have a look at the the plots when the computation is complete.
Diagnosing the model
ANOVA results
The first result diagnosis is the ANOVA table, in the upper left quadrant of the overview.
The first ANOVA table is for the response variable “Accept”.
ANOVA Punch
172
Locate the R-square and notice that the value is rather good: 0.93.
Look then at the p-values for the model: 0.0080. This is very good. This indicates the presence of noise in the data.
Look at the individual variables to conclude on the dimensionality of the model.
All the variables have a significant effect with p-values below 5%.
The model is then cubic.
View the results for the other responses by using the drop-down menu or the arrows in the menu
bar .
For the next result, “Cost”, one can see from the p-values that the model may not be cubic.
“Sweet” is also very well predicted. The only response that is not well modeled is for the bitterness.
Diagnostics
Look at the diagnostic table. Look at the residuals. Notice that the center samples show a high residual.
Diagnostics for response “Accept”
173
Effect summary
Look at the effect summary.
Notice that the most important effects are from second or third order for the first variable, but then for “Cost” it is mostly the linear effects.
Effect summary
Response surface
Go to the predefined response surface plot in the navigator.
Response surface for acceptance
174
Try to locate the optimal values for the acceptance.
Do the same for the cost.
To do so change the response variable to be plotted. Untick the “accept” variable and tick the “Cost” variable.
Response surface for cost
Conclusions
175
The response surface plots show maximum consumer acceptance for a fruit punch with about 39% Watermelon, 16% Pineapple and 45% Orange.
176
177
Tutorial H: PLS Discriminant Analysis (PLS-DA)
PLS-DA is the use of PLS regression for discrimination or classification purposes. In The Unscrambler® PLS-DA is not listed as a separate method. This tutorial explains how to do it.
• Description o Running a PLS Discriminant Analysis o What you will learn o Data table
• Build PLS regression model • Classify unknown samples • Some general comments on classification
Description
PLS Discriminant Analysis (PLS-DA), is a classification method based on modeling the differences between several classes with PLS. If there are only two classes to separate, the PLS model uses one response variable, which codes for class membership as follows: -1 for members of one class, +1 for members of the other one.
If there are three classes or more, the model uses one response variable (-1/+1 or 0/1, which is equivalent) coding for each class. There are then several Y-variables in the model.
In this tutorial we will analyze the chemical composition of spear heads excavated in the African desert. 19 samples known to belong to two tribes (classes A and B) are used for building a discriminant model, while seven new samples of unknown origin make up a test set to be classified.
The X variables are 10 chemical elements characterizing the composition of the spear heads. The 19 training samples are divided into 10 from class A and 9 from class B. The normal way to make dummy variables for classes is to assign 1 if the sample belongs to the class and 0 if not. A small trick to have a decision line of 0 and not 0.5 in the predicted versus measured plot is to use values -1 and 1, which gives an easier visualization.
Running a PLS Discriminant Analysis
When a data table is displayed in the viewer, one may access the Tasks menu to run a Regression (and later on a Prediction).
178
In order to run a PLS Discriminant Analysis (PLS-DA), one should first prepare the data table in the following way:
Insert or append a categorical variable in the data table. This categorical variable should have as many levels as there are classes in the data set. The easiest way to do this is to define one row set for each class, then build the sample sets based on the categorical variable (this is an option in the Define range dialog). The categorical variable will allow one to use sample grouping on plots, so that each class appears with a different color.
Use the function Edit- Split category variable to convert the categorical variable into indicator variables. These will be the Y-variables in the PLS model and are created as new columns in the data table. Then create a Column set containing only the indicator variables, as these are the responses that will be used in the regression.
What you will learn
This tutorial contains the following parts:
• Run a PLS regression • Interpret the model • Save the model • Classify new samples
References:
• Basic principles in using The Unscrambler® • Principles of Regression • Classification • Prediction
Data table
Click the following link to import the Tutorial H data set used in this tutorial. The data have already been organized for you into row sets, and with the class variable, as well as the indicators for the classes.
Tutorial H data
179
Build PLS regression model
Task
Run a PLS regression on the data.
How to do it
Click Tasks - Analyze - Partial Least Squares Regression to run a PLS regression and choose the following settings:
PLS Regression Dialog
180
Model inputs • Predictors: X: Tutorial H, Rows: Training, Cols: X • Responses: Y: Tutorial H, Rows: Training, Cols: Class num • Maximum components: 5 • Mean center data: Enable tick box
X Weights 1/SDev
Y Weights 1/SDev
Validation Full cross-validation
Set the weights on the X-weights and Y-weights tabs. Select all the variables, select the radio button A/(SDev+B), and click update. Do this for both the X and Y weights.
X weights dialog
181
To set the validation method, go to the Validation tab in the PLS Regression dialog. Select Cross validation, and then click Setup… to get to the dialog to select full cross validation. Select Full from the cross validation method drop-down list,
Cross Validation Dialog
182
After the computations are finished the default PLS regression plots will be shown. The score plot shows the separation of the two classes.
Score plot
For better visualization of the classes you may use the sample grouping option. Right click in the score plot and select Sample Grouping from the menu.
183
In the Sample grouping dialog, select the row sets “A” and “B” for visualization. You can double-click in the small boxes showing the colors to change to your preference. The same goes for the symbols, and their size.
Sample Grouping Dialog
The score plot shows that the two classes are well separated in the two first factors.
Score plot with grouping
Thus, a discrimination line may be inserted in the plot with the line drawing tool in The
Unscrambler® .
184
Study the explained variance plot for Y shown in the lower-left quadrant. If need be, switch it to the
view for Y by using the X-Y button . The explained variance plot for Y shows around 98 % explained calibration and 94 % explained validation variance for 2 factors. The red validation curve indicates that two factors is the optimal number, as there is only a small increase in explained variance after factor three.
Note: Explained variance or RMSE is not the main figure of merit for PLS-DA, however.
Variance plot
To interpret the importance in the classification the loading weights is the plot to look into. This is given in the upper-right quadrant.
In this case the loadings express the same information as the loading weights, and since correlation loadings show the explained variance directly, this is the preferred view. Make the loadings plot active, and change it to the Correlation loadings view by selecting the correlation loadings shortcut
.
In the correlation loadings plot for factors one and two we see that Ba, Zr and Sr are the variables that separate the two classes, as well as Ti, although with a slightly lower discrimination ability. These are the variables closest to the response variable class, and between the 50 - 100% explained circles. The remaining elements are mostly modeling the variance within the classes.
Correlation Loadings Plot
185
The regression vector is a summary of the important variables, in this case representing the loading weights plot after 2 factors. In the project navigator, select the plot Regression Coefficients, and
change it to a bar chart by using the toolbar shortcut .
Weighted Regression Coefficients
The magnitude of the regression coefficients is an indication of how important those variables are for modeling the response, here class.
The predicted versus measured plot, in the lower-right quadrant, shows how close to the ideal values -1 and 1 the predicted values are.
186
Predicted versus Measured Plot
Note that the blue points are from calibration where the samples are merely put back in the same model they were a part of. The red points are from cross validation which is more conservative as the sample was not a part of the model when it was predicted. You can toggle on/off the regression line,
trend line, and statistics for the plot using the shortcut .
Recall that “prediction” in this context does not mean that the model has been tested by predicting a real test set. In this case all samples are correctly classified for the cross validation.
To investigate how the model will behave on unknown samples, the next section will show how to predict unknown sample class.
It is a good idea to save your work so far. The project will include all the data, as well as all the results generated thus far. Use File – Save… to save the project.
Classify unknown samples
Assign the unknown samples to the known classes by predicting (classifying) with the PLS regression model.
Task
Assign the Sample Set Test to the classes A or B.
How to do it
Select Tasks - Predict - Regression….
187
Tasks - Predict - Regression…
Use the following parameters:
Components The number of factors (components) to use is two.
Data • Matrix: Tutorial H • Rows: Test • Cols: X
Prediction • Full Prediction • Inlier limit • Sample Inlier dist • Identify Outliers
Prediction Dialog
188
Click OK.
The predicted values are shown in the main plot of predicted values with estimated uncertainties.
All F samples have predicted values close to -1 classifying these as belonging to class “B”. The E sample 2 has a predicted value around 1 which assigns it to class “A”. As for E samples 1, 3 and 4, their predictions are close to 0, and have high uncertainties. It could be that these can not be said to belong to any of the classes because the estimated deviation (uncertainty) around the prediction value includes 0 in the plot.
Predicted values and deviation
189
A small trick to present the results more visibly is to do Tasks - Predict - Projection and select the PLS model from above. In the score plot you see that all samples F are lying in the “B” class and E samples 2 and 3 are probably belonging to class “A”, as discussed above. The position of test samples 1 and 4 shows that they are in fact closer to class “A” as also the predicted values indicate.
Note: Try to analyze the same data by doing PCA on the two groups and then select Tasks - Predict - Classification - SIMCA and compare results with the PLS-DA.
To check if the prediction can be trusted, study the Inlier vs Hotelling T² plot available from a right click on the plot and then Prediction - Inlier/Hotelling T² - Inliers vs Hotelling T²
Prediction - Inlier/Hotelling T² - Inliers vs Hotelling T² menu
190
For a prediction to be trusted the predicted sample must not be too far from a calibration sample. This is checked by the Inlier distance. The projection of the sample in the model also should not be too far from the center. This is checked with the Hotelling T² distance.
Inliers vs Hotelling T²
In this case the samples are found to be in the widely spread in the plot. If samples fall outside the limit lines that prediction cannot be trusted.
Some general comments on classification
LDA is the basic method that is typically taught in introductory classification courses and is available as a reference method for comparison with other classification methods such as SIMCA. Remember that LDA has the same issue with collinearity as MLR, and that more samples than variables are required in each class. Using PLS regression for classification as PLS-DA has shown can give very good results in discriminating between classes. In this context it may also be useful to apply the uncertainty test after deciding on the model dimensionality and remove the nonrelevant variables. This can in some cases improve results both in simpler visualization and model performance. However, PLS-DA does not take into account the within-class variability, and predicted values around 0 (assuming -1 and 1 are used as levels for the classes) are difficult to assign. One alternative procedure is to use the scores from the PLS-DA in an LDA to have a more “statistical” result. As the score vectors are orthogonal there is no problem with collinearity in this case.
191
Using local PCA models which for historical reasons has been given the name “SIMCA” is a good approach because it also gives the possibility to assign new samples to none of the existing classes. However, as there is no objective in the individual PCA models to discriminate between the classes one does not know if the variance modeled is the optimal for this purpose. The Modeling and Discrimination Power diagnostics are helpful in this context. One useful procedure is to first do PLS-DA and select the “best” set of variables for discrimination. Then use these together with the most important variables in the individual PCA models to have a variable set that both models the within and between class variability.
SVM is a powerful method which can handle nonlinearities, and very good results have been reported in the literature. However, it is not so transparent as PCA and PLS and the choice of values for input parameters must be decided from cross validation to assure a robust model.
As for all methods, the proof of the method lies in the classification of a large independent test set with known reference.
192
Tutorial I: Multivariate curve resolution (MCR) of
dye mixtures
• Description o What you will learn o Data table
• Data plotting • Run MCR with default options • Plot MCR results • Interpret MCR results • Run MCR with initial guess • Validate the estimated results with reference information • View an MCR result matrix
Description
Multivariate Curve Resolution (MCR) attempts recovery of response profiles (spectra, pH profiles, time profiles, elution profiles, etc) of the components in an unresolved mixture of at two or more components. This is especially useful for mixtures obtained in evolutionary processes and when no prior information is available about the nature and composition of these mixtures.
The Unscrambler® MCR algorithm is based on pure variable selection from PCA loadings to find the initial estimation of spectral profiles, and then Alternative Least Squares (ALS) to optimize resolved spectral and concentration profiles.
The algorithm can apply a constraint of Non-negativity in either spectral or concentration profiles or both.
It can also apply a constraint of Unimodality in concentration profiles that have only one maximum, and/or a constraint of Closure in concentration profiles where the sum of the mixture constituents is constant.
The Unscrambler® MCR functionality does not require any initial guess input. A mixture data set suitable for MCR analysis should have at least four samples and four variables. If no initial guess is used, the maximum number of variables is 5000.
In this tutorial we will utilize UV-Vis spectra of dye mixtures to extract pure dye spectra and their relative concentrations. The data are from the Institute of Applied Research (Prof. W. Kessler), Reutlingen University, Germany.
193
What you will learn
This tutorial contains the following parts:
• Run a basic MCR analysis • Plot MCR results • Interpret MCR results • Run an MCR analysis with initial guess • Validate MCR results with reference information • View the MCR result matrix and convert estimated concentrations into real scale.
References:
• Basic principles in using The Unscrambler® • What is MCR? • Interpreting MCR Plots
Data table
Click the following link to import the Tutorial I data set used in this tutorial.
Organizing the data table
The samples consist of 39 spectra of dye mixture samples. Samples 1 to 3 are pure dyes of blue, green and orange, respectively. Samples 4 to 39 are 36 mixture samples of those 3 dyes at known concentrations. The X variables are the UV-Vis spectra measured at range 250-800 nm with step 10 nm. We will begin by organizing the data for the analysis into row (sample) and column (variable) sets. The column sets have already been defined for you, and are found in the folder Column in the project navigator. There are 5 column sets for the different variables of interest in the analysis, including the concentrations of the three dyes, and two overlapping spectral ranges.
We begin by defining the row sets for these data. Select the entire first row in the data table, Blue_50, and go to Edit – Define Range… to open the Define Range dialog box. In the dialog, enter the name “Blue” in the Range row box and click OK.
Define Range Dialog
194
From the data table, select the sample Green_50, and go to Edit-Define Range to now make this row set Green. Do the same for the sample Orange_50, and then for samples 4 to 39, giving that row set the name Mixture. Additionally, create the row set Original by selecting samples and following the same procedure, Edit-Define Range
The first three columns are concentration measurements of blue, green and orange dyes. Columns 4 to 59 are UV-Vis spectra measured at range 250-800 nm with step 10 nm. In the project navigator expand the node Column to see the list of existing column sets. The organized data will look like this in the navigator and viewer, with color-coding for the defined set.
Navigator view of organized data
195
Data plotting
Before starting any analysis, it is a good idea to have a look at the data. We want to make a line plot of the spectra of all mixture samples together. Go to the original data table and highlight it in the navigator.
Use Plot - Line, which will open the Line plot dialog where the row set Mixture can be selected from the drop-down list, and for Cols, the set 250-800nm. This will give an overlay plot of the spectra.
Line plot of mixture spectra
196
We will now plot the reference spectra of the three pure components, select row set “Original”, and Cols 250–800nm. Go to Plot – Line… and select the rows and columns in the dialog.
Line plot dialog
This will results in the following plot, where we can see the maximum absorbance for each of the dyes is at a different wavelength. It is these component spectra that we expect to be able to extract from the data through the MCR analysis of the data in this tutorial.
Line plot of pure dyes
To plot the reference concentrations of the three dyes, select columns 1-3 and make a Line plot of Sample set “Mixture” by right clicking and selecting Plot – Line.
Line plot of sample concentrations
197
Note: Reference measurements of spectra and concentrations of pure components are not necessary to make your data set suitable for MCR!
Run MCR with default options
Task
Set up the options for an MCR analysis, launch the calculations and plot results.
How to do it
When data set “Tutorial_I” is active on screen, click Tasks - Analyze - Multivariate Curve Resolution…. The MCR dialog box with default settings will open up. Select Mixture (36) under the Rows tab, and 250-800nm (56) under the Columns tab. We will not use an initial guess.
Keep all other settings as default on the Options tab, then click OK. After the calculation is done, click Yes to View plots
MCR Dialog
198
When the MCR calculation is completed, a new node, named MCR, is added to the project navigator and the MCR overview plots are displayed in the viewer. The MCR results overview includes four plots, from upper-left to lower-right: Component Concentrations, Component Spectra, Sample Residuals and Total Residuals. The results overview plots are displayed at the optimum number of pure components, which the system estimates to 3 in this case. Our optimal number of components (3) is displayed on the toolbar. A summary of the analysis results is given in the Info tab in the lower left corner of the display, and also tells the optimal number of pure components.
MCR Info Box
MCR Overview plots
199
The MCR model results are all together in the new node in the project navigator named MCR. Rename the MCR model in the project navigator by highlighting the MCR node, right clicking and choosing Rename. Rename your first MCR model as MCR Original.
Plot MCR results
Task
Plot MCR results for various numbers of pure components.
How to do it
Actually, The Unscrambler® MCR procedure generates several sets of results, covering a number of estimated pure components from 2 to optimum +1. By default, the results are plotted for the optimal number of components.
You may view the results for varying numbers of pure components. Let us plot the spectral profiles
for a 2-component solution. Click the shortcut to select Component Number 2.
The plot of (estimated) component spectra for a resolution with two pure components is displayed.
In a similar manner, click on the right arrow shortcut to plot the 4-component solution.
200
MCR fitting and PCA fitting results are also available for varying numbers of pure components from 2 to optimum +1. Each fitting includes Variable Residuals, Sample Residuals and Total Residuals plots and are stored in result matrices in the MCR node of the project navigator. The user can plot these results upon selection of respective matrices, or by selecting the plot from the plots node of the project navigator. The plot of Total Residuals for MCR fitting is shown by default in the lower-right subframe. Like any other plot, it can also be accessed from the Plot menu. Change this plot to variable residuals by clicking and activate the lower-left subframe, then clicking MCR - Variable Residuals to have this plot displayed in place of the sample residuals plot.
Variable residuals plot
Interpret MCR results
Task
Determine the optimum number of pure components.
How to do it
In the Total Residuals plot, residuals are high for 2 components, and close to zero for 3 and 4 components. Change the appearance of the lower-right plot of the Total Residuals from a curve to
bars, using the toolbar icon .
Total residuals bar plot
201
This suggests that 3 components is the optimum solution.
Click and activate the Component Spectra plot with 3 components in the upper-right quadrant. The
toolbar contains a set of arrows , which is used to navigate between results at different numbers of components. Use the arrows to increase and decrease the number of components, and watch the impact on the spectral profiles.
Run MCR with initial guess
Task
Run the MCR calculation again, this time using an Initial Guess.
How to do it
If prior knowledge such as spectra of pure components or concentrations of mixture samples exists, this information may be included in the MCR calculation to help the algorithm converge towards the right solution of curve resolution.
Go back to data table Tutorial_I data by selecting the tab at the bottom of the viewer. Go to Tasks - Analyze - Multivariate Curve Resolution…. The MCR dialog box with default settings will open up. Select the same data as before, and then check the box Use initial guess and select option Pure spectra.
MCR dialog with initial guess
202
Select Row Set Original as initial guess for spectra, making sure to use the same column set for the data for the analysis and the initial guess. Then click OK to launch the calculations. When asked if you want to view the plots now, select yes.
Rename the new MCR results node in the project navigator as MCR Initial Guess.
Notes:
1. When using the initial guess option, The Unscrambler® requires all pure components to be included as initial guess inputs. Partial reference will generate erroneous results. It is recommended to run MCR without initial guess if only partial reference is available.
2. The Unscrambler® can be run with either spectra or concentration of pure components as an initial guess input.
Validate the estimated results with reference information
Task
We are going to compare the model’s Estimated Concentrations for a 3-component solution to the existing reference concentrations found in the data table and plotted earlier. In a first step we are going to compare the concentration profiles visually.
How to do it
Select the Component Concentrations plot, shown in the upper-left quadrant of the MCR Overview. Compare this with the three concentrations in the original data table that were previously plotted as
203
a line plot of the concentrations in the mixture data. . Look at both profiles. To make them both visible in the viewer, select the line plot you’ve made, and on the navigator tab right click to choose Pop out, giving an undocked plot that can now be docked wherever you wish for ease of viewing.
You can observe that the first estimated concentration profile is similar to the reference profile of the blue dye (blue curves on the plots), the second estimated concentration profile is similar to the reference profile of the green dye, and the third estimated concentration profile is very close to the reference concentration of the orange dye (green curves on the plots).
Caution: Estimated concentrations are relative values within an individual component itself. Estimated concentrations of a sample are not its real composition.
The estimated spectral profiles can be compared to the reference spectral profiles in the same way as for the concentrations. Because we used the spectra as initial guess inputs in this example, the comparison shows a perfect match. However, estimated spectra are unit-vector normalized; they are not the “real” spectral profile of the samples.
Plots of the Pure and Estimated Spectra
View an MCR result matrix
Tasks
Plot the MCR result matrix of estimated concentrations,
Compare the estimated concentrations to the reference concentrations in 2-D scatter plots by combining them into a single matrix.
204
Convert the estimated concentrations into real scale.
How to do it
Open Project Tutorial_I and expand the Results folder from the project navigator for model file MCR Initial Guess. The plot of the component concentrations is given in the upper-left quadrant of the MCR Overview plot. Select the Component concentrations matrix and make a duplicate of it matrix by selecting it going to Insert-Duplicate Matrix.
Insert Duplicate Matrix
.
Rename this matrix, named Component concentrations, that has been added to the bottom of the project navigator as Concentrations comparison.
With the cursor in the data matrix, go to Edit - Append and choose to add 3 columns to this matrix. Go to table Tutorial_i, select the first three columns (blue, green and orange), from rows 4-39. Copy them and paste them in the empty columns of the Concentrations comparison matrix, and enter names for columns 4-6 as blue, green, and orange respectively. We now have a table of six columns, containing the three estimated concentrations of the pure dyes followed by the three measured concentrations .
New Data Matrix with Estimated and Real Concentrations
Select columns “Blue” and “1” (press the Ctrl key on your keyboard to select several columns at a time). Click Plot - Scatter to display a 2-D Scatter plot of these columns. The correlation between
205
estimated and reference concentrations for the blue dye is 0.994. If the box containing plot statistics (among which correlation) is not displayed on the upper-left corner of your plot, use the toolbars
to display it. These can also be used to add a regression line and target line to the plot.
Continue to make the scatter plots for the green dye (columns “Green” and “2” in the table), which has a correlation between estimated and reference concentrations of 0.997.
For the orange dye (columns “Orange” and “3”), the correlation is 0.998. These very high correlations indicate that the MCR calculations have determined concentration profiles accurately in this case.
Scatter plot of orange dye concentration
These plots can be customized by right clicking and choosing Properties to make changes to the plot appearance.
Now let us convert the estimated Orange concentrations to real scale. In order to do this, at least one reference measurement is needed. The estimated concentrations (in relative scale) of all samples can be converted into real concentration scale by multiplying by a factor ( real concentration / estimated concentration ).
In the present case, we can use for example sample PROBE_11, which has a reference concentration of Orange dye of 7 and an estimated concentration of 0.4443.
206
Use menu Edit - Append - … to append a new column at the end of the table, and name it “MCR Orange real scale”. Go to Tasks - Transform - Compute General…, and type the expression:
V7=V3*(7/0.4443)
in the Expression space.
Compute General Dialog
Click OK to perform the calculation. A new matrix is created where the new column has been filled with the values of estimated Orange dye concentrations converted to real scale.
Data matrix with new values
207
Tutorial J: MCR constraint settings
Constraint settings in multivariate curve resolution
• Description o What you will learn o Data table
• Data plotting • Estimate the number of pure components and detect outliers with PCA • Run MCR with default settings • Tune the model’s sensitivity to pure components • Run MCR with a constraint of closure • Remove outliers and noisy wavelengths with recalculate
Description
In this tutorial we will utilize FTIR spectra of an esterification reaction to extract pure spectra and their relative concentrations. The original data are from the University of Rhode Island (Prof. Chris Brown), USA.
In situ FTIR spectroscopy was used to monitor the esterification reaction of isopropyl alcohol and acetic anhydride using pyridine as a catalyst in carbon tetrachloride solution. The initial concentrations of these three chemicals were 15%, 10% and 5% in volume, respectively. Isopropyl acetate was one of the products in this typical esterification reaction. The reaction was carried out in a ZnSe cell, and mixture spectra were measured at 4 cm-1 resolution. The data set consisted of 25 spectra, covering approximately 75 minutes of the reaction. To shift the equilibrium of the esterification, one-tenth of the volume was removed from the cell at 24, 45 and 60 minutes. An equal amount of a single reactant was added to the cell in the sequence of acetic anhydride, pyridine and isopropyl alcohol.
What you will learn
This tutorial contains the following parts:
• Estimate the number of pure components and detect outliers with PCA • Run MCR with default settings • Tune the sensitivity to pure components setting • Run MCR with a constraint of closure • Use the Recalculate functionality in MCR
208
References:
• Basic principles in using The Unscrambler® • Principles of PCA • What is MCR? • Interpreting MCR Plots
Data table
Click the following link to import the Tutorial J data set used in this tutorial.
The data consist of 25 FTIR spectra of 262 variables covering the spectral region from 1860 to 852 cm-1. There are two row sets already defined: mixture and closure. Mixture contains all the data, while the row set closure has the samples that will be used when using the constraint of closure during the MCR.
Data plotting
Before starting the analysis, it is always important to have a look at the data. Make a line plot of all of the spectra together.
Select all the samples by selecting the data set Tutorial_J in the project navigator. The data table for the FTIR spectra of the samples will then be displayed in the data editor. Highlight the samples, and use Plot - Line to display an overlay of the spectra in the viewer.
Line plot dialog
From this plot, one can see that there is a region around 1240 cm-1 that is changing over the course of the reaction being monitored.
Line plot of FTIR spectra
209
Estimate the number of pure components and detect
outliers with PCA
Principal Component Analysis (PCA) is recommended before running an MCR calculation. It provides some information on the number of pure components and on sample outliers.
Task
Run a PCA on the raw data.
How to do it
Click Tasks - Analyze - Principal Component Analysis to run a PCA and choose the following settings:
• Matrix: Tutorial_J • Rows: All • Columns: All • Maximum components: 8 • Mean center data: Not selected • Identify outliers: Selected
PCA Dialog
210
On the Validations tab, select Cross validation, and Setup… to set this to full cross validation, from the drop-down list for cross validation method. Click OK, then OK again on the model inputs page.
Cross Validation Setup
211
Once the PCA calculations are done, click Yes to view the plots of the PCA model immediately. The four plot PCA Overview will be displayed in the viewer.
The upper right quadrant is a 2-D plot of the PCA loadings. For spectral data, it is more informative to have a line plot of the loadings, as it then resembles a spectrum. Select the existing loading plot, and go to Plot - Loadings - Line; which will give the plot of the first PC loading, to replace the default plot in this quadrant. This plot, once can see, closely resembles the FTIR spectra of the raw data.
Scroll through the loadings plots for the other PCs using the arrows on the toolbar .
You can see that the loadings begin to get noisy at about the sixth principal component. The program recommends three components as the optimal number of PCs in this model. This is seen in
the Info box in the lower left corner of the display, and by clicking on the star on the menu toolbar. Select the Explained Variance plot in the lower-right quadrant by clicking on it with the mouse, then right mouse click to select View - Numerical View.
As you can see, the explained variance globally reaches a plateau from the third principal component. The fourth and fifth PCs still show some slight increase; at that stage, it is difficult to know whether they represent noise or real information. Now, click on the Influence plot at the bottom-left corner of
the Viewer, and use the PC navigation tool to display the influence plot at PC4. You may observe that sample 1 sticks out to the right with a high leverage, and that sample 8 sticks out upwards with a high residual variance.
212
PCA Influence Plot for PC4
Go to menu Plot - Sample Outliers to display a combination of four useful plots for outlier detection. Highlight the Residual Sample Variance at the bottom-left quadrant, and use the PC navigation arrows to change that to show results for PC4. This plot indicates a high validation residual for sample 8.
Residual Sample Variance Plot for PC4
As there is no validation check in MCR, we may use the outlier information issued from PCA in our MCR modeling later on.
213
Rename the PCA model file in the project navigator by highlighting the PCA node, right clicking and choosing Rename. Rename the model to “PCA Tutorial J”.
Run MCR with default settings
Task
Build a first MCR model with default settings.
How to do it
Go back to the data table Tutorial_J in the project navigator. Run an MCR by going to the menu and selecting Tasks - Analyze- Multivariate Curve Resolution… and keep the default settings:
• Matrix: Tutorial_J • Rows: All • Columns: All
Go to the Options tab and verify that the default settings are selected. Make changes as needed.
• Non-negative concentrations: selected • Non-negative spectra: selected • Closure: not selected • Unimodality: not selected • Sensitivity to pure components: 100 • Maximum ALS iterations: 50
MCR Options Dialog
214
Click OK to launch the calculations.
Note: MCR computations are demanding. Building the model can easily take several minutes depending on the size of the data set, the selected options and the capacity of your computer processor.
Click Yes when the calculations are finished, and you are asked if you want to view plots now. The MCR Overview plots are displayed. Notice that the program suggests 4 as the optimal number of
pure components, by indicating 4 components in the toolbar . This information, as well as parameters for the MCR analysis can be seen in the Info box in the lower left of the display.
Information Box
Rename the MCR model file to “MCR_Defaults”.
Tune the model’s sensitivity to pure components
215
Task
Read the MCR Warnings, which are found under the MCR model node. Open the warnings and follow the system’s recommendation for the Sensitivity to pure components setting.
How to do it
Expand the MCR_Defaults node in the project navigator and click on Warnings. A table of information will be displayed in the viewer and here you can check the recommendations given by the system. There are four types of recommendations:
Type 1 Increase sensitivity to pure components
Type 2 Decrease sensitivity to pure components
Type 3 Change sensitivity to pure components (increase or decrease)
Type 4 Baseline offset or normalization is recommended.
In the present case, the system recommends to change the setting for sensitivity to pure components.
The default setting (100) that was used for Sensitivity to pure components is usually a good starting point. After interpreting the results and reading the system recommendations, you can tune it up or down between 10 and 190. The higher the Sensitivity, the more pure components will be extracted. Therefore, if too many components are extracted, it is recommended to reduce the setting. Likewise, if you would like to see more components at an almost undetectable level, or even some noise profiles, it is recommended to increase the sensitivity setting.
Let us build a model with an increased setting.
Go back to the data table and redo the MCR calculation with a Sensitivity to pure components setting of 150.
The plot of Component Spectra is now shown by default for 5 components instead of 4 in the previous model.
Component Spectra for 5 Components
216
One can compare those profiles with FTIR spectra of known constituents, and identify the 5 estimated spectra as pyridine, isopropyl alcohol, a possible intermediate, propyl acetate and acetic anhydride, from curves 1-5 respectively.
Rename the new MCR model file created in the project navigator as MCR_Sensitivity150.
Run MCR with a constraint of closure
Task
Run MCR with a closure constraint. Compare two MCR models on the same data, with and without closure.
How to do it
Among the MCR settings we have used so far, two types of constraints were not selected.
A constraint of Unimodality can be applied to restrict the resolution to concentration profiles that have only one maximum.
With a constraint of Closure, the resolution will yield concentration profiles whose sum is constant.
In the present case, acetic anhydride was added at 24 minutes (between the eighth and the ninth samples), which means that the first 8 samples can be treated in closure conditions.
Go back to the data table and run a new MCR model with the following settings:
• Rows: Closure [8] (contains the first 8 samples of the data table) • Cols: All • Non-negative concentrations: selected • Non-negative spectra: selected • Closure: selected • Unimodality: not selected
217
• Sensitivity to pure components: 100
Once the computations are finished, choose to view the plots when prompted. Rename the new MCR model file as “MCR_Closure”.
You may compare the resolved concentration and spectral profiles of pure components with and without the closure setting. To do that, compute a new MCR model on sample set “Closure” without checking the Closure constraint option. Save the new MCR model file as “MCR_No_Closure” and compare the results to “MCR_Closure”.
The spectral profiles with and without the constraint of closure are very similar.
MCR Component Spectra
You can also observe that under constraint of closure, the concentrations of the pure components always add up to 1.
MCR Component Concentrations
218
Notes on MCR result interpretation
1. The spectral profiles obtained may be compared to a library of FTIR spectra in order to identify the nature of the pure components that were resolved. Likewise, if you have the spectra of your pure components and solvents, you can compare these to the computed components.
2. Estimated concentrations are relative values within an individual component itself. Estimated concentrations of a sample are not its real composition.
Remove outliers and noisy wavelengths with recalculate
Task
Use the Recalculate functionality to remove samples or variables with high residuals.
How to do it
Select the MCR_Defaults tab from the navigation bar to display your first MCR model on screen. If the plots were already closed, you may open them again from the project navigator; click on the MCR Overview plot from the node MCR_Defaults to display the results.
219
The Validation calculations of the PCA model that we built earlier indicated that sample 8 was a potential outlier. We can check this again in the MCR model by looking at the PCA fitting residuals.
Click on the bottom-left subframe where the Sample residuals are plotted to highlight it. If needed, use the PC navigation arrow tool to change the view to show the sample residual for the 4-component model.
Here you may notice a high residual showing for Sample 8, compared to the other samples. Let us build a model without this sample. You will notice is the sample residuals plot, that the shape is similar to what is observed in the residual sample variance plot from the PCA model on this same data set.
MCR Sample Residuals
Use the marking tools to highlight sample 8 in the Sample Residuals plot.
Marked sample in sample residuals plot
220
Select the MCR_Defaults model in the project navigator, and right click to select Recalculate - Without Marked… to specify a new MCR calculation without sample 8.
Menu to recalculate without marked
This brings you back to the MCR dialog, where sample 8 is now included in the Keep Out Of Calculation field. You may launch the calculations to get the new MCR results.
MCR menu with sample 8 kept out
Similarly, you may want to keep out of the model non-targeted wavelength regions, or highly overlapped wavelength regions.
From the MCR_Defaults overview plots, click Plot - Variable Residuals.
MCR Variable Residuals
221
Mark any unwanted variables on the plot using the marking tools, for examples variables around 1100-1140 cm-1 which present very high residuals, then select the model “MCR_Defaults” and right click to choose Recalculate - Without Marked… to specify a new MCR calculation.
General notes on MCR settings and interpretation:
1. To have reliable results on the number of pure components, one should cross-check with a PCA result, change the sensitivity to pure components setting, and use the navigation bar to study the MCR results for various numbers of pure components.
2. Weak components (either low concentration or noise) are usually listed first. 3. One can utilize estimated concentration profiles and other experimental information to analyze a
chemical/ biochemical reaction mechanism. 4. One can utilize estimated spectral profiles to study the mixture composition or even
intermediates during a chemical/biochemical process.
Tutorial K: Clustering
• Description o What you will learn o Data table
• Transform the raw spectra • Application of K-Means clustering • Application of Hierarchical Cluster Analysis (HCA) • Repeat the HCA using a correlation-based measure • Using the results of HCA to confirm the results of PCA
Description
222
This tutorial investigates the use of two well known clustering methods, K-Means and Hierarchical Cluster Analysis (HCA) for classification of raw materials used in the pharmaceutical industry, by means of reflectance Near Infrared (NIR) spectroscopy. This is an example of unsupervised pattern recognition and is an alternative methodology to Principal Component Analysis (PCA). Unsupervised pattern recognition is the first step performed to establish whether a discriminant classification method can be developed.
What you will learn
Tutorial K contains the following parts:
• Apply a pretreatment method to the spectral data • Use K-Means to identify clusters in the data set • Perform HCA and analyze the resulting dendrogram output.
References
• Basic principles in using The Unscrambler® • Principles of PCA • Data preprocessing and transformations • Classification • Cluster Analysis
Data table
Click the following link to import the Tutorial K data set used in this tutorial.
The data table contains 35 NIR spectra of seven classes of raw materials often used in pharmaceutical manufacturing. Typically when developing classification models it is recommended that more samples be used, being sure to cover the natural variability of each class, but for this exercise, we use just five spectra for each class.
The diffuse reflectance spectra have been truncated to the wavelength region 1200 - 2200 nm for this particular example.
The type of raw material is defined in the name of each sample, and includes:
• Citric acid • Dextrose anhydrous • Dextrose monohydrate • Ibuprofen • Lactose • Magnesium stearate
223
• Starch
Transform the raw spectra
Task
Transform the raw spectral data by applying a Standard Normal Variate (SNV) to the Tutor_k data table.
How to do it
Open the file Tutorial_K.unsb from the tutorial data folder.
First plot the raw data by selecting the entire table and selecting Plot - Line and select all rows and columns to plot.
Line plot
Click on OK and view the plot. Notice that there are distinct groups of spectra with similar profiles. The main source of variation within each group comes from differences in the absorbance (Y) axis. This baseline shifting is due to differences in sampling when preparing and scanning, resulting in differences in light scattering by the samples measured in reflectance by NIR spectroscopy.
Line plot of NIR spectral data
224
A convenient way to remove this variation is by the use of the SNV transform. This transform reduces the scattering effects in such data by removing the mean value from each point in the spectrum and divides each point by the standard deviation of all points in the spectrum, i.e. the SNV transform normalizes the spectrum to itself. The effect of the SNV transform is to remove the variation in the absorbance scale (baseline shifting), while retaining the original profile of the spectral data.
This is a commonly used practice in many NIR applications, especially for reflectance spectra of solids. To perform the SNV transformation, right click in the matrix Tutor K Data and select Transform - SNV. In the Rows dialog box, select All and in the Columns dialog box, select All. You can preview the effect of the transformation be clicking in the Preview result box, or just click OK to perform the transformation.
SNV dialog
225
The transformed data are displayed as a new node in the project navigator and the matrix is called Tutor K Data_SNV. Plot the data to see how they now look by selecting all samples in the new matrix and going to Plot-Line.
The resulting SNV-transformed spectra can be seen below.
Line plot of SNV-transformed NIR Spectra
226
The spectra are now ready for application of the clustering algorithms described below.
It is a good idea to save your work as you go. Save your project by going to File-Save As….
Application of K-Means clustering
K-Means clustering is an unsupervised classification method which attempts to group a set of samples being analyzed into “K” distinct groups, where K is specified by the analyst. The classification is performed based on a predefined distance measure. For more details on the distance measures available, refer to the section on Cluster Analysis.
Task
Perform a K-Means clustering of all samples.
How to do it
Use Tasks - Analyze- Cluster Analysis… and select the following parameters under the Inputs tab:
• Matrix: Tutor-K Data_SNV • Rows: All • Columns: All • Number of Clusters: 7 • Clustering Method: K-Means • Distance Measure: Euclidean
Cluster analysis dialog
227
With K-means one can also make initial class assignments on the options tab, and set the number of iterations to use to find the optimal number of clusters. Here we will allow the algorithm to make assignments with no further input, and use the default number of 50 iterations.
Cluster analysis dialog options tab
228
Click OK to start the analysis and a new node will appear in the project navigator called Cluster analysis. Right click on the node and select Rename and call this analysis K-Means.
You will notice that there is no graphical output for K-Means clustering. The output of the cluster analysis is found in the Results folder. Expand this folder to display a node called Tutor K Data_SNV_Classified, where the results reside. The classified data matrix is color-coded according to the clusters (row sets) that have been identified. Expand this matrix. Expand the rows and the columns folders and you will see that the rows contain seven assigned clusters from Cluster-0 to Cluster-6. The columns folder contains the class, a single column of classification results.
The K-Means data table is now classified by different colors, corresponding to the various assigned classes. Study this table. You will notice that the K-Means algorithm has successfully classified the data into seven distinct classes, each containing a single raw material type. Click on the various cluster nodes in the project navigator and confirm that each cluster contains 5 samples of the same material type. Using the Rename function, assign cluster names according to the table above. The results of this operation are shown below.
View of Assigned Classes in Navigator
Now that the separate classes have been defined, you can use this information to use it as a means to group samples in plots. Go back to the matrix Tutor K Data_SNV and right click to select Plot-Line. In the plot, now you can right click to select Sample grouping. In the sample grouping dialog , first select the clustered data from the drop-down list for Select. Now you have the row sets you have just renamed as available row sets. Select all of these, and click OK. The line plot will now have all samples of each set displayed in a single color.
Sample grouping option
229
Application of Hierarchical Cluster Analysis (HCA)
Hierarchical Cluster Analysis (HCA) is another clustering method. Like K-Means, it is based on distance measures; however, the main output of the HCA is the dendrogram. The dendrogram provides information pertaining to sample relationships within a particular data set. The structure of the dendrogram is dependent on the distance measure used and great care must be taken when interpreting the structures.
Task
Make a HCA model using the method of single linkage and Euclidean distance.
How to do it
Select Tasks - Analyze - Cluster Analysis… and make a model with the following parameters:
• Matrix: Tutor_K Data_SNV • Rows: All • Cols: All • Number of Clusters: 7 • Clustering Method: Hierarchical Single-linkage • Distance Measure: Euclidean
Use the drop-down lists to change the clustering method and distance measure. Click OK to start the analysis. When the the analysis is completed, the dendrogram is displayed in the editor window, and a new Cluster analysis node is added to the project navigator.
HCA Euclidean Dendrogram
230
Before reviewing the analysis results, rename the new cluster analysis node in the project navigator as HCA Euclidean.
Analyze the dendrogram and look at the order of the clusters from top to bottom. It can be seen that each raw material type is uniquely defined and the carbohydrate materials Starch, Lactose, Dextrose Monohydrate and Dextrose Anhydrous all group together in the dendrogram. Towards the bottom, the clustering is not as distinct. This indicates that the sample classification is based on some similarity in the chemistry of the samples, but it is not as well defined as it could be. This is one aspect of HCA that must be kept in mind when performing such a method.
In the project navigator, expand the results folder for the HCA and under the rows folder, you will see that seven clusters have been assigned to this analysis. These can be renamed as was done above, so that the names coincide with the class name.
Repeat the HCA using a correlation-based measure
When dealing with spectroscopic data, the spectrum of a material is analogous to its fingerprint. Using a straight distance measure such as the Euclidean measure may not be the most sensitive way of assessing the similarities present within the data. The Absolute correlation measure provides a better way of capturing the within spectral variable similarities of the materials. We will also change to the complete-linkage, which looks for the farthest neighbor, as opposed to nearest neighbor used in single-linkage HCA.
231
Task
Make a HCA model using the method of complete linkage and absolute correlation.
How to do it
Select Tasks - Analyze - Cluster Analysis. Use the following parameters:
• Matrix: Tutor K Data_SNV • Rows: All • Columns: All • Number of Clusters: 7 • Clustering Method: Hierarchical Complete-linkage • Distance Measure: Absolute Correlation
Click OK to start the analysis and then click Yes to view the plots. The dendrogram for this analysis is displayed in the editor window, and from the results node it is seen that 7 clusters are identified.
Before reviewing the analysis results, rename the new cluster analysis node in the project navigator as “HCA Correlation”.
Notice that all samples are uniquely classified into classes based on the raw material type. This time there are three distinct clusters in the dendrogram. At the top of the dendrogram is Starch. The next cluster of samples contains mostly carbohydrates: Lactose, Dextrose Monohydrate, Dextrose Anhydrous and Citric acid. The last cluster includes the materials Ibuprofen and Magnesium stearate, whose NIR spectra have features in the 1400 and 1700 nm regions.
HCA Absolute correlation distance dendrogram
232
The method of absolute correlation not only uniquely classified the individual raw materials, but it was also able to use the information in the spectral variables far better, by grouping the materials by their chemical properties.
In the results folder, select the data table Tutor K Data_SNV_Classified. Go to Insert - Duplicate Matrix…. The following dialog box opens.
Duplicate Matrix
Rename the clusters of the duplicated matrix based on the materials’ name.
Renamed row ranges
233
We will use these results, in conjunction with PCA, to show how the two methods of unsupervised pattern recognition can be used together.
Using the results of HCA to confirm the results of PCA
Task
Perform a PCA on the SNV transformed data and group the samples based on the results of HCA.
How to do it
Select Tasks - Analyze- Principal Component Analysis…. Use the following parameters:
• Matrix: Tutor K Data_SNV_Classified • Rows: All • Columns: All • Maximum Components: 6 • Mean Center Data: Yes • Identify Outliers: Yes
PCA dialog
234
Click OK to start the analysis and then click Yes to view the plots. The PCA Overview for this analysis is displayed in the workspace.
In the Scores Plot right click and select Sample Grouping and from the Select drop-down list, use the results from your clustering to give you the available row sets of the different clusters. Click on the » button to select all clusters in the analysis and then click OK.
Sample grouping dialog
235
Drag the updated scores plot so that it fills most of the screen and analyze the clustering.
The scores plot shows that PC1 explains 66% of the data variance, and PC2 describes 19%. The main difference along PC1 is between carbohydrate materials and fatty acid based materials (i.e. Magnesium Stearate and Citric Acid) and PC2 is differentiating between the starch and ibuprofen samples.
It can be seen that the clustering of the materials as established by HCA is consistent with that of PCA. PCA provides more information on the groupings as the spectral loadings can be related to the spectral features which describe the materials. To have a more informative view of the PCA loadings it is better to look at them as a line plot - resembling then a spectrum. Activate the loadings plot in the upper-right quadrant, and right click to select PCA - Loadings - Line. The loadings plot now shows which spectral features are related to the first PC, which explains most of the variance in this
data set. Use the next arrow to scroll to the next PC loadings plot.
PCA Overview Plot
236
Now that the work has been done it is a good idea to save the results so you can refer to them in the future.
When more data (more samples per each class) are available for classification, this exercise has shown that one can proceed to make a classification model to identify these seven raw materials from their NIR spectra. Classification modeling such as PLS-DA and SIMCA can be used to develop methods that can be used for classification of future samples.
237
Tutorial L: L-PLS
• Description o What you will learn o Data table
• Open and study the data • Build a L-PLS model • Interpret the results
o Variances o Products: Scores o Product descriptors X: X Correlation Loadings o Consumer descriptors Z: Z Correlation Loadings o Consumer liking of the products Y: Y Correlation Loadings o Overview of the L-PLS Regression solution
• Verify the results o Products liking o Liking Y vs. consumer background Z o Product descriptor rows in X o Product descriptor columns in X
• Bibliography
Description
Consumer studies represent an application field where such “L-shaped” data matrix structures X;Y;Z are common: A set of I products has been assessed by a set of J consumers, e.g. with respect to liking, with results collected in “liking” data table Y(I× J). In addition, each of the I products has been “measured” by K product descriptors (“X-variables”), reflecting chemical or physical measurements, sensory descriptions, production facts etc., in data table X(I ×K). Moreover, each of the J consumers has been characterized by L consumer descriptors (“Z-variables”), comprising sociological background variables like gender, age, income, etc., as well as the individual’s general attitude and consumption patterns; these are collected in data table Z(J ×L). Relevant questions could then be: Is it possible to find reliable patterns of variation in the liking data Y, which can be explained from both product descriptors X and from consumer descriptors Z? Is it possible to predict how a new product will be liked by these consumers, by measuring its X-variables? Is it possible to predict how a new consumer group will like these products, from their background Z-variables?
The data consists of information gathered on Danish children’s liking of apples. Their response to various apple types is termed Y. Chemical, physical and sensory descriptors of these apple types are called X, and sociological and attitude descriptors on these children is in matrix Z. The purpose of
238
the analysis is to find patterns in these X-Y-Z data that are causally interpretable and have predictive reliability.
We are now going to build an L-PLS regression model linking the panelists’ sensory, chemical and physical evaluations to the consumers and their sociological and attitude descriptors. The model will summarize all the information about consumers, consumers’ preference, the products and their characteristics.
What you will learn
This tutorial contains the following parts:
• Open and study the data. • Build an L-PLS model which explains consumer likings for the different consumer segments
from the descriptive sensory attributes and chemical measurements. • Study the results. • Verify the results.
References:
• L-shape Partial Least Square • Partial Least Square regression • Scatter plots
Data table
We are going to study three data tables that do not have all the same size. The structure of the data set is as follows:
• X - ApplesSensoryChem • Y - ApplesLiking • Z - AppleChildBackground
L-PLS Structure
239
In the following, matrices will be written in upper-case (e.g. X) letters, vectors in lower-case (e.g.
) and scalar elements in italics (e.g.
); all vectors are column vectors unless otherwise
specified.
The six products
The data are taken from Thybo et al. (2004). I=6 products were the apple cultivars “Jonagold”, “Mutsu”, “Gala”, “Gloster”, “Elstar” and “GrannySmith”. All cultivars were selected due to commercial relevance for the Danish market and due to the fact that the cultivars were known to span a large variation in sensory quality Kuhn and Thybo (2001). Gloster was chosen as a wine-red cultivar with particularly high glossiness, Gala and Jonagold as red cultivars with 80–90% red Bushed surface, Mutsu as a yellow–green cultivar and GrannySmith as a green and particularly round-shaped cultivar. GrannySmith was known to be a rather popular cultivar for some children, due to its texture and moistness characteristics. Only apples with shape and color deemed representative for their cultivar were used.
X data
The X data matrix (X - ApplesSensoryChem) contains the chemical, physical and sensory data of these apple types. Sensory profile descriptors: A panel of ten assessors was trained in quantitative descriptive analysis of apple types as described in Kuhn and Thybo (2001). Conventional statistical design with respect to replication and serving order was applied. The panel average of a subset of the appearance, texture, taste and Bavour descriptors determined will be used here:
• Red • Sweet
240
• Sour • Glossy • Hard • Round
Chemical and instrumental product descriptors:
• Texture firmness was evaluated instrumentally by penetration (FIRM Instrument). • Content of acid (ACIDS) and sugar (SUGARS) were determined as malic acid and soluble solids,
respectively. • Based on prior theory on human sensation of sourness, the ratio ACIDS/SUGARS was
included as a separate variable Kuhn and Thybo, 2001.
Together, the sensory, chemical and instrumental variables constituted K=10 product descriptors, which will here be referred to as X(I × K) for the I = 6 products.
Y data
The Y data (Y - ApplesLiking) consists of information gathered on Danish children’s liking of apples. Their response to various apple types is termed Y. Each child was asked to express the liking of the appearance of the six apple cultivars, using a five-point facial hedonic scale:
1. “not at all like to eat it” 2. “not like to eat it” 3. “it is okay” 4. “like to eat it” 5. “very much like to eat it”.
One apple at a time was shown to the child to avoid that the child concentrated on comparing the appearances. All samples were presented in randomized order. The resulting liking data for the I = 6 products × J = 125 consumers will here be termed Y(I × J ).
Z data
The Z data table (Z - AppleChildBackground) contains the information collected about the consumers: sociological and attitude descriptors on these children.
The consumers were children aged 6–10 years (51% boys, 49% girls), recruited from a local elementary school. A total of 146 children were tested and included in the original publication of Thybo et al. (2004). For simplicity, only the J = 125 children that had no missing values in their liking and background data are included in the present study.
First, each child was asked to look at a table with five different fruits (a red and a green apple, a banana, a pear and an orange (mandarin)), and answer the questions: “If you were asked to eat a fruit,
241
which fruit would you then choose, and which fruit would be your last choice?” The resulting responses will here be named “fruitFirst” and “fruitLast”, where “fruit” is one of Red Apple, Green Apple, Pear, Banana, Orange or Apple. (Summaries were later computed for apple liking: AppleFirst = RedAppleFirst + GreenAppleFirst and AppleLast = RedAppleLast + GreenAppleLast.)
The child was also questioned about how often he/she ate apples, by having the following opportunities: “every day” (here coded as value 4), “a couple of times weekly” (3), “a couple of times monthly” (2), “very seldom” (1); this descriptor is here named “EatAOften”. (A few of the children responded “do not know” to how often he/she ate apples. To reduce the number of missing values, this was for simplicity taken as indicating very low apple consumption, and coded as 0.) In addition, the child’s gender and age were noted. These two sociological descriptors were used, together with the attitude variables fruitFirst and fruitLast and eating habit-variable EatAOften, as L = 15 consumer background descriptors Z(J × L) for the J = 125 children.
Open and study the data
Click the following link to import the Tutorial L data set used in this tutorial.
There are three matrices:
• X - ApplesSensoryChem • Y - ApplesLiking • Z - AppleChildBackground
Build a L-PLS model
It will explain consumer likings from the descriptive sensory attributes, and also using the consumers’ information.
Go to the menu Tasks - Analyze - L-PLS Regression….
Tasks - Analyze - L-PLS Regression…
242
• In X select the variable set “X - ApplesSensoryChem”, in Rows and Columns select All. • In Y select the variable set “Y - ApplesLiking”, in Rows and Columns select All. • In Z select the variable set “Z - AppleChildBackground”, in Rows and Columns select All. • Set the maximum components to 10 PCs.
L-PLS regression settings
243
Then set the weights individually as follows:
• Click on the X Weights option. Select all the variables clicking on the All button. Select then the option “A / (SDev + B)” with the radio button. Finally click on the Update button.
• Click on the Y Weights option and use weighting option “A / (SDev + B)” for all the variables. • Click on the Z Weights option and use weighting option “A / (SDev + B)” for all the variables.
L-PLS regression settings: Weights
244
Once all necessary options have been selected, click OK to start the computations.
Interpret the results
View the results and study the different plots:
• L-PLS Overview • Correlation Loadings • Correlation
L-PLS Analysis node
245
Variances
Study the bottom right plot in the L-PLS overview. It presents the explained variances of the three data tables: X, Y and Z.
The number of component necessary to explain the X table is 4. The X table is the one that is explained the best.
The Y-table needs 5 factors to be explained at 72%.
The Z-table is explained at 69% with 10 PCs. It is always more difficult to explain all the variance in this table as it relates to the background of the consumers.
Products: Scores
246
Look at the products in the Score plot in the Correlation Loadings.
Score plot
It shows the main patterns of the six products. Product 6 (Granny Smith) and 4 (Mutzu) are grouped together, product 1 (Jonagold), 2 (Gloster), 3 (Elstar) and 5 (Gala) are grouped together. Product 3 is close to the center which means it is a sample close to the average sample.
The horizontal dimension (Factor 1) spans the contrast between Granny Smith and the other products, mainly Gala, Gloster and Jonagold. The vertical dimension (Factor 2) spans the contrast between Elstar and the other products. The correlations are rather weak, indicating that the variations in the second dimension are weaker than in the first dimension.
Product descriptors X: X Correlation Loadings
Look at the Product descriptors in the plot X Correlation Loadings in the L-PLS overview.
X Correlation Loadings
247
It shows the main patterns of the sensory, instrumental and chemical product descriptors. The horizontal dimension is seen to span the sensory contrast between Sour and Sweet, and the chemical contrast between the Acids/Sugars ratio and the Sugar content. Sensory Red color is correlated with Sweet apples. The vertical dimension (Factor 2) mainly contrasts properties like sensory Hard and instrumentally Firm against sensory Round shape and high content of Acids and Sugars.
Consumer descriptors Z: Z Correlation Loadings
Look at the Consumer descriptors in the plot Z Correlation Loadings in the L-PLS overview.
Z Correlation Loadings
248
It shows the main patterns of the consumer background descriptors. The horizontal dimension spans a tendency to choose the green apple first and the red apple last (GreenAFirst, RedALast), against the tendency to choose the red apple first and the green apple last. Vertically, exhibits a contrast between choosing pear first and banana last against choosing banana first and pear last. The purely sociological variables (gender, age, how often apples are eaten) are not particularly evident in the result, although gender (coded as being girl) is slightly associated with choosing green apple first, pear first and banana last.
Consumer liking of the products Y: Y Correlation Loadings
Look at the Consumer liking of the products in the plot Y Correlation Loadings in the L-PLS overview.
Y Correlation Loadings
249
It shows the main, product-related patterns of the consumer with respect to liking. Most of the 125 children are gathered towards either end of the horizontal dimension. The second, vertical dimension (Factor 2) is much less extensive, and spans fewer children.
Overview of the L-PLS Regression solution
Look at the plot Correlation as a general picture.
Correlation
250
In the horizontal dimension product GrannySmith is seen to be particularly Sour and not Sweet; it has high ratio Acid/Sugars and low level of Sugars. It is also Hard and not Red. The products Gala, Gloster and Jonagold appeared to display the opposite tendency.
GrannySmith is seen primarily to be liked by children who were observed to choose green apple first and red apple last, not by children who were observed to choose red apple first and green apple last. Again, products Gala, Gloster and Jonagold seem to display the opposite of this tendency.
In the vertical dimension, product Elstar is seen to be particularly round, with high levels of both Acids and Sugars, but not instrumentally Firm and sensory Hard; nor was it Glossy. On the contrary, the products Mutsu, Gloster and Gala appeared to be a little more Firm and Glossy, with less Sugars and Acids than the others.
Product Elstar seems primarily to be liked by children who chose banana first and pear last, and less liked by children who chose pear first and banana last. In contrast, e.g. Mutsu seemed to be associated with the liking of children who chose pear first.
Verify the results
With a relatively complex modeling tool like the L-PLS regression, it is important to verify the main aspects of the interpretation by plotting the raw data.
Products liking
Plot a scatter plot of the most extreme products (liking GrannySmith vs. liking Jonagold) and look at the correlation.
251
With only five response levels possible, many data points are superimposed and the pattern difficult to see. But their raw liking data are clearly negatively correlated (r = −0.29 over the 125 subjects), as expected.
Liking Y vs. consumer background Z
Plot a scatter plot of the liking of the green apple GrannySmith to the background response green apple first.
To do so copy the row “GreenAFirst” in the Z table and insert a new row in the Y table. Paste the “GreenAFirst” row. It is now possible to generate a scatter plot.
There is a clear tendency (r = 0.52 over 125 subjects) that if children chose green apple first, they reported that they liked GrannySmith.
Product descriptor rows in X
Plot a scatter plot of the standardized sensory and chemical variables for the two most extreme products, GrannySmith and Jonagold.
To do so select the X matrix and go to Tasks - Transform - Center and Scale
Tasks - Transform - Center and Scale
252
Select All for Rows and Cols. For the Transformation field select Mean for Center and Standard deviation for Scale.
Center and Scale window
From the new matrix generated called “X - ApplesSensoryChem_CenterAndScale” select the “JonaSC” and “GrannySmithSC” rows and select a scatter plot under the menu Plot.
253
Again, these two products are seen to be described quite opposite; Jonagold is Sweet, Red and high in Sugars, while GrannySmith has high Acids/Sugars ratio, is Sour, Hard and Round, and vice versa. The r is −0.72 between these two rows of 10 standardized X variables.
Product descriptor columns in X
Plot a scatter plot of the sensory descriptor Sour and the instrumental descriptor FIRM Instrument.
As expected from the L-PLS regression model, these two variables are almost orthogonal, with r = 0.07 over the six products.
254
Bibliography
B.F. Kühn, A.K. Thybo, The influence of sensory and physiochemical quality on Danish children’s preferences for apples, Food Qual. Pref., 12, 543-550(2001).
H. Martens, E. Anderssen, A. Flatberg, L. H. Gidskehaug, M. Hoy, F. Westad, A. Thybo, M. Martens, Regression of a data matrix on descriptors of both its rows and of its columns via latent variables: L-PLSR, Computational Statistics & Data Analysis 48, 103 – 123(2005).
A.K. Thybo, B.F. Kuhn, H. Martens, Explaining Danish children’s preferences for apples using instrumental, sensory and demographic/behavioral data, Food Qual. Pref. 15, 53–63(2004).
255
Tutorial M: Variable selection and model stability
Learn how to use the Uncertainty Test results in practice.
• Description o What you will learn o Data table
• Create a PLS model • Interpret a PLS model
o Variance plot o Score plot o Loading plot o Weighted regression coefficients o Stability plots
§ Stability in loading weights plots § Stability in score plots
• Conclusions
Description
In this work environment study, PLS regression was used to model 34 samples corresponding to 34 departments in a company. The data were collected from a questionnaire about overall job satisfaction (Y), modeled from 26 questions (X1, X2, …, X26) about repetitive tasks, inspiration from the boss, helpful colleagues, positive feedback from the boss, etc. The unit for these questions was the percentage of people in each department who ticked “yes”, e.g. “I can decide the pace of my work”. The response variable was the overall job satisfaction, on a scale from 1 to 9.
What you will learn
This tutorial contains the following parts:
• PLS regression • Validation methods • Uncertainty estimates • Interpretation of plots
This tutorial is also presented differently than the other tutorials, with less detailed instructions for each task, thus giving a slightly more demanding learning curve.
256
Data table
Click the following link to import the Tutorial M data set used in this tutorial. The data already have several row and column sets defined, but you must define the column set for the response variable, job satisfaction.
Create a PLS model
Click Tasks - Analyze - Partial Least Squares Regression to run a PLS regression and choose the following settings:
Model inputs • Predictors: X: Tutorial M, Rows: all, Cols: XData • Responses: Y: Tutorial M, Rows: all, Cols: Job satisfaction • Maximum components: 7 • Mean center data: Enable tick box
X Weights 1/SDev Select all the variables, select the radio button A/(SDev+B), and click Update.
Y Weights 1/SDev Select the “Job satisfaction” and select the radio button A/(SDev+B), and click Update.
Validation Full cross-validation. Click on the button Setup… to select this option. Select the Uncertainty test for the optimal number of factors.
Select Uncertainty test
257
Click on OK when everything is set.
Interpret a PLS model
The Unscrambler® regression overview gives by default the Score plot (factor 1-factor 2), the X-Loading and Y-loadings plot (factor 1- factor 2), the explained variance and the Predicted vs. Measured plot for 2 factors for this PLS regression model.
Variance plot
The initial model indicated 2 factors to be the optimal model dimension by full cross validation. Thus the cross validation has created 34 submodels, where 1 sample has been left out in each. The uncertainties for all x-variables were thus as a second step estimated by jack-knifing for various model parameters based on a two-factor model.
In the variance plot the validation curve (red) shows 62% explained variance for 2 factors, which is rather good for data of this kind.
258
Plot of explained y-variance
Score plot
The score plot shows that the samples are well distributed with no apparent outliers.
Plot of scores
Loading plot
259
The relations between all variables are more easily interpreted in the correlation loading plot rather than the loadings as the explained variance can be seen directly in the plot; the inner circle depicts 50% explained variance and the outer 100%.
Activate the X-Loadings plot by clicking in it, then use the following shortcut button ; it will display the two circles.
The most important variables for job satisfaction (Y) seem to be related to how the employees evaluate their leader. Questions related to the work span the direction from upper left to lower right in the plot.
Plot of correlation loadings
The variables found significant are marked with circles in the loading plot. If not shown by default,
activate the marking of the significant variables using the following button .
Although the variable pattern can be interpreted in the correlation loadings, the importance of the variables is better summarized in terms of the regression coefficients in this case. Recall that the loadings describe the structure in X and Y whereas the loading weights are more relevant to interpret for the importance in modeling Y. Alternatively, the predefined plots under the weighted regression coefficients may be investigated.
260
Weighted regression coefficients
Click on the regression coefficient plot in the navigator.
Regression coefficient plot in the navigator
The automatic function Mark significant variables shows clearly which variables have a
significant effect on Y.
When plotting the regression coefficients one can also plot the estimated uncertainty limits as an approximate 95% confidence interval as shown below.
Plot of the weighted regression coefficients
261
E.g. variable disrespect has uncertainty limits crossing the zero line: it is not significant at the 5% level. Zoom in with Ctrl+right click to see details.
13 out of 26 X-variables are found to be significant at the 5% level. However, there is nothing to say that one can not set the cut off at another level depending on the application. Variables with large regression coefficients may not be significant because the uncertainty estimate indicates that the relation between this variable and Y is due to only some samples spanning the range. One effective way to visualize this is to show the stability plot.
The corresponding p-values are given in the output node, in the validation folder.
p-values for the regression coefficients
262
Stability plots
Stability in loading weights plots
Go back to the loading plot. By clicking the toolbar button Stability plot the model stability is clearly visualized.
Stability in loading weights plots
263
Variable 11 or “Help” is not very stable, the two departments 15 and 26 have a much lower value than the others, thus being influential for this variable. This indicates that this variable is probably not reliable to predict the “job satisfaction”.
This can be studied by looking at the scatter plot of the “Help” versus “job satisfaction”.
To plot it go back to the data table “Work environment case”. Select the column 11 “Help” as well as the column 27 “Job satisfaction”, use Ctrl.
Then go to Plot - Scatter or click on the icon .
“Help” versus “job satisfaction”
264
This plot shows that the variable X11 ”help” (Do you find your colleagues helpful?) is not very correlated to the “job satisfaction”. The 2 suspicious departments are influential in this relation.
Stability in score plots
Go back to the score plot. By clicking the toolbar button Stability plot the model stability is clearly visualized.
Stability plot of scores
265
For each sample one can see a swarm of its scores from each submodel. There are 34 sample swarms. In the middle of each swarm is the score for the sample in the total model.
By clicking on any point, information of the segment is given. Thus, in the case of full cross validation one can directly see how the models change when a particular sample is kept out. In other words, a sample that makes the model change when it is not in the segment has influenced all other submodels due to its uniqueness.
The score and loading stability plots are also very useful for higher factors in models as they indicate when noise is becoming the main source for a specific component.
Conclusions
In the work environment example, from looking at the global picture from the stability score plot one can conclude that all samples seem good and the model seems robust. Also, the uncertainty test indicates 13 significant variables at the 5% level as visualized with the 95% confidence intervals.