Unscrambler X 教程

Unscrambler X 教程

CAMO出品

郑光辉整理

Tutorial A: A simple example of calibration ..................................1 Description .........................................................................................................................................................1 Opening the project file ......................................................................................................................................2 Define ranges......................................................................................................................................................4 Univariate regression..........................................................................................................................................7 Calibration ..........................................................................................................................................................9 Interpretation of the results...............................................................................................................................11 Prediction..........................................................................................................................................................17 Evaluation of the predicted results ...................................................................................................................18

Tutorial B: Quality analysis with PCA and PLS ..........................22 Description .......................................................................................................................................................22 Preparing the data .............................................................................................................................................26 Objective 1: Find the main sensory qualities....................................................................................................32 Objective 2: Explore the relationships between instrumental/chemical data (X) and sensory data (Y)...........43 Objective 3: Predict user preference from sensory measurements ...................................................................51

Tutorial C: Spectroscopy and interference problems ...................62 Description .......................................................................................................................................................62 Get to know the data.........................................................................................................................................64 Univariate regression........................................................................................................................................68 Calibration ........................................................................................................................................................71 Multiplicative Scatter Correction (MSC) .........................................................................................................77 Check the error in original units: RMSE ..........................................................................................................84 Predict new MSCorrected samples...................................................................................................................85 Guidelines for calibration of spectroscopic data ..............................................................................................87

Tutorial D: Screening and optimization designs ..........................89 Description .......................................................................................................................................................89 Build a screening design...................................................................................................................................90 Estimate the effects...........................................................................................................................................98 Draw a conclusion from the screening design................................................................................................104 Build an optimization design..........................................................................................................................104 Compute the response surface ........................................................................................................................110 Draw a conclusion from the optimization design ...........................................................................................117

Tutorial E: SIMCA classification...............................................118 Description .....................................................................................................................................................118 Reformat the data table...................................................................................................................................119 Graphical clustering........................................................................................................................................120 Make class models..........................................................................................................................................124 Classify unknown samples .............................................................................................................................126 Interpretation of classification results.............................................................................................................128

Diagnosing the classification model...............................................................................................................132

Tutorial F: Interacting with other programs ...............................136 Description .....................................................................................................................................................136 Import spectra from an ASCII file ..................................................................................................................137 Import responses from Excel..........................................................................................................................138 Create a categorical variable...........................................................................................................................140 Append a variable to the data set....................................................................................................................143 Organizing the data.........................................................................................................................................143 Study the data before modeling ......................................................................................................................145 Make a PLS Model .........................................................................................................................................148 Save PLS model file .......................................................................................................................................152 Export ASCII-MOD file .................................................................................................................................152 Export data to ASCII file ................................................................................................................................154

Tutorial G: Mixture design ........................................................155 Description .....................................................................................................................................................155 Design variables and responses ......................................................................................................................156 Build a simplex centroid design .....................................................................................................................157 Import response values from Excel ................................................................................................................164 Check response variations with statistics .......................................................................................................166 Model the mixture response surface...............................................................................................................169 Conclusions ....................................................................................................................................................174

Tutorial H: PLS Discriminant Analysis (PLS-DA) ....................177 Description .....................................................................................................................................................177 Build PLS regression model ...........................................................................................................................179 Classify unknown samples .............................................................................................................................186 Some general comments on classification......................................................................................................190

Tutorial I: Multivariate curve resolution (MCR) of dye mixtures

...................................................................................................192 Description .....................................................................................................................................................192 Data plotting ...................................................................................................................................................195 Run MCR with default options.......................................................................................................................197 Plot MCR results ............................................................................................................................................199 Interpret MCR results .....................................................................................................................................200 Run MCR with initial guess ...........................................................................................................................201 Validate the estimated results with reference information..............................................................................202 View an MCR result matrix............................................................................................................................203

Tutorial J: MCR constraint settings ...........................................207 Description .....................................................................................................................................................207

Data plotting ...................................................................................................................................................208 Estimate the number of pure components and detect outliers with PCA........................................................209 Run MCR with default settings ......................................................................................................................213 Tune the model’s sensitivity to pure components...........................................................................................214 Run MCR with a constraint of closure ...........................................................................................................216 Remove outliers and noisy wavelengths with recalculate ..............................................................................218

Tutorial K: Clustering ................................................................221 Description .....................................................................................................................................................221 Transform the raw spectra ..............................................................................................................................223 Application of K-Means clustering ................................................................................................................226 Application of Hierarchical Cluster Analysis (HCA) .....................................................................................229 Repeat the HCA using a correlation-based measure.......................................................................................230 Using the results of HCA to confirm the results of PCA................................................................................233

Tutorial L: L-PLS ......................................................................237 Description .....................................................................................................................................................237 Open and study the data .................................................................................................................................241 Build a L-PLS model......................................................................................................................................241 Interpret the results .........................................................................................................................................244 Verify the results.............................................................................................................................................250 Bibliography...................................................................................................................................................254

Tutorial M: Variable selection and model stability.....................255 Description .....................................................................................................................................................255 Create a PLS model ........................................................................................................................................256 Interpret a PLS model.....................................................................................................................................257 Conclusions ....................................................................................................................................................265

1

Tutorial A: A simple example of calibration

• Description o Expected outcomes of this tutorial o Data table

• Opening the project file • Define ranges • Univariate regression • Calibration • Interpretation of the results • Prediction • Evaluation of the predicted results

Description

This tutorial aims to provide and example of the measurement of the concentration (Y) of a chemical constituent “a” by use of conventional transmission spectroscopy. The situation is complicated by the presence of an interferent “b” which is present in varying unknown quantities. Under these conditions, the instrument response of “b” strongly overlaps that of “a”.

Expected outcomes of this tutorial

This tutorial contains the following tasks and procedures:

• Open a project file. • Define row and column sets. • Compare the results of univariate vs. multivariate regression.

2

• Develop calibration models. • Predict new samples. • Validate the model for future use. • Analyze and interpret regression coefficients. • Explore the plotting options available for these methods.

References:

• Basic principles in using The Unscrambler® • Descriptive Statistics • About Regression methods • Prediction • Validation

Data table

The data for this tutorial can be found in the project file “Tutorial A” in the “Data” directory installed with The Unscrambler®.

Seven solutions, (samples), of known concentration (Y) of the constituent a, will be used as the calibration set. Three other (test) samples are available of unknown concentrations. These will be predicted by the use of a developed regression model.

Light absorbance was measured at two different wavelengths, namely Red and Blue. Red is variable 1, Blue is variable 2. Variable 3 has been designated as the concentration of a.

Opening the project file

Task

Open the project “Tutorial A” into The Unscrambler® project navigator and study the data in the Editor. Use the Descriptive Statistics functionality to view some basic characteristics of the data table.

How to do it

Use File - Open to select the project file “Tutorial_A.unsb” in The Unscrambler® data samples directory. This directory is typically located in C:\Program Files\The Unscrambler X\Data.

For the purposes of this tutorial, click the following link to import the data. Tutorial A data set

The project should now be visible in the project navigator and the data should be displayed in the editor.

3

Note that the values for variable Comp “a” are missing (blank) for the 3 Unknown samples.

Use the Tasks-Analyze-Descriptive Statistics… option to view some basic statistics of the data, including the Mean, Standard Deviation, Skewness etc.

Tasks-Analyze-Descriptive Statistics…

The following dialog will open. Select the data matrix to be analyzed and ensure that no rows or columns have been excluded from the analysis.

Descriptive Statistics Dialog

4

After clicking OK, the statistics will be computed. A new analysis node will appear in the project navigator providing some simple plots and analysis of the data.

Descriptive Statistics Results Matrix

Define ranges

In most practical applications of multivariate data analysis, it is necessary to work on subsets of the data table. To do this, one must define ranges for variables and samples. One Sample Set (Row range) and one Variable Set (column range) make up a virtual matrix which is used in the analysis.

Task

Define two Column ranges (variable sets), one for “Light Absorb” and the other for”Constituent a”. Also define two Row ranges (sample sets) “Calibration Samples” and “Prediction Samples”.

How to do it

There are two options for defining data ranges in The Unscrambler®:

5

Create Row/Column ranges using the right mouse click option

Highlight a range of variables to be defined and right click in the column header. This will display the Create Column Range option. Sample sets can also be defined as row ranges using a similar method and selecting Create Row Range.

Create a column range

Rename the column set by highlighting it in the project navigator, and right clicking. Choose the Rename option, and change the name to “Constituent a”.

Repeat this process for the “Light Absorbance” set containing the first two columns and the row sets: “Calibration” containing samples 1 to 7 and “Prediction” containing samples 8 to 10.

Use Edit - Define Range… to create row and column sets.

Open the Define Range dialog from the Edit menu. Define the data as follows,

Name: Light Absorbance

Interval: columns 1-2

Define Range Dialog

6

Enter the Column numbers directly into the Set Interval field under rows and columns.

Deselect variables marked by mistake by pressing Ctrl while clicking on the variable to be removed from the set.

Click OK.

Similarly define the second variable Set using the Edit -Define Range option and specifying:

• Name: Constituent A

• Set Interval: Column 3

Click OK.

Choose Edit - Create Row Range to create sample sets.

Four sample and variable sets should now be displayed in the project navigator.

Data set with ranges

7

By organizing the data into sets from the beginning, one can add value to the analysis and also use this information to communicate results. All analyzes and plotting will be much easier to set up, and can be used in the visualization of results.

Remember to save the project before proceeding, select File - Save or press the button.

Univariate regression

The simplest regression method (univariate regression) can be simply visualized in a 2-dimensional scatter plot.

Task

Make a regression model of component “a” and the absorbance of red light.

How to do it

Perform the regression by plotting the red light variable against Constituent a. Select Plot - Scatter from the Plot menu. The following plot should appear.

Scatter plot

8

The univariate regression should be performed on the calibration samples only, as the Y-values are missing in the prediction set.

The plot is displayed without the trend lines visible. Toggle the regression and/or target line on and

off using the shortcut . Also view the statistics for the plot. Toggle the statistics display on

and off using the shortcut .

Statistics for the plot are shown in a special frame in the upper left corner.

Scatter plot with trend lines and statistics

9

The displayed correlation value of 0.91 indicates that the two variables are highly correlated. The univariate model for this data can be generated using the Offset value and Slope value. The equation is as follows:

Comp"a" = -0.9285 + 0.59524 * Red

Calibration

This section describes how to develop the simplest multivariate model containing two predictor (X) variables.

Task

Make a PLS regression model between the absorbance measurements and the concentration of “a”.

How to do it

Select Tasks - Analyze - Partial Least Squares Regression… to display the PLS regression dialog. Use the following parameters to define the model:

Model inputs

• Rows (indicating which samples to use): Calibration Samples (7) • Predictors, X: Light Absorbance (2) • Responses, Y: Constituent a (1) • Maximum components: 2

10

Check the Mean center Data and Identify Outliers boxes.

Partial Least Squares Regression Dialog: Model Inputs

Weights

Click the tabs for both X and Y weights to see which options apply for each sheet. Since the data are of spectral origin, ensure the weights are All 1.0

Validation

Under the validation tab select the cross validation option. Click on Setup to choose Full from the drop-down list.

It is important to properly validate models. Leverage correction is not recommended as it gives only an overly-optimistic estimate of the error of a model. The estimate of the prediction error (validation variance) is more conservative with cross validation than with leverage correction!

11

Cross Validation Dialog

Click OK to start the calibration.

Interpretation of the results

Task

• Display the results of the modeling steps. • Interpret the Y-Residual Validation Variance Curve. • Study the Regression Coefficients plot and provide an interpretation.

Display the model results

From the project navigator, display the Regression Overview plots. Four predefined plots make up the Regression Overview:

• Scores, • Loadings, • Variance, • and Predicted vs measured.

PLS Regression Overview

12

When OK has been selected in the PLS dialog box and Yes has been selected to view the plots, a PLS node will be added to the project navigator. This node contains the following,

• Raw data, • Results, • Validation, • Plots.

The raw data used for building the model is stored in the results folder. Validation results matrices generated from the model can be viewed along with predefined plots for the analysis.

Toggle between different plots from those available in the project navigator. Alternatively use the Plot… menu option, or right click in a plot to select a desired plot.

Information about the model is available in the Information field, located at the bottom of the project navigator view. Information such as how many samples were used to develop the model and the optimal number of factors is contained here.

Model info box

13

A number of important calculated results matrices may be obtained from the PLS node.

Returning to the PLS overview, activate the Scores plot, which is in the upper left quadrant of the overview, by clicking in it.

Right click on this plot and select the Properties option.

Properties option

Select Point label from the available options, and in the dialog change the label to sample number instead of sample name.

Properties: Point label

14

In the properties dialog it is possible to make other customizations to the plot.

Click OK.

Activate the Predicted vs. Measured plot (lower right quadrant of the PLS overview). In this plot, colors are used to differentiate between Calibration results (in blue) and Validation results (in red).

Use the Next Horizontal PC and Previous Horizontal PC buttons to display the Predicted vs. Measured for one and two PLS Factors.

15

Use the Cal/Val buttons to toggle between the calibration and validation samples. It is also

possible to toggle on and off the regression and trend lines .

Interpret the Y-Residual Validation Variance Curve

Activate the Y residuals plot in the lower left quadrant of the PLS overview and choose Cal/Val for Y from the toolbar shortcuts.

Notice that the residual variance increases going from factor 0 to factor 1. This usually indicates the presence of outliers in the data, which should be removed (with justification) before going final validation of the model.

Residual Y variance plot

.

However, for the purposes of this tutorial, the main goal is to become familiar with the use of The Unscrambler®.

Study the Regression Coefficients Plot

From the main menu, choose the Plot - Regression Coefficients - Raw- Line option. Change the plot

layout to a bar chart using the toolbar shortcut .

Regression coefficients

16

.

This illustrates how to view Raw regression coefficients (B), which define the model equation. View

the regression coefficients for the next factor using the arrows on the toolbar .

In the present case, the values of the regression coefficients remain unchanged when shifting from Weighted coefficients (Bw) to Raw coefficients (B). The reason is that the weights were chosen as All 1.0 (no weighting) for the purposes of calibration.

Regression coefficients can be viewed in different ways, such as lines, bars and accumulated bars from the respective shortcut buttons found in the toolbar.

Hovering the mouse cursor over one of the bars displays numerical information associated with the particular variable. Click once more to get the object information window. For the two factor model developed in this tutorial, the b-coefficient for the Red absorbance is 1.042, the b-coefficient for the Blue absorbance is -0.2083 and the offset (B0) is 1E-15, i.e. approximately zero.

The b-coefficients can also be shown as a table by selecting the matrix Beta coefficients (raw) in the Result folder of the PLS node in the project navigator.

Regression coefficients matrix

.

The b-coefficients are a graphical representation of the model equation relating the concentration of “a” to the Red and Blue light absorbances:

17

Concentration of “a”: a = 0 + 1.042 * Red – 0.2083 * Blue

Remember the value of the coefficient for Red in the univariate model (0.59524). This result is different from what was found in a multivariate model.

The results should be saved in the project with the data.

Select File - Save or use the save tool and give the project file the name “Tutorial A”.

Prediction

The main purpose of developing a regression model is for future prediction of the properties of new samples measured in a similar way.

Task

Use the PLS calibration model to predict the concentration of “a” for the three unknown samples in the data table.

How to do it

Use the Tasks - Predict- Regression… option to predict the values of the new samples. Enter the parameters below in the Prediction dialog:

Prediction dialog

18

• Select model: PLS. • Components: 2. • Full Prediction. • Inlier statistics. • Mahalanobis distance. • Data Matrix: Tutor_a. • Rows: Prediction (3). • Columns (X-variables): Light Absorbance (2). • Y-reference: no selection (do not include Y-reference values).

It is possible to find all models in the current project using the drop-down list next to Select model. Select the PLS model developed and click OK to start the prediction.

Evaluation of the predicted results

During the development stage of a regression model, the quality of the predictions must be checked by evaluating the quality of the Predicted vs Measured plot.

The predictions can be checked when some reference measurements are available. This is not possible for the unknown samples in this tutorial as there are no reference measurements available

19

for these samples. However, a method exists for determining the quality of the predictions, based on the properties of projection modeling.

Task

Perform a prediction and evaluate the quality of the predicted results.

How to do it

First, evaluate the predicted results of the unknown samples and determine if these values are in the same range as the calibration range of samples. Select the Prediction plot under the new Predict/Plots node in the project navigator to visually assess the results.

Prediction with deviation

The predicted values are displayed as horizontal bars. The size of the bars represent the deviation (uncertainty) in the estimates. The numerical values for the Y Predicted values and Y deviations can be found in the output matrices, and are displayed under the plot. A comparison of these predictions to actual values cannot be made, however, if the new samples have predicted values similar to those in the calibration set and the size of the deviation bars is small, the quality of the prediction may be ensured.

Predicted values

Another method for determining the reliability of the predicted values is to study the Inlier vs Hotelling T² plot available as a right click option in any plot. Select the Prediction - Inlier/Hotelling T² - Inliers vs Hotelling T² option to display this plot.

For a prediction to be trusted its value must not be too far from a calibration sample. This may be checked using the Inlier distance. The predicted values projection onto the model should not be too far from the center. This may be checked using the Hotelling T² distance.

20

Inliers vs Hotelling T²

In this case all the samples were found to be in the left bottom corner of the plot, indicating that the predicted results can be trusted.

Save the project before proceeding.

Returning to the PLS model results the estimated prediction quality of the model may be determined. Under the PLS node in the project navigator, expand the Plots folder and select Predicted vs Measured to display this plot in the viewer.

The Predicted vs Measured plot appears.

21

Use the toolbar icons to toggle between the regression and/or target lines.

High quality predictions were obtained from this PLS model. Comparison of the multivariate regression model with the univariate regression model, shows the marked improvement of using the multivariate model.

22

Tutorial B: Quality analysis with PCA and PLS

• Description o Main learning outcomes o Data table

• Preparing the data o Insert categorical variables o Check column (variable) sets o Define sample sets

• Objective 1: Find the main sensory qualities o Make a PCA model o Interpret the variance plot in the PCA overview o Interpretation of the score plot for the PCA o Interpretation of the correlation loadings plot o Interpretation of scores and loadings o Interpretation of the influence plot

• Objective 2: Explore the relationships between instrumental/chemical data (X) and sensory data (Y) o Make a PLS regression model o Interpretation of the variance plot o Interpretation of the score plot o Interpretation of the loadings and loading weights plot o Interpretation of the predicted vs measured plot

• Objective 3: Predict user preference from sensory measurements o Make a PLS regression model for preference o Interpretation of the regression overview o Interpretation of the regression coefficients o Open result matrices in the Editor o Predict preference for new samples o Interpretation of Predicted with Deviation o Check the error in original units – RMSE o Export models from The Unscrambler®

Description

This tutorial aims to use multivariate techniques to analyze the quality of raspberry jam in order to determine which sensory attributes are relevant to “perceived quality”. The analysis will cover three aspects as follows.

23

1. A trained tasting panel has provided scores for a number of different variables using descriptive sensory analysis. In this tutorial the first objective is to find the main sensory quality properties relevant for raspberry jam.

2. The second objective is to find a way of rationalizing quality control, since the use of taste panels is very costly. In this application a number of laboratory instrumental measurements were investigated to potentially replace the sensory testing panel.

3. The third and final objective of this application is to be able to predict consumer preference for raspberry jam from descriptive sensory analysis. The use of PLS regression modeling techniques were investigated in order to potentially find a relationship between sensory data and preference.

Main learning outcomes

This tutorial contains the following parts and learning objectives:

• Explore methods for inserting categorical variables. • Define ranges in data sets. • Investigate the relationships existing in a single data table by the use of PCA. • Interpret scores and loadings of the PCA and draw relevant conclusions. • Run a PLS regression for understanding the relationships between two data tables. • Export models within The Unscrambler® of potentially to other applications. • Predict response values from new samples. • Estimate regression coefficients and interpret them. • Find optimal number of components or factors in multivariate models.

References:

• Basic principles in using The Unscrambler® • PCA Analysis • About Regression methods • Exporting data from The Unscrambler® • Prediction

Data table

Click the following link to import the Tutorial B data set used in this tutorial.

The analysis is be based on 12 samples of jam (objects), selected to span the expected, normal quality variations inherent in such products. Several observations and measurements were been made on the samples.

Agronomic production variables

24

The samples were taken from four different cultivars, at three different harvesting times. The table below describes the sampling plan for this analysis.

Sample description

No Name Cultivar Harvest time No Name Cultivar Harvest time

1 C1-H1 1 1 7 C3-H1 3 1

2 C1-H2 1 2 8 C3-H2 3 2

3 C1-H3 1 3 9 C3-H3 3 3

4 C2-H1 2 1 10 C4-H1 4 1

5 C2-H2 2 2 11 C4-H2 4 2

6 C2-H3 2 3 12 C4-H3 4 3

Note that the agronomic production variables are not used as input variables in any of the matrices. These represent known information which may be extremely valuable for the interpretation of the results of the data analysis. They will be utilized as categorical variables in the analyses performed in this tutorial.

Column (variable) set Instrumental

Three chemical and three instrumental variables (APHA colorimetry) variables were also measured on the samples tested by the sensory panel. These are described in the table below.

Instrumental variables

No Name Method

1 L Lightness

2 a Green-red axis

3 b Blue-yellow axis

4 Absorbance Absorbance

5 Soluble Soluble solids (%)

6 Acidity Titrable acidity (%)

Column (variable) set “Sensory”

25

A trained sensory panel evaluated 12 different attributes of raspberries, using a 1-9 point intensity scale. The entries in the data matrix are the average ratings over all judges. The observed variables are listed in the table below.

Sensory variables

No Name Type

1 Redness Redness

2 Colour Color intensity

3 Shininess Shininess

4 R.Smell Raspberry smell

5 R.Flav Raspberry flavor

6 Sweetness Sweetness

7 Sourness Sourness

8 Bitterness Bitterness

9 Off-flav Off-flavor

10 Juiciness Juiciness

11 Thickness Viscosity/thickness

12 Chew.res Chewing resistance

Column (variable) set Preference

114 representative consumers were invited to taste the 12 jam samples used in this application. They each provided an individual preference score on a scale from 1-9. The average over all consumers for each sample is provided in the data table.

Row (sample) sets

The data table, “JAMdemo”, consists of 20 samples. The first twelve samples will be used to develop the models in this application and are hereafter referred to as training samples.

Eight new jam samples were assessed by the trained panel and given a sensory rating. These samples represent the eight last samples in the table, and referred to as Prediction samples. The preference and the instrumental values are missing for these samples, as measurements were not performed on these samples. The calibration model will be used to predict the preference for these eight samples.

26

Preparing the data

Insert categorical variables

Categorical variables are useful for interpreting patterns in data sets. Here, the raspberries used to make the jam samples originated from different cultivars and were harvested at different times. These parameters represent excellent candidates for using categorical variables in an analysis.

Task

Insert two categorical variables, Cultivar and Harvest Time.

How to do it

The data table should be opened by following the above link and are already organized into two row sets for training and prediction. The different types of variables have been defined in the column sets as Instrumental, Sensory and Preference, based on the definitions in the data tables above. These defined sets can be seen by expanding the data table in the project navigator.

Jam data organization

Some additional information about the cultivar and harvest time now needs to be added to this data as two new columns.

Activate a cell in the first column of the table, right mouse click and select Insert… or use the menu options and select Edit - Insert…. In the dialog box, choose to add two new columns. Two empty columns will be added to the data table.

风痕

高亮

树莓

风痕

高亮

果酱

风痕

高亮

风痕

高亮

风痕

高亮

风痕

高亮

风痕

高亮

品种

风痕

高亮

风痕

高亮

分类变量

27

Insert New Columns

Select the new inserted columns and convert each of them to data type Categorical by selecting Edit-Change Data Type… or right clicking to select Change Data Type…. The category converter dialog will appear, and here select to input new levels based upon individual values.

Category converter dialog

28

Enter the Categorical Variable Name “Cultivar” manually in the column 1 header cell. Manually enter the values of the new categorical variable. Use C1, C2, C3, and C4 as the values for Cultivar, as given in the sample names. Type these values in the Cultivar column.

Note: Categorical variable cells are orange in the editor to distinguish them from ordinary variables.

Insert the categorical variable “Harvest Time”; change the name of column 2 to Harvest time, and fill in the correct Harvest Time levels based on the information contained in the sample names.

The Tutorial_b data table displayed in the Editor (after insertion of Cultivar and Harvest Time)

Check column (variable) sets

In The Unscrambler® matrices are defined by Row and Column (Sample and Variable) Sets. A recommended good practice is to define all sets before any analyses are performed. The information entered to organize the data can later be used to color-code graphics according to these sample groups.

Task

Check that the three column (Variable) Sets: “Instrumental”, “Sensory” and “Preference” have been defined.

These sets can be visualized in the project navigator.

How to do it

To create column and row ranges, select Edit - Define Range to open the Define Range dialog.

风痕

高亮

颜色代码图

29

Three sets have been predefined in the project Tutorial_B data set.

Column name: Instrumental Interval: 3-8

Column name: Preference Interval: 14

Column name: Sensory Interval: 9-13, 15-21

To verify these definitions use the Edit - Define range and inspect the information in this dialog.

The Define range dialog with three column sets

After defining column intervals, click OK to perform the task.

Define sample sets

Task

30

Verify the existence of two sample sets “Calibration Samples” and “Prediction Samples”.

How to do it

Select Edit – Define Range to open the Define Range dialog. The available row sets can be inspected here.

The Define range dialog with two Row Sets

1. Row Name: Calibration Samples, Interval: 1-12 2. Row Name: Prediction Samples, Interval: 13-20

Additional row sets will be added for the various levels of the categorical variables harvest time and cultivar.

Go out from the Define Range dialog box by clicking Cancel.

Begin by selecting the row 1 in the data editor, and select Edit- Group rows…, which will open the Create row ranges from column dialog.

Edit- Group rows…

31

The column that was selected, “Cultivar”,is already in the Cols field.

No need to specify the Number of Groups as it is based on a category variable.

Create row ranges from column

Click OK.

32

Automatically 4 row ranges have been added. Look in the Row folder to see them:

New row ranges

Do the same for the variable “Harvest time”.

Objective 1: Find the main sensory qualities

The main variations in the sensory measurements may be found by decomposing them by Principal Component Analysis (PCA). This data decomposition results in valuable graphical diagnostic tools including scores, loadings and residuals. The results will be interpreted in order to establish whether sensory measurements made on the jam samples have any practical meaning.

Make a PCA model

Task

Make a PCA model using the Set “Sensory” as the variable set.

How to do it

Select Tasks – Analyze - Principal Component Analysis… Specify the following parameters in the dialog box:

Model inputs

• Data matrix: “JAMdemo” (20x21)

• Rows: Calibration Sam (12)

• Cols: Sensory (12)

风痕

高亮

分解。

风痕

高亮

风痕

高亮

33

• Maximum components: 6

Check the identify outliers and Mean Center boxes, if these check boxes are not already selected.

Principal Component Analysis dialog: Model inputs

Weights

From the Weights tab verify that the weights are all 1.0 (constant).

No weighting is used in this model as the sensory panel is known to be well trained.

However, sensory variables are often weighted when there is evidence that the panel is not well trained, or when investigating relationships with other variables. The most common weighting to use is 1/SDev.

Weight tab dialog

风痕

高亮

what is this ?

风痕

高亮

34

Validation

From the validation tab Select Cross Validation and press Setup which opens the Cross Validation Setup dialog. Here select Full from the drop-down list for cross validation method.

Validation Dialog

35

This validation method is more time consuming than leverage correction, but the estimate of the residual variance is more reliable.

Click OK to start the PCA. After PCA analysis is completed, the program will request a user, “Do you want to view plots now?”. Click Yes to see the PCA Overview plots. A new node has been added to the project navigator containing all the PCA result matrices and plots.

Interpret the variance plot in the PCA overview

Task

Determine the optimal number of PCs.

How to do it

The PCA Overview contains the most commonly used plots for interpreting PCA models, including

• Scores plot. • Loadings plot. • Influence plot. • Explained/Residual Variance plot.

风痕

高亮

36

PCA Overview plots

The Scores plot is a map of the samples, and shows how they are distributed. It can be used to isolate samples that are similar, or dissimilar to one another. In this analysis, the plot labels show that PC-1 explains 58% and PC-2 28% of the total variance in the data. The Explained variance curve (in the lower right corner) is an excellent tool for selecting of the optimal number of components in the model.

The explained variance increases until PC 5 is reached. The software does suggest the optimal number of PCs for a model, but it is up to the analyst to analyze the data and confirm the optimal number of PCs in this model, usually based on this plot.

The highest explained variance is found with 5 PCs, but the explained variance in a model using 3 PCs contains similar explained variation. A simple (parsimonious) model is usually more robust than a complex one, and easier to interpret. It is always suggested to work with a model consisting of a few PCs as possible. The info box in the lower left corner of the main workspace indicates that 3 PCs are considered optimal for this model.

Info Box

风痕

高亮

风痕

高亮

风痕

高亮

风痕

高亮

风痕

高亮

风痕

高亮

节俭的

风痕

高亮

健全的。

风痕

高亮

何以见得？从哪看出来的？

37

Task

Change the explained variance plot to a residual variance plot.

How to do it

Activate the lower right plot by clicking in it. Toggle between the Explained/Residual buttons from

toolbar shortcuts . Another way of doing this is to make the plot once again using Plot - Variances and RMSEP, but the short cut method of toggling to change the plot is preferred.

The explained variance is now converted to residual variance. The information is the same, but presented in another way. The residual variance is well suited to finding the optimal number of PCs to use in a model, while the explained variance is a better measure for explaining how much of the variation is described by the model. The plot layout can be changed to a bar chart by using the plot

layout shortcut .

The PCA Explained Variance Bar plot

The model with 3 PCs describes 92% of the total validation variance in the data; for calibration it is 96%. These values may be obtained by clicking on the specific data point in the plot.

风痕

高亮

风痕

高亮

风痕

高亮

38

Use the toolbar buttons to change between having only the calibration or validation variance curve plotted, or both.

Interpretation of the score plot for the PCA

The score plot, which is a map of samples, displays information about the sample relationships for a particular data set.

Task

Interpret Scores plot. Use different plot options for ease of interpretation.

How to do it

The score plot shows the projected locations of the samples onto the calculated PCs. By studying patterns in the samples a meaningful interpretation of the PCs may be possible.

PCA Scores plot

The score plot for this analysis indicates that the 12 samples are not arranged in a random way. By moving from left to right along this plot, a pattern can be observed where samples harvested at time H1 are mainly found on the left. These then change to H2 and finally H3. Moreover, moving from the top to the bottom, C4 samples occupy the top region, followed by C3, then C2, and finally C1.

The row sets based on the categorical variables that were inserted into the data table can be used to better visualize these trends. In the scores plot, right mouse click and select Sample Grouping to open the dialog where different row sets can be used for grouping and color-coding the plot. Select all the cultivar row sets (C1, C2, C3, C4) individually and add them for grouping purposes. The marker color, shape and size can be customized here for optimized viewing of the data.

Sample Grouping Dialog

风痕

高亮

风痕

高亮

分类变量

39

When the desired settings have been defined, click OK to complete the operation.

In the Scores plot, right mouse click to select Properties, where customization of the plot appearance is possible. Select header and change the plot heading to Scores plot with Cultivar Grouping. Choose a different font size or color if so desired.

Properties Dialog

40

PCA Scores with Sample Grouping

Repeat the above sample grouping process, this time using the categorical variable Harvest Time.

Interpretation of the correlation loadings plot

The loading plot, which is a map of the variables, displays information about the variables analyzed in the PCA model. Correlation Loadings provide a scale independent assessment of the variables and may, in some cases, provide a clearer indication of variable correlations.

Task

Interpret variable relationships in the correlation loadings plot.

How to do it

Activate the X-Loadings plot by clicking in it, then use the corresponding shortcut button.

The Correlation Loadings plot may be used to study the variable correlations that exist in a particular data set.

Correlation Loadings plot

风痕

高亮

41

The plot shows that two variables (redness and colour) have an extreme position to the right of the plot along PC1. They are close to each other (i.e. they are highly positively correlated), and far from the center and are very close to the edge of the 100% explained variance ellipse. This also means that samples lying to the right of the score plot have higher values for those two variables.

Along the vertical axis (PC2), two variables can be observed, with high positive values for this PC. These are R.SMELL and R.FLAV. These two variables are opposite to the variable OFF FLAV which has lower values for this PC. This indicates that raspberry smell and flavor correlate positively with each other, and negatively with off-flavor.

Interpretation of scores and loadings

Task

Relate Scores (samples) information to Loadings (variables) information.

How to do it

The Scores plot and Correlation Loadings plot show that samples C2H3 and C1H3 have high color and redness intensities, while sample C1H2 is more likely to have an off-flavor character. Samples located in a specific part of a 2-vector score plot have, in general, much of the properties of the variables in the same location in the 2-vector loading plot, provided that the plotted PCs describe a large proportion of the variance.

PC 3 describes the variation in sweetness, bitterness and chewing resistance. Confirm this by activating the loading plot (upper right quadrant) and selecting Plot - Loadings. Display PC 1 vs. PC

3 by changing Vector 2 using the arrows in the toolbar .

PCA Loadings 1 vs 3

风痕

高亮

风痕

高亮

风痕

高亮

风痕

高亮

风痕

高亮

风痕

高亮

42

In this new plot, the horizontal axis is unchanged (PC1) and the vertical axis now shows PC3.

Interpretation of the influence plot

Task

Interpret the influence plot, which is used for the detection of outliers.

How to do it

The influence plot is displayed in the lower left quadrant of the PCA Overview. The strongest outliers are placed in the upper right corner of the plot, i.e. they have a large leverage and a high residual variance. In the current analysis, there is no evidence of outliers.

PCA Influence plot

43

All of the results for the PCA are now part of the project Tutorial_B. Save the project to capture the PCA results. The next steps in this tutorial will make use of the sensory, instrumental and preference data.

Close the PCA overview by selecting its name in the navigation bar at the bottom of the viewer and right clicking to select Close.

Objective 2: Explore the relationships between

instrumental/chemical data (X) and sensory data (Y)

Is it be possible to predict the quality variations observed in the jam data by using instrumental measurements only? Training and employing a sensory panel is costly and time consuming. Producers of jam would find it most convenient if they could predict quality variations by measuring some properties by instrumental means. The next task in this tutorial is to make a regression model between the sensory and instrumental data and analyze the results for a possible solution.

Make a PLS regression model

In The Unscrambler® the regression between two matrices can be performed using a number of common multivariate methods. Partial Least Squares (PLS) is used in this case in order to maximize the information obtained from both X and Y.

Task

Make a PLS regression model that predicts the variations in sensory variables from instrumental and chemical variables.

How to do it

Select Tasks - Analyze - Partial Least Squares Regression…. Specify the following parameters in the Regression dialog:

Partial Least Squares Model Inputs

44

Model inputs tab

Predictors

• Rows/Samples: Calibration Sam (12) • X-variables: Instrumental (6)

Responses

• Cols/Y-variables: Sensory (12) • Maximum components: 6

X adn Y weights tabs

Select the X and Y Weights tabs to access their dialogs. Weighting will be applied to all the X and Y variables for regression purposes.

X Weights Dialog

45

Press All to change the weighting of all variables at the same time. Variables can also be selected by clicking on them in the list. Remember to hold the Ctrl key down while selecting several variables. Choose the A / (SDev +B) radio button. Use constants A = 1 and B = 0. Press Update and ensure that the weights change in the list, then click OK.

All variables are weighted by dividing them with their own standard deviations. This allows all variables to contribute to the model, regardless of whether they have a small or large standard deviation from the outset; only the systematic variation is of interest here.

Remember to do the same in the Y Weights tab.

Validation tab

Select Cross validation from the Validation tab.

Press the Setup button to access the Cross Validation Setup dialog and choose Full Cross Validation from the drop-down list. It is always recommended to use test set or cross validation to develop final models.

Click OK in the regression dialog when all parameters have been set up. The computation of the model will begin. After PLS analysis is completed, the system will ask “Do you want to view plots

46

now?”. Click Yes to see the PLS Overview plots. A new node, PLS, has been added to the project navigator.

Click Yes to study the Regression Overview.

PLS Regression Overview

This Viewer provides the most useful and common predefined result plots for PLS, including loading weights and residuals, etc. The model can always be reviewed during the analysis stage by selecting any of the result plots under the PLS-Plots node in the project navigator. For this exercise, various Y response values were used for model development. Therefore the overview results for each of these responses are available by choosing the Y value of interest in the tool bar. When performing this type of analysis with multiple responses the non-significant variables may be determined for each of the responses. It can also provides information on which sensory responses can best be predicted from the instrumental measurements without making a separate PLS model for each response. When a Predicted vs measured plot is selected (lower right quadrant) active, the name of the Y value being

analyzed appears in the toolbar . Another Y response can be chosen from the drop-menu menu, or one can scroll through the values using the arrow tool on the right.

Interpretation of the variance plot

Task

47

Interpret the explained variance curve, which can be shown as residual variance, or as explained variance. The two different views are useful for different tasks.

How to do it

The explained variance plot is in the lower left quadrant. This plot can be changed to the residual

variance plot by using the toolbar . A local minimum is achieved in only two PLS factors. The next task is to determine how much each of the six first Y-variables are described by the model. This can be done by looking at the explained variance.

Validation Variance plot

From the plot menu select Variances and RMSEP - X- and Y-Variance… Make sure the bottom plot shows the Explained Variance for the 12 individual Y variables. If not, change it by using the toolbar

shortcut. Also do not select Total, but select Cal from the toolbar shortcuts . Add a legend to the plot by right clicking and selecting Properties. Select legend, and check the box visible to add the legend to the plot.

PLS, Explained Validation Variance Plot displayed for the 12 individual Y-variables

48

The conclusion reached from the residual variance curve was that two PLS factors were optimal. The variables that are well described are reflected in the information conveyed by these factors. About 85% of the color variation (variables 1 and 2), and 80% of the variation in sweetness (variable 6) can be explained by a combination of the chemical and instrumental variables.

Note that only 23% of the total Y-variance is explained by the model using two factors.

Interpretation of the score plot

The score plot shows how the samples are related to each other.

Task

Interpret the score plot.

How to do it

Returning to the Regression Overview Plot (by selecting it from the Plots node in the project navigator). the Scores plot is always found in the upper left quadrant of the overview. The score plot shows patterns in the samples. This is often difficult to see without some other powerful visual tools. Use the categorical variables as markers in the same way it was performed in the “Interpretation of the Score Plot” for the PCA model. This can be performed by highlighting the score plot and right clicking to select sample grouping. The categorical variables harvest time, will be used for the sample grouping.

PLS factor 1 describes the harvesting time. Harvest time 1 is found on the right in the plot and harvest time 3 to the left. The score plot does not reveal information about the cultivars.

A comparison with the loading plot provides more information. Interpret the two plots (Scores and Loadings) by analyzing them together.

49

Interpretation of the loadings and loading weights plot

Study the loading weights plot to find correlating variables.

Task

Interpret the loadings and the loadings weight plots.

How to do it

The loadings plot is located in the upper right quadrant of the Regression Overview. Activate it (if it is present), or choose it from the project navigator under the PLS - Plots node. Make sure both X and Y loadings are plotted.

To interpret variable relationships, visualize straight lines between the variables through the origin. Variables along the same line, far from the origin, may be correlated. (Negatively correlated when situated on opposite sides of the origin.)

PLS, X-Loading Weights and Y-Loadings Plot

The spectrophotometric color measurements (L, a, and b) appear to be strongly negatively correlated with color intensity and redness. Sweetness is, as expected, strongly negatively correlated with measured Acidity. But the R. Flavor shows weak correlation to the PLS-factors (near origin = low PLS loadings).

The regression coefficients may also be analyzed to understand which X variables are important in describing each of the Y responses. These can be selected from the project navigator, or from the menu Plot- Regression coefficients - Raw - Line. The coefficients for each of the Y responses can be displayed by selecting them from the drop-down list in the toolbar.

50

From Problem I it was concluded that the jam quality varied both with respect to color, flavor, and sweetness. But the results so far in Problem II show that the chemical and instrumental variables mainly predict variations in color and sweetness (which is indicated by the low explained Y-variance of Flavor). This indicates that the Y-variable Flavor cannot be replaced with the present set of X-variables, i.e. there is no information in the chemical and instrumental measurements related to the Flavor of the jam samples.

Use of other instrumental X-variables, e.g. gas chromatographic data, may have increased the flavor prediction ability of the raspberry jam data.

Interpretation of the predicted vs measured plot

The predicted vs. measured plot displays the predictive ability of the developed model.

Task

Interpret the predicted vs. measured plot.

How to do it

The predicted vs. measured plot in the regression overview currently displays the results for the first Y-variable, in this case, Redness.

PLS, Predicted vs Measured Plot for variable Redness, model with two factors

Use the drop-down list in the toolbar to observe the prediction quality for other variables measured in this analysis. Make sure these plots are displayed for two PLS factors, as this is the optimal

51

number for this model. Note that for several of the properties, including raspberry flavor, raspberry smell, and off-flavor, the instrumental values do not provide any real information.

Objective 3: Predict user preference from sensory

measurements

Is it possible to develop a model for predicting consumer preference data from new sensory data? If so, expensive consumer tests can be replaced by cheaper sensory tests. The PLS model previously developed was used for interpretation purposes. The focus is now on prediction. A new model will be built relating the sensory data to consumer preference data, and this model will be applied to unknown samples to predict their preference.

Make a PLS regression model for preference

First, develop a model relating sensory data to preference, and interpret it. PLS regression will be used as the regression method

Task

Make a PLS regression model for describing the relationships between sensory data and preference.

How to do it

From the Main Menu, select Tasks - Analyze - Partial Least Squares Regression…, and specify the following parameters in the PLS Regression dialog:

Model Inputs

Predictors

• X data set: “JAMdemo” • Rows/Samples: Calibration Samples (12) • Col/X-variables: Sensory (12)

Responses

• Y data set: “JAMdemo” • Rows/Samples: Calibration Samples (12) • Cols/Y-variables: Preference (1)

Maximum components: 6

PLS Regression Dialog

52

Weights in X and Y

All 1/SDev

Select the X Weights tab and weight all the X variables with 1/SDev so that each variable will contribute equally in the modeling step. Also weight the Preference values (Y) by 1/SDev in the Y Weights tab.

Validation

Full Cross Validation

Press Setup to access the Cross Validation Setup dialog and choose Full Cross Validation as the cross validation method.

Press OK.

Interpretation of the regression overview

Task

53

A new PLS node has been added to the project navigator. Rename this to PLS Sensory by highlighting it, then right clicking and selecting the Rename option. Interpret the model using the regression overview plots and other diagnostic tools available.

How to do it

It is of primary interest to determine how well the model can predict new values. Therefore only the residual variance and the Predicted vs Measured plots have most meaning.

The residual variance

Activate the explained variance plot in the lower left quadrant, and change it to the residual Y

variance plot by using the toolbar shortcuts . The prediction error tapers off significantly after two PLS factors. This represents the optimal model conditions.

Residual Y Validation Variance Plot

Predicted vs measured

Activate the predicted vs. measured plot and specify to display it for 2 PLS factors, using the arrows

in the toolbar .

Turn on the regression line and the target line with the toolbar shortcuts .

Predicted vs Measured Plot with Trend Lines

54

It can be observed that the predictions are of good quality. Some samples are not so well predicted, but the overall correlation is satisfactory.

Interpretation of the regression coefficients

The regression coefficients are used to calculate the response value from the X-measurements. The size of the coefficients provides an indication of which variables have an important impact on the response variables.

There are two kinds of regression coefficients, Bw and B. The Bw coefficients are calculated from the weighted data table and are used for interpretation. The B coefficients (raw) are calculated from the raw data table and are used for predictions.

Task

Find which variables are important for predicting the Y-variable Preference.

How to do it

The estimated regression coefficients indicate the cumulative importance of each of the sensory variables to the consumer preference.

Select Plot - Regression Coefficients. Choose the Weighted coefficients (Bw) option. Using the arrows in the toolbar, change the plot to show regression coefficients for 2 PLS factors, and change the plot layout to a bar chart.

Regression Coefficients Plot

55

Redness, Color and Sweetness (B1, B2 and B6) are statistically significant in predicting Preference. Raspberry Smell (B4) is also significant, but contributing negatively to the Preference. Thickness (B11) seems to be of importance also as it has a large (negative) coefficient, however it is not shown significant in this model.

Save the project file with the name “Tutorial_B “. It may also be saved as the model file itself, providing a smaller file with just the model information that can be used for predicting new samples using The Unscrambler® Online Predictor and The Unscrambler® Online products. To save the model only, right click on the model node in the project navigator and select the option Save result.

Save result

Rename the model if desired and click on Save.

Open result matrices in the Editor

The result matrices may also be observed numerically. Comparison of results may be easier in tables and the Editor is a good starting point for exporting data into other programs.

56

The Raw regression Coefficients (B) are available as a predefined plot from the Plot menu in the Regression results Viewer. However, for this exercise the B coefficients will be viewed from the list of numerous available matrices.

Task

View the regression coefficients in the Editor.

How to do it

Open the Results folder under the PLS node in the project navigator and select the Beta Coefficients (raw) matrix. Any of the other validation matrices may be selected from the validation folder of the PLS model. The beta coefficients can then be treated as every other data in an Editor. They may be plotted from the Plot menu, etc.

Predict preference for new samples

Regression models are mainly used to predict the response value for new samples. Models are developed to allow the prediction of these values rather than performing reference measurements, which often are time consuming and expensive.

The purpose of the model previously developed was to predict the jam preference for some consumers based on sensory values that were measured for the samples.

Task

Predict the Preference for the jam samples.

Interpret the prediction results to see whether the predictions can be trusted.

57

How to do it

Activate the “JAMdemo” data matrix. Select Tasks - Predict - Regression… and specify the following parameters in the Prediction dialog:

• Select model: PLS Sensory • Data matrix: “JAMdemo” • Rows/Samples: Prediction Samples (8) • Cols/X-variables: Sensory (12) • Prediction type: Full Prediction • Y-reference: Not included • Number of Components: 2

Check the boxes for Inlier statistics and Mahalanobis distance to provide valuable statistical measures of the similarity of the prediction samples to the calibration samples.

Click OK to perform the prediction.

The Prediction dialog

Interpretation of Predicted with Deviation

58

There were no reference measurements available for the new samples in the “Prediction Sam” Set. This makes it impossible to check predicted vs. measured values. Since a model has been developed based on projection, the only option available is to check the reliability of the predictions from the deviations. There are also some statistical measurements of the similarity of predicted samples to those used in developing the calibration model that can be used: inlier statistics and Mahalanobis distance.

Task

Interpret the Predicted with Deviation plot, and other plots related to prediction results.

How to do it

Click OK in the Prediction dialog to display the predicted with deviation plot, and the tabulated prediction results.

Prediction results

Predicted preference for the “unknown” new jams have some uncertainty limits, i.e. the accuracy of new predictions is not so reliable, however, this model can be used to predict the preference of new jam samples providing an indication of which ones will be accepted or not by consumers.

View the Inlier vs Hotelling T² plot by selecting Plot – Inlier vs Hotelling T². This plot shows how similar the new samples are to those used in developing the calibration model. For a prediction to be trusted the predicted sample must not be too far from a calibration sample. This is checked by the Inlier distance and also its projection onto the model should not be too far from the center. This may be checked using the Hotelling T² distance.

Save the project file under the name “Tutorial B_complete”. This now includes all the data, three models, and the predicted results for preference.

59

Check the error in original units – RMSE

Finally, observe how large the expected error is in predicted preference results, i.e. determine what an approximate RMSEP is for such an analysis.

Task

Plot the RMSE.

How to do it

Return to the PLS Sensory node in the project navigator. In the plots folder select Regression Overview, then select Plot - Variances and RMSEP - RMSE.

Two curves are plotted, one for the calibration: RMSEC and one for validation. In this particular case it is the cross-validation error: RMSECV.

PLS, Root Mean Square Error Plot

To gain a better approximation of what to expect in future predictions, the RMSECV should be analyzed.

The RMSECV may be studied for Preference for all PLS factors. RMSECV (using two factors) is 0.83. This means that any predicted new sample on the scale from 1 to 9 will have a prediction error around 0.8. This is an acceptable error level in sensory analysis, which has much uncertainty in all measurements.

60

Export models from The Unscrambler®

Models from The Unscrambler® are often used in instruments to make predictions in real time. A model format has been developed to facilitate the easy reading of results in instruments or other software that do not read The Unscrambler® models directly.

Task

Export the regression model used to predict Preference from Sensory Data.

How to do it

Select a PLS Model from the project navigator and select File – Export - ASCII-MOD…

This displays the Export ASCII-MOD dialog box.

Export ASCII-MOD Dialog

Verify that the correct number of factors has been chosen for the selected model. The optimal number of components should be used for the export. Therefore, change the number of factors to 2 before clicking OK.

Two types of model export are available:

• Full • Regr.Coef. only: corresponding to only the regression coefficients

Observe the ASCII file that is generated, this has the file name extension .AMO. The format of the file is described in the ASCII-MOD Technical Reference.

Similarly any of the result or validation matrices can be selected for export into other formats. Supported export formats are

• ASCII • JCAMP-DX

61

• Matlab • NetCDF • ASCII-MOD

Full ASCII-MOD export includes all results that are necessary to perform outlier detection, etc. This format can be used for applying models outside The Unscrambler® environment, for example in a custom written program script. The ASCII-MOD file is readable by any text editor, such as Notepad.

62

Tutorial C: Spectroscopy and interference

problems

• Description o What you will learn o Data table

• Get to know the data o Read data file and define sets o Plot raw data

• Univariate regression • Calibration

o Interpretation of the calibration model o Study the predicted vs measured plot

• Multiplicative Scatter Correction (MSC) • Check the error in original units: RMSE • Predict new MSCorrected samples • Guidelines for calibration of spectroscopic data

Description

There is a need for an easy way to determine the concentration of dye (a brightly red-colored heme protein, Cytochrome-C), in water solutions. Dye absorbs light in the visible range, and the concentration determination will be based on this light absorbance.

In the solutions to be analyzed there are varying, unknown amounts of milk, which absorbs some light in the same wavelength range as dye and therefore causes chemical interference in the measurements. In addition, milk contains particles that give serious light scattering.

Another effect that will influence the absorbance spectra is the varying sample path length.

The light absorbance spectrum figure shows the light absorbance spectrum of one sample of the dye/milk/water solution.

Absorbance Spectrum

63

The vertical lines represent the 16 different wavelength channels selected as predicting variables for this sample set.

This example is constructed to enable duplication in a lab. This illustrates the interference effects and other effects that make spectroscopy challenging. However similar problems occur with many industrial applications, e.g. measuring the concentration of different chemical species in sewer water, which contains many other chemical agents, as well as physical interferences like slurries and particles; measuring moisture and solvents in a granulation process.

The two major peaks (variables Xvar4 and Xvar6) represent the absorbance of dye, while the first peak (Xvar2) represents absorbance due to an absorbing component in the milk. The broad peak to the right (Xvar12, Xvar13, Xvar14) is due to light absorption by water itself.

What you will learn

Tutorial C contains the following parts:

• PLS regression • Handling of interference problems, Multiplicative Scatter Correction (MSC) • Check list for calibration of spectroscopic data

A problem similar to this tutorial is described extensively in chapter 8 in the book “Multivariate Calibration”, by Martens & Næs.

References

• Transformations: Principles of Data Preprocessing • Multivariate regression methods • Prediction with regression models

Data table

64

Click the following link to import the Tutorial C data set used in this tutorial. This is best done into a new project (File-New).

The data matrix, Tutorial_C is imported into the project. It consists of 28 samples (samples of solutions) that spans the two most important types of variations: the dye and milk concentrations. The composition of dye/milk/water in each calibration sample is shown. The values are given in ml making a total of 20 ml in each solution (sample).

Sample Dye Milk Water Sample Dye Milk Water

1 0.0 0.5 19.5 15 4.0 0.5 15.5

2 0.0 1.0 19.0 16 4.0 1.0 15.0

3 0.0 2.0 18.0 17 4.0 1.5 14.5

4 0.0 6.0 14.0 18 4.0 6.0 10.0

5 0.0 8.0 12.0 19 4.0 10.0 6.0

6 0.0 10.0 10.0 20 6.0 1.0 13.0

7 2.0 0.5 17.5 21 6.0 2.0 12.0

8 2.0 1.0 17.0 22 6.0 6.0 8.0

9 2.0 1.5 16.5 23 6.0 10.0 4.0

10 2.0 2.0 16.0 24 8.0 0.5 11.5

11 2.0 4.0 14.0 25 8.0 1.0 11.0

12 2.0 6.0 12.0 26 8.0 1.5 10.5

13 2.0 8.0 10.0 27 8.0 2.0 10.0

14 2.0 10.0 8.0 28 8.0 6.0 6.0

Note that the known milk and water quantities will not be used to make the model, only as descriptors in result plots. The sample names are coded with these quantities as well.

Get to know the data

Read data file and define sets

The first step in all modeling is to get the data into The Unscrambler® and organize it into appropriate sets. The data for the different analyses are organized as sets, defining which

65

samples(rows) or variables(columns) are used in the modeling. Cleverly defined Sets make modeling and plotting work much easier.

Task

Open the data matrix Tutorial_C, and take a look at the properties of the data. Some of the data have already been organized into row and column sets. The data will be further organized by defining some additional sets to be used in the analysis.

How to do it

In the project navigator, expand the tree under the data matrix Tutorial_C to see the file content. An Editor with the data table is launched in the viewer.

Project navigator view of data

One can see that some sets have already been defined, but one additional column set named Statistical will be defined.

The data table already has the following: Column (Variable) Ranges:

• Cols/Name: Absorbance; Interval, Columns: 4-19 • Cols/Name: Dye Level; Interval, Columns: 3 • Cols/Name: Description; Interval, Columns: 1-2

Row (Sample) Ranges:

• Rows/Name: Calibration; Interval, Rows: 1-28 • Rows/Name : Prediction; Interval, Rows: 29-42

Put the cursor in the data viewer. Now one can define a new column set (variable range) by going to Edit - Define Range… which will open the Define Range Editor. Define the Columns Sets by putting the name Statistical in the column range space, and for interval, enter 3-19 for columns as shown below.

Define Range Dialog

66

Click OK when finished defining the Column and row sets. Use File-Save As… to save the project with the updated name “Tutorial_C_updated” in a convenient the location before continuing. The organized data will now have numerous nodes for column and sample sets in the project navigator, and give a color-coded data matrix.

Plot raw data

It is good practice to start by plotting the raw data to get an impression of what the data look like. It will be of tremendous help when you want to assess which pretreatments are necessary and what kind of model (e.g. how many factors) to expect, as well as generally understanding the structure of the data.

Task

Plot some calibration samples in order to see how the spectra vary with varying amounts of dye and milk.

How to do it

67

Make a line plot of samples that have the same amount of milk, 10 ml. The line plot is just of the X-variables for these samples, so in the data table editor, select the four samples having 10 ml of milk by marking the samples in the Editor (samples 6, 14, 19, and 23) by clicking the sample numbers while holding down the Ctrl key. Then right click and select Plot - Line.

Line plot dialog

In the Line Plot dialog that appears, select the column set Absorbance from the drop-down list. Click OK and note that the four samples are highlighted in the Editor.

The same could be done by selecting the menu option Plot - Line… after having selected the samples in the viewer, and specifying use the Column set Absorbance in the Line Plot dialog.

Line Plot of sample with 10 ml milk

.

68

Use shortcuts keys to change the layout of the plot to a bar chart.

These four samples have the same milk level and the line plot shows that the dye level has influence on the absorbance of variables number 2 - 8 only.

Plot samples 20, 21, 22,and 23 the same way, using the CTRL key to to select just these specific rows. These samples have the same dye level: 6 ml.

The plot shows that increasing milk level will increase the absorbance of light of all wavelengths from number 1 to number 16. There seems to be a great deal of interference or scattering to deal with, over the whole spectrum. This indicates that some transformations of the data may be useful to get an optimal model.

Univariate regression

Is it possible to predict the dye level from the absorbance of one single wavelength? Before we enter the multivariate world we want to see what can be done by univariate regression.

Task

Find the best wavelength on which to make a univariate regression model.

How to do it

You find the best wavelength by looking at the correlation between each absorbance variable and the dye level variable. Select the data set Statistical from the project navigator. Select Tasks - Analyze - Descriptive Statistics… and specify the following parameters in the Descriptive Statistics dialog.

• Rows: Calibration (28) • Cols: Statistical (17) • Compute Correlation matrix: On or tick

When the computation is done, there will be a prompt asking if you want to view the plots. Click Yes, and the two plots summarizing the statistics will be displayed. You will find a new node, Descriptive statistics in the project navigator which consists of the three folders raw data, results and plots.

In the project navigator, expand the folder results. Select the Variable Correlation matrix from this folder to view this in the viewer. We will use these data to find the highest correlation between Dye Level and some X-variable. You may select the first row, dye level, and plot it (Plot - Bar) to see the highest correlation (after the correlation between Dye level and Dye level, which of course is 1).

Bar chart of variable correlation

69

The variable with the highest correlation coefficient to Dye Level is Xvar6 with a correlation coefficient of 0.49. You can close the bar plot of the correlation matrix by selecting the tab in the navigation bar at the bottom of the viewer and right clicking to select close.

Now we should illustrate the regression in a plot. To get the right plot go back to the original data set, Tutorial_C, and select the columns Xvar6 and Dye level using the Ctrl key and Plot - Scatter. In the line plot dialog remember to select only the calibration samples from the row drop-down list.

Scatter plot dialog

Scatter plot of Xvar6 vs Dye level

70

Another way to do this is go to Plot - Scatter and in the Scatter plot dialog click on the define button next to Cols., which will open the Define Range dialog. Here you can select the columns Dye level and Xvar6, or type in columns 3, 9 in the Interval box. Select the calibration samples for the rows.

Scatter plot dialog showing define option

Turn on the Regression Line and Target Line with the shortcut buttons . We can also add

the plot statistics from the toolbar shortcut . From the plot we see our results are not very good using just one variable to model the dye level. Hopefully we can do better with multivariate regression models.

Scatter plot of Xvar6 vs Dye level with target and regression lines

71

Calibration

We choose to make a PLS regression model because PLS takes the variation in Y into consideration when the model is calibrated.

Task

Make a PLS regression model between the variable set Absorbance (X) and the response Dye Level(Y).

How to do it

Activate the Tutorial_C data Editor from project navigator and select Tasks - Analyze - Partial Least Squares Regression…. In the PLS dialog, specify the following parameters:

• Data Set: Tutorial_C • Predictors:

Rows: Calibration (28) Cols (X-variables): Absorbance (16) • Responses:

Cols (Y-variables): Dye Level (1) • Maximum components: 8 • Mean center data: selected • Identify outliers: selected

PLS Regression dialog

72

.

• Weights: All 1.0 in X and Y • Validation method: Cross validation

Go to the Validation tab to select cross validation. You can further define the settings for this by clicking Setup…, taking you to the Cross validation setup dialog. Select Random as the cross validation method and set the number of segments to “7”.

Cross validation setup dialog

73

.

Start the calibration by clicking OK on the Model inputs tab. When the computation is complete you will be asked if you want to view the PLS plots now. Click Yes, and the regression overview plots will be displayed.

A new node, PLS, has been added to the project navigator. This has four folders with the raw data, results, validation, and plots for the PLS model. Rename the PLS node in the project navigator for this analysis to “PLS Tutorial C” before you continue. You can do this by right clicking the latest PLS model in the project navigator and selecting Rename.

Interpretation of the calibration model

The interpretation of a calibration model involves several steps. First, we check whether the model has detected any systematic variation. This is done by looking at the residual variance plot. If the model has successfully described systematic variation, we start to interpret different additional modeling results. The most important model results to study are the Scores, Loadings, and the Predicted vs Measured, all of which are part of the Regression Overview Plots.

Task

Interpret the plots in the regression overview.

How to do it

74

The regression overview was displayed when you clicked View plots. It consists of four plots of the most important modeling results from the regression model. We will now view the PLS results. The plot in the lower left quadrant is the residual variance. This plot gives information about how many factors are required to explain model variation and optimal number of factors for the model. A summary of the model information is given in the Info box in the lower left of the screen, below the project navigator.

PLS Regression Overview Plots

Score plot

The plot in the upper left quadrant is the Scores plot. From the Scores plot we can interpret that the combination of two main factors, factor 1 and factor 2, reflects the variations in the milk and water levels. The first two factors indicate that 99% (X1 84, X2 15) of the X variance, explains 75% (Y1 19, Y2 56) of the response dye level. By studying the samples in the plot we can see that the milk level increases from upper left to lower right in the plot, while the water level increases from right to left.

Regression coefficients

The regression coefficients plot summarizes the relationship between all predictors and a given response. It is easiest to access this plot by selecting it from the plots folder in the project navigator.

Plots folder in project navigator

75

You can see this plot when the any PLS plots are active in the viewer and going to Plot - Regression Coefficients - Raw coefficients (B) - …, or by right mouse clicking and selecting PLS - Regression Coefficients - Raw coefficients (B) -…. Select the line plot of the raw regression coefficients. Since we did not apply any weighting to the data, the plots of weighted and raw regression coefficients will be identical.

The regression coefficients plot indicates that the wavelength numbers (X-variables) 4 and 6 are the most important for the prediction of Y (concentration) in the first factor. The pattern is clearer here than in the loading plot.

Regression coefficients plot

Compare the regression coefficients plot to the raw absorbance data. You see that high loading values indicating important variables are present in the region where we know that milk and dye absorb light.

76

Study the predicted vs measured plot

This plot, in the lower right of the Regression Overview shows how the model is able to predict the response value for the calibration samples. This gives an indication of how well the model will perform in the future when new samples are collected and we want to calculate the dye level for these samples, from the spectral data.

Task

Take a closer look at the residual variances in the error measures plots.

How to do it

Activate the Predicted vs Measured plot and select Plot - Variances and RMSEP… and select the X- and Y-variance, which will bring up two plots summarizing the X and Y variance.

The upper plot shows that the model describes much of the variance in the X-variables in the first factors, while it takes more factors in the lower plot to describe the variance in Y (dye level). We are interested in describing Y, therefore we have to include enough factors in our model to get a high explained variance for the Y-variable.

The X-variance and Y-variance plots

77

Multiplicative Scatter Correction (MSC)

Since we suspect that the light scattering and sample thickness have multiplicative effects on the data, and that the chemical absorptions have additive effects, we decide to try MSCorrection on the X-variables in order to separate these effects from each other.

Perform a Multiplicative Scatter Correction

Task

Correct the data for multiplicative scatter effects. Omit variables 1 to 8 in the Set Absorbance as important variables.

How to do it

Select the data matrix Tutorial_C.

First, we verify the need for MSC by looking at the Scatter Effects plot. This plot is available from a Statistics model. Select Tasks - Analyze - Descriptive Statistics and specify the following parameters in the Descriptive Statistics dialog:

• Rows: Calibration (28) • Cols: Absorbance (16)

Click OK to calculate the statistics, and select Yes to view the plots now. As we already have run descriptive statistics before, but using 17 of the variables, rather than just the absorbance, the current results are a new node, Descriptive Statistics(1), in the project navigator. We are not interested in the default plots that are shown, but want a plot that helps us to understand the scatter in the data. Make the plot window active, and select menu option Plot - Scatter effects. In this plot of the mean value of each X var we see that the scatter is not the same for all variables. The first 8 variables are approximately in a straight line. For the other variables, one can observe a spread in the scatter effects.

Scatter effects plot

78

Select the data matrix Tutorial_C. Select Tasks - Transform - MSC/E… Specify the following parameters in the Multiplicative Scatter Correction dialog:

• Rows: Calibration (28) • Columns: Absorbance (16) • Enable Omit Variables: 1-8

Multiplicative Scatter Correction dialog

Go to the Options tab and under Function select Common Amplification.

Multiplicative Scatter Correction options

79

Prediction samples are not used to find the correction factors we want to find now and use in the MSC.

Variables 1-8 are omitted as important because the light absorption of these variables vary with the dye level, while wavelengths 9 to 16 (the water absorption peak) is independent of the concentration of dye. The difference in these wavelengths is instead caused by the general light scatter due to milk addition. It is important that only wavelengths with no chemical information are used to find the correction factors.

The transformed data are now displayed in the project navigator with the name “Tutorial_C_MSC”. There is also a node with the MSC model for transformation, which can be applied to future samples. This is called “MSC_Tutorial_C”, and has a folder with the model under it.

Look at the corrected data by selecting the data from the new project navigator node, and going to Plot - Line. Select the new sample matrix with the corrected data in the Line Plot dialog, row set calibration, and column set Absorbance.

Line plot of MSC transformed data.

80

We want to compare the corrected data with the original data. Select the raw data matrix in the project navigator (Tutorial_C) and make a line plot of the calibration samples for the absorbance values. You see that the MSCorrected data are different from the original. The interference and light scatter effects have successfully been corrected for. You can display the plots on the same screen by going to the navigation bar at the bottom of the screen and right clicking to select Pop out to give an undocked plot of the MSC Corrected data that can be moved around as you wish.

Pop out menu

You can then choose the line plot of the uncorrected data from the navigation bar, making it active in the Viewer, and move the other window to the same view for easier comparison.

Line plots of the MSCCorrected and the original data

81

.

Another way to get a view of both plots together is to go to Insert-Custom Layout - Two Horizontal… and select the two samples matrices, selecting the calibration samples for rows, absorbance for columns, and setting the plots to be line plots in the custom layout dialog. You can also give a title for each plot as show below.

Custom layout dialog

Calibrate with MSC transformed data

So far we have only corrected the data, now we have to make a new PLS model using MSCorrected data.

Task

82

Make a PLS model with the same model parameters as the model “PLS Tutorial C”.

How to do it

Activate the matrix with the corrected data. Select Tasks - Analyze - Partial Least Squares Regression… and specify the following parameters in the Partial Least Squares dialog:

Data Set: “Tutorial_C_MSC” - Predictors: Rows: Calibration (28) Cols (X-variables): Absorbance (16) - Responses: Rows: Calibration (28) Cols (Y-variables): Dye Level (1)

• Maximum components: 8 • Mean center data: selected • Identify outliers: selected • Weights: All 1.0 in X and Y • Validation method: Cross Validation

Go to the Validation tab to select the cross validation method, again using Random with 7 segments.

Click yes to view the plots now for this model, and the regression overview plots will be displayed in the viewer.

The new regression model will create a new PLS node in the project navigator. Rename this to PLS MSCorrected by selecting the node and right clicking to select Rename.

Comparison of models

We are now interested in seeing how the model performs with regard to prediction ability. The residual variance is therefore the yardstick we compare the different models.

Task

Look at the residual variance for all models in Tutorial C.

How to do it

Study the residual variance for each model. In the project navigator, select the PLS results for the first PLS model, and from the plots folder select Regression overview. The plot on the lower left quadrant shows the variance. Use the toolbar shortcuts to display the residual Y variance

. We see that for the optimal number of factors (2) the variance value is 4.4.

83

There is a minimum in this plot for 5 factors, where the value of the residual variance has not really decreased.

Y Residual validation variance: original data

.

View the same plot for the model PLS MSCorrected by going to the PLS Overview plot of the MSC corrected data (which should still be an open tab in the navigator bar at the bottom of the viewer). Highlight the lower left quadrant, the explained variance plot, and change the view to be the residual

Y variance plot by using the toolbar shortcuts , selecting Y, and Res, for just the validation samples.

Y Residual validation variance: MSC Corrected data

.

The plot shows the validated residual Y-variance for the two models From these plots (line) we find that the minimum square error is lower for the MSC corrected model with two factors (1.87). So though the optimal number of factors recommended is four, even with two factors we can model the system well (more of the Y variance is explained by two factors, then when using the raw data; see score plot). The system can be modeled well with the MSC Corrected data, whereas with the raw data a much higher error is achieved, and less of the Y variance is explained with two factors. This shows that MSC has removed the interfering amplification effect in these data.

84

Tutorial C MSCorrected with four factors gives the lowest estimate for the residual Y-variance. So we see that predictions done by this model using four factors therefore will give the predicted values with the lowest prediction error. We could also model this system well enough with two factors (as we do not here have information on the error of the reference method for measuring the dye level, we will follow the model suggestion for four factors).

Check the error in original units: RMSE

The numerical residual variance values we used in order to find the best model and decide the optimal number of factors in the model are not related directly to the predictions. We cannot use the residual variance to tell how large we can expect the deviations in future predictions. We have to use the RMSEP for that purpose.

Task

Let us see how large an error in ml dye we can expect in future predictions: RMSEP.

How to do it

Activate the regression overview plot for the model PLS-MSCorrected. Select Plot - Variance and RMSEP - RMSE

Deselect the calibration samples box and select the validation samples (RMSEP) instead from the shortcut keys.

You see that the shape of the curve is exactly that of the residual variance, but the values have changed. The plot says that predictions done with this model and using four factors will have an average prediction error of 0.9.

RMSE: MSC Corrected data

.

85

Predict new MSCorrected samples

The model with MSC is the one we will use for the prediction of new samples.

Run a prediction with automatic pretreatment

The prediction samples will be transformed automatically with the same MSC model as the calibration samples. This will require that the variables selected for the data matrix include the same number of variables as are associated with the MSC. This we need to select correctly in the Prediction dialog.

Task

Predict the dye level of the unknown samples.

How to do it

Select Tasks - Predict- Regression…. Specify the following parameters in the Prediction dialog:

• Model name: “PLS MSCorrected” • Number of Components: 4 • Full Prediction with inlier options also selected • Data Matrix: “Tutorial_C_MSC” • Rows: Prediction (14) • Columns: All

As you can see, there is the option to make the prediction for a different number of components than what is deemed optimal for the model. We can also in the predictions, compare results with a model of fewer components, which is good to help avoid possible overfitting.

Prediction dialog

86

Click View after the prediction is done. The prediction overview plot appears where the predicted values are shown together with the deviations. A new node, Predict, has been added to the project navigator. This has folders for raw data, validation, and plots. The projection overview shows a plot of values with their estimated uncertainties, and also has a table of the values with these deviations.

Predicted values with deviation

87

Large deviations indicate that the predictions cannot be trusted. For a prediction to be trusted the predicted sample must be not too far from a calibration sample. This is checked by the Inlier distance and also its projection in the model should not be too far from the center. This is checked with the Hotelling T² distance.

Study the Inlier vs Hotelling T² plot available from a right click on the plot and then Prediction - Inlier/Hotelling T² - Inliers vs Hotelling T²


In this case all the samples are found to the below the Inliers distance limit, showing that these samples are similar to those used in making the model. One sample is outside the Hotelling T² limit line (with 95% confidence), so is an outlier. The prediction for the outlier therefore cannot be trusted.

Guidelines for calibration of spectroscopic data

88

Now that you have learned the basics of calibration, let us suggest steps and useful functions for the development of calibration models.

See the guidelines for spectroscopic calibrations

89

Tutorial D: Screening and optimization designs


• Build a screening design • Estimate the effects

o Run an analysis of effects o Interpret the results

• Draw a conclusion from the screening design • Build an optimization design • Compute the response surface

o Run a response surface analysis o Interpret analysis of variance results o Check the residuals o Interpret the response surface plots

• Draw a conclusion from the optimization design

Description

This tutorial is built from the enamine synthesis example published by R. Carlsson in his book “Design and Optimization in Organic Synthesis”, Elsevier, 1992.

A standard method for the synthesis of enamine from a ketone gave some problems, and a modified procedure was investigated. A first series of experiments gave two important results:

1. Reaction time can be shortened considerably. 2. The optimal operational conditions were highly dependent on the structure of the original

ketone.

Thus, a new investigation had to be conducted to study the specific case of the formation of morpholine enamine from methyl isobutyl ketone. It was decided to adopt a 2-step strategy:

1. At a screening stage, study the main effects of 4 factors (relative amounts of the reagents, stirring rate and reaction temperature).

2. Conduct an optimization investigation with a reduced number of factors.

What you will learn

90

Tutorial D contains the following parts:

• Build suitable designs for screening and optimization purposes; • Analysis of Effects; • Response Surface Modeling.

References:

• Principles of Data Collection and Experimental Design • Descriptive statistics • Principles of experimental design • Analysis of designed data

Data table

From the previous experiments, reasonable ranges of variation were selected for the 4 design variables:

Variable Low High

A: amount of TiCl4 / Ketone (mol/mol) 0.57 0.93

B: amount of Morpholine / Ketone (mol/mol) 3.7 7.3

C: reaction temperature (°C) 25 40

D: stirring rate (rpm) 0 50

Build a screening design

Screening designs are used to identify which design variables influence the responses significantly.

Task

Select a screening design which requires a maximum of 11 experiments that will make it possible to estimate all main effects.

Note: With 4 design variables: A Plackett-Burman design is not interesting because it requires 8 experiments; the same amount as a fractional factorial design. A fractional factorial gives 8 (24-1) experiments. A full factorial design gives 16 (2⁴) experiments.

How to do it

91

Choose Insert – Create Design… to launch the Design Experiment Wizard.

In the Design Experiment Wizard, on the first tab Start, type a name for the table for example “Enamine”. Select the Goal that for now is Screening. It is possible to type information in the Information section.

Start tab filled

Go to the next section: Define Variables.

Specify the variables as shown in the table hereafter:

ID Name Analysis type Constraints Analysis Type of levels Levels

A TiCl4 Design None Continuous 0.6 - 0.9

B Morpholine Design None Continuous 3.7 - 7.3

C Temperature Design None Continuous 25.0 - 40.0

D Stirring Design None Continuous 0.0 - 50.0

92


1 Yield Response None – –

Do this by clicking the Add button and editing the Variable editor. Validate by clicking OK and enter the next variable by clicking Add again.

Define Variables tab filled

After all design variables have been defined, go to the next tab Choose the Design, to select the appropriate design.

By default, in the Beginner mode, the selected design is “Screening of many design variables” which refers to a Fractional factorial design as can be seen in the box below the Design section.

This design corresponds to the goal of the experimentation so no change is needed.

The Design Wizard - Choose the design tab

93

Go to the next tab: Design Details.

This tab gives information about the resolution of the design, the confounding pattern and the number of experiments to perform including the center samples.

By default the selected option is a Fractional factorial design with a resolution IV and with a confounding pattern that gives the interactions being confounded. It is possible to upgrade to a Full factorial, but this increases the number of experiments to perform to 19 which is more than we would like to do.

Study the confounding pattern of the suggested design. All main effects are confounded with 3-variable interactions, which is acceptable if those interactions are unlikely to be significant. The 2-variable interactions are confounded two by two. This is going to limit the study and the conclusions, but in a screening stage this is acceptable.

The Design Wizard - Design Details tab

94

Go to the next tab: Additional Experiments.

There is no need to replicate the design samples so the Number of replications is kept at its default value: “1”.

By default there are “3” center samples. This is enough.

There is no need to add reference samples.

The Design Wizard - Additional experiments tab

95

Proceed to the next tab, Randomization. There is no need to make any further specification in this tab. Try different options just to get familiar with the possibilities.

The Design Wizard - Randomization tab

Go to the Summary tab.

In this tab some information about the design is presented. It is also possible to calculate the power of the design. To do so two values are needed:

96

• Delta: the difference to detect. In this example a 3% yield improvement would be great. • Std. dev.: estimated standard deviation. In this example the yield for the same parameters

varies with a standard deviation of 1.2.

Enter the following values:

• Std. dev.: 1.2 • Delta: 3

and click on the Recalculate power button

Note this value. As there is only one response variable the power of the design will be the same as the one calculated. A power superior to 0.80 is considered to be good enough.

The Design Wizard - Summary tab

Go to the final tab: Design Table. Here the data table is presented with several view options. Check them out to familiarize with the options.

The Design Wizard - Design table tab

97

The design creation is now complete. Click the Finish button.

Now the data tables appear in the Navigator. There is a separate table for the responses. The design table has been organized with row sets and column sets for the design, and center samples, and the effects, respectively.

The design tables in the Navigator

It is possible to view the data in different ways.

• To change the order from the standard sample sequence to the experiment sample sequence click on the column Randomized and go to Edit – Sort – Ascending.

• To change from the actual values to the level values click on the table and then View – Level indices.

98

Estimate the effects

After the experiments have been performed and the responses have been measured, the results have to be analyzed using a suitable method. Study the main effects of the four design variables. The simplest way to do this is to run an Analysis of Effects, and then, interpret the results.

Run an analysis of effects

Task

1. Fill in the responses in the matrix Enamine_Response. 2. Run an Analysis of Effects.

How to do it

First, enter the 11 response values manually. Make sure the rows are sorted in experimental order.

Sample Yield

(1) 74.3

ad 70.1

bd 87.9

ab 96.7

cd 72.8

ac 69.7

bc 88.7

abcd 97.1

cp01 96.4

cp02 96.8

cp03 96.9

99

To start the analysis, choose Tasks - Analyze - Analyze Design Matrix….

In the Method dialog select the Classical DOE analysis method and go to the second tab Model Inputs.

Method dialog

100

Predictors In the Predictors part set the X matrix to be “Enamine_Design”, Rows “All” and the Cols “All”.

Model The Model should include the “Main effects + Interactions (2-var)”. The list of estimated effect should be “A, B, C, D, AB, AC, BC”.

Note: All the interactions are not presented. Remember that AB=DC, AC=BD, BC=AD by the confounding pattern.

Responses For the Responses set the Matrix to be “Enamine_Response”, Rows “All” and the Cols “All”.

Validate the final choices by Clicking OK.

Model inputs

101

When the computations are done, click Yes to study the results. A new node called DOE Analysis is added into the navigator. Before doing anything else, use File - Save As to save the project with a name such as “Enamine Project”.

Interpret the results

Task

Interpret the results of the Analysis of Effects that was just run.

How to do it

The ANOVA Overview plot shows four informative plots:

• the ANOVA table • the Diagnostics table • the Effect viewer • the Effect Summary table

102

ANOVA table

Look at the ANOVA table and check for the validity of the model. The p-value of the model should be less than 0.05. If this is the case look at the value of the different sources of variation that are the main effect. The significant effects are the ones with a p-value less than 0.05. They are in shades of green. Here A (TiCl4), B (Morpholine),and AB=CD are found significant.

Check the R-square values; the closer to 1 the better.

ANOVA table

Note: The interaction effect BC=AD is a possible significant effect. Checking the effect value or the B-coefficient should help to determine if it is significant or not.

The Effect viewer

Look at the effects and check for curvature. See if the center sample average is placed such that the average at low and high level are linked by a linear relation. If this is the case there is

no curvature effect. Use the to scroll through the effects for the different variables.

Here a curvature effect can be found on all effects: A (TiCl4), B (Morpholine), C (Temperature), D (stirring). However it can be noticed that low and high values for C and D are quite similar. In addition the center value is the same for all 4 effects. The most probable is that there is a curvature effect for A and B that are significant.

Effect Morpholine on the Yield

103

The Diagnostics

Look at the residuals to see if the model fits the samples well. The table is presented with the experimental order (randomized) which makes it possible to check for any deviation with time.

The Summary table

See which effect is the most important (size) and the most significant (smallest p-value).

Look at the value of the coefficient for “Morpholine*Temperature”. This effect is much smaller than the significant one. It can be neglected.

Summary table

Go through the other plots and check the plot interpretation in the DOE section

104

Draw a conclusion from the screening design

The final conclusions of the screening experiments are the following:

1. Three effects were found likely to be significant. One of them is a confounded interaction. Since the main effects of A and B are the only significant ones, we can make an educated guess and assume that the significant interaction is AB (and not CD with which is it confounded).

2. There seems to be a strong nonlinearity in the relationship between Yield and (TiCl4, Morpholine). Furthermore, since the center samples have a higher yield than the majority of the design samples, the optimum is likely to be somewhere inside the investigated region.

Thus, the next sensible step would be to perform an optimization, using only variables TiCl4 and Morpholine.

Build an optimization design

After finding the important variables from a screening design, it is natural to proceed to the next step: find the optimal levels of those variables. This is achieved by an optimization design.

Task

Build a Central Composite Design to study the effects of the two important variables (TiCl4 and Morpholine) in more detail.

Note: The other two variables investigated in the screening design, found to not be significant, have been set to their most convenient values: No stirring, and Temperature=40°C.

How to do it

Go to Tools - Extend/Modify a design

A dialog box opens, in which one selects the design to be extended or modified. Select the design “Enamine_Design”.

105

Modify/Extend Design dialog

The Design Experiment Wizard opens.

On the first tab Start, type a name for the table, for example “Enamine_Opt”. Select the Goal that for now is Optimization. It is possible to type in information in the Information section.

Go to the next section: Define Variables.

Delete the variables “Temperature” and “Stirring”. To do so, click on the variable to be deleted and press Delete.

106

The design variables “TiCl4” and “Morpholine” as well as the response variable “Yield” are kept.


A TiCl4 Design None Continuous 0.6 - 0.9

B Morpholine Design None Continuous 3.7 - 7.3

1 Yield Response None – –

Define variables tab

Go to the next tab Choose the Design. The selected option, Optimization of response(s) with 3 or 5 levels, corresponds to either a central composite design or a Box-Behnken design. This is a good option for an optimization on variables without constraints. Do nothing and go to the next tab.

Choose the Design tab

107

In the next section Design Details, four options are proposed. Look at the bottom table to see the differences between the different designs and their performance. As it is possible to do experiments outside the selected range the option Circumscribed Central Composite (CCC) design is chosen. Check the value of the star point distance to the center. It should be 1.412 for two designed variables.

Design Details tab

Go to the next section: Additional Experiments.

108

In this section it is possible to add some samples: either replicate the design points or the center samples. Let the Number of replications be “1”. Set the Number of center samples to “5”. The are no Reference samples.

Additional Experiments tab

Go to the Randomization tab.

It is possible to change the order of the experimentation by modifying the settings of this tab. To not randomize a design variable use the Detailed randomization button. To just have another go at the randomization click on Re-randomize.

Randomization tab

109

In the Summary tab check that the design includes a total of 13 experiments. Otherwise, go back to the appropriate tab and make the necessary corrections.

Summary tab

Go to the Design Table tab, and display the experiment in different views.

Design Table tab

110

Finally click the Finish button.

The generated design table is displayed in the viewer and all associated tables are automatically added to the project navigator. Their names start with “Enamine_Opt”.

Save the project, which now include the information for the screening and optimization experiments.

Generated designed tables

Compute the response surface

111

After the new experiments have been performed and their results collected, it is possible to analyze the results so as to find the optimum. This is done by finding the levels of TiCl4 and Morpholine that give the best possible yield. A response surface analysis can give this information.

Run a response surface analysis

Task

Run a Response Surface Analysis.

How to do it

Enter the response values in the “Enamine_Opt_Response” matrix. Before doing so, check that the order of experiments is the standard one and not the experimental one. Use Edit-Sort-Ascending to change the order if necessary.

Sample Yield

Axial_A(high) 84.9

Axial_A(low) 76.8

Axial_B(high) 81.3

Axial_B(low) 56.6

Cube1 73.4

Cube2 69.7

Cube3 88.7

Cube4 98.7

cp01 96.4

cp02 96.8

cp03 87.5

cp04 96.1

cp05 90.5

Response matrix

112

Choose Tasks – Analyze – Analyze Design Matrix….

In the first tab, Method, select the first option: Classical DOE.

In the dialog box, make the following selections:

• Predictor Matrix: “Enamine_Opt_Design”, Rows: “All”, Cols: “All” • Model: “Main effects + Interactions (2-var) + Quadratic” • Responses Matrix: “Enamine_Opt_Response”, Rows: “All”, Cols: “All”

Model inputs

113

Click OK to start the analysis.

When the computations are done, click Yes to study the results. A new node called DOE Analysis(1) is added into the navigator.

Interpret analysis of variance results

Task

Interpret the results from the analysis.

How to do it

The ANOVA Overview plot shows four informative plots:

• the ANOVA table • the Diagnostics table • the Effect viewer • the Effect Summary table

114

First, study the ANOVA results.

Note: It is possible to resize the overview better the table by expanding any quadrant by dragging the resize cross.

Study in turn: Summary, Variables, and Quality in the ANOVA table.

ANOVA Table for the Response Surface model

The Summary shows that the model is globally significant, so it is possible to go on with the interpretation.

The ANOVA table for variables displays the values of the p-values for each effect. The most significant coefficients are for the linear and quadratic effects of Morpholine. TiCl4 effects looks less important but are still significant due to the square term being very significant. However the interaction is more doubtful.

The Quality section tells about the quality of the fit of the response surface model: R-square for the calibration and prediction are very good.

In the Results node in the project navigator, check the tables Model check and Lack of fit.

The Model Check indicates that the quadratic part of the model is significant, which shows that the interaction and square effects included in the model are useful.

115

The Lack of Fit section shows that with a p-value superior to 0.05, there is no significant lack of fit in the model. Thus the model can be trusted to describe the response surface adequately.

Check the residuals

Task

Check the residuals from the Response Surface Analysis.

How to do it

Go to the predefined plot Residuals overview, found in the Plots folder in the project navigator.

Start with the Normal Probability plot of the residuals. This plot can be used to detect any outliers. Here, the residuals form two groups (positive residuals and negative ones). Apart from that, they lie roughly along a straight line, and there is one extreme residual to be found “cp03”. This may be an outlier.

Normal Probability plot of the residuals

Look at the second plot Y-Residuals vs Predicted Y.

Y-Residuals vs Predicted Y

116

In the residuals plot, all values are within the (-4;+4) range, except “cp03” which has a high residual. For the other samples, there is no clear pattern in the residuals, so nothing seems to be wrong with the model.

Look at the bottom right plot Y-residuals vs Experimental order. Check if there is a bias with time. Look at the 5 center samples residuals.

The center samples show quite some variation. This is why so few effects in the model are very significant. There is quite a large amount of experimental variability.

Interpret the response surface plots

Now that the model has been thoroughly checked, use it for final interpretation. This is most easily done by studying the response surface.

Task

Interpret the response surface plots.

How to do it

The contour plot is available from the project navigator in the folder Plots - Response surface and shows the shape of the response surface as a contour plot. Click on it and select the menu Properties to change it into a 3-D response surface. Change the scaling to zoom around the optimum, so as to locate its coordinates more accurately.

117

Click at various points in the neighborhood of the optimum, to see how fast the predicted values decrease. Notice that the top of the surface is rather flat, but that the further away, the steeper the Yield decrease.

Finally, notice that the Predicted Max Point Value, found in the table below the plot, is smaller than several of the actually observed Yield values. (Sample Cube004a for instance has a Yield of 98.7). This is not paradoxical, since the model will smooth the observed values. Those high observed values might not be reproduced when the same experiments are performed again.

Draw a conclusion from the optimization design

The analysis gave a significant model, in which the quadratic part in particular was significant, thus justifying the optimization experiments.

Since there was no apparent lack of fit, no outliers, and the residuals showed no clear pattern, the model could be considered valid and its results interpreted more thoroughly.

The response surface showed an optimum predicted Yield of 96.747 for TiCl4=0.835 and Morpholine=6.504. The predicted Yield is larger than 95 in the neighboring area, so that even small deviations from the optimal settings of the two variables will give quite acceptable results.

118

Tutorial E: SIMCA classification


• Reformat the data table • Graphical clustering

o Graphical clustering based on hierarchical clustering o Graphical clustering based on score plots

• Make class models • Classify unknown samples • Interpretation of classification results • Diagnosing the classification model

Description

The data to be classified in this tutorial is taken from the classical paper by Fisher. (Fisher RA, The use of multiple measurements in taxonomic problems, Ann. Eugenics, 7, 179 – 188 (1936).) The task is to see whether three different types of the iris flowers can be classified by four measurements made on them; the length and width of the Sepal and Petal.

What you will learn

Tutorial E contains the following parts:

• Make models of different classes • Classify new data • Diagnose the classification model

References:

• Principal Component Analysis (PCA) overview • Classification • SIMCA Classification

Data table

Click the following link to import the Tutorial E data set used in this tutorial.

119

The data contains 75 training (calibration) samples and 75 testing (validation) samples.

The training samples are divided into three Row (Sample) ranges, each containing 25 samples. The three Sets are: Setosa, Versicolor, and Virginica. The row set Testing will later be used to test the classification.

Four variables are measured; Sepal length, Sepal width, Petal length, and Petal width. The measurements are given in centimeters. These four variables are collectively defined as the column set Iris properties

Reformat the data table

Whenever working with classification, it is very useful to identify samples belonging to the same class under all circumstances – in the raw data table and on PCA or classification plots.

In order to do this, we need to create a categorical variable stating class membership for all samples.

Task

Insert a categorical variable into the Tutorial_E data table.

How to do it

Open the file Tutorial_E from the Examples folder.

Select the first column in the editor and select Edit - Insert - Category Variable…. This opens a dialog that asks how to define the levels.

First enter a name for the variable: “Iris type”.

Then select the second option: Specify levels to be based on a collection of row sets.

In the left column select one by one the three row ranges: “Setosa”, “Virginica” and “Versocolor” and add them to the left column using the button Add.

Category variable dialog

120

Now a new column has been created “Iris type” containing the appropriate value for each sample in each cell of the column.

Data table with category variable “Iris”

Graphical clustering

It is always a good idea to start a classification with some exploratory data analysis. You can run a PCA model and/or hierarchical clustering of all samples. If you do not know the classes in advance, this is a way of visualizing the clustering. The calibration samples must be assigned to the different classes to give a sense of whether a classification model can be developed.

121

Graphical clustering based on hierarchical clustering

Task

Perform hierarchical clustering of all calibration samples.

How to do it

Use Tasks - Analyze - Cluster Analysis… and select the following parameters:

Model inputs • Matrix: Tutorial_E • Rows: Training • Columns: Iris properties • Number of clusters: 3 • Clustering method: Hierarchical Complete-linkage • Distance measure: Squared Euclidean.

In the options tab, you can assign samples to the initial clusters, but for this exercise, we will make a completely unsupervised cluster analysis.

Click OK for the Cluster analysis to run.

When the clustering is complete a dialogue asking if you want to view the plots will appear. Click Yes.

The Dendrogram showing the clustering of samples will be displayed. Notice that three clusters are identified, but they are not all of equal size. All the results are in a new Cluster analysis node in the project.

Dendrogram: Complete-linkage squared Euclidean distance

122

Open the Results folder for the cluster analysis, and expand the levels so that you see the different row sets; one has been defined for each cluster.

Cluster analysis results in project navigator view

By looking at the row sets, one can see that the Setosa samples are all assigned to one cluster, and that there is a small cluster that contains only Virginica samples, but a larger group has a mix of both Virginica and Versicolor samples. These results suggest that it based on the four variables provided for these irises, an unambiguous classification may be difficult.

Graphical clustering based on score plots

Task

123

Make a PCA model of all calibration samples.

How to do it

Use Tasks - Analyze - Principal Component Analysis… and select the following parameters:

Model inputs

• Matrix: Tutorial_E • Rows: Training • Columns: Iris properties • Maximum components: 4 • Keep the default ticks in the boxes Mean center data and Identify Outliers.

Weights

On the weights tab, select all the variables by highlighting them, and setting the weight by selecting the correct radio button.

• Weights: 1/SDev

Click Update.

Validation

Proceed to the Validation tab to set the validation.

• Validation Method: Cross validation

You can now click OK for the PCA to run.

We assume that you are familiar with making models by now. Refer to one of the previous tutorials if you have trouble finding your way in the PCA dialog.

When the model is built a dialogue asking if you want to view the plots will appear. Click Yes.

The Regression Overview consisting of the plots of the scores, loadings, influence and explained variance will be displayed. All the results are in a new PCA node in the project.

Activate the explained variance plot in the lower right quadrant and click on the Cal button on the toolbar so that only Validation variance remains on the plot.

Explained validation variance

124

We see that the Explained Validation Variance is 91% with 2 PCs.

Activate the score plot and right click to select sample grouping. Select the row sets for the Setosa, Versicolor and Virginica. Click OK.

Score plot with sample grouping

You can see the three groups in different colors; one very distinct (Setosa) and two that are not so well separated (Versicolor and Virginica). This indicates that it may be difficult to differentiate Versicolor from Virginica in an overall classification model.

Make class models

Before we classify new samples, each class must be described by a PCA model. These models should be made independently of each other. This means that the number of components must be determined for each model, outliers found and removed separately, etc.

Task

125

Make PCA models for the three classes Setosa, Versicolor, and Virginica.

How to do it

Go back to the Editor window containing your reformatted data table.

Select the first 25 samples corresponding to Setosa samples and create a new range by right clicking on the selected data and select the menu Create Row Range. Do the same with the next 25 samples corresponding to Virginica samples and the samples 51 to 75 corresponding to Virginica samples.

Create Range menu

Rename each range with a name reflecting the samples it contains using a right click on the row set and select Rename.

Rename row set menu

Select Tasks - Analyze-Principal Component Analysis… and make the first PCA model for Setosa with the following parameters:

126

Model Inputs • Matrix: Tutorial_E • Rows: Setosa • Cols: Iris properties • Maximum components: 4

Weights 1/SDev

Validation Leverage correction

When the model is computed, view the plots. In the project navigator rename the PCA class model with name PCA Setosa by highlighting the new PCA node, right clicking and selecting Rename.

Rename menu

Repeat the procedure successively on Row Sets Versicolor and Virginica, also renaming each new PCA model.

Classify unknown samples

When the different class models have been made and new samples are collected, it is time to assign them to the known classes. In our case the test samples are already in the data table, ready to use.

Task

Assign the Sample Set Testing to the classes Setosa, Versicolor, and Virginica.

How to do it

Select Tasks - Predict- Classification - SIMCA….

Menu Tasks - Predict- Classification - SIMCA…

127

Use the following parameters:

• Matrix: Tutorial_E • Rows: Testing • Columns: Iris properties

Make sure that Centered Models is checked. Add the three PCA class models Setosa, Versicolor, and Virginica.

SIMCA classification dialog

The suggested number of PCs to use is 3 for all models; keep that default (it is based on the variance curve for each model).

Click OK to start the classification.

128

Interpretation of classification results

The classification results are displayed directly in a table, but you may also investigate the classification model closer in some plots.

Interpret the classification table

Task

Interpret the classification results displayed in the SIMCA results.

How to do it

Click View when the classification is finished.

A table plot is displayed, called Classification membership. There are three columns: one for each class model.

Samples “recognized” as members of a class (they are within the limits on sample-to-model distance and leverage) have a star in the corresponding column.

SIMCA classification table

129

The significance level can be toggled with the Significance option, which is available as a toggle

on the menu bar.

At the 5% significance level, we can see that all but three samples (false negatives: virg1, virg36, virg42) are recognized by their rightful class model.

However, some samples are classified as belonging to two classes (false positives): 12 Versicolor samples are also classified as Virginica, while 6 Virginica samples are also classified as Versicolor. Only the Setosa samples are 100% correctly classified (no false positives, no false negatives).

If you tune up the significance limit to 25%, this reduces the number of false positives but also increases the number of false negatives (vers41 and virg35 come in addition).

Interpret the Cooman’s plot

130

If a sample is doubly classified, you should study both Si (sample-to-model distance) and Hi (leverage) to find the best fit; at similar Si levels, the sample is probably closest to the model to which it has the smallest Hi. The classification results are well displayed in the Cooman’s plot.

Task

Look at the Cooman’s plot.

How to do it

Under the SIMCA/Plots node choose the Cooman’s plot. You can change which classes it displays on

the toolbar ; now set it for models Virginica and Versicolor.

This plot displays the sample-to-model distance for each sample to two models. The newly classified samples (from sample set Testing) are displayed in green color, while the calibration samples for the two models are displayed in blue and red.

Cooman’s plot for Versicolor vs. Virginica

The Cooman’s plot for the classes Virginica and Versicolor shows that all Setosa samples are far away from the Virginica model (they appear far to the right). However, we can see that many Virginica and Versicolor samples are within the distance limits for both models. This suggests some classification problems.

131

Interpret the Si vs Hi plot

We also have to look at the distance from the model center to the projected location of the sample, i.e. the leverage. This is done in the Si vs. Hi plot.

Task

Look at the Si vs. Hi plots.

How to do it

Under the SIMCA/Plots node choose the Si vs. Hi plot, and set it for the model Versicolor using the arrows on the toolbar. Before you start interpreting the plot, turn on Sample Grouping by right clicking in the plot window and selecting the Sample Grouping option. In the sample grouping & marking dialog, select the row sets Setosa, Versicolor and Virginica. The point labels can be changed to show just the first two characters of their name by right clicking and selecting Properties. In the left list, select Point Label to get to the Point Label dialog. Here one has the option to change the label name to just the first 2 characters of the name. Select the radio button Name, and under the Label layout use the drop-down list for show to select first, and in number of characters box enter 2, as shown in the dialog.

Point layout dialog

The then provides a plot which is much easier to interpret: iris type appears clearly with the initials Se, Ve, Vi in three different colors.

132

Si vs Hi plot for the model Versicolor

Some Virginica samples are classified as belonging to the class Versicolor, but most samples that are not Versicolor are outside the lower left quadrant. The reason for the difficult classification between Versicolor and Virginica is that the samples are overlapping in the score plot. They are very similar with respect to the sepal and petal width.

Diagnosing the classification model

In addition to the Cooman’s and Si vs Hi plots, there are three more plots that give us information regarding the classification.

Interpret model-to-model distance

Task

Look at the Model Distance plots.

How to do it

Under the SIMCA/Plots node choose the Model Distance plot, and set it for the model Versicolor

using the arrows on the toolbar. Change it to a bar chart using the shortcut .

133

Model distance for Versicolor model

This plot allows you to compare different models. A distance larger than three indicates good class separation. The models are different.

It is clear from this plot that the Setosa model is different from the Versicolor, with a distance close to 10, while the distance to Virginica is smaller.

Interpret discrimination power

Task

Look at the Discrimination Power plots.

How to do it

Under the SIMCA/Plots node choose the Discrimination Power plot. Using the arrows on the toolbar, choose the discrimination power for Versicolor projected onto the Setosa model.

This plot tells which of the variables are most useful in describing the difference between the two types of iris.

Discrimination power:Versicolor onto Setosa

134

We can see that variables sepal length and sepal width have high discrimination powers between these classes, while it is lower for the petal length and width.

Do the same for Versicolor onto Virginica: all variables have discrimination powers around 3. This is obviously not enough to completely discriminate these classes.

Interpret modeling power

Task

Look at the Modeling Power plots.

How to do it

From the plots choose the Modeling Power for Versicolor.

Variables with a modeling power near one are important for the model. A rule of thumb says that variables with modeling power less than 0.3 are of little importance for the model.

Modeling power for Versicolor

135

The plot tells us that all variables have a modeling power larger than 0.3, which means that all variables are important for describing the model. None of the variables should be deleted from the modeling. The only chance to improve on the classification between Versicolor and Virginica is to measure some additional variables.

136

Tutorial F: Interacting with other programs


• Import spectra from an ASCII file • Import responses from Excel • Create a categorical variable • Append a variable to the data set • Organizing the data • Study the data before modeling

o Plot spectral data o Basic statistics on data

• Make a PLS Model o Interpretation of the Regression Overview o Customizing plots and copying them into other programs

• Save PLS model file • Export ASCII-MOD file • Export data to ASCII file

Description

It is not uncommon to use The Unscrambler® together with other programs in one’s daily work. This could be a word processor to document latest work, or instrument software.

This tutorial shows some of the capabilities The Unscrambler® has to interact with other programs under the Windows operating system. The main focus here is how The Unscrambler® is used in conjunction with other software.

What you will learn

Tutorial F contains the following parts:

• Import data file; • Drag and drop from other programs; • Insert categorical variable; • Edit plots and insert into another program; • Save models for use in The Unscrambler® Online Predictor and The Unscrambler® Online • Write an ASCII-MOD file.

137

References:

• Basic principles in using The Unscrambler® • Importing data into The Unscrambler® • About Regression methods • Customizing Plots • Exporting data from The Unscrambler®

Data table

The data are NIR spectra of wheat samples collected at a mill. Fifty five samples were collected and the NIR spectra on an instrument using 20 channels.

The water content of wheat samples was measured by a reference method and is the response variable in the data. These values are stored in a separate file.

Click the following links to save the data files to be used in this tutorial:

• Tutorial F data set: Spectra • Tutorial F data set: Responses

Import spectra from an ASCII file

Data are stored in many different ways. The most simple and flexible way is to store data in ASCII files.

Task

Import the “Tutorial_F_spectra.csv” ASCII data file.

How to do it

Start The Unscrambler® and go to File – Import data – ASCII…. Locate the file “Tutorial_F_spectra.csv” in the browser and click Open.

Alternatively, click the following link to import the Tutorial F data set used in this tutorial directly.

This launches the Import ASCII dialog, where you specify what the ASCII file looks like. Use the options displayed in the dialog. Note that the first row in the data file contains variable names and the first column contains sample names. The separator for the data is a comma. Check the boxes Process double quotes and Treat consecutive separators as one.

ASCII Import Dialog

138

Click OK to import the file and the data are read into The Unscrambler®, creating a data table called “Tutorial_F” in the project.

Import responses from Excel

Spreadsheet applications are commonly used for storing data. It is easy to transfer data between such a program and The Unscrambler®. The water content of the wheat samples is stored in an Excel file together with the sample names.

Task

Import the water values from the Excel data file “Tutorial_F_responses.xls” into the existing data table.

How to do it

There are two procedures. Use procedure 1 if you have Microsoft Excel or another spreadsheet application installed on your computer or procedure 2 if you do not have a spreadsheet program that can read the file “Tutorial_F_responses.xls”. You only need to follow one of the procedures.

We will begin by appending a column to the existing data table. Put the cursor in the data viewer and select Edit – Append, and in the dialog, enter 1 to add a single column.

1. Copy and paste from Excel

139

Save the Tutorial_F_responses.xls spreadsheet containing the responses

Launch Microsoft Excel and open the file “Tutorial_F_responses.xls”. Copy the values from the column water, and paste them into the empty column that you appended in data matrix “Tutorial F”.

2. Import data from the Excel file

From File – Import data – Excel…, select “Tutorial_F_responses.xls” from the location and click Import. Alternatively, click the following link to import the responses from Tutorial_F_responses.xls directly.

In the project navigator you will find the two data matrices which you imported from the ASCII and Excel files, respectively. Rename the matrices by selecting them, right clicking and choosing Rename; rename them as Wheat NIR Spectra and water content.

Data matrices in the Navigator

140

We could leave the response Y values (water content) in a separate matrix, and do the analysis from these two matrices. But for consistency on data organization in this exercise, we will copy the values from the Water content matrix into the empty column (21) that we appended to the data matrix “Wheat NIR Spectra”.

Create a categorical variable

Categorical variables are useful to calculate statistics and to use in plot interpretation.

Task

Insert a variable to group the samples into three categories, depending on the water content level.

How to do it

141

Place the cursor in the first column and select Edit – Insert… and insert one empty column. Then use copy (Ctrl+C) - paste (Ctrl+V) to copy the water content data into the new column.

Rename the column as “Water levels”.

Then select the “Water levels” column and go to the menu Edit – Change Data Type and select Categorical.

Edit – Change Data Type - Categorical menu

The category converter dialog appears. Select the option New levels based upon ranges of values. Add three levels by entering 3 for the Desired number of levels, and specify the following ranges manually:

• Low (Water < 13.0), • Medium (13.0 > Water > 15.0), and • High (15.0 > Water).

Category Converter menu

142

The column of the categorical values is orange to distinguish this kind of variable from the ordinary ones.

Data after insertion of a category variable

143

Append a variable to the data set

Sometimes it is interesting to have all the information in only one data table.

Task

Append a variable to have the NIR spectra and the water content in the same table.

How to do it

Place the cursor in the last column and select Edit – Append… and append one empty column. Then use copy (Ctrl+C) - paste (Ctrl+V) the water content data into the new column.

Rename the column as “Water”.

Organizing the data

Most of the time, you will want to work on subsets of your data table. To do this, you must define ranges for variables and samples. One Sample Set (Row range) and one Variable Set (column range) make up a virtual matrix which is used in the analysis.

Task

Define the Column ranges (variable sets) “Level”, “Water content” and “NIR Spectra”.

How to do it

144

Choose Edit - Define Range… to create sample sets and variable sets by defining Rows and Columns, or right click upon selecting Rows(samples) or Columns(Variables) to choose Create Row Range and Create Column Range respectively.

We begin by defining the column range for the water content by highlighting column 22, and going to Edit - Define Range. This opens the Define range dialog, where we determine the column range Water, entering this name for Column.

Define Range Dialog

Do the same then to define the column range for “level” in column 1, and “NIR Spectra” in columns 2-21.

The list of defined data ranges are found in the project navigator as nodes under the data matrix.

Project navigator with data sets defined

145

Go to File-Save As… to save the project as Tutorial F.

Study the data before modeling

In any analysis, it is advisable to begin by familiarizing yourself with the data. We should plot data to see if there are any obvious patterns or problems with the data. Does it look as we expect? Are there outliers? From looking at the raw data, we may also be able to see if we should apply a transform to the data. We can also look at the statistics on the data, to get an understanding of the distributions in the data.

Plot spectral data

The NIR data used here are collected at 20 wavelengths using a filter instrument, so do not give a complete spectrum. Regardless, it is still advisable to plot the data to have an understanding of it. Select the column set NIR Spectra in the project navigator. Right click and select Plot - Line to get the plot as shown below. In the plot, we can see that the strongest absorbance peak is at 1940 nm, where the OH vibration for water is found in the NIR spectrum. There is now a new entry in the project navigator for the Line plot. You can rename this by right clicking and choosing Rename

Line Plot of Spectral Data

146

Basic statistics on data

We can check the statistics of our data as well. This can be done for all the spectral data, and for the response variable. Here we will compute the statistics for the water content values. We begin by plotting a histogram, which shows the distribution of values. When we are developing a calibration, we would like to have an even distribution of the response values over the calibration range where we will be operating. Highlight the column “Water” and go to Plot-Histogram to get the following plot. The line for a normal distribution is superimposed on the plot, and the statistics for this sample set are displayed.

Histogram plot of water content

147

We can also compute the statistics without the plot by going to Tasks-Analyze-Descriptive Statistics…. In the dialog, select all the rows, and the column “Water” and click OK. When the computation is complete, say Yes to see the plots now. A quantile and mean and standard deviation plot are displayed. If you had more than one variable, the plots would show results for all the variables. A new node has been added to the project navigator, “Descriptive Statistics”. This has subfolders containing the raw data, results, and plots of the statistical analysis. Expand the folder “Results” and select the matrix “Statistics” to see the numerical results.

Statistics on water content

148

Make a PLS Model

The NIR spectra should contain information which makes it possible to predict the water content from them. Let us make a model and find out.

Task

Make a PLS model from NIR spectra to measure the Water Content.

How to Do It

Select Task - Analyze - Partial Least Squares Regression and specify the following parameters in the Regression dialog:

Model inputs:

• X: NIR Spectra (55x22) • X Rows: All • X Cols: Spectra • Y: Water content (55x1) • Y Rows: All • Y Cols: All • Maximum number of components: 5

If not already done, check the boxes Mean center data and Identify outliers.

Go to the X weights and Y weights tabs to verify that these are all set to 1.0 (the default setting). On the Validation tab, select Cross validation.

PLS Dialog

149

Click OK to launch the calculations.

Click Yes when the calculations are finished, and the prompt appears to view plots now. The PLS Overview plots are displayed. A new node is also added to the project navigator with all the PLS results. This has four folders with the raw data, results, validation, and plots for the PLS model.

Interpretation of the Regression Overview

The most important PLS analysis results are given in the regression overview plot. This has the plots Scores, X and Y loadings, Explained variance, and Predicted vs Measured displayed as the default.

Task

Look at the model results.

How to do it

Study the PLS regression overview plots in the viewer.

PLS Overview Plots

150

The Scores plot shows that the samples are scattered in the model space, with no evidence of groupings and that the first two factors explain 92% and 8% of the variance in the data respectively. The Explained X-variance goes up nicely and is close to 100 after two Factors (PCs). The Predicted vs Measured plot looks OK. The fit is quite good. The info box in the lower left panel of the display indicates that two factors are optimal for this model.

Another very useful plot is of the regression coefficients. Activate the upper-right quadrant and right click to go to PLS-Regression coefficients - Raw coefficients (B) - Line. From the regression coefficients one can see that there is a distinct peak around 1940, as expected as this is where the water absorbance peak is located in the NIR spectrum.

Raw Regression Coefficients

151

Save the project. All the results and plots that have generated will be part of the saved project.

Customizing plots and copying them into other programs

In data analysis and research work, it is critical to provide documentation of the results. Sometimes is may be necessary to transfer plots from The Unscrambler® into a word processor.

Task

Customize plots within The Unscrambler®, and transfer plots from The Unscrambler®, using Copy and Paste.

How to do it

Select the score plot in the regression overview plot, and right click to choose Properties which gives one options to customize a plot.

Change the plot heading name, as well as the font used for it.

Annotations can be added to a plot by right clicking and selecting Insert Draw Item…, or from the

short cut keys on the toolbar

When the plot has been customized it can readily be saved or copied into another application. Right click and select Copy to select just the highlighted plot, or Copy All to select all the four overview

152

plots. Go to another program and place the cursor where the plot is to appear in the document. Select Edit - Paste. The plot is now inserted as a graphical object in the other document.

The plot can be saved as a picture file. The picture file option will usually give better quality plots, but also larger files. Highlight a plot, and right click Save as… to save the plot in a choice of graphics image file formats, such as EMF or PNG.

Save as options

Save PLS model file

Task

Save just the PLS model file, giving a smaller file with just the model information that can be used for predicting new samples using The Unscrambler® Online Predictor and The Unscrambler® Online.

How to do it

To do so right click on the model in the Navigator and select the option Save Result.

Save result

Rename the model as needed and click on Save.

Export ASCII-MOD file

Task

Export an ASCII-MOD file.

153

How to do it

Go to File - Export menu.

File - Export menu

Select ASCII-MOD to open the dialog:

ASCII-MOD Dialog

Verify that the correct model is selected, and the correct number of factors. It is possible to select two types of model:

• Full • Regr.Coef. only: corresponding to only the regression coefficients

Take a look at the ASCII file that is generated, which has the file name extension .AMO. The format of the file is described in the ASCII-MOD Technical Reference.

154

Export data to ASCII file

A common file format that most programs read is the simple ASCII file. There are different ways of writing the ASCII file. Determine the format needed based on the requirements of other programs that will be used to read the ASCII files.

Task

Write the Wheat NIR Spectra data table to an ASCII file.

How to do it

Select the Wheat NIR Spectra table and select File - Export - ASCII. Use only the columns of the NIR Spectra, by choosing this column set from the drop-down list. Make sure that the item deliminator is comma as suggested in the Export ASCII dialog.

Export ASCII Dialog

Provide a file name, and location when prompted. Open the file in an ASCII editor and look at the file. All names are enclosed in double quotes.

155

Tutorial G: Mixture design


• Design variables and responses • Build a simplex centroid design • Import response values from Excel • Check response variations with statistics • Model the mixture response surface • Conclusions

Description

This application, inspired from an example in John A. Cornell’s reference book “Experiments With Mixtures”, illustrates the basic principles and specific features of mixture designs.

A fruit punch is to be prepared by blending three types of fruit juice:

• watermelon, • pineapple and • orange.

The purpose of the manufacturer is to use their large supplies of watermelons by introducing watermelon juice, of little value by itself, into a blend of fruit juices. Therefore, the fruit punch has to contain a substantial amount of watermelon - at least 30% of the total. Pineapple and orange have been selected as the other components of the mixture, since juices from these fruits are easy to get and relatively inexpensive.

The manufacturer decides to use experimental design to find out which combination of those three ingredients maximizes consumer acceptance of the taste of the punch.

What you will learn

This tutorial contains the following parts:

• Build a suitable design for a mixture optimization; • Import response values from Excel; • Check response variations with Statistics;

156

• Analyze the results with PLS and Martens’ Uncertainty Test;

References:

• Mixture designs • Data import from a spreadsheet • Descriptive statistics • Analysis of mixture design results • Martens’ Uncertainty Test

Data table

The data in this exercise consist of two parts:

1. The design table, which will be created in the tutorial. 2. Measured responses: Sensory data: acceptance, sweetness, bitterness, fruitiness of the juice

as well as an economic factor, the cost of production. We begin by setting up the design in The Unscrambler®, and then will import the response variables from a separate table.

Design variables and responses

The ranges of variation selected for the experiment are as follows:

Ranges of variation for the fruit punch design

Ingredient Low High

Watermelon 30% 100%

Pineapple 0% 70%

Orange 0% 70%

This defines a simplex.

The responses of interest for the manufacturer are detailed in the table below.

Responses for the fruit punch design

Variable Type of Measurement Target

Consumer acceptance Average of 63 individual ratings on a 0-5 scale Maximum

Production cost Computed from mixture composition and raw material cost Minimum

157

Variable Type of Measurement Target

Sweetness Average ratings by sensory panel on a 0-9 scale Descriptive only

Bitterness Average ratings by sensory panel on a 0-9 scale Descriptive only

Fruitiness Average ratings by sensory panel on a 0-9 scale Descriptive only

Consumer acceptance is the most important response, but if the analysis of the results should reveal two areas with equally high consumer acceptance, the mixture with lower production cost will be preferred. The sensory descriptors provide an explanation of the consumer acceptance based on some properties, and provide directions for further improvement (for instance by adding sugar or sweetener if the consumers seem to prefer sweeter mixtures).

Build a simplex centroid design

Since there are only three design variables, it is possible to build an optimization design right away. In a mixture design the most suitable design is a simplex centroid design.

Task

Build a simplex centroid design with the help of the design experiment wizard, Insert – Create design….

How to do it

Use Insert – Create design… to start the Design Experiment Wizard. The first tab is the Start tab, where you enter the name of the design and the goal of the experimentation. It is also possible to add additional information in the description field. Enter “Punch” as a name for the design and select Optimization as the goal.

Start tab for the Punch experiment

158

Go to the next tab: Define variables. Specify the variables as shown in the table hereafter:

Variables to define

ID Name Analysis Type Constraints Type of levels Level range

A Watermelon Design Mixture Continuous 30-100

B Pineapple Design Mixture Continuous 0-70

C Orange Design Mixture Continuous 0-70

1 Acceptance Response - - -

2 Cost Response - - -

3 Sweet Response - - -

4 Bitter Response - - -

5 Fruity Response - - -

Do this by clicking the Add button and editing the Variable editor including the level range for the design variables. Validate by clicking OK and enter the next variable by clicking Add again.

Variables involved in the design

159

Go to the next tab: Choose the Design.

There is already a type of design that has been selected: Mixture design. Validate this choice by going to the next tab.

Choose the design for the Punch experiments

Go to the next section: Design Details

160

Look at the Description table. The design needed is the Simplex centroid as it is the only design suitable for optimization. To better cover the design space add some more experiments; tick the option Augmented design.

Design details: Simplex centroid

Go to the next tab: Additional Experiments. There is no need to replicate the design samples so the Number of replications is kept at its default value: “1”.

By default there are “3” center samples. This is enough for the purpose of this experiment.

Additional experiments tab

161

There is no need to add reference samples so just proceed to the next tab, Randomization. There is no need to make any further adjustments in this tab. Try different options just to get familiar with the options.

Randomization tab

162

In the Summary tab, the table on the right presents a summary of the information on the design.

On the left part of the tab you can calculate the power of the design if you know two types of information for the responses:

• the standard deviation of the response variables: Std.dev. • the minimum difference to be detected: Delta

Summary tab

163

Go to the final tab Design Table. Here the data table is presented with several view options. Check them out so as to familiarize yourself with the options.

Design table tab for the punch experiments

164

Once all necessary checks and corrections have been made, click the Finish button.

Now the data tables appear in the Navigator. There are two tables: one for the design variables, and another for the responses. The response table is empty until you fill in the values, which you will do later. The design matrix is already organized into row and column sets according to the types of samples (design, center, etc.) and effects.

The design tables in the Navigator

It is possible to view the data in different ways:

• To change the order from the standard sample sequence to the experiment sample sequence click on column randomized, and select Edit - Sort - Descending.

• To change from the actual values to the level values click on the table and then View - Level indices.

Save the new project with File - Save and specify a name such as “Punch Optimization”.

Import response values from Excel

The responses for all samples are stored in an Excel spreadsheet. These can be imported directly as a separate matrix which can then be copied into the Punch_response matrix.

Task

165

Open the Excel table that has the response values and copy them into the response data table.

How to do it

Go to File - Import Data - Excel…, select the Excel file “Tutorial_G.xls” and click Open. Alternatively, click the following link to open the Excel sheet import the responses from Tutorial_G.xls directly.

Then in the Excel Preview window, select the sheet “Responses”, and select the 5 responses:

• Accept • Cost • Sweet • Bitter • Fruity

Excel Preview

Click on OK.

Imported response data

166

Look at the order. It is very important that the tables “Punch_Responses” and “Tutorial_G” match in their order. The “Punch_Responses” should be in the standard order

Select all the data in “Tutorial_G” and copy them using right click and the option Copy or with the shortcut Ctrl+C and paste them into the “Punch_Responses” table. To do so place the cursor in the first cell and use right click and the option Paste or the shortcut Ctrl+V.

Check response variations with statistics

Run a first analysis – Statistics, and interpret the results with the following questions in mind:

• How much does each response vary? • Is there more variation over the whole design than over the replicated Center samples? • Is there any response value outside the expected range?

Task

Run Statistics, display the results as plots, check response variations and look for abnormal values.

How to do it

With the Punch_Response data table displayed in the Editor, select Task - Analyze - Descriptive Statistics.

Choose the following settings in the Statistics dialog:

• Data Matrix: Punch_Response (12x5) • Data Row: All • Data Cols: All • Compute correlation matrix: ticked

167

then click OK to start the computations.

Descriptive statistics dialog box

Click Yes to view the results. The Statistics results are displayed as two plots. The upper plot is Quantiles, the lower Mean and SDev.

Let us have a look at the upper plot: Quantiles.

If you have never interpreted a box-plot (or Quantiles plot) before, follow this link.

Right click on the plot and select View - Numerical View to display the min, max, median, Q1 and Q3 for the response. Check that the ranges of variation are within the expected range for that response (0-5 for Acceptance, 0-3 for Cost and 1-9 for the sensory responses on flavor).

Now display the same two plots for design samples and center samples, in order to compare variation over the whole design to variation over the replicated Center samples. If the experiments have been performed correctly, there should be much more variation among design points than among the three replicates of the Center sample.

Return to the graphical view (View - Graphical view).

Right click on the plot and select Sample Grouping. A dialog box opens. Select the sets Center samples and All design samples from the matrix Punch_Design.

Sample grouping and marking for the statistics

168

Note: It is possible to edit the color of the bars

Click OK.

To display the legend, right click on the plot and select Properties. Go to legend and tick Visible.

Properties - Legend

169

Click OK.

Quantiles plot with sample grouping

The quantiles plot is now displayed for three groups. The bars or boxes for all samples appear in blue, for design samples in red and for center samples, in green (unless a different color scheme has been designated under Properties). On the quantiles plot, one can see that there is much more variation among design points than among the center samples. This also appears clearly on the Mean and SDev plot when the sample grouping us added. For instance, if you click successively on the blue and red bars for variable Acceptance, you will see that SDev is 0.75 for Design samples and only 0.25 for Center samples.

Conclusions

The ranges of variation of the 5 responses are as expected.

There is no abnormal value for any response.

There is much more variation over the whole design than among the center samples, which suggests that the experiments were performed correctly.

Model the mixture response surface

The next step after checking the data is to model the responses. In this we want to study the quantitative relationships between fruit punch composition and consumer acceptance, production cost and sensory properties of the mixtures.

170

There are two ways of analyzing the data: either each response variable individually with the Scheffé formula or the responses as a whole with PLS regression.

In both cases, the results will be interpreted by plotting a Response Surface for each response variable.

Task

Analyze the design with a response analysis using a Scheffé model. View the results and interpret the results.

How to do it

Highlight the data table Punch_Design and run Tasks - Analyze - Analyze design matrix…. Make the following choices in the Design Analysis dialog:

Method Classical

Model inputs • Predictors • Matrix: “Punch_Design (12x13)” • Rows: All • Cols: All • Model: Special cubic • Responses • Matrix: “Punch_Responses (12x5)” • Rows: All • Cols: All

Design Analysis

171

Click OK, then Yes to have a look at the the plots when the computation is complete.

Diagnosing the model

ANOVA results

The first result diagnosis is the ANOVA table, in the upper left quadrant of the overview.

The first ANOVA table is for the response variable “Accept”.

ANOVA Punch

172

Locate the R-square and notice that the value is rather good: 0.93.

Look then at the p-values for the model: 0.0080. This is very good. This indicates the presence of noise in the data.

Look at the individual variables to conclude on the dimensionality of the model.

All the variables have a significant effect with p-values below 5%.

The model is then cubic.

View the results for the other responses by using the drop-down menu or the arrows in the menu

bar .

For the next result, “Cost”, one can see from the p-values that the model may not be cubic.

“Sweet” is also very well predicted. The only response that is not well modeled is for the bitterness.

Diagnostics

Look at the diagnostic table. Look at the residuals. Notice that the center samples show a high residual.

Diagnostics for response “Accept”

173

Effect summary

Look at the effect summary.

Notice that the most important effects are from second or third order for the first variable, but then for “Cost” it is mostly the linear effects.

Effect summary

Response surface

Go to the predefined response surface plot in the navigator.

Response surface for acceptance

174

Try to locate the optimal values for the acceptance.

Do the same for the cost.

To do so change the response variable to be plotted. Untick the “accept” variable and tick the “Cost” variable.

Response surface for cost

Conclusions

175

The response surface plots show maximum consumer acceptance for a fruit punch with about 39% Watermelon, 16% Pineapple and 45% Orange.

176

177

Tutorial H: PLS Discriminant Analysis (PLS-DA)

PLS-DA is the use of PLS regression for discrimination or classification purposes. In The Unscrambler® PLS-DA is not listed as a separate method. This tutorial explains how to do it.

• Description o Running a PLS Discriminant Analysis o What you will learn o Data table

• Build PLS regression model • Classify unknown samples • Some general comments on classification

Description

PLS Discriminant Analysis (PLS-DA), is a classification method based on modeling the differences between several classes with PLS. If there are only two classes to separate, the PLS model uses one response variable, which codes for class membership as follows: -1 for members of one class, +1 for members of the other one.

If there are three classes or more, the model uses one response variable (-1/+1 or 0/1, which is equivalent) coding for each class. There are then several Y-variables in the model.

In this tutorial we will analyze the chemical composition of spear heads excavated in the African desert. 19 samples known to belong to two tribes (classes A and B) are used for building a discriminant model, while seven new samples of unknown origin make up a test set to be classified.

The X variables are 10 chemical elements characterizing the composition of the spear heads. The 19 training samples are divided into 10 from class A and 9 from class B. The normal way to make dummy variables for classes is to assign 1 if the sample belongs to the class and 0 if not. A small trick to have a decision line of 0 and not 0.5 in the predicted versus measured plot is to use values -1 and 1, which gives an easier visualization.

Running a PLS Discriminant Analysis

When a data table is displayed in the viewer, one may access the Tasks menu to run a Regression (and later on a Prediction).

178

In order to run a PLS Discriminant Analysis (PLS-DA), one should first prepare the data table in the following way:

Insert or append a categorical variable in the data table. This categorical variable should have as many levels as there are classes in the data set. The easiest way to do this is to define one row set for each class, then build the sample sets based on the categorical variable (this is an option in the Define range dialog). The categorical variable will allow one to use sample grouping on plots, so that each class appears with a different color.

Use the function Edit- Split category variable to convert the categorical variable into indicator variables. These will be the Y-variables in the PLS model and are created as new columns in the data table. Then create a Column set containing only the indicator variables, as these are the responses that will be used in the regression.

What you will learn


• Run a PLS regression • Interpret the model • Save the model • Classify new samples

References:

• Basic principles in using The Unscrambler® • Principles of Regression • Classification • Prediction

Data table

Click the following link to import the Tutorial H data set used in this tutorial. The data have already been organized for you into row sets, and with the class variable, as well as the indicators for the classes.

Tutorial H data

179

Build PLS regression model

Task

Run a PLS regression on the data.

How to do it

Click Tasks - Analyze - Partial Least Squares Regression to run a PLS regression and choose the following settings:

PLS Regression Dialog

180

Model inputs • Predictors: X: Tutorial H, Rows: Training, Cols: X • Responses: Y: Tutorial H, Rows: Training, Cols: Class num • Maximum components: 5 • Mean center data: Enable tick box

X Weights 1/SDev

Y Weights 1/SDev

Validation Full cross-validation

Set the weights on the X-weights and Y-weights tabs. Select all the variables, select the radio button A/(SDev+B), and click update. Do this for both the X and Y weights.

X weights dialog

181

To set the validation method, go to the Validation tab in the PLS Regression dialog. Select Cross validation, and then click Setup… to get to the dialog to select full cross validation. Select Full from the cross validation method drop-down list,

Cross Validation Dialog

182

After the computations are finished the default PLS regression plots will be shown. The score plot shows the separation of the two classes.

Score plot

For better visualization of the classes you may use the sample grouping option. Right click in the score plot and select Sample Grouping from the menu.

183

In the Sample grouping dialog, select the row sets “A” and “B” for visualization. You can double-click in the small boxes showing the colors to change to your preference. The same goes for the symbols, and their size.

Sample Grouping Dialog

The score plot shows that the two classes are well separated in the two first factors.

Score plot with grouping

Thus, a discrimination line may be inserted in the plot with the line drawing tool in The

Unscrambler® .

184

Study the explained variance plot for Y shown in the lower-left quadrant. If need be, switch it to the

view for Y by using the X-Y button . The explained variance plot for Y shows around 98 % explained calibration and 94 % explained validation variance for 2 factors. The red validation curve indicates that two factors is the optimal number, as there is only a small increase in explained variance after factor three.

Note: Explained variance or RMSE is not the main figure of merit for PLS-DA, however.

Variance plot

To interpret the importance in the classification the loading weights is the plot to look into. This is given in the upper-right quadrant.

In this case the loadings express the same information as the loading weights, and since correlation loadings show the explained variance directly, this is the preferred view. Make the loadings plot active, and change it to the Correlation loadings view by selecting the correlation loadings shortcut

.

In the correlation loadings plot for factors one and two we see that Ba, Zr and Sr are the variables that separate the two classes, as well as Ti, although with a slightly lower discrimination ability. These are the variables closest to the response variable class, and between the 50 - 100% explained circles. The remaining elements are mostly modeling the variance within the classes.

Correlation Loadings Plot

185

The regression vector is a summary of the important variables, in this case representing the loading weights plot after 2 factors. In the project navigator, select the plot Regression Coefficients, and

change it to a bar chart by using the toolbar shortcut .

Weighted Regression Coefficients

The magnitude of the regression coefficients is an indication of how important those variables are for modeling the response, here class.

The predicted versus measured plot, in the lower-right quadrant, shows how close to the ideal values -1 and 1 the predicted values are.

186

Predicted versus Measured Plot

Note that the blue points are from calibration where the samples are merely put back in the same model they were a part of. The red points are from cross validation which is more conservative as the sample was not a part of the model when it was predicted. You can toggle on/off the regression line,

trend line, and statistics for the plot using the shortcut .

Recall that “prediction” in this context does not mean that the model has been tested by predicting a real test set. In this case all samples are correctly classified for the cross validation.

To investigate how the model will behave on unknown samples, the next section will show how to predict unknown sample class.

It is a good idea to save your work so far. The project will include all the data, as well as all the results generated thus far. Use File – Save… to save the project.

Classify unknown samples

Assign the unknown samples to the known classes by predicting (classifying) with the PLS regression model.

Task

Assign the Sample Set Test to the classes A or B.

How to do it

Select Tasks - Predict - Regression….

187

Tasks - Predict - Regression…

Use the following parameters:

Components The number of factors (components) to use is two.

Data • Matrix: Tutorial H • Rows: Test • Cols: X

Prediction • Full Prediction • Inlier limit • Sample Inlier dist • Identify Outliers

Prediction Dialog

188

Click OK.

The predicted values are shown in the main plot of predicted values with estimated uncertainties.

All F samples have predicted values close to -1 classifying these as belonging to class “B”. The E sample 2 has a predicted value around 1 which assigns it to class “A”. As for E samples 1, 3 and 4, their predictions are close to 0, and have high uncertainties. It could be that these can not be said to belong to any of the classes because the estimated deviation (uncertainty) around the prediction value includes 0 in the plot.

Predicted values and deviation

189

A small trick to present the results more visibly is to do Tasks - Predict - Projection and select the PLS model from above. In the score plot you see that all samples F are lying in the “B” class and E samples 2 and 3 are probably belonging to class “A”, as discussed above. The position of test samples 1 and 4 shows that they are in fact closer to class “A” as also the predicted values indicate.

Note: Try to analyze the same data by doing PCA on the two groups and then select Tasks - Predict - Classification - SIMCA and compare results with the PLS-DA.

To check if the prediction can be trusted, study the Inlier vs Hotelling T² plot available from a right click on the plot and then Prediction - Inlier/Hotelling T² - Inliers vs Hotelling T²

Prediction - Inlier/Hotelling T² - Inliers vs Hotelling T² menu

190

For a prediction to be trusted the predicted sample must not be too far from a calibration sample. This is checked by the Inlier distance. The projection of the sample in the model also should not be too far from the center. This is checked with the Hotelling T² distance.


In this case the samples are found to be in the widely spread in the plot. If samples fall outside the limit lines that prediction cannot be trusted.

Some general comments on classification

LDA is the basic method that is typically taught in introductory classification courses and is available as a reference method for comparison with other classification methods such as SIMCA. Remember that LDA has the same issue with collinearity as MLR, and that more samples than variables are required in each class. Using PLS regression for classification as PLS-DA has shown can give very good results in discriminating between classes. In this context it may also be useful to apply the uncertainty test after deciding on the model dimensionality and remove the nonrelevant variables. This can in some cases improve results both in simpler visualization and model performance. However, PLS-DA does not take into account the within-class variability, and predicted values around 0 (assuming -1 and 1 are used as levels for the classes) are difficult to assign. One alternative procedure is to use the scores from the PLS-DA in an LDA to have a more “statistical” result. As the score vectors are orthogonal there is no problem with collinearity in this case.

191

Using local PCA models which for historical reasons has been given the name “SIMCA” is a good approach because it also gives the possibility to assign new samples to none of the existing classes. However, as there is no objective in the individual PCA models to discriminate between the classes one does not know if the variance modeled is the optimal for this purpose. The Modeling and Discrimination Power diagnostics are helpful in this context. One useful procedure is to first do PLS-DA and select the “best” set of variables for discrimination. Then use these together with the most important variables in the individual PCA models to have a variable set that both models the within and between class variability.

SVM is a powerful method which can handle nonlinearities, and very good results have been reported in the literature. However, it is not so transparent as PCA and PLS and the choice of values for input parameters must be decided from cross validation to assure a robust model.

As for all methods, the proof of the method lies in the classification of a large independent test set with known reference.

192

Tutorial I: Multivariate curve resolution (MCR) of

dye mixtures


• Data plotting • Run MCR with default options • Plot MCR results • Interpret MCR results • Run MCR with initial guess • Validate the estimated results with reference information • View an MCR result matrix

Description

Multivariate Curve Resolution (MCR) attempts recovery of response profiles (spectra, pH profiles, time profiles, elution profiles, etc) of the components in an unresolved mixture of at two or more components. This is especially useful for mixtures obtained in evolutionary processes and when no prior information is available about the nature and composition of these mixtures.

The Unscrambler® MCR algorithm is based on pure variable selection from PCA loadings to find the initial estimation of spectral profiles, and then Alternative Least Squares (ALS) to optimize resolved spectral and concentration profiles.

The algorithm can apply a constraint of Non-negativity in either spectral or concentration profiles or both.

It can also apply a constraint of Unimodality in concentration profiles that have only one maximum, and/or a constraint of Closure in concentration profiles where the sum of the mixture constituents is constant.

The Unscrambler® MCR functionality does not require any initial guess input. A mixture data set suitable for MCR analysis should have at least four samples and four variables. If no initial guess is used, the maximum number of variables is 5000.

In this tutorial we will utilize UV-Vis spectra of dye mixtures to extract pure dye spectra and their relative concentrations. The data are from the Institute of Applied Research (Prof. W. Kessler), Reutlingen University, Germany.

193

What you will learn


• Run a basic MCR analysis • Plot MCR results • Interpret MCR results • Run an MCR analysis with initial guess • Validate MCR results with reference information • View the MCR result matrix and convert estimated concentrations into real scale.

References:

• Basic principles in using The Unscrambler® • What is MCR? • Interpreting MCR Plots

Data table

Click the following link to import the Tutorial I data set used in this tutorial.

Organizing the data table

The samples consist of 39 spectra of dye mixture samples. Samples 1 to 3 are pure dyes of blue, green and orange, respectively. Samples 4 to 39 are 36 mixture samples of those 3 dyes at known concentrations. The X variables are the UV-Vis spectra measured at range 250-800 nm with step 10 nm. We will begin by organizing the data for the analysis into row (sample) and column (variable) sets. The column sets have already been defined for you, and are found in the folder Column in the project navigator. There are 5 column sets for the different variables of interest in the analysis, including the concentrations of the three dyes, and two overlapping spectral ranges.

We begin by defining the row sets for these data. Select the entire first row in the data table, Blue_50, and go to Edit – Define Range… to open the Define Range dialog box. In the dialog, enter the name “Blue” in the Range row box and click OK.

Define Range Dialog

194

From the data table, select the sample Green_50, and go to Edit-Define Range to now make this row set Green. Do the same for the sample Orange_50, and then for samples 4 to 39, giving that row set the name Mixture. Additionally, create the row set Original by selecting samples and following the same procedure, Edit-Define Range

The first three columns are concentration measurements of blue, green and orange dyes. Columns 4 to 59 are UV-Vis spectra measured at range 250-800 nm with step 10 nm. In the project navigator expand the node Column to see the list of existing column sets. The organized data will look like this in the navigator and viewer, with color-coding for the defined set.

Navigator view of organized data

195

Data plotting

Before starting any analysis, it is a good idea to have a look at the data. We want to make a line plot of the spectra of all mixture samples together. Go to the original data table and highlight it in the navigator.

Use Plot - Line, which will open the Line plot dialog where the row set Mixture can be selected from the drop-down list, and for Cols, the set 250-800nm. This will give an overlay plot of the spectra.

Line plot of mixture spectra

196

We will now plot the reference spectra of the three pure components, select row set “Original”, and Cols 250–800nm. Go to Plot – Line… and select the rows and columns in the dialog.

Line plot dialog

This will results in the following plot, where we can see the maximum absorbance for each of the dyes is at a different wavelength. It is these component spectra that we expect to be able to extract from the data through the MCR analysis of the data in this tutorial.

Line plot of pure dyes

To plot the reference concentrations of the three dyes, select columns 1-3 and make a Line plot of Sample set “Mixture” by right clicking and selecting Plot – Line.

Line plot of sample concentrations

197

Note: Reference measurements of spectra and concentrations of pure components are not necessary to make your data set suitable for MCR!

Run MCR with default options

Task

Set up the options for an MCR analysis, launch the calculations and plot results.

How to do it

When data set “Tutorial_I” is active on screen, click Tasks - Analyze - Multivariate Curve Resolution…. The MCR dialog box with default settings will open up. Select Mixture (36) under the Rows tab, and 250-800nm (56) under the Columns tab. We will not use an initial guess.

Keep all other settings as default on the Options tab, then click OK. After the calculation is done, click Yes to View plots

MCR Dialog

198

When the MCR calculation is completed, a new node, named MCR, is added to the project navigator and the MCR overview plots are displayed in the viewer. The MCR results overview includes four plots, from upper-left to lower-right: Component Concentrations, Component Spectra, Sample Residuals and Total Residuals. The results overview plots are displayed at the optimum number of pure components, which the system estimates to 3 in this case. Our optimal number of components (3) is displayed on the toolbar. A summary of the analysis results is given in the Info tab in the lower left corner of the display, and also tells the optimal number of pure components.

MCR Info Box

MCR Overview plots

199

The MCR model results are all together in the new node in the project navigator named MCR. Rename the MCR model in the project navigator by highlighting the MCR node, right clicking and choosing Rename. Rename your first MCR model as MCR Original.

Plot MCR results

Task

Plot MCR results for various numbers of pure components.

How to do it

Actually, The Unscrambler® MCR procedure generates several sets of results, covering a number of estimated pure components from 2 to optimum +1. By default, the results are plotted for the optimal number of components.

You may view the results for varying numbers of pure components. Let us plot the spectral profiles

for a 2-component solution. Click the shortcut to select Component Number 2.

The plot of (estimated) component spectra for a resolution with two pure components is displayed.

In a similar manner, click on the right arrow shortcut to plot the 4-component solution.

200

MCR fitting and PCA fitting results are also available for varying numbers of pure components from 2 to optimum +1. Each fitting includes Variable Residuals, Sample Residuals and Total Residuals plots and are stored in result matrices in the MCR node of the project navigator. The user can plot these results upon selection of respective matrices, or by selecting the plot from the plots node of the project navigator. The plot of Total Residuals for MCR fitting is shown by default in the lower-right subframe. Like any other plot, it can also be accessed from the Plot menu. Change this plot to variable residuals by clicking and activate the lower-left subframe, then clicking MCR - Variable Residuals to have this plot displayed in place of the sample residuals plot.

Variable residuals plot

Interpret MCR results

Task

Determine the optimum number of pure components.

How to do it

In the Total Residuals plot, residuals are high for 2 components, and close to zero for 3 and 4 components. Change the appearance of the lower-right plot of the Total Residuals from a curve to

bars, using the toolbar icon .

Total residuals bar plot

201

This suggests that 3 components is the optimum solution.

Click and activate the Component Spectra plot with 3 components in the upper-right quadrant. The

toolbar contains a set of arrows , which is used to navigate between results at different numbers of components. Use the arrows to increase and decrease the number of components, and watch the impact on the spectral profiles.

Run MCR with initial guess

Task

Run the MCR calculation again, this time using an Initial Guess.

How to do it

If prior knowledge such as spectra of pure components or concentrations of mixture samples exists, this information may be included in the MCR calculation to help the algorithm converge towards the right solution of curve resolution.

Go back to data table Tutorial_I data by selecting the tab at the bottom of the viewer. Go to Tasks - Analyze - Multivariate Curve Resolution…. The MCR dialog box with default settings will open up. Select the same data as before, and then check the box Use initial guess and select option Pure spectra.

MCR dialog with initial guess

202

Select Row Set Original as initial guess for spectra, making sure to use the same column set for the data for the analysis and the initial guess. Then click OK to launch the calculations. When asked if you want to view the plots now, select yes.

Rename the new MCR results node in the project navigator as MCR Initial Guess.

Notes:

1. When using the initial guess option, The Unscrambler® requires all pure components to be included as initial guess inputs. Partial reference will generate erroneous results. It is recommended to run MCR without initial guess if only partial reference is available.

2. The Unscrambler® can be run with either spectra or concentration of pure components as an initial guess input.

Validate the estimated results with reference information

Task

We are going to compare the model’s Estimated Concentrations for a 3-component solution to the existing reference concentrations found in the data table and plotted earlier. In a first step we are going to compare the concentration profiles visually.

How to do it

Select the Component Concentrations plot, shown in the upper-left quadrant of the MCR Overview. Compare this with the three concentrations in the original data table that were previously plotted as

203

a line plot of the concentrations in the mixture data. . Look at both profiles. To make them both visible in the viewer, select the line plot you’ve made, and on the navigator tab right click to choose Pop out, giving an undocked plot that can now be docked wherever you wish for ease of viewing.

You can observe that the first estimated concentration profile is similar to the reference profile of the blue dye (blue curves on the plots), the second estimated concentration profile is similar to the reference profile of the green dye, and the third estimated concentration profile is very close to the reference concentration of the orange dye (green curves on the plots).

Caution: Estimated concentrations are relative values within an individual component itself. Estimated concentrations of a sample are not its real composition.

The estimated spectral profiles can be compared to the reference spectral profiles in the same way as for the concentrations. Because we used the spectra as initial guess inputs in this example, the comparison shows a perfect match. However, estimated spectra are unit-vector normalized; they are not the “real” spectral profile of the samples.

Plots of the Pure and Estimated Spectra

View an MCR result matrix

Tasks

Plot the MCR result matrix of estimated concentrations,

Compare the estimated concentrations to the reference concentrations in 2-D scatter plots by combining them into a single matrix.

204

Convert the estimated concentrations into real scale.

How to do it

Open Project Tutorial_I and expand the Results folder from the project navigator for model file MCR Initial Guess. The plot of the component concentrations is given in the upper-left quadrant of the MCR Overview plot. Select the Component concentrations matrix and make a duplicate of it matrix by selecting it going to Insert-Duplicate Matrix.

Insert Duplicate Matrix

.

Rename this matrix, named Component concentrations, that has been added to the bottom of the project navigator as Concentrations comparison.

With the cursor in the data matrix, go to Edit - Append and choose to add 3 columns to this matrix. Go to table Tutorial_i, select the first three columns (blue, green and orange), from rows 4-39. Copy them and paste them in the empty columns of the Concentrations comparison matrix, and enter names for columns 4-6 as blue, green, and orange respectively. We now have a table of six columns, containing the three estimated concentrations of the pure dyes followed by the three measured concentrations .

New Data Matrix with Estimated and Real Concentrations

Select columns “Blue” and “1” (press the Ctrl key on your keyboard to select several columns at a time). Click Plot - Scatter to display a 2-D Scatter plot of these columns. The correlation between

205

estimated and reference concentrations for the blue dye is 0.994. If the box containing plot statistics (among which correlation) is not displayed on the upper-left corner of your plot, use the toolbars

to display it. These can also be used to add a regression line and target line to the plot.

Continue to make the scatter plots for the green dye (columns “Green” and “2” in the table), which has a correlation between estimated and reference concentrations of 0.997.

For the orange dye (columns “Orange” and “3”), the correlation is 0.998. These very high correlations indicate that the MCR calculations have determined concentration profiles accurately in this case.

Scatter plot of orange dye concentration

These plots can be customized by right clicking and choosing Properties to make changes to the plot appearance.

Now let us convert the estimated Orange concentrations to real scale. In order to do this, at least one reference measurement is needed. The estimated concentrations (in relative scale) of all samples can be converted into real concentration scale by multiplying by a factor ( real concentration / estimated concentration ).

In the present case, we can use for example sample PROBE_11, which has a reference concentration of Orange dye of 7 and an estimated concentration of 0.4443.

206

Use menu Edit - Append - … to append a new column at the end of the table, and name it “MCR Orange real scale”. Go to Tasks - Transform - Compute General…, and type the expression:

V7=V3*(7/0.4443)

in the Expression space.

Compute General Dialog

Click OK to perform the calculation. A new matrix is created where the new column has been filled with the values of estimated Orange dye concentrations converted to real scale.

Data matrix with new values

207

Tutorial J: MCR constraint settings

Constraint settings in multivariate curve resolution


• Data plotting • Estimate the number of pure components and detect outliers with PCA • Run MCR with default settings • Tune the model’s sensitivity to pure components • Run MCR with a constraint of closure • Remove outliers and noisy wavelengths with recalculate

Description

In this tutorial we will utilize FTIR spectra of an esterification reaction to extract pure spectra and their relative concentrations. The original data are from the University of Rhode Island (Prof. Chris Brown), USA.

In situ FTIR spectroscopy was used to monitor the esterification reaction of isopropyl alcohol and acetic anhydride using pyridine as a catalyst in carbon tetrachloride solution. The initial concentrations of these three chemicals were 15%, 10% and 5% in volume, respectively. Isopropyl acetate was one of the products in this typical esterification reaction. The reaction was carried out in a ZnSe cell, and mixture spectra were measured at 4 cm-1 resolution. The data set consisted of 25 spectra, covering approximately 75 minutes of the reaction. To shift the equilibrium of the esterification, one-tenth of the volume was removed from the cell at 24, 45 and 60 minutes. An equal amount of a single reactant was added to the cell in the sequence of acetic anhydride, pyridine and isopropyl alcohol.

What you will learn


• Estimate the number of pure components and detect outliers with PCA • Run MCR with default settings • Tune the sensitivity to pure components setting • Run MCR with a constraint of closure • Use the Recalculate functionality in MCR

208

References:

• Basic principles in using The Unscrambler® • Principles of PCA • What is MCR? • Interpreting MCR Plots

Data table

Click the following link to import the Tutorial J data set used in this tutorial.

The data consist of 25 FTIR spectra of 262 variables covering the spectral region from 1860 to 852 cm-1. There are two row sets already defined: mixture and closure. Mixture contains all the data, while the row set closure has the samples that will be used when using the constraint of closure during the MCR.

Data plotting

Before starting the analysis, it is always important to have a look at the data. Make a line plot of all of the spectra together.

Select all the samples by selecting the data set Tutorial_J in the project navigator. The data table for the FTIR spectra of the samples will then be displayed in the data editor. Highlight the samples, and use Plot - Line to display an overlay of the spectra in the viewer.

Line plot dialog

From this plot, one can see that there is a region around 1240 cm-1 that is changing over the course of the reaction being monitored.

Line plot of FTIR spectra

209

Estimate the number of pure components and detect

outliers with PCA

Principal Component Analysis (PCA) is recommended before running an MCR calculation. It provides some information on the number of pure components and on sample outliers.

Task

Run a PCA on the raw data.

How to do it

Click Tasks - Analyze - Principal Component Analysis to run a PCA and choose the following settings:

• Matrix: Tutorial_J • Rows: All • Columns: All • Maximum components: 8 • Mean center data: Not selected • Identify outliers: Selected

PCA Dialog

210

On the Validations tab, select Cross validation, and Setup… to set this to full cross validation, from the drop-down list for cross validation method. Click OK, then OK again on the model inputs page.

Cross Validation Setup

211

Once the PCA calculations are done, click Yes to view the plots of the PCA model immediately. The four plot PCA Overview will be displayed in the viewer.

The upper right quadrant is a 2-D plot of the PCA loadings. For spectral data, it is more informative to have a line plot of the loadings, as it then resembles a spectrum. Select the existing loading plot, and go to Plot - Loadings - Line; which will give the plot of the first PC loading, to replace the default plot in this quadrant. This plot, once can see, closely resembles the FTIR spectra of the raw data.

Scroll through the loadings plots for the other PCs using the arrows on the toolbar .

You can see that the loadings begin to get noisy at about the sixth principal component. The program recommends three components as the optimal number of PCs in this model. This is seen in

the Info box in the lower left corner of the display, and by clicking on the star on the menu toolbar. Select the Explained Variance plot in the lower-right quadrant by clicking on it with the mouse, then right mouse click to select View - Numerical View.

As you can see, the explained variance globally reaches a plateau from the third principal component. The fourth and fifth PCs still show some slight increase; at that stage, it is difficult to know whether they represent noise or real information. Now, click on the Influence plot at the bottom-left corner of

the Viewer, and use the PC navigation tool to display the influence plot at PC4. You may observe that sample 1 sticks out to the right with a high leverage, and that sample 8 sticks out upwards with a high residual variance.

212

PCA Influence Plot for PC4

Go to menu Plot - Sample Outliers to display a combination of four useful plots for outlier detection. Highlight the Residual Sample Variance at the bottom-left quadrant, and use the PC navigation arrows to change that to show results for PC4. This plot indicates a high validation residual for sample 8.

Residual Sample Variance Plot for PC4

As there is no validation check in MCR, we may use the outlier information issued from PCA in our MCR modeling later on.

213

Rename the PCA model file in the project navigator by highlighting the PCA node, right clicking and choosing Rename. Rename the model to “PCA Tutorial J”.

Run MCR with default settings

Task

Build a first MCR model with default settings.

How to do it

Go back to the data table Tutorial_J in the project navigator. Run an MCR by going to the menu and selecting Tasks - Analyze- Multivariate Curve Resolution… and keep the default settings:

• Matrix: Tutorial_J • Rows: All • Columns: All

Go to the Options tab and verify that the default settings are selected. Make changes as needed.

• Non-negative concentrations: selected • Non-negative spectra: selected • Closure: not selected • Unimodality: not selected • Sensitivity to pure components: 100 • Maximum ALS iterations: 50

MCR Options Dialog

214

Click OK to launch the calculations.

Note: MCR computations are demanding. Building the model can easily take several minutes depending on the size of the data set, the selected options and the capacity of your computer processor.

Click Yes when the calculations are finished, and you are asked if you want to view plots now. The MCR Overview plots are displayed. Notice that the program suggests 4 as the optimal number of

pure components, by indicating 4 components in the toolbar . This information, as well as parameters for the MCR analysis can be seen in the Info box in the lower left of the display.

Information Box

Rename the MCR model file to “MCR_Defaults”.

Tune the model’s sensitivity to pure components

215

Task

Read the MCR Warnings, which are found under the MCR model node. Open the warnings and follow the system’s recommendation for the Sensitivity to pure components setting.

How to do it

Expand the MCR_Defaults node in the project navigator and click on Warnings. A table of information will be displayed in the viewer and here you can check the recommendations given by the system. There are four types of recommendations:

Type 1 Increase sensitivity to pure components

Type 2 Decrease sensitivity to pure components

Type 3 Change sensitivity to pure components (increase or decrease)

Type 4 Baseline offset or normalization is recommended.

In the present case, the system recommends to change the setting for sensitivity to pure components.

The default setting (100) that was used for Sensitivity to pure components is usually a good starting point. After interpreting the results and reading the system recommendations, you can tune it up or down between 10 and 190. The higher the Sensitivity, the more pure components will be extracted. Therefore, if too many components are extracted, it is recommended to reduce the setting. Likewise, if you would like to see more components at an almost undetectable level, or even some noise profiles, it is recommended to increase the sensitivity setting.

Let us build a model with an increased setting.

Go back to the data table and redo the MCR calculation with a Sensitivity to pure components setting of 150.

The plot of Component Spectra is now shown by default for 5 components instead of 4 in the previous model.

Component Spectra for 5 Components

216

One can compare those profiles with FTIR spectra of known constituents, and identify the 5 estimated spectra as pyridine, isopropyl alcohol, a possible intermediate, propyl acetate and acetic anhydride, from curves 1-5 respectively.

Rename the new MCR model file created in the project navigator as MCR_Sensitivity150.

Run MCR with a constraint of closure

Task

Run MCR with a closure constraint. Compare two MCR models on the same data, with and without closure.

How to do it

Among the MCR settings we have used so far, two types of constraints were not selected.

A constraint of Unimodality can be applied to restrict the resolution to concentration profiles that have only one maximum.

With a constraint of Closure, the resolution will yield concentration profiles whose sum is constant.

In the present case, acetic anhydride was added at 24 minutes (between the eighth and the ninth samples), which means that the first 8 samples can be treated in closure conditions.

Go back to the data table and run a new MCR model with the following settings:

• Rows: Closure [8] (contains the first 8 samples of the data table) • Cols: All • Non-negative concentrations: selected • Non-negative spectra: selected • Closure: selected • Unimodality: not selected

217

• Sensitivity to pure components: 100

Once the computations are finished, choose to view the plots when prompted. Rename the new MCR model file as “MCR_Closure”.

You may compare the resolved concentration and spectral profiles of pure components with and without the closure setting. To do that, compute a new MCR model on sample set “Closure” without checking the Closure constraint option. Save the new MCR model file as “MCR_No_Closure” and compare the results to “MCR_Closure”.

The spectral profiles with and without the constraint of closure are very similar.

MCR Component Spectra

You can also observe that under constraint of closure, the concentrations of the pure components always add up to 1.

MCR Component Concentrations

218

Notes on MCR result interpretation

1. The spectral profiles obtained may be compared to a library of FTIR spectra in order to identify the nature of the pure components that were resolved. Likewise, if you have the spectra of your pure components and solvents, you can compare these to the computed components.

2. Estimated concentrations are relative values within an individual component itself. Estimated concentrations of a sample are not its real composition.

Remove outliers and noisy wavelengths with recalculate

Task

Use the Recalculate functionality to remove samples or variables with high residuals.

How to do it

Select the MCR_Defaults tab from the navigation bar to display your first MCR model on screen. If the plots were already closed, you may open them again from the project navigator; click on the MCR Overview plot from the node MCR_Defaults to display the results.

219

The Validation calculations of the PCA model that we built earlier indicated that sample 8 was a potential outlier. We can check this again in the MCR model by looking at the PCA fitting residuals.

Click on the bottom-left subframe where the Sample residuals are plotted to highlight it. If needed, use the PC navigation arrow tool to change the view to show the sample residual for the 4-component model.

Here you may notice a high residual showing for Sample 8, compared to the other samples. Let us build a model without this sample. You will notice is the sample residuals plot, that the shape is similar to what is observed in the residual sample variance plot from the PCA model on this same data set.

MCR Sample Residuals

Use the marking tools to highlight sample 8 in the Sample Residuals plot.

Marked sample in sample residuals plot

220

Select the MCR_Defaults model in the project navigator, and right click to select Recalculate - Without Marked… to specify a new MCR calculation without sample 8.

Menu to recalculate without marked

This brings you back to the MCR dialog, where sample 8 is now included in the Keep Out Of Calculation field. You may launch the calculations to get the new MCR results.

MCR menu with sample 8 kept out

Similarly, you may want to keep out of the model non-targeted wavelength regions, or highly overlapped wavelength regions.

From the MCR_Defaults overview plots, click Plot - Variable Residuals.

MCR Variable Residuals

221

Mark any unwanted variables on the plot using the marking tools, for examples variables around 1100-1140 cm-1 which present very high residuals, then select the model “MCR_Defaults” and right click to choose Recalculate - Without Marked… to specify a new MCR calculation.

General notes on MCR settings and interpretation:

1. To have reliable results on the number of pure components, one should cross-check with a PCA result, change the sensitivity to pure components setting, and use the navigation bar to study the MCR results for various numbers of pure components.

2. Weak components (either low concentration or noise) are usually listed first. 3. One can utilize estimated concentration profiles and other experimental information to analyze a

chemical/ biochemical reaction mechanism. 4. One can utilize estimated spectral profiles to study the mixture composition or even

intermediates during a chemical/biochemical process.

Tutorial K: Clustering


• Transform the raw spectra • Application of K-Means clustering • Application of Hierarchical Cluster Analysis (HCA) • Repeat the HCA using a correlation-based measure • Using the results of HCA to confirm the results of PCA

Description

222

This tutorial investigates the use of two well known clustering methods, K-Means and Hierarchical Cluster Analysis (HCA) for classification of raw materials used in the pharmaceutical industry, by means of reflectance Near Infrared (NIR) spectroscopy. This is an example of unsupervised pattern recognition and is an alternative methodology to Principal Component Analysis (PCA). Unsupervised pattern recognition is the first step performed to establish whether a discriminant classification method can be developed.

What you will learn

Tutorial K contains the following parts:

• Apply a pretreatment method to the spectral data • Use K-Means to identify clusters in the data set • Perform HCA and analyze the resulting dendrogram output.

References

• Basic principles in using The Unscrambler® • Principles of PCA • Data preprocessing and transformations • Classification • Cluster Analysis

Data table

Click the following link to import the Tutorial K data set used in this tutorial.

The data table contains 35 NIR spectra of seven classes of raw materials often used in pharmaceutical manufacturing. Typically when developing classification models it is recommended that more samples be used, being sure to cover the natural variability of each class, but for this exercise, we use just five spectra for each class.

The diffuse reflectance spectra have been truncated to the wavelength region 1200 - 2200 nm for this particular example.

The type of raw material is defined in the name of each sample, and includes:

• Citric acid • Dextrose anhydrous • Dextrose monohydrate • Ibuprofen • Lactose • Magnesium stearate

223

• Starch

Transform the raw spectra

Task

Transform the raw spectral data by applying a Standard Normal Variate (SNV) to the Tutor_k data table.

How to do it

Open the file Tutorial_K.unsb from the tutorial data folder.

First plot the raw data by selecting the entire table and selecting Plot - Line and select all rows and columns to plot.

Line plot

Click on OK and view the plot. Notice that there are distinct groups of spectra with similar profiles. The main source of variation within each group comes from differences in the absorbance (Y) axis. This baseline shifting is due to differences in sampling when preparing and scanning, resulting in differences in light scattering by the samples measured in reflectance by NIR spectroscopy.

Line plot of NIR spectral data

224

A convenient way to remove this variation is by the use of the SNV transform. This transform reduces the scattering effects in such data by removing the mean value from each point in the spectrum and divides each point by the standard deviation of all points in the spectrum, i.e. the SNV transform normalizes the spectrum to itself. The effect of the SNV transform is to remove the variation in the absorbance scale (baseline shifting), while retaining the original profile of the spectral data.

This is a commonly used practice in many NIR applications, especially for reflectance spectra of solids. To perform the SNV transformation, right click in the matrix Tutor K Data and select Transform - SNV. In the Rows dialog box, select All and in the Columns dialog box, select All. You can preview the effect of the transformation be clicking in the Preview result box, or just click OK to perform the transformation.

SNV dialog

225

The transformed data are displayed as a new node in the project navigator and the matrix is called Tutor K Data_SNV. Plot the data to see how they now look by selecting all samples in the new matrix and going to Plot-Line.

The resulting SNV-transformed spectra can be seen below.

Line plot of SNV-transformed NIR Spectra

226

The spectra are now ready for application of the clustering algorithms described below.

It is a good idea to save your work as you go. Save your project by going to File-Save As….

Application of K-Means clustering

K-Means clustering is an unsupervised classification method which attempts to group a set of samples being analyzed into “K” distinct groups, where K is specified by the analyst. The classification is performed based on a predefined distance measure. For more details on the distance measures available, refer to the section on Cluster Analysis.

Task

Perform a K-Means clustering of all samples.

How to do it

Use Tasks - Analyze- Cluster Analysis… and select the following parameters under the Inputs tab:

• Matrix: Tutor-K Data_SNV • Rows: All • Columns: All • Number of Clusters: 7 • Clustering Method: K-Means • Distance Measure: Euclidean

Cluster analysis dialog

227

With K-means one can also make initial class assignments on the options tab, and set the number of iterations to use to find the optimal number of clusters. Here we will allow the algorithm to make assignments with no further input, and use the default number of 50 iterations.

Cluster analysis dialog options tab

228

Click OK to start the analysis and a new node will appear in the project navigator called Cluster analysis. Right click on the node and select Rename and call this analysis K-Means.

You will notice that there is no graphical output for K-Means clustering. The output of the cluster analysis is found in the Results folder. Expand this folder to display a node called Tutor K Data_SNV_Classified, where the results reside. The classified data matrix is color-coded according to the clusters (row sets) that have been identified. Expand this matrix. Expand the rows and the columns folders and you will see that the rows contain seven assigned clusters from Cluster-0 to Cluster-6. The columns folder contains the class, a single column of classification results.

The K-Means data table is now classified by different colors, corresponding to the various assigned classes. Study this table. You will notice that the K-Means algorithm has successfully classified the data into seven distinct classes, each containing a single raw material type. Click on the various cluster nodes in the project navigator and confirm that each cluster contains 5 samples of the same material type. Using the Rename function, assign cluster names according to the table above. The results of this operation are shown below.

View of Assigned Classes in Navigator

Now that the separate classes have been defined, you can use this information to use it as a means to group samples in plots. Go back to the matrix Tutor K Data_SNV and right click to select Plot-Line. In the plot, now you can right click to select Sample grouping. In the sample grouping dialog , first select the clustered data from the drop-down list for Select. Now you have the row sets you have just renamed as available row sets. Select all of these, and click OK. The line plot will now have all samples of each set displayed in a single color.

Sample grouping option

229

Application of Hierarchical Cluster Analysis (HCA)

Hierarchical Cluster Analysis (HCA) is another clustering method. Like K-Means, it is based on distance measures; however, the main output of the HCA is the dendrogram. The dendrogram provides information pertaining to sample relationships within a particular data set. The structure of the dendrogram is dependent on the distance measure used and great care must be taken when interpreting the structures.

Task

Make a HCA model using the method of single linkage and Euclidean distance.

How to do it

Select Tasks - Analyze - Cluster Analysis… and make a model with the following parameters:

• Matrix: Tutor_K Data_SNV • Rows: All • Cols: All • Number of Clusters: 7 • Clustering Method: Hierarchical Single-linkage • Distance Measure: Euclidean

Use the drop-down lists to change the clustering method and distance measure. Click OK to start the analysis. When the the analysis is completed, the dendrogram is displayed in the editor window, and a new Cluster analysis node is added to the project navigator.

HCA Euclidean Dendrogram

230

Before reviewing the analysis results, rename the new cluster analysis node in the project navigator as HCA Euclidean.

Analyze the dendrogram and look at the order of the clusters from top to bottom. It can be seen that each raw material type is uniquely defined and the carbohydrate materials Starch, Lactose, Dextrose Monohydrate and Dextrose Anhydrous all group together in the dendrogram. Towards the bottom, the clustering is not as distinct. This indicates that the sample classification is based on some similarity in the chemistry of the samples, but it is not as well defined as it could be. This is one aspect of HCA that must be kept in mind when performing such a method.

In the project navigator, expand the results folder for the HCA and under the rows folder, you will see that seven clusters have been assigned to this analysis. These can be renamed as was done above, so that the names coincide with the class name.

Repeat the HCA using a correlation-based measure

When dealing with spectroscopic data, the spectrum of a material is analogous to its fingerprint. Using a straight distance measure such as the Euclidean measure may not be the most sensitive way of assessing the similarities present within the data. The Absolute correlation measure provides a better way of capturing the within spectral variable similarities of the materials. We will also change to the complete-linkage, which looks for the farthest neighbor, as opposed to nearest neighbor used in single-linkage HCA.

231

Task

Make a HCA model using the method of complete linkage and absolute correlation.

How to do it

Select Tasks - Analyze - Cluster Analysis. Use the following parameters:

• Matrix: Tutor K Data_SNV • Rows: All • Columns: All • Number of Clusters: 7 • Clustering Method: Hierarchical Complete-linkage • Distance Measure: Absolute Correlation

Click OK to start the analysis and then click Yes to view the plots. The dendrogram for this analysis is displayed in the editor window, and from the results node it is seen that 7 clusters are identified.

Before reviewing the analysis results, rename the new cluster analysis node in the project navigator as “HCA Correlation”.

Notice that all samples are uniquely classified into classes based on the raw material type. This time there are three distinct clusters in the dendrogram. At the top of the dendrogram is Starch. The next cluster of samples contains mostly carbohydrates: Lactose, Dextrose Monohydrate, Dextrose Anhydrous and Citric acid. The last cluster includes the materials Ibuprofen and Magnesium stearate, whose NIR spectra have features in the 1400 and 1700 nm regions.

HCA Absolute correlation distance dendrogram

232

The method of absolute correlation not only uniquely classified the individual raw materials, but it was also able to use the information in the spectral variables far better, by grouping the materials by their chemical properties.

In the results folder, select the data table Tutor K Data_SNV_Classified. Go to Insert - Duplicate Matrix…. The following dialog box opens.

Duplicate Matrix

Rename the clusters of the duplicated matrix based on the materials’ name.

Renamed row ranges

233

We will use these results, in conjunction with PCA, to show how the two methods of unsupervised pattern recognition can be used together.

Using the results of HCA to confirm the results of PCA

Task

Perform a PCA on the SNV transformed data and group the samples based on the results of HCA.

How to do it

Select Tasks - Analyze- Principal Component Analysis…. Use the following parameters:

• Matrix: Tutor K Data_SNV_Classified • Rows: All • Columns: All • Maximum Components: 6 • Mean Center Data: Yes • Identify Outliers: Yes

PCA dialog

234

Click OK to start the analysis and then click Yes to view the plots. The PCA Overview for this analysis is displayed in the workspace.

In the Scores Plot right click and select Sample Grouping and from the Select drop-down list, use the results from your clustering to give you the available row sets of the different clusters. Click on the » button to select all clusters in the analysis and then click OK.

Sample grouping dialog

235

Drag the updated scores plot so that it fills most of the screen and analyze the clustering.

The scores plot shows that PC1 explains 66% of the data variance, and PC2 describes 19%. The main difference along PC1 is between carbohydrate materials and fatty acid based materials (i.e. Magnesium Stearate and Citric Acid) and PC2 is differentiating between the starch and ibuprofen samples.

It can be seen that the clustering of the materials as established by HCA is consistent with that of PCA. PCA provides more information on the groupings as the spectral loadings can be related to the spectral features which describe the materials. To have a more informative view of the PCA loadings it is better to look at them as a line plot - resembling then a spectrum. Activate the loadings plot in the upper-right quadrant, and right click to select PCA - Loadings - Line. The loadings plot now shows which spectral features are related to the first PC, which explains most of the variance in this

data set. Use the next arrow to scroll to the next PC loadings plot.

PCA Overview Plot

236

Now that the work has been done it is a good idea to save the results so you can refer to them in the future.

When more data (more samples per each class) are available for classification, this exercise has shown that one can proceed to make a classification model to identify these seven raw materials from their NIR spectra. Classification modeling such as PLS-DA and SIMCA can be used to develop methods that can be used for classification of future samples.

237

Tutorial L: L-PLS


• Open and study the data • Build a L-PLS model • Interpret the results

o Variances o Products: Scores o Product descriptors X: X Correlation Loadings o Consumer descriptors Z: Z Correlation Loadings o Consumer liking of the products Y: Y Correlation Loadings o Overview of the L-PLS Regression solution

• Verify the results o Products liking o Liking Y vs. consumer background Z o Product descriptor rows in X o Product descriptor columns in X

• Bibliography

Description

Consumer studies represent an application field where such “L-shaped” data matrix structures X;Y;Z are common: A set of I products has been assessed by a set of J consumers, e.g. with respect to liking, with results collected in “liking” data table Y(I× J). In addition, each of the I products has been “measured” by K product descriptors (“X-variables”), reflecting chemical or physical measurements, sensory descriptions, production facts etc., in data table X(I ×K). Moreover, each of the J consumers has been characterized by L consumer descriptors (“Z-variables”), comprising sociological background variables like gender, age, income, etc., as well as the individual’s general attitude and consumption patterns; these are collected in data table Z(J ×L). Relevant questions could then be: Is it possible to find reliable patterns of variation in the liking data Y, which can be explained from both product descriptors X and from consumer descriptors Z? Is it possible to predict how a new product will be liked by these consumers, by measuring its X-variables? Is it possible to predict how a new consumer group will like these products, from their background Z-variables?

The data consists of information gathered on Danish children’s liking of apples. Their response to various apple types is termed Y. Chemical, physical and sensory descriptors of these apple types are called X, and sociological and attitude descriptors on these children is in matrix Z. The purpose of

238

the analysis is to find patterns in these X-Y-Z data that are causally interpretable and have predictive reliability.

We are now going to build an L-PLS regression model linking the panelists’ sensory, chemical and physical evaluations to the consumers and their sociological and attitude descriptors. The model will summarize all the information about consumers, consumers’ preference, the products and their characteristics.

What you will learn


• Open and study the data. • Build an L-PLS model which explains consumer likings for the different consumer segments

from the descriptive sensory attributes and chemical measurements. • Study the results. • Verify the results.

References:

• L-shape Partial Least Square • Partial Least Square regression • Scatter plots

Data table

We are going to study three data tables that do not have all the same size. The structure of the data set is as follows:

• X - ApplesSensoryChem • Y - ApplesLiking • Z - AppleChildBackground

L-PLS Structure

239

In the following, matrices will be written in upper-case (e.g. X) letters, vectors in lower-case (e.g.

) and scalar elements in italics (e.g.

); all vectors are column vectors unless otherwise

specified.

The six products

The data are taken from Thybo et al. (2004). I=6 products were the apple cultivars “Jonagold”, “Mutsu”, “Gala”, “Gloster”, “Elstar” and “GrannySmith”. All cultivars were selected due to commercial relevance for the Danish market and due to the fact that the cultivars were known to span a large variation in sensory quality Kuhn and Thybo (2001). Gloster was chosen as a wine-red cultivar with particularly high glossiness, Gala and Jonagold as red cultivars with 80–90% red Bushed surface, Mutsu as a yellow–green cultivar and GrannySmith as a green and particularly round-shaped cultivar. GrannySmith was known to be a rather popular cultivar for some children, due to its texture and moistness characteristics. Only apples with shape and color deemed representative for their cultivar were used.

X data

The X data matrix (X - ApplesSensoryChem) contains the chemical, physical and sensory data of these apple types. Sensory profile descriptors: A panel of ten assessors was trained in quantitative descriptive analysis of apple types as described in Kuhn and Thybo (2001). Conventional statistical design with respect to replication and serving order was applied. The panel average of a subset of the appearance, texture, taste and Bavour descriptors determined will be used here:

• Red • Sweet

240

• Sour • Glossy • Hard • Round

Chemical and instrumental product descriptors:

• Texture firmness was evaluated instrumentally by penetration (FIRM Instrument). • Content of acid (ACIDS) and sugar (SUGARS) were determined as malic acid and soluble solids,

respectively. • Based on prior theory on human sensation of sourness, the ratio ACIDS/SUGARS was

included as a separate variable Kuhn and Thybo, 2001.

Together, the sensory, chemical and instrumental variables constituted K=10 product descriptors, which will here be referred to as X(I × K) for the I = 6 products.

Y data

The Y data (Y - ApplesLiking) consists of information gathered on Danish children’s liking of apples. Their response to various apple types is termed Y. Each child was asked to express the liking of the appearance of the six apple cultivars, using a five-point facial hedonic scale:

1. “not at all like to eat it” 2. “not like to eat it” 3. “it is okay” 4. “like to eat it” 5. “very much like to eat it”.

One apple at a time was shown to the child to avoid that the child concentrated on comparing the appearances. All samples were presented in randomized order. The resulting liking data for the I = 6 products × J = 125 consumers will here be termed Y(I × J ).

Z data

The Z data table (Z - AppleChildBackground) contains the information collected about the consumers: sociological and attitude descriptors on these children.

The consumers were children aged 6–10 years (51% boys, 49% girls), recruited from a local elementary school. A total of 146 children were tested and included in the original publication of Thybo et al. (2004). For simplicity, only the J = 125 children that had no missing values in their liking and background data are included in the present study.

First, each child was asked to look at a table with five different fruits (a red and a green apple, a banana, a pear and an orange (mandarin)), and answer the questions: “If you were asked to eat a fruit,

241

which fruit would you then choose, and which fruit would be your last choice?” The resulting responses will here be named “fruitFirst” and “fruitLast”, where “fruit” is one of Red Apple, Green Apple, Pear, Banana, Orange or Apple. (Summaries were later computed for apple liking: AppleFirst = RedAppleFirst + GreenAppleFirst and AppleLast = RedAppleLast + GreenAppleLast.)

The child was also questioned about how often he/she ate apples, by having the following opportunities: “every day” (here coded as value 4), “a couple of times weekly” (3), “a couple of times monthly” (2), “very seldom” (1); this descriptor is here named “EatAOften”. (A few of the children responded “do not know” to how often he/she ate apples. To reduce the number of missing values, this was for simplicity taken as indicating very low apple consumption, and coded as 0.) In addition, the child’s gender and age were noted. These two sociological descriptors were used, together with the attitude variables fruitFirst and fruitLast and eating habit-variable EatAOften, as L = 15 consumer background descriptors Z(J × L) for the J = 125 children.

Open and study the data

Click the following link to import the Tutorial L data set used in this tutorial.

There are three matrices:

• X - ApplesSensoryChem • Y - ApplesLiking • Z - AppleChildBackground

Build a L-PLS model

It will explain consumer likings from the descriptive sensory attributes, and also using the consumers’ information.

Go to the menu Tasks - Analyze - L-PLS Regression….

Tasks - Analyze - L-PLS Regression…

242

• In X select the variable set “X - ApplesSensoryChem”, in Rows and Columns select All. • In Y select the variable set “Y - ApplesLiking”, in Rows and Columns select All. • In Z select the variable set “Z - AppleChildBackground”, in Rows and Columns select All. • Set the maximum components to 10 PCs.

L-PLS regression settings

243

Then set the weights individually as follows:

• Click on the X Weights option. Select all the variables clicking on the All button. Select then the option “A / (SDev + B)” with the radio button. Finally click on the Update button.

• Click on the Y Weights option and use weighting option “A / (SDev + B)” for all the variables. • Click on the Z Weights option and use weighting option “A / (SDev + B)” for all the variables.

L-PLS regression settings: Weights

244

Once all necessary options have been selected, click OK to start the computations.

Interpret the results

View the results and study the different plots:

• L-PLS Overview • Correlation Loadings • Correlation

L-PLS Analysis node

245

Variances

Study the bottom right plot in the L-PLS overview. It presents the explained variances of the three data tables: X, Y and Z.

The number of component necessary to explain the X table is 4. The X table is the one that is explained the best.

The Y-table needs 5 factors to be explained at 72%.

The Z-table is explained at 69% with 10 PCs. It is always more difficult to explain all the variance in this table as it relates to the background of the consumers.

Products: Scores

246

Look at the products in the Score plot in the Correlation Loadings.

Score plot

It shows the main patterns of the six products. Product 6 (Granny Smith) and 4 (Mutzu) are grouped together, product 1 (Jonagold), 2 (Gloster), 3 (Elstar) and 5 (Gala) are grouped together. Product 3 is close to the center which means it is a sample close to the average sample.

The horizontal dimension (Factor 1) spans the contrast between Granny Smith and the other products, mainly Gala, Gloster and Jonagold. The vertical dimension (Factor 2) spans the contrast between Elstar and the other products. The correlations are rather weak, indicating that the variations in the second dimension are weaker than in the first dimension.

Product descriptors X: X Correlation Loadings

Look at the Product descriptors in the plot X Correlation Loadings in the L-PLS overview.

X Correlation Loadings

247

It shows the main patterns of the sensory, instrumental and chemical product descriptors. The horizontal dimension is seen to span the sensory contrast between Sour and Sweet, and the chemical contrast between the Acids/Sugars ratio and the Sugar content. Sensory Red color is correlated with Sweet apples. The vertical dimension (Factor 2) mainly contrasts properties like sensory Hard and instrumentally Firm against sensory Round shape and high content of Acids and Sugars.

Consumer descriptors Z: Z Correlation Loadings

Look at the Consumer descriptors in the plot Z Correlation Loadings in the L-PLS overview.

Z Correlation Loadings

248

It shows the main patterns of the consumer background descriptors. The horizontal dimension spans a tendency to choose the green apple first and the red apple last (GreenAFirst, RedALast), against the tendency to choose the red apple first and the green apple last. Vertically, exhibits a contrast between choosing pear first and banana last against choosing banana first and pear last. The purely sociological variables (gender, age, how often apples are eaten) are not particularly evident in the result, although gender (coded as being girl) is slightly associated with choosing green apple first, pear first and banana last.

Consumer liking of the products Y: Y Correlation Loadings

Look at the Consumer liking of the products in the plot Y Correlation Loadings in the L-PLS overview.

Y Correlation Loadings

249

It shows the main, product-related patterns of the consumer with respect to liking. Most of the 125 children are gathered towards either end of the horizontal dimension. The second, vertical dimension (Factor 2) is much less extensive, and spans fewer children.

Overview of the L-PLS Regression solution

Look at the plot Correlation as a general picture.

Correlation

250

In the horizontal dimension product GrannySmith is seen to be particularly Sour and not Sweet; it has high ratio Acid/Sugars and low level of Sugars. It is also Hard and not Red. The products Gala, Gloster and Jonagold appeared to display the opposite tendency.

GrannySmith is seen primarily to be liked by children who were observed to choose green apple first and red apple last, not by children who were observed to choose red apple first and green apple last. Again, products Gala, Gloster and Jonagold seem to display the opposite of this tendency.

In the vertical dimension, product Elstar is seen to be particularly round, with high levels of both Acids and Sugars, but not instrumentally Firm and sensory Hard; nor was it Glossy. On the contrary, the products Mutsu, Gloster and Gala appeared to be a little more Firm and Glossy, with less Sugars and Acids than the others.

Product Elstar seems primarily to be liked by children who chose banana first and pear last, and less liked by children who chose pear first and banana last. In contrast, e.g. Mutsu seemed to be associated with the liking of children who chose pear first.

Verify the results

With a relatively complex modeling tool like the L-PLS regression, it is important to verify the main aspects of the interpretation by plotting the raw data.

Products liking

Plot a scatter plot of the most extreme products (liking GrannySmith vs. liking Jonagold) and look at the correlation.

251

With only five response levels possible, many data points are superimposed and the pattern difficult to see. But their raw liking data are clearly negatively correlated (r = −0.29 over the 125 subjects), as expected.

Liking Y vs. consumer background Z

Plot a scatter plot of the liking of the green apple GrannySmith to the background response green apple first.

To do so copy the row “GreenAFirst” in the Z table and insert a new row in the Y table. Paste the “GreenAFirst” row. It is now possible to generate a scatter plot.

There is a clear tendency (r = 0.52 over 125 subjects) that if children chose green apple first, they reported that they liked GrannySmith.

Product descriptor rows in X

Plot a scatter plot of the standardized sensory and chemical variables for the two most extreme products, GrannySmith and Jonagold.

To do so select the X matrix and go to Tasks - Transform - Center and Scale

Tasks - Transform - Center and Scale

252

Select All for Rows and Cols. For the Transformation field select Mean for Center and Standard deviation for Scale.

Center and Scale window

From the new matrix generated called “X - ApplesSensoryChem_CenterAndScale” select the “JonaSC” and “GrannySmithSC” rows and select a scatter plot under the menu Plot.

253

Again, these two products are seen to be described quite opposite; Jonagold is Sweet, Red and high in Sugars, while GrannySmith has high Acids/Sugars ratio, is Sour, Hard and Round, and vice versa. The r is −0.72 between these two rows of 10 standardized X variables.

Product descriptor columns in X

Plot a scatter plot of the sensory descriptor Sour and the instrumental descriptor FIRM Instrument.

As expected from the L-PLS regression model, these two variables are almost orthogonal, with r = 0.07 over the six products.

254

Bibliography

B.F. Kühn, A.K. Thybo, The influence of sensory and physiochemical quality on Danish children’s preferences for apples, Food Qual. Pref., 12, 543-550(2001).

H. Martens, E. Anderssen, A. Flatberg, L. H. Gidskehaug, M. Hoy, F. Westad, A. Thybo, M. Martens, Regression of a data matrix on descriptors of both its rows and of its columns via latent variables: L-PLSR, Computational Statistics & Data Analysis 48, 103 – 123(2005).

A.K. Thybo, B.F. Kuhn, H. Martens, Explaining Danish children’s preferences for apples using instrumental, sensory and demographic/behavioral data, Food Qual. Pref. 15, 53–63(2004).

255

Tutorial M: Variable selection and model stability

Learn how to use the Uncertainty Test results in practice.


• Create a PLS model • Interpret a PLS model

o Variance plot o Score plot o Loading plot o Weighted regression coefficients o Stability plots

§ Stability in loading weights plots § Stability in score plots

• Conclusions

Description

In this work environment study, PLS regression was used to model 34 samples corresponding to 34 departments in a company. The data were collected from a questionnaire about overall job satisfaction (Y), modeled from 26 questions (X1, X2, …, X26) about repetitive tasks, inspiration from the boss, helpful colleagues, positive feedback from the boss, etc. The unit for these questions was the percentage of people in each department who ticked “yes”, e.g. “I can decide the pace of my work”. The response variable was the overall job satisfaction, on a scale from 1 to 9.

What you will learn


• PLS regression • Validation methods • Uncertainty estimates • Interpretation of plots

This tutorial is also presented differently than the other tutorials, with less detailed instructions for each task, thus giving a slightly more demanding learning curve.

256

Data table

Click the following link to import the Tutorial M data set used in this tutorial. The data already have several row and column sets defined, but you must define the column set for the response variable, job satisfaction.

Create a PLS model

Click Tasks - Analyze - Partial Least Squares Regression to run a PLS regression and choose the following settings:

Model inputs • Predictors: X: Tutorial M, Rows: all, Cols: XData • Responses: Y: Tutorial M, Rows: all, Cols: Job satisfaction • Maximum components: 7 • Mean center data: Enable tick box

X Weights 1/SDev Select all the variables, select the radio button A/(SDev+B), and click Update.

Y Weights 1/SDev Select the “Job satisfaction” and select the radio button A/(SDev+B), and click Update.

Validation Full cross-validation. Click on the button Setup… to select this option. Select the Uncertainty test for the optimal number of factors.

Select Uncertainty test

257

Click on OK when everything is set.

Interpret a PLS model

The Unscrambler® regression overview gives by default the Score plot (factor 1-factor 2), the X-Loading and Y-loadings plot (factor 1- factor 2), the explained variance and the Predicted vs. Measured plot for 2 factors for this PLS regression model.

Variance plot

The initial model indicated 2 factors to be the optimal model dimension by full cross validation. Thus the cross validation has created 34 submodels, where 1 sample has been left out in each. The uncertainties for all x-variables were thus as a second step estimated by jack-knifing for various model parameters based on a two-factor model.

In the variance plot the validation curve (red) shows 62% explained variance for 2 factors, which is rather good for data of this kind.

258

Plot of explained y-variance

Score plot

The score plot shows that the samples are well distributed with no apparent outliers.

Plot of scores

Loading plot

259

The relations between all variables are more easily interpreted in the correlation loading plot rather than the loadings as the explained variance can be seen directly in the plot; the inner circle depicts 50% explained variance and the outer 100%.

Activate the X-Loadings plot by clicking in it, then use the following shortcut button ; it will display the two circles.

The most important variables for job satisfaction (Y) seem to be related to how the employees evaluate their leader. Questions related to the work span the direction from upper left to lower right in the plot.

Plot of correlation loadings

The variables found significant are marked with circles in the loading plot. If not shown by default,

activate the marking of the significant variables using the following button .

Although the variable pattern can be interpreted in the correlation loadings, the importance of the variables is better summarized in terms of the regression coefficients in this case. Recall that the loadings describe the structure in X and Y whereas the loading weights are more relevant to interpret for the importance in modeling Y. Alternatively, the predefined plots under the weighted regression coefficients may be investigated.

260

Weighted regression coefficients

Click on the regression coefficient plot in the navigator.

Regression coefficient plot in the navigator

The automatic function Mark significant variables shows clearly which variables have a

significant effect on Y.

When plotting the regression coefficients one can also plot the estimated uncertainty limits as an approximate 95% confidence interval as shown below.

Plot of the weighted regression coefficients

261

E.g. variable disrespect has uncertainty limits crossing the zero line: it is not significant at the 5% level. Zoom in with Ctrl+right click to see details.

13 out of 26 X-variables are found to be significant at the 5% level. However, there is nothing to say that one can not set the cut off at another level depending on the application. Variables with large regression coefficients may not be significant because the uncertainty estimate indicates that the relation between this variable and Y is due to only some samples spanning the range. One effective way to visualize this is to show the stability plot.

The corresponding p-values are given in the output node, in the validation folder.

p-values for the regression coefficients

262

Stability plots

Stability in loading weights plots

Go back to the loading plot. By clicking the toolbar button Stability plot the model stability is clearly visualized.

Stability in loading weights plots

263

Variable 11 or “Help” is not very stable, the two departments 15 and 26 have a much lower value than the others, thus being influential for this variable. This indicates that this variable is probably not reliable to predict the “job satisfaction”.

This can be studied by looking at the scatter plot of the “Help” versus “job satisfaction”.

To plot it go back to the data table “Work environment case”. Select the column 11 “Help” as well as the column 27 “Job satisfaction”, use Ctrl.

Then go to Plot - Scatter or click on the icon .

“Help” versus “job satisfaction”

264

This plot shows that the variable X11 ”help” (Do you find your colleagues helpful?) is not very correlated to the “job satisfaction”. The 2 suspicious departments are influential in this relation.

Stability in score plots

Go back to the score plot. By clicking the toolbar button Stability plot the model stability is clearly visualized.

Stability plot of scores

265

For each sample one can see a swarm of its scores from each submodel. There are 34 sample swarms. In the middle of each swarm is the score for the sample in the total model.

By clicking on any point, information of the segment is given. Thus, in the case of full cross validation one can directly see how the models change when a particular sample is kept out. In other words, a sample that makes the model change when it is not in the segment has influenced all other submodels due to its uniqueness.

The score and loading stability plots are also very useful for higher factors in models as they indicate when noise is becoming the main source for a specific component.

Conclusions

In the work environment example, from looking at the global picture from the stability score plot one can conclude that all samples seem good and the model seems robust. Also, the uncertainty test indicates 13 significant variables at the 5% level as visualized with the 95% confidence intervals.

Documents

Unscrambler X 教程