See Inside Multivariate Data Analysis

Embed Size (px)

Citation preview

  • 8/12/2019 See Inside Multivariate Data Analysis

    1/19

    Multivariate Data Analysis In Practice5thEdition

    An Introduction to

    Multivariate Data Analysisand Experimental Design

    Kim H. Esbensenlborg University, Esbjerg

    with contributions from

    Dominique GuyotFrank Westad

    Lars P. Houmller

    www.camo.com

    CAMO Software AS.

    Nedre Vollgate 8,N-0158,

    Oslo,

    NORWAY

    Tel: (47) 223 963 00

    Fax: (47) 223 963 22

    CAMO Software Inc.

    One Woodbridge Center,Suite 319,

    Woodbridge, NJ 07095,

    USA

    Tel: (732) 726 9200

    Fax: (973) 556 1229

    CAMO Software India Pvt. Ltd.

    14 & 15, Krishna Reddy ColonyDomlur Layout,

    Bangalore - 560 071,

    INDIA

    Tel: (91) 80 4125 4242

    Fax: (91) 80 4125 4181

  • 8/12/2019 See Inside Multivariate Data Analysis

    2/19

    This book was produced using Doc-to-Help together with Microsoft Word. Visio and

    Excel were used to make some of the illustrations. The screen captures were taken with

    Paint Shop Pro.

    Trademark AcknowledgmentsDoc-To-Help is a trademark of WexTech Systems, Inc.

    Microsoft is a registered trademark and Windows 95, Windows NT, Excel and Word

    are trademarks of the Microsoft Corporation.

    PaintShop Pro is a trademark of JASC, Inc.

    Visio is a trademark of the Shapeware Corporation.

    Information in this book is subject to change without notice. No part of this document

    may be reproduced or transmitted in any form or by any means, electronic or

    mechanical, for any purpose, without the express written permission of CAMO Process

    AS.

    ISBN 82-993330-3-2

    1994 2002 CAMO Process ASAll rights reserved.

    5th edition. Re-print December 2004

  • 8/12/2019 See Inside Multivariate Data Analysis

    3/19

    Preface iii

    Multivariate Data Analysis in Practice

    PrefaceOctober 2001

    Learning to do multivariate data analysis is in many ways like learningto drive a car: You are not let loose on the road without mandatorytraining, theoretical and practical, as required by current concern fortraffic safety. As a minimum you need to know how a car functions andyou need to know the traffic code. On the other hand, everybody wouldagree that it is first after having obtained your drivers license that thereal practical learning begins. This is when your personal experiencereally starts to accumulate. There is a strong interaction between the

    theory absorbed and the practice gained in this secondary, personaltraining period.

    Please substitute multivariate data analysis for driving a car in all ofthe above. Neither in this context are you let out on the data analyticalroad without mandatory training, theoretical and practical. Theanalogy is actually very apt!

    This book presents a basic theoretical foundation for bilinear(projection-based) multivariate data modeling and gives a conceptualframework for starting to do your own data modeling on the data setsprovided. There are some 25 data sets included in this training package.By doing all exercises included youre off to a flying start!

    Driving your newly acquired multivariate data analysis car is very muchan evolutionary process: this introductory textbook is filled withillustrative examples, many practical exercises and a full set of self-examinationreal-world data analysis problems (with corresponding datasets). If, after all of this, you are able to work confidentlyon your ownapplications, youll have reached the goal set for this book.

  • 8/12/2019 See Inside Multivariate Data Analysis

    4/19

    iv Preface

    Multivariate Data Analysis in Practice

    This is the 5threvised editionof this book. The three first editions weremainly reprints, the only major change being the inclusion of acompletely revised chapter on Introduction to experimental design,which first appeared in the 3rd edition (CAMO). The 4th revised

    edition however (published March 2000) saw very many majorextensions and improvements:

    Text completely rewritten by the senior author, based on five years ofextensive use in teaching at both university and dedicated courselevels. More than 5.500 copies in use.

    30% new theory & text material added, reflecting extensive studentresponse, full integration of PCA, PLS1 & PLS2 NIPALS algorithmsand explanations.

    Text revised with an augmented self-learning objective throughout.Four new master data sets added (with extended self-exercisepotential):

    1. Master violin data (PCA/PLS)2. Norwegian car dealerships (PCA/PLS)3. Vintages (PCA/PLS)4. Acoustic chemometric calibration (PCR/PLS)Additional chapter on experimental design: new features include

    mixture designs and D-optimal designs.

    New chapter on the powerful, novel: Martens Uncertainty Test.Comprehensive glossary of terms.This 5th edition also includes essential additional revisions and

    improvements:

    Lars P. Houmller, lborg University Esbjerg, has carried out acomplete work-through of all demonstrations and exercises. Many ofthese had not been updated with respect to several of the interveningUNSCRAMBLER software versions. We are happy to have finallyeliminated this most frustrating nuisance.

  • 8/12/2019 See Inside Multivariate Data Analysis

    5/19

    Preface v

    Multivariate Data Analysis in Practice

    About the authors

    Kim H. Esbensen, Ph.D., has more than 20 years of experience inmultivariate data analysis and applied chemometrics. He was professorin chemometrics at the Norwegian Telemark Institute of Technology(HIT/TF), Institute of Process Technology (PT) 1995-2001, where he

    was also head of the Chemometrics Department Tel-Tek, TelemarkIndustrial R&D Center, Porsgrunn. Between these institutions hefounded ACRG: the Applied Chemometrics Research Group, HIT/TF-Tel-Tek, which a.o. hosted SSC6, the 6thScandinavian Symposium onChemometrics, August 1999 as well as numerous other internationalcourses, workshops and meetings.

    July 1st, 2001 he moved to a position as research professor in AppliedChemometrics at lborg University, Esbjerg, Denmark (AUE), where heis currently leading ACACSRG: the Applied Chemometrics, AnalyticalChemistry and Sampling Research Group. As the name implies, appliedchemometrics activities continue in Esbjerg while new activities are

    added most notably through close collaboration with assoc. prof. LarsP. Houmller, who independently built up the area of analyticalchemistry/chemometrics at AUE before Prof. Esbensens arrival. Mostrecently the discipline of sampling (proper sampling) has been added, inrecognition of the immense importance of sampling in any dataanalytical discipline, including chemometrics.

    Kim H. Esbensen has published more than 60 papers and technicalreports on a wide range of chemical, geochemical, industrial,technological, remote sensing, image analytic and acoustic chemometricapplications. Together with Paul Geladi he has been instrumental in co-developing the concept of Multivariate Image Analysis (MIA); with

    ACRG he pioneered the development of the novel area of acousticchemometrics.

    His M. Sc. is from the University of Aarhus, Denmark in 1978 (geology,geochemistry), while a Ph.D. was conferred him by the TechnicalUniversity of Denmark (DTH) in 1981 within the areas of metallurgy,meteoritics and multivariate data analysis. He then did post-doctoralwork for two years with the Research Group for Chemometrics at theUniversity of Ume 1980-1981, after which he worked in a Swedishgeochemical exploration company, Terra Swede, for two more years.Moving to Norway, this was followed by eight years as data analyticalresearch scientist at the Norwegian Computing Center (NCC), Oslo,

  • 8/12/2019 See Inside Multivariate Data Analysis

    6/19

    vi Preface

    Multivariate Data Analysis in Practice

    after which he became a senior research scientist at SINTEF, theNorwegian Foundation for Industrial and Technological Research forfour additional years. In between these two assignments he was avisiting guest professor at Norsk Hydros Research Center in Bergen,Norway. He also holds a position as Chercheur associ (now Chercheur

    affili) du Centre de Recherche en Gomatique, Universit Laval,Quebec. He is a member of the editorial board of Journal ofChemometrics, Wiley Publishers, and is a member of ICS, AGU andseveral other geological, data analytical and statistical associations.

    Dominique Guyot, educated in Statistics, Economics andBiomathematics (ENSAE and Universit de Paris 7, France), has 15years of experience in the field of chemometrics. She gained industrialexperience from her work in the pharmaceutical and cosmetic industries,before joining CAMO from 1995 until 2000. With CAMO, Dominiqueworked as a Senior Consultant, and was particularly involved in foodapplications. She put together a practical strategy for efficient product

    development, based on experimental design and multivariate dataanalysis. This strategy was implemented in the Guideline+ softwarepackage, complemented by an integrated training course focusing onmultivariate methods for food product developers. Dominique is nowstudying music and singing at the Conservatoire of Trondheim, Norway.

    Frank Westadhas a M. Sc. in physical chemistry from the University ofTrondheim, Norway. He has 13 years experience in applied multivariatedata analysis, and he completed a Ph.D. in multivariate regression in2000. Frank has given numerous courses in experimental design andmultivariate analysis for companies in Europe and in the U.S.A. Hismain research fields include variable selection, shift modelling and

    image analysis.

    Lars P. Houmller has a M.Sc. in chemistry and physics from theUniversity of Aarhus, Denmark. He has 12 years of experience inanalytical chemistry and has worked 5-7 years with chemometrics. Histeaching experiences include chemometrics, analytical chemistry,spectroscopy, physical chemistry, general and technical chemistry,organic and inorganic chemistry, unit operations and fluid dynamics. Hisresearch field covers NIR spectroscopic applications over a very broadindustrial spectrum. He also has experience from working in the Danishfood production industry.

  • 8/12/2019 See Inside Multivariate Data Analysis

    7/19

    Preface vii

    Multivariate Data Analysis in Practice

    E-mail interaction with the authors:Kim Esbensen [email protected]

    Dominique Guyot [email protected]

    Frank Westad [email protected] P. Houmller [email protected]

    About this book

    Since 1986, when CAMO ASA first commercialized and startedmarketing THE UNSCRAMBLER, many customers have asked forbasic, easy-to-understand literature on chemometrics. In 1993 a group ofdata analysts at different competence levels was invited to a one-dayseminar at CAMO, Trondheim, for discussing their experience fromboth learning and teaching chemometrics. The result was a blue-printoutline for what came to be this introductory book: the specificationscalled for a comprehensive training-package, involving basic, practical,

    easy-to-read, largely non-mathematical theory, with plenty of hands-onexamples and exercises on real-world data sets. CAMO contractedSINTEF to write this book (first three editions), and the parties agreed tocooperate on the completion of the complete training package.

    In the intervening years, this book was published in some 4.500 copiesand was used for the introductory basic training in some 15 universitiesand in several hundred industrial companies; reactions were many andlargely constructive. We learned a lot from these criticisms; we thank allwho contributed!

    Came 1999, the time was ripe for a complete revision of the entire

    package. This was undertaken by the senior author in the summer 1999with significant assistance from his then Ph.D. studentJun Huang(nowwith CAMO, Norway);Frank Westad (Matforsk)who wrote chapter 14(Martens Uncertainty Test),Dominique Guyot (CAMO) who wrote theoriginal new entire chapter 17 (Complex Experimental DesignProblems), and with further invaluable editorial and managericalcontributions from Michael Bystrm (CAMO) and Valrie Lengard(CAMO). A most sincere thank you goes toPeter Hindmarch (CAMO,UK) for very effective linguistic streamlining of the 4th edition! Theauthors and CAMO also take this opportunity to acknowledge SuzanneSchnkopfs (CAMO) contribution to editions previous to the 4th one.

  • 8/12/2019 See Inside Multivariate Data Analysis

    8/19

    viii Preface

    Multivariate Data Analysis in Practice

    The present edition of this book still bears the fruit of her very importantpast efforts.

    The publication of the 4th edition, in March 2000, was unfortunatelysomewhat marred by a less than complete revision of the exercises and

    illustrative UNSCRAMBLER runs in the book, which was notconsidered fatal at the time This soon proved to be a serious mistake;disapointment and frustration from several generations of students, whowanted to follow all the exercises closely, followed rapidly. A Danishuniversity teacher, who had himself experienced this frustration close upwhen using the book for his own teachings, assoc. prof. Lars P.Houmllerat the University of lborg, Esbjerg voluntarily took it uponhimself to carry out a complete work-through of this essential didacticaspect of the book. His very valuable demo and exercise revisions, aswell as a very thorough text consistency check, have now been includedin totoin the 5thedition.

    Today, this book is a collaborative effort between the senior author andCAMO Process AS; the tie with SINTEF is now defunct.

    There is little academic glamour in writing an introductory leveltextbook, as the senior author has well experienced - which was neverthe goal anyway. But on the other hand, the introductory level isdefinitely where the largest audience and potential market exist, asCAMO has well experienced. The senior author has used the book forsix consecutive years teaching introductory chemometrics largely toengineering (M.Sc.) students, as well as for extensive course work inindustrial and foreign university environments. The response from someaccumulated 500 students has made this author happy, while some 5500

    sales have made CAMO equally satisfied.

    Thus all is well with the training package! We hope that this revised 5th

    edition will continue to meet the challenging demands of the market,hopefully now in an improved form. Writing for precisely thisintroductory audience/market constitutes the highest scientific anddidactic challenge, and is thus (still) irresistible!

  • 8/12/2019 See Inside Multivariate Data Analysis

    9/19

    Preface ix

    Multivariate Data Analysis in Practice

    Acknowledgements

    The authors wish to thank the following persons, institutions andcompanies for their very valuable help in the preparation of this trainingpackage:

    Hans Blom, stlandskonsult AS, Fredrikstad, NorwayFrode Brakstad, Norsk Hydro F-Center, Porsgrunn, NorwayRolf Carlson, Department of Chemistry, University of Troms, NorwayChevron Research & Technology Co, Richmond, CA, USALennart Eriksson, Dept. of Organic Chemistry, University of Ume,Sweden (now with Umetrics, Inc.)Professor Magni Martens, The Royal Vetarinary & AgriculturalUniversity, DenmarkGeological Survey of Greenland, DenmarkIKU, Institute for Petroleum Research, Trondhein, NorwayNorwegian Food Research Institute (MATFORSK), s, NorwayNorwegian Society of Process Control

    Norwegian Chemometrics SocietyInternational Chemometrics SocietyUOP Guided Wave,CA, USAPierre Gy, Cannes,France (for a gentlemans introduction to the finestFrench wines)Zander & Ingerstrm, Oslo, NorwayTomas berg Konsult AB,Karlskoga, SwedenKAPITAL(weekly Norwegian economic magazine), no 14/1994, p50-55Hlif Sigurjonsdottir, Reykjavik, Iceland (owner of G. Sgarabotto violinno 9)Birgitta Spur, LSO, Reykjavik, Iceland (permission to use the Sgarabottooeuvredata)

    Sensorteknikk A/S, Brum, Oslo (Bjrn Hope: sensor technologyentrepreneur extraordinaire; Evy: for innumerable occasions: warmcompany, coffee and waffles, waffles, waffles)Thorbjrn T. Lied, Maths Halstensen, Tore Gravermoen, Rune Mathisena.o.(for enormous help in developing acoustic chemometrics)Anonymous wine importer,Odense, Denmark.Helpful wine assessors(partly anonymous), Manson, Wa, USA.

    Finally the author(s) and CAMO wish to thank all THEUNSCRAMBLER users during the last seven years for their closerelationships with us, which have given us so much added experience in

  • 8/12/2019 See Inside Multivariate Data Analysis

    10/19

    x Preface

    Multivariate Data Analysis in Practice

    teaching multivariate data analysis. And thanks for all the constructivecriticism to the earlier editions of this book. Last, but certainly not least,a warm thank you to all the students at HIT/TF, at lborg University,Esbjerg and many, many others, who have been associated with theteachings of the authors, nearly all of whom have been very constructive

    in their ongoing criticism of the entire teaching system embedded in thistraining package. We even learned from the occasional not-so-friendlycriticisms

    Communication

    The period of seven years that has been the formative period for thetraining package has come of age. By now we are actually beginning tobe rather satisfied with it!

    And yet: The author(s) and CAMO always welcome all criticalresponsesto the present text. They are seriously needed in order for this

    work to be continually improving.

  • 8/12/2019 See Inside Multivariate Data Analysis

    11/19

    Contents xi

    Multivariate Data Analysis in Practice

    Contents

    1. Introduction to Multivariate Data Analysis -

    Overview 1

    1.1 Indirect Observations and Correlation 11.2 Hidden Data Structures 71.3 Multivariate Data Analysis vs. Multivariate Statistics 91.4 Main Objectives of Multivariate Data Analytical Techniques 91.5 Multivariate Techniques as Projections 11

    2. Getting Started - with Descriptive Statistics 132.1 Purpose 132.2 Data Set 1: Quality of Green Peas 132.3 Data set 2: Economic Characteristics of Car Dealerships inNorway 17

    3. Principal Component Analysis (PCA)

    Introduction 19

    3.1 Representing the Data as a Matrix 193.2 The Variable Space - Plotting Objects in p Dimensions 203.3 Plotting Objects in Variable Space 21

    3.3.1 Exercise - Plotting Raw Data (People) 223.4 The First Principal Component 273.5 Extension to Higher-Order Principal Components 303.6 Principal Component Models - Scores and Loadings 31

    3.6.1 Model Center 323.6.2 Loadings - Relations Between X and PCs 333.6.3 Scores - Coordinates in PC Space 343.6.4 Object Residuals 35

    3.7 Objectives of PCA 353.8 Score Plot - Map of Samples 363.9 Loading Plot - Map of Variables 40

  • 8/12/2019 See Inside Multivariate Data Analysis

    12/19

    xii Contents

    Multivariate Data Analysis in Practice

    3.10 Exercise: Plotting and Interpreting a PCA-Model (People) 473.11 PC-Models 54

    3.11.1 The PC Model: X = TP T+ E = Structure + Noise 543.11.2 Residuals - The E-Matrix 583.11.3 How Many PCs to Use? 61

    3.11.4 Variable Residuals 643.11.5 More about Variances - Modeling Error Variance 653.12 Exercise - Interpreting a PCA Model (Peas) 663.13 Exercise - PCA Modeling (Car Dealerships) 683.14 PCA Modeling The NIPALS Algorithm 72

    4. Principal Component Analysis (PCA) - In Practice 75

    4.1 Scaling or Weighting 754.2 Outliers 78

    4.2.1 Scaling, Transformation and Normalization are HighlyProblem Dependent Issues 80

    4.3 PCA Step by Step 814.3.1 The Unscrambler and PCA 84

    4.4 Summary of PCA 854.4.1 Interpretation of PCA-Models 884.4.2 Interpretation of Score Plots Look for Patterns 894.4.3 Summary - Interpretation of Score Plots 934.4.4 Summary - Interpretation of Loading Plots 94

    4.5 PCA - What Can Go Wrong? 954.6 Exercise - Detecting Outliers (Troodos) 97

    5. PCA Exercises Real-World Application Examples 105

    5.1 Exercise - Find Clusters (Iris Species Discrimination) 105

    5.2 Exercise - PCA for Experimental Design (Lewis Acids) 1075.3 Exercise - Mud Samples 1095.4 Exercise - Scaling (Troodos) 112

    6. Multivariate Calibration (PCR/PLS) 115

    6.1 Multivariate Modeling (X,Y): The Calibration Stage 1156.2 Multivariate Modeling (X, Y): The Prediction Stage 1166.3 Calibration Set Requirements (Training Data Set) 1186.4 Introduction to Validation 1206.5 Number of Components (Model Dimensionality) 1226.6 Univariate Regression (y|x) and MLR 124

  • 8/12/2019 See Inside Multivariate Data Analysis

    13/19

    Contents xiii

    Multivariate Data Analysis in Practice

    6.6.1 Univariate Regression (y|x) 1246.6.2 Multiple Linear Regression, MLR 125

    6.7 Collinearity 1276.8 PCR - Principal Component Regression 128

    6.8.1 Exercise - Interpretation of Jam (PCR) 130

    6.8.2 Weaknesses of PCR 1366.9 PLS- Regression (PLS-R) 1376.9.1 PLS - A Powerful Alternative to PCR 1376.9.2 PLS (X,Y): Initial Comparison with PCA(X), PCA(Y) 1376.9.3 PLS2 NIPALS Algorithm 1396.9.4 Interpretation of PLS Models 1436.9.5 The PLS1 NIPALS Algorithm 1446.9.6 Exercise - Interpretation of PLS1 (Jam) 1456.9.7 Exercise - Interpretation PLS2 (Jam) 147

    6.10 When to Use which Method? 1496.10.1 Exercise - Compare PCR and PLS1 (Jam) 150

    6.11 Summary 153

    7. Validation: Mandatory Performance Testing 155

    7.1 The Concept of Test Set Validation 1557.1.1 Calculating the Calibration Variance (Modeling Error) 1577.1.2 Calculating the Validation Variance (Prediction Error) 1587.1.3 Studying the Calibration and Validation Variances 159

    7.2 Requirements for the Test Set 1617.3 Cross Validation 1637.4 Leverage Corrected Validation 168

    8. How to Perform PCR and PLS-R 171

    8.1 PLS and PCR - Step by Step 1718.2 Optimal Number of Components in Modeling 1728.3 Information in Later PCs 1738.4 Exercises on PLS and PCR: the Heart-of-the-Matter! 173

    8.4.1 Exercise - PLS2 (Peas) 1748.4.2 Exercise - PLS1 or PLS2? (Peas) 1778.4.3 Exercise - Is PCR better than PLS? (Peas) 179

    9. Multivariate Data Analysis in Practice:

    Miscellaneous Issues 181

    9.1 Data Constraints 181

  • 8/12/2019 See Inside Multivariate Data Analysis

    14/19

    xiv Contents

    Multivariate Data Analysis in Practice

    9.1.1 Data Matrix Dimensions 1839.1.2 Missing Data 183

    9.2 Data Collection 1849.2.1 Use Historical Data 1849.2.2 Monitoring Data from an On-Going Process 185

    9.2.3 Data Generated by Planned Experiments 1859.2.4 Perform Experiments or Collect Data - Always byCareful Reflection 1869.2.5 The Random Design A Powerful Alternative 187

    9.3 Selecting from Abundant Data 1889.3.1 Selecting a Calibration Data Set from AbundantTraining Data 1889.3.2 Selecting a Validation Data Set 189

    9.4 Error Sources 1909.5 Replicates - A Means to Quantify Errors 1909.6 Estimates of Experimental - and Measurement Errors 191

    9.6.1 Error in Y (Reference Method): Reproducibility 192

    9.6.2 Stability over Consecutive Measurements: Repeatability 1939.7 Handling Replicates in Multivariate Modeling 1959.8 Validation in Practice 198

    9.8.1 Test Set 1989.8.2 Cross Validation 1989.8.3 Leverage Correction 1999.8.4 The Multivariate Model Validation Alternatives 199

    9.9 How Good is the Model: RMSEP and Other Measures 2009.9.1 Residuals 2009.9.2 Residual Variances (Calibration, Prediction) 2019.9.3 Correction for Degrees of Freedom 2039.9.4 RMSEP and RMSEC - Average, Representative Errors

    in Original Units 2039.9.5 RMSEP, SEP and Bias 2059.9.6 Comparison Between Prediction Error and MeasurementError 2069.9.7 Compare RMSEP for Different Models 2079.9.8 Compare Results with Other Methods 2079.9.9 Other Measures of Errors 208

    9.10 Prediction of New Data 2099.10.1 Getting Reliable Prediction Results 2099.10.2 How Does Prediction Work? 2099.10.3 Prediction Used as Validation 210

  • 8/12/2019 See Inside Multivariate Data Analysis

    15/19

    Contents xv

    Multivariate Data Analysis in Practice

    9.10.4 Uncertainty at Prediction 2109.10.5 Study Prediction Objects and Training Objects in theSame Plot 211

    9.11 Coding Category Variables: PLS-DISCRIM 2119.12 Scaling or Weighting Variables 213

    9.13 Using the B- and the Bw-Coefficients 2149.14 Calibration of Spectroscopic Data 2159.14.1 Spectroscopic Data: Calibration Options 2169.14.2 Interpretation of Spectroscopic Calibration Models 2179.14.3 Choosing Wavelengths 219

    10. PLS (PCR) Exercises: Real-World Application

    Examples - I 221

    10.1 Exercise - Prediction of Gasoline Octane Number 22110.2 Exercise - Water Quality 23010.3 Exercise - Freezing Point of Jet Fuel 23310.4 Exercise - Paper 236

    11. PLS (PCR) Multivariate Calibration In Practice 241

    11.1 Outliers and Subgroups 24211.1.1 Scores 24211.1.2 X-Y Relation Outlier Plots (T vs. U Scores) 24411.1.3 Residuals 24511.1.4 Dangerous Outliers or Interesting Extremes? 246

    11.2 Systematic Errors 24811.2.1 Y-Residuals Plotted Against Objects 24911.2.2 Residuals Plotted Against Predicted Values 24911.2.3 Normal Probability Plot of Residuals 251

    11.3 Transformations 25211.3.1 Logarithmic Transformations 25311.3.2 Spectroscopic Transformations 25411.3.3 Multiplicative Scatter Correction 25611.3.4 Differentiation 25911.3.5 Averaging 25911.3.6 Normalization 259

    11.4 Non-Linearities 26011.4.1 How to Handle Non-Linearities? 26211.4.2 Deleting Variables 263

    11.5 Procedure for Refining Models 264

  • 8/12/2019 See Inside Multivariate Data Analysis

    16/19

    xvi Contents

    Multivariate Data Analysis in Practice

    11.6 Precise Measurements vs. Noisy Measurements 26511.7 How to Interpret the Residual Variance Plot 26711.8 Summary: The Unscrambler Plots Revealing Problems 270

    12. PLS (PCR) Exercises: Real-World Applications - II 273

    12.1 Exercise ~ Log-Transformation (Dioxin) 27312.2 Exercise - Multiplicative Scatter Correction (Alcohol) 27612.3 Exercise Dirty Data (Geologic Data with SevereUncertainties) 28412.4 Exercise - Spectroscopy Calibration (Wheat) 29112.5 Exercise QSAR (Cytotoxicity) 293

    13. Master Data Sets: Interim Examination 303

    13.1 Sgarabotto Master Violin Data Set 30513.2 Norwegian Car Dealerships - Revisited 31313.3 Vintages 317

    13.4 Acoustic Chemometrics (a. c.) 32114. Uncertainty Estimates, Significance and Stability

    (Martens Uncertainty Test) 327

    14.1 Uncertainty Estimates in Regression Coefficients, b 32714.2 Rotation of Perturbed Models 32814.3 Variable Selection 32914.4 Model Stability 330

    14.4.1 Introduction 33014.4.2 An Example Using the Paper Data 330

    14.5 Exercise - Paper - Uncertainty Test and Model Stability 332

    15. SIMCA: An Introduction to Classification 33515.1 SIMCA - Fields of Use 33915.2 How to Make SIMCA Class-Models? 340

    15.2.1 Basic SIMCA Steps: A Standard Flow-Sheet 34015.3 How Do we Classify new Samples? 34115.4 Classification Results 341

    15.4.1 Statistical Significance Level and its Use: AnIntroduction 342

    15.5 Graphical Interpretation of Classification Results 34415.5.1 The Coomans Plot 34415.5.2 The Si vs. Hi Plot (Distance vs. Leverage) 345

  • 8/12/2019 See Inside Multivariate Data Analysis

    17/19

    Contents xvii

    Multivariate Data Analysis in Practice

    15.5.3 Si/S0 vs. Hi 34715.5.4 Model Distance 34815.5.5 Variable Discrimination Power 34915.5.6 Modeling Power 350

    15.6 SIMCA-Exercise IRIS Classification 351

    16. Introduction to Experimental Design 361

    16.1 Experimental Design 36116.2 Screening Designs 375

    16.2.1 Full Factorial Designs 37616.2.2 Fractional Factorial Designs 37816.2.3 Plackett-Burman Designs 382

    16.3 Analyzing a Screening Design 38316.3.1 Significant effects 38616.3.2 Using F-Test and P-Values to Determine SignificantEffects 38716.3.3 Exercise - Willgerodt-Kindler Reaction 391

    16.4 Optimization Designs 39516.4.1 Central Composite Designs 39616.4.2 Box-Behnken Designs 400

    16.5 Analyzing an Optimization Design 40216.5.1 Exercise - Optimization of Enamine Synthesis 403

    16.6 Practical Aspects of Making an Experimental Design 41416.7 Extending a Design 42816.8 Validation of Designed Data Sets 43016.9 Problems in Designed Data Sets 431

    16.9.1 Detect and Interpret Effects 43316.9.2 How to Separate Confounded Effects? 43616.9.3 Blocking and Repeated Response Measurements 43616.9.4 Fold-Over Designs 43816.9.5 What Do We Do if We Cannot Keep to the PlannedVariable Settings? 43916.9.6 A Random Design 44016.9.7 Modeling Uncoded Data 440

    16.10 Exercise - Designed Data with Non-Stipulated Values(Lacotid) 44116.11 Experimental Design Procedure in The Unscrambler 444

    17. Complex Experimental Design Problems 447

  • 8/12/2019 See Inside Multivariate Data Analysis

    18/19

    xviii Contents

    Multivariate Data Analysis in Practice

    17.1 Introduction to Complex Experimental Design Problems 44717.1.1 Constraints Between the Levels of Several DesignVariables 44717.1.2 A Special Case: Mixture Situations 45017.1.3 Alternative Solutions 451

    17.2 The Mixture Situation 45517.2.1 An Example of Mixture Design 45517.2.2 Screening Designs for Mixtures 45717.2.3 Optimization Designs for Mixtures 46017.2.4 Designs that Cover a Mixture Region Evenly 461

    17.3 How To Deal With Constraints 46317.3.1 Introduction to the D-Optimal Principle 46317.3.2 Non-Mixture D-Optimal Designs 46617.3.3 Mixture D-Optimal Designs 46717.3.4 Advanced Topics 469

    17.4 How To Analyze Results From Constrained Experiments 47417.4.1 Use of PLS Regression For Constrained Designs 474

    17.4.2 Relevant Regression Models 47617.4.3 The Mixture Response Surface Plot 478

    17.5 Exercise ~ Build a Mixture Design - Wines 479

    18. Comparison of Methods for Multivariate Data

    Analysis - And their Validation 489

    18.1 Comparison of Selected Multivariate Methods 48918.1.1 Principal Component Analysis (PCA) 49018.1.2 Factor Analysis (FA) 49218.1.3 Cluster Analysis (CA) 49418.1.4 Linear Discriminant Analysis (LDA) 496

    18.1.5 Comparison: Projection Dimensionality in MultivariateData Analysis 49818.1.6 Multiple Linear Regression, (MLR) 49818.1.7 Principal Component Regression (PCR) 49918.1.8 Partial Least Squares Regression (PLS-R) 50018.1.9 Increasing Projection Dimensionality in RegressionModeling 501

    18.2 Choosing Multivariate Methods Is Not Optional! 50118.2.1 Problem Formulation 501

    18.3 Unsupervised Methods 50218.4 Supervised Methods 503

  • 8/12/2019 See Inside Multivariate Data Analysis

    19/19

    Contents xix

    Multivariate Data Analysis in Practice

    18.5 A Final Discussion about Validation 50518.5.1 Test Set Validation 50518.5.2 Cross Validation 50618.5.3 Leverage Corrected Validation 50818.5.4 Selecting a Validation Approach in Practice 509

    18.6 Summary of Basic Rules for Success 51018.7 From Here You Are on Your Own. Good Luck! 511

    19. Literature 513

    20. Appendix: Algorithms 519

    20.1 PCA 51920.2 PCR 52020.3 PLS1 52120.4 PLS2 524

    21. Appendix: Software Installation and User

    Interface 527

    21.1 Welcome to The Unscrambler 52721.2 How to Install and Configure The Unscrambler 52721.3 Problems You Can Solve with The Unscrambler 52921.4 The Unscrambler Workplace 530

    21.4.2 The Editor 53221.4.3 The Viewer 53421.4.4 Dockable Views 53721.4.5 Dialogs 53721.4.6 The Help System 53921.4.7 Tooltips 540

    21.5 Using The Unscrambler Efficiently 54021.5.1 Analyses 54021.5.2 Some Tips to Make Your Work Easier 545

    Glossary of Terms 549

    Index 587