8
1214 Volume 54, Number 8, 2000 APPLIED SPECTROSCOPY 0003-7028 / 00 / 5408-1214$2.00 / 0 q 2000 Society for Applied Spectroscopy Self-Modeling Mixture Analysis by Interactive Principal Component Analysis DONGSHENG BU and CHRIS W. BROWN * Department of Chemistry, University of Rhode Island, Kingston, Rhode Island 02881 A key procedure for mixture analysis in self-modeling methods is to identify a pure wavelength (or pure variable) for each component in the mixture. A pure wavelength has intensity contributions from only one of the components in a mixture. In this paper, an inter- active approach based on principal component analysis (IPCA) is presented for the pure wavelength selection. The approach is de- veloped from a combination of key set factor analysis (KSFA) and SIMPLISMA (simple-to-use interactive self-modeling mixture anal- ysis). Since all signi cant principal components are included and user interaction is available during the procedure of selecting pure wavelengths, this new approach effectively resolves complicated mixture data containing highly overlapping and nonlinear absorp- tivities. Moreover, the noise level of the original spectra is deter- mined from secondary principal components and used in the scaling so that pure wavelength selection re ects the signal-to-noise ratio in the data. Simulated three-component mixture spectra are used to demonstrate the IPCA method; this is followed by a general ap- proach for analyzing an esteri cation reaction using mid-infrared data. The KSFA, SIMPLISMA and IPCA methods are compared by analyzing a set of near-infrared spectra of methane, ethane, and propane mixtures. Results from the three pure wavelength methods are used as inputs to the method of alternating least-squares to produce predicted spectra very similar to the spectra of the pure components. Index Headings: Self-modeling mixture analysis; Principal compo- nent analysis; Pure wavelengths; Experimental noise. INTRODUCTION The fundamental assumption for mixture analysis is that the spectrum of a mixture behaves as a linear com- bination of pure-component spectra. The spectral data matrix, A, for a set of m mixtures that all contain the same n components but have concentration coef cients from zero to one can be expressed as A 5 CK (1) where A is an m 3 r matrix (m rows of mixture spectra with r columns of wavelengths), and C and K are un- known matrices which represent the m 3 n concentration coef cients and the n 3 r pure-component spectra, re- spectively. The goal of mixture analysis is to solve for the matri- ces C and K from the known data in A . The mechanisms for performing this operation are generally referred to as self-modeling mixture analysis; these procedures have been successfully applied in several areas such as chem- ical reaction monitoring 1 and imaging processing. 2,3 The rst self-modeling algorithm for two-component systems was presented in 1971, 4 and several algorithms were de- veloped for general cases in the mid-1980s. 5–8 Many of Received 14 February 2000; accepted 27 April 2000. * Author to whom correspondence should be sent. these methods are based on factor analysis or principal component analysis (PCA) of the original mixture spec- tra. The most straightforward mechanism for determining the C and K matrices from the mixture spectra in A is the alternating least-squares method. 9 This method starts with guesses for either the C or K matrices; the guesses are usually obtained from PCA. The method applies the constraints that concentration and absorptivity values must be more than or equal to ( . or 5 to) zero. For example, we might start with guesses for the absorptivity spectra in the K matrix and solve for the C matrix by least-squares regression as t t 2 1 C 5 AK (KK ) (regress A onto K) (2) This estimate of the concentration is used to determine a new estimate for K by constraining the concentrations to be . or 5 to zero t 2 1 t K 5 (CC) CA (regress A onto C) (3) These absorptivity spectra are then constrained to have . or 5 to zero values and used to determine a new es- timate for the C matrix. This alternating least-squares processing is continued until convergence is reached. The major dif culty with the alternating least-squares method is in nding suitable starting guesses for the C and K matrices. Several different solutions to Eqs. 2 and 3 are possible depending upon the starting guesses; thus, the better the starting guesses, the better the nal results. Improved guesses for the alternating least-square meth- od can be obtained by nding pure wavelengths for each of the components in the mixture spectra. A ‘‘pure’’ wavelength is de ned as a wavelength that has contri- butions from only one component or almost only one component. Two methods for obtaining pure wavelengths are key set factor analysis (KSFA) and simple-to-use in- teractive self-modeling mixture analysis (SIMPLISMA). The KSFA method was developed by Malinowski 10 during the 1980s. Its strategy is to properly select pure wavelength columns from the principal component (PC) spectra as a rotation matrix and, then, convert these or- thogonal PCs into spectra of pure components. It starts by performing PCA on the set of mixture spectra. The rst PC is very nearly the average of the original spectra and represents the most variance in the data set. The sec- ond PC, which is orthogonal to the rst, represents the next most variance in the data set and, as a consequence, represents some of the major differences among the spec- tra of the pure components. Each successive PC is or- thogonal to the previous PCs and represents the next most variance in the mixture data. In the KSFA method, the individual PC values at each wavelength are divided by the ‘‘length’’ of the original

Self-Modeling Mixture Analysis by Interactive Principal Component Analysis

  • Upload
    chris-w

  • View
    217

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Self-Modeling Mixture Analysis by Interactive Principal Component Analysis

1214 Volume 54, Number 8, 2000 APPLIED SPECTROSCOPY0003-7028 / 00 / 5408-1214$2.00 / 0q 2000 Society for Applied Spectroscopy

Self-Modeling Mixture Analysis by Interactive PrincipalComponent Analysis

DONGSHENG BU and CHRIS W. BROWN*Department of Chemistry, University of Rhode Island, Kingston, Rhode Island 02881

A key procedure for mixture analysis in self-modeling methods isto identify a pure wavelength (or pure variable) for each componentin the mixture. A pure wavelength has intensity contributions fromonly one of the components in a mixture. In this paper, an inter-active approach based on principal component analysis (IPCA) ispresented for the pure wavelength selection. The approach is de-veloped from a combination of key set factor analysis (KSFA) andSIMPLISMA (simple-to-use interactive self-modeling mixture anal-ysis). Since all signi� cant principal components are included anduser interaction is available during the procedure of selecting purewavelengths, this new approach effectively resolves complicatedmixture data containing highly overlapping and nonlinear absorp-tivities. Moreover, the noise level of the original spectra is deter-mined from secondary principal components and used in the scalingso that pure wavelength selection re� ects the signal-to-noise ratioin the data. Simulated three-component mixture spectra are usedto demonstrate the IPCA method; this is followed by a general ap-proach for analyzing an esteri� cation reaction using mid-infrareddata. The KSFA, SIMPLISMA and IPCA methods are comparedby analyzing a set of near-infrared spectra of methane, ethane, andpropane mixtures. Results from the three pure wavelength methodsare used as inputs to the method of alternating least-squares toproduce predicted spectra very similar to the spectra of the purecomponents.

Index Headings: Self-modeling mixture analysis; Principal compo-nent analysis; Pure wavelengths; Experimental noise.

INTRODUCTION

The fundamental assumption for mixture analysis isthat the spectrum of a mixture behaves as a linear com-bination of pure-component spectra. The spectral datamatrix, A, for a set of m mixtures that all contain thesame n components but have concentration coef� cientsfrom zero to one can be expressed as

A 5 CK (1)

where A is an m 3 r matrix (m rows of mixture spectrawith r columns of wavelengths), and C and K are un-known matrices which represent the m 3 n concentrationcoef� cients and the n 3 r pure-component spectra, re-spectively.

The goal of mixture analysis is to solve for the matri-ces C and K from the known data in A . The mechanismsfor performing this operation are generally referred to asself-modeling mixture analysis; these procedures havebeen successfully applied in several areas such as chem-ical reaction monitoring1 and imaging processing. 2,3 The� rst self-modeling algorithm for two-component systemswas presented in 1971,4 and several algorithms were de-veloped for general cases in the mid-1980s.5–8 Many of

Received 14 February 2000; accepted 27 April 2000.* Author to whom correspondence should be sent.

these methods are based on factor analysis or principalcomponent analysis (PCA) of the original mixture spec-tra.

The most straightforward mechanism for determiningthe C and K matrices from the mixture spectra in A isthe alternating least-squares method.9 This method startswith guesses for either the C or K matrices; the guessesare usually obtained from PCA. The method applies theconstraints that concentration and absorptivity valuesmust be more than or equal to ( . or 5 to) zero. Forexample, we might start with guesses for the absorptivityspectra in the K matrix and solve for the C matrix byleast-squares regression as

t t 2 1C 5 AK (KK ) (regress A onto K) (2)

This estimate of the concentration is used to determine anew estimate for K by constraining the concentrations tobe . or 5 to zero

t 2 1 tK 5 (C C) C A (regress A onto C) (3)

These absorptivity spectra are then constrained to have. or 5 to zero values and used to determine a new es-timate for the C matrix. This alternating least-squaresprocessing is continued until convergence is reached. Themajor dif� culty with the alternating least-squares methodis in � nding suitable starting guesses for the C and Kmatrices. Several different solutions to Eqs. 2 and 3 arepossible depending upon the starting guesses; thus, thebetter the starting guesses, the better the � nal results.

Improved guesses for the alternating least-square meth-od can be obtained by � nding pure wavelengths for eachof the components in the mixture spectra. A ‘‘pure’’wavelength is de� ned as a wavelength that has contri-butions from only one component or almost only onecomponent. Two methods for obtaining pure wavelengthsare key set factor analysis (KSFA) and simple-to-use in-teractive self-modeling mixture analysis (SIMPLISMA).

The KSFA method was developed by Malinowski10

during the 1980s. Its strategy is to properly select purewavelength columns from the principal component (PC)spectra as a rotation matrix and, then, convert these or-thogonal PCs into spectra of pure components. It startsby performing PCA on the set of mixture spectra. The� rst PC is very nearly the average of the original spectraand represents the most variance in the data set. The sec-ond PC, which is orthogonal to the � rst, represents thenext most variance in the data set and, as a consequence,represents some of the major differences among the spec-tra of the pure components. Each successive PC is or-thogonal to the previous PCs and represents the next mostvariance in the mixture data.

In the KSFA method, the individual PC values at eachwavelength are divided by the ‘‘length’’ of the original

Page 2: Self-Modeling Mixture Analysis by Interactive Principal Component Analysis

APPLIED SPECTROSCOPY 1215

PCs at that wavelength to form a normalized column ofvalues. The length of a column in the original PCs is thesquare root of the sum of the squares of values in thecolumn. This normalization emphasizes the regions (orwavelengths) which are most pure. The � rst pure wave-length corresponds to the wavelength having the largestlength (most information) and the smallest average (orig-inal PC1 value); this combination produces the lowestvalue in the � rst modi� ed PC, and it is selected as the� rst pure wavelength. The second pure wavelength is se-lected as the wavelength vector most orthogonal to the� rst pure wavelength vector in a two-dimensional PCspace. This processing continues with successively mod-i� ed PCs by � nding the wavelength vectors most orthog-onal to the subspace de� ned by the previously determinedpure wavelength vectors. After all pure wavelength vec-tors have been sequentially extracted from the lower di-mensional space, the pure wavelength assignments arere� ned by an iteration procedure, involving the replace-ment of each vector using the full PC space. This iterationre� nement, labeled IKSFA, is necessary to ensure thatthe most orthogonal set of vectors has been found.11 Themajor dif� culty with the KSFA method is dealing withspectral noise, since regions of low baseline absorbancesin the original spectra can produce very large randomsignals in the modi� ed PCs. This problem will be ad-dressed later in the paper.

In the early of 1990s, Windig and Guilment12 proposedan interactive algorithm referred to as SIMPLISMA. Inthis method, all the processing is performed on the orig-inal spectra without assistance of PCA. User interactionis available during pure wavelength selection, and thisstep is necessary for complex data sets. In the SIMPLIS-MA method, an average spectrum is calculated from themixture spectra. The difference between the averagespectrum and each of the mixture spectra is calculatedand used to determine a standard deviation spectrum. Thepure wavelengths are determined by dividing the standarddeviation spectrum by the sum of the average spectrumand a constant value to represent noise; the limiting noisevalue is added to the denominator to reduce the effectsof noise in the baseline of the standard deviation spec-trum. Wavelengths belonging to pure components willhave the largest relative standard deviation values and areselected as the pure wavelengths. The � rst pure wave-length is selected as the one exhibiting the highest rela-tive standard deviation spectrum value. A correlation ma-trix is calculated from the original mixture spectra andused to remove all the contributions correlating with the� rst pure wavelength from the relative standard deviationspectrum; this procedure produces the second purityspectrum. The highest value in the second purity spec-trum is selected as the next pure wavelength. This processof removing contributions correlating with the purewavelength is repeated for each pure wavelength until nomore useful information is left. The SIMPLISMA methodis very effective in selecting the pure wavelengths. Theonly dif� culty with this method is selecting the potentialnoise level, which is performed in a rather arbitrary man-ner.

Although current analytical methods produce muchbetter quality spectra, self-modeling analysis methodsstill encounter dif� culties in some real data cases. The

success of mixture analysis depends upon a number offactors including the existence of nonlinear absorbances,signal-to-noise ratio in the original spectra, and interac-tions of components in the mixture.13 Windig et al. dis-cussed a method based on interactive principal compo-nent analysis;14 however, its strategy was based on thegeometrical plot of the principal component scores ratherthan visually oriented pure wavelength selection. Herein,a new approach called interactive principal componentanalysis (IPCA) is proposed for improving self-modelingmixture analysis.

THEORY

Factor analysis or principal component analysis is amethod for reducing the dimensionality of spectral databy representing the data ef� ciently with orthogonal prin-cipal components and scores. Details of the PCA processcan be found elsewhere.15 The data matrix, A , can bedecomposed as a product of two matrices

A 5 UV (4)

where U is an m 3 f matrix of principal componentscores, V is an f 3 r matrix of principal components,and f is the number of PCs. The rows of V are orthor-normal; the columns U are orthogonal, and the sum ofsquares of the elements in each column is the eigenvaluefor that principal component. The � rst PC accounts forthe largest amount of total variance in the mixture set,and it is very similar to the average spectrum. The secondPC, which is orthogonal to the � rst principal component,accounts for the maximum amount of the remaining totalvariance not already accounted for in the � rst principalcomponent. The remaining PCs are orthogonal to each ofthe former principal components, and each accounts fora maximum variation in the remaining residual data. Thenumber of PCs, f , is less than or equal to the smaller ofm or r.

As outlined later, only the � rst n PCs are required todescribe the variance associated with the data to withinexperimental error, if the spectra behave according toBeer’s law. The secondary set of factors is associatedwith noise contained in the data. After the number ofprimary (or signi� cant) principal components is chosen,U has the same dimensions as C , and V has the samedimensions as K , assuming that the number of signi� cantprincipal components is the same as the number of chem-ical components.

An arti� cial mixture data set containing 10 spectra isused to demonstrate the IPCA method. Each mixturespectrum is a combination of three pure spectra with 80wavelength points. Pure spectrum k1 (Fig. 1) consists oftwo Gaussian bands centered at wavelengths 14 and 49with an intensity ratio of 1:10. Pure spectra k2 and k3

consist of only one Gaussian band at wavelengths 47 and51, respectively. A random noise with a peak-to-peak lev-el of 0.2%T was added to each wavelength point in eachof the mixture spectra. The simulated absorbance spectrawere converted to % transmittance, the 0.2%T randomnoise added, and the spectra converted back to the ab-sorbance domain.

As shown in Fig. 1, h1, h2, and h3 are the pure wave-lengths for the three pure components. Ideally, only one

Page 3: Self-Modeling Mixture Analysis by Interactive Principal Component Analysis

1216 Volume 54, Number 8, 2000

FIG . 1. Three-component synthetic spectra showing relative the concentra-tion matrix Crel obtained from the absorbances at the pure wavelengths inthe spectra of mixtures.

FIG . 2. Relative concentration matrix Crel obtained from the score matrixU and the rotation matrix R.

element at each pure wavelength column in K is nonzerofrom the de� nition of the so-called ‘‘pure wavelength’’.In the present case shown in Fig. 1, each component hasabsorbances at the analytical wavelengths for components2 and 3, but the major contributions come from the an-alyte of interest. Therefore, the absorbances at the cor-responding column in A would be proportional to theconcentration of that pure component in the mixtures. Onthe basis of the same idea, if all n pure wavelengths canbe found from PCA, an n 3 n matrix R , which is shownin Fig. 2, can be constructed from the n columns in V.A relative concentration matrix C rel can be calculated as

Crel 5 UR (5)

After C rel is obtained, the pure-component absorptivitymatrix K can be calculated easily from Eq. 1 by least-squares regression analysis:

K 5 ( t 2 1 tC C ) C Arel rel rel (6)Thus, the problem stated in Eq. 1 is reduced to � ndingthe proper number of components, n, and a pure wave-length for each component.

Predicting the Number of Components. The exactnumber of components predicted by PCA is not alwaysclear and leads to a ‘‘gray’’ area in the predictions. Here-in, an empirical IND function developed by Malinowskiis used to � nd the number of signi� cant principal com-ponents from secondary eigenvalues without any priorchemical information.16 Assuming that the number ofwavelengths, r, is equal to or greater than the number ofmixtures, m, the IND function is written as

m 1/2é ùê úlO j1 ê új 5 i 1 1

IND 5 (7)ê úi 2(m 2 i) r (m 2 i)ë û

In Eq. 7, the sum is taken over all the eigenvalues fromi 1 1 to m. The jth eigenvalue, l j, is the length of thejth score vector u j. The number of signi� cant principal

components is determined by selecting i, which gives aminimum IND value. For an arti� cial data set, there isonly one minimum IND value when i is equal to thenumber of pure components. In real data sets, it is pos-sible that the IND function has multiple minima i1, i2 . . . .The eigenvalues after the second minimum i2 usually rep-resent minor sources of error.17 Therefore, the number ofchemical components can be estimated more reliably byrepeating the calculation in Eq. 7 with the m value re-placed by i2, where i2 is the second minimum, or by i1,when i2 leads to a ridiculously low error.

Finding the First Pure Wavelength. Since a purewavelength has intensity contributions from only one ofpure components, the intensity changes at each wave-length in the mixtures can be connected to its length i.As shown in Fig. 3, i is calculated from ith column ofthe signi� cant principal component set as follows:

1/2n2` 5 (1/n) (v ) (8)Oi j,i1 2j 5 1

Obviously, large i are expected at peak centers in thespectra. A common peak which has contributions fromseveral pure components has both a larger length i andlarger � rst principal component (average) value, v1, i.Therefore, to estimate the purity of the ith wavelength,the length i needs to be scaled by considering the � rstPC, which is similar to the average spectrum. The � rstpurity spectrum is then constructed from the equation

p1, i 5 i /(v1, i 1 a ) (9)

where v1, i is ith element of the � rst PC, and a is a small,positive, constant value added to avoid unexpected highp1, i due to noise at low-intensity background wavelengths.The value of a is assigned as a value equivalent to thereal error16,17 in the data

m 1/2é ùê úlO jê új 5 n 1 1

a 5 RE 5 (10)ê úr (m 2 n)ë û

Page 4: Self-Modeling Mixture Analysis by Interactive Principal Component Analysis

APPLIED SPECTROSCOPY 1217

FIG. 3. The purity spectra and weight functions during the processing ofthree-component synthetic spectra.

where n is selected as the number of pure components,and m and r values are the number of mixtures and wave-lengths, respectively. Since all points of v1 and all ele-ments in Eq. 9 are non-negative, the � rst purity pro� lep1 will give a meaningful non-negative purity distributionin the form of a spectrum. The � rst pure wavelength, h1,is selected as the point with the highest p1, i. The � rst purewavelength, h1, chosen in this way should give the sameresult as the KSFA method since 1/p1 is similar to the� rst PC normalized in the KSFA case. It is possible todirect the procedure of pure wavelength selection by theoperator, who can use chemical knowledge of the mixturesample to either accept the � rst pure wavelength or tooverride the selection and make a different selection.

Finding the Remaining Pure Wavelengths. For theidenti� cation of the remaining pure wavelengths, an r 3r correlation matrix, O , is calculated as in the SIMPLIS-MA method. However, in this case, the O matrix is cal-culated from the V matrix instead of the original datamatrix A:

O 5 (1/n)V tV (11)

The matrix O contains the correlation between any wave-length pairs in the � rst n principal components. Again,in order to avoid the effect due to the noise at low inten-sity, v j in Eq. 11 needs to be scaled as follows:

v j,i 5 v j,i /((v1, i 1 a )2 1 s i2)1/2 (12)

where v j,i is the ith point of the jth principal component,and the standard deviation, s i, in the equation is calcu-lated as

1/2n

2s 5 (1/n) (v 2 m ) (13)Oi j,i i1 2j 5 1

andn

m 5 (1/n) v (14)Oi j,ij 5 1

The denominator or scaling factor in Eq. 12 is similar tothat used in the SIMPLISMA method.12 However, the av-erage spectrum has been replaced with the � rst principalcomponent.

The weight function w 2, i for the second pure wave-length point h2 is then calculated from the determinant

O Oi,i i,h1w 5 (15)2, i ) )O Oh ,i h ,h1 1 1

One can see that w2, i 5 0 when i 5 h1. The values of w2, i

at other wavelengths depend on their degree of correla-tion with h1; at high correlations w2, i approaches 0. Asshown in Fig. 3, the information related to the � rst purewavelength can be removed by multiplying the � rst pu-rity spectrum by the weight function, w 2, i, and the resul-tant becomes the second purity spectrum:

p2, i 5 w 2, i p1, i (16)

The second pure wavelength, h2, is then chosen as thewavelength with highest p2, i value. For the mixture setcontaining more than two pure components, this proce-dure is continued until all n pure wavelengths are select-ed. The general equation for the jth determinant is asfollows:

) O O · · · O )i,i i,h i,h1 j 2 1) )O O · · · Oh ,i h ,h h ,h1 1 1 1 j 2 1w 5 (17)) )j,i · · · · · · · · · · · ·) )O O · · · Oh ,i h ,h h ,h) )j 2 1 j 2 1 1 j 2 1 j2 1

and the equation for the jth purity spectrum is

p j,i 5 w j,i p1, i (18)

Thus, the pure wavelengths are all contained in the � rstpurity spectrum, and the in� uence of each pure wave-length is sequentially removed by Eqs. 17 and 18. Duringeach sequential step, the jth pure wavelength is thatwavelength having the greatest contribution in the jth pu-rity spectrum, P j,i.

Again, operators can select the pure wavelengths by avisual check of the purity spectra and from informationin the original spectra. Each successive purity spectrumcontains less information than the previous one. Once allpure wavelengths have been determined, an n 3 n rota-tion matrix R is constructed from the columns of the PCin the V matrix at the ‘‘pure wavelength’’ points h1,h2, . . . h i. The relative concentration matrix C rel is thendeduced from Eq. 5. The matrix Crel can be treated asconcentration coef� cients matrix C to calculate the spec-tra of pure components K by Eq. 6.

EXPERIMENTAL

Mid-infrared Spectra. The esteri� cation reaction of2-propanol and acetic anhydride with the use of pyridineas a catalyst in carbon tetrachloride solution was moni-tored by Fourier transform infrared (FT-IR). The initialconcentrations of these three chemicals were 15%, 10%,and 5% in volume, respectively. Iso-propyl acetate wasone of the products in this typical esteri� cation reaction.The reaction was carried out in a ZnSe CIRCLE cell(Spectra Tech Inc. Stamford, CT). The mixture spectrawere measured on a Bio-Rad FTS-40 FT-IR (Bio-Rad,

Page 5: Self-Modeling Mixture Analysis by Interactive Principal Component Analysis

1218 Volume 54, Number 8, 2000

TABLE I. Composition of the standard gas mixtures.

Mixture

Mole percent

Methane Ethane Propane Nitrogen

12345

79.726079.160278.952479.027879.2146

20.274015.87199.89025.28033.4628

0.00004.96793.12252.00040.9852

0.00000.00008.0349

13.691516.3374

6789

10

84.765984.186183.118184.750085.3873

15.234110.91808.10354.96831.9744

0.00004.89594.83353.99880.9875

0.00000.00003.94496.2829

11.65081112131415

90.086190.030790.237289.527989.7946

9.91298.07805.91744.43302.9350

0.00001.89131.84542.98571.9921

0.00000.00000.00003.05345.2783

1617181920

95.005794.987895.430794.862294.8624

4.99434.02932.85182.07881.1225

0.00000.98291.71753.06000.9912

0.00000.00000.00000.00003.0239

FIG . 4. First purity spectra: (a) from KSFA; (b) from SIMPLISMA witha 5 1%, pure wavelengths 5 44, 54, and 14; (c) from SIMPLISMA witha 5 5%, pure wavelengths 5 45, 53, and 14.Digilab Division, Cambridge, MA) with 16 scans taken

at 4 cm 2 1 resolution. The data set consisted of 30 spectra,covering approximately 90 min of the reaction. To shiftthe equilibrium of the esteri� cation, we removed one-tenth of the volume from the cell at 24, 45, 60, and 70min. An equal amount of a single reactant or product wasadded to the cell in the sequence of acetic anhydride,pyridine, 2-propanol, and iso-propyl acetate. The follow-ing solvents were used to prepare the mixtures: 2-pro-panol (J.T. Baker Catalog No. 9084-01); acetic anhydride(EM Science Catalog No. AX 0080-1); pyridine (J.T.Baker Catalog No. 9391-1); iso-propyl acetate (J.T. BakerCatalog No. 6-U385); carbon tetrachloride (Aldrich Cat-alog No. 27065-2). Spectra of pure components weremeasured by using the same cell in carbon tetrachloridesolution.

Near-Infrared Spectra. Mixtures of methane, ethane,propane, and nitrogen were prepared gravimetrically bythe Department of Chemical Engineering at the Univer-sity of Oklahoma.18 All gases were Matheson (East Ruth-erford, NJ) research purity. The mixture compositions aregiven in Table I. The stated mole fraction of each com-ponent is considered accurate to within 0.025%. The de-tails of gas handling and spectral measurement can befound elsewhere.19 Spectra were measured on a Bio-RadModel FT S-40N near- infrared spectrom eter between3500 and 10 000 cm 2 1. For each spectrum 512 scans werecollected at a resolution of 4 cm 2 1. All mixtures weremeasured at 500 psi.

Data Processing and Software. Pretreatments on mix-ture spectra such as baseline corrections and setting min-imum to zero were processed with Grams32 software(Galactic Industries, Salem, NH). Program developmentand processing were performed on a DELL Personal Sys-tem (Dimension XPS R400). All programs were writtenwith the use of Array Basic in the Grams/32 and Matlab(The MathWorks Inc., Natick, MA).

RESULTS AND DISCUSSIONAnalysis of Three-Component Arti� cial Data by the

IPCA Method. The three-component, arti� cial spectra

shown in Fig. 1 were analyzed by KSFA, SIMPLISMA,and IPCA. Obviously, one of the pure wavelengthsshould be at l 5 14 since it belongs to pure componentk1 without any interference. The problem is to � nd thepure wavelengths for other two components since bandsaround l 5 50 are highly overlapped.

The mixture data have three signi� cant principal com-ponents (v1–v3) shown in Fig. 2, and the related a value(noise level) was calculated as 1.4 3 10 2 3 from Eq. 10.The � rst purity spectrum p1 shown in Fig. 3 has lowintensities at the wavelengths 46–52, and high intensitiesat the wavelengths 14, 44, and 54. The � rst pure wave-length was selected as the highest intensity in the � rstpurity spectrum ( l 5 14, in this case). It should be notedthat the purity spectrum has peaks at the center positionsas the nonoverlapped original spectral peaks, but at theshoulder positions for the overlapped original spectralpeaks.

The dashed spectra in Fig. 3 show the weight functionswhich are calculated from Eq. 15 or 17. It can be ex-pected that the weight function w2 has near-zero valuesaround l 5 14 and 49 since those wavelengths are highlycorrelated to the pure component k1. The second purityspectrum was then calculate by applying the weight func-tion, w2, i, to the � rst purity spectrum. Therefore, the in-formation correlated to the � rst pure wavelength is re-moved, and the second pure wavelength is obviously cho-sen at 44, i.e., the highest point in the second purity spec-trum p2. The weight function, w3, and the third purityspectrum, p3, are also shown in Fig. 3. After three purewavelengths are selected at 14, 44 and 54, three pure-component spectra can be deducted easily from Eqs. 5and 6.

Noise Consideration in Pure Wavelength Selection.As mentioned above, pure wavelength selection is highlysensitive to noise. This effect can be seen in Fig. 4a,which shows the � rst PC spectrum normalized in theKSFA approach. Abnormally high or low intensities lo-

Page 6: Self-Modeling Mixture Analysis by Interactive Principal Component Analysis

APPLIED SPECTROSCOPY 1219

FIG. 5. Effect of a value selection on the quality of extracted componentsfrom arti� cial data by the SIMPLISMA method. The sum of squares of theresiduals (SS) value by the IPCA method is 4.7 3 10 2 3.

FIG. 6. Mid-IR spectra of the esteri� cation of 2-propanol and acetic an-hydride: (a) spectra of the components involved in the reaction; (b) pre-dicted spectra from the IPCA methods.

cated in the shadow areas are due to the noise at thewavelengths of low absorptivities. To overcome the in-terference of the noise, the KSFA method simply deletesthose wavelengths with low absorbances before purewavelength selection. The SIMPLISMA approach sets aequivalent to 1–5% of the highest intensity in the averagespectrum to avoid noise interference. The � rst purityspectra calculated at 1% and 5%, respectively, are shownin Fig. 4b and 4c. Obviously, pure wavelength selectionsvary from the a value setting (14, 44, and 54 for a 51% and 14, 45, and 53 for a 5 5%). In fact, there is atrade-off between keeping low signal values and blurringthe purity spectra. It is important to relate the a valuewith the noise level when mixture spectra contain low-concentration components and highly overlapping absor-bances. Instead of either deleting low-intensity values orarbitrarily setting the value of a as a percentage of thepeak intensity, the IPCA method determines the a valuefrom the noise level on the basis of PCA processing inEq. 10. The determination of noise in this manner greatlyhelps in the selection process and in the extraction of purespectra.

The sum of squares of the residuals (SS) was used toquantify the difference between an original pure spectrumand its extracted pure-component spectrum. For the IPCAmethod this value was 4.7 3 10 2 3. The SS values as afunction of a in the SIMPLISMA method are shown inFig. 5. When the a value is around 0.15% the SS valueis close to the IPCA value; otherwise the SS values aremuch higher.

Mid-Infrared Mixture Analysis. The IPCA approachwas evaluated by using the FT-IR data from an esteri� -cation reaction. The target product was iso-propyl acetatein the reaction of 2-propanol and acetic anhydride usingpyridine as a catalyst. Figure 6a shows the standard spec-tra of � ve components involved in the reaction. The func-tion of the pyridine was to react with the acetic acid andshift the equilibrium to the target product. Therefore, pureacetic acid may not be detectable in this reaction. Instead,a combination of pyridine and acetic acid would be ex-pected; thus, we measured a spectrum of a pyridine/aceticacid mixture.

Thirty mixture spectra were measured during 90 minof the esteri� cation reaction. After principal componentanalysis was performed, the number of pure componentsin the mixture data was determined to be � ve and itsnoise level related a value was 0.0068. With � ve com-ponents, the IPCA program selected pure wavelengths insequence of 1581, 1706, 3336, 1241, and 1828 cm 2 1. Thepredicted spectra for 2-propanol, acetic anhydride, andiso-propyl acetate were very close to those of the purecomponents. The predicted spectrum for pyridine showssome slight deviations. The last predicted spectrum inFig. 6b appears similar to that of an actual mixture ofacetic acid-pyridine with the exception of the strongerintensity of the band at 1256 cm 2 1.

Near-Infrared Mixture Analysis. The IPCA approachwas evaluated and compared with the KSFA, SIMPLIS-MA, and alternating least-squares methods by using near-infrared spectra of natural gas samples measured at 500psi. Twenty spectra of mixtures of methane, ethane, andpropane having the compositions listed in Table I wereused for the analysis. This spectral set was chosen for thecomparison since the components exhibit little, if any,interactions, but the spectra of all three pure componentsare strongly overlapped. The hydrocarbons exhibit twodistinct spectral regions suitable for quantitative work at6450–7715 and 8085–9000 cm 2 1. The relative spectralcontribution for each pure component is shown in Fig. 7.It is obvious that the mixture spectra consist of highlyoverlapped bands and that the contributions from ethaneand propane are small. There is also a nonlinear, verysharp methane band at 7510 cm 2 1. It will be seen laterthat all these characteristics cause problems during purewavelength selections.

After PCA was performed, the number of pure com-ponents in the mixture data was determined to be threeand the predicted noise level, a , was 5.8 3 10 2 3 fromEq. 10. Therefore, the � rst three principal componentswere used to construct the V matrix for the IPCA andKSFA approaches, and as starting guesses for the alter-nating least-squares method. The user interactive IPCA

Page 7: Self-Modeling Mixture Analysis by Interactive Principal Component Analysis

1220 Volume 54, Number 8, 2000

FIG . 7. Near-infrared spectra of methane (1000 psi), ethane (200 psi), andpropane (50 psi).

FIG . 8. Near-infrared spectra of synthetic natural gas mixtures. Spectra ofthe pure components are in the � rst row. The extracted pure components bythe IPCA, KSFA, and SIMPLISMA methods are shown in the second, third,and fourth rows, respectively. Each number corresponds to the pure wave-length selection for that pure component; it is followed by the correlationvalue to the spectrum of the pure component (dot product of the normalizedspectra). The last row shows the results obtain using the alternating least-squares method with the three principal components as input.

method can avoid picking the nonlinear band at 7510cm 2 1 as one of the pure wavelengths. However, thisstrong band was always selected in the KSFA. On theother hand, although the SIMPLISMA approach canavoid picking abnormal bands as pure wavelengths byuser interaction, the correct extraction of a weak com-ponent is dependent upon the value selected for a . Inother words, the SIMPLISMA method may not resolve aweak component if its highest concentration in the mix-tures is comparable to the a value (3% was examined inthis case).

The IPCA, KSFA, and SIMPLISMA methods are allbased on � nding the pure wavelengths for the three com-ponents. The results shown in Fig. 8 were obtained with-out iteration, so they are based on � nding the best set ofpure wavelengths. The pure wavelength for each com-ponent obtained by each of the methods is shown in the� gure adjacent to the predicted spectrum. The second nu-merical value adjacent to each spectrum is the correlationof the predicted spectrum with the spectrum of the purecomponent. The correlation was obtained by taking thedot product of the normalized spectra of the predictedand the pure component. All three methods predictedmethane spectra with correlation coef� cients between0.98 and 0.99. The ethane spectrum extracted by theKSFA method (correlation 0.919) and the propane spec-trum extracted by the SIMPLISMA method (correlation0.713) had much lower correlation coef� cients than thoseof the other two methods. In the IPCA approach, thesensitivity to noise is not as large as in the SIMPLISMAmethod, because most of the noise is moved to the sec-ondary principal components after the PCA processing.Moreover, the a value predicted by the PCA can be usedto account for the real error (noise). The KSFA methoddoes not necessarily yield the most orthogonal set of col-umn vectors, because the selection process is sequential;therefore, highly overlapped mixtures may not be re-solved by the KSFA method.

Results for the alternating least-squares method areshown on the bottom of Fig. 8. This method started withthe three PCs constrained to have zero or positive values.The � rst PC contains positive absorbance information onall three components; however, the second and third PCs

contain values (information) below zero, which are lostwhen the constraint is applied. After convergence isreached with the alternating least-squares, two of the re-sulting spectra are similar to methane and one is some-what similar to ethane, but the pro� le for propane is com-pletely lost in the processing. The results for alternatingleast-squares change with the (mathematical) signs of thesecond and third PCs; e.g., the second PC and the scorescorresponding to the second PC can both be multipliedby a 2 1. The PCA results after this multiplication areidentical to before, but the alternating least-squares willbe different. Moreover, a change similar to the third PCwill produce another different result by the alternatingleast-squares method.

The results of the pure wavelength selection methodscan be greatly improved by using the initial resultingspectra in Fig. 8 as input to the alternating least-squaresmethod. The � nal results of the three pure wavelengthmethods after iteration in the alternating least-squares areshown in Fig. 9. The number of iterations necessary forconvergence of each method is listed in the � rst column.The numerical value adjacent to each spectrum is the cor-relation of the predicted spectrum with the spectrum ofthe pure component. The predicted spectra for methanewere very good for all three methods (correlations 50.986). The predicted spectra for ethane and propanewere largely improved after applying the alternatingleast-squares method, except for the appearance of theweak sharp band due to the Q-branch of methane at 7510cm 2 1. All three predicted spectra for propane were sim-ilar, but there is some difference between the predictedand the actual spectrum at 8500 and in the 7500–7000cm 2 1 region, and their correlation coef� cients (0.95 to0.96) were not as good as those of methane (0.986) andethane (0.99). In all of the mixture samples, the concen-trations of propane were below 5%; thus, considering thestrong band overlap and the low concentrations, the re-sults are very reasonable.

Page 8: Self-Modeling Mixture Analysis by Interactive Principal Component Analysis

APPLIED SPECTROSCOPY 1221

FIG. 9. Near-infrared spectra of synthetic natural gas mixtures. Spectra ofthe pure components are in the � rst row. The next three rows show theresults from applying alternating least-squares on the spectra from the IPCA,KSFA, and SIMPLISMA methods as shown in Fig. 8. The numbers ofiteration are given in the � rst column. The correlation values to the spectraof pure components (dot product of the normalized spectra) are given foreach of extracted spectra.

CONCLUSION

The � nal results on the near-infrared gas mixtures fromthe three pure wavelength methods of IPCA, KSFA, andSIMPLISMA before iteration are quite different. More-over, the straight iterative method of alternating least-squares produced very poor spectral predictions for theminor two components in the three-component gas mix-tures. However, the pure wavelength selection methodsall performed very well when the initial results were usedas input for the alternating least-squares. CombiningKSFA and SIMPLISMA into the IPCA method reduces

the need to blank out regions as in the KSFA method. Italso provides an improvement in the overall signal-to-noise by performing PCA on the entire data set, and itprovides a good estimate of the background noise by cal-culating the real error from the higher principal compo-nents. Basically, the IPCA method uses PCA to reducethe dimensionality of the data set and reduce the effectivenoise level. Finally, it provides a reasonable indication ofthe number of pure components present in the mixturedata.

1. V. Vacque, N. Dupuy, B. Sombret, J. P. Huvenne, and P. Legrand,Appl. Spectrosc. 51, 407 (1997).

2. J. Guilment, S. Markel, and W. Windig, Appl. Spectrosc. 48, 320(1994).

3. J. J. Andrew and T. M. Hancewicz, Appl. Spectrosc. 52, 797 (1998).4. W. H. Lawton and E. A. Sylvestre, Technometrics 13, 617 (1971).5. F. J. Knorr and J. H. Futrell, Anal. Chem. 51, 1236 (1979).6. D. W. Osten and B. R. Kowalski, Anal. Chem. 56, 991 (1984).7. B. Vandeginste, R. Essers, T. Bosman, I. Reijnen, and G. Kateman,

Anal. Chem. 57, 971 (1985).8. P. J. Gemperline, J. Chemom. 3, 549 (1989).9. R. Tauler and B. Kowalski, Anal. Chem. 65, 2040 (1993).

10. E. R. Malinowski, Anal. Chim. Acta 134, 129 (1982).11. K. J. Schostack and E. R. Malinowski, Chemom. Intell. Lab. Syst.

6, 21 (1989).12. W. Windig and J. Guilment, Anal. Chem. 62, 1425 (1991).13. C. W. Brown, D. Bu, N. P. Camacho, and R. Mendelsohn, Proc.

SPIE-Int. Soc. Opt. Eng., in press.14. W. Windig, J. L. Lippert, M. J. Robbins, K. R. Kresinske, J. P.

Twist, and A. P. Snyder, Chemom. Intell. Lab. Syst. 9, 7 (1990).15. Su-Chin Lo and C. W. Brown, Appl. Spectrosc. 46, 790 (1992).16. E. R. Malinowski, Anal. Chem. 49, 612 (1977).17. E. R. Malinowski, J. Chemom. 13, 69 (1999).18. J. L. Savidge, Department of Chemical Engineering, University of

Oklahoma, private communication.19. S. M. Donahue and C. W. Brown, Anal. Chem. 60, 1873 (1988).