Infrared Spectral Search for Mixtures in Large-Size Libraries

Infrared Spectral Search for Mixtures in Large-Size Libraries

S U - C H I N L O a n d C H R I S W . B R O W N * University of Rhode Island, Kingston, Rhode Island 02881

A routine for searching large spectral libraries with spectra of mixtures is presented. The dimensionality of a 3169-compound library is reduced to 12% of its original size by using Fourier transform compression and principal component analysis. A principal component regression is performed and used as a prefilter in selecting spectra having features (and chemical groups) similar to those of the unknown mixture. A dot-product metric is then used to identify a target component from the subgroup formed by the prefilter. This is followed by the application of an adaptive filter to remove the similarity of the target component from the subgroup and from the unknown mixture; the search is repeated on the modified data. Successive applications of the adaptive filter will produce minimum residuals if the correct identifications are made. Once the residuals are minimized, a similarity index is calculated to determine the closeness of the unknown mixture spectrum to a spectrum reconstructed from the library spectra. Four out of five two- and three-component spectra were correctly identified. One of the two components in the fifth mixture was correctly identified, and the residual values flagged the improper identification of the second component. After the adaptive filter was applied to the entire library, the second component was correctly identified. Results for this new algorithm are compared to those from four more traditional search routines, which were only completely successful on one of the unknown mixtures. Index Headings: Chemometrics; Multivariate analysis; Library search; Pattern recognition; Multicomponent analysis; Infrared.

I N T R O D U C T I O N

In the accompanying paper, 1 we show that a spectral library can be processed as standard spectra in a multicomponent analysis to provide regression coefficients needed to convert an unknown mixture spectrum into pseudo-concentrations or composition indices, CI. When all of the library spectra are linearly independent, they can all be used to provide an orthogonal basis set of vectors. 2,~ The projections of the original spectra onto this basis set are used to regress the hypothetical concentrations of 1.0 for each compound and to determine the regression coefficients. The projections of an unknown mixture spectrum onto this same basis set are multiplied by the regression coefficients to estimate the CIs of the components in the mixture. As the size of the library increases, a point is reached at which the spectra are no longer linearly independent. We showed that principal component regression (PCR) could then be used to solve the problem as long as the total number of spectra in the library was less than the number of data points in each spectrum.

In large libraries, where the number of spectra is greater than the number of data points, PCR can no longer provide reasonable estimates of the CIs of all possible compounds in the library. However, as we will show here-

Received 17 April 1991; revision received 28 June 1991. * Author to whom correspondence should be sent.

in, PCR can be used to prefilter the library and improve the chances of identifying all of the components in a mixture.

T H E O R Y

The mixture analysis algorithm for large libraries in- corporates Fourier compression, principal component regression analysis for prefiltering, an adaptive filter, and a similarity index to provide a high probability of identifying the correct components.

F o u r i e r C o m p r e s s i o n . Search algorithms based on fit- ting entire spectral regions with a Fourier series equation have been used successfully for UV-visible 4 and infrared spectral libraries3 -7 The Fourier-domain representation of a spectrum has the majority of the signal information concentrated in a relatively small region near the begin- ning of the series. Thus, the peak position, peak width, and shape information in the spectral domain are all compressed into a small Fourier window. In addition, higher signal-to-noise ratios and reduced background effects provided by Fourier processing offer improvements over the spectral and derivative domains. This approach is especially practical in identifying mixture unknowns from large spectral libraries in order to retain the most significant spectral information and to reduce computer storage space and processing time.

P r e f i l t e r b y P r i n c i p a l C o m p o n e n t R e g r e s s i o n . The in- corporation of principal component regression into a search algorithm was developed in the previous paper. 1 In mixture analysis for a large library, PCR procedures can be applied to select the subgroup of target compounds from the EPA library of 3169 spectra. The size of the subgroup can be determined by the user; however, for the EPA library we have found that 100 is an optimum number. In this processing, regressing the reference score matrix with CIs provides greater discriminatory capability, increases the searching performance, and reduces the processing time. The regression step serves as a prefilter for selecting compounds with similar structures from the library.

The principal component regression equations (6-10) in the previous paper 1 can all be applied to the data in the 3000-compound library. The problem of going to a large library can best be understood by considering the last expression in Eq. 8, which is

S = (A)-iU ' (1)

where S is an ([ x m) matrix of regression coefficients for [ principal components and m library spectra, U is the ( m x D matrix of projections (scores) of the library spectra onto the principal components, and (A) -1 is the inverse of the square of the diagonal matrix of eigen-

1628 Volume 45, Number 10, 1 9 9 1 0oo3-7o28/91/451o-162852.00/0 APPLIED SPECTROSCOPY © 1991 Society for Applied Spectroscopy

values. The scores, U..k, of projecting the unknown mixture spectrum onto the principal components are multiplied by the regression matrix to obtain the CIs in the mixture

f u n k = UunkS. (2)

If we replace S with Eq. 1, this becomes

C..,, = U . . k ( A ) - I U '. (3)

For small-size libraries in which the library spectra are completely independent, the CIs predicted by Eq. 3 are very close to those of the mixture. For example, if one of the library spectra is used as the unknown, its estimated CI would be very close to 1.0 and the estimated CIs of the other library spectra close to 0.0. For medium- size libraries in which all of the spectra are not linearly independent, Eq. 3 still predicts CIs close to those in mixture. If one of the library spectra is used as the unknown, its estimated CI is high (0.7 to 0.95) and the estimated CIs of the other library spectra are small; the prediction is not perfect, but the estimates are close. For large libraries, in which the number of spectra exceeds the number of spectra data points, it is not possible to provide even semi-quantitative results. For example, if one of the library spectra is used as the unknown, the estimated values in the C,.k from Eq. 3 are not even close to 1.0 for that compound; however, it will have the highest CI value.

A closer consideration of Eq. 3 helps in understanding the difficulty. The key feature is the (A) -~ matrix. Ba- sically, this is just a weight matrix with the reciprocals of the squares of the eigenvalues along the diagonal and zeros otherwise. The eigenvalues express the amount of variance in the original data set accounted for by each principal component; thus, the eigenvalues are highest for the first principal components and decrease corre- spondingly. However, the reciprocals weight the first principal components less and the higher principal components more. Thus, the reciprocal places more emphasis on the small differences between spectra which is found in the higher principal components.

If the diagonal elements in (A) -~ are made 1.0, Eq. 3 becomes simply a dot-product metric of the scores of the unknown with scores of the library spectra. Therefore, we can think of a principal component regression as a weighted dot-product metric. The weight matrix is very important in providing correct answers in the case of small- and medium-size libraries. For large libraries, we found that its application helps to eliminate certain interfering compounds that appear as high-hits with a straight dot-product metric. It helps to emphasize the differences between spectra rather than the similarities by placing a larger weight on the higher principal components. Basically, it selects library spectra having scores similar to those of the mixture by eliminating those with small differences.

For large libraries, PCR is use to filter the library to collect a subgroup of spectral scores. The dot-product metric is applied to the subgroup to select the target compounds.

Adaptive Filter. In the case of a large library, the adaptive filter--as explained in the previous paperl-- is applied to the subgroup selected by the PCR prefilter. The

score of the mixture spectrum and those of all spectra in the subgroup are made orthogonal to the score of the compound having the highest hit index for the dot-product metric after the first pass. For the medium library, the adaptive filter is applied to the original spectra and the PCR processing repeated. In this procedure for large libraries, it would be too time consuming to perform PCR on the 100-compound subgroup after modifying their spectra.

Similarity Index. After determination of all possible candidates in the mixture by the adaptive filter method, a similarity index (SI) provides the final evidence for the analysis. The similarity index is based on the dot-product between the original mixture score (U,nk) and the nor- malized scores of all selected components (U~). In the two-component system, for example, the calculated score mixture U~l is obtained by

Uca, = dt(A)U(A) + dt(B)U(~) (5) (dtCA) + dt<~))

where the dt represents the dot-product of scores obtained for each spectrum during the initial pass through the subgroup. The SI calculation is defined:

SI = UunkUcal t. (7)

An SI value of 1.00 indicates a perfect match of selected components with the unknown mixture.

EXPERIMENTAL

The same "unknown" mixture spectra were used in this study as in the accompanying report. 1 In the present study, the entire EPA library was used. Processing using the more traditional search routines was performed on the Bio-Rad Model 3200 Data Station using the Search- 32 software.

The Mix-Match search routines were performed on an IBM PS/2 50z. The EPA spectral library was converted to Fourier vectors as described previously. 10rthogonal vectors were obtained by Successive Average Orthogo- nalization. 8 Selection of the number of principal components was based on the cumulative percent variance. Under this criterion, the number of required components ranged from 195 to 205. Projections of original Fourier vectors onto the principal components were used to form the score matrix for PCR. CPU time for orthogonaliza- tion of the EPA spectral library was about 10 hours, and determining the scores of the entire library took about 3 h on the IBM PS/2 50z computer. These calculations had to be performed only once. The processing of each unknown through the whole library took about 2 min, and the adaptive filter took just 5 s.

RESULTS AND DISCUSSION

In order to evaluate the processing performance, ad- vantages, limitations, and problems of the described method, the five examples of "unknown" mixtures described in the previous paper were treated by traditional library searching routines and by the Mix-Match pro- grams. Four of the five mixtures contain aromatic derivatives with different functional groups. The IR spectral patterns of these compounds all contain complicated sharp bands which may mask aliphatic compounds in

APPLIED SPECTROSCOPY 1629

TABLE I. Library searching results on real mixtures using standard searching algorithms: Least-Squares (LS), Absolute Differences (AB), Derivative Least-Squares (DLS), and Derivative Absolute Differences (DAB).

MIX21: o-Chlorotoluene and m-Toluidine

HQI lists #1 #2

LS o-Toluodine (0.53) 2-Biphenylamine (0.56) AB o-Toluidine (0.50) o-chlorotoluene (0.50) DLS Benzylamine, o-chloro- o-Cyanoaniline (0.91)

N-methyl (0.89) DAB o-Toluidine (0.76) o-Toluidine (0.85)

MIX22: Anisole and methyl propyl ketone

HQI lists # 1 #2

LS 2-Propanone, o-metho- Anisole (0.46) xyphenyl (0.43)

AB Anisole (0.45) 2-Butanone, 4-/p-methoxyphenyl/-, (0.49)

DLS Anisole (0.41) Acetonitrile,/o-methoxyphenyl/-, (0.61)

DAB Anisole (0.37) Aeetonitrile,/o-methoxyphenyl/-, (0.60)

MIX23: 1,2,4-Trimethylbenzene and 1,3,5-trimethylbenzene

HQI lists #1 #2

LS 1,3,5-Trimethylbenzene (0.21)

AB 1,3,5-Trimethylbenzene (0.20)

DLS 1,3,5-Trimethylbenzene (0.39)

DAB 1,3,5-Trimethylbenzene (0.37)

1,2,3,5-Tetramethylbenzene (0.33)

1,2,4-Trimethylbenzene (0.29)

S-Collidine (0.83)

1,2,4-Trimethylbenzene (0.66)

MIX31: m-Nitrotoluene, m-dichlorobenzene, and p-chlorotoluene

HQI lists #1 #2

LS Toluene, 2,6-dichloro- Benzene, 1,4-dichloro-2-nitro-, 3-nitro-, (0.67) (0.68)

AB Benzene, 1,4-dichloro- Toluene, 2,6-dichloro-3-nitro-, 2-nitro-, (0.58) (0.58)

DLS m-Dichlorobenzene V-Triazolo/4,5-D/Pyridine, 3H-, (0.89) 3-Cyclopentyl-, (1.00)

DAB m-Dichlorobenzene m-Nitrotoluene (0.77) (0.89)

MIX32: o-Dimethoxybenzene, bromobenzene, and p-bromoanisole

HQI lists #1 #2

LS o-Dimethoxybenzene (0.28)

AB o-Dimethoxybenzene (0.24)

DLS a-Domethoxybenzene (0.62)

DAB o-Domethoxybenzene (0.24)

Acetonitrile, /o-methoxyphenyl/-, (0.39) o-Chloroanisole (0.33)

2-Propanone, 1-/o-methoxy- phenoxy/-0, (0.87)

o-Chloroanisole (0.33)

unknown mixtures. Therefore, one of the "unknown" mixtures was a combination of an aromatic and aliphatic ketones.

Traditional Library Search Routines. Four standard algorithms were used to search the 3000-compound library for the five mixture spectra. A popular searching

TABLE II. Results of Mix-Match search for MIX22 in the 3169- compound library, a

(a) Primary #1 881 #2 1706 #3 2771 #4 2524 #5 2760 #6 1034 #7 11 #8 540 #9 1825

#10 27O9

Dot-Product Anisole 0.875 Ethyl, butyl phenyl-, 0.850 Acetnitrile,/o-methoxyphenyl/ 0.843 Ethyl, phenyl propyl-, 0.837 2-Propanone, 1-/p-methoxyphenyl/ 0.800 2-Butanone, 4-/p-methoxyphenyl/-, 0.796 Benzene, ethoxy-, 0.791 Ethanol, 2-phenoxy-, 0.789 Butyraldehyde, phenylhydrazone 0.761 Cyclopentaecarbonitrile, 1 -/p-methoxyphenyl/-, 0.755

(b) Removed #881 from original library Dot-Product #1 583 2-Pentanone 0.461 #2 496 2-Pentanone, 4-methyl-, 0.453 #3 896 2-Butanone, 3-methyl-, 0.450 #4 904 2-Hexanone 0.449 #5 2240 2-Butanone 0.444 #6 634 2-Pentanone, 3-methyl-, 0.443 #7 2540 Valeric acid, 2-acetyl-5-chloro,

ethyl ester 0.439 #8 864 5-Heptene-2-one, 6-methyl-, 0.435 #9 143 2-Hexanone, 5-methyl-, 0.434

#10 552 2-Heptanone 0.426

(c) Removed #881 and #583 from original library #1 3159 Phthalimide, N-o-tolyl-, #2 3169 Phthalimide, N-/1-formylethyl/-,dl

All Residues are less than 0/05

0.042 0.041

a Similarity index = 0.991.

algorithm, least-squares (LS), is based on the least- squares calculation of the Euclidean distance between the unknown and library entry. A second search method is based on the absolute differences (AB) between the unknown and library spectra; this may reduce computing time and weight small difference in the spectral com- parison. Two other modified search algorithms, derivative least-squares (DLS) and derivative absolute value (DAB), calculate the derivative differences between two spectra and may reduce the contributions of baseline and peak shift effects. Generally, a hit quality index (HQI) above 0.60 indicates an unsuccessful search, whereas HQI below 0.30 indicates a good match between unknown and the library entry.

The first two hits obtained by each of the above four searching algorithms for the five mixtures are listed in Table I. The absolute value and/or derivative absolute value methods gave the best identification of four "unknown" mixtures. Both the absolute difference and the derivative of the absolute difference methods produced the correct composition for MIX23, and the DAB method correctly identified two of the three components in MIX31 (target components are shown in bold face). Two of the mixtures (MIX23 and MIX32) have very low HQI values (0.20-0.28), suggesting that these are not mixtures but single components. Although combining techniques such as searching small spectral regions, reverse searching, and applying a spectral subtraction routine may improve the capability of identifying the remaining component(s), they require tedious processing steps and are limited by the operator's experience and background.

Mix-Match Search Algorithm. The PCR prefilter was used to cluster similarly structured components into a

1630 Volume 45, Number 10, 1991

Mix-Match Search Mix21- EPA508/EPA863

DOT-PRODUCT

//7 o -Chlorot oluene/m-Toluidine I

0.60

0.40

RM ~06/~63 ~100~0~0~00~0~=~0~0~001000100~100~000100~1 Similarity Index = 0.976

FIG. 1. Resul t s of Mix -Ma tch search for MIX21 in the 3169-compound l ibrary.

Mix-Match Search Mix31-EPA9/EPA35/EPA824

DOT-PRODUCT

Similarity Index = 0.991

FIG. 3. Results of Mix-Match search for MIX31 in the 3169-compound library.

small population (the first 100 library entries were col- lected), and this step was followed by the dot-product calculation between scores of the unknown and those of the individual reference spectra. Principal component analysis of the entire 3169-compound library suggested that 200 factors are required in order to reproduce the original spectra. The combination of the Fourier compression and principal component analysis reduces the storage size to 12% of the original spectral size. The results are reported as the top 15 best-hits based on the dot-product between the scores of the unknown and those of the reference library.

As shown in Fig. 1 for the analysis MIX21, the best- hit is the target component #508 (o-chloro-toluene); the other high values were for similarly structured isomers or derivatives. After removal of #508 from the mixture scores and reference library scores using the adaptive filter, the best-hit is #863 (m-toluidine), which is the

second target component in the MIX21. After the adaptive filter is used to remove these two components, the residues of all 15 components dropped to 0.04. The trace residues may be attributed to the chemical impurities, spectral noise, or nonlinear chemical interaction in the mixture. In addition, the very small trace residues in- dicate that MIX21 contained only two components. In order to evaluate the reliability of this solution, the SI value is calculated to be 0.976, which indicates high cor- relation between selected combination of components and mixture unknown.

One challenging example of a two-component system is that containing aromatic and aliphatic compounds with similar functional groups. From the analysis results on MIX22 shown in Table II, the PCR prefilter followed by the dot-product of the scores correctly identified anisole (#881), but the prefilter excluded 2-pentanone from the top 100. However, the prefilter made it possible to filter

Mix-Match Search Mix23- EPA799/EPA75

DOT-PRODUCT

1,3,5- and 1,2,4-Trimethylbenzene

1.00 -

0.80 -

0.60 -

0 .40 -

0.20 -

0.00 I

LIBRARY # - 298

I P~Y ~ 0.61 I I RM 799 ~ 0.06 a, vs0/v5


FIG. 2. Results of Mix-Match search for MIX23 in the 3169-compound l ibrary.

Mix-Match Search Mix32- EPA572/EPA818/EPA1697

DOT-PRODUCT

L m R ~ #

PRIMARY

P~g 572/818

PJg 5 7 2

R~ 57~/818/1697


FIG. 4. Resu l t s of M i x - M a t c h search for MIX32 in the 3169-compound l ibrary.

APPLIED SPECTROSCOPY 1631

out other interfering component(s) which inhibited any correct identification of anisole. For example, if the ent i re library is searched with a dot-product metric of the nor- malized spectra without the PCR prefilter, the spectrum due to 2-propanone,o-methoxyphenyl-, which contains a combination of aromatic and aliphatic patterns, is selected. The prefilter eliminated this improper selection and made it possible to properly identify anisole. Suc- cessive applications of the adaptive filter as described above did not produce small residuals, which indicated that at least some components were improperly identified.

After unsuccessful attempts to find selections that produced minimum residuals, we used the adaptive filter to make the scores of all compounds in the library and the score of the mixture orthogonal to anisole. When the dot- product metric was used on these modified scores in the entire library, 2-pentanone (#583) was correctly identified, as shown in Table II(b). Moreover, when the modified scores of the entire library were made orthogonal to #583, another dot-product search produced values <0.05. The SI index for the combination of #881 and #583 is 0.991, which strongly supports the selections.

The results on MIX23, which contained two similarly structured isomers, 1,2,4- and 1,3,5-trimethylbenzene, are shown in Fig. 2; this two-component mixture was easily identified. However, the dot-product between mixture residues and #75 (1,2,4-TMB) is about 0.11 after #799 (1,3,5-TMB) is removed by the adaptive filter. This may be explained by the two isomers having spectral similarity. Filtering #799 from the unknown also eliminated major patterns of #75, causing lower dot-product values. Nevertheless, the high SI values (0.994) of these two target components (#75 and #799) clearly supported the correct identification of this mixture.

In the analysis of three-component systems, MIX31 and MIX32, three similar and/or related aromatic derivatives can be easily distinguished with the PCR prefilter and the adaptive filter, as shown in Figs. 3 and 4. The second application of the adaptive filter to MIX32 produced small values; however, the correct target compound, p-bromoanisole (# 1697), was identified. It should be mentioned that the volume of p-bromoanisole was 20% of those of the other two components in the mixture. After filtering of all three selected target components in the mixtures, the residues were very small and the SI values for the two mixtures were 0.991 and 0.970, re- spect ive ly-both results supporting the identification for these two mixtures.

CONCLUSIONS

Searching any type of library for unknown mixtures is not straightforward and can often lead to incorrect identifications. This is especially true when the mixture spectrum is a combination of components having chemical groups that can all belong to the sample molecule whose spectrum appears in the library. As we showed in the accompanying paper, 1 the problems can be solved as long as there is enough spectral information to distinguish the components. For large libraries, there is the lack of linear spectral independence between spectra and the lack of spectral data points to make positive identifications of all components in the mixtures. As shown here, the dif- ficulties can be at least partially overcome by using a prefilter to select a subgroup of spectra which can be further processed with a dot-product metric. Actually, the selection is made by the application for two successive metrics--the composition index (CI) and the dot- product. The analysis is supported by the application of the adaptive filter to find additional components and to confirm the findings by producing minimal residues. Fi- nally, the similarity index (SI) is used as a final confir- mation of the results.

In all cases, the target components, except for the alkyl ketone, appeared in the subgroup selected by the prefilter. In that example, successive applications of the adaptive filter did not produce small residual values, which suggested an improper identification; however, after filtering of the entire library, the correct target component was selected. In all of the other cases, successive applications of the adaptive filter produced the correct identification and eventually led to minimum residuals.

ACKNOWLEDGMENT The authors wish to express their appreciation to Steven M. Donahue

for many helpful discussions and suggestions for the development of the algorithm.

1. S.-C. Lo and C. W. Brown, Appl. Spectrosc. 45, 1621 (1991). 2. M. R. Nyden, Appl. Spectrosc. 40, 868 (1986). 3. M. R. Nyden, J. E. Pallister, D. T. Sparks, and A. Salari, Appl.

Spectrosc. 41, 63 (1987). 4. C. W. Brown and S. M. Donahue, Appl. Spectrosc. 42, 347 (1988). 5. L.V. Azarraga, R. R. Williams, and J. A. de Haseth, Appl. Spectrosc.

35, 466 (1981). 6. J. A. de Haseth and L. V. Azarraga, Anal Chem. 53, 2292 (1981). 7. J.W. Sherman, J. A. de Haseth, and D. G. Cameron, Appl. Spectrosc.

43, 1311 (1989). 8. S. M. Donahue and C. W. Brown, Anal. Chem. 63, 980 (1991).

1632 Volume 45, Number 10, 1991

Documents

Infrared Spectral Search for Mixtures in Large-Size Libraries