7
Feature Selection for Anomaly Detection Using Optical Emission Spectroscopy Puggini, L., & McLoone, S. (2016). Feature Selection for Anomaly Detection Using Optical Emission Spectroscopy. IFAC-PapersOnLine, 49(5), 132-137. https://doi.org/10.1016/j.ifacol.2016.07.102 Published in: IFAC-PapersOnLine Document Version: Peer reviewed version Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal Publisher rights © 2016, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/,which permits distribution and reproduction for non-commercial purposes, provided the author and source are cited. General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact [email protected]. Download date:13. Mar. 2021

Feature Selection for Anomaly Detection Using Optical ...€¦ · for anomaly detection than Principal Component analysis (PCA) (Jolli e (2002)) or FSCA when the anomaly oc-curs in

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Feature Selection for Anomaly Detection Using Optical ...€¦ · for anomaly detection than Principal Component analysis (PCA) (Jolli e (2002)) or FSCA when the anomaly oc-curs in

Feature Selection for Anomaly Detection Using Optical EmissionSpectroscopy

Puggini, L., & McLoone, S. (2016). Feature Selection for Anomaly Detection Using Optical EmissionSpectroscopy. IFAC-PapersOnLine, 49(5), 132-137. https://doi.org/10.1016/j.ifacol.2016.07.102

Published in:IFAC-PapersOnLine

Document Version:Peer reviewed version

Queen's University Belfast - Research Portal:Link to publication record in Queen's University Belfast Research Portal

Publisher rights© 2016, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. This manuscript version is madeavailable under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/,which permits distribution and reproductionfor non-commercial purposes, provided the author and source are cited.

General rightsCopyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or othercopyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associatedwith these rights.

Take down policyThe Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made toensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in theResearch Portal that you believe breaches copyright or violates any law, please contact [email protected].

Download date:13. Mar. 2021

Page 2: Feature Selection for Anomaly Detection Using Optical ...€¦ · for anomaly detection than Principal Component analysis (PCA) (Jolli e (2002)) or FSCA when the anomaly oc-curs in

Feature Selection for Anomaly Detection

Using Optical Emission Spectroscopy

Luca Puggini ∗ Sean McLoone ∗∗

∗National University of Ireland Maynooth, Maynooth, Ireland(e-mail: [email protected])

∗∗Queen’s University Belfast, Belfast, England(e-mail: [email protected])

Abstract: To maintain the pace of development set by Moore’s law, production processesin semiconductor manufacturing are becoming more and more complex. The development ofefficient and interpretable anomaly detection systems is fundamental to keeping production costslow. As the dimension of process monitoring data can become extremely high anomaly detectionsystems are impacted by the curse of dimensionality, hence dimensionality reduction plays animportant role. Classical dimensionality reduction approaches, such as Principal ComponentAnalysis, generally involve transformations that seek to maximize the explained variance. Indatasets with several clusters of correlated variables the contributions of isolated variables toexplained variance may be insignificant, with the result that they may not be included in thereduced data representation. It is then not possible to detect an anomaly if it is only reflectedin such isolated variables. In this paper we present a new dimensionality reduction techniquethat takes account of such isolated variables and demonstrate how it can be used to build aninterpretable and robust anomaly detection system for Optical Emission Spectroscopy data.

Keywords: Semiconductors, Fault Detection, Dimensionality Reduction, OC-SVM, OESSpectrum

1. INTRODUCTION

Semiconductor manufacturing is one of the most rapidlyevolving industries. To remain competitive producers mustcontinually deliver new devices that are smaller, fasterand/or more energy efficient that previous generations,while at the same time keeping production costs low. Inthis context the ability to detect faults during the produc-tion process reduces the number of incorrectly processedwafers and directly translates into improved overall processyield and throughput (He and Wang (2007)).

As a result, fault or anomaly detection is an active area ofresearch within the semiconductor manufacturing environ-ment. Some recent examples are Puggini et al. (2015) andMahadevan and Shah (2009) where anomaly detection inOES time series is performed with unsupervised randomforest and one class support vector machines (OC-SVM)or Ren and Lv (2014), He and Wang (2007) and Verdierand Ferreira (2011) where clustering is used to separatenormal and anomaly samples.

Data driven anomaly detection systems can roughly bedivided into three subgroups according to the informationavailable about the data during the training phase:

• Supervised anomaly detection where samples fromnormal and abnormal behaving wafers are availableto train classifiers such as Linear Discriminant Anal-ysis (LDA), Support Vector Machines (SVM) and k-nearest neighbours to distinguish between normal andanomalous samples (Chandola et al. (2009)).

• Semi-supervised anomaly detection where only datafor normal samples is available. Systems can then betrained to assign an anomaly score to new samplesaccording to how distant they are from the normalbehaving ones. Several algorithms have been devel-oped with this aim, including Multivariate ControlCharts (Lowry and Montgomery (1995)), one-classSVMs (Scholkopf et al. (2001)) and UnsupervisedRandom Forests (Shi and Horvath (2006)).

• Unsupervised anomaly detection where no informa-tion is available about the data (i.e. the data isunlabeled) but assumptions are made regarding thefrequency and distinctiveness of the anomalies withinthe overall dataset. This structure is then revealedand potential anomalies identified though the appli-cation of unsupervised clustering techniques such asDBSCAN (Ester et al. (1996)) and Max Separationclustering (Flynn and McLoone (2011)).

Optical emission spectroscopy (OES) is increasingly beingused by semiconductor manufacturers for plasma etchprocess monitoring due to its ability to track variationsin the chemical composition of a plasma over time. TheOES data is composed of measurements of the lightemitted from the plasma as a function of wavelengthand time. Figure 1 shows a sample spectrum from theplasma etch process case study which will be introducedin Section 3. OES has been shown to be an effective waferprocessing monitoring signal (Chen et al. (1996), Pugginiet al. (2014)) and has been employed for applications suchas anomaly detection (Puggini et al. (2015), Yue et al.

Page 3: Feature Selection for Anomaly Detection Using Optical ...€¦ · for anomaly detection than Principal Component analysis (PCA) (Jolli e (2002)) or FSCA when the anomaly oc-curs in

(2000)) and etch rate prediction (Puggini and McLoone(2015), Zeng and Spanos (2009)). OES data is generallycharacterized by high dimension, (Prakash et al. (2012))which poses a problem for anomaly detection algorithms.Most anomaly detection algorithms are based on a distancemeasure and it is known that distance measures becomemeaningless in high dimensional spaces due to the so-calledcurse of dimensionality (Kriegel et al. (2008)).

Wavelength

0

50

100

150

200

t

020

4060

80100

120140

160180

Inte

nsi

ty

500

0

500

1000

1500

2000

2500

3000

3500

4000

Fig. 1. A typical of OES spectrum from the case studypresented in Section 3

In this paper the focus is on developing an appropriatedata representation and dimensionality reduction tech-nique for anomaly detection using OES in semiconductormanufacturing. In particular, a Forward Selection Inde-pendent Variables (FSIV) algorithm is proposed as anenhancement to Forward Selection Component Analysis(FSCA) (Prakash et al. (2012)) that yields better featuresfor anomaly detection than Principal Component analysis(PCA) (Jolliffe (2002)) or FSCA when the anomaly oc-curs in an isolated variable in high dimensional correlateddatasets. The efficacy of FSIV is demonstrated using bothsimulated and industrial case studies. In the industrialcase study, a semi-supervised anomaly detection systemis developed using a one-class SVM as the classificationengine.

The remainder of the paper is organised as follows. Section2 introduces the FSIV algorithm and demonstrates it per-formance with respect to PCA and FSCA for a simulatedexample. Similar results ar then presented in Section 3for an industrial plasma etch case study. The anomalydetection classifier is developed in Section 4 and the resultsof its application to the industrial case study presented inSection 5. Finally, conclusions are provided in Section 6.

2. DIMENSIONALITY REDUCTION IN ANOMALYDETECTION

Dimensionality reduction techniques such as PCA andFSCA seek to obtain lower dimensional approximationsof datasets from which it is possible to reconstruct themajority of the information in the original high dimen-sional datasets, usually defined in terms the percentageof explained variance. While they are generally very usefulfor generating compact representations of highly correlated

datasets, the reduced representations are not guaranteedto retain sufficient information to detect isolated anoma-lies. In particular, in datasets with several large clusters ofcorrelated variables, the contributions of isolated uncorre-lated variables to explained variance may be insignificant,with the result that such variables may not be included inthe reduced data representation. It is then not possible todetect an anomaly if it is only reflected in such isolatedvariables.

Mitra et al. (2002) and Flynn and McLoone (2011) havedeveloped algorithms that perform unsupervised featuresselection while at the same time attempting to retainisolated variables in the data. In these algorithms thevariables are recursively clustered. In the former for eachvariable the set of its k-nearest variables is computedaccording to a similarity function. The variable whichis closest to its kth neighbour is retained while its kneighbours are discarded. The process ends when all thek-neighbours of all the variables are closer than a certainthreshold to their centroid. In the latter centroids for newclusters are chosen based on how different they are fromthe data in existing clusters, and individual clusters areformed on the basis of exceeding a similarity threshold.Then when clustering is complete the reduced datasetrepresentation is defined as the centroids of the clusters.

2.1 FSIV Algorithm

Both Mitra et al. (2002) and Flynn and McLoone (2011)select features based on a function s(x, y) that measuresthe similarity between two variables. In general, insteadof discarding variables that are similar to those alreadyselected, it is more interesting to know which variablesare not adequately represented by the selected variables.With this in mind Forward Selection Independent Variable(FSIV) analysis is proposed as a tool for efficient unsuper-vised features selection in anomaly detection.

Here the steps required to select K variables with FSCA(Prakash et al. (2012)) are recalled:

1 Start with the full data X = (x1, . . . , xp) and K thenumber of variables to select. Initialize Z0 = ∅ andk = 0.

2 Scale the data to zero mean.3 Define Zvk+1 as the matrix Zk with the addition of the

variable xv i.e. Zvk+1 = (Zk, xv)4 Define Zk+1 as:

argminv ‖ X −Zvk+1(ZTk+1vZvk+1)−1Zvk+1

TX ‖2 (1)

5 Update k = k + 16 If k < K return to step 3. Otherwise output ZK , the

set of selected variables.

The FSIV algorithm begins by selecting its first k vari-ables (z1, . . . , zk) using the FSCA algorithm. This step isrequired to ensure the presence of the variables that rep-resent the largest variation in the data. Then, additionalvariables are added in order to model signficant isolatedvariations that are not captured by the first k variables.The process ends when K variables are selected or whenthe error εj defined according to equations 4 and 5 issmaller than a given threshold. The FSIV algorithm is thusdefined as follows:

Page 4: Feature Selection for Anomaly Detection Using Optical ...€¦ · for anomaly detection than Principal Component analysis (PCA) (Jolli e (2002)) or FSCA when the anomaly oc-curs in

1 Start with the full data X = (x1, . . . , xp) ∈ Rn×p andset k and a stop criterion.

2 Scale the data such that each variable has zero mean.3 Select k variables z1, . . . , zk using the FSCA algo-

rithm.4 Define the matrix Z = (z1, . . . , zk).5 Compute the linear approximation of X

X = Z(ZTZ)−1ZTX ∈ Rn×k (2)

whereX = (x1, . . . , xp) (3)

6 For each variable xi in X compute its approximationerror

εi =‖ xi − xi ‖22 (4)

where xi is the ith column of X.7 Select xj the variable with the highest approximation

error where:j = argmax

iεi (5)

8 Add xj to the Z matrix.9 Stop if the termination criterion is reached, otherwise

set k = k + 1 and repeat from step 5

The distinguishing feature of FSIV is that a variable isadded to the model if it cannot be adequately recon-structed by a linear combination of those already selected.This makes the algorithm more efficient than methodsbased on similarity between variables. This follows, forexample, from the fact that lowly correlated variablesmay be linearly dependent (Rodgers et al. (1984)). Thefollowing example illustrates the difference in performancebetween PCA, FSCA and FSIV.

2.2 Simulated Example

Consider the simulated data X = (x1, . . . , x7) ∈ Rn×7defined by three groups of variables X1 = {x1, x2, x3},X2 = {x4, x5, x6} and X3 = {x7}. Each variable hascorrelation 0.9 with the others in the same group andbetween the variables in X1 and X2 there is a correlationof 0.4. The variable in X3 is instead isolated and hasonly correlation 0.1 with all other variables. Specifically,X = (x1, . . . , x7) ∼ N(0,Σ) where Σ = {Σi,j} is definedas:

Σi,j =

1 if i = j0.9 if i, j ∈ {1, 2, 3} or i, j ∈ {4, 5, 6}0.4 if i ∈ {1, 2, 3} and j ∈ {4, 5, 6}0.4 if j ∈ {1, 2, 3} and i ∈ {4, 5, 6}0.1 if i = 7 or j = 7

An anomaly is then introduced by replacing one of thesamples in x7 with the value 10. Dimensionality reductionis performed with PCA, FSCA and FIV. In each case onlytwo variables are selected. In FSIV parameter k is chosenas k = 1. The two dimensional representations of the dataobtained with the various methods is reported in Figure2. From the figure it can be observed that only FSIV isable to isolate the anomaly. In particular, FSCA tends toselect one variable from X1 and one from X2 while FSIVselects a variable from X1 and x7. The PCA componentsinstead are obtained as a weighted linear combination ofall the variables. However, the weighting associated withx7 is insufficient to materially affect the behaviour of

the components, with the result that the anomaly is notdistinguishable from the normal samples.

5 4 3 2 1 0 1 2 3 4

z1

3.270.272.735.738.73

z 2

FSIV

5 4 3 2 1 0 1 2 3 4

z1

42024

z 2

FSCA

10 5 0 5 10

PC1

6420246

PC

2

PCA

Fig. 2. The projection of the data for the simulatedexample on the first two variables selected by FSCAand FSIV, and the first two principal componentsobtained with PCA

3. INDUSTRIAL CASE STUDY

To demonstrate the effectiveness of the proposed dimen-sionality reduction method, the technique is applied to asample dataset from an industrial plasma etch chamber.The dataset consists of Plasma Etch Optical EmissionSpectroscopy (OES) samples.

Gas inlet

Quartz plate ECR 875Gauss

RF PIM

sensor Match box

Solenoid Coil 2 Solenoid Coil 2

Solenoid Coil 1 Solenoid Coil 1

Microwave transmission

Wave-guide

Magnetron

2.45GHz

Wafer

Backside wafer cooling (Helium Gas)

Plasma Plasma

Exhaust OES sensor

Fig. 3. A plasma etching chamber.

OES recordings were taken from the etching chamberexhaust, as depicted in Fig. 3. By using only OES data forprocess monitoring, the proposed dimensionality reductionand related fault-detection methodology is able to operatein real-time, thereby reducing the risk of costly faultspropagating during production.

3.1 OES data

Noting that the OES data is naturally parameterized interms of the wafer number, the processing time instant and

Page 5: Feature Selection for Anomaly Detection Using Optical ...€¦ · for anomaly detection than Principal Component analysis (PCA) (Jolli e (2002)) or FSCA when the anomaly oc-curs in

the measured wavelengths (Yue et al. (2000)), the intensityof the ith wavelength of the k-th wafer at time t is denotedas xwki (t).

OES spectra for K = 500 wafers are available, each oneconsisting of τ = 165 samples of p = 200 wavelengths. TheOES spectrum for a single wafer wk can be mathematicallyrepresented as a matrix Xk ∈ Rτ×p.

Xk ={xwki (t(k−1)τ+j)

}j=1,...,τ, i=1,...,p

∈ Rτ×p (6)

and the full data is represented by a set S containing themeasurements for each wafer:

S ={Xj ∈ Rtτ×p : j = 1, . . . ,K

}. (7)

For practical purposes it is better to store the data in atwo dimensional matrix. Two possible aggregations areconsidered and are denoted as Λ ∈ RτK×p and W ∈RK×pτ . These will be discussed in sections 3.2 and 3.4,respectively.

Artificial Fault In order to better show the differencebetween FSCA and FSIV an artificial wavelength xwkl (t)is added to the OES spectrum. xwkl (t) is defined for eachwafer as

xwkl (t) = 285(sin(t) + ε) t ∈ [−π, π] (8)

where ε ∼ N(0, 0.05) and amplitude 285 is selected to givea signal power that is similar to the other wavelengths. Afault is then introduced in the final wafer in the datasetby clamping the lth wavelength to lie between −100 and100, that is:

xwKl (t) =

{xwKl (t) if |xwKl (t)| < 100100 + ε if xwKl (t) > 100−100 + ε if xwKl (t) < −100

(9)

where ε ∼ N(0, 10). In figure 4 the artificial wavelengthxwkl (t) and a normal wavelength are illustrated for a groupof five wafers. It follows that the faulty wafer can beclassified as anomaly only if the wavelength l is amongthe selected ones.

0 100 200 300 400 500 600 700 800

t

400

300

200

100

0

100

200

300

400

xw i

(t)

0 100 200 300 400 500 600 700 800

t

400

200

0

200

400

600

800

1000

xw i

(t)

Fig. 4. The artificial wavelength and a normal wavelengthover 800 time points and 5 wafers.

3.2 The Λ matrix

The data can be aggregated in a Λ ∈ RτK×p matrix. In Λeach row corresponds to a time scan and each column to awavelength. Λ can be obtained by vertically stacking thematrices in S.

Λ =

X1

X2

· · ·XK

∈ RKτ×p (10)

The full OES data is then represented as a set of matriceswhere each one contains the spectrum for a given wafer.This format of the data is used to select a subset of relevantwavelengths. The wavelengths are chosen with the FSIVand FSCA algorithms.

3.3 Data approximation with FSCA and FSIV

FSCA and FSIV select variables using different criteria.Figure 5 shows the maximal error εj defined according toequations 4 and 5, as a function of the number compo-nents selected by FSCA and FSIV, while figure 6 showsthe corresponding explained variance (EV). In total 14components are selected by both the FSCA and FSIValgorithms. The first 7 components of the FSIV algorithmare selected with FSCA, hence it follows that the perfor-mances of the both methods in terms of both EV andεj are identical for these components. In contrast for theremaining 7 components we can observe that as expectedthe variables selected with FSIV lead to a lower error εj ,while those selected by FSCA yield a larger percentage ofEV . Notably, the lth wavelength is not selected by FSCAbut is selected by FSIV as the 8th component. It can beobserved in Figure 5 that the 8th component is where theperformance of FSCA and FSIV begin to deviate in termsof the error εj .

0 2 4 6 8 10 12 14

Number of Components

50000

0

50000

100000

150000

200000

ε j

FSIVFSCA

6 8 10 12 14

Number of Components

5000

10000

15000

20000

25000

ε j

FSIVFSCA

Fig. 5. The error εj as a function of the number ofcomponents selected with FSCA or FSIV.

3.4 W Matrix

Alternatively the data can be aggregated in order to haveeach wafer as an observation. For each wafer all the timescans of all the wavelengths are stored in a row. Thisis equivalent to transforming all the matrices in S into

Page 6: Feature Selection for Anomaly Detection Using Optical ...€¦ · for anomaly detection than Principal Component analysis (PCA) (Jolli e (2002)) or FSCA when the anomaly oc-curs in

0 2 4 6 8 10 12 14

Number of Components

40

50

60

70

80

90

100

110EV

FSIVFSCA

6 8 10 12 14

Number of Components

99.0

99.2

99.4

99.6

99.8

EV

FSIVFSCA

Fig. 6. The error percentage of explained variance as afunction of the number of components selected withFSCA or FSIV. The vertical line represent the valueof k for the FSIV algorithm.

vectors and stacking them horizontally. Matrix Xk, thekth element of S is reshaped as

Xk = (xwk1 (t1), . . . , xwkp (tτ )) ∈ R1×τp (11)

and the full dataset is represented by combining all thereshaped matrices in S as:

W =

X1

· · ·XK

∈ RK×τp (12)

This data format is particularly useful for comparingwafers and performing anomaly detection as each rowcorresponds to all the data observed for a given wafer.However, the dimension of the matrix is very large asit has pτ = 33000 columns. The dimension of the datacan be drastically reduced by selecting only a subset ofthe wavelengths from the Λ matrix. If 14 wavelengths areselected from Λ the number of columns in the W matrix isreduced to 14τ = 2310 columns. The W matrix is still highdimensional but now it is small enough for high dimen-sional anomaly detection algorithms such as UnsupervisedRandom Forest and OC-SVM to be efficiently applied.

4. OC-SVM ANOMALY DETECTION

Once a lower dimensional approximation of the W matrixis obtained it is possible to use it to develop an anomalydetection system. In plasma etching wafers are normallyprocessed is batches of 25, called a lot, with the chamberundergoing a cleaning cycle between each lot. As a con-sequence of the cleaning step there is a seasoning effectduring the processing of the first few wafers in each lot aschemicals absorb into the chamber walls, with the resultthat the processing of wafers 1, 2, 3 and 4 differ slightlyfrom the remaining wafers 5 to 25. For the purposes ofevaluating the performance of the different dimensionalityreduction techniques as a per-processing step for anomalydetection, we will consider wafers 1, 2, 3 and 4 in each lotas abnormal wafers. In addition, wafer 500 is also definedas abnormal due to the artificial anomaly introduced inthe lth wavelength.

In order to train and test the anomaly detector the wafersin W are split into a training set of 300 wafers containingmeasurements of only normal behaving wafers and a testset of 200 wafers containing normal and abnormally be-having wafers. The OC-SVM algorithm is used to assignan anomaly score to each wafer.

Given a training dataset X ∈ Rn×p, X ⊂ X where X is acompact subset of Rp and Φ a map into the dot productspace:

Φ(X) : X → F and Φ(x) ·Φ(y) = k(x, y) ∀x, y ∈ X (13)

the OC-SVM optimization problem is defined as:

minw∈F,ξ∈Rn,ρ∈R

1

2‖ w ‖ +

1

vn

n∑i=1

ξi − ρ (14)

subject to the constraint

(w · Φ(xi)) ≥ ρ− ξi i = 1, . . . , n ξi ≥ 0 (15)

Since the ξ are penalized it is expected that the decisionfunction

f(x) = sign((w · Φ(x))− ρ) (16)

will be positive for most samples xi, when w and ρ areoptimized. At the same time ‖ w ‖ is small forcing f(x) > 0only on a small region.

The OC-SVM is trained using only normal behavingwafers. Then an anomaly score, defined as

s(x) = −(w · Φ(x)) + ρ (17)

is assigned to each sample x in the test set. In otherwords the anomaly score is the distance between x andthe estimated support of the normal behaving data.

5. RESULTS

The anomaly score assigned by OC-SVM to each wafer inthe test dataset when using each of the dimensionality re-duction techniques is given in Figure 7. For completeness,the results obtained without dimensionality reduction arealso reported. The results show that in general a largeranomaly score is assigned to the abnormal wafers allowingthem to be identified. The one exception is the artificiallycreated abnormal wafer, denoted by the star, which isonly correctly identified as an anomaly when FSIV isused. Even when all the wavelengths are used the artificialanomaly has a low anomaly score. This may be due to overfitting caused by the excessive number of variables. Theperformance of each method is also summarized in termsof the Area Under the Curve (AUC) classifier performancemetric in Table 5.1 and again underscores the superiorityof FSIV for this application.

A.W. FSCA FSIV PCA

AUC 0.9564 0.9507 0.9650 0.9527

Table 5.1: The AUC score obtained using OC-SVM whenall the wavelengths are used (A.W.), when a subset of 14wavelengths is selected with FSCA and FSIV, and when14 PCA components are employed as inputs.

6. CONCLUSION

This paper considers the problem of feature selection foranomaly detection with application to OES based semi-

Page 7: Feature Selection for Anomaly Detection Using Optical ...€¦ · for anomaly detection than Principal Component analysis (PCA) (Jolli e (2002)) or FSCA when the anomaly oc-curs in

0 50 100 150 200

Wafer

10

0

10

20

30

Anom

aly

All Wavelengths

0 50 100 150 200

Wafer

10

0

10

20

30

Anom

aly

FSIV

0 50 100 150 200

Wafer

10

0

10

20

30

Anom

aly

FSCA

0 50 100 150 200

Wafer

20

0

20

40

Anom

aly

PCA

Fig. 7. Anomaly score assigned to each wafer in the testset: Black circles denote the anomaly wafers (1− 4 ineach lot), white circles denote the normal wafers, andthe artificial anomaly wafer is represented by a star.

conductor manufacturing process monitoring. FSIV is pro-posed as a new feature selection method that takes accountof isolated variables in highly correlated high dimensiondatasets. FSIV combined with OC-SVM is evaluated usingan industrial case study and shown to outperform FSCAand PCA for anomly detection. While a good separation isachieved between normal and abnormal wafers some falsepositives are still present. Further research is required tounderstand the nature of these false positives.

ACKNOWLEDGEMENTS

The authors would like to thank Intel Ireland for providingthe industrial case study for this research and MaynoothUniversity for the financial support provided.

REFERENCES

Chandola, V., Banerjee, A., and Kumar, V. (2009).Anomaly detection: A survey. ACM computing surveys(CSUR), 41(3), 15.

Chen, R., Huang, H., Spanos, C., and Gatto, M. (1996).Plasma etch modeling using optical emission spec-troscopy. Journal of Vacuum Science & Technology A,14(3), 1901–1906.

Ester, M., Kriegel, H.P., Sander, J., and Xu, X. (1996). Adensity-based algorithm for discovering clusters in largespatial databases with noise. In Kdd, volume 96, 226–231.

Flynn, B. and McLoone, S. (2011). Max separation clus-tering for feature extraction from optical emission spec-troscopy data. Semiconductor Manufacturing, IEEETransactions on, 24(4), 480–488.

He, Q.P. and Wang, J. (2007). Fault Detection Us-ing the k-Nearest Neighbor Rule for Semiconduc-tor Manufacturing Processes. IEEE Transactions on

Semiconductor Manufacturing, 20(4), 345–354. doi:10.1109/TSM.2007.907607.

Jolliffe, I. (2002). Principal component analysis. WileyOnline Library.

Kriegel, H.P., Zimek, A., et al. (2008). Angle-based outlierdetection in high-dimensional data. In Proceedings ofthe 14th ACM SIGKDD international conference onKnowledge discovery and data mining, 444–452. ACM.

Lowry, C.A. and Montgomery, D.C. (1995). A review ofmultivariate control charts. IIE transactions, 27(6), 800–810.

Mahadevan, S. and Shah, S.L. (2009). Fault detectionand diagnosis in process data using one-class supportvector machines. Journal of Process Control, 19(10),1627–1639. doi:10.1016/j.jprocont.2009.07.011.

Mitra, P., Murthy, C., and Pal, S.K. (2002). Unsupervisedfeature selection using feature similarity. IEEE trans-actions on pattern analysis and machine intelligence,24(3), 301–312.

Prakash, P., Johnston, A., Honari, B., and McLoone, S.(2012). Optimal wafer site selection using forward se-lection component analysis. In Advanced SemiconductorManufacturing Conference (ASMC), 2012 23rd AnnualSEMI, 91–96. IEEE.

Puggini, L., Doyle, J., and McLoone, S. (2014). Towardsmulti-sensor spectral alignment through post measure-ment calibration correction.

Puggini, L., Doyle, J., and McLoone, S. (2015). Faultdetection using random forest similarity distance. IFAC-PapersOnLine, 48(21), 583–588.

Puggini, L. and McLoone, S. (2015). Extreme learningmachines for virtual metrology and etch rate prediction.In Signals and Systems Conference (ISSC), 2015 26thIrish, 1–6. IEEE.

Ren, L. and Lv, W. (2014). Fault Detection via SparseRepresentation for Semiconductor Manufacturing Pro-cesses. IEEE Transactions on Semiconductor Manufac-turing, 27(2), 252–259. doi:10.1109/TSM.2014.2302011.

Rodgers, J.L., Nicewander, W.A., and Toothaker, L.(1984). Linearly independent, orthogonal, and uncorre-lated variables. The American Statistician, 38(2), 133–134.

Scholkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J.,and Williamson, R.C. (2001). Estimating the supportof a high-dimensional distribution. Neural computation,13(7), 1443–1471.

Shi, T. and Horvath, S. (2006). Unsupervised learning withrandom forest predictors. Journal of Computational andGraphical Statistics, 15(1).

Verdier, G. and Ferreira, A. (2011). Adaptive MahalanobisDistance and k-Nearest Neighbor Rule for Fault Detec-tion in Semiconductor Manufacturing. IEEE Transac-tions on Semiconductor Manufacturing, 24(1), 59–68.doi:10.1109/TSM.2010.2065531.

Yue, H.H., Qin, S.J., Markle, R.J., Nauert, C., and Gatto,M. (2000). Fault detection of plasma etchers usingoptical emission spectra. Semiconductor Manufacturing,IEEE Transactions on, 13(3), 374–385.

Zeng, D. and Spanos, C.J. (2009). Virtual metrologymodeling for plasma etch operations. SemiconductorManufacturing, IEEE Transactions on, 22(4), 419–431.