Upload
merry-harvey
View
212
Download
0
Embed Size (px)
Citation preview
A NEW USE OF TARGET FACTOR ANALYSIS (TFA)
John H. Kalivas, Kevin Higgins
Department of Chemistry
Idaho State University
Pocatello, Idaho 83209 USA
Erik Andries
Department of Mathematics
Central New Mexico Community College
Albuquerque, New Mexico, Idaho 87106 USA
Classification Situation
• Numerous classification approaches– KNN, LDA, MD, ANN, SVM, …
• As the number of classes increases for a problem, the more difficult classification can become
• Target factor analysis (TFA) and net analyte signal (NAS)– TFA and NAS have concurrent calculations of
analogous angles between a test sample vector and respective spaces spanned by library classes
– Useful for binary or multiclass situations
2
Requirements
• Xi = m × n library information matrix for the ith class– m = number of samples– n = number of measurements
• Wavelengths for spectra, other physical or chemical variables
– Samples making up a library class must span variances making up the class• Instrument profile, temperature effects, measurement
process, others
• y = m × 1 test sample measurement vector
3
ti
t
USV
I P y P VV
y
y
y
y
y
X
2
2
1 2
2
where
sin θ , NAS selectivity
θ sin
Orthogonal Projection Spatial Angle (OPSA)
• Identical to TFA and NAS– Use same orthogonal projection
4
yy
yLib
Xiθ
Process
• No data preprocessing• Perform SVD of each library class• Retain d eigenvectors (class-wise) where
1 ≤ d ≤ k and k = rank(X) ≤ min(m,n)• Compute OPSA, MD, and KNN for the test
sample relative to each library class– Use leave one out cross-validation (LOOCV)
• Library class with smallest angle or MD is the test sample classification
• KNN classification trends evaluated
5
Assessment
• Accuracy = (TP + TN)/(TP +TN + FP + FN) – TP = true positives– TN = true negatives– FP = false positives– FN = false negatives
• Receiver operator characteristic (ROC)– True positive rate = sensitivity = TP/(TP + FN)– False positive rate = 1- specificity = 1 – TN/(TN + FP)
6
Determining Eigenvectors
• Numerous approaches exist to determine the minimum number of eigenvectors to span X
• Determination of rank by augmentation (DRAUG)– Malinowski ER. J. Chemom. 2011; 25: 323-328
• Distinguishes primary eigenvectors (chemical, instrumental, etc.) from secondary eigenvectors (experimental error) independent of the experimental uncertainties distribution
7
Plastic Data
• Six classes (six of seven commercial plastic types 1-6)– Allen V, Kalivas JH, Rodriguez RG. Applied Spec. 1999; 53:
672-681
• Raman spectroscopy (850 – 1800 cm-1, 1093 wavenumbers per spectrum)– Type 1 = polyethylene terephthalate (PET); 30 samples– Type 2 = high-density polyethylene (HDPE); 29 samples– Type 3 = polyvinyl chloride (PVC); 13 samples– Type 4 = low-density polyethylene (LDPE); 22 samples– Type 5 = polypropylene (PP); 23 samples– Type 6 = polystyrene (PS); 29 samples
8
2 4 6 8 10 120.85
0.9
0.95
1
Number of Eigenvectors
Fra
ctio
n o
f V
ari
an
ce
Plastic Score and Scree Plots
9
-0.2 -0.1 0
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
PC1
PC
2
Type 1Type 2Type 3Type 4Type 5Type 6
• Unique clusters are not formed• Most of the spectral variance is
captured with the first eigenvector
Score Plot Scree Plot
0 0.05 0.1 0.15 0.20
0.2
0.4
0.6
0.8
1
False Positive Rate (1-Specificity)
Tru
e P
osi
tive
Ra
te (
Se
nsi
tivity
)
0 0.01 0.02
0.88
0.9
0.92
0.94
0.96
0.98
1
54
3
11
2
12
2 4 6 8 10 12
0.7
0.75
0.8
0.85
0.9
0.95
1
Number of Eigenvectors
Acc
ura
cy
Plastic Classification Results
10
Library plastica Accuracy (%) Sensitivity (%) Specificity (%) OPSA MD OPSA MD OPSA MD
Type 1 (9) 100 94 100 83 100 97
Type 2 (9) 100 97 100 93 100 99
Type 3 (4) 100 85 100 54 100 91
Type 4 (6) 100 86 100 59 100 92
Type 5 (9) 100 78 100 35 100 87
Type 6 (11) 100 98 100 93 100 99
aParenthesis values are DRAUG eigenvector number rounded to nearestwhole number
Numbers indicate number of eigenvectors
Total Accuracy Across All
Classes
OPSAMD
ROC Plot
1 3 5 7 9 110.3
0.4
0.5
0.6
0.7
0.8
0.9
Number of Nearest Neighbors
KNN
SpecificitySensitivityAccuracy
Archeological Data
• Four classes (four archeological sources of obsidian)– Kowalski BR, Schatzki TF, Stross FH. Anal. Chem. 1972; 44:
2176-2180
• 10 trace metal concentrations from X-ray fluorescence spectroscopy (Fe, Ti, Ba, Ca, K, Mn, Rb, Sr, Y, and Zr)– Source 1 = 10 samples– Source 2 = 9 samples– Source 3 = 23 samples– Source 4 = 21 samples
11
1 3 5 7 9
0.88
0.9
0.92
0.94
0.96
0.98
1
Number of Nearest Neighbors
2 4 6 8
0.7
0.75
0.8
0.85
0.9
0.95
1
Number of Eigenvectors
Acc
ura
cy
Archeological Classification Results
12
OPSAMD
Library sourcea Accuracy (%) Sensitivity (%) Specificity (%) OPSA MD OPSA MD OPSA MD
Source 1 (2) 100 80 100 60 100 87
Source 2 (4) 100 100 100 100 100 100
Source 3 (4) 100 98 100 96 100 99
Source 4 (3) 100 100 100 100 100 100
0.08 0.1 0.12 0.14 0.16-0.2
-0.1
0
0.1
0.2
PC1
PC
2
2 4 6 80.9
0.92
0.94
0.96
0.98
1
Number of Eigenvectors
Fra
ctio
n o
f V
ari
an
ce
Source 1Source 2Source 3Source 4
Score Plot Scree Plot
Total Accuracy Across All
Classes
aParenthesis values are DRAUG eigenvector number rounded to nearest whole number
KNN
SpecificitySensitivityAccuracy
Gasoil Data
• Three classes (three commercial sources of gasoil)– Wentzell P, Andrews D, Walsh J, Cooley J, Spencer P. Can. J.
Chem. 1999; 77: 391-400
• Ultraviolet spectroscopy (200 – 400 nm, 572 wavelengths per spectrum)– Source 1 = 59 samples– Source 2 = 25 samples– Source 3 = 30 samples
13
1 3 5 7 9 110.65
0.7
0.75
0.8
0.85
0.9
0.95
Number of Nearest Neighbors
5 10 15 20
0.7
0.75
0.8
0.85
0.9
0.95
1
Number of Eigenvectors
Acc
ura
cy
Library sourcea Accuracy (%) Sensitivity (%) Specificity (%) OPSA MD OPSA MD OPSA MD
Source 1 (11) 100 100 100 100 100 100
Source 2 (8) 95 89 92 84 96 92
Source 3 (11) 98 82 97 73 98 87
5 10 15 200.85
0.9
0.95
1
Number of Eigenvectors
Fra
ctio
n o
f V
ari
an
ce
-0.15 -0.1 -0.05 0-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
PC1
PC
2
Gasoil Classification Results
14
OPSAMDSource 1
Source 2Source 3
Score Plot Scree Plot
Total Accuracy Across All
Classes
aParenthesis values are DRAUG eigenvector number rounded to nearest whole number
KNNSpecificitySensitivityAccuracy
Extra Virgin Olive Oil (EVOO) Data• Six classes (six adulterant oils)
– Poulli KI, Mousdis GA, Georgiou CA. Food Chem. 2007; 105: 369-375
• Synchronous fluorescence spectroscopy (250 – 400 nm at Δ20nm,151 wavelengths per spectrum)
– Adulterant 1 = corn– Adulterant 2 = olive-pomace– Adulterant 3 = soybean– Adulterant 4 = sunflower– Adulterant 5 = rapeseed– Adulterant 6 = walnut
• 31 samples each at 0.5 to 95 % adulterant
15
5 10 15 20 250.7
0.75
0.8
0.85
0.9
0.95
1
Number of Eigenvectors
Acc
ura
cy
5 10 15 20 250.75
0.8
0.85
0.9
0.95
1
Number of Eigenvectors
Fra
ctio
n o
f V
ari
an
ce
-0.4 -0.3 -0.2 -0.1
-0.4
-0.2
0
0.2
PC1
PC
2
Library adulteranta Accuracy (%) Sensitivity (%) Specificity (%) OPSA MD OPSA MD OPSA MD
Corn (8) 98 89 93 68 99 94
Olive-pomace (4) 100 92 100 77 100 95
Rapeseed (6) 93 88 81 65 96 93
Soybean (6) 100 93 100 80 100 96
Sunflower (6) 97 87 90 61 98 92
Walnut (4) 99 84 97 51 99 90
EVOO Classification Results
OPSAMD
Corn, Olive-pomace,Rapeseed, Soybean,Sunflower, Walnut
Score Plot Scree Plot
Total Accuracy Across All
Classes
1 3 5 7 9 110
0.2
0.4
0.6
0.8
1
Number of Nearest Neighbors
SpecificitySensitivityAccuracy
KNN
-0.4 -0.3 -0.2 -0.1
-0.4
-0.2
0
0.2
PC1
PC
2
0.819672
5.20833
9.76966
13.6673
17.7037
39.4578
92.4966
-0.4 -0.3 -0.2 -0.1
-0.4
-0.2
0
0.2
PC1
PC
2
Library adulteranta
Minimum adulterant concentration (%) OPSA MD
Corn (8) 1.73 14.91
Olive-pomace (4) 0.85 14.53
Rapeseed (6) 15.50 20.64
Soybean (6) 1.05 17.06
Sunflower (6) 4.02 18.62
Walnut (4) 0.82 21.24
EVOO Concentrations
17
Corn, Olive-pomace,Rapeseed, Soybean,Sunflower, Walnut
Concentration Coded Score Plot
Score Plot
% S
un
flo
wer
aParenthesis values are DRAUGeigenvector number rounded to nearestwhole number
Summary• TFA or NAS angular measure OPSA out-performs MD
and KNN over a variety of data sets– If normalize y to unit length, same results if use (TFA)
• Score plots need not be obvious• Need to determine number of eigenvectors (basis
vectors) to characterize each library class• Samples making up a library class need to span
variances making up that library class– Instrument profile– Temperature effects– Others
18
2
y
2
22
sin θ
y
yy