1
Tuning Parameter Selection in KLDA Jaime Seith 1 , Supervisor: Dr. Yishi Wang 2 Department of Mathematics and Statistics at North Carolina State University 1 Department of Mathematics and Statistics at University of North Carolina at Wilmington 2 INTRODUCTION Dimension Reduction techniques are widely used on large datasets in statistics and data mining fields. Implementing these methods can: I Reduce dimensionality of data while still capturing the most importance features in the data I Improve accuracy of data classification I Reduce computational runtime I Allow visualization of data in lower dimensions. This research focuses specifically on Kernel Linear Discriminant Analysis (KLDA). Understanding the theoretical background behind KLDA and statistical classification power help in optimal tuning parameter selection, which is still an on-going challenge. Selecting optimal tuning parameters can yield well-separated classes without over-fiing the model. BACKGROUND THEORY Let ϕ be a non-linear mapping of the input to a higher-dimensional space. Define a p × dimensional variable X = [x 1 , x 2 , ...x ], where is the number of the number of observations in C classes and i is the number of observations in the i th class. We try to find w, which is a linear combination of the X, such that among the projections of X, Y = w T ϕ (X ), the between-class variance is maximized and the within-class variance is minimized. In the case of a two-class problem, to accomplish this, we define an objective function in which we aim to maximize: J(w) = w T S B ϕ w w T S w ϕ w , where the mean response vectors, between-class scaer matrix, and within-class scaer matrix are defined as: m i ϕ = 1 i i j =1 ϕ (x i j ), S B ϕ = (m 1 ϕ - m 2 ϕ )(m 1 ϕ - m 2 ϕ ) T , S w ϕ = S w 1 + S w 2 = x C 1 ( ϕ (x )- m 1 ϕ )( ϕ (x )- m 1 ϕ ) T + x C 2 ( ϕ (x )- m 2 ϕ )( ϕ (x )- m 2 ϕ ) T . The expression for J (w) can be put in terms of kernel functions to make computing run much faster. The Gaussian (RBF) kernel is used in this research which is of the form: exp -k x - y k 2 2σ 2 . This is where the tuning parameter that we wish to optimize is introduced. APPLICATION OF KLDA Figure 1: Original Toy Datasets (a) Checkerboard Data (b) Cookie Data (c) Swiss Roll Data (d) Wine Chocolate Figure 2: KLDA Projections using σ = 0.01 (a) Checkerboard Data (b) Cookie Data (c) Swiss Roll Data (d) Wine Chocolate KLDA was implemented on four toy datasets to assess how well the classes from each data set were being separated. SUPPORT VECTOR MACHINES Support Vector Machines (SVM) is a data mining tool that can be used in both Classification and Regression problems. In this research, it is applied aer KLDA is implemented on the data and is used for classification. SVM aims to separate data by finding a hyperplane with maximal margin from the plane to the support vectors from either class. Figure 3: SVM two-class example, James et. al ACKNOWLEGEMENTS Thank you to the NSF Foundation for the grant that allowed the REU students to participant in the Statistical Data Mining and Machine Learning REU this summer. Thank you to my collaborators of the Dimension Reduction Group: Lynn Huang, Jackson Maris, and Ryan Wood. A special thanks Dr. Yishi Wang and Dr. Cuixian Chen at University of North Carolina at Wilmington for their time and support. REFERENCES Cheng Li and Bingyu Wang. Fisher linear discriminant analysis, Aug 2014. Gareth James, Daniela Wien, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning, volume 112. Springer. Ananda Das. Understanding linear svm with r, Mar 2017. APPLICATION OF SVM From SVM, two error rates are defined to help investigate how well KLDA is performing given dierent σ values. L 1 = Classification error rate of SVM fit L 2 = Relative distance of support vectors against all data points to linear SVM hyperplane We now create a loss function that we aim to minimize: L = L 1 + L 2 . A function was created in R to take in σ values and output the error L. To understand how the loss function is working, the KLDA projections were ploed and compared among their loss function values. Figure 4: KLDA Projections Using Various Sigma Values on Cookie Dataset Sigmas 0.001 0.01 0.1 1 10 100 Loss Function Error 0.121 0.091 0.107 0.300 0.998 0.999 When applied to the cookie data in Figure (1b), given 1000 randomized sigma values between 0.001 and 10, the loss output L can be shown in Figure 5. The minimum of this curve is approximately located at σ = 0.65. The KLDA projections with this sigma value are shown in Figure 6. Figure 5: Loss Function Figure 6: Sigma = 0.65, Loss = 0.026 FUTURE RESEARCH Defining L 2 in an optimal way is important for creating a loss function that is continuous and achieves a global minimum that can be extracted. The functions should be tested on other two-class toy datasets as well as larger datasets like MORPH-II. Lastly, these functions can be adjusted to be used on multi-class KLDA problems.

Tuning Parameter Selection in KLDA - uncw.edu

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tuning Parameter Selection in KLDA - uncw.edu

Tuning Parameter Selection in KLDAJaime Seith1, Supervisor: Dr. Yishi Wang2

Department of Mathematics and Statistics at North Carolina State University1

Department of Mathematics and Statistics at University of North Carolina at Wilmington2

INTRODUCTIONDimension Reduction techniques are widely used on large datasets instatistics and data mining fields. Implementing these methods can:I Reduce dimensionality of data while still capturing the most

importance features in the data

I Improve accuracy of data classification

I Reduce computational runtime

I Allow visualization of data in lower dimensions.

This research focuses specifically on Kernel Linear Discriminant Analysis(KLDA). Understanding the theoretical background behind KLDA andstatistical classification power help in optimal tuning parameter selection,which is still an on-going challenge. Selecting optimal tuning parameters canyield well-separated classes without over-fi�ing the model.

BACKGROUND THEORYLet ϕ be a non-linear mapping of the input to a higher-dimensional space.Define a p × ` dimensional variable X = [x1, x2, ...x`], where ` is the numberof the number of observations in C classes and `i is the number ofobservations in the ith class. We try to find w, which is a linear combination ofthe X, such that among the projections of X,

Y = wTϕ(X ),

the between-class variance is maximized and the within-class variance isminimized.In the case of a two-class problem, to accomplish this, we define an objectivefunction in which we aim to maximize:

J(w) = wTSBϕwwTSwϕw

,

where the mean response vectors, between-class sca�er matrix, andwithin-class sca�er matrix are defined as:

miϕ =

1`i

`i∑j=1

ϕ(x i j),

SBϕ = (m1ϕ −m2

ϕ)(m1ϕ −m2

ϕ)T ,Swϕ = Sw1 + Sw2

=∑x∈C1

(ϕ(x) −m1ϕ)(ϕ(x) −m1

ϕ)T +∑x∈C2

(ϕ(x) −m2ϕ)(ϕ(x) −m2

ϕ)T .

The expression for J(w) can be put in terms of kernel functions to makecomputing run much faster. The Gaussian (RBF) kernel is used in thisresearch which is of the form:

exp−‖x − y‖2

2σ 2 .

This is where the tuning parameter that we wish to optimize is introduced.

APPLICATION OF KLDAFigure 1: Original Toy Datasets

(a) Checkerboard Data (b) Cookie Data (c) Swiss Roll Data (d) Wine Chocolate

Figure 2: KLDA Projections using σ = 0.01

(a) Checkerboard Data (b) Cookie Data (c) Swiss Roll Data (d) Wine Chocolate

KLDA was implemented on four toy datasets to assess how well the classesfrom each data set were being separated.

SUPPORT VECTOR MACHINESSupport Vector Machines (SVM) is adata mining tool that can be used inboth Classification and Regressionproblems. In this research, it is applieda�er KLDA is implemented on the dataand is used for classification. SVM aimsto separate data by finding a hyperplanewith maximal margin from the plane tothe support vectors from either class.

Figure 3: SVM two-class example, James et. al

ACKNOWLEGEMENTSThank you to the NSF Foundation for the grant that allowed the REU students to participantin the Statistical Data Mining and Machine Learning REU this summer. Thank you to mycollaborators of the Dimension Reduction Group: Lynn Huang, Jackson Maris, and RyanWood. A special thanks Dr. Yishi Wang and Dr. Cuixian Chen at University of North Carolinaat Wilmington for their time and support.

REFERENCES

Cheng Li and Bingyu Wang.Fisher linear discriminant analysis, Aug 2014.

Gareth James, Daniela Wi�en, Trevor Hastie, and Robert Tibshirani.An introduction to statistical learning, volume 112.Springer.

Ananda Das.Understanding linear svm with r, Mar 2017.

APPLICATION OF SVMFrom SVM, two error rates are defined to help investigate how well KLDA isperforming given di�erent σ values.

L1 = Classification error rate of SVM fit

L2 = Relative distance of support vectors against all data points

to linear SVM hyperplane

We now create a loss function that we aim to minimize:L = L1 + L2.

A function was created in R to take in σ values and output the error L. Tounderstand how the loss function is working, the KLDA projections wereplo�ed and compared among their loss function values.

Figure 4: KLDA Projections Using Various Sigma Values on Cookie Dataset

Sigmas 0.001 0.01 0.1 1 10 100Loss Function Error 0.121 0.091 0.107 0.300 0.998 0.999

When applied to the cookie data in Figure (1b), given 1000 randomized sigmavalues between 0.001 and 10, the loss output L can be shown in Figure 5. Theminimum of this curve is approximately located at σ = 0.65. The KLDAprojections with this sigma value are shown in Figure 6.

Figure 5: Loss Function Figure 6: Sigma = 0.65, Loss = 0.026

FUTURE RESEARCHDefining L2 in an optimal way is important for creating a loss function that iscontinuous and achieves a global minimum that can be extracted. Thefunctions should be tested on other two-class toy datasets as well as largerdatasets like MORPH-II. Lastly, these functions can be adjusted to be used onmulti-class KLDA problems.