CS 6890 Offered by Charles Yan Presented by: Jyothi Sankuri

Application of support vector machines forT-cell epitopes prediction

ByYingdong Zhao, Clemencia Pinilla, Danila Valmori,

Roland Martin and Richard Simon.

CS 6890 Offered by

Charles YanPresented by: Jyothi Sankuri

Overview• Introduction

– Problem– Why ?– T-Cell epitopes– SVMs– Results

• Support Vector Machines( SVMs)– SVM Principle – Kernel Function

• Systems and Methods• Results• Discussions & Conclusions• Remarks & Future Work• References

Introduction• Problem Training of SVMs for the Predicition of T-cell epitopes • T-Cell epitopes Antigenic determinants recognized and bound by the T-cell receptor.

Epitopes are antigenic determinant of an antigen due to which the immune system recognizes it as an “antigen” • SVMs Support Vector Machines

• Why prediction of T-cell epitopes ?

Prediction of T-cell Epitopes

– The T-cell receptor• A major histocompatibility complex (MHC) molecule, play major

roles in the process of antigen-specific T-cell activation. • One receptor may recognize thousands of different peptides.

– Deciphering the patterns of peptides that elicit a MHC restricted T-cell response is critical for vaccine development.

• A crucial step in the design of subunit vaccines is the identification of T-cell epitopes in sets of disease-specific gene products.

– Identifying characteristic patterns of immunogenic peptide epitopes can provide fundamental information for understanding disease pathogenesis and etiology, and for therapeutics such as vaccine development.

Why?

Why SVMs?

• Because of the ability of SVMs to build effective predictive models when the dimensionality of the data is high and the number of observations is limited.

• SVMs are based on a strong theoretical foundation for avoiding over-fitting training data.

• SVMs are one of the most powerful newtechniques and have been effective in DNAsequence analysis, protein structure prediction and gene expression pattern discovery

Different methods used for T-cell Epitope Prediction :

• ANNs

• Decision Tree Classifiers

• Score Matrix Based Approach

• SVMs

• SYFPEITHI

• SVMHC

SVM

• Definition• Training and Test Data sets• Training SVM

– Structure Risk Minimization Induction Principle.

– Kernel functions– Leave-One-out Cross Validation

• Reason for using SVM with T-cell Epitope.

SVMs

• Support Vector Machines are based on the concept of decision planes that define decision boundaries. A decision plane is one that separates between a set of objects having different class memberships.

• A SVM performs classification by constructing an N-dimensional hyperplane that optimally separates the data into two categories.

• A set of features that describes one case is called a vector. • The goal of SVM modeling is to find the optimal hyperplane that

separates clusters of vector in such a way that cases with one category of the target variable are on one side of the plane and cases with the other category are on the other size of the plane. The vectors near the hyperplane are the support vectors.

General SVM function

u = i iyiK(xi,x) – b (1)

• u : SVM output

• : weights to blend different kernels

• y in {-1, +1} : desired output

• b : threshold

• xi : stored training example (support vector)

• x : input (vector)

• K : kernel function to measure similarity of xi to xi

Training and Test Data sets

• Dividing the peptides (data) into positive and negative groups.

• Randomly sampling the group.• 80% is Training set & 20% is Test set.• Combining the two groups, separately in

training and test sets.• Using pairwise Pearson coefficients to

ensure peptides were dissimilar.

continue…• Training SVM • Training is done using SVMlight • 100 input values with class values of +1 & -1 .• Separation of classes with maximizing the margin.

• An SVM analysis finds the line (or, in general, hyperplane) that is oriented so that the margin between the support vectors is maximized. In the figure above, the line in the right panel is superior to the line in the left panel.

continue…

• Implementing Structural Risk Minimization Principle.

• SVM classification function used here is:

• For linear SVM, inner product Kernel function is used.

continue…

Kernel Function• The kernel function may transform the data into a higher dimensional

space to make it possible to perform the separation. • The concept of a kernel mapping function is very powerful. It allows

SVM models to perform separations even with very complex boundaries.

continue…..

• If f(x) is positive then sample is predicted to be in class +1 else in -1.

• The { and ‘b’ coefficients are determined by ‘learning’ the data.

• During learning on the 80% training set, leave-one-out cross-validation.

• Training and testing were repeated ten times for randomly determined training/test set partitions.

Results• Comparison with other methods is based on PPV, sensitivity values

and the ROC curve’s area.

– Sensitivity is the portion of all positive peptides that are correctly identified which indicates the ability of the classifier to detect real epitopes.

– PPV (positive predictive value) is the probability that a peptide predicted to be positive actually does stimulate the TCC.

• PPV reflects the efficiency of the method. A classifier with low PPV will result in the generation of numerous non-stimulatory peptides for the next rounds of testing.

– Receiver Operating Characteristic curve (or ROC curve) is a plot of the sensitivity and specificity.

Results(2)

b The SVM model was trained based on A.0201 restricted MHC binding data from SYFPEITHI database.C The SVM model was trained based on A.0201 restricted MHC binding data from MHCPEP database.

The area under the averaged ROC curve was 0.919 for the SVM model.The area under the averaged ROC curve was 0.833 for the Score matrix based approach..

Discussions• The ANN model had many more parameters than the SVM and

requires a larger number of training peptides for equivalent performance.

• In case of Decision Tress classifier it is easy to over fit and require large training sets. The optimal decision tree classifier had a sensitivity considerably less than the SVM.

• In the development of SVM the input variables plays kernel function, optimal learning parameters play a vital role. Here the simple linear kernel performed best in our data set, compared to the polynomial and radial basis kernel functions.

• Leave-one-out Cross validation was used to optimize the tuning parameters. In general turning parameters are chosen arbitrarily.

Discussions(2)

• The comparisons clearly show our SVM approach to predict T-cell epitopes is superior to the publicly available methods such as SVMHCand SYFPEITHI.

• SVM model greatly improved the predicting accuracy ( It is based on ROC curve).

• SVMs can be effectively used for predicting T-cell epitopes. Using physical property factors to encode the candidate peptides enables SVM classifiers to achieve good performance with modest amounts of synthesized peptide training data.

• This makes for an efficient process of prediction and synthesis of additional peptides because positive peptides are most informative.

Contributions• Developing a support vector machine (SVM) for T-cell epitome

prediction with an MHC type I restricted T-cell clone for the first time. • SVMs can be trained on relatively small data sets to provide

prediction more accurate than those based on MHC binding.

• Further investigations of the use of SVM for T-cell epitopeprediction are warranted as a potentially efficient and powerfulmethod for defining candidate autoantigens, finding theantigenic targets and molecular mimics in complex infectiousorganisms, and developing vaccines for infectious diseasesand cancers.

Conclusions

• Support Vector Machine (SVM) is one of the best statistical learning methods.

• Its performance is significantly better than that of competing methods.

• The goal is to provide biologists a friendly tool to test their hypothesis.

Remarks & Future Work

• Extending development of vector development of a support vector machine (SVM) for T-cell epitome prediction with an MHC type 2 restricted T-cell clone.

• In this case simple linear kernel function was used ,SVM using polynomial and radial basis kernel functions.

• Using different techniques to optimize the tuning parameters

About Authors• Dr.Yingdong Zhao He works at National Cancer Institure, BioMetric Research Branch Division of Cancer Treatment and Diagnosis.

• Clemencia Pinilla, Ph.D.2Torrey Pines Institute for Molecular Studies,San Diego, CA 92121, USA

• Danila Valmori Ludwig Institute Clinical Trial Center, New York, NY, and Ludwig

Institute for Cancer Research, Lausanne, Switzerland

• Roland Martin 3Division of Clinical Onco-Immunology, LudwigInstitute for Cancer Research, University Hospital (CHUV), Lausanne,Switzerland

• Richard Simon He works at National Cancer Institure, BioMetric Research Branch Division of Cancer Treatment and Diagnosis.

Questions????

Documents

CS 6890 Offered by Charles Yan Presented by: Jyothi Sankuri