CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

CSCE555 BioinformaticsCSCE555 BioinformaticsLecture 15 classification for microarray data

Meeting: MW 4:00PM-5:15PM SWGN2A21

Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555University of South Carolina

Department of Computer Science and Engineering2008 www.cse.sc.edu.

OutlineOutline

Classification problem in microarray data

Classification concepts and algorithms

Evaluation of classification algorithms

Summary

04/18/23 2

Lab 2.3 3

?Bad prognosis

recurrence < 5yrsGood Prognosis

recurrence > 5yrs

ReferenceL van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan..

ObjectsArray

Feature vectorsGene

expression

Predefine classesClinical

outcome

new array

Learning set

Classificationrule

Good PrognosisMatesis > 5

Lab 2.3 4

B-ALL T-ALL AML

ReferenceGolub et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531-537.

ObjectsArray

Feature vectorsGene

expression

Predefine classes

Tumor type

?

new array

Learning set

ClassificationRule

T-ALL

Classification/Classification/DiscriminationDiscrimination

Each object (e.g. arrays or columns)associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG)

Aim: predict Y_new from X_new.

sample1 sample2 sample3 sample4 sample5 … New sample

1 0.46 0.30 0.80 1.51 0.90 ... 0.342 -0.10 0.49 0.24 0.06 0.46 ... 0.433 0.15 0.74 0.04 0.10 0.20 ... -0.234 -0.45 -1.03 -0.79 -0.56 -0.32 ... -0.915 -0.06 1.06 1.35 1.09 -1.09 ... 1.23

Y Normal Normal Normal Cancer Cancer unknown =Y_new

X X_new

Discrimination/Discrimination/ClassificationClassification

Lab 2.3 6

Lab 2.3 7

Predefined Class

{1,2,…K}

1 2 K

Objects

Basic principles of discrimination•Each object associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG)

Aim: predict Y from X.

X = {red, square} Y = ?

Y = Class Label = 2

X = Feature vector {colour, shape}

Classification rule ?

Lab 2.3 8

KNN: Nearest neighbor KNN: Nearest neighbor classifierclassifierBased on a measure of distance between

observations (e.g. Euclidean distance or one minus correlation).

k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows:◦ find the k observations in the learning set closest to X◦ predict the class of X by majority vote, i.e., choose the

class that is most common among those k observations.

The number of neighbors k can be chosen by cross-validation (more on this later).

9

3-Nearest Neighbors3-Nearest Neighbors

query point qf

3 nearest neighbors

2x,1o

Limitation of KNN: what is Limitation of KNN: what is K?K?

SVM: Support Vector SVM: Support Vector MachinesMachinesSVMs are currently among the best

performers for a number of classification tasks ranging from text to genomic data.

In order to discriminate between two classes, given a training dataset◦ Map the data to a higher dimension space

(feature space)◦ Separate the two classes using an optimal

linear separator

11

12

Key Ideas of SVM: Margins of Key Ideas of SVM: Margins of Linear SeparatorsLinear Separators

Maximum margin linear classifier

13

Optimal hyperplaneOptimal hyperplane

ρ

Support vector

margin

Optimal hyper-plane

Support vectors uniquely characterize optimal hyper-plane

Finding the Support Finding the Support VectorsVectors

Lagrangian multiplier method for constrained opt

15

Key Ideas of SVM: Feature Space Key Ideas of SVM: Feature Space MappingMappingMap the original data to some higher-

dimensional feature space where the training set is linearly separable:

Φ: x → φ(x)

(x1,x2) (x1,x2, x1^2, x2^2, x1*x2, …)

The “Kernel Trick”The “Kernel Trick” The linear classifier relies on inner product between vectors

K(xi,xj)=xiTxj

If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes:

K(xi,xj)= φ(xi) Tφ(xj)

A kernel function is some function that corresponds to an inner product in some expanded feature space.

Example:

2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2

,

Need to show that K(xi,xj)= φ(xi) Tφ(xj):

K(xi,xj)=(1 + xiTxj)2

,= 1+ xi12xj1

2 + 2 xi1xj1 xi2xj2+ xi2

2xj22 + 2xi1xj1 +

2xi2xj2=

= [1 xi12 √2 xi1xi2 xi2

2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2

2 √2xj1

√2xj2] =

= φ(xi) Tφ(xj), where φ(x) = [1 x1

2 √2 x1x2 x22 √2x1 √2x2]

16

Examples of Kernel Functions Linear: K(xi,xj)= xi

Txj

Polynomial of power p: K(xi,xj)= (1+ xi Txj)p

Gaussian (radial-basis function network):

Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1)

)2

exp(),(2

2

ji

ji

xxxx

K

18

SVMSVMAdvantages:

◦ maximize the margin between two classes in the feature space characterized by a kernel function

◦ are robust with respect to high input dimension

Disadvantages:◦ difficult to incorporate background

knowledge◦ Sensitive to outliers

19

Variable/Feature Selection Variable/Feature Selection with SVMswith SVMsRecursive Feature Elimination

◦ Train a linear SVM◦ Remove the variables with the lowest weights

(those variables affect classification the least), e.g., remove the lowest 50% of variables

◦ Retrain the SVM with remaining variables and repeat until classification is reduced

Very successfulOther formulations exist where minimizing

the number of variables is folded into the optimization problem

Similar algorithm exist for non-linear SVMsSome of the best and most efficient variable

selection methods

20

SoftwareSoftwareA list of SVM implementation can be

found at http://www.kernel-machines.org/software.html

Some implementation (such as LIBSVM) can handle multi-class classification

SVMLight, LibSVM are among one of the earliest implementation of SVM

Several Matlab toolboxes for SVM are also available

How to Use SVM to Classify How to Use SVM to Classify Microarray DataMicroarray DataPrepare the data format for

LibSVM

Labels

Index of non-zero features

value of non-zero features

<label> <index1>:<value1> <index2>:<value2> ...

Usage: svm-train [options] training_set_file [model_file]

Examples of options: -s 0 -c 10 -t 1 -g 1 -r 1 -d 3

Usage: svm-predict [options] test_file model_file output_file

22

Decision tree classifiersDecision tree classifiersGene 1

Mi1 < -0.67

Gene 2Mi2 > 0.18

0

2

1

yes

yes

no

no

0.18

Advantage: transparent rules, easy to interpret

G1 0.1 -0.2 0.3 G2 0.3 0.4 0.4G3 … ……Class 0 1 0

23

Ensemble classifiersEnsemble classifiers

Training Set

X1, X2, … X100

Classifier 1Resample 1




Examples:BaggingBoosting

Random Forest

Aggregateclassifier

24

Aggregating classifiers:Aggregating classifiers:BaggingBagging

Training Set (arrays)X1, X2, … X100

Tree 1Resample 1

X*1, X*2, … X*100

Lets the treevote

Tree 2Resample 2

X*1, X*2, … X*100

Tree 499Resample 499X*1, X*2, … X*100

Tree 500Resample 500X*1, X*2, … X*100

Testsample

Class 1

Class 2

Class 1

Class 1

90% Class 110% Class 2

Weka Data Mining ToolboxWeka Data Mining Toolbox

Weka Package (java) includes:

◦All previous classifiers

◦Neural networks

◦Projection pursuit

◦Bayesian belief networks

◦And More

25

26

Feature Selection in Feature Selection in ClassificationClassificationWhat: select a subset of featuresWhy:

◦Lead to better classification performance by removing variables that are noise with respect to the outcome

◦May provide useful insights into the biology

◦Can eventually lead to the diagnostic tests (e.g., “breast cancer chip”)

Classifier Performance Classifier Performance assessmentassessmentAny classification rule needs to be evaluated for

its performance on the future samples. It is almost never the case in microarray studies that a large independent population-based collection of samples is available at the time of initial classifier-building phase.

One needs to estimate future performance based on what is available: often the same set that is used to build the classifier.

Assessing performance of the classifier based on◦ Cross-validation.◦ Test set◦ Independent testing on future dataset

27

Diagram of performance Diagram of performance assessmentassessment

Training set

Performance assessment

TrainingSet

Independenttest set

Classifier

Classifier

Resubstitution estimation

Test set estimation

Diagram of performance Diagram of performance assessmentassessment

Training set

Performance assessment

TrainingSet

Independenttest set

(CV) Learningset

(CV) Test set

Classifier

Classifier

Classifier

Resubstitution estimation

Test set estimation

Cross Validation

Performance assessmentPerformance assessment V-fold cross-validation (CV) estimation: Cases in learning set

randomly divided into V subsets of (nearly) equal size. Build classifiers by leaving one set out; compute test set error rates on the left out set and averaged. ◦ Bias-variance tradeoff: smaller V can give larger bias but smaller

variance◦ Computationally intensive.

Leave-one-out cross validation (LOOCV).

(Special case for V=n). Works well for stable classifiers (k-NN, LDA, SVM)

Lab 2.3 30

Supplementary slide

Which to use depends mostly Which to use depends mostly on sample sizeon sample sizeIf the sample is large enough,

split into test and train groups.If sample is barely adequate for

either testing or training, use leave one out

In between consider V-fold. This method can give more accurate estimates than leave one out, but reduces the size of training set.

SummarySummaryMicroarray Classification TaskClassifiers: KNN, SVM, Decision

Tree, Weka, LibSVMClassifier evaluation, cross-

validation

AcknowledgementAcknowledgementTerry SpeedJean Yee Hwa YangJane Fridlyand

Documents

CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: