33
CSCE555 Bioinformatics CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu .

CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Embed Size (px)

Citation preview

Page 1: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

CSCE555 BioinformaticsCSCE555 BioinformaticsLecture 15 classification for microarray data

Meeting: MW 4:00PM-5:15PM SWGN2A21

Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555University of South Carolina

Department of Computer Science and Engineering2008 www.cse.sc.edu.

Page 2: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

OutlineOutline

Classification problem in microarray data

Classification concepts and algorithms

Evaluation of classification algorithms

Summary

04/18/23 2

Page 3: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Lab 2.3 3

?Bad prognosis

recurrence < 5yrsGood Prognosis

recurrence > 5yrs

ReferenceL van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan..

ObjectsArray

Feature vectorsGene

expression

Predefine classesClinical

outcome

new array

Learning set

Classificationrule

Good PrognosisMatesis > 5

Page 4: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Lab 2.3 4

B-ALL T-ALL AML

ReferenceGolub et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531-537.

ObjectsArray

Feature vectorsGene

expression

Predefine classes

Tumor type

?

new array

Learning set

ClassificationRule

T-ALL

Page 5: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Classification/Classification/DiscriminationDiscrimination

Each object (e.g. arrays or columns)associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG)

Aim: predict Y_new from X_new.

sample1 sample2 sample3 sample4 sample5 … New sample

1 0.46 0.30 0.80 1.51 0.90 ... 0.342 -0.10 0.49 0.24 0.06 0.46 ... 0.433 0.15 0.74 0.04 0.10 0.20 ... -0.234 -0.45 -1.03 -0.79 -0.56 -0.32 ... -0.915 -0.06 1.06 1.35 1.09 -1.09 ... 1.23

Y Normal Normal Normal Cancer Cancer unknown =Y_new

X X_new

Page 6: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Discrimination/Discrimination/ClassificationClassification

Lab 2.3 6

Page 7: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Lab 2.3 7

Predefined Class

{1,2,…K}

1 2 K

Objects

Basic principles of discrimination•Each object associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG)

Aim: predict Y from X.

X = {red, square} Y = ?

Y = Class Label = 2

X = Feature vector {colour, shape}

Classification rule ?

Page 8: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Lab 2.3 8

KNN: Nearest neighbor KNN: Nearest neighbor classifierclassifierBased on a measure of distance between

observations (e.g. Euclidean distance or one minus correlation).

k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows:◦ find the k observations in the learning set closest to X◦ predict the class of X by majority vote, i.e., choose the

class that is most common among those k observations.

The number of neighbors k can be chosen by cross-validation (more on this later).

Page 9: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

9

3-Nearest Neighbors3-Nearest Neighbors

query point qf

3 nearest neighbors

2x,1o

Page 10: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Limitation of KNN: what is Limitation of KNN: what is K?K?

Page 11: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

SVM: Support Vector SVM: Support Vector MachinesMachinesSVMs are currently among the best

performers for a number of classification tasks ranging from text to genomic data.

In order to discriminate between two classes, given a training dataset◦ Map the data to a higher dimension space

(feature space)◦ Separate the two classes using an optimal

linear separator

11

Page 12: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

12

Key Ideas of SVM: Margins of Key Ideas of SVM: Margins of Linear SeparatorsLinear Separators

Maximum margin linear classifier

Page 13: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

13

Optimal hyperplaneOptimal hyperplane

ρ

Support vector

margin

Optimal hyper-plane

Support vectors uniquely characterize optimal hyper-plane

Page 14: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Finding the Support Finding the Support VectorsVectors

Lagrangian multiplier method for constrained opt

Page 15: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

15

Key Ideas of SVM: Feature Space Key Ideas of SVM: Feature Space MappingMappingMap the original data to some higher-

dimensional feature space where the training set is linearly separable:

Φ: x → φ(x)

(x1,x2) (x1,x2, x1^2, x2^2, x1*x2, …)

Page 16: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

The “Kernel Trick”The “Kernel Trick” The linear classifier relies on inner product between vectors

K(xi,xj)=xiTxj

If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes:

K(xi,xj)= φ(xi) Tφ(xj)

A kernel function is some function that corresponds to an inner product in some expanded feature space.

Example:

2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2

,

Need to show that K(xi,xj)= φ(xi) Tφ(xj):

K(xi,xj)=(1 + xiTxj)2

,= 1+ xi12xj1

2 + 2 xi1xj1 xi2xj2+ xi2

2xj22 + 2xi1xj1 +

2xi2xj2=

= [1 xi12 √2 xi1xi2 xi2

2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2

2 √2xj1

√2xj2] =

= φ(xi) Tφ(xj), where φ(x) = [1 x1

2 √2 x1x2 x22 √2x1 √2x2]

16

Page 17: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Examples of Kernel Functions Linear: K(xi,xj)= xi

Txj

Polynomial of power p: K(xi,xj)= (1+ xi Txj)p

Gaussian (radial-basis function network):

Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1)

)2

exp(),(2

2

ji

ji

xxxx

K

Page 18: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

18

SVMSVMAdvantages:

◦ maximize the margin between two classes in the feature space characterized by a kernel function

◦ are robust with respect to high input dimension

Disadvantages:◦ difficult to incorporate background

knowledge◦ Sensitive to outliers

Page 19: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

19

Variable/Feature Selection Variable/Feature Selection with SVMswith SVMsRecursive Feature Elimination

◦ Train a linear SVM◦ Remove the variables with the lowest weights

(those variables affect classification the least), e.g., remove the lowest 50% of variables

◦ Retrain the SVM with remaining variables and repeat until classification is reduced

Very successfulOther formulations exist where minimizing

the number of variables is folded into the optimization problem

Similar algorithm exist for non-linear SVMsSome of the best and most efficient variable

selection methods

Page 20: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

20

SoftwareSoftwareA list of SVM implementation can be

found at http://www.kernel-machines.org/software.html

Some implementation (such as LIBSVM) can handle multi-class classification

SVMLight, LibSVM are among one of the earliest implementation of SVM

Several Matlab toolboxes for SVM are also available

Page 21: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

How to Use SVM to Classify How to Use SVM to Classify Microarray DataMicroarray DataPrepare the data format for

LibSVM

Labels

Index of non-zero features

value of non-zero features

<label> <index1>:<value1> <index2>:<value2> ...

Usage: svm-train [options] training_set_file [model_file]

Examples of options: -s 0 -c 10 -t 1 -g 1 -r 1 -d 3

Usage: svm-predict [options] test_file model_file output_file

Page 22: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

22

Decision tree classifiersDecision tree classifiersGene 1

Mi1 < -0.67

Gene 2Mi2 > 0.18

0

2

1

yes

yes

no

no

0.18

Advantage: transparent rules, easy to interpret

G1 0.1 -0.2 0.3 G2 0.3 0.4 0.4G3 … ……Class 0 1 0

Page 23: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

23

Ensemble classifiersEnsemble classifiers

Training Set

X1, X2, … X100

Classifier 1Resample 1

Classifier 2Resample 2

Classifier 499Resample 499

Classifier 500Resample 500

Examples:BaggingBoosting

Random Forest

Aggregateclassifier

Page 24: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

24

Aggregating classifiers:Aggregating classifiers:BaggingBagging

Training Set (arrays)X1, X2, … X100

Tree 1Resample 1

X*1, X*2, … X*100

Lets the treevote

Tree 2Resample 2

X*1, X*2, … X*100

Tree 499Resample 499X*1, X*2, … X*100

Tree 500Resample 500X*1, X*2, … X*100

Testsample

Class 1

Class 2

Class 1

Class 1

90% Class 110% Class 2

Page 25: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Weka Data Mining ToolboxWeka Data Mining Toolbox

Weka Package (java) includes:

◦All previous classifiers

◦Neural networks

◦Projection pursuit

◦Bayesian belief networks

◦And More

25

Page 26: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

26

Feature Selection in Feature Selection in ClassificationClassificationWhat: select a subset of featuresWhy:

◦Lead to better classification performance by removing variables that are noise with respect to the outcome

◦May provide useful insights into the biology

◦Can eventually lead to the diagnostic tests (e.g., “breast cancer chip”)

Page 27: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Classifier Performance Classifier Performance assessmentassessmentAny classification rule needs to be evaluated for

its performance on the future samples. It is almost never the case in microarray studies that a large independent population-based collection of samples is available at the time of initial classifier-building phase.

One needs to estimate future performance based on what is available: often the same set that is used to build the classifier.

Assessing performance of the classifier based on◦ Cross-validation.◦ Test set◦ Independent testing on future dataset

27

Page 28: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Diagram of performance Diagram of performance assessmentassessment

Training set

Performance assessment

TrainingSet

Independenttest set

Classifier

Classifier

Resubstitution estimation

Test set estimation

Page 29: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Diagram of performance Diagram of performance assessmentassessment

Training set

Performance assessment

TrainingSet

Independenttest set

(CV) Learningset

(CV) Test set

Classifier

Classifier

Classifier

Resubstitution estimation

Test set estimation

Cross Validation

Page 30: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Performance assessmentPerformance assessment V-fold cross-validation (CV) estimation: Cases in learning set

randomly divided into V subsets of (nearly) equal size. Build classifiers by leaving one set out; compute test set error rates on the left out set and averaged. ◦ Bias-variance tradeoff: smaller V can give larger bias but smaller

variance◦ Computationally intensive.

Leave-one-out cross validation (LOOCV).

(Special case for V=n). Works well for stable classifiers (k-NN, LDA, SVM)

Lab 2.3 30

Supplementary slide

Page 31: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Which to use depends mostly Which to use depends mostly on sample sizeon sample sizeIf the sample is large enough,

split into test and train groups.If sample is barely adequate for

either testing or training, use leave one out

In between consider V-fold. This method can give more accurate estimates than leave one out, but reduces the size of training set.

Page 32: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

SummarySummaryMicroarray Classification TaskClassifiers: KNN, SVM, Decision

Tree, Weka, LibSVMClassifier evaluation, cross-

validation

Page 33: CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

AcknowledgementAcknowledgementTerry SpeedJean Yee Hwa YangJane Fridlyand