28
Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Embed Size (px)

DESCRIPTION

LDA Motivation  Non-greedy selection: preserve (augmented) discriminant information  Variables with between group differences  Variables highly correlated with these

Citation preview

Page 1: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Shrunken Centroid Ordering by

Orthogonal Projections(SCOOP)

method of variable selection

Joe VerducciOhio State University

Page 2: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Outline

Motivation—gene expression Variable selection for LDA Large p Moderate n Advantages in gene selection

Method Model Justification Measures of Performance Modifications

Page 3: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

LDA Motivation

Non-greedy selection: preserve (augmented) discriminant

information Variables with between group differences Variables highly correlated with these

Page 4: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Fisher’s Linear Discriminant Functionand

A Stupid Generalization

)()()( 21 TS xxL

)()()( 211 TxxL

221

where

Page 5: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Why It’s Stupid

1 0 00 1 00 0 0

001

00-1

Results from Bickel and Levina (2004) imply that the eigenvectors of within and between group covariance matrices approach orthogonality under n fixed pinfinity asymptotics.

Page 6: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Genetic Motivation

Wound Healing 80 National Wound Healing Clinics 1000 patients

Initial + 1-week samples Clinical records of patients

~10K genes of potential interest in myocytes

Subsets of genes act in concert A single gene may be active in several

subsystems

Page 7: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

P53

When the DNA in a cell becomes damaged by agents such as toxic chemicals or ultraviolet (UV) rays from sunlight, this protein plays a critical role in determining whether the DNA will be repaired or the cell will undergo programmed cell death (apoptosis).

If the DNA can be repaired, tumor protein p53 activates other genes to fix the damage.

If the DNA cannot be repaired, tumor protein p53 prevents the cell from dividing and signals it to undergo apoptosis. This process prevents cells with mutated or damaged DNA from dividing, which helps prevent the development of tumors.

Page 8: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Pathway construction based on GeneChipTM expression data. Genes shown in red ellipse are candidates identified using GeneChipTM assay that were up-regulated in 20% O2 compared with 3% O2. Green ellipses are genes that were down-regulated under conditions mentioned above. The expressions of candidates shown in red ellipse with blue outline have been independently verified using either real-time PCR or ribonuclease protection assay (6). BAX, Bcl2-associated X protein; Catn, catenin; CASP, caspage; ccng, cyclin G; Cdc61, cell division cycle; CDK, cyclin-dependent kinase; CDKN1A, cyclin-dependent kinase inhibitor 1A (p21); Cx43, gap junction membrane channel protein; GADD, growth arrest and DNA damage-inducible; MAPK, mitogen-activated protein kinase; Mdm2, transformed mouse 3T3 cell double minute 2; N-Cdh, cadherin 2; PXN, paxillin; Tob, transducer of ErbB-2.1; TP53, transformation-related protein 53; Vcl, vinculin; Wig, wild-type p53-induced gene 1.

Page 9: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Motivating Simple Example

Two groups 50 samples in each

P= 4000 normal variables All have variance 1 First 10 variables

correlation = .75 between all pairs Difference of 2 between group means

Second 10 variables correlation = .75 between all pairs Difference of 1 between group means

Last 3980 variables independent same mean in both groups

Page 10: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Results from 100 Simulations

Individual t-test ranking by p-values 73% of top 20 selected are correct On average need to select 400

variables to ensure inclusion of all 20 SCOOP

91% of top 20 selected are correct On average need to select 200

variables to ensure inclusion of all 20

Page 11: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Shrunken Centroid Methodfor K groups

Tibshirani, Hastie,Narasimhan & Chu

For each gene i, xik = sample mean in group k, xi = overall sample mean sik = estimated std. error of xik

Based on pooled std deviation dik = (xik - xi)/sik is a t-statistic

Shrinking by an amount gives Shrunken difference

ikikik ddsignd )('

)( ''ikikiik dsxx

• Shrunken centroid

Page 12: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Properties of Shrunken Centroid

When K = 2, ordering of variables/genes is same as t-test

Keeps “redundant” predictors Can be modified to regularize the

estimated std errors Shrunken centroids used directly for

classification Shrinkage by amount is simultaneous in

all coordinates on standardized scale Shrinkage parameter chosen by cross-

validation

Page 13: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Reformulating the Goals

Genetic studies Find biomarkers

classification/prediction Use small number of classifiers/predictors

Understand genetic pathways Discover which genes work together to

make a difference possible intervention

Other studies Improve efficiency in difficult

discrimination problems

Page 14: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

SCOOP Method(version 1)

Define the Augmented Discriminant Space:ADS = span of eigenvectors of Within and Between Covariance Matrices

Modify shrinkage so as not to distort configuration of data in the ADS

shrink variables differentially along directions orthogonal to the ADS

Note: Unlike the reference, we do not standardize, but scale only at the shrinkage stage.

Keep track of the amount of shrinkage i needed to eliminate the ith variable

Page 15: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

SCOOP Algorithmfor K groups

1. Between Group eigenvectorsDB = [(xik - xi)] p x K matrixUse Singular Value Decompostion (SVD) on DB. The singular vectors of DB are the eigenvectors of DB (DB)T

2. Within Group eigenvectors

Page 16: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Algorithm (part 2)

Orthogonalize the Between group (BG) eigenvectors to the Within group (WG) eigenvectors Note: residuals from orthogonalization will no

longer be orthogonal to each other Renormalize compute projection operator onto

complement of the ADS Note: do not need to use p x p storage

Page 17: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Algorithm (part 3)

Order variables by scaled shrinkage distances {i} For each variable i, compute a scale

value = (squared) length of its projection onto the orthogonal complement of the ADS

Then calculate how many [i] such units are needed to shrink each of the K mean differences to 0

Page 18: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Notes

Shrinking is non-linear it truncates at 0 shrinks each group only as much as it

needs to What to use as a stopping rule?

Some measure of preserved information Elbow in the distribution of {i} Reference to extreme value distribution

Page 19: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Theoretical Concern

Inconsistency of sample eigenvectors if p(n)/n c > 0

Johnstone and Lu (2004) Unless sparse representation

(offset) factor model Latent factors account for both

Correlation among variables Group mean differences

Page 20: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Modeling considerations Common offset factor model for gene expression

latent factors represent biological variation random measurement error are “uniqueness”

components of individual genes. Normally distributed data

two populations share the same factor structure differ only by the means of the underlying factors the restricted maximum likelihood procedure is the

(stupid) generalization of Fisher’s Linear Discriminant Analysis (SLDA) that incorporates a generalized inverse of the pooled sample covariance matrix.

SLDA seldom works well for real data amend overly restrictive assumptions on both

means and covariances.

Page 21: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

More model considerations

Factors underlying biological variation Common factors in 2 groups

Some with different means in 2 groups Some with same mean

Group specific factors Some may have non-zero means Some have 0 means

Unique variation among genes Most is noise A few of the genes that do not load on any factor

may have different means in the two groups

.

Page 22: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Model

ni

JjniF

KkniF

p

Ggni

FFX

iid

i

ggj

gj

iidg

ij

Fk

iid

ik

gjk

i

J

j

gij

gjkik

K

kk

gi

gg

,...,1),,0(~

,...,1each for ;,...,1),,(~

,...1each for ;,...,1),,0(~

1)dim()dim(

,...,1;,...,1

)()()()(

)(

1

)()(

1

)()(

)(

Page 23: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Simulation

n=100 p=4000 G=2 K=3 J(g) =1 =1 k

F=1 j

(g)=1

Loadings on common factors 1 indicates 1st 10 variables [1] 2 indicates 2nd 10 variables [.55] 3 indicates 3rd 10 variables [0]

Loadings on Group-specific factors 1

(1) indicates 4th 10 variables [.55] 1

(2) indicates 5th 10 variables [0]

Here [] is the difference in means

Page 24: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Shrinkage Needed to Select Top Predictors

Page 25: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Measures of Performance

Individual t-test ranking by p-values 49% of top 30 selected are correct On average need to select 400

variables to ensure inclusion of all 30 SCOOP

61% of top 30 selected are correct On average need to select 200

variables to ensure inclusion of all 30

Page 26: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Modifications

Preserve common and group-distinct within group sample eigenvectors

Regularize sample eigenvectors using Linear Perturbation Theory

)(

)(

pj

jj

gv

gv

jIS

S

This is piecewise linear until adjacent eigenvalues become equal

Page 27: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

Conclusions

To the extent that something like an offset factor model holds, incorporating correlations may substantially improve selection of discriminating variables (DVs)

Clustering of non-DVs does not seem to have any serious ill effect

SCOOP is one way to use covariance structure efficiently

Page 28: Shrunken Centroid Ordering by Orthogonal Projections (SCOOP) method of variable selection Joe Verducci Ohio State University

References Bickel PJ and Levina E (2004). Some theory for Fisher's

linear discriminant function, `naive Bayes', and some alternatives when there are many more variables than observations. Bernoulli  10, no. 6 989–1010.

Tibshirani R, Hastie T,Narasimhan & Chu (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS 99, no. 10 6567-6572.

Sen, CK, Verducci, JS, Melfi, VF, Khanna, S, Barbacioru, C and Roy, S (2005). Post-reperfusion healing of the heart: Focus on oxygen-sensitive genes and DNA microarray as a tool. Mathematical Biosciences Institute Technical Report No. 31 (available at http://mbi.osu.edu/publications/pub2005.html)