127
Classification of Microarray Gene Expression Data Geoff McLachlan Department of Mathematics & Institute for Molecular Bioscience University of Queensland

Classification of Microarray Gene Expression Data

  • Upload
    kassia

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

Classification of Microarray Gene Expression Data. Geoff McLachlan Department of Mathematics & Institute for Molecular Bioscience University of Queensland. Institute for Molecular Bioscience, University of Queensland. - PowerPoint PPT Presentation

Citation preview

Page 1: Classification of Microarray  Gene Expression Data

Classification of Microarray Gene Expression Data

Geoff McLachlan

Department of Mathematics & Institute for Molecular BioscienceUniversity of Queensland

Page 2: Classification of Microarray  Gene Expression Data

Institute for Molecular Bioscience, University of Queensland

Page 3: Classification of Microarray  Gene Expression Data

“A wide range of supervised and unsupervised learning methods have been considered to better organize data, be it to infer coordinated patterns of gene expression, to discover molecular signatures of disease subtypes, or to derive various predictions. ”

Statistical Methods for Gene Expression: Microarrays and Proteomics

Page 4: Classification of Microarray  Gene Expression Data

Outline of Talk

• Introduction

• Supervised classification of tissue samples – selection bias

• Unsupervised classification (clustering) of tissues – mixture model-based approach

Page 5: Classification of Microarray  Gene Expression Data

Vital Statistics by C. Tilstone

Nature 424, 610-612, 2003.

    “DNA microarrays have given geneticists and molecular biologists access to more data than ever before. But do these researchers have the statistical know-how to cope?”

Branching out: cluster analysis can group samples that show similar patterns of gene expression.

Page 6: Classification of Microarray  Gene Expression Data

MICROARRAY DATA

),,( n1 xx REPRESENTED by a p ×nmatrix

contains the gene expressions for the pgenesjxof the jth tissue sample (j = 1, …, n).

p =No. of genes (103 - 104) n =No. of tissue samples (10 - 102)

STANDARD STATISTICAL METHODOLOGY APPROPRIATE FOR n>> p

HERE p>> n

Page 7: Classification of Microarray  Gene Expression Data
Page 8: Classification of Microarray  Gene Expression Data

Two Groups in Two Dimensions. All cluster information would be lost by collapsing to the first principal component. The principal ellipses of the two groups are shown as solid curves.

Page 9: Classification of Microarray  Gene Expression Data

Oncologists would like to use arrays to predict whether or not a cancer is going to spread in the body, how likely it will respond to a certain type of treatment, and how long the patient will probably survive.

It would be useful if the gene expression signatures could distinguish between subtypes of tumours that standard methods, such as histological pathology from a biopsy, fail to discriminate, and that require different treatments.

bioArray News (2, no. 35, 2002)

Arrays Hold Promise for Cancer Diagnostics

Page 10: Classification of Microarray  Gene Expression Data

van’t Veer & De Jong (2002, Nature Medicine 8)

The microarray way to tailored cancer treatment

In principle, gene activities that determine the biological behaviour of a tumour are more likely to reflect its aggressiveness than general parameters such as tumour size and age of the patient.

(indistinguishable disease states in diffuse large B-cell lymphoma unravelled by microarray expression profiles – Shipp et al., 2002, Nature Med. 8)

Page 11: Classification of Microarray  Gene Expression Data

Microarray to be used as routine clinical screenby C. M. Schubert

Nature Medicine 9, 9, 2003.

The Netherlands Cancer Institute in Amsterdam is to become the first institution in the world to use microarray techniques for the routine prognostic screening of cancer patients. Aiming for a June 2003 start date, the center will use a panoply of 70 genes to assess the tumor profile of breast cancer patients and to determine which women will receive adjuvant treatment after surgery.

Page 12: Classification of Microarray  Gene Expression Data

Microarrays also to be used in the prediction of breast cancer by Mike West (Duke University) and the Koo Foundation Sun Yat-Sen Cancer Centre, Taipei

Huang et al. (2003, The Lancet, Gene expression predictors of breast cancer).

Page 13: Classification of Microarray  Gene Expression Data

CLASSIFICATION OF TISSUES

SUPERVISED CLASSIFICATION (DISCRIMINANT ANALYSIS)

AIM: TO CONSTRUCT A CLASSIFIER C(x) FOR PREDICTING THE UNKNOWN CLASS LABEL y OF A TISSUE SAMPLE x.

e.g. g = 2 classes G1 - DISEASE-FREE G2 - METASTASES

We OBSERVE the CLASS LABELS y1, …, yn where yj = i if jth tissue sample comes from the ith class (i=1,…,g).

Page 14: Classification of Microarray  Gene Expression Data
Page 15: Classification of Microarray  Gene Expression Data

LINEAR CLASSIFIER

FORM

xβx TC 0)(

for the production of the group label y of a future entity with feature vector x.

pp xβxββ 110

Page 16: Classification of Microarray  Gene Expression Data

FISHER’S LINEAR DISCRIMINANT FUNCTION

)(sign xCy

)()(2

1

)(

211

210

211

xxSxx

xxSβ

T

and , , 21 xxcovariance matrix found from the training data

where

and S are the sample means and pooled sample

Page 17: Classification of Microarray  Gene Expression Data

SUPPORT VECTOR CLASSIFIERVapnik (1995)

)(xC

n

jj

1 ,

2

2

1

0

min

ββ

subject to

jjj )C(y 1x,0j

,,1

n

),,1( nj

where β0 and β are obtained as follows:

relate to the slack variables

separable case

pp xβxββ 110

Page 18: Classification of Microarray  Gene Expression Data

jj

n

jj y xβ

1

ˆˆ

with non-zero j only for those observations j for which theconstraints are exactly met (the support vectors).

01

01

ˆ ,ˆ

ˆ ˆ)(

n

jjjj

n

j

Tjjj

y

yC

xx

xxx

Page 19: Classification of Microarray  Gene Expression Data

Support Vector Machine (SVM)

REPLACE )( xx h

01

01

ˆ ),(ˆ

ˆ )(),(ˆ)(

n

jjj

n

jjj

K

hhC

xx

xxx

where the kernel function )(),(),( xxxx hhK jj is the inner product in the transformed feature space.

by

Page 20: Classification of Microarray  Gene Expression Data

HASTIE et al. (2001, Chapter 12)

The Lagrange (primal function) is

(1) )1()(111

2

2

1

n

jjjjjj

n

jj

n

jjP CyL xβ

which we maximize w.r.t. β, β0, and ξj.

Setting the respective derivatives to zero, we get

).,,1( 0 ,0 ,0

(4) ).,,1(

(3)

(2)

1

1

nj

nj

y

y

jjj

jj

n

jjj

n

jjjj

with and

Page 21: Classification of Microarray  Gene Expression Data

(5) 1 11 2

1k

Tjkjk

n

j

n

kj

n

jjD yyL xx

We maximize (5) subject to

n

jjjj y

1

.0 and 0

In addition to (2) to (4), the constraints include

.,,1for

(8) 0)1()(

(7) 0

(6) 0)1()(

j

nj

Cy

Cy

jjj

j

jjjj

x

x

Together these equations (2) to (8) uniquely characterize the solution to the primal and dual problem.

By substituting (2) to (4) into (1), we obtain the Lagrangian dual function

Page 22: Classification of Microarray  Gene Expression Data

Leo Breiman (2001)

Statistical modeling: the two cultures (with discussion).

Statistical Science 16, 199-231.

Discussants include Brad Efron and David Cox

Page 23: Classification of Microarray  Gene Expression Data

Selection bias in gene extraction on the basis of microarray gene-expression data

Ambroise and McLachlan

Proceedings of the National Academy of SciencesVol. 99, Issue 10, 6562-6566, May 14, 2002

http://www.pnas.org/cgi/content/full/99/10/6562

Page 24: Classification of Microarray  Gene Expression Data

GUYON, WESTON, BARNHILL & VAPNIK (2002, Machine Learning)

• COLON Data (Alon et al., 1999)

• LEUKAEMIA Data (Golub et al., 1999)

Page 25: Classification of Microarray  Gene Expression Data

Since p>>n, consideration given to selection of suitable genes

SVM: FORWARD or BACKWARD (in terms of magnitude of weight βi)

RECURSIVE FEATURE ELIMINATION (RFE)

FISHER: FORWARD ONLY (in terms of CVE)

Page 26: Classification of Microarray  Gene Expression Data

GUYON et al. (2002)

LEUKAEMIA DATA:

Only 2 genes are needed to obtain a zero CVE (cross-validated error rate)

COLON DATA:

Using only 4 genes, CVE is 2%

Page 27: Classification of Microarray  Gene Expression Data

GUYON et al. (2002)

“The success of the RFE indicates that RFE has a built in regularization mechanism that we do not understand yet that prevents overfitting the training data in its selection of gene subsets.”

Page 28: Classification of Microarray  Gene Expression Data

Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples

Page 29: Classification of Microarray  Gene Expression Data

Figure 2: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of leukemia tissue samples

Page 30: Classification of Microarray  Gene Expression Data

Figure 3: Error rates of Fisher’s rule with stepwise forward selection procedure using all the colon data

Page 31: Classification of Microarray  Gene Expression Data

Figure 4: Error rates of Fisher’s rule with stepwise forward selection procedure using all the leukemia data

Page 32: Classification of Microarray  Gene Expression Data

Figure 5: Error rates of the SVM rule averaged over 20 noninformative samples generated by random permutations of the class labels of the

colon tumor tissues

Page 33: Classification of Microarray  Gene Expression Data

Error Rate Estimation

(x1, x2, x3,……………, xn)

Suppose there are two groups G1 and G2

C(x) is a classifier formed from the data set

The apparent error is the proportion of the data set misallocated by C(x).

Page 34: Classification of Microarray  Gene Expression Data

Use C(1)(x1) to allocate x1 to either G1 or G2.

From the original data set, remove x1 to give the reduced set

(x2, x3,……………, xn)Then form the classifier C(1)(x ) from this reduced set.

Cross-Validation

Page 35: Classification of Microarray  Gene Expression Data

Repeat this process for the second data point, x2.

So that this point is assigned to either G1 or G2 on the basis of the classifier C(2)(x2).

And so on up to xn.

Page 36: Classification of Microarray  Gene Expression Data

Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples

Page 37: Classification of Microarray  Gene Expression Data

Aware of selection bias:

SPANG et al. (2001, Silico Biology)

WEST et al. (2001, PNAS)

NGUYEN and ROCKE (2002)

ADDITIONAL REFERENCES

Selection bias ignored:

XIONG et al. (2001, Molecular Genetics and Metabolism)

XIONG et al. (2001, Genome Research)

ZHANG et al. (2001, PNAS)

Page 38: Classification of Microarray  Gene Expression Data

BOOTSTRAP APPROACH

Efron’s (1983, JASA) .632 estimator

B1.632 AE.368 632. B

where B1 is the bootstrap when rule is applied to a point not in the training sample.

A Monte Carlo estimate of B1 is

otherwise 0 esmisallocat * if 1

otherwise 0sample bootstrapth if 1

1

and

with

11

1

jjk

jjk

jkjkjkj

j

x

kx

Rk

K

k

K

k

n

j

Q

I

IQIE

nEB

Rk*

where

Page 39: Classification of Microarray  Gene Expression Data

Toussaint & Sharpe (1975) proposed the ERROR RATE ESTIMATOR

CV2E )AE 1( A ww-(w)

5.0w

McLachlan (1977) proposed w=wo where wo is chosen to minimize asymptotic bias of A(w) in the case of two homoscedastic normal groups.

Value of w0 was found to range between 0.6 and 0.7, depending on the values of . and , ,

2

1

n

np

where

Page 40: Classification of Microarray  Gene Expression Data

B1 )AE 1( 632. ww-B

.632+ estimate of Efron & Tibshirani (1997, JASA)

rw

368.1

632.

AE

AE1

B

r

g

i

ii qp1

)1(

where

(relative overfitting rate)

(estimate of no information error rate)

If r = 0, w = .632, and so B.632+ = B.632

r = 1, w = 1, and so B.632+ = B1

Page 41: Classification of Microarray  Gene Expression Data

“What we really need are expression profiles from hundreds or thousands of tumours linked to relevant, and appropriate, clinical data.”

One concern is the heterogeneity of the tumours themselves, which consist of a mixture of normal and malignant cells, with blood vessels in between.

Even if one pulled out some cancer cells from a tumour, there is no guarantee that those are the cells that are going to metastasize, just because tumours are heterogeneous.

John Quackenbush

Page 42: Classification of Microarray  Gene Expression Data

UNSUPERVISED CLASSIFICATION (CLUSTER ANALYSIS)

INFER CLASS LABELS y1, …, yn of x1, …, xn

Initially, hierarchical distance-based methodsof cluster analysis were used to cluster the tissues and the genes

Eisen, Spellman, Brown, & Botstein (1998, PNAS)

Page 43: Classification of Microarray  Gene Expression Data

Hierarchical (agglomerative) clustering algorithms are largely heuristically motivated and there exist a number of unresolved issues associated with their use, including how to determine the number of clusters.

(Yeung et al., 2001, Model-Based Clustering and Data Transformations for Gene Expression Data, Bioinformatics 17)

“in the absence of a well-grounded statistical model, it seems difficult to define what is meant by a ‘good’ clustering algorithm or the ‘right’ number of clusters.”

Page 44: Classification of Microarray  Gene Expression Data

Attention is now turning towards a model-based approach to the analysis of microarray data

For example:• Broet, Richarson, and Radvanyi (2002). Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. Journal of Computational Biology 9

•Ghosh and Chinnaiyan (2002). Mixture modelling of gene expression data from microarray experiments. Bioinformatics 18

•Liu, Zhang, Palumbo, and Lawrence (2003). Bayesian clustering with variable and transformation selection. In Bayesian Statistics 7

• Pan, Lin, and Le, 2002, Model-based cluster analysis of microarray gene expression data. Genome Biology 3

• Yeung et al., 2001, Model based clustering and data transformations for gene expression data, Bioinformatics 17

Page 45: Classification of Microarray  Gene Expression Data

The notion of a cluster is not easy to define.

There is a very large literature devoted to clustering when there is a metric known in advance; e.g. k-means. Usually, there is no a priori metric (or equivalently a user-defined distance matrix) for a cluster analysis.

That is, the difficulty is that the shape of the clusters is not known until the clusters have been identified, and the clusters cannot be effectively identified unless the shapes are known.

Page 46: Classification of Microarray  Gene Expression Data

In this case, one attractive feature of adopting mixture models with elliptically symmetric components such as the normal or t densities, is that the implied clustering is invariant under affine transformations of the data (that is, under operations relating to changes in location, scale, and rotation of the data).

Thus the clustering process does not depend on irrelevant factors such as the units of measurement or the orientation of the clusters in space.

Page 47: Classification of Microarray  Gene Expression Data

BP

Weight

Height

x

BP

W-H

WH

Page 48: Classification of Microarray  Gene Expression Data

MIXTURE OF g NORMAL COMPONENTS

);();()( 1 ggg11f Σ,μxΣ,μxx

)()( μxμx T

EUCLIDEAN DISTANCE

)()()(log2 μxΣμxΣμ,x; 1T

where

constantconstant

)()()(log2 μxΣμxΣμ,x; 1T

MAHALANOBIS DISTANCE

where

Page 49: Classification of Microarray  Gene Expression Data

SPHERICAL CLUSTERS

k-means

IΣΣ 21 σg

MIXTURE OF g NORMAL COMPONENTS

),;(),;()( 111 gggf ΣμxΣμxx

IΣΣ 21 σg

k-means

Page 50: Classification of Microarray  Gene Expression Data

Equal spherical covariance matrices

Page 51: Classification of Microarray  Gene Expression Data

Crab Data

Figure 6: Plot of Crab Data

Page 52: Classification of Microarray  Gene Expression Data

Figure 7: Contours of the fitted component densities on the 2nd & 3rd variates for the blue crab

data set.

Page 53: Classification of Microarray  Gene Expression Data

With a mixture model-based approach to clustering, an observation is assigned outright to the ith cluster if its density in the ith component of the mixture distribution (weighted by the prior probability of that component) is greater than in the other (g-1) components.

),;(

),;(),;()( 111

ggg

iiif

Σμx

ΣμxΣμxx

Page 54: Classification of Microarray  Gene Expression Data

http://www.maths.uq.edu.au/~gjm

McLachlan and Peel (2000), Finite Mixture Models. Wiley.

Page 55: Classification of Microarray  Gene Expression Data

Estimation of Mixture Distributions

It was the publication of the seminal paper of Dempster, Laird, and Rubin (1977) on the EM algorithm that greatly stimulated interest in the use of finite mixture distributions to model heterogeneous data.

McLachlan and Krishnan (1997, Wiley)

Page 56: Classification of Microarray  Gene Expression Data

• If need be, the normal mixture model can be made less sensitive to outlying observations by using t component densities.

• With this t mixture model-based approach, the normal distribution for each component in the mixture is embedded in a wider class of elliptically symmetric distributions with an additional parameter called the degrees of freedom.

Page 57: Classification of Microarray  Gene Expression Data

The advantage of the t mixture model is that, although the number of outliers needed for breakdown is almost the same as with the normal mixture model, the outliers have to be much larger.

Page 58: Classification of Microarray  Gene Expression Data
Page 59: Classification of Microarray  Gene Expression Data

Two Clustering Problems:

• Clustering of genes on basis of tissues –

genes not independent

• Clustering of tissues on basis of genes -

latter is a nonstandard problem in

cluster analysis (n << p)

Page 60: Classification of Microarray  Gene Expression Data

Mixture SoftwareMcLachlan, Peel, Adams, and Basford (1999)

http://www.maths.uq.edu.au/~gjm/emmix/emmix.html

Page 61: Classification of Microarray  Gene Expression Data

http://www.maths.uq.edu.au/~gjm/EMMIX_Demo/emmix.html

EMMIX for Windows

Page 62: Classification of Microarray  Gene Expression Data

PROVIDES A MODEL-BASED APPROACH TO CLUSTERING

McLachlan, Bean, and Peel, 2002, A Mixture Model-Based Approach to the Clustering of Microarray

Expression Data, Bioinformatics 18, 413-422

http://www.bioinformatics.oupjournals.org/cgi/screenpdf/18/3/413.pdf

Page 63: Classification of Microarray  Gene Expression Data
Page 64: Classification of Microarray  Gene Expression Data

Example: Microarray DataColon Data of Alon et al. (1999)

n=62 (40 tumours; 22 normals)

tissue samples of

p=2,000 genes in a

2,000 62 matrix.

Page 65: Classification of Microarray  Gene Expression Data
Page 66: Classification of Microarray  Gene Expression Data

Mixture of 2 normal components

Page 67: Classification of Microarray  Gene Expression Data

Mixture of 2 t components

Page 68: Classification of Microarray  Gene Expression Data

Mixture of 2 t components

Page 69: Classification of Microarray  Gene Expression Data

Mixture of 3 t components

Page 70: Classification of Microarray  Gene Expression Data
Page 71: Classification of Microarray  Gene Expression Data
Page 72: Classification of Microarray  Gene Expression Data

In this process, the genes are being treated anonymously.

May wish to incorporate existing biological information on the function of genes into the selection procedure.

Lottaz and Spang (2003, Proceedings of 54th Meeting of the ISI)

They structure the feature space by using a functional grid provided by the Gene Ontology annotations.

Page 73: Classification of Microarray  Gene Expression Data
Page 74: Classification of Microarray  Gene Expression Data

Clustering of COLON Data

Genes using EMMIX-GENE

Page 76: Classification of Microarray  Gene Expression Data
Page 77: Classification of Microarray  Gene Expression Data
Page 78: Classification of Microarray  Gene Expression Data

Clustering of COLON Data

Tissues using EMMIX-GENE

Page 79: Classification of Microarray  Gene Expression Data

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

Grouping for Colon Data

Page 80: Classification of Microarray  Gene Expression Data

Mixtures of Factor Analyzers

A normal mixture model without restrictions on the component-covariance matrices may be viewed as too general for many situations in practice, in particular, with high dimensional data.

One approach for reducing the number of parameters is to work in a lower dimensionalspace by adopting mixtures of factor analyzers (Ghahramani & Hinton, 1997).

Page 81: Classification of Microarray  Gene Expression Data

),,...,1(

where

),,;()(1

gi

f

iTiii

iiji

g

ij

DBB

xx

Bi is a p x q matrix and Di is a

diagonal matrix.

Page 82: Classification of Microarray  Gene Expression Data

Number of Components in a Mixture Model

Testing for the number of components, g, in a mixture is an important but very difficult problem which has not been completely resolved.

Page 83: Classification of Microarray  Gene Expression Data

Order of a Mixture Model

A mixture density with g components might be empirically indistinguishable from one with either fewer than g components or more than g components. It is therefore sensible in practice to approach the question of the number of components in a mixture model in terms of an assessment of the smallest number of components in the mixture compatible with the data.

Page 84: Classification of Microarray  Gene Expression Data

Likelihood Ratio Test Statistic

An obvious way of approaching the problem of testing for the smallest value of the number of components in a mixture model is to use the LRTS, -2log. Suppose we wish to test the null hypothesis,

for some g1>g0.

11 :H gg versus00 : ggH

Page 85: Classification of Microarray  Gene Expression Data

We let denote the MLE of calculated under Hi , (i=0,1). Then the evidence against H0 will be strong if is sufficiently small, or equivalently, if -2log is sufficiently large, where

iΨ Ψ

)}ˆ(log)ˆ({log2log2 01 ΨΨ LL

Page 86: Classification of Microarray  Gene Expression Data

Bootstrapping the LRTS

McLachlan (1987) proposed a resampling approach to the assessment of the P-value of the LRTS in testing

for a specified value of g0.

1100 :H v:H gggg

Page 87: Classification of Microarray  Gene Expression Data

Bayesian Information Criterion

ndL log)ˆ(log2

The Bayesian information criterion (BIC) of Schwarz (1978) is given by

as the penalized log likelihood to be maximized in model selection, including the present situation for the number of components g in a mixture model.

Page 88: Classification of Microarray  Gene Expression Data

Gap statistic (Tibshirani et al., 2001)

Clest (Dudoit and Fridlyand, 2002)

Page 89: Classification of Microarray  Gene Expression Data

Analysis of LEUKAEMIA Data using EMMIX-GENE

Page 90: Classification of Microarray  Gene Expression Data
Page 93: Classification of Microarray  Gene Expression Data
Page 94: Classification of Microarray  Gene Expression Data
Page 95: Classification of Microarray  Gene Expression Data

Breast cancer data set in van’t Veer et al. (van’t Veer et al., 2002, Gene Expression Profiling Predicts Clinical Outcome Of Breast Cancer, Nature 415)

These data were the result of microarray experiments on three patient groups with different classes of breast cancer tumours.

The overall goal was to identify a set of genes that could distinguish between the different tumour groups based upon the gene expression information for these groups.

Page 96: Classification of Microarray  Gene Expression Data

The Economist (US), February 2, 2002

The chips are down; Diagnosing breast cancer (Gene chips have shown that there are two sorts of breast cancer)

Page 97: Classification of Microarray  Gene Expression Data

Nature (2002, 4 July Issue, 418)

News feature (Ball)

Data visualiztion: Picture this

Page 98: Classification of Microarray  Gene Expression Data

Colour-coded: this plot of gene-expression data shows breast tumours falling into two groups

Page 99: Classification of Microarray  Gene Expression Data

• 44 from good prognosis group (remained metastasis free after a period of more than 5 years)

• 34 from poor prognosis group (developed distant metastases within 5 years)

• 20 with hereditary form of cancer (18 with BRAC1; 2 with BRAC2)

Microarray data from 98 patients with primary breast cancers with p = 24,881 genes

Page 100: Classification of Microarray  Gene Expression Data

Pre-processing filter of van’t Veer et al.

• P-value less than 0.01; and• at least a two-fold difference in more

than 5 out of the 98 tissues for the genes

were retained.

only genes with both:

This reduces the data set to 4869 genes.

Page 101: Classification of Microarray  Gene Expression Data

Heat Map Displaying the Reduced Set of 4,869 Genes on the 98 Breast Cancer Tumours

Page 102: Classification of Microarray  Gene Expression Data

Unsupervised Classification Analysis Using EMMIX-GENE

Steps used in the application of EMMIX-GENE:

1. Select the most relevant genes from this filtered set of 4,869 genes. The set of retained genes is thus reduced to 1,867.

2. Cluster these 1,867 genes into forty groups. The majority of gene groups produced were reasonably cohesive and distinct.

3. Using these forty group means, cluster the tissue samples into two and three components using a mixture of factor analyzers model with q = 4 factors.

Page 103: Classification of Microarray  Gene Expression Data

Insert heat map of 1867 genes

Heat Map of Top 1867 Genes

Page 104: Classification of Microarray  Gene Expression Data
Page 105: Classification of Microarray  Gene Expression Data

15141311 12

16 17 18 19 20

10986 7

5431 2

Page 106: Classification of Microarray  Gene Expression Data

35343331 32

36 37 38 39 40

30292826 27

25242321 22

Page 107: Classification of Microarray  Gene Expression Data

where i = group number

mi = number in group i

Ui = -2 log λi

1 146 112.98

2 93 74.95

3 61 46.08

4 55 35.20

5 43 30.40

6 92 29.29

7 71 28.77

8 20 28.76

9 23 28.44

10 23 27.73

21 44 13.77

22 30 13.28

23 25 13.10

24 67 13.01

25 12 12.04

26 58 12.03

27 27 11.74

28 64 11.61

29 38 11.38

30 21 10.72

11 66 25.72

12 38 25.45

13 28 25.00

14 53 21.33

15 47 18.14

16 23 18.00

17 27 17.62

18 45 17.51

19 80 17.28

20 55 13.79

31 53 9.84

32 36 8.95

33 36 8.89

34 38 8.86

35 44 8.02

36 56 7.43

37 46 7.21

38 19 6.14

39 29 4.64

40 35 2.44

i mi Ui i mi Ui i mi Ui i mi Ui

Page 108: Classification of Microarray  Gene Expression Data

Heat Map of Genes in Group G1

Page 109: Classification of Microarray  Gene Expression Data

Heat Map of Genes in Group G2

Page 110: Classification of Microarray  Gene Expression Data

Heat Map of Genes in Group G3

Page 111: Classification of Microarray  Gene Expression Data

1. A change in gene expression is apparent between the sporadic (first 78 tissue samples) and hereditary (last 20 tissue samples) tumours.

2. The final two tissue samples (the two BRCA2 tumours) show consistent patterns of expression. This expression is different from that exhibited by the set of BRCA1 tumours.

3. The problem of trying to distinguish between the two classes, patients who were disease-free after 5 years 1 and those with metastases within 5 years

2, is not straightforward on the basis of the gene

expressions.

Page 112: Classification of Microarray  Gene Expression Data

Selection of Relevant Genes

We compared the genes selected by EMMIX-GENE with those genes retained in the original study by van’t Veer et al. (2002).

van’t Veer et al. used an agglomerative hierarchical algorithm to organise the genes into dominant genes groups. Two of these groups were highlighted in their paper, with their genes corresponding to biologically significant features.

Page 113: Classification of Microarray  Gene Expression Data

Cluster Acontaining genes co-regulated with the

ER-a gene (ESR1) 40 24

Cluster B

containing “co-regulated genes that are the molecular reflection of extensive

lymphocytic infiltrate, and comprise a set of genes expressed in T and B cells”

40 23

Identification of van’t Veer et al. Number of genes

Number of matches with genes retained

by select-gene

We can see that of the 80 genes identified by van’t Veer et al., only 47 are retained by the select-genes step of the EMMIX-GENE algorithm.

Page 114: Classification of Microarray  Gene Expression Data

Subsets of these 47 genes appeared inside several of the 40 groups produced by the cluster-genes step of EMMIX-GENE.

Cluster Index

(EMMIX-GENE)

Number of Genes Matched

Percentage Matched

(%)

2 21 87.5 3 2 8.33

Cluster A

14 1 4.17 17 18 78.3 19 1 4.35 Cluster

B 21 4 17.4

Comparing Clusters from Hierarchical Algorithm with those from EMMIX-GENE Algorithm

Page 115: Classification of Microarray  Gene Expression Data

Genes Retained by EMMIX-GENE Appearing in Cluster A(vertical blue lines indicate the three groups of tumours)

Page 116: Classification of Microarray  Gene Expression Data

Genes Rejected by EMMIX-GENE Appearing in Cluster A

Page 117: Classification of Microarray  Gene Expression Data

Genes Retained by EMMIX-GENE Appearing in Cluster B

Page 118: Classification of Microarray  Gene Expression Data

Genes Rejected by EMMIX-GENE Appearing in Cluster B

Page 119: Classification of Microarray  Gene Expression Data

Assessing the Number of Tissue Groups

To assess the number of components g to be used in the normal mixture the likelihood ratio statistic was adopted, and the resampling approach used to assess the P-value.

By proceeding sequentially, testing the null hypothesis H0: g = g0 versus the alternative

hypothesis H1: g = g0 + 1, starting with g0 = 1 and

continuing until a non-significant result was obtained it was concluded that g = 3 components were adequate for this data set.

Page 120: Classification of Microarray  Gene Expression Data

Clustering Tissue Samples on the Basis of Gene Groups using EMMIX-GENE

Tissue samples can be subdivided into two groups corresponding to 78 sporadic tumours and 20 hereditary tumours.

When the two cluster assignment of EMMIX-GENE is compared to this genuine grouping, only 1 of the 20 hereditary tumour patients is misallocated, although 37 of the sporadic tumour patients are incorrectly assigned to the hereditary tumour cluster.

Page 121: Classification of Microarray  Gene Expression Data

Using a mixture of factor analyzers model with q = 8 factors, we would misallocate:

7 out of the 44 members of 1;

24 out of the 34 members of 2; and

1 of the 18 BRCA1 samples.The misallocation rate of 24/34 for the second class, 2, is not surprising given both the gene expressions

as summarized in the groups of genes and that we are classifying the tissues in an unsupervised manner without using the knowledge of their true classification.

Page 122: Classification of Microarray  Gene Expression Data

When knowledge of the groups’ true classification is used (van’t Veer et al.), the reported error rate was approximately 50% for members of 2 when

allowance was made for the selection bias in forming a classifier on the basis of an optimal subset of the genes.

Further analysis of this data set in a supervised context confirms the difficulty in trying to discriminate between the disease-free class 1 and the metastases

class 2. (Tibshirani and Efron, 2002, “Pre-Validation and Inference in

Microarrays”, Statistical Applications In Genetics And Molecular Biology 1)

Supervised Classification

Page 123: Classification of Microarray  Gene Expression Data
Page 124: Classification of Microarray  Gene Expression Data

Investigating Underlying Signatures With Other Clinical Indicators

The three clusters constructed by EMMIX-GENE were investigated in order to determine whether they followed a pattern contingent upon the clinical predictors of histological grade, angioinvasion, oestrogen receptor, lymphocytic infiltrate.

Page 125: Classification of Microarray  Gene Expression Data
Page 126: Classification of Microarray  Gene Expression Data

Microarrays have become promising diagnostic tools for clinical applications.

However, large-scale screening approaches in general and microarray technology in particular, inescapably lead to the challenging problem of learning from high-dimensional data.

Page 127: Classification of Microarray  Gene Expression Data

Hope to see you in Cairns in 2004!