4
A Novel Hybrid Approach to Selecting Marker Genes for Cancer Classification Using Gene Expression Data Li Jiangeng, Duan Yanhua, Ruan Xiaogang School of Electronic Information & Control Engineering, Beijing University of Technology Beijing, China [email protected] , [email protected] , [email protected] Abstract—Selecting a subset of marker genes from thousands of genes is an important topic in microarray experiments for diseases classification and prediction. In this paper, we proposed a novel hybrid approach that combines gene ranking, heuristic clustering analysis and wrapper method to select marker genes for tumor classification. In our method, we firstly employed gene filtering to select the informative genes; secondly, we extracted a set of prototype genes as the representative of the informative genes by heuristic K-means clustering; finally, employed SVM- RFE to find marker genes from the representative genes based on recursive feature elimination. The performance of our method was evaluated by AML/ALL microarray dataset. The experimental results revealed that our method could find very small subset of marker genes with minimum redundancy but got better classification accuracy. Keywords-gene expression profiles, feature selection, heuristic clustering, prototype gene, SVM-RFE, cancer classification. I. INTRODUCTION In the last few years, DNA microarray technology has become a fundamental tool in genomic research, and has been introduced a paradigmatic change in biology by shifting experimental approaches from single gene studies to genome- level analyses [1]. Recent studies have shown that microarray gene expression data is useful for phenotype classification of cancer diseases [2, 4-5]. However, classification using gene expression data has a major challenge because of the characteristics in microarray data set, which has the very high dimensionality (large number of genes) with a small number of samples in the data set. And it is very important but difficult to identify which genes contribute most to classification. Facing these challenges, feature selection techniques has been introduced to select a small subset of genes as features for classification. Feature selection (gene selection) is crucial for several of reasons in the task of tumor classification using the gene expression data, such as improving classification accuracy, reducing the cost in a clinical setting and gaining significant insight into the mechanism of disease [3-6]. Recent gene selection (feature selection) methods fall into two categories: filter methods and wrapper methods [3]. The wrapper methods which are very popular in machine learning applications; but are not widely used in DNA microarray tasks. Owing to high dimensions of microarray data, it is NP hard to find the most optimal gene subset among all combinations of genes. Although many heuristic search strategies can be employed, these are still too computationally expensive [4]. Filter methods which were said gene ranking methods in gene expression data area, attempt to find predictive subsets of the genes by using a simple criterion computed from the empirical distribution, and the top-genes were selected as a feature subset. The most commonly used gene selection methods are based on statistical tests or information theory to rank the genes [4, 5]. Each gene is evaluated individually and assigned a score reflecting its correlation with the class according to certain criterion in these gene ranking methods. The gene is independent of any learning methods in ranking gene methods. Therefore they have better generalization property and computational efficiency. However, there is a problem that these selected genes are often highly correlated [6]. Because these selected genes may belong to the same signaling pathways or function related to the disease. Therefore, if a gene has a high ranked; other genes which are highly correlated with it, that may be having highly ranked in the gene ranking method as well. This redundancy is an additional computational border, also can lead to misclassifications. For the reasons discussed above, we proposed a new hybrid method for selecting marker genes from gene expression data. Our approach combines three machine learning methods: gene ranking, clustering and wrapper method. In this approach, we first applied feature filter algorithm to select a set of top-ranked informative genes; secondly, in order to reduce the redundancy of the informative genes, we extracted a set of prototype genes as the representative of the informative genes by heuristic K- means clustering; finally, we employed SVM-FRE to select a set of marker genes. When applied our method to ALL/AML leukemia expression dataset, our approach was capable of selecting a very small subset of marker genes with the same or better classification accuracy. The rest of this paper is organized as follows: Section two will introduce gene ranking method to filter the noise genes; the method of extracting prototype gene is presented in section three. The result of selecting marker genes is given in section four; experimental results will be reached in section five and conclusion is in the final section. 1-4244-1120-3/07/$25.00 ©2007 IEEE 264

[IEEE 2007 1st International Conference on Bioinformatics and Biomedical Engineering - Wuhan, China (2007.07.6-2007.07.8)] 2007 1st International Conference on Bioinformatics and Biomedical

  • Upload
    ruan

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

A Novel Hybrid Approach to Selecting Marker Genes for Cancer Classification Using Gene Expression

Data

Li Jiangeng, Duan Yanhua, Ruan Xiaogang School of Electronic Information & Control Engineering, Beijing University of Technology

Beijing, China [email protected], [email protected], [email protected]

Abstract—Selecting a subset of marker genes from thousands of genes is an important topic in microarray experiments for diseases classification and prediction. In this paper, we proposed a novel hybrid approach that combines gene ranking, heuristic clustering analysis and wrapper method to select marker genes for tumor classification. In our method, we firstly employed gene filtering to select the informative genes; secondly, we extracted a set of prototype genes as the representative of the informative genes by heuristic K-means clustering; finally, employed SVM-RFE to find marker genes from the representative genes based on recursive feature elimination. The performance of our method was evaluated by AML/ALL microarray dataset. The experimental results revealed that our method could find very small subset of marker genes with minimum redundancy but got better classification accuracy.

Keywords-gene expression profiles, feature selection, heuristic clustering, prototype gene, SVM-RFE, cancer classification.

I. INTRODUCTION In the last few years, DNA microarray technology has

become a fundamental tool in genomic research, and has been introduced a paradigmatic change in biology by shifting experimental approaches from single gene studies to genome-level analyses [1]. Recent studies have shown that microarray gene expression data is useful for phenotype classification of cancer diseases [2, 4-5]. However, classification using gene expression data has a major challenge because of the characteristics in microarray data set, which has the very high dimensionality (large number of genes) with a small number of samples in the data set. And it is very important but difficult to identify which genes contribute most to classification. Facing these challenges, feature selection techniques has been introduced to select a small subset of genes as features for classification. Feature selection (gene selection) is crucial for several of reasons in the task of tumor classification using the gene expression data, such as improving classification accuracy, reducing the cost in a clinical setting and gaining significant insight into the mechanism of disease [3-6].

Recent gene selection (feature selection) methods fall into two categories: filter methods and wrapper methods [3]. The wrapper methods which are very popular in machine learning applications; but are not widely used in DNA microarray tasks.

Owing to high dimensions of microarray data, it is NP hard to find the most optimal gene subset among all combinations of genes. Although many heuristic search strategies can be employed, these are still too computationally expensive [4]. Filter methods which were said gene ranking methods in gene expression data area, attempt to find predictive subsets of the genes by using a simple criterion computed from the empirical distribution, and the top-genes were selected as a feature subset. The most commonly used gene selection methods are based on statistical tests or information theory to rank the genes [4, 5]. Each gene is evaluated individually and assigned a score reflecting its correlation with the class according to certain criterion in these gene ranking methods. The gene is independent of any learning methods in ranking gene methods. Therefore they have better generalization property and computational efficiency. However, there is a problem that these selected genes are often highly correlated [6]. Because these selected genes may belong to the same signaling pathways or function related to the disease. Therefore, if a gene has a high ranked; other genes which are highly correlated with it, that may be having highly ranked in the gene ranking method as well. This redundancy is an additional computational border, also can lead to misclassifications.

For the reasons discussed above, we proposed a new hybrid method for selecting marker genes from gene expression data. Our approach combines three machine learning methods: gene ranking, clustering and wrapper method. In this approach, we first applied feature filter algorithm to select a set of top-ranked informative genes; secondly, in order to reduce the redundancy of the informative genes, we extracted a set of prototype genes as the representative of the informative genes by heuristic K-means clustering; finally, we employed SVM-FRE to select a set of marker genes. When applied our method to ALL/AML leukemia expression dataset, our approach was capable of selecting a very small subset of marker genes with the same or better classification accuracy.

The rest of this paper is organized as follows: Section two will introduce gene ranking method to filter the noise genes; the method of extracting prototype gene is presented in section three. The result of selecting marker genes is given in section four; experimental results will be reached in section five and conclusion is in the final section.

1-4244-1120-3/07/$25.00 ©2007 IEEE

264

II. FILTERING NOISE GENES There are large numbers of genes carrying no significant

information with regard to classification; these genes are considered as noise gene. We applied a criterion ( )f g to evaluate each gene, and according to the distance the gene into two groups: noise genes and informative genes. To eliminate noise genes, we rank the all genes and select a set of the top-ranking genes as informative genes. We use the Bhattacharyya distance to evaluate the information of classification and rank all genes. The Bhattacharyya distance of a feature ( )f g is defined the following equation:

( )2 2 2

2 2

1 1( ) ln4 2 2

g g g g

g g g g

f gµ µ σ σσ σ σ σ

+ − + −

+ − + −

− += + +

(1)

Where gµ gσ are the mean and the standard deviation of the gene expression values of gene g for all the patients of class (+) or class (-).

III. EXTRACTING PROTOTYPE-GENES We proposed an extracting prototype gene method based a

heuristic clustering. This method works the following steps. The fist step is to identify equivalent clusters inside gene expression based on a heuristic clustering. The second step is to create gene prototypes that are good preventatives of these clusters.

A. Heuristic K-means clustering Clustering techniques have been proven to be helpful for

understanding microarray gene expression data [6, 7]. Co-expressed genes can be grouped in clusters based on the expression patterns. The informative genes are often highly correlated which were selected by the gene selecting algorithm. These redundancy genes will increase the computational expensive of the procedure of selected marker genes.

Here, we employed a heuristic K-means clustering [8] to analyze the informative genes for extracting prototype genes. The heuristic K-means algorithm is that the initial cluster number is not critical to the clustering results; and moreover, automatically determines a semi-optimal number of clusters according to the statistical nature of data, figure 1.

The heuristic K-means clustering works as the following four steps:

1) Cluster the informative genes using K-means algorithm.

2) Find the outliers of each cluster.

3) Adjust the set of centroids and the initial cluster number K, then use K-means clustering.

4) Delete the empty clusters.

5) Find and merge the similar clusters until there are no similar clusters.

After clustering analysis, there were 20 clusters with the number of genes more than one. Therefore, there were most of genes with a similar expression patterns. In other words, there is high redundancy in the informative genes selected by filter

1 2 3 4 5 6 7 8 9

5

10

15

20

25

Experiment Number

Num

ber of C

lusters

Initial Number

Final Number

Figure 1: Heuristic K-means analysis initial number clusters vs

final number clusters

Figure 2. The gene expression in the clusters by the heuristic K-means

method. The numbers of genes more than seven in these clusters were presented in Fig. 2.

B. Construction of prototype-genes We extract a prototype gene (as similar [9]) from each

cluster that was the result of the clustering analysis. Here, the prototype gene is meant to represent the gene in a cluster. The role of the prototype method is to reduce the redundancy of the informative genes. From a biological point of view, the prototype is characterized by an expression profiles the most similar to all genes of cluster. In other words, the co-expressed genes may belong to the same pathways or have similar function. In this paper, we select the gene as the prototype gene which has the minimum total distance to other genes in cluster. We extract the set of prototype

uP from clusteruC ; the vector

( )1, 2 .. . iy y y represents the expression of the expression of the prototype uP ; the

ijD represents the total distance of gene j to other genes in the cluster i .

( )1,...u iP y y= With argmin( )i ijy D= (2)

IV. SELECTING THE OPTIMAL MARKER GENES We select the optimal marker genes from all the prototype

genes by a wrapper method that is support vector machine based recursive feature elimination. The marker genes are the subset of the prototype genes with best classification performance and consist of genes as fewer as possible. The

265

support vector machine (SVM) is one of the most powerful supervised learning algorithms in statistical learning theory [10].

In this paper, we apply a linear support vector machine as a classifier to select the optimal marker genes based recursive feature elimination, which is called SVM-RFE [11]. The SVM-RFE algorithm works as the following iterative procedure until there is no gene in training set.

1) Compute the ranking criterion for all features.

2) Remove the feature with smallest ranking criterion.

In the step 2), the ranking criterion is defined:

( )1

( ) ( , )T

sv

j i i iX S ij

S x y K X X bx

α∈ =

∂= +

∂∑ ∑ (3)

Where TS is the set of training samples, X is the input genes

and the jx is the jth gene in X . The iy is the class label,

and ,i bα can be obtained by loving a quadratic programming

problem. ( ).K is the kernel function used inside the SVM.

V. EXPERIMENTAL RESULTS The public ALL/AML leukemia dataset [2] was used to

test our method. It consists of 72 samples, 25 of AML (acute myeloid leukemia) and 47 of ALL (acute lymphoblastic leukemia). Each sample in the dataset consists of 7129 genes. In next section, we present results that our method proposed this paper was applied to the ALL/AML leukemia datasets.

We use the Bhattacharyya distance to rank all genes in the training sample set and select the top 150 genes as informative gene. However, the informative genes are often highly correlated which were selected by the gene selecting algorithm. Therefore, the heuristic K-means clustering algorithm is employed to analyzing the informative genes extracting prototype genes for reducing their redundancy. We had discovered 20 clusters more than one genes. We found that there were most of genes with a similar expression patterns from the Figure 2. These genes in the same cluster may belong to the same signaling pathways or function related to the disease. Therefore, we extracted the prototype gene from each cluster as the representative of the cluster.

After having extracted prototype genes, we find the marker genes from the set of prototype genes by SVM-RFE. Then we use each subset of genes as the input features of SVM to evaluate the classification performance of the selected representative genes. We can obtain the ability of the subset of representative genes by the following two ways: (1) “Leave-One-Out Cross Validation” (LOOCV), (2) the “Independent Test” (IT) using the test set. Figure 3 displays the test results using different subset of the representative genes. We can know that four prototype genes is the best feature of ALL/AML classification from Figure 3. There were no error in LOOCV test and only one error in the IT test. We define these

51015200

5

10

15

20

25

30

Error

Num

bers

The number of marker genes

IT test

LOOCV test

Figure 3: Classification errors under different set of marker

genes

Table 1: Comparison of classification methods

Method Errors of

LOOCV test

Errors of

IT test

# of

genes

Golub et

al.[2] 2/38 5/34 50

Tibshirani et

al.[12] 1/38 2/34 21

Guyon et al.

[11] 0/38 0/34 8

Our method 0/38 1/34 4

four genes as the set of marker genes for classification. Table 1 reveals that our approach can achieve to select fewer marker genes with minimum redundancy but getting better classification accuracy. The results from our approach compared with the methods which were proposed previously are shown in Table1.

VI. CONCLUSION In this paper, we proposed a novel hybrid approach, which

sequentially combines gene ranking, heuristic K-means clustering and wrapper method to select marker genes for tumor classification. Co-expressed genes can be grouped in clusters based on the expression patterns. We employed the heuristic K-means clustering to extract prototype gene for reducing the redundancy genes which are highly correlated genes. And then we used SVM-REF to extract marker genes from the small set of representative genes. This hybrid method takes advantages of both gene ranking’s efficiency and wrapper methods’ high accuracy. Our method has been implemented on ALL/AML dataset. Experimental results have shown that our method can achieve to select few of marker genes with minimum redundancy but getting better classification accuracy.

ACKNOWLEDGEMENT

266

This work was supported by the National Natural Science Foundation of China under grant No. 60234020. We thank Grace Huang for extensive editing assistance.

REFERENCES [1] Blaschke C., Oliveros, J.C. and Valencia, A., “Mining functional

information associated with expression arrays”, Functional and Integrative Genomics, Springer, Berlin, 2001, 1(4), pp.256-268.

[2] Golub T R, Slonim D K, Tamayo P, et al, “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring”, Science, AAAS, New York, 1999, 286(15), pp.531-537.

[3] Inza, I., Larranaga, P., Blanco, R. and Cerrolaza, A.J., “Filter versus wrapper gene approaches in DNA microarray domains”, Artificial Intelligence in Medicine, ELSEVIER, Amsterdam, 2004, 31(2), pp.91-103.

[4] C.H. Ooi and P. Tan, “Genetic algorithms applied to multi-class prediction for the analysis of gene expression data,” Bioinformatics, Oxford University Press, Oxford, 2003, 19(1), pp. 37-44.

[5] Li, J., Zhang, C. and Olihara, M., “A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression”, Bioinformatics, Oxford University Press, Oxford, 2004, 20(15), pp. 2429-2437.

[6] Jaeger, J., Sengupta, R. and Ruzzo, W. L., “Improved gene selection for classification of microarrays”, Pac. Symp. Biocomput, Hawaii, USA, 2003, pp. 53-64.

[7] [7] Daxin Jiang, Chun Tang, and Aidong Zhang, “Cluster Analysis for Gene Expression Data: A Survey”, IEEE Transactions on Knowledge and Data Engineering, IEEE, USA, 2004, 16(11), pp. 1370-1386.

[8] Blaise Hanczar, Melanie Courtine, Arriel Benis, Corneliu Hennegar,Karine Clement, Jean-Daniel Zucker, “Improving classification of microarray data using prototype-based feature selection”, ACM SIGKDD Exploration Newsletter, ACM Press, New York, USA, 2003, 5(2), pp.23-30.

[9] Y. Guan, A. Ghorbani, and N. Belacel, “K-means+: An autonomous clustering algorithm”, In submission.

[10] Vapnik, V. N., Statisitcal Learning Theory, Wiley-Interscience, New York, USA, 1998.

[11] I. Guyon, J. Weston, S.Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machine”, Machine learning, Springer, Berlin, 2002, 46(13), pp. 389-422.

[12] Tibshirani R ,Hastie T ,Narasimhan B , et al , “Diagnosis of multiple cancer types by shrunken centroids of gene expression”, PNAS, National Academy of Sciences, Washington, 2002, 99(10), pp. 6567 – 6572.

267