Microarray time series

8/7/2019 Microarray time series

http://slidepdf.com/reader/full/microarray-time-series 1/19

Survey

Cairo University

Faculty of Computers and Information

Information Systems Department

Bioinformatics Survey

Microarray time series

classification

Prepared by:

Mohamed Mahmoud Mahmoud

Mostafa Lamlom Ahmed

Hani HusseinHassan Ali

Supervised by:

Dr. Hoda Mokhtar



Microarray Time Series Classification2

List of Figures

Figure 1.1«««««««««««««««««««««««««««««««.........6

Figure 1.2 «««««««««««««««««««««««««««««««««6

Figure 2.1«««««««««««««««««««««««««««««««««.9

Figure 3.1«««««««««««««««««««««««««««««««««.14

Figure 3.2«««««««««««««««««««««««««««««««««.15




Table of Contents

Chapter 1: Introduction1.1 What is DNA microarray technology?««««««««««««««««4

1.2 What is DNA microarray technology used for?...................................................5

1.3 How does DNA microarray technology works? ««««««««««««6

1.4 Microarray types and usage««««««««««««««««««««.8

Chapter 2: Microarray methods2.1 Independent Component Analysis (ICA)««« ««««««««««««9

2.2 ICA models of gene expression data «««««.««««................................9

2.3 Gene selection««««««««««««««.............................................. 10

2.4 Classifiers««««««««««««««««««««««««««« 10

Chapter 3: Data mining for DNA microarray3.1 Gene Selection ««««««.««««««««««««««««««««11

3.2 Pattern Classifier «««««««««««««««««««««««««12

3.2.1 MLP«««««««««««««««««««««««««««..12

3.2.2 K NN«««««««««««««««««««««««««««..12

3.2.3 SVM«««««««««««««««««««««««««««..133.2.4 Ensemble«««««««««««««««««««««««««..13

Chapter 4: Classification Algorithms

4.1 Wavelet Approach««««««««««««««««««««««««««««««.14 Chapter 5: Conclusion ......................................................................«.............17

Chapter 6: Future Work ......................................................................«..........17

Chapter 7: References ......................................................................«.............18




Chapter 1: Introduction

Introduction

In [1], Microarray technology has supplied a large volume of data, which changes many problems in biology into the problems of computing. As a result techniques for extracting useful information

from the data are developed. In particular, microarray technology has been applied to prediction anddiagnosis of cancer, so that it expectedly helps us to exactly predict and diagnose cancer. To

precisely classify cancer we have to select genes related to cancer because the genes extracted frommicroarray have many noises. In this paper, we attempt to explore seven feature selection methods

and four classifiers and propose ensemble classifiers

In three benchmark datasets to systematically evaluate the performances of the feature selection

methods and machine learning classifiers. Three benchmark datasets are leukemia cancer dataset,colon cancer dataset and lymphoma cancer data set. The methods to combine the classifiers are

majority voting, weighted voting, and Bayesian approach to improve the performance of classification. Experimental results show that the ensemble with several basis classifiers produces

the best recognition rate on the benchmark datasets.

In [5], Microarray technologies facilitate the generation of vast amount of bio-signal or genomicsignal data. The major challenge in processing these signals is the extraction of the global

characteristics of the data due to their huge dimension and the complex relationship among various

genes

1.1 What is DNA microarray technology?

In [1], although all of the cells in the human body contain identical genetic material, the same genes

are not active in every cell. Studying which genes are active and which are inactive in different celltypes helps scientists to understand both how these cells function normally and how they are affected

when various genes do not perform properly. In the past, scientists have only been able to conductthese genetic analyses on a few genes at once. With the development of DNA microarray

technology, however, scientists can now examine how active thousands of genes are at any giventime.

In[2],DNA microarray technology has attracted tremendous interest in both the scientific communityand industry. Generally, microarray expression experiments allow the recording of expression levels




of thousands of genes simultaneously. These experiments primarily consist of either monitoring eachgene multiple times under various conditions

In [3], recently microarray data have brought much attention to bioinformatics research area.

Microarray data are a recording of expression levels of thousands of genes that are measured in

various experimental settings. Microarray data have typically several thousands of genes (features) but only tens of experiments (samples), referred to as a small-sample-sized-problem.

In [6], DNA microarray technology allows simultaneous monitoring and measuring of thousands of gene expression activation levels in a single experiment. This technology is currently used in

medical diagnosis and gene analysis. Many microarray research projects focus on clustering analysisand classification accuracy. In clustering analysis, the purpose of clustering is to analyze the gene

groups that show a correlated pattern of the gene expression data and provide insight into geneinteractions and function. Research on classification accuracy is aimed at building an efficient model

for predicting the class membership of data, produce a correct label on training data, and predict thelabel for any unknown data correctly.

In [8], the analysis of microarray data requires two steps: feature selection and classification. From a

variety of feature selection methods and classifiers, it is difficult to find optimal ensemblescomposed of any feature-classifier pairs. This paper proposes a novel method based on the

evolutionary algorithm (EA) to form sophisticated ensembles of features and classifiers that can beused to obtain high classification performance. In spite of the exponential number of possible

ensembles of individual feature-classifier pairs, an EA can produce the best ensemble in a reasonableamount of time. The chromosome is encoded with real values to decide the weight for each feature-

classifier pair in an ensemble. Experimental results with two well-known microarray datasets interms of time and classification rate indicate that the proposed method produces ensembles that are

superior to individual classifiers, as well as other ensembles optimized by random and greedystrategies.

1.2 What is DNA microarray technology used for?

In[1],Microarray technology will help researchers to learn more about many different diseases,including heart disease, mental illness and infectious diseases, to name only a few. One intense area

of microarray research at the National Institutes of Health (NIH) is the study of cancer. In the past,scientists have classified different types of cancers based on the organs in which the tumors develop.

With the help of microarray technology, however, they will be able to further classify these types of cancers based on the patterns of gene activity in the tumor cells. Researchers will then be able to

design treatment strategies targeted directly to each specific type of cancer. Additionally, by

examining the differences in gene activity between untreated and treated tumor cells - for examplethose that are radiated or oxygen-starved - scientists will understand exactly how different therapiesAffect tumors and be able to develop more effective treatments.




Figure 1.1

Figure 1.2




1.3 How does DNA microarray technology work?

In [1], DNA microarrays are created by robotic machines that arrange minuscule amounts of hundreds or

thousands of gene sequences on a single microscope slide. Researchers have a database of over 3.5million genetic sequences that they can use for this purpose. When a gene is activated, cellular

machinery begins to copy certain segments of that gene. The resulting product is known as messengerRNA (mRNA), which is the body's template for creating proteins. The mRNA produced by the cell is

complementary, and therefore will bind to the original portion of the DNA strand from which it wascopied.

In [1], to determine which genes are turned on and which are turned off in a given cell, a researcher must

first collect the messenger RNA molecules present in that cell. The researcher then labels each mRNAmolecule by using a reverse transcriptase enzyme (RT) that generates a complementary cDNA to the

mRNA. During that process fluorescent nucleotides are attached to the cDNA. The tumor and thenormal samples are labeled with different fluorescent dyes. Next, the researcher places the labeled

cDNAs onto a DNA microarray slide. The labeled cDNAs that represent mRNAs in the cell will then

hybridize ± or bind ± to their synthetic complementary DNAs attached on the microarray slide, leavingits fluorescent tag. A researcher must then use a special scanner to measure the fluorescent intensity foreach spot/areas on the microarray slide.

In [1], if a particular gene is very active, it produces many molecules of messenger RNA, thus, morelabeled cDNAs, which hybridize to the DNA on the microarray slide and generate a very bright

fluorescent area. Genes that are somewhat less active produce fewer mRNAs, thus, less labeled cDNAswhich results in dimmer fluorescent spots. If there is no fluorescence, none of the messenger molecules

have hybridized to the DNA, indicating that the gene is inactive. Researchers frequently use thistechnique to examine the activity of various genes at different times. When co-hybridizing Tumor

samples (Red Dye) and Normal sample (Green dye) together, they will compete for the synthetic

complementary DNAs on the microarray slide. As a result, if the spot is red, this means that that specificgene is more expressed in tumor than in normal (up-regulated in cancer). If a spot is Green, that meansthat gene is more expressed in the Normal tissue (Down regulated in cancer). If a spot is yellow that

means that that specific gene is equally expressed in normal and tumor.




1.4 Microarray types and usage

In [1], many types of array exist and the broadest distinction is whether they are spatially arranged on a

surface or on coded beads:

y The traditional solid-phase array is a collection of orderly microscopic "spots", called

features, each with a specific probe attached to a solid surface, such as glass, plastic orsilicon biochip (commonly known as a genome chip, DNA chip or gene array). Thousands

of them can be placed in known locations on a single DNA microarray.

y The alternative bead array is a collection of microscopic polystyrene beads, each with a

specific probe and a ratio of two or more dyes, which do not interfere with the fluorescent

dyes used on the target sequence.

In [1], DNA microarrays can be used to detect DNA (as in comparative genomic hybridization), ordetect RNA (most commonly as cDNA after reverse transcription) that may or may not be translated

into proteins. The process of measuring gene expression via cDNA is called expression analysis orexpression profiling.




Chapter two: Microarray methods

2.1 Independent Component Analysis (ICA)

In[2],ICA is a useful extension of PCA that has been developed in context with blind separation ofindependent sources from their linear mixtures.7 Such blind separation techniques have been used, for

example, in various applications of auditory signal separation, medical signal processing, and so onRoughly speaking, rather than requiring that the coefficients of a linear expansion of the data vectors be

uncorrelated as in PCA, in ICA these coefficients must be mutually independent (or as independent as possible). This implies that higher order statistics are needed in determining the ICA expansion.

2.2 ICA models of gene expression data

In [2], in this approach, ICA is used to find a matrix W such that the rows of U are as statistically

independent as possible. The independent eigenassays estimated by the rows of U are then used torepresent the snapshots. The representation of the snapshots consists of their corresponding coordinates

with respect to the eigenassays defined by the rows of U, These coordinates are

Figure 2.1

The gene expression data synthesis model. To find a set of independent basis snapshots (eigenassay), the

snapshots in X are considered to be a linear combination of statistically independent basis snapshots(eigenassay, the rows in S), where W is the unmixing matrix and A =Wí1 is an unknown mixing

matrix. The independent eigenassay is estimated as the output U of the learned ICA




2.3 Gene selection

In[2],this method allows to find the individual gene expression profiles that help to discriminate between

two classes by calculating for each gene expression profile gj a score based on the mean 1 j(respectively 2 j) and the standard deviation 1 j(respectively 2 j ) of each class of samples. In this

study, we ranked the genes by their scores and retained a set of the top 500, 1000 and 2000 genes of thetwo data sets for ICA, respectively.

In [3], these three methods will be used for experimental comparisons :

2.3.1 Gene ranking using correlation

Selects the best genes based on the discriminating power of the individual genes

2.3.2 Recursive backward feature elimination by SVM

2.3.3 Forward feature selection by FDAFDA is a statistical dimension reduction method [10]. It finds a linear transformation that optimizes

class separability in the reduced dimensional space. The class separability is measured by using the between-

class and within-class scatters. A linear transformation which maximizes the between-class scatter andminimizes the within-class scatter projects the original data to a one-dimensional space.

2.4 Classifiers

In [2], after processing the gene expression data using t -statistics and ICA, the final step is to classify the

data set. There have been many methods for performing the classification tasks so far, such as radial basis function neural network (RBFNN), 11, 12 radial basis probabilities neural networks,13 logistic

discrimination (LD) and quadratic discriminant analysis (QDA),16 etc. Because the dimension of DNAmicroarray gene expression data is higher even after they are processed by ICA, and there are only few

samples of the data achieved in general, we use support vector machines (SVM),8 which have been proved to be very useful, to classify the gene expression data.

SVM is a relatively new type of machine learning model, originally introduced by Vapnik and co-

workers,24 and successively extended by a number of other researchers. This model, which is ofremarkably robust performance with respect to sparse and noisy data, is becoming a system of choice in

a number of applications from text categorization to protein function prediction.

When used for classification, SVM can separate a given set of binary labeled training data with ahyperplane that is maximally distant from them (the maximal margin hyperplane). For the cases in

which no linear separation is possible, they can work in combination with the technique of ³kernels´,which automatically realizes a nonlinear mapping to a feature space. Generally, the hyperplane founded

by the SVM in feature space corresponds to a nonlinear decision boundary in the original space.




Chapter Three: Data Mining for DNA Microarray

In [1], data mining for DNA microarray is to select discriminative genes related to classification from

gene expression data and train classifier with which classifies new data. After acquiring the gene

expression data calculated from the DNA microarray, prediction system has 2 stages: feature selectionand pattern classification.

In [1], the feature selection can be thought of as the gene selection, which is to get the list of genes that

might be informative for the prediction by statistical and information theoretical methods. Since it ishighly unlikely that all the genes have the information related to cancer and using all the genes results in

too big dimensionality, it is necessary to explore the efficient way to get the best feature. We haveextracted some informative genes using seven methods and the cancer predictor classifies the category

only with these genes.

In [1], given the gene list, a classifier makes decision as to which category the gene pattern belongs at

prediction stage. We have adopted four most widely used classification methods and an ensembleclassifier.

3.1 Gene selection

In [1], among thousands of genes whose expression levels are measured, not all are needed forclassification. Microarray data consist of large number of genes in small samples. We need to select

some genes highly related to particular classes for classification, which are called informative genes.

In [1], using the statistical correlation analysis, we can see the linear relationship and the direction of

relation between two variables. Correlation coefficient r varies from -1 to +1, so that the data distributednear the line biased to (+) direction will have positive coefficients, and the data near the line biased to (-)

direction will have negative coefficients.




3.2 P attern classification

In [1], many algorithms designed for solving classification problems in machine learning have been

applied to recent research of prediction and classification of cancer with gene expression data. Thegeneral process of classification in machine learning is to train classifiers to accurately recognize patterns from given training samples and to classify test samples with the trained classifie

Representative classification algorithms such as multi-layer perception, k-nearest neighbour, supportvector machine, and structure-adaptive self-organizing maps are applied to the classification. [3]The

optimization criterion for feature selection can be independent with the classifier, which works on theselected features, or it can be combined with a classifier in both a selection process and a classification

stage.

In [4], we compare the performance of different discrimination methods for tumour classification

problems based on different feature selection methods. We consider four feature extraction methods and

five classification methods, from which 20 classification models can be derived. Each classificationmodel is a combination of one feature extraction method and one classification method. The featureextraction methods are t -statistics, non-parametric Wilcoxon statistics, ad hoc signal-to-noise statistics,

and principal component analysis (PCA), and the classification methods are Fisher linear discriminateanalysis (FLDA), the support vector machine (SVM), the k nearest-neighbour classifier (kNN), diagonal

linear discriminant analysis (DLDA), and diagonal quadratic discriminate analysis (DQDA). Theselection of the parameters used in these feature extraction and classification methods is supervised.

These discrimination methods are then applied to three well-known publicly available microarraydatasets: acute leukaemia data , prostate cancer data , and lung cancer data.

Classification methods

Multilayer P erception (ML P )

In [1], a feed-forward multilayer perception (MLP) is error back propagation neural network that is

applied to many fields due to its powerful and stable learning algorithm. The neural network learns thetraining examples by adjusting the synaptic weight of neurons according to the error occurred on the

output layer. The power of the back propagation algorithm lies in two main aspects: local for updatingthe synaptic weights and biases, and efficient for computing all the partial derivatives of the cost

function with respect to these free parameters.

K-Nearest-Neighbor (KNN)

In [1], k-nearest neighbor (KNN) is one of the most common methods among memory based induction

Given an input vector, KNN extracts k closest vectors in the reference set based on similarity measures,




and makes decision for the input vector label using the labels of the k nearest neighbors. In [4] The knearest-neighbour rule proposed by Fix and Hodges , which uses distance function for pairs of tumour

mRNA samples, classifies test set observations based on the training set. For a given sample in the testset, kNN first finds the k closest tumour samples from the training set and then predicts its class by

voting, i.e. by choosing the class that is most common among the k neighbours .

Support vector machine (SVM)

In [1], support vector machine (SVM) estimates the function classifying the data into two classes. SVM

builds up a hyperplane as the decision surface in such a way as to maximize the margin of separation between positive and negative examples. SVM achieves this by the structural risk minimizatio

principle that the error rate of a learning machine on the test data is bounded by the sum of the training-error rate and a term that depends on the Vapnik-Chervonenkis (VC) dimension. In [4] SVM finds the

best hyperplane separating the two classes in the training set. The best hyperplane is generally found by

maximizing the sum of the distances from the hyperplane to the closest positive and negative correctlyclassified observations, while penalizing for thenumber of misclassifications .

In [6], support Vector Machine (SVM) classification algorithm to evaluate the selected features, and toestablish the influence on classification accuracy. The results indicate that in terms of the number of

genes that need to be selected and classification accuracy the proposed method is superior to other methods in the literature.

Ensemble classifier

In [1], Classification can be defined as the process to approximate I/O mapping from the given

observation to the optimal solution. Generally, classification tasks consist of two parts: feature selectionand classification. Feature selection is a transformation process of observations to obtain the best

pathway to get to the optimal solution. Therefore, considering multiple features encourages obtainingvarious candidate solutions so that we can estimate a more accurate solution to the optimal than any

other local optima.

In [1], when we have multiple features available, it is important to know which features should be used.Theoretically, for many features concerned, it may be more effective for the classifier to solve the

problems. But features that are overlapped in feature spaces may cause the redundancy of irrelevantinformation and result in the counter effect such as over fitting. Therefore, it is more important to

explore and utilize independent features to train classifiers, rather than increase the number of featureswe use. Correlation between feature sets can be induced from the distribution of features.




Chapter Four: Classification Algorithms

In [1], there are many algorithms for the classification from machine learning approach, but none of them is perfect. Moreover, it is always difficult to decide what to use and how to set up its parameters.

According to the environments when the classifier is embedded, some algorithms work well and othersdo not. This is because the classifier searches in different solution space depending on the algorithms,

features and parameters used. These sets of classifiers produce their own outputs; therefore the ensembleclassifier can explore a wider solution space.

Wavelet Approach

In [5], A wavelet transform is a lossless linear transformation of a signal or data into coefficients on the basis of wavelet functions.5 ,6 The coefficients yielded by wavelet transform contain information about

characteristics of the data at different scales. Fine scales capture local details of coefficients and coarsescales capture global features of a signal. Performing the discrete wavelet transform (DWT) of a signal x

is done by passing it through low pass filters (scaling functions) and high pass filter

Figure 3.1

Wavelet power spectrum is a graphical representation having a cumulative information variationmeasures at each decomposition level as data points. The global information variation of gene

expression in a given sample can thus be consolidated by plotting wavelet power spectrum. This featureof a power spectrum may be useful in identifying the characteristics of a microarray data. In addition to

providing a visualization of features of a microarray data, the global strength of gene expression in a




given sample can be consolidated by plotting wavelet power spectrum and it may reveal many otherhidden structures of the data.

In [7], accurate classification of microarray data plays a vital role in cancer prediction and diagnosis.

Previous studies have demonstrated the usefulness of naïve Bayes classifier in solving various

classification problems. In microarray data analysis, however, the conditional independence assumptionembedded in the classifier itself and the characteristics of microarray data, e.g. the extremely highdimensionality, may severely affect the classification performance of naïve Bayes classifier. This paper

presents a sequential feature extraction approach for naïve Bayes classification of microarray data. The proposed approach consists of feature selection by stepwise regression and feature transformation by

class-conditional independent component analysis. Experimental results on five microarray datasetsdemonstrate the effectiveness of the proposed approach in improving the performance of naïve Bayes

classifier in microarray data analysis.

Figure 3.2



16 Microarray time series classification

In [1], we have applied this idea to a classification framework as shown. Given k features

and n classifiers, there is k x n feature-classifier combinations. There are k x n Cm possible ensemble classifiers when m feature-classifier combinations are selected for

ensemble classifier. Classifiers are trained using the features selected, and finally acombining module is accompanied to combine their outputs. After classifiers are trained

independently with some features to produce their own outputs, the final answer will be judged by a combining module, where the majority voting, weighted voting, or Bayesian

combination can be adopted.

y Majority Voting: It is a simple ensemble method that selects the classmost favored by base classifiers. Majority voting has some advantages that

it does not require any previous knowledge nor does it require anyadditional complex computation to decide.

y Weighted Voting: Poor classifier can affect the result of the ensemble inmajority voting because it gives the same weight to all classifiers.Weighted voting reduces the effect of poor classifier by giving a different

weight to a classifier based on the performance of each classifier.

y Bayesian Combination: While majority voting method combines classifiers with their results, Bayesian combination makes the error possibility of each classifier affect the final

result. The method combines classifier with different weight by using the previous

knowledge of each classifier.




Chapter Five: Conclusion

In this survey, we introduce DNA microarray technology and how to use

these technologies. We introduce microarray types and methods. We

introduce the benefits of data mining in microarray time series classification

by using gene selection and pattern classifier such that K-Nearest Neighbor

and Support Vector Machine

Chapter Six: Future Work

In the future, we plan to use the microarray technology for treatment of

cancer diseases as hemophilia, brain tumors and lung cancer




Chapter Seven: References

[1]SUNG-BAE CHO_ and HONG-HEE WONy,´ DATA MINING FOR GENEEXPRESSION PR OFILES FR OM DNA MICR OARRAY´, International Journal of

Software Engineering, 134 Shinchon-dong, Sudaemoon-ku, Seoul 120-749, Korea, 593-608, 2003

[2] YAN CHEN and XIU-XIA LI, YI-XUE LI, YUN-PING ZHU, CHUN-HOUZHENG, ³TUMOR CLASSIFICATION BASED ON INDEPENDENT

COMPONENT ANALYSIS´, International Journal of Pattern Recognition andArtificial Intelligence, China, pp. 297±310 , November 02, 2006

[3] C. H. PARK*, M. JEONÁ, P. PARDALOS§ and H. PARK,´ Quality assessment of

gene selection in microarray data´, Optimization Methods and Software, 220 Gung-dong,Yuseong-gu, Daejeon, 305-764, Korea, Atlantic Drive, Atlanta, GA, 30332,USA,pp. 145±154, 1055-6788, February 2007

[4] JING ZHANG, TIANZI JIANG*, BING LIU, XINGPENG JIANG and HUIZHIZHAO, ³Systematic benchmarking of microarray data feature extraction and

classification´, International Journal of Computer Mathematics , pp 803±811, May 2008

[5] S. PRABAKARAN¤ , R. SAHU and S. VERMA, " A WAVELET APPR OACH

FOR CLASSIFICATION OF MICR OARRAY DATA´, International Journal of Wavelets, Multiresolution and Information Processing, pp. 375±389, May 2008.

[6] Cheng-San Yang, Li-Yeh Chuang, Chao-Hsuan Ke, and Cheng-Hong Yang, Member ,

IAENG³A Hybrid Feature Selection Method for Microarray Classification´. IAENG

International Journal of Computer Science, pp 285-290, 2008.

[7] Fan, Liwei, Zhou, Peng ³A sequential feature extraction approach for naïve bayes

classification of microarray data´. Article, pp 9919-9923., Aug2009.

[8] Sung-Bae, Kyung-Joong, "An Evolutionary Algorithm Approach to Optimal

Ensemble Classifiers for DNA Microarray Data Analysis." IEEE Transactions onEvolutionary Computation, pp. 377-388, Jun 2008.




Documents

Microarray time series