6
A novel gene-centric clustering algorithm for standardization of time series expression data Elena Tsiporkova and Veselka Boeva Abstract — This paper proposes a novel data trans- formation method aiming at multi-purpose data standardization and inspired by gene-centric clus- tering approaches. The idea is to perform data standardization via template matching of each ex- pression profile with the rest of the expression pro- files employing Dynamic Time Warping (DTW) alignment algorithm to measure the similarity be- tween the expression profiles. This algorithm facil- itates the identification of a cluster of genes whose expression profiles are related, possibly with a non- linear time shift, to the profile of the gene supplied as a template. Consequently, for each gene profile a varying number (based on the degree of similar- ity) of neighboring gene profiles is identified to be used in the subsequent standardization phase. The latter uses a recursive aggregation algorithm in or- der to reduce the set of neighboring expression pro- files into a singe profile representing the standard- ized version of the profile in question. The pro- posed data transformation method is evaluated and demonstrated on gene expression time series data coming from a study examining the global cell-cycle control of gene expression in fission yeast Schizosac- charomyces pombe. Index Terms — Data Standardization, DTW Dis- tance, RRN Algorithm, Time Series Data. I. Introduction The modern microarray technologies allow the mea- surement of the expression of almost an entire genome simultaneously. This quantitative advantage comes with high costs and complexity and a whole arsenal of methodologies like background correction, normaliza- tion, summarization, or in short preprocessing, is re- quired in order to make inter-array analysis and com- parison possible. And this is far not the whole story! The preprocessing phase needs to be followed by some sort of data transformation. The latter is necessary for enabling a fair comparison of the amount of vari- ation present in the different gene expression profiles. The difference in the natural gene variability is usually E. Tsiporkova is with the Innovation Center - East Flanders, House of Economy, Seminariestraat 2, B-9000 Ghent, Belgium (email: [email protected]) V. Boeva is with the Computer Systems and Technologies De- partment, Technical University of Sofia-branch Plovdiv, Tsanko Dyustabanov 25, 4000 Plovdiv, Bulgaria (email: vboeva@tu- plovdiv.bg) tackled via various normalization and standardization techniques. The choice of a particular data transformation method is determined by the type of data analysis study to be performed. The log2-transformation in- volves normalization of the expression profile of each gene by the expression value at time point 0 and con- sequently taking log2 of the ratios. This transforma- tion may be essential for performing between exper- iment or between species comparison of gene expres- sion time series. The application of different statis- tical methods, as for instance regression analysis or permutation tests, is also usually preceded by log2- transformation. However, a high percentage of the time expression values at time zero may be affected by various stress response phenomena associated with the particular treatment (e.g. synchronization method) or experimental conditions. Therefore the choice of the first measurement as a reference expression value bears the danger of creating a distorted perception of the gene expression behavior during the whole time of sampling. Another alternative, almost as a rule ap- plied before performing clustering, template matching or alignment, is to standardize the expression profiles via z-transformation. The expression profile of each gene is adjusted by subtracting the profile mean and dividing with the profile standard deviation. The z- transformation can be relevant when the general shape rather than the individual gene expression amplitudes at the different time points is important. However, this transformation needs to be used with caution, baring in mind that the expression levels of low ex- pressed genes will be amplified by it. The quality of microarray data is also affected by many other experimental artifacts, as for instance, oc- currence of peak shifts due to lost of synchrony, poor signal to noise ratio for a set of sampling times re- sulting in partially fluctuating profiles, etc. Unfortu- nately, there is no an universal data transformation method that offers adequate corrections for all these. We propose here a novel data transformation method aiming at multi-purpose data standardization and in- spired by gene-centric clustering approaches. The idea is to perform data standardization via template matching of each expression profile with the rest of the expression profiles employing Dynamic Time Warping 2008 4th International IEEE Conference "Intelligent Systems" 978-1-4244-1739-1/08/$25.00 © 2008 IEEE 12-8

A Novel Gene-centric Clustering Algorithm for Standardization of Time Series Expression Data

Embed Size (px)

Citation preview

Page 1: A Novel Gene-centric Clustering Algorithm for Standardization of Time Series Expression Data

A novel gene-centric clustering algorithmfor standardization of time series

expression dataElena Tsiporkova and Veselka Boeva

Abstract—This paper proposes a novel data trans-formation method aiming at multi-purpose datastandardization and inspired by gene-centric clus-tering approaches. The idea is to perform datastandardization via template matching of each ex-pression profile with the rest of the expression pro-files employing Dynamic Time Warping (DTW)alignment algorithm to measure the similarity be-tween the expression profiles. This algorithm facil-itates the identification of a cluster of genes whoseexpression profiles are related, possibly with a non-linear time shift, to the profile of the gene suppliedas a template. Consequently, for each gene profilea varying number (based on the degree of similar-ity) of neighboring gene profiles is identified to beused in the subsequent standardization phase. Thelatter uses a recursive aggregation algorithm in or-der to reduce the set of neighboring expression pro-files into a singe profile representing the standard-ized version of the profile in question. The pro-posed data transformation method is evaluated anddemonstrated on gene expression time series datacoming from a study examining the global cell-cyclecontrol of gene expression in fission yeast Schizosac-charomyces pombe.

Index Terms— Data Standardization, DTW Dis-tance, RRN Algorithm, Time Series Data.

I. Introduction

The modern microarray technologies allow the mea-surement of the expression of almost an entire genomesimultaneously. This quantitative advantage comeswith high costs and complexity and a whole arsenal ofmethodologies like background correction, normaliza-tion, summarization, or in short preprocessing, is re-quired in order to make inter-array analysis and com-parison possible. And this is far not the whole story!The preprocessing phase needs to be followed by somesort of data transformation. The latter is necessaryfor enabling a fair comparison of the amount of vari-ation present in the different gene expression profiles.The difference in the natural gene variability is usually

E. Tsiporkova is with the Innovation Center - East Flanders,House of Economy, Seminariestraat 2, B-9000 Ghent, Belgium(email: [email protected])

V. Boeva is with the Computer Systems and Technologies De-partment, Technical University of Sofia-branch Plovdiv, TsankoDyustabanov 25, 4000 Plovdiv, Bulgaria (email: [email protected])

tackled via various normalization and standardizationtechniques.

The choice of a particular data transformationmethod is determined by the type of data analysisstudy to be performed. The log2-transformation in-volves normalization of the expression profile of eachgene by the expression value at time point 0 and con-sequently taking log2 of the ratios. This transforma-tion may be essential for performing between exper-iment or between species comparison of gene expres-sion time series. The application of different statis-tical methods, as for instance regression analysis orpermutation tests, is also usually preceded by log2-transformation. However, a high percentage of thetime expression values at time zero may be affected byvarious stress response phenomena associated with theparticular treatment (e.g. synchronization method)or experimental conditions. Therefore the choice ofthe first measurement as a reference expression valuebears the danger of creating a distorted perception ofthe gene expression behavior during the whole time ofsampling. Another alternative, almost as a rule ap-plied before performing clustering, template matchingor alignment, is to standardize the expression profilesvia z-transformation. The expression profile of eachgene is adjusted by subtracting the profile mean anddividing with the profile standard deviation. The z-transformation can be relevant when the general shaperather than the individual gene expression amplitudesat the different time points is important. However,this transformation needs to be used with caution,baring in mind that the expression levels of low ex-pressed genes will be amplified by it.

The quality of microarray data is also affected bymany other experimental artifacts, as for instance, oc-currence of peak shifts due to lost of synchrony, poorsignal to noise ratio for a set of sampling times re-sulting in partially fluctuating profiles, etc. Unfortu-nately, there is no an universal data transformationmethod that offers adequate corrections for all these.We propose here a novel data transformation methodaiming at multi-purpose data standardization and in-spired by gene-centric clustering approaches. Theidea is to perform data standardization via templatematching of each expression profile with the rest of theexpression profiles employing Dynamic Time Warping

2008 4th International IEEE Conference "Intelligent Systems"

978-1-4244-1739-1/08/$25.00 © 2008 IEEE 12-8

Page 2: A Novel Gene-centric Clustering Algorithm for Standardization of Time Series Expression Data

(DTW) alignment algorithm to measure the similaritybetween the expression profiles. Template matching isusually employed in studies requiring gene-centric ap-proaches since it allows mining gene expression timeseries for patterns that fit best a template expressionprofile. The DTW algorithm aims at aligning two se-quences of feature vectors by warping the time axisiteratively until an optimal match (according to a suit-able metrics) between the two sequences is found. Itfacilitates the identification of a cluster of genes whoseexpression profiles are related, possibly with a non-linear time shift, to the profile of the gene suppliedas a template. Consequently, for each gene profile avarying number (based on the degree of similarity) ofneighboring gene profiles is identified to be used in thesubsequent standardization phase. The latter uses arecursive aggregation algorithm in order to reduce theset of neighboring expression profiles into a singe pro-file representing the standardized version of the profilein question.

II. Methods

A. Data

The proposed standardization algorithm is evalu-ated and demonstrated on a microarray dataset com-ing from a study examining the global cell-cycle con-trol of gene expression in fission yeast Schizosaccha-romyces pombe [1]. The whole study includes 8 inde-pendent time-course experiments synchronized respec-tively by elutriation, cdc25 block-release and a combi-nation of both methods. We have used in our teststhe elutriation1 experiment, which contains expres-sions levels of 5120 genes measured in 20 time points.

B. P-value calculation

For the test microarray dataset (see above), p-valuesfor regulation have been calculated as described by [2].Namely, a p-value for regulation for a particular geneis resulting from the comparison of the gene expres-sion variance with a randomly generated variance dis-tribution, constructed by selecting at each time pointthe expression value of a randomly chosen gene. Theartificial expression profiles are then mixtures of ex-pression values corresponding to different genes witheventually very different variation amplitudes. Thelatter implies that some artificial profiles will be as-signed very high variance values, which will lead ingeneral to the computation of relatively high p-valuesfor regulation. Therefore, in order to have a realis-tic estimate of the amount of variation present in theoriginal dataset with respect to a randomly generatedvariance distribution, each variation calculation waspreceded by transforming the original and artificialexpression profiles to log2-ratios relative to the expres-sion values at time zero. The p-value for regulation iscalculated as the fraction of artificial profiles with avariance equal to or greater than the score of the realexpression profile.

C. Dynamic time warping algorithm

The Dynamic Time Warping (DTW) alignment al-gorithm aims at aligning two sequences of feature vec-tors by warping the time axis iteratively until an opti-mal match (according to a suitable metrics) betweenthe two sequences is found. It was developed origi-nally for speech recognition applications [3]. Due toits flexibility, DTW has been widely used in many sci-entific disciplines including several computational bi-ology studies [4], [5], [6]. A detail explanation of DTWalgorithm can be found in [3], [6], [7]. Therefore thedescription following below is restricted to the impor-tant steps of the algorithm.

Two sequences of feature vectors: A =[a1, a2, . . . , an] and B = [b1, b2, . . . , bm] can be alignedagainst each other by arranging them on the sides of agrid, e.g. one on the top and the other on the left handside. Then a distance measure, comparing the corre-sponding elements of the two sequences, can be placedinside each cell. To find the best match or alignmentbetween these two sequences one needs to find a paththrough the grid P = (1, 1), . . . , (is, js), . . . , (n, m),(1 ≤ is ≤ n and 1 ≤ js ≤ m), which minimizes thetotal distance between A and B. Thus the procedurefor finding the best alignment between A and B in-volves finding all possible routes through the grid andfor each one compute the overall distance, which isdefined as the sum of the distances between the indi-vidual elements on the warping path. Consequently,the final DTW distance between A and B is the mini-mum overall distance over all possible warping paths:

dtw(A, B) =1

n + mmin

P

(k∑

s=1

dist(is, js)

).

It is apparent that for any pair of considerably long se-quences the number of possible paths through the gridwill be very large. However, the power of the DTWalgorithm resides in the fact that instead of findingall possible routes through the grid, the DTW algo-rithm makes use of dynamic programming and worksby keeping track of the cost of the best path at eachpoint in the grid.

D. R-Radius Neighbours algorithm

We propose here a standardization method for timeseries data which employs Dynamic Time Warping(DTW) distance [3] to measure the similarity betweenthe expression profiles. For each gene profile a vary-ing number (based on the degree of DTW similarity)of neighboring gene profiles is identified to be used inthe subsequent standardization phase.

A dedicated algorithm, called R-Radius Neighbours(RRN) algorithm, has been developed for the purposeof generating an estimation list for each gene profile.Such an estimation list consists of genes with expres-sion profiles which exhibit at least minimum (prelim-inary defined) similarity in terms of some distance

12-9

Page 3: A Novel Gene-centric Clustering Algorithm for Standardization of Time Series Expression Data

measure (in our case DTW distance) to the expres-sion profile of the gene in question. These profiles areconsequently used for data standardization. The mo-tivation behind this approach is that the expressionvalues of each profile will be standardized by adjust-ing them relative to the expression profiles in the samemicroarray dataset, which appear to be closely relatedto the target profile.

Assume that a matrix G of m×n (m� n) containsthe expression values of m genes measured in n timepoints,

G =

g1...

gm

=

g11 · · · g1n...

...gm1 · · · gmn

,

where the row (vector) gi = [gi1, . . . , gin] representsthe time expression profile of the i-th gene. In orderto standardize values in any location of gene i, a set ofgenes all at a maximum R (R is preliminary defined)DTW distance from the profile gi needs to be identi-fied. In this process, all gene profiles are consideredand a gene estimation list Ei, which is further used tostandardize the values of gene i, is constructed. Letus formally define the algorithm that builds this geneestimation list. Consider a radius R ∈ (0, 100) and anexpression profile gi = [gi1, . . . , gin]. Construct an ini-tial gene estimation list as Ei = {gene i} and iterateas follows:For j = 1, . . . ,m and j 6= i:(1) Calculate the DTW distance dtwij between gene

i and gene j ∈ G.(2) If dtwij < R then add gene j to the gene estima-

tion list of gene i, i.e. Ei = Ei ∪ {j}.(3) Return to the first step to calculate the DTW

distance between gene i and the next gene in G.The final estimation list Ei = {genes kj |dtwikj

<R}1 for gene i contains only genes at a maximum R-radius DTW distance from gene i. Let m = #Ei.

The values of the gene profiles in the estimation listEi can then be used to standardize the values of theexpression profile gi. The contribution of each genekj ∈ Ei (j = 1, . . . , m) is weighted by the degree ofsimilarity of its expression profile to the expressionprofile of gene i. Thus each gene kj ∈ Ei is assigned aweight wkj

:

wkj=

1−dtwikj

m∑j=1

dtwikj

/(m− 1). (1)

It can easily be checked thatm∑

j=1wkj

= 1. Moreover

dtwikp< dtwikq

implies wkp> wkq

, i.e. expression

1The use of double indexing kj for the elements of the estima-tion list Ei is necessary since the gene order in the original dataset is different from the one in the estimation list. Thus kj isthe gene index in the original expression matrix G, i.e. kj refersto the gene expression profile, while j merely refers to the geneposition in Ei.

Fig. 1: Recursive aggregation algorithm.

profiles closely matching the pattern of gene i will al-ways have a greater contribution to the standardizedvalues than expression profiles which match the profileof gene i to a lower extent.

The profile of gene i will always be assigned thehighest possible weight wi = 1/(m−1) due to dtwii =0. In case the estimation list of i contains only oneother profile besides the profile of i, i.e. m = 2, thenwi will be 1. The latter implies that the second profilewill not be taken into account during the standard-ization procedure and the profile of i will remain un-changed. Only one other matching profile is thereforenot sufficient to enforce data transformation since theprofile of i is then considered rather unique. Mini-mum 2 other profiles need to match closely the pro-file of i in order to subject it to standardization andwi = 1/2 will still be relatively high. The degree towhich the closely matching profiles of i will contributeto its standardization is thus determined by the sizeof the estimation list.

E. A recursive aggregation algorithm

We discuss herein a recursive aggregation algorithmaiming at reducing a given data matrix (or a set ofdata vectors) into a single vector. This algorithm willbe applied to obtain the standardized expression pro-file of a gene by aggregating the expression profiles ofthe genes in its estimation list (see Section II-D). Theconsidered aggregation model has been inspired by ourwork on non-parametric recursive aggregation, wherea set of aggregation operators is applied initially overa vector of input values, and then again over the resultof the aggregation, and so on until a certain stop con-dition is met [8]. This process defines an aggregationoperator that uses a vector of aggregation operatorsand acts as a trade-off between their conflicting be-haviour.

Assume that we are supposed to aggregate the ex-pression values of m genes measured in n time points.Thus a vector

tj =

g1j...

gmj

of m expression values (one per gene) will be associ-ated with each time point j (j = 1, . . . , n) and conse-quently, a matrix of m×n values is formed as a wholeG = [t1, . . . , tn]. Additionally, each gene (row vectorof expression values) i is associated with a weight wi,

12-10

Page 4: A Novel Gene-centric Clustering Algorithm for Standardization of Time Series Expression Data

expressing the relative degree of importance (contribu-tion) assigned to profile i in the aggregation process.

Thus a vector w = [w1, . . . , wm], wherem∑

i=1wi = 1 and

wi ∈ [0, 1] for i = 1, . . . ,m is given.The ultimate goal of the aggregation algorithm is

to transform the above matrix into a single vectorg = [g1, . . . , gn], consisting of one (overall) value pertime point. Thus each gj, (j = 1, . . . , n) can be inter-preted as the trade-off value, agreed between the dif-ferent genes, for the expression value of the time pointin question. Naturally, the aggregated values are ex-pected to take into account, in a suitable fashion, allthe individual input values of vectors tj (j = 1, . . . , n).The choice of aggregation operator is therefore crucial.Some aggregation operators can lead to a significantloss of information since their values can be greatlyinfluenced by extreme scores (arithmetic mean), whileothers are penalizing too much for low-scoring out-liers (geometric and harmonic means). A possible andquite straightforward solution to the described prob-lem is to use different aggregation operators in order tofind some trade-off between their conflicting behavior.In this way, different aspects of the input values willbe taken into account during the aggregation process.

We suggest to apply a hybrid aggregation process,developed in [8] (see also [9], [10]), by employing a setof k aggregation operators A1, . . . , Ak. The values ofthe matrix G can initially be combined in parallel withthe weighted versions of these k different aggregationoperators. Consequently, a new matrix G(0) of n col-umn vectors, i.e. G(0) = [t(0)

1 , . . . , t(0)n ], is generated as

follows:

t(0)j =

A1(w, tj)...

Ak(w, tj)

.

Thus a new vector of k values (one per aggregationoperator) is produced for each time point j = 1, . . . , nby aggregating the expression values of vector tj (seeFig. 1). The new matrix can be aggregated again,generating again a matrix G(1) = [t(1)

1 , . . . , t(1)n ], where

t(1)j =

A1(t(0)

j )...

Ak(t(0)j )

.

In this fashion, each step is modeled via k parallelaggregations applied over the results of the previousstep, i.e. at step q (q = 1, 2, . . .) a matrix G(q) =[t(q)1 , . . . , t

(q)n ] is obtained and

t(q)j =

A1(t(q−1)

j )...

Ak(t(q−1)j )

,

for j = 1, . . . , n.Thus the final result is conceived after passing a

few layers of aggregation. At the first layer, we have

the list of initial values that are to be combined. Us-ing a vector of weighted aggregation operators newvalues are obtained and the next step is to combinethese new values again using the nonparametric ver-sions (uniform distribution of the weights) of the givenaggregation operators. This process needs to be re-peated again and again until the difference betweenthe values produced for the different time points, orthe maximum and minimum values in each column ofthe currently calculated matrix are small enough tostop further aggregation.

In [8], [9], we have shown that any recursive aggrega-tion process, defined via a set of continuous and strict-compensatory aggregation operators, following the al-gorithm described herein is convergent. For instance,any weighted mean operator with non-zero weightsis continuous and strict compensatory [12]. Thus,if w1, w2, . . . , wn are positive real numbers such that∑n

i=1 wi = 1 then the weighted arithmetic Mw =∑ni=1 wixi, the weighted geometric Gw =

∏ni=1 xi

wi

and the weighted harmonic Hw = 1/(∑n

i=1 wi/xi)means are continuous and strict compensatory. Wehave shown in [8] that a recursive aggregation pro-cess, defined via a combination of the above means is,in fact, an aggregation mean operator that compen-sates between the conflicting properties of the differentmean operators.

III. Data standardization procedure

Assume that a particular biological phenomenon ismonitored in a high-throughput experiment. This ex-periment is supposed to measure the gene expressionlevels of m genes in n different time points, i.e. a ma-trix m × n will be produced. Assume that p-valuesare estimated in some way for each gene in the aboveexpression matrix and genes with very high p-valueare filtered out from the matrix. Then for each gene i(i = 1, . . . ,m)2 of the obtained expression matrix thefollowing two distinctive steps will be performed:1. Selection of estimation genes. Construct a geneestimation list Ei for gene i by applying the RRN al-gorithm described in Section II-D. The values of thegene profiles in the constructed estimation list Ei willfurther be used to standardize the values of the ex-pression profile gi. The contribution of each gene inEi is weighted by the degree of similarity (in terms ofDTW distance) of its expression profile to the expres-sion profile of gene i. Thus each estimation gene isassociated with a weight calculated by equation (1).2. Calculation of standardized expression pro-file. The standardized expression profile gi = [gi1, . . . ,gin] of gene i is obtained by applying the recursive ag-gregation algorithm, described in Section II-E, to thegene expression values of its estimation list Ei. Thusthe expression profiles included in the estimation listEi are initially combined in parallel with k different

2For simplicity, the number of genes in the newly obtainedmatrix will be referred to m.

12-11

Page 5: A Novel Gene-centric Clustering Algorithm for Standardization of Time Series Expression Data

Fig. 2: Cumulative distribution of the number ofneighboring profiles within DTW distance of 0.1.

weighted aggregation operators. In this way k newexpression profiles (one per aggregation operator) areproduced and these new profiles are aggregated againthis time with the nonparametric versions of the givenaggregation operators. The latter process is repeatedagain and again until for each time point the differ-ence between the aggregated values is small enough tostop further aggregation.

IV. Results and Discussion

The normalized data for the elutriation1 ex-periment (see Section II-A) has been down-loaded from the website of the Sanger Institute(http://www.sanger.ac.uk/PostGenomics/S pombe/).Subsequently, the empty rows (genes with no expres-sion measurements) have been filtered out from theexpression matrix and any other missing expressionentries have been imputed with the DTWimpute al-gorithm [11]. Thus a complete expression matrix ofdimension 4598 genes times 20 time points has beencreated in this way and subsequently, p-values for reg-ulation have been calculated, performing permutationtests as described in Section II-B. All the genes withp-value equal or higher than 0.1 have been filtered outfrom the complete expression matrix. Consequently,455 expression profiles have been retained in this waywith p-values strictly smaller than 0.1, which meansthat there is maximum 10% probability that thesegenes are expressed in this experiment by chance.

Subsequently, for each gene profile occurring in thenewly obtained matrix a gene estimation list has beencreated by applying the RRN algorithm describedin Section II-D. Thus for each gene profile a vary-ing number of neighboring gene profiles (at maxi-mum 0.1 DTW distance from the evaluated profile,i.e. R = 0.1) has been identified and further usedto calculate its standardized expression profile. Thecumulative distribution of the number of neighboringprofiles within DTW distance of 0.1 is given in Fig. 2.Note that about 50% of the expression profiles are as-signed less than 10 neighbouring profiles within DTWdistance of 0.1

The recursive aggregation procedure, as defined inSection II-E, has been applied to the gene expressionvalues of the estimation list to calculate the standard-ized expression profile. For the purpose of the hybridaggregation procedure, three different aggregation op-erators have been selected: arithmetic, geometric and

(a) (b)

(c) (d)

(e) (f)

(g) (h)

(i) (j)

Fig. 3: Original (solid thick line) versus standardized(dashed thick line) expression profiles. The profilesused for the standardization are in the background(dotted thin line).

harmonic means. Their definitions can be found inSection II-E. Each one of these aggregation opera-tors exhibits certain shortcomings when used individ-ually. For instance, the arithmetic mean values arestrongly influenced by the presence of extremely lowor extremely high values. This may lead in some casesto an averaged overall (standardized) value at some es-timated time point, which does not adequately reflectthe individual expression values at the correspondingtime point of the estimation genes. In case of thegeometric mean, the occurrence of a very low expres-sion value (e.g. 0 or close to 0) in some position fora single estimation gene is sufficient to produce a lowoverall value for this position, no matter what the cor-

12-12

Page 6: A Novel Gene-centric Clustering Algorithm for Standardization of Time Series Expression Data

(a) Mean (b) Variance

Fig. 4: Cumulative distribution of the absolute differ-ence between of original and the standardized profilesmean and variance, respectively.

responding expression values for the rest of the esti-mation genes are. The harmonic mean behaves evenmore extremely in situations when single entries withvery low values are present.

Fig. 3 depicts for 10 different genes the standardizedand original expression profiles on the background ofthe profiles in the estimation list used for the stan-dardization of each original profile. Fig. 3b, Fig. 3dand Fig. 3f, are examples of a clear fluctuation re-duction as a result of the standardization procedure.The latter can easily be noticed in the down regula-tion part of the standardized profiles. In plots Fig. 3eand Fig. 3j, the depicted standardized profiles almostrepeat the original ones, which is obviously due to thecloser match between the original profile and the pro-files used for the standardization. The standardizedprofile depicted in Fig. 3a, exhibits correction for apeak shift of the original profile with respect to itsneighboring profiles. Further, the profiles in Fig. 3f,Fig. 3h and Fig. 3i have been somewhat reduced inamplitude during the first peak. Finally, the profilesin plots Fig. 3c and Fig. 3g are clearly smoothed bythe standardization process. In general, the presentedresults in Fig. 3 demonstrate that the standardizationprocedure operates as a sort of data correction for e.g.,peak shifts, amplitude range, fluctuations, etc.

In order to evaluate the impact of the proposed herestandardization procedure on the expression varianceof the whole dataset, we have recorded for each genethe absolute difference between the original and thestandardized profiles mean and variance, respectively.Fig. 4 presents the cumulative distribution of thesedifferences. Thus Fig. 4a provides evidence that theprofiles means are rather conserved by the standard-ization procedure. Namely, for 90% of the profilesthe difference between the original and standardizedmean values is less than 10%. However, as depicted inFig. 4b, the difference between the original and stan-dardized variances of only 40% of the profiles is within10%. Obviously the expression means are much lessadjusted by the standardization than the variances.This is quite logical considering that the standardiza-tion performs profile smoothing in general.

V. Conclusion

We have proposed here a novel data transformationmethod aiming at multi-purpose data standardizationand inspired by gene-centric clustering approaches.The method performs data standardization via tem-plate matching of each expression profile with the restof the expression profiles employing DTW alignmentalgorithm to measure the similarity between the ex-pression profiles. For each gene profile a varying num-ber (based on the degree of similarity) of neighboringgene profiles is identified to be used in the standard-ization phase. Subsequently, a recursive aggregationalgorithm is applied in order to transform the identi-fied neighboring profiles into a singe standardized pro-file. The proposed transformation method has beenevaluated on gene expression time series data comingfrom a study examining the global cell-cycle controlof gene expression in fission yeast Schizosaccharomycespombe. It has been shown to be an adequate data stan-dardization procedure aiming at fluctuation reduction,peak and amplitude correction and profile smoothingin general.

References

[1] G. Rustici, J. Mata, K. Kivinen, P. Lio, C.J.Penkett, G.Burns, J. Hayles, A. Brazma, P. Nurse, J. Bahler, ”Periodicgene expression program of the fission yeast cell cycle”. Nat.Genet., 36, pp. 809-817, 2004.

[2] U. de Lichtenberg, L.J. Jensen, A. Fausbøll, T.S. Jensen, P.Bork, S. Brunak, ”Comparison of computational methodsfor the identification of cell cycle-regulated genes”. Bioin-formatics vol. 21, no. 7, pp. 1164-1171, 2004.

[3] H. Sakoe and S. Chiba, ”Dynamic programming algorithmoptimization for spoken word recognition”. IEEE Trans. onAcoust., Speech, and Sign. Proc., ASSP-26, pp. 43-49, 1978.

[4] J. Aach and G. M. Church, ”Aligning gene expression timeseries with time warping algorithms”. Bioinformatics, 17,pp. 495-508, 2001.

[5] J. Criel and E. Tsiporkova, ”Gene Time Expression Warper:A tool for alignment, template matching and visualizationof gene expression time series”. Bioinformatics vol. 22, no.2, pp. 251-252, 2006.

[6] F. Hermans and E. Tsiporkova, ”Merging microarray cellsynchronization experiments through curve alignment”.Bioinformatics, 23, e64-e70, 2007.

[7] D. Sankoff and J. Kruskal, Time Warps, String Edits, andMacromolecules: The Theory and Practice of Sequence Com-parison. AddisonWesley, Reading Mass., 1983.

[8] E. Tsiporkova and V. Boeva, ”Nonparametric Recursive Ag-gregation Process”. Kybernetika. J. of the Czech Society forCybernetics and Inf. Sciencies, vol. 40, no. 1, pp. 51-70, 2004.

[9] E. Tsiporkova and V. Boeva, ”Multi-step ranking of alter-natives in a multi-criteria and multi-expert decision makingenvironment”. Inf. Sciencies, vol. 76, no. 18, pp. 2673-2697,2006.

[10] E. Tsiporkova and V. Boeva, ”Modelling and Simulationof the Genetic Phenomena of Additivity and Dominancevia Gene Networks of Parallel Aggregation Processes”, S.Hochreiter and R. Wagner eds.: in LNB (Springer-VerlagBerlin Heidelberg 2007), vol. 4414, pp. 199-211, 2007.

[11] E. Tsiporkova and V. Boeva, ”Two-pass imputation algo-rithm for missing value estimation in gene expression timeseries”, Journal of Bioinformatics and Comp. Biology, vol.5, no. 5, pp. 1005-1022, 2007.

[12] J. C. Fodor and M. Roubens, Fuzzy Preference Modellingand Multicriteria Decision Support. Kluwer Academic Pub-lishers, Dordrecht, 1994.

12-13