Research Article Modified Logistic Regression Models Using ...downloads.hindawi.com/journals/cmmm/2013/917502.pdf · logistic regression (LR) model to classify the phenotypes of

Hindawi Publishing CorporationComputational and Mathematical Methods in MedicineVolume 2013 Article ID 917502 7 pageshttpdxdoiorg1011552013917502

Research ArticleModified Logistic Regression Models Using Gene Coexpressionand Clinical Features to Predict Prostate Cancer Progression

Hongya Zhao12 Christopher J Logothetis2 Ivan P Gorlov2 Jia Zeng3 and Jianguo Dai4

1 Industrial Center Shenzhen Polytechnic Shenzhen Guangdong 518055 China2Department of Genitourinary Medical Oncology Unit 1374 The University of Texas MD Anderson Cancer Center1515 Holcombe Boulevard Houston TX 77030-4009 USA

3 School of Computer Science and Technology Soochow University Suzhou 215006 China4 School of Applied Chemistry and Biotechnology Shenzhen Polytechnic Shenzhen Guangdong 518055 China

Correspondence should be addressed to Ivan P Gorlov gorurofumailru

Received 12 June 2013 Accepted 3 September 2013

Academic Editor Qizhai Li

Copyright copy 2013 Hongya Zhao et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Predicting disease progression is one of the most challenging problems in prostate cancer research Adding gene expression datato prediction models that are based on clinical features has been proposed to improve accuracy In the current study we applied alogistic regression (LR) model combining clinical features and gene co-expression data to improve the accuracy of the predictionof prostate cancer progression The top-scoring pair (TSP) method was used to select genes for the model The proposed modelsnot only preserved the basic properties of the TSP algorithm but also incorporated the clinical features into the prognostic modelsBased on the statistical inference with the iterative cross validation we demonstrated that prediction LR models that includedgenes selected by the TSP method provided better predictions of prostate cancer progression than those using clinical variablesonly andor those that included genes selected by the one-gene-at-a-time approach Thus we conclude that TSP selection is auseful tool for feature (andor gene) selection to use in prognostic models and our model also provides an alternative for predictingprostate cancer progression

1 Introduction

Prostate cancer (PCa) is the second leading cause of cancer-related deaths among men in the USA [1 2] Screeningusing serum prostate-specific antigen (PSA) has improvedthe early detection of PCa and has resulted in an increasein the proportion of patients with disease that is curableby prostatectomy [3 4] However 20 to 30 of treatedpatients will develop a local or metastatic recurrence whichreflects the most adverse clinical outcome [4]Thus from theclinical perspective it is important to be able to predict whichpatients will experience a relapse

Traditional PCa prognosis models are based on someclinical features such as pretreatment PSA levels biopsyGleason score (GS) and clinical stage but in practice they areinadequate to accurately predict disease progression [5]Withthe development of microarray technology in recent yearsa number of studies have been conducted to characterize

the dynamics of gene expression in PCa progression byusing DNA microarrays In some studies tumor expressionsignatures associated with clinical parameters and outcomeshave been identified [6ndash9] As a result it is possible to developthe clinical models with the variables of gene signaturesidentified from microarray data and some clinical featuresto predict which men would experience progression to themetastatic form of PCa

However it has been found that none of the predictivemodels using gene expression profiles are significantly betterthan models using clinical variables only in predicting PCaprogression [10 11] In fact only a limited number of genesare used to avoid overfitting in these models The genes areusually selected through a gene-by-gene comparison Theresults of recent studies however suggest that assessing theexpression of more than one gene (ie coexpression analysis)yields a better prediction of tumor progression than theanalysis of individual genes does [12ndash15]

2 Computational and Mathematical Methods in Medicine

In this study we tried to propose suchmodels bymergingthe coexpressed genesrsquo profiles and some clinical features topredict the patients who would suffer from PCa progressionThe genes used in our models are identified by a top-scoring pair (TSP) algorithm The TSP method was initiallyintroduced by Geman et al as a classification technique formicroarray data [16] We applied the TSP-based LR modelto publishedmicroarray experiments whose patients sufferedfrom PCa progressionWe analyzed the effects of the numberof coexpressed genes included in themodels and the selectionof the clinical variables on the accuracy of the prediction Wealso compared the performance of the most commonly usedclassification methods with our proposed method

2 Materials and Methods

21 Logistic Regression Model for the Classification of GeneMicroarrays Genome-wide microarray data from differentcells give insight into the gene expression variation of var-ious genotypes and phenotypes Classification of patientsis an important aspect of cancer diagnosis and treatmentFor example microarray experiments can be employed toscreen gene expression levels from cancerous and normalphenotypes so that proper prediction rules can be built fromthese gene expression data In this section we introduce alogistic regression (LR) model to classify the phenotypes ofmicroarray data

We denote a gene expression matrix by 119863 = 119909119894119895119872times119873

where there are119872 genes and119873 samples and 119909

119894119895denotes the

expression value of the 119894th gene 119894 isin 1 119872 from the119895th sample 119895 isin 1 119873 The vector 119866

119894sdot= (1199091198941 119909

119894119873)

represents the 119894 gene expression values over all119873 samples and119878sdot119895= (1199091119895 119909

119872119895) is the expression profile of all 119872 genes

for the 119895th sample Let 119910119895be the binary phenotype of the 119895th

profile 119910119895= 0 indicates that the 119895th sample belongs to class 0

(eg normal tissues) and 119910119895= 1 indicates that the 119895th sample

belongs to class 1 (eg tumor tissues)The classification of microarray data has been intensively

researched for years But some limitations have stood outsuch as the small-sample dilemma ldquoblack boxrdquo and lackof prediction strength [16ndash18] We used LR to build theprediction models for a binary outcome Obviously theunderlying probability of labels and contribution of predictorvariables can be explicitly provided in LR models which ishelpful for biologists in discovering the genes that interactand cause the occurrence of disease

The goal of classification with LR was to find a formulathat gives the probability 119901

119895that the 119895th sample with all

its measured expressions 119878sdot119895represents a class 1 case Since

only two classes are considered the probability of the samplerepresenting class 0 is consequently 1 minus 119901

119895 We used the

following normal LR model

120578119895= log119901119895

1 minus 119901119895

= 1205730+

119872

sum

119894=1

120573119894119909119894119895 (1)

where 1205730 1205731 120573

119872are parameters that can be estimated by

maximizing the following likelihood

119871 (1205730 120573) =

119873

sum

119895=1

119910119895log119901119895+

119873

sum

119895=1

(1 minus 119910119895) log (1 minus 119901

119895) (2)

For microarray experiments of typical ldquolarge 119901 small119899rdquo the number of samples 119873 is usually on the order oftens but the number of genes 119872 is usually on the orderof thousands or even tens of thousands So the number ofsamples is much less than the number of variables (119873 ≪119872) This situation presents a number of problems whenbuilding a LR model such as overfitting multicollinearity ofthe gene expression profiles and infinite solutions for 120573

119894[17ndash

19] Feature selection can be used to identify the significantgenes that contribute tomost of the classificationThus somedimension-reduction techniques such as support vectormachines (SVMs) singular value decomposition and partialleast squares are commonly used to tackle those problemsand make the computation feasible [17ndash19] However thefeatured genes are usually selected one by one According tothe biological mechanism genes do not work by themselvesso we employed coexpressed TSP genes in the model asdescribed in the following section

22 Identification of Coexpressed Genes Recent studies havesuggested that assessing the expression of more than onegene (ie coexpression analysis) provides a better predic-tion of tumor progression than analyzing the expression ofindividual genes [20ndash22] We identified the coexpressed genewith the paired-gene approach of the top-scoring pairs (TSP)algorithm as described by Geman et al [16] The TSP algo-rithm was originally developed for the binary classificationof phenotypes according to the relative expression profiles ofone-gene pairTheTSP classifier has the following advantagesover the standard classifiers used in gene expression studies(i) it is a parameter-free and data-driven machine-learningmethod that avoids overfitting by eliminating the need toperform specific parameter tuning as in other machine-learning techniques such as SVMs and neural networks (ii)it involves only two genes which leads to easily interpretabledata and inexpensive diagnostic tests (iii) the rank-basedTSP classifiers are less affected by technical factors or normal-ization than classifiers which are based on expression levelsof individual genes and (iv) the simple and accurate resultsgenerated by TSP facilitate follow-up studies

TSP gene pairs may be considered biomarker genes ina diagnostic test from microarray experiments [16 20ndash22]The methodology is being extended from one TSP gene pairto top-scoring pair of groups (TSPG) as gene signatures[20ndash22] However there are still some unresolved issues ofbiological explanation and the selection criteria related tothe use of gene pairs instead of larger groups of significantgenes Most of the algorithms in gene selection are basedon the distribution assumption of the gene expression dataHowever the rank-based TSP algorithm is a parameter-free data-driven machine-learning method It is difficult todetermine the number of gene pairs selected but current

Computational and Mathematical Methods in Medicine 3

research indicates that only a few gene pairs with the topscores need to be considered [20 21]

For simplicity using the gene expression matrix 119863 =119909119894119895119872times119873

with 119872 genes and 119873 samples we assume that 1198731

samples are labeled class 0 119910119895= 0 (119895 isin 1 119873

1)

and 1198732samples are labeled class 1 119910

119895= 1 (119895 isin 119873

1+

1 119873) and 1198731+ 1198732= 119873 In this method we focused

on detecting ldquomarker gene pairsrdquo (119906 V) because there is asignificant difference in the probability of observing 119866

119906lt 119866V

between class 1 and class 0 where 119866119906and 119866V denote the 119906th

and Vth rows of119863 The conditional probabilities of observing119866119906lt 119866V in each class are defined as

119901119906V (0) = 119875 (119866119906 lt 119866V | 119888 = 0) =

1

1198731

1198731

sum

119895=1

119868 (119909119906119895lt 119909V119895)

119901119906V (1) = 119875 (119866119906 lt 119866V | 119888 = 1) =

1

1198732

119873

sum

119895=1198731+1

119868 (119909119906119895lt 119909V119895)

(3)

where 119868(119909119906119895lt 119909V119895) is the indicator function defined as

119868 (119909119906119895lt 119909V119895) =

1 119909119906119895lt 119909V119895

0 119909119906119895ge 119909V119895

119895 = 1 2 119873 (4)

The typical TSP method is based on maximizing thefollowing score of (119906 V) defined by German et al [16]

Δ119906V =1003816100381610038161003816119901119906V (0) minus 119901119906V (1)

1003816100381610038161003816 (5)

This approach has been shown to be as accurate as SVMsand other more sophisticated methods [20ndash22] Althoughmaximizing delta identifies the best classifier with highaccuracy it may be associated with relatively low sensitivityor specificity as pointed out by German et al and Ummanniet al [16 23] For example in the classification of cancerversus normal samples accuracy is defined as the ratiobetween the number of correctly predicted samples and thetotal number of samples and sensitivity (resp specificity) isthe ratio between the number of correctly predicted cancer(resp normal) samples and the total number of cancer(resp normal) samples [16]This low sensitivity or specificityrestricts us to use the classifier of one TSP formakingmedicaldecisions This issue was improved with the use of multiplegene pairs as the classifier which can achieve similar scoreswith high accuracy sensitivity and specificity [20ndash22] Thuswe considered not just one but multiple TSP gene pairs in ourmodels

23 Evaluation of the Model Using Published Datasets Toevaluate the efficiency of the TSP-based LRmodel we appliedour model to datasets with both clinical parameters and geneexpression values We selected a dataset with a large samplesize because we could obtain more reliable estimates of theefficiency of the classifiers The dataset was from the recentlypublished study of Sboner et al [5] who analyzed geneexpression in patients with up to 30 years of clinical follow-up data Men who died within 10 years of being diagnosed

with PCa were considered to have ldquolethalrdquo disease and thosewho survived at least 10 years after diagnosis were consideredto have ldquoindolentrdquo disease There were 165 men with lethaland 116 with indolent diseaseThe GS tumor percentage andpresence of an estrogen-regulated gene (ERG) rearrangementwere provided for each patient in the study The expressionof 6100 genes was assessed using a custom gene expressionarray (GSE 16560)

For our model we first randomly separated the 281samples into a learning set with 186 samples and a validationset with the other 95 samples with an approximately equalproportion of men with lethal and indolent PCa in eachgroup The learning set was utilized to create the modelswhose performance was evaluated in the validation set bymeans of the area under the receiver operating characteristic(ROC) curve (AUC) To compare the performance of ourmodel we performed the statistical testing based on the nullhypothesis that there is no difference between the AUCsof Sbonerrsquos models and ours Similar to the estimation ofAUCs in [5] the corresponding 95 confidence intervalsof the AUCs were computed in 100 iterative 10-fold crossvalidation procedures that enabled an unbiased estimation ofthe modelrsquos performance since the evaluation was performedon an independent dataset The model is inferred to bebetter only if its AUC is statistically larger than that of theother models In the original study the authors conducted anextensive comparative analysis of the most frequently usedclassification methods including the k-nearest neighbor thenearest template prediction diagonal linear discriminateanalysis SVMs and neural network analysis Their resultsallowed us to compare the performance of the TSP-based LRclassifiers with that of the other classifiers

To optimize and select the best models we adopted aniterative cross validation procedure within the learning setthat was similar to the procedure used by Sboner et al [5]The stratified tenfold cross validation procedure split thelearning set into 10 disjointed partitions testi (119894 = 1 10)with approximately equal proportions of lethal and indolentcases in each For a given partition testi the models werefitted using all the other cases in the learning set that isthe trainingi set and then were evaluated with AUC analysisof testi In the procedure of 10-fold cross validation themodeli (119894 = 1 10)was first parameterized in the trainingisets and then the corresponding AUC ontesti (119894 = 1 10)sets were calculated from modeli To avoid potential biasesin the selection of the 10 partitions the entire procedurewas repeated 100 times for 1000 different partitions Weidentified the best model with the largest AUC by comparingthem as obtained in the 100 iterations Furthermore thefeatured gene pairs and estimated parameters in the modelwere also considered as the best model in learning set Therationale was that the results of this procedure enable theidentification of the best model which can then be used tobuild a classifier that was finally evaluated on the validationset

During the iterations of our cross validation procedurethe feature-selection procedurewas carried out to identify thesubsets of genes that are expressed differently in the lethal andindolent samples In the study by Sboner et al [5] a two-sided


Table 1 Logistic regression models that included TSP-selected gene pairs and different combinations of clinical variables

Model number Patientrsquos age Gleason score Tumor percentage Fusion ERG arrangement TSP genes11 X12 X X13 X X14 X X15 X X16 X X X17 X X X18 X X X19 X X X110 X X X111 X X X112 X X X X113 X X X X114 X X X X115 X X X X116 X X X X X

119905-test was performed for each gene to identify the differentlyexpressed genes We then compared our models using TSP-selected coexpressed genes with the models described bySboner et al [5]

3 Results

We proposed the LR models by combining TSP-selectedgenes and clinical features to identify and predict the patientswhose PCa will progress The performance of the modelswas evaluated with dataset GSE 16560 Table 1 lists the 16LR models that we tested Our models include all possiblecombinations of the following variables age GS tumorpercentage presence of ERG gene rearrangement and TSP-selected genes The AUCs of 1000 different partitions werecalculated to select the best models Figure 1(a) shows theAUC boxplots for the 100 tenfold cross validations of the16 models listed in Table 1 with one pair of TSP-selectedgenes per model The red stars denote the AUC valuesfrom the validation datasets that correspond to the best LRmodels from the learning dataset Figure 1(b) shows the AUCboxplots for the same 16 models but with two TSP-selectedgene pairs per model

We plotted the AUC values of the validation dataset toassess the effects of the variables on the models (Figure 2)The blue line represents AUC values from the models withone TSP-selected gene pair and the black line representsthose from the models with two TSP-selected gene pairsFurthermore we tested the statistical significance of themodels based on the null hypothesis that there is no differencebetween the AUCs of Sbonerrsquos models It is found that mostof AUC values in Sbonerrsquos models were out of the 95confidence intervals of AUCs in our models So our modelscan provide an alternative in predicting prostate cancerprogression The addition of TSP-selected gene pairs canimprove our modelsrsquo prediction of PCa progression whichdiffered from Sbonerrsquos results

What is the role of TSP-selected gene pairs in comparisonwith the fusion ERG and the other clinical features especiallyGS Obviously the GS was the most statistically significantvariable because all the top models included it In Figure 2the red circles label the 8 models that include the GS TheAUCs in those 8 models were much higher than they werein the others and were very similar in the one- and two-gene-pair models The 8 AUCs were more than 08 as shownin Figure 2 so we can conclude that the models with TSP-selected gene pairs performed better than all of Sbonerrsquosmodels for which the largest AUC was 079 in [5]

The model using only the GS yielded an AUC of 076by adding fusion ERG the largest AUC observed by Sboneret al was 079 [5] Similarly the other models that used onlyGS and tumor percent (or age) without molecular profilescould yield a higher AUC if fusion ERGwas addedThereforethe addition of fusion ERG may improve the predictioncapability of models that use only clinical features [5]

However the effect of fusion ERG was a little differentin our analysis First our models could perform better byreplacing fusion ERG with TSP-selected genes In compar-ison with the best model with GS and fusion ERG (AUC079) in [5] our model 13 with the GS and TSP-selectedgene pairs performed better with an AUC of 084 (95 CI =[081 088]) our best model was model 19 which used GStumor percentage and one TSP-selected gene pair (AUC086 95 CI = [079 092]) but the corresponding modelreported by Sboner with fusion ERG for replacement yieldedan AUC of 075 [5] On the other hand the addition of fusionERG had little or no effect on our models that included TSP-selected gene pairs For example the sameAUCwas obtainedwith our Model 13 and 110 with GS fusion ERG and TSP-selected gene pairs Thus TSP-selected genes seem to have amore important effect than the fusion ERGdoes in predictingPCa progression

The addition of genes other than fusion ERG could notimprove the prediction capability because the best modelsin the Sboner study [5] lacked molecular profiles However


10

08

06

04

02

00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

lowast

lowast

lowastlowast

lowastlowast

lowast

lowast

lowast lowastlowast

lowastlowast

lowast

lowastlowast

(a)

10

08

06

04

02

00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

lowast

lowast

lowast lowastlowast lowast

lowastlowast

lowast lowast lowastlowast lowast

lowast

lowastlowast

(b)

Figure 1 AUC boxplots for 100 tenfold cross validations of 16 models that include TSP-selected genes The x-axis is the index of 16 modelslisted in Table 1 and the y-axis is the AUC values The red star denotes the corresponding AUC values of the validation dataset that uses thebest logistic regression models from the learning dataset (a) Models that included one pair of TSP-selected genes and (b) those that includedtwo such gene pairs

Table 2 Comparison of the performance of our logistic regression models with that of the nine models evaluated by Sboner et al [5] usingthe same number of genes

Model number Patientrsquos age Gleason score Tumor percentage Fusion ERG Number of genes AUC in ref [5] AUC of our model21 X 18 0672 076922 X X 9 0708 073223 18 0713 073624 X X 21 0726 079325 X 11 0730 071226 X X X 3 0738 080627 X X X 12 0745 080428 X 16 0749 081329 X X 12 0750 0788

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

065

070

075

080

085

Figure 2 The AUCs of the 16 best models from the validationdatasetThe x-axis is the index of the 16 models listed in Table 1 andthe y-axis is the AUC valuesThe blue line shows the AUCs from themodels with one TSP-selected gene pair and the black line showsthose from models with two TSP-selected gene pairs The pointscircled in red are theAUCs in the 8models that included theGleasonscore as a variable

some improvement was observed in our study by replacingthe molecular profiles in the Sboner models with one ortwo TSP-selected gene pairs our models performed better

than theirs did For example the best model with molecularprofiles in the Sboner study used GS age and 12 genes andyielded an AUC of 075 whereas our model 16 which usedGS age and TSP-selected gene pairs yielded an AUC of 08(95 CI = [076 085]) Thus although we added fewergenes to ourmodel its performancewas betterMoreover theprediction capability of Sbonerrsquos models was also improvedwhen the same number of genes was replaced with TSP-selected gene pairs as demonstrated in Table 2 Thereforeadding the TSP-selected genes had an important effect on theperformance of the original models

Models that use only clinical features may perform betterif appropriate genes such as those selected with the TSPalgorithm are added To explore the effect of adding geneswe compared our approach with that used in the nine modelsin the original study by Sboner et al [5] that includedgenes For the comparison we selected the same number ofgenes for our models However the featured genes in themodels differed each time because the 1000 random trainingand testing partitions were different in the iterative crossvalidation procedure

The results of our comparison were presented in Table 2As noted the AUCs in our models were often higher thanthose in the study by Sboner et al The prognostic models for


PCa can perform better if the featured genes are selected Inparticular TSP-selected genes may play an important roleFirst the AUC of the model using only 18 genes increasedfrom 071 in Sbonerrsquos study to 074 (95 CI = [071 077]) inourmodel Further theAUCs of themodels that used one andtwo TSP-selected gene pairs were 071 (95CI = [067 074])and 077 (95 CI = [073 081]) respectively Thus the TSP-based models performed better with a smaller number ofgenes

The model from the Sboner study that included theGS and 16 genes did not perform any better than theirmodel did that used the GS only with AUCs of 075 and075 respectively [5] However the AUC of our model thatincluded the GS and 16 TSP-selected genes was 081 (95CI = [076 085]) in Table 2 and the models that used GSand one (or two) TSP gene pair(s) performed better with anAUC of 084 in Figure 2

Finally of all the models tested in the original study theone that included the GS and the ERG rearrangement (withno gene expression data) had the highest AUC value 079 [5]whereas most of the AUC values for our models were higherthan that Therefore in contrast to the conclusion reached bySboner et al we believe that adding themolecular profiles canimprove the results obtained with the traditional prognosticmodels of PCa if the appropriate genes are selected

From the results in Figure 2 andTable 2 wemay concludethat the modelsrsquo performance was not improved by theaddition of large numbers of genes but was improved by theaddition of significant clinical features andmolecular profilesFor example adding one TSP-selected gene pair is enoughif the important clinical variables such as GS are includedin the model However in the case of model 11 whichincluded only one gene signature and models 12 and 18which also included age the addition of more gene pairs cangreatly improve the performance Obviously the gene selec-tion strongly depends on the patient samples and so somestatistical techniques such as bootstrap repeated samplingor cross validation were commonly used in the TSP-extendedalgorithms In the current research the computation cost ofTSP-based algorithms is not the main concern but the topicsabout the optimal number of gene pairs added to improve theclinical models are still interesting in further research

4 Conclusion and Discussion

We have introduced an LR-based classification method thatcombines TSP-selected genes and clinicalmeasurementsTheempirical results of [19 20] based on the datasets of prostatecancer progression show that the classification models usingone or two TSP-selected gene pairs perform better than mostcommonly used one-gene-at-a-time approaches With thecombination of LR our models not only preserved the basicadvantages of the TSP algorithm but also incorporated theclinical features Furthermore the LR-TSP model providesthe underlying probability of predictionand coexpressedgenes that are used as biomarkers in the model Thus ourproposed method provides explicit biologic interpretation ofthe clinical tests Based on the statistical inference with the

iterative cross validation the better performance was shownin our models

As mentioned in the report of Sboner et al [5] manyfactors can influence the performance of the models such asthe definitions of lethal and indolent PCa the use of samplescontaminated with stromal tissue the selection of genesassayed using a DASL (cDNA-mediated annealing selectionextension and ligation) array and the effect of intertumorheterogeneity Based on the study of GSE 16560 we exploredthe possible effect of genes used in the clinical models Thefeatured genes are often selected by using a one-gene-at-a-time approach Sboner et al performed a two-sided 119905-testfor each gene within the trainingi partition thereby avoidingoverfitting because the selection of the genes was performedon only training sets [5]They also implemented stepwise-likefeature selection sorted the genes according to their 119875 valuesfrom the t-testing and then added the genes to their modelsOur study on the other hand demonstrates that coexpressionanalysis yields better prediction of tumor progression thanthe analysis of individual genes does Therefore we concludethat TSP selection is a useful tool for feature (andor gene)selection to use in prognostic models

Acknowledgments

This study was supported by the David Koch Center forApplied Research in Genitourinary Cancer National CancerInstitute Grant CA16672 National Natural Science Funds ofChina (31100958) and GDHVPS (2012)

References

[1] C S Grasso Y M Wu D R Robinson et al ldquoThe mutationallandscape of lethal castration-resistant prostate cancerrdquoNaturevol 487 pp 239ndash243 2012

[2] F H Schroder J HugossonM J Roobol et al ldquoProstate-cancermortality at 11 years of follow-uprdquo The New England Journal ofMedicine vol 366 pp 981ndash990 2012

[3] A Qaseem M J Barry T D Denberg et al ldquoScreeningfor prostate cancer a guidance statement from the clinicalguidelines committee of the American college of physiciansrdquoAnnals of Internal Medicine vol 158 no 10 pp 761ndash769 2013

[4] SMDhanasekaran A Dash J Yu et al ldquoMolecular profiling ofhumanprostate tissues insights into gene expression patterns ofprostate development during pubertyrdquoThe FASEB Journal vol19 no 2 pp 243ndash245 2005

[5] A Sboner F Demichelis S Calza et al ldquoMolecular sampling ofprostate cancer a dilemma for predicting disease progressionrdquoBMCMedical Genomics vol 3 article 8 2010

[6] E LaTulippe J Satagopan A Smith et al ldquoComprehensivegene expression analysis of prostate cancer reveals distincttranscriptional programs associated with metastatic diseaserdquoCancer Research vol 62 no 15 pp 4499ndash4506 2002

[7] J-H Luo Y P Yu K Cieply et al ldquoGene expression analysisof prostate cancersrdquoMolecular Carcinogenesis vol 33 no 1 pp25ndash35 2002

[8] K TamuraM Furihata T Tsunoda et al ldquoMolecular features ofhormone-refractory prostate cancer cells by genome-wide geneexpression profilesrdquo Cancer Research vol 67 no 11 pp 5117ndash5125 2007


[9] T S Furey N Cristianini N Duffy D W Bednarski MSchummer and D Haussler ldquoSupport vector machine classifi-cation and validation of cancer tissue samples using microarrayexpression datardquo Bioinformatics vol 16 no 10 pp 906ndash9142000

[10] C Peterson and M Ringner ldquoAnalyzing tumor gene expressionfilesrdquo Artificial Intelligence in Medicine vol 28 pp 59ndash74 2003

[11] P Xu G N Brock and R S Parrish ldquoModified linear discrimi-nant analysis approaches for classification of high-dimensionalmicroarray datardquo Computational Statistics and Data Analysisvol 53 no 5 pp 1674ndash1687 2009

[12] U R Chandran C Ma R Dhir et al ldquoGene expression profilesof prostate cancer reveal involvement of multiple molecularpathways in the metastatic processrdquo BMC Cancer vol 7 article64 2007

[13] R S Hudson M Yi D Esposito et al ldquoMicroRNA-106b-25cluster expression is associated with early disease recurrenceand targets caspase-7 and focal adhesion in human prostatecancerrdquo Oncogene vol 32 no 35 pp 4139ndash4147 2012

[14] EMartinez andV Trevino ldquoModelling gene expression profilesrelated to prostate tumor progression using binary statesrdquoTheoretical Biology and Medical Modelling vol 10 article 372013

[15] S Feng O Dakhova C J Creighton and M M Ittmann ldquoTheendocrine fibroblast growth factor FGF19 promotes prostatecancer progressionrdquo Cancer Research vol 73 no 8 pp 2551ndash2562 2012

[16] D Geman C drsquoAvignon D Q Naiman and R L WinslowldquoClassifying gene expression profiles from pairwise mRNAcomparisonsrdquo Statistical Applications in Genetics and MolecularBiology vol 3 no 1 article 19 2004

[17] J Zhu and T Hastie ldquoClassification of gene microarrays bypenalized logistic regressionrdquo Biostatistics vol 5 no 3 pp 427ndash443 2004

[18] L Shen and E C Tan ldquoDimension reduction-based penalizedlogistic regression for cancer classification using microarraydatardquo IEEEACM Transactions on Computational Biology andBioinformatics vol 2 no 2 pp 166ndash175 2005

[19] J G Liao and K-V Chin ldquoLogistic regression for diseaseclassification using microarray data model selection in a largep and small n caserdquo Bioinformatics vol 23 no 15 pp 1945ndash19512007

[20] A C Tan D Q Naiman L Xu R L Winslow and D GemanldquoSimple decision rules for classifying human cancers from geneexpression profilesrdquo Bioinformatics vol 21 no 20 pp 3896ndash3904 2005

[21] L Xu A C Tan D Q Naiman D Geman and R L WinslowldquoRobust prostate cancer marker genes emerge from directintegration of inter-study microarray datardquo Bioinformatics vol21 no 20 pp 3905ndash3911 2005

[22] Y Zhang J Szustakowski and M Schinke ldquoBioinformaticsanalysis of microarray datardquoMethods in Molecular Biology vol573 pp 259ndash284 2009

[23] R Ummanni S Teller H Junker et al ldquoAltered expressionof tumor protein D52 regulates apoptosis and migration ofprostate cancer cellsrdquo FEBS Journal vol 275 no 22 pp 5703ndash5713 2008

Submit your manuscripts athttpwwwhindawicom

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


MEDIATORSINFLAMMATION

of


Behavioural Neurology

EndocrinologyInternational Journal of



Disease Markers


BioMed Research International

OncologyJournal of



Oxidative Medicine and Cellular Longevity


PPAR Research

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Immunology ResearchHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

ObesityJournal of



Computational and Mathematical Methods in Medicine

OphthalmologyJournal of


Diabetes ResearchJournal of



Research and TreatmentAIDS


Gastroenterology Research and Practice


Parkinsonrsquos Disease

Evidence-Based Complementary and Alternative Medicine

Volume 2014Hindawi Publishing Corporationhttpwwwhindawicom


In this study we tried to propose suchmodels bymergingthe coexpressed genesrsquo profiles and some clinical features topredict the patients who would suffer from PCa progressionThe genes used in our models are identified by a top-scoring pair (TSP) algorithm The TSP method was initiallyintroduced by Geman et al as a classification technique formicroarray data [16] We applied the TSP-based LR modelto publishedmicroarray experiments whose patients sufferedfrom PCa progressionWe analyzed the effects of the numberof coexpressed genes included in themodels and the selectionof the clinical variables on the accuracy of the prediction Wealso compared the performance of the most commonly usedclassification methods with our proposed method

2 Materials and Methods

21 Logistic Regression Model for the Classification of GeneMicroarrays Genome-wide microarray data from differentcells give insight into the gene expression variation of var-ious genotypes and phenotypes Classification of patientsis an important aspect of cancer diagnosis and treatmentFor example microarray experiments can be employed toscreen gene expression levels from cancerous and normalphenotypes so that proper prediction rules can be built fromthese gene expression data In this section we introduce alogistic regression (LR) model to classify the phenotypes ofmicroarray data

We denote a gene expression matrix by 119863 = 119909119894119895119872times119873

where there are119872 genes and119873 samples and 119909

119894119895denotes the

expression value of the 119894th gene 119894 isin 1 119872 from the119895th sample 119895 isin 1 119873 The vector 119866

119894sdot= (1199091198941 119909

119894119873)

represents the 119894 gene expression values over all119873 samples and119878sdot119895= (1199091119895 119909

119872119895) is the expression profile of all 119872 genes

for the 119895th sample Let 119910119895be the binary phenotype of the 119895th

profile 119910119895= 0 indicates that the 119895th sample belongs to class 0

(eg normal tissues) and 119910119895= 1 indicates that the 119895th sample

belongs to class 1 (eg tumor tissues)The classification of microarray data has been intensively

researched for years But some limitations have stood outsuch as the small-sample dilemma ldquoblack boxrdquo and lackof prediction strength [16ndash18] We used LR to build theprediction models for a binary outcome Obviously theunderlying probability of labels and contribution of predictorvariables can be explicitly provided in LR models which ishelpful for biologists in discovering the genes that interactand cause the occurrence of disease

The goal of classification with LR was to find a formulathat gives the probability 119901

119895that the 119895th sample with all

its measured expressions 119878sdot119895represents a class 1 case Since

only two classes are considered the probability of the samplerepresenting class 0 is consequently 1 minus 119901

119895 We used the

following normal LR model

120578119895= log119901119895

1 minus 119901119895

= 1205730+

119872

sum

119894=1

120573119894119909119894119895 (1)

where 1205730 1205731 120573

119872are parameters that can be estimated by

maximizing the following likelihood

119871 (1205730 120573) =

119873

sum

119895=1

119910119895log119901119895+

119873

sum

119895=1

(1 minus 119910119895) log (1 minus 119901

119895) (2)

For microarray experiments of typical ldquolarge 119901 small119899rdquo the number of samples 119873 is usually on the order oftens but the number of genes 119872 is usually on the orderof thousands or even tens of thousands So the number ofsamples is much less than the number of variables (119873 ≪119872) This situation presents a number of problems whenbuilding a LR model such as overfitting multicollinearity ofthe gene expression profiles and infinite solutions for 120573

119894[17ndash

19] Feature selection can be used to identify the significantgenes that contribute tomost of the classificationThus somedimension-reduction techniques such as support vectormachines (SVMs) singular value decomposition and partialleast squares are commonly used to tackle those problemsand make the computation feasible [17ndash19] However thefeatured genes are usually selected one by one According tothe biological mechanism genes do not work by themselvesso we employed coexpressed TSP genes in the model asdescribed in the following section

22 Identification of Coexpressed Genes Recent studies havesuggested that assessing the expression of more than onegene (ie coexpression analysis) provides a better predic-tion of tumor progression than analyzing the expression ofindividual genes [20ndash22] We identified the coexpressed genewith the paired-gene approach of the top-scoring pairs (TSP)algorithm as described by Geman et al [16] The TSP algo-rithm was originally developed for the binary classificationof phenotypes according to the relative expression profiles ofone-gene pairTheTSP classifier has the following advantagesover the standard classifiers used in gene expression studies(i) it is a parameter-free and data-driven machine-learningmethod that avoids overfitting by eliminating the need toperform specific parameter tuning as in other machine-learning techniques such as SVMs and neural networks (ii)it involves only two genes which leads to easily interpretabledata and inexpensive diagnostic tests (iii) the rank-basedTSP classifiers are less affected by technical factors or normal-ization than classifiers which are based on expression levelsof individual genes and (iv) the simple and accurate resultsgenerated by TSP facilitate follow-up studies

TSP gene pairs may be considered biomarker genes ina diagnostic test from microarray experiments [16 20ndash22]The methodology is being extended from one TSP gene pairto top-scoring pair of groups (TSPG) as gene signatures[20ndash22] However there are still some unresolved issues ofbiological explanation and the selection criteria related tothe use of gene pairs instead of larger groups of significantgenes Most of the algorithms in gene selection are basedon the distribution assumption of the gene expression dataHowever the rank-based TSP algorithm is a parameter-free data-driven machine-learning method It is difficult todetermine the number of gene pairs selected but current






1)


119895= 1 (119895 isin 119873

1+



119906lt 119866V



119901119906V (0) = 119875 (119866119906 lt 119866V | 119888 = 0) =

1

1198731

1198731

sum

119895=1

119868 (119909119906119895lt 119909V119895)

119901119906V (1) = 119875 (119866119906 lt 119866V | 119888 = 1) =

1

1198732

119873

sum

119895=1198731+1

119868 (119909119906119895lt 119909V119895)

(3)


119868 (119909119906119895lt 119909V119895) =

1 119909119906119895lt 119909V119895

0 119909119906119895ge 119909V119895

119895 = 1 2 119873 (4)


Δ119906V =1003816100381610038161003816119901119906V (0) minus 119901119906V (1)

1003816100381610038161003816 (5)











3 Results








10

08

06

04

02

00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

lowast

lowast

lowastlowast

lowastlowast

lowast

lowast

lowast lowastlowast

lowastlowast

lowast

lowastlowast

(a)

10

08

06

04

02

00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

lowast

lowast


lowastlowast


lowast

lowastlowast

(b)




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

065

070

075

080

085















Acknowledgments


References






























of






Disease Markers



OncologyJournal of





PPAR Research



Journal of

ObesityJournal of





















1)


119895= 1 (119895 isin 119873

1+



119906lt 119866V



119901119906V (0) = 119875 (119866119906 lt 119866V | 119888 = 0) =

1

1198731

1198731

sum

119895=1

119868 (119909119906119895lt 119909V119895)

119901119906V (1) = 119875 (119866119906 lt 119866V | 119888 = 1) =

1

1198732

119873

sum

119895=1198731+1

119868 (119909119906119895lt 119909V119895)

(3)


119868 (119909119906119895lt 119909V119895) =

1 119909119906119895lt 119909V119895

0 119909119906119895ge 119909V119895

119895 = 1 2 119873 (4)


Δ119906V =1003816100381610038161003816119901119906V (0) minus 119901119906V (1)

1003816100381610038161003816 (5)











3 Results








10

08

06

04

02

00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

lowast

lowast

lowastlowast

lowastlowast

lowast

lowast

lowast lowastlowast

lowastlowast

lowast

lowastlowast

(a)

10

08

06

04

02

00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

lowast

lowast


lowastlowast


lowast

lowastlowast

(b)




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

065

070

075

080

085















Acknowledgments


References






























of






Disease Markers



OncologyJournal of





PPAR Research



Journal of

ObesityJournal of




















3 Results








10

08

06

04

02

00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

lowast

lowast

lowastlowast

lowastlowast

lowast

lowast

lowast lowastlowast

lowastlowast

lowast

lowastlowast

(a)

10

08

06

04

02

00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

lowast

lowast


lowastlowast


lowast

lowastlowast

(b)




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

065

070

075

080

085















Acknowledgments


References






























of






Disease Markers



OncologyJournal of





PPAR Research



Journal of

ObesityJournal of

















10

08

06

04

02

00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

lowast

lowast

lowastlowast

lowastlowast

lowast

lowast

lowast lowastlowast

lowastlowast

lowast

lowastlowast

(a)

10

08

06

04

02

00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

lowast

lowast


lowastlowast


lowast

lowastlowast

(b)




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

065

070

075

080

085















Acknowledgments


References






























of






Disease Markers



OncologyJournal of





PPAR Research



Journal of

ObesityJournal of

























Acknowledgments


References






























of






Disease Markers



OncologyJournal of





PPAR Research



Journal of

ObesityJournal of





































of






Disease Markers



OncologyJournal of





PPAR Research



Journal of

ObesityJournal of





















of






Disease Markers



OncologyJournal of





PPAR Research



Journal of

ObesityJournal of
















Documents

Research Article Modified Logistic Regression Models Using ...downloads.hindawi.com/journals/cmmm/2013/917502.pdf · logistic regression (LR) model to classify the phenotypes of