7
Chakraborty et al. BMC Bioinformatics 2013, 14:236 http://www.biomedcentral.com/1471-2105/14/236 SOFTWARE Open Access svapls: an R package to correct for hidden factors of variability in gene expression studies Sutirtha Chakraborty * , Somnath Datta and Susmita Datta Abstract Background: Hidden variability is a fundamentally important issue in the context of gene expression studies. Collected tissue samples may have a wide variety of hidden effects that may alter their transcriptional landscape significantly. As a result their actual differential expression pattern can be potentially distorted, leading to inaccurate results from a genome-wide testing for the important transcripts. Results: We present an R package svapls that can be used to identify several types of unknown sample-specific sources of heterogeneity in a gene expression study and adjust for them in order to provide a more accurate inference on the original expression pattern of the genes over different varieties of samples. The proposed method implements Partial Least Squares regression to extract the hidden signals of sample-specific heterogeneity in the data and uses them to find the genes that are actually correlated with the phenotype of interest. We also compare our package with three other popular softwares for testing differential gene expression along with a detailed illustration on the widely popular Golub dataset. Results from the sensitivity analyes on simulated data with widely different hidden variation patterns reveal the improved detection power of our R package compared to the other softwares along with reasonably smaller error rates. Application on the real-life dataset exhibits the efficacy of the R package in detecting potential batch effects from the dataset. Conclusions: Overall, Our R package provides the user with a simplified framework for analyzing gene expression data with a wide range of hidden variation patterns and delivering a differential gene expression analysis with substantially improved power and accuracy. The R package svapls is freely available at http://cran.r-project.org/web/packages/svapls/index.html. Background Several types of subject/sample specific factors consti- tute an important but often overlooked source of hidden variability in differential gene expression analyses. In a wide variety of situations these factors are triggered from certain specific biological, environmental or demographic profiles of the subjects corresponding to the collected tis- sue samples. The latent effects from these hidden factors can generate spurious signals of heterogeneity that may significantly distort the original differential expression pattern of the genes. In this context, a simple example is provided by the widely known batch-effect in microarray analyses, where subject tissue samples collected in sepa- rate batches can produce an additional effect of residual *Correspondence: [email protected] Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY-40202, USA variation. This effect is still manageable as composition of the batches are known prior to analyses. But, numerous other factors may still exist that are not detectable from outside, but can potentially affect the subject-specific expression levels of the genes in different ways. They can in turn lead to complex latent expression structures in the entire genomic landscape of the data (e.g., confounded signals between the two groups of samples, correlated expression signals corresponding to a specific group of genes and samples affected by the hidden factors, etc.). The contributed impact of these factors, either acting singly or in consort can induce serious problems in mul- tiple testing of differential expression for the genes. Thus, a number of truly significant genes can pass out unde- tected while many others may be wrongly flagged as positives. The consequence is a severe reduction in power (sensitivity) of the testing procedure accompanied by a substantially high rate of erroneous discoveries. Most of © 2013 Chakraborty et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

svapls: an R package to correct for hidden factors of variability in gene expression studies

  • Upload
    susmita

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: svapls: an R package to correct for hidden factors of variability in gene expression studies

Chakraborty et al BMC Bioinformatics 2013 14236httpwwwbiomedcentralcom1471-210514236

SOFTWARE Open Access

svapls an R package to correct for hiddenfactors of variability in gene expression studiesSutirtha Chakraborty Somnath Datta and Susmita Datta

Abstract

Background Hidden variability is a fundamentally important issue in the context of gene expression studiesCollected tissue samples may have a wide variety of hidden effects that may alter their transcriptional landscapesignificantly As a result their actual differential expression pattern can be potentially distorted leading to inaccurateresults from a genome-wide testing for the important transcripts

Results We present an R package svapls that can be used to identify several types of unknown sample-specificsources of heterogeneity in a gene expression study and adjust for them in order to provide a more accurateinference on the original expression pattern of the genes over different varieties of samples The proposed methodimplements Partial Least Squares regression to extract the hidden signals of sample-specific heterogeneity in the dataand uses them to find the genes that are actually correlated with the phenotype of interest We also compare ourpackage with three other popular softwares for testing differential gene expression along with a detailed illustrationon the widely popular Golub dataset Results from the sensitivity analyes on simulated data with widely differenthidden variation patterns reveal the improved detection power of our R package compared to the other softwaresalong with reasonably smaller error rates Application on the real-life dataset exhibits the efficacy of the R package indetecting potential batch effects from the dataset

Conclusions Overall Our R package provides the user with a simplified framework for analyzing gene expressiondata with a wide range of hidden variation patterns and delivering a differential gene expression analysis withsubstantially improved power and accuracyThe R package svapls is freely available at httpcranr-projectorgwebpackagessvaplsindexhtml

BackgroundSeveral types of subjectsample specific factors consti-tute an important but often overlooked source of hiddenvariability in differential gene expression analyses In awide variety of situations these factors are triggered fromcertain specific biological environmental or demographicprofiles of the subjects corresponding to the collected tis-sue samples The latent effects from these hidden factorscan generate spurious signals of heterogeneity that maysignificantly distort the original differential expressionpattern of the genes In this context a simple example isprovided by the widely known batch-effect in microarrayanalyses where subject tissue samples collected in sepa-rate batches can produce an additional effect of residual

Correspondence s0chak10louisvilleeduDepartment of Bioinformatics and Biostatistics University of LouisvilleLouisville KY-40202 USA

variation This effect is still manageable as composition ofthe batches are known prior to analyses But numerousother factors may still exist that are not detectable fromoutside but can potentially affect the subject-specificexpression levels of the genes in different ways They canin turn lead to complex latent expression structures in theentire genomic landscape of the data (eg confoundedsignals between the two groups of samples correlatedexpression signals corresponding to a specific group ofgenes and samples affected by the hidden factors etc)The contributed impact of these factors either actingsingly or in consort can induce serious problems in mul-tiple testing of differential expression for the genes Thusa number of truly significant genes can pass out unde-tected while many others may be wrongly flagged aspositives The consequence is a severe reduction in power(sensitivity) of the testing procedure accompanied by asubstantially high rate of erroneous discoveries Most of

copy 2013 Chakraborty et al licensee BioMed Central Ltd This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (httpcreativecommonsorglicensesby20) which permits unrestricted use distribution andreproduction in any medium provided the original work is properly cited

Chakraborty et al BMC Bioinformatics 2013 14236 Page 2 of 7httpwwwbiomedcentralcom1471-210514236

available softwares for differential gene expression anal-yses either overlook this broadly general issue of hid-den variability or consider simple parametric regressionapproaches (linear regression mixed effects models etc)to address the maladies of residual heterogeneity How-ever the complexity of problem necessitates the develop-ment of a more generalized and efficient technique thatcan identify these latent effects of variation in the dataand adjust for them in order to deliver a more powerfuland accurate inference on the actual expression patternof the genes This motivated us to construct a method-ology [1] that provides an unified framework for han-dling these widely different types of spurious variabilityin the dataWe have built an R software svapls that uses the

multivariate Non-Linear Iterative Partial Least Squares(NIPALS) algorithm [2] to extract the latent unwantedeffects of variation in a gene expression data and usesthem to build an optimal ANCOVA model for detectingthe truly differentially expressed genes between two typesof samplestissues In the next section we describe theimportant functions in our package along with illustra-tive examples that explain their practical usage in detailThe following section lsquoComparative evaluation with otheravailable software packagesrsquo demonstrates its compara-tively superior performance with respect to three otherpopular softwares sam [3] limma [4] and sva [5] througha sensitivity analysis of two simulated differential geneexpression datasets affected by complicated hidden vari-ation patterns Section lsquoApplication on the Golub datarsquoelucidates an application on a real-life dataset that provesthe worth of our software through the adjustment forbatch effects and detection of some additional phenotype-related genes that are deemed to be significant from theirannotations in the literature The manuscript ends with adiscussion in Section lsquoDiscussionrsquo

ImplementationBrief overview of the packageThis R package consists of the three primary func-tions fitModel svpls and hfp Below we give a briefoutline of them The function applications are demon-strated on a simulated dataset affected by hidden variation(hidden_facdat) that is inbuilt as a part of the R package

bull The first function fitModel fits an ANCOVA modelto the original log-transformed gene expression datawith a certain number of PLS scores as surrogatevariables (specified by nsurr) or the simple ANOVAmodel [67] if no surrogate variables are specifiedThis function provides an user with the flexibility ofestimating the actual gene-variety interaction effectsfrom a certain ANCOVA model with a specificchoice on the number of surrogate variables which

can be selected depending on the complexity of thesituation under study

gt data(hidden_facdat)

gt

gt Fitting an ANCOVA model with

5 surrogate variables

gt fit lt- fitModel(1010hidden_facdat

nsurr = 5)

gt print(fit) Estimated coefficients of

the surrogate variables

[1] 00425701446 00134271227

00012466815 00041702000

-00007253327

Estimated Mean Squared Error of the

fitted model [1] 9053331

AIC value of the fitted model

[1] 5179102

bull The second function svpls calls the first functionfitModel to fit a number of ANCOVA models(specified by pmax) to the data and selects theoptimal model as the one with the minimum value ofthe Akaikersquos Information Criterion (AIC) [8] Thismodel is then used to predict the actual pattern ofdifferential expression of the genes over the twosample varieties by performing a multiple hypothesistesting at specified value of the false discovery rate(FDR) [9] (specified by fdr)

gt Fitting the optimal ANCOVA model

to the data gives

gt fit lt- svpls(1010hidden_facdat

pmax = 5fdr = 005)

gt

gt The optimal ANCOVA model its AIC

value and the positive genes detected

gt from it are given by

gt fit$optmodel [1] 5

gt

gt fit$AICopt [1] 5178912

gt

gt fit$genes

[1] g31 g38 g42 g43 g65

g33 g57 g54 g30 g34

[11] g25 g29 g41 g61 g68

g51 g62 g50 g55 g46

[21] g52 g53 g63 g60 g28

g69 g24 g59 g40 g66

[31] g21 g44 g27 g26 g37

g45 g48 g23 g39 g67

Chakraborty et al BMC Bioinformatics 2013 14236 Page 3 of 7httpwwwbiomedcentralcom1471-210514236

[41] g36 g56 g49 g14 g47

g64 g35 g1 g70 g6

[51] g4 g455 g58 g12 g8

g13 g32 g7 g10 g3

[61] g18 g22 g11 g184

gt

gt The corrected gene

expression matrix obtained after

removing the effects of

the hidden variability is given by

gt Ycorrected lt- fit$Ycorr

gt pvaladj lt- fit$pvaluesadj

While the Benjamini-Hochberg correction is used bydefault in our R package the p-values returned by thesvpls object provides an user with the flexibility ofapplying several other FDR controlling techniquesand also peforming the more specifically targetedgene set enrichment analysesWe compute p-values from a differential testing ofthe genes with the estimated effects from standardANOVA and the optimal ANCOVA model selectedby our R package A side-by-side plot of theircorresponding histograms clearly demostrate theefficacy of the function svpls in our package in termsof the proximity of the set of larger p-values towardsthe uniform distribution (Figure 1)

Figure 1 Histograms of the unadjusted and adjusted p-valuesThis figure exhibits two histograms from an analysis of the datahidden_facdat one for the unadjusted p-values for testing thevariety-based differential gene expression (found from the standardANOVA model) and the other corresponding to the adjusted p-valuesobtained after correcting the hidden variability in the data by our Rpackage svapls

bull The third function hfp produces a heatmap for thePLS-imputed estimate of the residual expressionheterogeneity corresponding to an user-specified setof genes and samples (specified by gen and indrespectively) This enables us to understand howintensely the latent factors from a certain set ofsubjects affect the true expression levels of a specifiedset of genes

gt Specifying the set

of genes and subjects

gt gen lt- c(1205570100150250450)

gt sub lt- c(1471011151720)

gt

gt genes lt- paste(ggensep=)

gt subjects lt- paste(Ssubsep=)

gt

gt hfp(resgenessubjectshidden_facdat)

This produces a plot revealing the way the hiddenvariable affects the expression pattern of the selectedgroup of genes over the specified subjects (Figure 2)Clearly we can observe a substantial difference in theexpression variability caused by the latent factor forsubjects and the rest specified under the selectedgroup

Comparative evaluation with other available softwarepackagesIn this section we illustrate the application of the R pack-age along with the other three popular software packagesthrough a family of simlulation analyses conducted withtwo sample sizes 20 and 40 under three different val-ues of the noise-to-signal ratio (η = 005 01 and 015)controlling the relative intensity of the random error andprimary signal variances from low to high [1] In eachsimulation study we generate correlated expression mea-surements on 1000 genes over k subjects (k = 20 40)classified equally into two groups 1 and 2 Overall weconsider two different settings (1) The genes are affectedby a highly complex subject-specific confounder (mixtureof two normal random variables) with a small varianceand (2) The genes are affected by two widely differentsubject-specific confounders (one mixture of two normalrandom variables and another mixture of two exponen-tial random variables) both with very high variances [1]Under both the settings the first 70 genes are consideredto be truly differentially expressed over the two varietieswhile the rest are chosen as non-significant The sim-ulation study is based on computation of the averagevalues of two right decision indicators (sensitivity speci-ficity) and two wrong detection indicators (false discoveryrate and false non-discovery rate) evaluated from 1000

Chakraborty et al BMC Bioinformatics 2013 14236 Page 4 of 7httpwwwbiomedcentralcom1471-210514236

Figure 2 Heatmap showing the hidden variability in the data hidden_facdat owing to the specified set of subjects and genes

Monte-Carlo replications (Tables 1 and 2 for setting 1and 3 4 for setting 2) The obtained results from the twosimulation settings clearly reveal the superior sensitivityof svapls compared to other three R packages sam limmaand sva under most of the combinations of group sizeand noise-to-signal ratios This illustrates the efficiencyof our R package in discovering a higher proportion ofthe truly significant genes compared to the existing soft-ware packages The sensitivity of sam is comparable to ourmethod for a higher sample size under setting 1 (Tables 1and 2) and is very close or marginally better under setting2 (Tables 3 and 4) but is adversely impacted by its signifi-cantly large false discovery rate Specifically the sensitivityobtained from our R package becomes almost similar orslightly better than sam as the group size is increased from10 to 20 (Table 4) Moreover the specificity rate is the bestfor svapls closely followed by sva while sam and limma areless efficient in this context In addition the average errorrates FDR and FNR are much lower for svapls comparedto the other three software packages Thus overall our Rpackage is capable of discovering the truly differentiallyexpressed genes with more power along with an efficientcontrol over the wrong detections (non-detections)

Application on the Golub dataNow we explore the performance of svapls on thepre-processed ALLAML dataset [1011] It containsthe log-transformed expression levels of 7129 genes

Table 1 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 20subjects (10 being in each group) under setting 1 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FNR

η = 005

LIMMA 02287 06276 04285 02089

SAM 09239 06066 07278 00125

SVA 03311 09987 00475 00456

SVAPLS 09464 09998 00023 00039

η = 010

LIMMA 02307 06566 03724 02295

SAM 08880 06410 06596 00147

SVA 02882 09988 00481 00469

SVAPLS 09098 09994 00076 00065

η = 015

LIMMA 01956 06672 03689 02164

SAM 08522 06709 06140 00193

SVA 02474 09990 00458 00485

SVAPLS 08660 09991 00130 00097

Chakraborty et al BMC Bioinformatics 2013 14236 Page 5 of 7httpwwwbiomedcentralcom1471-210514236

Table 2 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 40subjects (20 being in each group) under setting 1 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FNR

η = 005

LIMMA 07863 02283 07719 05037

SAM 09793 05773 07724 00033

SVA 05659 09977 00475 00311

SVAPLS 09954 09998 00026 00003

η = 010

LIMMA 07577 02479 07432 05564

SAM 09854 06215 07055 00020

SVA 05695 09978 00471 00309

SVAPLS 09897 09994 00083 00008

η = 015

LIMMA 07307 02389 07464 05865

SAM 09816 06448 06609 00023

SVA 05393 09980 00443 00331

SVAPLS 09830 09990 00131 00012

Table 3 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 20subjects (10 being in each group) under setting 2 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FNR

η = 005

LIMMA 02111 05367 05328 02818

SAM 06290 05879 07474 00625

SVA 00445 09998 00405 00164

SVAPLS 06408 09998 00029 00200

η = 010

LIMMA 01953 05545 04871 02942

SAM 05580 06079 07342 00606

SVA 00239 09998 00514 00148

SVAPLS 05125 09996 00077 00263

η = 015

LIMMA 01900 05597 04838 03039

SAM 05412 06126 07363 00600

SVA 00190 09999 00395 00148

SVAPLS 04544 09996 00112 00297

Table 4 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 40subjects (20 being in each group) under setting 2 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FDR

η = 005

LIMMA 05632 01690 08399 06629

SAM 07845 06199 07158 00312

SVA 01651 09993 00469 00329

SVAPLS 08324 09998 00024 00111

η = 010

LIMMA 05637 01640 08333 06967

SAM 07378 06507 06471 00342

SVA 01428 09994 00461 00331

SVAPLS 07535 09995 00071 00165

η = 015

LIMMA 05660 01619 08315 07183

SAM 06983 06553 06481 00372

SVA 01082 09995 00506 00350

SVAPLS 06806 09995 00095 00211

over-two groups of patients 47 having Acute Lym-phoblastic Leukemia (ALL) and 25 suffering from AcuteMyeloid Leukemia (AML) The patient tissue sampleswere obtained from the following four sources (1) Dana-Farber Cancer Institute (DFCI) (2) St-Judersquos ChildrenrsquosResearch Hospital (ST-Jude) (3) Cancer and LeukemiaGroup B (CALGB) and (4) Childrenrsquos Cancer Study Group(CCG) This inherent classification in the data can poten-tially generate significant batch effects that may distortthe original expression pattern of the genes This moti-vated the implementation of our R package on this datasetThe corrected expression matrix for the first 1000 genesreturned from the use of the svpls function on thisdata demonstrates that the batch effects due to variabil-ity in the sample sources have been removed effectivelyThe haphazard distribution of the samples from the foursources in the corrected gene expression matrix wipesout the additonal effects owing to the observed batch-specific clustering in the original data In this contextsvapls fares equally well compared to another popular Rpackage ber for removing batch effects in microarray data[12] (Figure 3)Overall limma detects 7128 genes followed by 3307

genes from sam 1015 genes from our svapls and 412genes from sva A Venn diagram (Figure 4) representsthe extent of overlap between the genes detected by thefour softwares Specifically limma detects all the genesthat are found to be significant from the other three

Chakraborty et al BMC Bioinformatics 2013 14236 Page 6 of 7httpwwwbiomedcentralcom1471-210514236

Figure 3 (a) (b) (c) Heatmaps showing the original and corrected expression levels for the first 1000 genes in the Golub data (a) Heatmapfor the first 1000 genes in the original Golub expression data (b) Heatmap for the first 1000 genes in the adjusted Golub expression data obtainedby use of the R package ber (c) Heatmap for the first 1000 genes in the adjusted Golub expression data obtained by the use of our R package svapls

softwares This may be attributable to its high false dis-covery rate (FDR) as was observed in the simulationstudy Interestingly svapls detected 24 genes that aremissed by both sam as well as sva Among them thegenes CD74 TNFRSF1A LCN2 and GSN deserve spe-cial mention All these genes are either related to sometype of cancer or regulate cell growth(or apoptosis) CD74plays an important role in multiple myeloma and itshigher expression induces tumor cell malignancy [13]

Figure 4 A Venn-diagram showing the overlap pattern of thegenes detected to be significant from the Golub data by svaplssva sam and limma

An isoform of the tumor necrosis factor TNFRSF1A isassociated with the development of Acute Lymphoblas-tic Leukemia (ALL) in children [14] Specifically LCN2has been found to be connected with Acute Myel-ogenous Leukemia (AML) [15] GSN plays a signifi-cant role of suppressing tumorigenicity in lung cancer[16] and has a diminuted expression in bladder cancercells [17]

DiscussionVarious hidden sources of variation are found to exist in agene expression data that cannot be removed by the stan-dard normalization procedures But their effect may besubstantial enough to change the expression pattern of thegenes over two different varieties of samples The imme-diate consequence is a large reduction in the detectionpower of the testing procedure employed to find the trulysignificant genes followed by highly elevated error ratesIn this project we discuss the development and usage ofan R package svapls that can tackle a wide variety of hid-den effects in a gene expression analysis and can delivera more accurate inference on the differential expressionvariability of the genes between two groups of samples(tissues) We illustrate the superior performance of ourR package in comparison to three other popular soft-wares available for differential gene expression analysesThe high detection power (sensitivity) of svapls along with

Chakraborty et al BMC Bioinformatics 2013 14236 Page 7 of 7httpwwwbiomedcentralcom1471-210514236

the reasonably small error rates provides it a significantlybetter edge over the competing softwares Specificallysva is outperformed by our package in terms of the sen-sitivity (power) while sam comes close and performsmarginally better in some cases although its competenceis severely marred by the considerably high false discov-ery rate (FDR) and substantially low specificity rate Inaddition the graphical representation of the hidden vari-ation (by the function hfp) from our package enablesthe user to understand the pattern in which the hiddensources of variability affect the expression signals of anyspecified subset of genes over a selected group of sub-jectssamples This paves the way to more sophisticatedanalyses of subject-set specific gene expression variabil-ity in the data Application of our package on the Golubdata demonstrates its efficacy in removing the significantbatch effects from the collectedanalyzed samples More-over our package detects four additional genes (missed byboth sva and sam) that have been found to be connectedto Leukemia or some other type of cancer

ConclusionsThe R package svapls can detect a wide variety of hid-den factors in a gene expression study and adjust for themappropriately in order to provide a more accurate infer-ence on the expression pattern of the genes between twodifferent types of tissues In particular the superior detec-tion power and small error rate gives our R package asubstantially better edge over the competing softwaresconsidered in the analysis

Availability and requirementsProject nameR packagesvaplsProject home pagehttpcranrprojectorgwebpackagessvaplsindexhtmlOperating system and R versionThe R package is platform independent and is compatiblewith all the versions of R same as or higher than 20LicenseGPL-3

Competing interestsThe authors declare that they have no competing interests

Authorsrsquo contributionsSC wrote the R package including its evaluation and drafted the manuscriptSD and SD contributed to the structure of the simulation studies applicationand planning of the manuscript All authors read and approved the finalmanuscript

AcknowledgementsWe sincerely thank the editor and the two reviewers for their constructivecomments that lead to an improved manuscript This research was part of SCrsquosdoctoral dissertation work SC acknowledges generous support and adissertation completion award by the School of Interdisciplinary and Graduate

Studies of the University of Louisville We also acknowledge partial support bythe Department of Bioinformatics and Biostatistics of University of Louisvilletoward the processing charges This research work was partially supported byNIH grants CA133844 (Su Datta) and CA170091 minus 01A1 (Su Datta)

Received 15 April 2013 Accepted 16 July 2013Published 24 July 2013

References1 Chakraborty S Datta S Datta S Surrogate variable analysis using

partial least squares in gene expression studies Bioinformatics 201228(6)799ndash806

2 Geladi P Kowalski B Partial least squares regression a tutorialAnalytica Chimica Acta 1986 1851ndash17

3 Tusher V Tibshirani R Chu G Significance analysis of microarraysapplied to the ionizing radiation response Proc Natl Acad Sci 2001985116ndash5121

4 Smith GK Limma linear models for microarray data In Bioinformaticsand Computational Biology Solutions using R and Bioconductor Edited byGentleman R Carey V Dudoit S Irizarry R Huber W New York Springer2002

5 Leek ST et al The sva package for removing batch effects and otherunwanted variation in high-throughput experiments Bioinformatics2012 28(6)882ndash883

6 Kerr MK et al Analysis of variance for gene expression microarraydata J Comput Biol 2000 7819ndash837

7 Kerr MK et al Statistical analysis of a gene expression microarraydata Stat Sinica 2002 12(1)203ndash217

8 Hirotsugu A Likelihood and the Bayes procedure In Bayesian StatisticsEdited by Bernardo JM De Groot MH Lindley DV Smith AFM ValenciaSpain University Press 1980143ndash203

9 Benjamini Y Hochberg Y Controlling the false discovery rate apractical and powerful approach tomultiple testing J R Stat Soc 199557(1)289ndash300

10 Golub T et alMolecular classification of cancer class discovery andclass prediction by gene expression monitoring Science 1999286531ndash537

11 Dudoit S Friedlyand J Speed TP Comparison of discriminationmethods for the classification of tumors using gene expressiondata J Am Stat Assoc 2002 97(457)77ndash87

12 Giordan M A two-stage procedure for the removal of batch effects inmicroarray studies Stat Biosci 2013 doi101007s12561-013-9081-1

13 Burton JE CD74 is expressed by multiple myeloma and is apromising target for therapy Clin Cancer Res 2004 10(19)6606ndash6611

14 Wu S Levels of the soluble 55-kilodalton isoform of tumor necrosisfactor receptor in bonemarrow are correlated with the clinicaloutcome of children with acute lymphoblastic leukemia in firstrecurrence Cancer 2003 98(3)625ndash631

15 Shimada H et al Potential involvement of the AML1-MTG8 fusionprotein in the granulocytic maturation characteristic of the t(821)acute myelogenous leukemia revealed by microarray analysisLeukemia 2002 16(5)874ndash885

16 Sagawa N et al Gelsolin supresses tumorigenicity through inhibitingPKC activation in human lung cancer cell line PC10 Br J Cancer 200388(4)606ndash612

17 Haga K The mechanism for reduced expression of gelsolin tumorsupressor protein in bladder cancer Hokkaido Igaku Zasshi 200378(1)29ndash37

doi1011861471-2105-14-236Cite this article as Chakraborty et al svapls an R package to correct forhidden factors of variability in gene expression studies BMC Bioinformatics2013 14236

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
      • Implementation
        • Brief overview of the package
        • Comparative evaluation with other available software packages
        • Application on the Golub data
          • Discussion
          • Conclusions
          • Availability and requirements
            • Project name
            • Project home page
            • Operating system and R version
            • License
              • Competing interests
              • Authors contributions
              • Acknowledgements
              • References
Page 2: svapls: an R package to correct for hidden factors of variability in gene expression studies

Chakraborty et al BMC Bioinformatics 2013 14236 Page 2 of 7httpwwwbiomedcentralcom1471-210514236

available softwares for differential gene expression anal-yses either overlook this broadly general issue of hid-den variability or consider simple parametric regressionapproaches (linear regression mixed effects models etc)to address the maladies of residual heterogeneity How-ever the complexity of problem necessitates the develop-ment of a more generalized and efficient technique thatcan identify these latent effects of variation in the dataand adjust for them in order to deliver a more powerfuland accurate inference on the actual expression patternof the genes This motivated us to construct a method-ology [1] that provides an unified framework for han-dling these widely different types of spurious variabilityin the dataWe have built an R software svapls that uses the

multivariate Non-Linear Iterative Partial Least Squares(NIPALS) algorithm [2] to extract the latent unwantedeffects of variation in a gene expression data and usesthem to build an optimal ANCOVA model for detectingthe truly differentially expressed genes between two typesof samplestissues In the next section we describe theimportant functions in our package along with illustra-tive examples that explain their practical usage in detailThe following section lsquoComparative evaluation with otheravailable software packagesrsquo demonstrates its compara-tively superior performance with respect to three otherpopular softwares sam [3] limma [4] and sva [5] througha sensitivity analysis of two simulated differential geneexpression datasets affected by complicated hidden vari-ation patterns Section lsquoApplication on the Golub datarsquoelucidates an application on a real-life dataset that provesthe worth of our software through the adjustment forbatch effects and detection of some additional phenotype-related genes that are deemed to be significant from theirannotations in the literature The manuscript ends with adiscussion in Section lsquoDiscussionrsquo

ImplementationBrief overview of the packageThis R package consists of the three primary func-tions fitModel svpls and hfp Below we give a briefoutline of them The function applications are demon-strated on a simulated dataset affected by hidden variation(hidden_facdat) that is inbuilt as a part of the R package

bull The first function fitModel fits an ANCOVA modelto the original log-transformed gene expression datawith a certain number of PLS scores as surrogatevariables (specified by nsurr) or the simple ANOVAmodel [67] if no surrogate variables are specifiedThis function provides an user with the flexibility ofestimating the actual gene-variety interaction effectsfrom a certain ANCOVA model with a specificchoice on the number of surrogate variables which

can be selected depending on the complexity of thesituation under study

gt data(hidden_facdat)

gt

gt Fitting an ANCOVA model with

5 surrogate variables

gt fit lt- fitModel(1010hidden_facdat

nsurr = 5)

gt print(fit) Estimated coefficients of

the surrogate variables

[1] 00425701446 00134271227

00012466815 00041702000

-00007253327

Estimated Mean Squared Error of the

fitted model [1] 9053331

AIC value of the fitted model

[1] 5179102

bull The second function svpls calls the first functionfitModel to fit a number of ANCOVA models(specified by pmax) to the data and selects theoptimal model as the one with the minimum value ofthe Akaikersquos Information Criterion (AIC) [8] Thismodel is then used to predict the actual pattern ofdifferential expression of the genes over the twosample varieties by performing a multiple hypothesistesting at specified value of the false discovery rate(FDR) [9] (specified by fdr)

gt Fitting the optimal ANCOVA model

to the data gives

gt fit lt- svpls(1010hidden_facdat

pmax = 5fdr = 005)

gt

gt The optimal ANCOVA model its AIC

value and the positive genes detected

gt from it are given by

gt fit$optmodel [1] 5

gt

gt fit$AICopt [1] 5178912

gt

gt fit$genes

[1] g31 g38 g42 g43 g65

g33 g57 g54 g30 g34

[11] g25 g29 g41 g61 g68

g51 g62 g50 g55 g46

[21] g52 g53 g63 g60 g28

g69 g24 g59 g40 g66

[31] g21 g44 g27 g26 g37

g45 g48 g23 g39 g67

Chakraborty et al BMC Bioinformatics 2013 14236 Page 3 of 7httpwwwbiomedcentralcom1471-210514236

[41] g36 g56 g49 g14 g47

g64 g35 g1 g70 g6

[51] g4 g455 g58 g12 g8

g13 g32 g7 g10 g3

[61] g18 g22 g11 g184

gt

gt The corrected gene

expression matrix obtained after

removing the effects of

the hidden variability is given by

gt Ycorrected lt- fit$Ycorr

gt pvaladj lt- fit$pvaluesadj

While the Benjamini-Hochberg correction is used bydefault in our R package the p-values returned by thesvpls object provides an user with the flexibility ofapplying several other FDR controlling techniquesand also peforming the more specifically targetedgene set enrichment analysesWe compute p-values from a differential testing ofthe genes with the estimated effects from standardANOVA and the optimal ANCOVA model selectedby our R package A side-by-side plot of theircorresponding histograms clearly demostrate theefficacy of the function svpls in our package in termsof the proximity of the set of larger p-values towardsthe uniform distribution (Figure 1)

Figure 1 Histograms of the unadjusted and adjusted p-valuesThis figure exhibits two histograms from an analysis of the datahidden_facdat one for the unadjusted p-values for testing thevariety-based differential gene expression (found from the standardANOVA model) and the other corresponding to the adjusted p-valuesobtained after correcting the hidden variability in the data by our Rpackage svapls

bull The third function hfp produces a heatmap for thePLS-imputed estimate of the residual expressionheterogeneity corresponding to an user-specified setof genes and samples (specified by gen and indrespectively) This enables us to understand howintensely the latent factors from a certain set ofsubjects affect the true expression levels of a specifiedset of genes

gt Specifying the set

of genes and subjects

gt gen lt- c(1205570100150250450)

gt sub lt- c(1471011151720)

gt

gt genes lt- paste(ggensep=)

gt subjects lt- paste(Ssubsep=)

gt

gt hfp(resgenessubjectshidden_facdat)

This produces a plot revealing the way the hiddenvariable affects the expression pattern of the selectedgroup of genes over the specified subjects (Figure 2)Clearly we can observe a substantial difference in theexpression variability caused by the latent factor forsubjects and the rest specified under the selectedgroup

Comparative evaluation with other available softwarepackagesIn this section we illustrate the application of the R pack-age along with the other three popular software packagesthrough a family of simlulation analyses conducted withtwo sample sizes 20 and 40 under three different val-ues of the noise-to-signal ratio (η = 005 01 and 015)controlling the relative intensity of the random error andprimary signal variances from low to high [1] In eachsimulation study we generate correlated expression mea-surements on 1000 genes over k subjects (k = 20 40)classified equally into two groups 1 and 2 Overall weconsider two different settings (1) The genes are affectedby a highly complex subject-specific confounder (mixtureof two normal random variables) with a small varianceand (2) The genes are affected by two widely differentsubject-specific confounders (one mixture of two normalrandom variables and another mixture of two exponen-tial random variables) both with very high variances [1]Under both the settings the first 70 genes are consideredto be truly differentially expressed over the two varietieswhile the rest are chosen as non-significant The sim-ulation study is based on computation of the averagevalues of two right decision indicators (sensitivity speci-ficity) and two wrong detection indicators (false discoveryrate and false non-discovery rate) evaluated from 1000

Chakraborty et al BMC Bioinformatics 2013 14236 Page 4 of 7httpwwwbiomedcentralcom1471-210514236

Figure 2 Heatmap showing the hidden variability in the data hidden_facdat owing to the specified set of subjects and genes

Monte-Carlo replications (Tables 1 and 2 for setting 1and 3 4 for setting 2) The obtained results from the twosimulation settings clearly reveal the superior sensitivityof svapls compared to other three R packages sam limmaand sva under most of the combinations of group sizeand noise-to-signal ratios This illustrates the efficiencyof our R package in discovering a higher proportion ofthe truly significant genes compared to the existing soft-ware packages The sensitivity of sam is comparable to ourmethod for a higher sample size under setting 1 (Tables 1and 2) and is very close or marginally better under setting2 (Tables 3 and 4) but is adversely impacted by its signifi-cantly large false discovery rate Specifically the sensitivityobtained from our R package becomes almost similar orslightly better than sam as the group size is increased from10 to 20 (Table 4) Moreover the specificity rate is the bestfor svapls closely followed by sva while sam and limma areless efficient in this context In addition the average errorrates FDR and FNR are much lower for svapls comparedto the other three software packages Thus overall our Rpackage is capable of discovering the truly differentiallyexpressed genes with more power along with an efficientcontrol over the wrong detections (non-detections)

Application on the Golub dataNow we explore the performance of svapls on thepre-processed ALLAML dataset [1011] It containsthe log-transformed expression levels of 7129 genes

Table 1 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 20subjects (10 being in each group) under setting 1 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FNR

η = 005

LIMMA 02287 06276 04285 02089

SAM 09239 06066 07278 00125

SVA 03311 09987 00475 00456

SVAPLS 09464 09998 00023 00039

η = 010

LIMMA 02307 06566 03724 02295

SAM 08880 06410 06596 00147

SVA 02882 09988 00481 00469

SVAPLS 09098 09994 00076 00065

η = 015

LIMMA 01956 06672 03689 02164

SAM 08522 06709 06140 00193

SVA 02474 09990 00458 00485

SVAPLS 08660 09991 00130 00097

Chakraborty et al BMC Bioinformatics 2013 14236 Page 5 of 7httpwwwbiomedcentralcom1471-210514236

Table 2 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 40subjects (20 being in each group) under setting 1 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FNR

η = 005

LIMMA 07863 02283 07719 05037

SAM 09793 05773 07724 00033

SVA 05659 09977 00475 00311

SVAPLS 09954 09998 00026 00003

η = 010

LIMMA 07577 02479 07432 05564

SAM 09854 06215 07055 00020

SVA 05695 09978 00471 00309

SVAPLS 09897 09994 00083 00008

η = 015

LIMMA 07307 02389 07464 05865

SAM 09816 06448 06609 00023

SVA 05393 09980 00443 00331

SVAPLS 09830 09990 00131 00012

Table 3 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 20subjects (10 being in each group) under setting 2 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FNR

η = 005

LIMMA 02111 05367 05328 02818

SAM 06290 05879 07474 00625

SVA 00445 09998 00405 00164

SVAPLS 06408 09998 00029 00200

η = 010

LIMMA 01953 05545 04871 02942

SAM 05580 06079 07342 00606

SVA 00239 09998 00514 00148

SVAPLS 05125 09996 00077 00263

η = 015

LIMMA 01900 05597 04838 03039

SAM 05412 06126 07363 00600

SVA 00190 09999 00395 00148

SVAPLS 04544 09996 00112 00297

Table 4 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 40subjects (20 being in each group) under setting 2 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FDR

η = 005

LIMMA 05632 01690 08399 06629

SAM 07845 06199 07158 00312

SVA 01651 09993 00469 00329

SVAPLS 08324 09998 00024 00111

η = 010

LIMMA 05637 01640 08333 06967

SAM 07378 06507 06471 00342

SVA 01428 09994 00461 00331

SVAPLS 07535 09995 00071 00165

η = 015

LIMMA 05660 01619 08315 07183

SAM 06983 06553 06481 00372

SVA 01082 09995 00506 00350

SVAPLS 06806 09995 00095 00211

over-two groups of patients 47 having Acute Lym-phoblastic Leukemia (ALL) and 25 suffering from AcuteMyeloid Leukemia (AML) The patient tissue sampleswere obtained from the following four sources (1) Dana-Farber Cancer Institute (DFCI) (2) St-Judersquos ChildrenrsquosResearch Hospital (ST-Jude) (3) Cancer and LeukemiaGroup B (CALGB) and (4) Childrenrsquos Cancer Study Group(CCG) This inherent classification in the data can poten-tially generate significant batch effects that may distortthe original expression pattern of the genes This moti-vated the implementation of our R package on this datasetThe corrected expression matrix for the first 1000 genesreturned from the use of the svpls function on thisdata demonstrates that the batch effects due to variabil-ity in the sample sources have been removed effectivelyThe haphazard distribution of the samples from the foursources in the corrected gene expression matrix wipesout the additonal effects owing to the observed batch-specific clustering in the original data In this contextsvapls fares equally well compared to another popular Rpackage ber for removing batch effects in microarray data[12] (Figure 3)Overall limma detects 7128 genes followed by 3307

genes from sam 1015 genes from our svapls and 412genes from sva A Venn diagram (Figure 4) representsthe extent of overlap between the genes detected by thefour softwares Specifically limma detects all the genesthat are found to be significant from the other three

Chakraborty et al BMC Bioinformatics 2013 14236 Page 6 of 7httpwwwbiomedcentralcom1471-210514236

Figure 3 (a) (b) (c) Heatmaps showing the original and corrected expression levels for the first 1000 genes in the Golub data (a) Heatmapfor the first 1000 genes in the original Golub expression data (b) Heatmap for the first 1000 genes in the adjusted Golub expression data obtainedby use of the R package ber (c) Heatmap for the first 1000 genes in the adjusted Golub expression data obtained by the use of our R package svapls

softwares This may be attributable to its high false dis-covery rate (FDR) as was observed in the simulationstudy Interestingly svapls detected 24 genes that aremissed by both sam as well as sva Among them thegenes CD74 TNFRSF1A LCN2 and GSN deserve spe-cial mention All these genes are either related to sometype of cancer or regulate cell growth(or apoptosis) CD74plays an important role in multiple myeloma and itshigher expression induces tumor cell malignancy [13]

Figure 4 A Venn-diagram showing the overlap pattern of thegenes detected to be significant from the Golub data by svaplssva sam and limma

An isoform of the tumor necrosis factor TNFRSF1A isassociated with the development of Acute Lymphoblas-tic Leukemia (ALL) in children [14] Specifically LCN2has been found to be connected with Acute Myel-ogenous Leukemia (AML) [15] GSN plays a signifi-cant role of suppressing tumorigenicity in lung cancer[16] and has a diminuted expression in bladder cancercells [17]

DiscussionVarious hidden sources of variation are found to exist in agene expression data that cannot be removed by the stan-dard normalization procedures But their effect may besubstantial enough to change the expression pattern of thegenes over two different varieties of samples The imme-diate consequence is a large reduction in the detectionpower of the testing procedure employed to find the trulysignificant genes followed by highly elevated error ratesIn this project we discuss the development and usage ofan R package svapls that can tackle a wide variety of hid-den effects in a gene expression analysis and can delivera more accurate inference on the differential expressionvariability of the genes between two groups of samples(tissues) We illustrate the superior performance of ourR package in comparison to three other popular soft-wares available for differential gene expression analysesThe high detection power (sensitivity) of svapls along with

Chakraborty et al BMC Bioinformatics 2013 14236 Page 7 of 7httpwwwbiomedcentralcom1471-210514236

the reasonably small error rates provides it a significantlybetter edge over the competing softwares Specificallysva is outperformed by our package in terms of the sen-sitivity (power) while sam comes close and performsmarginally better in some cases although its competenceis severely marred by the considerably high false discov-ery rate (FDR) and substantially low specificity rate Inaddition the graphical representation of the hidden vari-ation (by the function hfp) from our package enablesthe user to understand the pattern in which the hiddensources of variability affect the expression signals of anyspecified subset of genes over a selected group of sub-jectssamples This paves the way to more sophisticatedanalyses of subject-set specific gene expression variabil-ity in the data Application of our package on the Golubdata demonstrates its efficacy in removing the significantbatch effects from the collectedanalyzed samples More-over our package detects four additional genes (missed byboth sva and sam) that have been found to be connectedto Leukemia or some other type of cancer

ConclusionsThe R package svapls can detect a wide variety of hid-den factors in a gene expression study and adjust for themappropriately in order to provide a more accurate infer-ence on the expression pattern of the genes between twodifferent types of tissues In particular the superior detec-tion power and small error rate gives our R package asubstantially better edge over the competing softwaresconsidered in the analysis

Availability and requirementsProject nameR packagesvaplsProject home pagehttpcranrprojectorgwebpackagessvaplsindexhtmlOperating system and R versionThe R package is platform independent and is compatiblewith all the versions of R same as or higher than 20LicenseGPL-3

Competing interestsThe authors declare that they have no competing interests

Authorsrsquo contributionsSC wrote the R package including its evaluation and drafted the manuscriptSD and SD contributed to the structure of the simulation studies applicationand planning of the manuscript All authors read and approved the finalmanuscript

AcknowledgementsWe sincerely thank the editor and the two reviewers for their constructivecomments that lead to an improved manuscript This research was part of SCrsquosdoctoral dissertation work SC acknowledges generous support and adissertation completion award by the School of Interdisciplinary and Graduate

Studies of the University of Louisville We also acknowledge partial support bythe Department of Bioinformatics and Biostatistics of University of Louisvilletoward the processing charges This research work was partially supported byNIH grants CA133844 (Su Datta) and CA170091 minus 01A1 (Su Datta)

Received 15 April 2013 Accepted 16 July 2013Published 24 July 2013

References1 Chakraborty S Datta S Datta S Surrogate variable analysis using

partial least squares in gene expression studies Bioinformatics 201228(6)799ndash806

2 Geladi P Kowalski B Partial least squares regression a tutorialAnalytica Chimica Acta 1986 1851ndash17

3 Tusher V Tibshirani R Chu G Significance analysis of microarraysapplied to the ionizing radiation response Proc Natl Acad Sci 2001985116ndash5121

4 Smith GK Limma linear models for microarray data In Bioinformaticsand Computational Biology Solutions using R and Bioconductor Edited byGentleman R Carey V Dudoit S Irizarry R Huber W New York Springer2002

5 Leek ST et al The sva package for removing batch effects and otherunwanted variation in high-throughput experiments Bioinformatics2012 28(6)882ndash883

6 Kerr MK et al Analysis of variance for gene expression microarraydata J Comput Biol 2000 7819ndash837

7 Kerr MK et al Statistical analysis of a gene expression microarraydata Stat Sinica 2002 12(1)203ndash217

8 Hirotsugu A Likelihood and the Bayes procedure In Bayesian StatisticsEdited by Bernardo JM De Groot MH Lindley DV Smith AFM ValenciaSpain University Press 1980143ndash203

9 Benjamini Y Hochberg Y Controlling the false discovery rate apractical and powerful approach tomultiple testing J R Stat Soc 199557(1)289ndash300

10 Golub T et alMolecular classification of cancer class discovery andclass prediction by gene expression monitoring Science 1999286531ndash537

11 Dudoit S Friedlyand J Speed TP Comparison of discriminationmethods for the classification of tumors using gene expressiondata J Am Stat Assoc 2002 97(457)77ndash87

12 Giordan M A two-stage procedure for the removal of batch effects inmicroarray studies Stat Biosci 2013 doi101007s12561-013-9081-1

13 Burton JE CD74 is expressed by multiple myeloma and is apromising target for therapy Clin Cancer Res 2004 10(19)6606ndash6611

14 Wu S Levels of the soluble 55-kilodalton isoform of tumor necrosisfactor receptor in bonemarrow are correlated with the clinicaloutcome of children with acute lymphoblastic leukemia in firstrecurrence Cancer 2003 98(3)625ndash631

15 Shimada H et al Potential involvement of the AML1-MTG8 fusionprotein in the granulocytic maturation characteristic of the t(821)acute myelogenous leukemia revealed by microarray analysisLeukemia 2002 16(5)874ndash885

16 Sagawa N et al Gelsolin supresses tumorigenicity through inhibitingPKC activation in human lung cancer cell line PC10 Br J Cancer 200388(4)606ndash612

17 Haga K The mechanism for reduced expression of gelsolin tumorsupressor protein in bladder cancer Hokkaido Igaku Zasshi 200378(1)29ndash37

doi1011861471-2105-14-236Cite this article as Chakraborty et al svapls an R package to correct forhidden factors of variability in gene expression studies BMC Bioinformatics2013 14236

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
      • Implementation
        • Brief overview of the package
        • Comparative evaluation with other available software packages
        • Application on the Golub data
          • Discussion
          • Conclusions
          • Availability and requirements
            • Project name
            • Project home page
            • Operating system and R version
            • License
              • Competing interests
              • Authors contributions
              • Acknowledgements
              • References
Page 3: svapls: an R package to correct for hidden factors of variability in gene expression studies

Chakraborty et al BMC Bioinformatics 2013 14236 Page 3 of 7httpwwwbiomedcentralcom1471-210514236

[41] g36 g56 g49 g14 g47

g64 g35 g1 g70 g6

[51] g4 g455 g58 g12 g8

g13 g32 g7 g10 g3

[61] g18 g22 g11 g184

gt

gt The corrected gene

expression matrix obtained after

removing the effects of

the hidden variability is given by

gt Ycorrected lt- fit$Ycorr

gt pvaladj lt- fit$pvaluesadj

While the Benjamini-Hochberg correction is used bydefault in our R package the p-values returned by thesvpls object provides an user with the flexibility ofapplying several other FDR controlling techniquesand also peforming the more specifically targetedgene set enrichment analysesWe compute p-values from a differential testing ofthe genes with the estimated effects from standardANOVA and the optimal ANCOVA model selectedby our R package A side-by-side plot of theircorresponding histograms clearly demostrate theefficacy of the function svpls in our package in termsof the proximity of the set of larger p-values towardsthe uniform distribution (Figure 1)

Figure 1 Histograms of the unadjusted and adjusted p-valuesThis figure exhibits two histograms from an analysis of the datahidden_facdat one for the unadjusted p-values for testing thevariety-based differential gene expression (found from the standardANOVA model) and the other corresponding to the adjusted p-valuesobtained after correcting the hidden variability in the data by our Rpackage svapls

bull The third function hfp produces a heatmap for thePLS-imputed estimate of the residual expressionheterogeneity corresponding to an user-specified setof genes and samples (specified by gen and indrespectively) This enables us to understand howintensely the latent factors from a certain set ofsubjects affect the true expression levels of a specifiedset of genes

gt Specifying the set

of genes and subjects

gt gen lt- c(1205570100150250450)

gt sub lt- c(1471011151720)

gt

gt genes lt- paste(ggensep=)

gt subjects lt- paste(Ssubsep=)

gt

gt hfp(resgenessubjectshidden_facdat)

This produces a plot revealing the way the hiddenvariable affects the expression pattern of the selectedgroup of genes over the specified subjects (Figure 2)Clearly we can observe a substantial difference in theexpression variability caused by the latent factor forsubjects and the rest specified under the selectedgroup

Comparative evaluation with other available softwarepackagesIn this section we illustrate the application of the R pack-age along with the other three popular software packagesthrough a family of simlulation analyses conducted withtwo sample sizes 20 and 40 under three different val-ues of the noise-to-signal ratio (η = 005 01 and 015)controlling the relative intensity of the random error andprimary signal variances from low to high [1] In eachsimulation study we generate correlated expression mea-surements on 1000 genes over k subjects (k = 20 40)classified equally into two groups 1 and 2 Overall weconsider two different settings (1) The genes are affectedby a highly complex subject-specific confounder (mixtureof two normal random variables) with a small varianceand (2) The genes are affected by two widely differentsubject-specific confounders (one mixture of two normalrandom variables and another mixture of two exponen-tial random variables) both with very high variances [1]Under both the settings the first 70 genes are consideredto be truly differentially expressed over the two varietieswhile the rest are chosen as non-significant The sim-ulation study is based on computation of the averagevalues of two right decision indicators (sensitivity speci-ficity) and two wrong detection indicators (false discoveryrate and false non-discovery rate) evaluated from 1000

Chakraborty et al BMC Bioinformatics 2013 14236 Page 4 of 7httpwwwbiomedcentralcom1471-210514236

Figure 2 Heatmap showing the hidden variability in the data hidden_facdat owing to the specified set of subjects and genes

Monte-Carlo replications (Tables 1 and 2 for setting 1and 3 4 for setting 2) The obtained results from the twosimulation settings clearly reveal the superior sensitivityof svapls compared to other three R packages sam limmaand sva under most of the combinations of group sizeand noise-to-signal ratios This illustrates the efficiencyof our R package in discovering a higher proportion ofthe truly significant genes compared to the existing soft-ware packages The sensitivity of sam is comparable to ourmethod for a higher sample size under setting 1 (Tables 1and 2) and is very close or marginally better under setting2 (Tables 3 and 4) but is adversely impacted by its signifi-cantly large false discovery rate Specifically the sensitivityobtained from our R package becomes almost similar orslightly better than sam as the group size is increased from10 to 20 (Table 4) Moreover the specificity rate is the bestfor svapls closely followed by sva while sam and limma areless efficient in this context In addition the average errorrates FDR and FNR are much lower for svapls comparedto the other three software packages Thus overall our Rpackage is capable of discovering the truly differentiallyexpressed genes with more power along with an efficientcontrol over the wrong detections (non-detections)

Application on the Golub dataNow we explore the performance of svapls on thepre-processed ALLAML dataset [1011] It containsthe log-transformed expression levels of 7129 genes

Table 1 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 20subjects (10 being in each group) under setting 1 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FNR

η = 005

LIMMA 02287 06276 04285 02089

SAM 09239 06066 07278 00125

SVA 03311 09987 00475 00456

SVAPLS 09464 09998 00023 00039

η = 010

LIMMA 02307 06566 03724 02295

SAM 08880 06410 06596 00147

SVA 02882 09988 00481 00469

SVAPLS 09098 09994 00076 00065

η = 015

LIMMA 01956 06672 03689 02164

SAM 08522 06709 06140 00193

SVA 02474 09990 00458 00485

SVAPLS 08660 09991 00130 00097

Chakraborty et al BMC Bioinformatics 2013 14236 Page 5 of 7httpwwwbiomedcentralcom1471-210514236

Table 2 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 40subjects (20 being in each group) under setting 1 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FNR

η = 005

LIMMA 07863 02283 07719 05037

SAM 09793 05773 07724 00033

SVA 05659 09977 00475 00311

SVAPLS 09954 09998 00026 00003

η = 010

LIMMA 07577 02479 07432 05564

SAM 09854 06215 07055 00020

SVA 05695 09978 00471 00309

SVAPLS 09897 09994 00083 00008

η = 015

LIMMA 07307 02389 07464 05865

SAM 09816 06448 06609 00023

SVA 05393 09980 00443 00331

SVAPLS 09830 09990 00131 00012

Table 3 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 20subjects (10 being in each group) under setting 2 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FNR

η = 005

LIMMA 02111 05367 05328 02818

SAM 06290 05879 07474 00625

SVA 00445 09998 00405 00164

SVAPLS 06408 09998 00029 00200

η = 010

LIMMA 01953 05545 04871 02942

SAM 05580 06079 07342 00606

SVA 00239 09998 00514 00148

SVAPLS 05125 09996 00077 00263

η = 015

LIMMA 01900 05597 04838 03039

SAM 05412 06126 07363 00600

SVA 00190 09999 00395 00148

SVAPLS 04544 09996 00112 00297

Table 4 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 40subjects (20 being in each group) under setting 2 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FDR

η = 005

LIMMA 05632 01690 08399 06629

SAM 07845 06199 07158 00312

SVA 01651 09993 00469 00329

SVAPLS 08324 09998 00024 00111

η = 010

LIMMA 05637 01640 08333 06967

SAM 07378 06507 06471 00342

SVA 01428 09994 00461 00331

SVAPLS 07535 09995 00071 00165

η = 015

LIMMA 05660 01619 08315 07183

SAM 06983 06553 06481 00372

SVA 01082 09995 00506 00350

SVAPLS 06806 09995 00095 00211

over-two groups of patients 47 having Acute Lym-phoblastic Leukemia (ALL) and 25 suffering from AcuteMyeloid Leukemia (AML) The patient tissue sampleswere obtained from the following four sources (1) Dana-Farber Cancer Institute (DFCI) (2) St-Judersquos ChildrenrsquosResearch Hospital (ST-Jude) (3) Cancer and LeukemiaGroup B (CALGB) and (4) Childrenrsquos Cancer Study Group(CCG) This inherent classification in the data can poten-tially generate significant batch effects that may distortthe original expression pattern of the genes This moti-vated the implementation of our R package on this datasetThe corrected expression matrix for the first 1000 genesreturned from the use of the svpls function on thisdata demonstrates that the batch effects due to variabil-ity in the sample sources have been removed effectivelyThe haphazard distribution of the samples from the foursources in the corrected gene expression matrix wipesout the additonal effects owing to the observed batch-specific clustering in the original data In this contextsvapls fares equally well compared to another popular Rpackage ber for removing batch effects in microarray data[12] (Figure 3)Overall limma detects 7128 genes followed by 3307

genes from sam 1015 genes from our svapls and 412genes from sva A Venn diagram (Figure 4) representsthe extent of overlap between the genes detected by thefour softwares Specifically limma detects all the genesthat are found to be significant from the other three

Chakraborty et al BMC Bioinformatics 2013 14236 Page 6 of 7httpwwwbiomedcentralcom1471-210514236

Figure 3 (a) (b) (c) Heatmaps showing the original and corrected expression levels for the first 1000 genes in the Golub data (a) Heatmapfor the first 1000 genes in the original Golub expression data (b) Heatmap for the first 1000 genes in the adjusted Golub expression data obtainedby use of the R package ber (c) Heatmap for the first 1000 genes in the adjusted Golub expression data obtained by the use of our R package svapls

softwares This may be attributable to its high false dis-covery rate (FDR) as was observed in the simulationstudy Interestingly svapls detected 24 genes that aremissed by both sam as well as sva Among them thegenes CD74 TNFRSF1A LCN2 and GSN deserve spe-cial mention All these genes are either related to sometype of cancer or regulate cell growth(or apoptosis) CD74plays an important role in multiple myeloma and itshigher expression induces tumor cell malignancy [13]

Figure 4 A Venn-diagram showing the overlap pattern of thegenes detected to be significant from the Golub data by svaplssva sam and limma

An isoform of the tumor necrosis factor TNFRSF1A isassociated with the development of Acute Lymphoblas-tic Leukemia (ALL) in children [14] Specifically LCN2has been found to be connected with Acute Myel-ogenous Leukemia (AML) [15] GSN plays a signifi-cant role of suppressing tumorigenicity in lung cancer[16] and has a diminuted expression in bladder cancercells [17]

DiscussionVarious hidden sources of variation are found to exist in agene expression data that cannot be removed by the stan-dard normalization procedures But their effect may besubstantial enough to change the expression pattern of thegenes over two different varieties of samples The imme-diate consequence is a large reduction in the detectionpower of the testing procedure employed to find the trulysignificant genes followed by highly elevated error ratesIn this project we discuss the development and usage ofan R package svapls that can tackle a wide variety of hid-den effects in a gene expression analysis and can delivera more accurate inference on the differential expressionvariability of the genes between two groups of samples(tissues) We illustrate the superior performance of ourR package in comparison to three other popular soft-wares available for differential gene expression analysesThe high detection power (sensitivity) of svapls along with

Chakraborty et al BMC Bioinformatics 2013 14236 Page 7 of 7httpwwwbiomedcentralcom1471-210514236

the reasonably small error rates provides it a significantlybetter edge over the competing softwares Specificallysva is outperformed by our package in terms of the sen-sitivity (power) while sam comes close and performsmarginally better in some cases although its competenceis severely marred by the considerably high false discov-ery rate (FDR) and substantially low specificity rate Inaddition the graphical representation of the hidden vari-ation (by the function hfp) from our package enablesthe user to understand the pattern in which the hiddensources of variability affect the expression signals of anyspecified subset of genes over a selected group of sub-jectssamples This paves the way to more sophisticatedanalyses of subject-set specific gene expression variabil-ity in the data Application of our package on the Golubdata demonstrates its efficacy in removing the significantbatch effects from the collectedanalyzed samples More-over our package detects four additional genes (missed byboth sva and sam) that have been found to be connectedto Leukemia or some other type of cancer

ConclusionsThe R package svapls can detect a wide variety of hid-den factors in a gene expression study and adjust for themappropriately in order to provide a more accurate infer-ence on the expression pattern of the genes between twodifferent types of tissues In particular the superior detec-tion power and small error rate gives our R package asubstantially better edge over the competing softwaresconsidered in the analysis

Availability and requirementsProject nameR packagesvaplsProject home pagehttpcranrprojectorgwebpackagessvaplsindexhtmlOperating system and R versionThe R package is platform independent and is compatiblewith all the versions of R same as or higher than 20LicenseGPL-3

Competing interestsThe authors declare that they have no competing interests

Authorsrsquo contributionsSC wrote the R package including its evaluation and drafted the manuscriptSD and SD contributed to the structure of the simulation studies applicationand planning of the manuscript All authors read and approved the finalmanuscript

AcknowledgementsWe sincerely thank the editor and the two reviewers for their constructivecomments that lead to an improved manuscript This research was part of SCrsquosdoctoral dissertation work SC acknowledges generous support and adissertation completion award by the School of Interdisciplinary and Graduate

Studies of the University of Louisville We also acknowledge partial support bythe Department of Bioinformatics and Biostatistics of University of Louisvilletoward the processing charges This research work was partially supported byNIH grants CA133844 (Su Datta) and CA170091 minus 01A1 (Su Datta)

Received 15 April 2013 Accepted 16 July 2013Published 24 July 2013

References1 Chakraborty S Datta S Datta S Surrogate variable analysis using

partial least squares in gene expression studies Bioinformatics 201228(6)799ndash806

2 Geladi P Kowalski B Partial least squares regression a tutorialAnalytica Chimica Acta 1986 1851ndash17

3 Tusher V Tibshirani R Chu G Significance analysis of microarraysapplied to the ionizing radiation response Proc Natl Acad Sci 2001985116ndash5121

4 Smith GK Limma linear models for microarray data In Bioinformaticsand Computational Biology Solutions using R and Bioconductor Edited byGentleman R Carey V Dudoit S Irizarry R Huber W New York Springer2002

5 Leek ST et al The sva package for removing batch effects and otherunwanted variation in high-throughput experiments Bioinformatics2012 28(6)882ndash883

6 Kerr MK et al Analysis of variance for gene expression microarraydata J Comput Biol 2000 7819ndash837

7 Kerr MK et al Statistical analysis of a gene expression microarraydata Stat Sinica 2002 12(1)203ndash217

8 Hirotsugu A Likelihood and the Bayes procedure In Bayesian StatisticsEdited by Bernardo JM De Groot MH Lindley DV Smith AFM ValenciaSpain University Press 1980143ndash203

9 Benjamini Y Hochberg Y Controlling the false discovery rate apractical and powerful approach tomultiple testing J R Stat Soc 199557(1)289ndash300

10 Golub T et alMolecular classification of cancer class discovery andclass prediction by gene expression monitoring Science 1999286531ndash537

11 Dudoit S Friedlyand J Speed TP Comparison of discriminationmethods for the classification of tumors using gene expressiondata J Am Stat Assoc 2002 97(457)77ndash87

12 Giordan M A two-stage procedure for the removal of batch effects inmicroarray studies Stat Biosci 2013 doi101007s12561-013-9081-1

13 Burton JE CD74 is expressed by multiple myeloma and is apromising target for therapy Clin Cancer Res 2004 10(19)6606ndash6611

14 Wu S Levels of the soluble 55-kilodalton isoform of tumor necrosisfactor receptor in bonemarrow are correlated with the clinicaloutcome of children with acute lymphoblastic leukemia in firstrecurrence Cancer 2003 98(3)625ndash631

15 Shimada H et al Potential involvement of the AML1-MTG8 fusionprotein in the granulocytic maturation characteristic of the t(821)acute myelogenous leukemia revealed by microarray analysisLeukemia 2002 16(5)874ndash885

16 Sagawa N et al Gelsolin supresses tumorigenicity through inhibitingPKC activation in human lung cancer cell line PC10 Br J Cancer 200388(4)606ndash612

17 Haga K The mechanism for reduced expression of gelsolin tumorsupressor protein in bladder cancer Hokkaido Igaku Zasshi 200378(1)29ndash37

doi1011861471-2105-14-236Cite this article as Chakraborty et al svapls an R package to correct forhidden factors of variability in gene expression studies BMC Bioinformatics2013 14236

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
      • Implementation
        • Brief overview of the package
        • Comparative evaluation with other available software packages
        • Application on the Golub data
          • Discussion
          • Conclusions
          • Availability and requirements
            • Project name
            • Project home page
            • Operating system and R version
            • License
              • Competing interests
              • Authors contributions
              • Acknowledgements
              • References
Page 4: svapls: an R package to correct for hidden factors of variability in gene expression studies

Chakraborty et al BMC Bioinformatics 2013 14236 Page 4 of 7httpwwwbiomedcentralcom1471-210514236

Figure 2 Heatmap showing the hidden variability in the data hidden_facdat owing to the specified set of subjects and genes

Monte-Carlo replications (Tables 1 and 2 for setting 1and 3 4 for setting 2) The obtained results from the twosimulation settings clearly reveal the superior sensitivityof svapls compared to other three R packages sam limmaand sva under most of the combinations of group sizeand noise-to-signal ratios This illustrates the efficiencyof our R package in discovering a higher proportion ofthe truly significant genes compared to the existing soft-ware packages The sensitivity of sam is comparable to ourmethod for a higher sample size under setting 1 (Tables 1and 2) and is very close or marginally better under setting2 (Tables 3 and 4) but is adversely impacted by its signifi-cantly large false discovery rate Specifically the sensitivityobtained from our R package becomes almost similar orslightly better than sam as the group size is increased from10 to 20 (Table 4) Moreover the specificity rate is the bestfor svapls closely followed by sva while sam and limma areless efficient in this context In addition the average errorrates FDR and FNR are much lower for svapls comparedto the other three software packages Thus overall our Rpackage is capable of discovering the truly differentiallyexpressed genes with more power along with an efficientcontrol over the wrong detections (non-detections)

Application on the Golub dataNow we explore the performance of svapls on thepre-processed ALLAML dataset [1011] It containsthe log-transformed expression levels of 7129 genes

Table 1 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 20subjects (10 being in each group) under setting 1 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FNR

η = 005

LIMMA 02287 06276 04285 02089

SAM 09239 06066 07278 00125

SVA 03311 09987 00475 00456

SVAPLS 09464 09998 00023 00039

η = 010

LIMMA 02307 06566 03724 02295

SAM 08880 06410 06596 00147

SVA 02882 09988 00481 00469

SVAPLS 09098 09994 00076 00065

η = 015

LIMMA 01956 06672 03689 02164

SAM 08522 06709 06140 00193

SVA 02474 09990 00458 00485

SVAPLS 08660 09991 00130 00097

Chakraborty et al BMC Bioinformatics 2013 14236 Page 5 of 7httpwwwbiomedcentralcom1471-210514236

Table 2 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 40subjects (20 being in each group) under setting 1 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FNR

η = 005

LIMMA 07863 02283 07719 05037

SAM 09793 05773 07724 00033

SVA 05659 09977 00475 00311

SVAPLS 09954 09998 00026 00003

η = 010

LIMMA 07577 02479 07432 05564

SAM 09854 06215 07055 00020

SVA 05695 09978 00471 00309

SVAPLS 09897 09994 00083 00008

η = 015

LIMMA 07307 02389 07464 05865

SAM 09816 06448 06609 00023

SVA 05393 09980 00443 00331

SVAPLS 09830 09990 00131 00012

Table 3 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 20subjects (10 being in each group) under setting 2 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FNR

η = 005

LIMMA 02111 05367 05328 02818

SAM 06290 05879 07474 00625

SVA 00445 09998 00405 00164

SVAPLS 06408 09998 00029 00200

η = 010

LIMMA 01953 05545 04871 02942

SAM 05580 06079 07342 00606

SVA 00239 09998 00514 00148

SVAPLS 05125 09996 00077 00263

η = 015

LIMMA 01900 05597 04838 03039

SAM 05412 06126 07363 00600

SVA 00190 09999 00395 00148

SVAPLS 04544 09996 00112 00297

Table 4 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 40subjects (20 being in each group) under setting 2 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FDR

η = 005

LIMMA 05632 01690 08399 06629

SAM 07845 06199 07158 00312

SVA 01651 09993 00469 00329

SVAPLS 08324 09998 00024 00111

η = 010

LIMMA 05637 01640 08333 06967

SAM 07378 06507 06471 00342

SVA 01428 09994 00461 00331

SVAPLS 07535 09995 00071 00165

η = 015

LIMMA 05660 01619 08315 07183

SAM 06983 06553 06481 00372

SVA 01082 09995 00506 00350

SVAPLS 06806 09995 00095 00211

over-two groups of patients 47 having Acute Lym-phoblastic Leukemia (ALL) and 25 suffering from AcuteMyeloid Leukemia (AML) The patient tissue sampleswere obtained from the following four sources (1) Dana-Farber Cancer Institute (DFCI) (2) St-Judersquos ChildrenrsquosResearch Hospital (ST-Jude) (3) Cancer and LeukemiaGroup B (CALGB) and (4) Childrenrsquos Cancer Study Group(CCG) This inherent classification in the data can poten-tially generate significant batch effects that may distortthe original expression pattern of the genes This moti-vated the implementation of our R package on this datasetThe corrected expression matrix for the first 1000 genesreturned from the use of the svpls function on thisdata demonstrates that the batch effects due to variabil-ity in the sample sources have been removed effectivelyThe haphazard distribution of the samples from the foursources in the corrected gene expression matrix wipesout the additonal effects owing to the observed batch-specific clustering in the original data In this contextsvapls fares equally well compared to another popular Rpackage ber for removing batch effects in microarray data[12] (Figure 3)Overall limma detects 7128 genes followed by 3307

genes from sam 1015 genes from our svapls and 412genes from sva A Venn diagram (Figure 4) representsthe extent of overlap between the genes detected by thefour softwares Specifically limma detects all the genesthat are found to be significant from the other three

Chakraborty et al BMC Bioinformatics 2013 14236 Page 6 of 7httpwwwbiomedcentralcom1471-210514236

Figure 3 (a) (b) (c) Heatmaps showing the original and corrected expression levels for the first 1000 genes in the Golub data (a) Heatmapfor the first 1000 genes in the original Golub expression data (b) Heatmap for the first 1000 genes in the adjusted Golub expression data obtainedby use of the R package ber (c) Heatmap for the first 1000 genes in the adjusted Golub expression data obtained by the use of our R package svapls

softwares This may be attributable to its high false dis-covery rate (FDR) as was observed in the simulationstudy Interestingly svapls detected 24 genes that aremissed by both sam as well as sva Among them thegenes CD74 TNFRSF1A LCN2 and GSN deserve spe-cial mention All these genes are either related to sometype of cancer or regulate cell growth(or apoptosis) CD74plays an important role in multiple myeloma and itshigher expression induces tumor cell malignancy [13]

Figure 4 A Venn-diagram showing the overlap pattern of thegenes detected to be significant from the Golub data by svaplssva sam and limma

An isoform of the tumor necrosis factor TNFRSF1A isassociated with the development of Acute Lymphoblas-tic Leukemia (ALL) in children [14] Specifically LCN2has been found to be connected with Acute Myel-ogenous Leukemia (AML) [15] GSN plays a signifi-cant role of suppressing tumorigenicity in lung cancer[16] and has a diminuted expression in bladder cancercells [17]

DiscussionVarious hidden sources of variation are found to exist in agene expression data that cannot be removed by the stan-dard normalization procedures But their effect may besubstantial enough to change the expression pattern of thegenes over two different varieties of samples The imme-diate consequence is a large reduction in the detectionpower of the testing procedure employed to find the trulysignificant genes followed by highly elevated error ratesIn this project we discuss the development and usage ofan R package svapls that can tackle a wide variety of hid-den effects in a gene expression analysis and can delivera more accurate inference on the differential expressionvariability of the genes between two groups of samples(tissues) We illustrate the superior performance of ourR package in comparison to three other popular soft-wares available for differential gene expression analysesThe high detection power (sensitivity) of svapls along with

Chakraborty et al BMC Bioinformatics 2013 14236 Page 7 of 7httpwwwbiomedcentralcom1471-210514236

the reasonably small error rates provides it a significantlybetter edge over the competing softwares Specificallysva is outperformed by our package in terms of the sen-sitivity (power) while sam comes close and performsmarginally better in some cases although its competenceis severely marred by the considerably high false discov-ery rate (FDR) and substantially low specificity rate Inaddition the graphical representation of the hidden vari-ation (by the function hfp) from our package enablesthe user to understand the pattern in which the hiddensources of variability affect the expression signals of anyspecified subset of genes over a selected group of sub-jectssamples This paves the way to more sophisticatedanalyses of subject-set specific gene expression variabil-ity in the data Application of our package on the Golubdata demonstrates its efficacy in removing the significantbatch effects from the collectedanalyzed samples More-over our package detects four additional genes (missed byboth sva and sam) that have been found to be connectedto Leukemia or some other type of cancer

ConclusionsThe R package svapls can detect a wide variety of hid-den factors in a gene expression study and adjust for themappropriately in order to provide a more accurate infer-ence on the expression pattern of the genes between twodifferent types of tissues In particular the superior detec-tion power and small error rate gives our R package asubstantially better edge over the competing softwaresconsidered in the analysis

Availability and requirementsProject nameR packagesvaplsProject home pagehttpcranrprojectorgwebpackagessvaplsindexhtmlOperating system and R versionThe R package is platform independent and is compatiblewith all the versions of R same as or higher than 20LicenseGPL-3

Competing interestsThe authors declare that they have no competing interests

Authorsrsquo contributionsSC wrote the R package including its evaluation and drafted the manuscriptSD and SD contributed to the structure of the simulation studies applicationand planning of the manuscript All authors read and approved the finalmanuscript

AcknowledgementsWe sincerely thank the editor and the two reviewers for their constructivecomments that lead to an improved manuscript This research was part of SCrsquosdoctoral dissertation work SC acknowledges generous support and adissertation completion award by the School of Interdisciplinary and Graduate

Studies of the University of Louisville We also acknowledge partial support bythe Department of Bioinformatics and Biostatistics of University of Louisvilletoward the processing charges This research work was partially supported byNIH grants CA133844 (Su Datta) and CA170091 minus 01A1 (Su Datta)

Received 15 April 2013 Accepted 16 July 2013Published 24 July 2013

References1 Chakraborty S Datta S Datta S Surrogate variable analysis using

partial least squares in gene expression studies Bioinformatics 201228(6)799ndash806

2 Geladi P Kowalski B Partial least squares regression a tutorialAnalytica Chimica Acta 1986 1851ndash17

3 Tusher V Tibshirani R Chu G Significance analysis of microarraysapplied to the ionizing radiation response Proc Natl Acad Sci 2001985116ndash5121

4 Smith GK Limma linear models for microarray data In Bioinformaticsand Computational Biology Solutions using R and Bioconductor Edited byGentleman R Carey V Dudoit S Irizarry R Huber W New York Springer2002

5 Leek ST et al The sva package for removing batch effects and otherunwanted variation in high-throughput experiments Bioinformatics2012 28(6)882ndash883

6 Kerr MK et al Analysis of variance for gene expression microarraydata J Comput Biol 2000 7819ndash837

7 Kerr MK et al Statistical analysis of a gene expression microarraydata Stat Sinica 2002 12(1)203ndash217

8 Hirotsugu A Likelihood and the Bayes procedure In Bayesian StatisticsEdited by Bernardo JM De Groot MH Lindley DV Smith AFM ValenciaSpain University Press 1980143ndash203

9 Benjamini Y Hochberg Y Controlling the false discovery rate apractical and powerful approach tomultiple testing J R Stat Soc 199557(1)289ndash300

10 Golub T et alMolecular classification of cancer class discovery andclass prediction by gene expression monitoring Science 1999286531ndash537

11 Dudoit S Friedlyand J Speed TP Comparison of discriminationmethods for the classification of tumors using gene expressiondata J Am Stat Assoc 2002 97(457)77ndash87

12 Giordan M A two-stage procedure for the removal of batch effects inmicroarray studies Stat Biosci 2013 doi101007s12561-013-9081-1

13 Burton JE CD74 is expressed by multiple myeloma and is apromising target for therapy Clin Cancer Res 2004 10(19)6606ndash6611

14 Wu S Levels of the soluble 55-kilodalton isoform of tumor necrosisfactor receptor in bonemarrow are correlated with the clinicaloutcome of children with acute lymphoblastic leukemia in firstrecurrence Cancer 2003 98(3)625ndash631

15 Shimada H et al Potential involvement of the AML1-MTG8 fusionprotein in the granulocytic maturation characteristic of the t(821)acute myelogenous leukemia revealed by microarray analysisLeukemia 2002 16(5)874ndash885

16 Sagawa N et al Gelsolin supresses tumorigenicity through inhibitingPKC activation in human lung cancer cell line PC10 Br J Cancer 200388(4)606ndash612

17 Haga K The mechanism for reduced expression of gelsolin tumorsupressor protein in bladder cancer Hokkaido Igaku Zasshi 200378(1)29ndash37

doi1011861471-2105-14-236Cite this article as Chakraborty et al svapls an R package to correct forhidden factors of variability in gene expression studies BMC Bioinformatics2013 14236

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
      • Implementation
        • Brief overview of the package
        • Comparative evaluation with other available software packages
        • Application on the Golub data
          • Discussion
          • Conclusions
          • Availability and requirements
            • Project name
            • Project home page
            • Operating system and R version
            • License
              • Competing interests
              • Authors contributions
              • Acknowledgements
              • References
Page 5: svapls: an R package to correct for hidden factors of variability in gene expression studies

Chakraborty et al BMC Bioinformatics 2013 14236 Page 5 of 7httpwwwbiomedcentralcom1471-210514236

Table 2 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 40subjects (20 being in each group) under setting 1 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FNR

η = 005

LIMMA 07863 02283 07719 05037

SAM 09793 05773 07724 00033

SVA 05659 09977 00475 00311

SVAPLS 09954 09998 00026 00003

η = 010

LIMMA 07577 02479 07432 05564

SAM 09854 06215 07055 00020

SVA 05695 09978 00471 00309

SVAPLS 09897 09994 00083 00008

η = 015

LIMMA 07307 02389 07464 05865

SAM 09816 06448 06609 00023

SVA 05393 09980 00443 00331

SVAPLS 09830 09990 00131 00012

Table 3 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 20subjects (10 being in each group) under setting 2 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FNR

η = 005

LIMMA 02111 05367 05328 02818

SAM 06290 05879 07474 00625

SVA 00445 09998 00405 00164

SVAPLS 06408 09998 00029 00200

η = 010

LIMMA 01953 05545 04871 02942

SAM 05580 06079 07342 00606

SVA 00239 09998 00514 00148

SVAPLS 05125 09996 00077 00263

η = 015

LIMMA 01900 05597 04838 03039

SAM 05412 06126 07363 00600

SVA 00190 09999 00395 00148

SVAPLS 04544 09996 00112 00297

Table 4 Average performancemeasures from a sensitivityanalysis of the simulated gene expression data on 40subjects (20 being in each group) under setting 2 with thefour software packages limma sam sva and svapls

Method Sensitivity Specificity FDR FDR

η = 005

LIMMA 05632 01690 08399 06629

SAM 07845 06199 07158 00312

SVA 01651 09993 00469 00329

SVAPLS 08324 09998 00024 00111

η = 010

LIMMA 05637 01640 08333 06967

SAM 07378 06507 06471 00342

SVA 01428 09994 00461 00331

SVAPLS 07535 09995 00071 00165

η = 015

LIMMA 05660 01619 08315 07183

SAM 06983 06553 06481 00372

SVA 01082 09995 00506 00350

SVAPLS 06806 09995 00095 00211

over-two groups of patients 47 having Acute Lym-phoblastic Leukemia (ALL) and 25 suffering from AcuteMyeloid Leukemia (AML) The patient tissue sampleswere obtained from the following four sources (1) Dana-Farber Cancer Institute (DFCI) (2) St-Judersquos ChildrenrsquosResearch Hospital (ST-Jude) (3) Cancer and LeukemiaGroup B (CALGB) and (4) Childrenrsquos Cancer Study Group(CCG) This inherent classification in the data can poten-tially generate significant batch effects that may distortthe original expression pattern of the genes This moti-vated the implementation of our R package on this datasetThe corrected expression matrix for the first 1000 genesreturned from the use of the svpls function on thisdata demonstrates that the batch effects due to variabil-ity in the sample sources have been removed effectivelyThe haphazard distribution of the samples from the foursources in the corrected gene expression matrix wipesout the additonal effects owing to the observed batch-specific clustering in the original data In this contextsvapls fares equally well compared to another popular Rpackage ber for removing batch effects in microarray data[12] (Figure 3)Overall limma detects 7128 genes followed by 3307

genes from sam 1015 genes from our svapls and 412genes from sva A Venn diagram (Figure 4) representsthe extent of overlap between the genes detected by thefour softwares Specifically limma detects all the genesthat are found to be significant from the other three

Chakraborty et al BMC Bioinformatics 2013 14236 Page 6 of 7httpwwwbiomedcentralcom1471-210514236

Figure 3 (a) (b) (c) Heatmaps showing the original and corrected expression levels for the first 1000 genes in the Golub data (a) Heatmapfor the first 1000 genes in the original Golub expression data (b) Heatmap for the first 1000 genes in the adjusted Golub expression data obtainedby use of the R package ber (c) Heatmap for the first 1000 genes in the adjusted Golub expression data obtained by the use of our R package svapls

softwares This may be attributable to its high false dis-covery rate (FDR) as was observed in the simulationstudy Interestingly svapls detected 24 genes that aremissed by both sam as well as sva Among them thegenes CD74 TNFRSF1A LCN2 and GSN deserve spe-cial mention All these genes are either related to sometype of cancer or regulate cell growth(or apoptosis) CD74plays an important role in multiple myeloma and itshigher expression induces tumor cell malignancy [13]

Figure 4 A Venn-diagram showing the overlap pattern of thegenes detected to be significant from the Golub data by svaplssva sam and limma

An isoform of the tumor necrosis factor TNFRSF1A isassociated with the development of Acute Lymphoblas-tic Leukemia (ALL) in children [14] Specifically LCN2has been found to be connected with Acute Myel-ogenous Leukemia (AML) [15] GSN plays a signifi-cant role of suppressing tumorigenicity in lung cancer[16] and has a diminuted expression in bladder cancercells [17]

DiscussionVarious hidden sources of variation are found to exist in agene expression data that cannot be removed by the stan-dard normalization procedures But their effect may besubstantial enough to change the expression pattern of thegenes over two different varieties of samples The imme-diate consequence is a large reduction in the detectionpower of the testing procedure employed to find the trulysignificant genes followed by highly elevated error ratesIn this project we discuss the development and usage ofan R package svapls that can tackle a wide variety of hid-den effects in a gene expression analysis and can delivera more accurate inference on the differential expressionvariability of the genes between two groups of samples(tissues) We illustrate the superior performance of ourR package in comparison to three other popular soft-wares available for differential gene expression analysesThe high detection power (sensitivity) of svapls along with

Chakraborty et al BMC Bioinformatics 2013 14236 Page 7 of 7httpwwwbiomedcentralcom1471-210514236

the reasonably small error rates provides it a significantlybetter edge over the competing softwares Specificallysva is outperformed by our package in terms of the sen-sitivity (power) while sam comes close and performsmarginally better in some cases although its competenceis severely marred by the considerably high false discov-ery rate (FDR) and substantially low specificity rate Inaddition the graphical representation of the hidden vari-ation (by the function hfp) from our package enablesthe user to understand the pattern in which the hiddensources of variability affect the expression signals of anyspecified subset of genes over a selected group of sub-jectssamples This paves the way to more sophisticatedanalyses of subject-set specific gene expression variabil-ity in the data Application of our package on the Golubdata demonstrates its efficacy in removing the significantbatch effects from the collectedanalyzed samples More-over our package detects four additional genes (missed byboth sva and sam) that have been found to be connectedto Leukemia or some other type of cancer

ConclusionsThe R package svapls can detect a wide variety of hid-den factors in a gene expression study and adjust for themappropriately in order to provide a more accurate infer-ence on the expression pattern of the genes between twodifferent types of tissues In particular the superior detec-tion power and small error rate gives our R package asubstantially better edge over the competing softwaresconsidered in the analysis

Availability and requirementsProject nameR packagesvaplsProject home pagehttpcranrprojectorgwebpackagessvaplsindexhtmlOperating system and R versionThe R package is platform independent and is compatiblewith all the versions of R same as or higher than 20LicenseGPL-3

Competing interestsThe authors declare that they have no competing interests

Authorsrsquo contributionsSC wrote the R package including its evaluation and drafted the manuscriptSD and SD contributed to the structure of the simulation studies applicationand planning of the manuscript All authors read and approved the finalmanuscript

AcknowledgementsWe sincerely thank the editor and the two reviewers for their constructivecomments that lead to an improved manuscript This research was part of SCrsquosdoctoral dissertation work SC acknowledges generous support and adissertation completion award by the School of Interdisciplinary and Graduate

Studies of the University of Louisville We also acknowledge partial support bythe Department of Bioinformatics and Biostatistics of University of Louisvilletoward the processing charges This research work was partially supported byNIH grants CA133844 (Su Datta) and CA170091 minus 01A1 (Su Datta)

Received 15 April 2013 Accepted 16 July 2013Published 24 July 2013

References1 Chakraborty S Datta S Datta S Surrogate variable analysis using

partial least squares in gene expression studies Bioinformatics 201228(6)799ndash806

2 Geladi P Kowalski B Partial least squares regression a tutorialAnalytica Chimica Acta 1986 1851ndash17

3 Tusher V Tibshirani R Chu G Significance analysis of microarraysapplied to the ionizing radiation response Proc Natl Acad Sci 2001985116ndash5121

4 Smith GK Limma linear models for microarray data In Bioinformaticsand Computational Biology Solutions using R and Bioconductor Edited byGentleman R Carey V Dudoit S Irizarry R Huber W New York Springer2002

5 Leek ST et al The sva package for removing batch effects and otherunwanted variation in high-throughput experiments Bioinformatics2012 28(6)882ndash883

6 Kerr MK et al Analysis of variance for gene expression microarraydata J Comput Biol 2000 7819ndash837

7 Kerr MK et al Statistical analysis of a gene expression microarraydata Stat Sinica 2002 12(1)203ndash217

8 Hirotsugu A Likelihood and the Bayes procedure In Bayesian StatisticsEdited by Bernardo JM De Groot MH Lindley DV Smith AFM ValenciaSpain University Press 1980143ndash203

9 Benjamini Y Hochberg Y Controlling the false discovery rate apractical and powerful approach tomultiple testing J R Stat Soc 199557(1)289ndash300

10 Golub T et alMolecular classification of cancer class discovery andclass prediction by gene expression monitoring Science 1999286531ndash537

11 Dudoit S Friedlyand J Speed TP Comparison of discriminationmethods for the classification of tumors using gene expressiondata J Am Stat Assoc 2002 97(457)77ndash87

12 Giordan M A two-stage procedure for the removal of batch effects inmicroarray studies Stat Biosci 2013 doi101007s12561-013-9081-1

13 Burton JE CD74 is expressed by multiple myeloma and is apromising target for therapy Clin Cancer Res 2004 10(19)6606ndash6611

14 Wu S Levels of the soluble 55-kilodalton isoform of tumor necrosisfactor receptor in bonemarrow are correlated with the clinicaloutcome of children with acute lymphoblastic leukemia in firstrecurrence Cancer 2003 98(3)625ndash631

15 Shimada H et al Potential involvement of the AML1-MTG8 fusionprotein in the granulocytic maturation characteristic of the t(821)acute myelogenous leukemia revealed by microarray analysisLeukemia 2002 16(5)874ndash885

16 Sagawa N et al Gelsolin supresses tumorigenicity through inhibitingPKC activation in human lung cancer cell line PC10 Br J Cancer 200388(4)606ndash612

17 Haga K The mechanism for reduced expression of gelsolin tumorsupressor protein in bladder cancer Hokkaido Igaku Zasshi 200378(1)29ndash37

doi1011861471-2105-14-236Cite this article as Chakraborty et al svapls an R package to correct forhidden factors of variability in gene expression studies BMC Bioinformatics2013 14236

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
      • Implementation
        • Brief overview of the package
        • Comparative evaluation with other available software packages
        • Application on the Golub data
          • Discussion
          • Conclusions
          • Availability and requirements
            • Project name
            • Project home page
            • Operating system and R version
            • License
              • Competing interests
              • Authors contributions
              • Acknowledgements
              • References
Page 6: svapls: an R package to correct for hidden factors of variability in gene expression studies

Chakraborty et al BMC Bioinformatics 2013 14236 Page 6 of 7httpwwwbiomedcentralcom1471-210514236

Figure 3 (a) (b) (c) Heatmaps showing the original and corrected expression levels for the first 1000 genes in the Golub data (a) Heatmapfor the first 1000 genes in the original Golub expression data (b) Heatmap for the first 1000 genes in the adjusted Golub expression data obtainedby use of the R package ber (c) Heatmap for the first 1000 genes in the adjusted Golub expression data obtained by the use of our R package svapls

softwares This may be attributable to its high false dis-covery rate (FDR) as was observed in the simulationstudy Interestingly svapls detected 24 genes that aremissed by both sam as well as sva Among them thegenes CD74 TNFRSF1A LCN2 and GSN deserve spe-cial mention All these genes are either related to sometype of cancer or regulate cell growth(or apoptosis) CD74plays an important role in multiple myeloma and itshigher expression induces tumor cell malignancy [13]

Figure 4 A Venn-diagram showing the overlap pattern of thegenes detected to be significant from the Golub data by svaplssva sam and limma

An isoform of the tumor necrosis factor TNFRSF1A isassociated with the development of Acute Lymphoblas-tic Leukemia (ALL) in children [14] Specifically LCN2has been found to be connected with Acute Myel-ogenous Leukemia (AML) [15] GSN plays a signifi-cant role of suppressing tumorigenicity in lung cancer[16] and has a diminuted expression in bladder cancercells [17]

DiscussionVarious hidden sources of variation are found to exist in agene expression data that cannot be removed by the stan-dard normalization procedures But their effect may besubstantial enough to change the expression pattern of thegenes over two different varieties of samples The imme-diate consequence is a large reduction in the detectionpower of the testing procedure employed to find the trulysignificant genes followed by highly elevated error ratesIn this project we discuss the development and usage ofan R package svapls that can tackle a wide variety of hid-den effects in a gene expression analysis and can delivera more accurate inference on the differential expressionvariability of the genes between two groups of samples(tissues) We illustrate the superior performance of ourR package in comparison to three other popular soft-wares available for differential gene expression analysesThe high detection power (sensitivity) of svapls along with

Chakraborty et al BMC Bioinformatics 2013 14236 Page 7 of 7httpwwwbiomedcentralcom1471-210514236

the reasonably small error rates provides it a significantlybetter edge over the competing softwares Specificallysva is outperformed by our package in terms of the sen-sitivity (power) while sam comes close and performsmarginally better in some cases although its competenceis severely marred by the considerably high false discov-ery rate (FDR) and substantially low specificity rate Inaddition the graphical representation of the hidden vari-ation (by the function hfp) from our package enablesthe user to understand the pattern in which the hiddensources of variability affect the expression signals of anyspecified subset of genes over a selected group of sub-jectssamples This paves the way to more sophisticatedanalyses of subject-set specific gene expression variabil-ity in the data Application of our package on the Golubdata demonstrates its efficacy in removing the significantbatch effects from the collectedanalyzed samples More-over our package detects four additional genes (missed byboth sva and sam) that have been found to be connectedto Leukemia or some other type of cancer

ConclusionsThe R package svapls can detect a wide variety of hid-den factors in a gene expression study and adjust for themappropriately in order to provide a more accurate infer-ence on the expression pattern of the genes between twodifferent types of tissues In particular the superior detec-tion power and small error rate gives our R package asubstantially better edge over the competing softwaresconsidered in the analysis

Availability and requirementsProject nameR packagesvaplsProject home pagehttpcranrprojectorgwebpackagessvaplsindexhtmlOperating system and R versionThe R package is platform independent and is compatiblewith all the versions of R same as or higher than 20LicenseGPL-3

Competing interestsThe authors declare that they have no competing interests

Authorsrsquo contributionsSC wrote the R package including its evaluation and drafted the manuscriptSD and SD contributed to the structure of the simulation studies applicationand planning of the manuscript All authors read and approved the finalmanuscript

AcknowledgementsWe sincerely thank the editor and the two reviewers for their constructivecomments that lead to an improved manuscript This research was part of SCrsquosdoctoral dissertation work SC acknowledges generous support and adissertation completion award by the School of Interdisciplinary and Graduate

Studies of the University of Louisville We also acknowledge partial support bythe Department of Bioinformatics and Biostatistics of University of Louisvilletoward the processing charges This research work was partially supported byNIH grants CA133844 (Su Datta) and CA170091 minus 01A1 (Su Datta)

Received 15 April 2013 Accepted 16 July 2013Published 24 July 2013

References1 Chakraborty S Datta S Datta S Surrogate variable analysis using

partial least squares in gene expression studies Bioinformatics 201228(6)799ndash806

2 Geladi P Kowalski B Partial least squares regression a tutorialAnalytica Chimica Acta 1986 1851ndash17

3 Tusher V Tibshirani R Chu G Significance analysis of microarraysapplied to the ionizing radiation response Proc Natl Acad Sci 2001985116ndash5121

4 Smith GK Limma linear models for microarray data In Bioinformaticsand Computational Biology Solutions using R and Bioconductor Edited byGentleman R Carey V Dudoit S Irizarry R Huber W New York Springer2002

5 Leek ST et al The sva package for removing batch effects and otherunwanted variation in high-throughput experiments Bioinformatics2012 28(6)882ndash883

6 Kerr MK et al Analysis of variance for gene expression microarraydata J Comput Biol 2000 7819ndash837

7 Kerr MK et al Statistical analysis of a gene expression microarraydata Stat Sinica 2002 12(1)203ndash217

8 Hirotsugu A Likelihood and the Bayes procedure In Bayesian StatisticsEdited by Bernardo JM De Groot MH Lindley DV Smith AFM ValenciaSpain University Press 1980143ndash203

9 Benjamini Y Hochberg Y Controlling the false discovery rate apractical and powerful approach tomultiple testing J R Stat Soc 199557(1)289ndash300

10 Golub T et alMolecular classification of cancer class discovery andclass prediction by gene expression monitoring Science 1999286531ndash537

11 Dudoit S Friedlyand J Speed TP Comparison of discriminationmethods for the classification of tumors using gene expressiondata J Am Stat Assoc 2002 97(457)77ndash87

12 Giordan M A two-stage procedure for the removal of batch effects inmicroarray studies Stat Biosci 2013 doi101007s12561-013-9081-1

13 Burton JE CD74 is expressed by multiple myeloma and is apromising target for therapy Clin Cancer Res 2004 10(19)6606ndash6611

14 Wu S Levels of the soluble 55-kilodalton isoform of tumor necrosisfactor receptor in bonemarrow are correlated with the clinicaloutcome of children with acute lymphoblastic leukemia in firstrecurrence Cancer 2003 98(3)625ndash631

15 Shimada H et al Potential involvement of the AML1-MTG8 fusionprotein in the granulocytic maturation characteristic of the t(821)acute myelogenous leukemia revealed by microarray analysisLeukemia 2002 16(5)874ndash885

16 Sagawa N et al Gelsolin supresses tumorigenicity through inhibitingPKC activation in human lung cancer cell line PC10 Br J Cancer 200388(4)606ndash612

17 Haga K The mechanism for reduced expression of gelsolin tumorsupressor protein in bladder cancer Hokkaido Igaku Zasshi 200378(1)29ndash37

doi1011861471-2105-14-236Cite this article as Chakraborty et al svapls an R package to correct forhidden factors of variability in gene expression studies BMC Bioinformatics2013 14236

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
      • Implementation
        • Brief overview of the package
        • Comparative evaluation with other available software packages
        • Application on the Golub data
          • Discussion
          • Conclusions
          • Availability and requirements
            • Project name
            • Project home page
            • Operating system and R version
            • License
              • Competing interests
              • Authors contributions
              • Acknowledgements
              • References
Page 7: svapls: an R package to correct for hidden factors of variability in gene expression studies

Chakraborty et al BMC Bioinformatics 2013 14236 Page 7 of 7httpwwwbiomedcentralcom1471-210514236

the reasonably small error rates provides it a significantlybetter edge over the competing softwares Specificallysva is outperformed by our package in terms of the sen-sitivity (power) while sam comes close and performsmarginally better in some cases although its competenceis severely marred by the considerably high false discov-ery rate (FDR) and substantially low specificity rate Inaddition the graphical representation of the hidden vari-ation (by the function hfp) from our package enablesthe user to understand the pattern in which the hiddensources of variability affect the expression signals of anyspecified subset of genes over a selected group of sub-jectssamples This paves the way to more sophisticatedanalyses of subject-set specific gene expression variabil-ity in the data Application of our package on the Golubdata demonstrates its efficacy in removing the significantbatch effects from the collectedanalyzed samples More-over our package detects four additional genes (missed byboth sva and sam) that have been found to be connectedto Leukemia or some other type of cancer

ConclusionsThe R package svapls can detect a wide variety of hid-den factors in a gene expression study and adjust for themappropriately in order to provide a more accurate infer-ence on the expression pattern of the genes between twodifferent types of tissues In particular the superior detec-tion power and small error rate gives our R package asubstantially better edge over the competing softwaresconsidered in the analysis

Availability and requirementsProject nameR packagesvaplsProject home pagehttpcranrprojectorgwebpackagessvaplsindexhtmlOperating system and R versionThe R package is platform independent and is compatiblewith all the versions of R same as or higher than 20LicenseGPL-3

Competing interestsThe authors declare that they have no competing interests

Authorsrsquo contributionsSC wrote the R package including its evaluation and drafted the manuscriptSD and SD contributed to the structure of the simulation studies applicationand planning of the manuscript All authors read and approved the finalmanuscript

AcknowledgementsWe sincerely thank the editor and the two reviewers for their constructivecomments that lead to an improved manuscript This research was part of SCrsquosdoctoral dissertation work SC acknowledges generous support and adissertation completion award by the School of Interdisciplinary and Graduate

Studies of the University of Louisville We also acknowledge partial support bythe Department of Bioinformatics and Biostatistics of University of Louisvilletoward the processing charges This research work was partially supported byNIH grants CA133844 (Su Datta) and CA170091 minus 01A1 (Su Datta)

Received 15 April 2013 Accepted 16 July 2013Published 24 July 2013

References1 Chakraborty S Datta S Datta S Surrogate variable analysis using

partial least squares in gene expression studies Bioinformatics 201228(6)799ndash806

2 Geladi P Kowalski B Partial least squares regression a tutorialAnalytica Chimica Acta 1986 1851ndash17

3 Tusher V Tibshirani R Chu G Significance analysis of microarraysapplied to the ionizing radiation response Proc Natl Acad Sci 2001985116ndash5121

4 Smith GK Limma linear models for microarray data In Bioinformaticsand Computational Biology Solutions using R and Bioconductor Edited byGentleman R Carey V Dudoit S Irizarry R Huber W New York Springer2002

5 Leek ST et al The sva package for removing batch effects and otherunwanted variation in high-throughput experiments Bioinformatics2012 28(6)882ndash883

6 Kerr MK et al Analysis of variance for gene expression microarraydata J Comput Biol 2000 7819ndash837

7 Kerr MK et al Statistical analysis of a gene expression microarraydata Stat Sinica 2002 12(1)203ndash217

8 Hirotsugu A Likelihood and the Bayes procedure In Bayesian StatisticsEdited by Bernardo JM De Groot MH Lindley DV Smith AFM ValenciaSpain University Press 1980143ndash203

9 Benjamini Y Hochberg Y Controlling the false discovery rate apractical and powerful approach tomultiple testing J R Stat Soc 199557(1)289ndash300

10 Golub T et alMolecular classification of cancer class discovery andclass prediction by gene expression monitoring Science 1999286531ndash537

11 Dudoit S Friedlyand J Speed TP Comparison of discriminationmethods for the classification of tumors using gene expressiondata J Am Stat Assoc 2002 97(457)77ndash87

12 Giordan M A two-stage procedure for the removal of batch effects inmicroarray studies Stat Biosci 2013 doi101007s12561-013-9081-1

13 Burton JE CD74 is expressed by multiple myeloma and is apromising target for therapy Clin Cancer Res 2004 10(19)6606ndash6611

14 Wu S Levels of the soluble 55-kilodalton isoform of tumor necrosisfactor receptor in bonemarrow are correlated with the clinicaloutcome of children with acute lymphoblastic leukemia in firstrecurrence Cancer 2003 98(3)625ndash631

15 Shimada H et al Potential involvement of the AML1-MTG8 fusionprotein in the granulocytic maturation characteristic of the t(821)acute myelogenous leukemia revealed by microarray analysisLeukemia 2002 16(5)874ndash885

16 Sagawa N et al Gelsolin supresses tumorigenicity through inhibitingPKC activation in human lung cancer cell line PC10 Br J Cancer 200388(4)606ndash612

17 Haga K The mechanism for reduced expression of gelsolin tumorsupressor protein in bladder cancer Hokkaido Igaku Zasshi 200378(1)29ndash37

doi1011861471-2105-14-236Cite this article as Chakraborty et al svapls an R package to correct forhidden factors of variability in gene expression studies BMC Bioinformatics2013 14236

  • Abstract
    • Background
    • Results
    • Conclusions
      • Background
      • Implementation
        • Brief overview of the package
        • Comparative evaluation with other available software packages
        • Application on the Golub data
          • Discussion
          • Conclusions
          • Availability and requirements
            • Project name
            • Project home page
            • Operating system and R version
            • License
              • Competing interests
              • Authors contributions
              • Acknowledgements
              • References