9
JOURNAL OF COMPUTATIONAL BIOLOGY Volume 15, Number 3, 2008 © Mary Ann Liebert, Inc. Pp. 269–277 DOI: 10.1089/cmb.2008.0002 Pathway Analysis of Microarray Data via Regression A.J. ADEWALE, 1; I. DINU, 2 J.D. POTTER, 3 Q. LIU, 2 and Y. YASUI 2 ABSTRACT Pathway analysis of microarray data evaluates gene expression profiles of a priori defined biological pathways in association with a phenotype of interest. We propose a unified pathway-analysis method that can be used for diverse phenotypes including binary, multi- class, continuous, count, rate, and censored survival phenotypes. The proposed method also allows covariate adjustments and correlation in the phenotype variable that is encountered in longitudinal, cluster-sampled, and paired designs. These are accomplished by combining the regression-based test statistic for each individual gene in a pathway of interest into a pathway-level test statistic. Applications of the proposed method are illustrated with two real pathway-analysis examples: one evaluating relapse-associated gene expression involving a matched-pair binary phenotype in children with acute lymphoblastic leukemia; and the other investigating gene expression in breast cancer tissues in relation to patients’ survival (a censored survival phenotype). Implementations for various phenotypes are available in R. Additionally, an Excel Add-in for a user-friendly interface is currently being developed. Key words: gene clusters, gene expression, statistics. 1. INTRODUCTION A NALYSIS OF MICROARRAY DATA was focused initially on identifying individual genes that are differentially expressed between two classes of a phenotype. The focus has since expanded to include other kinds of phenotypes, such as censored survival and continuous phenotypes, and also to identify biological pathways (i.e., sets of genes) that are differentially expressed according to a phenotype. The goal of this paper is to propose and illustrate a unified general analysis method of microarray data for identifying pathways (or gene sets) whose expressions are associated with a phenotype of any kind. To accommodate various analysis settings, our method also allows correlation among samples (e.g., paired, clustered, or longitudinal data) and adjustments for covariates that may be associated with the phenotype (e.g., age, sex, race), in addition to handling any type of phenotype, censored or uncensored. Many authors—including Mootha et al. (2003), Goeman et al. (2004, 2005), Mansmann and Meister (2005), and Dinu et al. (2007)—have proposed methods for pathway analysis for microarray data for 1 Merck & Co., Inc., 351 N. Sumneytown Pike, UGIC-36 North Wales, Pennsylvania 19454. 2 Department of Public Health Sciences, University of Alberta, Edmonton, Alberta, Canada. 3 Cancer Prevention Program, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington. This research was conducted when Dr. Adewale was a Postdoctoral Fellow at the University of Alberta, Edmonton, Canada. 269

Pathway Analysis of Microarray Data via Regression

  • Upload
    y

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Pathway Analysis of Microarray Data via Regression

JOURNAL OF COMPUTATIONAL BIOLOGY

Volume 15, Number 3, 2008

© Mary Ann Liebert, Inc.

Pp. 269–277

DOI: 10.1089/cmb.2008.0002

Pathway Analysis of Microarray Data via Regression

A.J. ADEWALE,1;� I. DINU,2 J.D. POTTER,3 Q. LIU,2 and Y. YASUI2

ABSTRACT

Pathway analysis of microarray data evaluates gene expression profiles of a priori defined

biological pathways in association with a phenotype of interest. We propose a unified

pathway-analysis method that can be used for diverse phenotypes including binary, multi-

class, continuous, count, rate, and censored survival phenotypes. The proposed method also

allows covariate adjustments and correlation in the phenotype variable that is encountered

in longitudinal, cluster-sampled, and paired designs. These are accomplished by combining

the regression-based test statistic for each individual gene in a pathway of interest into a

pathway-level test statistic. Applications of the proposed method are illustrated with two

real pathway-analysis examples: one evaluating relapse-associated gene expression involving

a matched-pair binary phenotype in children with acute lymphoblastic leukemia; and the

other investigating gene expression in breast cancer tissues in relation to patients’ survival

(a censored survival phenotype). Implementations for various phenotypes are available in

R. Additionally, an Excel Add-in for a user-friendly interface is currently being developed.

Key words: gene clusters, gene expression, statistics.

1. INTRODUCTION

ANALYSIS OF MICROARRAY DATA was focused initially on identifying individual genes that are

differentially expressed between two classes of a phenotype. The focus has since expanded to include

other kinds of phenotypes, such as censored survival and continuous phenotypes, and also to identify

biological pathways (i.e., sets of genes) that are differentially expressed according to a phenotype. The

goal of this paper is to propose and illustrate a unified general analysis method of microarray data for

identifying pathways (or gene sets) whose expressions are associated with a phenotype of any kind. To

accommodate various analysis settings, our method also allows correlation among samples (e.g., paired,

clustered, or longitudinal data) and adjustments for covariates that may be associated with the phenotype

(e.g., age, sex, race), in addition to handling any type of phenotype, censored or uncensored.

Many authors—including Mootha et al. (2003), Goeman et al. (2004, 2005), Mansmann and Meister

(2005), and Dinu et al. (2007)—have proposed methods for pathway analysis for microarray data for

1Merck & Co., Inc., 351 N. Sumneytown Pike, UGIC-36 North Wales, Pennsylvania 19454.2Department of Public Health Sciences, University of Alberta, Edmonton, Alberta, Canada.3Cancer Prevention Program, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle,

Washington.�This research was conducted when Dr. Adewale was a Postdoctoral Fellow at the University of Alberta, Edmonton,

Canada.

269

Page 2: Pathway Analysis of Microarray Data via Regression

270 ADEWALE ET AL.

different types of phenotype. In particular, Goeman et al. (2004) proposed a score test that is based on

random-effects modeling of parameters corresponding to the coefficients of the individual genes in the

pathway. Their method addressed binary and continuous phenotypes as follows. Denoting the biologic

outcome of interest (phenotype) by Y and the pathway expressions by .x1; : : : ; xm/, they proposed the test

statistic

Q D.Y � �/T R.Y � �/

�2

where R D .1=m/XXT , X D .x1; x2; : : : ; xm/ is a matrix with columns of gene expression vectors, Y is

the vector of outcomes, � D Enull.Y/ is the mean outcome under the null hypothesis of no association, and

�2 D Varnull.Y / is the variance of the outcome under the null hypothesis of no association. Later, Goeman

et al. (2005) extended the method to the censored-survival phenotype with use of a modeling framework

incorporating random effects and Cox proportional hazards. In Dinu et al. (2007), we proposed a test

called SAM-GS for assessing differential expression of pathways between two classes of a phenotype. The

SAM-GS approach addressed the issue of the low-variability characteristics of microarray data, using an

adjustment introduced in a popular individual-gene analysis method, significance analysis of microarray

(SAM) (Tusher et al., 2001). The SAMGS statistic for pathway analysis with a binary phenotype is given by:

SAMGS D

mX

pD1

d 2p

where m is the number of genes in the pathway of interest, dp Dxp .1/�xp.2/

spCs0, xp.k/ is the average

expression for the pth gene in the pathway for the kth class of a binary phenotype (k D 1, 2), and sp is

the pooled standard deviation. The constant s0 was added to adjust for the small variability characteristics

of microarray data (Tusher et al., 2001).

In this paper, we provide a particular view of SAM-GS and Goeman et al.’s global test that permits

pathway analysis of diverse phenotypes, including multi-class, continuous, and censored-survival pheno-

types, while allowing covariate adjustments and correlated data. The generality of the proposed method

is achieved by use of regression methods. The proposed approach is a “self-contained hypothesis testing”

method (Goeman and Bühlmann, 2007), which evaluates the association of a pathway of interest with a

phenotype using the expressions of genes in the pathway exclusively: the expressions of genes outside of

the pathway of interest do not influence the testing of the association.

2. PROPOSED METHOD

2.1. Overview

The SAM-GS statistic of Dinu et al. (2007), SAMGS DPm

pD1 d 2p, is a sum of the t-like test statistics

for the individual genes in the pathway:

dp Dxp.1/ � xp.2/

sp C s0

:

The idea of summing univariate test statistics as a basis of testing a multivariate hypothesis was in line

with the work of Dempster (1958, 1960) on the two-sample multivariate mean comparison problem with

small samples where the traditional Hotelling’s T -square test fails.

Goeman et al. (2004) also described the global test’s Q statistic as an average of the m test statistics

calculated as though each of the m individual genes constitute a pathway by itself. That is,

Q D1

m

mX

pD1

1

�2ŒxT

p .Y � �/�2

where Qp D 1� 2 ŒxT

p .Y � �/�2 is the test statistic for a pathway consisting just the pth gene.

Page 3: Pathway Analysis of Microarray Data via Regression

PATHWAY ANALYSIS OF MICROARRAY DATA VIA REGRESSION 271

It is this particular view of combining component-wise test statistic for testing a multivariate hypothesis

that motivated our approach to pathway analysis for various phenotypes. The proposed pathway statistic

is defined as follows:

W D

mX

pD1

rp

sp

�2

where rp is any appropriate measure of association between the phenotype Y and the expression xp for the

pth gene in the pathway, sp is the standard error of rp . We propose taking rp as the regression coefficient

from modeling the pth gene as a predictor of the phenotype Y in an appropriate regression framework.

The form of the test statistic is a sum of squares of the Wald statistics for individual genes constituting

the pathway.

Thus, the unified pathway analysis method we propose derives rp and sp from the framework of

regression methods for the outcome variable Y of various types. We describe below the proposed test

statistic by grouping various analysis settings into three categories: (1) uncensored independent phenotype;

(2) uncensored correlated phenotype; and (3) censored phenotype.

2.2. Uncensored independent phenotype

Consider data of the form f.xi1; : : : ; xim/; zi ; Yi gniD1 where, for the i th individual, .xi1; : : : ; xim/ are the

gene expressions in the pathway of interest, zi is the covariate vector for which adjustment is desired,

and Yi is the phenotype of interest. In order to assess the association of the pathway with an uncensored

independent phenotype, we adopt a regression framework that accommodates diverse uncensored outcome

measures—generalized linear models (GLMs). GLMs (Nelder and Wedderburn, 1972; McCullagh and

Nelder, 1989) are a family of regression models for diverse outcome types using an exponential family of

distributions: distributions in this family include Gaussian, gamma, Poisson, binomial, and inverse Gaussian

distributions. Thus, the GLM framework provides a unified approach to modeling independent binary,

count, rate, and non-Gaussian continuous outcomes as well as classical Gaussian continuous outcomes.

The linear predictor is a component of a GLM through which the influences of the predictors are specified.

Our proposed approach is to fit each gene in the pathway, one at a time, along with covariates of interest

to the analysis. That is, we fit a GLM with the linear predictor �i D ˇ0 C ˇ.p/

1 xip C zTi ˛. In computing

the test statistic, W DPm

pD1

r2p

s2p

, the measure of association, rp, between the phenotype and the expression

xp for the pth gene in the pathway is the regression coefficient O.p/1 of the above GLM and sp is its

corresponding standard error. Then, the statistic W is in the form of a sum of m Wald statistics from m

GLM fits where each of m genes in the pathway of interest provides a Wald statistic that is derived with

an adjustment for the covariates.

2.3. Uncensored correlated phenotype

Suppose we have a clustered data, fxij D .xij1; : : : ; xij m/; zij ; Yij gniD1, where xij is the vector of

gene expressions for j th observation from cluster i , zij is the vector of covariates, and Yij denotes the

corresponding phenotype. The cluster might correspond to an individual where the observations within a

cluster are measurements taken over time longitudinally over time or under varying experimental conditions.

The cluster might also consist of related subjects, for example, members of a family, clinic, or community.

Under this setting, the outcome data are no longer independent because observations within a cluster tend

to be more alike (or unalike under some circumstances) compared to observations from different clusters.

Thus, an appropriate model should account for the lack of statistical independence due to clustering.

Generalized linear mixed models (GLMMs) are a natural extension of GLMs that accommodate clus-

tering via use of random effects in the linear predictor. In the simplest form, the linear predictor includes

just a random intercept: �ij D ˇ0 C ˇ.p/

1 xijp C zTij ˛ C ui , where ui is the random intercept, which is

usually assumed to follow a mean-zero Gaussian distribution with an unknown variance. As in the GLM

framework, the measure of association of the pth gene with the phenotype is the estimated regression

coefficient, rp D O.p/

1 , and sp is its standard error.

Page 4: Pathway Analysis of Microarray Data via Regression

272 ADEWALE ET AL.

Alternative frameworks for accommodating correlated phenotypes include the generalized estimating

equations (GEE). It is a quasi-likelihood approach which requires the specification of mean response (i.e.,

the specification of the linear predictor as in GLMs and GLMMs), variance function, and a pairwise corre-

lation pattern among observations from the same cluster without fully specifying a particular multivariate

distribution.

Binary matched-pair data are a special case of clustered data where the phenotype is binary and each

cluster is a pair of observations. The conditional logistic regression model can be used for such data (see

an example in Section 4).

2.4. Censored-survival phenotype

Consider censored survival data, f.xi1; : : : ; xim/; zi ; Yi D .Ti ; ci/g, where ci D 1 and 0 denote an

occurrence of the event of interest and censoring, respectively, and Ti is the survival time if ci D 1

or the censoring time if ci D 0. Regression models for censored survival data provide the necessary

elements of W . For example, Cox proportional hazards models specify the hazard of the event at time t

by h.t j xp; z/ D h0.t/ exp.�.xp; z//, where the function h0.t/ is an unspecified baseline hazard function

and the function �.xip ; zi / D ˇ.p/xip C zTi ˛ captures the influence of the pth gene and covariates on the

hazard function at time t (Cox, 1972). The association measure for W can be taken as rp D O.p/, the

parameter estimate of the log-hazard ratio associated with expression of the pth gene and sp being its

corresponding standard error.

We note that other suitable models for censored data can be used. In situations where the proportional

hazards assumption is untenable, for example, a piecewise exponential model which assumes proportional

hazards in a series of consecutive time intervals can be used. For a correlated censored-survival phenotype,

the clustering introduces dependency in the data and frailty models that incorporate random effects into

the linear component of a proportional hazards model or an accelerated failure time model can be applied

(Aalen, 1998; Hougaard, 1995).

2.5. Significance testing

An approximation of the null distribution of the proposed test statistic W , by a scaled chi-squared

distribution, may be plausible, but this option was not pursued here. In fact, distributional approximations

in the context of pathway analysis of microarray data may not be satisfactory: see, for example, Mansmann

et al. (2005) and Liu et al. (2007). Rather, statistical significance for the association between gene expression

in a pathway and a phenotype is assessed by a permutation test. The permutation adopts the approach

of Braun and Feng (2001), which constitute permuting the indices of the (phenotype, covariates) set—

fYi ; zi gniD1—while fixing all regression parameters constant (at the estimates obtained from the original

unpermuted data) in each permutation, except the parameter corresponding to the gene effect. We note

that fixing all other parameters constant in each permutation is required to guarantee the invariance of

the test statistic under the null hypothesis of no gene effect (i.e., no association between the pathway and

phenotype after adjusting for covariate effects).

When the data are clustered, one must distinguish two cases: the case where the association of the path-

way with the phenotype is assessed within clusters (e.g., paired design, cross-over design, or longitudinal

design with time defining the phenotype of interest) and the case where the association is assessed across

clusters (e.g., repeated measures of subjects with exposure status/level as the phenotype). Within-cluster

permutation is suitable for assessing association within clusters, while block-permutation with clusters taken

as “blocks” is appropriate when clusters are independent and each cluster is indexed with a phenotype

label (Good, 1994).

The issue of multiple comparisons is a problem that must be addressed when multiple pathways are to be

assessed for their associations with a phenotype. However, we consider it separately from our development

of a pathway-analysis method because it is not necessary when a single pathway is of interest in pathway

analysis (e.g., a case where the pathway analysis is confirmatory in nature). When the interest is indeed on

multiple pathways, which is often the case where the pathway analysis is exploratory with many pathways

of potential interest, the q-value approach (Storey, 2002; Storey and Tibshirani, 2003; Storey et al., 2004)

or other methods that control for false-discovery rates (FDRs) in multiple comparisons can be employed.

Page 5: Pathway Analysis of Microarray Data via Regression

PATHWAY ANALYSIS OF MICROARRAY DATA VIA REGRESSION 273

3. APPLICATIONS

3.1. Correlated binary phenotype: original versus relapsed childhood acute lymphoblastic

leukemia data

Bhojwani et al. (2006) reported on a dataset consisting of 35 children with childhood acute lymphoblastic

leukemia (ALL) who relapsed after diagnosis and therapy. The gene expression profiles of 35 matched

diagnosis/relapse pairs were analyzed. The objective of the analysis of Bhojwani and colleagues was to

identify biological pathways associated with relapse in childhood ALL. The sample consisted of 23 children

who relapsed early (less than 36 months from diagnosis) and 12 children who relapsed late (after 36 months

from diagnosis and therapy). Bhojwani et al. (2006) addressed their objective by conducting a paired t-test

on individual genes. They then adjusted for multiple testing using Benjamini and Hochberg’s false discovery

rate (FDR) and Hochberg’s Bonferroni p-value adjustment (Hochberg, 1988; Benjamini and Hochberg,

1995). Genes meeting a specified FDR criterion were selected for manual classification into biological

functional groups (pathways). They concluded that their analyses revealed significant differences between

diagnosis and early relapse in the expression of genes involved in cell-cycle regulation, DNA repair,

and apoptosis. Although the conclusion of Bhojwani et al. (2006) may be correct, pathway analysis via

regression offers a more systematic assessment of differential pathway expression between the diagnosis

and relapse samples. We re-analyzed their data with the aim of directly identifying biologic pathways that

are associated with relapse using the proposed method.

The Affymetrix gene identifiers were mapped into a gene ontology database—Genebank. Of the 22,283

probes, 11,401 were successfully mapped to 839 distinct pathways using the C1 and C2 pathway databases

on “Molecular Signature Database,” provided by the Broad Institute (www.broad.mit.edu/gsea). In Subra-

manian et al. (2005), Catalog C1 included 24 sets, one for each of the 24 human chromosomes, and 295 sets

corresponding to cytogenetic bands; and Catalog C2 consisted of 472 sets containing gene sets reported

in manually curated databases and 50 sets containing genes reported in various experimental papers.

As the data were paired diagnosis and relapse samples, we employed the conditional logistic modeling.

Conditional logistic modeling of binary matched pairs entails fitting a no-intercept unconditional logistic

model to discordant pairs using difference of matched covariates as predictors (Breslow et al., 1978;

Chamberlain, 1980). The artificial response for use in the unconditional logistic model is y� D 1 when

.yi1 D �1; yi2 D 1/ and y� D 0 when .yi1 D 1; yi2 D �1/ and the model takes the following form:

logitŒp.y�

i D 1 j x�

pi D xpi2 � xpi1/� D ˇ.p/x�

pi The association measure of each gene in the pathway of

interest that are to be entered into the W statistics are rp D O.p/ and sp D se. O.p//.

One hundred sixty-eight of the 839 pathways have q-values less than 0.01, where q-values were obtained

by the method of Storey (Storey, 2002; Storey et al., 2004; Storey and Tibshirani, 2003). Bhojwani et al.

(2006) conducted a stratified analysis for early versus late relapse cases. We found no pathway that was

differentially expressed with q-value less than 0.01 between diagnosis and relapse using the late-relapsed

cases. On the other hand, 45 pathways had q-values less than 0.01 in the early-relapse cases. These

45 pathways are listed in Table 1. Most of the 45 pathways contain mixture of down- and up-regulated

genes. Note that, in agreement with Bhojwani et al., we found that, in the early-relapse cases, pathways

associated with cell-cycle regulation, DNA repair, and apoptosis are differentially expressed. Moreover,

the pathway analysis via regression provided many other pathways that are differentially expressed.

3.2. Censored survival phenotype: breast cancer survival data

The data for this analysis came from 295 women with primary invasive breast cancer reported in Van de

Vijver et al. (2002). Using the gene-expression profiles of the previously determined 70 marker genes, Van

de Vijver et al. (2002) classified the 295 tumors into good-prognosis and poor-prognosis categories. The

predictive ability of this categorization on time to distant metastases was examined. Here we mapped all

4919 genes into pathways using Genebank’s gene ontology database. We identified 728 pathways (numbers

of genes ranged from 2 to 268 in a pathway). Our objective in this pathway analysis is to identify pathways

that are associated with time to death (overall survival). We fit a proportional hazards model

h.t j xpi / D h0.t/ exp.ˇ.p/xpi /; i D 1; : : : ; n;

Page 6: Pathway Analysis of Microarray Data via Regression

274 ADEWALE ET AL.

TABLE 1. LIST OF SIGNIFICANT PATHWAYS FROM CONDITIONAL-LOGISTIC

REGRESSION-BASED PATHWAY ANALYSIS

No. of genes

with rp=sp

Pathway name Set size <0 >0 p-value q-value

chr1p22 28 13 15 0.001 0.006

chr2q12 8 3 5 0.001 0.006

aktPathway 14 11 3 0.001 0.006

cacamPathway 15 8 7 0.001 0.006

cdc25Pathway 9 3 6 0.001 0.006

gcrPathway 16 11 5 0.001 0.006

mrpPathway 4 2 2 0.001 0.006

no1Pathway 35 20 15 0.001 0.006

pepiPathway 7 2 5 0.001 0.006

plk3Pathway 8 4 4 0.001 0.006

rbPathway 12 4 8 0.001 0.006

relaPathway 12 7 5 0.001 0.006

MYC_MUT 4 1 3 0.001 0.006

EMT_DOWN 47 29 18 0.001 0.006

atmPathway 17 9 8 0.002 0.007

cd40Pathway 13 10 3 0.002 0.007

eea1Pathway 10 3 7 0.002 0.007

freePathway 10 4 6 0.002 0.007

g2Pathway 20 11 9 0.002 0.007

il1rPathway 24 14 10 0.002 0.007

MAP00670_One_carbon_pool_by_folate 14 4 10 0.002 0.007

MAP00680Methane_metabolism 8 3 5 0.002 0.007

nfkbPathway 16 9 7 0.002 0.007

ST_Tumor_Necrosis_Factor_Pathway 31 21 10 0.002 0.007

P53_UP 34 21 13 0.002 0.007

Chr13q12 31 17 14 0.004 0.009

Chr19q12 1 0 1 0.004 0.009

Chr6p25 11 6 5 0.004 0.009

Chr16q11 2 0 2 0.004 0.009

cptPathway 2 0 2 0.004 0.009

fibrinolysisPathway 10 7 3 0.003 0.009

fosbPathway 6 5 1 0.004 0.009

MAP00195_Photo synthesis 2 0 2 0.003 0.009

MAP00310_Lysine_degradation 19 4 15 0.004 0.009

p53hypoxiaPathway 12 7 5 0.003 0.009

tnf_&_fas_network 27 12 15 0.004 0.009

tnfr2Pathway 15 12 3 0.003 0.009

GO_ROS 22 4 18 0.004 0.009

FRASOR_ER_UP 30 16 14 0.004 0.009

chr10q26 29 11 18 0.005 0.010

chr11q22 24 10 14 0.005 0.010

atrbrcaPathway 19 3 16 0.005 0.010

dnafragment Pathway 7 1 6 0.005 0.010

longevityPathway 11 7 4 0.005 0.010

LEU_DOWN 166 48 118 0.005 0.010

Page 7: Pathway Analysis of Microarray Data via Regression

PATHWAY ANALYSIS OF MICROARRAY DATA VIA REGRESSION 275

TABLE 2. UNIVARIATE COX REGRESSION OF OUTCOMES WITH DEMOGRAPHIC

AND CLINICAL CHARACTERISTICS

Characteristics

Overall mortality

hazard ratio

(95% CI) p-value

Age 0.94 (0.91, 0.98) 0.004

Diameter 1.04 (1.01, 1.06) 0.001

Lymph-node status 1.07 (0.97, 1.17) 0.170

Mastectomy (yes vs. no) 1.20 (0.77, 1.87) 0.410

Estrogen receptor status (positive vs. negative) 0.30 (0.20, 0.48) <0.001

Tumor grade

Intermediate (vs. good) 4.65 (1.61, 13.40) <0.001

Poor (vs. good) 10.22 (3.69, 28.30)

Chemotherapy (yes vs. no) 0.79 (0.49, 1.26) 0.330

Hormonal therapy (yes vs. no) 0.61 (0.26, 1.40) 0.240

where ˇ.p/ is the regression coefficient corresponding to the pth gene in a particular pathway. The estimated

coefficient from the Cox model for each individual gene was taken as the association measure between the

gene and survival, and entered into W along with its standard error estimate.

Of the 728 pathways examined, 635 were found to be significantly (q-value < 0.01) associated with

overall survival. Further, we adjusted for known demographic and clinical covariates of the overall survival

and re-examined the association of the 728 pathways with overall survival. Demographic and clinical

covariates available in the data included age, diameter of the tumor, lymph-node status (positive, coded 1;

or negative, coded 0), mastectomy (vs. no mastectomy), estrogen-receptor status (positive, coded 1; or

negative, coded 0), tumor histological grade, chemotherapy (vs. no chemotherapy) and hormonal therapy

(vs. no hormonal therapy). We first examined the association of each covariate with overall survival in

a univariate Cox regression model. The results of these univariate analyses are presented in Table 2.

Covariates with p-value less than 0.2 were earmarked to be adjusted for in the pathway analysis. There is

a substantial reduction in the number of pathways that were identified as having an association with overall

survival (q-value < 0.01), after adjustment for known demographics and clinical information. Specifically,

after the covariate adjustment, only four pathways were significantly (q-value < 0.01) associated with

overall survival. These pathways are listed in Table 3. All four were among the 635 pathways found to

be statistically significant in the analysis without covariate adjustment. Of all four pathways (Table 3),

three are tightly related: Glycine serine and threonine metabolism; Cyanoamino acid metabolism; and

Methane metabolism. Further, a key component of One_carbon_pool_by_folate pathway is coupled to

methionine synthase.

TABLE 3. LIST OF SIGNIFICANT PATHWAYS AFTER COVARIATE ADJUSTMENT

(p-VALUE AND q-VALUE BOTH <0:001)

No. of genes

with rp=sp

Pathway name Set size <0 >0

MAP00260_Glycine_serine_and_threonine_metabolism 7 3 4

MAP00460_Cyanoamino_acid_metabolism 4 2 2

MAP00670_One_carbon_pool_by_folate 5 1 4

MAP00680_Methane_metabolism 2 0 2

Page 8: Pathway Analysis of Microarray Data via Regression

276 ADEWALE ET AL.

4. DISCUSSION

We have proposed a unified method of pathway analysis for diverse phenotypes via regression framework.

The term “pathway” has been used loosely to include groups of genes based on their chromosomal locations

since the proposed approach is also relevant in situation where the objective is to discover if the phenotype

is associated with expression of a group of genes that are located closely on the same chromosome.

The proposed W statistic is a sum of squares of Wald statistics from regression models assessing the

associations of individual genes in the pathway of interest with a phenotype.

There is an apparent similarity between the proposed W statistic and the SAMGS statistic. However,

SAMGS is based on linear modeling of individual gene expression in the pathway with the binary phenotype

label as a predictor. The SAMGS statistic is the ratio of the estimated phenotype effect to its adjusted

standard error. The issue of small variability of gene expression in microarray necessitated an adjustment

to the standard error in the SAMGS statistic in order to mitigate the chance of false discovery (Tusher

et al., 2001; Wright and Simon, 2003). In our proposed method, however, we modeled the phenotype,

not the gene expression, as the outcome variable. The proposed test statistic, therefore, does not require a

variance estimate of each gene’s expression which can lead to extremely small standard errors and high

chance of false discovery. Thus, the adjustment for the small variability used in the SAMGS statistic is not

applicable to the proposed test statistic.

The approach presented offers a significant addition to the literature in two ways. First, the proposed

method accommodates pathway analysis with respect to any phenotype that can be handled within the

existing regression framework. Second, the use of regression framework enables the assessment of pathway-

phenotype associations accounting for other known prognostic factors (covariates) and in study designs

that are subject to correlation or clustering in data. This pathway-analysis method, by incorporating these

features, should be a useful addition to the available tools.

In particular, for censored survival phenotype, the global test of Goeman et al. (2005) is the only

currently applicable method. Our proposal renders pathway analysis accessible for any phenotype with

existing classical regression method. An important advantage of the unified approach to pathway analysis

presented here is that the method is situated within the context of well known classical regression methods.

It thus puts a statistically sound but yet simple tool within the reach of biologists and clinicians with

interest in understanding the association of diverse clinical phenotypes and biological pathways. Also, the

method readily lends itself to easy implementation using existing programmable software with capability

for handling analysis involving basic regression methods with a minimal programming effort from the

analyst. R implementations for various phenotypes are available and can be requested from the authors.

We are currently preparing an Excel Add-in for a user-friendly interface.

ACKNOWLEDGMENTS

Support for this research was provided by the Alberta Heritage Foundation for Medical Research

(postdoctoral fellowship to A.J.A. and I.D., senior health investigator award to Y.Y.) and the Canada

Research Chair Program, Canadian Institute of Health Research, (to Y.Y.).

DISCLOSURE STATEMENT

No competing financial interests exist.

REFERENCES

Aalen, O.O. 1998. Frailty models. In: Everitt, B.S., and Dunn, G., eds., Statistical Analysis of Medical Data: New

Developments, Arnold, London, pp. 59–74.

Agresti, A. 2002. Categorical Data Analysis. Wiley, New York.

Page 9: Pathway Analysis of Microarray Data via Regression

PATHWAY ANALYSIS OF MICROARRAY DATA VIA REGRESSION 277

Benjamini, Y., and Hochberg, Y. 1995. Controlling the false discovery rate: a practical and powerful approach to

multiple testing. J. R. Statist. Soc. B 57, 289–300.

Bhojwani, D., et al. 2006. Biologic pathways associated with relapse in childhood acute lymphoblastic leukemia: a

Children’s Oncology Group study. Blood 108, 711–717.

Braun, T.M., and Feng, Z. 2001. Optimal permutation tests for the analysis of group randomized trials. J. Am. Statist.

Assoc. 96, 1424–1432.

Breslow, N., and Powers, W. 1978. Are there two logistic regressions for retrospective studies? Biometrics 34, 100–105.

Chamberlain, G. 1980. Analysis of covariance with qualitative data. J. R. Statist. Soc. B 74, 187–220.

Cox, D.R. 1972. Regression models and life-tables. J. R. Statist. Soc. 34, 187–220.

Dempster, A.P. 1958. A high dimensional two sample significance test. Ann. Math. Statist. 29, 995–1010.

Dempster, A.P. 1960. A significance test for the separation of two highly multivariate small samples. Biometrics 16,

41–50.

Dinu, I., et al. 2007. Improving gene set analysis of microarray data by SAM-GS. BMC Bioinform. 8, 242

Goeman, J.J., and Bühlmann, P. 2007. Analyzing gene expression data in terms of gene sets: methodological issues.

Bioinformatics 23, 980–987.

Goeman, J.J., et al. 2004. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics

20, 93–99.

Goeman, J.J., et al. 2005. Testing association of a pathway with survival using gene expression data. Bioinformatics

21, 1950–1957.

Good, P. 1994. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer, New

York.

Hochberg, Y. 1988. A sharper Bonferroni for multiple significance testing. Biometrika 75, 800–803.

Hougaard, P. 1995. Frailty models for survival data. Lifetime Data Anal. 1, 255–273.

Liu, Q., et al. 2007. Comparative evaluation of gene-set analysis methods. BMC Bioinform. 8, 431.

Mansmann, U., and Meister, R. 2005. Testing differential gene expression in functional groups. Goeman’s global test

versus an ANCOVA approach. Methods Inf. Med. 44, 449–453.

McCullagh, P., and Nelder, J.A. 1989. Generalized Linear Models. Chapman & Hall/CRC, New York.

McCulloch, C.E., and Shayle, R.S. 2000. Generalized, Linear, and Mixed Models. Wiley, New York.

Molenberghs, G., and Verbeke, G. 2005. Models for Discrete Longitudinal Data. Springer, New York.

Mootha, V.K., et al. 2003. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately

downregulated in human diabetes. Nat. Genet. 34, 267–273.

Nelder, J.A., and Wedderburn, R.W.M. 1972. Generalized linear models. J. R. Statist. Soc. Ser. A 135, 370–384.

Storey, J.D. 2002. A direct approach to false discovery rates. J. R. Statist. Soc. Ser. B 64, 479–498.

Storey, J.D., et al. 2004. Strong control, conservative point estimation and simultaneous conservative consistency of

false discovery rates: a unified approach. J. R. Statist. Soc. Ser. B 66, 187–205.

Storey, J.D., and Tibshirani, R. 2003. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100,

9440–9445.

Subramanian, A., et al. 2005. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide

expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545–15550.

Tusher, V.G., et al. 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl.

Acad. Sci. USA 98, 5116–5121.

van de Vijver, M.J., et al. 2002. A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J.

Med. 347, 1999–2009.

Wright, G.W., and Simon, R.M. 2003. A random variance model for detection of differential gene expression in small

microarray experiments. Bioinformatics 19, 2448–2455.

Address reprint requests to:

Dr. Y. Yasui

Department of Public Health Sciences

University of Alberta

13-106A Clinical Sciences Building

Edmonton, Alberta T6G 2G3 Canada

E-mail: [email protected]