13
© 2008 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 1375 Biotechnol. J. 2008, 3, 1375–1387 DOI 10.1002/biot.200800142 www.biotechnology-journal.com Functional annotation in biological systems In the post-genomic era, one of the big challenges in biology is to determine the functional roles of the tens of thousands of individual protein-encoding genes. Computationally based homology compar- isons are commonly used to infer gene functions based on similarities with previously annotated genes. While very useful, homology comparisons are limited to the identification of ‘more of the same’ type of genes and usually fail to provide in- formation regarding the temporal, spatial and stim- ulus-specific context in which genes are expressed and therefore can only provide inferences to gene functions in an isolated context. This problem is particularly apparent when considering that large gene families within a genome such as kinases or some transcription factor (TF) families share high sequence homology yet function as part of distinct and often unique cellular responses. The first step commonly used to uncover gene functions on a global scale is the study of global transcriptional responses using DNA microarray data [1]. The identification of protein-encoding genes that are transcriptionally activated (differen- tially expressed) in response to a defined condition implies a functional role for these genes within the response since this presumably results in in- creased cellular levels of their encoded protein products, which are assumed to play a role in me- Review A guide to the integrated application of on-line data mining tools for the inference of gene functions at the systems level Stuart Meier 1 and Chris Gehring 2 1 South African National Bioinformatics Institute, University of the Western Cape, Cape Town, South Africa 2 Department of Biotechnology, University of the Western Cape, Cape Town, South Africa Genes function in networks to achieve a common biological response. Thus, inferences into the biological role of individual genes can be gained by analyzing their association with other genes with more precisely defined functions. Here, we present a guide, using the well-characterized Ara- bidopsis thaliana pathogenesis-related protein 2 gene (PR-2) as an example, to document how the sequential use of web-based tools can be applied to integrate information from different databas- es and associate the function of an individual gene with a network of genes and additionally iden- tify specific biological processes in which they collectively function. The analysis begins by per- forming a global expression correlation analysis to build a functionally associated gene network. The network is subsequently analyzed for Gene Ontology enrichment, stimuli and mutant-specif- ic transcriptional responses and enriched putative promoter regulatory elements that may be re- sponsible for their correlated relationships. The results for the PR-2 gene are entirely consistent with the published literature documenting the accuracy of this type of analysis. Furthermore, this type of analysis can also be performed on other organisms with the appropriate data available and will greatly assist in understanding individual gene functions in a systems context. Keywords: Annotation · Systems biology · Transcription Correspondence: Professor Chris Gehring, Department of Biotechnology, University of the Western Cape, Private Bag X17, 7535 Bellville, South Africa E-mail: [email protected] Fax: + 27 21 959 13 49 Abbreviations: BTH, benzothiadiazole S-methylester; ECGG, expression correlated gene group; GO, gene ontology; GOI, gene of interest; PR-2, pathogenesis related protein-2; pwm, position weight matrix; SA, salicylic acid; SAR, systemic acquired resistance; TF, transcription factor; TFBS, TF binding site Received 30 June 2008 Revised 10 August 2008 Accepted 27 August 2008 Supporting information available online

A guide to the integrated application of on-line data mining tools for the inference of gene functions at the systems level

Embed Size (px)

Citation preview

Page 1: A guide to the integrated application of on-line data mining tools for the inference of gene functions at the systems level

© 2008 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 1375

Biotechnol. J. 2008, 3, 1375–1387 DOI 10.1002/biot.200800142 www.biotechnology-journal.com

Functional annotation in biological systems

In the post-genomic era, one of the big challengesin biology is to determine the functional roles of thetens of thousands of individual protein-encodinggenes. Computationally based homology compar-isons are commonly used to infer gene functionsbased on similarities with previously annotatedgenes. While very useful, homology comparisons

are limited to the identification of ‘more of thesame’ type of genes and usually fail to provide in-formation regarding the temporal, spatial and stim-ulus-specific context in which genes are expressedand therefore can only provide inferences to genefunctions in an isolated context. This problem isparticularly apparent when considering that largegene families within a genome such as kinases orsome transcription factor (TF) families share highsequence homology yet function as part of distinctand often unique cellular responses.

The first step commonly used to uncover genefunctions on a global scale is the study of globaltranscriptional responses using DNA microarraydata [1]. The identification of protein-encodinggenes that are transcriptionally activated (differen-tially expressed) in response to a defined conditionimplies a functional role for these genes within theresponse since this presumably results in in-creased cellular levels of their encoded proteinproducts, which are assumed to play a role in me-

Review

A guide to the integrated application of on-line data mining toolsfor the inference of gene functions at the systems level

Stuart Meier1 and Chris Gehring2

1 South African National Bioinformatics Institute, University of the Western Cape, Cape Town, South Africa2 Department of Biotechnology, University of the Western Cape, Cape Town, South Africa

Genes function in networks to achieve a common biological response. Thus, inferences into thebiological role of individual genes can be gained by analyzing their association with other geneswith more precisely defined functions. Here, we present a guide, using the well-characterized Ara-bidopsis thaliana pathogenesis-related protein 2 gene (PR-2) as an example, to document how thesequential use of web-based tools can be applied to integrate information from different databas-es and associate the function of an individual gene with a network of genes and additionally iden-tify specific biological processes in which they collectively function. The analysis begins by per-forming a global expression correlation analysis to build a functionally associated gene network.The network is subsequently analyzed for Gene Ontology enrichment, stimuli and mutant-specif-ic transcriptional responses and enriched putative promoter regulatory elements that may be re-sponsible for their correlated relationships. The results for the PR-2 gene are entirely consistentwith the published literature documenting the accuracy of this type of analysis. Furthermore, thistype of analysis can also be performed on other organisms with the appropriate data available andwill greatly assist in understanding individual gene functions in a systems context.

Keywords: Annotation · Systems biology · Transcription

Correspondence: Professor Chris Gehring, Department of Biotechnology,University of the Western Cape, Private Bag X17, 7535 Bellville,South AfricaE-mail: [email protected]: + 27 21 959 13 49

Abbreviations: BTH, benzothiadiazole S-methylester; ECGG, expressioncorrelated gene group; GO, gene ontology; GOI, gene of interest; PR-2,pathogenesis related protein-2; pwm, position weight matrix; SA, salicylicacid; SAR, systemic acquired resistance; TF, transcription factor; TFBS, TFbinding site

Received 30 June 2008Revised 10 August 2008Accepted 27 August 2008

Supporting information available online

1375_200800142_Gehring.qxd 07.11.2008 14:15 Uhr Seite 1375

Page 2: A guide to the integrated application of on-line data mining tools for the inference of gene functions at the systems level

BiotechnologyJournal Biotechnol. J. 2008, 3, 1375–1387

1376 © 2008 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

diating cellular adaptations to the conditions underinvestigation.

In eukaryotes, it is evident from the co-expres-sion of large sets of genes in response to specificstimuli that genes do not act alone, and many cel-lular processes require the coherent participationof multiple gene products [2–4]. Furthermore, anumber of studies have shown that genes that havebeen confirmed to be co-expressed in response toa range of conditions, presumably resulting fromco-regulation, have correlated functional relation-ships including physical interactions between theirproteins [1, 5–7].Thus, inferences to the function ofun-annotated genes can be made by observingconditions in which genes are expressed and byanalyzing and interpreting other genes with whichthey commonly co-express. Results from co-ex-pression analyses are particularly telling if one orseveral of the genes studied have well-definedfunctions.

Obtaining evidence of co-regulation is compli-cated by the fact that co-expression in response toa particular condition in a microarray experimentdoes not necessarily imply co-regulation. A specif-ic condition can co-activate multiple transcriptionregulatory pathways resulting in the co-expressionof genes that are regulated by different mecha-nisms. However, since genes that encode proteinsthat function in common responses have frequent-ly been found to be co-regulated, clusters of func-tionally related and presumably co-regulatedgenes can be identified by their highly correlatedtranscriptional responses over large numbers of di-verse microarray experiments [4, 8–10].

Public access to full genome DNA microarraydata sets is a precondition to provide the experi-mental data required to perform such analyses.When available, these data can be analyzed withbioinformatics tools that can identify genes thatshare the most similar expression correlation witha defined gene of interest (GOI). Analyzing mi-croarray expression data by correlation analysishas the advantage that a single GOI can be select-ed as a “reference gene” and its expression correla-tion with all other genes within the genome can bedetermined (Fig. 1). Following the establishment ofan expression correlated gene group (ECGG), thecombinational use of a number of web-based toolscan be used to integrate different data sets and pro-vide a rapid global functional assessment of the bi-ological processes in which the gene (GOI) func-tions.

Here, we present a report that details the use ofsuch tools and provide an example in Arabidopsisthaliana. In brief, the method consists of the fol-lowing four key steps: (1) global gene expression

correlation analysis of GOI to establish an ECGG;(2) gene ontology (GO) analysis of ECGG; (3) iden-tification of biological conditions that induce dif-ferential expression of the ECGG; and (4) promot-er content analysis of the ECGG.

Functional annotation of a GOI

To demonstrate the power of this combinationalanalysis we provide an example using the well-characterized pathogenesis-related protein-2 (PR-2) encoding gene in Arabidopsis (AT3G57260). Wechose to document the procedure with a well-char-acterized gene since it can serve as a positive con-trol for which the accuracy of the results generatedcan be assessed by comparison with published re-ports. In plants, the name “pathogenesis-relatedprotein” is a collective term that encompasses allproteins that are present at almost undetectablelevels in healthy tissue but are induced at the pro-

EXPRESSION CORRELATION ANALYSIS

ECGG: EXPRESSION CORRELATED GENE GROUP

GO ANALYSIS CONDITION SPECIFICEXPRESSION

ANALYSIS

PROMOTERCONTENTANALYSIS

ASSOCIATION OF GOI ECGG NETWORK WITH SPECIFIC BIOLOCIAL PROCESSES

GOI: GENE OF INTEREST

Figure 1. Flow diagram illustrating how the sequential use of web-basedtools can be used to identify networks of genes that function with a de-fined GOI in specific biological processes. A correlation analysis is per-formed to identify genes whose expression is most strongly correlatedwith the GOI. The ECGG is subsequently subjected to a GO analysis, stim-ulus-specific expression analysis and promoter content analysis. Thisprocess links the GOI with other genes with which it is commonly co-ex-pressed, identifies specific conditions that commonly induce expressionof the ECGG and identifies putative regulatory elements that are enrichedin the promoters of the ECGG, and thus may be responsible for coordinat-ing their common transcriptional responses.

1375_200800142_Gehring.qxd 07.11.2008 14:15 Uhr Seite 1376

Page 3: A guide to the integrated application of on-line data mining tools for the inference of gene functions at the systems level

© 2008 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 1377

tein level following pathogen infection. The classi-fication of these proteins is based on theirpathogen inducible expression rather than definedby specific functional roles in defense [11]. Thispoint is brought into focus when considering PR-1,which is the quintessential marker of the systemicacquired resistance (SAR) response, yet its biolog-ical role remains largely unknown [12]. Other char-acteristic features of PR proteins include an N-ter-minal signal peptide [11] that directs the moleculesinto the extracellular space. The PR-2 protein aswell as the PR-1 and PR-5 proteins have been iso-lated from the A. thaliana apoplast [13]. Inductionof PR proteins at the transcript level appears to oc-cur independently of de novo protein synthesis thatis characteristic of genes encoding secreted pro-teins [14]. The increased transcription of PR genesis commonly used as a marker of SAR, a localizeddefense response that gives rise to a systemic long-lasting resistance to a broad range of virulentpathogens [15].The SAR response is characterizedby the accumulation of endogenous salicylic acid(SA) in infected tissues and later in distal uninfect-ed tissues with a subsequent induction of a selectgroup of pathogenesis-related genes (PR genes)[16]. SA has been shown to be essential and suffi-cient to induce the SAR response in plants [17].PR-2 is annotated as having glucan β-1,3-glucosi-dase activity that hydrolyzes β-1,3-glycosidicbonds in linear or branched glucans that are pres-ent in plant and bacterial cell walls [18]. It isthought that the PR-2 β-glucanases in combinationwith chitinase family PRs play an important role inplant defense by either degrading cell walls of thepathogens or degrading β-glucans in host cellwalls, thereby releasing endogenous oligosaccha-rides that play important roles as elicitors [18, 19].

Expression correlation analysis

The first step of the sequential analysis is thecorrelation analysis on the GOI using the freelyavailable Arabidopsis co-expression tool (http://www.arabidopsis.leeds.ac.uk/) [20]. This tool useshybridization signal intensities from microarrayexperiments to calculate a Pearson correlation co-efficient (r value), which is a scale-invariant meas-ure of expression similarity that expresses thestrength and direction of the linear relationshipbetween the reference gene (GOI) and all otherArabidopsis genes represented on the selectedchip. The tool calculates and returns both negativeand positive correlations (ranging from –1 to 1), as-sociated probability (p) and expectation (e) values,which are a measure of the statistical significance

against a background of random chance correla-tions. The analysis is performed across all of the322 available Ath1 22K microarrays from theNASC/GARNet dataset that contain probe sets thatrecognize 21 891 Arabidopsis genes, and thus pro-vide a more complete representation of the Ara-bidopsis genome than the alternative 8K chip.

The arrays included in this analysis are derivedfrom a broad range of experimental samples fromspecific tissue types, developmental stages, abioticand biotic treatments and selected mutants. Thebreadth of the experimental conditions is an im-portant criterion to consider since it has beenshown that the use of expression data to success-fully predict genes that bind common TFs, and arethus presumably co-regulated, is crucially depend-ent on the use of a large number of microarrays(minimum 100) performed over diverse experi-mental conditions [6]. In yeast, it was shown thatpredictive power increased with the use of up to100 microarrays with little improvement with num-bers >100 [6]. Importantly, the Arabidopsis co-expression tool uses NASC/GARNet data fromlabeled, hybridized and analyzed material obtainedby a standardized procedure, thus providing a ho-mogeneous and readily comparable data set.

When performing this type of analysis we leavethe gene list limit blank which returns a global cor-relation analysis of the GOI ranging from the mostpositively to the most negatively correlated genes.The selection of the size of the ECGG to be used forfurther analysis is decided upon by the user andmay be based on the strength of the correlation (rvalue cut-off) or simply a desired list size. Accord-ing to the authors [20], r values above 0.9 are veryhigh and thus likely to be biologically significant;however, r values greater than 0.7 have been foundto be associated with over-representation of onto-logical keywords or annotations.

If the r value of the best-correlated genes arelow, e.g., below 0.6, this may reflect that the expres-sion of the gene is subject to complex control viamultiple promoter motifs with inputs from multi-ple, potentially antagonistic, signaling pathways.Alternatively, genes that are highly correlated aremore likely to be regulated by a single regulatorypathway. In the example presented here, we select-ed PR-2 and the top 49 expression correlated genesas the ECGG (50 genes) that had r values rangingfrom 0.904 to 0.716 (Table 1).

In addition, a recently developed Arabidopsisexpression correlation tool named CressExpress(http://www.cressexpress.org) can be used to per-form gene expression analyses in a number of de-fined tissue types [21].

Biotechnol. J. 2008, 3, 1375–1387 www.biotechnology-journal.com

1375_200800142_Gehring.qxd 07.11.2008 14:15 Uhr Seite 1377

Page 4: A guide to the integrated application of on-line data mining tools for the inference of gene functions at the systems level

BiotechnologyJournal Biotechnol. J. 2008, 3, 1375–1387

1378 © 2008 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Table 1. List of top 49 genes that are expression correlated with PR-2 (AT3G57260)

Locus GO termsa) r value Annotation

AT1G75040 DR, SAR 0.904 Pathogenesis-related protein 5 (PR-5)AT1G08450 0.861 Calreticulin 3 (CRT3)AT5G10760 0.842 Aspartyl protease family proteinAT2G32680 DR, CC, K 0.819 Disease resistance family protein / LRRAT4G03450 0.814 Ankyrin repeat family proteinAT2G31880 Ph, CC, K 0.802 LRR transmembrane protein kinaseAT3G11010 DR, CC, K 0.800 Disease resistance family protein / LRRAT4G23150 Ph, K 0.796 Protein kinase family proteinAT4G04500 Ph, K 0.792 Protein kinase family proteinAT5G24530 0.792 Oxidoreductase, 2OG-Fe(II) oxygenaseAT5G10380 0.791 Zinc finger (C3HC4-type RING finger) fam proteinAT1G67865 0.786 Expressed proteinAT4G23140 Ph, K 0.781 Receptor-like protein kinase 5 (RLK5)AT5G24210 0.778 Lipase class 3 family proteinAT4G04220 DR, CC, K 0.775 Disease resistance family protein contains LRRAT1G21250 Ph, CC, K 0.775 Wall-associated kinase 1 (WAK1)AT1G66880 Ph, K 0.774 Serine/threonine protein kinaseAT2G46430 0.771 Cyclic nucleotide-regulated ion channel (CNGC3)AT2G29120 0.770 Glutamate receptor family protein (GLR2.7)AT1G14870 0.761 Expressed protein similar to PGPS/D12AT3G56710 0.754 Identical to SigA binding proteinAT1G12220 DR 0.754 Disease resistance protein RPS5 (CC-NBS-LRR)AT3G24900 DR, CC 0.749 Disease resistance family protein / LRRAT5G02290 K 0.748 Similar to protein kinase APK1AAT2G37710 K 0.748 Lectin protein kinaseAT2G24860 0.748 Chaperone protein dnaJ-relatedAT5G60900 0.745 Lectin protein kinase family proteinAT4G23260 K 0.741 Protein kinase family proteinAT3G56400 DR, SAR, CC 0.734 WRKY70 family transcription factorAT1G17610 DR 0.730 Disease resistance protein-relatedAT1G67970 0.729 Heat shock transcription factor, putative (HSTF5)AT3G28540 0.727 AAA-type ATPaseAT2G23680 0.726 Stress-responsive proteinAT2G14610 DR, SAR, CC 0.726 Pathogenesis-related protein 1 (PR-1)AT4G29810 CC, K 0.725 Mitogen-activated protein kinase kinase (MAPKK) (MKK2)AT3G08870 Ph, K 0.725 Lectin protein kinaseAT2G14560 0.725 Expressed proteinAT3G60420 0.721 Expressed proteinAT1G35710 Ph, CC,K 0.721 LRR transmembrane protein kinaseAT1G76970 0.721 VHS domain-containing protein / GAT domain-containingAT1G34420 Ph, K 0.720 Leucine-rich repeat family proteinAT2G32800 Ph, K 0.719 Protein kinase family proteinAT1G74710 DR, SAR 0.719 Isochorismate synthase 1 (ICS1)AT3G13950 0.719 Expressed proteinAT2G02360 0.718F-box family proteinAT2G24850 0.718 AminotransferaseAT4G22260 0.718 Alternative oxidaseAT5G12890 0.717 UDP-glucoronosyl/UDP-glucosyl transferase family proteinAT3G53960 0.716 Proton-dependent oligopeptide transport (POT)

a) DR, defense response; CC, cell communication; K, kinase activity; SAR, systemic acquired resistance; Ph, phosphorylation.

1375_200800142_Gehring.qxd 07.11.2008 14:15 Uhr Seite 1378

Page 5: A guide to the integrated application of on-line data mining tools for the inference of gene functions at the systems level

© 2008 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 1379

GO analysis of ECGG

It has been demonstrated that genes that are co-expressed over multiple data sets show functionalrelatedness [7, 22]. For this reason a GO analysis ofthe ECGG is performed to identify any statisticallyenriched terms associated with the ECGG com-pared to the background (complete genome) fre-quency. If the expression of the GOI is found to bemost highly correlated with genes involved in aspecific type of biological response, we can inferfunctional roles through association with othergenes with more defined roles. For Arabidopsis, the“Fatigo plus” comparing tool in the Babelomicssuite can be used for such an analysis (http://fati-go.bioinfo.cnio.es; [23, 24]).This tool is not restrict-ed to Arabidopsis and can also be applied to the fol-lowing organisms: Homo sapiens, Mus musculus,Rattus norvegicus, Danio rerio, Bos taurus, Gallusgallus, Drosophila melanogaster, Caenorhabditis el-egans, Saccharomyces cerevisiae and Anophelesgambiae.The program uses a Fisher’s exact test andreturns adjusted p values (Family Wise Error Rate)accounting for multiple testing to reduce the false-positive frequencies.

In our example we selected all the availablefunctional annotation options for Arabidopsis in“Fatigo plus”, which include the three GO cate-gories of biological process, cellular componentand molecular function as well as InterPro motifs,KEGG pathways and Swiss-Prot keywords. The“Fatigo plus” analysis of our PR-2 ECGG identifieda number of statistically significant enrichments inthe biological process GO category that are gener-ally related to pathogen defense (Table 1) with themost notable being SAR at level 8 (see Supplemen-tary File 1 for details). Interestingly, 5 of the total 21SAR annotated genes in the entire genome arepresent in this list including PR-2, PR-1, PR-5, theWRKY70 TF and isochorismate synthase 1 (ICS1),the latter being critical for SA biosynthesis that in-duces the SAR response [17].

Additionally, within this category we noted en-richment in genes (10 genes) associated with phos-phorylation that are all annotated kinases, and anenrichment in kinases (17 genes) in the molecularfunction category at level 6.While 3 of these kinas-es are annotated as disease-resistance proteins, theremaining 14 are simply annotated as kinases, mostlikely through homology comparisons. Further 5genes in the ECGG list were annotated as ex-pressed proteins only. The confirmed correlatedexpression of these genes with those involved indefense responses extends on their functional an-notations since it also tightly associates them witha defined biological response.

Identification of stimuli that induce differentialexpression of the ECGG

To confirm the GO annotations and to identify con-ditions that collectively induce differential expres-sion of these genes, a Genevestigator (www.gen-evestigator.ethz.ch) [25] can be used to screen tran-scriptional responses of multiple genes acrosshundreds of varied conditions (over 3000 mircoar-rays for Arabidopsis). This tool can also be used toscreen transcriptional responses in H. sapiens, R.norvegicus, M. musculus and Hordeum vulgare. Toobtain information on precise transcriptional re-sponses for specific treatments or time points, it isnecessary to download the relevant data sets by fol-lowing the links to the repositories. This is neces-sary since the ratio value listed under “stimulus”are sometimes averaged responses to treatmentstaken over different time points and from differenttissues (e.g., roots and shoots) as is the case withsalt and osmotic stresses in Arabidopsis. This typeof in silico analysis identifies specific stimuli thatcause differential expression of the genes and thuslinks them to specific biological responses.The ex-pression of PR-2 and the average expression of theECGG in response to a number of selected treat-ments are presented (Fig. 2). In accordance withthe enrichment in defense and SAR annotatedgenes from the GO analysis, expression of theECGG is strongly induced by benzothiadiazole S-methylester (BTH), which is a functional syntheticSA analogue [14], as well as a number of biotroph-ic pathogens including Phytophthora infestans,Erysiphe cichoracearum and Erysiphe orontii thatdepend on living host tissue for survival (Fig. 2A).According to the literature, SA-dependent defens-es generally act against biotrophs in contrast to jas-monic acid- and ethylene-dependent responsesthat counteract necrotrophs [26].This distinction isreflected by the fact that the necrotrophic Botrytiscinerea pathogen did not greatly increase the aver-age expression of the ECGG but did induce expres-sion of PR-2 (1.82log2), PR-5 (1.35log2) and PR-1(1.89log2). Notably, Botrytis cinerea failed to inducelarge increases in expression of the ICS1 (0.25log2)or WRKY70 (-0.17log2), genes that are both key sig-naling molecules in SA-mediated responses.Theseresults thus indicate that transcription of the PR-genes can be induced by SA-independent mecha-nisms.

The abiotic ozone and UV-B stresses also in-duced strong expression of the ECGG (Fig. 2B).Both these treatments have previously been shownto stimulate SA production and induce the expres-sion of PR genes [27–30]. Additionally, expressionof the ECGG is strongly induced by abiotic osmotic

Biotechnol. J. 2008, 3, 1375–1387 www.biotechnology-journal.com

1375_200800142_Gehring.qxd 07.11.2008 14:15 Uhr Seite 1379

Page 6: A guide to the integrated application of on-line data mining tools for the inference of gene functions at the systems level

BiotechnologyJournal Biotechnol. J. 2008, 3, 1375–1387

1380 © 2008 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

related stresses including K+ starvation, osmoticand NaCl stress.The induction of defense and SARannotated genes by abiotic osmotic stresses as wellas defense eliciting treatments may well reflect thatboth types of challenges lead to common homeo-static disturbances. These in turn transcriptionallyactivate a set of common response genes.This con-cept is supported by several studies that recognizea role of SA in abiotic stresses such as drought,salinity and temperature [31, 32].The accumulationof PR proteins is in fact a common plant responseto both abiotic and biotic stresses, further high-lighting the overlap in biotic and abiotic defensemechanisms [11]. Indeed, changes in ion fluxeshave been identified as early responses to both abi-otic and biotic stresses, including an influx of H+

and Ca2+ and an efflux of K+ and Cl– [33].The sharp

increase in expression of PR-2 and the ECGG in re-sponse to the protein synthesis inhibitor cyclohex-imide (CHX) implies that transcription of thesegenes can occur independently of de novo proteinsynthesis, which concurs with the definition of im-mediate early response genes [34] that have beenproposed to play important roles in the early regu-lation of defense responses [35].

To gain further insights into the function of thecorrelated genes and pathways that may be in-volved in their regulation, transcriptional respons-es were observed in various mutants that are ac-cessible through the “mutant surveyor” function inGenevestigator (Fig. 3). Again, consistent with theGO annotations, the expression of the genes wasstrongly induced in the cpr5 (a constitutive ex-presser of pathogenesis-related genes) and mpk4mutants that have elevated SA levels and displayconstitutive expression of PR genes [36, 37]. Con-versely, the expression of the genes was markedlyreduced in the nahG mutant that is defective in SAproduction and signaling [17].

Observations of the transcriptional responses ofthe ECGG in the npr1 (AT1G64280) knockout mu-tant are interesting since the NPR1 gene encodesan ankyrin-repeat protein that is an essential pos-itive regulator of SA signal transduction, leading toactivation of the SAR response and PR gene ex-pression [38, 39]. Mutations in NPR1 have beenshown to inhibit PR gene expression with compro-mised disease resistance [40–42]. In the single npr1mutant, transcription of PR-2 is reduced moderate-ly (–0.987log2), while the average expression ofthe correlated genes is only slightly reduced(–0.134log2). Similarly, in the cpr5/npr1 double mu-tant, gene expression is only slightly reduced com-pared to the single cpr5 mutant. In contrast, in thenpr1 mutant after 24-h BTH treatment, expressionof PR-2 (–1.78log2) and the correlated genes(–1.95log2) is markedly reduced when compared to24-h BTH-treated wild-type controls. However, ex-pression of the genes in the npr1 mutant after 24-hBTH treatment is elevated compared to untreatedcontrols, although not to the same magnitude(Fig. 2A). In summary, these results indicate thatNPR1 is required for full transcriptional inductionof these genes; however, an NPR1-independentmechanism must also exist. To obtain informationon the NPR1 dependency of individual genes in thelist, individual responses must be analyzed ratherthan the average presented since the npr1 mutanteffects transcription of these genes quite different-ly. For example, expression of PR-2 (+2.18log2) andPR-5 (+2.08log2) was still largely induced in thenpr1 mutant after 24-h BTH treatment compared tountreated wild-type controls.The BTH induction of

0

1

2

3

4

5

ozone 3h

UV-B6h

NaCl 6h

Osm. 6h

K+ starv. CHX

Abiotic stresses

Rat

io (

log

2)

0

1

2

3

4

5

6

BTH8h

BTH24h

E.cich.3d

P.infest.24h

E.orontii4d

B. cinerea48h

Rat

io (

log

2)

Biotic stresses(A)

(B)

PR2

Av. ECGG

Figure 2. Expression profile of PR-2 and correlated genes in response toselected treatments. The results presented illustrate the fold change(log2) in expression of PR-2 and the average fold change for the ECGG inresponse to: (A) biotic stress and chemical treatments, and (B) abioticstresses. (A) The treatments were: BTH for 8 h (n=3) and 24 h (n=3);Erysiphe cichoracearum 3 days after inoculation (n=4); Phytophthora infes-tans 1 day after inoculation (n=3); Erysiphe orontii 4 days after inoculation(n=2); Botrytis cinerea 48 h after inoculation (n=3). (B) The treatmentswere: ozone 3 h (n=2); UV-B shoot 6 h (n=2); NaCl in the roots after 6 h(n=2); osmotic stress in the shoot after 6 h (n=2); K+ starvation, the shootafter 7 days (n=3); cycloheximide (CHX) after 3 h (n=2). Error bars repre-sent standard errors of the mean.

1375_200800142_Gehring.qxd 07.11.2008 14:15 Uhr Seite 1380

Page 7: A guide to the integrated application of on-line data mining tools for the inference of gene functions at the systems level

© 2008 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 1381

PR-1, however, was completely abolished, indicat-ing that transcription of PR-1 is NPR-1 dependent(data not shown).

Further, in the sni1 knockout mutant, expres-sion of the ECGG is strongly induced. The SNI1gene (AT4G18460) is a negative regulator of SARand is required to maintain the low basal expres-sion of PR genes [43].Thus, knocking out this generesults in de-repression of SAR-related genes. In-terestingly, in the correlation analysis, expressionof this gene was significantly anti-correlated(r=–0.36) with PR-2 and that is entirely consistentwith its role as a repressor.

Studying transcriptional responses of genes inTF-related mutants is also of interest since it canprovide insights into transcription regulation. Inthis case, microarray studies documenting tran-scriptional response in two Arabidopsis WRKY TFmutants,WRKY18 and WRKY70, is of particular in-terest since both have been implicated in regulat-ing Arabidopsis pathogen responses and SAR andPR gene expression in particular [14]. In the un-treated wrky18 mutant, it can be seen that expres-

sion of the ECGG was not greatly changed com-pared to untreated wild-type controls nor was in-duction largely different to wild types after 24-hBTH treatment, indicating that WRKY18 is not crit-ically involved in the induction of these genes. Inthe WRKY70 TF antisense knockout line, the ex-pression of the ECGG was not altered greatly (notshown). However, in the WRKY70 overexpressionthere was a strong induction of the ECGG, indicat-ing that this WRKY TF is a positive regulator ofthese genes. This is further supported by the factthat the WRKY70 TF is an annotated SAR gene andits expression is highly correlated with PR-2(r=0.73).

Promoter content analyses

To elucidate ECGG co-regulation, computationaltools can be used to identify similarities in theirregulatory promoter elements (cis-elements). Theunderlying rationale behind this is that the regula-tion of eukaryotic gene expression is critically dic-tated by the combinational presence of specific cis-regulatory motifs in the promoter regions of genesthat bind specific TFs and modulate gene expres-sion [6, 8, 22, 44–47]. The identification of cis-ele-ments that are thought to be causative for the ob-served transcriptional responses of the ECGG isachieved by identifying motifs that are enriched inthe promoters of the ECGG as compared to back-ground frequencies in the rest of the genome. Evi-dently, global computational analysis of promoterregions requires that the promoter sequence dataare available and thus, this type of analysis is lim-ited to sequenced genomes.

The enrichment of motifs in co-expressing pro-moter sets needs to be considered in two ways.Firstly, through the percentage of promoters in thegroup that contain the motif and, secondly, by thenumber of copies of the motif in each promoter ofthe ECGG. In the first case, since it has been re-ported that most TF binding sites (TFBS) occur inthe promoters of many genes, enrichment deter-mined by presence or absence in a promoter is notcritical or particularly useful in discriminatingtranscriptional regulation [48]. However, since weare attempting to identify motifs that are causativefor the transcriptional responses of the group as awhole, the motifs should be present in a significantproportion of the promoters. Secondly, since mostTFBS are indeed present in many genes, the aver-age copy number per promoter becomes critical forthe identification of enriched elements. It has beendocumented that multiple copies of cis-elements inpromoters has biological significance, particularly

Biotechnol. J. 2008, 3, 1375–1387 www.biotechnology-journal.com

-6

-4

-2

0

2

4

6

8

cpr5 cpr5/npr1

mpk4 nahG sni1 WRKY 70 +

-2

0

2

4

Wrky18 / wt

npr1 / wt

BTH:wrky18

/ wt

BTH:npr1/ wt

wrky18 BTH /

untr. wt

npr1BTH/

untr. wt

Rat

io (

log

2)R

atio

(lo

g2)

PR2

Av. ECGG

Figure 3. Expression profile of PR-2 and the ECGG in selected A. thalianamutants. The expression profiles of PR-2 and the correlated genes wereexamined in a number of SA/SAR-related mutants. The expression of PR-2and the ECGG is strongly induced in SAR-inducing mutants includingcpr5, mpk4, sni1 as well as in the WRKY 70 overexpresser line. Conversely,in the SA-deficient mutant nahG, expression of the PR-2 ECGG is marked-ly reduced. The BTH-induced expression of the ECGG was reduced in npr1mutants but remained largely unaffected in wrky18 mutants. All BTH treat-ments were for 24 h. wt, wild type; untr. wt, untreated wt control. Errorbars represent standard errors of the mean.

1375_200800142_Gehring.qxd 07.11.2008 14:15 Uhr Seite 1381

Page 8: A guide to the integrated application of on-line data mining tools for the inference of gene functions at the systems level

BiotechnologyJournal Biotechnol. J. 2008, 3, 1375–1387

1382 © 2008 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

when clustered, making transcriptional activationstronger [44, 48].

The identification of enriched putative regula-tory elements in promoter groups can be per-formed using motifs derived from two distinct ap-proaches. In the first, programs search promotersfor the presence of experimentally confirmedTFBS, while in the second, ab initio computational-ly derived motifs are used. Both approaches havespecific advantages and disadvantages.

Experimentally determined cis-elements

The obvious advantage in using experimentallyderived motifs is that they have been experimen-tally confirmed to bind TFs and are thus likely to beof biological relevance. However, due to the exten-sive experimental procedures required to identifyand characterize TFBS, even in well-studied organ-isms only a small proportion of them have beenidentified [49]. Thus, the use of experimentally de-rived motifs is often restricted to studying well-characterized pathways in well-studied organisms.

For Arabidopsis, the visualization tool in theArabidopsis Thaliana Expression Network Analy-sis (Athena; website http://www.bioinformatics2.wsu.edu/cgi-bin/Athena/cgi/visualize_select.pl)offers a platform that allows rapid identification ofknown enriched cis-elements within promotergroups [50]. The tool requires that the users pro-vide a list of the Arabidopsis accession IDs (e.g.,AT3G57260). A maximum of 20 genes can be sub-mitted for the cartoon visualization, whereas up to100 genes can be analyzed using the compact op-tion. Users can also choose the length of the up-stream region to be searched (range –250 to –3000)and whether to cut-off at adjacent genes. The out-put returns visualizations of the individual genesand their promoter regions with color-coded mark-ing to represent the location of cis-elements. Thetool searches the promoter regions of all genes forthe presence of consensus sequence motifs thatcorrespond to 105 previously characterized plantTFBS.The consensus sequences were derived from“PLACE” (http://www.dna.affrc.go.jp/PLACE/faq.html; [51]) and Agris (http://arabidopsis.med.ohio-state.edu; [52]). Athena returns a list of all the pre-dicted TFBS identified and the number of promot-ers (#P) that contained the motif, the total numberof occurrences of the motif across the promoter set(#S) and a p value that is calculated using a hyper-geometric probability distribution performed foreach binding site identified. A Bonferroni correc-tion is subsequently performed to account for mul-tiple testing (up to 105 different TFBS) and to de-

termine p value thresholds for significant enrich-ment. On the results page one can click on the pro-moter element names, which links to a pages thatprovide the consensus sequence of the predictedsite and often PubMed IDs.

In our example we performed the analysis onthe 1000-bp upstream regions of the PR-2 ECGG inthe compact mode and did not cut-off at adjacentgenes. The results reveal a number of statisticallysignificant enrichments of predicted sites withinour promoters (Supplementary File 2A). These in-clude T-box (p<10–4; #P=36; #S=62; consensus =ACTTTG),TATA-box (p<10–3; #P47; #S146; consen-sus = TATAAA),W-box [p<10–3; #P40; #S97; consen-sus = TTGAC(T/G)] and LS7 (p<10–3; #P2; #S2; con-sensus = ACGTCTATAGA) promoter motifs. Otherabundant motifs that may be of potential interestcan be identified by observing their p values andthe frequency with which they occur in the genes.For example, the GAREAT (gibberellic acid re-sponse element, in A. thaliana) motif, although notconsidered enriched was present in 64% of thegenes and had a small p value (p=0.014; #P32; #S48;consensus = TAACAAR). Since this type of analysisis attempting to identify elements that may becausative for the transcriptional responses of theECGG, biologically relevant motifs should be pres-ent in a large proportion of the genes. For examplein the PR-2 analysis, although considered enriched,the LS7 element is present in only 2/50 genes mak-ing it a rather unlikely regulator of the group as awhole.

We note that that the stringent W-boxes(TTGAC[C/T]) are present in 80% of the promotersof the ECGG, a total of 97 times and that is an aver-age of 2.45 copies/promoter when considering onlythe promoters that contain them. It is particularlytelling that W-boxes are enriched in the promotersof the PR-2 ECGG since W-box cis-elements areknown to bind WRKY TFs [53] that have a provenrole in the regulation of gene expression in re-sponse to pathogens [54] and osmotic stress inplants [55–57]. Furthermore, the WRKY70 TF is anannotated SAR gene and highly correlated withPR-2 and present in our ECGG. The mutant analy-sis reveals that transcription of the ECGG is highlyinduced in the WRKY70 overexpresser cell line.These results strongly support the interpretationthat this TF is a key positive regulator of the ECGGsince: (i) its expression is highly correlated with thegroup, (ii) overexpression of WRKY70 induces ex-pression of the ECGG, (iii) the TF has establishedroles in plant defense, and (iv) the group containsan enrichment in W-box elements.

While providing valuable information, this typeof analysis is limited to the identification of previ-

1375_200800142_Gehring.qxd 07.11.2008 14:15 Uhr Seite 1382

Page 9: A guide to the integrated application of on-line data mining tools for the inference of gene functions at the systems level

© 2008 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 1383

ously defined elements that are typically restrictedto well-studied responses. Further, the limitationsin using incomplete motif sets to characterize pro-moters are exemplified by the fact that, in eukary-otic organisms, transcriptional regulation of indi-vidual genes is determined by the combinationalpresence of multiple cis-regulatory sequences intheir promoters [58, 59]. Performing this analysiswithout complete knowledge of all important ele-ments will therefore provide partial information atbest.

Computationally derived motifs

An alternative approach for the identification ofpromoter regulatory elements that are thought tobe causative for transcriptional responses in co-ex-pressing gene groups is the use of computationalprograms that identify statistically over-represent-ed motifs within promoters. Typically this type ofanalysis starts with a group of promoters that arebelieved to be from a co-regulated gene group (co-expressed). The tools then identify short DNA se-quence motifs that are statistically over-represent-ed within the regulatory regions. The nature ofTFBS makes accurate identification of these sitesdifficult. TFBS are short and degenerate sincebinding sequences can vary for a given TF [49].An analysis of 292 experimentally verified verte-brate TFBS matrices extracted from the TRANS-FAC database (http://www.gene-regulation.com/pub/databases.html) revealed that the mean lengthof degenerate consensus TFBS sequences was14.3 nucleotides (nt) and the mean length of theconserved continuous core was only 4.1 nt [60].Theshort length of the core region makes it difficult todistinguish biologically meaningful data from sta-tistical noise present in the input sequences exam-ined (typically 1000 nt upstream of the transcrip-tion start site), while the inherent variability inconsensus sequences prevents TFBS from beingcharacterized by their individual sequences. Com-putational programs have been designed that uti-lize IUPAC (http://www.iupac.org) consensus se-quences or position weight matrices (pwm) to de-fine the characteristics of these sites when search-ing for motifs. Many such programs exist and theaccuracy of some was reviewed recently [49, 61].

We have used the Dragon Motif Builder (DMB)system to identify motifs in the promoter regions ofthe ECGG (DMB; http://tr.sanbi.ac.za:8889/~drag-on/dragon/MotifBuilder) [62]. In brief, this tooluses an ab initio methodology to identify groups ofhighly similar motifs that form a motif family. Thesystem uses two versions of an expectations maxi-

mization (EM) algorithm. EM1 attempts to deter-mine the best motif lengths for motifs in a motiffamily, while EM2 determines motif families withpre-specified motif lengths. Further details aboutthe algorithm of DMB can be found on the system’swebsite.

The Martview tool on the Gramene website(www.gramene.org/biomart/martview/) provides anexcellent resource for downloading genomic se-quence data for Arabidopsis (TAIR 7 release), rice(TIGR 5 release) and maize. It provides for fullgenome downloads or filtering for selected genes.It also allows the user to specify specific sequencedownloads including user specified upstream se-quence lengths, cDNA sequences, genomic se-quences and UTR sequences. For our example wehave downloaded 1000 bp upstream flanking se-quences relative to the coding region of the genefor all genes in the ECGG.

For this purpose, we uploaded a file that con-tained the promoter sequences of the ECGG inFASTA format. We then ran the analysis using theEM2 version with the following options: full-lengthanalysis, remove poor patterns, run against back-ground sequences and identify 50 motifs. We ranthe analysis three times at the selected motiflengths of five, six and seven. The remaining pa-rameters were left as default. The data areprocessed rapidly and a link to the results is re-turned by e-mail. Two files are returned, one con-tains a list of the consensus motifs and their corre-sponding pwm, while the other returns, for eachmotif, the motif families used to generate the con-sensus motif sequences. For each motif a p valueand e value are returned in the pwm file. Impor-tantly, the files also document how many promotersin the file contain the consensus motifs and themotif position in each promoter. Since this analysisis performed against random DNA, we subse-quently use a complementary tool (POBO; http://ekhidna.biocenter.helsinki.fi/poxo/pobo [63]) tocheck for enrichment of the elements compared toArabidopsis background frequencies. POBO countsthe frequency of the motif in each of the promotersin the input sets and in the internal background(genome) data set and then applies a bootstrapmethod to generate a number of artificial promot-er sets (pseudoclusters) on which statistical testsare performed to identify enrichments. This en-ables a reliable comparison of original data sets ofdifferent sizes. The POBO tool can be used on H.sapiens, M. musculus, A. thaliana, D. melanogaster,C. elegans, S. cerevisiae and A. gambiae.

In POBO, a file containing the 1 kb upstreampromoter sequences for each gene of the ECGGwas uploaded, and the analysis of each motif iden-

Biotechnol. J. 2008, 3, 1375–1387 www.biotechnology-journal.com

1375_200800142_Gehring.qxd 07.11.2008 14:15 Uhr Seite 1383

Page 10: A guide to the integrated application of on-line data mining tools for the inference of gene functions at the systems level

BiotechnologyJournal Biotechnol. J. 2008, 3, 1375–1387

1384 © 2008 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

tified in DMB was run against A. thaliana back-ground (BG; clean). The following settings wereused (number of sequences to pick out = 50, num-ber of samples to generate = 1000, sequence length= 1000 bp). In brief, POBO returns the following re-sults for both the BG (genome) and the ECGG: thetotal number of promoters in the sets, and the num-ber of promoters that contain the specified motif. Italso lists the promoter mean that is the averagecopy number of the motif in each promoter in theset. Calculating and comparing of the ratios ofthese parameters can be used to identify motif en-richment in the ECGG. Statistically significant en-richment of motifs in the ECGG compared to BGfrequencies can be measured by following the pro-vided link to the GraphPad website (http://www.graphpad.com/quickcalcs/DistMenu.cfm).The POBO output returns a calculated t value anddegrees of freedom that can be used in GraphPadto calculate a p value and determine statistical sig-nificance. However, it is important to keep in mindthat statistical significance is not necessarily im-plying biological significance.

In the case of PR-2, we noted that a number ofmotifs were identified in the DMB program thatcontain the core WRKY binding domain TTGAC.These include TTGAC (5 nt) and TTGACT (6 nt).The results from POBO (Fig. 4, Supplementary File2B) indicate that the core TTGAC W-box motif waspresent in 96% of promoters in the ECGG com-pared to 88% of promoters in the Arabidopsisgenome (8% enrichment).The motif was present atan average of 2.89 copies per promoter comparedto 2.24 in the background (29% enrichment). Simi-larly, the TTGACT motif that corresponds to themore stringent W-box sequence was present in 68%of promoters of the ECGG (18% enrichment), an av-erage of 1.3 copies per promoter (78% enrichment).Since we know that this element was determined tobe enriched from the output in Athena and otherevidence supports its biological relevance, we canuse these enrichment values as a positive control orbenchmark to help judge likely biologically mean-ingful enrichment of other motifs. We based ourcut-off for enrichment on copies per promoter andused 20% for all lengths. We also only used motifsthat were present in at least 40% of the genes. Outof the 100 motifs examined for each length, 14 mo-tifs of 5 nt (max. enrichment 32%), 29 motifs of 6 nt(max. enrichment 78%) and 27 motifs of 7 nt (max.enrichment 123%) passed these criteria (Supple-mentary Table 2B). There is a clear trend that withincreasing motif length, enrichment in copy num-bers also increases, which may reflect increasedspecificity. However, the degeneracy also increases

with longer motifs, while their presence and copynumbers sharply decrease.

POBO can also be used to confirm the enrich-ments of the motifs identified in Athena and addi-tionally used to check other interesting non-en-riched motifs identified in Athena that were pres-ent in many genes but had low p values. POBO pro-vides additional detail to that presented in Athenasince it lists motif frequencies in the genomic back-ground. While the enrichment of specific motifs ina co-expressed gene set compared to the back-ground of genomic frequencies may logically indi-cate that they could be causative for distinguishingtheir transcriptional responses, in reality there isno absolute standard to measure the accuracy ofpredictions.

The PATCH tool in the TRANSFAC database[64] allows computationally derived putative regu-latory motifs to be tested to see if they correspond-ed to binding sites of known TFs. This site can beused to compare motifs against known TFBS froma range of organisms including vertebrates, fungi,plants, insects, nematodes, consensi and viruses(total 21 621 sites in the 2008 version).We used theTRANSFAC Professional version 10.4 database toanalyze our consensus motifs – enriched in POBO– to determine if they corresponded to binding sitesfor known plant TFs. The search parameters usedin PATCH were a minimum binding site length of5 nt and a mismatch penalty of 100. This analysisrevealed that a number of the enriched putative el-

Figure 4. Frequency of occurrence of the core W-box (TTGAC) motif in thepromoters of the PR-2 ECGG and in background (genomic) A. thalianapromoters. The 1-kb upstream promoter sequences of the PR-2 ECGGwere analyzed in POBO [63] to determine the frequency of occurrence ofthe TTGAC W-box motif. The core TTGAC W-box motif was determined tobe significantly enriched (t-test: p value >0.0001) in the PR-2 ECGG beingpresent in 96% of promoters at an average of 2.89 copies per promotercompared to 88% of promoters in the Arabidopsis genome at an averageof 2.24 copies per promoter.

1375_200800142_Gehring.qxd 07.11.2008 14:15 Uhr Seite 1384

Page 11: A guide to the integrated application of on-line data mining tools for the inference of gene functions at the systems level

© 2008 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 1385

ements identified corresponded to binding sites forknown plant TFs that are involved in defense, in-cluding the previously discussed WRKYs, Jas-monate (JA)MYC2, which is involved in JA signal-ing [65], and GT-1, which has been shown to be in-volved in regulating PR gene expression in tobacco[66]. It is critical to remember, however, that only asmall fraction of TFBS have been characterizedeven in the most extensively studied species.Thus,enriched motifs that do not correspond to knownTFBS should not be dismissed but regarded as in-teresting candidate motifs that could be studied ex-perimentally.

The use of phylogenetic footprinting can help tovalidate predictions. Phylogenetic footprinting isperformed by comparing orthologous regulatoryregions of a single gene across several species [67].The rationale behind this approach is that selectionpressure causes functional elements to evolve at aslower rate and thus retain greater conservationthan non-functional sequences [44, 68]. Hence, onecould compare motifs identified in Arabidopsis toorthologous genes in rice or barley, which haveboth been sequenced. This type of comparativeanalysis has been used to identify conserved regu-latory elements between humans and mouse [69,70].

Summary and conclusion

This analysis, performed on the PR-2-encodinggene as example, has documented how the func-tional role of an individual gene can be largely elu-cidated by the sequential use of web-based tools.The correlation analysis identifies genes within thegenome that are most strongly expression correlat-ed with PR-2 over a large number of diverse treat-ments in microarray experiments.The GO analysisof the PR-2 ECGG most notably identified an en-richment in defense-related genes and SAR anno-tated genes in particular. The in silico analysis ofmicroarray expression profiles in response to spe-cific stimuli and in mutants confirms and comple-ments the correlation and GO analysis since it re-veals that expression of ECGG is collectively in-duced by SA- and SAR-inducing elicitors and inSA/SAR-related mutants. The promoter analysisidentifies an enrichment in a number of putativeregulatory elements, including the well-character-ized W-box elements that bind WRKY TFs and areknown to have a biological role in regulating SA/SAR-related genes. Further, the analysis stronglyimplicates a number of functionally un-annotatedkinases present in the ECGG in defense, thus pro-

viding novel and valuable information for theseproteins.

The results generated in this analysis corre-spond well with published reports for PR-2, docu-menting its power and accuracy. The analysis alsolinked a number of novel genes to defense re-sponses in plants, including a number of annotatedkinases and expressed proteins of unknown func-tion. Finally, the outlined approach can be similar-ly applied to genes of unknown function and to dif-ferent organisms that have the required informa-tion available and thereby help to identify groupsof genes that function together in specific biologi-cal processes.

This project was supported by the National ResearchFoundation of South Africa.

The authors have declared no conflict of interest.

References

[1] Stuart, R. O., Bush, K. T., Nigam, S. K., Changes in gene ex-pression patterns in the ureteric bud and metanephric mes-enchyme in models of kidney development. Kidney Int. 2003,64, 1997–2008.

[2] Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J., Church,G. M., Systematic determination of genetic network archi-tecture. Nat. Genet. 1999, 22, 281–285.

[3] Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q. et al., Interpretingpatterns of gene expression with self-organizing maps:Methods and application to hematopoietic differentiation.Proc. Natl. Acad. Sci. USA 1999, 96, 2907–2912.

[4] Eisen, M. B., Spellman, P. T., Brown, P. O., Botstein, D., Clus-ter analysis and display of genome-wide expression pat-terns. Proc. Natl. Acad. Sci. USA 1998, 95, 14863–14868.

[5] Jansen, R., Greenbaum, D., Gerstein, M., Relating whole-genome expression data with protein-protein interactions.Genome Res. 2002, 12, 37–46.

[6] Allocco, D. J., Kohane, I. S., Butte, A. J., Quantifying the rela-tionship between co-expression, co-regulation and genefunction. BMC Bioinformatics 2004, 5, 18.

[7] Lee, H. K., Hsu, A. K., Sajdak, J., Qin, J., Pavlidis, P., Coex-pression analysis of human genes across many microarraydata sets. Genome Res. 2004, 14, 1085–1094.

[8] Hughes, T. R., Marton, M. J., Jones, A. R., Roberts, C. J. et al.,Functional discovery via a compendium of expression pro-files. Cell 2000, 102, 109–126.

[9] Kim, S. K., Lund, J., Kiraly, M., Duke, K. et al., A gene ex-pression map for Caenorhabditis elegans. Science 2001, 293,2087–2092.

[10] Segal, E.,Yelensky, R., Koller, D., Genome-wide discovery oftranscriptional modules from DNA sequence and gene ex-pression. Bioinformatics 2003, 19 Suppl. 1, i273–282.

[11] van Loon, L., Rep, M., Pieterse, C., Significance of inducibledefense-related proteins in infected plants. Annu. Rev. Phy-topathol. 2006, 44, 135–162.

[12] Van Loon, L., Van Strien, E., The families of pathogenesis-related proteins, their activities, and comparative analysis

Biotechnol. J. 2008, 3, 1375–1387 www.biotechnology-journal.com

1375_200800142_Gehring.qxd 07.11.2008 14:15 Uhr Seite 1385

Page 12: A guide to the integrated application of on-line data mining tools for the inference of gene functions at the systems level

BiotechnologyJournal Biotechnol. J. 2008, 3, 1375–1387

1386 © 2008 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

of PR-1 type proteins. Physiol. Mol. Plant Pathol. 1999, 55,85–97.

[13] Boudart, G., Jamet, E., Rossignol, M., Lafitte, C. et al., Cellwall proteins in apoplastic fluids of Arabidopsis thalianarosettes: Identification by mass spectrometry and bioinfor-matics. Proteomics 2004, 5, 212–221.

[14] Wang, D., Amornsiripanitch, N., Dong, X., A genomic ap-proach to identify regulatory nodes in the transcriptionalnetwork of systemic acquired resistance in plants. PLoSPathog. 2006, 2, e123.

[15] Durrant, W., Dong, X., Systemic acquired resistance. Annu.Rev. Phytopathol. 2004, 42, 185–209.

[16] Weigel, R. R., Pfitzner, U. M., Gatz, C., Interaction of NIMIN1with NPR1 modulates PR gene expression in Arabidopsis.Plant Cell 2005, 17, 1279–1291.

[17] Wildermuth, M. C., Dewdney, J., Wu, G., Ausubel, F. M., Iso-chorismate synthase is required to synthesize salicylic acidfor plant defence. Nature 2001, 414, 562–565.

[18] Pitarch, A., Nombela, C., Gil, C., Cell wall fractionation foryeast and fungal proteomics. Methods Mol. Biol. 2008, 425,217–239.

[19] Sela-Buurlage, M. B., Ponstein, A. S., Bres-Vloemans, S. A.,Melchers, L. S. et al., Only specific tobacco (Nicotianatabacum) chitinases and β-1,3-glucanases exhibit antifun-gal activity. Plant Physiol. 1993, 101, 857–863.

[20] Manfield, I. W., Jen, C. H., Pinney, J. W., Michalopoulos, I. etal.,Arabidopsis Co-expression Tool (ACT): web server toolsfor microarray-based gene expression analysis. NucleicAcids Res. 2006, 34, W504–509.

[21] Srinivasasainagendra, V., Page, G. P., Mehta, T., Coulibaly, I.,Loraine,A. E., CressExpress: a tool for large-scale mining ofexpression data from Arabidopsis. Plant Physiol. 2008, 147,1004–1016.

[22] Schulze,A., Downward, J., Navigating gene expression usingmicroarrays – A technology review. Nat. Cell Biol. 2001, 3,E190–195.

[23] Al-Shahrour, F., Diaz-Uriarte, R., Dopazo, J., FatiGO: a webtool for finding significant associations of Gene Ontologyterms with groups of genes. Bioinformatics 2004, 20,578–580.

[24] Al-Shahrour, F., Minguez, P.,Tarraga, J., Medina, I. et al., Fati-GO +: A functional profiling tool for genomic data. Integra-tion of functional annotation, regulatory motifs and interac-tion data with microarray experiments. Nucleic Acids Res.2007, 35, W91–96.

[25] Zimmermann, P., Hirsch-Hoffmann, M., Hennig, L., Gruis-sem,W., GENEVESTIGATOR.Arabidopsis microarray data-base and analysis toolbox. Plant Physiol. 2004, 136,2621–2632.

[26] McDowell, J. M., Dangl, J. L., Signal transduction in the plantimmune response. Trends Biochem. Sci. 2000, 25, 79–82.

[27] Yalpani, N., Enyedi, A., Leon, J., Raskin, I., Ultraviolet lightand ozone stimulate accumulation of salicylic acid, patho-genesis-related proteins and virus resistance in tobacco.Planta 1994, 193, 372–376.

[28] Yang, P. Z., Chen, C. H., Wang, Z. P., Fan, B. F., Chen, Z. X., Apathogen- and salicylic acid-induced WRKY DNA-bindingactivity recognizes the elicitor response element of the to-bacco class I chitinase gene promoter. Plant J. 1999, 18,141–149.

[29] Mackerness, S.A. H., Jordan, B. R.,Thomas, B., Reactive oxy-gen species in the regulation of photosynthetic genes by ul-traviolet-B radiation (UV-B: 280-320 nm) in green and etio-lated buds of pea (Pisum sativum L.). J. Photochem. Photobi-ol. B Biol. 1999, 48, 180–188.

[30] Mackerness, S.A. H., Surplus, S. L., Blake, P., John, C. F. et al.,Ultraviolet-B-induced stress and changes in gene expres-sion in Arabidopsis thaliana: Role of signalling pathwayscontrolled by jasmonic acid, ethylene and reactive oxygenspecies. Plant Cell Environ. 1999, 22, 1413–1423.

[31] Dat, J. F., Foyer, C. H., Scott, I. M., Changes in salicylic acidand antioxidants during induced thermotolerance in mus-tard seedlings. Plant Physiol. 1998, 118, 1455–1461.

[32] Kang, H. M., Saltveit, M. E., Chilling tolerance of maize, cu-cumber and rice seedling leaves and roots are differential-ly affected by salicylic acid. Physiol. Plant 2002, 115, 571–576.

[33] Verslues, P. E., Bray, E. A., Role of abscisic acid (ABA) andArabidopsis thaliana ABA-insensitive loci in low water po-tential-induced ABA and proline accumulation. J. Exp. Bot.2006, 57, 201–212.

Chris Gehring is Professor of Plant

Biotechnology at the University of the

Western Cape. After gaining his first

degree from the University of Basel and

a DPhil in Biochemistry from London

University he undertook postdoctoral

work at the University of Rome la

Sapienza and La Trobe University in

Mebourne. In 1996 he joined Deakin

Universty in Geelong and in 2000

moved to Cape Town to join the University of the Western Cape. He

has long been interested in molecular and cellular aspects of plant

stress responses and in particular signalling molecules including sec-

ond messengers and novel signalling mechanisms. The advent of the

“omics” age has now opened the possibility to study many aspects of

plant and in particular Arabidopsis stress responses at a systems level

and promises to yield new insights into plant biology and eventually

crop biotechnology.

Stuart Meier is currently a Post-

Doctoral Fellow at the South African

National Bioinformatics Institute. Stu-

art completed his undergraduate and

PhD degrees at Deakin University in

Geelong Australia and his dissertation

investigated the role of Natriuretic Pep-

tides in amphibians. Current research

interests include analyzing global mi-

croarray expression data to identify co-

expressing gene networks and the subsequent computational identifi-

cation of putative regulatory elements within co-expressing genes that

may be causative for their correlated expression. This type of analysis

identifies groups of genes that function within common biological re-

sponses and thus provides insights into cellular functions from a sys-

tems biology perspective.

1375_200800142_Gehring.qxd 07.11.2008 14:15 Uhr Seite 1386

Page 13: A guide to the integrated application of on-line data mining tools for the inference of gene functions at the systems level

© 2008 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 1387

[34] Abel, S., Theologis, A., Early genes and auxin action. PlantPhysiol 1996, 111, 9–17.

[35] Nishiuchi, T., Suzuki, K., Kitajima, S., Sato, F., Shinshi, H.,Wounding activates immediate early transcription of genesfor ERFs in tobacco plants. Plant Mol. Biol. 2002, 49, 473–482.

[36] Bowling, S.A., Clarke, J. D., Liu,Y., Klessig, D. F., Dong, X.,Thecpr5 mutant of Arabidopsis expresses both NPR1-depend-ent and NPR1-independent resistance. Plant Cell 1997, 9,1573–1584.

[37] Petersen, M., Brodersen, P., Naested, H., Andreasson, E. etal., Arabidopsis map kinase 4 negatively regulates systemicacquired resistance. Cell 2000, 103, 1111–1120.

[38] Dong, X., NPR1, all things considered. Curr. Opin. Plant Biol.2004, 7, 547–552.

[39] Pieterse, C. M.,Van Loon, L. C., NPR1: the spider in the webof induced resistance signaling pathways. Curr. Opin. PlantBiol. 2004, 7, 456–464.

[40] Delaney, T. P., Friedrich, L., Ryals, J. A., Arabidopsis signaltransduction mutant defective in chemically and biological-ly induced disease resistance. Proc. Natl. Acad. Sci. USA1995, 92, 6602–6606.

[41] Ryals, J., Weymann, K., Lawton, K., Friedrich, L. et al., TheArabidopsis NIM1 protein shows homology to the mam-malian transcription factor inhibitor I kappa B. Plant Cell1997, 9, 425–439.

[42] Glazebrook, J., Contrasting mechanisms of defense againstbiotrophic and necrotrophic pathogens. Annu. Rev. Phy-topathol. 2005, 43, 205–227.

[43] Mosher, R., Durrant, W., Wang, D., Song, J., Dong, X., A com-prehensive structure–function analysis of Arabidopsis SNI1defines essential regions and transcriptional repressor ac-tivity. Plant Cell 2006, 18, 1750–1765.

[44] Brazma,A., Jonassen, I.,Vilo, J., Ukkonen, E., Predicting generegulatory elements in silico on a genomic scale. GenomeRes. 1998, 8, 1202–1215.

[45] Brazma, A., Jonassen, I., Eidhammer, I., Gilbert, D., Ap-proaches to the automatic discovery of patterns in biose-quences. J. Comp. Biol. 1998, 5, 279–305.

[46] Davidson, E. H., McClay, D. R., Hood, L., Regulatory genenetworks and the properties of the developmental process.Proc. Natl. Acad. Sci. USA 2003, 100, 1475–1480.

[47] Beer, M.A.,Tavazoie, S., Predicting gene expression from se-quence. Cell 2004, 117, 185–198.

[48] Werner, T., The promoter connection. Nat. Genet. 2001, 29,105–106.

[49] Tompa, M., Li, N., Bailey,T. L., Church, G. M. et al.,Assessingcomputational tools for the discovery of transcription factorbinding sites. Nat. Biotech. 2005, 23, 137–144.

[50] O’Connor,T. R., Dyreson, C.,Wyrick, J. J., Athena: A resourcefor rapid visualization and systematic analysis of Arabidop-sis promoter sequences. Bioinformatics 2005, 21, 4411–4413.

[51] Higo, K., Ugawa,Y., Iwamoto, M., Korenaga,T., Plant cis-act-ing regulatory DNA elements (PLACE) database: 1999. Nu-cleic Acids Res. 1999, 27, 297–300.

[52] Davuluri, R. V., Sun, H., Palaniswamy, S. K., Matthews, N. etal.,AGRIS:Arabidopsis gene regulatory information server,an information resource of Arabidopsis cis-regulatory ele-ments and transcription factors. BMC Bioinformatics 2003,4, 25.

[53] Maleck, K., Levine,A., Eulgem,T., Morgan,A. et al.,The tran-scriptome of Arabidopsis thaliana during systemic acquiredresistance. Nat. Genet. 2000, 26, 403–410.

[54] Eulgem, T., Rushton, P. J., Robatzek, S., Somssich, I. E., TheWRKY superfamily of plant transcription factors. TrendsPlant Sci. 2000, 5, 199–206.

[55] Pnueli, L., Hallak-Herr, E., Rozenberg, M., Cohen, M. et al.,Molecular and biochemical mechanisms associated withdormancy and drought tolerance in the desert legume Reta-ma raetam. Plant J. 2002, 31, 319–330.

[56] Seki, M., Ishida, J., Narusaka, M., Fujita, M. et al., Monitoringthe expression pattern of around 7,000 Arabidopsis genesunder ABA treatments using a full-length cDNA microar-ray. Funct. Integr. Genomics 2002, 2, 282–291.

[57] Rizhsky, L., Liang, H., Shuman, J., Shulaev,V. et al.,When de-fense pathways collide. The response of Arabidopsis to acombination of drought and heat stress. Plant Physiol. 2004,134, 1683–1696.

[58] Pilpel,Y., Sudarsanam, P., Church, G. M., Identifying regula-tory networks by combinatorial analysis of promoter ele-ments. Nat. Genet. 2001, 29, 153–159.

[59] Tuch, B. B., Li, H., Johnson, A. D., Evolution of eukaryotictranscription circuits. Science 2008, 319, 1797–1799.

[60] Fogel, G. B., Weekes, D. G., Varga, G., Dow, E. R. et al., A sta-tistical analysis of the TRANSFAC database. Biosystems2005, 81, 137–154.

[61] Werner,T., Fessele, S., Maier, H., Nelson, P. J., Computer mod-eling of promoter organization as a tool to study transcrip-tional coregulation. FASEB J. 2003, 17, 1228–1237.

[62] Huang, E., Yang, L., Chowdhary, R., Kassim, A., Bajic, V., Analgorithm for ab initio DNA motif detection, in: InformationProcessing and Living Systems, World Scientific Press, Sin-gapore 2005, pp. 611–614.

[63] Kankainen, M., Holm, L., POBO, transcription factor bind-ing site verification with bootstrapping. Nucleic Acids Res.2004, 32, W222–229.

[64] Matys, V., Fricke, E., Geffers, R., Gossling, E. et al., TRANS-FAC: Transcriptional regulation, from patterns to profiles.Nucleic Acids Res. 2003, 31, 374–378.

[65] Boter, M., Ruiz-Rivero, O., Abdeen, A., Prat, S., ConservedMYC transcription factors play a key role in jasmonate sig-naling both in tomato and Arabidopsis. Genes Dev. 2004, 18,1577–1591.

[66] Buchel,A. S., Brederode, F.T., Bol, J. F., Linthorst, H. J., Muta-tion of GT-1 binding sites in the Pr-1A promoter influencesthe level of inducible gene expression in vivo. Plant Mol.Biol. 1999, 40, 387–396.

[67] Tagle, D. A., Koop, B. F., Goodman, M., Slightom, J. L. et al.,Embryonic epsilon and gamma globin genes of a prosimianprimate (Galago crassicaudatus). Nucleotide and amino acidsequences, developmental regulation and phylogeneticfootprints. J. Mol. Biol. 1988, 203, 439–455.

[68] Blanchette, M.,Tompa, M., Discovery of regulatory elementsby a computational method for phylogenetic footprinting.Genome Res. 2002, 12, 739–748.

[69] Jareborg, N., Birney, E., Durbin, R., Comparative analysis ofnoncoding regions of 77 orthologous mouse and humangene pairs. Genome Res. 1999, 9, 815–824.

[70] Wasserman,W.W., Palumbo, M.,Thompson,W., Fickett, J.W.,Lawrence, C. E., Human-mouse genome comparisons to lo-cate regulatory sites. Nat. Genet. 2000, 26, 225–228.

Biotechnol. J. 2008, 3, 1375–1387 www.biotechnology-journal.com

1375_200800142_Gehring.qxd 07.11.2008 14:15 Uhr Seite 1387