1
Chrom osom e III 0 50 100 150 200 250 300 350 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Mb Chrom osom e V 0 50 100 150 200 250 300 350 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Mb Chrom osom e II 0 50 100 150 200 250 300 350 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Mb AGI EuGene GST Chrom osom e I 0 50 100 150 200 250 300 350 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Mb 0 500 1000 1500 2000 2500 3000 150 180 210 240 270 300 330 360 390 420 450 480 bp 16 x 24 U 3’ row 2 U 5’ col. 4 The Complete Arabidopsis Transcriptome MicroArray (CATMA) Project Most cDNA clones included in DNA arrays are identified by an EST covering only a portion of their length. The complete clone sequence is generally unknown and is not selected to yield hybridisation results specific to a single gene. ESTs only represent about half the genes identified in model eucaryote genomes. To bypass these shortcomings, we are constructing a collection of high quality Gene Specific Tags (GSTs) representing most Arabidopsis genes for use in microarray transcriptome analyses and in other functional genomic approaches. Gene structural annotation The identification of each gene in the Arabidopsis genome is at the root of any genome-wide effort to study their expression. Since the structure of only a minority of Arabidopsis genes has been determined experimentally so far, annotation still relies on gene prediction to identify the boundaries of transcription units and of the exon(s) within it (The AGI Consortium, 2000). Using the AGI nuclear genome, we have generated an updated structural annotation of all 5 Arabidopsis chromosomes. The annotation process has been automated. It uses the EuGène software (Schiex et al, 2001) with a unique set of parameters and algorithms applied to all chromosome regions (Figure 1A). Its prediction quality has been tested by matching results against a set of experimentally defined full length cDNA as described by Rouzé and collaborators (Pavy et al., 1999). Quality assessment parameters for chromosome 2 annotation are shown in Table 1. EuGène identifies 29,804 genes in the Arabidopsis nuclear genome, which is higher than the 25,470 identified by the AGI (Figure 2). The detailed comparative analysis of the EuGène and AGI annotations is currently underway. Preliminary observations indicate that EuGène’s higher number results from the combination of several factors: EuGène can predict two genes where AGI annotates one, it predicts genes where none is annotated by AGI (3,369) more often than the contrary (1,533), and it seems biased towards overprediction in pericentromeric regions rich in repeated sequences. Table 1. Assessment of EuGène prediction results actual genes correct gene models partial gene models split genes missing genes actual exons missing exons missing exons in 5' missing exons central missing exons in 3' wrong exons Plant- Gene 238 182 (76%) 50 (21%) 5 (2%) 1 (0.5%) 1639 51 (3%) 33 (2%) 12 (0.7%) 6 (0.4%) 1 (0.06%) Araset 51 37 (67%) 14 (27%) 0 0 254 15 (6%) 8 (3%) 5 (2%) 2 (0.8 %) 1 (0.4%) 2. Automated design of GSTs Conclusion The project is based on a novel complete unified annotation of the Arabidopsis nuclear genome, generated with our upgraded EuGène software, from which GSTs are selected with SPADS. We are currently studying how best to complement the current GST collection to minimize the presence of non specific probes allowing hybridisation with transcripts from non cognate genes. Given the structure of the GST collection, it can be adapted to a variety of microarray protocols and procedures. It can also serve as a key resource for other large scale functional genomic endeavours based on specific nucleic acid hybridisations, such as systematic Arabidopsis RNAi programmes. References The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796 - 815. Pavy N, Rombauts S, Déhais P, Mathé C, Ramana DV, Leroy P and Rouzé P (1999) Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences. Bioinformatics 15, 887-899. Schiex T, Moisan A and Rouzé P (2001) EuGène: an eukaryotic gene finder that combines several sources of evidence. In “JOBIM 2000, LNCS 2066”, O. Gascuel, M.F. Sagot (Eds.), pp 111-125. 3. Structure of the GST collection Each primer designed to synthesize a GST carries a gene specific 3’ domain corresponding to the sequence selected by SPADS (18-25 nt) and a 5’ extension (17 nt) added to allow for reamplification of the GSTs with a limited set of universal primers. A set of 40 extensions has been designed so that each sample in a 384-well plate can be amplified witt the unique combination combination of one row and another column primer, hence avoiding cross-contamination which often plagues the storage and dissemination of large-scale clone collections. The primary amplicons obtained from BAC DNA templates in large excess can be conveniently reamplified and distributed. Also, amplicon production using BAC increases the quality of the GSTs and the fraction of successful PCR amplifications by reducing the complexity of the templates (Figure 5). All GSTs are oriented with regard to transcription with column primers at the 5’ end (see above picture). As of 26 September 2001, the Consortium had PCR amplified 16.280 GSTs. First round PCR with specific primer pairs Second round PCR with universal primer pairs Genomic BAC DNA S 5 S 5 U 5 S 5 ’ U 3 S 3 ’ U 3 U 5 Primary amplicon Figure 5. Two-step GST amplification L E A F F L O W E R B U D R O O T P L A N T L E T probe length (bp) signal Figure 3. Transcription profiling with a test set of GSTs A. Hybridization (Cy5 cDNA) B. Signal according to length 1 10 100 1000 10000 100000 100 600 1100 1600 2100 predicted gene GST known gene GST intergenic region GST highly expressed cDNA negative control FLOWER BUD Figure 4. GST characteristics A. Distribution of GST lengthsB. Position of GSTs 150-200 bp: 42% 200-300 bp: 36% 300-500 bp: 22% ATG stop 5’ center 3’ 3267 (16%) 5115 (24%) 12701 (60%) UTR CDS igure 2. Gene density according to the Eugène and AGI annotations Figure 1. Gene identification and GST selection B SPADS gene sequence exon coordinates GST Blastn Blastn Primer3 GST specificity primer specificity EuGène genomic fragment genes RepeatMasker Blastn Blastx Netstart NetGene2 SplicePredictor A Introduction The Specific Primer & Amplicon Design Software (SPADS) selects specific regions within genes and designs primer pairs picked to amplify such regions (Figure 1B; Thareau et al, 2001). The procedure is summarised in the four following steps: Our GST design is based on expressed sequences (EST or cDNA) or on coding regions predicted by EuGène (i.e. excluding UTR not represented in EST or cDNA). The GST lengths range between 150 and 500 bp which is sufficient to yield reproducible microarray signal for transcriptome analysis (Figure 3). Because of the inherent duplicated nature of the Arabidopsis genome, not all genes will be represented by perfect GSTs. Rejecting candidate sequences that show over 70% identity with another sequence in the Arabidopsis nuclear genome, our process has identified so far a GST for 21,420 (72.0 %) genes out of 29,775 identified on all 5 chromosomes (Figure 2). 1.Search for the most specific region within each gene. Each exon is tested with BLASTn against the whole genome sequence and segments with hits are removed. Primer pairs are designed in the remaining regions. If none are detected, the mismatch parameter of BLASTn is decreased and only segments with stringent hits are substracted, thus enlarging the specific remaining regions for primer design. 2.Primer design. The specific regions are used as input for the Primer3 software. 3.Selection of specific primer pairs. Oligonucleotides designed by Primer3 are tested for specificity with BLASTn against 2 Mb segment containing the gene and are excluded if matches indicate potential unwanted PCR amplification. 4.Analysis of amplicon specificity. Each successive amplicon is tested with BLASTn to determine its specificity. If the identity with putative paralogous sequence is over 70%, the amplicon is removed and the next one is processed. GST are searched from 3’ to 5' until one is found. Chrom osom e IV 0 50 100 150 200 250 300 350 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Mb P. Hilson, T. Altmann, S. Aubourg, J. Beynon, F. Bitton, M. Caboche, M. Crowe, P. Dehais, H. Eickhoff, E. Kuhn, S. May, W. Nietfeld, J. Paz-Ares, W. Rensink, P. Reymond, P. Rouzé, U. Schneider, C. Serizet, A. Tabrett, V. Thareau, M. Trick, G. van den Ackerveken, P. Van Hummelen, P. Weisbeek, M. Zabeau http://jic-bioinfo.bbsrc.ac.uk/CATMA/

16 x 24 U 3’ row 2 U 5’ col. 4 The Complete Arabidopsis Transcriptome MicroArray (CATMA) Project Most cDNA clones included in DNA arrays are identified

Embed Size (px)

Citation preview

Page 1: 16 x 24 U 3’ row 2 U 5’ col. 4 The Complete Arabidopsis Transcriptome MicroArray (CATMA) Project Most cDNA clones included in DNA arrays are identified

Chromosome III

0

50

100

150

200

250

300

350

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24Mb

Chromosome V

0

50

100

150

200

250

300

350

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26Mb

Chromosome II

0

50

100

150

200

250

300

350

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Mb

AGIEuGeneGST

Chromosome I

0

50

100

150

200

250

300

350

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30Mb

0

500

1000

1500

2000

2500

3000

150 180 210 240 270 300 330 360 390 420 450 480 bp

16x

24

U 3’ row 2

U 5’ col. 4

The Complete Arabidopsis Transcriptome MicroArray (CATMA) Project

Most cDNA clones included in DNA arrays are identified by an EST covering only a portion of their length. The complete clone sequence is generally unknown and is not selected to yield hybridisation results specific to a single gene. ESTs only represent about half the genes identified in model eucaryote genomes.

To bypass these shortcomings, we are constructing a collection of high quality Gene Specific Tags (GSTs) representing most Arabidopsis genes for use in microarray transcriptome analyses and in other functional genomic approaches.

1. Gene structural annotationThe identification of each gene in the Arabidopsis genome is at the root of any genome-wide effort to study their expression. Since the structure of only a minority of Arabidopsis genes has been determined experimentally so far, annotation still relies on gene prediction to identify the boundaries of transcription units and of the exon(s) within it (The AGI Consortium, 2000). Using the AGI nuclear genome, we have generated an updated structural annotation of all 5 Arabidopsis chromosomes.

The annotation process has been automated. It uses the EuGène software (Schiex et al, 2001) with a unique set of parameters and algorithms applied to all chromosome regions (Figure 1A). Its prediction quality has been tested by matching results against a set of experimentally defined full length cDNA as described by Rouzé and collaborators (Pavy et al., 1999). Quality assessment parameters for chromosome 2 annotation are shown in Table 1.

EuGène identifies 29,804 genes in the Arabidopsis nuclear genome, which is higher than the 25,470 identified by the AGI (Figure 2). The detailed comparative analysis of the EuGène and AGI annotations is currently underway. Preliminary observations indicate that EuGène’s higher number results from the combination of several factors: EuGène can predict two genes where AGI annotates one, it predicts genes where none is annotated by AGI (3,369) more often than the contrary (1,533), and it seems biased towards overprediction in pericentromeric regions rich in repeated sequences.

Table 1. Assessment of EuGène prediction results

actualgenes

correctgene

models

partialgene

modelssplitgenes

missing genes

actualexons

missingexons

missingexonsin 5'

missing

exonscentral

missingexonsin 3'

wrongexons

Plant-Gene

238 182(76%)

50(21%)

5(2%)

1(0.5%)

1639 51(3%)

33(2%)

12(0.7%)

6(0.4%)

1(0.06%)

Araset 51 37(67%)

14(27%)

0 0 254 15(6%)

8(3%)

5(2%)

2(0.8 %)

1(0.4%)

2. Automated design of GSTs

ConclusionThe project is based on a novel complete unified annotation of the Arabidopsis nuclear genome, generated with our upgraded EuGène software, from which GSTs are selected with SPADS. We are currently studying how best to complement the current GST collection to minimize the presence of non specific probes allowing hybridisation with transcripts from non cognate genes. Given the structure of the GST collection, it can be adapted to a variety of microarray protocols and procedures. It can also serve as a key resource for other large scale functional genomic endeavours based on specific nucleic acid hybridisations, such as systematic Arabidopsis RNAi programmes.ReferencesThe Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796 - 815.Pavy N, Rombauts S, Déhais P, Mathé C, Ramana DV, Leroy P and Rouzé P (1999) Evaluation of gene prediction software using a genomic data set:

application to Arabidopsis thaliana sequences. Bioinformatics 15, 887-899.Schiex T, Moisan A and Rouzé P (2001) EuGène: an eukaryotic gene finder that combines several sources of evidence.

In “JOBIM 2000, LNCS 2066”, O. Gascuel, M.F. Sagot (Eds.), pp 111-125.Thareau V, Déhais P, Rouzé P and Aubourg S. (2001) Automatic design of gene specific tags for transcriptome studies.

Proc. of JOBIM'2001 (Journées Ouvertes Biologie Informatique Mathématiques). Toulouse, France.

3. Structure of the GST collectionEach primer designed to synthesize a GST carries a gene specific 3’ domain corresponding to the sequence selected by SPADS (18-25 nt) and a 5’ extension (17 nt) added to allow for reamplification of the GSTs with a limited set of universal primers. A set of 40 extensions has been designed so that each sample in a 384-well plate can be amplified witt the unique combinationcombination of one row and another column primer, hence avoiding cross-contamination which often plagues the storage and dissemination of large-scale clone collections. The primary amplicons obtained from BAC DNA templates in large excess can be conveniently reamplified and distributed. Also, amplicon production using BAC increases the quality of the GSTs and the fraction of successful PCR amplifications by reducing the complexity of the templates (Figure 5). All GSTs are oriented with regard to transcription with column primers at the 5’ end (see above picture). As of 26 September 2001, the Consortium had PCR amplified 16.280 GSTs.

First round PCR withspecific primer pairs

Second round PCR withuniversal primer pairs

Genomic BAC DNA

S 5 ’S 5 ’

U 5 ’

S 5 ’

U 3 ’

S 3 ’ U 3 ’

U 5 ’

Primary amplicon

Figure 5. Two-step GST amplification

LE

AF

FL

OW

ER

BU

D

RO

OT

PL

AN

TL

ET

probe length (bp)

sig

na

l

Figure 3. Transcription profiling with a test set of GSTs

A. Hybridization (Cy5 cDNA) B. Signal according to length

1

10

100

1000

10000

100000

100 600 1100 1600 2100

predicted gene GSTknown gene GSTintergenic region GSThighly expressed cDNAnegative control

FLOWER BUD

Figure 4. GST characteristics

A. Distribution of GST lengths B. Position of GSTs

150-200 bp: 42%200-300 bp: 36%300-500 bp: 22%

ATG stop

5’ center 3’

3267(16%)

5115(24%)

12701(60%)

UTRCDS

Figure 2. Gene density according to the Eugène and AGI annotations

Figure 1. Gene identification and GST selection

B

SPADSgene sequenceexon

coordinatesGST

Blastn Blastn

Primer3

GSTspecificity

primerspecificity

EuGènegenomicfragment genes

RepeatMasker Blastn Blastx

Netstart NetGene2

SplicePredictorA

IntroductionThe Specific Primer & Amplicon Design Software (SPADS) selects specific regions within genes and designs primer pairs picked to amplify such regions (Figure 1B; Thareau et al, 2001). The procedure is summarised in the four following steps:

Our GST design is based on expressed sequences (EST or cDNA) or on coding regions predicted by EuGène (i.e. excluding UTR not represented in EST or cDNA). The GST lengths range between 150 and 500 bp which is sufficient to yield reproducible microarray signal for transcriptome analysis (Figure 3). Because of the inherent duplicated nature of the Arabidopsis genome, not all genes will be represented by perfect GSTs. Rejecting candidate sequences that show over 70% identity with another sequence in the Arabidopsis nuclear genome, our process has identified so far a GST for 21,420 (72.0 %) genes out of 29,775 identified on all 5 chromosomes (Figure 2).

1. Search for the most specific region within each gene. Each exon is tested with BLASTn against the whole genome sequence and segments with hits are removed. Primer pairs are designed in the remaining regions. If none are detected, the mismatch parameter of BLASTn is decreased and only segments with stringent hits are substracted, thus enlarging the specific remaining regions for primer design.

2. Primer design. The specific regions are used as input for the Primer3 software.

3. Selection of specific primer pairs. Oligonucleotides designed by Primer3 are tested for specificity with BLASTn against 2 Mb segment containing the gene and are excluded if matches indicate potential unwanted PCR amplification.

4. Analysis of amplicon specificity. Each successive amplicon is tested with BLASTn to determine its specificity. If the identity with putative paralogous sequence is over 70%, the amplicon is removed and the next one is processed. GST are searched from 3’ to 5' until one is found.

Chromosome IV

0

50

100

150

200

250

300

350

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17Mb

P. Hilson, T. Altmann, S. Aubourg, J. Beynon, F. Bitton, M. Caboche, M. Crowe, P. Dehais, H. Eickhoff, E. Kuhn, S. May, W. Nietfeld, J. Paz-Ares, W. Rensink, P. Reymond, P. Rouzé, U. Schneider, C. Serizet, A. Tabrett, V. Thareau, M. Trick, G. van den Ackerveken, P. Van

Hummelen, P. Weisbeek, M. Zabeau http://jic-bioinfo.bbsrc.ac.uk/CATMA/