Upload
ariel-phillips
View
214
Download
0
Embed Size (px)
Citation preview
Chromosome III
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24Mb
Chromosome V
0
50
100
150
200
250
300
350
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26Mb
Chromosome II
0
50
100
150
200
250
300
350
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Mb
AGIEuGeneGST
Chromosome I
0
50
100
150
200
250
300
350
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30Mb
0
500
1000
1500
2000
2500
3000
150 180 210 240 270 300 330 360 390 420 450 480 bp
16x
24
U 3’ row 2
U 5’ col. 4
The Complete Arabidopsis Transcriptome MicroArray (CATMA) Project
Most cDNA clones included in DNA arrays are identified by an EST covering only a portion of their length. The complete clone sequence is generally unknown and is not selected to yield hybridisation results specific to a single gene. ESTs only represent about half the genes identified in model eucaryote genomes.
To bypass these shortcomings, we are constructing a collection of high quality Gene Specific Tags (GSTs) representing most Arabidopsis genes for use in microarray transcriptome analyses and in other functional genomic approaches.
1. Gene structural annotationThe identification of each gene in the Arabidopsis genome is at the root of any genome-wide effort to study their expression. Since the structure of only a minority of Arabidopsis genes has been determined experimentally so far, annotation still relies on gene prediction to identify the boundaries of transcription units and of the exon(s) within it (The AGI Consortium, 2000). Using the AGI nuclear genome, we have generated an updated structural annotation of all 5 Arabidopsis chromosomes.
The annotation process has been automated. It uses the EuGène software (Schiex et al, 2001) with a unique set of parameters and algorithms applied to all chromosome regions (Figure 1A). Its prediction quality has been tested by matching results against a set of experimentally defined full length cDNA as described by Rouzé and collaborators (Pavy et al., 1999). Quality assessment parameters for chromosome 2 annotation are shown in Table 1.
EuGène identifies 29,804 genes in the Arabidopsis nuclear genome, which is higher than the 25,470 identified by the AGI (Figure 2). The detailed comparative analysis of the EuGène and AGI annotations is currently underway. Preliminary observations indicate that EuGène’s higher number results from the combination of several factors: EuGène can predict two genes where AGI annotates one, it predicts genes where none is annotated by AGI (3,369) more often than the contrary (1,533), and it seems biased towards overprediction in pericentromeric regions rich in repeated sequences.
Table 1. Assessment of EuGène prediction results
actualgenes
correctgene
models
partialgene
modelssplitgenes
missing genes
actualexons
missingexons
missingexonsin 5'
missing
exonscentral
missingexonsin 3'
wrongexons
Plant-Gene
238 182(76%)
50(21%)
5(2%)
1(0.5%)
1639 51(3%)
33(2%)
12(0.7%)
6(0.4%)
1(0.06%)
Araset 51 37(67%)
14(27%)
0 0 254 15(6%)
8(3%)
5(2%)
2(0.8 %)
1(0.4%)
2. Automated design of GSTs
ConclusionThe project is based on a novel complete unified annotation of the Arabidopsis nuclear genome, generated with our upgraded EuGène software, from which GSTs are selected with SPADS. We are currently studying how best to complement the current GST collection to minimize the presence of non specific probes allowing hybridisation with transcripts from non cognate genes. Given the structure of the GST collection, it can be adapted to a variety of microarray protocols and procedures. It can also serve as a key resource for other large scale functional genomic endeavours based on specific nucleic acid hybridisations, such as systematic Arabidopsis RNAi programmes.ReferencesThe Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796 - 815.Pavy N, Rombauts S, Déhais P, Mathé C, Ramana DV, Leroy P and Rouzé P (1999) Evaluation of gene prediction software using a genomic data set:
application to Arabidopsis thaliana sequences. Bioinformatics 15, 887-899.Schiex T, Moisan A and Rouzé P (2001) EuGène: an eukaryotic gene finder that combines several sources of evidence.
In “JOBIM 2000, LNCS 2066”, O. Gascuel, M.F. Sagot (Eds.), pp 111-125.Thareau V, Déhais P, Rouzé P and Aubourg S. (2001) Automatic design of gene specific tags for transcriptome studies.
Proc. of JOBIM'2001 (Journées Ouvertes Biologie Informatique Mathématiques). Toulouse, France.
3. Structure of the GST collectionEach primer designed to synthesize a GST carries a gene specific 3’ domain corresponding to the sequence selected by SPADS (18-25 nt) and a 5’ extension (17 nt) added to allow for reamplification of the GSTs with a limited set of universal primers. A set of 40 extensions has been designed so that each sample in a 384-well plate can be amplified witt the unique combinationcombination of one row and another column primer, hence avoiding cross-contamination which often plagues the storage and dissemination of large-scale clone collections. The primary amplicons obtained from BAC DNA templates in large excess can be conveniently reamplified and distributed. Also, amplicon production using BAC increases the quality of the GSTs and the fraction of successful PCR amplifications by reducing the complexity of the templates (Figure 5). All GSTs are oriented with regard to transcription with column primers at the 5’ end (see above picture). As of 26 September 2001, the Consortium had PCR amplified 16.280 GSTs.
First round PCR withspecific primer pairs
Second round PCR withuniversal primer pairs
Genomic BAC DNA
S 5 ’S 5 ’
U 5 ’
S 5 ’
U 3 ’
S 3 ’ U 3 ’
U 5 ’
Primary amplicon
Figure 5. Two-step GST amplification
LE
AF
FL
OW
ER
BU
D
RO
OT
PL
AN
TL
ET
probe length (bp)
sig
na
l
Figure 3. Transcription profiling with a test set of GSTs
A. Hybridization (Cy5 cDNA) B. Signal according to length
1
10
100
1000
10000
100000
100 600 1100 1600 2100
predicted gene GSTknown gene GSTintergenic region GSThighly expressed cDNAnegative control
FLOWER BUD
Figure 4. GST characteristics
A. Distribution of GST lengths B. Position of GSTs
150-200 bp: 42%200-300 bp: 36%300-500 bp: 22%
ATG stop
5’ center 3’
3267(16%)
5115(24%)
12701(60%)
UTRCDS
Figure 2. Gene density according to the Eugène and AGI annotations
Figure 1. Gene identification and GST selection
B
SPADSgene sequenceexon
coordinatesGST
Blastn Blastn
Primer3
GSTspecificity
primerspecificity
EuGènegenomicfragment genes
RepeatMasker Blastn Blastx
Netstart NetGene2
SplicePredictorA
IntroductionThe Specific Primer & Amplicon Design Software (SPADS) selects specific regions within genes and designs primer pairs picked to amplify such regions (Figure 1B; Thareau et al, 2001). The procedure is summarised in the four following steps:
Our GST design is based on expressed sequences (EST or cDNA) or on coding regions predicted by EuGène (i.e. excluding UTR not represented in EST or cDNA). The GST lengths range between 150 and 500 bp which is sufficient to yield reproducible microarray signal for transcriptome analysis (Figure 3). Because of the inherent duplicated nature of the Arabidopsis genome, not all genes will be represented by perfect GSTs. Rejecting candidate sequences that show over 70% identity with another sequence in the Arabidopsis nuclear genome, our process has identified so far a GST for 21,420 (72.0 %) genes out of 29,775 identified on all 5 chromosomes (Figure 2).
1. Search for the most specific region within each gene. Each exon is tested with BLASTn against the whole genome sequence and segments with hits are removed. Primer pairs are designed in the remaining regions. If none are detected, the mismatch parameter of BLASTn is decreased and only segments with stringent hits are substracted, thus enlarging the specific remaining regions for primer design.
2. Primer design. The specific regions are used as input for the Primer3 software.
3. Selection of specific primer pairs. Oligonucleotides designed by Primer3 are tested for specificity with BLASTn against 2 Mb segment containing the gene and are excluded if matches indicate potential unwanted PCR amplification.
4. Analysis of amplicon specificity. Each successive amplicon is tested with BLASTn to determine its specificity. If the identity with putative paralogous sequence is over 70%, the amplicon is removed and the next one is processed. GST are searched from 3’ to 5' until one is found.
Chromosome IV
0
50
100
150
200
250
300
350
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17Mb
P. Hilson, T. Altmann, S. Aubourg, J. Beynon, F. Bitton, M. Caboche, M. Crowe, P. Dehais, H. Eickhoff, E. Kuhn, S. May, W. Nietfeld, J. Paz-Ares, W. Rensink, P. Reymond, P. Rouzé, U. Schneider, C. Serizet, A. Tabrett, V. Thareau, M. Trick, G. van den Ackerveken, P. Van
Hummelen, P. Weisbeek, M. Zabeau http://jic-bioinfo.bbsrc.ac.uk/CATMA/