10
 1. Gene Prediction Manual A. Gene annotation Step 1. Accesing EMBL database to retrieve the gene Go to EMBL database Select Nucleotide sequences Type sequence entry name HS30!1 Press "o button Click on E#blEntr$ link Have a look at the dierent entry ields! detect the m"#$ and C%S e&ons Click on %e&t Entr$ link to see the plain te&t ormatted output This is the sequence in '$ST$ ormat B. Exploring ab initio gene predict ion Step '. (unning geneid Connect to the geneid server  Paste the '$ST$ sequence Choose (eneid output )or#at  "un geneid )ith dierent parameters! *+ Searchin ( si( nals! Selec t acceptors* donors* start and stop codons + Look or them in the real annotation o the sequence ,+ Se archin( e& ons! Select All e&ons and try to ind the real ones -+ 'indin( ( enes! . ou do not need to select any op tion /deault behaviour0+ Compare the predicted (ene )ith the real (ene

Gene Prediction Exercise

Embed Size (px)

DESCRIPTION

gene

Citation preview

1. Gene Prediction ManualA. Gene annotation

Step 1. Accesing EMBL database to retrieve the gene

Go toEMBLdatabase SelectNucleotide sequences Type sequence entry nameHS307871 PressGobutton Click onEmblEntrylink Have a look at the different entry fields: detect the mRNA and CDS exons Click onText Entrylink to see the plain text formatted output This is the sequence inFASTA format

B. Exploring ab initio gene prediction

Step 2. Runninggeneid

Connect to thegeneidserver Paste the FASTA sequence Choose geneidoutput format Rungeneidwith different parameters:1. Searching signals: Selectacceptors, donors, start and stop codons. Look for them in the real annotation of the sequence2. Searching exons: SelectAll exonsand try to find the real ones3. Finding genes: You do not need to select any option (default behaviour). Compare the predicted gene with the real gene

Figure 1.Signal, exons and genes predicted bygeneidin the sequence HS307871

Step 3. Running other genefinders

Provided that there are several alternative programs to analyze a DNA sequence, we can run every application and observe the common parts of the predictions.1. GENSCAN: Connect to the GENSCANserver Paste DNA sequence PressRun Genscanbutton Compare annotations and predictions

2. FGENESH: Connect toSoftberry homepage On the left frame, selectGENE FINDING in Eukaryota Select the programFGENESH Paste DNA sequence PressSearchbutton Compare annotations and predictions

3. GRAIL: Connect toGrailEXP homepage ActivatePerceval Exon Candidatesbox Paste DNA sequence PressGo!button Check the results Compare annotations and predicted exons

4. NOTE: First exon is always missed in the predictions and there are some problems to detect the donor site from exon 5. Detection of Start codons is a serious drawback in current gene finding programs (see Figure 2). However, this problem can be overcome by using homology information to complete the gene prediction.

Figure 2.EMBL annotation and genes predicted by Grail, GENSCAN,geneidand FGENESH in the sequence HS307871

C. Using EST/cDNA homology information

Step 4. Using GrailEXP

Connect toGrailExp homepage ActivateGalahad EST/mRNA/cDNA Alignmentsbox Select GrailEXP database (RefSeq/HTDB/dbEST/EGAD/Riken) Activate exon assembly:Gawain Gene Models Paste DNA sequence PressGo!button Check the results: predictions and supporting information Compare annotations, ab initio GRAIL prediction and five predicted alternative spliced variants

Figure 3.Comparison between EMBL annotation and genes predicted ab inition by Grail Vs five alternative predictions supported by ESTs information in the sequence HS307871

Step 5. Using other gene finding programs + alignment of transcripts

Usingblastn, we can search the databaseest_humanfor ESTs supporting future predictions. Filter this output in order to select those non-overlapping ESTs that could form a complete cDNA sequence (see Figure 4). Moreover, ESTs not divided into two or more pieces in the genomic sequence (containing a couple of splice sites) should be rejected. Connect to theFGENESH-Cserver (onGene finding with similarity menu) Paste the sequence HS307871 Paste the cDNA sequence or EST you have selected Press thesearchbutton Notice that predicted gene will necessarily supported by homology information, so it will likely mapped only in the genomic region overlapping your EST query.

Figure 4.Best human ESTs in the alignment mapped on the genomic sequence HS307871

D. Using protein homology information

Step 6. Spliced alignment

Spliced alignment is very useful when we have additional information (a putative homologous protein sequence) about the content of the sequence. Thus, gene prediction is guided by fitting the protein sequence into the best splice sites predicted in the genomic sequence. Open theNCBI blast server Choose blastx program (genomic query versus protein database) Paste the genomic sequence and press theBlast!andFormat! Select the first protein. Display the FASTA sequence or clickhere. Obviously, it is the real protein annotated in the genomic sequence. Opengenewiseweb server to use this protein to predict the best gene structure Paste both protein and genomic sequences and run the program Compare predicted gene (end of the file) and annotations: look for splice sites within introns to check exon boundaries are correct

Figure 5.Best HSPs representing proteins homologues similar to the genomic sequence HS307871 obtained using blastx

Step 7. Spliced alignment using homologous proteins

From blastx output, choose several homologous genes and run genewise for each one separately, again. Observe the gain of accuracy as long as the homologue is closer to the original human protein: Homo sapiens Ovis aries Mus musculus Rattus norvegicus Danio rerio Drosophila melanogaster Drosophila virilis Saccharomyces cerevisiae Schizosaccharomyces pombe

Figure 6.Graphical comparison of the real gene annotation and different genewise predictions using different homologous proteins for the geneuroporphyrinogen decarboxylase (URO-D)

Step 8. Using protein homology information: GenomeScan

Protein homology information can also be used to enhance ab initio predicted exons supported by blastx HSPs as in the case of GenomeScan andgeneidimproving therefore the final prediction GenomeScan: Connect to theGenomeScanweb server Retrieve the protein from the previous blast search Paste both genomic and protein sequences Press the buttonGenomeScan Check the results. It seems that the first exon has not been detected even using homology information. This is due to the fact that blast programs have a minimal word lenght.

Figure 6.GenomeScan output: first exon is not correctly predicted probably due to blast length restrictions

E. Using a genome annotation browser

Step 9. Golden path archive:

Open theUCSC Genome Bioinformatics Site Select theblatlink to locate the genomic coordinates of our sequence Paste theDNA sequence in FASTAformat (HS307871) Submitthe file Click over the first hit:(browser link) Compare the graphical annotation with the EMBL entry of the gene Analyze these different sets of output options:Genes and Gene Prediction Tracks,mRNA and EST Tracks

Figure 7.(a) UCSC genome browser representation of the region containing the geneuroporphyrinogen decarboxylase (URO-D)(b) UCSC genome browser representation of the contex (100Kbps) region around the geneuroporphyrinogen decarboxylase (URO-D).

F. Results

Here you can find the solutions to every exercise:EMBL annotation

EMBL annotation (plain text)

FASTA sequence

geneid results: signals

geneid results: exons

geneid results: genes

GENSCAN results

FGENESH results

GRAIL results

GrailEXP results

Blastn + human ESTs results

Blastx + protein results

Genewise (human protein)

Genewise (ovis protein)

Genewise (mouse protein)

Genewise (rat protein)

Genewise (Danio rerio protein)

Genewise (Drosophila melanogaster protein)

Genewise (Drosophila virilis protein)

Genewise (yeast protein)

Genewise (fission yeast protein)

GenomeScan results

F. Bibliography

1. J.F. Abril and R. Guig.gff2ps: visualizing genomic annotations.Bioinformatics 16:743-744 (2000).2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J.Basic local alignment search tool.J. Mol. Biol. 215:403-410 (1990).3. Burge, C. and Karlin, S.Prediction of complete gene structures in human genomic DNA.J. Mol. Biol. 268, 78-94 (1997).4. E. Blanco, G. Parra and R. Guig.Using geneid to Identify Genes.In A. D. Baxevanis and D. B. Davison, chief editors: Current Protocols in Bioinformatics. Volume 1, Unit 4.3. John Wiley & Sons Inc., New York. ISBN: 0-471-25093-7 (2002).5. G. Parra, E. Blanco, and R. Guig.Geneid in Drosophila.Genome Research 10:511-515 (2000).6. Asaf A. Salamov and Victor V. Solovyev.Ab initio Gene Finding in Drosophila Genomic DNA Genome Res. 10: 516-522 (2000).7. Yeh, R.-F., Lim, L. P. and Burge, C. B.Computational inference of homologous gene structures in the human genome.Genome Res. 11: 803-816 (2001).8. D. Hyatt, J. Snoddy, D. Schmoyer, G. Chen, K. Fischer, M. Parang, I. Vokler, S. Petrov, P. Locascio, V. Olman, Miriam Land, M. Shah, and E. Uberbacher.Improved Analysis and Annotation Tools for Whole-Genome Computational Annotation and Analysis: GRAIL-EXP Genome Analysis Toolkit and Related Analysis Tools.Genome Sequencing & Biology Meeting (2000).9. Ewan Birney and Richard Durbin.Using GeneWise in the Drosophila Annotation Experiment. Genome Res. 10: 547-548 (2000).