Upload
allyson-joseph
View
219
Download
3
Tags:
Embed Size (px)
Citation preview
15 January 2006, PAG XIV SanDiego Rémy Bruggmann, MIPS/IBI, GSF
A Bioinformatic Framework to Unravel the Secrets of the Tomato
Genome
Outline
Introduction
Data management
Annotation
Training/Test gene set
Summary
MIPS´ look at the Green Side of Life
– genome projects and database activities –
Arabidopsis thalianaArabidopsis lyrata *Capsella rubella *
MaizeRice
MedicagoLotus
Solanum lycopersicum
MIPS´ look at the Green Side of Life
– genome projects and database activities –
Need to streamline and unify databases as well as analytical schemas and operation routines
Strong synergism and very robust
Risk to loose flexibility and „custom tailor“ attractiveness
Awareness that not every genome and every community„is just the same“
From Center Centric Strategies to distributed Approaches
Typically, genome projects undergo particular phases:
Sequenced BACs are annotated
Gene models are published to the community
Potentially generates competition rather than collaboration among groups
From Center Centric Strategies to distributed Approaches
Consequences can be:
underlying analytical procedures are not always tested, trained and evaluated
Between groups more or less pronounced differences exist--> differing, contradicting and confliciting data
„information enriched high quality genome backbone to address genome scale biological
questions“
Aim of all groups:
From Center Centric Strategies to distributed Approaches
An example ...
International Medicago Genome Annotation Group
Consists of groups participating either in the International or the European Medicago Genome Initiative annotation/ bioinformatics programs
Agreement on common annotation standards, data exchange formats and naming conventions
Aims to produce and provide unified high-quality Medicago data set
From Center Centric Strategies to distributed Approaches
Advantages of sharing efforts in genome annotation within a common annotation pipeline
From Center Centric Strategies to distributed Approaches
prevents from:
(i) duplicating efforts
(ii) conflicts resulted from different
annotation “standards”
ensures high-quality annotation standards
ensures common (gene) naming common dataset
Integrates and profits from knowledge and expertise
of the individual groups
Data management
All data should be organized in agenome database
Wishlist for a modern genome db
Complete Comprehensive Up-to-date Integrated User interface Application interface State-of-the-art automatic analysis Adaptable Cross-genome comparison
…low cost, low manpower...
PlantsDB Philosophy
Plants Genome Resource: provides and integrates sequence data from European plant sequencing consortia along with publically available data from the international initiative
Plants DB communicates bioinformatic analysis data (visualization, genetic elements, structural data, ontologies, domains...; BLAST, browse and search,…comparative analysis)
Integration: provides a distributed network to integrate and retrieve data from heterogenous resources using BioMOBY (connection to other plant DBs, PlaNET)
Preliminary Annotation Pipeline
Towards a preliminary annotation
Repeat OntologyRepeatDatabase
RepeatMasker
Repeat Detection
Masked sequences Repeat annotation
Gene prediction GAMEXML
Gene Prediction
EST DB
Protein DB
ESTAssemblies
e.g. SwissProt
External Databases
► GenomeThreader► FGenesH++/ProtMap► GeneMarkHMM
GAMEXML
Gene prediction programs
Document of computational
results
Manual annotation inApollo Genome Viewer
PlantsDB
Web Access
Gbrowse
First Results
Repeat Masker
5.8 MB analysed (48 BACs)
~ 6.7 % repetitive elements(<0.2% - 23% per bac)
~ 1 min/100 kb
whole genome (euchromatic part):
~ 2 daysBACs
0
5
10
15
20
25
Repeat content[%]
State: December 2005
Preliminary Results
Comparison of different gene finders
ab initio predictions
EST/TC
FGeneSH
GeneMark
EST/TC
ab initio predictions
ab initio predictions
FGeneSH++ and GeneMarkHMM often generate incomplete or wrong gene models at the moment
There are no matrices available that are trained for tomato
Tomato matrices will increase prediction quality dramatically
Collection of annotated high quality genes for a training/test set for EuGene, FGeneSH,
GeneMarkHMM, ...
Training/Test Gene Set
How can we get a training/test set?
Map available tomato cDNA/ESTs to the BACs(use only high confident matches)
Link experimental data to the genemodels
Use this gene set for ab initio gene finder training
GenomeThreader
GenomeThreader used for EST/cDNA-Mapping:
similarity-based approach:EST/Proteins used to predict gene structure via optimal spliced alignments
Offers many options (full user control)
incremental updates (avoids a lot of duplicated computations)
Improved GeneSeqer
GenomeThreader - calculations
DB Entries Size [MB] Calc time/100kb [s] Whole Genome
Tomato 32401 27 27 s
~ 2.8 daysMicroTom 26363 21 22 s
Potato 38239 34 23 s
Tobacco 28661 20 39 s
Arabidopsis cDNAs 31939 45 10 s 0.3 days
Dicots 404822 311 170 s 4.3 days
rice cds 15639 21 8 s 0.2 days
Uni_trembl Plants 185564 74 38 s 1.0 day
Uniprot_swissprot 181571 82 8 s 0.2 days
Nonred 1675230 662 437 s 11.1 days
Total 2834224 1433 14 min 22 days
(single CPU, euchromatic part)
Example
Tomato
Microtom
Potato
Tobacco
Examples - UK
Example
Number of high quality genes
0
2
4
6
8
10 Number of genes: 164(covered completely by cDNA/ESTs)
~3.4 genes/BAC(range: 0 - 9 genes/BAC)
These genes can be used to train gene finders
BAC
# genes
(Only very good alignments considered)
Gene Finder
Which program can be trained for tomato?
One possibility is EuGene (VIB Gent)
- performed well e.g. for Arabidopsis and Medicago- available as soon as test/training gene set is large
enough
EuGene - overview
DNA MarkovAA MarkovSplice
sites
Start sites
Protein similarities
EST similaritiesFL cDNA
Exon conservation
Repeats
Statistical contents
NetGene2
SplicePredictor
SpliceMachine
GeneSplicer
NetStartSpliceMachineATRPred
Similarities
Plugins
Plugin
training
Needs
one
dataset
Optimize
plugin
combination
Needs
one
dataset
Test
Needs
one
dataset
new
TRAINING OPTIM TEST
EuGene
First round training:- 500 high quality tomato genes- statistical models on codon usage and splice sites of Arabidopsis will be used
Second round training:- 2000 high quality tomato genes- Build a tomato-only version of EuGene
Approx. 150 BACs needed for first round training
Current state of sequenced BACs
Total number of BACs:- unfinished: 71- finished: 87- available: 52
Summary
ab initio gene finders are not yet calibrated to tomato
Need of a test/training gene set to calibrate the gene finders
We need another 100 BACs to get enough genes for a first round training of EuGene
GenomeThreader produces only good alignments with ESTs from SOL-species (Tomato, Potato, Tobacco)
More repeats will be detected (will be included in RepeatMasker Library)
Acknowledgments
Automated annotation
MIPS
Heidrun GundlachGeorg HabererManuel SpannaglKlaus F.X. Mayer
Manual Annotation/Curation/Web-site(Chromosome 4)Imperial CollegeDaniel BuchanJames Abbot
Sarah ButcherGerard Bishop
Sequencing & Assembly(Chromosome 4)Sanger InstituteChristine NicholsonSean Humphray
MPIZ Köln Heiko Schoof
EuGeneVIB Gent Stephane Rombauts
GenomeThreaderUniversity of HamburgGordon GremmeStefan KurtzVolker Brendel
15 January 2006, PAG XIV SanDiego Rémy Bruggmann, MIPS/IBI, GSF
A Bioinformatic Framework to Unravel the Secrets of the Tomato
Genome