15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

15 January 2006, PAG XIV SanDiego Rémy Bruggmann, MIPS/IBI, GSF

A Bioinformatic Framework to Unravel the Secrets of the Tomato

Genome

Outline

Introduction

Data management

Annotation

Training/Test gene set

Summary

MIPS´ look at the Green Side of Life

– genome projects and database activities –

Arabidopsis thalianaArabidopsis lyrata *Capsella rubella *

MaizeRice

MedicagoLotus

Solanum lycopersicum

MIPS´ look at the Green Side of Life

– genome projects and database activities –

Need to streamline and unify databases as well as analytical schemas and operation routines

Strong synergism and very robust

Risk to loose flexibility and „custom tailor“ attractiveness

Awareness that not every genome and every community„is just the same“

From Center Centric Strategies to distributed Approaches

Typically, genome projects undergo particular phases:

Sequenced BACs are annotated

Gene models are published to the community

Potentially generates competition rather than collaboration among groups


Consequences can be:

underlying analytical procedures are not always tested, trained and evaluated

Between groups more or less pronounced differences exist--> differing, contradicting and confliciting data

„information enriched high quality genome backbone to address genome scale biological

questions“

Aim of all groups:


An example ...

International Medicago Genome Annotation Group

Consists of groups participating either in the International or the European Medicago Genome Initiative annotation/ bioinformatics programs

Agreement on common annotation standards, data exchange formats and naming conventions

Aims to produce and provide unified high-quality Medicago data set


Advantages of sharing efforts in genome annotation within a common annotation pipeline


prevents from:

(i) duplicating efforts

(ii) conflicts resulted from different

annotation “standards”

ensures high-quality annotation standards

ensures common (gene) naming common dataset

Integrates and profits from knowledge and expertise

of the individual groups

Data management

All data should be organized in agenome database

Wishlist for a modern genome db

Complete Comprehensive Up-to-date Integrated User interface Application interface State-of-the-art automatic analysis Adaptable Cross-genome comparison

…low cost, low manpower...

PlantsDB Philosophy

Plants Genome Resource: provides and integrates sequence data from European plant sequencing consortia along with publically available data from the international initiative

Plants DB communicates bioinformatic analysis data (visualization, genetic elements, structural data, ontologies, domains...; BLAST, browse and search,…comparative analysis)

Integration: provides a distributed network to integrate and retrieve data from heterogenous resources using BioMOBY (connection to other plant DBs, PlaNET)

Preliminary Annotation Pipeline

Towards a preliminary annotation

Repeat OntologyRepeatDatabase

RepeatMasker

Repeat Detection

Masked sequences Repeat annotation

Gene prediction GAMEXML

Gene Prediction

EST DB

Protein DB

ESTAssemblies

e.g. SwissProt

External Databases

► GenomeThreader► FGenesH++/ProtMap► GeneMarkHMM

GAMEXML

Gene prediction programs

Document of computational

results

Manual annotation inApollo Genome Viewer

PlantsDB

Web Access

Gbrowse

First Results

Repeat Masker

5.8 MB analysed (48 BACs)

~ 6.7 % repetitive elements(<0.2% - 23% per bac)

~ 1 min/100 kb

whole genome (euchromatic part):

~ 2 daysBACs

0

5

10

15

20

25

Repeat content[%]

State: December 2005

Preliminary Results

Comparison of different gene finders

ab initio predictions

EST/TC

FGeneSH

GeneMark

EST/TC



FGeneSH++ and GeneMarkHMM often generate incomplete or wrong gene models at the moment

There are no matrices available that are trained for tomato

Tomato matrices will increase prediction quality dramatically

Collection of annotated high quality genes for a training/test set for EuGene, FGeneSH,

GeneMarkHMM, ...

Training/Test Gene Set

How can we get a training/test set?

Map available tomato cDNA/ESTs to the BACs(use only high confident matches)

Link experimental data to the genemodels

Use this gene set for ab initio gene finder training

GenomeThreader

GenomeThreader used for EST/cDNA-Mapping:

similarity-based approach:EST/Proteins used to predict gene structure via optimal spliced alignments

Offers many options (full user control)

incremental updates (avoids a lot of duplicated computations)

Improved GeneSeqer

GenomeThreader - calculations

DB Entries Size [MB] Calc time/100kb [s] Whole Genome

Tomato 32401 27 27 s

~ 2.8 daysMicroTom 26363 21 22 s

Potato 38239 34 23 s

Tobacco 28661 20 39 s

Arabidopsis cDNAs 31939 45 10 s 0.3 days

Dicots 404822 311 170 s 4.3 days

rice cds 15639 21 8 s 0.2 days

Uni_trembl Plants 185564 74 38 s 1.0 day

Uniprot_swissprot 181571 82 8 s 0.2 days

Nonred 1675230 662 437 s 11.1 days

Total 2834224 1433 14 min 22 days

(single CPU, euchromatic part)

Example

Tomato

Microtom

Potato

Tobacco

Examples - UK

Example

Number of high quality genes

0

2

4

6

8

10 Number of genes: 164(covered completely by cDNA/ESTs)

~3.4 genes/BAC(range: 0 - 9 genes/BAC)

These genes can be used to train gene finders

BAC

# genes

(Only very good alignments considered)

Gene Finder

Which program can be trained for tomato?

One possibility is EuGene (VIB Gent)

- performed well e.g. for Arabidopsis and Medicago- available as soon as test/training gene set is large

enough

EuGene - overview

DNA MarkovAA MarkovSplice

sites

Start sites

Protein similarities

EST similaritiesFL cDNA

Exon conservation

Repeats

Statistical contents

NetGene2

SplicePredictor

SpliceMachine

GeneSplicer

NetStartSpliceMachineATRPred

Similarities

Plugins

Plugin

training

Needs

one

dataset

Optimize

plugin

combination

Needs

one

dataset

Test

Needs

one

dataset

new

TRAINING OPTIM TEST

EuGene

First round training:- 500 high quality tomato genes- statistical models on codon usage and splice sites of Arabidopsis will be used

Second round training:- 2000 high quality tomato genes- Build a tomato-only version of EuGene

Approx. 150 BACs needed for first round training

Current state of sequenced BACs

Total number of BACs:- unfinished: 71- finished: 87- available: 52

Summary

ab initio gene finders are not yet calibrated to tomato

Need of a test/training gene set to calibrate the gene finders

We need another 100 BACs to get enough genes for a first round training of EuGene

GenomeThreader produces only good alignments with ESTs from SOL-species (Tomato, Potato, Tobacco)

More repeats will be detected (will be included in RepeatMasker Library)

Acknowledgments

Automated annotation

MIPS

Heidrun GundlachGeorg HabererManuel SpannaglKlaus F.X. Mayer

Manual Annotation/Curation/Web-site(Chromosome 4)Imperial CollegeDaniel BuchanJames Abbot

Sarah ButcherGerard Bishop

Sequencing & Assembly(Chromosome 4)Sanger InstituteChristine NicholsonSean Humphray

MPIZ Köln Heiko Schoof

EuGeneVIB Gent Stephane Rombauts

GenomeThreaderUniversity of HamburgGordon GremmeStefan KurtzVolker Brendel

15 January 2006, PAG XIV SanDiego Rémy Bruggmann, MIPS/IBI, GSF

A Bioinformatic Framework to Unravel the Secrets of the Tomato

Genome

Documents

15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome