View
228
Download
0
Category
Tags:
Preview:
Citation preview
MAKER Annotation ProcessExample of Glossina
VectorBasehttp://www.vectorbase.org
Karyn Mégy Dan Hughes
VectorBasehttp://www.vectorbase.org
Hinxton Developer Meeting February 2012
Annotation: aims and means
• Aims– Preliminary
– Locus rather than exact position
• Means– Automatic annotation
• By similarity
• Ab initio
– Manual annotation
• By regions
• By gene families
VectorBasehttp://www.vectorbase.org
Hinxton Developer Meeting February 2012
Annotation: similarity vs. ab initio
• Similarity– Similarity to known sequences
-> only know genes
-> based on available data (qty, qlty)
• Ab initio– Follow a gene “recipe”
-> potentially identify new genes
-> over predictions
VectorBasehttp://www.vectorbase.org
Hinxton Developer Meeting February 2012
Ensembl annotation
Masking: RepeatModeler repeats
+ known repeats/transposons
Rawgenome
Maskedgenome
Community Annotation
1
Proteinspecies specific
2
Transcriptomespecies specific
3
Protein‘close’ specific
4
Ab initio
5
VectorBasehttp://www.vectorbase.org
Hinxton Developer Meeting February 2012
Ensembl annotation
• Similarity-focused • Data rich organisms• Fiddly, time consuming• Rhodnius prolixus experience
• In the meantime:
Heliconius annotation using MAKER
VectorBasehttp://www.vectorbase.org
Hinxton Developer Meeting February 2012
MAKER
http://www.yandell-lab.org/software/maker.html Cantarel et al. Gen. Res. 2008. PMID 18025269
Rawgenome
DATADAT
ADATA
Annotatedgenome
VectorBasehttp://www.vectorbase.org
Hinxton Developer Meeting February 2012
MAKER
Rawgenome
DATADAT
ADATA
Annotatedgenome
VectorBasehttp://www.vectorbase.org
Hinxton Developer Meeting February 2012
Intermediate gene sets
Masking: RepeatModeler repeats
+ known repeats/transposons
Rawgenome
Maskedgenome
Raw data
- ESTs - from GenBank - cleaned and clustered/assembled with CAP3- 71,700 contigs
- Insecta/metazoa proteins- from UniProt- align to the genome with BLAST- 690,000 seqces (insecta)- 2,200,00 seqces (metazoa)
VectorBasehttp://www.vectorbase.org
Hinxton Developer Meeting February 2012
Intermediate gene sets
Masking: RepeatModeler repeats
+ known repeats/transposons
Rawgenome
Maskedgenome
Raw data
- RNAseq Illumina Yale - cleaned - aligned to the genome using Tophat/Bowtie - build ‘tranfrag’ with Cufflinks
- 78,000 ‘transfrag’ (on 4 sets -> overlaps)
- Augustus - generated by Martin Swain - trained with SOLiD data
- 16, 963 models – high quality
Gene models
VectorBasehttp://www.vectorbase.org
Hinxton Developer Meeting February 2012
Intermediate gene sets
Masking: RepeatModeler repeats
+ known repeats/transposons
Rawgenome
Maskedgenome
Raw data
Ab initio
- ESTs – aligned to the genome- from GenBank – clustered with CAP3- 71,700 clusters
- Insecta/metazoa proteins (UniProt)- 690,000 seqces (insecta)- 2,200,00 seqces (metazoa)
- RNAseq Illumina Yale – using Tophat/Cufflinks- 78,000 ‘transfrag’ (on 4 sets -> overlaps)
- Augustus – SOLiD data trained- 16, 963 models – high QC
- SNAP – trained for Glossina (MAKER)- Augustus – trained for Glossina (Martin Swain)- GenScan
Gene models
VectorBasehttp://www.vectorbase.org
Hinxton Developer Meeting February 2012
Intermediate gene sets
Masking: RepeatModeler repeats
+ known repeats/transposons
Rawgenome
Maskedgenome
Raw data
Ab initio
Gene models
VectorBasehttp://www.vectorbase.org
Hinxton Developer Meeting February 2012
MAKER
Masking: RepeatModeler repeats
+ known repeats/transposons
Rawgenome
Maskedgenome
Raw data
Ab initio
Gene models
ESTs
Proteins
Provided as input
Run software within MAKER
VectorBasehttp://www.vectorbase.org
Hinxton Developer Meeting February 2012
MAKER – iterative process
• Round-1:– Align ESTs and Insecta proteins to the genome
– Train SNAP (1): Drosophila HMM
ESTs and protein alignments,
RNA-seq Illumina Yale, Augustus (SOLiD)
• Round-2:– Re-train SNAP (2) – same as above but HMM = output of SNAP-1
• Round-3:– Re-train SNAP (3) – same as above but HMM = output of SNAP-2
– Align Metazoa proteins to the genome
– Combine final gene set
VectorBasehttp://www.vectorbase.org
Hinxton Developer Meeting February 2012
Using MAKER for…
Heliconius
Tsetse fly
Salmon louse
Centipede
VectorBasehttp://www.vectorbase.org
Hinxton Developer Meeting February 2012
Augustus (SOLiD)
Martin Swain’s stats, July 22nd, 2011
• Glossina trained:> ESTs only: 14,739 predictions,
9.8% with similarity to Gl. proteins (1,455 seq., 95% seq. identity)
-> ESTs + SOLiD: 14,739 predictions, 9.9% with similarity to Gl. proteins (1,465 seq., 95%
ID)
-> Glossina GenBank proteins: 2,754 proteins sequences 53% matching Augustus models
• Glossina un-trained:-> 8,581 predictions, 15% with similarity to Gl. proteins (1,299 seq., exact matches)
VectorBasehttp://www.vectorbase.org
Hinxton Developer Meeting February 2012
ESTs• Total: 79,292 ESTs
VectorBasehttp://www.vectorbase.org
Hinxton Developer Meeting February 2012
• [1] Adult midgut expressed sequence tags from the tsetse fly Glossina morsitans morsitans and expression analysis of putative immune response genes. Genome Biol. 2003. Lehane et al.
• [2] Differential expression of fat body genes in Glossina morsitans morsitans following infection with Trypanosoma brucei brucei. Int. J. Parasitol. 2008. Lehane et al.
• [3] Analysis of fat body transcriptome from the adult tsetse fly, Glossina morsitans morsitans. Insect Mol. Biol. 2006 Attardo et al.
• [4] Functional Characterisations of odorant binding proteins and chemosensory proteins in tsetse fly Glossina morsitans morsitans. Unpublished 2009. …., Lehane,M., Hertz-Fowler,C., Berriman,M., …
• [5] Comprehensive analysis of the transcriptome of the Tsetse fly Glossina morsitans morsitans. Unpublished. 2009. Hertz-Fowler,C., Aslett,M.A. and Berriman,M.EST submitted under: GenomeProject:9563
VectorBasehttp://www.vectorbase.org
Hinxton Developer Meeting February 2012
MAKER – final gene set
• Genes: – Final genes: 12,220
– Raw data: • EST-based genes: 23,469• Protein-based genes : 416,9591 (redundancy)
– Gene sets: • Illumina-Yale: 70,915 (redundancy)• Augustus (SOLiD): 16,155
– Ab initio• SNAP: 48,464• Augustus (MAKER): 14,413
(417,000)
Recommended