23
Gene prediction in flies Background Gene prediction pipeline Resources

Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Embed Size (px)

Citation preview

Page 1: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Gene prediction in flies

● Background● Gene prediction pipeline● Resources

Page 2: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Background

Page 3: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Genome quality

Species CoverageD. simulans 17 142 23,640 15 (11)D. sechellia 14,730 167 ~3-fold 6,695 9 (6)D. yakuba 20 169 ~6-fold 13,502 6 (4)D. erecta 5,124 153 ~12-fold 2,486 8 (5)D. ananassae 13,749 231 ~8-fold 6,783 17 (7)D. pseudoobscura 4,896 153 ~7-fold 8,729 7 (4)D. persimilis 12,838 188 ~4-fold 13,975 13 (7)D. willistonis 14,927 237 ~6-fold 5,716 12 (5)D. virilis 13,530 206 ~9-fold 4,852 17 (8)D. mojavensis 6,841 194 ~8-fold 5,033 14 (7)D. grimshawi 17,440 200 ~8-fold 6,717 14 (7)

ContigsSize /

MbNumber of gaps

Nucleotides in gaps / Mb

~3-fold+6x-fold1

Page 4: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Genes in Drosophila melanogaster

● high gene density● at least 20% with alternative transripts● can be nested

on the same strand on different strands

● di-cistronic● involve trans-splicing

exons from a different strand

Page 5: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Gene prediction pipeline

● Gene prediction by homology no ab-initio predictions not using genomic alignments

● TBLASTN/Genewise process quick genome scan to find putative gene containing

regions aligning peptide sequence to genomic fragment

using a gene model● cds● introns● splice-sites

Page 6: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Genome Templatetranscripts

Representativetranscripts

Alternativetranscripts

Genome scan

Geneprediction

Geneprediction

Geneprediction

Predictedalternativetranscripts

Predictedrepresentative

transcripts

Predictedrepresentative

transcripts

Gene predictions

Gene assignmentRedundancy removal

Quality control

Regions

Pre-processing

Page 7: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Sensitivity – Selectivity - Speed

● Genome scan strict trade-off between

● sensitivity versus memory/time

● Transcript prediction t = O(MN)

● N: length of peptide sequence = quite short● M: length of DNA sequence = large

you want to minimize● the length of the genomic sequence to search● the number of fragments you align

Page 8: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Solutions

● ENSEMBL: Minigenes cut out putative introns

● My pipeline: priority lists gene structure conservation

Page 9: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Difficulties

● Terminal exons short and thus alignment signal is weak

● Spindly genes there is no length penalty on introns

Page 10: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Concepts

● Predict in three passes

1)Predict clear cut cases

2)Predict dubious cases only if they don't overlap with a previous prediction

3)Predict alternative transcripts● Iteratively search for duplications● Accept a prediction with conserved exon

boundaries

Page 11: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Conservation of gene structure

QueryPredictionConserved

QueryPredictionPartially conserved

QueryPredictionSingle exon

QueryPredictionRetrotransposed

QueryPredictionUnconserved

(exon boundaries of query/prediction mapped on query protein)

Page 12: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Quality control

● Classify predictions into categories Full length or fragment Gene or pseudogene Conserved or not conserved gene structure

● Heuristically remove predictions that are redundant that are in conflict

● nested genes● good predictions take precedence over bad predictions

Page 13: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Results

● http://wwwfgu.anat.ox.ac.uk:8080/cgi-bin/gbrowse

Page 14: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Number of predicted genes

d m e l d sim d se c d ya k d e re d a n a d p se d p e r d w il d vir d m o j d g r i0

2 0 0 0

4 0 0 0

6 0 0 0

8 0 0 0

1 0 0 0 0

1 2 0 0 0

1 4 0 0 0

1 6 0 0 0

Ge

ne

s

Genom e

Genes: conserved semi-conserved single exon

Pseudogenes:

Page 15: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

d sim d se c d ya k d e re d a n a d p se d p e r d w il d vir d m o j d g ri0

2 0 0 0

4 0 0 0

6 0 0 0

8 0 0 0

1 0 0 0 0

1 2 0 0 0

1 4 0 0 0N

um

be

r o

f D.

me

lan

og

ast

er

ge

ne

s

Genom e

Orthology assignments

Genes in D. melanogaster with ortholgs

Page 16: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Technical details● Hardware:

28 dual CPU nodes with 2Gb memory sun grid engine (SGE)

● Pipeline logic gmake

● Tasks Python scripts (and Perl scripts) Bash/awk scripts

● Database Postgres

Page 17: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Downstream analysis

● Pairwise orthology assignment PhyOP Pipeline (Leo Goodstadt (2006))

● Multiple orthology assignment My own concoction based on graph clustering with

some consistency criteria● Multiple alignment of cds

Dialign (<50 sequences) Muscle (<500 sequences)

Page 18: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Phylogenetic analysis

● 14,000 GBlocks cleaned multiple alignments● Calculation of ka and ks with PAML● Phylogenetic trees

Genome trees Gene trees built with Fitch/Kitsch

Page 19: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Odds and bits

● Mapping of Pdb -> Uniprot -> dmel proteins● Mapping of Interpro domains onto predictions

not up-to-date● Codon bias analysis

ENC, CAI, information theoretic measures GC3, GC3_4D

Page 20: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Comparison of measures

Experimental CAI

Computational CAI

ENC

GC3

Encoding | bias

Encoding | unbiased

Encoding | uniform

Ribosomal CAI

Page 21: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Other groups

● see http://rana.lbl.gov/drosophila/wiki/index.php/Main_Page

● Gene predictions by others Don Gilbert: SNAP Lior Pachter: GeneMapper (genomic alignments) Eisen Lab : TBLastN + Genewise/Exonerate, GeneMapper Batzoglou Lab: CONTRAST Brent Lab: N-Scan Guigo: geneid and SGP2

Page 22: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

http://insects.eugenes.org/species/news/genome-summaries/genepredictions.html

Page 23: Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources

Consensus predictions

● Gbrowser comparison of all gene predictions http://rana.lbl.gov/drosophila/gbrowse/cgi-bin/gbrowse

● Mike Eisen's group: GLEAN consensus set● Don Gilbert: http://insects.eugenes.org/species/● Other resources

tRNA predictions genome alignments