43
Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Embed Size (px)

Citation preview

Page 1: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Annotation of Drosophila primer

GEP Workshop – August 2015

Wilson Leung and Chris Shaffer

Page 2: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Outline

Overview of the GEP annotation projectAnnotation goalsNomenclature

GEP annotation strategyUsing genomics databasesInterpreting RNA-Seq dataUnderstanding the phase of splice sites

“Annotation of a Drosophila gene” walkthrough

Page 3: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATCGTTCTTAAAAAAGAGCAAGAACAGTTTAACCATTGAAAACAAGATTATTCCAATAGCCGTAAGAGTTCATTTAATGACAATGACGATGGCGGCAAAGTCGATGAAGGACTAGTCGGAACTGGAAATAGGAATGCGCCAAAAGCTAGTGCAGCTAAACATCAATTGAAACAAGTTTGTACATCGATGCGCGGAGGCGCTTTTCTCTCAGGATGGCTGGGGATGCCAGCACGTTAATCAGGATACCAATTGAGGAGGTGCCCCAGCTCACCTAGAGCCGGCCAATAAGGACCCATCGGGGGGGCCGCTTATGTGGAAGCCAAACATTAAACCATAGGCAACCGATTTGTGGGAATCGAATTTAAGAAACGGCGGTCAGCCACCCGCTCAACAAGTGCCAAAGCCATCTTGGGGGCATACGCCTTCATCAAATTTGGGCGGAACTTGGGGCGAGGACGATGATGGCGCCGATAGCACCAGCGTTTGGACGGGTCAGTCATTCCACATATGCACAACGTCTGGTGTTGCAGTCGGTGCCATAGCGCCTGGCCGTTGGCGCCGCTGCTGGTCCCTAATGGGGACAGGCTGTTGCTGTTGGTGTTGGAGTCGGAGTTGCCTTAAACTCGACTGGAAATAACAATGCGCCGGCAACAGGAGCCCTGCCTGCCGTGGCTCGTCCGAAATGTGGGGACATCATCCTCAGATTGCTCACAATCATCGGCCGGAATGNTAANGAATTAATCAAATTTTGGCGGACATAATGNGCAGATTCAGAACGTATTAACAAAATGGTCGGCCCCGTTGTTAGTGCAACAGGGTCAAATATCGCAAGCTCAAATATTGGCCCAAGCGGTGTTGGTTCCGTATCCGGTAATGTCGGGGCACAATGGGGAGCCACACAGGCCGCGTTGGGGCCCCAAGGTATTTCCAAGCAAATCACTGGATGGGAGGAACCACAATCAGATTCAGAATATTAACAAAATGGTCGGCCCCGTTGTTATGGATAAAAAATTTGTGTCTTCGTACGGAGATTATGTTGTTAATCAATTTTATTAAGATATTTAAATAAATATGTGTACCTTTCACGAGAAATTTGCTTACCTTTTCGACACACACACTTATACAGACAGGTAATAATTACCTTTTGAGCAATTCGATTTTCATAAAATATACCTAAATCGCATCGTC

Start codon Coding region Stop codon

Splice donor Splice acceptor UTR

Page 4: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Annotation:Adding labels to a sequence

Genes: Novel or known genes, pseudogenes

Regulatory elements: Promoters, enhancers, silencers

Non-coding RNA: tRNAs, miRNAs, siRNAs, snoRNAs

Repeats: Transposable elements, simple repeats

Structural: Origins of replication, synteny

Experimental results: DNase I Hypersensitive sites (DHS)ChIP-chip and ChIP-Seq datasets (e.g., modENCODE)

Page 5: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

GEP Drosophila annotation projects

D. melanogasterD. simulansD. sechelliaD. yakubaD. erectaD. ficusphilaD. eugracilisD. biarmipesD. takahashiiD. elegansD. rhopaloaD. kikkawai

D. ananassaeD. bipectinata

D. pseudoobscuraD. persimilisD. willistoni

D. mojavensisD. virilisD. grimshawi

ReferencePublished

Annotation projects for Fall 2015 / Spring 2016

Species in the Four Genomes Paper

New species sequenced by modENCODE

Phylogenetic tree produced by Thom Kaufman as part of the modENCODE project

Manuscript in progress

Page 6: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Muller element nomenclature

Species A B C D E F

D. simulans X 2L 2R 3L 3R 4

D. sechellia X 2L 2R 3L 3R 4

D. melanogaster X 2L 2R 3L 3R 4

D. yakuba X 2L 2R 3L 3R 4

D. erecta X 2L 2R 3L 3R 4

D. ananassae XLXR 3R 3L 2R 2L 4L4R

D. pseudoobscura

XL 4 3 XR 2 5

D. persimilis XL 4 3 XR 2 5

D. willistoni XL 2R 2L XR 3

D. mojavensis X 3 5 4 2 6

D. virilis X 4 5 3 2 6

D. grimshawi X 3 2 5 4 6

Muller elements

Page 7: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Gene structure nomenclature

Gene span

Primary

mRNA

Protein

ExonsExon

UTR’sUTR

CDS’sCDS

Page 8: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

GEP annotation strategy

Technique is optimized for projects with a moderately close, well annotated neighbor species

Example: D. melanogaster

Need to apply different strategies when annotating genes in other species

Examples: corn, parrot

Page 9: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

GEP annotation goals Identify and annotate all the genes in your project

For each gene, identify and precisely map (accurate to the base pair) all coding exons (CDS)Do this for ALL isoformAnnotate the initial transcribed exon and transcription start site (TSS)

Optional curriculum not submitted to GEPClustal analysis (protein, promoter regions)Repeats analysisSynteny analysis

Page 10: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Evidence-based annotation

Human-curated analysisMore accurate than standard ab initio and evidence-based gene finders

Goal: collect, analyze, and synthesize all the available evidence to create the best-supported gene model

Example: 4591-4688, 5157-5490, 5747-6001

Page 11: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Collect, analyze, and synthesize

Collect:Genome BrowserConservation (BLAST searches)

Analyze:Interpreting Genome Browser evidence tracksInterpreting BLAST results

Synthesize:Construct the best-supported gene model based on potentially contradictory evidence

Page 12: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Basic annotation workflow1. Identify the likely ortholog in D. melanogaster

2. Observe the gene structure of the ortholog

3. Map each CDS of ortholog to the project sequenceUse BLASTX to identify conserved regionNote position and reading frame

4. Use these data to construct a gene modelUse TopHat splice junctions to identify splice site boundaries, or a combination of conservation, splice signals, reading frameIdentify the exact start and stop base position for each CDS

5. Use the Gene Model Checker to verify the gene model

6. For each additional isoform, repeat steps 2-5

Page 13: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

BLASTX search of each D. melanogaster CDS against the contig

ContigFeature

Annotation workflow (graphically)

Contig

BLASTP search of feature against the D. melanogaster proteins database

Feature

D. melanogaster gene model (1 isoform, 5 CDS)

1 3Reading frame

Alignment

2 13

Page 14: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Identify the exact coordinates of each CDS using the Genome Browser

Annotation workflow (graphically)BLASTX search of each D. melanogaster CDS against the contig

Contig

1 3Reading frame

Alignment

2 13

GT

1

AG GTM

3Reading frame

Use the Gene Model Checker to verify the final CDS coordinates

Gene model

1245 1383 1437 1678 1740 2081 2159 2337 2397 2511

1245-1383, 1437-1678, 1740-2081, 2159-2337, 2397-2511

Coordinates:

Page 15: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Four main web sites used by the GEP annotation strategy

1. GEP UCSC Genome Browser (http://gander.wustl.edu)

2. FlyBase (http://flybase.org) Tools Genomic/Map Tools BLAST

3. Gene Record Finder (http://gep.wustl.edu)Projects Annotation Resources

4. NCBI BLAST (http://blast.ncbi.nlm.nih.gov)BLASTX select the checkbox:

Page 16: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

UCSC Genome Browser

Provides a graphical view of genomic regionSequence conservationGene and splice site predictionsRNA-Seq data and splice junction predictions

BLAT – BLAST-Like Alignment ToolMap protein or nucleotide sequence against an assemblyFaster but less sensitive than BLAST

Table BrowserAccess raw data used to create the graphical browser

Page 17: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Two different versions of the UCSC Genome Browser

Official UCSC Versionhttp://genome.ucsc.edu

Published data, lots of species, whole genomes, used for “Chimp Chunks”

GEP Versionhttp://gander.wustl.edu

GEP data, parts of genomes, used for annotation of Drosophila species

Page 18: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Additional training resources

Training section on the UCSC web sitehttp://genome.ucsc.edu/training/index.html Training videos, user guides and tutorialsMailing lists

OpenHelix tutorials and training materialshttp://www.openhelix.com/ucsc Pre-recorded tutorialsReference cards

Page 19: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

GEP UCSC Genome Browser overview

Genomic sequence

Evidence tracks

Page 20: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Control how evidence tracks are displayed on the Genome Browser

Five different display modes:Hide: track is hiddenDense: all features appear on a single lineSquish: overlapping features appear on separate lines

Features are half the height compared to full mode

Pack: overlapping features appear on separate linesFeatures are the same height as full mode

Full: each feature is displayed on its own lineSet “Base Position” track to “Full” to see the amino acid translations

Some evidence tracks (e.g., RepeatMasker) only have a subset of these display modes

Page 21: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

FlyBase – Database for the Drosophila research community

Lots of ancillary data for each gene in D. melanogaster

Curation of literature for each gene

Reference D. melanogaster annotations for all other databases

Including NCBI, EBI, and DDBJ

Fast release cycle (6-8 releases per year)

Page 22: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Be aware of different annotation releases

D. melanogaster Release 6 genome assemblyFirst change of the assembly since late 2006Most modENCODE analysis used the Release 5 assembly

Gene annotations change much more frequently

Use FlyBase as the canonical reference

GEP data freeze:Update GEP materials before the start of semesterPotential discrepancies in exercise screenshotsMinor differences in search resultsLet us know about major errors or discrepancies

Page 23: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Gene Record Finder – Observe the structure of D. melanogaster genes

Retrieve CDS and exon sequences for each gene in D. melanogaster

CDS and exon usage maps for each isoformList of unique CDS

Optimal for the exon-by-exon annotation strategy

Page 24: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Nomenclature for Drosophila genes

Drosophila gene names are case-sensitiveLowercase initial letter = recessive mutant phenotypeUppercase initial letter = dominant mutant phenotype

Every D. melanogaster gene has an annotation symbolBegins with the prefix CG (Computed Gene)

Some genes have a different gene symbol (e.g., mav)

Suffix after the gene symbol denotes different isoformsmRNA = -R; protein = -Pmav-RA = Transcript for the A isoform of mavmav-PA = Protein product for the A isoform of mav

Page 25: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

NCBI – comprehensive database for biomedical and genomics information

One of the most comprehensive genomics databaseQuality of GenBank records will vary

PubMed for literature searches

BLAST web serviceBLAST search against RefSeq, nr/nt databasesAlign two (or more) sequences (bl2seq)

Page 26: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Common BLAST programsExcept for BLASTN, all alignments are based on comparisons of protein sequences

Alignment coordinates are relative to the original sequences

Decide which BLAST program to use based on the type of query and subject sequences:

Program Query Database (Subject)

BLASTN Nucleotide Nucleotide

BLASTP Protein Protein

BLASTX Nucleotide -> Protein

Protein

TBLASTN Protein Nucleotide -> Protein

TBLASTX Nucleotide -> Protein

Nucleotide -> Protein

Page 27: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Where can I run BLAST?

NCBI BLAST web servicehttp://blast.ncbi.nlm.nih.gov/Blast.cgi

EBI BLAST web servicehttp://www.ebi.ac.uk/Tools/sss/

FlyBase BLAST (Drosophila and other insects)

http://flybase.org/blast/

Page 28: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Data on the Genome Browser is incomplete

The Genome Browser contains many different evidence tracks:

Sequence similarity to D. melanogaster proteinsRNA-Seq from different developmental stagesGene and splice site predictionsSequence alignments of multiple Drosophila species

Annotators still need to identify the orthologWARNING! Students often over-interpret the BLASTX alignment track (D. mel Proteins); use with caution

Page 29: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Identify ortholog based on BLASTP search of the feature against D.

melanogaster proteins

Most genes (~90%) remain on the same Muller element across the different Drosophila species

Large increase in E-value from mav-PA to gbb-PB

Page 30: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Basic biological constraints (inviolate rules*)

Coding regions start with a methionine

Coding regions end with a stop codon

Gene should be on only one strand of DNA

Exons appear in order along the DNA (collinear)

Intron sequences should be at least 40 bp

Intron starts with a GT (or rarely GC)

Intron ends with an AG

* There are known exceptions to each rule

Page 31: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Evidence for gene models(in general order of importance)

1. Expression dataRNA-Seq, EST, cDNA

2. ConservationSequence similarity to genes in D. melanogasterSequence similarity to other Drosophila species (Multiz)

3. Computational predictionsGene and splice site predictions

4. Tie-breakers of last resortSee the “Annotation Instruction Sheet”

Page 32: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

modENCODE RNA-Seq data

RNA-Seq evidence tracks:RNA-Seq coverage (read depth)TopHat splice junction predictionsAssembled transcripts (Cufflinks, Oases)

Positive results very helpful

Negative results less informativeLack of transcription ≠ no gene

GEP curriculum:RNA-Seq PrimerBrowser-Based Annotation and RNA-Seq Data

Page 33: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Generating RNA-Seq data (Illumina)

Processed mRNA

5’ cap Poly-A tailAAAAAA

RNA fragments(~250bp)

Library with adapters 5’ 3’ 5’ 3’ 5’ 3’ 5’ 3’

Paired end sequencing 5’ 3’

~125bp

~125bp

RNA-Seq readsReverseForward

Wang Z et al. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 10(1):57-63.

Page 34: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Use the TopHat splice junction predictions to identify splice sites

Processed mRNA AAAAAAM

RNA-Seq reads

5’ cap Poly-A tail*

TopHat junctions

ContigIntron Intron

Page 35: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Use BLASTX to map D. melanogaster CDS onto the

contigUse sequence similarity to infer homology

D. melanogaster is very well annotatedCoding sequences evolve slowlyExon structure changes very slowly

Change two settings in the “Algorithm parameters” section when using bl2seq

Turn off compositional adjustments

Turn off the low complexity filter

Page 36: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Strategies for finding small CDS

Examine RNA-Seq coverage and TopHat junctionsSmall CDS is typically part of a larger transcribed exon

Use Query subrange to restrict the search region

Increase the Expect threshold and try againKeep increasing the Expect threshold until you get matchesAlso try decreasing the word size

Use the Small Exon Finder Minimize changes in CDS sizeAvailable under Projects Annotation Resources

See the “Annotation Strategy Guide” for details

Page 37: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

A genomic sequence has 6 different reading frames

Frame: Base to begin translation relative to the first base of the sequence

Frames

123

Page 38: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Splice donor and acceptor phases

Phase: Number of bases between the complete codon and the splice site

Donor phase: Number of bases between the end of the last complete codon and the splice donor site (GT/GC)

Acceptor phase: Number of bases between the splice acceptor site (AG) and the start of the first complete codon

Phase is relative to the reading frame of the CDS

Page 39: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Phase depends on the reading frame

Phase of acceptor site:Phase 2 relative to frame +1Phase 0 relative to frame +2Phase 1 relative to frame +3

Splice Acceptor

Page 40: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Phase of the donor and acceptor sites must be compatible

Extra nucleotides from donor and acceptor phases will form an additional codon

Donor phase + acceptor phase = 0 or 3

GT AG… … …CCA AAT G CTC GATTT

P N V L DTranslation:

CCA AAT G

CTC GATGTTCCA AAT

CTC GATTT

Page 41: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Incompatible donor and acceptor phases results in a frame shift

Phase 0 donor is incompatible with phase 2 acceptor; use prior GT, which is a phase 1 donor.

CCA GT AG… …AAT G CTC GATTT

P N G F STranslation:

GT

CCA AAT TCG ATGGT TTC

Page 42: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Verify the final gene model using the Gene Model Checker

Examine the checklist and explain any errors or warnings in the GEP Annotation Report

View your gene model in the context of the other evidence tracks on the Genome Browser

Examine the dot plot and explain any discrepancies in the GEP Annotation Report

Look for large vertical and horizontal gapsSee the “How to do a quick check of student annotations” document on the GEP web site

Page 43: Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Questions?

http://www.flickr.com/photos/jac_opo/240254763/sizes/l/