64
Searching for Transcription Start Sites in Drosophila 01/201 8 Wilson Leung

Searching for Transcription Start Sites

  • Upload
    buidang

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Searching for Transcription Start Sites

Searching for Transcription Start Sites in Drosophila

01/2018

Wilson Leung

Page 2: Searching for Transcription Start Sites

OutlineTranscription start sites (TSS) annotation goalsPromoter architecture in D. melanogaster

New D. melanogaster TSS datasetsFind the initial transcribed exonAnnotate putative transcription start sites

Search for core promoter motifs

Page 3: Searching for Transcription Start Sites

Muller F element, heterochromatic, and euchromatic genes show similar expression

levels

Riddle NC, et al. PLoS Genet. 2012 Sep;8(9):e1002954.

Page 4: Searching for Transcription Start Sites

TSS of F element genes show lower levels of H3K9me3 and HP1a

H3K9me2

H3K9me3

H3K36me3

HP1a

POF

Su(var)3-9

PolII

Riddle NC, et al. PLoS Genet. 2012 Sep;8(9):e1002954.

Page 5: Searching for Transcription Start Sites

Three strategies for motif finding

Multiple genes in a single speciesGenes with common expression patternSequences associated with ChIP-Seq peaks

Single gene in multiple speciesPhylogenetic footprinting

Multiple genes in multiple speciesCompare multiple sequence alignment profiles of multiple genes (Magma)

Page 6: Searching for Transcription Start Sites

Motif finding using multiple genes within a single species

Sequences surrounding TSS

Predicted motif instances

1 2 3 4 5 6 7 8 9 10A 4 6 12 0 3 1 13 5 7 9C 22 23 6 64 0 70 0 54 13 33G 9 11 30 3 14 0 2 7 11 10T 36 31 23 4 54 0 56 5 40 19

Bits

0.0

1.0

2.0

5 10

Trl: FlyReg_DNaseI

Page 7: Searching for Transcription Start Sites

Genes

PhyloP

phastConsConserved Elements

Multiple Sequence Alignment

D. mel: chr4

Motif finding using single gene in multiple species

Page 8: Searching for Transcription Start Sites

Motif finding using multiple genes in multiple species (PhyloNet)

Based on Figure 1 from Wang T and Stormo GD. PNAS 2005 Nov 29;102(48):17400-5.

Promoter sequences

Conserved motifs

1. Identify conserved regions (profiles) in whole genome multiple sequence alignments

2. Identify multiple genes in the genome with similar alignment profiles

Page 9: Searching for Transcription Start Sites

Magma: Multiple Aligner of Genomic Multiple AlignmentsKey features of Magma:

Runs ~70x faster than PhyloNetAnalyze multiple sequence alignments with gapsUse set-covering approach to minimize redundancy in discovered motifs

Ihuegbu NE, Stormo GD, Buhler J. J Comput Biol. 2012 Feb;19(2):139-47.

Computationally tractable to analyze conserved motifs in multiple eukaryotic genomes

Page 10: Searching for Transcription Start Sites

Goals for the transcription start sites (TSS) annotations

Research goal:Identify motifs that enable Muller F element genes to function within a heterochromatic environment

Annotation goals:Define search regions enriched in regulatory motifs

Define precise location of TSS if possibleDefine search regions where TSS could be found

Document the evidence used to support the TSS annotations

Page 11: Searching for Transcription Start Sites

Estimated evolutionary distances with respect to D. melanogaster

D. melanogasterD. simulansD. sechelliaD. yakubaD. erectaD. ficusphilaD. eugracilisD. biarmipesD. takahashiiD. elegansD. rhopaloaD. kikkawai

D. ananassaeD. bipectinata

D. pseudoobscuraD. persimilisD. willistoniD. mojavensisD. virilisD. grimshawi

Species sequenced by modENCODE

Species Substitutions per neutral

siteD. ficusphila 0.80D. eugracilis 0.76D. biarmipes 0.70D. takahashii 0.65D. elegans 0.72D. rhopaloa 0.66D. kikkawai 0.89D. bipectinata 0.99Data from Table 1 of the modENCODE comparative genomics white paper

GEP annotation projects

Page 12: Searching for Transcription Start Sites

Challenges with TSS annotations

Fewer constraints on untranslated regions (UTRs)

UTRs evolve more quickly than coding regionsOpen reading frames, compatible phases of donor and acceptor sites do not apply to UTRs

Low percent identity (~50-70%) between D. biarmipes contigs and D. melanogaster UTRsMost gene finders do not predict UTRsLack of experimental data

Cannot use RNA-Seq data to precisely define the TSS

Page 13: Searching for Transcription Start Sites

TSS annotation workflow1. Identify the ortholog2. Note the gene structure in D. melanogaster

3. Annotate the coding exons4. Classify the type of core promoter in D.

melanogaster

5. Annotate the initial transcribed exon6. Identify any core promoter motifs in region7. Define TSS positions or TSS search regions

Page 14: Searching for Transcription Start Sites

RNA Polymerase II core promoter

Initiator motif (Inr) contains the TSSTFIID binds to the TATA box and Inr to initiate the assembly of the pre-initiation complex (PIC)

Juven-Gershon T and Kadonaga JT. Regulation of gene expression via the core promoter and the basal transcriptional machinery. Dev Biol. 2010 Mar 15;339(2):225-9.

Page 15: Searching for Transcription Start Sites

Peaked versus broad promoters

Kadonaga JT. Perspectives on the RNA polymerase II core promoter. Wiley Interdiscip Rev Dev Biol. 2012 Jan-Feb;1(1):40-51.

Peaked promoter(Single strong TSS)

Broad promoter(Multiple weak TSS)

50-300 bp

Page 16: Searching for Transcription Start Sites

RNA-Seq biases introduced by library construction

cDNA fragmentationStrong bias at the 3’ end

RNA fragmentationMore uniform coverageMiss the 5’ and 3’ ends of the transcript

Wang Z, et al. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63.

Gene Span

RNA-

Seq

Read

Cou

nt

5’ 3’

Page 17: Searching for Transcription Start Sites

Techniques for finding TSSIdentify the 5’ cap at the beginning of the mRNA

Cap Analysis of Gene Expression (CAGE)RNA Ligase Mediated Rapid Amplification of cDNA Ends (RLM-RACE)Cap-trapped Expressed Sequence Tags (5’ ESTs)

More information on these techniques:Takahashi H, et al. CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks. Methods Mol Biol. 2012 786:181-200.Sandelin A, et al. Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet. 2007 Jun;8(6):424-36.

Page 18: Searching for Transcription Start Sites

Promoter architecture in Drosophila

Classify core promoter based on the Shape Index (SI)

Determined by the distribution of CAGE and 5’ RLM-RACE readsShape index is a continuum

Most promoters in D. melanogaster contain multiple TSS

Median width = 162 bp

~70% of vertebrate genes have broad promoters

Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92.

Page 19: Searching for Transcription Start Sites

Genes with peaked promoters show stronger spatial and tissue specificity

46% of genes with broad promoters are expressed in all stages of embryonic development19% of genes with peaked promoters are expressed in all stages

Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92.

Page 20: Searching for Transcription Start Sites

Peaked and broad promoters are enriched in different core promoter motifs

Rach EA, et al. Motif composition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome. Genome Biol. 2009;10(7):R73.

Page 21: Searching for Transcription Start Sites

Resources for classifying the type of core promoter in D. melanogaster Only a subset of the modENCODE data are available through FlyBase D. melanogaster GEP UCSC Genome Browser [Aug. 2014 (BDGP Release 6) assembly]

FlyBase gene annotations (release 6.16)modENCODE TSS (Celniker) annotationsDNase I hypersensitive sites (DHS)CAGE and RAMPAGE TSS datasets9-state and 16-state chromatin modelsTranscription factor binding site (TFBS) HOT spots

Page 22: Searching for Transcription Start Sites

9-state chromatin model

Kharchenko PV, et al. Comprehensive analysis of the chromatin landscape in Drosophila melanogaster. Nature. 2011 Mar 24;471(7339):480-5.

Page 23: Searching for Transcription Start Sites

DNaseI Hypersensitive Sites (DHS) correspond to accessible regions

Aasland R and Stewart AF. Analysis of DNaseI hypersensitive sites in chromatin by cleavage in permeabilized cells. Methods Mol Biol. 1999;119:355-62.

Ho JW, et al. Comparative analysis of metazoan chromatin organization. Nature. 2014 Aug 28;512(7515):449-52.

Page 24: Searching for Transcription Start Sites

modENCODE TSS annotations

Two sets of modENCODE TSS predictionsTSS (Celniker)

Most recent dataset produced by modENCODEAvailable on the GEP UCSC Genome Browser

TSS (Embryonic)Older dataset available from FlyBase GBrowse

Use TSS (Celniker) dataset as the primary evidenceHoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92

Page 25: Searching for Transcription Start Sites

Classify the D. melanogaster core promoter based on (TSS) Celniker annotations and

DHS positions

Consider DHS positions within a 300bp window surrounding the start of the D. melanogaster transcript

TSS classification

# Annotated TSS

# DHS positions

Peaked101

011

Intermediate ≤ 1> 1

> 1≤ 1

Broad > 1 > 1Insufficient evidence 0 0

Page 26: Searching for Transcription Start Sites

DEMO: Classify the core promoter of D. melanogaster Rad23

Page 27: Searching for Transcription Start Sites

BG3 9-stateS2 9-state

chr4

CG2316-RBCG2316-RDCG2316-RCCG2316-RACG2316-RGCG2316-RH

BG3 DHSS2 DHSKc DHS

Additional DHS data from different stages of embryonic development

DHS data produced by the BDTNP project

Evidence tracks:Detected DHS Positions (Embryos)DHS Read Density (Embryos)

Thomas S, et al. Dynamic reprogramming of chromatin accessibility during Drosophila embryo development. Genome Biol. 2011;12(5):R43.

Stage 11Stage 14

TSS (Celniker)

Stage 5Stage 9

Stage 10

Page 28: Searching for Transcription Start Sites

Additional TSS data available in FlyBase release 6.11

Batut P, Dobin A, Plessy C, Carninci P, Gingeras TR. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 2013 Jan;23(1):169-80.

Page 29: Searching for Transcription Start Sites

Benefits of RAMPAGERAMPAGE = RNA Annotation and Mapping of Promoters for Analysis of Gene ExpressionCAGE only allows sequencing of short sequence tags (~27 bp) near the 5’ cap

Ambiguous read mapping to large parts of the genome

RAMPAGE produces long paired-end reads instead of short sequence tags

Developed novel algorithm to identify TSS clustersUsed paired-end information during peak callingUsed Cufflinks to produce partial transcript models

Batut P, Gingeras TR. RAMPAGE: promoter activity profiling by paired-end sequencing of 5'-complete cDNAs. Curr Protoc Mol Biol. 2013 Nov 11;104:Unit 25B.11.

Page 30: Searching for Transcription Start Sites

RAMPAGE results on the GEP UCSC Genome Browser

Lifted RAMPAGE results from release 5 to release 6Results from 36 developmental stagesCombined TSS peak call from all samples

Available under the “Expression and Regulation” section

Page 31: Searching for Transcription Start Sites

Standardize analysis of MachiBase and modENCODE CAGE data using

CAGErBioconductor package developed by RIKEN

Map datasets against release 6 assembly37 modENCODE CAGE samples; 7 MachiBase samplesDefine TSS and promoters for each sampleDefine consensus promoters across all samples

Haberle V, et al. CAGEr: precise TSS data retrieval and high-resolution promoterome mining for integrative analyses. Nucleic Acids Res. 2015 Apr 30;43(8):e51.

Page 32: Searching for Transcription Start Sites

TSS classifications based on CAGErPeakedFlyBase Genes

modENCODE CAGE PeaksmodENCODE CAGE (Plus)

IntermediateFlyBase GenesmodENCODE CAGE PeaksmodENCODE CAGE (Plus)

BroadmodENCODE CAGE Peaks

modENCODE CAGE (Minus)

FlyBase Genes

Page 33: Searching for Transcription Start Sites

Changes in the dominant TSS of Rad23 across different developmental stages

Stag

es o

f Dev

elop

men

t

Adult females

Rad23

CAGE Tag Clusters

Page 34: Searching for Transcription Start Sites

Evidence for TSS annotations(in general order of importance)

1. Experimental dataRNA-SeqRNA Pol II ChIP-Seq

2. ConservationType of TSS (peaked/intermediate/broad) in D. melanogasterSequence similarity to initial exon in D. melanogasterSequence similarity to other Drosophila species (Multiz)

3. Core promoter motifsInr, TATA box, etc.

Page 35: Searching for Transcription Start Sites

Determine the gene structure in D. melanogaster

FlyBase: GBrowse UTR CDS

Gene Record Finder: Transcript Details

Page 36: Searching for Transcription Start Sites

Identify the initial transcribed exon using NCBI blastn

Retrieve the sequences of the initial exons from the Transcript Details tab of the Gene Record FinderUse placement of the flanking exons to reduce the size of the search region if possibleIncrease sensitivity of nucleotide searches

Change Program Selection to blastnChange Word size to 7Change Match/Mismatch Scores to +1, -1Change Gap Costs to Existence: 2, Extension: 1

Page 37: Searching for Transcription Start Sites

Extrapolate TSS position based on blastn alignment of the initial

transcribed exon

Assume the length of the initial transcribed exon is conserved between D. melanogaster and D. biarmipes

blastn: D. mel: Rad23:1 (Query) vs. contig19 (Sbjct)

Query start: 6

Extrapolate TSS position:

28,941-5 = 28,936

Page 38: Searching for Transcription Start Sites

Core promoter motifs can affect gene expression levels

Juven-Gershon T, et al. Rational design of a super core promoter that enhances gene expression. Nat Methods. 2006 Nov;3(11):917-22.

SCP1:

Page 39: Searching for Transcription Start Sites

Use core promoter motifs to support TSS annotations

Some sequence motifs are enriched in the region (~300 bp) surrounding the TSS

Some motifs (e.g., Inr, TATA) are well-characterizedOther motifs are identified based on computational analysis

Presence of core promoter motifs can be used to support the TSS annotations

Inr motif (TCAKTY) overlaps with the TSS (-2 to +4)

Absence of core promoter motifs is a negative result

Most D. melanogaster TSS do not contain the Inr motif

Page 40: Searching for Transcription Start Sites

Use UCSC Genome Browser Short Match to find Drosophila core promoter motifs

Ohler U, et al. Computational analysis of core promoters in the Drosophila genome. Genome Biol. 2002; 3(12):RESEARCH0087.

TATA box

Initiator (Inr)

Available under “Projects” “Annotation Resources” “Core Promoter Motifs” on the GEP web site: http://gander.wustl.edu/~wilson/core_promoter_motifs.html

Page 41: Searching for Transcription Start Sites

Core Promoter Motifs tracks

Show core promoter motif matches for each contig

Separated by strand

Visualize matches to different core promoter motifsUse UCSC Table Browser (or other means) to export the list of motif matches within the search region

Page 42: Searching for Transcription Start Sites

DEMO: Use the Inr motif to support the TSS position of Rad23

Page 43: Searching for Transcription Start Sites

RNA PolII ChIP-Seq tracks(available for D. biarmipes, D. elegans, and D. ficusphila)

Show regions that are enriched in RNA Polymerase II compared to input DNA

Gene Models

RNA PolII Peaks

RNA PolII Enrichment

RNA-Seq

Page 44: Searching for Transcription Start Sites

Using RNA-Seq and RNA PolII ChIP-Seq data to define the TSS search region

Narrow TSS search region

D. mel Transcripts

RNA PolII PeaksRNA PolII

Enrichment

RNA-Seq

Page 45: Searching for Transcription Start Sites

TSS annotation for Rad23TSS position: 28,936

Conservation with D. melanogaster blastn search of initial exon “D. mel Transcripts” track

Location of the Inr motif

TSS search region: 28,716-28,936Enrichment of RNA PolII upstream of the TSS positionRNA-Seq read coverage upstream of the TSS positionSearch region defined by the extent of the RNA PolII peak

Page 46: Searching for Transcription Start Sites

TSS annotation resourcesWalkthroughs:

Annotation of Transcription Start Sites in DrosophilaSample TSS report for onecut

Reference:TSS Annotation Workflow

GEP Annotation Report:Classify the type of core promoterEvidence that supports or refutes the TSS annotationDistribution of core promoter motifs

Page 47: Searching for Transcription Start Sites

Additional TSS annotation resources

The D. melanogaster gene annotations are the primary source of evidenceResources that could be useful if the D. melanogaster evidence is ambiguous

Whole genome alignments of 26 Drosophila speciesPhastCons and PhyloP conservation scores

Genome browsers for 26 Drosophila speciesRNA Pol II ChIP-Seq (D. biarmipes, D. elegans, and D. ficusphila)CAGE data for D. pseudoobscuraRNA-Seq coverage, splice junctions, assembled transcriptsGnomon and Augustus gene predictions

Page 48: Searching for Transcription Start Sites

Tree scale: 0.1

Published

GEP projects

Reference

Phylogenetic tree for 26 Drosophila species

Additional species

D. serrata

D. s

uzuk

ii

D. m

irand

a

D. a

rizon

aeD. navojoa

D. busckiiD. takahashii

Page 49: Searching for Transcription Start Sites

Genome alignments of 25 Drosophila species against D. melanogaster

Drosophila Chain/Net:

Page 50: Searching for Transcription Start Sites

Use the conservation tracks to identify regions under selection

PhyloP scores:

Under negative selection

Fast-evolving

phastCons

PhyloP

Conserved Elements

Page 51: Searching for Transcription Start Sites

Examine the ROAST alignments to identify the orthologous TSS regions

Page 52: Searching for Transcription Start Sites

TSS annotation summary

Most of the D. melanogaster core promoters have multiple TSS

Classify the type of promoter (peaked/intermediate/broad) based on the transcriptome evidence from D. melanogasterDefine search regions that might contain TSS

Use multiple lines of evidence to infer the TSS region

Identify initial exonRNA-Seq coverageblastn (change search parameters)

Distribution of core promoter motifs (e.g., Inr)RNA PolII ChIP-Seq peaksMaintain conservation compared to D. melanogaster

Page 53: Searching for Transcription Start Sites

Questions?

Page 54: Searching for Transcription Start Sites
Page 55: Searching for Transcription Start Sites

Structure of a typical mRNA

Pesole G. et al. Untranslated regions of mRNAs. Genome Biology. 2002: 3(3) reviews0004.1-reviews0004.10.

Page 56: Searching for Transcription Start Sites

D. ananassae and D. melanogaster F element genes show similar range of expression levels

Chen Z-X, et al. Genome Res. 24:1209-1223

D. ananassae Adult Females

CAI (Codon Bias)

Expression Levels (rlog)

LOESS Regression Line

F elementAdult MalesAdult Females

D. mel: F (modENCODE)

D. ana: F (modENCODE) 4R4L

4

Page 57: Searching for Transcription Start Sites

Phylogenetic tree based on the analysis of 13 Type IIB restriction endonucleases

Simulate restriction digests of 21 genomesDNA fragments range from 21-33 bp in sizeCalculate distance between two genomes based on number of shared fragments

D. melanogaster

D. simulansD. sechellia

D. yakuba

D. erecta

D. ficusphila

D. eugracilisD. biarmipesD. takahashiiD. elegansD. rhopaloa

D. kikkawaiD. ananassaeD. bipectinata

D. santomea

D. persimilis

D. willistoni

D. mojavensisD. virilis

D. grimshawi

D. pseudoobscuraSeetharam AS and Stuart GW. Whole genome phylogeny for 21 Drosophila species using predicted 2b-RAD fragments. PeerJ. 2013 Dec 23;1:e226.

Page 58: Searching for Transcription Start Sites

RAMPAGE protocol

Page 59: Searching for Transcription Start Sites

Signals in the FlyBase RAMPAGE and MachiBase TSS tracks are off by one base

https://wiki.flybase.org/wiki/FlyBase:GBrowse_Tracks#Aligned_Evidence

“FlyBase: GBrowse Tracks” page on the FlyBase Wiki

Page 60: Searching for Transcription Start Sites

DEMOblastn search of the initial transcribed exon of Rad23 against D. biarmipes contig19

Page 61: Searching for Transcription Start Sites

Optimize alignment parameters based on expected levels of conservation

Derive alignment scores using information theory

Relative entropy of target and background frequencies

Match +2, Mismatch -3 optimized for 90% identityMatch +1, Mismatch -1 optimized for 75% identity

States DJ, et al. Improved Sensitivity of Nucleic Acid Database Searches Using Application-Specific Scoring Matrices. Methods. 1991 3:66-70.

Page 62: Searching for Transcription Start Sites

Use RNA PolII tracks on the D. biarmipes genome browser to identify putative TSS

April 2013 (BCM-HGSC/Dbia_2.0) assemblySearch for orthologous regions in D. elegans

Page 63: Searching for Transcription Start Sites

Conservation tracks on the D. melanogaster GEP UCSC Genome Browser

Whole genome alignments of multiple Drosophila species

Drosophila Chain/Net composite track

Generate multiple sequence alignments from these pairwise alignments (ROAST) Identify conserved regions from Multiz alignments

PhastCons: identify conserved elementsPhyloP: measure level of selection at each nucleotide

Multiz alignment of 27 insect species available on the official UCSC Genome Browser

Aug. 2014 (BDGP Release 6 + ISO1 MT/dm6) assembly

Page 64: Searching for Transcription Start Sites

Improve the multiple sequence alignments for Drosophila

Use the most recent version of the RefSeq assemblies

UCSC uses draft assemblies instead of the CAF1 assemblies

Species-specific repeat maskingUse more sensitive alignment tools

LAST, ROAST

Manual optimizations of alignment parameters