Transcriptomics - bioinformatics.uni-muenster.de

1

Prof. Dr. Jürgen Gadau

Transcriptomics

2

What is transcriptomics?

• The sum of all transcribed components (RNA) of a genome (DNA) is referred to as its transcriptom.

3

What does my laboratory do?N. giraulti ♂

N. vitripennis ♂

4

HDL

N. giraulti ♂

N. vitripennis ♂

Male Courtship behavior

Male sex pheromone evolution

0

100

V… G… V… G…

F2 Hybrid breakdown

PrezygoticPostzygotic

Functional and Comparative Genomics of speciation and species differences

Cuticular hydrocarbon pattern –species and sex recognition

5

Evolution of Gene Regulation –Phenotypic Plasticity

Reproductive division of labor

Worker Subcastes

non-reproductive division of labor

AggressionCOOPERATION AND CONFLICT

6

What is Functional Genomics?

…and how does transcriptomicsfactor in?

7

• Functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic and transcriptomic projects (such as genome sequencing projects and RNA-seq) to describe gene (and protein) functions and interactions.

• Functional genomics uses genomic data to study gene and protein expression and function on a global scale (genome-wide or system-wide), focusing on gene transcription, translation and protein-protein interactions, and often involving high-throughput methods.

Some Definitions…

8

Genotype to phenotype

• Classic Mendelian genetics and how we teach Genetics/Evolution is looking at a single locus that we can map.

• One gene (white) à 1 character

• Organisms or the genetic architecture of complex/simple traits are much more complex.

*In January 1910, more than a century ago, Thomas Hunt Morgan discovered his first Drosophila mutant, a white-eyed male (Morgan 1910).

White-eyed mutant*

9

From Genotype to Phenotype…is a little bit more complex than we

usually think!

Envi

ronm

ent

Good flier

Bad flier

10

Environment

Developm

entGenotype-Phenotype-Fitness-Map

Phenotype-SpaceGenotype-Space Aggressive -> Monogyny

Cooperative -> Polygyny

Fitness

…Or Behavior

GENE-EXPRESSION

11

…one more complication

?

Fixed Genetic Differences versus

Phenotypic Plasticity

12

Any Questions?

13

qPCR

14

Methods to measure Generegulation

15

RNA-SEQ/MICROARRAY INTRODUCTION

Microarray

16

RNA-Seq

17

What are the advantages of RNA-Seq vs Microarray?

• Prior sequence (genome or transcriptome) not required

• Microarrays are costly and time consuming to design and produce and are just for one species

• Greater dynamic range and sensitivity than microarrays

• Improved ability to discriminate regions of high sequence identity

• Greater multiplexing of samples

18

Typical RNA-Seq Strategy

~300 bpinsert size

Library Construction

Illumina HiSeq 2000•104 bp paired end reads

Sequencing Assemblyor Alignment

Contigsnext-gen

104 bp reads

19

https://galaxyproject.org/tutorials/rb_rnaseq/

20

Transcriptome pipeline using UNIX-based tools

BowtieCreates reference genome index for alignment

(based on Broad Anocar2.0 assembly)

TophatAligns RNA-Seq reads to reference index

Cufflinks and CuffdiffTool for quantification and statistical analysis

of RNAseq reads , also can improve annotation

IGV/Tablet/ApolloViewers for output of RNA-Seq output from

Cufflinks

21

Tools for RNA-Seq Analysis (mostly UNIX/Linux)

• Bowtie: http://bowtie-bio.sourceforge.net/index.shtml• Tophat: http://tophat.cbcb.umd.edu/• Samtools: http://samtools.sourceforge.net/• Picard tools: http://picard.sourceforge.net/• Cufflinks: http://cufflinks.cbcb.umd.edu/• Apollo: http://apollo.berkeleybop.org/current/index.html• IGV: http://www.broadinstitute.org/igv/• ABySS: http://www.bcgsc.ca/platform/bioinfo/software/abyss

http://bowtie-bio.sourceforge.net/index.shtml

http://tophat.cbcb.umd.edu/

http://samtools.sourceforge.net/

http://picard.sourceforge.net/

http://cufflinks.cbcb.umd.edu/

http://apollo.berkeleybop.org/current/index.html

http://www.broadinstitute.org/igv/

22

Resources Needed

• Genomic sequence in fasta format– For alignment of RNA sequence

• Annotations in GTF (General Feature Format) format– For binning RNAseq reads into transcripts for

quantitative analysis, as well as to help with annotation in Apollo

• RNAseq reads in FASTQ format (sequenced and quality score)

23

Bowtie

Creates reference genome index for alignment• Commands needed:

– Bowtie-build –f (for fasta file) reference_in.faout_file_name

– Runtime: ~4 hrs for vertebrate genome• Output:

– 6 files that will be used by tophat

24

Tophat

Aligns RNAseq reads to reference index• Commands needed

– tophat –r (average gap distance for paired reads) –o (output directory) reference_input (name you gave output in bowtie) RNAseq_reads_1.fastq RNAseq_reads_2.fastq

– Runtime: multiple days• Output

– Read allignments in .bam or .sam format– Junctions, insertion and deletion as .bed format

25

Tophat aligns reads to reference genome

26

CufflinksTool for quantification and statistical analysis of RNAseqreads• Commands used:

– Cufflinks –G (.GTF annotation file) input_file.sam• Basic command that gives FPKM values and errors for each

transcript in a tab delimited file• Runtime: several hours (goes for all cuff programs)

– Cuffcompare reference_annotation.gtf input_file_tissue1.gtf (from cufflinks) input_file_tissue2.gtf

• Secondary analysis using output from cufflinks. Gives data on pooled transcript levels, a .GTF file for annotation based on all tissues, major isoforms found (if annotated in reference .GTF file)

27

Cufflinks quantifies expression level based on FPKM

0

50

100

150

200

250

300

350

400

450

500

012345

Ttn

Acta1

Mef2a

Mstn

FPKM = Fragments per kilobase-pair per million sequenced reads

FPKM

28

RPKM (FPKM avoids counting fragments twice in paired end

sequencing)• Count up the total reads in a sample and divide

that number by 1,000,000 – this is our “per million” scaling factor.

• Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)

• Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.

29

WHAT TO DO WITH TRANSCRIPTOMIC DATA?

30

• Sequencing cDNA• mRNA vs unprocessed RNA (depends on

library prep but you want to get rid of the ribosomal RNAs)

• Annotate coding SNPS• Transcript isoforms • Splice junctions• Count abundance of transcripts

Transcripts found by RNA-Seq

31

Review: Elements of protein coding genes

32

Review: Processed transcript

33

Review: Alternate splicing

34

Gene Orthology vs. Paralogy

Ancestor Gene A

lizard Gene A1 Gene A2

mouse Gene A1 Gene A2

paralogs

orthologues

35

• Not all genes in the genome encode proteins. • Human Chromosome 11, has 1,300 protein-coding genes

and 200 noncoding RNA genes, whose final product is an RNA, not a protein.

• Some are involved in the regulation of other genes, whereas others may play structural roles in various nuclear or cytoplasmic processes.

• MicroRNA (miRNA) genes, are noncoding RNA genes. There are at least 250 in the human genome

• miRNAs are short 22-nucleotide-long noncoding RNAs, at least some of which control the expression or repression of other genes during development.

From Thompson, Medical Genetics

Review: Non-protein coding genes

36

Long non-coding RNAs where they come from

37

Long non-coding RNAs have many functions in transcription

and translation

38

Validation of your RNASeqresults: different methods

– Rerun the same sample on the same instrument – Use multiple sequencing methods for the same

sample.• Quantatative

– qPCR– RNA seq– Microarrays

• Qualitative– PCR– In situ

– Run multiple samples of same tissue type

39

How does RT-qPCR work?

40

How does RT-qPCR work?

Documents

Transcriptomics - bioinformatics.uni-muenster.de