Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
1
Prof. Dr. JĂĽrgen Gadau
Transcriptomics
2
What is transcriptomics?
• The sum of all transcribed components (RNA) of a genome (DNA) is referred to as its transcriptom.
3
What does my laboratory do?N. giraulti ♂
N. vitripennis ♂
4
HDL
N. giraulti ♂
N. vitripennis ♂
Male Courtship behavior
Male sex pheromone evolution
0
100
V… G… V… G…
F2 Hybrid breakdown
PrezygoticPostzygotic
Functional and Comparative Genomics of speciation and species differences
Cuticular hydrocarbon pattern –species and sex recognition
5
Evolution of Gene Regulation –Phenotypic Plasticity
Reproductive division of labor
Worker Subcastes
non-reproductive division of labor
AggressionCOOPERATION AND CONFLICT
6
What is Functional Genomics?
…and how does transcriptomicsfactor in?
7
• Functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic and transcriptomic projects (such as genome sequencing projects and RNA-seq) to describe gene (and protein) functions and interactions.
• Functional genomics uses genomic data to study gene and protein expression and function on a global scale (genome-wide or system-wide), focusing on gene transcription, translation and protein-protein interactions, and often involving high-throughput methods.
Some Definitions…
8
Genotype to phenotype
• Classic Mendelian genetics and how we teach Genetics/Evolution is looking at a single locus that we can map.
• One gene (white) à 1 character
• Organisms or the genetic architecture of complex/simple traits are much more complex.
*In January 1910, more than a century ago, Thomas Hunt Morgan discovered his first Drosophila mutant, a white-eyed male (Morgan 1910).
White-eyed mutant*
9
From Genotype to Phenotype…is a little bit more complex than we
usually think!
Envi
ronm
ent
Good flier
Bad flier
10
Environment
Developm
entGenotype-Phenotype-Fitness-Map
Phenotype-SpaceGenotype-Space Aggressive -> Monogyny
Cooperative -> Polygyny
Fitness
…Or Behavior
GENE-EXPRESSION
11
…one more complication
?
Fixed Genetic Differences versus
Phenotypic Plasticity
12
Any Questions?
13
qPCR
14
Methods to measure Generegulation
15
RNA-SEQ/MICROARRAY INTRODUCTION
Microarray
16
RNA-Seq
17
What are the advantages of RNA-Seq vs Microarray?
• Prior sequence (genome or transcriptome) not required
• Microarrays are costly and time consuming to design and produce and are just for one species
• Greater dynamic range and sensitivity than microarrays
• Improved ability to discriminate regions of high sequence identity
• Greater multiplexing of samples
18
Typical RNA-Seq Strategy
~300 bpinsert size
Library Construction
Illumina HiSeq 2000•104 bp paired end reads
Sequencing Assemblyor Alignment
Contigsnext-gen
104 bp reads
19
https://galaxyproject.org/tutorials/rb_rnaseq/
20
Transcriptome pipeline using UNIX-based tools
BowtieCreates reference genome index for alignment
(based on Broad Anocar2.0 assembly)
TophatAligns RNA-Seq reads to reference index
Cufflinks and CuffdiffTool for quantification and statistical analysis
of RNAseq reads , also can improve annotation
IGV/Tablet/ApolloViewers for output of RNA-Seq output from
Cufflinks
21
Tools for RNA-Seq Analysis (mostly UNIX/Linux)
• Bowtie: http://bowtie-bio.sourceforge.net/index.shtml• Tophat: http://tophat.cbcb.umd.edu/• Samtools: http://samtools.sourceforge.net/• Picard tools: http://picard.sourceforge.net/• Cufflinks: http://cufflinks.cbcb.umd.edu/• Apollo: http://apollo.berkeleybop.org/current/index.html• IGV: http://www.broadinstitute.org/igv/• ABySS: http://www.bcgsc.ca/platform/bioinfo/software/abyss
22
Resources Needed
• Genomic sequence in fasta format– For alignment of RNA sequence
• Annotations in GTF (General Feature Format) format– For binning RNAseq reads into transcripts for
quantitative analysis, as well as to help with annotation in Apollo
• RNAseq reads in FASTQ format (sequenced and quality score)
23
Bowtie
Creates reference genome index for alignment• Commands needed:
– Bowtie-build –f (for fasta file) reference_in.faout_file_name
– Runtime: ~4 hrs for vertebrate genome• Output:
– 6 files that will be used by tophat
24
Tophat
Aligns RNAseq reads to reference index• Commands needed
– tophat –r (average gap distance for paired reads) –o (output directory) reference_input (name you gave output in bowtie) RNAseq_reads_1.fastq RNAseq_reads_2.fastq
– Runtime: multiple days• Output
– Read allignments in .bam or .sam format– Junctions, insertion and deletion as .bed format
25
Tophat aligns reads to reference genome
26
CufflinksTool for quantification and statistical analysis of RNAseqreads• Commands used:
– Cufflinks –G (.GTF annotation file) input_file.sam• Basic command that gives FPKM values and errors for each
transcript in a tab delimited file• Runtime: several hours (goes for all cuff programs)
– Cuffcompare reference_annotation.gtf input_file_tissue1.gtf (from cufflinks) input_file_tissue2.gtf
• Secondary analysis using output from cufflinks. Gives data on pooled transcript levels, a .GTF file for annotation based on all tissues, major isoforms found (if annotated in reference .GTF file)
27
Cufflinks quantifies expression level based on FPKM
0
50
100
150
200
250
300
350
400
450
500
012345
Ttn
Acta1
Mef2a
Mstn
FPKM = Fragments per kilobase-pair per million sequenced reads
FPKM
28
RPKM (FPKM avoids counting fragments twice in paired end
sequencing)• Count up the total reads in a sample and divide
that number by 1,000,000 – this is our “per million” scaling factor.
• Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
• Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.
29
WHAT TO DO WITH TRANSCRIPTOMIC DATA?
30
• Sequencing cDNA• mRNA vs unprocessed RNA (depends on
library prep but you want to get rid of the ribosomal RNAs)
• Annotate coding SNPS• Transcript isoforms • Splice junctions• Count abundance of transcripts
Transcripts found by RNA-Seq
31
Review: Elements of protein coding genes
32
Review: Processed transcript
33
Review: Alternate splicing
34
Gene Orthology vs. Paralogy
Ancestor Gene A
lizard Gene A1 Gene A2
mouse Gene A1 Gene A2
paralogs
orthologues
35
• Not all genes in the genome encode proteins. • Human Chromosome 11, has 1,300 protein-coding genes
and 200 noncoding RNA genes, whose final product is an RNA, not a protein.
• Some are involved in the regulation of other genes, whereas others may play structural roles in various nuclear or cytoplasmic processes.
• MicroRNA (miRNA) genes, are noncoding RNA genes. There are at least 250 in the human genome
• miRNAs are short 22-nucleotide-long noncoding RNAs, at least some of which control the expression or repression of other genes during development.
From Thompson, Medical Genetics
Review: Non-protein coding genes
36
Long non-coding RNAs where they come from
37
Long non-coding RNAs have many functions in transcription
and translation
38
Validation of your RNASeqresults: different methods
– Rerun the same sample on the same instrument – Use multiple sequencing methods for the same
sample.• Quantatative
– qPCR– RNA seq– Microarrays
• Qualitative– PCR– In situ
– Run multiple samples of same tissue type
39
How does RT-qPCR work?
40
How does RT-qPCR work?