Walk-thru of CAGE exercise Also at /tag_analysis/ /tag_analysis

Walk-thru of CAGE exercise

• Also at http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/

• …together with updated slides• And linked from web page

http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/

http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/

Interlude: a logistics problem• The largest cDNA project so far

made 102,000 cDNAs• If you publish, you need to be

able to ship these to the people asking for it

• This would take >50kg of dry ice! Expensive and a logistics nightmare since you need to keep track of the 102,000 tubes

• How can we transfer DNA?

RNA-seq

• With a high-throughput tag sequencer, we can also do the brute force approach – fragment all mRNAs in a cell and sequence the pieces (or part of the pieces)

• This is commonly referred to as RNA-seq

Compared to SAGE, CAGE

• Sequence the whole mRNA – not just the end or the start

• Can give connectivity, so that we know what exons that are used, and what isoforms

• Is actually bad at capturing 5’ and 3’ edges, due to statistical issues (white board demo)

Typical protocolAAAAA

AAAAATTTTT

AAAAA

Isolate mRNA

Break up mRNAs

Make cDNAs of RNA fragments

Add adapters, amplify and sequence

We sequence 25-35 bp reads…randomly selected from each side of

the fragment

Mapping tags

Challenge: What do we get (pros and cons) if we map the tagsa) To the genomeb) To the transcriptome (like all refseq transcripts)

Genome: unbiased – we could hit any transcripts. Hard to hit spliced tags, and possibly mRNAs that get modified…

Transcriptome: We hit annotated genes, and splice sites are not a problem. On the other hand, we cannot find new things

Going from tags to wigs

Showing all tags as blocks in the browser is possible, but dumb – because there are potentially thousands in the window of interest, and we go blind

Easy way to summarize is to make nucleotide histograms – whiteboard demo

Looking at RNA-seq data• At the tag _analysis web directoy, there is a

wig file, mm9_brain.wig showing tags an RNA-seq experiment from mouse brains. Upload this to the browser and look at the two genes below – are they expressed, and how much?

• Kcnc3 • Hoxa5

Thought challenge: from tags to expression

• We have a wig file showing where all the tags match on the genome

• We have the UCSC annotation for all known genes

• We want something like a microarray, saying – Gene X has an expression of Y– How can we do this? (2 minutes with your

sideman)

“Naïve solution”

• For each gene, count the tags that overlap it– Gene X has 45 tags– Gene Y has 4578 tags– Etc

Problems with this?

Length of transcripts will have an effect!

• A long transcript gives more tags when broken up, and can be captured more easily

• So, the number of tags from a transcript depends on– Actual expression (number of RNA molecules)– Length of the RNAs

Normalizing for length – not that hard

• For each gene, count the tags that overlap it, and divide by gene length– Gene X has 45/(length of x) tags– Gene Y has 4578(length of y) tags– Etc

What if we want to compare two experiments?

We also need to normalize for sample size, just as in SAGE, CAGE and ESTs

• Recap: TPM is a normalization that remakes the tags count into what we would get if having exactly one million tags

• …so, 10^6* (#tags in my gene)/(total tags)

Combining the two

• Normalize by gene length AND sample size

• Gene X has an expression of – Z TPMs/(N)– Where N is the RNA length.

Summary of tag technologies• ESTs: old, expensive, long tags. Biased to 5’and 3’ of genes. Can be used

for exploration• SAGE: 3’ end tags. Only gene expression, no functional data. Limited for

exploration• CAGE/5’SAGE: 5’ end tags. Promoter expression and location. Can be used

for exploration• RNA-seq: “Random” tags over the whole mRNA. Expression and location –

can be used for both expression and exploration

Documents

Walk-thru of CAGE exercise Also at /tag_analysis/ /tag_analysis