Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
Machine Learning Methods for RNA-seq-based
Transcriptome Reconstruction
Gunnar Ratsch
Friedrich Miescher Laboratory
Max Planck Society, Tubingen, Germany
StatSeq Workshop, Gent (Mai 20, 2010)
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 1 / 28
Motivation
Discovery of the Nuclein(Friedrich Miescher, 1869) fml
Discovery of Nuclein:from lymphocyte & salmon
“multi-basic acid” (≥ 4)
Tubingen, around 1869
“If one . . . wants to assume that a single substance . . . is the specificcause of fertilization, then one should undoubtedly first and foremostconsider nuclein” (Miescher, 1874)
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 2 / 28
Motivation
Discovery of the Nuclein(Friedrich Miescher, 1869) fml
Discovery of Nuclein:from lymphocyte & salmon
“multi-basic acid” (≥ 4)
Tubingen, around 1869
“If one . . . wants to assume that a single substance . . . is the specificcause of fertilization, then one should undoubtedly first and foremostconsider nuclein” (Miescher, 1874)
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 2 / 28
Motivation
Learning about the Transcriptome What is encoded on the genome and how is it processed? fml
Computational Point of View
How to learn to predict what these processes accomplish?
How well can we predict it from the available information?
Biological View
What can we not predict yet? What is missing?
Can we derive a deeper understanding of these processes?c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 3 / 28
Motivation
Learning about the Transcriptome What is encoded on the genome and how is it processed? fml
Computational Point of View
How to learn to predict what these processes accomplish?
How well can we predict it from the available information?
Biological View
What can we not predict yet? What is missing?
Can we derive a deeper understanding of these processes?c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 3 / 28
Motivation
Learning about the Transcriptome What is encoded on the genome and how is it processed? fml
Computational Point of View
How to learn to predict what these processes accomplish?
How well can we predict it from the available information?
Biological View
What can we not predict yet? What is missing?
Can we derive a deeper understanding of these processes?c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 3 / 28
Motivation
Learning about the Transcriptome What is encoded on the genome and how is it processed? fml
Computational Point of View
How to learn to predict what these processes accomplish?
How well can we predict it from the available information?
Biological View
What can we not predict yet? What is missing?
Can we derive a deeper understanding of these processes?c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 3 / 28
Motivation
Gene-by-Gene vs. Genome-wide Analysesfml
Cloning individual genes during the last 30 years revealedimportant hints to the complex structure of eukaryotic genes
[ENCODE Project Consortium (2004). Science 306:636]
Comprehensive understanding of gene regulation necessarily hasto include the entire organism’s gene collection as well as itsintergenic regions.
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 4 / 28
Motivation
Sequencing Genomesfml
Concerted effort to sequence genomes of model organisms:
E. coli 4.5 Mb (1997)S. cerevisiae 6.0 Mb (1997)C. elegans 98 Mb (1998)Human 3 Gb (2000)
Estimated cost: $2.7 billion in 1991 dollarsEstimated time in 1990: 15 years
In 2003, NHGRI committed to develop next-generationsequencing technologies to lower the cost of 30x a humangenome (∼100 Gbp):
$100,000 genome$1,000 genome
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 5 / 28
Motivation
Sequencing Genomesfml
Concerted effort to sequence genomes of model organisms:
E. coli 4.5 Mb (1997)S. cerevisiae 6.0 Mb (1997)C. elegans 98 Mb (1998)Human 3 Gb (2000)
Estimated cost: $2.7 billion in 1991 dollarsEstimated time in 1990: 15 years
In 2003, NHGRI committed to develop next-generationsequencing technologies to lower the cost of 30x a humangenome (∼100 Gbp):
$100,000 genome$1,000 genome
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 5 / 28
Motivation
New Sequencing Technologiesfml
short many
long few
SoLID (2007)
Genome Analyzer (2007)
454 Sequencing (2005)
SMRT (2010)
[Slide provided by Ali Mortazavi]
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 6 / 28
Motivation
The Encyclopedia of DNA Elementsfml
How to make sense of the genome?
Which nucleotides are functional?What is their function?
National Human Genome Research Institute started the Encodeproject to annotate human and model organism genomes:
2004: Human 1%2007: Human whole-genome2008: Drosophila and C. elegans2010: Mouse
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 7 / 28
Motivation
The Encyclopedia of DNA Elementsfml
How to make sense of the genome?
Which nucleotides are functional?What is their function?
National Human Genome Research Institute started the Encodeproject to annotate human and model organism genomes:
2004: Human 1%2007: Human whole-genome2008: Drosophila and C. elegans2010: Mouse
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 7 / 28
Motivation
Sequencing for Functional Genomicsfml
[Wold & Myers, Nature Methods, 2007]
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 8 / 28
Motivation
contiguous reads
splice-crossing reads
enriched regions
novel transfrags
density on known exons
Map reads
Aggregate and identify
Analyze
binding sources
motif finding
associated genes
novel gene models
novel splice isoforms
expression levels
differential expression
ChIP-seq RNA-seq discovery
RNA-seq quantification
RNA-seq, ChIP-seq, and external data Integrate
de novo transcript assembly
Info
rmat
ion
extra
ctio
n
[Pep
keet
al.,
2009
]
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 9 / 28
RNA-Seq Pipeline
Deep RNA Sequencing (RNA-Seq)fml
RNA-Seq allows . . .
High-throughputtranscriptomemeasurements
Qualitative studies
Quantitative studies athigh resolution
pre-mRNAintron
exon
mRNAshort reads
junction reads
reference genomeFigure adapted from Wikipedia
Goal: Obtain complete transcriptome for further analyses
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 10 / 28
RNA-Seq Pipeline
Deep RNA Sequencing (RNA-Seq)fml
RNA-Seq allows . . .
High-throughputtranscriptomemeasurements
Qualitative studies
Quantitative studies athigh resolution
pre-mRNAintron
exon
mRNAshort reads
junction reads
reference genomeFigure adapted from Wikipedia
Goal: Obtain complete transcriptome for further analyses
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 10 / 28
RNA-Seq Pipeline
RNA-Seq Pipeline Overviewfml
PALMapper
mGene.ngs
mTim Segmentation
rQuant
Short Reads Transcripts w/ QuantitationAlignm
ents
Transcripts
Read alignment Transcript finding Quantitation
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 11 / 28
RNA-Seq Pipeline
RNA-Seq Pipeline Overviewfml
PALMapper
mGene.ngs
mTim Segmentation
rQuant
Short Reads Transcripts w/ QuantitationAlignm
ents
Transcripts
Read alignment Transcript finding Quantitation
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 11 / 28
PALMapper Read Alignment Overview
Step 1: PALMapper Read Alignment(PALMapper = QPALMA + GenomeMapper) fmlGenomeMapper for (unspliced) read mapping:
Alignments based on GenomeMapper developed in Tubingen forthe 1001 plant genome project [Schneeberger et al., 2009]
k-mer based index, well suited for smaller genomes
More info: http://fml.mpg.de/raetsch/suppl/palmapper
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 12 / 28
PALMapper Read Alignment Overview
Step 1: PALMapper Read Alignment(PALMapper = QPALMA + GenomeMapper) fmlGenomeMapper for (unspliced) read mapping:
Alignments based on GenomeMapper developed in Tubingen forthe 1001 plant genome project [Schneeberger et al., 2009]
k-mer based index, well suited for smaller genomes
ACCGTCGCGCGCGT...TCGGCG...AGAACGCT
TCGCGCGCAACGTCGC CGCG GCGC CGCG GCGC CGCA GCAA CAAC AACG
...matchingk-mers
DNA
Read
k-mers
...
More info: http://fml.mpg.de/raetsch/suppl/palmapper
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 12 / 28
PALMapper Read Alignment Overview
Step 1: PALMapper Read Alignment(PALMapper = QPALMA + GenomeMapper) fmlGenomeMapper for (unspliced) read mapping:
Alignments based on GenomeMapper developed in Tubingen forthe 1001 plant genome project [Schneeberger et al., 2009]
k-mer based index, well suited for smaller genomes
ACCGTCGCGCGCGT...TCGGCG...AGAACGCT
TCGCGCGCAACGTCGC CGCG GCGC CGCG GCGC CGCA GCAA CAAC AACG
...matchingk-mers
DNA
Read
k-mers
...
More info: http://fml.mpg.de/raetsch/suppl/palmapper
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 12 / 28
PALMapper Read Alignment Overview
Step 1: PALMapper Read Alignment(PALMapper = QPALMA + GenomeMapper) fmlGenomeMapper for (unspliced) read mapping:
Alignments based on GenomeMapper developed in Tubingen forthe 1001 plant genome project [Schneeberger et al., 2009]
k-mer based index, well suited for smaller genomes
QPALMA for spliced read alignments:
GenomeMapper identifies seed regions
Spliced alignments by QPALMA [De Bona et al., 2008]
ACCGTCGCGCGCGT...TCGGCG...AGAACGCT
TCGCGCGCAACG
DNA
ReadMore info: http://fml.mpg.de/raetsch/suppl/palmapper
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 12 / 28
PALMapper Read Alignment Overview
QPALMA: Extended Smith-Waterman Scoringfml
Classical scoring f : Σ× Σ→ R
Source of information
Sequence matches
Computational splicesite predictions
Intron length model
Read qualityinformation
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 13 / 28
PALMapper Read Alignment Overview
QPALMA: Extended Smith-Waterman Scoringfml
Classical scoring f : Σ× Σ→ R
Source of information
Sequence matches
Computational splicesite predictions
Intron length model
Read qualityinformation
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 13 / 28
PALMapper Read Alignment Overview
QPALMA: Extended Smith-Waterman Scoringfml
Classical scoring f : Σ× Σ→ R
Source of information
Sequence matches
Computational splicesite predictions
Intron length model
Read qualityinformation
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 13 / 28
PALMapper Read Alignment Overview
QPALMA: Extended Smith-Waterman Scoringfml
Quality scoring f : (Σ× R)× Σ→ R [De Bona et al., 2008]
Source of information
Sequence matches
Computational splicesite predictions
Intron length model
Read qualityinformation
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 13 / 28
PALMapper Read Alignment Overview
QPALMA RNA-Seq Read Alignmentfml
Generate set of artificially spliced readsGenomic reads with quality informationGenome annotation for artificially splicing the readsUse 10, 000 reads for training and 30, 000 for testing
Alig
nmen
t Er
ror
Rat
e
SmithW14.19%
Intron9.96% 1.94% 1.78%
Intron+Splice
Intron+Splice+Quality [De Bona et al., 2008]
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 14 / 28
PALMapper Read Alignment Overview
QPALMA RNA-Seq Read Alignmentfml
Generate set of artificially spliced readsGenomic reads with quality informationGenome annotation for artificially splicing the readsUse 10, 000 reads for training and 30, 000 for testing
Alig
nmen
t Er
ror
Rat
e
SmithW14.19%
Intron9.96% 1.94% 1.78%
Intron+Splice
Intron+Splice+Quality
Error vs. intron position
[De Bona et al., 2008]
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 14 / 28
PALMapper Read Alignment Overview
QPALMA RNA-Seq Read Alignmentfml
Generate set of artificially spliced readsGenomic reads with quality informationGenome annotation for artificially splicing the readsUse 10, 000 reads for training and 30, 000 for testing
Alig
nmen
t Er
ror
Rat
e
SmithW14.19%
Intron9.96% 1.94% 1.78%
Intron+Splice
Intron+Splice+Quality
8nt12nt12nt8ntError vs. intron position
[De Bona et al., 2008]
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 14 / 28
PALMapper Read Alignment Overview
Step 2: Transcript Predictionfml
PALMapper
mGene.ngs
mTim Segmentation
rQuant
Short Reads Transcripts w/ QuantitationAlignm
ents
Transcripts
Read alignment Transcript finding Quantitation
A. Coverage segmentation algorithm mTiM for general transcripts(no coding bias/assumption)
B. Extension of the mGene gene finding system to use NGS datafor protein coding transcript prediction (mGene.ngs)
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 15 / 28
Next Generation Gene Finding Idea
Computational Gene Finding Labeling the Genome fml
DNA
pre-mRNA
mRNA
Protein
polyAcap
TSS
SpliceDonor
SpliceAcceptor
SpliceDonor
SpliceAcceptor
TIS Stop
polyA/cleavage
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 16 / 28
Next Generation Gene Finding Idea
Computational Gene Finding Labeling the Genome fml
DNA
pre-mRNA
mRNA
Protein
polyAcap
TSS Donor Acceptor Donor Acceptor
TIS Stop
polyA/cleavage
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 16 / 28
Next Generation Gene Finding Idea
Computational Gene Finding Labeling the Genome fml
DNA
pre-mRNA
mRNA
Protein
polyAcap
TSS Donor Acceptor Donor Acceptor
TIS Stop
polyA/cleavage
TSS TIS cleaveStop
Don Acc
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 16 / 28
Next Generation Gene Finding Idea
mGene-based Transcript Predictionfml
acc
don
tss
tis
stop
True gene model 2 3 4 5
STEP 1: SVM Signal Predictions
genomic position
genomic position
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 17 / 28
Next Generation Gene Finding Idea
mGene-based Transcript Predictionfml
acc
don
tss
tis
stop
True gene model 2 3 4 5
F(x,y)
transform features
STEP 1: SVM Signal Predictions
STEP 2: Integration
genomic position
genomic position
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 17 / 28
Next Generation Gene Finding Idea
mGene-based Transcript Predictionfml
acc
don
tss
tis
stop
True gene model 2 3 4 5
Wrong gene model
large margin
F(x,y)
transform features
STEP 1: SVM Signal Predictions
STEP 2: Integration
genomic position
genomic position
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 17 / 28
Next Generation Gene Finding Idea
mGene-based Transcript Predictionfml
acc
don
tss
tis
stop
True gene model 2 3 4 5
Wrong gene model
large margin
F(x,y)
STEP 1: SVM Signal Predictions
STEP 2: Integration
genomic position
genomic position
transform features
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 17 / 28
Next Generation Gene Finding Modeling Uncertainty
Learning to use Expression Measurementsfml
Two approaches:
Heuristic to incorporate ESTs/reads/tiling array measurementsto refine predictions
Directly use evidence during learning to learn to appropriatelyweight its importance
Exon Level Transcript LevelSN SP F SN SP F
ab initio 82.3 82.6 82.5 43.1 49.5 46.1ESTs heuristic 85.3 84.7 85.0 49.5 56.4 52.7ESTs trained 84.8 85.8 85.3 50.5 57.8 53.9RNA-Seq trainedRNA-Seq/ESTs trained
Gene prediction in C. elegans (CDS evaluation)
Behr et al., in prep., 2010
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 18 / 28
Next Generation Gene Finding Modeling Uncertainty
Learning to use Expression Measurementsfml
Two approaches:
Heuristic to incorporate ESTs/reads/tiling array measurementsto refine predictions
Directly use evidence during learning to learn to appropriatelyweight its importance
Exon Level Transcript LevelSN SP F SN SP F
ab initio 82.3 82.6 82.5 43.1 49.5 46.1ESTs heuristic 85.3 84.7 85.0 49.5 56.4 52.7ESTs trained 84.8 85.8 85.3 50.5 57.8 53.9RNA-Seq trained 84.6 84.9 84.8 49.1 55.2 52.0RNA-Seq/ESTs trained 84.7 86.9 85.8 50.3 60.5 54.9
Gene prediction in C. elegans (CDS evaluation)
Behr et al., in prep., 2010
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 18 / 28
Next Generation Gene Finding Results
Preliminary Evaluation (C. elegans)fml
CDS (precision+recall)/2
expression percentiles [%]
mGene ab initiomGene.ngs
10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 19 / 28
Next Generation Gene Finding Results
Preliminary Evaluation (C. elegans)fml
CDS (precision+recall)/2
expression percentiles [%]
mTiMmGene ab initiomGene.ngs
10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 19 / 28
Next Generation Gene Finding Alternative Transcripts
Digestionfml
mTiM and mGene.ngs predict single transcripts
mTiM exploits “uniformity” of read coverage among exons ofsame transcript
mGene.ngs uses more assumptions on structure of transcripts
Alternative Transcripts: Spliced reads for splicing graphs:
Paths through splicing graph define alternative transcripts
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 20 / 28
Next Generation Gene Finding Alternative Transcripts
Digestionfml
mTiM and mGene.ngs predict single transcripts
mTiM exploits “uniformity” of read coverage among exons ofsame transcript
mGene.ngs uses more assumptions on structure of transcripts
Alternative Transcripts: Spliced reads for splicing graphs:
Paths through splicing graph define alternative transcripts
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 20 / 28
Next Generation Gene Finding Alternative Transcripts
Digestionfml
mTiM and mGene.ngs predict single transcripts
mTiM exploits “uniformity” of read coverage among exons ofsame transcript
mGene.ngs uses more assumptions on structure of transcripts
Alternative Transcripts: Spliced reads for splicing graphs:
Transcript prediction(result of mGene/mTIM)
Spliced reads (result of Palmapper)
Paths through splicing graph define alternative transcripts
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 20 / 28
Next Generation Gene Finding Alternative Transcripts
Digestionfml
mTiM and mGene.ngs predict single transcripts
mTiM exploits “uniformity” of read coverage among exons ofsame transcript
mGene.ngs uses more assumptions on structure of transcripts
Alternative Transcripts: Spliced reads for splicing graphs:
Transcript prediction(result of mGene/mTIM)
Spliced reads (result of Palmapper)
Infered edges
Paths through splicing graph define alternative transcripts
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 20 / 28
Next Generation Gene Finding Alternative Transcripts
Digestionfml
mTiM and mGene.ngs predict single transcripts
mTiM exploits “uniformity” of read coverage among exons ofsame transcript
mGene.ngs uses more assumptions on structure of transcripts
Alternative Transcripts: Spliced reads for splicing graphs:
Transcript prediction(result of mGene/mTIM)
Spliced reads (result of Palmapper)
Infered edges
Paths through splicing graph define alternative transcripts
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 20 / 28
Next Generation Gene Finding Alternative Transcripts
Prediction of Alternative Transcriptsfml
Combine gene finding with prediction of alternative splicingMachine learning challenge:
Input: DNA sequenceOutput: All possible transcripts (not just one)
First step: Predict “simple splicing graphs”
Conceptually works . . .But, only with clean data we train the model!
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 21 / 28
Next Generation Gene Finding Alternative Transcripts
Prediction of Alternative Transcriptsfml
Combine gene finding with prediction of alternative splicing
Machine learning challenge:
Input: DNA sequenceOutput: All possible transcripts (not just one)
First step: Predict “simple splicing graphs”
TSS TIS cleaveStop
Don Acc
A-Don A-AccExonskip
Conceptually works . . .
But, only with clean data we train the model!
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 21 / 28
Next Generation Gene Finding Alternative Transcripts
Prediction of Alternative Transcriptsfml
Combine gene finding with prediction of alternative splicing
Machine learning challenge:
Input: DNA sequenceOutput: All possible transcripts (not just one)
First step: Predict “simple splicing graphs”
TSS TIS cleaveStop
Don Acc
A-Don A-AccExonskip
Conceptually works . . .
But, only with clean data we train the model!
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 21 / 28
Next Generation Gene Finding Alternative Transcripts
Prediction of Alternative Transcriptsfml
Combine gene finding with prediction of alternative splicing
Machine learning challenge:
Input: DNA sequenceOutput: All possible transcripts (not just one)
First step: Predict “simple splicing graphs”
TSS TIS cleaveStop
Don Acc
A-Don A-AccExonskip
Conceptually works . . .
But, only with clean data we train the model!
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 21 / 28
Transcript Quantitation with rQuant
RNA-Seq Pipeline Overviewfml
PALMapper
mGene.ngs
mTim Segmentation
rQuant
Short Reads Transcripts w/ QuantitationAlignm
ents
Transcripts
Read alignment Transcript finding Quantitation
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 22 / 28
Transcript Quantitation with rQuant Biases
RNA-Seq Biases and Quantitationfml
Sequencer short sequence reads
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
mRNA
fragmentation
sequence library
RT
fragmen-tation
RT
exonic reads
genome
transcripts
junction reads
aligned reads
Biases due to . . .
cDNA library construction
Sequencing
Read mapping
relative transcript position 5’ -> 3’
read
cov
erag
e 0
20
40
60
80
(average over annotated
transcripts of length ≈1kb for the
C. elegans SRX001872 dataset)c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 23 / 28
Transcript Quantitation with rQuant Biases
RNA-Seq Biases and Quantitationfml
Sequencer short sequence reads
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
mRNA
fragmentation
sequence library
RT
fragmen-tation
RT
exonic reads
genome
transcripts
junction reads
aligned reads
Biases due to . . .
cDNA library construction
Sequencing
Read mapping
relative transcript position 5’ -> 3’
read
cov
erag
e 0
20
40
60
80
(average over annotated
transcripts of length ≈1kb for the
C. elegans SRX001872 dataset)c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 23 / 28
Transcript Quantitation with rQuant Approach
rQuant – Basic Ideafml
A
read
cov
erag
e
Short transcript
relative transcript position 5’ -> 3’
Mi = wAAi + wBBi ⇒ minwA,wB
∑i ` (Mi , Ri )
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 24 / 28
Transcript Quantitation with rQuant Approach
rQuant – Basic Ideafml
A
B
read
cov
erag
e
Short transcript
relative transcript position 5’ -> 3’
read
cov
erag
e
Long transcript
relative transcript position 5’ -> 3’
Mi = wAAi + wBBi ⇒ minwA,wB
∑i ` (Mi , Ri )
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 24 / 28
Transcript Quantitation with rQuant Approach
rQuant – Basic Ideafml
read
cov
erag
e
genome position 5’ -> 3’
Short transcript
read
cov
erag
e
genome position 5’ -> 3’
Long transcript
A
B
Mi = wAAi + wBBi ⇒ minwA,wB
∑i ` (Mi , Ri )
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 24 / 28
Transcript Quantitation with rQuant Approach
rQuant – Basic Ideafml
read
cov
erag
e
genome position 5’ -> 3’
Mixture of transcripts
read
cov
erag
e
genome position 5’ -> 3’
Short transcript
read
cov
erag
e
genome position 5’ -> 3’
Long transcript
MA
B
Mi = wAAi + wBBi ⇒ minwA,wB
∑i ` (Mi , Ri )
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 24 / 28
Transcript Quantitation with rQuant Approach
rQuant – Basic Ideafml
read
cov
erag
e
genome position 5’ -> 3’
Mixture of transcripts
expected
observed
read
cov
erag
e
genome position 5’ -> 3’
Short transcript
read
cov
erag
e
genome position 5’ -> 3’
Long transcript
MA
B R
Mi = wAAi + wBBi ⇒ minwA,wB
∑i ` (Mi , Ri )
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 24 / 28
Transcript Quantitation with rQuant Approach
Galaxy-based Web Services for NGS Analysesfml
Galaxy-based web service http://galaxy.fml.mpg.de
PALMapper http://fml.mpg.de/raetsch/suppl/palmapper
mGene http://mgene.org/web
mTIM http://fml.mpg.de/raetsch/suppl/mtim (in prep.)
rQuant http://fml.mpg.de/raetsch/suppl/rquant/web
[Ratsch et al., in preparation, 2010]
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 25 / 28
Transcript Quantitation with rQuant Approach
contiguous reads
splice-crossing reads
enriched regions
novel transfrags
density on known exons
Map reads
Aggregate and identify
Analyze
binding sources
motif finding
associated genes
novel gene models
novel splice isoforms
expression levels
differential expression
ChIP-seq RNA-seq discovery
RNA-seq quantification
RNA-seq, ChIP-seq, and external data Integrate
de novo transcript assembly
Info
rmat
ion
extra
ctio
n
[Pep
keet
al.,
2009
]
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 26 / 28
Transcript Quantitation with rQuant Approach
Gene Finding + Auxiliary Data[Behr et al., 2009] fml
Is there other information that may be used by cellular processes thatcan improve prediction results?
Preliminary study: (A. thaliana) transcript level (SN + SP)/2
1 mGene (ab initio) . . . 73.3%
2 . . . with DNA methylation (1 tissue) 76.1%
3 . . . with Nucleosome position predictions 78.0%
4 . . . with RNA secondary structure predictions 76.7%
In progress:
Study effect of other information sources for gene prediction
Ideally, consider measurements from the same sample
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 27 / 28
Transcript Quantitation with rQuant Approach
Gene Finding + Auxiliary Data[Behr et al., 2009] fml
Is there other information that may be used by cellular processes thatcan improve prediction results?
Preliminary study: (A. thaliana) transcript level (SN + SP)/2
1 mGene (ab initio) . . . 73.3%
2 . . . with DNA methylation (1 tissue) 76.1%
3 . . . with Nucleosome position predictions 78.0%
4 . . . with RNA secondary structure predictions 76.7%
In progress:
Study effect of other information sources for gene prediction
Ideally, consider measurements from the same sample
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 27 / 28
Transcript Quantitation with rQuant Approach
Gene Finding + Auxiliary Data[Behr et al., 2009] fml
Is there other information that may be used by cellular processes thatcan improve prediction results?
Preliminary study: (A. thaliana) transcript level (SN + SP)/2
1 mGene (ab initio) . . . 73.3%
2 . . . with DNA methylation (1 tissue) 76.1%
3 . . . with Nucleosome position predictions 78.0%
4 . . . with RNA secondary structure predictions 76.7%
In progress:
Study effect of other information sources for gene prediction
Ideally, consider measurements from the same sample
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 27 / 28
Transcript Quantitation with rQuant Approach
Gene Finding + Auxiliary Data[Behr et al., 2009] fml
Is there other information that may be used by cellular processes thatcan improve prediction results?
Preliminary study: (A. thaliana) transcript level (SN + SP)/2
1 mGene (ab initio) . . . 73.3%
2 . . . with DNA methylation (1 tissue) 76.1%
3 . . . with Nucleosome position predictions 78.0%
4 . . . with RNA secondary structure predictions 76.7%
In progress:
Study effect of other information sources for gene prediction
Ideally, consider measurements from the same sample
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 27 / 28
Transcript Quantitation with rQuant Approach
Gene Finding + Auxiliary Data[Behr et al., 2009] fml
Is there other information that may be used by cellular processes thatcan improve prediction results?
Preliminary study: (A. thaliana) transcript level (SN + SP)/2
1 mGene (ab initio) . . . 73.3%
2 . . . with DNA methylation (1 tissue) 76.1%
3 . . . with Nucleosome position predictions 78.0%
4 . . . with RNA secondary structure predictions 76.7%
In progress:
Study effect of other information sources for gene prediction
Ideally, consider measurements from the same sample
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 27 / 28
Summary
Summary and Conclusionsfml
Methods:
1 PALMapper: Splice site predictions improve alignments
2 mTiM: Identifies transcripts specific to experiment
3 mGene: Best for annotation; also finds lowly expressed genes
4 rQuant: Models library prep., sequencing, alignment biases
Unsolved problems:
1 Better prediction of alternative transcripts (without RNA-seq?)
2 Differential expression detection [Stegle et al., 2010]
3 Multi-modal modelling to handle the many different ways genesare processed
4 Data integration; very promising for detecting missingcomponents, for instance, in gene prediction systems
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 28 / 28
Summary
Summary and Conclusionsfml
Methods:
1 PALMapper: Splice site predictions improve alignments
2 mTiM: Identifies transcripts specific to experiment
3 mGene: Best for annotation; also finds lowly expressed genes
4 rQuant: Models library prep., sequencing, alignment biases
Unsolved problems:
1 Better prediction of alternative transcripts (without RNA-seq?)
2 Differential expression detection [Stegle et al., 2010]
3 Multi-modal modelling to handle the many different ways genesare processed
4 Data integration; very promising for detecting missingcomponents, for instance, in gene prediction systems
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 28 / 28
Summary
Summary and Conclusionsfml
Methods:
1 PALMapper: Splice site predictions improve alignments
2 mTiM: Identifies transcripts specific to experiment
3 mGene: Best for annotation; also finds lowly expressed genes
4 rQuant: Models library prep., sequencing, alignment biases
Unsolved problems:
1 Better prediction of alternative transcripts (without RNA-seq?)
2 Differential expression detection [Stegle et al., 2010]
3 Multi-modal modelling to handle the many different ways genesare processed
4 Data integration; very promising for detecting missingcomponents, for instance, in gene prediction systems
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 28 / 28
Summary
Summary and Conclusionsfml
Methods:
1 PALMapper: Splice site predictions improve alignments
2 mTiM: Identifies transcripts specific to experiment
3 mGene: Best for annotation; also finds lowly expressed genes
4 rQuant: Models library prep., sequencing, alignment biases
Unsolved problems:
1 Better prediction of alternative transcripts (without RNA-seq?)
2 Differential expression detection [Stegle et al., 2010]
3 Multi-modal modelling to handle the many different ways genesare processed
4 Data integration; very promising for detecting missingcomponents, for instance, in gene prediction systems
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 28 / 28
Summary
Summary and Conclusionsfml
Methods:
1 PALMapper: Splice site predictions improve alignments
2 mTiM: Identifies transcripts specific to experiment
3 mGene: Best for annotation; also finds lowly expressed genes
4 rQuant: Models library prep., sequencing, alignment biases
Unsolved problems:
1 Better prediction of alternative transcripts (without RNA-seq?)
2 Differential expression detection [Stegle et al., 2010]
3 Multi-modal modelling to handle the many different ways genesare processed
4 Data integration; very promising for detecting missingcomponents, for instance, in gene prediction systems
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 28 / 28
Summary
Acknowledgementsfml
Fabio De Bona
Alignments
Jonas Behr
Gene finding
Georg Zeller
Segmentation
Regina Bohnert
Quantitation
Help with Slides
Jonas Behr (FML)
Georg Zeller (EMBL)
Regina Bohnert (FML)
Ali Mortazavi (Caltech)
Funding by DFG &Max Planck Society.
Thank you for your attention!
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 29 / 28
Summary
Acknowledgementsfml
Fabio De Bona
Alignments
Jonas Behr
Gene finding
Georg Zeller
Segmentation
Regina Bohnert
Quantitation
Help with Slides
Jonas Behr (FML)
Georg Zeller (EMBL)
Regina Bohnert (FML)
Ali Mortazavi (Caltech)
Funding by DFG &Max Planck Society.
Thank you for your attention!
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 29 / 28
Summary
rQuant – Iterative Algorithmfml
1 Optimise transcript weights: minw
∑i `(∑
t w (t)p(t)i , Ri
)2 Optimise profile weights: minp
∑i `(∑
t w (t)p(t)i , Ri
)3 Repeat 1. and 2. until convergence.
gene AT1G01240chromosome 1, forward strand
99,960 100,368 100,776 101,184 101,592
70
60
50
40
30
20
10
0
read
cou
nts
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 30 / 28
Summary
rQuant – Iterative Algorithmfml
1 Optimise transcript weights: minw
∑i `(∑
t w (t)p(t)i , Ri
)2 Optimise profile weights: minp
∑i `(∑
t w (t)p(t)i , Ri
)3 Repeat 1. and 2. until convergence.
gene AT1G01240chromosome 1, forward strand
AT1G01240.1 +110 1,178
625
99,960 100,368 100,776 101,184 101,592
27.20
70
60
50
40
30
20
10
0
read
cou
nts
27.20
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 30 / 28
Summary
rQuant – Iterative Algorithmfml
1 Optimise transcript weights: minw
∑i `(∑
t w (t)p(t)i , Ri
)2 Optimise profile weights: minp
∑i `(∑
t w (t)p(t)i , Ri
)3 Repeat 1. and 2. until convergence.
gene AT1G01240chromosome 1, forward strand
AT1G01240.1 +110 1,178
625
AT1G01240.2 +190 1,178
436
99,960 100,368 100,776 101,184 101,592
4.80
27.20
4.80
70
60
50
40
30
20
10
0
read
cou
nts
27.20
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 30 / 28
Summary
rQuant – Iterative Algorithmfml
1 Optimise transcript weights: minw
∑i `(∑
t w (t)p(t)i , Ri
)2 Optimise profile weights: minp
∑i `(∑
t w (t)p(t)i , Ri
)3 Repeat 1. and 2. until convergence.
gene AT1G01240chromosome 1, forward strand
AT1G01240.1 +110 1,178
625
AT1G01240.2 +190 1,178
436
AT1G01240.3 +1,941
99,960 100,368 100,776 101,184 101,592
6.00
4.80
27.20
6.004.80
70
60
50
40
30
20
10
0
read
cou
nts
27.20
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 30 / 28
Summary
rQuant – Iterative Algorithmfml
1 Optimise transcript weights: minw
∑i `(∑
t w (t)p(t)i , Ri
)2 Optimise profile weights: minp
∑i `(∑
t w (t)p(t)i , Ri
)3 Repeat 1. and 2. until convergence.
gene AT1G01240chromosome 1, forward strand
AT1G01240.1 +110 1,178
625
AT1G01240.2 +190 1,178
436
AT1G01240.3 +1,941
99,960 100,368 100,776 101,184 101,592
6.00
4.80
27.20
6.004.80
70
60
50
40
30
20
10
0
read
cou
nts
27.20
38.00
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 30 / 28
Summary
rQuant – Iterative Algorithmfml
1 Optimise transcript weights: minw
∑i `(∑
t w (t)p(t)i , Ri
)2 Optimise profile weights: minp
∑i `(∑
t w (t)p(t)i , Ri
)3 Repeat 1. and 2. until convergence.
distance to transcript start/end
pro!
le w
eigh
t
0 50 200 450 800 1250 1800 2450 "2450 "1800 "1250 "800 "450 "200 "50 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
gene AT1G01240chromosome 1, forward strand
AT1G01240.1 +110 1,178
625
AT1G01240.2 +190 1,178
436
AT1G01240.3 +1,941
99,960 100,368 100,776 101,184 101,592
6.00
4.80
27.20
6.004.80
70
60
50
40
30
20
10
0
read
cou
nts
27.20
38.00
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 30 / 28
Summary
rQuant – Iterative Algorithmfml
1 Optimise transcript weights: minw
∑i `(∑
t w (t)p(t)i , Ri
)2 Optimise profile weights: minp
∑i `(∑
t w (t)p(t)i , Ri
)3 Repeat 1. and 2. until convergence.
distance to transcript start/end
pro!
le w
eigh
t
0 50 200 450 800 1250 1800 2450 "2450 "1800 "1250 "800 "450 "200 "50 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
gene AT1G01240chromosome 1, forward strand
AT1G01240.1 +110 1,178
625
AT1G01240.2 +190 1,178
436
AT1G01240.3 +1,941
99,960 100,368 100,776 101,184 101,592
4.95
8.35
23.99
70
60
50
40
30
20
10
0
read
cou
nts
8.354.95
23.99
37.29
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 30 / 28
Summary Results
rQuant Evaluation Ifml
rQuant: Position-wise with profiles (estimating library and mapping bias)
compared to
Position-wise, without profiles
Segment-wise, without profiles (e.g., Jiang and Wong [2009] )
Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a])
Estimate transcript abundances
Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b])
Subset of alternatively spliced genes
Evaluation: Spearman correlation between
Simulated RNA expression level and
Predicted transcript weights
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 31 / 28
Summary Results
rQuant Evaluation Ifml
rQuant: Position-wise with profiles (estimating library and mapping bias)
compared to
Position-wise, without profiles
Segment-wise, without profiles (e.g., Jiang and Wong [2009] )
Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a])
Estimate transcript abundances
Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b])
Subset of alternatively spliced genes
Evaluation: Spearman correlation between
Simulated RNA expression level and
Predicted transcript weights
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 31 / 28
Summary Results
rQuant Evaluation Ifml
rQuant: Position-wise with profiles (estimating library and mapping bias)
compared to
Position-wise, without profiles
Segment-wise, without profiles (e.g., Jiang and Wong [2009] )
Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a])
Estimate transcript abundances
Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b])
Subset of alternatively spliced genes
Evaluation: Spearman correlation between
Simulated RNA expression level and
Predicted transcript weights
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 31 / 28
Summary Results
rQuant Evaluation Ifml
rQuant: Position-wise with profiles (estimating library and mapping bias)
compared to
Position-wise, without profiles
Segment-wise, without profiles (e.g., Jiang and Wong [2009] )
Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a])
Estimate transcript abundances
Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b])
Subset of alternatively spliced genes
Evaluation: Spearman correlation between
Simulated RNA expression level and
Predicted transcript weights
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 31 / 28
Summary Results
rQuant Evaluation Ifml
rQuant: Position-wise with profiles (estimating library and mapping bias)
compared to
Position-wise, without profiles
Segment-wise, without profiles (e.g., Jiang and Wong [2009] )
Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a])
Estimate transcript abundances
Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b])
Subset of alternatively spliced genes
Evaluation: Spearman correlation between
Simulated RNA expression level and
Predicted transcript weights
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 31 / 28
Summary Results
rQuant Evaluation Ifml
rQuant: Position-wise with profiles (estimating library and mapping bias)
compared to
Position-wise, without profiles
Segment-wise, without profiles (e.g., Jiang and Wong [2009] )
Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a])
Estimate transcript abundances
Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b])
Subset of alternatively spliced genes
Evaluation: Spearman correlation between
Simulated RNA expression level and
Predicted transcript weights
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 31 / 28
Summary Results
rQuant Evaluation IIfml
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Spea
rman
Cor
rela
tion
Across transcripts
Within genes (mean)
Segment-wise, w/o pro!les
(Bohnert et al., submitted, 2010)
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 32 / 28
Summary Results
rQuant Evaluation IIfml
Position-wise, w/o pro!les
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Spea
rman
Cor
rela
tion
Across transcripts
Within genes (mean)
Segment-wise, w/o pro!les
Segment-wise, w/ pro!les
(Bohnert et al., submitted, 2010)
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 32 / 28
Summary Results
rQuant Evaluation IIfml
rQuantPosition-wise,
w/ pro!lesPosition-wise, w/o pro!les
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Spea
rman
Cor
rela
tion
Across transcripts
Within genes (mean)
Segment-wise, w/o pro!les
Segment-wise, w/ pro!les
(Bohnert et al., submitted, 2010)
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 32 / 28
Summary Results
References I
J. Behr, G. Schweikert, J. Cao, F. De Bona, G. Zeller, S. Laubinger, S. Ossowski,K. Schneeberger, D. Weigel, and G. Ratsch. Rna-seq and tiling arrays for improved genefinding. Oral presentation at the CSHL Genome Informatics Meeting, September 2008. URLhttp:
//www.fml.tuebingen.mpg.de/raetsch/lectures/RaetschGenomeInformatics08.pdf.
J. Behr, G. Schweikert, J. Cao, F. De Bona, G. Zeller, S. Laubinger, S. Ossowski,K. Schneeberger, D. Weigel, and G. Ratsch. Rna-seq and tiling arrays for improved genefinding. Presented at the CSHL Genome Informatics Meeting, July 2009.
RM Clark, G Schweikert, C Toomajian, S Ossowski, G Zeller, P Shinn, N Warthmann, TT Hu,G Fu, DA Hinds, H Chen, KA Frazer, DH Huson, B Scholkopf, M Nordborg, G Ratsch,JR Ecker, and D Weigel. Common sequence polymorphisms shaping genetic diversity inarabidopsis thaliana. Science, 317(5836):338–342, 2007. ISSN 1095-9203 (Electronic). doi:10.1126/science.1138632.
F. De Bona, S. Ossowski, K. Schneeberger, and G. Ratsch. Qpalma: Optimal splicedalignments of short sequence reads. Bioinformatics, 24:i174–i180, 2008.
Hui Jiang and Wing Hung Wong. Statistical inferences for isoform expression in RNA-Seq.Bioinformatics, 25(8):1026–1032, April 2009.
Shirley Pepke, Barbara Wold, and Ali Mortazavi. Computation for chip-seq and rna-seq studies.Nat Methods, 6(11 Suppl):S22–32, Nov 2009. doi: 10.1038/nmeth.1371.
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 33 / 28
Summary Results
References II
G. Ratsch and S. Sonnenburg. Accurate splice site detection for Caenorhabditis elegans. InK. Tsuda B. Schoelkopf and J.-P. Vert, editors, Kernel Methods in Computational Biology.MIT Press, 2004.
G. Ratsch, S. Sonnenburg, and B. Scholkopf. RASE: recognition of alternatively spliced exonsin C. elegans. Bioinformatics, 21(Suppl. 1):i369–i377, June 2005.
M. Sammeth. The Flux Capacitor. Website, 2009a. http://flux.sammeth.net/capacitor.html.
M. Sammeth. The Flux Simulator. Website, 2009b. http://flux.sammeth.net/simulator.html.
Korbinian Schneeberger, Jorg Hagmann, Stephan Ossowski, Norman Warthmann, SandraGesing, Oliver Kohlbacher, and Detlef Weigel. Simultaneous alignment of short readsagainst multiple genomes. Genome Biol, 10(9):R98, Jan 2009. doi:10.1186/gb-2009-10-9-r98. URL http://genomebiology.com/2009/10/9/R98.
Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph Dieterich,Cheng Soon Ong, Petra Philips, Fabio De Bona, Lisa Hartmann, Anja Bohlen, Nina Kruger,Soren Sonnenburg, and Gunnar Ratsch. mgene: Accurate svm-based gene finding with anapplication to nematode genomes. Genome Research, 2009. URLhttp://genome.cshlp.org/content/early/2009/06/29/gr.090597.108.full.pdf+html.Advance access June 29, 2009.
S. Sonnenburg, G. Ratsch, A. Jagota, and K.-R. Muller. New methods for splice-siterecognition. In Proc. International Conference on Artificial Neural Networks, 2002.
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 34 / 28
Summary Results
References III
Soren Sonnenburg, Alexander Zien, and Gunnar Ratsch. ARTS: Accurate Recognition ofTranscription Starts in Human. Bioinformatics, 22(14):e472–480, 2006.
O. Stegle, P. Drewe, R. Bohnert, K. Borgwardt, and G. Ratsch. Statistical tests for detectingdifferential rna-transcript expression from read counts. Technical report, Nature Preceedings,2010.
G Zeller, RM Clark, K Schneeberger, A Bohlen, D Weigel, and G Ratsch. Detectingpolymorphic regions in arabidopsis thaliana with resequencing microarrays. Genome Res, 18(6):918–929, 2008. ISSN 1088-9051 (Print). doi: 10.1101/gr.070169.107.
A. Zien, G. Ratsch, S. Mika, B. Scholkopf, T. Lengauer, and K.-R. Muller. Engineering SupportVector Machine Kernels That Recognize Translation Initiation Sites. BioInformatics, 16(9):799–807, September 2000.
c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 35 / 28