Upload
gxxr
View
107
Download
1
Embed Size (px)
DESCRIPTION
Talk in gene discovery session at PAGXXII (https://pag.confex.com/pag/xxii/webprogram/Session2128.html) Joint work with Jonas Behr, Gabriele Schweikert, Andre Kahles and others. Abstract: High throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in detection of expressed genes and transcripts. However, the immense dynamic range of gene expression, limitations and biases of the sequencing technology, as well as the observed complexity of the transcriptional landscape pose profound computational challenges. We discuss several of these challenges and based on illustrative simulation examples, we identify the limits of state-of-the-art tools in reconstructing multiple alternative transcripts even if sufficient information is provided. We propose a novel framework, called MiTie, for simultaneous transcript reconstruction and quantification based on combinatorial optimization. We use the negative binomial distribution to define a likelihood function and use a regularization approach to select a small number of transcripts quantitatively explaining the observed read data. We show that the resulting regularized maximum likelihood problem can be formulated as a mixed integer programming problem (MIP) which can be solved optimally using standard optimization approaches. We will also describe an extension of the discriminative gene finding system mGene that takes advantage of RNA-seq reads. We demonstrate that the extended system mGene.ngs can significantly more accurately predict transcript annotations when using RNA-seq data and also better than tools for transcriptome reconstruction that are solely based on RNA-seq data. Finally, we illustrate how a combination of gene finding and transcriptome reconstruction methods like MiTie can be used to accurately annotate newly sequenced genomes without prior annotations.
Citation preview
RNA-Seq-based Genome Annotation
using mGene.ngs and MiTie
Gunnar RatschBiomedical Data Science GroupComputational Biology CenterMemorial Sloan-Kettering Cancer Center
@gxr #mGene #MiTie #PAGXXII
Acknowledgements and DisclosuresMemorial Sloan-Kettering Cancer Center
Main contributorsJonas Behr Gabriele Schweikert Andre Kahles
Funding
Financial interest disclosure
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 2
Acknowledgements and DisclosuresMemorial Sloan-Kettering Cancer Center
Main contributorsJonas Behr Gabriele Schweikert Andre Kahles
Funding
Financial interest disclosure
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 2
Acknowledgements and DisclosuresMemorial Sloan-Kettering Cancer Center
Main contributorsJonas Behr Gabriele Schweikert Andre Kahles
Funding
Financial interest disclosure
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 2
Acknowledgements and DisclosuresMemorial Sloan-Kettering Cancer Center
Main contributorsJonas Behr Gabriele Schweikert Andre Kahles
Funding
Financial interest disclosure
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 2
Genome Annotation Pipeline(s)Memorial Sloan-Kettering Cancer Center
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 3
Genome Annotation Pipeline(s)Memorial Sloan-Kettering Cancer Center
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 3
Genome Annotation Pipeline(s)Memorial Sloan-Kettering Cancer Center
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 3
Genome Annotation Pipeline(s)Memorial Sloan-Kettering Cancer Center
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 3
Genome Annotation Pipeline(s)Memorial Sloan-Kettering Cancer Center
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 3
Proposed new gene finding method (mGene.ngs) for reannotation of19 A. thaliana genomes (and genome assembly + analysis).
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 4
mGene.ngs OverviewMemorial Sloan-Kettering Cancer Center
Goal: Predict annotation based on RNA-seq and genomicsequence information
Learn function f (y |x) that scores gene models y based ondifferent sources of information x
Train parameters such that
f (y |x)� f (y ′|x) for all y ′ 6= y (“large margin”)
Hidden semi-Markov Support Vector Machines (HsM-SVMs)[Altun et al., 2003, Ratsch and Sonnenburg, 2007]
Automatically adapts to quality of RNA-seq data/alignments
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 5
mGene.ngs OverviewMemorial Sloan-Kettering Cancer Center
Goal: Predict annotation based on RNA-seq and genomicsequence information
Learn function f (y |x) that scores gene models y based ondifferent sources of information x
Train parameters such that
f (y |x)� f (y ′|x) for all y ′ 6= y (“large margin”)
Hidden semi-Markov Support Vector Machines (HsM-SVMs)[Altun et al., 2003, Ratsch and Sonnenburg, 2007]
Automatically adapts to quality of RNA-seq data/alignments
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 5
mGene.ngs OverviewMemorial Sloan-Kettering Cancer Center
Goal: Predict annotation based on RNA-seq and genomicsequence information
Learn function f (y |x) that scores gene models y based ondifferent sources of information x
Train parameters such that
f (y |x)� f (y ′|x) for all y ′ 6= y (“large margin”)
Hidden semi-Markov Support Vector Machines (HsM-SVMs)[Altun et al., 2003, Ratsch and Sonnenburg, 2007]
Automatically adapts to quality of RNA-seq data/alignments
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 5
Training of mGeneMemorial Sloan-Kettering Cancer Center
acc
don
tss
tis
stop
True gene model 2 3 4 5
STEP 1: SVM Signal Predictions
genomic position
genomic position
Scor
e f(y
|x)
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 6
Training of mGeneMemorial Sloan-Kettering Cancer Center
acc
don
tss
tis
stop
True gene model 2 3 4 5
STEP 1: SVM Signal Predictions
genomic position
genomic position
Scor
e f(y
|x)
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 6
Training of mGeneMemorial Sloan-Kettering Cancer Center
acc
don
tss
tis
stop
True gene model 2 3 4 5
STEP 1: SVM Signal Predictions
genomic position
genomic position
Wrong gene model
large margin
Scor
e f(y
|x)
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 6
Training of mGene.ngsMemorial Sloan-Kettering Cancer Center
acc
don
tss
tis
stop
True gene model 2 3 4 5
STEP 1: SVM Signal Predictions
genomic position
genomic position
Scor
e f(y
|x)
RNA-seq
Cov
erag
e
intron support from spliced reads
large margin
Wrong gene model
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 6
Training of mGene.ngsMemorial Sloan-Kettering Cancer Center
acc
don
tss
tis
stop
True gene model 2 3 4 5
STEP 1: SVM Signal Predictions
genomic position
genomic position
RNA-seq
Cov
erag
e
intron support from spliced reads
Scor
e f(y
|x) larger margin
Wrong gene model
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 6
Results for C. elegansMemorial Sloan-Kettering Cancer Center
RNA-seq:
paired-end, strand-specific RNA ligation based protocol
76bp reads, 50 million reads
Alignment with Palmapper
Evaluation:
Transcript-level F-score of coding transcripts. . . for different expression levels
Compare mGene (ab initio), mGene.ngs, cufflinks
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 7
Results for C. elegansMemorial Sloan-Kettering Cancer Center
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 8
DigestionMemorial Sloan-Kettering Cancer Center
Observations:
RNA-seq helps to improve performance
Genomic signals help much (see cufflinks)
Problems:
Need existing annotation for training
Cannot predict non-coding transcripts
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 9
Skimming and Non-coding TranscriptsMemorial Sloan-Kettering Cancer Center
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 10
Skimming and Non-coding TranscriptsMemorial Sloan-Kettering Cancer Center
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 10
Learning StrategyMemorial Sloan-Kettering Cancer Center
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 11
Learning StrategyMemorial Sloan-Kettering Cancer Center
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 11
Learning StrategyMemorial Sloan-Kettering Cancer Center
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 11
Results for C. elegansMemorial Sloan-Kettering Cancer Center
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
expression percentile
F−sc
ore
mGene − ab initio w/ annotation
mGene.ngs − w/o annotation
mGene.ngs − w/ annotationcufflinks − Trapnell et al. 2010
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 12
Results for C. elegansMemorial Sloan-Kettering Cancer Center
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
expression percentile
F−sc
ore
mGene.nc − w/o annotation
mGene − ab initio w/ annotation
mGene.ngs − w/o annotation
mGene.ngs − w/ annotationcufflinks − Trapnell et al. 2010
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 12
Results for C. elegansMemorial Sloan-Kettering Cancer Center
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
expression percentile
F−sc
ore
mGene.nc − w/o annotation
mGene − ab initio w/ annotation
mGene.ngs − w/o annotation
mGene.ngs − w/ annotationcufflinks − Trapnell et al. 2010
De novo prediction works!Modeling noncoding transcripts improves coding transcript prediction.
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 12
Gene Finding vs. Transcript AssemblyMemorial Sloan-Kettering Cancer Center
Gene expression levellow high
Gene�nding + RNA-seq=> only one transcript
RNA transcript assembly=>multiple transcripts
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 13
Vol. 29 no. 20 2013, pages 2529–2538BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btt442
Genome analysis Advance Access publication August 25, 2013
MITIE: Simultaneous RNA-Seq-based transcript identification andquantification in multiple samplesJonas Behr1,2,*,y, Andre Kahles1, Yi Zhong1, Vipin T. Sreedharan1, Philipp Drewe1 andGunnar Ratsch1,*1Computational Biology Center, Sloan-Kettering Institute, 1275 York Avenue, New York, NY 10065, USA and 2FriedrichMiescher Laboratory, Max Planck Society, Spemannstr. 39, 72076 Tubingen, GermanyAssociate Editor: Ivo Hofacker
ABSTRACT
Motivation: High-throughput sequencing of mRNA (RNA-Seq) has led
to tremendous improvements in the detection of expressed genes and
reconstruction of RNA transcripts. However, the extensive dynamic
range of gene expression, technical limitations and biases, as well
as the observed complexity of the transcriptional landscape, pose
profound computational challenges for transcriptome reconstruction.
Results: We present the novel framework MITIE (Mixed Integer
Transcript IdEntification) for simultaneous transcript reconstruction
and quantification. We define a likelihood function based on the nega-
tive binomial distribution, use a regularization approach to select a few
transcripts collectively explaining the observed read data and show
how to find the optimal solution using Mixed Integer Programming.
MITIE can (i) take advantage of known transcripts, (ii) reconstruct
and quantify transcripts simultaneously in multiple samples, and
(iii) resolve the location of multi-mapping reads. It is designed for
genome- and assembly-based transcriptome reconstruction. We
present an extensive study based on realistic simulated RNA-Seq
data. When compared with state-of-the-art approaches, MITIE
proves to be significantly more sensitive and overall more accurate.
Moreover, MITIE yields substantial performance gains when used with
multiple samples. We applied our system to 38 Drosophila melanoga-
ster modENCODE RNA-Seq libraries and estimated the sensitivity of
reconstructing omitted transcript annotations and the specificity with
respect to annotated transcripts. Our results corroborate that a well-
motivated objective paired with appropriate optimization techniques
lead to significant improvements over the state-of-the-art in transcrip-
tome reconstruction.
Availability: MITIE is implemented in Cþþ and is available from
http://bioweb.me/mitie under the GPL license.
Contact: [email protected] and [email protected]
Supplementary information: Supplementary data are available
at Bioinformatics online.
Received on October 14, 2012; revised on July 19, 2013; accepted on
July 29, 2013
1 INTRODUCTION
Most of the complexity of higher eukaryotic transcriptomes canbe attributed to the encoding of multiple transcripts at a single
genic locus by means of alternative splicing, transcription startand termination (e.g. Nilsen and Graveley, 2010; Ratsch et al.,2007; Schweikert et al., 2009). A comprehensive catalog of alltranscripts encoded by a genomic locus is essential for down-stream analyses that aim at a more detailed understanding ofgene expression and RNA processing regulation.RNA-Seq is a method for parallel sequencing of a large num-
ber of RNA molecules based on high-throughput sequencingtechnologies (ENCODE Project Consortium et al., 2012;Mortazavi et al., 2008; Wang et al., 2009). Currently availablesequencing platforms typically provide several 10–100 millions ofsequence fragments (reads) with a typical length of 50–150 bases.By mapping these reads back to the genome, one can determinewhere gene products are encoded in the genome (e.g. Denoeudet al., 2008; Guttman et al., 2010; Trapnell et al., 2010; Xia et al.,2011) and collect evidence of RNA processing such as splicing(Bradley et al., 2012; Sonnenburg et al., 2007) or RNA-editing(Bahn et al., 2012).In many cases, the RNA-Seq reads are first aligned to a ref-
erence genome using an alignment tool that identifies possibleread origins within the genome. Contiguous regions covered withread alignments (possibly with small gaps) are candidates forexonic segments. Alignment tools for RNA-Seq reads, such asPALMapper (De Bona et al., 2008; Jean et al., 2010), TopHat(Trapnell et al., 2009), MapSplice (Wang et al., 2010), Star(Dobin et al., 2012) or Gsnap (Wu and Nacu, 2010) are typicallyable to identify new exon–exon junctions, which are candidatesfor introns. This information can be compiled into a segment orsplicing graph, a directed acyclic graph, where the nodes corres-pond to exonic segments and the edges correspond to introncandidates (cf. Fig. 1 for an illustration). Assuming completecoverage, an expressed transcript corresponds to a path in thegraph. Similar graphs are produced during de novo transcriptassembly with the difference that the graph can potentially becyclic, and the segments are not explicitly associated with a gen-omic location. In genome- and assembly-based transcript recon-struction, tools such as Scripture (Guttman et al., 2010),Cufflinks (Trapnell et al., 2010), Trans-ABySS (Robertsonet al., 2010), Trinity (Grabherr et al., 2011) and OASES(Schulz et al., 2012) select a subset of paths through the graphas transcript predictions. For simplicity, we will focus ongenome-based transcript reconstruction when describing the ap-proach and discuss de novo assembly whenever necessary.
*To whom correspondence should be addressedyPresent address: ETH Zurich D-BSSE, Mattenstrasse 26, 4058 Basel,Switzerland.
! The Author 2013. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), whichpermits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
by guest on Decem
ber 3, 2013http://bioinform
atics.oxfordjournals.org/D
ownloaded from
Transcript prediction via combinatorial optimization that combinesevidence from multiple experiments & achieves higher accuracy.
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 14
Transcript Reconstruction with RNA-seqMemorial Sloan-Kettering Cancer Center
Read alignments
Genomic DNA
Segment graph
Dataprocessing
Reads
Optimization
Denovo Assembly(Trinity, Oases)
Genome Based Assembly(Cu�inks, Scripture)
108 possible transcripts, 1028 possible subsets of transcriptsc© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 15
Transcript Reconstruction with RNA-seqMemorial Sloan-Kettering Cancer Center
Read alignments
Genomic DNA
Segment graph
Dataprocessing
Reads
Optimization
Denovo Assembly(Trinity, Oases)
Genome Based Assembly(Cu�inks, Scripture)
108 possible transcripts, 1028 possible subsets of transcriptsc© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 15
Enumerate and Quantify all TranscriptsMemorial Sloan-Kettering Cancer Center
Segment Graph
Potential Transcripts
[Behr et al., 2013]
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16
Enumerate and Quantify all TranscriptsMemorial Sloan-Kettering Cancer Center
Segment Graph
Potential Transcripts
1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 11 0 1 0 1 1 11 1 1 1 1 0 01 0 1 1 1 0 01 1 1 0 1 0 01 0 1 0 1 0 0
[Behr et al., 2013]
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16
Enumerate and Quantify all TranscriptsMemorial Sloan-Kettering Cancer Center
Segment Graph
Potential Transcripts
0.00.20.00.80.00.00.00.0
0.00.00.10.90.00.00.00.0
AbundanceSample1 Sample2
Expected coverage
1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 11 0 1 0 1 1 11 1 1 1 1 0 01 0 1 1 1 0 01 1 1 0 1 0 01 0 1 0 1 0 0
R. Bohnert and G. Ratsch, NAR (2010)[Behr et al., 2013]c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16
Enumerate and Quantify all TranscriptsMemorial Sloan-Kettering Cancer Center
Segment Graph
Potential Transcripts
0.00.20.00.80.00.00.00.0
0.00.00.10.90.00.00.00.0
AbundanceSample1 Sample2
Expected coverage
1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 11 0 1 0 1 1 11 1 1 1 1 0 01 0 1 1 1 0 01 1 1 0 1 0 01 0 1 0 1 0 0
minW
L( UT ×W︸ ︷︷ ︸expected coverage
, C︸︷︷︸observed coverage
) + γ × ‖W ‖1
R. Bohnert and G. Ratsch, NAR (2010)[Behr et al., 2013]
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16
Simultaneous Identification & QuantificationMemorial Sloan-Kettering Cancer Center
Segment Graph
Transcripts Matrix
0.80.20.00.0
0.90.00.10.0
AbundanceSample1 Sample2
1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 10 0 0 0 0 0 0
Expected coverage
1
k
...
[Behr et al., 2013]
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16
Simultaneous Identification & QuantificationMemorial Sloan-Kettering Cancer Center
Segment Graph
Transcripts Matrix
0.80.20.00.0
0.90.00.10.0
AbundanceSample1 Sample2
1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 10 0 0 0 0 0 0
Expected coverage
U W1
N
...
minU ,W
L( UT ×W︸ ︷︷ ︸expected coverage
, C︸︷︷︸observed coverage
) + γ × N
[Behr et al., 2013]
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16
Simultaneous Identification & QuantificationMemorial Sloan-Kettering Cancer Center
Segment Graph
Transcripts Matrix
0.80.20.00.0
0.90.00.10.0
AbundanceSample1 Sample2
1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 10 0 0 0 0 0 0
Expected coverage
U W1
N
...
minU ,W
L(UT ×W ,C ) + γ × N
[Behr et al., 2013]
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16
Simultaneous Identification & QuantificationMemorial Sloan-Kettering Cancer Center
Segment Graph
Transcripts Matrix
0.80.20.00.0
0.90.00.10.0
AbundanceSample1 Sample2
1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 10 0 0 0 0 0 0
Expected coverage
U W1
N
...
minU ,W
L(UT ×W ,C ) + γ × N
s.t. U is valid
[Behr et al., 2013]c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16
Simultaneous Identification & QuantificationMemorial Sloan-Kettering Cancer Center
Segment Graph
Transcripts Matrix
0.80.20.00.0
0.90.00.10.0
AbundanceSample1 Sample2
1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 10 0 0 0 0 0 0
Expected coverage
U W1
N
...
minU ,W
L(UT ×W ,C ) + γ × N
s.t. U is valid&%'$
[Behr et al., 2013]c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16
Simultaneous Identification & QuantificationMemorial Sloan-Kettering Cancer Center
Segment Graph
Transcripts Matrix
0.80.20.00.0
0.90.00.10.0
AbundanceSample1 Sample2
1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 10 0 0 0 0 0 0
Expected coverage
U W1
N
...
minU ,W
L(UT ×W ,C ) + γ × N
s.t. U is valid
����
[Behr et al., 2013]c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16
Simultaneous Identification & QuantificationMemorial Sloan-Kettering Cancer Center
Segment Graph
Transcripts Matrix
0.80.20.00.0
0.90.00.10.0
AbundanceSample1 Sample2
1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 10 0 0 0 0 0 0
Expected coverage
U W1
N
...
minU ,W
L(UT ×W ,C ) + γ × N
s.t. U is valid
����
[Behr et al., 2013]c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16
MiTie’s Main FeaturesMemorial Sloan-Kettering Cancer Center
Uses a likelihood function L based on a probabilistic model forthe read coverage.
Uses combinatorial optimization to find transcripts that explaindata from multiple RNA-seq libraries
Newly predicted transcripts are penalized (once).
Can use already known/confirmed transcripts without penalty.
Provides a p-value for each transcript providing a confidencemeasure for presence of predicted transcript.Log-likelihood ratio test:
Tt = −2 log
(p(D|M)
p(D|Mt)
)[Behr et al., 2013]
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 17
MiTie ResultsMemorial Sloan-Kettering Cancer Center
1 2 3 4 5
0.35
0.40
0.45
Number of Samples
F−sc
ore
on T
rans
crip
t Lev
el
MITIE + MMOMITIECufflinks + CuffmergeCufflinks
Human Simulated Data
1 2 3 4 5 6 70.29
0.31
0.33
0.35
0.37
F−sc
ore
on T
rans
crip
t Lev
el
MITIECufflinks + Cuffmerge
A
B
D. melanogaster modENCODE Data
[Behr et al., 2013]c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 18
Gene Finding vs. Transcript AssemblyMemorial Sloan-Kettering Cancer Center
Gene expression levellow high
mGene.ngs=> only one transcript
MiTie=>multiple transcripts
low con�dence high con�dence for alternative transcripts
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 19
ConclusionsMemorial Sloan-Kettering Cancer Center
Genome annotation pipelineTranscript Skimmer identifies highly expressed genes for trainingmGene.ngs predicts coding and non-coding transcriptsMiTie predicts alternative transcripts for highly expressed genes
Genome annotation pipeline requires onlyGenome sequenceRNA-seq alignments
Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene &http://bioweb.me/mitie
Functionality partially available in Galaxy instance(http://galaxy.cbio.mskcc.org)
Thank you!c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 20
ConclusionsMemorial Sloan-Kettering Cancer Center
Genome annotation pipelineTranscript Skimmer identifies highly expressed genes for trainingmGene.ngs predicts coding and non-coding transcriptsMiTie predicts alternative transcripts for highly expressed genes
Genome annotation pipeline requires onlyGenome sequenceRNA-seq alignments
Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene &http://bioweb.me/mitie
Functionality partially available in Galaxy instance(http://galaxy.cbio.mskcc.org)
Thank you!c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 20
ConclusionsMemorial Sloan-Kettering Cancer Center
Genome annotation pipelineTranscript Skimmer identifies highly expressed genes for trainingmGene.ngs predicts coding and non-coding transcriptsMiTie predicts alternative transcripts for highly expressed genes
Genome annotation pipeline requires onlyGenome sequenceRNA-seq alignments
Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene &http://bioweb.me/mitie
Functionality partially available in Galaxy instance(http://galaxy.cbio.mskcc.org)
Thank you!c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 20
ConclusionsMemorial Sloan-Kettering Cancer Center
Genome annotation pipelineTranscript Skimmer identifies highly expressed genes for trainingmGene.ngs predicts coding and non-coding transcriptsMiTie predicts alternative transcripts for highly expressed genes
Genome annotation pipeline requires onlyGenome sequenceRNA-seq alignments
Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene &http://bioweb.me/mitie
Functionality partially available in Galaxy instance(http://galaxy.cbio.mskcc.org)
Thank you!c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 20
ConclusionsMemorial Sloan-Kettering Cancer Center
Genome annotation pipelineTranscript Skimmer identifies highly expressed genes for trainingmGene.ngs predicts coding and non-coding transcriptsMiTie predicts alternative transcripts for highly expressed genes
Genome annotation pipeline requires onlyGenome sequenceRNA-seq alignments
Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene &http://bioweb.me/mitie
Functionality partially available in Galaxy instance(http://galaxy.cbio.mskcc.org)
Thank you!c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 20
ConclusionsMemorial Sloan-Kettering Cancer Center
Genome annotation pipelineTranscript Skimmer identifies highly expressed genes for trainingmGene.ngs predicts coding and non-coding transcriptsMiTie predicts alternative transcripts for highly expressed genes
Genome annotation pipeline requires onlyGenome sequenceRNA-seq alignments
Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene &http://bioweb.me/mitie
Functionality partially available in Galaxy instance(http://galaxy.cbio.mskcc.org)
Thank you!c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 20
Just published:
Checkout:
http://oqtans.org
http://galaxy.cbio.mskcc.org
[Sreedharan et al., 2014]
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 21
References IY. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov Support Vector Machines. In Proc. 20th Int. Conf. Mach. Learn.,
pages 3–10, 2003.
J. Behr, G. Schweikert, J. Cao, F. De Bona, G. Zeller, S. Laubinger, S. Ossowski, K. Schneeberger, D. Weigel, and G. Ratsch.Rna-seq and tiling arrays for improved gene finding. Oral presentation at the CSHL Genome Informatics Meeting,September 2008. URL http://www.fml.tuebingen.mpg.de/raetsch/lectures/RaetschGenomeInformatics08.pdf.
Jonas Behr, Andre Kahles, Yi Zhong, Vipin T Sreedharan, Philipp Drewe, and Gunnar Ratsch. Mitie: Simultaneousrna-seq-based transcript identification and quantification in multiple samples. Bioinformatics, 29(20):2529–38, Oct 2013.doi: 10.1093/bioinformatics/btt442.
RM Clark, G Schweikert, C Toomajian, S Ossowski, G Zeller, P Shinn, N Warthmann, TT Hu, G Fu, DA Hinds, H Chen,KA Frazer, DH Huson, B Scholkopf, M Nordborg, G Ratsch, JR Ecker, and D Weigel. Common sequence polymorphismsshaping genetic diversity in arabidopsis thaliana. Science, 317(5836):338–342, 2007. ISSN 1095-9203 (Electronic). doi:10.1126/science.1138632.
G. Ratsch and S. Sonnenburg. Accurate splice site detection for Caenorhabditis elegans. In K. Tsuda B. Schoelkopf and J.-P.Vert, editors, Kernel Methods in Computational Biology. MIT Press, 2004.
G Ratsch and S Sonnenburg. Large scale hidden semi-markov svms. In B. Scholkopf, J. Platt, and T. Hoffman, editors,Advances in Neural Information Processing Systems (NIPS’06), volume 19, pages 1161–1168, Cambridge, MA, 2007. MITPress. URL http://www.fml.tuebingen.mpg.de/raetsch/projects/HSMSVM.
G. Ratsch, S. Sonnenburg, and B. Scholkopf. RASE: recognition of alternatively spliced exons in C. elegans. Bioinformatics, 21(Suppl. 1):i369–i377, June 2005.
Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph Dieterich, Cheng Soon Ong, Petra Philips, FabioDe Bona, Lisa Hartmann, Anja Bohlen, Nina Kruger, Soren Sonnenburg, and Gunnar Ratsch. mgene: Accurate svm-basedgene finding with an application to nematode genomes. Genome Research, 2009. URLhttp://genome.cshlp.org/content/early/2009/06/29/gr.090597.108.full.pdf+html. Advance access June 29, 2009.
S. Sonnenburg, G. Ratsch, A. Jagota, and K.-R. Muller. New methods for splice-site recognition. In Proc. InternationalConference on Artificial Neural Networks, 2002.
Soren Sonnenburg, Alexander Zien, and Gunnar Ratsch. ARTS: Accurate Recognition of Transcription Starts in Human.Bioinformatics, 22(14):e472–480, 2006.
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 22
References II
VT Sreedharan, SJ Schultheiss, G Jean, A Kahles, R Bohnert, P Drewe, P Mudrakarta, N Gornitz, G Zeller, and GunnarRatsch. Oqtans: The rna-seq workbench in the cloud for complete and reproducible quantitative transcriptome analysis.Bioinformatics, 2014. Bioinformatics Advance Access published January 11, 2014.
G Zeller, RM Clark, K Schneeberger, A Bohlen, D Weigel, and G Ratsch. Detecting polymorphic regions in arabidopsis thalianawith resequencing microarrays. Genome Res, 18(6):918–929, 2008. ISSN 1088-9051 (Print). doi:10.1101/gr.070169.107.
A. Zien, G. Ratsch, S. Mika, B. Scholkopf, T. Lengauer, and K.-R. Muller. Engineering Support Vector Machine Kernels ThatRecognize Translation Initiation Sites. BioInformatics, 16(9):799–807, September 2000.
c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 23