57
RNA-Seq-based Genome Annotation using mGene.ngs and MiTie Gunnar R¨ atsch Biomedical Data Science Group Computational Biology Center Memorial Sloan-Kettering Cancer Center @gxr #mGene #MiTie #PAGXXII

RNA-seq based Genome Annotation with mGene.ngs and MiTie

  • Upload
    gxxr

  • View
    107

  • Download
    1

Embed Size (px)

DESCRIPTION

Talk in gene discovery session at PAGXXII (https://pag.confex.com/pag/xxii/webprogram/Session2128.html) Joint work with Jonas Behr, Gabriele Schweikert, Andre Kahles and others. Abstract: High throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in detection of expressed genes and transcripts. However, the immense dynamic range of gene expression, limitations and biases of the sequencing technology, as well as the observed complexity of the transcriptional landscape pose profound computational challenges. We discuss several of these challenges and based on illustrative simulation examples, we identify the limits of state-of-the-art tools in reconstructing multiple alternative transcripts even if sufficient information is provided. We propose a novel framework, called MiTie, for simultaneous transcript reconstruction and quantification based on combinatorial optimization. We use the negative binomial distribution to define a likelihood function and use a regularization approach to select a small number of transcripts quantitatively explaining the observed read data. We show that the resulting regularized maximum likelihood problem can be formulated as a mixed integer programming problem (MIP) which can be solved optimally using standard optimization approaches. We will also describe an extension of the discriminative gene finding system mGene that takes advantage of RNA-seq reads. We demonstrate that the extended system mGene.ngs can significantly more accurately predict transcript annotations when using RNA-seq data and also better than tools for transcriptome reconstruction that are solely based on RNA-seq data. Finally, we illustrate how a combination of gene finding and transcriptome reconstruction methods like MiTie can be used to accurately annotate newly sequenced genomes without prior annotations.

Citation preview

Page 1: RNA-seq based Genome Annotation with mGene.ngs and MiTie

RNA-Seq-based Genome Annotation

using mGene.ngs and MiTie

Gunnar RatschBiomedical Data Science GroupComputational Biology CenterMemorial Sloan-Kettering Cancer Center

@gxr #mGene #MiTie #PAGXXII

Page 2: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Acknowledgements and DisclosuresMemorial Sloan-Kettering Cancer Center

Main contributorsJonas Behr Gabriele Schweikert Andre Kahles

Funding

Financial interest disclosure

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 2

Page 3: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Acknowledgements and DisclosuresMemorial Sloan-Kettering Cancer Center

Main contributorsJonas Behr Gabriele Schweikert Andre Kahles

Funding

Financial interest disclosure

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 2

Page 4: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Acknowledgements and DisclosuresMemorial Sloan-Kettering Cancer Center

Main contributorsJonas Behr Gabriele Schweikert Andre Kahles

Funding

Financial interest disclosure

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 2

Page 5: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Acknowledgements and DisclosuresMemorial Sloan-Kettering Cancer Center

Main contributorsJonas Behr Gabriele Schweikert Andre Kahles

Funding

Financial interest disclosure

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 2

Page 6: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Genome Annotation Pipeline(s)Memorial Sloan-Kettering Cancer Center

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 3

Page 7: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Genome Annotation Pipeline(s)Memorial Sloan-Kettering Cancer Center

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 3

Page 8: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Genome Annotation Pipeline(s)Memorial Sloan-Kettering Cancer Center

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 3

Page 9: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Genome Annotation Pipeline(s)Memorial Sloan-Kettering Cancer Center

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 3

Page 10: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Genome Annotation Pipeline(s)Memorial Sloan-Kettering Cancer Center

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 3

Page 11: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Proposed new gene finding method (mGene.ngs) for reannotation of19 A. thaliana genomes (and genome assembly + analysis).

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 4

Page 12: RNA-seq based Genome Annotation with mGene.ngs and MiTie

mGene.ngs OverviewMemorial Sloan-Kettering Cancer Center

Goal: Predict annotation based on RNA-seq and genomicsequence information

Learn function f (y |x) that scores gene models y based ondifferent sources of information x

Train parameters such that

f (y |x)� f (y ′|x) for all y ′ 6= y (“large margin”)

Hidden semi-Markov Support Vector Machines (HsM-SVMs)[Altun et al., 2003, Ratsch and Sonnenburg, 2007]

Automatically adapts to quality of RNA-seq data/alignments

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 5

Page 13: RNA-seq based Genome Annotation with mGene.ngs and MiTie

mGene.ngs OverviewMemorial Sloan-Kettering Cancer Center

Goal: Predict annotation based on RNA-seq and genomicsequence information

Learn function f (y |x) that scores gene models y based ondifferent sources of information x

Train parameters such that

f (y |x)� f (y ′|x) for all y ′ 6= y (“large margin”)

Hidden semi-Markov Support Vector Machines (HsM-SVMs)[Altun et al., 2003, Ratsch and Sonnenburg, 2007]

Automatically adapts to quality of RNA-seq data/alignments

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 5

Page 14: RNA-seq based Genome Annotation with mGene.ngs and MiTie

mGene.ngs OverviewMemorial Sloan-Kettering Cancer Center

Goal: Predict annotation based on RNA-seq and genomicsequence information

Learn function f (y |x) that scores gene models y based ondifferent sources of information x

Train parameters such that

f (y |x)� f (y ′|x) for all y ′ 6= y (“large margin”)

Hidden semi-Markov Support Vector Machines (HsM-SVMs)[Altun et al., 2003, Ratsch and Sonnenburg, 2007]

Automatically adapts to quality of RNA-seq data/alignments

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 5

Page 15: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Training of mGeneMemorial Sloan-Kettering Cancer Center

acc

don

tss

tis

stop

True gene model 2 3 4 5

STEP 1: SVM Signal Predictions

genomic position

genomic position

Scor

e f(y

|x)

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 6

Page 16: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Training of mGeneMemorial Sloan-Kettering Cancer Center

acc

don

tss

tis

stop

True gene model 2 3 4 5

STEP 1: SVM Signal Predictions

genomic position

genomic position

Scor

e f(y

|x)

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 6

Page 17: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Training of mGeneMemorial Sloan-Kettering Cancer Center

acc

don

tss

tis

stop

True gene model 2 3 4 5

STEP 1: SVM Signal Predictions

genomic position

genomic position

Wrong gene model

large margin

Scor

e f(y

|x)

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 6

Page 18: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Training of mGene.ngsMemorial Sloan-Kettering Cancer Center

acc

don

tss

tis

stop

True gene model 2 3 4 5

STEP 1: SVM Signal Predictions

genomic position

genomic position

Scor

e f(y

|x)

RNA-seq

Cov

erag

e

intron support from spliced reads

large margin

Wrong gene model

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 6

Page 19: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Training of mGene.ngsMemorial Sloan-Kettering Cancer Center

acc

don

tss

tis

stop

True gene model 2 3 4 5

STEP 1: SVM Signal Predictions

genomic position

genomic position

RNA-seq

Cov

erag

e

intron support from spliced reads

Scor

e f(y

|x) larger margin

Wrong gene model

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 6

Page 20: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Results for C. elegansMemorial Sloan-Kettering Cancer Center

RNA-seq:

paired-end, strand-specific RNA ligation based protocol

76bp reads, 50 million reads

Alignment with Palmapper

Evaluation:

Transcript-level F-score of coding transcripts. . . for different expression levels

Compare mGene (ab initio), mGene.ngs, cufflinks

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 7

Page 21: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Results for C. elegansMemorial Sloan-Kettering Cancer Center

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 8

Page 22: RNA-seq based Genome Annotation with mGene.ngs and MiTie

DigestionMemorial Sloan-Kettering Cancer Center

Observations:

RNA-seq helps to improve performance

Genomic signals help much (see cufflinks)

Problems:

Need existing annotation for training

Cannot predict non-coding transcripts

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 9

Page 23: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Skimming and Non-coding TranscriptsMemorial Sloan-Kettering Cancer Center

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 10

Page 24: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Skimming and Non-coding TranscriptsMemorial Sloan-Kettering Cancer Center

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 10

Page 25: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Learning StrategyMemorial Sloan-Kettering Cancer Center

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 11

Page 26: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Learning StrategyMemorial Sloan-Kettering Cancer Center

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 11

Page 27: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Learning StrategyMemorial Sloan-Kettering Cancer Center

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 11

Page 28: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Results for C. elegansMemorial Sloan-Kettering Cancer Center

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

expression percentile

F−sc

ore

mGene − ab initio w/ annotation

mGene.ngs − w/o annotation

mGene.ngs − w/ annotationcufflinks − Trapnell et al. 2010

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 12

Page 29: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Results for C. elegansMemorial Sloan-Kettering Cancer Center

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

expression percentile

F−sc

ore

mGene.nc − w/o annotation

mGene − ab initio w/ annotation

mGene.ngs − w/o annotation

mGene.ngs − w/ annotationcufflinks − Trapnell et al. 2010

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 12

Page 30: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Results for C. elegansMemorial Sloan-Kettering Cancer Center

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

expression percentile

F−sc

ore

mGene.nc − w/o annotation

mGene − ab initio w/ annotation

mGene.ngs − w/o annotation

mGene.ngs − w/ annotationcufflinks − Trapnell et al. 2010

De novo prediction works!Modeling noncoding transcripts improves coding transcript prediction.

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 12

Page 31: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Gene Finding vs. Transcript AssemblyMemorial Sloan-Kettering Cancer Center

Gene expression levellow high

Gene�nding + RNA-seq=> only one transcript

RNA transcript assembly=>multiple transcripts

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 13

Page 32: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Vol. 29 no. 20 2013, pages 2529–2538BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btt442

Genome analysis Advance Access publication August 25, 2013

MITIE: Simultaneous RNA-Seq-based transcript identification andquantification in multiple samplesJonas Behr1,2,*,y, Andre Kahles1, Yi Zhong1, Vipin T. Sreedharan1, Philipp Drewe1 andGunnar Ratsch1,*1Computational Biology Center, Sloan-Kettering Institute, 1275 York Avenue, New York, NY 10065, USA and 2FriedrichMiescher Laboratory, Max Planck Society, Spemannstr. 39, 72076 Tubingen, GermanyAssociate Editor: Ivo Hofacker

ABSTRACT

Motivation: High-throughput sequencing of mRNA (RNA-Seq) has led

to tremendous improvements in the detection of expressed genes and

reconstruction of RNA transcripts. However, the extensive dynamic

range of gene expression, technical limitations and biases, as well

as the observed complexity of the transcriptional landscape, pose

profound computational challenges for transcriptome reconstruction.

Results: We present the novel framework MITIE (Mixed Integer

Transcript IdEntification) for simultaneous transcript reconstruction

and quantification. We define a likelihood function based on the nega-

tive binomial distribution, use a regularization approach to select a few

transcripts collectively explaining the observed read data and show

how to find the optimal solution using Mixed Integer Programming.

MITIE can (i) take advantage of known transcripts, (ii) reconstruct

and quantify transcripts simultaneously in multiple samples, and

(iii) resolve the location of multi-mapping reads. It is designed for

genome- and assembly-based transcriptome reconstruction. We

present an extensive study based on realistic simulated RNA-Seq

data. When compared with state-of-the-art approaches, MITIE

proves to be significantly more sensitive and overall more accurate.

Moreover, MITIE yields substantial performance gains when used with

multiple samples. We applied our system to 38 Drosophila melanoga-

ster modENCODE RNA-Seq libraries and estimated the sensitivity of

reconstructing omitted transcript annotations and the specificity with

respect to annotated transcripts. Our results corroborate that a well-

motivated objective paired with appropriate optimization techniques

lead to significant improvements over the state-of-the-art in transcrip-

tome reconstruction.

Availability: MITIE is implemented in Cþþ and is available from

http://bioweb.me/mitie under the GPL license.

Contact: [email protected] and [email protected]

Supplementary information: Supplementary data are available

at Bioinformatics online.

Received on October 14, 2012; revised on July 19, 2013; accepted on

July 29, 2013

1 INTRODUCTION

Most of the complexity of higher eukaryotic transcriptomes canbe attributed to the encoding of multiple transcripts at a single

genic locus by means of alternative splicing, transcription startand termination (e.g. Nilsen and Graveley, 2010; Ratsch et al.,2007; Schweikert et al., 2009). A comprehensive catalog of alltranscripts encoded by a genomic locus is essential for down-stream analyses that aim at a more detailed understanding ofgene expression and RNA processing regulation.RNA-Seq is a method for parallel sequencing of a large num-

ber of RNA molecules based on high-throughput sequencingtechnologies (ENCODE Project Consortium et al., 2012;Mortazavi et al., 2008; Wang et al., 2009). Currently availablesequencing platforms typically provide several 10–100 millions ofsequence fragments (reads) with a typical length of 50–150 bases.By mapping these reads back to the genome, one can determinewhere gene products are encoded in the genome (e.g. Denoeudet al., 2008; Guttman et al., 2010; Trapnell et al., 2010; Xia et al.,2011) and collect evidence of RNA processing such as splicing(Bradley et al., 2012; Sonnenburg et al., 2007) or RNA-editing(Bahn et al., 2012).In many cases, the RNA-Seq reads are first aligned to a ref-

erence genome using an alignment tool that identifies possibleread origins within the genome. Contiguous regions covered withread alignments (possibly with small gaps) are candidates forexonic segments. Alignment tools for RNA-Seq reads, such asPALMapper (De Bona et al., 2008; Jean et al., 2010), TopHat(Trapnell et al., 2009), MapSplice (Wang et al., 2010), Star(Dobin et al., 2012) or Gsnap (Wu and Nacu, 2010) are typicallyable to identify new exon–exon junctions, which are candidatesfor introns. This information can be compiled into a segment orsplicing graph, a directed acyclic graph, where the nodes corres-pond to exonic segments and the edges correspond to introncandidates (cf. Fig. 1 for an illustration). Assuming completecoverage, an expressed transcript corresponds to a path in thegraph. Similar graphs are produced during de novo transcriptassembly with the difference that the graph can potentially becyclic, and the segments are not explicitly associated with a gen-omic location. In genome- and assembly-based transcript recon-struction, tools such as Scripture (Guttman et al., 2010),Cufflinks (Trapnell et al., 2010), Trans-ABySS (Robertsonet al., 2010), Trinity (Grabherr et al., 2011) and OASES(Schulz et al., 2012) select a subset of paths through the graphas transcript predictions. For simplicity, we will focus ongenome-based transcript reconstruction when describing the ap-proach and discuss de novo assembly whenever necessary.

*To whom correspondence should be addressedyPresent address: ETH Zurich D-BSSE, Mattenstrasse 26, 4058 Basel,Switzerland.

! The Author 2013. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), whichpermits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

by guest on Decem

ber 3, 2013http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Transcript prediction via combinatorial optimization that combinesevidence from multiple experiments & achieves higher accuracy.

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 14

Page 33: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Transcript Reconstruction with RNA-seqMemorial Sloan-Kettering Cancer Center

Read alignments

Genomic DNA

Segment graph

Dataprocessing

Reads

Optimization

Denovo Assembly(Trinity, Oases)

Genome Based Assembly(Cu�inks, Scripture)

108 possible transcripts, 1028 possible subsets of transcriptsc© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 15

Page 34: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Transcript Reconstruction with RNA-seqMemorial Sloan-Kettering Cancer Center

Read alignments

Genomic DNA

Segment graph

Dataprocessing

Reads

Optimization

Denovo Assembly(Trinity, Oases)

Genome Based Assembly(Cu�inks, Scripture)

108 possible transcripts, 1028 possible subsets of transcriptsc© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 15

Page 35: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Enumerate and Quantify all TranscriptsMemorial Sloan-Kettering Cancer Center

Segment Graph

Potential Transcripts

[Behr et al., 2013]

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16

Page 36: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Enumerate and Quantify all TranscriptsMemorial Sloan-Kettering Cancer Center

Segment Graph

Potential Transcripts

1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 11 0 1 0 1 1 11 1 1 1 1 0 01 0 1 1 1 0 01 1 1 0 1 0 01 0 1 0 1 0 0

[Behr et al., 2013]

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16

Page 37: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Enumerate and Quantify all TranscriptsMemorial Sloan-Kettering Cancer Center

Segment Graph

Potential Transcripts

0.00.20.00.80.00.00.00.0

0.00.00.10.90.00.00.00.0

AbundanceSample1 Sample2

Expected coverage

1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 11 0 1 0 1 1 11 1 1 1 1 0 01 0 1 1 1 0 01 1 1 0 1 0 01 0 1 0 1 0 0

R. Bohnert and G. Ratsch, NAR (2010)[Behr et al., 2013]c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16

Page 38: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Enumerate and Quantify all TranscriptsMemorial Sloan-Kettering Cancer Center

Segment Graph

Potential Transcripts

0.00.20.00.80.00.00.00.0

0.00.00.10.90.00.00.00.0

AbundanceSample1 Sample2

Expected coverage

1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 11 0 1 0 1 1 11 1 1 1 1 0 01 0 1 1 1 0 01 1 1 0 1 0 01 0 1 0 1 0 0

minW

L( UT ×W︸ ︷︷ ︸expected coverage

, C︸︷︷︸observed coverage

) + γ × ‖W ‖1

R. Bohnert and G. Ratsch, NAR (2010)[Behr et al., 2013]

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16

Page 39: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Simultaneous Identification & QuantificationMemorial Sloan-Kettering Cancer Center

Segment Graph

Transcripts Matrix

0.80.20.00.0

0.90.00.10.0

AbundanceSample1 Sample2

1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 10 0 0 0 0 0 0

Expected coverage

1

k

...

[Behr et al., 2013]

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16

Page 40: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Simultaneous Identification & QuantificationMemorial Sloan-Kettering Cancer Center

Segment Graph

Transcripts Matrix

0.80.20.00.0

0.90.00.10.0

AbundanceSample1 Sample2

1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 10 0 0 0 0 0 0

Expected coverage

U W1

N

...

minU ,W

L( UT ×W︸ ︷︷ ︸expected coverage

, C︸︷︷︸observed coverage

) + γ × N

[Behr et al., 2013]

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16

Page 41: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Simultaneous Identification & QuantificationMemorial Sloan-Kettering Cancer Center

Segment Graph

Transcripts Matrix

0.80.20.00.0

0.90.00.10.0

AbundanceSample1 Sample2

1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 10 0 0 0 0 0 0

Expected coverage

U W1

N

...

minU ,W

L(UT ×W ,C ) + γ × N

[Behr et al., 2013]

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16

Page 42: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Simultaneous Identification & QuantificationMemorial Sloan-Kettering Cancer Center

Segment Graph

Transcripts Matrix

0.80.20.00.0

0.90.00.10.0

AbundanceSample1 Sample2

1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 10 0 0 0 0 0 0

Expected coverage

U W1

N

...

minU ,W

L(UT ×W ,C ) + γ × N

s.t. U is valid

[Behr et al., 2013]c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16

Page 43: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Simultaneous Identification & QuantificationMemorial Sloan-Kettering Cancer Center

Segment Graph

Transcripts Matrix

0.80.20.00.0

0.90.00.10.0

AbundanceSample1 Sample2

1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 10 0 0 0 0 0 0

Expected coverage

U W1

N

...

minU ,W

L(UT ×W ,C ) + γ × N

s.t. U is valid&%'$

[Behr et al., 2013]c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16

Page 44: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Simultaneous Identification & QuantificationMemorial Sloan-Kettering Cancer Center

Segment Graph

Transcripts Matrix

0.80.20.00.0

0.90.00.10.0

AbundanceSample1 Sample2

1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 10 0 0 0 0 0 0

Expected coverage

U W1

N

...

minU ,W

L(UT ×W ,C ) + γ × N

s.t. U is valid

����

[Behr et al., 2013]c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16

Page 45: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Simultaneous Identification & QuantificationMemorial Sloan-Kettering Cancer Center

Segment Graph

Transcripts Matrix

0.80.20.00.0

0.90.00.10.0

AbundanceSample1 Sample2

1 0 1 0 1 1 11 0 1 1 1 1 11 1 1 0 1 1 10 0 0 0 0 0 0

Expected coverage

U W1

N

...

minU ,W

L(UT ×W ,C ) + γ × N

s.t. U is valid

����

[Behr et al., 2013]c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 16

Page 46: RNA-seq based Genome Annotation with mGene.ngs and MiTie

MiTie’s Main FeaturesMemorial Sloan-Kettering Cancer Center

Uses a likelihood function L based on a probabilistic model forthe read coverage.

Uses combinatorial optimization to find transcripts that explaindata from multiple RNA-seq libraries

Newly predicted transcripts are penalized (once).

Can use already known/confirmed transcripts without penalty.

Provides a p-value for each transcript providing a confidencemeasure for presence of predicted transcript.Log-likelihood ratio test:

Tt = −2 log

(p(D|M)

p(D|Mt)

)[Behr et al., 2013]

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 17

Page 47: RNA-seq based Genome Annotation with mGene.ngs and MiTie

MiTie ResultsMemorial Sloan-Kettering Cancer Center

1 2 3 4 5

0.35

0.40

0.45

Number of Samples

F−sc

ore

on T

rans

crip

t Lev

el

MITIE + MMOMITIECufflinks + CuffmergeCufflinks

Human Simulated Data

1 2 3 4 5 6 70.29

0.31

0.33

0.35

0.37

F−sc

ore

on T

rans

crip

t Lev

el

MITIECufflinks + Cuffmerge

A

B

D. melanogaster modENCODE Data

[Behr et al., 2013]c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 18

Page 48: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Gene Finding vs. Transcript AssemblyMemorial Sloan-Kettering Cancer Center

Gene expression levellow high

mGene.ngs=> only one transcript

MiTie=>multiple transcripts

low con�dence high con�dence for alternative transcripts

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 19

Page 49: RNA-seq based Genome Annotation with mGene.ngs and MiTie

ConclusionsMemorial Sloan-Kettering Cancer Center

Genome annotation pipelineTranscript Skimmer identifies highly expressed genes for trainingmGene.ngs predicts coding and non-coding transcriptsMiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires onlyGenome sequenceRNA-seq alignments

Good for annotating new genomes or improving existing ones

Sources are free http://bioweb.me/mgene &http://bioweb.me/mitie

Functionality partially available in Galaxy instance(http://galaxy.cbio.mskcc.org)

Thank you!c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 20

Page 50: RNA-seq based Genome Annotation with mGene.ngs and MiTie

ConclusionsMemorial Sloan-Kettering Cancer Center

Genome annotation pipelineTranscript Skimmer identifies highly expressed genes for trainingmGene.ngs predicts coding and non-coding transcriptsMiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires onlyGenome sequenceRNA-seq alignments

Good for annotating new genomes or improving existing ones

Sources are free http://bioweb.me/mgene &http://bioweb.me/mitie

Functionality partially available in Galaxy instance(http://galaxy.cbio.mskcc.org)

Thank you!c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 20

Page 51: RNA-seq based Genome Annotation with mGene.ngs and MiTie

ConclusionsMemorial Sloan-Kettering Cancer Center

Genome annotation pipelineTranscript Skimmer identifies highly expressed genes for trainingmGene.ngs predicts coding and non-coding transcriptsMiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires onlyGenome sequenceRNA-seq alignments

Good for annotating new genomes or improving existing ones

Sources are free http://bioweb.me/mgene &http://bioweb.me/mitie

Functionality partially available in Galaxy instance(http://galaxy.cbio.mskcc.org)

Thank you!c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 20

Page 52: RNA-seq based Genome Annotation with mGene.ngs and MiTie

ConclusionsMemorial Sloan-Kettering Cancer Center

Genome annotation pipelineTranscript Skimmer identifies highly expressed genes for trainingmGene.ngs predicts coding and non-coding transcriptsMiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires onlyGenome sequenceRNA-seq alignments

Good for annotating new genomes or improving existing ones

Sources are free http://bioweb.me/mgene &http://bioweb.me/mitie

Functionality partially available in Galaxy instance(http://galaxy.cbio.mskcc.org)

Thank you!c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 20

Page 53: RNA-seq based Genome Annotation with mGene.ngs and MiTie

ConclusionsMemorial Sloan-Kettering Cancer Center

Genome annotation pipelineTranscript Skimmer identifies highly expressed genes for trainingmGene.ngs predicts coding and non-coding transcriptsMiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires onlyGenome sequenceRNA-seq alignments

Good for annotating new genomes or improving existing ones

Sources are free http://bioweb.me/mgene &http://bioweb.me/mitie

Functionality partially available in Galaxy instance(http://galaxy.cbio.mskcc.org)

Thank you!c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 20

Page 54: RNA-seq based Genome Annotation with mGene.ngs and MiTie

ConclusionsMemorial Sloan-Kettering Cancer Center

Genome annotation pipelineTranscript Skimmer identifies highly expressed genes for trainingmGene.ngs predicts coding and non-coding transcriptsMiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires onlyGenome sequenceRNA-seq alignments

Good for annotating new genomes or improving existing ones

Sources are free http://bioweb.me/mgene &http://bioweb.me/mitie

Functionality partially available in Galaxy instance(http://galaxy.cbio.mskcc.org)

Thank you!c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 20

Page 55: RNA-seq based Genome Annotation with mGene.ngs and MiTie

Just published:

Checkout:

http://oqtans.org

http://galaxy.cbio.mskcc.org

[Sreedharan et al., 2014]

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 21

Page 56: RNA-seq based Genome Annotation with mGene.ngs and MiTie

References IY. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov Support Vector Machines. In Proc. 20th Int. Conf. Mach. Learn.,

pages 3–10, 2003.

J. Behr, G. Schweikert, J. Cao, F. De Bona, G. Zeller, S. Laubinger, S. Ossowski, K. Schneeberger, D. Weigel, and G. Ratsch.Rna-seq and tiling arrays for improved gene finding. Oral presentation at the CSHL Genome Informatics Meeting,September 2008. URL http://www.fml.tuebingen.mpg.de/raetsch/lectures/RaetschGenomeInformatics08.pdf.

Jonas Behr, Andre Kahles, Yi Zhong, Vipin T Sreedharan, Philipp Drewe, and Gunnar Ratsch. Mitie: Simultaneousrna-seq-based transcript identification and quantification in multiple samples. Bioinformatics, 29(20):2529–38, Oct 2013.doi: 10.1093/bioinformatics/btt442.

RM Clark, G Schweikert, C Toomajian, S Ossowski, G Zeller, P Shinn, N Warthmann, TT Hu, G Fu, DA Hinds, H Chen,KA Frazer, DH Huson, B Scholkopf, M Nordborg, G Ratsch, JR Ecker, and D Weigel. Common sequence polymorphismsshaping genetic diversity in arabidopsis thaliana. Science, 317(5836):338–342, 2007. ISSN 1095-9203 (Electronic). doi:10.1126/science.1138632.

G. Ratsch and S. Sonnenburg. Accurate splice site detection for Caenorhabditis elegans. In K. Tsuda B. Schoelkopf and J.-P.Vert, editors, Kernel Methods in Computational Biology. MIT Press, 2004.

G Ratsch and S Sonnenburg. Large scale hidden semi-markov svms. In B. Scholkopf, J. Platt, and T. Hoffman, editors,Advances in Neural Information Processing Systems (NIPS’06), volume 19, pages 1161–1168, Cambridge, MA, 2007. MITPress. URL http://www.fml.tuebingen.mpg.de/raetsch/projects/HSMSVM.

G. Ratsch, S. Sonnenburg, and B. Scholkopf. RASE: recognition of alternatively spliced exons in C. elegans. Bioinformatics, 21(Suppl. 1):i369–i377, June 2005.

Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph Dieterich, Cheng Soon Ong, Petra Philips, FabioDe Bona, Lisa Hartmann, Anja Bohlen, Nina Kruger, Soren Sonnenburg, and Gunnar Ratsch. mgene: Accurate svm-basedgene finding with an application to nematode genomes. Genome Research, 2009. URLhttp://genome.cshlp.org/content/early/2009/06/29/gr.090597.108.full.pdf+html. Advance access June 29, 2009.

S. Sonnenburg, G. Ratsch, A. Jagota, and K.-R. Muller. New methods for splice-site recognition. In Proc. InternationalConference on Artificial Neural Networks, 2002.

Soren Sonnenburg, Alexander Zien, and Gunnar Ratsch. ARTS: Accurate Recognition of Transcription Starts in Human.Bioinformatics, 22(14):e472–480, 2006.

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 22

Page 57: RNA-seq based Genome Annotation with mGene.ngs and MiTie

References II

VT Sreedharan, SJ Schultheiss, G Jean, A Kahles, R Bohnert, P Drewe, P Mudrakarta, N Gornitz, G Zeller, and GunnarRatsch. Oqtans: The rna-seq workbench in the cloud for complete and reproducible quantitative transcriptome analysis.Bioinformatics, 2014. Bioinformatics Advance Access published January 11, 2014.

G Zeller, RM Clark, K Schneeberger, A Bohlen, D Weigel, and G Ratsch. Detecting polymorphic regions in arabidopsis thalianawith resequencing microarrays. Genome Res, 18(6):918–929, 2008. ISSN 1088-9051 (Print). doi:10.1101/gr.070169.107.

A. Zien, G. Ratsch, S. Mika, B. Scholkopf, T. Lengauer, and K.-R. Muller. Engineering Support Vector Machine Kernels ThatRecognize Translation Initiation Sites. BioInformatics, 16(9):799–807, September 2000.

c© Gunnar Ratsch (cBio@MSKCC) RNA-Seq-based Annotation using mGene.ngs and MiTie @ PAG XXII Gene Discovery Workshop 23