S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression...

Preview:

Citation preview

S–MART:What can I do with all this RNA-Seq data?

Matthias ZytnickiURGI — INRA

ALIMENTATION

AGRICULTURE

ENVIRONNEMENT

Introduction

from The Economist, Aug. 2010

S–MART 10/26/10 Matthias Zytnicki 2 / 18

Sequencers around the world

http://pathogenomics.bham.ac.uk/hts/

S–MART 10/26/10 Matthias Zytnicki 3 / 18

Applications

• pharmacology: correlating drug response with genomevariation

• metagenomics: tag sequencing

• genetics: GWAS of families of four

• epigenetics: replication timing and histone modification

• genomics: gene fusions, variant detection with exon capture

S–MART 10/26/10 Matthias Zytnicki 4 / 18

Current problem

Differences between big and small labs

big labs may have

• their own sequencers• their own clusters• their own storage solutions• their own bioinformaticians

small labs may have

• nothing

numberof labs

number ofsequencers

S–MART 10/26/10 Matthias Zytnicki 5 / 18

Comparison with tiling arrays

from Wang et al., 2009

S–MART 10/26/10 Matthias Zytnicki 6 / 18

Differential expression

1 dot: # reads overlapping with 1 gene in each condition.

S–MART 10/26/10 Matthias Zytnicki 7 / 18

Count normalization problem

• each sample: 1 million reads

• 1 gene: 0 / 500,000

• other genes: expression unchanged

⇒ a gene 2000 / 1000 is not differentially expressed!

• use household genes as reference

• use averagely expressed genes

S–MART 10/26/10 Matthias Zytnicki 8 / 18

Count normalization problem

• each sample: 1 million reads

• 1 gene: 0 / 500,000

• other genes: expression unchanged

⇒ a gene 2000 / 1000 is not differentially expressed!

• use household genes as reference

• use averagely expressed genes

S–MART 10/26/10 Matthias Zytnicki 8 / 18

Count normalization problem

• each sample: 1 million reads

• 1 gene: 0 / 500,000

• other genes: expression unchanged

⇒ a gene 2000 / 1000 is not differentially expressed!

• use household genes as reference

• use averagely expressed genes

S–MART 10/26/10 Matthias Zytnicki 8 / 18

Count normalization problem

• each sample: 1 million reads

• 1 gene: 0 / 500,000

• other genes: expression unchanged

⇒ a gene 2000 / 1000 is not differentially expressed!

• use household genes as reference

• use averagely expressed genes

S–MART 10/26/10 Matthias Zytnicki 8 / 18

Size-dependant normalization problem

genes

sample 1

sample 2

S–MART 10/26/10 Matthias Zytnicki 9 / 18

IntroductionData:

• 1 RNA–Seq sample of wild type

• 1 RNA–Seq sample of mutant

• Get me all the genes which show differential expression.

• Use sliding windows instead.

• Actually, I want to have a bird’s eye view on the transcriptionthroughout the genome.

• Hey! I saw some paper which shows some nice dot plots fordifferential expression!

⇒ Better if the biologist can perform the work him/her-self.

⇒ There is no pre-defined pipe–line.

⇒ Few lanes are usually sufficient.

⇒ Use S–MART!

S–MART 10/26/10 Matthias Zytnicki 10 / 18

IntroductionData:

• 1 RNA–Seq sample of wild type

• 1 RNA–Seq sample of mutant

• Get me all the genes which show differential expression.

• Use sliding windows instead.

• Actually, I want to have a bird’s eye view on the transcriptionthroughout the genome.

• Hey! I saw some paper which shows some nice dot plots fordifferential expression!

⇒ Better if the biologist can perform the work him/her-self.

⇒ There is no pre-defined pipe–line.

⇒ Few lanes are usually sufficient.

⇒ Use S–MART!

S–MART 10/26/10 Matthias Zytnicki 10 / 18

IntroductionData:

• 1 RNA–Seq sample of wild type

• 1 RNA–Seq sample of mutant

• Get me all the genes which show differential expression.

• Use sliding windows instead.

• Actually, I want to have a bird’s eye view on the transcriptionthroughout the genome.

• Hey! I saw some paper which shows some nice dot plots fordifferential expression!

⇒ Better if the biologist can perform the work him/her-self.

⇒ There is no pre-defined pipe–line.

⇒ Few lanes are usually sufficient.

⇒ Use S–MART!

S–MART 10/26/10 Matthias Zytnicki 10 / 18

IntroductionData:

• 1 RNA–Seq sample of wild type

• 1 RNA–Seq sample of mutant

• Get me all the genes which show differential expression.

• Use sliding windows instead.

• Actually, I want to have a bird’s eye view on the transcriptionthroughout the genome.

• Hey! I saw some paper which shows some nice dot plots fordifferential expression!

⇒ Better if the biologist can perform the work him/her-self.

⇒ There is no pre-defined pipe–line.

⇒ Few lanes are usually sufficient.

⇒ Use S–MART!

S–MART 10/26/10 Matthias Zytnicki 10 / 18

IntroductionData:

• 1 RNA–Seq sample of wild type

• 1 RNA–Seq sample of mutant

• Get me all the genes which show differential expression.

• Use sliding windows instead.

• Actually, I want to have a bird’s eye view on the transcriptionthroughout the genome.

• Hey! I saw some paper which shows some nice dot plots fordifferential expression!

⇒ Better if the biologist can perform the work him/her-self.

⇒ There is no pre-defined pipe–line.

⇒ Few lanes are usually sufficient.

⇒ Use S–MART!

S–MART 10/26/10 Matthias Zytnicki 10 / 18

IntroductionData:

• 1 RNA–Seq sample of wild type

• 1 RNA–Seq sample of mutant

• Get me all the genes which show differential expression.

• Use sliding windows instead.

• Actually, I want to have a bird’s eye view on the transcriptionthroughout the genome.

• Hey! I saw some paper which shows some nice dot plots fordifferential expression!

⇒ Better if the biologist can perform the work him/her-self.

⇒ There is no pre-defined pipe–line.

⇒ Few lanes are usually sufficient.

⇒ Use S–MART!

S–MART 10/26/10 Matthias Zytnicki 10 / 18

Usual pipe–line

• First step: align on a reference genome(with any tool)

• Result: genomic coordinates.

• S–MART :• is a set of independant tools• works on a standard PC• can be installed and used easily• uses nested bins and SQL indices

sequencessequencer alignment genomiccoordinates

S–MART 10/26/10 Matthias Zytnicki 11 / 18

Data manipulation

• use several mapping tools

• remove reads w.r.t. a reference set

• find the coverage w.r.t. a reference set

• cluster by sliding windows

• find transcription on both strands

Mosaik

BlatFasta coordinates

Supported tools:

Blast, Blat, BWA, Exonerate, MAQ, Mosaik, Nucmer, RMap, SeqMap,

Shrimp, SOAP, . . .

S–MART 10/26/10 Matthias Zytnicki 12 / 18

Data manipulation

• use several mapping tools

• remove reads w.r.t. a reference set

• find the coverage w.r.t. a reference set

• cluster by sliding windows

• find transcription on both strands

reads

result

tRNA

S–MART 10/26/10 Matthias Zytnicki 12 / 18

Data manipulation

• use several mapping tools

• remove reads w.r.t. a reference set

• find the coverage w.r.t. a reference set

• cluster by sliding windows

• find transcription on both strands

reads

refSeq

result

S–MART 10/26/10 Matthias Zytnicki 12 / 18

Data manipulation

• use several mapping tools

• remove reads w.r.t. a reference set

• find the coverage w.r.t. a reference set

• cluster by sliding windows

• find transcription on both strands

40

4

reads

number

S–MART 10/26/10 Matthias Zytnicki 12 / 18

Data manipulation

• use several mapping tools

• remove reads w.r.t. a reference set

• find the coverage w.r.t. a reference set

• cluster by sliding windows

• find transcription on both strands

reads

result

S–MART 10/26/10 Matthias Zytnicki 12 / 18

Data visualization• read size distribution

• nucleotidic distribution• density on the chromosomes• distance with respect to genes

S–MART 10/26/10 Matthias Zytnicki 13 / 18

Data visualization• read size distribution• nucleotidic distribution

• density on the chromosomes• distance with respect to genes

S–MART 10/26/10 Matthias Zytnicki 13 / 18

Data visualization• read size distribution

• nucleotidic distribution

• density on the chromosomes

• distance with respect to genes

S–MART 10/26/10 Matthias Zytnicki 13 / 18

Data visualization• read size distribution• nucleotidic distribution• density on the chromosomes• distance with respect to genes

S–MART 10/26/10 Matthias Zytnicki 13 / 18

Differential expression

S–MART can find differentially expressed regions which can begenes, TEs, miRNAs, sliding windows, etc.Uses Fisher’s exact test for each region.

** –

sample 1

sample 2

result

S–MART 10/26/10 Matthias Zytnicki 14 / 18

NormalizationsUse dot plot to count the number of reads per gene.• no normalization

• normalization w.r.t. the number of reads• normalization w.r.t. the interquartile• # number of reads per kb• FDR of 5%

Spearman rho: 0.558867S–MART 10/26/10 Matthias Zytnicki 15 / 18

NormalizationsUse dot plot to count the number of reads per gene.• no normalization• normalization w.r.t. the number of reads

• normalization w.r.t. the interquartile• # number of reads per kb• FDR of 5%

Spearman rho: 0.697121S–MART 10/26/10 Matthias Zytnicki 15 / 18

NormalizationsUse dot plot to count the number of reads per gene.• no normalization• normalization w.r.t. the number of reads• normalization w.r.t. the interquartile

• # number of reads per kb• FDR of 5%

Spearman rho: 0.697153S–MART 10/26/10 Matthias Zytnicki 15 / 18

NormalizationsUse dot plot to count the number of reads per gene.• no normalization• normalization w.r.t. the number of reads• normalization w.r.t. the interquartile• # number of reads per kb

• FDR of 5%

Spearman rho: 0.752267S–MART 10/26/10 Matthias Zytnicki 15 / 18

NormalizationsUse dot plot to count the number of reads per gene.• no normalization• normalization w.r.t. the number of reads• normalization w.r.t. the interquartile• # number of reads per kb• FDR of 5%

Spearman rho: 0.752267S–MART 10/26/10 Matthias Zytnicki 15 / 18

Pipe-lines

The user can chain all the S–MART tools as he/she likes

windowsby sliding

clusterize

sample 2windows

by sliding

clusterize

sample 1

mergeoutput

keep loci

with p-value

10−4

Differential expression with sliding windows, plotted.

S–MART 10/26/10 Matthias Zytnicki 16 / 18

Conclusions

• S–MART: a tool for RNA-Seq high-throughput sequencingdata manipulation and visualization.

• Especially useful for detecting differential expression.

• For the labs with few bio–informaticiens.

• Download it athttp://urgi.versailles.inra.fr/index.php/urgi/

Tools/S-MART

S–MART 10/26/10 Matthias Zytnicki 17 / 18

Acknowledgements

Computer science

Hadi Quesneville URGIURGI lab

Fly

Dominique Anxolabehere IJMDanielle Nouaud IJMChantale Vaury GReDSophie Desset GReDSilke Jensen GReD

Arabidopsis

Herve Vaucheret IJPBValerie Gaudin IJPB

Sponsors

S–MART 10/26/10 Matthias Zytnicki 18 / 18

Recommended