37
S–MART: What can I do with all this RNA-Seq data? Matthias Zytnicki URGI — INRA ALIMENTATION AGRICULTURE ENVIRONNEMENT

S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

S–MART:What can I do with all this RNA-Seq data?

Matthias ZytnickiURGI — INRA

ALIMENTATION

AGRICULTURE

ENVIRONNEMENT

Page 2: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Introduction

from The Economist, Aug. 2010

S–MART 10/26/10 Matthias Zytnicki 2 / 18

Page 3: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Sequencers around the world

http://pathogenomics.bham.ac.uk/hts/

S–MART 10/26/10 Matthias Zytnicki 3 / 18

Page 4: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Applications

• pharmacology: correlating drug response with genomevariation

• metagenomics: tag sequencing

• genetics: GWAS of families of four

• epigenetics: replication timing and histone modification

• genomics: gene fusions, variant detection with exon capture

S–MART 10/26/10 Matthias Zytnicki 4 / 18

Page 5: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Current problem

Differences between big and small labs

big labs may have

• their own sequencers• their own clusters• their own storage solutions• their own bioinformaticians

small labs may have

• nothing

numberof labs

number ofsequencers

S–MART 10/26/10 Matthias Zytnicki 5 / 18

Page 6: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Comparison with tiling arrays

from Wang et al., 2009

S–MART 10/26/10 Matthias Zytnicki 6 / 18

Page 7: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Differential expression

1 dot: # reads overlapping with 1 gene in each condition.

S–MART 10/26/10 Matthias Zytnicki 7 / 18

Page 8: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Count normalization problem

• each sample: 1 million reads

• 1 gene: 0 / 500,000

• other genes: expression unchanged

⇒ a gene 2000 / 1000 is not differentially expressed!

• use household genes as reference

• use averagely expressed genes

S–MART 10/26/10 Matthias Zytnicki 8 / 18

Page 9: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Count normalization problem

• each sample: 1 million reads

• 1 gene: 0 / 500,000

• other genes: expression unchanged

⇒ a gene 2000 / 1000 is not differentially expressed!

• use household genes as reference

• use averagely expressed genes

S–MART 10/26/10 Matthias Zytnicki 8 / 18

Page 10: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Count normalization problem

• each sample: 1 million reads

• 1 gene: 0 / 500,000

• other genes: expression unchanged

⇒ a gene 2000 / 1000 is not differentially expressed!

• use household genes as reference

• use averagely expressed genes

S–MART 10/26/10 Matthias Zytnicki 8 / 18

Page 11: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Count normalization problem

• each sample: 1 million reads

• 1 gene: 0 / 500,000

• other genes: expression unchanged

⇒ a gene 2000 / 1000 is not differentially expressed!

• use household genes as reference

• use averagely expressed genes

S–MART 10/26/10 Matthias Zytnicki 8 / 18

Page 12: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Size-dependant normalization problem

genes

sample 1

sample 2

S–MART 10/26/10 Matthias Zytnicki 9 / 18

Page 13: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

IntroductionData:

• 1 RNA–Seq sample of wild type

• 1 RNA–Seq sample of mutant

• Get me all the genes which show differential expression.

• Use sliding windows instead.

• Actually, I want to have a bird’s eye view on the transcriptionthroughout the genome.

• Hey! I saw some paper which shows some nice dot plots fordifferential expression!

⇒ Better if the biologist can perform the work him/her-self.

⇒ There is no pre-defined pipe–line.

⇒ Few lanes are usually sufficient.

⇒ Use S–MART!

S–MART 10/26/10 Matthias Zytnicki 10 / 18

Page 14: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

IntroductionData:

• 1 RNA–Seq sample of wild type

• 1 RNA–Seq sample of mutant

• Get me all the genes which show differential expression.

• Use sliding windows instead.

• Actually, I want to have a bird’s eye view on the transcriptionthroughout the genome.

• Hey! I saw some paper which shows some nice dot plots fordifferential expression!

⇒ Better if the biologist can perform the work him/her-self.

⇒ There is no pre-defined pipe–line.

⇒ Few lanes are usually sufficient.

⇒ Use S–MART!

S–MART 10/26/10 Matthias Zytnicki 10 / 18

Page 15: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

IntroductionData:

• 1 RNA–Seq sample of wild type

• 1 RNA–Seq sample of mutant

• Get me all the genes which show differential expression.

• Use sliding windows instead.

• Actually, I want to have a bird’s eye view on the transcriptionthroughout the genome.

• Hey! I saw some paper which shows some nice dot plots fordifferential expression!

⇒ Better if the biologist can perform the work him/her-self.

⇒ There is no pre-defined pipe–line.

⇒ Few lanes are usually sufficient.

⇒ Use S–MART!

S–MART 10/26/10 Matthias Zytnicki 10 / 18

Page 16: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

IntroductionData:

• 1 RNA–Seq sample of wild type

• 1 RNA–Seq sample of mutant

• Get me all the genes which show differential expression.

• Use sliding windows instead.

• Actually, I want to have a bird’s eye view on the transcriptionthroughout the genome.

• Hey! I saw some paper which shows some nice dot plots fordifferential expression!

⇒ Better if the biologist can perform the work him/her-self.

⇒ There is no pre-defined pipe–line.

⇒ Few lanes are usually sufficient.

⇒ Use S–MART!

S–MART 10/26/10 Matthias Zytnicki 10 / 18

Page 17: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

IntroductionData:

• 1 RNA–Seq sample of wild type

• 1 RNA–Seq sample of mutant

• Get me all the genes which show differential expression.

• Use sliding windows instead.

• Actually, I want to have a bird’s eye view on the transcriptionthroughout the genome.

• Hey! I saw some paper which shows some nice dot plots fordifferential expression!

⇒ Better if the biologist can perform the work him/her-self.

⇒ There is no pre-defined pipe–line.

⇒ Few lanes are usually sufficient.

⇒ Use S–MART!

S–MART 10/26/10 Matthias Zytnicki 10 / 18

Page 18: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

IntroductionData:

• 1 RNA–Seq sample of wild type

• 1 RNA–Seq sample of mutant

• Get me all the genes which show differential expression.

• Use sliding windows instead.

• Actually, I want to have a bird’s eye view on the transcriptionthroughout the genome.

• Hey! I saw some paper which shows some nice dot plots fordifferential expression!

⇒ Better if the biologist can perform the work him/her-self.

⇒ There is no pre-defined pipe–line.

⇒ Few lanes are usually sufficient.

⇒ Use S–MART!

S–MART 10/26/10 Matthias Zytnicki 10 / 18

Page 19: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Usual pipe–line

• First step: align on a reference genome(with any tool)

• Result: genomic coordinates.

• S–MART :• is a set of independant tools• works on a standard PC• can be installed and used easily• uses nested bins and SQL indices

sequencessequencer alignment genomiccoordinates

S–MART 10/26/10 Matthias Zytnicki 11 / 18

Page 20: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Data manipulation

• use several mapping tools

• remove reads w.r.t. a reference set

• find the coverage w.r.t. a reference set

• cluster by sliding windows

• find transcription on both strands

Mosaik

BlatFasta coordinates

Supported tools:

Blast, Blat, BWA, Exonerate, MAQ, Mosaik, Nucmer, RMap, SeqMap,

Shrimp, SOAP, . . .

S–MART 10/26/10 Matthias Zytnicki 12 / 18

Page 21: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Data manipulation

• use several mapping tools

• remove reads w.r.t. a reference set

• find the coverage w.r.t. a reference set

• cluster by sliding windows

• find transcription on both strands

reads

result

tRNA

S–MART 10/26/10 Matthias Zytnicki 12 / 18

Page 22: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Data manipulation

• use several mapping tools

• remove reads w.r.t. a reference set

• find the coverage w.r.t. a reference set

• cluster by sliding windows

• find transcription on both strands

reads

refSeq

result

S–MART 10/26/10 Matthias Zytnicki 12 / 18

Page 23: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Data manipulation

• use several mapping tools

• remove reads w.r.t. a reference set

• find the coverage w.r.t. a reference set

• cluster by sliding windows

• find transcription on both strands

40

4

reads

number

S–MART 10/26/10 Matthias Zytnicki 12 / 18

Page 24: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Data manipulation

• use several mapping tools

• remove reads w.r.t. a reference set

• find the coverage w.r.t. a reference set

• cluster by sliding windows

• find transcription on both strands

reads

result

S–MART 10/26/10 Matthias Zytnicki 12 / 18

Page 25: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Data visualization• read size distribution

• nucleotidic distribution• density on the chromosomes• distance with respect to genes

S–MART 10/26/10 Matthias Zytnicki 13 / 18

Page 26: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Data visualization• read size distribution• nucleotidic distribution

• density on the chromosomes• distance with respect to genes

S–MART 10/26/10 Matthias Zytnicki 13 / 18

Page 27: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Data visualization• read size distribution

• nucleotidic distribution

• density on the chromosomes

• distance with respect to genes

S–MART 10/26/10 Matthias Zytnicki 13 / 18

Page 28: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Data visualization• read size distribution• nucleotidic distribution• density on the chromosomes• distance with respect to genes

S–MART 10/26/10 Matthias Zytnicki 13 / 18

Page 29: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Differential expression

S–MART can find differentially expressed regions which can begenes, TEs, miRNAs, sliding windows, etc.Uses Fisher’s exact test for each region.

** –

sample 1

sample 2

result

S–MART 10/26/10 Matthias Zytnicki 14 / 18

Page 30: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

NormalizationsUse dot plot to count the number of reads per gene.• no normalization

• normalization w.r.t. the number of reads• normalization w.r.t. the interquartile• # number of reads per kb• FDR of 5%

Spearman rho: 0.558867S–MART 10/26/10 Matthias Zytnicki 15 / 18

Page 31: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

NormalizationsUse dot plot to count the number of reads per gene.• no normalization• normalization w.r.t. the number of reads

• normalization w.r.t. the interquartile• # number of reads per kb• FDR of 5%

Spearman rho: 0.697121S–MART 10/26/10 Matthias Zytnicki 15 / 18

Page 32: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

NormalizationsUse dot plot to count the number of reads per gene.• no normalization• normalization w.r.t. the number of reads• normalization w.r.t. the interquartile

• # number of reads per kb• FDR of 5%

Spearman rho: 0.697153S–MART 10/26/10 Matthias Zytnicki 15 / 18

Page 33: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

NormalizationsUse dot plot to count the number of reads per gene.• no normalization• normalization w.r.t. the number of reads• normalization w.r.t. the interquartile• # number of reads per kb

• FDR of 5%

Spearman rho: 0.752267S–MART 10/26/10 Matthias Zytnicki 15 / 18

Page 34: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

NormalizationsUse dot plot to count the number of reads per gene.• no normalization• normalization w.r.t. the number of reads• normalization w.r.t. the interquartile• # number of reads per kb• FDR of 5%

Spearman rho: 0.752267S–MART 10/26/10 Matthias Zytnicki 15 / 18

Page 35: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Pipe-lines

The user can chain all the S–MART tools as he/she likes

windowsby sliding

clusterize

sample 2windows

by sliding

clusterize

sample 1

mergeoutput

keep loci

with p-value

10−4

Differential expression with sliding windows, plotted.

S–MART 10/26/10 Matthias Zytnicki 16 / 18

Page 36: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Conclusions

• S–MART: a tool for RNA-Seq high-throughput sequencingdata manipulation and visualization.

• Especially useful for detecting differential expression.

• For the labs with few bio–informaticiens.

• Download it athttp://urgi.versailles.inra.fr/index.php/urgi/

Tools/S-MART

S–MART 10/26/10 Matthias Zytnicki 17 / 18

Page 37: S MART: What can I do with all this RNA-Seq data? · 1 gene: 0 / 500,000 other genes: expression unchanged)a gene 2000 / 1000 is not di erentially expressed! use household genes as

Acknowledgements

Computer science

Hadi Quesneville URGIURGI lab

Fly

Dominique Anxolabehere IJMDanielle Nouaud IJMChantale Vaury GReDSophie Desset GReDSilke Jensen GReD

Arabidopsis

Herve Vaucheret IJPBValerie Gaudin IJPB

Sponsors

S–MART 10/26/10 Matthias Zytnicki 18 / 18