43
The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University June 5 th , 2013

The Data Tsunami in Biomedical Research

  • Upload
    kylia

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

The Data Tsunami in Biomedical Research. Guillaume Bourque McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University June 5 th , 2013. Next-generation sequencing (NGS). Stein, Genome Biol. 2010. Falling cost of sequencing. - PowerPoint PPT Presentation

Citation preview

Page 1: The Data Tsunami in  Biomedical  Research

The Data Tsunami in Biomedical Research

Guillaume Bourque

McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University

June 5th, 2013

Page 2: The Data Tsunami in  Biomedical  Research

2

Next-generation sequencing (NGS)

Stein, Genome Biol. 2010

Page 3: The Data Tsunami in  Biomedical  Research

3

Falling cost of sequencing

DeWitt, Nat. Biotechnol. 2012

Page 4: The Data Tsunami in  Biomedical  Research

Sequencing human genomes

1000 Genomes

Project

~ 10 000 $

The Human

Genome

~ 3 Billion $

Your Genome

100 - 1000 $

20112001 2013 (?)

Page 5: The Data Tsunami in  Biomedical  Research

5

Outline

• Overview of Next-Generation Sequencing (NGS)• Applications• Challenges• Solutions

Page 6: The Data Tsunami in  Biomedical  Research

6

Sequencing Revolution

http://www.brusselsgenetics.be

Sanger sequencing Next-Generation sequencing

Metzker, Nat. Rev. Genet. 2010

100s of reactions… 10000s of base pairs…

Millions of reactions!Billions of base pairs!

Page 7: The Data Tsunami in  Biomedical  Research

High-throughput Sequencing

6 Gbases2009 36bp X 20MX 8 lanes

2013 600 Gbases2 X 150bp X 250M

X 8 lanes

200 Human Genomes in 1 run!!!

Page 8: The Data Tsunami in  Biomedical  Research

NGS Technology Comparisoninstrument Pacbio Ion Torrent 454 Illumina SOLiD

Method Single-molecule in real-time

Ion semiconductor Pyrosequencing synthesis Ligation

Read length 3kb average 200 bp 700 bp 50 to 250 bp 50+35 or 50+50 bp

Error type indel indel indel substitution A-T bias

single-Pass Error rate % 13 ~1 ~0.1 ~0.1 ~0.1

Reads per run 35000–75000 up to 4M 1M up to 3.2G 1.2 to 1.4G

Time per run 30 minutes to 2 hours 2 hours 24 hours 1 to 10 days, 1 to 2 weeks

Cost per 1 million bases

(in US$) $2 $1 $10 $0.05 to $0.15 $0.13

Advantages Longest read length. Fast.

Less expensive equipment.

Fast. Long read size.

Fast. high sequence

yield, cost, accuracy

Low cost per base.

Disadvantages

Low yield at high accuracy. Equipment can

be very expensive.

Homopolymer errors.

Runs are expensive.

Homopolymer errors.

Equipment can be very

expensive.

Slower than other methods,

read length, longevity of the

plateform

Page 9: The Data Tsunami in  Biomedical  Research

9

Genome Canada

• > $915M investment and > $900M in co-funding• 100s Large-scale genomics projects• 5 Innovation centers

Page 10: The Data Tsunami in  Biomedical  Research

10

Outline

• Overview of Next-Generation Sequencing (NGS)• Applications• Challenges• Solutions

Page 11: The Data Tsunami in  Biomedical  Research

Applications (I)• De novo sequencing

– From the human genome… To all model organisms… To all relevant organisms (e.g. extreme genomes)… To “all” organisms?

11

Page 12: The Data Tsunami in  Biomedical  Research

12

Human Genome• 3 Billion DNA base pairs (bp)• Two human genomes are

~99.9% identical • There are about ~3M bp

differences between you and me

• Some of these differences explain variation in:– Disease susceptibility– Differences in drug metabolism– …www.dnacenter.com

Page 13: The Data Tsunami in  Biomedical  Research

Applications (II)• Genome re-sequencing

– Genetic disorders– Cancer genome sequencing– Map genomic structural variations across individuals – Genealogy and migration– Agricultural crops– …

13

1000 Genomes Project

The Cancer Genome Atlas

Page 14: The Data Tsunami in  Biomedical  Research

14

Exome sequencing for Mendelian disease

“… about one-half to one-third (~3,000) of all known or suspected Mendelian disorders (for example, cystic fibrosis and sickle cell anaemia) have been discovered. However, there is a substantial gap in our knowledge about the genes that cause many rare Mendelian phenotypes.”

“Accordingly, we can realistically look towards a future in which the genetic basis of all Mendelian traits is known, …”

Page 15: The Data Tsunami in  Biomedical  Research

15

Exome sequencing

Page 16: The Data Tsunami in  Biomedical  Research

16

Cancer genome sequencing

Can obtain a full catalogue of mutations

Page 17: The Data Tsunami in  Biomedical  Research

Michael Stromberg, bioinformatics.ca

Page 18: The Data Tsunami in  Biomedical  Research

18

Mutations in paediatric gliblastoma

Jabado, Pfister and Majewski

Page 19: The Data Tsunami in  Biomedical  Research

19

Mutations in paediatric gliblastoma

Sequenced the exomes of 48 paediatric GBM samples, found:

• Somatic mutations in the H3.3-ATRX-DAXX chromatin remodelling pathway in 44% of tumours

• Recurrent mutations in H3F3A, which encodes the replication-independent histone 3 variant H3.3 in 31% of tumours

Page 20: The Data Tsunami in  Biomedical  Research

Applications (III)

• Quantitative biology of complex systems– New high-throughput technologies in functional genomics: ChIP-Seq,

RNA-Seq, ChIA-PET, RIP-Seq, …– From single-gene measurements, to thousands of probes on arrays, to

profiles covering all 3B bases of the genome– Important systems: Stem cells, Cancer, Infectious diseases…

20

Page 21: The Data Tsunami in  Biomedical  Research

21

Outline

• Overview of Next-Generation Sequencing (NGS)• Applications• Challenges• Solutions

Page 22: The Data Tsunami in  Biomedical  Research

High-throughput Sequencing

6 Gbases2009 36bp X 20MX 8 lanes

2013 600 Gbases2 X 150bp X 250M

X 8 lanes

200 Human Genomes in 1 run!!!

Page 23: The Data Tsunami in  Biomedical  Research

Big Data

1 TBytes2013 2 X 10 TBytes

Intensity files Reads + qualities

70 TBytes

Image files

Page 24: The Data Tsunami in  Biomedical  Research

Big Data

1 TBytes2013 2 X 10 TBytes

12 TBytes240 TBytes

Intensity files Reads + qualities

25 TB of raw data / month300 TB of raw data / year

From: Alexandre Montpetit Subject: news from IlluminaDate: 4 June, 2013 2:15:16 PM EDTTo: Guillaume Bourque

De Mark Van Oene (vp Illumina ventes): dans la prochaine annee on doit s'attendre a 2x plus de reads en 2x moins de temps (et 2x plus longs) Ca cause probleme?

Alex

Page 25: The Data Tsunami in  Biomedical  Research

Large NGS project

Cancer project with whole genome data:

125 TB raw

500 X 3 lanes = 500 X 250GB

125 TB raw

500 tumors 500 matched-normal

500 X 3 lanes = 500 X 250GB

vs

Page 26: The Data Tsunami in  Biomedical  Research

26

DNA bases sequenced at the Innovation Center

DNA

base

s

12 HiSeqs

72 Trillions!

0r 800 genomes at 30X

Page 27: The Data Tsunami in  Biomedical  Research

27

adventure.nationalgeographic.com

Page 28: The Data Tsunami in  Biomedical  Research

Biomedical research is built on data integration

Your data

Page 29: The Data Tsunami in  Biomedical  Research

Biomedical research is built on data integration

100X

Your data

Page 30: The Data Tsunami in  Biomedical  Research

30

Challenges

• NGS instruments generate TBs of data• NGS instruments are getting faster, cheaper and will

increasingly be found in small research labs and hospitals

• Data sharing and integration is critical in biomedical research

• Sequencing data represents sensitive private data and is identifiable

Page 31: The Data Tsunami in  Biomedical  Research

31

Outline

• Overview of Next-Generation Sequencing (NGS)• Applications• Challenges• Solutions

Page 32: The Data Tsunami in  Biomedical  Research

32

Nanuq softwareHas tracked data and meta-data for more than:

• 2.6 million sample aliquots, • 20,500 reagents, • 17,000 plates, • 140,000 tubes, • Multiple platforms, technologies and

workflows(sequencing, genotyping, microarray, etc.)• 3,900 external users

Page 33: The Data Tsunami in  Biomedical  Research

33

Standardized analysis pipelines

ChIP-Seq Analysis report

RNA-Seq Analysis report

MethylationAnalysis report

… … …

Page 34: The Data Tsunami in  Biomedical  Research

34

Data center at the Innovation Center

> 1200 cores> 2 PB disk> 5 PB tape

Page 35: The Data Tsunami in  Biomedical  Research

35

Need more!

McGill Guillimin – 16000 cores

UdeS Mammouth – 39168 cores

Page 36: The Data Tsunami in  Biomedical  Research

Data processing issues

• We have many different projects all needing space and processing.

• We want to use the Compute Canada clusters for scalability but also to facilitate data distribution (we have >800 users).

• This brings uniformity problems:– Different setups Hardware and Software– Different configurations– Etc.

Page 37: The Data Tsunami in  Biomedical  Research

Our strategy

• We wrote analyses pipelines to be easily configurable across clusters.

• Same code, one ini file to customize (we already have templates for 3 cluster sites)

• We install Linux modules readable by all on all these clusters so we know exactly what is available everywhere

• We also deploy common genomes across sites.

Page 38: The Data Tsunami in  Biomedical  Research

38

Usage on Compute Canada

Page 39: The Data Tsunami in  Biomedical  Research

39

Canadian Epigenetics,

Environment and Health Research

Consortium (CEEHRC)$1.5M

(2012-2017)

Page 40: The Data Tsunami in  Biomedical  Research

40

PORTal for the Analysis of Genetics and Genomics Experiments (PORTAGGE)

Page 41: The Data Tsunami in  Biomedical  Research

Conclusions• NGS offers a variety of technologies and numerous exciting

applications• Many areas of NGS data analyses are still under active

development (e.g. RNA-Seq)• A major challenge is to ensure sufficient compute and storage

capacities not to limit more advanced analyses• Need to work together to avoid duplication of efforts in

installing tools but also to develop efficient ways to use HPC in biomedical research

Page 42: The Data Tsunami in  Biomedical  Research

Acknowledgements

IT teamTerrance McquilkinMarc-André LabontéGenevieve DancausseAndras FrankelAlexandru Guja

Development teamNathalie ÉmondDavid BujoldFrancois CantinCatherine CôtéBurak DemirtasDaniel GuertinLouis Dumond JosephFrancois KorbulyMarc MichaudThuong Ngo

Analysis teamLouis LetourneauMathieu BourgeyMaxime CaronGary LévesqueRobert EveleighFrancois LefebvreJohanna SandovalPascale Marquis

[email protected]

EDCC teamDavid Morais (UdeS)Carol Gauthier (UdeS)Bryan Caron (McGill)Alain Veilleux (UdeS)ME Rousseau (McGill)

Page 43: The Data Tsunami in  Biomedical  Research

43

Questions?