24
Thanks to: NHLBI Grant: HL089102 Rajiv Dulepet Michael Angerman (Constantin Georgescu) Ellen Rothenberg & lab Barbara Wold & lab (Caltech) Guy Naor, Cosmin Andriescu (SparkTech Soft) Martin Morgan (FHCRC) Reproducible Research in Genomics and Systems Biology Hamid Bolouri (FHCRC), 8 th S t b 2010 High throughput molecular and cell biology Small sample sizes Commoditized equipment Low-cost per experiment Global (genome-wide) data Rapidly evolving Currently dominated by Re-sequencing ChIP-seq RNA-seq But also Proteomics, Metabolomics Screens (RNAi, drugs, mutants) Biomarker assays Single-cell assays (microfluidics, FACS, microscopy) Large, noisy datasets Complex analysis needed Results are statistical & parameter/method dependent Accessible to individual labs Global data, focused hypotheses Bursts of heavy computing Opportunity: iterative data exploration Large-scale cataloguing projects (TGCA, ENCODE, Epigenome, 1000 genomes, Bacteriome…) Need to share data + analysis Opportunity: Technology characteristics Computational implications

Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

1

Thanks to:

NHLBI Grant: HL089102

Rajiv DulepetMichael Angerman(Constantin Georgescu)

Ellen Rothenberg & labBarbara Wold & lab(Caltech)

Guy Naor, Cosmin Andriescu(SparkTech Soft)

Martin Morgan (FHCRC)

Reproducible Research in Genomics and Systems Biology Hamid Bolouri (FHCRC), 8th

S t b 2010

High throughput molecular and cell biology

• Small sample sizes• Commoditized equipment• Low-cost per experiment• Global (genome-wide) data• Rapidly evolving

• Currently dominated by– Re-sequencing– ChIP-seq– RNA-seq

• But also – Proteomics, Metabolomics– Screens (RNAi, drugs,

mutants)– Biomarker assays– Single-cell assays

(microfluidics, FACS, microscopy)

• Large, noisy datasets– Complex analysis needed– Results are statistical &

parameter/method dependent

• Accessible to individual labs– Global data, focused

hypotheses – Bursts of heavy computing– Opportunity:

iterative data exploration

• Large-scale cataloguing projects

– (TGCA, ENCODE, Epigenome, 1000 genomes, Bacteriome…)

– Need to share data + analysis– Opportunity:

Technology characteristics Computational implications

Page 2: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

2

Stra

tton

et a

l, N

atur

e, 2

009,

458

:719

—72

4

US

$ pe

r fin

ishe

d ba

se p

air

Cou

rtesy

of R

ay K

urzw

eil

Lusc

ombe

et a

l, G

enom

e B

iolo

gy 2

000,

1(

1):re

view

s001

.1–0

01.3

7

RNA polymerase

initiation complex

Genes make RNAs, which make proteins, some of which regulate gene expression

expr

essi

on

low

high

Time 2Time 1

RNAregulatory complex

Georgescu et al, PNAS 2008

Page 3: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

3

Eve Mardis, Nature Methods, 2007, (4):61

1. Cross-link proteins to DNA

2. Fragment chromatin to 100-150bp

3. Immunoprecipitate antibody-bound DNA fragments

4. Reverse cross-links and sequence fragment ends

5. Map sequence reads to genome

6. Identify genomic regions with enriched number of mapped reads

ChIP-seq Overview

Barbara Wold’s lab, Science, 2007, 16(5830):1497-502

A typical pipeline:

(Hes1 ChIPseq in HSCsBernstein lab, FHCRC)

> 500 lines of R

Uses 12 R “packages”

Includes > 10 majoruser-defined parameters

(e.g. # read mismatches, Baye’s PP, ncRNAs?)

> 200 interim files

Page 4: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

4

…………

Zhang et al, Biometrics 2010, Jun 1st [Epub ahead of print]

Two types of complexity in high throughput data processing – (1) difficult math

Two types of complexity in high throughput data processing – (2) difficult choices

e.g. for ChIP-seq:

• Average fragment length

• Scan window size

• Read count significance threshold

• Background reads distribution

• Sensitivity vs. specificity

FAR

Sen

sitiv

ity

0 10

1

FAR

Sen

sitiv

ity

0 10

1

Page 5: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

5

Torres, Metta, Ottenwälder & Schlötterer, Genome Research, 2007, 18:000

Ordahl, Johnson & CaplanNAR, 1976, 3(11):2985-2999

0 500 1000 1500 2000DNA fragment length (base pairs)

expected fragment frequency

sequenced fragment frequency

Nor

mal

ized

frag

men

t cou

nt

Fragment length (X100bp)100bp200bp300bp400bp

N1 Hes1

DNA fragments (excluding 120bp adapters) are

asymmetrically distributed ~ 50bp–280bp(data: Suzanne Furuyama, Bernstein lab, FHCRC)

Kharchenko, Tolstorukov and Park

Page 6: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

6

Sarkar, Gentleman, Lawrence, Zhang, YaoKharchenko, Tolstorukov and Park

Estimating the average DNA fragment length from the strand-specific tag shift

Assuming average fragment length=75bp Assuming average fragment length=136bp

5

Number of reads

log(

num

bero

f reg

ions

)

log(

num

bero

f reg

ions

)

Number of reads

Page 7: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

7

Dat

a S

uzan

ne F

uruy

ama

& Ir

win

Ber

nste

in, F

HC

RC

bone marrow

macrophages

microglia

osteoblasts

follicularB cells

Predicted Hes1 binding site

http

://bi

ogps

.gnf

.org

/

Ly9 expression is repressed in the bone marrow

Dat

a S

uzan

ne F

uruy

ama

& Ir

win

Ber

nste

in, F

HC

RC

Page 8: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

8

Control? ParametersAlgorithm strands? Map? Significance score FDR measure? TF/histones?

Spyrou et al, BMC Bioinformatics 2009, 10:299

GABP FOXA1

Zhang et al, Biometrics 2010, Jun 1st [Epub ahead of print]

Page 9: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

9

(+ 2 MS Excel files)

(Barbara Wold’s lab)

Page 10: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

10

292 public datasets so far

NIH Roadmap Epigenomics consortium

Order of magnitude estimates of biological data sizes

• Read data for 1 human genome ~ 250 GB

• Annotated variants for 1 human genome ~ 25 GB

• 1000 genomes ~ 25 TB

• ChIP-seq read data for 1 TF in 1 condition ~ 1.5 GB

• ChIP-seq read data for 2,500 TFs in 200 cell types ~ 750 TB

• RNA-seq read data for 1 gene (ELAND format) ~ 500 MB

• RNA-seq read data for 22,000 genes in 200 cell types ~ 2.2 PB

A typical query is NOT a search, but a complex data

transformation

Page 11: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

11

Bringing computation to the lab bench

• Point & Click user interface

• Hide the math, highlight and enable choices

• Large-scale computing resources available on demand

• No software installation/maintenance

• No IT infrastructureNo need for sys-admin, obsolescence, space, cooling, etc.

Empowering the stakeholders

• Enable experimental biologists to interactively explore their data

Alternative analysis methods

Alternative parameter choices

Alternative views of results/data

• Relieve computational biologists from non-specialist tasks

Allow focus on strengths e g method

Web-services +

Cloud Computing

Resourcesharing

Scalability

• Number of users (jobs being run)

• Number/size of scripts and datasets

• Size of individual jobs (CPU + memory required)

• CostProcessor Queues auto-scale with demand

Users can have private nodes and shared nodes

Evolvability

• Diversity of users: freely available to all, open source

• Diversity of content: data & scripts provided by the community

• Community curated

• Inclusive architecture supports multiple OS (Windows, Unix/Linux, Mac)

multiple languages (Perl Python Java coming)

Web-services +

Cloud Computing

Resourcesharing

Page 12: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

12

Page 13: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

13

Page 14: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

14

Manage script files and directories

Scripts:

create

view

Page 15: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

15

Create & delete processing queues

Create & delete processing nodes for queues

Page 16: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

16

Choose memory and processing node type

Page 17: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

17

Page 18: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

18

(1956, re-published in 1998)(The Lancet 1902, re-published in 1996 & 2002)

Page 19: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

19

eosinophil counts ( per unit volume) in 4,458 Icelandic people

σµX

frequ

ency

Gudbjartsson et al, Nature Genetics, 2009, 41(3):3423–47.

prop

ortio

n of

indi

vidu

als

0 20 40

Duramad et al, Cancer Epidemiology, Biomarkers and Preventation, 2004,13(9):1452–8

% TH1 (INF-γ expressing) cells among CD4+ cells in blood

143 healthy middle-aged blood donors

Num

ber o

f ind

ivid

uals

C-reactive protein concentration in blood (mg/L)

0.1 1.1 2.1 3.1

30

20

10

0

EM Macy et al, Clinical Chemistry 1997, 43(1):52–58.

[Cre

atin

eK

inas

eBB

]

10

100

1000

Non

-AM

I

AM

I

Zweig and Campbell, Clinical Chemistry, 1993, 39(4):561-577

Page 20: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

20

Comparative Toxicogenomics Database http://ctd.mdibl.org/

(395 substances affect Akt1 expression. The graph below excludes Akt1 and the substances that affect it & other genes.)

Of 165 Wnt-pathway associated genes, only 14 have no associations in TCD:

2,896 environmental interactions affect 151 out of 165 of Wnt-pathway genes

porc

wls

wnt9b

wnt8b

wnt8a

trappc6b

trappc5

skp1

shisa9

shisa8

shisa7

shisa6

shisa3

shisa2%

thro

mbi

in m

ale

sudd

en c

oron

ary

deat

h

acut

e th

rom

bus

plaq

ue ru

ptur

epl

aque

ero

sion

stab

le p

laqu

e

Age group

Tracy & Russell, Current Opinion in Lipidology, 2008, 19:151–157

Page 21: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

21

The human microbiome

varia

tion

intra

inte

r

intra

inte

r

intra

inte

rCostello et al, Science 2009, 326(5960):1694 - 1697

The Human Microbiome ProjectNY Times, 12th July 2010

(Stephen Quake’s genome)

Page 22: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

22

Abbreviations:

BCR=B-cell receptor,

Ca=Calcium,

ErbB=the v-erb-b2 erythroblastic leukemia viral oncogene family (Receptor Tyrosine Kinases),

Hh=Hedgehog, PI=Phosphatidylinositol,

TCR=T-cell receptor,

TLR=Toll-like receptor.

DNA sequence variations affecting cellular signaling genes in two individuals

(W=Watson, V=Venter)

frameshif tW frameshiftV nonsynonymousW nonsynonymousV UTR_W UTR_V splice_siteW splice_siteV No. genesBCR 0 0 55 52 120 117 2 2 75Ca 3 3 99 96 327 325 1 1 178ErbB 3 3 29 24 161 187 0 1 87Hh 0 0 24 32 74 83 1 3 56JakStat 0 0 77 47 228 228 0 2 155MAPK 1 0 128 127 436 451 9 3 269mTOR 0 0 20 18 94 110 0 4 52Notch 0 0 20 19 56 61 1 1 47PI 0 0 55 61 159 175 0 0 76TCR 0 0 49 42 153 163 1 0 108TGFβ 0 0 27 28 85 139 0 1 87TLR 0 0 43 39 126 139 12 8 101VEGF 0 0 50 47 114 117 1 0 76Wnt 0 2 55 51 279 246 1 2 163

DNA sequence variations affecting cellular signaling genes in two individuals

overlapWatson total Venter total (non-synonymous + frameshift)

total entries

in dbSNP

BCR Ca ErbB Hh JakStat

MAPK mTOR Notch Phosphatidylinositol TCR

TGFβ TLR VEGF Wnt

26 29 23 49 53 46 17 15 12 7 17 15 55 22 25

63 66 61 11 9 9 17 3 16 21 34 27 27 22 20

7 20 8 23 20 19 18 32 15 29 26 27

Watson Venter Watson Venter Watson Venter Watson Venter

Watson Venter Watson Venter Watson Venter Watson Venter

Watson Venter Watson Venter Watson Venter Watson Venter

Watson Venter

Watson Venter

(805) (2740) (1122) (501)

(3792) (1068) (1250) (1010)

(1468)

(1141)

(720) (1033) (883) (1273)

Page 23: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

23

• Genome annotation DBs

• RNA structure & function DBs

• Protein structure & function DBs

• Pathway structure & function DBs

• Mutation effect prediction algorithms

• Environmental effects DBs

• Drug effects and interactions DBs

• Electronic health records

• Pathway-based calculation of interaction effects

• . . .

What is the appropriate computational infrastructure?

To predict an individual’s genomic susceptibilities,

we will need to integrate data from:

SummaryEmpowering individuals, labs, and consortia

An analysis portal for (all) public data

A place to explore and replicate published work

A place to compare quality of available data &

methods

To come

Support for multiple languages & Clouds

Strong data encryption for individual genomesInteractive graphics

Semantic searches

Page 24: Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

24

Thanks to:

NHLBI Grant: HL089102

Rajiv DulepetMichael Angerman(Constantin Georgescu)

Ellen Rothenberg & labBarbara Wold & lab(Caltech)

Guy Naor, Cosmin Andriescu(SparkTech Soft)

Martin Morgan (FHCRC)