Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided

1

Thanks to:

NHLBI Grant: HL089102

Rajiv DulepetMichael Angerman(Constantin Georgescu)

Ellen Rothenberg & labBarbara Wold & lab(Caltech)

Guy Naor, Cosmin Andriescu(SparkTech Soft)

Martin Morgan (FHCRC)

Reproducible Research in Genomics and Systems Biology Hamid Bolouri (FHCRC), 8th

S t b 2010

High throughput molecular and cell biology

• Small sample sizes• Commoditized equipment• Low-cost per experiment• Global (genome-wide) data• Rapidly evolving

• Currently dominated by– Re-sequencing– ChIP-seq– RNA-seq

• But also – Proteomics, Metabolomics– Screens (RNAi, drugs,

mutants)– Biomarker assays– Single-cell assays

(microfluidics, FACS, microscopy)

• Large, noisy datasets– Complex analysis needed– Results are statistical &

parameter/method dependent

• Accessible to individual labs– Global data, focused

hypotheses – Bursts of heavy computing– Opportunity:

iterative data exploration

• Large-scale cataloguing projects

– (TGCA, ENCODE, Epigenome, 1000 genomes, Bacteriome…)

– Need to share data + analysis– Opportunity:

Technology characteristics Computational implications

2

Stra

tton

et a

l, N

atur

e, 2

009,

458

:719

—72

4

US

$ pe

r fin

ishe

d ba

se p

air

Cou

rtesy

of R

ay K

urzw

eil

Lusc

ombe

et a

l, G

enom

e B

iolo

gy 2

000,

1(

1):re

view

s001

.1–0

01.3

7

RNA polymerase

initiation complex

Genes make RNAs, which make proteins, some of which regulate gene expression

expr

essi

on

low

high

Time 2Time 1

RNAregulatory complex

Georgescu et al, PNAS 2008

3

Eve Mardis, Nature Methods, 2007, (4):61

1. Cross-link proteins to DNA

2. Fragment chromatin to 100-150bp

3. Immunoprecipitate antibody-bound DNA fragments

4. Reverse cross-links and sequence fragment ends

5. Map sequence reads to genome

6. Identify genomic regions with enriched number of mapped reads

ChIP-seq Overview

Barbara Wold’s lab, Science, 2007, 16(5830):1497-502

A typical pipeline:

(Hes1 ChIPseq in HSCsBernstein lab, FHCRC)

> 500 lines of R

Uses 12 R “packages”

Includes > 10 majoruser-defined parameters

(e.g. # read mismatches, Baye’s PP, ncRNAs?)

> 200 interim files

4

…………

Zhang et al, Biometrics 2010, Jun 1st [Epub ahead of print]

Two types of complexity in high throughput data processing – (1) difficult math

Two types of complexity in high throughput data processing – (2) difficult choices

e.g. for ChIP-seq:

• Average fragment length

• Scan window size

• Read count significance threshold

• Background reads distribution

• Sensitivity vs. specificity

FAR

Sen

sitiv

ity

0 10

1

FAR

Sen

sitiv

ity

0 10

1

5

Torres, Metta, Ottenwälder & Schlötterer, Genome Research, 2007, 18:000

Ordahl, Johnson & CaplanNAR, 1976, 3(11):2985-2999

0 500 1000 1500 2000DNA fragment length (base pairs)

expected fragment frequency

sequenced fragment frequency

Nor

mal

ized

frag

men

t cou

nt

Fragment length (X100bp)100bp200bp300bp400bp

N1 Hes1

DNA fragments (excluding 120bp adapters) are

asymmetrically distributed ~ 50bp–280bp(data: Suzanne Furuyama, Bernstein lab, FHCRC)

Kharchenko, Tolstorukov and Park

6

Sarkar, Gentleman, Lawrence, Zhang, YaoKharchenko, Tolstorukov and Park

Estimating the average DNA fragment length from the strand-specific tag shift

Assuming average fragment length=75bp Assuming average fragment length=136bp

5

Number of reads

log(

num

bero

f reg

ions

)

log(

num

bero

f reg

ions

)

Number of reads

7

Dat

a S

uzan

ne F

uruy

ama

& Ir

win

Ber

nste

in, F

HC

RC

bone marrow

macrophages

microglia

osteoblasts

follicularB cells

Predicted Hes1 binding site

http

://bi

ogps

.gnf

.org

/

Ly9 expression is repressed in the bone marrow

Dat

a S

uzan

ne F

uruy

ama

& Ir

win

Ber

nste

in, F

HC

RC

8

Control? ParametersAlgorithm strands? Map? Significance score FDR measure? TF/histones?

Spyrou et al, BMC Bioinformatics 2009, 10:299

GABP FOXA1

Zhang et al, Biometrics 2010, Jun 1st [Epub ahead of print]

9

(+ 2 MS Excel files)

(Barbara Wold’s lab)

10

292 public datasets so far

NIH Roadmap Epigenomics consortium

Order of magnitude estimates of biological data sizes

• Read data for 1 human genome ~ 250 GB

• Annotated variants for 1 human genome ~ 25 GB

• 1000 genomes ~ 25 TB

• ChIP-seq read data for 1 TF in 1 condition ~ 1.5 GB

• ChIP-seq read data for 2,500 TFs in 200 cell types ~ 750 TB

• RNA-seq read data for 1 gene (ELAND format) ~ 500 MB

• RNA-seq read data for 22,000 genes in 200 cell types ~ 2.2 PB

A typical query is NOT a search, but a complex data

transformation

11

Bringing computation to the lab bench

• Point & Click user interface

• Hide the math, highlight and enable choices

• Large-scale computing resources available on demand

• No software installation/maintenance

• No IT infrastructureNo need for sys-admin, obsolescence, space, cooling, etc.

Empowering the stakeholders

• Enable experimental biologists to interactively explore their data

Alternative analysis methods

Alternative parameter choices

Alternative views of results/data

• Relieve computational biologists from non-specialist tasks

Allow focus on strengths e g method

Web-services +

Cloud Computing

Resourcesharing

Scalability

• Number of users (jobs being run)

• Number/size of scripts and datasets

• Size of individual jobs (CPU + memory required)

• CostProcessor Queues auto-scale with demand

Users can have private nodes and shared nodes

Evolvability

• Diversity of users: freely available to all, open source

• Diversity of content: data & scripts provided by the community

• Community curated

• Inclusive architecture supports multiple OS (Windows, Unix/Linux, Mac)

multiple languages (Perl Python Java coming)

Web-services +

Cloud Computing

Resourcesharing

12

13

14

Manage script files and directories

Scripts:

create

view

15

Create & delete processing queues

Create & delete processing nodes for queues

16

Choose memory and processing node type

17

18

(1956, re-published in 1998)(The Lancet 1902, re-published in 1996 & 2002)

19

eosinophil counts ( per unit volume) in 4,458 Icelandic people

−

σµX

frequ

ency

Gudbjartsson et al, Nature Genetics, 2009, 41(3):3423–47.

prop

ortio

n of

indi

vidu

als

0 20 40

Duramad et al, Cancer Epidemiology, Biomarkers and Preventation, 2004,13(9):1452–8

% TH1 (INF-γ expressing) cells among CD4+ cells in blood

143 healthy middle-aged blood donors

Num

ber o

f ind

ivid

uals

C-reactive protein concentration in blood (mg/L)

0.1 1.1 2.1 3.1

30

20

10

0

EM Macy et al, Clinical Chemistry 1997, 43(1):52–58.

[Cre

atin

eK

inas

eBB

]

10

100

1000

Non

-AM

I

AM

I

Zweig and Campbell, Clinical Chemistry, 1993, 39(4):561-577

20

Comparative Toxicogenomics Database http://ctd.mdibl.org/

(395 substances affect Akt1 expression. The graph below excludes Akt1 and the substances that affect it & other genes.)

Of 165 Wnt-pathway associated genes, only 14 have no associations in TCD:

2,896 environmental interactions affect 151 out of 165 of Wnt-pathway genes

porc

wls

wnt9b

wnt8b

wnt8a

trappc6b

trappc5

skp1

shisa9

shisa8

shisa7

shisa6

shisa3

shisa2%

thro

mbi

in m

ale

sudd

en c

oron

ary

deat

h

acut

e th

rom

bus

plaq

ue ru

ptur

epl

aque

ero

sion

stab

le p

laqu

e

Age group

Tracy & Russell, Current Opinion in Lipidology, 2008, 19:151–157

21

The human microbiome

varia

tion

intra

inte

r

intra

inte

r

intra

inte

rCostello et al, Science 2009, 326(5960):1694 - 1697

The Human Microbiome ProjectNY Times, 12th July 2010

(Stephen Quake’s genome)

22

Abbreviations:

BCR=B-cell receptor,

Ca=Calcium,

ErbB=the v-erb-b2 erythroblastic leukemia viral oncogene family (Receptor Tyrosine Kinases),

Hh=Hedgehog, PI=Phosphatidylinositol,

TCR=T-cell receptor,

TLR=Toll-like receptor.

DNA sequence variations affecting cellular signaling genes in two individuals

(W=Watson, V=Venter)

frameshif tW frameshiftV nonsynonymousW nonsynonymousV UTR_W UTR_V splice_siteW splice_siteV No. genesBCR 0 0 55 52 120 117 2 2 75Ca 3 3 99 96 327 325 1 1 178ErbB 3 3 29 24 161 187 0 1 87Hh 0 0 24 32 74 83 1 3 56JakStat 0 0 77 47 228 228 0 2 155MAPK 1 0 128 127 436 451 9 3 269mTOR 0 0 20 18 94 110 0 4 52Notch 0 0 20 19 56 61 1 1 47PI 0 0 55 61 159 175 0 0 76TCR 0 0 49 42 153 163 1 0 108TGFβ 0 0 27 28 85 139 0 1 87TLR 0 0 43 39 126 139 12 8 101VEGF 0 0 50 47 114 117 1 0 76Wnt 0 2 55 51 279 246 1 2 163

DNA sequence variations affecting cellular signaling genes in two individuals

overlapWatson total Venter total (non-synonymous + frameshift)

total entries

in dbSNP

BCR Ca ErbB Hh JakStat

MAPK mTOR Notch Phosphatidylinositol TCR

TGFβ TLR VEGF Wnt

26 29 23 49 53 46 17 15 12 7 17 15 55 22 25

63 66 61 11 9 9 17 3 16 21 34 27 27 22 20

7 20 8 23 20 19 18 32 15 29 26 27

Watson Venter Watson Venter Watson Venter Watson Venter



Watson Venter

Watson Venter

(805) (2740) (1122) (501)

(3792) (1068) (1250) (1010)

(1468)

(1141)

(720) (1033) (883) (1273)

23

• Genome annotation DBs

• RNA structure & function DBs

• Protein structure & function DBs

• Pathway structure & function DBs

• Mutation effect prediction algorithms

• Environmental effects DBs

• Drug effects and interactions DBs

• Electronic health records

• Pathway-based calculation of interaction effects

• . . .

What is the appropriate computational infrastructure?

To predict an individual’s genomic susceptibilities,

we will need to integrate data from:

SummaryEmpowering individuals, labs, and consortia

An analysis portal for (all) public data

A place to explore and replicate published work

A place to compare quality of available data &

methods

To come

Support for multiple languages & Clouds

Strong data encryption for individual genomesInteractive graphics

Semantic searches

24

Thanks to:

NHLBI Grant: HL089102

Rajiv DulepetMichael Angerman(Constantin Georgescu)

Ellen Rothenberg & labBarbara Wold & lab(Caltech)

Guy Naor, Cosmin Andriescu(SparkTech Soft)

Martin Morgan (FHCRC)

Documents

Thanks to - Research · Sarkar, Gentleman, Lawrence, Zhang, Yao ... • Diversity of users: freely available to all, open source • Diversity of content: data & scripts provided