Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
1
Thanks to:
NHLBI Grant: HL089102
Rajiv DulepetMichael Angerman(Constantin Georgescu)
Ellen Rothenberg & labBarbara Wold & lab(Caltech)
Guy Naor, Cosmin Andriescu(SparkTech Soft)
Martin Morgan (FHCRC)
Reproducible Research in Genomics and Systems Biology Hamid Bolouri (FHCRC), 8th
S t b 2010
High throughput molecular and cell biology
• Small sample sizes• Commoditized equipment• Low-cost per experiment• Global (genome-wide) data• Rapidly evolving
• Currently dominated by– Re-sequencing– ChIP-seq– RNA-seq
• But also – Proteomics, Metabolomics– Screens (RNAi, drugs,
mutants)– Biomarker assays– Single-cell assays
(microfluidics, FACS, microscopy)
• Large, noisy datasets– Complex analysis needed– Results are statistical &
parameter/method dependent
• Accessible to individual labs– Global data, focused
hypotheses – Bursts of heavy computing– Opportunity:
iterative data exploration
• Large-scale cataloguing projects
– (TGCA, ENCODE, Epigenome, 1000 genomes, Bacteriome…)
– Need to share data + analysis– Opportunity:
Technology characteristics Computational implications
2
Stra
tton
et a
l, N
atur
e, 2
009,
458
:719
—72
4
US
$ pe
r fin
ishe
d ba
se p
air
Cou
rtesy
of R
ay K
urzw
eil
Lusc
ombe
et a
l, G
enom
e B
iolo
gy 2
000,
1(
1):re
view
s001
.1–0
01.3
7
RNA polymerase
initiation complex
Genes make RNAs, which make proteins, some of which regulate gene expression
expr
essi
on
low
high
Time 2Time 1
RNAregulatory complex
Georgescu et al, PNAS 2008
3
Eve Mardis, Nature Methods, 2007, (4):61
1. Cross-link proteins to DNA
2. Fragment chromatin to 100-150bp
3. Immunoprecipitate antibody-bound DNA fragments
4. Reverse cross-links and sequence fragment ends
5. Map sequence reads to genome
6. Identify genomic regions with enriched number of mapped reads
ChIP-seq Overview
Barbara Wold’s lab, Science, 2007, 16(5830):1497-502
A typical pipeline:
(Hes1 ChIPseq in HSCsBernstein lab, FHCRC)
> 500 lines of R
Uses 12 R “packages”
Includes > 10 majoruser-defined parameters
(e.g. # read mismatches, Baye’s PP, ncRNAs?)
> 200 interim files
4
…………
Zhang et al, Biometrics 2010, Jun 1st [Epub ahead of print]
Two types of complexity in high throughput data processing – (1) difficult math
Two types of complexity in high throughput data processing – (2) difficult choices
e.g. for ChIP-seq:
• Average fragment length
• Scan window size
• Read count significance threshold
• Background reads distribution
• Sensitivity vs. specificity
FAR
Sen
sitiv
ity
0 10
1
FAR
Sen
sitiv
ity
0 10
1
5
Torres, Metta, Ottenwälder & Schlötterer, Genome Research, 2007, 18:000
Ordahl, Johnson & CaplanNAR, 1976, 3(11):2985-2999
0 500 1000 1500 2000DNA fragment length (base pairs)
expected fragment frequency
sequenced fragment frequency
Nor
mal
ized
frag
men
t cou
nt
Fragment length (X100bp)100bp200bp300bp400bp
N1 Hes1
DNA fragments (excluding 120bp adapters) are
asymmetrically distributed ~ 50bp–280bp(data: Suzanne Furuyama, Bernstein lab, FHCRC)
Kharchenko, Tolstorukov and Park
6
Sarkar, Gentleman, Lawrence, Zhang, YaoKharchenko, Tolstorukov and Park
Estimating the average DNA fragment length from the strand-specific tag shift
Assuming average fragment length=75bp Assuming average fragment length=136bp
5
Number of reads
log(
num
bero
f reg
ions
)
log(
num
bero
f reg
ions
)
Number of reads
7
Dat
a S
uzan
ne F
uruy
ama
& Ir
win
Ber
nste
in, F
HC
RC
bone marrow
macrophages
microglia
osteoblasts
follicularB cells
Predicted Hes1 binding site
http
://bi
ogps
.gnf
.org
/
Ly9 expression is repressed in the bone marrow
Dat
a S
uzan
ne F
uruy
ama
& Ir
win
Ber
nste
in, F
HC
RC
8
Control? ParametersAlgorithm strands? Map? Significance score FDR measure? TF/histones?
Spyrou et al, BMC Bioinformatics 2009, 10:299
GABP FOXA1
Zhang et al, Biometrics 2010, Jun 1st [Epub ahead of print]
9
(+ 2 MS Excel files)
(Barbara Wold’s lab)
10
292 public datasets so far
NIH Roadmap Epigenomics consortium
Order of magnitude estimates of biological data sizes
• Read data for 1 human genome ~ 250 GB
• Annotated variants for 1 human genome ~ 25 GB
• 1000 genomes ~ 25 TB
• ChIP-seq read data for 1 TF in 1 condition ~ 1.5 GB
• ChIP-seq read data for 2,500 TFs in 200 cell types ~ 750 TB
• RNA-seq read data for 1 gene (ELAND format) ~ 500 MB
• RNA-seq read data for 22,000 genes in 200 cell types ~ 2.2 PB
A typical query is NOT a search, but a complex data
transformation
11
Bringing computation to the lab bench
• Point & Click user interface
• Hide the math, highlight and enable choices
• Large-scale computing resources available on demand
• No software installation/maintenance
• No IT infrastructureNo need for sys-admin, obsolescence, space, cooling, etc.
Empowering the stakeholders
• Enable experimental biologists to interactively explore their data
Alternative analysis methods
Alternative parameter choices
Alternative views of results/data
• Relieve computational biologists from non-specialist tasks
Allow focus on strengths e g method
Web-services +
Cloud Computing
Resourcesharing
Scalability
• Number of users (jobs being run)
• Number/size of scripts and datasets
• Size of individual jobs (CPU + memory required)
• CostProcessor Queues auto-scale with demand
Users can have private nodes and shared nodes
Evolvability
• Diversity of users: freely available to all, open source
• Diversity of content: data & scripts provided by the community
• Community curated
• Inclusive architecture supports multiple OS (Windows, Unix/Linux, Mac)
multiple languages (Perl Python Java coming)
Web-services +
Cloud Computing
Resourcesharing
12
13
14
Manage script files and directories
Scripts:
create
view
15
Create & delete processing queues
Create & delete processing nodes for queues
16
Choose memory and processing node type
17
18
(1956, re-published in 1998)(The Lancet 1902, re-published in 1996 & 2002)
19
eosinophil counts ( per unit volume) in 4,458 Icelandic people
−
σµX
frequ
ency
Gudbjartsson et al, Nature Genetics, 2009, 41(3):3423–47.
prop
ortio
n of
indi
vidu
als
0 20 40
Duramad et al, Cancer Epidemiology, Biomarkers and Preventation, 2004,13(9):1452–8
% TH1 (INF-γ expressing) cells among CD4+ cells in blood
143 healthy middle-aged blood donors
Num
ber o
f ind
ivid
uals
C-reactive protein concentration in blood (mg/L)
0.1 1.1 2.1 3.1
30
20
10
0
EM Macy et al, Clinical Chemistry 1997, 43(1):52–58.
[Cre
atin
eK
inas
eBB
]
10
100
1000
Non
-AM
I
AM
I
Zweig and Campbell, Clinical Chemistry, 1993, 39(4):561-577
20
Comparative Toxicogenomics Database http://ctd.mdibl.org/
(395 substances affect Akt1 expression. The graph below excludes Akt1 and the substances that affect it & other genes.)
Of 165 Wnt-pathway associated genes, only 14 have no associations in TCD:
2,896 environmental interactions affect 151 out of 165 of Wnt-pathway genes
porc
wls
wnt9b
wnt8b
wnt8a
trappc6b
trappc5
skp1
shisa9
shisa8
shisa7
shisa6
shisa3
shisa2%
thro
mbi
in m
ale
sudd
en c
oron
ary
deat
h
acut
e th
rom
bus
plaq
ue ru
ptur
epl
aque
ero
sion
stab
le p
laqu
e
Age group
Tracy & Russell, Current Opinion in Lipidology, 2008, 19:151–157
21
The human microbiome
varia
tion
intra
inte
r
intra
inte
r
intra
inte
rCostello et al, Science 2009, 326(5960):1694 - 1697
The Human Microbiome ProjectNY Times, 12th July 2010
(Stephen Quake’s genome)
22
Abbreviations:
BCR=B-cell receptor,
Ca=Calcium,
ErbB=the v-erb-b2 erythroblastic leukemia viral oncogene family (Receptor Tyrosine Kinases),
Hh=Hedgehog, PI=Phosphatidylinositol,
TCR=T-cell receptor,
TLR=Toll-like receptor.
DNA sequence variations affecting cellular signaling genes in two individuals
(W=Watson, V=Venter)
frameshif tW frameshiftV nonsynonymousW nonsynonymousV UTR_W UTR_V splice_siteW splice_siteV No. genesBCR 0 0 55 52 120 117 2 2 75Ca 3 3 99 96 327 325 1 1 178ErbB 3 3 29 24 161 187 0 1 87Hh 0 0 24 32 74 83 1 3 56JakStat 0 0 77 47 228 228 0 2 155MAPK 1 0 128 127 436 451 9 3 269mTOR 0 0 20 18 94 110 0 4 52Notch 0 0 20 19 56 61 1 1 47PI 0 0 55 61 159 175 0 0 76TCR 0 0 49 42 153 163 1 0 108TGFβ 0 0 27 28 85 139 0 1 87TLR 0 0 43 39 126 139 12 8 101VEGF 0 0 50 47 114 117 1 0 76Wnt 0 2 55 51 279 246 1 2 163
DNA sequence variations affecting cellular signaling genes in two individuals
overlapWatson total Venter total (non-synonymous + frameshift)
total entries
in dbSNP
BCR Ca ErbB Hh JakStat
MAPK mTOR Notch Phosphatidylinositol TCR
TGFβ TLR VEGF Wnt
26 29 23 49 53 46 17 15 12 7 17 15 55 22 25
63 66 61 11 9 9 17 3 16 21 34 27 27 22 20
7 20 8 23 20 19 18 32 15 29 26 27
Watson Venter Watson Venter Watson Venter Watson Venter
Watson Venter Watson Venter Watson Venter Watson Venter
Watson Venter Watson Venter Watson Venter Watson Venter
Watson Venter
Watson Venter
(805) (2740) (1122) (501)
(3792) (1068) (1250) (1010)
(1468)
(1141)
(720) (1033) (883) (1273)
23
• Genome annotation DBs
• RNA structure & function DBs
• Protein structure & function DBs
• Pathway structure & function DBs
• Mutation effect prediction algorithms
• Environmental effects DBs
• Drug effects and interactions DBs
• Electronic health records
• Pathway-based calculation of interaction effects
• . . .
What is the appropriate computational infrastructure?
To predict an individual’s genomic susceptibilities,
we will need to integrate data from:
SummaryEmpowering individuals, labs, and consortia
An analysis portal for (all) public data
A place to explore and replicate published work
A place to compare quality of available data &
methods
To come
Support for multiple languages & Clouds
Strong data encryption for individual genomesInteractive graphics
Semantic searches
24
Thanks to:
NHLBI Grant: HL089102
Rajiv DulepetMichael Angerman(Constantin Georgescu)
Ellen Rothenberg & labBarbara Wold & lab(Caltech)
Guy Naor, Cosmin Andriescu(SparkTech Soft)
Martin Morgan (FHCRC)