Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Summary of ENCODE Accomplishments
Michael Snyder
On Behalf of the ENCODE
Consortium
March 10, 2015
Conflicts: Personalis, Genapsys, AxioMx
ENCODE: Three Phases (2003-present) Many Experimental Assays
Modified from PLoS Biol 9:e1001046, 2011 Science 306:636, 2004
ENCODE Dimensions
Methods/Factors
Cells
387 C
ell
Lin
es/ T
issues
1,700 Assays (~886 different TFs/~300 RBPs)
3,331 (8,832) Experiments
Chip-seq (239 TFs
+ Histone marks;
5101 data sets)
iCLIP (308 RBPs)
RNA-seq (643)
DNAse-seq (449)
High Quality Data
• > Two biological replicates
• Multiple quality control measures
IDR Processing, QC and Blacklist Filtering
Poor reproducibility
Good reproducibility
Rep1
Re
p2
ENCODE Uniform Analysis Pipeline
Mapped reads
Uniform Peak Calling Pipeline (SPP, PeakSeq)
Poor reproducibility Good reproducibility
Rep1
Re
p2
Anshul Kundaje
Uniform peak Calling (SPP, PeakSeq)
Quality Control
Derived Data (Chromosome Segments,
Expression)
Processing & Element Calling Compatible with Other Projects: GTEx, REMC, IHEC
Established Standards For Community
• ChIP-Seq
• DNAseHS
• RNA-Seq
Antibody characterization, Biological replicates,
QC measures
Genome Res.
2012
….
Assays/Data Types
7
Human
Mouse
1
10
100
1000
10000Started/Proposed
Done/Submitted
Exp
erim
ents
1
10
100
1000
10000
Started/Proposed
Done/Submitted
Exp
erim
ents
Deep Exploration of Some Lines Using Many Assays
8
0 500 1000 1500 2000
HeLa-S3
H1-hESC
A549
GM12878
MCF-7
HepG2
K562
Histone ChIP-Seq
TF ChIP-Seq
RBP iCLIP
RBP RNAi/RNA-seq
TF RNAi/RNA-seq2
ChIA-PET
RIP-seq
Number of Factors
Some Assays Were Conducted Across a Broad Range of Biosamples
9
0
50
100
150
200
250
300
350
RN
A-s
eq
TF C
hIP
-Seq
DN
ase-
seq
His
ton
e C
hIP
-Seq
RR
BS
RA
MP
AG
E
RN
A p
rofi
ling
by
arra
y…
DN
aseI
com
par
ativ
e ge
no
mic
…
DN
A m
eth
ylat
ion
…
DN
ase-
seq
…
WG
BS
DN
ase-
seq
(d
eep
)
ATA
C-S
eq
CA
GE
FAIR
E-se
q
Rep
li-ch
ip
wh
ole
-gen
om
e…
Rep
li-se
q
Nan
ost
rin
g
RN
A-P
ET
met
hyl
C-s
eq 5C
mic
roR
NA
pro
filin
g b
y…
Ch
IA-P
ET
MeD
IP-S
EQ a
ssay
MR
E-se
q
DN
ase-
seq
(lo
ng)
RIP
-seq
shR
NA
kn
ock
do
wn
…
iCLI
P
DN
A-P
ET
MN
ase-
seq
RN
A-S
eq
fo
llow
ing
TF…
RN
A B
ind
-n-S
eq
Un
iqu
e B
iosa
mp
les
Unique Biosample Types
020406080
100120140160180200
Human
Mouse
10
ENCODE Data
11
Cloud Storage and Computing Data available at Amazon Web Services (AWS) Uniform processing pipelines will be available at DNAnexus
related projects
Highly Searchable
ENCODE Data Open Access
12
New ENCODE Portal https://www.encodeproject.org
13
10 Computational Groups
15
Analyzing data in a variety of different ways
• GWAS
• Cancer
• Regulatory principles
Software Tools
16
• >30 Different algorithms
• Wide variety of areas. Examples:
– Segmentation
– Allele calling
– 3D nuclear analysis
– Data processing and peak calling
– Data quality control
Summary of Impact • Lots of open diverse data types on same cell
lines/tissues
• Experimental standards
• New analysis methods
• Methods and standards adopted by other large communities: e.g GTEx, REMC, betaCell, CIRM
• Data are widely used
17
18
Feb 2015 >750 Papers NonENCODE + 150 mod- ENCODE
ENCODE Publications
ENCODE Community Publications
Outreach Activities • Tutorials:https://www.encodeproject.org/tutorials • (http://www.genome.gov/27553900)
• CHARGE-ENCODE workshop • User’s meeting in 2015
20
Additional High Level Impact
21
1) Segmenting genome into types of elements
2) Gene regulatory principles
3) GWAS
>85% of lead SNPs lie outside of coding regions
a b
c
de
Phenotype−associated SNPs
Random sampling of matched SNPs
Genotyped SNPs
1000 Genomes
24 Personal genomes
chr5:40,390,001-40,440,000 (50,000 bp)
rs4613763 rs17234657 rs11742570
rs6451493
rs1992660
rs6896969 rs1373692 rs9292777
HUVEC G ATA2
HUVEC cFOS
HUVEC Input
HUVEC
Jurkat
Th1
Th2
chr5:
G W AS Catalog
Human Feb. 2009 (GRCh37/hg19) chr5:39,274,501-40,819,500 (1,545,000 bp)
39500000 40000000 40500000
C9
DAB2
BC026261
PTGER4TTC33
OSRF
PRKAA1
DNase I
TFs
Crohn’s disease
ulcerative colitis
multiple sclerosis
DNaseI peaks TF
Fraction of SNPs that ov
erlap feature
0
0.1
0.2
0.3
log2 (fold en
richment or depletion)
−0.5
0.0
0.5
1.0
1.5
E WETSSPF
CTCFT
D/R
GM12878
genes above
threshold
G O:0006955 immune responseG WAS enrichment10
-log p-value
Phenotype SN
P-P
hen
o as
soci
atio
ns
over
lap
any
TF
occu
panc
y
Gm
1287
8Ebf
Gm
1287
8Pol
24h8
Gm
1287
8Pol
2
Hel
as3C
ebpb
Igg
rab
K56
2Ctc
f
Gm
1287
8Pu1
Gm
1287
8Mef
2a
K56
2Pol
2V04
1610
1
Gm
1287
8Pax
5c20
Gm
1287
8Nfk
bIg
grab
Hep
g2C
tcf
Hu
vecG
ata2
Ucd
Gm
1287
8Elf1
sc63
1V04
1610
1
Gm
1287
8Egr
1V04
1610
1
Hep
g2M
afka
b503
22Ig
grab
Hu
vecC
fosU
cd
Gm
1287
8Bcl
11a
Gm
1287
8Irf
4
Gm
1287
8Bat
f
Gm
1287
8Tcf
12
Hep
g2Fo
sl2
K56
2Tal
1sc1
2984
Igg
mus
K56
2Max
V04
1610
2
Hep
g2Tc
f4U
cd
Hep
g2Fo
xa2
Hep
g2P
300V
0416
101
Hep
g2Fo
xa1c
20
Hu
vecC
tcf
Hel
as3C
tcf
Hep
g2Ju
nd
CA
CO
2.D
S82
35
HU
VE
C.a
ll
Hep
G2.
all
Jurk
at.D
S12
659
hTH
1.al
l
hTH
2.D
S78
42
CD
34.D
S16
814
TOTAL 4860 600 78 57 69 69 72 47 47 71 54 35 54 29 44 28 48 50 38 35 45 37 37 44 62 33 57 46 62 40 55 47 70 85 118 62 192 57 81
Height 204 34 7 3 3 7 6 1 3 2 3 2 6 0 4 6 3 2 3 5 5 2 0 2 3 1 2 0 2 5 4 3 3 6 5 4 9 3 7
Systemic_lupus_erythematosus 62 10 4 6 6 2 1 1 4 0 1 4 1 1 4 2 0 1 2 3 4 2 1 0 1 0 0 0 0 1 1 1 1 2 0 0 4 2 1
Crohn's_disease 105 20 2 2 2 2 1 2 2 0 2 1 2 5 1 1 1 3 2 1 1 0 2 1 1 2 1 2 3 2 3 1 3 6 5 3 9 5 5
Ulcerative_colitis 85 11 2 3 3 0 1 2 3 1 3 3 1 2 3 2 1 1 2 1 2 2 0 2 2 1 0 2 2 0 1 1 3 2 5 3 7 2 3
Multiple_sclerosis 71 15 4 3 3 1 0 3 4 2 4 2 0 2 2 1 0 2 4 3 2 3 0 3 1 0 0 0 0 0 0 0 0 1 1 3 5 4 3
Rheumatoid_arthritis 57 11 4 2 2 1 0 4 3 0 4 4 0 0 1 1 0 0 1 0 2 2 0 1 0 0 0 0 0 0 0 0 0 2 2 1 11 3 1
LDL_cholesterol 45 8 0 0 0 2 2 1 0 4 1 0 1 0 1 0 1 0 0 0 0 0 3 0 3 2 3 2 2 1 1 3 1 0 2 1 0 1 0
Bone_mineral_density 65 9 1 1 1 1 2 2 2 1 2 1 1 0 2 2 2 0 1 2 1 1 0 0 1 0 2 2 3 1 1 1 2 2 4 3 3 2 3
Coronary_heart_disease 107 17 2 0 0 2 4 0 0 4 1 2 0 2 0 0 1 1 1 0 0 1 1 1 3 1 2 2 2 1 1 1 3 2 3 0 6 0 1
Chronic_lymphocytic_leukemia 17 8 1 4 5 0 0 3 1 0 2 1 0 0 2 0 1 0 2 1 1 2 0 1 0 1 0 0 0 0 0 0 1 0 0 0 2 0 1
Prostate_cancer 56 8 0 0 0 0 0 0 0 1 0 0 2 1 0 0 3 2 0 0 0 0 2 1 1 4 3 3 3 0 0 2 2 3 1 1 2 0 1
Triglycerides 48 10 0 0 0 1 2 0 0 2 1 0 2 1 1 0 2 2 0 0 0 0 3 1 2 1 2 2 1 2 2 3 0 2 1 0 2 1 0
Celiac_disease 54 11 4 3 3 0 2 2 1 1 1 2 0 0 1 0 0 0 0 1 1 1 1 0 1 1 1 1 2 0 1 2 0 0 0 2 2 1 2
Colorectal_cancer 18 5 0 0 0 1 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 2 0 0 2 3 3 3 0 0 2 2 0 1 0 2 0 1
Hematological_parameters 85 12 3 0 0 3 1 0 3 0 1 1 2 0 0 0 2 0 1 1 2 2 1 1 0 0 1 1 1 0 0 0 1 1 3 3 6 3 5
HIV-1_control 55 10 0 2 4 1 2 0 1 2 2 0 0 0 1 0 0 0 1 1 1 1 1 1 2 1 1 2 1 0 0 1 0 0 0 0 2 1 0
Protein_quantitative_trait_loci 48 7 2 2 2 0 0 2 1 1 1 0 0 1 2 2 1 0 2 1 2 1 0 0 1 0 0 0 0 0 0 1 1 1 2 1 2 1 1
Alzheimer's_disease 42 5 0 0 0 1 2 0 0 1 0 0 2 0 0 0 0 1 0 0 0 0 1 1 1 0 1 1 0 2 2 1 1 2 0 0 2 0 1
HDL_cholesterol 55 8 1 0 0 1 1 0 0 2 1 0 1 0 0 0 1 2 0 0 0 1 2 1 1 1 2 2 2 1 1 2 0 1 2 0 3 1 0
Cholesterol 16 6 1 0 0 0 2 0 0 2 2 0 1 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 2 1 1 0 0 2 0 1 1 0
Longevity 30 5 0 2 3 1 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 2 1 2 0 2 1 0 0 1
Attention_deficit_hyperactivity_disorder 102 9 0 0 0 1 2 0 0 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 2 0 1 0 0
Cognitive_performance 111 8 0 0 2 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 1 3 0 0 0 0
Type_2_diabetes 97 13 0 0 0 1 1 2 1 1 1 0 1 1 0 0 2 1 1 1 1 0 0 3 1 0 0 0 0 2 1 0 1 0 2 0 4 0 0
Conduct_disorder 38 5 0 1 1 1 3 0 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 2 0 0 1 2 2 1 0 1
Type_1_diabetes 67 7 2 1 1 0 0 2 1 0 1 1 0 0 1 0 0 0 1 0 1 0 0 1 1 1 1 2 2 0 1 0 1 0 2 1 5 1 1
Dialysis-related_mortality 26 6 1 0 0 1 1 1 0 0 0 0 2 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 2 1 0 0 1
Bipolar_disorder 110 6 1 0 0 2 1 0 0 1 0 0 0 1 0 0 0 3 0 0 0 0 1 0 1 0 0 0 0 0 0 1 2 3 1 0 2 0 1
Body_mass 98 5 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 3 0 1 0 0
C-reactive_protein 34 7 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 1 0 2 1 0 0 1
Menarche_and_menopause 62 6 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 1 0 0 2 1 1 0
Breast_cancer 43 6 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 2 2 0 0
Mean_platelet_volume 15 5 1 0 0 0 0 1 0 0 1 0 0 0 0 0 2 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 2 0 1
Soluble_levels_of_adhesion_molecules 5 5 1 0 1 1 0 1 0 2 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Psoriasis 38 6 1 1 2 0 0 2 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 4 0 0
Parkinson's_disease 46 5 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0
Obesity 36 5 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Fasting_glucose-related_traits 17 5 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 1 1 1 1 0 0
Reported Association
Functional SNP
GWAS (Japanese): T Yamauchi et al. A
genome-wide association study in the
Japanese population identifies susceptibility
loci for type 2 diabetes at UBE2E2 and
C2CD4A-C2CD4B. Nature Genetics 42, 864–
868 (2010).
GWAS (Danish): N. Grarup et al. The
diabetogenic VPS13C/C2CD4A/C2CD4B
rs7172432 variant impairs glucose-stimulated
insulin response in 5,722 non-diabetic Danish
individuals. Diabetologia (2011) 54:789–794
Example: rs7172432 in Type 2 Diabetes
1.0
Alan P Boyle RegulomeDB
Altered View of the Human Genome 2003
• 25,000 Protein Coding Genes (1.5%)
• Few Non Coding Genes (Mostly tRNAs, snoRNAs)
• Little regulatory information mapped
23
2015
• 20,000 Protein Coding Genes
• Thousands of noncoding genes
• More potential regulatory DNA than
protein coding DNA
The ENCODE 3 Consortium
http://www.genome.gov/26525220 24
Alternative/Additional Slides
Three Phases
I) Pilot Phase -1% of Genome (2003-2007)
II) Scale Up Phase I (2007-2012)
III) Current Production Phase (2012-2016)
Related Projects:
Mouse ENCODE (2009-2012)
modENCODE (2007-2012)
27
Cancer
Autoimmunity, allergy
Neurological/psychiatric
Cardiovascular
Developmental
Metabolism
Infectious disease
Musculo-skeletal
Reproductive/dimorphism
Ophthalmology
Dermatological
Hematology
Human Genetics
Multiple diseases
Respiratory
Aging
Cancer 35%
Autoimmunity/Allergy
15%
Neuro/ Psych 13%
CV 7%
Categories of Disease-Related ENCODE Community Publications
28
Standard ENCODE Use Cases: Hypothesis Generation
• Prediction of:
• causal variants/regulatory elements
• target genes
• target cell types
• mechanism for phenotype changes
29
The Genome Is Active!
RNA
Kellis et al 2014
ENCODE Publications
31
ENCODE Timeline
32
33
The ENCODE Consortium Phase 3 Brad Bernstein (Eric Lander, Manolis Kellis, Tony Kouzarides)
Ewan Birney (Jim Kent, Mark Gerstein, Bill Noble, Peter Bickel, Ross Hardison, Zhiping Weng)
Greg Crawford (Ewan Birney, Jason Lieb, Terry Furey, Vishy Iyer)
Jim Kent (David Haussler, Kate Rosenbloom) Mike Cherry
John Stamatoyannopoulos (Evan Eichler, George Stamatoyannopoulos, Job Dekker, Maynard Olson, Michael Dorschner, Patrick
Navas, Phil Green)
Mike Snyder (Kevin Struhl, Mark Gerstein, Peggy Farnham, Sherman Weissman)
Rick Myers (Barbara Wold)
Scott Tenenbaum (Luiz Penalva)
Tim Hubbard (Alexandre Reymond, Alfonso Valencia, David Haussler, Ewan Birney, Jim Kent, Manolis Kellis, Mark Gerstein, Michael Brent,
Roderic Guigo)
Tom Gingeras (Alexandre Reymond, David Spector, Greg Hannon, Michael Brent, Roderic Guigo, Stylianos Antonarakis, Yijun Ruan,
Yoshihide Hayashizaki)
Zhiping Weng (Nathan Trinklein, Rick Myers)
Brenton Graveley (John Rinn, Others)
.. and many senior scientists, postdocs, students, technicians, computer scientists, statisticians and administrators in these groups
NHGRI: Elise Feingold, Mike Pazin, Peter Good
Metadata-driven searches
34
ENCODE Consortium Phase 3
35
Goals of ENCODE
• Catalog the functional elements in human and mouse genomes
• Generate high quality data using high throughput pipelines
• Develop new technologies and analytical tools to generate, analyze and validate data
• Provide data and tools to the community in as useful form as possible
36
RNA-Sequencing
Wang et al. 2009 Nat Gen. Rev.
Immunoprecipitation
Transcription Factor
Sequence and align ChIP-seq
Peak 300-500 bp
Antibody
Functional data: ChIP-seq
ChIP-exo Histone Marks
Motif (8-12 bp)
DNaseI
Histone Histone
Transcription Factor
Sequence and align
DNaseI hypersensitivity peak
Functional data: DNase-seq
Region of open chromatin
DNaseI
Histone Histone
Transcription Factor
Sequence and align
DNaseI Footprint
Functional data: DNase footprints
Region of open chromatin
Comparing Mouse and Human with
Mouse ENCODE Data
Number of Datasets Per Biosamples
Human Mouse
0
1000
2000
3000
4000
5000
6000
7000
imm
ort
aliz
ed c
ell…
tiss
ue
pri
mar
y ce
ll
stem
cel
l
in v
itro
…
ind
uce
d…
cell
line
Started/Proposed
Released/Submitted
0
200
400
600
800
1000
1200
1400
tiss
ue
imm
ort
aliz
ed c
ell…
pri
mar
y ce
ll
stem
cel
l
in v
itro
…
cell
line
ind
uce
d…
42
Assays Per Biosample
43
0
500
1000
1500
2000K
56
2
Hep
G2
MC
F-7
GM
12
87
8
A5
49
H1
-hES
C
HeL
a-S3
liver
SK-N
-SH
end
oth
eli…
HC
T11
6
IMR
-90
kera
tin
oc…
sto
mac
h
fib
rob
last
…
pan
crea
s
thyr
oid
…
adre
nal
…
sple
en
adip
ose
…
Proposed
Released
0
50
100
150
liver
hea
rt
fore
bra
in
hin
db
rain
mid
bra
in
limb
neu
ral t
ub
e
MEL
cel
l lin
e
kid
ne
y
lun
g
inte
stin
e
sto
mac
h
CH
12
.LX
cran
iofa
cial
emb
ryo
nic
…
G1
E-ER
4
ES-E
14
myo
cyte
sple
en
thym
us
Proposed
Released