View
513
Download
7
Tags:
Embed Size (px)
Citation preview
Biocuration activities for the International Cancer
Genome Consortium (ICGC).
December 4th 2014
B.F. Francis Ouellette [email protected]
• Senior Scientists & Associate Director,
Informatics and Biocomputing, Ontario Institute for
Cancer Research, Toronto, ON
• Associate Professor, Department of Cell and Systems Biology,
University of Toronto, Toronto, ON.
@bffo on
2
You are free to:
Copy, share, adapt, or re-mix;
Photograph, film, or broadcast;
Blog, live-blog, or post video of;
This presentation. Provided that:
You attribute the work to its author and respect the rights
and licenses associated with its components.
Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.
Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;
http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
3
• Cancer
• Data sharing
• Biocuration
• Access
• Relevance
• Making it better
4
CancerA Disease of the Genome
Challenge in Treating Cancer:
Every tumor is different
Every cancer patient is different
5
Johns Hopkins
> 18,000 genes analyzed for mutations
11 breast and 11 colon tumors
L.D. Wood et al, Science, Oct. 2007
Wellcome Trust Sanger Institute
518 genes analyzed for mutations
210 tumors of various types
C. Greenman et al, Nature, Mar. 2007
TCGA (NIH)
Multiple technologies
brain (glioblastoma multiforme), lung (squamous carcinoma), and ovarian (serous cystadenocarcinoma).
F.S. Collins & A.D. Barker, Sci. Am, Mar. 2007
Large-Scale Studies of Cancer Genomes
6
Heterogeneity within and across tumor types
High rate of abnormalities (driver vs
passenger)
Sample quality matters
Consent and controlled data access is
complicated
Lessons learned
7
International Cancer Genome Consortium
• Collect ~500 tumour/normal pairs from each of 50 different major
cancer types;
• Comprehensive genome analysis of each T/N pair:
– Genome
– Transcriptome
– Methylome
– Clinical data
• Make the data available to the research community & public.
Identify
genome
changes
…GATTATTCCAGGTAT… …GATTATTGCAGGTAT… …GATTATTGCAGGTAT…
8
Rationale for the ICGC
• The scope is huge, such that no country can do it all.
• Coordinated cancer genome initiatives will reduce
duplication of effort for common and easy to acquire
tumor samples and and ensure complete studies for many
less frequent forms of cancer.
• Standardization and uniform quality measures across
studies will enable the merging of datasets, increasing
power to detect additional targets.
• The spectrum of many cancers varies across the
world for many tumor types, because of environmental,
genetic and other causes.
• The ICGC will accelerate the dissemination of genomic
and analytical methods across participating sites, and
the user community
9
International Cancer Genome Consortium
(ICGC)Goals
• Catalogue genomic abnormalities in tumors in 50 different cancer types and/or subtypes of clinical and societal importance across the globe
• Generate complementary catalogues of transcriptomic and epigenomic datasets from the same tumors
• Make the data available to research community rapidly with minimal restrictions to accelerate research into the causes and control of cancer
50 tumor types and/or subtypes
500 tumors + 500 controls per subtype
50,000 Human Genome Projects!
Nature (2010) 464:993
10
ICGC
Goals, Structure,
Policies & Guidelines
http://goo.gl/sPGLQN
11
Primary Goal: coordinate efforts to
reach goals (50 tumours)
12
http://docs.icgc.org/dcc-data-element-specifications
13
Primary Goal: be comprehensive
http://goo.gl/BE7KH1
14
Analysis Data Types
• Germline variants (SNPs)
• Simple Somatic Mutations (SSM)
• Copy Number Alterations (CNA)
• Structural Variants (SV)
• Gene Expression (micro-arrays and RNASeq)
• miRNA Expression (RNASeq)
• Epigenomics (Arrays and Methylation)
• Splicing Variation (RNASeq)
• Protein Expression (Arrays)
15
Primary Goal: generate highest quality
http://goo.gl/FXCvi9
16
17
Primary Goal: available to all
18
Primary Goal: available to all
19
• Detailed Phenotype and Outcome data
Region of residence
Risk factors
Examination
Surgery
Radiation
Sample
Slide
Specific histological features
Analyte
Aliquot
Donor notes
• Gene Expression (probe-level data)
• Raw genotype calls
• Gene-sample identifier links
• Genome sequence files
ICGC Controlled
Access Datasets
• Cancer Pathology
Histologic type or subtype
Histologic nuclear grade
• Patient/Person
Gender, Age range,
Vital status, Survival time
Relapse type, Status at follow-up
• Gene Expression (normalized)
• DNA methylation
•Computed Copy Number and
Loss of Heterozygosity
• Newly discovered somatic variants
ICGC OA
Datasets
http://goo.gl/w4mrV
20
Secondary Goal: coordinate
work to benefit productivity
http://goo.gl/K5mHC3
21
https://icgc.org/icgc/committees-and-working-groups
22
Secondary Goal: disseminate knowledge
http://goo.gl/ObcZXy
23
ICGC
Goals, Structure,
Policies & Guidelines
http://goo.gl/sPGLQN
24
Policy
ICGC membership implies compliance with Core
Bioethical Elements for samples used in ICGC
Cancer Projects:
http://goo.gl/TFrCmK
http://goo.gl/nYx6YG
25
POLICY:
The members of the International Cancer Genomics
Consortium (ICGC) are committed to the principle of
rapid data release to the scientific community.
http://goo.gl/TFrCmK
26
Publication Policy
• The individual research groups in
the ICGC are free to publish the
results of their own efforts in
independent publications at any
time (subject, of course, to any
policies of any collaborations in
which they may be participating).
27
Moratorium: http://www.icgc.org/icgc/goals-structure-policies-guidelines/e3-publication-policy
28
Publication Policy
29
Where do you find that information?
• We actually make it hard to find, but we are
working on that! (this is an example of where ICGC
would like to do what TCGA does!)
• http://cancergenome.nih.gov/publications/publicatio
nguidelines
30
Where do you find that information?
For ICGC data:
• Need to find the policy!• http://icgc.org/icgc/goals-structure-policies-
guidelines/e3-publication-policy
• Find text:
• Find date: in README on FTP file
• This is bad, we know it, and we are fixing it!
• In doubt, contact us: [email protected]
31
Policy on Intellectual Property
• All ICGC members agree not to make claims to
possible IP derived from primary data (including
somatic mutations) and to not pursue IP
protections that would prevent or block access to
or use of any element of ICGC data or conclusions
drawn directly from those data.
http://goo.gl/TCMXCl
32
ICGC Map – May 201472 projects launched
33
OICR and the ICGC
34
DCC ActivitiesDCC activities are split between two groups:
• Software Development
– DCC portal
– Submission tool
• Biocuration (which also includes Content
Management)
– Data level management
– Submitter “handling”
– Coordination with secretariat
– User support
http://dcc.icgc.org/team34
35
Data
ValidationValidationValidation(dictionary)
Validation(across fields)
Validation(across fields)
Validation(across fields)
indexing
Happy Users
http://goo.gl/1EcyR
36
http://docs.icgc.org/methods
37
http://docs.icgc.org/dcc-data-element-specifications
38
ICGC Biocuration
• Helping submitters get their data to ICGC
• Progress reporting (data audit)
• Quality checks (coverage, correctness, etc.)
• Helping users get to the data
• Validate and check (and recheck) metadata on public
repositories
• Test and integrate with other public repositories via
standard data formats, ontologies.
• Documentation, documentation, and more documentation
• Training
38
39
ICGC datasets to date
ICGC Data Portal Cumulative Donor Count for Member Projects
2000
4000
6000
8000
10,000
12,000
14,000
0
Number of
Donors
Release 7
Release 8
Release 9
Release 10
Release 11
Release 12Release 13
Release 14
Release 15
Release 16Release 17
•Cancer types: 50
•Body sites: 18
•Donors: 12,232
•Specimens: 24, 661
•Simple somatic mutations: 9,871,477
•Mutated genes: 57,526
ICGC dataset version 17
Sept 11th 2014
41
Clinical Data Completeness
Donor interval of last followup
Donor Tumour stage at diagnosis
Donor Tumour staging system at diagnosis
Donor diagnosis ICG10
DonorFields
Donor survival time
Donor Tumour stage at diagnosis supplemental
Donor relapse interval
Donor age at last followup
Donor relapse type
Donor age at diagnosis
Disease status last followup
Donor region of residence
Donor sex
Donor ID
Donor vital status
Average Percentage Completeness
Overall Donor Clinical Data Completeness
42
Clinical Data Completeness
Donor interval of last followup
Donor Tumour stage at diagnosis
Donor Tumour staging system at diagnosis
Donor diagnosis ICG10
DonorFields
Donor survival time
Donor Tumour stage at diagnosis supplemental
Donor relapse interval
Donor age at last followup
Donor relapse type
Donor age at diagnosis
Disease status last followup
Donor region of residence
Donor sex
Donor ID
Donor vital status
Average Percentage Completeness
Overall Donor Clinical Data Completeness
43
Clinical Data Completeness
Overall Specimen Clinical Data Completeness
Level of cellularity
Percentage cellularity
Digital Image of Stained Section
Tumour Stage Supplemental
Tumour Stage
Tumour Stage System
Tumour Grade Supplemental
Tumour Grade
Tumour Grading System
Tumour Histological Type
Specimen available
Specimen Biobank ID
Specimen Biobank
Tumour confirmed
Specimen storage other
Specimen storage
Specimen processing other
Specimen processing
Specimen donor treatment type
Specimen Interval
Specimen type
Specimen type other
Specimen ID
Donor ID
Specimen donor treatment type other
SpecimenFields
0 20 40 60 80
Average Percentage Completeness
10 30 50 70 90 100
44
Clinical Data Completeness
Overall Specimen Clinical Data Completeness
Level of cellularity
Percentage cellularity
Digital Image of Stained Section
Tumour Stage Supplemental
Tumour Stage
Tumour Stage System
Tumour Grade Supplemental
Tumour Grade
Tumour Grading System
Tumour Histological Type
Specimen available
Specimen Biobank ID
Specimen Biobank
Tumour confirmed
Specimen storage other
Specimen storage
Specimen processing other
Specimen processing
Specimen donor treatment type
Specimen Interval
Specimen type
Specimen type other
Specimen ID
Donor ID
Specimen donor treatment type other
SpecimenFields
0 20 40 60 80
Average Percentage Completeness
10 30 50 70 90 100
45
ICGC DCC Pipeline
Donor
Consent
TumourNormal (blood)
ICGC Cancer Projects
Sequencing centres
DCC
Level II/IIImutation
dataData Portal
Harmonized
DataAnnotation
Researchers
DACO
Data access Agreement
European Genome-phenome Archive
Raw Sequencing
data
MetadataXMLFiles
Hardeep Nahal
46
EGA: Controlled Access and DACO
47
ICGC DCC Pipeline
Sequencing centres
Donor
Consent
TumourNormal (blood)
ICGC Cancer Projects
DCC
Level II/IIImutation
dataData Portal
Harmonized
DataAnnotation
Researchers
European Genome-phenome Archive
Raw Sequencing
data
MetadataXMLFiles
I want raw data for Donor
DO46688
HardeepNahal
48
ICGC DCC Pipeline
Sequencing centres
Donor
Consent
TumourNormal (blood)
ICGC Cancer Projects
DCC
Level II/IIImutation
dataData Portal
Harmonized
DataAnnotation
Researchers
DACO
Data access Agreement
European Genome-phenome Archive
Raw Sequencing
data
MetadataXMLFiles
I can’t find my BAM file
?!
DCC
HardeepNahal
49
Metadata: ICGC & EGA
1
1
1
1
1
1
n
n
n
n
n
n
n
n
50
ICGC-EGA Audit
Project/Study Names
Sample identifier
Donor Identifier
EGA Study/Dataset Accession
Raw data file names
Tumour/Normal designation
EGA
Metadata
XML
Files
ICGC-
EGA
Audit
Reports
ICGC
Submitted
Data
ICGC Cancer Projects
51
Major Issues & Challenges
Differences in the formats used for clinical identifiers submitted to
ICGC and EGA
Tumour/Normal designation missing
Project/Study names differ
Missing EGA datasets
• Donor ID
• Sample ID
• Tumour/Nor
mal
designation
• Sequencing
strategy
Raw data
filename
EGA
Study/Dataset
accession
ICGC EGA
HardeepNahal
52
Example metadata issues
<?xml version="1.0" encoding="UTF-8"?><SAMPLE_SET>
<SAMPLE alias="LFS_MB1" center_name="DKFZ-IBIOS" accession="ERS040283"><IDENTIFIERS><PRIMARY_ID>ERS040283</PRIMARY_ID><SUBMITTER_ID namespace="DKFZ-IBIOS">LFS_MB1</SUBMITTER_ID>
</IDENTIFIERS><SAMPLE_NAME><TAXON_ID>9606</TAXON_ID><SCIENTIFIC_NAME>Homo sapiens</SCIENTIFIC_NAME><COMMON_NAME>human</COMMON_NAME>
</SAMPLE_NAME><SAMPLE_ATTRIBUTES><SAMPLE_ATTRIBUTE>
<TAG>Sample ID</TAG><VALUE>LFS_MB1</VALUE>
</SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE>
<TAG>Donor ID</TAG><VALUE>165304</VALUE>
</SAMPLE_ATTRIBUTE></SAMPLE_ATTRIBUTES>
</SAMPLE></SAMPLE_SET>
Control/tumour information?
Different Donor identifiers!!
HardeepNahal
53
<?xml version="1.0" encoding="UTF-8"?><SAMPLE_SET>
<SAMPLE center_name="QCMG" alias="ICGC-ABMJ-20101130-29-ND" accession="ERS206872" xmlns:xsi…….<IDENTIFIERS><PRIMARY_ID>ERS206872</PRIMARY_ID><SUBMITTER_ID namespace="QCMG">ICGC-ABMJ-20101130-29-ND</SUBMITTER_ID>
</IDENTIFIERS><SAMPLE_NAME><TAXON_ID>9606</TAXON_ID><SCIENTIFIC_NAME>Homo sapiens</SCIENTIFIC_NAME><COMMON_NAME>human</COMMON_NAME>
</SAMPLE_NAME><DESCRIPTION>1:DNA|4:Normal control (other site)|Unknown</DESCRIPTION><SAMPLE_ATTRIBUTES><SAMPLE_ATTRIBUTE>
<TAG>Sample ID</TAG><VALUE>8029782</VALUE>
</SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE>
<TAG>Donor ID</TAG><VALUE>ICGC_0108</VALUE>
</SAMPLE_ATTRIBUTE></SAMPLE_ATTRIBUTES>
</SAMPLE></SAMPLE_SET>
Example metadata issues
Free Text
HardeepNahal
54
http://icgc.org
55
56
57
Select “Bladder Cancer – China”
58
Select “Pancreatic cancer – Canada”
59
… But where is the data?
60
61
http://dcc.icgc.org/
62
63
64
Highlights of the new portal: dcc.icgc.org
• Faceted searches capabilities for variants, genes and
donors
– Interactive data exploration fast and easy
• Mutation aggregation & counts across donors and cancers
– # of pancreatic cancers donors with mutation KRAS G12D
• Standardized gene consequence across all projects
• Genome browser
• Data doewnload
• Protein domains
• Links to repositories
65
KRAS search
66
• Summary
• Cancer type distribution
• Other links (Cosmic, Entrez, etc)
• Mutation profile in protein
• Domains
• Genomic Context
• Mutation profile
• Most common mutations
67
http://dcc.icgc.org/genes/ENSG00000133703
68
69
70
71
Donor• Donor ID
• Primary site
• Cancer Project
• Gender
• Tumor Stage
• Vital Status
• Disease Status
• Release type
• Age at diagnosis
• Available data types
• Analysis types
72
Genes
73
Mutations• Consequences
• Type
• Platform
• Verification status
74
Exporting data
75
Exporting data
76
77
Exporting data
78
Can do bulk download of the data …
79
BIGDATA
ValidationValidationRAW
DATA
MetaDATA
Interpreted data
✔
✔
✔
✔
✔
80
DACO
ICGC
dbGaP
EGA
TCGA
BAM
Open
Open
ERA
BA
M
Germ
Line
+ EGA id
BA
MBA
M
81
ICGC Data Categories
ICGC Open Access Datasets ICGC Controlled Access Datasets
Cancer Pathology
Histologic type or subtype
Histologic nuclear grade
Donor
Gender
Age range
RNA expression (normalized)
DNA methylation
Genotype frequencies
Somatic mutations (SNV,
CNV and Structural
Rearrangement)
Detailed Phenotype and Outcome Data
Patient demography
Risk factors
Examination
Surgery/Drugs/Radiation
Sample/Slide
Specific histological features
Protocol
Analyte/Aliquot
Gene Expression (probe-level data)
Raw genotype calls (germline)
Gene-sample identifier links
Genome sequence files
Most of the data in the portal is publically available without restriction. However,
access to some data, like the germline mutations, requires authorization by the Data
Access Compliance Office (DACO)
http://icgc.org/daco
84
• Detailed Phenotype and Outcome data
Region of residence
Risk factors
Examination
Surgery
Radiation
Sample
Slide
Specific histological features
Analyte
Aliquot
Donor notes
• Gene Expression (probe-level data)
• Raw genotype calls
• Gene-sample identifier links
• Genome sequence files
ICGC Controlled
Access Datasets
• Cancer Pathology
Histologic type or subtype
Histologic nuclear grade
• Patient/Person
Gender, Age range,
Vital status, Survival time
Relapse type, Status at follow-up
• Gene Expression (normalized)
• DNA methylation
•Computed Copy Number and
Loss of Heterozygosity
• Newly discovered somatic variants
ICGC OA
Datasets
http://goo.gl/w4mrV
Identify
yourselfFill out detail form which
includes:
• Contact and Project
Information
•Information Technology
details and procedures
for keeping data secure
•Data Access Agreement
All of these
documents are
put into a PDF
file that you
print and get your
institution to sign
off on your behalf
87
88
89
90
91
92
DACO approved projects (Dec 2014):
163 groups – 867 people
http://goo.gl/E8gHGx
93
Making sense of it all
1 project == 1 pipeline
94
Making sense of it all
70 projects == 70 pipelines
95
Making sense of it all
70 projects == 1 pipeline
96
PanCancer Analysis of Whole Genomes (PCAWG)
• 2,200 T/N pairs with clinical dataanalyzed over 6 Academic clouds
• 16 working groups, > 1000 scientists
• 1 alignment pipeline (8 months)
• Data freeze last month
• 3 somatic mutation pipelines (2 months?)
• 2 RNA-Seq pipelines (1 month?)
• Not scheduled yet:– miRNA
– CNV & SV
– Pathway analysis
• Start writing papers in July 2015
97
Conclusions: What we are doing at DCC
• Working with EGA to audit missing information
and minimize disconnect between submitters’
ICGC files and raw/metadata submitted to
EGA.
• Working in close collaboration with ICGC
projects and EGA to correct missing
information and data so we can harmonize
data in ICGC/EGA submissions process.
• Adding validation step in submission process
to better coordinate efforts with EGA
98
Conclusions: What we are doing at DCC
• Encourage & work with submitters to supply all
clinical metadata.
• Improved data & metadata curation at EGA; better
linking of data held at DCC to ICGC data in other
repositories
• Improved data quality/integrity checking through
new submission/validation system; review of
submission file specifications
• Integration of new data submission system and
portal infrastructure with project and user
information managed at ICGC.org
• Integrating PCAWG results with ICGC data portal
99
Some thoughts:
• Curation activities are obviously crucial to the
development of a great database
• Continuous feedback between users,
developers and submitters is also critical
• Biocurators are at the important interface and
are essential team players for the
development and maintenance of any modern
database.
100
http://www.biocurator.org/
101101
Nature 409:452
Bioinformatics Citizenship: What it means,
and what does it cost?
102
Important messages:
• The ICGC portal is evolving and getting better all
the time
• Lots of data provided by the ICGC
• Important to be good citizens of the scientific world
• The idea behind all of this is to provide tools to
help cure cancer
• Need to respect policies and guidelines
• There is help out there, and user feedback is
*always* welcome.
103
DCC Software
Developer
Vincent Ferretti
Daniel Chang
Anthony Cros
Jerry Lam
Brian O'Connor
Bob Tiernay
Stuart Watt
Shane Wilson
Junjun Zhang
Acknowledgments
ICGC/OICR Project leaders:
Tom HudsonJohn McPhersonLincoln SteinJared SimpsonPaul BoutrosVincent FerrettiFrancis OuelletteJennifer Jennings
Ouellette Lab
Michelle Brazas
Emilie Chautard
Nina Palikuca
Zhibin Lu
Web Dev
Joseph Yamada
Angela Chao
Daniel Gross
Kamen Wu
Kim Cullion
Miyuki Fukuma
Wen Xu
Pipeline Development
& Evaluation
Morgan Taschuk
Michael Laszloffy
Peter Ruzanov
ICGC DCC Biocuration
Hardeep Nahal
Marc Perry
http://oicr.on.ca http://icgc.org
… and all the patients and their families that that
are putting their hopes into our work!
Research IT/Systems
David Sutton,
Bob Gibson
Sam Maclennan
David Magda
Rob Naccarato
Brian Ott
Gino Yearwood
EGA
Justin Paschall
Jeff Almeida-King
Ilkka Lappalainen
Jordi Rambla De
Argila
Marc Sitges Puy
SeqProdBio Team
Tim Beck
Tony DeBat
Larry Heissler
Xuemei (Mei) Luo
Michael Moorhouse
Furqan Qureshi
Yogi Sundaravadanam
104Informatics and Biocomputing at the OICR
105
http://oicr.on.ca/careers
106