107
Biocuration activities for the International Cancer Genome Consortium (ICGC). December 4 th 2014 B.F. Francis Ouellette [email protected] Senior Scientists & Associate Director, Informatics and Biocomputing, Ontario Institute for Cancer Research, Toronto, ON Associate Professor, Department of Cell and Systems Biology, University of Toronto, Toronto, ON. @bf f o on

Biocuration activities for the International Cancer Genome Consortium (ICGC)

Embed Size (px)

Citation preview

Page 1: Biocuration activities for the International Cancer Genome Consortium (ICGC)

Biocuration activities for the International Cancer

Genome Consortium (ICGC).

December 4th 2014

B.F. Francis Ouellette [email protected]

• Senior Scientists & Associate Director,

Informatics and Biocomputing, Ontario Institute for

Cancer Research, Toronto, ON

• Associate Professor, Department of Cell and Systems Biology,

University of Toronto, Toronto, ON.

@bffo on

Page 2: Biocuration activities for the International Cancer Genome Consortium (ICGC)

2

You are free to:

Copy, share, adapt, or re-mix;

Photograph, film, or broadcast;

Blog, live-blog, or post video of;

This presentation. Provided that:

You attribute the work to its author and respect the rights

and licenses associated with its components.

Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.

Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;

http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites

Page 3: Biocuration activities for the International Cancer Genome Consortium (ICGC)

3

• Cancer

• Data sharing

• Biocuration

• Access

• Relevance

• Making it better

Page 4: Biocuration activities for the International Cancer Genome Consortium (ICGC)

4

CancerA Disease of the Genome

Challenge in Treating Cancer:

Every tumor is different

Every cancer patient is different

Page 5: Biocuration activities for the International Cancer Genome Consortium (ICGC)

5

Johns Hopkins

> 18,000 genes analyzed for mutations

11 breast and 11 colon tumors

L.D. Wood et al, Science, Oct. 2007

Wellcome Trust Sanger Institute

518 genes analyzed for mutations

210 tumors of various types

C. Greenman et al, Nature, Mar. 2007

TCGA (NIH)

Multiple technologies

brain (glioblastoma multiforme), lung (squamous carcinoma), and ovarian (serous cystadenocarcinoma).

F.S. Collins & A.D. Barker, Sci. Am, Mar. 2007

Large-Scale Studies of Cancer Genomes

Page 6: Biocuration activities for the International Cancer Genome Consortium (ICGC)

6

Heterogeneity within and across tumor types

High rate of abnormalities (driver vs

passenger)

Sample quality matters

Consent and controlled data access is

complicated

Lessons learned

Page 7: Biocuration activities for the International Cancer Genome Consortium (ICGC)

7

International Cancer Genome Consortium

• Collect ~500 tumour/normal pairs from each of 50 different major

cancer types;

• Comprehensive genome analysis of each T/N pair:

– Genome

– Transcriptome

– Methylome

– Clinical data

• Make the data available to the research community & public.

Identify

genome

changes

…GATTATTCCAGGTAT… …GATTATTGCAGGTAT… …GATTATTGCAGGTAT…

Page 8: Biocuration activities for the International Cancer Genome Consortium (ICGC)

8

Rationale for the ICGC

• The scope is huge, such that no country can do it all.

• Coordinated cancer genome initiatives will reduce

duplication of effort for common and easy to acquire

tumor samples and and ensure complete studies for many

less frequent forms of cancer.

• Standardization and uniform quality measures across

studies will enable the merging of datasets, increasing

power to detect additional targets.

• The spectrum of many cancers varies across the

world for many tumor types, because of environmental,

genetic and other causes.

• The ICGC will accelerate the dissemination of genomic

and analytical methods across participating sites, and

the user community

Page 9: Biocuration activities for the International Cancer Genome Consortium (ICGC)

9

International Cancer Genome Consortium

(ICGC)Goals

• Catalogue genomic abnormalities in tumors in 50 different cancer types and/or subtypes of clinical and societal importance across the globe

• Generate complementary catalogues of transcriptomic and epigenomic datasets from the same tumors

• Make the data available to research community rapidly with minimal restrictions to accelerate research into the causes and control of cancer

50 tumor types and/or subtypes

500 tumors + 500 controls per subtype

50,000 Human Genome Projects!

Nature (2010) 464:993

Page 10: Biocuration activities for the International Cancer Genome Consortium (ICGC)

10

ICGC

Goals, Structure,

Policies & Guidelines

http://goo.gl/sPGLQN

Page 11: Biocuration activities for the International Cancer Genome Consortium (ICGC)

11

Primary Goal: coordinate efforts to

reach goals (50 tumours)

Page 12: Biocuration activities for the International Cancer Genome Consortium (ICGC)

12

http://docs.icgc.org/dcc-data-element-specifications

Page 13: Biocuration activities for the International Cancer Genome Consortium (ICGC)

13

Primary Goal: be comprehensive

http://goo.gl/BE7KH1

Page 14: Biocuration activities for the International Cancer Genome Consortium (ICGC)

14

Analysis Data Types

• Germline variants (SNPs)

• Simple Somatic Mutations (SSM)

• Copy Number Alterations (CNA)

• Structural Variants (SV)

• Gene Expression (micro-arrays and RNASeq)

• miRNA Expression (RNASeq)

• Epigenomics (Arrays and Methylation)

• Splicing Variation (RNASeq)

• Protein Expression (Arrays)

Page 15: Biocuration activities for the International Cancer Genome Consortium (ICGC)

15

Primary Goal: generate highest quality

http://goo.gl/FXCvi9

Page 16: Biocuration activities for the International Cancer Genome Consortium (ICGC)

16

Page 17: Biocuration activities for the International Cancer Genome Consortium (ICGC)

17

Primary Goal: available to all

Page 18: Biocuration activities for the International Cancer Genome Consortium (ICGC)

18

Primary Goal: available to all

Page 19: Biocuration activities for the International Cancer Genome Consortium (ICGC)

19

• Detailed Phenotype and Outcome data

Region of residence

Risk factors

Examination

Surgery

Radiation

Sample

Slide

Specific histological features

Analyte

Aliquot

Donor notes

• Gene Expression (probe-level data)

• Raw genotype calls

• Gene-sample identifier links

• Genome sequence files

ICGC Controlled

Access Datasets

• Cancer Pathology

Histologic type or subtype

Histologic nuclear grade

• Patient/Person

Gender, Age range,

Vital status, Survival time

Relapse type, Status at follow-up

• Gene Expression (normalized)

• DNA methylation

•Computed Copy Number and

Loss of Heterozygosity

• Newly discovered somatic variants

ICGC OA

Datasets

http://goo.gl/w4mrV

Page 20: Biocuration activities for the International Cancer Genome Consortium (ICGC)

20

Secondary Goal: coordinate

work to benefit productivity

http://goo.gl/K5mHC3

Page 21: Biocuration activities for the International Cancer Genome Consortium (ICGC)

21

https://icgc.org/icgc/committees-and-working-groups

Page 22: Biocuration activities for the International Cancer Genome Consortium (ICGC)

22

Secondary Goal: disseminate knowledge

http://goo.gl/ObcZXy

Page 23: Biocuration activities for the International Cancer Genome Consortium (ICGC)

23

ICGC

Goals, Structure,

Policies & Guidelines

http://goo.gl/sPGLQN

Page 24: Biocuration activities for the International Cancer Genome Consortium (ICGC)

24

Policy

ICGC membership implies compliance with Core

Bioethical Elements for samples used in ICGC

Cancer Projects:

http://goo.gl/TFrCmK

http://goo.gl/nYx6YG

Page 25: Biocuration activities for the International Cancer Genome Consortium (ICGC)

25

POLICY:

The members of the International Cancer Genomics

Consortium (ICGC) are committed to the principle of

rapid data release to the scientific community.

http://goo.gl/TFrCmK

Page 26: Biocuration activities for the International Cancer Genome Consortium (ICGC)

26

Publication Policy

• The individual research groups in

the ICGC are free to publish the

results of their own efforts in

independent publications at any

time (subject, of course, to any

policies of any collaborations in

which they may be participating).

Page 27: Biocuration activities for the International Cancer Genome Consortium (ICGC)

27

Moratorium: http://www.icgc.org/icgc/goals-structure-policies-guidelines/e3-publication-policy

Page 28: Biocuration activities for the International Cancer Genome Consortium (ICGC)

28

Publication Policy

Page 29: Biocuration activities for the International Cancer Genome Consortium (ICGC)

29

Where do you find that information?

• We actually make it hard to find, but we are

working on that! (this is an example of where ICGC

would like to do what TCGA does!)

• http://cancergenome.nih.gov/publications/publicatio

nguidelines

Page 30: Biocuration activities for the International Cancer Genome Consortium (ICGC)

30

Where do you find that information?

For ICGC data:

• Need to find the policy!• http://icgc.org/icgc/goals-structure-policies-

guidelines/e3-publication-policy

• Find text:

• Find date: in README on FTP file

• This is bad, we know it, and we are fixing it!

• In doubt, contact us: [email protected]

Page 31: Biocuration activities for the International Cancer Genome Consortium (ICGC)

31

Policy on Intellectual Property

• All ICGC members agree not to make claims to

possible IP derived from primary data (including

somatic mutations) and to not pursue IP

protections that would prevent or block access to

or use of any element of ICGC data or conclusions

drawn directly from those data.

http://goo.gl/TCMXCl

Page 32: Biocuration activities for the International Cancer Genome Consortium (ICGC)

32

ICGC Map – May 201472 projects launched

Page 33: Biocuration activities for the International Cancer Genome Consortium (ICGC)

33

OICR and the ICGC

Page 34: Biocuration activities for the International Cancer Genome Consortium (ICGC)

34

DCC ActivitiesDCC activities are split between two groups:

• Software Development

– DCC portal

– Submission tool

• Biocuration (which also includes Content

Management)

– Data level management

– Submitter “handling”

– Coordination with secretariat

– User support

http://dcc.icgc.org/team34

Page 35: Biocuration activities for the International Cancer Genome Consortium (ICGC)

35

Data

ValidationValidationValidation(dictionary)

Validation(across fields)

Validation(across fields)

Validation(across fields)

indexing

Happy Users

http://goo.gl/1EcyR

Page 36: Biocuration activities for the International Cancer Genome Consortium (ICGC)

36

http://docs.icgc.org/methods

Page 37: Biocuration activities for the International Cancer Genome Consortium (ICGC)

37

http://docs.icgc.org/dcc-data-element-specifications

Page 38: Biocuration activities for the International Cancer Genome Consortium (ICGC)

38

ICGC Biocuration

• Helping submitters get their data to ICGC

• Progress reporting (data audit)

• Quality checks (coverage, correctness, etc.)

• Helping users get to the data

• Validate and check (and recheck) metadata on public

repositories

• Test and integrate with other public repositories via

standard data formats, ontologies.

• Documentation, documentation, and more documentation

• Training

38

Page 39: Biocuration activities for the International Cancer Genome Consortium (ICGC)

39

ICGC datasets to date

ICGC Data Portal Cumulative Donor Count for Member Projects

2000

4000

6000

8000

10,000

12,000

14,000

0

Number of

Donors

Release 7

Release 8

Release 9

Release 10

Release 11

Release 12Release 13

Release 14

Release 15

Release 16Release 17

Page 40: Biocuration activities for the International Cancer Genome Consortium (ICGC)

•Cancer types: 50

•Body sites: 18

•Donors: 12,232

•Specimens: 24, 661

•Simple somatic mutations: 9,871,477

•Mutated genes: 57,526

ICGC dataset version 17

Sept 11th 2014

Page 41: Biocuration activities for the International Cancer Genome Consortium (ICGC)

41

Clinical Data Completeness

Donor interval of last followup

Donor Tumour stage at diagnosis

Donor Tumour staging system at diagnosis

Donor diagnosis ICG10

DonorFields

Donor survival time

Donor Tumour stage at diagnosis supplemental

Donor relapse interval

Donor age at last followup

Donor relapse type

Donor age at diagnosis

Disease status last followup

Donor region of residence

Donor sex

Donor ID

Donor vital status

Average Percentage Completeness

Overall Donor Clinical Data Completeness

Page 42: Biocuration activities for the International Cancer Genome Consortium (ICGC)

42

Clinical Data Completeness

Donor interval of last followup

Donor Tumour stage at diagnosis

Donor Tumour staging system at diagnosis

Donor diagnosis ICG10

DonorFields

Donor survival time

Donor Tumour stage at diagnosis supplemental

Donor relapse interval

Donor age at last followup

Donor relapse type

Donor age at diagnosis

Disease status last followup

Donor region of residence

Donor sex

Donor ID

Donor vital status

Average Percentage Completeness

Overall Donor Clinical Data Completeness

Page 43: Biocuration activities for the International Cancer Genome Consortium (ICGC)

43

Clinical Data Completeness

Overall Specimen Clinical Data Completeness

Level of cellularity

Percentage cellularity

Digital Image of Stained Section

Tumour Stage Supplemental

Tumour Stage

Tumour Stage System

Tumour Grade Supplemental

Tumour Grade

Tumour Grading System

Tumour Histological Type

Specimen available

Specimen Biobank ID

Specimen Biobank

Tumour confirmed

Specimen storage other

Specimen storage

Specimen processing other

Specimen processing

Specimen donor treatment type

Specimen Interval

Specimen type

Specimen type other

Specimen ID

Donor ID

Specimen donor treatment type other

SpecimenFields

0 20 40 60 80

Average Percentage Completeness

10 30 50 70 90 100

Page 44: Biocuration activities for the International Cancer Genome Consortium (ICGC)

44

Clinical Data Completeness

Overall Specimen Clinical Data Completeness

Level of cellularity

Percentage cellularity

Digital Image of Stained Section

Tumour Stage Supplemental

Tumour Stage

Tumour Stage System

Tumour Grade Supplemental

Tumour Grade

Tumour Grading System

Tumour Histological Type

Specimen available

Specimen Biobank ID

Specimen Biobank

Tumour confirmed

Specimen storage other

Specimen storage

Specimen processing other

Specimen processing

Specimen donor treatment type

Specimen Interval

Specimen type

Specimen type other

Specimen ID

Donor ID

Specimen donor treatment type other

SpecimenFields

0 20 40 60 80

Average Percentage Completeness

10 30 50 70 90 100

Page 45: Biocuration activities for the International Cancer Genome Consortium (ICGC)

45

ICGC DCC Pipeline

Donor

Consent

TumourNormal (blood)

ICGC Cancer Projects

Sequencing centres

DCC

Level II/IIImutation

dataData Portal

Harmonized

DataAnnotation

Researchers

DACO

Data access Agreement

European Genome-phenome Archive

Raw Sequencing

data

MetadataXMLFiles

Hardeep Nahal

Page 46: Biocuration activities for the International Cancer Genome Consortium (ICGC)

46

EGA: Controlled Access and DACO

Page 47: Biocuration activities for the International Cancer Genome Consortium (ICGC)

47

ICGC DCC Pipeline

Sequencing centres

Donor

Consent

TumourNormal (blood)

ICGC Cancer Projects

DCC

Level II/IIImutation

dataData Portal

Harmonized

DataAnnotation

Researchers

European Genome-phenome Archive

Raw Sequencing

data

MetadataXMLFiles

I want raw data for Donor

DO46688

HardeepNahal

Page 48: Biocuration activities for the International Cancer Genome Consortium (ICGC)

48

ICGC DCC Pipeline

Sequencing centres

Donor

Consent

TumourNormal (blood)

ICGC Cancer Projects

DCC

Level II/IIImutation

dataData Portal

Harmonized

DataAnnotation

Researchers

DACO

Data access Agreement

European Genome-phenome Archive

Raw Sequencing

data

MetadataXMLFiles

I can’t find my BAM file

?!

DCC

HardeepNahal

Page 49: Biocuration activities for the International Cancer Genome Consortium (ICGC)

49

Metadata: ICGC & EGA

1

1

1

1

1

1

n

n

n

n

n

n

n

n

Page 50: Biocuration activities for the International Cancer Genome Consortium (ICGC)

50

ICGC-EGA Audit

Project/Study Names

Sample identifier

Donor Identifier

EGA Study/Dataset Accession

Raw data file names

Tumour/Normal designation

EGA

Metadata

XML

Files

ICGC-

EGA

Audit

Reports

ICGC

Submitted

Data

ICGC Cancer Projects

Page 51: Biocuration activities for the International Cancer Genome Consortium (ICGC)

51

Major Issues & Challenges

Differences in the formats used for clinical identifiers submitted to

ICGC and EGA

Tumour/Normal designation missing

Project/Study names differ

Missing EGA datasets

• Donor ID

• Sample ID

• Tumour/Nor

mal

designation

• Sequencing

strategy

Raw data

filename

EGA

Study/Dataset

accession

ICGC EGA

HardeepNahal

Page 52: Biocuration activities for the International Cancer Genome Consortium (ICGC)

52

Example metadata issues

<?xml version="1.0" encoding="UTF-8"?><SAMPLE_SET>

<SAMPLE alias="LFS_MB1" center_name="DKFZ-IBIOS" accession="ERS040283"><IDENTIFIERS><PRIMARY_ID>ERS040283</PRIMARY_ID><SUBMITTER_ID namespace="DKFZ-IBIOS">LFS_MB1</SUBMITTER_ID>

</IDENTIFIERS><SAMPLE_NAME><TAXON_ID>9606</TAXON_ID><SCIENTIFIC_NAME>Homo sapiens</SCIENTIFIC_NAME><COMMON_NAME>human</COMMON_NAME>

</SAMPLE_NAME><SAMPLE_ATTRIBUTES><SAMPLE_ATTRIBUTE>

<TAG>Sample ID</TAG><VALUE>LFS_MB1</VALUE>

</SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE>

<TAG>Donor ID</TAG><VALUE>165304</VALUE>

</SAMPLE_ATTRIBUTE></SAMPLE_ATTRIBUTES>

</SAMPLE></SAMPLE_SET>

Control/tumour information?

Different Donor identifiers!!

HardeepNahal

Page 53: Biocuration activities for the International Cancer Genome Consortium (ICGC)

53

<?xml version="1.0" encoding="UTF-8"?><SAMPLE_SET>

<SAMPLE center_name="QCMG" alias="ICGC-ABMJ-20101130-29-ND" accession="ERS206872" xmlns:xsi…….<IDENTIFIERS><PRIMARY_ID>ERS206872</PRIMARY_ID><SUBMITTER_ID namespace="QCMG">ICGC-ABMJ-20101130-29-ND</SUBMITTER_ID>

</IDENTIFIERS><SAMPLE_NAME><TAXON_ID>9606</TAXON_ID><SCIENTIFIC_NAME>Homo sapiens</SCIENTIFIC_NAME><COMMON_NAME>human</COMMON_NAME>

</SAMPLE_NAME><DESCRIPTION>1:DNA|4:Normal control (other site)|Unknown</DESCRIPTION><SAMPLE_ATTRIBUTES><SAMPLE_ATTRIBUTE>

<TAG>Sample ID</TAG><VALUE>8029782</VALUE>

</SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE>

<TAG>Donor ID</TAG><VALUE>ICGC_0108</VALUE>

</SAMPLE_ATTRIBUTE></SAMPLE_ATTRIBUTES>

</SAMPLE></SAMPLE_SET>

Example metadata issues

Free Text

HardeepNahal

Page 54: Biocuration activities for the International Cancer Genome Consortium (ICGC)

54

http://icgc.org

Page 55: Biocuration activities for the International Cancer Genome Consortium (ICGC)

55

Page 56: Biocuration activities for the International Cancer Genome Consortium (ICGC)

56

Page 57: Biocuration activities for the International Cancer Genome Consortium (ICGC)

57

Select “Bladder Cancer – China”

Page 58: Biocuration activities for the International Cancer Genome Consortium (ICGC)

58

Select “Pancreatic cancer – Canada”

Page 59: Biocuration activities for the International Cancer Genome Consortium (ICGC)

59

… But where is the data?

Page 60: Biocuration activities for the International Cancer Genome Consortium (ICGC)

60

Page 61: Biocuration activities for the International Cancer Genome Consortium (ICGC)

61

http://dcc.icgc.org/

Page 62: Biocuration activities for the International Cancer Genome Consortium (ICGC)

62

Page 63: Biocuration activities for the International Cancer Genome Consortium (ICGC)

63

Page 64: Biocuration activities for the International Cancer Genome Consortium (ICGC)

64

Highlights of the new portal: dcc.icgc.org

• Faceted searches capabilities for variants, genes and

donors

– Interactive data exploration fast and easy

• Mutation aggregation & counts across donors and cancers

– # of pancreatic cancers donors with mutation KRAS G12D

• Standardized gene consequence across all projects

• Genome browser

• Data doewnload

• Protein domains

• Links to repositories

Page 65: Biocuration activities for the International Cancer Genome Consortium (ICGC)

65

KRAS search

Page 66: Biocuration activities for the International Cancer Genome Consortium (ICGC)

66

• Summary

• Cancer type distribution

• Other links (Cosmic, Entrez, etc)

• Mutation profile in protein

• Domains

• Genomic Context

• Mutation profile

• Most common mutations

Page 67: Biocuration activities for the International Cancer Genome Consortium (ICGC)

67

http://dcc.icgc.org/genes/ENSG00000133703

Page 68: Biocuration activities for the International Cancer Genome Consortium (ICGC)

68

Page 69: Biocuration activities for the International Cancer Genome Consortium (ICGC)

69

Page 70: Biocuration activities for the International Cancer Genome Consortium (ICGC)

70

Page 71: Biocuration activities for the International Cancer Genome Consortium (ICGC)

71

Donor• Donor ID

• Primary site

• Cancer Project

• Gender

• Tumor Stage

• Vital Status

• Disease Status

• Release type

• Age at diagnosis

• Available data types

• Analysis types

Page 72: Biocuration activities for the International Cancer Genome Consortium (ICGC)

72

Genes

Page 73: Biocuration activities for the International Cancer Genome Consortium (ICGC)

73

Mutations• Consequences

• Type

• Platform

• Verification status

Page 74: Biocuration activities for the International Cancer Genome Consortium (ICGC)

74

Exporting data

Page 75: Biocuration activities for the International Cancer Genome Consortium (ICGC)

75

Exporting data

Page 76: Biocuration activities for the International Cancer Genome Consortium (ICGC)

76

Page 77: Biocuration activities for the International Cancer Genome Consortium (ICGC)

77

Exporting data

Page 78: Biocuration activities for the International Cancer Genome Consortium (ICGC)

78

Can do bulk download of the data …

Page 79: Biocuration activities for the International Cancer Genome Consortium (ICGC)

79

BIGDATA

ValidationValidationRAW

DATA

MetaDATA

Interpreted data

Page 80: Biocuration activities for the International Cancer Genome Consortium (ICGC)

80

DACO

ICGC

dbGaP

EGA

TCGA

BAM

Open

Open

ERA

BA

M

Germ

Line

+ EGA id

BA

MBA

M

Page 81: Biocuration activities for the International Cancer Genome Consortium (ICGC)

81

ICGC Data Categories

ICGC Open Access Datasets ICGC Controlled Access Datasets

Cancer Pathology

Histologic type or subtype

Histologic nuclear grade

Donor

Gender

Age range

RNA expression (normalized)

DNA methylation

Genotype frequencies

Somatic mutations (SNV,

CNV and Structural

Rearrangement)

Detailed Phenotype and Outcome Data

Patient demography

Risk factors

Examination

Surgery/Drugs/Radiation

Sample/Slide

Specific histological features

Protocol

Analyte/Aliquot

Gene Expression (probe-level data)

Raw genotype calls (germline)

Gene-sample identifier links

Genome sequence files

Most of the data in the portal is publically available without restriction. However,

access to some data, like the germline mutations, requires authorization by the Data

Access Compliance Office (DACO)

Page 82: Biocuration activities for the International Cancer Genome Consortium (ICGC)
Page 83: Biocuration activities for the International Cancer Genome Consortium (ICGC)

http://icgc.org/daco

Page 84: Biocuration activities for the International Cancer Genome Consortium (ICGC)

84

• Detailed Phenotype and Outcome data

Region of residence

Risk factors

Examination

Surgery

Radiation

Sample

Slide

Specific histological features

Analyte

Aliquot

Donor notes

• Gene Expression (probe-level data)

• Raw genotype calls

• Gene-sample identifier links

• Genome sequence files

ICGC Controlled

Access Datasets

• Cancer Pathology

Histologic type or subtype

Histologic nuclear grade

• Patient/Person

Gender, Age range,

Vital status, Survival time

Relapse type, Status at follow-up

• Gene Expression (normalized)

• DNA methylation

•Computed Copy Number and

Loss of Heterozygosity

• Newly discovered somatic variants

ICGC OA

Datasets

http://goo.gl/w4mrV

Page 85: Biocuration activities for the International Cancer Genome Consortium (ICGC)

Identify

yourselfFill out detail form which

includes:

• Contact and Project

Information

•Information Technology

details and procedures

for keeping data secure

•Data Access Agreement

All of these

documents are

put into a PDF

file that you

print and get your

institution to sign

off on your behalf

Page 86: Biocuration activities for the International Cancer Genome Consortium (ICGC)
Page 87: Biocuration activities for the International Cancer Genome Consortium (ICGC)

87

Page 88: Biocuration activities for the International Cancer Genome Consortium (ICGC)

88

Page 89: Biocuration activities for the International Cancer Genome Consortium (ICGC)

89

Page 90: Biocuration activities for the International Cancer Genome Consortium (ICGC)

90

Page 91: Biocuration activities for the International Cancer Genome Consortium (ICGC)

91

Page 92: Biocuration activities for the International Cancer Genome Consortium (ICGC)

92

DACO approved projects (Dec 2014):

163 groups – 867 people

http://goo.gl/E8gHGx

Page 93: Biocuration activities for the International Cancer Genome Consortium (ICGC)

93

Making sense of it all

1 project == 1 pipeline

Page 94: Biocuration activities for the International Cancer Genome Consortium (ICGC)

94

Making sense of it all

70 projects == 70 pipelines

Page 95: Biocuration activities for the International Cancer Genome Consortium (ICGC)

95

Making sense of it all

70 projects == 1 pipeline

Page 96: Biocuration activities for the International Cancer Genome Consortium (ICGC)

96

PanCancer Analysis of Whole Genomes (PCAWG)

• 2,200 T/N pairs with clinical dataanalyzed over 6 Academic clouds

• 16 working groups, > 1000 scientists

• 1 alignment pipeline (8 months)

• Data freeze last month

• 3 somatic mutation pipelines (2 months?)

• 2 RNA-Seq pipelines (1 month?)

• Not scheduled yet:– miRNA

– CNV & SV

– Pathway analysis

• Start writing papers in July 2015

Page 97: Biocuration activities for the International Cancer Genome Consortium (ICGC)

97

Conclusions: What we are doing at DCC

• Working with EGA to audit missing information

and minimize disconnect between submitters’

ICGC files and raw/metadata submitted to

EGA.

• Working in close collaboration with ICGC

projects and EGA to correct missing

information and data so we can harmonize

data in ICGC/EGA submissions process.

• Adding validation step in submission process

to better coordinate efforts with EGA

Page 98: Biocuration activities for the International Cancer Genome Consortium (ICGC)

98

Conclusions: What we are doing at DCC

• Encourage & work with submitters to supply all

clinical metadata.

• Improved data & metadata curation at EGA; better

linking of data held at DCC to ICGC data in other

repositories

• Improved data quality/integrity checking through

new submission/validation system; review of

submission file specifications

• Integration of new data submission system and

portal infrastructure with project and user

information managed at ICGC.org

• Integrating PCAWG results with ICGC data portal

Page 99: Biocuration activities for the International Cancer Genome Consortium (ICGC)

99

Some thoughts:

• Curation activities are obviously crucial to the

development of a great database

• Continuous feedback between users,

developers and submitters is also critical

• Biocurators are at the important interface and

are essential team players for the

development and maintenance of any modern

database.

Page 100: Biocuration activities for the International Cancer Genome Consortium (ICGC)

100

http://www.biocurator.org/

Page 101: Biocuration activities for the International Cancer Genome Consortium (ICGC)

101101

Nature 409:452

Bioinformatics Citizenship: What it means,

and what does it cost?

Page 102: Biocuration activities for the International Cancer Genome Consortium (ICGC)

102

Important messages:

• The ICGC portal is evolving and getting better all

the time

• Lots of data provided by the ICGC

• Important to be good citizens of the scientific world

• The idea behind all of this is to provide tools to

help cure cancer

• Need to respect policies and guidelines

• There is help out there, and user feedback is

*always* welcome.

Page 103: Biocuration activities for the International Cancer Genome Consortium (ICGC)

103

DCC Software

Developer

Vincent Ferretti

Daniel Chang

Anthony Cros

Jerry Lam

Brian O'Connor

Bob Tiernay

Stuart Watt

Shane Wilson

Junjun Zhang

Acknowledgments

ICGC/OICR Project leaders:

Tom HudsonJohn McPhersonLincoln SteinJared SimpsonPaul BoutrosVincent FerrettiFrancis OuelletteJennifer Jennings

Ouellette Lab

Michelle Brazas

Emilie Chautard

Nina Palikuca

Zhibin Lu

Web Dev

Joseph Yamada

Angela Chao

Daniel Gross

Kamen Wu

Kim Cullion

Miyuki Fukuma

Wen Xu

Pipeline Development

& Evaluation

Morgan Taschuk

Michael Laszloffy

Peter Ruzanov

ICGC DCC Biocuration

Hardeep Nahal

Marc Perry

http://oicr.on.ca http://icgc.org

… and all the patients and their families that that

are putting their hopes into our work!

Research IT/Systems

David Sutton,

Bob Gibson

Sam Maclennan

David Magda

Rob Naccarato

Brian Ott

Gino Yearwood

EGA

Justin Paschall

Jeff Almeida-King

Ilkka Lappalainen

Jordi Rambla De

Argila

Marc Sitges Puy

SeqProdBio Team

Tim Beck

Tony DeBat

Larry Heissler

Xuemei (Mei) Luo

Michael Moorhouse

Furqan Qureshi

Yogi Sundaravadanam

Page 104: Biocuration activities for the International Cancer Genome Consortium (ICGC)

104Informatics and Biocomputing at the OICR

Page 105: Biocuration activities for the International Cancer Genome Consortium (ICGC)

105

http://oicr.on.ca/careers

Page 106: Biocuration activities for the International Cancer Genome Consortium (ICGC)

106

Page 107: Biocuration activities for the International Cancer Genome Consortium (ICGC)

107

http://icgc.org

http://dcc.icgc.org

http://docs.icgc.org

[email protected]

@bffo