19
Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center

Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers Richard

Embed Size (px)

Citation preview

Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in

NIAID Bioinformatics Resource Centers

Richard H. Scheuermann, Ph.D.

Department of Pathology

Division of Biomedical Informatics

U.T. Southwestern Medical Center

Genome Sequencing Centers for Infectious Disease (GSCID)

Bioinformatics Resource Centers (BRC)

www.viprbrc.org www.fludb.org

High Throughput Sequencing

• Enabling technology– Epidemiology of outbreaks– Pathogen evolution– Host range restriction– Genetic determinants of virulence and pathogenicity

• Metadata requirements– Temporal-spatial information about isolates– Selective pressures– Host species of specimen source– Disease severity and clinical manifestations

Metadata Submission Spreadsheets

Complex Query Interface

Metadata Inconsistencies

• Each project was providing different types of metadata

• No consistent nomenclature being used• Impossible to perform reliable comparative

genomics analysis• Required extensive custom bioinformatics

system development

GSC-BRC Metadata Standards Working Group

• NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs

• Develop metadata standards for pathogen isolate sequencing projects

• Bottom up approach• Assemble into a semantic framework

Metadata Standards Process• Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors• Collect example metadata sets from sequencing project white papers and other project sources

(e.g. CEIRS)• Identify data fields that appear to be common across projects within a pathogen subgroup

(core) and data fields that appear to be project specific• For each data field, provide common set of attributes, including definitions, synonyms,

allowed value sets preferably using controlled vocabularies, and expected syntax, etc.• Merge subgroup core elements into a common set of core metadata fields and attributes• Assemble set of pathogen-specific and project-specific metadata fields to be used in

conjunction with core fields• Compare, harmonize, map to other relevant initiatives, including OBI, MIGS, MIMARKS,

BioProjects, BioSamples (ongoing)• Assemble all metadata fields into a semantic network• Harmonize semantic network with the Ontology of Biomedical Investigation (OBI)• Draft data submission spreadsheets to be used for all white paper and BRC-associated projects• Finalize version 1.0 metadata standard and version 1.0 data submission spreadsheet• Beta test version 1.0 standard with new white paper projects, collecting feedback

Data Fields: Core Project Core Sample

Attributes

Metadata Standards Process

• Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors• Collect example metadata sets from sequencing project white papers and other project sources (e.g.

CEIRS)• Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data

fields that appear to be project specific• For each data field, provide:

– definitions, – synonyms, – allowed value sets preferably using controlled vocabularies, – expected syntax, – examples, – data categories,– data providers

• Merge subgroup core elements into a common set of core metadata fields and attributes• Assemble metadata fields into a semantic network (Scheuermann)• Harmonize semantic network with the Ontology of Biomedical Investigation (OBI) (Stoeckert, Zheng)• Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS, BioProjects, BioSamples• Establish policies and procedures for metadata submission workflows and GenBank linkage• Develop data submission spreadsheets to be used for all white paper and BRC-associated projects

organism

environmentalmaterial

equipment

person

specimensource role

specimencapture role

specimencollector role

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

specimen Xspecimen isolation

procedure X

isolationprotocol

has_input

has_output

plays

plays

has_specification

has_partden

otes

located_in

name

denotes

spatialregion

geographiclocation

denotes

located_in

affiliation

has_affiliation

ID

v2

v5-6

v3-4

v7v8

v15

v16

denotes

specimen typeinsta

nce_of

specimen isolationprocedure type

instance_of

Specimen Isolation

plays

has_input

Comments

????

v9

organism parthypothesis

v17

is_about

IRB/IACUCapproval

has_authorization

v19v18

b18

b22environment

has_quality

b23

b24

b28 b29

b25 b26 b27

b30

organismpathogenicdisposition

has part

has disp

osition

Metadata Processes

data transformations –image processing

assemblysequencing assay

specimen source – organism or environmental

specimencollector

input sample

reagents

technician

equipment

type ID qualities

temporal-spatialregion

data transformations –variant detection

serotype marker detect.gene detection

primarydata

sequencedata

genotype/serotype/gene data

specimen

microorganism

enrichedNA sample

microorganismgenomic NA

specimen isolationprocess

isolationprotocol

sample processing

data archivingprocess

sequencedata record

has_input

has_output

has_output

has_specification has_part has_part

is_about

has_input

has_output

has_input

has_input

has_input

has_output

has_output

has_output

is_about

GenBankID

denotes

located_in

denotes

has_input

has_qualityinstance_of

temporal-spatialregion

located_in

Specimen Isolation

Material Processing

Data ProcessingSequencing Assay

Investigation

Core-Specimen

assay X

samplematerial X

material X

person X

equipment X

lot #

primarydata

assayprotocol

temporal-spatialregion

has_input

located_in

has_specification

has_output

plays

spatialregion

temporalinterval

GPSlocation

date/time

spatialregion

geographiclocation

Generic Assay

has_part

located

_indenotes denotes

runID

assaytype

denotes

instance_of

reagentrole

reagenttype

instance

_of

denotes

sample ID

playstargetrole

sampletype

instance

_of

denotes

name

playstechnicianrole

species

instance

_of

denotes

serial #

playssignaldetection role

equipmenttype

instance

_of

denotes

has_input

has_input

has_input

objectives

has_part

analyte X

has_part

quality x

has_quality

input samplematerial X

is_about

materialtransformation X

samplematerial X

material X

person X

equipment X

lot #

outputmaterial X

material transformationprotocol

temporal-spatialregion

has_input

located_in

has_specification

has_output

plays

spatialregion

temporalinterval

GPSlocation

date/time

spatialregion

geographiclocation

Generic Material Transformation

has_part

located

_indenotes denotes

runID

material transformationtype

denotes

instance_of

reagentrole

reagenttype

instance

_of

denotes

sample ID

playstargetrole

sampletype

instance

_of

denotes

name

playstechnicianrole

species

instance

_of

denotes

serial #

playssignaldetection role

equipmenttype

instance

_of

denotes

has_input

has_input

has_input

objectives

has_part

quality x

has_quality

quality x

materialtype

has_quality

instance_of

sample IDden

otes

data transformation Xinputdata

outputdata

material X

algorithm

has_specification

has_output

is_about

software

has_input

located_in

person Xname

data analystrole

denotes

runID

denotes

Generic Data Transformation

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

spatialregion

geographiclocation

has_part

located

_indenotes denotes

data transformationtype

instance_of

plays

Generic Material (IC)

material X

ID

materialtype

quality x

has_quality

material Y

has_part

material Z

has_part

quality y

has_quality

denotes

instance_of

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located

_in

spatialregion

geographiclocation

denotes denotes denotes

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located

_in

spatialregion

geographiclocation

denotes denotes denotes

located_in located_in

Conclusions

• Metadata standards for microorganism sequencing projects• Bottom up approach focuses standard on important features• Two flavors of “minimum information” – MIBBI vs. dbMIBBI

– Distinguish between minimum information to reproduce an experiment and the minimum information to structure in a database for query and analysis

• Utility of semantic representation– Identified gaps in data field list (e.g. temporal components)– Includes logical structure for other, project-specific, data fields - extensible– Identified gaps in ontology data standards (use case-driven standard

development)– Identified commonalities in data structures (reusable)– Support for semantic queries and inferential analysis in future

• Ontology-based framework is extensible– Sequencing => “omics”

GSC-BRC Metadata Working Groups