15
Describing Scientific Datasets: The HCLS Community Profile Alasdair J G Gray [email protected] alasdairjggray.co.uk @gray_alasdair Michel Dumontier Stanford University M. Scott Marshall MAASTRO Clinic

Describing Scientific Datasets: The HCLS Community Profile

Embed Size (px)

DESCRIPTION

Big Data presents an exciting opportunity to pursue large-scale analyses over collections of data in order to uncover valuable insights across a myriad of fields and disciplines. Yet, as more and more data is made available, researchers are finding it increasingly difficult to discover and reuse these data. One problem is that data are insufficiently described to understand what they are or how they were produced. A second issue is that no single vocabulary provides all key metadata fields required to support basic scientific use cases. A third issue is that data catalogs and data repositories all use different metadata standards, if they use any standard at all, and this prevents easy search and aggregation of data. Therefore, we need a community profile to indicate what are the essential metadata, and the manner in which we can express it. The W3C Health Care and Life Sciences Interest Group have developed such a community profile that defines the required properties to provide high-quality dataset descriptions that support finding, understanding, and reusing scientific data, i.e. making the data FAIR (Findable, Accessible, Interoperable and Re-usable – http://datafairport.org). The specification reuses many notions and vocabulary terms from Dublin Core, DCAT and VoID, with provenance and versioning information being provided by PROV-O and PAV. The community profile is based around a three tier model; the summary description captures catalogue style metadata about the dataset, each version of the dataset is described separately as are the various distribution formats of these versions. The resulting community profile is generic and applicable to a wide variety of scientific data. Tools are being developed to help with the creation and validation of these descriptions. Several datasets including those from Bio2RDF, EBI and IntegBio are already moving to release descriptions conforming to the community profile.

Citation preview

Page 1: Describing Scientific Datasets: The HCLS Community Profile

Describing Scientific Datasets: The HCLS Community Profile

Alasdair J G [email protected]

alasdairjggray.co.uk

@gray_alasdair

Michel DumontierStanford University

M. Scott MarshallMAASTRO Clinic

Page 2: Describing Scientific Datasets: The HCLS Community Profile
Page 3: Describing Scientific Datasets: The HCLS Community Profile
Page 4: Describing Scientific Datasets: The HCLS Community Profile

Data Cache (Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON) Domain

Specific

Services

Identity

Resolution

Service

Identifier

Management

Service

“Adenosine

receptor 2a”

EC2.43.4

CS4532

P12374

Co

re P

latf

orm

ChEMBL-

RDF

ChEMBL

v13

Chem2

Bio2RDF

SD

v13v12

v2 or v8

Page 5: Describing Scientific Datasets: The HCLS Community Profile

Data Cache (Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON) Domain

Specific

Services

Identity

Resolution

Service

Identifier

Management

Service

“Adenosine

receptor 2a”

EC2.43.4

CS4532

P12374

Co

re P

latf

orm

ChEMBL-

RDF

ChEMBL

v13

Chem2

Bio2RDF

SD

v13v12

v2 or v8

Open PHACTS

Discovery PlatformHistoric Use Case~January 2012

Open PHACTS v1.3ChEMBL 16

http://tiny.cc/ops-datasets

Page 6: Describing Scientific Datasets: The HCLS Community Profile

Challenges

Datasets available

In many versions over time

In different formats

From many mirrors/registries

Datasets build on each other

Files do not carry metadata

Registries

Can be out-of-date

Can contain conflicting information

25 September 2014 EUON - HCLS Dataset Description 5

Scientists require data provenance!

Page 7: Describing Scientific Datasets: The HCLS Community Profile

Dublin Core Metadata Initiative

Widely used

Broadly applicable

Documents

Datasets

✗Generic terms

✗Not comprehensive

✗No required properties

25 September 2014 EUON - HCLS Dataset Description 6

“Date: A point or period of

time associated with an

event in the lifecycle of

the resource.”

Page 8: Describing Scientific Datasets: The HCLS Community Profile

7EUON - HCLS Dataset Description

Metadata carried with data

Directly embedded: void:inDataset

✗No versioning

✗No checklist of requisite fields

✗Only for RDF data

VoID: Vocabulary of Interlinked Datasets

25 September 2014

Page 9: Describing Scientific Datasets: The HCLS Community Profile

DCAT: Data Catalog

Separates Dataset and Distribution

✗No versioning

✗No prescribed properties

25 September 2014 EUON - HCLS Dataset Description 8

Page 10: Describing Scientific Datasets: The HCLS Community Profile

W3C HCLS Group

25 September 2014 EUON - HCLS Dataset Description 9

Page 11: Describing Scientific Datasets: The HCLS Community Profile

HCLS Dataset Descriptions

25 September 2014 EUON - HCLS Dataset Description 10

Page 12: Describing Scientific Datasets: The HCLS Community Profile

VoID Editor

25 September 2014 EUON - HCLS Dataset Description 12

Page 13: Describing Scientific Datasets: The HCLS Community Profile

Validator

25 September 2014 EUON - HCLS Dataset Description 13

New version using ShEx in development

Page 14: Describing Scientific Datasets: The HCLS Community Profile

Future Vision

Provide rich and accurate provenance

trail of data

Write once, use many times

Automatic pipeline from description file to registries

FAIR Data

25 September 2014 EUON - HCLS Dataset Description 14

Page 15: Describing Scientific Datasets: The HCLS Community Profile

Thank you

Editors’ Draft:

http://tiny.cc/hcls-datadesc-ed

W3C Interest Group Note:

http://tiny.cc/hcls-datadesc

Acknowledgements to W3C HCLS Group

www.alasdairjggray.co.uk

[email protected]

@gray_alasdair

25 September 2014 EUON - HCLS Dataset Description 15