Upload
alasdair-gray
View
1.140
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Big Data presents an exciting opportunity to pursue large-scale analyses over collections of data in order to uncover valuable insights across a myriad of fields and disciplines. Yet, as more and more data is made available, researchers are finding it increasingly difficult to discover and reuse these data. One problem is that data are insufficiently described to understand what they are or how they were produced. A second issue is that no single vocabulary provides all key metadata fields required to support basic scientific use cases. A third issue is that data catalogs and data repositories all use different metadata standards, if they use any standard at all, and this prevents easy search and aggregation of data. Therefore, we need a community profile to indicate what are the essential metadata, and the manner in which we can express it. The W3C Health Care and Life Sciences Interest Group have developed such a community profile that defines the required properties to provide high-quality dataset descriptions that support finding, understanding, and reusing scientific data, i.e. making the data FAIR (Findable, Accessible, Interoperable and Re-usable – http://datafairport.org). The specification reuses many notions and vocabulary terms from Dublin Core, DCAT and VoID, with provenance and versioning information being provided by PROV-O and PAV. The community profile is based around a three tier model; the summary description captures catalogue style metadata about the dataset, each version of the dataset is described separately as are the various distribution formats of these versions. The resulting community profile is generic and applicable to a wide variety of scientific data. Tools are being developed to help with the creation and validation of these descriptions. Several datasets including those from Bio2RDF, EBI and IntegBio are already moving to release descriptions conforming to the community profile.
Citation preview
Describing Scientific Datasets: The HCLS Community Profile
Alasdair J G [email protected]
alasdairjggray.co.uk
@gray_alasdair
Michel DumontierStanford University
M. Scott MarshallMAASTRO Clinic
Data Cache (Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON) Domain
Specific
Services
Identity
Resolution
Service
Identifier
Management
Service
“Adenosine
receptor 2a”
EC2.43.4
CS4532
P12374
Co
re P
latf
orm
ChEMBL-
RDF
ChEMBL
v13
Chem2
Bio2RDF
SD
v13v12
v2 or v8
Data Cache (Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON) Domain
Specific
Services
Identity
Resolution
Service
Identifier
Management
Service
“Adenosine
receptor 2a”
EC2.43.4
CS4532
P12374
Co
re P
latf
orm
ChEMBL-
RDF
ChEMBL
v13
Chem2
Bio2RDF
SD
v13v12
v2 or v8
Open PHACTS
Discovery PlatformHistoric Use Case~January 2012
Open PHACTS v1.3ChEMBL 16
http://tiny.cc/ops-datasets
Challenges
Datasets available
In many versions over time
In different formats
From many mirrors/registries
Datasets build on each other
Files do not carry metadata
Registries
Can be out-of-date
Can contain conflicting information
25 September 2014 EUON - HCLS Dataset Description 5
Scientists require data provenance!
Dublin Core Metadata Initiative
Widely used
Broadly applicable
Documents
Datasets
✗Generic terms
✗Not comprehensive
✗No required properties
25 September 2014 EUON - HCLS Dataset Description 6
“Date: A point or period of
time associated with an
event in the lifecycle of
the resource.”
7EUON - HCLS Dataset Description
Metadata carried with data
Directly embedded: void:inDataset
✗No versioning
✗No checklist of requisite fields
✗Only for RDF data
VoID: Vocabulary of Interlinked Datasets
25 September 2014
DCAT: Data Catalog
Separates Dataset and Distribution
✗No versioning
✗No prescribed properties
25 September 2014 EUON - HCLS Dataset Description 8
W3C HCLS Group
25 September 2014 EUON - HCLS Dataset Description 9
HCLS Dataset Descriptions
25 September 2014 EUON - HCLS Dataset Description 10
VoID Editor
25 September 2014 EUON - HCLS Dataset Description 12
Validator
25 September 2014 EUON - HCLS Dataset Description 13
New version using ShEx in development
Future Vision
Provide rich and accurate provenance
trail of data
Write once, use many times
Automatic pipeline from description file to registries
FAIR Data
25 September 2014 EUON - HCLS Dataset Description 14
Thank you
Editors’ Draft:
http://tiny.cc/hcls-datadesc-ed
W3C Interest Group Note:
http://tiny.cc/hcls-datadesc
Acknowledgements to W3C HCLS Group
www.alasdairjggray.co.uk
@gray_alasdair
25 September 2014 EUON - HCLS Dataset Description 15