Upload
maaike-duine
View
199
Download
0
Embed Size (px)
Citation preview
Persistent Identifier Linking
Tom DemeranvilleTHOR Senior Project Officer and ORCID Software Engineerhttps://orcid.org/0000-0003-0902-4386
Martin FennerDataCite Technical Directorhttps://orcid.org/0000-0003-1419-2405
Laura RuedaDataCite Communications Directorhttps://orcid.org/0000-0001-5952-7630
Linking Data and DataChallengesHow to cite data with right granularity?
How to link data and contributors with right granularity?
Datasets that are part of larger datasets or heterogenous collections
Multiple versions of the same dataset
Dynamic data
Linking Data and DataChallenges – Granularity of Data in ORCID Record
http://search.datacite.org/contributors/0000-0002-8635-8390
Linking Data and DataChallenges – Versioned Data in ORCID Record
http://orcid.org/0000-0003-1419-2405
Linking Data and Data Research – GranularityAttribution vs. SpecificityPersistent identifiers for datasets
need to support different levels of granularity
Ideally this is done my multiple persistent identifiers linked via Has Part/Is Part Of relationship
Collections will play an increasingly important role
Linking Data and Data Research – Data VersioningVersioning of data is important for specificity
and verifiabilityPractices and expectations for versioning of data
vary widely between communities and data centers
The data repository is ultimately responsible for decisions about versioning
General recommendations can only include high-level best practices and common vocabulary
Linking Data and Data Research – Cross-Linking of Databases
Cross-linking of major life sciences resources at EMBL-EBI (source: EMBL-EBI)
Linking Data and DataImplementation – Cross-Linking of Databases
Cross-linking between different databases not conceptually different from article-data linking, implementation should follow same principles (see next section)
Linking Data and DataImplementation – Collections
http://search.datacite.org/works/10.1594/PANGAEA.611088
Linking Data and DataDemo
Collection of climate data from ship logbookshttp://search.datacite.org/works/10.1594/PANGAEA.611088
Dryad Datasets associated with a specific publicationhttp://search.datacite.org/works/10.5061/DRYAD.9R161.1
Linking Data and ArticlesChallenges
Data underlying the findings described in a manuscript not always fully available
Data underlying the findings described in a manuscript made available, but hidden in supplementary information and not easily findable
Data underlying the findings described in a manuscript made available, but not properly linked to/from article
Linking Data and ArticlesImplementation – Follow FAIR Data Principles
From: http://slideshare.net/lshtm/preparing-data-for-sharing-the-fair-principles
Linking Data and ArticlesResearch - Conceptual ModelLinkage as Triples. In the form subject-
predicate-object, consistent with the Resource Description Framework (RDF) data model.
Describing the relation. Additional information such as relation type (e.g. A is new version of B) and provenance.
Persistent Identifiers as HTTP URIs. This makes them actionable, and compatible with the RDF data model.
Centralized infrastructure for persistent identifier linking. Provided for example by ORCID and DataCite, facilitating discovery.
Linking Data with ArticlesImplementation – Discover Article/Data Links
DataCite Event Data (https://eventdata.datacite.org) Collect, aggregate and make available article/data links from DataCite metadata and other sources
Crossref Event Data (https://api.eventdata.crossref.org) Collect and make available article/data links from Crossref metadata and other sources
OpenAIRE Data/Literature Linking Service (http://dliservice.research-infrastructures.eu) Collect and make available article/data links from a variety of sources
Linking Data with ArticlesImplementation – Exchange Article/Data Links
Standard metadata for exchanging Article/Data LinksJoint Collaboration within RDA/WDS Data Publishing Services WG (http://www.scholix.org/guidelines)
Link Exchange between Crossref and DataCiteUsing the same open source software (https://github.com/lagotto/lagotto) for their respective Event Data services
Linking Data with ArticlesDemoSupplementary Information hosted in Data Repositoryhttp://search.datacite.org/works/10.6084/M9.FIGSHARE.3427304
Five datasets from Cambridge Crystallographic Data Centre linked to the same articlehttp://search.datacite.org/works/10.1021/acs.cgd.6b00527
Software library described in Journal of Open Source Softwarehttp://search.datacite.org/works/10.21105/joss.00026
PLOS articles linked with at least one DataCite DOIhttp://search.datacite.org/data-centers/340
DataCite DOI -> Crossref DOI links exported from DataCite to Crossrefhttp://api.eventdata.crossref.org/works?source_id=datacite_crossref
In practical terms...Real interoperability is much more than a framework:
• Compatible data models• Metadata quality• Development effort• Coordination
During this first year, THOR has:• Assessed how artefacts, contributors,
organisations and others are modelled• Explored different implementations (ADS,
Dryad… )• Proposed approaches to overcome mismatches
Metadata compatibility - ORCID/DataCite• Personal names (single and multiple fields)
Metadata compatibility - ORCID/DataCite• Personal names (single and multiple fields)
<creators><creator><creatorName>Miller, Elizabeth</creatorName><givenName>Elizabeth</givenName><familyName>Miller</familyName><nameIdentifier schemeURI="http://orcid.org/" nameIdentifierScheme="ORCID"> 0000-0001-5000-0007</nameIdentifier><affiliation>DataCite</affiliation>
</creator></creators>
<creators><creator><creatorName>Miller, Elizabeth</creatorName><nameIdentifier schemeURI="http://orcid.org/" nameIdentifierScheme="ORCID"> 0000-0001-5000-0007</nameIdentifier><affiliation>DataCite</affiliation>
</creator></creators>
Metadata compatibility - ORCID/DataCite• Contributor roles
Metadata compatibility - ORCID/DataCite• Relation types
Metadata compatibility - ORCID/DataCite• Lack of standards• Low adoption• Organisations:
• ISNI / Ringgold / Others• Open standard?
• Funding, projects:• Crossref’s Open Funder Registry• Coverage and quality?
The results!• ORCID Auto-Update:
Whenever a publication or a dataset receives a DOI and its metadata contains ORCID iDs,
the ORCID record of the author(s) can be updated
automatically!• Authors receive a notification
(inbox)• They can configure:
• Accept updates automatically• Level of privacy
The results!• DataCite and Crossref Event Data:
The results!• EThOS is the UK’s thesis service,
offering search and discovery of all UK theses, and direct access to all those that are digitally, openly available.
The results!• PANGAEA archives, publishes and
distributes geo-referenced data about climate variability, the marine environment and geological research.
• PANGAEA attempts to resolve ORCID iDs and annotate author names using a heuristic algorithm
• Data citations from literature are rare!
• PANGAEA is keeping track of the link from datasets back to articles (“reverse links”)
Linking Data and ContributorsImplementation – ORCID Search and Link
http://search.datacite.org/works?query=martin+fenner
Linking Data and ContributorsImplementation – ORCID Auto-Update
https://profiles.datacite.org/users/me
Linking Data and ContributorsDemo
Link Works via ORCID recordhttps://orcid.org/my-orcid
DataCite/ORCID Search and Link after authenticating with ORCIDhttps://profiles.datacite.org/users/me
Linking identifier types
There are a LOTS of identifier typesAttempting to work with them all raises LOTS of questions
Remember this?It’s the crosslinks between EMBL-EBI databases
Most of those databases use different identifier types
There are 560 collections!This can make things tricky
Linking identifier typesCase study - identifier types in the life sciences
ORCID currently supports 33 identifier types, such as DOIs.
These are part of a fixed vocabulary, with associated rules about validation and how to resolve them.
Adding a new one can be difficult, adding 500 is really difficult.
We now know that this does not scale.But to fully realise our mission, we need to be able to do it, and so do others.
Linking identifier typesCase study - External identifiers at ORCID
Linking identifier typesChallenges
Resolution Equivalence MaintenanceUsability
Linking identifier typesChallenges - Resolution
Not all of them are resolvableIdeally, they’d already be URIs, but that’s not the case.Mandating URIs is problematic as it could exclude large parts of the community with established practice
How do we turn the “foo” identifier with value “bar” into a URI so that the identifier can be resolved?
Do we need a set of transformation rules?
Linking identifier typesChallenges - Equivalence
Identifiers as URIs can introduce another problem - Some have more than one representation, in more than one place
The Protein Data Bank identifier (PDB) “3coj” can be resolved in lots of places:
• PDB Europe: http://www.ebi.ac.uk/pdbe/entry/pdb/3coj• PDB Japan: http://pdbj.org/mine/summary/3coj• RCSB Protein Data Bank:
http://www.rcsb.org/pdb/explore/explore.do?structureId=3coj
• Protopedia: http://proteopedia.org/wiki/index.php/3coj• PDBsum:
https://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/GetPage.pl?pdbcode=3coj
Linking identifier typesChallenges - Equivalence (2)
These URLs all point at the same conceptual entity. But for systems that group entities by identifiers, this can be a problem.How do we check for equivalence?How do we transform the URI into an identifier?Can we separate the location of things from their identifier?
Linking identifier typesChallenges - Maintenance
People may define the same thing in different ways.For example, the display name, validation rules or resolution URIs
Working with multiple identifiers from multiple sources quickly becomes difficult. It’s a jumbled pile of bilateral agreements.
Who owns the defnition, who updates it, where is it kept?How do we handle overlaps and conflicts?How do we make the process hassle free and timely?
Linking identifier typesChallenges - Usability
Presenting a list of a thousand identifier types to a user is bad.
Where do definitions and display names come from, what about internationalisation etc?
Are users expected to know the URI of their identifiers or the identifier itself?Should systems be able to recognise and transform between representations?
Linking identifier typesWhat are we doing to address these issues?
1: ORCID are working with EBI to integrate with systems such as MIRIAM and identifiers.org2: Refactoring the ORCID registry to streamline the addition of identifier types3: Investigating how ORCID might enable member defined identifier types
The life sciences community realised the issues and did something about it. They developed the MIRIAM registry.
It provides the data required to transform local identifiers into URIs, enabling resolution of metadata and the data itself.
Decouples the identification of an entity from its location on the Web.
Linking identifier typesIntegration - identifiers in the life sciences
Identifiers.org is a service built on top of the MIRIAM registry
It turns the URNs used by MIRIAM into URLs for the web
It provides persistent resolvable identifiers. The PDB identifier “3coj” can be resolved at http://identifiers.org/pdb/3coj
Linking identifier typesIntegration - identifiers in the life sciences
Image from: Identifiers.org and MIRIAM Registry: community resources to provide persistent identification, http://doi.org/10.1093/nar/gkr1097
Linking identifier typesIntegration - identifiers in the life sciences
ORCID will reference these services for life science identifiers, but there are still unanswered questions, which may have multiple correct answers.
Does ORCID work with the “3coj” the identifier of type PDB?
or the “http://identifiers.org/pdb/3coj” of the type identifiers.org?
or is it some hybrid system that works with both?THOR provides the platform to help answer these types of questions.
Controlled vocabularies can, in fact, impede interoperability by restricting links to specific systems. Yet we need to know what is valid and what isn’t.
ORCID is moving to a system whereby the identifier vocabulary is well understood and defined, yet not fixed and easily extensible in an on-demand manner.
Clients can query the current list of identifier types using the public API. We will soon add the rules associated with them
https://pub.sandbox.orcid.org/v2.0_rc2/#!/Identifier_API/viewIdentifierTypes
Linking identifier typesIntegration - ‘un’controlled vocabularies identifier types
The communities that use identifiers and the databases that create them are the best places to define and maintain their definitions
We’re investigating if the ORCID registry could enable external clients to define identifier types and the rules that go with them, on-the-fly, for re-use by themselves and others?
We’re evaluating to see if this will meet the needs of scholarly communication including EBI, CERN, DRYAD, PANGAEA and the communities they serve.
Linking identifier typesIntegration - ‘un’controlled vocabularies identifier types
Some of the images in these slides were designed by freepik.com
THOR is funded by the European Commission under call H2020-EINFRA-2014-2, project number 654039