46
Persistent Identifier Linking

THOR Workshop - Persistent Identifier Linking

Embed Size (px)

Citation preview

Page 1: THOR Workshop - Persistent Identifier Linking

Persistent Identifier Linking

Laura Rueda
Agenda:https://docs.google.com/document/d/1IhyGkjW0Ip1ERLcklgv2EmFYhV5OIcFrSqNhV2LD8Vw/edit
Page 2: THOR Workshop - Persistent Identifier Linking

Tom DemeranvilleTHOR Senior Project Officer and ORCID Software Engineerhttps://orcid.org/0000-0003-0902-4386

Martin FennerDataCite Technical Directorhttps://orcid.org/0000-0003-1419-2405

Laura RuedaDataCite Communications Directorhttps://orcid.org/0000-0001-5952-7630

Page 3: THOR Workshop - Persistent Identifier Linking

Linking Data and DataChallengesHow to cite data with right granularity?

How to link data and contributors with right granularity?

Datasets that are part of larger datasets or heterogenous collections

Multiple versions of the same dataset

Dynamic data

Page 4: THOR Workshop - Persistent Identifier Linking

Linking Data and DataChallenges – Granularity of Data in ORCID Record

http://search.datacite.org/contributors/0000-0002-8635-8390

Page 5: THOR Workshop - Persistent Identifier Linking

Linking Data and DataChallenges – Versioned Data in ORCID Record

http://orcid.org/0000-0003-1419-2405

Page 6: THOR Workshop - Persistent Identifier Linking

Linking Data and Data Research – GranularityAttribution vs. SpecificityPersistent identifiers for datasets

need to support different levels of granularity

Ideally this is done my multiple persistent identifiers linked via Has Part/Is Part Of relationship

Collections will play an increasingly important role

Page 7: THOR Workshop - Persistent Identifier Linking

Linking Data and Data Research – Data VersioningVersioning of data is important for specificity

and verifiabilityPractices and expectations for versioning of data

vary widely between communities and data centers

The data repository is ultimately responsible for decisions about versioning

General recommendations can only include high-level best practices and common vocabulary

Page 8: THOR Workshop - Persistent Identifier Linking

Linking Data and Data Research – Cross-Linking of Databases

Cross-linking of major life sciences resources at EMBL-EBI (source: EMBL-EBI)

Page 9: THOR Workshop - Persistent Identifier Linking

Linking Data and DataImplementation – Cross-Linking of Databases

Cross-linking between different databases not conceptually different from article-data linking, implementation should follow same principles (see next section)

Page 10: THOR Workshop - Persistent Identifier Linking

Linking Data and DataImplementation – Collections

http://search.datacite.org/works/10.1594/PANGAEA.611088

Page 11: THOR Workshop - Persistent Identifier Linking

Linking Data and DataDemo

Collection of climate data from ship logbookshttp://search.datacite.org/works/10.1594/PANGAEA.611088

Dryad Datasets associated with a specific publicationhttp://search.datacite.org/works/10.5061/DRYAD.9R161.1

Page 12: THOR Workshop - Persistent Identifier Linking

Linking Data and ArticlesChallenges

Data underlying the findings described in a manuscript not always fully available

Data underlying the findings described in a manuscript made available, but hidden in supplementary information and not easily findable

Data underlying the findings described in a manuscript made available, but not properly linked to/from article

Page 13: THOR Workshop - Persistent Identifier Linking

Linking Data and ArticlesImplementation – Follow FAIR Data Principles

From: http://slideshare.net/lshtm/preparing-data-for-sharing-the-fair-principles

Page 14: THOR Workshop - Persistent Identifier Linking

Linking Data and ArticlesResearch - Conceptual ModelLinkage as Triples. In the form subject-

predicate-object, consistent with the Resource Description Framework (RDF) data model.

Describing the relation. Additional information such as relation type (e.g. A is new version of B) and provenance.

Persistent Identifiers as HTTP URIs. This makes them actionable, and compatible with the RDF data model.

Centralized infrastructure for persistent identifier linking. Provided for example by ORCID and DataCite, facilitating discovery.

Page 15: THOR Workshop - Persistent Identifier Linking

Linking Data with ArticlesImplementation – Discover Article/Data Links

DataCite Event Data (https://eventdata.datacite.org) Collect, aggregate and make available article/data links from DataCite metadata and other sources

Crossref Event Data (https://api.eventdata.crossref.org) Collect and make available article/data links from Crossref metadata and other sources

OpenAIRE Data/Literature Linking Service (http://dliservice.research-infrastructures.eu) Collect and make available article/data links from a variety of sources

Page 16: THOR Workshop - Persistent Identifier Linking

Linking Data with ArticlesImplementation – Exchange Article/Data Links

Standard metadata for exchanging Article/Data LinksJoint Collaboration within RDA/WDS Data Publishing Services WG (http://www.scholix.org/guidelines)

Link Exchange between Crossref and DataCiteUsing the same open source software (https://github.com/lagotto/lagotto) for their respective Event Data services

Page 17: THOR Workshop - Persistent Identifier Linking

Linking Data with ArticlesDemoSupplementary Information hosted in Data Repositoryhttp://search.datacite.org/works/10.6084/M9.FIGSHARE.3427304

Five datasets from Cambridge Crystallographic Data Centre linked to the same articlehttp://search.datacite.org/works/10.1021/acs.cgd.6b00527

Software library described in Journal of Open Source Softwarehttp://search.datacite.org/works/10.21105/joss.00026

PLOS articles linked with at least one DataCite DOIhttp://search.datacite.org/data-centers/340

DataCite DOI -> Crossref DOI links exported from DataCite to Crossrefhttp://api.eventdata.crossref.org/works?source_id=datacite_crossref

Page 18: THOR Workshop - Persistent Identifier Linking

In practical terms...Real interoperability is much more than a framework:

• Compatible data models• Metadata quality• Development effort• Coordination

During this first year, THOR has:• Assessed how artefacts, contributors,

organisations and others are modelled• Explored different implementations (ADS,

Dryad… )• Proposed approaches to overcome mismatches

Page 19: THOR Workshop - Persistent Identifier Linking

Metadata compatibility - ORCID/DataCite• Personal names (single and multiple fields)

Page 20: THOR Workshop - Persistent Identifier Linking

Metadata compatibility - ORCID/DataCite• Personal names (single and multiple fields)

<creators><creator><creatorName>Miller, Elizabeth</creatorName><givenName>Elizabeth</givenName><familyName>Miller</familyName><nameIdentifier schemeURI="http://orcid.org/" nameIdentifierScheme="ORCID"> 0000-0001-5000-0007</nameIdentifier><affiliation>DataCite</affiliation>

</creator></creators>

<creators><creator><creatorName>Miller, Elizabeth</creatorName><nameIdentifier schemeURI="http://orcid.org/" nameIdentifierScheme="ORCID"> 0000-0001-5000-0007</nameIdentifier><affiliation>DataCite</affiliation>

</creator></creators>

Page 21: THOR Workshop - Persistent Identifier Linking

Metadata compatibility - ORCID/DataCite• Contributor roles

Page 22: THOR Workshop - Persistent Identifier Linking

Metadata compatibility - ORCID/DataCite• Relation types

Page 23: THOR Workshop - Persistent Identifier Linking

Metadata compatibility - ORCID/DataCite• Lack of standards• Low adoption• Organisations:

• ISNI / Ringgold / Others• Open standard?

• Funding, projects:• Crossref’s Open Funder Registry• Coverage and quality?

Page 24: THOR Workshop - Persistent Identifier Linking

The results!• ORCID Auto-Update:

Whenever a publication or a dataset receives a DOI and its metadata contains ORCID iDs,

the ORCID record of the author(s) can be updated

automatically!• Authors receive a notification

(inbox)• They can configure:

• Accept updates automatically• Level of privacy

Page 25: THOR Workshop - Persistent Identifier Linking

The results!• DataCite and Crossref Event Data:

Page 26: THOR Workshop - Persistent Identifier Linking

The results!• EThOS is the UK’s thesis service,

offering search and discovery of all UK theses, and direct access to all those that are digitally, openly available.

Page 27: THOR Workshop - Persistent Identifier Linking

The results!• PANGAEA archives, publishes and

distributes geo-referenced data about climate variability, the marine environment and geological research.

• PANGAEA attempts to resolve ORCID iDs and annotate author names using a heuristic algorithm

• Data citations from literature are rare!

• PANGAEA is keeping track of the link from datasets back to articles (“reverse links”)

Page 28: THOR Workshop - Persistent Identifier Linking

Linking Data and ContributorsImplementation – ORCID Search and Link

http://search.datacite.org/works?query=martin+fenner

Page 29: THOR Workshop - Persistent Identifier Linking

Linking Data and ContributorsImplementation – ORCID Auto-Update

https://profiles.datacite.org/users/me

Page 30: THOR Workshop - Persistent Identifier Linking

Linking Data and ContributorsDemo

Link Works via ORCID recordhttps://orcid.org/my-orcid

DataCite/ORCID Search and Link after authenticating with ORCIDhttps://profiles.datacite.org/users/me

Page 31: THOR Workshop - Persistent Identifier Linking

Linking identifier types

There are a LOTS of identifier typesAttempting to work with them all raises LOTS of questions

Page 32: THOR Workshop - Persistent Identifier Linking

Remember this?It’s the crosslinks between EMBL-EBI databases

Most of those databases use different identifier types

There are 560 collections!This can make things tricky

Linking identifier typesCase study - identifier types in the life sciences

Page 33: THOR Workshop - Persistent Identifier Linking

ORCID currently supports 33 identifier types, such as DOIs.

These are part of a fixed vocabulary, with associated rules about validation and how to resolve them.

Adding a new one can be difficult, adding 500 is really difficult.

We now know that this does not scale.But to fully realise our mission, we need to be able to do it, and so do others.

Linking identifier typesCase study - External identifiers at ORCID

Page 34: THOR Workshop - Persistent Identifier Linking

Linking identifier typesChallenges

Resolution Equivalence MaintenanceUsability

Page 35: THOR Workshop - Persistent Identifier Linking

Linking identifier typesChallenges - Resolution

Not all of them are resolvableIdeally, they’d already be URIs, but that’s not the case.Mandating URIs is problematic as it could exclude large parts of the community with established practice

How do we turn the “foo” identifier with value “bar” into a URI so that the identifier can be resolved?

Do we need a set of transformation rules?

Page 36: THOR Workshop - Persistent Identifier Linking

Linking identifier typesChallenges - Equivalence

Identifiers as URIs can introduce another problem - Some have more than one representation, in more than one place

The Protein Data Bank identifier (PDB) “3coj” can be resolved in lots of places:

• PDB Europe: http://www.ebi.ac.uk/pdbe/entry/pdb/3coj• PDB Japan: http://pdbj.org/mine/summary/3coj• RCSB Protein Data Bank:

http://www.rcsb.org/pdb/explore/explore.do?structureId=3coj

• Protopedia: http://proteopedia.org/wiki/index.php/3coj• PDBsum:

https://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/GetPage.pl?pdbcode=3coj

Page 37: THOR Workshop - Persistent Identifier Linking

Linking identifier typesChallenges - Equivalence (2)

These URLs all point at the same conceptual entity. But for systems that group entities by identifiers, this can be a problem.How do we check for equivalence?How do we transform the URI into an identifier?Can we separate the location of things from their identifier?

Page 38: THOR Workshop - Persistent Identifier Linking

Linking identifier typesChallenges - Maintenance

People may define the same thing in different ways.For example, the display name, validation rules or resolution URIs

Working with multiple identifiers from multiple sources quickly becomes difficult. It’s a jumbled pile of bilateral agreements.

Who owns the defnition, who updates it, where is it kept?How do we handle overlaps and conflicts?How do we make the process hassle free and timely?

Page 39: THOR Workshop - Persistent Identifier Linking

Linking identifier typesChallenges - Usability

Presenting a list of a thousand identifier types to a user is bad.

Where do definitions and display names come from, what about internationalisation etc?

Are users expected to know the URI of their identifiers or the identifier itself?Should systems be able to recognise and transform between representations?

Page 40: THOR Workshop - Persistent Identifier Linking

Linking identifier typesWhat are we doing to address these issues?

1: ORCID are working with EBI to integrate with systems such as MIRIAM and identifiers.org2: Refactoring the ORCID registry to streamline the addition of identifier types3: Investigating how ORCID might enable member defined identifier types

Page 41: THOR Workshop - Persistent Identifier Linking

The life sciences community realised the issues and did something about it. They developed the MIRIAM registry.

It provides the data required to transform local identifiers into URIs, enabling resolution of metadata and the data itself.

Decouples the identification of an entity from its location on the Web.

Linking identifier typesIntegration - identifiers in the life sciences

Page 42: THOR Workshop - Persistent Identifier Linking

Identifiers.org is a service built on top of the MIRIAM registry

It turns the URNs used by MIRIAM into URLs for the web

It provides persistent resolvable identifiers. The PDB identifier “3coj” can be resolved at http://identifiers.org/pdb/3coj

Linking identifier typesIntegration - identifiers in the life sciences

Image from: Identifiers.org and MIRIAM Registry: community resources to provide persistent identification, http://doi.org/10.1093/nar/gkr1097

Page 43: THOR Workshop - Persistent Identifier Linking

Linking identifier typesIntegration - identifiers in the life sciences

ORCID will reference these services for life science identifiers, but there are still unanswered questions, which may have multiple correct answers.

Does ORCID work with the “3coj” the identifier of type PDB?

or the “http://identifiers.org/pdb/3coj” of the type identifiers.org?

or is it some hybrid system that works with both?THOR provides the platform to help answer these types of questions.

Page 44: THOR Workshop - Persistent Identifier Linking

Controlled vocabularies can, in fact, impede interoperability by restricting links to specific systems. Yet we need to know what is valid and what isn’t.

ORCID is moving to a system whereby the identifier vocabulary is well understood and defined, yet not fixed and easily extensible in an on-demand manner.

Clients can query the current list of identifier types using the public API. We will soon add the rules associated with them

https://pub.sandbox.orcid.org/v2.0_rc2/#!/Identifier_API/viewIdentifierTypes

Linking identifier typesIntegration - ‘un’controlled vocabularies identifier types

Page 45: THOR Workshop - Persistent Identifier Linking

The communities that use identifiers and the databases that create them are the best places to define and maintain their definitions

We’re investigating if the ORCID registry could enable external clients to define identifier types and the rules that go with them, on-the-fly, for re-use by themselves and others?

We’re evaluating to see if this will meet the needs of scholarly communication including EBI, CERN, DRYAD, PANGAEA and the communities they serve.

Linking identifier typesIntegration - ‘un’controlled vocabularies identifier types

Page 46: THOR Workshop - Persistent Identifier Linking

Some of the images in these slides were designed by freepik.com

THOR is funded by the European Commission under call H2020-EINFRA-2014-2, project number 654039