Sharing Resources in CLARIN-NL Jan Odijk, Arjan van Hessen LRTS Workshop IJCNLP Chiang Mai, Thailand, 12 Nov 2011

Sharing Resources in CLARIN-NL

Jan Odijk, Arjan van Hessen

LRTS Workshop IJCNLP

Chiang Mai, Thailand,

12 Nov 2011

• Context• Documentation• Visibility• Referability• Accessibility• Long Term Preservation• Interoperability• Conclusions

Overview

• CLARIN-NL • National project in the Netherlands• 2009-2015• Budget: 9.01 m euro• Funding by NWO (National Roadmap Large

Scale Infrastructures)• Coordinated by Utrecht University• 24 partners (universities, royal academy

institutes, independent institutes, libraries, etc.)

Context

http://www.clarin.nl/

http://www.clarin.nl/node/7

• Dutch National contribution to the Europe-wide CLARIN infrastructure

• Prepared by CLARIN preparatory project (2008-2011)– Also coordinated by Utrecht University

• From Dec 2011 to be coordinated by the CLARIN-ERIC– ERIC: a legal entity at the European level

specifically for research infrastructures

Context

http://www.clarin.eu/external/

http://ec.europa.eu/research/infrastructures/index_en.cfm?pg=eric

• An technical research infrastructure in which a humanities researcher who works with language-related resources– Can find all data relevant for the research– Can find all tools relevant for the research– Can apply the tools to the data without any

technical background or ad-hoc adaptations– Can store data resulting from the research– Can store tools resulting from the research

via one portal

CLARIN infrastructure (NL)

• This requires systematic sharing of resources (=data, tools, web services, …)

• Systematic Sharing requires– Documentation– Visibility– Referability– Accessibility– Long Term Preservation– Interoperabilityof resources

CLARIN infrastructure (NL)

• Resource curation projects– Curate an existing resource

• Demonstrator projects– Curate an existing tool and supply a demonstration

scenario• #subprojects 21 (12-14 in 2012)• Data Curation Service

– Offers the service of curating existing data• Where curation includes

– Documentation, Visibility, Referability, Accessibility, Long Term Preservation, Interoperability

CLARIN-NL subprojects

• CLARIN infrastructure is virtual and distributed– CLARIN-Centres work together to implement the infrastructure– Each stores and makes available a part of the resources– Some also provide computational facilities– Centres must meet a list of requirements and be certified by CLARIN

• Candidate CLARIN Centres in NL– Institute for Dutch Lexicology (INL)– Max Planck Institute for Psycholinguistics (MPI)– Meertens Institute (MI)– Huygens ING Institute (HI)– Data Archiving and Networked Services (DANS)

CLARIN-NL Centres

http://www.inl.nl/

http://www.mpi.nl/

http://www.meertens.knaw.nl/

http://www.huygensinstituut.knaw.nl/

http://www.dans.knaw.nl/

• Implementation of basic infrastructure functionality– setting up authentication and authorizations systems– several registries (e.g. ISOCAT, RELCAT, Metadata Registry)– various other infrastructure services

• Search Facilities– In resource descriptions (`metadata’)

• Centralized after metadata harvesting– In the data themselves

• Via federated search

• Using Webservices in Workflow systems– Cooperation with Flanders– Based on work done in the STEVIN-programme– (as a severe test for interoperability)

Infrastructure Implementation

http://www.clarin.nl/node/76#infra

http://www.clarin.nl/node/76#S_D

http://www.clarin.nl/node/76#TTNWW

http://taalunieversum.org/taal/technologie/stevin/

• Is always necessary, so hardly any additional effort• Partly in natural language• Partly formalized

– Described under a particular formally identifiable attribute– With an explicit type for the value of the attribute– Possibly with further restrictions on the values (patterns, finite

lists of values, constraints, etc.)– Represented formally and unambiguously

• Any piece of documentation that can be formalized must be formalized, and must be put in the resource description (metadata of the resource)

Documentation

• Resource Descriptions– Component-based MetaData Infrastructure (CMDI)– One can define resource profiles as collections of components

(which can contain components). – Many generally useable components are available– Resource profiles for most common resources are available– Component-based flexibility– Flexibility: danger: diversity, no interoperability– Controlled by semantic interoperability (see below)– Not yet available but needed: profile(s) for tools

• Supported by tools– Component and profile editors– Component and profile registries– Metadata editor

Documentation

http://www.clarin.eu/cmdi

• Each resource and its resource description must be stored at a CLARIN-centre

• CLARIN-centres make resource descriptions available for metadata harvesting (using OAI-PMH)

• Via harvesting the metadata, the metadata become available in the CLARIN resource catalogue– browsing via the Virtual Language Observatory (VLO) using

faceted browsing– Search via a search interface (under development)

• In the metadata and in the data• String search and structured search• Results if desired collected in a Virtual Collection

Visibility

http://www.openarchives.org/OAI/openarchivesprotocol.html

http://www.clarin.eu/vlo/

• By name or title is not sufficient– All the problems that natural language poses for communication:

• not always unique (ambiguity)• language-specific Corpus Gesproken Nederlands

– Variants in other languages: Spoken Dutch Corpus– limited knowledge of the foreign language variants: Corpus Spoken Dutch, Dutch Spoken

Corpus

• Long, too redundant, – abbreviations/acronyms: CGN

• Invites for errors– Spoken Dutch Cropus, Spken Dutch Corpus

• URLs– Still too long/redundant (unless one uses shortened URLs)– Unstable, volatile

• Persistent Identifiers (PIDs) are needed

Referability

• PIDs• Each CLARIN-Centre

– must assign a PID to each resource (and/or to subresources)

– Keep the PID resolution registry up-to-date• PID systems

– Handle (preferred)– URN– Perhaps others (e.g. DOI)

Referability

http://www.handle.net/

• CLARIN infrastructure– Accessible at any time and from any place

• IPR– CLARIN-NL promotes maximal open access of resources– is working on plans to implement policies and functionality to

properly handle IPR and ethical restrictions• Researchers’ Mindset

– Many researchers in the humanities are hesitant or even unwilling to share their resources with others

– How to resolve this? With a carrot and a stick• CLARIN must accommodate reasonable wishes• CLARIN must prove benefits for researchers who put their resources there• Funding agencies must oblige researchers to do so (partially already so)

Accessibility

• Necessary to make sure the resources can be shared with future researchers (that may be the producer!)

• Each CLARIN-Centre is obliged to ensure long term preservation

• Usually outsources to specialized centres– MI outsources to DANS– MPI outsources to internal Max Planck Gesellschaft organisation

Long Term Preservation

• Interoperability of resources is the ability of resources to seamlessly work together– No manual ad-hoc adaptations– Adaptations occur automatically `behind the screens’

• Need for interoperability is high– Humanities researchers: not the required technical background

• Interoperability– Syntactic interoperability and Semantic interoperability

• Each subproject must try to achieve interoperability– Report any problems and make suggestions for adaptations – So that the resources are adapted to the infrastructure (in some

cases) and vice-versa (in other cases)• Not easy, but the only way to get further is to actually try

this and learn from it.

Interoperability

• the formats of data are selected from a limited set of (de facto) standards or best practices supported by CLARIN

• software tools and applications take input and yield output in these formats

Syntactic Interoperability

• Focus on the semantics of Data Categories (DCs)• a privileged data category registry (DCR) is set up containing DCs:

– unique persistent identifiers for DCs (PIDs), – their semantics, – a definition, – Examples– lexicalizations in various languages.

• Each resource specific DC mapped to DC from the privileged DCR.

every researcher can use his/her own DCs different DCs from different resources can be interpreted

as identical in meaning, via the DC of the privileged DCR• In CLARIN-NL multiple (complementary) privileged DCRs

are allowed. The primary is ISOCAT

Semantic Interoperability

• Achieving semantic interoperability is very hard– Many DCs are almost identical

(principled/pragmatic/arbitrary reasons)– Some DCs in ISOCAT are not defined clearly– There are many similar DCs in ISOCAT – Relevant DCs are not easy to find in ISOCAT

• Three actions taken– Held several workshops to discuss problems– Appointed a coordinator to deal with problems– Decided to implement RELCAT registry to

specify relations between DCs

Semantic Interoperability

• CLARIN-NL requires systematic sharing of resources• Therefore requires researchers to work on

– Documentation– Visibility– Referability– Accessibility– Long Term Preservation– InteroperabilityOf resources

• For certain aspects this is relatively easy but it must be done• For other aspects this is very hard but it must be done so that we

can learn • The approach described here may be a model for other countries

working on the CLARIN-infrastructure• It may be a model for other resource sharing facilities (e.g. META-

SHARE)

Conclusions

Thanks for your attention!

Documents

Sharing Resources in CLARIN-NL Jan Odijk, Arjan van Hessen LRTS Workshop IJCNLP Chiang Mai, Thailand, 12 Nov 2011